By the way, here's a quicker (but slightly more dangerous) way to find the

print.XMLInternalDocument


function: just call debug(print) before printing the object. Two steps take you to the method, and let you see what it's doing. The "danger" comes because now print() will always trigger the debugger, which can be a little confusing! Remember undebug(print) at the end.

Duncan Murdoch


On 26/07/2013 1:26 PM, Duncan Murdoch wrote:
On 26/07/2013 12:43 PM, Nick McClure wrote:
> I'm hitting a wall. When I use the 'scrape' function from the package
> 'scrapeR' to get the pagesource from a web page, I do the following:
> (as an example)
>
> website.doc = parse("http://www.google.com";)
>
> When I look at it, it seems fine:
>
> website.doc[[1]]
>
> This seems to have the information I need.  Then when I try to get it
> into a character vector,
>
> character.website = as.character(website.doc[[1]])
>
> I get the error:
>
> Error in as.vector(x, "character") :
> cannot coerce type 'externalptr' to vector of type 'character'
>
> I'm trying very very hard to wrap my head around how to get this
> external pointer to a character, but after reading many help files, I
> cannot understand how to do this. Any ideas?

You should use str() in cases like this. When I look at
str(website.doc[[1]]) (after producing website.doc with scrape(), not
parse()), I see

  > str(website.doc[[1]])
Classes 'HTMLInternalDocument', 'HTMLInternalDocument',
'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
- attr(*, "headers")= Named chr [1:2] "<HTML><HEAD><meta
http-equiv=\"content-type\"
content=\"text/html;charset=utf-8\">\n<TITLE>302
Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>"
..- attr(*, "names")= chr [1:2] "<HTML><HEAD><meta
http-equiv=\"content-type\"
content=\"text/html;charset=utf-8\">\n<TITLE>302
Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>"

So it is an external pointer with a number of classes. One or more of
those will have a print method. methods(print) will list all the print
methods, and I see there's a (hidden) print.XMLInternalDocument method
somewhere. Then

  > getAnywhere("print.XMLInternalDocument")
A single object matching ‘print.XMLInternalDocument’ was found
It was found in the following places
registered S3 method for print from namespace XML
namespace:XML
with value

function (x, ...)
{
cat(as(x, "character"), "\n")
}
<environment: namespace:XML>

shows that the as() generic should work, even though as.character()
doesn't, and indeed as(website.doc[[1]], "character") does display
something.

Duncan Murdoch




______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to