Re: [R] Figuring out encodings of PDFs in R

Duncan Murdoch Tue, 26 Jun 2012 17:10:37 -0700

On 12-06-26 3:28 PM, Jonas Michaelis wrote:

Dear list,


I am currently scraping some text data from several PDFs using the
readPDF() function in the tm package. This all works very well and in most
cases the encoding seems to be "latin1" - in some, however, it is not. Is
there a good way in R to check character encodings? I found the functions
is.utf8() and is.local() in the tau package but that obviously only gets me
so far.

There are heuristics for guessing encodings, but I don't think they arebuilt into R. I think the way to do what you want is to read the PDFspec to find out how the strings are encoded in the source file, andbelieve that.


Duncan Murdoch

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Figuring out encodings of PDFs in R

Reply via email to