[EMAIL PROTECTED] wrote:
I never said it *should* work.

I was simply trying something out that works on other types of files
I've needed in the past (eg: html, csv, dat, etc.). I don't know the
details of the pdf format, but I thought it was worth a try, certainly
no harm in experimenting, and hence I learned that pdfs aren't stored
in the same way that other files i've used in the past are. that's
fine, good to learn new things.

As for trying the readPDF() function, yes, I have downloaded and used
xpdf to convert pdfs into plain text since reading the OP email.
However, ow you can make xpdf available to the system so that readPDF
() works in R? i don't know, hence why I posted in this thread.

You clearly seem to have a solution, fancy sharing?


Sure, I thought that could not be a real question:
Set your environment variable PATH so that it additionally points to the directory where these tools are installed. As you would do for any other software that is to be called without knowledge where it is installed.

Uwe Ligges


Clair Crossupton xx



On 16 Nov, 12:34, Uwe Ligges <[EMAIL PROTECTED]> wrote:
[EMAIL PROTECTED] wrote:
Hello, I was just wondering if you had found a solution? I am having
the same difficulty of converting pdf's into plain text documents in
R. I originally thought I could use the readLines() function, but as
you can see below that did not work.
Why the hell should it? It is designed to read *text* files. And what
you get below is exactly how your PDF file looks like if you read it as
text which it is NOT. Why do you not also go the readPDF() way (and yes,
it is not always possible nor reliable to go that way).

Uwe Ligges



R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r-
intro.pdf"
R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf";
R> download.file(url = my.url, destfile=my.destfile, mode='wb')
R> txt <- readLines(my.destfile)
R> txt
[1]
"%PDF-1.4"
[2]
"%ÐÔÅØ"
[3] "1 0 obj
<<"
[4] "/Length 587
"
[5] "/Filter /
FlateDecode"
[6]
">>"
[7]
"stream"
[8] "xÚmTM [EMAIL PROTECTED]&ÎÁ±?\024tBL\020$ñ°ãd4›½*´.‰\002\001<øï·_•èÌf
\017’W¯_wÕ«îrðãc;Šòê`GæUŠOÛV×&³£øç¾ö\006ƒ¤Ê®\027[vïÖæ6ïWÛ7ñÑTÙÖvb
\030¯“uYt/N¼.³ó5·½êÿ¢¥=\025åS‚<b¸³¿G› "
Warm Regards,
Clair
On 13 Nov, 15:10, Tony Breyal <[EMAIL PROTECTED]> wrote:
Dear R-Help,
I need to convert a set of '.pdf' files into an equivalent set of
'.txt' files. This is so that i can do some text mining on the
content.
In the latest R-News letter (http://cran.r-project.org/doc/Rnews/
Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In
that lovely package, there is a function called 'readPDF()'. In order
to use this, ?readPDF says
    "Note that this PDF reader needs both the tools pdftotext and
pdfinfo installed and accessable on your system."
These tools are available fromhttp://www.foolabs.com/xpdf/download.html
I am able to download this and use it easily from a dos window to
convert a pdf file into a txt file.
Question: how do i make these tools available to R, so that i can use
the readPDF() function?
Thank you in advance for any help, and I hope the above made sense.
Tony Breyal
###OS = Windows Vista Ultimate>> sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32
locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets
methods   base
other attached packages:
[1] tm_0.3-1           XML_1.98-1         Snowball_0.0-3
RWeka_0.3-14       rJava_0.6-0        Matrix_0.999375-16
lattice_0.17-15    filehash_2.0
loaded via a namespace (and not attached):
[1] proxy_0.4-1
______________________________________________
[EMAIL PROTECTED] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
------------------------------------------------------------------------
______________________________________________
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[EMAIL PROTECTED] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to