Hi All, thank you so very much for your help, i have now got it working! I thought that i had replied already but i don't think it got through so this is a repost of it for anyone who does a search on this topic...
After adding the directory to the path variable, i should have restarted my laptop. I had assumed that windows would update the path automatically, but apparently didn't happen on my uni laptop (windows xp sp2). Also, i recieved 2 private emails about how to use the readPDF function, so here is how you do it: ### R START ### > library(tm) > my.path <- 'C:\\Documents and Settings\\tony\\Desktop\\pdfs\\' #put your pdfs > in here > Corpus(DirSource(my.path), readerControl = list(reader=readPDF)) A text document collection with 1 text document Warning message: In readLines(filename, encoding = encoding) : incomplete final line found on 'C:\Documents and Settings\tony \Desktop\pdfs\/r-intro.pdf' > ### R END ### not quite sure what what consequence that warning has, but otherwise it's fine to me Cheers, Tony Breyal On 16 Nov, 23:49, "joris meys" <[EMAIL PROTECTED]> wrote: > Hi Tony, > > You can name several variables 'Path' without problems. So you best > restore the original variable PATH to its original value (or it ain't > going to work any more) and just add a new one, call that PATH as > well, and add the directory C:\Program Files\xpdf , like Uwe already > suggested. > > This should make it work (I hope) > > Kind regards > Joris > > On Sun, Nov 16, 2008 at 9:15 PM, Tony Breyal <[EMAIL PROTECTED]> wrote: > > Hi Joris, there is already a variable called 'Path', therefore i > > appended the directory path to the other strings already in the value > > section: > > > Name: Path > > Value: %SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem; > > %SystemRoot%\system32\nls;%SystemRoot%\system32\nls\ENGLISH;C:\Program > > Files\Novell\ZENworks\;C:\Program Files\Common Files\Teleca Shared;C: > > \Program Files\QuickTime\QTSystem\;C:\Program Files\xpdf\ > > > Still didn't work i'm afraid, but cheers for the sugestion. > > > Tony Breyal > > > On 16 Nov, 20:00, "joris meys" <[EMAIL PROTECTED]> wrote: > >> Try putting "PATH" under name, and the directory path (not the file) > >> under value. That looks more appropriate to me... > > >> Kind regards > >> Joris Meys > > >> On Sun, Nov 16, 2008 at 8:41 PM, Tony Breyal <[EMAIL PROTECTED]> wrote: > >> > Hi, > > >> > Uwe -- ahh, thank you kindly, I was able to do a web search after > >> > reading your post above in order to find a guide on how to set the > >> > path in windows (i wasn't aware that this is how a file is made > >> > avaiable to the system). I haven't got it to work yet, but at least > >> > i'm on the right track! also just after reading your post, i've > >> > discoverd the system() function in R, what wonderful thing that is! > > >> > Clair -- I'm still working on getting the files to be accessable to > >> > the system, but in the mean time i have just discovered the system() > >> > function in R which is work around for the moment... so using your > >> > example, you could do: > >> > ## R code > >> >> system(paste('"C:/Program Files/xpdf/pdftotext.exe"', '"C:/Documents > >> >> and Settings/clair/Desktop/test/r-intro.pdf"'), wait=FALSE) > > >> > the above will create a new text document in your c:/../test folder. > > >> > Now obviously, we want to use the readPDF() function in package: tm. > >> > so on my uni laptop, running windows XP, this is what i have done: > > >> > 1. Click through: start >> control panel >> system > >> > 2. Click the Advanced tab. > >> > 3. Click Environment variables. > >> > 4. Click New (under 'system') to add a new variable name and value. > >> > 4a. name: pdftotext > >> > 4b. value: C:\Program Files\xpdf\pdftotext.exe > >> > 5. Click New (under 'system') to add a new variable name and value. > >> > 4a. name: pdfinfo > >> > 4b. value: C:\Program Files\xpdf\pdfinfo.exe > > >> > In theory, i think, that should work. however so far it hasn't, so not > >> > quite sure what to do. but at least in the mean time we have the system > >> > () function as a work around. If you can figure out what i'm doing > >> > wrong (probably something obvious knowing me!) please do let me know. > > >> > Cheers, > >> > Tony Breyal > > >> > On 16 Nov, 18:14, Uwe Ligges <[EMAIL PROTECTED]> wrote: > >> >> [EMAIL PROTECTED] wrote: > >> >> > I never said it *should* work. > > >> >> > I was simply trying something out that works on other types of files > >> >> > I've needed in the past (eg: html, csv, dat, etc.). I don't know the > >> >> > details of the pdf format, but I thought it was worth a try, certainly > >> >> > no harm in experimenting, and hence I learned that pdfs aren't stored > >> >> > in the same way that other files i've used in the past are. that's > >> >> > fine, good to learn new things. > > >> >> > As for trying the readPDF() function, yes, I have downloaded and used > >> >> > xpdf to convert pdfs into plain text since reading the OP email. > >> >> > However, ow you can make xpdf available to the system so that readPDF > >> >> > () works in R? i don't know, hence why I posted in this thread. > > >> >> > You clearly seem to have a solution, fancy sharing? > > >> >> Sure, I thought that could not be a real question: > >> >> Set your environment variable PATH so that it additionally points to the > >> >> directory where these tools are installed. As you would do for any other > >> >> software that is to be called without knowledge where it is installed. > > >> >> Uwe Ligges > > >> >> > Clair Crossupton xx > > >> >> > On 16 Nov, 12:34, Uwe Ligges <[EMAIL PROTECTED]> wrote: > >> >> >> [EMAIL PROTECTED] wrote: > >> >> >>> Hello, I was just wondering if you had found a solution? I am having > >> >> >>> the same difficulty of converting pdf's into plain text documents in > >> >> >>> R. I originally thought I could use the readLines() function, but as > >> >> >>> you can see below that did not work. > >> >> >> Why the hell should it? It is designed to read *text* files. And what > >> >> >> you get below is exactly how your PDF file looks like if you read it > >> >> >> as > >> >> >> text which it is NOT. Why do you not also go the readPDF() way (and > >> >> >> yes, > >> >> >> it is not always possible nor reliable to go that way). > > >> >> >> Uwe Ligges > > >> >> >>> R> my.destfile <- "C:\\Documents and > >> >> >>> Settings\\clair\\Desktop\\test\\r- > >> >> >>> intro.pdf" > >> >> >>> R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf" > >> >> >>> R> download.file(url = my.url, destfile=my.destfile, mode='wb') > >> >> >>> R> txt <- readLines(my.destfile) > >> >> >>> R> txt > >> >> >>> [1] > >> >> >>> "%PDF-1.4" > >> >> >>> [2] > >> >> >>> "%ÐÔÅØ" > >> >> >>> [3] "1 0 obj > >> >> >>> <<" > >> >> >>> [4] "/Length 587 > >> >> >>> " > >> >> >>> [5] "/Filter / > >> >> >>> FlateDecode" > >> >> >>> [6] > >> >> >>> ">>" > >> >> >>> [7] > >> >> >>> "stream" > >> >> >>> [8] "xÚmTM [EMAIL > >> >> >>> PROTECTED]&ÎÁ±?\024tBL\020$ñ°ãd4›½*´.‰\002\001<øï·_•èÌf > >> >> >>> \017'W¯_wÕ«îrðãc;Šòê`GæUŠOÛV×&³£øç¾ö\006ƒ¤Ê(R)\027[vïÖæ6ïWÛ7ñÑTÙÖvb > >> >> >>> \030¯"uYt/N¼.³ó5·½êÿ¢¥=\025åS‚<b¸³¿G› " > >> >> >>> Warm Regards, > >> >> >>> Clair > >> >> >>> On 13 Nov, 15:10, Tony Breyal <[EMAIL PROTECTED]> wrote: > >> >> >>>> Dear R-Help, > >> >> >>>> I need to convert a set of '.pdf' files into an equivalent set of > >> >> >>>> '.txt' files. This is so that i can do some text mining on the > >> >> >>>> content. > >> >> >>>> In the latest R-News letter (http://cran.r-project.org/doc/Rnews/ > >> >> >>>> Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. > >> >> >>>> In > >> >> >>>> that lovely package, there is a function called 'readPDF()'. In > >> >> >>>> order > >> >> >>>> to use this, ?readPDF says > >> >> >>>> "Note that this PDF reader needs both the tools pdftotext and > >> >> >>>> pdfinfo installed and accessable on your system." > >> >> >>>> These tools are available > >> >> >>>> fromhttp://www.foolabs.com/xpdf/download.html > >> >> >>>> I am able to download this and use it easily from a dos window to > >> >> >>>> convert a pdf file into a txt file. > >> >> >>>> Question: how do i make these tools available to R, so that i can > >> >> >>>> use > >> >> >>>> the readPDF() function? > >> >> >>>> Thank you in advance for any help, and I hope the above made sense. > >> >> >>>> Tony Breyal > >> >> >>>> ###OS = Windows Vista Ultimate>> sessionInfo() > >> >> >>>> R version 2.8.0 (2008-10-20) > >> >> >>>> i386-pc-mingw32 > >> >> >>>> locale: > >> >> >>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United > >> >> >>>> Kingdom. > >> >> >>>> 1252;LC_MONETARY=English_United Kingdom. > >> >> >>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > >> >> >>>> attached base packages: > >> >> >>>> [1] grid stats graphics grDevices utils datasets > >> >> >>>> methods base > >> >> >>>> other attached packages: > >> >> >>>> [1] tm_0.3-1 XML_1.98-1 Snowball_0.0-3 > >> >> >>>> RWeka_0.3-14 rJava_0.6-0 Matrix_0.999375-16 > >> >> >>>> lattice_0.17-15 filehash_2.0 > >> >> >>>> loaded via a namespace (and not attached): > >> >> >>>> [1] proxy_0.4-1 > >> >> >>>> ______________________________________________ > >> >> >>>> [EMAIL PROTECTED] mailing > >> >> >>>> listhttps://stat.ethz.ch/mailman/listinfo/r-help > >> >> >>>> PLEASE do read the posting > >> >> >>>> guidehttp://www.R-project.org/posting-guide.html > >> >> >>>> and provide commented, minimal, self-contained, reproducible code. > >> >> >>> ------------------------------------------------------------------------ > >> >> >>> ______________________________________________ > >> >> >>> [EMAIL PROTECTED] mailing list > >> >> >>>https://stat.ethz.ch/mailman/listinfo/r-help > >> >> >>> PLEASE do read the posting > >> >> >>> guidehttp://www.R-project.org/posting-guide.html > >> >> >>> and provide commented, minimal, self-contained, reproducible code. > > >> >> >> ______________________________________________ > >> >> >> [EMAIL PROTECTED] mailing > >> >> >> listhttps://stat.ethz.ch/mailman/listinfo/r-help > >> >> >> PLEASE do read the posting > >> >> >> guidehttp://www.R-project.org/posting-guide.html > >> >> >> and provide commented, minimal, self-contained, reproducible code. > > >> >> > ______________________________________________ > >> >> > [EMAIL PROTECTED] mailing list > >> >> >https://stat.ethz.ch/mailman/listinfo/r-help > >> >> > PLEASE do read the posting > >> >> > guidehttp://www.R-project.org/posting-guide.html > >> >> > and provide commented, minimal, self-contained, reproducible code. > > >> >> ______________________________________________ > >> >> [EMAIL PROTECTED] mailing > >> >> listhttps://stat.ethz.ch/mailman/listinfo/r-help > >> >> PLEASE do read the posting > >> >> guidehttp://www.R-project.org/posting-guide.html > >> >> and provide commented, minimal, self-contained, reproducible code. > > >> > ______________________________________________ > >> > [EMAIL PROTECTED] mailing list > >> >https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting > >> > guidehttp://www.R-project.org/posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > > >> ______________________________________________ > >> [EMAIL PROTECTED] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read > > ... > > read more » ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.