Re: [R] PDF extraction with tm package

Jeff Newmiller Fri, 22 Jul 2016 08:21:50 -0700

This is neither the Xpdf support forum nor the Windows Setup Program 
Reinvention support group... and you really need to read and follow the Posting 
Guide for the R mailing lists.


FWIW I would guess that you need to learn about environment variables and in 
particular about the PATH variable. There are subtleties about when and how 
they get defined that are OS-specific and certainly off topic here that may 
trip you up along the way. Alternatively, you may read the Xpdf documentation 
or a how-to blog about Xpdf that gives you a recipe, but again that is not 
about R. Once you can start a CMD shell and run the command directly then you 
are most of the way to getting R to invoke it.
-- 
Sent from my phone. Please excuse my brevity.

On July 21, 2016 5:26:26 PM PDT, Steven Kang <stochastick...@gmail.com> wrote:
>Hi R users,
>
>I’m having some issues trying to extract texts from PDF file using tm
>package.
>
>Here are the steps that were carried out:
>
>1. Downloaded and installed the following programs:
>
>- Xpdf (Copied the ‘bin32’, ‘bin64’, ‘doc’ folders into ‘C:\Program
>Files\Xpdf’ directory; also added C:\Program
>Files\Xpdf\bin64\pdfinfo.exe &
>C:\Program Files\Xpdf\bin64\pdftotext.exe in existing PATH
>
>- Tesseract
>
>- Imagemagick
>
>2. Used the following scripts and the corresponding error messages:
>
># Directory where PDF files are stored
>
>>cname <- getwd()
>
>>Corpus(DirSource(cname), readerControl=list(reader = readPDF))
>
>Error in system2("pdftotext", c(control$text, shQuote(x), "-"), stdout
>=
>TRUE) :
>'"pdftotext"' not found
>
> In addition: Warning message:
>
>running command '"pdfinfo" "C:\Users\R_Files\XXX.pdf"' had status 127
>
>>file.exists(Sys.which(c("pdfinfo","pdftpotext")))
>[1] FALSE FALSE
>
>It seems like R can’t find pdfinfo & pdftotext exe files, but not sure
>as
>to why this would be the case despite xpdf files being copied into
>‘C:\Program Files’ (Im using Windows 7 64bits)
>
>I’m aware that ‘pdf_text’ function from pdftools package can extract
>texts
>from PDF file and outputs into a string. But I was after something
>which is
>able to convert PDF (ie transaction data) into a dataframe without
>regular
>expression. Is tm package capable of doing this conversion? Are there
>any
>other alternatives to these methods?
>
>Your expertise in resolving this problem would be highly appreciated.
>
>
>Steve
>
>       [[alternative HTML version deleted]]
>
>______________________________________________
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] PDF extraction with tm package

Reply via email to