Re: [R] Doing PDF OCR with R

2015-08-13 Thread Duncan Murdoch
On 13/08/2015 1:29 AM, Jeff Newmiller wrote:
> This code is using R like a command shell... there really is not much chance 
> that R is the problem, and this is not a "tesseract" support forum, so this 
> seems quite off-topic. 

I would have guessed the same, but the error message looks like an R
message.  But I can see anything very different in the 3rd step compared
to the first, so I don't know what would be going on.

The use of shQuote looks wrong:  Anshuk probably doesn't want to quote
the whole command expression, just parts of it that may cause problems.
 And the docs do recommend using system2() rather than shell().  But I
don't think either of those things should have caused that error.

Duncan Murdoch

> 
> On August 12, 2015 10:05:19 PM PDT, Anshuk Pal Chaudhuri 
>  wrote:
>> Hi All,
>>
>> I have been trying to do OCR within R (reading PDF data which data as
>> scanned image). Have been reading about this @
>> http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
>>
>> This a very good post.
>>
>> Effectively 3 steps:
>>
>> convert pdf to ppm (an image format)
>> convert ppm to tif ready for tesseract (using ImageMagick for convert)
>> convert tif to text file
>> The effective code for the above 3 steps as per the link post:
>>
>> lapply(myfiles, function(i){
>>  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
>>  # but you can change that easily, just remove or edit the
>>  # -f 1 -l 10 bit in the line below
>> shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r
>> 600 ocrbook")))
>>  # convert ppm to tif ready for tesseract
>> shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i,
>> ".tif")))
>>  # convert tif to text file
>> shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i,
>> " -l eng")))
>>  # delete tif file
>>  file.remove(paste0(i, ".tif" ))
>>  })
>> The first two steps are happening fine. (although taking good amount of
>> time, for 4 pages of a pdf, but will look into the scalability part
>> later, first trying if this works or not)
>>
>> While running this, the first two steps work fine.
>>
>> While runinng the 3rd step, i.e
>>
>> **shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ",
>> i, " -l eng")))**
>> I having this error:
>>
>> Error: evaluation nested too deeply: infinite recursion /
>> options(expressions=)?
>>
>> Or
>>
>> Tesseract is crashing.
>>
>> Any workaround or root cause analysis would be appreciated.
>>
>> Regards,
>> Anshuk Pal Chaudhuri
>>
>>
>>  [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Doing PDF OCR with R

2015-08-12 Thread Jeff Newmiller
This code is using R like a command shell... there really is not much chance 
that R is the problem, and this is not a "tesseract" support forum, so this 
seems quite off-topic. 
---
Jeff NewmillerThe .   .  Go Live...
DCN:Basics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
--- 
Sent from my phone. Please excuse my brevity.

On August 12, 2015 10:05:19 PM PDT, Anshuk Pal Chaudhuri 
 wrote:
>Hi All,
>
>I have been trying to do OCR within R (reading PDF data which data as
>scanned image). Have been reading about this @
>http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
>
>This a very good post.
>
>Effectively 3 steps:
>
>convert pdf to ppm (an image format)
>convert ppm to tif ready for tesseract (using ImageMagick for convert)
>convert tif to text file
>The effective code for the above 3 steps as per the link post:
>
>lapply(myfiles, function(i){
>  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
>  # but you can change that easily, just remove or edit the
>  # -f 1 -l 10 bit in the line below
>shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r
>600 ocrbook")))
>  # convert ppm to tif ready for tesseract
>shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i,
>".tif")))
>  # convert tif to text file
>shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i,
>" -l eng")))
>  # delete tif file
>  file.remove(paste0(i, ".tif" ))
>  })
>The first two steps are happening fine. (although taking good amount of
>time, for 4 pages of a pdf, but will look into the scalability part
>later, first trying if this works or not)
>
>While running this, the first two steps work fine.
>
>While runinng the 3rd step, i.e
>
>**shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ",
>i, " -l eng")))**
>I having this error:
>
>Error: evaluation nested too deeply: infinite recursion /
>options(expressions=)?
>
>Or
>
>Tesseract is crashing.
>
>Any workaround or root cause analysis would be appreciated.
>
>Regards,
>Anshuk Pal Chaudhuri
>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Doing PDF OCR with R

2015-08-12 Thread Anshuk Pal Chaudhuri
Hi All,

I have been trying to do OCR within R (reading PDF data which data as scanned 
image). Have been reading about this @ 
http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

This a very good post.

Effectively 3 steps:

convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 
ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, 
".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l 
eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })
The first two steps are happening fine. (although taking good amount of time, 
for 4 pages of a pdf, but will look into the scalability part later, first 
trying if this works or not)

While running this, the first two steps work fine.

While runinng the 3rd step, i.e

**shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l 
eng")))**
I having this error:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?

Or

Tesseract is crashing.

Any workaround or root cause analysis would be appreciated.

Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.