720 dpi seems high.  Is that the native scan resolution?  I'd use the 
native resolution unless it's less than 200 dpi or more than 400 dpi. 
 Similarly, why are you rendering to tiffgray when the input looks like 
it's bitonal?  tesseract is just going to have to threshold back to bitonal 
again, resulting in two conversions where none are needed.

Don't have time to play with it myself, but perhaps you could outline the 
matrix of different conversions you've tried so far so to help folks what's 
already been tried and eliminated as not helpful.

Tom

On Friday, January 22, 2016 at 5:14:48 AM UTC-5, Timo Grossenbacher wrote:
>
> Hey,
>
> Given the input file 2000.pdf, and the following code, ...
>
> # first, conversion to TIFF with ghostscript
> ghostscript -o 2000_gs.tif -sDEVICE=tiffgray -r720x720 -g6120x7920 -
> sCompression=lzw 2000.pdf
> # then, rotation with imagemagick
> convert 2000_gs.tif -rotate 89.4 -background white -alpha Off 2000_rotated
> .tif
> # then, OCR with tesseract, using suggested parameters
> tesseract 2000_rotated.tif 2000_readable_gs_custom -c load_system_dawg=0 -c 
> load_freq_dawg=0 -c textord_tablefind_recognize_tables=1 -c 
> textord_tabfind_find_tables=1 pdf
>
> ...the quality of the OCR is really poor - hardly 30% of the text is 
> searchable in 2000_readable_gs_custom.pdf.
>
> I have uploaded all the files to 
> https://www.sendspace.com/filegroup/dGA6ojm%2BQ4tZ6gdkyuSM0xSIUD8P2vbB 
>
> When I OCR the same file with Adobe Acrobat Professional, I get almost 
> 100% accuracy. Of course I'd like to do it rather with FOSS than with a 
> commercial product, so do you have any hints on how I could mitigate those 
> problems?
>
> Thanks a lot,
> Timo
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c7f8e99f-917d-4328-85d7-d375fde485b5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to