[tesseract-ocr] Re: Text output vs. PDF

2016-07-19 Thread H . Mijail Antón Quiles
I just spent a couple of hours debugging a workflow, because the finally 
generated PDF seemed to have been OCR'd but with every character being a 
space.
Turns out that the problem was not in the workflow, but me using 
Preview.app, as explained in this thread. Acrobat Reader does extract the 
correct text when selecting + copying.

I see a number of other questions in the forum that could be related to 
this same problem, so I've just added a FAQ ( 
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#the-produced-searchable-pdf-seems-to-only-contain-spaces
 
)

On Monday, 29 June 2015 09:45:37 UTC+2, Jeff Breidenbach wrote:
>
> Unfortunately, I think there is nothing we can do. I've done everything I 
> can to 
> maximize compatibility with various PDF rendering engines, but Preview 
> uses 
> particularly terrible text extraction heuristics. To be fair, the root 
> problem is
> the design and complexity of the PDF specification itself.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/85cedd8a-9b39-4515-bd86-60c8b6754fa3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Text output vs. PDF

2015-06-29 Thread Jeff Breidenbach
Unfortunately, I think there is nothing we can do. I've done everything I 
can to 
maximize compatibility with various PDF rendering engines, but Preview uses 
particularly terrible text extraction heuristics. To be fair, the root 
problem is
the design and complexity of the PDF specification itself.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/262a0e22-eddf-4b10-bd17-7e7f5f17cac9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.