subject:"\[tesseract\-ocr\] Re\: Text output vs. PDF"

[tesseract-ocr] Re: Text output vs. PDF

2016-07-19 Thread H . Mijail Antón Quiles

I just spent a couple of hours debugging a workflow, because the finally generated PDF seemed to have been OCR'd but with every character being a space. Turns out that the problem was not in the workflow, but me using Preview.app, as explained in this thread. Acrobat Reader does extract the

[tesseract-ocr] Re: Text output vs. PDF

2015-06-29 Thread Jeff Breidenbach

Unfortunately, I think there is nothing we can do. I've done everything I can to maximize compatibility with various PDF rendering engines, but Preview uses particularly terrible text extraction heuristics. To be fair, the root problem is the design and complexity of the PDF specification