Hi Mark,

On 07/03/2024 20:53, Mark Pellegrino wrote:
I found more info here:
https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277

Glyphless appears to be an 'invisible font' and all that Tesseract supports. It seems like the solution it to use Tesseract to generate hOCR, then use another tool to combine the source image with the hOCR?

Does anyone have a simple workflow for editing/correcting Tesseract OCR documents that they can share?

If you're looking to do OCR and PDF generation separately, you might want to look into the Internet Archive's PDF generation tooling, which is designed to do exactly this (plus some aggressive compression): https://github.com/internetarchive/archive-pdf-tools (disclaimer: I'm the author of the tooling)

As for viewing and editing hOCR, there's a lot of different tools around, not all fully functional (I haven't tried most of these):

* https://www.not-implemented.de/hocr-proofreader/
* https://github.com/kba/hocrjs
* https://github.com/GeReV/hocr-editor-ts / https://github.com/GeReV/HocrEditor

There are also some GUI tools that I recall for editing hOCR, but they might require you to convert to another format first.

Regards,
Merlijn



Thanks again,

On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:

    Hello,
    I'm trying to check PDFs made with Tesseract 5.2 for correctness
    using an OCR editor but am unable to open them in either Abbyy or
    Acrobat.

    If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor,
    the software just hangs and crashes. I can open Tesseract PDFs with
    Acrobat Pro, but when I enable theĀ  'Make OCR text visible' option
    in Preflight, all of the text layer turns into unreadable black
    boxes. The font used shows as 'GlyphLessFont' and appears to be
    embedded in the file.

    It doesn't matter what training data I use, or what the source image
    was, I always get these results. Any other non-Tesseract made PDF
    works just fine. I'm guessing that the issue is a missing font? I
    don't have much of an understanding about how embedded PDF fonts
    work and I haven't found anything about this in the Tesseract docs.
    Can someone please point me in the right direction? I Thanks.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected] <mailto:[email protected]>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com <https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org.

Reply via email to