Hi Mark,
On 07/03/2024 20:53, Mark Pellegrino wrote:
I found more info here:
https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
Glyphless appears to be an 'invisible font' and all that Tesseract
supports. It seems like the solution it to use Tesseract to generate
hOCR, then use another tool to combine the source image with the hOCR?
Does anyone have a simple workflow for editing/correcting Tesseract OCR
documents that they can share?
If you're looking to do OCR and PDF generation separately, you might
want to look into the Internet Archive's PDF generation tooling, which
is designed to do exactly this (plus some aggressive compression):
https://github.com/internetarchive/archive-pdf-tools (disclaimer: I'm
the author of the tooling)
As for viewing and editing hOCR, there's a lot of different tools
around, not all fully functional (I haven't tried most of these):
* https://www.not-implemented.de/hocr-proofreader/
* https://github.com/kba/hocrjs
* https://github.com/GeReV/hocr-editor-ts /
https://github.com/GeReV/HocrEditor
There are also some GUI tools that I recall for editing hOCR, but they
might require you to convert to another format first.
Regards,
Merlijn
Thanks again,
On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
Hello,
I'm trying to check PDFs made with Tesseract 5.2 for correctness
using an OCR editor but am unable to open them in either Abbyy or
Acrobat.
If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor,
the software just hangs and crashes. I can open Tesseract PDFs with
Acrobat Pro, but when I enable theĀ 'Make OCR text visible' option
in Preflight, all of the text layer turns into unreadable black
boxes. The font used shows as 'GlyphLessFont' and appears to be
embedded in the file.
It doesn't matter what training data I use, or what the source image
was, I always get these results. Any other non-Tesseract made PDF
works just fine. I'm guessing that the issue is a missing font? I
don't have much of an understanding about how embedded PDF fonts
work and I haven't found anything about this in the Tesseract docs.
Can someone please point me in the right direction? I Thanks.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com <https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org.