Hello,

I am not sure if OCRmyPDF(https://ocrmypdf.readthedocs.io/en/latest/)
allows redaction.

If you would to implement text layer by yourself with custom font, have a
look at PyMuPDF:

   - https://github.com/pymupdf/PyMuPDF/discussions/775 (Adding text layer
   to a scanned PDF)
   - https://github.com/pymupdf/PyMuPDF/discussions/2464 (invisible text
   layer)


Zdenko


št 7. 3. 2024 o 20:53 Mark Pellegrino <[email protected]> napísal(a):

> I found more info here:
>
> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
>
> Glyphless appears to be an 'invisible font' and all that Tesseract
> supports. It seems like the solution it to use Tesseract to generate hOCR,
> then use another tool to combine the source image with the hOCR?
>
> Does anyone have a simple workflow for editing/correcting Tesseract OCR
> documents that they can share?
>
> Thanks again,
>
> On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
>
>> Hello,
>> I'm trying to check PDFs made with Tesseract 5.2 for correctness using an
>> OCR editor but am unable to open them in either Abbyy or Acrobat.
>>
>> If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor, the
>> software just hangs and crashes. I can open Tesseract PDFs with Acrobat
>> Pro, but when I enable the  'Make OCR text visible' option in Preflight,
>> all of the text layer turns into unreadable black boxes. The font used
>> shows as 'GlyphLessFont' and appears to be embedded in the file.
>>
>> It doesn't matter what training data I use, or what the source image was,
>> I always get these results. Any other non-Tesseract made PDF works just
>> fine. I'm guessing that the issue is a missing font? I don't have much of
>> an understanding about how embedded PDF fonts work and I haven't found
>> anything about this in the Tesseract docs. Can someone please point me in
>> the right direction? I Thanks.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wqM%2BE7KZ6_etfi6y8B_bLeZE4fRYns-TY3Yp%3DGhFjr7g%40mail.gmail.com.

Reply via email to