Thanks Zedenko, PyMuPDF is an intriguing option. I'll check it out further.

On Fri, Mar 8, 2024 at 6:14 AM Zdenko Podobny <zde...@gmail.com> wrote:

> Hello,
>
>
> I am not sure if OCRmyPDF(https://ocrmypdf.readthedocs.io/en/latest/)
> allows redaction.
>
> If you would to implement text layer by yourself with custom font, have a
> look at PyMuPDF:
>
>    - https://github.com/pymupdf/PyMuPDF/discussions/775 (Adding text
>    layer to a scanned PDF)
>    - https://github.com/pymupdf/PyMuPDF/discussions/2464 (invisible text
>    layer)
>
>
> Zdenko
>
>
> št 7. 3. 2024 o 20:53 Mark Pellegrino <mar...@gmail.com> napísal(a):
>
>> I found more info here:
>>
>> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
>>
>> Glyphless appears to be an 'invisible font' and all that Tesseract
>> supports. It seems like the solution it to use Tesseract to generate hOCR,
>> then use another tool to combine the source image with the hOCR?
>>
>> Does anyone have a simple workflow for editing/correcting Tesseract OCR
>> documents that they can share?
>>
>> Thanks again,
>>
>> On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
>>
>>> Hello,
>>> I'm trying to check PDFs made with Tesseract 5.2 for correctness using
>>> an OCR editor but am unable to open them in either Abbyy or Acrobat.
>>>
>>> If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor, the
>>> software just hangs and crashes. I can open Tesseract PDFs with Acrobat
>>> Pro, but when I enable the  'Make OCR text visible' option in Preflight,
>>> all of the text layer turns into unreadable black boxes. The font used
>>> shows as 'GlyphLessFont' and appears to be embedded in the file.
>>>
>>> It doesn't matter what training data I use, or what the source image
>>> was, I always get these results. Any other non-Tesseract made PDF works
>>> just fine. I'm guessing that the issue is a missing font? I don't have much
>>> of an understanding about how embedded PDF fonts work and I haven't found
>>> anything about this in the Tesseract docs. Can someone please point me in
>>> the right direction? I Thanks.
>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wqM%2BE7KZ6_etfi6y8B_bLeZE4fRYns-TY3Yp%3DGhFjr7g%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wqM%2BE7KZ6_etfi6y8B_bLeZE4fRYns-TY3Yp%3DGhFjr7g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhb%2Bqf_k_vFia1J6%2BxvQgAOVr5Ybk-kucPBUYNkDKgnsvg%40mail.gmail.com.

Reply via email to