Re: [tesseract-ocr] Re: Post OCR Verification and Editing

Greg Jay Wed, 10 Apr 2024 16:44:00 -0700

Try this: https://ocr.sanskritdictionary.com/#


It worked for the glyphs you mentioned for me.

Greg

On Wed, Apr 10, 2024 at 9:03 AM Mark Pellegrino <[email protected]> wrote:

> Hi Jeremiah,
>
> Thanks so much, this is a fantastic tool. I just tried using the Scribe
> OCR website to edit an hocr file that was generated with Tesseract against
> its source image, and it worked perfectly. I was also able to make some
> edits then successfully generate and download a PDF containing the image
> and edited text. This is great, just what I needed.
>
> The only issue that I ran into was that it doesn't seem to support the
> Latin characters and ligatures that I need like, æ œ ſ, etc. That's
> probably not a complicated fix on my end, I'll just have to dig around in
> the source code. If you could point me in the right direction it would be
> greatly appreciated.
>
> Thanks again for your hard work on this, I'll certainly be in touch with
> more questions about Scribe.
>
>  Mark
> On Sunday 31 March 2024 at 04:18:09 UTC-4 Jeremiah wrote:
>
>> There currently is no desktop application, so running requires either (1)
>> using the public site on scribeocr.com or (2) serving the files on your
>> local system using an HTTP server.  I added instructions to the README
>> <https://github.com/scribeocr/scribeocr?tab=readme-ov-file#running> for
>> running locally, which I will also paste below.
>> git clone --recursive https://github.com/scribeocr/scribeocr.git cd
>> scribeocr npm i npx http-server
>> The site can then be visited from a browser at the location printed by
>> `npx http-server`.
>>
>> On Saturday, March 30, 2024 at 12:25:34 PM UTC-7 zdenop wrote:
>>
>>> Hello Jeremiah,
>>>
>>> this looks very interesting and nice app. Any instructions for
>>> installation?
>>>
>>> I just downloaded code from GH but recognizing text doesn't work for me:
>>>
>>> [image: image.png]
>>>
>>> BR,
>>>
>>>
>>> Zdenko
>>>
>>>
>>> so 30. 3. 2024 o 8:41 Jeremiah <[email protected]> napísal(a):
>>>
>>>> You can proofread and correct .hocr files made by Tesseract using
>>>> scribeocr.com, which is an open source program I wrote to address
>>>> difficulties proofreading OCR data.  A video demo can be seen here
>>>> <https://www.youtube.com/watch?v=aWDiq3t1EeA>, and the GitHub repo is
>>>> here <https://github.com/scribeocr/scribeocr>.  The program positions
>>>> the glyphs precisely over the source image, which (in my experience)
>>>> reduces the time spent proofreading by 90% versus other methods.  A
>>>> screenshot is below.
>>>>
>>>> [image: scribe_screenshot.PNG]
>>>>
>>>>
>>>> Proofreading .pdfs created by Tesseract is unfortunately not possible,
>>>> given that (as you experienced personally), the precise glyph
>>>> metrics/positioning data is lost when exporting to .pdf.  However, if you
>>>> upload the source image alongside a .hocr file from Tesseract (with
>>>> `hocr_char_boxes: '1'` to include glyph-level data), it should have much
>>>> more information to position glyphs with.  After proofreading is done, a
>>>> .pdf can be exported using the site.  Alternatively, you can run
>>>> recognition directly in the browser using a built-in build of Tesseract,
>>>> which will produce the most accurate overlay due to several changes to
>>>> Tesseract to improve positioning.   The site is still under active
>>>> development, so if you try it and experience any issues please let me know
>>>> via a Git Issue or email to [email protected].
>>>> On Friday, March 15, 2024 at 12:12:39 PM UTC-7 Mark Pellegrino wrote:
>>>>
>>>>> Hi Art,
>>>>>
>>>>> Thanks so much for this. These are very intriguing tools. I'll
>>>>> definitely give Alethia a try. It seems more suited to my needs than 
>>>>> Abbyy.
>>>>> I'll report back once I've done some experimentation.
>>>>>
>>>>> Best,
>>>>> Mark
>>>>>
>>>>> On Wed, Mar 13, 2024 at 3:00 PM Art Rhyno <[email protected]> wrote:
>>>>>
>>>>>> In addition to hocr, Tesseract can produce the alto format, and this
>>>>>> allows the use of the Alethia editor [1] from the Prima folks. I haven’t
>>>>>> done much correction of hand-written materials but Alethia seems flexible
>>>>>> for a windows environment and exports the page format. You also can start
>>>>>> with hocr and/or roundtrip between alto, hocr, page, and other xml 
>>>>>> formats
>>>>>> with the ocr-fileformat project [2], which includes some Prima plumbing.
>>>>>> Merlijn and the IA folks have great tools for combing hocr and images to
>>>>>> make a lightweight PDF if that’s your end-goal [3].
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>
>>>>>>
>>>>>> art
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> 1. https://www.primaresearch.org/tools/Aletheia
>>>>>>
>>>>>> 2. https://github.com/UB-Mannheim/ocr-fileformat
>>>>>>
>>>>>> 3. https://git.archive.org/merlijn/archive-pdf-tools
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* [email protected] <[email protected]> *On
>>>>>> Behalf Of *Mark Pellegrino
>>>>>> *Sent:* Wednesday, March 13, 2024 11:25 AM
>>>>>> *To:* [email protected]
>>>>>> *Subject:* Re: [tesseract-ocr] Re: Post OCR Verification and Editing
>>>>>>
>>>>>>
>>>>>>
>>>>>> You don't often get email from [email protected]. Learn why this is
>>>>>> important <https://aka.ms/LearnAboutSenderIdentification>
>>>>>>
>>>>>> Hi Zdenko,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you so much for your continued interest. I'll provide a little
>>>>>> more context; I work for a rare book library in Canada and I have around
>>>>>> 10,000 pages of digitized, hand-written, latin manuscripts that I'm 
>>>>>> trying
>>>>>> to OCR.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I normally use Abbyy OCR Editor, which has good recognition but
>>>>>> struggles with Latin, particularly with ligatures or antiquated 
>>>>>> characters
>>>>>> like a long-s. Tesseract used with the training data available from
>>>>>> latirocr.org  <http://latirocr.org/> has much better recognition,
>>>>>> near perfect. However, my issue with Tesseract is that I am unable to
>>>>>> define a recognition area in the image, and therefore many unwanted
>>>>>> elements on the page like smudges, pen marks, tears, decorative elements,
>>>>>> etc, are also recognized with jumbled characters. I understand that I can
>>>>>> preprocess the image in Photoshop to remove these unwanted elements, then
>>>>>> generate hocr with Tesseract, then merge the hocr with the original
>>>>>> unprocessed image, but on my scale that's particularly laborious. I was
>>>>>> hoping to OCR all of the images then use an OCR editor like Acrobat or
>>>>>> Abbyy to edit out any unwanted characters or inspect the OCR for 
>>>>>> accuracy,
>>>>>> but it appears the Tesseract's usage of a Glyph Less font makes that
>>>>>> impossible.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here's what happens if I try to open a Tesseract-made PDF in Acrobat.
>>>>>> Like you mentioned, it opens just fine, but when the 'Make OCR Visible'
>>>>>> option is enabled all of the text turns into black boxes (it's not an 
>>>>>> issue
>>>>>> of redaction). My understanding is that because of the lack of any 
>>>>>> embedded
>>>>>> font information in the file, Acrobat can't make sense of the text layer
>>>>>> because there are no associated glyphs to present on screen. Tesseract 
>>>>>> PDFs
>>>>>> won't open in Abbyy OCR Editor or FineReader at all, I'm guessing for the
>>>>>> same reason.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for reading. I'll look further into hocr editing tools. I'm
>>>>>> hoping other institutions can share their procedures for similar 
>>>>>> projects.
>>>>>>
>>>>>>
>>>>>>
>>>>>> All the best,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 9, 2024 at 12:52 PM Zdenko Podobny <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> " there's no way to use an off-the-shelf text editor with a glyphless
>>>>>> font."
>>>>>>
>>>>>> I converted
>>>>>> https://github.com/tesseract-ocr/test/blob/main/testing/8087_054.3B.tif
>>>>>> to pdf
>>>>>>
>>>>>> tesseract 8087_054.3B.tif 8087_054.3B pdf
>>>>>>
>>>>>>
>>>>>>
>>>>>> I could open 8087_054.3B.pdf without a problem in Acode Acrobat Pro
>>>>>> Version 2023.008.20555 64 bit (on Windows 11)
>>>>>>
>>>>>> However, it seems that it ignores tesseract text layer and it ran its
>>>>>> own text recognition (including font identification).
>>>>>>
>>>>>>
>>>>>>
>>>>>> I tried to open 8087_054.3B.pdf  at
>>>>>> https://www.pdffiller.com/jsfiller-desk14/?flat_pdf_quality I can
>>>>>> modify the text:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Also https://tinywow.com/pdf/edit seems to work:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> IMO if pdf tool offers text editing, it should work with tesseract
>>>>>> output too.
>>>>>>
>>>>>>
>>>>>>
>>>>>> BR,
>>>>>>
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> pi 8. 3. 2024 o 20:24 Mark Pellegrino <[email protected]> napísal(a):
>>>>>>
>>>>>> Thank you Merlijn, this is very helpful.  I'm very interested in IA's
>>>>>> process so I'll have a deep dive through those tools.  This confirms my
>>>>>> suspicions that there's no way to use an off-the-shelf text editor with a
>>>>>> glyphless font. I'll explore these hOCR editor options. All the best,
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 8, 2024 at 7:03 AM Merlijn B.W. Wajer <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> On 07/03/2024 20:53, Mark Pellegrino wrote:
>>>>>> > I found more info here:
>>>>>> >
>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
>>>>>> >
>>>>>> > Glyphless appears to be an 'invisible font' and all that Tesseract
>>>>>> > supports. It seems like the solution it to use Tesseract to
>>>>>> generate
>>>>>> > hOCR, then use another tool to combine the source image with the
>>>>>> hOCR?
>>>>>> >
>>>>>> > Does anyone have a simple workflow for editing/correcting Tesseract
>>>>>> OCR
>>>>>> > documents that they can share?
>>>>>>
>>>>>> If you're looking to do OCR and PDF generation separately, you might
>>>>>> want to look into the Internet Archive's PDF generation tooling,
>>>>>> which
>>>>>> is designed to do exactly this (plus some aggressive compression):
>>>>>> https://github.com/internetarchive/archive-pdf-tools (disclaimer:
>>>>>> I'm
>>>>>> the author of the tooling)
>>>>>>
>>>>>> As for viewing and editing hOCR, there's a lot of different tools
>>>>>> around, not all fully functional (I haven't tried most of these):
>>>>>>
>>>>>> * https://www.not-implemented.de/hocr-proofreader/
>>>>>> * https://github.com/kba/hocrjs
>>>>>> * https://github.com/GeReV/hocr-editor-ts /
>>>>>> https://github.com/GeReV/HocrEditor
>>>>>>
>>>>>> There are also some GUI tools that I recall for editing hOCR, but
>>>>>> they
>>>>>> might require you to convert to another format first.
>>>>>>
>>>>>> Regards,
>>>>>> Merlijn
>>>>>>
>>>>>>
>>>>>> >
>>>>>> > Thanks again,
>>>>>> >
>>>>>> > On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
>>>>>> >
>>>>>> >     Hello,
>>>>>> >     I'm trying to check PDFs made with Tesseract 5.2 for correctness
>>>>>> >     using an OCR editor but am unable to open them in either Abbyy
>>>>>> or
>>>>>> >     Acrobat.
>>>>>> >
>>>>>> >     If I try to open a Tesseract PDF with Abbyy FineReader/OCR
>>>>>> Editor,
>>>>>> >     the software just hangs and crashes. I can open Tesseract PDFs
>>>>>> with
>>>>>> >     Acrobat Pro, but when I enable the  'Make OCR text visible'
>>>>>> option
>>>>>> >     in Preflight, all of the text layer turns into unreadable black
>>>>>> >     boxes. The font used shows as 'GlyphLessFont' and appears to be
>>>>>> >     embedded in the file.
>>>>>> >
>>>>>> >     It doesn't matter what training data I use, or what the source
>>>>>> image
>>>>>> >     was, I always get these results. Any other non-Tesseract made
>>>>>> PDF
>>>>>> >     works just fine. I'm guessing that the issue is a missing font?
>>>>>> I
>>>>>> >     don't have much of an understanding about how embedded PDF fonts
>>>>>> >     work and I haven't found anything about this in the Tesseract
>>>>>> docs.
>>>>>> >     Can someone please point me in the right direction? I Thanks.
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > You received this message because you are subscribed to the Google
>>>>>> > Groups "tesseract-ocr" group.
>>>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>>>> send
>>>>>> > an email to [email protected]
>>>>>> > <mailto:[email protected]>.
>>>>>> > To view this discussion on the web visit
>>>>>> >
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com
>>>>>> <
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer
>>>>>> >.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to a topic in
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>> [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org
>>>>>> .
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to a topic in
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>> [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to a topic in
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>> [email protected].
>>>>>>
>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d10a895e-f1cb-42cd-8e1a-78cbffe08a2cn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d10a895e-f1cb-42cd-8e1a-78cbffe08a2cn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BboD0MbCuK%2Bd%2B5hxG7b2hPM6nodrC684iYzKxkGpvOWodH0fw%40mail.gmail.com.

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

Reply via email to