Re: [tesseract-ocr] Re: Post OCR Verification and Editing

Mark Pellegrino Wed, 10 Apr 2024 12:03:25 -0700

Hi Jeremiah,

Thanks so much, this is a fantastic tool. I just tried using the Scribe OCR 
website to edit an hocr file that was generated with Tesseract against its 
source image, and it worked perfectly. I was also able to make some edits 
then successfully generate and download a PDF containing the image and 
edited text. This is great, just what I needed.


The only issue that I ran into was that it doesn't seem to support the 
Latin characters and ligatures that I need like, æ œ ſ, etc. That's 
probably not a complicated fix on my end, I'll just have to dig around in 
the source code. If you could point me in the right direction it would be 
greatly appreciated.

Thanks again for your hard work on this, I'll certainly be in touch with 
more questions about Scribe. 

 Mark
On Sunday 31 March 2024 at 04:18:09 UTC-4 Jeremiah wrote:

> There currently is no desktop application, so running requires either (1) 
> using the public site on scribeocr.com or (2) serving the files on your 
> local system using an HTTP server.  I added instructions to the README 
> <https://github.com/scribeocr/scribeocr?tab=readme-ov-file#running> for 
> running locally, which I will also paste below.
> git clone --recursive https://github.com/scribeocr/scribeocr.git cd 
> scribeocr npm i npx http-server 
> The site can then be visited from a browser at the location printed by 
> `npx http-server`.
>
> On Saturday, March 30, 2024 at 12:25:34 PM UTC-7 zdenop wrote:
>
>> Hello Jeremiah,
>>
>> this looks very interesting and nice app. Any instructions for 
>> installation?
>>
>> I just downloaded code from GH but recognizing text doesn't work for me:
>>
>> [image: image.png]
>>
>> BR,
>>
>>
>> Zdenko
>>
>>
>> so 30. 3. 2024 o 8:41 Jeremiah <[email protected]> napísal(a):
>>
>>> You can proofread and correct .hocr files made by Tesseract using 
>>> scribeocr.com, which is an open source program I wrote to address 
>>> difficulties proofreading OCR data.  A video demo can be seen here 
>>> <https://www.youtube.com/watch?v=aWDiq3t1EeA>, and the GitHub repo is 
>>> here <https://github.com/scribeocr/scribeocr>.  The program positions 
>>> the glyphs precisely over the source image, which (in my experience) 
>>> reduces the time spent proofreading by 90% versus other methods.  A 
>>> screenshot is below.
>>>
>>> [image: scribe_screenshot.PNG]
>>>
>>>
>>> Proofreading .pdfs created by Tesseract is unfortunately not possible, 
>>> given that (as you experienced personally), the precise glyph 
>>> metrics/positioning data is lost when exporting to .pdf.  However, if you 
>>> upload the source image alongside a .hocr file from Tesseract (with 
>>> `hocr_char_boxes: '1'` to include glyph-level data), it should have much 
>>> more information to position glyphs with.  After proofreading is done, a 
>>> .pdf can be exported using the site.  Alternatively, you can run 
>>> recognition directly in the browser using a built-in build of Tesseract, 
>>> which will produce the most accurate overlay due to several changes to 
>>> Tesseract to improve positioning.   The site is still under active 
>>> development, so if you try it and experience any issues please let me know 
>>> via a Git Issue or email to [email protected]. 
>>> On Friday, March 15, 2024 at 12:12:39 PM UTC-7 Mark Pellegrino wrote:
>>>
>>>> Hi Art,
>>>>
>>>> Thanks so much for this. These are very intriguing tools. I'll 
>>>> definitely give Alethia a try. It seems more suited to my needs than 
>>>> Abbyy. 
>>>> I'll report back once I've done some experimentation.
>>>>
>>>> Best,
>>>> Mark
>>>>
>>>> On Wed, Mar 13, 2024 at 3:00 PM Art Rhyno <[email protected]> wrote:
>>>>
>>>>> In addition to hocr, Tesseract can produce the alto format, and this 
>>>>> allows the use of the Alethia editor [1] from the Prima folks. I haven’t 
>>>>> done much correction of hand-written materials but Alethia seems flexible 
>>>>> for a windows environment and exports the page format. You also can start 
>>>>> with hocr and/or roundtrip between alto, hocr, page, and other xml 
>>>>> formats 
>>>>> with the ocr-fileformat project [2], which includes some Prima plumbing.  
>>>>> Merlijn and the IA folks have great tools for combing hocr and images to 
>>>>> make a lightweight PDF if that’s your end-goal [3].
>>>>>
>>>>>  
>>>>>
>>>>> Best,
>>>>>
>>>>>  
>>>>>
>>>>> art
>>>>>
>>>>> ---
>>>>>
>>>>> 1. https://www.primaresearch.org/tools/Aletheia
>>>>>
>>>>> 2. https://github.com/UB-Mannheim/ocr-fileformat
>>>>>
>>>>> 3. https://git.archive.org/merlijn/archive-pdf-tools
>>>>>
>>>>>  
>>>>>
>>>>> *From:* [email protected] <[email protected]> *On 
>>>>> Behalf Of *Mark Pellegrino
>>>>> *Sent:* Wednesday, March 13, 2024 11:25 AM
>>>>> *To:* [email protected]
>>>>> *Subject:* Re: [tesseract-ocr] Re: Post OCR Verification and Editing
>>>>>
>>>>>  
>>>>>
>>>>> You don't often get email from [email protected]. Learn why this is 
>>>>> important <https://aka.ms/LearnAboutSenderIdentification>
>>>>>
>>>>> Hi Zdenko, 
>>>>>
>>>>>  
>>>>>
>>>>> Thank you so much for your continued interest. I'll provide a little 
>>>>> more context; I work for a rare book library in Canada and I have around 
>>>>> 10,000 pages of digitized, hand-written, latin manuscripts that I'm 
>>>>> trying 
>>>>> to OCR.
>>>>>
>>>>>  
>>>>>
>>>>> I normally use Abbyy OCR Editor, which has good recognition but 
>>>>> struggles with Latin, particularly with ligatures or antiquated 
>>>>> characters 
>>>>> like a long-s. Tesseract used with the training data available from 
>>>>> latirocr.org  <http://latirocr.org/> has much better recognition, 
>>>>> near perfect. However, my issue with Tesseract is that I am unable to 
>>>>> define a recognition area in the image, and therefore many unwanted 
>>>>> elements on the page like smudges, pen marks, tears, decorative elements, 
>>>>> etc, are also recognized with jumbled characters. I understand that I can 
>>>>> preprocess the image in Photoshop to remove these unwanted elements, then 
>>>>> generate hocr with Tesseract, then merge the hocr with the original 
>>>>> unprocessed image, but on my scale that's particularly laborious. I was 
>>>>> hoping to OCR all of the images then use an OCR editor like Acrobat or 
>>>>> Abbyy to edit out any unwanted characters or inspect the OCR for 
>>>>> accuracy, 
>>>>> but it appears the Tesseract's usage of a Glyph Less font makes that 
>>>>> impossible. 
>>>>>
>>>>>  
>>>>>
>>>>> Here's what happens if I try to open a Tesseract-made PDF in Acrobat. 
>>>>> Like you mentioned, it opens just fine, but when the 'Make OCR Visible' 
>>>>> option is enabled all of the text turns into black boxes (it's not an 
>>>>> issue 
>>>>> of redaction). My understanding is that because of the lack of any 
>>>>> embedded 
>>>>> font information in the file, Acrobat can't make sense of the text layer 
>>>>> because there are no associated glyphs to present on screen. Tesseract 
>>>>> PDFs 
>>>>> won't open in Abbyy OCR Editor or FineReader at all, I'm guessing for the 
>>>>> same reason.
>>>>>
>>>>>  
>>>>>
>>>>> Thanks for reading. I'll look further into hocr editing tools. I'm 
>>>>> hoping other institutions can share their procedures for similar projects.
>>>>>
>>>>>  
>>>>>
>>>>> All the best,
>>>>>
>>>>>  
>>>>>
>>>>>  
>>>>>
>>>>> On Sat, Mar 9, 2024 at 12:52 PM Zdenko Podobny <[email protected]> 
>>>>> wrote:
>>>>>
>>>>> " there's no way to use an off-the-shelf text editor with a glyphless 
>>>>> font."
>>>>>
>>>>> I converted  
>>>>> https://github.com/tesseract-ocr/test/blob/main/testing/8087_054.3B.tif 
>>>>> to pdf
>>>>>
>>>>> tesseract 8087_054.3B.tif 8087_054.3B pdf
>>>>>
>>>>>  
>>>>>
>>>>> I could open 8087_054.3B.pdf without a problem in Acode Acrobat Pro 
>>>>> Version 2023.008.20555 64 bit (on Windows 11)
>>>>>
>>>>> However, it seems that it ignores tesseract text layer and it ran its 
>>>>> own text recognition (including font identification).
>>>>>
>>>>>  
>>>>>
>>>>> I tried to open 8087_054.3B.pdf  at 
>>>>> https://www.pdffiller.com/jsfiller-desk14/?flat_pdf_quality I can 
>>>>> modify the text:
>>>>>
>>>>>  
>>>>>
>>>>>  
>>>>>
>>>>> Also https://tinywow.com/pdf/edit seems to work:
>>>>>
>>>>>  
>>>>>
>>>>>  
>>>>>
>>>>> IMO if pdf tool offers text editing, it should work with tesseract 
>>>>> output too.
>>>>>
>>>>>  
>>>>>
>>>>> BR,
>>>>>
>>>>>
>>>>> Zdenko
>>>>>
>>>>>  
>>>>>
>>>>>  
>>>>>
>>>>> pi 8. 3. 2024 o 20:24 Mark Pellegrino <[email protected]> napísal(a):
>>>>>
>>>>> Thank you Merlijn, this is very helpful.  I'm very interested in IA's 
>>>>> process so I'll have a deep dive through those tools.  This confirms my 
>>>>> suspicions that there's no way to use an off-the-shelf text editor with a 
>>>>> glyphless font. I'll explore these hOCR editor options. All the best,
>>>>>
>>>>>  
>>>>>
>>>>> On Fri, Mar 8, 2024 at 7:03 AM Merlijn B.W. Wajer <[email protected]> 
>>>>> wrote:
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>> On 07/03/2024 20:53, Mark Pellegrino wrote:
>>>>> > I found more info here:
>>>>> > 
>>>>> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
>>>>> > 
>>>>> > Glyphless appears to be an 'invisible font' and all that Tesseract 
>>>>> > supports. It seems like the solution it to use Tesseract to generate 
>>>>> > hOCR, then use another tool to combine the source image with the 
>>>>> hOCR?
>>>>> > 
>>>>> > Does anyone have a simple workflow for editing/correcting Tesseract 
>>>>> OCR 
>>>>> > documents that they can share?
>>>>>
>>>>> If you're looking to do OCR and PDF generation separately, you might 
>>>>> want to look into the Internet Archive's PDF generation tooling, which 
>>>>> is designed to do exactly this (plus some aggressive compression): 
>>>>> https://github.com/internetarchive/archive-pdf-tools (disclaimer: I'm 
>>>>> the author of the tooling)
>>>>>
>>>>> As for viewing and editing hOCR, there's a lot of different tools 
>>>>> around, not all fully functional (I haven't tried most of these):
>>>>>
>>>>> * https://www.not-implemented.de/hocr-proofreader/
>>>>> * https://github.com/kba/hocrjs
>>>>> * https://github.com/GeReV/hocr-editor-ts / 
>>>>> https://github.com/GeReV/HocrEditor
>>>>>
>>>>> There are also some GUI tools that I recall for editing hOCR, but they 
>>>>> might require you to convert to another format first.
>>>>>
>>>>> Regards,
>>>>> Merlijn
>>>>>
>>>>>
>>>>> > 
>>>>> > Thanks again,
>>>>> > 
>>>>> > On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
>>>>> > 
>>>>> >     Hello,
>>>>> >     I'm trying to check PDFs made with Tesseract 5.2 for correctness
>>>>> >     using an OCR editor but am unable to open them in either Abbyy or
>>>>> >     Acrobat.
>>>>> > 
>>>>> >     If I try to open a Tesseract PDF with Abbyy FineReader/OCR 
>>>>> Editor,
>>>>> >     the software just hangs and crashes. I can open Tesseract PDFs 
>>>>> with
>>>>> >     Acrobat Pro, but when I enable the  'Make OCR text visible' 
>>>>> option
>>>>> >     in Preflight, all of the text layer turns into unreadable black
>>>>> >     boxes. The font used shows as 'GlyphLessFont' and appears to be
>>>>> >     embedded in the file.
>>>>> > 
>>>>> >     It doesn't matter what training data I use, or what the source 
>>>>> image
>>>>> >     was, I always get these results. Any other non-Tesseract made PDF
>>>>> >     works just fine. I'm guessing that the issue is a missing font? I
>>>>> >     don't have much of an understanding about how embedded PDF fonts
>>>>> >     work and I haven't found anything about this in the Tesseract 
>>>>> docs.
>>>>> >     Can someone please point me in the right direction? I Thanks.
>>>>> > 
>>>>> > 
>>>>> > -- 
>>>>> > You received this message because you are subscribed to the Google 
>>>>> > Groups "tesseract-ocr" group.
>>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>>> send 
>>>>> > an email to [email protected] 
>>>>> > <mailto:[email protected]>.
>>>>> > To view this discussion on the web visit 
>>>>> > 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com
>>>>>  
>>>>> <
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer
>>>>> >.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>> .
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org
>>>>> .
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>> .
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>> .
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected].
>>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d10a895e-f1cb-42cd-8e1a-78cbffe08a2cn%40googlegroups.com.

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

Reply via email to