[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Scott Goci Fri, 05 Jan 2024 09:48:33 -0800

Hey Tom,

Overall thanks for your guidance here, I appreciate our back and forth!

RE: *"[...] do you really *need* the italics?", *I think there is actually 
a lot lost without font attributes (e.g. bold / italic / underline). 
Consider the following sentences / quotes:

   - "I never said she stole the money"
   - "I never said *she* stole the money"
   - "I *never* said she stole the money"
   - "I never said she *stole* the money"

The context of the above varies drastically depending on which word (if 
any) were italicized.

For other font attributes (e.g. bold/underline) the case for implementation 
aren't as strong, but I still believe we miss some things. E.g. consider 
the following:

   - Not ten eggs, *ea*ten eggs (e.g. here, underlining helps emphasize a 
   specific area of text that changes context of the word at hand)
   - *Scott: *What is your biggest accomplishment? (e.g. in an interview 
   context, highlighting who is asking the question, especially if there is a 
   different person responding)

----

I can definitely try other OCR packages though, but as this is the biggest 
non-commercial OCR library I assume other non-commercial OCR libraries 
might not yield as good results -- I can also try commercial libraries as 
you suggest as well, although now I am beholden to potentially large 
pricing schemes.

Let me know if you have any final thoughts, but otherwise I'll take the 
advise you've given and go from here!

On Friday, January 5, 2024 at 11:27:10 AM UTC-5 tfmo...@gmail.com wrote:

> On Friday, January 5, 2024 at 9:30:05 AM UTC-5 sco...@gmail.com wrote:
>
> Would you offer any suggestions as to next steps I could take from here? 
> E.g. it seems my options are:
>
>    1. I can go back and train the legacy engine (e.g. *--oem 0*) on the 
>    fonts as well (I've been using this guide: 
>    
> https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/),
>  
>    and hope the results improve enough that I get pretty good results
>    2. I can use some sort of post-processing step after tesseract to 
>    detect italics / bold / etc (although I'm not sure what 
>    tools/software/library I'd use here for, so I'd really need suggestions)
>    3. I could wait and hope the roadmap for adding back 
>    WordFontAttributes to the non-legacy engine becomes a priority
>    4. Something else perhaps?
>
> I'm afraid I don't have any magic solutions (or even good suggestions). 
> The only thing I can offer is to perhaps not be so fixated on Tesseract as 
> a solution.
>
> - would a different OCR package (including commercial) give you better 
> results?
> - do you really *need* the italics?
> - could you implement a crowdsourced annotation facility that let people 
> add the italics later?
>
> Good luck!
>
> Tom
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a279b97d-feca-4650-a22e-c8e8cc4a39c2n%40googlegroups.com.

[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Reply via email to