[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Tom Morris Fri, 05 Jan 2024 08:27:24 -0800

On Friday, January 5, 2024 at 9:30:05 AM UTC-5 [email protected] wrote:

Would you offer any suggestions as to next steps I could take from here? 
E.g. it seems my options are:


   1. I can go back and train the legacy engine (e.g. *--oem 0*) on the 
   fonts as well (I've been using this guide: 
   
https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/),
 
   and hope the results improve enough that I get pretty good results
   2. I can use some sort of post-processing step after tesseract to detect 
   italics / bold / etc (although I'm not sure what tools/software/library I'd 
   use here for, so I'd really need suggestions)
   3. I could wait and hope the roadmap for adding back WordFontAttributes 
   to the non-legacy engine becomes a priority
   4. Something else perhaps?

I'm afraid I don't have any magic solutions (or even good suggestions). The 
only thing I can offer is to perhaps not be so fixated on Tesseract as a 
solution.

- would a different OCR package (including commercial) give you better 
results?
- do you really *need* the italics?
- could you implement a crowdsourced annotation facility that let people 
add the italics later?

Good luck!

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6e204225-c32a-450c-adea-a0b544114154n%40googlegroups.com.

[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Reply via email to