On Friday, January 5, 2024 at 9:30:05 AM UTC-5 sco...@gmail.com wrote: Would you offer any suggestions as to next steps I could take from here? E.g. it seems my options are:
1. I can go back and train the legacy engine (e.g. *--oem 0*) on the fonts as well (I've been using this guide: https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/), and hope the results improve enough that I get pretty good results 2. I can use some sort of post-processing step after tesseract to detect italics / bold / etc (although I'm not sure what tools/software/library I'd use here for, so I'd really need suggestions) 3. I could wait and hope the roadmap for adding back WordFontAttributes to the non-legacy engine becomes a priority 4. Something else perhaps? I'm afraid I don't have any magic solutions (or even good suggestions). The only thing I can offer is to perhaps not be so fixated on Tesseract as a solution. - would a different OCR package (including commercial) give you better results? - do you really *need* the italics? - could you implement a crowdsourced annotation facility that let people add the italics later? Good luck! Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6e204225-c32a-450c-adea-a0b544114154n%40googlegroups.com.