On Friday, March 15, 2024 at 11:13:15 PM UTC-4 lfdo...@gmail.com wrote: My naive assumption when I originally encountered issues with tesseract was that there would be some central repository of training data which we would collaborate on extending and improving in an open-source way, including with examples of bad results on fairly clean inputs.
Ray Smith has been very generous with his time and Google's resources, but it's a bit of an asymmetric situation and the open source community, by and large, has not organized around wide scale retraining. The work that has been done is typically isolated, "one-of"s with the results not captured and used to improve the state of play. The groups that have put significant resources into training typically have a very focused goal such as early German blackletter, early modern printing, etc. Given that tesseract is focused on OCR of machine-created text in the first place, creating synthetic datasets also seems very viable. I think one issue with creating synthetic datasets is access to commercially licensed fonts. Google has the resources to purchase licenses for hundreds of commercial fonts and use them to render a great variety of line images, but there's no economical way for them to provide those fonts to the open source community for reuse. Training also requires a non-trivial amount of computing resources as well as some specialized knowledge. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44dd22af-42fb-48f4-bf3b-9bbbe2c21a37n%40googlegroups.com.