On Friday, March 15, 2024 at 11:13:15 PM UTC-4 lfdo...@gmail.com wrote:

My naive assumption when I originally encountered issues with 
tesseract was that there would be some central repository of training 
data which we would collaborate on extending and improving in an 
open-source way, including with examples of bad results on fairly 
clean inputs. 


Ray Smith has been very generous with his time and Google's resources, but 
it's a bit of an asymmetric situation and the open source community, by and 
large, has not organized around wide scale retraining. The work that has 
been done is typically isolated, "one-of"s with the results not captured 
and used to improve the state of play. The groups that have put significant 
resources into training typically have a very focused goal such as early 
German blackletter, early modern printing, etc.
 

Given that tesseract is focused on OCR of 
machine-created text in the first place, creating synthetic datasets 
also seems very viable.


I think one issue with creating synthetic datasets is access to 
commercially licensed fonts. Google has the resources to purchase licenses 
for hundreds of commercial fonts and use them to render a great variety of 
line images, but there's no economical way for them to provide those fonts 
to the open source community for reuse. 

Training also requires a non-trivial amount of computing resources as well 
as some specialized knowledge. 

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44dd22af-42fb-48f4-bf3b-9bbbe2c21a37n%40googlegroups.com.

Reply via email to