Hi, 
I've been utilizing Tesseract 4 to extract text from PNG and TIFF images, 
and all the content is in German. While the image quality is pretty decent, 
the extraction results have been less than stellar for some of them. I 
understand that to improve OCR accuracy, training Tesseract with additional 
data is recommended.

However, I've hit a roadblock as I only have the images without the exact 
text (ground truth) or bounding boxes. Creating this data manually seems 
like a massive undertaking—do you recommend this as the best course of 
action? Or, are there other solutions or perhaps existing prepared datasets 
for German that I could use?

Also, I'm curious about the volume of training data required. Is there a 
minimum number of images and corresponding texts that you'd consider 
sufficient to start seeing improved results?

Any guidance or resources you can provide would be greatly appreciated.

Atef

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/aeed1be3-e759-454f-89b5-ff3f0282d9a8n%40googlegroups.com.

Reply via email to