Hi, I've been utilizing Tesseract 4 to extract text from PNG and TIFF images, and all the content is in German. While the image quality is pretty decent, the extraction results have been less than stellar for some of them. I understand that to improve OCR accuracy, training Tesseract with additional data is recommended.
However, I've hit a roadblock as I only have the images without the exact text (ground truth) or bounding boxes. Creating this data manually seems like a massive undertaking—do you recommend this as the best course of action? Or, are there other solutions or perhaps existing prepared datasets for German that I could use? Also, I'm curious about the volume of training data required. Is there a minimum number of images and corresponding texts that you'd consider sufficient to start seeing improved results? Any guidance or resources you can provide would be greatly appreciated. Atef -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aeed1be3-e759-454f-89b5-ff3f0282d9a8n%40googlegroups.com.

