Hi,
We are interested in improving the performance of Tesseract and we have prepared a large set with over 11k pages annotated manually with text lines bounding boxes and the transcribed text. We have been evaluating fine tuning Tesseract with this set and we observed that there is a slight decrease in performance and we would like to identify the issue and run the fine tuning again. We have some questions about the process and we would be helpful if you could help us understanding the fine tuning process for Tesseract. We have done several tests to fine-tune Tesseract using this set with mixed results. We evaluate the performance agains an existing benchmark that we name the mini-holistic set. The metrics that we consider are Levenshtein distance and % of missing words (which considers unique words). Using our manually annotated set we obtain a similar Levenshtein distance (probably not statistically different) but we get a higher % of missing words, e.g. from 7% to over 9.6%. 1. We realized our fine-tuned model degraded performance on scanned documents, so we used PIL to add noise to the preprocessed bounding boxes and train them with high quality data together. We added noise with random combination of rescaling, rotation, blur and salt and pepper noise. Our results were mixed; we saw significant improvement in some files while others got a lot worse. Documents with tables and documents that seem not-scanned saw an improvement in the evaluation metrics. With scanned documents, the fine tuned seemed to perform worst with the fine-tuned model. The polarization effect was greater compared to training with just high-quality data. - Is the way we do augmentation correct? - What can potentially cause this kind of mixed results? 1. We have tried different parameters. One of the is perfect_sample_delay with different values, from 1 to 100 to remove the impact of examples for which Tesseract had a perfect output. We find that there is no impact using this parameter, we find that the BCER is similar to other experiments without this parameter. - Is our understanding of this parameter correct? - Why we might not see any impact when using this parameter? 1. We have tried splitting the set into examples for which Tesseract 5 has a perfect output (a) and examples for which it fails to produce a perfect output (b). We find that the (a) set obtains a low BCER 0.042 during training, while (b) gets ~6% BCER, but the performance in Levenshtein distance and % of missing words is similar to previous output with both (a) and (b). - Performance is similar despite different performance. Why do you think is the case? - In this case, the cases that are correct initially with Tesseract should have no or limited impact in the training. - Using the perfect_sample_delay might prevent any learning from happening since all the examples are initially perfect in the (a) set (we checked values from 1 to 100). Why do we see no impact? How would you recommend logging that this parameter is working as expected? Thank you in advance for you help, Antonio -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0d2518c7-ed93-4496-8ee5-14c0e15efd6en%40googlegroups.com.

