Hi,

  We are interested in improving the performance of Tesseract and we have 
prepared a large set with over 11k pages annotated manually with text lines 
bounding boxes and the transcribed text. We have been evaluating fine 
tuning Tesseract with this set and we observed that there is a slight 
decrease in performance and we would like to identify the issue and run the 
fine tuning again. We have some questions about the process and we would be 
helpful if you could help us understanding the fine tuning process for 
Tesseract.

  We have done several tests to fine-tune Tesseract using this set with 
mixed results. We evaluate the performance agains an existing benchmark 
that we name the mini-holistic set. The metrics that we consider are 
Levenshtein distance and % of missing words (which considers unique words). 
Using our manually annotated set we obtain a similar Levenshtein distance 
(probably not statistically different) but we get a higher % of missing 
words, e.g. from 7% to over 9.6%.

   1. We realized our fine-tuned model degraded performance on scanned 
   documents, so we used PIL to add noise to the preprocessed bounding boxes 
   and train them with high quality data together. We added noise with random 
   combination of rescaling, rotation, blur and salt and pepper noise. 

Our results were mixed; we saw significant improvement in some files while 
others got a lot worse. Documents with tables and documents that seem 
not-scanned saw an improvement in the evaluation metrics. With scanned 
documents, the fine tuned seemed to perform worst with the fine-tuned 
model. The polarization effect was greater compared to training with just 
high-quality data.

   - Is the way we do augmentation correct? 
   - What can potentially cause this kind of mixed results? 


   1. We have tried different parameters. One of the is 
   perfect_sample_delay with different values, from 1 to 100 to remove the 
   impact of examples for which Tesseract had a perfect output. 

We find that there is no impact using this parameter, we find that the BCER 
is similar to other experiments without this parameter.

   - Is our understanding of this parameter correct? 
   - Why we might not see any impact when using this parameter? 


   1. We have tried splitting the set into examples for which Tesseract 5 
   has a perfect output (a) and examples for which it fails to produce a 
   perfect output (b). 

We find that the (a) set obtains a low BCER 0.042 during training, while 
(b) gets ~6% BCER, but the performance in Levenshtein distance and % of 
missing words is similar to previous output with both (a) and (b).

   - Performance is similar despite different performance. Why do you think 
   is the case? 
   - In this case, the cases that are correct initially with Tesseract 
   should have no or limited impact in the training. 
   - Using the perfect_sample_delay might prevent any learning from 
   happening since all the examples are initially perfect in the (a) set (we 
   checked values from 1 to 100). Why do we see no impact? How would you 
   recommend logging that this parameter is working as expected? 

  Thank you in advance for you help,
  Antonio

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0d2518c7-ed93-4496-8ee5-14c0e15efd6en%40googlegroups.com.

Reply via email to