[tesseract-ocr] Tesseract fine tuning questions

Antonio Jimeno Yepes Sat, 11 May 2024 03:43:45 -0700

Hi,

We are interested in improving the performance of Tesseract and we have
prepared a large set with over 11k pages annotated manually with text lines
bounding boxes and the transcribed text. We have been evaluating fine
tuning Tesseract with this set and we observed that there is a slight
decrease in performance and we would like to identify the issue and run the
fine tuning again. We have some questions about the process and we would be
helpful if you could help us understanding the fine tuning process for
Tesseract.

We have done several tests to fine-tune Tesseract using this set with
mixed results. We evaluate the performance agains an existing benchmark
that we name the mini-holistic set. The metrics that we consider are
Levenshtein distance and % of missing words (which considers unique words).
Using our manually annotated set we obtain a similar Levenshtein distance
(probably not statistically different) but we get a higher % of missing
words, e.g. from 7% to over 9.6%.

1. We realized our fine-tuned model degraded performance on scanned
documents, so we used PIL to add noise to the preprocessed bounding boxes
and train them with high quality data together. We added noise with random
combination of rescaling, rotation, blur and salt and pepper noise.

Our results were mixed; we saw significant improvement in some files while
others got a lot worse. Documents with tables and documents that seem
not-scanned saw an improvement in the evaluation metrics. With scanned
documents, the fine tuned seemed to perform worst with the fine-tuned
model. The polarization effect was greater compared to training with just
high-quality data.

- Is the way we do augmentation correct?
- What can potentially cause this kind of mixed results?

1. We have tried different parameters. One of the is
perfect_sample_delay with different values, from 1 to 100 to remove the
impact of examples for which Tesseract had a perfect output.

We find that there is no impact using this parameter, we find that the BCER
is similar to other experiments without this parameter.

- Is our understanding of this parameter correct?
- Why we might not see any impact when using this parameter?

1. We have tried splitting the set into examples for which Tesseract 5
has a perfect output (a) and examples for which it fails to produce a
perfect output (b).

We find that the (a) set obtains a low BCER 0.042 during training, while
(b) gets ~6% BCER, but the performance in Levenshtein distance and % of
missing words is similar to previous output with both (a) and (b).

- Performance is similar despite different performance. Why do you think
is the case?
- In this case, the cases that are correct initially with Tesseract
should have no or limited impact in the training.
- Using the perfect_sample_delay might prevent any learning from
happening since all the examples are initially perfect in the (a) set (we
checked values from 1 to 100). Why do we see no impact? How would you
recommend logging that this parameter is working as expected?

Thank you in advance for you help,
Antonio

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/0d2518c7-ed93-4496-8ee5-14c0e15efd6en%40googlegroups.com.

[tesseract-ocr] Tesseract fine tuning questions

Reply via email to