I don't know the answer to most of these questions, however one thing I 
noticed in your question was the addition of rotation within the training 
data for better performance on scanned documents.  This may imply that the 
scanned documents being fed to Tesseract are also rotated.  Tesseract 
performs poorly with images that have any sort of rotation--even a few 
degrees may noticeably degrade performance.  The recommended approach for 
dealing with this is image pre-processing rather than re-training with 
rotated text.  There are various programs that auto-rotate text that can be 
used in a pre-processing pipeline, and this PR 
<https://github.com/tesseract-ocr/tesseract/pull/4070> allows for getting 
the angle calculated by Tesseract during layout analysis without running 
recognition. 
On Saturday, May 11, 2024 at 3:43:51 AM UTC-7 ant...@unstructured.io wrote:

>   Hi,
>
>   We are interested in improving the performance of Tesseract and we have 
> prepared a large set with over 11k pages annotated manually with text lines 
> bounding boxes and the transcribed text. We have been evaluating fine 
> tuning Tesseract with this set and we observed that there is a slight 
> decrease in performance and we would like to identify the issue and run the 
> fine tuning again. We have some questions about the process and we would be 
> helpful if you could help us understanding the fine tuning process for 
> Tesseract.
>
>   We have done several tests to fine-tune Tesseract using this set with 
> mixed results. We evaluate the performance agains an existing benchmark 
> that we name the mini-holistic set. The metrics that we consider are 
> Levenshtein distance and % of missing words (which considers unique words). 
> Using our manually annotated set we obtain a similar Levenshtein distance 
> (probably not statistically different) but we get a higher % of missing 
> words, e.g. from 7% to over 9.6%.
>
>    1. We realized our fine-tuned model degraded performance on scanned 
>    documents, so we used PIL to add noise to the preprocessed bounding boxes 
>    and train them with high quality data together. We added noise with random 
>    combination of rescaling, rotation, blur and salt and pepper noise. 
>
> Our results were mixed; we saw significant improvement in some files while 
> others got a lot worse. Documents with tables and documents that seem 
> not-scanned saw an improvement in the evaluation metrics. With scanned 
> documents, the fine tuned seemed to perform worst with the fine-tuned 
> model. The polarization effect was greater compared to training with just 
> high-quality data.
>
>    - Is the way we do augmentation correct? 
>    - What can potentially cause this kind of mixed results? 
>
>
>    1. We have tried different parameters. One of the is 
>    perfect_sample_delay with different values, from 1 to 100 to remove the 
>    impact of examples for which Tesseract had a perfect output. 
>
> We find that there is no impact using this parameter, we find that the 
> BCER is similar to other experiments without this parameter.
>
>    - Is our understanding of this parameter correct? 
>    - Why we might not see any impact when using this parameter? 
>
>
>    1. We have tried splitting the set into examples for which Tesseract 5 
>    has a perfect output (a) and examples for which it fails to produce a 
>    perfect output (b). 
>
> We find that the (a) set obtains a low BCER 0.042 during training, while 
> (b) gets ~6% BCER, but the performance in Levenshtein distance and % of 
> missing words is similar to previous output with both (a) and (b).
>
>    - Performance is similar despite different performance. Why do you 
>    think is the case? 
>    - In this case, the cases that are correct initially with Tesseract 
>    should have no or limited impact in the training. 
>    - Using the perfect_sample_delay might prevent any learning from 
>    happening since all the examples are initially perfect in the (a) set (we 
>    checked values from 1 to 100). Why do we see no impact? How would you 
>    recommend logging that this parameter is working as expected? 
>
>   Thank you in advance for you help,
>   Antonio
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7adea8e7-fc95-45e8-b4b9-f2aaeb5a52c8n%40googlegroups.com.

Reply via email to