Hello everyone, 

I've been successfully fine-tuning the eng.traineddata model with smaller 
datasets, but when I try to scale up to a larger dataset to include a more 
diverse range of documents, I encounter an unusual error. The training 
process starts, but it immediately reports a negative Mean RMS error, which 
seems to be an anomaly.

Environment
Tesseract Version: 4.1.3
Platform: Ubuntu 20.04

I run the following command for fine-tuning:
lstmtraining --debug_interval 0 
--traineddata 
tesstrain/data/experiments/5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42.traineddata
 

--old_traineddata tesstrain/src/tessdata_best/eng.traineddata 
--continue_from 
tesstrain/data/experiments/5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42.lstm
 

--model_output 
tesstrain/data/experiments/5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42/checkpoints/model_eng_psm7_mi100000_5PX1000D_rs42
 

--train_listfile tesstrain/data/experiments/5PX1000D_rs42/list.train 
--eval_listfile tesstrain/data/experiments/5PX1000D_rs42/list.eval 
--max_iterations 100000 
--target_error_rate 0.01

The output I'm wondering about is :
At iteration 1/600/600, Mean rms=-2147483.6%, delta=0.033%, char 
train=275.696%, word train=100%, skip ratio=0%,  New worst char error = 
275.696 wrote checkpoint.

I expected the training process to proceed normally with the Mean RMS error 
showing sensible values, similar to when training on smaller datasets. When 
I use around 100k lstmf files it doesn't have this behaviour but with 400k 
this happens.

Am I looking in the wrong direction or missing something ?
I tried to look for something similar in the groups and discussions but 
couldn't find anything.
Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a2b1aa46-66d1-4a12-883c-afeac315cdc2n%40googlegroups.com.

Reply via email to