Hi there. I've attempted training(fine-tuning) Tesseract-4 lstmtraining on 
handwritten text, for box/tif pairs I generated myself. The overall 
training process has worked okay without any hitch.

I now wish to apply this fine-tuning process on a larger scale on form 
images. Here's my conundrum: My forms often contain a mixture of printed 
text as well as handwritten text. Do I have to annotate both the printed 
text and the handwritten text? Annotating both printed and handwritten 
would take a bit of extra effort, so I'm wondering if  it sufficient to 
simply make the boxes only around the handwritten portions. However, I'm 
worrying that if I only make boxes around handwritten parts and leave out 
the printed parts, it might confuse my model somehow.

My second question is, when I performed inference with my trained model, it 
throws a warning: *`Failed to load any lstm-specific dictionaries for lang 
X`*. I understand that this is caused by the absence of word lists, 
punctuation lists etc (although it does still give an inferenced output)

I'm wondering how much a word list affects the inference process? I could 
simply take the base language's word-list from the github repository and 
combine it into my newly trained tessdata. However, the forms I will use 
tesseract on will contain lots of people names (which may not be present in 
a wordlist?). In such a case, do I have to compile a new wordlist? Or is it 
sufficient to do without one?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e53fc50-8b36-423a-9bcf-afa4af84e9e7n%40googlegroups.com.

Reply via email to