Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

Piyush Chandra Wed, 08 Apr 2020 23:45:25 -0700

Thank you Shree for giving the overview.

Could you please help me understand your last point? Your unicharset should 
have Unicode codepoints. what does that mean? any example would be helpful. 
I was actually using akshara (attached box fiile image) .




On Thursday, 9 April 2020 09:02:43 UTC+5:30, shree wrote:
>
> devenagari.unicharset, Latin.unicharset and radical-stroke.txt
>
> The script unicharset are useful in setting character properties. For most 
> scripts they are already available in langadata_lstm. I don't  think they 
> are mandatory for lstm training but by copying them once you can avoid the 
> warning messages.
>
> radical-stroke.txt is used only for CJK languages, but tesseract checks 
> for it during training process, so you need to make it available.
>
> For chattisgarhi, if training for as written in Devanagari, I will suggest 
> training from script/Devanagari.traineddata rather than English.
>
> Please note if you are starting from scratch, then you don't need a 
> starting traineddata. If you use one, then you are finetuning.
>
> Finally,  you need to use the correct mode for Indic language with 
> unicharset_extractor. Your unicharset should have Unicode codepoints, not 
> akshara (consanant vowel sign combination).
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com.

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

Reply via email to