hin.des0.txt <https://github.com/tesseract-ocr/tesseract/files/4479075/hin.des0.txt> These are the files I used.
For box file, I used the below command: tesseract hin.des0.PNG hin.des0 -l hin lstmbox On Wednesday, 15 April 2020 06:52:48 UTC+5:30, shree wrote: > > How are you creating the box files? > > On Wed, Apr 15, 2020, 01:52 Piyush Chandra <[email protected] > <javascript:>> wrote: > >> For other files, when I try on linux, its coming like this: >> >> unicharset_extractor --norm_mode 2 hin.desk0.box hin.desk1.box >> Extracting unicharset from box file hin.desk0.box >> Invalid start of grapheme sequence:H=0x94d >> Normalization failed for string '्' >> Invalid start of grapheme sequence:M=0x93e >> Normalization failed for string 'ा' >> Invalid start of grapheme sequence:M=0x947 >> Normalization failed for string 'े' >> Invalid start of grapheme sequence:M=0x947 >> Normalization failed for string 'े' >> Invalid start of grapheme sequence:M=0x93e >> Normalization failed for string 'ा' >> Invalid start of grapheme sequence:M=0x93f >> Normalization failed for string 'ि' >> Invalid start of grapheme sequence:M=0x94b >> Normalization failed for string 'ो' >> Invalid start of grapheme sequence:D=0x902 >> Normalization failed for string 'ं' >> Invalid start of grapheme sequence:M=0x940 >> Normalization failed for string 'ी' >> Invalid start of grapheme sequence:M=0x93e >> Normalization failed for string 'ा' >> Invalid start of grapheme sequence:M=0x947 >> Normalization failed for string 'े' >> Invalid start of grapheme sequence:M=0x948 >> Normalization failed for string 'ै' >> Invalid start of grapheme sequence:D=0x902 >> Normalization failed for string 'ं' >> Invalid start of grapheme sequence:M=0x93f >> Normalization failed for string 'ि' >> >> >> On Tuesday, 14 April 2020 17:01:20 UTC+5:30, Piyush Chandra wrote: >>> >>> Hi Shree, >>> >>> When I used unicharset extractor command, I get these error: >>> >>> unicharset_extractor --norm_mode 2 --output_unicharset min.unicharset >>> hin.exp1.box >>> Extracting unicharset from box file hin.exp1.box >>> Invalid start of grapheme sequence:M=0x93e >>> Normalization failed for string 'αñ╛' >>> Invalid start of grapheme sequence:D=0x901 >>> Normalization failed for string 'αñü' >>> Invalid start of grapheme sequence:M=0x941 >>> Normalization failed for string 'αÑü' >>> Invalid start of grapheme sequence:M=0x947 >>> Normalization failed for string 'αÑç' >>> Invalid start of grapheme sequence:M=0x940 >>> Normalization failed for string 'αÑÇ' >>> Invalid start of grapheme sequence:M=0x948 >>> Normalization failed for string 'αÑê' >>> Mirror ] of [ is not in unicharset >>> Wrote unicharset file min.unicharset >>> >>> The box file used was: >>> >>> ह 28 33 261 74 0 >>> ा 28 33 261 74 0 >>> ँ 28 33 261 74 0 >>> , 28 33 261 74 0 >>> 28 33 261 74 0 >>> म 28 33 261 74 0 >>> ु 28 33 261 74 0 >>> झ 28 33 261 74 0 >>> े 28 33 261 74 0 >>> 28 33 261 74 0 >>> [ 28 33 261 74 0 >>> ख 28 33 261 74 0 >>> 28 33 261 74 0 >>> ल 28 33 261 74 0 >>> ग 28 33 261 74 0 >>> ी 28 33 261 74 0 >>> 28 33 261 74 0 >>> ह 28 33 261 74 0 >>> ै 28 33 261 74 0 >>> । 28 33 261 74 0 >>> 28 33 261 74 0 >>> >>> Do I need to just ignore them or what am I missing here? >>> >>> On Thursday, 9 April 2020 12:34:38 UTC+5:30, shree wrote: >>>> >>>> # Normalization mode - 2, 1 - for unicharset_extractor and Pass through >>>> Recoder for combine_lang_model >>>> ifeq ($(LANG_TYPE),Indic) >>>> NORM_MODE =2 >>>> RECODER =--pass_through_recoder >>>> >>>> >>>> On Thu, Apr 9, 2020 at 12:29 PM Shree Devi Kumar <[email protected]> >>>> wrote: >>>> >>>>> Unicharset will look like the following: >>>>> >>>>> द 1 34,72,192,192,100,122,0,0,99,114 Devanagari 11 0 11 द # द [926 ]x >>>>> र 1 58,64,192,192,84,119,0,0,81,110 Devanagari 12 0 12 र # र [930 ]x >>>>> ् 0 3,32,61,197,12,181,0,0,0,1 Devanagari 13 17 13 ् # ् [94d ] >>>>> श 1 61,64,192,195,128,148,0,12,130,147 Devanagari 14 0 14 श # श [936 ]x >>>>> य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 15 0 15 य # य [92f ]x >>>>> त 1 61,64,192,192,112,135,0,0,110,126 Devanagari 16 0 16 त # त [924 ]x >>>>> ि 0 62,65,228,253,132,279,0,0,40,65 Devanagari 17 0 17 ि # ि [93f ] >>>>> प 1 63,64,192,192,98,126,0,0,97,119 Devanagari 18 0 18 प # प [92a ]x >>>>> ू 0 1,35,67,197,33,193,0,0,0,1 Devanagari 19 17 19 ू # ू [942 ] >>>>> ज 1 63,64,192,192,138,165,0,0,128,157 Devanagari 20 0 20 ज # ज [91c ]x >>>>> >>>>> You can unpack any of the existing traineddatas from tessdata_best or >>>>> tessdata_fast and check. >>>>> >>>>> combine_tessdata -u >>>>> >>>>> and looks at the lstm-unicharset in the components >>>>> >>>>> On Thu, Apr 9, 2020 at 12:15 PM Piyush Chandra <[email protected]> >>>>> wrote: >>>>> >>>>>> Thank you Shree for giving the overview. >>>>>> >>>>>> Could you please help me understand your last point? Your unicharset >>>>>> should have Unicode codepoints. what does that mean? any example would >>>>>> be >>>>>> helpful. I was actually using akshara (attached box fiile image) . >>>>>> >>>>>> >>>>>> >>>>>> On Thursday, 9 April 2020 09:02:43 UTC+5:30, shree wrote: >>>>>>> >>>>>>> devenagari.unicharset, Latin.unicharset and radical-stroke.txt >>>>>>> >>>>>>> The script unicharset are useful in setting character properties. >>>>>>> For most scripts they are already available in langadata_lstm. I don't >>>>>>> think they are mandatory for lstm training but by copying them once you >>>>>>> can >>>>>>> avoid the warning messages. >>>>>>> >>>>>>> radical-stroke.txt is used only for CJK languages, but tesseract >>>>>>> checks for it during training process, so you need to make it available. >>>>>>> >>>>>>> For chattisgarhi, if training for as written in Devanagari, I will >>>>>>> suggest training from script/Devanagari.traineddata rather than English. >>>>>>> >>>>>>> Please note if you are starting from scratch, then you don't need a >>>>>>> starting traineddata. If you use one, then you are finetuning. >>>>>>> >>>>>>> Finally, you need to use the correct mode for Indic language with >>>>>>> unicharset_extractor. Your unicharset should have Unicode codepoints, >>>>>>> not >>>>>>> akshara (consanant vowel sign combination). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/23e8e435-a720-455b-aa2a-563edbb8a93c%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/23e8e435-a720-455b-aa2a-563edbb8a93c%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/38c5fd9e-0a5c-4053-a324-bb08e99309c0%40googlegroups.com.

