Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

Piyush Chandra Tue, 14 Apr 2020 21:17:27 -0700


hin.des0.txt 
<https://github.com/tesseract-ocr/tesseract/files/4479075/hin.des0.txt> 
These are the files I used.


For box file, I used the below command:

tesseract hin.des0.PNG hin.des0 -l hin lstmbox

On Wednesday, 15 April 2020 06:52:48 UTC+5:30, shree wrote:
>
> How are you creating the box files? 
>
> On Wed, Apr 15, 2020, 01:52 Piyush Chandra <[email protected] 
> <javascript:>> wrote:
>
>> For other files, when I try on linux, its coming like this:
>>
>> unicharset_extractor --norm_mode 2 hin.desk0.box hin.desk1.box
>> Extracting unicharset from box file hin.desk0.box
>> Invalid start of grapheme sequence:H=0x94d
>> Normalization failed for string '्'
>> Invalid start of grapheme sequence:M=0x93e
>> Normalization failed for string 'ा'
>> Invalid start of grapheme sequence:M=0x947
>> Normalization failed for string 'े'
>> Invalid start of grapheme sequence:M=0x947
>> Normalization failed for string 'े'
>> Invalid start of grapheme sequence:M=0x93e
>> Normalization failed for string 'ा'
>> Invalid start of grapheme sequence:M=0x93f
>> Normalization failed for string 'ि'
>> Invalid start of grapheme sequence:M=0x94b
>> Normalization failed for string 'ो'
>> Invalid start of grapheme sequence:D=0x902
>> Normalization failed for string 'ं'
>> Invalid start of grapheme sequence:M=0x940
>> Normalization failed for string 'ी'
>> Invalid start of grapheme sequence:M=0x93e
>> Normalization failed for string 'ा'
>> Invalid start of grapheme sequence:M=0x947
>> Normalization failed for string 'े'
>> Invalid start of grapheme sequence:M=0x948
>> Normalization failed for string 'ै'
>> Invalid start of grapheme sequence:D=0x902
>> Normalization failed for string 'ं'
>> Invalid start of grapheme sequence:M=0x93f
>> Normalization failed for string 'ि'
>>
>>
>> On Tuesday, 14 April 2020 17:01:20 UTC+5:30, Piyush Chandra wrote:
>>>
>>> Hi Shree, 
>>>
>>> When I used unicharset extractor command, I get these error:
>>>
>>> unicharset_extractor --norm_mode 2 --output_unicharset min.unicharset 
>>> hin.exp1.box
>>> Extracting unicharset from box file hin.exp1.box
>>> Invalid start of grapheme sequence:M=0x93e
>>> Normalization failed for string 'αñ╛'
>>> Invalid start of grapheme sequence:D=0x901
>>> Normalization failed for string 'αñü'
>>> Invalid start of grapheme sequence:M=0x941
>>> Normalization failed for string 'αÑü'
>>> Invalid start of grapheme sequence:M=0x947
>>> Normalization failed for string 'αÑç'
>>> Invalid start of grapheme sequence:M=0x940
>>> Normalization failed for string 'αÑÇ'
>>> Invalid start of grapheme sequence:M=0x948
>>> Normalization failed for string 'αÑê'
>>> Mirror ] of [ is not in unicharset
>>> Wrote unicharset file min.unicharset
>>>
>>> The box file used was:
>>>
>>> ह 28 33 261 74 0
>>> ा 28 33 261 74 0
>>> ँ 28 33 261 74 0
>>> , 28 33 261 74 0
>>>   28 33 261 74 0
>>> म 28 33 261 74 0
>>> ु 28 33 261 74 0
>>> झ 28 33 261 74 0
>>> े 28 33 261 74 0
>>>   28 33 261 74 0
>>> [ 28 33 261 74 0
>>> ख 28 33 261 74 0
>>>   28 33 261 74 0
>>> ल 28 33 261 74 0
>>> ग 28 33 261 74 0
>>> ी 28 33 261 74 0
>>>   28 33 261 74 0
>>> ह 28 33 261 74 0
>>> ै 28 33 261 74 0
>>> । 28 33 261 74 0
>>> 28 33 261 74 0
>>>
>>> Do I need to just ignore them or what am I missing here?
>>>
>>> On Thursday, 9 April 2020 12:34:38 UTC+5:30, shree wrote:
>>>>
>>>> # Normalization mode - 2, 1 - for unicharset_extractor and Pass through 
>>>> Recoder for combine_lang_model
>>>> ifeq ($(LANG_TYPE),Indic)
>>>> NORM_MODE =2
>>>> RECODER =--pass_through_recoder
>>>>
>>>>
>>>> On Thu, Apr 9, 2020 at 12:29 PM Shree Devi Kumar <[email protected]> 
>>>> wrote:
>>>>
>>>>> Unicharset will look like the following:
>>>>>
>>>>> द 1 34,72,192,192,100,122,0,0,99,114 Devanagari 11 0 11 द # द [926 ]x
>>>>> र 1 58,64,192,192,84,119,0,0,81,110 Devanagari 12 0 12 र # र [930 ]x
>>>>> ् 0 3,32,61,197,12,181,0,0,0,1 Devanagari 13 17 13 ् # ् [94d ]
>>>>> श 1 61,64,192,195,128,148,0,12,130,147 Devanagari 14 0 14 श # श [936 ]x
>>>>> य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 15 0 15 य # य [92f ]x
>>>>> त 1 61,64,192,192,112,135,0,0,110,126 Devanagari 16 0 16 त # त [924 ]x
>>>>> ि 0 62,65,228,253,132,279,0,0,40,65 Devanagari 17 0 17 ि # ि [93f ]
>>>>> प 1 63,64,192,192,98,126,0,0,97,119 Devanagari 18 0 18 प # प [92a ]x
>>>>> ू 0 1,35,67,197,33,193,0,0,0,1 Devanagari 19 17 19 ू # ू [942 ]
>>>>> ज 1 63,64,192,192,138,165,0,0,128,157 Devanagari 20 0 20 ज # ज [91c ]x
>>>>>
>>>>> You can unpack any of the existing traineddatas from tessdata_best or 
>>>>> tessdata_fast and check.
>>>>>
>>>>> combine_tessdata -u 
>>>>>
>>>>> and looks at the lstm-unicharset in the components
>>>>>
>>>>> On Thu, Apr 9, 2020 at 12:15 PM Piyush Chandra <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Thank you Shree for giving the overview.
>>>>>>
>>>>>> Could you please help me understand your last point? Your unicharset 
>>>>>> should have Unicode codepoints. what does that mean? any example would 
>>>>>> be 
>>>>>> helpful. I was actually using akshara (attached box fiile image) .
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thursday, 9 April 2020 09:02:43 UTC+5:30, shree wrote:
>>>>>>>
>>>>>>> devenagari.unicharset, Latin.unicharset and radical-stroke.txt
>>>>>>>
>>>>>>> The script unicharset are useful in setting character properties. 
>>>>>>> For most scripts they are already available in langadata_lstm. I don't  
>>>>>>> think they are mandatory for lstm training but by copying them once you 
>>>>>>> can 
>>>>>>> avoid the warning messages.
>>>>>>>
>>>>>>> radical-stroke.txt is used only for CJK languages, but tesseract 
>>>>>>> checks for it during training process, so you need to make it available.
>>>>>>>
>>>>>>> For chattisgarhi, if training for as written in Devanagari, I will 
>>>>>>> suggest training from script/Devanagari.traineddata rather than English.
>>>>>>>
>>>>>>> Please note if you are starting from scratch, then you don't need a 
>>>>>>> starting traineddata. If you use one, then you are finetuning.
>>>>>>>
>>>>>>> Finally,  you need to use the correct mode for Indic language with 
>>>>>>> unicharset_extractor. Your unicharset should have Unicode codepoints, 
>>>>>>> not 
>>>>>>> akshara (consanant vowel sign combination).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/23e8e435-a720-455b-aa2a-563edbb8a93c%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/23e8e435-a720-455b-aa2a-563edbb8a93c%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/38c5fd9e-0a5c-4053-a324-bb08e99309c0%40googlegroups.com.

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

Reply via email to