Re: [tesseract-ocr] Re: Creating Starter Traineddata

Simon Sat, 20 Jan 2024 07:00:13 -0800

Ok, could you please be a little bit more precise?
 I learned  "[21c6]" is the UTF-16 code. But where do I get the glyph 
information from and what does the 10 stand for?


Thanks for your patience I really appreciate your help :)

elvi...@gmail.com schrieb am Samstag, 20. Januar 2024 um 14:19:33 UTC+1:

> You need to look at it in the unicode list. 
>
> On Sat, Jan 20, 2024, 3:50 PM Simon <smon...@gmail.com> wrote:
>
>> Hey thanks for the response!
>>
>> How exactly do I add characters to the unicharset?
>>
>> Typically the unicharset has to follow a specific pattern (
>> Tesseract-unicharset_uni-mannheim 
>> <https://digi.bib.uni-mannheim.de/tesseract/manuals/unicharset.5.html>)
>>
>> Here is an example of the Latin unicharset: 
>>
>> ⇆ 0 24,76,166,249,122,224,6,30,136,256 Common 1600 10 1600 ⇆ # ⇆ [21c6 ]
>>
>> If I want to add for example this character "⌖" how would I know what 
>> numbers I need to put for the glyph information?
>>
>> And also what does the "10" and "[21c6]" mean?
>>
>>
>>
>>
>> elvi...@gmail.com schrieb am Freitag, 19. Januar 2024 um 16:22:24 UTC+1:
>>
>>> Yes, you need to add them before you create the starter model. You can 
>>> edit the Latin.unicarset before you run the combine command.
>>>
>>> On Fri, Jan 19, 2024, 5:27 PM Simon <smon...@gmail.com> wrote:
>>>
>>>> Ok somehow I had "no entry point found" errors in the dll files. 
>>>> Reinstallation of Tesseract solved the Problem. 
>>>>
>>>> Now I encounter another interesting Problem. 
>>>>
>>>> combine_lang_model --input_unicharset 
>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset
>>>>  
>>>> --script_dir 
>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng --lang 
>>>> --output_dir C:/Users/LCAdmin/Documents/FineTuning/output
>>>>
>>>> When I run this command Tesseract tries to load many unicharsets. I 
>>>> don't understand why it tries to. It doesn't make any sense to me.
>>>> Whats the reason for loading all these unicharsets:
>>>>
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Latin.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Inherited.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Unknown.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Greek.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Armenian.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Arabic.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Devanagari.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Gujarati.unicharset
>>>> Failed to load script unicharset 
>>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Bopomofo.unicharset
>>>>
>>>> when I only want to train the english model? 
>>>>
>>>> Also another question arised: 
>>>> When I try to train some new characters do I have to add them to the 
>>>> Latin.unicharset before I create the starter traineddata or do I just add 
>>>> these characters to the created unicharset after I created starter 
>>>> traineddata?
>>>>
>>>> Simon schrieb am Freitag, 19. Januar 2024 um 10:38:24 UTC+1:
>>>>
>>>>> Here is a link to the Website of Uni Mannheim: COMBINE_LANG_MODEL - 
>>>>> generate starter traineddata 
>>>>> <https://digi.bib.uni-mannheim.de/tesseract/manuals/combine_lang_model.1.html>
>>>>>
>>>>> Unfortunately the command doesn't create any files and after running 
>>>>> the command I don't get any Feedback on why the command didn't work 
>>>>> properly. 
>>>>> Even when I porposely use non existent paths I still get no error 
>>>>> message!
>>>>>
>>>>> PS C:\Windows\system32> combine_lang_model --input_unicharset 
>>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset
>>>>>  
>>>>> --script_dir 
>>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng  --lang 
>>>>> eng 
>>>>> --wordlist 
>>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/eng.wordlist
>>>>>  
>>>>> --output_dir C:/Users/LCAdmin/Documents/FineTuning/output
>>>>> PS C:\Users\LCAdmin\Documents\FineTuning>
>>>>>
>>>>> PS C:\Users\LCAdmin\Documents\FineTuning> combine_lang_model 
>>>>> --input_unicharset tesstutorial/langdata/Latin.unicharset --script_dir 
>>>>> tesstutorial/langdata/eng  --lang eng --wordlist 
>>>>> asdfasfdef/langdata/eng/eng.wordlist --output_dir output
>>>>> PS C:\Users\LCAdmin\Documents\FineTuning>
>>>>>
>>>>> Does anyone have an idea how I can get insights in some log messages 
>>>>> or something that could give me more insights on why it didn't work?
>>>>>
>>>>>
>>>>>
>>>>> Simon schrieb am Donnerstag, 18. Januar 2024 um 11:11:52 UTC+1:
>>>>>
>>>>>> Hello everybody,
>>>>>>
>>>>>> I have a question regarding "Fine Tuning +- a few characters". 
>>>>>>
>>>>>> In general the instructions on 
>>>>>> https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters
>>>>>>  
>>>>>> say that you have to make a starter traineddata from the unicharset, but 
>>>>>> is 
>>>>>> this also required if I want to fine tune? 
>>>>>>
>>>>>> Furthermore I have absolutely no idea how I can create a starter 
>>>>>> traineddata. I read the "creating starter traineddata" chapter but I 
>>>>>> have 
>>>>>> absolutely no clue how I do that. This site is supposed to be a 
>>>>>> tutorial, 
>>>>>> therefore I expect a step for step instruction. 
>>>>>>
>>>>>> Can anyone help me with this?
>>>>>>
>>>>>> I am a newby at tersseract training, so I would appreciate any help.
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/31a0381f-f407-43d7-a9a1-8450394c20fcn%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/31a0381f-f407-43d7-a9a1-8450394c20fcn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/91aeac2a-1e1a-439a-9f92-6abdda3dc695n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/91aeac2a-1e1a-439a-9f92-6abdda3dc695n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/702ec835-51a0-4ad1-a0f0-92b4a6e30a9fn%40googlegroups.com.

Re: [tesseract-ocr] Re: Creating Starter Traineddata

Reply via email to