Re: [tesseract-ocr] Re: Creating Starter Traineddata

Simon Sat, 20 Jan 2024 04:50:03 -0800

Hey thanks for the response!

How exactly do I add characters to the unicharset?


Typically the unicharset has to follow a specific pattern (
Tesseract-unicharset_uni-mannheim 
<https://digi.bib.uni-mannheim.de/tesseract/manuals/unicharset.5.html>)

Here is an example of the Latin unicharset: 

⇆ 0 24,76,166,249,122,224,6,30,136,256 Common 1600 10 1600 ⇆ # ⇆ [21c6 ]

If I want to add for example this character "⌖" how would I know what 
numbers I need to put for the glyph information?

And also what does the "10" and "[21c6]" mean?




elvi...@gmail.com schrieb am Freitag, 19. Januar 2024 um 16:22:24 UTC+1:

> Yes, you need to add them before you create the starter model. You can 
> edit the Latin.unicarset before you run the combine command.
>
> On Fri, Jan 19, 2024, 5:27 PM Simon <smon...@gmail.com> wrote:
>
>> Ok somehow I had "no entry point found" errors in the dll files. 
>> Reinstallation of Tesseract solved the Problem. 
>>
>> Now I encounter another interesting Problem. 
>>
>> combine_lang_model --input_unicharset 
>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset 
>> --script_dir 
>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng --lang 
>> --output_dir C:/Users/LCAdmin/Documents/FineTuning/output
>>
>> When I run this command Tesseract tries to load many unicharsets. I don't 
>> understand why it tries to. It doesn't make any sense to me.
>> Whats the reason for loading all these unicharsets:
>>
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Latin.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Inherited.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Unknown.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Greek.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Armenian.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Arabic.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Devanagari.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Gujarati.unicharset
>> Failed to load script unicharset 
>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Bopomofo.unicharset
>>
>> when I only want to train the english model? 
>>
>> Also another question arised: 
>> When I try to train some new characters do I have to add them to the 
>> Latin.unicharset before I create the starter traineddata or do I just add 
>> these characters to the created unicharset after I created starter 
>> traineddata?
>>
>> Simon schrieb am Freitag, 19. Januar 2024 um 10:38:24 UTC+1:
>>
>>> Here is a link to the Website of Uni Mannheim: COMBINE_LANG_MODEL - 
>>> generate starter traineddata 
>>> <https://digi.bib.uni-mannheim.de/tesseract/manuals/combine_lang_model.1.html>
>>>
>>> Unfortunately the command doesn't create any files and after running the 
>>> command I don't get any Feedback on why the command didn't work properly. 
>>> Even when I porposely use non existent paths I still get no error 
>>> message!
>>>
>>> PS C:\Windows\system32> combine_lang_model --input_unicharset 
>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset
>>>  
>>> --script_dir 
>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng  --lang eng 
>>> --wordlist 
>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/eng.wordlist
>>>  
>>> --output_dir C:/Users/LCAdmin/Documents/FineTuning/output
>>> PS C:\Users\LCAdmin\Documents\FineTuning>
>>>
>>> PS C:\Users\LCAdmin\Documents\FineTuning> combine_lang_model 
>>> --input_unicharset tesstutorial/langdata/Latin.unicharset --script_dir 
>>> tesstutorial/langdata/eng  --lang eng --wordlist 
>>> asdfasfdef/langdata/eng/eng.wordlist --output_dir output
>>> PS C:\Users\LCAdmin\Documents\FineTuning>
>>>
>>> Does anyone have an idea how I can get insights in some log messages or 
>>> something that could give me more insights on why it didn't work?
>>>
>>>
>>>
>>> Simon schrieb am Donnerstag, 18. Januar 2024 um 11:11:52 UTC+1:
>>>
>>>> Hello everybody,
>>>>
>>>> I have a question regarding "Fine Tuning +- a few characters". 
>>>>
>>>> In general the instructions on 
>>>> https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters
>>>>  
>>>> say that you have to make a starter traineddata from the unicharset, but 
>>>> is 
>>>> this also required if I want to fine tune? 
>>>>
>>>> Furthermore I have absolutely no idea how I can create a starter 
>>>> traineddata. I read the "creating starter traineddata" chapter but I have 
>>>> absolutely no clue how I do that. This site is supposed to be a tutorial, 
>>>> therefore I expect a step for step instruction. 
>>>>
>>>> Can anyone help me with this?
>>>>
>>>> I am a newby at tersseract training, so I would appreciate any help.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/31a0381f-f407-43d7-a9a1-8450394c20fcn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/31a0381f-f407-43d7-a9a1-8450394c20fcn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/91aeac2a-1e1a-439a-9f92-6abdda3dc695n%40googlegroups.com.

Re: [tesseract-ocr] Re: Creating Starter Traineddata

Reply via email to