[tesseract-ocr] Re: Training from Scratch

Des Bw Thu, 23 Nov 2023 01:28:31 -0800

If the original model lacks the ∠ symbol, fine tuning is not going to add 
it for you. We have all went through that process. To introduce a new 
character, removing the top layer and train from there is the most 
effective approach.


On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com wrote:

> If I need to train new characters that are not recognized by a default 
> model, is fine tuning in this case the right approach?
> One of these characters ist the one for angularity:  ∠
>
> This symbols appear in technical drawings and should be recognised in 
> those. E.g. for the scenario in the following picture tesseract should 
> reconize this symbol. 
>
>
>
> [image: angularity.png]
>
> Also here is one of the pngs I tried to train with: 
> [image: angularity_0_r0.jpg] 
> They all look pretty similar to this one. Things that change are the 
> angle, the propotion and the thickness of the lines. All examples have this 
> 64x64 pixel box around it. 
>
>
> Is Fine Tuning for this scenario the right approach as I only find 
> information for fine tuning for specific fonts. For fine tune also the 
> "tesstrain" repository would not be needed as it is used for training from 
> scratch, correct?
> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
> UTC+1:
>
>> From my limited experience, you need a lot more data than that to train 
>> from scratch. If you can't make more than that data, you might first try to 
>> fine tune:and then train by removing the top layer of the best model. 
>>
>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
>> wrote:
>>
>>> As it is not properly possible to combine my traineddata from scratch 
>>> with an existing one, I have decided to also train my traineddata model 
>>> numbers. Therefore I wrote a script which synthetically generates 
>>> groundtruth data with text2image. 
>>> This script uses dozens of different fonts and creates numbers for the 
>>> following formats. 
>>> X.XXX
>>> X.XX
>>> X,XX
>>> X,XXX
>>> I generated 10,000 files to train the numbers. But unfortunately numbers 
>>> get recognized pretty poorly with the best model. (most of times only "0."; 
>>> "0" or "0," gets recognized)  
>>> So I wanted to ask if It is not enough training (ground truth data) for 
>>> proper recognition when I train several fonts. 
>>> Thanks in advance for you help. 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com.

[tesseract-ocr] Re: Training from Scratch

Reply via email to