Re: [tesseract-ocr] Re: Training from Scratch

Simon Wed, 29 Nov 2023 00:36:11 -0800

Hey Lorenzo,

thanks a lot for your response. I've seen in the HOCR files of different 
technical drawings that the Tesseract Text Segmentation has massive 
problems recognizing zones with text, probably because of the varios lines 
and complex constructions within the technical drawing. Even the zones 
where text appears get recognized very rarely. So it seems pretty obvious 
to me that no Tesseract is not build for documents where no clear text 
lines are 
Therefore I decided to follow your suggestion to crop out the boxes 
(Feature Control Frame) and feed them seperately to Tesseract. To identify 
those boxes I would try to use OpenCV. I also try to generate training data 
which should be similar to these Feature Control Frames for the training of 
Tesseract. Do you think this approach could be successfull?



Lorenzo Blz schrieb am Montag, 27. November 2023 um 16:52:46 UTC+1:

>
> Hi Simon, yes, I think the instructions you can give to the segmentation 
> step are quite limited, mostly the PSM parameter and I suppose a few minor 
> ones. There is something about tables but I've never used it and yours 
> might be too small for this to work. Yes, you should be able to see what is 
> happening looking at the HOCR file.
>
> You could also try the attached script, it was made for the 4.x version 
> but might work with 5.x too. It draws boxes around letters according to the 
> tesseract output. I'm attaching the output on a simple text and on several 
> crops from your image: only in the clean one you can see the text boxes. 
> You can do the same from the HOCR file.
>
> Yes, you still need to fine tune for the new character. I was able to 
> train up to 57k iterations still improving the results on a test dataset. 
> You need to fine tune including the new symbols AND all the other symbols 
> you expect to recognize in the training dataset.
>
>
> I'm not sure if you are using something like this:
>
>  merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset 
> $(TRAIN)/my.unicharset  "$@"
>
> if so, you can replace it with:
>
>  cp "$(TRAIN)/my.unicharset" "data/unicharset"
>
> and the new model will output only the characters that are present in your 
> new dataset (for example to discard lower case letters, the < character, %, 
> !, #, etc.)
>
> Also, if you do not need to recognize the < symbol, you could reuse this 
> rather than adding a new one completely. I mean that when you generate the 
> images with the "angle" symbol you put < in the transcription. Maybe it 
> helps, maybe it won't.
>
>
>
> Bye
>
> Lorenzo
>
>
>
>
> Il giorno sab 25 nov 2023 alle ore 12:25 Simon <smon...@gmail.com> ha 
> scritto:
>
>> Yes in general I want to recognice this part  "< 0,05 A" except that the 
>> < ist actually  ∠  the character for angularity. 
>>
>> The segmentation process of tesseract can't be edited right? So you mean 
>> I would need to make an Tesseract independent program that localizes the 
>> boxes crops them out and feeds them to Tesseract? In that case I still 
>> would need to train Tesseract for recognizing  ∠ .  So I am still 
>> wondering how to train this sign properly. 
>>
>> Because you asked if the isolation step is able to isolate it, I can 
>> check this by looking at the hocr information right?
>>
>>
>>
>> Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:
>>
>>> Hi Simon,
>>> if I understand correctly how tesseract works, it follows this steps:
>>>
>>> - it segments the image into lines of text
>>> - it then takes each individual line and slides a small window, 1px wide 
>>> I think, over it, from one end to the other. For each step the model 
>>> outputs a prediction. The model, being an bidirectional LSTM has some 
>>> memory of the previous and following pixel columns.
>>> - all these predictions are converted into characters using beam search
>>>
>>> Please correct me if I got it wrong. So the first thing I think looking 
>>> at your picture is the segmentation step. Do you want to read the "< 0,05 
>>> A" block only? Is the segmentation step able to isolate it? This is the 
>>> first thing I would try to understand.
>>> Also your sample image for "<" has a very different angle to the one 
>>> before 0,05.
>>>
>>> In this case a would try to do a custom segmentation, looking for 
>>> rectangular boxes of a certain height, aspect ratio, etc. Then cropping 
>>> these out (maybe dropping the rectangular box and the black vertical lines) 
>>> and feed them to tesseract. This of course requires custom programming.
>>>
>>> This might give good results even without fine tuning. I would try this 
>>> manually with GIMP first.
>>>
>>>
>>> Also I suppose you are not going to encounter a lot of wild fonts into 
>>> these kind of diagrams. The more fonts you use, the harder the training. I 
>>> would focus on very few fonts, even one. I would start with exactly one 
>>> font and train on these to see quickly if my training setup/pipeline is 
>>> working. And if the training results reflect onto the diagrams later. If 
>>> the model error rate is good on the individual text lines and it is bad on 
>>> the real images it might be a segmentation problem that training cannot 
>>> fix. Or the problem might be the external box, that I suppose you do not 
>>> have in your generated data.
>>>
>>> Ideally, I would use real crops from these diagrams rather than images 
>>> from text2image.
>>>
>>> Also distinguishing 0 from O with many fonts is very hard. Often you 
>>> have domain knowledge that can help you to fix these errors in post, for 
>>> example 0,O5 can be easily spotted and fixed. You can, for example, assume 
>>> that each box contains only one kind of data and guess the most likely one 
>>> from this or from the box sequence, etc.
>>>
>>> I got good results with 20k samples (real world scanned docs, multi 
>>> fonts). 10k seems reasonable, I also assume your output "characters set" is 
>>> very small, like the numbers and a few capital letters and a couple of 
>>> symbols (no %, ^, &, {, etc.).
>>>
>>>
>>>
>>> Lorenzo
>>>
>>> Il giorno gio 23 nov 2023 alle ore 10:16 Simon <smon...@gmail.com> ha 
>>> scritto:
>>>
>>>> If I need to train new characters that are not recognized by a default 
>>>> model, is fine tuning in this case the right approach?
>>>> One of these characters ist the one for angularity:  ∠
>>>>
>>>> This symbols appear in technical drawings and should be recognised in 
>>>> those. E.g. for the scenario in the following picture tesseract should 
>>>> reconize this symbol. 
>>>>
>>>>
>>>>
>>>> [image: angularity.png]
>>>>
>>>> Also here is one of the pngs I tried to train with: 
>>>> [image: angularity_0_r0.jpg] 
>>>> They all look pretty similar to this one. Things that change are the 
>>>> angle, the propotion and the thickness of the lines. All examples have 
>>>> this 
>>>> 64x64 pixel box around it. 
>>>>
>>>>
>>>> Is Fine Tuning for this scenario the right approach as I only find 
>>>> information for fine tuning for specific fonts. For fine tune also the 
>>>> "tesstrain" repository would not be needed as it is used for training from 
>>>> scratch, correct?
>>>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
>>>> UTC+1:
>>>>
>>>>> From my limited experience, you need a lot more data than that to 
>>>>> train from scratch. If you can't make more than that data, you might 
>>>>> first 
>>>>> try to fine tune:and then train by removing the top layer of the best 
>>>>> model. 
>>>>>
>>>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> As it is not properly possible to combine my traineddata from scratch 
>>>>>> with an existing one, I have decided to also train my traineddata model 
>>>>>> numbers. Therefore I wrote a script which synthetically generates 
>>>>>> groundtruth data with text2image. 
>>>>>> This script uses dozens of different fonts and creates numbers for 
>>>>>> the following formats. 
>>>>>> X.XXX
>>>>>> X.XX
>>>>>> X,XX
>>>>>> X,XXX
>>>>>> I generated 10,000 files to train the numbers. But unfortunately 
>>>>>> numbers get recognized pretty poorly with the best model. (most of times 
>>>>>> only "0."; "0" or "0," gets recognized)  
>>>>>> So I wanted to ask if It is not enough training (ground truth data) 
>>>>>> for proper recognition when I train several fonts. 
>>>>>> Thanks in advance for you help. 
>>>>>>
>>>>> -- 
>>>>
>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/31d6a1f5-d114-485b-b6b3-897c57616783n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/31d6a1f5-d114-485b-b6b3-897c57616783n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/db437b03-7036-4ac6-85ee-ff25553edb46n%40googlegroups.com.

Re: [tesseract-ocr] Re: Training from Scratch

Reply via email to