Re: [tesseract-ocr] mftraining Segmentation fault error

Tom De Costere Tue, 08 Nov 2016 05:32:58 -0800

It seems my topic is not suitable for the DEV forum. (topic creation 
refused)


I would appreciate it sinceraly if anyone can bring this topic to the 
attention of the devs.

Thanks in advance!

Tom

Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree:
>
> Probably better to post on tesseract-dev, though there is no guarantee 
> that the developers will reply.
>
> On 4 Nov 2016 3:07 p.m., "Tom De Costere" <[email protected] 
> <javascript:>> wrote:
>
>> Just to be sure, are the developers watching this Google Group or should 
>> I make a topic under the "tesseract-dev" group?
>>
>> FYI: we've breached the 5k number of fonts this morning
>>
>> I'm thinking of notifying the users that they should only create box 
>> files for documents containing terrible handwriting.
>> Since I'm seeing quite good detection rates on new documents, even when 
>> they are not trained yet.
>>
>> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:
>>>
>>> Please see 
>>> https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh
>>>
>>> The max no of fonts for each language is not very large.
>>>
>>> I am not even sure whether increasing the number of fonts beyond a limit 
>>> will improve the recognition.
>>>
>>> I think it is unlikely that tesseract can handle thousands of box/tif 
>>> pairs that you are planning.
>>>
>>> I hope one of the developers will reply with a more definitive response. 
>>>
>>> On 3 Nov 2016 2:21 p.m., "Tom De Costere" <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> Thank you for your responses!
>>>>
>>>> Let me clarify the situation here on which training is performed, so 
>>>> you understand why we have 130+ tr files.
>>>>
>>>>
>>>> We have fill-in forms for our customers, which they have to hand over 
>>>> to our workers, in order to specify when and what our worker have 
>>>> performed 
>>>> at their house. On these forms there are fill-in boxes, like a date and 
>>>> name and work hours.
>>>>
>>>> Now the major time waste at our company is the manual parsing of the 
>>>> documents into our electronic bookkeeping application.
>>>> The current situation is: our workforce have to manually type over the 
>>>> filled in values from the papers into the application.
>>>> As you can guess, this is a very long and time consuming task, which 
>>>> nobody loves to do every day.
>>>>
>>>> Since there are, at the moment, almost no other OCR technologies which 
>>>> give a good recognition rate for handwriting, we are trying Tesseract to 
>>>> improve this job.
>>>>
>>>>
>>>> Our currently automated training algorithm uses these fill-in forms as 
>>>> basis for the learning of Tesseract.
>>>> We created a .NET program for generating the box files and correcting 
>>>> the OCR values, which some of our workers use at the moment.
>>>> The corrected box files are then sent to our OCR server (running 
>>>> Tesseract), which trains the language file with the new inputs.
>>>>
>>>> So in order to improve the detection percentage, we are creating one 
>>>> big language file for our entire customerbase, with unique fonts for each 
>>>> customer.
>>>> Since every customers has his/her unique handwriting.
>>>>
>>>> At the moment we have generated over 1000 box files for around 130 
>>>> customers (130 from 3000+ customers).
>>>>
>>>>
>>>> So to give an example:
>>>>
>>>> ncorp.traineddate consists of fonts:
>>>> - ocrB (standard printer font)
>>>> - customerA (handwriting for customer A)
>>>> - customerB (handwriting for customer B)
>>>> - customerC (handwriting for customer C)
>>>> - ...
>>>>
>>>>
>>>> This is why we have over 130 TR files at the moment, and the number is 
>>>> steadily rising every hour.
>>>>
>>>>
>>>> Now it would be ideal if Tesseract had a re-train function, instead of 
>>>> training the whole file again and again.
>>>> So that we would simply inject a new font for a new customer when it's 
>>>> needed.
>>>>
>>>> Correct me if I'm wrong, but as far as I know and as far as I have 
>>>> found on the internet, Tesseract doesn't have a re-train function which 
>>>> uses an existing traineddata file as input. And then outputs an improved 
>>>> version of this traineddata file.
>>>>
>>>>
>>>> *@Shree*
>>>> @Rkvsraman
>>>>
>>>> If there is a limit for Tesseract training, why are they supplying a 
>>>> font_properties file with around 4000 fonts then?
>>>> Or is this purely to be able to train using these fonts?
>>>>
>>>> Might there be another way to use the training for such a large amount 
>>>> of fonts?
>>>> Can training the fonts into multiple language files then be the 
>>>> solution?
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Tom
>>>>
>>>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>>>>>
>>>>> But why would you need 130 tr files? 
>>>>>
>>>>> Are you using 130 fonts?
>>>>>
>>>>> There is a limit of 64 fonts i guess in tesseract. 
>>>>>
>>>>> If it is just 1 font (or 1 kind of handwriting in ur case)  then you 
>>>>> can put it in 1 multi page tiff file which does not exceed 120 pages. 
>>>>>
>>>>>
>>>>>
>>>>> Best Regards
>>>>> -Raman
>>>>>
>>>>> -----------------------------------------------
>>>>> RKVS Raman
>>>>> http://sites.google.com/site/rkvsraman
>>>>> ------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Please see 
>>>>>> https://groups.google.com/forum/#!msg/tesseract-dev/u5CSn3B3mYc/U39zS6MeCQAJ
>>>>>>
>>>>>> There seems to be a limit ---
>>>>>>
>>>>>> ShreeDevi
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We are trying to train tesseract with a new font consisting of 
>>>>>>> multiple handwritings from our customers.
>>>>>>>
>>>>>>> The training itself works nicely and the OCR results are very good 
>>>>>>> (85-90% correct detection).
>>>>>>>
>>>>>>>
>>>>>>> However today something strange started to happen during the 
>>>>>>> training process (which we have automated using Python on Ubuntu 10.04).
>>>>>>>
>>>>>>> During the training with MFTraining we encountered the error "*Ouch! 
>>>>>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*
>>>>>>>
>>>>>>> Which results in the non-creation of the pffmtable file, which is 
>>>>>>> required in the next step.
>>>>>>>
>>>>>>> This started to happen after we reached a certain number of font 
>>>>>>> files (130 concatenated TR files) on which the training has to happen.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Can anybody help us with this problem?
>>>>>>>
>>>>>>>
>>>>>>> *Software details:*
>>>>>>> OS:                  Ubuntu 16.04.1 LTS
>>>>>>> Codename:       xenial
>>>>>>>
>>>>>>> Tesseract:        3.0.4  installed through APT-GET
>>>>>>>
>>>>>>> tesseract-ocr/xenial,now                 3.04.01-4 amd64 [installed]
>>>>>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all 
>>>>>>> [installed,automatic]
>>>>>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all 
>>>>>>> [installed,automatic]
>>>>>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all 
>>>>>>> [installed,automatic]
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/31ee927f-e673-4cc8-9455-ebb4ef228a55%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] mftraining Segmentation fault error

Reply via email to