Re: [tesseract-ocr] mftraining Segmentation fault error

ShreeDevi Kumar Tue, 08 Nov 2016 06:11:21 -0800

Tom,

Please see https://github.com/tesseract-ocr/tesseract/pull/466


I think the developers may want to focus on the merge of Google's private
new LSTM codebase with the public github repo.




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 8, 2016 at 7:02 PM, Tom De Costere <[email protected]>
wrote:

> It seems my topic is not suitable for the DEV forum. (topic creation
> refused)
>
> I would appreciate it sinceraly if anyone can bring this topic to the
> attention of the devs.
>
> Thanks in advance!
>
> Tom
>
> Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree:
>>
>> Probably better to post on tesseract-dev, though there is no guarantee
>> that the developers will reply.
>>
>> On 4 Nov 2016 3:07 p.m., "Tom De Costere" <[email protected]> wrote:
>>
>>> Just to be sure, are the developers watching this Google Group or should
>>> I make a topic under the "tesseract-dev" group?
>>>
>>> FYI: we've breached the 5k number of fonts this morning
>>>
>>> I'm thinking of notifying the users that they should only create box
>>> files for documents containing terrible handwriting.
>>> Since I'm seeing quite good detection rates on new documents, even when
>>> they are not trained yet.
>>>
>>> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:
>>>>
>>>> Please see https://github.com/tesseract-ocr/tesseract/blob/master/train
>>>> ing/language-specific.sh
>>>>
>>>> The max no of fonts for each language is not very large.
>>>>
>>>> I am not even sure whether increasing the number of fonts beyond a
>>>> limit will improve the recognition.
>>>>
>>>> I think it is unlikely that tesseract can handle thousands of box/tif
>>>> pairs that you are planning.
>>>>
>>>> I hope one of the developers will reply with a more definitive
>>>> response.
>>>>
>>>> On 3 Nov 2016 2:21 p.m., "Tom De Costere" <[email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Thank you for your responses!
>>>>>
>>>>> Let me clarify the situation here on which training is performed, so
>>>>> you understand why we have 130+ tr files.
>>>>>
>>>>>
>>>>> We have fill-in forms for our customers, which they have to hand over
>>>>> to our workers, in order to specify when and what our worker have 
>>>>> performed
>>>>> at their house. On these forms there are fill-in boxes, like a date and
>>>>> name and work hours.
>>>>>
>>>>> Now the major time waste at our company is the manual parsing of the
>>>>> documents into our electronic bookkeeping application.
>>>>> The current situation is: our workforce have to manually type over the
>>>>> filled in values from the papers into the application.
>>>>> As you can guess, this is a very long and time consuming task, which
>>>>> nobody loves to do every day.
>>>>>
>>>>> Since there are, at the moment, almost no other OCR technologies which
>>>>> give a good recognition rate for handwriting, we are trying Tesseract to
>>>>> improve this job.
>>>>>
>>>>>
>>>>> Our currently automated training algorithm uses these fill-in forms as
>>>>> basis for the learning of Tesseract.
>>>>> We created a .NET program for generating the box files and correcting
>>>>> the OCR values, which some of our workers use at the moment.
>>>>> The corrected box files are then sent to our OCR server (running
>>>>> Tesseract), which trains the language file with the new inputs.
>>>>>
>>>>> So in order to improve the detection percentage, we are creating one
>>>>> big language file for our entire customerbase, with unique fonts for each
>>>>> customer.
>>>>> Since every customers has his/her unique handwriting.
>>>>>
>>>>> At the moment we have generated over 1000 box files for around 130
>>>>> customers (130 from 3000+ customers).
>>>>>
>>>>>
>>>>> So to give an example:
>>>>>
>>>>> ncorp.traineddate consists of fonts:
>>>>> - ocrB (standard printer font)
>>>>> - customerA (handwriting for customer A)
>>>>> - customerB (handwriting for customer B)
>>>>> - customerC (handwriting for customer C)
>>>>> - ...
>>>>>
>>>>>
>>>>> This is why we have over 130 TR files at the moment, and the number is
>>>>> steadily rising every hour.
>>>>>
>>>>>
>>>>> Now it would be ideal if Tesseract had a re-train function, instead of
>>>>> training the whole file again and again.
>>>>> So that we would simply inject a new font for a new customer when it's
>>>>> needed.
>>>>>
>>>>> Correct me if I'm wrong, but as far as I know and as far as I have
>>>>> found on the internet, Tesseract doesn't have a re-train function which
>>>>> uses an existing traineddata file as input. And then outputs an improved
>>>>> version of this traineddata file.
>>>>>
>>>>>
>>>>> *@Shree*
>>>>> @Rkvsraman
>>>>>
>>>>> If there is a limit for Tesseract training, why are they supplying a
>>>>> font_properties file with around 4000 fonts then?
>>>>> Or is this purely to be able to train using these fonts?
>>>>>
>>>>> Might there be another way to use the training for such a large amount
>>>>> of fonts?
>>>>> Can training the fonts into multiple language files then be the
>>>>> solution?
>>>>>
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Tom
>>>>>
>>>>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>>>>>>
>>>>>> But why would you need 130 tr files?
>>>>>>
>>>>>> Are you using 130 fonts?
>>>>>>
>>>>>> There is a limit of 64 fonts i guess in tesseract.
>>>>>>
>>>>>> If it is just 1 font (or 1 kind of handwriting in ur case)  then you
>>>>>> can put it in 1 multi page tiff file which does not exceed 120 pages.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best Regards
>>>>>> -Raman
>>>>>>
>>>>>> -----------------------------------------------
>>>>>> RKVS Raman
>>>>>> http://sites.google.com/site/rkvsraman
>>>>>> ------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS
>>>>>>> n3B3mYc/U39zS6MeCQAJ
>>>>>>>
>>>>>>> There seems to be a limit ---
>>>>>>>
>>>>>>> ShreeDevi
>>>>>>> ____________________________________________________________
>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>
>>>>>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We are trying to train tesseract with a new font consisting of
>>>>>>>> multiple handwritings from our customers.
>>>>>>>>
>>>>>>>> The training itself works nicely and the OCR results are very good
>>>>>>>> (85-90% correct detection).
>>>>>>>>
>>>>>>>>
>>>>>>>> However today something strange started to happen during the
>>>>>>>> training process (which we have automated using Python on Ubuntu 
>>>>>>>> 10.04).
>>>>>>>>
>>>>>>>> During the training with MFTraining we encountered the error "*Ouch!
>>>>>>>> number of protos = 513, vs max of 512!Segmentation fault (core 
>>>>>>>> dumped)"*
>>>>>>>>
>>>>>>>> Which results in the non-creation of the pffmtable file, which is
>>>>>>>> required in the next step.
>>>>>>>>
>>>>>>>> This started to happen after we reached a certain number of font
>>>>>>>> files (130 concatenated TR files) on which the training has to happen.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Can anybody help us with this problem?
>>>>>>>>
>>>>>>>>
>>>>>>>> *Software details:*
>>>>>>>> OS:                  Ubuntu 16.04.1 LTS
>>>>>>>> Codename:       xenial
>>>>>>>>
>>>>>>>> Tesseract:        3.0.4  installed through APT-GET
>>>>>>>>
>>>>>>>> tesseract-ocr/xenial,now                 3.04.01-4 amd64 [installed]
>>>>>>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all
>>>>>>>> [installed,automatic]
>>>>>>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all
>>>>>>>> [installed,automatic]
>>>>>>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all
>>>>>>>> [installed,automatic]
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e
>>>>>>>> 0-497e-806f-4de580b07a80%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZ
>>>>>>> EJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b
>>>>> 7-4c44-ba99-3a9b36eb146e%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/31ee927f-e673-4cc8-9455-ebb4ef228a55%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/31ee927f-e673-4cc8-9455-ebb4ef228a55%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW1p97LL-u%2B2iO8mgwbJ63HkH1aBAX8HcPiJgoZX1gHqA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] mftraining Segmentation fault error

Reply via email to