It seems my topic is not suitable for the DEV forum. (topic creation refused)
I would appreciate it sinceraly if anyone can bring this topic to the attention of the devs. Thanks in advance! Tom Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree: > > Probably better to post on tesseract-dev, though there is no guarantee > that the developers will reply. > > On 4 Nov 2016 3:07 p.m., "Tom De Costere" <[email protected] > <javascript:>> wrote: > >> Just to be sure, are the developers watching this Google Group or should >> I make a topic under the "tesseract-dev" group? >> >> FYI: we've breached the 5k number of fonts this morning >> >> I'm thinking of notifying the users that they should only create box >> files for documents containing terrible handwriting. >> Since I'm seeing quite good detection rates on new documents, even when >> they are not trained yet. >> >> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree: >>> >>> Please see >>> https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh >>> >>> The max no of fonts for each language is not very large. >>> >>> I am not even sure whether increasing the number of fonts beyond a limit >>> will improve the recognition. >>> >>> I think it is unlikely that tesseract can handle thousands of box/tif >>> pairs that you are planning. >>> >>> I hope one of the developers will reply with a more definitive response. >>> >>> On 3 Nov 2016 2:21 p.m., "Tom De Costere" <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> Thank you for your responses! >>>> >>>> Let me clarify the situation here on which training is performed, so >>>> you understand why we have 130+ tr files. >>>> >>>> >>>> We have fill-in forms for our customers, which they have to hand over >>>> to our workers, in order to specify when and what our worker have >>>> performed >>>> at their house. On these forms there are fill-in boxes, like a date and >>>> name and work hours. >>>> >>>> Now the major time waste at our company is the manual parsing of the >>>> documents into our electronic bookkeeping application. >>>> The current situation is: our workforce have to manually type over the >>>> filled in values from the papers into the application. >>>> As you can guess, this is a very long and time consuming task, which >>>> nobody loves to do every day. >>>> >>>> Since there are, at the moment, almost no other OCR technologies which >>>> give a good recognition rate for handwriting, we are trying Tesseract to >>>> improve this job. >>>> >>>> >>>> Our currently automated training algorithm uses these fill-in forms as >>>> basis for the learning of Tesseract. >>>> We created a .NET program for generating the box files and correcting >>>> the OCR values, which some of our workers use at the moment. >>>> The corrected box files are then sent to our OCR server (running >>>> Tesseract), which trains the language file with the new inputs. >>>> >>>> So in order to improve the detection percentage, we are creating one >>>> big language file for our entire customerbase, with unique fonts for each >>>> customer. >>>> Since every customers has his/her unique handwriting. >>>> >>>> At the moment we have generated over 1000 box files for around 130 >>>> customers (130 from 3000+ customers). >>>> >>>> >>>> So to give an example: >>>> >>>> ncorp.traineddate consists of fonts: >>>> - ocrB (standard printer font) >>>> - customerA (handwriting for customer A) >>>> - customerB (handwriting for customer B) >>>> - customerC (handwriting for customer C) >>>> - ... >>>> >>>> >>>> This is why we have over 130 TR files at the moment, and the number is >>>> steadily rising every hour. >>>> >>>> >>>> Now it would be ideal if Tesseract had a re-train function, instead of >>>> training the whole file again and again. >>>> So that we would simply inject a new font for a new customer when it's >>>> needed. >>>> >>>> Correct me if I'm wrong, but as far as I know and as far as I have >>>> found on the internet, Tesseract doesn't have a re-train function which >>>> uses an existing traineddata file as input. And then outputs an improved >>>> version of this traineddata file. >>>> >>>> >>>> *@Shree* >>>> @Rkvsraman >>>> >>>> If there is a limit for Tesseract training, why are they supplying a >>>> font_properties file with around 4000 fonts then? >>>> Or is this purely to be able to train using these fonts? >>>> >>>> Might there be another way to use the training for such a large amount >>>> of fonts? >>>> Can training the fonts into multiple language files then be the >>>> solution? >>>> >>>> >>>> Kind regards, >>>> >>>> Tom >>>> >>>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman: >>>>> >>>>> But why would you need 130 tr files? >>>>> >>>>> Are you using 130 fonts? >>>>> >>>>> There is a limit of 64 fonts i guess in tesseract. >>>>> >>>>> If it is just 1 font (or 1 kind of handwriting in ur case) then you >>>>> can put it in 1 multi page tiff file which does not exceed 120 pages. >>>>> >>>>> >>>>> >>>>> Best Regards >>>>> -Raman >>>>> >>>>> ----------------------------------------------- >>>>> RKVS Raman >>>>> http://sites.google.com/site/rkvsraman >>>>> ------------------------------------------------ >>>>> >>>>> >>>>> >>>>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <[email protected]> >>>>> wrote: >>>>> >>>>>> Please see >>>>>> https://groups.google.com/forum/#!msg/tesseract-dev/u5CSn3B3mYc/U39zS6MeCQAJ >>>>>> >>>>>> There seems to be a limit --- >>>>>> >>>>>> ShreeDevi >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> We are trying to train tesseract with a new font consisting of >>>>>>> multiple handwritings from our customers. >>>>>>> >>>>>>> The training itself works nicely and the OCR results are very good >>>>>>> (85-90% correct detection). >>>>>>> >>>>>>> >>>>>>> However today something strange started to happen during the >>>>>>> training process (which we have automated using Python on Ubuntu 10.04). >>>>>>> >>>>>>> During the training with MFTraining we encountered the error "*Ouch! >>>>>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"* >>>>>>> >>>>>>> Which results in the non-creation of the pffmtable file, which is >>>>>>> required in the next step. >>>>>>> >>>>>>> This started to happen after we reached a certain number of font >>>>>>> files (130 concatenated TR files) on which the training has to happen. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Can anybody help us with this problem? >>>>>>> >>>>>>> >>>>>>> *Software details:* >>>>>>> OS: Ubuntu 16.04.1 LTS >>>>>>> Codename: xenial >>>>>>> >>>>>>> Tesseract: 3.0.4 installed through APT-GET >>>>>>> >>>>>>> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed] >>>>>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all >>>>>>> [installed,automatic] >>>>>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all >>>>>>> [installed,automatic] >>>>>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all >>>>>>> [installed,automatic] >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31ee927f-e673-4cc8-9455-ebb4ef228a55%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

