Tom, Please see https://github.com/tesseract-ocr/tesseract/pull/466
I think the developers may want to focus on the merge of Google's private new LSTM codebase with the public github repo. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Nov 8, 2016 at 7:02 PM, Tom De Costere <[email protected]> wrote: > It seems my topic is not suitable for the DEV forum. (topic creation > refused) > > I would appreciate it sinceraly if anyone can bring this topic to the > attention of the devs. > > Thanks in advance! > > Tom > > Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree: >> >> Probably better to post on tesseract-dev, though there is no guarantee >> that the developers will reply. >> >> On 4 Nov 2016 3:07 p.m., "Tom De Costere" <[email protected]> wrote: >> >>> Just to be sure, are the developers watching this Google Group or should >>> I make a topic under the "tesseract-dev" group? >>> >>> FYI: we've breached the 5k number of fonts this morning >>> >>> I'm thinking of notifying the users that they should only create box >>> files for documents containing terrible handwriting. >>> Since I'm seeing quite good detection rates on new documents, even when >>> they are not trained yet. >>> >>> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree: >>>> >>>> Please see https://github.com/tesseract-ocr/tesseract/blob/master/train >>>> ing/language-specific.sh >>>> >>>> The max no of fonts for each language is not very large. >>>> >>>> I am not even sure whether increasing the number of fonts beyond a >>>> limit will improve the recognition. >>>> >>>> I think it is unlikely that tesseract can handle thousands of box/tif >>>> pairs that you are planning. >>>> >>>> I hope one of the developers will reply with a more definitive >>>> response. >>>> >>>> On 3 Nov 2016 2:21 p.m., "Tom De Costere" <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> Thank you for your responses! >>>>> >>>>> Let me clarify the situation here on which training is performed, so >>>>> you understand why we have 130+ tr files. >>>>> >>>>> >>>>> We have fill-in forms for our customers, which they have to hand over >>>>> to our workers, in order to specify when and what our worker have >>>>> performed >>>>> at their house. On these forms there are fill-in boxes, like a date and >>>>> name and work hours. >>>>> >>>>> Now the major time waste at our company is the manual parsing of the >>>>> documents into our electronic bookkeeping application. >>>>> The current situation is: our workforce have to manually type over the >>>>> filled in values from the papers into the application. >>>>> As you can guess, this is a very long and time consuming task, which >>>>> nobody loves to do every day. >>>>> >>>>> Since there are, at the moment, almost no other OCR technologies which >>>>> give a good recognition rate for handwriting, we are trying Tesseract to >>>>> improve this job. >>>>> >>>>> >>>>> Our currently automated training algorithm uses these fill-in forms as >>>>> basis for the learning of Tesseract. >>>>> We created a .NET program for generating the box files and correcting >>>>> the OCR values, which some of our workers use at the moment. >>>>> The corrected box files are then sent to our OCR server (running >>>>> Tesseract), which trains the language file with the new inputs. >>>>> >>>>> So in order to improve the detection percentage, we are creating one >>>>> big language file for our entire customerbase, with unique fonts for each >>>>> customer. >>>>> Since every customers has his/her unique handwriting. >>>>> >>>>> At the moment we have generated over 1000 box files for around 130 >>>>> customers (130 from 3000+ customers). >>>>> >>>>> >>>>> So to give an example: >>>>> >>>>> ncorp.traineddate consists of fonts: >>>>> - ocrB (standard printer font) >>>>> - customerA (handwriting for customer A) >>>>> - customerB (handwriting for customer B) >>>>> - customerC (handwriting for customer C) >>>>> - ... >>>>> >>>>> >>>>> This is why we have over 130 TR files at the moment, and the number is >>>>> steadily rising every hour. >>>>> >>>>> >>>>> Now it would be ideal if Tesseract had a re-train function, instead of >>>>> training the whole file again and again. >>>>> So that we would simply inject a new font for a new customer when it's >>>>> needed. >>>>> >>>>> Correct me if I'm wrong, but as far as I know and as far as I have >>>>> found on the internet, Tesseract doesn't have a re-train function which >>>>> uses an existing traineddata file as input. And then outputs an improved >>>>> version of this traineddata file. >>>>> >>>>> >>>>> *@Shree* >>>>> @Rkvsraman >>>>> >>>>> If there is a limit for Tesseract training, why are they supplying a >>>>> font_properties file with around 4000 fonts then? >>>>> Or is this purely to be able to train using these fonts? >>>>> >>>>> Might there be another way to use the training for such a large amount >>>>> of fonts? >>>>> Can training the fonts into multiple language files then be the >>>>> solution? >>>>> >>>>> >>>>> Kind regards, >>>>> >>>>> Tom >>>>> >>>>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman: >>>>>> >>>>>> But why would you need 130 tr files? >>>>>> >>>>>> Are you using 130 fonts? >>>>>> >>>>>> There is a limit of 64 fonts i guess in tesseract. >>>>>> >>>>>> If it is just 1 font (or 1 kind of handwriting in ur case) then you >>>>>> can put it in 1 multi page tiff file which does not exceed 120 pages. >>>>>> >>>>>> >>>>>> >>>>>> Best Regards >>>>>> -Raman >>>>>> >>>>>> ----------------------------------------------- >>>>>> RKVS Raman >>>>>> http://sites.google.com/site/rkvsraman >>>>>> ------------------------------------------------ >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS >>>>>>> n3B3mYc/U39zS6MeCQAJ >>>>>>> >>>>>>> There seems to be a limit --- >>>>>>> >>>>>>> ShreeDevi >>>>>>> ____________________________________________________________ >>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>> >>>>>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> We are trying to train tesseract with a new font consisting of >>>>>>>> multiple handwritings from our customers. >>>>>>>> >>>>>>>> The training itself works nicely and the OCR results are very good >>>>>>>> (85-90% correct detection). >>>>>>>> >>>>>>>> >>>>>>>> However today something strange started to happen during the >>>>>>>> training process (which we have automated using Python on Ubuntu >>>>>>>> 10.04). >>>>>>>> >>>>>>>> During the training with MFTraining we encountered the error "*Ouch! >>>>>>>> number of protos = 513, vs max of 512!Segmentation fault (core >>>>>>>> dumped)"* >>>>>>>> >>>>>>>> Which results in the non-creation of the pffmtable file, which is >>>>>>>> required in the next step. >>>>>>>> >>>>>>>> This started to happen after we reached a certain number of font >>>>>>>> files (130 concatenated TR files) on which the training has to happen. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Can anybody help us with this problem? >>>>>>>> >>>>>>>> >>>>>>>> *Software details:* >>>>>>>> OS: Ubuntu 16.04.1 LTS >>>>>>>> Codename: xenial >>>>>>>> >>>>>>>> Tesseract: 3.0.4 installed through APT-GET >>>>>>>> >>>>>>>> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed] >>>>>>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all >>>>>>>> [installed,automatic] >>>>>>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all >>>>>>>> [installed,automatic] >>>>>>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all >>>>>>>> [installed,automatic] >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e >>>>>>>> 0-497e-806f-4de580b07a80%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZ >>>>>>> EJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b >>>>> 7-4c44-ba99-3a9b36eb146e%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/31ee927f-e673-4cc8-9455-ebb4ef228a55% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/31ee927f-e673-4cc8-9455-ebb4ef228a55%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW1p97LL-u%2B2iO8mgwbJ63HkH1aBAX8HcPiJgoZX1gHqA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

