April 2017 - It is probably the 3.0x version. Try the 3.05 branch. https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 3.05.01 Release <https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01> [image: @zdenop] <https://github.com/zdenop> zdenop <https://github.com/zdenop> released this on Jun 1, 2017 · 26 commits <https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to 3.05 since this release
On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak <[email protected]> wrote: > Hi Shree, > > Thank you for quick response. > I used the trained data by downloading the datasets at > https://github.com/tesseract-ocr/tessdata, > https://github.com/tesseract-ocr/tessdata_best and > https://github.com/tesseract-ocr/tessdata_fast. > > I ran following commands for each of these datasets and changed psm from 1 > to 13 , but more or less the output is like the one I posted. Couldn't get > the output as you have posted that has data in the right order of the > context. > > tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1 > tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1 > tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1 > > Not sure what I am doing wrong here, appreciate your help with this. > > Regards, > Giriraj > > On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote: >> >> Which eng.traineddata did you use? >> >> There are three options >> From tessdata, tessdata_best and tessdata_fast. >> >> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <[email protected]> wrote: >> >>> Hello Shree, >>> >>> I realize this post is more than two years old now, but would appreciate >>> any help. >>> I tried your suggestion on the same attached sample using tesseract v4 >>> and I am unable to get the result as you have posted. >>> I have tried all page segmentation modes, but none of them produced the >>> result you have posted. >>> Could you please let me know what I might be doing wrong? >>> >>> Here is the version detail for the tessreact on my machine: >>> >>> tesseract 4.0.0 >>> leptonica-1.77.0 >>> libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib >>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0 >>> Found AVX2 >>> Found AVX >>> Found SSE >>> >>> Here is the output I get for most of the psm modes: >>> >>> >>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3 >>> >>> Did you know? Did you know? >>> >>> Your Comcast Business Internet Never miss a payment with text alerts. >>> service gives you access to millions Receive text message reminders when >>> your >>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past due. >>> Sign up at >>> and even more coverage. Find out business.comcast.com/myaccount. >>> >>> more at business.comcast.conm/wifi. >>> >>> Your bill is ready >>> >>> >>> >>> Need help? We’re here for you. >>> >>> >>> >>> > Visit business.comcast.com/help Please notify us immediately with any >>> Call 1-800-391-3000 questions regarding charges billed to your >>> aa account. Comcast will issue a credit or >>> Billing support refund for any verified billing error which is >>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within >>> sixty (60) days >>> and 7 am-8 pm Sat of the bill. >>> >>> Technical support >>> Open 24 hours, 7 days a week >>> >>> TT >>> >>> Automatic payment If you’re moving, give us as much >>> Sign up at business.comcast.com/myaccount advanced notice as possible >>> so we >>> >>> Se Online can help make a smooth transition. >>> Visit business.comcast.com/myaccount >>> >>> a By phone >>> Call 1-800-391-3000 >>> >>> Call 1-800-391-3000 >>> >>> IME >>> >>> >>> >>> >>> >>> Regards, >>> Giriraj. >>> >>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote: >>>> >>>> If you want to OCR an invoice like the sample you posted, just use the >>>> eng.traineddata and OCR the page. You do not need to do any training. >>>> >>>> Here is the output I get >>>> >>>> >>>> >>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3 >>>> >>>> >>>> Did you know? >>>> >>>> >>>> Your Comcast Business Internet >>>> >>>> service gives you access to millions >>>> >>>> of WiFi hotspots with the fastest WiFi >>>> >>>> and even more coverage. Find out >>>> >>>> more at businesscomcast.com/wifi. >>>> >>>> >>>> >>>> Need help? We’re here for you. >>>> >>>> >>>> 9 Visit business.comcast.com/help >>>> >>>> Call 1-800—391 -3000 >>>> >>>> A >>>> >>>> >>>> Billing support >>>> >>>> Open 6 am-9 pm MTN, Mon through Fri >>>> >>>> and 7 am—8 pm Sat >>>> >>>> >>>> Technical support >>>> >>>> Open 24 hours, 7 days a week >>>> >>>> >>>> >>>> Did you know? >>>> >>>> >>>> Never miss a payment with text alerts. >>>> >>>> Receive text message reminders when your >>>> >>>> bill is ready to pay or past due. Sign up at >>>> >>>> business.comcast.com/myaccount. >>>> >>>> >>>> >>>> Your bill is ready >>>> >>>> >>>> >>>> >>>> Please notify us immediately with any >>>> >>>> questions regarding charges billed to your >>>> >>>> account. Comcast will issue a credit or >>>> >>>> refund for any verified billing error which is >>>> >>>> brought to our attention within sixty (60) days >>>> >>>> of the bill. >>>> >>>> >>>> llllllllllllllllllllllllllllllllll >>>> >>>> >>>> Additional payment options Moving? Let us help. >>>> >>>> >>>> Automatic payment >>>> >>>> Sign up at business.comcast.com/myaccount >>>> >>>> >>>> a Oniine >>>> >>>> >>>> Visit business.comcast.com/myaccount >>>> >>>> >>>> a By phone >>>> >>>> Call 1-800-391 -3000 >>>> >>>> >>>> if you're moving, give us as much >>>> >>>> advanced notice as possible so we >>>> >>>> can help make a smooth transition. >>>> >>>> >>>> Call 1 -800-391 -3000 >>>> >>>> >>>> |||||||llllllllllllllllllllllllll >>>> >>>> >>>> >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <[email protected]> >>>> wrote: >>>> >>>>> Hello all, >>>>> >>>>> I am surprised by how many people tell me that tesseract is the best >>>>> open-source OCR tool but yet there is no video explaining step-by-step the >>>>> problems that you can encounter, or a good explanation and documentation >>>>> for OCR. >>>>> >>>>> Well even though, everyone loves challenges! So here's the challenge I >>>>> faced. I brought many pdf files that are invoices and I want to train >>>>> tesseract to be able to ocr them as scanned images. >>>>> So first of all, I transformed these pdf files into tif files >>>>> using: magick -density 300 -depth 4 2151.pdf -background white -fill >>>>> white -alpha Off 2151%d.tif >>>>> This is ImageMagick. Nothing important here other than we have a 300 >>>>> dpi image with an alpha channel off. >>>>> >>>>> You must rename them so : rename .tif files to: >>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my >>>>> example >>>>> >>>>> Great! After this step you must create your box file right? So I >>>>> simply called: >>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop >>>>> makebox >>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop >>>>> makebox >>>>> >>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the famous >>>>> jTessBoxEditor online (weird right?) which did the job. >>>>> >>>>> After that, I created my .tr files: >>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train >>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train >>>>> >>>>> And here comes the surprises!!! >>>>> After having your .tr files you call unicharset_extractor. >>>>> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0? >>>>> Which is wrong according to the documentation: >>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc >>>>> Second question: Should I write a box file, then the other or combine >>>>> them? Option 1: unicharset_extractor com.test_font.exp0.box or Option 2: >>>>> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box >>>>> Third question: set_unicharset_extractor why should I use it? It >>>>> doesn't fix the metrics only specify if Latin or Common! Link: >>>>> https://github.com/tesseract-ocr/tesseract/issues/318 >>>>> >>>>> After all these unanswered questions, I used mftraining and cntraining >>>>> (no problems). Finally, I renamed my inttemp, normproto, >>>>> pffmtable, shapetable and I combined them using combine_tessdata com. >>>>> >>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work? >>>>> Same for shapetable, normproto, pffmtable >>>>> >>>>> I think these questions are asked more than once by all new users to >>>>> tesseract. Please if any expert in tesseract can answer these questions it >>>>> will be a great help for all the community. >>>>> Kindly find the attached 2 tif files and the boxes generated. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWvfZT%3DC%2BSnrPSMUzH%2B4Qy9dDpNgsW7QQet2Eb8bfuHsw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

