Thank you, I will try it out next. I wanted to use version 4 of tesseract since it uses LSTM based OCR engine. Higher accuracy is one of the essential requirements for my usecase. Would you know if v4 supports extracting text from a two column text structure image file at all? Thank you for your quick response Shree!
Regards, Giriraj. On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote: > > April 2017 - It is probably the 3.0x version. Try the 3.05 branch. > > https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 > 3.05.01 Release > <https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01> > [image: @zdenop] <https://github.com/zdenop> zdenop > <https://github.com/zdenop> released this on Jun 1, 2017 · 26 commits > <https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to > 3.05 since this release > > On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak <[email protected] > <javascript:>> wrote: > >> Hi Shree, >> >> Thank you for quick response. >> I used the trained data by downloading the datasets at >> https://github.com/tesseract-ocr/tessdata, >> https://github.com/tesseract-ocr/tessdata_best and >> https://github.com/tesseract-ocr/tessdata_fast. >> >> I ran following commands for each of these datasets and changed psm from >> 1 to 13 , but more or less the output is like the one I posted. Couldn't >> get the output as you have posted that has data in the right order of the >> context. >> >> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1 >> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1 >> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1 >> >> Not sure what I am doing wrong here, appreciate your help with this. >> >> Regards, >> Giriraj >> >> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote: >>> >>> Which eng.traineddata did you use? >>> >>> There are three options >>> From tessdata, tessdata_best and tessdata_fast. >>> >>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <[email protected]> wrote: >>> >>>> Hello Shree, >>>> >>>> I realize this post is more than two years old now, but would >>>> appreciate any help. >>>> I tried your suggestion on the same attached sample using tesseract v4 >>>> and I am unable to get the result as you have posted. >>>> I have tried all page segmentation modes, but none of them produced the >>>> result you have posted. >>>> Could you please let me know what I might be doing wrong? >>>> >>>> Here is the version detail for the tessreact on my machine: >>>> >>>> tesseract 4.0.0 >>>> leptonica-1.77.0 >>>> libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib >>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0 >>>> Found AVX2 >>>> Found AVX >>>> Found SSE >>>> >>>> Here is the output I get for most of the psm modes: >>>> >>>> >>>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3 >>>> >>>> Did you know? Did you know? >>>> >>>> Your Comcast Business Internet Never miss a payment with text alerts. >>>> service gives you access to millions Receive text message reminders >>>> when your >>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past >>>> due. Sign up at >>>> and even more coverage. Find out business.comcast.com/myaccount. >>>> >>>> more at business.comcast.conm/wifi. >>>> >>>> Your bill is ready >>>> >>>> >>>> >>>> Need help? We’re here for you. >>>> >>>> >>>> >>>> > Visit business.comcast.com/help Please notify us immediately with any >>>> Call 1-800-391-3000 questions regarding charges billed to your >>>> aa account. Comcast will issue a credit or >>>> Billing support refund for any verified billing error which is >>>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within >>>> sixty (60) days >>>> and 7 am-8 pm Sat of the bill. >>>> >>>> Technical support >>>> Open 24 hours, 7 days a week >>>> >>>> TT >>>> >>>> Automatic payment If you’re moving, give us as much >>>> Sign up at business.comcast.com/myaccount advanced notice as possible >>>> so we >>>> >>>> Se Online can help make a smooth transition. >>>> Visit business.comcast.com/myaccount >>>> >>>> a By phone >>>> Call 1-800-391-3000 >>>> >>>> Call 1-800-391-3000 >>>> >>>> IME >>>> >>>> >>>> >>>> >>>> >>>> Regards, >>>> Giriraj. >>>> >>>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote: >>>>> >>>>> If you want to OCR an invoice like the sample you posted, just use the >>>>> eng.traineddata and OCR the page. You do not need to do any training. >>>>> >>>>> Here is the output I get >>>>> >>>>> >>>>> >>>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3 >>>>> >>>>> >>>>> Did you know? >>>>> >>>>> >>>>> Your Comcast Business Internet >>>>> >>>>> service gives you access to millions >>>>> >>>>> of WiFi hotspots with the fastest WiFi >>>>> >>>>> and even more coverage. Find out >>>>> >>>>> more at businesscomcast.com/wifi. >>>>> >>>>> >>>>> >>>>> Need help? We’re here for you. >>>>> >>>>> >>>>> 9 Visit business.comcast.com/help >>>>> >>>>> Call 1-800—391 -3000 >>>>> >>>>> A >>>>> >>>>> >>>>> Billing support >>>>> >>>>> Open 6 am-9 pm MTN, Mon through Fri >>>>> >>>>> and 7 am—8 pm Sat >>>>> >>>>> >>>>> Technical support >>>>> >>>>> Open 24 hours, 7 days a week >>>>> >>>>> >>>>> >>>>> Did you know? >>>>> >>>>> >>>>> Never miss a payment with text alerts. >>>>> >>>>> Receive text message reminders when your >>>>> >>>>> bill is ready to pay or past due. Sign up at >>>>> >>>>> business.comcast.com/myaccount. >>>>> >>>>> >>>>> >>>>> Your bill is ready >>>>> >>>>> >>>>> >>>>> >>>>> Please notify us immediately with any >>>>> >>>>> questions regarding charges billed to your >>>>> >>>>> account. Comcast will issue a credit or >>>>> >>>>> refund for any verified billing error which is >>>>> >>>>> brought to our attention within sixty (60) days >>>>> >>>>> of the bill. >>>>> >>>>> >>>>> llllllllllllllllllllllllllllllllll >>>>> >>>>> >>>>> Additional payment options Moving? Let us help. >>>>> >>>>> >>>>> Automatic payment >>>>> >>>>> Sign up at business.comcast.com/myaccount >>>>> >>>>> >>>>> a Oniine >>>>> >>>>> >>>>> Visit business.comcast.com/myaccount >>>>> >>>>> >>>>> a By phone >>>>> >>>>> Call 1-800-391 -3000 >>>>> >>>>> >>>>> if you're moving, give us as much >>>>> >>>>> advanced notice as possible so we >>>>> >>>>> can help make a smooth transition. >>>>> >>>>> >>>>> Call 1 -800-391 -3000 >>>>> >>>>> >>>>> |||||||llllllllllllllllllllllllll >>>>> >>>>> >>>>> >>>>> >>>>> ShreeDevi >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello all, >>>>>> >>>>>> I am surprised by how many people tell me that tesseract is the best >>>>>> open-source OCR tool but yet there is no video explaining step-by-step >>>>>> the >>>>>> problems that you can encounter, or a good explanation and documentation >>>>>> for OCR. >>>>>> >>>>>> Well even though, everyone loves challenges! So here's the challenge >>>>>> I faced. I brought many pdf files that are invoices and I want to train >>>>>> tesseract to be able to ocr them as scanned images. >>>>>> So first of all, I transformed these pdf files into tif files >>>>>> using: magick -density 300 -depth 4 2151.pdf -background white -fill >>>>>> white -alpha Off 2151%d.tif >>>>>> This is ImageMagick. Nothing important here other than we have a 300 >>>>>> dpi image with an alpha channel off. >>>>>> >>>>>> You must rename them so : rename .tif files to: >>>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my >>>>>> example >>>>>> >>>>>> Great! After this step you must create your box file right? So I >>>>>> simply called: >>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop >>>>>> makebox >>>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop >>>>>> makebox >>>>>> >>>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the >>>>>> famous jTessBoxEditor online (weird right?) which did the job. >>>>>> >>>>>> After that, I created my .tr files: >>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train >>>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train >>>>>> >>>>>> And here comes the surprises!!! >>>>>> After having your .tr files you call unicharset_extractor. >>>>>> First question: Why the glyph metrics are all >>>>>> 0,255,0,255,0,0,0,0,0,0? Which is wrong according to the documentation: >>>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc >>>>>> Second question: Should I write a box file, then the other or combine >>>>>> them? Option 1: unicharset_extractor com.test_font.exp0.box or Option >>>>>> 2: >>>>>> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box >>>>>> Third question: set_unicharset_extractor why should I use it? It >>>>>> doesn't fix the metrics only specify if Latin or Common! Link: >>>>>> https://github.com/tesseract-ocr/tesseract/issues/318 >>>>>> >>>>>> After all these unanswered questions, I used mftraining and >>>>>> cntraining (no problems). Finally, I renamed my inttemp, normproto, >>>>>> pffmtable, shapetable and I combined them using combine_tessdata com. >>>>>> >>>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work? >>>>>> Same for shapetable, normproto, pffmtable >>>>>> >>>>>> I think these questions are asked more than once by all new users to >>>>>> tesseract. Please if any expert in tesseract can answer these questions >>>>>> it >>>>>> will be a great help for all the community. >>>>>> Kindly find the attached 2 tif files and the boxes generated. >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

