I did not post the command that I used, it was probably with default psm and code as of April 2017. If you really want to investigate, use the commit from master branch as of that time and test.
In theory tesseract 4 should recognize two columns with the default psm. But there seem to be some issues with layout analysis. You could try other means of selecting text regions and using tesseract on those. On Sat, 27 Apr 2019, 02:57 Giriraj Bhojak, <[email protected]> wrote: > Hi Shree, > > I just tried the v3.05.02 as well for different modes and I still couldn't > produce the output as you posted with the image file. > I am wondering if I am doing anything wrong. > Here is the command I have run for the v3.05.02 tesseract and changed psm > mode from 1 to 13: > > > */usr/local/Cellar/tesseract/3.05.02/bin/tesseract --tessdata-dir > /usr/local/Cellar/tesseract/3.05.02/share/ "sample.tif" test --psm 3* > > It still produced the same output as earlier. > Please let me know what I might be doing incorrectly here. > Once again, thank you for your prompt responses. > > > Regards, > Giriraj. > > > On Friday, April 26, 2019 at 1:42:17 PM UTC-4, shree wrote: >> >> @zdenko Please check this image (from the first post) with 3.0x and >> current 4.0x code to see if there is a regression in terms of recognition >> of 2 columns. >> >> On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak <[email protected]> >> wrote: >> >>> Thank you, I will try it out next. >>> I wanted to use version 4 of tesseract since it uses LSTM based OCR >>> engine. Higher accuracy is one of the essential requirements for my usecase. >>> Would you know if v4 supports extracting text from a two column text >>> structure image file at all? >>> Thank you for your quick response Shree! >>> >>> Regards, >>> Giriraj. >>> >>> On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote: >>>> >>>> April 2017 - It is probably the 3.0x version. Try the 3.05 branch. >>>> >>>> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 >>>> 3.05.01 Release >>>> <https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01> >>>> [image: @zdenop] <https://github.com/zdenop> zdenop >>>> <https://github.com/zdenop> released this on Jun 1, 2017 · 26 commits >>>> <https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to >>>> 3.05 since this release >>>> >>>> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak <[email protected]> >>>> wrote: >>>> >>>>> Hi Shree, >>>>> >>>>> Thank you for quick response. >>>>> I used the trained data by downloading the datasets at >>>>> https://github.com/tesseract-ocr/tessdata, >>>>> https://github.com/tesseract-ocr/tessdata_best and >>>>> https://github.com/tesseract-ocr/tessdata_fast. >>>>> >>>>> I ran following commands for each of these datasets and changed psm >>>>> from 1 to 13 , but more or less the output is like the one I posted. >>>>> Couldn't get the output as you have posted that has data in the right >>>>> order >>>>> of the context. >>>>> >>>>> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample >>>>> --psm 1 >>>>> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample >>>>> --psm 1 >>>>> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1 >>>>> >>>>> Not sure what I am doing wrong here, appreciate your help with this. >>>>> >>>>> Regards, >>>>> Giriraj >>>>> >>>>> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote: >>>>>> >>>>>> Which eng.traineddata did you use? >>>>>> >>>>>> There are three options >>>>>> From tessdata, tessdata_best and tessdata_fast. >>>>>> >>>>>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <[email protected]> wrote: >>>>>> >>>>>>> Hello Shree, >>>>>>> >>>>>>> I realize this post is more than two years old now, but would >>>>>>> appreciate any help. >>>>>>> I tried your suggestion on the same attached sample using tesseract >>>>>>> v4 and I am unable to get the result as you have posted. >>>>>>> I have tried all page segmentation modes, but none of them produced >>>>>>> the result you have posted. >>>>>>> Could you please let me know what I might be doing wrong? >>>>>>> >>>>>>> Here is the version detail for the tessreact on my machine: >>>>>>> >>>>>>> tesseract 4.0.0 >>>>>>> leptonica-1.77.0 >>>>>>> libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib >>>>>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0 >>>>>>> Found AVX2 >>>>>>> Found AVX >>>>>>> Found SSE >>>>>>> >>>>>>> Here is the output I get for most of the psm modes: >>>>>>> >>>>>>> >>>>>>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3 >>>>>>> >>>>>>> Did you know? Did you know? >>>>>>> >>>>>>> Your Comcast Business Internet Never miss a payment with text alerts. >>>>>>> service gives you access to millions Receive text message reminders >>>>>>> when your >>>>>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past >>>>>>> due. Sign up at >>>>>>> and even more coverage. Find out business.comcast.com/myaccount. >>>>>>> >>>>>>> more at business.comcast.conm/wifi. >>>>>>> >>>>>>> Your bill is ready >>>>>>> >>>>>>> >>>>>>> >>>>>>> Need help? We’re here for you. >>>>>>> >>>>>>> >>>>>>> >>>>>>> > Visit business.comcast.com/help Please notify us immediately with >>>>>>> any >>>>>>> Call 1-800-391-3000 questions regarding charges billed to your >>>>>>> aa account. Comcast will issue a credit or >>>>>>> Billing support refund for any verified billing error which is >>>>>>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within >>>>>>> sixty (60) days >>>>>>> and 7 am-8 pm Sat of the bill. >>>>>>> >>>>>>> Technical support >>>>>>> Open 24 hours, 7 days a week >>>>>>> >>>>>>> TT >>>>>>> >>>>>>> Automatic payment If you’re moving, give us as much >>>>>>> Sign up at business.comcast.com/myaccount advanced notice as >>>>>>> possible so we >>>>>>> >>>>>>> Se Online can help make a smooth transition. >>>>>>> Visit business.comcast.com/myaccount >>>>>>> >>>>>>> a By phone >>>>>>> Call 1-800-391-3000 >>>>>>> >>>>>>> Call 1-800-391-3000 >>>>>>> >>>>>>> IME >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Giriraj. >>>>>>> >>>>>>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote: >>>>>>>> >>>>>>>> If you want to OCR an invoice like the sample you posted, just use >>>>>>>> the eng.traineddata and OCR the page. You do not need to do any >>>>>>>> training. >>>>>>>> >>>>>>>> Here is the output I get >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3 >>>>>>>> >>>>>>>> >>>>>>>> Did you know? >>>>>>>> >>>>>>>> >>>>>>>> Your Comcast Business Internet >>>>>>>> >>>>>>>> service gives you access to millions >>>>>>>> >>>>>>>> of WiFi hotspots with the fastest WiFi >>>>>>>> >>>>>>>> and even more coverage. Find out >>>>>>>> >>>>>>>> more at businesscomcast.com/wifi. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Need help? We’re here for you. >>>>>>>> >>>>>>>> >>>>>>>> 9 Visit business.comcast.com/help >>>>>>>> >>>>>>>> Call 1-800—391 -3000 >>>>>>>> >>>>>>>> A >>>>>>>> >>>>>>>> >>>>>>>> Billing support >>>>>>>> >>>>>>>> Open 6 am-9 pm MTN, Mon through Fri >>>>>>>> >>>>>>>> and 7 am—8 pm Sat >>>>>>>> >>>>>>>> >>>>>>>> Technical support >>>>>>>> >>>>>>>> Open 24 hours, 7 days a week >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Did you know? >>>>>>>> >>>>>>>> >>>>>>>> Never miss a payment with text alerts. >>>>>>>> >>>>>>>> Receive text message reminders when your >>>>>>>> >>>>>>>> bill is ready to pay or past due. Sign up at >>>>>>>> >>>>>>>> business.comcast.com/myaccount. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Your bill is ready >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Please notify us immediately with any >>>>>>>> >>>>>>>> questions regarding charges billed to your >>>>>>>> >>>>>>>> account. Comcast will issue a credit or >>>>>>>> >>>>>>>> refund for any verified billing error which is >>>>>>>> >>>>>>>> brought to our attention within sixty (60) days >>>>>>>> >>>>>>>> of the bill. >>>>>>>> >>>>>>>> >>>>>>>> llllllllllllllllllllllllllllllllll >>>>>>>> >>>>>>>> >>>>>>>> Additional payment options Moving? Let us help. >>>>>>>> >>>>>>>> >>>>>>>> Automatic payment >>>>>>>> >>>>>>>> Sign up at business.comcast.com/myaccount >>>>>>>> >>>>>>>> >>>>>>>> a Oniine >>>>>>>> >>>>>>>> >>>>>>>> Visit business.comcast.com/myaccount >>>>>>>> >>>>>>>> >>>>>>>> a By phone >>>>>>>> >>>>>>>> Call 1-800-391 -3000 >>>>>>>> >>>>>>>> >>>>>>>> if you're moving, give us as much >>>>>>>> >>>>>>>> advanced notice as possible so we >>>>>>>> >>>>>>>> can help make a smooth transition. >>>>>>>> >>>>>>>> >>>>>>>> Call 1 -800-391 -3000 >>>>>>>> >>>>>>>> >>>>>>>> |||||||llllllllllllllllllllllllll >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ShreeDevi >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello all, >>>>>>>>> >>>>>>>>> I am surprised by how many people tell me that tesseract is the >>>>>>>>> best open-source OCR tool but yet there is no video explaining >>>>>>>>> step-by-step >>>>>>>>> the problems that you can encounter, or a good explanation and >>>>>>>>> documentation for OCR. >>>>>>>>> >>>>>>>>> Well even though, everyone loves challenges! So here's the >>>>>>>>> challenge I faced. I brought many pdf files that are invoices and I >>>>>>>>> want to >>>>>>>>> train tesseract to be able to ocr them as scanned images. >>>>>>>>> So first of all, I transformed these pdf files into tif files >>>>>>>>> using: magick -density 300 -depth 4 2151.pdf -background white -fill >>>>>>>>> white -alpha Off 2151%d.tif >>>>>>>>> This is ImageMagick. Nothing important here other than we have a >>>>>>>>> 300 dpi image with an alpha channel off. >>>>>>>>> >>>>>>>>> You must rename them so : rename .tif files to: >>>>>>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my >>>>>>>>> example >>>>>>>>> >>>>>>>>> Great! After this step you must create your box file right? So I >>>>>>>>> simply called: >>>>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop >>>>>>>>> makebox >>>>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop >>>>>>>>> makebox >>>>>>>>> >>>>>>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the >>>>>>>>> famous jTessBoxEditor online (weird right?) which did the job. >>>>>>>>> >>>>>>>>> After that, I created my .tr files: >>>>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch >>>>>>>>> box.train >>>>>>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch >>>>>>>>> box.train >>>>>>>>> >>>>>>>>> And here comes the surprises!!! >>>>>>>>> After having your .tr files you call unicharset_extractor. >>>>>>>>> First question: Why the glyph metrics are all >>>>>>>>> 0,255,0,255,0,0,0,0,0,0? Which is wrong according to the >>>>>>>>> documentation: >>>>>>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc >>>>>>>>> Second question: Should I write a box file, then the other or >>>>>>>>> combine them? Option 1: unicharset_extractor com.test_font.exp0.box >>>>>>>>> or >>>>>>>>> Option 2: unicharset_extractor com.test_font.exp0.box >>>>>>>>> com.test_font.exp1.box >>>>>>>>> Third question: set_unicharset_extractor why should I use it? It >>>>>>>>> doesn't fix the metrics only specify if Latin or Common! Link: >>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/318 >>>>>>>>> >>>>>>>>> After all these unanswered questions, I used mftraining and >>>>>>>>> cntraining (no problems). Finally, I renamed my inttemp, normproto, >>>>>>>>> pffmtable, shapetable and I combined them using combine_tessdata com. >>>>>>>>> >>>>>>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work? >>>>>>>>> Same for shapetable, normproto, pffmtable >>>>>>>>> >>>>>>>>> I think these questions are asked more than once by all new users >>>>>>>>> to tesseract. Please if any expert in tesseract can answer these >>>>>>>>> questions >>>>>>>>> it will be a great help for all the community. >>>>>>>>> Kindly find the attached 2 tif files and the boxes generated. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/aa1d47ee-2736-478c-91e3-0b1a6e86a81f%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/aa1d47ee-2736-478c-91e3-0b1a6e86a81f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUgJNv%2BC4rCUpWeFgGiZG8%3DqgbKoJK8CqQcfhyYXRAbGQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

