Thank you, I will try it out next.
I wanted to use version 4 of tesseract since it uses LSTM based OCR engine. 
Higher accuracy is one of the essential requirements for my usecase.
Would you know if v4 supports extracting text from a  two column text 
structure image file at all?
Thank you for your quick response Shree!

Regards,
Giriraj.

On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote:
>
> April 2017 - It is probably the 3.0x version. Try the 3.05 branch.
>
> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 
> 3.05.01 Release 
> <https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01>
> [image: @zdenop] <https://github.com/zdenop> zdenop 
> <https://github.com/zdenop> released this on Jun 1, 2017 · 26 commits 
> <https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to 
> 3.05 since this release 
>
> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak <[email protected] 
> <javascript:>> wrote:
>
>> Hi Shree,
>>
>> Thank you for quick response.
>> I used the trained data by downloading the datasets at 
>> https://github.com/tesseract-ocr/tessdata, 
>> https://github.com/tesseract-ocr/tessdata_best and 
>> https://github.com/tesseract-ocr/tessdata_fast.
>>
>> I ran following commands for each of these datasets and changed psm from 
>> 1 to 13 , but more or less the output is like the one I posted. Couldn't 
>> get the output as you have posted that has data in the right order of the 
>> context.
>>
>> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
>> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
>> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>>
>> Not sure what I am doing wrong here, appreciate your help with this.
>>
>> Regards,
>> Giriraj
>>
>> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>>>
>>> Which eng.traineddata did you use?
>>>
>>> There are three options
>>> From tessdata, tessdata_best and tessdata_fast.
>>>
>>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <[email protected]> wrote:
>>>
>>>> Hello Shree,
>>>>
>>>> I realize this post is more than two years old now, but would 
>>>> appreciate any help.
>>>> I tried your suggestion on the same attached sample using tesseract v4 
>>>> and I am unable to get the result as you have posted.
>>>> I have tried all page segmentation modes, but none of them produced the 
>>>> result you have posted. 
>>>> Could you please let me know what I might be doing wrong?
>>>>
>>>> Here is the version detail for the tessreact on my machine:
>>>>
>>>> tesseract 4.0.0
>>>>  leptonica-1.77.0
>>>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 
>>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>>>  Found AVX2
>>>>  Found AVX
>>>>  Found SSE
>>>>
>>>> Here is the output I get for most of the psm modes:
>>>>
>>>>
>>>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3
>>>>
>>>> Did you know? Did you know?
>>>>
>>>> Your Comcast Business Internet Never miss a payment with text alerts.
>>>> service gives you access to millions Receive text message reminders 
>>>> when your
>>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past 
>>>> due. Sign up at
>>>> and even more coverage. Find out business.comcast.com/myaccount.
>>>>
>>>> more at business.comcast.conm/wifi.
>>>>
>>>> Your bill is ready
>>>>
>>>>    
>>>>
>>>> Need help? We’re here for you.
>>>>
>>>>  
>>>>
>>>> > Visit business.comcast.com/help Please notify us immediately with any
>>>> Call 1-800-391-3000 questions regarding charges billed to your
>>>> aa account. Comcast will issue a credit or
>>>> Billing support refund for any verified billing error which is
>>>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within 
>>>> sixty (60) days
>>>> and 7 am-8 pm Sat of the bill.
>>>>
>>>> Technical support
>>>> Open 24 hours, 7 days a week
>>>>
>>>> TT
>>>>
>>>> Automatic payment If you’re moving, give us as much
>>>> Sign up at business.comcast.com/myaccount advanced notice as possible 
>>>> so we
>>>>
>>>> Se Online can help make a smooth transition.
>>>> Visit business.comcast.com/myaccount
>>>>
>>>> a By phone
>>>> Call 1-800-391-3000
>>>>
>>>> Call 1-800-391-3000
>>>>
>>>> IME
>>>>
>>>>  
>>>>
>>>>  
>>>>
>>>> Regards,
>>>> Giriraj.
>>>>
>>>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>>>>>
>>>>> If you want to OCR an invoice like the sample you posted, just use the 
>>>>> eng.traineddata and OCR the page. You do not need to do any training.
>>>>>
>>>>> Here is the output I get 
>>>>>
>>>>>
>>>>>
>>>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3
>>>>>
>>>>>
>>>>> Did you know?
>>>>>
>>>>>
>>>>> Your Comcast Business Internet
>>>>>
>>>>> service gives you access to millions
>>>>>
>>>>> of WiFi hotspots with the fastest WiFi
>>>>>
>>>>> and even more coverage. Find out
>>>>>
>>>>> more at businesscomcast.com/wifi.
>>>>>
>>>>>
>>>>>
>>>>> Need help? We’re here for you.
>>>>>
>>>>>
>>>>> 9 Visit business.comcast.com/help
>>>>>
>>>>> Call 1-800—391 -3000
>>>>>
>>>>> A
>>>>>
>>>>>
>>>>> Billing support
>>>>>
>>>>> Open 6 am-9 pm MTN, Mon through Fri
>>>>>
>>>>> and 7 am—8 pm Sat
>>>>>
>>>>>
>>>>> Technical support
>>>>>
>>>>> Open 24 hours, 7 days a week
>>>>>
>>>>>
>>>>>
>>>>> Did you know?
>>>>>
>>>>>
>>>>> Never miss a payment with text alerts.
>>>>>
>>>>> Receive text message reminders when your
>>>>>
>>>>> bill is ready to pay or past due. Sign up at
>>>>>
>>>>> business.comcast.com/myaccount.
>>>>>
>>>>>
>>>>>
>>>>> Your bill is ready
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Please notify us immediately with any
>>>>>
>>>>> questions regarding charges billed to your
>>>>>
>>>>> account. Comcast will issue a credit or
>>>>>
>>>>> refund for any verified billing error which is
>>>>>
>>>>> brought to our attention within sixty (60) days
>>>>>
>>>>> of the bill.
>>>>>
>>>>>
>>>>> llllllllllllllllllllllllllllllllll
>>>>>
>>>>>
>>>>> Additional payment options Moving? Let us help.
>>>>>
>>>>>
>>>>> Automatic payment
>>>>>
>>>>> Sign up at business.comcast.com/myaccount
>>>>>
>>>>>
>>>>> a Oniine
>>>>>
>>>>>
>>>>> Visit business.comcast.com/myaccount
>>>>>
>>>>>
>>>>> a By phone
>>>>>
>>>>> Call 1-800-391 -3000
>>>>>
>>>>>
>>>>> if you're moving, give us as much
>>>>>
>>>>> advanced notice as possible so we
>>>>>
>>>>> can help make a smooth transition.
>>>>>
>>>>>
>>>>> Call 1 -800-391 -3000
>>>>>
>>>>>
>>>>> |||||||llllllllllllllllllllllllll
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I am surprised by how many people tell me that tesseract is the best 
>>>>>> open-source OCR tool but yet there is no video explaining step-by-step 
>>>>>> the 
>>>>>> problems that you can encounter, or a good explanation and documentation 
>>>>>> for OCR.
>>>>>>
>>>>>> Well even though, everyone loves challenges! So here's the challenge 
>>>>>> I faced. I brought many pdf files that are invoices and I want to train 
>>>>>> tesseract to be able to ocr them as scanned images. 
>>>>>> So first of all, I transformed these pdf files into tif files 
>>>>>> using: magick -density 300 -depth 4   2151.pdf -background white -fill 
>>>>>> white -alpha Off  2151%d.tif
>>>>>> This is ImageMagick. Nothing important here other than we have a 300 
>>>>>> dpi image with an alpha channel off.
>>>>>>
>>>>>> You must rename them so : rename .tif files to: 
>>>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my 
>>>>>> example
>>>>>>
>>>>>> Great! After this step you must create your box file right? So I 
>>>>>> simply called: 
>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop 
>>>>>> makebox
>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop 
>>>>>> makebox
>>>>>>
>>>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the 
>>>>>> famous jTessBoxEditor online (weird right?) which did the job.
>>>>>>
>>>>>> After that, I created my .tr files:
>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
>>>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>>>>>>
>>>>>> And here comes the surprises!!!
>>>>>> After having your .tr files you call unicharset_extractor. 
>>>>>> First question: Why the glyph metrics are all 
>>>>>> 0,255,0,255,0,0,0,0,0,0? Which is wrong according to the documentation: 
>>>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc
>>>>>> Second question: Should I write a box file, then the other or combine 
>>>>>> them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 
>>>>>> 2: 
>>>>>> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box  
>>>>>> Third question: set_unicharset_extractor why should I use it? It 
>>>>>> doesn't fix the metrics only specify if Latin or Common! Link: 
>>>>>> https://github.com/tesseract-ocr/tesseract/issues/318
>>>>>>
>>>>>> After all these unanswered questions, I used mftraining and 
>>>>>> cntraining (no problems). Finally, I renamed my inttemp, normproto, 
>>>>>> pffmtable, shapetable  and I combined them using combine_tessdata com.
>>>>>>
>>>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work? 
>>>>>> Same for shapetable, normproto, pffmtable
>>>>>>
>>>>>> I think these questions are asked more than once by all new users to 
>>>>>> tesseract. Please if any expert in tesseract can answer these questions 
>>>>>> it 
>>>>>> will be a great help for all the community.
>>>>>> Kindly find the attached 2 tif files and the boxes generated. 
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to