April 2017 - It is probably the 3.0x version. Try the 3.05 branch.

https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01
3.05.01 Release
<https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01>
[image: @zdenop] <https://github.com/zdenop> zdenop
<https://github.com/zdenop> released this on Jun 1, 2017 · 26 commits
<https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to 3.05
since this release

On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak <[email protected]> wrote:

> Hi Shree,
>
> Thank you for quick response.
> I used the trained data by downloading the datasets at
> https://github.com/tesseract-ocr/tessdata,
> https://github.com/tesseract-ocr/tessdata_best and
> https://github.com/tesseract-ocr/tessdata_fast.
>
> I ran following commands for each of these datasets and changed psm from 1
> to 13 , but more or less the output is like the one I posted. Couldn't get
> the output as you have posted that has data in the right order of the
> context.
>
> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>
> Not sure what I am doing wrong here, appreciate your help with this.
>
> Regards,
> Giriraj
>
> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>>
>> Which eng.traineddata did you use?
>>
>> There are three options
>> From tessdata, tessdata_best and tessdata_fast.
>>
>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <[email protected]> wrote:
>>
>>> Hello Shree,
>>>
>>> I realize this post is more than two years old now, but would appreciate
>>> any help.
>>> I tried your suggestion on the same attached sample using tesseract v4
>>> and I am unable to get the result as you have posted.
>>> I have tried all page segmentation modes, but none of them produced the
>>> result you have posted.
>>> Could you please let me know what I might be doing wrong?
>>>
>>> Here is the version detail for the tessreact on my machine:
>>>
>>> tesseract 4.0.0
>>>  leptonica-1.77.0
>>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib
>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>>  Found AVX2
>>>  Found AVX
>>>  Found SSE
>>>
>>> Here is the output I get for most of the psm modes:
>>>
>>>
>>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3
>>>
>>> Did you know? Did you know?
>>>
>>> Your Comcast Business Internet Never miss a payment with text alerts.
>>> service gives you access to millions Receive text message reminders when
>>> your
>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past due.
>>> Sign up at
>>> and even more coverage. Find out business.comcast.com/myaccount.
>>>
>>> more at business.comcast.conm/wifi.
>>>
>>> Your bill is ready
>>>
>>>
>>>
>>> Need help? We’re here for you.
>>>
>>>
>>>
>>> > Visit business.comcast.com/help Please notify us immediately with any
>>> Call 1-800-391-3000 questions regarding charges billed to your
>>> aa account. Comcast will issue a credit or
>>> Billing support refund for any verified billing error which is
>>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within
>>> sixty (60) days
>>> and 7 am-8 pm Sat of the bill.
>>>
>>> Technical support
>>> Open 24 hours, 7 days a week
>>>
>>> TT
>>>
>>> Automatic payment If you’re moving, give us as much
>>> Sign up at business.comcast.com/myaccount advanced notice as possible
>>> so we
>>>
>>> Se Online can help make a smooth transition.
>>> Visit business.comcast.com/myaccount
>>>
>>> a By phone
>>> Call 1-800-391-3000
>>>
>>> Call 1-800-391-3000
>>>
>>> IME
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>> Giriraj.
>>>
>>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>>>>
>>>> If you want to OCR an invoice like the sample you posted, just use the
>>>> eng.traineddata and OCR the page. You do not need to do any training.
>>>>
>>>> Here is the output I get
>>>>
>>>>
>>>>
>>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3
>>>>
>>>>
>>>> Did you know?
>>>>
>>>>
>>>> Your Comcast Business Internet
>>>>
>>>> service gives you access to millions
>>>>
>>>> of WiFi hotspots with the fastest WiFi
>>>>
>>>> and even more coverage. Find out
>>>>
>>>> more at businesscomcast.com/wifi.
>>>>
>>>>
>>>>
>>>> Need help? We’re here for you.
>>>>
>>>>
>>>> 9 Visit business.comcast.com/help
>>>>
>>>> Call 1-800—391 -3000
>>>>
>>>> A
>>>>
>>>>
>>>> Billing support
>>>>
>>>> Open 6 am-9 pm MTN, Mon through Fri
>>>>
>>>> and 7 am—8 pm Sat
>>>>
>>>>
>>>> Technical support
>>>>
>>>> Open 24 hours, 7 days a week
>>>>
>>>>
>>>>
>>>> Did you know?
>>>>
>>>>
>>>> Never miss a payment with text alerts.
>>>>
>>>> Receive text message reminders when your
>>>>
>>>> bill is ready to pay or past due. Sign up at
>>>>
>>>> business.comcast.com/myaccount.
>>>>
>>>>
>>>>
>>>> Your bill is ready
>>>>
>>>>
>>>>
>>>>
>>>> Please notify us immediately with any
>>>>
>>>> questions regarding charges billed to your
>>>>
>>>> account. Comcast will issue a credit or
>>>>
>>>> refund for any verified billing error which is
>>>>
>>>> brought to our attention within sixty (60) days
>>>>
>>>> of the bill.
>>>>
>>>>
>>>> llllllllllllllllllllllllllllllllll
>>>>
>>>>
>>>> Additional payment options Moving? Let us help.
>>>>
>>>>
>>>> Automatic payment
>>>>
>>>> Sign up at business.comcast.com/myaccount
>>>>
>>>>
>>>> a Oniine
>>>>
>>>>
>>>> Visit business.comcast.com/myaccount
>>>>
>>>>
>>>> a By phone
>>>>
>>>> Call 1-800-391 -3000
>>>>
>>>>
>>>> if you're moving, give us as much
>>>>
>>>> advanced notice as possible so we
>>>>
>>>> can help make a smooth transition.
>>>>
>>>>
>>>> Call 1 -800-391 -3000
>>>>
>>>>
>>>> |||||||llllllllllllllllllllllllll
>>>>
>>>>
>>>>
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am surprised by how many people tell me that tesseract is the best
>>>>> open-source OCR tool but yet there is no video explaining step-by-step the
>>>>> problems that you can encounter, or a good explanation and documentation
>>>>> for OCR.
>>>>>
>>>>> Well even though, everyone loves challenges! So here's the challenge I
>>>>> faced. I brought many pdf files that are invoices and I want to train
>>>>> tesseract to be able to ocr them as scanned images.
>>>>> So first of all, I transformed these pdf files into tif files
>>>>> using: magick -density 300 -depth 4   2151.pdf -background white -fill
>>>>> white -alpha Off  2151%d.tif
>>>>> This is ImageMagick. Nothing important here other than we have a 300
>>>>> dpi image with an alpha channel off.
>>>>>
>>>>> You must rename them so : rename .tif files to:
>>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my 
>>>>> example
>>>>>
>>>>> Great! After this step you must create your box file right? So I
>>>>> simply called:
>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop
>>>>> makebox
>>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop
>>>>> makebox
>>>>>
>>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the famous
>>>>> jTessBoxEditor online (weird right?) which did the job.
>>>>>
>>>>> After that, I created my .tr files:
>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
>>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>>>>>
>>>>> And here comes the surprises!!!
>>>>> After having your .tr files you call unicharset_extractor.
>>>>> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0?
>>>>> Which is wrong according to the documentation:
>>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc
>>>>> Second question: Should I write a box file, then the other or combine
>>>>> them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 2:
>>>>> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box
>>>>> Third question: set_unicharset_extractor why should I use it? It
>>>>> doesn't fix the metrics only specify if Latin or Common! Link:
>>>>> https://github.com/tesseract-ocr/tesseract/issues/318
>>>>>
>>>>> After all these unanswered questions, I used mftraining and cntraining
>>>>> (no problems). Finally, I renamed my inttemp, normproto,
>>>>> pffmtable, shapetable  and I combined them using combine_tessdata com.
>>>>>
>>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work?
>>>>> Same for shapetable, normproto, pffmtable
>>>>>
>>>>> I think these questions are asked more than once by all new users to
>>>>> tesseract. Please if any expert in tesseract can answer these questions it
>>>>> will be a great help for all the community.
>>>>> Kindly find the attached 2 tif files and the boxes generated.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWvfZT%3DC%2BSnrPSMUzH%2B4Qy9dDpNgsW7QQet2Eb8bfuHsw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to