Re: [tesseract-ocr] Re: I need help to develop image to text extraction

Santhiya C Mon, 12 Feb 2024 21:40:09 -0800

I had completed the training portion utilising the training tesseract OCR. 
After annotating the.box file, it did not change the misspelt character for 
my output extraction.


I was followed this article only  Training Tesseract-OCR with custom data. 
| by Sai Ashish | Medium 
<https://saiashish90.medium.com/training-tesseract-ocr-with-custom-data-d3f4881575c0>
 how do i resolve this issue 
On Thursday 8 February 2024 at 10:22:40 UTC+5:30 aromal...@gmail.com wrote:

> are you working on a word level text extraction or sentence level text 
> extraction?
>
> On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 santhi...@gmail.com wrote:
>
>> can you please tell me model and steps 
>>
>> On Monday 5 February 2024 at 17:22:10 UTC+5:30 aromal...@gmail.com wrote:
>>
>>> if you are getting started with OCR try some  other  engines  or just 
>>> start with some deep learning models 
>>> understand the basic working
>>> On Thursday 1 February 2024 at 11:17:14 UTC+5:30 santhi...@gmail.com 
>>> wrote:
>>>
>>>> Already i was used above mentioned  steps but i lost the datas 
>>>>
>>>> On Saturday 27 January 2024 at 06:52:54 UTC+5:30 g...@hobbelt.com 
>>>> wrote:
>>>>
>>>>> L.S.,
>>>>>
>>>>> *PDF. OCR. text extraction. best language models? not a lot of success 
>>>>> yet...*
>>>>>
>>>>> 🤔 
>>>>>
>>>>> Broad subject.  Learning curve ahead. 🚧 Workflow diagram included 
>>>>> today.
>>>>>
>>>>>
>>>>> *Tesseract does not live alone*
>>>>>
>>>>> Tesseract is an engine, which takes an image as input and produces 
>>>>> text output; several output formats are available. If you are unsure, 
>>>>> start 
>>>>> with HOCR output as that's close to modern HTML and carries almost all 
>>>>> info 
>>>>> tesseract produces during the OCR process.
>>>>> If it isn't an image you've got, you need a preprocess (and 
>>>>> consequently additional tools) to produce images you can feed tesseract. 
>>>>> tesseract is designed to process a SINGLE IMAGE. (Yes, that means you may 
>>>>> want to 'merge' its output: postprocessing)
>>>>>
>>>>> *     To complicate matters immediately, tesseract can deal with 
>>>>> "multipage TIFF" images and can accept multiple images to process via its 
>>>>> commandline. Keep thinking "one page image in, bunch of text out" and 
>>>>> you'll be okay until you discover the additional possibilities.*
>>>>>
>>>>> *Advice Number 1: *get a tesseract executable, invoke it using its 
>>>>> commandline interface. If you can't build tesseract yourself, Uni 
>>>>> Mannheim 
>>>>> may have binaries for you to download and install. Linuxes often have 
>>>>> tesseract binaries and mandatory language models available as packages, 
>>>>> BUT 
>>>>> many Linuxes are more or less far behind the curve: latest tesseract 
>>>>> release as of this writing is 5.3.4: 
>>>>> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your 
>>>>> rig has the latest tesseract installed. Older releases are older and 
>>>>> "previous" for a reason!
>>>>>
>>>>>
>>>>> *Preprocessing is the chorus of this song*
>>>>>
>>>>> As you say "PDF", you therefor need to convert that thing to *page 
>>>>> images*. My personal favorite is the Artifex mupdf toolkit, using 
>>>>> mutool or mudraw / etc. tools from that commandline toolkit to render 
>>>>> accurate, high-rez page images. Others will favor other means but it all 
>>>>> ends up doing the same thing: anything, PDFs et al, is to be converted to 
>>>>> one image per page and fed to tesseract that way. The rendered page 
>>>>> images 
>>>>> MAY require additional *image preprocessing*: 
>>>>>
>>>>>
>>>>> *This next bit cannot be stressed enough: *tesseract is designed and 
>>>>> engineered to work on plain printed book pages, i.e. BLACK TEXT on PLAIN 
>>>>> WHITE BACKGROUND. As I observe everyone and their granny dumping holiday 
>>>>> snapshots, favorite CD, LP and fancy colourful book covers straight into 
>>>>> tesseract and complaining "nothing sensible is coming out" that's because 
>>>>> you're feeding it a load of dung as far as the engine concerned: it 
>>>>> expects 
>>>>> BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull printed page in 
>>>>> a 
>>>>> BOOK, so anything with nature backgrounds, colourful architectural 
>>>>> backgrounds and such is begging for a disaster. And I only emphasize with 
>>>>> the grannies. <drama + rant mode off/>   This is why 
>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is 
>>>>> mentioned almost every week in this mailing list, for example. It's very 
>>>>> important, but you'll need more...
>>>>>
>>>>>
>>>>> The take-away? You'll need additional tools for image preprocessing 
>>>>> until you can produce greyscale or B&W images that look almost as if 
>>>>> these 
>>>>> were plain old boring book pages: no or very little fancy stuff, black 
>>>>> text 
>>>>> (anti-aliased or not), white background. 
>>>>> Bonus points for you when your preprocess removes non-text image 
>>>>> components, e.g. photographs, in the page image: it can only confuse the 
>>>>> OCR engine so when you strive for perfection, that's one more bit to deal 
>>>>> with BEFORE you feed it into tesseract and wait expectantly... (Besides, 
>>>>> tesseract will have less discovery to do so it'll be faster too. Of 
>>>>> little 
>>>>> importance, relatively speaking, but there you have it.)
>>>>> As also mentioned at 
>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools 
>>>>> of interest re image processing are leptonica (parts used by tesseract, 
>>>>> but 
>>>>> don't count on it doing your preprocessing for you as it's a highly 
>>>>> scenario/case-dependent activity and therefor not included in tesseract 
>>>>> itself) Also check out: OpenCV (a library, not a tool, so you'll need 
>>>>> scaffolding there before you can use it), ImageMagick, (Adobe Photoshop 
>>>>> or 
>>>>> open source: Krita: great for what-can-I-get experiments but not suitable 
>>>>> for bulk), etc.etc.
>>>>>
>>>>>
>>>>> *Tesseract bliss and the afterglow: postprocessing*
>>>>>
>>>>> Once you are producing page images like they were book pages, and 
>>>>> feeding them into tesseract, you get output, being it "plain text", HOCR 
>>>>> or 
>>>>> otherwise.
>>>>>
>>>>> Personally I favor HOCR but that's because it's closest to what *my 
>>>>> *workflow 
>>>>> needs. You must look into "postprocessing" anyway: being it additional 
>>>>> tooling to recombine the OCR-ed text into PDF "overlay", PDF/A 
>>>>> production, 
>>>>> or anything else; advanced usage may require additional postprocessing 
>>>>> steps, e.g. pulling the OCR-ed text through a spellchecker+corrector such 
>>>>> as hunspell, if that floats your boat. You'll also need to get and set up 
>>>>> and/or program postprocess tooling if you otherwise wish to merge 
>>>>> multiple 
>>>>> images' OCR results. You may want to search the internet for this; I 
>>>>> don't 
>>>>> have any toolkit's name present off the top off my head for that as I'm 
>>>>> using tesseract in a slightly different workflow, where it is part of a 
>>>>> custom, *augmented *mupdf toolkit: PDF in, PDF + HOCR + misc document 
>>>>> metadata out, so all that preprocessing and postprocessing I hammer on is 
>>>>> done by yours truly's custom toolchain. Under development, so I'm not 
>>>>> working with the diverse python stuff most everybody else will dig up 
>>>>> after 
>>>>> a quick google search, I'm sure. Individual project's requirements' 
>>>>> differences and such, so your path will only be obvious to you.
>>>>>
>>>>>
>>>>>
>>>>> *How to be trolling an OCR engine *😋
>>>>>
>>>>> Oh, before I forget: some peeps drop shopping bills and such into 
>>>>> off-the-shelf tesseract: *cute *but not anything like a "plain 
>>>>> printed book page" so they encounter all kinds of "surprises":    
>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html  is 
>>>>> important but it doesn't tell you *everything*. "plain printed book 
>>>>> pages" are, by general assumption, pages of text, or, more precisely: 
>>>>> *stories*. Or other tracts with paragraphs of text. Bills, invoices 
>>>>> and other financial stuff is not just "tabulated semi-numeric content" 
>>>>> instead of "paragraphs of text" but those types of inputs also fail grade 
>>>>> F 
>>>>> regarding the other implicit assumption that comes with human "paragraphs 
>>>>> of text": the latter are series of words, technically each a bunch of 
>>>>> alphabet glyphs (*alpha*numerics), while financials often mix 
>>>>> currency symbols and numeric values: while these were part of tesseract's 
>>>>> training set, I am sure, they are not its focal point hence have been 
>>>>> given 
>>>>> less attention than the words in your language dictionary. And scanning 
>>>>> those SKUs will fare even worse as they're just a jumbled *codes*, 
>>>>> rather than *language*. Consequently you'll need to retrain tesseract 
>>>>> if your CONTENT does not suit these mentioned assumptions re "plain 
>>>>> printed 
>>>>> book page". Haven't done that yet myself; it's not for the faint of heart 
>>>>> and since Google did the training for the "official" tesseract language 
>>>>> models everyone downloads and uses, you can bet your bottom retraining 
>>>>> isn't going to be "nice" for the less well funded either. Don't expect 
>>>>> instant miracles and expect a long haul when you decide you must go this 
>>>>> route [of training tesseract], or you will meet Captain Disappointment. 
>>>>> Y'all have been warned. 😉
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Why your preprocess is more important than kickstarting tesseract, by 
>>>>> blowing ether* up its carburetor*
>>>>>
>>>>> *Why is that "plain printed book page is like human stories and 
>>>>> similar tracts: paragraphs of text" mantra so important?* Well, 
>>>>> tesseract uses a lot of technology to get the OCR quality it achieves, 
>>>>> including using language dictionaries. While some smarter people will 
>>>>> find 
>>>>> switches in tesseract where *explicit* dictionary usage can be turned 
>>>>> off, it cannot switch off the *implicit* use due to how the latest 
>>>>> and best core engine: LSTM+CTC (since tesseract v4) actually works: it 
>>>>> slowly moves its gaze across each word it is fed (jargon: *image 
>>>>> segmentation *preprocess inside tesseract produces these word images) 
>>>>> and LSTM is so good at recognizing text, because it has "learned 
>>>>> context": 
>>>>> that context being the characters surrounding the one it is gazing at 
>>>>> right 
>>>>> now. Which means LSTM can be argued to act akin to a *hidden Markov 
>>>>> model* (see wikipedia) and thus will deliver its predictions based on 
>>>>> what "language" (i.e. *dictionary*) it was fed during training: human 
>>>>> text which is used in professional papers and stories. Dutch VAT codes 
>>>>> didn't feature in the training set, as one member of the ML discovered a 
>>>>> while ago. Financial amounts, e.g. "EUR7.95" are also not prominently 
>>>>> featured in LSTMs training so you can now guess the amount of confusion 
>>>>> the 
>>>>> LSTM will experience when scanning across such a thing: reading "EUR" has 
>>>>> it expect "O" with high confidence, as in "eur" obviously leading to the 
>>>>> word "euro", but what the heck is that "digit 7" doing there?! That's 
>>>>> *highly* unexpected, hence OCR probabilities drop, pass 
>>>>> decision-making thresholds and you get WTF results, simply because the 
>>>>> engine went WTF *first*.
>>>>> Ditto story/drama for calligraphed signs outside shops, and, *oh! 
>>>>> oh!, license plates*!! (google LPR/ALPR if you want any of that) and 
>>>>> *anything 
>>>>> else *that's *not *reams of text and thus you wouldn't expect to find 
>>>>> in a plain story- or textbook.
>>>>> (And for the detail-oriented folks: yes, tesseract had/has a module on 
>>>>> board for recognizing math, but I haven't seen that work very well with 
>>>>> my 
>>>>> inputs and not seen a lot of happy noises out there about it either, but 
>>>>> the Google engineer(s) surely must have anticipated OCRing that kind of 
>>>>> stuff alongside paragraphs of text. For us mere mortals, I'ld consider 
>>>>> this 
>>>>> bit "an historic attempt" and forget about it.)
>>>>>
>>>>>
>>>>> *Advice Number 2: *when rendering page images, the ppi (pixels per 
>>>>> inch) resolution to select would be best adjusted to produce regular 
>>>>> lines 
>>>>> of text in those images where the capital-height of the text is around 30 
>>>>> pixels. Typography people would rather like to refer to *x-height*, 
>>>>> so that would be a little lower in pixel height. Line height would be 
>>>>> larger as that includes stems and interline spacing. However, from an OCR 
>>>>> engine perspective, these (x-height & line-height) are very much 
>>>>> dependent 
>>>>> of the font used and the page layout used, so they are more variable than 
>>>>> the reported optimal capital-D-height at ~32px. As no-one measures this 
>>>>> up-front, as an initial guess, 300dpi in the render/print-to-image dialog 
>>>>> of your render tool of choice would be reasonable start but when you want 
>>>>> more accuracy, tweaking this number can already bring some quality 
>>>>> changes. 
>>>>> Of course, when the source is (low rez) bitmap images already (embedded 
>>>>> in 
>>>>> PDF or otherwise), there's little you can do, but then there's still 
>>>>> scaling, sharpening, etc. image preprocessing to try. This advice is 
>>>>> driven 
>>>>> by the results published here: 
>>>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ 
>>>>> (and google already quickly produced one other who does something like 
>>>>> that 
>>>>> and published a small bit of tooling: 
>>>>> https://gist.github.com/rinogo/294e723ac9e53c23d131e5852312dfe8 )
>>>>>
>>>>>
>>>>> *) the old-fash way to see if a rusty engine will still go (or blow, 
>>>>> alas). Replace with "SEO'd blog pages extolling instant success with 
>>>>> ease" 
>>>>> to take this into the 21st century.)
>>>>>
>>>>>
>>>>>
>>>>> *The mandatory readings list:*
>>>>>
>>>>> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
>>>>> - https://tesseract-ocr.github.io/tessdoc/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *The above in diagram form (suggested tesseract workflow ;-) )*
>>>>>
>>>>> [image: diagram.png]
>>>>> (diagram PikChr source + SVG attached)
>>>>>
>>>>>
>>>>>
>>>>> Met vriendelijke groeten / Best regards,
>>>>>
>>>>> Ger Hobbelt
>>>>>
>>>>> --------------------------------------------------
>>>>> web:    http://www.hobbelt.com/
>>>>>         http://www.hebbut.net/
>>>>> mail:   g...@hobbelt.com
>>>>> mobile: +31-6-11 120 978
>>>>> --------------------------------------------------
>>>>>
>>>>>
>>>>> On Fri, Jan 26, 2024 at 6:11 PM Santhiya C <santhi...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Hi Guys , i will start development OCR using image and Pdf to text 
>>>>>> extraction what are the steps i need to follow , can you pleasse refer 
>>>>>> me 
>>>>>> the best model , already i had used the pytesseract engine but i did not 
>>>>>> get proper extraction ...
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Sandhiya
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e82f3531-573e-4d72-909c-73999860255cn%40googlegroups.com.

Re: [tesseract-ocr] Re: I need help to develop image to text extraction

Reply via email to