Re: [tesseract-ocr] Re: I need help to develop image to text extraction

Santhiya C Tue, 06 Feb 2024 03:38:12 -0800

How do i fix this issue using  training tesseract ocr custom data 

On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 Santhiya C wrote:


> can you please tell me model and steps 
>
> On Monday 5 February 2024 at 17:22:10 UTC+5:30 aromal...@gmail.com wrote:
>
>> if you are getting started with OCR try some  other  engines  or just 
>> start with some deep learning models 
>> understand the basic working
>> On Thursday 1 February 2024 at 11:17:14 UTC+5:30 santhi...@gmail.com 
>> wrote:
>>
>>> Already i was used above mentioned  steps but i lost the datas 
>>>
>>> On Saturday 27 January 2024 at 06:52:54 UTC+5:30 g...@hobbelt.com wrote:
>>>
>>>> L.S.,
>>>>
>>>> *PDF. OCR. text extraction. best language models? not a lot of success 
>>>> yet...*
>>>>
>>>> 🤔 
>>>>
>>>> Broad subject.  Learning curve ahead. 🚧 Workflow diagram included 
>>>> today.
>>>>
>>>>
>>>> *Tesseract does not live alone*
>>>>
>>>> Tesseract is an engine, which takes an image as input and produces text 
>>>> output; several output formats are available. If you are unsure, start 
>>>> with 
>>>> HOCR output as that's close to modern HTML and carries almost all info 
>>>> tesseract produces during the OCR process.
>>>> If it isn't an image you've got, you need a preprocess (and 
>>>> consequently additional tools) to produce images you can feed tesseract. 
>>>> tesseract is designed to process a SINGLE IMAGE. (Yes, that means you may 
>>>> want to 'merge' its output: postprocessing)
>>>>
>>>> *     To complicate matters immediately, tesseract can deal with 
>>>> "multipage TIFF" images and can accept multiple images to process via its 
>>>> commandline. Keep thinking "one page image in, bunch of text out" and 
>>>> you'll be okay until you discover the additional possibilities.*
>>>>
>>>> *Advice Number 1: *get a tesseract executable, invoke it using its 
>>>> commandline interface. If you can't build tesseract yourself, Uni Mannheim 
>>>> may have binaries for you to download and install. Linuxes often have 
>>>> tesseract binaries and mandatory language models available as packages, 
>>>> BUT 
>>>> many Linuxes are more or less far behind the curve: latest tesseract 
>>>> release as of this writing is 5.3.4: 
>>>> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your rig 
>>>> has the latest tesseract installed. Older releases are older and 
>>>> "previous" 
>>>> for a reason!
>>>>
>>>>
>>>> *Preprocessing is the chorus of this song*
>>>>
>>>> As you say "PDF", you therefor need to convert that thing to *page 
>>>> images*. My personal favorite is the Artifex mupdf toolkit, using 
>>>> mutool or mudraw / etc. tools from that commandline toolkit to render 
>>>> accurate, high-rez page images. Others will favor other means but it all 
>>>> ends up doing the same thing: anything, PDFs et al, is to be converted to 
>>>> one image per page and fed to tesseract that way. The rendered page images 
>>>> MAY require additional *image preprocessing*: 
>>>>
>>>>
>>>> *This next bit cannot be stressed enough: *tesseract is designed and 
>>>> engineered to work on plain printed book pages, i.e. BLACK TEXT on PLAIN 
>>>> WHITE BACKGROUND. As I observe everyone and their granny dumping holiday 
>>>> snapshots, favorite CD, LP and fancy colourful book covers straight into 
>>>> tesseract and complaining "nothing sensible is coming out" that's because 
>>>> you're feeding it a load of dung as far as the engine concerned: it 
>>>> expects 
>>>> BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull printed page in a 
>>>> BOOK, so anything with nature backgrounds, colourful architectural 
>>>> backgrounds and such is begging for a disaster. And I only emphasize with 
>>>> the grannies. <drama + rant mode off/>   This is why 
>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is 
>>>> mentioned almost every week in this mailing list, for example. It's very 
>>>> important, but you'll need more...
>>>>
>>>>
>>>> The take-away? You'll need additional tools for image preprocessing 
>>>> until you can produce greyscale or B&W images that look almost as if these 
>>>> were plain old boring book pages: no or very little fancy stuff, black 
>>>> text 
>>>> (anti-aliased or not), white background. 
>>>> Bonus points for you when your preprocess removes non-text image 
>>>> components, e.g. photographs, in the page image: it can only confuse the 
>>>> OCR engine so when you strive for perfection, that's one more bit to deal 
>>>> with BEFORE you feed it into tesseract and wait expectantly... (Besides, 
>>>> tesseract will have less discovery to do so it'll be faster too. Of little 
>>>> importance, relatively speaking, but there you have it.)
>>>> As also mentioned at 
>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools of 
>>>> interest re image processing are leptonica (parts used by tesseract, but 
>>>> don't count on it doing your preprocessing for you as it's a highly 
>>>> scenario/case-dependent activity and therefor not included in tesseract 
>>>> itself) Also check out: OpenCV (a library, not a tool, so you'll need 
>>>> scaffolding there before you can use it), ImageMagick, (Adobe Photoshop or 
>>>> open source: Krita: great for what-can-I-get experiments but not suitable 
>>>> for bulk), etc.etc.
>>>>
>>>>
>>>> *Tesseract bliss and the afterglow: postprocessing*
>>>>
>>>> Once you are producing page images like they were book pages, and 
>>>> feeding them into tesseract, you get output, being it "plain text", HOCR 
>>>> or 
>>>> otherwise.
>>>>
>>>> Personally I favor HOCR but that's because it's closest to what *my 
>>>> *workflow 
>>>> needs. You must look into "postprocessing" anyway: being it additional 
>>>> tooling to recombine the OCR-ed text into PDF "overlay", PDF/A production, 
>>>> or anything else; advanced usage may require additional postprocessing 
>>>> steps, e.g. pulling the OCR-ed text through a spellchecker+corrector such 
>>>> as hunspell, if that floats your boat. You'll also need to get and set up 
>>>> and/or program postprocess tooling if you otherwise wish to merge multiple 
>>>> images' OCR results. You may want to search the internet for this; I don't 
>>>> have any toolkit's name present off the top off my head for that as I'm 
>>>> using tesseract in a slightly different workflow, where it is part of a 
>>>> custom, *augmented *mupdf toolkit: PDF in, PDF + HOCR + misc document 
>>>> metadata out, so all that preprocessing and postprocessing I hammer on is 
>>>> done by yours truly's custom toolchain. Under development, so I'm not 
>>>> working with the diverse python stuff most everybody else will dig up 
>>>> after 
>>>> a quick google search, I'm sure. Individual project's requirements' 
>>>> differences and such, so your path will only be obvious to you.
>>>>
>>>>
>>>>
>>>> *How to be trolling an OCR engine *😋
>>>>
>>>> Oh, before I forget: some peeps drop shopping bills and such into 
>>>> off-the-shelf tesseract: *cute *but not anything like a "plain printed 
>>>> book page" so they encounter all kinds of "surprises":    
>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html  is 
>>>> important but it doesn't tell you *everything*. "plain printed book 
>>>> pages" are, by general assumption, pages of text, or, more precisely: 
>>>> *stories*. Or other tracts with paragraphs of text. Bills, invoices 
>>>> and other financial stuff is not just "tabulated semi-numeric content" 
>>>> instead of "paragraphs of text" but those types of inputs also fail grade 
>>>> F 
>>>> regarding the other implicit assumption that comes with human "paragraphs 
>>>> of text": the latter are series of words, technically each a bunch of 
>>>> alphabet glyphs (*alpha*numerics), while financials often mix currency 
>>>> symbols and numeric values: while these were part of tesseract's training 
>>>> set, I am sure, they are not its focal point hence have been given less 
>>>> attention than the words in your language dictionary. And scanning those 
>>>> SKUs will fare even worse as they're just a jumbled *codes*, rather 
>>>> than *language*. Consequently you'll need to retrain tesseract if your 
>>>> CONTENT does not suit these mentioned assumptions re "plain printed book 
>>>> page". Haven't done that yet myself; it's not for the faint of heart and 
>>>> since Google did the training for the "official" tesseract language models 
>>>> everyone downloads and uses, you can bet your bottom retraining isn't 
>>>> going 
>>>> to be "nice" for the less well funded either. Don't expect instant 
>>>> miracles 
>>>> and expect a long haul when you decide you must go this route [of training 
>>>> tesseract], or you will meet Captain Disappointment. Y'all have been 
>>>> warned. 😉
>>>>
>>>>
>>>>
>>>>
>>>> *Why your preprocess is more important than kickstarting tesseract, by 
>>>> blowing ether* up its carburetor*
>>>>
>>>> *Why is that "plain printed book page is like human stories and similar 
>>>> tracts: paragraphs of text" mantra so important?* Well, tesseract uses 
>>>> a lot of technology to get the OCR quality it achieves, including using 
>>>> language dictionaries. While some smarter people will find switches in 
>>>> tesseract where *explicit* dictionary usage can be turned off, it 
>>>> cannot switch off the *implicit* use due to how the latest and best 
>>>> core engine: LSTM+CTC (since tesseract v4) actually works: it slowly moves 
>>>> its gaze across each word it is fed (jargon: *image segmentation 
>>>> *preprocess 
>>>> inside tesseract produces these word images) and LSTM is so good at 
>>>> recognizing text, because it has "learned context": that context being the 
>>>> characters surrounding the one it is gazing at right now. Which means LSTM 
>>>> can be argued to act akin to a *hidden Markov model* (see wikipedia) 
>>>> and thus will deliver its predictions based on what "language" (i.e. 
>>>> *dictionary*) it was fed during training: human text which is used in 
>>>> professional papers and stories. Dutch VAT codes didn't feature in the 
>>>> training set, as one member of the ML discovered a while ago. Financial 
>>>> amounts, e.g. "EUR7.95" are also not prominently featured in LSTMs 
>>>> training 
>>>> so you can now guess the amount of confusion the LSTM will experience when 
>>>> scanning across such a thing: reading "EUR" has it expect "O" with high 
>>>> confidence, as in "eur" obviously leading to the word "euro", but what the 
>>>> heck is that "digit 7" doing there?! That's *highly* unexpected, hence 
>>>> OCR probabilities drop, pass decision-making thresholds and you get WTF 
>>>> results, simply because the engine went WTF *first*.
>>>> Ditto story/drama for calligraphed signs outside shops, and, *oh! oh!, 
>>>> license plates*!! (google LPR/ALPR if you want any of that) and *anything 
>>>> else *that's *not *reams of text and thus you wouldn't expect to find 
>>>> in a plain story- or textbook.
>>>> (And for the detail-oriented folks: yes, tesseract had/has a module on 
>>>> board for recognizing math, but I haven't seen that work very well with my 
>>>> inputs and not seen a lot of happy noises out there about it either, but 
>>>> the Google engineer(s) surely must have anticipated OCRing that kind of 
>>>> stuff alongside paragraphs of text. For us mere mortals, I'ld consider 
>>>> this 
>>>> bit "an historic attempt" and forget about it.)
>>>>
>>>>
>>>> *Advice Number 2: *when rendering page images, the ppi (pixels per 
>>>> inch) resolution to select would be best adjusted to produce regular lines 
>>>> of text in those images where the capital-height of the text is around 30 
>>>> pixels. Typography people would rather like to refer to *x-height*, so 
>>>> that would be a little lower in pixel height. Line height would be larger 
>>>> as that includes stems and interline spacing. However, from an OCR engine 
>>>> perspective, these (x-height & line-height) are very much dependent of the 
>>>> font used and the page layout used, so they are more variable than the 
>>>> reported optimal capital-D-height at ~32px. As no-one measures this 
>>>> up-front, as an initial guess, 300dpi in the render/print-to-image dialog 
>>>> of your render tool of choice would be reasonable start but when you want 
>>>> more accuracy, tweaking this number can already bring some quality 
>>>> changes. 
>>>> Of course, when the source is (low rez) bitmap images already (embedded in 
>>>> PDF or otherwise), there's little you can do, but then there's still 
>>>> scaling, sharpening, etc. image preprocessing to try. This advice is 
>>>> driven 
>>>> by the results published here: 
>>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ 
>>>> (and google already quickly produced one other who does something like 
>>>> that 
>>>> and published a small bit of tooling: 
>>>> https://gist.github.com/rinogo/294e723ac9e53c23d131e5852312dfe8 )
>>>>
>>>>
>>>> *) the old-fash way to see if a rusty engine will still go (or blow, 
>>>> alas). Replace with "SEO'd blog pages extolling instant success with ease" 
>>>> to take this into the 21st century.)
>>>>
>>>>
>>>>
>>>> *The mandatory readings list:*
>>>>
>>>> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
>>>> - https://tesseract-ocr.github.io/tessdoc/
>>>>
>>>>
>>>>
>>>>
>>>> *The above in diagram form (suggested tesseract workflow ;-) )*
>>>>
>>>> [image: diagram.png]
>>>> (diagram PikChr source + SVG attached)
>>>>
>>>>
>>>>
>>>> Met vriendelijke groeten / Best regards,
>>>>
>>>> Ger Hobbelt
>>>>
>>>> --------------------------------------------------
>>>> web:    http://www.hobbelt.com/
>>>>         http://www.hebbut.net/
>>>> mail:   g...@hobbelt.com
>>>> mobile: +31-6-11 120 978
>>>> --------------------------------------------------
>>>>
>>>>
>>>> On Fri, Jan 26, 2024 at 6:11 PM Santhiya C <santhi...@gmail.com> wrote:
>>>>
>>>>> Hi Guys , i will start development OCR using image and Pdf to text 
>>>>> extraction what are the steps i need to follow , can you pleasse refer me 
>>>>> the best model , already i had used the pytesseract engine but i did not 
>>>>> get proper extraction ...
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Sandhiya
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/de88924d-b5a1-4bbd-b565-12b7e7dedc03n%40googlegroups.com.

Re: [tesseract-ocr] Re: I need help to develop image to text extraction

Reply via email to