Re: [tesseract-ocr] Re: I need help to develop image to text extraction

AROMAL Wed, 14 Feb 2024 20:19:13 -0800

then go for the deep learning models since you have a dataset it will be 
easy and less complex for the word level text extraction task :)
On Tuesday 13 February 2024 at 11:10:53 UTC+5:30 santhi...@gmail.com wrote:


> Word level extraction only 
>
> On Tuesday 13 February 2024 at 11:10:03 UTC+5:30 Santhiya C wrote:
>
>> I had completed the training portion utilising the training tesseract 
>> OCR. After annotating the.box file, it did not change the misspelt 
>> character for my output extraction.
>>
>> I was followed this article only  Training Tesseract-OCR with custom 
>> data. | by Sai Ashish | Medium 
>> <https://saiashish90.medium.com/training-tesseract-ocr-with-custom-data-d3f4881575c0>
>>  how do i resolve this issue 
>> On Thursday 8 February 2024 at 10:22:40 UTC+5:30 aromal...@gmail.com 
>> wrote:
>>
>>> are you working on a word level text extraction or sentence level text 
>>> extraction?
>>>
>>> On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 santhi...@gmail.com 
>>> wrote:
>>>
>>>> can you please tell me model and steps 
>>>>
>>>> On Monday 5 February 2024 at 17:22:10 UTC+5:30 aromal...@gmail.com 
>>>> wrote:
>>>>
>>>>> if you are getting started with OCR try some  other  engines  or just 
>>>>> start with some deep learning models 
>>>>> understand the basic working
>>>>> On Thursday 1 February 2024 at 11:17:14 UTC+5:30 santhi...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> Already i was used above mentioned  steps but i lost the datas 
>>>>>>
>>>>>> On Saturday 27 January 2024 at 06:52:54 UTC+5:30 g...@hobbelt.com 
>>>>>> wrote:
>>>>>>
>>>>>>> L.S.,
>>>>>>>
>>>>>>> *PDF. OCR. text extraction. best language models? not a lot of 
>>>>>>> success yet...*
>>>>>>>
>>>>>>> 🤔 
>>>>>>>
>>>>>>> Broad subject.  Learning curve ahead. 🚧 Workflow diagram included 
>>>>>>> today.
>>>>>>>
>>>>>>>
>>>>>>> *Tesseract does not live alone*
>>>>>>>
>>>>>>> Tesseract is an engine, which takes an image as input and produces 
>>>>>>> text output; several output formats are available. If you are unsure, 
>>>>>>> start 
>>>>>>> with HOCR output as that's close to modern HTML and carries almost all 
>>>>>>> info 
>>>>>>> tesseract produces during the OCR process.
>>>>>>> If it isn't an image you've got, you need a preprocess (and 
>>>>>>> consequently additional tools) to produce images you can feed 
>>>>>>> tesseract. 
>>>>>>> tesseract is designed to process a SINGLE IMAGE. (Yes, that means you 
>>>>>>> may 
>>>>>>> want to 'merge' its output: postprocessing)
>>>>>>>
>>>>>>> *     To complicate matters immediately, tesseract can deal with 
>>>>>>> "multipage TIFF" images and can accept multiple images to process via 
>>>>>>> its 
>>>>>>> commandline. Keep thinking "one page image in, bunch of text out" and 
>>>>>>> you'll be okay until you discover the additional possibilities.*
>>>>>>>
>>>>>>> *Advice Number 1: *get a tesseract executable, invoke it using its 
>>>>>>> commandline interface. If you can't build tesseract yourself, Uni 
>>>>>>> Mannheim 
>>>>>>> may have binaries for you to download and install. Linuxes often have 
>>>>>>> tesseract binaries and mandatory language models available as packages, 
>>>>>>> BUT 
>>>>>>> many Linuxes are more or less far behind the curve: latest tesseract 
>>>>>>> release as of this writing is 5.3.4: 
>>>>>>> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your 
>>>>>>> rig has the latest tesseract installed. Older releases are older and 
>>>>>>> "previous" for a reason!
>>>>>>>
>>>>>>>
>>>>>>> *Preprocessing is the chorus of this song*
>>>>>>>
>>>>>>> As you say "PDF", you therefor need to convert that thing to *page 
>>>>>>> images*. My personal favorite is the Artifex mupdf toolkit, using 
>>>>>>> mutool or mudraw / etc. tools from that commandline toolkit to render 
>>>>>>> accurate, high-rez page images. Others will favor other means but it 
>>>>>>> all 
>>>>>>> ends up doing the same thing: anything, PDFs et al, is to be converted 
>>>>>>> to 
>>>>>>> one image per page and fed to tesseract that way. The rendered page 
>>>>>>> images 
>>>>>>> MAY require additional *image preprocessing*: 
>>>>>>>
>>>>>>>
>>>>>>> *This next bit cannot be stressed enough: *tesseract is designed 
>>>>>>> and engineered to work on plain printed book pages, i.e. BLACK TEXT on 
>>>>>>> PLAIN WHITE BACKGROUND. As I observe everyone and their granny dumping 
>>>>>>> holiday snapshots, favorite CD, LP and fancy colourful book covers 
>>>>>>> straight 
>>>>>>> into tesseract and complaining "nothing sensible is coming out" that's 
>>>>>>> because you're feeding it a load of dung as far as the engine 
>>>>>>> concerned: it 
>>>>>>> expects BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull 
>>>>>>> printed 
>>>>>>> page in a BOOK, so anything with nature backgrounds, colourful 
>>>>>>> architectural backgrounds and such is begging for a disaster. And I 
>>>>>>> only 
>>>>>>> emphasize with the grannies. <drama + rant mode off/>   This is why 
>>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is 
>>>>>>> mentioned almost every week in this mailing list, for example. It's 
>>>>>>> very 
>>>>>>> important, but you'll need more...
>>>>>>>
>>>>>>>
>>>>>>> The take-away? You'll need additional tools for image preprocessing 
>>>>>>> until you can produce greyscale or B&W images that look almost as if 
>>>>>>> these 
>>>>>>> were plain old boring book pages: no or very little fancy stuff, black 
>>>>>>> text 
>>>>>>> (anti-aliased or not), white background. 
>>>>>>> Bonus points for you when your preprocess removes non-text image 
>>>>>>> components, e.g. photographs, in the page image: it can only confuse 
>>>>>>> the 
>>>>>>> OCR engine so when you strive for perfection, that's one more bit to 
>>>>>>> deal 
>>>>>>> with BEFORE you feed it into tesseract and wait expectantly... 
>>>>>>> (Besides, 
>>>>>>> tesseract will have less discovery to do so it'll be faster too. Of 
>>>>>>> little 
>>>>>>> importance, relatively speaking, but there you have it.)
>>>>>>> As also mentioned at 
>>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools 
>>>>>>> of interest re image processing are leptonica (parts used by tesseract, 
>>>>>>> but 
>>>>>>> don't count on it doing your preprocessing for you as it's a highly 
>>>>>>> scenario/case-dependent activity and therefor not included in tesseract 
>>>>>>> itself) Also check out: OpenCV (a library, not a tool, so you'll need 
>>>>>>> scaffolding there before you can use it), ImageMagick, (Adobe Photoshop 
>>>>>>> or 
>>>>>>> open source: Krita: great for what-can-I-get experiments but not 
>>>>>>> suitable 
>>>>>>> for bulk), etc.etc.
>>>>>>>
>>>>>>>
>>>>>>> *Tesseract bliss and the afterglow: postprocessing*
>>>>>>>
>>>>>>> Once you are producing page images like they were book pages, and 
>>>>>>> feeding them into tesseract, you get output, being it "plain text", 
>>>>>>> HOCR or 
>>>>>>> otherwise.
>>>>>>>
>>>>>>> Personally I favor HOCR but that's because it's closest to what *my 
>>>>>>> *workflow needs. You must look into "postprocessing" anyway: being 
>>>>>>> it additional tooling to recombine the OCR-ed text into PDF "overlay", 
>>>>>>> PDF/A production, or anything else; advanced usage may require 
>>>>>>> additional 
>>>>>>> postprocessing steps, e.g. pulling the OCR-ed text through a 
>>>>>>> spellchecker+corrector such as hunspell, if that floats your boat. 
>>>>>>> You'll 
>>>>>>> also need to get and set up and/or program postprocess tooling if you 
>>>>>>> otherwise wish to merge multiple images' OCR results. You may want to 
>>>>>>> search the internet for this; I don't have any toolkit's name present 
>>>>>>> off 
>>>>>>> the top off my head for that as I'm using tesseract in a slightly 
>>>>>>> different 
>>>>>>> workflow, where it is part of a custom, *augmented *mupdf toolkit: 
>>>>>>> PDF in, PDF + HOCR + misc document metadata out, so all that 
>>>>>>> preprocessing 
>>>>>>> and postprocessing I hammer on is done by yours truly's custom 
>>>>>>> toolchain. 
>>>>>>> Under development, so I'm not working with the diverse python stuff 
>>>>>>> most 
>>>>>>> everybody else will dig up after a quick google search, I'm sure. 
>>>>>>> Individual project's requirements' differences and such, so your path 
>>>>>>> will 
>>>>>>> only be obvious to you.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *How to be trolling an OCR engine *😋
>>>>>>>
>>>>>>> Oh, before I forget: some peeps drop shopping bills and such into 
>>>>>>> off-the-shelf tesseract: *cute *but not anything like a "plain 
>>>>>>> printed book page" so they encounter all kinds of "surprises":    
>>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html  is 
>>>>>>> important but it doesn't tell you *everything*. "plain printed book 
>>>>>>> pages" are, by general assumption, pages of text, or, more precisely: 
>>>>>>> *stories*. Or other tracts with paragraphs of text. Bills, invoices 
>>>>>>> and other financial stuff is not just "tabulated semi-numeric content" 
>>>>>>> instead of "paragraphs of text" but those types of inputs also fail 
>>>>>>> grade F 
>>>>>>> regarding the other implicit assumption that comes with human 
>>>>>>> "paragraphs 
>>>>>>> of text": the latter are series of words, technically each a bunch of 
>>>>>>> alphabet glyphs (*alpha*numerics), while financials often mix 
>>>>>>> currency symbols and numeric values: while these were part of 
>>>>>>> tesseract's 
>>>>>>> training set, I am sure, they are not its focal point hence have been 
>>>>>>> given 
>>>>>>> less attention than the words in your language dictionary. And scanning 
>>>>>>> those SKUs will fare even worse as they're just a jumbled *codes*, 
>>>>>>> rather than *language*. Consequently you'll need to retrain 
>>>>>>> tesseract if your CONTENT does not suit these mentioned assumptions re 
>>>>>>> "plain printed book page". Haven't done that yet myself; it's not for 
>>>>>>> the 
>>>>>>> faint of heart and since Google did the training for the "official" 
>>>>>>> tesseract language models everyone downloads and uses, you can bet your 
>>>>>>> bottom retraining isn't going to be "nice" for the less well funded 
>>>>>>> either. 
>>>>>>> Don't expect instant miracles and expect a long haul when you decide 
>>>>>>> you 
>>>>>>> must go this route [of training tesseract], or you will meet Captain 
>>>>>>> Disappointment. Y'all have been warned. 😉
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Why your preprocess is more important than kickstarting tesseract, 
>>>>>>> by blowing ether* up its carburetor*
>>>>>>>
>>>>>>> *Why is that "plain printed book page is like human stories and 
>>>>>>> similar tracts: paragraphs of text" mantra so important?* Well, 
>>>>>>> tesseract uses a lot of technology to get the OCR quality it achieves, 
>>>>>>> including using language dictionaries. While some smarter people will 
>>>>>>> find 
>>>>>>> switches in tesseract where *explicit* dictionary usage can be 
>>>>>>> turned off, it cannot switch off the *implicit* use due to how the 
>>>>>>> latest and best core engine: LSTM+CTC (since tesseract v4) actually 
>>>>>>> works: 
>>>>>>> it slowly moves its gaze across each word it is fed (jargon: *image 
>>>>>>> segmentation *preprocess inside tesseract produces these word 
>>>>>>> images) and LSTM is so good at recognizing text, because it has 
>>>>>>> "learned 
>>>>>>> context": that context being the characters surrounding the one it is 
>>>>>>> gazing at right now. Which means LSTM can be argued to act akin to a 
>>>>>>> *hidden 
>>>>>>> Markov model* (see wikipedia) and thus will deliver its predictions 
>>>>>>> based on what "language" (i.e. *dictionary*) it was fed during 
>>>>>>> training: human text which is used in professional papers and stories. 
>>>>>>> Dutch VAT codes didn't feature in the training set, as one member of 
>>>>>>> the ML 
>>>>>>> discovered a while ago. Financial amounts, e.g. "EUR7.95" are also not 
>>>>>>> prominently featured in LSTMs training so you can now guess the amount 
>>>>>>> of 
>>>>>>> confusion the LSTM will experience when scanning across such a thing: 
>>>>>>> reading "EUR" has it expect "O" with high confidence, as in "eur" 
>>>>>>> obviously 
>>>>>>> leading to the word "euro", but what the heck is that "digit 7" doing 
>>>>>>> there?! That's *highly* unexpected, hence OCR probabilities drop, 
>>>>>>> pass decision-making thresholds and you get WTF results, simply because 
>>>>>>> the 
>>>>>>> engine went WTF *first*.
>>>>>>> Ditto story/drama for calligraphed signs outside shops, and, *oh! 
>>>>>>> oh!, license plates*!! (google LPR/ALPR if you want any of that) 
>>>>>>> and *anything else *that's *not *reams of text and thus you 
>>>>>>> wouldn't expect to find in a plain story- or textbook.
>>>>>>> (And for the detail-oriented folks: yes, tesseract had/has a module 
>>>>>>> on board for recognizing math, but I haven't seen that work very well 
>>>>>>> with 
>>>>>>> my inputs and not seen a lot of happy noises out there about it either, 
>>>>>>> but 
>>>>>>> the Google engineer(s) surely must have anticipated OCRing that kind of 
>>>>>>> stuff alongside paragraphs of text. For us mere mortals, I'ld consider 
>>>>>>> this 
>>>>>>> bit "an historic attempt" and forget about it.)
>>>>>>>
>>>>>>>
>>>>>>> *Advice Number 2: *when rendering page images, the ppi (pixels per 
>>>>>>> inch) resolution to select would be best adjusted to produce regular 
>>>>>>> lines 
>>>>>>> of text in those images where the capital-height of the text is around 
>>>>>>> 30 
>>>>>>> pixels. Typography people would rather like to refer to *x-height*, 
>>>>>>> so that would be a little lower in pixel height. Line height would be 
>>>>>>> larger as that includes stems and interline spacing. However, from an 
>>>>>>> OCR 
>>>>>>> engine perspective, these (x-height & line-height) are very much 
>>>>>>> dependent 
>>>>>>> of the font used and the page layout used, so they are more variable 
>>>>>>> than 
>>>>>>> the reported optimal capital-D-height at ~32px. As no-one measures this 
>>>>>>> up-front, as an initial guess, 300dpi in the render/print-to-image 
>>>>>>> dialog 
>>>>>>> of your render tool of choice would be reasonable start but when you 
>>>>>>> want 
>>>>>>> more accuracy, tweaking this number can already bring some quality 
>>>>>>> changes. 
>>>>>>> Of course, when the source is (low rez) bitmap images already (embedded 
>>>>>>> in 
>>>>>>> PDF or otherwise), there's little you can do, but then there's still 
>>>>>>> scaling, sharpening, etc. image preprocessing to try. This advice is 
>>>>>>> driven 
>>>>>>> by the results published here: 
>>>>>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ 
>>>>>>> (and google already quickly produced one other who does something like 
>>>>>>> that 
>>>>>>> and published a small bit of tooling: 
>>>>>>> https://gist.github.com/rinogo/294e723ac9e53c23d131e5852312dfe8 )
>>>>>>>
>>>>>>>
>>>>>>> *) the old-fash way to see if a rusty engine will still go (or blow, 
>>>>>>> alas). Replace with "SEO'd blog pages extolling instant success with 
>>>>>>> ease" 
>>>>>>> to take this into the 21st century.)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *The mandatory readings list:*
>>>>>>>
>>>>>>> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
>>>>>>> - https://tesseract-ocr.github.io/tessdoc/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *The above in diagram form (suggested tesseract workflow ;-) )*
>>>>>>>
>>>>>>> [image: diagram.png]
>>>>>>> (diagram PikChr source + SVG attached)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Met vriendelijke groeten / Best regards,
>>>>>>>
>>>>>>> Ger Hobbelt
>>>>>>>
>>>>>>> --------------------------------------------------
>>>>>>> web:    http://www.hobbelt.com/
>>>>>>>         http://www.hebbut.net/
>>>>>>> mail:   g...@hobbelt.com
>>>>>>> mobile: +31-6-11 120 978
>>>>>>> --------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 26, 2024 at 6:11 PM Santhiya C <santhi...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Guys , i will start development OCR using image and Pdf to text 
>>>>>>>> extraction what are the steps i need to follow , can you pleasse refer 
>>>>>>>> me 
>>>>>>>> the best model , already i had used the pytesseract engine but i did 
>>>>>>>> not 
>>>>>>>> get proper extraction ...
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Sandhiya
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0736f9ea-4cf5-4bf4-b601-b978221589adn%40googlegroups.com.

Re: [tesseract-ocr] Re: I need help to develop image to text extraction

Reply via email to