Did you read the tesseract documentation? Do you understand it?
Zdenko ut 6. 2. 2024 o 12:38 Santhiya C <santhiya.c...@gmail.com> napísal(a): > How do i fix this issue using training tesseract ocr custom data > > On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 Santhiya C wrote: > >> can you please tell me model and steps >> >> On Monday 5 February 2024 at 17:22:10 UTC+5:30 aromal...@gmail.com wrote: >> >>> if you are getting started with OCR try some other engines or just >>> start with some deep learning models >>> understand the basic working >>> On Thursday 1 February 2024 at 11:17:14 UTC+5:30 santhi...@gmail.com >>> wrote: >>> >>>> Already i was used above mentioned steps but i lost the datas >>>> >>>> On Saturday 27 January 2024 at 06:52:54 UTC+5:30 g...@hobbelt.com >>>> wrote: >>>> >>>>> L.S., >>>>> >>>>> *PDF. OCR. text extraction. best language models? not a lot of success >>>>> yet...* >>>>> >>>>> 🤔 >>>>> >>>>> Broad subject. Learning curve ahead. 🚧 Workflow diagram included >>>>> today. >>>>> >>>>> >>>>> *Tesseract does not live alone* >>>>> >>>>> Tesseract is an engine, which takes an image as input and produces >>>>> text output; several output formats are available. If you are unsure, >>>>> start >>>>> with HOCR output as that's close to modern HTML and carries almost all >>>>> info >>>>> tesseract produces during the OCR process. >>>>> If it isn't an image you've got, you need a preprocess (and >>>>> consequently additional tools) to produce images you can feed tesseract. >>>>> tesseract is designed to process a SINGLE IMAGE. (Yes, that means you may >>>>> want to 'merge' its output: postprocessing) >>>>> >>>>> * To complicate matters immediately, tesseract can deal with >>>>> "multipage TIFF" images and can accept multiple images to process via its >>>>> commandline. Keep thinking "one page image in, bunch of text out" and >>>>> you'll be okay until you discover the additional possibilities.* >>>>> >>>>> *Advice Number 1: *get a tesseract executable, invoke it using its >>>>> commandline interface. If you can't build tesseract yourself, Uni Mannheim >>>>> may have binaries for you to download and install. Linuxes often have >>>>> tesseract binaries and mandatory language models available as packages, >>>>> BUT >>>>> many Linuxes are more or less far behind the curve: latest tesseract >>>>> release as of this writing is 5.3.4: >>>>> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your >>>>> rig has the latest tesseract installed. Older releases are older and >>>>> "previous" for a reason! >>>>> >>>>> >>>>> *Preprocessing is the chorus of this song* >>>>> >>>>> As you say "PDF", you therefor need to convert that thing to *page >>>>> images*. My personal favorite is the Artifex mupdf toolkit, using >>>>> mutool or mudraw / etc. tools from that commandline toolkit to render >>>>> accurate, high-rez page images. Others will favor other means but it all >>>>> ends up doing the same thing: anything, PDFs et al, is to be converted to >>>>> one image per page and fed to tesseract that way. The rendered page images >>>>> MAY require additional *image preprocessing*: >>>>> >>>>> >>>>> *This next bit cannot be stressed enough: *tesseract is designed and >>>>> engineered to work on plain printed book pages, i.e. BLACK TEXT on PLAIN >>>>> WHITE BACKGROUND. As I observe everyone and their granny dumping holiday >>>>> snapshots, favorite CD, LP and fancy colourful book covers straight into >>>>> tesseract and complaining "nothing sensible is coming out" that's because >>>>> you're feeding it a load of dung as far as the engine concerned: it >>>>> expects >>>>> BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull printed page in a >>>>> BOOK, so anything with nature backgrounds, colourful architectural >>>>> backgrounds and such is begging for a disaster. And I only emphasize with >>>>> the grannies. <drama + rant mode off/> This is why >>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is >>>>> mentioned almost every week in this mailing list, for example. It's very >>>>> important, but you'll need more... >>>>> >>>>> >>>>> The take-away? You'll need additional tools for image preprocessing >>>>> until you can produce greyscale or B&W images that look almost as if these >>>>> were plain old boring book pages: no or very little fancy stuff, black >>>>> text >>>>> (anti-aliased or not), white background. >>>>> Bonus points for you when your preprocess removes non-text image >>>>> components, e.g. photographs, in the page image: it can only confuse the >>>>> OCR engine so when you strive for perfection, that's one more bit to deal >>>>> with BEFORE you feed it into tesseract and wait expectantly... (Besides, >>>>> tesseract will have less discovery to do so it'll be faster too. Of little >>>>> importance, relatively speaking, but there you have it.) >>>>> As also mentioned at >>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools >>>>> of interest re image processing are leptonica (parts used by tesseract, >>>>> but >>>>> don't count on it doing your preprocessing for you as it's a highly >>>>> scenario/case-dependent activity and therefor not included in tesseract >>>>> itself) Also check out: OpenCV (a library, not a tool, so you'll need >>>>> scaffolding there before you can use it), ImageMagick, (Adobe Photoshop or >>>>> open source: Krita: great for what-can-I-get experiments but not suitable >>>>> for bulk), etc.etc. >>>>> >>>>> >>>>> *Tesseract bliss and the afterglow: postprocessing* >>>>> >>>>> Once you are producing page images like they were book pages, and >>>>> feeding them into tesseract, you get output, being it "plain text", HOCR >>>>> or >>>>> otherwise. >>>>> >>>>> Personally I favor HOCR but that's because it's closest to what *my >>>>> *workflow >>>>> needs. You must look into "postprocessing" anyway: being it additional >>>>> tooling to recombine the OCR-ed text into PDF "overlay", PDF/A production, >>>>> or anything else; advanced usage may require additional postprocessing >>>>> steps, e.g. pulling the OCR-ed text through a spellchecker+corrector such >>>>> as hunspell, if that floats your boat. You'll also need to get and set up >>>>> and/or program postprocess tooling if you otherwise wish to merge multiple >>>>> images' OCR results. You may want to search the internet for this; I don't >>>>> have any toolkit's name present off the top off my head for that as I'm >>>>> using tesseract in a slightly different workflow, where it is part of a >>>>> custom, *augmented *mupdf toolkit: PDF in, PDF + HOCR + misc document >>>>> metadata out, so all that preprocessing and postprocessing I hammer on is >>>>> done by yours truly's custom toolchain. Under development, so I'm not >>>>> working with the diverse python stuff most everybody else will dig up >>>>> after >>>>> a quick google search, I'm sure. Individual project's requirements' >>>>> differences and such, so your path will only be obvious to you. >>>>> >>>>> >>>>> >>>>> *How to be trolling an OCR engine *😋 >>>>> >>>>> Oh, before I forget: some peeps drop shopping bills and such into >>>>> off-the-shelf tesseract: *cute *but not anything like a "plain >>>>> printed book page" so they encounter all kinds of "surprises": >>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is >>>>> important but it doesn't tell you *everything*. "plain printed book >>>>> pages" are, by general assumption, pages of text, or, more precisely: >>>>> *stories*. Or other tracts with paragraphs of text. Bills, invoices >>>>> and other financial stuff is not just "tabulated semi-numeric content" >>>>> instead of "paragraphs of text" but those types of inputs also fail grade >>>>> F >>>>> regarding the other implicit assumption that comes with human "paragraphs >>>>> of text": the latter are series of words, technically each a bunch of >>>>> alphabet glyphs (*alpha*numerics), while financials often mix >>>>> currency symbols and numeric values: while these were part of tesseract's >>>>> training set, I am sure, they are not its focal point hence have been >>>>> given >>>>> less attention than the words in your language dictionary. And scanning >>>>> those SKUs will fare even worse as they're just a jumbled *codes*, >>>>> rather than *language*. Consequently you'll need to retrain tesseract >>>>> if your CONTENT does not suit these mentioned assumptions re "plain >>>>> printed >>>>> book page". Haven't done that yet myself; it's not for the faint of heart >>>>> and since Google did the training for the "official" tesseract language >>>>> models everyone downloads and uses, you can bet your bottom retraining >>>>> isn't going to be "nice" for the less well funded either. Don't expect >>>>> instant miracles and expect a long haul when you decide you must go this >>>>> route [of training tesseract], or you will meet Captain Disappointment. >>>>> Y'all have been warned. 😉 >>>>> >>>>> >>>>> >>>>> >>>>> *Why your preprocess is more important than kickstarting tesseract, by >>>>> blowing ether* up its carburetor* >>>>> >>>>> *Why is that "plain printed book page is like human stories and >>>>> similar tracts: paragraphs of text" mantra so important?* Well, >>>>> tesseract uses a lot of technology to get the OCR quality it achieves, >>>>> including using language dictionaries. While some smarter people will find >>>>> switches in tesseract where *explicit* dictionary usage can be turned >>>>> off, it cannot switch off the *implicit* use due to how the latest >>>>> and best core engine: LSTM+CTC (since tesseract v4) actually works: it >>>>> slowly moves its gaze across each word it is fed (jargon: *image >>>>> segmentation *preprocess inside tesseract produces these word images) >>>>> and LSTM is so good at recognizing text, because it has "learned context": >>>>> that context being the characters surrounding the one it is gazing at >>>>> right >>>>> now. Which means LSTM can be argued to act akin to a *hidden Markov >>>>> model* (see wikipedia) and thus will deliver its predictions based on >>>>> what "language" (i.e. *dictionary*) it was fed during training: human >>>>> text which is used in professional papers and stories. Dutch VAT codes >>>>> didn't feature in the training set, as one member of the ML discovered a >>>>> while ago. Financial amounts, e.g. "EUR7.95" are also not prominently >>>>> featured in LSTMs training so you can now guess the amount of confusion >>>>> the >>>>> LSTM will experience when scanning across such a thing: reading "EUR" has >>>>> it expect "O" with high confidence, as in "eur" obviously leading to the >>>>> word "euro", but what the heck is that "digit 7" doing there?! That's >>>>> *highly* unexpected, hence OCR probabilities drop, pass >>>>> decision-making thresholds and you get WTF results, simply because the >>>>> engine went WTF *first*. >>>>> Ditto story/drama for calligraphed signs outside shops, and, *oh! >>>>> oh!, license plates*!! (google LPR/ALPR if you want any of that) and >>>>> *anything >>>>> else *that's *not *reams of text and thus you wouldn't expect to find >>>>> in a plain story- or textbook. >>>>> (And for the detail-oriented folks: yes, tesseract had/has a module on >>>>> board for recognizing math, but I haven't seen that work very well with my >>>>> inputs and not seen a lot of happy noises out there about it either, but >>>>> the Google engineer(s) surely must have anticipated OCRing that kind of >>>>> stuff alongside paragraphs of text. For us mere mortals, I'ld consider >>>>> this >>>>> bit "an historic attempt" and forget about it.) >>>>> >>>>> >>>>> *Advice Number 2: *when rendering page images, the ppi (pixels per >>>>> inch) resolution to select would be best adjusted to produce regular lines >>>>> of text in those images where the capital-height of the text is around 30 >>>>> pixels. Typography people would rather like to refer to *x-height*, >>>>> so that would be a little lower in pixel height. Line height would be >>>>> larger as that includes stems and interline spacing. However, from an OCR >>>>> engine perspective, these (x-height & line-height) are very much dependent >>>>> of the font used and the page layout used, so they are more variable than >>>>> the reported optimal capital-D-height at ~32px. As no-one measures this >>>>> up-front, as an initial guess, 300dpi in the render/print-to-image dialog >>>>> of your render tool of choice would be reasonable start but when you want >>>>> more accuracy, tweaking this number can already bring some quality >>>>> changes. >>>>> Of course, when the source is (low rez) bitmap images already (embedded in >>>>> PDF or otherwise), there's little you can do, but then there's still >>>>> scaling, sharpening, etc. image preprocessing to try. This advice is >>>>> driven >>>>> by the results published here: >>>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ >>>>> (and google already quickly produced one other who does something like >>>>> that >>>>> and published a small bit of tooling: >>>>> https://gist.github.com/rinogo/294e723ac9e53c23d131e5852312dfe8 ) >>>>> >>>>> >>>>> *) the old-fash way to see if a rusty engine will still go (or blow, >>>>> alas). Replace with "SEO'd blog pages extolling instant success with ease" >>>>> to take this into the 21st century.) >>>>> >>>>> >>>>> >>>>> *The mandatory readings list:* >>>>> >>>>> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html >>>>> - https://tesseract-ocr.github.io/tessdoc/ >>>>> >>>>> >>>>> >>>>> >>>>> *The above in diagram form (suggested tesseract workflow ;-) )* >>>>> >>>>> [image: diagram.png] >>>>> (diagram PikChr source + SVG attached) >>>>> >>>>> >>>>> >>>>> Met vriendelijke groeten / Best regards, >>>>> >>>>> Ger Hobbelt >>>>> >>>>> -------------------------------------------------- >>>>> web: http://www.hobbelt.com/ >>>>> http://www.hebbut.net/ >>>>> mail: g...@hobbelt.com >>>>> mobile: +31-6-11 120 978 >>>>> -------------------------------------------------- >>>>> >>>>> >>>>> On Fri, Jan 26, 2024 at 6:11 PM Santhiya C <santhi...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Guys , i will start development OCR using image and Pdf to text >>>>>> extraction what are the steps i need to follow , can you pleasse refer me >>>>>> the best model , already i had used the pytesseract engine but i did not >>>>>> get proper extraction ... >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Sandhiya >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/de88924d-b5a1-4bbd-b565-12b7e7dedc03n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/de88924d-b5a1-4bbd-b565-12b7e7dedc03n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w2dohKMbFe98EJq6ediSRtVPcZb55F-ff_Fw94vFGL%3Dg%40mail.gmail.com.