then go for the deep learning models since you have a dataset it will be easy and less complex for the word level text extraction task :) On Tuesday 13 February 2024 at 11:10:53 UTC+5:30 santhi...@gmail.com wrote:
> Word level extraction only > > On Tuesday 13 February 2024 at 11:10:03 UTC+5:30 Santhiya C wrote: > >> I had completed the training portion utilising the training tesseract >> OCR. After annotating the.box file, it did not change the misspelt >> character for my output extraction. >> >> I was followed this article only Training Tesseract-OCR with custom >> data. | by Sai Ashish | Medium >> <https://saiashish90.medium.com/training-tesseract-ocr-with-custom-data-d3f4881575c0> >> how do i resolve this issue >> On Thursday 8 February 2024 at 10:22:40 UTC+5:30 aromal...@gmail.com >> wrote: >> >>> are you working on a word level text extraction or sentence level text >>> extraction? >>> >>> On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 santhi...@gmail.com >>> wrote: >>> >>>> can you please tell me model and steps >>>> >>>> On Monday 5 February 2024 at 17:22:10 UTC+5:30 aromal...@gmail.com >>>> wrote: >>>> >>>>> if you are getting started with OCR try some other engines or just >>>>> start with some deep learning models >>>>> understand the basic working >>>>> On Thursday 1 February 2024 at 11:17:14 UTC+5:30 santhi...@gmail.com >>>>> wrote: >>>>> >>>>>> Already i was used above mentioned steps but i lost the datas >>>>>> >>>>>> On Saturday 27 January 2024 at 06:52:54 UTC+5:30 g...@hobbelt.com >>>>>> wrote: >>>>>> >>>>>>> L.S., >>>>>>> >>>>>>> *PDF. OCR. text extraction. best language models? not a lot of >>>>>>> success yet...* >>>>>>> >>>>>>> 🤔 >>>>>>> >>>>>>> Broad subject. Learning curve ahead. 🚧 Workflow diagram included >>>>>>> today. >>>>>>> >>>>>>> >>>>>>> *Tesseract does not live alone* >>>>>>> >>>>>>> Tesseract is an engine, which takes an image as input and produces >>>>>>> text output; several output formats are available. If you are unsure, >>>>>>> start >>>>>>> with HOCR output as that's close to modern HTML and carries almost all >>>>>>> info >>>>>>> tesseract produces during the OCR process. >>>>>>> If it isn't an image you've got, you need a preprocess (and >>>>>>> consequently additional tools) to produce images you can feed >>>>>>> tesseract. >>>>>>> tesseract is designed to process a SINGLE IMAGE. (Yes, that means you >>>>>>> may >>>>>>> want to 'merge' its output: postprocessing) >>>>>>> >>>>>>> * To complicate matters immediately, tesseract can deal with >>>>>>> "multipage TIFF" images and can accept multiple images to process via >>>>>>> its >>>>>>> commandline. Keep thinking "one page image in, bunch of text out" and >>>>>>> you'll be okay until you discover the additional possibilities.* >>>>>>> >>>>>>> *Advice Number 1: *get a tesseract executable, invoke it using its >>>>>>> commandline interface. If you can't build tesseract yourself, Uni >>>>>>> Mannheim >>>>>>> may have binaries for you to download and install. Linuxes often have >>>>>>> tesseract binaries and mandatory language models available as packages, >>>>>>> BUT >>>>>>> many Linuxes are more or less far behind the curve: latest tesseract >>>>>>> release as of this writing is 5.3.4: >>>>>>> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your >>>>>>> rig has the latest tesseract installed. Older releases are older and >>>>>>> "previous" for a reason! >>>>>>> >>>>>>> >>>>>>> *Preprocessing is the chorus of this song* >>>>>>> >>>>>>> As you say "PDF", you therefor need to convert that thing to *page >>>>>>> images*. My personal favorite is the Artifex mupdf toolkit, using >>>>>>> mutool or mudraw / etc. tools from that commandline toolkit to render >>>>>>> accurate, high-rez page images. Others will favor other means but it >>>>>>> all >>>>>>> ends up doing the same thing: anything, PDFs et al, is to be converted >>>>>>> to >>>>>>> one image per page and fed to tesseract that way. The rendered page >>>>>>> images >>>>>>> MAY require additional *image preprocessing*: >>>>>>> >>>>>>> >>>>>>> *This next bit cannot be stressed enough: *tesseract is designed >>>>>>> and engineered to work on plain printed book pages, i.e. BLACK TEXT on >>>>>>> PLAIN WHITE BACKGROUND. As I observe everyone and their granny dumping >>>>>>> holiday snapshots, favorite CD, LP and fancy colourful book covers >>>>>>> straight >>>>>>> into tesseract and complaining "nothing sensible is coming out" that's >>>>>>> because you're feeding it a load of dung as far as the engine >>>>>>> concerned: it >>>>>>> expects BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull >>>>>>> printed >>>>>>> page in a BOOK, so anything with nature backgrounds, colourful >>>>>>> architectural backgrounds and such is begging for a disaster. And I >>>>>>> only >>>>>>> emphasize with the grannies. <drama + rant mode off/> This is why >>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is >>>>>>> mentioned almost every week in this mailing list, for example. It's >>>>>>> very >>>>>>> important, but you'll need more... >>>>>>> >>>>>>> >>>>>>> The take-away? You'll need additional tools for image preprocessing >>>>>>> until you can produce greyscale or B&W images that look almost as if >>>>>>> these >>>>>>> were plain old boring book pages: no or very little fancy stuff, black >>>>>>> text >>>>>>> (anti-aliased or not), white background. >>>>>>> Bonus points for you when your preprocess removes non-text image >>>>>>> components, e.g. photographs, in the page image: it can only confuse >>>>>>> the >>>>>>> OCR engine so when you strive for perfection, that's one more bit to >>>>>>> deal >>>>>>> with BEFORE you feed it into tesseract and wait expectantly... >>>>>>> (Besides, >>>>>>> tesseract will have less discovery to do so it'll be faster too. Of >>>>>>> little >>>>>>> importance, relatively speaking, but there you have it.) >>>>>>> As also mentioned at >>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools >>>>>>> of interest re image processing are leptonica (parts used by tesseract, >>>>>>> but >>>>>>> don't count on it doing your preprocessing for you as it's a highly >>>>>>> scenario/case-dependent activity and therefor not included in tesseract >>>>>>> itself) Also check out: OpenCV (a library, not a tool, so you'll need >>>>>>> scaffolding there before you can use it), ImageMagick, (Adobe Photoshop >>>>>>> or >>>>>>> open source: Krita: great for what-can-I-get experiments but not >>>>>>> suitable >>>>>>> for bulk), etc.etc. >>>>>>> >>>>>>> >>>>>>> *Tesseract bliss and the afterglow: postprocessing* >>>>>>> >>>>>>> Once you are producing page images like they were book pages, and >>>>>>> feeding them into tesseract, you get output, being it "plain text", >>>>>>> HOCR or >>>>>>> otherwise. >>>>>>> >>>>>>> Personally I favor HOCR but that's because it's closest to what *my >>>>>>> *workflow needs. You must look into "postprocessing" anyway: being >>>>>>> it additional tooling to recombine the OCR-ed text into PDF "overlay", >>>>>>> PDF/A production, or anything else; advanced usage may require >>>>>>> additional >>>>>>> postprocessing steps, e.g. pulling the OCR-ed text through a >>>>>>> spellchecker+corrector such as hunspell, if that floats your boat. >>>>>>> You'll >>>>>>> also need to get and set up and/or program postprocess tooling if you >>>>>>> otherwise wish to merge multiple images' OCR results. You may want to >>>>>>> search the internet for this; I don't have any toolkit's name present >>>>>>> off >>>>>>> the top off my head for that as I'm using tesseract in a slightly >>>>>>> different >>>>>>> workflow, where it is part of a custom, *augmented *mupdf toolkit: >>>>>>> PDF in, PDF + HOCR + misc document metadata out, so all that >>>>>>> preprocessing >>>>>>> and postprocessing I hammer on is done by yours truly's custom >>>>>>> toolchain. >>>>>>> Under development, so I'm not working with the diverse python stuff >>>>>>> most >>>>>>> everybody else will dig up after a quick google search, I'm sure. >>>>>>> Individual project's requirements' differences and such, so your path >>>>>>> will >>>>>>> only be obvious to you. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *How to be trolling an OCR engine *😋 >>>>>>> >>>>>>> Oh, before I forget: some peeps drop shopping bills and such into >>>>>>> off-the-shelf tesseract: *cute *but not anything like a "plain >>>>>>> printed book page" so they encounter all kinds of "surprises": >>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is >>>>>>> important but it doesn't tell you *everything*. "plain printed book >>>>>>> pages" are, by general assumption, pages of text, or, more precisely: >>>>>>> *stories*. Or other tracts with paragraphs of text. Bills, invoices >>>>>>> and other financial stuff is not just "tabulated semi-numeric content" >>>>>>> instead of "paragraphs of text" but those types of inputs also fail >>>>>>> grade F >>>>>>> regarding the other implicit assumption that comes with human >>>>>>> "paragraphs >>>>>>> of text": the latter are series of words, technically each a bunch of >>>>>>> alphabet glyphs (*alpha*numerics), while financials often mix >>>>>>> currency symbols and numeric values: while these were part of >>>>>>> tesseract's >>>>>>> training set, I am sure, they are not its focal point hence have been >>>>>>> given >>>>>>> less attention than the words in your language dictionary. And scanning >>>>>>> those SKUs will fare even worse as they're just a jumbled *codes*, >>>>>>> rather than *language*. Consequently you'll need to retrain >>>>>>> tesseract if your CONTENT does not suit these mentioned assumptions re >>>>>>> "plain printed book page". Haven't done that yet myself; it's not for >>>>>>> the >>>>>>> faint of heart and since Google did the training for the "official" >>>>>>> tesseract language models everyone downloads and uses, you can bet your >>>>>>> bottom retraining isn't going to be "nice" for the less well funded >>>>>>> either. >>>>>>> Don't expect instant miracles and expect a long haul when you decide >>>>>>> you >>>>>>> must go this route [of training tesseract], or you will meet Captain >>>>>>> Disappointment. Y'all have been warned. 😉 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Why your preprocess is more important than kickstarting tesseract, >>>>>>> by blowing ether* up its carburetor* >>>>>>> >>>>>>> *Why is that "plain printed book page is like human stories and >>>>>>> similar tracts: paragraphs of text" mantra so important?* Well, >>>>>>> tesseract uses a lot of technology to get the OCR quality it achieves, >>>>>>> including using language dictionaries. While some smarter people will >>>>>>> find >>>>>>> switches in tesseract where *explicit* dictionary usage can be >>>>>>> turned off, it cannot switch off the *implicit* use due to how the >>>>>>> latest and best core engine: LSTM+CTC (since tesseract v4) actually >>>>>>> works: >>>>>>> it slowly moves its gaze across each word it is fed (jargon: *image >>>>>>> segmentation *preprocess inside tesseract produces these word >>>>>>> images) and LSTM is so good at recognizing text, because it has >>>>>>> "learned >>>>>>> context": that context being the characters surrounding the one it is >>>>>>> gazing at right now. Which means LSTM can be argued to act akin to a >>>>>>> *hidden >>>>>>> Markov model* (see wikipedia) and thus will deliver its predictions >>>>>>> based on what "language" (i.e. *dictionary*) it was fed during >>>>>>> training: human text which is used in professional papers and stories. >>>>>>> Dutch VAT codes didn't feature in the training set, as one member of >>>>>>> the ML >>>>>>> discovered a while ago. Financial amounts, e.g. "EUR7.95" are also not >>>>>>> prominently featured in LSTMs training so you can now guess the amount >>>>>>> of >>>>>>> confusion the LSTM will experience when scanning across such a thing: >>>>>>> reading "EUR" has it expect "O" with high confidence, as in "eur" >>>>>>> obviously >>>>>>> leading to the word "euro", but what the heck is that "digit 7" doing >>>>>>> there?! That's *highly* unexpected, hence OCR probabilities drop, >>>>>>> pass decision-making thresholds and you get WTF results, simply because >>>>>>> the >>>>>>> engine went WTF *first*. >>>>>>> Ditto story/drama for calligraphed signs outside shops, and, *oh! >>>>>>> oh!, license plates*!! (google LPR/ALPR if you want any of that) >>>>>>> and *anything else *that's *not *reams of text and thus you >>>>>>> wouldn't expect to find in a plain story- or textbook. >>>>>>> (And for the detail-oriented folks: yes, tesseract had/has a module >>>>>>> on board for recognizing math, but I haven't seen that work very well >>>>>>> with >>>>>>> my inputs and not seen a lot of happy noises out there about it either, >>>>>>> but >>>>>>> the Google engineer(s) surely must have anticipated OCRing that kind of >>>>>>> stuff alongside paragraphs of text. For us mere mortals, I'ld consider >>>>>>> this >>>>>>> bit "an historic attempt" and forget about it.) >>>>>>> >>>>>>> >>>>>>> *Advice Number 2: *when rendering page images, the ppi (pixels per >>>>>>> inch) resolution to select would be best adjusted to produce regular >>>>>>> lines >>>>>>> of text in those images where the capital-height of the text is around >>>>>>> 30 >>>>>>> pixels. Typography people would rather like to refer to *x-height*, >>>>>>> so that would be a little lower in pixel height. Line height would be >>>>>>> larger as that includes stems and interline spacing. However, from an >>>>>>> OCR >>>>>>> engine perspective, these (x-height & line-height) are very much >>>>>>> dependent >>>>>>> of the font used and the page layout used, so they are more variable >>>>>>> than >>>>>>> the reported optimal capital-D-height at ~32px. As no-one measures this >>>>>>> up-front, as an initial guess, 300dpi in the render/print-to-image >>>>>>> dialog >>>>>>> of your render tool of choice would be reasonable start but when you >>>>>>> want >>>>>>> more accuracy, tweaking this number can already bring some quality >>>>>>> changes. >>>>>>> Of course, when the source is (low rez) bitmap images already (embedded >>>>>>> in >>>>>>> PDF or otherwise), there's little you can do, but then there's still >>>>>>> scaling, sharpening, etc. image preprocessing to try. This advice is >>>>>>> driven >>>>>>> by the results published here: >>>>>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ >>>>>>> (and google already quickly produced one other who does something like >>>>>>> that >>>>>>> and published a small bit of tooling: >>>>>>> https://gist.github.com/rinogo/294e723ac9e53c23d131e5852312dfe8 ) >>>>>>> >>>>>>> >>>>>>> *) the old-fash way to see if a rusty engine will still go (or blow, >>>>>>> alas). Replace with "SEO'd blog pages extolling instant success with >>>>>>> ease" >>>>>>> to take this into the 21st century.) >>>>>>> >>>>>>> >>>>>>> >>>>>>> *The mandatory readings list:* >>>>>>> >>>>>>> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html >>>>>>> - https://tesseract-ocr.github.io/tessdoc/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *The above in diagram form (suggested tesseract workflow ;-) )* >>>>>>> >>>>>>> [image: diagram.png] >>>>>>> (diagram PikChr source + SVG attached) >>>>>>> >>>>>>> >>>>>>> >>>>>>> Met vriendelijke groeten / Best regards, >>>>>>> >>>>>>> Ger Hobbelt >>>>>>> >>>>>>> -------------------------------------------------- >>>>>>> web: http://www.hobbelt.com/ >>>>>>> http://www.hebbut.net/ >>>>>>> mail: g...@hobbelt.com >>>>>>> mobile: +31-6-11 120 978 >>>>>>> -------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 26, 2024 at 6:11 PM Santhiya C <santhi...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Guys , i will start development OCR using image and Pdf to text >>>>>>>> extraction what are the steps i need to follow , can you pleasse refer >>>>>>>> me >>>>>>>> the best model , already i had used the pytesseract engine but i did >>>>>>>> not >>>>>>>> get proper extraction ... >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> >>>>>>>> Sandhiya >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0736f9ea-4cf5-4bf4-b601-b978221589adn%40googlegroups.com.