Guys I am about to start a project, which mainly will improve tesseract ability to recognize text, regardless of input document/image quality. Am securing some grants/funding.
If you 're interested let me know. cheers On Saturday 8 June 2024 at 05:37:41 UTC+12 [email protected] wrote: > Hello Ger, and thank you for responding. > > Regarding training and/or tuning - I definitely don't have the available > computing power for a full train, and assuming I'm understanding the > requirements (specifically the 1000 images minimum thing) I'm not sure I > have enough data for a tune (it's approximately 230 pages that use this > font, with only about 50% text coverage on the more dense pages, the rest > are non-ocr pictures, even if the 1000 images are single line images, not > sure I'd get there). I also have no idea what the font is, I suspect it's > one that isn't available to the public (without a hefty fee), so, > generating new very clean images isn't possible either (if it's possible to > tune using one font and have it apply to others that aren't visually > similar, that might actually be an option). > > So, we're back to manually fixing after the ocr run and/or using graphics > software to further "fix" the images before processing. I could open the > hocr files in my text editor and "fix" commas that are read as periods, > quotes that aren't quite correct and even super/sub fractions, generating > the bounding boxes when whole words are simply ignored due to uneven > lighting (even though they are in the input image thanks to running a > thresholding algorithm before being handed to tesseract) is something I > haven't figured out how to do (if you happen to know how to use The GIMP to > selectively darken overexposed areas, that might help a lot. Alternatively, > is there a way to do a two run recognition? Something akin to a > non-persistent tune - do one run to a text file, manually correct the text > file, and have the second run to hocr use that text file as the dictionary > to use for that run. > > Biggest problem I am experiencing with manual correction: generating or > fixing - mostly expand, sometimes contract - the bounding boxes after > entering the correct characters when what was recognized as the wrong > metrics for what is supposed to be there. > > Second biggest problem (which if possible should be fixed first), I need > an additional preprocessing step to fix uneven lighting. I have available > for use Rawtherapee and The GIMP (was able to fix overexposure, but that > darkened everything equally, need a way to spot darken the regions that > received more light during scanning those regions are the ones that are > most likely to not get recognized at all) > > On Mon, Jun 3, 2024, 17:06 Ger Hobbelt <[email protected]> wrote: > >> - "These scans include characters that are not in the Latin-1 block, >> which I read somewhere and now can't find is the limit for the English >> data." >> >> Well, to put it bluntly, diving into the rabbit hole without a helmet nor >> a 'chute: as far as I have been able to discover, the current "official" >> tesseract training data "databases" (neural net matrices) that are used to >> recognize anything we throw at tesseract have been produced ("trained") at >> google by Ray Smith, using copious hardware from google I expect -- >> training neural nets is no joy at the average Joe's hardware budget, after >> all. When you dig through the git commits, such as >> https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find >> the last training file *content* update was back in '17 by @theraysmith and >> he hasn't been around long after since: >> https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31 >> -- without any hard data, my initial guess is a change of corporate google >> mind re tesseract. >> >> Stefan Weil et al have done a lot a ton of important work since, but when >> you ask "what can this baby recognize?" that translates 1:1 to "what has >> tesseract been trained to recognize?" and there... things get a little >> vague for me. I'd love to be corrected on this, slapped on the wrist or >> worse, but from what I've gleaned so far during my research: >> >> - though there's https://github.com/tesseract-ocr/langdata , >> https://github.com/tesseract-ocr/tesstrain , >> https://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray >> Smith's public notes and papers about what was done for tesseract v4/v5 at >> https://github.com/tesseract-ocr/docs (which is separate from >> https://github.com/tesseract-ocr/tessdoc, which is more user oriented >> instead of architectural background), I am not confident that the actual >> list of training files used to produce those master traineddata LSTM files >> (= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of >> font names used some place in there (or was it the mailing list?), but for >> anyone who works with fonts that already is a handwavey kinda thing and, >> yes, copyrights, yadayada, will forever prevent something more precise to >> be available because the list most certainly included commercial fonts. >> Then there's also the training input files defining the "text lines" to be >> rendered as training material: those actually determine which glyphs in the >> fonts will be trained at all (and in what combinations). And there I am not >> feeling confident either, as it looks like those files published are the >> ones from the older v3 engine, still relevant, but *probably* not what Ray >> was using to produce those many traineddata files he did at the google shop. >> Having dug through the git histories, inspected the various files, >> scripts and notes about 2 years ago, I cannot say with complete confidence >> whether your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into >> the training set for English back then. My *guess* is that they have been >> included, if only a few samples, so the neural net will have *some* >> recollection of them, if my guess is correct, but I also expect them to >> have "featured little" in the total training process so recognition chances >> are reduced. >> >> (Aside: As we focus on the English language training set here, I didn't >> mention the metric ton of work done by @Shreeshrii for Asian scripts, >> particularly Devanagari and related, a few years later. As far as I can >> tell, most of the `traineddata` scripts and process today are due to his >> work and Stefan Weil's, who, if you look over there, you'll note has done a >> lot of work around OCR-ing (pre-war) German newpapers and similar >> publications, which was when the Germans had a fondness of printing >> everything in (to my eyes) quite hard to read blackletter fonts. To make >> that feat happen, he and the university team (of several German uni's >> together, if I read what was done right, back when) created a >> German-specific training set for newspaper blackletter print and published >> the resulting tesseract traineddata OCR databases for public use (language: >> "fra" = fraktur). I don't recall seeing a publication where he lists the >> number of CPU hours used to produce that trained set (one(1) language, few >> fonts vs. the 400+ allegedly used in the google production run) but you can >> bet your bottom it wasn't cheap! Or quick!) >> >> When we pop out of the rabbit hole of tesseract history, we might now >> better understand why your problem is answered... haphazardly: >> >> - general advice number 1 out there is to 'tune' a language training file >> if you have special needs, such as your wish to recognize fractions, etc., >> which don't feature often in published texts and thus haven't been a real >> bother thus far. This "tuning" advice is basically training advice to do a >> little extra training, which is, to me, a little hairy as you are expected >> to not deteriorate the existing recognition ability while *slightly >> improving* the recognition confidence (and thus output quality) for a few >> glyphs ("characters in your fonts") that are already mostly recognized by >> the neural net as it recognizes part or all of the relevant "shapes" that >> make up the glyphs you wish to see recognized. (This is a very rough >> translation of what a neural net "learns" vs. how we humans might >> understand pattern recognition, so tread carefully around this blather of >> mine if you think you're getting a look under the hood. We're rather more >> *paraphrasing* the engine instead of pointing at its carburetor, spark >> plugs, etc., if you get my drift.) >> >> Logically, this approach is met with varying success (and crushed hopes) >> as it is VERY much dependent on the exact shapes and glyphs (characters) >> you add. (TM) might be helped by being quite close to a T+M superscript, >> while the fractions being a combo of superscript, subscript and a / slash >> might be doable or hard for the LSTM+CTC engine, I cannot tell without >> having tried. And training takes time, both in setting it up and in CPU >> cycles, so it's not a 5 minute thing to do. Which explains another type of >> silence around here. >> >> - if that didn't work, you will read several folks advising to "lop off >> the top layer" and retrain the whole language. What this says is that, >> basically, the attempt is to wipe just one of the many layers of the >> LSTM+CTC neural net where it is expected to 'conclude' things like "ah... >> that there and this shapy thingamajig here, all that jazz is very probably >> an 'a'..." and hope that that lopping-off-and-retraining suffices to get >> acceptable training results after running the training for a while (& >> checking you're doing all right and not overtraining other bits and pieces >> of the engine's alphabet/text output!) >> This takes rather more time than "tuning" as you must now retrain at >> least an entire layer, while tuning was only intended to have the training >> activity result in a few cell connections in there being tweaked a little >> to get what you wanted. >> >> - general advice number 3 is to do what the Germans did and train a >> dedicated "language", which means you'll need to do all the work of >> creating font(s), text line training files which include (hopefully) every >> word and symbol you may ever encounter later on and then cook one CPU or >> more for some considerable time. I consider that effort approaching >> herculean, particularly when you're alone. Some have tried, and a few even >> succeeded it seems from the noises I recall for the last couple of years >> lurking on this mailing list. >> >> By now you'll surely have gotten the gist of it: from the distance of a >> mailing list POV, it's all a guess and there's so many little details >> involved to arrive at success that almost nobody dares venture saying much, >> at least not all at once. Because this stuff is *hard* to get right and the >> above can be a cause for scare with some folks. >> >> Me personally, I tried my hand at "tuning" a little about a year ago and >> it didn't fare well, because I found out I still didn't understand all the >> processes involved well enough to make decisions that would differ from >> joining a crap shoot blindfolded. But that is me and I am not into the >> adrenalin rush of bungee jumping either, so it probably says more about me >> than about the process of training/tuning tesseract. >> >> >> >> >> >> >> Having mentioned the above three options, my personal favorite advice >> number 4 is: try to come up with a way which can keep tesseract as-is, and >> adding a review/correction post-process that is acceptable for you. If you >> find it in your heart to accept that a little copy-editing after the OCR >> actions is A-okay, you are probably better off, both in time spent and >> frustration with machines' ways. After all, the initial setup cost for this >> option is much less for single-person shops, I expect. ;-) (The break-even >> would be a fairly large number of pages to process...) >> >> >> >> >> >> >> >> - "I've got a mostly English language set of scans (image quality is good >> but not great, but best I can do without a better scanner" >> >> Personal experience to date is image preprocessing is a "field of active >> research" (i.e. you need to try and test all your own and any others' ideas >> that sound more or less reasonable) and has a very strong effect on the >> outcome of the OCR stage. For instance, you may want to rescale your >> scanned images and see at which text pixel height they do well/best; >> previous research says text at 30-33 pixels height is optimal, but yours >> might differ a little from that, so experiment! (I'll try to do a tesseract >> run on an image you posted earlier later tomorrow at very resize sizes to >> see what comes out that one.) >> >> Ditto for post-processing: it might be useful, if the content is >> important enough to you, to dump it into a word processor / text editor >> with spellchecker on board for further assistance. A manual review process >> of some kind is called for, anyway, if you want consistent (very) high >> quality output. >> >> There's also processors/tools that can do "smart quotes" if you like, but >> I would reserve that for last; my initial approach there would be to have >> the OCR engine spit out quotes where-ever they occur and then convert them >> to "smart" open/close quotes in post, if I wanted. French quotes would >> potentially be easier to OCR that way (as they appear at different vertical >> offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR >> machine: the training sets have been trained on a gazillion fonts and >> intricate little typography details like "smart quotes" are rather font >> specific, so recognizing them from an OCR engine's perspective screams >> "tuning! dedicated font training!" and a little headache starts to develop >> over here. ;-)) >> >> >> >> - "Slightly related, how, exactly, do y'all deal with drop caps?" >> >> Errrrm, AFAICT.... we don't. Apologies. Seriously though, I >> don't recall any positive success info on that one. >> >> Here my initial gut response is to "recognize" the drop caps in >> preprocessor, i.e. in the "image segmentation phase" and cut them out >> specifically to have them extracted, rescaled to a sensible "regular text >> size" and only then fed into the OCR engine. Afterwards the output then has >> to be recombined with the rest of the image segments' text produce. BUT >> that is mere theory as tesseract does not yet have a module/subprocess to >> "identify" possible dropcaps and segment and process them as I just >> described. Which means that today, you either do that up front and do the >> recombining afterwards in your own custom postprocess, or you decide to >> accept a little extra editorial post work by either keeping them in as-is >> (and expecting errors or at least uncertainties reported by the OCR engine) >> or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's >> built-in dictionary resolves half of them due to spelling correction. Any >> way, this is all currently non-existent, alas, so anything you come up with >> is better than what is, today. >> >> (I am working on my own copy of tesseract which might improve this a >> little, but don't expect any miracles there this quarter. I'm /slow/.) >> >> >> >> The 'tesseract does best with 30-33pixel high text' stuff is at: - >> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ >> I wrote >> https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ a >> while ago; maybe the diagram in there and some paragraphs there aid >> understanding what's going under the hood, which' info I think you need, >> like I did/do. >> >> >> >> Take care, >> >> Ger >> >> >> P.S.: it was lying around for a gander, but my tesseract is buggered ATM. >> Anyway, I installed an "official distro" one yesterday for other purposes >> and I'll see how your previously posted scans fare with that one when I >> test a few things on them. To be reported later this week, possibly >> tomorrow afternoon. >> >> >> >> >> >> >> >> >> On Monday, May 20, 2024 at 5:02:24 AM UTC+2 [email protected] wrote: >> >>> I've asked a couple different times, and each time I get just a little >>> bit more information, but still not enough to work with. >>> >>> I've got a mostly English language set of scans (image quality is good >>> but not great, but best I can do without a better scanner, I'm working on >>> that to re-scan but there are some problems that still wouldn't be fixed). >>> These scans include characters that are not in the Latin-1 block, which I >>> read somewhere and now can't find is the limit for the English data. >>> Example characters not being recognized include fractions ( ⅛ ⅔ instead >>> of 1/8 or 2/3), the TM ( ™ ) or C ( © ) symbols (latter is actually in >>> Latin 1, but isn't directly typeable and, from what I've been able to tell, >>> the circled part comes out so faint on the input image, tesseract thinks it >>> is noise) and "smart" or curly quotes - all characters that require using >>> alt+ codes, insert special character dialogs or letting your >>> wordprocessor/DTP handle converting for you. Which seems to mean they >>> require some level of manual review and correction to be able to get it >>> into the text output. BUT, once you see you need to input manually, how do >>> you handle the positioning data (when working in hocr format)? I >>> considered, briefly, using character whitelisting to help with these, but, >>> that would imply the characters are already included in the character >>> set/wordlist, which if memory serves, many of these aren't? >>> >>> Slightly related, how, exactly, do y'all deal with drop caps? >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/dc048b53-0767-4167-9976-819d2a2e0d8fn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/dc048b53-0767-4167-9976-819d2a2e0d8fn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/588fa1d6-4537-4d3f-861a-42db278053a4n%40googlegroups.com.

