I work on corpora research with text which scanning quality might be abysmal; yet, the text in themselves are valuable. Based on my previous experiences, as well as the comments and complaints that I notice, I don't think that we will be able to ever fully automate the whole process of OCR with reliable fidelity, but in a sense that situation is not entirely hopeless, since the human expert aspect of it could be "easily" and optimally managed through a corpus of known good data minded by experts (such as wikipedia and gutenberg.org) and the management of eyeballing human agents through a GUI (directing them exactly to where OCR seems to not have gotten it right presenting even contextual options to the user, keeping an editing history for each text, ...). OCR mistakes which could be easily handled based on the context using corpora are: "another" OCRed as "mother", and "Andre ?\farie Arnpere" in an equally messy yet hopeful context such as "Andre ?\farie Arnpere ( 1775--1836) , professor of mathematical analysis and n1echanics at the f::cole Polytechnique".
I am specially interested in the following aspects: 1) options while pre-processing images in order to make the work of tesseract optimal and since I will be working mostly with scientific texts, different font sizes and types of fonts, glyphs and multi-encoded text (texts containing formulas, charts, annotated pictures) must be handled well or at least flagged out; 2) images in visual text should be spotted and extracted separately from the actual text (including the text segments which are part of the images, think cartoons): https://superuser.com/questions/1857597/preferably-linux-based-os-utility-to-extract-images-from-image-based-pdf-file 3) relating to ยง2 tables should be also handled well 4) multilingually encoded texts (which I think tesseract handles well) ~ An important project such as unpaper (preprocessing on pages to be fed onto tesseract) was apparently abandoned without an accompanying documentation of the mathematical basis of its algorithm: // __ document algorithms https://github.com/unpaper/unpaper/issues/6 ~ For long I have noticed complaints about tesseract-ocr's blanket assumptions about font size, which makes it fail on multi-font size texts such as flyers and on texts with a curved gradient (either artistically or partially as an artifact of lousy scanning (on some of the texts you even see the whole fingers of the agent scanning them)). I think troubleshooting those problems is not that difficult. Given the nature and degree of complexity of the problem at hand, I am mostly interested in open, functionally described and well-documented step-by-step approaches, not "results". Do you know of know of any similar prior art? Any shared experiences and general suggestions regarding possible road blocks that such a problem may encounter? My search on: https://groups.google.com/g/tesseract-ocr/search?q=pre-processing%20unpaper resulted in only 8 hits which were somewhat helpful. lbrtchx -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8f3510a6-f019-4ef8-9a79-0ba86754e2dcn%40googlegroups.com.

