I work on corpora research with text which scanning quality might be 
abysmal; yet, the text in themselves are valuable. Based on my previous 
experiences, as well as the comments and complaints that I notice, I don't 
think that we will be able to ever fully automate the whole process of OCR 
with reliable fidelity, but in a sense that situation is not entirely 
hopeless, since the human expert aspect of it could be "easily" and 
optimally managed through a corpus of known good data minded by experts 
(such as wikipedia and gutenberg.org) and the management of eyeballing 
human agents through a GUI (directing them exactly to where OCR seems to 
not have gotten it right presenting even contextual options to the user, 
keeping an editing history for each text, ...). OCR mistakes which could be 
easily handled based on the context using corpora are: "another" OCRed as 
"mother", and "Andre ?\farie Arnpere" in an equally messy yet hopeful 
context such as "Andre ?\farie Arnpere ( 1775--1836) , professor of 
mathematical analysis and n1echanics at the f::cole Polytechnique".

 I am specially interested in the following aspects:
 1) options while pre-processing images in order to make the work of 
tesseract optimal and since I will be working mostly with scientific texts, 
different font sizes and types of fonts, glyphs and multi-encoded text 
(texts containing formulas, charts, annotated pictures) must be handled 
well or at least flagged out;
 2) images in visual text should be spotted and extracted separately from 
the actual text (including the text segments which are part of the images, 
think cartoons):
 
https://superuser.com/questions/1857597/preferably-linux-based-os-utility-to-extract-images-from-image-based-pdf-file
 3) relating to ยง2 tables should be also handled well
 4) multilingually encoded texts (which I think tesseract handles well)
~
 An important project such as unpaper (preprocessing on pages to be fed 
onto tesseract) was apparently abandoned without an accompanying 
documentation of the mathematical basis of its algorithm:

// __ document algorithms

 https://github.com/unpaper/unpaper/issues/6
~
 For long I have noticed complaints about tesseract-ocr's blanket 
assumptions about font size, which makes it fail on multi-font size texts 
such as flyers and on texts with a curved gradient (either artistically or 
partially as an artifact of lousy scanning (on some of the texts you even 
see the whole fingers of the agent scanning them)). I think troubleshooting 
those problems is not that difficult.
 Given the nature and degree of complexity of the problem at hand, I am 
mostly interested in open, functionally described and well-documented 
step-by-step approaches, not "results".
 Do you know of know of any similar prior art?
 Any shared experiences and general suggestions regarding possible road 
blocks that such a problem may encounter?
 My search on:
 https://groups.google.com/g/tesseract-ocr/search?q=pre-processing%20unpaper
 resulted in only 8 hits which were somewhat helpful.
 lbrtchx

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8f3510a6-f019-4ef8-9a79-0ba86754e2dcn%40googlegroups.com.

Reply via email to