Hello,

it is always good to provide some problematic images (better than thousands
of words;-) )

For preprocessing: look at scantailor - there are several forks with
different improvements that also provide cli version. IMO it should be able
to replace unpaper.

I also recommend checking
https://github.com/ImageProcessing-ElectronicPublications:  there is a
great collection of various tools for image processing including the
implementation of various thresholding algorithms

Zdenko


ut 15. 10. 2024 o 17:46 [email protected] <[email protected]> napísal(a):

>  I work on corpora research with text which scanning quality might be
> abysmal; yet, the text in themselves are valuable. Based on my previous
> experiences, as well as the comments and complaints that I notice, I don't
> think that we will be able to ever fully automate the whole process of OCR
> with reliable fidelity, but in a sense that situation is not entirely
> hopeless, since the human expert aspect of it could be "easily" and
> optimally managed through a corpus of known good data minded by experts
> (such as wikipedia and gutenberg.org) and the management of eyeballing
> human agents through a GUI (directing them exactly to where OCR seems to
> not have gotten it right presenting even contextual options to the user,
> keeping an editing history for each text, ...). OCR mistakes which could be
> easily handled based on the context using corpora are: "another" OCRed as
> "mother", and "Andre ?\farie Arnpere" in an equally messy yet hopeful
> context such as "Andre ?\farie Arnpere ( 1775--1836) , professor of
> mathematical analysis and n1echanics at the f::cole Polytechnique".
>
>  I am specially interested in the following aspects:
>  1) options while pre-processing images in order to make the work of
> tesseract optimal and since I will be working mostly with scientific texts,
> different font sizes and types of fonts, glyphs and multi-encoded text
> (texts containing formulas, charts, annotated pictures) must be handled
> well or at least flagged out;
>  2) images in visual text should be spotted and extracted separately from
> the actual text (including the text segments which are part of the images,
> think cartoons):
>
> https://superuser.com/questions/1857597/preferably-linux-based-os-utility-to-extract-images-from-image-based-pdf-file
>  3) relating to §2 tables should be also handled well
>  4) multilingually encoded texts (which I think tesseract handles well)
> ~
>  An important project such as unpaper (preprocessing on pages to be fed
> onto tesseract) was apparently abandoned without an accompanying
> documentation of the mathematical basis of its algorithm:
>
> // __ document algorithms
>
>  https://github.com/unpaper/unpaper/issues/6
> ~
>  For long I have noticed complaints about tesseract-ocr's blanket
> assumptions about font size, which makes it fail on multi-font size texts
> such as flyers and on texts with a curved gradient (either artistically or
> partially as an artifact of lousy scanning (on some of the texts you even
> see the whole fingers of the agent scanning them)). I think troubleshooting
> those problems is not that difficult.
>  Given the nature and degree of complexity of the problem at hand, I am
> mostly interested in open, functionally described and well-documented
> step-by-step approaches, not "results".
>  Do you know of know of any similar prior art?
>  Any shared experiences and general suggestions regarding possible road
> blocks that such a problem may encounter?
>  My search on:
>
> https://groups.google.com/g/tesseract-ocr/search?q=pre-processing%20unpaper
>  resulted in only 8 hits which were somewhat helpful.
>  lbrtchx
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8f3510a6-f019-4ef8-9a79-0ba86754e2dcn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8f3510a6-f019-4ef8-9a79-0ba86754e2dcn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ySC-PbXtJhp%3Dv602U5aN7VNgTa3P17OfB6WtoC_PrnnA%40mail.gmail.com.

Reply via email to