Re: [tesseract-ocr] Re: Using corrected text in second pass

2021-02-21 Thread Tom Morris
For alignment you're probably thinking of Burrows-Wheeler: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform There's a more fully worked, and more topical, example in ReTAS: http://ciir.cs.umass.edu/downloads/ocr-evaluation/ http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id

Re: [tesseract-ocr] Re: Using corrected text in second pass

2021-02-19 Thread Graham Seaman
Thanks Tom - I probably shouldn't have given the Gutenberg example since it introduces extra problems. In my actual process at the moment I have the source scans, OCR output texts, and corrected text files produced by myself, so there are fewer variables to worry about. In particular, page division

[tesseract-ocr] Re: Using corrected text in second pass

2021-02-19 Thread Tom Morris
On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net wrote: > > There are lots of pdfs of scanned books around which include moderately > good ocr-ed text (eg on archive.org). > OCR quality varies widely (even wildly) across scans and vintages of OCR, so it's worth checkin