For alignment you're probably thinking of
Burrows-Wheeler: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
There's a more fully worked, and more topical, example in ReTAS:
http://ciir.cs.umass.edu/downloads/ocr-evaluation/
http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id
Thanks Tom - I probably shouldn't have given the Gutenberg example since
it introduces extra problems. In my actual process at the moment I have
the source scans, OCR output texts, and corrected text files produced by
myself, so there are fewer variables to worry about. In particular, page
division
On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net
wrote:
>
> There are lots of pdfs of scanned books around which include moderately
> good ocr-ed text (eg on archive.org).
>
OCR quality varies widely (even wildly) across scans and vintages of OCR,
so it's worth checkin
3 matches
Mail list logo