I completely agree with Lars. I remember, for example, an awesome tool from Alex Brollo, postOCR, a js script which corrects automatically most common OCR errors and converts apostrophes. The tool is very useful and very used, and it would improve a lot from a given list of common OCR errors per book.
Moreover, a set of stats per books (list of words used, counting those words, etc.) could be very interesting for a tiny range of readers, but skilled ones, as digital humanists and philologists. As an example, we are collaborating right now with a philologist (a digital humanist) who put text on Wikisource, proofread them with the community, and then works on them. Aubrey On Fri, May 24, 2013 at 1:54 AM, Lars Aronsson <l...@aronsson.se> wrote: > It should be possible, in any language of Wikisource, to > check all existing text against a known dictionary valid > for that year, and to find words that are outside the > dictionary. These words could be proofread in some tool > similar to a CAPTCHA. They might be uncommon place names > that are correctly OCRed but not in the dictionary, or > they could be OCR errors, or both. > > Has anybody tried this? > > Such finds are not necessarily the only OCR errors. > Some OCR errors result in correctly spelled words, that > are found in the dictionary, e.g. burn -> bum. > So full manual proofreading and validation will still be > needed. But a statistics based approach could fill gaps > and quickly improve full text searchability. > > > -- > Lars Aronsson (l...@aronsson.se) > Aronsson Datateknik - http://aronsson.se > > Project Runeberg - free Nordic literature - http://runeberg.org/ > > > > ______________________________**_________________ > Wikisource-l mailing list > Wikisource-l@lists.wikimedia.**org <Wikisource-l@lists.wikimedia.org> > https://lists.wikimedia.org/**mailman/listinfo/wikisource-l<https://lists.wikimedia.org/mailman/listinfo/wikisource-l> >
_______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l