I completely agree with Lars.
I remember, for example, an awesome tool from Alex Brollo, postOCR,
a js script which corrects automatically most common OCR errors and
converts apostrophes.
The tool is very useful and very used, and it would improve a lot from
a given list of common OCR errors per book.

Moreover, a set of stats per books
(list of words used, counting those words, etc.)
could be very interesting for a tiny range of readers, but skilled ones,
as digital humanists and philologists.

As an example, we are collaborating right now with a philologist (a digital
humanist)
who put text on Wikisource, proofread them with the community,
and then works on them.

Aubrey


On Fri, May 24, 2013 at 1:54 AM, Lars Aronsson <l...@aronsson.se> wrote:

> It should be possible, in any language of Wikisource, to
> check all existing text against a known dictionary valid
> for that year, and to find words that are outside the
> dictionary. These words could be proofread in some tool
> similar to a CAPTCHA. They might be uncommon place names
> that are correctly OCRed but not in the dictionary, or
> they could be OCR errors, or both.
>
> Has anybody tried this?
>
> Such finds are not necessarily the only OCR errors.
> Some OCR errors result in correctly spelled words, that
> are found in the dictionary, e.g. burn -> bum.
> So full manual proofreading and validation will still be
> needed. But a statistics based approach could fill gaps
> and quickly improve full text searchability.
>
>
> --
>   Lars Aronsson (l...@aronsson.se)
>   Aronsson Datateknik - http://aronsson.se
>
>   Project Runeberg - free Nordic literature - http://runeberg.org/
>
>
>
> ______________________________**_________________
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.**org <Wikisource-l@lists.wikimedia.org>
> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l<https://lists.wikimedia.org/mailman/listinfo/wikisource-l>
>
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to