Re: [Wikisource-l] Proofreading based on statistics

Alex Brollo Fri, 24 May 2013 03:41:29 -0700

I explored as a user the website of Distributed Proofreaders, to catch
ideas about proofreading. It has been a very productive and highlighting
experience, even if the whole philosophy of DP proofreading/formatting is
completely different - and incompatible - with wiki approach. One of tools
is an excellent customable, js-based spelling dictionary. How much I desire
something like that into wikisource! Obviuosly we need an excellent, very
simply customable tool - ideally, a "specific book spelling tool", I tried
to think about but there are lots of difficulties - the first one is, that
it's difficult to highlight words into a textarea by js. Can be, that
VisualEditor could make things easier.


Alex


2013/5/24 Andrea Zanni <zanni.andre...@gmail.com>

> I completely agree with Lars.
> I remember, for example, an awesome tool from Alex Brollo, postOCR,
> a js script which corrects automatically most common OCR errors and
> converts apostrophes.
> The tool is very useful and very used, and it would improve a lot from
> a given list of common OCR errors per book.
>
> Moreover, a set of stats per books
>  (list of words used, counting those words, etc.)
> could be very interesting for a tiny range of readers, but skilled ones,
> as digital humanists and philologists.
>
> As an example, we are collaborating right now with a philologist (a
> digital humanist)
> who put text on Wikisource, proofread them with the community,
> and then works on them.
>
> Aubrey
>
>
> On Fri, May 24, 2013 at 1:54 AM, Lars Aronsson <l...@aronsson.se> wrote:
>
>> It should be possible, in any language of Wikisource, to
>> check all existing text against a known dictionary valid
>> for that year, and to find words that are outside the
>> dictionary. These words could be proofread in some tool
>> similar to a CAPTCHA. They might be uncommon place names
>> that are correctly OCRed but not in the dictionary, or
>> they could be OCR errors, or both.
>>
>> Has anybody tried this?
>>
>> Such finds are not necessarily the only OCR errors.
>> Some OCR errors result in correctly spelled words, that
>> are found in the dictionary, e.g. burn -> bum.
>> So full manual proofreading and validation will still be
>> needed. But a statistics based approach could fill gaps
>> and quickly improve full text searchability.
>>
>>
>> --
>>   Lars Aronsson (l...@aronsson.se)
>>   Aronsson Datateknik - http://aronsson.se
>>
>>   Project Runeberg - free Nordic literature - http://runeberg.org/
>>
>>
>>
>> ______________________________**_________________
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.**org <Wikisource-l@lists.wikimedia.org>
>> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l<https://lists.wikimedia.org/mailman/listinfo/wikisource-l>
>>
>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Re: [Wikisource-l] Proofreading based on statistics

Reply via email to