Re: [Wikisource-l] Proofreading based on statistics

2013-05-23 Thread Federico Leva (Nemo)

Lars Aronsson, 24/05/2013 01:54:

It should be possible, in any language of Wikisource, to
check all existing text against


What do you define as existing text? Only the text currently stored in 
wiki pages? Also the text layer of the DjVu or PDF files in use on the 
wiki? Also the files uploaded but not used yet?



a known dictionary valid
for that year, and to find words that are outside the
dictionary. These words could be proofread in some tool
similar to a CAPTCHA. They might be uncommon place names
that are correctly OCRed but not in the dictionary, or
they could be OCR errors, or both.

Has anybody tried this?


In a way: 





Such finds are not necessarily the only OCR errors.
Some OCR errors result in correctly spelled words, that
are found in the dictionary, e.g. burn -> bum.
So full manual proofreading and validation will still be
needed. But a statistics based approach could fill gaps
and quickly improve full text searchability.


True. Listing tasks to direct people to is also always a good thing on 
wikis, better than leaving people spend time on finding what to do.


Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Proofreading based on statistics

2013-05-24 Thread Andrea Zanni
I completely agree with Lars.
I remember, for example, an awesome tool from Alex Brollo, postOCR,
a js script which corrects automatically most common OCR errors and
converts apostrophes.
The tool is very useful and very used, and it would improve a lot from
a given list of common OCR errors per book.

Moreover, a set of stats per books
(list of words used, counting those words, etc.)
could be very interesting for a tiny range of readers, but skilled ones,
as digital humanists and philologists.

As an example, we are collaborating right now with a philologist (a digital
humanist)
who put text on Wikisource, proofread them with the community,
and then works on them.

Aubrey


On Fri, May 24, 2013 at 1:54 AM, Lars Aronsson  wrote:

> It should be possible, in any language of Wikisource, to
> check all existing text against a known dictionary valid
> for that year, and to find words that are outside the
> dictionary. These words could be proofread in some tool
> similar to a CAPTCHA. They might be uncommon place names
> that are correctly OCRed but not in the dictionary, or
> they could be OCR errors, or both.
>
> Has anybody tried this?
>
> Such finds are not necessarily the only OCR errors.
> Some OCR errors result in correctly spelled words, that
> are found in the dictionary, e.g. burn -> bum.
> So full manual proofreading and validation will still be
> needed. But a statistics based approach could fill gaps
> and quickly improve full text searchability.
>
>
> --
>   Lars Aronsson (l...@aronsson.se)
>   Aronsson Datateknik - http://aronsson.se
>
>   Project Runeberg - free Nordic literature - http://runeberg.org/
>
>
>
> __**_
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.**org 
> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Proofreading based on statistics

2013-05-24 Thread Alex Brollo
I explored as a user the website of Distributed Proofreaders, to catch
ideas about proofreading. It has been a very productive and highlighting
experience, even if the whole philosophy of DP proofreading/formatting is
completely different - and incompatible - with wiki approach. One of tools
is an excellent customable, js-based spelling dictionary. How much I desire
something like that into wikisource! Obviuosly we need an excellent, very
simply customable tool - ideally, a "specific book spelling tool", I tried
to think about but there are lots of difficulties - the first one is, that
it's difficult to highlight words into a textarea by js. Can be, that
VisualEditor could make things easier.

Alex


2013/5/24 Andrea Zanni 

> I completely agree with Lars.
> I remember, for example, an awesome tool from Alex Brollo, postOCR,
> a js script which corrects automatically most common OCR errors and
> converts apostrophes.
> The tool is very useful and very used, and it would improve a lot from
> a given list of common OCR errors per book.
>
> Moreover, a set of stats per books
>  (list of words used, counting those words, etc.)
> could be very interesting for a tiny range of readers, but skilled ones,
> as digital humanists and philologists.
>
> As an example, we are collaborating right now with a philologist (a
> digital humanist)
> who put text on Wikisource, proofread them with the community,
> and then works on them.
>
> Aubrey
>
>
> On Fri, May 24, 2013 at 1:54 AM, Lars Aronsson  wrote:
>
>> It should be possible, in any language of Wikisource, to
>> check all existing text against a known dictionary valid
>> for that year, and to find words that are outside the
>> dictionary. These words could be proofread in some tool
>> similar to a CAPTCHA. They might be uncommon place names
>> that are correctly OCRed but not in the dictionary, or
>> they could be OCR errors, or both.
>>
>> Has anybody tried this?
>>
>> Such finds are not necessarily the only OCR errors.
>> Some OCR errors result in correctly spelled words, that
>> are found in the dictionary, e.g. burn -> bum.
>> So full manual proofreading and validation will still be
>> needed. But a statistics based approach could fill gaps
>> and quickly improve full text searchability.
>>
>>
>> --
>>   Lars Aronsson (l...@aronsson.se)
>>   Aronsson Datateknik - http://aronsson.se
>>
>>   Project Runeberg - free Nordic literature - http://runeberg.org/
>>
>>
>>
>> __**_
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.**org 
>> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l
>>
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Proofreading based on statistics

2013-05-25 Thread Lars Aronsson

On 05/24/2013 09:11 AM, Andrea Zanni wrote:

I remember, for example, an awesome tool from Alex Brollo, postOCR,
a js script which corrects automatically most common OCR errors and 
converts apostrophes.


Where is this? Is it documented in English?

As an example, we are collaborating right now with a philologist (a 
digital humanist)

who put text on Wikisource, proofread them with the community,
and then works on them.


Do you document and distribute your experience?


--
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

  Project Runeberg - free Nordic literature - http://runeberg.org/



___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Proofreading based on statistics

2013-05-25 Thread Alex Brollo
On 05/24/2013 09:11 AM, Andrea Zanni wrote:

> I remember, for example, an awesome tool from Alex Brollo, postOCR,

a js script which corrects automatically most common OCR errors and
> converts apostrophes.


Where is this? Is it documented in English?

Andrea mentioned two different tools merged into one.
1. postOCR code comes mainly from Pathoschild's
RegexMenuFramework
with
minor changes for Italian OCR errors.
2. apostrophes conversion (from keyboard, typewriter one ' into real
apostrophe character ’) comes from an original it.source script (in python
to be used by a bot, and in js to be merged into postOCR); it's very
complex, since conversions into templates, link, html tags, math tags and
wiki markup must be avoided. This it far from simple, since regex doesn't
help to manage nested templates/nested code structures. No, we don't
document this stuff. We simply use it a lot.

Alex



2013/5/25 Lars Aronsson 

> On 05/24/2013 09:11 AM, Andrea Zanni wrote:
>
>> I remember, for example, an awesome tool from Alex Brollo, postOCR,
>> a js script which corrects automatically most common OCR errors and
>> converts apostrophes.
>>
>
> Where is this? Is it documented in English?
>
>
>  As an example, we are collaborating right now with a philologist (a
>> digital humanist)
>> who put text on Wikisource, proofread them with the community,
>> and then works on them.
>>
>
> Do you document and distribute your experience?
>
>
>
> --
>   Lars Aronsson (l...@aronsson.se)
>   Aronsson Datateknik - http://aronsson.se
>
>   Project Runeberg - free Nordic literature - http://runeberg.org/
>
>
>
> __**_
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.**org 
> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Proofreading based on statistics

2013-05-26 Thread Andrea Zanni
>
>  I remember, for example, an awesome tool from Alex Brollo, postOCR,
>>> a js script which corrects automatically most common OCR errors and
>>> converts apostrophes.
>>>
>>
>> Where is this? Is it documented in English?
>
>
As Alex says, no. And it's a pity.
I really think it is paramount to have a central place (on Wikisurce.org?)
to discuss tools, templates, and procedures, as an international community.
It is a project long needed.
Where should we start it?

Aubrey
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Proofreading based on statistics

2013-05-27 Thread Alex Brollo
>
>
>
> As Alex says, no. And it's a pity.
> I really think it is paramount to have a central place (on Wikisurce.org?)
> to discuss tools, templates, and procedures, as an international
> community.
> It is a project long needed.
> Where should we start it?
>
> Aubrey
>
>

Personally. I'll try to centralize tools and scripts - t.i. to write them
(at least as a copy)  into oldwikisource, then importing them from there.
The problem is, that documenting and centralizing seems time consuming at
the beginning for enthusiast nut unexperienced "programmers", but
experience tells that some time gained at the beginning implies lots of
wasted time as soon as things become complex

A problem is that js can be really centralized, we can really read a unique
centralized script; this, IMHO, is not possible with templates and LUA
scripts, since they run as specific project pages. So, aligning versions is
a serious issue.

Alex
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l