Dear Tesseracters,

At Wikisource, the free digital library and sister project of Wikipedia, we
have founded a user group [1] to promote international coordination and
partnerships with fellow organizations. We have thousands of high quality
volunteer proofread pages [2] matched by scans in ca. 50 different
languages [3]. Our editing interface of one single page looks like this
[4], which has another view as "index" [5] or as text with all pages
together [6]. There are several verification levels, the most important are
"yellow" which means that one contributor proofread the page, and "green"
which means that a second person verified the proofread text.

This past weekend at Wikimania '14 in London we had a meeting were we
discussed technical and social issues from several Wikisource language
communities. One of the most serious issues was raised by the Belarusian
community which uses 2 different scripts with no commercial OCR support.
This means that the volunteers have to type each word manually. We wondered
if it would be possible to train Tesseract to recognize these old texts
using the text that has been already typed.

We would like to know if you would be interested in exploring collaboration
possibilities. I imagine that with your guidance we could prepare training
data not only in different languages, but also from different time periods,
scripts, etc. At the moment it is not very clear how to achieve this.

Please let us know if you would like to have a hangout/skype conversation
any day next week.

Cheers,
Micru


[1] https://meta.wikimedia.org/wiki/Wikisource_Community_User_Group
[2] https://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics
[3] http://stats.wikimedia.org/wikisource/EN/Sitemap.htm
[4]
https://en.wikisource.org/wiki/Page%3ATyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf/2
[5]
https://en.wikisource.org/wiki/Index:Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf
[6]
https://en.wikisource.org/wiki/Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to