Re: [Wikisource-l] Tesseract Open Source OCR Engine

2016-04-20 Thread Federico Leva (Nemo)
Yes, Tesseract is used for many Wikisource books, mainly (?) via phe's 
tool https://github.com/phil-el/phetools/tree/master/hocr / 
https://tools.wmflabs.org/phetools/


You can search the archives to see some things that have been tried in 
the past, including http://terese.sourceforge.net/ . There are many 
repositories with Indic training sets, but I never understood the 
process to bring them together and make their usage wider.


Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


[Wikisource-l] Tesseract Open Source OCR Engine

2016-04-20 Thread mathieu stumpf guntz

Hi,

I don't know where things are with OCR for non-latin scripts, so maybe 
this is not relevant anymore. Last time I grabbed information about it, 
there was limitation with the google service which was a problem namely 
for Indic languages. Well, yesterday we had a contribution day around 
Alsatian and Franconian dialects 
 
where I had the opportunity to talk with some linguists. One of them 
told me that google was in fact using tesseract 
 for its OCR service, which is open 
source. According to what she told me (or at least what I remember from 
this), it works with a trans-script training machine, you have to define 
matching between picture sample and character and there it goes. Looking 
quickly at the langdata repository I see that there are stuff about 
Devenagari, which I believe is a script used in at least a part of Indic 
texts, isn't it?


Hope that may help,
mathieu
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l