On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann <derhoe...@gmx.net> wrote: > * Erik Moeller wrote: >>Are there open source MT efforts that are close enough to merit >>scrutiny? > > Wiktionary. If you want to help free software efforts in the area of > machine translation, then what they seem to need most is high quality > data about words, word forms, and so on, in a readily machine-usable > form, and freely licensed.
Yes. Finding a way to capture and integrate the work OmegaWiki has done into a new Wikidata-powered Wiktionary would be a useful start. And we've already sort of claimed the space (though we are neglecting it) -- it's discouraging to anyone else who might otherwise try to build a brilliant free structured dictionary that we are *so close* to getting it right. >< [ Andrea's ideas about using Wikisource to improve OCR tools ] > > I built various tools that could be fairly easily adapted for this, my > http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr > notes are available. One of the tools for instance is a diff tool, see > image at <http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031>. I hope the related GSOC project gets support. Getting mentoring from Tesseract team members seems like a handy way to keep the projects connected. Tim Starling writes: > We could basically clone the frontend component of Google Translate, > and use Moses as a backend. The work would be mostly JavaScript... > the next job would be to develop a corpus sharing site, hosting any > available freely-licensed output of the frontend tool. This would be most useful. There are often short quick translation projects that I would like to do through this sort of TM-capturing interface; for which the translatewiki prep-process is rather time consuming. We can set up a corpus sharing site now, with translatewiki - there is already a lot of material there that could be part of it. Different corpora (say, encyclopedic articles v. dictionary pages v. quotes) would need to be tagged for context. And we could start letting people upload their own freely licensed corpora to include as well. We would probably want a vetting process before giving users the import tool; or a quarantine until we had better ways to let editors revert / bulk-modify entire imports. SJ _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l