[ https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037219#comment-14037219 ]
Dave Meikle commented on TIKA-1343: ----------------------------------- Hey Chris - I am up for building out on this one. Will take a look at your patch and give it a whirl. > Create a Tika Translator implementation that uses JoshuaDecoder > --------------------------------------------------------------- > > Key: TIKA-1343 > URL: https://issues.apache.org/jira/browse/TIKA-1343 > Project: Tika > Issue Type: Bug > Components: general > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.6 > > > The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine > translation system hosted at Github: > http://joshua-decoder.org/ > Joshua takes in corpuses and trains models that can then be used to do > language translation. Currently there is support for e.g., Spanisn->English, > Indian dialects->English, Chinese->English, and a few others. > https://github.com/joshua-decoder/joshua/ > It would be nice to build a Tika Translator on top of Joshua. There are of > course several issues with this: > * the models are huge - so we'll need a separate package or Maven module, > maybe tika-translate-joshua or something to release the models and we'll need > to build the models. I just went through the process of building the > Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, > but it took over a day > * there is a configuration for Joshua, and so we need some way of passing > that config into the Translator. Not sure of the best way to do this. > * Joshua isn't in the Central repository. I've started a discussion on the > Joshua lists about this: > https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0 > Anyhoo, I've got a working patch right now with hard code stuff, and a manual > install into my Maven repo for brave souls out there that want to try it. -- This message was sent by Atlassian JIRA (v6.2#6252)