Chris A. Mattmann created TIKA-1343:
---------------------------------------

             Summary: Create a Tika Translator implementation that uses 
JoshuaDecoder
                 Key: TIKA-1343
                 URL: https://issues.apache.org/jira/browse/TIKA-1343
             Project: Tika
          Issue Type: Bug
          Components: general
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
             Fix For: 1.6


The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine 
translation system hosted at Github:

http://joshua-decoder.org/

Joshua takes in corpuses and trains models that can then be used to do language 
translation. Currently there is support for e.g., Spanisn->English, Indian 
dialects->English, Chinese->English, and a few others. 

https://github.com/joshua-decoder/joshua/

It would be nice to build a Tika Translator on top of Joshua. There are of 
course several issues with this:

* the models are huge - so we'll need a separate package or Maven module, maybe 
tika-translate-joshua or something to release the models and we'll need to 
build the models. I just went through the process of building the 
Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but 
it took over a day
* there is a configuration for Joshua, and so we need some way of passing that 
config into the Translator. Not sure of the best way to do this.
* Joshua isn't in the Central repository. I've started a discussion on the 
Joshua lists about this: 
https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0

Anyhoo, I've got a working patch right now with hard code stuff, and a manual 
install into my Maven repo for brave souls out there that want to try it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to