I hadn't entered an issue on the tika list as of yet but in the near future MIT-LL will also have language detection for video and audio streams. Chris if you're already going to make this pluggable this may be something to consider.
--Paul ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Paul Ramirez, M.S. Technical Group Supervisor Computer Science for Data Intensive Applications (398M) Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 158-264, Mailstop: 158-242 Email: [email protected]<mailto:[email protected]> Office: 818-354-1015 Cell: 818-395-8194 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On Jul 28, 2015, at 5:59 PM, "Mattmann, Chris A (3980)" <[email protected]<mailto:[email protected]>> wrote: Cool. Well with this one I found, along with language-detector, along with Ramirez and the work with Joe Campbell’s group at MIT-LL and the Julia stuff, I for one am going to take the step to make it pluggable. I’ll try and take this on over the next week. I’ll use a ServiceLoader approach similar to Translators, Detectors, Parsers, etc. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected]<mailto:[email protected]> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Ken Krugler <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Tuesday, July 28, 2015 at 5:39 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: RE: Bayesian N-Gram Language Detection I think switching to language-detector is a reasonable first step (more languages, faster, better accuracy), after which we can evaluate the need to make it pluggable. There were some code & resource packaging issues with the original project, but the fork I've been trying out seems much better. See https://github.com/optimaize/language-detector Still ALv2, and already in the Maven central repo. -- Ken From: Mattmann, Chris A (3980) Sent: July 28, 2015 5:30:00pm PDT To: [email protected]<mailto:[email protected]> Subject: Bayesian N-Gram Language Detection FYI the code is ALv2: https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md I’m going to test this out and see how it compares with our own. Maybe we need to make the Language Detector pluggable too. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
