[ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639522#comment-14639522
 ] 

Paul Ramirez commented on TIKA-1696:
------------------------------------

Ken, thanks for the fast feedback and references. I've not dug into this much 
so it may take a couple of weeks to get something up here to test. As I dig 
into this I'll update the Jira issue with more details to help drive 
discussion. Also I'll look to get the MITLL guys posting here too as they would 
be better able to describe the details. 

What wasn't clear on TIKA-369 is whether yalder was going to come back into 
Tika. Intent here is to get to a patch integrating their code so it could be 
tested in the same way that Tika's current approach was tested. Hopefully that 
patch would help answer the questions above. 

They are forwarding me some research papers so I can come up to speed on this 
too so as I gain knowledge I'll flush out here. 

Do you think this should instead happen on TIKA-369?

> Language Identification with Text Processing Toolkit from MITLL
> ---------------------------------------------------------------
>
>                 Key: TIKA-1696
>                 URL: https://issues.apache.org/jira/browse/TIKA-1696
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>            Reporter: Paul Ramirez
>             Fix For: 1.10
>
>
> The aim here is to extend the methods for language identification within 
> text. MIT Lincoln Labs has an open source library [1] written in Julia. 
> Having spoken  with the MITLL guys there is a possibility that there is a 
> scala version of this library which would make it easier to package in with 
> Tika. 
> At this point I'm not quite sure how many languages this library supports by 
> default but it can be extended when provided some training data.
> [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to