It's fine to discuss this on tika 1696

Sent from my iPhone

> On Jul 23, 2015, at 2:19 PM, Paul Ramirez (JIRA) <j...@apache.org> wrote:
> 
> 
>    [ 
> https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639522#comment-14639522
>  ] 
> 
> Paul Ramirez commented on TIKA-1696:
> ------------------------------------
> 
> Ken, thanks for the fast feedback and references. I've not dug into this much 
> so it may take a couple of weeks to get something up here to test. As I dig 
> into this I'll update the Jira issue with more details to help drive 
> discussion. Also I'll look to get the MITLL guys posting here too as they 
> would be better able to describe the details. 
> 
> What wasn't clear on TIKA-369 is whether yalder was going to come back into 
> Tika. Intent here is to get to a patch integrating their code so it could be 
> tested in the same way that Tika's current approach was tested. Hopefully 
> that patch would help answer the questions above. 
> 
> They are forwarding me some research papers so I can come up to speed on this 
> too so as I gain knowledge I'll flush out here. 
> 
> Do you think this should instead happen on TIKA-369?
> 
>> Language Identification with Text Processing Toolkit from MITLL
>> ---------------------------------------------------------------
>> 
>>                Key: TIKA-1696
>>                URL: https://issues.apache.org/jira/browse/TIKA-1696
>>            Project: Tika
>>         Issue Type: New Feature
>>         Components: languageidentifier
>>           Reporter: Paul Ramirez
>>            Fix For: 1.10
>> 
>> 
>> The aim here is to extend the methods for language identification within 
>> text. MIT Lincoln Labs has an open source library [1] written in Julia. 
>> Having spoken  with the MITLL guys there is a possibility that there is a 
>> scala version of this library which would make it easier to package in with 
>> Tika. 
>> At this point I'm not quite sure how many languages this library supports by 
>> default but it can be extended when provided some training data.
>> [1] https://github.com/mit-nlp/Text.jl
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Reply via email to