[ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639257#comment-14639257
 ] 

Ken Krugler commented on TIKA-1696:
-----------------------------------

Hi Paul - see https://issues.apache.org/jira/browse/TIKA-369 for a lengthy 
discussion of possible improvements to language detection.

I started a new open source project to improve on what's available in 
language-detector (see https://github.com/kkrugler/yalder), and found that the 
latest version is pretty darn good...I could beat it on speed or accuracy, but 
for both it was about equal.

So I'd be interested in finding out where/how the MITLL improves on the current 
version of language-detector, as that's a known Java-based solution that covers 
a lot of languages and has good performance/accuracy.

> Language Identification with Text Processing Toolkit from MITLL
> ---------------------------------------------------------------
>
>                 Key: TIKA-1696
>                 URL: https://issues.apache.org/jira/browse/TIKA-1696
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>            Reporter: Paul Ramirez
>             Fix For: 1.10
>
>
> The aim here is to extend the methods for language identification within 
> text. MIT Lincoln Labs has an open source library [1] written in Julia. 
> Having spoken  with the MITLL guys there is a possibility that there is a 
> scala version of this library which would make it easier to package in with 
> Tika. 
> At this point I'm not quite sure how many languages this library supports by 
> default but it can be extended when provided some training data.
> [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to