Dawid Weiss created SOLR-4871:
---------------------------------

             Summary: Another (fast) language identifier (port of langid.py)
                 Key: SOLR-4871
                 URL: https://issues.apache.org/jira/browse/SOLR-4871
             Project: Solr
          Issue Type: New Feature
          Components: contrib - LangId
            Reporter: Dawid Weiss
            Priority: Trivial


I've ported langid.py -- a Python language identifier with some very nice 
properties (see the research paper by Marco Lui) and pretty good language 
identification quality.

The major benefit though is speed. Without subsampling (which google code's 
languagedetection does) the benchmark on europarl clocks at:
{code}
--> langid-v3
     20826/     21000 (99.1714%) in 0.75 sec. (28075 docs/sec.)
--> languagedetect
     20846/     21000 (99.2667%) in 4.24 sec. (4948 docs/sec.)
{code}
So nearly the same language detection quality and five times faster. If you 
limit the number of languages to detect it'll be faster still -- see the 
benchmarking snippets.

Yet another nice (?) property is that it runs on UTF8 sequences natively. I've 
built-in a loop with the default Java's charset decoder but if you already have 
BytesRef you don't need to create strings at all.

https://oss.sonatype.org/content/repositories/releases/com/carrotsearch/langid-java/

The source code is at github:
https://github.com/carrotsearch/langid-java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to