[jira] [Issue Comment Edited] (TIKA-369) Improve accuracy of language detection

Christian Moen (Issue Comment Edited) (JIRA) Sun, 19 Feb 2012 09:25:01 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211424#comment-13211424
 ]


Christian Moen edited comment on TIKA-369 at 2/19/12 5:23 PM:
--------------------------------------------------------------

Does anyone have any thoughts on how we should follow up on this?

The {{language-detection}} library looks attractive to me and seems to be the 
best Java-based language detection library available and it also has a suitable 
license.  However, it seems to require Java 6 and Tika is still based on Java 
5.  Does this effectively rule out using {{language-detection}} for Tika?

Does it make sense to make {{language-detection}} an option that can be used as 
an alternative to the current detector?

The idea is basically to support {{language-detection}} in addition to what we 
have today with the latter being the default.

                
      was (Author: cm):
    Does anyone have any thoughts on how we should follow up on this?

The {{language-detection}} library looks attractive to me and seems to be the 
best Java-based language detection library available and it also has a suitable 
license.  However, it seems to require Java 6 and Tika is still based on Java 
5.  Does this effectively rule out using {{language-detection}} for Tika?

Does it make sense to make {{language-detection}} an option that can be used as 
an alternative to the current detector?

The idea is basically to support {{language-detection}} in addition to what we 
have today with the latter as being the default. we have as default.

                  
> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: Surprise and Coincidence.pdf, lingdet-mccs.pdf, 
> textcat.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language 
> profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper 
> (attached) indicates that a log-likelihood ratio (LLR) test works much 
> better, which would then make language detection faster due to less text 
> needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
> value as a threshold for certainty. This is very sensitive to the amount of 
> text being processed, and thus gives false negative results for short runs of 
> text.
> 3. Certainty should also be based on how much better the result is for 
> language X, compared to the next best language. If two languages both had 
> identical sum-of-squares values, and this value was below the threshold, then 
> the result is still not very certain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-369) Improve accuracy of language detection

Reply via email to