[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143701#comment-13143701
]
Robert Muir commented on TIKA-369:
----------------------------------
{quote}
Cons: unsurprisingly, has trouble with short text.
{quote}
Not any less trouble than competing libraries:
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
Its interesting if you read their paper, I think the normalizations etc. they
made make total sense
and I can easily see how that would make a big difference on train vs. test
when train is stuff
like wikipedia (which isn't always totally realistic).
I haven't played with their approach for CJK detection but it makes sense to
me, would be great to
see some evaluation results for that case.
On the other hand I think CLD has nice stuff like segmenting per-script (not
ambiguous) first to
eliminate stupidity when a document has multiple scripts (e.g. cyrillic+latin
or arabic+latin)..
it would be great if the cybozu impl integrated this approach as well.
> Improve accuracy of language detection
> --------------------------------------
>
> Key: TIKA-369
> URL: https://issues.apache.org/jira/browse/TIKA-369
> Project: Tika
> Issue Type: Improvement
> Components: languageidentifier
> Affects Versions: 0.6
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Attachments: Surprise and Coincidence.pdf, lingdet-mccs.pdf,
> textcat.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language
> profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper
> (attached) indicates that a log-likelihood ratio (LLR) test works much
> better, which would then make language detection faster due to less text
> needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact
> value as a threshold for certainty. This is very sensitive to the amount of
> text being processed, and thus gives false negative results for short runs of
> text.
> 3. Certainty should also be based on how much better the result is for
> language X, compared to the next best language. If two languages both had
> identical sum-of-squares values, and this value was below the threshold, then
> the result is still not very certain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira