Tim Allison created TIKA-4686:
---------------------------------

             Summary: Improve within language likelihood scores in 4.x
                 Key: TIKA-4686
                 URL: https://issues.apache.org/jira/browse/TIKA-4686
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


In the charsoupencodingdetector, we're using raw logits to choose whether a 
given decoding is better than another. This doesn't work well with languages 
which have distinctive scripts. The intuition, is that there's very high 
confidence that this is language x (e.g. chinese) vs english, but we're not 
measuring, if this is Chinese, how "Chinese-y" is it.

This is a different score, but we might be able to compute that with our 
current weights.

Let's figure out how to do this on this ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to