Tim Allison created TIKA-4690:
---------------------------------

             Summary: Add generative language model in 4.x
                 Key: TIKA-4690
                 URL: https://issues.apache.org/jira/browse/TIKA-4690
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


Finally realized that we can play all we want with logits from the language 
detector, but it is not a great approach for "languagey/junk" detection. On 
this ticket, we'll add a generative model trained on the same languages as the 
language detector so that we can get a better sense of, for example, "Lang 
detector said Thai, how likely is it to actually be Thai?"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to