[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856154#comment-16856154
 ] 

Tim Allison commented on TIKA-2790:
-----------------------------------

{noformat}
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 3; i++) {
            sb.append("four score and seven years ago ");
        }
        for (int i = 0; i < 100; i++) {
            sb.append("La MEL réunit 90 communes sur un territoire de près de 
650 km2 où résident plus de 1,1 million d’habitants. Située au centre d'une 
aire géographique très densément peuplée, à l’extrême ouest de la plaine 
d'Europe du Nord, elle est encadrée");
        }
        List<LangDetectResult> results =
                d.detect(sb.toString());
{noformat}

results in:
{noformat}
[LangDetectResult{lang='eng', confidence=1.0}]
{noformat}

When you get rid of the English loop, the result is 'fra' confidence=1.0.

To be clear, I think stopping short makes quite a bit of sense, and I'll see 
how we can do that in a modified version of OpenNLP.

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to