[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856154#comment-16856154 ]
Tim Allison commented on TIKA-2790: ----------------------------------- {noformat} StringBuilder sb = new StringBuilder(); for (int i = 0; i < 3; i++) { sb.append("four score and seven years ago "); } for (int i = 0; i < 100; i++) { sb.append("La MEL réunit 90 communes sur un territoire de près de 650 km2 où résident plus de 1,1 million d’habitants. Située au centre d'une aire géographique très densément peuplée, à l’extrême ouest de la plaine d'Europe du Nord, elle est encadrée"); } List<LangDetectResult> results = d.detect(sb.toString()); {noformat} results in: {noformat} [LangDetectResult{lang='eng', confidence=1.0}] {noformat} When you get rid of the English loop, the result is 'fra' confidence=1.0. To be clear, I think stopping short makes quite a bit of sense, and I'll see how we can do that in a modified version of OpenNLP. > Consider switching lang-detection in tika-eval to open-nlp > ---------------------------------------------------------- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, > langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, > langid_20190514_plus_minus_1.zip, timeVsLength.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)