[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856101#comment-16856101
 ] 

Tim Allison edited comment on TIKA-2790 at 6/4/19 8:38 PM:
-----------------------------------------------------------

In going down the path of sampling, or stopping short...I wanted to see how 
much text would be necessary for OpenNLP.  So, to answer the question of 
"what's the minimum length/minimum confidence after which the detector is 
always correct."  To answer that, I measured the inverse, what is the maximum 
confidence and at what length when the detector incorrectly ids a language.


In the following table, I show the maximum wrong confidence for a given 
language, the incorrectly detected language, and the text length at which that 
was incorrectly detected.  For example, at text length of 230 characters, 
OpenNLP had a confidence of 0.43 that the text was 'hrv', but it was really 
'bos'.

As the original confusion matrix shows, some lang pairs are much harder and 
require more evidence, e.g. {{ekk}} and {{est}}, {{fas}} and {{pes}}, {{hrv}} 
and {{bos}}, {{ind}} and {{sun}}, {{pus}} and {{por}}, but many languages 
require a very small amount of text...

||Lang||WrongId||MaxWrongConf||MaxWrongLength||
|ast|nno|0.03|90|
|bak|tat|0.08|70|
|bos|hrv|0.43|230|
|cat|vol|0.01|10|
|ces|slk|0.01|10|
|cym|min|0.01|10|
|dan|war|0.02|30|
|deu|war|0.01|10|
|ekk|est|0.54|310|
|eng|nan|0.01|10|
|est|ekk|0.56|250|
|fas|pes|0.52|550|
|fin|min|0.01|10|
|fra|fin|0.01|10|
|gsw|lat|0.01|10|
|hrv|bos|0.64|1010|
|hun|nob|0.01|10|
|ind|sun|0.73|810|
|isl|fao|0.02|30|
|ita|fra|0.04|110|
|jav|afr|0.02|50|
|lav|lvs|0.35|170|
|lim|epo|0.02|30|
|ltz|vol|0.01|10|
|lvs|lav|0.03|30|
|mlt|eng|0.02|50|
|msa|ind|0.45|490|
|nan|tur|0.01|10|
|nds|plt|0.01|10|
|nep|san|0.01|10|
|nld|plt|0.01|10|
|nno|nob|0.12|130|
|nob|nno|0.62|290|
|oci|ita|0.01|10|
|pes|fas|0.57|730|
|pus|por|0.27|130|
|ron|lat|0.04|70|
|rus|mkd|0.02|10|
|slk|epo|0.01|10|
|slv|min|0.01|10|
|spa|vol|0.01|10|
|sqi|zul|0.01|30|
|sun|ind|0.60|790|
|swe|dan|0.02|30|
|tat|bak|0.03|30|
|tgl|ceb|0.01|10|
|tur|min|0.01|10|
|ukr|che|0.02|10|
|uzb|kir|0.02|10|
|vie|war|0.02|30|
|zul|swa|0.02|10|


was (Author: [email protected]):
In going down the path of sampling, or stopping short...I wanted to see how 
much text would be necessary for OpenNLP.  In the following table, I show the 
maximum wrong confidence for a given language, the incorrectly detected 
language, and the text length at which that was incorrectly detected.  For 
example, at text length of 230 characters, OpenNLP had a confidence of 0.43 
that the text was 'hrv', but it was really 'bos'.

As the original confusion matrix shows, some lang pairs are much harder and 
require more evidence, e.g. {{ekk}} and {{est}}, {{fas}} and {{pes}}, {{hrv}} 
and {{bos}}, {{ind}} and {{sun}}, {{pus}} and {{por}}, but many languages 
require a very small amount of text...

||Lang||WrongId||MaxWrongConf||MaxWrongLength||
|ast|nno|0.03|90|
|bak|tat|0.08|70|
|bos|hrv|0.43|230|
|cat|vol|0.01|10|
|ces|slk|0.01|10|
|cym|min|0.01|10|
|dan|war|0.02|30|
|deu|war|0.01|10|
|ekk|est|0.54|310|
|eng|nan|0.01|10|
|est|ekk|0.56|250|
|fas|pes|0.52|550|
|fin|min|0.01|10|
|fra|fin|0.01|10|
|gsw|lat|0.01|10|
|hrv|bos|0.64|1010|
|hun|nob|0.01|10|
|ind|sun|0.73|810|
|isl|fao|0.02|30|
|ita|fra|0.04|110|
|jav|afr|0.02|50|
|lav|lvs|0.35|170|
|lim|epo|0.02|30|
|ltz|vol|0.01|10|
|lvs|lav|0.03|30|
|mlt|eng|0.02|50|
|msa|ind|0.45|490|
|nan|tur|0.01|10|
|nds|plt|0.01|10|
|nep|san|0.01|10|
|nld|plt|0.01|10|
|nno|nob|0.12|130|
|nob|nno|0.62|290|
|oci|ita|0.01|10|
|pes|fas|0.57|730|
|pus|por|0.27|130|
|ron|lat|0.04|70|
|rus|mkd|0.02|10|
|slk|epo|0.01|10|
|slv|min|0.01|10|
|spa|vol|0.01|10|
|sqi|zul|0.01|30|
|sun|ind|0.60|790|
|swe|dan|0.02|30|
|tat|bak|0.03|30|
|tgl|ceb|0.01|10|
|tur|min|0.01|10|
|ukr|che|0.02|10|
|uzb|kir|0.02|10|
|vie|war|0.02|30|
|zul|swa|0.02|10|

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to