[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837189#comment-16837189 ]
Tim Allison commented on TIKA-2790: ----------------------------------- Sorry...the results from the initial runs are in the attached zip file. I took the filename from opennlp's leipzig as ground truth. I've modified the code so that we now have 50 samples per lang/length/noise tuple. I'll post those results some time today or Monday. I'm not sure I like the noisification with any codepoint btwn 1-1000000...this has the effect of adding cjk characters because they take up so much real estate; this is disastrous for yalder...if I understand the results correctly; opennlp is quite good at handling noise. Some other ideas for noisification: 1) limit noise to 0-255 codepoints 2) noisify based on characters in file -- I don't like this one 3) keep as is If you could take a look at my wrappers to make sure I haven't wrecked anything, and/or if you have any recs for the evaluation, let me know. > Consider switching lang-detection in tika-eval to open-nlp > ---------------------------------------------------------- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > Attachments: langid_20190509.zip > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)