[ https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318165#comment-17318165 ]
Tim Allison commented on TIKA-3343: ----------------------------------- I finally took the time to figure out our legacy model. It computes vector distance of character trigrams between the input string and the language models. It is really elegant, and I can't bear to delete it. Because it only operates on trigrams, its performance on short texts is quite poor, but let's leave it in...but move it to where it belongs in a submodule within tika-langdetect. > Move Tika's legacy lang id to its own module > -------------------------------------------- > > Key: TIKA-3343 > URL: https://issues.apache.org/jira/browse/TIKA-3343 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > In the back of my mind, this was an agreed upon change for 2.x. I can't find > documentation, tho, so I'm opening this issue to discuss. > My memory is that we agreed that we should outsource language id to other > tools and remove our own lang ider for 2.x. If my memory is wrong, or if > there's a good reason to keep our language detection algorithm and data, > let's discuss. -- This message was sent by Atlassian Jira (v8.3.4#803005)