[ https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984654#comment-16984654 ]
Simon poortman commented on TEXT-109: ------------------------------------- {code:java} // Some comments here public String getFoo() { return foo; } {code} > Implement or document how to use edit distances that consider the keyboard > layout > --------------------------------------------------------------------------------- > > Key: TEXT-109 > URL: https://issues.apache.org/jira/browse/TEXT-109 > Project: Commons Text > Issue Type: New Feature > Reporter: Bruno P. Kinoshita > Priority: Minor > Labels: discussion, edit-distance, help-wanted > Attachments: 328DADB9-2465-45E4-B36C-953BFF7C2B9F.jpeg, > 63B28BEB-F040-46D9-8997-3E09DA2C94C5.jpeg, > 8A14350C-87F1-4A58-8AF0-4E1742ED8D64.jpeg > > > Most edit distances take into consideration number of "changes" required in > one string to match with another string. And they give you a value that > represent the distance between the words. > While it is helpful, when working with datasets and corpora that have been > created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have > mistakes. In some cases a letter was accidentally mistyped. But the character > used is normally close to the correct character. > For example, given the word "one", and two incorrect misspellings "onr" and > "oni". The Levenshtein distance for both would be 1. But if you are aware > that the keyboard layout is English with the QUERTY layout (notice the E and > the R), so the distance between "one" and "onr", would be greater than the > distance between "one" and "oni", because in the English keyboard the letter > 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, > but by the right hand. > Here's some reference links for further research. > * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/ > * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/ > * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf > * https://github.com/wsong/Typo-Distance > * > https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard > Ideally such edit distance would be extensible to support other keyboard > layouts. > There is some indication that perhaps an existing edit distance like > levenshtein could be extended to take into consideration the keyboard layout. > So perhaps a new edit distance is not entirely necessary. > We could come with the the decision that it is too hard to implement, and it > would be better done in a spell checker, or that it would require some > statistics and would be out of the scope of Text. Or we could simply add > documentation on how to do it, without adding any code. Or, perhaps we add a > new edit distance. -- This message was sent by Atlassian Jira (v8.3.4#803005)