[ 
https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364542#comment-16364542
 ] 

Mark Dacek commented on TEXT-109:
---------------------------------

Hello. I discussed this with [~chtompki] today. We will be attempting this 
shortly. 

> Implement or document how to use edit distances that consider the keyboard 
> layout
> ---------------------------------------------------------------------------------
>
>                 Key: TEXT-109
>                 URL: https://issues.apache.org/jira/browse/TEXT-109
>             Project: Commons Text
>          Issue Type: New Feature
>            Reporter: Bruno P. Kinoshita
>            Priority: Minor
>              Labels: discussion, edit-distance, help-wanted
>
> Most edit distances take into consideration number of "changes" required in 
> one string to match with another string. And they give you a value that 
> represent the distance between the words.
> While it is helpful, when working with datasets and corpora that have been 
> created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have 
> mistakes. In some cases a letter was accidentally mistyped. But the character 
> used is normally close to the correct character.
> For example, given the word "one", and two incorrect misspellings "onr" and 
> "oni". The Levenshtein distance for both would be 1. But if you are aware 
> that the keyboard layout is English with the QUERTY layout (notice the E and 
> the R), so the distance between "one" and "onr", would be greater than the 
> distance between "one" and "oni", because in the English keyboard the letter 
> 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, 
> but by the right hand.
> Here's some reference links for further research.
> * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
> * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
> * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
> * https://github.com/wsong/Typo-Distance
> * 
> https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
> Ideally such edit distance would be extensible to support other keyboard 
> layouts.
> There is some indication that perhaps an existing edit distance like 
> levenshtein could be extended to take into consideration the keyboard layout. 
> So perhaps a new edit distance is not entirely necessary.
> We could come with the the decision that it is too hard to implement, and it 
> would be better done in a spell checker, or that it would require some 
> statistics and would be out of the scope of Text. Or we could simply add 
> documentation on how to do it, without adding any code. Or, perhaps we add a 
> new edit distance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to