[jira] [Commented] (TEXT-109) Implement or document how to use edit distances that consider the keyboard layout

2019-11-28 Thread Simon poortman (Jira)


[ 
https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984654#comment-16984654
 ] 

Simon poortman commented on TEXT-109:
-


{code:java}
// Some comments here
public String getFoo()
{
return foo;
}
{code}


> Implement or document how to use edit distances that consider the keyboard 
> layout
> -
>
> Key: TEXT-109
> URL: https://issues.apache.org/jira/browse/TEXT-109
> Project: Commons Text
>  Issue Type: New Feature
>Reporter: Bruno P. Kinoshita
>Priority: Minor
>  Labels: discussion, edit-distance, help-wanted
> Attachments: 328DADB9-2465-45E4-B36C-953BFF7C2B9F.jpeg, 
> 63B28BEB-F040-46D9-8997-3E09DA2C94C5.jpeg, 
> 8A14350C-87F1-4A58-8AF0-4E1742ED8D64.jpeg
>
>
> Most edit distances take into consideration number of "changes" required in 
> one string to match with another string. And they give you a value that 
> represent the distance between the words.
> While it is helpful, when working with datasets and corpora that have been 
> created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have 
> mistakes. In some cases a letter was accidentally mistyped. But the character 
> used is normally close to the correct character.
> For example, given the word "one", and two incorrect misspellings "onr" and 
> "oni". The Levenshtein distance for both would be 1. But if you are aware 
> that the keyboard layout is English with the QUERTY layout (notice the E and 
> the R), so the distance between "one" and "onr", would be greater than the 
> distance between "one" and "oni", because in the English keyboard the letter 
> 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, 
> but by the right hand.
> Here's some reference links for further research.
> * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
> * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
> * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
> * https://github.com/wsong/Typo-Distance
> * 
> https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
> Ideally such edit distance would be extensible to support other keyboard 
> layouts.
> There is some indication that perhaps an existing edit distance like 
> levenshtein could be extended to take into consideration the keyboard layout. 
> So perhaps a new edit distance is not entirely necessary.
> We could come with the the decision that it is too hard to implement, and it 
> would be better done in a spell checker, or that it would require some 
> statistics and would be out of the scope of Text. Or we could simply add 
> documentation on how to do it, without adding any code. Or, perhaps we add a 
> new edit distance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEXT-109) Implement or document how to use edit distances that consider the keyboard layout

2019-11-28 Thread Simon poortman (Jira)


[ 
https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984652#comment-16984652
 ] 

Simon poortman commented on TEXT-109:
-

Begin war in eu union benelux come for 1 country nederland België luxenburg = 1 
bundesland

> Implement or document how to use edit distances that consider the keyboard 
> layout
> -
>
> Key: TEXT-109
> URL: https://issues.apache.org/jira/browse/TEXT-109
> Project: Commons Text
>  Issue Type: New Feature
>Reporter: Bruno P. Kinoshita
>Priority: Minor
>  Labels: discussion, edit-distance, help-wanted
>
> Most edit distances take into consideration number of "changes" required in 
> one string to match with another string. And they give you a value that 
> represent the distance between the words.
> While it is helpful, when working with datasets and corpora that have been 
> created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have 
> mistakes. In some cases a letter was accidentally mistyped. But the character 
> used is normally close to the correct character.
> For example, given the word "one", and two incorrect misspellings "onr" and 
> "oni". The Levenshtein distance for both would be 1. But if you are aware 
> that the keyboard layout is English with the QUERTY layout (notice the E and 
> the R), so the distance between "one" and "onr", would be greater than the 
> distance between "one" and "oni", because in the English keyboard the letter 
> 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, 
> but by the right hand.
> Here's some reference links for further research.
> * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
> * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
> * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
> * https://github.com/wsong/Typo-Distance
> * 
> https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
> Ideally such edit distance would be extensible to support other keyboard 
> layouts.
> There is some indication that perhaps an existing edit distance like 
> levenshtein could be extended to take into consideration the keyboard layout. 
> So perhaps a new edit distance is not entirely necessary.
> We could come with the the decision that it is too hard to implement, and it 
> would be better done in a spell checker, or that it would require some 
> statistics and would be out of the scope of Text. Or we could simply add 
> documentation on how to do it, without adding any code. Or, perhaps we add a 
> new edit distance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEXT-109) Implement or document how to use edit distances that consider the keyboard layout

2018-02-14 Thread Mark Dacek (JIRA)

[ 
https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364542#comment-16364542
 ] 

Mark Dacek commented on TEXT-109:
-

Hello. I discussed this with [~chtompki] today. We will be attempting this 
shortly. 

> Implement or document how to use edit distances that consider the keyboard 
> layout
> -
>
> Key: TEXT-109
> URL: https://issues.apache.org/jira/browse/TEXT-109
> Project: Commons Text
>  Issue Type: New Feature
>Reporter: Bruno P. Kinoshita
>Priority: Minor
>  Labels: discussion, edit-distance, help-wanted
>
> Most edit distances take into consideration number of "changes" required in 
> one string to match with another string. And they give you a value that 
> represent the distance between the words.
> While it is helpful, when working with datasets and corpora that have been 
> created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have 
> mistakes. In some cases a letter was accidentally mistyped. But the character 
> used is normally close to the correct character.
> For example, given the word "one", and two incorrect misspellings "onr" and 
> "oni". The Levenshtein distance for both would be 1. But if you are aware 
> that the keyboard layout is English with the QUERTY layout (notice the E and 
> the R), so the distance between "one" and "onr", would be greater than the 
> distance between "one" and "oni", because in the English keyboard the letter 
> 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, 
> but by the right hand.
> Here's some reference links for further research.
> * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
> * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
> * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
> * https://github.com/wsong/Typo-Distance
> * 
> https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
> Ideally such edit distance would be extensible to support other keyboard 
> layouts.
> There is some indication that perhaps an existing edit distance like 
> levenshtein could be extended to take into consideration the keyboard layout. 
> So perhaps a new edit distance is not entirely necessary.
> We could come with the the decision that it is too hard to implement, and it 
> would be better done in a spell checker, or that it would require some 
> statistics and would be out of the scope of Text. Or we could simply add 
> documentation on how to do it, without adding any code. Or, perhaps we add a 
> new edit distance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEXT-109) Implement or document how to use edit distances that consider the keyboard layout

2018-02-12 Thread Rob Tompkins (JIRA)

[ 
https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361738#comment-16361738
 ] 

Rob Tompkins commented on TEXT-109:
---

It's also worth noting that if one were to weight the distance between keys 
using the standard cartesian coordinate system, that there will necessarily be 
inconsistencies in the distances due to key alignment inconsistencies, for 
example the bottom row of the iPhone and Android keyboards lines up vertically 
with the middle row of keys. Maybe we'd consider those separately.

> Implement or document how to use edit distances that consider the keyboard 
> layout
> -
>
> Key: TEXT-109
> URL: https://issues.apache.org/jira/browse/TEXT-109
> Project: Commons Text
>  Issue Type: New Feature
>Reporter: Bruno P. Kinoshita
>Priority: Minor
>  Labels: discussion, edit-distance, help-wanted
>
> Most edit distances take into consideration number of "changes" required in 
> one string to match with another string. And they give you a value that 
> represent the distance between the words.
> While it is helpful, when working with datasets and corpora that have been 
> created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have 
> mistakes. In some cases a letter was accidentally mistyped. But the character 
> used is normally close to the correct character.
> For example, given the word "one", and two incorrect misspellings "onr" and 
> "oni". The Levenshtein distance for both would be 1. But if you are aware 
> that the keyboard layout is English with the QUERTY layout (notice the E and 
> the R), so the distance between "one" and "onr", would be greater than the 
> distance between "one" and "oni", because in the English keyboard the letter 
> 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, 
> but by the right hand.
> Here's some reference links for further research.
> * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
> * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
> * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
> * https://github.com/wsong/Typo-Distance
> * 
> https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
> Ideally such edit distance would be extensible to support other keyboard 
> layouts.
> There is some indication that perhaps an existing edit distance like 
> levenshtein could be extended to take into consideration the keyboard layout. 
> So perhaps a new edit distance is not entirely necessary.
> We could come with the the decision that it is too hard to implement, and it 
> would be better done in a spell checker, or that it would require some 
> statistics and would be out of the scope of Text. Or we could simply add 
> documentation on how to do it, without adding any code. Or, perhaps we add a 
> new edit distance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEXT-109) Implement or document how to use edit distances that consider the keyboard layout

2018-02-12 Thread Rob Tompkins (JIRA)

[ 
https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361719#comment-16361719
 ] 

Rob Tompkins commented on TEXT-109:
---

It would seem that we would want some flavor of a "weighted" edit distance 
here. Where upon each edit, you consider the keyboard distance between the 
keys. For this we would clearly have to build keyboards as constants. Let me 
fiddle around with the ideas some and I might be able to come up with something 
interesting. My thought is to try coming up with something and then cross 
checking it with the other ideas in the space listed above. 

> Implement or document how to use edit distances that consider the keyboard 
> layout
> -
>
> Key: TEXT-109
> URL: https://issues.apache.org/jira/browse/TEXT-109
> Project: Commons Text
>  Issue Type: New Feature
>Reporter: Bruno P. Kinoshita
>Priority: Minor
>  Labels: discussion, edit-distance, help-wanted
>
> Most edit distances take into consideration number of "changes" required in 
> one string to match with another string. And they give you a value that 
> represent the distance between the words.
> While it is helpful, when working with datasets and corpora that have been 
> created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have 
> mistakes. In some cases a letter was accidentally mistyped. But the character 
> used is normally close to the correct character.
> For example, given the word "one", and two incorrect misspellings "onr" and 
> "oni". The Levenshtein distance for both would be 1. But if you are aware 
> that the keyboard layout is English with the QUERTY layout (notice the E and 
> the R), so the distance between "one" and "onr", would be greater than the 
> distance between "one" and "oni", because in the English keyboard the letter 
> 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, 
> but by the right hand.
> Here's some reference links for further research.
> * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
> * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
> * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
> * https://github.com/wsong/Typo-Distance
> * 
> https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
> Ideally such edit distance would be extensible to support other keyboard 
> layouts.
> There is some indication that perhaps an existing edit distance like 
> levenshtein could be extended to take into consideration the keyboard layout. 
> So perhaps a new edit distance is not entirely necessary.
> We could come with the the decision that it is too hard to implement, and it 
> would be better done in a spell checker, or that it would require some 
> statistics and would be out of the scope of Text. Or we could simply add 
> documentation on how to do it, without adding any code. Or, perhaps we add a 
> new edit distance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)