[jira] [Commented] (DRILL-3747) UDF for "fuzzy" string and similarity matching

2015-10-30 Thread Karol Potocki (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982753#comment-14982753
 ] 

Karol Potocki commented on DRILL-3747:
--

Such functionality is often required when we search through data produced by 
user collaboration (i.e. street names etc. in internet datasources) or we make 
search conditions based on user input (handling spelling mistakes).
Recently I needed solution like that, basic implementation is on my github:
https://github.com/k255/drill-fuzzy-search
It works on simmetrics library which recently went apache license.

> UDF for "fuzzy" string and similarity matching
> --
>
> Key: DRILL-3747
> URL: https://issues.apache.org/jira/browse/DRILL-3747
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Functions - Drill
>Affects Versions: Future
>Reporter: Edmon Begoli
>Priority: Minor
>  Labels: features
> Fix For: Future
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> I propose implementation of string/distance or distance matching functions 
> similar to what one finds in most of other databases - soundex, metaphone, 
> levenshtein (and more advanced variants such as levenshtein-damerau, 
> jaro-winkler, etc.).
> See fuzzystrmatch 
> http://www.postgresql.org/docs/9.5/static/fuzzystrmatch.html, 
> and pg_similarity http://pgsimilarity.projects.pgfoundry.org/
> for inspiration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3747) UDF for "fuzzy" string and similarity matching

2015-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983103#comment-14983103
 ] 

ASF GitHub Bot commented on DRILL-3747:
---

GitHub user k255 opened a pull request:

https://github.com/apache/drill/pull/224

DRILL-3747: basic similarity search with simmetric

Helps handling i.e. typos in search queries with popular algorithms like 
levenshtein.
Sample query:
```
select levenshtein('foo', 'boo') from (VALUES(1)); //gives 0.67
```
and
```
select levenshtein('foo', 'bar') from (VALUES(1)); //not similar - gives 0
```
More:
https://github.com/k255/drill-fuzzy-search
https://en.wikipedia.org/wiki/Levenshtein_distance

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/k255/drill drill-fuzzysearch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #224


commit 51248358adf7ee71a744cccb7a22b45850f192a8
Author: potocki 
Date:   2015-10-30T18:54:41Z

basic similarity search with simmetric




> UDF for "fuzzy" string and similarity matching
> --
>
> Key: DRILL-3747
> URL: https://issues.apache.org/jira/browse/DRILL-3747
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Functions - Drill
>Affects Versions: Future
>Reporter: Edmon Begoli
>Priority: Minor
>  Labels: features
> Fix For: Future
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> I propose implementation of string/distance or distance matching functions 
> similar to what one finds in most of other databases - soundex, metaphone, 
> levenshtein (and more advanced variants such as levenshtein-damerau, 
> jaro-winkler, etc.).
> See fuzzystrmatch 
> http://www.postgresql.org/docs/9.5/static/fuzzystrmatch.html, 
> and pg_similarity http://pgsimilarity.projects.pgfoundry.org/
> for inspiration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)