[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Norbert Luksa (Code Review)
Norbert Luksa has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/13870


Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 316 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/3
--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13870/3/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/13870/3/be/src/exprs/expr-test.cc@4071
PS3, Line 4071: TestValue(fn_name + "('martha', 'marhta', 0.1, 0.99)", 
TYPE_DOUBLE, 0.05558);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/13870/3/be/src/exprs/expr-test.cc@4101
PS3, Line 4101: TestValue(fn_name + "('martha', 'marhta', 0.1, 0.99)", 
TYPE_DOUBLE, 0.94442);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/13870/3/be/src/exprs/expr-test.cc@4102
PS3, Line 4102: TestValue(fn_name + "('dwayne', 'duane', 0.1, 0.9)", 
TYPE_DOUBLE, 0.8);
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Jul 2019 09:27:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Norbert Luksa (Code Review)
Norbert Luksa has uploaded a new patch set (#4). ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 316 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/4
--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13870/4/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/13870/4/be/src/exprs/expr-test.cc@4071
PS4, Line 4071: TestValue(fn_name + "('martha', 'marhta', 0.1, 0.99)", 
TYPE_DOUBLE, 0.05558);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/13870/4/be/src/exprs/expr-test.cc@4101
PS4, Line 4101: TestValue(fn_name + "('martha', 'marhta', 0.1, 0.99)", 
TYPE_DOUBLE, 0.94442);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/13870/4/be/src/exprs/expr-test.cc@4102
PS4, Line 4102: TestValue(fn_name + "('dwayne', 'duane', 0.1, 0.9)", 
TYPE_DOUBLE, 0.8);
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Jul 2019 09:57:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Norbert Luksa (Code Review)
Norbert Luksa has uploaded a new patch set (#5). ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 319 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/5
--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 5
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3905/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Jul 2019 10:06:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3906/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Jul 2019 10:37:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3907/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 5
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Jul 2019 10:37:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-22 Thread Norbert Luksa (Code Review)
Norbert Luksa has uploaded a new patch set (#6). ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 319 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/6
--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3944/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 22 Jul 2019 11:20:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-22 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 6: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13870/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/13870/1/be/src/exprs/string-functions-ir.cc@1190
PS1, Line 1190:   if (s1len == s2len && memcmp(s1.ptr, s2.ptr, s1len) == 0) 
return DoubleVal(1.0);
  :   if (s1len == 0) return DoubleVal(0.0);
> Can we also add test cases when both parameters are empty strings?
I think my previous comment got forgotten.



--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 22 Jul 2019 11:21:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-22 Thread Norbert Luksa (Code Review)
Norbert Luksa has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13870/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/13870/1/be/src/exprs/string-functions-ir.cc@1190
PS1, Line 1190:   if (s1len == s2len && memcmp(s1.ptr, s2.ptr, s1len) == 0) 
return DoubleVal(1.0);
  :   if (s1len == 0) return DoubleVal(0.0);
> I think my previous comment got forgotten.
Done



--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 22 Jul 2019 11:41:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-22 Thread Norbert Luksa (Code Review)
Norbert Luksa has uploaded a new patch set (#7). ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 323 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/7
--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-07-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3946/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 22 Jul 2019 12:23:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-05 Thread Greg Rahn (Code Review)
Greg Rahn has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 7:

How's this coming.  Anything specific that I can provide clarity on?


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Aug 2019 02:19:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-06 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 7:

Hi Greg, I was planning to give it +2 once we tried an ASAN build. Do you plan 
to review it also or we should just go forward?


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Aug 2019 09:04:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-12 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 12 Aug 2019 17:09:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 8
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 12 Aug 2019 17:09:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4773/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 8
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 12 Aug 2019 17:09:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4773/


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 8
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 12 Aug 2019 21:19:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Norbert Luksa (Code Review)
Norbert Luksa has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 319 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/9
--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 9
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Norbert Luksa (Code Review)
Norbert Luksa has uploaded a new patch set (#10). ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 323 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/10
--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/4234/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 9
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 13 Aug 2019 13:51:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/4235/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 13 Aug 2019 14:00:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 10: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 13 Aug 2019 14:15:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4781/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 13 Aug 2019 14:15:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..


Patch Set 10: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 13 Aug 2019 18:25:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

2019-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
..

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Reviewed-on: http://gerrit.cloudera.org:8080/13870
Reviewed-by: Zoltan Borok-Nagy 
Tested-by: Impala Public Jenkins 
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 323 insertions(+), 0 deletions(-)

Approvals:
  Zoltan Borok-Nagy: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 11
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy