Norbert Luksa has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/13870 )
Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function ...................................................................... IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function The added functions return the Jaro/Jaro-Winkler similarity/distance of two strings. The algorithm calcuates the Jaro-Similarity of the strings, then adds more weight to the result if there are common prefixes. (Jaro-Winkler) For more detail, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Extended the algorithm with another optional parameter: boost threshold The prefix weight will only be applied if the Jaro-similarity exceeds the given threshold. By default, its value is 0.7. The new built-in functions are: * jaro_distance, jaro_dst * jaro_similarity, jaro_sim * jaro_winkler_distance, jw_dst * jaro_winkler_similarity, jw_sim Testing: * Added unit tests to expr-test.cc * Manual testing over 1400 word pairs from http://marvin.cs.uidaho.edu/misspell.html Results match Apache commons Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c --- M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc M be/src/exprs/string-functions.h M common/function-registry/impala_functions.py 4 files changed, 319 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/13870/6 -- To view, visit http://gerrit.cloudera.org:8080/13870 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c Gerrit-Change-Number: 13870 Gerrit-PatchSet: 6 Gerrit-Owner: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Greg Rahn <gr...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>