[
https://issues.apache.org/jira/browse/IMPALA-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joe McDonnell updated IMPALA-14753:
-----------------------------------
Summary: Optimizations for string search (was: SIMD optimizations for
string search)
> Optimizations for string search
> -------------------------------
>
> Key: IMPALA-14753
> URL: https://issues.apache.org/jira/browse/IMPALA-14753
> Project: IMPALA
> Issue Type: Task
> Components: Backend
> Affects Versions: Impala 5.0.0
> Reporter: Joe McDonnell
> Priority: Major
>
> In be/src/benchmarks/string-search-benchmark.cc, it compares our current
> implementation in be/src/runtime/string-search.h (which came from Python) to
> one that calls out to strstr in libc. Running this today, I get these results:
> {noformat}
> $ bin/run-jvm-binary.sh taskset -c 0 setarch `uname -m` -R
> be/build/latest/benchmarks/string-search-benchmark
> Machine Info: 12th Gen Intel(R) Core(TM) i9-12900K
> String Search: Function iters/ms 10%ile 50%ile 90%ile
> 10%ile 50%ile 90%ile
>
> (relative) (relative) (relative)
> ---------------------------------------------------------------------------------------------------------
> Python 165 168 169
> 1X 1X 1X
> LibC 1.37e+03 1.39e+03 1.4e+03
> 8.31X 8.26X 8.25X
> Null Terminated SSE 924 932 937
> 5.59X 5.56X 5.54X
> Non-null Terminated SSE 764 774 778
> 4.63X 4.61X 4.59X{noformat}
> This suggests that string search would benefit from SIMD optimizations.
> Further, there are libraries that claim to be faster than strstr (e.g.
> [https://github.com/ashvardanian/StringZilla] ).
> Our current code is based on Cython code in faststring.h:
> [https://github.com/python/cpython/blob/main/Objects/stringlib/fastsearch.h]
> The Cython code has since added an implementation of Crochemore and Perrin's
> (1991) Two-Way algorithm. This can avoid certain pathological cases.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]