[ 
https://issues.apache.org/jira/browse/IMPALA-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059852#comment-18059852
 ] 

Joe McDonnell edited comment on IMPALA-14753 at 2/20/26 7:00 AM:
-----------------------------------------------------------------

We can't use the LGPL code from the glibc implementation of strstr (though it 
seems to be quite good), and the null-terminated requirement is hard for us to 
satisfy (i.e. the benchmark code modifies strings in place to null-terminate, 
and that is not something we can actually do).

In Apache Doris, they started with code similar to ours, then first improved 
this by using std::search (for an unknown speedup): 
[https://github.com/apache/doris/commit/a4e7c76336e7f7ebfa5449dc26de461b601c5a49]

Then, they switched to Volnitsky: 
[https://github.com/apache/doris/commit/2335d233f1f52eb64a380b4c9959becdf182b71b]

Then they moved away from Volnitsky: 
[https://github.com/apache/doris/commit/5efafefedaf261d3ef87731bff28f5c7df7faa99]

It's unclear how much of a speedup we could get.


was (Author: joemcdonnell):
We can't use the LGPL code from the glibc implementation of strstr (though it 
seems to be quite good), and the null-terminated requirement is hard for us to 
satisfy.

In Apache Doris, they started with code similar to ours, then first improved 
this by using std::search (for an unknown speedup): 
[https://github.com/apache/doris/commit/a4e7c76336e7f7ebfa5449dc26de461b601c5a49]

Then, they switched to Volnitsky: 
[https://github.com/apache/doris/commit/2335d233f1f52eb64a380b4c9959becdf182b71b]

Then they moved away from Volnitsky: 
[https://github.com/apache/doris/commit/5efafefedaf261d3ef87731bff28f5c7df7faa99]

It's unclear how much of a speedup we could get.

> Optimizations for string search
> -------------------------------
>
>                 Key: IMPALA-14753
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14753
>             Project: IMPALA
>          Issue Type: Task
>          Components: Backend
>    Affects Versions: Impala 5.0.0
>            Reporter: Joe McDonnell
>            Priority: Major
>
> In be/src/benchmarks/string-search-benchmark.cc, it compares our current 
> implementation in be/src/runtime/string-search.h (which came from Python) to 
> one that calls out to strstr in libc. Running this today, I get these results:
> {noformat}
> $ bin/run-jvm-binary.sh taskset -c 0 setarch `uname -m` -R 
> be/build/latest/benchmarks/string-search-benchmark 
> Machine Info: 12th Gen Intel(R) Core(TM) i9-12900K
> String Search:             Function  iters/ms   10%ile   50%ile   90%ile     
> 10%ile     50%ile     90%ile
>                                                                          
> (relative) (relative) (relative)
> ---------------------------------------------------------------------------------------------------------
>                              Python                165      168      169      
>    1X         1X         1X
>                                LibC           1.37e+03 1.39e+03  1.4e+03      
> 8.31X      8.26X      8.25X
>                 Null Terminated SSE                924      932      937      
> 5.59X      5.56X      5.54X
>             Non-null Terminated SSE                764      774      778      
> 4.63X      4.61X      4.59X{noformat}
> This suggests that string search would benefit from SIMD optimizations. 
> Further, there are libraries that claim to be faster than strstr (e.g. 
> [https://github.com/ashvardanian/StringZilla] ).
> Our current code is based on Cython code in faststring.h: 
> [https://github.com/python/cpython/blob/main/Objects/stringlib/fastsearch.h]
> The Cython code has since added an implementation of Crochemore and Perrin's 
> (1991) Two-Way algorithm. This can avoid certain pathological cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to