On 05/24/2016 03:54 AM, qznc wrote:
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
On 05/23/2016 03:11 PM, qznc wrote:
Actually, std find should be faster, since it could use the Boyer Moore
algorithm instead of naive string matching.

Conventional wisdom has it that find() is brute force and that's that,
but probably it's time to destroy. Selectively using advanced
searching algorithms for the appropriate inputs is very DbI-ish.

There are a few nice precedents of blend algorithms, see e.g.
http://effbot.org/zone/stringlib.htm.

Writing a generic subsequence search blend algorithm, one that chooses
the right algorithm based on a combination of static and dynamic
decisions, is quite a fun and challenging project. Who wanna?

I observed that Boyer-Moore from std is still slower. My guess is due to
the relatively short strings in my benchmark and the need to allocate
for the tables. Knuth-Morris-Pratt might be worthwhile, because only
requires a smaller table, which could be allocated on the stack.

There's also Rabin-Karp that can be used when the sought range is relatively short.

The overall sentiment seems to be that KMP vs BM depends on the input.
This means an uninformed find function could only use heuristics.
Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to
the BoyerMooreFinder. I filed an issue [0].

Great. I keep on thinking how the advanced algorithms should "kick in" only after a partial match was long enough and frequent enough. As long as the partial match is short and infrequent, brute force will do best.

Apart from advanced algorithms, find should not be slower than a naive
nested-for-loop implementation.

[0] https://issues.dlang.org/show_bug.cgi?id=16066

Nice, thanks.


Andrei

Reply via email to