faster splitter
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote: On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu wrote: That's comparable, please file an enh request. http://d.puremagic.com/issues/show_bug.cgi?id=9646 Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: dmd -O -release -inline -noboundscheck: std.algorithm.splitter took 5 secs, 170 ms, and 656 μs MySplitter took 4 secs, 835 ms, 748 μs, and 1 hnsec my_splitter took 4 secs, 862 ms, 914 μs, and 4 hnsecs ldc2 -O -release: std.algorithm.splitter took 3 secs, 540 ms, and 84 μs MySplitter took 2 secs, 288 ms, and 535 μs my_splitter took 3 secs, 463 ms, and 897 μs CPU: Intel i7 M 620 2.67GHz dmd: v2.070.2 ldc: 0.17.1 (based on DMD v2.068.2 and LLVM 3.7.1) auto my_splitter(alias pred = "a == b", Range, Separator)(Range r, Separator s) if (is(typeof(binaryFun!pred(r.front, s.front)) : bool) && (hasSlicing!Range || isNarrowString!Range) && isForwardRange!Separator && (hasLength!Separator || isNarrowString!Separator)) { import std.algorithm.searching : find; import std.conv : unsigned; static struct Result { private: Range _input; Separator _separator; size_t _next_index; bool _empty; @property auto separatorLength() { return _separator.length; } void findIndex() { auto found = find!pred(_input, _separator); _next_index = _input.length - found.length; } public: this(Range input, Separator separator) { _input = input; _separator = separator; _empty = false; findIndex(); } @property Range front() { assert(!empty); return _input[0 .. _next_index]; } static if (isInfinite!Range) { enum bool empty = false; // Propagate infiniteness } else { @property bool empty() { return _empty; } } void popFront() { assert(!empty); if (_input.empty) { _empty = true; assert(_next_index == 0); } else if (_input.length == _next_index) { _input = _input[$ .. $]; _empty = true; } else { _input = _input[_next_index + separatorLength .. $]; findIndex(); } } static if (isForwardRange!Range) { @property typeof(this) save() { auto ret = this; ret._input = _input.save; return ret; } } } return Result(r, s); }
Re: faster splitter
On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote: On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote: [...] Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...] have you thought about opening a PR to improve `splitter`?
Re: faster splitter
On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote: On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote: On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote: [...] Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...] have you thought about opening a PR to improve `splitter`? Yes, but I'm not sure about the goals. I also want to dig a little deeper. Being "sometimes faster" without some understand why and when feels unsatisfying. Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR?
Re: faster splitter
On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote: On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote: On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote: On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote: [...] Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...] have you thought about opening a PR to improve `splitter`? Yes, but I'm not sure about the goals. I also want to dig a little deeper. Being "sometimes faster" without some understand why and when feels unsatisfying. Good luck. If you need help, it's probably the best to open the PR as people can directly see your code and the Phobos repo is usually pretty active. Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR? Absolutely ;-)
Re: faster splitter
On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote: Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR? Forget the "dead code comment" it is a actually a missing feature. In the case the separator is a range and the input range is bidirectional, the splitter result should be bidirectional as well. It is not, because the implementation of back() and popBack() is missing, although some bookkeeping code for it exists.
Re: faster splitter
On 05/23/2016 08:17 AM, qznc wrote: On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote: Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR? Forget the "dead code comment" it is a actually a missing feature. In the case the separator is a range and the input range is bidirectional, the splitter result should be bidirectional as well. It is not, because the implementation of back() and popBack() is missing, although some bookkeeping code for it exists. Most uses are forward, so perhaps it's best to eliminate the bidirectional bookkeeping if it adds overhead, and confine those to a separate function e.g. splitterBidirectional. There is precedent for that in Phobos. -- Andrei
Re: faster splitter
On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote: Most uses are forward, so perhaps it's best to eliminate the bidirectional bookkeeping if it adds overhead, and confine those to a separate function e.g. splitterBidirectional. There is precedent for that in Phobos. -- Andrei I don't see the value add that things like filterBidirectional give when the back and popBack functions can be static if-ed in or out the normal filter. filterBidirectional always felt like a relic of a time when static if's didn't exist, even though I know that's not the case.
Re: faster splitter
On 5/23/16 9:44 AM, Jack Stouffer wrote: On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote: Most uses are forward, so perhaps it's best to eliminate the bidirectional bookkeeping if it adds overhead, and confine those to a separate function e.g. splitterBidirectional. There is precedent for that in Phobos. -- Andrei I don't see the value add that things like filterBidirectional give when the back and popBack functions can be static if-ed in or out the normal filter. What would be the condition tested in the static if? Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- Andrei
Re: faster splitter
On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote: What would be the condition tested in the static if? I would assume `static if (isBidirectionalRange!Range)` Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- Andrei I honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined?
Re: faster splitter
On Monday, 23 May 2016 at 14:25:52 UTC, Jack Stouffer wrote: On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote: What would be the condition tested in the static if? I would assume `static if (isBidirectionalRange!Range)` Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- Andrei I honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined? Here is a commit, which removes the dead code: https://github.com/qznc/phobos/commit/0cb1e70f4504a52b81c30342be7ec26d597fac69 It would be strictly better to implement the back and popBack, no? My single test case says removing the code does not result in faster code. Maybe dmd optimizes the bookkeeping away. I see three options: 1. Remove dead bookkeeping code 2. Implement back() and popBack() 3. Use alternative splitter implementation (and implement back() and popBack()) The third one would be the best, if it is really faster.
Re: faster splitter
On 05/23/2016 10:47 AM, qznc wrote: It would be strictly better to implement the back and popBack, no? If it adds no overhead, you're right. -- Andrei
Re: faster splitter
On 05/23/2016 10:25 AM, Jack Stouffer wrote: On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote: What would be the condition tested in the static if? I would assume `static if (isBidirectionalRange!Range)` Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- Andrei I honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined? You might be doing extra work even if the user never calls back and popBack. -- Andrei
Re: faster splitter
On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote: On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote: On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu wrote: That's comparable, please file an enh request. http://d.puremagic.com/issues/show_bug.cgi?id=9646 I think the actual slow thing is std.find, which splitter uses. Benchmark: https://dpaste.dzfl.pl/1e31b6f95659 The output is: std findtook162167000 manual find took 99852000 Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.
Re: faster splitter
On Monday, 23 May 2016 at 14:47:22 UTC, qznc wrote: I see three options: 1. Remove dead bookkeeping code 2. Implement back() and popBack() 3. Use alternative splitter implementation (and implement back() and popBack()) The third one would be the best, if it is really faster. If the performance is really a problem (I think it's a micro optimization at best), then the best option is not to violate DRY. Have a template parameter, std.typecons.Flag, to turn off the back and popBack. Don't have two functions that are 95% the same code like filter and filterBidirectional.
Re: faster splitter
On 05/23/2016 03:11 PM, Jack Stouffer wrote: (I think it's a micro optimization at best) splitter in the stdlib is likely to be very frequently useful, any bit of speedup we put in it is likely to pay off immensely. -- Andrei
Re: faster splitter
On 05/23/2016 03:11 PM, qznc wrote: Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching. Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? Andrei
Re: faster splitter
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote: On 05/23/2016 03:11 PM, qznc wrote: Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching. Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack. The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0]. Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066
Re: faster splitter
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote: On 05/23/2016 03:11 PM, qznc wrote: Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching. Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. Since I work a lot with text, I use canFind etc. a lot. It'd be nice to be able to choose faster algorithms. People are often advised to use library functions instead of rolling their own, because library functions are optimized for speed, but this doesn't seem to be true of Phobos. Simple everyday tasks like string matching should be as fast as C. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? Andrei
Re: faster splitter
On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote: On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote: On 05/23/2016 03:11 PM, qznc wrote: Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching. Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack. The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0]. Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066 From Phobos [1]: /* Preprocess #2: init skip[] */ /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335
Re: faster splitter
On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote: On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote: Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066 From Phobos [1]: /* Preprocess #2: init skip[] */ /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335 Unfortunately, it is slower. My current measurements [0]: $ ./benchmark 1000 10 # large haystack, few iterations std findtook753837231 manual find took129561980 $ ./benchmark 10 1000 # small haystack, many iterations std findtook721676721 manual find took216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slow
Re: faster splitter
On 05/24/2016 06:44 AM, qznc wrote: On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote: On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote: Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066 From Phobos [1]: /* Preprocess #2: init skip[] */ /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335 Unfortunately, it is slower. My current measurements [0]: $ ./benchmark 1000 10 # large haystack, few iterations std findtook753837231 manual find took129561980 $ ./benchmark 10 1000 # small haystack, many iterations std findtook721676721 manual find took216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slow One somewhat unrelated thing that can be done with Boyer-Moore: use a constant-length skip array instead of dynamic allocation. Upon saturation, continue with brute force. -- Andrei
Re: faster splitter
On 05/24/2016 03:54 AM, qznc wrote: On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote: On 05/23/2016 03:11 PM, qznc wrote: Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching. Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack. There's also Rabin-Karp that can be used when the sought range is relatively short. The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0]. Great. I keep on thinking how the advanced algorithms should "kick in" only after a partial match was long enough and frequent enough. As long as the partial match is short and infrequent, brute force will do best. Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066 Nice, thanks. Andrei
Re: faster splitter
On 05/24/2016 08:38 AM, Andrei Alexandrescu wrote: One somewhat unrelated thing that can be done with Boyer-Moore: use a constant-length skip array instead of dynamic allocation. Upon saturation, continue with brute force. -- Andrei Another idea: in every searched string there is one specific index that offers the most information to a failed search, allowing for a large skip. Roughly speaking it's the largest index around which there are many characters not seen elsewhere in the searched string. (In a string with all distinct characters, that would be the last character.) If that offset is cheap to compute and offers a good skip, a search starting from that index would be very fast. -- Andrei
Re: faster splitter
On Tuesday, 24 May 2016 at 10:44:12 UTC, qznc wrote: Unfortunately, it is slower. My current measurements [0]: $ ./benchmark 1000 10 # large haystack, few iterations std findtook753837231 manual find took129561980 $ ./benchmark 10 1000 # small haystack, many iterations std findtook721676721 manual find took216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slow Plot twist: manual find is not always faster. My benchmark now generates random haystack and needle. Sometimes Phobos clearly wins. Example: $ ./benchmark.ldc -n39 -l1 -i1 Found at 1 std findtook289529102 manual find took368958679 This is compiled with LDC, needle is 39 chars, haystack 1 chars, and 1 iterations. "Found at 1" means the needle was not found. Very often manual find wins though. Takeaway: Manual find is too naive. However, something slows down std find in general. More research needed. PS: Any recommendations for a Linux dissassembler? Objdump is ok, but some arrows for jumps would be nice.
Re: faster splitter
On 05/24/2016 03:47 PM, qznc wrote: PS: Any recommendations for a Linux dissassembler? Objdump is ok, but some arrows for jumps would be nice. Try http://asm.dlang.org for short tests. -- Andrei
Re: faster splitter
On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote: PS: Any recommendations for a Linux dissassembler? Objdump is ok, but some arrows for jumps would be nice. I discovered radare2. While it is actually a debugger (like gdb), it includes a nice disassembler. Example view, which shows the loop calling std.algorithm.find repeatedly: | 0x00402e08, 0 | nop [rax+rax+0x0] .--> 0x00402e10, 0 | mov rdx, [rbx] || 0x00402e130 | mov rcx, [rbx+0x8] || 0x00402e170 | mov rdi, r15 || 0x00402e1a0 | mov rsi, r12 || 0x00402e1d0 | call 0x409160 ; 4 = sym._D3std9algorithm9searching34__T4fin || 0x00402e220 | mov rcx, [rbx] || 0x00402e250 | sub rcx, rax || 0x00402e28, 0 | mov [rbx+0x20], rcx || 0x00402e2c, 0 | dec ebp `==< 0x00402e2e0 | jnz 0x402e10 ; 5 = 0x00402e10 | 0x00402e30, 0 | lea r15, [rsp] Note the ASCII arrow on the left! Objdump should do that. :)
Re: faster splitter
On Tuesday, 24 May 2016 at 22:01:17 UTC, qznc wrote: On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote: I discovered radare2. While it is actually a debugger (like gdb), it includes a nice disassembler. Example view, which shows the loop calling std.algorithm.find repeatedly: | 0x00402e08, 0 | nop [rax+rax+0x0] .--> 0x00402e10, 0 | mov rdx, [rbx] || 0x00402e130 | mov rcx, [rbx+0x8] || 0x00402e170 | mov rdi, r15 || 0x00402e1a0 | mov rsi, r12 || 0x00402e1d0 | call 0x409160 ; 4 = sym._D3std9algorithm9searching34__T4fin || 0x00402e220 | mov rcx, [rbx] || 0x00402e250 | sub rcx, rax || 0x00402e28, 0 | mov [rbx+0x20], rcx || 0x00402e2c, 0 | dec ebp `==< 0x00402e2e0 | jnz 0x402e10 ; 5 = 0x00402e10 | 0x00402e30, 0 | lea r15, [rsp] Note the ASCII arrow on the left! Objdump should do that. :) Any suggestions how we can make find more efficient?
Re: faster splitter
On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote: Any suggestions how we can make find more efficient? I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95
Re: faster splitter
On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote: On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote: Any suggestions how we can make find more efficient? I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95 And here it is: https://github.com/dlang/phobos/pull/4362 Destroy!
Re: faster splitter
On Wednesday, 25 May 2016 at 11:30:33 UTC, qznc wrote: On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote: On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote: Any suggestions how we can make find more efficient? I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95 And here it is: https://github.com/dlang/phobos/pull/4362 Destroy! Great stuff! I really appreciate your efforts. As I said, my code uses find/canFind a lot.
Re: faster splitter
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote: On 05/23/2016 03:11 PM, qznc wrote: Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? Andrei I've played around with some algorithms and kept them as simple as possible, but I've found that a linear brute force for-loop always beats them (it's the extra decision(s), I suppose). It looks nice in theory, but in hardware reality a stupid loop is more efficient. NB: I've found that `foreach` is usually faster than a manual `for`-loop. I used qznc's benchmarking, and it's clear that the current implementation of std.algorithm.find is quite a poor performer. We do have to work on that. Library functions should be fast, no matter what. If someone uses `find`, they wanna be sure it's optimized for speed.
Re: faster splitter
On 5/27/16 8:28 AM, Chris wrote: I've played around with some algorithms and kept them as simple as possible, but I've found that a linear brute force for-loop always beats them (it's the extra decision(s), I suppose). It looks nice in theory, but in hardware reality a stupid loop is more efficient. This is surprising. It should be easy to construct examples where brute force search performs arbitrarily poor, e.g. searching for "aaa...aab" in a long string with only "a"s. NB: I've found that `foreach` is usually faster than a manual `for`-loop. That's surprising too. I'd be interested to look at a disassembly if you have one. I used qznc's benchmarking, and it's clear that the current implementation of std.algorithm.find is quite a poor performer. We do have to work on that. Library functions should be fast, no matter what. If someone uses `find`, they wanna be sure it's optimized for speed. That's definitely the way to go. Thanks for looking into it! Andrei
Re: faster splitter
On Friday, 27 May 2016 at 13:29:00 UTC, Andrei Alexandrescu wrote: On 5/27/16 8:28 AM, Chris wrote: This is surprising. It should be easy to construct examples where brute force search performs arbitrarily poor, e.g. searching for "aaa...aab" in a long string with only "a"s. I will look into this. Atm, I'm using qznc's benchmark. NB: I've found that `foreach` is usually faster than a manual `for`-loop. That's surprising too. I'd be interested to look at a disassembly if you have one. I have to correct myself. A manual loop is faster, I couldn't believe it either, so I benchmarked the same function with `foreach` and `for`. Turns out that `for` is _way_ faster. I used qznc's benchmarking, and it's clear that the current implementation of std.algorithm.find is quite a poor performer. We do have to work on that. Library functions should be fast, no matter what. If someone uses `find`, they wanna be sure it's optimized for speed. That's definitely the way to go. Thanks for looking into it! Andrei There is a bug, if it is one, in qznc's `manual_find`. It doesn't find strings at the end of a string, as "ing" in "string", and it does not find a string that is equal to the search string as in "abba" == "abba". This outperforms both `manual_find` and the improved `std find` string findStringS_Manual(string haystack, string needle) { if (needle.length > haystack.length) return haystack[$..$]; outer: for (auto i = 0; i < haystack.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } return haystack[$..$]; } A typical example would be: std findtook 25642297 manual find took 7426676 // manual_find my std find took 6654496 // improved find findStringS took 7159556 // foreach(i, c; haystack) findStringS_ took 5315498 // for (auto i = 0 ...) see above I think it's clear that `std find` is vry slow. If anyone wants to test my function, please do so. I don't want to spread wrong data, but this is what I get at the moment (ldc latest version).
Re: faster splitter
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote: This outperforms both `manual_find` and the improved `std find` string findStringS_Manual(string haystack, string needle) { if (needle.length > haystack.length) return haystack[$..$]; outer: for (auto i = 0; i < haystack.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } return haystack[$..$]; } Little bug fix: for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < needle.length; ++j, ++k) Else it will be out or range when you look for "za" and the string ends with "z". There is a slight performance penalty, but it's still faster than the rest. Maybe there are cleverer ways. But every little check and decision comes with a performance penalty. The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast. std findtook 12573666 manual find took 7418454 my std find took 6903854 <=== findStringS took 7166720 findStringS_took 6579529 <=== std findtook 11849892 manual find took 7407091 my std find took 6573102 <=== findStringS took 7296610 findStringS_took 6573214 <=== std findtook 10744361 manual find took 7576133 my std find took 6332014 <=== findStringS took 7224946 findStringS_took 6555976 <===
Re: faster splitter
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote: I think it's clear that `std find` is vry slow. If anyone wants to test my function, please do so. I don't want to spread wrong data, but this is what I get at the moment (ldc latest version). If you want to see find (current or my improved version) get really slow, then reduce the alphabet for the haystack (e.g. --alphabet=ab). ;)
Re: faster splitter
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote: The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast. I think it really depends on the use case. The manual algorithm is really naive and std-find is slightly more advanced. My benchmark creates a random haystack and needle, but this is not real world data. I believe the "slightly more advanced" approach is a good idea, but there is no data to prove it. It might be interesting what algorithms other language's standard libraries (C++, Python, Java, etc) use. Thanks for looking at the benchmarking, Chris! The more eyes, the better. :)
Re: faster splitter
On 5/27/16 10:41 AM, Chris wrote: The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast. What you want to do here for indexed access is to match the last element first. If no match, continue etc. If there's a match, enter an inner loop where you don't need to check for the end of the haystack. -- Andrei
Re: faster splitter
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote: On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote: This outperforms both `manual_find` and the improved `std find` string findStringS_Manual(string haystack, string needle) { if (needle.length > haystack.length) return haystack[$..$]; outer: for (auto i = 0; i < haystack.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } return haystack[$..$]; } Little bug fix: for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < needle.length; ++j, ++k) The upper bound of the outer loop should be haystack.length-needle.length. There's no point in searching after that point as the needle would not fit anymore and it avoids the buffer overrun you have in your code when the first character of the needle is found after that point. You can also remove then your ?: test in the j loop as that case is handled automatically then.
Re: faster splitter
Oops, I've been found out. :-) Thanks. You're right of course, and I've already noticed that bug as it fails on not found. I got the bounds wrong.
Re: faster splitter
On Friday, 27 May 2016 at 19:17:44 UTC, Chris wrote: Oops, I've been found out. :-) Thanks. You're right of course, and I've already noticed that bug as it fails on not found. I got the bounds wrong. I had the same "bug" when I wrote my search function on the project at work. I also found out that using the simplest loop is the generally the best way to go. The processors nowaday have loop-caches, fast level1 caches etc, all the fancy search algorithms like Boyer-Moore etc. can only overcome their memory and initialisation overhead for really huge haystacks. Here's the C code of my memmem() function that replaced the Boyer-Moore I had before. This implementation outran the BM search for sizes up to several hundreds megaytes on SPARC64 VII+ machine. On x86-64 I'm sure it would even be worse. void * memmem(const void *haystack, size_t haystack_len, const void *needle, size_t needle_len) { if(!haystack_len || !needle_len || needle_len > haystack_len) return NULL; uint8_t u8 = *((uint8_t *)needle); const uint8_t *p = (const uint8_t *) haystack; for(size_t i = haystack_len - needle_len + 1; i; --i) { if(*p == u8 && !memcmp(p, needle, needle_len)) return (void *)p; else p++; } return NULL; } (sorry for posting C code, as I'm only beginning to learn D).
Re: faster splitter
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote: The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast. std findtook 12573666 manual find took 7418454 my std find took 6903854 <=== findStringS took 7166720 findStringS_took 6579529 <=== I just updated my benchmark. It now checks lots of different scenarios instead of you having to specify one. You can only set the number of iterations now. It generates a random scenario (haystack length, needle length, alphabet, etc) and runs it once with all algorithms. Instead of recording the absolute runtime, it records the slowdown compared to the fastest of the three. Average those and also compute the absolute deviation (the second number). Example: std find:153 ±33 manual find: 112 ±19 my std find: 119 ±16 So manual find is on average 12% slower than the fastest ±19%. I would interpret that as no significant difference between manual and improved find.
Re: faster splitter
Now added Chris' algo to the benchmark: std find:155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_
Re: faster splitter
On 05/27/2016 04:39 PM, qznc wrote: Now added Chris' algo to the benchmark: std find:155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_ Does any of these feature simultaneously: * check the last element first * consequently stop early when the length of the haystack falls below the length of the needle * consequently no check against the haystack limit is necessary in the inner loop * use only one induction variable ? Andrei
Re: faster splitter
On Friday, 27 May 2016 at 20:50:52 UTC, Andrei Alexandrescu wrote: On 05/27/2016 04:39 PM, qznc wrote: Now added Chris' algo to the benchmark: std find:155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_ Does any of these feature simultaneously: * check the last element first * consequently stop early when the length of the haystack falls below the length of the needle * consequently no check against the haystack limit is necessary in the inner loop Std find (currently in Phobos) and qznc find (see pull request) fulfill those three. Manual and Chris find are simpler. * use only one induction variable None of those. I made such a variant of qznc find and it gives me the same numbers.
Re: faster splitter
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote: I have to correct myself. A manual loop is faster, I couldn't believe it either, so I benchmarked the same function with `foreach` and `for`. Turns out that `for` is _way_ faster. Just about the only reason I could think of for this to happen is if the compiler fails to inline the range primitives from std.array. Otherwise, the loops should be pretty much equivalent to LLVM's optimiser. This is so fundamental to D's strategic promise that we can't afford to get it wrong. — David
Re: faster splitter
On Friday, 27 May 2016 at 18:21:06 UTC, Andrei Alexandrescu wrote: What you want to do here for indexed access is to match the last element first. If no match, continue etc. If there's a match, enter an inner loop where you don't need to check for the end of the haystack. -- Andrei Another plot twist: Sentinels look interesting. Wilzbach [0] mentioned it in the pull request: "Btw on a related issue - it seems like no one has actually bother to implement Andrei's "Fastware" sentinels in Phobos. If you are already in benchmarking, you might give this specialization for mutable RandomAccessRanges a try ;-)" I tried it [1]. It changes the numbers from std find:153 ±32 manual find: 112 ±19 qznc find: 121 ±18 Chris find: 132 ±20 to std find:160 ±31 manual find: 118 ±24 qznc find: 109 ±13 <--- using the sentinel trick Chris find: 142 ±27 It is normal that the numbers of the other tests change, since those are relative to the fastest one in each run. When qznc find 'wins' more often, the others get more slowdown. This means another special case for mutable ranges could be worthwhile. [0] https://github.com/dlang/phobos/pull/4362#issuecomment-24387 [1] https://github.com/qznc/find-is-too-slow/commit/043d1d2f6dced625e964e8df883ad5a45d7fe654
Re: faster splitter
On 05/27/2016 06:19 PM, qznc wrote: manual find: 118 ±24 qznc find: 109 ±13 <--- using the sentinel trick Chris find: 142 ±27 It is normal that the numbers of the other tests change, since those are relative to the fastest one in each run. When qznc find 'wins' more often, the others get more slowdown. What compiler? This means another special case for mutable ranges could be worthwhile. Also mind the throwing possibilities. Andrei
Re: faster splitter
On Friday, 27 May 2016 at 20:39:03 UTC, qznc wrote: Now added Chris' algo to the benchmark: std find:155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_ Did you fix the bugs in my algorithm? S
Re: faster splitter
On Saturday, 28 May 2016 at 01:30:11 UTC, Andrei Alexandrescu wrote: On 05/27/2016 06:19 PM, qznc wrote: manual find: 118 ±24 qznc find: 109 ±13 <--- using the sentinel trick Chris find: 142 ±27 It is normal that the numbers of the other tests change, since those are relative to the fastest one in each run. When qznc find 'wins' more often, the others get more slowdown. What compiler? LDC. For DMD the numbers show no improvement: std find:170 ±31 manual find: 132 ±25 qznc find: 103 ±6 Chris find: 157 ±30 to std find:168 ±30 manual find: 131 ±25 qznc find: 104 ±8<--- sentinel Chris find: 155 ±29 (Caveat: Chris find has a bug, which is triggered once in the 1 runs each.) The sentinel trick is only proof of concept. The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges. This means another special case for mutable ranges could be worthwhile. Also mind the throwing possibilities. What do you mean? Using exceptions? If anybody wants to replicate. It only takes a few seconds and there are no dependencies except dmd/ldc. git clone https://github.com/qznc/find-is-too-slow cd find-is-too-slow make dmd make ldc
Re: faster splitter
On 5/28/16 6:56 AM, qznc wrote: The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges. No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work: T[] find(T)(T[] haystack, T[] needle) { if (needle.length == 0) return haystack; immutable lastIndex = needle.length - 1; auto last = needle[lastIndex]; size_t j = lastIndex; for (; j < haystack.length; ++j) { if (haystack[j] != last) continue; immutable k = j - lastIndex; // last elements match for (size_t i = 0; ; ++i) { if (i == lastIndex) return haystack[k .. $]; if (needle[i] != haystack[k + i]) break; } } return haystack[$ .. $]; } unittest { string s1 = "hello abc world"; assert(find(s1, "abc") == "abc world"); assert(find(s1, "def") == ""); } void main(){} Andrei
Re: faster splitter
On Friday, 27 May 2016 at 21:31:48 UTC, David Nadlinger wrote: Just about the only reason I could think of for this to happen is if the compiler fails to inline the range primitives from std.array. Otherwise, the loops should be pretty much equivalent to LLVM's optimiser. This is so fundamental to D's strategic promise that we can't afford to get it wrong. — David I agree, and it has been said on this forum that tight manual loops are faster than the range paradigm (I think bearophile was one of them, if I remember correctly). Maybe we should benchmark this too, just in case.
Re: faster splitter
On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu wrote: On 5/28/16 6:56 AM, qznc wrote: The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges. No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work: T[] find(T)(T[] haystack, T[] needle) { if (needle.length == 0) return haystack; immutable lastIndex = needle.length - 1; auto last = needle[lastIndex]; size_t j = lastIndex; for (; j < haystack.length; ++j) { if (haystack[j] != last) continue; immutable k = j - lastIndex; // last elements match for (size_t i = 0; ; ++i) { if (i == lastIndex) return haystack[k .. $]; if (needle[i] != haystack[k + i]) break; } } return haystack[$ .. $]; } unittest { string s1 = "hello abc world"; assert(find(s1, "abc") == "abc world"); assert(find(s1, "def") == ""); } void main(){} Andrei I included your function (I also fixed the bug in my findStringS_). Here's a typical result: std find:208 ±61 manual find: 127 ±29 qznc find: 108 ±13 Chris find: 140 ±32 Andrei find: 126 ±27 = std find:207 ±59 manual find: 128 ±30 qznc find: 108 ±13 Chris find: 141 ±32 Andrei find: 125 ±27 I used dmd, because I don't have ldc on my laptop. qznc's find is clearly the winner.
Re: faster splitter
On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu wrote: On 5/28/16 6:56 AM, qznc wrote: The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges. No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work: A nice one. Here are the numbers I got: LDC: std find: 156 ±33 manual find: 117 ±24 qznc find: 114 ±14 Chris find: 135 ±24 Andrei find: 112 ±27 DMD: std find: 177 ±31 manual find: 140 ±28 qznc find: 102 ±4 Chris find: 165 ±31 Andrei find: 130 ±25 My sentinel idea is to have only one condition branch in the inner loop. Andrei find still has to `if` in the inner loop. My inner loop looks like this: haystack[scout] = cast(typeof(needleBack)) (needleBack+1); // place sentinel for (size_t j = scout + 1 - needleLength, k = 0;; ++j, ++k) { // no condition if (!binaryFun!pred(haystack[j], needle[k])) { // only condition // ... ok in here comes another, but this not as costly } } As long as you are matching the needle, there is only one condition to check. There is no check for k < needleLength, because the check always fails for needle[$] aka haystack[scout] due to the sentinel. I thought it might be slower, since we need to write to memory twice instead of only reading. Turns out it is faster.
Re: faster splitter
On Saturday, 28 May 2016 at 14:18:10 UTC, Chris wrote: I used dmd, because I don't have ldc on my laptop. qznc's find is clearly the winner. DMD performance feels flaky to me. If performance is important, you should use ldc or gdc. Alternatively, you are Walter Bright and simply optimize dmd. ;)
Re: faster splitter
I played around with the benchmark. Some more numbers: $ make ldc ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc ./benchmark.ldc E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 153 ±25+66 (1934) -15 (7860) manual find: 122 ±28+80 (1812) -17 (8134) qznc find: 125 ±16+18 (4644) -15 (5126) Chris find: 148 ±29+75 (1976) -18 (7915) Andrei find: 114 ±23 +100 (1191) -13 (8770) (avg slowdown vs fastest; absolute deviation) $ make dmd dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd ./benchmark.dmd E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 160 ±27+44 (3162) -20 (6709) manual find: 148 ±28+54 (2701) -19 (7178) qznc find: 102 ±3 +27 ( 766) -1 (9136) Chris find: 175 ±30+55 (2796) -21 (7106) Andrei find: 122 ±22+46 (2554) -14 (7351) (avg slowdown vs fastest; absolute deviation) The additional numbers on the right are the ±MAD separated by above or below the mean. For example Andrei find with ldc: Andrei find: 114 ±23 +100 (1191) -13 (8770) The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 1 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown. What bothers me is that changing the alphabet changes the numbers so much. Currently, if you restrict the alphabets for haystack and needle, the numbers change. The benchmark already does a random subset on each run, but there is definitely a bias.
Re: faster splitter
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote: I played around with the benchmark. Some more numbers: $ make ldc ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc ./benchmark.ldc E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 153 ±25+66 (1934) -15 (7860) manual find: 122 ±28+80 (1812) -17 (8134) qznc find: 125 ±16+18 (4644) -15 (5126) Chris find: 148 ±29+75 (1976) -18 (7915) Andrei find: 114 ±23 +100 (1191) -13 (8770) (avg slowdown vs fastest; absolute deviation) $ make dmd dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd ./benchmark.dmd E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 160 ±27+44 (3162) -20 (6709) manual find: 148 ±28+54 (2701) -19 (7178) qznc find: 102 ±3 +27 ( 766) -1 (9136) Chris find: 175 ±30+55 (2796) -21 (7106) Andrei find: 122 ±22+46 (2554) -14 (7351) (avg slowdown vs fastest; absolute deviation) The additional numbers on the right are the ±MAD separated by above or below the mean. For example Andrei find with ldc: Andrei find: 114 ±23 +100 (1191) -13 (8770) The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 1 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown. What bothers me is that changing the alphabet changes the numbers so much. Currently, if you restrict the alphabets for haystack and needle, the numbers change. The benchmark already does a random subset on each run, but there is definitely a bias. You can avoid "E: wrong result with Chris find" by using outer: for (auto i = 0; i < haystack.length-needle.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } It's a tad faster. I'm planning to test on more varied data and see where a bias may occur.
Re: faster splitter
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote: I played around with the benchmark. Some more numbers: The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 1 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown. A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms. --Jon
Re: faster splitter
On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote: A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms. I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median? When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE. However, this is more advanced and requires further introspection. [0] http://forum.dlang.org/post/aahhopdcrengakvoe...@forum.dlang.org
Re: faster splitter
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote: On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote: A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms. I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median? Oh, I see. The benchmark varies the data on each run and aggregates, is that right? Sorry, I missed that.
Re: faster splitter
It's weird, on my machine, my find function is consistently faster than `manual find` LDC: std find: 137 ±22+33 (3580) -17 (6251) manual find: 137 ±32+53 (3105) -23 (6834) qznc find: 106 ±8 +17 (2608) -5 (7195) Chris find: 132 ±32+58 (2803) -23 (7131) Andrei find: 120 ±26+54 (2466) -17 (7454) === std find: 136 ±22+33 (3503) -17 (6304) manual find: 137 ±33+55 (3000) -23 (6920) qznc find: 106 ±8 +17 (2535) -5 (7287) Chris find: 132 ±33+59 (2803) -23 (7137) Andrei find: 119 ±25+51 (2569) -16 (7374) It's negligible, but I wonder is it the extra initialization: `size_t end = haystack.length - needle.length + 1;` and `size_t i = 0` is defined, before we know, if it's a legitimate search. But that shouldn't matter, or should it? Or does it depend on the machine? DMD slows all searches down considerably (except for `qznc find`, well done!:) DMD: std find: 161 ±36+46 (4015) -31 (5847) manual find: 147 ±40+68 (3001) -28 (6917) qznc find: 106 ±8 +15 (2623) -5 (7172) Chris find: 201 ±41+73 (2830) -28 (7120) Andrei find: 150 ±40+61 (3347) -30 (6575) PS I've corrected the end of the search string to `outer: for (auto i = 0; i < haystack.length-(needle.length-1); i++)` else it will not match at the end of a string.
Re: faster splitter
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote: When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE. So far, the results look disappointing. Andrei find does not get faster with wordwise matching: ./benchmark.ldc std find: 133 ±25+38 (3384) -19 (6486) manual find: 140 ±37+64 (2953) -25 (6962) qznc find: 114 ±17+33 (2610) -11 (7262) Chris find: 146 ±39+66 (3045) -28 (6873) Andrei find: 126 ±29+54 (2720) -19 (7189) Wordwise find: 130 ±30+53 (2934) -21 (6980) Interesting side-note: On my laptop Andrei find is faster than qznc find (for LDC), but on my desktop it reverses (see above). Both are Intel i7. Need to find a simpler processor. Maybe wordwise is faster there. Alternatively, find is purely memory bound and the L1 cache makes every difference disappear. Also, note how std find is faster than manual find! Finding a reliable benchmark is hard. :/
Re: faster splitter
On 05/30/2016 05:31 AM, qznc wrote: On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote: When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE. So far, the results look disappointing. Andrei find does not get faster with wordwise matching: ./benchmark.ldc std find: 133 ±25+38 (3384) -19 (6486) manual find: 140 ±37+64 (2953) -25 (6962) qznc find: 114 ±17+33 (2610) -11 (7262) Chris find: 146 ±39+66 (3045) -28 (6873) Andrei find: 126 ±29+54 (2720) -19 (7189) Wordwise find: 130 ±30+53 (2934) -21 (6980) Interesting side-note: On my laptop Andrei find is faster than qznc find (for LDC), but on my desktop it reverses (see above). Both are Intel i7. Need to find a simpler processor. Maybe wordwise is faster there. Alternatively, find is purely memory bound and the L1 cache makes every difference disappear. Also, note how std find is faster than manual find! Finding a reliable benchmark is hard. :/ Please throw this hat into the ring as well: it should improve average search on large vocabulary dramatically. https://dpaste.dzfl.pl/dc8dc6e1eb53 It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- Andrei
Re: faster splitter
On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote: Please throw this hat into the ring as well: it should improve average search on large vocabulary dramatically. https://dpaste.dzfl.pl/dc8dc6e1eb53 It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- Andrei Cool. So far, my experience with separate functions has been that they slow down the loop. But this might do the trick.
Re: faster splitter
On Monday, 30 May 2016 at 18:57:15 UTC, Chris wrote: On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote: Please throw this hat into the ring as well: it should improve average search on large vocabulary dramatically. https://dpaste.dzfl.pl/dc8dc6e1eb53 It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- Andrei Cool. So far, my experience with separate functions has been that they slow down the loop. But this might do the trick. Congratulations! DMD: ./benchmark.dmd std: 178 ±31+36 (4475) -29 (5344) manual: 167 ±46+82 (2883) -32 (7054) qznc: 114 ±7 +18 (1990) -5 (7288) Chris: 228 ±49+82 (3050) -35 (6780) Andrei: 103 ±5 +47 ( 658) -2 (9295) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz LDC: std: 184 ±19+28 (3420) -14 (6523) manual: 205 ±59 +120 (2463) -39 (7443) qznc: 151 ±25+44 (2983) -17 (6911) Chris: 194 ±57+78 (3702) -46 (6251) Andrei: 101 ±2 +42 ( 435) -1 (9542) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
Re: faster splitter
On 05/30/2016 04:00 PM, Chris wrote: ./benchmark.dmd std: 178 ±31+36 (4475) -29 (5344) manual: 167 ±46+82 (2883) -32 (7054) qznc: 114 ±7 +18 (1990) -5 (7288) Chris: 228 ±49+82 (3050) -35 (6780) Andrei: 103 ±5 +47 ( 658) -2 (9295) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz LDC: std: 184 ±19+28 (3420) -14 (6523) manual: 205 ±59 +120 (2463) -39 (7443) qznc: 151 ±25+44 (2983) -17 (6911) Chris: 194 ±57+78 (3702) -46 (6251) Andrei: 101 ±2 +42 ( 435) -1 (9542) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz Thanks for looking into this! @qznc, could you please look into updating https://github.com/dlang/phobos/pull/4362 with this result? One possible tweak is see whether replacing the function call with inline code helps. Thanks! -- Andrei
Re: faster splitter
On Monday, 30 May 2016 at 20:08:46 UTC, Andrei Alexandrescu wrote: On 05/30/2016 04:00 PM, Chris wrote: ./benchmark.dmd std: 178 ±31+36 (4475) -29 (5344) manual: 167 ±46+82 (2883) -32 (7054) qznc: 114 ±7 +18 (1990) -5 (7288) Chris: 228 ±49+82 (3050) -35 (6780) Andrei: 103 ±5 +47 ( 658) -2 (9295) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz LDC: std: 184 ±19+28 (3420) -14 (6523) manual: 205 ±59 +120 (2463) -39 (7443) qznc: 151 ±25+44 (2983) -17 (6911) Chris: 194 ±57+78 (3702) -46 (6251) Andrei: 101 ±2 +42 ( 435) -1 (9542) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz Thanks for looking into this! @qznc, could you please look into updating https://github.com/dlang/phobos/pull/4362 with this result? One possible tweak is see whether replacing the function call with inline code helps. Thanks! -- Andrei What function call should be replaced with inline code? Overall, I'm very confused and displeased by the benchmark right now. Initially, the problem was that a naive manual find was slower than std find in Phobos. Look at the LDC numbers–manual find is slower than std find. By tweaking the benchmark, the numbers can be shifted around arbitrarily. Even without tweaking, it is very processor dependent. Here are the numbers from my laptop: ./benchmark.ldc std: 149 ±24+29 (4275) -21 (5604) manual: 129 ±37+90 (2055) -23 (7917) qznc: 131 ±22+26 (4381) -20 (5488) Chris: 154 ±35+81 (2192) -22 (7672) Andrei: 118 ±28+82 (1762) -16 (8194) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz ./benchmark.dmd std: 143 ±22+26 (4219) -19 (5602) manual: 158 ±31+47 (3436) -24 (6492) qznc: 101 ±2 +12 (1349) -1 (7851) Chris: 216 ±38+56 (3393) -28 (6541) Andrei: 135 ±31+48 (3326) -24 (6589) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz And Desktop: ./benchmark.ldc std: 129 ±24+40 (3121) -17 (6767) manual: 129 ±31+59 (2668) -21 (7244) qznc: 112 ±14+30 (2542) -9 (7312) Chris: 134 ±33+58 (2835) -23 (7068) Andrei: 123 ±27+53 (2679) -18 (7225) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz ./benchmark.dmd std: 157 ±31+44 (3693) -24 (6234) manual: 143 ±41+73 (2854) -28 (7091) qznc: 116 ±21+35 (3092) -14 (6844) Chris: 181 ±50+74 (3452) -38 (6510) Andrei: 136 ±38+64 (2975) -27 (6953) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Re: faster splitter
On 05/30/2016 06:16 PM, qznc wrote: What function call should be replaced with inline code? The call to computeSkip. Overall, I'm very confused and displeased by the benchmark right now. I agree it's difficult to characterize the behavior of substring search with one number. There are many dimensions of variation. (But there's no reason for an emotional response.) A few possible baselines come to mind: * Search a long string for a one-character string, match and fail. * Take an English text string. Search for a substring consisting of its last portion (e.g. 5% of the length). * Take an English text string. Search for a substring consisting of a fraction of the text (e.g. 3%) with additional characters prepended. Repeat for appended. Andrei
Re: faster splitter
On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote: And Desktop: ./benchmark.ldc std: 129 ±24+40 (3121) -17 (6767) manual: 129 ±31+59 (2668) -21 (7244) qznc: 112 ±14+30 (2542) -9 (7312) Chris: 134 ±33+58 (2835) -23 (7068) Andrei: 123 ±27+53 (2679) -18 (7225) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz ./benchmark.dmd std: 157 ±31+44 (3693) -24 (6234) manual: 143 ±41+73 (2854) -28 (7091) qznc: 116 ±21+35 (3092) -14 (6844) Chris: 181 ±50+74 (3452) -38 (6510) Andrei: 136 ±38+64 (2975) -27 (6953) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Benchmark from desktop machine: DMD: std: 164 ±34+43 (4054) -29 (5793) manual: 150 ±41+72 (2889) -29 (7032) qznc: 103 ±6 +42 ( 878) -2 (9090) Chris: 205 ±43+81 (2708) -29 (7232) Andrei: 136 ±31+53 (2948) -22 (6977) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz === LDC: std: 138 ±23+35 (3457) -18 (6360) manual: 145 ±33+45 (3748) -27 (6181) qznc: 105 ±7 +17 (2267) -4 (7534) Chris: 135 ±33+56 (3061) -23 (6882) Andrei: 121 ±27+52 (2630) -18 (7301) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz On my laptop Andrei's was the fasted (see post above).
Re: faster splitter
On Tuesday, 31 May 2016 at 08:43:59 UTC, Chris wrote: On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote: And Desktop: ./benchmark.ldc std: 129 ±24+40 (3121) -17 (6767) manual: 129 ±31+59 (2668) -21 (7244) qznc: 112 ±14+30 (2542) -9 (7312) Chris: 134 ±33+58 (2835) -23 (7068) Andrei: 123 ±27+53 (2679) -18 (7225) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz ./benchmark.dmd std: 157 ±31+44 (3693) -24 (6234) manual: 143 ±41+73 (2854) -28 (7091) qznc: 116 ±21+35 (3092) -14 (6844) Chris: 181 ±50+74 (3452) -38 (6510) Andrei: 136 ±38+64 (2975) -27 (6953) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Benchmark from desktop machine: DMD: std: 164 ±34+43 (4054) -29 (5793) manual: 150 ±41+72 (2889) -29 (7032) qznc: 103 ±6 +42 ( 878) -2 (9090) Chris: 205 ±43+81 (2708) -29 (7232) Andrei: 136 ±31+53 (2948) -22 (6977) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz === LDC: std: 138 ±23+35 (3457) -18 (6360) manual: 145 ±33+45 (3748) -27 (6181) qznc: 105 ±7 +17 (2267) -4 (7534) Chris: 135 ±33+56 (3061) -23 (6882) Andrei: 121 ±27+52 (2630) -18 (7301) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz On my laptop Andrei's was the fasted (see post above). Comparing the chips involved, could it be cache related? 3M cache; Andrei wins both: http://ark.intel.com/products/75459/Intel-Core-i5-4200U-Processor-3M-Cache-up-to-2_60-GHz 4M cache; qznc wins DMD (and is faster than the LDC's best? What?); Andrei wins LDC: http://ark.intel.com/products/43560/Intel-Core-i7-620M-Processor-4M-Cache-2_66-GHz 8M cache; qznc wins both: http://ark.intel.com/products/65719/Intel-Core-i7-3770-Processor-8M-Cache-up-to-3_90-GHz http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz Normally, I'd expect the 4200U to be similar to the desktop parts. Unless... Say, for the laptops (and I guess the desktops too, but it's more important in a mobile), did you verify the CPU frequency scaling wasn't interfering? -Wyatt
Re: faster splitter
On Tuesday, 31 May 2016 at 17:31:29 UTC, Wyatt wrote: qznc wins DMD (and is faster than the LDC's best? Careful! These are not absolute numbers, but relative slowdowns. You cannot compare the numbers between LDC and DMD.
Re: faster splitter
On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote: I agree it's difficult to characterize the behavior of substring search with one number. There are many dimensions of variation. (But there's no reason for an emotional response.) A few possible baselines come to mind: * Search a long string for a one-character string, match and fail. There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for. * Take an English text string. Search for a substring consisting of its last portion (e.g. 5% of the length). How long should the english text be? A Tweet? A book? A Gigabyte of log files? English text means basically ASCII and no Unicode? * Take an English text string. Search for a substring consisting of a fraction of the text (e.g. 3%) with additional characters prepended. Repeat for appended. Why the prepend/append? To force a mismatch?
Re: faster splitter
On 5/31/16 1:54 PM, qznc wrote: Using a one-letter needle string is more like a user mistake than something to optimize for. How is splitting on ',' a user mistake? -- Andrei
Re: faster splitter
On 5/31/16 1:54 PM, qznc wrote: On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote: I agree it's difficult to characterize the behavior of substring search with one number. There are many dimensions of variation. (But there's no reason for an emotional response.) A few possible baselines come to mind: * Search a long string for a one-character string, match and fail. There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for. Oh, I see what you mean - it's a string consisting only of one character, not one character 'x'. I agree, but I'm just saying it's a sound baseline. * Take an English text string. Search for a substring consisting of its last portion (e.g. 5% of the length). How long should the english text be? A Tweet? A book? A Gigabyte of log files? I'm thinking of a few exponentially increasing sizes. e.g. 100, 1000, 10_000, 100_000, 1_000_000. English text means basically ASCII and no Unicode? My understanding (and correct me if I'm wrong) is that we're discussing search with the default predicate, ignoring equivalent grapheme representations etc. So there's no need to mind encoding at all, which means the text could be in any language. Foreign languages would further increase the vocabulary, but that's the only effect. * Take an English text string. Search for a substring consisting of a fraction of the text (e.g. 3%) with additional characters prepended. Repeat for appended. Why the prepend/append? To force a mismatch? Yah, and it's a good test for strings that have some equal portion yet they are different, which seems to me puts a greater strain on the algorithms. It is arguably a likely situation in the real world. Thanks, Andrei
Re: faster splitter
On Tuesday, 31 May 2016 at 18:31:47 UTC, Andrei Alexandrescu wrote: On 5/31/16 1:54 PM, qznc wrote: Using a one-letter needle string is more like a user mistake than something to optimize for. How is splitting on ',' a user mistake? -- Andrei The mistake is to split on "," instead of ','. The slow splitter at the start of this thread searches for "\r\n". Your lazy-skip algorithm looks great! ./benchmark.ldc Search in Alice in Wonderland std: 168 ±6 +29 ( 107) -3 ( 893) manual: 112 ±3 +28 ( 81) -1 ( 856) qznc: 149 ±4 +30 ( 79) -1 ( 898) Chris: 142 ±5 +28 ( 102) -2 ( 898) Andrei: 153 ±3 +28 ( 79) -1 ( 919) Andrei2: 101 ±2 +34 ( 31) -1 ( 969) Search random haystack with random needle std: 172 ±19+61 ( 161) -11 ( 825) manual: 161 ±47+73 ( 333) -35 ( 666) qznc: 163 ±21+33 ( 314) -15 ( 661) Chris: 190 ±47+80 ( 302) -33 ( 693) Andrei: 140 ±37+60 ( 315) -27 ( 676) Andrei2: 103 ±6 +57 ( 64) -2 ( 935) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz The Alice benchmark searches Alice in Wonderland for "find a pleasure in all their simple joys" and finds it in the last sentence.
Re: faster splitter
On Tuesday, 31 May 2016 at 18:56:14 UTC, qznc wrote: The mistake is to split on "," instead of ','. The slow splitter at the start of this thread searches for "\r\n". Your lazy-skip algorithm looks great! ./benchmark.ldc Search in Alice in Wonderland std: 168 ±6 +29 ( 107) -3 ( 893) manual: 112 ±3 +28 ( 81) -1 ( 856) qznc: 149 ±4 +30 ( 79) -1 ( 898) Chris: 142 ±5 +28 ( 102) -2 ( 898) Andrei: 153 ±3 +28 ( 79) -1 ( 919) Andrei2: 101 ±2 +34 ( 31) -1 ( 969) Search random haystack with random needle std: 172 ±19+61 ( 161) -11 ( 825) manual: 161 ±47+73 ( 333) -35 ( 666) qznc: 163 ±21+33 ( 314) -15 ( 661) Chris: 190 ±47+80 ( 302) -33 ( 693) Andrei: 140 ±37+60 ( 315) -27 ( 676) Andrei2: 103 ±6 +57 ( 64) -2 ( 935) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz The Alice benchmark searches Alice in Wonderland for "find a pleasure in all their simple joys" and finds it in the last sentence. Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler?
Re: faster splitter
On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote: Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler? LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :)
Re: faster splitter
On Tuesday, 31 May 2016 at 19:59:50 UTC, qznc wrote: On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote: Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler? LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :) Yep. It's really impressive. I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now. `Adrei2` is that it performs consistently well.
Re: faster splitter
On 05/31/2016 04:18 PM, Chris wrote: I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now. You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei
Re: faster splitter
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu wrote: You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei In general, it might be beneficial to use ldc.intrinsics.llvm_expect (cf. __builtin_expect) for things like that in order to optimise basic block placement. (We should probably have a compiler-independent API for that in core.*, by the way.) In this case, the skip computation path is probably small enough for that not to matter much, though. Another thing that might be interesting to do (now that you have a "clever" baseline) is to start counting cycles and make some comparisons against manual asm/intrinsics implementations. For short(-ish) needles, PCMPESTRI is probably the most promising candidate, although I suspect that for \r\n scanning in long strings in particular, an optimised AVX2 solution might have higher throughput. Of course these observations are not very valuable without backing them up with measurements, but it seems like before optimising a generic search algorithm for short-needle test cases, having one's eyes on a solid SIMD baseline would be a prudent thing to do. — David
Re: faster splitter
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu wrote: You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei I've added it as `Andrei3`. It runs faster with dmd, but it's not as good with ldc. Seems like ldc performs some extra optimization when moving `computeSkip` into the loop, something it doesn't bother to do when it's already there. dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd ./benchmark.dmd Search in Alice in Wonderland std: 190 ±4 manual: 115 ±4 qznc: 106 ±2 Chris: 160 ±4 Andrei: 159 ±4 Andrei2: 108 ±2 Andrei3: 100 ±0 Search in random short strings std: 222 ±27 manual: 193 ±49 qznc: 120 ±12 Chris: 224 ±57 Andrei: 114 ±9 Andrei2: 106 ±5 Andrei3: 102 ±3 Mismatch in random long strings std: 186 ±28 manual: 206 ±85 qznc: 118 ±14 Chris: 265 ±104 Andrei: 194 ±85 Andrei2: 116 ±18 Andrei3: 102 ±4 Search random haystack with random needle std: 189 ±38 manual: 171 ±45 qznc: 118 ±11 Chris: 225 ±52 Andrei: 149 ±32 Andrei2: 110 ±7 Andrei3: 103 ±6 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz ./benchmark.ldc Search in Alice in Wonderland std: 170 ±1 manual: 143 ±2 qznc: 133 ±1 Chris: 144 ±1 Andrei: 196 ±10 Andrei2: 100 ±0 Andrei3: 111 ±1 Search in random short strings std: 223 ±30 manual: 211 ±51 qznc: 124 ±12 Chris: 223 ±61 Andrei: 115 ±10 Andrei2: 102 ±3 Andrei3: 105 ±4 Mismatch in random long strings std: 181 ±17 manual: 253 ±109 qznc: 146 ±24 Chris: 248 ±108 Andrei: 228 ±96 Andrei2: 101 ±2 Andrei3: 108 ±6 Search random haystack with random needle std: 187 ±22 manual: 208 ±60 qznc: 152 ±27 Chris: 202 ±58 Andrei: 173 ±35 Andrei2: 102 ±4 Andrei3: 110 ±8 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Re: faster splitter
On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote: There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for. At compile time you may not know the length of the needle, like in the grep command.
Re: faster splitter
On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote: On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote: There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for. At compile time you may not know the length of the needle, like in the grep command. 1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest! At ndslice such functions are called bifacial: http://dlang.org/phobos/std_experimental_ndslice_iteration.html Imho a lot more in std.algorithm should be able to profit from facts known at compile-time. 2) Even for a runtime runtime one-letter needle I am pretty sure it's worth to specialize
Re: faster splitter
On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote: On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote: On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote: There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for. At compile time you may not know the length of the needle, like in the grep command. 1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest! At ndslice such functions are called bifacial: http://dlang.org/phobos/std_experimental_ndslice_iteration.html Imho a lot more in std.algorithm should be able to profit from facts known at compile-time. 2) Even for a runtime runtime one-letter needle I am pretty sure it's worth to specialize That makes sense. I think that std.algorithm needs to be revised and optimized for speed. We cannot afford to have suboptimal algorithms in there.
Re: faster splitter
On 06/01/2016 08:41 AM, Seb wrote: On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote: On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote: There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for. At compile time you may not know the length of the needle, like in the grep command. 1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest! That would require partial evaluation, which sadly we don't have in D. -- Andrei
Re: faster splitter
On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote: On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote: On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote: There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for. At compile time you may not know the length of the needle, like in the grep command. 1) how about a CTFE find? What I wanted to say, is that in real life, the input of the search routine is very often run-time user provided data. Think of search box in browsers and apps, command line parameter à la grep, etc. The "string" search function should not catastrophically break down on special input, like 1 character strings, unusual Unicode or when needle==haystack. I only said this to not lose the focus on what is being tried to be achieved here. It's often a danger of micro-optimization and unit test focused development, that a lot of time and effort are spent on improvements that are completely irrelevant when checked against what is really needed in real world (i.e. we're full in bike shed territory here).
Re: faster splitter
On Wednesday, 1 June 2016 at 13:47:10 UTC, Patrick Schluter wrote: What I wanted to say, is that in real life, the input of the search routine is very often run-time user provided data. Think of search box in browsers and apps, command line parameter à la grep, etc. The "string" search function should not catastrophically break down on special input, like 1 character strings, unusual Unicode or when needle==haystack. I only said this to not lose the focus on what is being tried to be achieved here. It's often a danger of micro-optimization and unit test focused development, that a lot of time and effort are spent on improvements that are completely irrelevant when checked against what is really needed in real world (i.e. we're full in bike shed territory here). That's true. We should try to optimize for the worst case, i.e. random input. However, it is often the case that the needle is known at compile time, e.g. things like if (input.canFind("foo")) There might be ways to optimize those cases at compile time.
Re: faster splitter
On Wednesday, 1 June 2016 at 14:32:38 UTC, Chris wrote: For fun, I've written a search function that alternates between the beginning and the end of a string. I'm basically using Andrei's search algorithm for this. It is, of course only useful, if `needle` is expected to be either towards the beginning or the end of a string. If `needle` occurs multiple times, it matches where ever it encounters it first. T[] findUnique(T)(T[] haystack, T[] needle) { static size_t computeSkip(T[] needle) { size_t result = 1; while (result < needle.length && needle[$ - 1 - result] != needle[$ - 1]) { ++result; } return result; } if (needle.length == 0) return haystack; immutable size_t len = haystack.length; immutable size_t lastIndex = needle.length - 1; auto last = needle[lastIndex]; auto first = needle[0]; size_t skip = 0; for (size_t i = lastIndex, j = len-lastIndex-1; i < len; ++i, --j) { if (last != haystack[i]) goto back; for (size_t k = 0; ; ++k) { if (k == lastIndex) return haystack[i-lastIndex..$]; if (haystack[i - lastIndex + k] != needle[k]) break; } back: if (i > j) break; if (first != haystack[j]) continue; for (size_t k = lastIndex; ; --k) { if (k == 0) return haystack[j..$]; if (haystack[j + k] != needle[k]) break; } if (skip == 0) skip = computeSkip(needle); i += skip; j -= skip-1; } return haystack[$..$]; } unittest { string s1 = "Hello abc world!"; assert (findUnique(s1, "abc") == "abc world!"); assert (findUnique(s1, "def") == ""); string s2 = "Ola, mundo!"; assert (findUnique(s2, "do!") == "do!"); string s3 = "abc"; assert (findUnique(s3, "c") == "c"); assert (findUnique(s3, "z") == ""); string s4 = "aaabaaa"; assert (findUnique(s4, "b") == "baaa"); } I added it to the benchmarking suite (I had to turn off the error warnings, see description above). Naturally, it shines in the first test, as the needle is at the end of the text. The whole thing is to be taken with a grain of salt ;) Search in Alice in Wonderland std: 535 ±8 manual: 454 ±10 qznc: 421 ±5 Chris: 460 ±10 Andrei: 643 ±19 Andrei2: 314 ±6 Andrei3: 352 ±5 Biloop: 100 ±0 Search in random short strings std: 219 ±33 manual: 194 ±47 qznc: 123 ±11 Chris: 219 ±69 Andrei: 113 ±10 Andrei2: 104 ±4 Andrei3: 103 ±3 Biloop: 195 ±46 Mismatch in random long strings std: 193 ±30 manual: 249 ±103 qznc: 156 ±30 Chris: 260 ±108 Andrei: 238 ±97 Andrei2: 107 ±11 Andrei3: 115 ±12 Biloop: 141 ±44 Search random haystack with random needle std: 189 ±25 manual: 193 ±57 qznc: 152 ±26 Chris: 203 ±59 Andrei: 176 ±35 Andrei2: 103 ±6 Andrei3: 110 ±8 Biloop: 141 ±28 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Re: faster splitter
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu wrote: On 05/31/2016 04:18 PM, Chris wrote: I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now. You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei I ported this to Phobos style (A2Phobos in the benchmark) and changed the pull request [0]. In practically all tests, this algorithms wipes the floor with the competition. [0] https://github.com/dlang/phobos/pull/4362
Re: faster splitter
On Tuesday, 31 May 2016 at 22:50:37 UTC, David Nadlinger wrote: Another thing that might be interesting to do (now that you have a "clever" baseline) is to start counting cycles and make some comparisons against manual asm/intrinsics implementations. For short(-ish) needles, PCMPESTRI is probably the most promising candidate, although I suspect that for \r\n scanning in long strings in particular, an optimised AVX2 solution might have higher throughput. Of course these observations are not very valuable without backing them up with measurements, but it seems like before optimising a generic search algorithm for short-needle test cases, having one's eyes on a solid SIMD baseline would be a prudent thing to do. The current algorithm is generic with respect to the predicate. Once we use SSE/AVX tricks, it is a special case for equality. As a next step in Phobos, this is probably worth it for strings. We could probably steal some well-optimized strcmp from somewhere.
Re: faster splitter
On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote: On 05/30/2016 05:31 AM, qznc wrote: On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote: worthwhile to use word loads [0] instead. Really fancy would be SSE. I wrote a splitter in SSE4.2 some time ago as acontribution to a github project. Perhaps it is related. https://github.com/HFTrader/string-splitting/blob/master/splithb2.cpp Cheers,