subject:"faster splitter"

faster splitter

2016-05-22 Thread qznc via Digitalmars-d

On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
wrote:
On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu 
 wrote:



That's comparable, please file an enh request.


http://d.puremagic.com/issues/show_bug.cgi?id=9646



Below is an implementation, which matches MySplitter with dmd, 
but not with ldc. More precisely:


dmd -O -release -inline -noboundscheck:
std.algorithm.splitter took 5 secs, 170 ms, and 656 μs
MySplitter took 4 secs, 835 ms, 748 μs, and 1 hnsec
my_splitter took 4 secs, 862 ms, 914 μs, and 4 hnsecs

ldc2 -O -release:
std.algorithm.splitter took 3 secs, 540 ms, and 84 μs
MySplitter took 2 secs, 288 ms, and 535 μs
my_splitter took 3 secs, 463 ms, and 897 μs

CPU: Intel i7 M 620 2.67GHz
dmd: v2.070.2
ldc: 0.17.1 (based on DMD v2.068.2 and LLVM 3.7.1)

auto my_splitter(alias pred = "a == b", Range, Separator)(Range 
r, Separator s)

if (is(typeof(binaryFun!pred(r.front, s.front)) : bool)
&& (hasSlicing!Range || isNarrowString!Range)
&& isForwardRange!Separator
&& (hasLength!Separator || isNarrowString!Separator))
{
import std.algorithm.searching : find;
import std.conv : unsigned;

static struct Result
{
private:
Range _input;
Separator _separator;
size_t _next_index;
bool _empty;

@property auto separatorLength() { return 
_separator.length; }


void findIndex()
{
auto found = find!pred(_input, _separator);
_next_index = _input.length - found.length;
}

public:
this(Range input, Separator separator)
{
_input = input;
_separator = separator;
_empty = false;
findIndex();
}

@property Range front()
{
assert(!empty);
return _input[0 .. _next_index];
}

static if (isInfinite!Range)
{
enum bool empty = false;  // Propagate infiniteness
}
else
{
@property bool empty()
{
return _empty;
}
}

void popFront()
{
assert(!empty);
if (_input.empty) {
_empty = true;
assert(_next_index == 0);
} else if (_input.length == _next_index) {
_input = _input[$ .. $];
_empty = true;
} else {
_input = _input[_next_index + separatorLength .. 
$];

findIndex();
}
}

static if (isForwardRange!Range)
{
@property typeof(this) save()
{
auto ret = this;
ret._input = _input.save;
return ret;
}
}
}

return Result(r, s);
}

Re: faster splitter

2016-05-23 Thread Seb via Digitalmars-d


On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
wrote:

[...]



Below is an implementation, which matches MySplitter with dmd, 
but not with ldc. More precisely:


[...]


have you thought about opening a PR to improve `splitter`?

Re: faster splitter

2016-05-23 Thread qznc via Digitalmars-d


On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:

On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
wrote:

[...]



Below is an implementation, which matches MySplitter with dmd, 
but not with ldc. More precisely:


[...]


have you thought about opening a PR to improve `splitter`?


Yes, but I'm not sure about the goals.

I also want to dig a little deeper. Being "sometimes faster" 
without some understand why and when feels unsatisfying.


Additionally, there is this weird special case for a 
bidirectional range, which just adds unnecessary overhead. Is 
"remove dead code" a good enough reason in itself for a PR?

Re: faster splitter

2016-05-23 Thread Seb via Digitalmars-d


On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:

On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:

On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
wrote:

[...]



Below is an implementation, which matches MySplitter with 
dmd, but not with ldc. More precisely:


[...]


have you thought about opening a PR to improve `splitter`?


Yes, but I'm not sure about the goals.

I also want to dig a little deeper. Being "sometimes faster" 
without some understand why and when feels unsatisfying.


Good luck. If you need help, it's probably the best to open the 
PR as people can directly see your code and the Phobos repo is 
usually pretty active.


Additionally, there is this weird special case for a 
bidirectional range, which just adds unnecessary overhead.



Is  "remove dead code" a good enough reason in itself for a PR?


Absolutely ;-)

Re: faster splitter

2016-05-23 Thread qznc via Digitalmars-d


On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:
Additionally, there is this weird special case for a 
bidirectional range, which just adds unnecessary overhead. Is 
"remove dead code" a good enough reason in itself for a PR?


Forget the "dead code comment" it is a actually a missing 
feature. In the case the separator is a range and the input range 
is bidirectional, the splitter result should be bidirectional as 
well. It is not, because the implementation of back() and 
popBack() is missing, although some bookkeeping code for it 
exists.

Re: faster splitter

2016-05-23 Thread Andrei Alexandrescu via Digitalmars-d


On 05/23/2016 08:17 AM, qznc wrote:

On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:

Additionally, there is this weird special case for a bidirectional
range, which just adds unnecessary overhead. Is "remove dead code" a
good enough reason in itself for a PR?


Forget the "dead code comment" it is a actually a missing feature. In
the case the separator is a range and the input range is bidirectional,
the splitter result should be bidirectional as well. It is not, because
the implementation of back() and popBack() is missing, although some
bookkeeping code for it exists.


Most uses are forward, so perhaps it's best to eliminate the 
bidirectional bookkeeping if it adds overhead, and confine those to a 
separate function e.g. splitterBidirectional. There is precedent for 
that in Phobos. -- Andrei

Re: faster splitter

2016-05-23 Thread Jack Stouffer via Digitalmars-d


On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:
Most uses are forward, so perhaps it's best to eliminate the 
bidirectional bookkeeping if it adds overhead, and confine 
those to a separate function e.g. splitterBidirectional. There 
is precedent for that in Phobos. -- Andrei


I don't see the value add that things like filterBidirectional 
give when the back and popBack functions can be static if-ed in 
or out the normal filter. filterBidirectional always felt like a 
relic of a time when static if's didn't exist, even though I know 
that's not the case.

Re: faster splitter

2016-05-23 Thread Andrei Alexandrescu via Digitalmars-d


On 5/23/16 9:44 AM, Jack Stouffer wrote:

On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:

Most uses are forward, so perhaps it's best to eliminate the
bidirectional bookkeeping if it adds overhead, and confine those to a
separate function e.g. splitterBidirectional. There is precedent for
that in Phobos. -- Andrei


I don't see the value add that things like filterBidirectional give when
the back and popBack functions can be static if-ed in or out the normal
filter.


What would be the condition tested in the static if? Recall that 
sometimes you do want a forward range even when you could define a 
bidirectional range. -- Andrei

Re: faster splitter

2016-05-23 Thread Jack Stouffer via Digitalmars-d


On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:

What would be the condition tested in the static if?


I would assume `static if (isBidirectionalRange!Range)`

Recall that sometimes you do want a forward range even when you 
could define a bidirectional range. -- Andrei


I honestly can't think of a time when would this be the case. 
What are you losing by having back and popBack defined?

Re: faster splitter

2016-05-23 Thread qznc via Digitalmars-d


On Monday, 23 May 2016 at 14:25:52 UTC, Jack Stouffer wrote:
On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu 
wrote:

What would be the condition tested in the static if?


I would assume `static if (isBidirectionalRange!Range)`

Recall that sometimes you do want a forward range even when 
you could define a bidirectional range. -- Andrei


I honestly can't think of a time when would this be the case. 
What are you losing by having back and popBack defined?


Here is a commit, which removes the dead code:

https://github.com/qznc/phobos/commit/0cb1e70f4504a52b81c30342be7ec26d597fac69

It would be strictly better to implement the back and popBack, no?

My single test case says removing the code does not result in 
faster code. Maybe dmd optimizes the bookkeeping away.


I see three options:

1. Remove dead bookkeeping code
2. Implement back() and popBack()
3. Use alternative splitter implementation (and implement back() 
and popBack())


The third one would be the best, if it is really faster.

Re: faster splitter

2016-05-23 Thread Andrei Alexandrescu via Digitalmars-d


On 05/23/2016 10:47 AM, qznc wrote:

It would be strictly better to implement the back and popBack, no?


If it adds no overhead, you're right. -- Andrei

Re: faster splitter

2016-05-23 Thread Andrei Alexandrescu via Digitalmars-d


On 05/23/2016 10:25 AM, Jack Stouffer wrote:

On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:

What would be the condition tested in the static if?


I would assume `static if (isBidirectionalRange!Range)`


Recall that sometimes you do want a forward range even when you could
define a bidirectional range. -- Andrei


I honestly can't think of a time when would this be the case. What are
you losing by having back and popBack defined?


You might be doing extra work even if the user never calls back and 
popBack. -- Andrei

Re: faster splitter

2016-05-23 Thread qznc via Digitalmars-d


On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
wrote:
On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu 
 wrote:



That's comparable, please file an enh request.


http://d.puremagic.com/issues/show_bug.cgi?id=9646


I think the actual slow thing is std.find, which splitter uses.

Benchmark: https://dpaste.dzfl.pl/1e31b6f95659

The output is:

std findtook162167000
manual find took 99852000

Actually, std find should be faster, since it could use the Boyer 
Moore algorithm instead of naive string matching.

Re: faster splitter

2016-05-23 Thread Jack Stouffer via Digitalmars-d


On Monday, 23 May 2016 at 14:47:22 UTC, qznc wrote:

I see three options:

1. Remove dead bookkeeping code
2. Implement back() and popBack()
3. Use alternative splitter implementation (and implement 
back() and popBack())


The third one would be the best, if it is really faster.


If the performance is really a problem (I think it's a micro 
optimization at best), then the best option is not to violate 
DRY. Have a template parameter, std.typecons.Flag, to turn off 
the back and popBack. Don't have two functions that are 95% the 
same code like filter and filterBidirectional.

Re: faster splitter

2016-05-23 Thread Andrei Alexandrescu via Digitalmars-d


On 05/23/2016 03:11 PM, Jack Stouffer wrote:

(I think it's a micro optimization at best)


splitter in the stdlib is likely to be very frequently useful, any bit 
of speedup we put in it is likely to pay off immensely. -- Andrei

Re: faster splitter

2016-05-23 Thread Andrei Alexandrescu via Digitalmars-d


On 05/23/2016 03:11 PM, qznc wrote:

Actually, std find should be faster, since it could use the Boyer Moore
algorithm instead of naive string matching.


Conventional wisdom has it that find() is brute force and that's that, 
but probably it's time to destroy. Selectively using advanced searching 
algorithms for the appropriate inputs is very DbI-ish.


There are a few nice precedents of blend algorithms, see e.g. 
http://effbot.org/zone/stringlib.htm.


Writing a generic subsequence search blend algorithm, one that chooses 
the right algorithm based on a combination of static and dynamic 
decisions, is quite a fun and challenging project. Who wanna?



Andrei

Re: faster splitter

2016-05-24 Thread qznc via Digitalmars-d


On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:

On 05/23/2016 03:11 PM, qznc wrote:
Actually, std find should be faster, since it could use the 
Boyer Moore

algorithm instead of naive string matching.


Conventional wisdom has it that find() is brute force and 
that's that, but probably it's time to destroy. Selectively 
using advanced searching algorithms for the appropriate inputs 
is very DbI-ish.


There are a few nice precedents of blend algorithms, see e.g. 
http://effbot.org/zone/stringlib.htm.


Writing a generic subsequence search blend algorithm, one that 
chooses the right algorithm based on a combination of static 
and dynamic decisions, is quite a fun and challenging project. 
Who wanna?


I observed that Boyer-Moore from std is still slower. My guess is 
due to the relatively short strings in my benchmark and the need 
to allocate for the tables. Knuth-Morris-Pratt might be 
worthwhile, because only requires a smaller table, which could be 
allocated on the stack.


The overall sentiment seems to be that KMP vs BM depends on the 
input. This means an uninformed find function could only use 
heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder 
implementation next to the BoyerMooreFinder. I filed an issue [0].


Apart from advanced algorithms, find should not be slower than a 
naive nested-for-loop implementation.


[0] https://issues.dlang.org/show_bug.cgi?id=16066

Re: faster splitter

2016-05-24 Thread Chris via Digitalmars-d


On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:

On 05/23/2016 03:11 PM, qznc wrote:
Actually, std find should be faster, since it could use the 
Boyer Moore

algorithm instead of naive string matching.


Conventional wisdom has it that find() is brute force and 
that's that, but probably it's time to destroy. Selectively 
using advanced searching algorithms for the appropriate inputs 
is very DbI-ish.


Since I work a lot with text, I use canFind etc. a lot. It'd be 
nice to be able to choose faster algorithms. People are often 
advised to use library functions instead of rolling their own, 
because library functions are optimized for speed, but this 
doesn't seem to be true of Phobos. Simple everyday tasks like 
string matching should be as fast as C.


There are a few nice precedents of blend algorithms, see e.g. 
http://effbot.org/zone/stringlib.htm.


Writing a generic subsequence search blend algorithm, one that 
chooses the right algorithm based on a combination of static 
and dynamic decisions, is quite a fun and challenging project. 
Who wanna?



Andrei

Re: faster splitter

2016-05-24 Thread Chris via Digitalmars-d


On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu 
wrote:

On 05/23/2016 03:11 PM, qznc wrote:
Actually, std find should be faster, since it could use the 
Boyer Moore

algorithm instead of naive string matching.


Conventional wisdom has it that find() is brute force and 
that's that, but probably it's time to destroy. Selectively 
using advanced searching algorithms for the appropriate inputs 
is very DbI-ish.


There are a few nice precedents of blend algorithms, see e.g. 
http://effbot.org/zone/stringlib.htm.


Writing a generic subsequence search blend algorithm, one that 
chooses the right algorithm based on a combination of static 
and dynamic decisions, is quite a fun and challenging project. 
Who wanna?


I observed that Boyer-Moore from std is still slower. My guess 
is due to the relatively short strings in my benchmark and the 
need to allocate for the tables. Knuth-Morris-Pratt might be 
worthwhile, because only requires a smaller table, which could 
be allocated on the stack.


The overall sentiment seems to be that KMP vs BM depends on the 
input. This means an uninformed find function could only use 
heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder 
implementation next to the BoyerMooreFinder. I filed an issue 
[0].


Apart from advanced algorithms, find should not be slower than 
a naive nested-for-loop implementation.


[0] https://issues.dlang.org/show_bug.cgi?id=16066


From Phobos [1]:

/* Preprocess #2: init skip[] */
/* Note: This step could be made a lot faster.
* A simple implementation is shown here. */
this.skip = new size_t[needle.length];
foreach (a; 0 .. needle.length)
{
size_t value = 0;
while (value < needle.length
   && !needlematch(needle, a, value))
{
++value;
}
this.skip[needle.length - a - 1] = value;
}

[1] 
https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335

Re: faster splitter

2016-05-24 Thread qznc via Digitalmars-d


On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:

On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
Apart from advanced algorithms, find should not be slower than 
a naive nested-for-loop implementation.


[0] https://issues.dlang.org/show_bug.cgi?id=16066


From Phobos [1]:

/* Preprocess #2: init skip[] */
/* Note: This step could be made a lot faster.
* A simple implementation is shown here. */
this.skip = new size_t[needle.length];
foreach (a; 0 .. needle.length)
{
size_t value = 0;
while (value < needle.length
   && !needlematch(needle, a, value))
{
++value;
}
this.skip[needle.length - a - 1] = value;
}

[1] 
https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335


Unfortunately, it is slower. My current measurements [0]:

$ ./benchmark 1000 10 # large haystack, few iterations
std findtook753837231
manual find took129561980
$ ./benchmark 10 1000 # small haystack, many iterations
std findtook721676721
manual find took216418870

The nested-for-loop implementation is roughly 4 times faster!

Caveat: I'm not completely sure my benchmark is fair yet.

[0] https://github.com/qznc/find-is-too-slow

Re: faster splitter

2016-05-24 Thread Andrei Alexandrescu via Digitalmars-d


On 05/24/2016 06:44 AM, qznc wrote:

On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:

On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:

Apart from advanced algorithms, find should not be slower than a
naive nested-for-loop implementation.

[0] https://issues.dlang.org/show_bug.cgi?id=16066


From Phobos [1]:

/* Preprocess #2: init skip[] */
/* Note: This step could be made a lot faster.
* A simple implementation is shown here. */
this.skip = new size_t[needle.length];
foreach (a; 0 .. needle.length)
{
size_t value = 0;
while (value < needle.length
   && !needlematch(needle, a, value))
{
++value;
}
this.skip[needle.length - a - 1] = value;
}

[1]
https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335



Unfortunately, it is slower. My current measurements [0]:

$ ./benchmark 1000 10 # large haystack, few iterations
std findtook753837231
manual find took129561980
$ ./benchmark 10 1000 # small haystack, many iterations
std findtook721676721
manual find took216418870

The nested-for-loop implementation is roughly 4 times faster!

Caveat: I'm not completely sure my benchmark is fair yet.

[0] https://github.com/qznc/find-is-too-slow


One somewhat unrelated thing that can be done with Boyer-Moore: use a 
constant-length skip array instead of dynamic allocation. Upon 
saturation, continue with brute force. -- Andrei

Re: faster splitter

2016-05-24 Thread Andrei Alexandrescu via Digitalmars-d


On 05/24/2016 03:54 AM, qznc wrote:

On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:

On 05/23/2016 03:11 PM, qznc wrote:

Actually, std find should be faster, since it could use the Boyer Moore
algorithm instead of naive string matching.


Conventional wisdom has it that find() is brute force and that's that,
but probably it's time to destroy. Selectively using advanced
searching algorithms for the appropriate inputs is very DbI-ish.

There are a few nice precedents of blend algorithms, see e.g.
http://effbot.org/zone/stringlib.htm.

Writing a generic subsequence search blend algorithm, one that chooses
the right algorithm based on a combination of static and dynamic
decisions, is quite a fun and challenging project. Who wanna?


I observed that Boyer-Moore from std is still slower. My guess is due to
the relatively short strings in my benchmark and the need to allocate
for the tables. Knuth-Morris-Pratt might be worthwhile, because only
requires a smaller table, which could be allocated on the stack.


There's also Rabin-Karp that can be used when the sought range is 
relatively short.



The overall sentiment seems to be that KMP vs BM depends on the input.
This means an uninformed find function could only use heuristics.
Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to
the BoyerMooreFinder. I filed an issue [0].


Great. I keep on thinking how the advanced algorithms should "kick in" 
only after a partial match was long enough and frequent enough. As long 
as the partial match is short and infrequent, brute force will do best.



Apart from advanced algorithms, find should not be slower than a naive
nested-for-loop implementation.

[0] https://issues.dlang.org/show_bug.cgi?id=16066


Nice, thanks.


Andrei

Re: faster splitter

2016-05-24 Thread Andrei Alexandrescu via Digitalmars-d


On 05/24/2016 08:38 AM, Andrei Alexandrescu wrote:

One somewhat unrelated thing that can be done with Boyer-Moore: use a
constant-length skip array instead of dynamic allocation. Upon
saturation, continue with brute force. -- Andrei


Another idea: in every searched string there is one specific index that 
offers the most information to a failed search, allowing for a large 
skip. Roughly speaking it's the largest index around which there are 
many characters not seen elsewhere in the searched string. (In a string 
with all distinct characters, that would be the last character.) If that 
offset is cheap to compute and offers a good skip, a search starting 
from that index would be very fast. -- Andrei

Re: faster splitter

2016-05-24 Thread qznc via Digitalmars-d


On Tuesday, 24 May 2016 at 10:44:12 UTC, qznc wrote:

Unfortunately, it is slower. My current measurements [0]:

$ ./benchmark 1000 10 # large haystack, few iterations
std findtook753837231
manual find took129561980
$ ./benchmark 10 1000 # small haystack, many iterations
std findtook721676721
manual find took216418870

The nested-for-loop implementation is roughly 4 times faster!

Caveat: I'm not completely sure my benchmark is fair yet.

[0] https://github.com/qznc/find-is-too-slow


Plot twist: manual find is not always faster.

My benchmark now generates random haystack and needle. Sometimes 
Phobos clearly wins. Example:


$ ./benchmark.ldc -n39 -l1 -i1
Found at 1
std findtook289529102
manual find took368958679

This is compiled with LDC, needle is 39 chars, haystack 1 
chars, and 1 iterations. "Found at 1" means the needle 
was not found.


Very often manual find wins though. Takeaway: Manual find is too 
naive. However, something slows down std find in general. More 
research needed.


PS: Any recommendations for a Linux dissassembler? Objdump is ok, 
but some arrows for jumps would be nice.

Re: faster splitter

2016-05-24 Thread Andrei Alexandrescu via Digitalmars-d


On 05/24/2016 03:47 PM, qznc wrote:


PS: Any recommendations for a Linux dissassembler? Objdump is ok, but
some arrows for jumps would be nice.


Try http://asm.dlang.org for short tests. -- Andrei

Re: faster splitter

2016-05-24 Thread qznc via Digitalmars-d


On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote:
PS: Any recommendations for a Linux dissassembler? Objdump is 
ok, but some arrows for jumps would be nice.


I discovered radare2. While it is actually a debugger (like gdb), 
it includes a nice disassembler. Example view, which shows the 
loop calling std.algorithm.find repeatedly:


   |   0x00402e08,   0 |   nop [rax+rax+0x0]
  .--> 0x00402e10,   0 |   mov rdx, [rbx]
  ||   0x00402e130 |   mov rcx, [rbx+0x8]
  ||   0x00402e170 |   mov rdi, r15
  ||   0x00402e1a0 |   mov rsi, r12
  ||   0x00402e1d0 |   call 0x409160  ; 4 = 
sym._D3std9algorithm9searching34__T4fin

  ||   0x00402e220 |   mov rcx, [rbx]
  ||   0x00402e250 |   sub rcx, rax
  ||   0x00402e28,   0 |   mov [rbx+0x20], rcx
  ||   0x00402e2c,   0 |   dec ebp
  `==< 0x00402e2e0 |   jnz 0x402e10  ; 5 = 
0x00402e10

   |   0x00402e30,   0 |   lea r15, [rsp]

Note the ASCII arrow on the left! Objdump should do that. :)

Re: faster splitter

2016-05-25 Thread Chris via Digitalmars-d


On Tuesday, 24 May 2016 at 22:01:17 UTC, qznc wrote:

On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote:

I discovered radare2. While it is actually a debugger (like 
gdb), it includes a nice disassembler. Example view, which 
shows the loop calling std.algorithm.find repeatedly:


   |   0x00402e08,   0 |   nop [rax+rax+0x0]
  .--> 0x00402e10,   0 |   mov rdx, [rbx]
  ||   0x00402e130 |   mov rcx, [rbx+0x8]
  ||   0x00402e170 |   mov rdi, r15
  ||   0x00402e1a0 |   mov rsi, r12
  ||   0x00402e1d0 |   call 0x409160  ; 4 = 
sym._D3std9algorithm9searching34__T4fin

  ||   0x00402e220 |   mov rcx, [rbx]
  ||   0x00402e250 |   sub rcx, rax
  ||   0x00402e28,   0 |   mov [rbx+0x20], rcx
  ||   0x00402e2c,   0 |   dec ebp
  `==< 0x00402e2e0 |   jnz 0x402e10  ; 5 = 
0x00402e10

   |   0x00402e30,   0 |   lea r15, [rsp]

Note the ASCII arrow on the left! Objdump should do that. :)


Any suggestions how we can make find more efficient?

Re: faster splitter

2016-05-25 Thread qznc via Digitalmars-d


On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:

Any suggestions how we can make find more efficient?


I will send a pull request soon.

My current patch [0] improves it, but still contains a bug.

[0] 
https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95

Re: faster splitter

2016-05-25 Thread qznc via Digitalmars-d


On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:

On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:

Any suggestions how we can make find more efficient?


I will send a pull request soon.

My current patch [0] improves it, but still contains a bug.

[0] 
https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95


And here it is: https://github.com/dlang/phobos/pull/4362

Destroy!

Re: faster splitter

2016-05-25 Thread Chris via Digitalmars-d


On Wednesday, 25 May 2016 at 11:30:33 UTC, qznc wrote:

On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:

On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:

Any suggestions how we can make find more efficient?


I will send a pull request soon.

My current patch [0] improves it, but still contains a bug.

[0] 
https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95


And here it is: https://github.com/dlang/phobos/pull/4362

Destroy!


Great stuff! I really appreciate your efforts. As I said, my code 
uses find/canFind a lot.

Re: faster splitter

2016-05-27 Thread Chris via Digitalmars-d


On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:

On 05/23/2016 03:11 PM, qznc wrote:

Conventional wisdom has it that find() is brute force and 
that's that, but probably it's time to destroy. Selectively 
using advanced searching algorithms for the appropriate inputs 
is very DbI-ish.


There are a few nice precedents of blend algorithms, see e.g. 
http://effbot.org/zone/stringlib.htm.


Writing a generic subsequence search blend algorithm, one that 
chooses the right algorithm based on a combination of static 
and dynamic decisions, is quite a fun and challenging project. 
Who wanna?



Andrei


I've played around with some algorithms and kept them as simple 
as possible, but I've found that a linear brute force for-loop 
always beats them (it's the extra decision(s), I suppose). It 
looks nice in theory, but in hardware reality a stupid loop is 
more efficient.


NB: I've found that `foreach` is usually faster than a manual 
`for`-loop.


I used qznc's benchmarking, and it's clear that the current 
implementation of std.algorithm.find is quite a poor performer. 
We do have to work on that. Library functions should be fast, no 
matter what. If someone uses `find`, they wanna be sure it's 
optimized for speed.

Re: faster splitter

2016-05-27 Thread Andrei Alexandrescu via Digitalmars-d


On 5/27/16 8:28 AM, Chris wrote:

I've played around with some algorithms and kept them as simple as
possible, but I've found that a linear brute force for-loop always beats
them (it's the extra decision(s), I suppose). It looks nice in theory,
but in hardware reality a stupid loop is more efficient.


This is surprising. It should be easy to construct examples where brute 
force search performs arbitrarily poor, e.g. searching for 
"aaa...aab" in a long string with only "a"s.



NB: I've found that `foreach` is usually faster than a manual `for`-loop.


That's surprising too. I'd be interested to look at a disassembly if you 
have one.



I used qznc's benchmarking, and it's clear that the current
implementation of std.algorithm.find is quite a poor performer. We do
have to work on that. Library functions should be fast, no matter what.
If someone uses `find`, they wanna be sure it's optimized for speed.


That's definitely the way to go. Thanks for looking into it!


Andrei

Re: faster splitter

2016-05-27 Thread Chris via Digitalmars-d


On Friday, 27 May 2016 at 13:29:00 UTC, Andrei Alexandrescu wrote:

On 5/27/16 8:28 AM, Chris wrote:

This is surprising. It should be easy to construct examples 
where brute force search performs arbitrarily poor, e.g. 
searching for "aaa...aab" in a long string with only "a"s.


I will look into this. Atm, I'm using qznc's benchmark.

NB: I've found that `foreach` is usually faster than a manual 
`for`-loop.


That's surprising too. I'd be interested to look at a 
disassembly if you have one.


I have to correct myself. A manual loop is faster, I couldn't 
believe it either, so I benchmarked the same function with 
`foreach` and `for`. Turns out that `for` is _way_ faster.



I used qznc's benchmarking, and it's clear that the current
implementation of std.algorithm.find is quite a poor 
performer. We do
have to work on that. Library functions should be fast, no 
matter what.
If someone uses `find`, they wanna be sure it's optimized for 
speed.


That's definitely the way to go. Thanks for looking into it!


Andrei


There is a bug, if it is one, in qznc's `manual_find`. It doesn't 
find strings at the end of a string, as "ing" in "string", and it 
does not find a string that is equal to the search string as in 
"abba" == "abba".


This outperforms both `manual_find` and the improved `std find`

string findStringS_Manual(string haystack, string needle)
{

if (needle.length > haystack.length)
return haystack[$..$];
outer:
for (auto i = 0; i < haystack.length; i++)
{
if (haystack[i] != needle[0])
continue;
for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
if (haystack[j] != needle[k])
continue outer;
return haystack[i..$];
}
return haystack[$..$];
}

A typical example would be:

std findtook 25642297
manual find took  7426676  // manual_find
my std find took  6654496  // improved find
findStringS took  7159556  // foreach(i, c; haystack)
findStringS_  	took  5315498  // for (auto i = 0 ...) see 
above


I think it's clear that `std find` is vry slow. If anyone 
wants to test my function, please do so. I don't want to spread 
wrong data, but this is what I get at the moment (ldc latest 
version).

Re: faster splitter

2016-05-27 Thread Chris via Digitalmars-d


On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:


This outperforms both `manual_find` and the improved `std find`

string findStringS_Manual(string haystack, string needle)
{

if (needle.length > haystack.length)
return haystack[$..$];
outer:
for (auto i = 0; i < haystack.length; i++)
{
if (haystack[i] != needle[0])
continue;
for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
if (haystack[j] != needle[k])
continue outer;
return haystack[i..$];
}
return haystack[$..$];
}



Little bug fix:

for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < 
needle.length; ++j, ++k)


Else it will be out or range when you look for "za" and the 
string ends with "z". There is a slight performance penalty, but 
it's still faster than the rest. Maybe there are cleverer ways. 
But every little check and decision comes with a performance 
penalty.


The improved `std find` comes very close to the manual function 
above now, sometimes it's even faster or at least as fast.


std findtook 12573666
manual find took  7418454
my std find took  6903854 <===
findStringS took  7166720
findStringS_took  6579529 <===

std findtook 11849892
manual find took  7407091
my std find took  6573102 <===
findStringS took  7296610
findStringS_took  6573214 <===

std findtook 10744361
manual find took  7576133
my std find took  6332014 <===
findStringS took  7224946
findStringS_took  6555976 <===

Re: faster splitter

2016-05-27 Thread qznc via Digitalmars-d


On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
I think it's clear that `std find` is vry slow. If anyone 
wants to test my function, please do so. I don't want to spread 
wrong data, but this is what I get at the moment (ldc latest 
version).


If you want to see find (current or my improved version) get 
really slow, then reduce the alphabet for the haystack (e.g. 
--alphabet=ab). ;)

Re: faster splitter

2016-05-27 Thread qznc via Digitalmars-d


On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
The improved `std find` comes very close to the manual function 
above now, sometimes it's even faster or at least as fast.


I think it really depends on the use case. The manual algorithm 
is really naive and std-find is slightly more advanced. My 
benchmark creates a random haystack and needle, but this is not 
real world data. I believe the "slightly more advanced" approach 
is a good idea, but there is no data to prove it.


It might be interesting what algorithms other language's standard 
libraries (C++, Python, Java, etc) use.


Thanks for looking at the benchmarking, Chris! The more eyes, the 
better. :)

Re: faster splitter

2016-05-27 Thread Andrei Alexandrescu via Digitalmars-d


On 5/27/16 10:41 AM, Chris wrote:

The improved `std find` comes very close to the manual function above
now, sometimes it's even faster or at least as fast.


What you want to do here for indexed access is to match the last element 
first. If no match, continue etc. If there's a match, enter an inner 
loop where you don't need to check for the end of the haystack. -- Andrei

Re: faster splitter

2016-05-27 Thread Patrick Schluter via Digitalmars-d


On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:

On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:


This outperforms both `manual_find` and the improved `std find`

string findStringS_Manual(string haystack, string needle)
{

if (needle.length > haystack.length)
return haystack[$..$];
outer:
for (auto i = 0; i < haystack.length; i++)
{
if (haystack[i] != needle[0])
continue;
for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
if (haystack[j] != needle[k])
continue outer;
return haystack[i..$];
}
return haystack[$..$];
}



Little bug fix:

for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < 
needle.length; ++j, ++k)




The upper bound of the outer loop should be 
haystack.length-needle.length. There's no point in searching 
after that point as the needle would not fit anymore and it 
avoids the buffer overrun you have in your code when the first 
character of the needle is found after that point.
You can also remove then your ?: test in the j loop as that case 
is handled automatically then.

Re: faster splitter

2016-05-27 Thread Chris via Digitalmars-d

Oops, I've been found out. :-) Thanks. You're right of course, 
and I've already noticed that bug as it fails on not found. I got 
the bounds wrong.

Re: faster splitter

2016-05-27 Thread Patrick Schluter via Digitalmars-d


On Friday, 27 May 2016 at 19:17:44 UTC, Chris wrote:
Oops, I've been found out. :-) Thanks. You're right of course, 
and I've already noticed that bug as it fails on not found. I 
got the bounds wrong.


I had the same "bug" when I wrote my search function on the 
project at work. I also found out that using the simplest loop is 
the generally the best way to go. The processors nowaday have 
loop-caches, fast level1 caches etc, all the fancy search 
algorithms like Boyer-Moore etc. can only overcome their memory 
and initialisation overhead for really huge haystacks.


Here's the C code of my memmem() function that replaced the 
Boyer-Moore I had before. This implementation outran the BM 
search for sizes up to several hundreds megaytes on SPARC64 VII+ 
machine. On x86-64 I'm sure it would even be worse.


void * memmem(const void *haystack, size_t haystack_len, const 
void *needle, size_t needle_len)

{
  if(!haystack_len || !needle_len || needle_len > haystack_len)
return NULL;

  uint8_t u8 = *((uint8_t *)needle);
  const uint8_t *p = (const uint8_t *) haystack;
  for(size_t i = haystack_len - needle_len + 1; i; --i) {
if(*p == u8 && !memcmp(p, needle, needle_len))
  return (void *)p;
else
  p++;
  }
  return NULL;
}

(sorry for posting C code, as I'm only beginning to learn D).

Re: faster splitter

2016-05-27 Thread qznc via Digitalmars-d


On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
The improved `std find` comes very close to the manual function 
above now, sometimes it's even faster or at least as fast.


std findtook 12573666
manual find took  7418454
my std find took  6903854 <===
findStringS took  7166720
findStringS_took  6579529 <===


I just updated my benchmark. It now checks lots of different 
scenarios instead of you having to specify one. You can only set 
the number of iterations now.


It generates a random scenario (haystack length, needle length, 
alphabet, etc) and runs it once with all algorithms. Instead of 
recording the absolute runtime, it records the slowdown compared 
to the fastest of the three. Average those and also compute the 
absolute deviation (the second number).


Example:
std find:153 ±33
manual find: 112 ±19
my std find: 119 ±16

So manual find is on average 12% slower than the fastest ±19%. I 
would interpret that as no significant difference between manual 
and improved find.

Re: faster splitter

2016-05-27 Thread qznc via Digitalmars-d


Now added Chris' algo to the benchmark:

std find:155 ±33
manual find: 112 ±19
qznc find:   122 ±18  <--- improved find
Chris find:  133 ±20  <--- findStringS_

Re: faster splitter

2016-05-27 Thread Andrei Alexandrescu via Digitalmars-d


On 05/27/2016 04:39 PM, qznc wrote:

Now added Chris' algo to the benchmark:

std find:155 ±33
manual find: 112 ±19
qznc find:   122 ±18  <--- improved find
Chris find:  133 ±20  <--- findStringS_


Does any of these feature simultaneously:

* check the last element first
* consequently stop early when the length of the haystack falls below 
the length of the needle
* consequently no check against the haystack limit is necessary in the 
inner loop

* use only one induction variable

?


Andrei

Re: faster splitter

2016-05-27 Thread qznc via Digitalmars-d


On Friday, 27 May 2016 at 20:50:52 UTC, Andrei Alexandrescu wrote:

On 05/27/2016 04:39 PM, qznc wrote:

Now added Chris' algo to the benchmark:

std find:155 ±33
manual find: 112 ±19
qznc find:   122 ±18  <--- improved find
Chris find:  133 ±20  <--- findStringS_


Does any of these feature simultaneously:

* check the last element first
* consequently stop early when the length of the haystack falls 
below the length of the needle
* consequently no check against the haystack limit is necessary 
in the inner loop


Std find (currently in Phobos) and qznc find (see pull request) 
fulfill those three. Manual and Chris find are simpler.



* use only one induction variable


None of those. I made such a variant of qznc find and it gives me 
the same numbers.

Re: faster splitter

2016-05-27 Thread David Nadlinger via Digitalmars-d


On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
I have to correct myself. A manual loop is faster, I couldn't 
believe it either, so I benchmarked the same function with 
`foreach` and `for`. Turns out that `for` is _way_ faster.


Just about the only reason I could think of for this to happen is 
if the compiler fails to inline the range primitives from 
std.array. Otherwise, the loops should be pretty much equivalent 
to LLVM's optimiser.


This is so fundamental to D's strategic promise that we can't 
afford to get it wrong.


 — David

Re: faster splitter

2016-05-27 Thread qznc via Digitalmars-d


On Friday, 27 May 2016 at 18:21:06 UTC, Andrei Alexandrescu wrote:
What you want to do here for indexed access is to match the 
last element first. If no match, continue etc. If there's a 
match, enter an inner loop where you don't need to check for 
the end of the haystack. -- Andrei


Another plot twist: Sentinels look interesting.

Wilzbach [0] mentioned it in the pull request:
"Btw on a related issue - it seems like no one has actually 
bother to implement Andrei's "Fastware" sentinels in Phobos. If 
you are already in benchmarking, you might give this 
specialization for mutable RandomAccessRanges a try ;-)"


I tried it [1]. It changes the numbers from

std find:153 ±32
manual find: 112 ±19
qznc find:   121 ±18
Chris find:  132 ±20

to

std find:160 ±31
manual find: 118 ±24
qznc find:   109 ±13  <--- using the sentinel trick
Chris find:  142 ±27

It is normal that the numbers of the other tests change, since 
those are relative to the fastest one in each run. When qznc find 
'wins' more often, the others get more slowdown.


This means another special case for mutable ranges could be 
worthwhile.


[0] 
https://github.com/dlang/phobos/pull/4362#issuecomment-24387
[1] 
https://github.com/qznc/find-is-too-slow/commit/043d1d2f6dced625e964e8df883ad5a45d7fe654

Re: faster splitter

2016-05-27 Thread Andrei Alexandrescu via Digitalmars-d


On 05/27/2016 06:19 PM, qznc wrote:

manual find: 118 ±24
qznc find:   109 ±13  <--- using the sentinel trick
Chris find:  142 ±27

It is normal that the numbers of the other tests change, since those are
relative to the fastest one in each run. When qznc find 'wins' more
often, the others get more slowdown.


What compiler?


This means another special case for mutable ranges could be worthwhile.


Also mind the throwing possibilities.


Andrei

Re: faster splitter

2016-05-28 Thread Chris via Digitalmars-d


On Friday, 27 May 2016 at 20:39:03 UTC, qznc wrote:

Now added Chris' algo to the benchmark:

std find:155 ±33
manual find: 112 ±19
qznc find:   122 ±18  <--- improved find
Chris find:  133 ±20  <--- findStringS_


Did you fix the bugs in my algorithm? S

Re: faster splitter

2016-05-28 Thread qznc via Digitalmars-d

On Saturday, 28 May 2016 at 01:30:11 UTC, Andrei Alexandrescu 
wrote:

On 05/27/2016 06:19 PM, qznc wrote:

manual find: 118 ±24
qznc find:   109 ±13  <--- using the sentinel trick
Chris find:  142 ±27

It is normal that the numbers of the other tests change, since 
those are
relative to the fastest one in each run. When qznc find 'wins' 
more

often, the others get more slowdown.


What compiler?


LDC. For DMD the numbers show no improvement:

std find:170 ±31
manual find: 132 ±25
qznc find:   103 ±6
Chris find:  157 ±30

to

std find:168 ±30
manual find: 131 ±25
qznc find:   104 ±8<--- sentinel
Chris find:  155 ±29

(Caveat: Chris find has a bug, which is triggered once in the 
1 runs each.)


The sentinel trick is only proof of concept. The sentinel value 
is `needleBack+1`, but range elements need not support addition. 
Finding a sentinel is hard and most certainly requires more 
assumptions about the ranges.


This means another special case for mutable ranges could be 
worthwhile.


Also mind the throwing possibilities.


What do you mean? Using exceptions?


If anybody wants to replicate. It only takes a few seconds and 
there are no dependencies except dmd/ldc.


  git clone https://github.com/qznc/find-is-too-slow
  cd find-is-too-slow
  make dmd
  make ldc

Re: faster splitter

2016-05-28 Thread Andrei Alexandrescu via Digitalmars-d


On 5/28/16 6:56 AM, qznc wrote:

The sentinel value is `needleBack+1`, but range elements need not
support addition. Finding a sentinel is hard and most certainly requires
more assumptions about the ranges.


No need for a sentinel really so long as you first search for the last 
element of the needle, in the haystack.


Allow me to put on the table a simple brute force routine, specialized 
for arrays with copyable elements, no custom predicate etc. Far as I can 
tell it does no redundant work:


T[] find(T)(T[] haystack, T[] needle)
{
if (needle.length == 0) return haystack;
immutable lastIndex = needle.length - 1;
auto last = needle[lastIndex];
size_t j = lastIndex;
for (; j < haystack.length; ++j)
{
if (haystack[j] != last) continue;
immutable k = j - lastIndex;
// last elements match
for (size_t i = 0; ; ++i)
{
if (i == lastIndex) return haystack[k .. $];
if (needle[i] != haystack[k + i]) break;
}
}
return haystack[$ .. $];
}

unittest
{
string s1 = "hello abc world";
assert(find(s1, "abc") == "abc world");
assert(find(s1, "def") == "");
}

void main(){}


Andrei

Re: faster splitter

2016-05-28 Thread Chris via Digitalmars-d


On Friday, 27 May 2016 at 21:31:48 UTC, David Nadlinger wrote:


Just about the only reason I could think of for this to happen 
is if the compiler fails to inline the range primitives from 
std.array. Otherwise, the loops should be pretty much 
equivalent to LLVM's optimiser.


This is so fundamental to D's strategic promise that we can't 
afford to get it wrong.


 — David


I agree, and it has been said on this forum that tight manual 
loops are faster than the range paradigm (I think bearophile was 
one of them, if I remember correctly). Maybe we should benchmark 
this too, just in case.

Re: faster splitter

2016-05-28 Thread Chris via Digitalmars-d

On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu 
wrote:

On 5/28/16 6:56 AM, qznc wrote:
The sentinel value is `needleBack+1`, but range elements need 
not
support addition. Finding a sentinel is hard and most 
certainly requires

more assumptions about the ranges.


No need for a sentinel really so long as you first search for 
the last element of the needle, in the haystack.


Allow me to put on the table a simple brute force routine, 
specialized for arrays with copyable elements, no custom 
predicate etc. Far as I can tell it does no redundant work:


T[] find(T)(T[] haystack, T[] needle)
{
if (needle.length == 0) return haystack;
immutable lastIndex = needle.length - 1;
auto last = needle[lastIndex];
size_t j = lastIndex;
for (; j < haystack.length; ++j)
{
if (haystack[j] != last) continue;
immutable k = j - lastIndex;
// last elements match
for (size_t i = 0; ; ++i)
{
if (i == lastIndex) return haystack[k .. $];
if (needle[i] != haystack[k + i]) break;
}
}
return haystack[$ .. $];
}

unittest
{
string s1 = "hello abc world";
assert(find(s1, "abc") == "abc world");
assert(find(s1, "def") == "");
}

void main(){}


Andrei


I included your function (I also fixed the bug in my 
findStringS_). Here's a typical result:


std find:208 ±61
manual find: 127 ±29
qznc find:   108 ±13
Chris find:  140 ±32
Andrei find: 126 ±27
=
std find:207 ±59
manual find: 128 ±30
qznc find:   108 ±13
Chris find:  141 ±32
Andrei find: 125 ±27

I used dmd, because I don't have ldc on my laptop. qznc's find is 
clearly the winner.

Re: faster splitter

2016-05-28 Thread qznc via Digitalmars-d

On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu 
wrote:

On 5/28/16 6:56 AM, qznc wrote:
The sentinel value is `needleBack+1`, but range elements need 
not
support addition. Finding a sentinel is hard and most 
certainly requires

more assumptions about the ranges.


No need for a sentinel really so long as you first search for 
the last element of the needle, in the haystack.


Allow me to put on the table a simple brute force routine, 
specialized for arrays with copyable elements, no custom 
predicate etc. Far as I can tell it does no redundant work:


A nice one. Here are the numbers I got:

LDC:
std find:   156 ±33
 manual find:   117 ±24
   qznc find:   114 ±14
  Chris find:   135 ±24
 Andrei find:   112 ±27

DMD:
std find:   177 ±31
 manual find:   140 ±28
   qznc find:   102 ±4
  Chris find:   165 ±31
 Andrei find:   130 ±25

My sentinel idea is to have only one condition branch in the 
inner loop. Andrei find still has to `if` in the inner loop. My 
inner loop looks like this:


  haystack[scout] = cast(typeof(needleBack)) (needleBack+1); // 
place sentinel
  for (size_t j = scout + 1 - needleLength, k = 0;; ++j, ++k) { 
// no condition
if (!binaryFun!pred(haystack[j], needle[k])) { // only 
condition

  // ... ok in here comes another, but this not as costly
}
  }

As long as you are matching the needle, there is only one 
condition to check. There is no check for k < needleLength, 
because the check always fails for needle[$] aka haystack[scout] 
due to the sentinel.


I thought it might be slower, since we need to write to memory 
twice instead of only reading. Turns out it is faster.

Re: faster splitter

2016-05-28 Thread qznc via Digitalmars-d


On Saturday, 28 May 2016 at 14:18:10 UTC, Chris wrote:
I used dmd, because I don't have ldc on my laptop. qznc's find 
is clearly the winner.


DMD performance feels flaky to me.

If performance is important, you should use ldc or gdc. 
Alternatively, you are Walter Bright and simply optimize dmd. ;)

Re: faster splitter

2016-05-29 Thread qznc via Digitalmars-d


I played around with the benchmark. Some more numbers:

$ make ldc
ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
./benchmark.ldc
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
std find: 153 ±25+66 (1934)  -15 (7860)
 manual find: 122 ±28+80 (1812)  -17 (8134)
   qznc find: 125 ±16+18 (4644)  -15 (5126)
  Chris find: 148 ±29+75 (1976)  -18 (7915)
 Andrei find: 114 ±23   +100 (1191)  -13 (8770)
 (avg slowdown vs fastest; absolute deviation)
$ make dmd
dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
./benchmark.dmd
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
std find: 160 ±27+44 (3162)  -20 (6709)
 manual find: 148 ±28+54 (2701)  -19 (7178)
   qznc find: 102 ±3 +27 ( 766)   -1 (9136)
  Chris find: 175 ±30+55 (2796)  -21 (7106)
 Andrei find: 122 ±22+46 (2554)  -14 (7351)
 (avg slowdown vs fastest; absolute deviation)

The additional numbers on the right are the ±MAD separated by 
above or below the mean. For example Andrei find with ldc:


  Andrei find: 114 ±23   +100 (1191)  -13 (8770)

The mean slowdown is 114, which means 14% slower than the fastest 
one. The mean absolute deviation (MAD) is 23. More precisely, the 
mean deviation above the mean slowdown of 103 is 100 and -13 
below the mean slowdown. 1191 of the 1 runs were above the 
mean slowdown and 8770 below. The 39 missing runs are equal to 
the mean slowdown.


What bothers me is that changing the alphabet changes the numbers 
so much. Currently, if you restrict the alphabets for haystack 
and needle, the numbers change. The benchmark already does a 
random subset on each run, but there is definitely a bias.

Re: faster splitter

2016-05-29 Thread Chris via Digitalmars-d


On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:

I played around with the benchmark. Some more numbers:

$ make ldc
ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
./benchmark.ldc
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
std find: 153 ±25+66 (1934)  -15 (7860)
 manual find: 122 ±28+80 (1812)  -17 (8134)
   qznc find: 125 ±16+18 (4644)  -15 (5126)
  Chris find: 148 ±29+75 (1976)  -18 (7915)
 Andrei find: 114 ±23   +100 (1191)  -13 (8770)
 (avg slowdown vs fastest; absolute deviation)
$ make dmd
dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
./benchmark.dmd
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
std find: 160 ±27+44 (3162)  -20 (6709)
 manual find: 148 ±28+54 (2701)  -19 (7178)
   qznc find: 102 ±3 +27 ( 766)   -1 (9136)
  Chris find: 175 ±30+55 (2796)  -21 (7106)
 Andrei find: 122 ±22+46 (2554)  -14 (7351)
 (avg slowdown vs fastest; absolute deviation)

The additional numbers on the right are the ±MAD separated by 
above or below the mean. For example Andrei find with ldc:


  Andrei find: 114 ±23   +100 (1191)  -13 (8770)

The mean slowdown is 114, which means 14% slower than the 
fastest one. The mean absolute deviation (MAD) is 23. More 
precisely, the mean deviation above the mean slowdown of 103 is 
100 and -13 below the mean slowdown. 1191 of the 1 runs 
were above the mean slowdown and 8770 below. The 39 missing 
runs are equal to the mean slowdown.


What bothers me is that changing the alphabet changes the 
numbers so much. Currently, if you restrict the alphabets for 
haystack and needle, the numbers change. The benchmark already 
does a random subset on each run, but there is definitely a 
bias.


You can avoid "E: wrong result with Chris find" by using

outer:
for (auto i = 0; i < haystack.length-needle.length; i++)
{
if (haystack[i] != needle[0])
continue;
for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
if (haystack[j] != needle[k])
continue outer;
return haystack[i..$];
}

It's a tad faster.

I'm planning to test on more varied data and see where a bias may 
occur.

Re: faster splitter

2016-05-29 Thread Jon Degenhardt via Digitalmars-d


On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:

I played around with the benchmark. Some more numbers:

The mean slowdown is 114, which means 14% slower than the 
fastest one. The mean absolute deviation (MAD) is 23. More 
precisely, the mean deviation above the mean slowdown of 103 is 
100 and -13 below the mean slowdown. 1191 of the 1 runs 
were above the mean slowdown and 8770 below. The 39 missing 
runs are equal to the mean slowdown.


A minor thing - you might consider also calculating the median 
and median version of MAD (median of absolute deviations from the 
median). The reason is that benchmarks often have outliers in the 
max time dimension, median will do a job reducing the effect of 
those outliers than mean. Your benchmark code could publish both 
forms.


--Jon

Re: faster splitter

2016-05-29 Thread qznc via Digitalmars-d


On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
A minor thing - you might consider also calculating the median 
and median version of MAD (median of absolute deviations from 
the median). The reason is that benchmarks often have outliers 
in the max time dimension, median will do a job reducing the 
effect of those outliers than mean. Your benchmark code could 
publish both forms.


I don't think that would help. This is not a normal distribution, 
so mean and median don't match anyways. What would you learn from 
the median?


When looking at the assembly I don't like the single-byte loads. 
Since string (ubyte[] here) is of extraordinary importance, it 
should be worthwhile to use word loads [0] instead. Really fancy 
would be SSE.


However, this is more advanced and requires further introspection.

[0] 
http://forum.dlang.org/post/aahhopdcrengakvoe...@forum.dlang.org

Re: faster splitter

2016-05-29 Thread Jon Degenhardt via Digitalmars-d


On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:

On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
A minor thing - you might consider also calculating the median 
and median version of MAD (median of absolute deviations from 
the median). The reason is that benchmarks often have outliers 
in the max time dimension, median will do a job reducing the 
effect of those outliers than mean. Your benchmark code could 
publish both forms.


I don't think that would help. This is not a normal 
distribution, so mean and median don't match anyways. What 
would you learn from the median?




Oh, I see. The benchmark varies the data on each run and 
aggregates, is that right? Sorry, I missed that.

Re: faster splitter

2016-05-30 Thread Chris via Digitalmars-d

It's weird, on my machine, my find function is consistently 
faster than `manual find`


LDC:

std find: 137 ±22+33 (3580)  -17 (6251)
 manual find: 137 ±32+53 (3105)  -23 (6834)
   qznc find: 106 ±8 +17 (2608)   -5 (7195)
  Chris find: 132 ±32+58 (2803)  -23 (7131)
 Andrei find: 120 ±26+54 (2466)  -17 (7454)

===

std find: 136 ±22+33 (3503)  -17 (6304)
 manual find: 137 ±33+55 (3000)  -23 (6920)
   qznc find: 106 ±8 +17 (2535)   -5 (7287)
  Chris find: 132 ±33+59 (2803)  -23 (7137)
 Andrei find: 119 ±25+51 (2569)  -16 (7374)


It's negligible, but I wonder is it the extra initialization:

`size_t end = haystack.length - needle.length + 1;`

and

`size_t i = 0` is defined, before we know, if it's a legitimate 
search. But that shouldn't matter, or should it?


Or does it depend on the machine?

DMD slows all searches down considerably (except for `qznc find`, 
well done!:)


DMD:

std find: 161 ±36+46 (4015)  -31 (5847)
 manual find: 147 ±40+68 (3001)  -28 (6917)
   qznc find: 106 ±8 +15 (2623)   -5 (7172)
  Chris find: 201 ±41+73 (2830)  -28 (7120)
 Andrei find: 150 ±40+61 (3347)  -30 (6575)


PS I've corrected the end of the search string to

`outer:
for (auto i = 0; i < haystack.length-(needle.length-1); i++)`

else it will not match at the end of a string.

Re: faster splitter

2016-05-30 Thread qznc via Digitalmars-d


On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
When looking at the assembly I don't like the single-byte 
loads. Since string (ubyte[] here) is of extraordinary 
importance, it should be worthwhile to use word loads [0] 
instead. Really fancy would be SSE.


So far, the results look disappointing. Andrei find does not get 
faster with wordwise matching:


./benchmark.ldc
std find: 133 ±25+38 (3384)  -19 (6486)
 manual find: 140 ±37+64 (2953)  -25 (6962)
   qznc find: 114 ±17+33 (2610)  -11 (7262)
  Chris find: 146 ±39+66 (3045)  -28 (6873)
 Andrei find: 126 ±29+54 (2720)  -19 (7189)
Wordwise find: 130 ±30+53 (2934)  -21 (6980)

Interesting side-note: On my laptop Andrei find is faster than 
qznc find (for LDC), but on my desktop it reverses (see above). 
Both are Intel i7. Need to find a simpler processor. Maybe 
wordwise is faster there. Alternatively, find is purely memory 
bound and the L1 cache makes every difference disappear.


Also, note how std find is faster than manual find! Finding a 
reliable benchmark is hard. :/

Re: faster splitter

2016-05-30 Thread Andrei Alexandrescu via Digitalmars-d


On 05/30/2016 05:31 AM, qznc wrote:

On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:

When looking at the assembly I don't like the single-byte loads. Since
string (ubyte[] here) is of extraordinary importance, it should be
worthwhile to use word loads [0] instead. Really fancy would be SSE.


So far, the results look disappointing. Andrei find does not get faster
with wordwise matching:

./benchmark.ldc
 std find: 133 ±25+38 (3384)  -19 (6486)
  manual find: 140 ±37+64 (2953)  -25 (6962)
qznc find: 114 ±17+33 (2610)  -11 (7262)
   Chris find: 146 ±39+66 (3045)  -28 (6873)
  Andrei find: 126 ±29+54 (2720)  -19 (7189)
Wordwise find: 130 ±30+53 (2934)  -21 (6980)

Interesting side-note: On my laptop Andrei find is faster than qznc find
(for LDC), but on my desktop it reverses (see above). Both are Intel i7.
Need to find a simpler processor. Maybe wordwise is faster there.
Alternatively, find is purely memory bound and the L1 cache makes every
difference disappear.

Also, note how std find is faster than manual find! Finding a reliable
benchmark is hard. :/


Please throw this hat into the ring as well: it should improve average 
search on large vocabulary dramatically.


https://dpaste.dzfl.pl/dc8dc6e1eb53

It uses a BM-inspired trick - once the last characters matched, if the 
match subsequently fails it needn't start from the next character in the 
haystack. The "skip" is computed lazily and in a separate function so as 
to keep the loop tight. All in all a routine worth a look. I wanted to 
write this for a long time. -- Andrei

Re: faster splitter

2016-05-30 Thread Chris via Digitalmars-d


On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:


Please throw this hat into the ring as well: it should improve 
average search on large vocabulary dramatically.


https://dpaste.dzfl.pl/dc8dc6e1eb53

It uses a BM-inspired trick - once the last characters matched, 
if the match subsequently fails it needn't start from the next 
character in the haystack. The "skip" is computed lazily and in 
a separate function so as to keep the loop tight. All in all a 
routine worth a look. I wanted to write this for a long time. 
-- Andrei


Cool. So far, my experience with separate functions has been that 
they slow down the loop. But this might do the trick.

Re: faster splitter

2016-05-30 Thread Chris via Digitalmars-d


On Monday, 30 May 2016 at 18:57:15 UTC, Chris wrote:
On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu 
wrote:


Please throw this hat into the ring as well: it should improve 
average search on large vocabulary dramatically.


https://dpaste.dzfl.pl/dc8dc6e1eb53

It uses a BM-inspired trick - once the last characters 
matched, if the match subsequently fails it needn't start from 
the next character in the haystack. The "skip" is computed 
lazily and in a separate function so as to keep the loop 
tight. All in all a routine worth a look. I wanted to write 
this for a long time. -- Andrei


Cool. So far, my experience with separate functions has been 
that they slow down the loop. But this might do the trick.


Congratulations!

DMD:

./benchmark.dmd
   std: 178 ±31+36 (4475)  -29 (5344)
manual: 167 ±46+82 (2883)  -32 (7054)
  qznc: 114 ±7 +18 (1990)   -5 (7288)
 Chris: 228 ±49+82 (3050)  -35 (6780)
Andrei: 103 ±5 +47 ( 658)   -2 (9295)
(avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz

LDC:

   std: 184 ±19+28 (3420)  -14 (6523)
manual: 205 ±59   +120 (2463)  -39 (7443)
  qznc: 151 ±25+44 (2983)  -17 (6911)
 Chris: 194 ±57+78 (3702)  -46 (6251)
Andrei: 101 ±2 +42 ( 435)   -1 (9542)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz

Re: faster splitter

2016-05-30 Thread Andrei Alexandrescu via Digitalmars-d


On 05/30/2016 04:00 PM, Chris wrote:

./benchmark.dmd
std: 178 ±31+36 (4475)  -29 (5344)
 manual: 167 ±46+82 (2883)  -32 (7054)
   qznc: 114 ±7 +18 (1990)   -5 (7288)
  Chris: 228 ±49+82 (3050)  -35 (6780)
 Andrei: 103 ±5 +47 ( 658)   -2 (9295)
(avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz

LDC:

std: 184 ±19+28 (3420)  -14 (6523)
 manual: 205 ±59   +120 (2463)  -39 (7443)
   qznc: 151 ±25+44 (2983)  -17 (6911)
  Chris: 194 ±57+78 (3702)  -46 (6251)
 Andrei: 101 ±2 +42 ( 435)   -1 (9542)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz


Thanks for looking into this! @qznc, could you please look into updating 
https://github.com/dlang/phobos/pull/4362 with this result? One possible 
tweak is see whether replacing the function call with inline code helps. 
Thanks! -- Andrei

Re: faster splitter

2016-05-30 Thread qznc via Digitalmars-d


On Monday, 30 May 2016 at 20:08:46 UTC, Andrei Alexandrescu wrote:

On 05/30/2016 04:00 PM, Chris wrote:

./benchmark.dmd
std: 178 ±31+36 (4475)  -29 (5344)
 manual: 167 ±46+82 (2883)  -32 (7054)
   qznc: 114 ±7 +18 (1990)   -5 (7288)
  Chris: 228 ±49+82 (3050)  -35 (6780)
 Andrei: 103 ±5 +47 ( 658)   -2 (9295)
(avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz

LDC:

std: 184 ±19+28 (3420)  -14 (6523)
 manual: 205 ±59   +120 (2463)  -39 (7443)
   qznc: 151 ±25+44 (2983)  -17 (6911)
  Chris: 194 ±57+78 (3702)  -46 (6251)
 Andrei: 101 ±2 +42 ( 435)   -1 (9542)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz


Thanks for looking into this! @qznc, could you please look into 
updating https://github.com/dlang/phobos/pull/4362 with this 
result? One possible tweak is see whether replacing the 
function call with inline code helps. Thanks! -- Andrei


What function call should be replaced with inline code?

Overall, I'm very confused and displeased by the benchmark right 
now. Initially, the problem was that a naive manual find was 
slower than std find in Phobos. Look at the LDC numbers–manual 
find is slower than std find. By tweaking the benchmark, the 
numbers can be shifted around arbitrarily. Even without tweaking, 
it is very processor dependent. Here are the numbers from my 
laptop:


./benchmark.ldc
   std: 149 ±24+29 (4275)  -21 (5604)
manual: 129 ±37+90 (2055)  -23 (7917)
  qznc: 131 ±22+26 (4381)  -20 (5488)
 Chris: 154 ±35+81 (2192)  -22 (7672)
Andrei: 118 ±28+82 (1762)  -16 (8194)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU   M 620  @ 
2.67GHz


./benchmark.dmd
   std: 143 ±22+26 (4219)  -19 (5602)
manual: 158 ±31+47 (3436)  -24 (6492)
  qznc: 101 ±2 +12 (1349)   -1 (7851)
 Chris: 216 ±38+56 (3393)  -28 (6541)
Andrei: 135 ±31+48 (3326)  -24 (6589)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU   M 620  @ 
2.67GHz


And Desktop:

./benchmark.ldc
   std: 129 ±24+40 (3121)  -17 (6767)
manual: 129 ±31+59 (2668)  -21 (7244)
  qznc: 112 ±14+30 (2542)   -9 (7312)
 Chris: 134 ±33+58 (2835)  -23 (7068)
Andrei: 123 ±27+53 (2679)  -18 (7225)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

./benchmark.dmd
   std: 157 ±31+44 (3693)  -24 (6234)
manual: 143 ±41+73 (2854)  -28 (7091)
  qznc: 116 ±21+35 (3092)  -14 (6844)
 Chris: 181 ±50+74 (3452)  -38 (6510)
Andrei: 136 ±38+64 (2975)  -27 (6953)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Re: faster splitter

2016-05-30 Thread Andrei Alexandrescu via Digitalmars-d


On 05/30/2016 06:16 PM, qznc wrote:

What function call should be replaced with inline code?


The call to computeSkip.


Overall, I'm very confused and displeased by the benchmark right now.


I agree it's difficult to characterize the behavior of substring search 
with one number. There are many dimensions of variation. (But there's no 
reason for an emotional response.) A few possible baselines come to mind:


* Search a long string for a one-character string, match and fail.

* Take an English text string. Search for a substring consisting of its 
last portion (e.g. 5% of the length).


* Take an English text string. Search for a substring consisting of a 
fraction of the text (e.g. 3%) with additional characters prepended. 
Repeat for appended.



Andrei

Re: faster splitter

2016-05-31 Thread Chris via Digitalmars-d


On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:


And Desktop:

./benchmark.ldc
   std: 129 ±24+40 (3121)  -17 (6767)
manual: 129 ±31+59 (2668)  -21 (7244)
  qznc: 112 ±14+30 (2542)   -9 (7312)
 Chris: 134 ±33+58 (2835)  -23 (7068)
Andrei: 123 ±27+53 (2679)  -18 (7225)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

./benchmark.dmd
   std: 157 ±31+44 (3693)  -24 (6234)
manual: 143 ±41+73 (2854)  -28 (7091)
  qznc: 116 ±21+35 (3092)  -14 (6844)
 Chris: 181 ±50+74 (3452)  -38 (6510)
Andrei: 136 ±38+64 (2975)  -27 (6953)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz


Benchmark from desktop machine:

DMD:
   std: 164 ±34+43 (4054)  -29 (5793)
manual: 150 ±41+72 (2889)  -29 (7032)
  qznc: 103 ±6 +42 ( 878)   -2 (9090)
 Chris: 205 ±43+81 (2708)  -29 (7232)
Andrei: 136 ±31+53 (2948)  -22 (6977)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

===

LDC:

   std: 138 ±23+35 (3457)  -18 (6360)
manual: 145 ±33+45 (3748)  -27 (6181)
  qznc: 105 ±7 +17 (2267)   -4 (7534)
 Chris: 135 ±33+56 (3061)  -23 (6882)
Andrei: 121 ±27+52 (2630)  -18 (7301)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

On my laptop Andrei's was the fasted (see post above).

Re: faster splitter

2016-05-31 Thread Wyatt via Digitalmars-d


On Tuesday, 31 May 2016 at 08:43:59 UTC, Chris wrote:

On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:


And Desktop:

./benchmark.ldc
   std: 129 ±24+40 (3121)  -17 (6767)
manual: 129 ±31+59 (2668)  -21 (7244)
  qznc: 112 ±14+30 (2542)   -9 (7312)
 Chris: 134 ±33+58 (2835)  -23 (7068)
Andrei: 123 ±27+53 (2679)  -18 (7225)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

./benchmark.dmd
   std: 157 ±31+44 (3693)  -24 (6234)
manual: 143 ±41+73 (2854)  -28 (7091)
  qznc: 116 ±21+35 (3092)  -14 (6844)
 Chris: 181 ±50+74 (3452)  -38 (6510)
Andrei: 136 ±38+64 (2975)  -27 (6953)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz


Benchmark from desktop machine:

DMD:
   std: 164 ±34+43 (4054)  -29 (5793)
manual: 150 ±41+72 (2889)  -29 (7032)
  qznc: 103 ±6 +42 ( 878)   -2 (9090)
 Chris: 205 ±43+81 (2708)  -29 (7232)
Andrei: 136 ±31+53 (2948)  -22 (6977)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

===

LDC:

   std: 138 ±23+35 (3457)  -18 (6360)
manual: 145 ±33+45 (3748)  -27 (6181)
  qznc: 105 ±7 +17 (2267)   -4 (7534)
 Chris: 135 ±33+56 (3061)  -23 (6882)
Andrei: 121 ±27+52 (2630)  -18 (7301)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

On my laptop Andrei's was the fasted (see post above).


Comparing the chips involved, could it be cache related?

3M cache; Andrei wins both:
http://ark.intel.com/products/75459/Intel-Core-i5-4200U-Processor-3M-Cache-up-to-2_60-GHz

4M cache; qznc wins DMD (and is faster than the LDC's best? 
What?); Andrei wins LDC:

http://ark.intel.com/products/43560/Intel-Core-i7-620M-Processor-4M-Cache-2_66-GHz

8M cache; qznc wins both:
http://ark.intel.com/products/65719/Intel-Core-i7-3770-Processor-8M-Cache-up-to-3_90-GHz
http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz

Normally, I'd expect the 4200U to be similar to the desktop 
parts.  Unless...


Say, for the laptops (and I guess the desktops too, but it's more 
important in a mobile), did you verify the CPU frequency scaling 
wasn't interfering?


-Wyatt

Re: faster splitter

2016-05-31 Thread qznc via Digitalmars-d


On Tuesday, 31 May 2016 at 17:31:29 UTC, Wyatt wrote:

qznc wins DMD (and is faster than the LDC's best?


Careful! These are not absolute numbers, but relative slowdowns. 
You cannot compare the numbers between LDC and DMD.

Re: faster splitter

2016-05-31 Thread qznc via Digitalmars-d

On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu 
wrote:
I agree it's difficult to characterize the behavior of 
substring search with one number. There are many dimensions of 
variation. (But there's no reason for an emotional response.) A 
few possible baselines come to mind:


* Search a long string for a one-character string, match and 
fail.


There is a special version of find for searching a single char in 
a string. Using a one-letter needle string is more like a user 
mistake than something to optimize for.


* Take an English text string. Search for a substring 
consisting of its last portion (e.g. 5% of the length).


How long should the english text be? A Tweet? A book? A Gigabyte 
of log files?


English text means basically ASCII and no Unicode?

* Take an English text string. Search for a substring 
consisting of a fraction of the text (e.g. 3%) with additional 
characters prepended. Repeat for appended.


Why the prepend/append? To force a mismatch?

Re: faster splitter

2016-05-31 Thread Andrei Alexandrescu via Digitalmars-d


On 5/31/16 1:54 PM, qznc wrote:

Using a one-letter needle string is more like a user mistake than
something to optimize for.


How is splitting on ',' a user mistake? -- Andrei

Re: faster splitter

2016-05-31 Thread Andrei Alexandrescu via Digitalmars-d


On 5/31/16 1:54 PM, qznc wrote:

On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote:

I agree it's difficult to characterize the behavior of substring
search with one number. There are many dimensions of variation. (But
there's no reason for an emotional response.) A few possible baselines
come to mind:

* Search a long string for a one-character string, match and fail.


There is a special version of find for searching a single char in a
string. Using a one-letter needle string is more like a user mistake
than something to optimize for.


Oh, I see what you mean - it's a string consisting only of one 
character, not one character 'x'. I agree, but I'm just saying it's a 
sound baseline.



* Take an English text string. Search for a substring consisting of
its last portion (e.g. 5% of the length).


How long should the english text be? A Tweet? A book? A Gigabyte of log
files?


I'm thinking of a few exponentially increasing sizes. e.g. 100, 1000, 
10_000, 100_000, 1_000_000.



English text means basically ASCII and no Unicode?


My understanding (and correct me if I'm wrong) is that we're discussing 
search with the default predicate, ignoring equivalent grapheme 
representations etc. So there's no need to mind encoding at all, which 
means the text could be in any language. Foreign languages would further 
increase the vocabulary, but that's the only effect.



* Take an English text string. Search for a substring consisting of a
fraction of the text (e.g. 3%) with additional characters prepended.
Repeat for appended.


Why the prepend/append? To force a mismatch?


Yah, and it's a good test for strings that have some equal portion yet 
they are different, which seems to me puts a greater strain on the 
algorithms. It is arguably a likely situation in the real world.



Thanks,

Andrei

Re: faster splitter

2016-05-31 Thread qznc via Digitalmars-d

On Tuesday, 31 May 2016 at 18:31:47 UTC, Andrei Alexandrescu 
wrote:

On 5/31/16 1:54 PM, qznc wrote:
Using a one-letter needle string is more like a user mistake 
than

something to optimize for.


How is splitting on ',' a user mistake? -- Andrei


The mistake is to split on "," instead of ','. The slow splitter 
at the start of this thread searches for "\r\n".


Your lazy-skip algorithm looks great!

./benchmark.ldc
Search in Alice in Wonderland
   std: 168 ±6 +29 ( 107)   -3 ( 893)
manual: 112 ±3 +28 (  81)   -1 ( 856)
  qznc: 149 ±4 +30 (  79)   -1 ( 898)
 Chris: 142 ±5 +28 ( 102)   -2 ( 898)
Andrei: 153 ±3 +28 (  79)   -1 ( 919)
   Andrei2: 101 ±2 +34 (  31)   -1 ( 969)
Search random haystack with random needle
   std: 172 ±19+61 ( 161)  -11 ( 825)
manual: 161 ±47+73 ( 333)  -35 ( 666)
  qznc: 163 ±21+33 ( 314)  -15 ( 661)
 Chris: 190 ±47+80 ( 302)  -33 ( 693)
Andrei: 140 ±37+60 ( 315)  -27 ( 676)
   Andrei2: 103 ±6 +57 (  64)   -2 ( 935)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU   M 620  @ 
2.67GHz


The Alice benchmark searches Alice in Wonderland for "find a 
pleasure in all their simple joys" and

finds it in the last sentence.

Re: faster splitter

2016-05-31 Thread Chris via Digitalmars-d


On Tuesday, 31 May 2016 at 18:56:14 UTC, qznc wrote:



The mistake is to split on "," instead of ','. The slow 
splitter at the start of this thread searches for "\r\n".


Your lazy-skip algorithm looks great!

./benchmark.ldc
Search in Alice in Wonderland
   std: 168 ±6 +29 ( 107)   -3 ( 893)
manual: 112 ±3 +28 (  81)   -1 ( 856)
  qznc: 149 ±4 +30 (  79)   -1 ( 898)
 Chris: 142 ±5 +28 ( 102)   -2 ( 898)
Andrei: 153 ±3 +28 (  79)   -1 ( 919)
   Andrei2: 101 ±2 +34 (  31)   -1 ( 969)
Search random haystack with random needle
   std: 172 ±19+61 ( 161)  -11 ( 825)
manual: 161 ±47+73 ( 333)  -35 ( 666)
  qznc: 163 ±21+33 ( 314)  -15 ( 661)
 Chris: 190 ±47+80 ( 302)  -33 ( 693)
Andrei: 140 ±37+60 ( 315)  -27 ( 676)
   Andrei2: 103 ±6 +57 (  64)   -2 ( 935)
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU   M 620  @ 
2.67GHz


The Alice benchmark searches Alice in Wonderland for "find a 
pleasure in all their simple joys" and

finds it in the last sentence.


Would it speed things up even more, if we put the function 
`computeSkip` into the loop or is this done automatically by the 
compiler?

Re: faster splitter

2016-05-31 Thread qznc via Digitalmars-d


On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:
Would it speed things up even more, if we put the function 
`computeSkip` into the loop or is this done automatically by 
the compiler?


LDC inlines it. DMD does not.

More numbers:

./benchmark.ldc
Search in Alice in Wonderland
   std: 147 ±1
manual: 100 ±0
  qznc: 121 ±1
 Chris: 103 ±1
Andrei: 144 ±1
   Andrei2: 105 ±1
Search in random short strings
   std: 125 ±15
manual: 117 ±10
  qznc: 104 ±6
 Chris: 123 ±14
Andrei: 104 ±5
   Andrei2: 103 ±4
Mismatch in random long strings
   std: 140 ±22
manual: 164 ±64
  qznc: 115 ±13
 Chris: 167 ±63
Andrei: 161 ±68
   Andrei2: 106 ±9
Search random haystack with random needle
   std: 138 ±27
manual: 135 ±33
  qznc: 116 ±16
 Chris: 141 ±36
Andrei: 131 ±33
   Andrei2: 109 ±12
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Random short strings has haystacks of 10 to 300 characters and 
needles of 2 to 10. Basically, no time for initialisation.


Random long strings has haystacks of size 1000, 10_000, 100_000, 
or 1_000_000 and needles 50 to 500. It inserts a character into a 
random index of the needle to force a mismatch.


The last one is the configuration as before.

Overall, Andrei2 (the lazy compute skip) is really impressive. :)

Re: faster splitter

2016-05-31 Thread Chris via Digitalmars-d


On Tuesday, 31 May 2016 at 19:59:50 UTC, qznc wrote:

On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:
Would it speed things up even more, if we put the function 
`computeSkip` into the loop or is this done automatically by 
the compiler?


LDC inlines it. DMD does not.

More numbers:

./benchmark.ldc
Search in Alice in Wonderland
   std: 147 ±1
manual: 100 ±0
  qznc: 121 ±1
 Chris: 103 ±1
Andrei: 144 ±1
   Andrei2: 105 ±1
Search in random short strings
   std: 125 ±15
manual: 117 ±10
  qznc: 104 ±6
 Chris: 123 ±14
Andrei: 104 ±5
   Andrei2: 103 ±4
Mismatch in random long strings
   std: 140 ±22
manual: 164 ±64
  qznc: 115 ±13
 Chris: 167 ±63
Andrei: 161 ±68
   Andrei2: 106 ±9
Search random haystack with random needle
   std: 138 ±27
manual: 135 ±33
  qznc: 116 ±16
 Chris: 141 ±36
Andrei: 131 ±33
   Andrei2: 109 ±12
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Random short strings has haystacks of 10 to 300 characters and 
needles of 2 to 10. Basically, no time for initialisation.


Random long strings has haystacks of size 1000, 10_000, 
100_000, or 1_000_000 and needles 50 to 500. It inserts a 
character into a random index of the needle to force a mismatch.


The last one is the configuration as before.

Overall, Andrei2 (the lazy compute skip) is really impressive. 
:)


Yep. It's really impressive. I actually thought that dmd didn't 
place `computeSkip` inside of the loop. This begs the question if 
it should be moved to the loop, in case we use it in Phobos, to 
make sure that it is as fast as possible even with dmd. However, 
I like it the way it is now.


`Adrei2` is that it performs consistently well.

Re: faster splitter

2016-05-31 Thread Andrei Alexandrescu via Digitalmars-d


On 05/31/2016 04:18 PM, Chris wrote:

I actually thought that dmd didn't place
`computeSkip` inside of the loop. This begs the question if it should be
moved to the loop, in case we use it in Phobos, to make sure that it is
as fast as possible even with dmd. However, I like it the way it is now.


You may want to then try https://dpaste.dzfl.pl/392710b765a9, which 
generates inline code on all compilers. -- Andrei

Re: faster splitter

2016-05-31 Thread David Nadlinger via Digitalmars-d

On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:
You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
which generates inline code on all compilers. -- Andrei


In general, it might be beneficial to use 
ldc.intrinsics.llvm_expect (cf. __builtin_expect) for things like 
that in order to optimise basic block placement. (We should 
probably have a compiler-independent API for that in core.*, by 
the way.) In this case, the skip computation path is probably 
small enough for that not to matter much, though.


Another thing that might be interesting to do (now that you have 
a "clever" baseline) is to start counting cycles and make some 
comparisons against manual asm/intrinsics implementations. For 
short(-ish) needles, PCMPESTRI is probably the most promising 
candidate, although I suspect that for \r\n scanning in long 
strings in particular, an optimised AVX2 solution might have 
higher throughput.


Of course these observations are not very valuable without 
backing them up with measurements, but it seems like before 
optimising a generic search algorithm for short-needle test 
cases, having one's eyes on a solid SIMD baseline would be a 
prudent thing to do.


 — David

Re: faster splitter

2016-06-01 Thread Chris via Digitalmars-d

On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:


You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
which generates inline code on all compilers. -- Andrei


I've added it as `Andrei3`. It runs faster with dmd, but it's not 
as good with ldc. Seems like ldc performs some extra optimization 
when moving `computeSkip` into the loop, something it doesn't 
bother to do when it's already there.


dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
./benchmark.dmd
Search in Alice in Wonderland
   std: 190 ±4
manual: 115 ±4
  qznc: 106 ±2
 Chris: 160 ±4
Andrei: 159 ±4
   Andrei2: 108 ±2
   Andrei3: 100 ±0
Search in random short strings
   std: 222 ±27
manual: 193 ±49
  qznc: 120 ±12
 Chris: 224 ±57
Andrei: 114 ±9
   Andrei2: 106 ±5
   Andrei3: 102 ±3
Mismatch in random long strings
   std: 186 ±28
manual: 206 ±85
  qznc: 118 ±14
 Chris: 265 ±104
Andrei: 194 ±85
   Andrei2: 116 ±18
   Andrei3: 102 ±4
Search random haystack with random needle
   std: 189 ±38
manual: 171 ±45
  qznc: 118 ±11
 Chris: 225 ±52
Andrei: 149 ±32
   Andrei2: 110 ±7
   Andrei3: 103 ±6
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

./benchmark.ldc
Search in Alice in Wonderland
   std: 170 ±1
manual: 143 ±2
  qznc: 133 ±1
 Chris: 144 ±1
Andrei: 196 ±10
   Andrei2: 100 ±0
   Andrei3: 111 ±1
Search in random short strings
   std: 223 ±30
manual: 211 ±51
  qznc: 124 ±12
 Chris: 223 ±61
Andrei: 115 ±10
   Andrei2: 102 ±3
   Andrei3: 105 ±4
Mismatch in random long strings
   std: 181 ±17
manual: 253 ±109
  qznc: 146 ±24
 Chris: 248 ±108
Andrei: 228 ±96
   Andrei2: 101 ±2
   Andrei3: 108 ±6
Search random haystack with random needle
   std: 187 ±22
manual: 208 ±60
  qznc: 152 ±27
 Chris: 202 ±58
Andrei: 173 ±35
   Andrei2: 102 ±4
   Andrei3: 110 ±8
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

Re: faster splitter

2016-06-01 Thread Patrick Schluter via Digitalmars-d


On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
There is a special version of find for searching a single char 
in a string. Using a one-letter needle string is more like a 
user mistake than something to optimize for.


At compile time you may not know the length of the needle, like 
in the grep command.

Re: faster splitter

2016-06-01 Thread Seb via Digitalmars-d


On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:

On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
There is a special version of find for searching a single char 
in a string. Using a one-letter needle string is more like a 
user mistake than something to optimize for.


At compile time you may not know the length of the needle, like 
in the grep command.


1) how about a CTFE find?

s.find!(needle, pred)

If we can initialize boyer-moore or KMP at compile time - it 
should be the fastest!


At ndslice such functions are called bifacial:

http://dlang.org/phobos/std_experimental_ndslice_iteration.html

Imho a lot more in std.algorithm should be able to profit from 
facts known at compile-time.


2) Even for a runtime runtime one-letter needle I am pretty sure 
it's worth to specialize

Re: faster splitter

2016-06-01 Thread Chris via Digitalmars-d


On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:
On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter 
wrote:

On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
There is a special version of find for searching a single 
char in a string. Using a one-letter needle string is more 
like a user mistake than something to optimize for.


At compile time you may not know the length of the needle, 
like in the grep command.


1) how about a CTFE find?

s.find!(needle, pred)

If we can initialize boyer-moore or KMP at compile time - it 
should be the fastest!


At ndslice such functions are called bifacial:

http://dlang.org/phobos/std_experimental_ndslice_iteration.html

Imho a lot more in std.algorithm should be able to profit from 
facts known at compile-time.


2) Even for a runtime runtime one-letter needle I am pretty 
sure it's worth to specialize



That makes sense. I think that std.algorithm needs to be revised 
and optimized for speed. We cannot afford to have suboptimal 
algorithms in there.

Re: faster splitter

2016-06-01 Thread Andrei Alexandrescu via Digitalmars-d


On 06/01/2016 08:41 AM, Seb wrote:

On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:

On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:

There is a special version of find for searching a single char in a
string. Using a one-letter needle string is more like a user mistake
than something to optimize for.


At compile time you may not know the length of the needle, like in the
grep command.


1) how about a CTFE find?

s.find!(needle, pred)

If we can initialize boyer-moore or KMP at compile time - it should be
the fastest!


That would require partial evaluation, which sadly we don't have in D. 
-- Andrei

Re: faster splitter

2016-06-01 Thread Patrick Schluter via Digitalmars-d


On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:
On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter 
wrote:

On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
There is a special version of find for searching a single 
char in a string. Using a one-letter needle string is more 
like a user mistake than something to optimize for.


At compile time you may not know the length of the needle, 
like in the grep command.


1) how about a CTFE find?

What I wanted to say, is that in real life, the input of the 
search routine is very often run-time user provided data. Think 
of search box in browsers and apps, command line parameter à la 
grep, etc. The "string" search function should not 
catastrophically break down on special input, like 1 character 
strings, unusual Unicode or when needle==haystack. I only said 
this to not lose the focus on what is being tried to be achieved 
here.
It's often a danger of micro-optimization and unit test focused 
development, that a lot of time and effort are spent on 
improvements that are completely irrelevant when checked against 
what is really needed in real world (i.e. we're full in bike shed 
territory here).

Re: faster splitter

2016-06-01 Thread Chris via Digitalmars-d


On Wednesday, 1 June 2016 at 13:47:10 UTC, Patrick Schluter wrote:


What I wanted to say, is that in real life, the input of the 
search routine is very often run-time user provided data. Think 
of search box in browsers and apps, command line parameter à la 
grep, etc. The "string" search function should not 
catastrophically break down on special input, like 1 character 
strings, unusual Unicode or when needle==haystack. I only said 
this to not lose the focus on what is being tried to be 
achieved here.
It's often a danger of micro-optimization and unit test focused 
development, that a lot of time and effort are spent on 
improvements that are completely irrelevant when checked 
against what is really needed in real world (i.e. we're full in 
bike shed territory here).


That's true. We should try to optimize for the worst case, i.e. 
random input. However, it is often the case that the needle is 
known at compile time, e.g. things like


if (input.canFind("foo"))

There might be ways to optimize those cases at compile time.

Re: faster splitter

2016-06-02 Thread Chris via Digitalmars-d


On Wednesday, 1 June 2016 at 14:32:38 UTC, Chris wrote:

For fun, I've written a search function that alternates between 
the beginning and the end of a string. I'm basically using 
Andrei's search algorithm for this. It is, of course only useful, 
if `needle` is expected to be either towards the beginning or the 
end of a string. If `needle` occurs multiple times, it matches 
where ever it encounters it first.


T[] findUnique(T)(T[] haystack, T[] needle)
{
static size_t computeSkip(T[] needle)
{
size_t result = 1;
		while (result < needle.length && needle[$ - 1 - result] != 
needle[$ - 1])

{
++result;
}
return result;
}

if (needle.length == 0) return haystack;
immutable size_t len = haystack.length;
immutable size_t lastIndex = needle.length - 1;
auto last = needle[lastIndex];
auto first = needle[0];
size_t skip = 0;
	for (size_t i = lastIndex, j = len-lastIndex-1; i < len; ++i, 
--j)

{
if (last != haystack[i])
goto back;
for (size_t k = 0; ; ++k)
{
if (k == lastIndex) return haystack[i-lastIndex..$];
if (haystack[i - lastIndex + k] != needle[k]) break;
}
back:
if (i > j) break;
if (first != haystack[j])
continue;
for (size_t k = lastIndex; ; --k)
{
if (k == 0) return haystack[j..$];
if (haystack[j + k] != needle[k]) break;
}
if (skip == 0) skip = computeSkip(needle);
i += skip;
j -= skip-1;
}
return haystack[$..$];
}

unittest
{
string s1 = "Hello abc world!";
assert (findUnique(s1, "abc") == "abc world!");
assert (findUnique(s1, "def") == "");
string s2 = "Ola, mundo!";
assert (findUnique(s2, "do!") == "do!");
string s3 = "abc";
assert (findUnique(s3, "c") == "c");
assert (findUnique(s3, "z") == "");
string s4 = "aaabaaa";
assert (findUnique(s4, "b") == "baaa");
}

I added it to the benchmarking suite (I had to turn off the error 
warnings, see description above). Naturally, it shines in the 
first test, as the needle is at the end of the text. The whole 
thing is to be taken with a grain of salt ;)


Search in Alice in Wonderland
   std: 535 ±8
manual: 454 ±10
  qznc: 421 ±5
 Chris: 460 ±10
Andrei: 643 ±19
   Andrei2: 314 ±6
   Andrei3: 352 ±5
Biloop: 100 ±0
Search in random short strings
   std: 219 ±33
manual: 194 ±47
  qznc: 123 ±11
 Chris: 219 ±69
Andrei: 113 ±10
   Andrei2: 104 ±4
   Andrei3: 103 ±3
Biloop: 195 ±46
Mismatch in random long strings
   std: 193 ±30
manual: 249 ±103
  qznc: 156 ±30
 Chris: 260 ±108
Andrei: 238 ±97
   Andrei2: 107 ±11
   Andrei3: 115 ±12
Biloop: 141 ±44
Search random haystack with random needle
   std: 189 ±25
manual: 193 ±57
  qznc: 152 ±26
 Chris: 203 ±59
Andrei: 176 ±35
   Andrei2: 103 ±6
   Andrei3: 110 ±8
Biloop: 141 ±28
 (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

Re: faster splitter

2016-06-02 Thread qznc via Digitalmars-d

On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:

On 05/31/2016 04:18 PM, Chris wrote:

I actually thought that dmd didn't place
`computeSkip` inside of the loop. This begs the question if it 
should be
moved to the loop, in case we use it in Phobos, to make sure 
that it is
as fast as possible even with dmd. However, I like it the way 
it is now.


You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
which generates inline code on all compilers. -- Andrei


I ported this to Phobos style (A2Phobos in the benchmark) and 
changed the pull request [0].
In practically all tests, this algorithms wipes the floor with 
the competition.


[0] https://github.com/dlang/phobos/pull/4362

Re: faster splitter

2016-06-02 Thread qznc via Digitalmars-d


On Tuesday, 31 May 2016 at 22:50:37 UTC, David Nadlinger wrote:
Another thing that might be interesting to do (now that you 
have a "clever" baseline) is to start counting cycles and make 
some comparisons against manual asm/intrinsics implementations. 
For short(-ish) needles, PCMPESTRI is probably the most 
promising candidate, although I suspect that for \r\n scanning 
in long strings in particular, an optimised AVX2 solution might 
have higher throughput.


Of course these observations are not very valuable without 
backing them up with measurements, but it seems like before 
optimising a generic search algorithm for short-needle test 
cases, having one's eyes on a solid SIMD baseline would be a 
prudent thing to do.


The current algorithm is generic with respect to the predicate. 
Once we use SSE/AVX tricks, it is a special case for equality.


As a next step in Phobos, this is probably worth it for strings. 
We could probably steal some well-optimized strcmp from somewhere.

Re: faster splitter

2016-07-11 Thread Henrique bucher via Digitalmars-d


On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:

On 05/30/2016 05:31 AM, qznc wrote:

On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
worthwhile to use word loads [0] instead. Really fancy would 
be SSE.


I wrote a splitter in SSE4.2 some time ago as  acontribution to a 
github project. Perhaps it is related.


https://github.com/HFTrader/string-splitting/blob/master/splithb2.cpp

Cheers,

90 matches

Mail list logo