Re: Any relation?

2016-10-12 Thread Neica Dorin via Digitalmars-d
On Tuesday, 11 October 2016 at 18:13:53 UTC, Andrei Alexandrescu 
wrote:

http://indianautosblog.com/2016/10/most-powerful-suzuki-swift-produces-350-hp-25
 -- Andrei


Buna ziua
Stimate Domnule Alexandrescu am studiat putin noul limbaj 
dezvoltat de dumneavoastra. Am incercat sa rulez cateva programe 
mici dar am avut proble cu mediile IDE. Ce IDE recomandati pentru 
incepatori? Multumesc!


Re: Reducing the cost of autodecoding

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 09:35 PM, Stefan Koch wrote:

On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote:

On 10/12/2016 08:41 PM, safety0ff wrote:

On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:


It made little difference: LDC compiled into AVX2 vectorized addition
(vpmovzxbq & vpaddq.)


Measurements without -mcpu=native:
overhead 0.336s
bytes0.610s
without branch hints 0.852s
code pasted 0.766s


So we should be able to reduce overhead by means of proper code
arrangement and interplay of inlining and outlining. The prize,
however, would be to get the AVX instructions for ASCII going. Is that
possible? -- Andrei


AVX for ascii ?
What are you referring to ?
Most text processing is terribly incompatible with simd.
sse 4.2 has a few instructions that do help, but as far as I am aware it
is not yet too far spread.


Oh ok, so it's that checksum in particular that got optimized. Bad 
benchmark! Bad! -- Andrei


Re: Reducing the cost of autodecoding

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu 
wrote:

On 10/12/2016 08:41 PM, safety0ff wrote:

On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:


It made little difference: LDC compiled into AVX2 vectorized 
addition

(vpmovzxbq & vpaddq.)


Measurements without -mcpu=native:
overhead 0.336s
bytes0.610s
without branch hints 0.852s
code pasted 0.766s


So we should be able to reduce overhead by means of proper code 
arrangement and interplay of inlining and outlining. The prize, 
however, would be to get the AVX instructions for ASCII going. 
Is that possible? -- Andrei


AVX for ascii ?
What are you referring to ?
Most text processing is terribly incompatible with simd.
sse 4.2 has a few instructions that do help, but as far as I am 
aware it is not yet too far spread.


Re: Reducing the cost of autodecoding

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu 
wrote:

On 10/12/2016 08:11 PM, Stefan Koch wrote:

We should probably introduce a new module for stuff like this.
object.d is already filled with too much unrelated things.


Yah, shouldn't go in object.d as it's fairly niche. On the 
other hand defining a new module for two functions seems 
excessive unless we have a good theme. On the third hand we may 
find an existing module that's topically close. Thoughts? -- 
Andrei


maybe core.intrinsics ?
or code.codelayout ?

We can control the layout at object-file level.
We should be able to expose some of that functionality.


Re: Reducing the cost of autodecoding

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 08:41 PM, safety0ff wrote:

On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:


It made little difference: LDC compiled into AVX2 vectorized addition
(vpmovzxbq & vpaddq.)


Measurements without -mcpu=native:
overhead 0.336s
bytes0.610s
without branch hints 0.852s
code pasted 0.766s


So we should be able to reduce overhead by means of proper code 
arrangement and interplay of inlining and outlining. The prize, however, 
would be to get the AVX instructions for ASCII going. Is that possible? 
-- Andrei


Re: Reducing the cost of autodecoding

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 08:11 PM, Stefan Koch wrote:

We should probably introduce a new module for stuff like this.
object.d is already filled with too much unrelated things.


Yah, shouldn't go in object.d as it's fairly niche. On the other hand 
defining a new module for two functions seems excessive unless we have a 
good theme. On the third hand we may find an existing module that's 
topically close. Thoughts? -- Andrei


Re: Reducing the cost of autodecoding

2016-10-12 Thread safety0ff via Digitalmars-d

On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:


It made little difference: LDC compiled into AVX2 vectorized 
addition (vpmovzxbq & vpaddq.)


Measurements without -mcpu=native:
overhead 0.336s
bytes0.610s
without branch hints 0.852s
code pasted 0.766s


Re: Reducing the cost of autodecoding

2016-10-12 Thread safety0ff via Digitalmars-d
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
Alexandrescu wrote:


Wait, so going through the bytes made almost no difference? Or 
did you subtract the overhead already?




It made little difference: LDC compiled into AVX2 vectorized 
addition (vpmovzxbq & vpaddq.)


Re: Reducing the cost of autodecoding

2016-10-12 Thread Stefan Koch via Digitalmars-d

On Wednesday, 12 October 2016 at 23:59:15 UTC, Stefan Koch wrote:
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
Alexandrescu wrote:


I think we should define two aliases "likely" and "unlikely" 
with default implementations:


bool likely(bool b) { return b; }
bool unlikely(bool b) { return b; }

They'd go in druntime. Then implementers can hook them into 
their intrinsics.


Works?


Andrei


I was about to suggest the same.
I can prepare a PR.


We should probably introduce a new module for stuff like this.
object.d is already filled with too much unrelated things.


Re: Reducing the cost of autodecoding

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
Alexandrescu wrote:


I think we should define two aliases "likely" and "unlikely" 
with default implementations:


bool likely(bool b) { return b; }
bool unlikely(bool b) { return b; }

They'd go in druntime. Then implementers can hook them into 
their intrinsics.


Works?


Andrei


I was about to suggest the same.
I can prepare a PR.


Re: Reducing the cost of autodecoding

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 04:02 PM, safety0ff wrote:

On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:


Remember the ASCII part is the bothersome one. There's only two
comparisons, all with 100% predictability. We should be able to
arrange matters so the loss is negligible. -- Andrei


My measurements:
ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
ldc version 1.0.0

overhead 0.350s
bytes0.385s


Wait, so going through the bytes made almost no difference? Or did you 
subtract the overhead already?



current autodecoding 0.915s (with new LUT popFront)
copy-pasting std.utf decoding functions into current file 0.840s
adding ASCII branch hints (llvm_expect) 0.770s

With the branch hints LDC moves the non-Ascii code outside of the loop
and creates a really tight loop body.


I think we should define two aliases "likely" and "unlikely" with 
default implementations:


bool likely(bool b) { return b; }
bool unlikely(bool b) { return b; }

They'd go in druntime. Then implementers can hook them into their 
intrinsics.


Works?


Andrei


Re: Any front end experts n da house?

2016-10-12 Thread tsbockman via Digitalmars-d

On Wednesday, 12 October 2016 at 22:38:33 UTC, Stefan Koch wrote:

On Wednesday, 12 October 2016 at 22:16:38 UTC, tsbockman wrote:
Yes. The path to fix 259 is clear, and Lionello Lunesu and 
myself have already done most of the work.


14835 is a blocker due to the nature of the solution that 
Walter and Andrei approved (which I agree is the right one); 
an independent implementation would run in to the same problem.


Great news!


Only if that blocker is dealt with - otherwise it's just wasted 
effort...


Re: Any front end experts n da house?

2016-10-12 Thread Stefan Koch via Digitalmars-d

On Wednesday, 12 October 2016 at 22:16:38 UTC, tsbockman wrote:
On Wednesday, 12 October 2016 at 16:36:32 UTC, Andrei 
Alexandrescu wrote:

On 10/12/2016 12:31 PM, Stefan Koch wrote:

I can take a look at 259.
14835 is nothing trivial though.


My understanding is Thomas has an attack on 259 once a 
solution to 14835 is up. -- Andrei


Yes. The path to fix 259 is clear, and Lionello Lunesu and 
myself have already done most of the work.


14835 is a blocker due to the nature of the solution that 
Walter and Andrei approved (which I agree is the right one); an 
independent implementation would run in to the same problem.


Great news!


Rust-style Ownership and Borrowing In D!

2016-10-12 Thread Nordlöw via Digitalmars-d

Hi!

I've recently started a new employment, and with that new 
collegues and new language discussions/wars ;)


So there's this language Rust. And it provides some pretty 
amazing safety guarantees when it to memory management, algorithm 
correctness (preventing iterator invalidation) in both single 
threaded and multi-threaded contexts (task-based parallelism). 
One of its brightest shining showcases is Rayon:


https://github.com/nikomatsakis/rayon

and more specifically

https://github.com/nikomatsakis/rayon#using-join-for-recursive-divide-and-conquer-problems

After having given this a lot of thought I decided to think about 
if we can implement some of Rust's great ideas in D.


My conclusion so far is that we can convert some of Rust's great 
compile-time checks to corresponding run-time checks in D. And 
use these to provide safe scope-checked reference access to 
non-RC C++/Rust-style containers. With a one-word memory overhead 
alongside the container and range/slice.


Here's my proof of concept so far:

https://github.com/nordlow/phobos-next/blob/master/src/borrown.d

To understand it's usage start with looking at the unittests at

https://github.com/nordlow/phobos-next/blob/master/src/borrown.d#L161

Note that it is used as a wrapper on top of a non-copyable 
array-container `Array` I've been developing the last months.


Destroy!


Re: Any front end experts n da house?

2016-10-12 Thread tsbockman via Digitalmars-d
On Wednesday, 12 October 2016 at 16:36:32 UTC, Andrei 
Alexandrescu wrote:

On 10/12/2016 12:31 PM, Stefan Koch wrote:

I can take a look at 259.
14835 is nothing trivial though.


My understanding is Thomas has an attack on 259 once a solution 
to 14835 is up. -- Andrei


Yes. The path to fix 259 is clear, and Lionello Lunesu and myself 
have already done most of the work.


14835 is a blocker due to the nature of the solution that Walter 
and Andrei approved (which I agree is the right one); an 
independent implementation would run in to the same problem.


Re: Reducing the cost of autodecoding

2016-10-12 Thread safety0ff via Digitalmars-d

On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote:


where did you apply the branch hints ?


Code: http://pastebin.com/CFCpUftW


Re: Reducing the cost of autodecoding

2016-10-12 Thread Stefan Koch via Digitalmars-d

On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote:
On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei 
Alexandrescu wrote:


Remember the ASCII part is the bothersome one. There's only 
two comparisons, all with 100% predictability. We should be 
able to arrange matters so the loss is negligible. -- Andrei


My measurements:
ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
ldc version 1.0.0

overhead 0.350s
bytes0.385s
current autodecoding 0.915s (with new LUT popFront)
copy-pasting std.utf decoding functions into current file 0.840s
adding ASCII branch hints (llvm_expect) 0.770s

With the branch hints LDC moves the non-Ascii code outside of 
the loop and creates a really tight loop body.


where did you apply the branch hints ?


Re: Reducing the cost of autodecoding

2016-10-12 Thread safety0ff via Digitalmars-d
On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei 
Alexandrescu wrote:


Remember the ASCII part is the bothersome one. There's only two 
comparisons, all with 100% predictability. We should be able to 
arrange matters so the loss is negligible. -- Andrei


My measurements:
ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
ldc version 1.0.0

overhead 0.350s
bytes0.385s
current autodecoding 0.915s (with new LUT popFront)
copy-pasting std.utf decoding functions into current file 0.840s
adding ASCII branch hints (llvm_expect) 0.770s

With the branch hints LDC moves the non-Ascii code outside of the 
loop and creates a really tight loop body.


Re: Reducing the cost of autodecoding

2016-10-12 Thread Johan Engelen via Digitalmars-d
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:


On my machine, with "ldc2 -release -O3 -enable-inlining"


"-O3 -enable-inlining" is synonymous with "-O3"  :-)

With LDC 1.1.0-beta3, you can try with 
"-enable-cross-module-inlining". It won't cross-module inline 
everything (notably: nested functions), but it's a start. If you 
see large performance differences, let me know.


cheers,
  Johan



Re: Can you shrink it further?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 01:05 PM, safety0ff wrote:

On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:

[Snip]


Didn't see the LUT implementation, nvm!


Yah, that's pretty clever. Better yet, I suspect we can reuse the 
look-up table for front() as well. -- Andrei




Re: Can you shrink it further?

2016-10-12 Thread safety0ff via Digitalmars-d

On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:

[Snip]


Didn't see the LUT implementation, nvm!



Re: Can you shrink it further?

2016-10-12 Thread safety0ff via Digitalmars-d

My current favorites:

void popFront(ref char[] s) @trusted pure nothrow {
  immutable byte c = s[0];
  if (c >= -2) {
s = s.ptr[1 .. s.length];
  } else {
import core.bitop;
size_t i = 7u - bsr(~c);
import std.algorithm;
s = s.ptr[min(i, s.length) .. s.length];
  }
}

I also experimented with explicit speculation:

void popFront(ref char[] s) @trusted pure nothrow {
  immutable byte c = s[0];
  s = s.ptr[1 .. s.length];
  if (c < -2) {
import core.bitop;
size_t i = 6u - bsr(~c);
import std.algorithm;
s = s.ptr[min(i, s.length) .. s.length];
  }
}


LDC and GDC both compile these to 23 instructions.
DMD does worse than with my other code.

You can influence GDC's block layout with __builtin_expect.

I notice that many other snippets posted use uint instead of 
size_t in the multi-byte branch. This generates extra 
instructions for me.


Any front end experts n da house?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d
So it would be great to get the super annoying 
https://issues.dlang.org/show_bug.cgi?id=259 to a conclusion, and it 
seems the similarly annoying 
https://issues.dlang.org/show_bug.cgi?id=14835 is in the way.


If anyone would like to look into the latter that would be great. Good 
regression testing (e.g. on dub projects) would be necessary.



Andrei


Re: Any front end experts n da house?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 12:31 PM, Stefan Koch wrote:

I can take a look at 259.
14835 is nothing trivial though.


My understanding is Thomas has an attack on 259 once a solution to 14835 
is up. -- Andrei


Re: Any front end experts n da house?

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 16:27:05 UTC, Andrei 
Alexandrescu wrote:
So it would be great to get the super annoying 
https://issues.dlang.org/show_bug.cgi?id=259 to a conclusion, 
and it seems the similarly annoying 
https://issues.dlang.org/show_bug.cgi?id=14835 is in the way.


If anyone would like to look into the latter that would be 
great. Good regression testing (e.g. on dub projects) would be 
necessary.



Andrei


I can take a look at 259.
14835 is nothing trivial though.


Re: Can you shrink it further?

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 14:46:32 UTC, Andrei 
Alexandrescu wrote:


No need. 1% for dmd is negligible. 25% would raise an eyebrow. 
-- Andrei


Alright then
PR: https://github.com/dlang/phobos/pull/4849


Re: Reducing the cost of autodecoding

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 12:03 PM, Stefan Koch wrote:

This will only work really efficiently with some state on the stack.


Remember the ASCII part is the bothersome one. There's only two 
comparisons, all with 100% predictability. We should be able to arrange 
matters so the loss is negligible. -- Andrei


Re: Reducing the cost of autodecoding

2016-10-12 Thread Ilya Yaroshenko via Digitalmars-d
On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko 
wrote:
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
So we've had a good run with making popFront smaller. In ASCII 
microbenchmarks with ldc, the speed is indistinguishable from 
s = s[1 .. $]. Smaller functions make sure that the impact on 
instruction cache in larger applications is not high.


Now it's time to look at the end-to-end cost of autodecoding. 
I wrote this simple microbenchmark:


=
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
import std.algorithm, std.array, std.stdio;
char[] line = "0123456789".dup.repeat(50_000_000).join;


For fair test line should be feet into L1 cache. --Ilya


EDITL For fair test the line should be in the L1 cache. --Ilya



Re: Reducing the cost of autodecoding

2016-10-12 Thread Ilya Yaroshenko via Digitalmars-d
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
So we've had a good run with making popFront smaller. In ASCII 
microbenchmarks with ldc, the speed is indistinguishable from s 
= s[1 .. $]. Smaller functions make sure that the impact on 
instruction cache in larger applications is not high.


Now it's time to look at the end-to-end cost of autodecoding. I 
wrote this simple microbenchmark:


=
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
import std.algorithm, std.array, std.stdio;
char[] line = "0123456789".dup.repeat(50_000_000).join;


For fair test line should be feet into L1 cache. --Ilya


Re: Reducing the cost of autodecoding

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
So we've had a good run with making popFront smaller. In ASCII 
microbenchmarks with ldc, the speed is indistinguishable from s 
= s[1 .. $]. Smaller functions make sure that the impact on 
instruction cache in larger applications is not high.


Now it's time to look at the end-to-end cost of autodecoding. I 
wrote this simple microbenchmark:


=
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
import std.algorithm, std.array, std.stdio;
char[] line = "0123456789".dup.repeat(50_000_000).join;
ulong checksum;
if (args.length == 1)
{
while (line.length) {
version(autodecode)
{
checksum += line.myFront;
line.myPopFront;
}
else
{
checksum += line[0];
line = line[1 .. $];
}
}
version(autodecode)
writeln("autodecode ", checksum);
else
writeln("bytes ", checksum);
}
else
writeln("overhead");
}
=

On my machine, with "ldc2 -release -O3 -enable-inlining" I get 
something like 0.54s overhead, 0.81s with no autodecoding, and 
1.12s with autodecoding.


Your mission, should you choose to accept it, is to define a 
combination front/popFront that reduces the gap.



Andrei


This will only work really efficiently with some state on the 
stack.

If we are to support Unicode.


Re: Can you shrink it further?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 10:39 AM, Stefan Koch wrote:

On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote:

On 10/12/2016 09:39 AM, Stefan Koch wrote:




Thanks! I'd say make sure there is exactly 0% loss on performance
compared to the popFront in the ASCII case, and if so make a PR with
the table version. -- Andrei


I measured again.
The table version has a DECREASES the performance for dmd by 1%.
I think we should keep performance for dmd in mind.
I could add the table version in a version (LDC) block.


No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei


Re: Can you shrink it further?

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei 
Alexandrescu wrote:

On 10/12/2016 09:39 AM, Stefan Koch wrote:




Thanks! I'd say make sure there is exactly 0% loss on 
performance compared to the popFront in the ASCII case, and if 
so make a PR with the table version. -- Andrei


I measured again.
The table version has a DECREASES the performance for dmd by 1%.
I think we should keep performance for dmd in mind.
I could add the table version in a version (LDC) block.

I cannot currently measure on gdc.
I'd appreciate more perf data.


Re: Can you shrink it further?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 09:39 AM, Stefan Koch wrote:

On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:

On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:


In the second case, the compiler generates an inc for bumping the
pointer and a dec for decreasing the length (small instructions). If
the variable char_length is used, add/sub must be used (larger). --
Andrei


When I write code so i can keep the
str = str[1 .. str.length]
It leads to lager code overall and decreases performance.
both in table lookup and in the multi-branch code.

update:
for ldc the table-lookup code is faster 27% then the current popFront.
on dmd the table-lookup is 2% faster then the current popFront.

for ldc the branching code is 23% faster then the current popFront
for dmd the branching code is 20% faster then the current popFront


Thanks! I'd say make sure there is exactly 0% loss on performance 
compared to the popFront in the ASCII case, and if so make a PR with the 
table version. -- Andrei


Re: Can you shrink it further?

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei 
Alexandrescu wrote:

On 10/12/2016 06:56 AM, Stefan Koch wrote:
I just confirmed that branching version is faster then 
table-lookup.


please test it our for yourself
http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

The table-lookup does produce the smallest code though.


Nice. I like that the table is NOT looked up on the ASCII path, 
so it stays cold in ASCII text. However, there's a problem with 
this pattern:


size_t char_length = 1;
immutable c = s[0];
if (c < 192)
{
Lend :
s = s.ptr[char_length .. s.length];
return ;
}

as opposed to:

immutable c = s[0];
if (c < 192)
{
Lend :
s = s.ptr[1 .. s.length];
return ;
}

In the second case, the compiler generates an inc for bumping 
the pointer and a dec for decreasing the length (small 
instructions). If the variable char_length is used, add/sub 
must be used (larger). -- Andrei


btw.
We could get rid of of a sub if we changed slices from holding a 
pointer and a length to holding two pointers.




Reducing the cost of autodecoding

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d
So we've had a good run with making popFront smaller. In ASCII 
microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. 
$]. Smaller functions make sure that the impact on instruction cache in 
larger applications is not high.


Now it's time to look at the end-to-end cost of autodecoding. I wrote 
this simple microbenchmark:


=
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
import std.algorithm, std.array, std.stdio;
char[] line = "0123456789".dup.repeat(50_000_000).join;
ulong checksum;
if (args.length == 1)
{
while (line.length) {
version(autodecode)
{
checksum += line.myFront;
line.myPopFront;
}
else
{
checksum += line[0];
line = line[1 .. $];
}
}
version(autodecode)
writeln("autodecode ", checksum);
else
writeln("bytes ", checksum);
}
else
writeln("overhead");
}
=

On my machine, with "ldc2 -release -O3 -enable-inlining" I get something 
like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with 
autodecoding.


Your mission, should you choose to accept it, is to define a combination 
front/popFront that reduces the gap.



Andrei


Re: Can you shrink it further?

2016-10-12 Thread Stefan Koch via Digitalmars-d

On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:
On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei 
Alexandrescu wrote:


In the second case, the compiler generates an inc for bumping 
the pointer and a dec for decreasing the length (small 
instructions). If the variable char_length is used, add/sub 
must be used (larger). -- Andrei


When I write code so i can keep the
str = str[1 .. str.length]
It leads to lager code overall and decreases performance.
both in table lookup and in the multi-branch code.

update:
for ldc the table-lookup code is faster 27% then the current 
popFront.

on dmd the table-lookup is 2% faster then the current popFront.

for ldc the branching code is 23% faster then the current popFront
for dmd the branching code is 20% faster then the current popFront


Re: Can you shrink it further?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 06:56 AM, Stefan Koch wrote:

I just confirmed that branching version is faster then table-lookup.

please test it our for yourself
http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

The table-lookup does produce the smallest code though.


Nice. I like that the table is NOT looked up on the ASCII path, so it 
stays cold in ASCII text. However, there's a problem with this pattern:


size_t char_length = 1;
immutable c = s[0];
if (c < 192)
{
Lend :
s = s.ptr[char_length .. s.length];
return ;
}

as opposed to:

immutable c = s[0];
if (c < 192)
{
Lend :
s = s.ptr[1 .. s.length];
return ;
}

In the second case, the compiler generates an inc for bumping the 
pointer and a dec for decreasing the length (small instructions). If the 
variable char_length is used, add/sub must be used (larger). -- Andrei


Re: Can you shrink it further?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 05:23 AM, Stefan Koch wrote:

All three are slower than baseline, for my test-case.
What did you test it against.


I'd say: (a) test for speed of ASCII-only text; (b) make it small. 
That's all we need. Nobody worries about 10-20% in multibyte-heavy text. 
-- Andrei


Re: Can you shrink it further?

2016-10-12 Thread Andrei Alexandrescu via Digitalmars-d

On 10/12/2016 04:56 AM, Matthias Bentrup wrote:


void popFront1b(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= -8) {
s = s[1 .. $];
  } else {
uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
import std.algorithm;
s = s[min(i, $) .. $];
  }
}


This is really small although it does two jumps on the frequent path. 
Perhaps an improvement over the current implementation. Nice work! -- Andrei


Re: Can you shrink it further?

2016-10-12 Thread Stefan Koch via Digitalmars-d
I just confirmed that branching version is faster then 
table-lookup.


please test it our for yourself
http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

The table-lookup does produce the smallest code though.



Re: Any relation?

2016-10-12 Thread Martin Krejcirik via Digitalmars-d

Then maybe this isn't photoshopped:

https://twitter.com/stamcd/status/742563964656062464


Why would be ? It's a screenshot from Forza Motorsport game.



Re: New encryption block...

2016-10-12 Thread Era Scarecrow via Digitalmars-d

On Sunday, 9 October 2016 at 20:33:29 UTC, Era Scarecrow wrote:
 Something coming to mind is the idea of making a small 
algorithm to be used with other already existing encryption 
functions to extend the blocksize of encryption with minimal 
complexity growth.


 For fun I'm experimenting with this. So far seems fairly 
promising, although I'm sure I'm writing it very insecurely. 
Maybe it would be better for random number generation rather than 
secure encryption? Not sure.


 Anyways, 16bit replacement, extending to 64bit via reordering, 
and 8 unique xor's between stages. Once I get the 576 block 
finished (1 for salt) I'll probably publish my ugly code for 
consideration and to be torn apart for security issues.


Re: Can you shrink it further?

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 10:15:17 UTC, Matthias Bentrup 
wrote:
On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch 
wrote:
On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias 
Bentrup wrote:

[...]


All three are slower than baseline, for my test-case.
What did you test it against.


The blns.txt file mentioned upthread.


And what were your timings ?

BTW. the code you posted would not be a proper replacement for 
utf8 popFront since our UTF8-popFront does handle depreacted 
sequences correctly.


making it equivalent to this code :
void pfnew(ref char[] s) @trusted pure nothrow
{
immutable c = s[0];
uint char_length = 1;
if (c < 192)
{
Lend :
s = s.ptr[char_length .. s.length];
} else {
if (c < 192+32)
{
char_length = 2;
}
else if (c < 192+32+16)
{
char_length = 3;
}
else if (c < 192+32+16+8)
{
char_length = 4;
}
else if (c < 192+32+16+8+4)
{
char_length = 5;
}
else if (c < 192+32+16+8+4+2)
{
char_length = 6;
}

char_length = char_length > s.length ? s.length : 
char_length;

goto Lend;
}
}


Re: Can you shrink it further?

2016-10-12 Thread Matthias Bentrup via Digitalmars-d

On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote:
On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup 
wrote:

[...]


All three are slower than baseline, for my test-case.
What did you test it against.


The blns.txt file mentioned upthread.


Re: Any relation?

2016-10-12 Thread Seb via Digitalmars-d
On Tuesday, 11 October 2016 at 18:13:53 UTC, Andrei Alexandrescu 
wrote:

http://indianautosblog.com/2016/10/most-powerful-suzuki-swift-produces-350-hp-25
 -- Andrei


Then maybe this isn't photoshopped:

https://twitter.com/stamcd/status/742563964656062464


Re: Can you shrink it further?

2016-10-12 Thread Stefan Koch via Digitalmars-d
On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup 
wrote:


Here are three branch-less variants that use the sign instead 
of the carry bit.


The last one is the fastest on my machine, although it mixes 
the rare error case and the common 1-byte case into one branch.


void popFront1(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= 0) {
s = s[1 .. $];
  } else if (c < -8) {
uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 
31);

import std.algorithm;
s = s[min(i, $) .. $];
  } else {
s = s[1 .. $];
  }
}

void popFront1a(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= 0) {Three
s = s[1 .. $];
  } else {
uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 
16 >> 31)) & (c + 8 >> 31));

import std.algorithm;
s = s[min(i, $) .. $];
  }
}

void popFront1b(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= -8) {
s = s[1 .. $];
  } else {
uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 
31);

import std.algorithm;
s = s[min(i, $) .. $];
  }
}


All three are slower than baseline, for my test-case.
What did you test it against.


Re: Can you shrink it further?

2016-10-12 Thread Matthias Bentrup via Digitalmars-d
On Tuesday, 11 October 2016 at 15:01:47 UTC, Andrei Alexandrescu 
wrote:

On 10/11/2016 10:49 AM, Matthias Bentrup wrote:


void popFrontAsmIntel(ref char[] s) @trusted pure nothrow {
  immutable c = s[0];
  if (c < 0x80) {
s = s[1 .. $];
  } else {
uint l = void;
asm pure nothrow @nogc {
  mov EAX, 1;
  mov BL, 0xf8-1;
  sub BL, c;
  cmp BL, 0xf8-0xc0;
  adc EAX, 0;
  cmp BL, 0xf8-0xe0;
  adc EAX, 0;
  cmp BL, 0xf8-0xf0;
  adc EAX, 0;
  mov l, EAX;
}
s = s[l <= $ ? l : $ .. $];
  }
}


Did you take a look at the codegen on http://ldc.acomirei.ru? 
It's huge. -- Andrei


Here are three branch-less variants that use the sign instead of 
the carry bit.


The last one is the fastest on my machine, although it mixes the 
rare error case and the common 1-byte case into one branch.


void popFront1(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= 0) {
s = s[1 .. $];
  } else if (c < -8) {
uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
import std.algorithm;
s = s[min(i, $) .. $];
  } else {
s = s[1 .. $];
  }
}

void popFront1a(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= 0) {Three
s = s[1 .. $];
  } else {
uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 
>> 31)) & (c + 8 >> 31));

import std.algorithm;
s = s[min(i, $) .. $];
  }
}

void popFront1b(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= -8) {
s = s[1 .. $];
  } else {
uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
import std.algorithm;
s = s[min(i, $) .. $];
  }
}