Re: Any relation?
On Tuesday, 11 October 2016 at 18:13:53 UTC, Andrei Alexandrescu wrote: http://indianautosblog.com/2016/10/most-powerful-suzuki-swift-produces-350-hp-25 -- Andrei Buna ziua Stimate Domnule Alexandrescu am studiat putin noul limbaj dezvoltat de dumneavoastra. Am incercat sa rulez cateva programe mici dar am avut proble cu mediile IDE. Ce IDE recomandati pentru incepatori? Multumesc!
Re: Reducing the cost of autodecoding
On 10/12/2016 09:35 PM, Stefan Koch wrote: On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote: On 10/12/2016 08:41 PM, safety0ff wrote: On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote: It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.) Measurements without -mcpu=native: overhead 0.336s bytes0.610s without branch hints 0.852s code pasted 0.766s So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei AVX for ascii ? What are you referring to ? Most text processing is terribly incompatible with simd. sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread. Oh ok, so it's that checksum in particular that got optimized. Bad benchmark! Bad! -- Andrei
Re: Reducing the cost of autodecoding
On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote: On 10/12/2016 08:41 PM, safety0ff wrote: On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote: It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.) Measurements without -mcpu=native: overhead 0.336s bytes0.610s without branch hints 0.852s code pasted 0.766s So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei AVX for ascii ? What are you referring to ? Most text processing is terribly incompatible with simd. sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread.
Re: Reducing the cost of autodecoding
On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu wrote: On 10/12/2016 08:11 PM, Stefan Koch wrote: We should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things. Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei maybe core.intrinsics ? or code.codelayout ? We can control the layout at object-file level. We should be able to expose some of that functionality.
Re: Reducing the cost of autodecoding
On 10/12/2016 08:41 PM, safety0ff wrote: On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote: It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.) Measurements without -mcpu=native: overhead 0.336s bytes0.610s without branch hints 0.852s code pasted 0.766s So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei
Re: Reducing the cost of autodecoding
On 10/12/2016 08:11 PM, Stefan Koch wrote: We should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things. Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
Re: Reducing the cost of autodecoding
On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote: It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.) Measurements without -mcpu=native: overhead 0.336s bytes0.610s without branch hints 0.852s code pasted 0.766s
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote: Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already? It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 23:59:15 UTC, Stefan Koch wrote: On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote: I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? Andrei I was about to suggest the same. I can prepare a PR. We should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things.
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote: I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? Andrei I was about to suggest the same. I can prepare a PR.
Re: Reducing the cost of autodecoding
On 10/12/2016 04:02 PM, safety0ff wrote: On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote: Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei My measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes0.385s Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already? current autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body. I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? Andrei
Re: Any front end experts n da house?
On Wednesday, 12 October 2016 at 22:38:33 UTC, Stefan Koch wrote: On Wednesday, 12 October 2016 at 22:16:38 UTC, tsbockman wrote: Yes. The path to fix 259 is clear, and Lionello Lunesu and myself have already done most of the work. 14835 is a blocker due to the nature of the solution that Walter and Andrei approved (which I agree is the right one); an independent implementation would run in to the same problem. Great news! Only if that blocker is dealt with - otherwise it's just wasted effort...
Re: Any front end experts n da house?
On Wednesday, 12 October 2016 at 22:16:38 UTC, tsbockman wrote: On Wednesday, 12 October 2016 at 16:36:32 UTC, Andrei Alexandrescu wrote: On 10/12/2016 12:31 PM, Stefan Koch wrote: I can take a look at 259. 14835 is nothing trivial though. My understanding is Thomas has an attack on 259 once a solution to 14835 is up. -- Andrei Yes. The path to fix 259 is clear, and Lionello Lunesu and myself have already done most of the work. 14835 is a blocker due to the nature of the solution that Walter and Andrei approved (which I agree is the right one); an independent implementation would run in to the same problem. Great news!
Rust-style Ownership and Borrowing In D!
Hi! I've recently started a new employment, and with that new collegues and new language discussions/wars ;) So there's this language Rust. And it provides some pretty amazing safety guarantees when it to memory management, algorithm correctness (preventing iterator invalidation) in both single threaded and multi-threaded contexts (task-based parallelism). One of its brightest shining showcases is Rayon: https://github.com/nikomatsakis/rayon and more specifically https://github.com/nikomatsakis/rayon#using-join-for-recursive-divide-and-conquer-problems After having given this a lot of thought I decided to think about if we can implement some of Rust's great ideas in D. My conclusion so far is that we can convert some of Rust's great compile-time checks to corresponding run-time checks in D. And use these to provide safe scope-checked reference access to non-RC C++/Rust-style containers. With a one-word memory overhead alongside the container and range/slice. Here's my proof of concept so far: https://github.com/nordlow/phobos-next/blob/master/src/borrown.d To understand it's usage start with looking at the unittests at https://github.com/nordlow/phobos-next/blob/master/src/borrown.d#L161 Note that it is used as a wrapper on top of a non-copyable array-container `Array` I've been developing the last months. Destroy!
Re: Any front end experts n da house?
On Wednesday, 12 October 2016 at 16:36:32 UTC, Andrei Alexandrescu wrote: On 10/12/2016 12:31 PM, Stefan Koch wrote: I can take a look at 259. 14835 is nothing trivial though. My understanding is Thomas has an attack on 259 once a solution to 14835 is up. -- Andrei Yes. The path to fix 259 is clear, and Lionello Lunesu and myself have already done most of the work. 14835 is a blocker due to the nature of the solution that Walter and Andrei approved (which I agree is the right one); an independent implementation would run in to the same problem.
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote: where did you apply the branch hints ? Code: http://pastebin.com/CFCpUftW
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote: On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote: Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei My measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes0.385s current autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body. where did you apply the branch hints ?
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote: Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei My measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes0.385s current autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote: On my machine, with "ldc2 -release -O3 -enable-inlining" "-O3 -enable-inlining" is synonymous with "-O3" :-) With LDC 1.1.0-beta3, you can try with "-enable-cross-module-inlining". It won't cross-module inline everything (notably: nested functions), but it's a start. If you see large performance differences, let me know. cheers, Johan
Re: Can you shrink it further?
On 10/12/2016 01:05 PM, safety0ff wrote: On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote: [Snip] Didn't see the LUT implementation, nvm! Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote: [Snip] Didn't see the LUT implementation, nvm!
Re: Can you shrink it further?
My current favorites: void popFront(ref char[] s) @trusted pure nothrow { immutable byte c = s[0]; if (c >= -2) { s = s.ptr[1 .. s.length]; } else { import core.bitop; size_t i = 7u - bsr(~c); import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } } I also experimented with explicit speculation: void popFront(ref char[] s) @trusted pure nothrow { immutable byte c = s[0]; s = s.ptr[1 .. s.length]; if (c < -2) { import core.bitop; size_t i = 6u - bsr(~c); import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } } LDC and GDC both compile these to 23 instructions. DMD does worse than with my other code. You can influence GDC's block layout with __builtin_expect. I notice that many other snippets posted use uint instead of size_t in the multi-byte branch. This generates extra instructions for me.
Any front end experts n da house?
So it would be great to get the super annoying https://issues.dlang.org/show_bug.cgi?id=259 to a conclusion, and it seems the similarly annoying https://issues.dlang.org/show_bug.cgi?id=14835 is in the way. If anyone would like to look into the latter that would be great. Good regression testing (e.g. on dub projects) would be necessary. Andrei
Re: Any front end experts n da house?
On 10/12/2016 12:31 PM, Stefan Koch wrote: I can take a look at 259. 14835 is nothing trivial though. My understanding is Thomas has an attack on 259 once a solution to 14835 is up. -- Andrei
Re: Any front end experts n da house?
On Wednesday, 12 October 2016 at 16:27:05 UTC, Andrei Alexandrescu wrote: So it would be great to get the super annoying https://issues.dlang.org/show_bug.cgi?id=259 to a conclusion, and it seems the similarly annoying https://issues.dlang.org/show_bug.cgi?id=14835 is in the way. If anyone would like to look into the latter that would be great. Good regression testing (e.g. on dub projects) would be necessary. Andrei I can take a look at 259. 14835 is nothing trivial though.
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 14:46:32 UTC, Andrei Alexandrescu wrote: No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei Alright then PR: https://github.com/dlang/phobos/pull/4849
Re: Reducing the cost of autodecoding
On 10/12/2016 12:03 PM, Stefan Koch wrote: This will only work really efficiently with some state on the stack. Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko wrote: On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote: So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: = import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join; For fair test line should be feet into L1 cache. --Ilya EDITL For fair test the line should be in the L1 cache. --Ilya
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote: So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: = import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join; For fair test line should be feet into L1 cache. --Ilya
Re: Reducing the cost of autodecoding
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote: So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: = import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join; ulong checksum; if (args.length == 1) { while (line.length) { version(autodecode) { checksum += line.myFront; line.myPopFront; } else { checksum += line[0]; line = line[1 .. $]; } } version(autodecode) writeln("autodecode ", checksum); else writeln("bytes ", checksum); } else writeln("overhead"); } = On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding. Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap. Andrei This will only work really efficiently with some state on the stack. If we are to support Unicode.
Re: Can you shrink it further?
On 10/12/2016 10:39 AM, Stefan Koch wrote: On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote: On 10/12/2016 09:39 AM, Stefan Koch wrote: Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- Andrei I measured again. The table version has a DECREASES the performance for dmd by 1%. I think we should keep performance for dmd in mind. I could add the table version in a version (LDC) block. No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote: On 10/12/2016 09:39 AM, Stefan Koch wrote: Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- Andrei I measured again. The table version has a DECREASES the performance for dmd by 1%. I think we should keep performance for dmd in mind. I could add the table version in a version (LDC) block. I cannot currently measure on gdc. I'd appreciate more perf data.
Re: Can you shrink it further?
On 10/12/2016 09:39 AM, Stefan Koch wrote: On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote: On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote: In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei When I write code so i can keep the str = str[1 .. str.length] It leads to lager code overall and decreases performance. both in table lookup and in the multi-branch code. update: for ldc the table-lookup code is faster 27% then the current popFront. on dmd the table-lookup is 2% faster then the current popFront. for ldc the branching code is 23% faster then the current popFront for dmd the branching code is 20% faster then the current popFront Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- Andrei
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote: On 10/12/2016 06:56 AM, Stefan Koch wrote: I just confirmed that branching version is faster then table-lookup. please test it our for yourself http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj The table-lookup does produce the smallest code though. Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern: size_t char_length = 1; immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; return ; } as opposed to: immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[1 .. s.length]; return ; } In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei btw. We could get rid of of a sub if we changed slices from holding a pointer and a length to holding two pointers.
Reducing the cost of autodecoding
So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: = import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join; ulong checksum; if (args.length == 1) { while (line.length) { version(autodecode) { checksum += line.myFront; line.myPopFront; } else { checksum += line[0]; line = line[1 .. $]; } } version(autodecode) writeln("autodecode ", checksum); else writeln("bytes ", checksum); } else writeln("overhead"); } = On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding. Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap. Andrei
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote: On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote: In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei When I write code so i can keep the str = str[1 .. str.length] It leads to lager code overall and decreases performance. both in table lookup and in the multi-branch code. update: for ldc the table-lookup code is faster 27% then the current popFront. on dmd the table-lookup is 2% faster then the current popFront. for ldc the branching code is 23% faster then the current popFront for dmd the branching code is 20% faster then the current popFront
Re: Can you shrink it further?
On 10/12/2016 06:56 AM, Stefan Koch wrote: I just confirmed that branching version is faster then table-lookup. please test it our for yourself http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj The table-lookup does produce the smallest code though. Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern: size_t char_length = 1; immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; return ; } as opposed to: immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[1 .. s.length]; return ; } In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
Re: Can you shrink it further?
On 10/12/2016 05:23 AM, Stefan Koch wrote: All three are slower than baseline, for my test-case. What did you test it against. I'd say: (a) test for speed of ASCII-only text; (b) make it small. That's all we need. Nobody worries about 10-20% in multibyte-heavy text. -- Andrei
Re: Can you shrink it further?
On 10/12/2016 04:56 AM, Matthias Bentrup wrote: void popFront1b(ref char[] s) @trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= -8) { s = s[1 .. $]; } else { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } } This is really small although it does two jumps on the frequent path. Perhaps an improvement over the current implementation. Nice work! -- Andrei
Re: Can you shrink it further?
I just confirmed that branching version is faster then table-lookup. please test it our for yourself http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj The table-lookup does produce the smallest code though.
Re: Any relation?
Then maybe this isn't photoshopped: https://twitter.com/stamcd/status/742563964656062464 Why would be ? It's a screenshot from Forza Motorsport game.
Re: New encryption block...
On Sunday, 9 October 2016 at 20:33:29 UTC, Era Scarecrow wrote: Something coming to mind is the idea of making a small algorithm to be used with other already existing encryption functions to extend the blocksize of encryption with minimal complexity growth. For fun I'm experimenting with this. So far seems fairly promising, although I'm sure I'm writing it very insecurely. Maybe it would be better for random number generation rather than secure encryption? Not sure. Anyways, 16bit replacement, extending to 64bit via reordering, and 8 unique xor's between stages. Once I get the 576 block finished (1 for salt) I'll probably publish my ugly code for consideration and to be torn apart for security issues.
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 10:15:17 UTC, Matthias Bentrup wrote: On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote: On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote: [...] All three are slower than baseline, for my test-case. What did you test it against. The blns.txt file mentioned upthread. And what were your timings ? BTW. the code you posted would not be a proper replacement for utf8 popFront since our UTF8-popFront does handle depreacted sequences correctly. making it equivalent to this code : void pfnew(ref char[] s) @trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; } else { if (c < 192+32) { char_length = 2; } else if (c < 192+32+16) { char_length = 3; } else if (c < 192+32+16+8) { char_length = 4; } else if (c < 192+32+16+8+4) { char_length = 5; } else if (c < 192+32+16+8+4+2) { char_length = 6; } char_length = char_length > s.length ? s.length : char_length; goto Lend; } }
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote: On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote: [...] All three are slower than baseline, for my test-case. What did you test it against. The blns.txt file mentioned upthread.
Re: Any relation?
On Tuesday, 11 October 2016 at 18:13:53 UTC, Andrei Alexandrescu wrote: http://indianautosblog.com/2016/10/most-powerful-suzuki-swift-produces-350-hp-25 -- Andrei Then maybe this isn't photoshopped: https://twitter.com/stamcd/status/742563964656062464
Re: Can you shrink it further?
On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote: Here are three branch-less variants that use the sign instead of the carry bit. The last one is the fastest on my machine, although it mixes the rare error case and the common 1-byte case into one branch. void popFront1(ref char[] s) @trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) { s = s[1 .. $]; } else if (c < -8) { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } else { s = s[1 .. $]; } } void popFront1a(ref char[] s) @trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) {Three s = s[1 .. $]; } else { uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31)) & (c + 8 >> 31)); import std.algorithm; s = s[min(i, $) .. $]; } } void popFront1b(ref char[] s) @trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= -8) { s = s[1 .. $]; } else { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } } All three are slower than baseline, for my test-case. What did you test it against.
Re: Can you shrink it further?
On Tuesday, 11 October 2016 at 15:01:47 UTC, Andrei Alexandrescu wrote: On 10/11/2016 10:49 AM, Matthias Bentrup wrote: void popFrontAsmIntel(ref char[] s) @trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s[1 .. $]; } else { uint l = void; asm pure nothrow @nogc { mov EAX, 1; mov BL, 0xf8-1; sub BL, c; cmp BL, 0xf8-0xc0; adc EAX, 0; cmp BL, 0xf8-0xe0; adc EAX, 0; cmp BL, 0xf8-0xf0; adc EAX, 0; mov l, EAX; } s = s[l <= $ ? l : $ .. $]; } } Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. -- Andrei Here are three branch-less variants that use the sign instead of the carry bit. The last one is the fastest on my machine, although it mixes the rare error case and the common 1-byte case into one branch. void popFront1(ref char[] s) @trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) { s = s[1 .. $]; } else if (c < -8) { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } else { s = s[1 .. $]; } } void popFront1a(ref char[] s) @trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) {Three s = s[1 .. $]; } else { uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31)) & (c + 8 >> 31)); import std.algorithm; s = s[min(i, $) .. $]; } } void popFront1b(ref char[] s) @trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= -8) { s = s[1 .. $]; } else { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } }