While doing some string processing I've seen some unusual timings compared to
the C code, so I have written this to see the situation better.
When USE_MEMCPY is false this little benchmark runs about 3+ times slower:
import std.c.stdlib: malloc;
import std.c.string: memcpy;
import std.stdio: writ
On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
> While doing some string processing I've seen some unusual timings
> compared to the C code, so I have written this to see the situation
> better. When USE_MEMCPY is false this little benchmark runs about 3+
> times slower:
I did a little ben
Moritz Warning:
> I don't see a very big difference between slice copying and memcpy (but
> between compilers).
I have taken the times again. My timings, best of 4:
true: 1.33 s
false: 4.28 s
I have used dmd 1.041, with Phobos, on WinXP 32 bit, 2 GB RAM, CPU Core 2 at 2
GHz.
This may be anothe
On Sun, 15 Mar 2009 13:17:50 +, Moritz Warning wrote:
> On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
>
>> While doing some string processing I've seen some unusual timings
>> compared to the C code, so I have written this to see the situation
>> better. When USE_MEMCPY is false this
Sun, 15 Mar 2009 13:17:50 + (UTC), Moritz Warning wrote:
> On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
>
>> While doing some string processing I've seen some unusual timings
>> compared to the C code, so I have written this to see the situation
>> better. When USE_MEMCPY is false th
For reference here's a simple C version:
#include "stdlib.h"
#include "string.h"
#include "stdio.h"
#define N 1
#define L 6
char h[L] = "hello\n";
int main() {
char *ptr;
if (N <= 100)
ptr = malloc(N * L + 1); // the +1 is for the final printing
else
ptr = ma
Sun, 15 Mar 2009 10:31:10 -0400, bearophile wrote:
> The ASM of the inner loop:
>
> L: movl_h, %eax
> movl%eax, (%edx)
> movzwl _h+4, %eax
> movw%ax, 4(%edx)
> addl$6, %edx
> cmpl%ecx, %edx
> jne L
Obviously, a memcpy intrinsic is at work here. DMD
Sergey Gromov:
> Obviously, a memcpy intrinsic is at work here.<
Yes, gcc is able to recognize some calls to C library functions and replace
them with intrinsics.
I think LDC too uses an intrinsic to copy memory of a slice.
This isn't a too much interesting benchmark, there's nothing much
intere
Sergey Gromov wrote:
Sun, 15 Mar 2009 13:17:50 + (UTC), Moritz Warning wrote:
On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
While doing some string processing I've seen some unusual timings
compared to the C code, so I have written this to see the situation
better. When USE_MEMCPY
Don:
>Which means that memcpy probably isn't anywhere near optimal, either.<
Time ago I have read an article written by AMD that shows that indeed with
modern CPUs there are ways to go much faster, using vector asm instructions,
loop unrolling and explicit cache prefetching (but it's useful with
On Mon, Mar 16, 2009 at 8:43 AM, bearophile wrote:
> Don:
>>Which means that memcpy probably isn't anywhere near optimal, either.<
>
> Time ago I have read an article written by AMD that shows that indeed with
> modern CPUs there are ways to go much faster, using vector asm instructions,
> loop
Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:
> Sergey Gromov wrote:
>> Sun, 15 Mar 2009 13:17:50 + (UTC), Moritz Warning wrote:
>>
>>> On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
>>>
While doing some string processing I've seen some unusual timings
compared to the C code, s
Sergey Gromov wrote:
Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:
Sergey Gromov wrote:
Sun, 15 Mar 2009 13:17:50 + (UTC), Moritz Warning wrote:
On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
While doing some string processing I've seen some unusual timings
compared to the C code,
Don wrote:
Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's
six bytes -- it's in the asm. Blimey. It should just be doing that as a
direct sequence of loads and stores, for anything up to at least 8 bytes.
The compiler will replace it with a simple mov if it is 1, 2, 4 or
Hello Jarrett,
I'm actually kind of shocked that given the prevalence of memory block
copy operations that more CPUs haven't implemented it as a basic
instruction. Yes, RISC is nice, but geez, this seems like a
no-brainer.
How about memory to memory DMA, Why even make the CPU wait for it to
Walter Bright wrote:
Don wrote:
Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's
six bytes -- it's in the asm. Blimey. It should just be doing that as
a direct sequence of loads and stores, for anything up to at least 8
bytes.
The compiler will replace it with a simple
On Mon, Mar 16, 2009 at 3:29 PM, BCS wrote:
>> I'm actually kind of shocked that given the prevalence of memory block
>> copy operations that more CPUs haven't implemented it as a basic
>> instruction. Yes, RISC is nice, but geez, this seems like a
>> no-brainer.
>>
>
> How about memory to memory
BCS wrote:
Hello Jarrett,
I'm actually kind of shocked that given the prevalence of memory block
copy operations that more CPUs haven't implemented it as a basic
instruction. Yes, RISC is nice, but geez, this seems like a
no-brainer.
How about memory to memory DMA, Why even make the CPU wai
Hello Don,
BCS wrote:
Hello Jarrett,
I'm actually kind of shocked that given the prevalence of memory
block copy operations that more CPUs haven't implemented it as a
basic instruction. Yes, RISC is nice, but geez, this seems like a
no-brainer.
How about memory to memory DMA, Why even mak
Hello Jarrett,
On Mon, Mar 16, 2009 at 3:29 PM, BCS wrote:
I'm actually kind of shocked that given the prevalence of memory
block copy operations that more CPUs haven't implemented it as a
basic instruction. Yes, RISC is nice, but geez, this seems like a
no-brainer.
How about memory to mem
bearophile wrote:
Don:
Which means that memcpy probably isn't anywhere near optimal, either.<
Time ago I have read an article written by AMD that shows that indeed with
modern CPUs there are ways to go much faster, using vector asm instructions,
loop unrolling and explicit cache prefetching
Christopher Wright wrote:
bearophile wrote:
Don:
Which means that memcpy probably isn't anywhere near optimal, either.<
Time ago I have read an article written by AMD that shows that indeed
with modern CPUs there are ways to go much faster, using vector asm
instructions, loop unrolling and
This has been discussed before, to no avail.
http://d.puremagic.com/issues/show_bug.cgi?id=2313
L.
Mon, 16 Mar 2009 11:36:50 -0700, Walter Bright wrote:
> Don wrote:
>> Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's
>> six bytes -- it's in the asm. Blimey. It should just be doing that as a
>> direct sequence of loads and stores, for anything up to at least 8 bytes.
>
24 matches
Mail list logo