On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
- default linux:
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
To see what is executed when you call memcpy() on a regular
GNU/Linux distro, you'd want to have a look at glibc instead. For
example, the AVX2 and AVX512 i
On Monday, 11 June 2018 at 08:02:42 UTC, Walter Bright wrote:
On 6/10/2018 9:44 PM, Patrick Schluter wrote:
See what Agner Fog has to say about it:
Thanks. Agner Fog gets the last word on this topic!
Well, Agner is rarely wrong indeed, but there is a limit to how
much material a single pers
On Monday, 11 June 2018 at 18:34:58 UTC, Johannes Pfau wrote:
I understand that you actually need to reimplement memcpy, as
in your microcontroller usecase you don't want to have any C
runtime. So you'll basically have to rewrite the C runtime
parts D depends on.
However, I think for memcpy
https://github.com/dlang/druntime/pull/2213
On 6/11/2018 11:17 AM, Guillaume Piolat wrote:
I don't know if someone really wrote this code, or if it was all from
intrinsics.
memcpy is so critical to success it is likely written by Intel itself to ensure
every drop of perf is wrung out of the CPU.
I was Intel CEO I'd direct the CPU har
Am Mon, 11 Jun 2018 10:54:23 + schrieb Mike Franklin:
> On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:
>> On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:
>>
I think there might also be optimization opportunities using
templates, metaprogramming, and type
BTW the way memcpy is(was?) implemented in the C runtime coming
from the Inter C++ compiler was really enlightening on the sheer
difficulty of such a task.
First of all there isn't one loop but many depending on the
source and destination alignment.
- If both are aligned on 16-byte boundarie
On 6/11/2018 6:00 AM, Steven Schveighoffer wrote:
No, __doPostblit is necessary -- you are making a copy.
example:
File[] fs = new File[5];
fs[0] = ...; // initialize fs
auto fs2 = fs;
fs.length = 100;
At this point, fs points at a separate block from fs2. If you did not do
postblit on this,
On 6/11/18 4:00 AM, Walter Bright wrote:
(I notice it is doing __doPostblit(). This looks wrong, D allows data to
be moved. As far as I can tell with a perfunctory examination, that's
the only "can throw" bit.)
No, __doPostblit is necessary -- you are making a copy.
example:
File[] fs = new
On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:
On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:
I think there might also be optimization opportunities using
templates, metaprogramming, and type introspection, that are
not currently possible with the current design.
On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:
I think there might also be optimization opportunities using
templates, metaprogramming, and type introspection, that are
not currently possible with the current design.
Just making it a template doesn't automatically enable any of
On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:
We have no design for this function that doesn't rely on the
GC, and the GC needs TypeInfo. This function is not usable with
betterC with or without the TypeInfo argument.
I understand that. I was using `_d_arraysetlengthT` as an
On 6/11/2018 1:12 AM, Mike Franklin wrote:
On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:
Making it a template is not really necessary. The compiler knows if there is
the possibility of it throwing based on the type, it doesn't need to infer it.
There are other reasons to make
On Monday, 11 June 2018 at 08:05:14 UTC, Walter Bright wrote:
On 6/10/2018 8:34 PM, Basile B. wrote:
- default win32 OMF:
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
I think you mean:
https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
Cool! and it's e
On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:
Making it a template is not really necessary. The compiler
knows if there is the possibility of it throwing based on the
type, it doesn't need to infer it.
There are other reasons to make it a template, though. For
example, if it
On 6/10/2018 8:34 PM, Basile B. wrote:
- default win32 OMF:
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
I think you mean:
https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
On 6/10/2018 8:43 PM, Mike Franklin wrote:
That only addresses the @safe attribute, and that code is much too complex for
anyone to audit it and certify it as safe.
Exceptions are also not all handled, so there is no way it can pass as nothrow.
The runtime call needs to be replaced with a temp
On 6/10/2018 9:44 PM, Patrick Schluter wrote:
See what Agner Fog has to say about it:
Thanks. Agner Fog gets the last word on this topic!
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
[...]
- default win32 OMF:
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
- default linux:
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
[...]
- default win32 OMF:
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
- default linux:
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs
Still, the ASM version is much faster.
rep movsd is very CPU dependend
On Monday, 11 June 2018 at 03:31:05 UTC, Walter Bright wrote:
On 6/10/2018 7:49 PM, Mike Franklin wrote:
On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
If the compiler can't get it right then who can?
The compiler implementation is faulty. It rewrites the
expressions to an `extern(C)
On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
I've modified the test based on the feedback so far, so here's
what it looks like now:
import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;
enum length = 4096 * 2;
void
On 6/10/2018 7:49 PM, Mike Franklin wrote:
On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
If the compiler can't get it right then who can?
The compiler implementation is faulty. It rewrites the expressions to an
`extern(C)` runtime implementation that is not @safe, nothrow, or pure:
On Monday, 11 June 2018 at 02:49:00 UTC, Mike Franklin wrote:
The compiler implementation is faulty. It rewrites the
expressions to an `extern(C)` runtime implementation that is
not @safe, nothrow, or pure:
https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/r
On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
There are many reasons to do this, one of which is to leverage
information available at compile-time and in D's type system
(type sizes, alignment, etc...) in order to optimize
On 06/10/2018 08:01 PM, Walter Bright wrote:
On 6/10/2018 4:39 PM, David Nadlinger wrote:
That's not entirely true. Intel started optimising some of the REP
string instructions again on Ivy Bridge and above. There is a CPUID
bit to indicate that (ERMS?); I'm sure the Optimization Manual has
fu
I've modified the test based on the feedback so far, so here's
what it looks like now:
import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;
enum length = 4096 * 2;
void init(ref ubyte[] a)
{
a.length = length;
for(int i = 0
On 6/10/2018 4:39 PM, David Nadlinger wrote:
That's not entirely true. Intel started optimising some of the REP string
instructions again on Ivy Bridge and above. There is a CPUID bit to indicate
that (ERMS?); I'm sure the Optimization Manual has further details. From what I
remember, `rep movs
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one
can draw from this is that memcpyD is the slowest,
Probably because it does a memory allocation.
Of course; that was alre
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one
can draw from this is that memcpyD is the slowest,
Probably because it does a memory allocation.
followed by the ASM imp
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
void memcpyASM()
{
auto s = src.ptr;
auto d = dst.ptr;
size_t len = length;
asm pure nothrow @nogc
{
mov RSI, s;
mov RDI, d;
cld;
mov RCX, len;
rep;
movsb;
}
}
Pr
On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one can draw from
this is that memcpyD is the slowest,
Probably because it does a memory allocation.
followed by the ASM implementation.
The CPU makers abandoned optimizing the REP inst
On 6/10/2018 6:45 AM, Mike Franklin wrote:
void memcpyD()
{
dst = src.dup;
}
Note that .dup is doing a GC memory allocation.
On 6/10/2018 5:49 AM, Mike Franklin wrote:
[...]
One source of entropy in the results is src and dst being global variables.
Global variables in D are in TLS, and TLS access can be complex (many
instructions) and is influenced by the -fPIC switch. Worse, global variable
access is not optimiz
Don't C implementations already do 90% of what you want? I
thought most compilers know about and optimize these methods
based on context. I thought they were *special* in the eyes of
the compiler already. I think you are fighting a battle pitting
40 years of tweaking against you...
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm not experienced with this kind of programming, so I'm
doubting these results. Have I done something wrong? Am I
overlooking something?
You've just discovered the fact that one can rarely be careful
enough with what is benchma
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
There are many reasons to do this, one of which is to leverage
information available at compile-time and in D's type system
(type sizes, alignment, etc...) in order to optimize the
implementation of these functions, and allow them to
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
arr1[] = arr2[]; // the compiler makes this memcpy, the
optimzer can further do its magic
void memcpyD()
{
dst = src.dup;
}
void memcpyD2()
{
dst[] = src[];
}
On 11/06/2018 1:45 AM, Mike Franklin wrote:
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can
further do its magic
void memcpyD()
{
dst = src.dup;
malloc (for slice not static array)
}
void memcpyD2()
{
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
arr1[] = arr2[]; // the compiler makes this memcpy, the
optimzer can further do its magic
void memcpyD()
{
dst = src.dup;
}
void memcpyD2()
{
dst[] = src[];
}
-
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5
On Sunday, 10 June 2018 at 13:17:53 UTC, Guillaume Piolat wrote:
Please make one that guarantee the usage of the corresponding
backend intrinsic, for example on LLVM.
I tested with ldc and got similar results. I thought the
implementation in C forwarded to the backend intrinsic. I think
ev
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
And D already has it built in as well for @safe etc:
arr1[] = arr2[]; // the compiler makes this memcpy, the
optimzer can further do its magic
so be sure to check against that too.
My intent is to use the D implementation in the
On Sunday, 10 June 2018 at 13:05:33 UTC, Nicholas Wilson wrote:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm exploring the possibility of implementing some of the
basic software building blocks (memcpy, memcmp, memmove,
etc...) that D utilizes from the C library with D
imp
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
D utilizes from the C library with D implementations. There
are many reasons to do this, one of which is to leverage
information available at compile-time and in D's type system
(type sizes, alignment, etc...) in order to optimize t
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm not experienced with this kind of programming, so I'm
doubting these results. Have I done something wrong? Am I
overlooking something?
Hi,
I've spent a lot of time optimizing memcpy. One of the result was
that on Intel ICC
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm exploring the possibility of implementing some of the basic
software building blocks (memcpy, memcmp, memmove, etc...) that
D utilizes from the C library with D implementations. There
are many reasons to do this, one of which is
I'm exploring the possibility of implementing some of the basic
software building blocks (memcpy, memcmp, memmove, etc...) that D
utilizes from the C library with D implementations. There are
many reasons to do this, one of which is to leverage information
available at compile-time and in D's
48 matches
Mail list logo