Hi Dominique,
For the record, the following CSHIFT is still 4 times slower than the DO loop
I have looked into this a bit. The main reason is that, unlike cshift0
(without the array as shift) we do not generate individual functions to
call for the usual data types, we use memcpy with a size determined
at run-time by looking at the array descriptor. This is, of course,
quite slow.
So, the solution should probably be to generate functions like
cshift1_4_i4 and then call them. This would generate a bit of
bloat, but if people use this in a serios way, I think
this is OK.
This was already done for cshift0 a few years ago. What
we have there looks like (intrinsics/cshift0.c)
type_size = GFC_DTYPE_TYPE_SIZE (array);
switch(type_size)
{
case GFC_DTYPE_LOGICAL_1:
case GFC_DTYPE_INTEGER_1:
case GFC_DTYPE_DERIVED_1:
cshift0_i1 ((gfc_array_i1 *)ret, (gfc_array_i1 *) array, shift,
which);
return;
case GFC_DTYPE_LOGICAL_2:
case GFC_DTYPE_INTEGER_2:
cshift0_i2 ((gfc_array_i2 *)ret, (gfc_array_i2 *) array, shift,
which);
return;
so this is something that we could also emulate.
A bit of work, but nothing that looks un-doable.
Regards
Thomas