Hi Dominique,

For the record, the following CSHIFT is still 4 times slower than the DO loop

I have looked into this a bit. The main reason is that, unlike cshift0
(without the array as shift) we do not generate individual functions to
call for the usual data types, we use memcpy with a size determined
at run-time by looking at the array descriptor.  This is, of course,
quite slow.

So, the solution should probably be to generate functions like
cshift1_4_i4 and then call them. This would generate a bit of
bloat, but if people use this in a serios way, I think
this is OK.

This was already done for cshift0 a few years ago. What
we have there looks like (intrinsics/cshift0.c)

  type_size = GFC_DTYPE_TYPE_SIZE (array);

  switch(type_size)
    {
    case GFC_DTYPE_LOGICAL_1:
    case GFC_DTYPE_INTEGER_1:
    case GFC_DTYPE_DERIVED_1:
cshift0_i1 ((gfc_array_i1 *)ret, (gfc_array_i1 *) array, shift, which);
      return;

    case GFC_DTYPE_LOGICAL_2:
    case GFC_DTYPE_INTEGER_2:
cshift0_i2 ((gfc_array_i2 *)ret, (gfc_array_i2 *) array, shift, which);
      return;

so this is something that we could also emulate.

A bit of work, but nothing that looks un-doable.

Regards

        Thomas

Reply via email to