Em segunda-feira, 24 de janeiro de 2011, às 12:57:32, Carsten Munk escreveu:
> Hi,
>
> Do we have a sane and performing NEON memcpy that would be suitable
> for MeeGo glibc version anywhere? Would be useful for glibc armv7nhl
> variant.

Shouldn't be very hard to implement one:

NOTE: NOT TESTED!

void *my_memcpy(void *dest, void *src, long n)
{
    const int stride_bytes = 16;
    uint8_t *d = dest;
    uint8_t *s = src;
    {
        /* main copy, Neon vectorised */
        long vector_len = n / stride_bytes;
        uint8_t *end = s + vector_len * stride_bytes;
        n -= vector_len * stride_bytes;

        while (s != end) {
#ifdef __CC_ARM
            vst1q_u8(d, vld1q_u8(s));
            d += stride_bytes;
            s += stride_bytes;
#else
            /*
             * Assembly equivalent:
             */
            asm ("vld1.8  {d0, d1}, [%[s]]!\n"
                 "vst1.8  {d0, d1}, [%[d]]!\n"
                 : [s] "+r" (s), [d] "+r" (d)
                 : /* no inputs */
                 : "d0", "d1");
#endif
        }
    }

    if (stride_bytes > 8 && n >= 8) {
        /* one last 8-byte step */
        n -= 8;
#ifdef __CC_ARM
        vst1_u8(d, vld1_u8(s));
        d += 8;
        s += 8;
#else
        /*
         * Assembly equivalent:
         */
        asm ("vld1.8  {d0}, [%[s]]!\n"
             "vst1.8  {d0}, [%[d]]!\n"
             : [s] "+r" (s), [d] "+r" (d)
             : /* no inputs */
             : "d0");
#endif
    }

    /* residue */
    switch (n) {
        case 7: *d++ = *s++;
        case 6: *d++ = *s++;
        case 5: *d++ = *s++;
        case 4: *d++ = *s++;
        case 3: *d++ = *s++;
        case 2: *d++ = *s++;
        case 1: *d++ = *s++;
    }

    return dest;
}

You can modify the above code:
        stride          load/store
        8                       vld1_u8 / vst1_u8
        16                      vld1q_u8 / vst1q_u8
        24                      vld3_u8 / vst3_u8
        48                      vld3q_u8 / vst3q_u8

Given that the 24- and 48-byte strides require a division by 3, I recommend
sticking to the 8- or 16-byte stride versions.

The GCC versions are written in inline assembly because all current versions
of GCC spill the Neon registers to memory with the intrinsics.

Comparing the assembly generated by both GCC and RVCT indicates that the math
portion of the function and the transition from 16-byte to 8-byte stride seem
to be better with RVCT, but the handling of the switch at the function
epilogue seems better with GCC 4.5.

Each of the case statements in GCC is:
        ldrb    r3, [r4], #1    @ zero_extendqisi2
        strb    r3, [ip], #1

whereas RVCT produces:
        LDRB     r4,[r1],#1
        ADD      r2,r3,#1
        STRB     r4,[r3,#0]
        MOV      r3,r2

That is, the same first instruction, but it uses two additional instructions to
update the "d" variable, instead of doing the inline post-update.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Senior Product Manager - Nokia, Qt Development Frameworks
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
MeeGo-dev mailing list
MeeGo-dev@meego.com
http://lists.meego.com/listinfo/meego-dev

Reply via email to