Re: [MeeGo-dev] NEON memcpy?

leonid.moiseichuk Tue, 25 Jan 2011 01:06:56 -0800

We have tested version in our libc and Juha could share it.

With best wishes,
Leonid



-----Original Message-----
From: Thiago Macieira [mailto:thi...@kde.org] 
Sent: 24 January, 2011 20:04
To: meego-dev@meego.com
Cc: Carsten Munk; Moiseichuk Leonid (Nokia-MS/Helsinki); Kant Jarmo 
(Nokia-MS/Helsinki); Kallioinen Juha (Nokia-MS/Helsinki)
Subject: Re: [MeeGo-dev] NEON memcpy?

Em segunda-feira, 24 de janeiro de 2011, às 12:57:32, Carsten Munk escreveu:
> Hi,
> 
> Do we have a sane and performing NEON memcpy that would be suitable 
> for MeeGo glibc version anywhere? Would be useful for glibc armv7nhl 
> variant.

Shouldn't be very hard to implement one:

NOTE: NOT TESTED!

void *my_memcpy(void *dest, void *src, long n) {
    const int stride_bytes = 16;
    uint8_t *d = dest;
    uint8_t *s = src;
    {
        /* main copy, Neon vectorised */
        long vector_len = n / stride_bytes;
        uint8_t *end = s + vector_len * stride_bytes;
        n -= vector_len * stride_bytes;

        while (s != end) {
#ifdef __CC_ARM
            vst1q_u8(d, vld1q_u8(s));
            d += stride_bytes;
            s += stride_bytes;
#else
            /*
             * Assembly equivalent:
             */
            asm ("vld1.8  {d0, d1}, [%[s]]!\n"
                 "vst1.8  {d0, d1}, [%[d]]!\n"
                 : [s] "+r" (s), [d] "+r" (d)
                 : /* no inputs */
                 : "d0", "d1");
#endif
        }
    }

    if (stride_bytes > 8 && n >= 8) {
        /* one last 8-byte step */
        n -= 8;
#ifdef __CC_ARM
        vst1_u8(d, vld1_u8(s));
        d += 8;
        s += 8;
#else
        /*
         * Assembly equivalent:
         */
        asm ("vld1.8  {d0}, [%[s]]!\n"
             "vst1.8  {d0}, [%[d]]!\n"
             : [s] "+r" (s), [d] "+r" (d)
             : /* no inputs */
             : "d0");
#endif
    }

    /* residue */
    switch (n) {
        case 7: *d++ = *s++;
        case 6: *d++ = *s++;
        case 5: *d++ = *s++;
        case 4: *d++ = *s++;
        case 3: *d++ = *s++;
        case 2: *d++ = *s++;
        case 1: *d++ = *s++;
    }

    return dest;
}

You can modify the above code:
        stride          load/store
        8                       vld1_u8 / vst1_u8
        16                      vld1q_u8 / vst1q_u8
        24                      vld3_u8 / vst3_u8
        48                      vld3q_u8 / vst3q_u8

Given that the 24- and 48-byte strides require a division by 3, I recommend 
sticking to the 8- or 16-byte stride versions.

The GCC versions are written in inline assembly because all current versions of 
GCC spill the Neon registers to memory with the intrinsics.

Comparing the assembly generated by both GCC and RVCT indicates that the math 
portion of the function and the transition from 16-byte to 8-byte stride seem 
to be better with RVCT, but the handling of the switch at the function epilogue 
seems better with GCC 4.5.

Each of the case statements in GCC is:
        ldrb    r3, [r4], #1    @ zero_extendqisi2
        strb    r3, [ip], #1

whereas RVCT produces:
        LDRB     r4,[r1],#1
        ADD      r2,r3,#1
        STRB     r4,[r3,#0]
        MOV      r3,r2

That is, the same first instruction, but it uses two additional instructions to 
update the "d" variable, instead of doing the inline post-update.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Senior Product Manager - Nokia, Qt Development Frameworks
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
_______________________________________________
MeeGo-dev mailing list
MeeGo-dev@meego.com
http://lists.meego.com/listinfo/meego-dev

Re: [MeeGo-dev] NEON memcpy?

Reply via email to