We have tested version in our libc and Juha could share it. With best wishes, Leonid
-----Original Message----- From: Thiago Macieira [mailto:thi...@kde.org] Sent: 24 January, 2011 20:04 To: meego-dev@meego.com Cc: Carsten Munk; Moiseichuk Leonid (Nokia-MS/Helsinki); Kant Jarmo (Nokia-MS/Helsinki); Kallioinen Juha (Nokia-MS/Helsinki) Subject: Re: [MeeGo-dev] NEON memcpy? Em segunda-feira, 24 de janeiro de 2011, às 12:57:32, Carsten Munk escreveu: > Hi, > > Do we have a sane and performing NEON memcpy that would be suitable > for MeeGo glibc version anywhere? Would be useful for glibc armv7nhl > variant. Shouldn't be very hard to implement one: NOTE: NOT TESTED! void *my_memcpy(void *dest, void *src, long n) { const int stride_bytes = 16; uint8_t *d = dest; uint8_t *s = src; { /* main copy, Neon vectorised */ long vector_len = n / stride_bytes; uint8_t *end = s + vector_len * stride_bytes; n -= vector_len * stride_bytes; while (s != end) { #ifdef __CC_ARM vst1q_u8(d, vld1q_u8(s)); d += stride_bytes; s += stride_bytes; #else /* * Assembly equivalent: */ asm ("vld1.8 {d0, d1}, [%[s]]!\n" "vst1.8 {d0, d1}, [%[d]]!\n" : [s] "+r" (s), [d] "+r" (d) : /* no inputs */ : "d0", "d1"); #endif } } if (stride_bytes > 8 && n >= 8) { /* one last 8-byte step */ n -= 8; #ifdef __CC_ARM vst1_u8(d, vld1_u8(s)); d += 8; s += 8; #else /* * Assembly equivalent: */ asm ("vld1.8 {d0}, [%[s]]!\n" "vst1.8 {d0}, [%[d]]!\n" : [s] "+r" (s), [d] "+r" (d) : /* no inputs */ : "d0"); #endif } /* residue */ switch (n) { case 7: *d++ = *s++; case 6: *d++ = *s++; case 5: *d++ = *s++; case 4: *d++ = *s++; case 3: *d++ = *s++; case 2: *d++ = *s++; case 1: *d++ = *s++; } return dest; } You can modify the above code: stride load/store 8 vld1_u8 / vst1_u8 16 vld1q_u8 / vst1q_u8 24 vld3_u8 / vst3_u8 48 vld3q_u8 / vst3q_u8 Given that the 24- and 48-byte strides require a division by 3, I recommend sticking to the 8- or 16-byte stride versions. The GCC versions are written in inline assembly because all current versions of GCC spill the Neon registers to memory with the intrinsics. Comparing the assembly generated by both GCC and RVCT indicates that the math portion of the function and the transition from 16-byte to 8-byte stride seem to be better with RVCT, but the handling of the switch at the function epilogue seems better with GCC 4.5. Each of the case statements in GCC is: ldrb r3, [r4], #1 @ zero_extendqisi2 strb r3, [ip], #1 whereas RVCT produces: LDRB r4,[r1],#1 ADD r2,r3,#1 STRB r4,[r3,#0] MOV r3,r2 That is, the same first instruction, but it uses two additional instructions to update the "d" variable, instead of doing the inline post-update. -- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Senior Product Manager - Nokia, Qt Development Frameworks PGP/GPG: 0x6EF45358; fingerprint: E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358 _______________________________________________ MeeGo-dev mailing list MeeGo-dev@meego.com http://lists.meego.com/listinfo/meego-dev