On Mon, Apr 4, 2011 at 7:12 PM, Taekyun Kim <podai...@gmail.com> wrote: > I've done various experiments on PLD instruction. > I removed cache preload in neon fast path functions and then benchmarked, > there was no difference at performance. > I tested some other neon functions (like memcpy) in similar way, but no > difference at all.
I can try to explain the results from http://lists.freedesktop.org/archives/pixman/2011-April/001156.html :) First of all, NEON unit in ARM Cortex-A8 was supposed to have direct access to L2 cache as described in http://www.arm.com/files/pdf/A8_Paper.pdf You can check section "5.3 Non-blocking NEON loads" for more details. But unfortunately early revisions of Cortex-A8 (r1pX) such as the one used in Nokia N900 had a hardware bug which required this direct assess to L2 cache to be disabled by setting L1NEON bit in Auxiliary Control Register: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Bgbffjhh.html That surely causes major differences in the case of not using prefetch. But how come that the performance of simple memcpy is even faster without any explicit prefetch? My understanding is that this happens because NEON instructions are executed with a significant delay relative to ARM pipeline. So basically, the flow of NEON instructions is passing through ARM pipeline, all the memory addresses for load/store instructions are resolved there and then the NEON instructions are put into a separate long queue to be actually executed much later. So for simple copy, the processor can easily see lots of VLD1 instructions in the queue, understand that we are reading really far ahead and keep memory controller busy, in this case PLD instructions would just interfere unnecessarily. But for more computationally intensive functions such as bilinear scaling, we don't have NEON queue flooded with that many VLD1 instructions happening far ahead (just because we need to fill the queue with arithmetic instructions too), so explicit prefetch is still needed. Anyway, there are many Nokia N900 users around and also the users of similar devices. Disabiling prefetch for the cases like simple copy where more modern Cortex-A8 processors do not strictly need it would cause a serious performance regression on older hardware. > As I know coretex-a8 have preload engine (maybe not according to different > SoC integration??) ARM Cortex-A8 does not have automatic hardware prefetch. My understanding is that prefetch should be done explicitly using either PLD instructions, or by programming PLE engine (something which is not normally accessible from userspace, so we can forget about it). > but PLD is just an hint to the HW. > So it is implementation dependent, right? PLD should work on all Cortex-A8 systems unless disabled for whatever reason (for example to workaround some bug). -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman