On 14 March 2011 07:28, Jon Nettleton <jon.nettle...@gmail.com> wrote: > It has been a fun and fulfilling weekend tracking down performance > regressions when using DRI on the XO 1.5 platform. One place I was > looking at specifically was blitting solids from userspace to the > kernel. I found that copy_from_user was really chewing up a > significant amount of cpu. Looking further I found that for VIA C7 > processors these two options, X86_INTEL_USERCOPY and > X86_USE_PPRO_CHECKSUM, are not enabled. > > I have not done thorough testing on it, but after patch the kernel > with the attached patch my gtkperf run dropped from 63 seconds down to > 50 seconds.
Wow! Nice work! But, I think your gains must have come from elsewhere. Please correct me if you spot something that I haven't. Firstly X86_INTEL_USERCOPY The function that decides when the alternative codepath enabled by this config option gets enabled is __movsl_is_ok() in arch/x86/lib/usercopy_32.c. This function has to return 0 for the interesting new functions such as __copy_user_zeroing_intel() to be called. The one way that this function can return 0 is: if (n >= 64 && ((a1 ^ a2) & movsl_mask.mask)) return 0; This above "return 0" will never happen on our platform. movsl_mask.mask is always 0. (on other platforms it is initialized in Intel-specific code in arch/x86/kernel/cpu/intel.c) So, X86_INTEL_USERCOPY doesn't seem to have any effect for us. I tried hacking movsl_mask.mask to 7 like Intel, and it resulted in a 0.9% speedup in copy_to_user (possibly just noise) when doing unaligned writes (its important to realise that the codepaths enabled by this option are only an optimization for unaligned accesses, well-aligned accesses are unchanged). X86_USE_PPRO_CHECKSUM looks like it is worth doing. It results in csum_partial() calls speeding up by a factor of 1.5. But I'd be surprised if this is having any direct impact on your graphics work (this checksum-calculating function seems to only have a handful of users outside of networking) unless your new DRI code is calling this function directly? Your performance gains are very exciting but I think they must have resulted from something else. My benchmark code is here (including a hack to enable the unaligned write optimization when X86_INTEL_USERCOPY is set): http://dev.laptop.org/~dsd/20110314/benchmark-copy_from_user-and-csum_partial.patch Daniel _______________________________________________ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel