On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > Ok, so I ran the above code on a single cpu using taskset, and set irq > affinity > such that no interrupts (save for local ones), would occur on that cpu. Note > that I had to convert csum_partial_opt to csum_partial, as the _opt variant > doesn't exist in my tree, nor do I see it in any upstream tree or in the > history > anywhere.
This csum_partial_opt() was a private implementation of csum_partial() so that I could load the module without rebooting the kernel ;) > > base results: > 53569916 > 43506025 > 43476542 > 44048436 > 45048042 > 48550429 > 53925556 > 53927374 > 53489708 > 53003915 > > AVG = 492 ns > > prefetching only: > 53279213 > 45518140 > 49585388 > 53176179 > 44071822 > 43588822 > 44086546 > 47507065 > 53646812 > 54469118 > > AVG = 488 ns > > > parallel alu's only: > 46226844 > 44458101 > 46803498 > 45060002 > 46187624 > 37542946 > 45632866 > 46275249 > 45031141 > 46281204 > > AVG = 449 ns > > > both optimizations: > 45708837 > 45631124 > 45697135 > 45647011 > 45036679 > 39418544 > 44481577 > 46820868 > 44496471 > 35523928 > > AVG = 438 ns > > > We continue to see a small savings in execution time with prefetching (4 ns, > or > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), > and > the best savings with both optimizations (54 ns, or 10.9%). > > These results, while they've changed as we've modified the test case slightly > have remained consistent in their sppedup ordinality. Prefetching helps, but > not as much as using multiple alu's, and neither is as good as doing both > together. > > Unless you see something else that I'm doing wrong here. It seems like a win > to > do both. > Well, I only said (or maybe I forgot), that on my machines, I got no improvements at all with the multiple alu or the prefetch. (I tried different strides) Only noises in the results. It seems it depends on cpus and/or multiple factors. Last machine I used for the tests had : processor : 23 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz stepping : 2 microcode : 0x13 cpu MHz : 2800.256 cache size : 12288 KB physical id : 1 siblings : 12 core id : 10 cpu cores : 6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/