> > Your benchmark uses a single 4K page, so data is _super_ hot in cpu > caches. > ( prefetch should give no speedups, I am surprised it makes any > difference) > > Try now with 32 huges pages, to get 64 MBytes of working set. > > Because in reality we never csum_partial() data in cpu cache. > (Unless the NIC preloaded the data into cpu cache before sending the > interrupt) > > Really, if Sebastien got a speed up, it means that something fishy was > going on, like : > > - A copy of data into some area of memory, prefilling cpu caches > - csum_partial() done while data is hot in cache. > > This is exactly a "should not happen" scenario, because the csum in this > case should happen _while_ doing the copy, for 0 ns. > > > >
So, I took your suggestion, and modified my test module to allocate 32 huge pages instead of a single 4k page. I've attached the module changes and the results below. Contrary to your assertion above, results came out the same as in my first run. See below: base results: 80381491 85279536 99537729 80398029 121385411 109478429 85369632 99242786 80250395 98170542 AVG=939 ns prefetch only results: 86803812 101891541 85762713 95866956 102316712 93529111 90473728 79374183 93744053 90075501 AVG=919 ns parallel only results: 68994797 63503221 64298412 63784256 75350022 66398821 77776050 79158271 91006098 67822318 AVG=718 ns both prefetch and parallel results: 68852213 77536525 63963560 67255913 76169867 80418081 63485088 62386262 75533808 57731705 AVG=693 ns So based on these, it seems that your assertion that prefetching is the key to speedup here isn't quite correct. Either that or the testing continues to be invalid. I'm going to try to do some of ingos microbenchmarking just to see if that provides any further details. But any other thoughts about what might be going awry are appreciated. My module code: #include <linux/module.h> #include <linux/kernel.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/init.h> #include <linux/moduleparam.h> #include <linux/rtnetlink.h> #include <net/rtnetlink.h> #include <linux/u64_stats_sync.h> static char *buf; #define BUFSIZ_ORDER 4 #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) static int __init csum_init_module(void) { int i; __wsum sum = 0; struct timespec start, end; u64 time; struct page *page; u32 offset = 0; page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER); if (!page) { printk(KERN_CRIT "NO MEMORY FOR ALLOCATION"); return -ENOMEM; } buf = page_address(page); printk(KERN_CRIT "INITALIZING BUFFER\n"); preempt_disable(); printk(KERN_CRIT "STARTING ITERATIONS\n"); getnstimeofday(&start); for(i=0;i<100000;i++) { sum = csum_partial(buf+offset, PAGE_SIZE, sum); offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0; } getnstimeofday(&end); preempt_enable(); if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec) time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec; else time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec; printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time); __free_pages(page, BUFSIZ_ORDER); return 0; } static void __exit csum_cleanup_module(void) { return; } module_init(csum_init_module); module_exit(csum_cleanup_module); MODULE_LICENSE("GPL"); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/