On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > 
> > > So, early testing results today.  I wrote a test module that, allocated a 
> > > 4k
> > > buffer, initalized it with random data, and called csum_partial on it 
> > > 100000
> > > times, recording the time at the start and end of that loop.  Results on 
> > > a 2.4
> > > GHz Intel Xeon processor:
> > > 
> > > Without patch: Average execute time for csum_partial was 808 ns
> > > With patch: Average execute time for csum_partial was 438 ns
> > 
> > Impressive, but could you try again with data out of cache ?
> 
> So I tried your patch on a GRE tunnel and got following results on a
> single TCP flow. (short result : no visible difference)
> 
> 

So I went to reproduce these results, but was unable to (due to the fact that I
only have a pretty jittery network to do testing accross at the moment with
these devices).  So instead I figured that I would go back to just doing
measurements with the module that I cobbled together (operating under the
assumption that it would give me accurate, relatively jitter free results (I've
attached the module code for reference below).  My results show slightly
different behavior:

Base results runs:
89417240
85170397
85208407
89422794
91645494
103655144
86063791
75647774
83502921
85847372
AVG = 875 ns

Prefetch only runs:
70962849
77555099
81898170
68249290
72636538
83039294
78561494
83393369
85317556
79570951
AVG = 781 ns

Parallel addition only runs:
42024233
44313064
48304416
64762297
42994259
41811628
55654282
64892958
55125582
42456403
AVG = 510 ns


Both prefetch and parallel addition:
41329930
40689195
61106622
46332422
49398117
52525171
49517101
61311153
43691814
49043084
AVG = 494 ns


For reference, each of the above large numbers is the number of nanoseconds
taken to compute the checksum of a 4kb buffer 100000 times.  To get my average
results, I ran the test in a loop 10 times, averaged them, and divided by
100000.


Based on these, prefetching is obviously a a good improvement, but not as good
as parallel execution, and the winner by far is doing both.

Thoughts?

Neil



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

static int __init csum_init_module(void)
{
        int i;
        __wsum sum = 0;
        struct timespec start, end;
        u64 time;

        buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

        if (!buf) {
                printk(KERN_CRIT "UNABLE TO ALLOCATE A BUFFER OF %lu bytes\n", 
PAGE_SIZE);
                return -ENOMEM;
        }

        printk(KERN_CRIT "INITALIZING BUFFER\n");
        get_random_bytes(buf, PAGE_SIZE);

        preempt_disable();
        printk(KERN_CRIT "STARTING ITERATIONS\n");
        getnstimeofday(&start);

        for(i=0;i<100000;i++)
                sum = csum_partial(buf, PAGE_SIZE, sum);
        getnstimeofday(&end);
        preempt_enable();
        if (start.tv_nsec > end.tv_nsec)
                time = (ULLONG_MAX - end.tv_nsec) + start.tv_nsec;
        else 
                time = end.tv_nsec - start.tv_nsec;

        printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu 
nanosec\n", time);
        kfree(buf);
        return 0;


}

static void __exit csum_cleanup_module(void)
{
        return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to