On Sun, Feb 28, 2016 at 11:15 AM, Tom Herbert <t...@herbertland.com> wrote: > On Sun, Feb 28, 2016 at 10:56 AM, Alexander Duyck > <alexander.du...@gmail.com> wrote: >> On Sat, Feb 27, 2016 at 12:30 AM, Alexander Duyck >> <alexander.du...@gmail.com> wrote: >>>> +{ >>>> + asm("lea 40f(, %[slen], 4), %%r11\n\t" >>>> + "clc\n\t" >>>> + "jmpq *%%r11\n\t" >>>> + "adcq 7*8(%[src]),%[res]\n\t" >>>> + "adcq 6*8(%[src]),%[res]\n\t" >>>> + "adcq 5*8(%[src]),%[res]\n\t" >>>> + "adcq 4*8(%[src]),%[res]\n\t" >>>> + "adcq 3*8(%[src]),%[res]\n\t" >>>> + "adcq 2*8(%[src]),%[res]\n\t" >>>> + "adcq 1*8(%[src]),%[res]\n\t" >>>> + "adcq 0*8(%[src]),%[res]\n\t" >>>> + "nop\n\t" >>>> + "40: adcq $0,%[res]" >>>> + : [res] "=r" (sum) >>>> + : [src] "r" (buff), >>>> + [slen] "r" (-((unsigned long)(len >> 3))), "[res]" >>>> (sum) >>>> + : "r11"); >>>> + >>> >>> With this patch I cannot mix/match different length checksums without >>> things failing. In perf the jmpq in the loop above seems to be set to >>> a fixed value so perhaps it is something in how the compiler is >>> interpreting the inline assembler. >> >> The perf thing was a red herring. Turns out the code is working >> correctly there. >> >> I actually found the root cause. The problem is in add32_with_carry3. >> > Thanks for the follow-up. btw are you trying to build csum_partial in > userspace for testing, or was this all in kernel?
It was in the kernel. I have been some user space work but all of the problems I was having were in the kernel. My guess is that the original sum value wasn't being used - Alex