Re: [lwip-users] TCP Checksum = 0xFFFF
Hi Guys, Thanks for the replies. @Bill - I agree, it's unlikely that someone else would not have found such a bug given how widely used lwIP is. My first assumption is always that I've made an error somewhere:-) but there is no harm in asking the question while I search for the answer. I should have mentioned that I have LWIP_CHECKSUM_ON_COPY = 1 and CHECKSUM_CHECK_TCP = 1. So my code calls the #if TCP_CHECKSUM_ON_COPY code hence I don't call inet_chksum_pseudo(), see below for more. @Simon - I'll apply the patch and re-test, but I can see from a debug run that that bit of code is not being executed in my implementation. If you saw my follow up email, you will notice that I identified the code that is causing my problem. It is caused by the line of code at line 1146 in tcp_out.c (Note I have LWIP_CHECKSUM_ON_COPY = 1) acc += (u16_t)~(seg-chksum); acc is a one's compliment checksum obtained from a call to inet_chksum_pseudo_partial() and seg-chksum is a checksum of the payload. What is happening is that occasionally during operation acc is resulting in a value of M and seg-chksum has, by coincidence, a value of M. Then M + (~M) always gives 0x. Why hasn't it been seen by others before? As I'm sure you are aware (I have just been reading up on it!) some checksum checkers might accept 0x as a valid checksum depending on how they validate the checksum (recalculate and compare to inserted checksum OR calculate with checksum value and check results is = 0). On windows 7 in my application it seems it re-calculates and compares the checksum and expects 0x (wireshark does too!). This combination of lwip options and checksum validation method might explain why others may not have seen this error before now? Mathematically speaking using ones compliment maths, ~(sum(a+b+c+d)) is not the same as [(~sum(a+b)) + (~sum(c+d))] for the special corner case where sum(a+b) = ~sum(c+d). In this special case the answer will be 0x instead of 0x. Which is what is happening in my case! example (using 4 bit numbers for simplicity): let a = 1, b = 2, c = 4, d = 8. checksum = ~sum(a+b+c+d) = ~(0xF) = 0x0 sum(a+b) = 3 sum(b+c) = 0xC Calculated by code = [(~sum(a+b)) + (~sum(c+d))] = [~(3) + ~(0xC)] = [0xC + 3] = 0xF QED!? I'm more convinced that this is a coding issue in lwIP that doesn't handle this special corner case, but am happy to be proved wrong! Regards, Niall. On 14 May 2014 06:25, Simon Goldschmidt goldsi...@gmx.de wrote: Bill Auerbach wrote: From an empirical standpoint, lwIP is used in far too many places for there to be this significant of a bug. I’d look for a compiler bug or some other issue. I seriously doubt it’s a bug in lwIP. Some of my company’s users run our systems 24/7 sending lots of data through lwIP and I’d hear about it really fast if there was this kind of a TCP lockup. I'm flattered by your opinion but I fear this does not prevent lwIP from having bugs :-) In this case, I think I fixed a bug in git master not too long ago (#36153), here is the change, maybe it fixes things for you: @@ -658,6 +662,10 @@ tcp_write(struct tcp_pcb *pcb, const void *arg, u16_t len, u8_t apiflags) last_unsent-len += concat_p-tot_len; #if TCP_CHECKSUM_ON_COPY if (concat_chksummed) { + /*if concat checksumm swapped - swap it back */ + if (concat_chksum_swapped){ + concat_chksum = SWAP_BYTES_IN_WORD(concat_chksum); + } tcp_seg_add_chksum(concat_chksum, concat_chksummed, last_unsent-chksum, last_unsent-chksum_swapped); last_unsent-flags |= TF_SEG_DATA_CHECKSUMMED; Simon ___ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users ___ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users
Re: [lwip-users] TCP Checksum = 0xFFFF
Quick check: try changing the checksum algorithm with LWIP_CHKSUM_ALGORITHM This might point to a possible bug in the algo -- ___ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users
Re: [lwip-users] TCP Checksum = 0xFFFF
Niall Donovan nfdono...@gmail.com wrote: If you saw my follow up email, you will notice that I identified the code that is causing my problem. It is caused by the line of code at line 1146 in tcp_out.c (Note I have LWIP_CHECKSUM_ON_COPY = 1) acc += (u16_t)~(seg-chksum); acc is a one's compliment checksum obtained from a call to inet_chksum_pseudo_partial() and seg-chksum is a checksum of the payload. What is happening is that occasionally during operation acc is resulting in a value of M and seg-chksum has, by coincidence, a value of M. Then M + (~M) always gives 0x. [...] Mathematically speaking using ones compliment maths, ~(sum(a+b+c+d)) is not the same as [(~sum(a+b)) + (~sum(c+d))] for the special corner case where sum(a+b) = ~sum(c+d). In this special case the answer will be 0x instead of 0x. Which is what is happening in my case! Your analysis is correct. As for the likelihood of stumbling into just the right numbers I recently had an interesting problem. I have a diskless machine that has been running off NFS. For ages. After NFS server upgrade the diskless machine would reliably wedge quite early in the boot process. Eventually I tracked it down to a hardware checksum bug in the ethernet of the diskless machine. In UDP the checksum value 0 means no checksum and if the datagram data actually has checksum 0, it's replaced with 0x - which gives the same result if verification is done properly (computing the sum with checksum filed included). Apparently hardware checksum used recomute and compare method instead, so it flagged such valid UDP datagrams as having a bad checksum. The hardware bug has always been there and I've never seen it. Then when NFS server change changed the numerology of the NFS handles just right, the bug was triggered reliably by some particular NFS response datagram. So don't underestiamte luck as a factor in system stability :) -uwe ___ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users
Re: [lwip-users] TCP Checksum = 0xFFFF
To follow up on this issue. The offending piece of code that is generating the 0x as the TCP checksum is this (lines 1137-1147 in tcp_out.c): /* rebuild TCP header checksum (TCP header changes for retransmissions!) */ acc = inet_chksum_pseudo_partial(seg-p, (pcb-local_ip), (pcb-remote_ip), IP_PROTO_TCP, seg-p-tot_len, TCPH_HDRLEN(seg-tcphdr) * 4); /* add payload checksum */ if (seg-chksum_swapped) { seg-chksum = SWAP_BYTES_IN_WORD(seg-chksum); seg-chksum_swapped = 0; } acc += (u16_t)~(seg-chksum); seg-tcphdr-chksum = FOLD_U32T(acc); If acc happens to have a value equal to seg-checksum, for example acc=0x8E93 seg-checksum=0x8E93, (which can happen given the right set of values) one gets acc += ~acc, which always results in 0x. The FOLD_U32T has nothing to do and the 0x is set as the seg-tcphdr-chksum. Which is wrong? Am in missing something? Anyone able to enlighten me as why this isn't a coding error? Shouldn't there be something to ensure 0x is converted to 0x?! Thanks. Niall. On 13 May 2014 13:17, Niall Donovan nfdono...@gmail.com wrote: Hi All, I'd appreciate some help on my problem. I occasionally have seen my TCP socket connection hang and when I captured the fault on Wireshark I could see, on the packet causing the hang, that the calculated TCP Checksum value was 0x, which Wireshark indicated was incorrect. Wireshark says it should be 0x. It also helpfully pointed to RFC1624 for further information. The socket hangs because the recipient of the packet (Win 7 PC) sees a checksum error and discards the packet and resends its previous packet. LwIP sends a duplicate Ack then resends (and keeps sending) the offending packet, with the same erroneous checksum. Hence my ping-pong type link gets stuck. I don't modify the packet content after handing it to lwIP and my MAC device driver simply copies the packet from pbuf(s) to a tx buffer verbatim. I depend on lwIP to calculate the Checksum and CRC. I've attached the offending packet in a pcap file. I hand calculated the checksum and the one's compliment sum is 0x hence the one's compliment of that is 0x. Why is lwIP inserting a checksum of 0x? It should have inserted 0x right? Is this a known issue, I didn't see any mention of it in the mail archives. If this is a known issue hopefully someone can point me in the right direction for a fix/workaround so I don't have to debug and/or re-code the checksum code of lwIP!! While I'm awaiting a reply I'll start that process... FYI: I am using lwIP 1.4.1 and have LWIP_CHKSUM_ALGORITHM = 3 in lwipopts.h Thanks for your time Regards Niall. ___ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users
Re: [lwip-users] TCP Checksum = 0xFFFF
Niall, This final line in tcp_out sets the check sum and includes the ~ in the call. seg-tcphdr-chksum = inet_chksum_pseudo(seg-p, (pcb-local_ip), (pcb-remote_ip), IP_PROTO_TCP, seg-p-tot_len); From an empirical standpoint, lwIP is used in far too many places for there to be this significant of a bug. I’d look for a compiler bug or some other issue. I seriously doubt it’s a bug in lwIP. Some of my company’s users run our systems 24/7 sending lots of data through lwIP and I’d hear about it really fast if there was this kind of a TCP lockup. Regards, Bill From: lwip-users-bounces+bauerbach=arrayonline@nongnu.org [mailto:lwip-users-bounces+bauerbach=arrayonline@nongnu.org] On Behalf Of Niall Donovan Sent: Tuesday, May 13, 2014 11:09 AM To: Mailing list for lwIP users Subject: Re: [lwip-users] TCP Checksum = 0x To follow up on this issue. The offending piece of code that is generating the 0x as the TCP checksum is this (lines 1137-1147 in tcp_out.c): /* rebuild TCP header checksum (TCP header changes for retransmissions!) */ acc = inet_chksum_pseudo_partial(seg-p, (pcb-local_ip), (pcb-remote_ip), IP_PROTO_TCP, seg-p-tot_len, TCPH_HDRLEN(seg-tcphdr) * 4); /* add payload checksum */ if (seg-chksum_swapped) { seg-chksum = SWAP_BYTES_IN_WORD(seg-chksum); seg-chksum_swapped = 0; } acc += (u16_t)~(seg-chksum); seg-tcphdr-chksum = FOLD_U32T(acc); If acc happens to have a value equal to seg-checksum, for example acc=0x8E93 seg-checksum=0x8E93, (which can happen given the right set of values) one gets acc += ~acc, which always results in 0x. The FOLD_U32T has nothing to do and the 0x is set as the seg-tcphdr-chksum. Which is wrong? Am in missing something? Anyone able to enlighten me as why this isn't a coding error? Shouldn't there be something to ensure 0x is converted to 0x?! Thanks. Niall. On 13 May 2014 13:17, Niall Donovan nfdono...@gmail.com wrote: Hi All, I'd appreciate some help on my problem. I occasionally have seen my TCP socket connection hang and when I captured the fault on Wireshark I could see, on the packet causing the hang, that the calculated TCP Checksum value was 0x, which Wireshark indicated was incorrect. Wireshark says it should be 0x. It also helpfully pointed to RFC1624 for further information. The socket hangs because the recipient of the packet (Win 7 PC) sees a checksum error and discards the packet and resends its previous packet. LwIP sends a duplicate Ack then resends (and keeps sending) the offending packet, with the same erroneous checksum. Hence my ping-pong type link gets stuck. I don't modify the packet content after handing it to lwIP and my MAC device driver simply copies the packet from pbuf(s) to a tx buffer verbatim. I depend on lwIP to calculate the Checksum and CRC. I've attached the offending packet in a pcap file. I hand calculated the checksum and the one's compliment sum is 0x hence the one's compliment of that is 0x. Why is lwIP inserting a checksum of 0x? It should have inserted 0x right? Is this a known issue, I didn't see any mention of it in the mail archives. If this is a known issue hopefully someone can point me in the right direction for a fix/workaround so I don't have to debug and/or re-code the checksum code of lwIP!! While I'm awaiting a reply I'll start that process... FYI: I am using lwIP 1.4.1 and have LWIP_CHKSUM_ALGORITHM = 3 in lwipopts.h Thanks for your time Regards Niall. ___ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users
Re: [lwip-users] TCP Checksum = 0xFFFF
Bill Auerbach wrote: From an empirical standpoint, lwIP is used in far too many places for there to be this significant of a bug. I’d look for a compiler bug or some other issue. I seriously doubt it’s a bug in lwIP. Some of my company’s users run our systems 24/7 sending lots of data through lwIP and I’d hear about it really fast if there was this kind of a TCP lockup. I'm flattered by your opinion but I fear this does not prevent lwIP from having bugs :-) In this case, I think I fixed a bug in git master not too long ago (#36153), here is the change, maybe it fixes things for you: @@ -658,6 +662,10 @@ tcp_write(struct tcp_pcb *pcb, const void *arg, u16_t len, u8_t apiflags) last_unsent-len += concat_p-tot_len; #if TCP_CHECKSUM_ON_COPY if (concat_chksummed) { + /*if concat checksumm swapped - swap it back */ + if (concat_chksum_swapped){ + concat_chksum = SWAP_BYTES_IN_WORD(concat_chksum); + } tcp_seg_add_chksum(concat_chksum, concat_chksummed, last_unsent-chksum, last_unsent-chksum_swapped); last_unsent-flags |= TF_SEG_DATA_CHECKSUMMED; Simon___ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users