Re: [lwip-users] TCP Checksum = 0xFFFF

2014-05-14 Thread Niall Donovan
Hi Guys,
  Thanks for the replies.

@Bill - I agree, it's unlikely that someone else would not have found such
a bug given how widely used lwIP is. My first assumption is always that
I've made an error somewhere:-) but there is no harm in asking the question
while I search for the answer. I should have mentioned that I
have LWIP_CHECKSUM_ON_COPY = 1 and CHECKSUM_CHECK_TCP = 1. So my code calls
the #if TCP_CHECKSUM_ON_COPY  code hence  I don't
call inet_chksum_pseudo(), see below for more.

@Simon - I'll apply the patch and re-test, but I can see from a debug run
that that bit of code is not being executed in my implementation.

If you saw my follow up email, you will notice that I identified the code
that is causing my problem. It is caused by the line of code at line 1146
in tcp_out.c (Note I have LWIP_CHECKSUM_ON_COPY = 1)
acc += (u16_t)~(seg-chksum);

acc is a one's compliment checksum obtained from a call to
inet_chksum_pseudo_partial()
and seg-chksum is a checksum of the payload.
What is happening is that occasionally during operation acc is resulting in
a value of M and seg-chksum has, by coincidence, a value of M. Then M +
(~M) always gives 0x.

Why hasn't it been seen by others before?
As I'm sure you are aware (I have just been reading up on it!) some
checksum checkers might accept 0x as a valid checksum depending on how
they validate the checksum (recalculate and compare to inserted checksum OR
calculate with checksum value and check results is = 0). On windows 7 in my
application it seems it re-calculates and compares the checksum and expects
0x (wireshark does too!). This combination of lwip options and checksum
validation method might explain why others may not have seen this error
before now?

Mathematically speaking using ones compliment maths, ~(sum(a+b+c+d)) is not
the same as [(~sum(a+b)) + (~sum(c+d))] for the special corner case where
sum(a+b) = ~sum(c+d). In this special case the answer will be 0x
instead of 0x. Which is what is happening in my case!

example (using 4 bit numbers for simplicity):
let a = 1, b = 2, c = 4, d = 8.
checksum = ~sum(a+b+c+d) = ~(0xF) = 0x0
sum(a+b) = 3
sum(b+c) = 0xC
Calculated by code = [(~sum(a+b)) + (~sum(c+d))] = [~(3) + ~(0xC)] = [0xC +
3] = 0xF
QED!?

I'm more convinced that this is a coding issue in lwIP that doesn't handle
this special corner case, but am happy to be proved wrong!

Regards,
 Niall.


On 14 May 2014 06:25, Simon Goldschmidt goldsi...@gmx.de wrote:

 Bill Auerbach wrote:

 From an empirical standpoint, lwIP is used in far too many places for
 there to be this significant of a bug.  I’d look for a compiler bug or some
 other issue.  I seriously doubt it’s a bug in lwIP.  Some of my company’s
 users run our systems 24/7 sending lots of data through lwIP and I’d hear
 about it really fast if there was this kind of a TCP lockup.


 I'm flattered by your opinion but I fear this does not prevent lwIP from
 having bugs :-)

 In this case, I think I fixed a bug in git master not too long ago
 (#36153), here is the change, maybe it fixes things for you:

 @@ -658,6 +662,10 @@ tcp_write(struct tcp_pcb *pcb, const void *arg, u16_t
 len, u8_t apiflags)
 last_unsent-len += concat_p-tot_len;
 #if TCP_CHECKSUM_ON_COPY
 if (concat_chksummed) {
 + /*if concat checksumm swapped - swap it back */
 + if (concat_chksum_swapped){
 + concat_chksum = SWAP_BYTES_IN_WORD(concat_chksum);
 + }
 tcp_seg_add_chksum(concat_chksum, concat_chksummed, last_unsent-chksum,
 last_unsent-chksum_swapped);
 last_unsent-flags |= TF_SEG_DATA_CHECKSUMMED;


 Simon

 ___
 lwip-users mailing list
 lwip-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/lwip-users

___
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users

Re: [lwip-users] TCP Checksum = 0xFFFF

2014-05-14 Thread Sergio R. Caprile
Quick check: try changing the checksum algorithm with LWIP_CHKSUM_ALGORITHM
This might point to a possible bug in the algo

-- 



___
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users


Re: [lwip-users] TCP Checksum = 0xFFFF

2014-05-14 Thread Valery Ushakov
Niall Donovan nfdono...@gmail.com wrote:

 If you saw my follow up email, you will notice that I identified the code
 that is causing my problem. It is caused by the line of code at line 1146
 in tcp_out.c (Note I have LWIP_CHECKSUM_ON_COPY = 1)
 acc += (u16_t)~(seg-chksum);
 
 acc is a one's compliment checksum obtained from a call to
 inet_chksum_pseudo_partial()
 and seg-chksum is a checksum of the payload.
 What is happening is that occasionally during operation acc is resulting in
 a value of M and seg-chksum has, by coincidence, a value of M. Then M +
 (~M) always gives 0x.
 
[...] 
 Mathematically speaking using ones compliment maths, ~(sum(a+b+c+d)) is not
 the same as [(~sum(a+b)) + (~sum(c+d))] for the special corner case where
 sum(a+b) = ~sum(c+d). In this special case the answer will be 0x
 instead of 0x. Which is what is happening in my case!

Your analysis is correct.


As for the likelihood of stumbling into just the right numbers I
recently had an interesting problem.  I have a diskless machine that
has been running off NFS.  For ages.  After NFS server upgrade the
diskless machine would reliably wedge quite early in the boot process.

Eventually I tracked it down to a hardware checksum bug in the
ethernet of the diskless machine.  In UDP the checksum value 0 means
no checksum and if the datagram data actually has checksum 0, it's
replaced with 0x - which gives the same result if verification is
done properly (computing the sum with checksum filed included).
Apparently hardware checksum used recomute and compare method instead,
so it flagged such valid UDP datagrams as having a bad checksum.

The hardware bug has always been there and I've never seen it.  Then
when NFS server change changed the numerology of the NFS handles just
right, the bug was triggered reliably by some particular NFS response
datagram.

So don't underestiamte luck as a factor in system stability :)

-uwe


___
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users


Re: [lwip-users] TCP Checksum = 0xFFFF

2014-05-13 Thread Niall Donovan
To follow up on this issue. The offending piece of code that is generating
the 0x as the TCP checksum is this (lines 1137-1147 in tcp_out.c):

/* rebuild TCP header checksum (TCP header changes for
retransmissions!) */
acc = inet_chksum_pseudo_partial(seg-p, (pcb-local_ip),
 (pcb-remote_ip),
 IP_PROTO_TCP, seg-p-tot_len, TCPH_HDRLEN(seg-tcphdr) * 4);
/* add payload checksum */
if (seg-chksum_swapped) {
  seg-chksum = SWAP_BYTES_IN_WORD(seg-chksum);
  seg-chksum_swapped = 0;
}
acc += (u16_t)~(seg-chksum);
seg-tcphdr-chksum = FOLD_U32T(acc);

If acc happens to have a value equal to seg-checksum, for example
acc=0x8E93  seg-checksum=0x8E93, (which can happen given the right set of
values) one gets  acc += ~acc, which always results in 0x. The
FOLD_U32T has nothing to do and the 0x is set as the
seg-tcphdr-chksum. Which is wrong? Am in missing something?

Anyone able to enlighten me as why this isn't a coding error?

Shouldn't there be something to ensure 0x is converted to 0x?!

Thanks.
Niall.


On 13 May 2014 13:17, Niall Donovan nfdono...@gmail.com wrote:

 Hi All,
I'd appreciate some help on my problem. I occasionally have seen my TCP
 socket connection hang and when I captured the fault on Wireshark I could
 see, on the packet causing the hang, that the calculated TCP Checksum value
 was 0x, which Wireshark indicated was incorrect. Wireshark says it
 should be 0x. It also helpfully pointed to RFC1624 for further
 information.

 The socket hangs because the recipient of the packet (Win 7 PC) sees a
 checksum error and discards the packet and resends its previous packet.
 LwIP sends a duplicate Ack then resends (and keeps sending) the offending
 packet, with the same erroneous checksum. Hence my ping-pong type link gets
 stuck.

 I don't modify the packet content after handing it to lwIP and my MAC
 device driver simply copies the packet from pbuf(s) to a tx buffer
 verbatim. I depend on lwIP to calculate the Checksum and CRC.

 I've attached the offending packet in a pcap file. I hand calculated the
 checksum and the one's compliment sum is 0x hence the one's compliment
 of that is 0x.

 Why is lwIP inserting a checksum of 0x? It should have inserted 0x
 right? Is this a known issue, I didn't see any mention of it in the mail
 archives. If this is a known issue hopefully someone can point me in the
 right direction for a fix/workaround so I don't have to debug and/or
 re-code the checksum code of lwIP!! While I'm awaiting a reply I'll start
 that process...

 FYI:
 I am using lwIP 1.4.1 and have LWIP_CHKSUM_ALGORITHM = 3 in lwipopts.h

 Thanks for your time
 Regards
 Niall.



___
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users

Re: [lwip-users] TCP Checksum = 0xFFFF

2014-05-13 Thread Bill Auerbach
Niall,

 

This final line in tcp_out sets the check sum and includes the ~ in the call.

 

  seg-tcphdr-chksum = inet_chksum_pseudo(seg-p, (pcb-local_ip),

 (pcb-remote_ip),

 IP_PROTO_TCP, seg-p-tot_len);

 

From an empirical standpoint, lwIP is used in far too many places for there to 
be this significant of a bug.  I’d look for a compiler bug or some other 
issue.  I seriously doubt it’s a bug in lwIP.  Some of my company’s users run 
our systems 24/7 sending lots of data through lwIP and I’d hear about it 
really fast if there was this kind of a TCP lockup.

 

Regards,

Bill

 

From: lwip-users-bounces+bauerbach=arrayonline@nongnu.org 
[mailto:lwip-users-bounces+bauerbach=arrayonline@nongnu.org] On Behalf Of 
Niall Donovan
Sent: Tuesday, May 13, 2014 11:09 AM
To: Mailing list for lwIP users
Subject: Re: [lwip-users] TCP Checksum = 0x

 

To follow up on this issue. The offending piece of code that is generating the 
0x as the TCP checksum is this (lines 1137-1147 in tcp_out.c):

 

/* rebuild TCP header checksum (TCP header changes for retransmissions!) */

acc = inet_chksum_pseudo_partial(seg-p, (pcb-local_ip),

 (pcb-remote_ip),

 IP_PROTO_TCP, seg-p-tot_len, TCPH_HDRLEN(seg-tcphdr) * 4);

/* add payload checksum */

if (seg-chksum_swapped) {

  seg-chksum = SWAP_BYTES_IN_WORD(seg-chksum);

  seg-chksum_swapped = 0;

}

acc += (u16_t)~(seg-chksum);

seg-tcphdr-chksum = FOLD_U32T(acc);

 

If acc happens to have a value equal to seg-checksum, for example acc=0x8E93  
seg-checksum=0x8E93, (which can happen given the right set of values) one gets 
 acc += ~acc, which always results in 0x. The FOLD_U32T has nothing to do 
and the 0x is set as the seg-tcphdr-chksum. Which is wrong? Am in missing 
something?

 

Anyone able to enlighten me as why this isn't a coding error?

 

Shouldn't there be something to ensure 0x is converted to 0x?!

 

Thanks.

Niall.

 

On 13 May 2014 13:17, Niall Donovan nfdono...@gmail.com wrote:

Hi All,

   I'd appreciate some help on my problem. I occasionally have seen my TCP 
socket connection hang and when I captured the fault on Wireshark I could see, 
on the packet causing the hang, that the calculated TCP Checksum value was 
0x, which Wireshark indicated was incorrect. Wireshark says it should be 
0x. It also helpfully pointed to RFC1624 for further information. 

 

The socket hangs because the recipient of the packet (Win 7 PC) sees a checksum 
error and discards the packet and resends its previous packet. LwIP sends a 
duplicate Ack then resends (and keeps sending) the offending packet, with the 
same erroneous checksum. Hence my ping-pong type link gets stuck.

 

I don't modify the packet content after handing it to lwIP and my MAC device 
driver simply copies the packet from pbuf(s) to a tx buffer verbatim. I depend 
on lwIP to calculate the Checksum and CRC.

 

I've attached the offending packet in a pcap file. I hand calculated the 
checksum and the one's compliment sum is 0x hence the one's compliment of 
that is 0x. 

 

Why is lwIP inserting a checksum of 0x? It should have inserted 0x 
right? Is this a known issue, I didn't see any mention of it in the mail 
archives. If this is a known issue hopefully someone can point me in the right 
direction for a fix/workaround so I don't have to debug and/or re-code the 
checksum code of lwIP!! While I'm awaiting a reply I'll start that process...

 

FYI:

I am using lwIP 1.4.1 and have LWIP_CHKSUM_ALGORITHM = 3 in lwipopts.h

 

Thanks for your time

Regards

Niall.

 

 

 

___
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users

Re: [lwip-users] TCP Checksum = 0xFFFF

2014-05-13 Thread Simon Goldschmidt
Bill Auerbach wrote:

 From an empirical standpoint, lwIP is used in far too many places for there 
 to be this significant of a bug.  I’d look for a compiler bug or some other 
 issue.  I seriously doubt it’s a bug in lwIP.  Some of my company’s users run 
 our systems 24/7 sending lots of data through lwIP and I’d hear about it 
 really fast if there was this kind of a TCP lockup.

I'm flattered by your opinion but I fear this does not prevent lwIP from having 
bugs :-)

In this case, I think I fixed a bug in git master not too long ago (#36153), 
here is the change, maybe it fixes things for you:

@@ -658,6 +662,10 @@ tcp_write(struct tcp_pcb *pcb, const void *arg, u16_t len, 
u8_t apiflags)
last_unsent-len += concat_p-tot_len;
#if TCP_CHECKSUM_ON_COPY
if (concat_chksummed) {
+ /*if concat checksumm swapped - swap it back */
+ if (concat_chksum_swapped){
+ concat_chksum = SWAP_BYTES_IN_WORD(concat_chksum);
+ }
tcp_seg_add_chksum(concat_chksum, concat_chksummed, last_unsent-chksum,
last_unsent-chksum_swapped);
last_unsent-flags |= TF_SEG_DATA_CHECKSUMMED;


Simon___
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users