Hi! 

On Wed, 28 Jan 2009, Mike Frysinger wrote:
> > On the wire between the client and the firewall, this happens:
> >
> > a packet 1 is sent
> > b packet 2 is sent
> > c answer 1 is received
> > d answer 2 is received
> >
> > Sometimes d doesn't happen because b is lost in the firewall
> > along the way (where the race condition happens).
> 
> does this affect actual userspace behavior ?  in other words,
> does this lead to lost lookups and errors from the resolver ?

The most visible effect (and the way we found out about it first)
is a 5s hang on ssh connects. Thing is: how long that timeout is
is program dependant (whatever they use in select()). A recvfrom() 
simply hangs. I wrote a simple C program to do what glibc does
(simplified for brevity):

sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
connect(sockfd, tgt->ai_addr, tgt->ai_addrlen);
sendto(sockfd, payload1, sizeof(payload1), 0, tgt->ai_addr, tgt->ai_addrlen); 
sendto(sockfd, payload2, sizeof(payload2), 0, tgt->ai_addr, tgt->ai_addrlen); 
recvfrom(sockfd, buf, sizeof(buf), 0, &addr, &fromlen);
recvfrom(sockfd, buf, sizeof(buf), 0, &addr, &fromlen);

payload1 and 2 are an A and a AAAA request for the same name,
respectively. That second recvfrom() hangs indefinitely in the
error case. Here's the full program for those interested:

http://eric.schwarzvogel.de/~klausman/dnstest2.c.txt

It'd be easy to put in a call to select and make the program
timeout as glibc does instead of simply hanging. Note that for an
actual test in your environment, you'll probably have to change
the payloads and line 44.

Here's the tcpdump of the error case:
09:42:53.614905 IP 192.168.0.2.39355 > 192.168.22.9.53: 64583+[|domain]
09:42:53.614920 IP 192.168.0.2.39355 > 192.168.22.9.53: 61812+[|domain]
09:42:53.615623 IP 192.168.22.9.53 > 192.168.0.2.39355: 64583[|domain]

Or, if you prefer tshark:

0.000000 192.168.0.2 -> 192.168.22.9  DNS Standard query A eric.schwarzvogel.de
0.000015 192.168.0.2 -> 192.168.22.9  DNS Standard query AAAA 
eric.schwarzvogel.de
0.000667  192.168.22.9 -> 192.168.0.2 DNS Standard query response A 194.97.4.250

As you can see, timing on the two queries is very close. glibc
usually is in the 20-50 microsecond range on this machine, my
little program can get as low as 5 microseconds. "Correct" timing
of course depends on a myriad of variables including load on the
packetfilter, kernel version there etc etc.

A "quickfix" would indeed be using two different ports for the
queries - but the bug in Netfilter would still be there.

Regards,
Tobias


Reply via email to