Hi,

I've been working on solving various problems with dhcpcd 0.7, which we use
with the single floppy disk router freesco. (http://www.freesco.org)

First a little background info that will explain what I'm trying to do.
Freesco is based on a 2.0.38 kernel with a few minor patches, libc5, an old
version of busybox, a few extra commands, various services like bind,
thttpd, telnet daemon, etc, and a lot of scripting :) One of our features
is a dhcp client to support cable and dsl ISP's who use dhcp to assign
addresses, but we've always had a lot of trouble with it.

There dont really seem to be any alternatives for us to use with a 2.0
kernel and libc5, since pump is tied to 2.2/glibc2, dhcpcd 1.3 requires
glibc2, and is too big, isc's dhclient is just too big, so we're stuck with
dhcpcd 0.7, so I decided to try to fix its problems myself.

Although I've had great success so far in fixing long standing issues, two
problems have eluded me. The first one I think I know how to fix, or at
least work around, the second problem has me completely stumped.

The first problem is as follows: (assume eth0 is going to the isp and has
dhcpcd running on it, and eth1 is local, with dhcpd running on it)

Because we also run dhcpd 2.0 as a server on local networks, we must set a
route to 255.255.255.255 for each interface it is serving, or dhcpd cant
work properly, but if we do that, when dhcpcd tries to send an initial
DHCPDISCOVER broadcast, the route just mentioned snatches the broadcasts
away from eth0 and they end up being sent on eth1, despite the send socket
seemingly being bound to eth0. To work around this, we also add a route for
255.255.255.255 to eth0, and this allows both programs to co-exist. For the
moment.

The trouble comes when dhcpcd configures the interface after a lease is
obtained, the route to 255.255.255.255 for eth0 is lost. Because of this,
broadcasts once again end up going to eth1 instead of eth0. When it reaches
the renewal phase, things are ok since renewals are sent as unicasts to the
dhcp server and follow normal routing. However should renewing fail for
some reason, once rebinding is reached things start falling apart, because
the broadcasts are being sent on eth1 again. After the lease time expires
eth0 is taken down and broadcasts are sent in the DISCOVER stage again, but
they're still going out on eth1. Another thing that can cause this is if
the server NAK's the client because it wants to shift it to another address.

Now if we have *no* routes to 255.255.255.255 at all, dhcpcd works just
fine, it seems to be able to send its broadcasts out on the correct
interface without any routes, but dhcpd isnt able to, so once we add a
route for it, we have to do it for both of them, and the dhcpcd is not
maintaining its route when it configures the interface hence the problem.

I can see two solutions to this - one is to somehow bypass normal routing
so we can send our broadcasts out on the desired interface without the
interference of the broadcast address routes, and the second is to recreate
a broadcast route in dhcpcd each time the interface is reconfigured.

As far as the first method goes, I havn't had any luck yet, I'm floundering
around a bit in unfamiliar water, I've tried using setsockopt() on the send
socket and setting the SO_DONTROUTE flag, and I've also tried setting the
MSG_DONTROUTE flag during the sendto() calls, but neither seems to have any
effect on the situation, and for lack of decent documentation on the
various flags, I cant be sure whether they even do what I want.

I'm reasonably sure that the second method of setting a route within dhcpcd
will work, I'm just studying the source for 'route' from net-tools, to see
exactly how routes can be added from C (again, some decent documentation on
the ioctl's would be nice, but I cant seem to find any) but I'm not
completely happy with this method - I don't particularly like the idea of
fiddling with the routing table from within a program, as I can think of
cases where it might interfere with another program, perhaps another dhcp
client running on a different interface.

On the other hand, freesco could be considered a special case, almost
"embedded" environment where taking such a short cut isnt such a big deal,
for the sake of having the smallest possible client which works properly,
since we have control over what else is running on the system.

Now the second problem is a bit more sinister and is the one that has me
stumped.

Basically, during the DHCPDISCOVER phase, before eth0 has an address,
dhcpcd sets the ip address of the interface to 0.0.0.0 to send its
broadcasts, but during that time all other interfaces are effectively dead.
Not a huge problem if the dhcp server responds quickly, but if it or the
connection to it is down for any period of time, it can be a very serious
problem if the system is also (for example) acting as a router between two
local networks, or a dns server, print server, dhcp server, and all the
other things it can do.

I've done a fair bit of testing of this problem, and quite a bit of hunting
through various mailing list archives and other sources of information, and
it seems as if this might be an actual bug in the kernel, where an
interface of 0.0.0.0 is mis-handled as a special case. Very odd things
happen with the arp cache in this situation. As an example, I ran tcpdump
on the router machine montioring eth1, set a machine connected to eth1 to
ping the router continuously, and allowed the dhcp lease to time out on
eth0, and here is the system log from the time in question:

Jul  1 18:55:09 - dhcpcd[704]: Entering REBINDING state 
Jul  1 18:55:21 - dhcpcd[704]: REBINDING: Lease time expired. Fall back to
INIT 
Jul  1 18:55:21 - dhcpcd[704]: Entering INIT state 
Jul  1 18:55:21 - dhcpcd[704]: Entering SELECTING state 
Jul  1 18:55:21 - kernel: ARP: arp called for own IP address

at 18:55:21 dhcpcd resets eth0 and configures it to 0.0.0.0 to begin
sending DHCPDISCOVER broadcasts, notice the strange ARP error message -
this is actually in response to the next ping which arrives from the
pinging machine connected to eth1, after eth0 is reconfigured. Here is a
snippet from tcpdump tracing eth1 at the same time:

18:55:20.740000 0:60:97:cd:d3:fa 0:20:af:f5:bc:c 0800 74: 10.0.0.10 >
10.0.0.1: icmp: echo request
18:55:20.740000 0:20:af:f5:bc:c 0:60:97:cd:d3:fa 0800 74: 10.0.0.1 >
10.0.0.10: icmp: echo reply
18:55:21.010000 0:20:af:f5:bc:c ff:ff:ff:ff:ff:ff 0800 590: 10.0.0.1.68 >
255.255.255.255.67: xid:0x27bfda76 [|bootp]
18:55:21.740000 0:60:97:cd:d3:fa 0:20:af:f5:bc:c 0800 74: 10.0.0.10 >
10.0.0.1: icmp: echo request
18:55:21.740000 0:20:af:f5:bc:c 0:20:af:f5:bc:c 0800 74: 10.0.0.1 >
10.0.0.10: icmp: echo reply

The first two lines are a perfectly normal ping and response. The third
line is the DHCPDISCOVER broadcast which should be going out on eth0
actually being sent on eth1 because of the first problem discussed, but now
notice the 4th line, the ping reply is addressed to the correct ip address,
but the *WRONG* hardware ethernet address ! It is trying to send to the
hardware address of eth1 in the router, eg itself. Because of this, it
doesnt actually get physically sent, and the machine pinging the router
stops receiving replies. For all intents and purposes eth1 is dead as far
as sending goes, as are any additional network interfaces.

I've confirmed that it is the act of setting the ip address of eth0 to
0.0.0.0 which causes this "black hole effect". (for lack of a better
description) The same situation can be simulated by manually configuring
eth0 using ifconfig, and setting a route for it with route.

I've searched the linux-net archives and a few other places and seen two or
three reports some time ago of exactly the same effect, but none of them
had a reply. :(

As far as I can see, there are only two ways of solving this problem - one
is to try to find what actually causes it, (kernel bug ??) but that doesnt
seem hopeful, the other is to avoid it, by somehow sending the DISCOVER
messages without setting the interface to 0.0.0.0.

The second way seems the logical approach, but how can it be done with a
2.0 kernel ? Is it possible to fabricate packets with a forged source
address without too much difficulty ? And how about receiving the reply
from the server ? The ideal situation would be if we could send our
fabricated packets and receive the reply *without* even having the
interface up at all, and not bring it "up" until we have a valid ip address.

Any suggestions on where to look, what to try next, or how to go about
things gratefully received.

Regards,
Simon


-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to [EMAIL PROTECTED]

Reply via email to