Hi, I've been working on solving various problems with dhcpcd 0.7, which we use with the single floppy disk router freesco. (http://www.freesco.org) First a little background info that will explain what I'm trying to do. Freesco is based on a 2.0.38 kernel with a few minor patches, libc5, an old version of busybox, a few extra commands, various services like bind, thttpd, telnet daemon, etc, and a lot of scripting :) One of our features is a dhcp client to support cable and dsl ISP's who use dhcp to assign addresses, but we've always had a lot of trouble with it. There dont really seem to be any alternatives for us to use with a 2.0 kernel and libc5, since pump is tied to 2.2/glibc2, dhcpcd 1.3 requires glibc2, and is too big, isc's dhclient is just too big, so we're stuck with dhcpcd 0.7, so I decided to try to fix its problems myself. Although I've had great success so far in fixing long standing issues, two problems have eluded me. The first one I think I know how to fix, or at least work around, the second problem has me completely stumped. The first problem is as follows: (assume eth0 is going to the isp and has dhcpcd running on it, and eth1 is local, with dhcpd running on it) Because we also run dhcpd 2.0 as a server on local networks, we must set a route to 255.255.255.255 for each interface it is serving, or dhcpd cant work properly, but if we do that, when dhcpcd tries to send an initial DHCPDISCOVER broadcast, the route just mentioned snatches the broadcasts away from eth0 and they end up being sent on eth1, despite the send socket seemingly being bound to eth0. To work around this, we also add a route for 255.255.255.255 to eth0, and this allows both programs to co-exist. For the moment. The trouble comes when dhcpcd configures the interface after a lease is obtained, the route to 255.255.255.255 for eth0 is lost. Because of this, broadcasts once again end up going to eth1 instead of eth0. When it reaches the renewal phase, things are ok since renewals are sent as unicasts to the dhcp server and follow normal routing. However should renewing fail for some reason, once rebinding is reached things start falling apart, because the broadcasts are being sent on eth1 again. After the lease time expires eth0 is taken down and broadcasts are sent in the DISCOVER stage again, but they're still going out on eth1. Another thing that can cause this is if the server NAK's the client because it wants to shift it to another address. Now if we have *no* routes to 255.255.255.255 at all, dhcpcd works just fine, it seems to be able to send its broadcasts out on the correct interface without any routes, but dhcpd isnt able to, so once we add a route for it, we have to do it for both of them, and the dhcpcd is not maintaining its route when it configures the interface hence the problem. I can see two solutions to this - one is to somehow bypass normal routing so we can send our broadcasts out on the desired interface without the interference of the broadcast address routes, and the second is to recreate a broadcast route in dhcpcd each time the interface is reconfigured. As far as the first method goes, I havn't had any luck yet, I'm floundering around a bit in unfamiliar water, I've tried using setsockopt() on the send socket and setting the SO_DONTROUTE flag, and I've also tried setting the MSG_DONTROUTE flag during the sendto() calls, but neither seems to have any effect on the situation, and for lack of decent documentation on the various flags, I cant be sure whether they even do what I want. I'm reasonably sure that the second method of setting a route within dhcpcd will work, I'm just studying the source for 'route' from net-tools, to see exactly how routes can be added from C (again, some decent documentation on the ioctl's would be nice, but I cant seem to find any) but I'm not completely happy with this method - I don't particularly like the idea of fiddling with the routing table from within a program, as I can think of cases where it might interfere with another program, perhaps another dhcp client running on a different interface. On the other hand, freesco could be considered a special case, almost "embedded" environment where taking such a short cut isnt such a big deal, for the sake of having the smallest possible client which works properly, since we have control over what else is running on the system. Now the second problem is a bit more sinister and is the one that has me stumped. Basically, during the DHCPDISCOVER phase, before eth0 has an address, dhcpcd sets the ip address of the interface to 0.0.0.0 to send its broadcasts, but during that time all other interfaces are effectively dead. Not a huge problem if the dhcp server responds quickly, but if it or the connection to it is down for any period of time, it can be a very serious problem if the system is also (for example) acting as a router between two local networks, or a dns server, print server, dhcp server, and all the other things it can do. I've done a fair bit of testing of this problem, and quite a bit of hunting through various mailing list archives and other sources of information, and it seems as if this might be an actual bug in the kernel, where an interface of 0.0.0.0 is mis-handled as a special case. Very odd things happen with the arp cache in this situation. As an example, I ran tcpdump on the router machine montioring eth1, set a machine connected to eth1 to ping the router continuously, and allowed the dhcp lease to time out on eth0, and here is the system log from the time in question: Jul 1 18:55:09 - dhcpcd[704]: Entering REBINDING state Jul 1 18:55:21 - dhcpcd[704]: REBINDING: Lease time expired. Fall back to INIT Jul 1 18:55:21 - dhcpcd[704]: Entering INIT state Jul 1 18:55:21 - dhcpcd[704]: Entering SELECTING state Jul 1 18:55:21 - kernel: ARP: arp called for own IP address at 18:55:21 dhcpcd resets eth0 and configures it to 0.0.0.0 to begin sending DHCPDISCOVER broadcasts, notice the strange ARP error message - this is actually in response to the next ping which arrives from the pinging machine connected to eth1, after eth0 is reconfigured. Here is a snippet from tcpdump tracing eth1 at the same time: 18:55:20.740000 0:60:97:cd:d3:fa 0:20:af:f5:bc:c 0800 74: 10.0.0.10 > 10.0.0.1: icmp: echo request 18:55:20.740000 0:20:af:f5:bc:c 0:60:97:cd:d3:fa 0800 74: 10.0.0.1 > 10.0.0.10: icmp: echo reply 18:55:21.010000 0:20:af:f5:bc:c ff:ff:ff:ff:ff:ff 0800 590: 10.0.0.1.68 > 255.255.255.255.67: xid:0x27bfda76 [|bootp] 18:55:21.740000 0:60:97:cd:d3:fa 0:20:af:f5:bc:c 0800 74: 10.0.0.10 > 10.0.0.1: icmp: echo request 18:55:21.740000 0:20:af:f5:bc:c 0:20:af:f5:bc:c 0800 74: 10.0.0.1 > 10.0.0.10: icmp: echo reply The first two lines are a perfectly normal ping and response. The third line is the DHCPDISCOVER broadcast which should be going out on eth0 actually being sent on eth1 because of the first problem discussed, but now notice the 4th line, the ping reply is addressed to the correct ip address, but the *WRONG* hardware ethernet address ! It is trying to send to the hardware address of eth1 in the router, eg itself. Because of this, it doesnt actually get physically sent, and the machine pinging the router stops receiving replies. For all intents and purposes eth1 is dead as far as sending goes, as are any additional network interfaces. I've confirmed that it is the act of setting the ip address of eth0 to 0.0.0.0 which causes this "black hole effect". (for lack of a better description) The same situation can be simulated by manually configuring eth0 using ifconfig, and setting a route for it with route. I've searched the linux-net archives and a few other places and seen two or three reports some time ago of exactly the same effect, but none of them had a reply. :( As far as I can see, there are only two ways of solving this problem - one is to try to find what actually causes it, (kernel bug ??) but that doesnt seem hopeful, the other is to avoid it, by somehow sending the DISCOVER messages without setting the interface to 0.0.0.0. The second way seems the logical approach, but how can it be done with a 2.0 kernel ? Is it possible to fabricate packets with a forged source address without too much difficulty ? And how about receiving the reply from the server ? The ideal situation would be if we could send our fabricated packets and receive the reply *without* even having the interface up at all, and not bring it "up" until we have a valid ip address. Any suggestions on where to look, what to try next, or how to go about things gratefully received. Regards, Simon - To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to [EMAIL PROTECTED]
