Hi

I have an issue on my servers related to both ucarp and the e1000
drivers, thus the crossposting. :-) 

I think that during system boot the e1000 driver (e1000e too) reports to
the OS that the link is up some seconds before it really is.

Server & module info:

Red Hat Enterprise Linux ES release 4 (Nahant)

filename:       /lib/modules/2.6.9-5.ELsmp/kernel/drivers/net/e1000/e1000.ko
parm:           copybreak:Maximum size of packet that is copied to a new
buffer on receive
author:         Intel Corporation, <linux.n...@intel.com>
description:    Intel(R) PRO/1000 Network Driver
license:        GPL
version:        7.5.5-NAPI

ucarp 1.2

Networking is configured on rc2.d/S10network and ucarp on S98ucarp.

This is what happens: after a reboot of the master server, configured
with preemption so that it would be master again after getting back
online, the virtual IP was unresponsive. We did some tcpdumps and found
out that the gratuitous-arp that ucarp sends when going to master state
wasn't reaching the router, so in the router's arp table the virtual IP
still pointed to the secondary server's MAC address.

On syslog on the primary server we have:

Mar 17 13:45:48 server1 network: Bringing up interface eth2:  succeeded 
Mar 17 13:45:54 server1 ucarp[2489]: [INFO] Local advertised ethernet
address is [00:15:17:58:19:08]
Mar 17 13:45:54 server1 ucarp[2489]: [WARNING] Spawning
[/opt/VIP/servicioVIP_add.sh eth2]
Mar 17 13:45:54 server1 ucarp[2489]: [WARNING] Switching to state:
MASTER
Mar 17 13:46:12 server1 kernel: e1000: eth2: e1000_probe: Intel(R)
PRO/1000 Network Connection
Mar 17 13:46:13 server1 kernel: e1000: eth2: e1000_watchdog_task: 10/100
speed: disabling TSO
Mar 17 13:46:13 server1 kernel: e1000: eth2: e1000_watchdog_task: NIC
Link is Up 100 Mbps Half Duplex, Flow Control: None

So it seems that, while the network is configured before ucarp is
launched (S10 vs S98), the cards (or the driver?) don't have link until
after some 25 seconds after running the network startup script. So when
ucarp runs, the network isn't still really working. ucarp sends the
gratuitous-arp but it gets lost. After some seconds the link gets up and
the heartbeats reach the secondary server, which goes into backup state
and releases the VIP. But, as the router hasn't received the
gratuitous-arp, in its table the VIP still belongs to the secondary
server. All traffic to the VIP gets routed to the secondary server,
which drops it as it doesn't recognize the VIP any more. This last point
was verified with dumps on both the router and the secondary server and
taking a look at the arp table on the router.

There are two things that make me think the driver has to do with this
issue:

- I've talked with the people in charge of all  the networking systems
and there have been no flapping on the port the server is plugged to. In
other words, according to the switch (Cisco Catalyst 4510), that link
has never gone down.

- I've inserted both a mii-tool and a ethtool on the ucarp startup
script, just before launching ucarp. According to both of them the link
is UP at that moment. But according to the messages by e1000_watchdong
on syslog, the link goes UP a couple of seconds after that!!! And in any
case the first packets sent by ucarp never leave the server.

Besides, after all this testing I've tried upgrading the driver to the
latest e1000e-0.5.11.2. Same problem, same log traces (bring up
interface succeeded -> ucarp runs -> link UP), same behavior when
studying the traffic with dumps.

On a side note: the VIP works with ucarp 1.5. The first gratuitous-arp
still gets lost, but it sends an additional one when the link gets up
and it receives the heartbeats from the other server, "fixing" the
router's arp table at that moment.

So, is this a know issue with the e1000/e1000e drivers? Anybody else has
experienced a similar situation? Just after a reboot, apparently having
the network up but losing traffic for some seconds? Why do mii-tool and
ethtool report that the link is UP, but it appears as going UP on syslog
a couple of seconds after that? Is there any other way to check the link
status?

Thanks in advance.

Regards

-- 
   Vicente Aguilar <bise...@bisente.com> | http://www.bisente.com
------------------------------------------------------------------------------
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel

Reply via email to