On 13 March 2015 16:12, quoth I: > gavinl-1011-e1000e_watchdog: > This resolves an issue with the e1000e driver that I mentioned earlier -- > when wired for cable redundancy, the second port didn't establish link until > the network broke, causing an unacceptable delay in failover to the > redundant connection. Turns out the problem was that the port watchdog has > the job of detecting link up/down and the watchdog was not run if the port > was receiving packets, even if it didn't think it had a link. (With > redundant wiring, it would transmit on the main link and receive back on the > backup link, resetting the backup link's watchdog each time so that it never > ran.) This patch removes the reset of watchdog on receive, so that the > watchdog runs every 2 seconds regardless. > I haven't checked the other network drivers to see if they're similarly > afflicted.
After a bit more testing, I need to revise this patch. It causes ~450us of extra delay inside ecrt_master_receive whenever the 2 second timer hits, which I think we can all agree is a bad thing. On looking closer at the older kernel versions, I noticed that from 2.6.35 and earlier the watchdog task was being scheduled to a kernel worker thread, while from 2.6.37 and later it was changed to perform this directly on the master application thread. Does anyone recall what the reason for this change was, or whether it was accidental? It seems to have happened in commit c350fc89afd7ac6bb64b706bbc333df5e53e3d2f. (Note that prior to this patch on all versions it would simply never execute the watchdog task as long as it was receiving packets, meaning that the stats calculations and other housekeeping tasks that seem to be part of this don't get performed; I'm not familiar enough with the driver/hardware internals to know whether this is a good thing or not. Given the cyclic nature of EtherCAT, there is rarely a time that ports stop receiving packets.) In the revised patch (attached), I've chosen to continue running the watchdog every 2 seconds even if RX happens (which fixes redundancy) but I've moved the watchdog work back to the worker thread (on 2.6.37+) to avoid holding up ecrt_master_receive. There is a slight race with the timer reset as a result (it doesn't take the time required to run the watchdog task into account) but as this is 2 seconds vs. ~500us that seems reasonably safe -- and it's what happened in the older kernel versions as well. I did consider an alternate patch which still avoids calling the watchdog if the port is receiving data, but I'm not convinced there's value in avoiding the "link_up" work in the watchdog task, especially when it's being done on a worker thread. Perhaps someone more familiar with this could enlighten me?
gavinl-1011-e1000e_watchdog.patch
Description: Binary data
_______________________________________________ etherlab-dev mailing list etherlab-dev@etherlab.org http://lists.etherlab.org/mailman/listinfo/etherlab-dev