Hi, John,

Thanks for your quick reply.

1.
Concerning the first kind of race (e1000_open() VS e1000_vlan_filter_on_off). 

After KernelStrider spotted it, I reproduced it directly by introducing delays 
at the end of register_netdev(), etc. The race on RCTL did happen but I saw 
nothing unusual in the behaviour of the driver after that, no visible failures 
yet.

I currently don't know, however, what RCTL register of the device is for, so 
the consequences of the race are still unclear to me. An opinion of an expert 
would be nice here.

I have rebuilt the driver with the call to e1000_vlan_filter_on_off() moved 
before register_netdev() in e1000_probe() as I suggested. Will work with the 
driver and test it more to see if the fix introduces problems.

2.
Concerning the races on adapter->stats.*, etc. (ethtool VS watchdog). Haven't 
reproduced it myself directly (i.e. with delays and breakpoints) yet, will 
probably try to.

Ethtool operations execute under rtnl_lock, but neither e1000_watchdog nor the 
functions called from it take that mutex. On the other hand, they take 
adapter->stats_lock when reading and updating the stats but ethtool operations 
do not. I suppose, locking as it is now, does not prevent these functions from 
executing concurrently.

As for the consequences, they are probably not critical in this case. If I 
understand the code correctly, there are only 32-bit or 64-bit values 
(adapter->stats.*) read and written there. IIRC, reading and writing of 32-bit 
data is atomic on 32-bit x86 systems at least. So the race is probably benign 
on such systems, not sure about others.

In case the race triggers on such systems, the user that executes ethtool will 
see stale data as a result, but they always may see stale data, so it is not a 
problem.

If there were longer pieces of data (or unaligned data) read and written, 
corrupt data could be output due to that race. Well, need to look at the code 
in more detail to see if there are such cases there.

Anyway, it is perhaps quite easy to eliminate this race altogether: 
spin_lock_irqsave(&adapter->stats_lock, flags) at the beginning of 
e1000_get_ethtool_stats(), spin_unlock_irqrestore(...) at the end. This is what 
some other functions in e1000 already do.

3.
Concerning possible races between NAPI poll and .ndo_start_xmit. 

AFAIK, NAPI poll callback executes in softirq context. .ndo_start_xmit executes 
with BH disabled at least, so they cannot race with each other on the same CPU. 
But in my experiments, they were executing on different CPUs sometimes, so the 
question if the race is possible is still open.

.ndo_start_xmit executes with Tx spinlock for the corresponding Tx queue taken 
but NAPI code seems not to take this lock. Same for NAPI-related locks. So, the 
currently used locks also probably do not prevent the race here.  

Is there anything else that prevents such callbacks from running concurrently 
on different CPUs? Are, for example, Tx queues stopped somehow when NAPI 
callback is executed? Probably not, but may be I am missing something? 

Again, help from an expert is highly appreciated here. 

If these callbacks can run concurrently, I will try to reproduce at least the 
race between dql_queued() and dql_avail() on the accesses to dql->num_queued 
(netdev_tx_sent_queue VS netdev_completed_queue) as it seems to be most likely 
to make a mess with Tx queues when it triggers.

Regards,

Eugene

--
Eugene Shatokhin, ROSA Laboratory.
www.rosalab.com

----- Original Message -----
From: John Ronciak <john.ronc...@intel.com>
To: Eugene Shatokhin <eugene.shatok...@rosalab.ru>, e1000-devel 
<e1000-devel@lists.sourceforge.net>
Sent: Fri, 02 Aug 2013 21:54:41 +0400 (MSD)
Subject: RE: [E1000-devel] Possible races in e1000

Have you seen an actual problem where this has really happened?  It's unclear 
to me that this can happen with how the locks are implemented.

Cheers,
John

------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to