> -----Original Message-----
> From: Jord Pool [mailto:jord.p...@outlook.com]
> Sent: Tuesday, August 07, 2018 2:21 AM
> To: Richard Cochran <richardcoch...@gmail.com>
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; Cliff Spradlin
> <csprad...@waymo.com>; Chris Caudle <ch...@chriscaudle.org>; Cliff Spradlin 
> via
> Linuxptp-users <linuxptp-users@lists.sourceforge.net>
> Subject: Re: PXE Boot PTP Issues
> 
> Hi Richard,
> 
> It is not per se PXE, but network load in general. When PXE booting other 
> servers,
> the PXE boot server which runs as a PTP slave will have a high load of network
> traffic going out to the servers that are about to boot through PXE.
> 
> This high network load causes the PTP slave instance to return the message
> telling to increase the tx_timestamp_timeout value or it being a driver bug.
> 
> To be sure it has nothing to do with PXE in specific, when copying an .iso 
> file of
> ~5GB over the Ethernet connections with the maximum gigabit speeds of +-
> 120MB/s, the PTP slave instance will stop and returns the same
> tx_timestamp_timeout message. This indicates clearly that high network load
> causes PTP to stop working, at least with the e1000e driver.
> 
> The weird part at least is that PTP does not recover itself anymore after 
> being set
> on hold for a minute when the tx_timestamp_timeout message appears. This
> completely vanishes the point of synchronising time that when network load
> increases the synchronisation process stops and only drifts further away 
> instead
> of re-synchronising.
> 
> The driver is the e1000e version 3.2.6, which is the default of Fedora 22. I 
> have
> also tried versions 3.4.0.2 adn 3.4.1.1 of the e1000e driver but they also 
> don’t
> seem to work.
> 
> Jord
> 
 

Oh! hmmmmmm. This sounds suspiciously familiar..... What version of the kernel 
is your fedora running? I think I recall a fix upstream that might be 
related... and it's quite possible the team that owns the sourceforge driver 
never released the fix into that driver...

The fix wasn't released until 4.13, it's commit 5012863b7347 ("e1000e: fix race 
condition around skb_tstamp_tx()", 2017-06-06)

I don't know for sure if this fix would resolve your issue or not, but it seems 
related. The way timestamps were handled, there was a race such that we would 
ignore some timestamp requests from the application.

What's the exact behavior you see after the first timeout? Do you keep seeing 
more timeouts? I'm curious what other behavior you see.. You might also check 
the ethtool stats to see if any of the timestamp statistics are incrementing, 
as this might help indicate the problem.

It's possible there is still some race condition in that driver that is causing 
failures to cascade.

Thanks,
Jake
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Linuxptp-users mailing list
Linuxptp-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-users

Reply via email to