Re: [Linuxptp-users] [External] Re: clockcheck jumps forwards and backwards

Ian Thompson Wed, 05 Apr 2017 07:16:22 -0700

All

Here’s a log with phc2sys output. This board ran for 11 hours without an error 
before this happened …


Apr  5 09:18:21 localhost user.info phc2sys: [38796.187] rms   28 max   65 freq 
  -242 +/-   1 delay  1099 +/-  11
Apr  5 09:20:21 localhost user.info phc2sys: [38916.211] rms   29 max   73 freq 
  -242 +/-   3 delay  1100 +/-  12
Apr  5 09:20:40 localhost user.info ptp4l: [38934.698] rms   29 max   76 freq   
-242 +/-   6 delay  2029 +/-   4
Apr  5 09:20:52 localhost user.err ptp4l: [38946.961] timed out while polling 
for tx timestamp
Apr  5 09:20:52 localhost user.err ptp4l: [38946.961] increasing 
tx_timestamp_timeout may correct this issue, but it is likely caused by a 
driver bug
Apr  5 09:20:52 localhost user.err ptp4l: [38946.962] Failure in 
transport_send(to)()
Apr  5 09:20:52 localhost user.err ptp4l: [38946.962] port 1: send delay 
request failed
Apr  5 09:20:52 localhost user.notice ptp4l: [38946.962] port 1: SLAVE to 
FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] port 
2cee26.fffe.000189-1 changed state
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] reconfiguring after 
port state change
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] selecting eth0 for 
synchronization
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] nothing to synchronize
Apr  5 09:21:08 localhost user.notice ptp4l: [38962.962] port 1: FAULTY to 
LISTENING on INIT_COMPLETE
Apr  5 09:21:08 localhost user.warn ptp4l: [38963.198] clockcheck: clock jumped 
backward or running slower than expected!
Apr  5 09:21:10 localhost user.notice ptp4l: [38964.698] port 1: new foreign 
master 000cec.fffe.0a0f8d-1
Apr  5 09:21:14 localhost user.notice ptp4l: [38968.698] selected best master 
clock 000cec.fffe.0a0f8d
Apr  5 09:21:14 localhost user.notice ptp4l: [38968.698] port 1: LISTENING to 
UNCALIBRATED on RS_SLAVE
Apr  5 09:21:14 localhost user.info phc2sys: [38969.224] port 
2cee26.fffe.000189-1 changed state
Apr  5 09:21:14 localhost user.info phc2sys: [38969.224] reconfiguring after 
port state change
Apr  5 09:21:14 localhost user.info phc2sys: [38969.224] master clock not 
ready, waiting...
Apr  5 09:21:21 localhost user.notice ptp4l: [38975.698] port 1: UNCALIBRATED 
to SLAVE on MASTER_CLOCK_SELECTED
Apr  5 09:21:21 localhost user.info phc2sys: [38976.229] port 
2cee26.fffe.000189-1 changed state
Apr  5 09:21:21 localhost user.info phc2sys: [38976.230] reconfiguring after 
port state change
Apr  5 09:21:21 localhost user.info phc2sys: [38976.230] selecting 
CLOCK_REALTIME for synchronization
Apr  5 09:21:21 localhost user.info phc2sys: [38976.230] selecting eth0 as the 
master clock
Apr  5 09:22:50 localhost user.info phc2sys: [39065.260] rms   34 max   88 freq 
  -241 +/-   5 delay  1100 +/-  10
Apr  5 09:24:50 localhost user.info phc2sys: [39185.288] rms   28 max   67 freq 
  -242 +/-   1 delay  1099 +/-  11
Apr  5 09:25:23 localhost user.info ptp4l: [39217.698] rms 3270369174 max 
37000003713 freq   -238 +/-  95 delay  2030 +/-   8
Apr  5 09:26:50 localhost user.info phc2sys: [39305.330] rms   29 max   70 freq 
  -241 +/-   2 delay  1099 +/-  10
Apr  5 09:28:50 localhost user.info phc2sys: [39425.376] rms   29 max   71 freq 
  -242 +/-   1 delay  1100 +/-  11
Apr  5 09:29:39 localhost user.info ptp4l: [39473.698] rms   29 max   75 freq   
-243 +/-   2 delay  2023 +/-   9

Ian T.



From: David Mirabito [mailto:[email protected]]
Sent: Tuesday, April 04, 2017 6:18 PM
To: Ian Thompson
Cc: [email protected]
Subject: [External] Re: [Linuxptp-users] clockcheck jumps forwards and backwards

Hi,

The device is a:
  00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 
03)

Using
  bash-4.3# ethtool -i ma2
  driver: igb
  version: 5.3.0-k
  firmware-version: 0.0.0
  expansion-rom-version:
  bus-info: 0000:00:14.1
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: no

And:
 # ethtool -T ma2
 Time stamping parameters for ma2:
 Capabilities:
     hardware-transmit     (SOF_TIMESTAMPING_TX_HARDWARE)
     software-transmit     (SOF_TIMESTAMPING_TX_SOFTWARE)
     hardware-receive      (SOF_TIMESTAMPING_RX_HARDWARE)
     software-receive      (SOF_TIMESTAMPING_RX_SOFTWARE)
     software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
     hardware-raw-clock    (SOF_TIMESTAMPING_RAW_HARDWARE)
 PTP Hardware Clock: 1
 Hardware Transmit Timestamp Modes:
     off                   (HWTSTAMP_TX_OFF)
     on                    (HWTSTAMP_TX_ON)
 Hardware Receive Filter Modes:
     none                  (HWTSTAMP_FILTER_NONE)
     all                   (HWTSTAMP_FILTER_ALL)


The config is:

[global]
slaveOnly 1
summary_interval 6
priority1 255

[ma2]

And running as:
   /usr/sbin/ptp4l -f /etc/ptp4l.conf
   /usr/sbin/phc2sys -a -r -u 64 -n 5

We are running version 1.8, downloaded from the sourceforge mirror. It's built 
with openembedede/bitbake and their recipie defines some extra cflags, I can 
look iwhy these were deemed to be necessary or if they could affect anything:
EXTRA_OEMAKE = "'CFLAGS=-D_GNU_SOURCE -DHAVE_CLOCK_ADJTIME -DHAVE_POSIX_SPAWN 
-DHAVE_ONESTEP_SYNC'"

I will look into obtaining more verbose logs.

For what it's worth, this exact same setup works elsewhere it is just this one 
physical setup that exhibits this, although unclear if the cause a physical 
fault or something about the network/master outside.

Additionally, since Ian brought it up
   a) We do sometimes see tx timestamp timeouts too
   b) We also occasionally see UNEXPECTED_SYSWRAP messages from igb
My understanding is that b) is an intel bug (bad per-device assumptions made in 
code regarding default state of PPS IRQ) on this HW and seems to be generally 
treated as benign. I do have a slight suspicion that a and b may be somehow 
related (backing out of the unexpected wrap IRQ 'forgets' to notice the 
available tx timestamp being ready?) but I have some digging to to on that 
front.

I currently expect (although happy to be proven wrong) that both a) and b) are 
unrelated to the clockcheck jumps, since a+b happens readily and doesn't affect 
sync *too* badly, whereas constant clockcheck aborts happens only in one place 
and is apparently disastrous to sync quality.

Cheers, and thanks for your replies,
Dave

On Wed, Apr 5, 2017 at 1:45 AM, Ian Thompson 
<[email protected]<mailto:[email protected]>> wrote:
Possibly following on from David’s post.

We have a system with 18 boards in a rack, each board has a Altera SoC with the 
STM Ethernet MAC connected via gigabit Ethernet to an Arista ptp-aware switch 
and then a Spectracom GrandMaster.
The boards are running Linux kernel 3.15.0.

They lock quickly after boot and can remain locked for several hours but 
usually any one of the boards may do the following …

Apr  4 13:42:04 localhost 
user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=>
 ptp4l: [537.164] rms  123 max  599 freq   +255 +/-  39 delay  7362 +/-  48
Apr  4 13:42:29 localhost user.err ptp4l: [561.387] timed out while polling for 
tx timestamp
Apr  4 13:42:29 localhost user.err ptp4l: [561.387] increasing 
tx_timestamp_timeout may correct this issue, but it is likely caused by a 
driver bug
Apr  4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay request 
failed
Apr  4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to FAULTY 
on FAULT_DETECTED (FT_UNSPECIFIED)
Apr  4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to 
LISTENING on FAULT_CLEARED
Apr  4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock jumped 
backward or running slower than expected!
Apr  4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign 
master 000cec.fffe.0a085d-1
Apr  4 13:42:47 localhost user.notice ptp4l: [579.414] selected best master 
clock 000cec.fffe.0a085d
Apr  4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING to 
UNCALIBRATED on RS_SLAVE
Apr  4 13:42:54 localhost user.notice ptp4l: [587.164] port 1: UNCALIBRATED to 
SLAVE on MASTER_CLOCK_SELECTED
Apr  4 13:46:46 localhost 
user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=>
 ptp4l: [818.414] rms 2312500092 max 37000001557 freq   +246 +/- 250 delay  
7358 +/-  46
Apr  4 13:51:02 localhost 
user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=>
 ptp4l: [1074.413] rms  116 max  681 freq   +256 +/-  48 delay  7373 +/-  88

Does this imply that one lost delay request can do this, or is there a retry 
mechanism?
Notice that the system recovers but we can’t afford the large timing glitch 
that gets introduced.
We have a lot of traffic leaving the boards but only PTP traffic coming in. As 
we increase the off board transfer rates the problem seems to occur more often.

Thanks for any help,
Ian T.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Linuxptp-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linuxptp-users

Re: [Linuxptp-users] [External] Re: clockcheck jumps forwards and backwards

Reply via email to