All
Here’s a log with phc2sys output. This board ran for 11 hours without an error
before this happened …
Apr 5 09:18:21 localhost user.info phc2sys: [38796.187] rms 28 max 65 freq
-242 +/- 1 delay 1099 +/- 11
Apr 5 09:20:21 localhost user.info phc2sys: [38916.211] rms 29 max 73 freq
-242 +/- 3 delay 1100 +/- 12
Apr 5 09:20:40 localhost user.info ptp4l: [38934.698] rms 29 max 76 freq
-242 +/- 6 delay 2029 +/- 4
Apr 5 09:20:52 localhost user.err ptp4l: [38946.961] timed out while polling
for tx timestamp
Apr 5 09:20:52 localhost user.err ptp4l: [38946.961] increasing
tx_timestamp_timeout may correct this issue, but it is likely caused by a
driver bug
Apr 5 09:20:52 localhost user.err ptp4l: [38946.962] Failure in
transport_send(to)()
Apr 5 09:20:52 localhost user.err ptp4l: [38946.962] port 1: send delay
request failed
Apr 5 09:20:52 localhost user.notice ptp4l: [38946.962] port 1: SLAVE to
FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] port
2cee26.fffe.000189-1 changed state
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] reconfiguring after
port state change
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] selecting eth0 for
synchronization
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] nothing to synchronize
Apr 5 09:21:08 localhost user.notice ptp4l: [38962.962] port 1: FAULTY to
LISTENING on INIT_COMPLETE
Apr 5 09:21:08 localhost user.warn ptp4l: [38963.198] clockcheck: clock jumped
backward or running slower than expected!
Apr 5 09:21:10 localhost user.notice ptp4l: [38964.698] port 1: new foreign
master 000cec.fffe.0a0f8d-1
Apr 5 09:21:14 localhost user.notice ptp4l: [38968.698] selected best master
clock 000cec.fffe.0a0f8d
Apr 5 09:21:14 localhost user.notice ptp4l: [38968.698] port 1: LISTENING to
UNCALIBRATED on RS_SLAVE
Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] port
2cee26.fffe.000189-1 changed state
Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] reconfiguring after
port state change
Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] master clock not
ready, waiting...
Apr 5 09:21:21 localhost user.notice ptp4l: [38975.698] port 1: UNCALIBRATED
to SLAVE on MASTER_CLOCK_SELECTED
Apr 5 09:21:21 localhost user.info phc2sys: [38976.229] port
2cee26.fffe.000189-1 changed state
Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] reconfiguring after
port state change
Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] selecting
CLOCK_REALTIME for synchronization
Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] selecting eth0 as the
master clock
Apr 5 09:22:50 localhost user.info phc2sys: [39065.260] rms 34 max 88 freq
-241 +/- 5 delay 1100 +/- 10
Apr 5 09:24:50 localhost user.info phc2sys: [39185.288] rms 28 max 67 freq
-242 +/- 1 delay 1099 +/- 11
Apr 5 09:25:23 localhost user.info ptp4l: [39217.698] rms 3270369174 max
37000003713 freq -238 +/- 95 delay 2030 +/- 8
Apr 5 09:26:50 localhost user.info phc2sys: [39305.330] rms 29 max 70 freq
-241 +/- 2 delay 1099 +/- 10
Apr 5 09:28:50 localhost user.info phc2sys: [39425.376] rms 29 max 71 freq
-242 +/- 1 delay 1100 +/- 11
Apr 5 09:29:39 localhost user.info ptp4l: [39473.698] rms 29 max 75 freq
-243 +/- 2 delay 2023 +/- 9
Ian T.
From: David Mirabito [mailto:david.mirab...@gmail.com]
Sent: Tuesday, April 04, 2017 6:18 PM
To: Ian Thompson
Cc: linuxptp-users@lists.sourceforge.net
Subject: [External] Re: [Linuxptp-users] clockcheck jumps forwards and backwards
Hi,
The device is a:
00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev
03)
Using
bash-4.3# ethtool -i ma2
driver: igb
version: 5.3.0-k
firmware-version: 0.0.0
expansion-rom-version:
bus-info: 0000:00:14.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
And:
# ethtool -T ma2
Time stamping parameters for ma2:
Capabilities:
hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
hardware-receive (SOF_TIMESTAMPING_RX_HARDWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
hardware-raw-clock (SOF_TIMESTAMPING_RAW_HARDWARE)
PTP Hardware Clock: 1
Hardware Transmit Timestamp Modes:
off (HWTSTAMP_TX_OFF)
on (HWTSTAMP_TX_ON)
Hardware Receive Filter Modes:
none (HWTSTAMP_FILTER_NONE)
all (HWTSTAMP_FILTER_ALL)
The config is:
[global]
slaveOnly 1
summary_interval 6
priority1 255
[ma2]
And running as:
/usr/sbin/ptp4l -f /etc/ptp4l.conf
/usr/sbin/phc2sys -a -r -u 64 -n 5
We are running version 1.8, downloaded from the sourceforge mirror. It's built
with openembedede/bitbake and their recipie defines some extra cflags, I can
look iwhy these were deemed to be necessary or if they could affect anything:
EXTRA_OEMAKE = "'CFLAGS=-D_GNU_SOURCE -DHAVE_CLOCK_ADJTIME -DHAVE_POSIX_SPAWN
-DHAVE_ONESTEP_SYNC'"
I will look into obtaining more verbose logs.
For what it's worth, this exact same setup works elsewhere it is just this one
physical setup that exhibits this, although unclear if the cause a physical
fault or something about the network/master outside.
Additionally, since Ian brought it up
a) We do sometimes see tx timestamp timeouts too
b) We also occasionally see UNEXPECTED_SYSWRAP messages from igb
My understanding is that b) is an intel bug (bad per-device assumptions made in
code regarding default state of PPS IRQ) on this HW and seems to be generally
treated as benign. I do have a slight suspicion that a and b may be somehow
related (backing out of the unexpected wrap IRQ 'forgets' to notice the
available tx timestamp being ready?) but I have some digging to to on that
front.
I currently expect (although happy to be proven wrong) that both a) and b) are
unrelated to the clockcheck jumps, since a+b happens readily and doesn't affect
sync *too* badly, whereas constant clockcheck aborts happens only in one place
and is apparently disastrous to sync quality.
Cheers, and thanks for your replies,
Dave
On Wed, Apr 5, 2017 at 1:45 AM, Ian Thompson
<ian.thomp...@pgs.com<mailto:ian.thomp...@pgs.com>> wrote:
Possibly following on from David’s post.
We have a system with 18 boards in a rack, each board has a Altera SoC with the
STM Ethernet MAC connected via gigabit Ethernet to an Arista ptp-aware switch
and then a Spectracom GrandMaster.
The boards are running Linux kernel 3.15.0.
They lock quickly after boot and can remain locked for several hours but
usually any one of the boards may do the following …
Apr 4 13:42:04 localhost
user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=>
ptp4l: [537.164] rms 123 max 599 freq +255 +/- 39 delay 7362 +/- 48
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] timed out while polling for
tx timestamp
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] increasing
tx_timestamp_timeout may correct this issue, but it is likely caused by a
driver bug
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay request
failed
Apr 4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to FAULTY
on FAULT_DETECTED (FT_UNSPECIFIED)
Apr 4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to
LISTENING on FAULT_CLEARED
Apr 4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock jumped
backward or running slower than expected!
Apr 4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign
master 000cec.fffe.0a085d-1
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] selected best master
clock 000cec.fffe.0a085d
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING to
UNCALIBRATED on RS_SLAVE
Apr 4 13:42:54 localhost user.notice ptp4l: [587.164] port 1: UNCALIBRATED to
SLAVE on MASTER_CLOCK_SELECTED
Apr 4 13:46:46 localhost
user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=>
ptp4l: [818.414] rms 2312500092 max 37000001557 freq +246 +/- 250 delay
7358 +/- 46
Apr 4 13:51:02 localhost
user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=>
ptp4l: [1074.413] rms 116 max 681 freq +256 +/- 48 delay 7373 +/- 88
Does this imply that one lost delay request can do this, or is there a retry
mechanism?
Notice that the system recovers but we can’t afford the large timing glitch
that gets introduced.
We have a lot of traffic leaving the boards but only PTP traffic coming in. As
we increase the off board transfer rates the problem seems to occur more often.
Thanks for any help,
Ian T.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Linuxptp-users mailing list
Linuxptp-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-users