Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard
Le lundi 23 juillet 2018 à 12:43 +0200, Oliver Freyermuth a écrit :
> There ARE chassis/BMC/IPMI level events, one of which is "CPU
> > CATERR
> > Fault", with a timestamp matching the timestamps below, and no more
> > information.
> 
> If this kind of failure (or a less severe one) also happens at
> runtime, mcelog should catch it. 

I'll install mcelog ASAP, even though it probably wouldn't have added
much in that case.

> For CATERR errors, we also found that sometimes the web interface of
> the BMC shows more information for the event log entry 
> than querying the event log via ipmitool - you may want to check
> this. 

I got that from the web interface. ipmitool does not give more
information anyway (lots of "missing" and "unknown", and not
description...):
ipmitool> sel get 118
SEL Record ID  : 0076
 Record Type   : 02
 Timestamp : 07/21/2018 01:58:48
 Generator ID  : 0020
 EvM Revision  : 04
 Sensor Type   : Unknown
 Sensor Number : 76
 Event Type: Sensor-specific Discrete
 Event Direction   : Assertion Event
 Event Data (RAW)  : 00
 Event Interpretation  : Missing
 Description   : 

Sensor ID  : CPU CATERR (0x76)
 Entity ID : 26.1
 Sensor Type (Discrete): Unknown

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Oliver Freyermuth
Am 23.07.2018 um 11:39 schrieb Nicolas Huillard:
> Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit :
>> Do you have any hardware watchdog running in the system? A watchdog
>> could
>> trigger a powerdown if it meets some value. Any event logs from the
>> chassis
>> itself?
> 
> Nice suggestions ;-)
> 
> I see some [watchdog/N] and one [watchdogd] kernel threads, along with
> a "kernel: [0.116002] NMI watchdog: enabled on all CPUs,
> permanently consumes one hw-PMU counter." line in the kernel log, but
> no user-land watchdog daemon: I'm not sure if the watchdog is actually
> active.
> 
> There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR
> Fault", with a timestamp matching the timestamps below, and no more
> information.

If this kind of failure (or a less severe one) also happens at runtime, mcelog 
should catch it. 
For CATERR errors, we also found that sometimes the web interface of the BMC 
shows more information for the event log entry 
than querying the event log via ipmitool - you may want to check this. 


> If I understand correctly, this is a signal emitted by the CPU, to the
> BMC, upon "catastrophic error" (more than "fatal"), which the BMC must
> respond to the way it wants, Intel suggestions including resetting the
> chassis.
> 
> https://www.intel.in/content/dam/www/public/us/en/documents/white-paper
> s/platform-level-error-strategies-paper.pdf
> 
> Does that mean that the hardware is failing, or a neutrino just crossed
> some CPU register?
> CPU is a Xeon D-1521 with ECC memory.
> 
>> Kind regards,
> 
> Many thanks!
> 
>>
>> Caspar
>>
>> 2018-07-21 10:31 GMT+02:00 Nicolas Huillard :
>>
>>> Hi all,
>>>
>>> One of my server silently shutdown last night, with no explanation
>>> whatsoever in any logs. According to the existing logs, the
>>> shutdown
>>> (without reboot) happened between 03:58:20.061452 (last timestamp
>>> from
>>> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
>>> election called, for which oxygene didn't answer).
>>>
>>> Is there any way in which Ceph could silently shutdown a server?
>>> Can SMART self-test influence scrubbing or compaction?
>>>
>>> The only thing I have is that smartd stated a long self-test on
>>> both
>>> OSD spinning drives on that host:
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
>>> test in
>>> progress, 90% remaining
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
>>> test in
>>> progress, 90% remaining
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
>>> previous
>>> self-test completed without error
>>>
>>> ...and smartctl now says that the self-tests didn't finish (on both
>>> drives) :
>>> # 1  Extended offlineInterrupted (host
>>> reset)  00% 10636
>>> -
>>>
>>> MON logs on oxygene talks about rockdb compaction a few minutes
>>> before
>>> the shutdown, and a deep-scrub finished earlier:
>>> /var/log/ceph/ceph-osd.6.log
>>> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
>>> [DBG]
>>> : 6.1d deep-scrub starts
>>> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
>>> [DBG]
>>> : 6.1d deep-scrub ok
>>> 2018-07-21 03:43:36.720707 7fd178082700  0 --
>>> 172.22.0.16:6801/478362 >>
>>> 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>> l=1).handle_connect_msg: challenging authorizer
>>>
>>> /var/log/ceph/ceph-mgr.oxygene.log
>>> 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
>>> 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
>>> 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
>>>
>>> /var/log/ceph/ceph-mon.oxygene.log
>>> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
>>> Time
>>> 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/
>>> rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual
>>> compaction
>>> from level-0 to level-1 from 'mgrstat .. '
>>> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403]
>>> [default] [JOB
>>> 1746] Compacting 1@0 + 1@1 files to L1, score -1.00
>>> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407]
>>> [default]
>>> Compaction start summary: Base version 1745 Base level 0, inputs:
>>> [149507(602KB)], [149505(13MB)]
>>> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532137947702334, "job": 1746, "event":
>>> "compaction_started", "files_L0": [149507], "files_L1": [149505],
>>> "score":
>>> -1, "input_data_size": 14916379}

Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard
Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit :
> Do you have any hardware watchdog running in the system? A watchdog
> could
> trigger a powerdown if it meets some value. Any event logs from the
> chassis
> itself?

Nice suggestions ;-)

I see some [watchdog/N] and one [watchdogd] kernel threads, along with
a "kernel: [0.116002] NMI watchdog: enabled on all CPUs,
permanently consumes one hw-PMU counter." line in the kernel log, but
no user-land watchdog daemon: I'm not sure if the watchdog is actually
active.

There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR
Fault", with a timestamp matching the timestamps below, and no more
information.
If I understand correctly, this is a signal emitted by the CPU, to the
BMC, upon "catastrophic error" (more than "fatal"), which the BMC must
respond to the way it wants, Intel suggestions including resetting the
chassis.

https://www.intel.in/content/dam/www/public/us/en/documents/white-paper
s/platform-level-error-strategies-paper.pdf

Does that mean that the hardware is failing, or a neutrino just crossed
some CPU register?
CPU is a Xeon D-1521 with ECC memory.

> Kind regards,

Many thanks!

> 
> Caspar
> 
> 2018-07-21 10:31 GMT+02:00 Nicolas Huillard :
> 
> > Hi all,
> > 
> > One of my server silently shutdown last night, with no explanation
> > whatsoever in any logs. According to the existing logs, the
> > shutdown
> > (without reboot) happened between 03:58:20.061452 (last timestamp
> > from
> > /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> > election called, for which oxygene didn't answer).
> > 
> > Is there any way in which Ceph could silently shutdown a server?
> > Can SMART self-test influence scrubbing or compaction?
> > 
> > The only thing I have is that smartd stated a long self-test on
> > both
> > OSD spinning drives on that host:
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
> > test in
> > progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
> > test in
> > progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > previous
> > self-test completed without error
> > 
> > ...and smartctl now says that the self-tests didn't finish (on both
> > drives) :
> > # 1  Extended offlineInterrupted (host
> > reset)  00% 10636
> > -
> > 
> > MON logs on oxygene talks about rockdb compaction a few minutes
> > before
> > the shutdown, and a deep-scrub finished earlier:
> > /var/log/ceph/ceph-osd.6.log
> > 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
> > [DBG]
> > : 6.1d deep-scrub starts
> > 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
> > [DBG]
> > : 6.1d deep-scrub ok
> > 2018-07-21 03:43:36.720707 7fd178082700  0 --
> > 172.22.0.16:6801/478362 >>
> > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=1).handle_connect_msg: challenging authorizer
> > 
> > /var/log/ceph/ceph-mgr.oxygene.log
> > 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
> > 
> > /var/log/ceph/ceph-mon.oxygene.log
> > 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
> > Time
> > 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/
> > rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual
> > compaction
> > from level-0 to level-1 from 'mgrstat .. '
> > 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb:
> > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403]
> > [default] [JOB
> > 1746] Compacting 1@0 + 1@1 files to L1, score -1.00
> > 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb:
> > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407]
> > [default]
> > Compaction start summary: Base version 1745 Base level 0, inputs:
> > [149507(602KB)], [149505(13MB)]
> > 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> > {"time_micros": 1532137947702334, "job": 1746, "event":
> > "compaction_started", "files_L0": [149507], "files_L1": [149505],
> > "score":
> > -1, "input_data_size": 14916379}
> > 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb:
> > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1116]
> > [default] [JOB
> > 1746] Generated table #149508: 4904 keys, 14808953 bytes
> > 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> > {"time_micros": 1532137947785565, "cf_name": "default", "job":
> > 1746,
> > "event": "table_file_creation", "file_number": 149508, "