Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)
Le lundi 23 juillet 2018 à 12:43 +0200, Oliver Freyermuth a écrit : > There ARE chassis/BMC/IPMI level events, one of which is "CPU > > CATERR > > Fault", with a timestamp matching the timestamps below, and no more > > information. > > If this kind of failure (or a less severe one) also happens at > runtime, mcelog should catch it. I'll install mcelog ASAP, even though it probably wouldn't have added much in that case. > For CATERR errors, we also found that sometimes the web interface of > the BMC shows more information for the event log entry > than querying the event log via ipmitool - you may want to check > this. I got that from the web interface. ipmitool does not give more information anyway (lots of "missing" and "unknown", and not description...): ipmitool> sel get 118 SEL Record ID : 0076 Record Type : 02 Timestamp : 07/21/2018 01:58:48 Generator ID : 0020 EvM Revision : 04 Sensor Type : Unknown Sensor Number : 76 Event Type: Sensor-specific Discrete Event Direction : Assertion Event Event Data (RAW) : 00 Event Interpretation : Missing Description : Sensor ID : CPU CATERR (0x76) Entity ID : 26.1 Sensor Type (Discrete): Unknown -- Nicolas Huillard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)
Am 23.07.2018 um 11:39 schrieb Nicolas Huillard: > Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit : >> Do you have any hardware watchdog running in the system? A watchdog >> could >> trigger a powerdown if it meets some value. Any event logs from the >> chassis >> itself? > > Nice suggestions ;-) > > I see some [watchdog/N] and one [watchdogd] kernel threads, along with > a "kernel: [0.116002] NMI watchdog: enabled on all CPUs, > permanently consumes one hw-PMU counter." line in the kernel log, but > no user-land watchdog daemon: I'm not sure if the watchdog is actually > active. > > There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR > Fault", with a timestamp matching the timestamps below, and no more > information. If this kind of failure (or a less severe one) also happens at runtime, mcelog should catch it. For CATERR errors, we also found that sometimes the web interface of the BMC shows more information for the event log entry than querying the event log via ipmitool - you may want to check this. > If I understand correctly, this is a signal emitted by the CPU, to the > BMC, upon "catastrophic error" (more than "fatal"), which the BMC must > respond to the way it wants, Intel suggestions including resetting the > chassis. > > https://www.intel.in/content/dam/www/public/us/en/documents/white-paper > s/platform-level-error-strategies-paper.pdf > > Does that mean that the hardware is failing, or a neutrino just crossed > some CPU register? > CPU is a Xeon D-1521 with ECC memory. > >> Kind regards, > > Many thanks! > >> >> Caspar >> >> 2018-07-21 10:31 GMT+02:00 Nicolas Huillard : >> >>> Hi all, >>> >>> One of my server silently shutdown last night, with no explanation >>> whatsoever in any logs. According to the existing logs, the >>> shutdown >>> (without reboot) happened between 03:58:20.061452 (last timestamp >>> from >>> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON >>> election called, for which oxygene didn't answer). >>> >>> Is there any way in which Ceph could silently shutdown a server? >>> Can SMART self-test influence scrubbing or compaction? >>> >>> The only thing I have is that smartd stated a long self-test on >>> both >>> OSD spinning drives on that host: >>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], >>> starting >>> scheduled Long Self-Test. >>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], >>> starting >>> scheduled Long Self-Test. >>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], >>> starting >>> scheduled Long Self-Test. >>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self- >>> test in >>> progress, 90% remaining >>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self- >>> test in >>> progress, 90% remaining >>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], >>> previous >>> self-test completed without error >>> >>> ...and smartctl now says that the self-tests didn't finish (on both >>> drives) : >>> # 1 Extended offlineInterrupted (host >>> reset) 00% 10636 >>> - >>> >>> MON logs on oxygene talks about rockdb compaction a few minutes >>> before >>> the shutdown, and a deep-scrub finished earlier: >>> /var/log/ceph/ceph-osd.6.log >>> 2018-07-21 03:32:54.086021 7fd15d82c700 0 log_channel(cluster) log >>> [DBG] >>> : 6.1d deep-scrub starts >>> 2018-07-21 03:34:31.185549 7fd15d82c700 0 log_channel(cluster) log >>> [DBG] >>> : 6.1d deep-scrub ok >>> 2018-07-21 03:43:36.720707 7fd178082700 0 -- >>> 172.22.0.16:6801/478362 >> >>> 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801 >>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>> l=1).handle_connect_msg: challenging authorizer >>> >>> /var/log/ceph/ceph-mgr.oxygene.log >>> 2018-07-21 03:58:16.060137 7fbcd300 1 mgr send_beacon standby >>> 2018-07-21 03:58:18.060733 7fbcd300 1 mgr send_beacon standby >>> 2018-07-21 03:58:20.061452 7fbcd300 1 mgr send_beacon standby >>> >>> /var/log/ceph/ceph-mon.oxygene.log >>> 2018-07-21 03:52:27.702314 7f25b5406700 4 rocksdb: (Original Log >>> Time >>> 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/ >>> rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual >>> compaction >>> from level-0 to level-1 from 'mgrstat .. ' >>> 2018-07-21 03:52:27.702321 7f25b5406700 4 rocksdb: >>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403] >>> [default] [JOB >>> 1746] Compacting 1@0 + 1@1 files to L1, score -1.00 >>> 2018-07-21 03:52:27.702329 7f25b5406700 4 rocksdb: >>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407] >>> [default] >>> Compaction start summary: Base version 1745 Base level 0, inputs: >>> [149507(602KB)], [149505(13MB)] >>> 2018-07-21 03:52:27.702348 7f25b5406700 4 rocksdb: EVENT_LOG_v1 >>> {"time_micros": 1532137947702334, "job": 1746, "event": >>> "compaction_started", "files_L0": [149507], "files_L1": [149505], >>> "score": >>> -1, "input_data_size": 14916379}
Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)
Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit : > Do you have any hardware watchdog running in the system? A watchdog > could > trigger a powerdown if it meets some value. Any event logs from the > chassis > itself? Nice suggestions ;-) I see some [watchdog/N] and one [watchdogd] kernel threads, along with a "kernel: [0.116002] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter." line in the kernel log, but no user-land watchdog daemon: I'm not sure if the watchdog is actually active. There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR Fault", with a timestamp matching the timestamps below, and no more information. If I understand correctly, this is a signal emitted by the CPU, to the BMC, upon "catastrophic error" (more than "fatal"), which the BMC must respond to the way it wants, Intel suggestions including resetting the chassis. https://www.intel.in/content/dam/www/public/us/en/documents/white-paper s/platform-level-error-strategies-paper.pdf Does that mean that the hardware is failing, or a neutrino just crossed some CPU register? CPU is a Xeon D-1521 with ECC memory. > Kind regards, Many thanks! > > Caspar > > 2018-07-21 10:31 GMT+02:00 Nicolas Huillard : > > > Hi all, > > > > One of my server silently shutdown last night, with no explanation > > whatsoever in any logs. According to the existing logs, the > > shutdown > > (without reboot) happened between 03:58:20.061452 (last timestamp > > from > > /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON > > election called, for which oxygene didn't answer). > > > > Is there any way in which Ceph could silently shutdown a server? > > Can SMART self-test influence scrubbing or compaction? > > > > The only thing I have is that smartd stated a long self-test on > > both > > OSD spinning drives on that host: > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], > > starting > > scheduled Long Self-Test. > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], > > starting > > scheduled Long Self-Test. > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], > > starting > > scheduled Long Self-Test. > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self- > > test in > > progress, 90% remaining > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self- > > test in > > progress, 90% remaining > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], > > previous > > self-test completed without error > > > > ...and smartctl now says that the self-tests didn't finish (on both > > drives) : > > # 1 Extended offlineInterrupted (host > > reset) 00% 10636 > > - > > > > MON logs on oxygene talks about rockdb compaction a few minutes > > before > > the shutdown, and a deep-scrub finished earlier: > > /var/log/ceph/ceph-osd.6.log > > 2018-07-21 03:32:54.086021 7fd15d82c700 0 log_channel(cluster) log > > [DBG] > > : 6.1d deep-scrub starts > > 2018-07-21 03:34:31.185549 7fd15d82c700 0 log_channel(cluster) log > > [DBG] > > : 6.1d deep-scrub ok > > 2018-07-21 03:43:36.720707 7fd178082700 0 -- > > 172.22.0.16:6801/478362 >> > > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801 > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > > l=1).handle_connect_msg: challenging authorizer > > > > /var/log/ceph/ceph-mgr.oxygene.log > > 2018-07-21 03:58:16.060137 7fbcd300 1 mgr send_beacon standby > > 2018-07-21 03:58:18.060733 7fbcd300 1 mgr send_beacon standby > > 2018-07-21 03:58:20.061452 7fbcd300 1 mgr send_beacon standby > > > > /var/log/ceph/ceph-mon.oxygene.log > > 2018-07-21 03:52:27.702314 7f25b5406700 4 rocksdb: (Original Log > > Time > > 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/ > > rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual > > compaction > > from level-0 to level-1 from 'mgrstat .. ' > > 2018-07-21 03:52:27.702321 7f25b5406700 4 rocksdb: > > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403] > > [default] [JOB > > 1746] Compacting 1@0 + 1@1 files to L1, score -1.00 > > 2018-07-21 03:52:27.702329 7f25b5406700 4 rocksdb: > > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407] > > [default] > > Compaction start summary: Base version 1745 Base level 0, inputs: > > [149507(602KB)], [149505(13MB)] > > 2018-07-21 03:52:27.702348 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532137947702334, "job": 1746, "event": > > "compaction_started", "files_L0": [149507], "files_L1": [149505], > > "score": > > -1, "input_data_size": 14916379} > > 2018-07-21 03:52:27.785532 7f25b5406700 4 rocksdb: > > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1116] > > [default] [JOB > > 1746] Generated table #149508: 4904 keys, 14808953 bytes > > 2018-07-21 03:52:27.785587 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532137947785565, "cf_name": "default", "job": > > 1746, > > "event": "table_file_creation", "file_number": 149508, "