BTW under 12.0-STABLE (built this afternoon after the advisories came out, with the patches) it's MUCH worse. I get the same device resets BUT it's followed by an immediate panic which I cannot dump as it generates a page-fault (supervisor read data, page not present) in the mps *driver* at mpssas_send_abort+0x21.
This precludes a dump of course since attempting to do so gives you a double-panic (I was wondering why I didn't get a crash dump!); I'll re-jigger the box to stick a dump device on an internal SATA device so I can successfully get the dump when it happens and see if I can obtain a proper crash dump on this. I think it's fair to assume that 12.0-STABLE should not panic on a disk problem (unless of course the problem is trying to page something back in -- it's not, the drive that aborts and resets is on a data pack doing a scrub) On 2/5/2019 10:26, Karl Denninger wrote: > On 2/5/2019 09:22, Karl Denninger wrote: >> On 2/2/2019 12:02, Karl Denninger wrote: >>> I recently started having some really oddball things happening under >>> stress. This coincided with the machine being updated to 11.2-STABLE >>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1. >>> >>> Specifically, I get "errors" like this: >>> >>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00 >>> length 131072 SMID 269 Aborting command 0xfffffe0001179110 >>> mps0: Sending reset from mpssas_send_abort for target ID 37 >>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00 >>> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state >>> c xfer 0 >>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00 >>> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state >>> c xfer 0 >>> mps0: Unfreezing devq for target ID 37 >>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00 >>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error >>> (da12:mps0:0:37:0): Retrying command >>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00 >>> (da12:mps0:0:37:0): CAM status: Command timeout >>> (da12:mps0:0:37:0): Retrying command >>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00 >>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error >>> (da12:mps0:0:37:0): Retrying command >>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00 >>> (da12:mps0:0:37:0): CAM status: SCSI Status Error >>> (da12:mps0:0:37:0): SCSI status: Check Condition >>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, >>> reset, or bus device reset occurred) >>> (da12:mps0:0:37:0): Retrying command (per sense data) >>> >>> The "Unit Attention" implies the drive reset. It only occurs on certain >>> drives under very heavy load (e.g. a scrub.) I've managed to provoke it >>> on two different brands of disk across multiple firmware and capacities, >>> however, which tends to point away from a drive firmware problem. >>> >>> A look at the pool data shows /no /errors (e.g. no checksum problems, >>> etc) and a look at the disk itself (using smartctl) shows no problems >>> either -- whatever is going on here the adapter is recovering from it >>> without any data corruption or loss registered on *either end*! >>> >>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows: >>> >>> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem >>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3 >>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd >>> mps0: IOCCapabilities: >>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> >> After considerable additional work this looks increasingly like either a >> missed interrupt or a command is getting lost between the host adapter >> and the expander. >> >> I'm going to turn the driver debug level up and see if I can capture >> more information..... whatever is behind this, however, it is >> almost-certainly related to something that changed between 11.1 and >> 11.2, as I never saw these on the 11.1-STABLE build. >> >> -- >> Karl Denninger >> k...@denninger.net <mailto:k...@denninger.net> >> /The Market Ticker/ >> /[S/MIME encrypted email preferred]/ > Pretty decent trace here -- any ideas? > > mps0: timedout cm 0xfffffe00011b5020 allocated tm 0xfffffe00011812a0 > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00 > length 131072 SMID 634 Aborting command 0xfffffe00011b5020 > mps0: Sending reset from mpssas_send_abort for target ID 37 > mps0: queued timedout cm 0xfffffe00011c2760 for processing by tm > 0xfffffe00011812a0 > mps0: queued timedout cm 0xfffffe00011a74f0 for processing by tm > 0xfffffe00011812a0 > mps0: queued timedout cm 0xfffffe00011cfd50 for processing by tm > 0xfffffe00011812a0 > mps0: EventReply : > EventDataLength: 2 > AckRequired: 0 > Event: SasDiscovery (0x16) > EventContext: 0x0 > Flags: 1<InProgress> > ReasonCode: Discovery Started > PhysicalPort: 0 > DiscoveryStatus: 0 > mps0: (0)->(mpssas_fw_work) Working on Event: [16] > mps0: (1)->(mpssas_fw_work) Event Free: [16] > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 > length 131072 SMID 961 completed timedout cm 0xfffffe00011cfd50 ccb > 0xfffff8019458e000 during recovery ioc 804b scsi 0 state c > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length > 131072 SMID 961 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0 > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00 > length 131072 SMID 634 completed timedout cm > 0xfffffe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 > 01 00 00 > 020 ccb 0xfffff80056fb5000 during recovery ioc 8048 scsi 0 state > c(da11:mps0:0:37:0): CAM status: CCB request completed with an error > (da11:mps0:0:37:0): Retrying command > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 > length 131072 SMID 798 completed timedout cm 0xfffffe00011c2760 ccb > 0xfffff80054e86000 during recovery ioc 804b scsi 0 state c > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 length > 131072 SMID 798 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0 > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 > length 131072 SMID 467 completed timedout cm 0xfffffe00011a74f0 ccb > 0xfffff8011efe3000 during recovery ioc 804b scsi 0 state c > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 length > 131072 SMID 467 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0 > (xpt0:mps0:0:37:0): SMID 2 abort TaskMID 634 status 0x0 code 0x0 > count 4 > (xpt0:mps0:0:37:0): SMID 2 finished recovery after aborting > TaskMID 634 > mps0: Unfreezing devq for target ID 37 > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00 > (da11:mps0:0:37:0): CAM status: Command timeout > (da11:mps0:0:37:0): Retrying command > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 > (da11:mps0:0:37:0): CAM status: CCB request completed with an error > (da11:mps0:0:37:0): Retrying command > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 > (da11:mps0:0:37:0): CAM status: CCB request completed with an error > (da11:mps0:0:37:0): Retrying command > mps0: EventReply : > EventDataLength: 4 > AckRequired: 0 > Event: SasTopologyChangeList (0x1c) > EventContext: 0x0 > EnclosureHandle: 0x2 > ExpanderDevHandle: 0x9 > NumPhys: 25 > NumEntries: 1 > StartPhyNum: 12 > ExpStatus: Responding (0x3) > PhysicalPort: 0 > PHY[12].AttachedDevHandle: 0x0013 > PHY[12].LinkRate: 6.0Gbps (0xaa) > PHY[12].PhyStatus: PHYLinkStatusChange > mps0: (1)->(mpssas_fw_work) Working on Event: [1c] > mps0: (2)->(mpssas_fw_work) Event Free: [1c] > mps0: EventReply : > EventDataLength: 2 > AckRequired: 0 > Event: SasDiscovery (0x16) > EventContext: 0x0 > Flags: 0 > ReasonCode: Discovery Complete > PhysicalPort: 0 > DiscoveryStatus: 0 > mps0: (2)->(mpssas_fw_work) Working on Event: [16] > mps0: (3)->(mpssas_fw_work) Event Free: [16] > (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 > (da11:mps0:0:37:0): CAM status: SCSI Status Error > (da11:mps0:0:37:0): SCSI status: Check Condition > (da11:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, > reset, or bus device reset occurred) > (da11:mps0:0:37:0): Retrying command (per sense data) > > This is what's running right now that produced that -- note no errors > reported on the two providers (the degraded state is intentional; the > third unit is off-site in a bank vault): > > [karl@NewFS /var/log]$ zpool status backup > pool: backup > state: DEGRADED > status: One or more devices has been taken offline by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using 'zpool online' or replace the device with > 'zpool replace'. > scan: scrub in progress since Tue Feb 5 06:57:14 2019 > 840G scanned at 200M/s, 517G issued at 123M/s, 3.36T total > 0 repaired, 15.04% done, 0 days 06:44:17 to go > config: > > NAME STATE READ WRITE CKSUM > backup DEGRADED 0 0 0 > mirror-0 DEGRADED 0 0 0 > gpt/backup61.eli ONLINE 0 0 0 > gpt/backup62-1.eli ONLINE 0 0 0 > 13282812295755460479 OFFLINE 0 0 0 was > /dev/gpt/backup62-2.eli > > errors: No known data errors > > But, here's the SMART data on the disk -- it DOES show the reset: > > Before.... > > Device Statistics (GP Log 0x04) > Page Offset Size Value Flags Description > 0x01 ===== = = === == General Statistics (rev 2) == > 0x01 0x008 4 19 --- Lifetime Power-On Resets > 0x01 0x018 6 32002138534 --- Logical Sectors Written > 0x01 0x020 6 155895325 --- Number of Write Commands > 0x01 0x028 6 265457016500 --- Logical Sectors Read > 0x01 0x030 6 1132601490 --- Number of Read Commands > 0x01 0x038 6 61537515300 --- Date and Time TimeStamp > 0x03 ===== = = === == Rotating Media Statistics (rev 1) == > 0x03 0x008 4 6114 --- Spindle Motor Power-on Hours > 0x03 0x010 4 6114 --- Head Flying Hours > 0x03 0x018 4 30602 --- Head Load Events > 0x03 0x020 4 0 --- Number of Reallocated Logical Sectors > 0x03 0x028 4 283735 --- Read Recovery Attempts > 0x03 0x030 4 0 --- Number of Mechanical Start Failures > 0x04 ===== = = === == General Errors Statistics (rev 1) == > 0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors > 0x04 0x010 4 17 --- Resets Between Cmd Acceptance and > Completio > n > 0x05 ===== = = === == Temperature Statistics (rev 1) == > 0x05 0x008 1 38 --- Current Temperature > 0x05 0x010 1 40 N-- Average Short Term Temperature > 0x05 0x018 1 32 N-- Average Long Term Temperature > 0x05 0x020 1 52 --- Highest Temperature > 0x05 0x028 1 23 --- Lowest Temperature > 0x05 0x030 1 48 N-- Highest Average Short Term Temperature > 0x05 0x038 1 25 N-- Lowest Average Short Term Temperature > 0x05 0x040 1 44 N-- Highest Average Long Term Temperature > 0x05 0x048 1 25 N-- Lowest Average Long Term Temperature > 0x05 0x050 4 0 --- Time in Over-Temperature > 0x05 0x058 1 60 --- Specified Maximum Operating Temperature > 0x05 0x060 4 0 --- Time in Under-Temperature > 0x05 0x068 1 0 --- Specified Minimum Operating Temperature > 0x06 ===== = = === == Transport Statistics (rev 1) == > 0x06 0x008 4 6 --- Number of Hardware Resets > 0x06 0x010 4 0 --- Number of ASR Events > 0x06 0x018 4 0 --- Number of Interface CRC Errors > |||_ C monitored condition met > ||__ D supports DSN > |___ N normalized value > > And after.... > > Device Statistics (GP Log 0x04) > Page Offset Size Value Flags Description > 0x01 ===== = = === == General Statistics (rev 2) == > 0x01 0x008 4 19 --- Lifetime Power-On Resets > 0x01 0x018 6 32004195070 --- Logical Sectors Written > 0x01 0x020 6 155946732 --- Number of Write Commands > 0x01 0x028 6 266340052468 --- Logical Sectors Read > 0x01 0x030 6 1136347463 --- Number of Read Commands > 0x01 0x038 6 61540814700 --- Date and Time TimeStamp > 0x03 ===== = = === == Rotating Media Statistics (rev 1) == > 0x03 0x008 4 6114 --- Spindle Motor Power-on Hours > 0x03 0x010 4 6114 --- Head Flying Hours > 0x03 0x018 4 30602 --- Head Load Events > 0x03 0x020 4 0 --- Number of Reallocated Logical Sectors > 0x03 0x028 4 283848 --- Read Recovery Attempts > 0x03 0x030 4 0 --- Number of Mechanical Start Failures > 0x04 ===== = = === == General Errors Statistics (rev 1) == > 0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors > 0x04 0x010 4 17 --- Resets Between Cmd Acceptance and > Completion > 0x05 ===== = = === == Temperature Statistics (rev 1) == > 0x05 0x008 1 41 --- Current Temperature > 0x05 0x010 1 40 N-- Average Short Term Temperature > 0x05 0x018 1 32 N-- Average Long Term Temperature > 0x05 0x020 1 52 --- Highest Temperature > 0x05 0x028 1 23 --- Lowest Temperature > 0x05 0x030 1 48 N-- Highest Average Short Term Temperature > 0x05 0x038 1 25 N-- Lowest Average Short Term Temperature > 0x05 0x040 1 44 N-- Highest Average Long Term Temperature > 0x05 0x048 1 25 N-- Lowest Average Long Term Temperature > 0x05 0x050 4 0 --- Time in Over-Temperature > 0x05 0x058 1 60 --- Specified Maximum Operating Temperature > 0x05 0x060 4 0 --- Time in Under-Temperature > 0x05 0x068 1 0 --- Specified Minimum Operating Temperature > 0x06 ===== = = === == Transport Statistics (rev 1) == > 0x06 0x008 4 7 --- Number of Hardware Resets > 0x06 0x010 4 0 --- Number of ASR Events > 0x06 0x018 4 0 --- Number of Interface CRC Errors > |||_ C monitored condition met > ||__ D supports DSN > |___ N normalized value > > Number of Hardware Resets has incremented. There are no other errors shown: > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate PO-R-- 100 100 016 - 0 > 2 Throughput_Performance P-S--- 137 137 054 - 104 > 3 Spin_Up_Time POS--- 132 132 024 - 497 > (Average 497) > 4 Start_Stop_Count -O--C- 099 099 000 - 7074 > 5 Reallocated_Sector_Ct PO--CK 100 100 005 - 0 > 7 Seek_Error_Rate PO-R-- 100 100 067 - 0 > 8 Seek_Time_Performance P-S--- 128 128 020 - 18 > 9 Power_On_Hours -O--C- 098 098 000 - 17094 > 10 Spin_Retry_Count PO--C- 100 100 060 - 0 > 12 Power_Cycle_Count -O--CK 100 100 000 - 19 > 192 Power-Off_Retract_Count -O--CK 075 075 000 - 30602 > 193 Load_Cycle_Count -O--C- 075 075 000 - 30602 > 194 Temperature_Celsius -O---- 146 146 000 - 41 (Min/Max > 23/52) > 196 Reallocated_Event_Count -O--CK 100 100 000 - 0 > 197 Current_Pending_Sector -O---K 100 100 000 - 0 > 198 Offline_Uncorrectable ---R-- 100 100 000 - 0 > 199 UDMA_CRC_Error_Count -O-R-- 200 200 000 - 0 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > > I'd throw possible shade at the backplane or cable /but I have already > swapped both out for spares without any change in behavior./ > -- Karl Denninger k...@denninger.net <mailto:k...@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/
smime.p7s
Description: S/MIME Cryptographic Signature