Re: RAID1: system stability
On 2015-07-22 07:00, Russell Coker wrote: On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's missing or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. I disagree with that statement. BTRFS should be expected to not do wild and crazy things regardless of what happens with block devices. I would generally agree with this, although we really shouldn't be doing things like trying to handle hardware failures without user intervention. If a block device disappears from under us, we should throw a warning and if it's the last device in the FS, kill anything that is trying to read or write to that FS. At the very least, we should try to avoid hanging or panicking the system if all of the devices in an FS disappear out from under us. A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning any manner of corrupted data and should not lose data or panic the kernel. It's debatable however whether the array should go read-only when degraded. MD/DM RAID (at least, AFAIK) and most hardware RAID controllers I've seen will still accept writes to degraded arrays, although there are arguments for forcing it read-only as well. Personally, I think that should be controlled by a mount option, so the sysadmin can decide, as it really is a policy decision. A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by mounting read-only or failing all operations on the filesystem. It should not affect any other filesystem or have any significant impact on the system unless it's the root filesystem. Or some other critical filesystem (there are still people who put /usr and/or /var on separate filesystems). Ideally, I'd love to see some some kind of warning from the kernel if a filesystem gets mounted that has the metadata/system profile set to raid0 (and possibly have some of the tools spit out such a warning also). smime.p7s Description: S/MIME Cryptographic Signature
Re: RAID1: system stability
Am Mittwoch, 5. August 2015, 13:32:41 schrieb Austin S Hemmelgarn: On 2015-07-22 07:00, Russell Coker wrote: On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's missing or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. I disagree with that statement. BTRFS should be expected to not do wild and crazy things regardless of what happens with block devices. I would generally agree with this, although we really shouldn't be doing things like trying to handle hardware failures without user intervention. If a block device disappears from under us, we should throw a warning and if it's the last device in the FS, kill anything that is trying to read or write to that FS. At the very least, we should try to avoid hanging or panicking the system if all of the devices in an FS disappear out from under us. The best solution I have ever seen for removable media is with AmigaOS. You remove a disk (or nowadays an usb stick) while it is being written to and AmigaDOS/AmigaOS pops up a dialog window saying You MUST insert volume $VOLUMENAME again. And if you did, it just continued writing. I bet this may be difficult for do for Linux for all devices as unwritten changes pile up in memory until dirty limits are reached, unless one says Okay, disk gone, we block all processes writing to it immediately or quite soon, but for removable media I never saw anything with that amount of sanity. There was some GSoC for NetBSD once to implement this, but I don´t know whether its implemented in there now. For AmigaOS and floppy disks with back then filesystem there was just one culprit: If you didn´t insert the disk again, it was often broken beyond repair. For journaling or COW filesystem it would just be like in any other sudden stop to writes. On Linux with eSATA I saw I can also replug the disk if I didn´t yet hit the timeouts in block layer. After that the disk is gone. Ciao, -- Martin signature.asc Description: This is a digitally signed message part.
Re: RAID1: system stability
On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's missing or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. I disagree with that statement. BTRFS should be expected to not do wild and crazy things regardless of what happens with block devices. A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning any manner of corrupted data and should not lose data or panic the kernel. A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by mounting read-only or failing all operations on the filesystem. It should not affect any other filesystem or have any significant impact on the system unless it's the root filesystem. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1: system stability
On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets nefelim...@gmail.com wrote: Okay, logs, i did release disk /dev/sde1 and get: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 So what's up with this? This only happens after you try to (software) remove /dev/sde1? Or is it happening also before that? Because this looks like some kind of hardware problem when the drive is reporting an error for a particular sector on read, as if it's a bad sector. Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Again same sector as before. This is not a Btrfs error message, it's coming from the block layer. Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read I'm not a dev so take it with a grain of salt but because this references a logical block, this is the layer in between Btrfs and the physical device. Btrfs works on logical blocks and those have to be translated to device and physical sector. Maybe what's happening is there's confusion somewhere about this device not actually being unavailable so Btrfs or something else is trying to read this logical block again, which causes a read attempt to happen instead of a flat out this device doesn't exist type of error. So I don't know if this is a problem strictly in Btrfs missing device error handling, or if there's something else that's not really working correctly. You could test by physically removing the device, if you have hot plug support (be certain all the hardware components support it), you can see if you get different results. Or you could try to reproduce the software delete of the device with mdraid or lvm raid with XFS and no Btrfs at all, and see if you get different results. It's known that the btrfs multiple device failure use case is weak right now. Data isn't lost, but the error handling, notification, all that is almost non-existent compared to mdadm. Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy 5,sas_addr 0x5000cca00d0514bd Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: phy-0:0:9: mptsas: ioc0: delete phy 5, phy-obj (0x880449541400) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: port-0:0:6: mptsas: ioc0: delete port 6, sas_addr (0x5000cca00d0514bd) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr 0x5000cca00d0514bd OK it looks like not until here does it actually get deleted (?) and then that
Re: RAID1: system stability
On Mon, Jun 22, 2015 at 10:36 AM, Timofey Titovets nefelim...@gmail.com wrote: 2015-06-22 19:03 GMT+03:00 Chris Murphy li...@colorremedies.com: On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets nefelim...@gmail.com wrote: Okay, logs, i did release disk /dev/sde1 and get: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 So what's up with this? This only happens after you try to (software) remove /dev/sde1? Or is it happening also before that? Because this looks like some kind of hardware problem when the drive is reporting an error for a particular sector on read, as if it's a bad sector. Nope, i've physically remove device and as you see it's produce errors on block layer -.- and this disks have 100% 'health' Because it's hot-plug device, kernel see what device now missing and remove all kernel objects reletad to them. OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's missing or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Again same sector as before. This is not a Btrfs error message, it's coming from the block layer. Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read I'm not a dev so take it with a grain of salt but because this references a logical block, this is the layer in between Btrfs and the physical device. Btrfs works on logical blocks and those have to be translated to device and physical sector. Maybe what's happening is there's confusion somewhere about this device not actually being unavailable so Btrfs or something else is trying to read this logical block again, which causes a read attempt to happen instead of a flat out this device doesn't exist type of error. So I don't know if this is a problem strictly in Btrfs missing device error handling, or if there's something else that's not really working correctly. You could test by physically removing the device, if you have hot plug support (be certain all the hardware components support it), you can see if you get different results. Or you could try to reproduce the software delete of the device with mdraid or lvm raid with XFS and no Btrfs at all, and see if you get different results. It's known that the btrfs multiple device failure use case is weak right now. Data isn't lost, but the error handling, notification, all that is almost non-existent compared to mdadm. So sad -.- i've test this test case with md raid1 and system continue work without problem when i release one of two md device OK well then it's either a Btrfs bug or something it directly depends on that md does not. You right about usb devices, it's not produce oops. May be its because kernel use different modules for SAS/SATA disks and usb sticks. They appear as sd devices on my system, so they're using libata and as such they ultimately still depend on the SCSI block layer. But there may be a very different kind of missing device error handling for USB that somehow makes its way up to libata differently than SAS/SATA hotplug. I'd say the oops is definitely a
Re: RAID1: system stability
2015-06-22 19:03 GMT+03:00 Chris Murphy li...@colorremedies.com: On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets nefelim...@gmail.com wrote: Okay, logs, i did release disk /dev/sde1 and get: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 So what's up with this? This only happens after you try to (software) remove /dev/sde1? Or is it happening also before that? Because this looks like some kind of hardware problem when the drive is reporting an error for a particular sector on read, as if it's a bad sector. Nope, i've physically remove device and as you see it's produce errors on block layer -.- and this disks have 100% 'health' Because it's hot-plug device, kernel see what device now missing and remove all kernel objects reletad to them. Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Again same sector as before. This is not a Btrfs error message, it's coming from the block layer. Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read I'm not a dev so take it with a grain of salt but because this references a logical block, this is the layer in between Btrfs and the physical device. Btrfs works on logical blocks and those have to be translated to device and physical sector. Maybe what's happening is there's confusion somewhere about this device not actually being unavailable so Btrfs or something else is trying to read this logical block again, which causes a read attempt to happen instead of a flat out this device doesn't exist type of error. So I don't know if this is a problem strictly in Btrfs missing device error handling, or if there's something else that's not really working correctly. You could test by physically removing the device, if you have hot plug support (be certain all the hardware components support it), you can see if you get different results. Or you could try to reproduce the software delete of the device with mdraid or lvm raid with XFS and no Btrfs at all, and see if you get different results. It's known that the btrfs multiple device failure use case is weak right now. Data isn't lost, but the error handling, notification, all that is almost non-existent compared to mdadm. So sad -.- i've test this test case with md raid1 and system continue work without problem when i release one of two md device Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy 5,sas_addr 0x5000cca00d0514bd
Re: RAID1: system stability
And again if i've try echo 1 /sys/block/sdf/device/delete Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [ cut here ] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: kernel BUG at /build/buildd/linux-3.19.0/fs/btrfs/extent_io.c:2056! Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: invalid opcode: [#1] SMP Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc ipmi_ssif amdkfd amd_iommu_v2 gpio_ich radeon ttm drm_kms_helper lpc_ich coretemp drm kvm_intel kvm i5000_edac i2c_algo_bit edac_core i5k_amb shpchp ipmi_si serio_raw 8250_fintek ioatdma dca joydev mac_hid ipmi_msghandler bonding autofs4 btrfs ses enclosure raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq hid_generic raid1 e1000e raid0 usbhid mptsas mptscsih multipath psmouse hid mptbase ptp scsi_transport_sas pps_core linear Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CPU: 0 PID: 1150 Comm: kworker/u16:12 Not tainted 3.19.0-21-generic #21-Ubuntu Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: task: 88044c603110 ti: 88044b4b8000 task.ti: 88044b4b8000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP: 0010:[c043fa80] [c043fa80] repair_io_failure+0x1a0/0x220 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RSP: 0018:88044b4bbba8 EFLAGS: 00010202 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RAX: RBX: 1000 RCX: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RDX: RSI: 880449841b08 RDI: 880449841a80 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RBP: 88044b4bbc08 R08: 00109000 R09: 880449841a80 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R10: 9000 R11: 0002 R12: 8803fa878068 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R13: 880448f5d000 R14: 88044cde8d28 R15: 000524f09000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: FS: () GS:88045fc0() knlGS: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CS: 0010 DS: ES: CR0: 8005003b Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CR2: 7fdcef9cafb8 CR3: 01c13000 CR4: 000407f0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Stack: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 880448f5d100 1000 4b4bbbd8 ea000fb66d40 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 7000 880449841a80 88044b4bbc08 880439a44b58 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 1000 880448f5d000 88044cde8d28 88044cde8bf0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Call Trace: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [c043fd7c] clean_io_failure+0x19c/0x1b0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [c04401b0] end_bio_extent_readpage+0x310/0x5e0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [811d5795] ? __slab_free+0xa5/0x320 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [8101e74a] ? native_sched_clock+0x2a/0x90 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [8137f1eb] bio_endio+0x6b/0xa0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [811d5bce] ? kmem_cache_free+0x1be/0x200 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [8137f232] bio_endio_nodec+0x12/0x20 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [c0414f3f] end_workqueue_fn+0x3f/0x50 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [c044f4e2] normal_work_helper+0xc2/0x2b0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [c044f7a2] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [8108fc98] process_one_work+0x158/0x430 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [810907db] worker_thread+0x5b/0x530 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [81090780] ? rescuer_thread+0x3a0/0x3a0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [81095879] kthread+0xc9/0xe0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [810957b0] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [817cae18] ret_from_fork+0x58/0x90 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [810957b0] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Code: f4 fe ff ff 0f 1f 80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 4c 89 e7 e8 e0 e4 f3 c0 41 b9 fb ff ff ff e9 d2 fe ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 4c 89 e7 e8 c0 e4 f3 c0 31 f6 4c 89 ef Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP [c043fa80] repair_io_failure+0x1a0/0x220 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RSP 88044b4bbba8 -- To unsubscribe from this list:
Re: RAID1: system stability
Okay, logs, i did release disk /dev/sde1 and get: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy 5,sas_addr 0x5000cca00d0514bd Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: phy-0:0:9: mptsas: ioc0: delete phy 5, phy-obj (0x880449541400) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: port-0:0:6: mptsas: ioc0: delete port 6, sas_addr (0x5000cca00d0514bd) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr 0x5000cca00d0514bd Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 13, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted 343582415 mirror 0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle kernel paging request at
Re: RAID1: system stability
Upd: i've try do removing disk by 'right' way: # echo 1 /sys/block/sdf/device/delete All okay and system don't crush immediately on 'sync' call and can work some time without problem, but after some call, which i can repeat by: # apt-get update testing system get kernel crush (on which i delete one of raid1 btrfs device), i've get following dmesg: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc gpio_ich coretemp kvm_intel lpc_ich ipmi_ssif kvm amdkfd amd_iommu_v2 serio_raw radeon ttm i5000_edac drm_kms_helper drm edac_core i2c_algo_bit i5k_amb ioatdma dca shpchp 8250_fintek joydev mac_hid ipmi_si ipmi_msghandler bonding autofs4 btrfs xor raid6_pq ses enclosure hid_generic psmouse usbhid hid mptsas mptscsih e1000e mptbase scsi_transport_sas ptp pps_core Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CPU: 3 PID: 99 Comm: kworker/u16:4 Not tainted 4.0.4-040004-generic #201505171336 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: task: 88009ab31400 ti: 88009ab4 task.ti: 88009ab4 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP: 0010:[c0477d50] [c0477d50] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP: 0018:88009ab43bb8 EFLAGS: 00010206 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RAX: RBX: 88009b1d3f30 RCX: 88009b53f9c0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RDX: 88044902f400 RSI: RDI: 88009b53f9c0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RBP: 88009ab43c18 R08: R09: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R10: 880448c1b090 R11: R12: 3907 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R13: 880439599e68 R14: 1000 R15: 88009a86 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: FS: () GS:88045fcc() knlGS: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CS: 0010 DS: ES: CR0: 8005003b Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CR2: 7f640a27e675 CR3: 98b4b000 CR4: 000407e0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Stack: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 9a860de0 ea0002644380 0003d2ee8000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 8000 88009b53f9c0 88009ab43c18 88009b1d3f30 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 88044c44a3c0 88009b0c1190 88009a86 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Call Trace: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0477f30] clean_io_failure+0x1a0/0x1b0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0478218] end_bio_extent_readpage+0x2d8/0x3d0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [8137b2c3] bio_endio+0x53/0xa0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [8137b322] bio_endio_nodec+0x12/0x20 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c044efb8] end_workqueue_fn+0x48/0x60 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0488b2e] normal_work_helper+0x7e/0x1b0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0488d32] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81092204] process_one_work+0x144/0x490 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81092c6e] worker_thread+0x11e/0x450 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81092b50] ? create_worker+0x1f0/0x1f0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81098999] kthread+0xc9/0xe0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [810988d0] ? flush_kthread_worker+0x90/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [817f08d8] ret_from_fork+0x58/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [810988d0] ? flush_kthread_worker+0x90/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Code: 44 00 00 4c 89 ef e8 b0 34 f0 c0 31 f6 4c 89 e7 e8 06 05 01 00 ba fb ff ff ff e9 c7 fe ff ff ba fb ff ff ff e9 bd fe ff ff 0f 0b 0f 0b 49 8b 4c 24 30 48 8b b3 58 fe ff ff 48 83 c1 10 48 85 f6 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP [c0477d50] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP 88009ab43bb8 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: ---[ end trace 0361c6fdca5f7ee2 ]--- --- Another test case: i've delete device: echo 1 /sys/block/sdf/device/delete after i reinsert this device (remove and insert again in server) Server found sdg device, all that okay but kernel crush with following stuck trace: --- Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: kernel BUG at
Re: RAID1: system stability
Oh, i missing, i've test it on 3.19+ kernels I can get trace from screen if it interesting for developers. 2015-05-26 14:23 GMT+03:00 Timofey Titovets nefelim...@gmail.com: Hi list, I'm regular on this list and I very like btrfs, I want use it on production server, and I want replace hw raid on it. Test case: server with N scsi discs 2 SAS disks used for raid 1 root fs If I just remove one disk physically, all okay, kernel show me write errors and system continue work some time. But after first sync call, example # sync # dd if=/Dev/zero of=/zero Kernel will crush and system freeze. Yes, after reboot, I can mount with degraded and recovery options, and I can add failed disk again, and btrfs will rebuild array. But kernel crush and reboot expected in this case, or I can skip it? How? # mount -o remount, degraded - kernel crush Insert failed disk again - kernel crush May be I missing something? I just want avoid shutdown time or/and reboot =.= -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1: system stability
Oh, thanks for advice, i'll get and attach it. i.e. as i understand behaviour like it, not expected, cool 2015-05-26 22:49 GMT+03:00 Chris Murphy li...@colorremedies.com: Without a complete dmesg it's hard to say what's going on. The call trace alone probably don't show the instigating factor so you may need to use remote ssh with journalctl -f, or use netconsole to continuously get kernel messages prior to the implosion. Chris Murphy -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1: system stability
Without a complete dmesg it's hard to say what's going on. The call trace alone probably don't show the instigating factor so you may need to use remote ssh with journalctl -f, or use netconsole to continuously get kernel messages prior to the implosion. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID1: system stability
Hi list, I'm regular on this list and I very like btrfs, I want use it on production server, and I want replace hw raid on it. Test case: server with N scsi discs 2 SAS disks used for raid 1 root fs If I just remove one disk physically, all okay, kernel show me write errors and system continue work some time. But after first sync call, example # sync # dd if=/Dev/zero of=/zero Kernel will crush and system freeze. Yes, after reboot, I can mount with degraded and recovery options, and I can add failed disk again, and btrfs will rebuild array. But kernel crush and reboot expected in this case, or I can skip it? How? # mount -o remount, degraded - kernel crush Insert failed disk again - kernel crush May be I missing something? I just want avoid shutdown time or/and reboot =.=