Re: diagnosis for disk drive errors (zfs on cgd on sata disk)
Duplicate, please ignore. Apologies for the noise. On Fri, 20 Aug 2021 at 06:34 +0200, Pouya Tafti wrote: > After a recent drive failure in my primary zfs pool, I set > up a secondary pool on a cgd(4) device on a single new sata > hdd (zfs on gpt on cgd on gpt on a 4TB Seagate Ironwolf > hdd) to back up the primary. > > I initialy scrubbed the entire disk without apparent > incident using a temporary cryptographic device and dd(1) > as in the cgdconfig(8) man page. > > Since then, twice already, in the past two days, the drive > has failed in the same way and been detached, once on the > very first zfs(8) create operation, and the second time > (after a reboot) after having written hundreds of GiBs to > it with a zfs(8) send/receive pipe. Here are the relevant > system messages: > > # dmesg > ... > [ 57131.573806] mpii0: physical device removed from slot 7 > [ 57131.573806] sd7d: error writing fsbn 1816866262 of 1816866262-1816866389 > (sd7 bn 1816866262; cn 894127 tn 1 sn 71) > [ 57131.573806] cgd0d: error writing fsbn 1816604078 of 1816604078-1816604205 > (cgd0 bn 1816604078; cn 887013 tn 0 sn 1454) > [ 57131.573806] sd7d: error reading fsbn 270904 of 270904-270919 (sd7 bn > 270904; cn 133 tn 5 sn 13) > [ 57131.573806] sd7d: error reading fsbn 7814028344 of 7814028344-7814028359 > (sd7 bn 7814028344; cn 3845486 tn 6 sn 30) > [ 57131.573806] sd7d: error reading fsbn 7814028856 of 7814028856-7814028871 > (sd7 bn 7814028856; cn 3845486 tn 10 sn 34) > [ 57131.573806] sd7: autoconfiguration error: cache synchronization failed > [ 57131.573806] cgd0d: error reading fsbn 7813766672 of 7813766672-7813766687 > (cgd0 bn 7813766672; cn 3815315 tn 0 sn 1552) > [ 57131.573806] cgd0d: error reading fsbn 7813766160 of 7813766160-7813766175 > (cgd0 bn 7813766160; cn 3815315 tn 0 sn 1040) > [ 57131.573806] cgd0d: error reading fsbn 8720 of 8720-8735 (cgd0 bn 8720; cn > 4 tn 0 sn 528) > [ 57131.573806] sd7d: error writing fsbn 1816866646 of 1816866646-1816866773 > (sd7 bn 1816866646; cn 894127 tn 4 sn 74) > [ 57131.573806] cgd0d: error writing fsbn 1816604462 of 1816604462-1816604589 > (cgd0 bn 1816604462; cn 887013 tn 0 sn 1838) > [ 57131.573806] sd7d: error writing fsbn 1816866518 of 1816866518-1816866645 > (sd7 bn 1816866518; cn 894127 tn 3 sn 73) > [ 57131.573806] cgd0d: error writing fsbn 1816604334 of 1816604334-1816604461 > (cgd0 bn 1816604334; cn 887013 tn 0 sn 1710) > [ 57131.593815] sd7: autoconfiguration error: cache synchronization failed > [ 57131.643840] dk11 at sd7 (backupcgd0) deleted > [ 57131.643840] dk10 at sd7 (backupcgd0.config) deleted > [ 57131.643840] sd7: detached > > I don't know how to go about diagnosing the issue and would > appreciate any suggestions. In particular, the hdd is new > and I wonder if I should return it for a replacement. The > previous disk in the same bay had also been showing > read/write errors (the other drive never got detached, > though). > > Apart from the drive, I have also little faith in the > backplate, cables, SAS controller (which I reflashed), RAM, > etc., although here it looks to me like the problem could > be somewhere between the drive and the controller. > > Many thanks, > Pouya > > N.B. I'm also a bit confused by how zfs is handling this: > zpool(8) appears to think the drive is still online, while > zfs(8) doesn't list any datasets on it: > > # zpool status -v puddle > pool: puddle > state: ONLINE > status: One or more devices are faulted in response to IO failures. > action: Make sure the affected devices are connected, then run 'zpool clear'. >see: http://illumos.org/msg/ZFS-8000-HC > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > puddleONLINE 0 3.62K 0 > wedges/backup0 ONLINE 0 213 0 > > errors: Permanent errors have been detected in the following files: > > puddle/backup.pond/backup:<0x0> > puddle/backup.pond/backup:<0x10ecc5> > > # zfs list puddle > cannot open 'puddle': pool I/O is currently suspended >
Re: diagnosis for disk drive errors (zfs on cgd on sata disk)
On Fri, 20 Aug 2021 at 06:13 -, Michael van Elst wrote: [snip] > Yes. It could be the drive itself, but I'd suspect the > backplane or cables. The PSU is also a possible candidate. Thanks. Retrying the replication in another bay now before opening up the box.
diagnosis for disk drive errors (zfs on cgd on sata disk)
After a recent drive failure in my primary zfs pool, I set up a secondary pool on a cgd(4) device on a single new sata hdd (zfs on gpt on cgd on gpt on a 4TB Seagate Ironwolf hdd) to back up the primary. I initialy scrubbed the entire disk without apparent incident using a temporary cryptographic device and dd(1) as in the cgdconfig(8) man page. Since then, twice already, in the past two days, the drive has failed in the same way and been detached, once on the very first zfs(8) create operation, and the second time (after a reboot) after having written hundreds of GiBs to it with a zfs(8) send/receive pipe. Here are the relevant system messages: # dmesg ... [ 57131.573806] mpii0: physical device removed from slot 7 [ 57131.573806] sd7d: error writing fsbn 1816866262 of 1816866262-1816866389 (sd7 bn 1816866262; cn 894127 tn 1 sn 71) [ 57131.573806] cgd0d: error writing fsbn 1816604078 of 1816604078-1816604205 (cgd0 bn 1816604078; cn 887013 tn 0 sn 1454) [ 57131.573806] sd7d: error reading fsbn 270904 of 270904-270919 (sd7 bn 270904; cn 133 tn 5 sn 13) [ 57131.573806] sd7d: error reading fsbn 7814028344 of 7814028344-7814028359 (sd7 bn 7814028344; cn 3845486 tn 6 sn 30) [ 57131.573806] sd7d: error reading fsbn 7814028856 of 7814028856-7814028871 (sd7 bn 7814028856; cn 3845486 tn 10 sn 34) [ 57131.573806] sd7: autoconfiguration error: cache synchronization failed [ 57131.573806] cgd0d: error reading fsbn 7813766672 of 7813766672-7813766687 (cgd0 bn 7813766672; cn 3815315 tn 0 sn 1552) [ 57131.573806] cgd0d: error reading fsbn 7813766160 of 7813766160-7813766175 (cgd0 bn 7813766160; cn 3815315 tn 0 sn 1040) [ 57131.573806] cgd0d: error reading fsbn 8720 of 8720-8735 (cgd0 bn 8720; cn 4 tn 0 sn 528) [ 57131.573806] sd7d: error writing fsbn 1816866646 of 1816866646-1816866773 (sd7 bn 1816866646; cn 894127 tn 4 sn 74) [ 57131.573806] cgd0d: error writing fsbn 1816604462 of 1816604462-1816604589 (cgd0 bn 1816604462; cn 887013 tn 0 sn 1838) [ 57131.573806] sd7d: error writing fsbn 1816866518 of 1816866518-1816866645 (sd7 bn 1816866518; cn 894127 tn 3 sn 73) [ 57131.573806] cgd0d: error writing fsbn 1816604334 of 1816604334-1816604461 (cgd0 bn 1816604334; cn 887013 tn 0 sn 1710) [ 57131.593815] sd7: autoconfiguration error: cache synchronization failed [ 57131.643840] dk11 at sd7 (backupcgd0) deleted [ 57131.643840] dk10 at sd7 (backupcgd0.config) deleted [ 57131.643840] sd7: detached I don't know how to go about diagnosing the issue and would appreciate any suggestions. In particular, the hdd is new and I wonder if I should return it for a replacement. The previous disk in the same bay had also been showing read/write errors (the other drive never got detached, though). Apart from the drive, I have also little faith in the backplate, cables, SAS controller (which I reflashed), RAM, etc., although here it looks to me like the problem could be somewhere between the drive and the controller. Many thanks, Pouya N.B. I'm also a bit confused by how zfs is handling this: zpool(8) appears to think the drive is still online, while zfs(8) doesn't list any datasets on it: # zpool status -v puddle pool: puddle state: ONLINE status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://illumos.org/msg/ZFS-8000-HC scan: none requested config: NAME STATE READ WRITE CKSUM puddleONLINE 0 3.62K 0 wedges/backup0 ONLINE 0 213 0 errors: Permanent errors have been detected in the following files: puddle/backup.pond/backup:<0x0> puddle/backup.pond/backup:<0x10ecc5> # zfs list puddle cannot open 'puddle': pool I/O is currently suspended
Re: LTO support
On Fri, 13 Aug 2021 at 11:48 +0100, David Brownlee wrote: > How does the rate of change in data compare to upload bandwidth? In my > case I bootstrapped the remote backup boxes by having them connected > to the same network for a few days until everything was up to date, > then transported them to the remote location. This might actually work, thanks for the suggestion. The rate of change should be relatively low.
Re: diagnosis for disk drive errors (zfs on cgd on sata disk)
pouya+lists.net...@nohup.io (Pouya Tafti) writes: Your disk controller gives the error reason: >[ 57131.573806] mpii0: physical device removed from slot 7 >Apart from the drive, I have also little faith in the >backplate, cables, SAS controller (which I reflashed), RAM, >etc., although here it looks to me like the problem could >be somewhere between the drive and the controller. Yes. It could be the drive itself, but I'd suspect the backplane or cables. The PSU is also a possible candidate.