Re: diagnosis for disk drive errors (zfs on cgd on sata disk)

2021-08-20 Thread Pouya Tafti
Duplicate, please ignore.  Apologies for the noise.

On Fri, 20 Aug 2021 at 06:34 +0200, Pouya Tafti wrote:
> After a recent drive failure in my primary zfs pool, I set
> up a secondary pool on a cgd(4) device on a single new sata
> hdd (zfs on gpt on cgd on gpt on a 4TB Seagate Ironwolf
> hdd) to back up the primary.
> 
> I initialy scrubbed the entire disk without apparent
> incident using a temporary cryptographic device and dd(1)
> as in the cgdconfig(8) man page.
> 
> Since then, twice already, in the past two days, the drive
> has failed in the same way and been detached, once on the
> very first zfs(8) create operation, and the second time
> (after a reboot) after having written hundreds of GiBs to
> it with a zfs(8) send/receive pipe.  Here are the relevant
> system messages:
> 
> # dmesg
> ...
> [ 57131.573806] mpii0: physical device removed from slot 7
> [ 57131.573806] sd7d: error writing fsbn 1816866262 of 1816866262-1816866389 
> (sd7 bn 1816866262; cn 894127 tn 1 sn 71)
> [ 57131.573806] cgd0d: error writing fsbn 1816604078 of 1816604078-1816604205 
> (cgd0 bn 1816604078; cn 887013 tn 0 sn 1454)
> [ 57131.573806] sd7d: error reading fsbn 270904 of 270904-270919 (sd7 bn 
> 270904; cn 133 tn 5 sn 13)
> [ 57131.573806] sd7d: error reading fsbn 7814028344 of 7814028344-7814028359 
> (sd7 bn 7814028344; cn 3845486 tn 6 sn 30)
> [ 57131.573806] sd7d: error reading fsbn 7814028856 of 7814028856-7814028871 
> (sd7 bn 7814028856; cn 3845486 tn 10 sn 34)
> [ 57131.573806] sd7: autoconfiguration error: cache synchronization failed
> [ 57131.573806] cgd0d: error reading fsbn 7813766672 of 7813766672-7813766687 
> (cgd0 bn 7813766672; cn 3815315 tn 0 sn 1552)
> [ 57131.573806] cgd0d: error reading fsbn 7813766160 of 7813766160-7813766175 
> (cgd0 bn 7813766160; cn 3815315 tn 0 sn 1040)
> [ 57131.573806] cgd0d: error reading fsbn 8720 of 8720-8735 (cgd0 bn 8720; cn 
> 4 tn 0 sn 528)
> [ 57131.573806] sd7d: error writing fsbn 1816866646 of 1816866646-1816866773 
> (sd7 bn 1816866646; cn 894127 tn 4 sn 74)
> [ 57131.573806] cgd0d: error writing fsbn 1816604462 of 1816604462-1816604589 
> (cgd0 bn 1816604462; cn 887013 tn 0 sn 1838)
> [ 57131.573806] sd7d: error writing fsbn 1816866518 of 1816866518-1816866645 
> (sd7 bn 1816866518; cn 894127 tn 3 sn 73)
> [ 57131.573806] cgd0d: error writing fsbn 1816604334 of 1816604334-1816604461 
> (cgd0 bn 1816604334; cn 887013 tn 0 sn 1710)
> [ 57131.593815] sd7: autoconfiguration error: cache synchronization failed
> [ 57131.643840] dk11 at sd7 (backupcgd0) deleted
> [ 57131.643840] dk10 at sd7 (backupcgd0.config) deleted
> [ 57131.643840] sd7: detached
> 
> I don't know how to go about diagnosing the issue and would
> appreciate any suggestions.  In particular, the hdd is new
> and I wonder if I should return it for a replacement.  The
> previous disk in the same bay had also been showing
> read/write errors (the other drive never got detached,
> though).
> 
> Apart from the drive, I have also little faith in the
> backplate, cables, SAS controller (which I reflashed), RAM,
> etc., although here it looks to me like the problem could
> be somewhere between the drive and the controller.
> 
> Many thanks,
> Pouya
> 
> N.B. I'm also a bit confused by how zfs is handling this:
> zpool(8) appears to think the drive is still online, while
> zfs(8) doesn't list any datasets on it:
> 
> # zpool status -v puddle
>   pool: puddle
>  state: ONLINE
> status: One or more devices are faulted in response to IO failures.
> action: Make sure the affected devices are connected, then run 'zpool clear'.
>see: http://illumos.org/msg/ZFS-8000-HC
>   scan: none requested
> config:
> 
>   NAME  STATE READ WRITE CKSUM
>   puddleONLINE   0 3.62K 0
> wedges/backup0  ONLINE   0   213 0
> 
> errors: Permanent errors have been detected in the following files:
> 
> puddle/backup.pond/backup:<0x0>
> puddle/backup.pond/backup:<0x10ecc5>
> 
> # zfs list puddle
> cannot open 'puddle': pool I/O is currently suspended
> 


Re: diagnosis for disk drive errors (zfs on cgd on sata disk)

2021-08-20 Thread Pouya Tafti
On Fri, 20 Aug 2021 at 06:13 -, Michael van Elst wrote:
[snip]
> Yes. It could be the drive itself, but I'd suspect the
> backplane or cables. The PSU is also a possible candidate.

Thanks.  Retrying the replication in another bay now before
opening up the box. 


diagnosis for disk drive errors (zfs on cgd on sata disk)

2021-08-20 Thread Pouya Tafti
After a recent drive failure in my primary zfs pool, I set
up a secondary pool on a cgd(4) device on a single new sata
hdd (zfs on gpt on cgd on gpt on a 4TB Seagate Ironwolf
hdd) to back up the primary.

I initialy scrubbed the entire disk without apparent
incident using a temporary cryptographic device and dd(1)
as in the cgdconfig(8) man page.

Since then, twice already, in the past two days, the drive
has failed in the same way and been detached, once on the
very first zfs(8) create operation, and the second time
(after a reboot) after having written hundreds of GiBs to
it with a zfs(8) send/receive pipe.  Here are the relevant
system messages:

# dmesg
...
[ 57131.573806] mpii0: physical device removed from slot 7
[ 57131.573806] sd7d: error writing fsbn 1816866262 of 1816866262-1816866389 
(sd7 bn 1816866262; cn 894127 tn 1 sn 71)
[ 57131.573806] cgd0d: error writing fsbn 1816604078 of 1816604078-1816604205 
(cgd0 bn 1816604078; cn 887013 tn 0 sn 1454)
[ 57131.573806] sd7d: error reading fsbn 270904 of 270904-270919 (sd7 bn 
270904; cn 133 tn 5 sn 13)
[ 57131.573806] sd7d: error reading fsbn 7814028344 of 7814028344-7814028359 
(sd7 bn 7814028344; cn 3845486 tn 6 sn 30)
[ 57131.573806] sd7d: error reading fsbn 7814028856 of 7814028856-7814028871 
(sd7 bn 7814028856; cn 3845486 tn 10 sn 34)
[ 57131.573806] sd7: autoconfiguration error: cache synchronization failed
[ 57131.573806] cgd0d: error reading fsbn 7813766672 of 7813766672-7813766687 
(cgd0 bn 7813766672; cn 3815315 tn 0 sn 1552)
[ 57131.573806] cgd0d: error reading fsbn 7813766160 of 7813766160-7813766175 
(cgd0 bn 7813766160; cn 3815315 tn 0 sn 1040)
[ 57131.573806] cgd0d: error reading fsbn 8720 of 8720-8735 (cgd0 bn 8720; cn 4 
tn 0 sn 528)
[ 57131.573806] sd7d: error writing fsbn 1816866646 of 1816866646-1816866773 
(sd7 bn 1816866646; cn 894127 tn 4 sn 74)
[ 57131.573806] cgd0d: error writing fsbn 1816604462 of 1816604462-1816604589 
(cgd0 bn 1816604462; cn 887013 tn 0 sn 1838)
[ 57131.573806] sd7d: error writing fsbn 1816866518 of 1816866518-1816866645 
(sd7 bn 1816866518; cn 894127 tn 3 sn 73)
[ 57131.573806] cgd0d: error writing fsbn 1816604334 of 1816604334-1816604461 
(cgd0 bn 1816604334; cn 887013 tn 0 sn 1710)
[ 57131.593815] sd7: autoconfiguration error: cache synchronization failed
[ 57131.643840] dk11 at sd7 (backupcgd0) deleted
[ 57131.643840] dk10 at sd7 (backupcgd0.config) deleted
[ 57131.643840] sd7: detached

I don't know how to go about diagnosing the issue and would
appreciate any suggestions.  In particular, the hdd is new
and I wonder if I should return it for a replacement.  The
previous disk in the same bay had also been showing
read/write errors (the other drive never got detached,
though).

Apart from the drive, I have also little faith in the
backplate, cables, SAS controller (which I reflashed), RAM,
etc., although here it looks to me like the problem could
be somewhere between the drive and the controller.

Many thanks,
Pouya

N.B. I'm also a bit confused by how zfs is handling this:
zpool(8) appears to think the drive is still online, while
zfs(8) doesn't list any datasets on it:

# zpool status -v puddle
  pool: puddle
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: none requested
config:

NAME  STATE READ WRITE CKSUM
puddleONLINE   0 3.62K 0
  wedges/backup0  ONLINE   0   213 0

errors: Permanent errors have been detected in the following files:

puddle/backup.pond/backup:<0x0>
puddle/backup.pond/backup:<0x10ecc5>

# zfs list puddle
cannot open 'puddle': pool I/O is currently suspended



Re: LTO support

2021-08-20 Thread Pouya Tafti
On Fri, 13 Aug 2021 at 11:48 +0100, David Brownlee wrote:
> How does the rate of change in data compare to upload bandwidth? In my
> case I bootstrapped the remote backup boxes by having them connected
> to the same network for a few days until everything was up to date,
> then transported them to the remote location.

This might actually work, thanks for the suggestion.  The
rate of change should be relatively low.


Re: diagnosis for disk drive errors (zfs on cgd on sata disk)

2021-08-20 Thread Michael van Elst
pouya+lists.net...@nohup.io (Pouya Tafti) writes:

Your disk controller gives the error reason:

>[ 57131.573806] mpii0: physical device removed from slot 7

>Apart from the drive, I have also little faith in the
>backplate, cables, SAS controller (which I reflashed), RAM,
>etc., although here it looks to me like the problem could
>be somewhere between the drive and the controller.

Yes. It could be the drive itself, but I'd suspect the
backplane or cables. The PSU is also a possible candidate.