Re: [lustre-discuss] ZFS wobble

2022-04-28 Thread Simon Guilbault
Hi,

Start a ZFS scrub on your pool, this will ensure that all the content is
fine since the short resilver when re-adding dead disks to a pool does not
check everything, only what changed on the pool while that disk was gone.

I sadly often see that kind of error on my personal NAS due to some bad
hardware but ZFS is always able to fix everything even if it
detects "permanent errors" and those permanent errors disappear after the
scrub.

On Thu, Apr 28, 2022 at 4:10 AM Alastair Basden via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> We have OSDs on ZFS (0.7.9) / Lustre 2.12.6.
>
> Recently, one of our JBODs had a wobble, and the disks (as presented to
> the OS) disappeared for a few seconds (and then returned).
>
> This upset a few zpools which SUSPENDED.
>
> A zpool clear on these then started the resilvering process, and zpool
> status gave e.g.:
> errors: Permanent errors have been detected in the following files:
>
>  :<0x0>
>  :<0xb01>
>  :<0x15>
>  :<0x383>
>  cos6-ost7/ost7:/O/40400/d11/10617643
>  cos6-ost7/ost7:/O/40400/d21/583029
>
>
> However, once the resilvering had completed, these permanent errors had
> gone.
>
> The question is then, are these errors really permanent, or was zfs able
> to correct them?
>
> Lustre continues to remain fine (though obviously froze while the pools
> were suspended).
>
> Should we be worried that there might be some under-the-hood corruption
> that will present itself when we need to remount (e.g. after a reboot) the
> OST?  In particular the :<0x0> file worries me a bit!
>
> Thanks,
> Alastair.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS wobble

2022-04-28 Thread Alastair Basden via lustre-discuss

Hi,

We have OSDs on ZFS (0.7.9) / Lustre 2.12.6.

Recently, one of our JBODs had a wobble, and the disks (as presented to 
the OS) disappeared for a few seconds (and then returned).


This upset a few zpools which SUSPENDED.

A zpool clear on these then started the resilvering process, and zpool 
status gave e.g.:

errors: Permanent errors have been detected in the following files:

:<0x0>
:<0xb01>
:<0x15>
:<0x383>
cos6-ost7/ost7:/O/40400/d11/10617643
cos6-ost7/ost7:/O/40400/d21/583029


However, once the resilvering had completed, these permanent errors had 
gone.


The question is then, are these errors really permanent, or was zfs able 
to correct them?


Lustre continues to remain fine (though obviously froze while the pools 
were suspended).


Should we be worried that there might be some under-the-hood corruption 
that will present itself when we need to remount (e.g. after a reboot) the 
OST?  In particular the :<0x0> file worries me a bit!


Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org