Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Richard Elling
On Feb 1, 2011, at 5:56 AM, Mike Tancsa wrote:
> On 1/31/2011 4:19 PM, Mike Tancsa wrote:
>> On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
>>> Hi Mike,
>>> 
>>> Yes, this is looking much better.
>>> 
>>> Some combination of removing corrupted files indicated in the zpool
>>> status -v output, running zpool scrub and then zpool clear should
>>> resolve the corruption, but its depends on how bad the corruption is.
>>> 
>>> First, I would try least destruction method: Try to remove the
>>> files listed below by using the rm command.
>>> 
>>> This entry probably means that the metadata is corrupted or some
>>> other file (like a temp file) no longer exists:
>>> 
>>> tank1/argus-data:<0xc6>
>> 
>> 
>> Hi Cindy,
>>  I removed the files that were listed, and now I am left with
>> 
>> errors: Permanent errors have been detected in the following files:
>> 
>>tank1/argus-data:<0xc5>
>>tank1/argus-data:<0xc6>
>>tank1/argus-data:<0xc7>
>> 
>> I have started a scrub
>> scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go
> 
> 
> Looks like that was it!  The scrub finished in the time it estimated and
> that was all I needed to do. I did not have to to do zpool clear or any
> other commands.  Is there anything beyond scrub to check the integrity
> of the pool ?

That is exactly what scrub does. It validates all data on the disks.


> 
> 0(offsite)# zpool status -v
>  pool: tank1
> state: ONLINE
> scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
> 2011
> config:
> 
>NAMESTATE READ WRITE CKSUM
>tank1   ONLINE   0 0 0
>  raidz1ONLINE   0 0 0
>ad0 ONLINE   0 0 0
>ad1 ONLINE   0 0 0
>ad4 ONLINE   0 0 0
>ad6 ONLINE   0 0 0
>  raidz1ONLINE   0 0 0
>ada0ONLINE   0 0 0
>ada1ONLINE   0 0 0
>ada2ONLINE   0 0 0
>ada3ONLINE   0 0 0
>  raidz1ONLINE   0 0 0
>ada5ONLINE   0 0 0
>ada8ONLINE   0 0 0
>ada7ONLINE   0 0 0
>ada6ONLINE   0 0 0
> 
> errors: No known data errors

Congrats!
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Cindy Swearingen

Excellent.

I think you are good for now as long as your hardware setup is stable.

You survived a severe hardware failure so say a prayer and make sure
this doesn't happen again. Always have good backups.

Thanks,

Cindy

On 02/01/11 06:56, Mike Tancsa wrote:

On 1/31/2011 4:19 PM, Mike Tancsa wrote:

On 1/31/2011 3:14 PM, Cindy Swearingen wrote:

Hi Mike,

Yes, this is looking much better.

Some combination of removing corrupted files indicated in the zpool
status -v output, running zpool scrub and then zpool clear should
resolve the corruption, but its depends on how bad the corruption is.

First, I would try least destruction method: Try to remove the
files listed below by using the rm command.

This entry probably means that the metadata is corrupted or some
other file (like a temp file) no longer exists:

tank1/argus-data:<0xc6>


Hi Cindy,
I removed the files that were listed, and now I am left with

errors: Permanent errors have been detected in the following files:

tank1/argus-data:<0xc5>
tank1/argus-data:<0xc6>
tank1/argus-data:<0xc7>

I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go



Looks like that was it!  The scrub finished in the time it estimated and
that was all I needed to do. I did not have to to do zpool clear or any
other commands.  Is there anything beyond scrub to check the integrity
of the pool ?

0(offsite)# zpool status -v
  pool: tank1
 state: ONLINE
 scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
2011
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 0 0

errors: No known data errors
0(offsite)#


---Mike

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Mike Tancsa
On 1/31/2011 4:19 PM, Mike Tancsa wrote:
> On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
>> Hi Mike,
>>
>> Yes, this is looking much better.
>>
>> Some combination of removing corrupted files indicated in the zpool
>> status -v output, running zpool scrub and then zpool clear should
>> resolve the corruption, but its depends on how bad the corruption is.
>>
>> First, I would try least destruction method: Try to remove the
>> files listed below by using the rm command.
>>
>> This entry probably means that the metadata is corrupted or some
>> other file (like a temp file) no longer exists:
>>
>> tank1/argus-data:<0xc6>
> 
> 
> Hi Cindy,
>   I removed the files that were listed, and now I am left with
> 
> errors: Permanent errors have been detected in the following files:
> 
> tank1/argus-data:<0xc5>
> tank1/argus-data:<0xc6>
> tank1/argus-data:<0xc7>
> 
> I have started a scrub
>  scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go


Looks like that was it!  The scrub finished in the time it estimated and
that was all I needed to do. I did not have to to do zpool clear or any
other commands.  Is there anything beyond scrub to check the integrity
of the pool ?

0(offsite)# zpool status -v
  pool: tank1
 state: ONLINE
 scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
2011
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 0 0

errors: No known data errors
0(offsite)#


---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Richard Elling
On Jan 31, 2011, at 1:19 PM, Mike Tancsa wrote:
> On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
>> Hi Mike,
>> 
>> Yes, this is looking much better.
>> 
>> Some combination of removing corrupted files indicated in the zpool
>> status -v output, running zpool scrub and then zpool clear should
>> resolve the corruption, but its depends on how bad the corruption is.
>> 
>> First, I would try least destruction method: Try to remove the
>> files listed below by using the rm command.
>> 
>> This entry probably means that the metadata is corrupted or some
>> other file (like a temp file) no longer exists:
>> 
>> tank1/argus-data:<0xc6>
> 
> 
> Hi Cindy,
>   I removed the files that were listed, and now I am left with
> 
> errors: Permanent errors have been detected in the following files:
> 
>tank1/argus-data:<0xc5>
>tank1/argus-data:<0xc6>
>tank1/argus-data:<0xc7>
> 
> I have started a scrub
> scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go
> 
> I will report back once the scrub is done!

The "permanent" errors report shows the current and previous results.
When you have multiple failures that are recovered, consider running scrub twice
before attempting to correct or delete files.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa
On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
> Hi Mike,
> 
> Yes, this is looking much better.
> 
> Some combination of removing corrupted files indicated in the zpool
> status -v output, running zpool scrub and then zpool clear should
> resolve the corruption, but its depends on how bad the corruption is.
> 
> First, I would try least destruction method: Try to remove the
> files listed below by using the rm command.
> 
> This entry probably means that the metadata is corrupted or some
> other file (like a temp file) no longer exists:
> 
> tank1/argus-data:<0xc6>


Hi Cindy,
I removed the files that were listed, and now I am left with

errors: Permanent errors have been detected in the following files:

tank1/argus-data:<0xc5>
tank1/argus-data:<0xc6>
tank1/argus-data:<0xc7>

I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go

I will report back once the scrub is done!

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Cindy Swearingen

Hi Mike,

Yes, this is looking much better.

Some combination of removing corrupted files indicated in the zpool
status -v output, running zpool scrub and then zpool clear should
resolve the corruption, but its depends on how bad the corruption is.

First, I would try least destruction method: Try to remove the
files listed below by using the rm command.

This entry probably means that the metadata is corrupted or some
other file (like a temp file) no longer exists:

tank1/argus-data:<0xc6>

If you are able to remove the individual file with rm, run another
zpool scrub and then a zpool clear to clear the pool errors. You
might need to repeat the zpool scrub/zpool clear combo.

If you can't remove the individual files, then you might have to
destroy the tank1/argus-data file system.

Let us know what actually works.

Thanks,

Cindy

On 01/31/11 12:20, Mike Tancsa wrote:

On 1/29/2011 6:18 PM, Richard Elling wrote:

On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote:


On 1/29/2011 12:57 PM, Richard Elling wrote:

0(offsite)# zpool status
pool: tank1
state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
  replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
 see: http://www.sun.com/msg/ZFS-8000-3C
scrub: none requested
config:

  NAMESTATE READ WRITE CKSUM
  tank1   UNAVAIL  0 0 0  insufficient replicas
raidz1ONLINE   0 0 0
  ad0 ONLINE   0 0 0
  ad1 ONLINE   0 0 0
  ad4 ONLINE   0 0 0
  ad6 ONLINE   0 0 0
raidz1ONLINE   0 0 0
  ada4ONLINE   0 0 0
  ada5ONLINE   0 0 0
  ada6ONLINE   0 0 0
  ada7ONLINE   0 0 0
raidz1UNAVAIL  0 0 0  insufficient replicas
  ada0UNAVAIL  0 0 0  cannot open
  ada1UNAVAIL  0 0 0  cannot open
  ada2UNAVAIL  0 0 0  cannot open
  ada3UNAVAIL  0 0 0  cannot open
0(offsite)#

This is usually easily solved without data loss by making the
disks available again.  Can you read anything from the disks using
any program?

Thats the strange thing, the disks are readable.  The drive cage just
reset a couple of times prior to the crash. But they seem OK now.  Same
order as well.

# camcontrol devlist
  at scbus0 target 0 lun 0
(pass0,ada0)
  at scbus0 target 1 lun 0
(pass1,ada1)
  at scbus0 target 2 lun 0
(pass2,ada2)
  at scbus0 target 3 lun 0
(pass3,ada3)


# dd if=/dev/ada2 of=/dev/null count=20 bs=1024
20+0 records in
20+0 records out
20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
0(offsite)#

The next step is to run "zdb -l" and look for all 4 labels. Something like:
zdb -l /dev/ada2

If all 4 labels exist for each drive and appear intact, then look more closely
at how the OS locates the vdevs. If you can't solve the "UNAVAIL" problem,
you won't be able to import the pool.
 -- richard


On 1/29/2011 10:13 PM, James R. Van Artsdalen wrote:

On 1/28/2011 4:46 PM, Mike Tancsa wrote:

I had just added another set of disks to my zfs array. It looks like the
drive cage with the new drives is faulty.  I had added a couple of files
to the main pool, but not much.  Is there any way to restore the pool
below ? I have a lot of files on ad0,1,4,6 and ada4,5,6,7 and perhaps
one file on the new drives in the bad cage.

Get another enclosure and verify it works OK.  Then move the disks from
the suspect enclosure to the tested enclosure and try to import the pool.

The problem may be cabling or the controller instead - you didn't
specify how the disks were attached or which version of FreeBSD you're
using.



First off thanks to all who responded on and offlist!

Good news (for me) it seems. New cage and all seems to be recognized
correctly.  The history is

...
2010-04-22.14:27:38 zpool add tank1 raidz /dev/ada4 /dev/ada5 /dev/ada6
/dev/ada7
2010-06-11.13:49:33 zfs create tank1/argus-data
2010-06-11.13:49:41 zfs create tank1/argus-data/previous
2010-06-11.13:50:38 zfs set compression=off tank1/argus-data
2010-08-06.12:20:59 zpool replace tank1 ad1 ad1
2010-09-16.10:17:51 zpool upgrade -a
2011-01-28.11:45:43 zpool add tank1 raidz /dev/ada0 /dev/ada1 /dev/ada2
/dev/ada3

FreeBSD RELENG_8 from last week, 8G of RAM, amd64.

 zpool status -v
  pool: tank1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 

Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa
On 1/29/2011 6:18 PM, Richard Elling wrote:
> 
> On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote:
> 
>> On 1/29/2011 12:57 PM, Richard Elling wrote:
 0(offsite)# zpool status
 pool: tank1
 state: UNAVAIL
 status: One or more devices could not be opened.  There are insufficient
   replicas for the pool to continue functioning.
 action: Attach the missing device and online it using 'zpool online'.
  see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
 config:

   NAMESTATE READ WRITE CKSUM
   tank1   UNAVAIL  0 0 0  insufficient replicas
 raidz1ONLINE   0 0 0
   ad0 ONLINE   0 0 0
   ad1 ONLINE   0 0 0
   ad4 ONLINE   0 0 0
   ad6 ONLINE   0 0 0
 raidz1ONLINE   0 0 0
   ada4ONLINE   0 0 0
   ada5ONLINE   0 0 0
   ada6ONLINE   0 0 0
   ada7ONLINE   0 0 0
 raidz1UNAVAIL  0 0 0  insufficient replicas
   ada0UNAVAIL  0 0 0  cannot open
   ada1UNAVAIL  0 0 0  cannot open
   ada2UNAVAIL  0 0 0  cannot open
   ada3UNAVAIL  0 0 0  cannot open
 0(offsite)#
>>>
>>> This is usually easily solved without data loss by making the
>>> disks available again.  Can you read anything from the disks using
>>> any program?
>>
>> Thats the strange thing, the disks are readable.  The drive cage just
>> reset a couple of times prior to the crash. But they seem OK now.  Same
>> order as well.
>>
>> # camcontrol devlist
>>   at scbus0 target 0 lun 0
>> (pass0,ada0)
>>   at scbus0 target 1 lun 0
>> (pass1,ada1)
>>   at scbus0 target 2 lun 0
>> (pass2,ada2)
>>   at scbus0 target 3 lun 0
>> (pass3,ada3)
>>
>>
>> # dd if=/dev/ada2 of=/dev/null count=20 bs=1024
>> 20+0 records in
>> 20+0 records out
>> 20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
>> 0(offsite)#
> 
> The next step is to run "zdb -l" and look for all 4 labels. Something like:
>   zdb -l /dev/ada2
> 
> If all 4 labels exist for each drive and appear intact, then look more closely
> at how the OS locates the vdevs. If you can't solve the "UNAVAIL" problem,
> you won't be able to import the pool.
>  -- richard

On 1/29/2011 10:13 PM, James R. Van Artsdalen wrote:
> On 1/28/2011 4:46 PM, Mike Tancsa wrote:
>>
>> I had just added another set of disks to my zfs array. It looks like the
>> drive cage with the new drives is faulty.  I had added a couple of files
>> to the main pool, but not much.  Is there any way to restore the pool
>> below ? I have a lot of files on ad0,1,4,6 and ada4,5,6,7 and perhaps
>> one file on the new drives in the bad cage.
>
> Get another enclosure and verify it works OK.  Then move the disks from
> the suspect enclosure to the tested enclosure and try to import the pool.
>
> The problem may be cabling or the controller instead - you didn't
> specify how the disks were attached or which version of FreeBSD you're
> using.
>

First off thanks to all who responded on and offlist!

Good news (for me) it seems. New cage and all seems to be recognized
correctly.  The history is

...
2010-04-22.14:27:38 zpool add tank1 raidz /dev/ada4 /dev/ada5 /dev/ada6
/dev/ada7
2010-06-11.13:49:33 zfs create tank1/argus-data
2010-06-11.13:49:41 zfs create tank1/argus-data/previous
2010-06-11.13:50:38 zfs set compression=off tank1/argus-data
2010-08-06.12:20:59 zpool replace tank1 ad1 ad1
2010-09-16.10:17:51 zpool upgrade -a
2011-01-28.11:45:43 zpool add tank1 raidz /dev/ada0 /dev/ada1 /dev/ada2
/dev/ada3

FreeBSD RELENG_8 from last week, 8G of RAM, amd64.

 zpool status -v
  pool: tank1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0