Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Mike Tancsa
On 1/31/2011 4:19 PM, Mike Tancsa wrote:
 On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
 Hi Mike,

 Yes, this is looking much better.

 Some combination of removing corrupted files indicated in the zpool
 status -v output, running zpool scrub and then zpool clear should
 resolve the corruption, but its depends on how bad the corruption is.

 First, I would try least destruction method: Try to remove the
 files listed below by using the rm command.

 This entry probably means that the metadata is corrupted or some
 other file (like a temp file) no longer exists:

 tank1/argus-data:0xc6
 
 
 Hi Cindy,
   I removed the files that were listed, and now I am left with
 
 errors: Permanent errors have been detected in the following files:
 
 tank1/argus-data:0xc5
 tank1/argus-data:0xc6
 tank1/argus-data:0xc7
 
 I have started a scrub
  scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go


Looks like that was it!  The scrub finished in the time it estimated and
that was all I needed to do. I did not have to to do zpool clear or any
other commands.  Is there anything beyond scrub to check the integrity
of the pool ?

0(offsite)# zpool status -v
  pool: tank1
 state: ONLINE
 scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
2011
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 0 0

errors: No known data errors
0(offsite)#


---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Cindy Swearingen

Excellent.

I think you are good for now as long as your hardware setup is stable.

You survived a severe hardware failure so say a prayer and make sure
this doesn't happen again. Always have good backups.

Thanks,

Cindy

On 02/01/11 06:56, Mike Tancsa wrote:

On 1/31/2011 4:19 PM, Mike Tancsa wrote:

On 1/31/2011 3:14 PM, Cindy Swearingen wrote:

Hi Mike,

Yes, this is looking much better.

Some combination of removing corrupted files indicated in the zpool
status -v output, running zpool scrub and then zpool clear should
resolve the corruption, but its depends on how bad the corruption is.

First, I would try least destruction method: Try to remove the
files listed below by using the rm command.

This entry probably means that the metadata is corrupted or some
other file (like a temp file) no longer exists:

tank1/argus-data:0xc6


Hi Cindy,
I removed the files that were listed, and now I am left with

errors: Permanent errors have been detected in the following files:

tank1/argus-data:0xc5
tank1/argus-data:0xc6
tank1/argus-data:0xc7

I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go



Looks like that was it!  The scrub finished in the time it estimated and
that was all I needed to do. I did not have to to do zpool clear or any
other commands.  Is there anything beyond scrub to check the integrity
of the pool ?

0(offsite)# zpool status -v
  pool: tank1
 state: ONLINE
 scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
2011
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 0 0

errors: No known data errors
0(offsite)#


---Mike

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Richard Elling
On Feb 1, 2011, at 5:56 AM, Mike Tancsa wrote:
 On 1/31/2011 4:19 PM, Mike Tancsa wrote:
 On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
 Hi Mike,
 
 Yes, this is looking much better.
 
 Some combination of removing corrupted files indicated in the zpool
 status -v output, running zpool scrub and then zpool clear should
 resolve the corruption, but its depends on how bad the corruption is.
 
 First, I would try least destruction method: Try to remove the
 files listed below by using the rm command.
 
 This entry probably means that the metadata is corrupted or some
 other file (like a temp file) no longer exists:
 
 tank1/argus-data:0xc6
 
 
 Hi Cindy,
  I removed the files that were listed, and now I am left with
 
 errors: Permanent errors have been detected in the following files:
 
tank1/argus-data:0xc5
tank1/argus-data:0xc6
tank1/argus-data:0xc7
 
 I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go
 
 
 Looks like that was it!  The scrub finished in the time it estimated and
 that was all I needed to do. I did not have to to do zpool clear or any
 other commands.  Is there anything beyond scrub to check the integrity
 of the pool ?

That is exactly what scrub does. It validates all data on the disks.


 
 0(offsite)# zpool status -v
  pool: tank1
 state: ONLINE
 scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
 2011
 config:
 
NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 0 0
 
 errors: No known data errors

Congrats!
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-31 Thread James Van Artsdalen
He says he's using FreeBSD.  ZFS recorded names like ada0 which always means 
a whole disk.

In any case FreeBSD will search all block storage for the ZFS dev components if 
the cached name is wrong: if the attached disks are connected to the system at 
all FreeBSD will find them wherever they may be.

Try FreeBSD 8-STABLE rather than just 8.2-RELEASE as many improvements and 
fixes have been backported.  Perhaps try 9-CURRENT as I'm confident the code 
there has all of the dev search fixes.

Add the line vfs.zfs.debug=1 to /boot/loader.conf to get detailed debug 
output as FreeBSD tries to import the pool.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa
On 1/29/2011 6:18 PM, Richard Elling wrote:
 
 On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote:
 
 On 1/29/2011 12:57 PM, Richard Elling wrote:
 0(offsite)# zpool status
 pool: tank1
 state: UNAVAIL
 status: One or more devices could not be opened.  There are insufficient
   replicas for the pool to continue functioning.
 action: Attach the missing device and online it using 'zpool online'.
  see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
 config:

   NAMESTATE READ WRITE CKSUM
   tank1   UNAVAIL  0 0 0  insufficient replicas
 raidz1ONLINE   0 0 0
   ad0 ONLINE   0 0 0
   ad1 ONLINE   0 0 0
   ad4 ONLINE   0 0 0
   ad6 ONLINE   0 0 0
 raidz1ONLINE   0 0 0
   ada4ONLINE   0 0 0
   ada5ONLINE   0 0 0
   ada6ONLINE   0 0 0
   ada7ONLINE   0 0 0
 raidz1UNAVAIL  0 0 0  insufficient replicas
   ada0UNAVAIL  0 0 0  cannot open
   ada1UNAVAIL  0 0 0  cannot open
   ada2UNAVAIL  0 0 0  cannot open
   ada3UNAVAIL  0 0 0  cannot open
 0(offsite)#

 This is usually easily solved without data loss by making the
 disks available again.  Can you read anything from the disks using
 any program?

 Thats the strange thing, the disks are readable.  The drive cage just
 reset a couple of times prior to the crash. But they seem OK now.  Same
 order as well.

 # camcontrol devlist
 WDC WD\021501FASR\25500W2B0 \200 0956  at scbus0 target 0 lun 0
 (pass0,ada0)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 1 lun 0
 (pass1,ada1)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 2 lun 0
 (pass2,ada2)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 3 lun 0
 (pass3,ada3)


 # dd if=/dev/ada2 of=/dev/null count=20 bs=1024
 20+0 records in
 20+0 records out
 20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
 0(offsite)#
 
 The next step is to run zdb -l and look for all 4 labels. Something like:
   zdb -l /dev/ada2
 
 If all 4 labels exist for each drive and appear intact, then look more closely
 at how the OS locates the vdevs. If you can't solve the UNAVAIL problem,
 you won't be able to import the pool.
  -- richard

On 1/29/2011 10:13 PM, James R. Van Artsdalen wrote:
 On 1/28/2011 4:46 PM, Mike Tancsa wrote:

 I had just added another set of disks to my zfs array. It looks like the
 drive cage with the new drives is faulty.  I had added a couple of files
 to the main pool, but not much.  Is there any way to restore the pool
 below ? I have a lot of files on ad0,1,4,6 and ada4,5,6,7 and perhaps
 one file on the new drives in the bad cage.

 Get another enclosure and verify it works OK.  Then move the disks from
 the suspect enclosure to the tested enclosure and try to import the pool.

 The problem may be cabling or the controller instead - you didn't
 specify how the disks were attached or which version of FreeBSD you're
 using.


First off thanks to all who responded on and offlist!

Good news (for me) it seems. New cage and all seems to be recognized
correctly.  The history is

...
2010-04-22.14:27:38 zpool add tank1 raidz /dev/ada4 /dev/ada5 /dev/ada6
/dev/ada7
2010-06-11.13:49:33 zfs create tank1/argus-data
2010-06-11.13:49:41 zfs create tank1/argus-data/previous
2010-06-11.13:50:38 zfs set compression=off tank1/argus-data
2010-08-06.12:20:59 zpool replace tank1 ad1 ad1
2010-09-16.10:17:51 zpool upgrade -a
2011-01-28.11:45:43 zpool add tank1 raidz /dev/ada0 /dev/ada1 /dev/ada2
/dev/ada3

FreeBSD RELENG_8 from last week, 8G of RAM, amd64.

 zpool status -v
  pool: tank1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 

Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Cindy Swearingen

Hi Mike,

Yes, this is looking much better.

Some combination of removing corrupted files indicated in the zpool
status -v output, running zpool scrub and then zpool clear should
resolve the corruption, but its depends on how bad the corruption is.

First, I would try least destruction method: Try to remove the
files listed below by using the rm command.

This entry probably means that the metadata is corrupted or some
other file (like a temp file) no longer exists:

tank1/argus-data:0xc6

If you are able to remove the individual file with rm, run another
zpool scrub and then a zpool clear to clear the pool errors. You
might need to repeat the zpool scrub/zpool clear combo.

If you can't remove the individual files, then you might have to
destroy the tank1/argus-data file system.

Let us know what actually works.

Thanks,

Cindy

On 01/31/11 12:20, Mike Tancsa wrote:

On 1/29/2011 6:18 PM, Richard Elling wrote:

On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote:


On 1/29/2011 12:57 PM, Richard Elling wrote:

0(offsite)# zpool status
pool: tank1
state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
  replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
 see: http://www.sun.com/msg/ZFS-8000-3C
scrub: none requested
config:

  NAMESTATE READ WRITE CKSUM
  tank1   UNAVAIL  0 0 0  insufficient replicas
raidz1ONLINE   0 0 0
  ad0 ONLINE   0 0 0
  ad1 ONLINE   0 0 0
  ad4 ONLINE   0 0 0
  ad6 ONLINE   0 0 0
raidz1ONLINE   0 0 0
  ada4ONLINE   0 0 0
  ada5ONLINE   0 0 0
  ada6ONLINE   0 0 0
  ada7ONLINE   0 0 0
raidz1UNAVAIL  0 0 0  insufficient replicas
  ada0UNAVAIL  0 0 0  cannot open
  ada1UNAVAIL  0 0 0  cannot open
  ada2UNAVAIL  0 0 0  cannot open
  ada3UNAVAIL  0 0 0  cannot open
0(offsite)#

This is usually easily solved without data loss by making the
disks available again.  Can you read anything from the disks using
any program?

Thats the strange thing, the disks are readable.  The drive cage just
reset a couple of times prior to the crash. But they seem OK now.  Same
order as well.

# camcontrol devlist
WDC WD\021501FASR\25500W2B0 \200 0956  at scbus0 target 0 lun 0
(pass0,ada0)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 1 lun 0
(pass1,ada1)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 2 lun 0
(pass2,ada2)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 3 lun 0
(pass3,ada3)


# dd if=/dev/ada2 of=/dev/null count=20 bs=1024
20+0 records in
20+0 records out
20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
0(offsite)#

The next step is to run zdb -l and look for all 4 labels. Something like:
zdb -l /dev/ada2

If all 4 labels exist for each drive and appear intact, then look more closely
at how the OS locates the vdevs. If you can't solve the UNAVAIL problem,
you won't be able to import the pool.
 -- richard


On 1/29/2011 10:13 PM, James R. Van Artsdalen wrote:

On 1/28/2011 4:46 PM, Mike Tancsa wrote:

I had just added another set of disks to my zfs array. It looks like the
drive cage with the new drives is faulty.  I had added a couple of files
to the main pool, but not much.  Is there any way to restore the pool
below ? I have a lot of files on ad0,1,4,6 and ada4,5,6,7 and perhaps
one file on the new drives in the bad cage.

Get another enclosure and verify it works OK.  Then move the disks from
the suspect enclosure to the tested enclosure and try to import the pool.

The problem may be cabling or the controller instead - you didn't
specify how the disks were attached or which version of FreeBSD you're
using.



First off thanks to all who responded on and offlist!

Good news (for me) it seems. New cage and all seems to be recognized
correctly.  The history is

...
2010-04-22.14:27:38 zpool add tank1 raidz /dev/ada4 /dev/ada5 /dev/ada6
/dev/ada7
2010-06-11.13:49:33 zfs create tank1/argus-data
2010-06-11.13:49:41 zfs create tank1/argus-data/previous
2010-06-11.13:50:38 zfs set compression=off tank1/argus-data
2010-08-06.12:20:59 zpool replace tank1 ad1 ad1
2010-09-16.10:17:51 zpool upgrade -a
2011-01-28.11:45:43 zpool add tank1 raidz /dev/ada0 /dev/ada1 /dev/ada2
/dev/ada3

FreeBSD RELENG_8 from last week, 8G of RAM, amd64.

 zpool status -v
  pool: tank1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 

Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa
On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
 Hi Mike,
 
 Yes, this is looking much better.
 
 Some combination of removing corrupted files indicated in the zpool
 status -v output, running zpool scrub and then zpool clear should
 resolve the corruption, but its depends on how bad the corruption is.
 
 First, I would try least destruction method: Try to remove the
 files listed below by using the rm command.
 
 This entry probably means that the metadata is corrupted or some
 other file (like a temp file) no longer exists:
 
 tank1/argus-data:0xc6


Hi Cindy,
I removed the files that were listed, and now I am left with

errors: Permanent errors have been detected in the following files:

tank1/argus-data:0xc5
tank1/argus-data:0xc6
tank1/argus-data:0xc7

I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go

I will report back once the scrub is done!

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Richard Elling
On Jan 31, 2011, at 1:19 PM, Mike Tancsa wrote:
 On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
 Hi Mike,
 
 Yes, this is looking much better.
 
 Some combination of removing corrupted files indicated in the zpool
 status -v output, running zpool scrub and then zpool clear should
 resolve the corruption, but its depends on how bad the corruption is.
 
 First, I would try least destruction method: Try to remove the
 files listed below by using the rm command.
 
 This entry probably means that the metadata is corrupted or some
 other file (like a temp file) no longer exists:
 
 tank1/argus-data:0xc6
 
 
 Hi Cindy,
   I removed the files that were listed, and now I am left with
 
 errors: Permanent errors have been detected in the following files:
 
tank1/argus-data:0xc5
tank1/argus-data:0xc6
tank1/argus-data:0xc7
 
 I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go
 
 I will report back once the scrub is done!

The permanent errors report shows the current and previous results.
When you have multiple failures that are recovered, consider running scrub twice
before attempting to correct or delete files.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-30 Thread Mike Tancsa
On 1/30/2011 12:39 AM, Richard Elling wrote:
 Hmmm, doesnt look good on any of the drives.
 
 I'm not sure of the way BSD enumerates devices.  Some clever person thought
 that hiding the partition or slice would be useful. I don't find it useful.  
 On a Solaris
 system, ZFS can show a disk something like c0t1d0, but that doesn't exist. The
 actual data is in slice 0, so you need to use c0t1d0s0 as the argument to zdb.

I think its the right syntax.  On the older drives,


0(offsite)# zdb -l /dev/ada0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
0(offsite)# zdb -l /dev/ada4

LABEL 0

version=15
name='tank1'
state=0
txg=44593174
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

LABEL 1

version=15
name='tank1'
state=0
txg=44592523
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

LABEL 2

version=15
name='tank1'
state=0
txg=44593174
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

Re: [zfs-discuss] multiple disk failure

2011-01-30 Thread Richard Elling

On Jan 30, 2011, at 4:31 AM, Mike Tancsa wrote:

 On 1/30/2011 12:39 AM, Richard Elling wrote:
 Hmmm, doesnt look good on any of the drives.
 
 I'm not sure of the way BSD enumerates devices.  Some clever person thought
 that hiding the partition or slice would be useful. I don't find it useful.  
 On a Solaris
 system, ZFS can show a disk something like c0t1d0, but that doesn't exist. 
 The
 actual data is in slice 0, so you need to use c0t1d0s0 as the argument to 
 zdb.
 
 I think its the right syntax.  On the older drives,

Bummer. You've got to fix this before you can import the pool.
No labels, no import.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-30 Thread Peter Jeremy
On 2011-Jan-30 13:39:22 +0800, Richard Elling richard.ell...@gmail.com wrote:
I'm not sure of the way BSD enumerates devices.  Some clever person thought
that hiding the partition or slice would be useful.

No, there's no hiding.  /dev/ada0 always refers to the entire physical disk.
If it had PC-style fdisk slices, there would be a sN suffix.
If it had GPT partitions, there would be a pN suffix.
If it had BSD partitions, there would be an alpha suffix [a-h].

On a Solaris
system, ZFS can show a disk something like c0t1d0, but that doesn't exist.

If we're discussing brokenness in OS device names, I've always thought
that reporting device names that don't exist and not having any way to
access the complete physical disk in Solaris was silly.  Having a fake
's2' meaning the whole disk if there's no label is a bad kludge.

Mike might like to try gpart list - which will display FreeBSD's view
of the physical disks.  It might also be worthwhile looking at a hexdump
of the first and last few MB of the faulty disks - it's possible that
the controller has decided to just shift things by a few sectors so the
labels aren't where ZFS expects to find them.

-- 
Peter Jeremy


pgpNc13adVY1q.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-30 Thread Richard Elling
On Jan 30, 2011, at 1:09 PM, Peter Jeremy wrote:

 On 2011-Jan-30 13:39:22 +0800, Richard Elling richard.ell...@gmail.com 
 wrote:
 I'm not sure of the way BSD enumerates devices.  Some clever person thought
 that hiding the partition or slice would be useful.
 
 No, there's no hiding.  /dev/ada0 always refers to the entire physical disk.

ZFS on Solaris hides the slice when dealing with whole disks using EFI labels.

 If it had PC-style fdisk slices, there would be a sN suffix.
 If it had GPT partitions, there would be a pN suffix.
 If it had BSD partitions, there would be an alpha suffix [a-h].
 
 On a Solaris
 system, ZFS can show a disk something like c0t1d0, but that doesn't exist.
 
 If we're discussing brokenness in OS device names, I've always thought
 that reporting device names that don't exist and not having any way to
 access the complete physical disk in Solaris was silly.  Having a fake
 's2' meaning the whole disk if there's no label is a bad kludge.

The fake s2 goes back to BSD where the c partition traditionally meant
the whole disk.  This was just carried forward and changed to s2 when
numbers were used instead of letters. With EFI on Solaris, this is no longer
possible and there is whole disk partition. On a default Solaris system s0 
usually refers to the whole disk less s8. 

 Mike might like to try gpart list - which will display FreeBSD's view
 of the physical disks.  It might also be worthwhile looking at a hexdump
 of the first and last few MB of the faulty disks - it's possible that
 the controller has decided to just shift things by a few sectors so the
 labels aren't where ZFS expects to find them.

Yes, sometimes controllers will steal space from the disk for implementing RAID.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Mike Tancsa
 
 NAMESTATE READ WRITE CKSUM
 tank1   UNAVAIL  0 0 0  insufficient replicas
   raidz1ONLINE   0 0 0
 ad0 ONLINE   0 0 0
 ad1 ONLINE   0 0 0
 ad4 ONLINE   0 0 0
 ad6 ONLINE   0 0 0
   raidz1ONLINE   0 0 0
 ada4ONLINE   0 0 0
 ada5ONLINE   0 0 0
 ada6ONLINE   0 0 0
 ada7ONLINE   0 0 0
   raidz1UNAVAIL  0 0 0  insufficient replicas
 ada0UNAVAIL  0 0 0  cannot open
 ada1UNAVAIL  0 0 0  cannot open
 ada2UNAVAIL  0 0 0  cannot open
 ada3UNAVAIL  0 0 0  cannot open

That is a huge bummer.  I don't know if there is any way to recover aside
from restoring backups.  But I will say this much:

That is precisely the reason why you always want to spread your mirror/raidz
devices across multiple controllers or chassis.  If you lose a controller or
a whole chassis, you lose one device from each vdev, and you're able to
continue production in a degraded state...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Richard Elling
On Jan 28, 2011, at 6:41 PM, Mike Tancsa wrote:

 Hi,
   I am using FreeBSD 8.2 and went to add 4 new disks today to expand my
 offsite storage.  All was working fine for about 20min and then the new
 drive cage started to fail.  Silly me for assuming new hardware would be
 fine :(
 
 The new drive cage started to fail, it hung the server and the box
 rebooted.  After it rebooted, the entire pool is gone and in the state
 below.  I had only written a few files to the new larger pool and I am
 not concerned about restoring that data.  However, is there a way to get
 back the original pool data ?
 Going to http://www.sun.com/msg/ZFS-8000-3C gives a 503 error on the web
 page listed BTW.

Oracle has its fair share of idiots :-(  They have been changing around the
websites and blowing all of the links people have setup for the past 20+ years.

 0(offsite)# zpool status
  pool: tank1
 state: UNAVAIL
 status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
 action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
 config:
 
NAMESTATE READ WRITE CKSUM
tank1   UNAVAIL  0 0 0  insufficient replicas
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0
  raidz1UNAVAIL  0 0 0  insufficient replicas
ada0UNAVAIL  0 0 0  cannot open
ada1UNAVAIL  0 0 0  cannot open
ada2UNAVAIL  0 0 0  cannot open
ada3UNAVAIL  0 0 0  cannot open
 0(offsite)#

This is usually easily solved without data loss by making the
disks available again.  Can you read anything from the disks using
any program?
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa
On 1/29/2011 12:57 PM, Richard Elling wrote:
 0(offsite)# zpool status
  pool: tank1
 state: UNAVAIL
 status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
 action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
tank1   UNAVAIL  0 0 0  insufficient replicas
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0
  raidz1UNAVAIL  0 0 0  insufficient replicas
ada0UNAVAIL  0 0 0  cannot open
ada1UNAVAIL  0 0 0  cannot open
ada2UNAVAIL  0 0 0  cannot open
ada3UNAVAIL  0 0 0  cannot open
 0(offsite)#
 
 This is usually easily solved without data loss by making the
 disks available again.  Can you read anything from the disks using
 any program?

Thats the strange thing, the disks are readable.  The drive cage just
reset a couple of times prior to the crash. But they seem OK now.  Same
order as well.

# camcontrol devlist
WDC WD\021501FASR\25500W2B0 \200 0956  at scbus0 target 0 lun 0
(pass0,ada0)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 1 lun 0
(pass1,ada1)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 2 lun 0
(pass2,ada2)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 3 lun 0
(pass3,ada3)


# dd if=/dev/ada2 of=/dev/null count=20 bs=1024
20+0 records in
20+0 records out
20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
0(offsite)#

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa
On 1/29/2011 11:38 AM, Edward Ned Harvey wrote:
 
 That is precisely the reason why you always want to spread your mirror/raidz
 devices across multiple controllers or chassis.  If you lose a controller or
 a whole chassis, you lose one device from each vdev, and you're able to
 continue production in a degraded state...


Thanks.  These are backups of backups. It would be nice to restore them
as it will take a while to sync up once again.  But if I need to start
fresh, is there a resource you can point me to with the current best
practices for laying out large storage like this ?  Its just for backups
of backups in a DR site

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Richard Elling

On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote:

 On 1/29/2011 12:57 PM, Richard Elling wrote:
 0(offsite)# zpool status
 pool: tank1
 state: UNAVAIL
 status: One or more devices could not be opened.  There are insufficient
   replicas for the pool to continue functioning.
 action: Attach the missing device and online it using 'zpool online'.
  see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
 config:
 
   NAMESTATE READ WRITE CKSUM
   tank1   UNAVAIL  0 0 0  insufficient replicas
 raidz1ONLINE   0 0 0
   ad0 ONLINE   0 0 0
   ad1 ONLINE   0 0 0
   ad4 ONLINE   0 0 0
   ad6 ONLINE   0 0 0
 raidz1ONLINE   0 0 0
   ada4ONLINE   0 0 0
   ada5ONLINE   0 0 0
   ada6ONLINE   0 0 0
   ada7ONLINE   0 0 0
 raidz1UNAVAIL  0 0 0  insufficient replicas
   ada0UNAVAIL  0 0 0  cannot open
   ada1UNAVAIL  0 0 0  cannot open
   ada2UNAVAIL  0 0 0  cannot open
   ada3UNAVAIL  0 0 0  cannot open
 0(offsite)#
 
 This is usually easily solved without data loss by making the
 disks available again.  Can you read anything from the disks using
 any program?
 
 Thats the strange thing, the disks are readable.  The drive cage just
 reset a couple of times prior to the crash. But they seem OK now.  Same
 order as well.
 
 # camcontrol devlist
 WDC WD\021501FASR\25500W2B0 \200 0956  at scbus0 target 0 lun 0
 (pass0,ada0)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 1 lun 0
 (pass1,ada1)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 2 lun 0
 (pass2,ada2)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 3 lun 0
 (pass3,ada3)
 
 
 # dd if=/dev/ada2 of=/dev/null count=20 bs=1024
 20+0 records in
 20+0 records out
 20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
 0(offsite)#

The next step is to run zdb -l and look for all 4 labels. Something like:
zdb -l /dev/ada2

If all 4 labels exist for each drive and appear intact, then look more closely
at how the OS locates the vdevs. If you can't solve the UNAVAIL problem,
you won't be able to import the pool.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa
On 1/29/2011 6:18 PM, Richard Elling wrote:
 0(offsite)#
 
 The next step is to run zdb -l and look for all 4 labels. Something like:
   zdb -l /dev/ada2
 
 If all 4 labels exist for each drive and appear intact, then look more closely
 at how the OS locates the vdevs. If you can't solve the UNAVAIL problem,
 you won't be able to import the pool.



Hmmm, doesnt look good on any of the drives.  Before I give up, I will
try the drives in a different cage Monday. Unfortunately, its a 150km
away from me at our DR site


# zdb -l /dev/ada0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Richard Elling
On Jan 29, 2011, at 4:14 PM, Mike Tancsa wrote:

 On 1/29/2011 6:18 PM, Richard Elling wrote:
 0(offsite)#
 
 The next step is to run zdb -l and look for all 4 labels. Something like:
  zdb -l /dev/ada2
 
 If all 4 labels exist for each drive and appear intact, then look more 
 closely
 at how the OS locates the vdevs. If you can't solve the UNAVAIL problem,
 you won't be able to import the pool.
 
 
 
 Hmmm, doesnt look good on any of the drives.

I'm not sure of the way BSD enumerates devices.  Some clever person thought
that hiding the partition or slice would be useful. I don't find it useful.  On 
a Solaris
system, ZFS can show a disk something like c0t1d0, but that doesn't exist. The
actual data is in slice 0, so you need to use c0t1d0s0 as the argument to zdb.
 -- richard

  Before I give up, I will
 try the drives in a different cage Monday. Unfortunately, its a 150km
 away from me at our DR site
 
 
 # zdb -l /dev/ada0
 
 LABEL 0
 
 failed to unpack label 0
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 failed to unpack label 2
 
 LABEL 3
 
 failed to unpack label 3

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] multiple disk failure

2011-01-28 Thread Mike Tancsa
Hi,
I am using FreeBSD 8.2 and went to add 4 new disks today to expand my
offsite storage.  All was working fine for about 20min and then the new
drive cage started to fail.  Silly me for assuming new hardware would be
fine :(

The new drive cage started to fail, it hung the server and the box
rebooted.  After it rebooted, the entire pool is gone and in the state
below.  I had only written a few files to the new larger pool and I am
not concerned about restoring that data.  However, is there a way to get
back the original pool data ?
Going to http://www.sun.com/msg/ZFS-8000-3C gives a 503 error on the web
page listed BTW.


0(offsite)# zpool status
  pool: tank1
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   UNAVAIL  0 0 0  insufficient replicas
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0
  raidz1UNAVAIL  0 0 0  insufficient replicas
ada0UNAVAIL  0 0 0  cannot open
ada1UNAVAIL  0 0 0  cannot open
ada2UNAVAIL  0 0 0  cannot open
ada3UNAVAIL  0 0 0  cannot open
0(offsite)#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss