Re: [zfs-discuss] Severe Problems on ZFS server

2010-04-22 Thread Andreas Höschler

Hi Bob,

The problem could be due to a faulty/failing disk, a poor connection 
with a disk, or some other hardware issue.  A failing disk can easily 
make the system pause temporarily like that.


As root you can run '/usr/sbin/fmdump -ef' to see all the fault events 
as they are reported.  Be sure to execute '/usr/sbin/fmadm faulty' to 
see if a fault has already been identified on your system.  Also 
execute '/usr/bin/iostat -xe' to see if there are errors reported 
against some of your disks, or if some are reported as being 
abnormally slow.


You might also want to verify that your Solaris 10 is current.  I 
notice that you did not identify what Solaris 10 you are using.


Thanks a lot for these hints. I checked all this. On my mirror server I 
found a faulty DIMM with these commands. But on the main server 
exhibiting the described problem everything seems fine.


another machine with 6GB RAM I fired up a second virtual machine 
(vbox). This drove the machine almost to a halt. The second vbox 
instance never came up. I finally saw a panel raised by the first 
vbox instance that there was not enough memory available (non severe 
vbox error) and the virtual machine was halted!! After killing the 
process of the second vbox I could simply press resume and the first 
vbox machine continued to work properly.


Maybe you should read the VirtualBox documentation.  There is a note 
about Solaris 10 and about how VirtualBox may fail if it can't get 
enough contiguous memory space.


Maybe I am lucky since I have run three VirtualBox instances at a time 
(2GB allocation each) on my system with no problem at all.


I have inserted

set zfs:zfs_arc_max = 0x2

in /etc/system and rebooted the machine having 64GB of memory. Tomorrow 
will show whether this did the trick!


Thanks a lot,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Severe Problems on ZFS server

2010-04-22 Thread Andreas Höschler

Hi all,

we are encountering severe problems on our X4240 (64GB, 16 disks) 
running Solaris 10 and ZFS. From time to time (5-6 times a day)


• FrontBase hangs or crashes
• VBox virtual machine do hang
• Other applications show rubber effect (white screen) while moving 
the windows


I have been tearing my hair off where this comes from. Could be 
software bugs, but in all these applications from different vendors? 
Could be a Solaris bug or bad memory!? Rather unlikely. I just was hit 
by a thought. On another machine with 6GB RAM I fired up a second 
virtual machine (vbox). This drove the machine almost to a halt. The 
second vbox instance never came up. I finally saw a panel raised by 
the first vbox instance that there was not enough memory available 
(non severe vbox error) and the virtual machine was halted!! After 
killing the process of the second vbox I could simply press resume and 
the first vbox machine continued to work properly.


OK, now this starts to make sense. My idea is that ZFS is 
blocking/allocating all of the available system memory. When an app 
(FrontBase, VBox,...) is started and suddenly requests larger chunks 
of memory from the system, the malloc calls fail because ZFS has 
allocated all the memory or because the system cannot release the 
memory quickly enough and make it available fo rthe requesting apps, 
so the malloc fails or times out or whatever which is not catched in 
the apps and makes them hang or crash or stall for minutes. Does this 
make any sense? Any similar experiences?




Followup to my owm message. On the X4240 I have

set zfs:zfs_arc_max = 0x78000

in /etc/system. Would it be a good idea to reduce that to say

set zfs:zfs_arc_max = 0x28000

?? Hints greatly appreciated!

Thanks,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Severe Problems on ZFS server

2010-04-22 Thread Andreas Höschler

Hi all

we are encountering severe problems on our X4240 (64GB, 16 disks) 
running Solaris 10 and ZFS. From time to time (5-6 times a day)


• FrontBase hangs or crashes
• VBox virtual machine do hang
• Other applications show rubber effect (white screen) while moving the 
windows


I have been tearing my hair off where this comes from. Could be 
software bugs, but in all these applications from different vendors? 
Could be a Solaris bug or bad memory!? Rather unlikely. I just was hit 
by a thought. On another machine with 6GB RAM I fired up a second 
virtual machine (vbox). This drove the machine almost to a halt. The 
second vbox instance never came up. I finally saw a panel raised by the 
first vbox instance that there was not enough memory available (non 
severe vbox error) and the virtual machine was halted!! After killing 
the process of the second vbox I could simply press resume and the 
first vbox machine continued to work properly.


OK, now this starts to make sense. My idea is that ZFS is 
blocking/allocating all of the available system memory. When an app 
(FrontBase, VBox,...) is started and suddenly requests larger chunks of 
memory from the system, the malloc calls fail because ZFS has allocated 
all the memory or because the system cannot release the memory quickly 
enough and make it available fo rthe requesting apps, so the malloc 
fails or times out or whatever which is not catched in the apps and 
makes them hang or crash or stall for minutes. Does this make any 
sense? Any similar experiences?


What can I do about that?

Thanks a lot,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing disk in zfs pool

2010-04-09 Thread Andreas Höschler

Hi Ragnar,

I need to replace a disk in a zfs pool on a production server (X4240 
running Solaris 10) today and won't have access to my documentation 
there. That's why I would like to have a good plan on paper before 
driving to that location. :-)


The current tank pool looks as follows:

 pool: tank
state: ONLINE
scrub: none requested
config:

   NAME STATE READ WRITE CKSUM
   tank ONLINE   0 0 0
 mirror ONLINE   0 0 0
   c1t2d0   ONLINE   0 0 0
   c1t3d0   ONLINE   0 0 0
 mirror ONLINE   0 0 0
   c1t5d0   ONLINE   0 0 0
   c1t4d0   ONLINE   0 0 0
 mirror ONLINE   0 0 0
   c1t15d0  ONLINE   0 0 0
   c1t7d0   ONLINE   0 0 0
 mirror ONLINE   0 0 0
   c1t8d0   ONLINE   0 0 0
   c1t9d0   ONLINE   0 0 0
 mirror ONLINE   0 0 0
   c1t10d0  ONLINE   0 0 0
   c1t11d0  ONLINE   0 0 0
 mirror ONLINE   0 0 0
   c1t12d0  ONLINE   0 0 0
   c1t13d0  ONLINE   0 0 0

errors: No known data errors

Note that disk c1t15d0 is being used and has taken ove rthe duty of 
c1t6d0. c1t6d0 failed and was replaced with a new disk a couple of 
months ago. However, the new disk does not show up in /dev/rdsk and 
/dev/dsk. I was told that the disk has to initialized first with the 
SCSI BIOS. I am going to do so today (reboot the server). Once the 
disks shows up in  /dev/rdsk I am planning to do the following:


I don't think that the BIOS and rebooting part ever has to be true,
at least I don't hope so. You shouldn't have to reboot just because
you replace a hot plug disk.


Hard to believe! But that's the most recent state of affairs. Not even 
the Sun technician made the disk to show up in /dev/dsks. They have 
replaced it 3 times assuming it to be defect! :-)


I tried to remotely reboot the server (with LOM) and go into the SCSI 
BIOS to initialize the disk, but the BIOS requires a key combination to 
initialize the disk that does not go through the remote connections 
(don't remember which one). That's why I am planning to drive to the 
remote location and do it manually with a server reboot and keyboard 
and screen attached like in the very old days. :-(



Depending on the hardware and the state
of your system, it might not be the problem at all, and rebooting may
not help. Are the device links for c1t6* gone in /dev/(r)dsk?
Then someone must have ran a "devfsadm -C" or something like that.
You could try "devfsadm -sv" to see if it wants to (re)create any
device links. If you think that it looks good, run it with "devfsadm 
-v".


If it is the HBA/raid controller acting up and not showing recently
inserted drives, you should be able to talk to it with a program
from within the OS. raidctl for some LSI HBAs, and arcconf for
some SUN/StorageTek HBAs.


I have /usr/sbin/raidctl on that machine and just studied the man page 
of this tool. But I couldn't find hints of how to initialize a disk 
c1t16d0. It just talks about setting up raid volumes!? :-(


Thanks a lot,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Replacing disk in zfs pool

2010-04-09 Thread Andreas Höschler

Hi all,

I need to replace a disk in a zfs pool on a production server (X4240 
running Solaris 10) today and won't have access to my documentation 
there. That's why I would like to have a good plan on paper before 
driving to that location. :-)


The current tank pool looks as follows:

  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t2d0   ONLINE   0 0 0
c1t3d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t5d0   ONLINE   0 0 0
c1t4d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t15d0  ONLINE   0 0 0
c1t7d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t8d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t12d0  ONLINE   0 0 0
c1t13d0  ONLINE   0 0 0

errors: No known data errors

Note that disk c1t15d0 is being used and has taken ove rthe duty of 
c1t6d0. c1t6d0 failed and was replaced with a new disk a couple of 
months ago. However, the new disk does not show up in /dev/rdsk and 
/dev/dsk. I was told that the disk has to initialized first with the 
SCSI BIOS. I am going to do so today (reboot the server). Once the 
disks shows up in  /dev/rdsk I am planning to do the following:


zpool attach tank c1t7d0 c1t6d0

This hopefully gives me a three-way mirror:

  mirror ONLINE   0 0 0
c1t15d0  ONLINE   0 0 0
c1t7d0   ONLINE   0 0 0
c1t6d0   ONLINE   0 0 0

And then a

zpool dettach tank c1t15d0

to get c1t15d0 out of the mirror to finally have

  mirror ONLINE   0 0 0
c1t6d0   ONLINE   0 0 0
c1t7d0   ONLINE   0 0 0

again. Is that a good plan? I am then intending to do

zpool add tank mirror c1t14d0 c1t15d0

to add another 146GB to the pool.

Please let me know if I am missing anything. This is a production 
server. A failure of the pool would be fatal.


Thanks a lot,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Andreas Höschler

Hi Khyron,

No, he did *not* say that a mirrored SLOG has no benefit, 
redundancy-wise.
He said that YOU do *not* have a mirrored SLOG.  You have 2 SLOG 
devices
which are striped.  And if this machine is running Solaris 10, then 
you cannot

remove a log device because those updates have not made their way into
Solaris 10 yet.  You need pool version >= 19 to remove log devices, 
and S10

does not currently have patches to ZFS to get to a pool version >= 19.

If your SLOG above were mirrored, you'd have "mirror" under "logs".  
And you
probably would have "log" not "logs" - notice the "s" at the end 
meaning plural,
meaning multiple independent log devices, not a mirrored pair of logs 
which

would effectively look like 1 device.


Thanks for the clarification! This is very annoying. My intend was to 
create a log mirror. I used


zpool add tank log c1t6d0 c1t7d0

and this was obviously false. Would

zpool add tank mirror log c1t6d0 c1t7d0

have done what I intended to do? If so it seems I have to tear down the 
tank pool and recreate it from scratc!?. Can I simply use


zpool destroy -f tank

to do so?

Thanks,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Andreas Höschler

Hi Edward,

thanks a lot for your detailed response!


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Andreas Höschler

• I would like to remove the two SSDs as log devices from the pool and
instead add them as a separate pool for sole use  by the database to
see how this enhences performance. I could certainly do

zpool detach tank c1t7d0

to remove one disk from the log mirror. But how can I get back the
second SSD?


If you're running solaris, sorry, you can't remove the log device.  You
better keep your log mirrored until you can plan for destroying and
recreating the pool.  Actually, in your example, you don't have a 
mirror of

logs.  You have two separate logs.  This is fine for opensolaris (zpool
=19), but not solaris (presently up to zpool 15).  If this is 
solaris, and

*either* one of those SSD's fails, then you lose your pool.


I run Solaris 10 (not Open Solaris)!

You say the log mirror

  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
...
logs
  c1t6d0ONLINE   0 0 0
  c1t7d0ONLINE   0 0 0

does not do me anything good (redundancy-wise)!? Shouldn't I dettach 
the second drive then and try to use it for something else, may be 
another machine?


I understand it is very dangerous to use SSDs for logs then (no 
redundancy)!?



If you're running opensolaris, "man zpool" and look for "zpool remove"

Is the database running locally on the machine?


Yes!


 Or at the other end of
something like nfs?  You should have better performance using your 
present
config than just about any other config ... By enabling the log 
devices,

such as you've done, you're dedicating the SSD's for sync writes.  And
that's what the database is probably doing.  This config should be 
*better*
than dedicating the SSD's as their own pool.  Because with the 
dedicated log
device on a stripe of mirrors, you're allowing the spindle disks to do 
what

they're good at (sequential blocks) and allowing the SSD's to do what
they're good at (low latency IOPS).


OK!

I actually have two machines here, one production machine (X4240 with 
16 disks, no SSDs) with performance issues and another development 
machine X4140 with 6 disks and two SDDs configured as shown in my 
previous mail. The question for me is how to improve the performance of 
the production machine and whether buying SSDs for this machine is 
worth the investment.


"zpool iostat" on the development machine with the SSDs gives me

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rpool114G   164G  0  4  13.5K  36.0K
tank 164G   392G  3131   444K  10.8M
--  -  -  -  -  -  -

When I do that on the production machine without SSDs I get

pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rpool   98.3G  37.7G  0  7  32.5K  36.9K
tank 480G   336G 16 53  1.69M  2.05M
--  -  -  -  -  -  -

It is interesting to note that the write bandwidth on the SSD machine 
is 5 times higher. I take this as an indicaor that the SSDs have some 
effect.


I am still wondering what your "if one SSd fails you loe your pool" 
means to me. Would you recommend to dettach one of the SSDs in the 
development machine and add to o the production machine with


zpool add tank log c1t15d0

?? And how save (reliable) is it to use SSDs for this? I mean when do I 
have to expect the SSD to fail and thus ruin the pool!?


Thanks a lot,

 Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Removing SSDs from pool

2010-04-05 Thread Andreas Höschler

Hi all,

while setting of our X4140 I have - following suggestions - added two 
SSDs as log devices as follows


zpool add tank log c1t6d0 c1t7d0

I currently have

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0
c1t1d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
logs
  c1t6d0ONLINE   0 0 0
  c1t7d0ONLINE   0 0 0

errors: No known data errors

We have performance problems especially with FrontBase (relational 
database) running on this ZFS configuration and need to look for 
optimizations.


• I would like to remove the two SSDs as log devices from the pool and 
instead add them as a separate pool for sole use  by the database to 
see how this enhences performance. I could certainly do


zpool detach tank c1t7d0

to remove one disk from the log mirror. But how can I get back the 
second SSD?


Any experiinces with running database on ZFS pools? What can I do to 
tune the performance? Smaller block size may be?


Thanks a lot,

 Andreas


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SSD and ZFS

2010-02-12 Thread Andreas Höschler

Hi all,

just after sending a message to sunmanagers I realized that my question 
should rather have gone here. So sunmanagers please excus ethe double 
post:


I have inherited a X4140 (8 SAS slots) and have just setup the system 
with Solaris 10 09. I first setup the system on a mirrored pool over 
the first two disks


  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0
c1t1d0s0  ONLINE   0 0 0

errors: No known data errors

and then tried to add the second pair of disks to this pool which did 
not work (famous error message reagding label, root pool BIOS issue). I 
therefore simply created an additional pool tank.


  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0
c1t1d0s0  ONLINE   0 0 0

errors: No known data errors

 pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0

errors: No known data errors

So far so good. I have now replaced the last two SAS disks with 32GB 
SSDs and am wondering how to add these to the system. I googled a lot 
for best practise but found nothing so far that made me any wiser. My 
current approach still is to simply do


zpool add tank mirror c0t6d0 c0t7d0

as I would do with normal disks but I am wondering whether that's the 
right approach to significantly increase system performance. Will ZFS 
automatically use these SSDs and optimize accesses to tank? Probably! 
But it won't optimize accesses to rpool of course. Not sure whether I 
need that or should look for that. Should I try to get all disks into 
rpool inspite of the BIOS label issue so that SSDs are used for all 
accesses to the disk system?


Hints (best practises) are greatly appreciated?

Thanks a lot,

 Andreas


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing faulty disk in ZFS pool

2009-08-06 Thread Andreas Höschler

Hi Cindy,


I think you can still offline the faulted disk, c1t6d0.


OK, here it gets tricky. I have

NAME   STATE READ WRITE CKSUM
tank   DEGRADED 0 0 0
  mirror   ONLINE   0 0 0
c1t2d0 ONLINE   0 0 0
c1t3d0 ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c1t5d0 ONLINE   0 0 0
c1t4d0 ONLINE   0 0 0
  mirror   DEGRADED 0 0 0
spare  DEGRADED 0 0 0
  c1t6d0   FAULTED  019 0  too many errors
  c1t15d0  ONLINE   0 0 0
c1t7d0 ONLINE   0 0 0
spares
  c1t15d0  INUSE currently in use

now. When I issue the command

zpool offline tank c1t6d0

I get

cannot offline c1t6d0: no valid replicas

??

However

zpool detach tank c1t6d0

seems to work!

pool: tank
 state: ONLINE
 scrub: resilver completed after 0h22m with 0 errors on Thu Aug  6 
22:55:37 2009

config:

NAME STATE READ WRITE CKSUM
tank ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t2d0   ONLINE   0 0 0
c1t3d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t5d0   ONLINE   0 0 0
c1t4d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c1t15d0  ONLINE   0 0 0
c1t7d0   ONLINE   0 0 0

errors: No known data errors

This looks like I can remove and physically replace c1t6d0 now! :-)

Thanks,

  Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing faulty disk in ZFS pool

2009-08-06 Thread Andreas Höschler

Hi all,


zpool add tank spare c1t15d0
? After doing that c1t6d0 is offline and ready to be physically 
replaced?


Yes, that is correct.
Then you could physically replace c1t6d0 and add it back to the pool 
as

a spare, like this:

# zpool add tank spare c1t6d0

For a production system, the steps above might be the most efficient.
Get the faulted disk replaced with a known good disk so the pool is
no longer degraded, then physically replace the bad disk when you 
have

the time and add it back to the pool as a spare.

It is also good practice to run a zpool scrub to ensure the
replacement is operational

That would be
zpool scrub tank
in my case!?


Yes.

and use zpool clear to clear the previous
errors on the pool.

I assume teh complete comamnd fo rmy case is
zpool clear tank
Why d we have to do that. Couldb't zfs realize that everything is 
fine again after executing "zpool replace tank c1t6d0 c1t15d0"?


Yes, sometimes the clear is not necessary but it will also clear the 
error counts if need be.


I have done

zpool add tank spare c1t15d0
zpool replace tank c1t6d0 c1t15d0

now and waited for the completion of the resilvering process. "zpool 
status" now gives me


scrub: resilver completed after 0h22m with 0 errors on Thu Aug  6 
22:55:37 2009

config:

NAME   STATE READ WRITE CKSUM
tank   DEGRADED 0 0 0
  mirror   ONLINE   0 0 0
c1t2d0 ONLINE   0 0 0
c1t3d0 ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c1t5d0 ONLINE   0 0 0
c1t4d0 ONLINE   0 0 0
  mirror   DEGRADED 0 0 0
spare  DEGRADED 0 0 0
  c1t6d0   FAULTED  019 0  too many errors
  c1t15d0  ONLINE   0 0 0
c1t7d0 ONLINE   0 0 0
spares
  c1t15d0  INUSE currently in use

errors: No known data errors

This does look like a final step is missing. Can I simply physically 
replace c1t6d0 now or do I have to do


zpool offline tank c1t6d0

first? Moreover it seems I have to run a

zpool clear

in my case to get rid of the DEGRADED message!? What is the missing bit 
here?



zpool offline tank c1t6d0

zpool replace tank c1t6d0
zpool online tank c1t6d0


Just out of curiosity (since I used the other road this time), how does 
the replace command know what exactly to do here. In my case I ordered 
the system specifically to replace c1t6d0 with c1t15d0 by doing "zpool 
replace tank c1t6d0 c1t15d0" but if I simply issue


zpool replace tank c1t6d0

it ...!??

Thanks a lot,

  Andreas



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing faulty disk in ZFS pool

2009-08-06 Thread Andreas Höschler

Hi Cindy,



Good job for using a mirrored configuration. :-)


Thanks!


Your various approaches would work.

My only comment about #2 is that it might take some time for the spare
to kick in for the faulted disk.

Both 1 and 2 would take a bit more time than just replacing the faulted
disk with a spare disk, like this:

# zpool replace tank c1t6d0 c1t15d0


You mean I can execute

zpool replace tank c1t6d0 c1t15d0

without having made c1t15d0 a spare disk first with

zpool add tank spare c1t15d0

? After doing that c1t6d0 is offline and ready to be physically 
replaced?



Then you could physically replace c1t6d0 and add it back to the pool as
a spare, like this:

# zpool add tank spare c1t6d0

For a production system, the steps above might be the most efficient.
Get the faulted disk replaced with a known good disk so the pool is
no longer degraded, then physically replace the bad disk when you have
the time and add it back to the pool as a spare.

It is also good practice to run a zpool scrub to ensure the
replacement is operational


That would be

zpool scrub tank

in my case!?


and use zpool clear to clear the previous
errors on the pool.


I assume teh complete comamnd fo rmy case is

zpool clear tank

Why d we have to do that. Couldb't zfs realize that everything is fine 
again after executing "zpool replace tank c1t6d0 c1t15d0"?


 If the system is used heavily, then you might want to run the zpool 
scrub when system use is reduced.


That would be now! :-)


If you were going to physically replace c1t6d0 while it was still
attached to the pool, then you might offline it first.


Ok, this sounds like approach 3)

zpool offline tank c1t6d0

zpool online tank c1t6d0

Would that be it?

Thanks a lot!

Regards,

  Andreas


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Replacing faulty disk in ZFS pool

2009-08-06 Thread Andreas Höschler

Dear managers,

one of our servers (X4240) shows a faulty disk:


-bash-3.00# zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c1t0d0s0  ONLINE       0     0     0
            c1t1d0s0  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent
errors.
        Sufficient replicas exist for the pool to continue functioning
in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the
device
        repaired.
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror    ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
          mirror    DEGRADED     0     0     0
            c1t6d0  FAULTED      0    19     0  too many errors
            c1t7d0  ONLINE       0     0     0

errors: No known data errors

I derived the following possible approaches to solve the problem:

1) A way to reestablish redundancy would be to use the command

       zpool attach tank c1t7d0 c1t15d0

to add c1t15d0 to the virtual device "c1t6d0 + c1t7d0". We still would
have the faulty disk in the virtual device.

We could then dettach the faulty disk with the command

       zpool dettach tank c1t6d0

2) Another approach would be to add a spare disk to tank

       zpool add tank spare c1t15d0

and the replace to replace the faulty disk.

       zpool replace tank c1t6d0 c1t15d0

In theory that is easy, but since I have never done that and since this
is a productive server I would appreciate if somone with more
experience would look on my agenda before I issue these commands.

What is the difference between the two approaches? Which one do you
recommend? And is that really all that has to be done or am I missing a
bit? I mean can c1t6d0 be physically replaced after issuing "zpool
dettach tank c1t6d0" or "zpool replace tank c1t6d0 c1t15d0"? I also
found the command

       zpool offline tank  ...

but am not sure whether this should be used in my case. Hints are
greatly appreciated!

Thanks a lot,

  Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss