Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-14 Thread Paul Armstrong
> Paul Kraus wrote:
> > In the ZFS case I could replace the disk
> and the zpool would
> > resilver automatically. I could also take the
> removed disk and put it
> > into the second system and have it recognize the
> zpool (and that it
> > was missing half of a mirror) and the data was all
> there.
> > 
> > In no case did I see any data loss or
> corruption. I had
> > attributed the system hanging to an interaction
> between the SAS and
> > ZFS layers, but the previous post makes me question
> that assumption.
> > 
> > As another data point, I have an old Intel
> box at home I am
> > running x86 on with ZFS. I have a pair of 120 GB
> PATA disks. OS is on
> > SVM/UFS mirrored partitions and /export home is on
> a pair of
> > partitions in a zpool (mirror). I had a bad power
> connector and
> > sometime after booting lost one of the drives. The
> server kept running
> > fine. Once I got the drive powered back up (while
> the server was shut
> > down), the SVM mirrors resync'd and the zpool
> resilvered. The zpool
> > finished substantially before the SVM.
> > 
> > In all cases the OS was Solaris 10 U 3
> (11/06) with no
> > additional patches.
> 
> The behaviour you describe is what I would expect for
> that release of
> Solaris + ZFS.

It seems this is fixed in SXCE, do you know if some of the fixes made it into 
10_U4?

Thanks,
Paul
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Marion Hakanson
> . . .
>> Use JBODs. Or tell the cache controllers to ignore
>> the flushing requests.
[EMAIL PROTECTED] said:
> Unfortunately HP EVA can't do it. About the 9900V, it is really fast (64GB
> cache helps a lot) end reliable. 100% uptime in years. We'll never touch it
> to solve a ZFS problem. 

On our low-end HDS array (9520V), turning on "Synchronize Cache Invalid Mode"
did the trick for ZFS purposes (Solaris-10U3).  They've since added a Solaris
kernel tunable in /etc/system:
set zfs:zfs_nocacheflush = 1

This has the unfortunate side-effect of disabling it on all disks
for the whole system, though.

ZFS is getting more mature all the time

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Gino
> It seems that maybe there is too large a code path
> leading to panics --
> maybe a side effect of ZFS being "new" (compared to
> other filesystems).  I
> would hope that as these panic issues are coming up
> that the code path
> leading to the panic is evaluated for a specific fix
> or behavior code path.
> Sometimes it does make sense to panic (if there
> _will_ be data damage if
> you continue).  Other times not.
 
I think the same about panics.  So, IMHO, ZFS should not be called "stable".
But you know ... marketing ...  ;)

> I can understand where you are coming from as
>  far as the need for
> ptime and loss of money on that app server. Two years
> of testing for the
> app, Sunfire servers for N+1 because the app can't be
> clustered and you
> have chosen to run a filesystem that has just been
> made public? 

What? That server is running and will be running on UFS for many years!
Upgrading, patching, cleaning ... even touching it is strictly prohibited :)
We upgraded to S10 because of DTrace (helped us a lot) and during the
test phase we evaluated also ZFS.
Now we only use ZFS for our central backup servers (for many applications, 
systems, customers, ...)
We also manage a lot of other systems and always try to migrate customers to 
Solaris because of stability, resource control, DTrace ..  but found ZFS 
disappointing at today (probably tomorrow it will be THE filesystem).

Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Wade . Stuart

[EMAIL PROTECTED] wrote on 09/12/2007 08:04:33 AM:

> > Gino wrote:
> > > The real problem is that ZFS should stop to force
> > kernel panics.
> > >
> > I found these panics very annoying, too. And even
> > more that the zpool
> > was faulted afterwards. But my problem is that when
> > someone asks me what
> > ZFS should do instead, I have no idea.
>
> well, what about just hang processes waiting for I/O on that zpool?
> Could be possible?

It seems that maybe there is too large a code path leading to panics --
maybe a side effect of ZFS being "new" (compared to other filesystems).  I
would hope that as these panic issues are coming up that the code path
leading to the panic is evaluated for a specific fix or behavior code path.
Sometimes it does make sense to panic (if there _will_ be data damage if
you continue).  Other times not.


>
> > Seagate FibreChannel drives, Cheetah 15k, ST3146855FC
> > for the databases.
>
> What king of JBOD for that drives? Just to know ...
> We found Xyratex's to be good products.
>
> > That depends on the indivdual requirements of each
> > service. Basically,
> > we change to recordsize according to the transaction
> > size of the
> > databases and, on the filers, the performance results
> > were best when the
> > recordsize was a bit lower than the average file size
> > (average file size
> > is 12K, so I set a recordsize of 8K). I set a vdev
> > cache size of 8K and
> > our databases worked best with a vq_max_pending of
> > 32. ZFSv3 was used,
> > that's the version which is shipped with Solaris 10
> > 11/06.
>
> thanks for sharing.
>
> > Yes, but why doesn't your application fail over to a
> > standby?
>
> It is a little complex to explain. Basically that apps are making a
> lot of "number cruncing" on some a very big data in ram. Failover
> would be starting again from the beginning, with all the customers
> waiting for hours (and loosing money).
> We are working on a new app, capable to work with a couple of nodes
> but it will takes some months to be in beta, then 2 years of testing ...
>
> > a system reboot can be a single point of failure,
> > what about the network
> > infrastructure? Hardware errors? Or power outages?
>
> We use Sunfire for that reason. We had 2 cpu failures and no service
> interruption, the same for 1 dimm module (we have been lucky with
> cpu failures ;)).
> HDS raid arrays are excellent about availability. Lots of fc links,
> network links ..
> All this is in a fully redundant datacenter .. and, sure, we have a
> stand by system on a disaster recovery site (hope to never use it!).

  I can understand where you are coming from as far as the need for
uptime and loss of money on that app server. Two years of testing for the
app, Sunfire servers for N+1 because the app can't be clustered and you
have chosen to run a filesystem that has just been made public? ZFS may be
great and all, but this stinks of running a .0 version on the production
machine.  VXFS+snap has well known and documented behaviors tested for
years on production machines. Why did you even choose to run ZFS on that
specific box?

Do not get me wrong,  I really like many things about ZFS -- it is ground
breaking.  I still do not get why it would be chosen for a server in that
position until it has better real world production testing and modeling.
You have taken all of the buildup you have done and introduced an unknown
to the mix.


>
> > I'm definitely NOT some kind of know-it-all, don't
> > misunderstand me.
> > Your statement just let my alarm bells ring and
> > that's why I'm asking.
>
> Don't worry Ralf. Any suggestion/opinion/critic is welcome.
> It's a pleasure to exchange our experience
>
> Gino
>
>
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Gino
> Gino wrote:
> > The real problem is that ZFS should stop to force
> kernel panics.
> >   
> I found these panics very annoying, too. And even
> more that the zpool 
> was faulted afterwards. But my problem is that when
> someone asks me what 
> ZFS should do instead, I have no idea.

well, what about just hang processes waiting for I/O on that zpool?
Could be possible?

> Seagate FibreChannel drives, Cheetah 15k, ST3146855FC
> for the databases.

What king of JBOD for that drives? Just to know ...
We found Xyratex's to be good products.

> That depends on the indivdual requirements of each
> service. Basically, 
> we change to recordsize according to the transaction
> size of the 
> databases and, on the filers, the performance results
> were best when the 
> recordsize was a bit lower than the average file size
> (average file size 
> is 12K, so I set a recordsize of 8K). I set a vdev
> cache size of 8K and 
> our databases worked best with a vq_max_pending of
> 32. ZFSv3 was used, 
> that's the version which is shipped with Solaris 10
> 11/06.

thanks for sharing.

> Yes, but why doesn't your application fail over to a
> standby? 

It is a little complex to explain. Basically that apps are making a lot of 
"number cruncing" on some a very big data in ram. Failover would be starting 
again from the beginning, with all the customers waiting for hours (and loosing 
money).
We are working on a new app, capable to work with a couple of nodes but it will 
takes some months to be in beta, then 2 years of testing ...

> a system reboot can be a single point of failure,
> what about the network 
> infrastructure? Hardware errors? Or power outages?

We use Sunfire for that reason. We had 2 cpu failures and no service 
interruption, the same for 1 dimm module (we have been lucky with cpu failures 
;)).
HDS raid arrays are excellent about availability. Lots of fc links, network 
links ..
All this is in a fully redundant datacenter .. and, sure, we have a stand by 
system on a disaster recovery site (hope to never use it!).

> I'm definitely NOT some kind of know-it-all, don't
> misunderstand me. 
> Your statement just let my alarm bells ring and
> that's why I'm asking.

Don't worry Ralf. Any suggestion/opinion/critic is welcome.
It's a pleasure to exchange our experience

Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Ralf Ramge
Gino wrote:
> The real problem is that ZFS should stop to force kernel panics.
>
>   
I found these panics very annoying, too. And even more that the zpool 
was faulted afterwards. But my problem is that when someone asks me what 
ZFS should do instead, I have no idea.

>> I have large Sybase database servers and file servers
>> with billions of 
>> inodes running using ZFSv3. They are attached to
>> X4600 boxes running 
>> Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using
>> dumb and cheap 
>> Infortrend FC JBODs (2 GBit/s) as storage shelves.
>> 
>
> Are you using FATA drives?
>
>   
Seagate FibreChannel drives, Cheetah 15k, ST3146855FC for the databases.

For the NFS filers we use Infortrend FC shelves with SATA inside.

>> During all my 
>> benchmarks (both on the command line and within
>> applications) show that 
>> the FibreChannel is the bottleneck, even with random
>> read. ZFS doesn't 
>> do this out of the box, but a bit of tuning helped a
>> lot.
>> 
>
> You found and other good point.
> I think that with ZFS and JBOD, FC links will be soon the bottleneck.
> What tuning have you done?
>
>   
That depends on the indivdual requirements of each service. Basically, 
we change to recordsize according to the transaction size of the 
databases and, on the filers, the performance results were best when the 
recordsize was a bit lower than the average file size (average file size 
is 12K, so I set a recordsize of 8K). I set a vdev cache size of 8K and 
our databases worked best with a vq_max_pending of 32. ZFSv3 was used, 
that's the version which is shipped with Solaris 10 11/06.

> It is a problem if your apps hangs waiting for you to power down/pull out the 
> drive!
> Almost in a time=money environment :)
>
>   
Yes, but why doesn't your application fail over to a standby? I'm also 
working in a "time is money and failure no option" environment, and I 
doubt I would sleep better if I  were responsible for an application 
under such a service level agreement without full high availability. If 
a system reboot can be a single point of failure, what about the network 
infrastructure? Hardware errors? Or power outages?
I'm definitely NOT some kind of know-it-all, don't misunderstand me. 
Your statement just let my alarm bells ring and that's why I'm asking.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

1&1 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Gino
> > -We had tons of kernel panics because of ZFS.
> > Here a "reboot" must be planned with a couple of
> weeks in advance
> > and done only at saturday night ..
> >   
> Well, I'm sorry, but if your datacenter runs into
> problems when a single 
> server isn't available, you probably have much worse
> problems. ZFS is a 
> file system. It's not a substitute for hardware
> trouble or a misplanned 
> infrastructure. What would you do if you had the fsck
> you mentioned 
> earlier? Or with another file system like UFS, ext3,
> whatever? Boot a 
> system into single user mode and fsck several
> terabytes, after planning 
> it a couple of weeks in advance?

For example we have a couple of apps using 80-290GB RAM. Some thousands users.
We use Solaris+Sparc+High end storage because we can't afford downtimes.
We can deal with a failed file system. A reboot during the day would cost a lot 
of money.
The real problem is that ZFS should stop to force kernel panics.

> > -Our 9900V and HP EVAs works really BAD with ZFS
> because of large cache.
> > (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve
> the problem. Only helped a bit.
> >
> >   
> Use JBODs. Or tell the cache controllers to ignore
> the flushing 
> requests. Should be possible, even the $10k low-cost
> StorageTek arrays 
> support this.

Unfortunately HP EVA can't do it.
About the 9900V, it is really fast (64GB cache helps a lot) end reliable. 100% 
uptime in years.
We'll never touch it to solve a ZFS problem.
We started using JBOD (12x16drive shelfs) with ZFS but speed and reliability 
(today) is not comparable to HDS+UFS.

> > -ZFS performs badly with a lot of small files.
> > (about 20 times slower that UFS with our millions
> file rsync procedures)
> >
> >   
> I have large Sybase database servers and file servers
> with billions of 
> inodes running using ZFSv3. They are attached to
> X4600 boxes running 
> Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using
> dumb and cheap 
> Infortrend FC JBODs (2 GBit/s) as storage shelves.

Are you using FATA drives?

> During all my 
> benchmarks (both on the command line and within
> applications) show that 
> the FibreChannel is the bottleneck, even with random
> read. ZFS doesn't 
> do this out of the box, but a bit of tuning helped a
> lot.

You found and other good point.
I think that with ZFS and JBOD, FC links will be soon the bottleneck.
What tuning have you done?

> > -ZFS+FC JBOD:  failed hard disk need a reboot
> :(
> > (frankly unbelievable in 2007!)
> >   
> No. Read the thread carefully. It was mentioned that
> you don't have to 
> reboot the server, all you need to do is pull the
> hard disk. Shouldn't 
> be a problem, except if you don't want to replace the
> faulty one anyway. 

It is a problem if your apps hangs waiting for you to power down/pull out the 
drive!
Almost in a time=money environment :)

> No other manual operations will be necessary, except
> for the final "zfs 
> replace". You could also try cfgadm to get rid of ZFS
> pool problems, 
> perhaps it works - I'm not sure about this, because I
> had the idea 
> *after* I solved that problem, but I'll give it a try
> someday.
> > Anyway we happily use ZFS on our new backup systems
> (snapshotting with ZFS is amazing), but to tell you
> the true we are keeping 2 large zpool in sync on each
> system because we fear an other zpool corruption.
> >
> >   
> May I ask how you accomplish that?

During the day we sync pool1 with pool2, then we °umount pool2" during sheduled 
backup operations at night.

> And why are you doing this? You should replicate your
> zpool to another 
> host, instead of mirroring locally. Where's your
> redundancy in that?

We have 4 backup hosts. Soon we'll move to 10G network and we'll replicate on 
different hosts, as you pointed out.

Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Gino
> We have seen just the opposite... we have a
>  server with about
> 0 million files and only 4 TB of data. We have been
> benchmarking FSes
> for creation and manipulation of large populations of
> small files and
> ZFS is the only one we have found that continues to
> scale linearly
> above one million files in one FS. UFS, VXFS, HFS+
> (don't ask why),
> NSS (on NW not Linux) all show exponential growth in
> response time as
> you cross a certain knee (we are graphing time to
> create  zero
> length files, then do a series of basic manipulations
> on them) in
> number of files. For all the FSes we have tested that
> knee has been
> under one million files, except for ZFS. I know this
> is not 'real
> world' but it does reflect the response time issues
> we have been
> trying to solve. I will see if my client (I am a
> consultant) will
> allow me to post the results, as I am under NDA for
> most of the
> details of what we are doing.

It would be great!

> On the other hand, we have seen serious
> issues using rsync to
> migrate this data from the existing server to the
> Solaris 10 / ZFS
> system, so perhaps your performance issues were rsync
> related and not
> ZFS. In fact, so far the fastest and most reliable
> method for moving
> the data is proving to be Veritas NetBackup (back it
> up on the source
> server, restore to the new ZFS server).
> 
> Now having said all that, we are probably
>  never going to see
> 00 million files in one zpool, because the ZFS
> architecture lets us
> use a more distributed model (many zpools and
> datasets within them)
> and still present the end users with a single view of
> all the data.

Hi Paul,
may I ask you your medium file size? Have you done some optimization?
ZFS recordsize?
Your test included also writing 1 million files?

Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Gino
> Yes, this is a case where the disk has not completely
> failed.
> ZFS seems to handle the completely failed disk case
> properly, and
> has for a long time.  Cutting the power (which you
> can also do with
> luxadm) makes the disk appear completely failed.

Richard, I think you're right.
The failed disk is still working but it has no space for bad sectors...

Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Gino
> On Tue, 2007-09-11 at 13:43 -0700, Gino wrote:
> > -ZFS+FC JBOD:  failed hard disk need a reboot
> :(
> > (frankly unbelievable in 2007!)
> 
> So, I've been using ZFS with some creaky old FC JBODs
> (A5200's) and old
> disks which have been failing regularly and haven't
> seen that; the worst
> I've seen running nevada was that processes touching
> the pool got stuck,

this is the problem

> but they all came unstuck when I powered off the
> at-fault FC disk via
> the A5200 front panel.

I'll try again with the EMC JBOD but anyway still remain the fact that you need 
to manually recover from an hard disk failure.

gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Ralf Ramge
Gino wrote:
[...]

> Just a few examples:
> -We lost several zpool with S10U3 because of "spacemap" bug,
> and -nothing- was recoverable.  No fsck here :(
>
>   
Yes, I criticized the lack of zpool recovery mechanisms, too, during my 
AVS testing.  But I don't have the know-how to judge if it has technical 
reasons.

> -We had tons of kernel panics because of ZFS.
> Here a "reboot" must be planned with a couple of weeks in advance
> and done only at saturday night ..
>   
Well, I'm sorry, but if your datacenter runs into problems when a single 
server isn't available, you probably have much worse problems. ZFS is a 
file system. It's not a substitute for hardware trouble or a misplanned 
infrastructure. What would you do if you had the fsck you mentioned 
earlier? Or with another file system like UFS, ext3, whatever? Boot a 
system into single user mode and fsck several terabytes, after planning 
it a couple of weeks in advance?

> -Our 9900V and HP EVAs works really BAD with ZFS because of large cache.
> (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve the problem. Only helped 
> a bit.
>
>   
Use JBODs. Or tell the cache controllers to ignore the flushing 
requests. Should be possible, even the $10k low-cost StorageTek arrays 
support this.

> -ZFS performs badly with a lot of small files.
> (about 20 times slower that UFS with our millions file rsync procedures)
>
>   
I have large Sybase database servers and file servers with billions of 
inodes running using ZFSv3. They are attached to X4600 boxes running 
Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using dumb and cheap 
Infortrend FC JBODs (2 GBit/s) as storage shelves. During all my 
benchmarks (both on the command line and within applications) show that 
the FibreChannel is the bottleneck, even with random read. ZFS doesn't 
do this out of the box, but a bit of tuning helped a lot.

> -ZFS+FC JBOD:  failed hard disk need a reboot :(
> (frankly unbelievable in 2007!)
>   
No. Read the thread carefully. It was mentioned that you don't have to 
reboot the server, all you need to do is pull the hard disk. Shouldn't 
be a problem, except if you don't want to replace the faulty one anyway. 
No other manual operations will be necessary, except for the final "zfs 
replace". You could also try cfgadm to get rid of ZFS pool problems, 
perhaps it works - I'm not sure about this, because I had the idea 
*after* I solved that problem, but I'll give it a try someday.

> Anyway we happily use ZFS on our new backup systems (snapshotting with ZFS is 
> amazing), but to tell you the true we are keeping 2 large zpool in sync on 
> each system because we fear an other zpool corruption.
>
>   
May I ask how you accomplish that?

And why are you doing this? You should replicate your zpool to another 
host, instead of mirroring locally. Where's your redundancy in that?

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

1&1 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-11 Thread Paul Kraus
On 9/11/07, Gino <[EMAIL PROTECTED]> wrote:

> -ZFS performs badly with a lot of small files.
> (about 20 times slower that UFS with our millions file rsync procedures)

We have seen just the opposite... we have a server with about
40 million files and only 4 TB of data. We have been benchmarking FSes
for creation and manipulation of large populations of small files and
ZFS is the only one we have found that continues to scale linearly
above one million files in one FS. UFS, VXFS, HFS+ (don't ask why),
NSS (on NW not Linux) all show exponential growth in response time as
you cross a certain knee (we are graphing time to create  zero
length files, then do a series of basic manipulations on them) in
number of files. For all the FSes we have tested that knee has been
under one million files, except for ZFS. I know this is not 'real
world' but it does reflect the response time issues we have been
trying to solve. I will see if my client (I am a consultant) will
allow me to post the results, as I am under NDA for most of the
details of what we are doing.

On the other hand, we have seen serious issues using rsync to
migrate this data from the existing server to the Solaris 10 / ZFS
system, so perhaps your performance issues were rsync related and not
ZFS. In fact, so far the fastest and most reliable method for moving
the data is proving to be Veritas NetBackup (back it up on the source
server, restore to the new ZFS server).

Now having said all that, we are probably never going to see
100 million files in one zpool, because the ZFS architecture lets us
use a more distributed model (many zpools and datasets within them)
and still present the end users with a single view of all the data.

-- 
Paul Kraus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-11 Thread Richard Elling
Bill Sommerfeld wrote:
> On Tue, 2007-09-11 at 13:43 -0700, Gino wrote:
>> -ZFS+FC JBOD:  failed hard disk need a reboot :(
>> (frankly unbelievable in 2007!)
> 
> So, I've been using ZFS with some creaky old FC JBODs (A5200's) and old
> disks which have been failing regularly and haven't seen that; the worst
> I've seen running nevada was that processes touching the pool got stuck,
> but they all came unstuck when I powered off the at-fault FC disk via
> the A5200 front panel.

Yes, this is a case where the disk has not completely failed.
ZFS seems to handle the completely failed disk case properly, and
has for a long time.  Cutting the power (which you can also do with
luxadm) makes the disk appear completely failed.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-11 Thread Bill Sommerfeld
On Tue, 2007-09-11 at 13:43 -0700, Gino wrote:
> -ZFS+FC JBOD:  failed hard disk need a reboot :(
> (frankly unbelievable in 2007!)

So, I've been using ZFS with some creaky old FC JBODs (A5200's) and old
disks which have been failing regularly and haven't seen that; the worst
I've seen running nevada was that processes touching the pool got stuck,
but they all came unstuck when I powered off the at-fault FC disk via
the A5200 front panel.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-11 Thread Gino
> To put this in perspective, no system on the planet
> today handles all faults.
> I would even argue that building such a system is
> theoretically impossible.

no doubt about that ;)

> So the subset of faults which ZFS covers which is
> different than the subset
> that UFS covers and different than what SVM covers.
>  For example, we *know*
> hat ZFS has allowed people to detect and recover from
> faulty SAN switches,
> borken RAID arrays, and accidental deletions which
> UFS could have never even
> detected.  There are some known gaps which are being
> closed in ZFS, but it is
> simply not the case that UFS is superior in all RAS
> respects to ZFS.

I agree ZFS features are outstanding BUT from my point of view ZFS
has been integrated on Solaris too early, without too much testing.

Just a few examples:
-We lost several zpool with S10U3 because of "spacemap" bug,
and -nothing- was recoverable.  No fsck here :(

-We had tons of kernel panics because of ZFS.
Here a "reboot" must be planned with a couple of weeks in advance
and done only at saturday night ..

-Our 9900V and HP EVAs works really BAD with ZFS because of large cache.
(echo zfs_nocacheflush/W 1 | mdb -kw) did not solve the problem. Only helped a 
bit.

-ZFS performs badly with a lot of small files.
(about 20 times slower that UFS with our millions file rsync procedures)

-ZFS+FC JBOD:  failed hard disk need a reboot :(
(frankly unbelievable in 2007!)

Anyway we happily use ZFS on our new backup systems (snapshotting with ZFS is 
amazing), but to tell you the true we are keeping 2 large zpool in sync on each 
system because we fear an other zpool corruption.

Many friends of mine working on big Solaris environments moved to ZFS with 
S10U3 and than soon went back with UFS because of the same problems. 
Sure, for our home server with cheap ata drives ZFS is unbeatable and free :)

Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-10 Thread Richard Elling
Gino wrote:
>>> Richard, thank you for your detailed reply.
>>> Unfortunately an other reason to stay with UFS in
>> production ..
>>>   
>> IMHO, maturity is the primary reason to stick with
>> UFS.  To look at
>> this through the maturity lens, UFS is the great
>> grandfather living on
>> life support (prune juice and oxygen) while ZFS is
>> the late adolescent,
>> soon to bloom into a young adult. The torch will pass
>> when ZFS
>> becomes the preferred root file system.
>>  -- richard
> 
> I agree with you but don't understand why Sun has integrated ZFS on Solaris 
> and declared it as stable.
> Sun Sales tell you to trash your old redundant arrays and go with jbod and 
> ZFS...
> but don't tell you that you probably will need to reboot your SF25k because 
> of a disk failure!!  :(

To put this in perspective, no system on the planet today handles all faults.
I would even argue that building such a system is theoretically impossible.

So the subset of faults which ZFS covers which is different than the subset
that UFS covers and different than what SVM covers.  For example, we *know*
that ZFS has allowed people to detect and recover from faulty SAN switches,
borken RAID arrays, and accidental deletions which UFS could have never even
detected.  There are some known gaps which are being closed in ZFS, but it is
simply not the case that UFS is superior in all RAS respects to ZFS.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-09 Thread Gino
> >  
> > Richard, thank you for your detailed reply.
> > Unfortunately an other reason to stay with UFS in
> production ..
> >
> >   
> IMHO, maturity is the primary reason to stick with
> UFS.  To look at
> this through the maturity lens, UFS is the great
> grandfather living on
> life support (prune juice and oxygen) while ZFS is
> the late adolescent,
> soon to bloom into a young adult. The torch will pass
> when ZFS
> becomes the preferred root file system.
>  -- richard

I agree with you but don't understand why Sun has integrated ZFS on Solaris and 
declared it as stable.
Sun Sales tell you to trash your old redundant arrays and go with jbod and 
ZFS...
but don't tell you that you probably will need to reboot your SF25k because of 
a disk failure!!  :(
Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-08 Thread Richard Elling
Gino wrote:
 "cfgadm -al" or "devfsadm -C" didn't solve the
 
>> problem.
>> 
 After a reboot  ZFS recognized the drive as failed
 
>> and all worked well.
>> 
 Do we need to restart Solaris after a drive
 
>> failure??
>>
>> It depends...
>> ... on which version of Solaris you are running.  ZFS
>> FMA phase 2 was
>> integrated into SXCE build 68.  Prior to that
>> release, ZFS had a limited
>> view of the (many) disk failure modes -- it would say
>> a disk was failed
>> if it could not be opened.  In phase 2, the ZFS
>> diagnosis engine was
>> enhanced to look for per-vdev soft error rate
>> discriminator (SERD) engines.
>> 
>  
> Richard, thank you for your detailed reply.
> Unfortunately an other reason to stay with UFS in production ..
>
>   
IMHO, maturity is the primary reason to stick with UFS.  To look at
this through the maturity lens, UFS is the great grandfather living on
life support (prune juice and oxygen) while ZFS is the late adolescent,
soon to bloom into a young adult. The torch will pass when ZFS
becomes the preferred root file system.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-08 Thread Gino
> >> "cfgadm -al" or "devfsadm -C" didn't solve the
> problem.
> >> After a reboot  ZFS recognized the drive as failed
> and all worked well.
> >>
> >> Do we need to restart Solaris after a drive
> failure??
> 
> It depends...
> ... on which version of Solaris you are running.  ZFS
> FMA phase 2 was
> integrated into SXCE build 68.  Prior to that
> release, ZFS had a limited
> view of the (many) disk failure modes -- it would say
> a disk was failed
> if it could not be opened.  In phase 2, the ZFS
> diagnosis engine was
> enhanced to look for per-vdev soft error rate
> discriminator (SERD) engines.
 
Richard, thank you for your detailed reply.
Unfortunately an other reason to stay with UFS in production ..

Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-05 Thread Richard Elling
Paul Kraus wrote:
> On 9/4/07, Gino <[EMAIL PROTECTED]> wrote:
> 
>> yesterday we had a drive failure on a fc-al jbod with 14 drives.
>> Suddenly the zpool using that jbod stopped to respond to I/O requests
>> and we get tons of the following messages on /var/adm/messages:
> 
> 
> 
>> "cfgadm -al" or "devfsadm -C" didn't solve the problem.
>> After a reboot  ZFS recognized the drive as failed and all worked well.
>>
>> Do we need to restart Solaris after a drive failure??

It depends...

> I would hope not but ... prior to putting some ZFS volumes
> into production we did some failure testing. The hardware I was
> testing with was a couple SF-V245 with 4 x 72 GB disks each. Two disks
> were setup with SVM/UFS as mirrored OS, the other two were handed to
> ZFS as a mirrored zpool. I did some large file copies to generate I/O.
> While a large copy was going on (lots of disk I/O) I pulled one of the
> drives.

... on which version of Solaris you are running.  ZFS FMA phase 2 was
integrated into SXCE build 68.  Prior to that release, ZFS had a limited
view of the (many) disk failure modes -- it would say a disk was failed
if it could not be opened.  In phase 2, the ZFS diagnosis engine was
enhanced to look for per-vdev soft error rate discriminator (SERD) engines.

More details can be found in the ARC case materials:
http://www.opensolaris.org/os/community/arc/caselog/2007/283/materials/portfolio-txt/

In SXCE build 72 we gain a new FMA I/O retire agent.  This is more general
purpose and allows a process to set a contract against a device in use.
http://www.opensolaris.org/os/community/on/flag-days/pages/2007080901/
http://www.opensolaris.org/os/community/arc/caselog/2007/290/

> If the I/O was to the zpool the system would hang (just like
> it was hung waiting on an I/O operation). I let it sit this way for
> over an hour with no recovery. After rebooting it found the existing
> half of the ZFS mirror just fine. Just to be clear, once I pulled the
> disk, over about a 5 minute period *all* activity on the box hung.
> Even a shell just running prstat.

It may depend on what shell you are using.  Some shells, such as ksh
write to the $HISTFILE before exec'ing the command.  If your $HISTFILE
was located in an affected file system, then you would appear hung.

> If the I/O was to one of the SVM/UFS disks there would be a
> 60-90 second pause in all activity (just like the ZFS case), but then
> operation would resume. This is what I am used to seeing for a disk
> failure.

Default retries to most disks are 60 seconds (last time I checked).
There are several layers involved here, so you can expect something to
happen on 60 second intervals, even if it is just another retry.

> In the ZFS case I could replace the disk and the zpool would
> resilver automatically. I could also take the removed disk and put it
> into the second system and have it recognize the zpool (and that it
> was missing half of a mirror) and the data was all there.
> 
> In no case did I see any data loss or corruption. I had
> attributed the system hanging to an interaction between the SAS and
> ZFS layers, but the previous post makes me question that assumption.
> 
> As another data point, I have an old Intel box at home I am
> running x86 on with ZFS. I have a pair of 120 GB PATA disks. OS is on
> SVM/UFS mirrored partitions and /export home is on a pair of
> partitions in a zpool (mirror). I had a bad power connector and
> sometime after booting lost one of the drives. The server kept running
> fine. Once I got the drive powered back up (while the server was shut
> down), the SVM mirrors resync'd and the zpool resilvered. The zpool
> finished substantially before the SVM.
> 
> In all cases the OS was Solaris 10 U 3 (11/06) with no
> additional patches.

The behaviour you describe is what I would expect for that release of
Solaris + ZFS.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-05 Thread Paul Kraus
On 9/4/07, Gino <[EMAIL PROTECTED]> wrote:

> yesterday we had a drive failure on a fc-al jbod with 14 drives.
> Suddenly the zpool using that jbod stopped to respond to I/O requests
> and we get tons of the following messages on /var/adm/messages:



> "cfgadm -al" or "devfsadm -C" didn't solve the problem.
> After a reboot  ZFS recognized the drive as failed and all worked well.
>
> Do we need to restart Solaris after a drive failure??

I would hope not but ... prior to putting some ZFS volumes
into production we did some failure testing. The hardware I was
testing with was a couple SF-V245 with 4 x 72 GB disks each. Two disks
were setup with SVM/UFS as mirrored OS, the other two were handed to
ZFS as a mirrored zpool. I did some large file copies to generate I/O.
While a large copy was going on (lots of disk I/O) I pulled one of the
drives.

If the I/O was to the zpool the system would hang (just like
it was hung waiting on an I/O operation). I let it sit this way for
over an hour with no recovery. After rebooting it found the existing
half of the ZFS mirror just fine. Just to be clear, once I pulled the
disk, over about a 5 minute period *all* activity on the box hung.
Even a shell just running prstat.

If the I/O was to one of the SVM/UFS disks there would be a
60-90 second pause in all activity (just like the ZFS case), but then
operation would resume. This is what I am used to seeing for a disk
failure.

In the ZFS case I could replace the disk and the zpool would
resilver automatically. I could also take the removed disk and put it
into the second system and have it recognize the zpool (and that it
was missing half of a mirror) and the data was all there.

In no case did I see any data loss or corruption. I had
attributed the system hanging to an interaction between the SAS and
ZFS layers, but the previous post makes me question that assumption.

As another data point, I have an old Intel box at home I am
running x86 on with ZFS. I have a pair of 120 GB PATA disks. OS is on
SVM/UFS mirrored partitions and /export home is on a pair of
partitions in a zpool (mirror). I had a bad power connector and
sometime after booting lost one of the drives. The server kept running
fine. Once I got the drive powered back up (while the server was shut
down), the SVM mirrors resync'd and the zpool resilvered. The zpool
finished substantially before the SVM.

In all cases the OS was Solaris 10 U 3 (11/06) with no
additional patches.

-- 
Paul Kraus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-04 Thread Gino
Hi Mark,

the drive (147GB, FC 2Gb) failed on a Xyratex JBOD. 
Also in the past we had the same problem with a drive failed on a EMC CX JBOD.

Anyway I can't understand why rebooting Solaris solved out the situation ..

Thank you,
Gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-04 Thread Mark Ashley
I'm going to go out on a limb here and say you have an A5000 with the 
1.6" disks in it. Because of their design, (all drives seeing each other 
on both the A and B loops), it's possible for one disk that is behaving 
badly to take over the FC-AL loop and require human intervention. You 
can physically go up to the A5000 and remove the faulty drive if your 
volume manager software (SVM, VxVM, ZFS, etc) can still run without the 
drive.

In the above case the WWN (ending in 81b9f) is printed on the label so 
it's easy to locate the faulty drive. Keep in mind sometimes the /next/ 
functioning drive in the loop can be the reporting one sometimes. It's 
just a quirk of that storage unit.

These days devices will usually have an individual internal FC-AL loop 
to each drive to alleviate this sort of problem.

Cheers,
Mark.

> Hi all,
>
> yesterday we had a drive failure on a fc-al jbod with 14 drives.
> Suddenly the zpool using that jbod stopped to respond to I/O requests and we 
> get tons of the following messages on /var/adm/messages:
>
> Sep  3 15:20:10 fb2 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/[EMAIL 
> PROTECTED] (sd52):
> Sep  3 15:20:10 fb2 SCSI transport failed: reason 'timeout': giving up
>
> "cfgadm -al" or "devfsadm -C" didn't solve the problem.
> After a reboot  ZFS recognized the drive as failed and all worked well.
>
> Do we need to restart Solaris after a drive failure??
>
> Gino
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss