[EMAIL PROTECTED] wrote on 09/12/2007 08:04:33 AM: > > Gino wrote: > > > The real problem is that ZFS should stop to force > > kernel panics. > > > > > I found these panics very annoying, too. And even > > more that the zpool > > was faulted afterwards. But my problem is that when > > someone asks me what > > ZFS should do instead, I have no idea. > > well, what about just hang processes waiting for I/O on that zpool? > Could be possible?
It seems that maybe there is too large a code path leading to panics -- maybe a side effect of ZFS being "new" (compared to other filesystems). I would hope that as these panic issues are coming up that the code path leading to the panic is evaluated for a specific fix or behavior code path. Sometimes it does make sense to panic (if there _will_ be data damage if you continue). Other times not. > > > Seagate FibreChannel drives, Cheetah 15k, ST3146855FC > > for the databases. > > What king of JBOD for that drives? Just to know ... > We found Xyratex's to be good products. > > > That depends on the indivdual requirements of each > > service. Basically, > > we change to recordsize according to the transaction > > size of the > > databases and, on the filers, the performance results > > were best when the > > recordsize was a bit lower than the average file size > > (average file size > > is 12K, so I set a recordsize of 8K). I set a vdev > > cache size of 8K and > > our databases worked best with a vq_max_pending of > > 32. ZFSv3 was used, > > that's the version which is shipped with Solaris 10 > > 11/06. > > thanks for sharing. > > > Yes, but why doesn't your application fail over to a > > standby? > > It is a little complex to explain. Basically that apps are making a > lot of "number cruncing" on some a very big data in ram. Failover > would be starting again from the beginning, with all the customers > waiting for hours (and loosing money). > We are working on a new app, capable to work with a couple of nodes > but it will takes some months to be in beta, then 2 years of testing ... > > > a system reboot can be a single point of failure, > > what about the network > > infrastructure? Hardware errors? Or power outages? > > We use Sunfire for that reason. We had 2 cpu failures and no service > interruption, the same for 1 dimm module (we have been lucky with > cpu failures ;)). > HDS raid arrays are excellent about availability. Lots of fc links, > network links .. > All this is in a fully redundant datacenter .. and, sure, we have a > stand by system on a disaster recovery site (hope to never use it!). I can understand where you are coming from as far as the need for uptime and loss of money on that app server. Two years of testing for the app, Sunfire servers for N+1 because the app can't be clustered and you have chosen to run a filesystem that has just been made public? ZFS may be great and all, but this stinks of running a .0 version on the production machine. VXFS+snap has well known and documented behaviors tested for years on production machines. Why did you even choose to run ZFS on that specific box? Do not get me wrong, I really like many things about ZFS -- it is ground breaking. I still do not get why it would be chosen for a server in that position until it has better real world production testing and modeling. You have taken all of the buildup you have done and introduced an unknown to the mix. > > > I'm definitely NOT some kind of know-it-all, don't > > misunderstand me. > > Your statement just let my alarm bells ring and > > that's why I'm asking. > > Don't worry Ralf. Any suggestion/opinion/critic is welcome. > It's a pleasure to exchange our experience > > Gino > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss