[zfs-discuss] question about modification of dd_parent_obj during COW
>From the online document of ZFS On-Disk Specification, I found there is a >field named "dd_parent_obj" in dsl_dir_phys_t. will this field be modified or >kept unchanged during snapshot COW? For example, consider a ZFS filesytem mounted on /myzfs, which contains 2 subdirs(A and B). If we do the following steps: 1) create a snapshot named /[EMAIL PROTECTED] 2) rename /myzfs/A to /myzfs/A1. I think the directory objects of /myzfs/A and /myzfs will be COW'd during the rename operation. Now we can access dirctory B by specifing either "/myzfs/B " or "/[EMAIL PROTECTED]/B". The problems are: 1) What is the parent of B? will "dd_parent_obj" of B be changed during the COW of dirctory object /myzfs? 2) If we remove /myzfs/B thereafter, will "dd_parent_obj" of of /[EMAIL PROTECTED]/B be changed? Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Root and upgrades
How long before we can upgrade a ZFS based root fs? Not looking for a Live Upgrade feature, just to be able to boot off a newer release DVD and upgrade in place. Currently using a build 62 based system, would like to start taking a look at some of the features showing up in newer builds. Thanks Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
Pawel Jakub Dawidek FreeBSD.org> writes: > > This is how RAIDZ fills the disks (follow the numbers): > > Disk0 Disk1 Disk2 Disk3 > > D0 D1 D2 P3 > D4 D5 D6 P7 > D8 D9 D10 P11 > D12 D13 D14 P15 > D16 D17 D18 P19 > D20 D21 D22 P23 > > D is data, P is parity. This layout assumes of course that large stripes have been written to the RAIDZ vdev. As you know, the stripe width is dynamic, so it is possible for a single logical block to span only 2 disks (for those who don't know what I am talking about, see the "red" block occupying LBAs D3 and E3 on page 13 of these ZFS slides [1]). To read this logical block (and validate its checksum), only D_0 needs to be read (LBA E3). So in this very specific case, a RAIDZ read operation is as cheap as a RAID5 read operation. The existence of these small stripes could explain why RAIDZ doesn't perform as bad as RAID5 in Pawel's benchmark... [1] http://br.sun.com/sunnews/events/2007/techdaysbrazil/pdf/eric_zfs.pdf -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 07:39:56PM -0500, Al Hopper wrote: > >This is how RAIDZ fills the disks (follow the numbers): > > > > Disk0 Disk1 Disk2 Disk3 > > > > D0 D1 D2 P3 > > D4 D5 D6 P7 > > D8 D9 D10 P11 > > D12 D13 D14 P15 > > D16 D17 D18 P19 > > D20 D21 D22 P23 > > > >D is data, P is parity. > > > >And RAID5 does this: > > > > Disk0 Disk1 Disk2 Disk3 > > > > D0 D3 D6 P0,3,6 > > D1 D4 D7 P1,4,7 > > D2 D5 D8 P2,5,8 > > D9 D12 D15 P9,12,15 > > D10 D13 D16 P10,13,16 > > D11 D14 D17 P11,14,17 > > Surely the above is not accurate? You've showing the parity data only > being written to disk3. In RAID5 the parity is distributed across all > disks in the RAID5 set. What is illustrated above is RAID3. It's actually RAID4 (RAID3 would look the same as RAIDZ, but there are differences in practice), but my point wasn't how the parity is distributed:) Ok, RAID5 once again: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 P9,12,15D15 D10 D13 P10,13,16 D16 D11 D14 P11,14,17 D17 -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpjnuDDD5adp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Thu, 13 Sep 2007, Pawel Jakub Dawidek wrote: > On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: >> On 9/10/07, Pawel Jakub Dawidek <[EMAIL PROTECTED]> wrote: >>> Hi. >>> >>> I've a prototype RAID5 implementation for ZFS. It only works in >>> non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 >>> performance, as I suspected that RAIDZ, because of full-stripe >>> operations, doesn't work well for random reads issued by many processes >>> in parallel. >>> >>> There is of course write-hole problem, which can be mitigated by running >>> scrub after a power failure or system crash. >> >> If I read your suggestion correctly, your implementation is much >> more like traditional raid-5, with a read-modify-write cycle? >> >> My understanding of the raid-z performance issue is that it requires >> full-stripe reads in order to validate the checksum. [...] > > No, checksum is independent thing, and this is not the reason why RAIDZ > needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read > parity. > > This is how RAIDZ fills the disks (follow the numbers): > > Disk0 Disk1 Disk2 Disk3 > > D0 D1 D2 P3 > D4 D5 D6 P7 > D8 D9 D10 P11 > D12 D13 D14 P15 > D16 D17 D18 P19 > D20 D21 D22 P23 > > D is data, P is parity. > > And RAID5 does this: > > Disk0 Disk1 Disk2 Disk3 > > D0 D3 D6 P0,3,6 > D1 D4 D7 P1,4,7 > D2 D5 D8 P2,5,8 > D9 D12 D15 P9,12,15 > D10 D13 D16 P10,13,16 > D11 D14 D17 P11,14,17 Surely the above is not accurate? You've showing the parity data only being written to disk3. In RAID5 the parity is distributed across all disks in the RAID5 set. What is illustrated above is RAID3. > As you can see even small block is stored on all disks in RAIDZ, where > on RAID5 small block can be stored on one disk only. > > -- Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Thu, Sep 13, 2007 at 12:56:44AM +0200, Pawel Jakub Dawidek wrote: > On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: > > My understanding of the raid-z performance issue is that it requires > > full-stripe reads in order to validate the checksum. [...] > > No, checksum is independent thing, and this is not the reason why RAIDZ > needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read > parity. I doubt reading the parity could cost all that much (particularly if there's enough I/O capacity). It's reading the full 128KB that you have to read, if a file's record size is 128KB, in order to satisfy a 2KB read. And ZFS has to read full blocks in order to verify the checksum. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: > On 9/10/07, Pawel Jakub Dawidek <[EMAIL PROTECTED]> wrote: > > Hi. > > > > I've a prototype RAID5 implementation for ZFS. It only works in > > non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 > > performance, as I suspected that RAIDZ, because of full-stripe > > operations, doesn't work well for random reads issued by many processes > > in parallel. > > > > There is of course write-hole problem, which can be mitigated by running > > scrub after a power failure or system crash. > > If I read your suggestion correctly, your implementation is much > more like traditional raid-5, with a read-modify-write cycle? > > My understanding of the raid-z performance issue is that it requires > full-stripe reads in order to validate the checksum. [...] No, checksum is independent thing, and this is not the reason why RAIDZ needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read parity. This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. And RAID5 does this: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 D15 P9,12,15 D10 D13 D16 P10,13,16 D11 D14 D17 P11,14,17 As you can see even small block is stored on all disks in RAIDZ, where on RAID5 small block can be stored on one disk only. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp5p7Tq85M8q.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote: > I'm a bit surprised by these results. Assuming relatively large blocks > written, RAID-Z and RAID-5 should be laid out on disk very similarly > resulting in similar read performance. > > Did you compare the I/O characteristic of both? Was the bottleneck in > the software or the hardware? Note that Pawel wrote: Pawel> I was using 8 processes, I/O size was a random value between 2kB Pawel> and 32kB (with 2kB step), offset was a random value between 0 and Pawel> 10GB (also with 2kB step). If the dataset's record size was the default (Pawel didn't say, right?) then the reason for the lousy read performance is clear: RAID-Z has to read full blocks to verify the checksum, whereas RAID-5 need only read as much as is requested (assuming aligned reads, which Pawel did seem to indicate: "2KB steps"). Peter Tribble pointed out much the same thing already. The crucial requirement is to match the dataset record size to the I/O size done by the application. If the app writes in bigger chunks than it reads and you want to optimize for write performance then set the record size to match the write size, else set the record size to match the read size. Where the dataset record size is not matched to the application's I/O size I guess we could say that RAID-Z trades off the RAID-5 write hole for a read-hole. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote: > On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: > > And here are the results: > > > > RAIDZ: > > > > Number of READ requests: 4. > > Number of WRITE requests: 0. > > Number of bytes to transmit: 695678976. > > Number of processes: 8. > > Bytes per second: 1305213 > > Requests per second: 75 > > > > RAID5: > > > > Number of READ requests: 4. > > Number of WRITE requests: 0. > > Number of bytes to transmit: 695678976. > > Number of processes: 8. > > Bytes per second: 2749719 > > Requests per second: 158 > > I'm a bit surprised by these results. Assuming relatively large blocks > written, RAID-Z and RAID-5 should be laid out on disk very similarly > resulting in similar read performance. Hmm, no. The data was organized very differenly on disks. The smallest block size used was 2kB, to ensure each block is written to all disks in RAIDZ configuration. In RAID5 configuration however, 128kB stripe size was used, which means each block was stored on one disk only. Now when you read the data, RAIDZ need to read all disks for each block, and RAID5 needs to read only one disk for each block. > Did you compare the I/O characteristic of both? Was the bottleneck in > the software or the hardware? The bottleneck were definiatelly disks. CPU was like 96% idle. To be honest I expected, just like Jeff, much bigger win for RAID5 case. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpaN8zKnXp9n.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On 9/10/07, Pawel Jakub Dawidek <[EMAIL PROTECTED]> wrote: > Hi. > > I've a prototype RAID5 implementation for ZFS. It only works in > non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 > performance, as I suspected that RAIDZ, because of full-stripe > operations, doesn't work well for random reads issued by many processes > in parallel. > > There is of course write-hole problem, which can be mitigated by running > scrub after a power failure or system crash. If I read your suggestion correctly, your implementation is much more like traditional raid-5, with a read-modify-write cycle? My understanding of the raid-z performance issue is that it requires full-stripe reads in order to validate the checksum. So to get better random read performance, why not simply have a separate checksum for each chunk in the stripe? You still eliminate the raid-5 write hole (albeit at some loss in performance because you have to compute and write extra checksums) but you allow multiple independent reads. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: > And here are the results: > > RAIDZ: > > Number of READ requests: 4. > Number of WRITE requests: 0. > Number of bytes to transmit: 695678976. > Number of processes: 8. > Bytes per second: 1305213 > Requests per second: 75 > > RAID5: > > Number of READ requests: 4. > Number of WRITE requests: 0. > Number of bytes to transmit: 695678976. > Number of processes: 8. > Bytes per second: 2749719 > Requests per second: 158 I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? Very interesting experiment... Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Again ZFS with expanding LUNs!
I like option #1 because it is simple and quick. It seems unlikely that this will lead to an excessive number of luns in the pool in most cases unless you start with a large number of very small luns. If you begin with 5 100GB luns and over time add 5 more it still seems like a reasonable and manageable pool with twice the original capacity. And considering the array can likely support hundreds and perhaps thousands of luns then it really isn't an issue on the array side either. Regards, Vic On 9/12/07, Bill Korb <[EMAIL PROTECTED]> wrote: > I found this discussion just today as I recently set up my first S10 machine > with ZFS. We use a NetApp Filer via multipathed FC HBAs, and I wanted to know > what my options were in regards to growing a ZFS filesystem. > > After looking at this thread, it looks like there is currently no way to grow > an existing LUN on our NetApp and then tell ZFS to expand to fill the new > space. This may be coming down the road at some point, but I would like to be > able to do this now. > > At this point, I believe I have two options: > > 1. Add a second LUN and simply do a "zpool add" to add the new space to the > existing pool. > > 2. Create a new LUN that is the size I would like my pool to be, then use > "zpool replace oldLUNdev newLUNdev" to ask ZFS to resilver my data to the new > LUN then detach the old one. > > The advantage of the first option is that it happens very quickly, but it > could get kind of messy if you grow the ZFS pool on multiple occasions. I've > read that some SANs are also limited as to how many LUNs can be created (some > are limitations of the SAN itself whereas I believe that some others impose a > limit as part of the SAN license). That would also make the first approach > less attractive. > > The advantage of the second approach is that all of the space would be > contained in a single LUN. The disadvantages are that this would involve > copying all of the data from the old LUN to the new one and also this means > that you need to have enough free space on your SAN to create this new, > larger LUN. > > Is there a best practice regarding this? I'm leaning towards option #2 so as > to keep the number of LUNs I have to manage at a minimum, but #1 seems like a > reasonable alternative, too. Or perhaps there's an option #3 that I haven't > thought of? > > Thanks, > Bill > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Again ZFS with expanding LUNs!
I found this discussion just today as I recently set up my first S10 machine with ZFS. We use a NetApp Filer via multipathed FC HBAs, and I wanted to know what my options were in regards to growing a ZFS filesystem. After looking at this thread, it looks like there is currently no way to grow an existing LUN on our NetApp and then tell ZFS to expand to fill the new space. This may be coming down the road at some point, but I would like to be able to do this now. At this point, I believe I have two options: 1. Add a second LUN and simply do a "zpool add" to add the new space to the existing pool. 2. Create a new LUN that is the size I would like my pool to be, then use "zpool replace oldLUNdev newLUNdev" to ask ZFS to resilver my data to the new LUN then detach the old one. The advantage of the first option is that it happens very quickly, but it could get kind of messy if you grow the ZFS pool on multiple occasions. I've read that some SANs are also limited as to how many LUNs can be created (some are limitations of the SAN itself whereas I believe that some others impose a limit as part of the SAN license). That would also make the first approach less attractive. The advantage of the second approach is that all of the space would be contained in a single LUN. The disadvantages are that this would involve copying all of the data from the old LUN to the new one and also this means that you need to have enough free space on your SAN to create this new, larger LUN. Is there a best practice regarding this? I'm leaning towards option #2 so as to keep the number of LUNs I have to manage at a minimum, but #1 seems like a reasonable alternative, too. Or perhaps there's an option #3 that I haven't thought of? Thanks, Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression=on and zpool attach
> On 9/12/07, Mike DeMarco <[EMAIL PROTECTED]> wrote: > > > Striping several disks together with a stripe width > that is tuned for your data > > model is how you could get your performance up. > Stripping has been left out > > of the ZFS model for some reason. Where it is true > that RAIDZ will stripe > > the data across a given drive set it does not give > you the option to tune the > > stripe width. Do to the write performance problems > of RAIDZ you may not > > get a performance boost from it stripping if your > write to read ratio is too > > high since the driver has to calculate parity for > each write. > > I am not sure why you think striping has been left > out of the ZFS > model. If you create a ZFS pool without the "raidz" > or "mirror" > keywords, the pool will be striped. Also, the > "recordsize" tunable can > be useful for matching up application I/O to physical > I/O. > > Thanks, > - Ryan > -- > UNIX Administrator > http://prefetch.net > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss Oh... How right you are. I dug into the PDFs and read up on Dynamic striping. My bad. ZFS rocks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression=on and zpool attach
Mike DeMarco wrote: > IO bottle necks are usually caused by a slow disk or one that has heavy > workloads reading many small files. Two factors that need to be considered > are Head seek latency and spin latency. Head seek latency is the amount > of time it takes for the head to move to the track that is to be written, > this is a eternity for the system (usually around 4 or 5 milliseconds). For most modern disks, writes are cached in a buffer. But reads still take the latency hit. The trick is to ensure that writes are committed to media, which ZFS will do for you. > Spin latency is the amount of time it takes for the spindle to spin the > track to be read or written over the head. Ideally you only want to pay the > latency penalty once. If you have large reads and writes going to the disk > then compression may help a little but if you have many small reads or > writes it will do nothing more than to burden your CPU with a no gain > amount of work to do since your are going to be paying Mr latency for each > read or write. > > Striping several disks together with a stripe width that is tuned for your > data model is how you could get your performance up. Stripping has been > left out of the ZFS model for some reason. Where it is true that RAIDZ will > stripe the data across a given drive set it does not give you the option to > tune the stripe width. It is called "dynamic striping" and a write is not compelled to be spread across all vdevs. This opens up an interesting rat hole conversation about whether stochastic spreading is always better than an efficient, larger block write. Our grandchildren might still be arguing this when they enter the retirement home. In general, for ZFS, the top-level dynamic stripe interlace is 1 MByte which seems to fit well with the 128kByte block size. YMMV. > Do to the write performance problems of RAIDZ you may not get a performance > boost from it stripping if your write to read ratio is too high since the > driver has to calculate parity for each write. Write performance for raidz is generally quite good, better than most other RAID-5 implementations which are bit by the read-modify-write cycle (added latency). raidz can pay for this optimization when doing small, random reads, TANSTAAFL. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
> . . . >> Use JBODs. Or tell the cache controllers to ignore >> the flushing requests. [EMAIL PROTECTED] said: > Unfortunately HP EVA can't do it. About the 9900V, it is really fast (64GB > cache helps a lot) end reliable. 100% uptime in years. We'll never touch it > to solve a ZFS problem. On our low-end HDS array (9520V), turning on "Synchronize Cache Invalid Mode" did the trick for ZFS purposes (Solaris-10U3). They've since added a Solaris kernel tunable in /etc/system: set zfs:zfs_nocacheflush = 1 This has the unfortunate side-effect of disabling it on all disks for the whole system, though. ZFS is getting more mature all the time Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] can we monitor a ZFS pool with SunMC 3.6.1 ?
Hello Everyone can we monitor a ZFS pool with SunMC 3.6.1 ? is this a base function ? if not will SunMC 4.0 solve this ? Juan -- Juan Berlie Engagement Architect/Architecte de Systèmes Sun Microsystems, Inc. 1800 McGill College, Suite 800 Montréal, Québec H3A 3J6 CA Phone x25349/514-285-8349 Mobile 1 514 781 1443 Fax 1 514 285-1983 Email [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
> It seems that maybe there is too large a code path > leading to panics -- > maybe a side effect of ZFS being "new" (compared to > other filesystems). I > would hope that as these panic issues are coming up > that the code path > leading to the panic is evaluated for a specific fix > or behavior code path. > Sometimes it does make sense to panic (if there > _will_ be data damage if > you continue). Other times not. I think the same about panics. So, IMHO, ZFS should not be called "stable". But you know ... marketing ... ;) > I can understand where you are coming from as > far as the need for > ptime and loss of money on that app server. Two years > of testing for the > app, Sunfire servers for N+1 because the app can't be > clustered and you > have chosen to run a filesystem that has just been > made public? What? That server is running and will be running on UFS for many years! Upgrading, patching, cleaning ... even touching it is strictly prohibited :) We upgraded to S10 because of DTrace (helped us a lot) and during the test phase we evaluated also ZFS. Now we only use ZFS for our central backup servers (for many applications, systems, customers, ...) We also manage a lot of other systems and always try to migrate customers to Solaris because of stability, resource control, DTrace .. but found ZFS disappointing at today (probably tomorrow it will be THE filesystem). Gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression=on and zpool attach
On 9/12/07, Mike DeMarco <[EMAIL PROTECTED]> wrote: > Striping several disks together with a stripe width that is tuned for your > data > model is how you could get your performance up. Stripping has been left out > of the ZFS model for some reason. Where it is true that RAIDZ will stripe > the data across a given drive set it does not give you the option to tune the > stripe width. Do to the write performance problems of RAIDZ you may not > get a performance boost from it stripping if your write to read ratio is too > high since the driver has to calculate parity for each write. I am not sure why you think striping has been left out of the ZFS model. If you create a ZFS pool without the "raidz" or "mirror" keywords, the pool will be striped. Also, the "recordsize" tunable can be useful for matching up application I/O to physical I/O. Thanks, - Ryan -- UNIX Administrator http://prefetch.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
[EMAIL PROTECTED] wrote on 09/12/2007 08:04:33 AM: > > Gino wrote: > > > The real problem is that ZFS should stop to force > > kernel panics. > > > > > I found these panics very annoying, too. And even > > more that the zpool > > was faulted afterwards. But my problem is that when > > someone asks me what > > ZFS should do instead, I have no idea. > > well, what about just hang processes waiting for I/O on that zpool? > Could be possible? It seems that maybe there is too large a code path leading to panics -- maybe a side effect of ZFS being "new" (compared to other filesystems). I would hope that as these panic issues are coming up that the code path leading to the panic is evaluated for a specific fix or behavior code path. Sometimes it does make sense to panic (if there _will_ be data damage if you continue). Other times not. > > > Seagate FibreChannel drives, Cheetah 15k, ST3146855FC > > for the databases. > > What king of JBOD for that drives? Just to know ... > We found Xyratex's to be good products. > > > That depends on the indivdual requirements of each > > service. Basically, > > we change to recordsize according to the transaction > > size of the > > databases and, on the filers, the performance results > > were best when the > > recordsize was a bit lower than the average file size > > (average file size > > is 12K, so I set a recordsize of 8K). I set a vdev > > cache size of 8K and > > our databases worked best with a vq_max_pending of > > 32. ZFSv3 was used, > > that's the version which is shipped with Solaris 10 > > 11/06. > > thanks for sharing. > > > Yes, but why doesn't your application fail over to a > > standby? > > It is a little complex to explain. Basically that apps are making a > lot of "number cruncing" on some a very big data in ram. Failover > would be starting again from the beginning, with all the customers > waiting for hours (and loosing money). > We are working on a new app, capable to work with a couple of nodes > but it will takes some months to be in beta, then 2 years of testing ... > > > a system reboot can be a single point of failure, > > what about the network > > infrastructure? Hardware errors? Or power outages? > > We use Sunfire for that reason. We had 2 cpu failures and no service > interruption, the same for 1 dimm module (we have been lucky with > cpu failures ;)). > HDS raid arrays are excellent about availability. Lots of fc links, > network links .. > All this is in a fully redundant datacenter .. and, sure, we have a > stand by system on a disaster recovery site (hope to never use it!). I can understand where you are coming from as far as the need for uptime and loss of money on that app server. Two years of testing for the app, Sunfire servers for N+1 because the app can't be clustered and you have chosen to run a filesystem that has just been made public? ZFS may be great and all, but this stinks of running a .0 version on the production machine. VXFS+snap has well known and documented behaviors tested for years on production machines. Why did you even choose to run ZFS on that specific box? Do not get me wrong, I really like many things about ZFS -- it is ground breaking. I still do not get why it would be chosen for a server in that position until it has better real world production testing and modeling. You have taken all of the buildup you have done and introduced an unknown to the mix. > > > I'm definitely NOT some kind of know-it-all, don't > > misunderstand me. > > Your statement just let my alarm bells ring and > > that's why I'm asking. > > Don't worry Ralf. Any suggestion/opinion/critic is welcome. > It's a pleasure to exchange our experience > > Gino > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
> Gino wrote: > > The real problem is that ZFS should stop to force > kernel panics. > > > I found these panics very annoying, too. And even > more that the zpool > was faulted afterwards. But my problem is that when > someone asks me what > ZFS should do instead, I have no idea. well, what about just hang processes waiting for I/O on that zpool? Could be possible? > Seagate FibreChannel drives, Cheetah 15k, ST3146855FC > for the databases. What king of JBOD for that drives? Just to know ... We found Xyratex's to be good products. > That depends on the indivdual requirements of each > service. Basically, > we change to recordsize according to the transaction > size of the > databases and, on the filers, the performance results > were best when the > recordsize was a bit lower than the average file size > (average file size > is 12K, so I set a recordsize of 8K). I set a vdev > cache size of 8K and > our databases worked best with a vq_max_pending of > 32. ZFSv3 was used, > that's the version which is shipped with Solaris 10 > 11/06. thanks for sharing. > Yes, but why doesn't your application fail over to a > standby? It is a little complex to explain. Basically that apps are making a lot of "number cruncing" on some a very big data in ram. Failover would be starting again from the beginning, with all the customers waiting for hours (and loosing money). We are working on a new app, capable to work with a couple of nodes but it will takes some months to be in beta, then 2 years of testing ... > a system reboot can be a single point of failure, > what about the network > infrastructure? Hardware errors? Or power outages? We use Sunfire for that reason. We had 2 cpu failures and no service interruption, the same for 1 dimm module (we have been lucky with cpu failures ;)). HDS raid arrays are excellent about availability. Lots of fc links, network links .. All this is in a fully redundant datacenter .. and, sure, we have a stand by system on a disaster recovery site (hope to never use it!). > I'm definitely NOT some kind of know-it-all, don't > misunderstand me. > Your statement just let my alarm bells ring and > that's why I'm asking. Don't worry Ralf. Any suggestion/opinion/critic is welcome. It's a pleasure to exchange our experience Gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression=on and zpool attach
> On 11/09/2007, Mike DeMarco <[EMAIL PROTECTED]> > wrote: > > > I've got 12Gb or so of db+web in a zone on a ZFS > > > filesystem on a mirrored zpool. > > > Noticed during some performance testing today > that > > > its i/o bound but > > > using hardly > > > any CPU, so I thought turning on compression > would be > > > a quick win. > > > > If it is io bound won't compression make it worse? > > Well, the CPUs are sat twiddling their thumbs. > I thought reducing the amount of data going to disk > might help I/O - > is that unlikely? IO bottle necks are usually caused by a slow disk or one that has heavy workloads reading many small files. Two factors that need to be considered are Head seek latency and spin latency. Head seek latency is the amount of time it takes for the head to move to the track that is to be written, this is a eternity for the system (usually around 4 or 5 milliseconds). Spin latency is the amount of time it takes for the spindle to spin the track to be read or written over the head. Ideally you only want to pay the latency penalty once. If you have large reads and writes going to the disk then compression may help a little but if you have many small reads or writes it will do nothing more than to burden your CPU with a no gain amount of work to do since your are going to be paying Mr latency for each read or write. Striping several disks together with a stripe width that is tuned for your data model is how you could get your performance up. Stripping has been left out of the ZFS model for some reason. Where it is true that RAIDZ will stripe the data across a given drive set it does not give you the option to tune the stripe width. Do to the write performance problems of RAIDZ you may not get a performance boost from it stripping if your write to read ratio is too high since the driver has to calculate parity for each write. > > > > benefit of compression > > > on the blocks > > > that are copied by the mirror being resilvered? > > > > No! Since you are doing a block for block mirror of > the data, this would not could not compress the data. > > No problem, another job for rsync then :) > > > -- > Rasputin :: Jack of All Trades - Master of Nuns > http://number9.hellooperator.net/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
Gino wrote: > The real problem is that ZFS should stop to force kernel panics. > > I found these panics very annoying, too. And even more that the zpool was faulted afterwards. But my problem is that when someone asks me what ZFS should do instead, I have no idea. >> I have large Sybase database servers and file servers >> with billions of >> inodes running using ZFSv3. They are attached to >> X4600 boxes running >> Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using >> dumb and cheap >> Infortrend FC JBODs (2 GBit/s) as storage shelves. >> > > Are you using FATA drives? > > Seagate FibreChannel drives, Cheetah 15k, ST3146855FC for the databases. For the NFS filers we use Infortrend FC shelves with SATA inside. >> During all my >> benchmarks (both on the command line and within >> applications) show that >> the FibreChannel is the bottleneck, even with random >> read. ZFS doesn't >> do this out of the box, but a bit of tuning helped a >> lot. >> > > You found and other good point. > I think that with ZFS and JBOD, FC links will be soon the bottleneck. > What tuning have you done? > > That depends on the indivdual requirements of each service. Basically, we change to recordsize according to the transaction size of the databases and, on the filers, the performance results were best when the recordsize was a bit lower than the average file size (average file size is 12K, so I set a recordsize of 8K). I set a vdev cache size of 8K and our databases worked best with a vq_max_pending of 32. ZFSv3 was used, that's the version which is shipped with Solaris 10 11/06. > It is a problem if your apps hangs waiting for you to power down/pull out the > drive! > Almost in a time=money environment :) > > Yes, but why doesn't your application fail over to a standby? I'm also working in a "time is money and failure no option" environment, and I doubt I would sleep better if I were responsible for an application under such a service level agreement without full high availability. If a system reboot can be a single point of failure, what about the network infrastructure? Hardware errors? Or power outages? I'm definitely NOT some kind of know-it-all, don't misunderstand me. Your statement just let my alarm bells ring and that's why I'm asking. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 1&1 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
> > -We had tons of kernel panics because of ZFS. > > Here a "reboot" must be planned with a couple of > weeks in advance > > and done only at saturday night .. > > > Well, I'm sorry, but if your datacenter runs into > problems when a single > server isn't available, you probably have much worse > problems. ZFS is a > file system. It's not a substitute for hardware > trouble or a misplanned > infrastructure. What would you do if you had the fsck > you mentioned > earlier? Or with another file system like UFS, ext3, > whatever? Boot a > system into single user mode and fsck several > terabytes, after planning > it a couple of weeks in advance? For example we have a couple of apps using 80-290GB RAM. Some thousands users. We use Solaris+Sparc+High end storage because we can't afford downtimes. We can deal with a failed file system. A reboot during the day would cost a lot of money. The real problem is that ZFS should stop to force kernel panics. > > -Our 9900V and HP EVAs works really BAD with ZFS > because of large cache. > > (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve > the problem. Only helped a bit. > > > > > Use JBODs. Or tell the cache controllers to ignore > the flushing > requests. Should be possible, even the $10k low-cost > StorageTek arrays > support this. Unfortunately HP EVA can't do it. About the 9900V, it is really fast (64GB cache helps a lot) end reliable. 100% uptime in years. We'll never touch it to solve a ZFS problem. We started using JBOD (12x16drive shelfs) with ZFS but speed and reliability (today) is not comparable to HDS+UFS. > > -ZFS performs badly with a lot of small files. > > (about 20 times slower that UFS with our millions > file rsync procedures) > > > > > I have large Sybase database servers and file servers > with billions of > inodes running using ZFSv3. They are attached to > X4600 boxes running > Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using > dumb and cheap > Infortrend FC JBODs (2 GBit/s) as storage shelves. Are you using FATA drives? > During all my > benchmarks (both on the command line and within > applications) show that > the FibreChannel is the bottleneck, even with random > read. ZFS doesn't > do this out of the box, but a bit of tuning helped a > lot. You found and other good point. I think that with ZFS and JBOD, FC links will be soon the bottleneck. What tuning have you done? > > -ZFS+FC JBOD: failed hard disk need a reboot > :( > > (frankly unbelievable in 2007!) > > > No. Read the thread carefully. It was mentioned that > you don't have to > reboot the server, all you need to do is pull the > hard disk. Shouldn't > be a problem, except if you don't want to replace the > faulty one anyway. It is a problem if your apps hangs waiting for you to power down/pull out the drive! Almost in a time=money environment :) > No other manual operations will be necessary, except > for the final "zfs > replace". You could also try cfgadm to get rid of ZFS > pool problems, > perhaps it works - I'm not sure about this, because I > had the idea > *after* I solved that problem, but I'll give it a try > someday. > > Anyway we happily use ZFS on our new backup systems > (snapshotting with ZFS is amazing), but to tell you > the true we are keeping 2 large zpool in sync on each > system because we fear an other zpool corruption. > > > > > May I ask how you accomplish that? During the day we sync pool1 with pool2, then we °umount pool2" during sheduled backup operations at night. > And why are you doing this? You should replicate your > zpool to another > host, instead of mirroring locally. Where's your > redundancy in that? We have 4 backup hosts. Soon we'll move to 10G network and we'll replicate on different hosts, as you pointed out. Gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
> We have seen just the opposite... we have a > server with about > 0 million files and only 4 TB of data. We have been > benchmarking FSes > for creation and manipulation of large populations of > small files and > ZFS is the only one we have found that continues to > scale linearly > above one million files in one FS. UFS, VXFS, HFS+ > (don't ask why), > NSS (on NW not Linux) all show exponential growth in > response time as > you cross a certain knee (we are graphing time to > create zero > length files, then do a series of basic manipulations > on them) in > number of files. For all the FSes we have tested that > knee has been > under one million files, except for ZFS. I know this > is not 'real > world' but it does reflect the response time issues > we have been > trying to solve. I will see if my client (I am a > consultant) will > allow me to post the results, as I am under NDA for > most of the > details of what we are doing. It would be great! > On the other hand, we have seen serious > issues using rsync to > migrate this data from the existing server to the > Solaris 10 / ZFS > system, so perhaps your performance issues were rsync > related and not > ZFS. In fact, so far the fastest and most reliable > method for moving > the data is proving to be Veritas NetBackup (back it > up on the source > server, restore to the new ZFS server). > > Now having said all that, we are probably > never going to see > 00 million files in one zpool, because the ZFS > architecture lets us > use a more distributed model (many zpools and > datasets within them) > and still present the end users with a single view of > all the data. Hi Paul, may I ask you your medium file size? Have you done some optimization? ZFS recordsize? Your test included also writing 1 million files? Gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
> Yes, this is a case where the disk has not completely > failed. > ZFS seems to handle the completely failed disk case > properly, and > has for a long time. Cutting the power (which you > can also do with > luxadm) makes the disk appear completely failed. Richard, I think you're right. The failed disk is still working but it has no space for bad sectors... Gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
> On Tue, 2007-09-11 at 13:43 -0700, Gino wrote: > > -ZFS+FC JBOD: failed hard disk need a reboot > :( > > (frankly unbelievable in 2007!) > > So, I've been using ZFS with some creaky old FC JBODs > (A5200's) and old > disks which have been failing regularly and haven't > seen that; the worst > I've seen running nevada was that processes touching > the pool got stuck, this is the problem > but they all came unstuck when I powered off the > at-fault FC disk via > the A5200 front panel. I'll try again with the EMC JBOD but anyway still remain the fact that you need to manually recover from an hard disk failure. gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
Gino wrote: [...] > Just a few examples: > -We lost several zpool with S10U3 because of "spacemap" bug, > and -nothing- was recoverable. No fsck here :( > > Yes, I criticized the lack of zpool recovery mechanisms, too, during my AVS testing. But I don't have the know-how to judge if it has technical reasons. > -We had tons of kernel panics because of ZFS. > Here a "reboot" must be planned with a couple of weeks in advance > and done only at saturday night .. > Well, I'm sorry, but if your datacenter runs into problems when a single server isn't available, you probably have much worse problems. ZFS is a file system. It's not a substitute for hardware trouble or a misplanned infrastructure. What would you do if you had the fsck you mentioned earlier? Or with another file system like UFS, ext3, whatever? Boot a system into single user mode and fsck several terabytes, after planning it a couple of weeks in advance? > -Our 9900V and HP EVAs works really BAD with ZFS because of large cache. > (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve the problem. Only helped > a bit. > > Use JBODs. Or tell the cache controllers to ignore the flushing requests. Should be possible, even the $10k low-cost StorageTek arrays support this. > -ZFS performs badly with a lot of small files. > (about 20 times slower that UFS with our millions file rsync procedures) > > I have large Sybase database servers and file servers with billions of inodes running using ZFSv3. They are attached to X4600 boxes running Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using dumb and cheap Infortrend FC JBODs (2 GBit/s) as storage shelves. During all my benchmarks (both on the command line and within applications) show that the FibreChannel is the bottleneck, even with random read. ZFS doesn't do this out of the box, but a bit of tuning helped a lot. > -ZFS+FC JBOD: failed hard disk need a reboot :( > (frankly unbelievable in 2007!) > No. Read the thread carefully. It was mentioned that you don't have to reboot the server, all you need to do is pull the hard disk. Shouldn't be a problem, except if you don't want to replace the faulty one anyway. No other manual operations will be necessary, except for the final "zfs replace". You could also try cfgadm to get rid of ZFS pool problems, perhaps it works - I'm not sure about this, because I had the idea *after* I solved that problem, but I'll give it a try someday. > Anyway we happily use ZFS on our new backup systems (snapshotting with ZFS is > amazing), but to tell you the true we are keeping 2 large zpool in sync on > each system because we fear an other zpool corruption. > > May I ask how you accomplish that? And why are you doing this? You should replicate your zpool to another host, instead of mirroring locally. Where's your redundancy in that? -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 1&1 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] MS Exchange storage on ZFS?
Microsoft have a document you should read: "Optimizing Storage for Microsoft Exchange Server 2003" http://download.microsoft.com/download/b/e/0/be072b12-9c30-4e00-952d-c7d0d7bcea5f/StoragePerformance.doc Microsoft also have a utility "JetStress" which you can use to verify the performance of the storage system. http://www.microsoft.com/downloads/details.aspx?familyid=94b9810b-670e-433a-b5ef-b47054595e9c&displaylang=en I think you can use Jetstress on a non-exchange server if you copy across some of the Exchange DLLs. If you do any testing along these lines, please report success or failure back to this forum, and on the 'Storage-discuss' forum where these sort of questions are more usually discussed. Thanks Nigel Smith This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss