Re: [zfs-discuss] ZFS Random Read Performance
more below... On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote: On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling wrote: Try disabling prefetch. Just tried it... no change in random read (still 17-18 MB/sec for a single thread), but sequential read performance dropped from about 200 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file accessed in 256 KB records. ARC is set to a max of 1 GB for testing. arcstat.pl shows that the vast majority (>95%) of reads are missing the cache. hmmm... more testing needed. The question is whether the low I/O rate is because of zfs itself, or the application? Disabling prefetch will expose the application, because zfs is not creating additional and perhaps unnecessary read I/O. Your data which shows the sequential write, random write, and sequential read driving actv to 35 is because prefetching is enabled for the read. We expect the writes to drive to 35 with a sustained write workload of any flavor. The random read (with cache misses) will stall the application, so it takes a lot of threads (>>16?) to keep 35 concurrent I/Os in the pipeline without prefetching. The ZFS prefetching algorithm is "intelligent" so it actually complicates the interpretation of the data. You're peaking at 658 256KB random IOPS for the 3511, or ~66 IOPS per drive. Since ZFS will max out at 128KB per I/O, the disks see something more than 66 IOPS each. The IOPS data from iostat would be a better metric to observe than bandwidth. These drives are good for about 80 random IOPS each, so you may be close to disk saturation. The iostat data for IOPS and svc_t will confirm. The T2000 data (sheet 3) shows pretty consistently around 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20% less than I would expect, perhaps due to the measurement. Also, the 3511 RAID-5 configuration will perform random reads at around 1/2 IOPS capacity if the partition offset is 34. This was the default long ago. The new default is 256. The reason is that with a 34 block offset, you are almost guaranteed that a larger I/O will stride 2 disks. You won't notice this as easily with a single thread, but it will be measurable with more threads. Double check the offset with prtvtoc or format. Writes are a completely different matter. ZFS has a tendency to turn random writes into sequential writes, so it is pretty much useless to look at random write data. The sequential writes should easily blow through the cache on the 3511. Squinting my eyes, I would expect the array can do around 70 MB/s writes, or 25 256KB IOPS saturated writes. By contrast, the T2000 JBOD data shows consistent IOPS at the disk level and exposes the track cache affect on the sequential read test. Did I mention that I'm a member of BAARF? www.baarf.com :-) Hint: for performance work with HDDs, pay close attention to IOPS, then convert to bandwidth for the PHB. The reason I don't think that this ishitting our end users is the cache hit ratio (reported by arc_summary.pl) is 95% on the production system (I am working on our test system and am the only one using it right now, so all the I/O load is iozone). I think my next step (beyond more poking with DTrace) is to try a backup and see what I get for ARC hit ratio ... I expect it to be low, but I may be surprised (then I have to figure out why backups are as slow as they are). We are using NetBackup and it takes about 3 days to do a FULL on a 3.3 TB zfs with about 30 million files. Differential Incrementals take 16-22 hours (and almost no data changes). The production server is an M4000, 4 dual core CPUs, 16 GB memory, and about 25 TB of data overall. A big SAMBA file server. b119 has improved stat() performance, which should make a positive improvement of such backups. But eventually you may need to move to a multi-stage backup, depending on your business requirements. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
Those are great, but they're about testing the zfs software. There's a small amount of overlap, in that these injections include trying to simulate the hoped-for system response (e.g, EIO) to various physical scenarios, so it's worth looking at for scenario suggestions. However, for most of us, we generally rely on Sun's (generally acknowledged as excellent) testing of the software stack. I suspect the OP is more interested in verifying on his own hardware, that physical events and problems will be connected to the software fault injection test scenarios. The rest of us running on random commodity hardware have largely the same interest, because Sun hasn't qualified the hardware parts of the stack as well. We've taken on that responsibility ourselves (both individually, and as a community by sharing findings). For example, for the various kinds of failures that might happen: * Does my particular drive/controller/chipset/bios/etc combination notice the problem and result in the appropriate error from the driver upwards? * How quickly does it notice? Do I have to wait for some long timeout or other retry cycle, and is that a problem for my usage? * Does the rest of the system keep working to allow zfs to recover/react, or is there some kind of follow-on failure (bus hangs/resets, etc) that will have wider impact? Yanking disk controller and/or power cables is an easy and obvious test. Testing scenarios that involve things like disk firmware behaviour in response to bad reads is harder - though apparently yelling at them might be worthwhile :-) Finding ways to dial up the load up your psu (or drop voltage/limit current to a specific device with an inline filter) might be an idea, since overloaded power supplies seem to be implicated in various people's reports of trouble. Finding ways to generate EMF or "cosmic rays" to induce other kinds of failure is left as an exercise. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Lost zfs root partition after zfs upgrade
Using an HP DL 360 G5 with an HP smart array P400i controller. Created 2 mirrored (hardware) RAID volumes. I installed Solaris onto a ZFS partition during setup onto one of the mirrored volumes. Used the second mirrored volume to create another ZFS pool. I patched the OS, rebooted, then ran zpool upgrade followed by zfs upgrade and rebooted again. This time, the boot partition was gone and I was left at the open grub prompt. I reinstalled the OS, patched, rebooted, and ran zpool upgrade and rebooted fine. I was able to import the other zfs pool (on the second mirrored volume) – now that one is running zfs v4 and the root partition is running zfs v3, but I’m reluctant to try and upgrade zfs again 1. Is this a supported hardware/software configuration? 2. Has anybody seen this before? I found this: http://blogs.sun.com/pomah/entry/zpool_upgrade_broke_grub Don' t know if it's related or not... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
> you can fetch the "cr_txg" (cr for creation) for a > snapshot using zdb, yes, but this is hardly an appropriate interface. zdb is also likely to cause disk activity because it looks at many things other than the specific item in question. > but the very creation of a snapshot requires a new > txg to note that fact in the pool. yes, which is exactly what we're trying to avoid, because it requires disk activity to write. > if the snapshot is taken recursively, all snapshots > will have the same cr_txg, but that requires the > same configuration for all filesets. again, yes, but that's irrelevant - the important knowledge at this moment is that the txg has not changed since last time, and that thus there will be no benefit in taking further snapshots, regardless of configuration. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharemgr
glidic anthony wrote: > I have a solution with use zfs set sharenfs=rw,nosuid zpool but i prefer > use the sharemgr command. Then you prefere wrong. ZFS filesystems are not shared this way. Read up on ZFS and NFS. -- Dick Hoogendijk -- PGP/GnuPG key: F86289CE +http://nagual.nl/ | SunOS 10u7 05/09 ZFS+ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for zpools on zfs
On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling wrote: > On Nov 24, 2009, at 11:31 AM, Mike Gerdts wrote: > >> On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling >> wrote: >>> >>> Good question! Additional thoughts below... >>> >>> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote: >>> Suppose I have a storage server that runs ZFS, presumably providing file (NFS) and/or block (iSCSI, FC) services to other machines that are running Solaris. Some of the use will be for LDoms and zones[1], which would create zpools on top of zfs (fs or zvol). I have concerns about variable block sizes and the implications for performance. 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss Suppose that on the storage server, an NFS shared dataset is created without tuning the block size. This implies that when the client (ldom or zone v12n server) runs mkfile or similar to create the backing store for a vdisk or a zpool, the file on the storage server will be created with 128K blocks. Then when Solaris or OpenSolaris is installed into the vdisk or zpool, files of a wide variety of sizes will be created. At this layer they will be created with variable block sizes (512B to 128K). The implications for a 512 byte write in the upper level zpool (inside a zone or ldom) seems to be: - The 512 byte write turns into a 128 KB write at the storage server (256x multiplication in write size). - To write that 128 KB block, the rest of the block needs to be read to recalculate the checksum. That is, a read/modify/write process is forced. (Less impact if block already in ARC.) - Deduplicaiton is likely to be less effective because it is unlikely that the same combination of small blocks in different zones/ldoms will be packed into the same 128 KB block. Alternatively, the block size could be forced to something smaller at the storage server. Setting it to 512 bytes could eliminate the read/modify/write cycle, but would presumably be less efficient (less performant) with moderate to large files. Setting it somewhere in between may be desirable as well, but it is not clear where. The key competition in this area seems to have a fixed 4 KB block size. Questions: Are my basic assumptions about a given file consisting only of a single sized block, except for perhaps the final block? >>> >>> Yes, for a file system dataset. Volumes are fixed block size with >>> the default being 8 KB. So in the iSCSI over volume case, OOB >>> it can be more efficient. 4KB matches well with NTFS or some of >>> the Linux file systems >> >> OOB is missing from my TLA translator. Help, please. > > Out of box. Looky there, it was in my TLA translator after all. Not sure how I missed it the first time. > >>> Has any work been done to identify the performance characteristics in this area? >>> >>> None to my knowledge. The performance teams know to set the block >>> size to match the application, so they don't waste time re-learning this. >> >> That works great for certain workloads, particularly those with a >> fixed record size or large sequential I/O. If the workload is >> "installing then running an operating system" the answer is harder to >> define. > > running OSes don't create much work, post boot Agreed, particularly if backups are pushed to the storage server. I suspect that most apps that shuffle bits between protocols but do little disk I/O can piggy back on this idea. That is, a J2EE server that just talks to the web and database tier, with some log entries and occasional app deployments should be pretty safe too. > Is there less to be concerned about from a performance standpoint if the workload is primarily read? >>> >>> Sequential read: yes >>> Random read: no >> >> I was thinking that random wouldn't be too much of a concern either >> assuming that the things that are commonly read are in cache. I guess >> this does open the door for a small chunk of useful code in the middle >> of a largely useless shared library to force lot of that shared >> library into the ARC, among other things. > > This was much more of a problem years ago when memory was small. > Don't see it much on modern machines. Probably so, particularly if deduped blocks are deduped in the ARC. If 1000 virtual machines are each forcing the same 5 MB of extra stuff into the ARC, that can be inconsequential or take up a somewhat significant amount of the ARC. > >>> To maximize the efficacy of dedup, would it be best to pick a fixed block size and match it between the layers of zfs? >>> >>> I don't think we know yet. Until b128 arrives in binary, and folks get >>> some time to experiment, we just don't have much data... and there >>> are way too many variables at play to predict. I can make one >>> prediction, though, dedupe
Re: [zfs-discuss] Best practices for zpools on zfs
On Nov 24, 2009, at 11:31 AM, Mike Gerdts wrote: On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling wrote: Good question! Additional thoughts below... On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote: Suppose I have a storage server that runs ZFS, presumably providing file (NFS) and/or block (iSCSI, FC) services to other machines that are running Solaris. Some of the use will be for LDoms and zones[1], which would create zpools on top of zfs (fs or zvol). I have concerns about variable block sizes and the implications for performance. 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss Suppose that on the storage server, an NFS shared dataset is created without tuning the block size. This implies that when the client (ldom or zone v12n server) runs mkfile or similar to create the backing store for a vdisk or a zpool, the file on the storage server will be created with 128K blocks. Then when Solaris or OpenSolaris is installed into the vdisk or zpool, files of a wide variety of sizes will be created. At this layer they will be created with variable block sizes (512B to 128K). The implications for a 512 byte write in the upper level zpool (inside a zone or ldom) seems to be: - The 512 byte write turns into a 128 KB write at the storage server (256x multiplication in write size). - To write that 128 KB block, the rest of the block needs to be read to recalculate the checksum. That is, a read/modify/write process is forced. (Less impact if block already in ARC.) - Deduplicaiton is likely to be less effective because it is unlikely that the same combination of small blocks in different zones/ldoms will be packed into the same 128 KB block. Alternatively, the block size could be forced to something smaller at the storage server. Setting it to 512 bytes could eliminate the read/modify/write cycle, but would presumably be less efficient (less performant) with moderate to large files. Setting it somewhere in between may be desirable as well, but it is not clear where. The key competition in this area seems to have a fixed 4 KB block size. Questions: Are my basic assumptions about a given file consisting only of a single sized block, except for perhaps the final block? Yes, for a file system dataset. Volumes are fixed block size with the default being 8 KB. So in the iSCSI over volume case, OOB it can be more efficient. 4KB matches well with NTFS or some of the Linux file systems OOB is missing from my TLA translator. Help, please. Out of box. Has any work been done to identify the performance characteristics in this area? None to my knowledge. The performance teams know to set the block size to match the application, so they don't waste time re-learning this. That works great for certain workloads, particularly those with a fixed record size or large sequential I/O. If the workload is "installing then running an operating system" the answer is harder to define. running OSes don't create much work, post boot Is there less to be concerned about from a performance standpoint if the workload is primarily read? Sequential read: yes Random read: no I was thinking that random wouldn't be too much of a concern either assuming that the things that are commonly read are in cache. I guess this does open the door for a small chunk of useful code in the middle of a largely useless shared library to force lot of that shared library into the ARC, among other things. This was much more of a problem years ago when memory was small. Don't see it much on modern machines. To maximize the efficacy of dedup, would it be best to pick a fixed block size and match it between the layers of zfs? I don't think we know yet. Until b128 arrives in binary, and folks get some time to experiment, we just don't have much data... and there are way too many variables at play to predict. I can make one prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-) We already have that optimization with compression. Dedupe just messes up my method of repeatedly writing the same smallish (<1MB) chunk of random or already compressed data to avoid the block-of-zeros compression optimization. Pretty soon filebench is going to need to add statistical methods to mimic the level of duplicate data it is simulating. Trying to write simple benchmarks to test increasingly smart systems looks to be problematic. :-) Also, the performance of /dev/*random is not very good. So prestaging lots of random data will be particularly challenging. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] sharemgr
Hi all, I want to share a folder where i have mount many zfs filesystem. But when i mount this share i have acces on this folder but no on my zfs filesystem. If anyone have a solution other make one share by zfs that's be great. I have a solution with use zfs set sharenfs=rw,nosuid zpool but i prefer use the sharemgr command. Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
On Nov 23, 2009, at 11:41 AM, Richard Elling wrote: On Nov 23, 2009, at 9:44 AM, sundeep dhall wrote: All, I have a test environment with 4 internal disks and RAIDZ option. Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz handles things well without data errors NB, the comments section of zinject.c is an interesting read. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/zinject/ -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large ZFS server questions
Lustre is coming in a year(?). It will then use ZFS -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for zpools on zfs
On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling wrote: > Good question! Additional thoughts below... > > On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote: > >> Suppose I have a storage server that runs ZFS, presumably providing >> file (NFS) and/or block (iSCSI, FC) services to other machines that >> are running Solaris. Some of the use will be for LDoms and zones[1], >> which would create zpools on top of zfs (fs or zvol). I have concerns >> about variable block sizes and the implications for performance. >> >> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss >> >> Suppose that on the storage server, an NFS shared dataset is created >> without tuning the block size. This implies that when the client >> (ldom or zone v12n server) runs mkfile or similar to create the >> backing store for a vdisk or a zpool, the file on the storage server >> will be created with 128K blocks. Then when Solaris or OpenSolaris is >> installed into the vdisk or zpool, files of a wide variety of sizes >> will be created. At this layer they will be created with variable >> block sizes (512B to 128K). >> >> The implications for a 512 byte write in the upper level zpool (inside >> a zone or ldom) seems to be: >> >> - The 512 byte write turns into a 128 KB write at the storage server >> (256x multiplication in write size). >> - To write that 128 KB block, the rest of the block needs to be read >> to recalculate the checksum. That is, a read/modify/write process >> is forced. (Less impact if block already in ARC.) >> - Deduplicaiton is likely to be less effective because it is unlikely >> that the same combination of small blocks in different zones/ldoms >> will be packed into the same 128 KB block. >> >> Alternatively, the block size could be forced to something smaller at >> the storage server. Setting it to 512 bytes could eliminate the >> read/modify/write cycle, but would presumably be less efficient (less >> performant) with moderate to large files. Setting it somewhere in >> between may be desirable as well, but it is not clear where. The key >> competition in this area seems to have a fixed 4 KB block size. >> >> Questions: >> >> Are my basic assumptions about a given file consisting only of a >> single sized block, except for perhaps the final block? > > Yes, for a file system dataset. Volumes are fixed block size with > the default being 8 KB. So in the iSCSI over volume case, OOB > it can be more efficient. 4KB matches well with NTFS or some of > the Linux file systems OOB is missing from my TLA translator. Help, please. > >> Has any work been done to identify the performance characteristics in >> this area? > > None to my knowledge. The performance teams know to set the block > size to match the application, so they don't waste time re-learning this. That works great for certain workloads, particularly those with a fixed record size or large sequential I/O. If the workload is "installing then running an operating system" the answer is harder to define. > >> Is there less to be concerned about from a performance standpoint if >> the workload is primarily read? > > Sequential read: yes > Random read: no I was thinking that random wouldn't be too much of a concern either assuming that the things that are commonly read are in cache. I guess this does open the door for a small chunk of useful code in the middle of a largely useless shared library to force lot of that shared library into the ARC, among other things. > >> To maximize the efficacy of dedup, would it be best to pick a fixed >> block size and match it between the layers of zfs? > > I don't think we know yet. Until b128 arrives in binary, and folks get > some time to experiment, we just don't have much data... and there > are way too many variables at play to predict. I can make one > prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-) We already have that optimization with compression. Dedupe just messes up my method of repeatedly writing the same smallish (<1MB) chunk of random or already compressed data to avoid the block-of-zeros compression optimization. Pretty soon filebench is going to need to add statistical methods to mimic the level of duplicate data it is simulating. Trying to write simple benchmarks to test increasingly smart systems looks to be problematic. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
Thankyou for all who've procvided data about this. I've updated the bugs mentioned earlier and I believe we can now make progress on diagnosis. The new synopsis (should show up on b.o.o tomorrow) is as follows: 6894775 mpt's msi support is suboptimal with xVM James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
> Travis Tabbal wrote: > > I have a possible workaround. Mark Johnson > has > > been emailing me today about this issue and he > proposed the > > following: > > > >> You can try adding the following to /etc/system, > then rebooting... > >> set xpv_psm:xen_support_msi = -1 > > I am also running XVM, and after modifying > /etc/system and rebooting, my > zpool scrub test is runing along merrily with no > hangs so far, where > usually I would expect to see several by now. > > Can the other folks who have seen this please test > and report back? I'd > hate to think we solved it only to discover there > were overlapping bugs. > > Fingers crossed, and many thanks to those who have > worked to track this > down! Nice to see we have one confirmed report that things are working. Hopefully we get a few more! Even if it's just a workaround until a real fix makes it in, it gets us running. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
> > On Nov 23, 2009, at 7:28 PM, Travis Tabbal wrote: > > > I have a possible workaround. Mark Johnson > > > has been emailing me today about this issue and he > proposed the > > following: > > > >> You can try adding the following to /etc/system, > then rebooting... > >> set xpv_psm:xen_support_msi = -1 > > would this change affect systems not using XVM? we > are just using > these as backup storage. Probably not. Are you seeing the issue without XVM installed? We had one other user report that the issue went away when they removed XVM, so I had thought it wouldn't affect other users. If you are getting the same issue without XVM, there may be overlapping bugs in play. Someone at Sun might be able to tell you how to disable MSI on the controller. Someone told me how to do it for the NVidia SATA controller when there was a bug in that driver. I would think there is a way to do it for the MPT driver. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
On Tue, 24 Nov 2009, Paul Kraus wrote: On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling wrote: Try disabling prefetch. Just tried it... no change in random read (still 17-18 MB/sec for a single thread), but sequential read performance dropped from about 200 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file accessed in 256 KB records. ARC is set to a max of 1 GB for testing. arcstat.pl shows that the vast majority (>95%) of reads are missing the cache. You will often see the best random access performance if you access the data using the same record size that zfs uses. For example, if you request data in 256KB records, but zfs is using 128KB records, then zfs needs to access, reconstruct, and concatenate two 128K zfs records before it can return any data to the user. This increases the access latency and decreases opportunity to take advantage of concurrency. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling wrote: > Try disabling prefetch. Just tried it... no change in random read (still 17-18 MB/sec for a single thread), but sequential read performance dropped from about 200 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file accessed in 256 KB records. ARC is set to a max of 1 GB for testing. arcstat.pl shows that the vast majority (>95%) of reads are missing the cache. The reason I don't think that this ishitting our end users is the cache hit ratio (reported by arc_summary.pl) is 95% on the production system (I am working on our test system and am the only one using it right now, so all the I/O load is iozone). I think my next step (beyond more poking with DTrace) is to try a backup and see what I get for ARC hit ratio ... I expect it to be low, but I may be surprised (then I have to figure out why backups are as slow as they are). We are using NetBackup and it takes about 3 days to do a FULL on a 3.3 TB zfs with about 30 million files. Differential Incrementals take 16-22 hours (and almost no data changes). The production server is an M4000, 4 dual core CPUs, 16 GB memory, and about 25 TB of data overall. A big SAMBA file server. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X45xx storage vs 7xxx Unified storage
Erik Trimble wrote: Miles Nordin wrote: "lz" == Len Zaifman writes: lz> So I now have 2 disk paths and two network paths as opposed to lz> only one in the 7310 cluster. You're configuring all your failover on the client, so the HA stuff is stateless wrt the server? sounds like the smart way since you control both ends. The entirely-server-side HA makes more sense when you cannot control the clients because they are entee usars. lz> Under these circumstances what advantage would a 7310 cluster lz> over 2 X4540s backing each other up and splitting the load? remember fishworks does not include any source code nor run the standard opensolaris builds, both of which could be a big problem if you are working with IB, RDMA, and eventually Lustre. You may get access to certain fancy/flakey stuff around those protocols a year sooner by sticking to the trunk builds. Furthermore by not letting them take the source away you get the ability to backport your own fixes, if you get really aggressive. Having a cluster where one can give up the redundancy for a moment and turn one of the boxes into ``development'' is exactly what I'd want if I were betting my future on bleeding edge stuff like HOL-blocking fabrics, RDMA, and zpool-backed Lustre. It also lets you copy your whole dataset with rsync so you don't get painted into a corner, ex trying to downgrade a zpool version. I'd still get the 7310 hardware. Worst case scenario is that you can blow away the AmberRoad software load, and install OpenSolaris/Solaris. The hardware is a standard X4140 and J4200. Note, that if you do that, well, you can't re-load A-R without a support contract. Oops, I mean a J4400. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X45xx storage vs 7xxx Unified storage
Miles Nordin wrote: "lz" == Len Zaifman writes: lz> So I now have 2 disk paths and two network paths as opposed to lz> only one in the 7310 cluster. You're configuring all your failover on the client, so the HA stuff is stateless wrt the server? sounds like the smart way since you control both ends. The entirely-server-side HA makes more sense when you cannot control the clients because they are entee usars. lz> Under these circumstances what advantage would a 7310 cluster lz> over 2 X4540s backing each other up and splitting the load? remember fishworks does not include any source code nor run the standard opensolaris builds, both of which could be a big problem if you are working with IB, RDMA, and eventually Lustre. You may get access to certain fancy/flakey stuff around those protocols a year sooner by sticking to the trunk builds. Furthermore by not letting them take the source away you get the ability to backport your own fixes, if you get really aggressive. Having a cluster where one can give up the redundancy for a moment and turn one of the boxes into ``development'' is exactly what I'd want if I were betting my future on bleeding edge stuff like HOL-blocking fabrics, RDMA, and zpool-backed Lustre. It also lets you copy your whole dataset with rsync so you don't get painted into a corner, ex trying to downgrade a zpool version. I'd still get the 7310 hardware. Worst case scenario is that you can blow away the AmberRoad software load, and install OpenSolaris/Solaris. The hardware is a standard X4140 and J4200. Note, that if you do that, well, you can't re-load A-R without a support contract. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
Try disabling prefetch. -- richard On Nov 24, 2009, at 6:45 AM, Paul Kraus wrote: I know there have been a bunch of discussion of various ZFS performance issues, but I did not see anything specifically on this. In testing a new configuration of an SE-3511 (SATA) array, I ran into an interesting ZFS performance issue. I do not believe that this is creating a major issue for our end users (but it may), but it is certainly impacting our nightly backups. I am only seeing 10-20 MB/sec per thread for random read throughput using iozone for testing. Here is the full config: SF-V480 --- 4 x 1.2 GHz III+ --- 16 GB memory --- Solaris 10U6 with ZFS patch and IDR for snapshot / resilver bug. SE-3511 --- 12 x 500 GB SATA drives --- 11 disk R5 --- dual 2 Gbps FC host connection I have the ARC size limited to 1 GB so that I can test with a rational data set size. The total amount of data that I am testing with is 3 GB and a 256KB record size. I tested with 1 through 20 threads. With 1 thread I got the following results: sequential write: 112 MB/sec. sequential read: 221 MB/sec. random write: 96 MB/sec. random read: 18 MB/sec. As I scaled the number of threads (and kept the total data size the same) I got the following (throughput is in MB/sec): threads sw sr rw rr 2 105 218 93 34 4 106 219 88 52 8 95 189 69 92 16 71 153 76 128 As the number of threads climbs the first thee values drop once you get above 4 threads (one per CPU), but the fourth (random read) climbs well past 4 threads. It is just about linear through 9 threads and then it starts fluctuating, but continues climbing to at least 20 threads (I did not test past 20). Above 16 threads the random read even exceeds the sequential read values. Looking at iostat output for the LUN I am using for the 1 thread case, for the first three tests (sequential write, sequential read, random write) I see %b at 100 and actv climb to 35 and hang out there. For the random read test I see %b at 5 to 7, actv at less than 1 (usually around 0.5 to 0.6), wsvc_t is essentially 0, and asvc_t runs about 14. As the number of threads increases, the iostat values don't really change for the first three tests (sequential write and read), but they climb for the random read. The array is close to saturated at about 170 MB/sec. random read (18 threads), so I know that the 18 MB/sec. value for one thread is _not_limited by the array. I know the 3511 is not a high performance array, but we needed lots of bulk storage and could not afford better when we bought these 3 years ago. But, it seems to me that there is something wrong with the random read performance of ZFS. To test whether this is an effect of the 3511 I ran some tests on another system we have, as follows: T2000 --- 32 thread 1 GHz --- 32 GB memory --- Solaris 10U8 --- 4 Internal 72 GB SAS drives We have a zpool built of one slice on each of the 4 internal drives configured as a striped mirror layout (2 vdevs each of 2 slices). So I/O is spread over all 4 spindles. I started with 4 threads and 8 GB each (32 GB total to insure I got past the ARC, it is not tuned down on this system). I saw exactly the same ratio of sequential read to random read (the random read performance was 23% of the sequential read performance in both cases). Based on looking at iostat values during the test, I am saturating all four drives with the write operations with just 1 thread. The sequential read is saturating the drives with anything more than 1 thread, and the random read is not saturating the drives until I get to about 6 threads. threads sw sr rw rr 1 100 207 88 30 2 103 370 88 53 4 98 350 90 82 8 101 434 92 95 I confirmed that the problem is not unique to either 10U6 or the IDR, 10U8 has the same behavior. I confirmed that the problem is not unique to a FC attached disk array or the SE-3511 in particular. Then I went back and took another look at my original data (SF-V480/SE-3511) and looked at throughput per thread. For the sequential operations and the random write, the throughput per thread fell pretty far and pretty fast, but the per thread random read numbers fell very slowly. Per thread throughput in MB/sec. threads sw sr rw rr 1 112 221 96 18 2 53 109 46 17 4 26 55 22 13 8 12 24 9 12 16 5 10 5 8 So this makes me think that the random read performance issue is a limitation per thread. Does anyone have any idea why ZFS is not reading as fast as the underlying storage can handle in the case of random reads ? Or am I seeing an artifact of iozone itself ? Is there another benchmark I should be using ? P.S. I posted a OpenOffice.org spreadsheet of my test resulsts here: http://www.ilk.org/~ppk/Geek/throughput-summary.ods -- {1 -2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www
Re: [zfs-discuss] Best practices for zpools on zfs
Good question! Additional thoughts below... On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote: Suppose I have a storage server that runs ZFS, presumably providing file (NFS) and/or block (iSCSI, FC) services to other machines that are running Solaris. Some of the use will be for LDoms and zones[1], which would create zpools on top of zfs (fs or zvol). I have concerns about variable block sizes and the implications for performance. 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss Suppose that on the storage server, an NFS shared dataset is created without tuning the block size. This implies that when the client (ldom or zone v12n server) runs mkfile or similar to create the backing store for a vdisk or a zpool, the file on the storage server will be created with 128K blocks. Then when Solaris or OpenSolaris is installed into the vdisk or zpool, files of a wide variety of sizes will be created. At this layer they will be created with variable block sizes (512B to 128K). The implications for a 512 byte write in the upper level zpool (inside a zone or ldom) seems to be: - The 512 byte write turns into a 128 KB write at the storage server (256x multiplication in write size). - To write that 128 KB block, the rest of the block needs to be read to recalculate the checksum. That is, a read/modify/write process is forced. (Less impact if block already in ARC.) - Deduplicaiton is likely to be less effective because it is unlikely that the same combination of small blocks in different zones/ldoms will be packed into the same 128 KB block. Alternatively, the block size could be forced to something smaller at the storage server. Setting it to 512 bytes could eliminate the read/modify/write cycle, but would presumably be less efficient (less performant) with moderate to large files. Setting it somewhere in between may be desirable as well, but it is not clear where. The key competition in this area seems to have a fixed 4 KB block size. Questions: Are my basic assumptions about a given file consisting only of a single sized block, except for perhaps the final block? Yes, for a file system dataset. Volumes are fixed block size with the default being 8 KB. So in the iSCSI over volume case, OOB it can be more efficient. 4KB matches well with NTFS or some of the Linux file systems. Has any work been done to identify the performance characteristics in this area? None to my knowledge. The performance teams know to set the block size to match the application, so they don't waste time re-learning this. Is there less to be concerned about from a performance standpoint if the workload is primarily read? Sequential read: yes Random read: no To maximize the efficacy of dedup, would it be best to pick a fixed block size and match it between the layers of zfs? I don't think we know yet. Until b128 arrives in binary, and folks get some time to experiment, we just don't have much data... and there are way too many variables at play to predict. I can make one prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Random Read Performance
I know there have been a bunch of discussion of various ZFS performance issues, but I did not see anything specifically on this. In testing a new configuration of an SE-3511 (SATA) array, I ran into an interesting ZFS performance issue. I do not believe that this is creating a major issue for our end users (but it may), but it is certainly impacting our nightly backups. I am only seeing 10-20 MB/sec per thread for random read throughput using iozone for testing. Here is the full config: SF-V480 --- 4 x 1.2 GHz III+ --- 16 GB memory --- Solaris 10U6 with ZFS patch and IDR for snapshot / resilver bug. SE-3511 --- 12 x 500 GB SATA drives --- 11 disk R5 --- dual 2 Gbps FC host connection I have the ARC size limited to 1 GB so that I can test with a rational data set size. The total amount of data that I am testing with is 3 GB and a 256KB record size. I tested with 1 through 20 threads. With 1 thread I got the following results: sequential write: 112 MB/sec. sequential read: 221 MB/sec. random write: 96 MB/sec. random read: 18 MB/sec. As I scaled the number of threads (and kept the total data size the same) I got the following (throughput is in MB/sec): threads sw sr rw rr 2 105 218 93 34 4 106 219 88 52 8 95 189 69 92 16 71 153 76 128 As the number of threads climbs the first thee values drop once you get above 4 threads (one per CPU), but the fourth (random read) climbs well past 4 threads. It is just about linear through 9 threads and then it starts fluctuating, but continues climbing to at least 20 threads (I did not test past 20). Above 16 threads the random read even exceeds the sequential read values. Looking at iostat output for the LUN I am using for the 1 thread case, for the first three tests (sequential write, sequential read, random write) I see %b at 100 and actv climb to 35 and hang out there. For the random read test I see %b at 5 to 7, actv at less than 1 (usually around 0.5 to 0.6), wsvc_t is essentially 0, and asvc_t runs about 14. As the number of threads increases, the iostat values don't really change for the first three tests (sequential write and read), but they climb for the random read. The array is close to saturated at about 170 MB/sec. random read (18 threads), so I know that the 18 MB/sec. value for one thread is _not_limited by the array. I know the 3511 is not a high performance array, but we needed lots of bulk storage and could not afford better when we bought these 3 years ago. But, it seems to me that there is something wrong with the random read performance of ZFS. To test whether this is an effect of the 3511 I ran some tests on another system we have, as follows: T2000 --- 32 thread 1 GHz --- 32 GB memory --- Solaris 10U8 --- 4 Internal 72 GB SAS drives We have a zpool built of one slice on each of the 4 internal drives configured as a striped mirror layout (2 vdevs each of 2 slices). So I/O is spread over all 4 spindles. I started with 4 threads and 8 GB each (32 GB total to insure I got past the ARC, it is not tuned down on this system). I saw exactly the same ratio of sequential read to random read (the random read performance was 23% of the sequential read performance in both cases). Based on looking at iostat values during the test, I am saturating all four drives with the write operations with just 1 thread. The sequential read is saturating the drives with anything more than 1 thread, and the random read is not saturating the drives until I get to about 6 threads. threads sw sr rw rr 1 100 207 88 30 2 103 370 88 53 4 98 350 90 82 8 101 434 92 95 I confirmed that the problem is not unique to either 10U6 or the IDR, 10U8 has the same behavior. I confirmed that the problem is not unique to a FC attached disk array or the SE-3511 in particular. Then I went back and took another look at my original data (SF-V480/SE-3511) and looked at throughput per thread. For the sequential operations and the random write, the throughput per thread fell pretty far and pretty fast, but the per thread random read numbers fell very slowly. Per thread throughput in MB/sec. threads sw sr rw rr 1 112 221 96 18 2 53 109 46 17 4 26 55 22 13 8 12 24 9 12 16 5 10 5 8 So this makes me think that the random read performance issue is a limitation per thread. Does anyone have any idea why ZFS is not reading as fast as the underlying storage can handle in the case of random reads ? Or am I seeing an artifact of iozone itself ? Is there another benchmark I should be using ? P.S. I posted a OpenOffice.org spreadsheet of my test resulsts here: http://www.ilk.org/~ppk/Geek/throughput-summary.ods -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technica
[zfs-discuss] Best practices for zpools on zfs
Suppose I have a storage server that runs ZFS, presumably providing file (NFS) and/or block (iSCSI, FC) services to other machines that are running Solaris. Some of the use will be for LDoms and zones[1], which would create zpools on top of zfs (fs or zvol). I have concerns about variable block sizes and the implications for performance. 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss Suppose that on the storage server, an NFS shared dataset is created without tuning the block size. This implies that when the client (ldom or zone v12n server) runs mkfile or similar to create the backing store for a vdisk or a zpool, the file on the storage server will be created with 128K blocks. Then when Solaris or OpenSolaris is installed into the vdisk or zpool, files of a wide variety of sizes will be created. At this layer they will be created with variable block sizes (512B to 128K). The implications for a 512 byte write in the upper level zpool (inside a zone or ldom) seems to be: - The 512 byte write turns into a 128 KB write at the storage server (256x multiplication in write size). - To write that 128 KB block, the rest of the block needs to be read to recalculate the checksum. That is, a read/modify/write process is forced. (Less impact if block already in ARC.) - Deduplicaiton is likely to be less effective because it is unlikely that the same combination of small blocks in different zones/ldoms will be packed into the same 128 KB block. Alternatively, the block size could be forced to something smaller at the storage server. Setting it to 512 bytes could eliminate the read/modify/write cycle, but would presumably be less efficient (less performant) with moderate to large files. Setting it somewhere in between may be desirable as well, but it is not clear where. The key competition in this area seems to have a fixed 4 KB block size. Questions: Are my basic assumptions about a given file consisting only of a single sized block, except for perhaps the final block? Has any work been done to identify the performance characteristics in this area? Is there less to be concerned about from a performance standpoint if the workload is primarily read? To maximize the efficacy of dedup, would it be best to pick a fixed block size and match it between the layers of zfs? -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
Daniel Carosone writes: >> I don't think it is easy to do, the txg counter is on >> a pool level, >> [..] >> it would help when the entire pool is idle, though. > > .. which is exactly the scenario in question: when the disks are > likely to be spun down already (or to spin down soon without further > activity), and you want to avoid waking them up (or keeping them > awake) with useless snapshot activity. good point! > However, this highlights that a (pool? fs?) property that exposes the > current txg id (frozen in snapshots, as normal, if an fs property) > might be enough for the userspace daemon to make its own decision to > avoid requesting snapshots, without needing a whole discretionary > mechanism in zfs itself. you can fetch the "cr_txg" (cr for creation) for a snapshot using zdb, but the very creation of a snapshot requires a new txg to note that fact in the pool. if there are several filesystems to snapshot, you'll get a sequence of cr_txg, and they won't be adjacent. # zdb tank/te...@snap1 Dataset tank/te...@snap1 [ZVOL], ID 78, cr_txg 872401, 4.03G, 3 objects # zdb -u tank txg = 872402 timestamp = 1259064201 UTC = Tue Nov 24 13:03:21 2009 # sync # zdb -u tank txg = 872402 # zfs snapshot tank/te...@snap1 # zdb tank/te...@snap1 Dataset tank/te...@snap1 [ZVOL], ID 80, cr_txg 872419, 4.03G, 3 objects # zdb -u tank txg = 872420 timestamp = 1259064641 UTC = Tue Nov 24 13:10:41 2009 if the snapshot is taken recursively, all snapshots will have the same cr_txg, but that requires the same configuration for all filesets. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Deduplication Replication
Hi Darren, Could you post the -D part of the man pages? I have no access to a system (yet) with the latest man pages. http://docs.sun.com/app/docs/doc/819-2240/zfs-1m has not been updated yet. Regards Peter Darren J Moffat wrote: Steven Sim wrote: Hello; Dedup on ZFS is an absolutely wonderful feature! Is there a way to conduct dedup replication across boxes from one dedup ZFS data set to another? Pass the '-D' argument to 'zfs send'. -- Regards Peter Brouwer, Sun Microsystems Linlithgow Principal Storage Architect, ABCP DRII Consultant Office:+44 (0) 1506 672767 Mobile:+44 (0) 7720 598226 Skype :flyingdutchman_,flyingdutchman_l smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss