Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 15 Sep 2009, Dale Ghent wrote: As someone who currently faces kernel panics with recent U7+ kernel patches (on AMD64 and SPARC) related to PCI bus upset, I expect that Sun will take the time to make sure that the implementation is as good as it can be and is thoroughly tested before release. Are you referring the the same testing that gained you this PCI panic feature in s10u7? No. The system worked with the kernel patch corresponding to baseline S10U7. Problems started with later kernel patches (which seem to be much less tested). Of course there could actually be a real hardware problem. Regardless, when the integrity of our data is involved, I prefer to wait for more testing rather than to potentially have to recover the pool from backup. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 15, 2009, at 6:28 PM, Bob Friesenhahn wrote: On Tue, 15 Sep 2009, Dale Ghent wrote: Question though... why is bug fix that can be a watershed for performance be held back for so long? s10u9 won't be available for at least 6 months from now, and with a huge environment, I try hard not to live off of IDRs. As someone who currently faces kernel panics with recent U7+ kernel patches (on AMD64 and SPARC) related to PCI bus upset, I expect that Sun will take the time to make sure that the implementation is as good as it can be and is thoroughly tested before release. Are you referring the the same testing that gained you this PCI panic feature in s10u7? Testing is a no-brainer, and I would expect that there already exists some level of assurance that a CR fix is correct at the point of putback. But I've dealt with many bugs both very recently and long in the past where a fix has existed in nevada for months, even a year, before I got bit by the same bug in s10 and then had to go through the support channels to A) convince whomever I'm talking to that, yes, I'm hitting this bug, B) yes, there is a fix, and then C) pretty please can I have an IDR Just this week I'm wrapping up testing of a IDR which addresses a e1000g hardware errata that was fixed in onnv earlier this year in February. For something that addresses a hardware issue on a Intel chipset used on shipping Sun servers, one would think that Sustaining would be on the ball and get that integrated ASAP. But the current mode of operation appears to be "no CR, no backport", which leaves us customers needlessly running into bugs and then begging for their fixes... or hearing the dreaded "oh that fix will be available two updates from now." Not cool. /dale /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 15 Sep 2009, Dale Ghent wrote: Question though... why is bug fix that can be a watershed for performance be held back for so long? s10u9 won't be available for at least 6 months from now, and with a huge environment, I try hard not to live off of IDRs. As someone who currently faces kernel panics with recent U7+ kernel patches (on AMD64 and SPARC) related to PCI bus upset, I expect that Sun will take the time to make sure that the implementation is as good as it can be and is thoroughly tested before release. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Reference below... On Sep 15, 2009, at 2:38 PM, Dale Ghent wrote: On Sep 15, 2009, at 5:21 PM, Richard Elling wrote: On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote: On Sep 10, 2009, at 3:12 PM, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Awesome that the fix exists. I've been having a hell of a time with device-level prefetch on my iscsi clients causing tons of ultimately useless IO and have resorted to setting zfs_vdev_cache_max=1. This only affects metadata. Wouldn't it be better to disable prefetching for data? Well, that's a surprise to me, but the zfs_vdev_cache_max=1 did provide relief. Just a general description of my environment: My setup consists of several s10uX iscsi clients which get LUNs from a pairs of thumpers. Each thumper pair exports identical LUNs to each iscsi client, and the client in turn mirrors each LUN pair inside a local zpool. As more space is needed on a client, a new LUN is created on the pair of thumpers, exported to the iscsi client, which then picks it up and we add a new mirrored vdev to the client's existing zpool. This is so we have data redundancy across chassis, so if one thumper were to fail or need patching, etc, the iscsi clients just see one of side of their mirrors drop out. The problem that we observed on the iscsi clients was that, when viewing things through 'zpool iostat -v', far more IO was being requested from the LUs than was being registered for the vdev those LUs were a member of. Being that that was a iscsi setup with stock thumpers (no SSD ZIL, L2ARC) serving the LUs, this apparently overhead caused far more uneccessary disk IO on the thumpers, thus starving out IO for data that was actually needed. The working set is lots of small-ish files, entirely random IO. If zfs_vdev_cache_max only affects metadata prefetches, which parameter affects data prefetches ? There are two main areas for prefetch: at the transactional object layer (DMU) and the pooled storage level (VDEV). zfs_vdev_cache_max works at the VDEV level, obviously. The DMU knows more about the context of the data and is where the intelligent prefetching algorithm works. You can easily observe the VDEV cache statistics with kstat: # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_statsclass:misc crtime 38.83342625 delegations 14030 hits105169 misses 59452 snaptime4564628.18130739 This represents a 59% cache hit rate, which is pretty decent. But you will notice far fewer delegations+hits+misses than real IOPS because it is only caching metadata. Unfortunately, there is not a kstat for showing the DMU cache stats. But a DTrace script can be written or, even easier, lockstat will show if you are spending much time in the zfetch_* functions. More details are in the Evil Tuning Guide, including how to set zfs_prefetch_disable http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide I have to admit that disabling device-level prefetching was a shot in the dark, but it did result in drastically reduced contention on the thumpers. That is a little bit surprising. I would expect little metadata activity for iscsi service. It would not be surprising for older Solaris 10 releases, though. It was fixed in NV b70, circa July 2007. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 15, 2009, at 5:21 PM, Richard Elling wrote: On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote: On Sep 10, 2009, at 3:12 PM, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Awesome that the fix exists. I've been having a hell of a time with device-level prefetch on my iscsi clients causing tons of ultimately useless IO and have resorted to setting zfs_vdev_cache_max=1. This only affects metadata. Wouldn't it be better to disable prefetching for data? Well, that's a surprise to me, but the zfs_vdev_cache_max=1 did provide relief. Just a general description of my environment: My setup consists of several s10uX iscsi clients which get LUNs from a pairs of thumpers. Each thumper pair exports identical LUNs to each iscsi client, and the client in turn mirrors each LUN pair inside a local zpool. As more space is needed on a client, a new LUN is created on the pair of thumpers, exported to the iscsi client, which then picks it up and we add a new mirrored vdev to the client's existing zpool. This is so we have data redundancy across chassis, so if one thumper were to fail or need patching, etc, the iscsi clients just see one of side of their mirrors drop out. The problem that we observed on the iscsi clients was that, when viewing things through 'zpool iostat -v', far more IO was being requested from the LUs than was being registered for the vdev those LUs were a member of. Being that that was a iscsi setup with stock thumpers (no SSD ZIL, L2ARC) serving the LUs, this apparently overhead caused far more uneccessary disk IO on the thumpers, thus starving out IO for data that was actually needed. The working set is lots of small-ish files, entirely random IO. If zfs_vdev_cache_max only affects metadata prefetches, which parameter affects data prefetches ? I have to admit that disabling device-level prefetching was a shot in the dark, but it did result in drastically reduced contention on the thumpers. /dale Question though... why is bug fix that can be a watershed for performance be held back for so long? s10u9 won't be available for at least 6 months from now, and with a huge environment, I try hard not to live off of IDRs. Am I the only one that thinks this is way too conservative? It's just maddening to know that a highly beneficial fix is out there, but its release is based on time rather than need. Sustaining really needs to be more proactive when it comes to this stuff. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote: On Sep 10, 2009, at 3:12 PM, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Awesome that the fix exists. I've been having a hell of a time with device-level prefetch on my iscsi clients causing tons of ultimately useless IO and have resorted to setting zfs_vdev_cache_max=1. This only affects metadata. Wouldn't it be better to disable prefetching for data? -- richard Question though... why is bug fix that can be a watershed for performance be held back for so long? s10u9 won't be available for at least 6 months from now, and with a huge environment, I try hard not to live off of IDRs. Am I the only one that thinks this is way too conservative? It's just maddening to know that a highly beneficial fix is out there, but its release is based on time rather than need. Sustaining really needs to be more proactive when it comes to this stuff. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 10, 2009, at 3:12 PM, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Awesome that the fix exists. I've been having a hell of a time with device-level prefetch on my iscsi clients causing tons of ultimately useless IO and have resorted to setting zfs_vdev_cache_max=1. Question though... why is bug fix that can be a watershed for performance be held back for so long? s10u9 won't be available for at least 6 months from now, and with a huge environment, I try hard not to live off of IDRs. Am I the only one that thinks this is way too conservative? It's just maddening to know that a highly beneficial fix is out there, but its release is based on time rather than need. Sustaining really needs to be more proactive when it comes to this stuff. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 is already a diff for the source available? El Sep 11, 2009, a las 4:02 PM, Rich Morris escribió: On 09/10/09 16:22, en...@businessgrade.com wrote: Quoting Bob Friesenhahn : On Thu, 10 Sep 2009, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Excellent. What level of read improvement are you seeing? Is the prefetch rate improved, or does the fix simply avoid losing the prefetch? Thanks, Bob Is this fixed in snv_122 or something else? snv_124. See http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6859997 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iD8DBQFKrRJnp+9ff145KVIRAhErAKCYKnv6Fn/Vn61Fa2MYpl9S+P9KGACeJUMA g+RhFTRl9NdI0eNOx5aZaXw= =QAX8 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On 09/10/09 16:22, en...@businessgrade.com wrote: Quoting Bob Friesenhahn : On Thu, 10 Sep 2009, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Excellent. What level of read improvement are you seeing? Is the prefetch rate improved, or does the fix simply avoid losing the prefetch? Thanks, Bob Is this fixed in snv_122 or something else? snv_124. See http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6859997 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Thu, 10 Sep 2009, Rich Morris wrote: Excellent. What level of read improvement are you seeing? Is the prefetch rate improved, or does the fix simply avoid losing the prefetch? This fix avoids using a prefetch stream when it is no longer valid. BTW, ZFS prefetch appears to work well for most prefetch patterns. But this CR found a pattern that should have worked well but did not. It seems that after doing a fresh mount, the zfs prefetch is not quite enough to keep my hungry highly-tuned application sufficiently well fed. I will have to wait and see though. In the mean time, I need to investigate why recent Solaris 10 kernel patches (141415-10) cause my Sun Ultra-40M2 system to panic five minutes into 'zpool scrub' with a fault being reported against the motherboard. Maybe a few more motherboard swaps will solve it (on 4th motherboard now). 141415-3 seems less likely to panic since it survives a full scrub (unless VirtualBox is running a Linux instance). Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On 09/10/09 16:17, Bob Friesenhahn wrote: On Thu, 10 Sep 2009, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Excellent. What level of read improvement are you seeing? Is the prefetch rate improved, or does the fix simply avoid losing the prefetch? This fix avoids using a prefetch stream when it is no longer valid. BTW, ZFS prefetch appears to work well for most prefetch patterns. But this CR found a pattern that should have worked well but did not. -- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hello Rich, On Sep 10, 2009, at 9:12 PM, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Nice work, do you know if it will be released as a patch for s10u8 or will it only be part of the update 9 KUP? Regards Henrik http://sparcv9.blogspot.com___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Quoting Bob Friesenhahn : On Thu, 10 Sep 2009, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Excellent. What level of read improvement are you seeing? Is the prefetch rate improved, or does the fix simply avoid losing the prefetch? Thanks, Bob Is this fixed in snv_122 or something else? This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. This communication may contain material protected by the attorney-client privilege. If you are not the intended recipient, be advised that any use, dissemination, forwarding, printing or copying is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Thu, 10 Sep 2009, Rich Morris wrote: On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. Excellent. What level of read improvement are you seeing? Is the prefetch rate improved, or does the fix simply avoid losing the prefetch? Thanks, Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On 07/28/09 17:13, Rich Morris wrote: On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. -- Rich CR 6859997 has been accepted and is actively being worked on. The following info has been added to that CR: This is a problem with the ZFS file prefetch code (zfetch) in dmu_zfetch.c. The test script provided by the submitter (thanks Bob!) does no file prefetching the second time through each file. This problem exists in ZFS in Solaris 10, Nevada, and OpenSolaris. This test script creates 3000 files each 8M long so the amount of data (24G) is greater than the amount of memory (16G on a Thumper). With the default blocksize of 128k, each of the 3000 files has 63 blocks. The first time through, zfetch ramps up a single prefetch stream normally. But the second time through, dmu_zfetch() calls dmu_zfetch_find() which thinks that the data has already been prefetched so no additional prefetching is started. This problem is not seen with 500 files each 48M in length (still 24G of data). In that case there's still only one prefetch stream but it is reclaimed when one of the requested offsets is not found. The reason it is not found is that stream "strided" the first time through after reaching the zfetch cap, which is 256 blocks. Files with no more than 256 blocks don't require a stride. So this problem will only be seen when the data from a file with no more than 256 blocks is accessed after being tossed from the ARC. The fix for this problem may be more feedback between the ARC and the zfetch code. Or it may make sense to restart the prefetch stream after some time has passed or perhaps whenever there's a miss on a block that was expected to have already been prefetched? On a Thumper running Nevada build 118, the first pass of this test takes 2 minutes 50 seconds and the second pass takes 5 minutes 22 seconds. If dmu_zfetch_find() is modified to restart the refetch stream when the requested offset is 0 and more than 2 seconds has passed since the stream was last accessed then the time needed for the second pass is reduced to 2 minutes 24 seconds. Additional investigation is currently taking place to determine if another solution makes more sense. And more testing will be needed to see what affect this change has on other prefetch patterns. 6412053 is a related CR which mentions that the zfetch code may not be issuing I/O at a sufficient pace. This behavior is also seen on a Thumper running the test script in CR 6859997 since, even when prefetch is ramping up as expected, less than half of the available I/O bandwidth is being used. Although more aggressive file prefetching could increase memory pressure as described in CRs 6258102 and 6469558. -- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 28 Jul 2009, Rich Morris wrote: The fix for this problem may be more feedback between the ARC and the zfetch code. Or it may make sense to restart the prefetch stream after some time has passed or perhaps whenever there's a miss on a block that was expected to have already been prefetched? Regarding this approach of waiting for a prefetch miss, this seems like it would produce an uneven flow of data to the application and not ensure that data is always available when the application goes to read it. A stutter is likely to produce at least a 10ms gap (and possibly far greater) while the application is blocked in read() waiting for data. Since zfs blocks are large, stuttering becomes expensive, and if the application itself needs to read ahead 128K in order to avoid the stutter, then it consumes memory in an expensive non-sharable way. In the ideal case, zfs will always stay one 128K block ahead of the application's requirement and the unconsumed data will be cached in the ARC where it can be shared with other processes. For an application with real-time data requirements, it is definitely desireable not to stutter at all if possible. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 28 Jul 2009, Rich Morris wrote: 6412053 is a related CR which mentions that the zfetch code may not be issuing I/O at a sufficient pace. This behavior is also seen on a Thumper running the test script in CR 6859997 since, even when prefetch is ramping up as expected, less than half of the available I/O bandwidth is being used. Although more aggressive file prefetching could increase memory pressure as described in CRs 6258102 and 6469558. It is good to see this analysis. Certainly the optimum prefetching required for an Internet video streaming server (with maybe 300 kilobits/second per stream) is radically different than what is required for uncompressed 2K preview (8MB/frame) of motion picture frames (320 megabytes/second per stream) but zfs should be able to support both. Besides real-time analysis based on current stream behavior and memory, it would be useful to maintain some recent history for the whole pool so that a pool which is usually used for 1000 slow-speed video streams behaves differently by default than one used for one or two high-speed video streams. With this bit of hint information, files belonging to a pool recently producing high-speed streams can be ramped up quickly while files belonging to a pool which has recently fed low-speed streams can be ramped up more conservatively (until proven otherwise) in order to not flood memory and starve the I/O needed by other streams. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. CR 6859997 has been accepted and is actively being worked on. The following info has been added to that CR: This is a problem with the ZFS file prefetch code (zfetch) in dmu_zfetch.c. The test script provided by the submitter (thanks Bob!) does no file prefetching the second time through each file. This problem exists in ZFS in Solaris 10, Nevada, and OpenSolaris. This test script creates 3000 files each 8M long so the amount of data (24G) is greater than the amount of memory (16G on a Thumper). With the default blocksize of 128k, each of the 3000 files has 63 blocks. The first time through, zfetch ramps up a single prefetch stream normally. But the second time through, dmu_zfetch() calls dmu_zfetch_find() which thinks that the data has already been prefetched so no additional prefetching is started. This problem is not seen with 500 files each 48M in length (still 24G of data). In that case there's still only one prefetch stream but it is reclaimed when one of the requested offsets is not found. The reason it is not found is that stream "strided" the first time through after reaching the zfetch cap, which is 256 blocks. Files with no more than 256 blocks don't require a stride. So this problem will only be seen when the data from a file with no more than 256 blocks is accessed after being tossed from the ARC. The fix for this problem may be more feedback between the ARC and the zfetch code. Or it may make sense to restart the prefetch stream after some time has passed or perhaps whenever there's a miss on a block that was expected to have already been prefetched? On a Thumper running Nevada build 118, the first pass of this test takes 2 minutes 50 seconds and the second pass takes 5 minutes 22 seconds. If dmu_zfetch_find() is modified to restart the refetch stream when the requested offset is 0 and more than 2 seconds has passed since the stream was last accessed then the time needed for the second pass is reduced to 2 minutes 24 seconds. Additional investigation is currently taking place to determine if another solution makes more sense. And more testing will be needed to see what affect this change has on other prefetch patterns. 6412053 is a related CR which mentions that the zfetch code may not be issuing I/O at a sufficient pace. This behavior is also seen on a Thumper running the test script in CR 6859997 since, even when prefetch is ramping up as expected, less than half of the available I/O bandwidth is being used. Although more aggressive file prefetching could increase memory pressure as described in CRs 6258102 and 6469558. -- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 22 Jul 2009, Roch wrote: HI Bob did you consider running the 2 runs with echo zfs_prefetch_disable/W0t1 | mdb -kw and see if performance is constant between the 2 runs (and low). That would help clear the cause a bit. Sorry, I'd do it for you but since you have the setup etc... Revert with : echo zfs_prefetch_disable/W0t0 | mdb -kw -r I see that if I update my test script so that prefetch is disabled before the first cpio is executed, the read performance of the first cpio reported by 'zpool iostat' is similar to what has been normal for the second cpio case (i.e. 32MB/second). This seems to indicate that prefetch is entirely disabled if the file has ever been read before. However, there is a new wrinkle in that the second cpio completes twice as fast with prefetch disabled even though 'zpool iostat' indicates the same consistent throughput. The difference goes away if I tripple the number of files. With 3000 8.2MB files: Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 14443520 blocks real3m41.61s user0m0.44s sys 0m8.12s Doing second 'cpio -C 131072 -o > /dev/null' 14443520 blocks real1m50.12s user0m0.42s sys 0m7.21s Now if I increase the number of files to 9000 8.2MB files: Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 144000768 blocks real35m51.47s user0m4.46s sys 1m20.11s Doing second 'cpio -C 131072 -o > /dev/null' 144000768 blocks real35m22.41s user0m4.40s sys 1m14.22s Notice that with 3X the files, the throughput is dramatically reduced and the time is the same for both cases. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Have you considered running your script with ZFS pre-fetching disabled altogether to see if the results are consistent between runs? Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail bradley.di...@sun.com Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 15, 2009, at 9:59 AM, Bob Friesenhahn wrote: On Wed, 15 Jul 2009, Ross wrote: Yes, that makes sense. For the first run, the pool has only just been mounted, so the ARC will be empty, with plenty of space for prefetching. I don't think that this hypothesis is quite correct. If you use 'zpool iostat' to monitor the read rate while reading a large collection of files with total size far larger than the ARC, you will see that there is no fall-off in read performance once the ARC becomes full. The performance problem occurs when there is still metadata cached for a file but the file data has since been expunged from the cache. The implication here is that zfs speculates that the file data will be in the cache if the metadata is cached, and this results in a cache miss as well as disabling the file read- ahead algorithm. You would not want to do read-ahead on data that you already have in a cache. Recent OpenSolaris seems to take a 2X performance hit rather than the 4X hit that Solaris 10 takes. This may be due to improvement of existing algorithm function performance (optimizations) rather than a related design improvement. I wonder if there is any tuning that can be done to counteract this? Is there any way to tell ZFS to bias towards prefetching rather than preserving data in the ARC? That may provide better performance for scripts like this, or for random access workloads. Recent zfs development focus has been on how to keep prefetch from damaging applications like database where prefetch causes more data to be read than is needed. Since OpenSolaris now apparently includes an option setting which blocks file data caching and prefetch, this seems to open the door for use of more aggressive prefetch in the normal mode. In summary, I agree with Richard Elling's hypothesis (which is the same as my own). Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: > On Mon, 20 Jul 2009, Marion Hakanson wrote: > > It is definitely real. Sun has opened internal CR 6859997. It is now in > Dispatched state at High priority. > Is there a way we can get a Sun person on this list to supply a little bit more info on that CR? Seems theres a lot of people bitten by this, from low end to extremely high end hardware. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 20 Jul 2009, Marion Hakanson wrote: Bob, have you tried changing your benchmark to be multithreaded? It occurs to me that maybe a single cpio invocation is another bottleneck. I've definitely experienced the case where a single bonnie++ process was not enough to max out the storage system. It is likely that adding more cpios would cause more data to be read, but it would also thrash the disks with many more conflicting IOPS. I'm not suggesting that the bug you're demonstrating is not real. It's It is definitely real. Sun has opened internal CR 6859997. It is now in Dispatched state at High priority. that points out a problem. Rather, I'm thinking that maybe the timing comparisons between low-end and high-end storage systems on this particular test are not revealing the whole story. The similarity of performance between the low-end and high-end storage systems is a sign that the rotating rust is not a whole lot faster on the high-end storage systems. Since zfs is failing to use pre-fetch, only one (or maybe two) disks are accessed at a time. If more read I/Os are issued in parallel, then the data read rate will be vastly higher on the higher-end systems. With my 12 disk array and a large sequential read, zfs can issue 12 requests for 128K at once and since it can also queue pending I/Os, it can request many more than that. Care is required since over-reading will penalize the system. It is not an easy thing to get right. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
bfrie...@simple.dallas.tx.us said: > No. I am suggesting that all Solaris 10 (and probably OpenSolaris systems) > currently have a software-imposed read bottleneck which places a limit on > how well systems will perform on this simple sequential read benchmark. > After a certain point (which is unfortunately not very high), throwing more > hardware at the problem does not result in any speed improvement. This is > demonstrated by Scott Lawson's little two disk mirror almost producing the > same performance as our much more exotic setups. Apologies for reawakening this thread -- I was away last week. Bob, have you tried changing your benchmark to be multithreaded? It occurs to me that maybe a single cpio invocation is another bottleneck. I've definitely experienced the case where a single bonnie++ process was not enough to max out the storage system. I'm not suggesting that the bug you're demonstrating is not real. It's clear that subsequent runs on the same system show the degradation, and that points out a problem. Rather, I'm thinking that maybe the timing comparisons between low-end and high-end storage systems on this particular test are not revealing the whole story. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I have received email that Sun CR numbers 6861397 & 6859997 have been created to get this performance problem fixed. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sun, 2009-07-12 at 16:38 -0500, Bob Friesenhahn wrote: > In order to raise visibility of this issue, I invite others to see if > they can reproduce it in their ZFS pools. The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Here's the results from two machines, the first has 12x400MHz US-II CPUs, 11GB of RAM and the disks are 18GB 10krpm SCSI in a split D1000: System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise 4000/5000 System architecture: sparc System release level: 5.11 snv_101 CPU ISA list: sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: space state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h22m with 0 errors on Mon Jul 13 17:18:55 2009 config: NAME STATE READ WRITE CKSUM spaceONLINE 0 0 0 mirror ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c2t13d0 ONLINE 1 0 0 128K repaired errors: No known data errors zfs create space/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /space/zfscachetest ... Done! zfs unmount space/zfscachetest zfs mount space/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real11m40.67s user0m20.32s sys 5m27.16s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real31m29.42s user0m19.31s sys 6m46.39s Feel free to clean up with 'zfs destroy space/zfscachetest'. The second has 2x1.2GHz US-III+, 4GB RAM and 10krpm FC disks on a single loop. System Configuration: Sun Microsystems sun4u Sun Fire 480R System architecture: sparc System release level: 5.11 snv_97 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: space state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM spaceONLINE 0 0 0 mirror ONLINE 0 0 0 c1t34d0 ONLINE 0 0 0 c1t48d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t35d0 ONLINE 0 0 0 c1t49d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t36d0 ONLINE 0 0 0 c1t51d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t33d0 ONLINE 0 0 0 c1t52d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t38d0 ONLINE 0 0 0 c1t53d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t39d0 ONLINE 0 0 0 c1t54d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t40d0 ONLINE 0 0 0 c1t55d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t41d0 ONLINE 0 0 0 c1t56d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t42d0 ONLINE 0 0 0 c1t57d0 ONLINE 0 0 0 logs ONLINE 0 0 0 c1t50d0ONLINE 0 0 0 errors: No known data errors zfs create space/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /space/zfscachetest ... Done! zfs unmount space/zfscachetest zfs mount space/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real5m45.66s user0m5.63s sys 1m14.66s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real15m29.42s user0m5.65s sys 1m37.83s Feel free to clean up with 'zfs destroy space/zfscachetest'. James Andrewartha ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Aaah, ok, I think I understand now. Thanks Richard. I'll grab the updated test and have a look at the ARC ghost results when I get back to work tomorrow. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Richard Elling wrote: heh. What you would be looking for is evidence of prefetching. If there is a lot of prefetching, the actv will tend to be high and latencies relatively low. If there is no prefetching, actv will be low and latencies may be higher. This also implies that if you use IDE disks, which cannot handle multiple outstanding I/Os, the performance will look similar for both runs. Ok, here are some stats for the "poor" (initial "USB" rates) and "terrible" (sub-"USB" rates) cases. "poor" (29% busy) iostat: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c1t0d0 0.01.20.0 11.4 0.0 0.00.04.5 0 0 c1t1d0 91.20.0 11654.70.0 0.0 0.80.09.2 0 27 c4t600A0B80003A8A0B096147B451BEd0 95.00.0 12160.30.0 0.0 0.90.09.9 0 29 c4t600A0B800039C9B50A9C47B4522Dd0 96.40.0 12333.10.0 0.0 0.90.09.5 0 29 c4t600A0B800039C9B50AA047B4529Bd0 96.80.0 12377.90.0 0.0 0.90.09.5 0 30 c4t600A0B80003A8A0B096647B453CEd0 100.40.0 12845.10.0 0.0 1.00.09.5 0 29 c4t600A0B800039C9B50AA447B4544Fd0 93.40.0 11949.10.0 0.0 0.80.09.0 0 28 c4t600A0B80003A8A0B096A47B4559Ed0 91.50.0 11705.90.0 0.0 0.90.09.7 0 28 c4t600A0B800039C9B50AA847B45605d0 91.40.0 11680.30.0 0.0 0.90.0 10.1 0 29 c4t600A0B80003A8A0B096E47B456DAd0 88.90.0 11366.70.0 0.0 0.90.09.7 0 27 c4t600A0B800039C9B50AAC47B45739d0 94.30.0 12045.50.0 0.0 0.90.09.9 0 29 c4t600A0B800039C9B50AB047B457ADd0 96.50.0 12339.50.0 0.0 0.90.09.3 0 28 c4t600A0B80003A8A0B097347B457D4d0 87.90.0 11232.70.0 0.0 0.90.0 10.4 0 29 c4t600A0B800039C9B50AB447B4595Fd0 0.00.00.00.0 0.0 0.00.00.0 0 0 c5t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c6t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c2t202400A0B83A8A0Bd31 0.00.00.00.0 0.0 0.00.00.0 0 0 c3t202500A0B83A8A0Bd31 0.00.00.00.0 0.0 0.00.00.0 0 0 freddy:vold(pid508) "terrible" (8% busy) iostat: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c1t0d0 0.01.80.01.0 0.0 0.00.0 26.6 0 1 c1t1d0 26.80.0 3430.40.0 0.0 0.10.02.9 0 8 c4t600A0B80003A8A0B096147B451BEd0 21.00.0 2688.00.0 0.0 0.10.03.9 0 8 c4t600A0B800039C9B50A9C47B4522Dd0 24.00.0 3059.60.0 0.0 0.10.03.4 0 8 c4t600A0B800039C9B50AA047B4529Bd0 27.60.0 3532.80.0 0.0 0.10.03.2 0 9 c4t600A0B80003A8A0B096647B453CEd0 20.80.0 2662.40.0 0.0 0.10.03.1 0 6 c4t600A0B800039C9B50AA447B4544Fd0 26.50.0 3392.00.0 0.0 0.10.02.6 0 7 c4t600A0B80003A8A0B096A47B4559Ed0 20.60.0 2636.80.0 0.0 0.10.03.0 0 6 c4t600A0B800039C9B50AA847B45605d0 22.90.0 2931.20.0 0.0 0.10.03.8 0 9 c4t600A0B80003A8A0B096E47B456DAd0 21.40.0 2739.20.0 0.0 0.10.03.5 0 7 c4t600A0B800039C9B50AAC47B45739d0 23.10.0 2944.40.0 0.0 0.10.03.7 0 9 c4t600A0B800039C9B50AB047B457ADd0 24.90.0 3187.20.0 0.0 0.10.03.4 0 8 c4t600A0B80003A8A0B097347B457D4d0 28.30.0 3622.40.0 0.0 0.10.02.8 0 8 c4t600A0B800039C9B50AB447B4595Fd0 0.00.00.00.0 0.0 0.00.00.0 0 0 c5t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c6t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c2t202400A0B83A8A0Bd31 0.00.00.00.0 0.0 0.00.00.0 0 0 c3t202500A0B83A8A0Bd31 0.00.00.00.0 0.0 0.00.00.0 0 0 freddy:vold(pid508) Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Wed, 15 Jul 2009, Richard Elling wrote: Unfortunately, "zpool iostat" doesn't really tell you anything about performance. All it shows is bandwidth. Latency is what you need to understand performance, so use iostat. You are still thinking about this as if it was a hardware-related problem when it is clearly not. Iostat is useful for analyzing hardware-related problems in the case where the workload is too much for the hardware, or the hardware is non-responsive. Anyone who runs this crude benchmark will discover that iostat shows hardly any disk utilization at all, latencies are low, and read I/O rates are low enough that they could be satisfied by a portable USB drive. You can even observe the blinking lights on the front of the drive array and see that it is lightly loaded. This explains why a two disk mirror is almost able to keep up with a system with 40 fast SAS drives. heh. What you would be looking for is evidence of prefetching. If there is a lot of prefetching, the actv will tend to be high and latencies relatively low. If there is no prefetching, actv will be low and latencies may be higher. This also implies that if you use IDE disks, which cannot handle multiple outstanding I/Os, the performance will look similar for both runs. Or, you could get more sophisticated and use a dtrace script to look at the I/O behavior to determine the latency between contiguous I/O requests. Something like iopattern is a good start, though it doesn't try to measure the time between requests, it would be easy to add. http://www.richardelling.com/Home/scripts-and-programs-1/iopattern -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Richard Elling wrote: Unfortunately, "zpool iostat" doesn't really tell you anything about performance. All it shows is bandwidth. Latency is what you need to understand performance, so use iostat. You are still thinking about this as if it was a hardware-related problem when it is clearly not. Iostat is useful for analyzing hardware-related problems in the case where the workload is too much for the hardware, or the hardware is non-responsive. Anyone who runs this crude benchmark will discover that iostat shows hardly any disk utilization at all, latencies are low, and read I/O rates are low enough that they could be satisfied by a portable USB drive. You can even observe the blinking lights on the front of the drive array and see that it is lightly loaded. This explains why a two disk mirror is almost able to keep up with a system with 40 fast SAS drives. This is the opposite situation from the zfs writes which periodically push the hardware to its limits. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Wed, 15 Jul 2009, Ross wrote: Yes, that makes sense. For the first run, the pool has only just been mounted, so the ARC will be empty, with plenty of space for prefetching. I don't think that this hypothesis is quite correct. If you use 'zpool iostat' to monitor the read rate while reading a large collection of files with total size far larger than the ARC, you will see that there is no fall-off in read performance once the ARC becomes full. Unfortunately, "zpool iostat" doesn't really tell you anything about performance. All it shows is bandwidth. Latency is what you need to understand performance, so use iostat. The performance problem occurs when there is still metadata cached for a file but the file data has since been expunged from the cache. The implication here is that zfs speculates that the file data will be in the cache if the metadata is cached, and this results in a cache miss as well as disabling the file read-ahead algorithm. You would not want to do read-ahead on data that you already have in a cache. I realized this morning that what I posted last night might be misleading to the casual reader. Clearly the first time through the data is prefetched and misses the cache. On the second pass, it should also miss the cache, if it were a simple cache. But the ARC tries to be more clever and has ghosts -- where the data is no longer in cache, but the metadata is. I suspect the prefetching is not being used for the ghosts. The arcstats will show this. As benr blogs, "These Ghosts lists are magic. If you get a lot of hits to the ghost lists, it means that ARC is WAY too small and that you desperately need either more RAM or an L2 ARC device (likely, SSD). Please note, if you are considering investing in L2 ARC, check this FIRST." http://www.cuddletech.com/blog/pivot/entry.php?id=979 This is the explicit case presented by your test. This also explains why the entry from the system with an L2ARC did not have the performance "problem." Also, another test would be to have two large files. Read from one, then the other, then from the first again. Capture arcstats from between the reads and see if the haunting stops ;-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, My D. Truong wrote: Here's an example of an OpenSolaris machine, 2008.11 upgraded to the 117 devel release. X4540, 32GB RAM. The file count was bumped up to 9000 to be a little over double the RAM. Your timings show a 3.1X hit so it appears that the OpenSolaris improvement is not as much as was assumed. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Ross wrote: Yes, that makes sense. For the first run, the pool has only just been mounted, so the ARC will be empty, with plenty of space for prefetching. I don't think that this hypothesis is quite correct. If you use 'zpool iostat' to monitor the read rate while reading a large collection of files with total size far larger than the ARC, you will see that there is no fall-off in read performance once the ARC becomes full. The performance problem occurs when there is still metadata cached for a file but the file data has since been expunged from the cache. The implication here is that zfs speculates that the file data will be in the cache if the metadata is cached, and this results in a cache miss as well as disabling the file read-ahead algorithm. You would not want to do read-ahead on data that you already have in a cache. Recent OpenSolaris seems to take a 2X performance hit rather than the 4X hit that Solaris 10 takes. This may be due to improvement of existing algorithm function performance (optimizations) rather than a related design improvement. I wonder if there is any tuning that can be done to counteract this? Is there any way to tell ZFS to bias towards prefetching rather than preserving data in the ARC? That may provide better performance for scripts like this, or for random access workloads. Recent zfs development focus has been on how to keep prefetch from damaging applications like database where prefetch causes more data to be read than is needed. Since OpenSolaris now apparently includes an option setting which blocks file data caching and prefetch, this seems to open the door for use of more aggressive prefetch in the normal mode. In summary, I agree with Richard Elling's hypothesis (which is the same as my own). Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
> It would be good to see results from a few > OpenSolaris users running a > recent 64-bit kernel, and with fast storage to see if > this is an > OpenSolaris issue as well. Bob, Here's an example of an OpenSolaris machine, 2008.11 upgraded to the 117 devel release. X4540, 32GB RAM. The file count was bumped up to 9000 to be a little over double the RAM. r...@deviant:~# ./zfs-cache-test.ksh gauss System Configuration: Sun Microsystems Sun Fire X4540 System architecture: i386 System release level: 5.11 snv_117 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: gauss state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM gauss ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c8t7d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 errors: No known data errors zfs create gauss/zfscachetest Creating data file set (9000 files of 8192000 bytes) under /gauss/zfscachetest ... Done! zfs unmount gauss/zfscachetest zfs mount gauss/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 144000768 blocks real9m15.87s user0m5.16s sys 1m29.32s Doing second 'cpio -C 131072 -o > /dev/null' 144000768 blocks real28m57.54s user0m5.47s sys 1m50.32s Feel free to clean up with 'zfs destroy gauss/zfscachetest'. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Yes, that makes sense. For the first run, the pool has only just been mounted, so the ARC will be empty, with plenty of space for prefetching. On the second run however, the ARC is already full of the data that we just read, and I'm guessing that the prefetch code is less aggressive when there is already data in the ARC. Which for normal use may be what you want - it's trying to keep things in the ARC in case they are needed. However that does mean that ZFS prefetch is always going to suffer performance degradation on a live system, although early signs are that this might not be so severe in snv_117. I wonder if there is any tuning that can be done to counteract this? Is there any way to tell ZFS to bias towards prefetching rather than preserving data in the ARC? That may provide better performance for scripts like this, or for random access workloads. Also, could there be any generic algorithm improvements that could help. Why should ZFS keep data in the ARC if it hasn't been used? This script has 8GB files, but the ARC should be using at least 1GB of RAM. That's a minimum of 128 files in memory, none of which would have been read more than once. If we're reading a new file now, prefetching should be able to displace any old object in the ARC that hasn't been used - in this case all 127 previous files should be candidates for replacement. I wonder how that would interact with a L2ARC. If that was fast enough I'd certainly want to allocate more of the ARC to prefetching. Finally, would it make sense for the ARC to always allow a certain percentage for prefetching, possibly with that percentage being tunable, allowing us to balance the needs of the two systems according to the expected usage? Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Richard Elling wrote: > I think a picture is emerging that if you have enough RAM, the > ARC is working very well. Which means that the ARC management > is suspect. > > I propose the hypothesis that ARC misses are not prefetched. The > first time through, prefetching works. For the second pass, ARC > misses are not prefetched, so sequential reads go slower. You may be right as it may be that the cache is not filled by new important data because there is already 100% of unimportant data inside. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I think a picture is emerging that if you have enough RAM, the ARC is working very well. Which means that the ARC management is suspect. I propose the hypothesis that ARC misses are not prefetched. The first time through, prefetching works. For the second pass, ARC misses are not prefetched, so sequential reads go slower. For JBODs, the effect will be worse than for LUNs on a storage array with lots of cache. benr's prefetch script will help shed light on this, but apparently doesn't work for Solaris 10. Since the Solaris 10 source is not publicly available, someone with source access might need to adjust it to match the Solaris 10 source. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
This system has 32 GB of RAM so I will probbaly need to increase the data set size. [r...@x tmp]#> ./zfs-cache-test.ksh nbupool System Configuration: Sun Microsystems sun4v SPARC Enterprise T5220 System architecture: sparc System release level: 5.10 Generic_141414-02 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: nbupool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nbupool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t22d0 ONLINE 0 0 0 c2t23d0 ONLINE 0 0 0 c2t24d0 ONLINE 0 0 0 c2t25d0 ONLINE 0 0 0 c2t26d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t27d0 ONLINE 0 0 0 c2t28d0 ONLINE 0 0 0 c2t29d0 ONLINE 0 0 0 c2t30d0 ONLINE 0 0 0 c2t31d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t32d0 ONLINE 0 0 0 c2t33d0 ONLINE 0 0 0 c2t34d0 ONLINE 0 0 0 c2t35d0 ONLINE 0 0 0 c2t36d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t37d0 ONLINE 0 0 0 c2t38d0 ONLINE 0 0 0 c2t39d0 ONLINE 0 0 0 c2t40d0 ONLINE 0 0 0 c2t41d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t42d0 ONLINE 0 0 0 c2t43d0 ONLINE 0 0 0 c2t44d0 ONLINE 0 0 0 c2t45d0 ONLINE 0 0 0 c2t46d0 ONLINE 0 0 0 spares c2t47d0AVAIL c2t48d0AVAIL c2t49d0AVAIL errors: No known data errors zfs create nbupool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /nbupool/zfscachetest ... Done! zfs unmount nbupool/zfscachetest zfs mount nbupool/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m37.24s user0m9.87s sys 1m54.08s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real1m59.11s user0m9.93s sys 1m49.15s Feel free to clean up with 'zfs destroy nbupool/zfscachetest'. Scott Lawson wrote: Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]#> uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]#> cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]#> prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]#> ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Wed, 15 Jul 2009, Scott Lawson wrote: NAME STATE READ WRITE CKSUM test1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t600A0B8000562264039B4A257E11d0 ONLINE 0 0 0 c3t600A0B8000336DE204394A258B93d0 ONLINE 0 0 0 Each of these LUNS is a pair of 146GB 15K drives in a RAID1 on Crystal firmware on a 6140. Each LUN is 2km apart in different data centres. 1 LUN where the server is, 1 remote. Interestingly by creating the mirror vdev the first run got faster, and the second much much slower. The second cpio took and extra 2 minutes by virtue of it being a mirror. I ran the script once again prior to adding the mirror and the results were pretty much the same as the first run posted. (plus or minus a couple of seconds, which is to be expected as these LUNS are on prod arrays feeding other servers as well) I will try these tests on some of my J4500's when I get a chance shortly. My interest is now piqued. Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m25.13s user0m2.67s sys 0m28.40s It is quite impressive that your little two disk mirror reads as fast as mega Sun systems with 38+ disks and striped vdevs to boot. Incredible! Does this have something to do with your well-managed power and cooling? :-) Maybe it is Bob, maybe it is. ;) haha. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Jorgen Lundman wrote: You have some mighty pools there. Something I find quite interesting is that those who have "mighty pools" generally obtain about the same data rate regardless of their relative degree of excessive "might". This causes me to believe that the Solaris kernel is throttling the read rate so that throwing more and faster hardware at the problem does not help. Are you saying the X4500s we have are set up incorrectly, or done in a way which will make them run poorly? No. I am suggesting that all Solaris 10 (and probably OpenSolaris systems) currently have a software-imposed read bottleneck which places a limit on how well systems will perform on this simple sequential read benchmark. After a certain point (which is unfortunately not very high), throwing more hardware at the problem does not result in any speed improvement. This is demonstrated by Scott Lawson's little two disk mirror almost producing the same performance as our much more exotic setups. Evidence suggests that SPARC systems are doing better than x86. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Ross wrote: Hi Bob, My guess is something like it's single threaded, with each file dealt with in order and requests being serviced by just one or two disks at a time. With that being the case, an x4500 is essentially just running off 7200 rpm SATA drives, which really is nothing special. A quick summary of some of the figures, with times normalized for 3000 files: Sun x2200, single 500GB sata: 6m25.15s Sun v490, raidz1 zpool of 6x146 sas drives on a j4200: 2m46.29s Sun X4500, 7 sets of mirrored 500Gb SATA: 3m0.83s Sun x4540, (unknown pool - Jorgen, what are you running?): 4m7.13s This new one from Scott Lawson is incredible (but technically quite possible): SPARC Enterprise M3000, single SAS mirror pair: 3m25.13s Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Scott Lawson wrote: NAME STATE READ WRITE CKSUM test1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t600A0B8000562264039B4A257E11d0 ONLINE 0 0 0 c3t600A0B8000336DE204394A258B93d0 ONLINE 0 0 0 Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m25.13s user0m2.67s sys 0m28.40s It is quite impressive that your little two disk mirror reads as fast as mega Sun systems with 38+ disks and striped vdevs to boot. Incredible! Does this have something to do with your well-managed power and cooling? :-) Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
You have some mighty pools there. Something I find quite interesting is that those who have "mighty pools" generally obtain about the same data rate regardless of their relative degree of excessive "might". This causes me to believe that the Solaris kernel is throttling the read rate so that throwing more and faster hardware at the problem does not help. Are you saying the X4500s we have are set up incorrectly, or done in a way which will make them run poorly? The servers came with no documentation nor advise. I have yet to find a good place that suggest configurations for dedicated x4500 NFS servers. We had to find out about the NFSD_SERVERS when the first trouble came in. (Followed by 5 other tweaks and limits-reached troubles). If Sun really wants to compete with NetApp, you'd think they would ship us hardware configured for NFS servers, not x4500s configured for desktops :( They are cheap though! Nothing like being Wall-Mart of Storage! That is how the pools were created as well. Admittedly it may be down to our Vendor again. Lund -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Jorgen Lundman wrote: Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m1.58s user0m1.92s sys 0m56.67s Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m5.51s user0m1.70s sys 0m29.53s You have some mighty pools there. Something I find quite interesting is that those who have "mighty pools" generally obtain about the same data rate regardless of their relative degree of excessive "might". This causes me to believe that the Solaris kernel is throttling the read rate so that throwing more and faster hardware at the problem does not help. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I added a second Lun identical in size as a mirror and reran test. Results are more in line with yours now. ./zfs-cache-test.ksh test1 System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System architecture: sparc System release level: 5.10 Generic_139555-08 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: test1 state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Wed Jul 15 11:38:54 2009 config: NAME STATE READ WRITE CKSUM test1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t600A0B8000562264039B4A257E11d0 ONLINE 0 0 0 c3t600A0B8000336DE204394A258B93d0 ONLINE 0 0 0 errors: No known data errors zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m25.13s user0m2.67s sys 0m28.40s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real8m53.05s user0m2.69s sys 0m32.83s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Scott Lawson wrote: Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]#> uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]#> cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]#> prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]#> ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real4m48.94s user0m21.58s sys 0m44.91s Doing second 'cpio -o > /dev/null' 48000247 blocks real6m39.87s user0m21.62s sys 0m46.20s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Looks like a 25% performance loss for me. I was seeing around 80MB/s sustained on the first run and around 60M/'s sustained on the 2nd. /Scott. Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o > /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
3 servers contained within. Both x4500 and x4540 are setup the way Sun shipped to us. With minor changes (nfsservers=1024 etc). I was a little disappointed that they were identical in speed on round one, but the x4540 looked better part 2. Which I suspect is probably just OS version? x4500 Sol 10 100% idle, but with 3.86T existing data. 16GB memory, 4 core. x4500-03:/var/tmp# ./zfs-cache-test.ksh zpool1 System Configuration: Sun Microsystems Sun Fire X4500 System architecture: i386 System release level: 5.10 on10-public-x:s10idr_ldi:03/27/2009 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: zpool1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM zpool1 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c8t7d0 ONLINE 0 0 0 errors: No known data errors zfs create zpool1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /zpool1/zfscachetest ... Done! zfs unmount zpool1/zfscachetest zfs mount zpool1/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m1.58s user0m1.92s sys 0m56.67s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real7m7.76s user0m1.77s sys 1m6.82s Feel free to clean up with 'zfs destroy zpool1/zfscachetest'. x4540 Sol svn 117, 100% idle, completely empty, 32GB memory, 8 core. x4500-07:/var/tmp# ./zfs-cache-test.ksh zpool1 System Configuration: Sun Microsystems Sun Fire X4540 System architecture: i386 System release level: 5.11 snv_117 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: zpool1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM zpool1 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c3t7d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi! Do you think that this issues will be seen on a ZVOL-s that are exported as iSCSI tragets? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Richard Elling wrote: That is because file prefetch is dynamic. benr wrote a good blog on the subject and includes a DTrace script to monitor DMU prefetches. http://www.cuddletech.com/blog/pivot/entry.php?id=1040 Apparently not dynamic enough. The provided DTrace script has a syntax error when used for Solaris 10 U7. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Tue, 14 Jul 2009, Ross wrote: My guess is something like it's single threaded, with each file dealt with in order and requests being serviced by just one or two disks at a time. With that being the case, an x4500 is essentially just running off 7200 rpm SATA drives, which really is nothing special. Keep in mind that there is supposed to be file level read-ahead. As an example, ZFS is able to read from my array at up to 551 MB/second when reading from a huge (64GB) file yet it is only managing 145MB/second or so for these 8MB files sequentially accessed by cpio. This suggests that even for the initial read case that zfs is not applying enough file level read-ahead (or applying it soon enough) to keep the disks busy. 8MB is still pretty big in the world of files. Perhaps it takes zfs a long time to decide that read-ahead is required. I have yet to find a tunable for file level read-ahead. There are tunables for vdev-level read-ahead but vdev read-ahead pretty minor read-ahead and increasing it may cause more harm than help. That is because file prefetch is dynamic. benr wrote a good blog on the subject and includes a DTrace script to monitor DMU prefetches. http://www.cuddletech.com/blog/pivot/entry.php?id=1040 -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue Jul 14, 2009 at 11:09:32AM -0500, Bob Friesenhahn wrote: > On Tue, 14 Jul 2009, Jorgen Lundman wrote: > >> I have no idea. I downloaded the script from Bob without modifications and >> ran it specifying only the name of our pool. Should I have changed >> something to run the test? > > If your system has quite a lot of memory, the number of files should be > increased to at least match the amount of memory. > >> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running >> svn117 for ZFS quotas. Worth trying on both? > > It is useful to test as much as possible in order to fully understand the > situation. > > Since results often get posted without system details, the script is > updated to dump some system info and the pool configuration. Refresh from > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Bob > -- > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss And mine: d...@pax:1512 $ pfexec ./zfs-cache-test.ksh tank System Configuration: MICRO-STAR INTERNATIONAL CO.,LTD MS-7365 System architecture: i386 System release level: 5.11 snv_101b CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: tank state: ONLINE scrub: scrub completed after 3h30m with 0 errors on Tue Jul 7 19:38:45 2009 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1ONLINE 0 0 0 c4d0ONLINE 0 0 0 c5d0ONLINE 0 0 0 c7d0ONLINE 0 0 0 errors: No known data errors zfs create tank/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /tank/zfscachetest ... Done! zfs unmount tank/zfscachetest zfs mount tank/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real8m19.62s user0m2.07s sys 0m30.18s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real5m4.59s user0m1.86s sys 0m34.06s Feel free to clean up with 'zfs destroy tank/zfscachetest'. -- Regards, Dóri ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Le 14 juil. 09 à 18:09, Bob Friesenhahn a écrit : On Tue, 14 Jul 2009, Jorgen Lundman wrote: I have no idea. I downloaded the script from Bob without modifications and ran it specifying only the name of our pool. Should I have changed something to run the test? If your system has quite a lot of memory, the number of files should be increased to at least match the amount of memory. We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running svn117 for ZFS quotas. Worth trying on both? It is useful to test as much as possible in order to fully understand the situation. Since results often get posted without system details, the script is updated to dump some system info and the pool configuration. Refresh from http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Here is the result on another host with faster drives (sas 1 rpm) and solaris 10u7. System Configuration: Sun Microsystems SUN FIRE X4150 System architecture: i386 System release level: 5.10 Generic_139556-08 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool : rpool état : ONLINE purger : aucun requis configuration : NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 erreurs : aucune erreur de données connue zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocs real4m56.84s user0m1.72s sys 0m28.48s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocs real13m48.19s user0m2.07s sys 0m44.45s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. -- Gaëtan Lehmann Biologie du Développement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr PGP.sig Description: Ceci est une signature électronique PGP ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Ross wrote: My guess is something like it's single threaded, with each file dealt with in order and requests being serviced by just one or two disks at a time. With that being the case, an x4500 is essentially just running off 7200 rpm SATA drives, which really is nothing special. Keep in mind that there is supposed to be file level read-ahead. As an example, ZFS is able to read from my array at up to 551 MB/second when reading from a huge (64GB) file yet it is only managing 145MB/second or so for these 8MB files sequentially accessed by cpio. This suggests that even for the initial read case that zfs is not applying enough file level read-ahead (or applying it soon enough) to keep the disks busy. 8MB is still pretty big in the world of files. Perhaps it takes zfs a long time to decide that read-ahead is required. I have yet to find a tunable for file level read-ahead. There are tunables for vdev-level read-ahead but vdev read-ahead pretty minor read-ahead and increasing it may cause more harm than help. A quick summary of some of the figures, with times normalized for 3000 files: Sun x2200, single 500GB sata: 6m25.15s Sun v490, raidz1 zpool of 6x146 sas drives on a j4200: 2m46.29s Sun X4500, 7 sets of mirrored 500Gb SATA: 3m0.83s Sun x4540, (unknown pool - Jorgen, what are you running?): 4m7.13s And mine: Ultra 40-M2 / StorageTek 2540, 6 sets of mirrored 300GB SAS: 2m44.20s I think that Jorgen implied that his system is using SAN storage with a mirror across two jumbo LUNs. The raid pool of SAS drives is quicker again, but for a single threaded request that also seems about right. The random read benefits of the mirror aren't going to take effect unless you run multiple reads in parallel. What I suspect is helping here are the slightly better seek times of the SAS drives, along with slightly higher throughput due to the raid. Once ZFS decides to apply file level read-ahead then it can issue many reads in parallel. It should be able to keep at least six disks busy at once, leading to much better performance than we are seeing. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Just FYI. I ran a slightly different version of the test. I used SSD (for log & cache)! 3 x 32GB SSDs. 2 mirrored for log and one for cache. The systems is a 4150 with 12 GB of RAM. Here are the results $ pfexec ./zfs-cache-test.ksh sdpool System Configuration: System architecture: i386 System release level: 5.11 snv_111b CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: sdpool state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Fri Jul 10 11:33:01 2009 config: NAMESTATE READ WRITE CKSUM sdpool ONLINE 0 0 0 mirrorONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 logsONLINE 0 0 0 mirrorONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 cache c8t4d0ONLINE 0 0 0 errors: No known data errors zfs unmount sdpool/zfscachetest zfs mount sdpool/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real3m27.06s user0m2.05s sys 0m30.14s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real2m47.32s user0m2.09s sys 0m32.32s Feel free to clean up with 'zfs destroy sdpool/zfscachetest'. -Angelo On Jul 14, 2009, at 12:09 PM, Bob Friesenhahn wrote: On Tue, 14 Jul 2009, Jorgen Lundman wrote: I have no idea. I downloaded the script from Bob without modifications and ran it specifying only the name of our pool. Should I have changed something to run the test? If your system has quite a lot of memory, the number of files should be increased to at least match the amount of memory. We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running svn117 for ZFS quotas. Worth trying on both? It is useful to test as much as possible in order to fully understand the situation. Since results often get posted without system details, the script is updated to dump some system info and the pool configuration. Refresh from http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi Bob, My guess is something like it's single threaded, with each file dealt with in order and requests being serviced by just one or two disks at a time. With that being the case, an x4500 is essentially just running off 7200 rpm SATA drives, which really is nothing special. A quick summary of some of the figures, with times normalized for 3000 files: Sun x2200, single 500GB sata: 6m25.15s Sun v490, raidz1 zpool of 6x146 sas drives on a j4200: 2m46.29s Sun X4500, 7 sets of mirrored 500Gb SATA: 3m0.83s Sun x4540, (unknown pool - Jorgen, what are you running?): 4m7.13s Taking my single SATA drive as a base, a pool of mirrored SATA is almost exactly twice as quick which makes sense if ZFS is reading the file off both drives at once. The raid pool of SAS drives is quicker again, but for a single threaded request that also seems about right. The random read benefits of the mirror aren't going to take effect unless you run multiple reads in parallel. What I suspect is helping here are the slightly better seek times of the SAS drives, along with slightly higher throughput due to the raid. What might be interesting would be to see the results off a ramdisk or SSD drive. Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, Jul 14, 2009 at 11:09:32AM -0500, Bob Friesenhahn wrote: > On Tue, 14 Jul 2009, Jorgen Lundman wrote: > >> I have no idea. I downloaded the script from Bob without modifications >> and ran it specifying only the name of our pool. Should I have changed >> something to run the test? > > If your system has quite a lot of memory, the number of files should be > increased to at least match the amount of memory. > >> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 >> running svn117 for ZFS quotas. Worth trying on both? > > It is useful to test as much as possible in order to fully understand > the situation. > > Since results often get posted without system details, the script is > updated to dump some system info and the pool configuration. Refresh > from > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Bob > -- > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Whitebox Quad-core Phenom, 8G RAM, RAID-Z (3x1TB + 3x1.5TB) SATA drives via an AOC-USAS-L8i: System Configuration: Gigabyte Technology Co., Ltd. GA-MA770-DS3 System architecture: i386 System release level: 5.11 snv_111b CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: pool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM poolONLINE 0 0 0 raidz1ONLINE 0 0 0 c3t7d0 ONLINE 0 0 0 c3t6d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 errors: No known data errors zfs create pool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /pool/zfscachetest ... Done! zfs unmount pool/zfscachetest zfs mount pool/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 48000256 blocks real4m59.33s user0m21.83s sys 2m56.05s Doing second 'cpio -C 131072 -o > /dev/null' 48000256 blocks real8m28.11s user0m22.66s sys 3m13.26s Feel free to clean up with 'zfs destroy pool/zfscachetest'. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Jorgen Lundman wrote: I have no idea. I downloaded the script from Bob without modifications and ran it specifying only the name of our pool. Should I have changed something to run the test? If your system has quite a lot of memory, the number of files should be increased to at least match the amount of memory. We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running svn117 for ZFS quotas. Worth trying on both? It is useful to test as much as possible in order to fully understand the situation. Since results often get posted without system details, the script is updated to dump some system info and the pool configuration. Refresh from http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ross, Please refresh your test script from the source. The current script tells cpio to use 128k blocks and mentions the proper command in its progress message. I have now updated it to display useful information about the system being tested, and to dump the pool configuration. It is really interesting seeing the various posted numbers. This is as close as it comes to a common benchmark. A sort of sanity check. What is most interesting to me is the reported performance for those who paid for really fast storage hardware and are using what should be really fast storage configurations. The reason why it is interesting is that there seems to be a hardware-independent cap on maximum read performance. It seems that ZFS's read algorithm is rate-limiting the read so that regardless of how nice the hardware is, there is a peak read limit. There can be no other explanation as to why an ideal configuration of "Thumper II" SAS type hardware is neck and neck with my own setup, and quite similar to another fast system as well. My own setup is delivering less than 1/2 the performance that I would expect for the initial read (iozone says it can read 540MB/second from a huge file). Do the math and see if you think that zfs is giving you the read performance you expect based on your hardware. I think that we are encountering several bugs here. We also have a general read bottleneck. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
For what it's worth, I just repeated that test. The timings are suspiciously similar. This is very definitely a reproducible bug: zfs unmount rc-pool/zfscachetest zfs mount rc-pool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real4m45.69s user0m10.22s sys 0m53.29s Doing second 'cpio -o > /dev/null' 48000247 blocks real15m47.48s user0m10.58s sys 1m10.96s -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I also ran this on my future RAID/NAS. Intel Atom 330 (D945GCLF2) dual core 1.6ghz, on a single HDD pool. svn_114, 64 bit, 2GB RAM. bash-3.23 ./zfs-cache-test.ksh zboot zfs create zboot/zfscachetest creating data file set (3000 files of 8192000 bytes) under /zboot/zfscachetest ... done1 zfs unmount zboot/zfscachetest zfs mount zboot/zfscachetest doing initial (unmount/mount) 'cpio -c 131072 -o . /dev/null' 48000256 blocks real7m45.96s user0m6.55s sys 1m20.85s doing second 'cpio -c 131072 -o . /dev/null' 48000256 blocks real7m50.35s user0m6.76s sys 1m32.91s feel free to clean up with 'zfs destroy zboot/zfscachetest'. Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o > /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to clean up with 'zfs destroy Sun_2540/zfscachetest'. And here is a similar run on my Blade 2500 using the default rpool: # ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real13m3.91s user2m43.04s sys 9m28.73s Doing second 'cpio -o > /dev/null' 48000247 blocks real23m50.27s user2m41.81s sys 9m46.76s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. I am interested to hear about systems which do not suffer from this bug. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, Jul 14, 2009 at 08:54:36AM +0200, Ross wrote: > Ok, build 117 does seem a lot better. The second run is slower, > but not by such a huge margin. Hm, I can't support this: SunOS fred 5.11 snv_117 sun4u sparc SUNW,Sun-Fire-V440 The system has 16GB of Ram, pool is mirrored over two FUJITSU-MBA3147NC. >-1007: sudo ksh zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (4000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'tar to /dev/null' real5m12.61s user0m0.30s sys 1m28.36s Doing second 'tar to /dev/null' real11m13.93s user0m0.22s sys 1m37.41s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. user=2.32 sec, sys=343.41 sec, elapsed=23:39.41 min, cpu use=24.3% And here's what arcstat.pl has to say when starting the second read: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 11:53:26 11K 895 7410 854 10013 10013G 13G 11:53:27 12K 832 6390 793 10013 10013G 13G 11:53:28 11K 832 7390 793 10013 10013G 13G 11:53:29 11K 832 7390 793 10013 7613G 13G 11:53:30 12K 896 7420 854 10014 10013G 13G 11:53:31 11K 832 7390 793 10013 10013G 13G 11:53:32 11K 768 6360 732 10012 10013G 13G 11:53:33 11K 832 7390 793 10013 10013G 13G 11:53:347K 497 7 2533 244 99 4 1113G 13G 11:53:355K 385 7 3857 00 0013G 13G 11:53:365K 374 7 3747 00 0013G 13G 11:53:375K 368 7 3687 00 0013G 13G 11:53:384K 340 7 3407 00 0013G 13G 11:53:395K 383 7 3837 00 0013G 13G 11:53:405K 406 7 4067 00 0013G 13G 11:53:414K 360 7 3607 00 0013G 13G 11:53:424K 328 7 3287 00 0013G 13G 11:53:434K 346 7 3467 00 0013G 13G 11:53:444K 346 7 3467 00 0013G 13G 11:53:454K 319 7 3197 00 0013G 13G 11:53:474K 337 7 3377 00 0013G 13G I used tar in this run instead of cpio, just to give it a try... [time (find . -type f | xargs -i tar cf /dev/null {} )] Another run with Bob's new script: (rpool/zfscachetest not destroyed before this run, so wall clock time below is lower) >-1008: sudo ksh zfs-cache-test.ksh.1 zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 64000512 blocks real4m40.25s user0m7.96s sys 1m28.62s Doing second 'cpio -C 131072 -o > /dev/null' 64000512 blocks real11m0.08s user0m7.37s sys 1m38.58s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. user=15.35 sec, sys=187.87 sec, elapsed=15:43.65 min, cpu use=21.5% Not much difference to the "tar"-run... Kurt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ah yes, my apologies! I haven't quite worked out why OsX VNC server can't handle keyboard mappings. I have to copy'paste "@" even. As I pasted the output into my mail over VNC, it would have destroyed the (not very) "unusual" characters. Ross wrote: Aaah, nevermind, it looks like there's just a rogue 9 appeared in your output. It was just a standard run of 3,000 files. -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Aaah, nevermind, it looks like there's just a rogue 9 appeared in your output. It was just a standard run of 3,000 files. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I have no idea. I downloaded the script from Bob without modifications and ran it specifying only the name of our pool. Should I have changed something to run the test? We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running svn117 for ZFS quotas. Worth trying on both? Lund Ross wrote: Jorgen, Am I right in thinking the numbers here don't quite work. 48M blocks is just 9,000 files isn't it, not 93,000? I'm asking because I had to repeat a test earlier - I edited the script with vi, but when I ran it, it was still using the old parameters. I ignored it as a one off, but I'm wondering if your test has done a similar thing. Ross x4540 running svn117 # ./zfs-cache-test.ksh zpool1 zfs create zpool1/zfscachetest creating data file set 93000 files of 8192000 bytes0 under /zpool1/zfscachetest ... done1 zfs unmount zpool1/zfscachetest zfs mount zpool1/zfscachetest doing initial (unmount/mount) 'cpio -o . /dev/null' 48000247 blocks real4m7.13s user0m9.27s sys 0m49.09s doing second 'cpio -o . /dev/null' 48000247 blocks real4m52.52s user0m9.13s sys 0m47.51s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Jorgen, Am I right in thinking the numbers here don't quite work. 48M blocks is just 9,000 files isn't it, not 93,000? I'm asking because I had to repeat a test earlier - I edited the script with vi, but when I ran it, it was still using the old parameters. I ignored it as a one off, but I'm wondering if your test has done a similar thing. Ross > > x4540 running svn117 > > # ./zfs-cache-test.ksh zpool1 > zfs create zpool1/zfscachetest > creating data file set 93000 files of 8192000 bytes0 > under > /zpool1/zfscachetest ... > done1 > zfs unmount zpool1/zfscachetest > zfs mount zpool1/zfscachetest > > doing initial (unmount/mount) 'cpio -o . /dev/null' > 48000247 blocks > > real4m7.13s > user0m9.27s > sys 0m49.09s > > doing second 'cpio -o . /dev/null' > 48000247 blocks > > real4m52.52s > user0m9.13s > sys 0m47.51s > > > > > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ok, build 117 does seem a lot better. The second run is slower, but not by such a huge margin. This was the end of the 98GB test: Creating data file set (12000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 192000985 blocks real26m17.80s user0m47.55s sys 3m56.94s Doing second 'cpio -o > /dev/null' 192000985 blocks real27m14.35s user0m46.84s sys 4m39.85s -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob: Sun v490, 4x1.35 processors, 32GB ram, Solaris 10u7 working with a raidz1 zpool made up of 6x146 sas drives on a j4200. Results of your running your script: # zfs-cache-test.ksh pool2 zfs create pool2/zfscachetest Creating data file set (6000 files of 8192000 bytes) under /pool2/zfscachetest ... Done! zfs unmount pool2/zfscachetest zfs mount pool2/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 96000512 blocks real5m32.58s user0m12.75s sys 2m56.58s Doing second 'cpio -C 131072 -o > /dev/null' 96000512 blocks real17m26.68s user0m12.97s sys 4m34.33s Feel free to clean up with 'zfs destroy pool2/zfscachetest'. # Same results as you are seeing. Thanks Randy -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Joerg Schilling wrote: If you continue to use cpio and the cpio archive format, you force copying a lot of data as the cpio archive format does use odd header sizes and starts new files "unaligned" directly after the archive header. Note that the output of cpio is sent to /dev/null in this test so it is only the reading part which is significant as long as cpio's CPU use is low. Sun Service won't have a clue about 'star' since it is not part of Solaris 10. It is best to stick with what they know so the problem report won't be rejected. If star is truely more efficient than cpio, it may make the difference even more obvious. What did you discover when you modified my test script to use 'star' instead? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mark Shellenbaum wrote: I've opened the following bug to track this issue: 6859997 zfs caching performance problem We need to track down if/when this problem was introduced or if it has always been there. I think that it has always been there as long as I have been using ZFS (1-3/4 years). Sometimes it takes a while for me to wake up and smell the coffee. Meanwhile I have opened a formal service request (IBIS 71326296) with Sun Support. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 4:41 PM, Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Jim Mauro wrote: > >> Bob - Have you filed a bug on this issue? I am not up to speed on this >> thread, so I can not comment on whether or not there is a bug here, but you >> seem to have a test case and supporting data. Filing a bug will get the >> attention of ZFS engineering. > > No, I have not filed a bug report yet. Any problem report to Sun's Service > department seems to require at least one day's time. > > I was curious to see if recent OpenSolaris suffers from the same problem, > but posted results (thus far) are not as conclusive as they are for Solaris > 10. It doesn't seem to be quite as bad as S10, but there is certainly a hit. # /var/tmp/zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (400 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 6400033 blocks real1m26.16s user0m12.83s sys 0m25.88s Doing second 'cpio -o > /dev/null' 6400033 blocks real2m44.46s user0m12.59s sys 0m24.34s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. # cat /etc/release OpenSolaris 2009.06 snv_111b SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 07 May 2009 # uname -srvp SunOS 5.11 snv_111b sparc -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Jim Mauro wrote: Bob - Have you filed a bug on this issue? I am not up to speed on this thread, so I can not comment on whether or not there is a bug here, but you seem to have a test case and supporting data. Filing a bug will get the attention of ZFS engineering. No, I have not filed a bug report yet. Any problem report to Sun's Service department seems to require at least one day's time. I was curious to see if recent OpenSolaris suffers from the same problem, but posted results (thus far) are not as conclusive as they are for Solaris 10. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Mike Gerdts wrote: > > > > Using cpio's -C option seems to not change the behavior for this bug, > > but I did see a performance difference with the case where I hadn't > > modified the zfs caching behavior. That is, the performance of the > > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) > >> /dev/null". At this point cpio was spending roughly 13% usr and 87% > > sys. > > Interesting. I just updated zfs-cache-test.ksh on my web site so that > it uses 131072 byte blocks. I see a tiny improvement in performance > from doing this, but I do see a bit less CPU consumption so the CPU > consumption is essentially zero. The bug remains. It seems best to > use ZFS's ideal block size so that issues don't get confused. If you continue to use cpio and the cpio archive format, you force copying a lot of data as the cpio archive format does use odd header sizes and starts new files "unaligned" directly after the archive header. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Mike Gerdts wrote: > Using cpio's -C option seems to not change the behavior for this bug, > but I did see a performance difference with the case where I hadn't > modified the zfs caching behavior. That is, the performance of the > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) > >/dev/null". At this point cpio was spending roughly 13% usr and 87% > sys. As mentioned before, a lot of the user CPU time from cpio is spend to create cpio archive headers or caused by the fact that cpio archives copy the file content to unaligned archive locations while the "tar" archive format starts each new file on a modulo 512 offset in the archive. This requires a lot of unneeded copying of file data. You can of course slightly modify parameters even with cpio. I am not sure what you mean with "13% usr and 87%" as star typically spends 6% of the wall clock time in user+sys CPU where the user CPU time is typically only 1.5% of the system CPU time. In the "cached" case, it is obviously ZFS that's responsible for the slow down, regardless what cpio did in the other case. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Joerg Schilling wrote: > > > > cpio reads/writes in 8192 byte chunks from the filesystem. > > Yes, I was just reading the cpio manual page and see that. I think > that re-reading the 128K zfs block 16 times to satisfy each request > for 8192 bytes explains the 16X performance loss when caching is > disabled. I don't think that this is strictly a bug since it is what > the database folks are looking for. cpio spends 1.6x more SYStem CPU time than star. This may mainly be a result from the fact that cpio (when using the cpio archive format) reads/writes 512 byte blocks from/to the archive file. cpio by default spends 19x more USER CPU time than star. This seems to be a result of the inapropriate header structure with the cpio archive format and reblocking and cannot be easily changed (well you could use "scpio" - or in other words the "cpio" CLI personality of star, but this reduces the USER CPU time only by 10%-50% compared to Sun cpio). cpio is a program from the past that does no fit well in our current world. The internal limits cannot be lifted without creating a new incompatible archive format. In other words: if you use cpio for your work, you have to live with it's problems ;-) If you like to play with different parameter values (e.g. read sizes), cpio is unsuitable for tests. Star allows you to set big filesystem read sizes by using the FIFO and playing with the fifo size and smell filesystem read sizes by switching off the FIFO and playing with the archive block size. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest I've opened the following bug to track this issue: 6859997 zfs caching performance problem We need to track down if/when this problem was introduced or if it has always been there. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Ross Walker wrote: Have you tried limiting the ARC so it doesn't squash the page cache? Yes, the ARC is limited to 10GB, leaving another 10GB for the OS and applications. Resource limits are not the problem. There is a ton of memory and CPU to go around. Current /etc/system tunables: set maxphys = 0x2 set zfs:zfs_arc_max = 0x28000 set zfs:zfs_write_limit_override = 0xea60 set zfs:zfs_vdev_max_pending = 5 Make sure page cache has enough for mmap plus buffers for bouncing between it and the ARC. I would say 1GB minimum, 2 to be safe. In this testing mmap is not being used (cpio does not use mmap) so the page cache is not an issue. It does become an issue for 'cp -r' though where we see the I/O be substantially (and essentially permanently) reduced even more for impacted files until the filesystem is unmounted. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mike Gerdts wrote: Using cpio's -C option seems to not change the behavior for this bug, but I did see a performance difference with the case where I hadn't modified the zfs caching behavior. That is, the performance of the tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) /dev/null". At this point cpio was spending roughly 13% usr and 87% sys. Interesting. I just updated zfs-cache-test.ksh on my web site so that it uses 131072 byte blocks. I see a tiny improvement in performance from doing this, but I do see a bit less CPU consumption so the CPU consumption is essentially zero. The bug remains. It seems best to use ZFS's ideal block size so that issues don't get confused. Using an ARC monitoring script called 'arcstat.pl' I see a huge number of 'dmis' events when performance is poor. The ARC size is 7GB, which is less than its prescribed cap of 10GB. Better: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 15:39:37 20K1K 65801K 10019 100 7G 10G 15:39:38 19K1K 55701K 10019 100 7G 10G 15:39:39 19K1K 65401K 10018 100 7G 10G 15:39:40 17K1K 65101K 10017 100 7G 10G Worse: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 15:43:244K 280 6 2806 00 4 100 9G 10G 15:43:254K 277 6 2776 00 4 100 9G 10G 15:43:264K 268 6 2686 00 5 100 9G 10G 15:43:274K 259 6 2596 00 4 100 9G 10G An ARC stats summary from a tool called 'arc_summary.pl' is appended to this message. Operation is quite consistent across the full span of files. Since 'dmis' is still low when things are "good" (and even when the ARC has surely cycled already) this leads me to believe that prefetch is mostly working and is usually satisfying read requests. When things go bad I see that 'dmiss' becomes 100% of the misses. A hypothesis is that if zfs thinks that the data might be in the ARC (due to having seen the file before) that it disables file prefetch entirely, assuming that it can retrieve the data from its cache. Then once it finally determines that there is no cached data after all, it issues a read request. Even the "better" read performance is 1/2 of what I would expect from my hardware and based on prior test results from 'iozone'. More prefetch would surely help. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ System Memory: Physical RAM: 20470 MB Free Memory : 2511 MB LotsFree: 312 MB ZFS Tunables (/etc/system): * set zfs:zfs_arc_max = 0x3 set zfs:zfs_arc_max = 0x28000 * set zfs:zfs_arc_max = 0x2 set zfs:zfs_write_limit_override = 0xea60 * set zfs:zfs_write_limit_override = 0xa000 set zfs:zfs_vdev_max_pending = 5 ARC Size: Current Size: 8735 MB (arcsize) Target Size (Adaptive): 10240 MB (c) Min Size (Hard Limit):1280 MB (zfs_arc_min) Max Size (Hard Limit):10240 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 95%9791 MB (p) Most Frequently Used Cache Size: 4%448 MB (c-p) ARC Efficency: Cache Access Total: 827767314 Cache Hit Ratio: 96% 800123657 [Defined State for buffer] Cache Miss Ratio: 3% 27643657 [Undefined State for Buffer] REAL Hit Ratio: 89% 743665046 [MRU/MFU Hits Only] Data Demand Efficiency:99% Data Prefetch Efficiency:61% CACHE HITS BY CACHE LIST: Anon:5%47497010 [ New Customer, First Cache Hit ] Most Recently Used: 33%271365449 (mru)[ Return Customer ] Most Frequently Used: 59%472299597 (mfu)[ Frequent Customer ] Most Recently Used Ghost:0%1700764 (mru_ghost)[ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0%7260837 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:73%589582518 Prefetch Data: 2%20424879 Demand Metadata:17%139111510 Prefetch Metadata: 6%51004750 CACHE MISSES BY DATA TYPE: Demand Data:21%5814459 Prefetch Data: 46%12788265 Demand Metadata:27%7700169 Prefetch Metada
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 13, 2009, at 2:54 PM, Bob Friesenhahn > wrote: On Mon, 13 Jul 2009, Brad Diggs wrote: You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Your post makes it sound like there is not a bug in the operating system. It does not take long to see that there is a bug in the Solaris 10 operating system. It is not clear if the same bug is shared by current OpenSolaris since it seems like it has not been tested. Solaris 10 U7 reads files that it has not seen before at a constant rate regardless of the amount of file data it has already read. When the file is read a second time, the read is 4X or more slower. If reads were slowing down because the ARC was slow to expunge stale data, then that would be apparent on the first read pass. However, the reads are not slowing down in the first read pass. ZFS goes into the weeds if it has seen a file before but none of the file data is resident in the ARC. It is pathetic that a Sun RAID array that I paid $21K for out of my own life savings is not able to perform better than the cheapo portable USB drives that I use for backup because of ZFS. This is making me madder and madder by the minute. Have you tried limiting the ARC so it doesn't squash the page cache? Make sure page cache has enough for mmap plus buffers for bouncing between it and the ARC. I would say 1GB minimum, 2 to be safe. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 3:23 PM, Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Joerg Schilling wrote: >> >> cpio reads/writes in 8192 byte chunks from the filesystem. > > Yes, I was just reading the cpio manual page and see that. I think that > re-reading the 128K zfs block 16 times to satisfy each request for 8192 > bytes explains the 16X performance loss when caching is disabled. I don't > think that this is strictly a bug since it is what the database folks are > looking for. > > Bob I did other tests with "dd bs=128k" and verified via truss that each read(2) was returning 128K. I thought I had seen excessive reads there too, but now I can't reproduce that. Creating another fs with recordsize=8k seems to make this behavior go away - things seem to be working as designed. I'll go update the (nota-)bug. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 3:16 PM, Joerg Schilling wrote: > Bob Friesenhahn wrote: > >> On Mon, 13 Jul 2009, Mike Gerdts wrote: >> > >> > FWIW, I hit another bug if I turn off primarycache. >> > >> > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 >> > >> > This causes really abysmal performance - but equally so for repeat runs! >> >> It is quite facinating seeing the huge difference in I/O performance >> from these various reports. The bug you reported seems likely to be >> that without at least a little bit of caching, it is necessary to >> re-request the underlying 128K ZFS block several times as the program >> does numerous smaller I/Os (cpio uses 10240 bytes?) across it. > > cpio reads/writes in 8192 byte chunks from the filesystem. > > BTW: star by default creates a shared memory based FIFO of 8 MB size and > reads in the biggest possible size that would currently fit into the FIFO. > > Jörg Using cpio's -C option seems to not change the behavior for this bug, but I did see a performance difference with the case where I hadn't modified the zfs caching behavior. That is, the performance of the tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) >/dev/null". At this point cpio was spending roughly 13% usr and 87% sys. I haven't tried star, but I did see that I could also reproduce with "cat $file | cat > /dev/null". This seems like a worthless use of cat, but it forces cat to actually copy data from input to output unlike when cat can mmap input and output. When it does that and output is /dev/null Solaris is smart enough to avoid any reads. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Joerg Schilling wrote: cpio reads/writes in 8192 byte chunks from the filesystem. Yes, I was just reading the cpio manual page and see that. I think that re-reading the 128K zfs block 16 times to satisfy each request for 8192 bytes explains the 16X performance loss when caching is disabled. I don't think that this is strictly a bug since it is what the database folks are looking for. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob - Have you filed a bug on this issue? I am not up to speed on this thread, so I can not comment on whether or not there is a bug here, but you seem to have a test case and supporting data. Filing a bug will get the attention of ZFS engineering. Thanks, /jim Bob Friesenhahn wrote: On Mon, 13 Jul 2009, Mike Gerdts wrote: FWIW, I hit another bug if I turn off primarycache. http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 This causes really abysmal performance - but equally so for repeat runs! It is quite facinating seeing the huge difference in I/O performance from these various reports. The bug you reported seems likely to be that without at least a little bit of caching, it is necessary to re-request the underlying 128K ZFS block several times as the program does numerous smaller I/Os (cpio uses 10240 bytes?) across it. Totally disabling data caching seems best reserved for block-oriented databases which are looking for a substitute for directio(3C). It is easily demonstrated that the problem seen in Solaris 10 (jury still out on OpenSolaris although one report has been posted) is due to some sort of confusion. It is not due to delays caused by purging old data from the ARC. If these delays were caused by purging data from the ARC, then 'zfs iostat' would start showing lower read performance once the ARC becomes full, but that is not the case. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Mike Gerdts wrote: > > > > FWIW, I hit another bug if I turn off primarycache. > > > > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 > > > > This causes really abysmal performance - but equally so for repeat runs! > > It is quite facinating seeing the huge difference in I/O performance > from these various reports. The bug you reported seems likely to be > that without at least a little bit of caching, it is necessary to > re-request the underlying 128K ZFS block several times as the program > does numerous smaller I/Os (cpio uses 10240 bytes?) across it. cpio reads/writes in 8192 byte chunks from the filesystem. BTW: star by default creates a shared memory based FIFO of 8 MB size and reads in the biggest possible size that would currently fit into the FIFO. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mike Gerdts wrote: FWIW, I hit another bug if I turn off primarycache. http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 This causes really abysmal performance - but equally so for repeat runs! It is quite facinating seeing the huge difference in I/O performance from these various reports. The bug you reported seems likely to be that without at least a little bit of caching, it is necessary to re-request the underlying 128K ZFS block several times as the program does numerous smaller I/Os (cpio uses 10240 bytes?) across it. Totally disabling data caching seems best reserved for block-oriented databases which are looking for a substitute for directio(3C). It is easily demonstrated that the problem seen in Solaris 10 (jury still out on OpenSolaris although one report has been posted) is due to some sort of confusion. It is not due to delays caused by purging old data from the ARC. If these delays were caused by purging data from the ARC, then 'zfs iostat' would start showing lower read performance once the ARC becomes full, but that is not the case. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 9:34 AM, Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Alexander Skwar wrote: >> >> Still on S10 U7 Sparc M4000. >> >> So I'm now inline with the other results - the 2nd run is WAY slower. 4x >> as slow. > > It would be good to see results from a few OpenSolaris users running a > recent 64-bit kernel, and with fast storage to see if this is an OpenSolaris > issue as well. Indeed it is. Using ldoms with tmpfs as the backing store for virtual disks, I see: With S10u7: # ./zfs-cache-test.ksh testpool zfs create testpool/zfscachetest Creating data file set (300 files of 8192000 bytes) under /testpool/zfscachetest ... Done! zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m30.35s user0m9.90s sys 0m19.81s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m43.95s user0m9.67s sys 0m17.96s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. # ./zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m31.14s user0m10.09s sys 0m20.47s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m40.24s user0m9.68s sys 0m17.86s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. When I move the zpool to a 2009.06 ldom, # /var/tmp/zfs-cache-test.ksh testpool zfs create testpool/zfscachetest Creating data file set (300 files of 8192000 bytes) under /testpool/zfscachetest ... Done! zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m30.09s user0m9.58s sys 0m19.83s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m44.21s user0m9.47s sys 0m18.18s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. # /var/tmp/zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m29.89s user0m9.58s sys 0m19.72s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m44.40s user0m9.59s sys 0m18.24s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. Notice in these runs that each time the usr+sys time of the first run adds up to the elapsed time - the rate was choked by CPU. This is verified by "prstat -mL". The second run seemed to be slow due to a lock as we had just demonstrated that the IO path can do more (not an IO bottleneck) and "prstat -mL shows cpio at in sleep for a significant amount of time. FWIW, I hit another bug if I turn off primarycache. http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 This causes really abysmal performance - but equally so for repeat runs! # /var/tmp/zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real4m21.57s user0m9.72s sys 0m36.30s Doing second 'cpio -o > /dev/null' 4800025 blocks real4m21.56s user0m9.72s sys 0m36.19s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. This bug report contains more detail of the configuration. One thing not covered in that bug report is that the S10u7 ldom has 2048 MB of RAM and the 2009.06 ldom has 2024 MB of RAM. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Sun X4500 (thumper) with 16Gb of memory running Solaris 10 U6 with patches current to the end of Feb 2009. Current ARC size is ~6Gb. ZFS filesystem created in a ~3.2 Tb pool consisting of 7 sets of mirrored 500Gb SATA drives. I used 4000 8Mb files for a total of 32Gb. run 1: ~140M/s average according to zpool iostat real4m1.11s user0m10.44s sys 0m50.76s run 2: ~37M/s average according to zpool iostat real13m53.43s user0m10.62s sys 0m55.80s A zfs unmount followed by a mount of the filesystem returned the performance to the run 1 case. real3m58.16s user0m11.54s sys 0m51.95s In summary, the second run performance drops to about 30% of the original run. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Brad Diggs wrote: You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Your post makes it sound like there is not a bug in the operating system. It does not take long to see that there is a bug in the Solaris 10 operating system. It is not clear if the same bug is shared by current OpenSolaris since it seems like it has not been tested. Solaris 10 U7 reads files that it has not seen before at a constant rate regardless of the amount of file data it has already read. When the file is read a second time, the read is 4X or more slower. If reads were slowing down because the ARC was slow to expunge stale data, then that would be apparent on the first read pass. However, the reads are not slowing down in the first read pass. ZFS goes into the weeds if it has seen a file before but none of the file data is resident in the ARC. It is pathetic that a Sun RAID array that I paid $21K for out of my own life savings is not able to perform better than the cheapo portable USB drives that I use for backup because of ZFS. This is making me madder and madder by the minute. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail bradley.di...@sun.com Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 4, 2009, at 2:48 AM, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies into the Solaris page cache. As long as they remain there ZFS will be slow for those files, even if you subsequently use read(2) to access them. If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my iPod I can't be sure. It would be interesting to see how tar(1) performs if you run that test before cp(1) on a freshly rebooted system. I have done some work with the ZFS team towards a fix, but it is only currently in OpenSolaris. The other thing that slows you down is that ZFS only flushes to disk every 5 seconds if there are no synchronous writes. It would be interesting to see iostat -xnz 1 while you are running your tests. You may find the disks are writing very efficiently for one second in every five. Hope this helps, Phil blogs.sun.com/pgdh Sent from my iPod On 4 Jul 2009, at 05:26, Bob Friesenhahn wrote: On Fri, 3 Jul 2009, Bob Friesenhahn wrote: Copy MethodData Rate == cpio -pdum75 MB/s cp -r32 MB/s tar -cf - . | (cd dest && tar -xf -)26 MB/s It seems that the above should be ammended. Running the cpio based copy again results in zpool iostat only reporting a read bandwidth of 33 MB/second. The system seems to get slower and slower as it runs. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Interesting, I repeated the test on a few other machines running newer builds. First impressions are good: snv_114, virtual machine, 1GB RAM, 30GB disk - 16% slowdown. (Only 9GB free so I ran an 8GB test) Doing initial (unmount/mount) 'cpio -o > /dev/null' 1683 blocks real3m4.85s user0m16.74s sys 0m41.69s Doing second 'cpio -o > /dev/null' 1683 blocks real3m34.58s user0m18.85s sys 0m45.40s And again on snv_117, Sun x2200, 40GB RAM, single 500GB sata disk: First run (with the default 24GB set): real6m25.15s user0m11.93s sys 0m54.93s Doing second 'cpio -o > /dev/null' 48000247 blocks real1m9.97s user0m12.17s sys 0m57.80s ... d'oh! At least I know the ARC is working :-) The second run, with a 98GB test is running now, I'll post the results in the morning. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Alexander Skwar wrote: Still on S10 U7 Sparc M4000. So I'm now inline with the other results - the 2nd run is WAY slower. 4x as slow. It would be good to see results from a few OpenSolaris users running a recent 64-bit kernel, and with fast storage to see if this is an OpenSolaris issue as well. It seems likely to be more evident with fast SAS disks or SAN devices rather than a few SATA disks since the SATA disks have more access latency. Pools composed of mirrors should offer less read latency as well. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Alexander Skwar wrote: This is a M4000 mit 32 GB RAM and two HDs in a mirror. I think that you should edit the script to increase the file count since your RAM size is big enough to cache most of the data. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Here's a more useful output, with having set the number of files to 6000, so that it has a dataset which is larger than the amount of RAM. --($ ~)-- time sudo ksh zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (6000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 96000493 Blöcke real8m44.82s user0m46.85s sys2m15.01s Doing second 'cpio -o > /dev/null' 96000493 Blöcke real29m15.81s user0m45.31s sys3m2.36s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. real48m40.890s user1m47.192s sys8m2.165s Still on S10 U7 Sparc M4000. So I'm now inline with the other results - the 2nd run is WAY slower. 4x as slow. Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo 'CLICK!' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
x4540 running svn117 # ./zfs-cache-test.ksh zpool1 zfs create zpool1/zfscachetest creating data file set 93000 files of 8192000 bytes0 under /zpool1/zfscachetest ... done1 zfs unmount zpool1/zfscachetest zfs mount zpool1/zfscachetest doing initial (unmount/mount) 'cpio -o . /dev/null' 48000247 blocks real4m7.13s user0m9.27s sys 0m49.09s doing second 'cpio -o . /dev/null' 48000247 blocks real4m52.52s user0m9.13s sys 0m47.51s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi, Solaris 10U7, patched to the latest released patches two weeks ago. Four ST31000340NS attached to two SI3132 SATA controller, RAIDZ1. Selfmade system with 2GB RAM and an x86 (chipid 0x0 AuthenticAMD family 15 model 35 step 2 clock 2210 MHz) AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ processor. On the first run throughput was ~110MB/s, on the second run only 80MB/s. Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 Blöcke real3m37.17s user0m11.15s sys 0m47.74s Doing second 'cpio -o > /dev/null' 48000247 Blöcke real4m55.69s user0m10.69s sys 0m47.57s Daniel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hey Bob, Here are my results on a Dual 2.2Ghz Opteron, 8GB of RAM and 16 SATA disks connected via a Supermicro AOC-SAT2-MV8 (albeit with one dead drive). Looks like a 5x slowdown to me: Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real4m46.45s user0m10.29s sys 0m58.27s Doing second 'cpio -o > /dev/null' 48000247 blocks real15m50.62s user0m10.54s sys 1m11.86s Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob, On Sun, Jul 12, 2009 at 23:38, Bob Friesenhahn wrote: > There has been no forward progress on the ZFS read performance issue for a > week now. A 4X reduction in file read performance due to having read the > file before is terrible, and of course the situation is considerably worse > if the file was previously mmapped as well. Many of us have sent a lot of > money to Sun and were not aware that ZFS is sucking the life out of our > expensive Sun hardware. > > It is trivially easy to reproduce this problem on multiple machines. For > example, I reproduced it on my Blade 2500 (SPARC) which uses a simple > mirrored rpool. On that system there is a 1.8X read slowdown from the file > being accessed previously. > > In order to raise visibility of this issue, I invite others to see if they > can reproduce it in their ZFS pools. The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Implements a simple test. --($ ~)-- time sudo ksh zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 Blöcke real4m7.70s user0m24.10s sys 1m5.99s Doing second 'cpio -o > /dev/null' 48000247 Blöcke real1m44.88s user0m22.26s sys 0m51.56s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. real10m47.747s user0m54.189s sys 3m22.039s This is a M4000 mit 32 GB RAM and two HDs in a mirror. Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo 'CLICK!' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi, Here is the result on a Dell Precision T5500 with 24 GB of RAM and two HD in a mirror (SATA, 7200 rpm, NCQ). [glehm...@marvin2 tmp]$ uname -a SunOS marvin2 5.11 snv_117 i86pc i386 i86pc Solaris [glehm...@marvin2 tmp]$ pfexec ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/ zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real8m19,74s user0m6,47s sys 0m25,32s Doing second 'cpio -o > /dev/null' 48000247 blocks real10m42,68s user0m8,35s sys 0m30,93s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. HTH, Gaëtan Le 13 juil. 09 à 01:15, Scott Lawson a écrit : Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]#> uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]#> cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]#> prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]#> ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/ zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real4m48.94s user0m21.58s sys 0m44.91s Doing second 'cpio -o > /dev/null' 48000247 blocks real6m39.87s user0m21.62s sys 0m46.20s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Looks like a 25% performance loss for me. I was seeing around 80MB/s sustained on the first run and around 60M/'s sustained on the 2nd. /Scott. Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under / Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o > /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to clean up with 'zfs destroy Sun_2540/zfscachetest'. And here is a similar run on my Blade 2500 using the default rpool: # ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/ zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real13m3.91s user2m43.04s sys 9m28.73s Doing second 'cpio -o > /dev/null' 48000247 blocks real23m50.27s user2m41.81s sys 9m46.76s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. I am interested to hear about systems which do not suffer from this bu
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]#> uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]#> cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]#> prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]#> ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real4m48.94s user0m21.58s sys 0m44.91s Doing second 'cpio -o > /dev/null' 48000247 blocks real6m39.87s user0m21.62s sys 0m46.20s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Looks like a 25% performance loss for me. I was seeing around 80MB/s sustained on the first run and around 60M/'s sustained on the 2nd. /Scott. Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o > /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to clean up with 'zfs destroy Sun_2540/zfscachetest'. And here is a similar run on my Blade 2500 using the default rpool: # ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real13m3.91s user2m43.04s sys 9m28.73s Doing second 'cpio -o > /dev/null' 48000247 blocks real23m50.27s user2m41.81s sys 9m46.76s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. I am interested to hear about systems which do not suffer from this bug. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o > /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to clean up with 'zfs destroy Sun_2540/zfscachetest'. And here is a similar run on my Blade 2500 using the default rpool: # ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real13m3.91s user2m43.04s sys 9m28.73s Doing second 'cpio -o > /dev/null' 48000247 blocks real23m50.27s user2m41.81s sys 9m46.76s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. I am interested to hear about systems which do not suffer from this bug. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I don't swear. The word it bleeped was not a bad word -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I have a much more generic question regarding this thread. I have a sun T5120 (T2 quad core, 1.4GHz) with two 10K RPM SAS drives in a mirrored pool running Solaris 10 u7. The disk performance seems horrible. I have the same apps running on a Sun X2100M2 (dual core 1.8GHz AMD) also running Solaris 10u7 and an old, really poor performing SATA drive (also with ZFS), and its disk performance seems at least 5x better. I'm not offering much detail here, but I had been attributing this to what I've always observed--Solaris on x86 performs far better than on sparc for any app I've ever used. I guess the real question would be is ZFS ready for production in Solaris 10, or should I flar this bugger up and rebuild with UFS? This thread concerns me, and I really want to keep ZFS on this system for its many features. Sorry if this is off-topic, but you guys got me wondering. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 7 Jul 2009, Joerg Schilling wrote: Based on the prior discussions of using mmap() with ZFS and the way ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at all and POSIX_FADV_DONTNEED probably does not work either. These are pretty straightforward to implement with UFS since UFS benefits from the existing working madvise() functionality. I did run my tests on UFS... To clarify, you are not likely to see benefits until the system becomes starved for memory resources, or there is contention from multiple processes for read cache. Solaris UFS is very well tuned so it is likely that a single process won't see much benefit. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss