Re: [zfs-discuss] SSD and ZFS
On Mon, Feb 15, 2010 at 5:51 PM, Daniel Carosone d...@geek.com.au wrote: On Sun, Feb 14, 2010 at 11:08:52PM -0600, Tracey Bernath wrote: Now, to add the second SSD ZIL/L2ARC for a mirror. Just be clear: mirror ZIL by all means, but don't mirror l2arc, just add more devices and let them load-balance. This is especially true if you're sharing ssd writes with ZIL, as slices on the same devices. Well, the problem I am trying to solve is wouldn't it read 2x faster with the mirror? It seems once I can drive the single device to 10 queued actions, and 100% busy, it would be more useful to have two channels to the same data. Is ZFS not smart enough to understand that there are two identical mirror devices in the cache to split requests to? Or, are you saying that ZFS is smart enough to cache it in two places, although not mirrored? If the device itself was full, and items were falling off the L2ARC, then I could see having two separate cache devices, but since I am only at about 50% utilization of the available capacity, and maxing out the IO, then mirroring seemed smarter. Am I missing something here? Tracey I may even splurge for one more to get a three way mirror. With more devices, questions about selecting different devices appropriate for each purpose come into play. Now I need a bigger server See? :) -- Dan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD and ZFS
For those following the saga: With the prefetch problem fixed, and data coming off the L2ARC instead of the disks, the system switched from IO bound to CPU bound, I opened up the throttles with some explicit PARALLEL hints in the Oracle commands, and we were finally able to max out the single SSD: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 826.03.2 104361.8 35.2 0.0 9.90.0 12.0 3 100 c0t0d4 So, when we maxed out the SSD cache, it was delivering 100+MB/s, and 830 IOPS with 3.4 TB behind it in a 4 disk SATA RAIDz1. Still have to remap it to 8k blocks to get more efficiency, but for raw numbers, it's right what I was looking for. Now, to add the second SSD ZIL/L2ARC for a mirror. I may even splurge for one more to get a three way mirror. That will completely saturate the SCSI channel. Now I need a bigger server Did I mention it was $1000 for the whole setup? Bah-ha-ha-ha. Tracey On Sat, Feb 13, 2010 at 11:51 PM, Tracey Bernath tbern...@ix.netcom.comwrote: OK, that was the magic incantation I was looking for: - changing the noprefetch option opened the floodgates to the L2ARC - changing the max queue depth relived the wait time on the drives, although I may undo this again in the benchmarking since these drives all have NCQ I went from all four disks of the array at 100%, doing about 170 read IOPS/25MB/s to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s off the cache drive (@ only 50% load). This bodes well for adding a second mirrored cache drive to push for the 1KIOPS. Now I am ready to insert the mirror for the ZIL and the CACHE, and we will be ready for some production benchmarking. BEFORE: devicer/sw/s kr/s kw/s wait actv svc_t %w %b us sy wt id sd0 170.00.4 7684.70.0 0.0 35.0 205.3 0 100 11 80 0 82 sd1 168.40.4 7680.20.0 0.0 34.6 205.1 0 100 sd2 172.00.4 7761.70.0 0.0 35.0 202.9 0 100 sd4 170.00.4 7727.10.0 0.0 35.0 205.3 0 100 sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 AFTER: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d1 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d2 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d3 285.20.8 36236.2 14.4 0.0 0.50.01.8 1 37 c0t0d4 And, keep in mind this was on less than $1000 of hardware. Thanks for the pointers guys, Tracey On Sat, Feb 13, 2010 at 9:22 AM, Richard Elling richard.ell...@gmail.comwrote: comment below... On Feb 12, 2010, at 2:25 PM, TMB wrote: I have a similar question, I put together a cheapo RAID with four 1TB WD Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with slice 0 (5GB) for ZIL and the rest of the SSD for cache: # zpool status dpool pool: dpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM dpool ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t0d1 ONLINE 0 0 0 c0t0d2 ONLINE 0 0 0 c0t0d3 ONLINE 0 0 0 [b]logs c0t0d4s0 ONLINE 0 0 0[/b] [b]cache c0t0d4s1 ONLINE 0 0 0[/b] spares c0t0d6AVAIL c0t0d7AVAIL capacity operationsbandwidth pool used avail read write read write -- - - - - - - dpool 72.1G 3.55T237 12 29.7M 597K raidz172.1G 3.55T237 9 29.7M 469K c0t0d0 - -166 3 7.39M 157K c0t0d1 - -166 3 7.44M 157K c0t0d2 - -166 3 7.39M 157K c0t0d3 - -167 3 7.45M 157K c0t0d4s020K 4.97G 0 3 0 127K cache - - - - - - c0t0d4s1 17.6G 36.4G 3 1 249K 119K -- - - - - - - I just don't seem to be getting any bang for the buck I should be. This was taken while rebuilding an Oracle index, all files stored in this pool. The WD disks are at 100%, and nothing is coming from the cache. The cache does have the entire DB cached (17.6G used), but hardly reads anything from it. I also am not seeing the spike of data flowing into the ZIL either, although iostat show there is just write traffic hitting the SSD: extended device statistics cpu devicer/sw/s kr/s kw/s wait actv svc_t %w %b
Re: [zfs-discuss] SSD and ZFS
OK, that was the magic incantation I was looking for: - changing the noprefetch option opened the floodgates to the L2ARC - changing the max queue depth relived the wait time on the drives, although I may undo this again in the benchmarking since these drives all have NCQ I went from all four disks of the array at 100%, doing about 170 read IOPS/25MB/s to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s off the cache drive (@ only 50% load). This bodes well for adding a second mirrored cache drive to push for the 1KIOPS. Now I am ready to insert the mirror for the ZIL and the CACHE, and we will be ready for some production benchmarking. devicer/sw/s kr/s kw/s wait actv svc_t %w %b us sy wt id sd0 170.00.4 7684.70.0 0.0 35.0 205.3 0 100 11 80 0 82 sd1 168.40.4 7680.20.0 0.0 34.6 205.1 0 100 sd2 172.00.4 7761.70.0 0.0 35.0 202.9 0 100 sd4 170.00.4 7727.10.0 0.0 35.0 205.3 0 100 sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d1 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d2 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d3 285.20.8 36236.2 14.4 0.0 0.50.01.8 1 37 c0t0d4 And, keep in mind this was on less than $1000 of hardware. Thanks, Tracey On Sat, Feb 13, 2010 at 9:22 AM, Richard Elling richard.ell...@gmail.comwrote: comment below... On Feb 12, 2010, at 2:25 PM, TMB wrote: I have a similar question, I put together a cheapo RAID with four 1TB WD Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with slice 0 (5GB) for ZIL and the rest of the SSD for cache: # zpool status dpool pool: dpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM dpool ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t0d1 ONLINE 0 0 0 c0t0d2 ONLINE 0 0 0 c0t0d3 ONLINE 0 0 0 [b]logs c0t0d4s0 ONLINE 0 0 0[/b] [b]cache c0t0d4s1 ONLINE 0 0 0[/b] spares c0t0d6AVAIL c0t0d7AVAIL capacity operationsbandwidth pool used avail read write read write -- - - - - - - dpool 72.1G 3.55T237 12 29.7M 597K raidz172.1G 3.55T237 9 29.7M 469K c0t0d0 - -166 3 7.39M 157K c0t0d1 - -166 3 7.44M 157K c0t0d2 - -166 3 7.39M 157K c0t0d3 - -167 3 7.45M 157K c0t0d4s020K 4.97G 0 3 0 127K cache - - - - - - c0t0d4s1 17.6G 36.4G 3 1 249K 119K -- - - - - - - I just don't seem to be getting any bang for the buck I should be. This was taken while rebuilding an Oracle index, all files stored in this pool. The WD disks are at 100%, and nothing is coming from the cache. The cache does have the entire DB cached (17.6G used), but hardly reads anything from it. I also am not seeing the spike of data flowing into the ZIL either, although iostat show there is just write traffic hitting the SSD: extended device statistics cpu devicer/sw/s kr/s kw/s wait actv svc_t %w %b us sy wt id sd0 170.00.4 7684.70.0 0.0 35.0 205.3 0 100 11 8 0 82 sd1 168.40.4 7680.20.0 0.0 34.6 205.1 0 100 sd2 172.00.4 7761.70.0 0.0 35.0 202.9 0 100 sd3 0.0 0.0 0.00.0 0.0 0.00.0 0 0 sd4 170.00.4 7727.10.0 0.0 35.0 205.3 0 100 [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b] iostat has a n option, which is very useful for looking at device names :-) The SSD here is perfoming well. The rest are clobbered. 205 millisecond response time will be agonizingly slow. By default, for this version of ZFS, up to 35 I/Os will be queued to the disk, which is why you see 35.0 in the actv column. The combination of actv=35 and svc_t200 indicates that this is the place to start working. Begin by reducing zfs_vdev_max_pending from 35 to something like 1 to 4. This will reduce the concurrent load on the disks, thus reducing svc_t. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 -- richard
Re: [zfs-discuss] SSD and ZFS
Thanks Brendan, I was going to move it over to 8kb block size once I got through this index rebuild. My thinking was that a disproportionate block size would show up as excessive IO thruput, not a lack of thruput. The question about the cache comes from the fact that the 18GB or so that it says is in the cache IS the database. This was why I was thinking the index rebuild should be CPU constrained, and I should see a spike in reading from the cache. If the entire file is cached, why would it go to the disks at all for the reads? The disks are delivering about 30MB/s of reads, but this SSD is rated for sustained 70MB/s, so there should be a chance to pick up 100% gain. I've seen lots of mention of kernel settings, but those only seem to apply to cache flushes on sync writes. Any idea on where to look next? I've spent about a week tinkering with it.I'm trying to get a major customer to switch over to zfs and an open storage solution, but I'm afraid if I cant get it to work in the small scale, I cant convince them about the large scale. Thanks, Tracey On Fri, Feb 12, 2010 at 4:43 PM, Brendan Gregg - Sun Microsystems bren...@sun.com wrote: On Fri, Feb 12, 2010 at 02:25:51PM -0800, TMB wrote: I have a similar question, I put together a cheapo RAID with four 1TB WD Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with slice 0 (5GB) for ZIL and the rest of the SSD for cache: # zpool status dpool pool: dpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM dpool ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t0d1 ONLINE 0 0 0 c0t0d2 ONLINE 0 0 0 c0t0d3 ONLINE 0 0 0 [b]logs c0t0d4s0 ONLINE 0 0 0[/b] [b]cache c0t0d4s1 ONLINE 0 0 0[/b] spares c0t0d6AVAIL c0t0d7AVAIL capacity operationsbandwidth pool used avail read write read write -- - - - - - - dpool 72.1G 3.55T237 12 29.7M 597K raidz172.1G 3.55T237 9 29.7M 469K c0t0d0 - -166 3 7.39M 157K c0t0d1 - -166 3 7.44M 157K c0t0d2 - -166 3 7.39M 157K c0t0d3 - -167 3 7.45M 157K c0t0d4s020K 4.97G 0 3 0 127K cache - - - - - - c0t0d4s1 17.6G 36.4G 3 1 249K 119K -- - - - - - - I just don't seem to be getting any bang for the buck I should be. This was taken while rebuilding an Oracle index, all files stored in this pool. The WD disks are at 100%, and nothing is coming from the cache. The cache does have the entire DB cached (17.6G used), but hardly reads anything from it. I also am not seeing the spike of data flowing into the ZIL either, although iostat show there is just write traffic hitting the SSD: extended device statistics cpu devicer/sw/s kr/s kw/s wait actv svc_t %w %b us sy wt id sd0 170.00.4 7684.70.0 0.0 35.0 205.3 0 100 11 8 0 82 sd1 168.40.4 7680.20.0 0.0 34.6 205.1 0 100 sd2 172.00.4 7761.70.0 0.0 35.0 202.9 0 100 sd3 0.0 0.0 0.00.0 0.0 0.00.0 0 0 sd4 170.00.4 7727.10.0 0.0 35.0 205.3 0 100 [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b] Since this SSD is in a RAID array, and just presents as a regular disk LUN, is there a special incantation required to turn on the Turbo mode? Doesnt it seem that all this traffic should be maxing out the SSD? Reads from the cache, and writes to the ZIL? I have a seocnd identical SSD I wanted to add as a mirror, but it seems pointless if there's no zip to be had The most likely reason is that this workload has been identified as streaming by ZFS, which is prefetching from disk instead of the L2ARC (l2arc_nopreftch=1). It also looks like you've used a 128 Kbyte ZFS record size. Is Oracle doing 128 Kbyte random I/O? We usually tune that down before creating the database; which will use the L2ARC device more efficiently. Brendan -- Brendan Gregg, Fishworks http://blogs.sun.com/brendan -- Tracey Bernath 913-488-6284 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss