Re: [zfs-discuss] Performance problems due to smaller ZFS recordsize

Jim Mauro Mon, 25 Oct 2010 09:22:14 -0700

Hi Jim - cross-posting to zfs-discuss, because 20X is, to say the least, 
compelling.


Obviously, it would be awesome if we had the opportunity to whittle-down which 
of
the changes made this fly, or if it was a combination of the changes. 
Looking at them individually....


> set zfs:zfs_vdev_cache_size = 0

The default for this is 10MB per vdev, and as I understand it (which may be 
wrong)
is part of the device-level prefetch on reads.


> set zfs:zfs_vdev_cache_bshift = 13

This obscurely named parameter defines the amount of data read from disk for
each disk read (I think).  The default value for this parameter is 16, equating 
to
64k reads. The value of 13 reduces disk read sizes to 8k.

> set zfs:zfs_prefetch_disable = 1 

The vdev parameters above relate to device-level prefetching.
zfs_prefetch_disable applies to file level prefetching.

With regard to the COW/scattered blocks query, it is certainly a possible 
side-effect
of COW that maintaining sequential file block layout can get challenging, but
the TXG model and coalescing writes helps with that.

With regard to the changes (including the ARC size increase), it's really 
impossible to 
say without data the extent with which prefetching at one or both layers made 
the
difference here. Was it the cumulative effect of both, or was one a much larger
contributing factor?

It would be interesting to reproduce this in a lab.

What release of Solaris 10 is this?

Thanks
/jim


> 
> ... and increased their ARC to 8GB and backups that took 15+ hours now take 
> 45 minutes.  They are still analyzing what effects re-enabling prefetch has 
> on their applications.  
> 
> One other thing they noticed, before removing these tunables, is that the 
> backups were taking progressly longer, each day.  For instance, at the 
> beginning of last, they took 12 hours.  By Friday, they were taking 17 hours. 
>  This is with similar sized datasets.  They will be keeping an eye on this, 
> too, but I'm interested of any possible causes that might be related to ZFS.  
> One thing I've been told is that ZFS COW (copy-on write) operations can cause 
> blocks to be scattered across a disk, where they were once located closer to 
> one another.
> 
> We'll see how it behaves in the next week or so.
> 
> Thanks for the feedback,
> Jim
> 
> On 10/21/10 02:49 PM, Amer Ather wrote:
>> 
>> Jim,
>> 
>> For sequential IO read performance, you need file system read ahead. By 
>> setting:
>> 
>> set zfs:zfs_prefetch_disable = 1 
>> 
>> You have disabled zfs prefetch that is needed to boost sequential IO 
>> performance. Normally, we recommend to disable it for Oracle OLTP type of 
>> workload to avoid IO inflation due to read ahead. However, for backups it 
>> needs to be enabled. Take this setting out of etcsystem file and retest.
>> 
>> Amer.
>> 
>> 
>> 
>> On 10/21/10 12:00 PM, Jim Nissen wrote:
>>> 
>>> I working with a customer who is having Directory server backup performance 
>>> problems, since switching to ZFS.  In short, backups that used to take 1 - 
>>> 4 hours on UFS are now taking 12+ hours on ZFS.  We've figured out that ZFS 
>>> reads seem to be throttled, where writes seem really fast.  Backend Storage 
>>> is IBM SVC.
>>> 
>>> As part of their cutover, they were given the following Best Practice 
>>> recommendations from LDAP folks @Sun...
>>> 
>>> /etc/system tunables:
>>> set zfs:zfs_arc_max = 0x100000000 
>>> set zfs:zfs_vdev_cache_size = 0 
>>> set zfs:zfs_vdev_cache_bshift = 13 
>>> set zfs:zfs_prefetch_disable = 1 
>>> set zfs:zfs_nocacheflush = 1 
>>> 
>>> At ZFS filesystem level:
>>> recordsize = 32K
>>> noatime
>>> 
>>> One of the things they noticed is that simple dd reads from one of the 132K 
>>> recordsize filesystems runs much faster (4 - 7 times) than their 32K 
>>> filesystems.  I joined a shared-shell where we switched the same filesystem 
>>> from 32K to 128K, and we could see underlying disks were getting 4x better 
>>> throughput (from 1.5 - 2MB/sec to 8 - 10MB/s), whereas a direct dd against 
>>> one of the disks shows that the disks were capable of much more (45+ 
>>> MB/sec).
>>> 
>>> Here are some snippets from iostat...
>>> 
>>> ZFS recordsize of 32K, dd if=./somelarge5gfile of=/dev/null bs=16k (to 
>>> mimic application blocksizes)
>>> 
>>>                     extended device statistics              
>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>>    67.6    0.0 2132.7    0.0  0.0  0.3    0.0    4.5   0  30 
>>> c6t60050768018E82BDA800000000000565d0
>>>    67.4    0.0 2156.8    0.0  0.0  0.1    0.0    1.5   0  10 
>>> c6t60050768018E82BDA800000000000564d0
>>>    68.4    0.0 2158.3    0.0  0.0  0.3    0.0    4.5   0  31 
>>> c6t60050768018E82BDA800000000000563d0
>>>    66.2    0.0 2118.4    0.0  0.0  0.2    0.0    3.4   0  22 
>>> c6t60050768018E82BDA800000000000562d0
>>> 
>>> ZFS recordsize of 128K, dd if=./somelarge5gfile of=/dev/null bs=16k (to 
>>> mimic application blocksizes)
>>> 
>>>                     extended device statistics              
>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>>    78.2    0.0 10009.6    0.0  0.0  0.2    0.0    1.9   0  15 
>>> c6t60050768018E82BDA800000000000565d0
>>>    78.6    0.0 9960.0    0.0  0.0  0.1    0.0    1.2   0  10 
>>> c6t60050768018E82BDA800000000000564d0
>>>    79.4    0.0 10062.3    0.0  0.0  0.4    0.0    4.4   0  35 
>>> c6t60050768018E82BDA800000000000563d0
>>>    76.6    0.0 9804.8    0.0  0.0  0.2    0.0    2.3   0  17 
>>> c6t60050768018E82BDA800000000000562d0
>>> 
>>> dd if=/dev/rdsk/c6t60050768018E82BDA800000000000564d0s0 of=/dev/null bs=32k 
>>> (to mimic small ZFS blocksize)
>>>                     extended device statistics              
>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>>  3220.9    0.0 51533.9    0.0  0.0  0.9    0.0    0.3   1  94 
>>> c6t60050768018E82BDA800000000000564d0
>>> 
>>> So, it's not like the underlying disk isn't capable of much more than what 
>>> ZFS is asking of it.  I understand the part where it will have to 4x as 
>>> much work with 32K blocksize as with 128K, but it doesn't seem as if ZFS is 
>>> doing much at all with underlying disks.
>>> 
>>> We've ask the customer to rerun the test without /etc/system tunables.  
>>> Anybody else worked a similar issue?  Any hints provided would be greatly 
>>> appreciated.
>>> 
>>> Thanks!
>>> 
>>> Jim
>>> 
>>

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance problems due to smaller ZFS recordsize

Reply via email to