date:20100226

Re: [zfs-discuss] application writes are blocked near the end of spa_sync

2010-02-26 Thread Shane Cox

Thanks for your reply.  I disabled write throttling, but didn't observe any
change in behavior.  After doing some more research, I have a theory as to
the root cause of the pauses that I'm observing.


Near the end of spa_sync, writes are blocked in function zil_itx_assign as
illustrated by the following lockstat output:

Adaptive mutex block: 179 events in 5.015 seconds (36 events/sec)
Count indv cuml rcnt nsec Hottest Lock   Hottest Caller

---
3 100% 100% 0.00 178617192 0x82a7e4c0 zil_itx_assign+0x22


This function is blocked for 178ms while attempting to get a lock on the zfs
intent log.


The function holding the lock is zil_itx_clean as illustrated by the
following lockstat output:

Adaptive mutex hold: 146357 events in 5.059 seconds (28927 events/sec)
Count indv cuml rcnt nsec Lock   Caller

1   0% 100% 0.00 178438696 0x82a7e4c0 zil_itx_clean+0xd1


Since zil_itx_clean holds a lock on the zfs intent log for 178ms, no new
writes can be performed during this time.


Looking into the source, it appears that zil_itx_clean obtains the lock on
the zfs intent log, then enters a while loop, moving the already sync'd
transactions into another list so that they can be freed.  Here's a comment
from the code within the synchronized block:
* Move the sync'd log transactions to a separate list so we can call
* kmem_free without holding the zl_lock.


So it appears that sync'ing the transactions to disk isn't causing the
delays.  Instead, the cleanup after the sync is the problem.  This cleanup
holds a lock on the zfs intent log while old/sync'd transactions are moved
out of the intent log, during which time new zfs writes are
prohibited/blocked.

At least, that's my theory.



On Fri, Feb 26, 2010 at 11:30 PM, Zhu Han  wrote:

> Hi,
>
> This page may indicate the root cause.
> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>
> ZFS will throttle the write speed to match the write speed to the txg to
> the speed of DISK IO. If it detects the modest measure(1 tick pause) cannot
> prevent the tx group from being too large, it adopts a way to stall all
> write request. That could be the situation you have observed.
>
> However, please be notice, this is may not correct since I'm not  a
> developer of ZFS.
>
> For a workaround, you may add more disk to ZFS pool to get more bandwidth
> to alleviate the problem. Or you may want to disable write throttling if you
> are sure the write just bursts in an extended time. Again, I'm not sure
> whether the latter solution is feasible.
>
> best regards,
> hanzhu
>
>
> On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn <
> bfrie...@simple.dallas.tx.us> wrote:
>
>> On Fri, 26 Feb 2010, Shane Cox wrote:
>>
>>>
>>> I've reviewed the forum archives and read a number of threads related to
>>> this issue.  However I
>>> didn't find a root-cause explanation for these pauses, only talk of how
>>> to ameliorate them.  In my
>>> particular case, I would like to know why zfs_log_writes are blocked for
>>> 180ms on a mutex (seemingly
>>> blocked on the intent log itself) when performing
>>> zil_itx_assign.  Another thread must have a lock on
>>> the intent log, no?  Overall, the system appears healthy as other system
>>> calls (e.g., reads and
>>> writes to network devices) complete successfully while writes to the
>>> intent log are blocked ... so
>>> the problem seems to be access to the zfs intent log.
>>> Any additional insight would be appreciated.
>>>
>>
>> As far as I am aware, none of the zfs authors has been willing to address
>> this issue in public.  It is not clear (to me) if the fundmental design of
>> zfs transaction groups requires that writes stop briefly until the
>> transaction group has been flushed to disk.  I suspect that this is the
>> case.
>>
>> Perhaps zfs will never meet your timing requirements.  Others here have
>> had considerable success by using RAID interface adaptor cards with
>> battery-backed cache memory and configuring those cards to "IT" JBOD mode.
>>  By limiting the TXG group size to the amount which will fit in
>> battery-backed cache memory, the time to "commit" the TXG group is
>> dramatically reduced as long as the continual write rate does not exceed
>> what the backing disks can sustain.  Unfortunately, this may increase the
>> total amount of data written to underlying storage.
>>
>>
>> Bob
>> --
>> Bob Friesenhahn
>> bfrie...@simple.dallas.tx.us,
>> http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)

2010-02-26 Thread Richard Elling

On Feb 26, 2010, at 8:59 PM, Richard Elling wrote:

> On Feb 26, 2010, at 8:25 PM, Eric D. Mudama wrote:
>> On Thu, Feb 25 at 20:21, Bob Friesenhahn wrote:
>>> On Thu, 25 Feb 2010, Alastair Neil wrote:
>>> 
 I do not know and I don't think anyone would deploy a system in that way 
 with UFS. 
 This is the model that is imposed in order to take full advantage of zfs 
 advanced
 features such as snapshots, encryption and compression and I know many 
 universities
 in particular are eager to adopt it for just that reason, but are stymied 
 by this
 problem.
>>> 
>>> It was not really a serious question but it was posed to make a point. 
>>> However, it would be interesting to know if there is another type of 
>>> filesystem (even on Linux or some other OS) which is able to reasonably and 
>>> efficiently support 16K mounted and exported file systems.
>>> 
>>> Eventually Solaris is likely to work much better for this than it does 
>>> today, but most likely there are higher priorities at the moment.
>> 
>> I agree with the above, but the best practices guide:
>> 
>> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_file_service_for_SMB_.28CIFS.29_or_SAMBA
>> 
>> states in the SAMBA section that "Beware that mounting 1000s of file
>> systems, will impact your boot time".  I'd say going from a 2-3 minute
>> boot time to a 4+ hour boot time is more than just "impact".  That's
>> getting hit by a train.

Perhaps someone that has a SAMBA config large enough could make a
test similar to the NFS set described in
http://developers.sun.com/solaris/articles/nfs_zfs.html
(note the date, 2007)
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)

2010-02-26 Thread Alastair Neil

Ironically It's nfs exporting that is the real hog, cifs shares seem to come
up pretty fast.  The fact that cifs shares can be fast makes it hard for me
to understand why Sun/Oracle seem to be making such a meal of this bug.
Possibly because it only critically affects poor universities and not
clients with the budget to throw hardware at the problem.

On Fri, Feb 26, 2010 at 11:59 PM, Richard Elling
wrote:

> On Feb 26, 2010, at 8:25 PM, Eric D. Mudama wrote:
> > On Thu, Feb 25 at 20:21, Bob Friesenhahn wrote:
> >> On Thu, 25 Feb 2010, Alastair Neil wrote:
> >>
> >>> I do not know and I don't think anyone would deploy a system in that
> way with UFS.
> >>> This is the model that is imposed in order to take full advantage of
> zfs advanced
> >>> features such as snapshots, encryption and compression and I know many
> universities
> >>> in particular are eager to adopt it for just that reason, but are
> stymied by this
> >>> problem.
> >>
> >> It was not really a serious question but it was posed to make a point.
> However, it would be interesting to know if there is another type of
> filesystem (even on Linux or some other OS) which is able to reasonably and
> efficiently support 16K mounted and exported file systems.
> >>
> >> Eventually Solaris is likely to work much better for this than it does
> today, but most likely there are higher priorities at the moment.
> >
> > I agree with the above, but the best practices guide:
> >
> >
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_file_service_for_SMB_.28CIFS.29_or_SAMBA
> >
> > states in the SAMBA section that "Beware that mounting 1000s of file
> > systems, will impact your boot time".  I'd say going from a 2-3 minute
> > boot time to a 4+ hour boot time is more than just "impact".  That's
> > getting hit by a train.
>
> The shares are more troublesome than the mounts.
>
> >
> > Might be useful for folks, if the above document listed a few concrete
> > datapoints of boot time scaling with the number of filesystems or
> > something similar.
>
> Gory details and timings are available in the many references to CR 6850837
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6850837
>  -- richard
>
> ZFS storage and performance consulting at http://www.RichardElling.com
> ZFS training on deduplication, NexentaStor, and NAS performance
> http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)
>
>
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)

2010-02-26 Thread Richard Elling

On Feb 26, 2010, at 8:25 PM, Eric D. Mudama wrote:
> On Thu, Feb 25 at 20:21, Bob Friesenhahn wrote:
>> On Thu, 25 Feb 2010, Alastair Neil wrote:
>> 
>>> I do not know and I don't think anyone would deploy a system in that way 
>>> with UFS. 
>>> This is the model that is imposed in order to take full advantage of zfs 
>>> advanced
>>> features such as snapshots, encryption and compression and I know many 
>>> universities
>>> in particular are eager to adopt it for just that reason, but are stymied 
>>> by this
>>> problem.
>> 
>> It was not really a serious question but it was posed to make a point. 
>> However, it would be interesting to know if there is another type of 
>> filesystem (even on Linux or some other OS) which is able to reasonably and 
>> efficiently support 16K mounted and exported file systems.
>> 
>> Eventually Solaris is likely to work much better for this than it does 
>> today, but most likely there are higher priorities at the moment.
> 
> I agree with the above, but the best practices guide:
> 
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_file_service_for_SMB_.28CIFS.29_or_SAMBA
> 
> states in the SAMBA section that "Beware that mounting 1000s of file
> systems, will impact your boot time".  I'd say going from a 2-3 minute
> boot time to a 4+ hour boot time is more than just "impact".  That's
> getting hit by a train.

The shares are more troublesome than the mounts.  

> 
> Might be useful for folks, if the above document listed a few concrete
> datapoints of boot time scaling with the number of filesystems or
> something similar.

Gory details and timings are available in the many references to CR 6850837
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6850837
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] application writes are blocked near the end of spa_sync

2010-02-26 Thread Zhu Han

Hi,

This page may indicate the root cause.
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle

ZFS will throttle the write speed to match the write speed to the txg to the
speed of DISK IO. If it detects the modest measure(1 tick pause) cannot
prevent the tx group from being too large, it adopts a way to stall all
write request. That could be the situation you have observed.

However, please be notice, this is may not correct since I'm not  a
developer of ZFS.

For a workaround, you may add more disk to ZFS pool to get more bandwidth to
alleviate the problem. Or you may want to disable write throttling if you
are sure the write just bursts in an extended time. Again, I'm not sure
whether the latter solution is feasible.

best regards,
hanzhu


On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn <
bfrie...@simple.dallas.tx.us> wrote:

> On Fri, 26 Feb 2010, Shane Cox wrote:
>
>>
>> I've reviewed the forum archives and read a number of threads related to
>> this issue.  However I
>> didn't find a root-cause explanation for these pauses, only talk of how to
>> ameliorate them.  In my
>> particular case, I would like to know why zfs_log_writes are blocked for
>> 180ms on a mutex (seemingly
>> blocked on the intent log itself) when performing zil_itx_assign.  Another
>> thread must have a lock on
>> the intent log, no?  Overall, the system appears healthy as other system
>> calls (e.g., reads and
>> writes to network devices) complete successfully while writes to the
>> intent log are blocked ... so
>> the problem seems to be access to the zfs intent log.
>> Any additional insight would be appreciated.
>>
>
> As far as I am aware, none of the zfs authors has been willing to address
> this issue in public.  It is not clear (to me) if the fundmental design of
> zfs transaction groups requires that writes stop briefly until the
> transaction group has been flushed to disk.  I suspect that this is the
> case.
>
> Perhaps zfs will never meet your timing requirements.  Others here have had
> considerable success by using RAID interface adaptor cards with
> battery-backed cache memory and configuring those cards to "IT" JBOD mode.
>  By limiting the TXG group size to the amount which will fit in
> battery-backed cache memory, the time to "commit" the TXG group is
> dramatically reduced as long as the continual write rate does not exceed
> what the backing disks can sustain.  Unfortunately, this may increase the
> total amount of data written to underlying storage.
>
>
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)

2010-02-26 Thread Eric D. Mudama

On Thu, Feb 25 at 20:21, Bob Friesenhahn wrote:

On Thu, 25 Feb 2010, Alastair Neil wrote:

I do not know and I don't think anyone would deploy a system in that way with
UFS.
This is the model that is imposed in order to take full advantage of zfs
advanced
features such as snapshots, encryption and compression and I know many
universities
in particular are eager to adopt it for just that reason, but are stymied by
this
problem.

It was not really a serious question but it was posed to make a
point. However, it would be interesting to know if there is another
type of filesystem (even on Linux or some other OS) which is able to
reasonably and efficiently support 16K mounted and exported file
systems.

Eventually Solaris is likely to work much better for this than it
does today, but most likely there are higher priorities at the
moment.

I agree with the above, but the best practices guide:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_file_service_for_SMB_.28CIFS.29_or_SAMBA

states in the SAMBA section that "Beware that mounting 1000s of file
systems, will impact your boot time". I'd say going from a 2-3 minute
boot time to a 4+ hour boot time is more than just "impact". That's
getting hit by a train.

Might be useful for folks, if the above document listed a few concrete
datapoints of boot time scaling with the number of filesystems or
something similar.

70 matches

Mail list logo