subject:"\[zfs\-discuss\] ZFS write I\/O stalls"

[zfs-discuss] ZFS write I/O stalls

2009-06-23 Thread Bob Friesenhahn

It has been quite some time (about a year) since I did testing of 
batch processing with my software (GraphicsMagick).  In between time, 
ZFS added write-throttling.  I am using Solaris 10 with kernel 
141415-03.


Quite a while back I complained that ZFS was periodically stalling the 
writing process (which UFS did not do).  The ZFS write-throttling 
feature was supposed to avoid that.  In my testing today I am still 
seeing ZFS stall the writing process periodically.  When the process 
is stalled, there is a burst of disk activity, a burst of context 
switching, and total CPU use drops to almost zero. Zpool iostat says 
that read bandwidth is 15.8M and write bandwidth is 15.8M over a 60 
second averaging interval.  Since my drive array is good for writing 
over 250MB/second, this is a very small write load and the array is 
loafing.


My program uses the simple read->process->write approach.  Each file 
written (about 8MB/file) is written contiguously and written just 
once.  Data is read and written in 128K blocks.  For this application 
there is no value obtained by caching the file just written.  From 
what I am seeing, reading occurs as needed, but writes are being 
batched up until the next ZFS synchronization cycle.  During the ZFS 
synchronization cycle it seems that processes are blocked from 
writing. Since my system has a lot of memory and the ARC is capped at 
10GB, quite a lot of data can be queued up to be written.  The ARC is 
currently running at its limit of 10GB.


If I tell my software to invoke fsync() before closing each written 
file, then the stall goes away, but the program then needs to block so 
there is less beneficial use of the CPU.


If this application stall annoys me, I am sure that it would really 
annoy a user with mission-critical work which needs to get done on a 
uniform basis.


If I run this little script then the application runs more smoothly 
but I see evidence of many shorter stalls:


while true
do
  sleep 3
  sync
done

Is there a solution in the works for this problem?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-23 Thread milosz

is this a direct write to a zfs filesystem or is it some kind of zvol export?

anyway, sounds similar to this:

http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0

On Tue, Jun 23, 2009 at 7:14 PM, Bob
Friesenhahn wrote:
> It has been quite some time (about a year) since I did testing of batch
> processing with my software (GraphicsMagick).  In between time, ZFS added
> write-throttling.  I am using Solaris 10 with kernel 141415-03.
>
> Quite a while back I complained that ZFS was periodically stalling the
> writing process (which UFS did not do).  The ZFS write-throttling feature
> was supposed to avoid that.  In my testing today I am still seeing ZFS stall
> the writing process periodically.  When the process is stalled, there is a
> burst of disk activity, a burst of context switching, and total CPU use
> drops to almost zero. Zpool iostat says that read bandwidth is 15.8M and
> write bandwidth is 15.8M over a 60 second averaging interval.  Since my
> drive array is good for writing over 250MB/second, this is a very small
> write load and the array is loafing.
>
> My program uses the simple read->process->write approach.  Each file written
> (about 8MB/file) is written contiguously and written just once.  Data is
> read and written in 128K blocks.  For this application there is no value
> obtained by caching the file just written.  From what I am seeing, reading
> occurs as needed, but writes are being batched up until the next ZFS
> synchronization cycle.  During the ZFS synchronization cycle it seems that
> processes are blocked from writing. Since my system has a lot of memory and
> the ARC is capped at 10GB, quite a lot of data can be queued up to be
> written.  The ARC is currently running at its limit of 10GB.
>
> If I tell my software to invoke fsync() before closing each written file,
> then the stall goes away, but the program then needs to block so there is
> less beneficial use of the CPU.
>
> If this application stall annoys me, I am sure that it would really annoy a
> user with mission-critical work which needs to get done on a uniform basis.
>
> If I run this little script then the application runs more smoothly but I
> see evidence of many shorter stalls:
>
> while true
> do
>  sleep 3
>  sync
> done
>
> Is there a solution in the works for this problem?
>
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Bob Friesenhahn


On Tue, 23 Jun 2009, milosz wrote:


is this a direct write to a zfs filesystem or is it some kind of zvol export?


This is direct write to a zfs filesystem implemented as six mirrors of 
15K RPM 300GB drives on a Sun StorageTek 2500.  This setup tests very 
well under iozone and performs remarkably well when extracting from 
large tar files.



anyway, sounds similar to this:

http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0


Yes, this does sound very similar.  It looks to me like data from read 
files is clogging the ARC so that there is no more room for more 
writes when ZFS periodically goes to commit unwritten data.  The 
"Perfmeter" tool shows that almost all disk I/O occurs during a brief 
interval of time.  The storage array is capable of writing at high 
rates, but ZFS is coming at it with huge periodic writes which are 
surely much larger than what the array's internal buffering can 
handle.


What is clear to me is that my drive array is "loafing".  The 
application runs much slower than expected and zfs is to blame for 
this.  Observed write performance could be sustained by a single fast 
disk drive.  In fact, if I direct the output to a single SAS drive 
formatted with UFS, the observed performance is fairly similar except 
there are no stalls until iostat reports that the drive is extremely 
(close to 99%) busy.  When the UFS-formatted drive is reported to be 
60% busy (at 48MB/second), application execution is very smooth.  If a 
similar rate is sent to the ZFS pool (52.9MB/second according to zpool 
iostat) and the individual drives in the pool are reported to be 5 to 
33% busy (24-31% for 60 second average), then execution stutters for 
three seconds at a time as the 1.5GB to 3GB of "written" data which 
has been batched up is suddenly written.


Something else interesting I notice is that performance is not 
consistent over time:


% zpool iostat Sun_2540 60
capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
Sun_2540 460G  1.18T368447  45.7M  52.9M
Sun_2540 463G  1.18T336400  42.1M  47.5M
Sun_2540 465G  1.17T341400  42.6M  47.2M
Sun_2540 469G  1.17T280473  34.8M  55.9M
Sun_2540 472G  1.17T286449  35.5M  52.5M
Sun_2540 474G  1.17T338391  42.1M  45.7M
Sun_2540 477G  1.16T332400  41.3M  47.0M
Sun_2540 479G  1.16T300356  37.5M  41.4M
Sun_2540 482G  1.16T314381  39.3M  43.8M
Sun_2540 485G  1.15T520479  63.0M  55.9M
Sun_2540 490G  1.15T564722  67.3M  84.7M
Sun_2540 494G  1.15T586539  70.4M  63.1M
Sun_2540 499G  1.14T549698  66.9M  81.9M
Sun_2540 504G  1.14T547749  65.6M  87.7M
Sun_2540 507G  1.13T584495  70.8M  57.8M
Sun_2540 512G  1.13T544822  64.9M  91.1M
Sun_2540 516G  1.13T596527  72.0M  60.4M
Sun_2540 521G  1.12T561759  68.0M  87.2M
Sun_2540 526G  1.12T548779  65.9M  88.6M

A 2X variation in minute-to-minute performance while performing 
consistently similar operations is remarkable.  Also notice that the 
write data rates are gradually increasing (on average) even though 
the task being performed remains the same.


Here is a Perfmeter graph showing what is happening in normal 
operation:


http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-stalls.png

and here is one which shows what happens if fsync() is used to force 
the file data entirely to disk immediately after each file has been 
written:


http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-fsync.png

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Ethan Erchinger

> > http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0
> 
> Yes, this does sound very similar.  It looks to me like data from read
> files is clogging the ARC so that there is no more room for more
> writes when ZFS periodically goes to commit unwritten data.  

I'm wondering if changing txg_time to a lower value might help.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Bob Friesenhahn


On Wed, 24 Jun 2009, Ethan Erchinger wrote:


http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0


Yes, this does sound very similar.  It looks to me like data from read
files is clogging the ARC so that there is no more room for more
writes when ZFS periodically goes to commit unwritten data.


I'm wondering if changing txg_time to a lower value might help.


There is no doubt that having ZFS sync the written data more often 
would help.  However, it should not be necessary to tune the OS for 
such a common task as batch processing a bunch of files.


A more appropriate solution is for ZFS to notice that more than XXX 
megabytes are uncommitted, so maybe it should wake up and go write 
some data.  It is useful for ZFS to defer data writes in case the same 
file is updated many times.  In the case where the same file is 
updated many times, the total uncommitted data is still limited by the 
amount of data which is re-written and so the 30 second cycle is fine. 
In my case the amount of uncommitted data is limited by available RAM 
and how fast my application is able to produce new data to write.


The problem is very much related to how fast the data is output.  If 
the new data is created at a slower rate (output files are smaller) 
then the problem just goes away.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Marcelo Leal

Hello Bob,
 I think that is related with my post about "zio_taskq_threads and TXG sync ":
( http://www.opensolaris.org/jive/thread.jspa?threadID=105703&tstart=0 )
 Roch did say that this is on top of the performance problems, and in the same 
email i did talk about the change from 5s to 30s, what i think makes this 
problem worst, if this txg sync interval be "fixed".
 

 Leal
[ http://www.eall.com.br/blog ]
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Bob Friesenhahn


On Wed, 24 Jun 2009, Marcelo Leal wrote:


Hello Bob,
I think that is related with my post about "zio_taskq_threads and TXG sync ":
( http://www.opensolaris.org/jive/thread.jspa?threadID=105703&tstart=0 )
Roch did say that this is on top of the performance problems, and in 
the same email i did talk about the change from 5s to 30s, what i 
think makes this problem worst, if this txg sync interval be 
"fixed".


The problem is that basing disk writes on a simple timeout and 
available memory does not work.  It is easy for an application to 
write considerable amounts of new data in 30 seconds, or even 5 
seconds.  If the application blocks while the data is being comitted, 
then the application is not performing any useful function during that 
time.


Current ZFS write behavior make it not very useful for the creative 
media industries even though otherwise it should be a perfect fit 
since hundreds of terrabytes of working disk (or even petabytes) are 
normal for this industry.  For example, when data is captured to disk 
from film via a datacine (real time = 24 files/second and 6MB to 50MB 
per file), or captured to disk from a high-definition video camera, 
there is little margin for error and blocking on writes will result in 
missed frames or other malfunction.  Current ZFS write behavior is 
based on timing and the amount of system memory and it does not seem 
that throwing more storage hardware at the problem solves anything at 
all.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Ross

Wouldn't it make sense for the timing technique to be used if the data is 
coming in at a rate slower than the underlying disk storage?

But then if the data starts to come at a faster rate, ZFS needs to start 
streaming to disk as quickly as it can, and instead of re-ordering writes in 
blocks, it should just do the best it can with whatever is currently in memory. 
 And when that mode activates, inbound data should be throttled to match the 
current throughput to disk.

That preserves the efficient write ordering that ZFS was originally designed 
for, but means a more graceful degradation under load, with the system tending 
towards a steady state of throughput that matches what you would expect from 
other filesystems on those physical disks.

Of course, I have no idea how difficult this is technically.  But the idea 
seems reasonable to me.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Ian Collins


Bob Friesenhahn wrote:

On Wed, 24 Jun 2009, Marcelo Leal wrote:


Hello Bob,
I think that is related with my post about "zio_taskq_threads and TXG 
sync ":

( http://www.opensolaris.org/jive/thread.jspa?threadID=105703&tstart=0 )
Roch did say that this is on top of the performance problems, and in 
the same email i did talk about the change from 5s to 30s, what i 
think makes this problem worst, if this txg sync interval be "fixed".


The problem is that basing disk writes on a simple timeout and 
available memory does not work.  It is easy for an application to 
write considerable amounts of new data in 30 seconds, or even 5 
seconds.  If the application blocks while the data is being comitted, 
then the application is not performing any useful function during that 
time.


Current ZFS write behavior make it not very useful for the creative 
media industries even though otherwise it should be a perfect fit 
since hundreds of terrabytes of working disk (or even petabytes) are 
normal for this industry.  For example, when data is captured to disk 
from film via a datacine (real time = 24 files/second and 6MB to 50MB 
per file), or captured to disk from a high-definition video camera, 
there is little margin for error and blocking on writes will result in 
missed frames or other malfunction.  Current ZFS write behavior is 
based on timing and the amount of system memory and it does not seem 
that throwing more storage hardware at the problem solves anything at 
all.


I wonder whether a filesystem property "streamed" might be appropriate?  
This could act as hint to ZFS that the data is sequential and should be 
streamed direct to disk.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Bob Friesenhahn


On Wed, 24 Jun 2009, Ross wrote:

Wouldn't it make sense for the timing technique to be used if the 
data is coming in at a rate slower than the underlying disk storage?


I am not sure how zfs would know the rate of the underlying disk 
storage without characterizing it for a while with actual I/O. 
Regardless, buffering up to 3GB of data and then writing it all at 
once does not make sense regardless of the write rate of the 
underlying disk storage.  It results in the I/O channel being 
completely clogged for 3-7 seconds.


But then if the data starts to come at a faster rate, ZFS needs to 
start streaming to disk as quickly as it can, and instead of 
re-ordering writes in blocks, it should just do the best it can with 
whatever is currently in memory.  And when that mode activates, 
inbound data should be throttled to match the current throughput to 
disk.


In my case, the data is produced at a continual rate (40-80MB/s). 
ZFS batches it up in a huge buffer for 30 seconds and then writes it 
all at once.  It is not clear to me if the writer is blocking, or if 
the reader is blocking due to ZFS's sudden huge use of the I/O 
channel.  I am sure that I could find the answer via dtrace.


That preserves the efficient write ordering that ZFS was originally 
designed for, but means a more graceful degradation under load, with 
the system tending towards a steady state of throughput that matches 
what you would expect from other filesystems on those physical 
disks.


In this case the files are complete and ready to be written in optimum 
order.  Of course ZFS has no way to know that the application won't 
try to update them again.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Marcelo Leal

I think that is the purpose of the current implementation:
 http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
 But seems like is not that easy... as i did understand what Roch said, seems 
like the cause is not always a "hardy" writer.

 Leal
[ http://www.eall.com.br/blog ]
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Bob Friesenhahn


On Wed, 24 Jun 2009, Marcelo Leal wrote:

I think that is the purpose of the current implementation: 
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle But seems 
like is not that easy... as i did understand what Roch said, seems 
like the cause is not always a "hardy" writer.


I see this:

"The new code keeps track of the amount of data accepted in a TXG and 
the time it takes to sync. It dynamically adjusts that amount so that 
each TXG sync takes about 5 seconds (txg_time variable). It also 
clamps the limit to no more than 1/8th of physical memory."


It is interesting that it was decided that a TXG sync should take 5 
seconds by default.  That does seem to be about what I am seeing here. 
There is no mention of the devastation to the I/O channel which occurs 
if the kernel writes 5 seconds worth of data (e.g. 2GB) as fast as 
possible on a system using mirroring (2GB becomes 4GB of writes).  If 
it writes 5 seconds of data as fast as possible, then it seems that 
this blocks any opportunity to read more data so that application 
processing can continue during the TXG sync.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Bob Friesenhahn


On Thu, 25 Jun 2009, Ian Collins wrote:


I wonder whether a filesystem property "streamed" might be appropriate?  This 
could act as hint to ZFS that the data is sequential and should be streamed 
direct to disk.


ZFS does not seem to offer an ability to stream direct to disk other 
than perhaps via the special "raw" mode known to database developers.


It seems that current ZFS behavior is "works as designed".  The write 
transaction time is currently tuned for 5 seconds and so it writes 
data intensely for 5 seconds while either starving the readers 
and/or blocking the writers.  Notice that by the end of TXG write, zfs 
iostat is reporting zero reads:


% zpool iostat Sun_2540 1
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
Sun_2540 456G  1.18T 14  0  1.86M  0
Sun_2540 456G  1.18T  0 19  0  1.47M
Sun_2540 456G  1.18T  0  3.11K  0   385M
Sun_2540 456G  1.18T  0  3.00K  0   385M
Sun_2540 456G  1.18T  0  3.34K  0   387M
Sun_2540 456G  1.18T  0  3.01K  0   386M
Sun_2540 458G  1.18T 19  1.87K  30.2K   220M
Sun_2540 458G  1.18T  0  0  0  0
Sun_2540 458G  1.18T275  0  34.4M  0
Sun_2540 458G  1.18T448  0  56.1M  0
Sun_2540 458G  1.18T468  0  58.5M  0
Sun_2540 458G  1.18T425  0  53.2M  0
Sun_2540 458G  1.18T402  0  50.4M  0
Sun_2540 458G  1.18T364  0  45.5M  0
Sun_2540 458G  1.18T339  0  42.4M  0
Sun_2540 458G  1.18T376  0  47.0M  0
Sun_2540 458G  1.18T307  0  38.5M  0
Sun_2540 458G  1.18T380  0  47.5M  0
Sun_2540 458G  1.18T148  1.35K  18.3M   117M
Sun_2540 458G  1.18T 20  3.01K  2.60M   385M
Sun_2540 458G  1.18T 15  3.00K  1.98M   384M
Sun_2540 458G  1.18T  4  3.03K   634K   388M
Sun_2540 458G  1.18T  0  3.01K  0   386M
Sun_2540 460G  1.18T142792  15.8M  82.7M
Sun_2540 460G  1.18T375  0  46.9M  0

Here is an interesting discussion thread on another list that I had 
not seen before:


http://opensolaris.org/jive/thread.jspa?messageID=347212

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Richard Elling


Bob Friesenhahn wrote:

On Wed, 24 Jun 2009, Marcelo Leal wrote:

I think that is the purpose of the current implementation: 
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle But seems 
like is not that easy... as i did understand what Roch said, seems 
like the cause is not always a "hardy" writer.


I see this:

"The new code keeps track of the amount of data accepted in a TXG and 
the time it takes to sync. It dynamically adjusts that amount so that 
each TXG sync takes about 5 seconds (txg_time variable). It also 
clamps the limit to no more than 1/8th of physical memory."


hmmm... methinks there is a chance that the 1/8th rule might not work so 
well

for machines with lots of RAM and slow I/O.  I'm also reasonably sure that
that sort of machine is not what Sun would typically build for 
performance lab
testing, as a rule.  Hopefully Roch will comment when it is morning in 
Europe.

-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Bob Friesenhahn


On Wed, 24 Jun 2009, Richard Elling wrote:


"The new code keeps track of the amount of data accepted in a TXG and the 
time it takes to sync. It dynamically adjusts that amount so that each TXG 
sync takes about 5 seconds (txg_time variable). It also clamps the limit to 
no more than 1/8th of physical memory."


hmmm... methinks there is a chance that the 1/8th rule might not work so well
for machines with lots of RAM and slow I/O.  I'm also reasonably sure that
that sort of machine is not what Sun would typically build for performance 
lab
testing, as a rule.  Hopefully Roch will comment when it is morning in 
Europe.


Slow I/O is relative.  If I install more memory does that make my I/O 
even slower?


I did some more testing.  I put the input data on a different drive 
and sent application output to the ZFS pool.  I no longer noticed any 
stalls in the execution even though the large ZFS flushes are taking 
place.  This proves that my application is seeing stalled reads rather 
than stalled writes.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-24 Thread Lejun Zhu

> On Wed, 24 Jun 2009, Richard Elling wrote:
> >> 
> >> "The new code keeps track of the amount of data
> accepted in a TXG and the 
> >> time it takes to sync. It dynamically adjusts that
> amount so that each TXG 
> >> sync takes about 5 seconds (txg_time variable). It
> also clamps the limit to 
> >> no more than 1/8th of physical memory."
> >
> > hmmm... methinks there is a chance that the 1/8th
> rule might not work so well
> > for machines with lots of RAM and slow I/O.  I'm
> also reasonably sure that
> > that sort of machine is not what Sun would
> typically build for performance 
> > lab
> > testing, as a rule.  Hopefully Roch will comment
> when it is morning in 
> > Europe.
> 
> Slow I/O is relative.  If I install more memory does
> that make my I/O 
> even slower?
> 
> I did some more testing.  I put the input data on a
> different drive 
> and sent application output to the ZFS pool.  I no
> longer noticed any 
> stalls in the execution even though the large ZFS
> flushes are taking 
> place.  This proves that my application is seeing
> stalled reads rather 
> than stalled writes.

There is a bug in the database about reads blocked by writes which may be 
related:

http://bugs.opensolaris.org/view_bug.do?bug_id=6471212

The symptom is sometimes reducing queue depth makes read perform better.

> 
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us,
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,
>http://www.GraphicsMagick.org/
> 
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-25 Thread Ross

> I am not sure how zfs would know the rate of the
> underlying disk storage 

Easy:  Is the buffer growing?  :-)

If the amount of data in the buffer is growing, you need to throttle back a bit 
until the disks catch up.  Don't stop writes until the buffer is empty, just 
slow them down to match the rate at which you're clearing data from the buffer.

In your case I'd expect to see ZFS buffer the early part of the write (so you'd 
see a very quick initial burst), but from then on you would want a continual 
stream of data to disk, at a steady rate.

To the client it should respond just like storing to disk, the only difference 
is there's actually a small delay before the data hits the disk, which will be 
proportional to the buffer size.  ZFS won't have so much opportunity to 
optimize writes, but you wouldn't get such stuttering performance.

However, reading through the other messages, if it's a known bug and ZFS 
blocking reads while writing, there may not be any need for this idea.  But 
then, that bug has been open since 2006, is flagged as fix in progress, and was 
planned for snv_51 o_0.  So it probably is worth having this discussion.

And I may be completely wrong here, but reading that bug, it sounds like ZFS 
issues a whole bunch of writes at once as it clears the buffer, which ties in 
with the experiences of stalling actually being caused by reads being blocked.

I'm guessing given ZFS's aims it made sense to code it that way - if you're 
going to queue a bunch of transactions to make them efficient on disk, you 
don't want to interrupt that batch with a bunch of other (less efficient) 
reads. 

But the unintended side effect of this is that ZFS's attempt to optimize writes 
will causes jerky read and write behaviour any time you have a large amount of 
writes going on, and when you should be pushing the disks to 100% usage you're 
never going to reach that as it's always going to have 5s of inactivity, 
followed by 5s of running the disks flat out.

In fact, I wonder if it's a simple as the disks ending up doing 5s of reads, a 
delay for processing, 5s of writes, 5s of reads, etc...

It's probably efficient, but it's going to *feel* horrible, a 5s delay is 
easily noticeable by the end user, and is a deal breaker for many applications.

In situations like that, 5s is a *huge* amount of time, especially so if you're 
writing to a disk or storage device which has it's own caching!  Might it be 
possible to keep the 5s buffer for ordering transactions, but then commit that 
as a larger number of small transactions instead of one huge one?

The number of transactions could even be based on how busy the system is - if 
there are a lot of reads coming in, I'd be quite happy to split that into 50 
transactions.  On 10GbE, 5s is potentially 6.25GB of data.  Even split into 50 
transactions you're writing 128MB at a time, and that sounds plenty big enough 
to me!

Either way, something needs to be done.  If we move to ZFS our users are not 
going to be impressed with 5s delays on the storage system.

Finally, I do have one question for the ZFS guys:  How does the L2ARC interact 
with this?  Are reads from the L2ARC blocked, or will they happen in parallel 
with the writes to the main storage?  I suspect that a large L2ARC (potentially 
made up of SSD disks) would eliminate this problem the majority of the time.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-25 Thread Bob Friesenhahn


On Wed, 24 Jun 2009, Lejun Zhu wrote:


There is a bug in the database about reads blocked by writes which may be 
related:

http://bugs.opensolaris.org/view_bug.do?bug_id=6471212

The symptom is sometimes reducing queue depth makes read perform better.


This one certainly sounds promising.  Since Matt Ahrens has been 
working on it for almost a year, it must be almost fixed by now. :-)


I am not sure how is queue depth is managed, but it seems possible to 
detect when reads are blocked by bulk writes and make some automatic 
adjustments to improve balance.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-25 Thread Bob Friesenhahn


On Thu, 25 Jun 2009, Ross wrote:

But the unintended side effect of this is that ZFS's attempt to 
optimize writes will causes jerky read and write behaviour any time 
you have a large amount of writes going on, and when you should be 
pushing the disks to 100% usage you're never going to reach that as 
it's always going to have 5s of inactivity, followed by 5s of 
running the disks flat out.


In fact, I wonder if it's a simple as the disks ending up doing 5s 
of reads, a delay for processing, 5s of writes, 5s of reads, etc...


It's probably efficient, but it's going to *feel* horrible, a 5s 
delay is easily noticeable by the end user, and is a deal breaker 
for many applications.


Yes, 5 seconds is a long time.  For an application which mixes 
computation with I/O it is not really acceptable for read I/O to go 
away for up to 5 seconds.  This represents time that the CPU is not 
being used, and a time that the application may be unresponsive to the 
user.  When compression is used the impact is different, but the 
compression itself consumes considerable CPU (and quite abruptly) so 
that other applications (e.g. X11) stop responding during the 
compress/write cycle.


The read problem is one of congestion.  If I/O is congested with 
massive writes, then reads don't work.  It does not really matter how 
fast your storage system is.  If the 5 seconds of buffered writes are 
larger than what the device driver and storage system buffering allows 
for, then the I/O channel will be congested.


As an example, my storage array is demonstrated to be able to write 
359MB/second but ZFS will blast data from memory as fast as it can, 
and the storage path can not effectively absorb 1.8GB (359*5) of data 
since the StorageTek 2500's internal buffers are much smaller than 
that, and fiber channel device drivers are not allowed to consume much 
memory either.  To make matters worse, I am using ZFS mirrors so the 
amount of data written to the array in those five seconds is doubled 
to 3.6GB.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-29 Thread Bob Friesenhahn


On Wed, 24 Jun 2009, Lejun Zhu wrote:


There is a bug in the database about reads blocked by writes which may be 
related:

http://bugs.opensolaris.org/view_bug.do?bug_id=6471212

The symptom is sometimes reducing queue depth makes read perform better.


I have been banging away at this issue without resolution.  Based on 
Roch Bourbonnais's blog description of the ZFS write throttle code, it 
seems that I am facing a perfect storm.  Both the storage write 
bandwidth (800+ MB/second) and the memory size of my system (20 GB) 
result in the algorithm batching up 2.5 GB of user data to write. 
Since I am using mirrors, this results in 5 GB of data being written 
at full speed to the array on a very precise schedule since my 
application is processing fixed-sized files with a fixed algorithm. 
The huge writes lead to at least 3 seconds of read starvation, 
resulting in a stalled application and a square-wave of system CPU 
utilization.  I could attempt to modify my application to read ahead 
by 3 seconds but that would require gigabytes of memory, lots of 
complexity, and would not be efficient.


Richard Elling thinks that my array is pokey, but based on write speed 
and memory size, ZFS is always going to be batching up data to fill 
the write channel for 5 seconds so it does not really matter how fast 
that write channel is.  If I had 32GB of RAM and 2X the write speed, 
the situation would be identical.


Hopefully someone at Sun is indeed working this read starvation issue 
and it will be resolved soon.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-29 Thread Brent Jones

On Mon, Jun 29, 2009 at 2:48 PM, Bob
Friesenhahn wrote:
> On Wed, 24 Jun 2009, Lejun Zhu wrote:
>
>> There is a bug in the database about reads blocked by writes which may be
>> related:
>>
>> http://bugs.opensolaris.org/view_bug.do?bug_id=6471212
>>
>> The symptom is sometimes reducing queue depth makes read perform better.
>
> I have been banging away at this issue without resolution.  Based on Roch
> Bourbonnais's blog description of the ZFS write throttle code, it seems that
> I am facing a perfect storm.  Both the storage write bandwidth (800+
> MB/second) and the memory size of my system (20 GB) result in the algorithm
> batching up 2.5 GB of user data to write. Since I am using mirrors, this
> results in 5 GB of data being written at full speed to the array on a very
> precise schedule since my application is processing fixed-sized files with a
> fixed algorithm. The huge writes lead to at least 3 seconds of read
> starvation, resulting in a stalled application and a square-wave of system
> CPU utilization.  I could attempt to modify my application to read ahead by
> 3 seconds but that would require gigabytes of memory, lots of complexity,
> and would not be efficient.
>
> Richard Elling thinks that my array is pokey, but based on write speed and
> memory size, ZFS is always going to be batching up data to fill the write
> channel for 5 seconds so it does not really matter how fast that write
> channel is.  If I had 32GB of RAM and 2X the write speed, the situation
> would be identical.
>
> Hopefully someone at Sun is indeed working this read starvation issue and it
> will be resolved soon.
>
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

I see similar square-wave performance. However, my load is primarily
write-based, when those commits happen, I see all network activity
pause while the buffer is commited to disk.
I write about 750Mbit/sec over the network to the X4540's during
backup windows using primarily iSCSI. When those writes occur to my
RaidZ volume, all activity pauses until the writes are fully flushed.

One thing to note, on 117, the effects are seemingly reduced and a bit
more even performance, but it is still there.

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-29 Thread Lejun Zhu

> On Wed, 24 Jun 2009, Lejun Zhu wrote:
> 
> > There is a bug in the database about reads blocked
> by writes which may be related:
> >
> >
> http://bugs.opensolaris.org/view_bug.do?bug_id=6471212
> >
> > The symptom is sometimes reducing queue depth makes
> read perform better.
> 
> I have been banging away at this issue without
> resolution.  Based on 
> Roch Bourbonnais's blog description of the ZFS write
> throttle code, it 
> seems that I am facing a perfect storm.  Both the
> storage write 
> bandwidth (800+ MB/second) and the memory size of my
> system (20 GB) 
> result in the algorithm batching up 2.5 GB of user
> data to write. 

With ZFS write throttle, the number 2.5GB is tunable. From what I've read in 
the code, it is possible to e.g. set zfs:zfs_write_limit_override = 0x800 
(bytes) to make it write 128M instead.

> Since I am using mirrors, this results in 5 GB of
> data being written 
> at full speed to the array on a very precise schedule
> since my 
> application is processing fixed-sized files with a
> fixed algorithm. 
> The huge writes lead to at least 3 seconds of read
> starvation, 
> resulting in a stalled application and a square-wave
> of system CPU 
> utilization.  I could attempt to modify my
> application to read ahead 
> by 3 seconds but that would require gigabytes of
> memory, lots of 
> complexity, and would not be efficient.
> 
> Richard Elling thinks that my array is pokey, but
> based on write speed 
> and memory size, ZFS is always going to be batching
> up data to fill 
> the write channel for 5 seconds so it does not really
> matter how fast 
> that write channel is.  If I had 32GB of RAM and 2X
> the write speed, 
> the situation would be identical.
> 
> Hopefully someone at Sun is indeed working this read
> starvation issue 
> and it will be resolved soon.
> 
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us,
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,
>http://www.GraphicsMagick.org/
> 
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Ross

> backup windows using primarily iSCSI. When those
> writes occur to my RaidZ volume, all activity pauses until the writes
> are fully flushed.

The more I read about this, the worse it sounds.  The thing is, I can see where 
the ZFS developers are coming from - in theory this is a more efficient use of 
the disk, and with that being the slowest part of the system, there probably is 
a slight benefit in computational time.

However, it completely breaks any process like this that can't afford 3-5s 
delays in processing, it makes ZFS a nightmare for things like audio or video 
editing (where it would otherwise be a perfect fit), and it's also horrible 
from the perspective of the end user.

Does anybody know if a L2ARC would help this?  Does that work off a different 
queue, or would reads still be blocked?

I still think a simple solution to this could be to split the ZFS writes into 
smaller chunks.  That creates room for reads to be squeezed in (with the ratio 
of reads to writes something that should be automatically balanced by the 
software), but you still get the benefit of ZFS write ordering with all the 
work that's gone into perfecting that.  

Regardless of whether there are reads or not, your data is always going to be 
written to disk in an optimized fashion, and you could have a property on the 
pool that specifies how finely chopped up writes should be, allowing this to be 
easily tuned.

We're considering ZFS as storage for our virtualization solution, and this 
could be a big concern.  We really don't want the entire network pausing for 
3-5 seconds any time there is a burst of write activity.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Bob Friesenhahn


On Tue, 30 Jun 2009, Ross wrote:


However, it completely breaks any process like this that can't 
afford 3-5s delays in processing, it makes ZFS a nightmare for 
things like audio or video editing (where it would otherwise be a 
perfect fit), and it's also horrible from the perspective of the end 
user.


Yes.  I updated the image at 
http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-stalls.png 
so that it shows the execution impact with more processes running. 
This is taken with three processes running in parallel so that there 
can be no doubt that I/O is being globally blocked and it is not just 
misbehavior of a single process.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Scott Meilicke

For what it is worth, I too have seen this behavior when load testing our zfs 
box. I used iometer and the RealLife profile (1 worker, 1 target, 65% reads, 
60% random, 8k, 32 IOs in the queue). When writes are being dumped, reads drop 
close to zero, from 600-700 read IOPS to 15-30 read IOPS.

zpool iostat data01 1

Where data01 is my pool name

pool used  avail   read  write   read  write
--  -  -  -  -  -  -
data01  55.5G  20.4T691  0  4.21M  0
data01  55.5G  20.4T632  0  3.80M  0
data01  55.5G  20.4T657  0  3.93M  0
data01  55.5G  20.4T669  0  4.12M  0
data01  55.5G  20.4T689  0  4.09M  0
data01  55.5G  20.4T488  1.77K  2.94M  9.56M
data01  55.5G  20.4T 29  4.28K   176K  23.5M
data01  55.5G  20.4T 25  4.26K   165K  23.7M
data01  55.5G  20.4T 20  3.97K   133K  22.0M
data01  55.6G  20.4T170  2.26K  1.01M  11.8M
data01  55.6G  20.4T678  0  4.05M  0
data01  55.6G  20.4T625  0  3.74M  0
data01  55.6G  20.4T685  0  4.17M  0
data01  55.6G  20.4T690  0  4.04M  0
data01  55.6G  20.4T679  0  4.02M  0
data01  55.6G  20.4T664  0  4.03M  0
data01  55.6G  20.4T699  0  4.27M  0
data01  55.6G  20.4T423  1.73K  2.66M  9.32M
data01  55.6G  20.4T 26  3.97K   151K  21.8M
data01  55.6G  20.4T 34  4.23K   223K  23.2M
data01  55.6G  20.4T 13  4.37K  87.1K  23.9M
data01  55.6G  20.4T 21  3.33K   136K  18.6M
data01  55.6G  20.4T468496  2.89M  1.82M
data01  55.6G  20.4T687  0  4.13M  0

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Bob Friesenhahn


On Mon, 29 Jun 2009, Lejun Zhu wrote:


With ZFS write throttle, the number 2.5GB is tunable. From what I've 
read in the code, it is possible to e.g. set 
zfs:zfs_write_limit_override = 0x800 (bytes) to make it write 
128M instead.


This works, and the difference in behavior is profound.  Now it is a 
matter of finding the "best" value which optimizes both usability and 
performance.  A tuning for 384 MB:


# echo zfs_write_limit_override/W0t402653184 | mdb -kw
zfs_write_limit_override:   0x3000  =   0x1800

CPU is smoothed out quite a lot and write latencies (as reported by a 
zio_rw.d dtrace script) are radically different than before.


Perfmeter display for 256 MB:
http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-256mb.png

Perfmeter display for 384 MB:
http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-384mb.png

Perfmeter display for 768 MB:
http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-768mb.png

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Brent Jones

On Tue, Jun 30, 2009 at 12:25 PM, Bob
Friesenhahn wrote:
> On Mon, 29 Jun 2009, Lejun Zhu wrote:
>>
>> With ZFS write throttle, the number 2.5GB is tunable. From what I've read
>> in the code, it is possible to e.g. set zfs:zfs_write_limit_override =
>> 0x800 (bytes) to make it write 128M instead.
>
> This works, and the difference in behavior is profound.  Now it is a matter
> of finding the "best" value which optimizes both usability and performance.
>  A tuning for 384 MB:
>
> # echo zfs_write_limit_override/W0t402653184 | mdb -kw
> zfs_write_limit_override:       0x3000      =       0x1800
>
> CPU is smoothed out quite a lot and write latencies (as reported by a
> zio_rw.d dtrace script) are radically different than before.
>
> Perfmeter display for 256 MB:
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-256mb.png
>
> Perfmeter display for 384 MB:
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-384mb.png
>
> Perfmeter display for 768 MB:
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-768mb.png
>
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Maybe there could be a supported ZFS tuneable (per file system even?)
that is optimized for 'background' tasks, or 'foreground'.

Beyond that, I will give this tuneable a shot and see how it impacts
my own workload.

Thanks!

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Bob Friesenhahn


On Tue, 30 Jun 2009, Brent Jones wrote:


Maybe there could be a supported ZFS tuneable (per file system even?)
that is optimized for 'background' tasks, or 'foreground'.

Beyond that, I will give this tuneable a shot and see how it impacts
my own workload.


Note that this issue does not apply at all to NFS service, database 
service, or any other usage which does synchronous writes.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Rob Logan


> CPU is smoothed out quite a lot
yes, but the area under the CPU graph is less, so the
rate of real work performed is less, so the entire
job took longer. (allbeit "smoother")

Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Ross

Interesting to see that it makes such a difference, but I wonder what effect it 
has on ZFS's write ordering, and it's attempts to prevent fragmentation?

By reducing the write buffer, are you loosing those benefits?

Although on the flip side, I guess this is no worse off than any other 
filesystem, and as SSD drives take off, fragmentation is going to be less and 
less of an issue.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Bob Friesenhahn


On Tue, 30 Jun 2009, Rob Logan wrote:


CPU is smoothed out quite a lot

yes, but the area under the CPU graph is less, so the
rate of real work performed is less, so the entire
job took longer. (allbeit "smoother")


For the purpose of illustration, the case showing the huge sawtooth 
was when running three processes at once.  The period/duration of the 
sawtooth was pretty similar, but the magnitude changes.


I agree that there is a size which provides the best balance of 
smoothness and application performance.  Probably the value should be 
dialed down to just below the point where the sawtooth occurs.


More at 11.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-06-30 Thread Scott Meilicke

> On Tue, 30 Jun 2009, Bob Friesenhahn wrote:
> 
> Note that this issue does not apply at all to NFS
> service, database 
> service, or any other usage which does synchronous
> writes.

I see read starvation with NFS. I was using iometer on a Windows VM, connecting 
to an NFS mount on a 2008.11 physical box. iometer params: 65% read, 60% 
random, 8k blocks, 32 outstanding IO requests, 1 worker, 1 target.

NFS Testing
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
data01  59.6G  20.4T 46 24   757K  3.09M
data01  59.6G  20.4T 39 24   593K  3.09M
data01  59.6G  20.4T 45 25   687K  3.22M
data01  59.6G  20.4T 45 23   683K  2.97M
data01  59.6G  20.4T 33 23   492K  2.97M
data01  59.6G  20.4T 16 41   214K  1.71M
data01  59.6G  20.4T  3  2.36K  53.4K  30.4M
data01  59.6G  20.4T  1  2.23K  20.3K  29.2M
data01  59.6G  20.4T  0  2.24K  30.2K  28.9M
data01  59.6G  20.4T  0  1.93K  30.2K  25.1M
data01  59.6G  20.4T  0  2.22K  0  28.4M
data01  59.7G  20.4T 21295   317K  4.48M
data01  59.7G  20.4T 32 12   495K  1.61M
data01  59.7G  20.4T 35 25   515K  3.22M
data01  59.7G  20.4T 36 11   522K  1.49M
data01  59.7G  20.4T 33 24   508K  3.09M
data01  59.7G  20.4T 35 23   536K  2.97M
data01  59.7G  20.4T 32 23   483K  2.97M
data01  59.7G  20.4T 37 37   538K  4.70M

While writes are being committed to the ZIL all the time, periodic dumping to 
the pool still occurs, and during those times reads are starved. Maybe this 
doesn't happen in the 'real world' ?

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-01 Thread Bob Friesenhahn

Even if I set zfs_write_limit_override to 8053063680 I am unable to 
achieve the massive writes that Solaris 10 (141415-03) sends to my 
drive array by default.


When I read the blog entry at 
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle, I see this 
statement:


"The new code keeps track of the amount of data accepted in a TXG and 
the time it takes to sync. It dynamically adjusts that amount so that 
each TXG sync takes about 5 seconds (txg_time variable). It also 
clamps the limit to no more than 1/8th of physical memory."


On my system I see that the "about 5 seconds" rule is being followed, 
but see no sign of clamping the limit to no more than 1/8th of 
physical memory.  There is no sign of clamping at all.  The writen 
data is captured and does take about 5 seconds to write (good 
estimate).


On my system with 20GB of RAM, and ARC memory limit set to 10GB 
(zfs:zfs_arc_max = 0x28000), the maximum zfs_write_limit_override 
value I can set is on the order of 8053063680, yet this results in a 
much smaller amount of data being written per write cycle than the 
Solaris 10 default operation.  The default operation is 24 seconds of 
no write activity followed by 5 seconds of write.


On my system, 1/8 of memory would be 2.5GB.  If I set the 
zfs_write_limit_override value to 2684354560 then it seems that about 
1.2 seconds of data is captured for write.  In this case I see 5 
seconds of no write followed by maybe a second of write.


This causes me to believe that the algorithm is not implemented as 
described in Solaris 10.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-01 Thread Marcelo Leal

> 
> Note that this issue does not apply at all to NFS
> service, database 
> service, or any other usage which does synchronous
> writes.
> 
> Bob
 Hello Bob,
 There is impact for "all" workloads.
 The fact that the write is sync or not, is just a question to write on slog 
(SSD) or not.
 But the txg interval and sync time is the same. Actually the zil code is just 
to preserve that exact same thing for synchronous writes.

 Leal
[ http://www.eall.com.br/blog ]
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-01 Thread Zhu, Lejun

Actually it seems to be 3/4:

dsl_pool.c
391 zfs_write_limit_max = ptob(physmem) >> 
zfs_write_limit_shift;
392 zfs_write_limit_inflated = MAX(zfs_write_limit_min,
393 spa_get_asize(dp->dp_spa, zfs_write_limit_max));

While spa_get_asize is:

spa_misc.c
   1249 uint64_t
   1250 spa_get_asize(spa_t *spa, uint64_t lsize)
   1251 {
   1252 /*
   1253  * For now, the worst case is 512-byte RAID-Z blocks, in which
   1254  * case the space requirement is exactly 2x; so just assume 
that.
   1255  * Add to this the fact that we can have up to 3 DVAs per bp, 
and
   1256  * we have to multiply by a total of 6x.
   1257  */
   1258 return (lsize * 6);
   1259 }

Which will result in:
   zfs_write_limit_inflated = MAX((32 << 20), (ptob(physmem) >> 3) * 6);

Bob Friesenhahn wrote:
> Even if I set zfs_write_limit_override to 8053063680 I am unable to
> achieve the massive writes that Solaris 10 (141415-03) sends to my
> drive array by default.
> 
> When I read the blog entry at
> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle, I see this
> statement:
> 
> "The new code keeps track of the amount of data accepted in a TXG and
> the time it takes to sync. It dynamically adjusts that amount so that
> each TXG sync takes about 5 seconds (txg_time variable). It also
> clamps the limit to no more than 1/8th of physical memory."
> 
> On my system I see that the "about 5 seconds" rule is being followed,
> but see no sign of clamping the limit to no more than 1/8th of
> physical memory.  There is no sign of clamping at all.  The writen
> data is captured and does take about 5 seconds to write (good
> estimate).
> 
> On my system with 20GB of RAM, and ARC memory limit set to 10GB
> (zfs:zfs_arc_max = 0x28000), the maximum zfs_write_limit_override
> value I can set is on the order of 8053063680, yet this results in a
> much smaller amount of data being written per write cycle than the
> Solaris 10 default operation.  The default operation is 24 seconds of
> no write activity followed by 5 seconds of write.
> 
> On my system, 1/8 of memory would be 2.5GB.  If I set the
> zfs_write_limit_override value to 2684354560 then it seems that about
> 1.2 seconds of data is captured for write.  In this case I see 5
> seconds of no write followed by maybe a second of write.
> 
> This causes me to believe that the algorithm is not implemented as
> described in Solaris 10.
> 
> Bob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-02 Thread Bob Friesenhahn


On Thu, 2 Jul 2009, Zhu, Lejun wrote:


Actually it seems to be 3/4:


3/4 is an awful lot.  That would be 15 GB on my system, which explains 
why the "5 seconds to write" rule is dominant.


It seems that both rules are worthy of re-consideration.

There is also still the little problem that zfs is incable of reading 
during all/much of the time it is syncing a TXG.  Even if the TXG is 
written more often, readers will still block, resulting in a similar 
cumulative effect on performance.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-03 Thread Victor Latushkin


On 02.07.09 22:05, Bob Friesenhahn wrote:

On Thu, 2 Jul 2009, Zhu, Lejun wrote:


Actually it seems to be 3/4:


3/4 is an awful lot.  That would be 15 GB on my system, which explains 
why the "5 seconds to write" rule is dominant.


3/4 is 1/8 * 6, where 6 is worst-case inflation factor (for raid-z2 is 9 
actually, and considering ganged 1k block on raid-z2 in the really bad 
case it should be even bigger than that). DSL does inflate write sizes 
too, so inflated write sizes are compared against inflated limit, so it 
should be fine.


victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-03 Thread Tristan Ball


Is the system otherwise responsive during the zfs sync cycles?

I ask because I think I'm seeing a similar thing - except that it's not 
only other writers that block , it seems like other interrupts are 
blocked. Pinging my zfs server in 1s intervals results in large delays 
while the system syncs, followed by normal response times while the 
system buffers more input...


Thanks,
   Tristan.

Bob Friesenhahn wrote:
It has been quite some time (about a year) since I did testing of 
batch processing with my software (GraphicsMagick).  In between time, 
ZFS added write-throttling.  I am using Solaris 10 with kernel 141415-03.


Quite a while back I complained that ZFS was periodically stalling the 
writing process (which UFS did not do).  The ZFS write-throttling 
feature was supposed to avoid that.  In my testing today I am still 
seeing ZFS stall the writing process periodically.  When the process 
is stalled, there is a burst of disk activity, a burst of context 
switching, and total CPU use drops to almost zero. Zpool iostat says 
that read bandwidth is 15.8M and write bandwidth is 15.8M over a 60 
second averaging interval.  Since my drive array is good for writing 
over 250MB/second, this is a very small write load and the array is 
loafing.


My program uses the simple read->process->write approach.  Each file 
written (about 8MB/file) is written contiguously and written just 
once.  Data is read and written in 128K blocks.  For this application 
there is no value obtained by caching the file just written.  From 
what I am seeing, reading occurs as needed, but writes are being 
batched up until the next ZFS synchronization cycle.  During the ZFS 
synchronization cycle it seems that processes are blocked from 
writing. Since my system has a lot of memory and the ARC is capped at 
10GB, quite a lot of data can be queued up to be written.  The ARC is 
currently running at its limit of 10GB.


If I tell my software to invoke fsync() before closing each written 
file, then the stall goes away, but the program then needs to block so 
there is less beneficial use of the CPU.


If this application stall annoys me, I am sure that it would really 
annoy a user with mission-critical work which needs to get done on a 
uniform basis.


If I run this little script then the application runs more smoothly 
but I see evidence of many shorter stalls:


while true
do
  sleep 3
  sync
done

Is there a solution in the works for this problem?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-03 Thread Damon Atkins



With regards too http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle

I would have thought that if you had enough data to be written, it is 
worth just writing it, and not waiting X seconds or trying to adjust 
things so it only takes 5 seconds


For example different Disk Bus have different data sizes e.g. (If I 
remember correctly) Fiber Channel  packet size 2MB, if you have 2MB you 
can write to a single disk/lun why not just write it straight away.


If transaction group meta data/log reaches a certain size (say 128k??) 
why not write the TXG?
If transaction group meta data/log is estimated to take more than X ms 
why not write the TXG? (assume reads are stop while this happens, to 
prevent large pauses)


if any file has 2MB?? of outstanding data to be written why not do a 
TXG? and stall the process's write thread until the data is written. ie 
to prevent "And to avoid the system wide and seconds long throttle 
effect, the new code will detect when we are dangerously close to that 
situation (7/8th of the limit) and will **insert 1 tick** delays for 
applications issuing writes. This prevents a write intensive thread from 
hogging the available space starving out other threads. This delay 
should also generally prevent the system wide throttle." A write limit 
on a individual thread/file would prevent a single file filling up the ARC?


By write/TXG I mean close the existing open TXG, and place it into 
quiescing, ready for syncing.


So the real question is why wait, why give the system the chance to 
stall; if you have enough data to write out to disk, that allows the 
target disk(s) to perform at optimal performance why not write the data 
out to the disks? (ie decent write I/O size. Even if quiescing state 
needed to be turned into a small FIFO queue)


Back to the Tennis (Wimbledon)

**Cheers

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-03 Thread Tristan Ball


Red herring...

Actually, I had compression=gzip-9 enabled on that filesystem, which is 
apparently too much for the old xeon's in that server (it's a Dell 
1850). The CPU was sitting at 100% kernel time while it tried to 
compress + sync.


Switching to compression=off or compression=on (lzjb) makes the problem 
go away.


Interestingly, creating a second processor set also alleviates many of 
the symptoms - certainly the slow ping goes away. Assigning the ssh + 
shell session I had on the machine while running these to the second set 
restores responsiveness to that too, it appears that all the compression 
happens in set 0.


Regards,
   Tristan.

Tristan Ball wrote:

Is the system otherwise responsive during the zfs sync cycles?

I ask because I think I'm seeing a similar thing - except that it's 
not only other writers that block , it seems like other interrupts are 
blocked. Pinging my zfs server in 1s intervals results in large delays 
while the system syncs, followed by normal response times while the 
system buffers more input...


Thanks,
   Tristan.

Bob Friesenhahn wrote:
It has been quite some time (about a year) since I did testing of 
batch processing with my software (GraphicsMagick).  In between time, 
ZFS added write-throttling.  I am using Solaris 10 with kernel 
141415-03.


Quite a while back I complained that ZFS was periodically stalling 
the writing process (which UFS did not do).  The ZFS write-throttling 
feature was supposed to avoid that.  In my testing today I am still 
seeing ZFS stall the writing process periodically.  When the process 
is stalled, there is a burst of disk activity, a burst of context 
switching, and total CPU use drops to almost zero. Zpool iostat says 
that read bandwidth is 15.8M and write bandwidth is 15.8M over a 60 
second averaging interval.  Since my drive array is good for writing 
over 250MB/second, this is a very small write load and the array is 
loafing.


My program uses the simple read->process->write approach.  Each file 
written (about 8MB/file) is written contiguously and written just 
once.  Data is read and written in 128K blocks.  For this application 
there is no value obtained by caching the file just written.  From 
what I am seeing, reading occurs as needed, but writes are being 
batched up until the next ZFS synchronization cycle.  During the ZFS 
synchronization cycle it seems that processes are blocked from 
writing. Since my system has a lot of memory and the ARC is capped at 
10GB, quite a lot of data can be queued up to be written.  The ARC is 
currently running at its limit of 10GB.


If I tell my software to invoke fsync() before closing each written 
file, then the stall goes away, but the program then needs to block 
so there is less beneficial use of the CPU.


If this application stall annoys me, I am sure that it would really 
annoy a user with mission-critical work which needs to get done on a 
uniform basis.


If I run this little script then the application runs more smoothly 
but I see evidence of many shorter stalls:


while true
do
  sleep 3
  sync
done

Is there a solution in the works for this problem?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-03 Thread Bob Friesenhahn


On Fri, 3 Jul 2009, Victor Latushkin wrote:


On 02.07.09 22:05, Bob Friesenhahn wrote:

On Thu, 2 Jul 2009, Zhu, Lejun wrote:


Actually it seems to be 3/4:


3/4 is an awful lot.  That would be 15 GB on my system, which explains why 
the "5 seconds to write" rule is dominant.


3/4 is 1/8 * 6, where 6 is worst-case inflation factor (for raid-z2 is 9 
actually, and considering ganged 1k block on raid-z2 in the really bad case 
it should be even bigger than that). DSL does inflate write sizes too, so 
inflated write sizes are compared against inflated limit, so it should be 
fine.


But blocking read I/O for several seconds is not so fine.  There are 
various amounts of buffering and caches in the write pipe-line. 
These suggest that there is a certain amount of write data which is 
handled efficiently by the write pipe-line.  Once buffers and caches 
fill, and the disks are maximally busy with write I/O, there is no 
more opportunity to do a read from the same disks for several seconds 
(up to five seconds).  When a TXG is written, the system writes as 
just fast and hard as it can (for up to five seconds) without 
considering other requirements.


ZFS's asynchronous write caching is speculative, hoping that the 
application will update the data just written several times so that 
only the final version needs to be written and disk I/O and precious 
IOPS are saved.  Unfortunately, not all applications work that way.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-03 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Tristan Ball wrote:


Is the system otherwise responsive during the zfs sync cycles?

I ask because I think I'm seeing a similar thing - except that it's not only 
other writers that block , it seems like other interrupts are blocked. 
Pinging my zfs server in 1s intervals results in large delays while the 
system syncs, followed by normal response times while the system buffers more 
input...


I don't see any such problems unless compression is enabled.  When 
compression is enabled, the TXG sync causes definite response time 
issues in the system.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-08 Thread John Wythe

> This causes me to believe that the algorithm is not
> implemented as  described in Solaris 10.

I was all ready to write about my frustrations with this problem, but I 
upgraded to snv_117 last night to fix some iscsi bugs and now it seems that the 
write throttling is working as described in that blog.

If a process starts filling the ARC it is throttled and the data is written 
nice and constant using just about all the disk bandwidth without freezing the 
system every 5 seconds.

However, with gzip-1 compression the symptoms return, but for my system I think 
it's because the gzip compression is not multi-threaded? I'm only getting 50% 
utilization on a dual core system. LZJB seems to work well though.

Is anyone aware of any bug fixes since 111b that would have helped to mitigate 
the freezing with the cache flushes?

-John
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-08 Thread John Wythe

> I was all ready to write about my frustrations with
> this problem, but I upgraded to snv_117 last night to
> fix some iscsi bugs and now it seems that the write
> throttling is working as described in that blog.

I may have been a little premature. While everything is much improved for Samba 
and local disk operations (dd, cp) on snv_117, Comstar ISCSI writes still seem 
to incur this "write a bit, block, write a bit, block" every 5 seconds.

But on top of that, I am getting relatively poor ISCSI performance for some 
reason with a direct gigabit link with MTU=9000. I'm not sure what that is 
about yet.

-John
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

44 matches

Mail list logo