Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-30 Thread Robert Milkowski
Hello Bob,

Wednesday, July 30, 2008, 3:07:05 AM, you wrote:

BF> On Wed, 30 Jul 2008, Robert Milkowski wrote:
>>
>> Both cases are basically the same.
>> Please notice I'm not talking about disabling ZIL, I'm talking about
>> disabling cache flushes in ZFS. ZFS will still wait for the array to
>> confirm that it did receive data (nvram).

BF> So it seems that in your opinion, the periodic "burp" in system call 
BF> completion time is due to ZFS's periodic cache flush.  That is 
BF> certainly quite possible.


Could be. Additionally he will end-up with up-to 35 IOs queued per
each LUN and if he doesn't effectively have a nvram cache there the
latency can dramaticly increase during these periods.


-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-29 Thread Bob Friesenhahn
On Wed, 30 Jul 2008, Robert Milkowski wrote:
>
> Both cases are basically the same.
> Please notice I'm not talking about disabling ZIL, I'm talking about
> disabling cache flushes in ZFS. ZFS will still wait for the array to
> confirm that it did receive data (nvram).

So it seems that in your opinion, the periodic "burp" in system call 
completion time is due to ZFS's periodic cache flush.  That is 
certainly quite possible.

Testing will prove it, but the testing can be on someone else's system 
rather than my own. :)

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-29 Thread Robert Milkowski
Hello Bob,

Friday, July 25, 2008, 4:58:54 PM, you wrote:

BF> On Fri, 25 Jul 2008, Robert Milkowski wrote:

>> Both on 2540 and 6540 if you do not disable it your performance will 
>> be very bad especially for synchronous IOs as ZIL will force your 
>> array to flush its cache every time. If you are not using ZFS on any 
>> other storage than 2540 on your servers then put "set 
>> zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you 
>> haven't done so it should help you considerably.

BF> This does not seem wise since then data (records of trades) may be 
BF> lost if the system crashes or loses power.  It is much better to apply
BF> the firmware tweaks so that the 2540 reports that the data is written 
BF> as soon as it is safely in its NVRAM rather than waiting for it to be 
BF> on disk.  ZFS should then perform rather well with low latency. 

Both cases are basically the same.
Please notice I'm not talking about disabling ZIL, I'm talking about
disabling cache flushes in ZFS. ZFS will still wait for the array to
confirm that it did receive data (nvram).

If you loose power the behavior will be the same - no difference here.




-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-28 Thread Bob Friesenhahn
On Mon, 28 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
> 
> I have tried your pdf but did not get good latency numbers even after array
> tuning...

Right.  And since I observed only slightly less optimal performance 
from a mirror pair of USB drives it seems that your requirement is not 
challenging at all for the storage hardware.  USB does not offer very 
much throughput.  A $200 portable disk drive is able to almost match a 
$23K drive array for this application.

I did test your application with output to /dev/null and /tmp (memory 
based) and it did report consistently tiny numbers in that case.

It seems likely that writes to ZFS encounter a "hickup" every so often 
in the write system call.

There is still the possibility that a bug in your application is 
causing the "hickup" in the write timings.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-28 Thread Tharindu Rukshan Bamunuarachchi




Dear All,

I will try to post DTool source code asap 

DTool is depend on our patented middleware, need one or two days to
clarify  :-P

Very Sorry.

Bob,

I have tried your pdf but did not get good latency numbers even after
array tuning...


cheers
tharindu


Bob Friesenhahn wrote:
On Sat, 26 Jul 2008, Richard Elling wrote:
  
  
Is it doing buffered or sync writes?  I'll try it later today or

tomorrow...

  
  
I have not seen the source code but truss shows that this program is
doing more than expected such as using send/recv to send a message. In
fact, send(), pollsys(), recv(), and write() constitute most of the
activity.  A POSIX.4 real-time timer is created. Perhaps it uses two
threads, with one sending messages to the other over a socketpair, and
the second thread does the actual write.
  
  
  
I did not run your program in a real-time
scheduling class (see priocntl). Perhaps it would perform better using
real-time scheduling.  It might also do better in a fixed-priority
class.
  


This might be more important.  But a better solution is to assign a

processor set to run only the application -- a good idea any time you

need a predictable response.

  
  
Later on I did try running the program in the real time scheduling
class with high priority and it made no difference at all.
  
  
While it is clear that filesystem type (ZFS or UFS) does make a
significant difference, it seems that the program is doing more than
simply timing the write system call.  A defect in the program could
easily account for the long delays.
  
  
It would help if source code for the program can be posted.
  
  
Bob
  
==
  
Bob Friesenhahn
  
[EMAIL PROTECTED],
http://www.simplesystems.org/users/bfriesen/
  
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
  
  
  


***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread Ellis, Mike
Bob Says: 

"But a better solution is to assign a processor set to run only
the application -- a good idea any time you need a predictable
response."

Bob's suggestion above along with "no interrupts on that pset", and a
fixed scheduling class for the application/processes in question could
also be helpful.

Tharindu, would you be able to share the source of your
write-latency-measuring application? This might give us a better idea of
exactly what its measuring and how. This might allow people (way smarter
than me) to do some additional/alternative DTRACE work to help further
drill down towards the source-and-resolution of the issue.

Thanks,

 -- MikeE
 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Richard Elling
Sent: Saturday, July 26, 2008 3:33 PM
To: Bob Friesenhahn
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

Bob Friesenhahn wrote:
> On Sat, 26 Jul 2008, Bob Friesenhahn wrote:
>
>   
>> I suspect that the maximum peak latencies have something to do with 
>> zfs itself (or something in the test program) rather than the pool 
>> configuration.
>> 
>
> As confirmation that the reported timings have virtually nothing to do

> with the pool configuration, I ran the program on a two-drive ZFS 
> mirror pool consisting of two cheap 500MB USB drives.  The average 
> latency was not much worse.  The peak latency values are often larger 
> but the maximum peak is still on the order of 9000 microseconds.
>   

Is it doing buffered or sync writes?  I'll try it later today or
tomorrow...

> I then ran the test on a single-drive UFS filesystem (300GB 15K RPM 
> SAS drive) which is freshly created and see that the average latency 
> is somewhat lower but the maximum peak for each interval is typically 
> much higher (at least 1200 but often 4000). I even saw a measured peak

> as high as 4.
>
> Based on the findings, it seems that using the 2540 is a complete 
> waste if two cheap USB drives in a zfs mirror pool can almost obtain 
> the same timings.  UFS on the fast SAS drive performed worse.
>
> I did not run your program in a real-time scheduling class (see 
> priocntl).  Perhaps it would perform better using real-time 
> scheduling.  It might also do better in a fixed-priority class.
>   

This might be more important.  But a better solution is to assign a
processor set to run only the application -- a good idea any time you
need a predictable response.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread Bob Friesenhahn
On Sat, 26 Jul 2008, Richard Elling wrote:
>
> Is it doing buffered or sync writes?  I'll try it later today or
> tomorrow...

I have not seen the source code but truss shows that this program is 
doing more than expected such as using send/recv to send a message. 
In fact, send(), pollsys(), recv(), and write() constitute most of the 
activity.  A POSIX.4 real-time timer is created. Perhaps it uses two 
threads, with one sending messages to the other over a socketpair, and 
the second thread does the actual write.

>> I did not run your program in a real-time scheduling class (see priocntl). 
>> Perhaps it would perform better using real-time scheduling.  It might also 
>> do better in a fixed-priority class.
>
> This might be more important.  But a better solution is to assign a
> processor set to run only the application -- a good idea any time you
> need a predictable response.

Later on I did try running the program in the real time scheduling 
class with high priority and it made no difference at all.

While it is clear that filesystem type (ZFS or UFS) does make a 
significant difference, it seems that the program is doing more than 
simply timing the write system call.  A defect in the program could 
easily account for the long delays.

It would help if source code for the program can be posted.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread Richard Elling
Bob Friesenhahn wrote:
> On Sat, 26 Jul 2008, Bob Friesenhahn wrote:
>
>   
>> I suspect that the maximum peak latencies have something to do with 
>> zfs itself (or something in the test program) rather than the pool 
>> configuration.
>> 
>
> As confirmation that the reported timings have virtually nothing to do 
> with the pool configuration, I ran the program on a two-drive ZFS 
> mirror pool consisting of two cheap 500MB USB drives.  The average 
> latency was not much worse.  The peak latency values are often larger 
> but the maximum peak is still on the order of 9000 microseconds.
>   

Is it doing buffered or sync writes?  I'll try it later today or
tomorrow...

> I then ran the test on a single-drive UFS filesystem (300GB 15K RPM 
> SAS drive) which is freshly created and see that the average latency 
> is somewhat lower but the maximum peak for each interval is typically 
> much higher (at least 1200 but often 4000). I even saw a measured peak 
> as high as 4.
>
> Based on the findings, it seems that using the 2540 is a complete 
> waste if two cheap USB drives in a zfs mirror pool can almost obtain 
> the same timings.  UFS on the fast SAS drive performed worse.
>
> I did not run your program in a real-time scheduling class (see 
> priocntl).  Perhaps it would perform better using real-time 
> scheduling.  It might also do better in a fixed-priority class.
>   

This might be more important.  But a better solution is to assign a
processor set to run only the application -- a good idea any time you
need a predictable response.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread Bob Friesenhahn
On Sat, 26 Jul 2008, Bob Friesenhahn wrote:

> I suspect that the maximum peak latencies have something to do with 
> zfs itself (or something in the test program) rather than the pool 
> configuration.

As confirmation that the reported timings have virtually nothing to do 
with the pool configuration, I ran the program on a two-drive ZFS 
mirror pool consisting of two cheap 500MB USB drives.  The average 
latency was not much worse.  The peak latency values are often larger 
but the maximum peak is still on the order of 9000 microseconds.

I then ran the test on a single-drive UFS filesystem (300GB 15K RPM 
SAS drive) which is freshly created and see that the average latency 
is somewhat lower but the maximum peak for each interval is typically 
much higher (at least 1200 but often 4000). I even saw a measured peak 
as high as 4.

Based on the findings, it seems that using the 2540 is a complete 
waste if two cheap USB drives in a zfs mirror pool can almost obtain 
the same timings.  UFS on the fast SAS drive performed worse.

I did not run your program in a real-time scheduling class (see 
priocntl).  Perhaps it would perform better using real-time 
scheduling.  It might also do better in a fixed-priority class.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread Bob Friesenhahn
On Sat, 26 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:

> It is impossible to simulate my scenario with iozone. iozone performs very 
> well for ZFS. OTOH,
> iozone does not measure latency.
> 
> Please find attached tool (Solaris x86), which we have written to measure 
> latency.

Very interesting software.  I ran it for a little while in a ZFS 
filesystem configured with 8K zfs blocksize and produced a 673MB file. 
This is with a full graphical login environment running.  I did a 
short run using 128K zfs blocksize and notice that the average peak 
latencies are about 2X as high as with 8K blocks, but the maximum peak 
latencies are similar (i.e. somewhat under 10,000 us).  I suspect that 
the maximum peak latencies have something to do with zfs itself (or 
something in the test program) rather than the pool configuration.

Here is the text output with 8k filesystem blocks:

% ./DTool -W -i 1 -s 700 -r 1 -f file
System Tick = 100 usecs
Clock resolution 10
HR Timer created for 100usecs
z_FileName = file
i_Rate = 1
l_BlockSize = 700
i_SyncInterval = 0
l_TickInterval = 100
i_TicksPerIO = 1
i_NumOfIOsPerSlot = 1
Max (us)| Min (us)  | Avg (us)  | MB/S  | File  Freq 
Distribution
   80   |  5|  6.5637   |  3.3371   |  file50(99.99), 
200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   2116 |  4|  7.2277   |  6.6429   |  file50(99.88), 
200(0.10), 500(0.00), 2000(0.01), 5000(0.01), 1(0.00), 10(0.00), 
20(0.00),
   60   |  5|  6.7522   |  6.6733   |  file50(99.99), 
200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   64   |  5|  6.6542   |  6.6753   |  file50(99.99), 
200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   46   |  4|  6.5489   |  6.6753   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   68   |  5|  6.5236   |  6.6726   |  file50(99.99), 
200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   8694 |  4|  8.7859   |  6.4859   |  file50(99.39), 
200(0.54), 500(0.03), 2000(0.03), 5000(0.00), 1(0.01), 10(0.00), 
20(0.00),
   70   |  4|  6.5669   |  6.6753   |  file50(99.98), 
200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   48   |  5|  6.5907   |  6.6733   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   49   |  5|  6.5948   |  6.6753   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   47   |  4|  6.5437   |  6.6753   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   7991 |  4|  8.7452   |  6.5043   |  file50(99.45), 
200(0.45), 500(0.06), 2000(0.03), 5000(0.00), 1(0.01), 10(0.00), 
20(0.00),
   57   |  4|  6.7606   |  6.6753   |  file50(99.98), 
200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   49   |  5|  6.6358   |  6.6753   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   46   |  5|  6.4603   |  6.6726   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   60   |  5|  6.4511   |  6.6727   |  file50(99.99), 
200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   9099 |  4|  9.0321   |  6.4891   |  file50(99.37), 
200(0.51), 500(0.07), 2000(0.04), 5000(0.00), 1(0.01), 10(0.00), 
20(0.00),
   48   |  5|  6.5132   |  6.6727   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   72   |  5|  6.5453   |  6.6726   |  file50(99.99), 
200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   44   |  5|  6.5788   |  6.6753   |  file50(100.00), 
200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   71   |  5|  6.5554   |  6.6727   |  file50(99.98), 
200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 1(0.00), 10(0.00), 
20(0.00),
   9138 |  4|  8.9271   |  6.5061   |  file50(99.43), 
200(0.48), 500(0.03), 2000(0.04), 5000(0.01), 1(0.01), 10(0.00), 
20(0.00),
   45   |  5|  6.5028   |  6.6753   |  file50(100.00), 
200(0

Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread Bob Friesenhahn
On Sat, 26 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
> 
> 1.Re configure array with 12 independent disks
> 2. Allocate disks to RAIDZed pool

Using raidz will penalize your transaction performance since all disks 
will need to perform I/O for each write.  It is definitely better to 
use load-shared mirrors for this purpose.

> 3. Fine tune the 2540 according to Bob's 2540-ZFS-Performance.pdf (Thankx Bob)
> 4. Apply ZFS tunings (i.e. zfs_nocacheflush=1 etc.)

Hopefully after step #3, step#4 will not be required.  Step #4 puts 
data at risk if there is a system crash.

> However, I could not find additional cards to support I/O Multipath. Hope 
> that would not affect
> on latency.

Probably not.  It will effect sequential I/O performance but latency 
is primarily dependent on disk configuration and ZFS filesystem block 
size.

I have performed some tests here of synchronous writes using iozone 
with multi-threaded readers/writers.  This is for the same 2540 
configuration that I wrote about earlier.  For this particular test, 
the ZFS filesystem blocksize is 8K and the size of the I/Os is 8K. 
This may not be a good representation of your own workload since the 
threads are contending for I/O with random access.  In your case, it 
seems that writes may be written in a sequential append mode.

I also have test results handy for similar test parameters but using 
various ZFS filesystem settings (8K/128K block size, checksum 
enable/disable, noatime, and sha256 checksum), and 8K or 128K I/O 
block sizes.  Let me know if there is something you would like for me 
to measure.  It should be easy to simulate your application behavior 
using iozone.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

Iozone: Performance Test of File I/O
Version $Revision: 3.283 $
Compiled for 64 bit mode.
Build: Solaris10gcc-64

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap, Mark Montague, Dan Million,
 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
 Erik Habbinga, Kris Strecker, Walter Wong.

Run began: Wed Jul  2 10:54:19 2008

Multi_buffer. Work area 16777216 bytes
OPS Mode. Output is in operations per second.
Record Size 8 KB
SYNC Mode.
File size set to 2097152 KB
Command line used: iozone -m -t 8 -T -O -r 8k -o -s 2G
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 8 threads
Each thread writes a 2097152 Kbyte file in 8 Kbyte records

Children see throughput for  8 initial writers  =4315.57 ops/sec
Parent sees throughput for  8 initial writers   =4266.15 ops/sec
Min throughput per thread   = 532.18 ops/sec
Max throughput per thread   = 543.36 ops/sec
Avg throughput per thread   = 539.45 ops/sec
Min xfer=  256746.00 ops

Children see throughput for  8 rewriters=2595.08 ops/sec
Parent sees throughput for  8 rewriters =2595.06 ops/sec
Min throughput per thread   = 322.07 ops/sec
Max throughput per thread   = 326.15 ops/sec
Avg throughput per thread   = 324.38 ops/sec
Min xfer=  258867.00 ops

Children see throughput for  8 readers  =   53462.03 ops/sec
Parent sees throughput for  8 readers   =   53451.08 ops/sec
Min throughput per thread   =6340.39 ops/sec
Max throughput per thread   =6859.59 ops/sec
Avg throughput per thread   =6682.75 ops/sec
Min xfer=  242368.00 ops

Children see throughput for 8 re-readers=   54585.11 ops/sec
Parent sees throughput for 8 re-readers =   54573.08 ops/sec
Min throughput per thread   =6022.81 ops/sec
Max throughput per thread   =7164.78 ops/sec
Avg throughput per thread   =6823.14 ops/sec
Min xfer=  220373.00 ops

Children see throughput for 8 reverse readers   =   56755.70 ops/sec
Parent sees throughput for 8 reverse re

Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread Tharindu Rukshan Bamunuarachchi





Deal All,

Thank you very much for the continuous support 

Sorry for the late reply ...
I was trying to allocate 2540 & 2 x 4600 for try out your
recommendation ...
Finally, I could reserve 2540 disk array for testing purpose.
so I am free to try out each and every points you have emphasized over
next week or two.

Current Volume setup is something like ...
I have created Storage Profile with RAID1 (there was no option for
either RAID 1+0 or RAID 0+1, may be due to CAM version.)
No read ahead.
16kb segment size.

4 physical SAS disks have been allocated for each volume.
So I got 2disk size effective space.
Particular 272GB volume is mounted as ZFS pool. (We need at least 3
independent volumes to start our system)


This is my plan for next week ...

1.Re configure array with 12 independent disks
2. Allocate disks to RAIDZed pool
3. Fine tune the 2540 according to Bob's 2540-ZFS-Performance.pdf
(Thankx Bob)
4. Apply ZFS tunings (i.e. zfs_nocacheflush=1 etc.)


Did I miss anything 
However, I could not find additional cards to support I/O Multipath.
Hope that would not affect on latency.


Thankx again ...

Cheers
Tharindu







David Collier-Brown wrote:
Brandon
High wrote:
  
  On Fri, Jul 25, 2008 at 9:17 AM, David
Collier-Brown <[EMAIL PROTECTED]> wrote:


And do you really have 4-sided raid 1
mirrors, not 4-wide raid-0 stripes???
  



Or perhaps 4 RAID1 mirrors concatenated?


  
I wondered that too, but he insists he doesn't have 0+1 or 1+0...
  
  
Tharindu. could you clarify this for us? It significantly
  
affects what advice we give!
  
  
--dave (former tech lead, performance engineering at ACE) c-b
  


***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-26 Thread David Collier-Brown
Brandon High wrote:
> On Fri, Jul 25, 2008 at 9:17 AM, David Collier-Brown <[EMAIL PROTECTED]> 
> wrote:
> 
>>And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes???
> 
> 
> Or perhaps 4 RAID1 mirrors concatenated?
> 
I wondered that too, but he insists he doesn't have 0+1 or 1+0...

Tharindu. could you clarify this for us? It significantly
affects what advice we give!

--dave (former tech lead, performance engineering at ACE) c-b
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-25 Thread Brandon High
On Fri, Jul 25, 2008 at 9:17 AM, David Collier-Brown <[EMAIL PROTECTED]> wrote:
> And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes???

Or perhaps 4 RAID1 mirrors concatenated?

-B

-- 
Brandon High [EMAIL PROTECTED]
"The good is the enemy of the best." - Nietzsche
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-25 Thread David Collier-Brown
And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes???

--dave

Robert Milkowski wrote:
> Hello Tharindu,
> 
> 
> Thursday, July 24, 2008, 6:02:31 AM, you wrote:
> 
> 
>>
> 
>   
> 
> We do not use raidz*. Virtually, no raid or stripe through OS.
> 
> 
> We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.
> 
> 
> 2540 does not have RAID 1+0 or 0+1.
> 
> 
> 
> 
> Of course it does 1+0. Just add more drives to RAID-1
> 
> 
> 
> 
> -- 
> 
> Best regards,
> 
>  Robert Milkowski   mailto:[EMAIL PROTECTED]
> 
>http://milek.blogspot.com
> 
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-25 Thread Bob Friesenhahn
On Fri, 25 Jul 2008, Robert Milkowski wrote:

> Both on 2540 and 6540 if you do not disable it your performance will 
> be very bad especially for synchronous IOs as ZIL will force your 
> array to flush its cache every time. If you are not using ZFS on any 
> other storage than 2540 on your servers then put "set 
> zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you 
> haven't done so it should help you considerably.

This does not seem wise since then data (records of trades) may be 
lost if the system crashes or loses power.  It is much better to apply 
the firmware tweaks so that the 2540 reports that the data is written 
as soon as it is safely in its NVRAM rather than waiting for it to be 
on disk.  ZFS should then perform rather well with low latency. 
However, I have yet to see any response from Tharindu which indicates 
he has seen any of my emails regarding this (or many emails from 
others).  Based on his responses I would assume that Tharindu is 
seeing less than a third of the response messages regarding his topic.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-25 Thread Robert Milkowski




Hello Tharindu,

Thursday, July 24, 2008, 6:02:31 AM, you wrote:




>


We do not use raidz*. Virtually, no raid or stripe through OS.

We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.

2540 does not have RAID 1+0 or 0+1.







Of course it does 1+0. Just add more drives to RAID-1



-- 
Best regards,
 Robert Milkowski                           mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-25 Thread Robert Milkowski




Hello Tharindu,

Wednesday, July 23, 2008, 10:03:15 AM, you wrote:




>


10,000 x 700 = 7MB per second ..

We have this rate for whole day 

10,000 orders per second is minimum requirments of modern day stock exchanges ...

Cache still help us for ~1 hours, but after that who will help us ...







Have you disable on zfs side SCSI flushes? Or have you disabled it on the array?
Both on 2540 and 6540 if you do not disable it your performance will be very bad especially for synchronous IOs as ZIL will force your array to flush its cache every time. If you are not using ZFS on any other storage than 2540 on your servers then put "set zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you haven't done so it should help you considerably.


With such relatively low throughput and with ZFS, plus cache on the array (after above correction), plus you stated in another email you are basically not reading at all you should cache evrything in the array and then stream it to disks (partly thanks to CoW in ZFS). 

Additional question is - how do you write your data? Are you updating larger files or creating a new file each time, or...?



-- 
Best regards,
 Robert Milkowski                           mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread Bob Friesenhahn
On Thu, 24 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:

> Do you have any recommend parameters should I try ?

Using an external log is really not needed when using the StorageTek 
2540.  I doubt that it is useful at all.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread Bob Friesenhahn
On Thu, 24 Jul 2008, Brandon High wrote:
>
> Have you tried exporting the individual drives and using zfs to handle
> the mirroring? It might have better performance in your situation.

It should indeed have better performance.  The single LUN exported 
from the 2540 will be treated like a single drive from ZFS's 
perspective.  The data written needs to be serialized in the same way 
that it would be for a drive.  ZFS has no understanding that some 
offsets will access a different drive so it may be that one pair of 
drives is experiencing all of the load.

The most performant configuration would be to export a LUN from each 
of the 2540's 12 drives and create a pool of 6 mirrors.  In this 
situation, ZFS will load share across the 6 mirrors so that each pair 
gets its fair share of the IOPS based on its backlog.

The 2540 cache tweaks will also help tremendously for this sort of 
work load.

Since this is for critical data I would not disable the cache 
mirroring in the 2540's controllers.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread Bob Friesenhahn

On Thu, 24 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:


We do not use raidz*. Virtually, no raid or stripe through OS.

We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.


What ZFS block size are you using?

Are you using synchronous writes for each 700byte message?  10k 
synchronous writes per second is pretty high and would depend heavily 
on the 2540's write cache and how the 2540's firmware behaves.


You will find some cache tweaks for the 2540 in my writeup available 
at 
http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf.


Without these tweaks, the 2540 waits for the data to be written to 
disk rather than written to its NVRAM whenever ZFS flushes the write 
cache.


Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread David Collier-Brown
  Hmmn, that *sounds* as if you are saying you've a very-high-redundancy
RAID1 mirror, 4 disks deep, on an 'enterprise-class tier 2 storage' array
that doesn't support RAID 1+0 or 0+1. 

  That sounds weird: the 2540 supports RAID levels 0, 1, (1+0), 3 and 5,
and deep mirrors are normally only used on really fast equipment in
mission-critical tier 1 storage...

  Are you sure you don't mean you have raid 0 (stripes) 4 disks wide,
each stripe presented as a LUN?

  If you really have 4-deep RAID 1, you have a configuration that will
perform somewhat slower than any single disk, as the array launches
4 writes to 4 drives in parallel, and returns success when they
all complete.

  If you had 4-wide RAID 0, with mirroring done at the host, you would
have a configuration that would (probabilistically) perform better than 
a single drive when writing to each side of the mirror, and the write
would return success when the slowest side of the mirror completed.

 --dave (puzzled!) c-b

Tharindu Rukshan Bamunuarachchi wrote:
> We do not use raidz*. Virtually, no raid or stripe through OS.
> 
> We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.
> 
> 2540 does not have RAID 1+0 or 0+1.
> 
> cheers
> tharindu
> 
> Brandon High wrote:
> 
>>On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi
>><[EMAIL PROTECTED]> wrote:
>>  
>>
>>>Dear Mark/All,
>>>
>>>Our trading system is writing to local and/or array volume at 10k
>>>messages per second.
>>>Each message is about 700bytes in size.
>>>
>>>Before ZFS, we used UFS.
>>>Even with UFS, there was evey 5 second peak due to fsflush invocation.
>>>
>>>However each peak is about ~5ms.
>>>Our application can not recover from such higher latency.
>>>
>>>
>>
>>Is the pool using raidz, raidz2, or mirroring? How many drives are you using?
>>
>>-B
>>
>>  
>>
> 
> 
> 
> ***
> 
> "The information contained in this email including in any attachment is 
> confidential and is meant to be read only by the person to whom it is 
> addressed. If you are not the intended recipient(s), you are prohibited from 
> printing, forwarding, saving or copying this email. If you have received this 
> e-mail in error, please immediately notify the sender and delete this e-mail 
> and its attachments from your computer."
> 
> ***
> 
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread Tharindu Rukshan Bamunuarachchi




Thankx for your continuous
help ...

We do not read ...

We hardly read 

Actually our system is writing whole day, each and every transaction it
receives ...

We need written data to recover the system from a crash , in the middle
of day (very rare situation, but most important part of a trading
system) ...

cheers
tharindu

Richard Elling wrote:
[EMAIL PROTECTED]
wrote:
  
  On Wed, 23 Jul 2008, Tharindu Rukshan
Bamunuarachchi wrote:


 
10,000 x 700 = 7MB per second ..
  
  
We have this rate for whole day 
  
  
10,000 orders per second is minimum requirments of modern day stock
exchanges ...
  
  
Cache still help us for ~1 hours, but after that who will help us ...
  
  
We are using 2540 for current testing ...
  
I have tried same with 6140, but no significant improvement ... only
one or two hours ...
  
    

It might not be exactly what you have in mind, but this "how do I get
latency down at all costs" thing reminded me of this old paper:


 http://www.sun.com/blueprints/1000/layout.pdf


I'm not a storage architect, someone with more experience in the area
care to comment on this ? With huge disks as we have these days, the
"wide thin" idea has gone under a bit - but how to replace such setups
with modern arrays, if the workload is such that caches eventually must
get blown and you're down to spindle speed ?

  
  
Bob Larson wrote that article, and I would love to ask him for an
  
update.  Unfortunately, he passed away a few years ago :-(
  
http://blogs.sun.com/relling/entry/bob_larson_my_friend
  
  
I think the model still holds true, the per-disk performance hasn't
  
significantly changed since it was written.
  
  
This particular problem screams for a queuing model.  You don't
  
really need to have a huge cache as long as you can de-stage
  
efficiently.  However, the original poster hasn't shared the read
  
workload details... if you never read, it is a trivial problem to
  
solve with a WOM.
  
-- richard
  
  
  


***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread Tharindu Rukshan Bamunuarachchi




Do you have any recommend parameters should I try ?

Ellis, Mike wrote:

  Would adding a dedicated ZIL/SLOG (what is the difference between those 2 exactly? Is there one?) help meet your requirement?

The idea would be to use some sort of relatively large SSD drive of some variety to absorb the initial write-hit. After hours when things quieit down (or perhaps during "slow periods" in the day) data is transparently destaged into the main disk-pool, providing you a transparent/rudimentary form of HSM. 

Have a look at Adam Leventhal's blog and ACM article for some interesting perspectives on this stuff... (Specifically the potential "return of the 3600 rpm drive" ;-)

Thanks -- mikee
  



Actually, we do not need this data at the end of the day.

We will write summary into Oracle DB.

SSD is good options, but cost is not feasible for some client.

Is Sun providing SSD arrays ??

  

- Original Message -
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
To: Tharindu Rukshan Bamunuarachchi <[EMAIL PROTECTED]>
Cc: zfs-discuss@opensolaris.org 
Sent: Wed Jul 23 11:22:51 2008
Subject: Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:

  
  
10,000 x 700 = 7MB per second ..

We have this rate for whole day 

10,000 orders per second is minimum requirments of modern day stock exchanges ...

Cache still help us for ~1 hours, but after that who will help us ...

We are using 2540 for current testing ...
I have tried same with 6140, but no significant improvement ... only one or two hours ...

  
  
Does your application request synchronous file writes or use fsync()? 
While normally fsync() slows performance I think that it will also 
serve to even the write response since ZFS will not be buffering lots 
of unwritten data.  However, there may be buffered writes from other 
applications which gets written periodically and which may delay the 
writes from your critical application.  In this case reducing the ARC 
size may help so that the ZFS sync takes less time.

You could also run a script which executes 'sync' every second or two 
in order to convince ZFS to cache less unwritten data. This will cause 
a bit of a performance hit for the whole system though.
  

This did not work and i got much higher peak , once a while.

Other than Array mounted disk, our applications are writing to local
hard disks (e.g. logs )

AFAIK, "sync" is applicable to all file systems.

  
You 7MB per second is a very tiny write load so it is worthwhile 
investigating to see if there are other factors which are causing your 
storage system to not perform correctly.  The 2540 is capable of 
supporting writes at hundreds of MB per second.
  


Yes. 2540 can go up to 40MB/s or more with more striped hard disks.

But we are struggling with latency not bandwidth. I/O bandwidth is
superb. But poor latency.


  
As an example of "another factor", let's say that you used the 2540 to 
create 6 small LUNs and then put them into a ZFS zraid.  However, in 
this case the 2540 allocated all of the LUNs from the same disk (which 
it is happy to do by default) so now that disk is being severely 
thrashed since it is one disk rather than six.
  


I did not use raidz. 

I have manullay allocated 4 independent disk per volume. 

I will try to get few independent disks through few luns.

I would be able to created RAIDZ and try.

  
Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  


***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread Brandon High
On Wed, Jul 23, 2008 at 10:02 PM, Tharindu Rukshan Bamunuarachchi
<[EMAIL PROTECTED]> wrote:
> We do not use raidz*. Virtually, no raid or stripe through OS.

So it's ZFS on a single LUN exported from the 2540? Or have you
created a zpool from multiple raid1 LUNs on the 2540?

Have you tried exporting the individual drives and using zfs to handle
the mirroring? It might have better performance in your situation.

-B

-- 
Brandon High [EMAIL PROTECTED]
"The good is the enemy of the best." - Nietzsche
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread Tharindu Rukshan Bamunuarachchi




We do not use raidz*.
Virtually, no raid or stripe through OS.

We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.

2540 does not have RAID 1+0 or 0+1.

cheers
tharindu

Brandon High wrote:

  On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi
<[EMAIL PROTECTED]> wrote:
  
  
Dear Mark/All,

Our trading system is writing to local and/or array volume at 10k
messages per second.
Each message is about 700bytes in size.

Before ZFS, we used UFS.
Even with UFS, there was evey 5 second peak due to fsflush invocation.

However each peak is about ~5ms.
Our application can not recover from such higher latency.

  
  
Is the pool using raidz, raidz2, or mirroring? How many drives are you using?

-B

  


***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Brandon High
On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi
<[EMAIL PROTECTED]> wrote:
>
> Dear Mark/All,
>
> Our trading system is writing to local and/or array volume at 10k
> messages per second.
> Each message is about 700bytes in size.
>
> Before ZFS, we used UFS.
> Even with UFS, there was evey 5 second peak due to fsflush invocation.
>
> However each peak is about ~5ms.
> Our application can not recover from such higher latency.

Is the pool using raidz, raidz2, or mirroring? How many drives are you using?

-B

-- 
Brandon High [EMAIL PROTECTED]
"The good is the enemy of the best." - Nietzsche
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Richard Elling
[EMAIL PROTECTED] wrote:
> On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
>
>   
>> 10,000 x 700 = 7MB per second ..
>>
>> We have this rate for whole day 
>>
>> 10,000 orders per second is minimum requirments of modern day stock 
>> exchanges ...
>>
>> Cache still help us for ~1 hours, but after that who will help us ...
>>
>> We are using 2540 for current testing ...
>> I have tried same with 6140, but no significant improvement ... only one or 
>> two hours ...
>> 
>
> It might not be exactly what you have in mind, but this "how do I get 
> latency down at all costs" thing reminded me of this old paper:
>
>   http://www.sun.com/blueprints/1000/layout.pdf
>
> I'm not a storage architect, someone with more experience in the area care 
> to comment on this ? With huge disks as we have these days, the "wide 
> thin" idea has gone under a bit - but how to replace such setups with 
> modern arrays, if the workload is such that caches eventually must get 
> blown and you're down to spindle speed ?
>   

Bob Larson wrote that article, and I would love to ask him for an
update.  Unfortunately, he passed away a few years ago :-(
http://blogs.sun.com/relling/entry/bob_larson_my_friend

I think the model still holds true, the per-disk performance hasn't
significantly changed since it was written.

This particular problem screams for a queuing model.  You don't
really need to have a huge cache as long as you can de-stage
efficiently.  However, the original poster hasn't shared the read
workload details... if you never read, it is a trivial problem to
solve with a WOM.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Ellis, Mike
Would adding a dedicated ZIL/SLOG (what is the difference between those 2 
exactly? Is there one?) help meet your requirement?

The idea would be to use some sort of relatively large SSD drive of some 
variety to absorb the initial write-hit. After hours when things quieit down 
(or perhaps during "slow periods" in the day) data is transparently destaged 
into the main disk-pool, providing you a transparent/rudimentary form of HSM. 

Have a look at Adam Leventhal's blog and ACM article for some interesting 
perspectives on this stuff... (Specifically the potential "return of the 3600 
rpm drive" ;-)

Thanks -- mikee


- Original Message -
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
To: Tharindu Rukshan Bamunuarachchi <[EMAIL PROTECTED]>
Cc: zfs-discuss@opensolaris.org 
Sent: Wed Jul 23 11:22:51 2008
Subject: Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:

> 10,000 x 700 = 7MB per second ..
> 
> We have this rate for whole day 
> 
> 10,000 orders per second is minimum requirments of modern day stock exchanges 
> ...
> 
> Cache still help us for ~1 hours, but after that who will help us ...
> 
> We are using 2540 for current testing ...
> I have tried same with 6140, but no significant improvement ... only one or 
> two hours ...

Does your application request synchronous file writes or use fsync()? 
While normally fsync() slows performance I think that it will also 
serve to even the write response since ZFS will not be buffering lots 
of unwritten data.  However, there may be buffered writes from other 
applications which gets written periodically and which may delay the 
writes from your critical application.  In this case reducing the ARC 
size may help so that the ZFS sync takes less time.

You could also run a script which executes 'sync' every second or two 
in order to convince ZFS to cache less unwritten data. This will cause 
a bit of a performance hit for the whole system though.

You 7MB per second is a very tiny write load so it is worthwhile 
investigating to see if there are other factors which are causing your 
storage system to not perform correctly.  The 2540 is capable of 
supporting writes at hundreds of MB per second.

As an example of "another factor", let's say that you used the 2540 to 
create 6 small LUNs and then put them into a ZFS zraid.  However, in 
this case the 2540 allocated all of the LUNs from the same disk (which 
it is happy to do by default) so now that disk is being severely 
thrashed since it is one disk rather than six.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Bob Friesenhahn
On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:

> 10,000 x 700 = 7MB per second ..
> 
> We have this rate for whole day 
> 
> 10,000 orders per second is minimum requirments of modern day stock exchanges 
> ...
> 
> Cache still help us for ~1 hours, but after that who will help us ...
> 
> We are using 2540 for current testing ...
> I have tried same with 6140, but no significant improvement ... only one or 
> two hours ...

Does your application request synchronous file writes or use fsync()? 
While normally fsync() slows performance I think that it will also 
serve to even the write response since ZFS will not be buffering lots 
of unwritten data.  However, there may be buffered writes from other 
applications which gets written periodically and which may delay the 
writes from your critical application.  In this case reducing the ARC 
size may help so that the ZFS sync takes less time.

You could also run a script which executes 'sync' every second or two 
in order to convince ZFS to cache less unwritten data. This will cause 
a bit of a performance hit for the whole system though.

You 7MB per second is a very tiny write load so it is worthwhile 
investigating to see if there are other factors which are causing your 
storage system to not perform correctly.  The 2540 is capable of 
supporting writes at hundreds of MB per second.

As an example of "another factor", let's say that you used the 2540 to 
create 6 small LUNs and then put them into a ZFS zraid.  However, in 
this case the 2540 allocated all of the LUNs from the same disk (which 
it is happy to do by default) so now that disk is being severely 
thrashed since it is one disk rather than six.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Frank . Hofmann
On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:

> 10,000 x 700 = 7MB per second ..
> 
> We have this rate for whole day 
> 
> 10,000 orders per second is minimum requirments of modern day stock exchanges 
> ...
> 
> Cache still help us for ~1 hours, but after that who will help us ...
> 
> We are using 2540 for current testing ...
> I have tried same with 6140, but no significant improvement ... only one or 
> two hours ...

It might not be exactly what you have in mind, but this "how do I get 
latency down at all costs" thing reminded me of this old paper:

http://www.sun.com/blueprints/1000/layout.pdf

I'm not a storage architect, someone with more experience in the area care 
to comment on this ? With huge disks as we have these days, the "wide 
thin" idea has gone under a bit - but how to replace such setups with 
modern arrays, if the workload is such that caches eventually must get 
blown and you're down to spindle speed ?

FrankH.

> 
> Robert Milkowski wrote:
>
>  Hello Tharindu,
> 
> Wednesday, July 23, 2008, 6:35:33 AM, you wrote:
> 
> TRB> Dear Mark/All,
> 
> TRB> Our trading system is writing to local and/or array volume at 10k 
> TRB> messages per second.
> TRB> Each message is about 700bytes in size.
> 
> TRB> Before ZFS, we used UFS.
> TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation.
> 
> TRB> However each peak is about ~5ms.
> TRB> Our application can not recover from such higher latency.
> 
> TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease
> TRB> the flush interval.
> TRB> As a result peaks came down to ~1.5ms. But it is still too high for our
> TRB> application.
> 
> TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will
> TRB> be reduced to ~1ms or less.
> TRB> We like <1ms peaks per second than 5ms peak per 5 second :-)
> 
> TRB> Are there any tunable, so i can reduce ZFS sync interval.
> TRB> If there is no any tunable, can not I use "mdb" for the job ...?
> 
> TRB> This is not general and we are ok with increased I/O rate.
> TRB> Please advice/help.
> 
> txt_time/D
> 
> btw:
>  10,000 * 700 = ~7MB
> 
> What's your storage subsystem? Any, even small, raid device with write
> cache should help.
> 
>
> 
> 
> 
>

--
No good can come from selling your freedom, not for all the gold in the world,
for the value of this heavenly gift far exceeds that of any fortune on earth.
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Tharindu Rukshan Bamunuarachchi




> txt_time/D
mdb: failed to dereference symbol: unknown symbol name
> txg_time/D
mdb: failed to dereference symbol: unknown symbol name


Am I doing something wrong 

Robert Milkowski wrote:

  Hello Tharindu,

Wednesday, July 23, 2008, 6:35:33 AM, you wrote:

TRB> Dear Mark/All,

TRB> Our trading system is writing to local and/or array volume at 10k 
TRB> messages per second.
TRB> Each message is about 700bytes in size.

TRB> Before ZFS, we used UFS.
TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation.

TRB> However each peak is about ~5ms.
TRB> Our application can not recover from such higher latency.

TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease
TRB> the flush interval.
TRB> As a result peaks came down to ~1.5ms. But it is still too high for our
TRB> application.

TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will
TRB> be reduced to ~1ms or less.
TRB> We like <1ms peaks per second than 5ms peak per 5 second :-)

TRB> Are there any tunable, so i can reduce ZFS sync interval.
TRB> If there is no any tunable, can not I use "mdb" for the job ...?

TRB> This is not general and we are ok with increased I/O rate.
TRB> Please advice/help.

txt_time/D

btw:
 10,000 * 700 = ~7MB

What's your storage subsystem? Any, even small, raid device with write
cache should help.


  


***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Tharindu Rukshan Bamunuarachchi




10,000 x 700 = 7MB per
second ..

We have this rate for whole day 

10,000 orders per second is minimum requirments of modern day stock
exchanges ...

Cache still help us for ~1 hours, but after that who will help us ...

We are using 2540 for current testing ...
I have tried same with 6140, but no significant improvement ... only
one or two hours ...

Robert Milkowski wrote:

  Hello Tharindu,

Wednesday, July 23, 2008, 6:35:33 AM, you wrote:

TRB> Dear Mark/All,

TRB> Our trading system is writing to local and/or array volume at 10k 
TRB> messages per second.
TRB> Each message is about 700bytes in size.

TRB> Before ZFS, we used UFS.
TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation.

TRB> However each peak is about ~5ms.
TRB> Our application can not recover from such higher latency.

TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease
TRB> the flush interval.
TRB> As a result peaks came down to ~1.5ms. But it is still too high for our
TRB> application.

TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will
TRB> be reduced to ~1ms or less.
TRB> We like <1ms peaks per second than 5ms peak per 5 second :-)

TRB> Are there any tunable, so i can reduce ZFS sync interval.
TRB> If there is no any tunable, can not I use "mdb" for the job ...?

TRB> This is not general and we are ok with increased I/O rate.
TRB> Please advice/help.

txt_time/D

btw:
 10,000 * 700 = ~7MB

What's your storage subsystem? Any, even small, raid device with write
cache should help.


  


***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Robert Milkowski
Hello Tharindu,

Wednesday, July 23, 2008, 6:35:33 AM, you wrote:

TRB> Dear Mark/All,

TRB> Our trading system is writing to local and/or array volume at 10k 
TRB> messages per second.
TRB> Each message is about 700bytes in size.

TRB> Before ZFS, we used UFS.
TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation.

TRB> However each peak is about ~5ms.
TRB> Our application can not recover from such higher latency.

TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease
TRB> the flush interval.
TRB> As a result peaks came down to ~1.5ms. But it is still too high for our
TRB> application.

TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will
TRB> be reduced to ~1ms or less.
TRB> We like <1ms peaks per second than 5ms peak per 5 second :-)

TRB> Are there any tunable, so i can reduce ZFS sync interval.
TRB> If there is no any tunable, can not I use "mdb" for the job ...?

TRB> This is not general and we are ok with increased I/O rate.
TRB> Please advice/help.

txt_time/D

btw:
 10,000 * 700 = ~7MB

What's your storage subsystem? Any, even small, raid device with write
cache should help.


-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-22 Thread Tharindu Rukshan Bamunuarachchi

Dear Mark/All,

Our trading system is writing to local and/or array volume at 10k 
messages per second.
Each message is about 700bytes in size.

Before ZFS, we used UFS.
Even with UFS, there was evey 5 second peak due to fsflush invocation.

However each peak is about ~5ms.
Our application can not recover from such higher latency.

So we used several tuning parameters (tune_r_* and autoup) to decrease 
the flush interval.
As a result peaks came down to ~1.5ms. But it is still too high for our 
application.

I believe, if we could reduce ZFS sync interval down to ~1s, peaks will 
be reduced to ~1ms or less.
We like <1ms peaks per second than 5ms peak per 5 second :-)

Are there any tunable, so i can reduce ZFS sync interval.
If there is no any tunable, can not I use "mdb" for the job ...?

This is not general and we are ok with increased I/O rate.
Please advice/help.

Thankx in advance.
tharindu


Mark Maybee wrote:
> ZFS is designed to "sync" a transaction group about every 5 seconds
> under normal work loads.  So your system looks to be operating as
> designed.  Is there some specific reason why you need to reduce this
> interval?  In general, this is a bad idea, as there is somewhat of a
> "fixed overhead" associated with each sync, so increasing the sync
> frequency could result in increased IO.
>
> -Mark
>
> Tharindu Rukshan Bamunuarachchi wrote:
>> Dear ZFS Gurus,
>>
>> We are developing low latency transaction processing systems for 
>> stock exchanges.
>> Low latency high performance file system is critical component of our 
>> trading systems.
>>
>> We have choose ZFS as our primary file system.
>> But we saw periodical disk write peaks every 4-5 second.
>>
>> Please refer first column of below output. (marked in bold)
>> Output is generated from our own Disk performance measuring tool. i.e 
>> DTool (please find attachment)
>>
>> Compared UFS/VxFS , ZFS is performing very well,  but we could not 
>> minimize periodical peaks.
>> We used autoup and tune_r_fsflush flags for UFS tuning.
>>
>> Are there any ZFS specific tuning, which will reduce file system 
>> flush interval of ZFS.
>>
>> I have tried all parameters specified in "solarisinternals" and 
>> google.com.
>> I would like to go for ZFS code change/recompile if necessary.
>>
>> Please advice.
>>
>> Cheers
>> Tharindu
>>
>>
>>
>> cpu4600-100 /tantan >./*DTool -f M -s 1000 -r 1 -i 1 -W*
>> System Tick = 100 usecs
>> Clock resolution 10
>> HR Timer created for 100usecs
>> z_FileName = M
>> i_Rate = 1
>> l_BlockSize = 1000
>> i_SyncInterval = 0
>> l_TickInterval = 100
>> i_TicksPerIO = 1
>> i_NumOfIOsPerSlot = 1
>> Max (us)| Min (us)  | Avg (us)  | MB/S  | 
>> File  Freq Distribution
>>   336   |  4|  10.5635  |  4.7688   |  M   
>> 50(98.55), 200(1.09), 500(0.36), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   *1911 * |  4|  10.3152  |  9.4822   |  M   
>> 50(98.90), 200(0.77), 500(0.32), 2000(0.01), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   307   |  4|  9.9386   |  9.5324   |  M   
>> 50(99.03), 200(0.66), 500(0.31), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   331   |  4|  9.9465   |  9.5332   |  M   
>> 50(99.04), 200(0.72), 500(0.24), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   318   |  4|  10.1241  |  9.5309   |  M   
>> 50(99.07), 200(0.66), 500(0.27), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   303   |  4|  9.9236   |  9.5296   |  M   
>> 50(99.13), 200(0.59), 500(0.28), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   560   |  4|  10.2604  |  9.4565   |  M   
>> 50(98.82), 200(0.86), 500(0.31), 2000(0.01), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   376   |  4|  9.9975   |  9.5176   |  M   
>> 50(99.05), 200(0.63), 500(0.32), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   *9783 * |  4|  10.8216  |  9.5301   |  M   
>> 50(99.05), 200(0.58), 500(0.36), 2000(0.00), 5000(0.00), 1(0.01), 
>> 10(0.00), 20(0.00),
>>   332   |  4|  9.9345   |  9.5252   |  M   
>> 50(99.06), 200(0.61), 500(0.33), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   355   |  4|  9.9906   |  9.5315   |  M   
>> 50(99.01), 200(0.69), 500(0.30), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   356   |  4|  10.2341  |  9.5207   |  M   
>> 50(98.96), 200(0.76), 500(0.28), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0.00), 20(0.00),
>>   320   |  4|  9.8893   |  9.5279   |  M   
>> 50(99.10), 200(0.59), 500(0.31), 2000(0.00), 5000(0.00), 1(0.00), 
>> 10(0