subject:"\"\\\\\\\[zfs\\\\\\\-discuss\\\\\\\] Why is Solaris 10 ZFS performance so terrible\\\\\\\?\""

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-07 Thread Joerg Schilling

Bob Friesenhahn  wrote:

> On Tue, 7 Jul 2009, Joerg Schilling wrote:
> >
> > posix_fadvise seems to be _very_ new for Solaris and even though I am
> > frequently reading/writing the POSIX standards mailing list, I was not 
> > aware of
> > it.
> >
> > From my tests with star, I cannot see a significant performance increase 
> > but it
> > may have a 3% effect
>
> Based on the prior discussions of using mmap() with ZFS and the way 
> ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at 
> all and POSIX_FADV_DONTNEED probably does not work either.  These are 
> pretty straightforward to implement with UFS since UFS benefits from 
> the existing working madvise() functionality.

I did run my tests on UFS...

> ZFS seems to want to cache all read data in the ARC, period.

And this is definitely a conceptional mistake as there are applications like 
star that like to benefit from read ahead but that don't like to trash caches.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-07 Thread Bob Friesenhahn


On Tue, 7 Jul 2009, Joerg Schilling wrote:


posix_fadvise seems to be _very_ new for Solaris and even though I am
frequently reading/writing the POSIX standards mailing list, I was not aware of
it.

From my tests with star, I cannot see a significant performance increase but it
may have a 3% effect


Based on the prior discussions of using mmap() with ZFS and the way 
ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at 
all and POSIX_FADV_DONTNEED probably does not work either.  These are 
pretty straightforward to implement with UFS since UFS benefits from 
the existing working madvise() functionality.


ZFS seems to want to cache all read data in the ARC, period.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-07 Thread Gary Mills

On Mon, Jul 06, 2009 at 04:54:16PM +0100, Andrew Gabriel wrote:
> Andre van Eyssen wrote:
> >On Mon, 6 Jul 2009, Gary Mills wrote:
> >
> >>As for a business case, we just had an extended and catastrophic
> >>performance degradation that was the result of two ZFS bugs.  If we
> >>have another one like that, our director is likely to instruct us to
> >>throw away all our Solaris toys and convert to Microsoft products.
> >
> >If you change platform every time you get two bugs in a product, you 
> >must cycle platforms on a pretty regular basis!
> 
> You often find the change is towards Windows. That very rarely has the 
> same rules applied, so things then stick there.

There's a more general principle in operation here.  Organizations do
sometimes change platforms for peculiar reasons, but once they do that
they're not going to do it again for a long time.  That's why they
disregard problems with the new platform.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-07 Thread Joerg Schilling

James Andrewartha  wrote:

> Joerg Schilling wrote:
> > I would be interested to see a open(2) flag that tells the system that I 
> > will
> > read a file that I opened exactly once in native oder. This could tell the 
> > system to do read ahead and to later mark the pages as immediately 
> > reusable. 
> > This would make star even faster than it is now.
>
> Are you aware of posix_fadvise(2) and madvise(2)?

I am of course aware of madvise since December 1987 but this is an interface 
that does not play nicely with a highly portable program like star.

posix_fadvise seems to be _very_ new for Solaris and even though I am 
frequently reading/writing the POSIX standards mailing list, I was not aware of 
it.

>From my tests with star, I cannot see a significant performance increase but 
>it 
may have a 3% effect

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-07 Thread James Andrewartha

Joerg Schilling wrote:
> I would be interested to see a open(2) flag that tells the system that I will
> read a file that I opened exactly once in native oder. This could tell the 
> system to do read ahead and to later mark the pages as immediately reusable. 
> This would make star even faster than it is now.

Are you aware of posix_fadvise(2) and madvise(2)?

-- 
James Andrewartha
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Lejun Zhu

If cpu seems to be idle, the tool latencytop probably can give you some clue. 
It's developed for OpenSolaris but Solaris 10 should work too (with glib 2.14 
installed). You can get a copy of v0.1 at 
http://opensolaris.org/os/project/latencytop/

To use latencytop, open a terminal and start "latencytop -s -k 2". The tool 
will show a window with activities that are being blocked in the system. Then 
you can launch your application to reproduce the performance problem in another 
terminal, switch back to latencytop window, and use "<" and ">" to find your 
process. The list will tell you which function is causing the delay.

After a couple minutes you may press "q" to exit from latencytop. When it ends, 
a log file /var/log/latencytop.log will be created. It includes the stack trace 
of waiting for IO, semaphore etc. when latencytop was running. If you post the 
log here, I can probably extract a list of worst delays in ZFS source code, and 
other experts may comment.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Sanjeev

Bob,

Catching up late on this thread.

Would it be possible for you to collect the following data :
- /usr/sbin/lockstat -CcwP -n 5 -D 20 -s 40 sleep 5
- /usr/sbin/lockstat -HcwP -n 5 -D 20 -s 40 sleep 5
- /usr/sbin/lockstat -kIW -i 977 -D 20 -s 40 sleep 5

Or if you have access to the GUDs tool please collect data using that.

We need to understand how ARC plays a role here.

Thanks and regards,
Sanjeev.
On Sat, Jul 04, 2009 at 02:49:05PM -0500, Bob Friesenhahn wrote:
> On Sat, 4 Jul 2009, Jonathan Edwards wrote:
>>
>> this is only going to help if you've got problems in zfetch .. you'd 
>> probably see this better by looking for high lock contention in zfetch 
>> with lockstat
>
> This is what lockstat says when performance is poor:
>
> Adaptive mutex spin: 477 events in 30.019 seconds (16 events/sec)
>
> Count indv cuml rcnt nsec Lock   Caller  
> ---
>47  10%  10% 0.00 5813 0x80256000 untimeout+0x24
>46  10%  19% 0.00 2223 0xb0a2f200 taskq_thread+0xe3
>38   8%  27% 0.00 2252 0xb0a2f200 cv_wait+0x70
>29   6%  34% 0.00 1115 0x80256000 callout_execute+0xeb
>26   5%  39% 0.00 3006 0xb0a2f200 taskq_dispatch+0x1b8
>22   5%  44% 0.00 1200 0xa06158c0 post_syscall+0x206
>18   4%  47% 0.00 3858 arc_eviction_mtx   arc_do_user_evicts+0x76
>16   3%  51% 0.00 1352 arc_eviction_mtx   arc_buf_add_ref+0x2d
>15   3%  54% 0.00 5376 0xb1adac28 taskq_thread+0xe3
>11   2%  56% 0.00 2520 0xb1adac28 taskq_dispatch+0x1b8
> 9   2%  58% 0.00 2158 0xbb909e20 pollwakeup+0x116
> 9   2%  60% 0.00 2431 0xb1adac28 cv_wait+0x70
> 8   2%  62% 0.00 3912 0x80259000 untimeout+0x24
> 7   1%  63% 0.00 3679 0xb10dfbc0 polllock+0x3f
> 7   1%  65% 0.00 2171 0xb0a2f2d8 cv_wait+0x70
> 6   1%  66% 0.00  771 0xb3f23708 pcache_delete_fd+0xac
> 6   1%  67% 0.00 4679 0xb0a2f2d8 taskq_dispatch+0x1b8
> 5   1%  68% 0.00  500 0xbe555040 fifo_read+0xf8
> 5   1%  69% 0.0015838 0x8025c000 untimeout+0x24
> 4   1%  70% 0.00 1213 0xac44b558 sd_initpkt_for_buf+0x110
> 4   1%  71% 0.00  638 0xa28722a0 polllock+0x3f
> 4   1%  72% 0.00  610 0x80259000 timeout_common+0x39
> 4   1%  73% 0.0010691 0x80256000 timeout_common+0x39
> 3   1%  73% 0.00 1559 htable_mutex+0x78  htable_release+0x8a
> 3   1%  74% 0.00 3610 0xbb909e20 cv_timedwait_sig+0x1c1
> 3   1%  74% 0.00 1636 0xa240d410 
> ohci_allocate_periodic_in_resource+0x71
> 2   0%  75% 0.00 5959 0xbe555040 fifo_read+0x5c
> 2   0%  75% 0.00 3744 0xbe555040 polllock+0x3f
> 2   0%  76% 0.00  635 0xb3f23708 pollwakeup+0x116
> 2   0%  76% 0.00  709 0xb3f23708 cv_timedwait_sig+0x1c1
> 2   0%  77% 0.00  831 0xb3dd2070 pcache_insert+0x13d
> 2   0%  77% 0.00 5976 0xb3dd2070 pollwakeup+0x116
> 2   0%  77% 0.00 1339 0xb1eb9b80 
> metaslab_group_alloc+0x136
> 2   0%  78% 0.00 1514 0xb0a2f2d8 taskq_thread+0xe3
> 2   0%  78% 0.00 4042 0xb0a22988 vdev_queue_io_done+0xc3
> 2   0%  79% 0.00 3428 0xb0a21f08 vdev_queue_io_done+0xc3
> 2   0%  79% 0.00 1002 0xac44b558 sd_core_iostart+0x37
> 2   0%  79% 0.00 1387 0xa8c56d80 xbuf_iostart+0x7d
> 2   0%  80% 0.00  698 0xa58a3318 sd_return_command+0x11b
> 2   0%  80% 0.00  385 0xa58a3318 sd_start_cmds+0x115
> 2   0%  81% 0.00  562 0xa5647800 ssfcp_scsi_start+0x30
> 2   0%  81% 0.00 1620 0xa4162d58 ssfcp_scsi_init_pkt+0x1be
> 2   0%  82% 0.00  897 0xa4162d58 ssfcp_scsi_start+0x42
> 2   0%  82% 0.00  475 0xa4162b78 ssfcp_scsi_start+0x42
> 2   0%  82% 0.00  697 0xa40fb158 sd_start_cmds+0x115
> 2   0%  83% 0.0010901 0xa28722a0 fifo_write+0x5b
> 2   0%  83% 0.00 4379 0xa28722a0 fifo_read+0xf8
> 2   0%  84% 0.00 1534 0xa2638390 emlxs_tx_get+0x38
> 2   0%  84% 0.00 1601 0xa2638350 emlxs_issue_iocb_cmd+0xc1
> 2   0%  84% 0.00 6697 0xa2503f08 vdev_queue_io_done+0x7b
> 2   0%  85% 0.00 4113 0xa24040b0 
> gcpu_ntv_mca_poll_wrapper+0x64
> 2   0%  85% 0.00  928 0xfe85dc140658 pollwakeup+0x116
> 1   0%  86% 0.00  404 iommulib_lock  lookup_cache+0x2c
> 1   0%  86% 0.00 4867 pidlock

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Andrew Gabriel


Andre van Eyssen wrote:

On Mon, 6 Jul 2009, Gary Mills wrote:


As for a business case, we just had an extended and catastrophic
performance degradation that was the result of two ZFS bugs.  If we
have another one like that, our director is likely to instruct us to
throw away all our Solaris toys and convert to Microsoft products.


If you change platform every time you get two bugs in a product, you 
must cycle platforms on a pretty regular basis!


You often find the change is towards Windows. That very rarely has the 
same rules applied, so things then stick there.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Bryan Allen

+--
| On 2009-07-07 01:29:11, Andre van Eyssen wrote:
| 
| On Mon, 6 Jul 2009, Gary Mills wrote:
| 
| >As for a business case, we just had an extended and catastrophic
| >performance degradation that was the result of two ZFS bugs.  If we
| >have another one like that, our director is likely to instruct us to
| >throw away all our Solaris toys and convert to Microsoft products.
| 
| If you change platform every time you get two bugs in a product, you must 
| cycle platforms on a pretty regular basis!

Given that policy, I don't imagine Windows will last very long anyway.
-- 
bda
cyberpunk is dead. long live cyberpunk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Andre van Eyssen


On Mon, 6 Jul 2009, Gary Mills wrote:


As for a business case, we just had an extended and catastrophic
performance degradation that was the result of two ZFS bugs.  If we
have another one like that, our director is likely to instruct us to
throw away all our Solaris toys and convert to Microsoft products.


If you change platform every time you get two bugs in a product, you must 
cycle platforms on a pretty regular basis!


--
Andre van Eyssen.
mail: an...@purplecow.org  jabber: an...@interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix http://pix.purplecow.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Gary Mills

On Sat, Jul 04, 2009 at 07:18:45PM +0100, Phil Harman wrote:
> Gary Mills wrote:
> >On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote:
> >  
> >>ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC  
> >>instead of the Solaris page cache. But mmap() uses the latter. So if  
> >>anyone maps a file, ZFS has to keep the two caches in sync.
> >
> >That's the first I've heard of this issue.  Our e-mail server runs
> >Cyrus IMAP with mailboxes on ZFS filesystems.  Cyrus uses mmap(2)
> >extensively.  I understand that Solaris has an excellent
> >implementation of mmap(2).  ZFS has many advantages, snapshots for
> >example, for mailbox storage.  Is there anything that we can be do to
> >optimize the two caches in this environment?  Will mmap(2) one day
> >play nicely with ZFS?
> 
[..]
> Software engineering is always about prioritising resource. Nothing 
> prioritises performance tuning attention quite like compelling 
> competitive data. When Bart Smaalders and I wrote libMicro we generated 
> a lot of very compelling data. I also coined the phrase "If Linux is 
> faster, it's a Solaris bug". You will find quite a few (mostly fixed) 
> bugs with the synopsis "linux is faster than solaris at ...".
> 
> So, if mmap(2) playing nicely with ZFS is important to you, probably the 
> best thing you can do to help that along is to provide data that will 
> help build the business case for spending engineering resource on the issue.

First of all, how significant is the double caching in terms of
performance?  If the effect is small, I won't worry about it anymore.

What sort of data do you need?  Would a list of software products that
utilize mmap(2) extensively and could benefit from ZFS be suitable?

As for a business case, we just had an extended and catastrophic
performance degradation that was the result of two ZFS bugs.  If we
have another one like that, our director is likely to instruct us to
throw away all our Solaris toys and convert to Microsoft products.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Bob Friesenhahn


On Mon, 6 Jul 2009, Boyd Adamson wrote:


Probably this is encouraged by documentation like this:


The memory mapping interface is described in Memory Management
Interfaces. Mapping files is the most efficient form of file I/O for
most applications run under the SunOS platform.


Found at:

http://docs.sun.com/app/docs/doc/817-4415/fileio-2?l=en&a=view


People often think about the main benefit of mmap() being to reduce 
CPU consumption and buffer copies but the mmap() family of programming 
interfaces is much richer than low-level read/write, pread/pwrite, or 
stdio, because madvise() provides the ability for I/O scheduling, or 
to flush stale data from memory.  In recent Solaris, it also includes 
provisions which allow applications to improve their performance on 
NUMA systems.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-06 Thread Boyd Adamson

Phil Harman  writes:

> Gary Mills wrote:
> The Solaris implementation of mmap(2) is functionally correct, but the
> wait for a 64 bit address space rather moved the attention of
> performance tuning elsewhere. I must admit I was surprised to see so
> much code out there that still uses mmap(2) for general I/O (rather
> than just to support dynamic linking).

Probably this is encouraged by documentation like this:

> The memory mapping interface is described in Memory Management
> Interfaces. Mapping files is the most efficient form of file I/O for
> most applications run under the SunOS platform.

Found at:

http://docs.sun.com/app/docs/doc/817-4415/fileio-2?l=en&a=view


Boyd.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Phil Harman wrote:


This is not a new problem.  It seems that I have been banging my head 
against this from the time I started using zfs. 


I'd like to see mpstat 1 for each case, on an otherwise idle system, 
but then there's probably a whole lot of dtrace I'd like to do ... 
but I'm just off on vacation for a week, and this will probably have 
to be my last post on this thread until I'm back.


Shame on you for taking well-earned vacation in my time of need. :-)

'mpstat 1' output when I/O is good:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   00  1700  247 2187   11  214   110 102702   5   0  93
  10   00  14785 2812   18  241   100 184242   4   0  94
  20   01  12100 2392   60  185   190 3019275  28   0  67
  30   00  3242 2320 2028   60  18190 2225003  24   0  73
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   00  1862  244 25549  23160  28802   3   0  95
  10   00  11581 2055   17  22170  44791   3   0  96
  20   00  10370 2051   65  186   140 2502114  24   0  73
  30   00  3037 2167 2101   62  186   110 2513934  25   0  71

'mpstat 1'  output when I/O is bad:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   00   859  243 10065  10600 207332   3   0  95
  10   00   504   15  942   12   8460 740093   6   0  91
  20   00   1920  3380   4800380   1   0  99
  30   00   549  376  5221   3600   1350   2   0  98

Notice how intensely unbusy the CPU cores are when I/O is bad.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Phil Harman


Bob Friesenhahn wrote:

On Sat, 4 Jul 2009, Phil Harman wrote:


However, it seems that memory mapping is not responsible for the 
problem I am seeing here.  Memory mapping may make the problem seem 
worse, but it is clearly not the cause.


mmap(2) is what brings ZFS files into the page cache. I think you've 
shown us that once you've copied files with cp(1) - which does use 
mmap(2) - that anything that uses read(2) on the same files is impacted.


The problem is observed with cpio, which does not use mmap.  This is 
immediately after a reboot or unmount/mount of the filesystem.


Sorry, I didn't get to your other post ...

Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) 
performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 
with latest firmware.  I rebooted the system used cpio to send the 
input files to /dev/null, and then immediately used cpio a second time 
to send the input files to /dev/null.  Note that the amount of file 
data (243 GB) is plenty sufficient to purge any file data from the ARC 
(which has a cap of 10 GB).


% time cat dpx-files.txt | cpio -o > /dev/null
495713288 blocks
cat dpx-files.txt  0.00s user 0.00s system 0% cpu 1.573 total
cpio -o > /dev/null  78.92s user 360.55s system 43% cpu 16:59.48 total

% time cat dpx-files.txt | cpio -o > /dev/null
495713288 blocks
cat dpx-files.txt  0.00s user 0.00s system 0% cpu 0.198 total
cpio -o > /dev/null  79.92s user 358.75s system 11% cpu 1:01:05.88 total

zpool iostat averaged over 60 seconds reported that the first run 
through the files read the data at 251 MB/s and the second run only 
achieved 68 MB/s.  It seems clear that there is something really bad 
about Solaris 10 zfs's file caching code which is causing it to go 
into the weeds.


I don't think that the results mean much, but I have attached output 
from 'hotkernel' while a subsequent cpio copy is taking place.  It 
shows that the kernel is mostly sleeping.


This is not a new problem.  It seems that I have been banging my head 
against this from the time I started using zfs. 


I'd like to see mpstat 1 for each case, on an otherwise idle system, but 
then there's probably a whole lot of dtrace I'd like to do ... but I'm 
just off on vacation for a week, and this will probably have to be my 
last post on this thread until I'm back.


Cheers,
Phil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread dick hoogendijk

On Sat, 4 Jul 2009 13:03:52 -0500 (CDT)
Bob Friesenhahn  wrote:

> On Sat, 4 Jul 2009, Joerg Schilling wrote:

> > Did you try to use highly performant software like star?
> 
> No, because I don't want to tarnish your software's stellar 
> reputation.  I am focusing on Solaris 10 bugs today.

Blunt.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / OpenSolaris 2009.06 release
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Jonathan Edwards wrote:


this is only going to help if you've got problems in zfetch .. you'd probably 
see this better by looking for high lock contention in zfetch with lockstat


This is what lockstat says when performance is poor:

Adaptive mutex spin: 477 events in 30.019 seconds (16 events/sec)

Count indv cuml rcnt nsec Lock   Caller 
---

   47  10%  10% 0.00 5813 0x80256000 untimeout+0x24
   46  10%  19% 0.00 2223 0xb0a2f200 taskq_thread+0xe3
   38   8%  27% 0.00 2252 0xb0a2f200 cv_wait+0x70
   29   6%  34% 0.00 1115 0x80256000 callout_execute+0xeb
   26   5%  39% 0.00 3006 0xb0a2f200 taskq_dispatch+0x1b8
   22   5%  44% 0.00 1200 0xa06158c0 post_syscall+0x206
   18   4%  47% 0.00 3858 arc_eviction_mtx   arc_do_user_evicts+0x76
   16   3%  51% 0.00 1352 arc_eviction_mtx   arc_buf_add_ref+0x2d
   15   3%  54% 0.00 5376 0xb1adac28 taskq_thread+0xe3
   11   2%  56% 0.00 2520 0xb1adac28 taskq_dispatch+0x1b8
9   2%  58% 0.00 2158 0xbb909e20 pollwakeup+0x116
9   2%  60% 0.00 2431 0xb1adac28 cv_wait+0x70
8   2%  62% 0.00 3912 0x80259000 untimeout+0x24
7   1%  63% 0.00 3679 0xb10dfbc0 polllock+0x3f
7   1%  65% 0.00 2171 0xb0a2f2d8 cv_wait+0x70
6   1%  66% 0.00  771 0xb3f23708 pcache_delete_fd+0xac
6   1%  67% 0.00 4679 0xb0a2f2d8 taskq_dispatch+0x1b8
5   1%  68% 0.00  500 0xbe555040 fifo_read+0xf8
5   1%  69% 0.0015838 0x8025c000 untimeout+0x24
4   1%  70% 0.00 1213 0xac44b558 sd_initpkt_for_buf+0x110
4   1%  71% 0.00  638 0xa28722a0 polllock+0x3f
4   1%  72% 0.00  610 0x80259000 timeout_common+0x39
4   1%  73% 0.0010691 0x80256000 timeout_common+0x39
3   1%  73% 0.00 1559 htable_mutex+0x78  htable_release+0x8a
3   1%  74% 0.00 3610 0xbb909e20 cv_timedwait_sig+0x1c1
3   1%  74% 0.00 1636 0xa240d410 
ohci_allocate_periodic_in_resource+0x71
2   0%  75% 0.00 5959 0xbe555040 fifo_read+0x5c
2   0%  75% 0.00 3744 0xbe555040 polllock+0x3f
2   0%  76% 0.00  635 0xb3f23708 pollwakeup+0x116
2   0%  76% 0.00  709 0xb3f23708 cv_timedwait_sig+0x1c1
2   0%  77% 0.00  831 0xb3dd2070 pcache_insert+0x13d
2   0%  77% 0.00 5976 0xb3dd2070 pollwakeup+0x116
2   0%  77% 0.00 1339 0xb1eb9b80 metaslab_group_alloc+0x136
2   0%  78% 0.00 1514 0xb0a2f2d8 taskq_thread+0xe3
2   0%  78% 0.00 4042 0xb0a22988 vdev_queue_io_done+0xc3
2   0%  79% 0.00 3428 0xb0a21f08 vdev_queue_io_done+0xc3
2   0%  79% 0.00 1002 0xac44b558 sd_core_iostart+0x37
2   0%  79% 0.00 1387 0xa8c56d80 xbuf_iostart+0x7d
2   0%  80% 0.00  698 0xa58a3318 sd_return_command+0x11b
2   0%  80% 0.00  385 0xa58a3318 sd_start_cmds+0x115
2   0%  81% 0.00  562 0xa5647800 ssfcp_scsi_start+0x30
2   0%  81% 0.00 1620 0xa4162d58 ssfcp_scsi_init_pkt+0x1be
2   0%  82% 0.00  897 0xa4162d58 ssfcp_scsi_start+0x42
2   0%  82% 0.00  475 0xa4162b78 ssfcp_scsi_start+0x42
2   0%  82% 0.00  697 0xa40fb158 sd_start_cmds+0x115
2   0%  83% 0.0010901 0xa28722a0 fifo_write+0x5b
2   0%  83% 0.00 4379 0xa28722a0 fifo_read+0xf8
2   0%  84% 0.00 1534 0xa2638390 emlxs_tx_get+0x38
2   0%  84% 0.00 1601 0xa2638350 emlxs_issue_iocb_cmd+0xc1
2   0%  84% 0.00 6697 0xa2503f08 vdev_queue_io_done+0x7b
2   0%  85% 0.00 4113 0xa24040b0 
gcpu_ntv_mca_poll_wrapper+0x64
2   0%  85% 0.00  928 0xfe85dc140658 pollwakeup+0x116
1   0%  86% 0.00  404 iommulib_lock  lookup_cache+0x2c
1   0%  86% 0.00 4867 pidlockthread_exit+0x6f
1   0%  86% 0.00 1245 plocks+0x3c0   pollhead_delete+0x23
1   0%  86% 0.00 2452 plocks+0x3c0   pollhead_insert+0x35
1   0%  86% 0.00  882 htable_mutex+0x3c0 htable_lookup+0x83
1   0%  87% 0.0028547 htable_mutex+0x3c0 htable_create+0xe3
1   0%  87% 0.0021173 htable_mutex+0x3c0 htable_release+0x8a
1   0%  87% 0.00 1235 htable_mutex+0x370 htable_lookup+0x83
1   0%  87% 0.00 3212 htable_mutex+0x370 htable_release+0x8a
1   0%  87% 0.00  793 htable_mutex+0x78  htable_lookup+0x83
1   0%

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Phil Harman wrote:


However, it seems that memory mapping is not responsible for the problem I 
am seeing here.  Memory mapping may make the problem seem worse, but it is 
clearly not the cause.


mmap(2) is what brings ZFS files into the page cache. I think you've shown us 
that once you've copied files with cp(1) - which does use mmap(2) - that 
anything that uses read(2) on the same files is impacted.


The problem is observed with cpio, which does not use mmap.  This is 
immediately after a reboot or unmount/mount of the filesystem.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Phil Harman


Bob Friesenhahn wrote:

On Sat, 4 Jul 2009, Phil Harman wrote:
However, this is only part of the problem. The fundamental issue is 
that ZFS has its own ARC apart from the Solaris page cache, so 
whenever mmap() is used, all I/O to that file has to make sure that 
the two caches are in sync. Hence, a read(2) on a file which has 
sometime been mapped, will be impacted, even if the file is nolonger 
mapped.


However, it seems that memory mapping is not responsible for the 
problem I am seeing here.  Memory mapping may make the problem seem 
worse, but it is clearly not the cause.


mmap(2) is what brings ZFS files into the page cache. I think you've 
shown us that once you've copied files with cp(1) - which does use 
mmap(2) - that anything that uses read(2) on the same files is impacted.



Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Joerg Schilling

Phil Harman  wrote:

> I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) 
> was the first UNIX to get a working implementation of mmap(2) for files 
> (if I recall correctly, BSD 4.3 had a manpage but no implementation for 
> files). From that we got a whole lot of cool stuff, not least dynamic 
> linking with ld.so (which has made it just about everywhere).

Well on BSD, you could mmap() devices but as a result from the fact that
there was no useful address space management, you had to first malloc()
the amount of space, forcing you to have the same amount of memory available
as swap. Later, the device was mapped on top of the allocated memory and
made the underlying spap space unacessible. We had to add expensive amounts of
swap that time in order to be able to mmap the 256 MB of RAM from our image 
processor that time at Berthold AG.

> The Solaris implementation of mmap(2) is functionally correct, but the 
> wait for a 64 bit address space rather moved the attention of 
> performance tuning elsewhere. I must admit I was surprised to see so 
> much code out there that still uses mmap(2) for general I/O (rather than 
> just to support dynamic linking).

When the new memory management architecture was introduced with SunOS-4.0,
things became better although the now unified and partially anomized address 
space made it hard to implement "limit memoryuse" (rlmit with RLIMIT_RSS).
I made a working implementation for SunOS-4.0 but this did not make it into 
SunOS.

There are still related performance issues. If you e.g. store a CD/DVD/BluRay
image in /tmp that is bigger than the amount of RAM in the machine, you will
observe a buffer overflow while writing with cdrecord unless you use 
driveropts=burnfree because pagin in is slow on tmpfs.

> Software engineering is always about prioritising resource. Nothing 
> prioritises performance tuning attention quite like compelling 
> competitive data. When Bart Smaalders and I wrote libMicro we generated 
> a lot of very compelling data. I also coined the phrase "If Linux is 
> faster, it's a Solaris bug". You will find quite a few (mostly fixed) 
> bugs with the synopsis "linux is faster than solaris at ...".

Fortunately, Linux is slower with most tasks ;-)

In 1988, the effect of mmap() was much more visible than it is now. 20 years 
ago, the CPU speed limited copy operations making pipes, copyout() and similar 
slow. This changed with modern CPUs and for this reason, the demand for using 
mmap() is lower than it has been 20 years ago.

> So, if mmap(2) playing nicely with ZFS is important to you, probably the 
> best thing you can do to help that along is to provide data that will 
> help build the business case for spending engineering resource on the issue.

I would be interested to see a open(2) flag that tells the system that I will
read a file that I opened exactly once in native oder. This could tell the 
system to do read ahead and to later mark the pages as immediately reusable. 
This would make star even faster than it is now.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Phil Harman wrote:

However, this is only part of the problem. The fundamental issue is that ZFS 
has its own ARC apart from the Solaris page cache, so whenever mmap() is 
used, all I/O to that file has to make sure that the two caches are in sync. 
Hence, a read(2) on a file which has sometime been mapped, will be impacted, 
even if the file is nolonger mapped.


However, it seems that memory mapping is not responsible for the 
problem I am seeing here.  Memory mapping may make the problem seem 
worse, but it is clearly not the cause.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn

Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) 
performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 
with latest firmware.  I rebooted the system used cpio to send the 
input files to /dev/null, and then immediately used cpio a second time 
to send the input files to /dev/null.  Note that the amount of file 
data (243 GB) is plenty sufficient to purge any file data from the ARC 
(which has a cap of 10 GB).


% time cat dpx-files.txt | cpio -o > /dev/null
495713288 blocks
cat dpx-files.txt  0.00s user 0.00s system 0% cpu 1.573 total
cpio -o > /dev/null  78.92s user 360.55s system 43% cpu 16:59.48 total

% time cat dpx-files.txt | cpio -o > /dev/null
495713288 blocks
cat dpx-files.txt  0.00s user 0.00s system 0% cpu 0.198 total
cpio -o > /dev/null  79.92s user 358.75s system 11% cpu 1:01:05.88 total

zpool iostat averaged over 60 seconds reported that the first run 
through the files read the data at 251 MB/s and the second run only 
achieved 68 MB/s.  It seems clear that there is something really bad 
about Solaris 10 zfs's file caching code which is causing it to go 
into the weeds.


I don't think that the results mean much, but I have attached output 
from 'hotkernel' while a subsequent cpio copy is taking place.  It 
shows that the kernel is mostly sleeping.


This is not a new problem.  It seems that I have been banging my head 
against this from the time I started using zfs.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

Sampling... Hit Ctrl-C to end.

FUNCTIONCOUNT   PCNT
unix`SHA1Update 1   0.0%
unix`page_unlock1   0.0%
unix`lwp_segregs_save   1   0.0%
rootnex`rootnex_dma_allochdl1   0.0%
unix`mutex_delay_default1   0.0%
emlxs`emlxs_initialize_pkt  1   0.0%
genunix`pid_lookup  1   0.0%
TS`ts_setrun1   0.0%
fcp`ssfcp_adjust_cmd1   0.0%
genunix`strrput 1   0.0%
genunix`cyclic_softint  1   0.0%
genunix`fop_poll1   0.0%
sd`sd_xbuf_strategy 1   0.0%
ohci`ohci_state_is_operational  1   0.0%
zfs`SHA256Transform 1   0.0%
unix`cpu_resched1   0.0%
nvidia`_nv006110rm  1   0.0%
genunix`lwp_timer_timeout   1   0.0%
genunix`realtime_timeout1   0.0%
fcp`ssfcp_scsi_destroy_pkt  1   0.0%
nvidia`nvidia_pci_check_config_space1   0.0%
genunix`closef  1   0.0%
sd`sd_setup_rw_pkt  1   0.0%
unix`vsnprintf  1   0.0%
zfs`vdev_dtl_contains   1   0.0%
genunix`siginfo_kto32   1   0.0%
iommulib`iommulib_nex_open  1   0.0%
genunix`vn_has_cached_data  1   0.0%
ohci`ohci_sendup_td_message 1   0.0%
scsi_vhci`vhci_scsi_destroy_pkt 1   0.0%
genunix`avl_add 1   0.0%
unix`page_create_va 1   0.0%
genunix`savectx 1   0.0%
ohci`ohci_root_hub_allocate_intr_pipe_resource  1   0.0%
unix`page_add   1   0.0%
zfs`zfs_unix_to_v4  1   0.0%
genunix`set_qend1   0.0%
zfs`vdev_queue_io_done  1   0.0%
unix`set_idle_cpu   1   0.0%
zfs`vdev_cache_read 1   0.0%
nvidia`_nv002998rm  1   0.0%
ohci`ohci_do_intrs_stats1   0.0%
genunix`putq1   0.0%
genunix`strput  1   0.0%
zfs`zio_buf_alloc   1   0.0%
sockfs`socktpi_poll 1   0.0%
sockfs`so_update_attrs  1   0.0%
sockfs`so_unlock_read

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Phil Harman


Gary Mills wrote:

On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote:
  
ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC  
instead of the Solaris page cache. But mmap() uses the latter. So if  
anyone maps a file, ZFS has to keep the two caches in sync.



That's the first I've heard of this issue.  Our e-mail server runs
Cyrus IMAP with mailboxes on ZFS filesystems.  Cyrus uses mmap(2)
extensively.  I understand that Solaris has an excellent
implementation of mmap(2).  ZFS has many advantages, snapshots for
example, for mailbox storage.  Is there anything that we can be do to
optimize the two caches in this environment?  Will mmap(2) one day
play nicely with ZFS?
  


I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) 
was the first UNIX to get a working implementation of mmap(2) for files 
(if I recall correctly, BSD 4.3 had a manpage but no implementation for 
files). From that we got a whole lot of cool stuff, not least dynamic 
linking with ld.so (which has made it just about everywhere).


The Solaris implementation of mmap(2) is functionally correct, but the 
wait for a 64 bit address space rather moved the attention of 
performance tuning elsewhere. I must admit I was surprised to see so 
much code out there that still uses mmap(2) for general I/O (rather than 
just to support dynamic linking).


Software engineering is always about prioritising resource. Nothing 
prioritises performance tuning attention quite like compelling 
competitive data. When Bart Smaalders and I wrote libMicro we generated 
a lot of very compelling data. I also coined the phrase "If Linux is 
faster, it's a Solaris bug". You will find quite a few (mostly fixed) 
bugs with the synopsis "linux is faster than solaris at ...".


So, if mmap(2) playing nicely with ZFS is important to you, probably the 
best thing you can do to help that along is to provide data that will 
help build the business case for spending engineering resource on the issue.


Cheers,
Phil


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Jonathan Edwards



On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote:

This brings me to the absurd conclusion that the system must be  
rebooted immediately prior to each use.


see Phil's later email .. an export/import of the pool or a remount of  
the filesystem should clear the page cache - with mmap'd files you're  
essentially both them both in the page cache and also in the ARC ..  
then invalidations in the page cache are going to have effects on  
dirty data in the cache



/etc/system tunables are currently:

set zfs:zfs_arc_max = 0x28000
set zfs:zfs_write_limit_override = 0xea60
set zfs:zfs_vdev_max_pending = 5



if you're on x86 - i'd also increase maxphys to 128K .. we still have  
a 56KB default value in there which is still a bad thing (IMO)


---
.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Joerg Schilling

Bob Friesenhahn  wrote:

> On Sat, 4 Jul 2009, Joerg Schilling wrote:
> >> by more than half.  Based on yesterday's experience, that may diminish
> >> to only 33 MB/s.
> >
> > "star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir"
> >
> > is nearly 40% faster than
> >
> > "find . | cpio -pdum to-dir"
> >
> > Did you try to use highly performant software like star?
>
> No, because I don't want to tarnish your software's stellar 
> reputation.  I am focusing on Solaris 10 bugs today.

I've seen more prefessional replies. At the end it is your decision
to ignore helpful advise. 

BTW: if star on ZFS would not be faster than cpio this would be just a 
hint for a problem in ZFS that needs to be fixed.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Phil Harman


Bob Friesenhahn wrote:

On Sat, 4 Jul 2009, Phil Harman wrote:


If you reboot, your cpio(1) tests will probably go fast again, until 
someone uses mmap(2) on the files again. I think tar(1) uses read(2), 
but from my iPod I can't be sure. It would be interesting to see how 
tar(1) performs if you run that test before cp(1) on a freshly 
rebooted system.


Ok, I just rebooted the system.  Now 'zpool iostat Sun_2540 60' shows 
that the cpio read rate has increased from (the most recently 
observed) 33 MB/second to as much as 132 MB/second.  To some this may 
not seem significant but to me it looks a whole lot different. ;-)


Thanks, that's really useful data. I wasn't near a machine at the time, 
so I couldn't do it for myself. I answered your initial question based 
on what I understood of the implementation, and it's very satisfying to 
have the data to back it up.


I have done some work with the ZFS team towards a fix, but it is only 
currently in OpenSolaris.


Hopefully the fix is very very good.  It is difficult to displace the 
many years of SunOS training that using mmap is the path to best 
performance.  Mmap provides many tools to improve application 
performance which are just not available via traditional I/O.


The part of the problem I highlighted was ...

  6699438 zfs induces crosscall storm under heavy mapped sequential read

This has been fixed in OpenSolaris, and should be fixed in Solaris 10 
update 8.


However, this is only part of the problem. The fundamental issue is that 
ZFS has its own ARC apart from the Solaris page cache, so whenever 
mmap() is used, all I/O to that file has to make sure that the two 
caches are in sync. Hence, a read(2) on a file which has sometime been 
mapped, will be impacted, even if the file is nolonger mapped.


I'm sure the data and interest from this thread will be useful to the 
ZFS team in prioritising further performance enhancements. So thanks 
again. And if there's any more useful data you can add, please do so. If 
you have a support contract, you might also consider logging a call and 
even raising an escalation request.


Cheers,
Phil


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Joerg Schilling wrote:

by more than half.  Based on yesterday's experience, that may diminish
to only 33 MB/s.


"star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir"

is nearly 40% faster than

"find . | cpio -pdum to-dir"

Did you try to use highly performant software like star?


No, because I don't want to tarnish your software's stellar 
reputation.  I am focusing on Solaris 10 bugs today.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Phil Harman


Joerg Schilling wrote:

Phil Harman  wrote:

  
ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC  
instead of the Solaris page cache. But mmap() uses the latter. So if  
anyone maps a file, ZFS has to keep the two caches in sync.


cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it  
copies into the Solaris page cache. As long as they remain there ZFS  
will be slow for those files, even if you subsequently use read(2) to  
access them.


If you reboot, your cpio(1) tests will probably go fast again, until  



Do you believe that reboot is the only way to reset this?
  


No, but from my iPod I didn't have the patience to write a fuller 
explanation :)


See ...

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zfs_vnops.c#514

We take the long path is the vnode has any pages cached in the page cache.

So instead of a reboot, you should also be able to export/import the 
pool or unmount/mount the filesystem.


Also, if you didn't touch the file for a long time, and had lots of 
other page cache churn, the file might eventually get expunged from the 
page cache.


Phil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Joerg Schilling

Bob Friesenhahn  wrote:

> A tar pipeline still provides terrible file copy performance.  Read 
> bandwidth is only 26 MB.  So I stopped the tar copy and re-tried the 
> cpio copy.
>
> A second copy with the cpio results in a read/write data rate of only 
> 54.9 MB/s (vs the just experienced 132 MB/s).  Performance is reduced 
> by more than half.  Based on yesterday's experience, that may diminish 
> to only 33 MB/s.

"star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir"

is nearly 40% faster than 

"find . | cpio -pdum to-dir"

Did you try to use highly performant software like star?

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn

A tar pipeline still provides terrible file copy performance.  Read 
bandwidth is only 26 MB.  So I stopped the tar copy and re-tried the 
cpio copy.


A second copy with the cpio results in a read/write data rate of only 
54.9 MB/s (vs the just experienced 132 MB/s).  Performance is reduced 
by more than half.  Based on yesterday's experience, that may diminish 
to only 33 MB/s.


The amount of data being copied is much larger than any cache yet 
somehow reading a file a second time is less than 1/2 as fast.


This brings me to the absurd conclusion that the system must be 
rebooted immediately prior to each use.


/etc/system tunables are currently:

set zfs:zfs_arc_max = 0x28000
set zfs:zfs_write_limit_override = 0xea60
set zfs:zfs_vdev_max_pending = 5

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Gary Mills

On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote:
> ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC  
> instead of the Solaris page cache. But mmap() uses the latter. So if  
> anyone maps a file, ZFS has to keep the two caches in sync.

That's the first I've heard of this issue.  Our e-mail server runs
Cyrus IMAP with mailboxes on ZFS filesystems.  Cyrus uses mmap(2)
extensively.  I understand that Solaris has an excellent
implementation of mmap(2).  ZFS has many advantages, snapshots for
example, for mailbox storage.  Is there anything that we can be do to
optimize the two caches in this environment?  Will mmap(2) one day
play nicely with ZFS?

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Phil Harman wrote:


If you reboot, your cpio(1) tests will probably go fast again, until someone 
uses mmap(2) on the files again. I think tar(1) uses read(2), but from my 
iPod I can't be sure. It would be interesting to see how tar(1) performs if 
you run that test before cp(1) on a freshly rebooted system.


Ok, I just rebooted the system.  Now 'zpool iostat Sun_2540 60' shows 
that the cpio read rate has increased from (the most recently 
observed) 33 MB/second to as much as 132 MB/second.  To some this may 
not seem significant but to me it looks a whole lot different. ;-)


I have done some work with the ZFS team towards a fix, but it is only 
currently in OpenSolaris.


Hopefully the fix is very very good.  It is difficult to displace the 
many years of SunOS training that using mmap is the path to best 
performance.  Mmap provides many tools to improve application 
performance which are just not available via traditional I/O.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Phil Harman wrote:

ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead 
of the Solaris page cache. But mmap() uses the latter. So if anyone maps a 
file, ZFS has to keep the two caches in sync.


cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies 
into the Solaris page cache. As long as they remain there ZFS will be slow 
for those files, even if you subsequently use read(2) to access them.


This is very interesting information and certainly can explain a lot. 
My application has a choice of using mmap or traditional I/O.  I often 
use mmap.  From what you are saying, using mmap is poison to 
subsequent performance.


On June 29th I tested my application (which was set to use mmap) 
shortly after a reboot and got this overall initial runtime:


real  2:24:25.675
user  4:38:57.837
sys 14:30.823

By June 30th (with no intermediate reboot) the overall runtime had 
increased to


real  3:08:58.941
user  4:38:38.192
sys 15:44.197

which seems like quite a large change.

If you reboot, your cpio(1) tests will probably go fast again, until someone 
uses mmap(2) on the files again. I think tar(1) uses read(2), but from my


I will test.

The other thing that slows you down is that ZFS only flushes to disk every 5 
seconds if there are no synchronous writes. It would be interesting to see 
iostat -xnz 1 while you are running your tests. You may find the disks are 
writing very efficiently for one second in every five.


Actually I found that the disks were writing flat out for five seconds 
at a time which stalled all other pool I/O (and dependent CPU) for at 
least three seconds (see earlier discussion).  So at the moment I have 
zfs_write_limit_override set to 2684354560 so that the write cycle is 
more on the order of one second in every five.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Bob Friesenhahn


On Sat, 4 Jul 2009, Jonathan Edwards wrote:


somehow i don't think that reading the first 64MB off (presumably) off a raw 
disk device 3 times and picking the middle value is going to give you much 
useful information on the overall state of the disks .. i believe this was 
more of a quick hack to just validate that there's nothing too far out of the 
norm, but with that said - what's the c2 and c3 device above?  you've got to 
be caching the heck out of that to get that unbelievable 13 GB/s - so you're 
really only seeing memory speeds there


Agreed.  It is just a quick sanity check.  I think that the c2 and c3 
devices are speedy USB drives.


more useful information would be something more like the old taz or some of 
the disk IO latency tools when you're driving a workload.


What I see from 'iostat -cx' is a low latency (<= 4 ms) and low 
workload while the data is being read, and then (periodically) a burst 
of write data with much higher latency (40-64ms svc_t).  The write 
burst does not take long so it is clear that reading is the 
bottleneck.


if you're using LUNs off an array - this might be another case of the 
zfs_vdev_max_pending being tuned more for direct attach drives .. you could 
be trying to queue up too much I/O against the RAID controller, particularly 
if the RAID controller is also trying to prefetch out of it's cache.


I have played with zfs_vdev_max_pending before.  It does dial down the 
latency pretty linearly during the write phase (e.g. 35 queued I/Os 
results in 64 ms svc_t).


you might want to dtrace this to break down where the latency is occuring .. 
eg: is this a DNLC caching problem, ARC problem, or device level problem


also - is this really coming off a 2540? if so - you should probably 
investigate the array throughput numbers and what's happening on the RAID 
controller .. i typically find it helpful to understand what the raw hardware 
is capable of (hence tools like vdbench to drive an anticipated load before i 
configure anything) - and then attempting to configure the various tunables 
to match after that


Yes, this comes off of a 2540.  I used iozone for testing and see that 
through zfs, the hardware is able to write a 64GB file at 380 MB/s and 
read at 551 MB/s.  Unfortunately, this does not seem to translate well 
for the actual task.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Joerg Schilling

Phil Harman  wrote:

> ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC  
> instead of the Solaris page cache. But mmap() uses the latter. So if  
> anyone maps a file, ZFS has to keep the two caches in sync.
>
> cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it  
> copies into the Solaris page cache. As long as they remain there ZFS  
> will be slow for those files, even if you subsequently use read(2) to  
> access them.
>
> If you reboot, your cpio(1) tests will probably go fast again, until  

Do you believe that reboot is the only way to reset this?

> someone uses mmap(2) on the files again. I think tar(1) uses read(2),  
> but from my iPod I can't be sure. It would be interesting to see how  
> tar(1) performs if you run that test before cp(1) on a freshly  
> rebooted system.

There are many tar implementations. The oldest is the UNIX tar implementation
from around 1978, the next was star from 1982, then there is GNU tar from 1987.

Star forks into two processes that are connected via shared memory in order to
speed up things. 

If you compare the copy speed from star amd cp on UFS and if you tell star to 
be as unreliable as cp (by specifying the star option -no-fsync), star will do 
the job by 30% faster than cp does even though star does not use mmap. Copying 
with Sun's tar is a tic faster than using cp and it is a bit more accurat.
GNU tar is not better than Sun's tar.

If you are looking for the best speed, use:

star -copy -no-fsync -C from-dir . to-dir

and set up e.v. bs=1m fs=128m.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread David Magda


On Jul 4, 2009, at 03:48, Phil Harman wrote:

The other thing that slows you down is that ZFS only flushes to disk  
every 5 seconds if there are no synchronous writes. It would be  
interesting to see iostat -xnz 1 while you are running your tests.  
You may find the disks are writing very efficiently for one second  
in every five.


The value of 5 seconds is no longer a hard stop since SNV 87. Since  
snv_87 (and S10u6) it can be up to 30 seconds (but it does shoot for 5  
seconds):


http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205

See the 20-Mar-2008 change for txg.c for details.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Jonathan Edwards



On Jul 4, 2009, at 12:03 AM, Bob Friesenhahn wrote:


% ./diskqual.sh
c1t0d0 130 MB/sec
c1t1d0 130 MB/sec
c2t202400A0B83A8A0Bd31 13422 MB/sec
c3t202500A0B83A8A0Bd31 13422 MB/sec
c4t600A0B80003A8A0B096A47B4559Ed0 191 MB/sec
c4t600A0B80003A8A0B096E47B456DAd0 192 MB/sec
c4t600A0B80003A8A0B096147B451BEd0 192 MB/sec
c4t600A0B80003A8A0B096647B453CEd0 192 MB/sec
c4t600A0B80003A8A0B097347B457D4d0 212 MB/sec
c4t600A0B800039C9B50A9C47B4522Dd0 191 MB/sec
c4t600A0B800039C9B50AA047B4529Bd0 192 MB/sec
c4t600A0B800039C9B50AA447B4544Fd0 192 MB/sec
c4t600A0B800039C9B50AA847B45605d0 191 MB/sec
c4t600A0B800039C9B50AAC47B45739d0 191 MB/sec
c4t600A0B800039C9B50AB047B457ADd0 191 MB/sec
c4t600A0B800039C9B50AB447B4595Fd0 191 MB/sec


somehow i don't think that reading the first 64MB off (presumably) off  
a raw disk device 3 times and picking the middle value is going to  
give you much useful information on the overall state of the disks ..  
i believe this was more of a quick hack to just validate that there's  
nothing too far out of the norm, but with that said - what's the c2  
and c3 device above?  you've got to be caching the heck out of that to  
get that unbelievable 13 GB/s - so you're really only seeing memory  
speeds there


more useful information would be something more like the old taz or  
some of the disk IO latency tools when you're driving a workload.



% arc_summary.pl

System Memory:
 Physical RAM:  20470 MB
 Free Memory :  2371 MB
 LotsFree:  312 MB

ZFS Tunables (/etc/system):
 * set zfs:zfs_arc_max = 0x3
 set zfs:zfs_arc_max = 0x28000
 * set zfs:zfs_arc_max = 0x2

ARC Size:
 Current Size: 9383 MB (arcsize)
 Target Size (Adaptive):   10240 MB (c)
 Min Size (Hard Limit):1280 MB (zfs_arc_min)
 Max Size (Hard Limit):10240 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:   6%644 MB (p)
 Most Frequently Used Cache Size:93%9595 MB (c-p)

ARC Efficency:
 Cache Access Total: 674638362
 Cache Hit Ratio:  91%   615586988  [Defined State for 
buffer]
 Cache Miss Ratio:  8%   59051374   [Undefined State for 
Buffer]
 REAL Hit Ratio:   87%   590314508  [MRU/MFU Hits Only]

 Data Demand   Efficiency:96%
 Data Prefetch Efficiency: 7%

CACHE HITS BY CACHE LIST:
	  Anon:2% 	 13626529	[ New  
Customer, First Cache Hit ]
	  Most Recently Used: 78% 	 480379752 (mru)  	[ Return  
Customer ]
	  Most Frequently Used:   17% 	 109934756 (mfu)  	 
[ Frequent Customer ]
	  Most Recently Used Ghost:0% 	 5180256 (mru_ghost)	[ Return  
Customer Evicted, Now Back ]
	  Most Frequently Used Ghost:  1% 	 6465695 (mfu_ghost)	[ Frequent  
Customer Evicted, Now Back ]

CACHE HITS BY DATA TYPE:
  Demand Data:78%485431759
  Prefetch Data:   0%3045442
  Demand Metadata:16%103900170
  Prefetch Metadata:   3%23209617
CACHE MISSES BY DATA TYPE:
  Demand Data:30%18109355
  Prefetch Data:  60%35633374
  Demand Metadata: 6%3806177
	  Prefetch Metadata:   2% 	 1502468  
-


Prefetch seems to be performing badly.  The Ben Rockwood's blog  
entry at http://www.cuddletech.com/blog/pivot/entry.php?id=1040  
discusses prefetch.  The sample Dtrace script on that page only  
shows cache misses:


vdev_cache_read: 6507827833451031357 read 131072 bytes at offset  
6774849536: MISS
vdev_cache_read: 6507827833451031357 read 131072 bytes at offset  
6774980608: MISS


Unfortunately, the file-level prefetch DTrace sample script from the  
same page seems to have a syntax error.


if you're using LUNs off an array - this might be another case of the  
zfs_vdev_max_pending being tuned more for direct attach drives .. you  
could be trying to queue up too much I/O against the RAID controller,  
particularly if the RAID controller is also trying to prefetch out of  
it's cache.


I tried disabling file level prefetch (zfs_prefetch_disable=1) but  
did not observe any change in behavior.


this is only going to help if you've got problems in zfetch .. you'd  
probably see this better by looking for high lock contention in zfetch  
with lockstat



# kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:classmisc
zfs:0:vdev_cache_stats:crtime   130.61298275
zfs:0:vdev_cache_stats:delegations  754287
zfs:0:vdev_cache_stats:hits 3973496
zfs:0:vdev_cache_stats:misses   2154959
zfs:0:vdev_cache_stats:snaptime 451955.55419545

Performance when coping 236 GB of files (each file is 5537792 bytes,  
wit

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Joerg Schilling

Mattias Pantzare  wrote:

> > Performance when coping 236 GB of files (each file is 5537792 bytes, with
> > 20001 files per directory) from one directory to another:
> >
> > Copy Method                             Data Rate
> >     ==
> > cpio -pdum                              75 MB/s
> > cp -r                                   32 MB/s
> > tar -cf - . | (cd dest && tar -xf -)    26 MB/s
> >
> > I would expect data copy rates approaching 200 MB/s.
> >
>
> What happens if you run two copy at the same time? (On different data)

Before you do things like this, you first should start using test
that may give you useful results.

Note of the programs above have been written for decent performance.
I know that "cp" on Solaris is a partial exception for songle file copies, but  
does not help us if we like to compare _aparent_ performance.

Let me first introduce other programs:

sdd A dd(1) replacement that was first written in 1984 and that includes
built-in speed metering since Jily 1988.

starA tar(1) replacement that was first written in 1982 and that supports
much better performance by using a shared memory based FIFO.

Note that most speed tests that are run on Linux do not result un useful values
as you don't know what's happening dunring the observation time.

If you like to meter read performance, I recommend to use a filesystem that was
mounted directly before doing the test or to use files that are big enough not 
to fit into memory.

Use e.g.:   sdd if=file-name bs=64k -onull -time

If you like to meter write performance, I recomment to write big enough files
to avoid using wrong numbers as a result from caching.

Use e.g.sdd -inull bs=64k count=some-number of=file-name -time

Us an apropriate value for "some-number".

For copying files, I recommend to use:

star -copy bs=1m fs=128m -time -C from-dir . to-dir

It makes sense to run another test using the option: -no-fsync in addition.
On Solaris with UFS, using -no-fsync speeds up things by aprox. 10%.
On Linux with a local filesystem, using -no-fsync speeds up things by 
aprox. 400%. This is why you get useless high numbers from using GNU tar
for copy tests on Linux.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Mattias Pantzare

On Sat, Jul 4, 2009 at 06:03, Bob
Friesenhahn wrote:
> I am still trying to determine why Solaris 10 (Generic_141415-03) ZFS
> performs so terribly on my system.  I blew a good bit of personal life
> savings on this set-up but am not seeing performance anywhere near what is
> expected.  Testing with iozone shows that bulk I/O performance is good.
>  Testing with Jeff Bonwick's 'diskqual.sh' shows expected disk performance.
>  The problem is that actual observed application performance sucks, and
> could often be satisified by portable USB drives rather than high-end SAS
> drives.  It could be satisified by just one SAS disk drive.  Behavior is as
> if zfs is very slow to read data since disks are read at only 2 or 3
> MB/second followed by an intermittent write on a long cycle.  Drive lights
> blink slowly.  It is as if ZFS does no successful sequential read-ahead on
> the files (see Prefetch Data hit rate of 0% and Prefetch Data cache miss of
> 60% below), or there is a semaphore bottleneck somewhere (but CPU use is
> very low).
>
> Observed behavior is very program dependent.
>
> # zpool status Sun_2540
>  pool: Sun_2540
>  state: ONLINE
> status: The pool is formatted using an older on-disk format.  The pool can
>        still be used, but some features are unavailable.
> action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
>        pool will no longer be accessible on older software versions.
>  scrub: scrub completed after 0h46m with 0 errors on Mon Jun 29 05:06:33
> 2009
> config:
>
>        NAME                                       STATE     READ WRITE CKSUM
>        Sun_2540                                   ONLINE       0     0     0
>          mirror                                   ONLINE       0     0     0
>            c4t600A0B80003A8A0B096A47B4559Ed0  ONLINE       0     0     0
>            c4t600A0B800039C9B50AA047B4529Bd0  ONLINE       0     0     0
>          mirror                                   ONLINE       0     0     0
>            c4t600A0B80003A8A0B096E47B456DAd0  ONLINE       0     0     0
>            c4t600A0B800039C9B50AA447B4544Fd0  ONLINE       0     0     0
>          mirror                                   ONLINE       0     0     0
>            c4t600A0B80003A8A0B096147B451BEd0  ONLINE       0     0     0
>            c4t600A0B800039C9B50AA847B45605d0  ONLINE       0     0     0
>          mirror                                   ONLINE       0     0     0
>            c4t600A0B80003A8A0B096647B453CEd0  ONLINE       0     0     0
>            c4t600A0B800039C9B50AAC47B45739d0  ONLINE       0     0     0
>          mirror                                   ONLINE       0     0     0
>            c4t600A0B80003A8A0B097347B457D4d0  ONLINE       0     0     0
>            c4t600A0B800039C9B50AB047B457ADd0  ONLINE       0     0     0
>          mirror                                   ONLINE       0     0     0
>            c4t600A0B800039C9B50A9C47B4522Dd0  ONLINE       0     0     0
>            c4t600A0B800039C9B50AB447B4595Fd0  ONLINE       0     0     0
>
> errors: No known data errors
>
>
> Prefetch seems to be performing badly.  The Ben Rockwood's blog entry at
> http://www.cuddletech.com/blog/pivot/entry.php?id=1040 discusses prefetch.
>  The sample Dtrace script on that page only shows cache misses:
>
> vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774849536:
> MISS
> vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774980608:
> MISS
>
> Unfortunately, the file-level prefetch DTrace sample script from the same
> page seems to have a syntax error.
>
> I tried disabling file level prefetch (zfs_prefetch_disable=1) but did not
> observe any change in behavior.
>
> # kstat -p zfs:0:vdev_cache_stats
> zfs:0:vdev_cache_stats:class    misc
> zfs:0:vdev_cache_stats:crtime   130.61298275
> zfs:0:vdev_cache_stats:delegations      754287
> zfs:0:vdev_cache_stats:hits     3973496
> zfs:0:vdev_cache_stats:misses   2154959
> zfs:0:vdev_cache_stats:snaptime 451955.55419545
>
> Performance when coping 236 GB of files (each file is 5537792 bytes, with
> 20001 files per directory) from one directory to another:
>
> Copy Method                             Data Rate
>     ==
> cpio -pdum                              75 MB/s
> cp -r                                   32 MB/s
> tar -cf - . | (cd dest && tar -xf -)    26 MB/s
>
> I would expect data copy rates approaching 200 MB/s.
>

What happens if you run two copy at the same time? (On different data)

Your test is very bad at using striping as reads are done sequential.
Prefetch can only help in a file and your files are only 5Mb.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Phil Harman

ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC  
instead of the Solaris page cache. But mmap() uses the latter. So if  
anyone maps a file, ZFS has to keep the two caches in sync.


cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it  
copies into the Solaris page cache. As long as they remain there ZFS  
will be slow for those files, even if you subsequently use read(2) to  
access them.


If you reboot, your cpio(1) tests will probably go fast again, until  
someone uses mmap(2) on the files again. I think tar(1) uses read(2),  
but from my iPod I can't be sure. It would be interesting to see how  
tar(1) performs if you run that test before cp(1) on a freshly  
rebooted system.


I have done some work with the ZFS team towards a fix, but it is only  
currently in OpenSolaris.


The other thing that slows you down is that ZFS only flushes to disk  
every 5 seconds if there are no synchronous writes. It would be  
interesting to see iostat -xnz 1 while you are running your tests. You  
may find the disks are writing very efficiently for one second in  
every five.


Hope this helps,
Phil

blogs.sun.com/pgdh


Sent from my iPod

On 4 Jul 2009, at 05:26, Bob Friesenhahn  
 wrote:



On Fri, 3 Jul 2009, Bob Friesenhahn wrote:


Copy MethodData Rate
==
cpio -pdum75 MB/s
cp -r32 MB/s
tar -cf - . | (cd dest && tar -xf -)26 MB/s


It seems that the above should be ammended.  Running the cpio based  
copy again results in zpool iostat only reporting a read bandwidth  
of 33 MB/second.  The system seems to get slower and slower as it  
runs.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-03 Thread Bob Friesenhahn


On Fri, 3 Jul 2009, Bob Friesenhahn wrote:


Copy Method Data Rate
==
cpio -pdum  75 MB/s
cp -r   32 MB/s
tar -cf - . | (cd dest && tar -xf -)26 MB/s


It seems that the above should be ammended.  Running the cpio based 
copy again results in zpool iostat only reporting a read bandwidth of 
33 MB/second.  The system seems to get slower and slower as it runs.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-03 Thread Bob Friesenhahn

I am still trying to determine why Solaris 10 (Generic_141415-03) ZFS 
performs so terribly on my system.  I blew a good bit of personal life 
savings on this set-up but am not seeing performance anywhere near 
what is expected.  Testing with iozone shows that bulk I/O performance 
is good.  Testing with Jeff Bonwick's 'diskqual.sh' shows expected 
disk performance.  The problem is that actual observed application 
performance sucks, and could often be satisified by portable USB 
drives rather than high-end SAS drives.  It could be satisified by 
just one SAS disk drive.  Behavior is as if zfs is very slow to read 
data since disks are read at only 2 or 3 MB/second followed by an 
intermittent write on a long cycle.  Drive lights blink slowly.  It is 
as if ZFS does no successful sequential read-ahead on the files (see 
Prefetch Data hit rate of 0% and Prefetch Data cache miss of 60% 
below), or there is a semaphore bottleneck somewhere (but CPU use is 
very low).


Observed behavior is very program dependent.

# zpool status Sun_2540
  pool: Sun_2540
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: scrub completed after 0h46m with 0 errors on Mon Jun 29 05:06:33 2009
config:

NAME   STATE READ WRITE CKSUM
Sun_2540   ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096A47B4559Ed0  ONLINE   0 0 0
c4t600A0B800039C9B50AA047B4529Bd0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096E47B456DAd0  ONLINE   0 0 0
c4t600A0B800039C9B50AA447B4544Fd0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096147B451BEd0  ONLINE   0 0 0
c4t600A0B800039C9B50AA847B45605d0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096647B453CEd0  ONLINE   0 0 0
c4t600A0B800039C9B50AAC47B45739d0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B097347B457D4d0  ONLINE   0 0 0
c4t600A0B800039C9B50AB047B457ADd0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B800039C9B50A9C47B4522Dd0  ONLINE   0 0 0
c4t600A0B800039C9B50AB447B4595Fd0  ONLINE   0 0 0

errors: No known data errors

% ./diskqual.sh
c1t0d0 130 MB/sec
c1t1d0 130 MB/sec
c2t202400A0B83A8A0Bd31 13422 MB/sec
c3t202500A0B83A8A0Bd31 13422 MB/sec
c4t600A0B80003A8A0B096A47B4559Ed0 191 MB/sec
c4t600A0B80003A8A0B096E47B456DAd0 192 MB/sec
c4t600A0B80003A8A0B096147B451BEd0 192 MB/sec
c4t600A0B80003A8A0B096647B453CEd0 192 MB/sec
c4t600A0B80003A8A0B097347B457D4d0 212 MB/sec
c4t600A0B800039C9B50A9C47B4522Dd0 191 MB/sec
c4t600A0B800039C9B50AA047B4529Bd0 192 MB/sec
c4t600A0B800039C9B50AA447B4544Fd0 192 MB/sec
c4t600A0B800039C9B50AA847B45605d0 191 MB/sec
c4t600A0B800039C9B50AAC47B45739d0 191 MB/sec
c4t600A0B800039C9B50AB047B457ADd0 191 MB/sec
c4t600A0B800039C9B50AB447B4595Fd0 191 MB/sec

% arc_summary.pl

System Memory:
 Physical RAM:  20470 MB
 Free Memory :  2371 MB
 LotsFree:  312 MB

ZFS Tunables (/etc/system):
 * set zfs:zfs_arc_max = 0x3
 set zfs:zfs_arc_max = 0x28000
 * set zfs:zfs_arc_max = 0x2

ARC Size:
 Current Size: 9383 MB (arcsize)
 Target Size (Adaptive):   10240 MB (c)
 Min Size (Hard Limit):1280 MB (zfs_arc_min)
 Max Size (Hard Limit):10240 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:   6%644 MB (p)
 Most Frequently Used Cache Size:93%9595 MB (c-p)

ARC Efficency:
 Cache Access Total: 674638362
 Cache Hit Ratio:  91%   615586988  [Defined State for 
buffer]
 Cache Miss Ratio:  8%   59051374   [Undefined State for 
Buffer]
 REAL Hit Ratio:   87%   590314508  [MRU/MFU Hits Only]

 Data Demand   Efficiency:96%
 Data Prefetch Efficiency: 7%

CACHE HITS BY CACHE LIST:
  Anon:2%13626529   [ New 
Customer, First Cache Hit ]
  Most Recently Used:

< 1 2

101 - 142 of 142 matches

Mail list logo