Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-03 Thread Brandon High
On Mon, Aug 1, 2011 at 4:27 PM, Daniel Carosone d...@geek.com.au wrote:
 The other thing that can cause a storm of tiny IOs is dedup, and this
 effect can last long after space has been freed and/or dedup turned
 off, until all the blocks corresponding to DDT entries are rewritten.
 I wonder if this was involved here.

Using dedup on a pool that houses an Oracle DB is Doing It Wrong in so
many ways...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-03 Thread Daniel Carosone
On Wed, Aug 03, 2011 at 12:32:56PM -0700, Brandon High wrote:
 On Mon, Aug 1, 2011 at 4:27 PM, Daniel Carosone d...@geek.com.au wrote:
  The other thing that can cause a storm of tiny IOs is dedup, and this
  effect can last long after space has been freed and/or dedup turned
  off, until all the blocks corresponding to DDT entries are rewritten.
  I wonder if this was involved here.
 
 Using dedup on a pool that houses an Oracle DB is Doing It Wrong in so
 many ways...

Indeed, but alas people still Do It Wrong.  In particular, when a pool
is approaching full, turning on dedup might seem like an attractive
proposition to someone who doesn't understand the cost. 

So i just wonder if they have, or had at some time past, enabed it.

--
Dan.

pgpNRTzK8WULH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Josh Simon

Hello,

One of my coworkers was sent the following explanation from Oracle as to 
why one of backup systems was conducting a scrub so slow. I figured I 
would share it with the group.


http://wildness.espix.org/index.php?post/2011/06/09/ZFS-Fragmentation-issue-examining-the-ZIL

PS: Thought it was kind of odd that Oracle would direct us to a blog, 
but the post is very thorough.


Thanks,

Josh Simon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Neil Perrin
In general the blogs conclusion is correct . When file systems get full 
there is

fragmentation (happens to all file systems) and for ZFS the pool uses gang
blocks of smaller blocks when there are insufficient large blocks.
However, the ZIL never allocates or uses gang blocks. It directly allocates
blocks (outside of the zio pipeline) using zio_alloc_zil() - 
metaslab_alloc().

Gang blocks are only used by the main pool when the pool transaction
group (txg) commit occurs.  Solutions to the problem include:
   - add a separate intent log
   - add more top level devices (hopefully replicated)
   - delete unused files/snapshots etc with in the poll...

Neil.


On 08/01/11 08:29, Josh Simon wrote:

Hello,

One of my coworkers was sent the following explanation from Oracle as 
to why one of backup systems was conducting a scrub so slow. I figured 
I would share it with the group.


http://wildness.espix.org/index.php?post/2011/06/09/ZFS-Fragmentation-issue-examining-the-ZIL 



PS: Thought it was kind of odd that Oracle would direct us to a blog, 
but the post is very thorough.


Thanks,

Josh Simon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Brandon High
On Mon, Aug 1, 2011 at 2:16 PM, Neil Perrin neil.per...@oracle.com wrote:
 In general the blogs conclusion is correct . When file systems get full
 there is
 fragmentation (happens to all file systems) and for ZFS the pool uses gang
 blocks of smaller blocks when there are insufficient large blocks.

The blog doesn't mention how full the pool was. It's pretty well
documented that performance takes a nosedive at a certain point.

A slow scrub is actually not related to the problems in the blog post,
since there's not a lot of writes during (or at least caused by) a
scrub. Fragmentation is a real issue with pools that are (or have
been) very full. The data gets written out in fragments and has to be
read back in the same order.

If the mythical bp_rewrite code ever shows up, it will be possible to
defrag a pool. But not yet.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Richard Elling
On Aug 1, 2011, at 2:16 PM, Neil Perrin wrote:

 In general the blogs conclusion is correct . When file systems get full there 
 is
 fragmentation (happens to all file systems) and for ZFS the pool uses gang
 blocks of smaller blocks when there are insufficient large blocks.
 However, the ZIL never allocates or uses gang blocks. It directly allocates
 blocks (outside of the zio pipeline) using zio_alloc_zil() - 
 metaslab_alloc().
 Gang blocks are only used by the main pool when the pool transaction
 group (txg) commit occurs.  Solutions to the problem include:
   - add a separate intent log

Yes, I thought that it was odd that someone who is familiar with Oracle 
databases,
and their redo logs, didn't use separate intent logs.

   - add more top level devices (hopefully replicated)
   - delete unused files/snapshots etc with in the poll…

If gang activity is the root cause of the performance, then they must be at the
edge of effective space utilization.
 -- richard

 
 Neil.
 
 
 On 08/01/11 08:29, Josh Simon wrote:
 Hello,
 
 One of my coworkers was sent the following explanation from Oracle as to why 
 one of backup systems was conducting a scrub so slow. I figured I would 
 share it with the group.
 
 http://wildness.espix.org/index.php?post/2011/06/09/ZFS-Fragmentation-issue-examining-the-ZIL
  
 
 PS: Thought it was kind of odd that Oracle would direct us to a blog, but 
 the post is very thorough.
 
 Thanks,
 
 Josh Simon
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-18 Thread Mertol Ozyoney
There are Works to make NDMP more efficient in highly fregmanted file
Systems with a lot of small files. 
I am not a development engineer so I don't know much and I do not think that
there is any committed work. However ZFS engineers on the forum may comment
more
Mertol 



Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +90212335
Email mertol.ozyo...@sun.com


-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ed Spencer
Sent: Sunday, August 09, 2009 12:14 AM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs fragmentation


On Sat, 2009-08-08 at 15:20, Bob Friesenhahn wrote:

 A SSD slog backed by a SAS 15K JBOD array should perform much better 
 than a big iSCSI LUN.

Now...yes. We implemented this pool years ago. I believe, then, the
server would crash if you had a zfs drive fail. We decided to let the
netapp handle the disk redundency. Its worked out well. 

I've looked at those really nice Sun products adoringly. And a 7000
series appliance would also be a nice addition to our central NFS
service. Not to mention more cost effective than expanding our Network
Appliance (We have researchers who are quite hungry for storage and NFS
is always our first choice).

We now have quite an investment in the current implementation. Its
difficult to move away from. The netapp is quite a reliable product.

We are quite happy with zfs and our implementation. We just need to
address our backup performance and  improve it just a little bit!

We were almost lynched this spring because we encountered some pretty
severe zfs bugs. We are still running the IDR named A wad of ZFS bug
fixes for Solaris 10 Update 6. It took over a month to resolve the
issues.

I work at a University and Final Exams and year end occur at the same
time. I don't recommend having email problems during this time! People
are intolerant to email problems.

I live in hope that a Netapp OS update, or a solaris patch, or a zfs
patch, or a iscsi patch , or something will come along that improves our
performance just a bit so our backup people get off my back!

-- 
Ed 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-12 Thread Damjan Perenic
On Tue, Aug 11, 2009 at 11:04 PM, Richard
Ellingrichard.ell...@gmail.com wrote:
 On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:

 I suspect that if we 'rsync' one of these filesystems to a second
 server/pool  that we would also see a performance increase equal to what
 we see on the development server. (I don't know how zfs send a receive
 work so I don't know if it would address this Filesystem Entropy or
 specifically reorganize the files and directories). However, when we
 created a testfs filesystem in the zfs pool on the production server,
 and copied data to it, we saw the same performance as the other
 filesystems, in the same pool.

 Directory walkers, like NetBackup or rsync, will not scale well as
 the number of files increases.  It doesn't matter what file system you
 use, the scalability will look more-or-less similar. For millions of files,
 ZFS send/receive works much better.  More details are in my paper.

It would be nice if ZFS had something similar to VxFS File Change Log.
This feature is very useful for incremental backups and other
directory walkers, providing they support FCL.

Damjan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-12 Thread Scott Lawson



Ed Spencer wrote:

I don't know of any reason why we can't turn 1 backup job per filesystem
into say, up to say , 26 based on the cyrus file and directory
structure.
  
No reason whatsoever. Sometimes the more the better as per the rest of 
this thread. The key
here is to test and tweak till you get the optimal arrangement of backup 
window time and performance.


Performance tuning is a little bit of a Journey, that sooner or later 
has a final destination. ;)

The cyrus file and directory structure is designed with users located
under the directories A,B,C,D,etc to deal with the millions of little
files issue at the  filesystem layer.
  
The sun messaging server actually hashes the user names into a structure 
which looks quite similar
to a squid cache store. This has a top level of 128 directories, which 
each in turn contain 128 directories,
which then contain a folder for each user that has been mapped into that 
structure by the hash algorithm
on the user name. I use a wildcard mapping to split this into 16 
streams to cover the 0-9, a-f of the hexadecimal

directory structure names. eg. /mailstore1/users/0*

Our backups will have to be changed to use this design feature.
There will be a little work on the front end  to create the jobs but
once done the full backups should finish in a couple of hours.
  
The nice thing about this work is it really is only a one off 
configuration in the backup software
and then it is done. Certainly works a lot better than something like 
ALL_LOCAL_DRIVES

in Netbackup which effectively forks one backup thread per file system.

As an aside, we are currently upgrading our backup server to a sun4v
machine.
This architecture is well suited to run more jobs in parallel.
  
I use a T5220 with staging to a J4500 with 48 x 1 TB disks in a zpool 
with 6 file systems. This then gets streamed
to 6 LTO4 tape drives in a SL500 .Needless to say this supports a high 
degree of parallelism  and generally
finds the source server to be the bottleneck. I also take advantage of 
the 10 GigE capability
built straight into the Ultrasparc T2. Only major bottleneck in this 
system is the SAS interconnect to the J4500.
 
Thanx for all your help and advice.


Ed

On Tue, 2009-08-11 at 22:47, Mike Gerdts wrote:
  

On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencered_spen...@umanitoba.ca wrote:


We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
saturday. We backup to disk and then clone to tape. Our backup people
can only handle doing 2 filesystems per night.

Creating more filesystems to increase the parallelism of our backup is
one solution but its a major redesign of the of the mail system.
  

What is magical about a 1:1 mapping of backup job to file system?
According to the Networker manual[1], a save set in Networker can be
configured to back up certain directories.  According to some random
documentation about Cyrus[2], mail boxes fall under a pretty
predictable hierarchy.

1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf
2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html

Assuming that the way that your mailboxes get hashed fall into a
structure like $fs/b/bigbird and $fs/g/grover (and not just
$fs/bigbird and $fs/grover), you should be able to set a save set per
top level directory or per group of a few directories.  That is,
create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e -
$fs/h, etc.  If you are able to create many smaller save sets and turn
the parallelism up you should be able to drive more throughput.

I wouldn't get too worried about ensuring that they all start at the
same time[3], but it would probably make sense to prioritize the
larger ones so that they start early and the smaller ones can fill in
the parallelism gaps as the longer-running ones finish.

3. That is, there is sometimes benefit in having many more jobs to run
than you have concurrent streams.  This avoids having one save set
that finishes long after all the others because of poorly balanced
save sets.


Couldn't agree more Mike.

--
Mike Gerdts
http://mgerdts.blogspot.com/



--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Ed Spencer
I've come up with a better name for the concept of file and directory
fragmentation which is, Filesystem Entropy. Where, over time, an
active and volitile  filesystem moves from an organized state to a
disorganized state resulting in backup difficulties.

Here are some stats which illustrate the issue:

First the development mail server:
==
(Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to
2097152)

Small file workload (copy from zfs on iscsi network to local ufs
filesystem)
# zpool iostat 10
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G  3  0   247K  59.7K
space   70.5G  29.0G136  0  8.37M  0
space   70.5G  29.0G115  0  6.31M  0
space   70.5G  29.0G108  0  7.08M  0
space   70.5G  29.0G105  0  3.72M  0
space   70.5G  29.0G135  0  3.74M  0
space   70.5G  29.0G155  0  6.09M  0
space   70.5G  29.0G193  0  4.85M  0
space   70.5G  29.0G142  0  5.73M  0
space   70.5G  29.0G159  0  7.87M  0

Large File workload (cd and dvd iso's) 
# zpool iostat 10
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G  3  0   224K  59.8K
space   70.5G  29.0G462  0  57.8M  0
space   70.5G  29.0G427  0  53.5M  0
space   70.5G  29.0G406  0  50.8M  0
space   70.5G  29.0G430  0  53.8M  0
space   70.5G  29.0G382  0  47.9M  0

The production mail server:
===
Mail system is running with 790 imap users logged in (low imap work
load).
Two backup streams are running.
Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set
to 2097152
- we've never seen any effect of changing the iscsi transport
parameters
  under this small file workload.

# zpool iostat 10
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   1.06T   955G 96 69  5.20M  2.69M
space   1.06T   955G175105  8.96M  2.22M
space   1.06T   955G182 16  4.47M   546K
space   1.06T   955G170 16  4.82M  1.85M
space   1.06T   955G145159  4.23M  3.19M
space   1.06T   955G138 15  4.97M  92.7K
space   1.06T   955G134 15  3.82M  1.71M
space   1.06T   955G109123  3.07M  3.08M
space   1.06T   955G106 11  3.07M  1.34M
space   1.06T   955G120 17  3.69M  1.74M

# prstat -mL
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
 12438 root  12 6.9 0.0 0.0 0.0 0.0  81 0.1 508  84  4K   0 save/1
 27399 cyrus 15 0.5 0.0 0.0 0.0 0.0  85 0.0  18  10 297   0 imapd/1
 20230 root 3.9 8.0 0.0 0.0 0.0 0.0  88 0.1 393  33  2K   0 save/1
 25913 root 0.5 3.3 0.0 0.0 0.0 0.0  96 0.0  22   2  1K   0 prstat/1
 20495 cyrus1.1 0.2 0.0 0.0 0.5 0.0  98 0.0  14   3 191   0 imapd/1
  1051 cyrus1.2 0.0 0.0 0.0 0.0 0.0  99 0.0  19   1  80   0 master/1
 24350 cyrus0.5 0.5 0.0 0.0 1.4 0.0  98 0.0  57   1 484   0 lmtpd/1
 22645 cyrus0.6 0.3 0.0 0.0 0.0 0.0  99 0.0  53   1 603   0 imapd/1
 24904 cyrus0.3 0.4 0.0 0.0 0.0 0.0  99 0.0  66   0 863   0 imapd/1
 18139 cyrus0.3 0.2 0.0 0.0 0.0 0.0  99 0.0  24   0 195   0 imapd/1
 21459 cyrus0.2 0.3 0.0 0.0 0.0 0.0  99 0.0  54   0 635   0 imapd/1
 24891 cyrus0.3 0.3 0.0 0.0 0.9 0.0  99 0.0  28   0 259   0 lmtpd/1
   388 root 0.2 0.3 0.0 0.0 0.0 0.0 100 0.0   1   1  48   0
in.routed/1
 21643 cyrus0.2 0.3 0.0 0.0 0.2 0.0  99 0.0  49   7 540   0 imapd/1
 18684 cyrus0.2 0.3 0.0 0.0 0.0 0.0 100 0.0  48   1 544   0 imapd/1
 25398 cyrus0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  47   0 466   0 pop3d/1
 23724 cyrus0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  47   0 540   0 imapd/1
 24909 cyrus0.1 0.2 0.0 0.0 0.2 0.0  99 0.0  25   1 251   0 lmtpd/1
 16317 cyrus0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  37   1 495   0 imapd/1
 28243 cyrus0.1 0.3 0.0 0.0 0.0 0.0 100 0.0  32   0 289   0 imapd/1
 20097 cyrus0.1 0.2 0.0 0.0 0.3 0.0  99 0.0  26   5 253   0 lmtpd/1
Total: 893 processes, 1125 lwps, load averages: 1.14, 1.16, 1.16
 
-- 
Ed  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Alex Lam S.L.
At a first glance, your production server's numbers are looking fairly
similar to the small file workload results of your development
server.

I thought you were saying that the development server has faster performance?

Alex.


On Tue, Aug 11, 2009 at 1:33 PM, Ed Spencered_spen...@umanitoba.ca wrote:
 I've come up with a better name for the concept of file and directory
 fragmentation which is, Filesystem Entropy. Where, over time, an
 active and volitile  filesystem moves from an organized state to a
 disorganized state resulting in backup difficulties.

 Here are some stats which illustrate the issue:

 First the development mail server:
 ==
 (Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to
 2097152)

 Small file workload (copy from zfs on iscsi network to local ufs
 filesystem)
 # zpool iostat 10
               capacity     operations    bandwidth
 pool         used  avail   read  write   read  write
 --  -  -  -  -  -  -
 space       70.5G  29.0G      3      0   247K  59.7K
 space       70.5G  29.0G    136      0  8.37M      0
 space       70.5G  29.0G    115      0  6.31M      0
 space       70.5G  29.0G    108      0  7.08M      0
 space       70.5G  29.0G    105      0  3.72M      0
 space       70.5G  29.0G    135      0  3.74M      0
 space       70.5G  29.0G    155      0  6.09M      0
 space       70.5G  29.0G    193      0  4.85M      0
 space       70.5G  29.0G    142      0  5.73M      0
 space       70.5G  29.0G    159      0  7.87M      0

 Large File workload (cd and dvd iso's)
 # zpool iostat 10
               capacity     operations    bandwidth
 pool         used  avail   read  write   read  write
 --  -  -  -  -  -  -
 space       70.5G  29.0G      3      0   224K  59.8K
 space       70.5G  29.0G    462      0  57.8M      0
 space       70.5G  29.0G    427      0  53.5M      0
 space       70.5G  29.0G    406      0  50.8M      0
 space       70.5G  29.0G    430      0  53.8M      0
 space       70.5G  29.0G    382      0  47.9M      0

 The production mail server:
 ===
 Mail system is running with 790 imap users logged in (low imap work
 load).
 Two backup streams are running.
 Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set
 to 2097152
    - we've never seen any effect of changing the iscsi transport
 parameters
      under this small file workload.

 # zpool iostat 10
               capacity     operations    bandwidth
 pool         used  avail   read  write   read  write
 --  -  -  -  -  -  -
 space       1.06T   955G     96     69  5.20M  2.69M
 space       1.06T   955G    175    105  8.96M  2.22M
 space       1.06T   955G    182     16  4.47M   546K
 space       1.06T   955G    170     16  4.82M  1.85M
 space       1.06T   955G    145    159  4.23M  3.19M
 space       1.06T   955G    138     15  4.97M  92.7K
 space       1.06T   955G    134     15  3.82M  1.71M
 space       1.06T   955G    109    123  3.07M  3.08M
 space       1.06T   955G    106     11  3.07M  1.34M
 space       1.06T   955G    120     17  3.69M  1.74M

 # prstat -mL
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
 PROCESS/LWPID
  12438 root      12 6.9 0.0 0.0 0.0 0.0  81 0.1 508  84  4K   0 save/1
  27399 cyrus     15 0.5 0.0 0.0 0.0 0.0  85 0.0  18  10 297   0 imapd/1
  20230 root     3.9 8.0 0.0 0.0 0.0 0.0  88 0.1 393  33  2K   0 save/1
  25913 root     0.5 3.3 0.0 0.0 0.0 0.0  96 0.0  22   2  1K   0 prstat/1
  20495 cyrus    1.1 0.2 0.0 0.0 0.5 0.0  98 0.0  14   3 191   0 imapd/1
  1051 cyrus    1.2 0.0 0.0 0.0 0.0 0.0  99 0.0  19   1  80   0 master/1
  24350 cyrus    0.5 0.5 0.0 0.0 1.4 0.0  98 0.0  57   1 484   0 lmtpd/1
  22645 cyrus    0.6 0.3 0.0 0.0 0.0 0.0  99 0.0  53   1 603   0 imapd/1
  24904 cyrus    0.3 0.4 0.0 0.0 0.0 0.0  99 0.0  66   0 863   0 imapd/1
  18139 cyrus    0.3 0.2 0.0 0.0 0.0 0.0  99 0.0  24   0 195   0 imapd/1
  21459 cyrus    0.2 0.3 0.0 0.0 0.0 0.0  99 0.0  54   0 635   0 imapd/1
  24891 cyrus    0.3 0.3 0.0 0.0 0.9 0.0  99 0.0  28   0 259   0 lmtpd/1
   388 root     0.2 0.3 0.0 0.0 0.0 0.0 100 0.0   1   1  48   0
 in.routed/1
  21643 cyrus    0.2 0.3 0.0 0.0 0.2 0.0  99 0.0  49   7 540   0 imapd/1
  18684 cyrus    0.2 0.3 0.0 0.0 0.0 0.0 100 0.0  48   1 544   0 imapd/1
  25398 cyrus    0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  47   0 466   0 pop3d/1
  23724 cyrus    0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  47   0 540   0 imapd/1
  24909 cyrus    0.1 0.2 0.0 0.0 0.2 0.0  99 0.0  25   1 251   0 lmtpd/1
  16317 cyrus    0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  37   1 495   0 imapd/1
  28243 cyrus    0.1 0.3 0.0 0.0 0.0 0.0 100 0.0  32   0 289   0 imapd/1
  20097 cyrus    0.1 0.2 0.0 0.0 0.3 0.0  99 0.0  26   5 253   0 lmtpd/1
 Total: 893 processes, 1125 lwps, load averages: 1.14, 1.16, 1.16

 --
 Ed

 ___
 zfs-discuss mailing list
 

Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Mike Gerdts
On Tue, Aug 11, 2009 at 7:33 AM, Ed Spencered_spen...@umanitoba.ca wrote:
 I've come up with a better name for the concept of file and directory
 fragmentation which is, Filesystem Entropy. Where, over time, an
 active and volitile  filesystem moves from an organized state to a
 disorganized state resulting in backup difficulties.

 Here are some stats which illustrate the issue:

 First the development mail server:
 ==
 (Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to
 2097152)

 Small file workload (copy from zfs on iscsi network to local ufs
 filesystem)
 # zpool iostat 10
   capacity operationsbandwidth
 pool used  avail   read  write   read  write
 --  -  -  -  -  -  -
 space   70.5G  29.0G  3  0   247K  59.7K
 space   70.5G  29.0G136  0  8.37M  0
 space   70.5G  29.0G115  0  6.31M  0
 space   70.5G  29.0G108  0  7.08M  0
 space   70.5G  29.0G105  0  3.72M  0
 space   70.5G  29.0G135  0  3.74M  0
 space   70.5G  29.0G155  0  6.09M  0
 space   70.5G  29.0G193  0  4.85M  0
 space   70.5G  29.0G142  0  5.73M  0
 space   70.5G  29.0G159  0  7.87M  0

So you are averaging about 6 MB/s on a small file workload.  The
average read size was about 44 KB.

This throughput could be limited by the file creation rate on UFS.
Perhaps a better command to use to judge of how fast a single stream
can read is tar cf /dev/null $dir.

 Large File workload (cd and dvd iso's)
 # zpool iostat 10
   capacity operationsbandwidth
 pool used  avail   read  write   read  write
 --  -  -  -  -  -  -
 space   70.5G  29.0G  3  0   224K  59.8K
 space   70.5G  29.0G462  0  57.8M  0
 space   70.5G  29.0G427  0  53.5M  0
 space   70.5G  29.0G406  0  50.8M  0
 space   70.5G  29.0G430  0  53.8M  0
 space   70.5G  29.0G382  0  47.9M  0

Here the average throughput was about 53 MB/s, with the average read
size at 128 KB.  Note that 128 KB is not only the largest block size
that ZFS supports, it is also the default value of maxphys.  Tuning
maxphys to 1 MB may give you a performance boost, so long as the files
are contiguous.  Unless the files were trickled in very slowly with a
lot of other IO at the same time, they are probably mostly contiguous.

1 Gbit links, they are at about 25% capacity - good.  I assume you
have similar load balancing at the NetApp side too.

In a previous message you said that this server was seeing better
backup throughput than the production server.  How does the mixture of
large files vs. small files compare on the two systems?

 The production mail server:
 ===
 Mail system is running with 790 imap users logged in (low imap work
 load).
 Two backup streams are running.
 Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set
 to 2097152
- we've never seen any effect of changing the iscsi transport
 parameters
  under this small file workload.

 # zpool iostat 10
   capacity operationsbandwidth
 pool used  avail   read  write   read  write
 --  -  -  -  -  -  -
 space   1.06T   955G 96 69  5.20M  2.69M
 space   1.06T   955G175105  8.96M  2.22M
 space   1.06T   955G182 16  4.47M   546K
 space   1.06T   955G170 16  4.82M  1.85M
 space   1.06T   955G145159  4.23M  3.19M
 space   1.06T   955G138 15  4.97M  92.7K
 space   1.06T   955G134 15  3.82M  1.71M
 space   1.06T   955G109123  3.07M  3.08M
 space   1.06T   955G106 11  3.07M  1.34M
 space   1.06T   955G120 17  3.69M  1.74M

Here your average read throughput is about 4.6 MB/s with an average
read size of 47 KB.  That looks a lot like the simulation in the
non-production environment.

I would guess that the average message size is somewhere in the
40 - 50 KB range.


 # prstat -mL
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
 PROCESS/LWPID
  12438 root  12 6.9 0.0 0.0 0.0 0.0  81 0.1 508  84  4K   0 save/1
  27399 cyrus 15 0.5 0.0 0.0 0.0 0.0  85 0.0  18  10 297   0 imapd/1
  20230 root 3.9 8.0 0.0 0.0 0.0 0.0  88 0.1 393  33  2K   0 save/1
[snip]

The save process is from Networker, right?  These process do not
look CPU bound (less than 20% on CPU).

In a previous message you showed iostat data at a time when backups
weren't running.  I've reproduced below, removing the device column
for sake of formatting.

 iostat -xpn 5 5
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

   17.1   29.2  746.7  317.1  0.0  0.60.0   12.5   0  

Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Ed Spencer

On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote:
 At a first glance, your production server's numbers are looking fairly
 similar to the small file workload results of your development
 server.
 
 I thought you were saying that the development server has faster performance?

The development serer was running only one cp -pr command.

The production mail sevrer was running two concurrent backup jobs and of
course the mail system, with each job having the same performance
throughput as if there were a single job running. The single threaded
backup jobs do not conflict with each other over performance.

If we ran 20 concurrent backup jobs, overall performance would scale up
quite a bit. (I would guess between 5 and 10 times the performance). (I
just read Mike's post and will do some 'concurrency' testing).

Users are currently evenly distributed over 5 filesystems (I previously
mentioned 7 but its really 5 filesystems for users and 1 for system
data, totalling 6, and one test filesystem).

We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
saturday. We backup to disk and then clone to tape. Our backup people
can only handle doing 2 filesystems per night.

Creating more filesystems to increase the parallelism of our backup is
one solution but its a major redesign of the of the mail system.

Adding a second server to half the pool and thereby  half the problem is
another solution (and we would also create more filesystems at the same
time).

Moving the pool to a FC San or a JBOD may also increase performance.
(Less layers, introduced by the appliance, thereby increasing
performance.)

I suspect that if we 'rsync' one of these filesystems to a second
server/pool  that we would also see a performance increase equal to what
we see on the development server. (I don't know how zfs send a receive
work so I don't know if it would address this Filesystem Entropy or
specifically reorganize the files and directories). However, when we
created a testfs filesystem in the zfs pool on the production server,
and copied data to it, we saw the same performance as the other
filesystems, in the same pool.

We will have to do something to address the problem. A combination of
what I just listed is our probable course of action. (Much testing will
have to be done to ensure our solution will address the problem because
we are not 100% sure what is the cause of performance degradation).  I'm
also dealing with Network Appliance to see if there is anything we can
do at the filer end to increase performance.  But I'm holding out little
hope.

But please, don't miss the point I'm trying to make. ZFS would benefit
from a utility or a background process that would reorganize files and
directories in the pool to optimize performance. A utility to deal with
Filesystem Entropy. Currently a zfs pool will live as long as the
lifetime of the disks that it is on, without reorganization. This can be
a long long time. Not to mention slowly expanding the pool over time
contributes to the issue. 

--
Ed



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Richard Elling

On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:



On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote:
At a first glance, your production server's numbers are looking  
fairly

similar to the small file workload results of your development
server.

I thought you were saying that the development server has faster  
performance?


The development serer was running only one cp -pr command.

The production mail sevrer was running two concurrent backup jobs  
and of

course the mail system, with each job having the same performance
throughput as if there were a single job running. The single threaded
backup jobs do not conflict with each other over performance.


Agree.

If we ran 20 concurrent backup jobs, overall performance would scale  
up
quite a bit. (I would guess between 5 and 10 times the performance).  
(I

just read Mike's post and will do some 'concurrency' testing).


Yes.

Users are currently evenly distributed over 5 filesystems (I  
previously

mentioned 7 but its really 5 filesystems for users and 1 for system
data, totalling 6, and one test filesystem).

We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2  
on

saturday. We backup to disk and then clone to tape. Our backup people
can only handle doing 2 filesystems per night.

Creating more filesystems to increase the parallelism of our backup is
one solution but its a major redesign of the of the mail system.


Really?  I presume this is because of the way you originally
allocated accounts to file systems.  Creating file systems in ZFS is
easy, so could you explain in a new thread?

Adding a second server to half the pool and thereby  half the  
problem is
another solution (and we would also create more filesystems at the  
same

time).


I'm not convinced this is a good idea. It is a lot of work based on
the assumption that the server is the bottleneck.


Moving the pool to a FC San or a JBOD may also increase performance.
(Less layers, introduced by the appliance, thereby increasing
performance.)


Disagree.


I suspect that if we 'rsync' one of these filesystems to a second
server/pool  that we would also see a performance increase equal to  
what

we see on the development server. (I don't know how zfs send a receive
work so I don't know if it would address this Filesystem Entropy or
specifically reorganize the files and directories). However, when we
created a testfs filesystem in the zfs pool on the production server,
and copied data to it, we saw the same performance as the other
filesystems, in the same pool.


Directory walkers, like NetBackup or rsync, will not scale well as
the number of files increases.  It doesn't matter what file system you
use, the scalability will look more-or-less similar. For millions of  
files,

ZFS send/receive works much better.  More details are in my paper.


We will have to do something to address the problem. A combination of
what I just listed is our probable course of action. (Much testing  
will
have to be done to ensure our solution will address the problem  
because
we are not 100% sure what is the cause of performance degradation).   
I'm

also dealing with Network Appliance to see if there is anything we can
do at the filer end to increase performance.  But I'm holding out  
little

hope.


DNLC hit rate?
Also, is atime on?



But please, don't miss the point I'm trying to make. ZFS would benefit
from a utility or a background process that would reorganize files and
directories in the pool to optimize performance. A utility to deal  
with

Filesystem Entropy. Currently a zfs pool will live as long as the
lifetime of the disks that it is on, without reorganization. This  
can be

a long long time. Not to mention slowly expanding the pool over time
contributes to the issue.


This does not come for free in either performance or risk. It will
do nothing to solve the directory walker's problem.

NB, people who use UFS don't tend to see this because UFS can't
handle millions of files.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread David Magda
On Tue, August 11, 2009 10:39, Ed Spencer wrote:

 I suspect that if we 'rsync' one of these filesystems to a second
 server/pool  that we would also see a performance increase equal to what
 we see on the development server. (I don't know how zfs send a receive

Rsync has to traverse the entire directory tree to stat() every file to
see if it's changed (and if it has, it then computes which parts of the
file that have been updated). Zfs send/recv however works at a lower level
and doesn't go to each file: it can simply compare which file system
blocks have changed.

So you would create a snapshot on the ZFS file system(s) of interest and
send it over to where ever you want to replicate it. Later on you would
create another snapshot and, with the incremental (-i) option in
zfs(1M), you could then only transfer the blocks of data that were changed
since the first snapshot. ZFS will be able to figure out the block
differences without having to touch every file.

Two pretty good explanations at:

http://www.markround.com/archives/38-ZFS-Replication.html
http://www.cuddletech.com/blog/pivot/entry.php?id=984

 work so I don't know if it would address this Filesystem Entropy or
 specifically reorganize the files and directories). However, when we
 created a testfs filesystem in the zfs pool on the production server,
 and copied data to it, we saw the same performance as the other
 filesystems, in the same pool.

Not surprising, since any file systems on any particular pool would be
using the same spindles. If you want different I/O characteristics you'd
need a different pool with different spindles.

 We will have to do something to address the problem. A combination of
 what I just listed is our probable course of action. (Much testing will
 have to be done to ensure our solution will address the problem because
 we are not 100% sure what is the cause of performance degradation).  I'm

Don't forget about the DTrace Toolkit, as it has many handy scripts for
digging into various performance characteristics:

http://www.brendangregg.com/dtrace.html

 But please, don't miss the point I'm trying to make. ZFS would benefit
 from a utility or a background process that would reorganize files and
 directories in the pool to optimize performance. A utility to deal with

If you have a Sun support contract call them up and ask for this
enhancement. If there's enough people asking for it the ZFS team will add
it. Talking on the list is one thing, but if there's no official paper
trail in Sun's database, then it won't get the attention it may deserve.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Scott Lawson



Richard Elling wrote:

On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:



On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote:

At a first glance, your production server's numbers are looking fairly
similar to the small file workload results of your development
server.

I thought you were saying that the development server has faster 
performance?


The development serer was running only one cp -pr command.

The production mail sevrer was running two concurrent backup jobs and of
course the mail system, with each job having the same performance
throughput as if there were a single job running. The single threaded
backup jobs do not conflict with each other over performance.


Agree.


If we ran 20 concurrent backup jobs, overall performance would scale up
quite a bit. (I would guess between 5 and 10 times the performance). (I
just read Mike's post and will do some 'concurrency' testing).


Yes.


Users are currently evenly distributed over 5 filesystems (I previously
mentioned 7 but its really 5 filesystems for users and 1 for system
data, totalling 6, and one test filesystem).

We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
saturday. We backup to disk and then clone to tape. Our backup people
can only handle doing 2 filesystems per night.

Creating more filesystems to increase the parallelism of our backup is
one solution but its a major redesign of the of the mail system.


Really?  I presume this is because of the way you originally
allocated accounts to file systems.  Creating file systems in ZFS is
easy, so could you explain in a new thread?

Ed, This would be a good idea.

This issue has been discussed many time on the iMS mailing list for the 
Sun Messaging server
which as far as the way it stores messages on disk is very similar to 
Cyrus. (in fact I think

it once was based on the same code base).

The upshot of it is what has been explained by Mike in that these type 
of store create
millions of little files that Netbackup or Legato must walk over and 
backup one after another
sequentially. This does not scale very well at all due to the reasons 
explained by Mike.


The issue commonly discussed about on the iMS list has been one of file 
system size. In general the rule
of thumb most people had for this was around 100 to 250 GB per file 
system and lots of them to mostly
increase the parallelism in the backup process rather than for 
performance gains in the

actually functionality of the application.

I, as a rule of thumb group my large users who have large mailboxes, 
which in turn
have lots of large attachments into particular larger file system. 
Students who have small
quotas and generally lots of small messages or small files in this case 
into other smaller
file system. It really in this case is one size does not suit all. To 
keep backups within
the time allocation, a bit of filesystem monitoring is useful. In the 
days of UFS I used

to use a command like this to help make decisions.

[r...@xxx]# df -F ufs -o i
Filesystem iused   ifree  %iused  Mounted on
/dev/md/dsk/d0605765 6674235 8%   /
/dev/md/dsk/d50  2387509 28198091 8%   /mail1
/dev/md/dsk/d70  2090768 30669232 6%   /mail3
/dev/md/dsk/d60  2447548 30312452 7%   /mail2

I used this to balance the inodes. My guess is that around 85-90% of the 
inodes
in a messaging server store are files with the remainder directories. 
Either way
it is a simple way to make sure the stores are reasonably balanced. I am 
sure there

will be a good way to do this for ZFS?




Adding a second server to half the pool and thereby  half the problem is
another solution (and we would also create more filesystems at the same
time).


It can be a good idea, but it really depends on how many file systems 
you split
your message stores into. Also good for relocating message stores to if 
the first server
fails. This of course depends on your message store architecture. Easy 
to do with Sun
Messaging, not so sure about Cyrus. But I did once run a Simeon message 
server for
a University in London and that was based on Cyrus and was pretty 
similar from recollection.

I'm not convinced this is a good idea. It is a lot of work based on
the assumption that the server is the bottleneck.


Moving the pool to a FC San or a JBOD may also increase performance.
(Less layers, introduced by the appliance, thereby increasing
performance.)


Disagree.


I suspect that if we 'rsync' one of these filesystems to a second
server/pool  that we would also see a performance increase equal to what
we see on the development server. (I don't know how zfs send a receive
work so I don't know if it would address this Filesystem Entropy or
specifically reorganize the files and directories). However, when we
created a testfs filesystem in the zfs pool on the production server,
and copied data to it, we saw the same performance as the other
filesystems, in the same pool.


Directory walkers, like NetBackup or rsync, 

Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Ed Spencer
Concurrency/Parallelism testing.
I have 6 different filesystems populated with email data on our mail
development server.
I rebooted the server before beginning the tests.
The server is a T2000 (sun4v) machine so its ideally suited for this
type of testing.
The test was to tar (to /dev/null) each of the filesystems. Launch 1,
gather stats launch another , gather stats, etc.
The underlying storage system is a Network Appliance. Our only one. In
production. Serving NFS, CIFS and iscsi. Other work the appliance is
doing may effect these tests, and vice versa :) . No one seemed to
notice I was running these tests. 

After 6 concurrent tar's running we are probabaly seeing benefits of the
ARC. 
At certian points I included load averages and traffic stats for each of
the iscsi ethernet interfaces that are configured with MPXIO.

After the first 6 jobs, I launched duplicates of the 6. Then another 6,
etc.

At the end I included the zfs kernel statistics:

1 job
=
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G  0  0  0  0
space   70.5G  29.0G 19  0  1.04M  0
space   70.5G  29.0G268  0  8.71M  0
space   70.5G  29.0G196  0  11.3M  0
space   70.5G  29.0G171  0  11.0M  0
space   70.5G  29.0G182  0  5.01M  0
space   70.5G  29.0G273  0  9.71M  0
space   70.5G  29.0G292  0  8.91M  0
space   70.5G  29.0G279  0  15.4M  0
space   70.5G  29.0G219  0  11.3M  0
space   70.5G  29.0G175  0  8.67M  0

2 jobs
==
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G381  0  23.8M  0
space   70.5G  29.0G422  0  28.0M  0
space   70.5G  29.0G386  0  26.5M  0
space   70.5G  29.0G380  0  22.9M  0
space   70.5G  29.0G411  0  18.8M  0
space   70.5G  29.0G393  0  20.7M  0
space   70.5G  29.0G302  0  15.0M  0
space   70.5G  29.0G267  0  15.6M  0
space   70.5G  29.0G304  0  18.7M  0
space   70.5G  29.0G534  0  19.7M  0
space   70.5G  29.0G339  0  17.0M  0

3 jobs
==
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G530  0  22.9M  0
space   70.5G  29.0G428  0  16.3M  0
space   70.5G  29.0G439  0  16.4M  0
space   70.5G  29.0G511  0  22.1M  0
space   70.5G  29.0G464  0  17.9M  0
space   70.5G  29.0G371  0  12.1M  0
space   70.5G  29.0G447  0  16.5M  0
space   70.5G  29.0G379  0  15.5M  0

4jobs
==
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G434  0  22.0M  0
space   70.5G  29.0G506  0  29.5M  0
space   70.5G  29.0G424  0  21.3M  0
space   70.5G  29.0G643  0  36.0M  0
space   70.5G  29.0G688  0  31.1M  0
space   70.5G  29.0G726  0  37.6M  0
space   70.5G  29.0G652  0  24.8M  0
space   70.5G  29.0G646  0  33.9M  0

5jobs
==
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G629  0  31.1M  0
space   70.5G  29.0G774  0  45.8M  0
space   70.5G  29.0G815  0  39.8M  0
space   70.5G  29.0G895  0  44.4M  0
space   70.5G  29.0G800  0  48.1M  0
space   70.5G  29.0G857  0  51.8M  0
space   70.5G  29.0G725  0  47.6M  0

6jobs
==
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   70.5G  29.0G924  0  58.8M  0
space   70.5G  29.0G767  0  51.8M  0
space   70.5G  29.0G862  0  48.4M  0
space   70.5G  29.0G977  0  43.9M  0
space   70.5G  29.0G954  0  53.7M  0
space   70.5G  29.0G903  0  48.3M  0

# uptime
  2:19pm  up 15 min(s),  2 users,  load average: 1.44, 1.10, 0.67

26MB ( 1 minute average) on each iSCSI ethernet port

12jobs
==
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  

Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Richard Elling


On Aug 11, 2009, at 1:21 PM, Ed Spencer wrote:


Concurrency/Parallelism testing.
I have 6 different filesystems populated with email data on our mail
development server.
I rebooted the server before beginning the tests.
The server is a T2000 (sun4v) machine so its ideally suited for this
type of testing.
The test was to tar (to /dev/null) each of the filesystems. Launch 1,
gather stats launch another , gather stats, etc.
The underlying storage system is a Network Appliance. Our only one. In
production. Serving NFS, CIFS and iscsi. Other work the appliance is
doing may effect these tests, and vice versa :) . No one seemed to
notice I was running these tests.

After 6 concurrent tar's running we are probabaly seeing benefits of  
the

ARC.
At certian points I included load averages and traffic stats for  
each of

the iscsi ethernet interfaces that are configured with MPXIO.

After the first 6 jobs, I launched duplicates of the 6. Then another  
6,

etc.



iostat and zpool iostat measure I/O to the disks. fsstat measures
I/O to the file system (hence the name ;-).  A large discrepancy
between the two is another indicator of filesystem caching.

While tar is slightly interesting, I would expect that your normal
backup workload to show a lot of lookups and attr gets. If these
are cached, life will be better.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Louis-Frédéric Feuillette
On Tue, 2009-08-11 at 08:04 -0700, Richard Elling wrote:
 On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:
  I suspect that if we 'rsync' one of these filesystems to a second
  server/pool  that we would also see a performance increase equal to  
  what
  we see on the development server. (I don't know how zfs send a receive
  work so I don't know if it would address this Filesystem Entropy or
  specifically reorganize the files and directories). However, when we
  created a testfs filesystem in the zfs pool on the production server,
  and copied data to it, we saw the same performance as the other
  filesystems, in the same pool.
 
 Directory walkers, like NetBackup or rsync, will not scale well as
 the number of files increases.  It doesn't matter what file system you
 use, the scalability will look more-or-less similar. For millions of  
 files,
 ZFS send/receive works much better.  More details are in my paper.

Is there link to this paper available?

-- 
Louis-Frédéric Feuillette jeb...@gmail.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Mike Gerdts
On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencered_spen...@umanitoba.ca wrote:
 We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
 saturday. We backup to disk and then clone to tape. Our backup people
 can only handle doing 2 filesystems per night.

 Creating more filesystems to increase the parallelism of our backup is
 one solution but its a major redesign of the of the mail system.

What is magical about a 1:1 mapping of backup job to file system?
According to the Networker manual[1], a save set in Networker can be
configured to back up certain directories.  According to some random
documentation about Cyrus[2], mail boxes fall under a pretty
predictable hierarchy.

1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf
2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html

Assuming that the way that your mailboxes get hashed fall into a
structure like $fs/b/bigbird and $fs/g/grover (and not just
$fs/bigbird and $fs/grover), you should be able to set a save set per
top level directory or per group of a few directories.  That is,
create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e -
$fs/h, etc.  If you are able to create many smaller save sets and turn
the parallelism up you should be able to drive more throughput.

I wouldn't get too worried about ensuring that they all start at the
same time[3], but it would probably make sense to prioritize the
larger ones so that they start early and the smaller ones can fill in
the parallelism gaps as the longer-running ones finish.

3. That is, there is sometimes benefit in having many more jobs to run
than you have concurrent streams.  This avoids having one save set
that finishes long after all the others because of poorly balanced
save sets.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Ed Spencer

On Fri, 2009-08-07 at 19:33, Richard Elling wrote:

 This is very unlikely to be a fragmentation problem. It is a  
 scalability problem
 and there may be something you can do about it in the short term.

You could be right.

Out test mail server consists of the exact same design, same hardware
(SUN4V)  but in a smaller configuration (less memory and 4 x 25g san
luns) has a backup/copy thoughput of 30GB/hour. Data used for testing
was copied from our production mail server.

  Adding another pool and copying all/some data over to it would only  
  a short term solution.
 
 I'll have to disagree.

What is the point of a filesystem the can grow to such a huge size and
not have functionality built in to optimize data layout?  Real world
implementations of filesystems that are intended to live for
years/decades need this functionality, don't they?

Our mail system works well, only the backup doesn't perform well.
All the features of ZFS that make reads perform well (prefetch, ARC)
have little effect.
 
We think backup is quite important. We do quite a few restores of months
old data. Snapshots help in the short term, but for longer term restores
we need to go to tape. 

Of course, as you can tell, I'm kinda stuck on this idea that file and
directory fragmentation is causing our issues with the backup. I don't
know how to analyze the pool to better understand the problem.

If we did chop the pool up into lets say 7 pools (one for each current
filesystem) then over time these 7 pools would grow and we would end up
with the same issues. Thats why it seems to me to be a short term
solution.

If our issues with zfs are scalability then you could say zfs is not
scalable. Is that true?
(It certianly is if the solution is too create more pools!).  

-- 
Ed 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Mattias Pantzare
  Adding another pool and copying all/some data over to it would only
  a short term solution.

 I'll have to disagree.

 What is the point of a filesystem the can grow to such a huge size and
 not have functionality built in to optimize data layout?  Real world
 implementations of filesystems that are intended to live for
 years/decades need this functionality, don't they?

 Our mail system works well, only the backup doesn't perform well.
 All the features of ZFS that make reads perform well (prefetch, ARC)
 have little effect.

 We think backup is quite important. We do quite a few restores of months
 old data. Snapshots help in the short term, but for longer term restores
 we need to go to tape.

Your scalability problem may be in your backup solution.

The problem is not how many Gb data you have but the number of files.

It was a while since I worked with networker so things may have changed.

If you are doing backups directly to tape you may have a buffering
problem. By simply staging backups on disk we got at lot faster
backups.

Have you configured networker to do several simultaneous backups from
your pool?
You can do that by having several zfs on the same pool or tell
networker to do backups one directory level down so that it thinks you
have more file systems. And don't forget to play with the parallelism
settings in networker.

This made a huge difference for us on VxFS.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Bob Friesenhahn

On Sat, 8 Aug 2009, Ed Spencer wrote:


What is the point of a filesystem the can grow to such a huge size and
not have functionality built in to optimize data layout?  Real world
implementations of filesystems that are intended to live for
years/decades need this functionality, don't they?


Enterprise storage should work fine without needing to run a tool to 
optimize data layout or repair the filesystem.  Well designed software 
uses an approach which does not unravel through use.



Our mail system works well, only the backup doesn't perform well.
All the features of ZFS that make reads perform well (prefetch, ARC)
have little effect.


It is already known that ZFS prefetch is often not aggressive enough 
for bulk reads, and sometimes gets lost entirely.  I think that is the 
first issue to resolve in order to get your backups going faster.


Many of us here already tested our own systems and found that under 
some conditions ZFS was offering up only 30MB/second for bulk data 
reads regardless of how exotic our storage pool and hardware was.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Ed Spencer

On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
 Many of us here already tested our own systems and found that under 
 some conditions ZFS was offering up only 30MB/second for bulk data 
 reads regardless of how exotic our storage pool and hardware was.

Just so we are using the same units of measurements. Backup/copy
throughput on our development mail server is 8.5MB/sec. The people
running our backups would be over joyed with that performance.

However backup/copy throughput on our production mail server is 2.25
MB/sec.

The underlying disk is 15000 RPM 146GB FC drives.
Our performance may be hampered somewhat because the luns are on a
Network Appliance accessed via iSCSI, but not to the extent that we are
seeing, and it does not account for the throughput difference in the
development and production pools.

When I talk about fragmentation its not in the normal sense. I'm not
talking about blocks in a file not being sequential. I'm talking about
files in a single directory that end up spread across the entire
filesytem/pool. 

My problem right now is diagnosing the performance issues.  I can't
address them without understanding the underlying cause.  There is a
lack of tools to help in this area. There is also a lack of acceptance
that I'm actually having a problem with zfs. Its frustrating.

Anyone know how significantly increase the performance of a zfs
filesystem without causing any downtime to an Enterprise email system
used by 30,000 intolerant people, when you don't really know what is
causing the performance issues in the first place? (Yeah, it sucks to be
me!)

-- 
Ed 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Ed Spencer

On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote:

 Your scalability problem may be in your backup solution.
We've eliminated the backup system as being involved with the
performance issues. 

The servers are Solaris 10 with the OS on UFS filesystems. (In zfs
terms, the pool is old/mature). Solaris has been patched to a fairly
current level.  

Copying data from the zfs filesystem to the local ufs filesystem enjoys
the same throughput as the backup system. 

The test was simple. Create a test filesystem on the zfs pool. Restore
production email data to it. Reboot the server. Backup the data (29
minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs
to ufs using a 'cp -pr ...' command, which also took 29 minutes. 

And if anyone is interested it only took 15 minutes to restore (write) 
the 15.8GB of data over the network. 

-- 
Ed 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Mike Gerdts
On Sat, Aug 8, 2009 at 3:02 PM, Ed Spencered_spen...@umanitoba.ca wrote:

 On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:

 Enterprise storage should work fine without needing to run a tool to
 optimize data layout or repair the filesystem.  Well designed software
 uses an approach which does not unravel through use.

 H, this is counter to my understanding. I always thought that to
 optimize sequential read performance you must store the data according
 to how the device will read the data.

 Spinning rust reads data in a sequential fashion. In order to optimize
 read performance it has to be laid down that way.

 When reading files in a directory, the files need to be laid out on the
 physical device sequentially for optimal read performance.

 I probably not he person to argue this point thoughIs there a DBA
 around?

The DBA's that I know use files that are at least hundreds of
megabytes in size.  Your problem is very different.

 Maybe my problems will go away once we move into the next generation of
 storage devices, SSD's! I'm starting to think that ZFS will really shine
 on SSD's.

Your problem seems to be related to cold reads in a pretty large data
set.  With SSD's (l2arc) you are likely to see a performance boost for
a larger set of recently read files, but my guess is that backups will
still be pretty slow.  There is likely more benefit in restore speed
with SSD's than there is in read speeds.  However, the NVRAM on the
NetApp that is backing your iSCSI LUNs is probably already giving you
most of this benefit (assuming low latency on network connections).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Ed Spencer

On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote:

 The DBA's that I know use files that are at least hundreds of
 megabytes in size.  Your problem is very different.
Yes, definitely. 

I'm relating records in a table to my small files because our email
system treats the filesystem as a database.

And in the back of my mind I'm also thinking that you have to
rebuild/repair the database once in a while to improve performance.

And in my case, since the filesystem is the database, I want to do that
to zfs! 

At least thats what I'm thinking, however, and I always come back to
this, I'm not certian what is causing my problem. I need certainty
before taking action on the production system. 

-- 
Ed 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Mike Gerdts
On Sat, Aug 8, 2009 at 3:25 PM, Ed Spencered_spen...@umanitoba.ca wrote:

 On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote:

 The DBA's that I know use files that are at least hundreds of
 megabytes in size.  Your problem is very different.
 Yes, definitely.

 I'm relating records in a table to my small files because our email
 system treats the filesystem as a database.

Right... but ZFS doesn't understand your application.  The reason that
a file system would put files that are in the same directory in the
same general area on a disk is to minimize seek time.  I would argue
that seek time doesn't matter a whole lot here - at least from the
vantage point of ZFS.  The LUNs that you have presented from the filer
are probably RAID6 across many disks.  ZFS seems to be doing a  4 way
stripe (or are you mirroring or raidz?).  Assuming you are doing
something like a 7+2 RAID6 on the back end, the contents would be
spread across 36 drives.[1]  The trick to making this perform well is
to have 36 * N worker threads.  Mail is a great thing to keep those
spindles kinda busy while getting decent performance.  A small number
of sequential readers - particularly with small files where you can't
do a reasonable job with read-ahead - has little chance of keeping
that number of drives busy.

1. Or you might have 4 LUNs presented from one 4+1 RAID5 in which you
may be forcing more head movement because ZFS thinks it can speed
things up by striping data across the LUNs.

ZFS can recognize a database (or other application) doing a sequential
read on a large file.  While data located sequentially on disk can be
helpful for reads, this is much less important when the pool sits
across tens of disks.  This is because it has the ability to spread
the iops across lots of disks, potentially reading a heavily
fragmented file much faster than a purely sequential file.

In either case, your backup application is competing for iops (and
seeks) with other workload.  With the NetApp backend there are likely
other applications on the same aggregate that are forcing head
movement away from any data belonging to these LUNs.

 And in the back of my mind I'm also thinking that you have to
 rebuild/repair the database once in a while to improve performance.

Certainly.  Databases become fragmented and are reorganized to fix this.

 And in my case, since the filesystem is the database, I want to do that
 to zfs!

 At least thats what I'm thinking, however, and I always come back to
 this, I'm not certian what is causing my problem. I need certainty
 before taking action on the production system.

Most databases are written in such a way that they can be optimized
for sequential reads (table scans) and for backups, whether on raw
disk or on a file system.  The more advanced the database is, the more
likely it is to ask the file system to get out of its way and *not* do
anything fancy.

It seems that cyrus was optimized for operations that make sense for a
mail program (deliver messages, retrieve messages, delete messages)
and nothing else.  I would argue that any application that creates
lots of tiny files is not optimized for backing up using a small
number of streams.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Ed Spencer

On Sat, 2009-08-08 at 15:20, Bob Friesenhahn wrote:

 A SSD slog backed by a SAS 15K JBOD array should perform much better 
 than a big iSCSI LUN.

Now...yes. We implemented this pool years ago. I believe, then, the
server would crash if you had a zfs drive fail. We decided to let the
netapp handle the disk redundency. Its worked out well. 

I've looked at those really nice Sun products adoringly. And a 7000
series appliance would also be a nice addition to our central NFS
service. Not to mention more cost effective than expanding our Network
Appliance (We have researchers who are quite hungry for storage and NFS
is always our first choice).

We now have quite an investment in the current implementation. Its
difficult to move away from. The netapp is quite a reliable product.

We are quite happy with zfs and our implementation. We just need to
address our backup performance and  improve it just a little bit!

We were almost lynched this spring because we encountered some pretty
severe zfs bugs. We are still running the IDR named A wad of ZFS bug
fixes for Solaris 10 Update 6. It took over a month to resolve the
issues.

I work at a University and Final Exams and year end occur at the same
time. I don't recommend having email problems during this time! People
are intolerant to email problems.

I live in hope that a Netapp OS update, or a solaris patch, or a zfs
patch, or a iscsi patch , or something will come along that improves our
performance just a bit so our backup people get off my back!

-- 
Ed 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Ed Spencer

On Sat, 2009-08-08 at 15:05, Mike Gerdts wrote:
 On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencered_spen...@umanitoba.ca wrote:
 
  On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
  Many of us here already tested our own systems and found that under
  some conditions ZFS was offering up only 30MB/second for bulk data
  reads regardless of how exotic our storage pool and hardware was.
 
  Just so we are using the same units of measurements. Backup/copy
  throughput on our development mail server is 8.5MB/sec. The people
  running our backups would be over joyed with that performance.
 
  However backup/copy throughput on our production mail server is 2.25
  MB/sec.
 
  The underlying disk is 15000 RPM 146GB FC drives.
  Our performance may be hampered somewhat because the luns are on a
  Network Appliance accessed via iSCSI, but not to the extent that we are
  seeing, and it does not account for the throughput difference in the
  development and production pools.
 
 NetApp filers run WAFL - Write Anywhere File Layout.  Even if ZFS
 arranged everything perfrectly (however that is defined) WAFL would
 undo its hard work.
 
 Since you are using iSCSI, I assume that you have disabled the Nagle
 algorithm and increased  tcp_xmit_hiwat and tcp_recv_hiwat.  If not,
 go do that now.
We've tried many different iscsi parameter changes on our development server:
Jumbo Frames
Disabling the Nagle
I'll double check next week on tcp_xmit_hiwat and tcp_recv_hiwat.

Nothing has made any real difference. 
We are only using about 5% of the bandwidth on our IPSan.

We use two cisco ethernet switches on the IPSAN. The iscsi initiators
use MPXIO in a round robin configuration.  

  When I talk about fragmentation its not in the normal sense. I'm not
  talking about blocks in a file not being sequential. I'm talking about
  files in a single directory that end up spread across the entire
  filesytem/pool.
 
 It's tempting to think that if the files were in roughly the same area
 of the block device that ZFS sees that reading the files sequentially
 would at least trigger a read-ahead at the filer.  I suspect that even
 a moderate amount of file creation and deletion would cause the I/O
 pattern to be random enough (not purely sequential) that the back-end
 storage would not have a reasonable chance of recognizing it as a good
 time for read-ahead.  Further, since the backup application is
 probably in a loop of:
 
 while there are more files in the directory
if next file mtime  last backup time
open file
read file contents, send to backup stream
close file
 end if
 end while
 
 In other words, other I/O operations are interspersed between the
 sequential data reads, some files are likely to be skipped, and there
 is latency introduced by writing to the data stream.  I would be
 surprised to see any file system do intelligent read-ahead here.  In
 other words, lots of small file operations make backups and especially
 restores go slowly.  More backup and restore streams will almost
 certainly help.  Multiplex the streams so that you can keep your tapes
 moving at a constant speed.

We backup to disk first and then put to tape later.

 Do you have statistics on network utilization to ensure that you
 aren't stressing it?
 
 Have you looked at iostat data to be sure that you are seeing asvc_t +
 wsvc_t that supports the number of operations that you need to
 perform?  That is if asvc_t + wsvc_t for a device adds up to 10 ms, a
 workload that waits for the completion of one I/O before issuing the
 next will max out at 100 iops.  Presumably ZFS should hide some of
 this from you[1], but it does suggest that each backup stream would be
 limited to about 100 files per second[2].  This is because the read
 request for one file does not happen before the close of the previous
 file[3].  Since cyrus stores each message as a separate file, this
 suggests that 2.5 MB/s corresponds to average mail message size of 25
 KB.
 
 1. via metadata caching, read-ahead on file data reads, etc.
 2. Assuming wsvc_t + asvc_t = 10 ms
 3. Assuming that networker is about as smart as tar, zip, cpio, etc.

There is a backup of a single filesystem in the pool going on right now:
# zpool iostat 5 5
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
space   1.05T   965G 97 69  5.24M  2.71M
space   1.05T   965G113 10  6.41M   996K
space   1.05T   965G100112  2.87M  1.81M
space   1.05T   965G112  8  2.35M  35.9K
space   1.05T   965G106  3  1.76M  55.1K

Here are examples :
iostat -xpn 5 5
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

   17.1   29.2  746.7  317.1  0.0  0.60.0   12.5   0  27
c4t60A98000433469764E4A2D456A644A74d0
   25.0   11.9  991.9  277.0  0.0  0.60.0   16.1   0  36

Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Richard Elling


On Aug 8, 2009, at 5:02 AM, Ed Spencer wrote:



On Fri, 2009-08-07 at 19:33, Richard Elling wrote:


This is very unlikely to be a fragmentation problem. It is a
scalability problem
and there may be something you can do about it in the short term.


You could be right.

Out test mail server consists of the exact same design, same hardware
(SUN4V)  but in a smaller configuration (less memory and 4 x 25g san
luns) has a backup/copy thoughput of 30GB/hour. Data used for testing
was copied from our production mail server.


Adding another pool and copying all/some data over to it would only
a short term solution.


I'll have to disagree.


What is the point of a filesystem the can grow to such a huge size and
not have functionality built in to optimize data layout?  Real world
implementations of filesystems that are intended to live for
years/decades need this functionality, don't they?

Our mail system works well, only the backup doesn't perform well.
All the features of ZFS that make reads perform well (prefetch, ARC)
have little effect.


The best workload is one that doesn't read from disk to begin with :-)
For workloads with millions of files (eg large-scale mail servers) you
will need to increase the size of the Directory Name Lookup Cache
(DNLC). By default, it is way too small for such workloads. If the
directory names are in cache, then they do not have to be read from
disk -- a big win.

You can see how well the DNLC is working by looking at the output of
vmstat -s and look for the total name lookups. You can size DNLC
by tuning the ncsize parameter, but it requires a reboot.  See the
Solaris Tunable Parameters Guide for details.
http://docs.sun.com/app/docs/doc/817-0404/chapter2-35?a=view

I'd like to revisit the backup problem, but that is much more  
complicated

and probably won't fit in a mail thread very easily (hence, the white
paper :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Bob Friesenhahn

On Sat, 8 Aug 2009, Ed Spencer wrote:

   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  11.9   43.0  528.9 1972.8  0.0  2.10.0   38.9   0  31
c4t60A98000433469764E4A2D456A644A74d0
  17.0   19.6  496.9 1499.0  0.0  1.40.0   38.8   0  39
c4t60A98000433469764E4A2D456A696579d0
  14.0   30.0  670.2 1971.3  0.0  1.70.0   38.0   0  34
c4t60A98000433469764E4A476D2F664E4Fd0
  19.7   28.7  985.2 1647.6  0.0  1.60.0   32.5   0  37
c4t60A98000433469764E4A476D2F6B385Ad0


I have this in my /etc/system file:

* Set device I/O maximum concurrency
* 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29

set zfs:zfs_vdev_max_pending = 5

This parameter may be worthwhile to look at to reduce your asvc_t. 
It seems that the default (35) is tuned for a true JBOD setup and not 
a SAN-hosted LUN.


As I recall, you can use the kernel debugger to set it while the 
system is running and immediately see differences in iostat output.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Mattias Pantzare
On Sat, Aug 8, 2009 at 20:20, Ed Spencered_spen...@umanitoba.ca wrote:

 On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote:

 Your scalability problem may be in your backup solution.
 We've eliminated the backup system as being involved with the
 performance issues.

 The servers are Solaris 10 with the OS on UFS filesystems. (In zfs
 terms, the pool is old/mature). Solaris has been patched to a fairly
 current level.

 Copying data from the zfs filesystem to the local ufs filesystem enjoys
 the same throughput as the backup system.

 The test was simple. Create a test filesystem on the zfs pool. Restore
 production email data to it. Reboot the server. Backup the data (29
 minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs
 to ufs using a 'cp -pr ...' command, which also took 29 minutes.

Yes, that was expected. What hapens if you run two cp -pr at the same
time? I am guessing that two cp will take almost the same time as one.

If you get twice the performance from two cp  then you will get twice
the performance from doing two backups in parallel.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs fragmentation

2009-08-07 Thread Hua
1. Due to the COW nature of zfs, files on zfs are more tender to be fragmented 
comparing to traditional file system. Is this statement correct?

2. If so, common understanding is that fragmentation cause perform degradation, 
will zfs or to what extend zfs performance is affected by the fragmentation?

3. Being a relative new file system, are there many adoption in large 
implementation?

4. Googing zfs fragmentation doesn't return a lot results. It can because 
either there isn't a lot major adoption of zfs or fragment isn't a really 
problem for zfs.

Any information is appreciated.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-07 Thread Bob Friesenhahn

On Thu, 6 Aug 2009, Hua wrote:

1. Due to the COW nature of zfs, files on zfs are more tender to be 
fragmented comparing to traditional file system. Is this statement 
correct?


Yes and no.  Fragmentation is a complex issue.

ZFS uses 128K data blocks by default whereas other filesystems 
typically use 4K or 8K blocks.  This naturally reduces the potential 
for fragmentation by 32X over 4k blocks.


ZFS storage pools are typically comprised of multiple vdevs and 
writes are distributed over these vdevs.  This means that the first 
128K of a file may go to the first vdev and the second 128K may go to 
the second vdev.  It could be argued that this is a type of 
fragmentation but since all of the vdevs can be read at once (if zfs 
prefetch chooses to do so) the seek time for single-user contiguous 
access is essentially zero since the seeks occur while the application 
is already busy processing other data.  When mirror vdevs are used, 
any device in the mirror may be used to read the data.


ZFS uses a slab allocator and allocates large contiguous chunks of 
from the vdev storage, and then carves the 128K blocks from those 
large chunks.  This dramatically increases the probability that 
related data will be very close on the same disk.


ZFS delays ordinary writes to the very last minute according to these 
rules (my understanding): 7/8th total memory consumed, 5 seconds of 
100% write I/O is collected, or 30 seconds has elapsed.  Since quite a 
lot of data is written at once, zfs is able to write that data in the 
best possible order.


ZFS uses a copy-on-write model.  Copy-on-write tends to cause 
fragmentation if portions of existing files are updated.  If a large 
portion of a file is overwritten in a short period of time, the result 
should be reasonably fragment-free but if parts of the file are 
updated over a long period of time (like a database) then the file is 
certain to be fragmented.  This is not such a big problem as it 
appears to be since such files were already typically accessed using 
random access.


ZFS absolutely observes synchronous write requests (e.g. by NFS or a 
database).  The synchronous write requests do not benefit from the 
long write aggregation delay so the result may not be written as 
ideally as ordinary write requests.  Recently zfs has added support 
for using a SSD as a synchronous write log, and this allows zfs to 
turn synchronous writes into more ordinary writes which can be written 
more intelligently while returning to the user with minimal latency.


Perhaps the most significant fragmentation concern for zfs is if the 
pool is allowed to become close to 100% full.  Similar to other 
filesystems, the quality of the storage allocations goes downhill fast 
when the pool is almost 100% full, so even files written contiguously 
may be written in fragments.


3. Being a relative new file system, are there many adoption in 
large implementation?


There are indeed some sites which heavily use zfs.  One very large 
site using zfs is archive.org.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-07 Thread Scott Meilicke
 ZFS absolutely observes synchronous write requests (e.g. by NFS or a 
 database). The synchronous write requests do not benefit from the 
 long write aggregation delay so the result may not be written as 
 ideally as ordinary write requests. Recently zfs has added support 
 for using a SSD as a synchronous write log, and this allows zfs to 
 turn synchronous writes into more ordinary writes which can be written 
 more intelligently while returning to the user with minimal latency.

Bob, since the ZIL is used always, whether a separate device or not, won't 
writes to a system without a separate ZIL also be written as intelligently as 
with a separate ZIL?

Thanks,
Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-07 Thread Neil Perrin



On 08/07/09 10:54, Scott Meilicke wrote:
ZFS absolutely observes synchronous write requests (e.g. by NFS or a 
database). The synchronous write requests do not benefit from the 
long write aggregation delay so the result may not be written as 
ideally as ordinary write requests. Recently zfs has added support 
for using a SSD as a synchronous write log, and this allows zfs to 
turn synchronous writes into more ordinary writes which can be written 
more intelligently while returning to the user with minimal latency.


Bob, since the ZIL is used always, whether a separate device or not,
won't writes to a system without a separate ZIL also be written as
intelligently as with a separate ZIL?


- Yes. ZFS uses the same code path (intelligence?) to write out the data
from NFS - regardless of whether there's a separate log (slog) or not.



Thanks,
Scott

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-07 Thread Bob Friesenhahn

On Fri, 7 Aug 2009, Scott Meilicke wrote:


Bob, since the ZIL is used always, whether a separate device or not, 
won't writes to a system without a separate ZIL also be written as 
intelligently as with a separate ZIL?


I don't know the answer to that.  Perhaps there is no current 
advantage.  The longer the final writes can be deferred, the more 
opportunity there is to write the data with a better layout, or to 
avoid writing some data at all.


One thing I forgot to mention in my summary is that zfs is commonly 
used in multi-user environments where there may be many simultaneous 
writers.  Simultaneous writers tend to naturally fragment a filesystem 
unless the filesystem is willing to spread the data out in advance and 
take a seek hit (from one file to another) for each file write.  Zfs 
deferrment of the writes allows the data to be written more 
intelligently in these multi-user environments.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-07 Thread Ed Spencer
Let me give a real life example of what I believe is a fragmented zfs pool.

Currently the pool is 2 terabytes in size  (55% used) and is made of 4 san luns 
(512gb each).
The pool has never gotten close to being full. We increase the size of the pool 
by adding 2 512gb luns about once a year or so.

The pool has been divided into 7 filesystems.

The pool is used for imap email data. The email system (cyrus) has 
approximately 80,000 accounts all located within the pool, evenly distributed 
between the filesystems.

Each account has a directory associated with it. This directory is the users 
inbox. Additional mail folders are subdirectories. Mail is stored as individual 
files.

We receive mail at a rate of 0-20MB/Second, every minute of every  hour of 
every day of every week, etc etc.

Users recieve mail constantly over time. They read it and then either delete it 
or store it in a subdirectory/folder.

I imagine that my mail (located in a single subdirectory structure) is spread 
over the entire pool because it has been received over time. I believe the data 
is highly fragmented (from a file and directory perspective).

The result of this is that backup thoughput of a single filesystem in this pool 
is about 8GB/hour.
We use EMC networker for backups.
  
This is a problem. There are no utilities available to evaluate this type of 
fragmentation.  
There are no utilities to fix it.

ZFS, from the mail system perspective works great. 
Writes and random reads operate well.

Backup is a problem and not just because of small files, but small files 
scatterred over the entire pool. 

Adding another pool and copying all/some data over to it would only a short 
term solution.

I believe zfs needs a feature that operates in the background and defrags the 
pool to optimize sequential reads of the file and directory structure.

Ed
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-07 Thread Richard Elling

On Aug 7, 2009, at 2:29 PM, Ed Spencer wrote:

Let me give a real life example of what I believe is a fragmented  
zfs pool.


Currently the pool is 2 terabytes in size  (55% used) and is made of  
4 san luns (512gb each).
The pool has never gotten close to being full. We increase the size  
of the pool by adding 2 512gb luns about once a year or so.


The pool has been divided into 7 filesystems.

The pool is used for imap email data. The email system (cyrus) has  
approximately 80,000 accounts all located within the pool, evenly  
distributed between the filesystems.


Each account has a directory associated with it. This directory is  
the users inbox. Additional mail folders are subdirectories. Mail is  
stored as individual files.


We receive mail at a rate of 0-20MB/Second, every minute of every   
hour of every day of every week, etc etc.


Users recieve mail constantly over time. They read it and then  
either delete it or store it in a subdirectory/folder.


I imagine that my mail (located in a single subdirectory structure)  
is spread over the entire pool because it has been received over  
time. I believe the data is highly fragmented (from a file and  
directory perspective).


The result of this is that backup thoughput of a single filesystem  
in this pool is about 8GB/hour.

We use EMC networker for backups.


This is very unlikely to be a fragmentation problem. It is a  
scalability problem

and there may be something you can do about it in the short term.

However, though I usually like to tease, in this case I need to tease. I
recently completed a white paper on this exact workload and how we
designed it to scale. I hope to publish that paper RSN.  When the paper
hits the web, I'll restart a new thread on using ZFS for large-scale  
email

systems.



This is a problem. There are no utilities available to evaluate this  
type of fragmentation.

There are no utilities to fix it.

ZFS, from the mail system perspective works great.
Writes and random reads operate well.

Backup is a problem and not just because of small files, but small  
files scatterred over the entire pool.


Adding another pool and copying all/some data over to it would only  
a short term solution.


I'll have to disagree.

I believe zfs needs a feature that operates in the background and  
defrags the pool to optimize sequential reads of the file and  
directory structure.


This will not solve your problem, but there are other methods that can.
 -- richard



Ed
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-12-02 Thread t. johnson

 One would expect so, yes. But the usefulness of this is limited to the cases 
 where the entire working set will fit into an SSD cache.


 Not entirely out of the question. SSDs can be purchased today
 with more than 500 GBytes in a 2.5 form factor. One or more of
 these would make a dandy L2ARC.
 http://www.stecinc.com/product/mach8mlc.php


Speaking of which.. what's the current limit on L2ARC size? Gathering tidbits 
here and there (7000 storage line config limits, FAST talk given by Bill Moore) 
there are indications that L2ARC can only be ~500GB?

Is this the case? If so, is that a raw size limitation or a number of devices 
used to form the L2ARC limitation or something else? I'm sure some of us can 
come with examples where we really would like to use much more than a 500GB 
L2ARC :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-12-02 Thread Darren J Moffat
t. johnson wrote:
 One would expect so, yes. But the usefulness of this is limited to the 
 cases where the entire working set will fit into an SSD cache.

 Not entirely out of the question. SSDs can be purchased today
 with more than 500 GBytes in a 2.5 form factor. One or more of
 these would make a dandy L2ARC.
 http://www.stecinc.com/product/mach8mlc.php
 
 
 Speaking of which.. what's the current limit on L2ARC size? Gathering tidbits 
 here and there (7000 storage line config limits, FAST talk given by Bill 
 Moore) there are indications that L2ARC can only be ~500GB?

There is no limits on the size of the L2ARC that I could fine 
implemented in the source code.

However every buffer that is cached on an L2ARC device needs an ARC 
header in the in memory ARC that points to it.  So in practical terms 
there will be a limit on the size of an L2ARC based on the size of 
physical ram.

For example a machine with 512 MegaByte RAM and a 500GByte SSD L2ARC is 
probably pretty silly.

I'll leave it as an exercise to the reader to work out how much core 
memory is needed based on the sizes of arc_buf_t (0x30) and 
arc_buf_hdr_t (0xf8).

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-24 Thread Richard Elling
Luke Lonergan wrote:
 Actually, it does seem to work quite
 well when you use a read optimized
 SSD for the L2ARC.  In that case,
 random read workloads have very
 fast access, once the cache is warm.
 

 One would expect so, yes.  But the usefulness of this is limited to the cases 
 where the entire working set will fit into an SSD cache.
   

Not entirely out of the question. SSDs can be purchased today
with more than 500 GBytes in a 2.5 form factor.  One or more of
these would make a dandy L2ARC.
http://www.stecinc.com/product/mach8mlc.php

 In other words, for random access across a working set larger (by say X%) 
 than the SSD-backed L2 ARC, the cache is useless.  This should asymptotically 
 approach truth as X grows and experience shows that X=200% is where it's 
 about 99% true.

 As time passes and SSDs get larger while many OLTP random workloads remain 
 somewhat constrained in size, this becomes less important.
   

You can also purchase machines with 2+ TBytes of RAM, which will
do nicely for caching most OLTP databases :-) 
 Modern DB workloads are becoming hybridized, though.  A 'mixed workload' 
 scenario is now common where there are a mix of updated working sets and 
 indexed access alongside heavy analytical 'update rarely if ever' kind of 
 workloads.
   

Agree.  We think that the hybrid storage pool architecture will work
well for a variety of these workloads, but the proof will be in the
pudding.  No doubt we'll discover some interesting interactions along
the way.  Stay tuned...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-23 Thread Bob Friesenhahn
On Sat, 22 Nov 2008, Bob Netherton wrote:


 In other words, for random access across a working set larger (by 
 say X%) than the SSD-backed L2 ARC, the cache is useless.  This 
 should asymptotically approach truth as X grows and experience 
 shows that X=200% is where it's about 99% true.

 Ummm, before we throw around phrases like useless, how about a 
 little testing ?  I like a good academic argument just like the next 
 guy, but before I dismiss something completely out of hand I'd like 
 to see some data.

This argument can be proven by basic statistics without need to resort 
to actual testing.

A similar issue applies to non-volatile write caches.

Luckily, most data access is not completely random in nature.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-23 Thread Bob Netherton

 This argument can be proven by basic statistics without need to resort 
 to actual testing.

Mathematical proof  reality of how things end up getting used.

 Luckily, most data access is not completely random in nature.

Which was my point exactly.   I've never seen a purely mathematical
model put in production anywhere :-)


Bob

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-23 Thread Bob Friesenhahn
On Sun, 23 Nov 2008, Bob Netherton wrote:

 This argument can be proven by basic statistics without need to resort
 to actual testing.

 Mathematical proof  reality of how things end up getting used.

Right.  That is a good thing since otherwise the technologies that Sun 
has recently deployed for Amber Road would be deemed virtually 
useless (as would most computing architectures).  It is quite trivial 
to demonstrate scenarios where read caches will fail, or NV write 
cache devices will become swamped (regardless of capacity) and 
worthless.  Luckily, these are not the common scenarios for most 
users.

For the write cache case it may be seen that if the volume of writes 
continually exceeds the write rate of the backing store and is 
continually to new locations, then the write cache becomes useless 
since it will always become full.

The read cache case is subject to the normal rules which require that 
the read cache needs to be large enough to contain the common working 
set of data in order for it to be effective.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-22 Thread Tamer Embaby
Kees Nuyt wrote:
 My explanation would be: Whenever a block within a file
 changes, zfs has to write it at another location (copy on
 write), so the previous version isn't immediately lost.

 Zfs will try to keep the new version of the block close to
 the original one, but after several changes on the same
 database page, things get pretty messed up and logical
 sequential I/O becomes pretty much physically random indeed.

 The original blocks will eventually be added to the freelist
 and reused, so proximity can be restored, but it will never
 be 100% sequential again.
 The effect is larger when many snapshots are kept, because
 older block versions are not freed, or when the same block
 is changed very often and freelist updating has to be
 postponed.

 That is the trade-off between always consistent and
 fast.
   
Well, does that mean ZFS is not best suited for database engines as 
underlying
filesystem?  With databases it will always be fragmented, hence slow
performance?

Because this way it would be best to use it for large file server that
don't usually change frequently.

Thanks,
Tamer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-22 Thread Luke Lonergan
ZFS works marvelously well for data warehouse and analytic DBs.  For lots of 
small updates scattered across the breadth of the persistent working set, it's 
not going to work well IMO.

Note that we're using ZFS to host databases as large as 10,000 TB - that's 10PB 
(!!).  Solaris 10 U5 on X4540.  That said - it's on 96 servers running 
Greenplum DB.

With SSD, the randomness won't matter much I expect, though the filesystem 
won't be helping by virtue of this fragmentation effect of COW.

- Luke

- Original Message -
From: [EMAIL PROTECTED] [EMAIL PROTECTED]
To: zfs-discuss@opensolaris.org zfs-discuss@opensolaris.org
Sent: Sat Nov 22 16:43:53 2008
Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases

Kees Nuyt wrote:
 My explanation would be: Whenever a block within a file
 changes, zfs has to write it at another location (copy on
 write), so the previous version isn't immediately lost.

 Zfs will try to keep the new version of the block close to
 the original one, but after several changes on the same
 database page, things get pretty messed up and logical
 sequential I/O becomes pretty much physically random indeed.

 The original blocks will eventually be added to the freelist
 and reused, so proximity can be restored, but it will never
 be 100% sequential again.
 The effect is larger when many snapshots are kept, because
 older block versions are not freed, or when the same block
 is changed very often and freelist updating has to be
 postponed.

 That is the trade-off between always consistent and
 fast.

Well, does that mean ZFS is not best suited for database engines as
underlying
filesystem?  With databases it will always be fragmented, hence slow
performance?

Because this way it would be best to use it for large file server that
don't usually change frequently.

Thanks,
Tamer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-22 Thread Bob Friesenhahn
On Sun, 23 Nov 2008, Tamer Embaby wrote:
 That is the trade-off between always consistent and
 fast.

 Well, does that mean ZFS is not best suited for database engines as 
 underlying filesystem?  With databases it will always be fragmented, 
 hence slow performance?

Assuming that the filesystem block size matches the database size 
there is not so much of an issue with fragmentation because databases 
are generally fragmented (almost by definition) due to their nature of 
random access.  Only a freshly written database from carefully ordered 
insert statements might be in a linear order, and only for accesses in 
the same linear order.  Database indexes could be negatively impacted, 
but they are likely to be cached in RAM anyway.  I understand that zfs 
uses a slab allocator so that file data is reserved in larger slabs 
(e.g. 1MB) and then the blocks are carved out of that.  This tends to 
keep more of the file data together and reduces allocation overhead.

Fragmentation is more of an impact for large files which should 
usually be accessed sequentially.

Zfs's COW algorithm and ordered writes will always be slower than for 
filesystems which simply overwrite existing blocks, but there is a 
better chance that the database will be immediately usable if someone 
pulls the power plug, and without needing to rely on special 
battery-backed hardware.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-22 Thread Richard Elling
Luke Lonergan wrote:
 ZFS works marvelously well for data warehouse and analytic DBs.  For lots of 
 small updates scattered across the breadth of the persistent working set, 
 it's not going to work well IMO.
   

Actually, it does seem to work quite well when you use a read optimized
SSD for the L2ARC.  In that case, random read workloads have very
fast access, once the cache is warm.
 -- richard

 Note that we're using ZFS to host databases as large as 10,000 TB - that's 
 10PB (!!).  Solaris 10 U5 on X4540.  That said - it's on 96 servers running 
 Greenplum DB.

 With SSD, the randomness won't matter much I expect, though the filesystem 
 won't be helping by virtue of this fragmentation effect of COW.

 - Luke

 - Original Message -
 From: [EMAIL PROTECTED] [EMAIL PROTECTED]
 To: zfs-discuss@opensolaris.org zfs-discuss@opensolaris.org
 Sent: Sat Nov 22 16:43:53 2008
 Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases

 Kees Nuyt wrote:
   
 My explanation would be: Whenever a block within a file
 changes, zfs has to write it at another location (copy on
 write), so the previous version isn't immediately lost.

 Zfs will try to keep the new version of the block close to
 the original one, but after several changes on the same
 database page, things get pretty messed up and logical
 sequential I/O becomes pretty much physically random indeed.

 The original blocks will eventually be added to the freelist
 and reused, so proximity can be restored, but it will never
 be 100% sequential again.
 The effect is larger when many snapshots are kept, because
 older block versions are not freed, or when the same block
 is changed very often and freelist updating has to be
 postponed.

 That is the trade-off between always consistent and
 fast.

 
 Well, does that mean ZFS is not best suited for database engines as
 underlying
 filesystem?  With databases it will always be fragmented, hence slow
 performance?

 Because this way it would be best to use it for large file server that
 don't usually change frequently.

 Thanks,
 Tamer
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-22 Thread Luke Lonergan

 Actually, it does seem to work quite
 well when you use a read optimized
 SSD for the L2ARC.  In that case,
 random read workloads have very
 fast access, once the cache is warm.

One would expect so, yes.  But the usefulness of this is limited to the cases 
where the entire working set will fit into an SSD cache.

In other words, for random access across a working set larger (by say X%) than 
the SSD-backed L2 ARC, the cache is useless.  This should asymptotically 
approach truth as X grows and experience shows that X=200% is where it's about 
99% true.

As time passes and SSDs get larger while many OLTP random workloads remain 
somewhat constrained in size, this becomes less important.

Modern DB workloads are becoming hybridized, though.  A 'mixed workload' 
scenario is now common where there are a mix of updated working sets and 
indexed access alongside heavy analytical 'update rarely if ever' kind of 
workloads.

- Luke

- Original Message -
From: [EMAIL PROTECTED] [EMAIL PROTECTED]
To: Luke Lonergan
Cc: [EMAIL PROTECTED] [EMAIL PROTECTED]; zfs-discuss@opensolaris.org 
zfs-discuss@opensolaris.org
Sent: Sat Nov 22 20:28:54 2008
Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases

Luke Lonergan wrote:
 ZFS works marvelously well for data warehouse and analytic DBs.  For lots of 
 small updates scattered across the breadth of the persistent working set, 
 it's not going to work well IMO.


Actually, it does seem to work quite well when you use a read optimized
SSD for the L2ARC.  In that case, random read workloads have very
fast access, once the cache is warm.
 -- richard

 Note that we're using ZFS to host databases as large as 10,000 TB - that's 
 10PB (!!).  Solaris 10 U5 on X4540.  That said - it's on 96 servers running 
 Greenplum DB.

 With SSD, the randomness won't matter much I expect, though the filesystem 
 won't be helping by virtue of this fragmentation effect of COW.

 - Luke

 - Original Message -
 From: [EMAIL PROTECTED] [EMAIL PROTECTED]
 To: zfs-discuss@opensolaris.org zfs-discuss@opensolaris.org
 Sent: Sat Nov 22 16:43:53 2008
 Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases

 Kees Nuyt wrote:

 My explanation would be: Whenever a block within a file
 changes, zfs has to write it at another location (copy on
 write), so the previous version isn't immediately lost.

 Zfs will try to keep the new version of the block close to
 the original one, but after several changes on the same
 database page, things get pretty messed up and logical
 sequential I/O becomes pretty much physically random indeed.

 The original blocks will eventually be added to the freelist
 and reused, so proximity can be restored, but it will never
 be 100% sequential again.
 The effect is larger when many snapshots are kept, because
 older block versions are not freed, or when the same block
 is changed very often and freelist updating has to be
 postponed.

 That is the trade-off between always consistent and
 fast.


 Well, does that mean ZFS is not best suited for database engines as
 underlying
 filesystem?  With databases it will always be fragmented, hence slow
 performance?

 Because this way it would be best to use it for large file server that
 don't usually change frequently.

 Thanks,
 Tamer
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-22 Thread Bob Netherton

 In other words, for random access across a working set larger (by say X%) 
 than the SSD-backed L2 ARC, the cache is useless.  This should asymptotically 
 approach truth as X grows and experience shows that X=200% is where it's 
 about 99% true.
   
Ummm, before we throw around phrases like useless, how about a little 
testing ?I like a
good academic argument just like the next guy, but before I dismiss 
something completely
out of hand I'd like to see some data. 

Bob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-21 Thread Vincent Kéravec
I just try ZFS on one of our slave and got some really bad performance.

When I start the server yesterday, it was able to keep up with the main server 
without problem but after two days of consecutive run the server is crushed by 
IO.

After running the dtrace script iopattern, I notice that the workload is now 
100% Random IO. Copying the database (140Go) from one directory to an other 
took more than 4 hours without any other tasks running on the server, and all 
the reads on table that where updated where random... Keeping an eye on 
iopattern and zpool iostat I saw that when the systems was accessing file that 
have not been changed the disk was reading sequentially at more than 50Mo/s but 
when reading files that changed often the speed got down to 2-3 Mo/s.

The server has plenty of diskplace so it should not have such a level of file 
fragmentation in such a short time.

For information I'm using solaris 10/08 with a mirrored root pool on two 1Tb 
Sata harddisk (slow with random io). I'm using MySQL 5.0.67 with MyISAM engine. 
The zfs recordsize is 8k as recommended on the zfs guide.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation with MySQL databases

2008-11-21 Thread Kees Nuyt
[Default] On Fri, 21 Nov 2008 17:20:48 PST, Vincent Kéravec
[EMAIL PROTECTED] wrote:

 I just try ZFS on one of our slave and got some really
 bad performance.
 
 When I start the server yesterday, it was able to keep
 up with the main server without problem but after two
 days of consecutive run the server is crushed by IO.
 
 After running the dtrace script iopattern, I notice
 that the workload is now 100% Random IO. Copying the
 database (140Go) from one directory to an other took
 more than 4 hours without any other tasks running on
 the server, and all the reads on table that where
 updated where random... Keeping an eye on iopattern and
 zpool iostat I saw that when the systems was accessing
 file that have not been changed the disk was reading
 sequentially at more than 50Mo/s but when reading files
 that changed often the speed got down to 2-3 Mo/s.

Good observation and analysis.
 
 The server has plenty of diskplace so it should not
 have such a level of file fragmentation in such a short
 time.

My explanation would be: Whenever a block within a file
changes, zfs has to write it at another location (copy on
write), so the previous version isn't immediately lost.

Zfs will try to keep the new version of the block close to
the original one, but after several changes on the same
database page, things get pretty messed up and logical
sequential I/O becomes pretty much physically random indeed.

The original blocks will eventually be added to the freelist
and reused, so proximity can be restored, but it will never
be 100% sequential again.
The effect is larger when many snapshots are kept, because
older block versions are not freed, or when the same block
is changed very often and freelist updating has to be
postponed.

That is the trade-off between always consistent and
fast.

 For information I'm using solaris 10/08 with a mirrored
 root pool on two 1Tb Sata harddisk (slow with random
 io). I'm using MySQL 5.0.67 with MyISAM engine. The zfs
 recordsize is 8k as recommended on the zfs guide.

I would suggest to enlarge the MyISAM buffers.
The InnoDB engine does copy on write within its data files,
so things might be different there. 
-- 
  (  Kees Nuyt
  )
c[_]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation

2008-06-18 Thread Lance
Any progress on a defragmentation utility?  We appear to be having a severe 
fragmentation problem on an X4500, vanilla S10U4, no additional patches.  500GB 
disks in 4 x 11 disk RAIDZ2 vdevs.  It hit 97% full and fell off a 
cliff...about 50KB/sec on writes.  Deleting files so the zpool is at 92% has 
not helped.  I rebooted the host...no difference.  I lowered the recordsize 
from 128KB to 8KB.  That has boosted performance to 250-500KB/sec on writes 
(still 10x-100x too slow).  Reads have been fine all along.

This is one big zpool and one file system of 16TB.  Approximately 25-30M files, 
some of which change often.  Lots of small, changing files, which are probably 
aggravating the problem.  Due to the Marvell driver bug, I have SATA NCQ turned 
off in /etc/system via set sata:sata_func_enable=0x5.  We plan to go to the 
most recent patch set so I can remove that, but I'm not convinced patching will 
fix the slowness we're seeing.

We'll try to delete more files, but having a defragmentation utility might help 
in this case.  It seems a shame to waste 10-20% of your disk space to maintain 
moderate performance, though I guess that's what we'll have to do.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragmentation

2008-06-18 Thread Sanjeev Bagewadi
Lance,

This could be bug#*6596237 Stop looking and start ganging 
http://monaco.sfbay/detail.jsf?cr=6596237.*
The fix is in progress and Victor Latushkin is working on it.

We have an IDR based on the patches 127127-11/127128-11 which has the 
first cut of the fix.
You could raise an escalation and get these IDRs.
Thanks and regards,
Sanjeev.

Lance wrote:
 Any progress on a defragmentation utility?  We appear to be having a severe 
 fragmentation problem on an X4500, vanilla S10U4, no additional patches.  
 500GB disks in 4 x 11 disk RAIDZ2 vdevs.  It hit 97% full and fell off a 
 cliff...about 50KB/sec on writes.  Deleting files so the zpool is at 92% has 
 not helped.  I rebooted the host...no difference.  I lowered the recordsize 
 from 128KB to 8KB.  That has boosted performance to 250-500KB/sec on writes 
 (still 10x-100x too slow).  Reads have been fine all along.

 This is one big zpool and one file system of 16TB.  Approximately 25-30M 
 files, some of which change often.  Lots of small, changing files, which are 
 probably aggravating the problem.  Due to the Marvell driver bug, I have SATA 
 NCQ turned off in /etc/system via set sata:sata_func_enable=0x5.  We plan 
 to go to the most recent patch set so I can remove that, but I'm not 
 convinced patching will fix the slowness we're seeing.

 We'll try to delete more files, but having a defragmentation utility might 
 help in this case.  It seems a shame to waste 10-20% of your disk space to 
 maintain moderate performance, though I guess that's what we'll have to do.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS fragmentation

2007-02-13 Thread Matty

Howdy,

I have seen a number of folks run into issues due to ZFS file system
fragmentation, and was curious if anyone on team ZFS is working on
this issue? Would it be possible to share with the list any changes
that will be made to to help address fragmentation problems?

Thanks,
- Ryan
--
UNIX Administrator
http://prefetch.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss