Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-24 Thread Pramod Batni

> I would greatly appreciate it if you could open the bug, I don't have an
> opensolaris bugzilla account yet and you'd probably put better technical
> details in it anyway :). If you do, could you please let me know the bug#
> so I can refer to it once S10U6 is out and I confirm it has the same
> behavior?
>   

   6763592 creating zfs filesystems gets slower as the number of zfs 
filesystems increase

Pramod

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-23 Thread Paul B. Henson
On Thu, 23 Oct 2008, Pramod Batni wrote:

> On 10/23/08 08:19, Paul B. Henson wrote:
> >
> > Ok, that leads to another question, why does creating a new ZFS filesystem
> > require determining if any of the existing filesystems in the dataset are
> > mounted :)?
>
> I am not sure. All the checking is done as part of the libshare's sa_init
> which is calling into sa_get_zfs_shares().

It does make a big difference whether or not sharenfs is enabled, I haven't
finished my testing, but at 5000 filesystems it takes about 30 seconds to
create a new filesystem and over 30 minutes to reboot if they are shared,
but only 7 seconds to make a filesystem and about 15 minutes to reboot if
they are not.

> You could do that else I can open a bug for you citing the Nevada
> build [b97] you are using.

I would greatly appreciate it if you could open the bug, I don't have an
opensolaris bugzilla account yet and you'd probably put better technical
details in it anyway :). If you do, could you please let me know the bug#
so I can refer to it once S10U6 is out and I confirm it has the same
behavior?

Thanks much...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-23 Thread Pramod Batni



On 10/23/08 08:19, Paul B. Henson wrote:

On Tue, 21 Oct 2008, Pramod Batni wrote:

  

Why does creating a new ZFS filesystem require enumerating all existing
ones?
  

  This is to determine if any of the filesystems in the dataset are mounted.



Ok, that leads to another question, why does creating a new ZFS filesystem
require determining if any of the existing filesystems in the dataset are
mounted :)? I could see checking the parent filesystems, but why the
siblings?

  

 I am not sure.
 All the checking is done as part of the libshare's sa_init which is 
calling into sa_get_zfs_shares().



In any case a bug can be filed on this.



Should I open a sun support call to request such a bug? I guess I should
wait until U6 is released, I don't have support for SXCE...
  
 You could do that else I can open a bug for you citing the Nevada 
build [b97] you are using.


Pramod

Thanks...


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-22 Thread Paul B. Henson
On Tue, 21 Oct 2008, Pramod Batni wrote:

> > Why does creating a new ZFS filesystem require enumerating all existing
> > ones?
>
>   This is to determine if any of the filesystems in the dataset are mounted.

Ok, that leads to another question, why does creating a new ZFS filesystem
require determining if any of the existing filesystems in the dataset are
mounted :)? I could see checking the parent filesystems, but why the
siblings?

> In any case a bug can be filed on this.

Should I open a sun support call to request such a bug? I guess I should
wait until U6 is released, I don't have support for SXCE...

Thanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-21 Thread Pramod Batni



On 10/21/08 04:52, Paul B. Henson wrote:

On Mon, 20 Oct 2008, Pramod Batni wrote:

  

Yes, the implementation of the above ioctl walks the list of mounted
filesystems 'vfslist' [in this case it walks 5000 nodes of a linked list
before the ioctl returns] This in-kernel traversal of the filesystems is
taking time.



Hmm, O(n) :(... I guess that is the implementation of getmntent(3C)?
  
  In fact the problem is that 'zfs create' calls the ioctl way too many 
times.

  getmntent(3C) issues a single ioctl( MNTIOC_GETMNTENT).

Why does creating a new ZFS filesystem require enumerating all existing
ones?
  

 This is to determine if any of the filesystems in the dataset are mounted.
 The ioctl calls are coming from:

 libc.so.1`ioctl+0x8
 libc.so.1`getmntany+0x200
 libzfs.so.1`is_mounted+0x60
 libshare.so.1`sa_get_zfs_shares+0x118
 libshare.so.1`sa_init+0x330
 libzfs.so.1`zfs_init_libshare+0xac
 libzfs.so.1`zfs_share_proto+0x4c
 zfs`zfs_do_create+0x608
 zfs`main+0x2b0
 zfs`_start+0x108

  zfs_init_libshare is walking through a list of filesystems and 
determining if each of them
  are mounted. I think there can be a better way to do this rather than 
doing a is_mounted()

  check on each of the filesystems. In any case a bug can be filed on this.

Pramod
  

You could set 'zfs set mountpoint=none ' and then create the
filesystems under the  . [In my experiments the number of
ioctl's went down drastically.] You could then set a mountpoint for the
pool and then issue a 'zpool mount -a' .



That would work for an initial mass creation, but we are going to need to
create and delete fairly large numbers of file systems over time, this
workaround would not help for that.


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-20 Thread Paul B. Henson
On Mon, 20 Oct 2008, Pramod Batni wrote:

> Yes, the implementation of the above ioctl walks the list of mounted
> filesystems 'vfslist' [in this case it walks 5000 nodes of a linked list
> before the ioctl returns] This in-kernel traversal of the filesystems is
> taking time.

Hmm, O(n) :(... I guess that is the implementation of getmntent(3C)?

Why does creating a new ZFS filesystem require enumerating all existing
ones?

> You could set 'zfs set mountpoint=none ' and then create the
> filesystems under the  . [In my experiments the number of
> ioctl's went down drastically.] You could then set a mountpoint for the
> pool and then issue a 'zpool mount -a' .

That would work for an initial mass creation, but we are going to need to
create and delete fairly large numbers of file systems over time, this
workaround would not help for that.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-20 Thread Bob Friesenhahn
On Mon, 20 Oct 2008, Paul B. Henson wrote:
>
> I haven't rebooted it yet; I somewhat naively assumed performance would be
> much better and just started a script to create test file systems for about
> 10,000 people. I'm going to delete the pool and re-create it, then create
> 1000 filesystems at a time and gather some performance statistics.

It would be useful to know if there is a performance difference 
between many filesystems in one directory, or the same number of 
filesystems in multiple directories.  For example, you could have 
upper directories 'a', 'b', 'c', etc, and put the filesystems under 
these upper directories so there are fewer filesystems per directory.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-20 Thread Paul B. Henson
On Sun, 19 Oct 2008, Ed Plese wrote:

> The biggest problem I ran into was the boot time, specifically when "zfs
> volinit" is executing.  With ~3500 filesystems on S10U3 the boot time for
> our X4500 was around 40 minutes.  Any idea what your boot time is like
> with that many filesystems on the newer releases?

I haven't rebooted it yet; I somewhat naively assumed performance would be
much better and just started a script to create test file systems for about
10,000 people. I'm going to delete the pool and re-create it, then create
1000 filesystems at a time and gather some performance statistics.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-20 Thread Pramod Batni


Paul B. Henson wrote:
> 
>
> At about 5000 filesystems, it starts taking over 30 seconds to
> create/delete additional filesystems.
>
> At 7848, over a minute:
>
> # time zfs create export/user/test
>
> real1m22.950s
> user1m12.268s
> sys 0m10.184s
>
> I did a little experiment with truss:
>
> # truss -c zfs create export/user/test2
>
> syscall   seconds   calls  errors
> _exit.000   1
> read .004 892
> open .023  67   2
> close.001  80
> brk  .006 653
> getpid   .0378598
> mount.006   1
> sysi86   .000   1
> ioctl 115.534 313036787920
> execve   .000   1
> fcntl.000  18
> openat   .000   2
> mkdir.000   1
> getppriv .000   1
> getprivimplinfo  .000   1
> issetugid.000   4
> sigaction.000   1
> sigfillset   .000   1
> getcontext   .000   1
> setustack.000   1
> mmap .000  78
> munmap   .000  28
> xstat.000  65  21
> lxstat   .000   1   1
> getrlimit.000   1
> memcntl  .000  16
> sysconfig.000   5
> lwp_sigmask  .000   2
> lwp_private  .000   1
> llseek   .084   15819
> door_info.000  13
> door_call.1038391
> schedctl .000   1
> resolvepath  .000  19
> getdents64   .000   4
> stat64   .000   3
> fstat64  .000  98
> zone_getattr .000   1
> zone_lookup  .000   2
>    --   
> sys totals:   115.804 31338551   7944
> usr time: 107.174
> elapsed:  897.670
>
>
> and it seems the majority of time is spent in ioctl calls, specifically:
>
> ioctl(16, MNTIOC_GETMNTENT, 0x08045A60) = 0
>   

Yes, the implementation of the above ioctl walks the list of mounted 
filesystems 'vfslist'
[in this case it walks 5000 nodes of a linked list before the ioctl 
returns] This in-kernel traversal of the filesystems
is taking time.
> Interestingly, I tested creating 6 filesystems simultaneously, which took a
> total of only three minutes, rather than 9 minutes had they been created
>   
> sequentially. I'm not sure how parallelizable I can make an identity
> management provisioning system though.
>   
> Was I mistaken about the increased scalability that was going to be
> available? Is there anything I could configure differently to improve this
> performance? We are going to need about 30,000 filesystems to cover our
>   

You could set 'zfs set mountpoint=none ' and then create the 
filesystems
under the  . [In my experiments the number of ioctl's went 
down drastically.]
You could then set a mountpoint for the pool and then issue a 'zpool 
mount -a' .

Pramod
> faculty, staff, students, and group project directories. We do have 5
> x4500's which will be allocated to the task, so about 6000 filesystems per.
> Depending on what time of the quarter it is, our identity management sytem
> can create hundreds up to thousands of accounts, and when we purge accounts
> quarterly we typically delete 10,000 or so. Currently those jobs only take
> 2-6 hours, with this level of performance from ZFS they would take days if
> not over a week :(.
>
> Thanks for any suggestions. What is the internal recommendation on maximum
> number of file systems per server?
>
>
>   
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-19 Thread Ed Plese
On Sun, Oct 19, 2008 at 4:08 PM, Paul B. Henson <[EMAIL PROTECTED]> wrote:
> At about 5000 filesystems, it starts taking over 30 seconds to
> create/delete additional filesystems.

The biggest problem I ran into was the boot time, specifically when
"zfs volinit" is executing.  With ~3500 filesystems on S10U3 the boot
time for our X4500 was around 40 minutes.  Any idea what your boot
time is like with that many filesystems on the newer releases?


Ed Plese
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-19 Thread Paul B. Henson

I originally started testing a prototype for an enterprise file service
implementation on our campus using S10U4. Scalability in terms of file
system count was pretty bad, anything over a couple of thousand and
operations started taking way too long.

I had thought there were a number of improvements/enhancements that had
been made since then to improve performance and scalability when a large
number of file systems exist. I've been testing with SXCE (b97) which
presumably has all of the enhancements (and potentially then some) that
will be available in U6, and I'm still seeing very poor scalability once
more than a few thousand filesystems are created.

I have a test install on an x4500 with two TB disks as a ZFS root pool, 44
TB disks configured as mirror pairs belonging to one zpool, and the last
two TB disks as hot spares.

At about 5000 filesystems, it starts taking over 30 seconds to
create/delete additional filesystems.

At 7848, over a minute:

# time zfs create export/user/test

real1m22.950s
user1m12.268s
sys 0m10.184s

I did a little experiment with truss:

# truss -c zfs create export/user/test2

syscall   seconds   calls  errors
_exit.000   1
read .004 892
open .023  67   2
close.001  80
brk  .006 653
getpid   .0378598
mount.006   1
sysi86   .000   1
ioctl 115.534 313036787920
execve   .000   1
fcntl.000  18
openat   .000   2
mkdir.000   1
getppriv .000   1
getprivimplinfo  .000   1
issetugid.000   4
sigaction.000   1
sigfillset   .000   1
getcontext   .000   1
setustack.000   1
mmap .000  78
munmap   .000  28
xstat.000  65  21
lxstat   .000   1   1
getrlimit.000   1
memcntl  .000  16
sysconfig.000   5
lwp_sigmask  .000   2
lwp_private  .000   1
llseek   .084   15819
door_info.000  13
door_call.1038391
schedctl .000   1
resolvepath  .000  19
getdents64   .000   4
stat64   .000   3
fstat64  .000  98
zone_getattr .000   1
zone_lookup  .000   2
   --   
sys totals:   115.804 31338551   7944
usr time: 107.174
elapsed:  897.670


and it seems the majority of time is spent in ioctl calls, specifically:

ioctl(16, MNTIOC_GETMNTENT, 0x08045A60) = 0

Interestingly, I tested creating 6 filesystems simultaneously, which took a
total of only three minutes, rather than 9 minutes had they been created
sequentially. I'm not sure how parallelizable I can make an identity
management provisioning system though.

Was I mistaken about the increased scalability that was going to be
available? Is there anything I could configure differently to improve this
performance? We are going to need about 30,000 filesystems to cover our
faculty, staff, students, and group project directories. We do have 5
x4500's which will be allocated to the task, so about 6000 filesystems per.
Depending on what time of the quarter it is, our identity management sytem
can create hundreds up to thousands of accounts, and when we purge accounts
quarterly we typically delete 10,000 or so. Currently those jobs only take
2-6 hours, with this level of performance from ZFS they would take days if
not over a week :(.

Thanks for any suggestions. What is the internal recommendation on maximum
number of file systems per server?


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss