Re: hundreds of mount --bind mountpoints?

2001-04-26 Thread Andreas Dilger

Al writes:
> It's not a fscking rocket science - encapsulate accesses to ->u.foofs_i
> into inlined function, find ->read_inode, find places that do get_empty_inode

OK, I was doing this for the ext3 port I'm working on for 2.4, and ran into
a snag.  In the ext3_inode_info, there is a list_head.  However, if this is
moved into a separate slab struct, it is now impossible to locate the inode
from the offset in the slab struct.  When I was checking the size of each
inode_info struct, I noticed several others that had list_heads in them.
One solution is that we store list_heads in the inode proper, after generic_ip.

Cheers, Andreas
-- 
Andreas Dilger   TurboLabs filesystem development
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-26 Thread Andreas Dilger

Al writes:
 It's not a fscking rocket science - encapsulate accesses to -u.foofs_i
 into inlined function, find -read_inode, find places that do get_empty_inode

OK, I was doing this for the ext3 port I'm working on for 2.4, and ran into
a snag.  In the ext3_inode_info, there is a list_head.  However, if this is
moved into a separate slab struct, it is now impossible to locate the inode
from the offset in the slab struct.  When I was checking the size of each
inode_info struct, I noticed several others that had list_heads in them.
One solution is that we store list_heads in the inode proper, after generic_ip.

Cheers, Andreas
-- 
Andreas Dilger   TurboLabs filesystem development
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-25 Thread Alexander Viro



On Wed, 25 Apr 2001, Andreas Dilger wrote:

> Al writes:
> > It's not a fscking rocket science - encapsulate accesses to ->u.foofs_i
> > into inlined function, find ->read_inode, find places that do get_empty_inode
> 
> OK, I was doing this for the ext3 port I'm working on for 2.4, and ran into
> a snag.  In the ext3_inode_info, there is a list_head.  However, if this is
> moved into a separate slab struct, it is now impossible to locate the inode
> from the offset in the slab struct.  When I was checking the size of each
> inode_info struct, I noticed several others that had list_heads in them.
> One solution is that we store list_heads in the inode proper, after generic_ip.

If you need to go from ext3_inode_info to inode - put the pointer into the
thing and be done with that. No need to bump ->i_count - fs-private
part dies before inode itself.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-25 Thread Christoph Rohland

Hi Andreas,

On Tue, 24 Apr 2001, Andreas Dilger wrote:
> On the other hand, sockets and shmem are both relatively large...

shmem is only large because the union is large. I introduced the
direct swap array of size SHMEM_NR_DIRECT simply to take advantage of
the union. We can decrease SHMEM_NR_DIRECT very easily. I am thinking
about 1 or 5 which would mean that we allocate an indirect block for
files bigger than 4k or 20k respectively.

The shmem_inode_info would then be 8 or 12 words.

Greetings
Christoph


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-25 Thread Christoph Rohland

Hi Andreas,

On Tue, 24 Apr 2001, Andreas Dilger wrote:
 On the other hand, sockets and shmem are both relatively large...

shmem is only large because the union is large. I introduced the
direct swap array of size SHMEM_NR_DIRECT simply to take advantage of
the union. We can decrease SHMEM_NR_DIRECT very easily. I am thinking
about 1 or 5 which would mean that we allocate an indirect block for
files bigger than 4k or 20k respectively.

The shmem_inode_info would then be 8 or 12 words.

Greetings
Christoph


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-25 Thread Alexander Viro



On Wed, 25 Apr 2001, Andreas Dilger wrote:

 Al writes:
  It's not a fscking rocket science - encapsulate accesses to -u.foofs_i
  into inlined function, find -read_inode, find places that do get_empty_inode
 
 OK, I was doing this for the ext3 port I'm working on for 2.4, and ran into
 a snag.  In the ext3_inode_info, there is a list_head.  However, if this is
 moved into a separate slab struct, it is now impossible to locate the inode
 from the offset in the slab struct.  When I was checking the size of each
 inode_info struct, I noticed several others that had list_heads in them.
 One solution is that we store list_heads in the inode proper, after generic_ip.

If you need to go from ext3_inode_info to inode - put the pointer into the
thing and be done with that. No need to bump -i_count - fs-private
part dies before inode itself.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Trond Myklebust

> " " == Alexander Viro <[EMAIL PROTECTED]> writes:

 > _Ouch_. So what are you going to do if another iget4() comes
 > between the moment when you hash the inode and set these
 > fields? You are filling them only after you drop inode_lock, so
 > AFAICS the current code has the same problem.

The entire call to iget4() is protected by the BKL in all relevant
instances. As long as we don't sleep between find_inode() and
nfs_fill_inode(), we're safe.

In fact the BKL protection is needed also for another reason: we don't
actually initialize the inode in the I_LOCK-protected read_inode() but
instead rely on the caller of iget4 to do it for us. The reason is
that one we would need to pass the struct nfs_fattr to read_inode()
and this wasn't possible until the ReiserFS people introduced
read_inode2().

Cheers,
   Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On 24 Apr 2001, Trond Myklebust wrote:

> Hi Al,
> 
>   I believe your patch introduces a race for the NFS case. The problem
> lies in the fact that nfs_find_actor() needs to read several of the
> fields from nfs_inode_info. By adding an allocation after the inode
> has been hashed, you are creating a window during which the inode can
> be found by find_inode(), but during which you aren't even guaranteed
> that the nfs_inode_info exists let alone that it's been initialized
> by nfs_fill_inode().

_Ouch_. So what are you going to do if another iget4() comes between
the moment when you hash the inode and set these fields? You are
filling them only after you drop inode_lock, so AFAICS the current
code has the same problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Ingo Oeser

On Tue, Apr 24, 2001 at 02:49:23PM -0400, Alexander Viro wrote:
> On Tue, 24 Apr 2001, Andreas Dilger wrote:
> > One thing to watch out for is that the current code zeros the u. struct
> > for us (as you pointed out to me previously), but allocating from the
> > slab cache will not...  This could be an interesting source of bugs for
> > some filesystems that assume zero'd inode_info structs.
> True, but easy to catch.

Jepp. Just request SLAB_ZERO (still to be implemented) instead of
SLAB_POISON or provide an constructor.

A nice set of macros for this would make it quite easy. The ctor
is the way to handle it. May be we could even put all the fs
specific initalizers into it (e.g. magics, zeroes).

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag 
  been there and had much fun   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Trond Myklebust

> " " == Alexander Viro <[EMAIL PROTECTED]> writes:

 > On Mon, 23 Apr 2001, Jan Harkes wrote:

>> On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

>> > BTW: Is it still less than one page? Then it doesn't make me
>> >nervous. Why? Guess what granularity we allocate at, if we
>> >just store pointers instead of the inode.u. Or do you like
>> >every FS creating his own slab cache?

 > Oh, for crying out loud. All it takes is half an hour per
 > filesystem.  Here - completely untested patch that does it for
 > NFS. Took about that long. Absolutely straightforward, very
 > easy to verify correctness.

 > Some stuff may need tweaking, but not much (e.g. some functions
 > should take nfs_inode_info instead of inodes, etc.). From the
 > look of flushd cache it seems that we would be better off with
 > cyclic lists instead of single-linked ones for the hash, but I
 > didn't look deep enough.

 > So consider the patch below as proof-of-concept. Enjoy:

Hi Al,

  I believe your patch introduces a race for the NFS case. The problem
lies in the fact that nfs_find_actor() needs to read several of the
fields from nfs_inode_info. By adding an allocation after the inode
has been hashed, you are creating a window during which the inode can
be found by find_inode(), but during which you aren't even guaranteed
that the nfs_inode_info exists let alone that it's been initialized
by nfs_fill_inode().

One solution could be to have find_inode sleep on encountering a
locked inode. It would have to be something along the lines of

static struct inode * find_inode(struct super_block * sb, unsigned long ino, struct 
list_head *head, find_inode_t find_actor, void *opaque)
{
struct list_head *tmp;
struct inode * inode;

tmp = head;
for (;;) {
tmp = tmp->next;
inode = NULL;
if (tmp == head)
break;
inode = list_entry(tmp, struct inode, i_hash);
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
if (find_actor) {
if (inode->i_state & I_LOCK) {
spin_unlock(_lock);
__wait_on_inode(inode);
spin_lock(_lock);
tmp = head;
continue;
}
if (!find_actor(inode, ino, opaque))
continue;
}
break;
}
return inode;
}


Cheers,
   Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Erik Mouw

On Tue, Apr 24, 2001 at 12:47:38PM -0600, Andreas Dilger wrote:
> While I applaud your initiative, you made an unfortunate choice of
> filesystems to convert.  The iso_inode_info is only 4*__u32, as is
> proc_inode_info.  Given that we still need to keep a pointer to the
> external info structs, and the overhead of the slab cache itself
> (both CPU usage and memory overhead, however small), I don't think
> it is worthwhile to have isofs and procfs in separate slabs.

Well, I know a little bit about procfs because I'm currently
documenting it, so that's why I picked it first. After I got the idea,
isofs was quite easy.

In retrospect it would have been more effective to pick a filesystem
with a larger *_inode_info field, but then again: Al is right. Struct
inode is cluttered with *_inode_info fields, while we use anonymous
data entries in other parts of the kernel (like the data pointer in
struct proc_dir_entry, or the priv pointer in struct net_device).

There is another advantage: suppose you're hacking on a filesystem and
change it's *_fs_i.h header. With Al's proposal you only have to
recompile the filesystem you're hacking on, while you have to recompile
the complete kernel in the current situation.


Erik

-- 
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031,  2600 GA Delft, The Netherlands
Phone: +31-15-2783635  Fax: +31-15-2781843  Email: [EMAIL PROTECTED]
WWW: http://www-ict.its.tudelft.nl/~erik/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Andreas Dilger

Al writes:
> > Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
> > approximate for 32-bit arches - I was just counting by hand and not
> > strictly checking alignment), then almost all other filesystems are below
> > 25 * __u32 (i.e. half of the previous size).
> 
> Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
> I'm thinking about.

But then again, you are saving 50-25 words for every non-NFS inode, and I
think _most_ systems will have more local inodes than NFS inodes.  Even
NFS servers will have local inodes, only clients (AFAIK) use nfs_inode_info.

> > Maybe the size of the union can depend on CONFIG_*_FS?  There should be
> > an absolute minimum size (16 * __u32 or so), but then people who want
> > reiserfs as their primary fs do not need to pay the memory penalty of ext2.
> > For ext2 (the next largest and most common fs), we could make it part of
> > the union if it is compiled in, and on a slab cache if it is a module?
> 
> NO. Sorry about shouting, but that's the way to madness. I can understand
> code depending on SMP vs. UP and similar beasts, but presense of specific
> filesystems 

But then again, if the size of nfs_inode_info changes, it is the same
problem... sizeof(struct inode) may have changed (depends if slab has
some padding between inodes or not).  If we stick to a minimum size
(16 words or maybe even 8), then it will never change anymore, and we
do not have overhead for small inode_info structs.

> > Should uncommon-but-widely-used things like socket and shmem have their
> > own slab cache, or should they just allocate from the generic size-32 slab?
> 
> That's pretty interesting - especially for sockets. I wonder whether
> we would get problems with separate allocation of these - we don't
> go from inode to socket all that often, but...

I never thought of that.  I guess the socket code does not know which
fs the inode_info was allocated from, so it cannot free it from the slab
(even if it had access to these slabs, which it does not).  In that case,
each fs would have struct socket as the minimum size allocatable, which
is unfortunately one of the largest inode_info sizes.  It is smaller
than ext2, but...

Any ideas?  Do we ever get back into fs-specific clear_inode() from
a socket?  In that case, the socket would just hold a pointer to the
fs-specific inode_info inside its own struct socket until the inode
is dropped.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Andreas Dilger wrote:

> While I applaud your initiative, you made an unfortunate choice of
> filesystems to convert.  The iso_inode_info is only 4*__u32, as is
> proc_inode_info.  Given that we still need to keep a pointer to the
> external info structs, and the overhead of the slab cache itself
> (both CPU usage and memory overhead, however small), I don't think
> it is worthwhile to have isofs and procfs in separate slabs.
> 
> On the other hand, sockets and shmem are both relatively large...
> Watch out that the *_inode_info structs have all of the fields
> initialized, because the union field is zeroed for us, but slab is not.

Frankly, I'd rather start with encapsulation part. It's easy to
verify, it can go in right now and it makes separate allocation
part uncluttered. Besides, it simply makes code cleaner, so it
makes sense even if don't want to go for separate allocation for
that particular fs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Andreas Dilger wrote:

> One thing to watch out for is that the current code zeros the u. struct
> for us (as you pointed out to me previously), but allocating from the
> slab cache will not...  This could be an interesting source of bugs for
> some filesystems that assume zero'd inode_info structs.

True, but easy to catch.
 
> Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
> approximate for 32-bit arches - I was just counting by hand and not
> strictly checking alignment), then almost all other filesystems are below
> 25 * __u32 (i.e. half of the previous size).

Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
I'm thinking about.
 
> Maybe the size of the union can depend on CONFIG_*_FS?  There should be
> an absolute minimum size (16 * __u32 or so), but then people who want
> reiserfs as their primary fs do not need to pay the memory penalty of ext2.
> For ext2 (the next largest and most common fs), we could make it part of
> the union if it is compiled in, and on a slab cache if it is a module?

NO. Sorry about shouting, but that's the way to madness. I can understand
code depending on SMP vs. UP and similar beasts, but presense of specific
filesystems 

> Should uncommon-but-widely-used things like socket and shmem have their
> own slab cache, or should they just allocate from the generic size-32 slab?

That's pretty interesting - especially for sockets. I wonder whether
we would get problems with separate allocation of these - we don't
go from inode to socket all that often, but...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Andreas Dilger

Eric Mouw writes:
> Al is right, it is no rocket science. Here is a patch against
> 2.4.4-pre6 for procfs and isofs. It took me an hour to do because I'm
> not familiar with the fs code. It compiles, and the procfs code even
> runs (sorry, no CDROM player availeble on my embedded StrongARM
> system), though it is possible that there are some bugs in it.

While I applaud your initiative, you made an unfortunate choice of
filesystems to convert.  The iso_inode_info is only 4*__u32, as is
proc_inode_info.  Given that we still need to keep a pointer to the
external info structs, and the overhead of the slab cache itself
(both CPU usage and memory overhead, however small), I don't think
it is worthwhile to have isofs and procfs in separate slabs.

On the other hand, sockets and shmem are both relatively large...
Watch out that the *_inode_info structs have all of the fields
initialized, because the union field is zeroed for us, but slab is not.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Andreas Dilger

Al writes:
> Encapsulation part is definitely worth doing - it cleans the code up
> and doesn't change the result of compile. Adding allocation/freeing/
> cache initialization/cache removal and chaninging FOOFS_I() definition -
> well, it's probably worth to keep such patches around, but whether
> to switch any individual filesystem during 2.4 is a policy decision.
> Up to maintainer, indeed. Notice that these patches (separate allocation
> per se) are going to be within 3-4Kb per filesystem _and_ completely
> straightforward.

One thing to watch out for is that the current code zeros the u. struct
for us (as you pointed out to me previously), but allocating from the
slab cache will not...  This could be an interesting source of bugs for
some filesystems that assume zero'd inode_info structs.

Fortunately, it doesn't appear that my patch to clean out all of the
"duplicate" zero initializers in the fs-specific code was accepted...

> What I would like to avoid is scenario like:
> 
> Maintainers of filesystems with large private inodes: Why would we separate
> them? We would only waste memory, since the other filesystems stay in ->u
> and keep it large.

Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
approximate for 32-bit arches - I was just counting by hand and not
strictly checking alignment), then almost all other filesystems are below
25 * __u32 (i.e. half of the previous size).

For large-private-inode filesystems, we are wasting memory in EVERY inode
in the slab cache, not just ones in use with the large private inode.  If
it were the most common filesystem (ext2, maybe reiser, msdos next) then
it wouldn't make much difference.

At some point reducing the union size is not efficient to have separate
slab allocations from a memory usage standpoint.

The remaining info structs are (approx. for 32-bit arch) (size in __u32):

ext227
affs26
ufs 25
socket  24
shmem   22
coda20
qnx418

minix   16
umsdos  15
hpfs15
efs 14
sysv13
reiser  12
udf 12
ntfs11
ncp 10
msdos   9
adfs7
smb 6
usbdev  5
proc4
iso 4
bfs 3
romfs   2


> Maintainers of the rest of filesystems: Since there's no patches that would
> take large stuff out of ->u, why would we bother?

Maybe the size of the union can depend on CONFIG_*_FS?  There should be
an absolute minimum size (16 * __u32 or so), but then people who want
reiserfs as their primary fs do not need to pay the memory penalty of ext2.
For ext2 (the next largest and most common fs), we could make it part of
the union if it is compiled in, and on a slab cache if it is a module?

Should uncommon-but-widely-used things like socket and shmem have their
own slab cache, or should they just allocate from the generic size-32 slab?

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David L. Parsley

Christoph Rohland wrote:
> 
> OK I will do that for tmpfs soon. And I will do the symlink inlining
> with that patch.

Wow, this thread really exploded, eh?  But thanks, Christoph, I look
forward to seeing
your patch.  4k symlinks really suck for embedders who never swap out
pages. ;-)

regards,
David

-- 
David L. Parsley
Network Administrator
Roanoke College
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Christoph Rohland

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
> So yes, IMO having such patches available _is_ a good thing. And in
> 2.5 we definitely want them in the tree. If encapsulation part gets
> there during 2.4 and separate allocation is available for all of
> them it will be easier to do without PITA in process.

OK I will do that for tmpfs soon. And I will do the symlink inlining
with that patch.

Greetings
Christoph


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Erik Mouw

On Tue, Apr 24, 2001 at 06:01:12AM -0400, Alexander Viro wrote:
> For fsck sake! HFS patch. Time: 14 minutes, including checking that sucker
> builds (it had most of the accesses to ->u.hfs_i already encapsulated).

Al is right, it is no rocket science. Here is a patch against
2.4.4-pre6 for procfs and isofs. It took me an hour to do because I'm
not familiar with the fs code. It compiles, and the procfs code even
runs (sorry, no CDROM player availeble on my embedded StrongARM
system), though it is possible that there are some bugs in it.


Erik

Index: fs/isofs/inode.c
===
RCS file: /home/erik/cvsroot/elinux/fs/isofs/inode.c,v
retrieving revision 1.1.1.24
diff -d -u -r1.1.1.24 inode.c
--- fs/isofs/inode.c2001/04/21 21:24:00 1.1.1.24
+++ fs/isofs/inode.c2001/04/24 13:13:29
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -44,6 +45,8 @@
 static int check_bread = 0;
 #endif
 
+static kmem_cache_t *isofs_cachep;
+
 static int isofs_hashi(struct dentry *parent, struct qstr *qstr);
 static int isofs_hash(struct dentry *parent, struct qstr *qstr);
 static int isofs_dentry_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b);
@@ -74,10 +77,12 @@
 }
 
 static void isofs_read_inode(struct inode *);
+static void isofs_clear_inode(struct inode *);
 static int isofs_statfs (struct super_block *, struct statfs *);
 
 static struct super_operations isofs_sops = {
read_inode: isofs_read_inode,
+   clear_inode:isofs_clear_inode,
put_super:  isofs_put_super,
statfs: isofs_statfs,
 };
@@ -908,9 +913,9 @@
goto abort_beyond_end;
 
offset= 0;
-   firstext  = inode->u.isofs_i.i_first_extent;
-   sect_size = inode->u.isofs_i.i_section_size >> ISOFS_BUFFER_BITS(inode);
-   nextino   = inode->u.isofs_i.i_next_section_ino;
+   firstext  = ISOFS_I(inode)->i_first_extent;
+   sect_size = ISOFS_I(inode)->i_section_size >> ISOFS_BUFFER_BITS(inode);
+   nextino   = ISOFS_I(inode)->i_next_section_ino;
 
i = 0;
if (nextino) {
@@ -923,9 +928,9 @@
ninode = iget(inode->i_sb, nextino);
if (!ninode)
goto abort;
-   firstext  = ninode->u.isofs_i.i_first_extent;
-   sect_size = ninode->u.isofs_i.i_section_size;
-   nextino   = ninode->u.isofs_i.i_next_section_ino;
+   firstext  = ISOFS_I(ninode)->i_first_extent;
+   sect_size = ISOFS_I(ninode)->i_section_size;
+   nextino   = ISOFS_I(ninode)->i_next_section_ino;
iput(ninode);
 
if (++i > 100)
@@ -1025,7 +1030,7 @@
struct iso_directory_record * tmpde = NULL;
 
inode->i_size = 0;
-   inode->u.isofs_i.i_next_section_ino = 0;
+   ISOFS_I(inode)->i_next_section_ino = 0;
 
block = f_pos >> ISOFS_BUFFER_BITS(inode);
offset = f_pos & (bufsize-1);
@@ -1077,7 +1082,7 @@
 
inode->i_size += isonum_733(de->size);
if (i == 1)
-   inode->u.isofs_i.i_next_section_ino = f_pos;
+   ISOFS_I(inode)->i_next_section_ino = f_pos;
 
more_entries = de->flags[-high_sierra] & 0x80;
 
@@ -1174,9 +1179,10 @@
inode->i_uid = inode->i_sb->u.isofs_sb.s_uid;
inode->i_gid = inode->i_sb->u.isofs_sb.s_gid;
inode->i_blocks = inode->i_blksize = 0;
+   inode->u.generic_ip = kmem_cache_alloc(isofs_cachep, SLAB_KERNEL);
 
 
-   inode->u.isofs_i.i_section_size = isonum_733 (de->size);
+   ISOFS_I(inode)->i_section_size = isonum_733 (de->size);
if(de->flags[-high_sierra] & 0x80) {
if(isofs_read_level3_size(inode)) goto fail;
} else {
@@ -1230,7 +1236,7 @@
inode->i_mtime = inode->i_atime = inode->i_ctime =
iso_date(de->date, high_sierra);
 
-   inode->u.isofs_i.i_first_extent = (isonum_733 (de->extent) +
+   ISOFS_I(inode)->i_first_extent = (isonum_733 (de->extent) +
   isonum_711 (de->ext_attr_length));
 
/*
@@ -1298,6 +1304,16 @@
goto out;
 }
 
+
+static void isofs_clear_inode(struct inode *inode)
+{
+   struct iso_inode_info *isofsi = ISOFS_I(inode);
+   inode->u.generic_ip = NULL;
+   if(isofsi)
+   kmem_cache_free(isofs_cachep, isofsi);
+}
+
+
 #ifdef LEAK_CHECK
 #undef malloc
 #undef free_s
@@ -1332,12 +1348,21 @@
 
 static int __init init_iso9660_fs(void)
 {
+   isofs_cachep = kmem_cache_create("isofs_inodes",
+sizeof(struct iso_inode_info),
+0, SLAB_HWCACHE_ALIGN,
+NULL, NULL);
+   if(isofs_cachep == NULL)
+  

Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David Woodhouse


[EMAIL PROTECTED] said:
>  What I would like to avoid is scenario like
> Maintainers of filesystems with large private inodes: Why would we
> separate them? We would only waste memory, since the other filesystems
> stay in ->u and keep it large.

> Maintainers of the rest of filesystems: Since there's no patches that
> would take large stuff out of ->u, why would we bother?

> So yes, IMO having such patches available _is_ a good thing. And in
> 2.5 we definitely want them in the tree. If encapsulation part gets
> there during 2.4 and separate allocation is available for all of them
> it will be easier to do without PITA in process.

JFFS2 has the encapsulation part already. I'll make it do separate 
allocation in 2.5, when it's actually a gain.

--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On 24 Apr 2001, Christoph Rohland wrote:

> Hi Al,
> 
> On Tue, 24 Apr 2001, Alexander Viro wrote:
> >> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
> >> be very surprised.
> > 
> >  What's stopping you? 
> > You _are_ JFFS maintainer, aren't you?
> 
> So is this the start to change all filesystems in 2.4? I am not sure
> we should do that. 

Encapsulation part is definitely worth doing - it cleans the code up
and doesn't change the result of compile. Adding allocation/freeing/
cache initialization/cache removal and chaninging FOOFS_I() definition -
well, it's probably worth to keep such patches around, but whether
to switch any individual filesystem during 2.4 is a policy decision.
Up to maintainer, indeed. Notice that these patches (separate allocation
per se) are going to be within 3-4Kb per filesystem _and_ completely
straightforward.

What I would like to avoid is scenario like

Maintainers of filesystems with large private inodes: Why would we separate
them? We would only waste memory, since the other filesystems stay in ->u
and keep it large.

Maintainers of the rest of filesystems: Since there's no patches that would
take large stuff out of ->u, why would we bother?

So yes, IMO having such patches available _is_ a good thing. And in 2.5
we definitely want them in the tree. If encapsulation part gets there
during 2.4 and separate allocation is available for all of them it will
be easier to do without PITA in process.
Al


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Christoph Rohland

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
>> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
>> be very surprised.
> 
>  What's stopping you? 
> You _are_ JFFS maintainer, aren't you?

So is this the start to change all filesystems in 2.4? I am not sure
we should do that. 

Greetings
Christoph


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David Woodhouse


[EMAIL PROTECTED] said:
>   What's stopping you?  You _are_ JFFS maintainer,
> aren't you?

It already uses...

#define JFFS2_INODE_INFO(i) (>u.jffs2_i) 

It's trivial to switch over when the size of the inode union goes below the 
size of struct jffs2_inode_info. Until then, I'd just be wasting space.

--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, David Woodhouse wrote:

> 
> [EMAIL PROTECTED] said:
> >  Oh, for crying out loud. All it takes is half an hour per filesystem.
> 
> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
> surprised.

 What's stopping you? 
You _are_ JFFS maintainer, aren't you?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Mon, 23 Apr 2001, Andreas Dilger wrote:

> Al posted a patch to the NFS code which removes nfs_inode_info from the
> inode union.  Since it is (AFAIK) the largest member of the union, we
> have just saved 24 bytes per inode (hfs_inode_info is also rather large).
> If we removed hfs_inode_info as well, we would save 108 bytes per inode,
> about 22% ({ext2,affs,ufs}_inode_info are all about the same size).

For fsck sake! HFS patch. Time: 14 minutes, including checking that sucker
builds (it had most of the accesses to ->u.hfs_i already encapsulated).

What I really don't understand is why the hell people keep coming up
with the grand and convoluted plans of removing the inode bloat and
nobleedinone of them actually cared to sit down and do the simplest variant
possible.

I can certainly go through the rest of filesystems and even do a testing
for most of them, but WTF? Could the rest of you please join the show?
It's not a fscking rocket science - encapsulate accesses to ->u.foofs_i
into inlined function, find ->read_inode, find places that do get_empty_inode()
or new_inode(), add allocation there, add freeing to ->clear_inode()
(defining one if needed), change that inlined function so that it would
return ->u.generic_ip and you are done. Clean the results up and test
them. Furrfu...

It's not like it was a global change that affected the whole kernel -
at every step changes are local to one filesystem and changes for
different filesystems are independent from each other. If at some point
in 2.5 .generic_ip is the only member of union - fine, we just do
%s/u.generic_ip/fs_inode/g
or something like that. Moreover, if maintainer of filesystem foo is
OK with change it _can_ be done in 2.4 - it doesn't affect anything
outside of foofs.

Guys, doing all these patches is ~20 man-hours. And that's bloody generous
estimate. Looking through the results and doing necessary tweaking
(as in "hmm... we keep passing pointer to inode through the long chain
of functions and all of them need only fs-specific part", etc.) - about
the same. Verifiying that thing wasn't fucked up - maybe an hour or two of
audit per filesystem (split the patch into encapsulation part - trivial
to verify - and the rest - pretty small). Grrr...

Oh, well... Initial HFS patch follows:

diff -urN S4-pre6/fs/hfs/inode.c S4-pre6-hfs/fs/hfs/inode.c
--- S4-pre6/fs/hfs/inode.c  Fri Feb 16 22:55:36 2001
+++ S4-pre6-hfs/fs/hfs/inode.c  Tue Apr 24 05:10:21 2001
@@ -231,7 +231,7 @@
 static int hfs_prepare_write(struct file *file, struct page *page, unsigned from, 
unsigned to)
 {
return cont_prepare_write(page,from,to,hfs_get_block,
-   >mapping->host->u.hfs_i.mmu_private);
+   _I(page->mapping->host)->mmu_private);
 }
 static int hfs_bmap(struct address_space *mapping, long block)
 {
@@ -309,7 +309,7 @@
return NULL;
}
 
-   if (inode->i_dev != sb->s_dev) {
+   if (inode->i_dev != sb->s_dev || !HFS_I(inode)) {
iput(inode); /* automatically does an hfs_cat_put */
inode = NULL;
} else if (!inode->i_mode || (*sys_entry == NULL)) {
@@ -373,7 +373,7 @@
inode->i_op = _file_inode_operations;
inode->i_fop = _file_operations;
inode->i_mapping->a_ops = _aops;
-   inode->u.hfs_i.mmu_private = inode->i_size;
+   HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = >u.dir;
 
@@ -433,7 +433,7 @@
inode->i_op = _file_inode_operations;
inode->i_fop = _file_operations;
inode->i_mapping->a_ops = _aops;
-   inode->u.hfs_i.mmu_private = inode->i_size;
+   HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = >u.dir;
 
@@ -479,7 +479,7 @@
inode->i_op = _file_inode_operations;
inode->i_fop = _file_operations;
inode->i_mapping->a_ops = _aops;
-   inode->u.hfs_i.mmu_private = inode->i_size;
+   HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = >u.dir;
 
diff -urN S4-pre6/fs/hfs/super.c S4-pre6-hfs/fs/hfs/super.c
--- S4-pre6/fs/hfs/super.c  Sat Apr 21 14:35:20 2001
+++ S4-pre6-hfs/fs/hfs/super.c  Tue Apr 24 05:26:04 2001
@@ -35,6 +35,7 @@
 /* Forward declarations */
 
 static void hfs_read_inode(struct inode *);
+static void hfs_clear_inode(struct inode *);
 static void hfs_put_super(struct super_block *);
 static int hfs_statfs(struct super_block *, struct statfs *);
 static void hfs_write_super(struct super_block *);
@@ -43,6 +44,7 @@
 
 static struct super_operations hfs_super_operations = { 
read_inode: hfs_read_inode,
+   clear_inode:hfs_clear_inode,
put_inode:  hfs_put_inode,
put_super:  

Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David Woodhouse


[EMAIL PROTECTED] said:
>  Oh, for crying out loud. All it takes is half an hour per filesystem.

Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
surprised.

--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Christoph Rohland

Hi Alexander,

On Mon, 23 Apr 2001, Alexander Viro wrote:
>> I like it. ext2fs does the same, so there should be no VFS
>> hassles involved. Al?
> 
> We should get ext2 and friends to move the sucker _out_ of struct
> inode.  As it is, sizeof(struct inode) is way too large. This is 2.5
> stuff, but it really has to be done. More filesystems adding stuff
> into the union is a Bad Thing(tm). If you want to allocates space -
> allocate if yourself; ->clear_inode() is the right place for freeing
> it.

Yes, I agree that the union is way too large and I did not plan to
extend it but simply use the size it has.

if (strlen(path) < sizeof(inode->u))
inline the symlink;
else
put it into the page cache;

So if somebody really cleans up the private inode structures it will
not trigger that often any more and we perhaps have to rethink the
idea.

But also if we use struct shmem_inode_info which is 92 bytes right now
we would inline all symlinks on my machine.

If we reduced its size to 32 (which could be easily done) we would
still inline 6642 out of 9317 symlinks on my machine. That's not bad.

Greetings
Christoph


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Christoph Rohland

Hi Alexander,

On Mon, 23 Apr 2001, Alexander Viro wrote:
 I like it. ext2fs does the same, so there should be no VFS
 hassles involved. Al?
 
 We should get ext2 and friends to move the sucker _out_ of struct
 inode.  As it is, sizeof(struct inode) is way too large. This is 2.5
 stuff, but it really has to be done. More filesystems adding stuff
 into the union is a Bad Thing(tm). If you want to allocates space -
 allocate if yourself; -clear_inode() is the right place for freeing
 it.

Yes, I agree that the union is way too large and I did not plan to
extend it but simply use the size it has.

if (strlen(path)  sizeof(inode-u))
inline the symlink;
else
put it into the page cache;

So if somebody really cleans up the private inode structures it will
not trigger that often any more and we perhaps have to rethink the
idea.

But also if we use struct shmem_inode_info which is 92 bytes right now
we would inline all symlinks on my machine.

If we reduced its size to 32 (which could be easily done) we would
still inline 6642 out of 9317 symlinks on my machine. That's not bad.

Greetings
Christoph


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David Woodhouse


[EMAIL PROTECTED] said:
  Oh, for crying out loud. All it takes is half an hour per filesystem.

Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
surprised.

--
dwmw2


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Mon, 23 Apr 2001, Andreas Dilger wrote:

 Al posted a patch to the NFS code which removes nfs_inode_info from the
 inode union.  Since it is (AFAIK) the largest member of the union, we
 have just saved 24 bytes per inode (hfs_inode_info is also rather large).
 If we removed hfs_inode_info as well, we would save 108 bytes per inode,
 about 22% ({ext2,affs,ufs}_inode_info are all about the same size).

For fsck sake! HFS patch. Time: 14 minutes, including checking that sucker
builds (it had most of the accesses to -u.hfs_i already encapsulated).

What I really don't understand is why the hell people keep coming up
with the grand and convoluted plans of removing the inode bloat and
nobleedinone of them actually cared to sit down and do the simplest variant
possible.

I can certainly go through the rest of filesystems and even do a testing
for most of them, but WTF? Could the rest of you please join the show?
It's not a fscking rocket science - encapsulate accesses to -u.foofs_i
into inlined function, find -read_inode, find places that do get_empty_inode()
or new_inode(), add allocation there, add freeing to -clear_inode()
(defining one if needed), change that inlined function so that it would
return -u.generic_ip and you are done. Clean the results up and test
them. Furrfu...

It's not like it was a global change that affected the whole kernel -
at every step changes are local to one filesystem and changes for
different filesystems are independent from each other. If at some point
in 2.5 .generic_ip is the only member of union - fine, we just do
%s/u.generic_ip/fs_inode/g
or something like that. Moreover, if maintainer of filesystem foo is
OK with change it _can_ be done in 2.4 - it doesn't affect anything
outside of foofs.

Guys, doing all these patches is ~20 man-hours. And that's bloody generous
estimate. Looking through the results and doing necessary tweaking
(as in hmm... we keep passing pointer to inode through the long chain
of functions and all of them need only fs-specific part, etc.) - about
the same. Verifiying that thing wasn't fucked up - maybe an hour or two of
audit per filesystem (split the patch into encapsulation part - trivial
to verify - and the rest - pretty small). Grrr...

Oh, well... Initial HFS patch follows:

diff -urN S4-pre6/fs/hfs/inode.c S4-pre6-hfs/fs/hfs/inode.c
--- S4-pre6/fs/hfs/inode.c  Fri Feb 16 22:55:36 2001
+++ S4-pre6-hfs/fs/hfs/inode.c  Tue Apr 24 05:10:21 2001
@@ -231,7 +231,7 @@
 static int hfs_prepare_write(struct file *file, struct page *page, unsigned from, 
unsigned to)
 {
return cont_prepare_write(page,from,to,hfs_get_block,
-   page-mapping-host-u.hfs_i.mmu_private);
+   HFS_I(page-mapping-host)-mmu_private);
 }
 static int hfs_bmap(struct address_space *mapping, long block)
 {
@@ -309,7 +309,7 @@
return NULL;
}
 
-   if (inode-i_dev != sb-s_dev) {
+   if (inode-i_dev != sb-s_dev || !HFS_I(inode)) {
iput(inode); /* automatically does an hfs_cat_put */
inode = NULL;
} else if (!inode-i_mode || (*sys_entry == NULL)) {
@@ -373,7 +373,7 @@
inode-i_op = hfs_file_inode_operations;
inode-i_fop = hfs_file_operations;
inode-i_mapping-a_ops = hfs_aops;
-   inode-u.hfs_i.mmu_private = inode-i_size;
+   HFS_I(inode)-mmu_private = inode-i_size;
} else { /* Directory */
struct hfs_dir *hdir = entry-u.dir;
 
@@ -433,7 +433,7 @@
inode-i_op = hfs_file_inode_operations;
inode-i_fop = hfs_file_operations;
inode-i_mapping-a_ops = hfs_aops;
-   inode-u.hfs_i.mmu_private = inode-i_size;
+   HFS_I(inode)-mmu_private = inode-i_size;
} else { /* Directory */
struct hfs_dir *hdir = entry-u.dir;
 
@@ -479,7 +479,7 @@
inode-i_op = hfs_file_inode_operations;
inode-i_fop = hfs_file_operations;
inode-i_mapping-a_ops = hfs_aops;
-   inode-u.hfs_i.mmu_private = inode-i_size;
+   HFS_I(inode)-mmu_private = inode-i_size;
} else { /* Directory */
struct hfs_dir *hdir = entry-u.dir;
 
diff -urN S4-pre6/fs/hfs/super.c S4-pre6-hfs/fs/hfs/super.c
--- S4-pre6/fs/hfs/super.c  Sat Apr 21 14:35:20 2001
+++ S4-pre6-hfs/fs/hfs/super.c  Tue Apr 24 05:26:04 2001
@@ -35,6 +35,7 @@
 /* Forward declarations */
 
 static void hfs_read_inode(struct inode *);
+static void hfs_clear_inode(struct inode *);
 static void hfs_put_super(struct super_block *);
 static int hfs_statfs(struct super_block *, struct statfs *);
 static void hfs_write_super(struct super_block *);
@@ -43,6 +44,7 @@
 
 static struct super_operations hfs_super_operations = { 
read_inode: hfs_read_inode,
+   clear_inode:hfs_clear_inode,
put_inode:  hfs_put_inode,
put_super: 

Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, David Woodhouse wrote:

 
 [EMAIL PROTECTED] said:
   Oh, for crying out loud. All it takes is half an hour per filesystem.
 
 Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
 surprised.

tone polite What's stopping you? /tone
You _are_ JFFS maintainer, aren't you?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David Woodhouse


[EMAIL PROTECTED] said:
  tone polite What's stopping you? /tone You _are_ JFFS maintainer,
 aren't you?

It already uses...

#define JFFS2_INODE_INFO(i) (i-u.jffs2_i) 

It's trivial to switch over when the size of the inode union goes below the 
size of struct jffs2_inode_info. Until then, I'd just be wasting space.

--
dwmw2


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Christoph Rohland

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
 Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
 be very surprised.
 
 tone polite What's stopping you? /tone
 You _are_ JFFS maintainer, aren't you?

So is this the start to change all filesystems in 2.4? I am not sure
we should do that. 

Greetings
Christoph


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On 24 Apr 2001, Christoph Rohland wrote:

 Hi Al,
 
 On Tue, 24 Apr 2001, Alexander Viro wrote:
  Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
  be very surprised.
  
  tone polite What's stopping you? /tone
  You _are_ JFFS maintainer, aren't you?
 
 So is this the start to change all filesystems in 2.4? I am not sure
 we should do that. 

Encapsulation part is definitely worth doing - it cleans the code up
and doesn't change the result of compile. Adding allocation/freeing/
cache initialization/cache removal and chaninging FOOFS_I() definition -
well, it's probably worth to keep such patches around, but whether
to switch any individual filesystem during 2.4 is a policy decision.
Up to maintainer, indeed. Notice that these patches (separate allocation
per se) are going to be within 3-4Kb per filesystem _and_ completely
straightforward.

What I would like to avoid is scenario like

Maintainers of filesystems with large private inodes: Why would we separate
them? We would only waste memory, since the other filesystems stay in -u
and keep it large.

Maintainers of the rest of filesystems: Since there's no patches that would
take large stuff out of -u, why would we bother?

So yes, IMO having such patches available _is_ a good thing. And in 2.5
we definitely want them in the tree. If encapsulation part gets there
during 2.4 and separate allocation is available for all of them it will
be easier to do without PITA in process.
Al


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David Woodhouse


[EMAIL PROTECTED] said:
  What I would like to avoid is scenario like
 Maintainers of filesystems with large private inodes: Why would we
 separate them? We would only waste memory, since the other filesystems
 stay in -u and keep it large.

 Maintainers of the rest of filesystems: Since there's no patches that
 would take large stuff out of -u, why would we bother?

 So yes, IMO having such patches available _is_ a good thing. And in
 2.5 we definitely want them in the tree. If encapsulation part gets
 there during 2.4 and separate allocation is available for all of them
 it will be easier to do without PITA in process.

JFFS2 has the encapsulation part already. I'll make it do separate 
allocation in 2.5, when it's actually a gain.

--
dwmw2


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Erik Mouw

On Tue, Apr 24, 2001 at 06:01:12AM -0400, Alexander Viro wrote:
 For fsck sake! HFS patch. Time: 14 minutes, including checking that sucker
 builds (it had most of the accesses to -u.hfs_i already encapsulated).

Al is right, it is no rocket science. Here is a patch against
2.4.4-pre6 for procfs and isofs. It took me an hour to do because I'm
not familiar with the fs code. It compiles, and the procfs code even
runs (sorry, no CDROM player availeble on my embedded StrongARM
system), though it is possible that there are some bugs in it.


Erik

Index: fs/isofs/inode.c
===
RCS file: /home/erik/cvsroot/elinux/fs/isofs/inode.c,v
retrieving revision 1.1.1.24
diff -d -u -r1.1.1.24 inode.c
--- fs/isofs/inode.c2001/04/21 21:24:00 1.1.1.24
+++ fs/isofs/inode.c2001/04/24 13:13:29
@@ -15,6 +15,7 @@
 #include linux/stat.h
 #include linux/sched.h
 #include linux/iso_fs.h
+#include linux/iso_fs_i.h
 #include linux/kernel.h
 #include linux/major.h
 #include linux/mm.h
@@ -44,6 +45,8 @@
 static int check_bread = 0;
 #endif
 
+static kmem_cache_t *isofs_cachep;
+
 static int isofs_hashi(struct dentry *parent, struct qstr *qstr);
 static int isofs_hash(struct dentry *parent, struct qstr *qstr);
 static int isofs_dentry_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b);
@@ -74,10 +77,12 @@
 }
 
 static void isofs_read_inode(struct inode *);
+static void isofs_clear_inode(struct inode *);
 static int isofs_statfs (struct super_block *, struct statfs *);
 
 static struct super_operations isofs_sops = {
read_inode: isofs_read_inode,
+   clear_inode:isofs_clear_inode,
put_super:  isofs_put_super,
statfs: isofs_statfs,
 };
@@ -908,9 +913,9 @@
goto abort_beyond_end;
 
offset= 0;
-   firstext  = inode-u.isofs_i.i_first_extent;
-   sect_size = inode-u.isofs_i.i_section_size  ISOFS_BUFFER_BITS(inode);
-   nextino   = inode-u.isofs_i.i_next_section_ino;
+   firstext  = ISOFS_I(inode)-i_first_extent;
+   sect_size = ISOFS_I(inode)-i_section_size  ISOFS_BUFFER_BITS(inode);
+   nextino   = ISOFS_I(inode)-i_next_section_ino;
 
i = 0;
if (nextino) {
@@ -923,9 +928,9 @@
ninode = iget(inode-i_sb, nextino);
if (!ninode)
goto abort;
-   firstext  = ninode-u.isofs_i.i_first_extent;
-   sect_size = ninode-u.isofs_i.i_section_size;
-   nextino   = ninode-u.isofs_i.i_next_section_ino;
+   firstext  = ISOFS_I(ninode)-i_first_extent;
+   sect_size = ISOFS_I(ninode)-i_section_size;
+   nextino   = ISOFS_I(ninode)-i_next_section_ino;
iput(ninode);
 
if (++i  100)
@@ -1025,7 +1030,7 @@
struct iso_directory_record * tmpde = NULL;
 
inode-i_size = 0;
-   inode-u.isofs_i.i_next_section_ino = 0;
+   ISOFS_I(inode)-i_next_section_ino = 0;
 
block = f_pos  ISOFS_BUFFER_BITS(inode);
offset = f_pos  (bufsize-1);
@@ -1077,7 +1082,7 @@
 
inode-i_size += isonum_733(de-size);
if (i == 1)
-   inode-u.isofs_i.i_next_section_ino = f_pos;
+   ISOFS_I(inode)-i_next_section_ino = f_pos;
 
more_entries = de-flags[-high_sierra]  0x80;
 
@@ -1174,9 +1179,10 @@
inode-i_uid = inode-i_sb-u.isofs_sb.s_uid;
inode-i_gid = inode-i_sb-u.isofs_sb.s_gid;
inode-i_blocks = inode-i_blksize = 0;
+   inode-u.generic_ip = kmem_cache_alloc(isofs_cachep, SLAB_KERNEL);
 
 
-   inode-u.isofs_i.i_section_size = isonum_733 (de-size);
+   ISOFS_I(inode)-i_section_size = isonum_733 (de-size);
if(de-flags[-high_sierra]  0x80) {
if(isofs_read_level3_size(inode)) goto fail;
} else {
@@ -1230,7 +1236,7 @@
inode-i_mtime = inode-i_atime = inode-i_ctime =
iso_date(de-date, high_sierra);
 
-   inode-u.isofs_i.i_first_extent = (isonum_733 (de-extent) +
+   ISOFS_I(inode)-i_first_extent = (isonum_733 (de-extent) +
   isonum_711 (de-ext_attr_length));
 
/*
@@ -1298,6 +1304,16 @@
goto out;
 }
 
+
+static void isofs_clear_inode(struct inode *inode)
+{
+   struct iso_inode_info *isofsi = ISOFS_I(inode);
+   inode-u.generic_ip = NULL;
+   if(isofsi)
+   kmem_cache_free(isofs_cachep, isofsi);
+}
+
+
 #ifdef LEAK_CHECK
 #undef malloc
 #undef free_s
@@ -1332,12 +1348,21 @@
 
 static int __init init_iso9660_fs(void)
 {
+   isofs_cachep = kmem_cache_create(isofs_inodes,
+sizeof(struct iso_inode_info),
+0, SLAB_HWCACHE_ALIGN,
+NULL, NULL);
+   

Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Christoph Rohland

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
 So yes, IMO having such patches available _is_ a good thing. And in
 2.5 we definitely want them in the tree. If encapsulation part gets
 there during 2.4 and separate allocation is available for all of
 them it will be easier to do without PITA in process.

OK I will do that for tmpfs soon. And I will do the symlink inlining
with that patch.

Greetings
Christoph


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread David L. Parsley

Christoph Rohland wrote:
 
 OK I will do that for tmpfs soon. And I will do the symlink inlining
 with that patch.

Wow, this thread really exploded, eh?  But thanks, Christoph, I look
forward to seeing
your patch.  4k symlinks really suck for embedders who never swap out
pages. ;-)

regards,
David

-- 
David L. Parsley
Network Administrator
Roanoke College
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Andreas Dilger

Al writes:
 Encapsulation part is definitely worth doing - it cleans the code up
 and doesn't change the result of compile. Adding allocation/freeing/
 cache initialization/cache removal and chaninging FOOFS_I() definition -
 well, it's probably worth to keep such patches around, but whether
 to switch any individual filesystem during 2.4 is a policy decision.
 Up to maintainer, indeed. Notice that these patches (separate allocation
 per se) are going to be within 3-4Kb per filesystem _and_ completely
 straightforward.

One thing to watch out for is that the current code zeros the u. struct
for us (as you pointed out to me previously), but allocating from the
slab cache will not...  This could be an interesting source of bugs for
some filesystems that assume zero'd inode_info structs.

Fortunately, it doesn't appear that my patch to clean out all of the
duplicate zero initializers in the fs-specific code was accepted...

 What I would like to avoid is scenario like:
 
 Maintainers of filesystems with large private inodes: Why would we separate
 them? We would only waste memory, since the other filesystems stay in -u
 and keep it large.

Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
approximate for 32-bit arches - I was just counting by hand and not
strictly checking alignment), then almost all other filesystems are below
25 * __u32 (i.e. half of the previous size).

For large-private-inode filesystems, we are wasting memory in EVERY inode
in the slab cache, not just ones in use with the large private inode.  If
it were the most common filesystem (ext2, maybe reiser, msdos next) then
it wouldn't make much difference.

At some point reducing the union size is not efficient to have separate
slab allocations from a memory usage standpoint.

The remaining info structs are (approx. for 32-bit arch) (size in __u32):

ext227
affs26
ufs 25
socket  24
shmem   22
coda20
qnx418

minix   16
umsdos  15
hpfs15
efs 14
sysv13
reiser  12
udf 12
ntfs11
ncp 10
msdos   9
adfs7
smb 6
usbdev  5
proc4
iso 4
bfs 3
romfs   2


 Maintainers of the rest of filesystems: Since there's no patches that would
 take large stuff out of -u, why would we bother?

Maybe the size of the union can depend on CONFIG_*_FS?  There should be
an absolute minimum size (16 * __u32 or so), but then people who want
reiserfs as their primary fs do not need to pay the memory penalty of ext2.
For ext2 (the next largest and most common fs), we could make it part of
the union if it is compiled in, and on a slab cache if it is a module?

Should uncommon-but-widely-used things like socket and shmem have their
own slab cache, or should they just allocate from the generic size-32 slab?

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Andreas Dilger wrote:

 One thing to watch out for is that the current code zeros the u. struct
 for us (as you pointed out to me previously), but allocating from the
 slab cache will not...  This could be an interesting source of bugs for
 some filesystems that assume zero'd inode_info structs.

True, but easy to catch.
 
 Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
 approximate for 32-bit arches - I was just counting by hand and not
 strictly checking alignment), then almost all other filesystems are below
 25 * __u32 (i.e. half of the previous size).

Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
I'm thinking about.
 
 Maybe the size of the union can depend on CONFIG_*_FS?  There should be
 an absolute minimum size (16 * __u32 or so), but then people who want
 reiserfs as their primary fs do not need to pay the memory penalty of ext2.
 For ext2 (the next largest and most common fs), we could make it part of
 the union if it is compiled in, and on a slab cache if it is a module?

NO. Sorry about shouting, but that's the way to madness. I can understand
code depending on SMP vs. UP and similar beasts, but presense of specific
filesystems shudder

 Should uncommon-but-widely-used things like socket and shmem have their
 own slab cache, or should they just allocate from the generic size-32 slab?

That's pretty interesting - especially for sockets. I wonder whether
we would get problems with separate allocation of these - we don't
go from inode to socket all that often, but...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Andreas Dilger

Eric Mouw writes:
 Al is right, it is no rocket science. Here is a patch against
 2.4.4-pre6 for procfs and isofs. It took me an hour to do because I'm
 not familiar with the fs code. It compiles, and the procfs code even
 runs (sorry, no CDROM player availeble on my embedded StrongARM
 system), though it is possible that there are some bugs in it.

While I applaud your initiative, you made an unfortunate choice of
filesystems to convert.  The iso_inode_info is only 4*__u32, as is
proc_inode_info.  Given that we still need to keep a pointer to the
external info structs, and the overhead of the slab cache itself
(both CPU usage and memory overhead, however small), I don't think
it is worthwhile to have isofs and procfs in separate slabs.

On the other hand, sockets and shmem are both relatively large...
Watch out that the *_inode_info structs have all of the fields
initialized, because the union field is zeroed for us, but slab is not.

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Andreas Dilger wrote:

 While I applaud your initiative, you made an unfortunate choice of
 filesystems to convert.  The iso_inode_info is only 4*__u32, as is
 proc_inode_info.  Given that we still need to keep a pointer to the
 external info structs, and the overhead of the slab cache itself
 (both CPU usage and memory overhead, however small), I don't think
 it is worthwhile to have isofs and procfs in separate slabs.
 
 On the other hand, sockets and shmem are both relatively large...
 Watch out that the *_inode_info structs have all of the fields
 initialized, because the union field is zeroed for us, but slab is not.

Frankly, I'd rather start with encapsulation part. It's easy to
verify, it can go in right now and it makes separate allocation
part uncluttered. Besides, it simply makes code cleaner, so it
makes sense even if don't want to go for separate allocation for
that particular fs.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Andreas Dilger

Al writes:
  Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
  approximate for 32-bit arches - I was just counting by hand and not
  strictly checking alignment), then almost all other filesystems are below
  25 * __u32 (i.e. half of the previous size).
 
 Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
 I'm thinking about.

But then again, you are saving 50-25 words for every non-NFS inode, and I
think _most_ systems will have more local inodes than NFS inodes.  Even
NFS servers will have local inodes, only clients (AFAIK) use nfs_inode_info.

  Maybe the size of the union can depend on CONFIG_*_FS?  There should be
  an absolute minimum size (16 * __u32 or so), but then people who want
  reiserfs as their primary fs do not need to pay the memory penalty of ext2.
  For ext2 (the next largest and most common fs), we could make it part of
  the union if it is compiled in, and on a slab cache if it is a module?
 
 NO. Sorry about shouting, but that's the way to madness. I can understand
 code depending on SMP vs. UP and similar beasts, but presense of specific
 filesystems shudder

But then again, if the size of nfs_inode_info changes, it is the same
problem... sizeof(struct inode) may have changed (depends if slab has
some padding between inodes or not).  If we stick to a minimum size
(16 words or maybe even 8), then it will never change anymore, and we
do not have overhead for small inode_info structs.

  Should uncommon-but-widely-used things like socket and shmem have their
  own slab cache, or should they just allocate from the generic size-32 slab?
 
 That's pretty interesting - especially for sockets. I wonder whether
 we would get problems with separate allocation of these - we don't
 go from inode to socket all that often, but...

I never thought of that.  I guess the socket code does not know which
fs the inode_info was allocated from, so it cannot free it from the slab
(even if it had access to these slabs, which it does not).  In that case,
each fs would have struct socket as the minimum size allocatable, which
is unfortunately one of the largest inode_info sizes.  It is smaller
than ext2, but...

Any ideas?  Do we ever get back into fs-specific clear_inode() from
a socket?  In that case, the socket would just hold a pointer to the
fs-specific inode_info inside its own struct socket until the inode
is dropped.

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Erik Mouw

On Tue, Apr 24, 2001 at 12:47:38PM -0600, Andreas Dilger wrote:
 While I applaud your initiative, you made an unfortunate choice of
 filesystems to convert.  The iso_inode_info is only 4*__u32, as is
 proc_inode_info.  Given that we still need to keep a pointer to the
 external info structs, and the overhead of the slab cache itself
 (both CPU usage and memory overhead, however small), I don't think
 it is worthwhile to have isofs and procfs in separate slabs.

Well, I know a little bit about procfs because I'm currently
documenting it, so that's why I picked it first. After I got the idea,
isofs was quite easy.

In retrospect it would have been more effective to pick a filesystem
with a larger *_inode_info field, but then again: Al is right. Struct
inode is cluttered with *_inode_info fields, while we use anonymous
data entries in other parts of the kernel (like the data pointer in
struct proc_dir_entry, or the priv pointer in struct net_device).

There is another advantage: suppose you're hacking on a filesystem and
change it's *_fs_i.h header. With Al's proposal you only have to
recompile the filesystem you're hacking on, while you have to recompile
the complete kernel in the current situation.


Erik

-- 
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031,  2600 GA Delft, The Netherlands
Phone: +31-15-2783635  Fax: +31-15-2781843  Email: [EMAIL PROTECTED]
WWW: http://www-ict.its.tudelft.nl/~erik/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Trond Myklebust

   == Alexander Viro [EMAIL PROTECTED] writes:

  On Mon, 23 Apr 2001, Jan Harkes wrote:

 On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

  BTW: Is it still less than one page? Then it doesn't make me
 nervous. Why? Guess what granularity we allocate at, if we
 just store pointers instead of the inode.u. Or do you like
 every FS creating his own slab cache?

  Oh, for crying out loud. All it takes is half an hour per
  filesystem.  Here - completely untested patch that does it for
  NFS. Took about that long. Absolutely straightforward, very
  easy to verify correctness.

  Some stuff may need tweaking, but not much (e.g. some functions
  should take nfs_inode_info instead of inodes, etc.). From the
  look of flushd cache it seems that we would be better off with
  cyclic lists instead of single-linked ones for the hash, but I
  didn't look deep enough.

  So consider the patch below as proof-of-concept. Enjoy:

Hi Al,

  I believe your patch introduces a race for the NFS case. The problem
lies in the fact that nfs_find_actor() needs to read several of the
fields from nfs_inode_info. By adding an allocation after the inode
has been hashed, you are creating a window during which the inode can
be found by find_inode(), but during which you aren't even guaranteed
that the nfs_inode_info exists let alone that it's been initialized
by nfs_fill_inode().

One solution could be to have find_inode sleep on encountering a
locked inode. It would have to be something along the lines of

static struct inode * find_inode(struct super_block * sb, unsigned long ino, struct 
list_head *head, find_inode_t find_actor, void *opaque)
{
struct list_head *tmp;
struct inode * inode;

tmp = head;
for (;;) {
tmp = tmp-next;
inode = NULL;
if (tmp == head)
break;
inode = list_entry(tmp, struct inode, i_hash);
if (inode-i_ino != ino)
continue;
if (inode-i_sb != sb)
continue;
if (find_actor) {
if (inode-i_state  I_LOCK) {
spin_unlock(inode_lock);
__wait_on_inode(inode);
spin_lock(inode_lock);
tmp = head;
continue;
}
if (!find_actor(inode, ino, opaque))
continue;
}
break;
}
return inode;
}


Cheers,
   Trond
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Ingo Oeser

On Tue, Apr 24, 2001 at 02:49:23PM -0400, Alexander Viro wrote:
 On Tue, 24 Apr 2001, Andreas Dilger wrote:
  One thing to watch out for is that the current code zeros the u. struct
  for us (as you pointed out to me previously), but allocating from the
  slab cache will not...  This could be an interesting source of bugs for
  some filesystems that assume zero'd inode_info structs.
 True, but easy to catch.

Jepp. Just request SLAB_ZERO (still to be implemented) instead of
SLAB_POISON or provide an constructor.

A nice set of macros for this would make it quite easy. The ctor
is the way to handle it. May be we could even put all the fs
specific initalizers into it (e.g. magics, zeroes).

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag http://www.tu-chemnitz.de/linux/tag
  been there and had much fun   
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On 24 Apr 2001, Trond Myklebust wrote:

 Hi Al,
 
   I believe your patch introduces a race for the NFS case. The problem
 lies in the fact that nfs_find_actor() needs to read several of the
 fields from nfs_inode_info. By adding an allocation after the inode
 has been hashed, you are creating a window during which the inode can
 be found by find_inode(), but during which you aren't even guaranteed
 that the nfs_inode_info exists let alone that it's been initialized
 by nfs_fill_inode().

_Ouch_. So what are you going to do if another iget4() comes between
the moment when you hash the inode and set these fields? You are
filling them only after you drop inode_lock, so AFAICS the current
code has the same problem.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Trond Myklebust

   == Alexander Viro [EMAIL PROTECTED] writes:

  _Ouch_. So what are you going to do if another iget4() comes
  between the moment when you hash the inode and set these
  fields? You are filling them only after you drop inode_lock, so
  AFAICS the current code has the same problem.

The entire call to iget4() is protected by the BKL in all relevant
instances. As long as we don't sleep between find_inode() and
nfs_fill_inode(), we're safe.

In fact the BKL protection is needed also for another reason: we don't
actually initialize the inode in the I_LOCK-protected read_inode() but
instead rely on the caller of iget4 to do it for us. The reason is
that one we would need to pass the struct nfs_fattr to read_inode()
and this wasn't possible until the ReiserFS people introduced
read_inode2().

Cheers,
   Trond
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Andreas Dilger

Ed Tomlinson writes:
> > Consider, when I was doing some fs benchmark, my inode slab cache was
> > over 120k items on a 128MB machine.  At 480 butes per inode, this is
> > almost 58 MB, close to half of RAM.  Reducing this to exactly ext2
> > sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
> > systems)!!! (This assumes nfs_inode_info is the largest).
> 
> Was this with a recient kernel (post Alexander Viro's dcache pressure fix)?  
> If not I suggest rerunning the benchmark.  I had/have a patch to apply
> pressure to the dcache and icache from kswapd but its not been needed here
> since the above fix.

Actually, it had the dcache patch but I'm not aware of a patch from Al to
change icache behaviour.  In any case, changing the icache behaviour is
not what I'm getting at here - having the union of all private inode
structs in the generic inode is a huge waste of RAM.  Even for filesystems
that are heavy NFS users, they will likely still have a considerable amount
of local filesystem space (excluding NFS root systems, which are very few).

Al posted a patch to the NFS code which removes nfs_inode_info from the
inode union.  Since it is (AFAIK) the largest member of the union, we
have just saved 24 bytes per inode (hfs_inode_info is also rather large).
If we removed hfs_inode_info as well, we would save 108 bytes per inode,
about 22% ({ext2,affs,ufs}_inode_info are all about the same size).

No point in punishing all users for filesystems they don't necessarily use.
Even for people that DO use NFS and/or HFS, they are probably still wasting
10k inodes * 108 bytes = 1MB of RAM for no good reason (because most of
their inodes are probably not NFS and/or HFS).

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Jan Harkes wrote:

> On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

> > BTW: Is it still less than one page? Then it doesn't make me
> >nervous. Why? Guess what granularity we allocate at, if we
> >just store pointers instead of the inode.u. Or do you like
> >every FS creating his own slab cache?

Oh, for crying out loud. All it takes is half an hour per filesystem.
Here - completely untested patch that does it for NFS. Took about that
long. Absolutely straightforward, very easy to verify correctness.

Some stuff may need tweaking, but not much (e.g. some functions
should take nfs_inode_info instead of inodes, etc.). From the look
of flushd cache it seems that we would be better off with cyclic
lists instead of single-linked ones for the hash, but I didn't look
deep enough.

So consider the patch below as proof-of-concept. Enjoy:

diff -urN S4-pre6/fs/nfs/flushd.c S4-pre6-nfs/fs/nfs/flushd.c
--- S4-pre6/fs/nfs/flushd.c Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/flushd.c Mon Apr 23 22:23:11 2001
@@ -162,11 +162,11 @@
 
if (NFS_FLAGS(inode) & NFS_INO_FLUSH)
goto out;
-   inode->u.nfs_i.hash_next = NULL;
+   NFS_I(inode)->hash_next = NULL;
 
q = >inodes;
while (*q)
-   q = &(*q)->u.nfs_i.hash_next;
+   q = _I(*q)->hash_next;
*q = inode;
 
/* Note: we increase the inode i_count in order to prevent
@@ -188,9 +188,9 @@
 
q = >inodes;
while (*q && *q != inode)
-   q = &(*q)->u.nfs_i.hash_next;
+   q = _I(*q)->hash_next;
if (*q) {
-   *q = inode->u.nfs_i.hash_next;
+   *q = NFS_I(inode)->hash_next;
NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
iput(inode);
}
@@ -238,8 +238,8 @@
cache->inodes = NULL;
 
while ((inode = next) != NULL) {
-   next = next->u.nfs_i.hash_next;
-   inode->u.nfs_i.hash_next = NULL;
+   next = NFS_I(next)->hash_next;
+   NFS_I(inode)->hash_next = NULL;
NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
 
if (flush) {
diff -urN S4-pre6/fs/nfs/inode.c S4-pre6-nfs/fs/nfs/inode.c
--- S4-pre6/fs/nfs/inode.c  Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/inode.c  Mon Apr 23 22:43:45 2001
@@ -40,11 +40,14 @@
 #define NFSDBG_FACILITYNFSDBG_VFS
 #define NFS_PARANOIA 1
 
+static kmem_cache_t *nfs_inode_cachep;
+
 static struct inode * __nfs_fhget(struct super_block *, struct nfs_fh *, struct 
nfs_fattr *);
 void nfs_zap_caches(struct inode *);
 static void nfs_invalidate_inode(struct inode *);
 
 static void nfs_read_inode(struct inode *);
+static void nfs_clear_inode(struct inode *);
 static void nfs_delete_inode(struct inode *);
 static void nfs_put_super(struct super_block *);
 static void nfs_umount_begin(struct super_block *);
@@ -52,6 +55,7 @@
 
 static struct super_operations nfs_sops = { 
read_inode: nfs_read_inode,
+   clear_inode:nfs_clear_inode,
put_inode:  force_delete,
delete_inode:   nfs_delete_inode,
put_super:  nfs_put_super,
@@ -96,23 +100,44 @@
 static void
 nfs_read_inode(struct inode * inode)
 {
+   struct nfs_inode_info *nfsi;
+
+   nfsi = kmem_cache_alloc(nfs_inode_cachep, GFP_KERNEL);
+   if (!nfsi)
+   goto Enomem;
+
inode->i_blksize = inode->i_sb->s_blocksize;
inode->i_mode = 0;
inode->i_rdev = 0;
+   inode->u.generic_ip = nfsi;
NFS_FILEID(inode) = 0;
NFS_FSID(inode) = 0;
NFS_FLAGS(inode) = 0;
-   INIT_LIST_HEAD(>u.nfs_i.read);
-   INIT_LIST_HEAD(>u.nfs_i.dirty);
-   INIT_LIST_HEAD(>u.nfs_i.commit);
-   INIT_LIST_HEAD(>u.nfs_i.writeback);
-   inode->u.nfs_i.nread = 0;
-   inode->u.nfs_i.ndirty = 0;
-   inode->u.nfs_i.ncommit = 0;
-   inode->u.nfs_i.npages = 0;
+   INIT_LIST_HEAD(>read);
+   INIT_LIST_HEAD(>dirty);
+   INIT_LIST_HEAD(>commit);
+   INIT_LIST_HEAD(>writeback);
+   nfsi->nread = 0;
+   nfsi->ndirty = 0;
+   nfsi->ncommit = 0;
+   nfsi->npages = 0;
NFS_CACHEINV(inode);
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
+   return;
+
+Enomem:
+   make_bad_inode(inode);
+   return;
+}
+
+static void
+nfs_clear_inode(struct inode * inode)
+{
+   struct nfs_inode_info *p = NFS_I(inode);
+   inode->u.generic_ip = NULL;
+   if (p)
+   kmem_cache_free(nfs_inode_cachep, p);
 }
 
 static void
@@ -594,7 +619,7 @@
NFS_CACHE_ISIZE(inode) = fattr->size;
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
-   memcpy(>u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh));
+   memcpy(NFS_FH(inode), fh, sizeof(struct nfs_fh));
}

Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Jan Harkes

On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance. 
> 
> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].
> 
> So that struct inode around is ok.
> 
> BTW: Is it still less than one page? Then it doesn't make me
>nervous. Why? Guess what granularity we allocate at, if we
>just store pointers instead of the inode.u. Or do you like
>every FS creating his own slab cache?

I've actually got the coda_inode_info (inode->u.u_coda_fs_i) split out
of the union in my development kernel. It doesn't shrink the size of the
struct inode yet, Coda isn't the biggest user at the moment.

But, it forced me to do several cleanups in the code, and it even has
resulted in fixing a 'leak'. Not a real memory loss leak one, but we
left uninitialized inodes around in the icache for no good reason. Also
changing a but in a coda specific header file does trigger an almost
complete rebuild of the whole kernel (coda.h -> coda_fs_i.h -> fs.h ->
everything?)

The allocation overhead really isn't that bad. kmalloc/kfree are also
using the slabcache, and a trivial variant using a 'private' slabcache
gave me the counters to find the 'leak' I mentioned before.

I can't really evaluate performance impacts. The struct inode is still
the same size, so for now there even is a little bit of additional
memory pressure. Also, Coda wasn't really developed to achieve high
performance but more to explore novel features.

Jan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ed Tomlinson

Andreas Dilger wrote:

> Consider, when I was doing some fs benchmark, my inode slab cache was
> over 120k items on a 128MB machine.  At 480 butes per inode, this is
> almost 58 MB, close to half of RAM.  Reducing this to exactly ext2
> sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
> systems)!!! (This assumes nfs_inode_info is the largest).

Was this with a recient kernel (post Alexander Viro's dcache pressure fix)?  
If not I suggest rerunning the benchmark.  I had/have a patch to apply pressure 
to the dcache and icache from kswapd but its not been needed here since the above 
fix.

Ed Tomlinson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Rik van Riel

On Mon, 23 Apr 2001, Alexander Viro wrote:
> On Mon, 23 Apr 2001, Richard Gooch wrote:
>
> > - keep a separate VFSinode and FSinode slab cache
>
> Yup.

Would it make sense to unify these with the struct
address_space ?

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Richard Gooch wrote:

> - keep a separate VFSinode and FSinode slab cache

Yup.

> - allocate an enlarged VFSinode that contains the FSinode at the end,
>   with the generic pointer in the VFSinode part pointing to FSinode
>   part.

Please, don't. It would help with bloat only if you allocated these
beasts separately for each fs and then you end up with _many_ allocators
that can generate pointer to struct inode. 

"One type - one allocator" is a good rule - violating it turns into major
PITA couple of years down the road 9 times out of 10.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Richard Gooch

Alexander Viro writes:
> 
> 
> On Mon, 23 Apr 2001, Richard Gooch wrote:
> 
> > - keep a separate VFSinode and FSinode slab cache
> 
> Yup.
> 
> > - allocate an enlarged VFSinode that contains the FSinode at the end,
> >   with the generic pointer in the VFSinode part pointing to FSinode
> >   part.
> 
> Please, don't. It would help with bloat only if you allocated these
> beasts separately for each fs and then you end up with _many_ allocators
> that can generate pointer to struct inode. 
> 
> "One type - one allocator" is a good rule - violating it turns into
> major PITA couple of years down the road 9 times out of 10.

Agreed. The better option is the separate VFSinode and FSinode caches.
The enlarged inode scheme is also ugly, like the unions. It's just
less bloated :-)

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Richard Gooch

Albert D. Cahalan writes:
> Richard Gooch writes:
> 
> > We want to take out that union because it sucks for virtual
> > filesystems. Besides, it's ugly.
> 
> I hope you won't mind if people trash this with benchmarks.

But they can't. At least, not for a well designed patch. If there is a
real issue of fragmentation, then there are ways to fix that without
using a bloated union structure. Don't punish some filesystems just
because others have a problem.

Solutions to avoid fragmentation:

- keep a separate VFSinode and FSinode slab cache
- allocate an enlarged VFSinode that contains the FSinode at the end,
  with the generic pointer in the VFSinode part pointing to FSinode
  part.

It's simply wrong to bloat everyone because some random FS found it
easier to thow in a union.

Besides, for every benchmark that shows how fragmentation hurts, I can
generate a benchmark showing how inode bloat hurts. Lies, damn lies
and benchmarks.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Andreas Dilger

Ingo Oeser writes:
> > We should get ext2 and friends to move the sucker _out_ of struct inode.
> > As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> > it really has to be done. More filesystems adding stuff into the union
> > is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> > ->clear_inode() is the right place for freeing it.
> 
> BTW: Is it still less than one page? Then it doesn't make me
>nervous. Why? Guess what granularity we allocate at, if we
>just store pointers instead of the inode.u. Or do you like
>every FS creating his own slab cache?

I would much rather we allocate a slab cache for each fs type (it
would be trivial at register_fs time).  Most people have only a limited
number of filesystems active at a single time, yet tens or hundreds of
thousands of inodes in the inode slab cache.  Making the per-fs private
inode data as small as possible would reduce memory wastage considerably,
and not impact performance (AFAICS) if we use a per-fs type slab cache
for fs private data.

Consider, when I was doing some fs benchmark, my inode slab cache was
over 120k items on a 128MB machine.  At 480 butes per inode, this is
almost 58 MB, close to half of RAM.  Reducing this to exactly ext2
sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
systems)!!! (This assumes nfs_inode_info is the largest).

This also makes it possible to safely (and efficiently) use external
filesystem modules without the need to recompile the kernel.  Granted,
if the external filesystem doesn't use more than the largest .u struct,
then it is currently possible as well, but that number changes, so it
is not safe.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Albert D. Cahalan

Richard Gooch writes:

> We want to take out that union because it sucks for virtual
> filesystems. Besides, it's ugly.

I hope you won't mind if people trash this with benchmarks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Tue, 24 Apr 2001, Ingo Oeser wrote:

> We have this kind of stuff all over the place. If we allocate
> some small amount of memory and and need some small amount
> associated with this memory, there is no problem with a little
> waste.

Little? How about quarter of kilobyte per inode? sizeof(struct inode)
is nearly half-kilobyte. And icache can easily get to ~10 elements.

> Waste is better than fragmentation. This is the lesson people
> learned from segments in the ia32.
> 
> Objects are easier to manage, if they are the same size.

So don't keep them in the same cache. Notice that quite a few systems
keep vnode separately from fs-specific data. For a very good reason.

Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ingo Oeser

On Mon, Apr 23, 2001 at 10:56:16PM +0200, Christoph Hellwig wrote:
> In article <[EMAIL PROTECTED]> you wrote:
> > Last time we suggested this, people ended up with some OS trying
> > it and getting worse performance. 
> 
> Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

Don't remember. I think Larry McVoy told the story, so I cc'ed
him ;-)

> Because having an union in generic code that includes filesystem-specific
> memebers is ugly? It's one of those a little more performance for a lot of
> bad style optimizations.

We have this kind of stuff all over the place. If we allocate
some small amount of memory and and need some small amount
associated with this memory, there is no problem with a little
waste.

Waste is better than fragmentation. This is the lesson people
learned from segments in the ia32.

Objects are easier to manage, if they are the same size.

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag 
  been there and had much fun   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Richard Gooch

Ingo Oeser writes:
> On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
> > > Great idea. We allocate this space anyway. And we don't have to
> > > care about the internals of this union, because never have to use
> > > it outside the kernel ;-)
> > > 
> > > I like it. ext2fs does the same, so there should be no VFS
> > > hassles involved. Al?
> > 
> > We should get ext2 and friends to move the sucker _out_ of struct inode.
> > As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> > it really has to be done. More filesystems adding stuff into the union
> > is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> > ->clear_inode() is the right place for freeing it.
> 
> You need an inode anyway. So why not using the space in it? tmpfs
> would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
> this kind of symlinks.
> 
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance. 
> 
> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].

We want to take out that union because it sucks for virtual
filesystems. Besides, it's ugly.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Christoph Hellwig

In article <[EMAIL PROTECTED]> you wrote:
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance. 

Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].

Because having an union in generic code that includes filesystem-specific
memebers is ugly? It's one of those a little more performance for a lot of
bad style optimizations.

Christoph


-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ingo Oeser

On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
> > Great idea. We allocate this space anyway. And we don't have to
> > care about the internals of this union, because never have to use
> > it outside the kernel ;-)
> > 
> > I like it. ext2fs does the same, so there should be no VFS
> > hassles involved. Al?
> 
> We should get ext2 and friends to move the sucker _out_ of struct inode.
> As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> it really has to be done. More filesystems adding stuff into the union
> is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> ->clear_inode() is the right place for freeing it.

You need an inode anyway. So why not using the space in it? tmpfs
would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
this kind of symlinks.

Last time we suggested this, people ended up with some OS trying
it and getting worse performance. 

Why? You need to allocate the VFS-inode (vnode in other OSs) and
the on-disk-inode anyway at the same time. You get better
performance and less fragmentation, if you allocate them both
together[1].

So that struct inode around is ok.

BTW: Is it still less than one page? Then it doesn't make me
   nervous. Why? Guess what granularity we allocate at, if we
   just store pointers instead of the inode.u. Or do you like
   every FS creating his own slab cache?

Regards

Ingo Oeser

[1] Which is true for other allocations, too.
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag 
  been there and had much fun   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Ingo Oeser wrote:

> Hi Chris,
> 
> On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
> > > The question is: How? If you do it like ramfs, you cannot swap
> > > these symlinks and this is effectively a mlock(symlink) operation
> > > allowed for normal users. -> BAD!
> > 
> > How about storing it into the inode structure if it fits into the
> > fs-private union? If it is too big we allocate the page as we do it
> > now. The union has 192 bytes. This should be sufficient for most
> > cases.
> 
> Great idea. We allocate this space anyway. And we don't have to
> care about the internals of this union, because never have to use
> it outside the kernel ;-)
> 
> I like it. ext2fs does the same, so there should be no VFS
> hassles involved. Al?

We should get ext2 and friends to move the sucker _out_ of struct inode.
As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
it really has to be done. More filesystems adding stuff into the union
is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
->clear_inode() is the right place for freeing it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ingo Oeser

Hi Chris,

On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
> > The question is: How? If you do it like ramfs, you cannot swap
> > these symlinks and this is effectively a mlock(symlink) operation
> > allowed for normal users. -> BAD!
> 
> How about storing it into the inode structure if it fits into the
> fs-private union? If it is too big we allocate the page as we do it
> now. The union has 192 bytes. This should be sufficient for most
> cases.

Great idea. We allocate this space anyway. And we don't have to
care about the internals of this union, because never have to use
it outside the kernel ;-)

I like it. ext2fs does the same, so there should be no VFS
hassles involved. Al?

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag 
  been there and had much fun   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Christoph Rohland

Hi Ingo,

On Mon, 23 Apr 2001, Ingo Oeser wrote:
> On Mon, Apr 23, 2001 at 01:43:27PM +0200, Christoph Rohland wrote:
>> On Sun, 22 Apr 2001, David L. Parsley wrote:
>> > attach packages inside it.  Since symlinks in a tmpfs filesystem
>> > cost 4k each (ouch!), I'm considering using mount --bind for
>> > everything.
>> 
>> What about fixing tmpfs instead?
> 
> The question is: How? If you do it like ramfs, you cannot swap
> these symlinks and this is effectively a mlock(symlink) operation
> allowed for normal users. -> BAD!

How about storing it into the inode structure if it fits into the
fs-private union? If it is too big we allocate the page as we do it
now. The union has 192 bytes. This should be sufficient for most
cases.

Greetings
Christoph


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, David L. Parsley wrote:

> What I'm not sure of is which solution is actually 'better' - I'm
> guessing that performance-wise, neither will make a noticable
> difference, so I guess memory usage would be the deciding factor.  If I

Bindings are faster on lookup. For obvious reasons - in case of symlinks
you do name resolution every time you traverse the link; in case of
bindings it is done when you create them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread David L. Parsley

Christoph Rohland wrote:
> 
> Hi David,
> 
> On Sun, 22 Apr 2001, David L. Parsley wrote:
> > I'm still working on a packaging system for diskless
> > (quasi-embedded) devices.  The root filesystem is all tmpfs, and I
> > attach packages inside it.  Since symlinks in a tmpfs filesystem
> > cost 4k each (ouch!), I'm considering using mount --bind for
> > everything.
> 
> What about fixing tmpfs instead?

That would be great - are you volunteering? ;-)  Seriously - I might be
able to look at what ramfs does and port that to tmpfs for my needs, but
that's about the extent of my kernel hacking skills.  For now, mount
--bind looks like it'll work just fine.  If somebody wants to fix tmpfs,
I'll be happy to test patches; it'll just change a couple of lines in my
package loading logic (mount --bind x y -> ln -s x y).

What I'm not sure of is which solution is actually 'better' - I'm
guessing that performance-wise, neither will make a noticable
difference, so I guess memory usage would be the deciding factor.  If I
can get a lot closer to the size of a symlink (10-20 bytes) that would
be best.  The issue with /proc/mounts really shouldn't hurt anything - I
could almost get by without mounting /proc anyway, it's mainly a
convenience.

regards,
David

-- 
David L. Parsley
Network Administrator
Roanoke College
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ingo Oeser

On Mon, Apr 23, 2001 at 01:43:27PM +0200, Christoph Rohland wrote:
> On Sun, 22 Apr 2001, David L. Parsley wrote:
> > attach packages inside it.  Since symlinks in a tmpfs filesystem
> > cost 4k each (ouch!), I'm considering using mount --bind for
> > everything.
> 
> What about fixing tmpfs instead?

The question is: How? If you do it like ramfs, you cannot swap
these symlinks and this is effectively a mlock(symlink) operation
allowed for normal users. -> BAD!

One idea is to only use a page, if the entry will be pushed into
swap and thus only wasting swap, not memory (where we have more
of it).

But allocating a page on memory pressure is also not a bright
idea.

OTOH we could force this entry to swap immedately, after we
copied it from the dentry. So we can do an GFP_ATOMIC allocation
and do not too much harm to memory pressure and only make the IO
a bit stormier.

I think there are a lot of races, which I don't see now.

So please don't beat me too much, if this is a completly stupid
idea, ok?  ;-)


Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag 
  been there and had much fun   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Christoph Rohland

Hi David,

On Sun, 22 Apr 2001, David L. Parsley wrote:
> I'm still working on a packaging system for diskless
> (quasi-embedded) devices.  The root filesystem is all tmpfs, and I
> attach packages inside it.  Since symlinks in a tmpfs filesystem
> cost 4k each (ouch!), I'm considering using mount --bind for
> everything.

What about fixing tmpfs instead?

Greetings
Christoph


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Christoph Rohland

Hi David,

On Sun, 22 Apr 2001, David L. Parsley wrote:
 I'm still working on a packaging system for diskless
 (quasi-embedded) devices.  The root filesystem is all tmpfs, and I
 attach packages inside it.  Since symlinks in a tmpfs filesystem
 cost 4k each (ouch!), I'm considering using mount --bind for
 everything.

What about fixing tmpfs instead?

Greetings
Christoph


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, David L. Parsley wrote:

 What I'm not sure of is which solution is actually 'better' - I'm
 guessing that performance-wise, neither will make a noticable
 difference, so I guess memory usage would be the deciding factor.  If I

Bindings are faster on lookup. For obvious reasons - in case of symlinks
you do name resolution every time you traverse the link; in case of
bindings it is done when you create them.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Christoph Rohland

Hi Ingo,

On Mon, 23 Apr 2001, Ingo Oeser wrote:
 On Mon, Apr 23, 2001 at 01:43:27PM +0200, Christoph Rohland wrote:
 On Sun, 22 Apr 2001, David L. Parsley wrote:
  attach packages inside it.  Since symlinks in a tmpfs filesystem
  cost 4k each (ouch!), I'm considering using mount --bind for
  everything.
 
 What about fixing tmpfs instead?
 
 The question is: How? If you do it like ramfs, you cannot swap
 these symlinks and this is effectively a mlock(symlink) operation
 allowed for normal users. - BAD!

How about storing it into the inode structure if it fits into the
fs-private union? If it is too big we allocate the page as we do it
now. The union has 192 bytes. This should be sufficient for most
cases.

Greetings
Christoph


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ingo Oeser

Hi Chris,

On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
  The question is: How? If you do it like ramfs, you cannot swap
  these symlinks and this is effectively a mlock(symlink) operation
  allowed for normal users. - BAD!
 
 How about storing it into the inode structure if it fits into the
 fs-private union? If it is too big we allocate the page as we do it
 now. The union has 192 bytes. This should be sufficient for most
 cases.

Great idea. We allocate this space anyway. And we don't have to
care about the internals of this union, because never have to use
it outside the kernel ;-)

I like it. ext2fs does the same, so there should be no VFS
hassles involved. Al?

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag http://www.tu-chemnitz.de/linux/tag
  been there and had much fun   
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Ingo Oeser wrote:

 Hi Chris,
 
 On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
   The question is: How? If you do it like ramfs, you cannot swap
   these symlinks and this is effectively a mlock(symlink) operation
   allowed for normal users. - BAD!
  
  How about storing it into the inode structure if it fits into the
  fs-private union? If it is too big we allocate the page as we do it
  now. The union has 192 bytes. This should be sufficient for most
  cases.
 
 Great idea. We allocate this space anyway. And we don't have to
 care about the internals of this union, because never have to use
 it outside the kernel ;-)
 
 I like it. ext2fs does the same, so there should be no VFS
 hassles involved. Al?

We should get ext2 and friends to move the sucker _out_ of struct inode.
As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
it really has to be done. More filesystems adding stuff into the union
is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
-clear_inode() is the right place for freeing it.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ingo Oeser

On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
  Great idea. We allocate this space anyway. And we don't have to
  care about the internals of this union, because never have to use
  it outside the kernel ;-)
  
  I like it. ext2fs does the same, so there should be no VFS
  hassles involved. Al?
 
 We should get ext2 and friends to move the sucker _out_ of struct inode.
 As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
 it really has to be done. More filesystems adding stuff into the union
 is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
 -clear_inode() is the right place for freeing it.

You need an inode anyway. So why not using the space in it? tmpfs
would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
this kind of symlinks.

Last time we suggested this, people ended up with some OS trying
it and getting worse performance. 

Why? You need to allocate the VFS-inode (vnode in other OSs) and
the on-disk-inode anyway at the same time. You get better
performance and less fragmentation, if you allocate them both
together[1].

So that struct inode around is ok.

BTW: Is it still less than one page? Then it doesn't make me
   nervous. Why? Guess what granularity we allocate at, if we
   just store pointers instead of the inode.u. Or do you like
   every FS creating his own slab cache?

Regards

Ingo Oeser

[1] Which is true for other allocations, too.
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag http://www.tu-chemnitz.de/linux/tag
  been there and had much fun   
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Christoph Hellwig

In article [EMAIL PROTECTED] you wrote:
 Last time we suggested this, people ended up with some OS trying
 it and getting worse performance. 

Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

 Why? You need to allocate the VFS-inode (vnode in other OSs) and
 the on-disk-inode anyway at the same time. You get better
 performance and less fragmentation, if you allocate them both
 together[1].

Because having an union in generic code that includes filesystem-specific
memebers is ugly? It's one of those a little more performance for a lot of
bad style optimizations.

Christoph


-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Richard Gooch

Ingo Oeser writes:
 On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
   Great idea. We allocate this space anyway. And we don't have to
   care about the internals of this union, because never have to use
   it outside the kernel ;-)
   
   I like it. ext2fs does the same, so there should be no VFS
   hassles involved. Al?
  
  We should get ext2 and friends to move the sucker _out_ of struct inode.
  As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
  it really has to be done. More filesystems adding stuff into the union
  is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
  -clear_inode() is the right place for freeing it.
 
 You need an inode anyway. So why not using the space in it? tmpfs
 would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
 this kind of symlinks.
 
 Last time we suggested this, people ended up with some OS trying
 it and getting worse performance. 
 
 Why? You need to allocate the VFS-inode (vnode in other OSs) and
 the on-disk-inode anyway at the same time. You get better
 performance and less fragmentation, if you allocate them both
 together[1].

We want to take out that union because it sucks for virtual
filesystems. Besides, it's ugly.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ingo Oeser

On Mon, Apr 23, 2001 at 10:56:16PM +0200, Christoph Hellwig wrote:
 In article [EMAIL PROTECTED] you wrote:
  Last time we suggested this, people ended up with some OS trying
  it and getting worse performance. 
 
 Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

Don't remember. I think Larry McVoy told the story, so I cc'ed
him ;-)

 Because having an union in generic code that includes filesystem-specific
 memebers is ugly? It's one of those a little more performance for a lot of
 bad style optimizations.

We have this kind of stuff all over the place. If we allocate
some small amount of memory and and need some small amount
associated with this memory, there is no problem with a little
waste.

Waste is better than fragmentation. This is the lesson people
learned from segments in the ia32.

Objects are easier to manage, if they are the same size.

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag http://www.tu-chemnitz.de/linux/tag
  been there and had much fun   
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Tue, 24 Apr 2001, Ingo Oeser wrote:

 We have this kind of stuff all over the place. If we allocate
 some small amount of memory and and need some small amount
 associated with this memory, there is no problem with a little
 waste.

Little? How about quarter of kilobyte per inode? sizeof(struct inode)
is nearly half-kilobyte. And icache can easily get to ~10 elements.

 Waste is better than fragmentation. This is the lesson people
 learned from segments in the ia32.
 
 Objects are easier to manage, if they are the same size.

So don't keep them in the same cache. Notice that quite a few systems
keep vnode separately from fs-specific data. For a very good reason.

Al

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Albert D. Cahalan

Richard Gooch writes:

 We want to take out that union because it sucks for virtual
 filesystems. Besides, it's ugly.

I hope you won't mind if people trash this with benchmarks.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Andreas Dilger

Ingo Oeser writes:
  We should get ext2 and friends to move the sucker _out_ of struct inode.
  As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
  it really has to be done. More filesystems adding stuff into the union
  is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
  -clear_inode() is the right place for freeing it.
 
 BTW: Is it still less than one page? Then it doesn't make me
nervous. Why? Guess what granularity we allocate at, if we
just store pointers instead of the inode.u. Or do you like
every FS creating his own slab cache?

I would much rather we allocate a slab cache for each fs type (it
would be trivial at register_fs time).  Most people have only a limited
number of filesystems active at a single time, yet tens or hundreds of
thousands of inodes in the inode slab cache.  Making the per-fs private
inode data as small as possible would reduce memory wastage considerably,
and not impact performance (AFAICS) if we use a per-fs type slab cache
for fs private data.

Consider, when I was doing some fs benchmark, my inode slab cache was
over 120k items on a 128MB machine.  At 480 butes per inode, this is
almost 58 MB, close to half of RAM.  Reducing this to exactly ext2
sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
systems)!!! (This assumes nfs_inode_info is the largest).

This also makes it possible to safely (and efficiently) use external
filesystem modules without the need to recompile the kernel.  Granted,
if the external filesystem doesn't use more than the largest .u struct,
then it is currently possible as well, but that number changes, so it
is not safe.

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Richard Gooch

Albert D. Cahalan writes:
 Richard Gooch writes:
 
  We want to take out that union because it sucks for virtual
  filesystems. Besides, it's ugly.
 
 I hope you won't mind if people trash this with benchmarks.

But they can't. At least, not for a well designed patch. If there is a
real issue of fragmentation, then there are ways to fix that without
using a bloated union structure. Don't punish some filesystems just
because others have a problem.

Solutions to avoid fragmentation:

- keep a separate VFSinode and FSinode slab cache
- allocate an enlarged VFSinode that contains the FSinode at the end,
  with the generic pointer in the VFSinode part pointing to FSinode
  part.

It's simply wrong to bloat everyone because some random FS found it
easier to thow in a union.

Besides, for every benchmark that shows how fragmentation hurts, I can
generate a benchmark showing how inode bloat hurts. Lies, damn lies
and benchmarks.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Richard Gooch

Alexander Viro writes:
 
 
 On Mon, 23 Apr 2001, Richard Gooch wrote:
 
  - keep a separate VFSinode and FSinode slab cache
 
 Yup.
 
  - allocate an enlarged VFSinode that contains the FSinode at the end,
with the generic pointer in the VFSinode part pointing to FSinode
part.
 
 Please, don't. It would help with bloat only if you allocated these
 beasts separately for each fs and then you end up with _many_ allocators
 that can generate pointer to struct inode. 
 
 One type - one allocator is a good rule - violating it turns into
 major PITA couple of years down the road 9 times out of 10.

Agreed. The better option is the separate VFSinode and FSinode caches.
The enlarged inode scheme is also ugly, like the unions. It's just
less bloated :-)

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Richard Gooch wrote:

 - keep a separate VFSinode and FSinode slab cache

Yup.

 - allocate an enlarged VFSinode that contains the FSinode at the end,
   with the generic pointer in the VFSinode part pointing to FSinode
   part.

Please, don't. It would help with bloat only if you allocated these
beasts separately for each fs and then you end up with _many_ allocators
that can generate pointer to struct inode. 

One type - one allocator is a good rule - violating it turns into major
PITA couple of years down the road 9 times out of 10.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Rik van Riel

On Mon, 23 Apr 2001, Alexander Viro wrote:
 On Mon, 23 Apr 2001, Richard Gooch wrote:

  - keep a separate VFSinode and FSinode slab cache

 Yup.

Would it make sense to unify these with the struct
address_space ?

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Ed Tomlinson

Andreas Dilger wrote:

 Consider, when I was doing some fs benchmark, my inode slab cache was
 over 120k items on a 128MB machine.  At 480 butes per inode, this is
 almost 58 MB, close to half of RAM.  Reducing this to exactly ext2
 sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
 systems)!!! (This assumes nfs_inode_info is the largest).

Was this with a recient kernel (post Alexander Viro's dcache pressure fix)?  
If not I suggest rerunning the benchmark.  I had/have a patch to apply pressure 
to the dcache and icache from kswapd but its not been needed here since the above 
fix.

Ed Tomlinson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Jan Harkes

On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:
 Last time we suggested this, people ended up with some OS trying
 it and getting worse performance. 
 
 Why? You need to allocate the VFS-inode (vnode in other OSs) and
 the on-disk-inode anyway at the same time. You get better
 performance and less fragmentation, if you allocate them both
 together[1].
 
 So that struct inode around is ok.
 
 BTW: Is it still less than one page? Then it doesn't make me
nervous. Why? Guess what granularity we allocate at, if we
just store pointers instead of the inode.u. Or do you like
every FS creating his own slab cache?

I've actually got the coda_inode_info (inode-u.u_coda_fs_i) split out
of the union in my development kernel. It doesn't shrink the size of the
struct inode yet, Coda isn't the biggest user at the moment.

But, it forced me to do several cleanups in the code, and it even has
resulted in fixing a 'leak'. Not a real memory loss leak one, but we
left uninitialized inodes around in the icache for no good reason. Also
changing a but in a coda specific header file does trigger an almost
complete rebuild of the whole kernel (coda.h - coda_fs_i.h - fs.h -
everything?)

The allocation overhead really isn't that bad. kmalloc/kfree are also
using the slabcache, and a trivial variant using a 'private' slabcache
gave me the counters to find the 'leak' I mentioned before.

I can't really evaluate performance impacts. The struct inode is still
the same size, so for now there even is a little bit of additional
memory pressure. Also, Coda wasn't really developed to achieve high
performance but more to explore novel features.

Jan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Jan Harkes wrote:

 On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

  BTW: Is it still less than one page? Then it doesn't make me
 nervous. Why? Guess what granularity we allocate at, if we
 just store pointers instead of the inode.u. Or do you like
 every FS creating his own slab cache?

Oh, for crying out loud. All it takes is half an hour per filesystem.
Here - completely untested patch that does it for NFS. Took about that
long. Absolutely straightforward, very easy to verify correctness.

Some stuff may need tweaking, but not much (e.g. some functions
should take nfs_inode_info instead of inodes, etc.). From the look
of flushd cache it seems that we would be better off with cyclic
lists instead of single-linked ones for the hash, but I didn't look
deep enough.

So consider the patch below as proof-of-concept. Enjoy:

diff -urN S4-pre6/fs/nfs/flushd.c S4-pre6-nfs/fs/nfs/flushd.c
--- S4-pre6/fs/nfs/flushd.c Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/flushd.c Mon Apr 23 22:23:11 2001
@@ -162,11 +162,11 @@
 
if (NFS_FLAGS(inode)  NFS_INO_FLUSH)
goto out;
-   inode-u.nfs_i.hash_next = NULL;
+   NFS_I(inode)-hash_next = NULL;
 
q = cache-inodes;
while (*q)
-   q = (*q)-u.nfs_i.hash_next;
+   q = NFS_I(*q)-hash_next;
*q = inode;
 
/* Note: we increase the inode i_count in order to prevent
@@ -188,9 +188,9 @@
 
q = cache-inodes;
while (*q  *q != inode)
-   q = (*q)-u.nfs_i.hash_next;
+   q = NFS_I(*q)-hash_next;
if (*q) {
-   *q = inode-u.nfs_i.hash_next;
+   *q = NFS_I(inode)-hash_next;
NFS_FLAGS(inode) = ~NFS_INO_FLUSH;
iput(inode);
}
@@ -238,8 +238,8 @@
cache-inodes = NULL;
 
while ((inode = next) != NULL) {
-   next = next-u.nfs_i.hash_next;
-   inode-u.nfs_i.hash_next = NULL;
+   next = NFS_I(next)-hash_next;
+   NFS_I(inode)-hash_next = NULL;
NFS_FLAGS(inode) = ~NFS_INO_FLUSH;
 
if (flush) {
diff -urN S4-pre6/fs/nfs/inode.c S4-pre6-nfs/fs/nfs/inode.c
--- S4-pre6/fs/nfs/inode.c  Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/inode.c  Mon Apr 23 22:43:45 2001
@@ -40,11 +40,14 @@
 #define NFSDBG_FACILITYNFSDBG_VFS
 #define NFS_PARANOIA 1
 
+static kmem_cache_t *nfs_inode_cachep;
+
 static struct inode * __nfs_fhget(struct super_block *, struct nfs_fh *, struct 
nfs_fattr *);
 void nfs_zap_caches(struct inode *);
 static void nfs_invalidate_inode(struct inode *);
 
 static void nfs_read_inode(struct inode *);
+static void nfs_clear_inode(struct inode *);
 static void nfs_delete_inode(struct inode *);
 static void nfs_put_super(struct super_block *);
 static void nfs_umount_begin(struct super_block *);
@@ -52,6 +55,7 @@
 
 static struct super_operations nfs_sops = { 
read_inode: nfs_read_inode,
+   clear_inode:nfs_clear_inode,
put_inode:  force_delete,
delete_inode:   nfs_delete_inode,
put_super:  nfs_put_super,
@@ -96,23 +100,44 @@
 static void
 nfs_read_inode(struct inode * inode)
 {
+   struct nfs_inode_info *nfsi;
+
+   nfsi = kmem_cache_alloc(nfs_inode_cachep, GFP_KERNEL);
+   if (!nfsi)
+   goto Enomem;
+
inode-i_blksize = inode-i_sb-s_blocksize;
inode-i_mode = 0;
inode-i_rdev = 0;
+   inode-u.generic_ip = nfsi;
NFS_FILEID(inode) = 0;
NFS_FSID(inode) = 0;
NFS_FLAGS(inode) = 0;
-   INIT_LIST_HEAD(inode-u.nfs_i.read);
-   INIT_LIST_HEAD(inode-u.nfs_i.dirty);
-   INIT_LIST_HEAD(inode-u.nfs_i.commit);
-   INIT_LIST_HEAD(inode-u.nfs_i.writeback);
-   inode-u.nfs_i.nread = 0;
-   inode-u.nfs_i.ndirty = 0;
-   inode-u.nfs_i.ncommit = 0;
-   inode-u.nfs_i.npages = 0;
+   INIT_LIST_HEAD(nfsi-read);
+   INIT_LIST_HEAD(nfsi-dirty);
+   INIT_LIST_HEAD(nfsi-commit);
+   INIT_LIST_HEAD(nfsi-writeback);
+   nfsi-nread = 0;
+   nfsi-ndirty = 0;
+   nfsi-ncommit = 0;
+   nfsi-npages = 0;
NFS_CACHEINV(inode);
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
+   return;
+
+Enomem:
+   make_bad_inode(inode);
+   return;
+}
+
+static void
+nfs_clear_inode(struct inode * inode)
+{
+   struct nfs_inode_info *p = NFS_I(inode);
+   inode-u.generic_ip = NULL;
+   if (p)
+   kmem_cache_free(nfs_inode_cachep, p);
 }
 
 static void
@@ -594,7 +619,7 @@
NFS_CACHE_ISIZE(inode) = fattr-size;
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
-   memcpy(inode-u.nfs_i.fh, fh, sizeof(inode-u.nfs_i.fh));
+   memcpy(NFS_FH(inode), fh, sizeof(struct nfs_fh));
}

Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Andreas Dilger

Ed Tomlinson writes:
  Consider, when I was doing some fs benchmark, my inode slab cache was
  over 120k items on a 128MB machine.  At 480 butes per inode, this is
  almost 58 MB, close to half of RAM.  Reducing this to exactly ext2
  sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
  systems)!!! (This assumes nfs_inode_info is the largest).
 
 Was this with a recient kernel (post Alexander Viro's dcache pressure fix)?  
 If not I suggest rerunning the benchmark.  I had/have a patch to apply
 pressure to the dcache and icache from kswapd but its not been needed here
 since the above fix.

Actually, it had the dcache patch but I'm not aware of a patch from Al to
change icache behaviour.  In any case, changing the icache behaviour is
not what I'm getting at here - having the union of all private inode
structs in the generic inode is a huge waste of RAM.  Even for filesystems
that are heavy NFS users, they will likely still have a considerable amount
of local filesystem space (excluding NFS root systems, which are very few).

Al posted a patch to the NFS code which removes nfs_inode_info from the
inode union.  Since it is (AFAIK) the largest member of the union, we
have just saved 24 bytes per inode (hfs_inode_info is also rather large).
If we removed hfs_inode_info as well, we would save 108 bytes per inode,
about 22% ({ext2,affs,ufs}_inode_info are all about the same size).

No point in punishing all users for filesystems they don't necessarily use.
Even for people that DO use NFS and/or HFS, they are probably still wasting
10k inodes * 108 bytes = 1MB of RAM for no good reason (because most of
their inodes are probably not NFS and/or HFS).

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-22 Thread Alexander Viro



On Sun, 22 Apr 2001, David L. Parsley wrote:

> Hi,
> 
> I'm still working on a packaging system for diskless (quasi-embedded)
> devices.  The root filesystem is all tmpfs, and I attach packages inside
> it.  Since symlinks in a tmpfs filesystem cost 4k each (ouch!), I'm
> considering using mount --bind for everything.  This appears to use very
> little memory, but I'm wondering if I'll run into problems when I start
> having many hundreds of bind mountings.  Any feel for this?

Memory use is sizeof(struct vfsmount) per binding. In principle, you can get
in trouble when size of /proc/mount will get past 4Kb - you'll get only
first 4 (actually 3, IIRC) kilobytes, so stuff that relies on the contents
of said file may get unhappy. It's fixable, though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-22 Thread Alexander Viro



On Sun, 22 Apr 2001, David L. Parsley wrote:

 Hi,
 
 I'm still working on a packaging system for diskless (quasi-embedded)
 devices.  The root filesystem is all tmpfs, and I attach packages inside
 it.  Since symlinks in a tmpfs filesystem cost 4k each (ouch!), I'm
 considering using mount --bind for everything.  This appears to use very
 little memory, but I'm wondering if I'll run into problems when I start
 having many hundreds of bind mountings.  Any feel for this?

Memory use is sizeof(struct vfsmount) per binding. In principle, you can get
in trouble when size of /proc/mount will get past 4Kb - you'll get only
first 4 (actually 3, IIRC) kilobytes, so stuff that relies on the contents
of said file may get unhappy. It's fixable, though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/