Re: [RFC] getgroups2 system call

2011-12-13 Thread Michael van Elst
mm_li...@pulsar-zone.net (Matthew Mondor) writes:

>What does NFS do in this case?  I seem to remember that it also imposes
>a sane size limit, possibly even below NGROUPS_MAX, is it really the
>case?  If so, would this also be acceptable?

NFS (or rather the underlying SunRPC) passes an array of 16 gids, which is
a common problem when you try to use groups for fine grained access control.


-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Lost file-system story

2011-12-13 Thread Michael van Elst
wo...@planix.ca ("Greg A. Woods") writes:

>easy, if not even easier, to do a "mount -u -r"

Does this work again?

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: [RFC] getgroups2 system call

2011-12-13 Thread YAMAMOTO Takashi
hi,

> At this point, I think I will fetch secondary groups through sysctl,
> this seems to be the point of least resistance.

do you mean to implement fuse_getgroups for NetBSD with the sysctl?
if you are adding a #ifdef NetBSD block to the fuse, can't it use
a NetBSD-specific sidechannel to get the info from an appropriate
puffs-supplied uucred?

YAMAMOTO Takashi


Re: [RFC] getgroups2 system call

2011-12-13 Thread Matthew Mondor
On Wed, 14 Dec 2011 07:04:06 +0100
m...@netbsd.org (Emmanuel Dreyfus) wrote:

> - a fixed lentgh header is highly desirable for performance
> optimization. For instance glusterfs fetches the header and the data
> using readv(2) with an iovec that has two slots. That way it gets write
> date aligned on a page boundary.

What does NFS do in this case?  I seem to remember that it also imposes
a sane size limit, possibly even below NGROUPS_MAX, is it really the
case?  If so, would this also be acceptable?
-- 
Matt


Re: [RFC] getgroups2 system call

2011-12-13 Thread Emmanuel Dreyfus
Thor Lancelot Simon  wrote:

> > At this point, I think I will fetch secondary groups through sysctl,
> > this seems to be the point of least resistance.
> 
> You are not worried about security issues resulting from the fact
> that time will pass, and the process may do other operations which
> modify its credentials, before the operation completes?

I explored the option of modifying the FUSE protocol, and that is
though. We can easily negociate an extended FUSE header that contains
secondary groups, and I already submitted a patch that does exactly
that, but then we face two conflicting requirements:

- a fixed lentgh header is highly desirable for performance
optimization. For instance glusterfs fetches the header and the data
using readv(2) with an iovec that has two slots. That way it gets write
date aligned on a page boundary.

- a fixed length header means an array of secondary groups with
NGROUPS_MAX slots, but Linux's NGROUPS_MAX is 65536, which means an
insane waste of space. Therefore we need an array of secondary groups
that is not bigger than the used slots.

As a tradeoff between the two requirements, I proposed that the
filesystem could request a minimum size for secondary group array. That
way, the header would be of fixed length most of the time, except when
there are many groups (something that can only happen on Linux: NetBSD's
NGROUPS_MAX is much more reasonable). Big amount of secondary groups
kill write optimization, but the filesystem can always be configured to
request on initialization a bigger minimal secondary group aray size, if
desired. That last proposal has been considered "a series of hacks to
make it confirm to the requirements", therefore I am left with fetching
secondary groups asynchrnously through sysctl.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: [RFC] getgroups2 system call

2011-12-13 Thread Thor Lancelot Simon
On Wed, Dec 14, 2011 at 06:05:53AM +0100, Emmanuel Dreyfus wrote:
> 
> At this point, I think I will fetch secondary groups through sysctl,
> this seems to be the point of least resistance.

You are not worried about security issues resulting from the fact
that time will pass, and the process may do other operations which
modify its credentials, before the operation completes?

This seems like a very dangerous idea for a filesystem.

-- 
Thor Lancelot Simont...@panix.com
  "All of my opinions are consistent, but I cannot present them all
   at once."-Jean-Jacques Rousseau, On The Social Contract


Re: [RFC] getgroups2 system call

2011-12-13 Thread Emmanuel Dreyfus
Christos Zoulas  wrote:

> Don't you need a getuid2(pid_t pid)? 

uid, gid and pid are passed inthe FUSE header, so we aready have them.

> Why don't you add separate fuse messages to send and retrieve this
> information? Then the kernel can notify if these have changed...

That adds a lot of state in the kernel (you need to update creds on
process termination and setgroup(2) calls, which makes the thing even
harder to port. And on the performance front, new messages add lattency.

At this point, I think I will fetch secondary groups through sysctl,
this seems to be the point of least resistance.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: Lost file-system story

2011-12-13 Thread Greg A. Woods
At Mon, 12 Dec 2011 18:49:31 -0500 (EST), "Matt W. Benjamin" 
 wrote:
Subject: Re: Lost file-system story
> 
> Why would sync not be effective under MNT_ASYNC?  Use of sync is not
> required to lead to consistency expect with respect to an arbitrary
> point in time, but I don't think anyone ever believed otherwise.
> However, there should be no question of metadata never being written
> out if sync was run?

Well sync(2) _could_ be effective even in the face of MNT_ASYNC, though
I'm not sure it will, or indeed even should be required to, have a
guaranteed ongoing beneficial affect to the on-disk consistency of
filesystem that was mounted with MNT_ASYNC while activity continues to
proceed on the filesystem.

I.e. I don't expect sync(2) to suddenly enforce order on the writes that
it schedules to a MNT_ASYNC-mounted filesystem.  The ordering _may_ be a
natural result of the implementation, but if it's not then I wouldn't
consider that to be a bug, and I certainly wouldn't write any
documentation that suggested it might be a possible outcome.  MNT_ASYNC
means, to me at least, that even sync(2) can get away with doing writes
to a filesystem mounted with that flag in an order other than one which
would guarantee on-disk consistency to a level where fsck could repair
it.

I.e. sync(2) could possibly make things worse for MNT_ASYNC mounted
filesystems before it makes them better, and I don't see how that could
be considered to be a bug.

I do agree that IFF the filesystem is made quiescent, AND all writes
necessary and scheduled by sync(2) are allowed to come to completion,
THEN the on-disk state of an MNT_ASYNC-mounted filesystem must be
consistent (and all data blocks must be flushed to the disk too).

However if you're going to go to that trouble (i.e. close all files open
on the MNT_ASYNC-mounted filesystem and somehow prevent any other file
operations of any kind on that filesystem until such time that you think
the sync(2) scheduled writes are all done), then it should be just as
easy, if not even easier, to do a "mount -u -r" (or "mount -u -o
noasync", or even "umount"), in which case you'll not only be sure that
the filesystem is consistent and secure, but you'll know when it reaches
this state (i.e. you won't have to guess about when sync(2)'s scheduled
work completes).

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpcLcSlnWPyx.pgp
Description: PGP signature


Re: Benchmark results for i386/amd64, native/Xen, TLS/noTLS

2011-12-13 Thread Thor Lancelot Simon
You should be aware when rerunning benchmarks MP that unless we have
had a radical improvement since June, build.sh will run faster with 14 CPUs
hot than with 24.

Thor


Re: Lost file-system story

2011-12-13 Thread James Chacon
On Tue, Dec 13, 2011 at 4:09 PM, Greg A. Woods  wrote:
> At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn  
> wrote:
> Subject: Re: Lost file-system story
>>
>> On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote:
>> >
>> > fsck is supposed to handle *all* corruptions to the file system that can
>> > occur as part of normal file system operation in the kernel. It is doing
>> > best effort for others. It's a bug if it doesn't do the former and a
>> > potential missing feature for the latter.
>> >
>>
>> There are a lot of slips twixt cup and lip.  If you are really unlucky
>> you can get an outage at just the wrong time that will cause the
>> filesystem to be hosed so badly that fsck cannot recover it.  Sure, fsck
>> can run to completion but all you have is most of your FS in lost+found
>> which you have to be really really desperate to sort through.  I have
>> been working with UNIX for over 20years now and I have only seen this
>> happen once and it was with a commercial UNIX.
>
> I've seen that happen more than once unfortunately.  SunOS-4 once I think.
>
> I agree 100% with Joerg here though.
>
> I'm pretty sure at least some of the times I've seen fsck do more damage
> than good it was due to a kernel bug or more breaking assumptions about
> ordered operations.
>
> There have of course also been some pretty serious bugs in various fsck
> implementations across the years and vendors.
>

I'd be suspicious of fsck failing on a regularly mounted disk with
corruption that can't otherwise be tracked to outside influences (bad
ram, bad disk cache, etc). I've seen some bizarre things happen on ram
errors over the years for instance.

James


Re: Lost file-system story

2011-12-13 Thread Greg A. Woods
At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn  
wrote:
Subject: Re: Lost file-system story
> 
> On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote:
> > 
> > fsck is supposed to handle *all* corruptions to the file system that can
> > occur as part of normal file system operation in the kernel. It is doing
> > best effort for others. It's a bug if it doesn't do the former and a
> > potential missing feature for the latter.
> > 
> 
> There are a lot of slips twixt cup and lip.  If you are really unlucky
> you can get an outage at just the wrong time that will cause the
> filesystem to be hosed so badly that fsck cannot recover it.  Sure, fsck
> can run to completion but all you have is most of your FS in lost+found
> which you have to be really really desperate to sort through.  I have
> been working with UNIX for over 20years now and I have only seen this
> happen once and it was with a commercial UNIX.

I've seen that happen more than once unfortunately.  SunOS-4 once I think.

I agree 100% with Joerg here though.

I'm pretty sure at least some of the times I've seen fsck do more damage
than good it was due to a kernel bug or more breaking assumptions about
ordered operations.

There have of course also been some pretty serious bugs in various fsck
implementations across the years and vendors.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpYVEF362Y36.pgp
Description: PGP signature


Benchmark results for i386/amd64, native/Xen, TLS/noTLS

2011-12-13 Thread Jean-Yves Migeon
So, I told some people that I'd run benchmarks to qualify the TLS 
(Thread Local Storage) vs no-TLS overhead, both for Xen and native 
setups, i386 and amd64.


For results, jump directly at the bottom of this mail.

=== Context ===

To compare GENERIC and XEN3 kernels, the GENERIC kernel was always 
started UP (with boot -1). The MP support in GENERIC made it _very_ 
unfair when compared to Xen. For MP benchmarking, see the Conclusion at 
the bottom of this mail.


Host: Core i7, 24 cores, 6GiB.

i386 was always PAE enabled, both for GENERIC and XEN3. The Xen 
hypervisor used is Xen 4.1.


The no-TLS build is a release build from 2011-03-10, a few days before 
TLS support got in for x86.


The TLS build is a -currentish release build, from beginning December 
(did not take note of my last cvs up, sorry).


=== The following benchmarks were run ===

- blogbench: blogbench -d /tmp/ -i 10 -w 10 -W 10 -r 100

It (tries to) reproduce a typical blog setup, with readers, writers, and 
updaters that modify a content with text and images (mostly). The bench 
itself runs with threads, and is quite heavyweight for the kernel (you 
can run out of file descriptors very easily when launched brute-force).


It returns a "score" for read and write performance.

- sysbench: sysbench --num-threads=128 --thread-yields=1024 \
   --thread-locks=32 --test=threads run

Short description: http://sysbench.sourceforge.net/docs/#threads_mode

That bench was more focused on LWP's scheduling. I kept all the results 
(for those interested), but my main interest is the average time to 
complete the test.


- build.sh -j2 src runs

Obviously you all know what this one does. My interest is total build 
time. The same src was used all the time, but the build is native (no 
cross-compilation, no -m). obj/tools/dest/release were rm'ed each time 
before starting the build anew.


- bonnie++: bonnie++ -r 1024 -u nobody

A popular file-system and I/O benchmark. Creates, reads and writes files 
in bytes and blocks, either sequentially or randomly. Also used to 
stress hard drives.


=== Results ===

= blogbench =

http://www.netbsd.org/~jym/blogbench.results

While it uses threads to (re)produce blog-like behavior, I believe it's 
more representative of file-system and I/O performance rather than 
threads themselves, given the results.


There seems to be a trade-off between read and write performance with 
this benchmark: when one is above average, the other one is below. The 
GENERIC average perfs are a bit higher than XEN3 (~10%) though.


Other than that, there is no clear winner between TLS/no-TLS. They are 
in the same range, for both ports.


= sysbench =

http://www.netbsd.org/~jym/sysbench.results

More interestingly, sysbench stresses threads and scheduler a bit. 
There's a clear cut between TLS and no TLS systems, with TLS-enabled 
releases being slower by 15-20% (i386 and amd64) for the work completion.


Curiously, there's no real winner between XEN3 and GENERIC. i386 Xen is 
slightly faster that native (by a few percent), while amd64 Xen is 
slower than native (again by a few percent). Likely due to the pmap 
overhead, we flush mappings constantly between kernel and userland with 
amd64 (kernel runs in ring 3 for Xen amd64, just like a typical userland 
process).


= bonnie++ =

http://www.netbsd.org/~jym/bonnie.1.html
http://www.netbsd.org/~jym/bonnie.2.html
http://www.netbsd.org/~jym/bonnie.3.html

Given that bonnie is a file-system benchmark, I did not expect too much 
deviation between TLS and no-TLS. That's generally the case, except for 
sequential (and random) file creation:


with GENERIC, the cut is about to a 1:3 ratio (ouch). The release from 
2011-03-10 has a score flirting with the 9k-10k points, while the 
-currentish kernel is more like 3-3.5k. This result is 100% 
reproducible, but only with GENERIC kernels. XEN3 kernels remain 
unaffected, and TLS/noTLS releases are on par.


I am not sure that this "regression" comes from TLS (I can only express 
doubt, I have not investigated the thing technically). There has been 
lots of work in the vfs layer for the last few months, so one of these 
changes is likely to affect bonnie's results directly.


= build.sh runs =

http://www.netbsd.org/~jym/build.results

These benchmarks were run with an UP system, build.sh being invoked with 
-j2. I can't notice any real regression there, TLS -current is faster 
(3-4%) for almost every cases when compared to no-TLS releases.


Please note that all runs were made with UP kernels.

=== Conclusion ===

Except for the bonnie++ regression, overall I did not notice any real 
performance hit between a release from before TLS commit and one from 
december.


Yeah, the "TLS vs noTLS" release is a misnomer, it's rather "release 
from march vs release from december, with bits of TLS."


I am currently making another run with a release from a few days after 
the TLS commit, and subsequent kernels up to -current to investigate the 

Re: Lost file-system story

2011-12-13 Thread Brett Lymn
On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote:
> 
> fsck is supposed to handle *all* corruptions to the file system that can
> occur as part of normal file system operation in the kernel. It is doing
> best effort for others. It's a bug if it doesn't do the former and a
> potential missing feature for the latter.
> 

There are a lot of slips twixt cup and lip.  If you are really unlucky
you can get an outage at just the wrong time that will cause the
filesystem to be hosed so badly that fsck cannot recover it.  Sure, fsck
can run to completion but all you have is most of your FS in lost+found
which you have to be really really desperate to sort through.  I have
been working with UNIX for over 20years now and I have only seen this
happen once and it was with a commercial UNIX.

-- 
Brett Lymn
"Warning:
The information contained in this email and any attached files is
confidential to BAE Systems Australia. If you are not the intended
recipient, any use, disclosure or copying of this email or any
attachments is expressly prohibited.  If you have received this email
in error, please notify us immediately. VIRUS: Every care has been
taken to ensure this email and its attachments are virus free,
however, any loss or damage incurred in using this email is not the
sender's responsibility.  It is your responsibility to ensure virus
checks are completed before installing any data sent in this email to
your computer."




Re: bug #44412

2011-12-13 Thread Hauke Fath
At 14:55 Uhr -0400 22.7.2011, David Riley wrote:
>I filed this report a while back.  Someone else has tested my fix on
>non-PPC systems (x86, x86_64) and reported that it seems to work as well.
>I'm attaching the patch against -current here;

[appended to PR kern/44412]

>could someone give it a
>look and include it if it seems to be OK?  Or provide feedback if it
>doesn't.

Late feedback: I have set up a macppc G3/400 with a wm(4) network card,
running Netatalk 2.2.1, where your patch fixes AppleTalk both with netbsd-5
and HEAD kernels.

Before I commit your patches: Could tech-net people please give them a look?

hauke

--
"It's never straight up and down" (DEVO)




Re: [RFC] getgroups2 system call

2011-12-13 Thread Christos Zoulas
In article <20111213141930.ge15...@homeworld.netbsd.org>,
Emmanuel Dreyfus   wrote:
>Hello
>
>FUSE has no way to send the calling process secondary groups to the
>filesystem. A filesystem that wants this operation currently has to 
>open a /proc file, read and parse the string represnetation of the 
>groups, and close the file.
>
>This is not very good performance-wise, as the filesystem needs to
>open/read/close in /proc for each file operation that require access
>control. Moreover, this can lead to deadlocks because of root vnode
>locking. 
>
>A first approach for improving this would be to fetch the secondary 
>groups using sysctl. Manuel Bouyer noted that this interface was never 
>meant to be fast, and therefore it would not address the performance issue.
>
>In a second attempt, I submitted a patch for libfuse, which would extend
>the FUSE protocol so that secondary groups could be optionnly appended
>to FUSE headers, should the filsystem request it. That did not meet 
>consensus, because on one hand, having a fixed-length header is desirable
>for optimizing performances. On the other hand, having a fixed-length
>header with an array of NGROUPS_MAX slots for secondary groups is just
>impossible on Linux, where NGROUPS_MAX is 65536.
>
>A third way was suggested on the fuse-devel mailing list: adding a
>system call to retreive a process' secondary groups. The prototype
>would be moddled on getgroups(2):
>
>   int getgroups2(int gidsetlen, gid_t *gidset, pid_t pid);
>
>Il this is preferred, it could also be named getgroupspid(2)

Don't you need a getuid2(pid_t pid)? And where does this stop? Are
you going to need other information too? The approach of adding extra
syscalls to support FUSE will make the protocol more difficult to port
across different operating systems and the new syscalls will need to be
vetted against security and usefulness. I.e. is getgroups2 limited
to the superuser? This is for VOP_ACCESS() in userland, right?

Why don't you add separate fuse messages to send and retrieve this
information? Then the kernel can notify if these have changed...

christos



Re: Lost file-system story

2011-12-13 Thread Rhialto
> > Linux ext2 is not a Unix-based filesystem and Linux itself is not a
> > Unix-based kernel.
> 
> It's about as Unix-based as NetBSD is.  Unless you mean something
> strange by "Unix-based" - what _do_ you mean by it?

I'm guessing that the point is that ext2 is a scaled-up
re-implementation of minix-fs, which itself is a scaled-down
re-implementaion of v7-fs. NetBSD's ffs is probably more directly based
on CSRG code.

In this roundabout fashion the Linux people arrived at a filesystem that
is feature-wise very similar to ffs, but without being on-disk
compatible.  Rather a missed chance, I'd say.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- There's no point being grown-up if you 
\X/ rhialto/at/xs4all.nl-- can't be childish sometimes. -The 4th Doctor


Fwd: Lost file-system story

2011-12-13 Thread Donald Allen
I did it again. gmail is trying to teach an old dog a new trick 


-- Forwarded message --
From: Donald Allen 
Date: Tue, Dec 13, 2011 at 10:04 AM
Subject: Re: Lost file-system story
To: David Holland 


On Tue, Dec 13, 2011 at 1:27 AM, David Holland  wrote:
> On Mon, Dec 12, 2011 at 03:31:09PM -0500, Donald Allen wrote:
>  > Note that this bug *may* not worsen the probability of recovery after
>  > a crash. It might even increase it! Think about it. If you boot NetBSD
>  > and mount a filesystem async, it is going to be correctly structured
>  > (or deemed to be by fsck) at boot time, or the system wouldn't mount
>  > it. Assuming the system is happy with it, if you then make changes to
>  > the filesystem,  but, because of this bug they are all in the buffer
>  > cache and never get written out, and then the system crashes ---
>  > you've got the filesystem you started with.
>
> Not necessarily;

I did say "*may*" (which I wrote because you could write a good book
about NetBSD internals with what I don't know about NetBSD internals).

right off I can see two ways to get hosed:
>
> 1. Delete a large file. This causes the in-memory FS to believe the
> indirect blocks from this file are free; then it can reallocate them
> as data for some other file. That data then *does* get written out, so
> after crashing and rebooting the indirect blocks contain utter
> nonsense. The ffs fsck probably can't recover this.
>
> 2. Use a program that calls fsync(). This will write out some metadata
> blocks and not others; in the relatively benign case it will just
> update a previously-free inode and after crashing fsck will place the
> file in lost+found. In less benign cases it might do the converse of
> (1), and e.g. overwrite file data with indirect blocks, leading to
> crosslinked files or worse and probably total fsck failure.
>
> Not that any of this matters...

I agree. I was just indulging in some idle speculation, having some
fun. This bug should be fixed and I think the fix, as I said before,
should include a knob to allow the user to control the sync frequency
(maybe the knob is already there in sysctl and getting ignored for
some reason?). I'm running NetBSD again on my test machine, and have a
sleep-sync loop started in rc.local.

/Don


>
> --
> David A. Holland
> dholl...@netbsd.org


EOPNOTSUPP / ENOTSUP

2011-12-13 Thread David Holland
Currently we have both EOPNOTSUPP ("Operation not supported") and
ENOTSUP ("Not supported") errnos. EOPNOTSUPP is historical; ENOTSUP
was randomly added by POSIX relatively recently.

And lately I've noticed a tendency to conflate them, which isn't
healthy.

It is too late to do #define ENOTSUP EOPNOTSUPP (although we could do
this as part of the mythical libc major version bump to happen
Sometime(TM)) but I think we should take steps to prevent any more
confusion than already exists:

   - do not add new uses of ENOTSUP except where specifically mandated
 by POSIX;
   - avoid creating interfaces where ENOTSUP and EOPNOTSUPP are both
 expected error conditions (and remove any that have crept in);
   - mention this in errno(2);
   - also add a notice to this effect somewhere it will be seen
 (although I have no idea where that would be);

Alternatively we could try to articulate a distinction between them
and adjust the messages to match; but I'm not sure this is compatible
with the mandated behavior.

Thoughts anyone?

-- 
David A. Holland
dholl...@netbsd.org


Re: [RFC] getgroups2 system call

2011-12-13 Thread David Holland
On Tue, Dec 13, 2011 at 02:19:30PM +, Emmanuel Dreyfus wrote:
 > A third way was suggested on the fuse-devel mailing list: adding a
 > system call to retreive a process' secondary groups. The prototype
 > would be moddled on getgroups(2):
 > 
 >  int getgroups2(int gidsetlen, gid_t *gidset, pid_t pid);
 > 
 > Il this is preferred, it could also be named getgroupspid(2)

Ugh.

I don't like it. The credentials for an operation should be passed
along with the operation, not fetched through a side channel. Even if
the operation is completely synchronous, using a side channel like
this is at best bodgy. If it's not completely synchronous, it's doomed
to fail horribly.

This interface would also make it permanently impossible to run fuse
servers with reduced privilege.

I would argue that if what you need is a hack, fuse itself was never
meant to be fast and so sysctl is an adequate method; if you want to
do it right, extend the protocol correctly.

(And in any event, it should be "int getpidgroups(pid_t pid, int
gidsetlen, gid_t *gidset)".)

-- 
David A. Holland
dholl...@netbsd.org


[RFC] getgroups2 system call

2011-12-13 Thread Emmanuel Dreyfus
Hello

FUSE has no way to send the calling process secondary groups to the
filesystem. A filesystem that wants this operation currently has to 
open a /proc file, read and parse the string represnetation of the 
groups, and close the file.

This is not very good performance-wise, as the filesystem needs to
open/read/close in /proc for each file operation that require access
control. Moreover, this can lead to deadlocks because of root vnode
locking. 

A first approach for improving this would be to fetch the secondary 
groups using sysctl. Manuel Bouyer noted that this interface was never 
meant to be fast, and therefore it would not address the performance issue.

In a second attempt, I submitted a patch for libfuse, which would extend
the FUSE protocol so that secondary groups could be optionnly appended
to FUSE headers, should the filsystem request it. That did not meet 
consensus, because on one hand, having a fixed-length header is desirable
for optimizing performances. On the other hand, having a fixed-length
header with an array of NGROUPS_MAX slots for secondary groups is just
impossible on Linux, where NGROUPS_MAX is 65536.

A third way was suggested on the fuse-devel mailing list: adding a
system call to retreive a process' secondary groups. The prototype
would be moddled on getgroups(2):

int getgroups2(int gidsetlen, gid_t *gidset, pid_t pid);

Il this is preferred, it could also be named getgroupspid(2)

Opinions?

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: Lost file-system story

2011-12-13 Thread Joerg Sonnenberger
On Tue, Dec 13, 2011 at 12:20:16AM -0500, Mouse wrote:
> > Possibilities other than zero or one are not useful in manual pages,
> 
> Then we can throw away fsck, because there is always _some_ chance the
> filesystem will be irreparable.  Memory, CPUs, disks, and the
> transports between them do fail, occasionally transiently.

fsck is supposed to handle *all* corruptions to the file system that can
occur as part of normal file system operation in the kernel. It is doing
best effort for others. It's a bug if it doesn't do the former and a
potential missing feature for the latter.

Joerg