Re: [RFC] getgroups2 system call
mm_li...@pulsar-zone.net (Matthew Mondor) writes: >What does NFS do in this case? I seem to remember that it also imposes >a sane size limit, possibly even below NGROUPS_MAX, is it really the >case? If so, would this also be acceptable? NFS (or rather the underlying SunRPC) passes an array of 16 gids, which is a common problem when you try to use groups for fine grained access control. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Lost file-system story
wo...@planix.ca ("Greg A. Woods") writes: >easy, if not even easier, to do a "mount -u -r" Does this work again? -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: [RFC] getgroups2 system call
hi, > At this point, I think I will fetch secondary groups through sysctl, > this seems to be the point of least resistance. do you mean to implement fuse_getgroups for NetBSD with the sysctl? if you are adding a #ifdef NetBSD block to the fuse, can't it use a NetBSD-specific sidechannel to get the info from an appropriate puffs-supplied uucred? YAMAMOTO Takashi
Re: [RFC] getgroups2 system call
On Wed, 14 Dec 2011 07:04:06 +0100 m...@netbsd.org (Emmanuel Dreyfus) wrote: > - a fixed lentgh header is highly desirable for performance > optimization. For instance glusterfs fetches the header and the data > using readv(2) with an iovec that has two slots. That way it gets write > date aligned on a page boundary. What does NFS do in this case? I seem to remember that it also imposes a sane size limit, possibly even below NGROUPS_MAX, is it really the case? If so, would this also be acceptable? -- Matt
Re: [RFC] getgroups2 system call
Thor Lancelot Simon wrote: > > At this point, I think I will fetch secondary groups through sysctl, > > this seems to be the point of least resistance. > > You are not worried about security issues resulting from the fact > that time will pass, and the process may do other operations which > modify its credentials, before the operation completes? I explored the option of modifying the FUSE protocol, and that is though. We can easily negociate an extended FUSE header that contains secondary groups, and I already submitted a patch that does exactly that, but then we face two conflicting requirements: - a fixed lentgh header is highly desirable for performance optimization. For instance glusterfs fetches the header and the data using readv(2) with an iovec that has two slots. That way it gets write date aligned on a page boundary. - a fixed length header means an array of secondary groups with NGROUPS_MAX slots, but Linux's NGROUPS_MAX is 65536, which means an insane waste of space. Therefore we need an array of secondary groups that is not bigger than the used slots. As a tradeoff between the two requirements, I proposed that the filesystem could request a minimum size for secondary group array. That way, the header would be of fixed length most of the time, except when there are many groups (something that can only happen on Linux: NetBSD's NGROUPS_MAX is much more reasonable). Big amount of secondary groups kill write optimization, but the filesystem can always be configured to request on initialization a bigger minimal secondary group aray size, if desired. That last proposal has been considered "a series of hacks to make it confirm to the requirements", therefore I am left with fetching secondary groups asynchrnously through sysctl. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: [RFC] getgroups2 system call
On Wed, Dec 14, 2011 at 06:05:53AM +0100, Emmanuel Dreyfus wrote: > > At this point, I think I will fetch secondary groups through sysctl, > this seems to be the point of least resistance. You are not worried about security issues resulting from the fact that time will pass, and the process may do other operations which modify its credentials, before the operation completes? This seems like a very dangerous idea for a filesystem. -- Thor Lancelot Simont...@panix.com "All of my opinions are consistent, but I cannot present them all at once."-Jean-Jacques Rousseau, On The Social Contract
Re: [RFC] getgroups2 system call
Christos Zoulas wrote: > Don't you need a getuid2(pid_t pid)? uid, gid and pid are passed inthe FUSE header, so we aready have them. > Why don't you add separate fuse messages to send and retrieve this > information? Then the kernel can notify if these have changed... That adds a lot of state in the kernel (you need to update creds on process termination and setgroup(2) calls, which makes the thing even harder to port. And on the performance front, new messages add lattency. At this point, I think I will fetch secondary groups through sysctl, this seems to be the point of least resistance. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: Lost file-system story
At Mon, 12 Dec 2011 18:49:31 -0500 (EST), "Matt W. Benjamin" wrote: Subject: Re: Lost file-system story > > Why would sync not be effective under MNT_ASYNC? Use of sync is not > required to lead to consistency expect with respect to an arbitrary > point in time, but I don't think anyone ever believed otherwise. > However, there should be no question of metadata never being written > out if sync was run? Well sync(2) _could_ be effective even in the face of MNT_ASYNC, though I'm not sure it will, or indeed even should be required to, have a guaranteed ongoing beneficial affect to the on-disk consistency of filesystem that was mounted with MNT_ASYNC while activity continues to proceed on the filesystem. I.e. I don't expect sync(2) to suddenly enforce order on the writes that it schedules to a MNT_ASYNC-mounted filesystem. The ordering _may_ be a natural result of the implementation, but if it's not then I wouldn't consider that to be a bug, and I certainly wouldn't write any documentation that suggested it might be a possible outcome. MNT_ASYNC means, to me at least, that even sync(2) can get away with doing writes to a filesystem mounted with that flag in an order other than one which would guarantee on-disk consistency to a level where fsck could repair it. I.e. sync(2) could possibly make things worse for MNT_ASYNC mounted filesystems before it makes them better, and I don't see how that could be considered to be a bug. I do agree that IFF the filesystem is made quiescent, AND all writes necessary and scheduled by sync(2) are allowed to come to completion, THEN the on-disk state of an MNT_ASYNC-mounted filesystem must be consistent (and all data blocks must be flushed to the disk too). However if you're going to go to that trouble (i.e. close all files open on the MNT_ASYNC-mounted filesystem and somehow prevent any other file operations of any kind on that filesystem until such time that you think the sync(2) scheduled writes are all done), then it should be just as easy, if not even easier, to do a "mount -u -r" (or "mount -u -o noasync", or even "umount"), in which case you'll not only be sure that the filesystem is consistent and secure, but you'll know when it reaches this state (i.e. you won't have to guess about when sync(2)'s scheduled work completes). -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpcLcSlnWPyx.pgp Description: PGP signature
Re: Benchmark results for i386/amd64, native/Xen, TLS/noTLS
You should be aware when rerunning benchmarks MP that unless we have had a radical improvement since June, build.sh will run faster with 14 CPUs hot than with 24. Thor
Re: Lost file-system story
On Tue, Dec 13, 2011 at 4:09 PM, Greg A. Woods wrote: > At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn > wrote: > Subject: Re: Lost file-system story >> >> On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote: >> > >> > fsck is supposed to handle *all* corruptions to the file system that can >> > occur as part of normal file system operation in the kernel. It is doing >> > best effort for others. It's a bug if it doesn't do the former and a >> > potential missing feature for the latter. >> > >> >> There are a lot of slips twixt cup and lip. If you are really unlucky >> you can get an outage at just the wrong time that will cause the >> filesystem to be hosed so badly that fsck cannot recover it. Sure, fsck >> can run to completion but all you have is most of your FS in lost+found >> which you have to be really really desperate to sort through. I have >> been working with UNIX for over 20years now and I have only seen this >> happen once and it was with a commercial UNIX. > > I've seen that happen more than once unfortunately. SunOS-4 once I think. > > I agree 100% with Joerg here though. > > I'm pretty sure at least some of the times I've seen fsck do more damage > than good it was due to a kernel bug or more breaking assumptions about > ordered operations. > > There have of course also been some pretty serious bugs in various fsck > implementations across the years and vendors. > I'd be suspicious of fsck failing on a regularly mounted disk with corruption that can't otherwise be tracked to outside influences (bad ram, bad disk cache, etc). I've seen some bizarre things happen on ram errors over the years for instance. James
Re: Lost file-system story
At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn wrote: Subject: Re: Lost file-system story > > On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote: > > > > fsck is supposed to handle *all* corruptions to the file system that can > > occur as part of normal file system operation in the kernel. It is doing > > best effort for others. It's a bug if it doesn't do the former and a > > potential missing feature for the latter. > > > > There are a lot of slips twixt cup and lip. If you are really unlucky > you can get an outage at just the wrong time that will cause the > filesystem to be hosed so badly that fsck cannot recover it. Sure, fsck > can run to completion but all you have is most of your FS in lost+found > which you have to be really really desperate to sort through. I have > been working with UNIX for over 20years now and I have only seen this > happen once and it was with a commercial UNIX. I've seen that happen more than once unfortunately. SunOS-4 once I think. I agree 100% with Joerg here though. I'm pretty sure at least some of the times I've seen fsck do more damage than good it was due to a kernel bug or more breaking assumptions about ordered operations. There have of course also been some pretty serious bugs in various fsck implementations across the years and vendors. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpYVEF362Y36.pgp Description: PGP signature
Benchmark results for i386/amd64, native/Xen, TLS/noTLS
So, I told some people that I'd run benchmarks to qualify the TLS (Thread Local Storage) vs no-TLS overhead, both for Xen and native setups, i386 and amd64. For results, jump directly at the bottom of this mail. === Context === To compare GENERIC and XEN3 kernels, the GENERIC kernel was always started UP (with boot -1). The MP support in GENERIC made it _very_ unfair when compared to Xen. For MP benchmarking, see the Conclusion at the bottom of this mail. Host: Core i7, 24 cores, 6GiB. i386 was always PAE enabled, both for GENERIC and XEN3. The Xen hypervisor used is Xen 4.1. The no-TLS build is a release build from 2011-03-10, a few days before TLS support got in for x86. The TLS build is a -currentish release build, from beginning December (did not take note of my last cvs up, sorry). === The following benchmarks were run === - blogbench: blogbench -d /tmp/ -i 10 -w 10 -W 10 -r 100 It (tries to) reproduce a typical blog setup, with readers, writers, and updaters that modify a content with text and images (mostly). The bench itself runs with threads, and is quite heavyweight for the kernel (you can run out of file descriptors very easily when launched brute-force). It returns a "score" for read and write performance. - sysbench: sysbench --num-threads=128 --thread-yields=1024 \ --thread-locks=32 --test=threads run Short description: http://sysbench.sourceforge.net/docs/#threads_mode That bench was more focused on LWP's scheduling. I kept all the results (for those interested), but my main interest is the average time to complete the test. - build.sh -j2 src runs Obviously you all know what this one does. My interest is total build time. The same src was used all the time, but the build is native (no cross-compilation, no -m). obj/tools/dest/release were rm'ed each time before starting the build anew. - bonnie++: bonnie++ -r 1024 -u nobody A popular file-system and I/O benchmark. Creates, reads and writes files in bytes and blocks, either sequentially or randomly. Also used to stress hard drives. === Results === = blogbench = http://www.netbsd.org/~jym/blogbench.results While it uses threads to (re)produce blog-like behavior, I believe it's more representative of file-system and I/O performance rather than threads themselves, given the results. There seems to be a trade-off between read and write performance with this benchmark: when one is above average, the other one is below. The GENERIC average perfs are a bit higher than XEN3 (~10%) though. Other than that, there is no clear winner between TLS/no-TLS. They are in the same range, for both ports. = sysbench = http://www.netbsd.org/~jym/sysbench.results More interestingly, sysbench stresses threads and scheduler a bit. There's a clear cut between TLS and no TLS systems, with TLS-enabled releases being slower by 15-20% (i386 and amd64) for the work completion. Curiously, there's no real winner between XEN3 and GENERIC. i386 Xen is slightly faster that native (by a few percent), while amd64 Xen is slower than native (again by a few percent). Likely due to the pmap overhead, we flush mappings constantly between kernel and userland with amd64 (kernel runs in ring 3 for Xen amd64, just like a typical userland process). = bonnie++ = http://www.netbsd.org/~jym/bonnie.1.html http://www.netbsd.org/~jym/bonnie.2.html http://www.netbsd.org/~jym/bonnie.3.html Given that bonnie is a file-system benchmark, I did not expect too much deviation between TLS and no-TLS. That's generally the case, except for sequential (and random) file creation: with GENERIC, the cut is about to a 1:3 ratio (ouch). The release from 2011-03-10 has a score flirting with the 9k-10k points, while the -currentish kernel is more like 3-3.5k. This result is 100% reproducible, but only with GENERIC kernels. XEN3 kernels remain unaffected, and TLS/noTLS releases are on par. I am not sure that this "regression" comes from TLS (I can only express doubt, I have not investigated the thing technically). There has been lots of work in the vfs layer for the last few months, so one of these changes is likely to affect bonnie's results directly. = build.sh runs = http://www.netbsd.org/~jym/build.results These benchmarks were run with an UP system, build.sh being invoked with -j2. I can't notice any real regression there, TLS -current is faster (3-4%) for almost every cases when compared to no-TLS releases. Please note that all runs were made with UP kernels. === Conclusion === Except for the bonnie++ regression, overall I did not notice any real performance hit between a release from before TLS commit and one from december. Yeah, the "TLS vs noTLS" release is a misnomer, it's rather "release from march vs release from december, with bits of TLS." I am currently making another run with a release from a few days after the TLS commit, and subsequent kernels up to -current to investigate the
Re: Lost file-system story
On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote: > > fsck is supposed to handle *all* corruptions to the file system that can > occur as part of normal file system operation in the kernel. It is doing > best effort for others. It's a bug if it doesn't do the former and a > potential missing feature for the latter. > There are a lot of slips twixt cup and lip. If you are really unlucky you can get an outage at just the wrong time that will cause the filesystem to be hosed so badly that fsck cannot recover it. Sure, fsck can run to completion but all you have is most of your FS in lost+found which you have to be really really desperate to sort through. I have been working with UNIX for over 20years now and I have only seen this happen once and it was with a commercial UNIX. -- Brett Lymn "Warning: The information contained in this email and any attached files is confidential to BAE Systems Australia. If you are not the intended recipient, any use, disclosure or copying of this email or any attachments is expressly prohibited. If you have received this email in error, please notify us immediately. VIRUS: Every care has been taken to ensure this email and its attachments are virus free, however, any loss or damage incurred in using this email is not the sender's responsibility. It is your responsibility to ensure virus checks are completed before installing any data sent in this email to your computer."
Re: bug #44412
At 14:55 Uhr -0400 22.7.2011, David Riley wrote: >I filed this report a while back. Someone else has tested my fix on >non-PPC systems (x86, x86_64) and reported that it seems to work as well. >I'm attaching the patch against -current here; [appended to PR kern/44412] >could someone give it a >look and include it if it seems to be OK? Or provide feedback if it >doesn't. Late feedback: I have set up a macppc G3/400 with a wm(4) network card, running Netatalk 2.2.1, where your patch fixes AppleTalk both with netbsd-5 and HEAD kernels. Before I commit your patches: Could tech-net people please give them a look? hauke -- "It's never straight up and down" (DEVO)
Re: [RFC] getgroups2 system call
In article <20111213141930.ge15...@homeworld.netbsd.org>, Emmanuel Dreyfus wrote: >Hello > >FUSE has no way to send the calling process secondary groups to the >filesystem. A filesystem that wants this operation currently has to >open a /proc file, read and parse the string represnetation of the >groups, and close the file. > >This is not very good performance-wise, as the filesystem needs to >open/read/close in /proc for each file operation that require access >control. Moreover, this can lead to deadlocks because of root vnode >locking. > >A first approach for improving this would be to fetch the secondary >groups using sysctl. Manuel Bouyer noted that this interface was never >meant to be fast, and therefore it would not address the performance issue. > >In a second attempt, I submitted a patch for libfuse, which would extend >the FUSE protocol so that secondary groups could be optionnly appended >to FUSE headers, should the filsystem request it. That did not meet >consensus, because on one hand, having a fixed-length header is desirable >for optimizing performances. On the other hand, having a fixed-length >header with an array of NGROUPS_MAX slots for secondary groups is just >impossible on Linux, where NGROUPS_MAX is 65536. > >A third way was suggested on the fuse-devel mailing list: adding a >system call to retreive a process' secondary groups. The prototype >would be moddled on getgroups(2): > > int getgroups2(int gidsetlen, gid_t *gidset, pid_t pid); > >Il this is preferred, it could also be named getgroupspid(2) Don't you need a getuid2(pid_t pid)? And where does this stop? Are you going to need other information too? The approach of adding extra syscalls to support FUSE will make the protocol more difficult to port across different operating systems and the new syscalls will need to be vetted against security and usefulness. I.e. is getgroups2 limited to the superuser? This is for VOP_ACCESS() in userland, right? Why don't you add separate fuse messages to send and retrieve this information? Then the kernel can notify if these have changed... christos
Re: Lost file-system story
> > Linux ext2 is not a Unix-based filesystem and Linux itself is not a > > Unix-based kernel. > > It's about as Unix-based as NetBSD is. Unless you mean something > strange by "Unix-based" - what _do_ you mean by it? I'm guessing that the point is that ext2 is a scaled-up re-implementation of minix-fs, which itself is a scaled-down re-implementaion of v7-fs. NetBSD's ffs is probably more directly based on CSRG code. In this roundabout fashion the Linux people arrived at a filesystem that is feature-wise very similar to ffs, but without being on-disk compatible. Rather a missed chance, I'd say. -Olaf. -- ___ Olaf 'Rhialto' Seibert -- There's no point being grown-up if you \X/ rhialto/at/xs4all.nl-- can't be childish sometimes. -The 4th Doctor
Fwd: Lost file-system story
I did it again. gmail is trying to teach an old dog a new trick -- Forwarded message -- From: Donald Allen Date: Tue, Dec 13, 2011 at 10:04 AM Subject: Re: Lost file-system story To: David Holland On Tue, Dec 13, 2011 at 1:27 AM, David Holland wrote: > On Mon, Dec 12, 2011 at 03:31:09PM -0500, Donald Allen wrote: > > Note that this bug *may* not worsen the probability of recovery after > > a crash. It might even increase it! Think about it. If you boot NetBSD > > and mount a filesystem async, it is going to be correctly structured > > (or deemed to be by fsck) at boot time, or the system wouldn't mount > > it. Assuming the system is happy with it, if you then make changes to > > the filesystem, but, because of this bug they are all in the buffer > > cache and never get written out, and then the system crashes --- > > you've got the filesystem you started with. > > Not necessarily; I did say "*may*" (which I wrote because you could write a good book about NetBSD internals with what I don't know about NetBSD internals). right off I can see two ways to get hosed: > > 1. Delete a large file. This causes the in-memory FS to believe the > indirect blocks from this file are free; then it can reallocate them > as data for some other file. That data then *does* get written out, so > after crashing and rebooting the indirect blocks contain utter > nonsense. The ffs fsck probably can't recover this. > > 2. Use a program that calls fsync(). This will write out some metadata > blocks and not others; in the relatively benign case it will just > update a previously-free inode and after crashing fsck will place the > file in lost+found. In less benign cases it might do the converse of > (1), and e.g. overwrite file data with indirect blocks, leading to > crosslinked files or worse and probably total fsck failure. > > Not that any of this matters... I agree. I was just indulging in some idle speculation, having some fun. This bug should be fixed and I think the fix, as I said before, should include a knob to allow the user to control the sync frequency (maybe the knob is already there in sysctl and getting ignored for some reason?). I'm running NetBSD again on my test machine, and have a sleep-sync loop started in rc.local. /Don > > -- > David A. Holland > dholl...@netbsd.org
EOPNOTSUPP / ENOTSUP
Currently we have both EOPNOTSUPP ("Operation not supported") and ENOTSUP ("Not supported") errnos. EOPNOTSUPP is historical; ENOTSUP was randomly added by POSIX relatively recently. And lately I've noticed a tendency to conflate them, which isn't healthy. It is too late to do #define ENOTSUP EOPNOTSUPP (although we could do this as part of the mythical libc major version bump to happen Sometime(TM)) but I think we should take steps to prevent any more confusion than already exists: - do not add new uses of ENOTSUP except where specifically mandated by POSIX; - avoid creating interfaces where ENOTSUP and EOPNOTSUPP are both expected error conditions (and remove any that have crept in); - mention this in errno(2); - also add a notice to this effect somewhere it will be seen (although I have no idea where that would be); Alternatively we could try to articulate a distinction between them and adjust the messages to match; but I'm not sure this is compatible with the mandated behavior. Thoughts anyone? -- David A. Holland dholl...@netbsd.org
Re: [RFC] getgroups2 system call
On Tue, Dec 13, 2011 at 02:19:30PM +, Emmanuel Dreyfus wrote: > A third way was suggested on the fuse-devel mailing list: adding a > system call to retreive a process' secondary groups. The prototype > would be moddled on getgroups(2): > > int getgroups2(int gidsetlen, gid_t *gidset, pid_t pid); > > Il this is preferred, it could also be named getgroupspid(2) Ugh. I don't like it. The credentials for an operation should be passed along with the operation, not fetched through a side channel. Even if the operation is completely synchronous, using a side channel like this is at best bodgy. If it's not completely synchronous, it's doomed to fail horribly. This interface would also make it permanently impossible to run fuse servers with reduced privilege. I would argue that if what you need is a hack, fuse itself was never meant to be fast and so sysctl is an adequate method; if you want to do it right, extend the protocol correctly. (And in any event, it should be "int getpidgroups(pid_t pid, int gidsetlen, gid_t *gidset)".) -- David A. Holland dholl...@netbsd.org
[RFC] getgroups2 system call
Hello FUSE has no way to send the calling process secondary groups to the filesystem. A filesystem that wants this operation currently has to open a /proc file, read and parse the string represnetation of the groups, and close the file. This is not very good performance-wise, as the filesystem needs to open/read/close in /proc for each file operation that require access control. Moreover, this can lead to deadlocks because of root vnode locking. A first approach for improving this would be to fetch the secondary groups using sysctl. Manuel Bouyer noted that this interface was never meant to be fast, and therefore it would not address the performance issue. In a second attempt, I submitted a patch for libfuse, which would extend the FUSE protocol so that secondary groups could be optionnly appended to FUSE headers, should the filsystem request it. That did not meet consensus, because on one hand, having a fixed-length header is desirable for optimizing performances. On the other hand, having a fixed-length header with an array of NGROUPS_MAX slots for secondary groups is just impossible on Linux, where NGROUPS_MAX is 65536. A third way was suggested on the fuse-devel mailing list: adding a system call to retreive a process' secondary groups. The prototype would be moddled on getgroups(2): int getgroups2(int gidsetlen, gid_t *gidset, pid_t pid); Il this is preferred, it could also be named getgroupspid(2) Opinions? -- Emmanuel Dreyfus m...@netbsd.org
Re: Lost file-system story
On Tue, Dec 13, 2011 at 12:20:16AM -0500, Mouse wrote: > > Possibilities other than zero or one are not useful in manual pages, > > Then we can throw away fsck, because there is always _some_ chance the > filesystem will be irreparable. Memory, CPUs, disks, and the > transports between them do fail, occasionally transiently. fsck is supposed to handle *all* corruptions to the file system that can occur as part of normal file system operation in the kernel. It is doing best effort for others. It's a bug if it doesn't do the former and a potential missing feature for the latter. Joerg