Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Bodo Eggert
Trond Myklebust <[EMAIL PROTECTED]> wrote:
> On Mon, 2008-01-28 at 05:38 +0100, Andi Kleen wrote:
>> On Monday 28 January 2008 05:13:09 Trond Myklebust wrote:
>> > On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote:

>> > > The problem is that it's not a race in who gets to do its thing first,
>> > > but a parallel reader can actually see a corrupted value from the two
>> > > independent words on 32bit (e.g. during a 4GB). And this could actually
>> > > completely corrupt f_pos when it happens with two racing relative seeks
>> > > or read/write()s
>> > > 
>> > > I would consider that a bug.
>> > 
>> > I disagree. The corruption occurs because this isn't a situation that is
>> > allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring
>> > to here?
>> 
>> No specific spec, just general quality of implementation. We normally don't
>> have non thread safe system calls even if it was in theory allowed by some
>> specification.
> 
> We've had the existing implementation for quite some time. The arguments
> against changing it have been the same all along: if your application
> wants to share files between threads, the portability argument implies
> that you should either use pread/pwrite or use a mutex or some other
> form of synchronisation primitive in order to ensure that
> lseek()/read()/write() do not overlap.

Does anything in the kernel depend on f_pos being valid?
E.g. is it possible to read beyond the EOF using this race, or to have files
larger than the ulimit?

If not, update the manpage and be done. ¢¢

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [14/18] BKL-removal: Add unlocked_fasync

2008-01-27 Thread Bodo Eggert
> +++ linux/fs/fcntl.c
> @@ -240,11 +240,15 @@ static int setfl(int fd, struct file * f
>  
> lock_kernel();
> if ((arg ^ filp->f_flags) & FASYNC) {
> -   if (filp->f_op && filp->f_op->fasync) {
> +   if (filp->f_op && filp->f_op->unlocked_fasync)
> +   error = filp->f_op->unlocked_fasync(fd, filp,
> +   !!(arg & FASYNC));
> +   else if (filp->f_op && filp->f_op->fasync) {
> error = filp->f_op->fasync(fd, filp, (arg & FASYNC) !=
0);
> if (error < 0)
> goto out;

No goto if you use unlocked_fasync?

> }
> +   /* AK: no else error = -EINVAL here? */
> }
>  
> filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
> --
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Bodo Eggert
On Fri, 25 Jan 2008, Bryan Henderson wrote:

> > AIX basically did this with SIGDANGER (the signal is ignored by
> > default), except there wasn't the ability for the process to tell the
> > kernel at what level of memory pressure before it should start getting
> > notified, and there was no way for the kernel to tell how bad the
> > memory pressure actually was.  On the other hand, it was a relatively
> > simple design.
> 
> AIX does provide a system call to find out how much paging backing store 
> space is available and the thresholds set by the system administrator. 
> Running out of paging space is the only memory pressure AIX is concerned 
> about.  While I think having processes make memory usage decisions based 
> on that is a shoddy way to manage system resources, that's what it is 
> intended for.

If you start partitioning the system into virtual servers (or something
similar), being close to swapping may be somebody else's problem.
(They shouldn't have exceeded their guaranteed memory limit).


> Incidentally, some context for the AIX approach to the OOM problem: a 
> process may exclude itself from OOM vulnerability altogether.  It places 
> itself in "early allocation" mode, which means at the time it creates 
> virtual memory, it reserves enough backing store for the worst case.  The 
> memory manager does not send such a process the SIGDANGER signal or 
> terminate it when it runs out of paging space.  Before c. 2000, this was 
> the only mode.  Now the default is late allocation mode, which is similar 
> to Linux.

This is an interesting approach. It feels like some programs might be 
interested in choosing this mode instead of risking OOM. 
-- 
The programmer's National Anthem is ''
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Bodo Eggert
Alan Cox <[EMAIL PROTECTED]> wrote:

>> I'd tried to advocate SIGDANGER some years ago as well, but none of
>> the kernel maintainers were interested.  It definitely makes sense
>> to have some sort of mechanism like this.  At the time I first brought
>> it up it was in conjunction with Netscape using too much cache on some
>> system, but it would be just as useful for all kinds of other memory-
>> hungry applications.
> 
> There is an early thread for a /proc file which you can add to your
> poll() set and it will wake people when memory is low. Very elegant and
> if async support is added it will also give you the signal variant for
> free.

IMO you'll need a userspace daemon. The kernel does only know about the
amount of memory available / recommended for a system (or container),
while the user knows which program's cache is most precious today.

(Off cause the userspace daemon will in turn need the /proc file.)

I think a single, system-wide signal is the second-to worst solution: All
applications (or the wrong one, if you select one) would free their caches
and start to crawl, and either stay in this state or slowly increase their
caches again until they get signaled again. And the signal would either
come too early or too late. The userspace daemon could collect the weighted
demand of memory from all applications and tell them how much to use.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-11 Thread Bodo Eggert
Al Boldi <[EMAIL PROTECTED]> wrote:

> Even after a black-out shutdown, the corruption is pretty minimal, using
> ext3fs at least.  So let's take advantage of this fact and do an optimistic
> fsck, to assure integrity per-dir, and assume no external corruption.  Then
> we release this checked dir to the wild (optionally ro), and check the next.
> Once we find external inconsistencies we either fix it unconditionally,
> based on some preconfigured actions, or present the user with options.

Maybe we can know the changes that need to be done in order to fix the
filesystem. Let's record this information in - eh - let's call it a journal!

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] getattr - fill the size of FIFOs

2007-10-03 Thread Bodo Eggert
Jan Engelhardt <[EMAIL PROTECTED]> wrote:

> [PATCH]: Fill the size of FIFOs
> 
> Instead of reporting 0 in size when stating() a 

FIFO
-- 
Whenever you have plenty of ammo, you never miss. Whenever you are low on
ammo, you can't hit the broad side of a barn.

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] limit minixfs printks on corrupted dir i_size, CVE-2006-6058

2007-08-15 Thread Bodo Eggert
On Mon, 13 Aug 2007, Eric Sandeen wrote:
> Bodo Eggert wrote:

> > Warning: I'm only looking at the patch.
> >
> > You are supposed to print an error message for a user, not to write in a
> > chat window to a 1337 script kiddie. OK, you just matched the current style,
> > and your patch is IMHO OK for a quick security fix, but:
> >
> > - Security fixes should be CCed to the security mailing list, shouldn't 
> > they?
> >   (It might be security@ or stable@, I'll remember tomorrow, but then I'd
> >forget to comment)
> > - Imagine you have three mounts containing a minix fs, how can you tell 
> > which
> >   one is the the defective one?
> > - The message says "minix_bmap", while the patch suggests it's in
> >   block_to_path. Therefore I asume "minix_bmap" to have only random
> >   informational value.
> > - Does block < 0 or block > $size make a difference?
> > - the printk lacks the loglevel.
> > - Asuming minix supports error handling, shouldn't it do something?
> >
> > I'd suggest a message saying something like "minix: Bad block address on
> > device 08:15, needs fsck".
> >   
> Ok, do you like this slightly better?  It states the subsystem, the 
> function with the error, the block nr. in the case of a too-large block,
> and the block device on which the error occurred.

- how long is BDEVNAME_SIZE? Will it fit on the stack?
- Does it include thespace for \0?

I asume you copied other users, and the other users will do it right (or 
at least not terribly wrong:), but I can't dig the code right now.

>  Honestly minix.fsck
> doesn't handle the situation well either, so at this point I hesitate
> to recommend it in the print.  :)

*g*
-- 
Top 100 things you don't want the sysadmin to say:
79. What's this "any" key I'm supposed to press?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/4] pass open file to ->setattr()

2007-08-09 Thread Bodo Eggert
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

>> > This is needed to be able to correctly implement open-unlink-fsetattr
>> > semantics in some filesystem such as sshfs, without having to resort
>> > to "silly-renaming".
>> 
>> How do you plan to do that?
> 
> Easy: the SFTP protocol has stateful opens and defines an FSTAT call.

Is it possible to reconnect without umounting? If yes, the unlinked files
would be lost in spite of being opened, wouldn't they?
-- 
Top 100 things you don't want the sysadmin to say:
11. Can you get VMS for this Sparc thingy?

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] limit minixfs printks on corrupted dir i_size, CVE-2006-6058

2007-08-09 Thread Bodo Eggert
Eric Sandeen <[EMAIL PROTECTED]> wrote:

> This attempts to address CVE-2006-6058
> http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2006-6058
>  
> first reported at http://projects.info-pull.com/mokb/MOKB-17-11-2006.html
> 
> Essentially a corrupted minix dir inode reporting a very large
> i_size will loop for a very long time in minix_readdir, minix_find_entry,
> etc, because on EIO they just move on to try the next page.  This is
> under the BKL, printk-storming as well.  This can lock up the machine
> for a very long time.  Simply ratelimiting the printks gets things back
> under control.

> Index: linux-2.6.22-rc4/fs/minix/itree_v1.c
> ===
> --- linux-2.6.22-rc4.orig/fs/minix/itree_v1.c
> +++ linux-2.6.22-rc4/fs/minix/itree_v1.c
> @@ -27,7 +27,8 @@ static int block_to_path(struct inode *
>  if (block < 0) {
>  printk("minix_bmap: block<0\n");
>  } else if (block >= (minix_sb(inode->i_sb)->s_max_size/BLOCK_SIZE)) {
> - printk("minix_bmap: block>big\n");
> + if (printk_ratelimit())
> + printk("minix_bmap: block>big\n");

Warning: I'm only looking at the patch.

You are supposed to print an error message for a user, not to write in a
chat window to a 1337 script kiddie. OK, you just matched the current style,
and your patch is IMHO OK for a quick security fix, but:

- Security fixes should be CCed to the security mailing list, shouldn't they?
  (It might be security@ or stable@, I'll remember tomorrow, but then I'd
   forget to comment)
- Imagine you have three mounts containing a minix fs, how can you tell which
  one is the the defective one?
- The message says "minix_bmap", while the patch suggests it's in
  block_to_path. Therefore I asume "minix_bmap" to have only random
  informational value.
- Does block < 0 or block > $size make a difference?
- the printk lacks the loglevel.
- Asuming minix supports error handling, shouldn't it do something?

I'd suggest a message saying something like "minix: Bad block address on
device 08:15, needs fsck".
-- 
Oops. My brain just hit a bad sector. 

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
 [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: utimes() with vfat is problematic

2007-07-11 Thread Bodo Eggert
Jan Engelhardt <[EMAIL PROTECTED]> wrote:

> vfat does not know about ownership, hence the files are always owned by the
> vfat mounter (or whatever the uid= option specified). Which brings
> a problem to userspace programs trying to utime() but which do not
> run as the same user as the vfat mounter, because:
> 
> 
> fs/attr.c:53
> ret = -EPERM;
> [...]
> 
> /* Check for setting the inode time. */
> if (ia_valid & (ATTR_MTIME_SET | ATTR_ATIME_SET)) {
> if (current->fsuid != inode->i_uid && !capable(CAP_FOWNER))
> goto error;
> }
> 
> 
> To trigger the problem:
> # mount /somevfat -o umask=0,uid=root
> $ touch -d "2005-05-05" /somevfat/myfile
> 
> I am not sure how this could be dealt with besides passing -o quiet to
> mount.vfat. Any ideas?

Would it be possible to allow any user to modify the fs by adding
"&& current->fsuid != -1"? I think it's commonly the desired behaviour.
Off cause the default behaviour should stay the same.
-- 
Those who hesitate under fire usually do not end up KIA or WIA. 

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
 [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-06 Thread Bodo Eggert
On Thu, 5 Jul 2007, DervishD wrote:
>  * Bodo Eggert <[EMAIL PROTECTED]> dixit:

> > Standardisation is good, but autotools (as they are used) usurally isn't.
> 
> Usually, by picking other's project configure.in and tweak blindly.

If it were that easy to write a correct automake script, people would do 
that. Wouldn't they?

> > Configuring the build of an autotools program is harder than nescensary;
> > if it used a config file, you could easily save it somewhere while adding
> > comments on how and why you did *that* choice, and you could possibly
> > use a set of default configs which you'd just include.
> 
> Looks like CMake...

Obviously something I should look at.
-- 
Top 100 things you don't want the sysadmin to say:
45. Was that YOUR directory?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Bodo Eggert
Nix <[EMAIL PROTECTED]> wrote:
> On 4 Jul 2007, DervishD stated:

>> Anyway, if you don't like mobs or you just don't want to try it,
>> that's fine, but please don't use autotools, it doesn't make much sense
>> for a linux only project, since you will be using only the "directory
>> choosing" part of autotools. Maybe a hand made script will help (and I
> 
> Oh, yeah, great, another hand-rolled build system. That's *juwt* what
> those of us who have autotools working well (with config.site's that
> do all we need and then some) are looking forward to.
> 
> There are advantages to standardization, you know. A *lot* of
> autobuilders know how to make autoconf-generated configure scripts jump
> through hoops. I was downright *happy* when util-linux was
> autoconfiscated: I could ditch the code to handle automatic
> configuration of yet another one-package hand-rolled build system.

Standardisation is good, but autotools (as they are used) usurally isn't. It
tests for the availability of a fortran compiler for a C-only project, checks
the width of integers on i386 for projects not caring about that and fails to
find installed libraries without telling how it was supposed to find them or
how to make it find that library.

Configuring the build of an autotools program is harder than nescensary;
if it used a config file, you could easily save it somewhere while adding
comments on how and why you did *that* choice, and you could possibly
use a set of default configs which you'd just include.

The Makefiles generated by autotools is a huge mess, if autotools got it
wrong (again!), fixing them requires editing a lot of files.

I'm really really happy if I read 'edit Makefile.conf and run make...'.
-- 
No matter which way you have to march, its always uphill. 

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
 [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Bodo Eggert
alan <[EMAIL PROTECTED]> wrote:

> I just wish that people would learn from the mistakes of others.  The
> MacOS is a prime example of why you do not want to use a forked
> filesystem, yet some people still seem to think it is a good idea.
> (Forked filesystems tend to be fragile and do not play well with
> non-forked filesystems.)

What's the conceptual difference between forks and extended user attributes?
-- 
"Unix policy is to not stop root from doing stupid things because
that would also stop him from doing clever things." - Andi Kleen

"It's such a fine line between stupid and clever" - Derek Smalls
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pipefs unique inode numbers

2007-01-30 Thread Bodo Eggert
change pipefs to use a unique inode number equal to the memory
address unless it would be truncated.

Signed-Off-By: Bodo Eggert <[EMAIL PROTECTED]>
---
Tested on i386.

--- 2.6.19/fs/pipe.c.ori2007-01-30 22:02:46.0 +0100
+++ 2.6.19/fs/pipe.c2007-01-30 23:22:27.0 +0100
@@ -864,6 +864,10 @@ static struct inode * get_pipe_inode(voi
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+   /* The address of *inode is unique, so we'll get an unique inode number.
+* Off cause this will not work for 32 bit inodes on 64 bit systems. */
+   if (sizeof(inode->i_ino) >= sizeof(struct inode*))
+   inode->i_ino = (unsigned int) inode;
 
return inode;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] make pipefs do lazy i_ino assignment and hashing

2007-01-30 Thread Bodo Eggert
Jeff Layton <[EMAIL PROTECTED]> wrote:

> This patch updates pipefs to do defer assigning an i_ino value to its inodes
> until someone actually tries to stat it. This allows us to have unique i_ino
> values for the inodes here, without the performance impact for anyone who
> doesn't actually care about it.
> 
> Since we don't have an i_ino value at pipe creation time, we need something
> else to stuff into the dentry name. Here, I'm using the pointer address of
> the inode xor'ed with a random value. There are certainly better hashing
> schemes, so if someone wants to propose a better way to do this, then I'm
> open to looking at it (maybe halfmd4?).

Why XOR? To pretend a non-existent level of security?
Either you can't use the address, or you can read the obfusicator value, too.

OTOH, if sizeof(void*) <= sieof(ino_t), using the address will result in a
unique inode number without need for expensive hashing algorithms.
-- 
Why did the hacker cross the road? To get to the other side.
Why did the cracker cross the road? To get what was on the other side.
The difference is small, but important.
-- Gandalf Parker
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-05 Thread Bodo Eggert
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

>> > Well, sort of.  Samefile without keeping fds open doesn't have any
>> > protection against the tree changing underneath between first
>> > registering a file and later opening it.  The inode number is more
>> 
>> You only need to keep one-file-per-hardlink-group open during final
>> verification, checking that inode hashing produced reasonable results.
> 
> What final verification?  I wasn't just talking about 'tar' but all
> cases where st_ino might be used to check the identity of two files at
> possibly different points in time.
> 
> Time A:remember identity of file X
> Time B:check if identity of file Y matches that of file X
> 
> With samefile() if you open X at A, and keep it open till B, you can
> accumulate large numbers of open files and the application can fail.
> 
> If you don't keep an open file, just remember the path, then renaming
> X will foil the later identity check.  Changing the file at this path
> between A and B can even give you a false positive.  This applies to
> 'tar' as well as the other uses.

If you open Y, this open file descriptor will guarantee that no distinct
file will have the same inode number while all hardliked files must have
the same inode number. (AFAIK)

Now you will check this against the list of hardlink candidates using the
stored inode number. If the inode number has changed, this will result in
a false negative. If you removed X, recreated it with the same inode number
and linked that to Y, you'll get a false positive (which could be identified
by the [mc]time changes).

Samefile without keeping the files open will result in the same false
positive as open+fstat+stat, while samefile with keeping the files open
will occasionally overflow the files table, Therefore I think it's not
worth while introducing samefile as long as the inode is unique for open
files. OTOH you'll want to keep the inode number as stable as possible,
since it's the only sane way to find sets of hardlinked files and some
important programs may depend on it.
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

http://david.woodhou.se/why-not-spf.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: Stable inodes for inode-less filesystems (was: Finding hardlinks)

2007-01-05 Thread Bodo Eggert
Pavel Machek <[EMAIL PROTECTED]> wrote:

>> Another idea is to export the filesystem internal ID as an arbitray
>> length cookie through the extended attribute interface.  That could be
>> stored/compared by the filesystem quite efficiently.
> 
> How will that work for FAT?

> Or maybe we can relax that "inode may not change over rename" and
> "zero length files need unique inode numbers"...

I didn't look into the code, and I'm not experienced in writing (linux)
fs, but I have an Idea I'd like to share. Maybe it's not that bad ...

(I'm going to type about inode numbers, since having constant inodes
 is desired and the extended attribute would only be an aid if the
 inode is too small.)

IIRC, no cluster is reserved for empty files on FAT; if I'm wrong, it'll
be much easier, you would just use the cluster-number (less than 32 bit).

The basic idea is to use a different inode range for non-empty and empty
files. This will result in the inode possibly changing after close()* or
on rename(empty1, empty2). OTOH it will keep a stable inode for non-empty
files and for newly written files** if they aren't stat()ed before writing
the first byte. I'm not sure if it's better than changing inodes after
$randomtime, but I just made a quick strace on gtar, rsync and cp -a;
they don't look at the dest inode before it would change (or at all).

(If this idea is applied to iso9660, the hard problem will be finding the
 number of hardlinked files for one location)

Changing the inode# on the last close* can be done by purging the cache
if the file is empty XOR the file has an inode# from the empty-range.
(That should be the same operation as done by unlink()?)
A new open(), stat() or readdir should create the correct kind of inode#.

*) You can optionally just wait for the inode to expire, but you need to
   keep the associated reservation(s) until then. I don't expect any
   visible effect from doing this, but see ** from the next paragraph
   on how to minimize the effect. The reserved directory entry (see far
   below in this text) is 32 Bytes, but the fragmentation may be bad.
**) which are empty on open() and therefore don't yet have a stable inode#
   Those inode numbers will apear to be stable because nobody saw them
   change. It's safe to change the inode# because by reserving disk space,
   we got a unique inode#. I hope the kernel side allows this ...


For non-empty files, you can use the cluster-number (start address), it's
unique, and won't change on rename. It will, however, change on emptying
or writing to an empty file. If you write to an empty file, no special
handling is required, all reservations are implicit*. If you empty a file,
you'll have to keep the first cluster reserved** untill it's closed,
otherwise you'd risk an inode collision.

*) since the inode# doesn't change yet, you'll still have to treat it like
   an empty file while unlinking or renaming.
**) It's OK to reuse it if it's in the middle of a file, so you may
optionally keep a list of these clusters and not start files there
instead of reserving the space. OTOH, it's more work.


Empty files will be a PITA with 32-bit-inodes, since a full-sized FAT32 can
have about 2^38 empty files*. (The extended attribute would work as described
below.) You can, however, generate inode numbers for empty files, risking
collisions. This requires all generated inode numbers to be above 0x4000
(or above the number of clusters on disk).

*) 8 TB divided by 32 B / directory entry

With 64-bit-values, you can generate an unique inode for empty files
using cluster#-of-dir | 0x8000 | index_in_dir << 32. The downside
is, it will change on cross-directory-renames and may change on in-
directory-renames. If this happens to an open file, you'll need to
make sure the old inode# is not reused by reserving that directory
entry, since the inode# can't change for open files.


extra operations on final close:
if "empty" inode:
 if !empty
  unreserve_directory_entry(inode & 0x7fff, inode >> 32)
  uncache inode (will change inode#)
  stop
 if unreserve_directory_entry(inode & 0x7fff, inode >> 32)
  uncache inode
if "non-empty" inode
 if empty
  free start cluster
  uncache inode

extra operations on unlink/rename:
if "empty" inode:
 if can_use_current_inode#_for_dest
  do it
  unreserve_directory_entry(inode & 0x7fff, inode >> 32)
  // because of "mv a/empty b/empty; mv b/empty a/empty"
 else if is_open_file
  // the current inode won't fit the new location:
  reserve_directory_entry(old_inode & 0x7fff, inode >> 32)

extra operations on truncate
if "non-empty" inode && empty_after_truncate
 exempt start cluster from being freed,
 or put it on a list of non-startclusters

extra operation on extend
if "empty" inode && nobody did e.g. stat() after opening this file
 silently change inode, nobody will notice. Racy? Possible?


Required data in filehandle:
 Location of directory entry (d.e. contains inode information)
  (this shouldn't be new 

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-20 Thread Bodo Eggert <[EMAIL PROTECTED]>
Mike Waychison <[EMAIL PROTECTED]> wrote:

> Consider the following pseudo example:
> 
> main():
> chdir("/");
> fd = open(".", O_RDONLY);
> clone(cloned_func, cloned_stack, CLONE_NEWNS, NULL);
> 
> cloned_func:
> fchdir(fd);
> chdir("..");
> 
> if main is run within a chroot where it's "/" is on the same vfsmount as
>  it's "..", then the application can step out of the chroot using clone(2).
> 
> Note: using chdir in a vfsmount outside of your namespace works, however
> you won't be able to walk off that vfsmount (to its parent or children).

IMO the '..' file descriptor should be attached to it's chroot domain.
This should avoid all chroot-escapes, even with fd-passing etc.

I wonder why nobody thought of that. Either it's too obvious or too stupid.
-- 
Funny quotes:
7. You have the right to remain silent. Anything you say will be misquoted,
   then used against you.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Bodo Eggert
On Tue, 19 Apr 2005, Eric Van Hensbergen wrote:
> On 4/19/05, Bodo Eggert <[EMAIL PROTECTED]> wrote:

> > Allowing user mounts with no* should be allways ok (no config needed
> > besides the ulimit), and mounting specified files to defined locations
> > is allready supported by fstab.
> >
> 
> Do folks think that the limits should be per-user or per-process for
> user-mounts, what about separate limits for # of private namespaces
> and # of mounts?

Per-user.

> The fstab support doesn't seem to provide enough flexibility for
> certain situations, say I want to support mounting any remote file
> system, as long as its in the user's private hierarchy?
[...]

The dir is owned by the user, therefore it's allowed with no*.
-- 
Top 100 things you don't want the sysadmin to say:
11. Can you get VMS for this Sparc thingy?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Bodo Eggert
On Tue, 19 Apr 2005, Eric Van Hensbergen wrote:
> On 4/17/05, Bodo Eggert <[EMAIL PROTECTED]>

> > > I was thinking about this a while back and thought having a user-mount
> > > permissions file might be the right way to address lots of these
> > > issues.  Essentially it would contain information about what
> > > users/groups were allowed to mount what sources to what destinations
> > > and with what mandatory options.
> > 
> > Users being able to mount random fs containing suid or device nodes
> > are root whenever they want to. If you want to mount with dev or suid,
> > use sudo and restrict the mount to a limited set of images/devices/whatever.
> 
> Well, that would kinda be the intent behind the permissions file  --
> it can specify what restricted set of images/devices/whatever the user
> can mount, I suppose the sensible thing would be to always enforce
> nosuid and nsgid, but I'd rather keep these as the default version of
> options (allowing admins to shoot themselves in the foot perhaps, but
> in the single-user workstation case, is seems like there's less reason
> to be so paranoid).

I think you shouldn't help the admins by creating shoes with target marks.

Allowing user mounts with no* should be allways ok (no config needed 
besides the ulimit), and mounting specified files to defined locations
is allready supported by fstab.
-- 
Top 100 things you don't want the sysadmin to say:
6. We prefer not to change the root password, it's an nice easy one
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-17 Thread Bodo Eggert <[EMAIL PROTECTED]>
Eric Van Hensbergen <[EMAIL PROTECTED]> wrote:
> On 4/11/05, Miklos Szeredi <[EMAIL PROTECTED]> wrote:

>> 
>>   1) Only allow mount over a directory for which the user has write
>>  access (and is not sticky)
>> 
>>   2) Use nosuid,nodev mount options
[...]

> Do these solve all the security concerns with unprivileged mounts, or
> are there other barriers/concerns?  Should there be ulimit (or rlimit)
> style restrictions on how many mounts/binds a user is allowed to have
> to prevent users from abusing mount privs?

Definitively. Mountpoints use kernel space, the users could DoS the machine.
The per-Machine-limit isn't fine-grained enough, since the users may DoS
each other.

You'll have to avoid users capturing system daemons in D state or in
slowed-down artificial directory-forests, too. I think namespaces will
do most the trick.

> I was thinking about this a while back and thought having a user-mount
> permissions file might be the right way to address lots of these
> issues.  Essentially it would contain information about what
> users/groups were allowed to mount what sources to what destinations
> and with what mandatory options.

Users being able to mount random fs containing suid or device nodes
are root whenever they want to. If you want to mount with dev or suid,
use sudo and restrict the mount to a limited set of images/devices/whatever.
-- 
Anger, fear, aggression. The Dark Side of the Force are they.
Once you start down the Dark Path, forever will it dominate your destiny.
-- Jedi Master Yoda

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Bodo Eggert
On Tue, 12 Apr 2005, Jamie Lokier wrote:
> Bodo Eggert <[EMAIL PROTECTED]> wrote:

> > > I think that would be _much_ nicer implemented as a mount which is
> > > invisible to other users, rather than one which causes the admin's
> > > scripts to spew error messages.  Is the namespace mechanism at all
> > > suitable for that?
> > 
> > This will require shared subtrees plus a way for new logins from the same
> > user to join an existing (previous login) namespace.
> 
> Or "per-user namespaces".

A general way to enter child namespaces would be much more flexible. The 
mechanism could be reused by projects like linux-vserver.
-- 
Our last fight was my fault: My wife asked me "What's on the TV?"
I said, "Dust!"
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Bodo Eggert <[EMAIL PROTECTED]>
Jamie Lokier <[EMAIL PROTECTED]> wrote:
> Miklos Szeredi wrote:

>>   4) Access should not be further restricted for the owner of the
>>  mount, even if permission bits, uid or gid would suggest
>>  otherwise
> 
> Why?  Surely you want to prevent writing to files which don't have the
> writable bit set?  A filesystem may also create append-only files -
> and all users including the mount owner should be bound by that.

That will depend on the situation. If the user is mounting a tgz owned
by himself, FUSE should default to being a convenient hex-editor.

>>   5) As much of the available information should be exported via the
>>  filesystem as possible
> 
> This is the root of the conflict.  You are trying to overload the
> permission bits and uid/gid to mean something different than they
> normally do.
> 
> While it's convenient to see some "remote" information such as the
> uid/gid in a tar file, are you sure it's a good idea to break the unix
> permissions model - which will break some programs?  (For example, try
> editing a file with the broken semantics in an editor which checks the
> uid/gid of the file against the current user).

The editor will try to keep the original permissions, and saving will be
less effective.

>>   1) Only allow mount over a directory for which the user has write
>>  access (and is not sticky)
> 
> Seems good - but why not sticky?  Mounting a user filesystem in
> /tmp/user-xxx/my-mount-point seems not unreasonable - provided the
> administrator can delete the directory (which is possible with
> detachable mount points).

I once mounted a filesystem in ~/tmp after forgetting about it being a
symlink to /tmp/$me/tmp, and I had to promise never to do that again.
Ng zvqavtug, gur pyrnahc-grzc-fpevcg xvpxrq va.

>>   5) The filesystem daemon is free to fill in all file attributes to
>>  any (sane) value, and the kernel won't modify these.
> 
> Dangerous, because an administrative program might actually trust the
> attributes to mean what they normally mean in the unix permissions model.

The same risk applies to smbmounted file systems.

Sane daemons will do no check besides matching the owner of a file in the
user's home against the expected UID and checking the permission mask,
since you can't trust users not to mess with files in directories they own.
The "best" they can do should be shoothing their own feet.

(If the user doesn't own the directory, FUSE shouldn't mount.)
-- 
Top 100 things you don't want the sysadmin to say:
80. I cleaned up the root partition and now there's LOTS of free space.

Friß, Spammer:[EMAIL PROTECTED]@whitedoc.info
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html