lock_rename for cluster filesystems? (was: Re: [PATCH] prune_icache_sb)

2007-03-02 Thread Mike Snitzer

On 12/4/06, Wendy Cheng <[EMAIL PROTECTED]> wrote:

Russell Cattelan wrote:
> Wendy Cheng wrote:
>
>> Linux kernel, particularly the VFS layer, is starting to show signs
>> of inadequacy as the software components built upon it keep growing.
>> I have doubts that it can keep up and handle this complexity with a
>> development policy like you just described (filesystem is a dumb
>> layer ?). Aren't these DIO_xxx_LOCKING flags inside
>> __blockdev_direct_IO() a perfect example why trying to do too many
>> things inside vfs layer for so many filesystems is a bad idea ? By
>> the way, since we're on this subject, could we discuss a little bit
>> about vfs rename call (or I can start another new discussion thread) ?
>>
>> Note that linux do_rename() starts with the usual lookup logic,
>> followed by "lock_rename", then a final round of dentry lookup, and
>> finally comes to filesystem's i_op->rename call. Since lock_rename()
>> only calls for vfs layer locks that are local to this particular
>> machine, for a cluster filesystem, there exists a huge window between
>> the final lookup and filesystem's i_op->rename calls such that the
>> file could get deleted from another node before fs can do anything
>> about it. Is it possible that we could get a new function pointer
>> (lock_rename) in inode_operations structure so a cluster filesystem
>> can do proper locking ?
>
> It looks like the ocfs2 guys have the similar problem?
>
> 
http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt
>
>
>

Thanks for the pointer. Same as ocfs2, under current VFS code, both
GFS1/2 also need FS_ODD_RENAME flag for the rename problem - got an ugly
~200 line draft patch ready for GFS1 (and am looking into GFS2 at this
moment). The issue here is, for GFS, if vfs lock_rename() can call us,
this complication can be greatly reduced. Will start another thread to
see whether the wish can be granted.


Hi Wendy,

Have you (or others) made any progress on a possible solution to
simplify handling cluster fs do_rename() races (e.g. your proposed
lock_rename in inode_operations)?

I couldn't find a newer thread that continued this discussion...

please advise, thanks.
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Mingming Cao

Andrew Morton wrote:


On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott <[EMAIL PROTECTED]> wrote:



On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:


On Fri, 2 Mar 2007 00:04:45 +0530
"Amit K. Arora" <[EMAIL PROTECTED]> wrote:



This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);


...

I'd agree with Eric on the "command" flag extension.


Seems like a separate syscall would be better, "command" sounds
a bit ioctl like, especially if that command is passed into the
filesystems..




madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to "mode"? ;)

I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.


Right now neither ext4 preallocation implementation or reservation are 
guranteed to allocate/reserve contigugous extents. If the application 
told it so, it could do more searching to satisfy the requirement.


Or fadvise is the right interface?

Mingming

I'm inclined to merge this patch nice and early, so the syscall number is
stabilised.  Otherwise the people who are working on out-of-tree code (ie:
ext4) will have to keep playing catchup.




-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Mingming Cao

Dave Kleikamp wrote:

On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote:


On Thu, 01 Mar 2007 22:44:16 +
Dave Kleikamp <[EMAIL PROTECTED]> wrote:



On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:


On Fri, 2 Mar 2007 00:04:45 +0530
"Amit K. Arora" <[EMAIL PROTECTED]> wrote:



+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   inode = file->f_path.dentry->d_inode;
+   if (inode->i_op && inode->i_op->fallocate)
+   ret = inode->i_op->fallocate(inode, offset, len);
+   else
+   ret = -ENOTTY;
+   fput(file);
+out:
+return ret;
+}


ENOTTY is a bit unconventional - we often use EINVAL for this sort of
thing.  But EINVAL has other meanings for posix_fallocate() and isn't
really appropriate here anyway.  So I'm not sure what would be better...


Would EINVAL (or whatever) make it back to the caller of
posix_fallocate(), or would glibc fall back to its current
implementation?

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages for any
non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since the major
file systems can probably do something better in their own code.


Given that glibc already implements fallocate for all filesystems, it will
need to continue to do so for filesystems which don't implement this
syscall - otherwise applications would start breaking.



I didn't make it clear, but my point was to call generic_fallocate if
the file system did not define i_op->allocate().

if (inode->i_op && inode->i_op->fallocate)
ret = inode->i_op->fallocate(inode, offset, len);
else
ret = generic_fallocate(inode, offset, len);

I'm not sure it's worth the effort, but I thought I'd throw the idea out
there.


I think this is useful.

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Eric Sandeen

Badari Pulavarty wrote:


BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 


With XFS at least, "du," "stat," etc tell you a little:

[EMAIL PROTECTED] test]# touch resvsp
[EMAIL PROTECTED] test]# xfs_io resvsp
xfs_io> resvsp 0 10g

The file is 0 length, but is using 10g of blocks:
(with posix_fallocate this would move the size out to 10g as well)

[EMAIL PROTECTED] test]# ls -lh resvsp
-rw-r--r--  1 root root 0 Nov 28 14:11 resvsp
[EMAIL PROTECTED] test]# du -hc resvsp
10G resvsp
10G total
[EMAIL PROTECTED] test]# stat resvsp
  File: `resvsp'
  Size: 0   Blocks: 20971520   IO Block: 4096   regular 
empty file

Device: 81eh/2078d  Inode: 186 Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)

xfs also has an interface to find out what allocations are where:

if you reserve some ranges not starting at 0...

[EMAIL PROTECTED] test]# xfs_io resvsp
xfs_io> resvsp 1g 1g
xfs_io> resvsp 3g 1g
xfs_io> resvsp 5g 1g
xfs_io> quit

[EMAIL PROTECTED] test]# xfs_bmap -v resvsp
resvsp:
 EXT: FILE-OFFSET   BLOCK-RANGE   AG AG-OFFSET 
TOTAL FLAGS
   0: [0..2097151]: hole 
2097152
   1: [2097152..4194303]:   42392..2139543 0 (42392..2139543) 
2097152 1
   2: [4194304..6291455]:   hole 
2097152
   3: [6291456..8388607]:   4236696..6333847   0 (4236696..6333847) 
2097152 1
   4: [8388608..10485759]:  hole 
2097152
   5: [10485760..12582911]: 8431000..10528151  0 (8431000..10528151) 
2097152 1


The flags of 1 mean that these extents is preallocated/unwritten.

I suppose outside of XFS, FIBMAP is your best bet, but that won't tell 
you what is preallocated vs. allocated/written


-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Andrew Morton
On Fri, 02 Mar 2007 08:13:00 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> > 
> > > What about 
> > > if the
> > > blocks already exists ? What would be return values in those cases ?
> > 
> > 0 on success, other normal errors oetherwise..
> > 
> > If asked for a range that includes already-allocated blocks, you just 
> > allocate any non-allocated blocks in the range, I think.
> 
> Yes. What I was trying to figure out is, if there is a requirement that
> interface need to return exact number of bytes it *really* allocated
> (like write() or read()). I can't think of any, but just wanted to
> through it out..

Hopefully not, because posix didn't anticipate that.

We could of course return a positive number on success, but it'd get
tricky on 32-bit machines.

> BTW, what is the interface for finding out what is the size of the
> pre-allocated file ? 

stat.st_blocks?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Badari Pulavarty
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote:
> Badari Pulavarty wrote:
> > 
> > Amit K. Arora wrote:
> > 
> >> This is to give a heads up on few patches that we will be soon coming up
> >> with. These patches implement a new system call sys_fallocate() and a
> >> new inode operation "fallocate", for persistent preallocation. The new
> >> system call, as Andrew suggested, will look like:
> >>
> >>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> >>
> > I am wondering about return values from this syscall ? Is it supposed to 
> > return the
> > number of bytes allocated ? What about partial allocations ? 
> 
> If you don't have enough blocks to cover the request, you should 
> probably just return -ENOSPC, not a partial allocation.

That could be challenging, when multiple writers are working in
parallel. You may not be able to return -ENOSPC, till you fail the
allocation (for filesystems which alllocates a block at a time).

> 
> > What about 
> > if the
> > blocks already exists ? What would be return values in those cases ?
> 
> 0 on success, other normal errors oetherwise..
> 
> If asked for a range that includes already-allocated blocks, you just 
> allocate any non-allocated blocks in the range, I think.

Yes. What I was trying to figure out is, if there is a requirement that
interface need to return exact number of bytes it *really* allocated
(like write() or read()). I can't think of any, but just wanted to
through it out..

BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Eric Sandeen

Badari Pulavarty wrote:


Amit K. Arora wrote:


This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? 


If you don't have enough blocks to cover the request, you should 
probably just return -ENOSPC, not a partial allocation.


What about 
if the

blocks already exists ? What would be return values in those cases ?


0 on success, other normal errors oetherwise..

If asked for a range that includes already-allocated blocks, you just 
allocate any non-allocated blocks in the range, I think.


-Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Ulrich Drepper

On 3/2/07, Dave Kleikamp <[EMAIL PROTECTED]> wrote:

Then there's no need for sys_allocate to return a long.


Every syscall must return a long.  Otherwise you can have problems on
64-bit archs.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Jan Engelhardt

On Mar 1 2007 23:09, Dave Kleikamp wrote:
>> 
>> Given that glibc already implements fallocate for all filesystems, it will
>> need to continue to do so for filesystems which don't implement this
>> syscall - otherwise applications would start breaking.
>
>I didn't make it clear, but my point was to call generic_fallocate if
>the file system did not define i_op->allocate().
>
>if (inode->i_op && inode->i_op->fallocate)
>   ret = inode->i_op->fallocate(inode, offset, len);
>else
>   ret = generic_fallocate(inode, offset, len);
>
>I'm not sure it's worth the effort, but I thought I'd throw the idea out
>there.

Writing zeroes using glibc emu most likely means write() --
so generic_fallocate should be preferable (think splice).
Or does glibc use mmap() and it's all different?


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
Amit wrote:

>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote:
> On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > Just curious .. What does posix_fallocate() return ?
> 
> bookmark this:
> 
> http://www.opengroup.org/onlinepubs/009695399/nfindex.html
> 
> Upon successful completion, posix_fallocate() shall return zero;
> otherwise, an error number shall be returned to indicate the error.

Then there's no need for sys_allocate to return a long.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote:
> On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
> > One thing I'd like to see is a cmd argument as well, to allow for 
> > example allocation vs. reservation (i.e. allocating blocks vs. simply 
> > reserving a number), as well as the inverse of those functions 
> > (un-reservation, de-allocation)?
> > 
> > If the allocation interface allows allocation/reservation within 
> > arbitrary ranges, if the only way to un-allocate is via a truncate, 
> > that's pretty asymmetric.
> 
> I'd rather we just get the oft-discussed punch() syscall instead.
> This is really what "unallocate" would do for persistent allocations
> and it would be useful for files that were not preallocated.

I can see a difference though.  punch() would throw away written data as
well as pre-allocated-but-never-written-to data.  I can see where a user
might preallocate a large file and do a lot of random writes.  At some
point, he decides the file isn't going to grow much more, so let's free
up the remaining pre-allocated blocks.  This makes even more sense with
reservation.

The alternative would be to have punch() take a flag to specify if only
preallocated or reserved blocks should be freed.

> 
> For filesystems that don't implement punch glibc() would do zero-filling
> of the punched area I guess (to make it equivalent to reading from a
> hole in the file).

Or it could just fail.  Writing zeroes may be really slow and not give
the caller any benefit.  (The intention was to free blocks back to the
file system.)

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Andreas Dilger
On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
> One thing I'd like to see is a cmd argument as well, to allow for 
> example allocation vs. reservation (i.e. allocating blocks vs. simply 
> reserving a number), as well as the inverse of those functions 
> (un-reservation, de-allocation)?
> 
> If the allocation interface allows allocation/reservation within 
> arbitrary ranges, if the only way to un-allocate is via a truncate, 
> that's pretty asymmetric.

I'd rather we just get the oft-discussed punch() syscall instead.
This is really what "unallocate" would do for persistent allocations
and it would be useful for files that were not preallocated.

For filesystems that don't implement punch glibc() would do zero-filling
of the punched area I guess (to make it equivalent to reading from a
hole in the file).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html