Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-30 Thread Christoph Hellwig
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> I suppose it might be a bit late in the game to add a "goal"
> parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
> the API more suitable for XFS?  The goal could be a single __u64, or
> a struct with e.g. __u64 byte offset (possibly also __u32 lun like
> in FIEMAP).  I guess the one potential limitation here is the
> number of function parameters on some architectures.

This isn't really about "more suitable for XFS" but more about more
suitable for sophisticated layout decisions.

But I'm still not confident this should be shohorned into this
syscall.  In fact I'm already rather unhappy about the feature churn in
the current patch series.

The more I think about it the more I'd prefer we would just put a simple
syscall in that implements nothing but the posix_fallocate(3) semantics
as defined in SuS, and then go on to brainstorm about advanced
preallocation / layout hint semantics.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-30 Thread Christoph Hellwig
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
 I suppose it might be a bit late in the game to add a goal
 parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
 the API more suitable for XFS?  The goal could be a single __u64, or
 a struct with e.g. __u64 byte offset (possibly also __u32 lun like
 in FIEMAP).  I guess the one potential limitation here is the
 number of function parameters on some architectures.

This isn't really about more suitable for XFS but more about more
suitable for sophisticated layout decisions.

But I'm still not confident this should be shohorned into this
syscall.  In fact I'm already rather unhappy about the feature churn in
the current patch series.

The more I think about it the more I'd prefer we would just put a simple
syscall in that implements nothing but the posix_fallocate(3) semantics
as defined in SuS, and then go on to brainstorm about advanced
preallocation / layout hint semantics.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread Andreas Dilger
On Jun 14, 2007  22:04 +1000, David Chinner wrote:
> On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> > > B FA_DEALLOCATE
> > > removes the underlying disk space with the given range. The disk space
> > > shall be removed regardless of it's contents so both allocated space
> > > from
> > > B FA_ALLOCATE
> > > and
> > > B FA_PREALLOCATE
> > > as well as from
> > > B write(3)
> > > will be removed.
> > > B FA_DEALLOCATE
> > > shall never remove disk blocks outside the range specified.
> > 
> > So this is essentially the same as "punch".
> 
> Depends on your definition of "punch".
> 
> > There doesn't seem to be
> > a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
> > end.
> 
> ftruncate()

No, that will delete written data also.  What I'm thinking is in cases
where an application does fallocate() to reserve a lot of space, and
when the application is finished it wants to unreserve any unused space.

> > > B FA_DEALLOCATE
> > > shall never change the file size. If changing the file size
> > > is required when deallocating blocks from an offset to end
> > > of file (or beyond end of file) is required,
> > > B ftuncate64(3)
> > > should be used.
> > 
> > This also seems to be a bit of a wart, since it isn't a natural converse
> > of either of the above functions.  How about having two modes,
> > similar to FA_ALLOCATE and FA_PREALLOCATE?
> 
> 
> 
> whatever.
> 
> > Say, FA_PUNCH (which
> > would be as you describe here - deletes all data in the specified
> > range changing the file size if it overlaps EOF,
> 
> Punch means different things to different people. To me (and probably
> most XFS aware ppl) punch implies no change to the file size.

If "punch" does not change the file size, how is it possible to determine
the end of the actual written data?  Say you have a file with records
in it, and these records are cancelled as they are processed (e.g. a
journal of sorts).  One usage model for punch() that we had in the past
is to punch out each record after it finishes processing, so that it will
not be re-processed after a crash.  If the file size doesn't change with
punch then there is no way to know when the last record is hit and the
rest of the file needs to be scanned.

> i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
> holes to leave the file size unchanged. This is the behaviour I
> described for FA_DEALLOCATE.
> 
> > and FA_DEALLOCATE,
> > which only deallocates unused FA_{PRE,}ALLOCATE space?
> 
> That's an "unwritten-to-hole" extent conversion. Is that really
> useful for anything? That's easily implemented with FIEMAP
> and FA_DEALLOCATE.

But why force the application to do this instead of making the
fallocate API sensible and allowing it to be done directly?

> Anyway, because we can't agree on a single pair of flags:
> 
>   FA_ALLOCATE== posix_fallocate()
>   FA_DEALLOCATE  == unwritten-to-hole ???

I'd think this makes sense, being natural opposites of each other.
FA_ALLOCATE doesn't overwrite existing data with zeros, so FA_DEALLOCATE
shouldn't erase existing data.  If FA_ALLOCATE extends the file size,
then FA_DEALLOCATE should shrink it if there is no data at the end.

>   FA_RESV_SPACE  == XFS_IOC_RESVSP64
>   FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64

> > We might also consider making @mode be a mask instead of an enumeration:
> > 
> > FA_FL_DEALLOC   0x01 (default allocate)
> > FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
> > FA_FL_DEL_DATA  0x04 (default keep written data on DEALLOC)
> 
> #define FA_ALLOCATE   0
> #define FA_DEALLOCATE FA_FL_DEALLOC
> #define FA_RESV_SPACE FA_FL_KEEP_SIZE
> #define FA_UNRESV_SPACE   FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

OK, this makes the semantics of XFS_IOC_RESVSP64 and XFS_IOC_UNRESVSP64
clear at least.  The benefit is that it would also be possible (I'm
not necessarily advocating this as a flag, just an example) to have
semantics that are like XFS_IOC_ALLOCSP64 (zeroing written data while
preallocating) with:

#define FA_ZERO_SPACEFA_DEL_DATA

or whatever semantics the caller actually wants, instead of restricting
them to the subset of combinations given by FA_ALLOCATE and FA_DEALLOCATE
(whatever it is we decide on in the end).

> > > B ENOSPC
> > > There is not enough space left on the device containing the file
> > > referred to by
> > > IR fd.
> > 
> > Should probably say whether space is removed on failure or not.  In
> 
> Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
> allocated was freed back up. That way the error handling in the allocate
> functions is much simpler (i.e. no need to undo there).

Hmm, another flag?  FA_FL_FREE_ENOSPC?  I can imagine applications like
PVRs to want to preallocate, say, an estimated 30 min of space for a show
but if they only get 25 min of space returned they know some cleanup is
in order (which can be done asynchronously while the show is filling the
first 25 min of 

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread David Chinner
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> On Jun 14, 2007  09:52 +1000, David Chinner wrote:
> > B FA_PREALLOCATE
> > provides the same functionality as
> > B FA_ALLOCATE
> > except it does not ever change the file size. This allows allocation
> > of zero blocks beyond the end of file and is useful for optimising
> > append workloads.
> > TP
> > B FA_DEALLOCATE
> > removes the underlying disk space with the given range. The disk space
> > shall be removed regardless of it's contents so both allocated space
> > from
> > B FA_ALLOCATE
> > and
> > B FA_PREALLOCATE
> > as well as from
> > B write(3)
> > will be removed.
> > B FA_DEALLOCATE
> > shall never remove disk blocks outside the range specified.
> 
> So this is essentially the same as "punch".

Depends on your definition of "punch".

> There doesn't seem to be
> a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
> end.

ftruncate()

> > B FA_DEALLOCATE
> > shall never change the file size. If changing the file size
> > is required when deallocating blocks from an offset to end
> > of file (or beyond end of file) is required,
> > B ftuncate64(3)
> > should be used.
> 
> This also seems to be a bit of a wart, since it isn't a natural converse
> of either of the above functions.  How about having two modes,
> similar to FA_ALLOCATE and FA_PREALLOCATE?



whatever.

> Say, FA_PUNCH (which
> would be as you describe here - deletes all data in the specified
> range changing the file size if it overlaps EOF,

Punch means different things to different people. To me (and probably
most XFS aware ppl) punch implies no change to the file size.

i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
holes to leave the file size unchanged. This is the behaviour I
described for FA_DEALLOCATE.

> and FA_DEALLOCATE,
> which only deallocates unused FA_{PRE,}ALLOCATE space?

That's an "unwritten-to-hole" extent conversion. Is that really
useful for anything? That's easily implemented with FIEMAP
and FA_DEALLOCATE.

Anyway, because we can't agree on a single pair of flags:

FA_ALLOCATE== posix_fallocate()
FA_DEALLOCATE  == unwritten-to-hole ???
FA_RESV_SPACE  == XFS_IOC_RESVSP64
FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64

> We might also consider making @mode be a mask instead of an enumeration:
> 
> FA_FL_DEALLOC 0x01 (default allocate)
> FA_FL_KEEP_SIZE   0x02 (default extend/shrink size)
> FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC)

i.e:

#define FA_ALLOCATE 0
#define FA_DEALLOCATE   FA_FL_DEALLOC
#define FA_RESV_SPACE   FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

> I suppose it might be a bit late in the game to add a "goal"
> parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
> the API more suitable for XFS?

It would suffice for the simpler operations, I think, but we'll
rapidly run out of flags and we'll still need another interface
for doing complex stuff.

> The goal could be a single __u64, or
> a struct with e.g. __u64 byte offset (possibly also __u32 lun like
> in FIEMAP).  I guess the one potential limitation here is the
> number of function parameters on some architectures.

To be useful it needs to __u64.

> > B ENOSPC
> > There is not enough space left on the device containing the file
> > referred to by
> > IR fd.
> 
> Should probably say whether space is removed on failure or not.  In

Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
allocated was freed back up. That way the error handling in the allocate
functions is much simpler (i.e. no need to undo there).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread Andreas Dilger
On Jun 14, 2007  09:52 +1000, David Chinner wrote:
> B FA_PREALLOCATE
> provides the same functionality as
> B FA_ALLOCATE
> except it does not ever change the file size. This allows allocation
> of zero blocks beyond the end of file and is useful for optimising
> append workloads.
> TP
> B FA_DEALLOCATE
> removes the underlying disk space with the given range. The disk space
> shall be removed regardless of it's contents so both allocated space
> from
> B FA_ALLOCATE
> and
> B FA_PREALLOCATE
> as well as from
> B write(3)
> will be removed.
> B FA_DEALLOCATE
> shall never remove disk blocks outside the range specified.

So this is essentially the same as "punch".  There doesn't seem to be
a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
end.

> B FA_DEALLOCATE
> shall never change the file size. If changing the file size
> is required when deallocating blocks from an offset to end
> of file (or beyond end of file) is required,
> B ftuncate64(3)
> should be used.

This also seems to be a bit of a wart, since it isn't a natural converse
of either of the above functions.  How about having two modes,
similar to FA_ALLOCATE and FA_PREALLOCATE?  Say, FA_PUNCH (which
would be as you describe here - deletes all data in the specified
range changing the file size if it overlaps EOF, and FA_DEALLOCATE,
which only deallocates unused FA_{PRE,}ALLOCATE space?

We might also consider making @mode be a mask instead of an enumeration:

FA_FL_DEALLOC   0x01 (default allocate)
FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
FA_FL_DEL_DATA  0x04 (default keep written data on DEALLOC)

We might then build FA_ALLOCATE and FA_DEALLOCATE out of these flags
without making the interface sub-optimal.

I suppose it might be a bit late in the game to add a "goal"
parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
the API more suitable for XFS?  The goal could be a single __u64, or
a struct with e.g. __u64 byte offset (possibly also __u32 lun like
in FIEMAP).  I guess the one potential limitation here is the
number of function parameters on some architectures.

> B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> IR fd.

Should probably say whether space is removed on failure or not.  In
some (primitive) implementations it might no longer be possible to
distinguish between unwritten extents and zero-filled blocks, though
at this point DEALLOC of zero-filled blocks might not be harmful either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread Andreas Dilger
On Jun 14, 2007  09:52 +1000, David Chinner wrote:
 B FA_PREALLOCATE
 provides the same functionality as
 B FA_ALLOCATE
 except it does not ever change the file size. This allows allocation
 of zero blocks beyond the end of file and is useful for optimising
 append workloads.
 TP
 B FA_DEALLOCATE
 removes the underlying disk space with the given range. The disk space
 shall be removed regardless of it's contents so both allocated space
 from
 B FA_ALLOCATE
 and
 B FA_PREALLOCATE
 as well as from
 B write(3)
 will be removed.
 B FA_DEALLOCATE
 shall never remove disk blocks outside the range specified.

So this is essentially the same as punch.  There doesn't seem to be
a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
end.

 B FA_DEALLOCATE
 shall never change the file size. If changing the file size
 is required when deallocating blocks from an offset to end
 of file (or beyond end of file) is required,
 B ftuncate64(3)
 should be used.

This also seems to be a bit of a wart, since it isn't a natural converse
of either of the above functions.  How about having two modes,
similar to FA_ALLOCATE and FA_PREALLOCATE?  Say, FA_PUNCH (which
would be as you describe here - deletes all data in the specified
range changing the file size if it overlaps EOF, and FA_DEALLOCATE,
which only deallocates unused FA_{PRE,}ALLOCATE space?

We might also consider making @mode be a mask instead of an enumeration:

FA_FL_DEALLOC   0x01 (default allocate)
FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
FA_FL_DEL_DATA  0x04 (default keep written data on DEALLOC)

We might then build FA_ALLOCATE and FA_DEALLOCATE out of these flags
without making the interface sub-optimal.

I suppose it might be a bit late in the game to add a goal
parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
the API more suitable for XFS?  The goal could be a single __u64, or
a struct with e.g. __u64 byte offset (possibly also __u32 lun like
in FIEMAP).  I guess the one potential limitation here is the
number of function parameters on some architectures.

 B ENOSPC
 There is not enough space left on the device containing the file
 referred to by
 IR fd.

Should probably say whether space is removed on failure or not.  In
some (primitive) implementations it might no longer be possible to
distinguish between unwritten extents and zero-filled blocks, though
at this point DEALLOC of zero-filled blocks might not be harmful either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread David Chinner
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
 On Jun 14, 2007  09:52 +1000, David Chinner wrote:
  B FA_PREALLOCATE
  provides the same functionality as
  B FA_ALLOCATE
  except it does not ever change the file size. This allows allocation
  of zero blocks beyond the end of file and is useful for optimising
  append workloads.
  TP
  B FA_DEALLOCATE
  removes the underlying disk space with the given range. The disk space
  shall be removed regardless of it's contents so both allocated space
  from
  B FA_ALLOCATE
  and
  B FA_PREALLOCATE
  as well as from
  B write(3)
  will be removed.
  B FA_DEALLOCATE
  shall never remove disk blocks outside the range specified.
 
 So this is essentially the same as punch.

Depends on your definition of punch.

 There doesn't seem to be
 a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
 end.

ftruncate()

  B FA_DEALLOCATE
  shall never change the file size. If changing the file size
  is required when deallocating blocks from an offset to end
  of file (or beyond end of file) is required,
  B ftuncate64(3)
  should be used.
 
 This also seems to be a bit of a wart, since it isn't a natural converse
 of either of the above functions.  How about having two modes,
 similar to FA_ALLOCATE and FA_PREALLOCATE?

shrug

whatever.

 Say, FA_PUNCH (which
 would be as you describe here - deletes all data in the specified
 range changing the file size if it overlaps EOF,

Punch means different things to different people. To me (and probably
most XFS aware ppl) punch implies no change to the file size.

i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
holes to leave the file size unchanged. This is the behaviour I
described for FA_DEALLOCATE.

 and FA_DEALLOCATE,
 which only deallocates unused FA_{PRE,}ALLOCATE space?

That's an unwritten-to-hole extent conversion. Is that really
useful for anything? That's easily implemented with FIEMAP
and FA_DEALLOCATE.

Anyway, because we can't agree on a single pair of flags:

FA_ALLOCATE== posix_fallocate()
FA_DEALLOCATE  == unwritten-to-hole ???
FA_RESV_SPACE  == XFS_IOC_RESVSP64
FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64

 We might also consider making @mode be a mask instead of an enumeration:
 
 FA_FL_DEALLOC 0x01 (default allocate)
 FA_FL_KEEP_SIZE   0x02 (default extend/shrink size)
 FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC)

i.e:

#define FA_ALLOCATE 0
#define FA_DEALLOCATE   FA_FL_DEALLOC
#define FA_RESV_SPACE   FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

 I suppose it might be a bit late in the game to add a goal
 parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
 the API more suitable for XFS?

It would suffice for the simpler operations, I think, but we'll
rapidly run out of flags and we'll still need another interface
for doing complex stuff.

 The goal could be a single __u64, or
 a struct with e.g. __u64 byte offset (possibly also __u32 lun like
 in FIEMAP).  I guess the one potential limitation here is the
 number of function parameters on some architectures.

To be useful it needs to __u64.

  B ENOSPC
  There is not enough space left on the device containing the file
  referred to by
  IR fd.
 
 Should probably say whether space is removed on failure or not.  In

Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
allocated was freed back up. That way the error handling in the allocate
functions is much simpler (i.e. no need to undo there).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread Andreas Dilger
On Jun 14, 2007  22:04 +1000, David Chinner wrote:
 On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
   B FA_DEALLOCATE
   removes the underlying disk space with the given range. The disk space
   shall be removed regardless of it's contents so both allocated space
   from
   B FA_ALLOCATE
   and
   B FA_PREALLOCATE
   as well as from
   B write(3)
   will be removed.
   B FA_DEALLOCATE
   shall never remove disk blocks outside the range specified.
  
  So this is essentially the same as punch.
 
 Depends on your definition of punch.
 
  There doesn't seem to be
  a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
  end.
 
 ftruncate()

No, that will delete written data also.  What I'm thinking is in cases
where an application does fallocate() to reserve a lot of space, and
when the application is finished it wants to unreserve any unused space.

   B FA_DEALLOCATE
   shall never change the file size. If changing the file size
   is required when deallocating blocks from an offset to end
   of file (or beyond end of file) is required,
   B ftuncate64(3)
   should be used.
  
  This also seems to be a bit of a wart, since it isn't a natural converse
  of either of the above functions.  How about having two modes,
  similar to FA_ALLOCATE and FA_PREALLOCATE?
 
 shrug
 
 whatever.
 
  Say, FA_PUNCH (which
  would be as you describe here - deletes all data in the specified
  range changing the file size if it overlaps EOF,
 
 Punch means different things to different people. To me (and probably
 most XFS aware ppl) punch implies no change to the file size.

If punch does not change the file size, how is it possible to determine
the end of the actual written data?  Say you have a file with records
in it, and these records are cancelled as they are processed (e.g. a
journal of sorts).  One usage model for punch() that we had in the past
is to punch out each record after it finishes processing, so that it will
not be re-processed after a crash.  If the file size doesn't change with
punch then there is no way to know when the last record is hit and the
rest of the file needs to be scanned.

 i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
 holes to leave the file size unchanged. This is the behaviour I
 described for FA_DEALLOCATE.
 
  and FA_DEALLOCATE,
  which only deallocates unused FA_{PRE,}ALLOCATE space?
 
 That's an unwritten-to-hole extent conversion. Is that really
 useful for anything? That's easily implemented with FIEMAP
 and FA_DEALLOCATE.

But why force the application to do this instead of making the
fallocate API sensible and allowing it to be done directly?

 Anyway, because we can't agree on a single pair of flags:
 
   FA_ALLOCATE== posix_fallocate()
   FA_DEALLOCATE  == unwritten-to-hole ???

I'd think this makes sense, being natural opposites of each other.
FA_ALLOCATE doesn't overwrite existing data with zeros, so FA_DEALLOCATE
shouldn't erase existing data.  If FA_ALLOCATE extends the file size,
then FA_DEALLOCATE should shrink it if there is no data at the end.

   FA_RESV_SPACE  == XFS_IOC_RESVSP64
   FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64

  We might also consider making @mode be a mask instead of an enumeration:
  
  FA_FL_DEALLOC   0x01 (default allocate)
  FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
  FA_FL_DEL_DATA  0x04 (default keep written data on DEALLOC)
 
 #define FA_ALLOCATE   0
 #define FA_DEALLOCATE FA_FL_DEALLOC
 #define FA_RESV_SPACE FA_FL_KEEP_SIZE
 #define FA_UNRESV_SPACE   FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

OK, this makes the semantics of XFS_IOC_RESVSP64 and XFS_IOC_UNRESVSP64
clear at least.  The benefit is that it would also be possible (I'm
not necessarily advocating this as a flag, just an example) to have
semantics that are like XFS_IOC_ALLOCSP64 (zeroing written data while
preallocating) with:

#define FA_ZERO_SPACEFA_DEL_DATA

or whatever semantics the caller actually wants, instead of restricting
them to the subset of combinations given by FA_ALLOCATE and FA_DEALLOCATE
(whatever it is we decide on in the end).

   B ENOSPC
   There is not enough space left on the device containing the file
   referred to by
   IR fd.
  
  Should probably say whether space is removed on failure or not.  In
 
 Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
 allocated was freed back up. That way the error handling in the allocate
 functions is much simpler (i.e. no need to undo there).

Hmm, another flag?  FA_FL_FREE_ENOSPC?  I can imagine applications like
PVRs to want to preallocate, say, an estimated 30 min of space for a show
but if they only get 25 min of space returned they know some cleanup is
in order (which can be done asynchronously while the show is filling the
first 25 min of preallocated space).  Otherwise, they have to loop in
userspace trying decreasing preallocations until they fit, or starting
small and incrementally 

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-13 Thread David Chinner
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
> Did you get time to write the above man page ? It will help to push
> further patches in time (eg. for FA_PREALLOCATE mode).

First pass is attached.

`nroff -man fallocate.2 | less` to view.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include 
.PP
.BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len);
.Op
.SH DESCRIPTION
The
.BR fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.IR offset
and continuing for
.IR len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are three modes:
.TP
.B FA_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
B FA_ALLOCATE
closely resembles
B posix_fallocate(3)
and is intended as a method of optimally implementing this function.
B FA_ALLOCATE
may allocate a larger range that was specified.
TP
B FA_PREALLOCATE
provides the same functionality as
B FA_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
TP
B FA_DEALLOCATE
removes the underlying disk space with the given range. The disk space
shall be removed regardless of it's contents so both allocated space
from
B FA_ALLOCATE
and
B FA_PREALLOCATE
as well as from
B write(3)
will be removed.
B FA_DEALLOCATE
shall never remove disk blocks outside the range specified.
B FA_DEALLOCATE
shall never change the file size. If changing the file size
is required when deallocating blocks from an offset to end
of file (or beyond end of file) is required,
B ftuncate64(3)
should be used.

SH "RETURN VALUE"
BR fallocate()
returns zero on success, or an error number on failure.
Note that
IR errno
is not set.
SH "ERRORS"
TP
B EBADF
I fd
is not a valid file descriptor, or is not opened for writing.
TP
B EFBIG
I offset+len
exceeds the maximum file size.
TP
B EINVAL
I offset
or
I len
was less than 0.
TP
B ENODEV
I fd
does not refer to a regular file or a directory.
TP
B ENOSPC
There is not enough space left on the device containing the file
referred to by
IR fd.
TP
B ESPIPE
I fd
refers to a pipe of file descriptor.
B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
SH AVAILABILITY
The
BR fallocate ()
system call is available since 2.6.XX
SH "SEE ALSO"
BR syscall (2),
BR posix_fadvise (3)
BR ftruncate (3)


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-13 Thread David Chinner
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
 Did you get time to write the above man page ? It will help to push
 further patches in time (eg. for FA_PREALLOCATE mode).

First pass is attached.

`nroff -man fallocate.2 | less` to view.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include sys/syscall.h
.PP
.BI int syscall(int, int fd, int mode, loff_t offset, loff_t len);
.Op
.SH DESCRIPTION
The
.BR fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.IR offset
and continuing for
.IR len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are three modes:
.TP
.B FA_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
B FA_ALLOCATE
closely resembles
B posix_fallocate(3)
and is intended as a method of optimally implementing this function.
B FA_ALLOCATE
may allocate a larger range that was specified.
TP
B FA_PREALLOCATE
provides the same functionality as
B FA_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
TP
B FA_DEALLOCATE
removes the underlying disk space with the given range. The disk space
shall be removed regardless of it's contents so both allocated space
from
B FA_ALLOCATE
and
B FA_PREALLOCATE
as well as from
B write(3)
will be removed.
B FA_DEALLOCATE
shall never remove disk blocks outside the range specified.
B FA_DEALLOCATE
shall never change the file size. If changing the file size
is required when deallocating blocks from an offset to end
of file (or beyond end of file) is required,
B ftuncate64(3)
should be used.

SH RETURN VALUE
BR fallocate()
returns zero on success, or an error number on failure.
Note that
IR errno
is not set.
SH ERRORS
TP
B EBADF
I fd
is not a valid file descriptor, or is not opened for writing.
TP
B EFBIG
I offset+len
exceeds the maximum file size.
TP
B EINVAL
I offset
or
I len
was less than 0.
TP
B ENODEV
I fd
does not refer to a regular file or a directory.
TP
B ENOSPC
There is not enough space left on the device containing the file
referred to by
IR fd.
TP
B ESPIPE
I fd
refers to a pipe of file descriptor.
B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
SH AVAILABILITY
The
BR fallocate ()
system call is available since 2.6.XX
SH SEE ALSO
BR syscall (2),
BR posix_fadvise (3)
BR ftruncate (3)


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-12 Thread David Chinner
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
> On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
> > Minimal definition to replace what applicaitons use on XFS and to
> > support poasix_fallocate are the thre that have been mentioned so
> > far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
> > all in a man page...
> 
> Hi Dave,
> 
> Did you get time to write the above man page ? It will help to push
> further patches in time (eg. for FA_PREALLOCATE mode).

No, I didn't. Instead of working on new preallocation stuff, I've
been spending all my time fixing bugs found by new and interesting
(ab)uses of preallocation and hole punching.

> The idea I had was to push the patch with bare minimum functionality
> (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
> new mode(s) based on the man page you planned to provide.

Push them. I'll just make XFS work with whatever is provided.
Is there a test harness for the syscall yet?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-12 Thread Amit K. Arora
On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
> On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
> > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> > > All I'm really interested in right now is that the fallocate
> > > _interface_ can be used as a *complete replacement* for the
> > > pre-existing XFS-specific ioctls that are already used by
> > > applications.  What ext4 can or can't do right now is irrelevant to
> > > this discussion - the interface definition needs to take priority
> > > over implementation
> > 
> > Would you like to write up an interface definition description (likely
> > man page) and post it for review, possibly with a mention of apps using
> > it today ?
> 
> Yeah, I started doing that yesterday as i figured it was the only way
> to cut the discussion short
> 
> > One reason for introducing the mode parameter was to allow the interface to
> > evolve incrementally as more options / semantic questions are proposed, so
> > that we don't have to make all the decisions right now. 
> > So it would be good to start with a *minimal* definition, even just one 
> > mode.
> > The rest could follow as subsequent patches, each being reviewed and debated
> > separately. Otherwise this discussion can drag on for a long time.
> 
> Minimal definition to replace what applicaitons use on XFS and to
> support poasix_fallocate are the thre that have been mentioned so
> far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
> all in a man page...

Hi Dave,

Did you get time to write the above man page ? It will help to push
further patches in time (eg. for FA_PREALLOCATE mode).

The idea I had was to push the patch with bare minimum functionality
(FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
new mode(s) based on the man page you planned to provide.

Thanks!
--
Regards,
Amit Arora

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-12 Thread Amit K. Arora
On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
 On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
  On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
   All I'm really interested in right now is that the fallocate
   _interface_ can be used as a *complete replacement* for the
   pre-existing XFS-specific ioctls that are already used by
   applications.  What ext4 can or can't do right now is irrelevant to
   this discussion - the interface definition needs to take priority
   over implementation
  
  Would you like to write up an interface definition description (likely
  man page) and post it for review, possibly with a mention of apps using
  it today ?
 
 Yeah, I started doing that yesterday as i figured it was the only way
 to cut the discussion short
 
  One reason for introducing the mode parameter was to allow the interface to
  evolve incrementally as more options / semantic questions are proposed, so
  that we don't have to make all the decisions right now. 
  So it would be good to start with a *minimal* definition, even just one 
  mode.
  The rest could follow as subsequent patches, each being reviewed and debated
  separately. Otherwise this discussion can drag on for a long time.
 
 Minimal definition to replace what applicaitons use on XFS and to
 support poasix_fallocate are the thre that have been mentioned so
 far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
 all in a man page...

Hi Dave,

Did you get time to write the above man page ? It will help to push
further patches in time (eg. for FA_PREALLOCATE mode).

The idea I had was to push the patch with bare minimum functionality
(FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
new mode(s) based on the man page you planned to provide.

Thanks!
--
Regards,
Amit Arora

 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-12 Thread David Chinner
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
 On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
  Minimal definition to replace what applicaitons use on XFS and to
  support poasix_fallocate are the thre that have been mentioned so
  far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
  all in a man page...
 
 Hi Dave,
 
 Did you get time to write the above man page ? It will help to push
 further patches in time (eg. for FA_PREALLOCATE mode).

No, I didn't. Instead of working on new preallocation stuff, I've
been spending all my time fixing bugs found by new and interesting
(ab)uses of preallocation and hole punching.

 The idea I had was to push the patch with bare minimum functionality
 (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
 new mode(s) based on the man page you planned to provide.

Push them. I'll just make XFS work with whatever is provided.
Is there a test harness for the syscall yet?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-12 Thread David Chinner
On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
> On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> > All I'm really interested in right now is that the fallocate
> > _interface_ can be used as a *complete replacement* for the
> > pre-existing XFS-specific ioctls that are already used by
> > applications.  What ext4 can or can't do right now is irrelevant to
> > this discussion - the interface definition needs to take priority
> > over implementation
> 
> Would you like to write up an interface definition description (likely
> man page) and post it for review, possibly with a mention of apps using
> it today ?

Yeah, I started doing that yesterday as i figured it was the only way
to cut the discussion short

> One reason for introducing the mode parameter was to allow the interface to
> evolve incrementally as more options / semantic questions are proposed, so
> that we don't have to make all the decisions right now. 
> So it would be good to start with a *minimal* definition, even just one mode.
> The rest could follow as subsequent patches, each being reviewed and debated
> separately. Otherwise this discussion can drag on for a long time.

Minimal definition to replace what applicaitons use on XFS and to
support poasix_fallocate are the thre that have been mentioned so
far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
all in a man page...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-12 Thread David Chinner
On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
 On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
  All I'm really interested in right now is that the fallocate
  _interface_ can be used as a *complete replacement* for the
  pre-existing XFS-specific ioctls that are already used by
  applications.  What ext4 can or can't do right now is irrelevant to
  this discussion - the interface definition needs to take priority
  over implementation
 
 Would you like to write up an interface definition description (likely
 man page) and post it for review, possibly with a mention of apps using
 it today ?

Yeah, I started doing that yesterday as i figured it was the only way
to cut the discussion short

 One reason for introducing the mode parameter was to allow the interface to
 evolve incrementally as more options / semantic questions are proposed, so
 that we don't have to make all the decisions right now. 
 So it would be good to start with a *minimal* definition, even just one mode.
 The rest could follow as subsequent patches, each being reviewed and debated
 separately. Otherwise this discussion can drag on for a long time.

Minimal definition to replace what applicaitons use on XFS and to
support poasix_fallocate are the thre that have been mentioned so
far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
all in a man page...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-11 Thread Suparna Bhattacharya
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
> > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > > > I have the updated patches ready which take care of Andrew's comments.
> > > > Will run some tests and post them soon.
> > > > 
> > > > But, before submitting these patches, I think it will be better to
> > > > finalize on certain things which might be worth some discussion here:
> > > > 
> > > > 1) Should the file size change when preallocation is done beyond EOF ?
> > > > - Andreas and Chris Wedgwood are in favor of not changing the file size
> > > > in this case. I also tend to agree with them. Does anyone has an
> > > > argument in favor of changing the filesize ?  If not, I will remove the
> > > > code which changes the filesize, before I resubmit the concerned ext4
> > > > patch.
> > > 
> > > I think there needs to be both. If we don't have a mechanism to atomically
> > > change the file size with the preallocation, then applications that use
> > > stat() to work out if they need to preallocate more space will end up
> > > racing.
> > 
> > By "both" above, do you mean we should give user the flexibility if it wants
> > the filesize changed or not ? It can be done by having *two* modes for
> > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
> > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
> > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
> > will change the filesize if required (i.e.  when allocation is beyond EOF)
> > and also update [cm]time.  This way, the application can decide what it
> > wants.
> 
> Yes, that's right.
> 
> > This will be helpfull for the partial allocation scenario also. Think of the
> > case when we do not change the filesize in fallocate() and expect
> > applications/posix_fallocate() to do ftruncate() after fallocate() for this.
> > Now if fallocate() results in a partial allocation with -ENOSPC error
> > returned, applications/posix_fallocate() will not know for what length
> > ftruncate() has to be called.  :(
> 
> Well, posix_fallocate() either gets all the space or it fails. If
> you truncate to extend the file size after an ENOSPC, then that is
> a buggy implementation.
> 
> The same could be said for any application, or even the fallocate()
> call itself if it changes the filesize without having completely
> preallocated the space asked
> 
> > Hence it may be a good idea to give user the flexibility if it wants to
> > atomically change the file size with preallocation or not. But, with more
> > flexibility there comes inconsistency in behavior, which is worth
> > considering.
> 
> We've got different modes to specify different behaviour. That's
> what the mode field was put there for in the first place - the
> interface is *designed* to support different preallocation
> behaviours
> 
> > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of
> > > > normal (non-preallocated) blocks (blocks allocated via regular
> > > > write/truncate operations) also (i.e. work as punch()) ?
> > > 
> > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
> > > i did for FA_UNALLOCATE as well.
> > 
> > Ok. But, some people may not expect/like this. I think, we can keep it on
> > the backburner for a while, till other issues are sorted out.
> 
> How can it be a "backburner" issue when it defines the
> implementation?  I've already implemented some thing in XFS that
> sort of does what I think that the interface is supposed to do, but
> I need that interface to be nailed down before proceeding any
> further.
> 
> All I'm really interested in right now is that the fallocate
> _interface_ can be used as a *complete replacement* for the
> pre-existing XFS-specific ioctls that are already used by
> applications.  What ext4 can or can't do right now is irrelevant to
> this discussion - the interface definition needs to take priority
> over implementation

Would you like to write up an interface definition description (likely
man page) and post it for review, possibly with a mention of apps using
it today ?

One reason for introducing the mode parameter was to allow the interface to
evolve incrementally as more options / semantic questions are proposed, so
that we don't have to make all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

> 
> Cheers,
> 
> Dave,
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info 

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-11 Thread Suparna Bhattacharya
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
 On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
  On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
   On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to
finalize on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the file size
in this case. I also tend to agree with them. Does anyone has an
argument in favor of changing the filesize ?  If not, I will remove the
code which changes the filesize, before I resubmit the concerned ext4
patch.
   
   I think there needs to be both. If we don't have a mechanism to atomically
   change the file size with the preallocation, then applications that use
   stat() to work out if they need to preallocate more space will end up
   racing.
  
  By both above, do you mean we should give user the flexibility if it wants
  the filesize changed or not ? It can be done by having *two* modes for
  preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
  use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
  change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
  will change the filesize if required (i.e.  when allocation is beyond EOF)
  and also update [cm]time.  This way, the application can decide what it
  wants.
 
 Yes, that's right.
 
  This will be helpfull for the partial allocation scenario also. Think of the
  case when we do not change the filesize in fallocate() and expect
  applications/posix_fallocate() to do ftruncate() after fallocate() for this.
  Now if fallocate() results in a partial allocation with -ENOSPC error
  returned, applications/posix_fallocate() will not know for what length
  ftruncate() has to be called.  :(
 
 Well, posix_fallocate() either gets all the space or it fails. If
 you truncate to extend the file size after an ENOSPC, then that is
 a buggy implementation.
 
 The same could be said for any application, or even the fallocate()
 call itself if it changes the filesize without having completely
 preallocated the space asked
 
  Hence it may be a good idea to give user the flexibility if it wants to
  atomically change the file size with preallocation or not. But, with more
  flexibility there comes inconsistency in behavior, which is worth
  considering.
 
 We've got different modes to specify different behaviour. That's
 what the mode field was put there for in the first place - the
 interface is *designed* to support different preallocation
 behaviours
 
2) For FA_UNALLOCATE mode, should the file system allow unallocation of
normal (non-preallocated) blocks (blocks allocated via regular
write/truncate operations) also (i.e. work as punch()) ?
   
   Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
   i did for FA_UNALLOCATE as well.
  
  Ok. But, some people may not expect/like this. I think, we can keep it on
  the backburner for a while, till other issues are sorted out.
 
 How can it be a backburner issue when it defines the
 implementation?  I've already implemented some thing in XFS that
 sort of does what I think that the interface is supposed to do, but
 I need that interface to be nailed down before proceeding any
 further.
 
 All I'm really interested in right now is that the fallocate
 _interface_ can be used as a *complete replacement* for the
 pre-existing XFS-specific ioctls that are already used by
 applications.  What ext4 can or can't do right now is irrelevant to
 this discussion - the interface definition needs to take priority
 over implementation

Would you like to write up an interface definition description (likely
man page) and post it for review, possibly with a mention of apps using
it today ?

One reason for introducing the mode parameter was to allow the interface to
evolve incrementally as more options / semantic questions are proposed, so
that we don't have to make all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

 
 Cheers,
 
 Dave,
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the 

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-10 Thread David Chinner
On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
> On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > > I have the updated patches ready which take care of Andrew's comments.
> > > Will run some tests and post them soon.
> > > 
> > > But, before submitting these patches, I think it will be better to
> > > finalize on certain things which might be worth some discussion here:
> > > 
> > > 1) Should the file size change when preallocation is done beyond EOF ?
> > > - Andreas and Chris Wedgwood are in favor of not changing the file size
> > > in this case. I also tend to agree with them. Does anyone has an
> > > argument in favor of changing the filesize ?  If not, I will remove the
> > > code which changes the filesize, before I resubmit the concerned ext4
> > > patch.
> > 
> > I think there needs to be both. If we don't have a mechanism to atomically
> > change the file size with the preallocation, then applications that use
> > stat() to work out if they need to preallocate more space will end up
> > racing.
> 
> By "both" above, do you mean we should give user the flexibility if it wants
> the filesize changed or not ? It can be done by having *two* modes for
> preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
> use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
> change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
> will change the filesize if required (i.e.  when allocation is beyond EOF)
> and also update [cm]time.  This way, the application can decide what it
> wants.

Yes, that's right.

> This will be helpfull for the partial allocation scenario also. Think of the
> case when we do not change the filesize in fallocate() and expect
> applications/posix_fallocate() to do ftruncate() after fallocate() for this.
> Now if fallocate() results in a partial allocation with -ENOSPC error
> returned, applications/posix_fallocate() will not know for what length
> ftruncate() has to be called.  :(

Well, posix_fallocate() either gets all the space or it fails. If
you truncate to extend the file size after an ENOSPC, then that is
a buggy implementation.

The same could be said for any application, or even the fallocate()
call itself if it changes the filesize without having completely
preallocated the space asked

> Hence it may be a good idea to give user the flexibility if it wants to
> atomically change the file size with preallocation or not. But, with more
> flexibility there comes inconsistency in behavior, which is worth
> considering.

We've got different modes to specify different behaviour. That's
what the mode field was put there for in the first place - the
interface is *designed* to support different preallocation
behaviours

> > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of
> > > normal (non-preallocated) blocks (blocks allocated via regular
> > > write/truncate operations) also (i.e. work as punch()) ?
> > 
> > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
> > i did for FA_UNALLOCATE as well.
> 
> Ok. But, some people may not expect/like this. I think, we can keep it on
> the backburner for a while, till other issues are sorted out.

How can it be a "backburner" issue when it defines the
implementation?  I've already implemented some thing in XFS that
sort of does what I think that the interface is supposed to do, but
I need that interface to be nailed down before proceeding any
further.

All I'm really interested in right now is that the fallocate
_interface_ can be used as a *complete replacement* for the
pre-existing XFS-specific ioctls that are already used by
applications.  What ext4 can or can't do right now is irrelevant to
this discussion - the interface definition needs to take priority
over implementation

Cheers,

Dave,
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-10 Thread Amit K. Arora
On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > I have the updated patches ready which take care of Andrew's comments.
> > Will run some tests and post them soon.
> > 
> > But, before submitting these patches, I think it will be better to finalize
> > on certain things which might be worth some discussion here:
> > 
> > 1) Should the file size change when preallocation is done beyond EOF ?
> >- Andreas and Chris Wedgwood are in favor of not changing the
> >  file size in this case. I also tend to agree with them. Does anyone
> >  has an argument in favor of changing the filesize ?
> >  If not, I will remove the code which changes the filesize, before I
> >  resubmit the concerned ext4 patch.
> 
> I think there needs to be both. If we don't have a mechanism to
> atomically change the file size with the preallocation, then
> applications that use stat() to work out if they need to preallocate
> more space will end up racing.

By "both" above, do you mean we should give user the flexibility if it
wants the filesize changed or not ? It can be done by having *two* modes
for preallocation in the system call - say FA_PREALLOCATE and
FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate
blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE
mode is used, fallocate() will change the filesize if required (i.e.
when allocation is beyond EOF) and also update [cm]time.
This way, the application can decide what it wants.

This will be helpfull for the partial allocation scenario also. Think of
the case when we do not change the filesize in fallocate() and expect
applications/posix_fallocate() to do ftruncate() after fallocate() for
this. Now if fallocate() results in a partial allocation with -ENOSPC
error returned, applications/posix_fallocate() will not know for what
length ftruncate() has to be called.  :(

Hence it may be a good idea to give user the flexibility if it wants to
atomically change the file size with preallocation or not. But, with
more flexibility there comes inconsistency in behavior, which is worth
considering.

> 
> > 2) For FA_UNALLOCATE mode, should the file system allow unallocation
> >of normal (non-preallocated) blocks (blocks allocated via
> >regular write/truncate operations) also (i.e. work as punch()) ?
> 
> Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
> what i did for FA_UNALLOCATE as well.

Ok. But, some people may not expect/like this. I think, we can keep it
on the backburner for a while, till other issues are sorted out.
 
> >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
> >  we need to finalize on the convention here as a general guideline
> >  to all the filesystems that implement fallocate.
> > 
> > 3) If above is true, the file size will need to be changed
> >for "unallocation" when block holding the EOF gets unallocated.
> 
> No - we punch a hole. If you want the filesize to change, then
> you use ftruncate() to remove the blocks at EOF and change the
> file size atomically.

Ok.
> 
> > 4) Should we update mtime & ctime on a successfull allocation/
> >unallocation ?
> >- David Chinner raised this question in following post:
> >  http://lkml.org/lkml/2007/4/29/407
> >  I think it makes sense to update the [mc]time for a successfull
> >  preallocation/unallocation. Does anyone feel otherwise ?
> >  It will be interesting to know how XFS behaves currently. Does XFS
> >  update [mc]time for preallocation ?
> 
> No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
> changes. If the filesize changes, it behaves exactly the same way that
> ftruncate() behaves.

Having additional mode (of FA_PREALLOCATE) might help here too. Please
see above.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-10 Thread Amit K. Arora
On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
 On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
  I have the updated patches ready which take care of Andrew's comments.
  Will run some tests and post them soon.
  
  But, before submitting these patches, I think it will be better to finalize
  on certain things which might be worth some discussion here:
  
  1) Should the file size change when preallocation is done beyond EOF ?
 - Andreas and Chris Wedgwood are in favor of not changing the
   file size in this case. I also tend to agree with them. Does anyone
   has an argument in favor of changing the filesize ?
   If not, I will remove the code which changes the filesize, before I
   resubmit the concerned ext4 patch.
 
 I think there needs to be both. If we don't have a mechanism to
 atomically change the file size with the preallocation, then
 applications that use stat() to work out if they need to preallocate
 more space will end up racing.

By both above, do you mean we should give user the flexibility if it
wants the filesize changed or not ? It can be done by having *two* modes
for preallocation in the system call - say FA_PREALLOCATE and
FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate
blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE
mode is used, fallocate() will change the filesize if required (i.e.
when allocation is beyond EOF) and also update [cm]time.
This way, the application can decide what it wants.

This will be helpfull for the partial allocation scenario also. Think of
the case when we do not change the filesize in fallocate() and expect
applications/posix_fallocate() to do ftruncate() after fallocate() for
this. Now if fallocate() results in a partial allocation with -ENOSPC
error returned, applications/posix_fallocate() will not know for what
length ftruncate() has to be called.  :(

Hence it may be a good idea to give user the flexibility if it wants to
atomically change the file size with preallocation or not. But, with
more flexibility there comes inconsistency in behavior, which is worth
considering.

 
  2) For FA_UNALLOCATE mode, should the file system allow unallocation
 of normal (non-preallocated) blocks (blocks allocated via
 regular write/truncate operations) also (i.e. work as punch()) ?
 
 Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
 what i did for FA_UNALLOCATE as well.

Ok. But, some people may not expect/like this. I think, we can keep it
on the backburner for a while, till other issues are sorted out.
 
 - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
   we need to finalize on the convention here as a general guideline
   to all the filesystems that implement fallocate.
  
  3) If above is true, the file size will need to be changed
 for unallocation when block holding the EOF gets unallocated.
 
 No - we punch a hole. If you want the filesize to change, then
 you use ftruncate() to remove the blocks at EOF and change the
 file size atomically.

Ok.
 
  4) Should we update mtime  ctime on a successfull allocation/
 unallocation ?
 - David Chinner raised this question in following post:
   http://lkml.org/lkml/2007/4/29/407
   I think it makes sense to update the [mc]time for a successfull
   preallocation/unallocation. Does anyone feel otherwise ?
   It will be interesting to know how XFS behaves currently. Does XFS
   update [mc]time for preallocation ?
 
 No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
 changes. If the filesize changes, it behaves exactly the same way that
 ftruncate() behaves.

Having additional mode (of FA_PREALLOCATE) might help here too. Please
see above.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-10 Thread David Chinner
On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
 On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
  On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
   I have the updated patches ready which take care of Andrew's comments.
   Will run some tests and post them soon.
   
   But, before submitting these patches, I think it will be better to
   finalize on certain things which might be worth some discussion here:
   
   1) Should the file size change when preallocation is done beyond EOF ?
   - Andreas and Chris Wedgwood are in favor of not changing the file size
   in this case. I also tend to agree with them. Does anyone has an
   argument in favor of changing the filesize ?  If not, I will remove the
   code which changes the filesize, before I resubmit the concerned ext4
   patch.
  
  I think there needs to be both. If we don't have a mechanism to atomically
  change the file size with the preallocation, then applications that use
  stat() to work out if they need to preallocate more space will end up
  racing.
 
 By both above, do you mean we should give user the flexibility if it wants
 the filesize changed or not ? It can be done by having *two* modes for
 preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
 use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
 change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
 will change the filesize if required (i.e.  when allocation is beyond EOF)
 and also update [cm]time.  This way, the application can decide what it
 wants.

Yes, that's right.

 This will be helpfull for the partial allocation scenario also. Think of the
 case when we do not change the filesize in fallocate() and expect
 applications/posix_fallocate() to do ftruncate() after fallocate() for this.
 Now if fallocate() results in a partial allocation with -ENOSPC error
 returned, applications/posix_fallocate() will not know for what length
 ftruncate() has to be called.  :(

Well, posix_fallocate() either gets all the space or it fails. If
you truncate to extend the file size after an ENOSPC, then that is
a buggy implementation.

The same could be said for any application, or even the fallocate()
call itself if it changes the filesize without having completely
preallocated the space asked

 Hence it may be a good idea to give user the flexibility if it wants to
 atomically change the file size with preallocation or not. But, with more
 flexibility there comes inconsistency in behavior, which is worth
 considering.

We've got different modes to specify different behaviour. That's
what the mode field was put there for in the first place - the
interface is *designed* to support different preallocation
behaviours

   2) For FA_UNALLOCATE mode, should the file system allow unallocation of
   normal (non-preallocated) blocks (blocks allocated via regular
   write/truncate operations) also (i.e. work as punch()) ?
  
  Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
  i did for FA_UNALLOCATE as well.
 
 Ok. But, some people may not expect/like this. I think, we can keep it on
 the backburner for a while, till other issues are sorted out.

How can it be a backburner issue when it defines the
implementation?  I've already implemented some thing in XFS that
sort of does what I think that the interface is supposed to do, but
I need that interface to be nailed down before proceeding any
further.

All I'm really interested in right now is that the fallocate
_interface_ can be used as a *complete replacement* for the
pre-existing XFS-specific ioctls that are already used by
applications.  What ext4 can or can't do right now is irrelevant to
this discussion - the interface definition needs to take priority
over implementation

Cheers,

Dave,
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread David Chinner
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> I have the updated patches ready which take care of Andrew's comments.
> Will run some tests and post them soon.
> 
> But, before submitting these patches, I think it will be better to finalize
> on certain things which might be worth some discussion here:
> 
> 1) Should the file size change when preallocation is done beyond EOF ?
>- Andreas and Chris Wedgwood are in favor of not changing the
>  file size in this case. I also tend to agree with them. Does anyone
>  has an argument in favor of changing the filesize ?
>  If not, I will remove the code which changes the filesize, before I
>  resubmit the concerned ext4 patch.

I think there needs to be both. If we don't have a mechanism to
atomically change the file size with the preallocation, then
applications that use stat() to work out if they need to preallocate
more space will end up racing.

> 2) For FA_UNALLOCATE mode, should the file system allow unallocation
>of normal (non-preallocated) blocks (blocks allocated via
>regular write/truncate operations) also (i.e. work as punch()) ?

Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
what i did for FA_UNALLOCATE as well.

>- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
>  we need to finalize on the convention here as a general guideline
>  to all the filesystems that implement fallocate.
> 
> 3) If above is true, the file size will need to be changed
>for "unallocation" when block holding the EOF gets unallocated.

No - we punch a hole. If you want the filesize to change, then
you use ftruncate() to remove the blocks at EOF and change the
file size atomically.

> 4) Should we update mtime & ctime on a successfull allocation/
>unallocation ?
>- David Chinner raised this question in following post:
>  http://lkml.org/lkml/2007/4/29/407
>  I think it makes sense to update the [mc]time for a successfull
>  preallocation/unallocation. Does anyone feel otherwise ?
>  It will be interesting to know how XFS behaves currently. Does XFS
>  update [mc]time for preallocation ?

No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
changes. If the filesize changes, it behaves exactly the same way that
ftruncate() behaves.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Mingming Cao
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote:
> I have the updated patches ready which take care of Andrew's comments.
> Will run some tests and post them soon.
> 
> But, before submitting these patches, I think it will be better to finalize
> on certain things which might be worth some discussion here:
> 
> 1) Should the file size change when preallocation is done beyond EOF ?
>- Andreas and Chris Wedgwood are in favor of not changing the
>  file size in this case. I also tend to agree with them. Does anyone
>  has an argument in favor of changing the filesize ?
>  If not, I will remove the code which changes the filesize, before I
>  resubmit the concerned ext4 patch.
> 

If we chose not to update the file size beyong EOF, then for filesystem
without fallocate() support (ext2,3 currently), posix_fallocate() will
follow the hard way(zero-out) to do preallocation. Then we will get
different behavior on filesystems w/o fallocate() support. It make sense
to be consistent, IMO.

My point of view, preallocation is just a efficient way to allocating
blocks for files without zero-out, other than this, the new behavior
should be consistent with the old way: file size update,mtime/ctime,
ENOSPC etc.

Mingming


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Andreas Dilger
On May 09, 2007  21:31 +0530, Amit K. Arora wrote:
> 2) For FA_UNALLOCATE mode, should the file system allow unallocation
>of normal (non-preallocated) blocks (blocks allocated via
>regular write/truncate operations) also (i.e. work as punch()) ?
>- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
>  we need to finalize on the convention here as a general guideline
>  to all the filesystems that implement fallocate.

I would only allow this on FA_ALLOCATE extents.  That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.

> 3) If above is true, the file size will need to be changed
>for "unallocation" when block holding the EOF gets unallocated.
>- If we do not "unallocate" normal (non-preallocated) blocks and we
>  do not change the file size on preallocation, then this is a
>  non-issue.

Not necessarily.  That will just make the file sparse.  If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.

> 4) Should we update mtime & ctime on a successfull allocation/
>unallocation ?

I would say yes.  If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods.  Also, it just makes sense from the "this file was modified"
point of view.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Amit K. Arora
I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to finalize
on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
   - Andreas and Chris Wedgwood are in favor of not changing the
 file size in this case. I also tend to agree with them. Does anyone
 has an argument in favor of changing the filesize ?
 If not, I will remove the code which changes the filesize, before I
 resubmit the concerned ext4 patch.

2) For FA_UNALLOCATE mode, should the file system allow unallocation
   of normal (non-preallocated) blocks (blocks allocated via
   regular write/truncate operations) also (i.e. work as punch()) ?
   - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
 we need to finalize on the convention here as a general guideline
 to all the filesystems that implement fallocate.

3) If above is true, the file size will need to be changed
   for "unallocation" when block holding the EOF gets unallocated.
   - If we do not "unallocate" normal (non-preallocated) blocks and we
 do not change the file size on preallocation, then this is a
 non-issue.

4) Should we update mtime & ctime on a successfull allocation/
   unallocation ?
   - David Chinner raised this question in following post:
 http://lkml.org/lkml/2007/4/29/407
 I think it makes sense to update the [mc]time for a successfull
 preallocation/unallocation. Does anyone feel otherwise ?
 It will be interesting to know how XFS behaves currently. Does XFS
 update [mc]time for preallocation ?


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Amit K. Arora
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
> 
> > > Of course the interface used by an application program would have the
> > > fd first.  Glibc can do the translation.
> > 
> > I think that was understood.
> 
> OK, then what does it matter what the glibc/kernel interface is, as
> long as it works?
> 
> It's only a minor point; the order of arguments can vary between
> architectures if necessary, but it's nicer if they don't have to.
> 32-bit powerpc will need to have the two int arguments adjacent in
> order to avoid using more than 6 argument registers at the user/kernel
> boundary, and s390 will need to avoid having a 64-bit argument last
> (if I understand it correctly).

You are right to say that. But, it may not be _that_ a minor point,
especially for the arch which is getting affected. It has
other implications like what Heiko noticed in his post below:
http://lkml.org/lkml/2007/4/27/377
 - implications like modifying glibc and *trace utilities for a particular
arch.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Martin Schwidefsky

On 5/9/07, Paul Mackerras <[EMAIL PROTECTED]> wrote:

Suparna Bhattacharya writes:

> > Of course the interface used by an application program would have the
> > fd first.  Glibc can do the translation.
>
> I think that was understood.

OK, then what does it matter what the glibc/kernel interface is, as
long as it works?

It's only a minor point; the order of arguments can vary between
architectures if necessary, but it's nicer if they don't have to.
32-bit powerpc will need to have the two int arguments adjacent in
order to avoid using more than 6 argument registers at the user/kernel
boundary, and s390 will need to avoid having a 64-bit argument last
(if I understand it correctly).


Ah, almost but not quite the point. But I admit it is hard to understand..
The trouble started with the futex call which has been the first
system call with 6 arguments. s390 supported only 5 arguments up to
that point (%r2 - %r6). For futex we added a wrapper to the glibc that
loaded the 6th argument to %r7. In entry.S we set up things so that
%r7 gets stored to the kernel stack where normal C code expects the
first overflow argument. This enabled us to use the standard futex
system call with 6 arguments.
fallocate now has an additional problem: the last argument is a 64 bit
integers AND registers %r2-%r5 are already used. In this case the 64
bit number would have to be split into the high part in %r6 and the
low part on the stack so that the glibc wrapper can load the low part
to %r7. But the C compiler will skip %r6 and store the 64 bit number
on the stack.
If the order of the arguments if modified so that %r6 is assigned to a
32-bit argument, then the entry.S magic with %r7 would work.

--
blue skies,
 Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Paul Mackerras
Suparna Bhattacharya writes:

> > Of course the interface used by an application program would have the
> > fd first.  Glibc can do the translation.
> 
> I think that was understood.

OK, then what does it matter what the glibc/kernel interface is, as
long as it works?

It's only a minor point; the order of arguments can vary between
architectures if necessary, but it's nicer if they don't have to.
32-bit powerpc will need to have the two int arguments adjacent in
order to avoid using more than 6 argument registers at the user/kernel
boundary, and s390 will need to avoid having a 64-bit argument last
(if I understand it correctly).

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
> 
> > > This looks like it will have the same problem on s390 as
> > > sys_sync_file_range.  Maybe the prototype should be:
> > > 
> > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
> > 
> > Yes, but the trouble is that there was a contrary viewpoint preferring that 
> > fd
> > first be maintained as a convention like other syscalls (see the following
> > posts)
> 
> Of course the interface used by an application program would have the
> fd first.  Glibc can do the translation.

I think that was understood.

Regards
Suparna

> 
> Paul.

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Paul Mackerras
Suparna Bhattacharya writes:

> > This looks like it will have the same problem on s390 as
> > sys_sync_file_range.  Maybe the prototype should be:
> > 
> > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
> 
> Yes, but the trouble is that there was a contrary viewpoint preferring that fd
> first be maintained as a convention like other syscalls (see the following
> posts)

Of course the interface used by an application program would have the
fd first.  Glibc can do the translation.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
> Andrew Morton writes:
> 
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > > 
> > > ...
> > >
> > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t 
> > > len)
> > 
> > Please add a comment over this function which specifies its behaviour. 
> > Really it should be enough material from which a full manpage can be
> > written.
> 
> This looks like it will have the same problem on s390 as
> sys_sync_file_range.  Maybe the prototype should be:
> 
> asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)

http://marc.info/?l=linux-fsdevel=117585330016809=2 (Andreas)
http://marc.info/?l=linux-fsdevel=117690157917378=2  (Andreas)

http://marc.info/?l=linux-fsdevel=117578821827323=2 (Randy)

So we are kind of deadlocked, aren't we ?

The debates on the proposed solution for s390

http://marc.info/?l=linux-fsdevel=117760995610639=2  
http://marc.info/?l=linux-fsdevel=117708124913098=2 
http://marc.info/?l=linux-fsdevel=117767607229807=2

Are there any better ideas ?

Regards
Suparna

> 
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
 Andrew Morton writes:
 
  On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
  wrote:
  
   This patch implements the fallocate() system call and adds support for
   i386, x86_64 and powerpc.
   
   ...
  
   +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t 
   len)
  
  Please add a comment over this function which specifies its behaviour. 
  Really it should be enough material from which a full manpage can be
  written.
 
 This looks like it will have the same problem on s390 as
 sys_sync_file_range.  Maybe the prototype should be:
 
 asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)

http://marc.info/?l=linux-fsdevelm=117585330016809w=2 (Andreas)
http://marc.info/?l=linux-fsdevelm=117690157917378w=2  (Andreas)

http://marc.info/?l=linux-fsdevelm=117578821827323w=2 (Randy)

So we are kind of deadlocked, aren't we ?

The debates on the proposed solution for s390

http://marc.info/?l=linux-fsdevelm=117760995610639w=2  
http://marc.info/?l=linux-fsdevelm=117708124913098w=2 
http://marc.info/?l=linux-fsdevelm=117767607229807w=2

Are there any better ideas ?

Regards
Suparna

 
 Paul.
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Paul Mackerras
Suparna Bhattacharya writes:

  This looks like it will have the same problem on s390 as
  sys_sync_file_range.  Maybe the prototype should be:
  
  asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
 
 Yes, but the trouble is that there was a contrary viewpoint preferring that fd
 first be maintained as a convention like other syscalls (see the following
 posts)

Of course the interface used by an application program would have the
fd first.  Glibc can do the translation.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
 Suparna Bhattacharya writes:
 
   This looks like it will have the same problem on s390 as
   sys_sync_file_range.  Maybe the prototype should be:
   
   asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
  
  Yes, but the trouble is that there was a contrary viewpoint preferring that 
  fd
  first be maintained as a convention like other syscalls (see the following
  posts)
 
 Of course the interface used by an application program would have the
 fd first.  Glibc can do the translation.

I think that was understood.

Regards
Suparna

 
 Paul.

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Paul Mackerras
Suparna Bhattacharya writes:

  Of course the interface used by an application program would have the
  fd first.  Glibc can do the translation.
 
 I think that was understood.

OK, then what does it matter what the glibc/kernel interface is, as
long as it works?

It's only a minor point; the order of arguments can vary between
architectures if necessary, but it's nicer if they don't have to.
32-bit powerpc will need to have the two int arguments adjacent in
order to avoid using more than 6 argument registers at the user/kernel
boundary, and s390 will need to avoid having a 64-bit argument last
(if I understand it correctly).

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Martin Schwidefsky

On 5/9/07, Paul Mackerras [EMAIL PROTECTED] wrote:

Suparna Bhattacharya writes:

  Of course the interface used by an application program would have the
  fd first.  Glibc can do the translation.

 I think that was understood.

OK, then what does it matter what the glibc/kernel interface is, as
long as it works?

It's only a minor point; the order of arguments can vary between
architectures if necessary, but it's nicer if they don't have to.
32-bit powerpc will need to have the two int arguments adjacent in
order to avoid using more than 6 argument registers at the user/kernel
boundary, and s390 will need to avoid having a 64-bit argument last
(if I understand it correctly).


Ah, almost but not quite the point. But I admit it is hard to understand..
The trouble started with the futex call which has been the first
system call with 6 arguments. s390 supported only 5 arguments up to
that point (%r2 - %r6). For futex we added a wrapper to the glibc that
loaded the 6th argument to %r7. In entry.S we set up things so that
%r7 gets stored to the kernel stack where normal C code expects the
first overflow argument. This enabled us to use the standard futex
system call with 6 arguments.
fallocate now has an additional problem: the last argument is a 64 bit
integers AND registers %r2-%r5 are already used. In this case the 64
bit number would have to be split into the high part in %r6 and the
low part on the stack so that the glibc wrapper can load the low part
to %r7. But the C compiler will skip %r6 and store the 64 bit number
on the stack.
If the order of the arguments if modified so that %r6 is assigned to a
32-bit argument, then the entry.S magic with %r7 would work.

--
blue skies,
 Martin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Amit K. Arora
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote:
 Suparna Bhattacharya writes:
 
   Of course the interface used by an application program would have the
   fd first.  Glibc can do the translation.
  
  I think that was understood.
 
 OK, then what does it matter what the glibc/kernel interface is, as
 long as it works?
 
 It's only a minor point; the order of arguments can vary between
 architectures if necessary, but it's nicer if they don't have to.
 32-bit powerpc will need to have the two int arguments adjacent in
 order to avoid using more than 6 argument registers at the user/kernel
 boundary, and s390 will need to avoid having a 64-bit argument last
 (if I understand it correctly).

You are right to say that. But, it may not be _that_ a minor point,
especially for the arch which is getting affected. It has
other implications like what Heiko noticed in his post below:
http://lkml.org/lkml/2007/4/27/377
 - implications like modifying glibc and *trace utilities for a particular
arch.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Amit K. Arora
I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to finalize
on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
   - Andreas and Chris Wedgwood are in favor of not changing the
 file size in this case. I also tend to agree with them. Does anyone
 has an argument in favor of changing the filesize ?
 If not, I will remove the code which changes the filesize, before I
 resubmit the concerned ext4 patch.

2) For FA_UNALLOCATE mode, should the file system allow unallocation
   of normal (non-preallocated) blocks (blocks allocated via
   regular write/truncate operations) also (i.e. work as punch()) ?
   - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
 we need to finalize on the convention here as a general guideline
 to all the filesystems that implement fallocate.

3) If above is true, the file size will need to be changed
   for unallocation when block holding the EOF gets unallocated.
   - If we do not unallocate normal (non-preallocated) blocks and we
 do not change the file size on preallocation, then this is a
 non-issue.

4) Should we update mtime  ctime on a successfull allocation/
   unallocation ?
   - David Chinner raised this question in following post:
 http://lkml.org/lkml/2007/4/29/407
 I think it makes sense to update the [mc]time for a successfull
 preallocation/unallocation. Does anyone feel otherwise ?
 It will be interesting to know how XFS behaves currently. Does XFS
 update [mc]time for preallocation ?


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Andreas Dilger
On May 09, 2007  21:31 +0530, Amit K. Arora wrote:
 2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?
- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
  we need to finalize on the convention here as a general guideline
  to all the filesystems that implement fallocate.

I would only allow this on FA_ALLOCATE extents.  That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.

 3) If above is true, the file size will need to be changed
for unallocation when block holding the EOF gets unallocated.
- If we do not unallocate normal (non-preallocated) blocks and we
  do not change the file size on preallocation, then this is a
  non-issue.

Not necessarily.  That will just make the file sparse.  If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.

 4) Should we update mtime  ctime on a successfull allocation/
unallocation ?

I would say yes.  If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods.  Also, it just makes sense from the this file was modified
point of view.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Mingming Cao
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote:
 I have the updated patches ready which take care of Andrew's comments.
 Will run some tests and post them soon.
 
 But, before submitting these patches, I think it will be better to finalize
 on certain things which might be worth some discussion here:
 
 1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the
  file size in this case. I also tend to agree with them. Does anyone
  has an argument in favor of changing the filesize ?
  If not, I will remove the code which changes the filesize, before I
  resubmit the concerned ext4 patch.
 

If we chose not to update the file size beyong EOF, then for filesystem
without fallocate() support (ext2,3 currently), posix_fallocate() will
follow the hard way(zero-out) to do preallocation. Then we will get
different behavior on filesystems w/o fallocate() support. It make sense
to be consistent, IMO.

My point of view, preallocation is just a efficient way to allocating
blocks for files without zero-out, other than this, the new behavior
should be consistent with the old way: file size update,mtime/ctime,
ENOSPC etc.

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread David Chinner
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
 I have the updated patches ready which take care of Andrew's comments.
 Will run some tests and post them soon.
 
 But, before submitting these patches, I think it will be better to finalize
 on certain things which might be worth some discussion here:
 
 1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the
  file size in this case. I also tend to agree with them. Does anyone
  has an argument in favor of changing the filesize ?
  If not, I will remove the code which changes the filesize, before I
  resubmit the concerned ext4 patch.

I think there needs to be both. If we don't have a mechanism to
atomically change the file size with the preallocation, then
applications that use stat() to work out if they need to preallocate
more space will end up racing.

 2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?

Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
what i did for FA_UNALLOCATE as well.

- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
  we need to finalize on the convention here as a general guideline
  to all the filesystems that implement fallocate.
 
 3) If above is true, the file size will need to be changed
for unallocation when block holding the EOF gets unallocated.

No - we punch a hole. If you want the filesize to change, then
you use ftruncate() to remove the blocks at EOF and change the
file size atomically.

 4) Should we update mtime  ctime on a successfull allocation/
unallocation ?
- David Chinner raised this question in following post:
  http://lkml.org/lkml/2007/4/29/407
  I think it makes sense to update the [mc]time for a successfull
  preallocation/unallocation. Does anyone feel otherwise ?
  It will be interesting to know how XFS behaves currently. Does XFS
  update [mc]time for preallocation ?

No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
changes. If the filesize changes, it behaves exactly the same way that
ftruncate() behaves.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Ulrich Drepper

Jakub Jelinek wrote:

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is "shall fail".  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
"required storage for regular file data starting at offset and continuing for len 
bytes"
doesn't make sense for negative size.  


This wording has already been cleaned up.  The current draft for the 
next revision reads:



[EINVAL] The len argument is less than or equal to zero, or the offset
 argument is less than zero, or the underlying file system does not
 support this operation.


I still don't like it since len==0 shouldn't create an error (it's 
inconsistent) but len<0 is already outlawed.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.
> 
> But yes, sounds like a desirable thing.  Would XFS support it easily if the 
> above
> check was relaxed?

I think we may relax the check here and let the individual file system
decide if they support preallocation for directories or not. What do you
think ?

One thing to be thought in this case is the error code which should be
returned by the file system implementation, incase it doesn't support
preallocation for directories. Should it be -ENODEV (to match with what
posix says) , or something else (which might make more sense in this
case) ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Please add a comment over this function which specifies its behaviour. 
> Really it should be enough material from which a full manpage can be
> written.
> 
> If that's all too much, this material should at least be spelled out in the
> changelog.  Because there's no way in which this change can be fully
> reviewed unless someone (ie: you) tells us what it is setting out to
> achieve.
> 
> If we 100% implement some standard then a URL for what we claim to
> implement would suffice.  Given that we're at least using different types from
> posix I doubt if such a thing would be sufficient.
> 
> And given the complexity and potential variability within the filesystem
> implementations of this, I'd expect that _something_ additional needs to be
> said?

Ok. I will add a detailed comment here.

> 
> > +{
> > +   struct file *file;
> > +   struct inode *inode;
> > +   long ret = -EINVAL;
> > +
> > +   if (len == 0 || offset < 0)
> > +   goto out;
> 
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'".  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

> 
> > +   ret = -EBADF;
> > +   file = fget(fd);
> > +   if (!file)
> > +   goto out;
> > +   if (!(file->f_mode & FMODE_WRITE))
> > +   goto out_fput;
> > +
> > +   inode = file->f_path.dentry->d_inode;
> > +
> > +   ret = -ESPIPE;
> > +   if (S_ISFIFO(inode->i_mode))
> > +   goto out_fput;
> > +
> > +   ret = -ENODEV;
> > +   if (!S_ISREG(inode->i_mode))
> > +   goto out_fput;
> 
> So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> seems a bit silly of them.

True. 
 
> > +   ret = -EFBIG;
> > +   if (offset + len > inode->i_sb->s_maxbytes)
> > +   goto out_fput;
> 
> This code does handle offset+len going negative, but only by accident, I
> suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
> here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
> > +   if (inode->i_op && inode->i_op->fallocate)
> > +   ret = inode->i_op->fallocate(inode, mode, offset, len);
> > +   else
> > +   ret = -ENOSYS;
> 
> If we _are_ going to support negative `len', as posix suggests, I think we
> should perform the appropriate sanity conversions to `offset' and `len'
> right here, rather than expecting each filesystem to do it.
> 
> If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
> > +out_fput:
> > +   fput(file);
> > +out:
> > +   return ret;
> > +}
> > +EXPORT_SYMBOL(sys_fallocate);
> 
> I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
> > +/*
> > + * fallocate() modes
> > + */
> > +#define FA_ALLOCATE0x1
> > +#define FA_DEALLOCATE  0x2
> 
> Now those aren't in posix.  They should be documented, along with their
> expected semantics.

Will add a comment describing the role of these modes.
 
> >  #ifdef __KERNEL__
> >  
> >  #include 
> > @@ -1125,6 +1131,7 @@ struct inode_operations {
> > ssize_t (*listxattr) (struct dentry *, char *, size_t);
> > int (*removexattr) (struct dentry *, const char *);
> > void (*truncate_range)(struct inode *, loff_t, loff_t);
> > +   long (*fallocate)(struct inode *, int, loff_t, loff_t);
> 
> I really do think it's better to put the variable names in definitions such
> as this.  Especially when we have two identically-typed variables next to
> each other like that.  Quick: which one is the offset and which is the
> length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
 
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 Please add a comment over this function which specifies its behaviour. 
 Really it should be enough material from which a full manpage can be
 written.
 
 If that's all too much, this material should at least be spelled out in the
 changelog.  Because there's no way in which this change can be fully
 reviewed unless someone (ie: you) tells us what it is setting out to
 achieve.
 
 If we 100% implement some standard then a URL for what we claim to
 implement would suffice.  Given that we're at least using different types from
 posix I doubt if such a thing would be sufficient.
 
 And given the complexity and potential variability within the filesystem
 implementations of this, I'd expect that _something_ additional needs to be
 said?

Ok. I will add a detailed comment here.

 
  +{
  +   struct file *file;
  +   struct inode *inode;
  +   long ret = -EINVAL;
  +
  +   if (len == 0 || offset  0)
  +   goto out;
 
 The posix spec implies that negative `len' is permitted - presumably allocate
 ahead of `offset'.  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

 
  +   ret = -EBADF;
  +   file = fget(fd);
  +   if (!file)
  +   goto out;
  +   if (!(file-f_mode  FMODE_WRITE))
  +   goto out_fput;
  +
  +   inode = file-f_path.dentry-d_inode;
  +
  +   ret = -ESPIPE;
  +   if (S_ISFIFO(inode-i_mode))
  +   goto out_fput;
  +
  +   ret = -ENODEV;
  +   if (!S_ISREG(inode-i_mode))
  +   goto out_fput;
 
 So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
 seems a bit silly of them.

True. 
 
  +   ret = -EFBIG;
  +   if (offset + len  inode-i_sb-s_maxbytes)
  +   goto out_fput;
 
 This code does handle offset+len going negative, but only by accident, I
 suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
 here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
  +   if (inode-i_op  inode-i_op-fallocate)
  +   ret = inode-i_op-fallocate(inode, mode, offset, len);
  +   else
  +   ret = -ENOSYS;
 
 If we _are_ going to support negative `len', as posix suggests, I think we
 should perform the appropriate sanity conversions to `offset' and `len'
 right here, rather than expecting each filesystem to do it.
 
 If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
  +out_fput:
  +   fput(file);
  +out:
  +   return ret;
  +}
  +EXPORT_SYMBOL(sys_fallocate);
 
 I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
  +/*
  + * fallocate() modes
  + */
  +#define FA_ALLOCATE0x1
  +#define FA_DEALLOCATE  0x2
 
 Now those aren't in posix.  They should be documented, along with their
 expected semantics.

Will add a comment describing the role of these modes.
 
   #ifdef __KERNEL__
   
   #include linux/linkage.h
  @@ -1125,6 +1131,7 @@ struct inode_operations {
  ssize_t (*listxattr) (struct dentry *, char *, size_t);
  int (*removexattr) (struct dentry *, const char *);
  void (*truncate_range)(struct inode *, loff_t, loff_t);
  +   long (*fallocate)(struct inode *, int, loff_t, loff_t);
 
 I really do think it's better to put the variable names in definitions such
 as this.  Especially when we have two identically-typed variables next to
 each other like that.  Quick: which one is the offset and which is the
 length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
 The above opengroup page only permits S_ISREG.  Preallocating directories
 sounds quite useful to me, although it's something which would be pretty
 hard to emulate if the FS doesn't support it.  And there's a decent case to
 be made for emulating it - run-anywhere reasons.  Does glibc emulation support
 directories?  Quite unlikely.
 
 But yes, sounds like a desirable thing.  Would XFS support it easily if the 
 above
 check was relaxed?

I think we may relax the check here and let the individual file system
decide if they support preallocation for directories or not. What do you
think ?

One thing to be thought in this case is the error code which should be
returned by the file system implementation, incase it doesn't support
preallocation for directories. Should it be -ENODEV (to match with what
posix says) , or something else (which might make more sense in this
case) ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Ulrich Drepper

Jakub Jelinek wrote:

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is shall fail.  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
required storage for regular file data starting at offset and continuing for len 
bytes
doesn't make sense for negative size.  


This wording has already been cleaned up.  The current draft for the 
next revision reads:



[EINVAL] The len argument is less than or equal to zero, or the offset
 argument is less than zero, or the underlying file system does not
 support this operation.


I still don't like it since len==0 shouldn't create an error (it's 
inconsistent) but len0 is already outlawed.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread David Chinner
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[EMAIL PROTECTED]> wrote:
> > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > > wrote:
> > > 
> > > > This patch implements the fallocate() system call and adds support for
> > > > i386, x86_64 and powerpc.
> > > > 
> > > > ...
> > > > +{
> > > > +   struct file *file;
> > > > +   struct inode *inode;
> > > > +   long ret = -EINVAL;
> > > > +
> > > > +   if (len == 0 || offset < 0)
> > > > +   goto out;
> > > 
> > > The posix spec implies that negative `len' is permitted - presumably 
> > > "allocate
> > > ahead of `offset'".  How peculiar.
> > 
> > I just checked the man page for posix_fallocate() and it says:
> > 
> >   EINVAL  offset or len was less than zero.
> > 
> > We should probably follow this lead.
> 
> Yes, I think so.  I'm suspecting that
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> is just buggy.  Or I can't read.
> 
> I mean, if we're going to support negative `len' then is the byte at
> `offset' inside or outside the segment?  Head spins.

I don't think we should care. If we provide a syscall with the
semantics of "allocate from offset to offset+len" then glibc's
implementation can turn negative length into two separate
fallocate syscalls

> > > > +   ret = -ENODEV;
> > > > +   if (!S_ISREG(inode->i_mode))
> > > > +   goto out_fput;
> > > 
> > > So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> > > seems a bit silly of them.
> > 
> > H - I thought that the intention of sys_fallocate() was to
> > be generic enough to eventually allow preallocation on directories.
> > If that is the case, then this check will prevent that
> 
> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.
> 
> But yes, sounds like a desirable thing.  Would XFS support it easily if the 
> above
> check was relaxed?

No - right now empty blocks are pruned from the directory immediately so I
don't think we really have a concept of empty blocks in the btree structure.
dir2 is bloody complex, so adding preallocation is probably not going to
be simple to do.

It's not high on my list to add, either, because we can typically avoid the
worst case directory fragmentation by using larger directory block sizes
(e.g. 16k instead of the default 4k on a 4k block size fs).

IIRC directory preallocation has been talked about more for ext3/4

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread Jakub Jelinek
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> > > The posix spec implies that negative `len' is permitted - presumably 
> > > "allocate
> > > ahead of `offset'".  How peculiar.
> > 
> > I just checked the man page for posix_fallocate() and it says:
> > 
> >   EINVAL  offset or len was less than zero.

That describes the current glibc implementation.


> > We should probably follow this lead.
> 
> Yes, I think so.  I'm suspecting that
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> is just buggy.  Or I can't read.
> 
> I mean, if we're going to support negative `len' then is the byte at
> `offset' inside or outside the segment?  Head spins.
> 
> However it would be neat if someone could test $OTHER_OS and, perhaps more
> importantly, the present glibc emulation (which I assume your manpage is
> referring to, so this would be a manpage test ;)).

int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
  struct stat64 st;
  struct statfs f;

  /* `off_t' is a signed type.  Therefore we can determine whether
 OFFSET + LEN is too large if it is a negative value.  */
  if (offset < 0 || len < 0)
return EINVAL;
  if (offset + len < 0)
return EFBIG;

  /* First thing we have to make sure is that this is really a regular
 file.  */
  if (__fxstat64 (_STAT_VER, fd, ) != 0)
return EBADF;
  if (S_ISFIFO (st.st_mode))
return ESPIPE;
  if (! S_ISREG (st.st_mode))
return ENODEV;

  if (len == 0)
{
  if (st.st_size < offset)
{
  int ret = __ftruncate (fd, offset);

  if (ret != 0)
ret = errno;
  return ret;
}
  return 0;
}
...

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is "shall fail".  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
"required storage for regular file data starting at offset and continuing for 
len bytes"
doesn't make sense for negative size.  
And given the general
"Implementations may support additional errors not included in this list,
may generate errors included in this list under circumstances other than
those described here, or may contain extensions or limitations that prevent
some errors from occurring."
I believe returning EINVAL for len < 0 is not a POSIX violation.
That doesn't mean the standard shouldn't be clarified, whether by saying
EINVAL must be returned for non-positive len or saying that using negative
len has undefined or implementation defined behavior.

> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.

No, see above.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[EMAIL PROTECTED]> wrote:

> On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > > 
> > > ...
> > > +{
> > > + struct file *file;
> > > + struct inode *inode;
> > > + long ret = -EINVAL;
> > > +
> > > + if (len == 0 || offset < 0)
> > > + goto out;
> > 
> > The posix spec implies that negative `len' is permitted - presumably 
> > "allocate
> > ahead of `offset'".  How peculiar.
> 
> I just checked the man page for posix_fallocate() and it says:
> 
>   EINVAL  offset or len was less than zero.
> 
> We should probably follow this lead.

Yes, I think so.  I'm suspecting that
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
is just buggy.  Or I can't read.

I mean, if we're going to support negative `len' then is the byte at
`offset' inside or outside the segment?  Head spins.

However it would be neat if someone could test $OTHER_OS and, perhaps more
importantly, the present glibc emulation (which I assume your manpage is
referring to, so this would be a manpage test ;)).

> > > +
> > > + ret = -ENODEV;
> > > + if (!S_ISREG(inode->i_mode))
> > > + goto out_fput;
> > 
> > So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> > seems a bit silly of them.
> 
> H - I thought that the intention of sys_fallocate() was to
> be generic enough to eventually allow preallocation on directories.
> If that is the case, then this check will prevent that

The above opengroup page only permits S_ISREG.  Preallocating directories
sounds quite useful to me, although it's something which would be pretty
hard to emulate if the FS doesn't support it.  And there's a decent case to
be made for emulating it - run-anywhere reasons.  Does glibc emulation support
directories?  Quite unlikely.

But yes, sounds like a desirable thing.  Would XFS support it easily if the 
above
check was relaxed?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread David Chinner
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> > +{
> > +   struct file *file;
> > +   struct inode *inode;
> > +   long ret = -EINVAL;
> > +
> > +   if (len == 0 || offset < 0)
> > +   goto out;
> 
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'".  How peculiar.

I just checked the man page for posix_fallocate() and it says:

  EINVAL  offset or len was less than zero.

We should probably follow this lead.

> > +
> > +   ret = -ENODEV;
> > +   if (!S_ISREG(inode->i_mode))
> > +   goto out_fput;
> 
> So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> seems a bit silly of them.

H - I thought that the intention of sys_fallocate() was to
be generic enough to eventually allow preallocation on directories.
If that is the case, then this check will prevent that

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread David Chinner
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
  +{
  +   struct file *file;
  +   struct inode *inode;
  +   long ret = -EINVAL;
  +
  +   if (len == 0 || offset  0)
  +   goto out;
 
 The posix spec implies that negative `len' is permitted - presumably allocate
 ahead of `offset'.  How peculiar.

I just checked the man page for posix_fallocate() and it says:

  EINVAL  offset or len was less than zero.

We should probably follow this lead.

  +
  +   ret = -ENODEV;
  +   if (!S_ISREG(inode-i_mode))
  +   goto out_fput;
 
 So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
 seems a bit silly of them.

H - I thought that the intention of sys_fallocate() was to
be generic enough to eventually allow preallocation on directories.
If that is the case, then this check will prevent that

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 16:07:31 +1000 David Chinner [EMAIL PROTECTED] wrote:

 On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
  On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
  wrote:
  
   This patch implements the fallocate() system call and adds support for
   i386, x86_64 and powerpc.
   
   ...
   +{
   + struct file *file;
   + struct inode *inode;
   + long ret = -EINVAL;
   +
   + if (len == 0 || offset  0)
   + goto out;
  
  The posix spec implies that negative `len' is permitted - presumably 
  allocate
  ahead of `offset'.  How peculiar.
 
 I just checked the man page for posix_fallocate() and it says:
 
   EINVAL  offset or len was less than zero.
 
 We should probably follow this lead.

Yes, I think so.  I'm suspecting that
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
is just buggy.  Or I can't read.

I mean, if we're going to support negative `len' then is the byte at
`offset' inside or outside the segment?  Head spins.

However it would be neat if someone could test $OTHER_OS and, perhaps more
importantly, the present glibc emulation (which I assume your manpage is
referring to, so this would be a manpage test ;)).

   +
   + ret = -ENODEV;
   + if (!S_ISREG(inode-i_mode))
   + goto out_fput;
  
  So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
  seems a bit silly of them.
 
 H - I thought that the intention of sys_fallocate() was to
 be generic enough to eventually allow preallocation on directories.
 If that is the case, then this check will prevent that

The above opengroup page only permits S_ISREG.  Preallocating directories
sounds quite useful to me, although it's something which would be pretty
hard to emulate if the FS doesn't support it.  And there's a decent case to
be made for emulating it - run-anywhere reasons.  Does glibc emulation support
directories?  Quite unlikely.

But yes, sounds like a desirable thing.  Would XFS support it easily if the 
above
check was relaxed?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread Jakub Jelinek
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
   The posix spec implies that negative `len' is permitted - presumably 
   allocate
   ahead of `offset'.  How peculiar.
  
  I just checked the man page for posix_fallocate() and it says:
  
EINVAL  offset or len was less than zero.

That describes the current glibc implementation.


  We should probably follow this lead.
 
 Yes, I think so.  I'm suspecting that
 http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
 is just buggy.  Or I can't read.
 
 I mean, if we're going to support negative `len' then is the byte at
 `offset' inside or outside the segment?  Head spins.
 
 However it would be neat if someone could test $OTHER_OS and, perhaps more
 importantly, the present glibc emulation (which I assume your manpage is
 referring to, so this would be a manpage test ;)).

int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
  struct stat64 st;
  struct statfs f;

  /* `off_t' is a signed type.  Therefore we can determine whether
 OFFSET + LEN is too large if it is a negative value.  */
  if (offset  0 || len  0)
return EINVAL;
  if (offset + len  0)
return EFBIG;

  /* First thing we have to make sure is that this is really a regular
 file.  */
  if (__fxstat64 (_STAT_VER, fd, st) != 0)
return EBADF;
  if (S_ISFIFO (st.st_mode))
return ESPIPE;
  if (! S_ISREG (st.st_mode))
return ENODEV;

  if (len == 0)
{
  if (st.st_size  offset)
{
  int ret = __ftruncate (fd, offset);

  if (ret != 0)
ret = errno;
  return ret;
}
  return 0;
}
...

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is shall fail.  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
required storage for regular file data starting at offset and continuing for 
len bytes
doesn't make sense for negative size.  
And given the general
Implementations may support additional errors not included in this list,
may generate errors included in this list under circumstances other than
those described here, or may contain extensions or limitations that prevent
some errors from occurring.
I believe returning EINVAL for len  0 is not a POSIX violation.
That doesn't mean the standard shouldn't be clarified, whether by saying
EINVAL must be returned for non-positive len or saying that using negative
len has undefined or implementation defined behavior.

 The above opengroup page only permits S_ISREG.  Preallocating directories
 sounds quite useful to me, although it's something which would be pretty
 hard to emulate if the FS doesn't support it.  And there's a decent case to
 be made for emulating it - run-anywhere reasons.  Does glibc emulation support
 directories?  Quite unlikely.

No, see above.

Jakub
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread David Chinner
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
 On Fri, 4 May 2007 16:07:31 +1000 David Chinner [EMAIL PROTECTED] wrote:
  On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
   On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
   wrote:
   
This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.

...
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (len == 0 || offset  0)
+   goto out;
   
   The posix spec implies that negative `len' is permitted - presumably 
   allocate
   ahead of `offset'.  How peculiar.
  
  I just checked the man page for posix_fallocate() and it says:
  
EINVAL  offset or len was less than zero.
  
  We should probably follow this lead.
 
 Yes, I think so.  I'm suspecting that
 http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
 is just buggy.  Or I can't read.
 
 I mean, if we're going to support negative `len' then is the byte at
 `offset' inside or outside the segment?  Head spins.

I don't think we should care. If we provide a syscall with the
semantics of allocate from offset to offset+len then glibc's
implementation can turn negative length into two separate
fallocate syscalls

+   ret = -ENODEV;
+   if (!S_ISREG(inode-i_mode))
+   goto out_fput;
   
   So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
   seems a bit silly of them.
  
  H - I thought that the intention of sys_fallocate() was to
  be generic enough to eventually allow preallocation on directories.
  If that is the case, then this check will prevent that
 
 The above opengroup page only permits S_ISREG.  Preallocating directories
 sounds quite useful to me, although it's something which would be pretty
 hard to emulate if the FS doesn't support it.  And there's a decent case to
 be made for emulating it - run-anywhere reasons.  Does glibc emulation support
 directories?  Quite unlikely.
 
 But yes, sounds like a desirable thing.  Would XFS support it easily if the 
 above
 check was relaxed?

No - right now empty blocks are pruned from the directory immediately so I
don't think we really have a concept of empty blocks in the btree structure.
dir2 is bloody complex, so adding preallocation is probably not going to
be simple to do.

It's not high on my list to add, either, because we can typically avoid the
worst case directory fragmentation by using larger directory block sizes
(e.g. 16k instead of the default 4k on a 4k block size fs).

IIRC directory preallocation has been talked about more for ext3/4

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Paul Mackerras
Andrew Morton writes:

> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Please add a comment over this function which specifies its behaviour. 
> Really it should be enough material from which a full manpage can be
> written.

This looks like it will have the same problem on s390 as
sys_sync_file_range.  Maybe the prototype should be:

asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton <[EMAIL PROTECTED]> wrote:

> > +   ret = -EFBIG;
> > +   if (offset + len > inode->i_sb->s_maxbytes)
> > +   goto out_fput;
> 
> This code does handle offset+len going negative, but only by accident, I
> suspect.

But it doesn't handle offset+len wrapping through zero.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> This patch implements the fallocate() system call and adds support for
> i386, x86_64 and powerpc.
> 
> ...
>
> +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Please add a comment over this function which specifies its behaviour. 
Really it should be enough material from which a full manpage can be
written.

If that's all too much, this material should at least be spelled out in the
changelog.  Because there's no way in which this change can be fully
reviewed unless someone (ie: you) tells us what it is setting out to
achieve.

If we 100% implement some standard then a URL for what we claim to
implement would suffice.  Given that we're at least using different types from
posix I doubt if such a thing would be sufficient.

And given the complexity and potential variability within the filesystem
implementations of this, I'd expect that _something_ additional needs to be
said?

> +{
> + struct file *file;
> + struct inode *inode;
> + long ret = -EINVAL;
> +
> + if (len == 0 || offset < 0)
> + goto out;

The posix spec implies that negative `len' is permitted - presumably "allocate
ahead of `offset'".  How peculiar.

> + ret = -EBADF;
> + file = fget(fd);
> + if (!file)
> + goto out;
> + if (!(file->f_mode & FMODE_WRITE))
> + goto out_fput;
> +
> + inode = file->f_path.dentry->d_inode;
> +
> + ret = -ESPIPE;
> + if (S_ISFIFO(inode->i_mode))
> + goto out_fput;
> +
> + ret = -ENODEV;
> + if (!S_ISREG(inode->i_mode))
> + goto out_fput;

So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
seems a bit silly of them.

> + ret = -EFBIG;
> + if (offset + len > inode->i_sb->s_maxbytes)
> + goto out_fput;

This code does handle offset+len going negative, but only by accident, I
suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
here would settle the reader's mind.

> + if (inode->i_op && inode->i_op->fallocate)
> + ret = inode->i_op->fallocate(inode, mode, offset, len);
> + else
> + ret = -ENOSYS;

If we _are_ going to support negative `len', as posix suggests, I think we
should perform the appropriate sanity conversions to `offset' and `len'
right here, rather than expecting each filesystem to do it.

If we're not going to handle negative `len' then we should check for it.

> +out_fput:
> + fput(file);
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(sys_fallocate);

I don't believe this needs to be exported to modules?

> +/*
> + * fallocate() modes
> + */
> +#define FA_ALLOCATE  0x1
> +#define FA_DEALLOCATE0x2

Now those aren't in posix.  They should be documented, along with their
expected semantics.

>  #ifdef __KERNEL__
>  
>  #include 
> @@ -1125,6 +1131,7 @@ struct inode_operations {
>   ssize_t (*listxattr) (struct dentry *, char *, size_t);
>   int (*removexattr) (struct dentry *, const char *);
>   void (*truncate_range)(struct inode *, loff_t, loff_t);
> + long (*fallocate)(struct inode *, int, loff_t, loff_t);

I really do think it's better to put the variable names in definitions such
as this.  Especially when we have two identically-typed variables next to
each other like that.  Quick: which one is the offset and which is the
length?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:

 This patch implements the fallocate() system call and adds support for
 i386, x86_64 and powerpc.
 
 ...

 +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Please add a comment over this function which specifies its behaviour. 
Really it should be enough material from which a full manpage can be
written.

If that's all too much, this material should at least be spelled out in the
changelog.  Because there's no way in which this change can be fully
reviewed unless someone (ie: you) tells us what it is setting out to
achieve.

If we 100% implement some standard then a URL for what we claim to
implement would suffice.  Given that we're at least using different types from
posix I doubt if such a thing would be sufficient.

And given the complexity and potential variability within the filesystem
implementations of this, I'd expect that _something_ additional needs to be
said?

 +{
 + struct file *file;
 + struct inode *inode;
 + long ret = -EINVAL;
 +
 + if (len == 0 || offset  0)
 + goto out;

The posix spec implies that negative `len' is permitted - presumably allocate
ahead of `offset'.  How peculiar.

 + ret = -EBADF;
 + file = fget(fd);
 + if (!file)
 + goto out;
 + if (!(file-f_mode  FMODE_WRITE))
 + goto out_fput;
 +
 + inode = file-f_path.dentry-d_inode;
 +
 + ret = -ESPIPE;
 + if (S_ISFIFO(inode-i_mode))
 + goto out_fput;
 +
 + ret = -ENODEV;
 + if (!S_ISREG(inode-i_mode))
 + goto out_fput;

So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
seems a bit silly of them.

 + ret = -EFBIG;
 + if (offset + len  inode-i_sb-s_maxbytes)
 + goto out_fput;

This code does handle offset+len going negative, but only by accident, I
suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
here would settle the reader's mind.

 + if (inode-i_op  inode-i_op-fallocate)
 + ret = inode-i_op-fallocate(inode, mode, offset, len);
 + else
 + ret = -ENOSYS;

If we _are_ going to support negative `len', as posix suggests, I think we
should perform the appropriate sanity conversions to `offset' and `len'
right here, rather than expecting each filesystem to do it.

If we're not going to handle negative `len' then we should check for it.

 +out_fput:
 + fput(file);
 +out:
 + return ret;
 +}
 +EXPORT_SYMBOL(sys_fallocate);

I don't believe this needs to be exported to modules?

 +/*
 + * fallocate() modes
 + */
 +#define FA_ALLOCATE  0x1
 +#define FA_DEALLOCATE0x2

Now those aren't in posix.  They should be documented, along with their
expected semantics.

  #ifdef __KERNEL__
  
  #include linux/linkage.h
 @@ -1125,6 +1131,7 @@ struct inode_operations {
   ssize_t (*listxattr) (struct dentry *, char *, size_t);
   int (*removexattr) (struct dentry *, const char *);
   void (*truncate_range)(struct inode *, loff_t, loff_t);
 + long (*fallocate)(struct inode *, int, loff_t, loff_t);

I really do think it's better to put the variable names in definitions such
as this.  Especially when we have two identically-typed variables next to
each other like that.  Quick: which one is the offset and which is the
length?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton [EMAIL PROTECTED] wrote:

  +   ret = -EFBIG;
  +   if (offset + len  inode-i_sb-s_maxbytes)
  +   goto out_fput;
 
 This code does handle offset+len going negative, but only by accident, I
 suspect.

But it doesn't handle offset+len wrapping through zero.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Paul Mackerras
Andrew Morton writes:

 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
 
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 Please add a comment over this function which specifies its behaviour. 
 Really it should be enough material from which a full manpage can be
 written.

This looks like it will have the same problem on s390 as
sys_sync_file_range.  Maybe the prototype should be:

asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-04-26 Thread Amit K. Arora
This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.

NOTE: It is based on 2.6.21 kernel version.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 ++
 arch/x86_64/kernel/functionlist  |1 
 fs/open.c|   41 +++
 include/asm-i386/unistd.h|3 +-
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 +-
 include/asm-x86_64/unistd.h  |4 ++-
 include/linux/fs.h   |7 ++
 include/linux/syscalls.h |1 
 10 files changed, 66 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (len == 0 || offset < 0)
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file->f_mode & FMODE_WRITE))
+   goto out_fput;
+
+   inode = file->f_path.dentry->d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode->i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   if (!S_ISREG(inode->i_mode))
+   goto out_fput;
+
+   ret = -EFBIG;
+   if (offset + len > inode->i_sb->s_maxbytes)
+   goto out_fput;
+
+   if (inode->i_op && inode->i_op->fallocate)
+   ret = inode->i_op->fallocate(inode, mode, offset, len);
+   else
+   ret = -ENOSYS;
+out_fput:
+   fput(file);
+out:
+   return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.21/include/asm-i386/unistd.h
===
--- linux-2.6.21.orig/include/asm-i386/unistd.h
+++ linux-2.6.21/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_fallocate 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/asm-powerpc/systbl.h
===
--- linux-2.6.21.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.21/include/asm-powerpc/systbl.h
@@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
+COMPAT_SYS(fallocate)
Index: linux-2.6.21/include/asm-powerpc/unistd.h
===
--- linux-2.6.21.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.21/include/asm-powerpc/unistd.h
@@ -326,10 +326,11 @@
 #define __NR_move_pages301
 #define __NR_getcpu302
 #define __NR_epoll_pwait   303
+#define __NR_fallocate 304
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls  304
+#define __NR_syscalls  305
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
Index: linux-2.6.21/include/asm-x86_64/unistd.h
===
--- linux-2.6.21.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.21/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate 280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: 

[PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-04-26 Thread Amit K. Arora
This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.

NOTE: It is based on 2.6.21 kernel version.

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 ++
 arch/x86_64/kernel/functionlist  |1 
 fs/open.c|   41 +++
 include/asm-i386/unistd.h|3 +-
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 +-
 include/asm-x86_64/unistd.h  |4 ++-
 include/linux/fs.h   |7 ++
 include/linux/syscalls.h |1 
 10 files changed, 66 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (len == 0 || offset  0)
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file-f_mode  FMODE_WRITE))
+   goto out_fput;
+
+   inode = file-f_path.dentry-d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode-i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   if (!S_ISREG(inode-i_mode))
+   goto out_fput;
+
+   ret = -EFBIG;
+   if (offset + len  inode-i_sb-s_maxbytes)
+   goto out_fput;
+
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, mode, offset, len);
+   else
+   ret = -ENOSYS;
+out_fput:
+   fput(file);
+out:
+   return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.21/include/asm-i386/unistd.h
===
--- linux-2.6.21.orig/include/asm-i386/unistd.h
+++ linux-2.6.21/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_fallocate 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/asm-powerpc/systbl.h
===
--- linux-2.6.21.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.21/include/asm-powerpc/systbl.h
@@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
+COMPAT_SYS(fallocate)
Index: linux-2.6.21/include/asm-powerpc/unistd.h
===
--- linux-2.6.21.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.21/include/asm-powerpc/unistd.h
@@ -326,10 +326,11 @@
 #define __NR_move_pages301
 #define __NR_getcpu302
 #define __NR_epoll_pwait   303
+#define __NR_fallocate 304
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls  304
+#define __NR_syscalls  305
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
Index: linux-2.6.21/include/asm-x86_64/unistd.h
===
--- linux-2.6.21.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.21/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate 280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/linux/fs.h