Re: [PATCH 0/2] LogFS take two

2007-05-07 Thread Albert Cahalan

[EMAIL PROTECTED], [EMAIL PROTECTED],
linux-fsdevel@vger.kernel.org, [EMAIL PROTECTED],
[EMAIL PROTECTED], [EMAIL PROTECTED]

Re: [PATCH 0/2] LogFS take two

You seem to be missing the immutable bit. This is really useful
for dealing with buggy or badly-designed things running as root.
I've used to to protect /dev/null from becoming a normal file
filled with junk, and to protect /etc/resolv.conf from "helpful"
network management daemons that don't know my DNS servers.

Anything else missing?

BTW, BSD offers an unprivileged immutable bit as well. I'm sure
it's useful for the apps that trash their own config files.
Actually, this bit alone would do fine, and we could really use
a way to protect writable device files from deletion or permission
bit changes.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> We could check the total number of fs free blocks account before
> preallocation happens, if there isn't enough space left, there is no
> need to bother preallocating.

Checking against the fs free blocks is a good idea, since it will
prevent the obvious error case where someone tries to preallocate 10GB
when there is only 2GB left.  But it won't help if there are multiple
processes trying to allocate blocks the same time.  On the other hand,
that case is probably relatively rare, and in that case, the
filesystem was probably going to be left completely full in any case.

On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> Userspace could presumably repair the mess in most situations by truncating
> the file back again.  The kernel cannot do that because there might be live
> data in amongst there.

Actually, the kernel could do it, in that could simply release all
unitialized extents back to the system.  The problem is distinguishing
between the unitialized extents that had just been newly added, versus
the ones that had there from before.  (On the other hand, if the
filesystem was completely full, releasing unitialized blocks wouldn't
be the worse thing in the world to do, although releasing previously
fallocated blocks probably does violate the princple of least
surprise, even if it's what the user would have wanted.)

On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> If there is enough free space, we could make a reservation window that
> have at least N free blocks and mark it not stealable by other files. So
> later we will not run into the ENOSPC error.

Could you really use a single reservation window?  When the filesystem
is almost full, the free extents are likely going to be scattered all
over the disk.  The general principle of grabbing all of the extents
and keeping them in an in-memory data structure, and only adding them
to the extent tree would work, though; I'm just not sure we could do
it using the existing reservation window code, since it only supports
a single reservation window per file, yes?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Jeff Garzik

Andreas Dilger wrote:

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.



Agreed, thanks for the clarification.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >Actually, this is a non-issue.  The reason that it is handled for 
> >extent-only is that this is the only way to allocate space in the
> >filesystem without doing the explicit zeroing.
> 
> Precisely /how/ do you avoid the zeroing issue, for extents?
> 
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
> otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel error when closing a pipe

2007-05-07 Thread Greg KH
On Mon, May 07, 2007 at 10:29:29AM -0600, Chris Kottaridis wrote:
> I am getting the following when trying to close a pipe:

Do you have a small, sample example code that can easily duplicate this
problem?  Can you post it here?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote:
> On Mon, 07 May 2007 17:00:24 -0700
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > > +   while (ret >= 0 && ret < max_blocks) {
> > > +   block = block + ret;
> > > +   max_blocks = max_blocks - ret;
> > > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > > + max_blocks, &map_bh,
> > > + EXT4_CREATE_UNINITIALIZED_EXT, 
> > > 0);
> > > +   BUG_ON(!ret);
> > > +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > > +   && ((block + ret) > (i_size_read(inode) << 
> > > blkbits)))
> > > +   nblocks = nblocks + ret;
> > > +   }
> > > +
> > > +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, 
> > > &retries))
> > > +   goto retry;
> > > +
> > > Now the interesting question is: what do we do if we get halfway through
> > > this loop and then run out of space?  We could leave the disk all filled 
> > > up
> > > and then return failure to the caller, but that's pretty poor behaviour,
> > > IMO.
> > > 
> > The current code handles earlier ENOSPC by three times retries. After
> > that if we still run out of space, then it's propably right to notify
> > the caller there isn't much space left.
> > 
> > We could extend the block reservation window size before the while loop
> > so we could get a lower chance to get more fragmented.
> 
> yes, but my point is that the proposed behaviour is really quite bad.
> 
I agree your point, that's why I mention it only helped the
fragmentation issue but not the ENOSPC case.


> We will attempt to allocate the disk space and then we will return failure,
> having consumed all the disk space and having partially and uselessly
> populated an unknown amount of the file.
> 

Not totally useless I think. If only half of the space is preallocated
because run out of space, the application can decide whether it's good
enough to start to use this preallocated space or wait for the fs to
have more free space.

> Userspace could presumably repair the mess in most situations by truncating
> the file back again.  The kernel cannot do that because there might be live
> data in amongst there.
> 
> So we'd need to either keep track of which blocks were newly-allocated and
> then free them all again on the error path (doesn't work right across
> commit+crash+recovery) or we could later use the space-reservation scheme 
> which
> delayed allocation will need to introduce.
> 
> Or we could decide to live with the above IMO-crappy behaviour.

In fact Amit and I had raised this issue before, whether it's okay to do
allow partial preallocation. At that moment the feedback is it's no much
different than the current zero-out-preallocation behavior: people might
preallocating half-way then later deal with ENOSPC.

We could check the total number of fs free blocks account before
preallocation happens, if there isn't enough space left, there is no
need to bother preallocating.

If there is enough free space, we could make a reservation window that
have at least N free blocks and mark it not stealable by other files. So
later we will not run into the ENOSPC error.

The fs free blocks account is just a estimate though.


Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 19:14:42 -0400
> Theodore Tso <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > > Actually, this is a non-issue.  The reason that it is handled for 
> > > > extent-only
> > > > is that this is the only way to allocate space in the filesystem without
> > > > doing the explicit zeroing.  For other filesystems (including ext3 and
> > > > ext4 with block-mapped files) the filesystem should return an error 
> > > > (e.g.
> > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
> > > > userspace.
> > > 
> > > It can be a bit suboptimal from the layout POV.  The reservations code 
> > > will
> > > largely save us here, but kernel support might make it a bit better.
> > 
> > Actually, the reservations code won't matter, since glibc will fall
> > back to its current behavior, which is it will do the preallocation by
> > explicitly writing zeros to the file.
> 
> No!  Reservations code is *critical* here.  Without reservations, we get
> disastrously-bad layout if two processes were running a large fallocate()
> at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
> lengths save us).
> 
> My point is that even though reservations save us, we could do even-better
> in-kernel.
> 

In this case, since the number of blocks to preallocate (eg. N=10GB) is
clear, we could improve the current reservation code, to allow callers
explicitly ask for a new window that have the minimum N free blocks for
the blocks-to-preallocated(rather than just have at least 1 free
blocks).

Before the ext4_fallocate() is called, the right reservation window size
is set with the flag to indicating "please spend time if needed to find
a window covers at least N free blocks".

So for ex4 block mapped files, later when glibc is doing allocation and
zeroing, the ext4 block-mapped allocator will knows to reserve the right
amount of free blocks before allocating and zeroing 10GB space.

I am not sure whether this worth the effort though.

> But then, a smart application would bypass the glibc() fallocate()
> implementation and would tune the reservation window size and would use
> direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).
> 
> > This wlil result in the same
> > layout as if we had done the persistent preallocation, but of course
> > it will mean the posix_fallocate() could potentially take a long time
> > if you're a PVR and you're reserving a gig or two for a two hour movie
> > at high quality.  That seems suboptimal, granted, and ideally the
> > application should be warned about this before it calls
> > posix_fallocate().  On the other hand, it's what happens today, all
> > the time, so applications won't be too badly surprised.
> 
> A PVR implementor would take all this over and would do it themselves, for
> sure.
> 
> > If we think applications programmers badly need to know in advance if
> > posix_fallocate() will be fast or slow, probably the right thing is to
> > define a new fpathconf() configuration option so they can query to see
> > whether a particular file will support a fast posix_fallocate().  I'm
> > not 100% convinced such complexity is really needed, but I'm willing
> > to be convinced  what do folks think?
> > 
> 
> An application could do sys_fallocate(one-byte) to work out whether it's
> supported in-kernel, I guess.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao <[EMAIL PROTECTED]> wrote:

> > +   while (ret >= 0 && ret < max_blocks) {
> > +   block = block + ret;
> > +   max_blocks = max_blocks - ret;
> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> > +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > +   && ((block + ret) > (i_size_read(inode) << 
> > blkbits)))
> > +   nblocks = nblocks + ret;
> > +   }
> > +
> > +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, 
> > &retries))
> > +   goto retry;
> > +
> > Now the interesting question is: what do we do if we get halfway through
> > this loop and then run out of space?  We could leave the disk all filled up
> > and then return failure to the caller, but that's pretty poor behaviour,
> > IMO.
> > 
> The current code handles earlier ENOSPC by three times retries. After
> that if we still run out of space, then it's propably right to notify
> the caller there isn't much space left.
> 
> We could extend the block reservation window size before the while loop
> so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again.  The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 05:37:54 -0600
> Andreas Dilger <[EMAIL PROTECTED]> wrote:
> 
> > > > +   block = offset >> blkbits;
> > > > +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> 
> > > > blkbits)
> > > > +- block;
> > > > +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > > +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > > +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > > 
> > > Now I'm mystified.  Given that we're allocating an arbitrary amount of 
> > > disk
> > > space, and that this disk space will require an arbitrary amount of
> > > metadata, how can we work out how much journal space we'll be needing
> > > without at least looking at `len'?
> > 
> > Good question.
> > 
> > The uninitialized extent can cover up to 128MB with a single entry.
> > If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> > function returns the maximum number of extents needed to insert a leaf,
> > including splitting all of the index blocks.  That would allow up to 43GB
> > (340 extents/block * 128MB) to be preallocated, but it still needs to take
> > the size of the preallocation into account (adding 3 blocks per 43GB - a
> > leaf block, a bitmap block and a group descriptor).
> 
> I think the use of ext4_journal_extend() (as Amit has proposed) will help
> here, but it is not sufficient.
> 
> Because under some circumstances, a journal_extend() failure could mean
> that we fail to allocate all the required disk space.  If it is infrequent
> enough, that is acceptable when the caller is using fallocate() for
> performance reasons.
> 
> But it is very much not acceptable if the caller is using fallocate() for
> space-reservation reasons.  If you used fallocate to reserve 1GB of disk
> and fallocate() "succeeded" and you later get ENOSPC then you'd have a
> right to get a bit upset.
> 
> So I think the ext3/4 fallocate() implementation will need to be
> implemented as a loop: 
> 
>   while (len) {
>   journal_start();
>   len -= do_fallocate(len, ...);
>   journal_stop();
>   }
> 
> 

I agree.  There is already a loop in Amit's current's patch to call
ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to
ask for in each journal_start()?
> +/*
> + * ext4_fallocate:
> + * preallocate space for a file
> + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> + */
> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> +{


> +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);

I think the calculation is based on the assumption that there is only a
single extent to be inserted, which is the ideal case. But in some cases
we may end up allocating several chunk of blocks(extents) for this
single preallocation request when fs is fragmented (or part of
preallocation request is already fulfilled)

I think we should move this calculation inside the loop as well,and we
really do not need to grab the lock to calculate the credit if the @path
is always NULL, all the function does is mathmatics.

I can't think of any good way to estimate the total credits needed for
this whole preallocation request. Looked at ext4_get_block(), which is
used for DIO code to deal with large amount of block allocation. The
credit reservation is quite weak there too. The DIO_CREDIT is only
(EXT4_RESERVE_TRANS_BLOCKS + 32)

> +   handle=ext4_journal_start(inode, credits +
> +   
> EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
> +   if (IS_ERR(handle))
> +   return PTR_ERR(handle);
> +retry:
> +   ret = 0;
> +   while (ret >= 0 && ret < max_blocks) {
> +   block = block + ret;
> +   max_blocks = max_blocks - ret;
> +   ret = ext4_ext_get_blocks(handle, inode, block,
> + max_blocks, &map_bh,
> + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> +   BUG_ON(!ret);
> +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> +   && ((block + ret) > (i_size_read(inode) << blkbits)))
> +   nblocks = nblocks + ret;
> +   }
> +
> +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> +   goto retry;
> +
> Now the interesting question is: what do we do if we get halfway through
> this loop and then run out of space?  We could leave the disk all filled up
> and then return failure to the caller, but that's pretty poor behaviour,
> IMO.
> 
The current code handles earlier ENOSPC by three times retries. After
that if we still run out of space, then it's propably right to notify
the caller there isn't much space left.

We could extend 

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> >>Final point: it's fairly disappointing that the present implementation is
> >>ext4-only, and extent-only.  I do think we should be aiming at an ext4
> >>bitmap-based implementation and an ext3 implementation.
> >
> >Actually, this is a non-issue.  The reason that it is handled for 
> >extent-only
> >is that this is the only way to allocate space in the filesystem without
> >doing the explicit zeroing.  For other filesystems (including ext3 and
> 
> Precisely /how/ do you avoid the zeroing issue, for extents?
> 
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
> otherwise the implementation is broken.

There is a bit in the extent structure which indicates that the extent
has not been initialized.  When reading from a block where the extent
is marked as unitialized, ext4 returns zero's, to avoid returning the
uninitalized contents of the disk, which might contain someone else's
love letters, p0rn, or other information which we shouldn't leak out.
When writing to an extent which is uninitalized, we may potentially
have to split the extent into three extents in the worst case.

My understanding is that XFS uses a similar implementation; it's a
pretty obvious and standard way to implement allocated-but-not-initialized
extents.

We thought about supporting persistent preallocation for inodes using
indirect blocks, but it would require stealing a bit from each entry
in the indirect block, reducing the maximum size of the filesystem by
two (i.e., 2**31 blocks).  It was decided it wasn't worth the
complexity, given the tradeoffs.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 19:14:42 -0400
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > Actually, this is a non-issue.  The reason that it is handled for 
> > > extent-only
> > > is that this is the only way to allocate space in the filesystem without
> > > doing the explicit zeroing.  For other filesystems (including ext3 and
> > > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
> > > userspace.
> > 
> > It can be a bit suboptimal from the layout POV.  The reservations code will
> > largely save us here, but kernel support might make it a bit better.
> 
> Actually, the reservations code won't matter, since glibc will fall
> back to its current behavior, which is it will do the preallocation by
> explicitly writing zeros to the file.

No!  Reservations code is *critical* here.  Without reservations, we get
disastrously-bad layout if two processes were running a large fallocate()
at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
lengths save us).

My point is that even though reservations save us, we could do even-better
in-kernel.

But then, a smart application would bypass the glibc() fallocate()
implementation and would tune the reservation window size and would use
direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).

> This wlil result in the same
> layout as if we had done the persistent preallocation, but of course
> it will mean the posix_fallocate() could potentially take a long time
> if you're a PVR and you're reserving a gig or two for a two hour movie
> at high quality.  That seems suboptimal, granted, and ideally the
> application should be warned about this before it calls
> posix_fallocate().  On the other hand, it's what happens today, all
> the time, so applications won't be too badly surprised.

A PVR implementor would take all this over and would do it themselves, for
sure.

> If we think applications programmers badly need to know in advance if
> posix_fallocate() will be fast or slow, probably the right thing is to
> define a new fpathconf() configuration option so they can query to see
> whether a particular file will support a fast posix_fallocate().  I'm
> not 100% convinced such complexity is really needed, but I'm willing
> to be convinced  what do folks think?
> 

An application could do sys_fallocate(one-byte) to work out whether it's
supported in-kernel, I guess.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > Actually, this is a non-issue.  The reason that it is handled for 
> > extent-only
> > is that this is the only way to allocate space in the filesystem without
> > doing the explicit zeroing.  For other filesystems (including ext3 and
> > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
> 
> It can be a bit suboptimal from the layout POV.  The reservations code will
> largely save us here, but kernel support might make it a bit better.

Actually, the reservations code won't matter, since glibc will fall
back to its current behavior, which is it will do the preallocation by
explicitly writing zeros to the file.  This wlil result in the same
layout as if we had done the persistent preallocation, but of course
it will mean the posix_fallocate() could potentially take a long time
if you're a PVR and you're reserving a gig or two for a two hour movie
at high quality.  That seems suboptimal, granted, and ideally the
application should be warned about this before it calls
posix_fallocate().  On the other hand, it's what happens today, all
the time, so applications won't be too badly surprised.  

If we think applications programmers badly need to know in advance if
posix_fallocate() will be fast or slow, probably the right thing is to
define a new fpathconf() configuration option so they can query to see
whether a particular file will support a fast posix_fallocate().  I'm
not 100% convinced such complexity is really needed, but I'm willing
to be convinced  what do folks think?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Jeff Garzik

Andreas Dilger wrote:

On May 07, 2007  13:58 -0700, Andrew Morton wrote:

Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.


Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and


Precisely /how/ do you avoid the zeroing issue, for extents?

If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
otherwise the implementation is broken.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> > Final point: it's fairly disappointing that the present implementation is
> > ext4-only, and extent-only.  I do think we should be aiming at an ext4
> > bitmap-based implementation and an ext3 implementation.
> 
> Actually, this is a non-issue.  The reason that it is handled for extent-only
> is that this is the only way to allocate space in the filesystem without
> doing the explicit zeroing.  For other filesystems (including ext3 and
> ext4 with block-mapped files) the filesystem should return an error (e.g.
> -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV.  The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem.  Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> Final point: it's fairly disappointing that the present implementation is
> ext4-only, and extent-only.  I do think we should be aiming at an ext4
> bitmap-based implementation and an ext3 implementation.

Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error (e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] introduce I_SYNC

2007-05-07 Thread Jörn Engel
This patch is actually independent of LogFS.  It fixes a deadlock
hidden in fs/fs-writeback.c that LogFS was unlucky enough to trigger.
I strongly suspect NTFS triggered the same deadlock and "solved" it by
introducing iget5_nowait().  For LogFS, iget5_nowait() would translate
the deadlock into data corruption, so that is not an option.


1. Introduction

In its write path, LogFS may have to do some Garbage Collection when
space is getting tight.  GC requires reading inodes, so the I_LOCK bit
is taken for some random inodes.

I_LOCK is also held when syncing inodes to flash, so LogFS has to wait
for those inodes.  Inodes are written by the same code path as regular
file data and needs to acquire an fs-global mutex.  Call stacks of the
1-2 processes will look roughly like this:

Process A:  Process B:
inode_wait  [filesystem locking write path]
__wait_on_bit   __writeback_single_inode
out_of_line_wait_on_bit
ifind_fast
[filesystem calling iget()]
[filesystem locking write path]


2. The usage of inode_lock and I_LOCK

Almost all modifications of inodes are protected by the inode_lock, a
global spinlock.  Some modifications, however, can block for various
reasons and require the inode_lock to get dropped temporarily.  In the
meantime, the individual inode needs to get protected somehow.  Usually
this happens through the use of I_LOCK.

But I_LOCK is not a simple mutex.  It is a Janus-faced bit in the inode
that is used for several things, including mutual exclusion and
completion notification.  Most users are open-coded, so it is not easy
to follow, but can be summarized in the table below.

In this table columns indicate events when I_LOCK is either set or
reset (or not reset but all waiters are notified anyway).  Rows
indicate code that either checks for I_LOCK and changes behaviour
depending on its state or is waiting until I_LOCK gets reset (or is
waiting even if I_LOCK is not set).

__sync_single_inode
|   get_new_inode[_fast]
|   | unlock_new_inode
|   | | dispose_list
|   | | |   generic_delete_inode
|   | | |   |   generic_forget_inode
lockv   v | |   |   |
unlock/complete v   v   v   v   v   comment
---
__writeback_single_inodeX   O   O   O   O   sync
write_inode_now X   O   O   O   O   sync
clear_inode X   O   O   O   O   sync
__mark_inode_dirty  X   O   O   O   O   lists
generic_osync_inode X   O   O   O   O   sync
get_new_inode[_fast]O   X   O   O   O   mutex
find_inode[_fast]   O   O   X   X   X   I_FREEING
ifind[_fast]O   X   O   O   O   read

jfs txCommit?   ?   ?   ?   ?   ?
xfs_ichgtime[_fast] X   O   O   O   O   sync

Comments:
sync - wait for writeout to finish
lists - move inode to dirty list without racing against __sync_single_inode
mutex - protect against two concurrent get_new_inode[_fast] creating two inodes
I_FREEING - wait for inode to get freed, then repeat
read - don't return inode until it is read from medium

Now, the "X"s mark combinations where columns and rows are related.
"O"s mark combinations where afaics columns and rows share no
relationship whatsoever except that both use either I_LOCK or
wake_up_inode()/wait_on_inode() or any other of the partially open-coded
variants.

The table shows that two large usage groups exist for I_LOCK, one
dealing exclusively with the various sync() functions in
fs/fs-writeback.c and another basically confined to fs/inode.c code.
JFS has one remaining user that is unclear to me.


This patch introduces a new flag, I_SYNC and seperates out all sync()
users of I_LOCK to use the new flag instead.


 fs/fs-writeback.c   |   39 ---
 fs/xfs/linux-2.6/xfs_iops.c |4 ++--
 include/linux/fs.h  |2 ++
 include/linux/writeback.h   |7 +++
 4 files changed, 35 insertions(+), 17 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -99,11 +99,11 @@ void __mark_inode_dirty(struct inode *in
inode->i_state |= flags;
 
/*
-* If the inode is locked, just update its dirty state. 
+* If the inode is being synced, just update its dirty state.
 * The unlocker will place the inode on the appropriate
 * superblock list, based upon its state.
 */
-   if (inode->i_state & I_LOCK)
+   if (inode->i_s

[PATCH 0/2] LogFS take two

2007-05-07 Thread Jörn Engel
Motivation:

Linux currently has 1-2 flash filesystems to choose from, JFFS2 and
YAFFS.  The latter has never made a serious attempt of kernel
integration, which may disqualify it to some.

The two main problems of JFFS2 are memory consumption and mount time.
Unlike most filesystems, there is no tree structure of any sorts on
the medium, so the complete medium needs to be scanned at mount time
and a tree structure kept in-memory while the filesystem is mounted.
With bigger devices, both mount time and memory consumption increase
linearly.

JFFS2 has recently gained summary support, which helps reduce mount
time by a constant factor.  Linear scalability remains.  YAFFS also
appears to be better by a constant factor, yet still scales linearly.

LogFS has an on-medium tree, fairly similar to Ext2 in structure, so
mount times are O(1).  In absolute terms, the OLPC system has mount
times of ~3.3s for JFFS2 and ~60ms for LogFS.


Motivation 2:

Flash is becoming increasingly common in standard PC hardware.  Nearly
a dozen different manufacturers have announced Solid State Disks
(SSDs), the OLPC and the Intel Classmate no longer contain hard disks
and ASUS announced a flash-only Laptop series for regular consumers.
And that doesn't even mention the ubiquitous USB-Sticks, SD-Cards,
etc.

Flash behaves significantly different to hard disks.  In order to use
flash, the current standard practice is to add an emulation layer and
an old-fashioned hard disk filesystem.  As can be expected, this is
eating up some of the benefits flash can offer over hard disks.

In principle it is possible to achieve better performance with a flash
filesystem than with the current emulated approach.  In practice our
current flash filesystems are not even near that theoretical goal.
LogFS in its current state is already closer.


Current state:

LogFS works and survives my testcases.  It has fairly good chances of
not eating your data during regular operation.  There are still two
known bugs that will eat data if the filesystem is uncleanly
unmounted.  Also still missing is wear leveling.

Handling of read/write/erase errors currently is BUG().  It is on my
list, no need to remind me. :)

Overall I consider this to be -mm material.  It would be good to get
some review and have the usual allyesconfig crowd build it and find
coverity bugs and the like.

http://logfs.org/logfs/ may have some further information.


Shameless plug:

I have quit my job last November to concentrate on LogFS.  While I
have found one sponsor kind enough to fund me, my monetary reserves
are fairly stressed.  Fairly soon I will be forced to take an
old-fashioned job again and work on other less exciting stuff.  So if
anyone needs a fast flash filesystem and has spare money to spend,
please contact me.

Jörn

-- 
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> > > + block = offset >> blkbits;
> > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > > +  - block;
> > > + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > 
> > Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> > space, and that this disk space will require an arbitrary amount of
> > metadata, how can we work out how much journal space we'll be needing
> > without at least looking at `len'?
> 
> Good question.
> 
> The uninitialized extent can cover up to 128MB with a single entry.
> If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> function returns the maximum number of extents needed to insert a leaf,
> including splitting all of the index blocks.  That would allow up to 43GB
> (340 extents/block * 128MB) to be preallocated, but it still needs to take
> the size of the preallocation into account (adding 3 blocks per 43GB - a
> leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space.  If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons.  If you used fallocate to reserve 1GB of disk
and fallocate() "succeeded" and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop: 

while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}


Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space?  We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.



Does the proposed implementation handle quotas correctly, btw?  Has that
been tested?


Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 03, 2007  21:31 -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
> 
> This description is rather thin.  What is the filesystem's actual behaviour
> here?  If the file is using extents then the implementation will do
> .  If the file is using bitmaps then we will do .
> 
> But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
> > len)
> > +{
> > +   handle_t *handle;
> > +   ext4_fsblk_t block, max_blocks;
> > +   int ret, ret2, nblocks = 0, retries = 0;
> > +   struct buffer_head map_bh;
> > +   unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > +   /* Currently supporting (pre)allocate mode _only_ */
> > +   if (mode != FA_ALLOCATE)
> > +   return -EOPNOTSUPP;
> > +
> > +   if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +   return -ENOTTY;
> 
> So we don't implement fallocate on bitmap-based files!  Well that's huge
> news.  The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.
> 
> Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

> > +   block = offset >> blkbits;
> > +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > +- block;
> > +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> 
> Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> 
> BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-07 Thread Quentin Godfroy
On Mon, May 07, 2007 at 01:47:02PM -0400, Quentin Godfroy wrote:
> I have made another patch, which I hope should not break anything this
> time. I tested it on an x86_64 kernel with 32 and 64 bits executables,
> PIE and not PIE (well my only PIE are libc-2.5.so and ld-2.5.so). I
> rejoined the loops as suggested, and the kernel falls back on load_addr +
> exec->e_phoff if there was no PT_PHDR entry in the program headers.
> 
> Signed-off-by: Quentin Godfroy <[EMAIL PROTECTED]>
> ---
> --- linux-2.6.21.1/fs/binfmt_elf.c2007-05-07 10:58:38.0 -0400
> +++ linux-2.6.21.1-patch/fs/binfmt_elf.c  2007-05-07 13:14:33.0 
> -0400
> @@ -134,6 +134,7 @@ static int padzero(unsigned long elf_bss
>  static int
>  create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
>   int interp_aout, unsigned long load_addr,
> + unsigned long phdr_addr,
>   unsigned long interp_load_addr)
>  {
>   unsigned long p = bprm->p;
> @@ -190,7 +191,10 @@ create_elf_tables(struct linux_binprm *b
>   NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
>   NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE);
>   NEW_AUX_ENT(AT_CLKTCK, CLOCKS_PER_SEC);
> - NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
> + if(phdr_addr)
> + NEW_AUX_ENT(AT_PHDR, phdr_addr);
> + else
> + NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
>   NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
>   NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
>   NEW_AUX_ENT(AT_BASE, interp_load_addr);
> @@ -529,7 +533,7 @@ static unsigned long randomize_stack_top
>  static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
>  {
>   struct file *interpreter = NULL; /* to shut gcc up */
> - unsigned long load_addr = 0, load_bias = 0;
> + unsigned long load_addr = 0, load_bias = 0, phdr_addr = 0;
>   int load_addr_set = 0;
>   char * elf_interpreter = NULL;
>   unsigned int interpreter_type = INTERPRETER_NONE;
> @@ -620,7 +624,12 @@ static int load_elf_binary(struct linux_
>   start_data = 0;
>   end_data = 0;
>  
> - for (i = 0; i < loc->elf_ex.e_phnum; i++) {
> + for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
> + if (elf_ppnt->p_type == PT_PHDR) {
> + phdr_addr = elf_ppnt->p_vaddr;
> + continue;
> + }
> +
>   if (elf_ppnt->p_type == PT_INTERP) {
>   /* This is the program interpreter used for
>* shared libraries - for now assume that this
> @@ -703,20 +712,17 @@ static int load_elf_binary(struct linux_
>   /* Get the exec headers */
>   loc->interp_ex = *((struct exec *)bprm->buf);
>   loc->interp_elf_ex = *((struct elfhdr *)bprm->buf);
> - break;
> + continue;
>   }
> - elf_ppnt++;
> - }
>  
> - elf_ppnt = elf_phdata;
> - for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++)
>   if (elf_ppnt->p_type == PT_GNU_STACK) {
>   if (elf_ppnt->p_flags & PF_X)
>   executable_stack = EXSTACK_ENABLE_X;
>   else
>   executable_stack = EXSTACK_DISABLE_X;
> - break;
> + continue;
>   }
> + }
>  
>   /* Some simple consistency checks for the interpreter */
>   if (elf_interpreter) {
> @@ -985,9 +991,12 @@ static int load_elf_binary(struct linux_
>  
>   compute_creds(bprm);
>   current->flags &= ~PF_FORKNOEXEC;
> + if (phdr_addr)
> + phdr_addr += load_bias;
> +
>   create_elf_tables(bprm, &loc->elf_ex,
> (interpreter_type == INTERPRETER_AOUT),
> -   load_addr, interp_load_addr);
> +   load_addr, phdr_addr, interp_load_addr);
>   /* N.B. passed_fileno might not be initialized? */
>   if (interpreter_type == INTERPRETER_AOUT)
>   current->mm->arg_start += strlen(passed_fileno) + 1;

Wooops. Wrong one. Sorry. This one could have memory leaks.

Signed-off-by: Quentin Godfroy <[EMAIL PROTECTED]>
---
--- linux-2.6.21.1/fs/binfmt_elf.c  2007-05-07 10:58:38.0 -0400
+++ linux-2.6.21.1-patch/fs/binfmt_elf.c2007-05-07 13:58:14.0 
-0400
@@ -134,6 +134,7 @@ static int padzero(unsigned long elf_bss
 static int
 create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
int interp_aout, unsigned long load_addr,
+   unsigned long phdr_addr,
unsigned long interp_load_addr)
 {
unsigned long p = bprm->p;
@@ -190,7 +191,10 @@ create_elf_tables(struct linux_binprm *b
NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE);
NEW_AUX_ENT(AT_CLKTCK, CLOCKS_PER_SEC);
-   NEW_AUX_ENT(AT_PHDR, load_add

Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-07 Thread Quentin Godfroy
On Fri, May 04, 2007 at 10:09:21AM -0400, Quentin Godfroy wrote:
> On a dynamic ELF executable, the current kernel loader gives to the
> interpreter (in the AUXV vector) the AT_PHDR argument as :
> offset_of_phdr_in_file + first address.
> 
> It can be wrong for an executable where the program headers are not located 
> in the first loaded segment.
> 
> This patch corrects the behaviour.
> 
> Signed-off-by: Quentin Godfroy <[EMAIL PROTECTED]>
> ---
>  Here is an example of such an ELF executable which the current code
>  fails on :
>  ftp://quatramaran.ens.fr/pub/godfroy/addrpath/broken-sample
> 
> --- linux-2.6.21.1/fs/binfmt_elf.c2007-05-04 03:20:00.0 -0400
> +++ linux-2.6.21.1-patch/fs/binfmt_elf.c  2007-05-04 08:02:18.0 
> -0400
> @@ -134,6 +134,7 @@ static int padzero(unsigned long elf_bss
>  static int
>  create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
>   int interp_aout, unsigned long load_addr,
> + unsigned long phdr_addr,
>   unsigned long interp_load_addr)
>  {
>   unsigned long p = bprm->p;
> @@ -190,7 +191,7 @@ create_elf_tables(struct linux_binprm *b
>   NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
>   NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE);
>   NEW_AUX_ENT(AT_CLKTCK, CLOCKS_PER_SEC);
> - NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
> + NEW_AUX_ENT(AT_PHDR, phdr_addr);
>   NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
>   NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
>   NEW_AUX_ENT(AT_BASE, interp_load_addr);
> @@ -529,7 +530,7 @@ static unsigned long randomize_stack_top
>  static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
>  {
>   struct file *interpreter = NULL; /* to shut gcc up */
> - unsigned long load_addr = 0, load_bias = 0;
> + unsigned long load_addr = 0, load_bias = 0, phdr_addr = 0;
>   int load_addr_set = 0;
>   char * elf_interpreter = NULL;
>   unsigned int interpreter_type = INTERPRETER_NONE;
> @@ -718,6 +719,16 @@ static int load_elf_binary(struct linux_
>   break;
>   }
>  
> + elf_ppnt = elf_phdata;
> + for (i = 0; i< loc->elf_ex.e_phnum; i++, elf_ppnt++)
> + if (elf_ppnt->p_type == PT_PHDR) {
> + phdr_addr = elf_ppnt->p_vaddr;
> + break;
> + }
> + retval = -ENOEXEC;
> + if (!phdr_addr)
> + goto out_free_dentry;
> +
>   /* Some simple consistency checks for the interpreter */
>   if (elf_interpreter) {
>   interpreter_type = INTERPRETER_ELF | INTERPRETER_AOUT;
> @@ -987,7 +998,7 @@ static int load_elf_binary(struct linux_
>   current->flags &= ~PF_FORKNOEXEC;
>   create_elf_tables(bprm, &loc->elf_ex,
> (interpreter_type == INTERPRETER_AOUT),
> -   load_addr, interp_load_addr);
> +   load_addr, phdr_addr, interp_load_addr);
>   /* N.B. passed_fileno might not be initialized? */
>   if (interpreter_type == INTERPRETER_AOUT)
>   current->mm->arg_start += strlen(passed_fileno) + 1;

I have made another patch, which I hope should not break anything this
time. I tested it on an x86_64 kernel with 32 and 64 bits executables,
PIE and not PIE (well my only PIE are libc-2.5.so and ld-2.5.so). I
rejoined the loops as suggested, and the kernel falls back on load_addr +
exec->e_phoff if there was no PT_PHDR entry in the program headers.

Signed-off-by: Quentin Godfroy <[EMAIL PROTECTED]>
---
--- linux-2.6.21.1/fs/binfmt_elf.c  2007-05-07 10:58:38.0 -0400
+++ linux-2.6.21.1-patch/fs/binfmt_elf.c2007-05-07 13:14:33.0 
-0400
@@ -134,6 +134,7 @@ static int padzero(unsigned long elf_bss
 static int
 create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
int interp_aout, unsigned long load_addr,
+   unsigned long phdr_addr,
unsigned long interp_load_addr)
 {
unsigned long p = bprm->p;
@@ -190,7 +191,10 @@ create_elf_tables(struct linux_binprm *b
NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE);
NEW_AUX_ENT(AT_CLKTCK, CLOCKS_PER_SEC);
-   NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
+   if(phdr_addr)
+   NEW_AUX_ENT(AT_PHDR, phdr_addr);
+   else
+   NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
NEW_AUX_ENT(AT_BASE, interp_load_addr);
@@ -529,7 +533,7 @@ static unsigned long randomize_stack_top
 static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
 {
struct file *interpreter = NULL; /* to shut gcc up */
-   unsigned long load_addr = 0, load_bias = 0;
+   unsigned long load_addr = 0, load_bias = 0, phdr_addr = 0;
int load_addr_set = 0;
char * elf_interpreter = NULL;
   

Kernel error when closing a pipe

2007-05-07 Thread Chris Kottaridis
I am getting the following when trying to close a pipe:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
CPU:2
EIP:0060:[]Not tainted VLI
EFLAGS: 00010002   (2.6.10-pne)
EIP is at fasync_helper+0x4f/0xcf
eax: c04d2e28   ebx:    ecx: d9562f6c   edx: 0001
esi: f2686d80   edi: d9562f6c   ebp: f22cbf20   esp: f22cbf08
ds: 007b   es: 007b   ss: 0068
Process dsIpMgr (pid: 18122, threadinfo=f22ca000 task=f40a0710)
Stack: f40a0710 0292  f5db1b70 c5246580 f5db1b70 f22cbf3c
c017470d
    f2686d80  d9562f6c f2686d80 f22cbf50 c0174828

   f2686d80  f22cbf70 c016985d f5db1b70 f2686d80 ed0daab4
f2686d80
Call Trace:
 [] show_stack+0x80/0x96
 [] show_registers+0x15d/0x1d6
 [] die+0x106/0x194
 [] do_page_fault+0x583/0x8d8
 [] error_code+0x2b/0x30
 [] pipe_read_fasync+0x3d/0x56
 [] pipe_read_release+0x21/0x3e
 [] __fput+0xf3/0x122
 [] filp_close+0x50/0x8e
 [] sys_close+0x96/0xc1
 [] no_dpa_vsyscall_enter+0x8/0x1b
Code: c7 44 24 04 d0 00 00 00 89 04 24 e8 5f 60 fd ff 89 c3 b8 f4 ff ff
ff85 db 74 50 b8 28 2e 4d c0 e8 52 b7 2e 00 89 f9 8b 17 eb 0b <39> 72 0c
74 43 8d 4a 08 8b 52 08 85 d2 75 f1 8b 45 10 85 c0 74

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
I am running a flavor of 2.6.10 and am not really in a position to
upgrade to a newer version. I assume this is happening when fa is
getting de-referenced in the for loop. So, I assume the fasync reader
list has gotten corrupted somehow:

/*
 * fasync_helper() is used by some character device drivers (mainly
mice)
 * to set up the fasync queue. It returns negative on error, 0 if it did
 * no changes and positive if it added/deleted the entry.
 */
int fasync_helper(int fd, struct file * filp, int on, struct
fasync_struct **fapp)
{
struct fasync_struct *fa, **fp;
struct fasync_struct *new = NULL;
int result = 0;

if (on) {
new = kmem_cache_alloc(fasync_cache, SLAB_KERNEL);
if (!new)
return -ENOMEM;
}
write_lock_irq(&fasync_lock);
for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
if (fa->fa_file == filp) {
if(on) {
fa->fa_fd = fd;
kmem_cache_free(fasync_cache, new);
} else {
*fp = fa->fa_next;
kmem_cache_free(fasync_cache, fa);
result = 1;
}
goto out;
}
}

if (on) {
new->magic = FASYNC_MAGIC;
new->fa_file = filp;
new->fa_fd = fd;
new->fa_next = *fapp;
*fapp = new;
result = 1;
}
out:
write_unlock_irq(&fasync_lock);
return result;
}

The variable "on" is zero when this gets called and fd is -1.

Anyone seen anything like this when trying to close down a pipe ?

Any recommendations on how to track this down would be appreciated also.

Thanks
Chris Kottaridis([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Implement renaming for debugfs

2007-05-07 Thread Greg KH
On Fri, May 04, 2007 at 04:14:28PM +0200, Jan Kara wrote:
> On Thu 03-05-07 17:16:02, Greg KH wrote:
> > On Thu, May 03, 2007 at 11:54:52AM +0200, Jan Kara wrote:
> > > On Tue 01-05-07 20:26:27, Greg KH wrote:
> > > > On Mon, Apr 30, 2007 at 07:55:36PM +0200, Jan Kara wrote:
> > > > >   Hello,
> > > > > 
> > > > >   attached patch implements renaming for debugfs. I was asked for this
> > > > > feature by WLAN guys and I guess it makes sence (they have some debug 
> > > > > info
> > > > > in the directory identified by interface name and that can change...).
> > > > > Could someone have a look at what I wrote whether it looks reasonable?
> > > > > Thanks.
> > > > > 
> > > > >   Honza
> > > > > 
> > > > > -- 
> > > > > Jan Kara <[EMAIL PROTECTED]>
> > > > > SuSE CR Labs
> > > > 
> > > > > Implement debugfs_rename() to allow renaming files/directories in 
> > > > > debugfs.
> > > > 
> > > > I think you are going to need more infrastructure here, the caller
> > > > doesn't want to have to allocate a new dentry themselves, they just want
> > > > to pass in the new filename :)
> > >   Actually, I wanted the call to be in the spirit of other debugfs calls.
> > > So we have for example:
> > > void debugfs_remove(struct dentry *dentry)
> > 
> > That is because 'debugfs_create' returns a dentry.
> > 
> > > struct dentry *debugfs_create_dir(const char *name, struct dentry *parent)
> > > etc.
> > 
> > Same here, you already have a dentry to place this directory into, _and_
> > all the user needs to provide is a name for the new directory.  They
> > don't ever create a dentry themselves, which is what your function
> > required them to do.
> > 
> > Try using your function and you'll see what I mean :)
>   I've tried it when testing the function :). The code looked like:
> dir1 = debugfs_create_dir("dir1", NULL);
> dir2 = debugfs_create_dir("dir2", NULL);
> file1 = debugfs_create_file("file1", 0644, dir1, NULL, NULL);
> file2 = debugfs_rename(dir1, file1, dir2, "new_name");
>   No new dentries needed to be created...

Ah, ok, sorry, that makes more sense, I missed that the dentry's passed
in was the new directory location.  This will still work if you use the
same directory like:
debugfs_rename(dir1, file1, dir1, "new_name");

right?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Dave Kleikamp
On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
> On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > wrote:

> > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
> > > len)
> > > +{
> > > + handle_t *handle;
> > > + ext4_fsblk_t block, max_blocks;
> > > + int ret, ret2, nblocks = 0, retries = 0;
> > > + struct buffer_head map_bh;
> > > + unsigned int credits, blkbits = inode->i_blkbits;
> > > +
> > > + /* Currently supporting (pre)allocate mode _only_ */
> > > + if (mode != FA_ALLOCATE)
> > > + return -EOPNOTSUPP;
> > > +
> > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > > + return -ENOTTY;
> > 
> > So we don't implement fallocate on bitmap-based files!  Well that's huge
> > news.  The changelog would be an appropriate place to communicate this,
> > along with reasons why, or a description of the plan to fix it.
> 
> Ok. Will add this in the function description as well.
> 
> > Also, posix says nothing about fallocate() returning ENOTTY.
> 
> Right. I don't seem to find any suitable error from posix description.
> Can you please suggest an error code which might make more sense here ?
> Will -ENOTSUPP be ok ? Since we want to say here that we don't support
> non-extent files.

Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
here, and fall back to the current library code to do preallocation?
This way, the caller of fallocate() will never see this return code, so
it won't violate posix.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nfs locking for cluster filesystems

2007-05-07 Thread J. Bruce Fields
Apologies if this is late

Please pull from the 'server-cluster-locking-api' branch at

git://linux-nfs.org/~bfields/linux.git server-cluster-locking-api

for a series of patches which allow NFS to export the locking
functionality provided by filesystems which define their own ->lock()
method (cluster filesystems being the interesting case, and GFS2 the
first example).  There's also a little miscellaneous locks.c cleanup
along the way.

This has gone through an iteration or two with linux-fsdevel and sat in
-mm a couple weeks, and Trond has made a pass through it.

We've tested it by running cthon -l on ext3, gfs2, and nfs exports of
the two, in addition to doing some manual testing to ensure correct
handling of conflicts across multiple servers in a gfs2 cluster.

--b.

---

 fs/fuse/file.c|3 +-
 fs/gfs2/locking/dlm/plock.c   |  109 +++--
 fs/gfs2/locking/nolock/main.c |8 +-
 fs/gfs2/ops_file.c|   12 +-
 fs/lockd/svc4proc.c   |6 +-
 fs/lockd/svclock.c|  275 ++---
 fs/lockd/svcproc.c|7 +-
 fs/lockd/svcsubs.c|2 +-
 fs/locks.c|  264 +++
 fs/nfs/file.c |7 +-
 fs/nfs/nfs4proc.c |1 +
 fs/nfsd/nfs4state.c   |   30 +++--
 include/linux/fcntl.h |4 +
 include/linux/fs.h|9 +-
 include/linux/lockd/lockd.h   |   14 ++-
 15 files changed, 546 insertions(+), 205 deletions(-)

commit 586759f03e2e9031ac5589912a51a909ed53c30a
Author: Marc Eshel <[EMAIL PROTECTED]>
Date:   Tue Nov 14 16:37:25 2006 -0500

gfs2: nfs lock support for gfs2

Add NFS lock support to GFS2.

Signed-off-by: Marc Eshel <[EMAIL PROTECTED]>
Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>
Acked-by: Steven Whitehouse <[EMAIL PROTECTED]>

commit 1a8322b2b02071b0c7ac37a28357b93e6362f13e
Author: Marc Eshel <[EMAIL PROTECTED]>
Date:   Tue Nov 28 16:27:06 2006 -0500

lockd: add code to handle deferred lock requests

Rewrite nlmsvc_lock() to use the asynchronous interface.

As with testlock, we answer nlm requests in nlmsvc_lock by first looking up
the block and then using the results we find in the block if B_QUEUED is
set, and calling vfs_lock_file() otherwise.

If this a new lock request and we get -EINPROGRESS return on a non-blocking
request then we defer the request.

Also modify nlmsvc_unlock() to call the filesystem method if appropriate.

Signed-off-by: Marc Eshel <[EMAIL PROTECTED]>
Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>

commit f812048020282fdfa9b72a6cf539c33b6df1fd07
Author: Marc Eshel <[EMAIL PROTECTED]>
Date:   Tue Dec 5 23:48:10 2006 -0500

lockd: always preallocate block in nlmsvc_lock()

Normally we could skip ever having to allocate a block in the case where
the client asks for a non-blocking lock, or asks for a blocking lock that
succeeds immediately.

However we're going to want to always look up a block first in order to
check whether we're revisiting a deferred lock call, and to be prepared to
handle the case where the filesystem returns -EINPROGRESS--in that case we
want to make sure the lock we've given the filesystem is the one embedded
in the block that we'll use to track the deferred request.

Signed-off-by: Marc Eshel <[EMAIL PROTECTED]>
Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>

commit 5ea0d75037b93baa453b4d326c6319968fe91cea
Author: Marc Eshel <[EMAIL PROTECTED]>
Date:   Tue Nov 28 16:27:06 2006 -0500

lockd: handle test_lock deferrals

Rewrite nlmsvc_testlock() to use the new asynchronous interface: instead of
immediately doing a posix_test_lock(), we first look for a matching block.
If the subsequent test_lock returns anything other than -EINPROGRESS, we
then remove the block we've found and return the results.

If it returns -EINPROGRESS, then we defer the lock request.

In the case where the block we find in the first step has B_QUEUED set,
we bypass the vfs_test_lock entirely, instead using the block to decide how
to respond:
with nlm_lck_denied if B_TIMED_OUT is set.
with nlm_granted if B_GOT_CALLBACK is set.
by dropping if neither B_TIMED_OUT nor B_GOT_CALLBACK is set

Signed-off-by: Marc Eshel <[EMAIL PROTECTED]>
Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>

commit 85f3f1b3f7a6197b51a2ab98d927517df730214c
Author: Marc Eshel <[EMAIL PROTECTED]>
Date:   Tue Nov 28 16:27:06 2006 -0500

lockd: pass cookie in nlmsvc_testlock

Change NLM internal interface to pass more information for test lock; we
need this to make sure the cookie information is pushed down to the place
where we do request deferral, which is handled for testlock by the
following patch.

Signed-off-by: 

Re: cifs review reminder

2007-05-07 Thread Steven French
Yes.  I can do that trivially as long as it does not bore people on 
fsdevel.   Only three or four big patches had gone in this year.

I had been sending a few of the bigger patches off to lkml and/or jra (the 
Samba 3 lead) and /or Shaggy.


Steve French
Senior Software Engineer
Linux Technology Center - IBM Austin
phone: 512-838-2294
email: sfrench at-sign us dot ibm dot com



Christoph Hellwig <[EMAIL PROTECTED]> 
05/06/2007 07:48 AM

To
Steven French/Austin/[EMAIL PROTECTED], linux-fsdevel@vger.kernel.org
cc

Subject
cifs review reminder






Steve,

any chance to please send cifs uodates to -fsdevel before sending them
to Linus?  A lot of the cifs code really would benefit from some review.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Ulrich Drepper

Jakub Jelinek wrote:

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is "shall fail".  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
"required storage for regular file data starting at offset and continuing for len 
bytes"
doesn't make sense for negative size.  


This wording has already been cleaned up.  The current draft for the 
next revision reads:



[EINVAL] The len argument is less than or equal to zero, or the offset
 argument is less than zero, or the underlying file system does not
 support this operation.


I still don't like it since len==0 shouldn't create an error (it's 
inconsistent) but len<0 is already outlawed.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora
On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote:
> On 4/26/07, Amit K. Arora <[EMAIL PROTECTED]> wrote:
> > /*
> >+ * ext4_ext_try_to_merge:
> >+ * tries to merge the "ex" extent to the next extent in the tree.
> >+ * It always tries to merge towards right. If you want to merge towards
> >+ * left, pass "ex - 1" as argument instead of "ex".
> >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
> >+ * 1 if they got merged.
> >+ */
> >+int ext4_ext_try_to_merge(struct inode *inode,
> >+   struct ext4_ext_path *path,
> >+   struct ext4_extent *ex)
> >+{
> 
> Please either use proper kerneldoc format or drop
> "ext4_ext_try_to_merge" from the comment.

Ok, Thanks.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Pekka Enberg

On 4/26/07, Amit K. Arora <[EMAIL PROTECTED]> wrote:

 /*
+ * ext4_ext_try_to_merge:
+ * tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+   struct ext4_ext_path *path,
+   struct ext4_extent *ex)
+{


Please either use proper kerneldoc format or drop
"ext4_ext_try_to_merge" from the comment.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> > + */
> > +int ext4_ext_try_to_merge(struct inode *inode,
> > +   struct ext4_ext_path *path,
> > +   struct ext4_extent *ex)
> > +{
> > +   struct ext4_extent_header *eh;
> > +   unsigned int depth, len;
> > +   int merge_done=0, uninitialized = 0;
> 
> space around "=", please.
> 
> Many people prefer not to do the multiple-definitions-per-line, btw:
> 
>   int merge_done = 0;
>   int uninitialized = 0;

Ok. Will make the change.

> 
> reasons:
> 
> - If gives you some space for a nice comment
> 
> - It makes patches much more readable, and it makes rejects easier to fix
> 
> - standardisation.
> 
> > +   depth = ext_depth(inode);
> > +   BUG_ON(path[depth].p_hdr == NULL);
> > +   eh = path[depth].p_hdr;
> > +
> > +   while (ex < EXT_LAST_EXTENT(eh)) {
> > +   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
> > +   break;
> > +   /* merge with next extent! */
> > +   if (ext4_ext_is_uninitialized(ex))
> > +   uninitialized = 1;
> > +   ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
> > +   + ext4_ext_get_actual_len(ex + 1));
> > +   if (uninitialized)
> > +   ext4_ext_mark_uninitialized(ex);
> > +
> > +   if (ex + 1 < EXT_LAST_EXTENT(eh)) {
> > +   len = (EXT_LAST_EXTENT(eh) - ex - 1)
> > +   * sizeof(struct ext4_extent);
> > +   memmove(ex + 1, ex + 2, len);
> > +   }
> > +   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
> 
> Kenrel convention is to put spaces around "-"

Will fix this.

> 
> > +   merge_done = 1;
> > +   BUG_ON(eh->eh_entries == 0);
> 
> eek, scary BUG_ON.  Do we really need to be that severe?  Would it be
> better to warn and run ext4_error() here?
Ok.
> 
> > +   }
> > +
> > +   return merge_done;
> > +}
> > +
> > +
> >
> > ...
> >
> > +/*
> > + * ext4_ext_convert_to_initialized:
> > + * this function is called by ext4_ext_get_blocks() if someone tries to 
> > write
> > + * to an uninitialized extent. It may result in splitting the uninitialized
> > + * extent into multiple extents (upto three). Atleast one initialized 
> > extent
> > + * and atmost two uninitialized extents can result.
> 
> There are some typos here
> 
> > + * There are three possibilities:
> > + *   a> No split required: Entire extent should be initialized.
> > + *   b> Split into two extents: Only one end of the extent is being 
> > written to.
> > + *   c> Split into three extents: Somone is writing in middle of the 
> > extent.
> 
> and here
> 
Ok. Will fix them.
> > + */
> > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
> > +   struct ext4_ext_path *path,
> > +   ext4_fsblk_t iblock,
> > +   unsigned long max_blocks)
> > +{
> > +   struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
> > +   struct ext4_extent_header *eh;
> > +   unsigned int allocated, ee_block, ee_len, depth;
> > +   ext4_fsblk_t newblock;
> > +   int err = 0, ret = 0;
> > +
> > +   depth = ext_depth(inode);
> > +   eh = path[depth].p_hdr;
> > +   ex = path[depth].p_ext;
> > +   ee_block = le32_to_cpu(ex->ee_block);
> > +   ee_len = ext4_ext_get_actual_len(ex);
> > +   allocated = ee_len - (iblock - ee_block);
> > +   newblock = iblock - ee_block + ext_pblock(ex);
> > +   ex2 = ex;
> > +
> > +   /* ex1: ee_block to iblock - 1 : uninitialized */
> > +   if (iblock > ee_block) {
> > +   ex1 = ex;
> > +   ex1->ee_len = cpu_to_le16(iblock - ee_block);
> > +   ext4_ext_mark_uninitialized(ex1);
> > +   ex2 = &newex;
> > +   }
> > +   /* for sanity, update the length of the ex2 extent before
> > +* we insert ex3, if ex1 is NULL. This is to avoid temporary
> > +* overlap of blocks.
> > +*/
> > +   if (!ex1 && allocated > max_blocks)
> > +   ex2->ee_len = cpu_to_le16(max_blocks);
> > +   /* ex3: to ee_block + ee_len : uninitialised */
> > +   if (allocated > max_blocks) {
> > +   unsigned int newdepth;
> > +   ex3 = &newex;
> > +   ex3->ee_block = cpu_to_le32(iblock + max_blocks);
> > +   ext4_ext_store_pblock(ex3, newblock + max_blocks);
> > +   ex3->ee_len = cpu_to_le16(allocated - max_blocks);
> > +   ext4_ext_mark_uninitialized(ex3);
> > +   err = ext4_ext_insert_extent(handle, inode, path, ex3);
> > +   if (err)
> > +   goto out;
> > +   /* The depth, and hence eh & ex might change
> > +* as part of the insert above.
> > +*/
> > +   newdepth = ext_depth(inode);
> > +   if (newdepth != dep

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > This patch has the ext4 implemtation of fallocate system call.
> > 
> > ...
> >
> > +   /* ext4_can_extents_be_merged should have checked that either
> > +* both extents are uninitialized, or both aren't. Thus we
> > +* need to check only one of them here.
> > +*/
> 
> Please always format multiline comments like this:
> 
>   /*
>* ext4_can_extents_be_merged should have checked that either
>* both extents are uninitialized, or both aren't. Thus we
>* need to check only one of them here.
>*/

Ok.
 
> > ...
> >
> > +/*
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
> 
> This description is rather thin.  What is the filesystem's actual behaviour
> here?  If the file is using extents then the implementation will do
> .  If the file is using bitmaps then we will do .
> 
> But what?   Here is where it should be described.

Ok. Will expand the description.
 
> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
> > len)
> > +{
> > +   handle_t *handle;
> > +   ext4_fsblk_t block, max_blocks;
> > +   int ret, ret2, nblocks = 0, retries = 0;
> > +   struct buffer_head map_bh;
> > +   unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > +   /* Currently supporting (pre)allocate mode _only_ */
> > +   if (mode != FA_ALLOCATE)
> > +   return -EOPNOTSUPP;
> > +
> > +   if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +   return -ENOTTY;
> 
> So we don't implement fallocate on bitmap-based files!  Well that's huge
> news.  The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.

Ok. Will add this in the function description as well.
 
> Also, posix says nothing about fallocate() returning ENOTTY.

Right. I don't seem to find any suitable error from posix description.
Can you please suggest an error code which might make more sense here ?
Will -ENOTSUPP be ok ? Since we want to say here that we don't support
non-extent files.
 
> > +   block = offset >> blkbits;
> > +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > +- block;
> > +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> 
> Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

You are right to say that the credits can not be fixed here. But, 'len'
will not directly tell us how many extents might need to be inserted and
how many block groups (if any - think about the "segment range" already
being allocated case) the allocation request might touch.
One solution I have thought is to check the buffer credits after a call to
ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the
credits are falling short. Incase journal_extend fails, we call
journal_restart. This will automatically take care of how much journal
space we might need for any value of "len".
 
> > +   handle=ext4_journal_start(inode, credits +
> 
> Please always put spaces around "="A
Ok.
> 
> > +   EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
> 
> And around "+"
Ok.
> 
> > +   if (IS_ERR(handle))
> > +   return PTR_ERR(handle);
> > +retry:
> > +   ret = 0;
> > +   while (ret >= 0 && ret < max_blocks) {
> > +   block = block + ret;
> > +   max_blocks = max_blocks - ret;
> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> 
> BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ok. Will do that.
> 
> > +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> 
> Use buffer_new() here.   A separate patch which fixes the three existing
> instances of open-coded BH_foo usage would be appreciated.

Ok.
> 
> > +   && ((block + ret) > (i_size_read(inode) << blkbits)))
> 
> Check for wrap though the sign bit and through zero please.
Ok.
> 
> > +   nblocks = nblocks + ret;
> > +   }
> > +
> > +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > +   goto retry;
> > +
> > +   /* Time to update the file size.
> > +* Update only when preallocation was requested beyond the file size.

Re: [PATCH 3/5] ext4: Extent overlap bugfix

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > +unsigned int ext4_ext_check_overlap(struct inode *inode,
> > +   struct ext4_extent *newext,
> > +   struct ext4_ext_path *path)
> > +{
> > +   unsigned long b1, b2;
> > +   unsigned int depth, len1;
> > +
> > +   b1 = le32_to_cpu(newext->ee_block);
> > +   len1 = le16_to_cpu(newext->ee_len);
> > +   depth = ext_depth(inode);
> > +   if (!path[depth].p_ext)
> > +   goto out;
> > +   b2 = le32_to_cpu(path[depth].p_ext->ee_block);
> > +
> > +   /* get the next allocated block if the extent in the path
> > +* is before the requested block(s) */
> > +   if (b2 < b1) {
> > +   b2 = ext4_ext_next_allocated_block(path);
> > +   if (b2 == EXT_MAX_BLOCK)
> > +   goto out;
> > +   }
> > +
> > +   if (b1 + len1 > b2) {
> 
> Are we sure that b1+len cannot wrap through zero here?

No. Will add a check here for this. Thanks!
 
> > +   newext->ee_len = cpu_to_le16(b2 - b1);
> > +   return 1;
> > +   }


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.
> 
> But yes, sounds like a desirable thing.  Would XFS support it easily if the 
> above
> check was relaxed?

I think we may relax the check here and let the individual file system
decide if they support preallocation for directories or not. What do you
think ?

One thing to be thought in this case is the error code which should be
returned by the file system implementation, incase it doesn't support
preallocation for directories. Should it be -ENODEV (to match with what
posix says) , or something else (which might make more sense in this
case) ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Please add a comment over this function which specifies its behaviour. 
> Really it should be enough material from which a full manpage can be
> written.
> 
> If that's all too much, this material should at least be spelled out in the
> changelog.  Because there's no way in which this change can be fully
> reviewed unless someone (ie: you) tells us what it is setting out to
> achieve.
> 
> If we 100% implement some standard then a URL for what we claim to
> implement would suffice.  Given that we're at least using different types from
> posix I doubt if such a thing would be sufficient.
> 
> And given the complexity and potential variability within the filesystem
> implementations of this, I'd expect that _something_ additional needs to be
> said?

Ok. I will add a detailed comment here.

> 
> > +{
> > +   struct file *file;
> > +   struct inode *inode;
> > +   long ret = -EINVAL;
> > +
> > +   if (len == 0 || offset < 0)
> > +   goto out;
> 
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'".  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

> 
> > +   ret = -EBADF;
> > +   file = fget(fd);
> > +   if (!file)
> > +   goto out;
> > +   if (!(file->f_mode & FMODE_WRITE))
> > +   goto out_fput;
> > +
> > +   inode = file->f_path.dentry->d_inode;
> > +
> > +   ret = -ESPIPE;
> > +   if (S_ISFIFO(inode->i_mode))
> > +   goto out_fput;
> > +
> > +   ret = -ENODEV;
> > +   if (!S_ISREG(inode->i_mode))
> > +   goto out_fput;
> 
> So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> seems a bit silly of them.

True. 
 
> > +   ret = -EFBIG;
> > +   if (offset + len > inode->i_sb->s_maxbytes)
> > +   goto out_fput;
> 
> This code does handle offset+len going negative, but only by accident, I
> suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
> here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
> > +   if (inode->i_op && inode->i_op->fallocate)
> > +   ret = inode->i_op->fallocate(inode, mode, offset, len);
> > +   else
> > +   ret = -ENOSYS;
> 
> If we _are_ going to support negative `len', as posix suggests, I think we
> should perform the appropriate sanity conversions to `offset' and `len'
> right here, rather than expecting each filesystem to do it.
> 
> If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
> > +out_fput:
> > +   fput(file);
> > +out:
> > +   return ret;
> > +}
> > +EXPORT_SYMBOL(sys_fallocate);
> 
> I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
> > +/*
> > + * fallocate() modes
> > + */
> > +#define FA_ALLOCATE0x1
> > +#define FA_DEALLOCATE  0x2
> 
> Now those aren't in posix.  They should be documented, along with their
> expected semantics.

Will add a comment describing the role of these modes.
 
> >  #ifdef __KERNEL__
> >  
> >  #include 
> > @@ -1125,6 +1131,7 @@ struct inode_operations {
> > ssize_t (*listxattr) (struct dentry *, char *, size_t);
> > int (*removexattr) (struct dentry *, const char *);
> > void (*truncate_range)(struct inode *, loff_t, loff_t);
> > +   long (*fallocate)(struct inode *, int, loff_t, loff_t);
> 
> I really do think it's better to put the variable names in definitions such
> as this.  Especially when we have two identically-typed variables next to
> each other like that.  Quick: which one is the offset and which is the
> length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 0/18] export operations rewrite

2007-05-07 Thread Christoph Hellwig
On Mon, May 07, 2007 at 10:06:10AM +1000, Neil Brown wrote:
> On Monday May 7, [EMAIL PROTECTED] wrote:
> > 
> > Would you be able to respin that second patch series with one of those
> > changes?
> 
> Of course it is actually the first series of patches that introduces
> this problem.  So maybe just a full respin??

I'll take a look at how to solve this issues.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html