Re: Proposal for annotating _unstable_ pages

2015-05-21 Thread Jan Kara
On Wed 20-05-15 18:04:40, Kent Overstreet wrote:
> > Yeah.  I never figured out a sane way to migrate pages and keep everything
> > else happy.  Daniel Phillips is having a go at page forking for tux3; let's
> > see if the questions about that get resolved.
> 
> That would be great, we need something.
> 
> I'd also be really curious what btrfs is doing today - is it just bouncing
> everything internally, or did they come up with something more clever?

Btrfs is just waiting for IO to complete.

> > > Also, there's probably always going to be situations where we're reading 
> > > or
> > > writing to pages user space can stomp on (dio) - IMO we need to add a bio 
> > > flag
> > > to annotate this - "if you need this to be stable you have to bounce it".
> > > Otherwise either filesystems/block drivers are going to be stuck bouncing
> > > everything, or it'll just (continue to be) buggy.
> > 
> > Well, for now there's BIO_SNAP_STABLE that forces the block layer to bounce 
> > it,
> > but right now ext3 is the last user of it, and afaict btrfs is the only 
> > other
> > FS that takes care of stable pages on its own.
> 
> I have no idea what BIO_SNAP_STABLE was supposed to be for, but I don't see 
> how
> it's useful for anything sane.

It's for the case where lower layer requests it needs stable pages but
upper layer isn't able to provide them (as is the case of ext3). Then block
layer bounces the data for the caller.

> But that's the complete opposite of the problem stable pages are supposed to
> solve: stable pages are for when the _lower_ layer (be it filesystem, bcache,
> md, lvm) needs the memory being either read to or written from (both, it's not
> just writes) to not be diddled over while the IO is in flight.
> 
> Now, a point that I think has been missed is that stable pages are _not_ a
> complete solution, at least for consumers in the block layer.
> 
> The situation today is that if I'm in the block layer, and I get a handed a 
> read
> or write bio, I _don't know_ if it's from something that's going to diddle 
> over
> those pages or not. So if I require stable pages - be it for data checksumming
> or for other things - I've just got to bounce the bio myself.
> 
> And then the really annoying thing is that if you've got stacked things that 
> all
> need stable pages (maybe btrfs on top of bcache on top of md) - they _all_ 
> have
> to assume the pages aren't going to be stable, so if they need them they _all_
> have to bounce - even though once the first layer bounced the bio that made it
> stable for everything underneath it.

The current design is that if you need stable pages for your device, set
bdi capability BDI_CAP_STABLE_WRITES, fs then takes care of not scribbling
over your page while it is under writeback or uses BIO_SNAP_STABLE if it
cannot.

There is no reason why this shouldn't work with device stacking (although
I'm not sure it really works currently). You are right that this won't
solve the possible issues with direct IO where user scribbles over the
buffers while direct IO is in flight. We could make direct IO submit pages
with BIO_SNAP_STABLE when underlying device declares they are required but
I assume some users would rather like to promise they don't touch them than
paying the cost of copy...

I have to say I don't quite see the advantage of your proposal over this...

> Stable pages for IO to/from the pagecache are _not_ going to solve this 
> problem,
> because the page cache is not the only source of IO to non stable pages 
> (Direct
> IO will always be, even if everything else gets fixed).
> 
> So what I'm proposing is:
> 
>  - Add a new bio flag: BIO_PAGES_NOT_STABLE
> 
>  - Everything that submits a bio and _doesn't_ guarantee that the pages won't 
> be
>touched while the IO is in flight has to set that flag.  This flag will 
> have
>to be preserved when cloning a bio, but not when cloning a bio and its 
> pages
>(i.e. bouncing it).
> 
> This is going to be a lot of not-fun work auditing code, but IMO it really 
> needs
> to be done. As a bonus, once it's done everything that generates IO that must 
> be
> expensively bounced will be nicely annotated.
> 
> To verify that the annotations are correct, for writes we can add some debug
> code to the generic IO path that checksums the data before and after the IO 
> and
> complains loudly if the checksums don't match. Dunno what we can do for reads.
> 
> Thoughts?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Jan Kara
On Fri 28-11-14 13:14:21, Ted Tso wrote:
> On Fri, Nov 28, 2014 at 06:23:23PM +0100, Jan Kara wrote:
> > Hum, when someone calls fsync() for an inode, you likely want to sync
> > timestamps to disk even if everything else is clean. I think that doing
> > what you did in last version:
> > dirty = inode->i_state & I_DIRTY_INODE;
> > inode->i_state &= ~I_DIRTY_INODE;
> > spin_unlock(&inode->i_lock);
> > if (dirty & I_DIRTY_TIME)
> > mark_inode_dirty_sync(inode);
> > looks better to me. IMO when someone calls __writeback_single_inode() we
> > should write whatever we have...
> 
> Yes, but we also have to distinguish between what happens on an
> fsync() versus what happens on a periodic writeback if I_DIRTY_PAGES
> (but not I_DIRTY_SYNC or I_DIRTY_DATASYNC) is set.  So there is a
> check in the fsync() code path to handle the concern you raised above.
  Ah, this is the thing you have been likely talking about but which I was
constantly missing in my thoughts. You don't want to write times when inode
has only dirty pages and timestamps - I was always thinking about a
situation where inode has only dirty timestamps and not pages. This
situation also complicates the writeback logic because when inode has dirty
pages, you need to track it as normal dirty inode for page writeback (with
dirtied_when correspoding to time when pages were dirtied) but in
parallel you now need to track the information that inode has timestamps
that weren't written for X long. And even if we stored how old are
timestamps it isn't easily possible to keep the list of inodes with just
dirty timestamps sorted by dirty time. So now I finally understand why you
did things the way you did them... Sorry for misleading you.

So let's restart the design so that things are clear:
1) We have new inode bit I_DIRTY_TIME. This means that only timestamps in
the inode have changed. The desired behavior is that inode is with
I_DIRTY_TIME and without I_DIRTY_SYNC | I_DIRTY_DATASYNC is written by
background writeback only once per 24 hours. Such inodes do get written by
sync(2) and fsync(2) calls.

2) Inodes with only I_DIRTY_TIME are tracked in a new dirty list
b_dirty_time. We use i_wb_list list head for this. Unlike b_dirty list,
this list isn't kept sorted by dirtied_when. If queue_io() sees for_sync
bit set in the work item, it will call mark_inode_dirty_sync() for all
inodes in b_dirty_time before queuing io from b_dirty list. Once per hour
(or something like that) flusher thread scans the whole b_dirty_time list
and calls mark_inode_dirty_sync() for all inodes that have too old dirty
timestamps (to detect this we need a new time stamp in the inode).

3) When fsync() sees inode with I_DIRTY_TIME set, it calls
mark_inode_dirty_sync().

4) When we are dropping last inode reference and inode has I_DIRTY_TIME
set, we call mark_inode_dirty_sync().

And that should be it, right?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-28 Thread Jan Kara
On Thu 27-11-14 18:00:16, Ted Tso wrote:
> On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> > * change queue_io() to also call
> > moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, time + 
> > 24hours)
> >   For this you need to tweak move_expired_inodes() to take pointer to
> >   timestamp instead of pointer to work but that's trivial. Also you want
> >   probably leave time ->older_than_this value (i.e. without +24 hours) if
> >   we are doing WB_SYNC_ALL writeback. With this you can remove
> >   flush_sb_dirty_time() completely.
> 
> Well it's not quite enough.  The problem is that for ext3 and
> ext4, the actual work of writing the inode happens in dirty_inode(),
> not in write_inode().  Which means we need to do something like this.
  Right, I didn't realize this problem.

> I'm not entirely sure whether or not this is too ugly to live;
> personally, I think my hack of handling this in update_time() might be
> preferable
  Actually handling the copying of timestamps in __writeback_single_inode()
would look fine to me. You mention in your next email, calling
mark_inode_dirty_sync() from flusher may be problematic - why? How is this
any different from calling mark_inode_dirty_sync() from
flush_sb_dirty_time()?

I will note that for a while I thought copying the full inode to on-disk
buffer may be problematic because inode may be in an intermediate state of
some transactional change. But that isn't an issue - if there's any
transactional change in progress, it has a handle open and until the
change is node, thus the buffer with the partial change cannot go to the
journal (transaction cannot commit) until mark_inode_dirty_sync() copies
the final state of the inode.

Another solution may be to convey the information that copying of timestamps
is necessary to ->write_inode method. We could do that via a
flag bit in writeback_control. Each filesystem can then copy timestamps
when this bit is set. But calling mark_inode_dirty_sync() from
__writeback_single_inode() looks simpler to me.

Honza

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index b93c529..95a42b3 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -253,7 +253,7 @@ static bool inode_dirtied_after(struct inode *inode, 
> unsigned long t)
>   */
>  static int move_expired_inodes(struct list_head *delaying_queue,
>  struct list_head *dispatch_queue,
> -struct wb_writeback_work *work)
> +unsigned long *older_than_this)
>  {
>   LIST_HEAD(tmp);
>   struct list_head *pos, *node;
> @@ -264,8 +264,8 @@ static int move_expired_inodes(struct list_head 
> *delaying_queue,
>  
>   while (!list_empty(delaying_queue)) {
>   inode = wb_inode(delaying_queue->prev);
> - if (work->older_than_this &&
> - inode_dirtied_after(inode, *work->older_than_this))
> + if (older_than_this &&
> + inode_dirtied_after(inode, *older_than_this))
>   break;
>   list_move(&inode->i_wb_list, &tmp);
>   moved++;
> @@ -309,9 +309,14 @@ out:
>  static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work 
> *work)
>  {
>   int moved;
> + unsigned long one_day_later = jiffies + (HZ * 86400);
> +
>   assert_spin_locked(&wb->list_lock);
>   list_splice_init(&wb->b_more_io, &wb->b_io);
> - moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
> + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io,
> + work->older_than_this);
> + moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
> +  &one_day_later);
>   trace_writeback_queue_io(wb, work, moved);
>  }
>  
> @@ -637,6 +642,17 @@ static long writeback_sb_inodes(struct super_block *sb,
>   }
>  
>   /*
> +  * If the inode is marked dirty time but is not dirty,
> +  * then at last for ext3 and ext4 we need to call
> +  * mark_inode_dirty_sync in order to get the inode
> +  * timestamp transferred to the on disk inode, since
> +  * write_inode is a no-op for those file systems.  HACK HACK 
> HACK
> +  */
> + if ((inode->i_state & I_DIRTY_TIME) &&
> + ((inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) == 0))
> + mark_inode_dirty_sync(inode);
> +

Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-28 Thread Jan Kara
On Thu 27-11-14 15:19:54, Ted Tso wrote:
> On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> > Looking into the code & your patch I'd prefer to do something like:
> > * add support for I_DIRTY_TIME in __mark_inode_dirty() - update_time will
> >   call __mark_inode_dirty() with this flag if any of the times was updated.
> >   That way we can just remove your ->write_time() callback - filesystems
> >   can just handle this in their ->dirty_inode() methods if they wish.
> >   __mark_inode_dirty() will take care of moving inode into proper writeback
> >   list (i_dirty / i_dirty_time), dirtied_when will be set to current time.
> 
> One of the tricky bits about this is that btrfs wants to be able to
> return an error from write_time() which gets reflected up through
> update_time() to the callers of file_update_time().  Currently
> __mark_inode_dirty() and family return a void, and changing this is
> going to be a bit of a mess, since doing this correctly would require
> auditing all of the callers of mark_inode_dirty(),
> mark_inode_dirty_sync(), __mark_inode_dirty(), etc.
> 
> Doing this would be a good thing, and eventually I think it would be
> nice if we could allow the mark_inode_dirty() functions return an
> error instead of void, but I wonder if that's a cleanup that's better
> saved for later.  While we were at it, maybe we should rename
> mark_inode_dirty() to inode_dirty(), since that way we can be sure
> we've looked at all of the call site of mark_inode_dirty() and friends
> --- and we have a number of file systems, including btrfs, ext3, and
> ext4, where mark_inode_dirty() is doing a lot more than just marking
> the inode is dirty, and the only reason why it's named that is mostly
> historical.
  Except that lots of callers of update_time() / file_update_time() just
ignore the return value anyway. And frankly most of the time it's a
simplification we can get away with. I agree that ultimately we should
propagate and handle these errors but as you say above handling errors from
__mark_inode_dirty() is what we'd really need - that handles the whole
class of errors. So for now I would be OK, with just ignoring the error
when updating time stamps.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-27 Thread Jan Kara
  struct inode *inode, *ret_inode = NULL;
>   int mval;
> 
>   spin_lock(&inode_hash_lock);
>   hlist_for_each_entry(inode, head, i_hash) {
>   if (inode->i_sb != sb)
>       continue;
>   mval = match(inode, hashval, data);
>   if (mval == 0)
>   continue;
>   if (mval == 1)
>   ret_inode = inode;
>   goto out;
>   }
> out:
>   spin_unlock(&inode_hash_lock);
>   return ret_inode;
> }
> EXPORT_SYMBOL(find_inode_nowait);
> 
> Comments?
> 
>   - Ted
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-27 Thread Jan Kara
On Thu 27-11-14 14:27:52, Jan Kara wrote:
> On Thu 27-11-14 10:35:37, Dave Chinner wrote:
> > On Wed, Nov 26, 2014 at 04:10:44PM -0700, Andreas Dilger wrote:
> > > On Nov 26, 2014, at 3:48 PM, Dave Chinner  wrote:
> > > > 
> > > > On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
> > > >> Add an optimization for the MS_LAZYTIME mount option so that we will
> > > >> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> > > >> in a particular inode table block when we need to update some inode
> > > >> in that inode table block anyway.
> > > >> 
> > > >> Also add some temporary code so that we can set the lazytime mount
> > > >> option without needing a modified /sbin/mount program which can set
> > > >> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> > > >> added support.
> > > >> 
> > > >> Google-Bug-Id: 18297052
> > > >> 
> > > >> Signed-off-by: Theodore Ts'o 
> > > >> ---
> > > >> fs/ext4/inode.c | 49 
> > > >> ++---
> > > >> fs/ext4/super.c |  9 +
> > > >> include/trace/events/ext4.h | 30 +++
> > > >> 3 files changed, 85 insertions(+), 3 deletions(-)
> > > >> 
> > > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > >> index 5653fa4..8308c82 100644
> > > >> --- a/fs/ext4/inode.c
> > > >> +++ b/fs/ext4/inode.c
> > > >> @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t 
> > > >> *handle,
> > > >> }
> > > >> 
> > > >> /*
> > > >> + * Opportunistically update the other time fields for other inodes in
> > > >> + * the same inode table block.
> > > >> + */
> > > >> +static void ext4_update_other_inodes_time(struct super_block *sb,
> > > >> +unsigned long orig_ino, char 
> > > >> *buf)
> > > >> +{
> > > >> +  struct ext4_inode_info  *ei;
> > > >> +  struct ext4_inode   *raw_inode;
> > > >> +  unsigned long   ino;
> > > >> +  struct inode*inode;
> > > >> +  int i, inodes_per_block = 
> > > >> EXT4_SB(sb)->s_inodes_per_block;
> > > >> +  int inode_size = EXT4_INODE_SIZE(sb);
> > > >> +
> > > >> +  ino = orig_ino & ~(inodes_per_block - 1);
> > > >> +  for (i = 0; i < inodes_per_block; i++, ino++, buf += 
> > > >> inode_size) {
> > > >> +  if (ino == orig_ino)
> > > >> +  continue;
> > > >> +  inode = find_active_inode_nowait(sb, ino);
> > > >> +  if (!inode ||
> > > >> +  (inode->i_state & I_DIRTY_TIME) == 0 ||
> > > >> +  !spin_trylock(&inode->i_lock)) {
> > > >> +  iput(inode);
> > > >> +  continue;
> > > >> +  }
> > > >> +  inode->i_state &= ~I_DIRTY_TIME;
> > > >> +  inode->i_ts_dirty_day = 0;
> > > >> +  spin_unlock(&inode->i_lock);
> > > >> +  inode_requeue_dirtytime(inode);
> > > >> +
> > > >> +  ei = EXT4_I(inode);
> > > >> +  raw_inode = (struct ext4_inode *) buf;
> > > >> +
> > > >> +  spin_lock(&ei->i_raw_lock);
> > > >> +  EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
> > > >> +  EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
> > > >> +  EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
> > > >> +  ext4_inode_csum_set(inode, raw_inode, ei);
> > > >> +  spin_unlock(&ei->i_raw_lock);
> > > >> +  trace_ext4_other_inode_update_time(inode, orig_ino);
> > > >> +  iput(inode);
> > > >> +  }
> > > >> +}
> > > > 
> > > > Am I right in that this now does unlogged timestamp updates of
> > >

Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-27 Thread Jan Kara
; > > inode update "optimisation" so that all changes (including timestamps)
> > > are transactional so inode state on disk not matching what log
> > > recovery wrote to disk for all the other inode metadata...
> > > 
> > > Optimistic unlogged inode updates are a slippery slope, and history
> > > tells me that it doesn't lead to a nice place
> > 
> > Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
> > logical journaling, this isn't an unlogged update.  It is just taking
> > advantage of the fact that the whole block is going to be logged and
> > written to the disk anyway.
> 
> Urk - that's worse, isn't it? i.e the code above calls iput() from
> within a current transaction context?  What happens if that drops
> the last reference to the inode and it gets evicted due to racing
> with an unlink? Won't that try to start another transaction to free
> the inode (i.e. through ext4_evict_inode())?
  Yeah, the patch looks buggy (and racy wrt concurrent updates of time
stamps as well). I think if we want to do this optimization, we would need
a function like "clear inode dirty bits for this range of inode numbers".
That is doable atomically within VFS and although it looks somewhat ugly,
the performance
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-27 Thread Jan Kara
On Wed 26-11-14 05:23:52, Ted Tso wrote:
> Add a new mount option which enables a new "lazytime" mode.  This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode.  The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
> 
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
> 
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table.  The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives.  Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
  So this looks better to me than previous versions but I'm still not 100%
happy :)

Looking into the code & your patch I'd prefer to do something like:
* add support for I_DIRTY_TIME in __mark_inode_dirty() - update_time will
  call __mark_inode_dirty() with this flag if any of the times was updated.
  That way we can just remove your ->write_time() callback - filesystems
  can just handle this in their ->dirty_inode() methods if they wish.
  __mark_inode_dirty() will take care of moving inode into proper writeback
  list (i_dirty / i_dirty_time), dirtied_when will be set to current time.
* change queue_io() to also call
moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, time + 
24hours)
  For this you need to tweak move_expired_inodes() to take pointer to
  timestamp instead of pointer to work but that's trivial. Also you want
  probably leave time ->older_than_this value (i.e. without +24 hours) if
  we are doing WB_SYNC_ALL writeback. With this you can remove
  flush_sb_dirty_time() completely.
* Changes for iput() & fsync stay as they are.

And this should be all that's necessary. I'm not 100% sure about your dirty
bits naming changes - let's see how that will look like when the above more
substantial changes are done.

One technical detail below:

> diff --git a/fs/inode.c b/fs/inode.c
> index 8f5c4b5..9e464cc 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1430,11 +1430,22 @@ static void iput_final(struct inode *inode)
>   */
>  void iput(struct inode *inode)
>  {
> - if (inode) {
> - BUG_ON(inode->i_state & I_CLEAR);
> -
> - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
> - iput_final(inode);
> + if (!inode)
> + return;
> + BUG_ON(inode->i_state & I_CLEAR);
  I think we can better handle this without retry at this place like:
if (atomic_read(&inode->i_count) == 1 && inode->i_nlink &&
(inode->i_state & I_DIRTY_TIME)) {
if (inode->i_op->write_time)
inode->i_op->write_time(inode);
else if (inode->i_sb->s_op->write_inode)
mark_inode_dirty_sync(inode);
}
  Sure it will be one more read of i_count in the fast path but that's IMO
negligible.

BTW: Is the test for ->write_inode really needed? We don't do it e.g. in
update_time().

> +retry:
> + if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> + atomic_inc(&inode->i_count);
> + inode->i_state &= ~I_DIRTY_TIME;
> + spin_unlock(&inode->i_lock);
> + if (inode->i_op->write_time)
> + inode->i_op->write_time(inode);
> + else if (inode->i_sb->s_op->write_inode)
> + mark_inode_dirty_sync(inode);
> + goto retry;
> + }
> + iput_final(inode);
>   }
>  }
>  EXPORT_SYMBOL(iput);

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Jan Kara
On Wed 26-11-14 11:23:28, Christoph Hellwig wrote:
> As mentioned last round please move the addition of the is_readonly
> operation to the first thing in the series, so that the ordering makes
> more sense.
> 
> Second I think this patch is incorrect for XFS - XFS uses ->update_time
> to set the time stampst in the dinode.  These two need to be coherent
> as we can write out a dirty inode any time, so it needs to have the
> timestamp uptodate.
  But Ted changed XFS to copy timestamps to on-disk structure from the
in-memory inode fields after VFS updated the timestamps. So the stamps
should be coherent AFAICT, shouldn't they?

> Third update_time now calls mark_inode_dirty unconditionally, while
> previously it wasn't called when ->update_time was set.  At least
> for XFS that's a major change in behavior as XFS never used VFS dirty
> tracking for metadata updates.
  We don't call mark_inode_dirty() when ->write_time is set (note the
return, I missed it on the first reading) which looks sensible to me.

        Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Tue 25-11-14 12:57:16, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote:
> >   Actually, I'd also prefer to do the writing from iput_final(). My main
> > reason is that shrinker starts behaving very differently when you put
> > inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
> > particular:
> > /*
> >  * Referenced or dirty inodes are still in use. Give them another
> >  * pass
> >  * through the LRU as we canot reclaim them now.
> >  */
> > if (atomic_read(&inode->i_count) ||
> > (inode->i_state & ~I_REFERENCED)) {
> > list_del_init(&inode->i_lru);
> > spin_unlock(&inode->i_lock);
> > this_cpu_dec(nr_unused);
> > return LRU_REMOVED;
> > }
> 
> I must be missing something; how would the shirnker behave
> differently?  I_DIRTY_TIME shouldn't have any effect on the shrinker;
> note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite
> deliberate, because I didn't want I_DIRTY_TIME to have any affect on
> any of the other parts of the writeback or inode management parts.
  Sure, but the test tests whether the inode has *any other* bit than
I_REFERENCED set. So I_DIRTY_TIME will trigger the test and we just remove
the inode from lru list. You could exclude I_DIRTY_TIME from this test to
avoid this problem but then the shrinker latency would get much larger
because it will suddently do IO in evict(). So I still think doing the
write in iput_final() is the best solution.

> > Regarding your concern that we'd write the inode when file is closed -
> > that's not true. We'll write the inode only after corresponding dentry is
> > evicted and thus drops inode reference. That doesn't seem too bad to me.
> 
> True, fair enough.  It's not quite so lazy, but it should be close
> enough.
> 
> I'm still not seeing the benefit in waiting until the last possible
> minute to write out the timestamps; evict() can block as it is if
> there are any writeback that needs to be completed, and if the
> writeback happens to pages subject to delalloc, the timestamp update
> could happen for free at that point.
  Yeah, doing IO from evict is OK in princible but the change in shrinker
success rate / latency worries me... It would certainly need careful
testing under memory pressure & IO load with lots of outstanding timestamp
updates and see how shrinker behaves (change in CPU consumption, numbers of
evicted inodes, etc.).

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] ext4: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Fri 21-11-14 14:59:24, Ted Tso wrote:
> Add an optimization for the MS_LAZYTIME mount option so that we will
> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> in a particular inode table block when we need to update some inode in
> that inode table block anyway.
> 
> Also add some temporary code so that we can set the lazytime mount
> option without needing a modified /sbin/mount program which can set
> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> added support.
...
> diff --git a/fs/inode.c b/fs/inode.c
> index f0d6232..89cfca7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1292,6 +1292,42 @@ struct inode *ilookup(struct super_block *sb, unsigned 
> long ino)
>  }
>  EXPORT_SYMBOL(ilookup);
>  
> +/**
> + * find_active_inode_nowait - find an active inode in the inode cache
> + * @sb:  super block of file system to search
> + * @ino: inode number to search for
> + *
> + * Search for an active inode @ino in the inode cache, and if the
> + * inode is in the cache, the inode is returned with an incremented
> + * reference count.  If the inode is being freed or is newly
> + * initialized, return nothing instead of trying to wait for the inode
> + * initialization or destruction to be complete.
> + */
> +struct inode *find_active_inode_nowait(struct super_block *sb,
> +unsigned long ino)
> +{
> + struct hlist_head *head = inode_hashtable + hash(sb, ino);
> + struct inode *inode, *ret_inode = NULL;
> +
> + spin_lock(&inode_hash_lock);
> + hlist_for_each_entry(inode, head, i_hash) {
> + if ((inode->i_ino != ino) ||
> + (inode->i_sb != sb))
> + continue;
> + spin_lock(&inode->i_lock);
> + if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
> + __iget(inode);
> + ret_inode = inode;
> + }
> + spin_unlock(&inode->i_lock);
> + goto out;
> + }
> +out:
> + spin_unlock(&inode_hash_lock);
> + return ret_inode;
> +}
> +EXPORT_SYMBOL(find_active_inode_nowait);
> +
  Please move this function definition into a separate patch so that it
isn't hidden in an ext4-specific patch. Otherwise it looks good.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-25 Thread Jan Kara
On Fri 21-11-14 14:59:23, Ted Tso wrote:
> Guarantee that the on-disk timestamps will be no more than 24 hours
> stale.
  Hum, how about reusing i_dirtied_when for this. Using that field even
makes a good sence to me...

Honza

> Signed-off-by: Theodore Ts'o 
> ---
>  fs/fs-writeback.c  | 1 +
>  fs/inode.c | 7 ++-
>  include/linux/fs.h | 1 +
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ce7de22..eb04277 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>   if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>   trace_writeback_dirty_inode_start(inode, flags);
>  
> + inode->i_ts_dirty_day = 0;
>   if (sb->s_op->dirty_inode)
>   sb->s_op->dirty_inode(inode, flags);
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 6e91aca..f0d6232 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
> struct inode *inode,
>   */
>  static int update_time(struct inode *inode, struct timespec *time, int flags)
>  {
> + int days_since_boot = jiffies / (HZ * 86400);
>   int ret;
>  
>   if (inode->i_op->update_time) {
> @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct 
> timespec *time, int flags)
>   if (flags & S_MTIME)
>   inode->i_mtime = *time;
>   }
> - if (inode->i_sb->s_flags & MS_LAZYTIME) {
> + if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> + (!inode->i_ts_dirty_day ||
> +  inode->i_ts_dirty_day == days_since_boot)) {
>   spin_lock(&inode->i_lock);
>   inode->i_state |= I_DIRTY_TIME;
>   spin_unlock(&inode->i_lock);
> + inode->i_ts_dirty_day = days_since_boot;
>   return 0;
>   }
> + inode->i_ts_dirty_day = 0;
>   if (inode->i_op->write_time)
>   return inode->i_op->write_time(inode);
>   mark_inode_dirty_sync(inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 489b2f2..e3574cd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -575,6 +575,7 @@ struct inode {
>   struct timespec i_ctime;
>   spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
>   unsigned short  i_bytes;
> + unsigned short  i_ts_dirty_day;
>   unsigned int    i_blkbits;
>   blkcnt_ti_blocks;
>  
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Mon 24-11-14 23:33:35, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > > +static void flush_sb_dirty_time(struct super_block *sb)
> > > +{
>   ...
> > > +}
> > 
> > This just seems wrong to me, not to mention extremely expensive when we have
> > millions of cached inodes on the superblock.
> 
> #1, It only gets called on a sync(2) or syncfs(2), which can be pretty
> expensive as it is, so I didn't really worry about it.
> 
> #2, We're already iterating over all of the inodes in the sync(2) or
> syncfs(2) path, so the code path in question is already O(N) in the
> number of inodes.
> 
> > Why can't we just add a function like mark_inode_dirty_time() which
> > puts the inode on the dirty inode list with a writeback time 24
> > hours in the future rather than 30s in the future?
> 
> I was concerned about putting them on the dirty inode list because it
> would be extra inodes for the writeback threads would have to skip
> over and ignore (since they would not be dirty in the inde or data
> pages sense).
  I agree this isn't going to work easily. Currently flusher relies on
dirty list being sorted by i_dirtied_when and that gets harder to maintain
if we ever have inodes with i_dirtied_when in future (we'd have to sort-in
newly dirtied inodes).

> Another solution would be to use a separate linked list for dirtytime
> inodes, but that means adding some extra fields to the inode
> structure, which some might view as bloat.  Given #1 and #2 above,
> yes, we're doubling the CPU cost for sync(2) and syncfs(2), since
> we're not iterating over all of the inodes twice, but I believe that
> is a better trade-off than bloating the inode structure with an extra
> set of linked lists or increasing the CPU cost to the writeback path
> (which gets executed way more often than the sync or syncfs paths).
  This would be possible and as Boaz says, it might be possible to reuse
the same list_head in the inode for this. Getting rid of the full scan of
all superblock inodes would be nice (as the scan gets really expensive for
large numbers of inodes (think of i_sb_lock contention) and this makes it
twice as bad) so I'd prefer to do this if possible.
 
Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Mon 24-11-14 23:33:35, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > Eviction is too late for this. I'm pretty sure that it won't get
> > this far as iput_final() should catch the I_DIRTY_TIME in the !drop
> > case via write_inode_now().
> 
> Actually, the tracepoint for fs_lazytime_evict() does get triggered
> from time to time; but only when the inode is evicted due to memory
> pressure, i.e., via the evict_inodes() path.
> 
> I thought about possibly doing this in iput_final(), but that would
> mean that whenever we closed the last fd on the file, we would push
> the inode out to disk.  For files that we are writing, that's not so
> bad; but if we enable strictatime with lazytime, then we would be
> updating the atime for inodes that had been only been read on every
> close --- which in the case of say, all of the files in the kernel
> tree, would be a little unfortunate.
  Actually, I'd also prefer to do the writing from iput_final(). My main
reason is that shrinker starts behaving very differently when you put
inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
particular:
/*
 * Referenced or dirty inodes are still in use. Give them another
 * pass
 * through the LRU as we canot reclaim them now.
 */
if (atomic_read(&inode->i_count) ||
(inode->i_state & ~I_REFERENCED)) {
list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
}

Regarding your concern that we'd write the inode when file is closed -
that's not true. We'll write the inode only after corresponding dentry is
evicted and thus drops inode reference. That doesn't seem too bad to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 6/7] fs: pass iocb to generic_write_sync

2014-11-06 Thread Jan Kara
On Wed 05-11-14 16:14:52, Milosz Tanski wrote:
> From: Christoph Hellwig 
> 
> Clean up the generic_write_sync by just passing an iocb and a bytes
> written / negative errno argument.  In addition to simplifying the
> callers this also prepares for passing a per-operation O_DSYNC
> flag.  Two callers didn't quite fit that scheme:
> 
>  - dio_complete didn't both to update ki_pos as we don't need it
>on a iocb that is about to be freed, so we had to add it. Additionally
>it also synced out written data in the error case, which has been
>changed to operate like the other callers.
>  - gfs2 also used generic_write_sync to implement a crude version
>of fallocate.  It has been switched to use an open coded variant
>instead.
> 
> Signed-off-by: Christoph Hellwig 
  Looks good. You can add:
Reviewed-by: Jan Kara 

Honza

> ---
>  fs/block_dev.c |  8 +---
>  fs/btrfs/file.c|  7 ++-
>  fs/cifs/file.c |  8 +---
>  fs/direct-io.c |  8 ++--
>  fs/ext4/file.c |  8 +---
>  fs/gfs2/file.c |  9 +++--
>  fs/ntfs/file.c |  8 ++--
>  fs/udf/file.c  | 11 ++-
>  fs/xfs/xfs_file.c  |  8 +---
>  include/linux/fs.h |  8 +---
>  mm/filemap.c   | 30 --
>  11 files changed, 40 insertions(+), 73 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index cc9d411..c529b1c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned 
> cmd, unsigned long arg)
>   */
>  ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
> - struct file *file = iocb->ki_filp;
>   struct blk_plug plug;
>   ssize_t ret;
>  
>   blk_start_plug(&plug);
>   ret = __generic_file_write_iter(iocb, from);
> - if (ret > 0) {
> - ssize_t err;
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> + ret = generic_write_sync(iocb, ret);
>   blk_finish_plug(&plug);
>   return ret;
>  }
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index a18ceab..4f4a6f7 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> *iocb,
>*/
>   BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
>   BTRFS_I(inode)->last_sub_trans = root->log_transid;
> - if (num_written > 0) {
> - err = generic_write_sync(file, pos, num_written);
> - if (err < 0)
> - num_written = err;
> - }
> +
> + num_written = generic_write_sync(iocb, num_written);
>  
>   if (sync)
>   atomic_dec(&BTRFS_I(inode)->sync_writers);
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index c485afa..32359de 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
>   rc = __generic_file_write_iter(iocb, from);
>   mutex_unlock(&inode->i_mutex);
>  
> - if (rc > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - rc, rc);
> - if (err < 0)
> - rc = err;
> - }
> + rc = generic_write_sync(iocb, rc);
>   } else {
>   mutex_unlock(&inode->i_mutex);
>   }
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..b72ac83 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t 
> offset, ssize_t ret,
>   inode_dio_done(dio->inode);
>   if (is_async) {
>   if (dio->rw & WRITE) {
> - int err;
> -
> - err = generic_write_sync(dio->iocb->ki_filp, offset,
> -  transferred);
> - if (err < 0 && ret > 0)
> - ret = err;
> + dio->iocb->ki_pos = offset + transferred;
> + ret = generic_write_sync(dio->iocb, ret);
>   }
>  
>   aio_complete(dio->iocb, ret, 0);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index aca7b24..79b000c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -175,13 +175,7 @@ ext4_file_wr

Re: ext4 vs btrfs performance on SSD array

2014-09-02 Thread Jan Kara
On Tue 02-09-14 07:31:04, Ted Tso wrote:
> >  - the very small max readahead size
> 
> For things like the readahead size, that's probably something that we
> should autotune, based the time it takes to read N sectors.  i.e.,
> start N relatively small, such as 128k, and then bump it up based on
> how long it takes to do a sequential read of N sectors until it hits a
> given tunable, which is specified in milliseconds instead of kilobytes.
  Actually the amount of readahead we do is autotuned (based on hit rate).
So I would keep the setting in sysfs as the maximum size adaptive readahead
can ever read and we can bump it up. We can possibly add another feedback
into the readahead code to tune actualy readahead size depending on device
speed but we'd have to research exactly what algorithm would work best.

        Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Jan Kara
On Thu 13-03-14 10:20:56, Ted Tso wrote:
> Previously, the no-op "mount -o mount /dev/xxx" operation when the
  ^^remount

> file system is already mounted read-write causes an implied,
> unconditional syncfs().  This seems pretty stupid, and it's certainly
> documented or guaraunteed to do this, nor is it particularly useful,
> except in the case where the file system was mounted rw and is getting
> remounted read-only.
> 
> However, it's possible that there might be some file systems that are
> actually depending on this behavior.  In most file systems, it's
> probably fine to only call sync_filesystem() when transitioning from
> read-write to read-only, and there are some file systems where this is
> not needed at all (for example, for a pseudo-filesystem or something
> like romfs).
  Hum, I'd avoid this excercise at least for filesystem where
sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts,
also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs,
efs, freevxfs, romfs, squashfs. I think you can find a couple more which
clearly don't care about sync_filesystem() if you look a bit closer.

Honza
> 
> Signed-off-by: "Theodore Ts'o" 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Christoph Hellwig 
> Cc: Artem Bityutskiy 
> Cc: Adrian Hunter 
> Cc: Evgeniy Dushistov 
> Cc: Jan Kara 
> Cc: OGAWA Hirofumi 
> Cc: Anders Larsen 
> Cc: Phillip Lougher 
> Cc: Kees Cook 
> Cc: Mikulas Patocka 
> Cc: Petr Vandrovec 
> Cc: x...@oss.sgi.com
> Cc: linux-btrfs@vger.kernel.org
> Cc: linux-c...@vger.kernel.org
> Cc: samba-techni...@lists.samba.org
> Cc: codal...@coda.cs.cmu.edu
> Cc: linux-e...@vger.kernel.org
> Cc: linux-f2fs-de...@lists.sourceforge.net
> Cc: fuse-de...@lists.sourceforge.net
> Cc: cluster-de...@redhat.com
> Cc: linux-...@lists.infradead.org
> Cc: jfs-discuss...@lists.sourceforge.net
> Cc: linux-...@vger.kernel.org
> Cc: linux-ni...@vger.kernel.org
> Cc: linux-ntfs-...@lists.sourceforge.net
> Cc: ocfs2-de...@oss.oracle.com
> Cc: reiserfs-de...@vger.kernel.org
> ---
>  fs/adfs/super.c  | 1 +
>  fs/affs/super.c  | 1 +
>  fs/befs/linuxvfs.c   | 1 +
>  fs/btrfs/super.c | 1 +
>  fs/cifs/cifsfs.c | 1 +
>  fs/coda/inode.c  | 1 +
>  fs/cramfs/inode.c| 1 +
>  fs/debugfs/inode.c   | 1 +
>  fs/devpts/inode.c| 1 +
>  fs/efs/super.c   | 1 +
>  fs/ext2/super.c  | 1 +
>  fs/ext3/super.c  | 2 ++
>  fs/ext4/super.c  | 2 ++
>  fs/f2fs/super.c  | 2 ++
>  fs/fat/inode.c   | 2 ++
>  fs/freevxfs/vxfs_super.c | 1 +
>  fs/fuse/inode.c  | 1 +
>  fs/gfs2/super.c  | 2 ++
>  fs/hfs/super.c   | 1 +
>  fs/hfsplus/super.c   | 1 +
>  fs/hpfs/super.c  | 2 ++
>  fs/isofs/inode.c | 1 +
>  fs/jffs2/super.c | 1 +
>  fs/jfs/super.c   | 1 +
>  fs/minix/inode.c | 1 +
>  fs/ncpfs/inode.c | 1 +
>  fs/nfs/super.c   | 2 ++
>  fs/nilfs2/super.c| 1 +
>  fs/ntfs/super.c  | 2 ++
>  fs/ocfs2/super.c | 2 ++
>  fs/openpromfs/inode.c| 1 +
>  fs/proc/root.c   | 2 ++
>  fs/pstore/inode.c| 1 +
>  fs/qnx4/inode.c  | 1 +
>  fs/qnx6/inode.c  | 1 +
>  fs/reiserfs/super.c  | 1 +
>  fs/romfs/super.c | 1 +
>  fs/squashfs/super.c  | 1 +
>  fs/super.c   | 2 --
>  fs/sysv/inode.c  | 1 +
>  fs/ubifs/super.c | 1 +
>  fs/udf/super.c   | 1 +
>  fs/ufs/super.c   | 1 +
>  fs/xfs/xfs_super.c   | 1 +
>  44 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/adfs/super.c b/fs/adfs/super.c
> index 7b3003c..952aeb0 100644
> --- a/fs/adfs/super.c
> +++ b/fs/adfs/super.c
> @@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char 
> *options)
>  
>  static int adfs_remount(struct super_block *sb, int *flags, char *data)
>  {
> + sync_filesystem(sb);
>   *flags |= MS_NODIRATIME;
>   return parse_options(sb, data);
>  }
> diff --git a/fs/affs/super.c b/fs/affs/super.c
> index d098731..3074530 100644
> --- a/fs/affs/super.c
> +++ b/fs/affs/super.c
> @@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char 
> *data)
>  
>   pr_debug("AFFS: remount(flags=0x%x,opts=\"%s\")\n",*flags,data);
>  
> + sync_filesystem(sb);
>   *flags |= MS_NODIRATIME;
>  
>   memcpy(volume, sbi->s_volume, 32);
> diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
> i

Re: [PATCH 12/18] ocfs2: use generic posix ACL infrastructure

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:15, Christoph Hellwig wrote:
> This contains some major refactoring for the create path so that
> inodes are created with the right mode to start with instead of
> fixing it up later.
> 
> Signed-off-by: Christoph Hellwig 
...
> -int ocfs2_acl_chmod(struct inode *inode)
> -{
> - struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> - struct posix_acl *acl;
> - int ret;
> -
> - if (S_ISLNK(inode->i_mode))
> - return -EOPNOTSUPP;
> -
> - if (!(osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL))
> - return 0;
> -
> - acl = ocfs2_get_acl(inode, ACL_TYPE_ACCESS);
> - if (IS_ERR(acl) || !acl)
> - return PTR_ERR(acl);
> - ret = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> - if (ret)
> - return ret;
> - ret = ocfs2_set_acl(NULL, inode, NULL, ACL_TYPE_ACCESS,
> - acl, NULL, NULL);
> - posix_acl_release(acl);
> - return ret;
> -}
...

> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6fff128..ac371ad 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1236,7 +1236,7 @@ bail:
>   dqput(transfer_to[qtype]);
>  
>   if (!status && attr->ia_valid & ATTR_MODE) {
> - status = ocfs2_acl_chmod(inode);
> + status = posix_acl_chmod(inode);
>   if (status < 0)
>   mlog_errno(status);
>   }
  Hum, this changes the cluster locking. Previously ocfs2_acl_get() used
from ocfs2_acl_chmod() grabbed cluster wide inode lock. Now getting of ACL
isn't protected by the inode lock. That being said the cluster locking
around setattr looks fishy anyway - if two processes on different
nodes are changing attributes of the same file, changing ACLs post fact
after dropping inode lock could cause interesting effects. Also I'm
wondering how inode_change_ok() can ever be safe without holding inode
lock... Until we grab that other node is free to change e.g. owner of the
inode thus leading even to security implications. But maybe I'm missing
something. Mark, Joel?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/18] reiserfs: use generic posix ACL infrastructure

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:16, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
  Looks good. You can add:
Reviewed-by: Jan Kara 

Honza

> ---
>  fs/reiserfs/acl.h   |4 +-
>  fs/reiserfs/file.c  |1 +
>  fs/reiserfs/namei.c |3 +
>  fs/reiserfs/xattr.c |5 +-
>  fs/reiserfs/xattr_acl.c |  175 
> ---
>  5 files changed, 36 insertions(+), 152 deletions(-)
> 
> diff --git a/fs/reiserfs/acl.h b/fs/reiserfs/acl.h
> index f096b80..4a211f5 100644
> --- a/fs/reiserfs/acl.h
> +++ b/fs/reiserfs/acl.h
> @@ -48,18 +48,18 @@ static inline int reiserfs_acl_count(size_t size)
>  
>  #ifdef CONFIG_REISERFS_FS_POSIX_ACL
>  struct posix_acl *reiserfs_get_acl(struct inode *inode, int type);
> +int reiserfs_set_acl(struct inode *inode, struct posix_acl *acl, int type);
>  int reiserfs_acl_chmod(struct inode *inode);
>  int reiserfs_inherit_default_acl(struct reiserfs_transaction_handle *th,
>struct inode *dir, struct dentry *dentry,
>struct inode *inode);
>  int reiserfs_cache_default_acl(struct inode *dir);
> -extern const struct xattr_handler reiserfs_posix_acl_default_handler;
> -extern const struct xattr_handler reiserfs_posix_acl_access_handler;
>  
>  #else
>  
>  #define reiserfs_cache_default_acl(inode) 0
>  #define reiserfs_get_acl NULL
> +#define reiserfs_set_acl NULL
>  
>  static inline int reiserfs_acl_chmod(struct inode *inode)
>  {
> diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c
> index dcaafcf..ed58d84 100644
> --- a/fs/reiserfs/file.c
> +++ b/fs/reiserfs/file.c
> @@ -260,4 +260,5 @@ const struct inode_operations 
> reiserfs_file_inode_operations = {
>   .removexattr = reiserfs_removexattr,
>   .permission = reiserfs_permission,
>   .get_acl = reiserfs_get_acl,
> + .set_acl = reiserfs_set_acl,
>  };
> diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
> index dc5236f..8ba707e 100644
> --- a/fs/reiserfs/namei.c
> +++ b/fs/reiserfs/namei.c
> @@ -1522,6 +1522,7 @@ const struct inode_operations 
> reiserfs_dir_inode_operations = {
>   .removexattr = reiserfs_removexattr,
>   .permission = reiserfs_permission,
>   .get_acl = reiserfs_get_acl,
> + .set_acl = reiserfs_set_acl,
>  };
>  
>  /*
> @@ -1539,6 +1540,7 @@ const struct inode_operations 
> reiserfs_symlink_inode_operations = {
>   .removexattr = reiserfs_removexattr,
>   .permission = reiserfs_permission,
>   .get_acl = reiserfs_get_acl,
> + .set_acl = reiserfs_set_acl,
>  
>  };
>  
> @@ -1553,4 +1555,5 @@ const struct inode_operations 
> reiserfs_special_inode_operations = {
>   .removexattr = reiserfs_removexattr,
>   .permission = reiserfs_permission,
>   .get_acl = reiserfs_get_acl,
> + .set_acl = reiserfs_set_acl,
>  };
> diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
> index 8a9e2dc..5cdfbd6 100644
> --- a/fs/reiserfs/xattr.c
> +++ b/fs/reiserfs/xattr.c
> @@ -50,6 +50,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define PRIVROOT_NAME ".reiserfs_priv"
>  #define XAROOT_NAME   "xattrs"
> @@ -904,8 +905,8 @@ static const struct xattr_handler 
> *reiserfs_xattr_handlers[] = {
>   &reiserfs_xattr_security_handler,
>  #endif
>  #ifdef CONFIG_REISERFS_FS_POSIX_ACL
> - &reiserfs_posix_acl_access_handler,
> - &reiserfs_posix_acl_default_handler,
> + &posix_acl_access_xattr_handler,
> + &posix_acl_default_xattr_handler,
>  #endif
>   NULL
>  };
> diff --git a/fs/reiserfs/xattr_acl.c b/fs/reiserfs/xattr_acl.c
> index d95c959..ff04988 100644
> --- a/fs/reiserfs/xattr_acl.c
> +++ b/fs/reiserfs/xattr_acl.c
> @@ -11,35 +11,19 @@
>  #include "acl.h"
>  #include 
>  
> -static int reiserfs_set_acl(struct reiserfs_transaction_handle *th,
> +static int __reiserfs_set_acl(struct reiserfs_transaction_handle *th,
>   struct inode *inode, int type,
>   struct posix_acl *acl);
>  
> -static int
> -reiserfs_posix_acl_set(struct dentry *dentry, const char *name, const void 
> *value,
> - size_t size, int flags, int type)
> +
> +int
> +reiserfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  {
> - struct inode *inode = dentry->d_inode;
> - struct posix_acl *acl;
>   int error, error2;
>   struct reiserfs_transaction_handle th;
>   size_t jcreate_blocks;
> - if (!reiserfs_posixacl(inode->i_sb))
> - return -EO

Re: [PATCH 08/18] ext2/3/4: use generic posix ACL infrastructure

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:11, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
  Looks good. You can add:
Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext2/acl.c   |  176 -
>  fs/ext2/acl.h   |8 +--
>  fs/ext2/file.c  |1 +
>  fs/ext2/inode.c |2 +-
>  fs/ext2/namei.c |2 +
>  fs/ext2/xattr.c |8 +--
>  fs/ext2/xattr.h |2 -
>  fs/ext3/acl.c   |  213 
> ---
>  fs/ext3/acl.h   |9 +--
>  fs/ext3/file.c  |1 +
>  fs/ext3/inode.c |2 +-
>  fs/ext3/namei.c |2 +
>  fs/ext3/xattr.c |8 +--
>  fs/ext3/xattr.h |2 -
>  fs/ext4/acl.c   |  213 
> ---
>  fs/ext4/acl.h   |9 +--
>  fs/ext4/file.c  |1 +
>  fs/ext4/inode.c |2 +-
>  fs/ext4/namei.c |2 +
>  fs/ext4/xattr.c |8 +--
>  fs/ext4/xattr.h |2 -
>  21 files changed, 100 insertions(+), 573 deletions(-)
> 
> diff --git a/fs/ext2/acl.c b/fs/ext2/acl.c
> index 6e842a7..f04a295 100644
> --- a/fs/ext2/acl.c
> +++ b/fs/ext2/acl.c
> @@ -189,8 +189,8 @@ ext2_get_acl(struct inode *inode, int type)
>  /*
>   * inode->i_mutex: down
>   */
> -static int
> -ext2_set_acl(struct inode *inode, int type, struct posix_acl *acl)
> +int
> +ext2_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  {
>   int name_index;
>   void *value = NULL;
> @@ -250,169 +250,21 @@ ext2_set_acl(struct inode *inode, int type, struct 
> posix_acl *acl)
>  int
>  ext2_init_acl(struct inode *inode, struct inode *dir)
>  {
> - struct posix_acl *acl = NULL;
> - int error = 0;
> -
> - if (!S_ISLNK(inode->i_mode)) {
> - if (test_opt(dir->i_sb, POSIX_ACL)) {
> - acl = ext2_get_acl(dir, ACL_TYPE_DEFAULT);
> - if (IS_ERR(acl))
> - return PTR_ERR(acl);
> - }
> - if (!acl)
> - inode->i_mode &= ~current_umask();
> - }
> - if (test_opt(inode->i_sb, POSIX_ACL) && acl) {
> - if (S_ISDIR(inode->i_mode)) {
> - error = ext2_set_acl(inode, ACL_TYPE_DEFAULT, acl);
> - if (error)
> - goto cleanup;
> - }
> - error = __posix_acl_create(&acl, GFP_KERNEL, &inode->i_mode);
> - if (error < 0)
> - return error;
> - if (error > 0) {
> - /* This is an extended ACL */
> - error = ext2_set_acl(inode, ACL_TYPE_ACCESS, acl);
> - }
> - }
> -cleanup:
> -   posix_acl_release(acl);
> -   return error;
> -}
> -
> -/*
> - * Does chmod for an inode that may have an Access Control List. The
> - * inode->i_mode field must be updated to the desired value by the caller
> - * before calling this function.
> - * Returns 0 on success, or a negative error number.
> - *
> - * We change the ACL rather than storing some ACL entries in the file
> - * mode permission bits (which would be more efficient), because that
> - * would break once additional permissions (like  ACL_APPEND, ACL_DELETE
> - * for directories) are added. There are no more bits available in the
> - * file mode.
> - *
> - * inode->i_mutex: down
> - */
> -int
> -ext2_acl_chmod(struct inode *inode)
> -{
> - struct posix_acl *acl;
> -int error;
> + struct posix_acl *default_acl, *acl;
> + int error;
>  
> - if (!test_opt(inode->i_sb, POSIX_ACL))
> - return 0;
> - if (S_ISLNK(inode->i_mode))
> - return -EOPNOTSUPP;
> - acl = ext2_get_acl(inode, ACL_TYPE_ACCESS);
> - if (IS_ERR(acl) || !acl)
> - return PTR_ERR(acl);
> - error = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> + error = posix_acl_create(dir, &inode->i_mode, &default_acl, &acl);
>   if (error)
>   return error;
> - error = ext2_set_acl(inode, ACL_TYPE_ACCESS, acl);
> - posix_acl_release(acl);
> - return error;
> -}
> -
> -/*
> - * Extended attribut handlers
> - */
> -static size_t
> -ext2_xattr_list_acl_access(struct dentry *dentry, char *list, size_t 
> list_size,
> -const char *name, size_t name_len, int type)
> -{
> - const size_t size = sizeof(POSIX_ACL_XATTR_ACCESS);
> -
> - if (!test_opt(dentry->d_sb, POSIX_ACL))
> - return 0;
> - if (list && size <= l

Re: [PATCH 06/18] fs: make posix_acl_create more useful

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:09, Christoph Hellwig wrote:
> Rename the current posix_acl_created to __posix_acl_create and add
> a fully featured helper to set up the ACLs on file creation that
> uses get_acl().
  Looks good, you can add:
Reviewed-by: Jan Kara 

Honza

> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/9p/acl.c   |2 +-
>  fs/btrfs/acl.c|2 +-
>  fs/ext2/acl.c |2 +-
>  fs/ext3/acl.c |2 +-
>  fs/ext4/acl.c |2 +-
>  fs/f2fs/acl.c |2 +-
>  fs/generic_acl.c  |2 +-
>  fs/gfs2/acl.c |2 +-
>  fs/hfsplus/posix_acl.c|2 +-
>  fs/jffs2/acl.c|2 +-
>  fs/jfs/acl.c  |2 +-
>  fs/nfs/nfs3acl.c  |2 +-
>  fs/ocfs2/acl.c|2 +-
>  fs/posix_acl.c|   53 
> +++--
>  fs/reiserfs/xattr_acl.c   |2 +-
>  fs/xfs/xfs_acl.c  |4 ++--
>  include/linux/posix_acl.h |6 -
>  17 files changed, 72 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/9p/acl.c b/fs/9p/acl.c
> index f5ce5c5..8482f2d 100644
> --- a/fs/9p/acl.c
> +++ b/fs/9p/acl.c
> @@ -200,7 +200,7 @@ int v9fs_acl_mode(struct inode *dir, umode_t *modep,
>   if (acl) {
>   if (S_ISDIR(mode))
>   *dpacl = posix_acl_dup(acl);
> - retval = posix_acl_create(&acl, GFP_NOFS, &mode);
> + retval = __posix_acl_create(&acl, GFP_NOFS, &mode);
>   if (retval < 0)
>   return retval;
>   if (retval > 0)
> diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
> index 1af04ff..b56519d 100644
> --- a/fs/btrfs/acl.c
> +++ b/fs/btrfs/acl.c
> @@ -222,7 +222,7 @@ int btrfs_init_acl(struct btrfs_trans_handle *trans,
>   if (ret)
>   goto failed;
>   }
> - ret = posix_acl_create(&acl, GFP_NOFS, &inode->i_mode);
> + ret = __posix_acl_create(&acl, GFP_NOFS, &inode->i_mode);
>   if (ret < 0)
>   return ret;
>  
> diff --git a/fs/ext2/acl.c b/fs/ext2/acl.c
> index 7006ced..6e842a7 100644
> --- a/fs/ext2/acl.c
> +++ b/fs/ext2/acl.c
> @@ -268,7 +268,7 @@ ext2_init_acl(struct inode *inode, struct inode *dir)
>   if (error)
>   goto cleanup;
>   }
> - error = posix_acl_create(&acl, GFP_KERNEL, &inode->i_mode);
> + error = __posix_acl_create(&acl, GFP_KERNEL, &inode->i_mode);
>   if (error < 0)
>   return error;
>   if (error > 0) {
> diff --git a/fs/ext3/acl.c b/fs/ext3/acl.c
> index 6691a6c..4f3d8fa 100644
> --- a/fs/ext3/acl.c
> +++ b/fs/ext3/acl.c
> @@ -271,7 +271,7 @@ ext3_init_acl(handle_t *handle, struct inode *inode, 
> struct inode *dir)
>   if (error)
>   goto cleanup;
>   }
> - error = posix_acl_create(&acl, GFP_NOFS, &inode->i_mode);
> + error = __posix_acl_create(&acl, GFP_NOFS, &inode->i_mode);
>   if (error < 0)
>   return error;
>  
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 2eebe02..f827f3b 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -276,7 +276,7 @@ ext4_init_acl(handle_t *handle, struct inode *inode, 
> struct inode *dir)
>   if (error)
>   goto cleanup;
>   }
> - error = posix_acl_create(&acl, GFP_NOFS, &inode->i_mode);
> + error = __posix_acl_create(&acl, GFP_NOFS, &inode->i_mode);
>   if (error < 0)
>   return error;
>  
> diff --git a/fs/f2fs/acl.c b/fs/f2fs/acl.c
> index 14c4df0..45e8430 100644
> --- a/fs/f2fs/acl.c
> +++ b/fs/f2fs/acl.c
> @@ -285,7 +285,7 @@ int f2fs_init_acl(struct inode *inode, struct inode *dir, 
> struct page *ipage)
>   if (error)
>   goto cleanup;
>   }
> - error = posix_acl_create(&acl, GFP_KERNEL, &inode->i_mode);
> + error = __posix_acl_create(&acl, GFP_KERNEL, &inode->i_mode);
>   if (error < 0)
>   return error;
>   if (error > 0)
> diff --git a/fs/generic_acl.c b/fs/generic_acl.c
> index 46a5076..4357f39 100644
> --- a/fs/generic_acl.c
> +++ b/fs/generic_acl.c
> @@ -128,7 +128,7 @@ generic_acl_ini

Re: [PATCH 05/18] fs: make posix_acl_chmod more useful

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:08, Christoph Hellwig wrote:
> Rename the current posix_acl_chmod to __posix_acl_chmod and add
> a fully featured ACL chmod helper that uses the ->set_acl inode
> operation.
  Looks good. You can add:
Reviewed-by: Jan Kara 

Honza
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/9p/acl.c   |2 +-
>  fs/btrfs/acl.c|2 +-
>  fs/ext2/acl.c |2 +-
>  fs/ext3/acl.c |2 +-
>  fs/ext4/acl.c |2 +-
>  fs/f2fs/acl.c |2 +-
>  fs/generic_acl.c  |2 +-
>  fs/gfs2/acl.c |2 +-
>  fs/hfsplus/posix_acl.c|2 +-
>  fs/jffs2/acl.c|2 +-
>  fs/jfs/acl.c  |2 +-
>  fs/ocfs2/acl.c|2 +-
>  fs/posix_acl.c|   30 +++---
>  fs/reiserfs/xattr_acl.c   |2 +-
>  fs/xfs/xfs_acl.c  |2 +-
>  include/linux/posix_acl.h |   17 +
>  16 files changed, 54 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/9p/acl.c b/fs/9p/acl.c
> index 7af425f..f5ce5c5 100644
> --- a/fs/9p/acl.c
> +++ b/fs/9p/acl.c
> @@ -156,7 +156,7 @@ int v9fs_acl_chmod(struct inode *inode, struct p9_fid 
> *fid)
>   return -EOPNOTSUPP;
>   acl = v9fs_get_cached_acl(inode, ACL_TYPE_ACCESS);
>   if (acl) {
> - retval = posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> + retval = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
>   if (retval)
>   return retval;
>   set_cached_acl(inode, ACL_TYPE_ACCESS, acl);
> diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
> index 0890c83..1af04ff 100644
> --- a/fs/btrfs/acl.c
> +++ b/fs/btrfs/acl.c
> @@ -256,7 +256,7 @@ int btrfs_acl_chmod(struct inode *inode)
>   if (IS_ERR_OR_NULL(acl))
>   return PTR_ERR(acl);
>  
> - ret = posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> + ret = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
>   if (ret)
>   return ret;
>   ret = btrfs_set_acl(NULL, inode, acl, ACL_TYPE_ACCESS);
> diff --git a/fs/ext2/acl.c b/fs/ext2/acl.c
> index 110b6b3..7006ced 100644
> --- a/fs/ext2/acl.c
> +++ b/fs/ext2/acl.c
> @@ -308,7 +308,7 @@ ext2_acl_chmod(struct inode *inode)
>   acl = ext2_get_acl(inode, ACL_TYPE_ACCESS);
>   if (IS_ERR(acl) || !acl)
>   return PTR_ERR(acl);
> - error = posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> + error = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
>   if (error)
>   return error;
>   error = ext2_set_acl(inode, ACL_TYPE_ACCESS, acl);
> diff --git a/fs/ext3/acl.c b/fs/ext3/acl.c
> index dbb5ad5..6691a6c 100644
> --- a/fs/ext3/acl.c
> +++ b/fs/ext3/acl.c
> @@ -314,7 +314,7 @@ ext3_acl_chmod(struct inode *inode)
>   acl = ext3_get_acl(inode, ACL_TYPE_ACCESS);
>   if (IS_ERR(acl) || !acl)
>   return PTR_ERR(acl);
> - error = posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> + error = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
>   if (error)
>   return error;
>  retry:
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 39a54a0..2eebe02 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -320,7 +320,7 @@ ext4_acl_chmod(struct inode *inode)
>   acl = ext4_get_acl(inode, ACL_TYPE_ACCESS);
>   if (IS_ERR(acl) || !acl)
>   return PTR_ERR(acl);
> - error = posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> + error = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
>   if (error)
>   return error;
>  retry:
> diff --git a/fs/f2fs/acl.c b/fs/f2fs/acl.c
> index d0fc287..14c4df0 100644
> --- a/fs/f2fs/acl.c
> +++ b/fs/f2fs/acl.c
> @@ -311,7 +311,7 @@ int f2fs_acl_chmod(struct inode *inode)
>   if (IS_ERR(acl) || !acl)
>   return PTR_ERR(acl);
>  
> - error = posix_acl_chmod(&acl, GFP_KERNEL, mode);
> + error = __posix_acl_chmod(&acl, GFP_KERNEL, mode);
>   if (error)
>   return error;
>  
> diff --git a/fs/generic_acl.c b/fs/generic_acl.c
> index b3f3676..46a5076 100644
> --- a/fs/generic_acl.c
> +++ b/fs/generic_acl.c
> @@ -158,7 +158,7 @@ generic_acl_chmod(struct inode *inode)
>   return -EOPNOTSUPP;
>   acl = get_cached_acl(inode, ACL_TYPE_ACCESS);
>   if (acl) {
> - error = posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
> + error = __posix_acl_chmod(&acl, GFP

Re: [PATCH 04/18] fs: add generic xattr_acl handlers

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:07, Christoph Hellwig wrote:
> With the ->set_acl inode operation we can implement the Posix ACL
> xattr handlers in generic code instead of duplicating them all
> over the tree.
  Looks good. You can add:
Reviewed-by: Jan Kara 

Honza

> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/xattr_acl.c  |   95 
> +++
>  include/linux/posix_acl_xattr.h |3 ++
>  2 files changed, 98 insertions(+)
> 
> diff --git a/fs/xattr_acl.c b/fs/xattr_acl.c
> index 9fbea87..932ec76 100644
> --- a/fs/xattr_acl.c
> +++ b/fs/xattr_acl.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Fix up the uids and gids in posix acl extended attributes in place.
> @@ -178,3 +179,97 @@ posix_acl_to_xattr(struct user_namespace *user_ns, const 
> struct posix_acl *acl,
>   return real_size;
>  }
>  EXPORT_SYMBOL (posix_acl_to_xattr);
> +
> +static int
> +posix_acl_xattr_get(struct dentry *dentry, const char *name,
> + void *value, size_t size, int type)
> +{
> + struct posix_acl *acl;
> + int error;
> +
> + if (!IS_POSIXACL(dentry->d_inode))
> + return -EOPNOTSUPP;
> +
> + acl = get_acl(dentry->d_inode, type);
> + if (IS_ERR(acl))
> + return PTR_ERR(acl);
> + if (acl == NULL)
> + return -ENODATA;
> +
> + error = posix_acl_to_xattr(&init_user_ns, acl, value, size);
> + posix_acl_release(acl);
> +
> + return error;
> +}
> +
> +static int
> +posix_acl_xattr_set(struct dentry *dentry, const char *name,
> + const void *value, size_t size, int flags, int type)
> +{
> + struct inode *inode = dentry->d_inode;
> + struct posix_acl *acl = NULL;
> + int ret;
> +
> + if (type == ACL_TYPE_DEFAULT && !S_ISDIR(inode->i_mode))
> + return value ? -EACCES : 0;
> + if (!inode_owner_or_capable(inode))
> + return -EPERM;
> + if (!IS_POSIXACL(inode))
> + return -EOPNOTSUPP;
> +
> + if (value) {
> + acl = posix_acl_from_xattr(&init_user_ns, value, size);
> + if (IS_ERR(acl))
> + return PTR_ERR(acl);
> +
> + if (acl) {
> + ret = posix_acl_valid(acl);
> + if (ret)
> + goto out;
> + }
> + }
> +
> + ret = inode->i_op->set_acl(inode, acl, type);
> +out:
> + posix_acl_release(acl);
> + return ret;
> +}
> +
> +static size_t
> +posix_acl_xattr_list(struct dentry *dentry, char *list, size_t list_size,
> + const char *name, size_t name_len, int type)
> +{
> + const char *xname;
> + size_t size;
> +
> + if (!IS_POSIXACL(dentry->d_inode))
> + return -EOPNOTSUPP;
> +
> + if (type == ACL_TYPE_ACCESS)
> + xname = POSIX_ACL_XATTR_ACCESS;
> + else
> + xname = POSIX_ACL_XATTR_DEFAULT;
> +
> + size = strlen(xname) + 1;
> + if (list && size <= list_size)
> + memcpy(list, xname, size);
> + return size;
> +}
> +
> +const struct xattr_handler posix_acl_access_xattr_handler = {
> + .prefix = POSIX_ACL_XATTR_ACCESS,
> + .flags = ACL_TYPE_ACCESS,
> + .list = posix_acl_xattr_list,
> + .get = posix_acl_xattr_get,
> + .set = posix_acl_xattr_set,
> +};
> +EXPORT_SYMBOL_GPL(posix_acl_access_xattr_handler);
> +
> +const struct xattr_handler posix_acl_default_xattr_handler = {
> + .prefix = POSIX_ACL_XATTR_DEFAULT,
> + .flags = ACL_TYPE_DEFAULT,
> + .list = posix_acl_xattr_list,
> + .get = posix_acl_xattr_get,
> + .set = posix_acl_xattr_set,
> +};
> +EXPORT_SYMBOL_GPL(posix_acl_default_xattr_handler);
> diff --git a/include/linux/posix_acl_xattr.h b/include/linux/posix_acl_xattr.h
> index ad93ad0..6f14ee2 100644
> --- a/include/linux/posix_acl_xattr.h
> +++ b/include/linux/posix_acl_xattr.h
> @@ -69,4 +69,7 @@ struct posix_acl *posix_acl_from_xattr(struct 
> user_namespace *user_ns,
>  int posix_acl_to_xattr(struct user_namespace *user_ns,
>  const struct posix_acl *acl, void *buffer, size_t size);
>  
> +extern const struct xattr_handler posix_acl_access_xattr_handler;
> +extern const struct xattr_handler posix_acl_default_xattr_handler;
> +
>  #endif   /* _POSIX_ACL_XATTR_H */
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/18] fs: add a set_acl inode operation

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:06, Christoph Hellwig wrote:
> This will allow moving all the Posix ACL handling into the VFS and clean
> up tons of cruft in the filesystems.
  Looks good. Btw I'd merge this with the following patch... Anyway:
Reviewed-by: Jan Kara 

Honza
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/fs.h |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 121f11f..09f553c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1580,6 +1580,7 @@ struct inode_operations {
>  struct file *, unsigned open_flag,
>  umode_t create_mode, int *opened);
>   int (*tmpfile) (struct inode *, struct dentry *, umode_t);
> + int (*set_acl)(struct inode *, struct posix_acl *, int);
>  } cacheline_aligned;
>  
>  ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/18] reiserfs: prefix ACL symbols with reiserfs_

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:04, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
  Looks good. You can add:
Reviewed-by: Jan Kara 

Honza

> ---
>  fs/reiserfs/xattr_acl.c |   20 ++--
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/reiserfs/xattr_acl.c b/fs/reiserfs/xattr_acl.c
> index 06c04f7..6f721ea 100644
> --- a/fs/reiserfs/xattr_acl.c
> +++ b/fs/reiserfs/xattr_acl.c
> @@ -16,7 +16,7 @@ static int reiserfs_set_acl(struct 
> reiserfs_transaction_handle *th,
>   struct posix_acl *acl);
>  
>  static int
> -posix_acl_set(struct dentry *dentry, const char *name, const void *value,
> +reiserfs_posix_acl_set(struct dentry *dentry, const char *name, const void 
> *value,
>   size_t size, int flags, int type)
>  {
>   struct inode *inode = dentry->d_inode;
> @@ -65,7 +65,7 @@ posix_acl_set(struct dentry *dentry, const char *name, 
> const void *value,
>  }
>  
>  static int
> -posix_acl_get(struct dentry *dentry, const char *name, void *buffer,
> +reiserfs_posix_acl_get(struct dentry *dentry, const char *name, void *buffer,
>   size_t size, int type)
>  {
>   struct posix_acl *acl;
> @@ -88,7 +88,7 @@ posix_acl_get(struct dentry *dentry, const char *name, void 
> *buffer,
>  /*
>   * Convert from filesystem to in-memory representation.
>   */
> -static struct posix_acl *posix_acl_from_disk(const void *value, size_t size)
> +static struct posix_acl *reiserfs_posix_acl_from_disk(const void *value, 
> size_t size)
>  {
>   const char *end = (char *)value + size;
>   int n, count;
> @@ -158,7 +158,7 @@ static struct posix_acl *posix_acl_from_disk(const void 
> *value, size_t size)
>  /*
>   * Convert from in-memory to filesystem representation.
>   */
> -static void *posix_acl_to_disk(const struct posix_acl *acl, size_t * size)
> +static void *reiserfs_posix_acl_to_disk(const struct posix_acl *acl, size_t 
> * size)
>  {
>   reiserfs_acl_header *ext_acl;
>   char *e;
> @@ -257,7 +257,7 @@ struct posix_acl *reiserfs_get_acl(struct inode *inode, 
> int type)
>   } else if (retval < 0) {
>   acl = ERR_PTR(retval);
>   } else {
> - acl = posix_acl_from_disk(value, retval);
> + acl = reiserfs_posix_acl_from_disk(value, retval);
>   }
>   if (!IS_ERR(acl))
>   set_cached_acl(inode, type, acl);
> @@ -307,7 +307,7 @@ reiserfs_set_acl(struct reiserfs_transaction_handle *th, 
> struct inode *inode,
>   }
>  
>   if (acl) {
> - value = posix_acl_to_disk(acl, &size);
> + value = reiserfs_posix_acl_to_disk(acl, &size);
>   if (IS_ERR(value))
>   return (int)PTR_ERR(value);
>   }
> @@ -499,8 +499,8 @@ static size_t posix_acl_access_list(struct dentry 
> *dentry, char *list,
>  const struct xattr_handler reiserfs_posix_acl_access_handler = {
>   .prefix = POSIX_ACL_XATTR_ACCESS,
>   .flags = ACL_TYPE_ACCESS,
> - .get = posix_acl_get,
> - .set = posix_acl_set,
> + .get = reiserfs_posix_acl_get,
> + .set = reiserfs_posix_acl_set,
>   .list = posix_acl_access_list,
>  };
>  
> @@ -519,7 +519,7 @@ static size_t posix_acl_default_list(struct dentry 
> *dentry, char *list,
>  const struct xattr_handler reiserfs_posix_acl_default_handler = {
>   .prefix = POSIX_ACL_XATTR_DEFAULT,
>   .flags = ACL_TYPE_DEFAULT,
> - .get = posix_acl_get,
> - .set = posix_acl_set,
> + .get = reiserfs_posix_acl_get,
> + .set = reiserfs_posix_acl_set,
>   .list = posix_acl_default_list,
>  };
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/18] fs: add get_acl helper

2013-12-02 Thread Jan Kara
On Sun 01-12-13 03:59:05, Christoph Hellwig wrote:
> Factor out the code to get an ACL either from the inode or disk from
> check_acl, so that it can be used elsewhere later on.
> 
> Signed-off-by: Christoph Hellwig 
  Looks good. You can add:
Reviewed-by: Jan Kara 

Honza

> ---
>  fs/namei.c|   24 +++-
>  fs/posix_acl.c|   23 +++
>  include/linux/posix_acl.h |2 ++
>  3 files changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index c53d3a9..8acd1e8 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -235,27 +235,9 @@ static int check_acl(struct inode *inode, int mask)
>   return posix_acl_permission(inode, acl, mask & ~MAY_NOT_BLOCK);
>   }
>  
> - acl = get_cached_acl(inode, ACL_TYPE_ACCESS);
> -
> - /*
> -  * A filesystem can force a ACL callback by just never filling the
> -  * ACL cache. But normally you'd fill the cache either at inode
> -  * instantiation time, or on the first ->get_acl call.
> -  *
> -  * If the filesystem doesn't have a get_acl() function at all, we'll
> -  * just create the negative cache entry.
> -  */
> - if (acl == ACL_NOT_CACHED) {
> - if (inode->i_op->get_acl) {
> - acl = inode->i_op->get_acl(inode, ACL_TYPE_ACCESS);
> - if (IS_ERR(acl))
> - return PTR_ERR(acl);
> - } else {
> - set_cached_acl(inode, ACL_TYPE_ACCESS, NULL);
> - return -EAGAIN;
> - }
> - }
> -
> + acl = get_acl(inode, ACL_TYPE_ACCESS);
> + if (IS_ERR(acl))
> + return PTR_ERR(acl);
>   if (acl) {
>   int error = posix_acl_permission(inode, acl, mask);
>   posix_acl_release(acl);
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 8bd2135..9dd03e0 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -418,3 +418,26 @@ posix_acl_chmod(struct posix_acl **acl, gfp_t gfp, 
> umode_t mode)
>   return err;
>  }
>  EXPORT_SYMBOL(posix_acl_chmod);
> +
> +struct posix_acl *get_acl(struct inode *inode, int type)
> +{
> + struct posix_acl *acl;
> +
> + acl = get_cached_acl(inode, type);
> + if (acl != ACL_NOT_CACHED)
> + return acl;
> +
> + /*
> +  * A filesystem can force a ACL callback by just never filling the
> +  * ACL cache. But normally you'd fill the cache either at inode
> +  * instantiation time, or on the first ->get_acl call.
> +  *
> +  * If the filesystem doesn't have a get_acl() function at all, we'll
> +  * just create the negative cache entry.
> +  */
> +if (!inode->i_op->get_acl) {
> + set_cached_acl(inode, type, NULL);
> + return ERR_PTR(-EAGAIN);
> + }
> + return inode->i_op->get_acl(inode, type);
> +}
> diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
> index 7931efe..a8d9918 100644
> --- a/include/linux/posix_acl.h
> +++ b/include/linux/posix_acl.h
> @@ -175,4 +175,6 @@ static inline void cache_no_acl(struct inode *inode)
>  #endif
>  }
>  
> +struct posix_acl *get_acl(struct inode *inode, int type);
> +
>  #endif  /* __LINUX_POSIX_ACL_H */
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] block: Convert various code to bio_for_each_segment()

2013-11-07 Thread Jan Kara
  page_cache_release(bvec->bv_page);
> + }
>   bio_put(bio);
>   if (atomic_dec_and_test(&super->s_pending_writes))
>   wake_up(&wq);
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 92b125f..4979ffa 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -43,16 +43,14 @@
>   */
>  static void mpage_end_io(struct bio *bio, int err)
>  {
> - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
> - struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
> + struct bio_vec *bv;
> + int i;
>  
> - do {
> - struct page *page = bvec->bv_page;
> + bio_for_each_segment_all(bv, bio, i) {
> + struct page *page = bv->bv_page;
>  
> - if (--bvec >= bio->bi_io_vec)
> - prefetchw(&bvec->bv_page->flags);
>   if (bio_data_dir(bio) == READ) {
> - if (uptodate) {
> + if (!err) {
>   SetPageUptodate(page);
>   } else {
>   ClearPageUptodate(page);
> @@ -60,14 +58,15 @@ static void mpage_end_io(struct bio *bio, int err)
>   }
>   unlock_page(page);
>   } else { /* bio_data_dir(bio) == WRITE */
> - if (!uptodate) {
> + if (err) {
>   SetPageError(page);
>   if (page->mapping)
>   set_bit(AS_EIO, &page->mapping->flags);
>   }
>   end_page_writeback(page);
>   }
> - } while (bvec >= bio->bi_io_vec);
> + }
> +
>   bio_put(bio);
>  }
>  
> diff --git a/fs/nfs/blocklayout/blocklayout.c 
> b/fs/nfs/blocklayout/blocklayout.c
> index af73896..56ff823 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -202,18 +202,14 @@ static struct bio *bl_add_page_to_bio(struct bio *bio, 
> int npg, int rw,
>  static void bl_end_io_read(struct bio *bio, int err)
>  {
>   struct parallel_io *par = bio->bi_private;
> - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
> - struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
> + struct bio_vec *bvec;
> + int i;
>  
> - do {
> - struct page *page = bvec->bv_page;
> + if (!err)
> + bio_for_each_segment_all(bvec, bio, i)
> + SetPageUptodate(bvec->bv_page);
>  
> - if (--bvec >= bio->bi_io_vec)
> - prefetchw(&bvec->bv_page->flags);
> - if (uptodate)
> - SetPageUptodate(page);
> - } while (bvec >= bio->bi_io_vec);
> - if (!uptodate) {
> + if (err) {
>   struct nfs_read_data *rdata = par->data;
>   struct nfs_pgio_header *header = rdata->header;
>  
> @@ -384,20 +380,16 @@ static void mark_extents_written(struct 
> pnfs_block_layout *bl,
>  static void bl_end_io_write_zero(struct bio *bio, int err)
>  {
>   struct parallel_io *par = bio->bi_private;
> - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
> - struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
> -
> - do {
> - struct page *page = bvec->bv_page;
> + struct bio_vec *bvec;
> + int i;
>  
> - if (--bvec >= bio->bi_io_vec)
> - prefetchw(&bvec->bv_page->flags);
> + bio_for_each_segment_all(bvec, bio, i) {
>   /* This is the zeroing page we added */
> - end_page_writeback(page);
> - page_cache_release(page);
> - } while (bvec >= bio->bi_io_vec);
> + end_page_writeback(bvec->bv_page);
> + page_cache_release(bvec->bv_page);
> + }
>  
> - if (unlikely(!uptodate)) {
> + if (unlikely(err)) {
>   struct nfs_write_data *data = par->data;
>   struct nfs_pgio_header *header = data->header;
>  
> -- 
> 1.8.4.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: convert: access name_len and file_type the old way

2013-03-18 Thread Jan Kara
On Mon 18-03-13 19:39:31, David Sterba wrote:
> We can't use ext2_dir_entry_2 typecast on big endian machines directly.
> The bytes do not get converted during extX block read due to missing
> flag EXT2_DIRBLOCK_V2_STRUCT passed down to ext2fs_read_dir_block4 from
> ext2fs_process_dir_block. Fixing on the ext2 side needs updating callers
> and (maybe) the library interfaces. We'll fix it on the convert side for
> now.
> 
> CC: Jan Kara 
> Signed-off-by: David Sterba 
  Yes, now the patch looks OK to me.

Honza

> ---
> 
> v2:
> - use v1 ext2_dir_entry everywhere
> - fix remaining incorrect uses of dirent->name_len
> 
>  convert.c | 18 ++
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/convert.c b/convert.c
> index f462597..399856f 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -252,7 +252,7 @@ static u8 filetype_conversion_table[EXT2_FT_MAX] = {
>  };
>  
>  static int dir_iterate_proc(ext2_ino_t dir, int entry,
> - struct ext2_dir_entry *old,
> + struct ext2_dir_entry *dirent,
>   int offset, int blocksize,
>   char *buf,void *priv_data)
>  {
> @@ -262,12 +262,14 @@ static int dir_iterate_proc(ext2_ino_t dir, int entry,
>  u64 inode_size;
>   char dotdot[] = "..";
>   struct btrfs_key location;
> - struct ext2_dir_entry_2 *dirent = (struct ext2_dir_entry_2 *)old;
>   struct dir_iterate_data *idata = (struct dir_iterate_data *)priv_data;
> + int name_len;
> +
> + name_len = dirent->name_len & 0xFF;
>  
>   objectid = dirent->inode + INO_OFFSET;
> - if (!strncmp(dirent->name, dotdot, dirent->name_len)) {
> - if (dirent->name_len == 2) {
> + if (!strncmp(dirent->name, dotdot, name_len)) {
> + if (name_len == 2) {
>   BUG_ON(idata->parent != 0);
>   idata->parent = objectid;
>   }
> @@ -280,24 +282,24 @@ static int dir_iterate_proc(ext2_ino_t dir, int entry,
>   location.offset = 0;
>   btrfs_set_key_type(&location, BTRFS_INODE_ITEM_KEY);
>  
> - file_type = dirent->file_type;
> + file_type = dirent->name_len >> 8;
>   BUG_ON(file_type > EXT2_FT_SYMLINK);
>   ret = btrfs_insert_dir_item(idata->trans, idata->root,
> - dirent->name, dirent->name_len,
> + dirent->name, name_len,
>   idata->objectid, &location,
>   filetype_conversion_table[file_type],
>   idata->index_cnt);
>   if (ret)
>   goto fail;
>   ret = btrfs_insert_inode_ref(idata->trans, idata->root,
> -  dirent->name, dirent->name_len,
> +  dirent->name, name_len,
>objectid, idata->objectid,
>idata->index_cnt);
>   if (ret)
>   goto fail;
>   idata->index_cnt++;
>   inode_size = btrfs_stack_inode_size(idata->inode) +
> -  dirent->name_len * 2;
> +  name_len * 2;
>   btrfs_set_stack_inode_size(idata->inode, inode_size);
>   return 0;
>  fail:
> -- 
> 1.8.1.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: convert: access name_len and file_type the old way

2013-03-18 Thread Jan Kara
On Fri 15-03-13 17:10:45, David Sterba wrote:
> We can't use ext2_dir_entry_2 typecast on big endian machines directly.
> The bytes do not get converted during extX block read due to missing
> flag EXT2_DIRBLOCK_V2_STRUCT passed down to ext2fs_read_dir_block4 from
> ext2fs_process_dir_block. Fixing on the ext2 side needs updating callers
> and (maybe) the library interfaces. We'll fix it on the convert side for
> now.
> 
> CC: Jan Kara 
> Signed-off-by: David Sterba 
> ---
>  convert.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/convert.c b/convert.c
> index 400488e..7766c09 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -264,10 +264,13 @@ static int dir_iterate_proc(ext2_ino_t dir, int entry,
>   struct btrfs_key location;
>   struct ext2_dir_entry_2 *dirent = (struct ext2_dir_entry_2 *)old;
  You may as well remove this and use 'old' everywhere. But technically the
patch is OK.

Honza

>   struct dir_iterate_data *idata = (struct dir_iterate_data *)priv_data;
> + int name_len;
> +
> + name_len = old->name_len & 0xFF;
>  
>   objectid = dirent->inode + INO_OFFSET;
> - if (!strncmp(dirent->name, dotdot, dirent->name_len)) {
> - if (dirent->name_len == 2) {
> + if (!strncmp(dirent->name, dotdot, name_len)) {
> + if (name_len == 2) {
>   BUG_ON(idata->parent != 0);
>   idata->parent = objectid;
>   }
> @@ -280,7 +283,7 @@ static int dir_iterate_proc(ext2_ino_t dir, int entry,
>   location.offset = 0;
>   btrfs_set_key_type(&location, BTRFS_INODE_ITEM_KEY);
>  
> - file_type = dirent->file_type;
> + file_type = old->name_len >> 8;
>   BUG_ON(file_type > EXT2_FT_SYMLINK);
>   ret = btrfs_insert_dir_item(idata->trans, idata->root,
>   dirent->name, dirent->name_len,
> -- 
> 1.8.1.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them

2013-01-11 Thread Jan Kara
On Thu 10-01-13 13:47:57, Miao Xie wrote:
> writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
> with down_read_trylock() because
> - If ->s_umount is write locked, then the sb is not idle. That is
>   writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.
> - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
>   writeback, it may bring us deadlock problem when doing umount. In order to
>   fix the problem, ext4 and btrfs implemented their own writeback functions
>   instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the 
> redundant
>   code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().
> 
> The name of these two functions is cumbersome, so rename them to
> try_to_writeback_inodes_sb(_nr).
> 
> This idea came from Christoph Hellwig.
> Some code is from the patch of Kamal Mostafa.
  The patch looks good. Thanks! You can add:
Reviewed-by: Jan Kara 

  BTW, since changes to filesystems are minimal, maybe Fengguang can take
the patch through his writeback tree? Fengguang?

Honza
> 
> Signed-off-by: Miao Xie 
> ---
> Changelog v1 -> v2:
> - do not remove EXPORT_SYMBOL of writeback_inodes_sb_br()
> ---
>  fs/btrfs/extent-tree.c| 20 +++-
>  fs/ext4/inode.c   |  8 ++--
>  fs/fs-writeback.c | 44 
>  include/linux/writeback.h |  6 +++---
>  4 files changed, 28 insertions(+), 50 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 521e9d4..f31abb1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3689,20 +3689,6 @@ static int can_overcommit(struct btrfs_root *root,
>   return 0;
>  }
>  
> -static int writeback_inodes_sb_nr_if_idle_safe(struct super_block *sb,
> -unsigned long nr_pages,
> -enum wb_reason reason)
> -{
> - if (!writeback_in_progress(sb->s_bdi) &&
> - down_read_trylock(&sb->s_umount)) {
> - writeback_inodes_sb_nr(sb, nr_pages, reason);
> - up_read(&sb->s_umount);
> - return 1;
> - }
> -
> - return 0;
> -}
> -
>  /*
>   * shrink metadata reservation for delalloc
>   */
> @@ -3735,9 +3721,9 @@ static void shrink_delalloc(struct btrfs_root *root, 
> u64 to_reclaim, u64 orig,
>   while (delalloc_bytes && loops < 3) {
>   max_reclaim = min(delalloc_bytes, to_reclaim);
>   nr_pages = max_reclaim >> PAGE_CACHE_SHIFT;
> - writeback_inodes_sb_nr_if_idle_safe(root->fs_info->sb,
> - nr_pages,
> - WB_REASON_FS_FREE_SPACE);
> + try_to_writeback_inodes_sb_nr(root->fs_info->sb,
> +   nr_pages,
> +   WB_REASON_FS_FREE_SPACE);
>  
>   /*
>* We need to wait for the async pages to actually start before
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index cbfe13b..5f6eef7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2512,12 +2512,8 @@ static int ext4_nonda_switch(struct super_block *sb)
>   /*
>* Start pushing delalloc when 1/2 of free blocks are dirty.
>*/
> - if (dirty_blocks && (free_blocks < 2 * dirty_blocks) &&
> - !writeback_in_progress(sb->s_bdi) &&
> - down_read_trylock(&sb->s_umount)) {
> - writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
> - up_read(&sb->s_umount);
> - }
> + if (dirty_blocks && (free_blocks < 2 * dirty_blocks))
> + try_to_writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
>  
>   if (2 * free_blocks < 3 * dirty_blocks ||
>   free_blocks < (dirty_blocks + EXT4_FREECLUSTERS_WATERMARK)) {
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 310972b..ad3cc46 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1332,47 +1332,43 @@ void writeback_inodes_sb(struct super_block *sb, enum 
> wb_reason reason)
>  EXPORT_SYMBOL(writeback_inodes_sb);
>  
>  /**
> - * writeback_inodes_sb_if_idle   -   start writeback if none underway
> + * try_to_writeback_inodes_sb_nr - try to start writeback if none underway
>   * @sb: the superblock
> - * @reason: reason why some writeback work was initiated
> + * @nr

Re: [PATCH RFC] Btrfs: fix deadlock between sys_sync and freeze

2012-08-15 Thread Jan Kara
On Wed 15-08-12 20:51:17, Liu Bo wrote:
> (CCed Jan, the author of freeze code)
> On 08/14/2012 10:12 PM, Marco Stornelli wrote:
> > Il 14/08/2012 15:53, Liu Bo ha scritto:
> >> On 08/14/2012 08:59 PM, Marco Stornelli wrote:
> >>> Il 14/08/2012 07:01, liub.li...@gmail.com ha scritto:
> >>>> From: Liu Bo 
> >>>>
> >>>> I found this while testing xfstests 068, the story is
> >>>>
> >>>>   t1t2
> >>>> sys_syncthaw_super
> >>>>   iterate_supers
> >>>> down_read(sb->s_umount)   
> >>>> down_write(sb->s_umount) --->wait for t1
> >>>> sync_fs (with wait mode)
> >>>>   start_transaction
> >>>> sb_start_intwrite > wait for t2 to set 
> >>>> s_writers.frozen to SB_UNFROZEN
> >>>>
> >>>> In this patch, I add an helper sb_start_intwrite_trylock() and use it 
> >>>> before we
> >>>> start_transaction in sync_fs() with wait mode so that we won't hit the 
> >>>> deadlock.
> >>>>
> >>>
> >>> IMHO, we should avoid to call the sync operation on a frozen fs. The 
> >>> freeze operation, indeed, already include a sync operation.
> >>> According to man page, no other operation should modify the fs after the 
> >>> freeze.
> >>> So for me the modification is inside sync_filesystem (and sync_one_sb).
> >>
> >> Do you mean that we should add the trylock check in sync_filesystem?
> >>
> >> But it seems to be useless because we already run into 
> >> down_read(sb->s_umount) before starting sync_one_sb().
> >>
> >> thanks,
> >> liubo
> >>
> > 
> > I meant that we should check if there are in a "complete" freeze state 
> > (according to the "states" of a freeze transaction) and simply skip the 
> > sync operation.
> > 
> 
> I'm ok with it.
> 
> What do you think about it, Jan?  Any comments?
  Hum, so what I don't exactly understand is why does btrfs start a
transaction in ->sync_fs(). The idea of freeze code is that when filesystem
is frozen, there is no data / metadata to write and thus we avoid deadlocks
arising from trying to write anything with s_umount held.

So checking whether filesystem is frozen and avoiding transaction start in
that case in btrfs_sync_fs() will work but looks hacky. Rather the check
should be for the thing which is the reason for transaction start-end pair
in btrfs_sync_fs(). My naive guess would be we should check whether there
is any transaction running...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/27] btrfs: Push mnt_want_write() outside of i_mutex

2012-06-12 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason 
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara 
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 24b776c..440655b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(&inode->i_mutex);
 
ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags & FS_SYNC_FL)
ip->flags |= BTRFS_INODE_SYNC;
else
@@ -272,9 +272,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode->i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(&inode->i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -640,6 +640,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent->mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -651,13 +655,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry->d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent->mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -675,12 +675,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent->mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(&dir->i_mutex);
+   mnt_drop_write(parent->mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/27] btrfs: Convert to new freezing mechanism

2012-06-12 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.

Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason 
Signed-off-by: Jan Kara 
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 fs/btrfs/transaction.c |7 +++
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7ae51de..663f3a0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1532,8 +1532,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
mutex_trylock(&root->fs_info->cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1565,7 +1563,6 @@ static int transaction_kthread(void *arg)
do {
cannot_commit = false;
delay = HZ * 30;
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 70dc8ca..5131c3a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1392,7 +1392,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode->i_sb);
 
mutex_lock(&inode->i_mutex);
 
@@ -1482,6 +1482,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode->i_sb);
current->backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f6ab6f5..2e191f6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6535,6 +6535,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode->i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = file_update_time(vma->vm_file);
@@ -6624,12 +6625,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode->i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode->i_sb);
return ret;
 }
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1791c6e..05d8c29 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -325,6 +325,8 @@ again:
if (!h)
return ERR_PTR(-ENOMEM);
 
+   sb_start_intwrite(root->fs_info->sb);
+
if (may_wait_transaction(root, type))
wait_current_trans(root);
 
@@ -335,6 +337,7 @@ again:
} while (ret == -EBUSY);
 
if (ret < 0) {
+   sb_end_intwrite(root->fs_info->sb);
kmem_cache_free(btrfs_trans_handle_cachep, h);
return ERR_PTR(ret);
}
@@ -524,6 +527,8 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
count++;
}
 
+   sb_end_intwrite(root->fs_info->sb);
+
if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
should_end_transaction(trans, root)) {
trans->transaction->blocked = 1;
@@ -1507,6 +1512,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
put_transaction(cur_trans);
put_transaction(cur_trans);
 
+   sb_end_intwrite(root->fs_info->sb);
+
trace_btrfs_transaction_commit(root);
 
btrfs_scrub_continue(root);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/27 v7] Fix filesystem freezing deadlocks

2012-06-12 Thread Jan Kara
  Hello,

  here is the seventh iteration of my patches to improve filesystem freezing.
I've rebased patches on top of 3.5-rc2 as Al requested. Otherwise I've just
fixed some outdated text in the introduction below and added one ack.

Introductory text to first time readers:

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress, bash-shared-mapping, and
aio-stress while freezing and unfreezing for several hours (using ext4 and xfs)
so I'm reasonably confident this could finally be the right solution.

Changes since v6:
  * rebased on 3.5-rc2
  * added ack

Changes since v5:
  * handle unlinked & open files on frozen filesystem
  * lockdep keys for freeze protection are now per filesystem type
  * taught lockdep that freeze protection at lower level does not create
dependency when we already hold freeze protection at higher level 
  * rebased on 3.5-rc1-ish

Changes since v4:
  * added a couple of Acked-by's
  * added some comments & doc update
  * added patches from series "Push file_update_time() into .page_mkwrite"
since it doesn't make much sense to keep them separate anymore
  * rebased on top of 3.4-rc2

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder 
CC: Anton Altaparmakov 
CC: Ben Myers 
CC: Chris Mason 
CC: cluster-de...@redhat.com
CC: "David S. Miller" 
CC: fuse-de...@lists.sourceforge.net
CC: "J. Bruce Fields" 
CC: Joel Becker 
CC: KONISHI Ryusuke 
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh 
CC: Miklos Szeredi 
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi 
CC: Steven Whitehouse 
CC: "Theodore Ts'o" 
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/27 v6] Fix filesystem freezing deadlocks

2012-06-01 Thread Jan Kara
  Hello,

  here is the sixth iteration of my patches to improve filesystem freezing.
The change since last iteration is that filesystem can be frozen with open but
unlinked files. After some thinking, I've decided that the best way to handle
this is to block removal inside ->evict_inode() of each filesystem and use
fs-internal level of freeze protection for that (usually I've instrumented
filesystem's transaction system to use freeze protection). Handling
inside VFS would be less work but the only level of freeze protection that
has a chance of not causing deadlocks is the one used for page faults and even
there it's not clear lock ordering would be correct wrt some fs-specific locks.
I've converted ext2, ext4, btrfs, xfs, nilfs2, ocfs2, gfs2 and also checked
that ext3, reiserfs, jfs should work as well (they have their internal freeze
protection mechanisms, possibly they could be replaced by a generic one but
given these are mostly aging filesystems, it's not a real priority IHMO).
So finally I'm not aware of any pending issue with this patch set so if you
have some concern, please speak up!

Introductory text to first time readers:

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

Changes since v5:
  * handle unlinked & open files on frozen filesystem
  * lockdep keys for freeze protection are now per filesystem type
  * taught lockdep that freeze protection at lower level does not create
dependency when we already hold freeze protection at higher level 
  * rebased on 3.5-rc1-ish

Changes since v4:
  * added a couple of Acked-by's
  * added some comments & doc update
  * added patches from series "Push file_update_time() into .page_mkwrite"
since it doesn't make much sense to keep them separate anymore
  * rebased on top of 3.4-rc2

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder 
CC: Anton Altaparmakov 
CC: Ben Myers 
CC: Chris Mason 
CC: cluster-de...@redhat.com
CC: "David S. Miller" 
CC: fuse-de...@lists.sourceforge.net
CC: "J. Bruce Fields" 
CC: Joel Becker 
CC: KONISHI Ryusuke 
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh 
CC: Miklos Szeredi 
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi 
CC: Steven Whitehouse 
CC: "Theodore Ts'o" 
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/27] btrfs: Convert to new freezing mechanism

2012-06-01 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.

Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason 
Signed-off-by: Jan Kara 
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 fs/btrfs/transaction.c |7 +++
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7ae51de..663f3a0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1532,8 +1532,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
mutex_trylock(&root->fs_info->cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1565,7 +1563,6 @@ static int transaction_kthread(void *arg)
do {
cannot_commit = false;
delay = HZ * 30;
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 876cddd..7f3d7fe 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1392,7 +1392,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode->i_sb);
 
mutex_lock(&inode->i_mutex);
 
@@ -1482,6 +1482,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode->i_sb);
current->backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e9991ad..54e5378 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6563,6 +6563,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode->i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma->vm_file);
@@ -6652,12 +6653,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode->i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode->i_sb);
return ret;
 }
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1791c6e..05d8c29 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -325,6 +325,8 @@ again:
if (!h)
return ERR_PTR(-ENOMEM);
 
+   sb_start_intwrite(root->fs_info->sb);
+
if (may_wait_transaction(root, type))
wait_current_trans(root);
 
@@ -335,6 +337,7 @@ again:
} while (ret == -EBUSY);
 
if (ret < 0) {
+   sb_end_intwrite(root->fs_info->sb);
kmem_cache_free(btrfs_trans_handle_cachep, h);
return ERR_PTR(ret);
}
@@ -524,6 +527,8 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
count++;
}
 
+   sb_end_intwrite(root->fs_info->sb);
+
if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
should_end_transaction(trans, root)) {
trans->transaction->blocked = 1;
@@ -1507,6 +1512,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
put_transaction(cur_trans);
put_transaction(cur_trans);
 
+   sb_end_intwrite(root->fs_info->sb);
+
trace_btrfs_transaction_commit(root);
 
btrfs_scrub_continue(root);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/27] btrfs: Push mnt_want_write() outside of i_mutex

2012-06-01 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason 
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara 
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 24b776c..440655b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(&inode->i_mutex);
 
ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags & FS_SYNC_FL)
ip->flags |= BTRFS_INODE_SYNC;
else
@@ -272,9 +272,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode->i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(&inode->i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -640,6 +640,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent->mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -651,13 +655,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry->d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent->mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -675,12 +675,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent->mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(&dir->i_mutex);
+   mnt_drop_write(parent->mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

2012-04-17 Thread Jan Kara
On Mon 16-04-12 15:02:50, Andreas Dilger wrote:
> On 2012-04-16, at 9:13 AM, Jan Kara wrote:
> > Another potential contention point might be patch 19. In that patch
> > we make freeze_super() refuse to freeze the filesystem when there
> > are open but unlinked files which may be impractical in some cases.
> > The main reason for this is the problem with handling of file deletion
> > from fput() called with mmap_sem held (e.g. from munmap(2)), and
> > then there's the fact that we cannot really force such filesystem
> > into a consistent state... But if people think that freezing with
> > open but unlinked files should happen, then I have some possible
> > solutions in mind (maybe as a separate patchset since this is
> > large enough).
> 
> Looking at a desktop system, I think it is very typical that there
> are open-unlinked files present, so I don't know if this is really
> an acceptable solution.  It isn't clear from your comments whether
> this is a blanket refusal for all open-unlinked files, or only in
> some particular cases...
  Thanks for looking at this. It is currently a blanket refusal. And I
agree it's problematic. There are two problems with open but unlinked
files.

One is that some old filesystems cannot get in a consistent state in
presence of open but unlinked files but for filesystems we really care
about - xfs, ext4, ext3, btrfs, or even ocfs2, gfs2 - that is not a real
issue (these filesystems will delete those inodes on next mount read-write).

The other problem is with what should happen when you put last inode
reference on a frozen filesystem. Two possibilities I see are:

a) block the iput() call - that is inconvenient because it can be
called in various contexts. I think we could possibly use the same level of
freeze protection as for page fault (this has changed since I originally
thought about this and that would make things simpler) but I'm not
completely sure.

b) let the iput finish but filesystem will keep inode on its orphan list
(or it's equivalent) and the inode will be deleted after the filesystem is
thawed. The advantage of this is we don't have to block iput(), the
disadvantage is we have to have filesystem support and not all filesystems
can do this.

Any thoughts?

Honza
> 
> lsof | grep deleted
> nautilus  25393  adilger   19r  REG   253,0  340 253954 
> /home/adilger/.local/share/gvfs-metadata/home (deleted)
> nautilus  25393  adilger   20r  REG   253,032768 253964 
> /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
> gnome-ter 25623  adilger   22u  REG0,18178412717846 
> /tmp/vtePIRJCW (deleted)
> gnome-ter 25623  adilger   23u  REG0,18 55682717847 
> /tmp/vteDCSJCW (deleted)
> gnome-ter 25623  adilger   29u  REG0,18  4802728484 
> /tmp/vte6C1TCW (deleted)
  
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/27] btrfs: Push mnt_want_write() outside of i_mutex

2012-04-16 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason 
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara 
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 18cc23d..869d913 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(&inode->i_mutex);
 
ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags & FS_SYNC_FL)
ip->flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode->i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(&inode->i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -639,6 +639,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent->mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -650,13 +654,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry->d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent->mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -674,12 +674,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent->mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(&dir->i_mutex);
+   mnt_drop_write(parent->mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/27] btrfs: Convert to new freezing mechanism

2012-04-16 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason 
Signed-off-by: Jan Kara 
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 20196f4..555a57a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1527,8 +1527,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
mutex_trylock(&root->fs_info->cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1560,7 +1558,6 @@ static int transaction_kthread(void *arg)
do {
cannot_commit = false;
delay = HZ * 30;
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d83260d..a48251e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1358,7 +1358,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode->i_sb);
 
mutex_lock(&inode->i_mutex);
 
@@ -1449,6 +1449,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode->i_sb);
current->backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 115bc05..db4fc01 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6592,6 +6592,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode->i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma->vm_file);
@@ -6681,12 +6682,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode->i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode->i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

2012-04-16 Thread Jan Kara
  The subject should have been [PATCH 00/27]... Sorry for the mistake.

Honza

On Mon 16-04-12 18:13:38, Jan Kara wrote:
>   Hello,
> 
>   here is the fifth iteration of my patches to improve filesystem freezing.
> No serious changes since last time. Mostly I rebased patches and merged this
> series with series moving file_update_time() to ->page_mkwrite() to simplify
> testing and merging.
> 
> Filesystem freezing is currently racy and thus we can end up with dirty data 
> on
> frozen filesystem (see changelog patch 13 for detailed race description). This
> patch series aims at fixing this.
> 
> To be able to block all places where inodes get dirtied, I've moved filesystem
> file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
> freeze handling in mnt_want_write() / mnt_drop_write(). That however required
> some code shuffling and changes to kern_path_create() (see patches 09-12). I
> think the result is OK but opinions may differ ;). The advantage of this 
> change
> also is that all filesystems get freeze protection almost for free - even ext2
> can handle freezing well now.
> 
> Another potential contention point might be patch 19. In that patch we make
> freeze_super() refuse to freeze the filesystem when there are open but 
> unlinked
> files which may be impractical in some cases. The main reason for this is the
> problem with handling of file deletion from fput() called with mmap_sem held
> (e.g. from munmap(2)), and then there's the fact that we cannot really force
> such filesystem into a consistent state... But if people think that freezing
> with open but unlinked files should happen, then I have some possible
> solutions in mind (maybe as a separate patchset since this is large enough).
> 
> I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
> filesystem despite beating it with fsstress and bash-shared-mapping while
> freezing and unfreezing for several hours (using ext4 and xfs) so I'm
> reasonably confident this could finally be the right solution.
> 
> Changes since v4:
>   * added a couple of Acked-by's
>   * added some comments & doc update
>   * added patches from series "Push file_update_time() into .page_mkwrite"
> since it doesn't make much sense to keep them separate anymore
>   * rebased on top of 3.4-rc2
> 
> Changes since v3:
>   * added third level of freezing for fs internal purposes - hooked some
> filesystems to use it (XFS, nilfs2)
>   * removed racy i_size check from filemap_mkwrite()
> 
> Changes since v2:
>   * completely rewritten
>   * freezing is now blocked at VFS entry points
>   * two stage freezing to handle both mmapped writes and other IO
> 
> The biggest changes since v1:
>   * have two counters to provide safe state transitions for SB_FREEZE_WRITE
> and SB_FREEZE_TRANS states
>   * use percpu counters instead of own percpu structure
>   * added documentation fixes from the old fs freezing series
>   * converted XFS to use SB_FREEZE_TRANS counter instead of its private
> m_active_trans counter
> 
>   Honza
> 
> CC: Alex Elder 
> CC: Anton Altaparmakov 
> CC: Ben Myers 
> CC: Chris Mason 
> CC: cluster-de...@redhat.com
> CC: "David S. Miller" 
> CC: fuse-de...@lists.sourceforge.net
> CC: "J. Bruce Fields" 
> CC: Joel Becker 
> CC: KONISHI Ryusuke 
> CC: linux-btrfs@vger.kernel.org
> CC: linux-e...@vger.kernel.org
> CC: linux-...@vger.kernel.org
> CC: linux-ni...@vger.kernel.org
> CC: linux-ntfs-...@lists.sourceforge.net
> CC: Mark Fasheh 
> CC: Miklos Szeredi 
> CC: ocfs2-de...@oss.oracle.com
> CC: OGAWA Hirofumi 
> CC: Steven Whitehouse 
> CC: "Theodore Ts'o" 
> CC: x...@oss.sgi.com
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19 v5] Fix filesystem freezing deadlocks

2012-04-16 Thread Jan Kara
  Hello,

  here is the fifth iteration of my patches to improve filesystem freezing.
No serious changes since last time. Mostly I rebased patches and merged this
series with series moving file_update_time() to ->page_mkwrite() to simplify
testing and merging.

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

Changes since v4:
  * added a couple of Acked-by's
  * added some comments & doc update
  * added patches from series "Push file_update_time() into .page_mkwrite"
since it doesn't make much sense to keep them separate anymore
  * rebased on top of 3.4-rc2

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder 
CC: Anton Altaparmakov 
CC: Ben Myers 
CC: Chris Mason 
CC: cluster-de...@redhat.com
CC: "David S. Miller" 
CC: fuse-de...@lists.sourceforge.net
CC: "J. Bruce Fields" 
CC: Joel Becker 
CC: KONISHI Ryusuke 
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh 
CC: Miklos Szeredi 
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi 
CC: Steven Whitehouse 
CC: "Theodore Ts'o" 
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] btrfs: Convert to new freezing mechanism

2012-03-28 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason 
Signed-off-by: Jan Kara 
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811d9f9..fc0f74c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
mutex_trylock(&root->fs_info->cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg)
 
do {
delay = HZ * 30;
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..1aac7ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode->i_sb);
 
mutex_lock(&inode->i_mutex);
 
@@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode->i_sb);
current->backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..63c9006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode->i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma->vm_file);
@@ -6495,12 +6496,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode->i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode->i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19 v4] Fix filesystem freezing deadlocks

2012-03-28 Thread Jan Kara
  Hello,

  here is the fourth iteration of my patches to improve filesystem freezing.
Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 06 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder 
CC: Anton Altaparmakov 
CC: Ben Myers 
CC: Chris Mason 
CC: cluster-de...@redhat.com
CC: "David S. Miller" 
CC: fuse-de...@lists.sourceforge.net
CC: "J. Bruce Fields" 
CC: Joel Becker 
CC: KONISHI Ryusuke 
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh 
CC: Miklos Szeredi 
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi 
CC: Steven Whitehouse 
CC: "Theodore Ts'o" 
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex

2012-03-28 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason 
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara 
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(&inode->i_mutex);
 
ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags & FS_SYNC_FL)
ip->flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode->i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(&inode->i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent->mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry->d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent->mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent->mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(&dir->i_mutex);
+   mnt_drop_write(parent->mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3/4, btrfs, ocfs2: How to assure that cleancache_invalidate_fs is called on every superblock free

2012-03-12 Thread Jan Kara
  Hello,

On Fri 09-03-12 14:40:22, Andor Daam wrote:
> Is it ever possible for a superblock for a mounted filesystem to be
> free'd without a previous call to unmount the filesystem?
  No, I don't think so (well, except for cases where we do not manage to
fully setup the superblock). But be aware that mount/umount need not be
really the entry points you are looking for since filesystem can be mounted
several times. Rather deactivate_locked_supers() is the place you are
looking for...

> I need to be certain that the function cleancache_invalidate_fs, which is
> at the moment called by deactivate_locked_super (fs/super.c) [1], is
> called before every free on a superblock of cleancache-enabled
> filesystems.  Is this already the case or are there situations in which
> this does not happen?
> 
> It would be interesting to know this, as we are planning to have
> cleancache save pointers to superblocks of every mounted
> cleancache-enabled filesystem [2] and it would be fatal if a
> superblock is free'd without cleancache being notified.

        Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex

2012-03-05 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason 
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara 
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(&inode->i_mutex);
 
ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags & FS_SYNC_FL)
ip->flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode->i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(&inode->i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent->mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry->d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent->mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent->mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(&dir->i_mutex);
+   mnt_drop_write(parent->mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19] Fix filesystem freezing deadlocks

2012-03-05 Thread Jan Kara
  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder 
CC: Anton Altaparmakov 
CC: Ben Myers 
CC: Chris Mason 
CC: cluster-de...@redhat.com
CC: "David S. Miller" 
CC: fuse-de...@lists.sourceforge.net
CC: "J. Bruce Fields" 
CC: Joel Becker 
CC: KONISHI Ryusuke 
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh 
CC: Miklos Szeredi 
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi 
CC: Steven Whitehouse 
CC: "Theodore Ts'o" 
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] btrfs: Convert to new freezing mechanism

2012-03-05 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason 
Signed-off-by: Jan Kara 
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811d9f9..fc0f74c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
mutex_trylock(&root->fs_info->cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg)
 
do {
delay = HZ * 30;
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..1aac7ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode->i_sb);
 
mutex_lock(&inode->i_mutex);
 
@@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode->i_sb);
current->backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..63c9006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode->i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma->vm_file);
@@ -6495,12 +6496,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode->i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode->i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-05 Thread Jan Kara
On Fri 02-03-12 14:32:15, Ted Tso wrote:
> On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
> It would be interesting to have a project where someone added
> fallocate() support into libelf, and then added some hueristics into
> ext4 so that if a file is fallocated to a precise size, or if the file
> is fully written and closed before writeback begins, that we use this
> to more efficiently pack the space used by the files by the block
> allocator.  This is a place where I would not be surprised that XFS
> has some better code to avoid accelerated file system aging, and where
> we could do better with ext4 with some development effort.
  AFAIK XFS people actually prefer that applications let them do their work
using delayed allocation and do not interfere with fallocate(2) calls. The
problem they have with fallocate(2) is that it forces you to allocate
blocks while with delayed allocation you can make the decision about
allocation later. So for small files which completely fit into pagecache
before they get pushed out by writeback, they can make better decisions
from delayed allocation. Just dumping my memory from some other thread...

        Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] gfs2: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

Signed-off-by: Jan Kara 
---
 fs/gfs2/aops.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 501e5cb..9c381ff 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1034,7 +1034,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
 
rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
  offset, nr_segs, gfs2_get_block_direct,
- NULL, NULL, 0);
+ NULL, NULL, DIO_SYNC_WRITES);
 out:
gfs2_glock_dq_m(1, &gh);
gfs2_holder_uninit(&gh);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] vfs: Handle O_SYNC AIO DIO in generic code properly

2012-02-10 Thread Jan Kara
Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystem wanting to
use the helpers has to pass DIO_SYNC_WRITES to __blockdev_direct_IO. Then if
they don't use direct IO end_io handler, generic code takes care of everything
else. Otherwise their end_io handler is passed struct dio_sync_io_work pointer
as 'private' argument and they have to call generic_dio_end_io() to finish
their AIO DIO. Generic code then takes care to call generic_write_sync() from
a workqueue context when AIO DIO is completed.

Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
and the generic one is enough for them, make blockdev_direct_IO() pass
DIO_SYNC_WRITES flag.

Signed-off-by: Jan Kara 
---
 fs/direct-io.c |  128 ++--
 fs/super.c |2 +
 include/linux/fs.h |   13 +-
 3 files changed, 138 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4a588db..79aa531 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,8 @@
 #include 
 #include 
 
+#include 
+
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
  * the size of a structure in the slab cache
@@ -112,6 +114,15 @@ struct dio_submit {
unsigned tail;  /* last valid page + 1 */
 };
 
+/* state needed for final sync and completion of O_SYNC AIO DIO */
+struct dio_sync_io_work {
+   struct kiocb *iocb;
+   loff_t offset;
+   ssize_t len;
+   int ret;
+   struct work_struct work;
+};
+
 /* dio_state communicated between submission path and end_io */
 struct dio {
int flags;  /* doesn't change */
@@ -134,6 +145,7 @@ struct dio {
/* AIO related stuff */
struct kiocb *iocb; /* kiocb */
ssize_t result; /* IO result */
+   struct dio_sync_io_work *sync_work; /* work used for O_SYNC AIO */
 
/*
 * pages[] (and any fields placed after it) are not zeroed out at
@@ -261,6 +273,45 @@ static inline struct page *dio_get_page(struct dio *dio,
 }
 
 /**
+ * generic_dio_end_io() - generic dio ->end_io handler
+ * @iocb: iocb of finishing DIO
+ * @offset: the byte offset in the file of the completed operation
+ * @bytes: length of the completed operation
+ * @work: work to queue for O_SYNC AIO DIO, NULL otherwise
+ * @ret: error code if IO failed
+ * @is_async: is this AIO?
+ *
+ * This is generic callback to be called when direct IO is finished. It
+ * handles update of number of outstanding DIOs for an inode, completion
+ * of async iocb and queueing of work if we need to call fsync() because
+ * io was O_SYNC.
+ */
+void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
+   struct dio_sync_io_work *work, int ret, bool is_async)
+{
+   struct inode *inode = iocb->ki_filp->f_dentry->d_inode;
+
+   if (!is_async) {
+   inode_dio_done(inode);
+   return;
+   }
+
+   /*
+* If we need to sync file, we offload completion to workqueue
+*/
+   if (work) {
+   work->ret = ret;
+   work->offset = offset;
+   work->len = bytes;
+   queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
+   } else {
+   aio_complete(iocb, ret, 0);
+   inode_dio_done(inode);
+   }
+}
+EXPORT_SYMBOL(generic_dio_end_io);
+
+/**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
  *
@@ -302,12 +353,22 @@ static ssize_t dio_complete(struct dio *dio, loff_t 
offset, ssize_t ret, bool is
ret = transferred;
 
if (dio->end_io && dio->result) {
+   void *private;
+
+   if (dio->sync_work)
+   private = dio->sync_work;
+   else
+   private = dio->private;
dio->end_io(dio->iocb, offset, transferred,
-   dio->private, ret, is_async);
+   private, ret, is_async);
} else {
-   if (is_async)
-   aio_complete(dio->iocb, ret, 0);
-   inode_dio_done(dio->inode);
+   /* No IO submitted? Skip syncing... */
+   if (!dio->result && dio->sync_work) {
+   kfree(dio->sync_work);
+   dio->sync_work = NULL;
+   }
+   generic_dio_end_io(dio->iocb, offset, transferred,
+  dio->sync_work, ret, is_async);
}
 
return ret;
@@ -1064,6 +1125,41 @@ static inline int drop_refcount(struct dio *dio)
 }
 
 /*
+ * Work performed from workqueue when AIO DIO is finished.
+ */
+static void dio_aio_sync_work(struct work_struct *wo

[PATCH 2/4] ocfs2: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

Signed-off-by: Jan Kara 
---
 fs/ocfs2/aops.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 78b68af..3d14c2b 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb,
level = ocfs2_iocb_rw_locked_level(iocb);
ocfs2_rw_unlock(inode, level);
 
-   if (is_async)
-   aio_complete(iocb, ret, 0);
-   inode_dio_done(inode);
+   generic_dio_end_io(iocb, offset, bytes, private, ret, is_async);
 }
 
 /*
@@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw,
return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
iov, offset, nr_segs,
ocfs2_direct_IO_get_blocks,
-   ocfs2_dio_end_io, NULL, 0);
+   ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES);
 }
 
 static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Generic O_SYNC AIO DIO handling

2012-02-10 Thread Jan Kara

  Hi Jeff,

  these patches implement generic way of handling O_SYNC AIO DIO. They work
for all filesystems except for ext4 and xfs. Thus together with your patches,
all filesystems should handle O_SYNC AIO DIO correctly. I've tested ext3,
btrfs, and xfs (to check that I didn't break anything when the generic code
is unused) and things seem to work fine. Will you add these patches to your
series please? Thanks.

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file. Although we use our own bio->end_io function, we call dio_end_io()
from it and thus, because we don't set any specific dio->end_io function,
generic code ends up calling generic_dio_end_io() which is all what we need
for proper O_SYNC AIO DIO handling.

Signed-off-by: Jan Kara 
---
 fs/btrfs/inode.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..68add6e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6221,7 +6221,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
ret = __blockdev_direct_IO(rw, iocb, inode,
   BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
   iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
-  btrfs_submit_direct, 0);
+  btrfs_submit_direct, DIO_SYNC_WRITES);
 
if (ret < 0 && ret != -EIOCBQUEUED) {
clear_extent_bit(&BTRFS_I(inode)->io_tree, offset,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-02-10 Thread Jan Kara
On Wed 08-02-12 22:05:26, Phillip Susi wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 02/08/2012 06:20 PM, Jan Kara wrote:
> >   Thanks for your reply. I admit I was not sure what exactly size argument
> > should be. So after looking into the code for a while I figured it should
> > be a total size of the filesystem - or differently it should be size of
> > virtual block address space in the filesystem. Thus when filesystem has
> > more devices (or admin wants to add more devices later), it can be larger
> > than the first device. But I'm not really a btrfs developper so I might be
> > wrong and of course feel free to fix the issue as you deem fit.
> 
> The size of the fs is the total size of the individual disks.  When you
> limit the size, you limit the size of a disk, not the whole fs.  IIRC,
> mkfs initializes the fs on the first disk, which is why it was using that
> size as the size of the whole fs, and then adds the other disks after (
> which then add their size to the total fs size ).
  OK, I missed that btrfs_add_to_fsid() increases total size of the
filesystem. So now I agree with you. New patch is attached. Thanks for your
review.

> It might be nice if
> mkfs could take sizes for each disk, but it only seems to take one size
> for the initial disk.
  Yes, but I don't see a realistic usecase so I don't think it's really
worth the work.

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From e5f46872232520310c56327593c02ef6a7f5ea33 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Fri, 10 Feb 2012 11:44:44 +0100
Subject: [PATCH] mkfs: Handle creation of filesystem larger than the first device

mkfs does not properly check requested size of the filesystem. Thus if the
requested size is larger than the first device, it happily creates larger
filesystem than a device it resides on which results in 'attemp to access
beyond end of device' messages from the kernel. So verify specified filesystem
size against the size of the first device.

CC: David Sterba 
Signed-off-by: Jan Kara 
---
 mkfs.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index e3ced19..3afe9eb 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1282,6 +1282,10 @@ int main(int ac, char **av)
 		ret = btrfs_prepare_device(fd, file, zero_end, &dev_block_count, &mixed);
 		if (block_count == 0)
 			block_count = dev_block_count;
+		else if (block_count > dev_block_count) {
+			fprintf(stderr, "%s is smaller than requested size\n", file);
+			exit(1);
+		}
 	} else {
 		ac = 0;
 		file = av[optind++];
-- 
1.7.1



Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-02-08 Thread Jan Kara
On Wed 08-02-12 17:01:15, Phillip Susi wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 1/26/2012 11:03 AM, Jan Kara wrote:
> > make_btrfs() function takes a size of filesystem as an argument. It
> > uses this value to set the size of the first device as well which
> > is wrong for filesystems larger than this device. It results in
> > 'attemp to access beyond end of device' messages from the kernel.
> > So add size of the first device as an argument to make_btrfs().
> 
> I don't think this patch is correct.  Yes, the size switch only
> applies to the first device, so it doesn't make any sense to try to
> use a value larger than that device.  Attempting to do so probably
> should be trapped and it should error out, and the man page should
> probably clarify the fact that the size is only for the first device.
> 
> It looks like you think the size should somehow apply to multiple
> devices or to the total fs size when creating on multiple devices, and
> that just doesn't make sense.
  Thanks for your reply. I admit I was not sure what exactly size argument
should be. So after looking into the code for a while I figured it should
be a total size of the filesystem - or differently it should be size of
virtual block address space in the filesystem. Thus when filesystem has
more devices (or admin wants to add more devices later), it can be larger
than the first device. But I'm not really a btrfs developper so I might be
wrong and of course feel free to fix the issue as you deem fit.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix busyloops in transaction waiting code

2012-01-26 Thread Jan Kara
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...

Signed-off-by: Jan Kara 
---
 fs/btrfs/tree-log.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cb877e0..966cc74 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1957,7 +1957,8 @@ static int wait_log_commit(struct btrfs_trans_handle 
*trans,
 
finish_wait(&root->log_commit_wait[index], &wait);
mutex_lock(&root->log_mutex);
-   } while (root->log_transid < transid + 2 &&
+   } while (root->fs_info->last_trans_log_full_commit !=
+trans->transid && root->log_transid < transid + 2 &&
 atomic_read(&root->log_commit[index]));
return 0;
 }
@@ -1966,7 +1967,8 @@ static int wait_for_writer(struct btrfs_trans_handle 
*trans,
   struct btrfs_root *root)
 {
DEFINE_WAIT(wait);
-   while (atomic_read(&root->log_writers)) {
+   while (root->fs_info->last_trans_log_full_commit !=
+  trans->transid && atomic_read(&root->log_writers)) {
prepare_to_wait(&root->log_writer_wait,
&wait, TASK_UNINTERRUPTIBLE);
mutex_unlock(&root->log_mutex);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-01-26 Thread Jan Kara
make_btrfs() function takes a size of filesystem as an argument. It uses this
value to set the size of the first device as well which is wrong for
filesystems larger than this device. It results in 'attemp to access beyond end
of device' messages from the kernel. So add size of the first device as an
argument to make_btrfs().

CC: David Sterba 
Signed-off-by: Jan Kara 
---
 convert.c |2 +-
 mkfs.c|6 --
 utils.c   |4 ++--
 utils.h   |2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

  As a side note, I'd guess that creating filesystem larger than all given
devices (especially in case of single device) is usually not what sysadmin
wants (we've spotted this bug when xfstest were happily creating 1 GB
filesystem on 500 MB device and it took us a while to notice the problem).
Thus maybe it should require some --force switch?

diff --git a/convert.c b/convert.c
index 291dc27..7f1932c 100644
--- a/convert.c
+++ b/convert.c
@@ -2374,7 +2374,7 @@ int do_convert(const char *devname, int datacsum, int 
packing, int noxattr)
goto fail;
}
ret = make_btrfs(fd, devname, ext2_fs->super->s_volume_name,
-blocks, total_bytes, blocksize, blocksize,
+blocks, total_bytes, total_bytes, blocksize, blocksize,
 blocksize, blocksize);
if (ret) {
fprintf(stderr, "unable to create initial ctree\n");
diff --git a/mkfs.c b/mkfs.c
index e3ced19..f0d29bb 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1294,8 +1294,10 @@ int main(int ac, char **av)
first_file = file;
source_dir_size = size_sourcedir(source_dir, sectorsize,
 &num_of_meta_chunks, 
&size_of_data);
-   if(block_count < source_dir_size)
+   if (block_count < source_dir_size)
block_count = source_dir_size;
+   dev_block_count = block_count;
+
ret = zero_output_file(fd, block_count, sectorsize);
if (ret) {
fprintf(stderr, "unable to zero the output file\n");
@@ -1321,7 +1323,7 @@ int main(int ac, char **av)
leafsize * i;
}
 
-   ret = make_btrfs(fd, file, label, blocks, block_count,
+   ret = make_btrfs(fd, file, label, blocks, block_count, dev_block_count,
 nodesize, leafsize,
 sectorsize, stripesize);
if (ret) {
diff --git a/utils.c b/utils.c
index 178d1b9..f34da51 100644
--- a/utils.c
+++ b/utils.c
@@ -74,7 +74,7 @@ static u64 reference_root_table[] = {
 };
 
 int make_btrfs(int fd, const char *device, const char *label,
-  u64 blocks[7], u64 num_bytes, u32 nodesize,
+  u64 blocks[7], u64 num_bytes, u64 dev_num_bytes, u32 nodesize,
   u32 leafsize, u32 sectorsize, u32 stripesize)
 {
struct btrfs_super_block super;
@@ -276,7 +276,7 @@ int make_btrfs(int fd, const char *device, const char 
*label,
dev_item = btrfs_item_ptr(buf, nritems, struct btrfs_dev_item);
btrfs_set_device_id(buf, dev_item, 1);
btrfs_set_device_generation(buf, dev_item, 0);
-   btrfs_set_device_total_bytes(buf, dev_item, num_bytes);
+   btrfs_set_device_total_bytes(buf, dev_item, dev_num_bytes);
btrfs_set_device_bytes_used(buf, dev_item,
BTRFS_MKFS_SYSTEM_GROUP_SIZE);
btrfs_set_device_io_align(buf, dev_item, sectorsize);
diff --git a/utils.h b/utils.h
index c5f55e1..bf2d5a4 100644
--- a/utils.h
+++ b/utils.h
@@ -22,7 +22,7 @@
 #define BTRFS_MKFS_SYSTEM_GROUP_SIZE (4 * 1024 * 1024)
 
 int make_btrfs(int fd, const char *device, const char *label,
-  u64 blocks[6], u64 num_bytes, u32 nodesize,
+  u64 blocks[6], u64 num_bytes, u64 dev_num_bytes, u32 nodesize,
   u32 leafsize, u32 sectorsize, u32 stripesize);
 int btrfs_make_root_dir(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u64 objectid);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix busyloops in transaction waiting code

2012-01-25 Thread Jan Kara
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...

CC: sta...@kernel.org
Signed-off-by: Jan Kara 
---
 fs/btrfs/tree-log.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cb877e0..966cc74 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1957,7 +1957,8 @@ static int wait_log_commit(struct btrfs_trans_handle 
*trans,
 
finish_wait(&root->log_commit_wait[index], &wait);
mutex_lock(&root->log_mutex);
-   } while (root->log_transid < transid + 2 &&
+   } while (root->fs_info->last_trans_log_full_commit !=
+trans->transid && root->log_transid < transid + 2 &&
 atomic_read(&root->log_commit[index]));
return 0;
 }
@@ -1966,7 +1967,8 @@ static int wait_for_writer(struct btrfs_trans_handle 
*trans,
   struct btrfs_root *root)
 {
DEFINE_WAIT(wait);
-   while (atomic_read(&root->log_writers)) {
+   while (root->fs_info->last_trans_log_full_commit !=
+  trans->transid && atomic_read(&root->log_writers)) {
prepare_to_wait(&root->log_writer_wait,
&wait, TASK_UNINTERRUPTIBLE);
mutex_unlock(&root->log_mutex);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: push file_update_time into ->page_mkwrite

2011-11-29 Thread Jan Kara
;   return 0;
>  }
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index edeb9e8..ba22704 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -359,6 +359,7 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, 
> struct vm_fault *vmf)
>   struct gfs2_alloc *al;
>   int ret;
>  
> + file_update_time(vma->vm_file);
>   gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
>   ret = gfs2_glock_nq(&gh);
>   if (ret)
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 28b8c3f..bfa0c48 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -571,6 +571,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct 
> *vma, struct vm_fault *vmf)
>   filp->f_mapping->host->i_ino,
>   (long long)page_offset(page));
>  
> + file_update_time(filp);
>   /* make sure the cache has finished storing the page */
>   nfs_fscache_wait_on_page_write(NFS_I(dentry->d_inode), page);
>  
> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
> index 2660152..d1ab350 100644
> --- a/fs/nilfs2/file.c
> +++ b/fs/nilfs2/file.c
> @@ -70,6 +70,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, 
> struct vm_fault *vmf)
>   if (unlikely(nilfs_near_disk_full(inode->i_sb->s_fs_info)))
>   return VM_FAULT_SIGBUS; /* -ENOSPC */
>  
> + file_update_time(vma->vm_file);
>   lock_page(page);
>   if (page->mapping != inode->i_mapping ||
>   page_offset(page) >= i_size_read(inode) || !PageUptodate(page)) {
> diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
> index 3e9393c..27c59e5 100644
> --- a/fs/ocfs2/mmap.c
> +++ b/fs/ocfs2/mmap.c
> @@ -141,6 +141,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, 
> struct vm_fault *vmf)
>  
>   ocfs2_block_signals(&oldset);
>  
> + file_update_time(vma->vm_file);
>   /*
>* The cluster locks taken will block a truncate from another
>* node. Taking the data lock will also ensure that we don't
> diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
> index a475983..4879bfb 100644
> --- a/fs/sysfs/bin.c
> +++ b/fs/sysfs/bin.c
> @@ -225,6 +225,7 @@ static int bin_page_mkwrite(struct vm_area_struct *vma, 
> struct vm_fault *vmf)
>   if (!sysfs_get_active(attr_sd))
>   return VM_FAULT_SIGBUS;
>  
> + file_update_time(file);
>   ret = 0;
>   if (bb->vm_ops->page_mkwrite)
>   ret = bb->vm_ops->page_mkwrite(vma, vmf);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 0f85778..abbaee4 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3466,6 +3466,7 @@ static int perf_mmap_fault(struct vm_area_struct *vma, 
> struct vm_fault *vmf)
>   struct ring_buffer *rb;
>   int ret = VM_FAULT_SIGBUS;
>  
> + file_update_time(vma->vm_file);
>   if (vmf->flags & FAULT_FLAG_MKWRITE) {
>   if (vmf->pgoff == 0)
>   ret = 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index a56e3ba..8e4686f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2619,10 +2619,6 @@ reuse:
>   }
>   }
>  
> - /* file_update_time outside page_lock */
> - if (vma->vm_file)
> - file_update_time(vma->vm_file);
> -
>   return ret;
>   }
>  
> @@ -3303,10 +3299,6 @@ static int __do_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>*/
>   balance_dirty_pages_ratelimited(mapping);
>   }
> -
> - /* file_update_time outside page_lock */
> - if (vma->vm_file)
> -         file_update_time(vma->vm_file);
>   } else {
>   unlock_page(vmf.page);
>   if (anon)
> diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
> index 55d92cb..e5f4be1 100644
> --- a/security/selinux/selinuxfs.c
> +++ b/security/selinux/selinuxfs.c
> @@ -458,6 +458,7 @@ static int sel_mmap_policy_fault(struct vm_area_struct 
> *vma,
>   unsigned long offset;
>   struct page *page;
>  
> + file_update_time(vma->vm_file);
>   if (vmf->flags & (FAULT_FLAG_MKWRITE | FAULT_FLAG_WRITE))
>   return VM_FAULT_SIGBUS;
>  
> -- 
> 1.7.5.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-16 Thread Jan Kara
On Wed 16-11-11 08:42:34, Christoph Hellwig wrote:
> On Wed, Nov 16, 2011 at 02:39:15PM +0100, Jan Kara wrote:
> > > This would work fine with XFS and be equivalent to what it does for
> > > O_DSYNC now.  But I'd rather see every filesystem do the right thing
> > > and make sure the update actually is on disk when doing O_(D)SYNC
> > > operations.
> >   OK, I don't really have a strong opinion here. Are you afraid that just
> > calling fsync() need not be enough to push all updates fallocate did to
> > disk?
> 
> No, the point is that you should not have to call fsync when doing
> O_SYNC I/O.  That's the whole point of it.
  I agree with you that userspace shouldn't have to call fsync. What I
meant is that sys_fallocate() or do_fallocate() can call
generic_write_sync(file, pos, len), and that would be completely
transparent to userspace.

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-16 Thread Jan Kara
On Wed 16-11-11 07:45:50, Christoph Hellwig wrote:
> On Wed, Nov 16, 2011 at 11:54:13AM +0100, Jan Kara wrote:
> >   Yeah, only that nobody calls that fsync() automatically if the fd is
> > O_SYNC if I'm right. But maybe calling fdatasync() on the range which was
> > fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for
> > most filesystems? That would match how we treat O_SYNC for other operations
> > as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit
> > with this.
> 
> This would work fine with XFS and be equivalent to what it does for
> O_DSYNC now.  But I'd rather see every filesystem do the right thing
> and make sure the update actually is on disk when doing O_(D)SYNC
> operations.
  OK, I don't really have a strong opinion here. Are you afraid that just
calling fsync() need not be enough to push all updates fallocate did to
disk?

    Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-16 Thread Jan Kara
  Hello,

On Wed 16-11-11 09:43:08, Steven Whitehouse wrote:
> On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote:
> > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never
> > make sure the size update transaction made it to disk.
> > 
> > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data
> > operation (it adds new blocks that return zeroes) that seems like a
> > fairly nasty surprise for O_SYNC users.
> 
> In GFS2 we zero out the data blocks as we go (since our metadata doesn't
> allow us to mark blocks as zeroed at alloc time) and also because we are
> mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use
> on our rindex system file in order to ensure that there is always enough
> space to expand a filesystem.
> 
> So there is no danger of having non-zeroed blocks appearing later, as
> that is done before the metadata change.
> 
> Our fallocate_chunk() function calls mark_inode_dirty(inode) on each
> call, so that fsync should pick that up and ensure that the metadata has
> been written back. So we should thus have both data and metadata stable
> on disk.
> 
> Do you have some evidence that this is not happening?
  Yeah, only that nobody calls that fsync() automatically if the fd is
O_SYNC if I'm right. But maybe calling fdatasync() on the range which was
fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for
most filesystems? That would match how we treat O_SYNC for other operations
as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit
with this.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] remove i_alloc_sem

2011-06-22 Thread Jan Kara
On Wed 22-06-11 10:22:35, Ted Tso wrote:
> On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:
> > ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
> > it?
> 
> So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I
> will do not dropping the page's refcount until after the workqueue
> finishes its job), does your patch need changing, or is it a matter of
> fixing the commit description?
  Looking into patchwork (http://patchwork.ozlabs.org/patch/97924/) I see
the discussion of two issues (page refcounting and page_mkwrite got mixed).
I see the patch just needs fixing commit description because it's no longer
accurate after "stable page under writeback" changes went it. I'll send you
a patch with updated changelog in a minute.

> In any case, this dragged got out, and it's late in -rc cycle for 3.0,
> so I was planning on queuing this for the next merge window.  (Sorry
> for that, this was mostly due to my not having enough time to really
> dive into the issues involved.)
  No problem. Just we have to somehow coordinate with Christoph... Either
he can avoid touching ext4 and merge his patch set after you merge my patch
or he can take my patch instead of his ext4 change. Since my patch touches
only ext4_page_mkwrite() there's not a high risk of collision. The latter
would seem easier for both of you to me but I don't really care.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] remove i_alloc_sem

2011-06-21 Thread Jan Kara
On Mon 20-06-11 16:15:33, Christoph Hellwig wrote:
> i_alloc_sem has always been a bit of an odd "lock".  It's the only remaining
> rw_semaphore that can be released by a different thread than the one that
> locked it, and it's use case in the core direct I/O code is more like a
> counter given that the writers already have external serialization.
> 
> This series removes it in favour of a simpler counter scheme, thus getting
> rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
> same time shrinking the size of struct inode by 160 bytes on 64-bit systems.
> 
> The only nasty bit is that two filesystems (fat and ext4) have started
> abusing the lock for their own purposes.  I've added a new rw_semaphore
  ext4 abuse should be gone when Ted merges my rewrite of
ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
it?

        Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-24 Thread Jan Kara
On Wed 24-11-10 12:03:43, Nick Piggin wrote:
> > For the _nr variant that btrfs uses, it's worse for the filesystems
> > that don't have a 1:1 bdi<->sb mapping.  It might not actually write any
> > of the pages from the SB that is out of space.
> 
> That's true, but it might not write anything anyway (and after we
> check whether writeout is in progress, the writeout thread could go
> to sleep and not do anything anyway).
> 
> So it's a pretty hacky interface anyway. If you want to do anything
> deterministic, you obviously need real coupling between producer and
> consumer. This should only be a performance tweak (or a workaround
> hack in worst case).
  Yes, the current interface is a band aid for the problem and better
interface is welcome. But it's not trivial to do better...

> > > It makes no further guarantees, and anyway
> > > the sb has to compete for normal writeback within this bdi.
> > 
> > > 
> > > I think Christoph is right because filesystems should not really
> > > know about how bdi writeback queueing works. But I don't know if it's
> > > worth doing anything more complex for this functionality?
> > 
> > I think we should make a writeback_inodes_sb_unlocked() that doesn't
> > warn when the semaphore isn't held.  That should be enough given where
> > btrfs and ext4 are calling it from.
> 
> It doesn't solve the bugs -- calling and waiting for writeback is
> still broken because completion requires i_mutex and it is called
> from under i_mutex.
  Well, as I wrote in my previous email, only ext4 has the problem with
i_mutex and I personally view it as a bug. But ultimately it's Ted's call
to decide.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-23 Thread Jan Kara
On Tue 23-11-10 19:07:58, Nick Piggin wrote:
> On Mon, Nov 22, 2010 at 07:16:55PM +0100, Jan Kara wrote:
> > On Fri 19-11-10 16:16:19, Nick Piggin wrote:
> > > On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
> > > > On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> > > > > The fact that a call to ->write_begin can randomly return with 
> > > > > s_umount
> > > > > held, to be randomly released at some random time in the future is a
> > > > > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > > > > thing.
> > > >   I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
> > > > variants) does:
> > > > bdi_queue_work(sb->s_bdi, &work);
> > > > wait_for_completion(&done);
> > > >   So we return only after all the IO has been submitted and unlock 
> > > > s_umount
> > > > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO 
> > > > ourselves
> > > > because we are holding i_mutex and we need to get and put references
> > > > to other inodes while doing writeback (those would be really horrible 
> > > > lock
> > > > dependencies - writeback thread can put the last reference to an 
> > > > unlinked
> > > > inode...).
> > > 
> > > But if we're waiting for it, with the lock held, then surely it can
> > > deadlock just the same as if we submit it ourself?
> >   Yes, that's what I realized as well and what needs fixing. It was an
> > unintentional consequence of locking changes Christoph did to the writeback
> > code to fix other races.
> > 
> > > BTW. are you taking i_mutex inside writeback? I mutex can be held
> > > while entering page reclaim, and thus writepage... so it could be a
> > > bug too.
> >   No, writeback does not need i_mutex.
> 
> I did in fact see i_mutex from writeback, which is how the lock order
> reversal was noticed in the first place. I haven't had much luck in
> reproducing it yet. It did come from end_io_work, I believe.
> 
> There is another deadlock in here, by the looks (although this is not
> the one which triggers lockdep -- the workqueue coupling means it
> defeats lockdep detection).
> 
> Buffered write holds i_lock, and waits for inode writeback submission,
> but the work queue seems to be blocked on i_mutex (ext4_end_io_work),
> and so it deadlocks.
  Ah, I see. That's a new thing introduced by Ted's rewrite of
ext4_writepages() - bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc. And indeed it
introduced a deadlock as you describe...

Honza

> Note that this is an AA deadlock, different from ABBA one relating to
> s_umount lock (but very similar call chains IIRC).
> 
> 
> [  748.406457] SysRq : Show Blocked State
> [  748.406685]   taskPC stack   pid father
> [  748.406868] kworker/9:1   D   6296   118  2
> 0x
> [  748.407173]  88012acb3c90 0046 
> 88012b1cec80
> [  748.407686]  0002 88012acb2000 88012acb2000
> d6c8
> [  748.408200]  88012acb2010 88012acb3fd8 880129c7dd00
> 88012b1cec80
> [  748.408711] Call Trace:
> [  748.408866]  [] ? get_parent_ip+0x11/0x50
> [  748.409033]  [] __mutex_lock_common+0x18c/0x480
> [  748.409205]  [] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [  748.409379]  [] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [  748.409546]  [] mutex_lock_nested+0x3e/0x50
> [  748.409717]  [] ext4_end_io_work+0x37/0xb0 [ext4]
> [  748.409883]  [] process_one_work+0x1b8/0x5a0
> [  748.410046]  [] ? process_one_work+0x14e/0x5a0
> [  748.410219]  [] ? ext4_end_io_work+0x0/0xb0 [ext4]
> [  748.410386]  [] worker_thread+0x175/0x3a0
> [  748.410549]  [] ? worker_thread+0x0/0x3a0
> [  748.410712]  [] kthread+0x96/0xa0
> [  748.410874]  [] kernel_thread_helper+0x4/0x10
> [  748.411038]  [] ? finish_task_switch+0x78/0x110
> [  748.411202]  [] ? restore_args+0x0/0x30
> [  748.411364]  [] ? kthread+0x0/0xa0
> [  748.411524]  [] ? kernel_thread_helper+0x0/0x10
> [  748.606853] dbenchD   2872  2916  1
> 0x0004
> [  748.607154]  880129ec58b8 0046 880129ec5838
> 810806fd
> [  748.607665]  880129ec5898 880129ec4000 880129ec4000
> d6c8
> [  748.608176]  880129ec4010 880129ec5fd8 88010a2d
> 88011ff69f00
> [  748.608686] Call Trace:
> [  748.608837]  [] ? trace_hardirqs_off+0xd/

Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-23 Thread Jan Kara
On Tue 23-11-10 21:11:49, Nick Piggin wrote:
> The issue of writeback_inodes_sb being synchronous so far as it has to
> wait until the work has been dequeued is another subtlety. That is a
> funny interface though, really. It has 3 callers, sync, quota, and
> ubifs. From a quick look, quota and ubifs seem to think it is some kind
> of synchronous writeout API.
  Yes, they expect it and it used to be the case before per-bdi flusher
threads existed (because the function submitted IO on its own). Then it
was changed to an async interface in per-bdi flusher thread patches and
then back again to a synchronous one... Sad history...

> It also really sucks that it can get deadlocked behind an unrelated item
> in a workqueue. I think it should just get converted over to the
> async-submission style as well.
  Here I don't quite agree. After my patches (currently in -mm tree) all
work items have reasonably well defined lifetime so no livelocks should
occur. After all writeback thread is doing its best to do as much IO as
possible (and hopefully saturates the storage) so given all the IO work
items we cannot do much better. Where I see a room for improvement is
that work items usually try to achieve a common goal - for example when we
get two items "write all dirty pages", we have mostly fulfilled the second
one after finishing the first one but we start from the beginning when
processing the second request. But it seems non-trivial to do this request
merging especially for different types of requests...

        Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-22 Thread Jan Kara
On Fri 19-11-10 16:16:19, Nick Piggin wrote:
> On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
> > On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> > > The fact that a call to ->write_begin can randomly return with s_umount
> > > held, to be randomly released at some random time in the future is a
> > > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > > thing.
> >   I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
> > variants) does:
> > bdi_queue_work(sb->s_bdi, &work);
> > wait_for_completion(&done);
> >   So we return only after all the IO has been submitted and unlock s_umount
> > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO 
> > ourselves
> > because we are holding i_mutex and we need to get and put references
> > to other inodes while doing writeback (those would be really horrible lock
> > dependencies - writeback thread can put the last reference to an unlinked
> > inode...).
> 
> But if we're waiting for it, with the lock held, then surely it can
> deadlock just the same as if we submit it ourself?
  Yes, that's what I realized as well and what needs fixing. It was an
unintentional consequence of locking changes Christoph did to the writeback
code to fix other races.

> BTW. are you taking i_mutex inside writeback? I mutex can be held
> while entering page reclaim, and thus writepage... so it could be a
> bug too.
  No, writeback does not need i_mutex.

> > In fact, as I'm speaking about it, pushing things to writeback thread and
> > waiting on the work does not help a bit with the locking issues (we didn't
> > wait for the work previously but that had other issues). Bug, sigh.
> > 
> > What might be better interface for usecases like above is to allow
> > filesystem to kick flusher thread to start doing background writeback
> > (regardless of dirty limits). Then the caller can wait for some delayed
> > allocation reservations to get freed (easy enough to check in
> > ->writepage() and wake the waiters) - possibly with a reasonable timeout
> > so that we don't stall forever.
> 
> We really need to throttle the producer without any locks held, no?
> So the filesystem would like to to hook into dirtying path somewhere
> without i_mutex held (and without implementing your own .write). Eg.
> around the dirty throttling calls somewhere I suppose.
  Well, the particular case where we need to do delayed allocation but the
filesystem has already all blocks reserved is hard to detect without any
locks. At least we need some locks to find out how many blocks do we need
to reserve for that write(). So for filesystems it would solve the locking
issues if we could bail out of write() path up to the point where we hold
no locks, wait there / do necessary writeback and then restart the write...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Jan Kara
On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> I'm not sure that s_umount versus i_mutex has come up before.
> 
> Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> is a per-superblock thing, and i_mutex is a per-file thing, and files
> live under superblocks.  Nesting s_umount outside i_mutex creates
> complex deadlock graphs between the various i_mutexes, I think.
> 
> Someone tell me if btrfs has the same bug, via its call to
> writeback_inodes_sb_nr_if_idle()?
> 
> I don't see why these functions need s_umount at all, if they're called
> from within ->write_begin against an inode on that superblock.  If the
> superblock can get itself disappeared while we're running ->write_begin
> on it, we have problems, no?
  As I wrote to Chris, the function just needs exclusion from umount /
remount happening (and want to stop umount from returning EBUSY when
writeback thread is writing something out). When the function is called
from ->write_begin this is no issue as you properly noted so s_umount is
not needed in that particular case.

> In which case I'd suggest just removing the down_read(s_umount) and
> specifying that the caller must pin the superblock via some means.
  Possibly, but currently the advantage is that we can have WARN_ON in the
writeback code that complains if someone starts writeback without properly
pinned superblock and we cannot easily do that with your general rule. I'm
not saying that should stop us from changing the rule but it was kind of
nice.

> Only we can't do that because we need to hold s_umount until the
> bdi_queue_work() worker has done its work.
> 
> The fact that a call to ->write_begin can randomly return with s_umount
> held, to be randomly released at some random time in the future is a
> bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> thing.
  I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
variants) does:
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
  So we return only after all the IO has been submitted and unlock s_umount
in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
because we are holding i_mutex and we need to get and put references
to other inodes while doing writeback (those would be really horrible lock
dependencies - writeback thread can put the last reference to an unlinked
inode...).

In fact, as I'm speaking about it, pushing things to writeback thread and
waiting on the work does not help a bit with the locking issues (we didn't
wait for the work previously but that had other issues). Bug, sigh.

What might be better interface for usecases like above is to allow
filesystem to kick flusher thread to start doing background writeback
(regardless of dirty limits). Then the caller can wait for some delayed
allocation reservations to get freed (easy enough to check in
->writepage() and wake the waiters) - possibly with a reasonable timeout
so that we don't stall forever.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Jan Kara
On Thu 18-11-10 13:33:54, Chris Mason wrote:
> > Can we just delete writeback_inodes_sb_nr_if_idle() and
> > writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> > pretty handwavy - do we know that deleting these things would make a
> > jot of difference?
> > 
> > And why _do_ we need to hold s_umount during the bdi_queue_work()
> > handover?  Would simply bumping s_count suffice?
> > 
> 
> We don't need to keep the super in ram, we need to keep the FS mounted
> so that writepage and friends continue to do useful things.  s_count
> isn't enough for that...but when the bdi stuff is passed an SB from
> something that has the SB explicitly pinned, we should be able to safely
> skip the locking.
> 
> Since these functions are only used in that context it makes good sense
> to try_lock them or drop the lock completely.
> 
> I think the only reason we need the lock:
> 
> void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr)
> {
> ...
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
  The above is the only reason why we need the lock in the call from
->write_begin(). Yes. But:

> We're not going to go from rw to ro with dirty pages unless something
> horrible has gone wrong (eios all around in the FS), so I'm not sure why
> we need the lock at all?
  One of the nasty cases is for example: Writeback thread decides inode
needs writeout, so the thread gets inode reference. Then someone calls
umount and gets EBUSY because writeback thread is working. Kind of
unexpected... So generally we *do* need a serialization of writeback thread
and umount / remount. Just in that particular case where we call the
function from ->write_begin(), s_umount is overkill...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-18 Thread Jan Kara
On Wed 17-11-10 20:46:15, Josef Bacik wrote:
> Hole punching has already been implemented by XFS and OCFS2, and has the
> potential to be implemented on both BTRFS and EXT4 so we need a generic way to
> get to this feature.  The simplest way in my mind is to add 
> FALLOC_FL_PUNCH_HOLE
> to fallocate() since it already looks like the normal fallocate() operation.
> I've tested this patch with XFS and BTRFS to make sure XFS did what it's
> supposed to do and that BTRFS failed like it was supposed to.  Thank you,
  Looks nice now. Acked-by: Jan Kara 

Honza
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/open.c  |7 ++-
>  include/linux/falloc.h |1 +
>  2 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 4197b9e..5b6ef7e 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -223,7 +223,12 @@ int do_fallocate(struct file *file, int mode, loff_t 
> offset, loff_t len)
>   return -EINVAL;
>  
>   /* Return error if mode is not supported */
> - if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + return -EOPNOTSUPP;
> +
> + /* Punch hole must have keep size set */
> + if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> + !(mode & FALLOC_FL_KEEP_SIZE))
>   return -EOPNOTSUPP;
>  
>   if (!(file->f_mode & FMODE_WRITE))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 3c15510..73e0b62 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -2,6 +2,7 @@
>  #define _FALLOC_H_
>  
>  #define FALLOC_FL_KEEP_SIZE  0x01 /* default is extend size */
> +#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
>  
>  #ifdef __KERNEL__
>  
> -- 
> 1.6.6.1
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-16 Thread Jan Kara
On Tue 16-11-10 07:52:50, Josef Bacik wrote:
> On Tue, Nov 16, 2010 at 12:43:46PM +0100, Jan Kara wrote:
> > On Tue 16-11-10 12:16:11, Jan Kara wrote:
> > > On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> > > > diff --git a/fs/open.c b/fs/open.c
> > > > index 4197b9e..ab8dedf 100644
> > > > --- a/fs/open.c
> > > > +++ b/fs/open.c
> > > > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, 
> > > > loff_t offset, loff_t len)
> > > > return -EINVAL;
> > > >  
> > > > /* Return error if mode is not supported */
> > > > -   if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > > > +   if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | 
> > > > FALLOC_FL_PUNCH_HOLE)))
> > >   Why not just:
> > > if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?
> >   And BTW, since FALLOC_FL_PUNCH_HOLE does not change the file size, should
> > not we enforce that FALLOC_FL_KEEP_SIZE is / is not set? I don't mind too
> > much which way but keeping it ambiguous (ignored) in the interface usually
> > proves as a bad idea in future when we want to further extend the 
> > interface...
> >
> 
> Yeah I went back and forth on this.  KEEP_SIZE won't change the behavior of
> PUNCH_HOLE since PUNCH_HOLE implicitly means keep the size.  I figured since 
> its
> "mode" and not "flags" it would be ok to make either way accepted, but if you
> prefer PUNCH_HOLE means you have to have KEEP_SIZE set then I'm cool with 
> that,
> just let me know one way or the other.  Thanks,
  I was wondering about 'mode' vs 'flags' as well. The manpage says:
The mode argument determines the operation to be performed on the given
range.  Currently only one flag is supported for mode...
  So we call it "mode" but speak about "flags"? Seems a bit inconsistent.
I'd maybe lean a bit at the "flags" side and just make sure that
only one of FALLOC_FL_KEEP_SIZE, FALLOC_FL_PUNCH_HOLE is set (interpreting
FALLOC_FL_KEEP_SIZE as allocate blocks beyond i_size). But I'm not sure
what others think.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-16 Thread Jan Kara
On Tue 16-11-10 22:00:58, Nick Piggin wrote:
> I saw a lock order warning on ext4 trigger. This should solve it.
> Raciness shouldn't matter much, because writeback can stop just
> after we make the test and return anyway (so the API is racy anyway).
  Hmm, for now the fix is OK. Ultimately, we probably want to call
writeback_inodes_sb() directly from all the callers. They all just want to
reduce uncertainty of delayed allocation reservations by writing delayed
data and actually wait for some of the writeback to happen before they
retry again the allocation.

Although the callers generally cannot get umount_sem because they hold
other locks, they have the superblock well pinned so grabbing umount_sem
makes sense mostly to make assertions happy. But as I'm thinking about it,
trylock *is* maybe the right answer to this anyway...

So
Acked-by: Jan Kara 

Honza
> 
> Signed-off-by: Nick Piggin 
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===
> --- linux-2.6.orig/fs/fs-writeback.c  2010-11-16 21:44:32.0 +1100
> +++ linux-2.6/fs/fs-writeback.c   2010-11-16 21:49:37.0 +1100
> @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * May be called inside i_lock. May not start writeback if locks cannot
> + * be acquired.
>   */
>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>   if (!writeback_in_progress(sb->s_bdi)) {
> - down_read(&sb->s_umount);
> - writeback_inodes_sb(sb);
> - up_read(&sb->s_umount);
> - return 1;
> - } else
> - return 0;
> + if (down_read_trylock(&sb->s_umount)) {
> + writeback_inodes_sb(sb);
> + up_read(&sb->s_umount);
> + return 1;
> + }
> + }
> + return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
>  
> @@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * May be called inside i_lock. May not start writeback if locks cannot
> + * be acquired.
>   */
>  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>  unsigned long nr)
>  {
>   if (!writeback_in_progress(sb->s_bdi)) {
> - down_read(&sb->s_umount);
> - writeback_inodes_sb_nr(sb, nr);
> - up_read(&sb->s_umount);
> - return 1;
> - } else
> - return 0;
> + if (down_read_trylock(&sb->s_umount)) {
> + writeback_inodes_sb_nr(sb, nr);
> + up_read(&sb->s_umount);
> + return 1;
> + }
> + }
> + return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Ext4: fail if we try to use hole punch

2010-11-16 Thread Jan Kara
On Mon 15-11-10 12:05:21, Josef Bacik wrote:
> Ext4 doesn't have the ability to punch holes yet, so make sure we return
> EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
> be added later.  Thanks,
> 
> Signed-off-by: Josef Bacik 
  Acked-by: Jan Kara 

Honza
> ---
>  fs/ext4/extents.c |4 
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 0554c48..35bca73 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3622,6 +3622,10 @@ long ext4_fallocate(struct inode *inode, int mode, 
> loff_t offset, loff_t len)
>   struct ext4_map_blocks map;
>   unsigned int credits, blkbits = inode->i_blkbits;
>  
> + /* We only support the FALLOC_FL_KEEP_SIZE mode */
> + if (mode && (mode != FALLOC_FL_KEEP_SIZE))
> + return -EOPNOTSUPP;
> +
>   /*
>* currently supporting (pre)allocate mode for extent-based
>* files _only_
> -- 
> 1.6.6.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] Ocfs2: handle hole punching via fallocate properly

2010-11-16 Thread Jan Kara
On Mon 15-11-10 12:05:20, Josef Bacik wrote:
> This patch just makes ocfs2 use its UNRESERVP ioctl when we get the hole punch
> flag in fallocate.  I didn't test it, but it seems simple enough.  Thanks,
> 
> Signed-off-by: Josef Bacik 
  You might want to directly CC Joel Becker  who
maintains OCFS2. Otherwise the patch looks OK so you can add
  Acked-by: Jan Kara 
for what it's worth ;).

Honza
> ---
>  fs/ocfs2/file.c |   10 --
>  1 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 77b4c04..181ae52 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1992,6 +1992,7 @@ static long ocfs2_fallocate(struct inode *inode, int 
> mode, loff_t offset,
>   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>   struct ocfs2_space_resv sr;
>   int change_size = 1;
> + int cmd = OCFS2_IOC_RESVSP64;
>  
>   if (!ocfs2_writes_unwritten_extents(osb))
>   return -EOPNOTSUPP;
> @@ -2002,12 +2003,17 @@ static long ocfs2_fallocate(struct inode *inode, int 
> mode, loff_t offset,
>   if (mode & FALLOC_FL_KEEP_SIZE)
>   change_size = 0;
>  
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + cmd = OCFS2_IOC_UNRESVSP64;
> + change_size = 0;
> + }
> +
>   sr.l_whence = 0;
>   sr.l_start = (s64)offset;
>   sr.l_len = (s64)len;
>  
> - return __ocfs2_change_file_space(NULL, inode, offset,
> -  OCFS2_IOC_RESVSP64, &sr, change_size);
> + return __ocfs2_change_file_space(NULL, inode, offset, cmd, &sr,
> +  change_size);
>  }
>  
>  int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
> -- 
> 1.6.6.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-16 Thread Jan Kara
On Tue 16-11-10 12:16:11, Jan Kara wrote:
> On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> > diff --git a/fs/open.c b/fs/open.c
> > index 4197b9e..ab8dedf 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t 
> > offset, loff_t len)
> > return -EINVAL;
> >  
> > /* Return error if mode is not supported */
> > -   if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > +   if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
>   Why not just:
> if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?
  And BTW, since FALLOC_FL_PUNCH_HOLE does not change the file size, should
not we enforce that FALLOC_FL_KEEP_SIZE is / is not set? I don't mind too
much which way but keeping it ambiguous (ignored) in the interface usually
proves as a bad idea in future when we want to further extend the interface...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-16 Thread Jan Kara
On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> diff --git a/fs/open.c b/fs/open.c
> index 4197b9e..ab8dedf 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t 
> offset, loff_t len)
>   return -EINVAL;
>  
>   /* Return error if mode is not supported */
> - if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> + if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
  Why not just:
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?

> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 3c15510..851cba2 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -2,6 +2,7 @@
>  #define _FALLOC_H_
>  
>  #define FALLOC_FL_KEEP_SIZE  0x01 /* default is extend size */
> +#define FALLOC_FL_PUNCH_HOLE 0X02 /* de-allocates range */
 ^ use lowercase 'x' please...


Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-09 Thread Jan Kara
On Tue 09-11-10 16:41:47, Ted Ts'o wrote:
> On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
> > Implementation is up to the filesystem. However, XFS does (b)
> > because:
> > 
> > 1) it was extremely simple to implement (one of the
> >advantages of having an exceedingly complex allocation
> >interface to begin with :P)
> > 2) conversion is atomic, fast and reliable
> > 3) it is independent of the underlying storage; and
> > 4) reads of unwritten extents operate at memory speed,
> >not disk speed.
> 
> Yeah, I was thinking that using a device-style TRIM might be better
> since future attempts to write to it won't require a separate seek to
> modify the extent tree.  But yeah, there are a bunch of advantages of
> simply mutating the extent tree.
> 
> While we're on the subject of changes to fallocate, what do people
> think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
> privileges or (if capabilities are in use) CAP_DAC_OVERRIDE &&
> CAP_MAC_OVERRIDE && CAP_SYS_ADMIN.  This would allow a trusted process
> to fallocate blocks with the extent already marked initialized.  I've
> had two requests for such functionality for ext4 already.  
> 
> (Take for example a trusted cluster filesystem backend that checks the
> object checksum before returning any data to the user; and if the
> check fails the cluster file system will try to use some other replica
> stored on some other server.)
  Hum, could you elaborate a bit? I fail to see how above fallocate() flag
could be used to help solving this problem... Just curious...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Wed 29-09-10 16:10:06, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 02:18:08PM +0200, Jan Kara wrote:
> > On Wed 29-09-10 10:19:36, Christoph Hellwig wrote:
> > > ---
> > > From: Christoph Hellwig 
> > > Subject: [PATCH] writeback: always use sb->s_bdi for writeback purposes
> > > 
> > ...
> > > The one exception for now is the block device filesystem which really
> > > wants different writeback contexts for it's different (internal) inodes
> > > to handle the writeout more efficiently.  For now we do this with
> > > a hack in fs-writeback.c because we're so late in the cycle, but in
> > > the future I plan to replace this with a superblock method that allows
> > > for multiple writeback contexts per filesystem.
> >   Another exception I know about is mtd_inodefs filesystem
> > (drivers/mtd/mtdchar.c).
> 
> No, it's not.  MTD only has three different backing_dev_info instances
> which have different flags in the mapping-relevant portion of the
> backing_dev. 
  In the end I agree I was probably wrong but it's not that simple ;)

> >   So at least here you'd need also add a similar exception for
> > "mtd_inodefs".
> 
> No.  For one thing we don't need any exception for correctnes alone -
> even the block device variant would work fine with the default case.
Here I don't agree. If you don't have some kind of exception, sb->s_bdi
for both "block" and "mtd_inodefs" filesystems points to
noop_backing_dev_info and you get no writeback for that one. So it isn't
just a performance issue but also a correctness one.

Regarding mtd_inodefs I now looked in more detail what MTD actually does
and it seems to me that MTD device inodes do not seem to carry any
cached state that flusher threads could write back. So returning
noop_backing_dev_info might be the right thing for them after all...
(added David Woodhouse and MTD list to CC so that they can shout if it's
not the case). Coming to this conclusion, I'm happy with your patch going
in as is...
Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Mon 27-09-10 20:55:45, Cesar Eduardo Barros wrote:
> Em 27-09-2010 19:25, Jan Kara escreveu:
> >[Added CCs for similar ecryptfs warning]
> >On Thu 23-09-10 12:38:49, Andrew Morton wrote:
> >>>[...]
> >>>device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342
> >>>/dev/mapper/vg_cesarbinspiro-lv_home
> >>>SELinux: initialized (dev dm-3, type btrfs), uses xattr
> >>>[ cut here ]
> >>>WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> >>>Hardware name: Inspiron N4010
> >>>Dirtiable inode bdi default != sb bdi btrfs
> >   That suggests that we should probably handle such cases in a more generic
> >way by changing the code in inode_init_always(). The patch below makes at
> >least btrfs happy for me... Could you maybe test it? Thanks.
> 
> Applied on top of v2.6.36-rc5-151-g32163f4, running it right now.
> The warning messages no longer happen, and everything seems to be
> working fine.
> 
> Tested-by: Cesar Eduardo Barros 
  Great, thanks for testing.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Tue 28-09-10 10:05:49, Artem Bityutskiy wrote:
> On Mon, 2010-09-27 at 18:54 -0400, Chris Mason wrote:
> > On Tue, Sep 28, 2010 at 12:25:48AM +0200, Jan Kara wrote:
> > > [Added CCs for similar ecryptfs warning]
> > > On Thu 23-09-10 12:38:49, Andrew Morton wrote:
> > > > > This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not 
> > > > > happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have 
> > > > > commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the 
> > > > > warning.
> > > > > [...]
> > > > > device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
> > > > > /dev/mapper/vg_cesarbinspiro-lv_home
> > > > > SELinux: initialized (dev dm-3, type btrfs), uses xattr
> > > > > [ cut here ]
> > > > > WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> > > > > Hardware name: Inspiron N4010
> > > > > Dirtiable inode bdi default != sb bdi btrfs
> > > > > Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
> > > > > snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
> > > > > iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
> > > > > snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
> > > > > cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop 
> > > > > snd 
> > > > > pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
> > > > > rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd 
> > > > > aes_x86_64 
> > > > > aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
> > > > > i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
> > > > > Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
> > > > > Call Trace:
> > > > >   [] warn_slowpath_common+0x85/0x9d
> > > > >   [] warn_slowpath_fmt+0x46/0x48
> > > > >   [] inode_to_bdi+0x62/0x6d
> > > > >   [] __mark_inode_dirty+0xd0/0x177
> > > > >   [] touch_atime+0x107/0x12a
> > > > >   [] ? filldir+0x0/0xd0
> > > > >   [] vfs_readdir+0x8d/0xb4
> > > > >   [] sys_getdents+0x81/0xd1
> > > > >   [] system_call_fastpath+0x16/0x1b
> > >   Thanks for the report. These bdi pointers are a mess. As Chris pointed
> > > out, btrfs forgets to properly initialize 
> > > inode->i_mapping.backing_dev_info
> > > for directories and special inodes and thus these were previously attached
> > > to default_backing_dev_info which probably isn't what Chris would like to
> > > see.
> > 
> > There's no actual writeback for these, so it works fine for btrfs either
> > way.
> 
> Side note: every time inode is marked as dirty, we wake up a bdi thread
> or the default bdi thread. So if we have inodes which do not need
> write-back, we should never mark them as dirty.
  Are you sure? I think we wake up the thread only when it's the first
dirty inode for the bdi...
  And a side side note ;): It's harder not to dirty the inode than it seems.
E.g. btrfs (or similarly ext3) add the new inode data to the journal already
at inode dirty time but still they need to track that the transaction
carrying the inode is still uncommitted. Thus the inode *is* dirty in some
sense. Only it does not need any writeout to happen...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Wed 29-09-10 10:19:36, Christoph Hellwig wrote:
> ---
> From: Christoph Hellwig 
> Subject: [PATCH] writeback: always use sb->s_bdi for writeback purposes
> 
...
> The one exception for now is the block device filesystem which really
> wants different writeback contexts for it's different (internal) inodes
> to handle the writeout more efficiently.  For now we do this with
> a hack in fs-writeback.c because we're so late in the cycle, but in
> the future I plan to replace this with a superblock method that allows
> for multiple writeback contexts per filesystem.
  Another exception I know about is mtd_inodefs filesystem
(drivers/mtd/mtdchar.c).

> Signed-off-by: Christoph Hellwig 
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===
> --- linux-2.6.orig/fs/fs-writeback.c  2010-09-29 16:58:41.750557721 +0900
> +++ linux-2.6/fs/fs-writeback.c   2010-09-29 17:11:35.040557719 +0900
> @@ -72,22 +72,10 @@ int writeback_in_progress(struct backing
>  static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
>  {
>   struct super_block *sb = inode->i_sb;
> - struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
>  
> - /*
> -  * For inodes on standard filesystems, we use superblock's bdi. For
> -  * inodes on virtual filesystems, we want to use inode mapping's bdi
> -  * because they can possibly point to something useful (think about
> -  * block_dev filesystem).
> -  */
> - if (sb->s_bdi && sb->s_bdi != &noop_backing_dev_info) {
> - /* Some device inodes could play dirty tricks. Catch them... */
> - WARN(bdi != sb->s_bdi && bdi_cap_writeback_dirty(bdi),
> - "Dirtiable inode bdi %s != sb bdi %s\n",
> - bdi->name, sb->s_bdi->name);
> - return sb->s_bdi;
> - }
> - return bdi;
> + if (strcmp(sb->s_type->name, "bdev") == 0)
> + return inode->i_mapping->backing_dev_info;
> + return sb->s_bdi;
  So at least here you'd need also add a similar exception for
"mtd_inodefs". Because of these exeptions I've chosen the
(sb->s_bdi && sb->s_bdi != &noop_backing_dev_info) check rather than your
exception based check. All in all I don't care much what ends up in the
kernel as it's just a temporary solution...
  Also I've added the warning to catch situations where inodes would get
filed to a different backing device after the patch. So far the reported
warnings were harmless but still I'm more comfortable when it's there
because otherwise we can so easily miss some device-driver-invented
filesystem like mtd_inodefs which would break silently after the change...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-27 Thread Jan Kara
On Mon 27-09-10 18:54:52, Chris Mason wrote:
> On Tue, Sep 28, 2010 at 12:25:48AM +0200, Jan Kara wrote:
> > [Added CCs for similar ecryptfs warning]
> > On Thu 23-09-10 12:38:49, Andrew Morton wrote:
> > > > This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not 
> > > > happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have 
> > > > commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the 
> > > > warning.
> > > > [...]
> > > > device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
> > > > /dev/mapper/vg_cesarbinspiro-lv_home
> > > > SELinux: initialized (dev dm-3, type btrfs), uses xattr
> > > > [ cut here ]
> > > > WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> > > > Hardware name: Inspiron N4010
> > > > Dirtiable inode bdi default != sb bdi btrfs
> > > > Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
> > > > snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
> > > > iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
> > > > snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
> > > > cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd 
> > > > pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
> > > > rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 
> > > > aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
> > > > i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
> > > > Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
> > > > Call Trace:
> > > >   [] warn_slowpath_common+0x85/0x9d
> > > >   [] warn_slowpath_fmt+0x46/0x48
> > > >   [] inode_to_bdi+0x62/0x6d
> > > >   [] __mark_inode_dirty+0xd0/0x177
> > > >   [] touch_atime+0x107/0x12a
> > > >   [] ? filldir+0x0/0xd0
> > > >   [] vfs_readdir+0x8d/0xb4
> > > >   [] sys_getdents+0x81/0xd1
> > > >   [] system_call_fastpath+0x16/0x1b
> >   Thanks for the report. These bdi pointers are a mess. As Chris pointed
> > out, btrfs forgets to properly initialize inode->i_mapping.backing_dev_info
> > for directories and special inodes and thus these were previously attached
> > to default_backing_dev_info which probably isn't what Chris would like to
> > see.
> 
> There's no actual writeback for these, so it works fine for btrfs either
> way.
  Ah, OK. I see you do all the writing already at inode dirtying time so
you don't really care what happens after the dirtying for inodes without
page cache. So it's really just a cosmetic issue for btrfs.

> >   I've also got a similar report for ecryptfs which also does not
> > initialize inode->i_mapping.backing_dev_info although it sets sb->s_bdi and
> > thus again its inodes get filed to default_backing_dev_info lists.  Quick
> > search seems to reveal that other filesystems using handcrafted bdi's get
> > it wrong as well and thus their inodes end up in the 
> > default_backing_dev_info
> > lists which is generally undesirable (this was happening already before my
> > patch, my code just started complaining about that).
> >   That suggests that we should probably handle such cases in a more generic
> > way by changing the code in inode_init_always(). The patch below makes at
> > least btrfs happy for me... Could you maybe test it? Thanks.
> 
> Christoph had a slightly different plan for this, I've cc'd him and kept
> the patch below for comment.
  OK, thanks.

Honza
> > ---
> > 
> > From 29f60c2b08ff9637a10439d1513805835ddcc746 Mon Sep 17 00:00:00 2001
> > From: Jan Kara 
> > Date: Mon, 27 Sep 2010 23:56:48 +0200
> > Subject: [PATCH] bdi: Initialize inode->i_mapping.backing_dev_info to 
> > sb->s_bdi
> > 
> > Currently, we initialize inode->i_mapping.backing_dev_info to the bdi of 
> > device
> > sb->s_bdev points to. However there is quite a big number of filesystems 
> > that
> > do not set sb->s_bdev (because they do not have one) but do set sb->s_bdi.
> > These filesystems would generally benefit from setting
> > inode->i_mapping.backing_dev_info to their s_bdi because otherwise their 
> > inodes
> > would point to default_backing_dev_info and thus dirty inode tracking would
> > happen there. So change inode initialization cod

Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-27 Thread Jan Kara
[Added CCs for similar ecryptfs warning]
On Thu 23-09-10 12:38:49, Andrew Morton wrote:
> > This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not 
> > happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have 
> > commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the 
> > warning.
> > [...]
> > device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
> > /dev/mapper/vg_cesarbinspiro-lv_home
> > SELinux: initialized (dev dm-3, type btrfs), uses xattr
> > [ cut here ]
> > WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> > Hardware name: Inspiron N4010
> > Dirtiable inode bdi default != sb bdi btrfs
> > Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
> > snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
> > iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
> > snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
> > cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd 
> > pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
> > rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 
> > aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
> > i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
> > Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
> > Call Trace:
> >   [] warn_slowpath_common+0x85/0x9d
> >   [] warn_slowpath_fmt+0x46/0x48
> >   [] inode_to_bdi+0x62/0x6d
> >   [] __mark_inode_dirty+0xd0/0x177
> >   [] touch_atime+0x107/0x12a
> >   [] ? filldir+0x0/0xd0
> >   [] vfs_readdir+0x8d/0xb4
> >   [] sys_getdents+0x81/0xd1
> >   [] system_call_fastpath+0x16/0x1b
  Thanks for the report. These bdi pointers are a mess. As Chris pointed
out, btrfs forgets to properly initialize inode->i_mapping.backing_dev_info
for directories and special inodes and thus these were previously attached
to default_backing_dev_info which probably isn't what Chris would like to
see.
  I've also got a similar report for ecryptfs which also does not
initialize inode->i_mapping.backing_dev_info although it sets sb->s_bdi and
thus again its inodes get filed to default_backing_dev_info lists.  Quick
search seems to reveal that other filesystems using handcrafted bdi's get
it wrong as well and thus their inodes end up in the default_backing_dev_info
lists which is generally undesirable (this was happening already before my
patch, my code just started complaining about that).
  That suggests that we should probably handle such cases in a more generic
way by changing the code in inode_init_always(). The patch below makes at
least btrfs happy for me... Could you maybe test it? Thanks.

Honza
-- 
Jan Kara 
SUSE Labs, CR
---

>From 29f60c2b08ff9637a10439d1513805835ddcc746 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 27 Sep 2010 23:56:48 +0200
Subject: [PATCH] bdi: Initialize inode->i_mapping.backing_dev_info to sb->s_bdi

Currently, we initialize inode->i_mapping.backing_dev_info to the bdi of device
sb->s_bdev points to. However there is quite a big number of filesystems that
do not set sb->s_bdev (because they do not have one) but do set sb->s_bdi.
These filesystems would generally benefit from setting
inode->i_mapping.backing_dev_info to their s_bdi because otherwise their inodes
would point to default_backing_dev_info and thus dirty inode tracking would
happen there. So change inode initialization code to use sb->s_bdi if it
is available.

Signed-off-by: Jan Kara 
---
 fs/inode.c |   22 ++
 1 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8646433..e415be4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -172,15 +172,21 @@ int inode_init_always(struct super_block *sb, struct 
inode *inode)
mapping->writeback_index = 0;
 
/*
-* If the block_device provides a backing_dev_info for client
-* inodes then use that.  Otherwise the inode share the bdev's
-* backing_dev_info.
+* If the filesystem provides a backing_dev_info for client inodes
+* then use that. Otherwise inodes share default_backing_dev_info.
 */
-   if (sb->s_bdev) {
-   struct backing_dev_info *bdi;
-
-   bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
-   mapping->backing_dev_info = bdi;
+   if (sb->s_bdi && sb->s_bdi != &noop_backing_dev_info) {
+   /*
+* Catch cases where filesystem might be bitten by using s_bdi
+* instead of sb->s_bdev. Can be removed in 2.6.38.
+ 

Re: [patch v2 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-09-02 Thread Jan Kara
On Thu 02-09-10 09:59:13, Jiri Slaby wrote:
> On 09/02/2010 03:02 AM, David Rientjes wrote:
> > --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -334,6 +334,57
> > @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
> > return kmalloc_node(size, flags | __GFP_ZERO, node); }
> >  
> > +/** + * kmalloc_nofail - infinitely loop until kmalloc() succeeds.  +
> > * @size: how many bytes of memory are required.  + * @flags: the type
> > of memory to allocate (see kmalloc).  + * + * NOTE: no new callers of
> > this function should be implemented!  + * All memory allocations should
> > be failable whenever possible.  + */ +static inline void
> > *kmalloc_nofail(size_t size, gfp_t flags) +{ +  void *ret; + +  for
> > (;;) { +ret = kmalloc(size, flags); +   if (ret) +
> > return ret; +   WARN_ON_ONCE(get_order(size) >
> > PAGE_ALLOC_COSTLY_ORDER);
> 
> This doesn't work as you expect. kmalloc will warn every time it fails.
> __GFP_NOFAIL used to disable the warning. Actually what's wrong with
> __GFP_NOFAIL? I cannot find a reason in the changelogs why the patches
> are needed.
  David should probably add the reasoning to the changelogs so that he
doesn't have to explain again and again ;). But if I understood it
correctly, the concern is that the looping checks slightly impact fast path
of the callers which do not need it. Generally, also looping for a long
time inside allocator isn't a nice thing but some callers aren't able to do
better for now to the patch is imperfect in this sence...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-24 Thread Jan Kara
On Tue 24-08-10 03:50:19, David Rientjes wrote:
> Add kmalloc_nofail(), kcalloc_nofail(), and kzalloc_nofail().  These
> functions are equivalent to kmalloc(), kcalloc(), and kzalloc(),
> respectively, except that they will never return NULL and instead loop
> forever trying to allocate memory.
> 
> If the first allocation attempt fails, a warning will be emitted,
> including a call trace.  Subsequent failures will suppress this warning.
> 
> These were added as helper functions for documentation and auditability.
> No future callers should be added.
  This looks like a cleaner approach than the previous one. You can add
Acked-by: Jan Kara 
  for the JBD part.

Honza
> Signed-off-by: David Rientjes 
> ---
>  drivers/md/dm-region-hash.c |2 +-
>  fs/btrfs/inode.c|2 +-
>  fs/gfs2/log.c   |2 +-
>  fs/gfs2/rgrp.c  |   18 -
>  fs/jbd/transaction.c|   11 ++--
>  fs/reiserfs/journal.c   |3 +-
>  include/linux/slab.h|   55 
> +++
>  7 files changed, 68 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
> --- a/drivers/md/dm-region-hash.c
> +++ b/drivers/md/dm-region-hash.c
> @@ -290,7 +290,7 @@ static struct dm_region *__rh_alloc(struct dm_region_hash 
> *rh, region_t region)
>  
>   nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC);
>   if (unlikely(!nreg))
> - nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL);
> + nreg = kmalloc_nofail(sizeof(*nreg), GFP_NOIO);
>  
>   nreg->state = rh->log->type->in_sync(rh->log, region, 1) ?
> DM_RH_CLEAN : DM_RH_NOSYNC;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1967,7 +1967,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
>   if (atomic_add_unless(&inode->i_count, -1, 1))
>   return;
>  
> - delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
> + delayed = kmalloc_nofail(sizeof(*delayed), GFP_NOFS);
>   delayed->inode = inode;
>  
>   spin_lock(&fs_info->delayed_iput_lock);
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -709,7 +709,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct 
> gfs2_glock *gl)
>   }
>   trace_gfs2_log_flush(sdp, 1);
>  
> - ai = kzalloc(sizeof(struct gfs2_ail), GFP_NOFS | __GFP_NOFAIL);
> + ai = kzalloc_nofail(sizeof(struct gfs2_ail), GFP_NOFS);
>   INIT_LIST_HEAD(&ai->ai_ail1_list);
>   INIT_LIST_HEAD(&ai->ai_ail2_list);
>  
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1440,8 +1440,8 @@ static struct gfs2_rgrpd *rgblk_free(struct gfs2_sbd 
> *sdp, u64 bstart,
>   rgrp_blk++;
>  
>   if (!bi->bi_clone) {
> - bi->bi_clone = kmalloc(bi->bi_bh->b_size,
> -GFP_NOFS | __GFP_NOFAIL);
> + bi->bi_clone = kmalloc_nofail(bi->bi_bh->b_size,
> +GFP_NOFS);
>   memcpy(bi->bi_clone + bi->bi_offset,
>  bi->bi_bh->b_data + bi->bi_offset,
>  bi->bi_len);
> @@ -1759,9 +1759,6 @@ fail:
>   * @block: the block
>   *
>   * Figure out what RG a block belongs to and add that RG to the list
> - *
> - * FIXME: Don't use NOFAIL
> - *
>   */
>  
>  void gfs2_rlist_add(struct gfs2_sbd *sdp, struct gfs2_rgrp_list *rlist,
> @@ -1789,8 +1786,8 @@ void gfs2_rlist_add(struct gfs2_sbd *sdp, struct 
> gfs2_rgrp_list *rlist,
>   if (rlist->rl_rgrps == rlist->rl_space) {
>   new_space = rlist->rl_space + 10;
>  
> - tmp = kcalloc(new_space, sizeof(struct gfs2_rgrpd *),
> -   GFP_NOFS | __GFP_NOFAIL);
> + tmp = kcalloc_nofail(new_space, sizeof(struct gfs2_rgrpd *),
> +   GFP_NOFS);
>  
>   if (rlist->rl_rgd) {
>   memcpy(tmp, rlist->rl_rgd,
> @@ -1811,17 +1808,14 @@ void gfs2_rlist_add(struct gfs2_sbd *sdp, struct 
> gfs2_rgrp_list *rlist,
>   * @rlist: the list of resource groups
>   * @state: the lock state to acquire the RG lock in
>   * @flags: the modifier flags for the holder structures
> - *
> - * FIXME: Don't use NOFAIL
> - *
>   */
>  
>  void gfs2_rlist_alloc(struct gfs

[PATCH] btrfs: Do not return more items that user asked from from search ioctl

2010-08-20 Thread Jan Kara
While searching a tree we didn't properly check number of items we really
stored in user's buffer thus possibly exceeding number of items requested
by user. This was mostly harmless since actual buffer overflow is checked
correctly in a different place. Anyway, let's fix the check.

Signed-off-by: Jan Kara 
---
 fs/btrfs/ioctl.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..94e7ab5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -977,7 +977,7 @@ static noinline int copy_to_sk(struct btrfs_root *root,
}
found++;
 
-   if (*num_found >= sk->nr_items)
+   if (*num_found + found >= sk->nr_items)
break;
}
 advance_key:
-- 
1.6.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Usefulness of max_transid in the tree search?

2010-08-20 Thread Jan Kara
  Hi,

  I just had a look at how btrfs ioctl searches the tree for new items and
while looking at the code I've noticed that maximum transaction id as
specified in the search is rather useless.
  If I understand right, transaction id is stored in a header of each node.
Thus when any item of the node gets modified, new transaction id gets
recorded. So when you give the search some maximum transaction id, you will
not receive items in nodes where some item has been modified in a later
transaction. That seems like a rather useless set of items to me... Or am I
missing something?

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3