Re: [RFC] fsblock

2007-06-27 Thread Chris Mason
On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
> On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
> > On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> > > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> > 
> > [ ... fsblocks vs extent range mapping ]
> > 
> > > iomaps can double as range locks simply because iomaps are
> > > expressions of ranges within the file.  Seeing as you can only
> > > access a given range exclusively to modify it, inserting an empty
> > > mapping into the tree as a range lock gives an effective method of
> > > allowing safe parallel reads, writes and allocation into the file.
> > > 
> > > The fsblocks and the vm page cache interface cannot be used to
> > > facilitate this because a radix tree is the wrong type of tree to
> > > store this information in. A sparse, range based tree (e.g. btree)
> > > is the right way to do this and it matches very well with
> > > a range based API.
> > 
> > I'm really not against the extent based page cache idea, but I kind of
> > assumed it would be too big a change for this kind of generic setup.  At
> > any rate, if we'd like to do it, it may be best to ditch the idea of
> > "attach mapping information to a page", and switch to "lookup mapping
> > information and range locking for a page".
> 
> Well the get_block equivalent API is extent based one now, and I'll
> look at what is required in making map_fsblock a more generic call
> that could be used for an extent-based scheme.
> 
> An extent based thing IMO really isn't appropriate as the main generic
> layer here though. If it is really useful and popular, then it could
> be turned into generic code and sit along side fsblock or underneath
> fsblock...

Lets look at a typical example of how IO actually gets done today,
starting with sys_write():

sys_write(file, buffer, 1MB)
for each page:
prepare_write()
allocate contiguous chunks of disk
attach buffers
copy_from_user()
commit_write()
dirty buffers

pdflush:
writepages()
find pages with contiguous chunks of disk
build and submit large bios

So, we replace prepare_write and commit_write with an extent based api,
but we keep the dirty each buffer part.  writepages has to turn that
back into extents (bio sized), and the result is completely full of dark
dark corner cases.

I do think fsblocks is a nice cleanup on its own, but Dave has a good
point that it makes sense to look for ways generalize things even more.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] fsblock

2007-06-28 Thread Chris Mason
On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
> > On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> > > Lets look at a typical example of how IO actually gets done today,
> > > starting with sys_write():
> > > 
> > > sys_write(file, buffer, 1MB)
> > > for each page:
> > > prepare_write()
> > >   allocate contiguous chunks of disk
> > > attach buffers
> > > copy_from_user()
> > > commit_write()
> > > dirty buffers
> > > 
> > > pdflush:
> > > writepages()
> > > find pages with contiguous chunks of disk
> > >   build and submit large bios
> > > 
> > > So, we replace prepare_write and commit_write with an extent based api,
> > > but we keep the dirty each buffer part.  writepages has to turn that
> > > back into extents (bio sized), and the result is completely full of dark
> > > dark corner cases.
> 
> That's true but I don't think an extent data structure means we can
> become too far divorced from the pagecache or the native block size
> -- what will end up happening is that often we'll need "stuff" to map
> between all those as well, even if it is only at IO-time.

I think the fundamental difference is that fsblock still does:
mapping_info = page->something, where something is attached on a per
page basis.  What we really want is mapping_info = lookup_mapping(page),
where that function goes and finds something stored on a per extent
basis, with extra bits for tracking dirty and locked state.

Ideally, in at least some of the cases the dirty and locked state could
be at an extent granularity (streaming IO) instead of the block
granularity (random IO).

In my little brain, even block based filesystems should be able to take
advantage of this...but such things are always easier to believe in
before the coding starts.

> 
> But the point is taken, and I do believe that at least for APIs, extent
> based seems like the best way to go. And that should allow fsblock to
> be replaced or augmented in future without _too_ much pain.
> 
>  
> > Yup - I've been on the painful end of those dark corner cases several
> > times in the last few months.
> > 
> > It's also worth pointing out that mpage_readpages() already works on
> > an extent basis - it overloads bufferheads to provide a "map_bh" that
> > can point to a range of blocks in the same state. The code then iterates
> > the map_bh range a page at a time building bios (i.e. not even using
> > buffer heads) from that map..
> 
> One issue I have with the current nobh and mpage stuff is that it
> requires multiple calls into get_block (first to prepare write, then
> to writepage), it doesn't allow filesystems to attach resources
> required for writeout at prepare_write time, and it doesn't play nicely
> with buffers in general. (not to mention that nobh error handling is
> buggy).
> 
> I haven't done any mpage-like code for fsblocks yet, but I think they
> wouldn't be too much trouble, and wouldn't have any of the above
> problems...

Could be, but the fundamental issue of sometimes pages have mappings
attached and sometimes they don't is still there.  The window is
smaller, but non-zero.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] extent mapped page cache

2007-07-18 Thread Chris Mason
On Thu, 12 Jul 2007 00:00:28 -0700
Daniel Phillips <[EMAIL PROTECTED]> wrote:

> On Tuesday 10 July 2007 14:03, Chris Mason wrote:
> > This patch aims to demonstrate one way to replace buffer heads with
> > a few extent trees...
> 
> Hi Chris,
> 
> Quite terse commentary on algorithms and data structures, but I
> suppose that is not a problem because Jon has a whole week to reverse
> engineer it for us.
> 
> What did you have in mind for subpages?
> 

This partially depends on input here.  The goal is to have one
interface that works for subpages, highmem and superpages, and for
the FS maintainers to not care if the mappings come magically from
clameter's work or vmap or whatever.

Given the whole extent based theme, I plan on something like this:

struct extent_ptr {
char *ptr;
some way to indicate size and type of map
struct page pages[];
};

struct extent_ptr *alloc_extent_ptr(struct extent_map_tree *tree,
u64 start, u64 end);
void free_extent_ptr(struct extent_map_tree *tree,
 struct extent_ptr *ptr);

And then some calls along the lines of kmap/kunmap that gives you a
pointer you can use for accessing the ram.  read/write calls would also
be fine by me, but harder to convert filesystems to use.

The struct extent_ptr would increase the ref count on the pages, but
the pages would have no back pointers to it.  All
dirty/locked/writeback state would go in the extent state tree and would
not be stored in the struct extent_ptr.  

The idea is to make a simple mapping entity, and not complicate it
by storing FS specific state in there. It could be variably sized to
hold an array of pages, and allocated via kmap.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] seekwatcher IO graphing v0.2

2007-07-23 Thread Chris Mason

Hello everyone,

Since doing the initial Btrfs benchmarks, I've made my blktrace graphing
utility a little more generic and tossed it out on oss.oracle.com.

This new version can easily graph two different runs, and has a few
other tweaks that make the graphs look nicer.

Docs, examples and other details are at:

http://oss.oracle.com/~mason/seekwatcher

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-21 Thread Chris Mason
On Sun, 12 Aug 2007 17:11:20 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> Andrew and Ken,
> 
> Here are some more experiments on the writeback stuff.
> Comments are highly welcome~ 

I've been doing benchmarks lately to try and trigger fragmentation, and
one of them is a simulation of make -j N.  It takes a list of all
the .o files in the kernel tree, randomly sorts them and then
creates bogus files with the same names and sizes in clean kernel trees.

This is basically creating a whole bunch of files in random order in a
whole bunch of subdirectories.

The results aren't pretty:

http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png

The top graph shows one dot for each write over time.  It shows that
ext3 is basically writing all over the place the whole time.  But, ext3
actually wins the read phase, so the layout isn't horrible.  My guess
is that if we introduce some write clustering by sending a group of
inodes down at the same time, it'll go much much better.

Andrew has mentioned bringing a few radix trees into the writeback paths
before, it seems like file servers and other general uses will benefit
from better clustering here.

I'm hoping to talk you into trying it out ;)

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-22 Thread Chris Mason
On Wed, 22 Aug 2007 09:18:41 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > On Sun, 12 Aug 2007 17:11:20 +0800
> > Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > 
> > > Andrew and Ken,
> > > 
> > > Here are some more experiments on the writeback stuff.
> > > Comments are highly welcome~ 
> > 
> > I've been doing benchmarks lately to try and trigger fragmentation,
> > and one of them is a simulation of make -j N.  It takes a list of
> > all the .o files in the kernel tree, randomly sorts them and then
> > creates bogus files with the same names and sizes in clean kernel
> > trees.
> > 
> > This is basically creating a whole bunch of files in random order
> > in a whole bunch of subdirectories.
> > 
> > The results aren't pretty:
> > 
> > http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
> > 
> > The top graph shows one dot for each write over time.  It shows that
> > ext3 is basically writing all over the place the whole time.  But,
> > ext3 actually wins the read phase, so the layout isn't horrible.
> > My guess is that if we introduce some write clustering by sending a
> > group of inodes down at the same time, it'll go much much better.
> > 
> > Andrew has mentioned bringing a few radix trees into the writeback
> > paths before, it seems like file servers and other general uses
> > will benefit from better clustering here.
> > 
> > I'm hoping to talk you into trying it out ;)
> 
> Thank you for the description of problem. So far I have a similar one
> in mind: if we are to delay writeback of atime-dirty-only inodes to
> above 1 hour, some grouping/piggy-backing scenario would be
> beneficial.  (Which I guess does not deserve the complexity now that
> we have Ingo's make-reltime-default patch.)

Good clustering would definitely help some delayed atime writeback
scheme.

> 
> My vague idea is to
> - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> queue.
> - convert s_dirty to some radix-tree/rbtree based data structure.
>   It would have dual functions: delayed-writeback and
> clustered-writeback. 
> clustered-writeback:
> - Use inode number as clue of locality, hence the key for the sorted
>   tree.
> - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
>   but do it in the ascending order of inode number instead of
>   ->dirtied_when. 
> 
> delayed-writeback:
> - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
>   dirty_expire_interval.

I think we should assume a full scan of s_dirty is impossible in the
presence of concurrent writers.  We want to be able to pick a start
time (right now) and find all the inodes older than that start time.
New things will come in while we're scanning.  But perhaps that's what
you're saying...

At any rate, we've got two types of lists now.  One keeps track of age
and the other two keep track of what is currently being written.  I
would try two things:

1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
indexes by inode number (or some arbitrary field the FS can set in the
inode).  Radix tree tags are used to indicate which things in s_io are
already in progress or are pending (hand waving because I'm not sure
exactly).

inodes are pulled off s_dirty and the corresponding slot in s_io is
tagged to indicate IO has started.  Any nearby inodes in s_io are also
sent down.

2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
sequence number that corresponds to age.  It is treated as a big
circular indexed list that can wrap around over time.  Radix tree tags
are used both on s_dirty and s_io to flag which inodes are in progress.

> 
> Notes:
> (1) I'm not sure inode number is correlated to disk location in
> filesystems other than ext2/3/4. Or parent dir?

In general, it is a better assumption than sorting by time.  It may
make sense to one day let the FS provide a clustering hint
(corresponding to the first block in the file?), but for starters it
makes sense to just go with the inode number.

> (2) It duplicates some function of elevators. Why is it necessary?
> Maybe we have no clue on the exact data location at this time?

The elevator can only sort the pending IO, and we send down a
relatively small window of all the dirty pages at a time.  If we sent
down all the dirty pages and let the elevator sort it out, we wouldn't
need this clustering at all.

But, that has other issues ;)

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-23 Thread Chris Mason
On Thu, 23 Aug 2007 12:47:23 +1000
David Chinner <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > I think we should assume a full scan of s_dirty is impossible in the
> > presence of concurrent writers.  We want to be able to pick a start
> > time (right now) and find all the inodes older than that start time.
> > New things will come in while we're scanning.  But perhaps that's
> > what you're saying...
> > 
> > At any rate, we've got two types of lists now.  One keeps track of
> > age and the other two keep track of what is currently being
> > written.  I would try two things:
> > 
> > 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> > indexes by inode number (or some arbitrary field the FS can set in
> > the inode).  Radix tree tags are used to indicate which things in
> > s_io are already in progress or are pending (hand waving because
> > I'm not sure exactly).
> > 
> > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > tagged to indicate IO has started.  Any nearby inodes in s_io are
> > also sent down.
> 
> the problem with this approach is that it only looks at inode
> locality. Data locality is ignored completely here and the data for
> all the inodes that are close together could be splattered all over
> the drive. In that case, clustering by inode location is exactly the
> wrong thing to do.

Usually it won't be less wrong than clustering by time.

> 
> For example, XFs changes allocation strategy at 1TB for 32bit inode
> filesystems which makes the data get placed way away from the inodes.
> i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> by inode number for data writeback is mostly useless in the >1TB
> case.

I agree we'll want a way to let the FS provide the clustering key.  But
for the first cut on the patch, I would suggest keeping it simple.

> 
> The inode32 for <1Tb and inode64 allocators both try to keep data
> close to the inode (i.e. in the same AG) so clustering by inode number
> might work better here.
> 
> Also, it might be worthwhile allowing the filesystem to supply a
> hint or mask for "closeness" for inode clustering. This would help
> the gernic code only try to cluster inode writes to inodes that
> fall into the same cluster as the first inode

Yes, also a good idea after things are working.

> 
> > > Notes:
> > > (1) I'm not sure inode number is correlated to disk location in
> > > filesystems other than ext2/3/4. Or parent dir?
> > 
> > In general, it is a better assumption than sorting by time.  It may
> > make sense to one day let the FS provide a clustering hint
> > (corresponding to the first block in the file?), but for starters it
> > makes sense to just go with the inode number.
> 
> Perhaps multiple hints are needed - one for data locality and one
> for inode cluster locality.

So, my feature creep idea would have been more data clustering.  I'm
mainly trying to solve this graph:

http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png

Where background writing of the block device inode is making ext3 do
seeky writes while directory trees.  My simple idea was to kick
off a 'I've just written block X' call back to the FS, where it may
decide to send down dirty chunks of the block device inode that also
happen to be dirty.

But, maintaining the kupdate max dirty time and congestion limits in
the face of all this clustering gets tricky.  So, I wasn't going to
suggest it until the basic machinery was working.

Fengguang, this isn't a small project ;)  But, lots of people will be
interested in the results.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-24 Thread Chris Mason
On Fri, 24 Aug 2007 21:24:58 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:
 
> > 2) s_dirty and s_io both become radix trees.  s_dirty is indexed by
> > a sequence number that corresponds to age.  It is treated as a big
> > circular indexed list that can wrap around over time.  Radix tree
> > tags are used both on s_dirty and s_io to flag which inodes are in
> > progress.
> 
> It's meaningless to convert s_io to radix tree. Because inodes on s_io
> will normally be sent to block layer elevators at the same time.

Not entirely, using a radix tree instead lets you tag things instead of
doing the current backflips across three lists.

> 
> Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
> The more inodes, the more chances of good clustering. That's the
> general rule.
> 
> s_dirty is the right place to do address-clustering.
> As for the dirty_expire_interval parameter on dirty age,
> we can apply a simple rule: do one full scan/sweep over the
> fs-address-space in every 30s, syncing all inodes encountered,
> and sparing those newly dirtied in less than 5s. With that rule,
> any inode will get synced after being dirtied for 5-35 seconds.

This gives you an O(inodes dirty) behavior instead of the current O(old
inodes).  It might not matter, but walking the radix tree is more
expensive than walking a list.

But, I look forward to your patches, we can tune from there.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Chris Mason
On Tue, 3 Jul 2007 01:28:57 -0400
"Xin Zhao" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> 
> If a file is already opened when snapshot command is issued,  the file
> itself could be in an inconsistent state already. Before the file is
> closed, maybe part of the file contains old data, the rest contains
> new data.
> How does a versioning filesystem guarantee that the file snapshot is
> in a consistent state in this case?
> 
> I googled it but didn't find any answer. Can someone explain it a
> little bit?

It's the same answer as in most filesystem related questions...it
depends ;)  Consistent state means many different things.  It may mean
that the metadata accurately reflects the space on disk allocated to
the file and that all data for the file is properly on disk (ie from an
fsync).

But, even this is less than useful because very few files on the
filesystem stand alone.  Applications spread their state across a
number of files and so consistent means something different to
every application.

Getting a snapshot that is useful with respect to application data
requires help from the application.  The app needs to be shutdown or
paused prior to the snapshot and then started up again after the
snapshot is taken.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Versioning file system

2007-07-05 Thread Chris Mason
On Thu, 5 Jul 2007 09:57:40 -0400
"John Stoffel" <[EMAIL PROTECTED]> wrote:

> > "Erik" == Erik Mouw <[EMAIL PROTECTED]> writes:
> 
> Erik> (sorry for the late reply, just got back from holiday)
> Erik> On Mon, Jun 18, 2007 at 01:29:56PM -0400, Theodore Tso wrote:
> >> As I mentioned in my Linux.conf.au presentation a year and a half
> >> ago, the main use of Streams in Windows to date has been for system
> >> crackers to hide trojan horse code and rootkits so that system
> >> administrators couldn't find them.  :-)
> 
> Erik> The only valid use of Streams in Windows I've seen was a virus
> Erik> checker that stored a hash of the file in a separate
> Erik> stream. Checking a file was a matter of rehashing it and
> Erik> comparing against the hash stored in the special hash data
> Erik> stream for that particular file.
> 
> So what was stopping a virus from infecting a file, re-computing the
> hash and pushing the new hash into the stream?  
> 
> You need to keep the computed hashes on Read-Only media for true
> security, once you let the system change them, then you're toast

I'm not a huge fan of streams, but I'm pretty sure there are various
encryption tools that let us verify and validate the source of data.
It's entirely possible the virus checker wasn't doing it right, but
storing verification info in an EA or stream isn't entirely invalid.

You still need an external key that you do trust of course.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] [RFC][PATCH] clustered writeback

2007-08-27 Thread Chris Mason
On Mon, 27 Aug 2007 05:03:36 -0700
Arjan van de Ven <[EMAIL PROTECTED]> wrote:

> On Mon, 27 Aug 2007 19:21:52 +0800
> > 
> > Because it does the work in small batches of 10 inodes, when the
> > system has <=10 dirty inodes, its behavior will reduce to:
> > - do a full sweep *at once* on every 25s
> > Which means the disk will flicker once every 25s, not bad :)
> 
> 25 seconds is quite not good already though it takes a disk a
> second or two of no activity to go into low power mode, every 25
> seconds means you now have at least a 10% constant power cost
> 
> I don't know the right answer (well other than "make sure inodes
> aren't dirty", which involves fixing apps to not do as much file
> operations, as well as relatime) but just "every 25s is no big deal"
> isn't really the case ;-(

But fixing this isn't the job of this patchIt needs something like
the laptop mode logic where it says o, the disk is awake, lets send
stuff down.

kupdate hitting on the disk isn't really a new problem, I'd rather
address it with a different patch series.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason
On Wed, 29 Aug 2007 00:55:30 +1000
David Chinner <[EMAIL PROTECTED]> wrote:

> On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > > filesystems other than ext2/3/4. Or parent dir?
> > > 
> > > The correspond to the exact location on disk on XFS. But, XFS has
> > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > up into the generic layers because of locking and integration into
> > > the transaction subsystem.
> > >
> > > > (2) It duplicates some function of elevators. Why is it
> > > > necessary?
> > > 
> > > The elevators have no clue as to how the filesystem might treat
> > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > feature of the inode reading and writing and that is something no
> > > elevator can hope to acheive
> >  
> > Thank you. That explains the linear write curve(perfect!) in Chris'
> > graph.
> > 
> > I wonder if XFS can benefit any more from the general writeback
> > clustering. How large would be a typical XFS cluster?
> 
> Depends on inode size. typically they are 8k in size, so anything
> from 4-32 inodes. The inode writeback clustering is pretty tightly
> integrated into the transaction subsystem and has some intricate
> locking, so it's not likely to be easy (or perhaps even possible) to
> make it more generic.

When I talked to hch about this, he said the order file data pages got
written in XFS was still dictated by the order the higher layers sent
things down.  Shouldn't the clustering still help to have delalloc done
in inode order instead of in whatever random order pdflush sends things
down now?

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason
On Wed, 29 Aug 2007 02:33:08 +1000
David Chinner <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> > > > 
> > > > I wonder if XFS can benefit any more from the general writeback
> > > > clustering. How large would be a typical XFS cluster?
> > > 
> > > Depends on inode size. typically they are 8k in size, so anything
> > > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > > integrated into the transaction subsystem and has some intricate
> > > locking, so it's not likely to be easy (or perhaps even possible)
> > > to make it more generic.
> > 
> > When I talked to hch about this, he said the order file data pages
> > got written in XFS was still dictated by the order the higher
> > layers sent things down.
> 
> Sure, that's file data. I was talking about the inode writeback, not
> the data writeback.

I think we're trying to gain different things from inode based
clustering...I'm not worried that the inode be next to the data.  I'm
going under the assumption that most of the time, the FS will try to
allocate inodes in groups in a directory, and so most of the time the
data blocks for inode N will be close to inode N+1.

So what I'm really trying for here is data block clustering when
writing multiple inodes at once.  This matters most when files are
relatively small and written in groups, which is a common workload.

It may make the most sense to change the patch to supply some key for
the data block clustering instead of the inode number, but its an easy
first pass.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] extent mapped page cache

2007-07-24 Thread Chris Mason
On Tue, 10 Jul 2007 17:03:26 -0400
Chris Mason <[EMAIL PROTECTED]> wrote:

> This patch aims to demonstrate one way to replace buffer heads with a
> few extent trees.  Buffer heads provide a few different features:
> 
> 1) Mapping of logical file offset to blocks on disk
> 2) Recording state (dirty, locked etc)
> 3) Providing a mechanism to access sub-page sized blocks.
> 
> This patch covers #1 and #2, I'll start on #3 a little later next
> week.
> 
Well, almost.  I decided to try out an rbtree instead of the radix,
which turned out to be much faster.  Even though individual operations
are slower, the rbtree was able to do many fewer ops to accomplish the
same thing, especially for merging extents together.  It also uses much
less ram.

This code still has lots of room for optimization, but it comes in at
around 2-5% more cpu time for ext2 streaming reads and writes.  I
haven't done readpages or writepages yet, so this is more or less a
worst case setup.  I'm comparing against ext2 with readpages and
writepages disabled.

The new code has the added benefit of passing fsx-linux, and not
triggering MCE's on my poor little test box.

The basic idea is to store state in byte ranges in an rbtree, and to
mirror that state down into individual pages.  This allows us to store
arbitrary state outside of the page struct, so we could include the pid
of the process that dirtied a page range for cfq purposes.  The example
readpage and writepage code is probably the easiest way to understand
the basic API.

A separate rbtree stores a mapping of byte offset in the file to byte
offset on disk.  This allows the filesystem to fill in mapping
information in bulk, and reduces the number of metadata lookups
required to do common operations.

Because the state and mapping information are separate from the page,
pages can come and go and their corresponding metadata can still be
cached (the current code drops mappings as the last page corresponding
to that mapping disappears).

Two patches follow, the core extent_map implementation and a sample
user (ext2).  This is pretty basic, implementing prepare/commit_write,
read/writepage and a few other funcs to exercise the new code.  Longer
term, it should fit in with Nick's other extent work instead of
prepare/commit_write.

My patch sets page->private to 1, really for no good reason.  It is
just a debugging aid I was using to make sure the page took the right
path down the line.  If this catches on, we might set it to a magic
value so you can if (ExtentPage(page)) or just leave it as null.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] extent mapped page cache main code

2007-07-24 Thread Chris Mason
Core Extentmap implementation

diff -r 126111346f94 -r 53cabea328f7 fs/Makefile
--- a/fs/Makefile   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/Makefile   Tue Jul 24 15:40:27 2007 -0400
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o extent_map.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff -r 126111346f94 -r 53cabea328f7 fs/extent_map.c
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/fs/extent_map.c   Tue Jul 24 15:40:27 2007 -0400
@@ -0,0 +1,1591 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static struct kmem_cache *extent_map_cache;
+static struct kmem_cache *extent_state_cache;
+
+struct tree_entry {
+   u64 start;
+   u64 end;
+   int in_tree;
+   struct rb_node rb_node;
+};
+
+
+/* bits for the extent state */
+#define EXTENT_DIRTY 1
+#define EXTENT_WRITEBACK (1 << 1)
+#define EXTENT_UPTODATE (1 << 2)
+#define EXTENT_LOCKED (1 << 3)
+#define EXTENT_NEW (1 << 4)
+
+#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
+
+void __init extent_map_init(void)
+{
+   extent_map_cache = kmem_cache_create("extent_map",
+   sizeof(struct extent_map), 0,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_DESTROY_BY_RCU,
+   NULL, NULL);
+   extent_state_cache = kmem_cache_create("extent_state",
+   sizeof(struct extent_state), 0,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_DESTROY_BY_RCU,
+   NULL, NULL);
+}
+
+void extent_map_tree_init(struct extent_map_tree *tree,
+ struct address_space *mapping, gfp_t mask)
+{
+   tree->map.rb_node = NULL;
+   tree->state.rb_node = NULL;
+   rwlock_init(>lock);
+   tree->mapping = mapping;
+}
+EXPORT_SYMBOL(extent_map_tree_init);
+
+struct extent_map *alloc_extent_map(gfp_t mask)
+{
+   struct extent_map *em;
+   em = kmem_cache_alloc(extent_map_cache, mask);
+   if (!em || IS_ERR(em))
+   return em;
+   em->in_tree = 0;
+   atomic_set(>refs, 1);
+   return em;
+}
+EXPORT_SYMBOL(alloc_extent_map);
+
+void free_extent_map(struct extent_map *em)
+{
+   if (atomic_dec_and_test(>refs)) {
+   WARN_ON(em->in_tree);
+   kmem_cache_free(extent_map_cache, em);
+   }
+}
+EXPORT_SYMBOL(free_extent_map);
+
+struct extent_state *alloc_extent_state(gfp_t mask)
+{
+   struct extent_state *state;
+   state = kmem_cache_alloc(extent_state_cache, mask);
+   if (!state || IS_ERR(state))
+   return state;
+   state->state = 0;
+   state->in_tree = 0;
+   atomic_set(>refs, 1);
+   init_waitqueue_head(>wq);
+   return state;
+}
+EXPORT_SYMBOL(alloc_extent_state);
+
+void free_extent_state(struct extent_state *state)
+{
+   if (atomic_dec_and_test(>refs)) {
+   WARN_ON(state->in_tree);
+   kmem_cache_free(extent_state_cache, state);
+   }
+}
+EXPORT_SYMBOL(free_extent_state);
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
+  struct rb_node *node)
+{
+   struct rb_node ** p = >rb_node;
+   struct rb_node * parent = NULL;
+   struct tree_entry *entry;
+
+   while(*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct tree_entry, rb_node);
+
+   if (offset < entry->end)
+   p = &(*p)->rb_left;
+   else if (offset > entry->end)
+   p = &(*p)->rb_right;
+   else
+   return parent;
+   }
+
+   entry = rb_entry(node, struct tree_entry, rb_node);
+   entry->in_tree = 1;
+   rb_link_node(node, parent, p);
+   rb_insert_color(node, root);
+   return NULL;
+}
+
+static struct rb_node *__tree_search(struct rb_root *root, u64 offset,
+  struct rb_node **prev_ret)
+{
+   struct rb_node * n = root->rb_node;
+   struct rb_node *prev = NULL;
+   struct tree_entry *entry;
+   struct tree_entry *prev_entry = NULL;
+
+   while(n) {
+   entry = rb_entry(n, struct tree_entry, rb_node);
+   prev = n;
+   prev_entry = entry;
+
+   if (offset < entry->end)
+   n = n->rb_left;
+   else if (offset > entry->end)
+   n = n->rb_right;
+   

[PATCH RFC] ext2 extentmap support

2007-07-24 Thread Chris Mason
mount -o extentmap to use the new stuff

diff -r 126111346f94 -r 53cabea328f7 fs/ext2/ext2.h
--- a/fs/ext2/ext2.hMon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/ext2.hTue Jul 24 15:40:27 2007 -0400
@@ -1,5 +1,6 @@
 #include 
 #include 
+#include 
 
 /*
  * ext2 mount options
@@ -65,6 +66,7 @@ struct ext2_inode_info {
struct posix_acl*i_default_acl;
 #endif
rwlock_t i_meta_lock;
+   struct extent_map_tree extent_tree;
struct inodevfs_inode;
 };
 
@@ -167,6 +169,7 @@ extern const struct address_space_operat
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
+extern const struct address_space_operations ext2_extent_map_aops;
 
 /* namei.c */
 extern const struct inode_operations ext2_dir_inode_operations;
diff -r 126111346f94 -r 53cabea328f7 fs/ext2/inode.c
--- a/fs/ext2/inode.c   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/inode.c   Tue Jul 24 15:40:27 2007 -0400
@@ -625,6 +625,84 @@ changed:
goto reread;
 }
 
+/*
+ * simple get_extent implementation using get_block.  This assumes
+ * the get_block function can return something larger than a single block,
+ * but the ext2 implementation doesn't do so.  Just change b_size to
+ * something larger if get_block can return larger extents.
+ */
+struct extent_map *ext2_get_extent(struct inode *inode, struct page *page,
+  size_t page_offset, u64 start, u64 end,
+  int create)
+{
+   struct buffer_head bh;
+   sector_t iblock;
+   struct extent_map *em = NULL;
+   struct extent_map_tree *extent_tree = _I(inode)->extent_tree;
+   int ret = 0;
+   u64 max_end = (u64)-1;
+   u64 found_len;
+   u64 bh_start;
+   u64 bh_end;
+
+   bh.b_size = inode->i_sb->s_blocksize;
+   bh.b_state = 0;
+again:
+   em = lookup_extent_mapping(extent_tree, start, end);
+   if (em) {
+   return em;
+   }
+
+   iblock = start >> inode->i_blkbits;
+   if (!buffer_mapped()) {
+   ret = ext2_get_block(inode, iblock, , create);
+   if (ret)
+   goto out;
+   }
+
+   found_len = min((u64)(bh.b_size), max_end - start);
+   if (!em)
+   em = alloc_extent_map(GFP_NOFS);
+
+   bh_start = start;
+   bh_end = start + found_len - 1;
+   em->start = start;
+   em->end = bh_end;
+   em->bdev = inode->i_sb->s_bdev;
+
+   if (!buffer_mapped()) {
+   em->block_start = 0;
+   em->block_end = 0;
+   } else {
+   em->block_start = bh.b_blocknr << inode->i_blkbits;
+   em->block_end = em->block_start + found_len - 1;
+   }
+   ret = add_extent_mapping(extent_tree, em);
+   if (ret == -EEXIST) {
+   free_extent_map(em);
+   em = NULL;
+   max_end = end;
+   goto again;
+   }
+out:
+   if (ret) {
+   if (em)
+   free_extent_map(em);
+   return ERR_PTR(ret);
+   } else if (em && buffer_new()) {
+   set_extent_new(extent_tree, bh_start, bh_end, GFP_NOFS);
+   }
+   return em;
+}
+
+static int ext2_extent_map_writepage(struct page *page,
+struct writeback_control *wbc)
+{
+   struct extent_map_tree *tree;
+   tree = _I(page->mapping->host)->extent_tree;
+   return extent_write_full_page(tree, page, ext2_get_extent, wbc);
+}
+
 static int ext2_writepage(struct page *page, struct writeback_control *wbc)
 {
return block_write_full_page(page, ext2_get_block, wbc);
@@ -633,6 +711,42 @@ static int ext2_readpage(struct file *fi
 static int ext2_readpage(struct file *file, struct page *page)
 {
return mpage_readpage(page, ext2_get_block);
+}
+
+static int ext2_extent_map_readpage(struct file *file, struct page *page)
+{
+   struct extent_map_tree *tree;
+   tree = _I(page->mapping->host)->extent_tree;
+   return extent_read_full_page(tree, page, ext2_get_extent);
+}
+
+static int ext2_extent_map_releasepage(struct page *page,
+  gfp_t unused_gfp_flags)
+{
+   struct extent_map_tree *tree;
+   int ret;
+
+   if (page->private != 1)
+   return try_to_free_buffers(page);
+   tree = _I(page->mapping->host)->extent_tree;
+   ret = try_release_extent_mapping(tree, page);
+   if (ret == 1) {
+   ClearPagePrivate(page);
+   set_page_private(page, 0);
+   page_cache_release(page);
+   }
+   return ret;
+}
+
+
+static void ext2_extent_map_invalidatepage(struct page *page,
+  unsigned long offset)
+{
+   struct extent_map_tree *tree;
+
+   tree = _I(page->mapping->host)->extent_tree;
+   

Re: [PATCH RFC] extent mapped page cache

2007-07-24 Thread Chris Mason
On Tue, 24 Jul 2007 23:25:43 +0200
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-07-24 at 16:13 -0400, Trond Myklebust wrote:
> > On Tue, 2007-07-24 at 16:00 -0400, Chris Mason wrote:
> > > On Tue, 10 Jul 2007 17:03:26 -0400
> > > Chris Mason <[EMAIL PROTECTED]> wrote:
> > > 
> > > > This patch aims to demonstrate one way to replace buffer heads
> > > > with a few extent trees.  Buffer heads provide a few different
> > > > features:
> > > > 
> > > > 1) Mapping of logical file offset to blocks on disk
> > > > 2) Recording state (dirty, locked etc)
> > > > 3) Providing a mechanism to access sub-page sized blocks.
> > > > 
> > > > This patch covers #1 and #2, I'll start on #3 a little later
> > > > next week.
> > > > 
> > > Well, almost.  I decided to try out an rbtree instead of the
> > > radix, which turned out to be much faster.  Even though
> > > individual operations are slower, the rbtree was able to do many
> > > fewer ops to accomplish the same thing, especially for merging
> > > extents together.  It also uses much less ram.
> > 
> > The problem with an rbtree is that you can't use it together with
> > RCU to do lockless lookups. You can probably modify it to allocate
> > nodes dynamically (like the radix tree does) and thus make it
> > RCU-compatible, but then you risk losing the two main benefits that
> > you list above.

The tree is a critical part of the patch, but it is also the easiest to
rip out and replace.  Basically the code stores a range by inserting
an object at an index corresponding to the end of the range.

Then it does searches by looking forward from the start of the range.
More or less any tree that can search and return the first key >=
than the requested key will work.

So, I'd be happy to rip out the tree and replace with something else.
Going completely lockless will be tricky, its something that will deep
thought once the rest of the interface is sane.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Chris Mason
On Wed, 25 Jul 2007 04:32:17 +0200
Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Tue, Jul 24, 2007 at 07:25:09PM -0400, Chris Mason wrote:
> > On Tue, 24 Jul 2007 23:25:43 +0200
> > Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > The tree is a critical part of the patch, but it is also the
> > easiest to rip out and replace.  Basically the code stores a range
> > by inserting an object at an index corresponding to the end of the
> > range.
> > 
> > Then it does searches by looking forward from the start of the
> > range. More or less any tree that can search and return the first
> > key >= than the requested key will work.
> > 
> > So, I'd be happy to rip out the tree and replace with something
> > else. Going completely lockless will be tricky, its something that
> > will deep thought once the rest of the interface is sane.
> 
> Just having the other tree and managing it is what makes me a little
> less positive of this approach, especially using it to store pagecache
> state when we already have the pagecache tree.
> 
> Having another tree to store block state I think is a good idea as I
> said in the fsblock thread with Dave, but I haven't clicked as to why
> it is a big advantage to use it to manage pagecache state. (and I can
> see some possible disadvantages in locking and tree manipulation
> overhead).

Yes, there are definitely costs with the state tree, it will take some
careful benchmarking to convince me it is a feasible solution. But,
storing all the state in the pages themselves is impossible unless the
block size equals the page size. So, we end up with something like
fsblock/buffer heads or the state tree.

One advantage to the state tree is that it separates the state from
the memory being described, allowing a simple kmap style interface
that covers subpages, highmem and superpages.

It also more naturally matches the way we want to do IO, making for
easy clustering.

O_DIRECT becomes a special case of readpages and writepagesthe
memory used for IO just comes from userland instead of the page cache.

The ability to put in additional tracking info like the process that
first dirtied a range is also significant.  So, I think it is worth
trying.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Chris Mason
On Thu, 26 Jul 2007 03:37:28 +0200
Nick Piggin <[EMAIL PROTECTED]> wrote:

>  
> > One advantage to the state tree is that it separates the state from
> > the memory being described, allowing a simple kmap style interface
> > that covers subpages, highmem and superpages.
> 
> I suppose so, although we should have added those interfaces long
> ago ;) The variants in fsblock are pretty good, and you could always
> do an arbitrary extent (rather than block) based API using the
> pagecache tree if it would be helpful.

Yes, you could use fsblock for the state bits and make a separate API
to map the actual pages.

>  
> 
> > It also more naturally matches the way we want to do IO, making for
> > easy clustering.
> 
> Well the pagecache tree is used to reasonable effect for that now.
> OK the code isn't beautiful ;). Granted, this might be an area where
> the seperate state tree ends up being better. We'll see.
> 

One thing it gains us is finding the start of the cluster.  Even if
called by kswapd, the state tree allows writepage to find the start of
the cluster and send down a big bio (provided I implement trylock to
avoid various deadlocks).

>  
> > O_DIRECT becomes a special case of readpages and writepagesthe
> > memory used for IO just comes from userland instead of the page
> > cache.
> 
> Could be, although you'll probably also need to teach the mm about
> the state tree and/or still manipulate the pagecache tree to prevent
> concurrency?

Well, it isn't coded yet, but I should be able to do it from the FS
specific ops.

> 
> But isn't the main aim of O_DIRECT to do as little locking and
> synchronisation with the pagecache as possible? I thought this is
> why your race fixing patches got put on the back burner (although
> they did look fairly nice from a correctness POV).

I put the placeholder patches on hold because handling a corner case
where userland did O_DIRECT from a mmap'd region of the same file (Linus
pointed it out to me).  Basically my patches had to work in 64k chunks
to avoid a deadlock in get_user_pages.  With the state tree, I can
allow the page to be faulted in but still properly deal with it.

> 
> Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look
> like this might be another advantage of the state tree (although you
> aren't allowed to slow down buffered IO to achieve the locking ;)).

;) The O_DIRECT benefit is a fringe thing.  I've long wanted to help
clean up that code, but the real point of the patch is to make general
usage faster and less complex.  If I can't get there, the O_DIRECT
stuff doesn't matter.
> 
>  
> > The ability to put in additional tracking info like the process that
> > first dirtied a range is also significant.  So, I think it is worth
> > trying.
> 
> Definitely, and I'm glad you are. You haven't converted me yet, but
> I look forward to finding the best ideas from our two approaches when
> the patches are further along (ext2 port of fsblock coming along, so
> we'll be able to have races soon :P).

I'm sure we can find some river in Cambridge, winner gets to throw
Axboe in.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] extent mapped page cache

2007-07-26 Thread Chris Mason
On Thu, 26 Jul 2007 04:36:39 +0200
Nick Piggin <[EMAIL PROTECTED]> wrote:

[ are state trees a good idea? ]

> > One thing it gains us is finding the start of the cluster.  Even if
> > called by kswapd, the state tree allows writepage to find the start
> > of the cluster and send down a big bio (provided I implement
> > trylock to avoid various deadlocks).
> 
> That's very true, we could potentially also do that with the block
> extent tree that I want to try with fsblock.

If fsblock records and extent of 200MB, and writepage is called on a
page in the middle of the extent, how do you walk the radix backwards
to find the first dirty & up to date page in the range?

> 
> I'm looking at "cleaning up" some of these aops APIs so hopefully
> most of the deadlock problems go away. Should be useful to both our
> efforts. Will post patches hopefully when I get time to finish the
> draft this weekend.

Great

> 
> 
> > > > O_DIRECT becomes a special case of readpages and
> > > > writepagesthe memory used for IO just comes from userland
> > > > instead of the page cache.
> > > 
> > > Could be, although you'll probably also need to teach the mm about
> > > the state tree and/or still manipulate the pagecache tree to
> > > prevent concurrency?
> > 
> > Well, it isn't coded yet, but I should be able to do it from the FS
> > specific ops.
> 
> Probably, if you invalidate all the pagecache in the range beforehand
> you should be able to do it (and I guess you want to do the invalidate
> anyway). Although, below deadlock issues might still bite somehwere...

Well, O_DIRECT is french for deadlocks.  But I shouldn't have to worry
so much about evicting the pages themselves since I can tag the range.

> 
> 
> > > But isn't the main aim of O_DIRECT to do as little locking and
> > > synchronisation with the pagecache as possible? I thought this is
> > > why your race fixing patches got put on the back burner (although
> > > they did look fairly nice from a correctness POV).
> > 
> > I put the placeholder patches on hold because handling a corner case
> > where userland did O_DIRECT from a mmap'd region of the same file
> > (Linus pointed it out to me).  Basically my patches had to work in
> > 64k chunks to avoid a deadlock in get_user_pages.  With the state
> > tree, I can allow the page to be faulted in but still properly deal
> > with it.
> 
> Oh right, I didn't think of that one. Would you still have similar
> issues with the external state tree? I mean, the filesystem doesn't
> really know why the fault is taken. O_DIRECT read from a file into
> mmapped memory of the same block in the file is almost hopeless I
> think.

Racing is fine as long as we don't deadlock or expose garbage from disk.

> > > > The ability to put in additional tracking info like the process
> > > > that first dirtied a range is also significant.  So, I think it
> > > > is worth trying.
> > > 
> > > Definitely, and I'm glad you are. You haven't converted me yet,
> > > but I look forward to finding the best ideas from our two
> > > approaches when the patches are further along (ext2 port of
> > > fsblock coming along, so we'll be able to have races soon :P).
> > 
> > I'm sure we can find some river in Cambridge, winner gets to throw
> > Axboe in.
> 
> Very noble of you to donate your colleage to such a worthy cause.

Jens is always interested in helping solve such debates.  It's a
fantastic service he provides to the community.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] seekwatcher v0.3 IO graphing an animation

2007-07-27 Thread Chris Mason

Hello everyone,

I've tossed out seekwatcher v0.3.  The major changes are using rolling
averages to smooth out the seek and throughput graphs, and it can
generate mpgs of the IO done by a given trace.

Here's a sample of the smoother graphs (creating 20 kernel trees):

http://oss.oracle.com/~mason/seekwatcher/ext3_vs_btrfs_vs_xfs.png

There are details and sample movies of the kernel tree run at:

http://oss.oracle.com/~mason/seekwatcher

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-06 Thread Chris Mason
On Sun, 5 Aug 2007 11:00:29 -0400
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Sun, Aug 05, 2007 at 02:26:53AM +0200, Andi Kleen wrote:
> > I always thought the right solution would be to just sync atime only
> > very very lazily. This means if a inode is only dirty because of an
> > atime update put it on a "only write out when there is nothing to do
> > or the memory is really needed" list.
> 
> As I've mentioend earlier, the memory balancing issues that arise when
> we add an "atime dirty" bit scare me a little.  It can be addressed,
> obviously, but at the cost of more code complexity.

ext3 and reiser both use a dirty_inode method to make sure that we
don't actually have dirty inodes.  This way, kswapd doesn't get stuck
on the log and is able to do real work.

It would be interesting to see a comparison of relatime with a kinoded
that is willing to get stuck on the log.  The FS would need a few
tweaks so that write_inode() could know if it really needed to log or
not, but for testing you could just drop ext3_dirty_inode and have
ext3_write_inode do real work.

Then just change kswapd to kick a new kinoded and benchmark away.  A
real patch would have to look for places where mark_inode_dirty was
used and expected the dirty_inode callback to log things right away,
but for testing its good enough.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-22 Thread Chris Mason
On Sun, 21 Oct 2007 12:39:30 -0600
[EMAIL PROTECTED] (Eric W. Biederman) wrote:

> Nick Piggin <[EMAIL PROTECTED]> writes:
> 
> > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> >> Christian Borntraeger <[EMAIL PROTECTED]> writes:
> >
> >> Let me put it another way.  Looking at /proc/slabinfo I can get
> >> 37 buffer_heads per page.  I can allocate 10% of memory in
> >> buffer_heads before we start to reclaim them.  So it requires just
> >> over 3.7 buffer_heads on very page of low memory to even trigger
> >> this case.  That is a large 1k filesystem or a weird sized
> >> partition, that we have written to directly.
> >
> > On a highmem machine it it could be relatively common.
> 
> Possibly.  But the same proportions still hold.  1k filesystems
> are not the default these days and ramdisks are relatively uncommon.
> The memory quantities involved are all low mem.

It is definitely common during run time.  It was seen in practice enough
to be reproducible and get fixed for the non-ramdisk case.

The big underlying question is how which ramdisk usage case are we
shooting for. Keeping the ram disk pages off the LRU can certainly help
the VM if larger ramdisks used at runtime are very common.

Otherwise, I'd say to keep it as simple as possible and use Eric's
patch.  By simple I'm not counting lines of code, I'm counting overall
readability between something everyone knows (page cache usage) and
something specific to ramdisks (Nick's patch).

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

2007-10-23 Thread Chris Mason
On Tue, 23 Oct 2007 19:56:20 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote:
> > [ adding reiserfs devs to the CC ]
> 
> Thank you.
> 
> This fix is kind of crude - even when it fixed Maxim's problem, and
> survived my stress testing of a lot of patching and kernel compiling.
> I'd be glad to see better solutions.

This should be safe, reiserfs has the buffer heads themselves clean and
the page should get cleaned eventually.  The cancel_dirty_page call was
just an optimization to be VM friendly.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[CFP] 2008 Linux Storage and Filesystem Workshop

2007-10-24 Thread Chris Mason

Hello everyone,

We are organizing another filesystem and storage workshop in San Jose
next Feb 25 and 26.  You can find some great writeups of last year's
conference on LWN:

http://lwn.net/Articles/226351/

This year we're trying to concentrate on more problem solving sessions,
short term projects and joint sessions.  You can find all the details
on the conference webpages:

http://www.usenix.org/events/lsf08/

Soon there will be a link for submitting your position statement, which
is basically a note to the organizers that you are interested in
attending and which topics you think should be covered.

We're also looking for people to lead the discussion around the major
topics, so please let us know if you're interested in that.  The
discussion leaders will have input into the people that get invited and
the format of the discussion.

Please let me know if there are any questions about the workshop.

Thanks,
Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dio_get_page() lockdep complaints

2007-11-09 Thread Chris Mason
On Fri, 09 Nov 2007 09:48:22 -0800
Zach Brown <[EMAIL PROTECTED]> wrote:

> 
> >> So reiser and NFS need to be fixed.  No?
> > 
> > Actually, it is rather mmap() needs to be fixed.
> 
> Sure, I'm willing to have that demonstrated.  My point was that DIO
> getting the mmap_sem inside i_mutex is currently correct.
> 
> reiserfs, though, seems to be out on a more precarious limb ;).

reiserfs is doing tail packing during the file_release call, which has
lots of advantages for removing complexity in file_write.

Without getting into a huge patch, the best fix would just be switching
to try lock.  If the tail doesn't get packed, the world doesn't end.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dio_get_page() lockdep complaints

2007-11-09 Thread Chris Mason
On Fri, 09 Nov 2007 10:35:04 -0800
Zach Brown <[EMAIL PROTECTED]> wrote:

> > Without getting into a huge patch, the best fix would just be
> > switching to try lock.  If the tail doesn't get packed, the world
> > doesn't end.
> 
> So, something like this?

Reviewed-by: Chris Mason <[EMAIL PROTECTED]>

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dio_get_page() lockdep complaints

2007-11-09 Thread Chris Mason
On Fri, 9 Nov 2007 13:53:27 -0500
Chris Mason <[EMAIL PROTECTED]> wrote:

> On Fri, 09 Nov 2007 10:35:04 -0800
> Zach Brown <[EMAIL PROTECTED]> wrote:
> 
> > > Without getting into a huge patch, the best fix would just be
> > > switching to try lock.  If the tail doesn't get packed, the world
> > > doesn't end.
> > 
> > So, something like this?
> 
> Reviewed-by: Chris Mason <[EMAIL PROTECTED]>
> 

Ugh, I thought the preallocation was getting freed elsewhere, but it
looks like I was wrong.  We can't just skip the i_mutex after all,
sorry.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dio_get_page() lockdep complaints

2007-11-09 Thread Chris Mason
On Fri, 09 Nov 2007 11:16:53 -0800
Zach Brown <[EMAIL PROTECTED]> wrote:

> > Ugh, I thought the preallocation was getting freed elsewhere, but it
> > looks like I was wrong.  We can't just skip the i_mutex after all,
> > sorry.
> 
> Ah, so none of those tests at the top will stop tail packing if
> there's been pre-allocation?
> 
> Like, uh, the inode reference count test?
> 

It always drops prealloc, regardless of i_count.  One fix is to change
the prealloc routines to take/drop a reference on the inode.
preallocation is already dropped when the transaction ends, so things
would continue working.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


More Large blocksize benchmarks

2007-10-15 Thread Chris Mason
Hello everyone,

I'm stealing the cc list and reviving and old thread because I've
finally got some numbers to go along with the Btrfs variable blocksize
feature.  The basic idea is to create a read/write interface to
map a range of bytes on the address space, and use it in Btrfs for all
metadata operations (file operations have always been extent based).

So, instead of casting buffer_head->b_data to some structure, I read and
write at offsets in a struct extent_buffer.  The extent buffer is very
small and backed by an address space, and I get large block sizes the
same way file_write gets to write to 16k at a time, by finding the
appropriate page in the addess space.  This is an over simplification
since I try to cache these mapping decisions to avoid using too much
CPU, but hopefully you get the idea.

The advantage to this approach is the changes are all inside Btrfs.  No
extra kernel patches were required.

Dave reported that XFS saw much higher write throughput with large
blocksizes, but so far I'm seeing the most benefits during reads.

The next step is a bunch more benchmarks.  I've done the first round
and posted it here:

http://oss.oracle.com/~mason/blocksizes/

The Btrfs code makes it relatively easy to experiment, and so this may
be a good step toward figuring out if some automagic solution is worth
it in general.  I can even use different sizes for nodes and leaves,
although I haven't done much testing at all there yet.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: More Large blocksize benchmarks

2007-10-16 Thread Chris Mason
On Tue, 2007-10-16 at 12:36 +1000, David Chinner wrote:
> On Mon, Oct 15, 2007 at 08:22:31PM -0400, Chris Mason wrote:
> > Hello everyone,
> > 
> > I'm stealing the cc list and reviving and old thread because I've
> > finally got some numbers to go along with the Btrfs variable blocksize
> > feature.  The basic idea is to create a read/write interface to
> > map a range of bytes on the address space, and use it in Btrfs for all
> > metadata operations (file operations have always been extent based).
> > 
> > So, instead of casting buffer_head->b_data to some structure, I read and
> > write at offsets in a struct extent_buffer.  The extent buffer is very
> > small and backed by an address space, and I get large block sizes the
> > same way file_write gets to write to 16k at a time, by finding the
> > appropriate page in the addess space.  This is an over simplification
> > since I try to cache these mapping decisions to avoid using too much
> > CPU, but hopefully you get the idea.
> > 
> > The advantage to this approach is the changes are all inside Btrfs.  No
> > extra kernel patches were required.
> > 
> > Dave reported that XFS saw much higher write throughput with large
> > blocksizes, but so far I'm seeing the most benefits during reads.
> 
> Apples to oranges, Chris ;)
> 

Grin, if the two were the same, there'd be no reason to write a new one.
I didn't expect faster writes on btrfs, at least not for workloads that
did not require reads.  The basic idea is to show there are a variety of
ways the larger blocks can improve (and hurt) performance.

Also, vmap isn't the only implementation path.  Its true the Btrfs
changes for this were huge, but a big chunk of the changes were for
different leaf/node blocksizes, something that may never get used in
practice.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 11:57 -0600, Eric W. Biederman wrote:
> Christian Borntraeger <[EMAIL PROTECTED]> writes:
> 
> > Eric,
> >
> > Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger:
> >> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:
> >> 
> >> > fs/buffer.c |3 +++
> >> > 1 files changed, 3 insertions(+), 0 deletions(-)
> >> >  drivers/block/rd.c |   13 +
> >> >  1 files changed, 1 insertions(+), 12 deletions(-)
> >> 
> >> Your patches look sane so far. I have applied both patches, and the 
> >> problem 
> >> seems gone. I will try to get these patches to our testers.
> >> 
> >> As long as they dont find new problems:
> >
> > Our testers did only a short test, and then they were stopped by problems 
> > with
> > reiserfs. At the moment I cannot say for sure if your patch caused this, 
> > but 
> > we got the following BUG
> 
> Thanks.
> 
> > ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr
> > storage.
> > [ cut here ]
> > kernel BUG at
> > /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
> > illegal operation: 0001 [#1]
> > Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
> > CPU:3Not tainted
> > Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
> > Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
> >R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
> > Krnl GPRS: 0002 7411b5c8 002b 
> >7b04d000 0001  76d1de00
> >7513eec0 0003 0012 77f77680
> >7411b608 fb343b7e fb34404a 7513ee50
> > Krnl Code: fb344374: a7210002   tmll%r2,2
> >fb344378: a7840004   brc 8,fb344380
> >fb34437c: a7f40001   brc 15,fb34437e
> >   >fb344380: 5810d8c2   l   %r1,2242(%r13)
> >fb344384: 5820b03c   l   %r2,60(%r11)
> >fb344388: 0de1   basr%r14,%r1
> >fb34438a: 5810d90e   l   %r1,2318(%r13)
> >fb34438e: 5820b03c   l   %r2,60(%r11)
> >
> >
> > Looking at the code, this really seems related to dirty buffers, so your 
> > patch
> > is the main suspect at the moment. 
> 
> Sounds reasonable.
> 
> > if (!barrier) {
> > /* If there was a write error in the journal - we can't 
> > commit
> >  * this transaction - it will be invalid and, if successful,
> >  * will just end up propagating the write error out to
> >  * the file system. */
> > if (likely(!retval && !reiserfs_is_journal_aborted 
> > (journal))) {
> > if (buffer_dirty(jl->j_commit_bh))
> > 1117>BUG();
> > mark_buffer_dirty(jl->j_commit_bh) ;
> > sync_dirty_buffer(jl->j_commit_bh) ;
> > }
> > }
> 
> Grr.  I'm not certain how to call that.
> 
> Given that I should also be able to trigger this case by writing to
> the block device through the buffer cache (to the write spot at the
> write moment) this feels like a reiserfs bug. 
> Although I feel
> screaming about filesystems that go BUG instead of remount read-only
> 

In this case, the commit block isn't allowed to be dirty before reiserfs
decides it is safe to write it.  The journal code expects it is the only
spot in the kernel setting buffer heads dirty, and it only does so after
the rest of the log blocks are safely on disk.

Given this is a ramdisk, the check can be ignored, but I'd rather not
sprinkle if (ram_disk) into the FS code

> At the same time I increasingly don't think we should allow user space
> to dirty or update our filesystem metadata buffer heads.  That seems
> like asking for trouble.
> 

Demanding trouble ;)

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 14:29 -0600, Eric W. Biederman wrote:
> Chris Mason <[EMAIL PROTECTED]> writes:
> 
> > In this case, the commit block isn't allowed to be dirty before reiserfs
> > decides it is safe to write it.  The journal code expects it is the only
> > spot in the kernel setting buffer heads dirty, and it only does so after
> > the rest of the log blocks are safely on disk.
> 
> Ok.  So the journal code here fundamentally depends on being able to
> control the order of the writes, and something else being able to set
> the buffer head dirty messes up that control.
> 

Right.

> >> At the same time I increasingly don't think we should allow user space
> >> to dirty or update our filesystem metadata buffer heads.  That seems
> >> like asking for trouble.
> >> 
> >
> > Demanding trouble ;)
> 
> Looks like it.  There are even comments in jbd about the same class
> of problems.  Apparently dump and tune2fs on mounted filesystems have
> triggered some of these issues.  The practical question is any of this
> trouble worth handling.
> 
> Thinking about it.  I don't believe anyone has ever intentionally built
> a filesystem tool that depends on being able to modify a file systems
> metadata buffer heads while the filesystem is running, and doing that
> would seem to be fragile as it would require a lot of cooperation
> between the tool and the filesystem about how the filesystem uses and
> implement things.
> 

That's right.  For example, ext2 is doing directories in the page cache
of the directory inode, so there's a cache alias between the block
device page cache and the directory inode page cache.

> Now I guess I need to see how difficult a patch would be to give
> filesystems magic inodes to keep their metadata buffer heads in.

Not hard, the block device inode is already a magic inode for metadata
buffer heads.  You could just make another one attached to the bdev.

But, I don't think I fully understand the problem you're trying to
solve?

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 15:30 -0600, Eric W. Biederman wrote:
> Chris Mason <[EMAIL PROTECTED]> writes:
> 
> >> Thinking about it.  I don't believe anyone has ever intentionally built
> >> a filesystem tool that depends on being able to modify a file systems
> >> metadata buffer heads while the filesystem is running, and doing that
> >> would seem to be fragile as it would require a lot of cooperation
> >> between the tool and the filesystem about how the filesystem uses and
> >> implement things.
> >> 
> >
> > That's right.  For example, ext2 is doing directories in the page cache
> > of the directory inode, so there's a cache alias between the block
> > device page cache and the directory inode page cache.
> >
> >> Now I guess I need to see how difficult a patch would be to give
> >> filesystems magic inodes to keep their metadata buffer heads in.
> >
> > Not hard, the block device inode is already a magic inode for metadata
> > buffer heads.  You could just make another one attached to the bdev.
> >
> > But, I don't think I fully understand the problem you're trying to
> > solve?
> 
> 
> So the start:
> When we write buffers from the buffer cache we clear buffer_dirty
> but not PageDirty
>
> So try_to_free_buffers() will mark any page with clean buffer_heads
> that is not clean itself clean.
> 
> The ramdisk set pages dirty to keep them from being removed from the
> page cache, just like ramfs.
> 
So, the problem is using the Dirty bit to indicate pinned.  You're
completely right that our current setup of buffer heads and pages and
filesystem metadata is complex and difficult.

But, moving the buffer heads off of the page cache pages isn't going to
make it any easier to use dirty as pinned, especially in the face of
buffer_head users for file data pages.

You've already seen Nick fsblock code, but you can see my general
approach to replacing buffer heads here:

http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h

(alpha quality implementation in extent_map.c and users in inode.c)  The
basic idea is to do extent based record keeping for mapping and state of
things in the filesystem, and to avoid attaching these things to the
page.

> Unfortunately when those dirty ramdisk pages get buffers on them and
> those buffers all go clean and we are trying to reclaim buffer_heads
> we drop those pages from the page cache.   Ouch!
> 
> We can fix the ramdisk by setting making certain that buffer_heads
> on ramdisk pages stay dirty as well.  The problem is this breaks
> filesystems like reiserfs and ext3 that expect to be able to make 
> buffer_heads clean sometimes.
> 
> There are other ways to solve this for ramdisks, such as changing
> where ramdisks are stored.  However fixing the ramdisks this way
> still leaves the general problem that there are other paths to the
> filesystem metadata buffers, and those other paths cause the code
> to be complicated and buggy.
> 
> So I'm trying to see if we can untangle this Gordian knot, so the
> code because more easily maintainable.  
> 

Don't get me wrong, I'd love to see a simple and coherent fix for what
reiserfs and ext3 do with buffer head state, but I think for the short
term you're best off pinning the ramdisk pages via some other means.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 17:28 -0600, Eric W. Biederman wrote:
> Chris Mason <[EMAIL PROTECTED]> writes:
> 
> > So, the problem is using the Dirty bit to indicate pinned.  You're
> > completely right that our current setup of buffer heads and pages and
> > filesystpem metadata is complex and difficult.
> >
> > But, moving the buffer heads off of the page cache pages isn't going to
> > make it any easier to use dirty as pinned, especially in the face of
> > buffer_head users for file data pages.
> 
> Let me specific.  Not moving buffer_heads off of page cache pages,
> moving buffer_heads off of the block devices page cache pages.
> 
> My problem is the coupling of how block devices are cached and the
> implementation of buffer heads, and by removing that coupling
> we can generally make things better.  Currently that coupling
> means silly things like all block devices are cached in low memory.
> Which probably isn't what you want if you actually have a use
> for block devices.
> 
> For the ramdisk case in particular what this means is that there
> are no more users that create buffer_head mappings on the block
> device cache so using the dirty bit will be safe.

Ok, we move the buffer heads off to a different inode, and that indoe
has pages.  The pages on the inode still need to get pinned, how does
that pinning happen?

The problem you described where someone cleans a page because the buffer
heads are clean happens already without help from userland.  So, keeping
the pages away from userland won't save them from cleaning.

Sorry if I'm reading your suggestion wrong...

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 4/6][RFC] Attempt to plug race with truncate

2007-10-29 Thread Chris Mason
On Fri, 26 Oct 2007 16:37:36 -0700
Mike Waychison <[EMAIL PROTECTED]> wrote:

> Attempt to deal with races with truncate paths.
> 
> I'm not really sure on the locking here, but these seem to be taken
> by the truncate path.  BKL is left as some filesystem may(?) still
> require it.
> 
> Signed-off-by: Mike Waychison <[EMAIL PROTECTED]>
>  fs/ioctl.c |8 
>  1 file changed, 8 insertions(+)
> 
> Index: linux-2.6.23/fs/ioctl.c
> ===
> --- linux-2.6.23.orig/fs/ioctl.c  2007-10-26 15:27:29.0
> -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26
> 16:16:28.0 -0700 @@ -43,13 +43,21 @@ static long
> do_ioctl(struct file *filp, static int do_fibmap(struct address_space
> *mapping, sector_t block, sector_t *phys_block)
>  {
> + struct inode *inode = mapping->host;
> +
>   if (!capable(CAP_SYS_RAWIO))
>   return -EPERM;
>   if (!mapping->a_ops->bmap)
>   return -EINVAL;
>  
>   lock_kernel();
> + /* Avoid races with truncate */
> + mutex_lock(>i_mutex);
> + /* FIXME: Do we really need i_alloc_sem? */
> + down_read(>i_alloc_sem);


i_alloc_sem will avoid races with filesystems filling holes inside
writepage (where i_mutex isn't held).  I'd expect everyone to currently
give some consistent result (either the old value or the new but not
garbage), but I wouldn't expect taking the semaphore to hurt anything.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/6][RFC] Introduce FIBMAP64

2007-10-29 Thread Chris Mason
On Fri, 26 Oct 2007 16:37:37 -0700
Mike Waychison <[EMAIL PROTECTED]> wrote:

> Introduce FIBMAP64.  This is the same as FIBMAP, but takes a u64.

If we're adding new ioctls, I'd rather see the FIEMAP stuff go
in, a quick search found discussions but has it died off?

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Chris Mason
On Sat, 27 Oct 2007 18:57:06 +0100
Anton Altaparmakov <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> ->bmap is ugly and horrible!  If you have to do this at the very
> least please cause ->bmap64 to be able to return error values in case
> the file system failed to get the information or indeed such
> information does not exist as is the case for compressed and
> encrypted files for example and also for small files that are inside
> the on-disk inode (NTFS resident files and reiserfs packed tails are
> examples of this).
> 
> And another of my pet peeves with ->bmap is that it uses 0 to mean  
> "sparse" which causes a conflict on NTFS at least as block zero is  
> part of the $Boot system file so it is a real, valid block...  NTFS  
> uses -1 to denote sparse blocks internally.

Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
was a way to indicate your-data-is-here-but-isn't-alone.  But that's
more of a feature for the FIEMAP stuff.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Chris Mason
On Mon, 29 Oct 2007 12:18:22 -0700
Mike Waychison <[EMAIL PROTECTED]> wrote:

> Zach Brown wrote:
> >>> And another of my pet peeves with ->bmap is that it uses 0 to
> >>> mean "sparse" which causes a conflict on NTFS at least as block
> >>> zero is part of the $Boot system file so it is a real, valid
> >>> block...  NTFS uses -1 to denote sparse blocks internally.
> >> Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if
> >> there was a way to indicate your-data-is-here-but-isn't-alone.
> >> But that's more of a feature for the FIEMAP stuff.
> > 
> > And maybe we can step back and see what the callers of FIBMAP are
> > doing with the results they're getting.
> > 
> > One use is to discover the order in which to read file data that
> > will result in efficient IO.
> > 
> > If we had an interface specifically for this use case then perhaps a
> > sparse block would be better reported as the position of the inode
> > relative to other data blocks.  Maybe the inode block number in
> > ext* land.
> > 
> 
> Can you clarify what you mean above with an example?  I don't really
> follow.

This is a larger topic of helping userland optimize access to groups of
files.  For example, during a readdir if we knew the next step was to
delete all the files found, we could do one top of readahead (or even
ordering the returned values).

If we knew the next step would be to read all the files found, a
different type of readahead would be useful.

But, we shouldn't inflict all of this on fibmap/fiemapwe'll get
lost trying to make the one true interface for all operations.

For grouping operations on files, I think a read_tree syscall with
hints for what userland will do (read, stat, delete, list
filenames), and a better cookie than readdir should do it.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2008 Linux Storage and Filesystem Workshop

2007-11-05 Thread Chris Mason

Hello everyone,

The position statement submission system for the 2008 storage and
filesystem workshop is now online.  This is how you let us know you're
interested in attending and what topics are most important for
discussion.

For all the details, please see:

http://www.usenix.org/events/lsf08/

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-08-19 Thread Chris Mason
Quoting Linus Torvalds (2013-08-19 17:16:36)
> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter  wrote:
> > On Mon, 19 Aug 2013, Simon Kirby wrote:
> >
> >>[... ]  The
> >> alloc/free traces are always the same -- always alloc_pipe_info and
> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> >>
> >> Object 880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
> >> lkkk
> >
> > This looks like an increment after free in the second 32 bit value of the
> > structure. First 32 bit value's poison is unchanged.
> 
> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> the "wait_lock" spinlock that is part of the mutex.
> 
> Doing a "spin_lock()" could indeed cause an increment operation. But
> it still sounds like a very odd case. And even for some wild pointer
> I'd then expect the spin_unlock to also happen, and to then increment
> the next byte (or word) too. More importantly, for a mutex, I'd expect
> the *other* fields to be corrupted too (the "waiter" field etc). That
> is, unless we're still spinning waiting for the mutex, but with that
> value we shouldn't, as far as I can see.
> 

Simon, is this box doing btrfs send/receive?  If so, it's probably where
this pipe is coming from.

Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
way to find it, I can give you a patch if it'll help.

It would be nice if you could trigger this on plain 3.11-rcX instead of
btrfs-next.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-08-10 Thread Chris Mason
Hi Linus

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

These are assorted fixes, mostly from Josef nailing down xfstests runs.
Zach also has a long standing fix for problems with readdir wrapping
f_pos (or ctx->pos)

These patches were spread out over different bases, so I rebased things on
top of rc4 and retested overnight.

Josef Bacik (6) commits (+82/-52):
Btrfs: check to see if root_list is empty before adding it to dead roots 
(+5/-5)
Btrfs: make sure the backref walker catches all refs to our extent (+14/-11)
Btrfs: allow splitting of hole em's when dropping extent cache (+40/-22)
Btrfs: release both paths before logging dir/changed extents (+2/-3)
Btrfs: fix backref walking when we hit a compressed extent (+15/-8)
Btrfs: do not offset physical if we're compressed (+6/-3)

Liu Bo (2) commits (+12/-5):
Btrfs: fix a bug of snapshot-aware defrag to make it work on partial 
extents (+12/-4)
Btrfs: fix extent buffer leak after backref walking (+0/-1)

Zach Brown (1) commits (+25/-8):
btrfs: don't loop on large offsets in readdir

Jie Liu (1) commits (+0/-3):
btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified

Total: (10) commits (+119/-68)

 fs/btrfs/backref.c | 48 ++
 fs/btrfs/ctree.c   |  1 -
 fs/btrfs/extent_io.c   |  9 +---
 fs/btrfs/file.c| 62 --
 fs/btrfs/inode.c   | 52 ++
 fs/btrfs/transaction.c |  8 +++
 fs/btrfs/transaction.h |  2 +-
 fs/btrfs/tree-log.c|  5 ++--
 8 files changed, 119 insertions(+), 68 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Boot failure on Arndale with next-20131105

2013-11-05 Thread Chris Mason
Quoting Olof Johansson (2013-11-05 15:23:51)
> On Tue, Nov 5, 2013 at 11:33 AM, Jens Axboe  wrote:

[ horrible crashes fixed by removing my patch ]

> > Very weird! What file system is being used?
> 
> Most of my failures have happened on regular MMC cards with ext4
> filesystems on them.
> 
> Note that the panic happens during device probe / partition table
> scanning, not after mounting the filesystem.
> 
> Giving your patch a go now across the board. I'm very concerned about
> the reports of bisectability, build failures and heaps of warnings
> though. Did the 0-day builder pick up any of those? :-/
> 

Hmmm, is bcache in your config?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Boot failure on Arndale with next-20131105

2013-11-05 Thread Chris Mason
Quoting Olof Johansson (2013-11-05 15:38:33)
> On Tue, Nov 5, 2013 at 12:33 PM, Chris Mason  wrote:
> > Quoting Olof Johansson (2013-11-05 15:23:51)
> >> On Tue, Nov 5, 2013 at 11:33 AM, Jens Axboe  wrote:
> >
> > [ horrible crashes fixed by removing my patch ]
> >
> >> > Very weird! What file system is being used?
> >>
> >> Most of my failures have happened on regular MMC cards with ext4
> >> filesystems on them.
> >>
> >> Note that the panic happens during device probe / partition table
> >> scanning, not after mounting the filesystem.
> >>
> >> Giving your patch a go now across the board. I'm very concerned about
> >> the reports of bisectability, build failures and heaps of warnings
> >> though. Did the 0-day builder pick up any of those? :-/
> >>
> >
> > Hmmm, is bcache in your config?
> 
> Doesn't look that way -- no ARM defconfigs enable it (it's what I
> build and boot), and the option defaults to off and nothing selects
> it.

Ok, I think I see it.  My guess is that you're hitting bounce buffers.
__blk_queue_bounce is the only caller of the bio splitting code I can
see that you might be hitting.

My first patch exposed a lurking bug in bio_clone_biovec.  Basically
bio->bi_vcnt is being doubled instead of initialized.

This patch is only compile tested, but I think it'll fix it.

diff --git a/fs/bio.c b/fs/bio.c
index be93de1..3595456 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -612,6 +612,7 @@ int bio_clone_biovec(struct bio *bio, gfp_t gfp_mask)
unsigned nr_iovecs = 0;
struct bio_vec bv, *bvl = NULL;
struct bvec_iter iter;
+   int i;
 
BUG_ON(!bio->bi_pool);
BUG_ON(BIO_POOL_IDX(bio) != BIO_POOL_NONE);
@@ -628,8 +629,9 @@ int bio_clone_biovec(struct bio *bio, gfp_t gfp_mask)
bvl = bio->bi_inline_vecs;
}
 
+   i = 0;
bio_for_each_segment(bv, bio, iter)
-   bvl[bio->bi_vcnt++] = bv;
+   bvl[i++] = bv;
 
bio->bi_io_vec = bvl;
bio->bi_iter.bi_idx = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Boot failure on Arndale with next-20131105

2013-11-05 Thread Chris Mason
Quoting Olof Johansson (2013-11-05 17:41:42)
> On Tue, Nov 5, 2013 at 2:06 PM, Stephen Warren  wrote:
> > On 11/05/2013 01:56 PM, Chris Mason wrote:
> >> Quoting Olof Johansson (2013-11-05 15:38:33)
> >>> On Tue, Nov 5, 2013 at 12:33 PM, Chris Mason  
> >>> wrote:
> >>>> Quoting Olof Johansson (2013-11-05 15:23:51)
> >>>>> On Tue, Nov 5, 2013 at 11:33 AM, Jens Axboe  wrote:
> >>>>
> >>>> [ horrible crashes fixed by removing my patch ]
> >>>>
> > ...
> >> Ok, I think I see it.  My guess is that you're hitting bounce buffers.
> >> __blk_queue_bounce is the only caller of the bio splitting code I can
> >> see that you might be hitting.
> >>
> >> My first patch exposed a lurking bug in bio_clone_biovec.  Basically
> >> bio->bi_vcnt is being doubled instead of initialized.
> >>
> >> This patch is only compile tested, but I think it'll fix it.
> >
> > Tested-by: Stephen Warren 
> > (this fixes the issue on Tegra30/Beaver at least)
> 
> Tested-by: Olof Johansson 
> 
> This resolves boot failures on:
> * Tegra30/beaver
> * OMAP4/panda
> * i.MX6/wandboard
> 
> Thanks Chris!

Perfect, thanks for bisecting and trying the patches.  Kent, if things
get rebased, could you please fold this patch and my bi_vcnt patch in?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-05 22:48:41)
> This patch reverts the default behaviour introduced by
> 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> shares the source bio's biovec, cloning the biovec is once again the
> default.
> 
> Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> that shares the source's biovec. This patch changes bcache and md to use
   ^
   dm?

> __bio_clone_biovec_fast() since they're expecting the new behaviour due
> to other refactoring; most of the other uses of bio_clone() should be
> same to convert to the _fast() variant but that will be done more
> incrementally in other patches (bio_split() in particular).

Hi Kent,

I noticed yesterday the _fast variants of bio clone introduce sharing
between the src and the clone, but without any reference counts:

bio->bi_io_vec = bio_src->bi_io_vec;

Have you audited all of the _fast users to make sure they are not
freeing the src before the clone?  Sorry if this came up already in past
reviews.

> 
> Note that __bio_clone() isn't being readded - the reason being that with
> immutable biovecs allocating the right number of biovecs for the new
> clone is no longer trivial so we don't want drivers trying to do that
> themselves.
> 
> This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> short,

I think I see what you mean with tying bi_vcnt to ownership of the bio,
but we're not consistent.  Looking at bio_for_eaach_segment_all:

*
 * drivers should _never_ use the all version - the bio may have been split
 * before it got to the driver and the driver won't own all of it
 */
#define bio_for_each_segment_all(bvl, bio, i)   \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)

bio_for_each_segment_all still trusts bi_vcnt, so any
bio_for_each_segment_all operation on a clone will basically be a noop.

Just looking at MD raid1 make_request()

mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
...
alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
...
if (r1_bio->behind_bvecs) {
bio_for_each_segment_all(bvec, mbio, j)
...

I didn't test MD without the vcnt fix, but I think any operations in MD
that duplicate data for raid1 turn into noops.  I think we'll end up
writing garbage (or nothing) to the second mirror.

If you look at dm's crypt_free_buffer_pages(), it had similar problems.

> not setting it might cause bugs in the short term but long term
> it's likely to hide nastier more subtle bugs, we don't want code looking
> at bi_vcnt at all for bios it does not own.

I think the concept of bio ownership is still much too weak, at least
for established users like MD and DM.  I don't know how to verify the
sharing of bi_io_vec without some kind of reference counting on the
iovec.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-06 15:02:22)
> On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> > Quoting Kent Overstreet (2013-11-05 22:48:41)
> > > This patch reverts the default behaviour introduced by
> > > 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> > > shares the source bio's biovec, cloning the biovec is once again the
> > > default.
> > > 
> > > Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> > > that shares the source's biovec. This patch changes bcache and md to use
> >^
> >  dm?
> > 
> > > __bio_clone_biovec_fast() since they're expecting the new behaviour due
> > > to other refactoring; most of the other uses of bio_clone() should be
> > > same to convert to the _fast() variant but that will be done more
> > > incrementally in other patches (bio_split() in particular).
> > 
> > Hi Kent,
> > 
> > I noticed yesterday the _fast variants of bio clone introduce sharing
> > between the src and the clone, but without any reference counts:
> > 
> > bio->bi_io_vec = bio_src->bi_io_vec;
> > 
> > Have you audited all of the _fast users to make sure they are not
> > freeing the src before the clone?  Sorry if this came up already in past
> > reviews.
> 
> Yup - that should actually be safe for all the existing bio_clone() users
> actually, I audited all of them - because normally you're not going to 
> complete
> the original bio until the clone finishes.

I'd say we need an ack from Neil before we can switch to _fast.  The
completions look non-trivial and _fast adds new ordering requirements on
free.

> 
> > > Note that __bio_clone() isn't being readded - the reason being that with
> > > immutable biovecs allocating the right number of biovecs for the new
> > > clone is no longer trivial so we don't want drivers trying to do that
> > > themselves.
> > > 
> > > This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> > > __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> > > own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> > > short,
> > 
> > I think I see what you mean with tying bi_vcnt to ownership of the bio,
> > but we're not consistent.  Looking at bio_for_eaach_segment_all:
> > 
> > *
> >  * drivers should _never_ use the all version - the bio may have been split
> >  * before it got to the driver and the driver won't own all of it
> >  */
> > #define bio_for_each_segment_all(bvl, bio, i)   \
> > for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
> > 
> > bio_for_each_segment_all still trusts bi_vcnt, so any
> > bio_for_each_segment_all operation on a clone will basically be a noop.
> > 
> > Just looking at MD raid1 make_request()
> > 
> > mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> >   ...
> >   alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
> >   ...
> >   if (r1_bio->behind_bvecs) {
> > bio_for_each_segment_all(bvec, mbio, j)
> >   ...
> > 
> > I didn't test MD without the vcnt fix, but I think any operations in MD
> > that duplicate data for raid1 turn into noops.  I think we'll end up
> > writing garbage (or nothing) to the second mirror.
> > 
> > If you look at dm's crypt_free_buffer_pages(), it had similar problems.
> 
> Those are fine actually - in both cases, they're bios they allocated, not the
> bios that were submitted to them.

Both are expecting bi_vcnt to be non-zero after a clone, which wasn't
the case before the bi_vcnt patch.  With your current patch (adding
_fast) both should now be correct.

>
> Though md _definitely_ shouldn't have been
> sharing the original bio's biovec, so looks like this patch will fix a bug
> there...
> 
> (I remember seeing that code before and I thought I added a bio_clone_biovec()
> call to that md code, but apparently that never got commited. Argh.)
> 
> > 
> > > not setting it might cause bugs in the short term but long term
> > > it's likely to hide nastier more subtle bugs, we don't want code looking
> > > at bi_vcnt at all for bios it does not own.
> > 
> > I think the concept of bio ownership is still much too weak, at least
> > for established users lik

Re: block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Mike Snitzer (2013-11-06 15:36:40)
> On Wed, Nov 06 2013 at  3:22pm -0500,
> Chris Mason  wrote:
> 
> > Quoting Kent Overstreet (2013-11-06 15:02:22)
> > > On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> > > > 
> > > > I think the concept of bio ownership is still much too weak, at least
> > > > for established users like MD and DM.  I don't know how to verify the
> > > > sharing of bi_io_vec without some kind of reference counting on the
> > > > iovec.
> > > 
> > > What's unclear about it? The rule is just - if you didn't allocate the 
> > > biovec,
> > > don't modify it or use bio_for_each_segment_all() (probably I didn't 
> > > quite state
> > > it clearly enough before though)
> > 
> > That part makes sense.  The new rule that scares me is that we can't
> > free the src of the clone until all the clones are freed.  If it works
> > with today's existing users it feels like it is more by accident than
> > design.  I'm not saying we can't do it, we just need some bigger
> > flashing warning lights.
> 
> But we probably don't want those warning lights to come with the cost of
> managing extra refcounts in the fast path -- so maybe a debug-only
> refcount?

I'd be happy with some code comments and a few extra SOBs.  In general
I'm happy with the new patches that add _fast and make us explicitly
choose the sharing.  Lots of little chances for bugs, but opt-in is a
much better starting point.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Revert bio_clone() default behaviour

2013-11-06 Thread Chris Mason
Quoting Kent Overstreet (2013-11-06 15:57:34)
> On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
> > Quoting Kent Overstreet (2013-11-06 15:02:22)

[ ... nods, thanks! ... ]

> OTOH - with regards to just the ordering requirements, the more I look at
> various code the less accidental the fact that that works seems to be: the 
> best
> explanation I've come up with so far is that you already needed to ensure that
> the _pages_ the clone points to stick around until the clone completes, and if
> you don't own the original bio the only way to do that is to not complete the
> original bio until after the clone completes.
> 
> So if you're a driver cloning bios that were submitted to you, 
> bio_clone_fast()
> introduces no new ordering requirements.
> 
> On the third hand - if you're cloning (i.e. splitting) your own bios, in that
> case it would be possible to screw up the ordering - I don't know of any code 
> in
> the kernel that does this today (except for, sort of, bcache) but my dio 
> rewrite
> takes this approach - but if you do the obvious and sane thing with 
> bio_chain()
> it's a non issue, it seems to me you'd have to come up with something pretty
> contrived and dumb for this to actually be an issue in practice.
> 
> Anyways, I haven't come to any definite conclusions, those are just my
> observations so far.

I do think you're right.  We all seem to have clones doing work on
behalf of the original, and when everyone is done we complete the
original.

But, btrfs does have silly things like this:

dio_end_io(dio_bio, err); // end and free the original
bio_put(bio); // free the clone

It's not a bug yet, but given enough time the space between those two
frees will grow new code that kills us all.

Really though, the new stuff is better, thanks.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-10-18 Thread Chris Mason
Hi Linus,

My for-linus branch has a one line fix:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Sage hit a deadlock with ceph on btrfs, and Josef tracked it down to a
regression in our initial rc1 pull.  When doing nocow writes we were
sometimes starting a transaction with locks held.

Josef Bacik (1) commits (+1/-0):
Btrfs: release path before starting transaction in can_nocow_extent

Total: (1) commits (+1/-0)

 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-11-14 Thread Chris Mason
Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This is our usual merge window set of bug fixes, performance
improvements and cleanups.  Miao Xie has some really nice optimizations
for writeback.

Josef also expanded our sanity checks quite a bit; these make up a big
chunk of the new lines.

This was tested both on 3.12 and your current git

Josef Bacik (30) commits (+2038/-419):
Btrfs: make sure the delalloc workers actually flush compressed writes 
(+12/-6)
Btrfs: do not free the dirty bytes from the trans block rsv on cleanup 
(+0/-2)
Btrfs: add a sanity test for a vacant extent at the front of a file 
(+128/-5)
Btrfs: take ordered root lock when removing ordered operations inode (+2/-0)
Btrfs: stop committing the transaction so much during relocate (+13/-20)
Btrfs: do not bug_on if we try to cow a free space cache inode (+4/-1)
Btrfs: add an assert to btrfs_lookup_csums_range for alignment (+3/-0)
Btrfs: cleanup reserved space when freeing tree log on error (+25/-22)
Btrfs: fix two use-after-free bugs with transaction cleanup (+52/-82)
Btrfs: don't delete ordered roots from list during cleanup (+2/-1)
Btrfs: do a full search everytime in btrfs_search_old_slot (+6/-2)
Btrfs: handle a missing extent for the first file extent (+9/-2)
Btrfs: return an error from btrfs_wait_ordered_range (+78/-41)
Btrfs: remove all BUG_ON()'s from commit_cowonly_roots (+8/-5)
Btrfs: don't abort transaction in run_delalloc_nocow (+5/-15)
Btrfs: do not release metadata for space cache inodes (+7/-1)
Btrfs: relocate csums properly with prealloc extents (+15/-3)
Btrfs: free reserved space on error in a few places (+21/-2)
Btrfs: check file extent type before anything else (+5/-5)
Btrfs: stop all workers after we free block groups (+2/-2)
Btrfs: add tests for find_lock_delalloc_range (+389/-9)
Btrfs: add a sanity test for btrfs_split_item (+283/-7)
Btrfs: fix up seek_hole/seek_data handling (+19/-75)
Btrfs: free up block groups after everything (+2/-2)
Btrfs: reset intwrite on transaction abort (+2/-0)
Btrfs: fix hole check in log_one_extent (+1/-1)
Btrfs: add tests for btrfs_get_extent (+861/-2)
Btrfs: stop using vfs_read in send (+72/-103)
Btrfs: cleanup transaction on abort (+6/-1)
Btrfs: fixup reserved trace points (+6/-2)

Filipe David Borba Manana (19) commits (+127/-104):
Btrfs: fix sync fs to actually wait for all data to be persisted (+9/-3)
Btrfs: fix csum search offset/length calculation in log tree (+7/-7)
Btrfs: optimize extent item search in run_delayed_extent_op (+20/-7)
Btrfs: remove path arg from btrfs_truncate_free_space_cache (+3/-15)
Btrfs: remove unused max_key arg from btrfs_search_forward (+7/-30)
Btrfs: don't leak delayed node on path allocation failure (+3/-1)
Btrfs: remove unnecessary tree search when logging inode (+5/-5)
Btrfs: log recovery, don't unlink inode always on error (+4/-1)
Btrfs: fix btrfs_prev_leaf() previous key computation (+8/-4)
Btrfs: remove unnecessary key copy when logging inode (+2/-3)
Btrfs: fix memory leaks on transaction commit failure (+8/-4)
Btrfs: remove duplicated ino cache's inode lookup (+5/-9)
Btrfs: improve inode hash function/inode lookup (+25/-3)
Btrfs: don't store NULL byte in symlink extents (+2/-2)
Btrfs: optimize tree-log.c:count_inode_refs() (+5/-0)
Btrfs: fix tracking of orphan inode count (+8/-5)
Btrfs: don't leak block group on error (+1/-2)
Btrfs: fix incorrect inode acl reset (+1/-1)
Btrfs: fix verification of dir_item (+4/-2)

Miao Xie (10) commits (+121/-52):
Btrfs: remove unnecessary initialization and memory barrior in 
shrink_delalloc() (+3/-4)
Btrfs: pick up the code for the item number calculation in flush_space() 
(+16/-9)
Btrfs: fix the free space write out failure when there is no data space 
(+12/-3)
Btrfs: don't wait for all the async delalloc when shrinking delalloc 
(+12/-2)
Btrfs: fix the confusion between delalloc bytes and metadata bytes (+6/-0)
Btrfs: improve jitter performance of the sequential buffered write (+4/-3)
Btrfs: don't wait for the completion of all the ordered extents (+31/-18)
Btrfs: fix BUG_ON() casued by the reserved space migration (+28/-3)
Btrfs: wait for the ordered extent only when we want (+2/-1)
Btrfs: rename btrfs_start_all_delalloc_inodes (+7/-9)

Liu Bo (7) commits (+69/-32):
Btrfs: fix a crash when running balance and defrag concurrently (+3/-0)
Btrfs: do not run snapshot-aware defragment on error (+28/-19)
Btrfs: export btrfs space shared info to userspace (+33/-1)
Btrfs: fixup error path in __btrfs_inc_extent_ref (+2/-8)
Btrfs: kill unused code in btrfs_search_forward (+0/-2)
Btrfs: fix memory leak of chunks' extent map (+2/-0)
Btrfs: cleanup dead code of defragment (+1/-2)


Re: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

2013-10-29 Thread Chris Mason
Quoting Mel Gorman (2013-10-29 13:38:14)
> Thomas Gleixner and Peter Zijlstra discussed off-list that real-time users
> currently have a problem with the page lock being contended for unbounded
> periods of time during futex operations. The three of us discussed the
> possibiltity that the page lock is unnecessary in this case because we are
> not concerned with the usual races with reclaim and page cache updates. For
> anonymous pages, the associated futex object is the mm_struct which does
> not require the page lock. For inodes, we should be able to check under
> RCU read lock if the page mapping is still valid to take a reference to
> the inode.  This just leaves one rare race that requires the page lock
> in the slow path. This patch does not completely eliminate the page lock
> but it should reduce contention in the majority of cases.
> 
> Patch boots and futextest did not explode but I did no comparison
> performance tests. Thomas, do you have details of the workload that
> drove you to examine this problem? Alternatively, can you test it and
> see does it help you? I added Chris to the To list because he mentioned
> that some filesystems might already be doing tricks similar to this
> patch that are worth copying.

Unfortunately, all the special cases I see in the filesystems either
have an inode ref or are trylocking the page to safety.

XFS is a special case because they have their own inode cache, but by my
reading they are still using i_count and free by rcu.

The iput in here is a little tricky:

> 
> +   /* Should be impossible but lets be paranoid for now */
> +   if (WARN_ON(inode->i_mapping != mapping)) {
> +   rcu_read_unlock();
> +   iput(inode);
> +   put_page(page_head);
> +   goto again;
> +   }
> +

Once you call iput, you add the potential to call the filesystem unlink
operation if i_nlink had gone to zero.  This shouldn't be a problem
since you've dropped the rcu lock, but just for fun I'd move the
put_page up a line.

Or, change it to a BUG_ON instead, it really should be impossible.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Btrfs: update bi_remaining to relfect our bio endio chaining

2013-10-31 Thread Chris Mason
Btrfs is sometimes calling bio_endio twice on the same bio while
we chain things.  This makes sure we don't trip over new assertions in
fs/bio.c

Signed-off-by: Chris Mason 

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 7fcac70..5b30360 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2289,6 +2289,10 @@ static void btrfsic_bio_end_io(struct bio *bp, int 
bio_error_status)
block = next_block;
} while (NULL != block);
 
+   /*
+* since we're not using bio_endio here, we don't need to worry about
+* the remaining count
+*/
bp->bi_end_io(bp, bio_error_status);
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 62176ad..786ddac 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1684,7 +1684,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
bio->bi_private = end_io_wq->private;
bio->bi_end_io = end_io_wq->end_io;
kfree(end_io_wq);
-   bio_endio(bio, error);
+   bio_endio_nodec(bio, error);
 }
 
 static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ef48947..a31448f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5284,9 +5284,17 @@ static void btrfs_end_bio(struct bio *bio, int err)
}
}
 
-   if (bio == bbio->orig_bio)
+   if (bio == bbio->orig_bio) {
is_orig_bio = 1;
 
+   /*
+* eventually we will call the bi_endio for the original bio,
+* make sure that we've properly bumped bi_remaining to reflect
+* our chain of endios here
+*/
+   atomic_inc(>bi_remaining);
+   }
+
if (atomic_dec_and_test(>stripes_pending)) {
if (!is_orig_bio) {
bio_put(bio);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs

2013-11-15 Thread Chris Mason
Quoting Heiko Carstens (2013-11-15 06:32:16)
> On Thu, Nov 14, 2013 at 12:19:52PM -0500, Chris Mason wrote:
> > Hi Linus,
> > 
> > Please pull my for-linus branch:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> > for-linus
> > 
> > This is our usual merge window set of bug fixes, performance
> > improvements and cleanups.  Miao Xie has some really nice optimizations
> > for writeback.
> > 
> > Josef also expanded our sanity checks quite a bit; these make up a big
> > chunk of the new lines.
> 
> Hmm..  b19e68439375  "btrfs: Remove redundant local zero structure" seems to
> use the empty_zero_page incorrectly and causes this compile warning on s390:
> 
>   CC  fs/btrfs/ioctl.o
> fs/btrfs/ioctl.c: In function 'btrfs_is_empty_uuid':
> fs/btrfs/ioctl.c:372:2: warning: passing argument 2 of 'memcmp' makes pointer 
> from
> integer without a cast [enabled by default]
>   return !memcmp(uuid, empty_zero_page, BTRFS_UUID_SIZE);
>   ^
> 
> In fact there seem to be two more incorrect usages in the kernel. The patch
> below is not really tested.

Thanks Heiko,

I'll make a new pull with the btrfs part of this.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs

2013-11-15 Thread Chris Mason
Quoting Chris Mason (2013-11-15 07:21:31)
> Quoting Heiko Carstens (2013-11-15 06:32:16)
> > On Thu, Nov 14, 2013 at 12:19:52PM -0500, Chris Mason wrote:
> > > Hi Linus,
> > > 
> > > Please pull my for-linus branch:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> > > for-linus
> > > 
> > > This is our usual merge window set of bug fixes, performance
> > > improvements and cleanups.  Miao Xie has some really nice optimizations
> > > for writeback.
> > > 
> > > Josef also expanded our sanity checks quite a bit; these make up a big
> > > chunk of the new lines.
> > 
> > Hmm..  b19e68439375  "btrfs: Remove redundant local zero structure" seems to
> > use the empty_zero_page incorrectly and causes this compile warning on s390:
> > 
> >   CC  fs/btrfs/ioctl.o
> > fs/btrfs/ioctl.c: In function 'btrfs_is_empty_uuid':
> > fs/btrfs/ioctl.c:372:2: warning: passing argument 2 of 'memcmp' makes 
> > pointer from
> > integer without a cast [enabled by default]
> >   return !memcmp(uuid, empty_zero_page, BTRFS_UUID_SIZE);
> >   ^
> > 
> > In fact there seem to be two more incorrect usages in the kernel. The patch
> > below is not really tested.
> 
> Thanks Heiko,
> 
> I'll make a new pull with the btrfs part of this.

Or something slightly different ;)

BUG: unable to handle kernel paging request at 01ba6000
IP: [] memcmp+0xf/0x22

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-11-15 Thread Chris Mason
Hi Linus,

This pull fixes the empty_zero_page bug that Heiko reported, and
includes one more cleanup from Al Viro.   Please grab my for-linus:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Chris Mason (1) commits (+7/-2):
btrfs: fix empty_zero_page misusage

Al Viro (1) commits (+4/-9):
btrfs: get rid of fdentry()

Total: (2) commits (+11/-11)

 fs/btrfs/ctree.h |  5 -
 fs/btrfs/ioctl.c | 17 +++--
 2 files changed, 11 insertions(+), 11 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs Maintainer update

2013-12-04 Thread Chris Mason
Hi Linus,

I'm still getting settled into new devel hardware etc, but I do have one
commit for the next rc.  Please grab it from my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This changes my email over to fb.com, and adds a MAINTAINERS entry for
Josef as well.

Instead of diffstat etc, I've put the patch below.  The commit is also
signed, my updated key should be floating around the pgp servers now.

commit c0778e2534b81be6b85b5ee07c0e15ff774f7136
Author: Chris Mason 
Date:   Tue Dec 3 20:16:03 2013 -0500

Btrfs: update the MAINTAINERS file

Josef and I have new email addresses

Signed-off-by: Chris Mason 
Signed-off-by: Josef Bacik 

diff --git a/MAINTAINERS b/MAINTAINERS
index ffcaf97..db89b5d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1921,7 +1921,8 @@ S:Maintained
 F: drivers/gpio/gpio-bt8xx.c
 
 BTRFS FILE SYSTEM
-M: Chris Mason 
+M: Chris Mason 
+M: Josef Bacik 
 L: linux-bt...@vger.kernel.org
 W: http://btrfs.wiki.kernel.org/
 Q: http://patchwork.kernel.org/project/linux-btrfs/list/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs Maintainer update

2013-12-05 Thread Chris Mason
Quoting Linus Torvalds (2013-12-05 12:53:26)
> On Wed, Dec 4, 2013 at 9:53 AM, Chris Mason  wrote:
> >
> > Instead of diffstat etc, I've put the patch below.  The commit is also
> > signed, my updated key should be floating around the pgp servers now.
> 
> You've done something to the git tree since your test request, because
> now the signed tag isn't there any more, now it's just a regular
> branch.
> 
> Mind re-uploading the *tag*, which is where the signature is?

I had used git commit -S instead of signing the tag.  I just pushed out
a signed tag on for-linus as well.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-12-12 Thread Chris Mason
Hi Linus,

Please pull my for-linus branch (or for-linus signed tag):

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This is a small collection of fixes.  It was rebased this morning, but I
was just fixing signed-off-by tags with the wrong email.

Wang Shilong (3) commits (+57/-24):
Btrfs: skip building backref tree for uuid and quota tree when doing 
balance relocation (+3/-1)
Btrfs: make sure we cleanup all reloc roots if error happens (+7/-0)
Btrfs: fix an oops when doing balance relocation (+47/-23)

Filipe David Borba Manana (1) commits (+10/-12):
Btrfs: don't miss skinny extent items on delayed ref head contention

Dan Carpenter (1) commits (+2/-2):
Btrfs: fix access_ok() check in btrfs_ioctl_send()

Miao Xie (1) commits (+2/-3):
Btrfs: don't clear the default compression type

David Sterba (1) commits (+2/-1):
btrfs: call mnt_drop_write after interrupted subvol deletion

Total: (7) commits (+73/-42)

 fs/btrfs/extent-tree.c | 22 +++---
 fs/btrfs/ioctl.c   |  3 +-
 fs/btrfs/relocation.c  | 81 +++---
 fs/btrfs/send.c|  4 +--
 fs/btrfs/super.c   |  5 ++--
 5 files changed, 73 insertions(+), 42 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixes

2013-03-17 Thread Chris Mason
Hi Linus,

My for-linus branch has some btrfs fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Eric's rcu barrier patch fixes a long standing problem with our unmount
code hanging on to devices in workqueue helpers.  Liu Bo nailed down a
difficult assertion for in-memory extent mappings.

Liu Bo (4) commits (+9/-7):
Btrfs: get better concurrency for snapshot-aware defrag work (+3/-0)
Btrfs: fix warning when creating snapshots (+5/-6)
Btrfs: fix warning of free_extent_map (+1/-0)
Btrfs: remove btrfs_try_spin_lock (+0/-1)

Josef Bacik (1) commits (+4/-1):
Btrfs: return EIO if we have extent tree corruption

Eric Sandeen (1) commits (+6/-0):
btrfs: use rcu_barrier() to wait for bdev puts at unmount

Wang Shilong (1) commits (+6/-4):
Btrfs: return as soon as possible when edquot happens

Total: (7) commits (+25/-12)

 fs/btrfs/extent-tree.c |  5 -
 fs/btrfs/file.c|  1 +
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/locking.h |  1 -
 fs/btrfs/qgroup.c  | 10 ++
 fs/btrfs/transaction.c | 11 +--
 fs/btrfs/volumes.c |  6 ++
 7 files changed, 25 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixes

2013-02-06 Thread Chris Mason
[ sorry, my lbdb seems to really like linux-ker...@vger.kerrnel.org,
fixed for real this time ]

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We've got corner cases for updating i_size that ceph was hitting, error
handling for quotas when we run out of space, a very subtle snapshot
deletion race, a crash while removing devices, and one deadlock between
subvolume creation and the sb_internal code (thanks lockdep).

Josef Bacik (3) commits (+12/-4):
Btrfs: do not merge logged extents if we've removed them from the tree 
(+2/-1)
Btrfs: fix possible stale data exposure (+1/-1)
Btrfs: fix missing i_size update (+9/-2)

Miao Xie (2) commits (+21/-9):
Btrfs: fix missing release of the space/qgroup reservation in 
start_transaction() (+19/-8)
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() (+2/-1)

Jan Schmidt (1) commits (+10/-12):
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata

Liu Bo (1) commits (+38/-9):
Btrfs: fix race between snapshot deletion and getting inode

Chris Mason (1) commits (+4/-1):
Btrfs: move d_instantiate outside the transaction during mksubvol

Eric Sandeen (1) commits (+2/-1):
btrfs: don't try to notify udev about missing devices

Total: (9) commits

 fs/btrfs/extent-tree.c  | 22 ++
 fs/btrfs/extent_map.c   |  3 ++-
 fs/btrfs/file.c | 25 -
 fs/btrfs/ioctl.c|  5 -
 fs/btrfs/ordered-data.c | 13 ++---
 fs/btrfs/scrub.c| 25 -
 fs/btrfs/transaction.c  | 27 +++
 fs/btrfs/volumes.c  |  3 ++-
 8 files changed, 87 insertions(+), 36 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops when mounting btrfs partition

2013-02-02 Thread Chris Mason
Hi Arnd,

First things first, nospace_cache is a safe thing to use.  It is slow
because it's finding free extents, but it's just a cache and always safe
to discard.  With your other errors, I'd just mount it readonly
and then you won't waste time on atime updates.

I'll take a look at the BUG you got during log recovery.  We've fixed a
few of those during the 3.8 rc cycle.

> Feb  1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at 
> a01fdcf7 [verbose debug info unavailable]

> Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 
> 15619835 off 454656 csum 2755731641 private 864823192
> Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 
> errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
> ...
> Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify 
> failed on 17006399488 wanted 54700 found 54764

These aren't good.  With a few exceptions for really tight races in fsx
use cases, csum errors are bad data from the disk.  The transid verify
failed shows we wanted to find a metadata block from generation 54700
but found 54764 instead:

54700 = 0xD5AC
54764 = 0xD5EC

This same bad block comes up a few different times.

> Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error 
> corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288)

This shows we pulled from the second copy of this block and got the
right answer, and then wrote the right answer to the duplicate.
Inode 1 means it was metadata.

But for some reason still aborted the transaction.  It could have been
an EIO on the correction, but the auto correction code in 3.5 did work
well.

I think your plan to pull the data off and reformat is a good one.  I'd
also look hard at your ram since drives don't usually send back single bit
errors.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7

2013-04-19 Thread Chris Mason
Quoting Tejun Heo (2013-04-19 01:57:54)
> 
> Ewweehh
> 
> No wonder this thing crashes.  Chris, can't the original bio carry
> bbio in bi_private and let end_bio_extent_readpage() free the bbio
> instead of abusing bi_bdev like this?

Yes, we can definitely carry bbio up higher in the stack.  I'll patch it
up right now.  I do agree that it'll be too big for -final, but we'll
have it either way.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7

2013-04-19 Thread Chris Mason
Quoting Jens Axboe (2013-04-19 09:32:50)
> > 
> > No wonder this thing crashes.  Chris, can't the original bio carry
> > bbio in bi_private and let end_bio_extent_readpage() free the bbio
> > instead of abusing bi_bdev like this?
> 
> Ugh, wtf.
> 
> Chris, time for a swim in the bay :-)

Yeah, I can't really defend this one.  We needed a space for an int and
I assumed end_io meant the FS was free to do horrible things.

Really though, I'll just take a quick dip in the lake and patch this out
of btrfs. 

Jan is probably right about changing around our endio callbacks to
explicitly pass the mirror, it should be less complex and cleaner.

Many thanks to everyone here that tracked it down.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] SLAB changes for v3.10

2013-05-08 Thread Chris Mason
[ Sorry if I break the threading on this, I had to pull it off gmane ]

On Tue, 7 May 2013, Tony Lindgren wrote:
> OK got it narrowed down to CONFIG_DEBUG_SPINLOCK=y causing the problem
> with commit 8a965b3b. Ain't nothing like bisecting and booting and then
> diffing .config files on top of that.

I'm unable to boot with slab on current Linus -master, and bisected it
down almost as far as Tony did before trying SLUB and then finding this
thread.   My box is a standard x86-64, nothing exciting and spinlock
debugging isn't on.

A few printks stuffed into Christoph's code:

cache 88047f80 at index 5
creating slab cache at 6
create special cache #1 (this is kmalloc_caches[1])
cache 88047f0001c0 at index 7
creating slab cache at 8
creating slab cache at 9
creating slab cache at 10
... more get created

Pulling this into the code from commit 8a965b3b:

for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
if (!kmalloc_caches[i]) {
kmalloc_caches[i] = create_kmalloc_cache(NULL,
1 << i, flags);

/*
 * Caches that are not of the two-to-the-power-of size.
 * These have to be created immediately after the
 * earlier power of two caches
 */
if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i 
== 6)
kmalloc_caches[1] = create_kmalloc_cache(NULL, 
96, flags);

if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i 
== 7)
kmalloc_caches[2] = create_kmalloc_cache(NULL, 
192, flags);
}
}

kmalloc_caches[7] was not null, and so kmalloc_caches[2] was never
created.

I get this oops (with early printk on)

[ cut here ]
kernel BUG at mm/slab.c:1635!
invalid opcode:  [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-josef+ #920
Hardware name: Supermicro 
X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 1.0a 
03/06/2012
task: 8196a410 ti: 8195a000 task.ti: 8195a000
RIP: 0010:[]  [] 
kmem_cache_init_late+0x40/0x7d
RSP: :8195bf78  EFLAGS: 00010282
RAX: fff4 RBX: 88047f006480 RCX: ff31
RDX: 000e RSI: 0046 RDI: 81baf238
RBP: 8195bf80 R08: 0400 R09: 
R10: 2fa4 R11:  R12: 81ab58d0
R13: 81abd2c0 R14: 88047ffaf0c0 R15: 
FS:  () GS:88047fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 88047000 CR3: 01965000 CR4: 000406b0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Stack:
  8195bfc0 81a25b64 81a2573d
 81abd2c0 3000  
  8195bfd0 81a2547f 8195bfe8
Call Trace:
 [] start_kernel+0x235/0x394
 [] ? repair_env_string+0x58/0x58
 [] x86_64_start_reservations+0x2a/0x2c
 [] x86_64_start_kernel+0xc7/0xca
Code: 53 e8 40 57 bd ff 48 8b 05 21 f2 f4 ff 48 8d 58 a8 48 8d 43 58 48 3d a0 
8e 99 81 74 1a 31 f6 48 89 df e8 ba 7f 6d ff 85 c0 74 02 <0f> 0b 48 8b 5b 58 48 
83 eb 58 eb da 48 c7 c7 70 8e 99 81 e8 e6
RIP  [] kmem_cache_init_late+0x40/0x7d
 RSP 
---[ end trace 2e5587581263f881 ]---

This patch fixes things for me, but to maintain the rules from
Christoph's patch,  kmalloc_caches[2] should have been created whenever
kmalloc_caches[7] was done.

diff --git a/mm/slab_common.c b/mm/slab_common.c
index d2517b0..ff3218a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -446,18 +446,18 @@ void __init create_kmalloc_caches(unsigned long flags)
if (!kmalloc_caches[i]) {
kmalloc_caches[i] = create_kmalloc_cache(NULL,
1 << i, flags);
+   }
 
-   /*
-* Caches that are not of the two-to-the-power-of size.
-* These have to be created immediately after the
-* earlier power of two caches
-*/
-   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i 
== 6)
-   kmalloc_caches[1] = create_kmalloc_cache(NULL, 
96, flags);
+   /*
+* Caches that are not of the two-to-the-power-of size.
+* These have to be created immediately after the
+* earlier power of two caches
+*/
+   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6)
+   

Re: [GIT PULL] SLAB changes for v3.10

2013-05-08 Thread Chris Mason
Quoting Christoph Lameter (2013-05-08 14:25:49)
> On Wed, 8 May 2013, Chris Mason wrote:
> 
> > This patch fixes things for me, but to maintain the rules from
> > Christoph's patch,  kmalloc_caches[2] should have been created whenever
> > kmalloc_caches[7] was done.
> 
> Not necessary. The early slab bootstrap must create some slab caches of
> specific sizes, it will only use those during very early bootstrap.
> 
> The later creation of the array must skip those.
> 
> You correctly moved the checks out of the if (!kmalloc_cacheS())
> condition so that the caches are created properly.

But if the ordering is required at all, why is it ok to create cache 2
after cache 6 instead of after cache 7?

IOW if we can safely do cache 2 after cache 6, why can't we just do both
cache 1 and cache 2 after the loop?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix crash during slab init

2013-05-08 Thread Chris Mason
Commit 8a965b3b introduced a regression that caused us to crash early
during boot.  The commit was introducing ordering of slab creation,
making sure two odd-sized slabs were created after specific powers of
two sizes.

But, if any of the  power of two slabs were created earlier during boot,
slabs at index 1 or 2 might not get created at all.  This patch makes
sure none of the slabs get skipped.

Tony Lindgren bisected this down to the offending commit, which really
helped because bisect kept bringing me to almost but not quite this one.

Signed-off-by: Chris Mason 
Acked-by: Christoph Lameter 
Acked-by: Tony Lindgren 

---

v1->v2 reword description

diff --git a/mm/slab_common.c b/mm/slab_common.c
index d2517b0..ff3218a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -446,18 +446,18 @@ void __init create_kmalloc_caches(unsigned long flags)
if (!kmalloc_caches[i]) {
kmalloc_caches[i] = create_kmalloc_cache(NULL,
1 << i, flags);
+   }
 
-   /*
-* Caches that are not of the two-to-the-power-of size.
-* These have to be created immediately after the
-* earlier power of two caches
-*/
-   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i 
== 6)
-   kmalloc_caches[1] = create_kmalloc_cache(NULL, 
96, flags);
+   /*
+* Caches that are not of the two-to-the-power-of size.
+* These have to be created immediately after the
+* earlier power of two caches
+*/
+   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6)
+   kmalloc_caches[1] = create_kmalloc_cache(NULL, 96, 
flags);
 
-   if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i 
== 7)
-   kmalloc_caches[2] = create_kmalloc_cache(NULL, 
192, flags);
-   }
+   if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i == 7)
+   kmalloc_caches[2] = create_kmalloc_cache(NULL, 192, 
flags);
}
 
/* Kmalloc array is now usable */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-05-09 Thread Chris Mason
 of btrfs_csum_data (+15/-21)
Btrfs: return free space in cow error path (+9/-3)
Btrfs: improve the loop of scrub_stripe (+57/-26)
Btrfs: use helper to cleanup tree roots (+1/-14)
Btrfs: share stop worker code (+23/-32)
Btrfs: cleanup unused function (+0/-1)
Btrfs: pass NULL instead of 0 (+1/-1)

Jan Schmidt (7) commits (+682/-212):
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log 
(+63/-19)
Btrfs: fix accessing the root pointer in tree mod log functions (+20/-20)
Btrfs: split btrfs_qgroup_account_ref into four functions (+148/-105)
Btrfs: fix tree mod log regression on root split operations (+29/-26)
Btrfs: automatic rescan after "quota enable" command (+11/-0)
Btrfs: fix unlock after free on rewinded tree blocks (+11/-7)
Btrfs: rescan for qgroups (+400/-35)

Tsutomu Itoh (6) commits (+59/-79):
Btrfs: remove unused variable in __process_changed_new_xattr() (+0/-2)
Btrfs: cleanup of function where btrfs_extend_item() is called (+2/-3)
Btrfs: cleanup of function where fixup_low_keys() is called (+38/-51)
Btrfs: remove unused argument of btrfs_extend_item() (+9/-11)
Btrfs: remove unused argument of fixup_low_keys() (+8/-10)
Btrfs: fix error handling in btrfs_ioctl_send() (+2/-2)

Stefan Behrens (4) commits (+51/-23):
Btrfs: allow omitting stream header and end-cmd for btrfs send (+33/-11)
Btrfs: clear received_uuid field for new writable snapshots (+8/-4)
Btrfs: delete unused parameter to btrfs_read_root_item() (+6/-8)
Btrfs: set UUID in root_item for created trees (+4/-0)

Eric Sandeen (4) commits (+383/-458):
btrfs: document mount options in Documentation/fs/btrfs.txt (+173/-7)
btrfs: ignore device open failures in __btrfs_open_devices (+3/-3)
btrfs: make static code static & remove dead code (+135/-392)
btrfs: move leak debug code to functions (+72/-56)

Miao Xie (4) commits (+155/-53):
Btrfs: use a lock to protect incompat/compat flag of the super block 
(+26/-11)
Btrfs: allocate new chunks if the space is not enough for global rsv 
(+11/-8)
Btrfs: improve the performance of the csums lookup (+111/-31)
Btrfs: fix unblocked autodefraggers when remount (+7/-3)

Zhi Yong Wu (2) commits (+2/-7):
btrfs: Cleanup some redundant codes in btrfs_lookup_csums_range() (+2/-5)
btrfs: Cleanup some redundant codes in btrfs_log_inode() (+0/-2)

Zach Brown (1) commits (+2/-0):
btrfs: abort unlink trans in missed error case

Simon Kirby (1) commits (+133/-109):
Btrfs: Include the device in most error printk()s

Nathaniel Yazdani (1) commits (+1/-1):
btrfs: fix minor typo in comment

Chris Mason (1) commits (+5/-0):
Btrfs: allow superblock mismatch from older mkfs

Vincent (1) commits (+3/-2):
Btrfs: fix reada debug code compilation

Total: (101) commits

 Documentation/filesystems/btrfs.txt | 180 +++-
 fs/btrfs/Kconfig|  22 +-
 fs/btrfs/backref.c  |  87 ++--
 fs/btrfs/backref.h  |   3 -
 fs/btrfs/btrfs_inode.h  |   2 +-
 fs/btrfs/compression.c  |  14 +-
 fs/btrfs/compression.h  |   2 -
 fs/btrfs/ctree.c| 382 +++-
 fs/btrfs/ctree.h| 145 ---
 fs/btrfs/delayed-inode.c|  66 +--
 fs/btrfs/delayed-ref.c  |  30 +-
 fs/btrfs/dir-item.c |  11 +-
 fs/btrfs/disk-io.c  | 409 ++
 fs/btrfs/disk-io.h  |   5 +-
 fs/btrfs/extent-tree.c  | 549 +++
 fs/btrfs/extent_io.c| 310 ++---
 fs/btrfs/extent_io.h|  44 +-
 fs/btrfs/extent_map.c   |  23 +-
 fs/btrfs/extent_map.h   |   3 +-
 fs/btrfs/file-item.c| 102 ++---
 fs/btrfs/file.c |  37 +-
 fs/btrfs/free-space-cache.c | 596 +++--
 fs/btrfs/free-space-cache.h |   5 +
 fs/btrfs/inode-item.c   |  17 +-
 fs/btrfs/inode.c| 183 
 fs/btrfs/ioctl.c| 108 -
 fs/btrfs/locking.c  |   4 +-
 fs/btrfs/ordered-data.c |  28 +-
 fs/btrfs/ordered-data.h |   3 +-
 fs/btrfs/print-tree.c   |   9 +-
 fs/btrfs/print-tree.h   |   2 +-
 fs/btrfs/qgroup.c   | 840 
 fs/btrfs/raid56.c   |  14 +-
 fs/btrfs/reada.c|   5 +-
 fs/btrfs/relocation.c   | 111 +++--
 fs/btrfs/root-tree.c|   7 +-
 fs/btrfs/scrub.c| 130 +++---
 fs/btrfs/send.c |  32 +-
 fs/btrfs/send.h |   1 -
 fs/btrfs/super.c| 107 +++--
 fs/btrfs/transaction.c  |  95 ++--
 fs/btrfs/transaction.h  |   3 +-
 fs/btrfs/tree-log.c

[GIT PULL] Btrfs

2013-03-02 Thread Chris Mason
 (+28/-10)
Btrfs: remove extent mapping if we fail to add chunk (+12/-2)
Btrfs: relax the block group size limit for bitmaps (+9/-3)
Btrfs: cleanup orphan reservation if truncate fails (+2/-0)
Btrfs: make sure NODATACOW also gets NODATASUM set (+2/-1)
Btrfs: don't re-enter when allocating a chunk (+9/-0)
Btrfs: remove unused extent io tree ops V2 (+11/-27)
Btrfs: fix chunk allocation error handling (+22/-10)

Liu Bo (14) commits (+796/-109):
Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay (+3/-6)
Btrfs: fix cleaner thread not working with inode cache option (+8/-1)
Btrfs: use token to avoid times mapping extent buffer (+35/-28)
Btrfs: extend the checksum item as much as possible (+46/-21)
Btrfs: fix NULL pointer after aborting a transaction (+7/-1)
Btrfs: use reserved space for creating a snapshot (+2/-0)
Btrfs: kill unused argument of update_block_group (+5/-7)
Btrfs: kill unused arguments of cache_block_group (+5/-8)
Btrfs: do not change inode flags in rename (+0/-25)
Btrfs: record first logical byte in memory (+20/-1)
Btrfs: fix memory leak of log roots (+9/-2)
Btrfs: remove deprecated comments (+0/-6)
Btrfs: snapshot-aware defrag (+654/-0)
Btrfs: save us a read_lock (+2/-3)

Eric Sandeen (11) commits (+58/-108):
btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk (+5/-1)
btrfs: remove unused "item" in btrfs_insert_delayed_item() (+0/-2)
btrfs: remove unused fs_info from btrfs_decode_error() (+4/-5)
btrfs: remove cache only arguments from defrag path (+32/-82)
btrfs: remove unnecessary DEFINE_WAIT() declarations (+0/-2)
btrfs: annotate intentional switch case fallthroughs (+2/-0)
btrfs: add missing break in btrfs_print_leaf() (+1/-0)
btrfs: remove unused fd in btrfs_ioctl_send() (+0/-3)
btrfs: handle null fs_info in btrfs_panic() (+7/-4)
btrfs: fix varargs in __btrfs_std_error (+7/-7)
btrfs: list_entry can't return NULL (+0/-2)

Chris Mason (7) commits (+561/-30):
Btrfs: reduce CPU contention while waiting for delayed extent operations 
(+70/-5)
Btrfs: remove conflicting check for minimum number of devices in raid56 
(+0/-8)
Btrfs: reduce lock contention on extent buffer locks (+16/-0)
Btrfs: add a plugging callback to raid56 writes (+124/-4)
Btrfs: fix cluster alignment for mount -o ssd (+6/-1)
Btrfs: fix max chunk size on raid5/6 (+21/-4)
Btrfs: Add a stripe cache to raid56 (+324/-8)

Wang Shilong (6) commits (+78/-68):
Btrfs: remove reduplicate check about root in the function 
btrfs_clean_quota_tree (+0/-3)
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more 
logic (+38/-44)
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails 
(+9/-3)
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails 
(+6/-5)
Btrfs: fix missing deleted items in btrfs_clean_quota_tree (+21/-13)
Btrfs: fix missing check before disabling quota (+4/-0)

David Sterba (6) commits (+131/-42):
btrfs: access superblock via pagecache in scan_one_device (+64/-6)
btrfs: put some enospc messages under enospc_debug (+15/-11)
btrfs: try harder to allocate raid56 stripe cache (+26/-7)
btrfs: use only inline_pages from extent buffer (+7/-17)
btrfs: remove a printk from scan_one_device (+0/-1)
btrfs: add cancellation points to defrag (+19/-0)

Zach Brown (2) commits (+9/-12):
btrfs: limit fallocate extent reservation to 256MB (+4/-3)
btrfs: define BTRFS_MAGIC as a u64 value (+5/-9)

David Woodhouse (2) commits (+2294/-113):
Btrfs: add rw argument to merge_bio_hook() (+11/-11)
Btrfs: RAID5 and RAID6 (+2283/-102)

Ilya Dryomov (2) commits (+6/-6):
Btrfs: allow for selecting only completely empty chunks (+1/-1)
Btrfs: eliminate a use-after-free in btrfs_balance() (+5/-5)

jeff.liu (2) commits (+67/-0):
Btrfs: Add a new ioctl to get the label of a mounted file system (+23/-0)
Btrfs: set/change the label of a mounted file system (+44/-0)

Filipe Brandenburger (1) commits (+19/-11):
Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h

Mark Fasheh (1) commits (+54/-4):
btrfs: add "no file data" flag to btrfs send ioctl

Alexandre Oliva (1) commits (+3/-3):
clear chunk_alloc flag on retryable failure

Thomas Gleixner (1) commits (+1/-0):
btrfs: Init io_lock after cloning btrfs device struct

Paul Gortmaker (1) commits (+1/-4):
btrfs: fixup/remove module.h usage as required

Tomasz Torcz (1) commits (+1/-0):
Btrfs: select XOR_BLOCKS in Kconfig

Jan Schmidt (1) commits (+1/-4):
Btrfs: fix backref walking race with tree deletions

Qu Wenruo (1) commits (+25/-38):
btrfs: cleanup for open-coded alignment

Kusanagi Kouichi (1) commits (+1/-1):
Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS

Arne Jansen (1) commits (+1/-1):
Btrfs: fix crash in log replay with qgroups 

Re: [GIT PULL] Btrfs

2013-03-02 Thread Chris Mason
On Sat, Mar 02, 2013 at 05:45:41PM -0700, Linus Torvalds wrote:
> On Sat, Mar 2, 2013 at 7:15 AM, Chris Mason  wrote:
> >
> > Our set of btrfs features, fixes and cleanups are in my for-linus
> > branch:
> 
> I *really* wish that big pull requests like this would come in earlier
> in the merge window. I hate seeing them the day before I close the
> window - really.  A number of the latter commits are done in the last
> few days, which also smells bad.

Definitely, I wanted to send this earlier in the merge window.  But I
was out last week and also didn't want to send the big stuff (raid 5/6
and the fsync work) to you right before I left on vacation.

So instead I sent things off to linux-next, and everyone on the btrfs
list collected fixes while I was gone.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] btrfs/raid56: Add missing #include

2013-03-03 Thread Chris Mason
On Sun, Mar 03, 2013 at 04:44:41AM -0700, Geert Uytterhoeven wrote:
> tilegx_defconfig:
> 
> fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
> fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' 
> [-Werror=implicit-function-declaration]
> fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer 
> without a cast [enabled by default]
> fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' 
> [-Werror=implicit-function-declaration]

Thanks, I've got this one in my for-linus now.  It'll go with the next
pull.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixup

2013-03-03 Thread Chris Mason
Hi Linus,

Geert and James both sent this one in, sorry guys.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Geert Uytterhoeven (1) commits (+1/-0):
btrfs/raid56: Add missing #include 

Total: (1) commits (+1/-0)

 fs/btrfs/raid56.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] One more btrfs

2013-04-13 Thread Chris Mason
Hi Linus

My for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Has a recent fix from Josef for our tree log replay code.  It fixes
problems where the inode counter for the number of bytes in the file
wasn't getting updated properly during fsync replay.

The commit did get rebased this morning, but it was only to clean up the
subject line.  The code hasn't changed.

Josef Bacik (1) commits (+42/-6):
Btrfs: make sure nbytes are right after log replay

Total: (1) commits (+42/-6)

 fs/btrfs/tree-log.c | 48 ++--
 1 file changed, 42 insertions(+), 6 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2014-02-16 Thread Chris Mason
Hi Linus,

We have a small collection of fixes in my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

The big thing that stands out is a revert of a new ioctl.  Users haven't
shipped yet in btrfs-progs, and Dave Sterba found a better way to export
the information.

Anand Jain (1) commits (+8/-2):
btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105

Liu Bo (1) commits (+0/-1):
Btrfs: fix a lockdep warning when cleaning up aborted transaction

Filipe David Borba Manana (1) commits (+10/-0):
Btrfs: use right clone root offset for compressed extents

Mitch Harder (1) commits (+1/-1):
Btrfs: fix max_inline mount option

Chris Mason (1) commits (+0/-17):
Revert "btrfs: add ioctl to export size of global metadata reservation"

Josef Bacik (1) commits (+9/-2):
Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol

Total: (6) commits (+28/-23)

 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/inode.c   |  2 +-
 fs/btrfs/ioctl.c   | 16 
 fs/btrfs/send.c| 10 ++
 fs/btrfs/super.c   | 11 +--
 fs/btrfs/sysfs.c   | 10 --
 include/uapi/linux/btrfs.h |  1 -
 7 files changed, 28 insertions(+), 23 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] fs: xattr-based FS_IOC_[GS]ETFLAGS interface

2014-01-07 Thread Chris Mason
On Tue, 2014-01-07 at 11:43 -0800, Darrick J. Wong wrote:
> On Tue, Jan 07, 2014 at 12:04:30PM -0500, Theodore Ts'o wrote:
> > On Tue, Jan 07, 2014 at 07:49:35AM -0800, Christoph Hellwig wrote:
> > > On Tue, Jan 07, 2014 at 01:48:31PM +0100, Jan Kara wrote:
> > > >   I have to say I'm not thrilled by the idea of juggling strings in
> > > > userspace and in kernel to set a flag for an inode...
> > > 
> > > Nevermind the massive amounts of code that sit in the filesystem.
> > 
> > The reason for this patch was to address what Dave Chinner has called
> > "a shitty interface"[1].  Using bitfields that need to be coordinated
> > across file systems, when sometimes a bit assignment is validly a fs
> > specific thing, and then later becomes something that gets shared
> > across file systems.
> > 
> > [1] http://thread.gmane.org/gmane.linux.file-systems/80164/focus=80396
> > 
> > If we don't go about it this way, there are alternatives: we could
> > create new ioctls (or a new syscall) as we start running out of bits
> > used by FS_IOC_[GS]ETFLAGS.  We can create new ioctls for bits which
> > are intended for fs-specific flags, which then later get promoted to
> > the new syscall when some functionality starts to get shared accross
> > other file systems (probably with a different bit assignment).  This
> > is certainly less code, but it does mean more complexity outside of
> > the code when we try to coordinate new functionality across file
> > systems.
> 
> I had thought of indexed inode flags as an alternative to the xattr/string
> parsing thing.  Feature flags make their first appearance as part of a per-FS
> flag-space and are migrated to the common flag-space when there is demand.
> It would also avoid the need for each fs to create its own flag ioctl.
> 
> On the other hand, someone suggested I try remaking IOC_[GS]ETFLAG as an 
> xattr,
> so off I went. :)
> 

At least in btrfs xattrs are more expensive than something right in the
inode.  We can cache it when we load the inode (it'll be right next to
the inode most of the time) but for performance critical things I do
like the good old fashioned flags.

It's also possible to turn xattrs off, so we have to deal with
filesystems that are mounted with them off and then back on again.  I
can't think of huge problems from that right now, just something to be
aware of.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [block:for-3.14/core] kernel BUG at fs/bio.c:1748

2014-01-07 Thread Chris Mason
On Tue, 2014-01-07 at 12:15 -0800, Muthu Kumar wrote:
> Thanks Fengguang. Final patch with added comment. BTW, fengguang
> mentioned that git-am has trouble with the inline patch and "quilt
> import" worked fine for him...
> 
> 
> In btrfs_end_bio(), we increment bi_remaining if is_orig_bio. If not,
> we restore the orig_bio but failed to increment bi_remaining for
> orig_bio, which triggers a BUG_ON later when bio_endio is called. Fix
> is to increment bi_remaining when we restore the orig bio as well.
> 

Hi everyone,

Which git tree is this against?  Just Jens or some extra code too?

I'll run some tests here.  My original patch is below (it's slightly
different from Muthu's).

Btrfs is sometimes calling bio_endio twice on the same bio while
we chain things.  This makes sure we don't trip over new assertions in
fs/bio.c

Signed-off-by: Chris Mason 

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 7fcac70..5b30360 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2289,6 +2289,10 @@ static void btrfsic_bio_end_io(struct bio *bp, int 
bio_error_status)
block = next_block;
} while (NULL != block);
 
+   /*
+* since we're not using bio_endio here, we don't need to worry about
+* the remaining count
+*/
bp->bi_end_io(bp, bio_error_status);
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 62176ad..786ddac 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1684,7 +1684,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
bio->bi_private = end_io_wq->private;
bio->bi_end_io = end_io_wq->end_io;
kfree(end_io_wq);
-   bio_endio(bio, error);
+   bio_endio_nodec(bio, error);
 }
 
 static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ef48947..a31448f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5284,9 +5284,17 @@ static void btrfs_end_bio(struct bio *bio, int err)
}
}
 
-   if (bio == bbio->orig_bio)
+   if (bio == bbio->orig_bio) {
is_orig_bio = 1;
 
+   /*
+* eventually we will call the bi_endio for the original bio,
+* make sure that we've properly bumped bi_remaining to reflect
+* our chain of endios here
+*/
+   atomic_inc(>bi_remaining);
+   }
+
if (atomic_dec_and_test(>stripes_pending)) {
if (!is_orig_bio) {
bio_put(bio);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] fs: xattr-based FS_IOC_[GS]ETFLAGS interface

2014-01-07 Thread Chris Mason
On Tue, 2014-01-07 at 14:02 -0800, Darrick J. Wong wrote:
> On Tue, Jan 07, 2014 at 07:59:15PM +0000, Chris Mason wrote:
> > On Tue, 2014-01-07 at 11:43 -0800, Darrick J. Wong wrote:
> > > On Tue, Jan 07, 2014 at 12:04:30PM -0500, Theodore Ts'o wrote:
> > > > On Tue, Jan 07, 2014 at 07:49:35AM -0800, Christoph Hellwig wrote:
> > > > > On Tue, Jan 07, 2014 at 01:48:31PM +0100, Jan Kara wrote:
> > > > > >   I have to say I'm not thrilled by the idea of juggling strings in
> > > > > > userspace and in kernel to set a flag for an inode...
> > > > > 
> > > > > Nevermind the massive amounts of code that sit in the filesystem.
> > > > 
> > > > The reason for this patch was to address what Dave Chinner has called
> > > > "a shitty interface"[1].  Using bitfields that need to be coordinated
> > > > across file systems, when sometimes a bit assignment is validly a fs
> > > > specific thing, and then later becomes something that gets shared
> > > > across file systems.
> > > > 
> > > > [1] http://thread.gmane.org/gmane.linux.file-systems/80164/focus=80396
> > > > 
> > > > If we don't go about it this way, there are alternatives: we could
> > > > create new ioctls (or a new syscall) as we start running out of bits
> > > > used by FS_IOC_[GS]ETFLAGS.  We can create new ioctls for bits which
> > > > are intended for fs-specific flags, which then later get promoted to
> > > > the new syscall when some functionality starts to get shared accross
> > > > other file systems (probably with a different bit assignment).  This
> > > > is certainly less code, but it does mean more complexity outside of
> > > > the code when we try to coordinate new functionality across file
> > > > systems.
> > > 
> > > I had thought of indexed inode flags as an alternative to the xattr/string
> > > parsing thing.  Feature flags make their first appearance as part of a 
> > > per-FS
> > > flag-space and are migrated to the common flag-space when there is demand.
> > > It would also avoid the need for each fs to create its own flag ioctl.
> > > 
> > > On the other hand, someone suggested I try remaking IOC_[GS]ETFLAG as an 
> > > xattr,
> > > so off I went. :)
> > > 
> > 
> > At least in btrfs xattrs are more expensive than something right in the
> > inode.  We can cache it when we load the inode (it'll be right next to
> > the inode most of the time) but for performance critical things I do
> > like the good old fashioned flags.
> 
> Just to clarify -- I wasn't proposing any on-disk changes for any filesystems,
> merely creating virtual xattrs that wrap the inode flags.

Ah, sorry I missed that part.  I was assuming that as we run out of
flags we'll want to use real xattrs instead.

> 
> > It's also possible to turn xattrs off, so we have to deal with
> > filesystems that are mounted with them off and then back on again.  I
> > can't think of huge problems from that right now, just something to be
> > aware of.
> 
> Just to satisfy my curiosity: are xattrs always separate objects in btrfs?

Strictly speaking, yes.  They aren't in the inode struct.  But most of
the time they will be in the same btree block.  For anything performance
critical we can order the keys to push them to the front.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [block:for-3.14/core] kernel BUG at fs/bio.c:1748

2014-01-08 Thread Chris Mason
On Tue, 2014-01-07 at 13:23 -0800, Muthu Kumar wrote:
> Chris,
> This is based off of Jens block tree, for-3.14/core branch...
> 

Ok, Kent did pull in one of my hunks, one was a comment and the third
was effectively the same as your patch.  I tried to test the end result
today, but get these on boot with ext4:

[8.336061] WARNING: CPU: 0 PID: 0 at fs/bio.c:1778 bio_endio+0xbe/0x100()
[8.336062] bio_endio: bio for (unknown) without endio
[8.336063] Modules linked in: megaraid_sas(+)
[8.336065] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc7-mason+ #1
[8.336066] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 
05/10/2012
[8.336069]  06f2 88087fc03c28 815cb8c6 
06f2
[8.336071]  88087fc03c78 88087fc03c68 81047497 
88085561a8e8
[8.336073]  8808582b6d80 00fe fffb 
8808582b6d80
[8.336073] Call Trace:
[8.336078][] dump_stack+0x49/0x5b
[8.336082]  [] warn_slowpath_common+0x87/0xb0
[8.336084]  [] warn_slowpath_fmt+0x41/0x50
[8.336086]  [] ? scsi_request_fn+0xc8/0x6a0
[8.336087]  [] bio_endio+0xbe/0x100
[8.336091]  [] blk_update_request+0x243/0x3a0
[8.336092]  [] blk_update_bidi_request+0x22/0xa0
[8.336094]  [] blk_end_bidi_request+0x2a/0x80
[8.336096]  [] blk_end_request+0xb/0x10
[8.336098]  [] scsi_io_completion+0xa6/0x700
[8.336100]  [] scsi_finish_command+0xc8/0x130
[8.336101]  [] scsi_softirq_done+0x13f/0x160
[8.336104]  [] blk_done_softirq+0x6d/0x80
[8.336106]  [] __do_softirq+0xdb/0x290
[8.336108]  [] irq_exit+0xbd/0xd0
[8.336110]  [] do_IRQ+0x61/0xe0
[8.336112]  [] common_interrupt+0x6a/0x6a
[8.336117][] ? cpuidle_enter_state+0x4a/0xc0
[8.336119]  [] ? cpuidle_enter_state+0x46/0xc0
[8.336121]  [] cpuidle_idle_call+0xc7/0x160
[8.336123]  [] arch_cpu_idle+0x9/0x20
[8.336126]  [] cpu_startup_entry+0x9a/0x250
[8.336128]  [] rest_init+0x72/0x80
[8.336131]  [] start_kernel+0x3fd/0x40a
[8.336133]  [] ? repair_env_string+0x5b/0x5b
[8.336134]  [] x86_64_start_reservations+0x2a/0x2c
[8.336136]  [] x86_64_start_kernel+0x140/0x147
[8.336137] ---[ end trace d0966e2430ea53b4 ]---
[8.336146] [ cut here ]
[8.336146] kernel BUG at fs/bio.c:523!
[8.336148] invalid opcode:  [#1] SMP
[8.336148] Modules linked in: megaraid_sas(+)
[8.336150] CPU: 0 PID: 2911 Comm: scsi_id Tainted: GW
3.13.0-rc7-mason+ #1
[8.336150] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 
05/10/2012
[8.336151] task: 8808556b4150 ti: 8808556b6000 task.ti: 
8808556b6000
[8.336153] RIP: 0010:[]  [] 
bio_put+0x8a/0xa0
[8.336153] RSP: 0018:8808556b7b68  EFLAGS: 00010246
[8.336154] RAX:  RBX: 8808582b6d80 RCX: 
[8.336155] RDX: 8808582b6dec RSI: 0003 RDI: 8808582b6d80
[8.336155] RBP: 8808556b7b78 R08: 0004 R09: 
[8.336156] R10: 0001 R11: 0001 R12: 
[8.336156] R13:  R14: 8808567ebe28 R15: 8808582b6d80
[8.336157] FS:  7f16056bd700() GS:88087fc0() 
knlGS:
[8.336158] CS:  0010 DS:  ES:  CR0: 80050033
[8.336159] CR2: e8f7ffc0 CR3: 000856303000 CR4: 000407f0
[8.336159] Stack:
[8.336164]  8808582b6d80  8808556b7ba8 
81291b37
[8.336168]  8808556b7b88 8808556b7cf8 88085561a8e8 
880855685400
[8.336172]  8808556b7c78 8129b42d 8808556b7be8 
8119e09b
[8.336172] Call Trace:
[8.336174]  [] blk_rq_unmap_user+0x47/0x60
[8.336177]  [] sg_io+0x26d/0x370
[8.336179]  [] ? bdget+0x11b/0x130
[8.336183]  [] ? find_get_page+0x19/0xa0
[8.336185]  [] scsi_cmd_ioctl+0x409/0x480
[8.336186]  [] ? unlock_page+0x22/0x30
[8.336189]  [] ? __do_fault+0x439/0x560
[8.336191]  [] scsi_cmd_blk_ioctl+0x4c/0x70
[8.336194]  [] sd_ioctl+0xcf/0x160
[8.336196]  [] __blkdev_driver_ioctl+0x23/0x30
[8.336198]  [] blkdev_ioctl+0x1f8/0x790
[8.336199]  [] block_ioctl+0x37/0x40
[8.336201]  [] do_vfs_ioctl+0x87/0x4f0
[8.336204]  [] ? file_has_perm+0x8a/0xa0
[8.336205]  [] SyS_ioctl+0x91/0xa0
[8.336207]  [] system_call_fastpath+0x16/0x1b
[8.336218] Code: 8b 74 24 10 48 29 fb 48 89 df e8 a2 d2 f6 ff 48 8b 1c 24 
4c 8b 64 24 08 c9 c3 0f 1f 80 00 00 00 00 48 89 df e8 38 60 fb ff eb 9a <0f> 0b 
0f 1f 40 00 eb f
a 66 66 66 66 66 2e 0f 1f 84 00 00 00 00
[8.336220] RIP  [] bio_put+0x8a/0xa0
[8.336220]  RSP 
[8.336221] ---[ end trace d0966e2430ea53b5 ]---

Trying to track it down.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [block:for-3.14/core] kernel BUG at fs/bio.c:1748

2014-01-08 Thread Chris Mason
On Wed, 2014-01-08 at 11:54 -0800, Muthu Kumar wrote:
> Chris,
> 
> [8.336061] WARNING: CPU: 0 PID: 0 at fs/bio.c:1778 bio_endio+0xbe/0x100()
> [8.336062] bio_endio: bio for (unknown) without endio
> 
> This is my recent change to avoid memory leak in bio_endio. But I
> think the problem is higher up, most likely bio_endio is called twice
> on the same bio (which was freed before).
> 

I think these are just two separate problems.  Lets ignore the WARN_ON
for now.

> Are you running the unmodified for-3.14/core or do you have local patches?
> 

It's for-3.14/core with my btrfs branch.  Basically rc7 instead of rc6
but no changes to the block layer.  I hadn't mounted btrfs yet.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [block:for-3.14/core] kernel BUG at fs/bio.c:1748

2014-01-08 Thread Chris Mason
On Wed, 2014-01-08 at 12:40 -0800, Muthu Kumar wrote:
> On Wed, Jan 8, 2014 at 12:16 PM, Chris Mason  wrote:
> > On Wed, 2014-01-08 at 11:54 -0800, Muthu Kumar wrote:
> >> Chris,
> >>
> >> [8.336061] WARNING: CPU: 0 PID: 0 at fs/bio.c:1778 
> >> bio_endio+0xbe/0x100()
> >> [8.336062] bio_endio: bio for (unknown) without endio
> >>
> >> This is my recent change to avoid memory leak in bio_endio. But I
> >> think the problem is higher up, most likely bio_endio is called twice
> >> on the same bio (which was freed before).
> >>
> >
> > I think these are just two separate problems.  Lets ignore the WARN_ON
> > for now.
> >
> 
> Not really... the BUG that is triggered:
> 
> kernel BUG at fs/bio.c:523!
> 
> It is in bio_put() (added to bio_endio() as part of recent change)
> which gets an already freed bio.
> 

Oh! I see.  Let me try with that one reverted.  Thanks!

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [block:for-3.14/core] kernel BUG at fs/bio.c:1748

2014-01-08 Thread Chris Mason
On Wed, 2014-01-08 at 13:01 -0800, Muthu Kumar wrote:
> On Wed, Jan 8, 2014 at 12:51 PM, Chris Mason  wrote:
> > On Wed, 2014-01-08 at 12:40 -0800, Muthu Kumar wrote:
> >> On Wed, Jan 8, 2014 at 12:16 PM, Chris Mason  wrote:
> >> > On Wed, 2014-01-08 at 11:54 -0800, Muthu Kumar wrote:
> >> >> Chris,
> >> >>
> >> >> [8.336061] WARNING: CPU: 0 PID: 0 at fs/bio.c:1778 
> >> >> bio_endio+0xbe/0x100()
> >> >> [8.336062] bio_endio: bio for (unknown) without endio
> >> >>
> >> >> This is my recent change to avoid memory leak in bio_endio. But I
> >> >> think the problem is higher up, most likely bio_endio is called twice
> >> >> on the same bio (which was freed before).
> >> >>
> >> >
> >> > I think these are just two separate problems.  Lets ignore the WARN_ON
> >> > for now.
> >> >
> >>
> >> Not really... the BUG that is triggered:
> >>
> >> kernel BUG at fs/bio.c:523!
> >>
> >> It is in bio_put() (added to bio_endio() as part of recent change)
> >> which gets an already freed bio.
> >>
> >
> > Oh! I see.  Let me try with that one reverted.  Thanks!
> >
> > -chris
> >
> 
> But, like I said, problem is in different place. I am running a "dd"
> on ext4 fs for a while now, but didn't hit the problem. Any idea to
> repro locally? I would also suggest running just the for-3.1/core to
> isolate the issue.

Just reverting that change fixes it for me.  Jens mentioned it was
broken for on-stack bios.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [block:for-3.14/core] kernel BUG at fs/bio.c:1748

2014-01-08 Thread Chris Mason
On Tue, 2014-01-07 at 12:15 -0800, Muthu Kumar wrote:
> Thanks Fengguang. Final patch with added comment. BTW, fengguang
> mentioned that git-am has trouble with the inline patch and "quilt
> import" worked fine for him...
> 
> 
> In btrfs_end_bio(), we increment bi_remaining if is_orig_bio. If not,
> we restore the orig_bio but failed to increment bi_remaining for
> orig_bio, which triggers a BUG_ON later when bio_endio is called. Fix
> is to increment bi_remaining when we restore the orig bio as well.
> 
> Reported-and-Tested-by: Fengguang wu 
> CC: Kent Overstreet 
> CC: Jens Axboe 
> CC: Chris Mason 
> Signed-off-by: Muthukumar Ratty 
> 

Reviewed-by: Chris Mason 

Jens, please pull this one in.

> ---
>  fs/btrfs/volumes.c |8 ++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 37972d5..34aba2b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5297,9 +5297,13 @@ static void btrfs_end_bio(struct bio *bio, int err)
> if (!is_orig_bio) {
> bio_put(bio);
> bio = bbio->orig_bio;
> -   } else {
> -   atomic_inc(>bi_remaining);
> }
> +/*
> + * We have original bio now. So increment bi_remaining to
> + * account for it in endio
> + */
> +   atomic_inc(>bi_remaining);
> +
> bio->bi_private = bbio->private;
> bio->bi_end_io = bbio->end_io;
> btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
> 
> -


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-07 Thread Chris Mason

On Fri 07 Feb 2014 07:10:38 AM EST, Fengguang Wu wrote:

On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote:

On Fri, 7 Feb 2014, Fengguang Wu wrote:


[1.625020] BTRFS: selftest: Running btrfs_split_item tests
[1.627004] BTRFS: selftest: Running find delalloc tests
[2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz
[  292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, 
oom_score_adj=0
[  292.086439] kthreadd cpuset=
[  292.087072] BUG: unable to handle kernel NULL pointer dereference at 
0038
[  292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c


This looks like a problem with the cpuset cgroup name, are you sure this
isn't related to the removal of cgroup->name?


It looks not related to patch "cgroup: remove cgroup->name", because
that patch lies in the cgroup tree and not contained in output of "git log 
BAD_COMMIT".


Still not sure exactly what is going on, but I can't trigger it here.  
My first guess is that it is related to having btrfs static, some part 
of our init is happening at the wrong time, and the self tests are 
swooping in and causing trouble.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2014-02-09 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This is a small collection of fixes

Josef Bacik (2) commits (+4/-5):
Btrfs: don't loop forever if we can't run because of the tree mod log 
(+1/-0)
Btrfs: fix assert screwup for the pending move stuff (+3/-5)

David Sterba (1) commits (+1/-1):
btrfs: reserve no transaction units in btrfs_ioctl_set_features

Filipe David Borba Manana (1) commits (+2/-0):
Btrfs: fix data corruption when reading/updating compressed extents

Jeff Mahoney (1) commits (+2/-2):
btrfs: commit transaction after setting label and features

Total: (5) commits (+9/-8)

 fs/btrfs/compression.c | 2 ++
 fs/btrfs/extent-tree.c | 1 +
 fs/btrfs/ioctl.c   | 6 +++---
 fs/btrfs/send.c| 8 +++-
 4 files changed, 9 insertions(+), 8 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 1be41b78: +18% increased btrfs write throughput

2014-01-03 Thread Chris Mason
On Fri, 2014-01-03 at 23:54 +0800, fengguang...@intel.com wrote:
> Hi Josef,
> 
> FYI. We are doing 0day performance tests and happen to notice that
> btrfs write throughput increased considerably during v3.10-11 time
> frame:
> 
>   v3.10  v3.11  v3.12 
>  v3.13-rc6
> ---  -  -  
> -
>  50619 ~ 1% +17.0%  59209 ~ 2% +18.8%  60159 ~ 2% 
> +20.5%  61007 ~ 0%  lkp-ws02/micro/dd-write/11HDD-JBOD-cfq-btrfs-1dd
>  50619  +17.0%  59209  +18.8%  60159  
> +20.5%  61007   TOTAL iostat.sdd.wkB/s
> 
> and it's contributed by 
> 
> commit 1be41b78bc688fc634bf30965d2be692c99fd11d
> Author: Josef Bacik 
> AuthorDate: Wed Jun 12 13:56:06 2013 -0400
> Commit: Josef Bacik 
> CommitDate: Mon Jul 1 08:52:28 2013 -0400
> 
> Btrfs: fix transaction throttling for delayed refs

Bonus points for increasing the performance on purpose.  Thanks for
running these Wu.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 2/2] fs: btrfs: Extends btrfs/raid56 to support up to six parities

2014-01-06 Thread Chris Mason
On Mon, 2014-01-06 at 10:31 +0100, Andrea Mazzoleni wrote:
> This patch changes btrfs/raid56.c to use the new raid interface and
> extends its support to an arbitrary number of parities.
> 
> More in details, the two faila/failb failure indexes are now replaced
> with a fail[] vector that keeps track of up to six failures, and now
> the new raid_par() and raid_rec() functions are used to handle with
> parity instead of the old xor/raid6 ones.
> 

Neat.  The faila/failb were always my least favorite part of the btrfs
code ;)  Did you test just raid5/6 or also the higher parity counts?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Build failures due to commit 416161db (btrfs: offline dedupe)

2013-09-13 Thread Chris Mason
Quoting Guenter Roeck (2013-09-13 12:35:35)
> On Fri, Sep 13, 2013 at 03:52:43PM +0200, Geert Uytterhoeven wrote:
> > On Fri, Sep 13, 2013 at 3:33 PM, Guenter Roeck  wrote:
> > > fs/btrfs/ioctl.c: In function 'btrfs_ioctl_file_extent_same':
> > > fs/btrfs/ioctl.c:2802:3: error: implicit declaration of function 
> > > '__put_user_unaligned' [-Werror=implicit-function-declaration]
> > > cc1: some warnings being treated as errors
> > > make[2]: *** [fs/btrfs/ioctl.o] Error 1
> > > make[2]: *** Waiting for unfinished jobs
> > >
> > > Seen with alpha:allmodconfig, arm:allmodconfig, m68k:allmodconfig, and
> > > xtensa:allmodconfig.
> > 
> > Known issue, cfr. my early warning 10 days ago:
> > 
> > "Btrfs is the first user of __put_user_unaligned() outside the compat code,
> > hence now all 32-bit architectures should make sure to implement this, too."
> > 
> > http://marc.info/?l=linux-arch=137820065929216=2
> > 
> > and today's thread https://lkml.org/lkml/2013/9/12/814
> > 
> 
> It doesn't seem right that a patch breaks the build for several platforms, and
> the problem is then blamed on the platform code instead of the code that is
> introducing the problem.
> 
> Maybe we should add BROKEN to the btrfs dependencies for the affected 
> platforms.
> After all, it _is_ broken.

I'm happy to fix this with a bigger put of the info struct, just
let me know the preferred arch-happy solution.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Build failures due to commit 416161db (btrfs: offline dedupe)

2013-09-13 Thread Chris Mason
Quoting Mark Fasheh (2013-09-13 13:58:01)
> On Fri, Sep 13, 2013 at 01:00:22PM -0400, Chris Mason wrote:
> > Quoting Guenter Roeck (2013-09-13 12:35:35)
> > I'm happy to fix this with a bigger put of the info struct, just
> > let me know the preferred arch-happy solution.
> 
> In fact old versions of the patch were putting the whole struct but during
> review I was asked to change it. This should be very straight forward to fix
> so long as we all stay calm ;)
> --Mark

Mark, could you please send a patch for the whole-struct option until
the unaligned put is upstreamed?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-10-11 Thread Chris Mason
Hi Linus,

We've got more bug fixes in my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

One of these fixes another corner of the compression oops from last
time.  Miao nailed down some problems with concurrent snapshot deletion
and drive balancing.

I kept out one of his patches for more testing, but these are all
stable.

Josef Bacik (2) commits (+5/-9):
Btrfs: limit delalloc pages outside of find_delalloc_range (+4/-8)
Btrfs: use right root when checking for hash collision (+1/-1)

Miao Xie (2) commits (+20/-12):
Btrfs: fix oops caused by the space balance and dead roots (+17/-7)
Btrfs: insert orphan roots into fs radix tree (+3/-5)

Total: (4) commits (+25/-21)

 fs/btrfs/disk-io.c|  9 +
 fs/btrfs/disk-io.h| 13 +++--
 fs/btrfs/extent_io.c  | 12 
 fs/btrfs/inode.c  |  2 +-
 fs/btrfs/relocation.c |  2 +-
 fs/btrfs/root-tree.c  |  8 +++-
 6 files changed, 25 insertions(+), 21 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-09-22 Thread Chris Mason
Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

These are mostly bug fixes and a two small performance fixes.  The most
important of the bunch are Josef's fix for a snapshotting regression and
Mark's update to fix compile problems on arm.

These are on top of 3.11 + my first pull, but they were also tested
against your master as of last night.

Josef Bacik (13) commits (+219/-102):
Btrfs: check roots last log commit when checking if an inode has been 
logged (+4/-1)
Revert "Btrfs: rework the overcommit logic to be based on the total size" 
(+3/-12)
Btrfs: kill delay_iput arg to the wait_ordered functions (+14/-33)
Btrfs: drop dir i_size when adding new names on replay (+27/-0)
Btrfs: replay dir_index items before other items (+12/-3)
Btrfs: fix worst case calculator for space usage (+1/-1)
Btrfs: actually log directory we are fsync()'ing (+9/-1)
Btrfs: actually limit the size of delalloc range (+5/-3)
Btrfs: fixup error handling in btrfs_reloc_cow (+32/-22)
Btrfs: remove space_info->reservation_progress (+0/-12)
Btrfs: create the uuid tree on remount rw (+10/-0)
Btrfs: improve replacing nocow extents (+98/-14)
Btrfs: iput inode on allocation failure (+4/-0)

Stefan Behrens (2) commits (+4/-0):
btrfs: show compiled-in config features at module load time (+3/-0)
Btrfs: add the missing mutex unlock in write_all_supers() (+1/-0)

Filipe David Borba Manana (2) commits (+7/-7):
Btrfs: don't leak transaction in btrfs_sync_file() (+2/-2)
Btrfs: more efficient inode tree replace operation (+5/-5)

David Sterba (2) commits (+8/-0):
btrfs: add lockdep and tracing annotations for uuid tree (+2/-0)
btrfs: refuse to remount read-write after abort (+6/-0)

Guangyu Sun (1) commits (+2/-0):
Btrfs: dir_inode_operations should use btrfs_update_time also

Mark Fasheh (1) commits (+45/-31):
btrfs: change extent-same to copy entire argument struct

Ilya Dryomov (1) commits (+2/-1):
Btrfs: do not add replace target to the alloc_list

Frank Holton (1) commits (+2/-2):
btrfs: Add btrfs: prefix to kernel log output

chandan (1) commits (+1/-1):
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when 
arg is 0

Miao Xie (1) commits (+74/-31):
Btrfs: allocate the free space by the existed max extent size when ENOSPC

Total: (25) commits

 fs/btrfs/btrfs_inode.h   |   5 +-
 fs/btrfs/ctree.c |   7 ++-
 fs/btrfs/ctree.h |  17 ++-
 fs/btrfs/dev-replace.c   |   4 +-
 fs/btrfs/disk-io.c   |   2 +
 fs/btrfs/extent-tree.c   |  57 +++---
 fs/btrfs/extent_io.c |   8 ++--
 fs/btrfs/file.c  |   4 +-
 fs/btrfs/free-space-cache.c  |  67 ++
 fs/btrfs/free-space-cache.h  |   5 +-
 fs/btrfs/inode.c |  16 +--
 fs/btrfs/ioctl.c |  80 ++-
 fs/btrfs/ordered-data.c  |  24 ++
 fs/btrfs/ordered-data.h  |   5 +-
 fs/btrfs/relocation.c|  43 ++---
 fs/btrfs/scrub.c | 112 +--
 fs/btrfs/super.c |  21 +++-
 fs/btrfs/transaction.c   |   2 +-
 fs/btrfs/tree-log.c  |  52 ++--
 fs/btrfs/volumes.c   |   7 +--
 include/trace/events/btrfs.h |   1 +
 21 files changed, 364 insertions(+), 175 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-09-12 Thread Chris Mason
Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This is against 3.11-rc7, but was pulled and tested against your tree as
of yesterday.  We do have two small incrementals queued up, but I wanted
to get this bunch out the door before I hop on an airplane.

This is a fairly large batch of fixes, performance improvements, and
cleanups from the usual Btrfs suspects.

We've included Stefan Behren's work to index subvolume UUIDs, which is
targeted at speeding up send/receive with many subvolumes or snapshots
in place.  It closes a long standing performance issue that was built
in to the disk format.

Mark Fasheh's offline dedup work is also here.  In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO.   This is a building block where utilities  are
able to ask the FS to dedup a series of extents.  The kernel takes
care of verifying the data involved really is the same.  Today this
involves reading both extents, but we'll continue to evolve the patches.

Anand Jain (3):
  btrfs: fix get set label blocking against balance
  btrfs: use BTRFS_SUPER_INFO_SIZE macro at btrfs_read_dev_super()
  btrfs: return btrfs error code for dev excl ops err

Andy Shevchenko (1):
  btrfs: reuse kbasename helper

Carey Underwood (1):
  Btrfs: Release uuid_mutex for shrink during device delete

Dan Carpenter (1):
  btrfs/raid56: fix and cleanup some error paths

Dave Jones (1):
  Fix leak in __btrfs_map_block error path

David Sterba (2):
  btrfs: make errors in btrfs_num_copies less noisy
  btrfs: add mount option to set commit interval

Filipe David Borba Manana (18):
  Btrfs: optimize btrfs_lookup_extent_info()
  Btrfs: add missing error checks to add_data_references
  Btrfs: optimize function btrfs_read_chunk_tree
  Btrfs: add missing error check to find_parent_nodes
  Btrfs: add missing error handling to read_tree_block
  Btrfs: fix inode leak on kmalloc failure in tree-log.c
  Btrfs: don't ignore errors from btrfs_run_delayed_items
  Btrfs: return ENOSPC when target space is full
  Btrfs: add missing error code to BTRFS_IOC_INO_LOOKUP handler
  Btrfs: don't miss inode ref items in BTRFS_IOC_INO_LOOKUP
  Btrfs: reset force_compress on btrfs_file_defrag failure
  Btrfs: fix memory leak of orphan block rsv
  Btrfs: fix printing of non NULL terminated string
  Btrfs: fix race between removing a dev and writing sbs
  Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
  Btrfs: fix memory leak of uuid_root in free_fs_info
  Btrfs: fix deadlock in uuid scan kthread
  Btrfs: optimize key searches in btrfs_search_slot

Geert Uytterhoeven (12):
  Btrfs: Remove superfluous casts from u64 to unsigned long long
  Btrfs: Make BTRFS_DEV_REPLACE_DEVID an unsigned long long constant
  Btrfs: Format PAGE_SIZE as unsigned long
  Btrfs: Format mirror_num as int
  Btrfs: Make btrfs_device_uuid() return unsigned long
  Btrfs: Make btrfs_device_fsid() return unsigned long
  Btrfs: Make btrfs_dev_extent_chunk_tree_uuid() return unsigned long
  Btrfs: Make btrfs_header_fsid() return unsigned long
  Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long
  Btrfs: PAGE_CACHE_SIZE is already unsigned long
  Btrfs: Do not truncate sector_t on 32-bit with CONFIG_LBDAF=y
  Btrfs: Use %z to format size_t

Ilya Dryomov (5):
  Btrfs: find_next_devid: root -> fs_info
  Btrfs: add btrfs_alloc_device and switch to it
  Btrfs: add alloc_fs_devices and switch to it
  Btrfs: rollback btrfs_device fields on umount
  Btrfs: stop refusing the relocation of chunk 0

Jeff Mahoney (1):
  btrfs: fall back to global reservation when removing subvolumes

Josef Bacik (30):
  Btrfs: stop using GFP_ATOMIC for the tree mod log allocations
  Btrfs: set lockdep class before locking new extent buffer
  Btrfs: reset ret in record_one_backref
  Btrfs: cleanup reloc roots properly on error
  Btrfs: don't bother autodefragging if our root is going away
  Btrfs: cleanup arguments to extent_clear_unlock_delalloc
  Btrfs: fix what bits we clear when erroring out from delalloc
  Btrfs: check to see if we have an inline item properly
  Btrfs: change how we queue blocks for backref checking
  Btrfs: don't bug_on when we fail when cleaning up transactions
  Btrfs: handle errors when doing slow caching
  Btrfs: check our parent dir when doing a compare send
  Btrfs: deal with enomem in the rewind path
  Btrfs: stop using GFP_ATOMIC when allocating rewind ebs
  Btrfs: skip subvol entries when checking if we've created a dir already
  Btrfs: don't allow a subvol to be deleted if it is the default subovl
  Btrfs: fix the error handling wrt orphan items
  Btrfs: fix heavy delalloc related deadlock
  Btrfs: 

Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-07 Thread Chris Mason
On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
> 
> > Indeed.  Though how well my patches will work with Oracle will
> > depend a lot on what kind of semctl syscalls they are doing.
> > 
> > Does Oracle typically do one semop per semctl syscall, or does
> > it pass in a whole bunch at once?
> 
> https://oss.oracle.com/~mason/sembench.c
> 
> I think Chris wrote that to match a particular pattern of semaphore
> operations the database engine in question does. I haven't checked to
> see if it triggers the case in point though.
> 
> Also, Chris since left Oracle but maybe he knows who to poke.
> 

Dave Kleikamp (cc'd) took over my patches and did the most recent
benchmarking.  Ported against 3.0:

https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c

The current versions are still in the 2.6.32 oracle kernel, but it looks
like they reverted this 3.0 commit.  I think with Manfred's upstream
work my more complex approach wasn't required anymore, but hopefully
Dave can fill in details.

Here is some of the original discussion around the patch:

https://lkml.org/lkml/2010/4/12/257

In terms of how oracle uses IPC, the part that shows up in profiles is
using semtimedop for bulk wakeups.  They can configure things to use
either a bunch of small arrays or a huge single array (and anything in
between). 

There is one IPC semaphore per process and they use this to wait for
some event (like a log commit).  When the event comes in, everyone
waiting is woken in bulk via a semtimedop call.

So, single proc waking many waiters at once.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-07 Thread Chris Mason
On Thu, Mar 07, 2013 at 08:54:55AM -0700, Dave Kleikamp wrote:
> On 03/07/2013 06:55 AM, Chris Mason wrote:
> > On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
> >> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
> >>
> >>> Indeed.  Though how well my patches will work with Oracle will
> >>> depend a lot on what kind of semctl syscalls they are doing.
> >>>
> >>> Does Oracle typically do one semop per semctl syscall, or does
> >>> it pass in a whole bunch at once?
> >>
> >> https://oss.oracle.com/~mason/sembench.c
> >>
> >> I think Chris wrote that to match a particular pattern of semaphore
> >> operations the database engine in question does. I haven't checked to
> >> see if it triggers the case in point though.
> >>
> >> Also, Chris since left Oracle but maybe he knows who to poke.
> >>
> > 
> > Dave Kleikamp (cc'd) took over my patches and did the most recent
> > benchmarking.  Ported against 3.0:
> > 
> > https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c
> > 
> > The current versions are still in the 2.6.32 oracle kernel, but it looks
> > like they reverted this 3.0 commit.  I think with Manfred's upstream
> > work my more complex approach wasn't required anymore, but hopefully
> > Dave can fill in details.
> 
> From what I recall, I could never get better performance from your
> patches that we saw with Manfred's work alone. I can't remember the
> reasons for including and then reverting the patches from the 3.0
> (2.6.39) Oracle kernel, but in the end we weren't able to justify their
> inclusion.

Ok, so after this commit, oracle was happy:

commit fd5db42254518fbf241dc454e918598fbe494fa2
Author: Manfred Spraul 
Date:   Wed May 26 14:43:40 2010 -0700

ipc/sem.c: optimize update_queue() for bulk wakeup calls

But that doesn't explain why Davidlohr saw semtimedop at the top of the
oracle profiles in his runs.

Looking through the patches in this thread, I don't see anything that
I'd expect to slow down oracle TPC numbers.

I dealt with the ipc_perm lock a little differently:

https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commitdiff;h=78fe45325c8e2e3f4b6ebb1ee15b6c2e8af5ddb1;hp=8102e1ff9d667661b581209323faaf7a84f0f528

My code switched the ipc_rcu_hdr refcount to an atomic, which changed
where I needed the spinlock.  It may make things easier in patches 3/4
and 4/4.

(some of this code was Jens, but at the time he made me promise to
pretend he never touched it)

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs updates

2013-03-08 Thread Chris Mason
Hi Linus,

Please grab my for-linus:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

These are scattered fixes and one performance improvement.  The biggest
functional change is in how we throttle metadata changes.  The new code
bumps our average file creation rate up by ~13% in fs_mark, and lowers
CPU usage.

Stefan bisected out a regression in our allocation code that made
balance loop on extents larger than 256MB.

Liu Bo (6) commits (+71/-19):
Btrfs: build up error handling for merge_reloc_roots (+35/-12)
Btrfs: check for NULL pointer in updating reloc roots (+2/-0)
Btrfs: avoid deadlock on transaction waiting list (+7/-0)
Btrfs: free all recorded tree blocks on error (+6/-3)
Btrfs: do not BUG_ON on aborted situation (+12/-3)
Btrfs: do not BUG_ON in prepare_to_reloc (+9/-1)

Chris Mason (2) commits (+96/-63):
Btrfs: enforce min_bytes parameter during extent allocation (+4/-2)
Btrfs: improve the delayed inode throttling (+92/-61)

Miao Xie (2) commits (+45/-39):
Btrfs: fix unclosed transaction handler when the async transaction 
commitment fails (+4/-0)
Btrfs: fix wrong handle at error path of create_snapshot() when the commit 
fails (+41/-39)

Stefan Behrens (1) commits (+0/-8):
Btrfs: allow running defrag in parallel to administrative tasks

Ilya Dryomov (1) commits (+5/-0):
Btrfs: fix a mismerge in btrfs_balance()

Josef Bacik (1) commits (+4/-1):
Btrfs: use set_nlink if our i_nlink is 0

Total: (13) commits (+221/-130)

 fs/btrfs/delayed-inode.c | 151 ---
 fs/btrfs/delayed-inode.h |   2 +
 fs/btrfs/disk-io.c   |  16 +++--
 fs/btrfs/inode.c |   6 +-
 fs/btrfs/ioctl.c |  18 ++
 fs/btrfs/relocation.c|  74 +--
 fs/btrfs/transaction.c   |  65 
 fs/btrfs/tree-log.c  |   5 +-
 fs/btrfs/volumes.c   |  14 -
 9 files changed, 221 insertions(+), 130 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: btrfs triggered lockdep WARN.

2013-06-27 Thread Chris Mason
Quoting Dave Jones (2013-06-27 10:58:24)
> Another bug caused by this script. 
> https://github.com/kernelslacker/io-tests/blob/master/setup.sh

I'm still struggling to reproduce that one here.  I've tried every
variation I can think of but I'll try again.

I really hope you don't already have CONFIG_DEBUG_PAGE_ALLOC turned on,
maybe it will catch this?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: btrfs triggered lockdep WARN.

2013-06-27 Thread Chris Mason
Quoting Dave Jones (2013-06-27 11:19:22)
> On Thu, Jun 27, 2013 at 11:01:30AM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-27 10:58:24)
>  > > Another bug caused by this script. 
> https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > 
>  > I'm still struggling to reproduce that one here.  I've tried every
>  > variation I can think of but I'll try again.
>  
> Note that this is a different trace to the other post about that script.

Yeah, but I haven't hit anything unusual at all yet.

> 
>  > I really hope you don't already have CONFIG_DEBUG_PAGE_ALLOC turned on,
>  > maybe it will catch this?
> 
> I do. Though given this is lockdep complaining about what looks like
> memory corruption, it's probably not related.

Ok, could you please try this with some heavy memory pressure?  I'm
hoping to trigger a use-after-free that points us in the right
direction.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-17 Thread Chris Mason
Quoting Dave Jones (2013-06-17 09:49:55)
> Hit this while running this script in a loop..
> https://github.com/kernelslacker/io-tests/blob/master/setup.sh
> [34385.251507] [ cut here ]
> [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

Thanks Dave, how long did you have to run the script to trigger it?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-17 Thread Chris Mason
Quoting Dave Jones (2013-06-17 14:20:06)
> On Mon, Jun 17, 2013 at 01:39:42PM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-17 09:49:55)
>  > > Hit this while running this script in a loop..
>  > > https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > > [34385.251507] [ cut here ]
>  > > [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()
>  > 
>  > Thanks Dave, how long did you have to run the script to trigger it?
>  > 
>  > -chris
> 
> Judging by the timestamp, about 9 hours.  This is on a 3 disk (sata)
> (oldish) quad opteron. Might repro faster on a more modern machine.

Exactly how did you run it?  I want to make sure I'm matching your
config.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-19 Thread Chris Mason
Quoting Dave Jones (2013-06-17 14:58:10)
> On Mon, Jun 17, 2013 at 02:42:27PM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-17 14:20:06)
>  > > On Mon, Jun 17, 2013 at 01:39:42PM -0400, Chris Mason wrote:
>  > >  > Quoting Dave Jones (2013-06-17 09:49:55)
>  > >  > > Hit this while running this script in a loop..
>  > >  > > https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > >  > > [34385.251507] [ cut here ]
>  > >  > > [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()
>  > >  > 
>  > >  > Thanks Dave, how long did you have to run the script to trigger it?
>  > >  > 
>  > >  > -chris
>  > > 
>  > > Judging by the timestamp, about 9 hours.  This is on a 3 disk (sata)
>  > > (oldish) quad opteron. Might repro faster on a more modern machine.
>  > 
>  > Exactly how did you run it?  I want to make sure I'm matching your
>  > config.
> 
> while [ 1 ];
> do
>   setup.sh
> done
> 
> You'll need to set DISK1 etc variables at the top of the script to point to
> at least 3 disks for it to scribble over.
> 
> you'll also need fsx and fsstress in /usr/local/bin.

I've tried with and without memory pressure, and let it run for about 30
hours.  So far, nothing here.  Have you seen this again?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-19 Thread Chris Mason
Quoting Dave Jones (2013-06-19 14:34:50)
> On Wed, Jun 19, 2013 at 02:02:33PM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-17 14:58:10)
>  > > On Mon, Jun 17, 2013 at 02:42:27PM -0400, Chris Mason wrote:
>  > >  > Quoting Dave Jones (2013-06-17 14:20:06)
>  > >  > > On Mon, Jun 17, 2013 at 01:39:42PM -0400, Chris Mason wrote:
>  > >  > >  > Quoting Dave Jones (2013-06-17 09:49:55)
>  > >  > >  > > Hit this while running this script in a loop..
>  > >  > >  > > https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > >  > >  > > [34385.251507] [ cut here ]
>  > >  > >  > > [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()
>  > >  > >  > 
>  > >  > >  > Thanks Dave, how long did you have to run the script to trigger 
> it?
>  > >  > > 
>  > >  > > Judging by the timestamp, about 9 hours.  This is on a 3 disk (sata)
>  > >  > > (oldish) quad opteron. Might repro faster on a more modern machine.
>  > >  > 
>  > >  > Exactly how did you run it?  I want to make sure I'm matching your
>  > >  > config.
>  > > 
>  > > while [ 1 ];
>  > > do
>  > >   setup.sh
>  > > done
>  > > 
>  > > You'll need to set DISK1 etc variables at the top of the script to point 
> to
>  > > at least 3 disks for it to scribble over.
>  > > 
>  > > you'll also need fsx and fsstress in /usr/local/bin.
>  > 
>  > I've tried with and without memory pressure, and let it run for about 30
>  > hours.  So far, nothing here.  Have you seen this again?
> 
> yeah, one time I hit it within 30 minutes.

Ok, I'll try bumping the thread count on both fsstress and fsx

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    6   7   8   9   10   11   12   13   14   15   >