from:"Dave Chinner"

Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

2015-03-16 Thread Dave Chinner

> we can do some simple mapping between inode AG and data AZ mapping so that 
> > we
> > keep some form of locality to related data (e.g. grouping of data by parent
> > directory).
> > 
> > We can do simple things first - simply rotoring allocation across zones 
> > will get
> > us moving very quickly, and then we can refine it once we have more than 
> > just a
> > proof of concept prototype.
> > 
> > Optimising data allocation for SMR is going to be tricky, and I hope to be 
> > able
> > to leave that to drive vendor engineers

Maybe in 5 years time

> I think we'd all be interested in whether the write and return
> allocation position suggested at LSF/MM would prove useful for this (and
> whether the manufacturers are interested in prototyping it with us).

Right, that's where we need to head. I've got several other block
layer interfaces in mind that could use exactly this semantic to
avoid significant complexity in the filesystem layers.

> > Ideally, we won't need a zbc interface in the kernel, except to erase zones.
> > I'd like to see an interface that doesn't even require that. For example, we
> > issue a discard (TRIM) on an entire  zone and that erases it and
> > resets the write
> > pointer. This way we need no new infrastructure at the filesystem layer to
> > implement SMR awareness. In effect, the kernel isn't even aware that it's 
> > an SMR
> > drive underneath it.
> > 
> > 
> > Dr. Reinecke has already done the Discard/TRIM stuff.  However, he's
> > as of yet ignored the zone management pieces.  I have thought
> > (briefly) of the possible need for a new allocator:  the group
> > allocator.  As there can only be a few (relatively) zones available at
> > any one time, We might need a mechanism to tell which are available
> > and which are not.  The stack will have to collectively work together
> > to find a way to request and use zones in an orderly fashion.
> 
> Here I think the sense of LSF/MM was that only allowing a fixed number
> of zones to be open would get a bit unmanageable (unless the drive
> silently manages it for us).  The idea of different sized zones is also
> a complicating factor.

Not for XFS - my proposal handles variable sized zones without any
additional complexity. Indeed, it will handle zone sizes from 16MB
to 1TB without any modification - mkfs handles it all when it
queries the zones and sets up the zone allocation inodes...

And we limit the number of "open zones" by the number of zone groups
we alow concurrent allocation to

> The other open question is that if we go for
> fully drive managed, what sort of alignment, size, trim + anything else
> should we do to make the drive's job easier.  I'm guessing we won't
> really have a practical answer to any of these until we see how the
> market responds.

I'm not aiming this proposal at drive managed, or even host-managed
drives: this proposal is for full host-aware (i.e. error on
out-of-order write) drive support. If you have drive managed SMR,
then there's pretty much nothing to change in existing filesystems.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

2015-03-16 Thread Dave Chinner

On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
> [cc to linux-scsi added since this seems relevant]
> On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > As I told many people at Vault last week, I wrote a document
> > outlining how we should modify the on-disk structures of XFS to
> > support host aware SMR drives on the (long) plane flights to Boston.
> > 
> > TL;DR: not a lot of change to the XFS kernel code is required, no
> > specific SMR awareness is needed by the kernel code.  Only
> > relatively minor tweaks to the on-disk format will be needed and
> > most of the userspace changes are relatively straight forward, too.
> > 
> > The source for that document can be found in this git tree here:
> > 
> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > 
> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > pull it straight from cgit:
> > 
> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > 
> > Or there is a pdf version built from the current TOT on the xfs.org
> > wiki here:
> > 
> > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > 
> > Happy reading!
> 
> I don't think it would have caused too much heartache to post the entire
> doc to the list, but anyway
> 
> The first is a meta question: What happened to the idea of separating
> the fs block allocator from filesystems?  It looks like a lot of the
> updates could be duplicated into other filesystems, so it might be a
> very opportune time to think about this.

Which requires a complete rework of the fs/block layer. That's the
long term goal, but we aren't going to be there for a few years yet.
Hust look at how long it's taken for copy offload (which is trivial
compared to allocation offload) to be implemented

> > === RAID on SMR
> > 
> > How does RAID work with SMR, and exactly what does that look like to
> > the filesystem?
> > 
> > How does libzbc work with RAID given it is implemented through the scsi 
> > ioctl
> > interface?
> 
> Probably need to cc dm-devel here.  However, I think we're all agreed
> this is RAID across multiple devices, rather than within a single
> device?  In which case we just need a way of ensuring identical zoning
> on the raided devices and what you get is either a standard zone (for
> mirror) or a larger zone (for hamming etc).

Any sort of RAID is a bloody hard problem, hence the fact that I'm
designing a solution for a filesystem on top of an entire bare
drive. I'm not trying to solve every use case in the world, just the
one where the drive manufactures think SMR will be mostly used: the
back end of "never delete" distributed storage environments

We can't wait for years for infrastructure layers to catch up in the
brave new world of shipping SMR drives. We may not like them, but we
have to make stuff work. I'm not trying to solve every problem - I'm
just tryin gto address the biggest use case I see for SMR devices
and it just so happens that XFS is already used pervasively in that
same use case, mostly within the same "no raid, fs per entire
device" constraints as I've documented for this proposal...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

2015-03-16 Thread Dave Chinner

On Mon, Mar 16, 2015 at 08:12:16PM -0500, Alireza Haghdoost wrote:
> On Mon, Mar 16, 2015 at 3:32 PM, Dave Chinner  wrote:
> > On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
> >> Probably need to cc dm-devel here.  However, I think we're all agreed
> >> this is RAID across multiple devices, rather than within a single
> >> device?  In which case we just need a way of ensuring identical zoning
> >> on the raided devices and what you get is either a standard zone (for
> >> mirror) or a larger zone (for hamming etc).
> >
> > Any sort of RAID is a bloody hard problem, hence the fact that I'm
> > designing a solution for a filesystem on top of an entire bare
> > drive. I'm not trying to solve every use case in the world, just the
> > one where the drive manufactures think SMR will be mostly used: the
> > back end of "never delete" distributed storage environments
> > We can't wait for years for infrastructure layers to catch up in the
> > brave new world of shipping SMR drives. We may not like them, but we
> > have to make stuff work. I'm not trying to solve every problem - I'm
> > just tryin gto address the biggest use case I see for SMR devices
> > and it just so happens that XFS is already used pervasively in that
> > same use case, mostly within the same "no raid, fs per entire
> > device" constraints as I've documented for this proposal...
> 
> I am confused what kind of application you are referring to for this
> "back end, no raid, fs per entire device". Are you gonna rely on the
> application to do replication for disk failure protection ?

Exactly. Think distributed storage such as Ceph and gluster where
the data redundancy and failure recovery algorithms are in layers
*above* the local filesystem, not in the storage below the fs.  The
"no raid, fs per device" model is already a very common back end
storage configuration for such deployments.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists

2015-04-08 Thread Dave Chinner

[ Sending again with a trimmed CC list to just the lists. Jeff - cc
lists that large get blocked by mailing lists... ]

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to *try* to maintain the same behaviour as
we currently have? I say *try*, because no instead of just flushing
the readahead IO on the plug, we'll also flush all the queued data
writeback IO onthe high level plug. We don't actually want to do
that; we only want to submit the readahead and not the bulk IO that
will delay the latency sensitive dependent IOs

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs fundamental change?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 13/35] xfs: set bi_op to REQ_OP

2016-01-06 Thread Dave Chinner

On Tue, Jan 05, 2016 at 02:53:16PM -0600, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This patch has xfs set the bio bi_op to a REQ_OP, and
> rq_flag_bits to bi_rw.
> 
> Note:
> I have run xfs tests on these btrfs patches. There were some failures
> with and without the patches. I have not had time to track down why
> xfstest fails without the patches.
> 
> Signed-off-by: Mike Christie 
> ---
>  fs/xfs/xfs_aops.c |  3 ++-
>  fs/xfs/xfs_buf.c  | 27 +++
>  2 files changed, 17 insertions(+), 13 deletions(-)

Not sure which patches your note is refering to here.

The XFS change here looks fine.

Acked-by: Dave Chinner 

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs

2016-01-06 Thread Dave Chinner

On Wed, Jan 06, 2016 at 08:40:09PM -0500, Martin K. Petersen wrote:
> >>>>> "Mike" == mchristi   writes:
> 
> Mike> The following patches begin to cleanup the request->cmd_flags and
> bio-> bi_rw mess. We currently use cmd_flags to specify the operation,
> Mike> attributes and state of the request. For bi_rw we use it for
> Mike> similar info and also the priority but then also have another
> Mike> bi_flags field for state. At some point, we abused them so much we
> Mike> just made cmd_flags 64 bits, so we could add more.
> 
> Mike> The following patches seperate the operation (read, write discard,
> Mike> flush, etc) from cmd_flags/bi_rw.
> 
> Mike> This patchset was made against linux-next from today Jan 5 2016.
> Mike> (git tag next-20160105).
> 
> Very nice work. Thanks for doing this!
> 
> I think it's a much needed cleanup. I focused mainly on the core block,
> discard, write same and sd.c pieces and everything looks sensible to me.
> 
> I wonder what the best approach is to move a patch set with this many
> stakeholders forward? Set a "speak now or forever hold your peace"
> review deadline?

I say just ask Linus to pull it immediately after the next merge
window closes

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.

2014-01-15 Thread Dave Chinner

On Tue, Jan 14, 2014 at 03:30:11PM +0200, Sergey Meirovich wrote:
> Hi Cristoph,
> 
> On 8 January 2014 16:03, Christoph Hellwig  wrote:
> > On Tue, Jan 07, 2014 at 08:37:23PM +0200, Sergey Meirovich wrote:
> >> Actually my initial report (14.67Mb/sec  3755.41 Requests/sec) was about 
> >> ext4
> >> However I have tried XFS as well. It was a bit slower than ext4 on all
> >> occasions.
> >
> > I wasn't trying to say XFS fixes your problem, but that we could
> > implement appending AIO writes in XFS fairly easily.
> >
> > To verify Jan's theory, can you try to preallocate the file to the full
> > size and then run the benchmark by doing a:
> >
> > # fallocate -l  
> >
> > and then run it?  If that's indeed the issue I'd be happy to implement
> > the "real aio" append support for you as well.
> >
> 
> I've resorted to write simple wrapper around io_submit() and ran it
> against preallocated file (exactly to avoid append AIO scenario).
> Random data was used to avoid XtremIO online deduplication but results
> were still wonderfull for 4k sequential AIO write:
> 
> 744.77 MB/s   190660.17 Req/sec
> 
> Clearly Linux lacks "rial aio" append to be available for any FS.
> Seems that you are thinking that it would be relatively easy to
> implement it for XFS on Linux? If so - I will really appreciate your
> afford.

Yes, I think it can be done relatively simply. We'd have to change
the code in xfs_file_aio_write_checks() to check whether EOF zeroing
was required rather than always taking an exclusive lock (for block
aligned IO at EOF sub-block zeroing isn't required), and then we'd
have to modify the direct IO code to set the is_async flag
appropriately. We'd probably need a new flag to say tell the DIO
code that AIO beyond EOF is OK, but that isn't hard to do

And for those that are wondering about the stale data exposure problem
documented in the aio code:

/*
 * For file extending writes updating i_size before data
 * writeouts complete can expose uninitialized blocks. So
 * even for AIO, we need to wait for i/o to complete before
 * returning in this case.
 */

This is fixed in XFS by removing a single if() check in
xfs_iomap_write_direct(). We already use unwritten extents for DIO
within EOF to avoid races that could expose uninitialised blocks, so
we just need to make that unconditional behaviour.  Hence racing IO
on concurrent appending i_size updates will only ever see a hole
(zeros), an unwritten region (zeros) or the written data.

Christoph, are you going to get any time to look at doing this in
the next few days?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.

2014-01-20 Thread Dave Chinner

On Mon, Jan 20, 2014 at 05:58:55AM -0800, Christoph Hellwig wrote:
> On Thu, Jan 16, 2014 at 09:07:21AM +1100, Dave Chinner wrote:
> > Yes, I think it can be done relatively simply. We'd have to change
> > the code in xfs_file_aio_write_checks() to check whether EOF zeroing
> > was required rather than always taking an exclusive lock (for block
> > aligned IO at EOF sub-block zeroing isn't required),
> 
> That's not even required for supporting aio appends, just a further
> optimization for it.

Oh, right, I got an off-by-one when reading the code - the EOF
zeroing only occurs when the offset is beyond EOF, not at or beyond
EOF...

> > and then we'd
> > have to modify the direct IO code to set the is_async flag
> > appropriately. We'd probably need a new flag to say tell the DIO
> > code that AIO beyond EOF is OK, but that isn't hard to do
> 
> Yep, need a flag to allow appending writes and then defer them.
> 
> > Christoph, are you going to get any time to look at doing this in
> > the next few days?
> 
> I'll probably need at least another week before I can get to it.  If you
> wanna pick it up before than feel free.

I'm probably not going to get to it before then, either, so check
back in a week?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner

On Wed, Jan 22, 2014 at 02:34:52PM +, Mel Gorman wrote:
> On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:
> > On 01/22/2014 04:34 AM, Mel Gorman wrote:
> > >On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
> > >>One topic that has been lurking forever at the edges is the current
> > >>4k limitation for file system block sizes. Some devices in
> > >>production today and others coming soon have larger sectors and it
> > >>would be interesting to see if it is time to poke at this topic
> > >>again.
> > >>
> > >Large block support was proposed years ago by Christoph Lameter
> > >(http://lwn.net/Articles/232757/). I think I was just getting started
> > >in the community at the time so I do not recall any of the details. I do
> > >believe it motivated an alternative by Nick Piggin called fsblock though
> > >(http://lwn.net/Articles/321390/). At the very least it would be nice to
> > >know why neither were never merged for those of us that were not around
> > >at the time and who may not have the chance to dive through mailing list
> > >archives between now and March.
> > >
> > >FWIW, I would expect that a show-stopper for any proposal is requiring
> > >high-order allocations to succeed for the system to behave correctly.
> > >
> > 
> > I have a somewhat hazy memory of Andrew warning us that touching
> > this code takes us into dark and scary places.
> > 
> 
> That is a light summary. As Andrew tends to reject patches with poor
> documentation in case we forget the details in 6 months, I'm going to guess
> that he does not remember the details of a discussion from 7ish years ago.
> This is where Andrew swoops in with a dazzling display of his eidetic
> memory just to prove me wrong.
> 
> Ric, are there any storage vendor that is pushing for this right now?
> Is someone working on this right now or planning to? If they are, have they
> looked into the history of fsblock (Nick) and large block support (Christoph)
> to see if they are candidates for forward porting or reimplementation?
> I ask because without that person there is a risk that the discussion
> will go as follows
> 
> Topic leader: Does anyone have an objection to supporting larger block
>   sizes than the page size?
> Room: Send patches and we'll talk.

So, from someone who was done in the trenches of the large
filesystem block size code wars, the main objection to Christoph
lameter's patchset was that it used high order compound pages in the
page cache so that nothing at filesystem level needed to be changed
to support large block sizes.

The patch to enable XFS to use 64k block sizes with Christoph's
patches was simply removing 5 lines of code that limited the block
size to PAGE_SIZE. And everything just worked.

Given that compound pages are used all over the place now and we
also have page migration, compaction and other MM support that
greatly improves high order memory allocation, perhaps we should
revisit this approach.

As to Nick's fsblock rewrite, he basically rewrote all the
bufferhead head code to handle filesystem blocks larger than a page
whilst leaving the page cache untouched. i.e. the complete opposite
approach. The problem with this approach is that every filesystem
needs to be re-written to use fsblocks rather than bufferheads. For
some filesystems that isn't hard (e.g. ext2) but for filesystems
that use bufferheads in the core of their journalling subsystems
that's a completely different story.

And for filesystems like XFS, it doesn't solve any of the problem
with using bufferheads that we have now, so it simply introduces a
huge amount of IO path rework and validation without providing any
advantage from a feature or performance point of view. i.e. extent
based filesystems mostly negate the impact of filesystem block size
on IO performance...

Realistically, if I'm going to do something in XFS to add block size
> page size support, I'm going to do it wiht somethign XFS can track
through it's own journal so I can add data=journal functionality
with the same filesystem block/extent header structures used to
track the pages in blocks larger than PAGE_SIZE. And given that we
already have such infrastructure in XFS to support directory
blocks larger than filesystem block size

FWIW, as to the original "large sector size" support question, XFS
already supports sector sizes up to 32k in size. The limitation is
actually a limitation of the journal format, so going larger than
that would take some work...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner

On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
> On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
> > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:
> > > On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
> > 
> > [ I like big sectors and I cannot lie ]
> 
> I think I might be sceptical, but I don't think that's showing in my
> concerns ...
> 
> > > > I really think that if we want to make progress on this one, we need
> > > > code and someone that owns it.  Nick's work was impressive, but it was
> > > > mostly there for getting rid of buffer heads.  If we have a device that
> > > > needs it and someone working to enable that device, we'll go forward
> > > > much faster.
> > > 
> > > Do we even need to do that (eliminate buffer heads)?  We cope with 4k
> > > sector only devices just fine today because the bh mechanisms now
> > > operate on top of the page cache and can do the RMW necessary to update
> > > a bh in the page cache itself which allows us to do only 4k chunked
> > > writes, so we could keep the bh system and just alter the granularity of
> > > the page cache.
> > > 
> > 
> > We're likely to have people mixing 4K drives and  > size here> on the same box.  We could just go with the biggest size and
> > use the existing bh code for the sub-pagesized blocks, but I really
> > hesitate to change VM fundamentals for this.
> 
> If the page cache had a variable granularity per device, that would cope
> with this.  It's the variable granularity that's the VM problem.
> 
> > From a pure code point of view, it may be less work to change it once in
> > the VM.  But from an overall system impact point of view, it's a big
> > change in how the system behaves just for filesystem metadata.
> 
> Agreed, but only if we don't do RMW in the buffer cache ... which may be
> a good reason to keep it.
> 
> > > The other question is if the drive does RMW between 4k and whatever its
> > > physical sector size, do we need to do anything to take advantage of
> > > it ... as in what would altering the granularity of the page cache buy
> > > us?
> > 
> > The real benefit is when and how the reads get scheduled.  We're able to
> > do a much better job pipelining the reads, controlling our caches and
> > reducing write latency by having the reads done up in the OS instead of
> > the drive.
> 
> I agree with all of that, but my question is still can we do this by
> propagating alignment and chunk size information (i.e. the physical
> sector size) like we do today.  If the FS knows the optimal I/O patterns
> and tries to follow them, the odd cockup won't impact performance
> dramatically.  The real question is can the FS make use of this layout
> information *without* changing the page cache granularity?  Only if you
> answer me "no" to this do I think we need to worry about changing page
> cache granularity.

We already do this today.

The problem is that we are limited by the page cache assumption that
the block device/filesystem never need to manage multiple pages as
an atomic unit of change. Hence we can't use the generic
infrastructure as it stands to handle block/sector sizes larger than
a page size...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner

On Wed, Jan 22, 2014 at 09:21:40AM -0800, James Bottomley wrote:
> On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
> > On Wed, 2014-01-22 at 15:19 +, Mel Gorman wrote:
> > > On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote:
> > > > On 01/22/2014 09:34 AM, Mel Gorman wrote:
> > > > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:
> > > > >>On 01/22/2014 04:34 AM, Mel Gorman wrote:
> > > > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
> > > > >>>>One topic that has been lurking forever at the edges is the current
> > > > >>>>4k limitation for file system block sizes. Some devices in
> > > > >>>>production today and others coming soon have larger sectors and it
> > > > >>>>would be interesting to see if it is time to poke at this topic
> > > > >>>>again.
> > > > >>>>
> > > > >>>Large block support was proposed years ago by Christoph Lameter
> > > > >>>(http://lwn.net/Articles/232757/). I think I was just getting started
> > > > >>>in the community at the time so I do not recall any of the details. 
> > > > >>>I do
> > > > >>>believe it motivated an alternative by Nick Piggin called fsblock 
> > > > >>>though
> > > > >>>(http://lwn.net/Articles/321390/). At the very least it would be 
> > > > >>>nice to
> > > > >>>know why neither were never merged for those of us that were not 
> > > > >>>around
> > > > >>>at the time and who may not have the chance to dive through mailing 
> > > > >>>list
> > > > >>>archives between now and March.
> > > > >>>
> > > > >>>FWIW, I would expect that a show-stopper for any proposal is 
> > > > >>>requiring
> > > > >>>high-order allocations to succeed for the system to behave correctly.
> > > > >>>
> > > > >>I have a somewhat hazy memory of Andrew warning us that touching
> > > > >>this code takes us into dark and scary places.
> > > > >>
> > > > >That is a light summary. As Andrew tends to reject patches with poor
> > > > >documentation in case we forget the details in 6 months, I'm going to 
> > > > >guess
> > > > >that he does not remember the details of a discussion from 7ish years 
> > > > >ago.
> > > > >This is where Andrew swoops in with a dazzling display of his eidetic
> > > > >memory just to prove me wrong.
> > > > >
> > > > >Ric, are there any storage vendor that is pushing for this right now?
> > > > >Is someone working on this right now or planning to? If they are, have 
> > > > >they
> > > > >looked into the history of fsblock (Nick) and large block support 
> > > > >(Christoph)
> > > > >to see if they are candidates for forward porting or reimplementation?
> > > > >I ask because without that person there is a risk that the discussion
> > > > >will go as follows
> > > > >
> > > > >Topic leader: Does anyone have an objection to supporting larger block
> > > > >   sizes than the page size?
> > > > >Room: Send patches and we'll talk.
> > > > >
> > > > 
> > > > I will have to see if I can get a storage vendor to make a public
> > > > statement, but there are vendors hoping to see this land in Linux in
> > > > the next few years.
> > > 
> > > What about the second and third questions -- is someone working on this
> > > right now or planning to? Have they looked into the history of fsblock
> > > (Nick) and large block support (Christoph) to see if they are candidates
> > > for forward porting or reimplementation?
> > 
> > I really think that if we want to make progress on this one, we need
> > code and someone that owns it.  Nick's work was impressive, but it was
> > mostly there for getting rid of buffer heads.  If we have a device that
> > needs it and someone working to enable that device, we'll go forward
> > much faster.
> 
> Do we even need to do that (eliminate buffer heads)?

No, the reason bufferheads were replaced was that a bufferhead can
only reference a single page. i.e. the structure is that a page can
reference multipl bufferheads (block size >= page size) but a
bufferhead can't refernce multiple pages which is what is needed for
block size > page size. fsblock was designed to handle both cases.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner

On Wed, Jan 22, 2014 at 11:50:02AM -0800, Andrew Morton wrote:
> On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley 
>  wrote:
> 
> > But this, I think, is the fundamental point for debate.  If we can pull
> > alignment and other tricks to solve 99% of the problem is there a need
> > for radical VM surgery?  Is there anything coming down the pipe in the
> > future that may move the devices ahead of the tricks?
> 
> I expect it would be relatively simple to get large blocksizes working
> on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
> amounts of work, perhaps someone can do a proof-of-concept on powerpc
> (or ia64) with 64k blocksize.

Reality check: 64k block sizes on 64k page Linux machines has been
used in production on XFS for at least 10 years. It's exactly the
same case as 4k block size on 4k page size - one page, one buffer
head, one filesystem block.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner

On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote:
> On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote:
> > > 
> > > I expect it would be relatively simple to get large blocksizes working
> > > on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
> > > amounts of work, perhaps someone can do a proof-of-concept on powerpc
> > > (or ia64) with 64k blocksize.
> > 
> > Reality check: 64k block sizes on 64k page Linux machines has been
> > used in production on XFS for at least 10 years. It's exactly the
> > same case as 4k block size on 4k page size - one page, one buffer
> > head, one filesystem block.
> 
> This is true for ext4 as well.  Block size == page size support is
> pretty easy; the hard part is when block size > page size, due to
> assumptions in the VM layer that requires that FS system needs to do a
> lot of extra work to fudge around.  So the real problem comes with
> trying to support 64k block sizes on a 4k page architecture, and can
> we do it in a way where every single file system doesn't have to do
> their own specific hacks to work around assumptions made in the VM
> layer.
> 
> Some of the problems include handling the case where you get someone
> dirties a single block in a sparse page, and the FS needs to manually
> fault in the other 56k pages around that single page.  Or the VM not
> understanding that page eviction needs to be done in chunks of 64k so
> we don't have part of the block evicted but not all of it, etc.

Right, this is part of the problem that fsblock tried to handle, and
some of the nastiness it had was that a page fault only resulted in
the individual page being read from the underlying block. This means
that it was entirely possible that the filesystem would need to do
RMW cycles in the writeback path itself to handle things like block
checksums, copy-on-write, unwritten extent conversion, etc. i.e. all
the stuff that the page cache currently handles by doing RMW cycles
at the page level.

The method of using compound pages in the page cache so that the
page cache could do 64k RMW cycles so that a filesystem never had to
deal with new issues like the above was one of the reasons that
approach is so appealing to us filesystem people. ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner

On Thu, Jan 23, 2014 at 04:44:38PM +, Mel Gorman wrote:
> On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote:
> > On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote:
> > > On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
> > > > On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
> > > > > > The other question is if the drive does RMW between 4k and whatever 
> > > > > > its
> > > > > > physical sector size, do we need to do anything to take advantage of
> > > > > > it ... as in what would altering the granularity of the page cache 
> > > > > > buy
> > > > > > us?
> > > > > 
> > > > > The real benefit is when and how the reads get scheduled.  We're able 
> > > > > to
> > > > > do a much better job pipelining the reads, controlling our caches and
> > > > > reducing write latency by having the reads done up in the OS instead 
> > > > > of
> > > > > the drive.
> > > > 
> > > > I agree with all of that, but my question is still can we do this by
> > > > propagating alignment and chunk size information (i.e. the physical
> > > > sector size) like we do today.  If the FS knows the optimal I/O patterns
> > > > and tries to follow them, the odd cockup won't impact performance
> > > > dramatically.  The real question is can the FS make use of this layout
> > > > information *without* changing the page cache granularity?  Only if you
> > > > answer me "no" to this do I think we need to worry about changing page
> > > > cache granularity.
> > > 
> > > We already do this today.
> > > 
> > > The problem is that we are limited by the page cache assumption that
> > > the block device/filesystem never need to manage multiple pages as
> > > an atomic unit of change. Hence we can't use the generic
> > > infrastructure as it stands to handle block/sector sizes larger than
> > > a page size...
> > 
> > If the compound page infrastructure exists today and is usable for this,
> > what else do we need to do? ... because if it's a couple of trivial
> > changes and a few minor patches to filesystems to take advantage of it,
> > we might as well do it anyway. 
> 
> Do not do this as there is no guarantee that a compound allocation will
> succeed. If the allocation fails then it is potentially unrecoverable
> because we can no longer write to storage then you're hosed.  If you are
> now thinking mempool then the problem becomes that the system will be
> in a state of degraded performance for an unknowable length of time and
> may never recover fully.

We are talking about page cache allocation here, not something deep
down inside the IO path that requires mempools to guarantee IO
completion. IOWs, we have an *existing error path* to return ENOMEM
to userspace when page cache allocation fails.

> 64K MMU page size systems get away with this
> because the blocksize is still <= PAGE_SIZE and no core VM changes are
> necessary. Critically, pages like the page table pages are the same size as
> the basic unit of allocation used by the kernel so external fragmentation
> simply is not a severe problem.

Christoph's old patches didn't need 64k MMU page sizes to work.
IIRC, the compound page was mapped via into the page cache as
individual 4k pages. Any change of state on the child pages followed
the back pointer to the head of the compound page and changed the
state of that page. On page faults, the individual 4k pages were
mapped to userspace rather than the compound page, so there was no
userspace visible change, either.

The question I had at the time that was never answered was this: if
pages are faulted and mapped individually through their own ptes,
why did the compound pages need to be contiguous? copy-in/out
through read/write was still done a PAGE_SIZE granularity, mmap
mappings were still on PAGE_SIZE granularity, so why can't we build
a compound page for the page cache out of discontiguous pages?

FWIW, XFS has long used discontiguous pages for large block support
in metadata. Some of that is vmapped to make metadata processing
simple. The point of this is that we don't need *contiguous*
compound pages in the page cache if we can map them into userspace
as individual PAGE_SIZE pages. Only the page cache management needs
to handle the groups of pages that make up a filesystem block
as a compound page

> > I was only objecting on the grounds that
> > the last time we looked at it, it was major VM surgery.  Can someone
> > give a summar

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-29 Thread Dave Chinner

On Wed, Jan 29, 2014 at 09:52:46PM -0700, Matthew Wilcox wrote:
> On Fri, Jan 24, 2014 at 10:57:48AM +, Mel Gorman wrote:
> > So far on the table is
> > 
> > 1. major filesystem overhawl
> > 2. major vm overhawl
> > 3. use compound pages as they are today and hope it does not go
> >completely to hell, reboot when it does
> 
> Is the below paragraph an exposition of option 2, or is it an option 4,
> change the VM unit of allocation?  Other than the names you're using,
> this is basically what I said to Kirill in an earlier thread; either
> scrap the difference between PAGE_SIZE and PAGE_CACHE_SIZE, or start
> making use of it.

Christoph Lamater's compound page patch set scrapped PAGE_CACHE_SIZE
and made it a variable that was set on the struct address_space when
it was instantiated by the filesystem. In effect, it allowed
filesystems to specify the unit of page cache allocation on a
per-inode basis.

> The fact that EVERYBODY in this thread has been using PAGE_SIZE when they
> should have been using PAGE_CACHE_SIZE makes me wonder if part of the
> problem is that the split in naming went the wrong way.  ie use PTE_SIZE
> for 'the amount of memory pointed to by a pte_t' and use PAGE_SIZE for
> 'the amount of memory described by a struct page'.

PAGE_CACHE_SIZE was never distributed sufficiently to be used, and
if you #define it to something other than PAGE_SIZE stuff will
simply break.

> (we need to remove the current users of PTE_SIZE; sparc32 and powerpc32,
> but that's just a detail)
> 
> And we need to fix all the places that are currently getting the
> distinction wrong.  SMOP ... ;-)  What would help is correct typing of
> variables, possibly with sparse support to help us out.  Big Job.

Yes, that's what the Christoph's patchset did.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-03 Thread Dave Chinner

On Tue, Jun 03, 2014 at 04:22:19PM +0200, Arnd Bergmann wrote:
> On Monday 02 June 2014 14:57:26 H. Peter Anvin wrote:
> > On 06/02/2014 12:55 PM, Arnd Bergmann wrote:
> The possible uses I can see for non-ktime_t types in the kernel are:
> * inodes need 96 bit timestamps to represent the full range of values
>   that can be stored in a file system, you made a convincing argument
>   for that. Almost everything else can fit into 64 bit on a 32-bit
>   kernel, in theory also on a 64-bit kernel if we want that.

Just ot be pedantic, inodes don't *need* 96 bit timestamps - some
filesystems can *support up to* 96 bit timestamps. If the kernel
only supports 64 bit timestamps and that's all the kernel can
represent, then the upper bits of the 96 bit on-disk inode
timestamps simply remain zero.

If you move the filesystem between kernels with different time
ranges, then the filesystem needs to be able to tell the kernel what
it's supported range is.  This is where having the VFS limit the
range of supported timestamps is important: the limit is the
min(kernel range, filesystem range). This allows the filesystems
to be indepenent of the kernel time representation, and the kernel
to be independent of the physical filesystem time encoding

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: scheduling while atomic in blk_mq codepath?

2014-06-19 Thread Dave Chinner

On Thu, Jun 19, 2014 at 12:21:44PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 19, 2014 at 12:08:01PM -0400, Theodore Ts'o wrote:
> > > The other issue, not sure, not a lot of detail. It may be fixed by the 
> > > pull
> > > request I sent out yesterday. You can try pulling in:
> > > 
> > > git://git.kernel.dk/linux-block.git for-linus
> > 
> > Thanks, I'll give that a try.
> 
> I tried merging in your for-linus branch in v3.16-rc1, and I'm seeing
> the following.  On a 32-bit x86 3.15 kernel, run: "mke2fs -t ext3
> /dev/vdc" where /dev/vdc is a 5 gig virtio partition.

Short reads are more likely a bug in all the iovec iterator stuff
that got merged in from the vfs tree. ISTR a 32 bit-only bug in that
stuff go past in to do with not being able to partition a 32GB block
dev on a 32 bit system due to a 32 bit size_t overflow somewhere

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

2016-05-30 Thread Dave Chinner

On Mon, May 30, 2016 at 10:27:52AM +0200, Gernot Hillier wrote:
> Hi!
> 
> On 25.05.2016 01:13, Theodore Ts'o wrote:
> > On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
> >> We experience strange delays with kernel 4.1.18 during dpkg
> >> package installation on an ext4 filesystem after switching from
> >> Ubuntu 14.04 to 16.04. We can reproduce the issue with kernel 4.6.
> >> Installation of the same package takes 2s with ext3 and 31s with
> >> ext4 on the same partition.
> >>
> >> Hardware is an Intel-based server with Supermicro X8DTH board and
> >> Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
> >> 0x1000:0x0072, mpt2sas driver).
> [...]
> >> To me, the problem looks comparable to
> >> https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
> >> a full hang and there's no RAID involved for us), so a closer look on
> >> the SCSI layer or driver might be the next step?
> > 
> > What I would suggest is to create a small test case which compares the
> > time it takes to allocate 1 megabyte of memory, zero it, and then
> > write one megabytes of zeros using the write(2) system call.  Then try
> > writing one megabytes of zero using the BLKZEROOUT ioctl.
> 
> Ok, this is my test code:
> 
>   const int SIZE = 1*1024*1024;
>   char* buffer = malloc(SIZE);
>   uint64_t range[2] = { 0, SIZE };
>   int fd = open("/dev/sdb2", O_WRONLY);
> 
>   bzero(buffer, SIZE);
>   write(fd, buffer, SIZE);
>   sync_file_range(fd, 0, 0, 2);
> 
>   ioctl (fd, BLKZEROOUT, range);
> 
>   close(fd);
>   free(buffer);
> 
> # strace -tt ./test-tytso
> [...]
> 15:46:27.481636 open("/dev/sdb2", O_WRONLY) = 3
> 15:46:27.482004 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
> 15:46:27.482438 sync_file_range(3, 0, 0, SYNC_FILE_RANGE_WRITE) = 0
> 15:46:27.482698 ioctl(3, BLKZEROOUT, [0, 10]) = 0
> 15:46:27.546971 close(3)= 0
> 
> So the write() and sync_file_range() in the first case takes ~400 us
> each while BLKZEROOUT takes... 60 ms. Wow.

Comparing apples to oranges.

Unlike the name implies, sync_file_range() does not provide any data
integrity semantics what-so-ever: SYNC_FILE_RANGE_WRITE only submits
IO to clean dirty pages - that only takes 400us of CPU time.  It
does not wait for completion, nor does it flush the drive cache and
so by the time the syscall returns to userspace the IO may not have
even been sent to the device (e.g. it could be queued by the IO
scheduler in the block layer). i.e. you're not timing IO, you're
timing CPU overhead of IO submission.

For an apples to apples comparison, you need to use fsync() to
physically force the written data to stable storage and wait for
completion. This is what BLKZEROOUT is effectively doing, so I think
you'll find fdatasync() also takes around 60ms...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-16 Thread Dave Chinner


But this indicates that the page is under writeback at this point,
so that tends to indicate that the above freeing was incorrect.

Hmmm - it's clear we've got direct reclaim involved here, and the
suspicion of a dirty page that has had it's bufferheads cleared.
Are there any other warnings in the log from XFS prior to kasan
throwing the error?

> Is there anything else I can send that might be helpful?

full console/dmesg output from a crashed machine, plus:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> --
> /*
>  * Run as "./repro outfile 1000", where "outfile" sits on an XFS filesystem.
>  */
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define CHUNK (32768)
> 
> static const char crap[CHUNK];
> 
> int main(int argc, char **argv)
> {
>   int r, fd, i;
>   size_t allocsize, count;
>   void *p;
> 
>   if (argc != 3) {
>   printf("Usage: %s filename count\n", argv[0]);
>   return 1;
>   }
> 
>   fd = open(argv[1], O_RDWR|O_CREAT, 0644);
>   if (fd == -1) {
>   perror("Can't open");
>   return 1;
>   }
> 
>   if (!fork()) {
>   count = atol(argv[2]);
> 
>   while (1) {
>   for (i = 0; i < count; i++)
>   if (write(fd, crap, CHUNK) != CHUNK)
>   perror("Eh?");
> 
>   fsync(fd);
>   ftruncate(fd, 0);
>   }

H. Truncate is used, but only after fsync. If the truncate
is removed, does the problem go away?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-17 Thread Dave Chinner

On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:
> On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:
> > Hello all,
> > 
> > I've found a nasty source of slab corruption. Based on seeing similar 
> > symptoms
> > on boxes at Facebook, I suspect it's been around since at least 3.10.
> > 
> > It only reproduces under memory pressure so far as I can tell: the issue 
> > seems
> > to be that XFS reclaims pages from buffers that are still in use by
> > scsi/block. I'm not sure which side the bug lies on, but I've only observed 
> > it
> > with XFS.
[]
> But this indicates that the page is under writeback at this point,
> so that tends to indicate that the above freeing was incorrect.
> 
> Hmmm - it's clear we've got direct reclaim involved here, and the
> suspicion of a dirty page that has had it's bufferheads cleared.
> Are there any other warnings in the log from XFS prior to kasan
> throwing the error?

Can you try the patch below?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

xfs: bufferhead chains are invalid after end_page_writeback

From: Dave Chinner 

In xfs_finish_page_writeback(), we have a loop that looks like this:

do {
if (off < bvec->bv_offset)
goto next_bh;
if (off > end)
break;
bh->b_end_io(bh, !error);
next_bh:
off += bh->b_size;
} while ((bh = bh->b_this_page) != head);

The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO.  This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.

While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:

.
 INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
  free_buffer_head+0x41/0x90
  __slab_free+0x1ed/0x340
  kmem_cache_free+0x270/0x300
  free_buffer_head+0x41/0x90
  try_to_free_buffers+0x171/0x240
  xfs_vm_releasepage+0xcb/0x3b0
  try_to_release_page+0x106/0x190
  shrink_page_list+0x118e/0x1a10
  shrink_inactive_list+0x42c/0xdf0
  shrink_zone_memcg+0xa09/0xfa0
  shrink_zone+0x2c3/0xbc0
.
 Call Trace:
[] dump_stack+0x68/0x94
  [] print_trailer+0x115/0x1a0
  [] object_err+0x34/0x40
  [] kasan_report_error+0x217/0x530
  [] __asan_report_load8_noabort+0x43/0x50
  [] xfs_destroy_ioend+0x3bf/0x4c0
  [] xfs_end_bio+0x154/0x220
  [] bio_endio+0x158/0x1b0
  [] blk_update_request+0x18b/0xb80
  [] scsi_end_request+0x97/0x5a0
  [] scsi_io_completion+0x438/0x1690
  [] scsi_finish_command+0x375/0x4e0
  [] scsi_softirq_done+0x280/0x340


Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.

Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.

Yet another example of why Bufferheads Must Die.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_aops.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 80714eb..0cfb944 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -87,6 +87,12 @@ xfs_find_bdev_for_inode(
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
  * we need to process on the page.
+ *
+ * Landmine Warning: bh->b_end_io() will call end_page_writeback() on the last
+ * buffer in the IO. Once it does this, it is unsafe to access the bufferhead 
or
+ * the page at all, as we may be racing with memory reclaim and it can free 
both
+ * the bufferhead chain and the page as it will see the page as clean and
+ * unused.
  */
 static void
 xfs_finish_page_writeback(
@@ -95,8 +101,9 @@ xfs_finish_page_writeback(
int error)
 {
unsigned intend = bvec->bv_offset + bvec->bv_len - 1;
-   struct buffer_head  *head, *bh;
+   struct buffer_head  *head, *bh, *next;
unsigned intoff = 0;
+   unsigned intbsize;
 
ASSERT(b

Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-19 Thread Dave Chinner

On Tue, Jul 19, 2016 at 02:22:47PM -0700, Calvin Owens wrote:
> On 07/18/2016 07:05 PM, Calvin Owens wrote:
> >On 07/17/2016 11:02 PM, Dave Chinner wrote:
> >>On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:
> >>>On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:
> >>>>Hello all,
> >>>>
> >>>>I've found a nasty source of slab corruption. Based on seeing similar 
> >>>>symptoms
> >>>>on boxes at Facebook, I suspect it's been around since at least 3.10.
> >>>>
> >>>>It only reproduces under memory pressure so far as I can tell: the issue 
> >>>>seems
> >>>>to be that XFS reclaims pages from buffers that are still in use by
> >>>>scsi/block. I'm not sure which side the bug lies on, but I've only 
> >>>>observed it
> >>>>with XFS.
> >>[]
> >>>But this indicates that the page is under writeback at this point,
> >>>so that tends to indicate that the above freeing was incorrect.
> >>>
> >>>Hmmm - it's clear we've got direct reclaim involved here, and the
> >>>suspicion of a dirty page that has had it's bufferheads cleared.
> >>>Are there any other warnings in the log from XFS prior to kasan
> >>>throwing the error?
> >>
> >>Can you try the patch below?
> >
> >Thanks for getting this out so quickly :)
> >
> >So far so good: I booted Linus' tree as of this morning and reproduced the 
> >ASAN
> >splat. After applying your patch I haven't triggered it.
> >
> >I'm a bit wary since it was hard to trigger reliably in the first place... 
> >so I
> >lined up a few dozen boxes to run the test case overnight. I'll confirm in 
> >the
> >morning (-0700) they look good.
> 
> All right, my testcase ran 2099 times overnight without triggering anything.
> 
> For the overnight tests, I booted the boxes with "mem=" to artificially limit 
> RAM,
> which makes my repro *much* more reliable (I feel silly for not thinking of 
> that
> in the first place). With that setup, I hit the ASAN splat 21 times in 98 
> runs on
> vanilla 4.7-rc7. So I'm sold.
> 
> Tested-by: Calvin Owens 

Thanks for testing, Calvin. I'll update the patch and get it
reviewed and committed.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner

On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> Thanks Dave,
> 
> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> modules loaded (virtio block) so there's something else going on in the
> current merge window.  I'll keep an eye on it and make sure there's
> nothing iSCSI needs fixing for.

OK, so before this slips through the cracks.

Linus - your tree as of a few minutes ago still panics immediately
when starting xfstests on iscsi devices. It appears to be a
scatterlist corruption and not an iscsi problem, so the iscsi guys
seem to have bounced it and no-one is looking at it.

I'm disappearing for several months at the end of tomorrow, so I
thought I better make sure you know about it.  I've also added
linux-scsi, linux-block to the cc list

Cheers,

Dave.

> On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > > window kernel to test the XFS merge I am preparing for Linus.
> > > Unfortunately, all my test VMs using iscsi failed pretty much
> > > instantly on the first mount of an iscsi device:
> > > 
> > > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature 
> > > enabled. Use at your own risk!
> > > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > > [  159.425710] XFS (sdb): Ending clean mount
> > > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> > > 000c
> > > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > FYI, crash is here:
> > 
> > (gdb) l *(iscsi_tcp_segment_done+0x20d)
> > 0x81b950bd is in iscsi_tcp_segment_done 
> > (drivers/scsi/libiscsi_tcp.c:102).
> > 97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> > 98struct scatterlist *sg, unsigned int offset)
> > 99  {
> > 100 segment->sg = sg;
> > 101 segment->sg_offset = offset;
> > 102 segment->size = min(sg->length - offset,
> > 103 segment->total_size - 
> > segment->total_copied);
> > 104 segment->data = NULL;
> > 105 }
> > 106 
> > 
> > So it looks to be sg = NULL, which means there's probably an issue
> > with the scatterlist...
> > 
> > -Dave.
> > 
> > > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > > PMD 0 [  160.277309]
> > > [  160.277523] Oops:  [#1] PREEMPT SMP
> > > [  160.278004] Modules linked in:
> > > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS Debian-1.8.2-1 04/01/2014
> > > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > > [  160.280919] task: 88003e28 task.stack: c908
> > > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > > [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> > > [  160.282980] RAX:  RBX: 880039061730 RCX: 
> > > 
> > > [  160.283854] RDX: 1e00 RSI:  RDI: 
> > > 880039061730
> > > [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> > > 05a8
> > > [  160.285627] R10: 9835607d R11:  R12: 
> > > 0200
> > > [  160.286495] R13:  R14: 8800390615a0 R15: 
> > > 880039061730
> > > [  160.287362] FS:  () GS:88003fc0() 
> > > knlGS:
> > > [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> > > [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> > > 06f0
> > > [  160.290084] Call Trace:
> > > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > > [  160.293835]  process_one_work+0x1de/0x4d0
> > > [  160.294388]  worker_thread+0x4b/0x4f0
> > > [  160.294889]  kthread+0x10c/0x140
> > > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > > [  160.295898]  ? kthread_create_on_node

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner

On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> > 
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window.  I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> > 
> > Hmm. There's not much to go by.
> > 
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
> 
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd.  I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.

There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:

[628867.607417] [ cut here ]
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW   
4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691]  dump_stack+0x63/0x83
[628867.637937]  __warn+0xcb/0xf0
[628867.641359]  warn_slowpath_null+0x1d/0x20
[628867.643362]  shadow_lru_isolate+0x171/0x220
[628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780]  ? __list_lru_init+0x70/0x70
[628867.646628]  list_lru_walk_one+0x17/0x20
[628867.647488]  scan_shadow_nodes+0x34/0x50
[628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506]  shrink_node+0x57/0x90
[628867.650233]  do_try_to_free_pages+0xdd/0x230
[628867.651157]  try_to_free_pages+0xce/0x1a0
[628867.652342]  __alloc_pages_slowpath+0x2df/0x960
[628867.653332]  ? __might_sleep+0x4a/0x80
[628867.654148]  __alloc_pages_nodemask+0x24b/0x290
[628867.655237]  kmalloc_order+0x21/0x50
[628867.656016]  kmalloc_order_trace+0x24/0xc0
[628867.656878]  __kmalloc+0x17d/0x1d0
[628867.657644]  bucket_table_alloc+0x195/0x1d0
[628867.658564]  ? __might_sleep+0x4a/0x80
[628867.659449]  rht_deferred_worker+0x287/0x3c0
[628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294]  process_one_work+0x1de/0x4d0
[628867.662208]  worker_thread+0x4b/0x4f0
[628867.662990]  kthread+0x10c/0x140
[628867.663687]  ? process_one_work+0x4d0/0x4d0
[628867.664564]  ? kthread_create_on_node+0x40/0x40
[628867.665523]  ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---

Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.

And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
35074 root  -2   0   0  0  0 R  99.0  0.0  12:38.92 acpi_pad/12
35079 root  -2   0   0  0  0 R  99.0  0.0  12:39.40 acpi_pad/16
35080 root  -2   0   0  0  0 R  99.0  0.0  12:39.29 acpi_pad/17
35085 root  -2   0   0  0  0 R  99.0  0.0  12:39.35 acpi_pad/22
35087 root  -2   0   0  0  0 R  99.0  0.0  12:39.13 acpi_pad/24
35090 root  -2   0   0  0  0 R  99.0  0.0  12:38.89 acpi_pad/27
35093 root  -2   0   0  0  0 R  99.0  0.0  12:38.88 acpi_pad/30
35063 root  -2   0   0  0  0 R  98.1  0.0  12:40.64 acpi_pad/1
35065 root  -2   0   0  0  0 R  98.1  0.0  12:40.38 acpi_pad/3
35066 root  -2   0   0  0  0 R  98.1  0.0  12:40.30 acpi_pad/4
35067 root  -2   0   0  0  0 R  98.1  0.0  12:40.82 acpi_pad/5
35077 root  -2   0   0  0

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner

On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > > Hi,
> > > 
> > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > > >> Thanks Dave,
> > > >>
> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > > >> modules loaded (virtio block) so there's something else going on in the
> > > >> current merge window.  I'll keep an eye on it and make sure there's
> > > >> nothing iSCSI needs fixing for.
> > > >
> > > > OK, so before this slips through the cracks.
> > > >
> > > > Linus - your tree as of a few minutes ago still panics immediately
> > > > when starting xfstests on iscsi devices. It appears to be a
> > > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > > seem to have bounced it and no-one is looking at it.
> > > 
> > > Hmm. There's not much to go by.
> > > 
> > > Can somebody in iscsi-land please try to just bisect it - I'm not
> > > seeing a lot of clues to where this comes from otherwise.
> > 
> > Yeah, my hopes of this being quickly resolved by someone else didn't
> > work out and whatever is going on in that test VM is looking like a
> > different kind of odd.  I'm saving that off for later, and seeing if I
> > can't be a bisect on the iSCSI issue.
> 
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all. I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
> 
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220
> [628867.610702] Modules linked in:
> [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW  
>  4.9.0-dgc #18
> [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Debian-1.8.2-1 04/01/2014
> [628867.616179] Workqueue: events rht_deferred_worker
> [628867.632422] Call Trace:
> [628867.634691]  dump_stack+0x63/0x83
> [628867.637937]  __warn+0xcb/0xf0
> [628867.641359]  warn_slowpath_null+0x1d/0x20
> [628867.643362]  shadow_lru_isolate+0x171/0x220
> [628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
> [628867.645780]  ? __list_lru_init+0x70/0x70
> [628867.646628]  list_lru_walk_one+0x17/0x20
> [628867.647488]  scan_shadow_nodes+0x34/0x50
> [628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
> [628867.649506]  shrink_node+0x57/0x90
> [628867.650233]  do_try_to_free_pages+0xdd/0x230
> [628867.651157]  try_to_free_pages+0xce/0x1a0
> [628867.652342]  __alloc_pages_slowpath+0x2df/0x960
> [628867.653332]  ? __might_sleep+0x4a/0x80
> [628867.654148]  __alloc_pages_nodemask+0x24b/0x290
> [628867.655237]  kmalloc_order+0x21/0x50
> [628867.656016]  kmalloc_order_trace+0x24/0xc0
> [628867.656878]  __kmalloc+0x17d/0x1d0
> [628867.657644]  bucket_table_alloc+0x195/0x1d0
> [628867.658564]  ? __might_sleep+0x4a/0x80
> [628867.659449]  rht_deferred_worker+0x287/0x3c0
> [628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
> [628867.661294]  process_one_work+0x1de/0x4d0
> [628867.662208]  worker_thread+0x4b/0x4f0
> [628867.662990]  kthread+0x10c/0x140
> [628867.663687]  ? process_one_work+0x4d0/0x4d0
> [628867.664564]  ? kthread_create_on_node+0x40/0x40
> [628867.665523]  ret_from_fork+0x25/0x30
> [628867.666317] ---[ end trace 7c38634006a9955e ]---
> 
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

The system back up, and I haven't reproduced this problem yet.
However, benchmark results are way off where they should be, and at
times the performance is utterly abysmal. The XFS for-next tree
based on the 4.9 kernel shows none of these problems, so I don't
think there's an XFS problem here. Workload is the same 16-way
fsmark workload that I've been using for years as a performance
regression test.

The workload normally averages around 230k files/s - i'm seeing
and average of ~175k files/s on you current kernel. And there are
periods where performance just completely tanks:

#  ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner

On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Looking around a bit, the only even halfway suspicious scatterlist
> > initialization thing I see is commit f9d03f96b988 ("block: improve
> > handling of the magic discard payload") which used to have a magic
> > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> > now it does __blk_bvec_map_sg() instead.
> 
> But that check was only for discard (and discard-like) bios which
> had the maic single page that sometimes was unused attached.
> 
> For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> so it will be ignored anyway.  That being said both I and the lists
> got CCed halfway through the thread and I haven't seen the original
> report, so I'm not really sure what's going on here anyway.

http://www.gossamer-threads.com/lists/linux/kernel/2587485

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner

On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?

I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.

The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.

> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] [ cut here ]
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.

The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.

And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):

[ 2729.750563] BUG: Bad page state in process bstat  pfn:14945
[ 2729.751863] page:ea525140 count:-1 mapcount:0 mapping:  
(null) index:0x0
[ 2729.753763] flags: 0x4000()
[ 2729.754671] raw: 4000   

[ 2729.756469] raw: dead0100 dead0200  

[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: GB   
4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523]  
[ 2729.765004]  dump_stack+0x63/0x83
[ 2729.765784]  bad_page+0xc4/0x130
[ 2729.766552]  free_pages_check_bad+0x4f/0x70
[ 2729.767531]  free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513]  ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510]  drain_pages_zone+0x41/0x60
[ 2729.770417]  drain_pages+0x3e/0x60
[ 2729.771215]  drain_local_pages+0x24/0x30
[ 2729.772138]  flush_smp_call_function_queue+0x88/0x160
[ 2729.773317]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742]  smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000]  smp_call_function_interrupt+0xe/0x10
[ 2729.777102]  call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:c9000f0cf500 EFLAGS: 0202 ORIG_RAX: 
ff03
[ 2729.780852] RAX: 77541291 RBX: 88008b5efe40 RCX: 002e
[ 2729.782514] RDX: 0577 RSI: 05541291 RDI: 0001
[ 2729.784167] RBP: c9000f0cf500 R08: 0007 R09: c9000f0cf678
[ 2729.785818] R10: 0006 R11: 1000 R12: 0061
[ 2729.787480] R13: 0001 R14: 83214e30 R15: 0080
[ 2729.789124]  
[ 2729.789626]  __delay+0xf/0x20
[ 2729.790333]  do_raw_spin_lock+0x8c/0x160
[ 2729.791255]  _raw_spin_lock+0x15/0x20
[ 2729.792112]  list_lru_add+0x1a/0x70
[ 2729.792932]  xfs_buf_rele+0x3e7/0x410
[ 2729.793792]  xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099]  shrink_node+0x57/0x90
[ 2729.796905]  do_try_to_free_pages+0xdd/0x230
[ 2729.797914]  try_to_free_pages+0xce/0x1a0
[ 2729.798852]

Re: block layer API for file system creation - when to use multidisk mode

2018-11-30 Thread Dave Chinner

On Fri, Nov 30, 2018 at 01:00:52PM -0500, Ric Wheeler wrote:
> On 11/30/18 7:55 AM, Dave Chinner wrote:
> >On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
> >>Other file systems also need to
> >>accommodate/probe behind the fictitious visible storage device
> >>layer... Specifically, is there something we can add per block
> >>device to help here? Number of independent devices
> >That's how mkfs.xfs used to do stripe unit/stripe width calculations
> >automatically on MD devices back in the 2000s. We got rid of that
> >for more generaly applicable configuration information such as
> >minimum/optimal IO sizes so we could expose equivalent alignment
> >information from lots of different types of storage device
> >
> >>or a map of
> >>those regions?
> >Not sure what this means or how we'd use it.
> >Dave.
> 
> What I was thinking of was a way of giving up a good outline of how
> many independent regions that are behind one "virtual" block device
> like a ceph rbd or device mapper device. My assumption is that we
> are trying to lay down (at least one) allocation group per region.
> 
> What we need to optimize for includes:
> 
>     * how many independent regions are there?
> 
>     * what are the boundaries of those regions?
> 
>     * optimal IO size/alignment/etc
> 
> Some of that we have, but the current assumptions don't work well
> for all device types.

Oh, so essential "independent regions" of the storage device. I
wrote this in 2008:

http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Failure_Domains

This was derived from the ideas in prototype code I wrote in ~2007
to try to optimise file layout and load distribution across linear
concats of multi-TB RAID6 luns. Some of that work was published
long after I left SGI:

https://marc.info/?l=linux-xfs&m=123441191222714&w=2

Essentially, independent regions - called "Logical
Extension Groups", or "legs" of the filesystem - and would
essentially be an aggregation of AGs in that region. The
concept was that we'd move the geometry information from the
superblock into the legs, and so we could have different AG
geoemetry optimies for each independent leg of the filesystem.

eg the SSD region could have numerous small AGs, the large,
contiguous RAID6 part could have maximally size AGs or even make use
of the RT allocator for free space management instead of the
AG/btree allocator. Basically it was seen as a mechanism for getting
rid of needing to specify block devices as command line or mount
options.

Fundamentally, though, it was based on the concept that Linux would
eventually grow an interface for the block device/volume manager to
tell the filesystem where the independent regions in the device
were(*), but that's not something that has ever appeared. If you can
provide an indepedent region map in an easy to digest format (e.g. a
set of {offset, len, geometry} tuples), then we can obviously make
use of it in XFS

Cheers,

Dave.

(*) Basically provide a linux version of the functionality Irix
volume managers had provided filesystems since the late 80s

-- 
Dave Chinner
da...@fromorbit.com

Re: [trivial PATCH] treewide: Align function definition open/close braces

2017-12-18 Thread Dave Chinner

On Sun, Dec 17, 2017 at 04:28:44PM -0800, Joe Perches wrote:
> Some functions definitions have either the initial open brace and/or
> the closing brace outside of column 1.
> 
> Move those braces to column 1.
> 
> This allows various function analyzers like gnu complexity to work
> properly for these modified functions.
> 
> Miscellanea:
> 
> o Remove extra trailing ; and blank line from xfs_agf_verify
> 
> Signed-off-by: Joe Perches 
> ---


XFS bits look fine.

Acked-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC] Re: broken userland ABI in configfs binary attributes

2019-08-29 Thread Dave Chinner

On Thu, Aug 29, 2019 at 11:22:58PM +0100, Al Viro wrote:
> On Tue, Aug 27, 2019 at 06:27:35PM +0100, Al Viro wrote:
> 
> > Most of them are actually pure bollocks - "it can never happen, but if it 
> > does,
> > let's return -EWHATEVER to feel better".  Some are crap like -EINTR, which 
> > is
> > also bollocks - for one thing, process might've been closing files precisely
> > because it's been hit by SIGKILL.  For another, it's a destructor.  It won't
> > be retried by the caller - there's nothing called for that object 
> > afterwards.
> > What you don't do in it won't be done at all.
> > 
> > And some are "commit on final close" kind of thing, both with the hardware
> > errors and parsing errors.
> 
> FWIW, here's the picture for fs/*: 6 instances.
> 
> afs_release():
>calls vfs_fsync() if file had been opened for write, tries to pass
>   the return value to caller.  Job for ->flush(), AFAICS.
> 
> coda_psdev_release():
>   returns -1 in situation impossible after successful ->open().
>   Can't happen without memory corruption.
> 
> configfs_release_bin_file():
>   discussed upthread
> 
> dlm device_close():
>   returns -ENOENT if dlm_find_lockspace_local(proc->lockspace) fails.
> No idea if that can happen.
> 
> reiserfs_file_release():
>   tries to return an error if it can't free preallocated blocks.
> 
> xfs_release():
>   similar to the previous case.

Not quite right. XFS only returns an error if there is data
writeback failure or filesystem corruption or shutdown detected
during whatever operation it is performing.

We don't really care what is done with the error that we return;
we're just returning an error because that's what the function
prototype indicates we should do...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists

Re: [PATCH 13/35] xfs: set bi_op to REQ_OP

Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs

Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.

Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

Re: [RFC 00/32] making inode time stamps y2038 ready

Re: BUG: scheduling while atomic in blk_mq codepath?

Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

Re: [BUG] Slab corruption during XFS writeback under memory pressure

Re: [BUG] Slab corruption during XFS writeback under memory pressure

Re: [BUG] Slab corruption during XFS writeback under memory pressure

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

Re: block layer API for file system creation - when to use multidisk mode

Re: [trivial PATCH] treewide: Align function definition open/close braces

Re: [RFC] Re: broken userland ABI in configfs binary attributes

29 matches

Site Navigation

Mail list logo

Footer information