Re: [PATCH] staging: exfat: add exfat filesystem code to staging

2019-08-30 Thread Dave Chinner
On Thu, Aug 29, 2019 at 01:18:10PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Aug 29, 2019 at 03:37:49AM -0700, Christoph Hellwig wrote:
> > On Thu, Aug 29, 2019 at 11:50:19AM +0200, Greg Kroah-Hartman wrote:
> > > I know the code is horrible, but I will gladly take horrible code into
> > > staging.  If it bothers you, just please ignore it.  That's what staging
> > > is there for :)
> > 
> > And then after a while you decide it's been long enough and force move
> > it out of staging like the POS erofs code?
> 
> Hey, that's not nice, erofs isn't a POS.  It could always use more
> review, which the developers asked for numerous times.
> 
> There's nothing different from a filesystem compared to a driver.  If
> its stand-alone, and touches nothing else, all issues with it are
> self-contained and do not bother anyone else in the kernel.  We merge

I whole-heartedly disagree with that statement.

The major difference between filesystems and the rest of the kernel
that seems to be missed by most kernel developers is that
filesystems maintain persistent data - you can't fix a problem/bug
by rebooting or power cycling. Once the filesystem code screws up,
the user is left with a mess they have to fix and that invariably
results in data loss.

Users remember when a filesystem eats their data - they don't tend
to want to have anything to do with that filesystem ever again if it
happens to them. We still get people saying "XFS ate my data back in
2002, I dont trust it and I'll never use it again".

Users put up with shit hardware and drivers - it's an inconvenience
more than anything. They don't put up with buggy filesystems that
lose their data - that is completely unacceptible to users.  As a
result, the quality and stability standard for merging a new
filesystem needs to be far higher that what is acceptible for
merging a new driver.

The correct place for new filesystem review is where all the
experienced filesystem developers hang out - that's linux-fsdevel,
not the driver staging tree.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging: exfat: add exfat filesystem code to staging

2019-08-31 Thread Dave Chinner
On Sat, Aug 31, 2019 at 06:31:45AM -0400, Valdis Klētnieks wrote:
> On Sat, 31 Aug 2019 07:54:10 +1000, Dave Chinner said:
> 
> > The correct place for new filesystem review is where all the
> > experienced filesystem developers hang out - that's linux-fsdevel,
> > not the driver staging tree.
> 
> So far everything's been cc'ed to linux-fsdevel, which has been spending
> more time discussing unlikely() usage in a different filesystem.

That's just noise - you'll get whitespace and other trivial
review on any list you post a patch series for review. Go back and
look at what other people have raised w.r.t. to that filesystem -
on-disk format validation, re-implementation of largely generic
code, lack of namespacing of functions leading to conflicts with
generic/VFS functionality, etc.

Review bandwidth for things like on-disk format definition and
manipulation, consistency with other filesystems, efficient
integration into the generic infrastructure, etc is limited.
Everyone has to juggle that around the load they have for their own
filesystem maintenance, and there's usually only bandwidth for a
single filesystem at a time.

Just be patient - trying to force the merging of code before there's
even been consensus on fundamental architecture choices doesn't make
things better for anyone.  Merging incomplete filesystem code early
in the development cycle has -always- been something we've regretted
in the long run.  We've learn this lesson many times before, yet we
seem doomed to repeat it yet again...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] errno.h: Provide EFSCORRUPTED for everybody

2019-11-01 Thread Dave Chinner
On Fri, Nov 01, 2019 at 09:57:31PM +0100, Geert Uytterhoeven wrote:
> Hi Valdis,
> 
> On Thu, Oct 31, 2019 at 2:11 AM Valdis Kletnieks
>  wrote:
> > Three questions: (a) ACK/NAK on this patch, (b) should it be all in one
> > patch, or one to add to errno.h and 6 patches for 6 filesystems?), and
> > (c) if one patch, who gets to shepherd it through?
> >
> > There's currently 6 filesystems that have the same #define. Move it
> > into errno.h so it's defined in just one place.
> >
> > Signed-off-by: Valdis Kletnieks 
> 
> Thanks for your patch!
> 
> > --- a/include/uapi/asm-generic/errno.h
> > +++ b/include/uapi/asm-generic/errno.h
> > @@ -98,6 +98,7 @@
> >  #defineEINPROGRESS 115 /* Operation now in progress */
> >  #defineESTALE  116 /* Stale file handle */
> >  #defineEUCLEAN 117 /* Structure needs cleaning */
> > +#defineEFSCORRUPTEDEUCLEAN
> 
> I have two questions:
> a) Why not use EUCLEAN everywhere instead?
> Having two different names for the same errno complicates grepping.

Because:
a) EUCLEAN is horrible for code documentation. EFSCORRUPTED
describes exactly the error being returned and/or checked for.

b) we've used EFSCORRUPTED in XFS since 1993. i.e. it was an
official, published error value on Irix, and we've kept it
in the linux code for the past ~20 years because of a)

c) Userspace programs that include filesystem specific
headers have already been exposed to and use EFSCORRUPTED,
so we can't remove/change it without breaking userspace.

d) EUCLEAN has a convenient userspace string description
that is appropriate for filesystem corruption: "Structure
needs cleaning" is precisely what needs to happen. Repair of
the filesystem (i.e. recovery to a clean state) is what is
required to fix the error

> b) Perhaps both errors should use different values?

That horse bolted to userspace years ago - this is just formalising
the practice that has spread across multiple linux filesystems from
XFS over the past ~10 years..

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] errno.h: Provide EFSCORRUPTED for everybody

2019-11-05 Thread Dave Chinner
On Tue, Nov 05, 2019 at 04:15:50PM +0100, David Sterba wrote:
> On Sat, Nov 02, 2019 at 08:38:23AM +1100, Dave Chinner wrote:
> > On Fri, Nov 01, 2019 at 09:57:31PM +0100, Geert Uytterhoeven wrote:
> > > Hi Valdis,
> > > 
> > > On Thu, Oct 31, 2019 at 2:11 AM Valdis Kletnieks
> > >  wrote:
> > > > Three questions: (a) ACK/NAK on this patch, (b) should it be all in one
> > > > patch, or one to add to errno.h and 6 patches for 6 filesystems?), and
> > > > (c) if one patch, who gets to shepherd it through?
> > > >
> > > > There's currently 6 filesystems that have the same #define. Move it
> > > > into errno.h so it's defined in just one place.
> > > >
> > > > Signed-off-by: Valdis Kletnieks 
> > > 
> > > Thanks for your patch!
> > > 
> > > > --- a/include/uapi/asm-generic/errno.h
> > > > +++ b/include/uapi/asm-generic/errno.h
> > > > @@ -98,6 +98,7 @@
> > > >  #defineEINPROGRESS 115 /* Operation now in progress */
> > > >  #defineESTALE  116 /* Stale file handle */
> > > >  #defineEUCLEAN 117 /* Structure needs cleaning */
> > > > +#defineEFSCORRUPTEDEUCLEAN
> > > 
> > > I have two questions:
> > > a) Why not use EUCLEAN everywhere instead?
> > > Having two different names for the same errno complicates grepping.
> > 
> > Because:
> > a) EUCLEAN is horrible for code documentation. EFSCORRUPTED
> > describes exactly the error being returned and/or checked for.
> > 
> > b) we've used EFSCORRUPTED in XFS since 1993. i.e. it was an
> > official, published error value on Irix, and we've kept it
> > in the linux code for the past ~20 years because of a)
> > 
> > c) Userspace programs that include filesystem specific
> > headers have already been exposed to and use EFSCORRUPTED,
> > so we can't remove/change it without breaking userspace.
> > 
> > d) EUCLEAN has a convenient userspace string description
> > that is appropriate for filesystem corruption: "Structure
> > needs cleaning" is precisely what needs to happen. Repair of
> > the filesystem (i.e. recovery to a clean state) is what is
> > required to fix the error
> 
> The description is very confusing to users that are also not filesystem
> developers.

That's a pretty good description of most error messages for modern
software packages. Anything that is a non-trivial error requires web
searching to understand and diagnose these days.

Google knows exactly what you are looking for if you search for
"mount structure needs cleaning" or other similar error messages.
That means users also know what it means and what they need to
do to address it.  i.e. it's been around long enough that there's a
deep history in internet search engines and question forums like
stackexchange.

> "Structure needs cleaning" says what needs to be done but
> not what happened. Unlike other error codes like "not enough memory",
> "IO error" etc. We don't have EBUYMEM / "Buy more memory" instead of
> ENOMEM.

Message grammar is largely irrelevant. And not unique to EUCLEAN. e.g.
EAGAIN = "Try again".

> Fuzzing tests and crafted images produce most of the EUCLEAN errors and
> in this context "structure needs cleaning" makes even less sense.

It's pretty silly to argue that a developer crafting and/or fuzzing
corrupted filesystem images is not going to understand what the
error message returned when they successfully corrupt a filesystem
actually means

> > > b) Perhaps both errors should use different values?
> > 
> > That horse bolted to userspace years ago - this is just
> > formalising the practice that has spread across multiple linux
> > filesystems from XFS over the past ~10 years..
> 
> EFSCORRUPTED is a appropriate name but to share the number with
> EUCLEAN was a terrible decision and now every filesystem has to
> suffer and explain to users what the code really means and point
> to the the sad story when asked "So why don't you have a separate
> code?".

Damned if you do, damned if you don't.

Back in 2005-2006, XFS developers tried to make EFSCORRUPTED a
proper system wide errno (like we had on Irix). The NIH was strong
in the linux community back then, and we were largely told to go
away as the superior Linux filesystems would never need to report
corruption to userspace so we don't need a special errno just
because some shitt

Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-03 Thread Dave Chinner
On Wed, Oct 02, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 1 Oct 2013, Joe Thornber wrote:
> 
> > > Alternatively, delaying them will stall the filesystem because it's
> > > waiting for said REQ_FUA IO to complete. For example, journal writes
> > > in XFS are extremely IO latency sensitive in workloads that have a
> > > signifincant number of ordering constraints (e.g. O_SYNC writes,
> > > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> > > filesystem for the majority of that barrier_deadline_ms.
> > 
> > Yes, this is a valid concern, but I assume Akira has benchmarked.
> > With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
> > see if there are any other FUA requests on my queue that can be
> > aggregated into a single flush.  I agree with you that the target
> > should never delay waiting for new io; that's asking for trouble.
> > 
> > - Joe
> 
> You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
> and aggregate all the requests that were received while you processed the 
> initial request. This way, you can do request batching without introducing 
> artifical delays.

Yes, that's what XFS does with it's log when lots of fsync requests
come in. i.e. the first is dispatched immmediately, and the others
are gathered into the next log buffer until it is either full or the
original REQ_FUA log write completes.

That's where arbitrary delays in the storage stack below XFS cause
problems - if the first FUA log write is delayed, the next log
buffer will get filled, issued and delayed, and when we run out of
log buffers (there are 8 maximum) the entire log subsystem will
stall, stopping *all* log commit operations until log buffer
IOs complete and become free again. i.e. it can stall modifications
across the entire filesystem while we wait for batch timeouts to
expire and issue and complete FUA requests.

IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
point where they are issued - any attempt to further optimise them
by adding delays down in the stack to aggregate FUA operations will
only increase latency of the operations that the issuer want to have
complete as fast as possible

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-07 Thread Dave Chinner
On Sat, Oct 05, 2013 at 04:51:16PM +0900, Akira Hayakawa wrote:
> Dave,
> 
> > That's where arbitrary delays in the storage stack below XFS cause
> > problems - if the first FUA log write is delayed, the next log
> > buffer will get filled, issued and delayed, and when we run out of
> > log buffers (there are 8 maximum) the entire log subsystem will
> > stall, stopping *all* log commit operations until log buffer
> > IOs complete and become free again. i.e. it can stall modifications
> > across the entire filesystem while we wait for batch timeouts to
> > expire and issue and complete FUA requests.
> To me, this sounds like design failure in XFS log subsystem.

If you say so. As it is, XFS is the best of all the linux
filesystems when it comes to performance under a heavy fsync
workload, so if you consider it broken by design then you've got a
horror show waiting for you on any other filesystem...

> Or just the limitation of metadata journal.

It's a recovery limitation - the more uncompleted log buffers we
have outstanding, the more space in the log will be considered
unrecoverable during a crash...

> > IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
> > point where they are issued - any attempt to further optimise them
> > by adding delays down in the stack to aggregate FUA operations will
> > only increase latency of the operations that the issuer want to have
> > complete as fast as possible
> That lower layer stack attempts to optimize further
> can benefit any filesystems.
> So, your opinion is not always correct although
> it is always correct in error handling or memory management.
> 
> I have proposed future plan of using persistent memory.
> I believe with this leap forward
> filesystems are free from doing such optimization
> relevant to write barriers. For more detail, please see my post.
> https://lkml.org/lkml/2013/10/4/186

Sure, we already do that in the storage stack to minimise the impact
of FUA operations - it's called a non-volatile write cache, and most
RAID controllers have them. They rely on immediate dispatch of FUA
operations to get them into the write cache as quickly as possible
(i.e. what filesystems do right now), and that is something your
proposed behaviour will prevent.

i.e. there's no point justifying a behaviour with "we could do this
in future so lets ignore the impact on current users"...

> However,
> I think I should leave option to disable the optimization
> in case the upper layer doesn't like it.
> Maybe, writeboost should disable deferring barriers
> if barrier_deadline_ms parameter is especially 0.
> Linux kernel's layered architecture is obviously not always perfect
> so there are similar cases in other boundaries
> such as O_DIRECT to bypass the page cache.

Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're
relying on the user tweaking the corect knobs for their workload.

e.g. what happens if a user has a mixed workload - one where
performance benefits are only seen by delaying FUA, and another that
is seriously slowed down by delaying FUA requests?  This is where
knobs are problematic

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-15 Thread Dave Chinner
[cc x...@oss.sgi.com]

On Tue, Oct 15, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> On Mon, 14 Oct 2013, Akira Hayakawa wrote:
> > But, XFS stalls ...
> > ---
> > For testing,
> > I manually turns `blockup` to 1
> > when compiling Ruby is in progress
> > on XFS on a writeboost device.
> > As soon as I do it,
> > XFS starts to dump error message 
> > like "metadata I/O error: ... ("xlog_iodone") error ..."
> > and after few seconds it then starts to dump
> > like "BUG: soft lockup -CPU#3 stuck for 22s!".
> > The system stalls and doesn't accept the keyboard.
> > 
> > I think this behavior is caused by
> > the device always returning -EIO after turning 
> > the variable to 1.
> > But why XFS goes stalling on I/O error?
> 
> Because it is bloated and buggy.

How did I know you'd take that cheap shot, Mikulas? You are so
predictable...

> We have bug 924301 for XFS crash on I/O 
> error...

Which is a problem with memory corruption after filling a dm
snapshot volume to 100% and shortly after XFS has shut down the
kernel panics from memory corruption. Can't be reproduced without
filling the dm-snapshot volume to 100%, can't be reproduced with any
other filesystem. Crashes are also occurring randomly in printk and
the worker thread infrastructure. Memory and list poisoning clearly
indicates worker thread lists have freed objects on them. There are
lockdep messages from the DM snapshot code, etc.

There's actually very little to point at XFS problems other than the
first hang that was reported where XFS was stuck in a tight loop due
to memory corruption.  It reminds me of a very similar bug report
and triage we went through last week:

http://oss.sgi.com/pipermail/xfs/2013-October/030681.html

Further analysis and bisects pointed to the zram driver being buggy,
not XFS:

http://oss.sgi.com/pipermail/xfs/2013-October/030707.html

XFS has historically exposing bugs in block device drivers that no
other filesystem exposes, and so when a new block device driver gets
tested with XFS and we start seeing memory corruption symptoms, it's
a fair bet that it's not XFS that is causing it

Just sayin'.

---

Akira, can you please post the entire set of messages you are
getting when XFS showing problems? That way I can try to confirm
whether it's a regression in XFS or something else.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-16 Thread Dave Chinner
On Wed, Oct 16, 2013 at 07:34:38PM +0900, Akira Hayakawa wrote:
> Dave
> 
> > Akira, can you please post the entire set of messages you are
> > getting when XFS showing problems? That way I can try to confirm
> > whether it's a regression in XFS or something else.
> 
> Environment:
> - The kernel version is 3.12-rc1
> - The debuggee is a KVM virtual machine equipped with 8 vcpus.
> - writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0
>   You can clone it from https://github.com/akiradeveloper/dm-writeboost
> 
> Test:
> 1. Make a writeboost device with 3MB cache device and 3GB backing store
>with default option (segment size order is 7 and RAM buffer is 2MB 
> allocated).
> 2. start testing/1 script (compiling Ruby and make test after it)
> 3. set blockup variable to 1 via message interface few seconds later.
>The writeboost device starts to return -EIO on all incoming requests.
>I guess this behavior causes the problem.
> 
> In some case, XFS doesn't collapse after setting blockup to 1.
> When I set the variable to 1 about 10 or 20 seconds later,
> it didn't collapse but neatly stops the compile and
> after again I set it to 0, it restarts the compile.
> XFS does collapse (badly shutting down the filesystem as seen below) in some 
> case
> but doesn't collapse in another case sounds to me that
> the former case runs into a very corner case bug.

XFS shuts down because you've returned EIO to a log IO. That's a
fatal error. If you do the same to an ext4 journal write, it will do
the equivalent of shut down (e.g. complain and turn read-only).

> The entire set of messages via virsh console is shown below.
> Some lines related to writeboost are all benign.
> The daemons are just stopping because blockup variable is 1.
> 
> [  146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") 
> error 5 numblks 64
> [  146.285825] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
> [  146.286699] XFS (dm-3): Please umount the filesystem and rectify the 
> problem(s)

What happened before this? Please attach the *full* log.

> [  146.560036] device-mapper: writeboost: err@modulator_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [  147.244036] device-mapper: writeboost: err@migrate_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [  172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
> [  172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]

These should be emitting a stack trace. Can you turn up the logging
level you are using so that they emit a full stack trace? The
messages are useless without the stack dump

Also, 23 seconds before this timestamp is 149s, about 3s after the
XFS filesystem shut down, so it's not clear that the XFS shutdown is
related to the soft lockup yet. That's what we need the stack traces
for...

> [  180.560040] device-mapper: writeboost: err@recorder_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [  180.561179] device-mapper: writeboost: err@sync_proc() system is blocked 
> up on I/O error. set blockup to 0 after checkup.

What's with the 35s delay between these writeboost messages? Have
you only done a partial shutdown of the block device and it takes
This length of time for it to completely block IO?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-16 Thread Dave Chinner
b0
> [  100.052005]  [] ? __kthread_parkme+0x5d/0x5d

You need to compile your kernel with framepointers enabled so we get
reliable stack traces. I think it's stuck on a spinlock in
xfs_buf_iodone, which would imply the AIL lock.

.

> [  100.244006] CPU: 2 PID: 3167 Comm: xfsaild/dm-3 Tainted: G   O 
> 3.12.0-rc1 #8

FWIW, you should probably be testing the lastest Linus kernel
(3.12-rc5, IIRC) rather than -rc1



> [  100.244006] RIP: 0010:[]  [] 
> do_raw_spin_lock+0x16/0x23
> [  100.244006] Call Trace:
> [  100.244006]  [] ? xfs_trans_committed_bulk+0x2f/0x19d 
> [xfs]
> [  100.244006]  [] ? _xfs_buf_ioapply+0x271/0x29c [xfs]
> [  100.244006]  [] ? remove_wait_queue+0xe/0x48
> [  100.244006]  [] ? xlog_wait+0x62/0x6b [xfs]
> [  100.244006]  [] ? try_to_wake_up+0x190/0x190
> [  100.244006]  [] ? xlog_state_get_iclog_space+0x5a/0x1fb 
> [xfs]
> [  100.244006]  [] ? __cache_free.isra.46+0x178/0x187
> [  100.244006]  [] ? xlog_cil_committed+0x2f/0xe6 [xfs]
> [  100.244006]  [] ? xlog_cil_push+0x2f6/0x311 [xfs]
> [  100.244006]  [] ? mmdrop+0xd/0x1c
> [  100.244006]  [] ? xlog_cil_force_lsn+0x71/0xdd [xfs]
> [  100.244006]  [] ? _xfs_log_force+0x55/0x1a0 [xfs]
> [  100.244006]  [] ? xfs_log_force+0x1f/0x4e [xfs]
> [  100.244006]  [] ? xfsaild+0x144/0x4cd [xfs]
> [  100.244006]  [] ? finish_task_switch+0x7f/0xaa
> [  100.244006]  [] ? xfs_trans_ail_cursor_first+0x76/0x76 
> [xfs]
> [  100.244006]  [] ? xfs_trans_ail_cursor_first+0x76/0x76 
> [xfs]
> [  100.244006]  [] ? kthread+0x81/0x89
> [  100.244006]  [] ? __kthread_parkme+0x5d/0x5d
> [  100.244006]  [] ? ret_from_fork+0x7c/0xb0
> [  100.244006]  [] ? __kthread_parkme+0x5d/0x5d

It's stuck on a spin lock, but I don't know what function it's
in because the stack trace is indeterminate (i.e. need frame
pointers enabled). It might be the AIL lock (as it's the xfsaild),
but I can't tell.

> [  100.436010] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:537]
...
> [  100.436010] Workqueue: xfs-reclaim/dm-3 xfs_reclaim_worker [xfs]
> [  100.436010] RIP: 0010:[]  [] 
> do_raw_spin_lock+0x13/0x23
> [  100.436010] Call Trace:
> [  100.436010]  [] ? xfs_iflush_abort+0x35/0x9a [xfs]
> [  100.436010]  [] ? xfs_reclaim_inode+0x85/0x246 [xfs]
> [  100.436010]  [] ? xfs_reclaim_inodes_ag+0x147/0x1fc [xfs]
> [  100.436010]  [] ? try_to_wake_up+0x190/0x190
> [  100.436010]  [] ? __wake_up_common+0x42/0x78
> [  100.436010]  [] ? fold_diff+0x22/0x2e
> [  100.436010]  [] ? lock_timer_base.isra.35+0x23/0x48
> [  100.436010]  [] ? internal_add_timer+0xd/0x28
> [  100.436010]  [] ? __mod_timer+0xfa/0x10c
> [  100.436010]  [] ? xfs_reclaim_inodes+0x16/0x1b [xfs]
> [  100.436010]  [] ? xfs_reclaim_worker+0x15/0x1e [xfs]
> [  100.436010]  [] ? process_one_work+0x191/0x294
> [  100.436010]  [] ? worker_thread+0x121/0x1e7
> [  100.436010]  [] ? rescuer_thread+0x269/0x269
> [  100.436010]  [] ? kthread+0x81/0x89
> [  100.436010]  [] ? __kthread_parkme+0x5d/0x5d
> [  100.436010]  [] ? ret_from_fork+0x7c/0xb0
> [  100.436010]  [] ? __kthread_parkme+0x5d/0x5d

Also stuck on a spin lock, but again it is not obvious what function
it is in and hence what spinlock is affected. xfs_iflush_abort()
does take the AIL lock, so that might be it.

> [  100.628005] BUG: soft lockup - CPU#6 stuck for 22s! [script:3151]
> [  100.628005] RIP: 0010:[]  [] 
> do_raw_spin_lock+0x16/0x23
> [  100.628005] Call Trace:
> [  100.628005]  [] ? xfs_ail_push_all+0x13/0x4f [xfs]
> [  100.628005]  [] ? xfs_reclaim_inodes_nr+0x1a/0x34 [xfs]
> [  100.628005]  [] ? super_cache_scan+0x121/0x13e
> [  100.628005]  [] ? shrink_slab+0x1e3/0x2f9
> [  100.628005]  [] ? iput+0x34/0x13d
> [  100.628005]  [] ? do_coredump+0xbc3/0xbc3
> [  100.628005]  [] ? drop_caches_sysctl_handler+0x65/0x76
> [  100.628005]  [] ? proc_sys_call_handler+0x98/0xbf
> [  100.628005]  [] ? vfs_write+0x9e/0x104
> [  100.628005]  [] ? SyS_write+0x51/0x85
> [  100.628005]  [] ? system_call_fastpath+0x16/0x1b

That can only be stuck on the AIL spin lock.

So, I've just audited all the uses of the AIL lock, and I cannot
find an unbalanced user of the AIL lock. If we've leaked the spin
lock, it's not an obvious or easy to trigger bug. Can you turn on
lockdep as well as CONFIG_XFS_DEBUG and see what warnings that
throws?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-20 Thread Dave Chinner
nder heavy load just
using the godown utility in xfstests:

[1044535.986040] XFS (vdc): Mounting Filesystem
[1044536.040630] XFS (vdc): Ending clean mount
[1044536.059299] XFS: Clearing xfsstats
# src/xfstests-dev/src/godown -v /mnt/scratch
Opening "/mnt/scratch"
Calling XFS_IOC_GOINGDOWN
[1044553.342767] XFS (vdc): metadata I/O error: block 0x19000c2e98 
("xlog_iodone") error 5 numblks 512
[1044553.345993] XFS (vdc): xfs_do_force_shutdown(0x2) called from line 1171 of 
file fs/xfs/xfs_log.c.  Return address = 0x814e61ad
#[1044554.136965] XFS (vdc): xfs_log_force: error 5 returned.
[1044554.154892] XFS: Clearing xfsstats
[1044566.108484] XFS (vdc): xfs_log_force: error 5 returned.
[1044596.188515] XFS (vdc): xfs_log_force: error 5 returned.
[1044626.268166] XFS (vdc): xfs_log_force: error 5 returned.
[1044656.348146] XFS (vdc): xfs_log_force: error 5 returned.

IOWs, this looks like something is corrupting the state of the log
*before* the xlog_iodone() callback is being run. i.e. it is in a
valid state before IO dispatch and it's in a corrupt state on IO
completion despite XFS having *never touched that state* during the
IO.

Indeed, this is a different issue to the one you posted
earlier, which was the AIL lock (which is in a different XFS
structure) had not been released. Again, I couldn't see how that
could occur, and we're now seeing a pattern of random structures
being corrupted. Hmmm, looking at pahole:

atomic_t   ic_refcnt;/*   192 4 */

spinlock_t xa_lock;  /*64 2 */

Both the items that were corrupted are on the first word of a
cacheline. That further points to memory corruption of some kind...

IOWs, the more I look, the more this looks like memory corruption is
the cause of the problems. The only unusual thing that is happening
is an error handling path in a brand new block device driver is
being run immediately before the memory corruption causes problems,
and that tends to indicate that this corruption is not caused by
XFS...

> My Idea:
> - If something problematic happens in underlying devices
>   writeboost device returns -EIO on any requests and
>   stop all the daemons

> - I will not remove `blockup`.
>   Yes. If the problem is in hardware it is very difficult
>   to recover.
>   However, leaving a chance for recovering the device is
>   at least better than just shutdown the device
>   if it doesn't pollute the code so much.
>   So, I will leave this option.

It doesn't matter whether the underlying hardware might be able to
recover - once you've send EIOs during IO completion that
data/metadata is considered *unrecoverable* and so the filesystem
will end up inconsistent or the *user will lose data* as a result.

IOWs, your idea is flawed, will not work and will result in
data/filesystem loss when used. You cannot call this a "recovery
mechanism" because it simply isn't.

> BTW, what do you mean by the word "fatal"?

"He suffered a fatal gunshot wound to the head."

i.e. the person is *dead* and life is *unrecoverable*.

That's what you are doing here - returning EIOs to IOs in progress
is effectively shooting the fileystem (and user data) in the
head.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Dave Chinner
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
> * Deferring ACK for barrier writes
> Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
> Immediately handling these bios badly slows down writeboost.
> It surveils the bios with these flags and forcefully flushes them
> at worst case within `barrier_deadline_ms` period.

That rings alarm bells.

If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons,
delaying them to allow other IOs to be submitted and dispatched may
very well violate the IO ordering constraints the filesystem is
trying to acheive.

Alternatively, delaying them will stall the filesystem because it's
waiting for said REQ_FUA IO to complete. For example, journal writes
in XFS are extremely IO latency sensitive in workloads that have a
signifincant number of ordering constraints (e.g. O_SYNC writes,
fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
filesystem for the majority of that barrier_deadline_ms.

i.e. this says to me that the best performance you can get from such
workloas is one synchronous operation per process per
barrier_deadline_ms, even when the storage and filesystem might be
capable of executing hundreds of synchronous operations per
barrier_deadline_ms..

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel