Re: Poll: time to switch skinny-metadata on by default?

Zygo Blaxell Sun, 26 Oct 2014 21:40:36 -0700

On Mon, Oct 20, 2014 at 06:34:03PM +0200, David Sterba wrote:
> On Thu, Oct 16, 2014 at 01:33:37PM +0200, David Sterba wrote:
> > I'd like to make it default with the 3.17 release of btrfs-progs.
> > Please let me know if you have objections.
> 
> For the record, 3.17 will not change the defaults. The timing of the
> poll was very bad to get enough feedback before the release. Let's keep
> it open for now.


I don't have hard data, but I do have disturbing soft data:

        12 btrfs filesystems with various mixed workloads

        4 of those w/skinny metadata (converted with btrfstune -x)

        3 of those have processes or the entire filesystem hanging
        every few days, triggering watchdog reboots

I'm still trying to find the smoking gun, but it looks like there's a
problem that only shows up when skinny metadata is enabled (or possibly
one that only shows up when both skinny and non-skinny are mixed?).

One thing that may be significant is _when_ those 3 hanging filesystems
are hanging:  when using rsync to update local files.  These machines are
using the traditional rsync copy-then-rename method rather than --inplace
updates.  There's no problem copying data into an empty directory with
rsync, but as soon as I start updating existing data, some process (not
necessarily rsync) using with the filesystem gets stuck within 36 hours,
and stays stuck for days.  If I don't run rsync on the skinny filesystems,
they'll run for a week or more without incident--and if I then start
running rsync again, they hang later the same day.

When I get kernel stacks they show ~50 processes stuck all over the
btrfs metadata manipulation code.  If someone wants to wade through
these I can collect them easily enough.

The 4th skinny-metadata machine--the one that doesn't hang often--is
the only one that isn't using rsync to receive files from elsewhere.
It's also the busiest filesystem (in iops/sec) with the largest variety
in its workload, so all things being equal it should be encountering
more random btrfs problems than the other three.

Some of my machines have multiple filesystems, some with skinny and
some without.  I've tried moving the rsync destination tree to the
non-skinny filesystems on those machines, and in those cases I was able
to complete several rsync updates without incident.  That seems to rule
out any system-level problem.

The 8 filesystems without skinny don't have the hang problem.  They have
had a variety of other issues, but not hangs alone.  Currently 3.17 +
stable-queue patches fixes all the problems I've encountered so far with
the non-skinny filesystems, so the skinny filesystems are now earning
most of my attention.

With this small sample size and data collection rate I admit I could
just have a spurious correlation.  The data also supports conclusions
such as "Western Digital hard drives cause hangs" or "filesystems
created in August 2014 cause hangs."  I'd encourage anyone with the
intrastructure set up to do a larger-scale test to see if this is--or
is not--reproducible.

signature.asc
Description: Digital signature

Re: Poll: time to switch skinny-metadata on by default?

Reply via email to