Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-20 Thread Rich Freeman
On Sat, Mar 15, 2014 at 7:51 AM, Duncan 1i5t5.dun...@cox.net wrote:
 1) Does running the snapper cleanup command from that cron job manually
 trigger the problem as well?

As you can imagine I'm not too keen to trigger this often.  But yes, I
just gave it a shot on my SSD and cleaning a few days of timelines
triggered a panic.

 2) What about modifying the cron job to run hourly, or perhaps every six
 hours, so it's deleting only 2 or 12 instead of 48 at a time?  Does that
 help?

 If so then it's a thundering herd problem.  While definitely still a bug,
 you'll at least have a workaround until its fixed.

Definitely looks like a thundering herd problem.

I stopped the cron jobs (including the creation of snapshots based on
your later warning).  However, I am my snapshots one at a time at a
rate of one every 5-30 minutes, and while that is creating
surprisingly high disk loads on my ssd and hard drives, I don't get
any panics.  I figured that having only one deletion pending per
checkpoint would eliminate locking risk.

I did get some blocked task messages in dmesg, like:
[105538.121239] INFO: task mysqld:3006 blocked for more than 120 seconds.
[105538.121251]   Not tainted 3.13.6-gentoo #1
[105538.121256] echo 0  /proc/sys/kernel/hung_task_timeout_secs
disables this message.
[105538.121262] mysqld  D 880395f63e80  3432  3006  1 0x
[105538.121273]  88028b623d38 0086 88028b623dc8
81c10440
[105538.121283]  0200 88028b623fd8 880395f63b80
00012c40
[105538.121291]  00012c40 880395f63b80 532b7877
880410e7e578
[105538.121299] Call Trace:
[105538.121316]  [81623d73] schedule+0x6a/0x6c
[105538.121327]  [81623f52] schedule_preempt_disabled+0x9/0xb
[105538.121337]  [816251af] __mutex_lock_slowpath+0x155/0x1af
[105538.121347]  [812b9db0] ? radix_tree_tag_set+0x71/0xd4
[105538.121356]  [81625225] mutex_lock+0x1c/0x2e
[105538.121365]  [8123c168] btrfs_log_inode_parent+0x161/0x308
[105538.121373]  [8162466d] ? mutex_unlock+0x11/0x13
[105538.121382]  [8123cd37] btrfs_log_dentry_safe+0x39/0x52
[105538.121390]  [8121a0c9] btrfs_sync_file+0x1bc/0x280
[105538.121401]  [811339a3] vfs_fsync_range+0x13/0x1d
[105538.121409]  [811339c4] vfs_fsync+0x17/0x19
[105538.121416]  [81133c3c] do_fsync+0x30/0x55
[105538.121423]  [81133e40] SyS_fsync+0xb/0xf
[105538.121432]  [8162c2e2] system_call_fastpath+0x16/0x1b

I suspect that this may not be terribly helpful - it probably reflects
tasks waiting for a lock rather than whatever is holding it.  It was
more of a problem when I was trying to delete a snapshot per minute on
my ssd, or one every 5 min on hdd.

Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-20 Thread Duncan
Rich Freeman posted on Thu, 20 Mar 2014 22:13:51 -0400 as excerpted:

 However, I am my snapshots one at a time at a rate of one every 5-30
 minutes, and while that is creating surprisingly high disk loads on my
 ssd and hard drives, I don't get any panics.  I figured that having only
 one deletion pending per checkpoint would eliminate locking risk.
 
 I did get some blocked task messages in dmesg, like:
 [105538.121239] INFO: task mysqld:3006 blocked for more than 120
 seconds.

These... are a continuing issue.  The devs are working on it, but...

The people that seem to have it the worst are doing both scripted 
snapshotting and large (gig+) constantly internal-rewritten files such as 
VM images (the most commonly reported case) or databases.  Properly 
setting NOCOW on the files[1] helps, but...

* The key thing to realize about snapshotting continually rewritten NOCOW 
files is that the first change to a block after a snapshot by definition 
MUST be COWed anyway, since the file content has changed from that of the 
snapshot.  Further writes to the same block (until the next snapshot) 
will be rewritten in-place (the existing NOCOW attribute  is maintained 
thru that mandatory COW), but next snapshot and following write, BAM! 
gotta COW again!

So while NOCOW helps, in scenarios such as hourly snapshotting of active 
VM-image data loads its ability to control actual fragmentation is 
unfortunately rather limited.  And it's precisely this fragmentation that 
appears to be the problem! =:^(

It's almost certainly that fragmentation that's triggering your blocked 
for X seconds issues.  But the interesting thing here is the reports even 
from people with fast SSDs where seek-time and even IOPs shouldn't be a 
huge issue.  In at least some cases, the problem has been CPU time, not 
physical media access.

Which is one reason the snapshot-aware-defrag was disabled again 
recently, because it simply wasn't scaling.  (To answer the question, 
yes, defrag still works; it's only the snapshot-awareness that was 
disabled.  Defrag is back to dumbly ignoring other snapshots and simply 
defragging the working file-extent-mapping the defrag is being run on, 
with other snapshots staying untouched.)  They're reworking the whole 
feature now in ordered to scale better.

But while that considerably reduces the pain point, people were seeing 
little or no defrag/balance/restripe progress in /hours/ if they had 
enough snapshots and that problem has been bypassed for the moment, we're 
still left with these nasty N-second stalls at times, especially when 
doing anything else involving those snapshots and the corresponding 
fragmentation they cover, including deleting them.  Hopefully tweaking 
the algorithms and eventually optimizing can do away with much of this 
problem eventually, but I've a feeling it'll be around to some degree for 
some years.

Meanwhile, for data that fits that known problematic profile, the current 
recommendation is, preferably, to isolate it to a subvolume that has only 
very limited or no snapshotting done.

The other alternative, of course, since NOCOW already turns off many of 
the features a lot of people are using btrfs for in the first place 
(checksumming and compression are disabled with NOCOW as well, tho it 
turns out they're not so well suited to VM images in the first place), is 
that given the subvolume isolation already, just stick it on an entirely 
different filesystem, either btrfs with the nocow mount option, or 
arguably something a bit more traditional and mature such as ext4 or xfs, 
where xfs of course is actually targeted at large to huge file use-cases 
so multi-gig VMs should be an ideal fit.  Of course you lose the benefits 
of btrfs doing that, but given its COW nature, btrfs arguably isn't the 
ideal solution for such huge internal-write files in the first place, and 
even when fully mature will likely only have /acceptable/ performance 
with them as suitable for use as a general purpose filesystem, with xfs 
or similar still likely being a better dedicated filesystem for such use-
cases.

Meanwhile, I think everyone agrees that getting that locking down to 
avoid the deadlocks, etc, really must be priority one, at least now that 
the huge scaling blocker of snapshot-aware-defrag is (hopefully 
temporarily) disabled.  Blocking for a couple minutes at a time certainly 
isn't ideal, but since the triggering jobs such as snapshot deletion, 
etc, can be rescheduled to otherwise idle time, that's certainly less 
critical than crashes if people accidentally or in ignorance queue up too 
many snapshot deletions at a time!

---
[1] NOCOW: chattr +C .  With btrfs, this should be done while the file is 
zero-size, before it has content.  The easiest way to do that is to 
create a dedicated directory for these files and set the attribute on the 
directory, such that the files inherit it at file creation.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree 

Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-17 Thread Josef Bacik

On 03/14/2014 06:40 PM, Rich Freeman wrote:

On Wed, Mar 12, 2014 at 12:34 PM, Rich Freeman
r-bt...@thefreemanclan.net wrote:

On Wed, Mar 12, 2014 at 11:24 AM, Josef Bacik jba...@fb.com wrote:

On 03/12/2014 08:56 AM, Rich Freeman wrote:


  After a number of reboots the system became stable, presumably
whatever race condition btrfs was hitting followed a favorable
path.

I do have a 2GB btrfs-image pre-dating my application of this
patch that was causing the issue last week.



Uhm wow that's pretty epic.  I will talk to chris and figure out how
we want to deal with that and send you a patch shortly.  Thanks,


A tiny bit more background.


And some more background.  I had more reboots over the next two days
at the same time each day, just after my crontab successfully
completed.  One of the last thing it does is runs the snapper cleanups
which delete a bunch of snapshots.  During a reboot I checked and
there were a bunch of deleted snapshots, which disappeared over the
next 30-60 seconds before the panic, and then they would re-appear on
the next reboot.

I disabled the snapper cron job and this morning had no issues at all.
  One day isn't much to establish a trend, but I suspect that this is
the cause.  Obviously getting rid of snapshots would be desirable at
some point, but I can wait for a patch.  Snapper would be deleting
about 48 snapshots at the same time, since I create them hourly and
the cleanup occurs daily on two different subvolumes on the same
filesystem.


Ok that's helpful, I'm no longer positive I know what's causing this, 
I'll try to reproduce once I've nailed down these backref problems and 
balance corruption.  Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-15 Thread Duncan
Rich Freeman posted on Fri, 14 Mar 2014 18:40:25 -0400 as excerpted:

 And some more background.  I had more reboots over the next two days at
 the same time each day, just after my crontab successfully completed. 
 One of the last thing it does is runs the snapper cleanups which delete
 a bunch of snapshots.  During a reboot I checked and there were a bunch
 of deleted snapshots, which disappeared over the next 30-60 seconds
 before the panic, and then they would re-appear on the next reboot.
 
 I disabled the snapper cron job and this morning had no issues at all.
  One day isn't much to establish a trend, but I suspect that this is
 the cause.  Obviously getting rid of snapshots would be desirable at
 some point, but I can wait for a patch.  Snapper would be deleting about
 48 snapshots at the same time, since I create them hourly and the
 cleanup occurs daily on two different subvolumes on the same filesystem.

Hi, Rich.  Imagine seeing you here! =:^)  (Note to others, I run gentoo 
and he's a gentoo dev, so we normally see each other on the gentoo 
lists.  But btrfs comes up occasionally there too, so we knew we were 
both running it, I'd just not noticed any of his posts here, previously.)

Three things:

1) Does running the snapper cleanup command from that cron job manually 
trigger the problem as well?

Presumably if you run it manually, you'll do so at a different time of 
day, thus eliminating the possibility that it's a combination of that and 
something else occurring at that specific time, as well as confirming 
that it is indeed the snapper 

2) What about modifying the cron job to run hourly, or perhaps every six 
hours, so it's deleting only 2 or 12 instead of 48 at a time?  Does that 
help?

If so then it's a thundering herd problem.  While definitely still a bug, 
you'll at least have a workaround until its fixed. 

3) I'd be wary of letting too many snapshots build up.  A couple hundred 
shouldn't be a huge issue, but particularly when the snapshot-aware-
defrag was still enabled, people were reporting problems with thousands 
of snapshots, so I'd recommend trying to keep it under 500 or so, at 
least of the same subvol (so under 1000 total since you're snapshotting 
two different subvols).

So a hourly cron job deleting or at least thinning down snapshots over 
say 2 days old, possibly in the same cron job that creates the new snaps, 
might be a good idea.  That'd only do two at a time, the same rate 
they're created, but with a 48 hour set of snaps before deletion.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-14 Thread Rich Freeman
On Wed, Mar 12, 2014 at 12:34 PM, Rich Freeman
r-bt...@thefreemanclan.net wrote:
 On Wed, Mar 12, 2014 at 11:24 AM, Josef Bacik jba...@fb.com wrote:
 On 03/12/2014 08:56 AM, Rich Freeman wrote:

  After a number of reboots the system became stable, presumably
 whatever race condition btrfs was hitting followed a favorable
 path.

 I do have a 2GB btrfs-image pre-dating my application of this
 patch that was causing the issue last week.


 Uhm wow that's pretty epic.  I will talk to chris and figure out how
 we want to deal with that and send you a patch shortly.  Thanks,

 A tiny bit more background.

And some more background.  I had more reboots over the next two days
at the same time each day, just after my crontab successfully
completed.  One of the last thing it does is runs the snapper cleanups
which delete a bunch of snapshots.  During a reboot I checked and
there were a bunch of deleted snapshots, which disappeared over the
next 30-60 seconds before the panic, and then they would re-appear on
the next reboot.

I disabled the snapper cron job and this morning had no issues at all.
 One day isn't much to establish a trend, but I suspect that this is
the cause.  Obviously getting rid of snapshots would be desirable at
some point, but I can wait for a patch.  Snapper would be deleting
about 48 snapshots at the same time, since I create them hourly and
the cleanup occurs daily on two different subvolumes on the same
filesystem.

Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-12 Thread Rich Freeman
On Thu, Mar 6, 2014 at 7:25 PM, Zach Brown z...@redhat.com wrote:
 On Thu, Mar 06, 2014 at 07:01:07PM -0500, Josef Bacik wrote:
 Zach found this deadlock that would happen like this


 And this fixes it.  It's run through a few times successfully.

I'm not sure if my issue is related to this or not - happy to start a
new thread if not.  I applied this patch as I was running into locks,
but I am still having them.

See: http://picpaste.com/IMG_20140312_072458-KPH35pQ6.jpg

After a number of reboots the system became stable, presumably
whatever race condition btrfs was hitting followed a favorable path.

I do have a 2GB btrfs-image pre-dating my application of this patch
that was causing the issue last week.

Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-12 Thread Josef Bacik
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/12/2014 08:56 AM, Rich Freeman wrote:
 On Thu, Mar 6, 2014 at 7:25 PM, Zach Brown z...@redhat.com wrote:
 On Thu, Mar 06, 2014 at 07:01:07PM -0500, Josef Bacik wrote:
 Zach found this deadlock that would happen like this
 
 
 And this fixes it.  It's run through a few times successfully.
 
 I'm not sure if my issue is related to this or not - happy to start
 a new thread if not.  I applied this patch as I was running into
 locks, but I am still having them.
 
 See:
 https://urldefense.proofpoint.com/v1/url?u=http://picpaste.com/IMG_20140312_072458-KPH35pQ6.jpgk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=i0z2iBhr7rZW%2Bkc1oo9FXeEzYukWs4Q36PGBkcCzub4%3D%0As=a3430d4f9555ea4c4ae32e4570a75bb66153a7d4b76e99e46da7e7c3d2a80a2e

  After a number of reboots the system became stable, presumably 
 whatever race condition btrfs was hitting followed a favorable
 path.
 
 I do have a 2GB btrfs-image pre-dating my application of this
 patch that was causing the issue last week.
 

Uhm wow that's pretty epic.  I will talk to chris and figure out how
we want to deal with that and send you a patch shortly.  Thanks,

Josef

-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTIHxGAAoJEANb+wAKly3BNMYP/0HYjAVhf9NndNM/ZsLcR48k
4Br0S3QPuNuMvyZMtTSmpSWmLjhtPOl87llOSBfYOTti3AVXYuyzHruGxpkjrqAS
PLYF+1wlrJeJ7FO6of7B0ZTzbEZcpMOV5Y0FV5m2ONGO1HthLISVTipVgm9KIcWG
wa1BmyqdVg2h2aRmBeFTJVHWWHCK3m2qLXiNpiH3Vh399vZwrdh7tmVI3Pvv+nVj
NGl+xskoaQHV9TUz4RnpyQ2dg/y3A+f+GmbDHcI3WUi/6is1Pnctv0RrGb1XYwlu
DuhYxP/vbrgD9PE4C3/6kaXMfnzVkqQk9dN6hqWZnOq5Vo1zbTeEpemvDvRPRYVf
PwJ29PXXRJsUtHpr68/xBAjBE2elLQvnIHOkfqVSsTNKA4PZzojcw7k0WIFTRveN
iWDDGqcL+Pnvq9Ajj8hjlnVIqQwExx5K0Bikue/Nr8p3vuy1FEvdOXlxmOqPVd6K
Bw2H9mPsJeLHZdwV3Kb0aH3rA8ruPqZHy3Dp0/r0gcYKwpQTKv5cDnjhu5aJ4xgZ
k7Ve8AFZwqChnXPdr9yZr1dBrxBTVa7b/uY+7PiAQQiBXa5M2yZOP9R3H+yak1PP
w11FNrhemaKpYX63Equ9VVRDv0TFIslklLfvVRIQIUKpZ+4XX43hmU0EqYbbN+Eh
fGiSsK1fFbWt/js23RoD
=sGCH
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-12 Thread Rich Freeman
On Wed, Mar 12, 2014 at 11:24 AM, Josef Bacik jba...@fb.com wrote:
 On 03/12/2014 08:56 AM, Rich Freeman wrote:

  After a number of reboots the system became stable, presumably
 whatever race condition btrfs was hitting followed a favorable
 path.

 I do have a 2GB btrfs-image pre-dating my application of this
 patch that was causing the issue last week.


 Uhm wow that's pretty epic.  I will talk to chris and figure out how
 we want to deal with that and send you a patch shortly.  Thanks,

If you need any info from me at all beyond the capture let me know.

A tiny bit more background.  The system would boot normally, but panic
after about 30-90 seconds (usually long enough to log into KDE,
perhaps even fire up a browser/etc).  In single-user mode I could
mount the filesystem read-only without issue.  If I mounted it
read-write (in recovery mode or normally) I'd get the panic after
about 30-60 seconds.  On one occasion it seemed stable, but panicked
when I unmounted it.

I have to say that I'm impressed that it recovers at all.  I'd rather
have the file system not write anything if it isn't sure it can't
write it correctly, and that seems to be the effect here.  Just about
all the issues I've run into with btrfs have tended to be lockup/etc
type issues, and not silent corruption.

Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock with nested trans handles

2014-03-06 Thread Zach Brown
On Thu, Mar 06, 2014 at 07:01:07PM -0500, Josef Bacik wrote:
 Zach found this deadlock that would happen like this
 
 btrfs_end_transaction - reduce trans-use_count to 0
   btrfs_run_delayed_refs
 btrfs_cow_block
   find_free_extent
   btrfs_start_transaction - increase trans-use_count to 1
   allocate chunk
   btrfs_end_transaction - decrease trans-use_count to 0
 btrfs_run_delayed_refs
   lock tree block we are cowing above ^^

Indeed, I stumbled across this while trying to reproduce reported
problems with iozone.  This deadlock would consistently hit during
random 1k reads in a 2gig file.

 We need to only decrease trans-use_count if it is above 1, otherwise leave it
 alone.  This will make nested trans be the only ones who decrease their added
 ref, and will let us get rid of the trans-use_count++ hack if we have to 
 commit
 the transaction.  Thanks,

And this fixes it.  It's run through a few times successfully.

 cc: sta...@vger.kernel.org
 Reported-by: Zach Brown z...@redhat.com
 Signed-off-by: Josef Bacik jba...@fb.com

Tested-by: Zach Brown z...@redhat.com

- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html