Re: btrfs stability

2013-01-26 Thread Andrew McNabb
Here's an update.  I tried the new kernel, and I seem to be having some
new (possibly worse problems.  In my ssh session, I'm seeing many errors
of this sort:

Message from syslogd@guru at Jan 26 13:13:14 ...
 kernel:[  308.223834] BUG: soft lockup - CPU#0 stuck for 23s!
 [btrfs-endio-wri:2073]

Message from syslogd@guru at Jan 26 13:13:14 ...
 kernel:[  308.248754] BUG: soft lockup - CPU#2 stuck for 23s!
 [btrfs-delalloc-:594]

In the logs, I'm seeing several warnings and bugs, including:

WARNING: at fs/btrfs/extent_map.c:78 free_extent_map+0x79/0x90 [btrfs]()
WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
BUG: unable to handle kernel NULL pointer dereference at (null)
BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-endio-wri:1489]
BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-delalloc-:607]

Kernel logs (across a few reboots) are at:

http://students.cs.byu.edu/~amcnabb/messages2

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs stability

2013-01-25 Thread Andrew McNabb
I tried creating a multi-device btrfs filesystem for the first time (on
Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems.  I
had heard that btrfs is now reasonably stable, and though I expected to
possibly see a problem here or there, I was a little surprised at just
how many problems I encountered in such a short period of time.  I now
have about a thousand error messages in my kernel logs related to
several different problems.  Is this roughly the expected level of
stability for btrfs with multiple devices, or am I just particularly
lucky? :)

Am I correct in assuming that I'll need to switch to md for a few months
and try btrfs again later, or are there known problems in the specific
kernel I'm running that I could avoid by trying a different version?

For the sake of being specific, I'll detail a few of the problems I've
hit:

These two may have been caused by a possibly faulty disk (I'm still
trying to determine whether it was faulty or whether the bug was purely
in btrfs):

https://bugzilla.redhat.com/show_bug.cgi?id=903794
https://bugzilla.redhat.com/show_bug.cgi?id=904143

This one was triggered when I tried to remove a possibly faulty disk:

https://bugzilla.redhat.com/show_bug.cgi?id=904197

With a freshly created filesystem, I got a kernel bug, associated with a
hang in most filesystem operations.  This occurred in the middle of
ordinary operation and without any sort of hardware-related errors in
the kernel logs.

https://bugzilla.redhat.com/show_bug.cgi?id=904223

I've noticed that a lot of the reports in the Fedora bugzilla and kernel
bugzilla don't seem to include much discussion; is there any specific
type of information that bug submitters should try to include to make
the reports more helpful?  Thanks.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs stability

2013-01-25 Thread Andrew McNabb
On Fri, Jan 25, 2013 at 03:37:17PM -0500, Josef Bacik wrote:
  https://bugzilla.redhat.com/show_bug.cgi?id=903794
 
 This one is just a allocator warning because the relocator doesn't do the 
 right
 accounting for relocation.  It's just complainig, we need to fix it but it 
 won't
 keep it from working.

I won't worry about this one, then.

  https://bugzilla.redhat.com/show_bug.cgi?id=904143
 
 This I'm almost certain (I have to check) was just a result of me making fsync
 faster and forgetting to remove this warn on.  It's fixed upstream.  Again,
 nothing to worry about, but annoying.

Sounds good.

  This one was triggered when I tried to remove a possibly faulty disk:
  
  https://bugzilla.redhat.com/show_bug.cgi?id=904197
  
 
 Ok this is a bug, I can fix this.  Basically we tried to read from the faulty
 disk, it failed, we read from the other copy, and then tried to write the good
 copy back to the failed disk and when we saw that the IO wasn't actually going
 to go to the bad disk we panic'ed.  Silly but easy enough to understand/fix.

I was a little surprised that this happened after I had already done a
btrfs dev delete--is there a way to tell btrfs that a disk really is
gone?

  With a freshly created filesystem, I got a kernel bug, associated with a
  hang in most filesystem operations.  This occurred in the middle of
  ordinary operation and without any sort of hardware-related errors in
  the kernel logs.
  
  https://bugzilla.redhat.com/show_bug.cgi?id=904223
  
 
 So this is from the fsync stuff, and I'm sure I fixed this somewhere but I 
 can't
 account for where I did it.

Would this also be the cause of the hangs that I'm seeing?  In the end,
a hang with the load rising to 260.10 is the most serious problem.  It's
happened a few times, and it gets temporarily fixed by a reboot, but
then tends to recur fairly soon.

 Can you give btrfs-next a try and see if you can
 still reproduce.  Thanks,

Is there a pre-built RPM for btrfs-next, or what's the best way to try
it out in Fedora without breaking other things?

Thanks for your quick response, and sorry for not responding sooner
(I've been interrupted by a few phone calls).

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs stability

2013-01-25 Thread Andrew McNabb
On Fri, Jan 25, 2013 at 03:53:22PM -0500, Josef Bacik wrote:
 
 Actually for this one, how did you remove the disk?  Did you just yank it out
 while the box was running?  Did you mount -o degraded and then delete the 
 device
 and then remove it?  How exactly did you get to this situation.  Thanks,

I've moved my answer over to IRC to reduce the latency in the
conversation.  Thanks again for all the help.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html