btrstress caused kernel oops after 8-ish days.

2010-04-27 Thread Sean Reifschneider
I ported my zfsstress program over to btrfs, and started running it on
a test machine a few weeks ago.  See here for more information and a link
to the program:

   http://www.tummy.com/journals/entries/jafo_20100418_124309

It looks like after around 8 days of running, there were some issues, as
shown in dmesg (below).

The system is a 64-bit Atom 330 with 2GB RAM, and a single 250GB hard
drive.  btrfs has 200GB of that.  The OS is the Fedora 13 Beta with kernel
2.6.33.1-24.fc13.x86_64.

I had started btrstress and let it run a day or so.  Then I went in and
deleted the subvolume that btrstress puts everything into, then started it
again.  A few days later, I did the same.  I also tried turning on
compression with mount -o remount,compress /data.  Around 6 hours later,
it looks like btrstress was no longer working.

The primary issue seems to be that file deletions aren't freeing up space.
btrstress will fill the file-system up, but disables any write operations
if the df output shows more than 95% full.  So normally it would clear up
some snapshots or files until it gets back down to 95% or less, and start
doing writes again.

However, after the Oops, it looks like it was able to continue allowing
removes of files and snapshots, but df is no longer reflecting that.  For
example:

   [r...@btrtest btrstress-lZ6C7txz3n]# df -h
   FilesystemSize  Used Avail Use% Mounted on
   /dev/sda1  29G   13G   16G  45% /
   tmpfs 991M 0  991M   0% /dev/shm
   /dev/sda4 200G  189G  9.9G  96% /data
   [r...@btrtest btrstress-lZ6C7txz3n]# find /data
   /data
   /data/btrstress-lZ6C7txz3n
   [r...@btrtest btrstress-lZ6C7txz3n]# btrfs subvolume list /data
   ID 28423 top level 5 path btrstress-lZ6C7txz3n
   [r...@btrtest btrstress-lZ6C7txz3n]# du -sh /data
   4.0K/data
   [r...@btrtest btrstress-lZ6C7txz3n]#

I've left the test system as it is, let me know if there's anything you'd
like me to try on the system before I wipe it and start again.

Also, let me know if this sort of report helps.

Note that after enabling compression, but before the oops, dmesg reported a
bunch of messages like:

   btrfs: relocating block group 11840520192 flags 1
   btrfs: relocating block group 10766778368 flags 1
   btrfs: relocating block group 9693036544 flags 1
   btrfs: relocating block group 8619294720 flags 1
   btrfs: relocating block group 7545552896 flags 1
   btrfs: relocating block group 6471811072 flags 1

Note that the group numbers started at 212630241280 and reduced by around a
billion for every line.

dmesg output of oops below.

BUG: unable to handle kernel NULL pointer dereference at 0075
IP: [810e380f] page_cache_sync_readahead+0x15/0x3a
PGD 7a937067 PUD 3310c067 PMD 0
Oops:  [#1] SMP
last sysfs file: /sys/devices/pci:00/:00:1e.0/:04:00.1/irq
CPU 0
Pid: 30242, comm: btrfs Not tainted 2.6.33.1-24.fc13.x86_64 #1 D945GCLF2/
RIP: 0010:[810e380f]  [810e380f]
page_cache_sync_readahead+0x15/0x3a
RSP: 0018:88003309fac8  EFLAGS: 00010206
RAX:  RBX: 880046476940 RCX: 
RDX:  RSI: 88007ac840d0 RDI: 880046476b70
RBP: 88003309fac8 R08: 3f6a R09: 0246
R10: 88003309f8d8 R11:  R12: 880077422968
R13:  R14: 880046476608 R15: 
FS:  7f893574d740() GS:880004a0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0075 CR3: 33004000 CR4: 06f0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process btrfs (pid: 30242, threadinfo 88003309e000, task 8800777a8000)
Stack:
 88003309fb68 a0364899 88003309fae8 000181c1
0 880046476a30 880046476608 88003309fb28 3f69
0  88007ac840d0 3f6a 000181c0
Call Trace:
 [a0364899] relocate_file_extent_cluster+0x18f/0x399 [btrfs]
 [a0364b46] relocate_data_extent+0xa3/0xbb [btrfs]
 [a0364e1a] relocate_block_group+0x2bc/0x384 [btrfs]
 [a036506f] btrfs_relocate_block_group+0x18d/0x312 [btrfs]
 [a034dfe7] btrfs_relocate_chunk+0x6c/0x4c2 [btrfs]
 [a033e051] ? btrfs_item_offset+0xbb/0xcb [btrfs]
 [a034c81b] ? btrfs_item_key_to_cpu+0x2a/0x46 [btrfs]
 [a034ea24] btrfs_balance+0x1ce/0x21b [btrfs]
 [811f02b0] ? inode_has_perm+0xaa/0xce
 [a0355cec] btrfs_ioctl+0x6f9/0x871 [btrfs]
 [81071226] ? sched_clock_cpu+0xc3/0xce
 [8107ba94] ? trace_hardirqs_off+0xd/0xf
 [81071274] ? cpu_clock+0x43/0x5e
 [8112c054] vfs_ioctl+0x32/0xa6
 [8112c5d4] do_vfs_ioctl+0x490/0x4d6
 [8112c670] sys_ioctl+0x56/0x79
 [81009c72] system_call_fastpath+0x16/0x1b
Code: 47 48 48 85 c0 74 04 31 f6 ff d0 48 83 c4 28 

Re: btrstress caused kernel oops after 8-ish days.

2010-04-27 Thread Sean Reifschneider
On 04/27/2010 05:46 AM, Chris Mason wrote:
 This oops is fixed in later kernels, and it's why things stopped.

Thanks for the reply.  I'm not sure I have the time to give this with
respect to following the trunk kernel right now.  If the btrfs project
doesn't have test machines that could be set up for longer-term testing of
something like btrstress, let me know and I'll look at it when I have some
more time in the future.

Thanks,
Sean
-- 
Sean Reifschneider, Member of Technical Staff j...@tummy.com
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability



signature.asc
Description: OpenPGP digital signature