Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

2014-08-05 Thread Peter Waller
My current interpretation of this problem is that it is some pathological condition caused by not rebalancing and being nearly out of space for allocating more metadata and hence it is rarely being seen by anyone else (because most users are regularly doing rebalances). See this thread for details

Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

2014-08-01 Thread Peter Waller
I've reproduced these issues on a single-core machine which doesn't appear to become completely unresponsive after 12 hours of copying (as the other machines are deadlocking after 5-10 minutes, perhaps?), but it does use 100% SYS CPU with no IO traffic for the vast majority of the time. (In fact, c

Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

2014-07-31 Thread Peter Waller
I should add that I have reproduced this even after doing `mount -o clear_cache /dev/... /mnt/...`, unmount, remount with `-o space_cache`. After the machine lockup and rebooting there are the warnings of the form: > [ 117.288248] BTRFS warning (device dm-0): block group 694165700608 has wrong >

Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

2014-07-31 Thread Peter Waller
I've now reproduced this on 3.15.7-031507-generic and 3.16.0-031600rc7-generic, and have a test case where I can reliably cause the crash after about 30 seconds of disk activity. The test case just involves taking a directory tree of ~400GB of files and copying every file to a new one with .new on

Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

2014-07-30 Thread Peter Waller
The crashes became more frequent. The time scale before lockup went ~12 days, ~7 days, ~2 days, ~6 hours, ~1 hour. Then we upgraded to 3.15.7-031507-generic on the advice of #ubuntu-kernel and #btrfs on IRC, and it has since been stable for 19.5 hours. I dug around more in our logs and realised t

Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

2014-07-29 Thread Peter Waller
Someone on IRC suggested that I clear the free cache: > sudo mount -o remount,clear_cache /path/to/dev /path/to/mount > sudo mount -o remount,space_cache /path/to/dev /path/to/mount The former command printed `btrfs: disk space caching is enabled` and the latter repeated it, making me think that

Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

2014-07-29 Thread Peter Waller
Hi All, I've reported a bug with Ubuntu here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711 The machine in question has one BTRFS volume which is 87% full and lives on an Logical Volume Manager (LVM) block device on top of one Amazon Elastic Block Store (EBS) device. We have other