Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli
On 03/14/2018 08:27 PM, Austin S. Hemmelgarn wrote: > On 2018-03-14 14:39, Goffredo Baroncelli wrote: >> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote: >> [...] In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs

Re: Ongoing Btrfs stability issues

2018-03-14 Thread Austin S. Hemmelgarn
On 2018-03-14 14:39, Goffredo Baroncelli wrote: On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote: [...] In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't

Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli
On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote: [...] >> >> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a >> conventional filesystem (or a btrfs filesystem w/o datasum) there is no >> checksum, so this problem doesn't exist. >> >> I am curious how ZFS solves

Re: Ongoing Btrfs stability issues

2018-03-14 Thread Austin S. Hemmelgarn
On 2018-03-13 15:36, Goffredo Baroncelli wrote: On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote: On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote: Unfortunately no, the likelihood might be 100%: there are some patterns which trigger this problem quite easily. See The link

Re: Ongoing Btrfs stability issues

2018-03-13 Thread Christoph Anton Mitterer
On Tue, 2018-03-13 at 20:36 +0100, Goffredo Baroncelli wrote: > A checksum mismatch, is returned as -EIO by a read() syscall. This is > an event handled badly by most part of the programs. Then these programs must simply be fixed... otherwise they'll also fail under normal circumstances with

Re: Ongoing Btrfs stability issues

2018-03-13 Thread Goffredo Baroncelli
On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote: > On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote: >> Unfortunately no, the likelihood might be 100%: there are some >> patterns which trigger this problem quite easily. See The link which >> I posted in my previous email. There

Re: Ongoing Btrfs stability issues

2018-03-13 Thread Patrik Lundquist
On 9 March 2018 at 20:05, Alex Adriaanse wrote: > > Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load > on these machines. Dump the databases and recreate them with --data-checksums and Btrfs No_COW attribute. You can add this to

Re: Ongoing Btrfs stability issues

2018-03-12 Thread Christoph Anton Mitterer
On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote: > Unfortunately no, the likelihood might be 100%: there are some > patterns which trigger this problem quite easily. See The link which > I posted in my previous email. There was a program which creates a > bad checksum (in COW+DATASUM

Re: Ongoing Btrfs stability issues

2018-03-12 Thread Goffredo Baroncelli
On 03/11/2018 11:37 PM, Christoph Anton Mitterer wrote: > On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote: >> >> COW is needed to properly checksum the data. Otherwise is not >> possible to ensure the coherency between data and checksum (however I >> have to point out that BTRFS fails

Re: Ongoing Btrfs stability issues

2018-03-11 Thread Christoph Anton Mitterer
On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote: > > COW is needed to properly checksum the data. Otherwise is not > possible to ensure the coherency between data and checksum (however I > have to point out that BTRFS fails even in this case [*]). > We could rearrange this sentence,

Re: Ongoing Btrfs stability issues

2018-03-11 Thread Goffredo Baroncelli
On 03/10/2018 03:29 PM, Christoph Anton Mitterer wrote: > On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote: >> So for OLTP workloads you definitely want nodatacow enabled, bear in >> mind this also disables crc checksumming, but your db engine should >> already have such functionality

Re: Ongoing Btrfs stability issues

2018-03-10 Thread Christoph Anton Mitterer
On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote: > So for OLTP workloads you definitely want nodatacow enabled, bear in > mind this also disables crc checksumming, but your db engine should > already have such functionality implemented in it. Unlike repeated claims made here on the list

Re: Ongoing Btrfs stability issues

2018-03-10 Thread Nikolay Borisov
On 9.03.2018 21:05, Alex Adriaanse wrote: > Am I correct to understand that nodatacow doesn't really avoid CoW when > you're using snapshots? In a filesystem that's snapshotted Yes, so nodatacow won't interfere with how snapshots operate. For more information on that topic check the

Re: Ongoing Btrfs stability issues

2018-03-09 Thread Alex Adriaanse
On Mar 9, 2018, at 3:54 AM, Nikolay Borisov wrote: > >> Sorry, I clearly missed that one. I have applied the patch you referenced >> and rebooted the VM in question. This morning we had another FS failure on >> the same machine that caused it to go into readonly mode. This

Re: Ongoing Btrfs stability issues

2018-03-09 Thread Nikolay Borisov
> Sorry, I clearly missed that one. I have applied the patch you referenced and > rebooted the VM in question. This morning we had another FS failure on the > same machine that caused it to go into readonly mode. This happened after > that device was experiencing 100% I/O utilization for some

Re: Ongoing Btrfs stability issues

2018-03-08 Thread Alex Adriaanse
On Mar 2, 2018, at 11:29 AM, Liu Bo wrote: > On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote: >> On 1.03.2018 21:04, Alex Adriaanse wrote: >>> Thanks so much for the suggestions so far, everyone. I wanted to report >>> back on this. Last Friday I made the

Re: Ongoing Btrfs stability issues

2018-03-02 Thread Liu Bo
On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote: > > > On 1.03.2018 21:04, Alex Adriaanse wrote: > > On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn > > wrote: ... > > > [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: > >

Re: Ongoing Btrfs stability issues

2018-03-01 Thread Qu Wenruo
On 2018年03月02日 03:04, Alex Adriaanse wrote: > On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn > wrote: >> I would suggest changing this to eliminate the balance with '-dusage=10' >> (it's redundant with the '-dusage=20' one unless your filesystem is in >>

Re: Ongoing Btrfs stability issues

2018-03-01 Thread Nikolay Borisov
On 1.03.2018 21:04, Alex Adriaanse wrote: > On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn > wrote: >> I would suggest changing this to eliminate the balance with '-dusage=10' >> (it's redundant with the '-dusage=20' one unless your filesystem is in >>

Re: Ongoing Btrfs stability issues

2018-03-01 Thread Alex Adriaanse
On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn wrote: > I would suggest changing this to eliminate the balance with '-dusage=10' > (it's redundant with the '-dusage=20' one unless your filesystem is in > pathologically bad shape), and adding equivalent filters for

Re: Ongoing Btrfs stability issues

2018-02-17 Thread Shehbaz Jaffer
>First of all, the ssd mount option does not have anything to do with having single or DUP metadata. Sorry about that, I agree with you. -nossd would not help in increasing reliability in any way. One alternative would be to format and force duplication of metadata during filesystem creation on

Re: Ongoing Btrfs stability issues

2018-02-17 Thread Hans van Kranenburg
On 02/17/2018 05:34 AM, Shehbaz Jaffer wrote: >> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS >> volumes are all SSD > > I have recently done some SSD corruption experiments on small set of > workloads, so I thought I would share my experience. > > While creating

Re: Ongoing Btrfs stability issues

2018-02-16 Thread Shehbaz Jaffer
>It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS >volumes are all SSD I have recently done some SSD corruption experiments on small set of workloads, so I thought I would share my experience. While creating btrfs using mkfs.btrfs command for SSDs, by default the

Re: Ongoing Btrfs stability issues

2018-02-16 Thread Duncan
Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as excerpted: > This will probably sound like an odd question, but does BTRFS think your > storage devices are SSD's or not? Based on what you're saying, it > sounds like you're running into issues resulting from the >

Re: Ongoing Btrfs stability issues

2018-02-16 Thread Austin S. Hemmelgarn
mode. We've spent an enormous amount of time trying to recover corrupted filesystems, and the time that servers were down as a result of Btrfs  instability has accumulated to many days. We've made many changes to try to improve Btrfs stability: upgrading to newer kernels, setting up nightly balances

Re: Ongoing Btrfs stability issues

2018-02-15 Thread Nikolay Borisov
On 16.02.2018 06:54, Alex Adriaanse wrote: > >> On Feb 15, 2018, at 2:42 PM, Nikolay Borisov wrote: >> >> On 15.02.2018 21:41, Alex Adriaanse wrote: >>> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov wrote: So in all of the cases you are

Re: Ongoing Btrfs stability issues

2018-02-15 Thread Alex Adriaanse
> On Feb 15, 2018, at 2:42 PM, Nikolay Borisov wrote: > > On 15.02.2018 21:41, Alex Adriaanse wrote: >> >>> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov wrote: >>> >>> So in all of the cases you are hitting some form of premature enospc. >>> There was a

Re: Ongoing Btrfs stability issues

2018-02-15 Thread Nikolay Borisov
On 15.02.2018 21:41, Alex Adriaanse wrote: > >> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov wrote: >> >> So in all of the cases you are hitting some form of premature enospc. >> There was a fix that landed in 4.15 that should have fixed a rather >> long-standing issue with

Re: Ongoing Btrfs stability issues

2018-02-15 Thread Alex Adriaanse
> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov wrote: > > So in all of the cases you are hitting some form of premature enospc. > There was a fix that landed in 4.15 that should have fixed a rather > long-standing issue with the way metadata reservations are satisfied, >

Re: Ongoing Btrfs stability issues

2018-02-15 Thread Nikolay Borisov
e filesystem > going into readonly mode. We've spent an enormous amount of time trying to > recover corrupted filesystems, and the time that servers were down as a > result of Btrfs instability has accumulated to many days. > > We've made many changes to try to improve Btrfs stab

Ongoing Btrfs stability issues

2018-02-15 Thread Alex Adriaanse
trying to recover corrupted filesystems, and the time that servers were down as a result of Btrfs  instability has accumulated to many days. We've made many changes to try to improve Btrfs stability: upgrading to newer kernels, setting up nightly balances, setting up monitoring to ensure our

Re: btrfs stability

2016-05-26 Thread Roman Mamedov
On Fri, 27 May 2016 00:42:07 +0200 Diego Torres wrote: > Btrfs is the only fs that can add drives one by one to an existing raid > setup, and use the new space inmediately, without replacing all the drives. Ext4, XFS, JFS or pretty much any FS which can be resized

btrfs stability

2016-05-26 Thread Diego Torres
Hi there, I've been using btrfs with a raid5 configuration with 3 disks for 6 months, and then with 4 disks for a couple of months more. I run a weekly scrub, and a monthly balance. Btrfs is the only fs that can add drives one by one to an existing raid setup, and use the new space inmediately,

Re: btrfs stability

2013-01-28 Thread Josef Bacik
On Sat, Jan 26, 2013 at 01:27:11PM -0700, Andrew McNabb wrote: Here's an update. I tried the new kernel, and I seem to be having some new (possibly worse problems. In my ssh session, I'm seeing many errors of this sort: Message from syslogd@guru at Jan 26 13:13:14 ... kernel:[

Re: btrfs stability

2013-01-28 Thread Josef Bacik
On Sat, Jan 26, 2013 at 01:27:11PM -0700, Andrew McNabb wrote: Here's an update. I tried the new kernel, and I seem to be having some new (possibly worse problems. In my ssh session, I'm seeing many errors of this sort: Message from syslogd@guru at Jan 26 13:13:14 ... kernel:[

Re: btrfs stability

2013-01-26 Thread Andrew McNabb
Here's an update. I tried the new kernel, and I seem to be having some new (possibly worse problems. In my ssh session, I'm seeing many errors of this sort: Message from syslogd@guru at Jan 26 13:13:14 ... kernel:[ 308.223834] BUG: soft lockup - CPU#0 stuck for 23s! [btrfs-endio-wri:2073]

btrfs stability

2013-01-25 Thread Andrew McNabb
I tried creating a multi-device btrfs filesystem for the first time (on Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems. I had heard that btrfs is now reasonably stable, and though I expected to possibly see a problem here or there, I was a little surprised at just how many

Re: btrfs stability

2013-01-25 Thread Josef Bacik
On Fri, Jan 25, 2013 at 01:05:14PM -0700, Andrew McNabb wrote: I tried creating a multi-device btrfs filesystem for the first time (on Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems. I had heard that btrfs is now reasonably stable, and though I expected to possibly see a

Re: btrfs stability

2013-01-25 Thread Josef Bacik
On Fri, Jan 25, 2013 at 01:05:14PM -0700, Andrew McNabb wrote: I tried creating a multi-device btrfs filesystem for the first time (on Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems. I had heard that btrfs is now reasonably stable, and though I expected to possibly see a

Re: btrfs stability

2013-01-25 Thread Andrew McNabb
On Fri, Jan 25, 2013 at 03:37:17PM -0500, Josef Bacik wrote: https://bugzilla.redhat.com/show_bug.cgi?id=903794 This one is just a allocator warning because the relocator doesn't do the right accounting for relocation. It's just complainig, we need to fix it but it won't keep it from

Re: btrfs stability

2013-01-25 Thread Andrew McNabb
On Fri, Jan 25, 2013 at 03:53:22PM -0500, Josef Bacik wrote: Actually for this one, how did you remove the disk? Did you just yank it out while the box was running? Did you mount -o degraded and then delete the device and then remove it? How exactly did you get to this situation.