Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli
On 03/14/2018 08:27 PM, Austin S. Hemmelgarn wrote:
> On 2018-03-14 14:39, Goffredo Baroncelli wrote:
>> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
>> [...]

 In btrfs, a checksum mismatch creates an -EIO error during the reading. In 
 a conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
 checksum, so this problem doesn't exist.

 I am curious how ZFS solves this problem.
>>> It doesn't support disabling COW or the O_DIRECT flag, so it just never has 
>>> the problem in the first place.
>>
>> I would like to perform some tests: however I think that you are right. if 
>> you make a "double buffering" approach (copy the data in the page cache, 
>> compute the checksum, then write the data to disk), the mismatch should not 
>> happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT 
>> is a prerequisite to the "double buffering"; alone it couldn't be 
>> sufficient; what about mmap ? Are we sure that this does a double buffering ?
> There's a whole lot of applications that would be showing some pretty serious 
> issues if checksumming didn't work correctly with mmap(), so I think it does 
> work correctly given that we don't have hordes of angry users and sysadmins 
> beating down the doors.

I tried to do in parallel updating a page and writing in different thread; I 
was unable to reproduce a checksum mismatch; so it seems that mmap are safe 
from this point of view;

>>
>> I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I 
>> prefer this to the checksum mismatch bug.
> This is only reasonable if you are writing to the files.  Checksums appear to 
> be checked on O_DIRECT reads, and outside of databases and VM's, read-only 
> access accounts for a significant percentage of O_DIRECT usage, partly 
> because it is needed for AIO support (nginx for example can serve files using 
> AIO and O_DIRECT and gets a pretty serious performance boost on heavily 
> loaded systems by doing so).
> 

So O_DIRECT should be unsupported/ignored only for the writing ? It could be a 
good compromise...

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-14 Thread Austin S. Hemmelgarn

On 2018-03-14 14:39, Goffredo Baroncelli wrote:

On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
[...]


In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
checksum, so this problem doesn't exist.

I am curious how ZFS solves this problem.

It doesn't support disabling COW or the O_DIRECT flag, so it just never has the 
problem in the first place.


I would like to perform some tests: however I think that you are right. if you make a "double 
buffering" approach (copy the data in the page cache, compute the checksum, then write the 
data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but 
disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be 
sufficient; what about mmap ? Are we sure that this does a double buffering ?
There's a whole lot of applications that would be showing some pretty 
serious issues if checksumming didn't work correctly with mmap(), so I 
think it does work correctly given that we don't have hordes of angry 
users and sysadmins beating down the doors.


I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer 
this to the checksum mismatch bug.
This is only reasonable if you are writing to the files.  Checksums 
appear to be checked on O_DIRECT reads, and outside of databases and 
VM's, read-only access accounts for a significant percentage of O_DIRECT 
usage, partly because it is needed for AIO support (nginx for example 
can serve files using AIO and O_DIRECT and gets a pretty serious 
performance boost on heavily loaded systems by doing so).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli
On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
[...]
>>
>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
>> conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
>> checksum, so this problem doesn't exist.
>>
>> I am curious how ZFS solves this problem.
> It doesn't support disabling COW or the O_DIRECT flag, so it just never has 
> the problem in the first place.

I would like to perform some tests: however I think that you are right. if you 
make a "double buffering" approach (copy the data in the page cache, compute 
the checksum, then write the data to disk), the mismatch should not happen. Of 
course this is incompatible with O_DIRECT; but disabling O_DIRECT is a 
prerequisite to the "double buffering"; alone it couldn't be sufficient; what 
about mmap ? Are we sure that this does a double buffering ?

I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer 
this to the checksum mismatch bug.


>>
>> However I have to point out that this problem is not solved by the COW. COW 
>> solved only the problem about an interrupted commit of the filesystem, where 
>> the data is update in place (so it is available by the user), but the 
>> metadata not.
> COW is irrelevant if you're bypassing it.  It's only enforced for metadata so 
> that you don't have to check the FS every time you mount it (because the way 
> BTRFS uses it guarantees consistency of the metadata).
>>
>>>
>>> Even if not... I should be only a problem in case of a crash during
>>> that,.. and than I'd still prefer to get the false positive than bad
>>> data.
>>
>> How you can know if it is a "bad data" or a "bad checksum" ?
> You can't directly.  Just like you can't know which copy in a two-device MD 
> RAID1 array is bad when they mismatch.
> 
> That's part of why I'm not all that fond of the idea of having checksums 
> without COW, you need to verify the data using secondary means anyway, so why 
> exactly should you waste time verifying it twice?

This is true

>>
>>>
>>> Anyway... it's not going to happen so the discussion is pointless.
>>> I think people can probably use dm-integrity (which btw: does no CoW
>>> either (IIRC) and still can provide integrity... ;-) ) to see whether
>>> their data is valid.
>>> No nice but since it won't change on btrfs, a possible alternative.
>>
>> Even in this case I am curious about dm-integrity would sole this issue.
> dm-integrity uses journaling, and actually based on the testing I've done, 
> will typically have much worse performance than the overhead of just enabling 
> COW on files on BTRFS and manually defragmenting them on a regular basis.

Good to know
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-14 Thread Austin S. Hemmelgarn

On 2018-03-13 15:36, Goffredo Baroncelli wrote:

On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:

On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:

Unfortunately no, the likelihood might be 100%: there are some
patterns which trigger this problem quite easily. See The link which
I posted in my previous email. There was a program which creates a
bad checksum (in COW+DATASUM mode), and the file became unreadable.

But that rather seems like a plain bug?!


You are right, unfortunately it seems that it is catalogate as WONT-FIX :(


No reason that would conceptually make checksumming+notdatacow
impossible.

AFAIU, the conceptual thin would be about:
- data is written in nodatacow
   => thus a checksum must be written as well, so write it
- what can then of course happen is
   - both csum and data are written => fine
   - csum is written but data not and then some crash => csum will show
 that => fine
   - data is written but csum not and then some crash => csum will give
 false positive

Still better few false positives, as many unnoticed data corruptions
and no true raid repair.


A checksum mismatch, is returned as -EIO by a read() syscall. This is an event 
handled badly by most part of the programs.
I.e. suppose that a page of a VM ram image file has a wrong checksum. When the 
VM starts, tries to read the page, got -EIO and aborts. It is even possible 
that it could not print which page is corrupted. In this case, how the user 
understand the problem, and what he could do ?
Check the kernel log on the host system, which should have an error 
message saying which block failed.  If the VM itself actually gets to 
the point of booting into an OS (and properly propagates things like 
-EIO to the guest environment like it should), that OS should also log 
where the error was.


Most of the reason user applications don't tell you where the error was 
is because the kernel already does it on any sensible system, and the 
kernel tells you _exactly_ where the error was (exact block and device 
that threw the error), which user applications can't really do (they 
generally can't get sufficiently low-level information to give you all 
the info the kernel does).





Again, you are assuming that the likelihood of having a bad checksum
is low. Unfortunately this is not true. There are pattern which
exploits this bug with a likelihood=100%.


Okay I don't understand why this would be so and wouldn't assume that
the IO pattern can affect it heavily... but I'm not really btrfs
expert.

My blind assumption would have been that writing an extent of data
takes much longer to complete than writing the corresponding checksum.


The problem is the following: there is a time window between the checksum 
computation and the writing the data on the disk (which is done at the lower 
level via a DMA channel), where if the data is update the checksum would 
mismatch. This happens if we have two threads, where the first commits the data 
on the disk, and the second one updates the data (I think that both VM and 
database could behave so).
Though it only matters if you use O_DIRECT or the files in question are 
NOCOW.


In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
checksum, so this problem doesn't exist.

I am curious how ZFS solves this problem.
It doesn't support disabling COW or the O_DIRECT flag, so it just never 
has the problem in the first place.


However I have to point out that this problem is not solved by the COW. COW 
solved only the problem about an interrupted commit of the filesystem, where 
the data is update in place (so it is available by the user), but the metadata 
not.
COW is irrelevant if you're bypassing it.  It's only enforced for 
metadata so that you don't have to check the FS every time you mount it 
(because the way BTRFS uses it guarantees consistency of the metadata).




Even if not... I should be only a problem in case of a crash during
that,.. and than I'd still prefer to get the false positive than bad
data.


How you can know if it is a "bad data" or a "bad checksum" ?
You can't directly.  Just like you can't know which copy in a two-device 
MD RAID1 array is bad when they mismatch.


That's part of why I'm not all that fond of the idea of having checksums 
without COW, you need to verify the data using secondary means anyway, 
so why exactly should you waste time verifying it twice?




Anyway... it's not going to happen so the discussion is pointless.
I think people can probably use dm-integrity (which btw: does no CoW
either (IIRC) and still can provide integrity... ;-) ) to see whether
their data is valid.
No nice but since it won't change on btrfs, a possible alternative.


Even in this case I am curious about dm-integrity would sole this issue.
dm-integrity uses journaling, and actually based on the testing I've 
done, will typically have much worse 

Re: Ongoing Btrfs stability issues

2018-03-13 Thread Christoph Anton Mitterer
On Tue, 2018-03-13 at 20:36 +0100, Goffredo Baroncelli wrote:
> A checksum mismatch, is returned as -EIO by a read() syscall. This is
> an event handled badly by most part of the programs.
Then these programs must simply be fixed... otherwise they'll also fail
under normal circumstances with btrfs, if there is any corruption.


> The problem is the following: there is a time window between the
> checksum computation and the writing the data on the disk (which is
> done at the lower level via a DMA channel), where if the data is
> update the checksum would mismatch. This happens if we have two
> threads, where the first commits the data on the disk, and the second
> one updates the data (I think that both VM and database could behave
> so).
Well that's clear... but isn't that time frame also there if the extent
is just written without CoW (regardless of checksumming)?
Obviously there would need to be some protection here anyway, so that
such data is taken e.g. from RAM, before the write has completed, so
that the read wouldn't take place while the write has only half
finished?!
So I'd naively assume one could just enlarge that protection to the
completion of checksum writing,...



> In btrfs, a checksum mismatch creates an -EIO error during the
> reading. In a conventional filesystem (or a btrfs filesystem w/o
> datasum) there is no checksum, so this problem doesn't exist.
If ext writes an extent (can't that be up to 128MiB there?), then I'm
sure it cannot write that atomically (in terms of hardware)... so there
is likely some protection around this operation, that there are no
concurrent reads of that particular extent from the disk, while the
write hasn't finished yet.



> > Even if not... I should be only a problem in case of a crash during
> > that,.. and than I'd still prefer to get the false positive than
> > bad
> > data.
> 
> How you can know if it is a "bad data" or a "bad checksum" ?
Well as I've said, in my naive thinking this should only be a problem
in case of a crash... and then, yes, one cannot say whether it's bad
data or checksum (that's exactly what I'm saying)... but I rather
prefer to know that something might be fishy, then not knowing anything
and perhaps even get good data "RAID-repaired" with bad one...


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-13 Thread Goffredo Baroncelli
On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:
> On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
>> Unfortunately no, the likelihood might be 100%: there are some
>> patterns which trigger this problem quite easily. See The link which
>> I posted in my previous email. There was a program which creates a
>> bad checksum (in COW+DATASUM mode), and the file became unreadable.
> But that rather seems like a plain bug?!

You are right, unfortunately it seems that it is catalogate as WONT-FIX :(

> No reason that would conceptually make checksumming+notdatacow
> impossible.
> 
> AFAIU, the conceptual thin would be about:
> - data is written in nodatacow
>   => thus a checksum must be written as well, so write it
> - what can then of course happen is
>   - both csum and data are written => fine
>   - csum is written but data not and then some crash => csum will show
> that => fine
>   - data is written but csum not and then some crash => csum will give
> false positive
> 
> Still better few false positives, as many unnoticed data corruptions
> and no true raid repair.

A checksum mismatch, is returned as -EIO by a read() syscall. This is an event 
handled badly by most part of the programs. 
I.e. suppose that a page of a VM ram image file has a wrong checksum. When the 
VM starts, tries to read the page, got -EIO and aborts. It is even possible 
that it could not print which page is corrupted. In this case, how the user 
understand the problem, and what he could do ?


[]

> 
>> Again, you are assuming that the likelihood of having a bad checksum
>> is low. Unfortunately this is not true. There are pattern which
>> exploits this bug with a likelihood=100%.
> 
> Okay I don't understand why this would be so and wouldn't assume that
> the IO pattern can affect it heavily... but I'm not really btrfs
> expert.
> 
> My blind assumption would have been that writing an extent of data
> takes much longer to complete than writing the corresponding checksum.

The problem is the following: there is a time window between the checksum 
computation and the writing the data on the disk (which is done at the lower 
level via a DMA channel), where if the data is update the checksum would 
mismatch. This happens if we have two threads, where the first commits the data 
on the disk, and the second one updates the data (I think that both VM and 
database could behave so).

In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
checksum, so this problem doesn't exist. 

I am curious how ZFS solves this problem.

However I have to point out that this problem is not solved by the COW. COW 
solved only the problem about an interrupted commit of the filesystem, where 
the data is update in place (so it is available by the user), but the metadata 
not.


> 
> Even if not... I should be only a problem in case of a crash during
> that,.. and than I'd still prefer to get the false positive than bad
> data.

How you can know if it is a "bad data" or a "bad checksum" ?


> 
> 
> Anyway... it's not going to happen so the discussion is pointless.
> I think people can probably use dm-integrity (which btw: does no CoW
> either (IIRC) and still can provide integrity... ;-) ) to see whether
> their data is valid.
> No nice but since it won't change on btrfs, a possible alternative.

Even in this case I am curious about dm-integrity would sole this issue.

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-13 Thread Patrik Lundquist
On 9 March 2018 at 20:05, Alex Adriaanse  wrote:
>
> Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load 
> on these machines.

Dump the databases and recreate them with --data-checksums and Btrfs
No_COW attribute.

You can add this to /etc/postgresql-common/createcluster.conf in
Debian/Ubuntu if you use pg_createcluster:
initdb_options = '--data-checksums'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-12 Thread Christoph Anton Mitterer
On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
> Unfortunately no, the likelihood might be 100%: there are some
> patterns which trigger this problem quite easily. See The link which
> I posted in my previous email. There was a program which creates a
> bad checksum (in COW+DATASUM mode), and the file became unreadable.
But that rather seems like a plain bug?!

No reason that would conceptually make checksumming+notdatacow
impossible.

AFAIU, the conceptual thin would be about:
- data is written in nodatacow
  => thus a checksum must be written as well, so write it
- what can then of course happen is
  - both csum and data are written => fine
  - csum is written but data not and then some crash => csum will show
that => fine
  - data is written but csum not and then some crash => csum will give
false positive

Still better few false positives, as many unnoticed data corruptions
and no true raid repair.


> If you cannot know if a checksum is bad or the data is bad, the
> checksum is not useful at all!
Why not? It's anyway only uncertain in the case of crash,... and it at
least tells you that something is fishy.
A program which cares about its data will ensure its own journaling
means and can simply recover by this... or users could then just roll
in a backup.
Or one could provide some API/userland tool to recompute the csums of
the affected file (and possibly live with bad data).


> If I read correctly what you wrote, it seems that you consider a
> "minor issue" the fact that the checksum is not correct. If you
> accept the possibility that a checksum might be wrong, you wont trust
> anymore the checksum; so the checksum became not useful.
There's simply no disadvantage to not having checksumming at all in the
nodatacow case.
Cause then you never have an idea whether your data is correct or
not... the case with checksumming + datacow, which can give a false
positive on a crash when data was written correctly, but not the
checksum, covers at least the other cases of data corruption (silent
data corruption, csum written, but data not or only partially in case
of a crash).


> Again, you are assuming that the likelihood of having a bad checksum
> is low. Unfortunately this is not true. There are pattern which
> exploits this bug with a likelihood=100%.

Okay I don't understand why this would be so and wouldn't assume that
the IO pattern can affect it heavily... but I'm not really btrfs
expert.

My blind assumption would have been that writing an extent of data
takes much longer to complete than writing the corresponding checksum.

Even if not... I should be only a problem in case of a crash during
that,.. and than I'd still prefer to get the false positive than bad
data.


Anyway... it's not going to happen so the discussion is pointless.
I think people can probably use dm-integrity (which btw: does no CoW
either (IIRC) and still can provide integrity... ;-) ) to see whether
their data is valid.
No nice but since it won't change on btrfs, a possible alternative.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-12 Thread Goffredo Baroncelli
On 03/11/2018 11:37 PM, Christoph Anton Mitterer wrote:
> On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote:
>>
>> COW is needed to properly checksum the data. Otherwise is not
>> possible to ensure the coherency between data and checksum (however I
>> have to point out that BTRFS fails even in this case [*]).
>> We could rearrange this sentence, saying that: if you want checksum,
>> you need COW...
> 
> No,... not really... the meta-data is anyway always CoWed... so if you
> do checksum *and* notdatacow,..., the only thing that could possibly
> happen (in the worst case) is, that data that actually made it
> correctly to the disk is falsely determined bad, as the metadata (i.e.
> the checksums) weren't upgraded correctly.
> 
> That however is probably much less likely than the other way round,..
> i.e. bad data went to disk and would be detected with checksuming.

Unfortunately no, the likelihood might be 100%: there are some patterns which 
trigger this problem quite easily. See The link which I posted in my previous 
email. There was a program which creates a bad checksum (in COW+DATASUM mode), 
and the file became unreadable.

> 
> 
> I had lots of discussions about this here on the list, and no one ever
> brought up a real argument against it... I also had an off-list
> discussion with Chris Mason who IIRC confirmed that it would actually
> work as I imagine it... with the only two problems:
> - good data possibly be marked bad because of bad checksums
> - reads giving back EIO where people would rather prefer bad data

If you cannot know if a checksum is bad or the data is bad, the checksum is not 
useful at all!

If I read correctly what you wrote, it seems that you consider a "minor issue" 
the fact that the checksum is not correct. If you accept the possibility that a 
checksum might be wrong, you wont trust anymore the checksum; so the checksum 
became not useful.
 

> (not really sure if this were really his two arguments,... I'd have to
> look it up, so don't nail me down).
> 
> 
> Long story short:
> 
> In any case, I think giving back bad data without EIO is unacceptable.
> If someone really doesn't care (e.g. because he has higher level
> checksumming and possibly even repair) he could still manually disable
> checksumming.
> 
> The little chance of having a false positive weights IMO far less that
> have very large amounts of data (DBs, VM images are our typical cases)
> completely unprotected.

Again, you are assuming that the likelihood of having a bad checksum is low. 
Unfortunately this is not true. There are pattern which exploits this bug with 
a likelihood=100%.

> 
> And not having checksumming with notdatacow breaks any safe raid repair
> (so in that case "repair" may even overwrite good data),... which is
> IMO also unacceptable.
> And the typical use cases for nodatacow (VMs, DBs) are in turn not so
> uncommon to want RAID.
> 
> 
> I really like btrfs,... and it's not that other fs (which typically
> have no checksumming at all) would perform better here... but not
> having it for these major use case is a big disappointment for me.
> 
> 
> Cheers,
> Chris.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-11 Thread Christoph Anton Mitterer
On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote:
> 
> COW is needed to properly checksum the data. Otherwise is not
> possible to ensure the coherency between data and checksum (however I
> have to point out that BTRFS fails even in this case [*]).
> We could rearrange this sentence, saying that: if you want checksum,
> you need COW...

No,... not really... the meta-data is anyway always CoWed... so if you
do checksum *and* notdatacow,..., the only thing that could possibly
happen (in the worst case) is, that data that actually made it
correctly to the disk is falsely determined bad, as the metadata (i.e.
the checksums) weren't upgraded correctly.

That however is probably much less likely than the other way round,..
i.e. bad data went to disk and would be detected with checksuming.


I had lots of discussions about this here on the list, and no one ever
brought up a real argument against it... I also had an off-list
discussion with Chris Mason who IIRC confirmed that it would actually
work as I imagine it... with the only two problems:
- good data possibly be marked bad because of bad checksums
- reads giving back EIO where people would rather prefer bad data
(not really sure if this were really his two arguments,... I'd have to
look it up, so don't nail me down).


Long story short:

In any case, I think giving back bad data without EIO is unacceptable.
If someone really doesn't care (e.g. because he has higher level
checksumming and possibly even repair) he could still manually disable
checksumming.

The little chance of having a false positive weights IMO far less that
have very large amounts of data (DBs, VM images are our typical cases)
completely unprotected.

And not having checksumming with notdatacow breaks any safe raid repair
(so in that case "repair" may even overwrite good data),... which is
IMO also unacceptable.
And the typical use cases for nodatacow (VMs, DBs) are in turn not so
uncommon to want RAID.


I really like btrfs,... and it's not that other fs (which typically
have no checksumming at all) would perform better here... but not
having it for these major use case is a big disappointment for me.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-11 Thread Goffredo Baroncelli
On 03/10/2018 03:29 PM, Christoph Anton Mitterer wrote:
> On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
>> So for OLTP workloads you definitely want nodatacow enabled, bear in
>> mind this also disables crc checksumming, but your db engine should
>> already have such functionality implemented in it.
> 
> Unlike repeated claims made here on the list and other places... I
> woudln't now *any* DB system which actually does this per default and
> or in a way that would be comparable to filesystem lvl checksumming.
> 

I agree with you, also nobody warn that without checksum in case of RAID 
filesystem BTRFS is not capable anymore to check if a stripe is correct or not

> 
> Look back in the archives... when I've asked several times for
> checksumming support *with* nodatacow, I evaluated the existing status
> for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
> this either not enabled per default, not at all, or requiring special
> support for the program using the DB.
> 
> 
> Similar btw: no single VM image type I've evaluated back then had any
> form of checksumming integrated.
> 
> 
> Still, one of the major deficiencies (not in comparison to other fs,
> but in comparison to how it should be) of btrfs unfortunately :-(

COW is needed to properly checksum the data. Otherwise is not possible to 
ensure the coherency between data and checksum (however I have to point out 
that BTRFS fails even in this case [*]).
We could rearrange this sentence, saying that: if you want checksum, you need 
COW...

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[*] https://www.spinics.net/lists/linux-btrfs/msg69185.html

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-10 Thread Christoph Anton Mitterer
On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
> So for OLTP workloads you definitely want nodatacow enabled, bear in
> mind this also disables crc checksumming, but your db engine should
> already have such functionality implemented in it.

Unlike repeated claims made here on the list and other places... I
woudln't now *any* DB system which actually does this per default and
or in a way that would be comparable to filesystem lvl checksumming.


Look back in the archives... when I've asked several times for
checksumming support *with* nodatacow, I evaluated the existing status
for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
this either not enabled per default, not at all, or requiring special
support for the program using the DB.


Similar btw: no single VM image type I've evaluated back then had any
form of checksumming integrated.


Still, one of the major deficiencies (not in comparison to other fs,
but in comparison to how it should be) of btrfs unfortunately :-(


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-10 Thread Nikolay Borisov


On  9.03.2018 21:05, Alex Adriaanse wrote:
> Am I correct to understand that nodatacow doesn't really avoid CoW when 
> you're using snapshots? In a filesystem that's snapshotted 

Yes, so nodatacow won't interfere with how snapshots operate. For more
information on that topic check the following mailing list thread:
https://www.spinics.net/lists/linux-btrfs/msg62715.html

every 15 minutes, is there a difference between normal CoW and nodatacow
when (in the case of Postgres) you update a small portion of a 1GB file
many times per minute? Do you anticipate us seeing a benefit in
stability and performance if we set nodatacow for the
So regarding this, you can check :
https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

Essentially every bit of small, random postgres update in the db file
will cause a CoW operation + checksum IO which cause, and I quote, "
thrashing on HDDs and excessive multi-second spikes of CPU load on
systems with an SSD or large amount a RAM."

So for OLTP workloads you definitely want nodatacow enabled, bear in
mind this also disables crc checksumming, but your db engine should
already have such functionality implemented in it.

entire FS while retaining snapshots? Does nodatacow increase the chance
of corruption in a database like Postgres, i.e. are writes still
properly ordered/sync'ed when flushed to disk?

Well most modern DB already implement some sort of a WAL, so the
reliability responsibility is shifted on the db engine.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-09 Thread Alex Adriaanse
On Mar 9, 2018, at 3:54 AM, Nikolay Borisov  wrote:
> 
>> Sorry, I clearly missed that one. I have applied the patch you referenced 
>> and rebooted the VM in question. This morning we had another FS failure on 
>> the same machine that caused it to go into readonly mode. This happened 
>> after that device was experiencing 100% I/O utilization for some time. No 
>> balance was running at the time; last balance finished about 6 hours prior 
>> to the error.
>> 
>> Kernel messages:
>> [211238.262683] use_block_rsv: 163 callbacks suppressed
>> [211238.262683] BTRFS: block rsv returned -28
>> [211238.266718] [ cut here ]
>> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 
>> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
>> nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
>> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
>> iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
>> br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
>> ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev 
>> ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
>> raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
>> drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse 
>> drm ena scsi_mod i2c_piix4 button
>> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: GW
>>4.14.13 #3
>> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>> [211238.330742] task: 9cb43abb70c0 task.stack: b234c3b58000
>> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.340454] RSP: 0018:b234c3b5b958 EFLAGS: 00010282
>> [211238.344782] RAX: 001d RBX: 9cb43bdea128 RCX: 
>> 
>> [211238.350562] RDX:  RSI: 9cb440a166f8 RDI: 
>> 9cb440a166f8
>> [211238.356066] RBP: 4000 R08: 0001 R09: 
>> 7d81
>> [211238.361649] R10: 0001 R11: 7d81 R12: 
>> 9cb43bdea000
>> [211238.367304] R13: 9cb437f2c800 R14: 0001 R15: 
>> ffe4
>> [211238.372658] FS:  () GS:9cb440a0() 
>> knlGS:
>> [211238.379048] CS:  0010 DS:  ES:  CR0: 80050033
>> [211238.384681] CR2: 7f90a6677000 CR3: 0003cea0a006 CR4: 
>> 001606f0
>> [211238.391380] DR0:  DR1:  DR2: 
>> 
>> [211238.398050] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [211238.404730] Call Trace:
>> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
>> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
>> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
>> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
>> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
>> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
>> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
>> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
>> [211238.442899]  ? remove_wait_queue+0x60/0x60
>> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
>> [211238.450578]  kthread+0xfc/0x130
>> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
>> [211238.458381]  ? kthread_create_on_node+0x70/0x70
>> [211238.462225]  ? do_group_exit+0x3a/0xa0
>> [211238.465586]  ret_from_fork+0x1f/0x30
>> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 
>> f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 
>> <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
>> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
>> [211238.486524] BTRFS info (device xvdc): space_info 4 has 
>> 18446744073258958848 free, is not full
>> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, 
>> used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, 
>> readonly=65536
> 
> Ok so the numbers here are helpful, they show that we have enough space
> to allocate a chunk. I've also looked at the logic in 4.14.13 and all
> the necessary patches are there. Unfortunately none of this matters due
> to the fact that reserve_metadata_bytes is being called with
> BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
> any effort to flush anything at all.
> 
> Can you tell again what the workload is - is it some sort of a database,
> constantly writing to its files?

Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load 
on these machines. We also have snapshots being deleted and created every 15 
minutes. Looking at historical atop data for the two most recent crashes:

1. Right 

Re: Ongoing Btrfs stability issues

2018-03-09 Thread Nikolay Borisov

> Sorry, I clearly missed that one. I have applied the patch you referenced and 
> rebooted the VM in question. This morning we had another FS failure on the 
> same machine that caused it to go into readonly mode. This happened after 
> that device was experiencing 100% I/O utilization for some time. No balance 
> was running at the time; last balance finished about 6 hours prior to the 
> error.
> 
> Kernel messages:
> [211238.262683] use_block_rsv: 163 callbacks suppressed
> [211238.262683] BTRFS: block rsv returned -28
> [211238.266718] [ cut here ]
> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 
> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
> iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
> br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev 
> ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
> raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
> drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse 
> drm ena scsi_mod i2c_piix4 button
> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: GW 
>   4.14.13 #3
> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [211238.330742] task: 9cb43abb70c0 task.stack: b234c3b58000
> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.340454] RSP: 0018:b234c3b5b958 EFLAGS: 00010282
> [211238.344782] RAX: 001d RBX: 9cb43bdea128 RCX: 
> 
> [211238.350562] RDX:  RSI: 9cb440a166f8 RDI: 
> 9cb440a166f8
> [211238.356066] RBP: 4000 R08: 0001 R09: 
> 7d81
> [211238.361649] R10: 0001 R11: 7d81 R12: 
> 9cb43bdea000
> [211238.367304] R13: 9cb437f2c800 R14: 0001 R15: 
> ffe4
> [211238.372658] FS:  () GS:9cb440a0() 
> knlGS:
> [211238.379048] CS:  0010 DS:  ES:  CR0: 80050033
> [211238.384681] CR2: 7f90a6677000 CR3: 0003cea0a006 CR4: 
> 001606f0
> [211238.391380] DR0:  DR1:  DR2: 
> 
> [211238.398050] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [211238.404730] Call Trace:
> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [211238.442899]  ? remove_wait_queue+0x60/0x60
> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
> [211238.450578]  kthread+0xfc/0x130
> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [211238.458381]  ? kthread_create_on_node+0x70/0x70
> [211238.462225]  ? do_group_exit+0x3a/0xa0
> [211238.465586]  ret_from_fork+0x1f/0x30
> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 
> f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 
> <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
> [211238.486524] BTRFS info (device xvdc): space_info 4 has 
> 18446744073258958848 free, is not full
> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, 
> used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, 
> readonly=65536

Ok so the numbers here are helpful, they show that we have enough space
to allocate a chunk. I've also looked at the logic in 4.14.13 and all
the necessary patches are there. Unfortunately none of this matters due
to the fact that reserve_metadata_bytes is being called with
BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
any effort to flush anything at all.

Can you tell again what the workload is - is it some sort of a database,
constantly writing to its files? If so, btrfs is not really suited for
rewrite-heavy workloads since this causes excessive CoW. In such cases
you really ought to set nodatacow on the specified files. For more
information:

https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

The other thing that comes to mind is to try and tune the default commit
interval. Currently this is 30 seconds, meaning a 

Re: Ongoing Btrfs stability issues

2018-03-08 Thread Alex Adriaanse
On Mar 2, 2018, at 11:29 AM, Liu Bo  wrote:
> On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote:
>> On  1.03.2018 21:04, Alex Adriaanse wrote:
>>> Thanks so much for the suggestions so far, everyone. I wanted to report 
>>> back on this. Last Friday I made the following changes per suggestions from 
>>> this thread:
>>> 
>>> 1. Change the nightly balance to the following:
>>> 
>>>btrfs balance start -dusage=20 
>>>btrfs balance start -dusage=40,limit=10 
>>>btrfs balance start -musage=30 
>>> 
>>> 2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD 
>>> space allocation fix.
>>> 
>>> 3. Boot Linux with the elevator=noop option
>>> 
>>> 4. Change /sys/block/xvd*/queue/scheduler to "none"
>>> 
>>> 5. Mount all our Btrfs filesystems with the "enospc_debug" option.
>> 
>> SO that's good, however you didn't apply the out of tree patch (it has
>> already been merged into the for-next so will likely land in 4.17) I
>> pointed you at. As a result when you your ENOSPC error there is no extra
>> information being printed so we can't really reason about what might be
>> going wrong in the metadata flushing algorithms.

Sorry, I clearly missed that one. I have applied the patch you referenced and 
rebooted the VM in question. This morning we had another FS failure on the same 
machine that caused it to go into readonly mode. This happened after that 
device was experiencing 100% I/O utilization for some time. No balance was 
running at the time; last balance finished about 6 hours prior to the error.

Kernel messages:
[211238.262683] use_block_rsv: 163 callbacks suppressed
[211238.262683] BTRFS: block rsv returned -28
[211238.266718] [ cut here ]
[211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 
btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev 
ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse 
drm ena scsi_mod i2c_piix4 button
[211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: GW   
4.14.13 #3
[211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[211238.330742] task: 9cb43abb70c0 task.stack: b234c3b58000
[211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[211238.340454] RSP: 0018:b234c3b5b958 EFLAGS: 00010282
[211238.344782] RAX: 001d RBX: 9cb43bdea128 RCX: 

[211238.350562] RDX:  RSI: 9cb440a166f8 RDI: 
9cb440a166f8
[211238.356066] RBP: 4000 R08: 0001 R09: 
7d81
[211238.361649] R10: 0001 R11: 7d81 R12: 
9cb43bdea000
[211238.367304] R13: 9cb437f2c800 R14: 0001 R15: 
ffe4
[211238.372658] FS:  () GS:9cb440a0() 
knlGS:
[211238.379048] CS:  0010 DS:  ES:  CR0: 80050033
[211238.384681] CR2: 7f90a6677000 CR3: 0003cea0a006 CR4: 
001606f0
[211238.391380] DR0:  DR1:  DR2: 

[211238.398050] DR3:  DR6: fffe0ff0 DR7: 
0400
[211238.404730] Call Trace:
[211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
[211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
[211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
[211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
[211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
[211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[211238.442899]  ? remove_wait_queue+0x60/0x60
[211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
[211238.450578]  kthread+0xfc/0x130
[211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[211238.458381]  ? kthread_create_on_node+0x70/0x70
[211238.462225]  ? do_group_exit+0x3a/0xa0
[211238.465586]  ret_from_fork+0x1f/0x30
[211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 f7 
d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 <0f> ff 
e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
[211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
[211238.486524] BTRFS info (device xvdc): space_info 4 has 18446744073258958848 
free, is not full

Re: Ongoing Btrfs stability issues

2018-03-02 Thread Liu Bo
On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote:
> 
> 
> On  1.03.2018 21:04, Alex Adriaanse wrote:
> > On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn  
> > wrote:
...
> 
> > [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: 
> > errno=-28 No space left
> > [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: 
> > errno=-28 No space left
> > [496003.641996] BTRFS info (device xvdc): forced readonly
> > [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: 
> > errno=-28 No space left
> > [496003.642060] BUG: unable to handle kernel NULL pointer dereference at
> >(null)
> > [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
> > [496003.642087] PGD 8005fe08c067 P4D 8005fe08c067 PUD 3bd2f4067 PMD > > 0
> > [496003.642091] Oops:  [#1] SMP PTI
> > [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
> > nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
> > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
> > iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
> > br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul 
> > crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc 
> > parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress 
> > zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata 
> > xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper 
> > cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
> > [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: GW   
> > 4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> > [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> > [496003.642130] task: 8fbffb8dd080 task.stack: 9e81c7b8c000
> > [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
> 
> 
> if you happen to have the vmlinux of that kernel can you run the
> following from the kernel source directory:
> 
> ./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux
>

I thought this was fixed by bb166d7 btrfs: fix NULL pointer dereference from 
free_reloc_roots(),
Alex, do you mind checking if it's included in your kernel?

You can also check if the following change is merged in kernel-src deb.

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3a49a3c..9841fae 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2400,11 +2400,11 @@ void free_reloc_roots(struct list_head *list)
while (!list_empty(list)) {
reloc_root = list_entry(list->next, struct btrfs_root,
root_list);
+   __del_reloc_root(reloc_root);
free_extent_buffer(reloc_root->node);
free_extent_buffer(reloc_root->commit_root);
reloc_root->node = NULL;
reloc_root->commit_root = NULL;
-   __del_reloc_root(reloc_root);
}
 }


Thanks,

-liubo

> 
> > [496003.642151] RSP: 0018:9e81c7b8fab0 EFLAGS: 00010286
> > [496003.642153] RAX:  RBX: 8fb90a10a3c0 RCX: 
> > ca5d1fda5a5f
> > [496003.642154] RDX: 0001 RSI: 8fc05eae62c0 RDI: 
> > 8fbc4fd87d70
> > [496003.642154] RBP: 8fbbb5139000 R08:  R09: 
> > 
> > [496003.642155] R10: 8fc05eae62c0 R11: 01bc R12: 
> > 8fc0fbeac000
> > [496003.642156] R13: 8fbc4fd87d70 R14: 8fbc4fd87800 R15: 
> > ffe4
> > [496003.642157] FS:  7f64196708c0() GS:8fc100a4() 
> > knlGS:
> > [496003.642159] CS:  0010 DS:  ES:  CR0: 80050033
> > [496003.642160] CR2:  CR3: 00069b972004 CR4: 
> > 001606e0
> > [496003.642162] DR0:  DR1:  DR2: 
> > 
> > [496003.642163] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [496003.642164] Call Trace:
> > [496003.642185]  free_reloc_roots+0x22/0x60 [btrfs]
> > [496003.642202]  merge_reloc_roots+0x184/0x260 [btrfs]
> > [496003.642217]  relocate_block_group+0x29a/0x610 [btrfs]
> > [496003.642232]  btrfs_relocate_block_group+0x17b/0x230 [btrfs]
> > [496003.642254]  btrfs_relocate_chunk+0x38/0xb0 [btrfs]
> > [496003.642272]  btrfs_balance+0xa15/0x1250 [btrfs]
> > [496003.642292]  btrfs_ioctl_balance+0x368/0x380 [btrfs]
> > [496003.642309]  btrfs_ioctl+0x1170/0x24e0 [btrfs]
> > [496003.642312]  ? mem_cgroup_try_charge+0x86/0x1a0
> > [496003.642315]  ? __handle_mm_fault+0x640/0x10e0
> > [496003.642318]  ? do_vfs_ioctl+0x9f/0x600
> > [496003.642319]  do_vfs_ioctl+0x9f/0x600
> > [496003.642321]  ? handle_mm_fault+0xc6/0x1b0
> > [496003.642325]  ? __do_page_fault+0x289/0x500
> > [496003.642327]  SyS_ioctl+0x74/0x80
> > [496003.642330]  system_call_fast_compare_end+0xc/0x6f
> > [496003.642332] RIP: 

Re: Ongoing Btrfs stability issues

2018-03-01 Thread Qu Wenruo


On 2018年03月02日 03:04, Alex Adriaanse wrote:
> On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn  
> wrote:
>> I would suggest changing this to eliminate the balance with '-dusage=10' 
>> (it's redundant with the '-dusage=20' one unless your filesystem is in 
>> pathologically bad shape), and adding equivalent filters for balancing 
>> metadata (which generally goes pretty fast).
>>
>> Unless you've got a huge filesystem, you can also cut down on that limit 
>> filter.  100 data chunks that are 40% full is up to 40GB of data to move on 
>> a normally sized filesystem, or potentially up to 200GB if you've got a 
>> really big filesystem (I forget what point BTRFS starts scaling up chunk 
>> sizes at, but I'm pretty sure it's in the TB range).
> 
> Thanks so much for the suggestions so far, everyone. I wanted to report back 
> on this. Last Friday I made the following changes per suggestions from this 
> thread:
> 
> 1. Change the nightly balance to the following:
> 
> btrfs balance start -dusage=20 
> btrfs balance start -dusage=40,limit=10 
> btrfs balance start -musage=30 
> 
> 2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD 
> space allocation fix.
> 
> 3. Boot Linux with the elevator=noop option
> 
> 4. Change /sys/block/xvd*/queue/scheduler to "none"
> 
> 5. Mount all our Btrfs filesystems with the "enospc_debug" option.
> 
> 6. I did NOT add the "nossd" flag because I didn't think it'd make much of a 
> difference after that SSD space allocation fix.
> 
> 7. After applying the above changes, ran a full balance on all the Btrfs 
> filesystems. I also have not experimented with autodefrag yet.
> 
> 
> Despite the changes above, we just experienced another crash this morning. 
> Kernel message (with enospc_debug turned on for the given mountpoint):

Would you please try to use "btrfs check" to check the filesystem offline?
I'm wondering if it's extent tree or free space cache get corrupted and
makes kernel confused about its space allocation.


I'm not completely sure, but it may also be something wrong with the
space cache.

So either mount it with nospace_cache option of use "btrfs check
--clear-space-cache v1" may help.

Thanks,
Qu

> 
> [496003.170278] use_block_rsv: 46 callbacks suppressed
> [496003.170279] BTRFS: block rsv returned -28
> [496003.173875] [ cut here ]
> [496003.177186] WARNING: CPU: 2 PID: 362 at 
> /build/linux-3RM5ap/linux-4.14.13/fs/btrfs/extent-tree.c:8458 
> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [496003.185369] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
> iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
> br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev 
> ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
> raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
> aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena 
> psmouse drm scsi_mod i2c_piix4 button
> [496003.218484] CPU: 2 PID: 362 Comm: btrfs-transacti Tainted: GW 
>   4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> [496003.224618] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [496003.228702] task: 8fc0fb6bd0c0 task.stack: 9e81c3ac
> [496003.233081] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [496003.237220] RSP: 0018:9e81c3ac3958 EFLAGS: 00010282
> [496003.241404] RAX: 001d RBX: 8fc0fbeac128 RCX: 
> 
> [496003.248004] RDX:  RSI: 8fc100a966f8 RDI: 
> 8fc100a966f8
> [496003.253896] RBP: 4000 R08: 0001 R09: 
> 0001667b
> [496003.258508] R10: 0001 R11: 0001667b R12: 
> 8fc0fbeac000
> [496003.264759] R13: 8fc0fac22800 R14: 0001 R15: 
> ffe4
> [496003.271203] FS:  () GS:8fc100a8() 
> knlGS:
> [496003.278169] CS:  0010 DS:  ES:  CR0: 80050033
> [496003.283917] CR2: 7efe00f36000 CR3: 000102a0a001 CR4: 
> 001606e0
> [496003.290309] DR0:  DR1:  DR2: 
> 
> [496003.296985] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [496003.303335] Call Trace:
> [496003.307113]  ? __pagevec_lru_add_fn+0x270/0x270
> [496003.312126]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
> [496003.316995]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
> [496003.321568]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
> [496003.326684]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
> [496003.332724]  ? set_extent_bit+0x19/0x20 [btrfs]
> [496003.337991]  __btrfs_free_extent.isra.61+0xf5/0xd30 

Re: Ongoing Btrfs stability issues

2018-03-01 Thread Nikolay Borisov


On  1.03.2018 21:04, Alex Adriaanse wrote:
> On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn  
> wrote:
>> I would suggest changing this to eliminate the balance with '-dusage=10' 
>> (it's redundant with the '-dusage=20' one unless your filesystem is in 
>> pathologically bad shape), and adding equivalent filters for balancing 
>> metadata (which generally goes pretty fast).
>>
>> Unless you've got a huge filesystem, you can also cut down on that limit 
>> filter.  100 data chunks that are 40% full is up to 40GB of data to move on 
>> a normally sized filesystem, or potentially up to 200GB if you've got a 
>> really big filesystem (I forget what point BTRFS starts scaling up chunk 
>> sizes at, but I'm pretty sure it's in the TB range).
> 
> Thanks so much for the suggestions so far, everyone. I wanted to report back 
> on this. Last Friday I made the following changes per suggestions from this 
> thread:
> 
> 1. Change the nightly balance to the following:
> 
> btrfs balance start -dusage=20 
> btrfs balance start -dusage=40,limit=10 
> btrfs balance start -musage=30 
> 
> 2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD 
> space allocation fix.
> 
> 3. Boot Linux with the elevator=noop option
> 
> 4. Change /sys/block/xvd*/queue/scheduler to "none"
> 
> 5. Mount all our Btrfs filesystems with the "enospc_debug" option.

SO that's good, however you didn't apply the out of tree patch (it has
already been merged into the for-next so will likely land in 4.17) I
pointed you at. As a result when you your ENOSPC error there is no extra
information being printed so we can't really reason about what might be
going wrong in the metadata flushing algorithms.




> [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: 
> errno=-28 No space left
> [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: 
> errno=-28 No space left
> [496003.641996] BTRFS info (device xvdc): forced readonly
> [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: 
> errno=-28 No space left
> [496003.642060] BUG: unable to handle kernel NULL pointer dereference at  
>  (null)
> [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
> [496003.642087] PGD 8005fe08c067 P4D 8005fe08c067 PUD 3bd2f4067 PMD 0
> [496003.642091] Oops:  [#1] SMP PTI
> [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
> iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
> br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev 
> ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
> raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
> aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena 
> psmouse drm scsi_mod i2c_piix4 button
> [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: GW   
> 4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [496003.642130] task: 8fbffb8dd080 task.stack: 9e81c7b8c000
> [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]


if you happen to have the vmlinux of that kernel can you run the
following from the kernel source directory:

./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux


> [496003.642151] RSP: 0018:9e81c7b8fab0 EFLAGS: 00010286
> [496003.642153] RAX:  RBX: 8fb90a10a3c0 RCX: 
> ca5d1fda5a5f
> [496003.642154] RDX: 0001 RSI: 8fc05eae62c0 RDI: 
> 8fbc4fd87d70
> [496003.642154] RBP: 8fbbb5139000 R08:  R09: 
> 
> [496003.642155] R10: 8fc05eae62c0 R11: 01bc R12: 
> 8fc0fbeac000
> [496003.642156] R13: 8fbc4fd87d70 R14: 8fbc4fd87800 R15: 
> ffe4
> [496003.642157] FS:  7f64196708c0() GS:8fc100a4() 
> knlGS:
> [496003.642159] CS:  0010 DS:  ES:  CR0: 80050033
> [496003.642160] CR2:  CR3: 00069b972004 CR4: 
> 001606e0
> [496003.642162] DR0:  DR1:  DR2: 
> 
> [496003.642163] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [496003.642164] Call Trace:
> [496003.642185]  free_reloc_roots+0x22/0x60 [btrfs]
> [496003.642202]  merge_reloc_roots+0x184/0x260 [btrfs]
> [496003.642217]  relocate_block_group+0x29a/0x610 [btrfs]
> [496003.642232]  btrfs_relocate_block_group+0x17b/0x230 [btrfs]
> [496003.642254]  btrfs_relocate_chunk+0x38/0xb0 [btrfs]
> [496003.642272]  btrfs_balance+0xa15/0x1250 [btrfs]
> [496003.642292]  btrfs_ioctl_balance+0x368/0x380 [btrfs]
> [496003.642309]  

Re: Ongoing Btrfs stability issues

2018-03-01 Thread Alex Adriaanse
On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn  wrote:
> I would suggest changing this to eliminate the balance with '-dusage=10' 
> (it's redundant with the '-dusage=20' one unless your filesystem is in 
> pathologically bad shape), and adding equivalent filters for balancing 
> metadata (which generally goes pretty fast).
> 
> Unless you've got a huge filesystem, you can also cut down on that limit 
> filter.  100 data chunks that are 40% full is up to 40GB of data to move on a 
> normally sized filesystem, or potentially up to 200GB if you've got a really 
> big filesystem (I forget what point BTRFS starts scaling up chunk sizes at, 
> but I'm pretty sure it's in the TB range).

Thanks so much for the suggestions so far, everyone. I wanted to report back on 
this. Last Friday I made the following changes per suggestions from this thread:

1. Change the nightly balance to the following:

btrfs balance start -dusage=20 
btrfs balance start -dusage=40,limit=10 
btrfs balance start -musage=30 

2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD 
space allocation fix.

3. Boot Linux with the elevator=noop option

4. Change /sys/block/xvd*/queue/scheduler to "none"

5. Mount all our Btrfs filesystems with the "enospc_debug" option.

6. I did NOT add the "nossd" flag because I didn't think it'd make much of a 
difference after that SSD space allocation fix.

7. After applying the above changes, ran a full balance on all the Btrfs 
filesystems. I also have not experimented with autodefrag yet.


Despite the changes above, we just experienced another crash this morning. 
Kernel message (with enospc_debug turned on for the given mountpoint):

[496003.170278] use_block_rsv: 46 callbacks suppressed
[496003.170279] BTRFS: block rsv returned -28
[496003.173875] [ cut here ]
[496003.177186] WARNING: CPU: 2 PID: 362 at 
/build/linux-3RM5ap/linux-4.14.13/fs/btrfs/extent-tree.c:8458 
btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[496003.185369] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev 
ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena 
psmouse drm scsi_mod i2c_piix4 button
[496003.218484] CPU: 2 PID: 362 Comm: btrfs-transacti Tainted: GW   
4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
[496003.224618] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[496003.228702] task: 8fc0fb6bd0c0 task.stack: 9e81c3ac
[496003.233081] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[496003.237220] RSP: 0018:9e81c3ac3958 EFLAGS: 00010282
[496003.241404] RAX: 001d RBX: 8fc0fbeac128 RCX: 

[496003.248004] RDX:  RSI: 8fc100a966f8 RDI: 
8fc100a966f8
[496003.253896] RBP: 4000 R08: 0001 R09: 
0001667b
[496003.258508] R10: 0001 R11: 0001667b R12: 
8fc0fbeac000
[496003.264759] R13: 8fc0fac22800 R14: 0001 R15: 
ffe4
[496003.271203] FS:  () GS:8fc100a8() 
knlGS:
[496003.278169] CS:  0010 DS:  ES:  CR0: 80050033
[496003.283917] CR2: 7efe00f36000 CR3: 000102a0a001 CR4: 
001606e0
[496003.290309] DR0:  DR1:  DR2: 

[496003.296985] DR3:  DR6: fffe0ff0 DR7: 
0400
[496003.303335] Call Trace:
[496003.307113]  ? __pagevec_lru_add_fn+0x270/0x270
[496003.312126]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
[496003.316995]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
[496003.321568]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
[496003.326684]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
[496003.332724]  ? set_extent_bit+0x19/0x20 [btrfs]
[496003.337991]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
[496003.343436]  ? btrfs_merge_delayed_refs+0x8f/0x560 [btrfs]
[496003.349322]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[496003.355157]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[496003.360707]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[496003.366022]  ? remove_wait_queue+0x60/0x60
[496003.370898]  transaction_kthread+0x195/0x1b0 [btrfs]
[496003.376411]  kthread+0xfc/0x130
[496003.380741]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[496003.386404]  ? kthread_create_on_node+0x70/0x70
[496003.391287]  ? do_group_exit+0x3a/0xa0
[496003.396201]  ret_from_fork+0x1f/0x30
[496003.400779] Code: ff 

Re: Ongoing Btrfs stability issues

2018-02-17 Thread Shehbaz Jaffer
>First of all, the ssd mount option does not have anything to do with
having single or DUP metadata.

Sorry about that, I agree with you. -nossd would not help in
increasing reliability in any way. One alternative would be to format
and force duplication of metadata during filesystem creation on SSD.
but again, as you described, there is the likelihood of consecutive
writes of original and copy of the metadata going to the same cell,
which may not end up giving us good reliability.

>Also, instead of physically damaging flash cells inside your SSD, you
are writing data to a perfectly working one. This is a different failure
scenario.

By writing data to a working SSD, I am emulating byte and block
corruptions, which is valid failure scenario. In this case, read
operation on SSD would take place successfully (no EIO from device),
but internally, the blocks read would be corrupted. Here, btrfs
detects cksum failures, tries to correct them using scrubber, but
fails to do so for SSD.

For the scenario of physically damaged flash cells that you mentioned,
I am currently performing experiments where I inject -EIO at places
where btrfs tries reading or writing a block. this is to see how btrfs
handles a failed block access to a damaged cell. Would that cover the
failure scenario you described? If not, could you elaborate on other
alternatives to emulate physically damaged flash cells?

> In any case, using DUP instead of single obviously increases the chance
of recovery in case of failures that corrupt one copy of the data when
it's travelling between system memory and disk, while you're sending two
of them right after each other, so you're totally right that it's better
to enable

yes, DUP is better than single, however, as you pointed correctly, it
may not be the perfect solution to the problem.


On Sat, Feb 17, 2018 at 10:18 AM, Hans van Kranenburg
 wrote:
> On 02/17/2018 05:34 AM, Shehbaz Jaffer wrote:
>>> It's hosted on an EBS volume; we don't use ephemeral storage at all. The 
>>> EBS volumes are all SSD
>>
>> I have recently done some SSD corruption experiments on small set of
>> workloads, so I thought I would share my experience.
>>
>> While creating btrfs using mkfs.btrfs command for SSDs, by default the
>> metadata duplication option is disabled. this renders btrfs-scrubbing
>> ineffective, as there are no redundant metadata to restore corrupted
>> metadata from.
>> So if there are any errors during read operation on SSD, unlike HDD
>> where the corruptions would be handled by btrfs scrub on the fly while
>> detecting checksum error, for SSD the read would fail as uncorrectable
>> error.
>
> First of all, the ssd mount option does not have anything to do with
> having single or DUP metadata.
>
> Well, both the things that happen by default (mkfs using single, mount
> enabling the ssd option) are happening because of the lookup result on
> the rotational flag, but that's all.
>
>> Could you confirm if metadata DUP is enabled for your system by
>> running the following cmd:
>>
>> $btrfs fi df /mnt # mount is the mount point
>> Data, single: total=8.00MiB, used=64.00KiB
>> System, single: total=4.00MiB, used=16.00KiB
>> Metadata, single: total=168.00MiB, used=112.00KiB
>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>
>> If metadata is single in your case as well (and not DUP), that may be
>> the problem for btrfs-scrub not working effectively on the fly
>> (mid-stream bit-rot correction), causing reliability issues. A couple
>> of such bugs that are observed specifically for SSDs is reported here:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=198463
>> https://bugzilla.kernel.org/show_bug.cgi?id=198807
>
> Here you show that when you have 'single' metadata, there's no copy to
> recover from. This is expected.
>
> Also, instead of physically damaging flash cells inside your SSD, you
> are writing data to a perfectly working one. This is a different failure
> scenario.
>
> One of the reasons to turn off DUP for metadata by default on SSD is
> (from man mkfs.btrfs):
>
> "The controllers may put data written in a short timespan into the
> same physical storage unit (cell, block etc). In case this unit dies,
> both copies are lost. BTRFS does not add any artificial delay between
> metadata writes." .. "The traditional rotational hard drives usually
> fail at the sector level."
>
> And, of course, in case of EBS, you don't have any idea at all where the
> data actually ends up, since it's talking to a black box service, and
> not an SSD.
>
> In any case, using DUP instead of single obviously increases the chance
> of recovery in case of failures that corrupt one copy of the data when
> it's travelling between system memory and disk, while you're sending two
> of them right after each other, so you're totally right that it's better
> to enable.
>
>> These do not occur for HDD, and I believe should not occur when
>> filesystem is mounted with nossd mode.
>
> So to 

Re: Ongoing Btrfs stability issues

2018-02-17 Thread Hans van Kranenburg
On 02/17/2018 05:34 AM, Shehbaz Jaffer wrote:
>> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS 
>> volumes are all SSD
> 
> I have recently done some SSD corruption experiments on small set of
> workloads, so I thought I would share my experience.
> 
> While creating btrfs using mkfs.btrfs command for SSDs, by default the
> metadata duplication option is disabled. this renders btrfs-scrubbing
> ineffective, as there are no redundant metadata to restore corrupted
> metadata from.
> So if there are any errors during read operation on SSD, unlike HDD
> where the corruptions would be handled by btrfs scrub on the fly while
> detecting checksum error, for SSD the read would fail as uncorrectable
> error.

First of all, the ssd mount option does not have anything to do with
having single or DUP metadata.

Well, both the things that happen by default (mkfs using single, mount
enabling the ssd option) are happening because of the lookup result on
the rotational flag, but that's all.

> Could you confirm if metadata DUP is enabled for your system by
> running the following cmd:
> 
> $btrfs fi df /mnt # mount is the mount point
> Data, single: total=8.00MiB, used=64.00KiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=168.00MiB, used=112.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> If metadata is single in your case as well (and not DUP), that may be
> the problem for btrfs-scrub not working effectively on the fly
> (mid-stream bit-rot correction), causing reliability issues. A couple
> of such bugs that are observed specifically for SSDs is reported here:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=198463
> https://bugzilla.kernel.org/show_bug.cgi?id=198807

Here you show that when you have 'single' metadata, there's no copy to
recover from. This is expected.

Also, instead of physically damaging flash cells inside your SSD, you
are writing data to a perfectly working one. This is a different failure
scenario.

One of the reasons to turn off DUP for metadata by default on SSD is
(from man mkfs.btrfs):

"The controllers may put data written in a short timespan into the
same physical storage unit (cell, block etc). In case this unit dies,
both copies are lost. BTRFS does not add any artificial delay between
metadata writes." .. "The traditional rotational hard drives usually
fail at the sector level."

And, of course, in case of EBS, you don't have any idea at all where the
data actually ends up, since it's talking to a black box service, and
not an SSD.

In any case, using DUP instead of single obviously increases the chance
of recovery in case of failures that corrupt one copy of the data when
it's travelling between system memory and disk, while you're sending two
of them right after each other, so you're totally right that it's better
to enable.

> These do not occur for HDD, and I believe should not occur when
> filesystem is mounted with nossd mode.

So to reiterate, mounting nossd does not make your metadata writes DUP.

> On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
>> excerpted:
>>
>>> This will probably sound like an odd question, but does BTRFS think your
>>> storage devices are SSD's or not?  Based on what you're saying, it
>>> sounds like you're running into issues resulting from the
>>> over-aggressive SSD 'optimizations' that were done by BTRFS until very
>>> recently.
>>>
>>> You can verify if this is what's causing your problems or not by either
>>> upgrading to a recent mainline kernel version (I know the changes are in
>>> 4.15, I don't remember for certain if they're in 4.14 or not, but I
>>> think they are), or by adding 'nossd' to your mount options, and then
>>> seeing if you still have the problems or not (I suspect this is only
>>> part of it, and thus changing this will reduce the issues, but not
>>> completely eliminate them).  Make sure and run a full balance after
>>> changing either item, as the aforementioned 'optimizations' have an
>>> impact on how data is organized on-disk (which is ultimately what causes
>>> the issues), so they will have a lingering effect if you don't balance
>>> everything.
>>
>> According to the wiki, 4.14 does indeed have the ssd changes.
>>
>> According to the bug, he's running 4.13.x on one server and 4.14.x on
>> two.  So upgrading the one to 4.14.x should mean all will have that fix.
>>
>> However, without a full balance it /will/ take some time to settle down
>> (again, assuming btrfs was using ssd mode), so the lingering effect could
>> still be creating problems on the 4.14 kernel servers for the moment.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-16 Thread Shehbaz Jaffer
>It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS 
>volumes are all SSD

I have recently done some SSD corruption experiments on small set of
workloads, so I thought I would share my experience.

While creating btrfs using mkfs.btrfs command for SSDs, by default the
metadata duplication option is disabled. this renders btrfs-scrubbing
ineffective, as there are no redundant metadata to restore corrupted
metadata from.
So if there are any errors during read operation on SSD, unlike HDD
where the corruptions would be handled by btrfs scrub on the fly while
detecting checksum error, for SSD the read would fail as uncorrectable
error.

Could you confirm if metadata DUP is enabled for your system by
running the following cmd:

$btrfs fi df /mnt # mount is the mount point
Data, single: total=8.00MiB, used=64.00KiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=168.00MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

If metadata is single in your case as well (and not DUP), that may be
the problem for btrfs-scrub not working effectively on the fly
(mid-stream bit-rot correction), causing reliability issues. A couple
of such bugs that are observed specifically for SSDs is reported here:

https://bugzilla.kernel.org/show_bug.cgi?id=198463
https://bugzilla.kernel.org/show_bug.cgi?id=198807

These do not occur for HDD, and I believe should not occur when
filesystem is mounted with nossd mode.

On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
> excerpted:
>
>> This will probably sound like an odd question, but does BTRFS think your
>> storage devices are SSD's or not?  Based on what you're saying, it
>> sounds like you're running into issues resulting from the
>> over-aggressive SSD 'optimizations' that were done by BTRFS until very
>> recently.
>>
>> You can verify if this is what's causing your problems or not by either
>> upgrading to a recent mainline kernel version (I know the changes are in
>> 4.15, I don't remember for certain if they're in 4.14 or not, but I
>> think they are), or by adding 'nossd' to your mount options, and then
>> seeing if you still have the problems or not (I suspect this is only
>> part of it, and thus changing this will reduce the issues, but not
>> completely eliminate them).  Make sure and run a full balance after
>> changing either item, as the aforementioned 'optimizations' have an
>> impact on how data is organized on-disk (which is ultimately what causes
>> the issues), so they will have a lingering effect if you don't balance
>> everything.
>
> According to the wiki, 4.14 does indeed have the ssd changes.
>
> According to the bug, he's running 4.13.x on one server and 4.14.x on
> two.  So upgrading the one to 4.14.x should mean all will have that fix.
>
> However, without a full balance it /will/ take some time to settle down
> (again, assuming btrfs was using ssd mode), so the lingering effect could
> still be creating problems on the 4.14 kernel servers for the moment.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Shehbaz Jaffer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-16 Thread Duncan
Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
excerpted:

> This will probably sound like an odd question, but does BTRFS think your
> storage devices are SSD's or not?  Based on what you're saying, it
> sounds like you're running into issues resulting from the
> over-aggressive SSD 'optimizations' that were done by BTRFS until very
> recently.
> 
> You can verify if this is what's causing your problems or not by either
> upgrading to a recent mainline kernel version (I know the changes are in
> 4.15, I don't remember for certain if they're in 4.14 or not, but I
> think they are), or by adding 'nossd' to your mount options, and then
> seeing if you still have the problems or not (I suspect this is only
> part of it, and thus changing this will reduce the issues, but not
> completely eliminate them).  Make sure and run a full balance after
> changing either item, as the aforementioned 'optimizations' have an
> impact on how data is organized on-disk (which is ultimately what causes
> the issues), so they will have a lingering effect if you don't balance
> everything.

According to the wiki, 4.14 does indeed have the ssd changes.

According to the bug, he's running 4.13.x on one server and 4.14.x on 
two.  So upgrading the one to 4.14.x should mean all will have that fix.

However, without a full balance it /will/ take some time to settle down 
(again, assuming btrfs was using ssd mode), so the lingering effect could 
still be creating problems on the 4.14 kernel servers for the moment.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-16 Thread Austin S. Hemmelgarn

On 2018-02-15 11:18, Alex Adriaanse wrote:

We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 
years. There is so much I love about Btrfs: CoW snapshots, compression, 
subvolumes, flexibility, the tools, etc. However, lack of stability has been a 
serious ongoing issue for us, and we're getting to the point that it's becoming 
hard to justify continuing to use it unless we make some changes that will get 
it stable. The instability manifests itself mostly in the form of the VM 
completely crashing, I/O operations freezing, or the filesystem going into 
readonly mode. We've spent an enormous amount of time trying to recover 
corrupted filesystems, and the time that servers were down as a result of Btrfs 
instability has accumulated to many days.

We've made many changes to try to improve Btrfs stability: upgrading to newer 
kernels, setting up nightly balances, setting up monitoring to ensure our 
filesystems stay under 70% utilization, etc. This has definitely helped quite a 
bit, but even with these things in place it's still unstable. Take 
https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created 
yesterday: we've had 4 VMs (out of 20) go down over the past week alone because 
of Btrfs errors. Thankfully, no data was lost, but I did have to copy 
everything over to a new filesystem.

Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O 
utilization is often pegged at 100%). The filesystems that get little I/O seem 
pretty stable, but the ones that undergo a lot of I/O activity are the ones 
that suffer from the most instability problems. We run the following balances 
on every filesystem every night:

 btrfs balance start -dusage=10 
 btrfs balance start -dusage=20 
 btrfs balance start -dusage=40,limit=100 
I would suggest changing this to eliminate the balance with '-dusage=10' 
(it's redundant with the '-dusage=20' one unless your filesystem is in 
pathologically bad shape), and adding equivalent filters for balancing 
metadata (which generally goes pretty fast).


Unless you've got a huge filesystem, you can also cut down on that limit 
filter.  100 data chunks that are 40% full is up to 40GB of data to move 
on a normally sized filesystem, or potentially up to 200GB if you've got 
a really big filesystem (I forget what point BTRFS starts scaling up 
chunk sizes at, but I'm pretty sure it's in the TB range).


We also use the following btrfs-snap cronjobs to implement rotating snapshots, 
with short-term snapshots taking place every 15 minutes and less frequent ones 
being retained for up to 3 days:

 0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r  23
 15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r  15m 3
 0 0 * * * /opt/btrfs-snap/btrfs-snap -r  daily 3

Our filesystems are mounted with the "compress=lzo" option.

Are we doing something wrong? Are there things we should change to improve 
stability? I wouldn't be surprised if eliminating snapshots would stabilize 
things, but if we do that we might as well be using a filesystem like XFS. Are 
there fixes queued up that will solve the problems listed in the Bugzilla 
ticket referenced above? Or is our I/O-intensive workload just not a good fit 
for Btrfs?


This will probably sound like an odd question, but does BTRFS think your 
storage devices are SSD's or not?  Based on what you're saying, it 
sounds like you're running into issues resulting from the 
over-aggressive SSD 'optimizations' that were done by BTRFS until very 
recently.


You can verify if this is what's causing your problems or not by either 
upgrading to a recent mainline kernel version (I know the changes are in 
4.15, I don't remember for certain if they're in 4.14 or not, but I 
think they are), or by adding 'nossd' to your mount options, and then 
seeing if you still have the problems or not (I suspect this is only 
part of it, and thus changing this will reduce the issues, but not 
completely eliminate them).  Make sure and run a full balance after 
changing either item, as the aforementioned 'optimizations' have an 
impact on how data is organized on-disk (which is ultimately what causes 
the issues), so they will have a lingering effect if you don't balance 
everything.


'autodefrag' is the other mount option that I would try toggling (turn 
it off if you've got it on, or on if you've got it off).  I doubt it 
will have much impact, but it does change how things end up on disk.


Additionally to all that, make sure your monitoring isn't just looking 
at the regular `df` command's output, it's woefully insufficient for 
monitoring space usage on BTRFS.  If you want to check things properly, 
you want to be looking at the data in /sys/fs/btrfs//allocation, 
more specifically checking the following percentages:


1. The sum of the values in /sys/fs/btrfs/relative to the sum total of the size of the block devices for the 
filesystem.
2. The ratio of 

Re: Ongoing Btrfs stability issues

2018-02-15 Thread Nikolay Borisov


On 16.02.2018 06:54, Alex Adriaanse wrote:
> 
>> On Feb 15, 2018, at 2:42 PM, Nikolay Borisov  wrote:
>>
>> On 15.02.2018 21:41, Alex Adriaanse wrote:
>>>
 On Feb 15, 2018, at 12:00 PM, Nikolay Borisov  wrote:

 So in all of the cases you are hitting some form of premature enospc.
 There was a fix that landed in 4.15 that should have fixed a rather
 long-standing issue with the way metadata reservations are satisfied,
 namely:

 996478ca9c46 ("btrfs: change how we decide to commit transactions during
 flushing").

 That commit was introduced in 4.14.3 stable kernel. Since you are not
 using upstream kernel I'd advise you check whether the respective commit
 is contained in the kernel versions you are using.

 Other than that in the reports you mentioned there is one crash in
 __del_reloc_root which looks rather interesting, at the very least it
 shouldn't crash...
>>>
>>> I checked the Debian source code that's used for building the kernels that 
>>> we run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 
>>> contain the changes associated with the commit you referenced. So crash 
>>> instances #2, #3, and #4 at 
>>> https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all running kernels 
>>> that contain this fix already.
>>>
>>> Could it be that some on-disk data structures got (silently) corrupted 
>>> while we were running pre-4.14.7 kernels, and the aforementioned fix 
>>> doesn't address anything relating to damage that has already been done? If 
>>> so, is there a way to detect and/or repair this for existing filesystems 
>>> other than running a "btrfs check --repair" or rebuilding filesystems (both 
>>> of which require a significant amount of downtime)?
>>
>> From the logs provided I can see only a single crash, the others are
>> just ENOSPC which can cause corruption due to delayed refs (in majority
>> of examples) not finishing. Is btrfs hosted on the EBS volume or on the
>> ephemeral storage of the instance? Is the EBS an ssd? If it's ssd are
>> you using an io scheduler for those ebs devices? You ca check what the
>> io scheduler for a device is by reading the following sysfs file:
>>
>> /sys/block//queue/scheduler
> 
> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS 
> volumes are all SSD. We didn't change the default schedulers on the VMs and 
> it looks like it's using mq-deadline:
> 
> $ cat /sys/block/xvdc/queue/scheduler
> [mq-deadline] none

SO one thing I can advise to test is set the scheduler for that xvdc to
none. Next, I'ad advise you backport the following patch to your kernel:
https://github.com/kdave/btrfs-devel/commit/1b816c23e91f70603c532af52cccf17e68393682

then mount the filesystem with -o enospc_debug. And the next time an
enospc occurs additional info should be printed in dmesg with the state
of the space_info structure.

> 
> Alex--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-15 Thread Alex Adriaanse

> On Feb 15, 2018, at 2:42 PM, Nikolay Borisov  wrote:
> 
> On 15.02.2018 21:41, Alex Adriaanse wrote:
>> 
>>> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov  wrote:
>>> 
>>> So in all of the cases you are hitting some form of premature enospc.
>>> There was a fix that landed in 4.15 that should have fixed a rather
>>> long-standing issue with the way metadata reservations are satisfied,
>>> namely:
>>> 
>>> 996478ca9c46 ("btrfs: change how we decide to commit transactions during
>>> flushing").
>>> 
>>> That commit was introduced in 4.14.3 stable kernel. Since you are not
>>> using upstream kernel I'd advise you check whether the respective commit
>>> is contained in the kernel versions you are using.
>>> 
>>> Other than that in the reports you mentioned there is one crash in
>>> __del_reloc_root which looks rather interesting, at the very least it
>>> shouldn't crash...
>> 
>> I checked the Debian source code that's used for building the kernels that 
>> we run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 
>> contain the changes associated with the commit you referenced. So crash 
>> instances #2, #3, and #4 at 
>> https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all running kernels 
>> that contain this fix already.
>> 
>> Could it be that some on-disk data structures got (silently) corrupted while 
>> we were running pre-4.14.7 kernels, and the aforementioned fix doesn't 
>> address anything relating to damage that has already been done? If so, is 
>> there a way to detect and/or repair this for existing filesystems other than 
>> running a "btrfs check --repair" or rebuilding filesystems (both of which 
>> require a significant amount of downtime)?
> 
> From the logs provided I can see only a single crash, the others are
> just ENOSPC which can cause corruption due to delayed refs (in majority
> of examples) not finishing. Is btrfs hosted on the EBS volume or on the
> ephemeral storage of the instance? Is the EBS an ssd? If it's ssd are
> you using an io scheduler for those ebs devices? You ca check what the
> io scheduler for a device is by reading the following sysfs file:
> 
> /sys/block//queue/scheduler

It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS 
volumes are all SSD. We didn't change the default schedulers on the VMs and it 
looks like it's using mq-deadline:

$ cat /sys/block/xvdc/queue/scheduler
[mq-deadline] none

Alex--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-15 Thread Nikolay Borisov


On 15.02.2018 21:41, Alex Adriaanse wrote:
> 
>> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov  wrote:
>>
>> So in all of the cases you are hitting some form of premature enospc.
>> There was a fix that landed in 4.15 that should have fixed a rather
>> long-standing issue with the way metadata reservations are satisfied,
>> namely:
>>
>> 996478ca9c46 ("btrfs: change how we decide to commit transactions during
>> flushing").
>>
>> That commit was introduced in 4.14.3 stable kernel. Since you are not
>> using upstream kernel I'd advise you check whether the respective commit
>> is contained in the kernel versions you are using.
>>
>> Other than that in the reports you mentioned there is one crash in
>> __del_reloc_root which looks rather interesting, at the very least it
>> shouldn't crash...
> 
> I checked the Debian source code that's used for building the kernels that we 
> run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 contain 
> the changes associated with the commit you referenced. So crash instances #2, 
> #3, and #4 at https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all 
> running kernels that contain this fix already.
> 
> Could it be that some on-disk data structures got (silently) corrupted while 
> we were running pre-4.14.7 kernels, and the aforementioned fix doesn't 
> address anything relating to damage that has already been done? If so, is 
> there a way to detect and/or repair this for existing filesystems other than 
> running a "btrfs check --repair" or rebuilding filesystems (both of which 
> require a significant amount of downtime)?

>From the logs provided I can see only a single crash, the others are
just ENOSPC which can cause corruption due to delayed refs (in majority
of examples) not finishing. Is btrfs hosted on the EBS volume or on the
ephemeral storage of the instance? Is the EBS an ssd? If it's ssd are
you using an io scheduler for those ebs devices? You ca check what the
io scheduler for a device is by reading the following sysfs file:

/sys/block//queue/scheduler


> 
> Thanks,
> 
> Alex
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-15 Thread Alex Adriaanse

> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov  wrote:
> 
> So in all of the cases you are hitting some form of premature enospc.
> There was a fix that landed in 4.15 that should have fixed a rather
> long-standing issue with the way metadata reservations are satisfied,
> namely:
> 
> 996478ca9c46 ("btrfs: change how we decide to commit transactions during
> flushing").
> 
> That commit was introduced in 4.14.3 stable kernel. Since you are not
> using upstream kernel I'd advise you check whether the respective commit
> is contained in the kernel versions you are using.
> 
> Other than that in the reports you mentioned there is one crash in
> __del_reloc_root which looks rather interesting, at the very least it
> shouldn't crash...

I checked the Debian source code that's used for building the kernels that we 
run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 contain the 
changes associated with the commit you referenced. So crash instances #2, #3, 
and #4 at https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all running 
kernels that contain this fix already.

Could it be that some on-disk data structures got (silently) corrupted while we 
were running pre-4.14.7 kernels, and the aforementioned fix doesn't address 
anything relating to damage that has already been done? If so, is there a way 
to detect and/or repair this for existing filesystems other than running a 
"btrfs check --repair" or rebuilding filesystems (both of which require a 
significant amount of downtime)?

Thanks,

Alex--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-15 Thread Nikolay Borisov


On 15.02.2018 18:18, Alex Adriaanse wrote:
> We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 
> years. There is so much I love about Btrfs: CoW snapshots, compression, 
> subvolumes, flexibility, the tools, etc. However, lack of stability has been 
> a serious ongoing issue for us, and we're getting to the point that it's 
> becoming hard to justify continuing to use it unless we make some changes 
> that will get it stable. The instability manifests itself mostly in the form 
> of the VM completely crashing, I/O operations freezing, or the filesystem 
> going into readonly mode. We've spent an enormous amount of time trying to 
> recover corrupted filesystems, and the time that servers were down as a 
> result of Btrfs instability has accumulated to many days.
> 
> We've made many changes to try to improve Btrfs stability: upgrading to newer 
> kernels, setting up nightly balances, setting up monitoring to ensure our 
> filesystems stay under 70% utilization, etc. This has definitely helped quite 
> a bit, but even with these things in place it's still unstable. Take 
> https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I 
> created yesterday: we've had 4 VMs (out of 20) go down over the past week 
> alone because of Btrfs errors. Thankfully, no data was lost, but I did have 
> to copy everything over to a new filesystem.

So in all of the cases you are hitting some form of premature enospc.
There was a fix that landed in 4.15 that should have fixed a rather
long-standing issue with the way metadata reservations are satisfied,
namely:

996478ca9c46 ("btrfs: change how we decide to commit transactions during
flushing").

That commit was introduced in 4.14.3 stable kernel. Since you are not
using upstream kernel I'd advise you check whether the respective commit
is contained in the kernel versions you are using.

Other than that in the reports you mentioned there is one crash in
__del_reloc_root which looks rather interesting, at the very least it
shouldn't crash...

> Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O 
> utilization is often pegged at 100%). The filesystems that get little I/O 
> seem pretty stable, but the ones that undergo a lot of I/O activity are the 
> ones that suffer from the most instability problems. We run the following 
> balances on every filesystem every night:
> 
> btrfs balance start -dusage=10 
> btrfs balance start -dusage=20 
> btrfs balance start -dusage=40,limit=100 
> 
> We also use the following btrfs-snap cronjobs to implement rotating 
> snapshots, with short-term snapshots taking place every 15 minutes and less 
> frequent ones being retained for up to 3 days:
> 
> 0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r  23
> 15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r  15m 3
> 0 0 * * * /opt/btrfs-snap/btrfs-snap -r  daily 3
> 
> Our filesystems are mounted with the "compress=lzo" option.
> 
> Are we doing something wrong? Are there things we should change to improve 
> stability? I wouldn't be surprised if eliminating snapshots would stabilize 
> things, but if we do that we might as well be using a filesystem like XFS. 
> Are there fixes queued up that will solve the problems listed in the Bugzilla 
> ticket referenced above? Or is our I/O-intensive workload just not a good fit 
> for Btrfs?
> 
> Thanks,
> 
> Alex--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ongoing Btrfs stability issues

2018-02-15 Thread Alex Adriaanse
We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 
years. There is so much I love about Btrfs: CoW snapshots, compression, 
subvolumes, flexibility, the tools, etc. However, lack of stability has been a 
serious ongoing issue for us, and we're getting to the point that it's becoming 
hard to justify continuing to use it unless we make some changes that will get 
it stable. The instability manifests itself mostly in the form of the VM 
completely crashing, I/O operations freezing, or the filesystem going into 
readonly mode. We've spent an enormous amount of time trying to recover 
corrupted filesystems, and the time that servers were down as a result of Btrfs 
instability has accumulated to many days.

We've made many changes to try to improve Btrfs stability: upgrading to newer 
kernels, setting up nightly balances, setting up monitoring to ensure our 
filesystems stay under 70% utilization, etc. This has definitely helped quite a 
bit, but even with these things in place it's still unstable. Take 
https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created 
yesterday: we've had 4 VMs (out of 20) go down over the past week alone because 
of Btrfs errors. Thankfully, no data was lost, but I did have to copy 
everything over to a new filesystem.

Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O 
utilization is often pegged at 100%). The filesystems that get little I/O seem 
pretty stable, but the ones that undergo a lot of I/O activity are the ones 
that suffer from the most instability problems. We run the following balances 
on every filesystem every night:

btrfs balance start -dusage=10 
btrfs balance start -dusage=20 
btrfs balance start -dusage=40,limit=100 

We also use the following btrfs-snap cronjobs to implement rotating snapshots, 
with short-term snapshots taking place every 15 minutes and less frequent ones 
being retained for up to 3 days:

0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r  23
15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r  15m 3
0 0 * * * /opt/btrfs-snap/btrfs-snap -r  daily 3

Our filesystems are mounted with the "compress=lzo" option.

Are we doing something wrong? Are there things we should change to improve 
stability? I wouldn't be surprised if eliminating snapshots would stabilize 
things, but if we do that we might as well be using a filesystem like XFS. Are 
there fixes queued up that will solve the problems listed in the Bugzilla 
ticket referenced above? Or is our I/O-intensive workload just not a good fit 
for Btrfs?

Thanks,

Alex--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html