Re: Problem with file system

2017-11-08 Thread Austin S. Hemmelgarn

On 2017-11-08 13:31, Chris Murphy wrote:

On Wed, Nov 8, 2017 at 11:10 AM, Austin S. Hemmelgarn
 wrote:

On 2017-11-08 12:54, Chris Murphy wrote:


On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills  wrote:


On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:


On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
 wrote:


It definitely does fix ups during normal operations. During reads, if
there's a UNC or there's corruption detected, Btrfs gets the good
copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
don't just happen with scrubbing. Even raid56 supports these kinds of
passive fixups back to disk.



I could have sworn it didn't rewrite the data on-disk during normal
usage.
I mean, I know for certain that it will return the correct data to
userspace
if at all possible, but I was under the impression it will just log the
error during normal operation.



No, everything except raid56 has had it since a long time, I can't
even think how far back, maybe even before 3.0. Whereas raid56 got it
in 4.12.



 Yes, I'm pretty sure it's been like that ever since I've been using
btrfs (somewhere around the early neolithic).



Yeah, around the original code for multiple devices I think. Anyway,
this is what the fixups look like between scrub and normal read on
raid1. Hilariously the error reporting is radically different.

This is kernel messages of what a scrub finding data file corruption
detection and repair looks like. This was 5120 bytes corrupted so all
of one block and partial of anther.


[244964.589522] BTRFS warning (device dm-6): checksum error at logical
1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
offset 0, length 4096, links 1 (path: test.bin)
[244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 1, gen 0
[244964.650239] BTRFS error (device dm-6): fixed up error at logical
1103626240 on dev /dev/mapper/vg-2
[244964.650612] BTRFS warning (device dm-6): checksum error at logical
1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
offset 4096, length 4096, links 1 (path: test.bin)
[244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 2, gen 0
[244964.683586] BTRFS error (device dm-6): fixed up error at logical
1103630336 on dev /dev/mapper/vg-2
[root@f26s test]#


Exact same corruption (same device and offset), but normal read of the
file.

[245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.638901] BTRFS info (device dm-6): read error corrected: ino
257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
[245721.639608] BTRFS info (device dm-6): read error corrected: ino
257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
[245747.280718]


scrub considers the fixup an error, normal read considers it info; but
there's more useful information in the scrub output I think. I'd
really like to see the warning make it clear whether this is metadata
or data corruption though. From the above you have to infer it,
because of the inode reference.


OK, that actually explains why I had this incorrect assumption.  I've not
delved all that deep into that code, so I have no reference there, but
looking at the two messages, the scrub message makes it very clear that the
error was fixed, whereas the phrasing in the case of a normal read is kind
of ambiguous (as I see it, 'read error corrected' could mean that it was
actually repaired (fixed as scrub says), or that the error was corrected in
BTRFS by falling back to the old copy, and I assumed the second case given
the context).

As far as the whole warning versus info versus error thing, I actually think
_that_ makes some sense.  If things got fixed, it's not exactly an error,
even though it would be nice to have some consistency there.  For scrub
however, it makes sense to have it all be labeled as an 'error' because
otherwise the log entries will be incomplete if dmesg is not set to report
anything less than an error (and the three lines are functionally _one_
entry).  I can also kind of understand scrub reporting error counts, but
regular reads not doing so (scrub is a diagnostic and repair tool, regular
reads aren't).



I just did those corruptions as a test, and following the normal read
fixup, a subsequent scrub finds no problems. And in both cases
debug-tree shows pretty much identical metadata, at least the same
chunks are intact and the tree the file is located in has the same

Re: Problem with file system

2017-11-08 Thread Chris Murphy
On Wed, Nov 8, 2017 at 11:10 AM, Austin S. Hemmelgarn
 wrote:
> On 2017-11-08 12:54, Chris Murphy wrote:
>>
>> On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills  wrote:
>>>
>>> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:

 On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
  wrote:

>> It definitely does fix ups during normal operations. During reads, if
>> there's a UNC or there's corruption detected, Btrfs gets the good
>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>> don't just happen with scrubbing. Even raid56 supports these kinds of
>> passive fixups back to disk.
>
>
> I could have sworn it didn't rewrite the data on-disk during normal
> usage.
> I mean, I know for certain that it will return the correct data to
> userspace
> if at all possible, but I was under the impression it will just log the
> error during normal operation.


 No, everything except raid56 has had it since a long time, I can't
 even think how far back, maybe even before 3.0. Whereas raid56 got it
 in 4.12.
>>>
>>>
>>> Yes, I'm pretty sure it's been like that ever since I've been using
>>> btrfs (somewhere around the early neolithic).
>>>
>>
>> Yeah, around the original code for multiple devices I think. Anyway,
>> this is what the fixups look like between scrub and normal read on
>> raid1. Hilariously the error reporting is radically different.
>>
>> This is kernel messages of what a scrub finding data file corruption
>> detection and repair looks like. This was 5120 bytes corrupted so all
>> of one block and partial of anther.
>>
>>
>> [244964.589522] BTRFS warning (device dm-6): checksum error at logical
>> 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
>> offset 0, length 4096, links 1 (path: test.bin)
>> [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
>> wr 0, rd 0, flush 0, corrupt 1, gen 0
>> [244964.650239] BTRFS error (device dm-6): fixed up error at logical
>> 1103626240 on dev /dev/mapper/vg-2
>> [244964.650612] BTRFS warning (device dm-6): checksum error at logical
>> 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
>> offset 4096, length 4096, links 1 (path: test.bin)
>> [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
>> wr 0, rd 0, flush 0, corrupt 2, gen 0
>> [244964.683586] BTRFS error (device dm-6): fixed up error at logical
>> 1103630336 on dev /dev/mapper/vg-2
>> [root@f26s test]#
>>
>>
>> Exact same corruption (same device and offset), but normal read of the
>> file.
>>
>> [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
>> [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
>> [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
>> [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
>> [245721.638901] BTRFS info (device dm-6): read error corrected: ino
>> 257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
>> [245721.639608] BTRFS info (device dm-6): read error corrected: ino
>> 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
>> [245747.280718]
>>
>>
>> scrub considers the fixup an error, normal read considers it info; but
>> there's more useful information in the scrub output I think. I'd
>> really like to see the warning make it clear whether this is metadata
>> or data corruption though. From the above you have to infer it,
>> because of the inode reference.
>
> OK, that actually explains why I had this incorrect assumption.  I've not
> delved all that deep into that code, so I have no reference there, but
> looking at the two messages, the scrub message makes it very clear that the
> error was fixed, whereas the phrasing in the case of a normal read is kind
> of ambiguous (as I see it, 'read error corrected' could mean that it was
> actually repaired (fixed as scrub says), or that the error was corrected in
> BTRFS by falling back to the old copy, and I assumed the second case given
> the context).
>
> As far as the whole warning versus info versus error thing, I actually think
> _that_ makes some sense.  If things got fixed, it's not exactly an error,
> even though it would be nice to have some consistency there.  For scrub
> however, it makes sense to have it all be labeled as an 'error' because
> otherwise the log entries will be incomplete if dmesg is not set to report
> anything less than an error (and the three lines are functionally _one_
> entry).  I can also kind of understand scrub reporting error counts, but
> regular reads not doing so (scrub is a diagnostic and repair tool, regular
> reads 

Re: Problem with file system

2017-11-08 Thread Austin S. Hemmelgarn

On 2017-11-08 12:54, Chris Murphy wrote:

On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills  wrote:

On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:

On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
 wrote:


It definitely does fix ups during normal operations. During reads, if
there's a UNC or there's corruption detected, Btrfs gets the good
copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
don't just happen with scrubbing. Even raid56 supports these kinds of
passive fixups back to disk.


I could have sworn it didn't rewrite the data on-disk during normal usage.
I mean, I know for certain that it will return the correct data to userspace
if at all possible, but I was under the impression it will just log the
error during normal operation.


No, everything except raid56 has had it since a long time, I can't
even think how far back, maybe even before 3.0. Whereas raid56 got it
in 4.12.


Yes, I'm pretty sure it's been like that ever since I've been using
btrfs (somewhere around the early neolithic).



Yeah, around the original code for multiple devices I think. Anyway,
this is what the fixups look like between scrub and normal read on
raid1. Hilariously the error reporting is radically different.

This is kernel messages of what a scrub finding data file corruption
detection and repair looks like. This was 5120 bytes corrupted so all
of one block and partial of anther.


[244964.589522] BTRFS warning (device dm-6): checksum error at logical
1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
offset 0, length 4096, links 1 (path: test.bin)
[244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 1, gen 0
[244964.650239] BTRFS error (device dm-6): fixed up error at logical
1103626240 on dev /dev/mapper/vg-2
[244964.650612] BTRFS warning (device dm-6): checksum error at logical
1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
offset 4096, length 4096, links 1 (path: test.bin)
[244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 2, gen 0
[244964.683586] BTRFS error (device dm-6): fixed up error at logical
1103630336 on dev /dev/mapper/vg-2
[root@f26s test]#


Exact same corruption (same device and offset), but normal read of the file.

[245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.638901] BTRFS info (device dm-6): read error corrected: ino
257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
[245721.639608] BTRFS info (device dm-6): read error corrected: ino
257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
[245747.280718]


scrub considers the fixup an error, normal read considers it info; but
there's more useful information in the scrub output I think. I'd
really like to see the warning make it clear whether this is metadata
or data corruption though. From the above you have to infer it,
because of the inode reference.
OK, that actually explains why I had this incorrect assumption.  I've 
not delved all that deep into that code, so I have no reference there, 
but looking at the two messages, the scrub message makes it very clear 
that the error was fixed, whereas the phrasing in the case of a normal 
read is kind of ambiguous (as I see it, 'read error corrected' could 
mean that it was actually repaired (fixed as scrub says), or that the 
error was corrected in BTRFS by falling back to the old copy, and I 
assumed the second case given the context).


As far as the whole warning versus info versus error thing, I actually 
think _that_ makes some sense.  If things got fixed, it's not exactly an 
error, even though it would be nice to have some consistency there.  For 
scrub however, it makes sense to have it all be labeled as an 'error' 
because otherwise the log entries will be incomplete if dmesg is not set 
to report anything less than an error (and the three lines are 
functionally _one_ entry).  I can also kind of understand scrub 
reporting error counts, but regular reads not doing so (scrub is a 
diagnostic and repair tool, regular reads aren't).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-08 Thread Chris Murphy
On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills  wrote:
> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
>>  wrote:
>>
>> >> It definitely does fix ups during normal operations. During reads, if
>> >> there's a UNC or there's corruption detected, Btrfs gets the good
>> >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>> >> don't just happen with scrubbing. Even raid56 supports these kinds of
>> >> passive fixups back to disk.
>> >
>> > I could have sworn it didn't rewrite the data on-disk during normal usage.
>> > I mean, I know for certain that it will return the correct data to 
>> > userspace
>> > if at all possible, but I was under the impression it will just log the
>> > error during normal operation.
>>
>> No, everything except raid56 has had it since a long time, I can't
>> even think how far back, maybe even before 3.0. Whereas raid56 got it
>> in 4.12.
>
>Yes, I'm pretty sure it's been like that ever since I've been using
> btrfs (somewhere around the early neolithic).
>

Yeah, around the original code for multiple devices I think. Anyway,
this is what the fixups look like between scrub and normal read on
raid1. Hilariously the error reporting is radically different.

This is kernel messages of what a scrub finding data file corruption
detection and repair looks like. This was 5120 bytes corrupted so all
of one block and partial of anther.


[244964.589522] BTRFS warning (device dm-6): checksum error at logical
1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
offset 0, length 4096, links 1 (path: test.bin)
[244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 1, gen 0
[244964.650239] BTRFS error (device dm-6): fixed up error at logical
1103626240 on dev /dev/mapper/vg-2
[244964.650612] BTRFS warning (device dm-6): checksum error at logical
1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
offset 4096, length 4096, links 1 (path: test.bin)
[244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 2, gen 0
[244964.683586] BTRFS error (device dm-6): fixed up error at logical
1103630336 on dev /dev/mapper/vg-2
[root@f26s test]#


Exact same corruption (same device and offset), but normal read of the file.

[245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.638901] BTRFS info (device dm-6): read error corrected: ino
257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
[245721.639608] BTRFS info (device dm-6): read error corrected: ino
257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
[245747.280718]


scrub considers the fixup an error, normal read considers it info; but
there's more useful information in the scrub output I think. I'd
really like to see the warning make it clear whether this is metadata
or data corruption though. From the above you have to infer it,
because of the inode reference.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-08 Thread Hugo Mills
On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
>  wrote:
> 
> >> It definitely does fix ups during normal operations. During reads, if
> >> there's a UNC or there's corruption detected, Btrfs gets the good
> >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
> >> don't just happen with scrubbing. Even raid56 supports these kinds of
> >> passive fixups back to disk.
> >
> > I could have sworn it didn't rewrite the data on-disk during normal usage.
> > I mean, I know for certain that it will return the correct data to userspace
> > if at all possible, but I was under the impression it will just log the
> > error during normal operation.
> 
> No, everything except raid56 has had it since a long time, I can't
> even think how far back, maybe even before 3.0. Whereas raid56 got it
> in 4.12.

   Yes, I'm pretty sure it's been like that ever since I've been using
btrfs (somewhere around the early neolithic).

   Hugo.

-- 
Hugo Mills | Turning, pages turning in the widening bath,
hugo@... carfax.org.uk | The spine cannot bear the humidity.
http://carfax.org.uk/  | Books fall apart; the binding cannot hold.
PGP: E2AB1DE4  | Page 129 is loosed upon the world.   Zarf


signature.asc
Description: Digital signature


Re: Problem with file system

2017-11-08 Thread Chris Murphy
On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
 wrote:

>> It definitely does fix ups during normal operations. During reads, if
>> there's a UNC or there's corruption detected, Btrfs gets the good
>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>> don't just happen with scrubbing. Even raid56 supports these kinds of
>> passive fixups back to disk.
>
> I could have sworn it didn't rewrite the data on-disk during normal usage.
> I mean, I know for certain that it will return the correct data to userspace
> if at all possible, but I was under the impression it will just log the
> error during normal operation.

No, everything except raid56 has had it since a long time, I can't
even think how far back, maybe even before 3.0. Whereas raid56 got it
in 4.12.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-08 Thread Austin S. Hemmelgarn

On 2017-11-07 23:50, Chris Murphy wrote:

On Tue, Nov 7, 2017 at 6:02 AM, Austin S. Hemmelgarn
 wrote:


* Optional automatic correction of errors detected during normal usage.
Right now, you have to run a scrub to correct errors. Such a design makes
sense with MD and LVM, where you don't know which copy is correct, but BTRFS
does know which copy is correct (or how to rebuild the correct data), and it
therefore makes sense to have an option to automatically rebuild data that
is detected to be incorrect.


?

It definitely does fix ups during normal operations. During reads, if
there's a UNC or there's corruption detected, Btrfs gets the good
copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
don't just happen with scrubbing. Even raid56 supports these kinds of
passive fixups back to disk.
I could have sworn it didn't rewrite the data on-disk during normal 
usage.  I mean, I know for certain that it will return the correct data 
to userspace if at all possible, but I was under the impression it will 
just log the error during normal operation.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-07 Thread Chris Murphy
On Tue, Nov 7, 2017 at 6:02 AM, Austin S. Hemmelgarn
 wrote:

> * Optional automatic correction of errors detected during normal usage.
> Right now, you have to run a scrub to correct errors. Such a design makes
> sense with MD and LVM, where you don't know which copy is correct, but BTRFS
> does know which copy is correct (or how to rebuild the correct data), and it
> therefore makes sense to have an option to automatically rebuild data that
> is detected to be incorrect.

?

It definitely does fix ups during normal operations. During reads, if
there's a UNC or there's corruption detected, Btrfs gets the good
copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
don't just happen with scrubbing. Even raid56 supports these kinds of
passive fixups back to disk.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-07 Thread Austin S. Hemmelgarn

On 2017-11-07 02:01, Dave wrote:

On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy  wrote:


On Sat, Nov 4, 2017 at 1:26 AM, Dave  wrote:

On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy  wrote:


That is not a general purpose file system. It's a file system for admins who 
understand where the bodies are buried.


I'm not sure I understand your comment...

Are you saying BTRFS is not a general purpose file system?


I'm suggesting that any file system that burdens the user with more
knowledge to stay out of trouble than the widely considered general
purpose file systems of the day, is not a general purpose file system.

And yes, I'm suggesting that Btrfs is at risk of being neither general
purpose, and not meeting its design goals as stated in Btrfs
documentation. It is not easy to admin *when things go wrong*. It's
great before then. It's a butt ton easier to resize, replace devices,
take snapshots, and so on. But when it comes to fixing it when it goes
wrong? It is a goddamn Choose Your Own Adventure book. It's way, way
more complicated than any other file system I'm aware of.


It sounds like a large part of that could be addressed with better
documentation. I know that documentation such as what you are
suggesting would be really valuable to me!
Documentation would help, but most of it is a lack of automation of 
things that could be automated (and are reasonably expected to be based 
on how LVM and ZFS work), including but not limited to:
* Handling of device failures.  In particular, BTRFS has absolutely zero 
hot-spare support currently (though there are patches to add this), 
which is considered a mandatory feature in almost all large scale data 
storage situations.
* Handling of chunk-level allocation exhaustion.  Ideally, when we can't 
allocate a chunk, we should try to free up space from the other chunk 
type through repacking of data.  Handling this better would 
significantly improve things around one of the biggest pitfalls with 
BTRFS, namely filling up a filesystem completely (which many end users 
seem to think is perfectly fine, despite that not being the case for 
pretty much any filesystem).
* Optional automatic correction of errors detected during normal usage. 
Right now, you have to run a scrub to correct errors. Such a design 
makes sense with MD and LVM, where you don't know which copy is correct, 
but BTRFS does know which copy is correct (or how to rebuild the correct 
data), and it therefore makes sense to have an option to automatically 
rebuild data that is detected to be incorrect.



If btrfs isn't able to serve as a general purpose file system for
Linux going forward, which file system(s) would you suggest can fill
that role? (I can't think of any that are clearly all-around better
than btrfs now, or that will be in the next few years.)


ext4 and XFS are clearly the file systems to beat. They almost always
recover from crashes with just a normal journal replay at mount time,
file system repair is not often needed. When it is needed, it usually
works, and there is just the one option to repair and go with it.
Btrfs has piles of repair options, mount time options, btrfs check has
options, btrfs rescue has options, it's a bit nutty honestly. And
there's zero guidance in the available docs what order to try things
in, not least of which some of these repair tools are still considered
dangerous at least in the man page text, and the order depends on the
failure. The user is burdened with way too much.


Neither one of those file systems offers snapshots. (And when I
compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS
is the clear winner.)

Snapshots and volumes have a lot of value to me and I would not enjoy
going back to a file system without those features.
While that is true, that's not exactly the point Chris was trying to 
make.  The point is that if you install a system with XFS, you don't 
have to do pretty much anything to keep the filesystem running 
correctly, and ext4 is almost as good about not needing user 
intervention (repairs for ext4 are a bit more involved, and you have to 
watch inode usage because it uses static inode tables).  In contrast, 
you have to essentially treat BTRFS like a small child and keep an eye 
on it almost constantly to make sure it works correctly.



Even as much as I know about Btrfs having used it since 2008 and my
list activity, I routinely have WTF moments when people post problems,
what order to try to get things going again. Easy to admin? Yeah for
the most part. But stability is still a problem, and it's coming up on
a 10 year anniversary soon.

If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd
use ZoL hands down.


Might it be the case that if you were equally familiar with ZFS, you
would become aware of more of its pitfalls? And that greater knowledge
could always lead to a different decision (such as favoring BTRFS)..

Re: Problem with file system

2017-11-06 Thread Dave
On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy  wrote:
>
> On Sat, Nov 4, 2017 at 1:26 AM, Dave  wrote:
> > On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy  
> > wrote:
> >>
> >> That is not a general purpose file system. It's a file system for admins 
> >> who understand where the bodies are buried.
> >
> > I'm not sure I understand your comment...
> >
> > Are you saying BTRFS is not a general purpose file system?
>
> I'm suggesting that any file system that burdens the user with more
> knowledge to stay out of trouble than the widely considered general
> purpose file systems of the day, is not a general purpose file system.
>
> And yes, I'm suggesting that Btrfs is at risk of being neither general
> purpose, and not meeting its design goals as stated in Btrfs
> documentation. It is not easy to admin *when things go wrong*. It's
> great before then. It's a butt ton easier to resize, replace devices,
> take snapshots, and so on. But when it comes to fixing it when it goes
> wrong? It is a goddamn Choose Your Own Adventure book. It's way, way
> more complicated than any other file system I'm aware of.

It sounds like a large part of that could be addressed with better
documentation. I know that documentation such as what you are
suggesting would be really valuable to me!

> > If btrfs isn't able to serve as a general purpose file system for
> > Linux going forward, which file system(s) would you suggest can fill
> > that role? (I can't think of any that are clearly all-around better
> > than btrfs now, or that will be in the next few years.)
>
> ext4 and XFS are clearly the file systems to beat. They almost always
> recover from crashes with just a normal journal replay at mount time,
> file system repair is not often needed. When it is needed, it usually
> works, and there is just the one option to repair and go with it.
> Btrfs has piles of repair options, mount time options, btrfs check has
> options, btrfs rescue has options, it's a bit nutty honestly. And
> there's zero guidance in the available docs what order to try things
> in, not least of which some of these repair tools are still considered
> dangerous at least in the man page text, and the order depends on the
> failure. The user is burdened with way too much.

Neither one of those file systems offers snapshots. (And when I
compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS
is the clear winner.)

Snapshots and volumes have a lot of value to me and I would not enjoy
going back to a file system without those features.

> Even as much as I know about Btrfs having used it since 2008 and my
> list activity, I routinely have WTF moments when people post problems,
> what order to try to get things going again. Easy to admin? Yeah for
> the most part. But stability is still a problem, and it's coming up on
> a 10 year anniversary soon.
>
> If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd
> use ZoL hands down.

Might it be the case that if you were equally familiar with ZFS, you
would become aware of more of its pitfalls? And that greater knowledge
could always lead to a different decision (such as favoring BTRFS)..
In my experience the grass is always greener when I am less familiar
with the field.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-06 Thread Austin S. Hemmelgarn

On 2017-11-06 13:45, Chris Murphy wrote:

On Mon, Nov 6, 2017 at 6:29 AM, Austin S. Hemmelgarn
 wrote:



With ATA devices (including SATA), except on newer SSD's, TRIM commands
can't be queued,


SATA spec 3.1 includes queued trim. There are SATA spec 3.1 products
on the market claiming to do queued trim. Some of them fuck up, and
have been black listed in the kernel for queued trim.

Yes, but some still work, and they are invariably very new devices by 
most people's definitions.

Anyway right now I consider discard mount option fundamentally broken
on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
broken there too.


For LVM thinp, discard there deallocates the blocks, and unallocated regions
read back as zeroes, just like in a sparse file (in fact, if you just think
of LVM thinp as a sparse file with reflinking for snapshots, you get
remarkably close to how it's actually implemented from a semantic
perspective), so it is broken there.  In fact, it's guaranteed broken on any
block device that has the discard_zeroes_data flag set, and theoretically
broken on many things that don't have that flag (although block devices that
don't have that flag are inherently broken from a security perspective
anyway, but that's orthogonal to this discussion).


So this is really only solvable by having Btrfs delay, possibly
substantially, the discarding of metadata blocks. Aside from physical
device trim, there are benefits in thin provisioning for trim and some
use cases will require file system discard, being unable to rely on
periodic fstrim.
Yes.  However, from a simplicity of implementation perspective, it makes 
more sense to keep some number of old trees instead of keeping old trees 
for some amount of time.  That would remove the need to track timing 
info in the filesystem, provide sufficient protection, and probably be a 
bit easier to explain in the documentation.  Such logic could also be 
applied to regular block devices that don't support discard to provide a 
better guarantee that you won't overwrite old trees that might be useful 
for recovery.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-06 Thread Chris Murphy
On Mon, Nov 6, 2017 at 6:29 AM, Austin S. Hemmelgarn
 wrote:

>
> With ATA devices (including SATA), except on newer SSD's, TRIM commands
> can't be queued,

SATA spec 3.1 includes queued trim. There are SATA spec 3.1 products
on the market claiming to do queued trim. Some of them fuck up, and
have been black listed in the kernel for queued trim.








>>
>>
>> Anyway right now I consider discard mount option fundamentally broken
>> on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
>> broken there too.
>
> For LVM thinp, discard there deallocates the blocks, and unallocated regions
> read back as zeroes, just like in a sparse file (in fact, if you just think
> of LVM thinp as a sparse file with reflinking for snapshots, you get
> remarkably close to how it's actually implemented from a semantic
> perspective), so it is broken there.  In fact, it's guaranteed broken on any
> block device that has the discard_zeroes_data flag set, and theoretically
> broken on many things that don't have that flag (although block devices that
> don't have that flag are inherently broken from a security perspective
> anyway, but that's orthogonal to this discussion).

So this is really only solvable by having Btrfs delay, possibly
substantially, the discarding of metadata blocks. Aside from physical
device trim, there are benefits in thin provisioning for trim and some
use cases will require file system discard, being unable to rely on
periodic fstrim.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-06 Thread Austin S. Hemmelgarn

On 2017-11-04 13:14, Chris Murphy wrote:

On Fri, Nov 3, 2017 at 10:46 PM, Adam Borowski  wrote:

On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:

On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
 wrote:


If you're running on an SSD (or thinly provisioned storage, or something
else which supports discards) and have the 'discard' mount option enabled,
then there is no backup metadata tree (this issue was mentioned on the list
a while ago, but nobody ever replied),



This is a really good point. I've been running discard mount option
for some time now without problems, in a laptop with Samsung
Electronics Co Ltd NVMe SSD Controller SM951/PM951.

However, just trying btrfs-debug-tree -b on a specific block address
for any of the backup root trees listed in the super, only the current
one returns a valid result.  All others fail with checksum errors. And
even the good one fails with checksum errors within seconds as a new
tree is created, the super updated, and Btrfs considers the old root
tree disposable and subject to discard.

So absolutely if I were to have a problem, probably no rollback for
me. This seems to totally obviate a fundamental part of Btrfs design.


How is this an issue?  Discard is issued only once we're positive there's no
reference to the freed blocks anywhere.  At that point, they're also open
for reuse, thus they can be arbitrarily scribbled upon.


If it's not an issue, then no one should ever need those backup slots
in the super and we should just remove them.

But in fact, we know people end up situations where they're needed for
either automatic recovery at mount time or explicitly calling
--usebackuproot. And in some cases we're seeing users using discard
who have a borked root tree, and none of the backup roots are present
so they're fucked. Their file system is fucked.

Now again, maybe this means the hardware is misbehaving, and honored
the discard out of order, and did that and wrote the new supers before
it had completely committed all the metadata? I have no idea, but the
evidence is present in the list that some people run into this and
when they do the file system is beyond repair even though it can
usually be scraped with btrfs restore.
With ATA devices (including SATA), except on newer SSD's, TRIM commands 
can't be queued, so by definition they can't become unordered (the 
kernel ends up having to flush the device queue prior to the discard and 
then flush the write cache, so it's functionally equivalent to a write 
barrier, just more expensive, which is why inline discard performance 
sucks in most cases).  I'm not sure about SCSI (I'm pretty sure UNMAP 
can be queued and is handled just like any other write in terms of 
ordering), MMC/SD (Though I'm also not sure if the block layer and the 
MMC driver properly handle discard BIO's on MMC devices), or NVMe (which 
I think handles things similarly to SCSI).




Unless your hardware is seriously broken (such as lying about barriers,
which is nearly-guaranteed data loss on btrfs anyway), there's no way the
filesystem will ever reference such blocks.  The corpses of old trees that
are left lying around with no discard can at most be used for manual
forensics, but whether a given block will have been overwritten or not is
a matter of pure luck.


File systems that overwrite, are hinting the intent in the journal
what's about to happen. So if there's a partial overwrite of metadata,
it's fine. The journal can help recover. But Btrfs without a journal,
has a major piece of information required to bootstrap the file system
at mount time, that's damaged, and then every backup has been
discarded. So it actually makes Btrfs more fragile than other file
systems in the same situation.

Indeed.

Unless I'm seriously misunderstanding the code, there's a pretty high 
chance that any given old metadata block will get overwritten reasonably 
soon on an active filesystem.  I'm not 100% certain about this, but I'm 
pretty sure that BTRFS will avoid allocating new chunks to write into 
just to preserve old copies of metadata, which in turn means that it 
will overwrite things pretty fast if the metadata chunks are mostly full.>


For rollbacks, there are snapshots.  Once a transaction has been fully
committed, the old version is considered gone.


Yeah well snapshots do not cause root trees to stick around.





  because it's already been discarded.

This is ideally something which should be addressed (we need some sort of
discard queue for handling in-line discards), but it's not easy to address.


Discard data extents, don't discard metadata extents? Or put them on a
substantial delay.


Why would you special-case metadata?  Metadata that points to overwritten or
discarded blocks is of no use either.


I would rather lose 30 seconds, 1 minute, or even 2 minutes of writes,
than lose an entire file system. That's why.
And outside of very specific use cases, this is something 

Re: Problem with file system

2017-11-04 Thread Chris Murphy
On Sat, Nov 4, 2017 at 1:26 AM, Dave  wrote:
> On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy  wrote:
>>
>> That is not a general purpose file system. It's a file system for admins who 
>> understand where the bodies are buried.
>
> I'm not sure I understand your comment...
>
> Are you saying BTRFS is not a general purpose file system?

I'm suggesting that any file system that burdens the user with more
knowledge to stay out of trouble than the widely considered general
purpose file systems of the day, is not a general purpose file system.

And yes, I'm suggesting that Btrfs is at risk of being neither general
purpose, and not meeting its design goals as stated in Btrfs
documentation. It is not easy to admin *when things go wrong*. It's
great before then. It's a butt ton easier to resize, replace devices,
take snapshots, and so on. But when it comes to fixing it when it goes
wrong? It is a goddamn Choose Your Own Adventure book. It's way, way
more complicated than any other file system I'm aware of.


> If btrfs isn't able to serve as a general purpose file system for
> Linux going forward, which file system(s) would you suggest can fill
> that role? (I can't think of any that are clearly all-around better
> than btrfs now, or that will be in the next few years.)

ext4 and XFS are clearly the file systems to beat. They almost always
recover from crashes with just a normal journal replay at mount time,
file system repair is not often needed. When it is needed, it usually
works, and there is just the one option to repair and go with it.
Btrfs has piles of repair options, mount time options, btrfs check has
options, btrfs rescue has options, it's a bit nutty honestly. And
there's zero guidance in the available docs what order to try things
in, not least of which some of these repair tools are still considered
dangerous at least in the man page text, and the order depends on the
failure. The user is burdened with way too much.

Even as much as I know about Btrfs having used it since 2008 and my
list activity, I routinely have WTF moments when people post problems,
what order to try to get things going again. Easy to admin? Yeah for
the most part. But stability is still a problem, and it's coming up on
a 10 year anniversary soon.

If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd
use ZoL hands down. But I'm not, I'm much more familiar with Btrfs and
where the bodies are buried, so I continue to use Btrfs.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-04 Thread Chris Murphy
On Fri, Nov 3, 2017 at 10:46 PM, Adam Borowski  wrote:
> On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
>> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>>  wrote:
>>
>> > If you're running on an SSD (or thinly provisioned storage, or something
>> > else which supports discards) and have the 'discard' mount option enabled,
>> > then there is no backup metadata tree (this issue was mentioned on the list
>> > a while ago, but nobody ever replied),
>>
>>
>> This is a really good point. I've been running discard mount option
>> for some time now without problems, in a laptop with Samsung
>> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
>>
>> However, just trying btrfs-debug-tree -b on a specific block address
>> for any of the backup root trees listed in the super, only the current
>> one returns a valid result.  All others fail with checksum errors. And
>> even the good one fails with checksum errors within seconds as a new
>> tree is created, the super updated, and Btrfs considers the old root
>> tree disposable and subject to discard.
>>
>> So absolutely if I were to have a problem, probably no rollback for
>> me. This seems to totally obviate a fundamental part of Btrfs design.
>
> How is this an issue?  Discard is issued only once we're positive there's no
> reference to the freed blocks anywhere.  At that point, they're also open
> for reuse, thus they can be arbitrarily scribbled upon.

If it's not an issue, then no one should ever need those backup slots
in the super and we should just remove them.

But in fact, we know people end up situations where they're needed for
either automatic recovery at mount time or explicitly calling
--usebackuproot. And in some cases we're seeing users using discard
who have a borked root tree, and none of the backup roots are present
so they're fucked. Their file system is fucked.

Now again, maybe this means the hardware is misbehaving, and honored
the discard out of order, and did that and wrote the new supers before
it had completely committed all the metadata? I have no idea, but the
evidence is present in the list that some people run into this and
when they do the file system is beyond repair even t hough it can
usually be scraped with btrfs restore.


> Unless your hardware is seriously broken (such as lying about barriers,
> which is nearly-guaranteed data loss on btrfs anyway), there's no way the
> filesystem will ever reference such blocks.  The corpses of old trees that
> are left lying around with no discard can at most be used for manual
> forensics, but whether a given block will have been overwritten or not is
> a matter of pure luck.

File systems that overwrite, are hinting the intent in the journal
what's about to happen. So if there's a partial overwrite of metadata,
it's fine. The journal can help recover. But Btrfs without a journal,
has a major piece of information required to bootstrap the file system
at mount time, that's damaged, and then every backup has been
discarded. So it actually makes Btrfs more fragile than other file
systems in the same situation.



>
> For rollbacks, there are snapshots.  Once a transaction has been fully
> committed, the old version is considered gone.

Yeah well snapshots do not cause root trees to stick around.


>
>>  because it's already been discarded.
>> > This is ideally something which should be addressed (we need some sort of
>> > discard queue for handling in-line discards), but it's not easy to address.
>>
>> Discard data extents, don't discard metadata extents? Or put them on a
>> substantial delay.
>
> Why would you special-case metadata?  Metadata that points to overwritten or
> discarded blocks is of no use either.

I would rather lose 30 seconds, 1 minute, or even 2 minutes of writes,
than lose an entire file system. That's why.

Anyway right now I consider discard mount option fundamentally broken
on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
broken there too.

Even fstrim leaves a tiny window open for a few minutes every time it
gets called, where if the root tree is corrupted for any reason,
you're fucked because all the backup roots are already gone.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-04 Thread Marat Khalili
>How is this an issue?  Discard is issued only once we're positive 
>there's no 
>reference to the freed blocks anywhere.  At that point, they're also 
>open 
>for reuse, thus they can be arbitrarily scribbled upon.

Point was, how about keeping this reference for some time period?

>Unless your hardware is seriously broken (such as lying about barriers, 
>which is nearly-guaranteed data loss on btrfs anyway), there's no way 
>the 
>filesystem will ever reference such blocks.

Buggy hardware happen. So do buggy filesystems ;) Besides, most filesystems let 
user recover most data after losing just one sector, would be pity if BTRFS 
with all its COW coolness didn't. 

>Why would you special-case metadata?  Metadata that points to 
>overwritten or 
>discarded blocks is of no use either.

It takes significant time to overwrite noticeable portion of data on disk, but 
loss of metadata makes it gone in a moment. Moreover, user is usually prepared 
to lose some recently changed data in crash, but not the one that it didn't 
even touch.
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-04 Thread Dave
On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy  wrote:
>
> That is not a general purpose file system. It's a file system for admins who 
> understand where the bodies are buried.

I'm not sure I understand your comment...

Are you saying BTRFS is not a general purpose file system?

If btrfs isn't able to serve as a general purpose file system for
Linux going forward, which file system(s) would you suggest can fill
that role? (I can't think of any that are clearly all-around better
than btrfs now, or that will be in the next few years.)

Or maybe you meant something else?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Adam Borowski
On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>  wrote:
> 
> > If you're running on an SSD (or thinly provisioned storage, or something
> > else which supports discards) and have the 'discard' mount option enabled,
> > then there is no backup metadata tree (this issue was mentioned on the list
> > a while ago, but nobody ever replied),
> 
> 
> This is a really good point. I've been running discard mount option
> for some time now without problems, in a laptop with Samsung
> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
> 
> However, just trying btrfs-debug-tree -b on a specific block address
> for any of the backup root trees listed in the super, only the current
> one returns a valid result.  All others fail with checksum errors. And
> even the good one fails with checksum errors within seconds as a new
> tree is created, the super updated, and Btrfs considers the old root
> tree disposable and subject to discard.
> 
> So absolutely if I were to have a problem, probably no rollback for
> me. This seems to totally obviate a fundamental part of Btrfs design.

How is this an issue?  Discard is issued only once we're positive there's no
reference to the freed blocks anywhere.  At that point, they're also open
for reuse, thus they can be arbitrarily scribbled upon.

Unless your hardware is seriously broken (such as lying about barriers,
which is nearly-guaranteed data loss on btrfs anyway), there's no way the
filesystem will ever reference such blocks.  The corpses of old trees that
are left lying around with no discard can at most be used for manual
forensics, but whether a given block will have been overwritten or not is
a matter of pure luck.

For rollbacks, there are snapshots.  Once a transaction has been fully
committed, the old version is considered gone.

>  because it's already been discarded.
> > This is ideally something which should be addressed (we need some sort of
> > discard queue for handling in-line discards), but it's not easy to address.
> 
> Discard data extents, don't discard metadata extents? Or put them on a
> substantial delay.

Why would you special-case metadata?  Metadata that points to overwritten or
discarded blocks is of no use either.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Chris Murphy
On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
 wrote:

> If you're running on an SSD (or thinly provisioned storage, or something
> else which supports discards) and have the 'discard' mount option enabled,
> then there is no backup metadata tree (this issue was mentioned on the list
> a while ago, but nobody ever replied),


This is a really good point. I've been running discard mount option
for some time now without problems, in a laptop with Samsung
Electronics Co Ltd NVMe SSD Controller SM951/PM951.

However, just trying btrfs-debug-tree -b on a specific block address
for any of the backup root trees listed in the super, only the current
one returns a valid result.  All others fail with checksum errors. And
even the good one fails with checksum errors within seconds as a new
tree is created, the super updated, and Btrfs considers the old root
tree disposable and subject to discard.

So absolutely if I were to have a problem, probably no rollback for
me. This seems to totally obviate a fundamental part of Btrfs design.


 because it's already been discarded.
> This is ideally something which should be addressed (we need some sort of
> discard queue for handling in-line discards), but it's not easy to address.

Discard data extents, don't discard metadata extents? Or put them on a
substantial delay.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Austin S. Hemmelgarn

On 2017-11-03 03:42, Kai Krakow wrote:

Am Tue, 31 Oct 2017 07:28:58 -0400
schrieb "Austin S. Hemmelgarn" :


On 2017-10-31 01:57, Marat Khalili wrote:

On 31/10/17 00:37, Chris Murphy wrote:

But off hand it sounds like hardware was sabotaging the expected
write ordering. How to test a given hardware setup for that, I
think, is really overdue. It affects literally every file system,
and Linux storage technology.

It kinda sounds like to me something other than supers is being
overwritten too soon, and that's why it's possible for none of the
backup roots to find a valid root tree, because all four possible
root trees either haven't actually been written yet (still) or
they've been overwritten, even though the super is updated. But
again, it's speculation, we don't actually know why your system
was no longer mountable.

Just a detached view: I know hardware should respect
ordering/barriers and such, but how hard is it really to avoid
overwriting at least one complete metadata tree for half an hour
(even better, yet another one for a day)? Just metadata, not data
extents.

If you're running on an SSD (or thinly provisioned storage, or
something else which supports discards) and have the 'discard' mount
option enabled, then there is no backup metadata tree (this issue was
mentioned on the list a while ago, but nobody ever replied), because
it's already been discarded.  This is ideally something which should
be addressed (we need some sort of discard queue for handling in-line
discards), but it's not easy to address.

Otherwise, it becomes a question of space usage on the filesystem,
and this is just another reason to keep some extra slack space on the
FS (though that doesn't help _much_, it does help).  This, in theory,
could be addressed, but it probably can't be applied across mounts of
a filesystem without an on-disk format change.


Well, maybe inline discard is working at the wrong level. It should
kick in when the reference through any of the backup roots is dropped,
not when the current instance is dropped.

Indeed.


Without knowledge of the internals, I guess discards could be added to
a queue within a new tree in btrfs, and only added to that queue when
dropped from the last backup root referencing it. But this will
probably add some bad performance spikes.

Inline discards can already cause bad performance spikes.


I wonder how a regular fstrim run through cron applies to this problem?
You functionally lose any old (freed) trees, they just get kept around 
until you call fstrim.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Kai Krakow
Am Tue, 31 Oct 2017 07:28:58 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-10-31 01:57, Marat Khalili wrote:
> > On 31/10/17 00:37, Chris Murphy wrote:  
> >> But off hand it sounds like hardware was sabotaging the expected
> >> write ordering. How to test a given hardware setup for that, I
> >> think, is really overdue. It affects literally every file system,
> >> and Linux storage technology.
> >>
> >> It kinda sounds like to me something other than supers is being
> >> overwritten too soon, and that's why it's possible for none of the
> >> backup roots to find a valid root tree, because all four possible
> >> root trees either haven't actually been written yet (still) or
> >> they've been overwritten, even though the super is updated. But
> >> again, it's speculation, we don't actually know why your system
> >> was no longer mountable.  
> > Just a detached view: I know hardware should respect
> > ordering/barriers and such, but how hard is it really to avoid
> > overwriting at least one complete metadata tree for half an hour
> > (even better, yet another one for a day)? Just metadata, not data
> > extents.  
> If you're running on an SSD (or thinly provisioned storage, or
> something else which supports discards) and have the 'discard' mount
> option enabled, then there is no backup metadata tree (this issue was
> mentioned on the list a while ago, but nobody ever replied), because
> it's already been discarded.  This is ideally something which should
> be addressed (we need some sort of discard queue for handling in-line
> discards), but it's not easy to address.
> 
> Otherwise, it becomes a question of space usage on the filesystem,
> and this is just another reason to keep some extra slack space on the
> FS (though that doesn't help _much_, it does help).  This, in theory,
> could be addressed, but it probably can't be applied across mounts of
> a filesystem without an on-disk format change.

Well, maybe inline discard is working at the wrong level. It should
kick in when the reference through any of the backup roots is dropped,
not when the current instance is dropped.

Without knowledge of the internals, I guess discards could be added to
a queue within a new tree in btrfs, and only added to that queue when
dropped from the last backup root referencing it. But this will
probably add some bad performance spikes.

I wonder how a regular fstrim run through cron applies to this problem?


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-10-31 Thread Austin S. Hemmelgarn

On 2017-10-31 01:57, Marat Khalili wrote:

On 31/10/17 00:37, Chris Murphy wrote:

But off hand it sounds like hardware was sabotaging the expected write
ordering. How to test a given hardware setup for that, I think, is
really overdue. It affects literally every file system, and Linux
storage technology.

It kinda sounds like to me something other than supers is being
overwritten too soon, and that's why it's possible for none of the
backup roots to find a valid root tree, because all four possible root
trees either haven't actually been written yet (still) or they've been
overwritten, even though the super is updated. But again, it's
speculation, we don't actually know why your system was no longer
mountable.
Just a detached view: I know hardware should respect ordering/barriers 
and such, but how hard is it really to avoid overwriting at least one 
complete metadata tree for half an hour (even better, yet another one 
for a day)? Just metadata, not data extents.
If you're running on an SSD (or thinly provisioned storage, or something 
else which supports discards) and have the 'discard' mount option 
enabled, then there is no backup metadata tree (this issue was mentioned 
on the list a while ago, but nobody ever replied), because it's already 
been discarded.  This is ideally something which should be addressed (we 
need some sort of discard queue for handling in-line discards), but it's 
not easy to address.


Otherwise, it becomes a question of space usage on the filesystem, and 
this is just another reason to keep some extra slack space on the FS 
(though that doesn't help _much_, it does help).  This, in theory, could 
be addressed, but it probably can't be applied across mounts of a 
filesystem without an on-disk format change.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-10-30 Thread Marat Khalili

On 31/10/17 00:37, Chris Murphy wrote:

But off hand it sounds like hardware was sabotaging the expected write
ordering. How to test a given hardware setup for that, I think, is
really overdue. It affects literally every file system, and Linux
storage technology.

It kinda sounds like to me something other than supers is being
overwritten too soon, and that's why it's possible for none of the
backup roots to find a valid root tree, because all four possible root
trees either haven't actually been written yet (still) or they've been
overwritten, even though the super is updated. But again, it's
speculation, we don't actually know why your system was no longer
mountable.
Just a detached view: I know hardware should respect ordering/barriers 
and such, but how hard is it really to avoid overwriting at least one 
complete metadata tree for half an hour (even better, yet another one 
for a day)? Just metadata, not data extents.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-10-30 Thread Duncan
Dave posted on Sun, 29 Oct 2017 23:31:57 -0400 as excerpted:

> It's all part of the process of gaining critical experience with BTRFS.
> Whether or not BTRFS is ready for production use is (it seems to me)
> mostly a question of how knowledgeable and experienced are the people
> administering it.
> 
> In the various online discussions on this topic, all the focus is on
> whether or not BTRFS itself is production-ready. At the current maturity
> level of BTRFS, I think that's the wrong focus. The right focus is on
> how production-ready is the admin person or team (with respect to their
> BTRFS knowledge and experience). When a filesystem has been around for
> decades, most of the critical admin issues become fairly common
> knowledge, fairly widely known and easy to find. When a filesystem is
> newer, far fewer people understand the gotchas. Also, in older or widely
> used filesystems, when someone hits a gotcha, the response isn't "that
> filesystem is not ready for production". Instead the response is, "you
> should have known not to do that."

That's a view I hadn't seen before, but it seems reasonable and I like it.

Indeed, there were/are a few reasonably widely known caveats with both 
ext3 and reiserfs, for instance, and certainly some that apply to fat/
vfat/fat32, the three filesystems other than btrfs I know most about, and 
if anything they're past their prime, /not/ "still maturing", as btrfs is 
typically described.  For example, setting either of the two to writeback 
journaling and then losing data results in something akin to "you should 
have known not to do that unless you were prepared for the risk, as it's 
definitely a well known one."

Which of course was about my own reaction when Linus and the other powers 
that be decided to set ext3 to writeback journaling by default for a few 
kernel cycles.  Having lived thru that on reiserfs, I /knew/ where /that/ 
was headed, and sure enough...

Similarly, ext3's performance problems with fsync, because it effectively 
forces a full filesystem sync not just a file sync, are well known, as 
are the risks of storing a reiserfs in a loopback file on reiserfs and 
then trying to run a tree restore on the host, since it's known to mix up 
the two filesystems in that case.

It's thus a reasonable viewpoint to consider some of the btrfs quirks to 
be in the same category.  Of course btrfs being the first COW-based-fs 
most will have had experience with, and the first filesystem most will 
have experienced that handles raid, snapshotting, etc, it's definitely 
rather different and more complex than the filesystems most people are 
familiar with, and thus can only be expected to have rather different and 
more complex caveats than the filesystems most are familiar with, as well.

OTOH, there's definitely some known low-hanging-fruit in terms of ease of 
use, remaining to be implemented, tho I'd argue that we've reached the 
point where general stability is such that it has allowed the focus to 
gradually tilt toward implementing some of this, over the last year or 
so, and we're beginning to see the loose ends tied up in the 
documentation, for instance.  I'd say we are getting close, and your 
viewpoint is a definite argument in support of that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-10-30 Thread Chris Murphy
On Mon, Oct 30, 2017 at 4:31 AM, Dave  wrote:
> This is a very helpful thread. I want to share an interesting related story.
>
> We have a machine with 4 btrfs volumes and 4 Snapper configs. I
> recently discovered that Snapper timeline cleanup been turned off for
> 3 of those volumes. In the Snapper configs I found this setting:
>
> TIMELINE_CLEANUP="no"
>
> Normally that would be set to "yes". So I corrected the issue and set
> it to "yes" for the 3 volumes where it had not been set correctly.
>
> I suppose it was turned off temporarily and then somebody forgot to
> turn it back on.
>
> What I did not know, and what I did not realize was a critical piece
> of information, was how long timeline cleanup had been turned off and
> how many snapshots had accumulated on each volume in that time.
>
> I naively re-enabled Snapper timeline cleanup. The instant I started
> the  snapper-cleanup.service  the system was hosed. The ssh session
> became unresponsive, no other ssh sessions could be established and it
> was impossible to log into the system at the console.
>
> My subsequent investigation showed that the root filesystem volume
> accumulated more than 3000 btrfs snapshots. The two other affected
> volumes also had very large numbers of snapshots.
>
> Deleting a single snapshot in that situation would likely require
> hours. (I set up a test, but I ran out of patience before I was able
> to delete even a single snapshot.) My guess is that if we had been
> patient enough to wait for all the snapshots to be deleted, the
> process would have finished in some number of months (or maybe a
> year).
>
> We did not know most of this at the time, so we did what we usually do
> when a system becomes totally unresponsive -- we did a hard reset. Of
> course, we could never get the system to boot up again.
>
> Since we had backups, the easiest option became to replace that system
> -- not unlike what the OP decided to do. In our case, the hardware was
> not old, so we simply reformatted the drives and reinstalled Linux.
>
> That's a drastic consequence of changing TIMELINE_CLEANUP="no" to
> TIMELINE_CLEANUP="yes" in the snapper config.


Without a complete autopsy on the file system, it's unclear whether it
was fixable with available tools, and why it wouldn't mount normally,
or if necessary do its own autorecovery with one of the available
backup roots.

But off hand it sounds like hardware was sabotaging the expected write
ordering. How to test a given hardware setup for that, I think, is
really overdue. It affects literally every file system, and Linux
storage technology.

It kinda sounds like to me something other than supers is being
overwritten too soon, and that's why it's possible for none of the
backup roots to find a valid root tree, because all four possible root
trees either haven't actually been written yet (still) or they've been
overwritten, even though the super is updated. But again, it's
speculation, we don't actually know why your system was no longer
mountable.



>
> It's all part of the process of gaining critical experience with
> BTRFS. Whether or not BTRFS is ready for production use is (it seems
> to me) mostly a question of how knowledgeable and experienced are the
> people administering it.

"Btrfs is a copy on write filesystem for Linux aimed at implementing advanced
features while focusing on fault tolerance, repair and easy administration."

That is the current descriptive text at
Documentation/filesystems/btrfs.txt for some time now.


> In the various online discussions on this topic, all the focus is on
> whether or not BTRFS itself is production-ready. At the current
> maturity level of BTRFS, I think that's the wrong focus. The right
> focus is on how production-ready is the admin person or team (with
> respect to their BTRFS knowledge and experience). When a filesystem
> has been around for decades, most of the critical admin issues become
> fairly common knowledge, fairly widely known and easy to find. When a
> filesystem is newer, far fewer people understand the gotchas. Also, in
> older or widely used filesystems, when someone hits a gotcha, the
> response isn't "that filesystem is not ready for production". Instead
> the response is, "you should have known not to do that."



That is not a general purpose file system. It's a file system for
admins who understand where the bodies are buried.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-10-29 Thread Dave
This is a very helpful thread. I want to share an interesting related story.

We have a machine with 4 btrfs volumes and 4 Snapper configs. I
recently discovered that Snapper timeline cleanup been turned off for
3 of those volumes. In the Snapper configs I found this setting:

TIMELINE_CLEANUP="no"

Normally that would be set to "yes". So I corrected the issue and set
it to "yes" for the 3 volumes where it had not been set correctly.

I suppose it was turned off temporarily and then somebody forgot to
turn it back on.

What I did not know, and what I did not realize was a critical piece
of information, was how long timeline cleanup had been turned off and
how many snapshots had accumulated on each volume in that time.

I naively re-enabled Snapper timeline cleanup. The instant I started
the  snapper-cleanup.service  the system was hosed. The ssh session
became unresponsive, no other ssh sessions could be established and it
was impossible to log into the system at the console.

My subsequent investigation showed that the root filesystem volume
accumulated more than 3000 btrfs snapshots. The two other affected
volumes also had very large numbers of snapshots.

Deleting a single snapshot in that situation would likely require
hours. (I set up a test, but I ran out of patience before I was able
to delete even a single snapshot.) My guess is that if we had been
patient enough to wait for all the snapshots to be deleted, the
process would have finished in some number of months (or maybe a
year).

We did not know most of this at the time, so we did what we usually do
when a system becomes totally unresponsive -- we did a hard reset. Of
course, we could never get the system to boot up again.

Since we had backups, the easiest option became to replace that system
-- not unlike what the OP decided to do. In our case, the hardware was
not old, so we simply reformatted the drives and reinstalled Linux.

That's a drastic consequence of changing TIMELINE_CLEANUP="no" to
TIMELINE_CLEANUP="yes" in the snapper config.

It's all part of the process of gaining critical experience with
BTRFS. Whether or not BTRFS is ready for production use is (it seems
to me) mostly a question of how knowledgeable and experienced are the
people administering it.

In the various online discussions on this topic, all the focus is on
whether or not BTRFS itself is production-ready. At the current
maturity level of BTRFS, I think that's the wrong focus. The right
focus is on how production-ready is the admin person or team (with
respect to their BTRFS knowledge and experience). When a filesystem
has been around for decades, most of the critical admin issues become
fairly common knowledge, fairly widely known and easy to find. When a
filesystem is newer, far fewer people understand the gotchas. Also, in
older or widely used filesystems, when someone hits a gotcha, the
response isn't "that filesystem is not ready for production". Instead
the response is, "you should have known not to do that."

On Wed, Apr 26, 2017 at 12:43 PM, Fred Van Andel  wrote:
> Yes I was running qgroups.
> Yes the filesystem is highly fragmented.
> Yes I have way too many snapshots.
>
> I think it's clear that the problem is on my end. I simply placed too
> many demands on the filesystem without fully understanding the
> implications.  Now I have to deal with the consequences.
>
> It was decided today to replace this computer due to its age.  I will
> use the recover command to pull the needed data off this system and
> onto the new one.
>
>
> Thank you everyone for your assistance and the education.
>
> Fred
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-04-26 Thread Fred Van Andel
Yes I was running qgroups.
Yes the filesystem is highly fragmented.
Yes I have way too many snapshots.

I think it's clear that the problem is on my end. I simply placed too
many demands on the filesystem without fully understanding the
implications.  Now I have to deal with the consequences.

It was decided today to replace this computer due to its age.  I will
use the recover command to pull the needed data off this system and
onto the new one.


Thank you everyone for your assistance and the education.

Fred
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-04-25 Thread Qu Wenruo



At 04/25/2017 01:33 PM, Marat Khalili wrote:

On 25/04/17 03:26, Qu Wenruo wrote:
IIRC qgroup for subvolume deletion will cause full subtree rescan 
which can cause tons of memory. 
Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even 
use this absurd amount of memory for? Is it swappable?


The memory is used for 2 reasons.

1) Record which extents are needed to trace
   Freed at transaction commit.

   Need better idea to handle them. Maybe create a new tree so that we
   can write it to disk?
   Or another qgroup rework?

2) Record current roots referring to this extent
   Only after v4.10 IIRC.

The memory allocated is not swappable.

How many memory it uses depends on the number of extents of that subvolume.

It's 56 bytes for one extent, both tree block and data extent.
To use up 16G ram, it's about 300 million extents.
For 5.6T volume, its average extent size is about 20K.

It seems that your volume is highly fragmented though.

If that's the problem, disabling qgroup may be the best workaround.

Thanks,
Qu



Haven't read about RAM limitations for running qgroups before, only 
about CPU load (which importantly only requires patience, does not crash 
servers).


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-04-24 Thread Marat Khalili

On 25/04/17 03:26, Qu Wenruo wrote:
IIRC qgroup for subvolume deletion will cause full subtree rescan 
which can cause tons of memory. 
Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even 
use this absurd amount of memory for? Is it swappable?


Haven't read about RAM limitations for running qgroups before, only 
about CPU load (which importantly only requires patience, does not crash 
servers).


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-04-24 Thread Duncan
Chris Murphy posted on Mon, 24 Apr 2017 11:02:02 -0600 as excerpted:

> On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel 
> wrote:
>> I have a btrfs file system with a few thousand snapshots.  When I
>> attempted to delete 20 or so of them the problems started.
>>
>> The disks are being read but except for the first few minutes there are
>> no writes.
>>
>> Memory usage keeps growing until all the memory (24 Gb) is used in a
>> few hours. Eventually the system will crash with out of memory errors.

In addition to what CMurphy and QW suggested (both valid), I have a 
couple other suggestions/pointers.  They won't help you get out of the 
current situation, but they might help you stay out of it in the future.

1) A "few thousand snapshots", but no mention of how many subvolumes 
those snapshots are of, or how many per subvolume.

As CMurphy says but I'll expand it here, taking a snapshot is nearly 
free, just a bit of metadata to write because btrfs is COW-based and all 
a snapshot does is lock down a copy of everything in the subvolume as it 
exists currently and the filesystem's already tracking that, but removal 
is expensive, because btrfs must go thru and check everything to see if 
it can actually be deleted (no other snapshots referencing the block) or 
not (something else referencing it).

Obviously, then, this checking gets much more complicated the more 
snapshots of the same subvolume that exist.  IOW, it's a scaling issue.

The same scaling issue applies to various other btrfs maintenance tasks, 
including btrfs check (aka btrfsck), and btrfs balance (and thus btrfs 
device remove, which does an implicit balance).  Both of these take *far* 
longer if the number of snapshots per subvolume is allowed to get out of 
hand.

Due to this scaling issue, the recommendation is no more than 200-300 
snapshots per subvolume, and keeping it down to 50-100 max is even 
better, if you can do it reasonably.  That helps keep scaling issues and 
thus time for any necessary maintenance manageable.  Otherwise... well, 
we've had reports of device removes (aka balances) that would take 
/months/ to finish at the rate they were going.  Obviously, well before 
it gets to that point it's far faster to simply blow away the filesystem 
and restore from backups.[1]

It follows that if you have an automated system doing the snapshots, it's 
equally important to have an automated system doing snapshot thinning as 
well, keeping the number of snapshots per subvolume within manageable 
scaling limits.

So if that's "a few thousand snapshots", I hope you that's of (at least) 
a double-digit number of subvolumes, keeping the number of snapshots per 
subvolume under 300, and under 100 if your snapshot rotation schedule 
will allow it.

2) As Qu suggests, btrfs quotas increase the scaling issues significantly.

Additionally, there have been and continue to be accuracy issues with 
certain quota corner-cases, so they can't be entirely relied upon anyway.

Generally, people using btrfs quotas fall into three categories:

a) Those who know the problems and are working with Qu and the other devs 
to report and trace issues so they will eventually work well, ideally 
with less of a scaling issue as well.

Bless them!  Keep it up! =:^)

b) Those who have a use-case that really depends on quotas.

Because btrfs quotas are buggy and not entirely reliable now, not to 
mention the scaling issues, these users are almost certainly better 
served using more mature filesystems with mature and dependable quotas.

c) Those who don't really care about quotas specifically, and are just 
using them because it's a nice feature.  This likely includes some who 
are simply running distros that enable quotas.

My recommendation for these users is to simply turn btrfs quotas off for 
now, as they're presently in general more trouble than they're worth, due 
to both the accuracy and scaling issues.  Hopefully quotas will be stable 
in a couple years, and with developer and tester hard work perhaps the 
scaling issues will have been reduced as well, and that recommendation 
can change.  But for now, if you don't really need them, leaving quotas 
off will significantly reduce scaling issues.  And if you do need them, 
they're not yet reliable on btrfs anyway, so better off using something 
more mature where they actually work.

3) Similarly (tho unlikely to apply in your case), beware of the scaling 
implications of the various reflink-based copying and dedup utilities, 
which work via the same copy-on-write and reflinking technology that's 
behind snapshotting.

Tho snapshotting is effectively reflinking /everything/ in the subvolume, 
so the scaling issues compound much faster there than they will with a 
more trivial level of reflinking.  Of course, when it comes to dedup, a 
more trivial level of reflinking means less benefit from doing the dedup 
in the first place, so there's a limit to the effectiveness of dedup 
before it starts 

Re: Problem with file system

2017-04-24 Thread Qu Wenruo



At 04/24/2017 11:27 PM, Fred Van Andel wrote:

I have a btrfs file system with a few thousand snapshots.  When I
attempted to delete 20 or so of them the problems started.

The disks are being read but except for the first few minutes there
are no writes.

Memory usage keeps growing until all the memory (24 Gb) is used in a
few hours. Eventually the system will crash with out of memory errors.


Are you using qgroup/quota?

IIRC qgroup for subvolume deletion will cause full subtree rescan which 
can cause tons of memory.


Thanks,
Qu



The CPU load is low (<5%) but iowait is around 30 to 50%

The drives are mounted but any process that attempts to access them
will just hang so I cannot access any data on the drives.

Smartctl does not show any issues with the drives.

The problem restarts after a reboot once you mount the drives.

I tried to zero the log hoping it wouldn't restart after a reboot but
that didn't work

I am assuming that the attempt to remove the snapshots caused this
problem.  How do I interrupt the process so I can access the
filesystem again?

# uname -a
Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

#   btrfs --version
btrfs-progs v4.9.1

#   btrfs fi show
Label: none  uuid: 79ba7374-bf77-4868-bb64-656ff5736c44
 Total devices 6 FS bytes used 5.65TiB
 devid1 size 1.82TiB used 1.29TiB path /dev/sdb
 devid2 size 1.82TiB used 1.29TiB path /dev/sdc
 devid3 size 1.82TiB used 1.29TiB path /dev/sdd
 devid4 size 1.82TiB used 1.29TiB path /dev/sde
 devid5 size 3.64TiB used 3.11TiB path /dev/sdf
 devid6 size 3.64TiB used 3.11TiB path /dev/sdg

# btrfs fi df /pubroot
Data, RAID1: total=5.58TiB, used=5.58TiB
System, RAID1: total=32.00MiB, used=828.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=104.00GiB, used=70.64GiB
GlobalReserve, single: total=512.00MiB, used=28.51MiB
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-04-24 Thread Chris Murphy
On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel <vanan...@gmail.com> wrote:
> I have a btrfs file system with a few thousand snapshots.  When I
> attempted to delete 20 or so of them the problems started.
>
> The disks are being read but except for the first few minutes there
> are no writes.
>
> Memory usage keeps growing until all the memory (24 Gb) is used in a
> few hours. Eventually the system will crash with out of memory errors.

Boot with these boot parameters
log_buf_len=1M

I find it easier to remotely login with another computer to capture
problems in case of a crash and I can't save things locally. So on the
remote computer use 'journalctl -kf -o short-monotonic'

Either on the 1st computer, or from an additional ssh connection from the 2nd:

echo 1 >/proc/sys/kernel/sysrq
btrfs fi show   #you need the UUID for the volume you're going to
mount, best to have it in advance

mount the file system normally, and once it's starting to have the
problem (I guess it happens pretty quickly?)

echo t > /proc/sysrq-trigger
grep . -IR /sys/fs/btrfs/UUID/allocation/

Paste in the UUID from fi show. If the computer is hanging due to
running out of memory, each of these commands can take a while to
complete. So it's best to have them all ready to go before you mount
the file system, and the problem starts happening. Best if you can
issue the commands more than once as the problem gets worse, if you
can keep them all organized and labeled.

Then attach them (rather than pasting them into the message).


> I tried to zero the log hoping it wouldn't restart after a reboot but
> that didn't work

Yeah don't just start randomly hitting the fs with a hammer like
zeroing the log tree. That's for a specific problem and this isn't it.


> I am assuming that the attempt to remove the snapshots caused this
> problem.  How do I interrupt the process so I can access the
> filesystem again?

Snapshot creation is essentially free. Snapshot removal is expensive.
There's no way to answer your questions because your email doesn't
even include a call trace. So a developer will need at least the call
trace, but there might be some other useful information in a sysrq +
t, as well as the allocation states.



> # btrfs fi df /pubroot
> Data, RAID1: total=5.58TiB, used=5.58TiB
> System, RAID1: total=32.00MiB, used=828.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=104.00GiB, used=70.64GiB
> GlobalReserve, single: total=512.00MiB, used=28.51MiB

Later, after this problem is solved, you'll want to get rid of that
single system chunk that isn't being used, but might cause a problem
in a device failure.

sudo btrfs balance start -mconvert=raid1,soft 


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problem with file system

2017-04-24 Thread Fred Van Andel
I have a btrfs file system with a few thousand snapshots.  When I
attempted to delete 20 or so of them the problems started.

The disks are being read but except for the first few minutes there
are no writes.

Memory usage keeps growing until all the memory (24 Gb) is used in a
few hours. Eventually the system will crash with out of memory errors.

The CPU load is low (<5%) but iowait is around 30 to 50%

The drives are mounted but any process that attempts to access them
will just hang so I cannot access any data on the drives.

Smartctl does not show any issues with the drives.

The problem restarts after a reboot once you mount the drives.

I tried to zero the log hoping it wouldn't restart after a reboot but
that didn't work

I am assuming that the attempt to remove the snapshots caused this
problem.  How do I interrupt the process so I can access the
filesystem again?

# uname -a
Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

#   btrfs --version
btrfs-progs v4.9.1

#   btrfs fi show
Label: none  uuid: 79ba7374-bf77-4868-bb64-656ff5736c44
Total devices 6 FS bytes used 5.65TiB
devid1 size 1.82TiB used 1.29TiB path /dev/sdb
devid2 size 1.82TiB used 1.29TiB path /dev/sdc
devid3 size 1.82TiB used 1.29TiB path /dev/sdd
devid4 size 1.82TiB used 1.29TiB path /dev/sde
devid5 size 3.64TiB used 3.11TiB path /dev/sdf
devid6 size 3.64TiB used 3.11TiB path /dev/sdg

# btrfs fi df /pubroot
Data, RAID1: total=5.58TiB, used=5.58TiB
System, RAID1: total=32.00MiB, used=828.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=104.00GiB, used=70.64GiB
GlobalReserve, single: total=512.00MiB, used=28.51MiB
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html