[Mount time bug bounty?] was: BTRFS Mount Delay Time Graph

2018-12-04 Thread Lionel Bouton
Le 03/12/2018 à 23:22, Hans van Kranenburg a écrit :
> [...]
> Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982
>
> What the code is doing here is starting at the beginning of the extent
> tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
> is not that far away), and then based on the information in it, computes
> where the next one will be (just after the end of the vaddr+length of
> it), and then jumps over all normal extent items and searches again near
> where the next block group item has to be. So, yes, that means that they
> depend on each other.
>
> Two possible ways to improve this:
>
> 1. Instead, walk the chunk tree (which has all related items packed
> together) instead to find out at which locations in the extent tree the
> block group items are located and then start getting items in parallel.
> If you have storage with a lot of rotating rust that can deliver much
> more random reads if you ask for more of them at the same time, then
> this can already cause a massive speedup.
>
> 2. Move the block group items somewhere else, where they can nicely be
> grouped together, so that the amount of metadata pages that has to be
> looked up is minimal. Quoting from the link below, "slightly tricky
> [...] but there are no fundamental obstacles".
>
> https://www.spinics.net/lists/linux-btrfs/msg71766.html
>
> I think the main obstacle here is finding a developer with enough
> experience and time to do it. :)

I would definitely be interested in sponsoring at least a part of the
needed time through my company (we are too small to hire kernel
developers full-time but we can make a one-time contribution for
something as valuable to us as faster mount delays).

If needed it could be split in two steps with separate bounties :
- providing a patch for the latest LTS kernel with a substantial
decrease in mount time in our case (ideally less than a minute instead
of 15 minutes but <5 minutes is already worth it).
- having it integrated in mainline.

I don't have any experience with company sponsorship/bounties but I'm
willing to learn (don't hesitate to make suggestions). I'll have to
discuss it with our accountant to make sure we do it correctly.

Is it the right place to discuss this kind of subject or should I take
the discussion elsewhere ?

Best regards,

Lionel


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Lionel Bouton
Le 04/12/2018 à 03:52, Chris Murphy a écrit :
> On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
>  wrote:
>> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
>>> [...]
>>> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
>>> tuning of the io queue (switching between classic io-schedulers and
>>> blk-mq ones in the virtual machines) and BTRFS mount options
>>> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
>>> in mount time (I managed to reduce the mount of IO requests
>> Sent to quickly : I meant to write "managed to reduce by half the number
>> of IO write requests for the same amount of data writen"
>>
>>>  by half on
>>> one server in production though although more tests are needed to
>>> isolate the cause).
> Interesting. I wonder if it's ssd_spread or space_cache=v2 that
> reduces the writes by half, or by how much for each? That's a major
> reduction in writes, and suggests it might be possible for further
> optimization, to help mitigate the wandering trees impact.

Note, the other major changes were :
- 4.9 upgrade to 1.14,
- using multi-queue aware bfq instead of noop.

If BTRFS IO patterns in our case allow bfq to merge io-requests, this
could be another explanation.

Lionel



Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton
Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> [...]
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests

Sent to quickly : I meant to write "managed to reduce by half the number
of IO write requests for the same amount of data writen"

>  by half on
> one server in production though although more tests are needed to
> isolate the cause).




Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton
Hi,

Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
> Hi all,
>
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.

We are hosting some large BTRFS filesystems on Ceph (RBD used by
QEMU/KVM). I believe the delay is heavily linked to the number of files
(I didn't check if snapshots matter and I suspect it does but not as
much as the number of "original" files at least if you don't heavily
modify existing files but mostly create new ones as we do).
As an example, we have a filesystem with 20TB used space with 4
subvolumes hosting multi millions files/directories (probably 10-20
millions total I didn't check the exact number recently as simply
counting files is a very long process) and 40 snapshots for each volume.
Mount takes about 15 minutes.
We have virtual machines that we don't reboot as often as we would like
because of these slow mount times.

If you want to study this, you could :
- graph the delay for various individual file sizes (instead of 25x10GB,
create 2 500 x 100MB and 250 000 x 1MB files between each run and
compare to the original result)
- graph the delay vs the number of snapshots (probably starting with a
large number of files in the initial subvolume to start with a non
trivial mount delay)
You may want to study the impact of the differences between snapshots by
comparing snapshoting without modifications and snapshots made at
various stages of your suvolume growth.

Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
tuning of the io queue (switching between classic io-schedulers and
blk-mq ones in the virtual machines) and BTRFS mount options
(space_cache=v2,ssd_spread) but there wasn't any measurable improvement
in mount time (I managed to reduce the mount of IO requests by half on
one server in production though although more tests are needed to
isolate the cause).
I didn't expect much for the mount times, it seems to me that mount is
mostly constrained by the BTRFS on disk structures needed at mount time
and how the filesystem reads them (for example it doesn't benefit at all
from large IO queue depths which probably means that each read depends
on previous ones which prevents io-schedulers from optimizing anything).

Best regards,

Lionel


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Lionel Bouton
Hi,

On 29/06/2018 09:22, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
>> On Thu, 28 Jun 2018 23:59:03 -0700
>> Marc MERLIN  wrote:
>>
>>> I don't waste a week recreating the many btrfs send/receive relationships.
>> Consider not using send/receive, and switching to regular rsync instead.
>> Send/receive is very limiting and cumbersome, including because of what you
>> described. And it doesn't gain you much over an incremental rsync. As for
> Err, sorry but I cannot agree with you here, at all :)
>
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences
> It's super inefficient.
> btrfs send knows in seconds what needs to be sent, and works on it right
> away.

I've not yet tried send/receive but I feel the pain of rsyncing millions
of files (I had to use lsyncd to limit the problem to the time the
origin servers reboot which is a relatively rare event) so this thread
picked my attention. Looking at the whole thread I wonder if you could
get a more manageable solution by splitting the filesystem.

If instead of using a single BTRFS filesystem you used LVM volumes
(maybe with Thin provisioning and monitoring of the volume group free
space) for each of your servers to backup with one BTRFS filesystem per
volume you would have less snapshots per filesystem and isolate problems
in case of corruption. If you eventually decide to start from scratch
again this might help a lot in your case.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 4.14 RAID5 multi disk array on bcache not mounting

2017-11-21 Thread Lionel Bouton
Le 21/11/2017 à 23:04, Andy Leadbetter a écrit :
> I have a 4 disk array on top of 120GB bcache setup, arranged as follows
[...]
> Upgraded today to 4.14.1 from their PPA and the

4.14 and 4.14.1 have a nasty bug affecting bcache users. See for example
:
https://www.reddit.com/r/linux/comments/7eh2oz/serious_regression_in_linux_414_using_bcache_can/

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] btrfs: Add zstd support

2017-07-06 Thread Lionel Bouton
Le 06/07/2017 à 13:59, Austin S. Hemmelgarn a écrit :
> On 2017-07-05 20:25, Nick Terrell wrote:
>> On 7/5/17, 12:57 PM, "Austin S. Hemmelgarn" 
>> wrote:
>>> It's the slower compression speed that has me arguing for the
>>> possibility of configurable levels on zlib.  11MB/s is painfully slow
>>> considering that most decent HDD's these days can get almost 5-10x that
>>> speed with no compression.  There are cases (WORM pattern archival
>>> storage for example) where slow writes to that degree may be
>>> acceptable,
>>> but for most users they won't be, and zlib at level 9 would probably be
>>> a better choice.  I don't think it can beat zstd at level 15 for
>>> compression ratio, but if they're even close, then zlib would still
>>> be a
>>> better option at that high of a compression level most of the time.
>>
>> I don't imagine the very high zstd levels would be useful to too many
>> btrfs users, except in rare cases. However, lower levels of zstd should
>> outperform zlib level 9 in all aspects except memory usage. I would
>> expect
>> zstd level 7 would compress as well as or better than zlib 9 with faster
>> compression and decompression speed. It's worth benchmarking to
>> ensure that
>> it holds for many different workloads, but I wouldn't expect zlib 9 to
>> compress better than zstd 7 often. zstd up to level 12 should
>> compress as
>> fast as or faster than zlib level 9. zstd levels 12 and beyond allow
>> stronger compression than zlib, at the cost of slow compression and more
>> memory usage.
> While I generally agree that most people probably won't use zstd
> levels above 3, it shouldn't be hard to support them if we're going to
> have configurable compression levels, so I would argue that it's still
> worth supporting anyway.

One use case for the higher compression levels would be manual
defragmentation with recompression for a subset of data (files that
won't be updated and are stored for long periods typically). The
filesystem would be mounted with a low level for general usage low
latencies and the subset of files would be recompressed with a high
level asynchronously.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Compression

2017-07-06 Thread Lionel Bouton
Le 06/07/2017 à 13:51, Austin S. Hemmelgarn a écrit :
>
> Additionally, when you're referring to extent size, I assume you mean
> the huge number of 128k extents that the FIEMAP ioctl (and at least
> older versions of `filefrag`) shows for compressed files?  If that's
> the case, then it's important to understand that that's due to an
> issue with FIEMAP, it doesn't understand compressed extents in BTRFS
> correctly, so it shows one extent per compressed _block_ instead, even
> if they are internally an extent in BTRFS.  You can verify the actual
> number of extents by checking how many runs of continuous 128k
> 'extents' there are.

This is in fact the problem : compressed extents are far less likely to
be contiguous than uncompressed extents (even compensating for the
fiemap limitations). When calling defrag on these files BTRFS is likely
to ignore the fragmentation too : when I modeled the cost of reading a
file as stored vs the ideal cost if it were in one single block I got
this surprise. Uncompressed files can be fully defragmented most of the
time, compressed files usually reach a fragmentation cost of
approximately 1.5x to 2.5x the ideal case after defragmentation (it
seems to depend on how the whole filesystem is used).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: lzo compression must free at least PAGE_SIZE

2017-05-20 Thread Lionel Bouton
Le 19/05/2017 à 23:15, Timofey Titovets a écrit :
> 2017-05-19 23:19 GMT+03:00 Lionel Bouton
> <lionel-subscript...@bouton.name>:
>> I was too focused on other problems and having a fresh look at what I
>> wrote I'm embarrassed by what I read. Used pages for a given amount
>> of data should be (amount / PAGE_SIZE) + ((amount % PAGE_SIZE) == 0 ?
>> 0 : 1) this seems enough of a common thing to compute that the kernel
>> might have a macro defined for this. 
> If i understand the code correctly, the logic of comparing the size of
> input/output by bytes is enough (IMHO)

As I suspected I missed context : the name of the function makes it
clear it is supposed to work on whole pages so you are right about the
comparison.

What I'm still unsure is if the test is at the right spot. The inner
loop seems to work on at most
in_len = min(len, PAGE_SIZE)
chunks of data,
for example on anything with len >= 4xPAGE_SIZE and PAGE_SIZE=4096 it
seems to me there's a problem.

if·(tot_in·>·8192·&&·tot_in·<·tot_out·+·PAGE_SIZE)

tot_in > 8192 is true starting at the 3rd page being processedin my example

If the 3 first pages don't manage to free one full page (ie the function
only reaches at best a 2/3 compression ratio) the modified second
condition is true and the compression is aborted. This even if
continuing the compression would end up in freeing one page (tot_out is
expected to rise slower than tot_in on compressible data so the
difference could rise and reach a full PAGE_SIZE).

Am I still confused by something ?

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: lzo compression must free at least PAGE_SIZE

2017-05-19 Thread Lionel Bouton
Le 19/05/2017 à 16:17, Lionel Bouton a écrit :
> Hi,
>
> Le 19/05/2017 à 15:38, Timofey Titovets a écrit :
>> If data compression didn't free at least one PAGE_SIZE, it useless to store 
>> that compressed extent
>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  fs/btrfs/lzo.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
>> index bd0b0938..637ef1b0 100644
>> --- a/fs/btrfs/lzo.c
>> +++ b/fs/btrfs/lzo.c
>> @@ -207,7 +207,7 @@ static int lzo_compress_pages(struct list_head *ws,
>>  }
>>  
>>  /* we're making it bigger, give up */
>> -if (tot_in > 8192 && tot_in < tot_out) {
>> +if (tot_in > 8192 && tot_in < tot_out + PAGE_SIZE) {
>>  ret = -E2BIG;
>>  goto out;
>>  }
> I'm not familiar with this code but I was surprised by the test : you
> would expect compression having a benefit when you are freeing an actual
> page not reducing data by a page size. So unless I don't understand the
> context shouldn't it be something like :
>
> if (tot_in > 8192 && ((tot_in % PAGE_SIZE) <= (tot_out % PAGE_SIZE))
>
> but looking at the code I see that this is in a while loop and there's
> another test just after the loop in the existing code :
>
> if (tot_out > tot_in)
> goto out;
>
> There's a couple of things I don't understand but isn't this designed to
> stream data in small chunks through compression before writing it in the
> end ? So isn't this later test the proper location to detect if
> compression was beneficial ?
>
> You might not save a page early on in the while loop working on a subset
> of the data to compress but after enough data being processed you could
> save a page. It seems odd that your modification could abort compression
> early on although the same condition would become true after enough loops.
>
> Isn't what you want something like :
>
> if (tot_out % PAGE_SIZE >= tot_in % PAGE_SIZE)
> goto out;
>
> after the loop ?
> The >= instead of > would avoid decompression in the case where the
> compressed data is smaller but uses the same space on disk.

I was too focused on other problems and having a fresh look at what I
wrote I'm embarrassed by what I read.
Used pages for a given amount of data should be (amount / PAGE_SIZE) +
((amount % PAGE_SIZE) == 0 ? 0 : 1) this seems enough of a common thing
to compute that the kernel might have a macro defined for this.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: lzo compression must free at least PAGE_SIZE

2017-05-19 Thread Lionel Bouton
Hi,

Le 19/05/2017 à 15:38, Timofey Titovets a écrit :
> If data compression didn't free at least one PAGE_SIZE, it useless to store 
> that compressed extent
>
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/lzo.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index bd0b0938..637ef1b0 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -207,7 +207,7 @@ static int lzo_compress_pages(struct list_head *ws,
>   }
>  
>   /* we're making it bigger, give up */
> - if (tot_in > 8192 && tot_in < tot_out) {
> + if (tot_in > 8192 && tot_in < tot_out + PAGE_SIZE) {
>   ret = -E2BIG;
>   goto out;
>   }
I'm not familiar with this code but I was surprised by the test : you
would expect compression having a benefit when you are freeing an actual
page not reducing data by a page size. So unless I don't understand the
context shouldn't it be something like :

if (tot_in > 8192 && ((tot_in % PAGE_SIZE) <= (tot_out % PAGE_SIZE))

but looking at the code I see that this is in a while loop and there's
another test just after the loop in the existing code :

if (tot_out > tot_in)
goto out;

There's a couple of things I don't understand but isn't this designed to
stream data in small chunks through compression before writing it in the
end ? So isn't this later test the proper location to detect if
compression was beneficial ?

You might not save a page early on in the while loop working on a subset
of the data to compress but after enough data being processed you could
save a page. It seems odd that your modification could abort compression
early on although the same condition would become true after enough loops.

Isn't what you want something like :

if (tot_out % PAGE_SIZE >= tot_in % PAGE_SIZE)
goto out;

after the loop ?
The >= instead of > would avoid decompression in the case where the
compressed data is smaller but uses the same space on disk.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-15 Thread Lionel Bouton
Le 15/05/2017 à 10:14, Hugo Mills a écrit :
> [...]
>> As for limit= I'm not sure if it would be helpful since I run this
>> nightly. Anything that doesn't get done tonight due to limit, would be
>> done tomorrow?
>I'm suggesting limit= on its own. It's a fixed amount of work
> compared to usage=, which may not do anything at all. For example,
> it's perfectly possible to have a filesystem which is, say, 30% full,
> and yet is still fully-allocated filesystem with more than 20% of
> every chunk used. In that case your usage= wouldn't balance anything,
> and you'd still be left in the situation of risking ENOSPC from
> running out of metadata.

Hugo, as I don't have any feedback on my approach to address this
problem could you have a look at my script or simply the principle : is
there any drawback vs using limit in calling balance multiple times
raising usage (and using the same value for data and metadata) until you
get enough free space ?

For reference :

https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb


Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Lionel Bouton
Le 14/05/2017 à 23:30, Kai Krakow a écrit :
> Am Sun, 14 May 2017 22:57:26 +0200
> schrieb Lionel Bouton <lionel-subscript...@bouton.name>:
>
>> I've coded one Ruby script which tries to balance between the cost of
>> reallocating group and the need for it.[...]
>> Given its current size, I should probably push it on github...
> Yes, please... ;-)

Most of our BTRFS filesystems are used by Ceph OSD, so here it is :

https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Lionel Bouton
Le 14/05/2017 à 22:15, Marc MERLIN a écrit :
> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
>> On 05/13/2017 10:54 PM, Marc MERLIN wrote:
>>> Kernel 4.11, btrfs-progs v4.7.3
>>>
>>> I run scrub and balance every night, been doing this for 1.5 years on this
>>> filesystem.
>> What are the exact commands you run every day?
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20
> 4) balance -dusage=0
> 5) balance -dusage=20

usage=20 is pretty low: this means you don't try to reallocate and
regroup together block groups that are filled more than 20%.
Constantly using this settings has left lots of allocated block groups
that are mostly empty on your filesystem (a little more than 20% used).

The rebalance subject is a bit complex. With an empty filesystem you
almost don't need it as group creation is sparse and it's OK to have
mostly empty groups. When your filesystem begins to fill up you have to
raise the usage target to be able to reclaim space (as the fs fills up
most of your groups do too) so that new block creation can happen.

I've coded one Ruby script which tries to balance between the cost of
reallocating group and the need for it. The basic idea is that it tries
to keep the proportion of free space "wasted" by being allocated
although it isn't used below a threshold. It will bring this proportion
down enough through balance that minor reallocation won't trigger a new
balance right away. It should handle pathological conditions as well as
possible and it won't spend more than 2 hours working on a single
filesystem by default. We deploy this as a daily cron script through
Puppet on all our systems and it works very well (I didn't have to use
balance manually to manage free space since we did that).
Note that by default it sleeps a random amount of time to avoid IO
spikes on VMs running on the same host. You can either edit it or pass
it "0" which will be used for the max amount of time to sleep bypassing
this precaution.

Here is the latest version : https://pastebin.com/Rrw1GLtx
Given its current size, I should probably push it on github...

I've seen other maintenance scripts mentioned on this list so you might
something simpler or more targeted to your needs by browsing through the
list's history.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help : "bad tree block start" -> btrfs forced readonly

2017-03-17 Thread Lionel Bouton
Hi,

some news from the coal mine...

Le 17/03/2017 à 11:03, Lionel Bouton a écrit :
> [...]
> I'm considering trying to use a 4 week old snapshot of the device to
> find out if it was corrupted or not instead. It will still be a pain if
> it works but rsync for less than a month of data is at least an order of
> magnitude faster than a full restore.

btrfs check -p /dev/sdb is running on this 4 week old snapshot. The
extents check passed without any error, it is currently checking the
free space (and it's just done while I was writing this and is doing fs
roots).

I'm not sure of the list of checks it performs. I assume the free
space^H... fs roots can't be much longer than the rest (on a ~13TB of
20TB used filesystem with ~ 10 million files and half a dozen subvolumes).
It took less than an hour to check extents. I'll give it another hour
and stop it if its not done : it's already passing stages than the live
data couldn't get to.

I may be wrong but I suspect Ceph is innocent of any wrong-doing here :
I think there's a high probability that if Ceph could corrupt its data
in our configuration the snapshot would have been corrupted too (most of
its data is shared with the live data). I wonder if QEMU or the VM
kernel managed to transform IO timeouts (which clearly happened below
Ceph and were passed to the VM in many instances) into garbage reads
which ended in garbage writes. If it isn't in QEMU and happened in the
kernel this was with 4.1.15 so it might be a corrected kernel bug in
either the block or fs layers. I'm not especially ecstatic at the
prospect of testing this behavior again but I will automate more Ceph
snapshots in the future (and the VM is now on 4.9.6).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help : "bad tree block start" -> btrfs forced readonly

2017-03-17 Thread Lionel Bouton
Le 17/03/2017 à 10:51, Roman Mamedov a écrit :
> On Fri, 17 Mar 2017 10:27:11 +0100
> Lionel Bouton <lionel-subscript...@bouton.name> wrote:
>
>> Hi,
>>
>> Le 17/03/2017 à 09:43, Hans van Kranenburg a écrit :
>>> btrfs-debug-tree -b 3415463870464
>> Here is what it gives me back :
>>
>> btrfs-debug-tree -b 3415463870464 /dev/sdb
>> btrfs-progs v4.6.1
>> checksum verify failed on 3415463870464 found A85405B7 wanted 01010101
>> checksum verify failed on 3415463870464 found A85405B7 wanted 01010101
>> bytenr mismatch, want=3415463870464, have=72340172838076673
>> ERROR: failed to read 3415463870464
>>
>> Is there a way to remove part of the tree and keep the rest ? It could
>> help minimize the time needed to restore data.
> If you are able to experiment with writable snapshots, you could try using
> "btrfs-corrupt-block" to kill the bad block, and see what btrfsck makes out of
> the rest. In a similar case I got little to no damage to the overall FS.
> http://www.spinics.net/lists/linux-btrfs/msg53061.html
>
I've launched btrfs check in read-only mode :

btrfs check -p /dev/sdb
Checking filesystem on /dev/sdb
UUID: dbbde1f0-d8a0-4c7c-a7b8-17237e98e525
checksum verify failed on 3415463755776 found A85405B7 wanted 01010101
checksum verify failed on 3415463755776 found A85405B7 wanted 01010101
bytenr mismatch, want=3415463755776, have=72340172838076673
checksum verify failed on 3415464001536 found A85405B7 wanted 01010101
checksum verify failed on 3415464001536 found A85405B7 wanted 01010101
bytenr mismatch, want=3415464001536, have=72340172838076673
checksum verify failed on 3415464640512 found A85405B7 wanted 01010101
checksum verify failed on 3415464640512 found A85405B7 wanted 01010101
bytenr mismatch, want=3415464640512, have=72340172838076673

This goes on for pages... I probably missed some output and then there
are lots of errors like this one :

ref mismatch on [3415470456832 16384] extent item 1, found 0
Backref 3415470456832 root 3420 not referenced back 0x268013d0
Incorrect global backref count on 3415470456832 found 1 wanted 0
backpointer mismatch on [3415470456832 16384]
owner ref check failed [3415470456832 16384]

...

Followed by lots of this :

ref mismatch on [11010388205568 278528] extent item 1, found 0
checksum verify failed on 3415464869888 found A85405B7 wanted 01010101
checksum verify failed on 3415464869888 found A85405B7 wanted 01010101
bytenr mismatch, want=3415464869888, have=72340172838076673
Incorrect local backref count on 11010388205568 root 257 owner 7487206
offset 0 found 0 wanted 1 back 0x72335670
Backref disk bytenr does not match extent record, bytenr=11010388205568,
ref bytenr=0
backpointer mismatch on [11010388205568 278528]
owner ref check failed [11010388205568 278528]

...

I stopped there : am I correct in thinking that it will take ages to try
to salvage this without any guarantee that I'll get a substantial amount
of the 10 million files on this filesystem ?

I'm considering trying to use a 4 week old snapshot of the device to
find out if it was corrupted or not instead. It will still be a pain if
it works but rsync for less than a month of data is at least an order of
magnitude faster than a full restore.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help : "bad tree block start" -> btrfs forced readonly

2017-03-17 Thread Lionel Bouton
Hi,

Le 17/03/2017 à 09:43, Hans van Kranenburg a écrit :
> btrfs-debug-tree -b 3415463870464

Here is what it gives me back :

btrfs-debug-tree -b 3415463870464 /dev/sdb
btrfs-progs v4.6.1
checksum verify failed on 3415463870464 found A85405B7 wanted 01010101
checksum verify failed on 3415463870464 found A85405B7 wanted 01010101
bytenr mismatch, want=3415463870464, have=72340172838076673
ERROR: failed to read 3415463870464

Is there a way to remove part of the tree and keep the rest ? It could
help minimize the time needed to restore data.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help : "bad tree block start" -> btrfs forced readonly

2017-03-17 Thread Lionel Bouton
Le 17/03/2017 à 05:32, Lionel Bouton a écrit :
> Hi,
>
> [...]
> I'll catch some sleep right now (it's 5:28 AM here) but I'll be able to
> work on this in 3 or 4 hours.

I woke up to this :

Mar 17 06:56:30 fileserver kernel: btree_readpage_end_io_hook: 104476
callbacks suppressed
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464
Mar 17 06:56:30 fileserver kernel: BTRFS error (device sdb): bad tree
block start 72340172838076673 3415463870464

and the server was unusable.

I just moved the client to a read-only backup server and we are trying
to find out if we can salvage this or if we start the full restore
procedure.

Help ?

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


help : "bad tree block start" -> btrfs forced readonly

2017-03-16 Thread Lionel Bouton
Hi,

our largest BTRFS filesystem is damaged but I'm unclear if it is
recoverable or not. This is a 20TB filesystem with ~13TB used in a
virtual machine using virtio-scsi backed by Ceph (Firefly 0.8.10).
The following messages have become more frequent :

fileserver kernel: sd 0:0:1:0: [sdb] tag# abort

This can sometimes happen under heavy IO load and I didn't immediately
spot a new cause for them : a failing disk. Then I saw this after a
failed monthly scrub :

Mar 13 03:49:01 fileserver kernel: BTRFS: checksum error at logical
13373533028352 on dev /dev/sdb, sector 26004838336, root 257, inode
8155339, offset 131072, length 4096, links 1 (path: )

This was surprising as I thought Ceph would not give back bad data. I
saw this kind too :

Mar  7 18:33:53 fileserver kernel: BTRFS warning (device sdb): csum
failed ino 8155339 off 1073152 csum 1108896639 expected csum 1374028982

The csum was always 1108896639 for different chunks so I suspect this is
the csum of a zero-filled block of data. So in case of timeouts maybe
virtio-scsi just returns a block full of zero. I actually tried to read
the affected files and saw Ceph OSD timeouts on the disk I was
suspecting of failing at the same time I got the IO error.
The disk is confirmed having relocated ~40 sectors in the same period
problems appeared, it is behind an HP SATA/SAS controller so it isn't
easy to get the whole SMART info.

I restored all files affected, launched another full scrub which passed
successfully but unfortunately the damaged got worse shortly after :

Mar 16 23:30:09 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:09 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:09 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:09 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:10 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:10 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:10 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:10 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:10 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:20 fileserver kernel: BTRFS (device sdb): bad tree block
start 72340172838076673 3415463870464
Mar 16 23:30:20 fileserver kernel: [ cut here ]
Mar 16 23:30:20 fileserver kernel: WARNING: CPU: 2 PID: 3556 at
fs/btrfs/super.c:260 __btrfs_abort_transaction+0x46/0x110()
Mar 16 23:30:20 fileserver kernel: BTRFS: Transaction aborted (error -5)
Mar 16 23:30:20 fileserver kernel: Modules linked in: nfsd auth_rpcgss
oid_registry nfs_acl ipv6 binfmt_misc mousedev 8250 processor
crc32c_intel psmouse thermal_sys serial_core button dm_zero dm_thin_pool
dm_persistent_data dm_bio_prison dm_service_time dm_round_robin
dm_queue_length dm_multipath dm_log_userspace dm_delay virtio_console
xts gf128mul aes_x86_64 cbc sha512_generic sha256_generic sha1_generic
scsi_transport_iscsi fuse overlay xfs libcrc32c nfs lockd grace sunrpc
fscache jfs reiserfs multipath linear raid10 raid1 raid0 dm_raid raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx md_mod
dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log usbhid
xhci_pci xhci_hcd ohci_pci ohci_hcd uhci_hcd usb_storage ehci_pci
ehci_hcd usbcore usb_common sr_mod cdrom sg virtio_net
Mar 16 23:30:20 fileserver kernel: CPU: 2 PID: 3556 Comm:
btrfs-transacti Not tainted 4.1.15-gentoo-r1 #2
Mar 16 23:30:20 fileserver kernel: Hardware name: QEMU Standard PC
(i440FX + PIIX, 1996), BIOS
rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Mar 16 23:30:20 fileserver kernel:   8163e153
81518242 88082c1ebd28
Mar 16 23:30:20 fileserver kernel:  8104ab7c 88042f73b600
fffb 88082c68c800
Mar 16 23:30:20 fileserver kernel:  8154f9d0 04a4
8104abf5 81636528
Mar 16 23:30:20 fileserver kernel: Call Trace:
Mar 16 23:30:20 fileserver kernel:  [] ?
dump_stack+0x40/0x50
Mar 16 23:30:20 fileserver kernel:  [] ?
warn_slowpath_common+0x7c/0xb0
Mar 16 23:30:20 fileserver kernel:  [] ?
warn_slowpath_fmt+0x45/0x50
Mar 16 23:30:20 fileserver kernel:  [] ?
__btrfs_abort_transaction+0x46/0x110
Mar 16 23:30:20 fileserver kernel:  [] ?
__btrfs_run_delayed_items+0xde/0x1d0
Mar 16 23:30:20 fileserver kernel:  [] ?
btrfs_commit_transaction+0x2b8/0xa60
Mar 16 23:30:20 fileserver kernel:  [] ?
start_transaction+0x8b/0x5a0
Mar 16 23:30:20 fileserver kernel:  [] ?
transaction_kthread+0x1cd/0x240
Mar 16 23:30:20 fileserver kernel:  [] ?
btrfs_cleanup_transaction+0x530/0x530
Mar 16 23:30:20 

Re: BTRFS for OLTP Databases

2017-02-07 Thread Lionel Bouton
Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
> On 2017-02-07 15:36, Kai Krakow wrote:
>> Am Tue, 7 Feb 2017 09:13:25 -0500
>> schrieb Peter Zaitsev :
>>
>>> Hi Hugo,
>>>
>>> For the use case I'm looking for I'm interested in having snapshot(s)
>>> open at all time.  Imagine  for example snapshot being created every
>>> hour and several of these snapshots  kept at all time providing quick
>>> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
>>> think you also describe)  nodatacow  does not provide any advantage.
>>
>> Out of curiosity, I see one problem here:
>>
>> If you're doing snapshots of the live database, each snapshot leaves
>> the database files like killing the database in-flight. Like shutting
>> the system down in the middle of writing data.
>>
>> This is because I think there's no API for user space to subscribe to
>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>> service) in Windows. You should put the database into frozen state to
>> prepare it for a hotcopy before creating the snapshot, then ensure all
>> data is flushed before continuing.
> Correct.
>>
>> I think I've read that btrfs snapshots do not guarantee single point in
>> time snapshots - the snapshot may be smeared across a longer period of
>> time while the kernel is still writing data. So parts of your writes
>> may still end up in the snapshot after issuing the snapshot command,
>> instead of in the working copy as expected.
> Also correct AFAICT, and this needs to be better documented (for most
> people, the term snapshot implies atomicity of the operation).

Atomicity can be a relative term. If the snapshot atomicity is relative
to barriers but not relative to individual writes between barriers then
AFAICT it's fine because the filesystem doesn't make any promise it
won't keep even in the context of its snapshots.
Consider a power loss : the filesystems atomicity guarantees can't go
beyond what the hardware guarantees which means not all current in fly
write will reach the disk and partial writes can happen. Modern
filesystems will remain consistent though and if an application using
them makes uses of f*sync it can provide its own guarantees too. The
same should apply to snapshots : all the writes in fly can complete or
not on disk before the snapshot what matters is that both the snapshot
and these writes will be completed after the next barrier (and any
robust application will ignore all the in fly writes it finds in the
snapshot if they were part of a batch that should be atomically commited).

This is why AFAIK PostgreSQL or MySQL with their default ACID compliant
configuration will recover from a BTRFS snapshot in the same way they
recover from a power loss.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Lionel Bouton
Le 07/02/2017 à 21:36, Kai Krakow a écrit :
> [...]
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.


I don't think so for three reasons :
- it's so far away from admin's expectations that someone would have
documented this in "man btrfs-subvolume",
- the CoW nature of Btrfs makes this trivial : it only has to keep old
versions of data and the corresponding tree for it to work instead of
unlinking them,
- the backup server I referred to restarted a PostgreSQL system from
snapshots about one thousand time now without a single problem while
being almost continuously being updated by streaming replication.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Lionel Bouton
Hi Peter,

Le 07/02/2017 à 15:13, Peter Zaitsev a écrit :
> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.
>
> I have not seen autodefrag helping much but I will try again. Is
> there any autodefrag documentation available about how is it expected
> to work and if it can be tuned in any way

There's not much that can be done if the same file is modified in 2
different subvolumes (typically the original and a R/W snapshot). You
either break the reflink around the modification to limit the amount of
fragmentation (which will use disk space and write I/O) or get
fragmentation on at least one subvolume (which will add seeks).
So the only options are either to flatten the files (which can be done
incrementally by defragmenting them on both sides when they change) or
only defragment the most used volume (especially if the other is a
relatively short-lived snapshot where performance won't degrade much
until it is removed and won't matter much).

I just modified our defragmenter scheduler to be aware of multiple
subvolumes and support ignoring some of them. The previous version (not
tagged, sorry) was battle tested on a Ceph cluster and was designed for
it. Autodefrag didn't work with Ceph with our workload (latency went
through the roof, OSDs were timing out requests, ...) and our scheduler
with some simple Ceph BTRFS related tunings gave us even better
performance than XFS (which is usually the recommended choice with
current Ceph versions).

The current version is probably still rough around the edges as it is
brand new (most of the work was done last Sunday) and only running on a
backup server with a situation not much different from yours : a large
PostgreSQL slave (>50GB) which is snapshoted hourly and daily, with a
daily snapshot used to start a PostgreSQL instance for "tests on real
data" purposes + a copy of a <10TB NFS server with similar snapshots in
place. All of this is on a single RAID10 13-14TB BTRFS.
In our case using autodefrag on this slowly degraded performance to the
point where off-site backups became slow enough to warrant preventive
measures.
The current scheduler looks for the mountpoints of top BTRFS volumes (so
you have to mount the top volume somewhere), and defragments them avoiding :
- read-only snapshots,
- all data below configurable subdirs (including read-write subvolumes
even if they are mounted elsewhere), see README.md for instructions.

It slowly walks all files eligible for defragmentation and in parallel
detects writes to the same filesystem, including writes to read-write
subvolumes mounted elsewhere to trigger defragmentation. The scheduler
uses an estimated "cost" for each file to prioritize defragmentation
tasks and with default settings tries to keep I/O activity low enough
that it doesn't slow down other tasks too much. However it defragments
files whole, which might put some strain for huge ibdata* files if you
didn't switch to file per table. In our case defragmenting 1GB files is
OK and doesn't have a major impact.

We are already seeing better performance (our total daily backup time is
below worrying levels again) and the scheduler didn't even finish
walking the whole filesystem (there are approximately 8 millions files
and it is configured to evaluate them over a week). This is probably
because it follows the most write-active files (which are in the
PostgreSQL slave directory) and defragmented most of them early.

Note that it is tuned for filesystems using ~2TB 7200rpm drives (there
are some options that will adapt it to subsystems with more I/O
capacity). Using drives with different capacities shouldn't need tuning,
but it probably will not work well on SSD (it should be configured to
speed up significantly).

See https://github.com/jtek/ceph-utils you want btrfs-defrag-scheduler.rb

Some parameters are available (start it with --help). You should
probably start it with --verbose at least until you are comfortable with
it to get a list of which files are defragmented and many debug messages
you probably want to ignore (or you'll probably have to read the Ruby
code to fully understand what they mean).

I don't provide any warranty for it but the worst I believe can happen
is no performance improvements or performance degradation until you stop
it. If you don't blacklist read-write snapshots with the .no-defrag file
(see README.md) defragmentation will probably eat more disk space than
usual. Space usage will go up rapidly during defragmentation if you have
snapshots, it is supposed to go down after all snapshots referring to
fragmented files are removed and replaced by new snapshots (where
fragmentation should be more stable).

Best regards,

Re: missing checksums on reboot

2016-12-02 Thread Lionel Bouton
Hi,

Le 02/12/2016 à 20:07, Blake Lewis a écrit :
> Hi, all, this is my first posting to the mailing list.  I am a
> long-time file system guy who is just starting to take a serious
> interest in btrfs.
>
> My company's product uses btrfs for its backing storage.  We
> maintain a log file to let us synchronize after reboots.  In
> testing, we find that when the system panics and we read the
> file after coming back up, we intermittently (but fairly often)
> get "no csum found for inode X start Y" messages and from our
> point of view, the log is corrupt.
>
> Here are a few pertinent details:
>
> 1) When we see this, the device is always an SSD.
> 2) We reproduce it easily with 3.10 kernels 

Wow. That's ancient and certainly full of various bugs fixed since its
release.

> but we have not
> been able to reproduce it in 4.8.
> 3) The log file is opened with O_SYNC|O_DIRECT; its size is 128MB
> and we are appending to it.
> 4) No other activity in the file system except a generated sequential
> write workload
> 5) Panics are induced with "echo c > /proc/sysrq-trigger".
>
> We filed a bug (https://bugzilla.kernel.org/show_bug.cgi?id=188051)
> but I wanted to see if anyone here recognized these symptoms and
> could point me in the right direction, especially since the problem
> seems to have gone away in more recent releases.  We can't realistically
> make our customers run newer kernels,

Why ? If the kernel has a bug they have to update it to get the fix,
there's no way around it. Unless they use exotic software (proprietary
kernel modules typically) which won't work with later kernels, updating
to a simple patch or a much newer kernel doesn't make much of a
difference (it may be hard to package a recent kernel for an ancient
distribution, but it's definitely doable and usually transparent for its
users).

Btrfs and old kernels don't mix *at all*. I wouldn't advise using it in
any environment where updating the kernel to the latest mainline isn't
possible.

AFAIK from my reading of this mailing list :
- btrfs developers don't backport patches, at least not to anything but
the latest stable kernel version (currently 4.4.x).
- distributions aren't known to backport patches either (you should ask
your distribution support for specifics to make sure). Note : I'm not
sure why they compile btrfs support making users think it's OK to use it
as-is but don't actually support it.

I think 3.10 is pretty much unmaintained btrfs-wise. So much work as
been done on btrfs the last 3 years (3.10 is more than 3 years old) that
applying patches to a 3.10 distribution kernel is probably orders of
magnitude more complex than packaging a recent kernel for an old
distribution.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-29 Thread Lionel Bouton
Hi,

Le 29/11/2016 à 18:20, Florian Lindner a écrit :
> [...]
>
> * Any other advice? ;-)

Don't rely on RAID too much... The degraded mode is unstable even for
RAID10: you can corrupt data simply by writing to a degraded RAID10. I
could reliably reproduce this on a 6 devices RAID10 BTRFS filesystem
with a missing device. It affected even a 4.8.4 kernel where our
PostgreSQL clusters got frequent write errors (on the fs itself but not
the 5 working devices) and managed to corrupt their data. Have backups,
you probably will need them.

With Btrfs RAID If you have a failing device, replace it early (monitor
the devices and don't wait for them to fail if you get transient errors
or see worrying SMART values). If you have a failed device, don't
actively use the filesystem in degraded mode. Replace or delete/add
before writing to the filesystem again.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


replace panic solved with add/balance/delete was: Compression and device replace on raid10 kernel panic on 4.4.6 and 4.6.x

2016-11-12 Thread Lionel Bouton
Hi,

here's how I managed to recover from a BTRFS replace panic which
happened even on 4.8.4.

The kernel didn't seem to handle our raid10 filesystem with a missing
device correctly (even though it passed a precautionary scrub before
removing the device) :
- replace didn't work and triggered a kernel panic,
- we saw PostgreSQL corruption (duplicate entries in indexes and write
errors), both for database clusters using NoCoW and CoW (we run several
clusters on this filesystem and configure them differently based on our
needs).

What finally worked is adding devices to the filesystem, balancing (I
added skip_balance in fstab in case balance would trigger a panic like
replace) which removed data allocated to the missing device and then
delete it.
I didn't dare delete without balancing first as I couldn't get
confirmation that skip_balance would prevent the balance triggered by
delete to stop (which could mean a panic each time we tried to mount the
filesystem). In the end it seems that balancing before deleting is doing
the same work : balance correctly detects that it shouldn't use the
missing device and reallocate all data properly.

The sad result is that we are currently forced to check/restore most of
the data just because we had to replace a single disk : clearly BTRFS
can't handle itself properly until the missing device is completely
removed. That's not what I expected to do when using raid10 :-(

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression and device replace on raid10 kernel panic on 4.4.6 and 4.6.x

2016-10-28 Thread Lionel Bouton
Hi,

as I don't have much time to handle a long backup recovery, I didn't try
the delete/add combination to avoid any risk.
What I tried though, was fatal_errors=bug. As I don't have any console I
thought it might at least help log the problem instead of the usual
kernel panic.

No luck : the problem still made the kernel panic. Unless someone comes
up with a somewhat safe way to recover from this situation I'll let the
filesystem as is (we are building a new platform where redundancy will
be handled by Ceph anyway).

Lionel

Le 27/10/2016 à 18:07, Lionel Bouton a écrit :
> Hi,
>
> Le 27/10/2016 à 02:50, Lionel Bouton a écrit :
>> [...]
>> I'll stop for tonight and see what happens during the day. I'd like to
>> try a device add / delete next but I'm worried I could end up with a
>> completely unusable filesystem if the device delete hits the same
>> problem than replace.
>> If the replace resuming on mount crashes the system I can cancel it but
>> there's no way to do so with a device delete. Or is by any chance the
>> skip_balance mount option a way to cancel a delete ?
> Can anyone just confirm (or infirm) that skip_balance will indeed
> effectively cancel a device delete before I try something that could
> force me to resort to backups ?
> Lionel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression and device replace on raid10 kernel panic on 4.4.6 and 4.6.x

2016-10-27 Thread Lionel Bouton
Hi,

Le 27/10/2016 à 02:50, Lionel Bouton a écrit :
> [...]
> I'll stop for tonight and see what happens during the day. I'd like to
> try a device add / delete next but I'm worried I could end up with a
> completely unusable filesystem if the device delete hits the same
> problem than replace.
> If the replace resuming on mount crashes the system I can cancel it but
> there's no way to do so with a device delete. Or is by any chance the
> skip_balance mount option a way to cancel a delete ?

Can anyone just confirm (or infirm) that skip_balance will indeed
effectively cancel a device delete before I try something that could
force me to resort to backups ?

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression and device replace on raid10 kernel panic on 4.4.6 and 4.6.x

2016-10-26 Thread Lionel Bouton
Hi,

Le 27/10/2016 à 01:54, Lionel Bouton a écrit :
>
> I'll post the final result of the btrfs replace later (it's currently at
> 5.6% after 45 minutes).

Result : kernel panic (so 4.8.4 didn't solve my main problem).
Unfortunately I don't have a remote KVM anymore so I couldn't capture
this one. panic=60 however did its job twice (I tried to mount the
filesystem again) confirming that a panic occured.

It seems the problem may be located at a precise location. Yesterday the
last replace resume logged :
Oct 26 00:40:56 zagreus kernel: BTRFS info (device sdb2): continuing
dev_replace from  (devid 7) to /dev/sdb2 @12%

And today I was switching between a screen terminal windows when the
crash happened and the replace was at 12.6%. A mount triggers the crash
in ~20 seconds which is similar to what happened yesterday on the last try.

I've successfully canceled the replace to get back a usable system.

I'll stop for tonight and see what happens during the day. I'd like to
try a device add / delete next but I'm worried I could end up with a
completely unusable filesystem if the device delete hits the same
problem than replace.
If the replace resuming on mount crashes the system I can cancel it but
there's no way to do so with a device delete. Or is by any chance the
skip_balance mount option a way to cancel a delete ?

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression and device replace on raid10 kernel panic on 4.4.6 and 4.6.x

2016-10-26 Thread Lionel Bouton
Hi,

Le 26/10/2016 à 02:57, Lionel Bouton a écrit :
> Hi,
>
> I'm currently trying to recover from a disk failure on a 6-drive Btrfs
> RAID10 filesystem. A "mount -o degraded" auto-resumes a current
> btrfs-replace from a missing dev to a new disk. This eventually triggers
> a kernel panic (and the panic  seemed faster on each new boot). I
> managed to cancel the replace, hoping to get a usable (although in
> degraded state) system this way.

The system didn't crash during the day (yeah). Although I had some
PostgreSQL slave servers getting I/O errors (these are still on btrfs
because we snapshot them). PostgreSQL is quite resilient : it auto
aborts and restarts most of the time.
I've just rebooted with Gentoo's 4.8.4 and started a new btrfs replace.
The only problem so far (the night is young) is this :

Oct 27 00:36:57 zagreus kernel: BTRFS info (device sdc2): dev_replace
from  (devid 7) to /dev/sdb2 started
Oct 27 00:43:01 zagreus kernel: BTRFS: decompress failed
Oct 27 01:06:59 zagreus kernel: BTRFS: decompress failed

This is the first I've seen the "decompress failed" message, so clearly
4.8.4 has changes detecting some kind of corruption that happened on
this system with compressed extents.
I've not seen any sign of a process getting an IO error (which it should
according to lzo.c where I found the 2 possible printk for this message)
so I don't have a clue which file might be corrupted. It's probably a
very old extent, this filesystem uses compress=zlib, lzo was used a long
time ago.

Some more information with the current state:

uname -a :
Linux zagreus. 4.8.4-gentoo #1 SMP Wed Oct 26 03:39:19 CEST 2016
x86_64 Intel(R) Core(TM) i7 CPU 975 @ 3.33GHz GenuineIntel GNU/Linux

btrfs --version :
btrfs-progs v4.6.1

btrfs fi show :
Label: 'raid10_btrfs'  uuid: c67683fd-8fe3-4966-8b0a-063de25ac44c
Total devices 7 FS bytes used 6.03TiB
devid0 size 2.72TiB used 2.14TiB path /dev/sdb2
devid2 size 2.72TiB used 2.14TiB path /dev/sda2
devid5 size 2.72TiB used 2.14TiB path /dev/sdc2
devid6 size 2.72TiB used 2.14TiB path /dev/sde2
devid8 size 2.72TiB used 2.14TiB path /dev/sdg2
devid9 size 2.72TiB used 2.14TiB path /dev/sdd2
*** Some devices missing

btrfs fi df /mnt/btrfs :
Data, RAID10: total=6.42TiB, used=6.02TiB
System, RAID10: total=288.00MiB, used=656.00KiB
Metadata, RAID10: total=13.41GiB, used=11.49GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

I'll post the final result of the btrfs replace later (it's currently at
5.6% after 45 minutes).

Best regards,

Lionel

>
> This is a hosted system and I just managed to have a basic KVM connected
> to the rescue system where I could capture the console output after the
> system stopped working.
> This is on a 4.6.x kernel (I didn't have the opportunity to note down
> the exact version yet) and I got this :
>
> http://imgur.com/a/D10z6
>
> The following elements in the stack trace caught my attention because I
> remembered seeing some problems with compression and recovery reported
> here :
> clean_io_failure, btrfs_submit_compressed_read, btrfs_map_bio
>
> I found discussions on similar cases (involving clean_io_failure,
> btrfs_submit_compressed_read, btrfs_map_bio) but it isn't clear to me if :
> - the filesystem is damaged to the point where my best choice is
> restoring backups and generating data again (a several days process but
> I can manage to bring back the most important data in less than a day),
> - a simple kernel upgrade can work around this (I currently run 4.4.6
> with the default Gentoo patchset which probably trigger the same kind of
> problem although I don't have a kernel panic screenshot yet to prove it).
>
> Other miscellanous informations :
>
> Another problem is that corruption happened at least 2 times on the
> single subvolume hosting only nodatacow files (a PostgreSQL server). I'm
> currently restoring backups for this data on mdadm raid10 + ext4 as it
> is the most used service of this system...
>
> The filesystem is quite old (it probably began its life with 3.19 kernels).
>
> It passed a full scrub with flying colors a few hours ago.
>
> A btrfs check in the rescue environment found this :
> checking extents
> checking free space cache
> checking fs roots
> root 4485 inode 608 errors 400, nbytes wrong
> found 3136342732761 bytes used err is 1
> total csum bytes: 6403620384
> total tree bytes: 12181405696
> total fs tree bytes: 2774007808
> total extent tree bytes: 1459339264
> btree space waste bytes: 2186016312
> file data blocks allocated: 7061947838464
>  referenced 6796179566592
> Btrfs v3.17
>
> The subvolume 4485 inode 608 was a simple text file. I saved a copy,
> truncated/deleted it and restored it. btrfs check didn't complain at all
&g

Compression and device replace on raid10 kernel panic on 4.4.6 and 4.6.x

2016-10-25 Thread Lionel Bouton
Hi,

I'm currently trying to recover from a disk failure on a 6-drive Btrfs
RAID10 filesystem. A "mount -o degraded" auto-resumes a current
btrfs-replace from a missing dev to a new disk. This eventually triggers
a kernel panic (and the panic  seemed faster on each new boot). I
managed to cancel the replace, hoping to get a usable (although in
degraded state) system this way.

This is a hosted system and I just managed to have a basic KVM connected
to the rescue system where I could capture the console output after the
system stopped working.
This is on a 4.6.x kernel (I didn't have the opportunity to note down
the exact version yet) and I got this :

http://imgur.com/a/D10z6

The following elements in the stack trace caught my attention because I
remembered seeing some problems with compression and recovery reported
here :
clean_io_failure, btrfs_submit_compressed_read, btrfs_map_bio

I found discussions on similar cases (involving clean_io_failure,
btrfs_submit_compressed_read, btrfs_map_bio) but it isn't clear to me if :
- the filesystem is damaged to the point where my best choice is
restoring backups and generating data again (a several days process but
I can manage to bring back the most important data in less than a day),
- a simple kernel upgrade can work around this (I currently run 4.4.6
with the default Gentoo patchset which probably trigger the same kind of
problem although I don't have a kernel panic screenshot yet to prove it).

Other miscellanous informations :

Another problem is that corruption happened at least 2 times on the
single subvolume hosting only nodatacow files (a PostgreSQL server). I'm
currently restoring backups for this data on mdadm raid10 + ext4 as it
is the most used service of this system...

The filesystem is quite old (it probably began its life with 3.19 kernels).

It passed a full scrub with flying colors a few hours ago.

A btrfs check in the rescue environment found this :
checking extents
checking free space cache
checking fs roots
root 4485 inode 608 errors 400, nbytes wrong
found 3136342732761 bytes used err is 1
total csum bytes: 6403620384
total tree bytes: 12181405696
total fs tree bytes: 2774007808
total extent tree bytes: 1459339264
btree space waste bytes: 2186016312
file data blocks allocated: 7061947838464
 referenced 6796179566592
Btrfs v3.17

The subvolume 4485 inode 608 was a simple text file. I saved a copy,
truncated/deleted it and restored it. btrfs check didn't complain at all
after this.

Currently compiling a 4.8.4 kernel with Gentoo patches. I can easily try
4.9-rc2 mainline or even a git tree if needed.
I can use this system without trying to replace the drive for a few days
if it can work reliably in this state. If I'm stuck with replace not
working another solution I can try is adding one drive and deleting the
missing one if this works and is the only known way to work around this.

I have the opportunity to do some (non destructive) tests between 00:00
and 03:00 (GMT+2) or more if I don't fall asleep at the keyboard. This
has 6+TB of data and a total of 41 subvolumes (most of them snapshots).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-12 Thread Lionel Bouton
Hi,

On 12/09/2016 14:59, Michel Bouissou wrote:
>  [...]
> I never had problems with lzo compression, although I suspect that it (in 
> conjuction with snapshots) adds much fragmentation that may relate to the 
> extremely bad performance I get over time with mechanical HDs.

I had about 30 btrfs filesystems on 2TB drives for a Ceph cluster with
compress=lzo and a background process which detected files recently
written to and defragmented/recompressed them using zlib when they
reached an arbitrary fragmentation level (so the fs was a mix of lzo,
zlib and "normal" extents).

With our usage pattern, our Ceph cluster is faster with compress=zlib
instead of the lzo then zlib mechanism (which tried to make writes
faster but was in fact counterproductive) so we made the switch to
compress=zlib this winter.

On these compress=lzo filesystems, at least 12 where often (up to
several times a week) corrupted by defective hardware controllers. I
never had any crash related to BTRFS under these conditions (at the time
with late 3.19 and 4.1.5 + Gentoo patches kernels). Is there a bug
somewhere open with kernel version affected and the kind of usage that
could reproduce any lzo specific problem (or any problem made worse by
lzo) ?

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfstune settings

2016-08-28 Thread Lionel Bouton
Hi,

happy borgbackup user here. This is probably off-topic for most but as
many users probably are evaluating send/receive versus other backup
solutions, I'll keep linux-btrfs in the loop.

On 28/08/2016 20:10, Oliver Freyermuth wrote:
>> Try borgbackup, I'm using it very successfully. It is very fast,
>> supports very impressive deduplication and compression, retention
>> policies, and remote backups - and it is available as a single binary
>> version so you can more easily use it for disaster recovery. One
>> downside: while it seems to restore nocow attributes, it seems to do it
>> in a silly way (it first creates the file, then sets the attributes,
>> which of course won't work for nocow). I have not checked that
>> extensively, only had to restore once yet.
> Wow - this looks like the holy grail I've been waiting for, not sure how
> I have missed that up to now. 
> Especially the deduplication across several backupped systems on the backup 
> target
> is interesting, I originally planned to do that using duperemove on the 
> backup target
> to dedupe across the readonly snapshots.

Note that only one backup can happen at a given time on a single repository.
You'll probably have to both schedule the backups to avoid collisions
and use "--lock-wait" with a large enough parameter to avoid backup
failures.

There's another twist : borgbackup maintains a local index of the
repositories' content, when it detects that it is out of sync (if
several systems use the same repository it will) it has to update its
index from remote.
I'm not sure how heavy this update can be (it seems it uses some form of
delta). I have a ~user/.cache/borg of ~2GB for a user with ~9TB of data
to backup in ~8M files but I don't share repositories so I'm not
affected by this).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Lionel Bouton
Le 21/06/2016 15:17, Graham Cobb a écrit :
> On 21/06/16 12:51, Austin S. Hemmelgarn wrote:
>> The scrub design works, but the whole state file thing has some rather
>> irritating side effects and other implications, and developed out of
>> requirements that aren't present for balance (it might be nice to check
>> how many chunks actually got balanced after the fact, but it's not
>> absolutely necessary).
> Actually, that would be **really** useful.  I have been experimenting
> with cancelling balances after a certain time (as part of my
> "balance-slowly" script).  I have got it working, just using bash
> scripting, but it means my script does not know whether any work has
> actually been done by the balance run which was cancelled (if no work
> was done, but it timed out anyway, there is probably no point trying
> again with the same timeout later!).

I have the exact same use case.

We trigger balances when we detect that the free space is mostly
allocated but unused to prevent possible ENOSPC events. A balance on
busy disks can slow other I/Os so we try to limit them in time (in our
use case 15 to 30 min max is mostly OK).
Trying to emulate this by using [d|v]range was a possibility too but I
thought it could be hard to get right. We actually inspect the allocated
space before and after to report the difference but we don't know if
this difference is caused by the aborted balance or other activity (we
have to read the kernel logs to find out).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Lionel Bouton
Hi,

Le 09/05/2016 16:53, Niccolò Belli a écrit :
> On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
>> Are you using any power management tweaks?
>
> Yes, as stated in my very first post I use TLP with
> SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the
> bug even without TLP. Also in the past week I've alwyas been on AC.
>
> On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
>> Memtest doesn't replicate typical usage patterns very well.  My usual
>> testing for RAM involves not just memtest, but also booting into a
>> LiveCD (usually SystemRescueCD), pulling down a copy of the kernel
>> source, and then running as many concurrent kernel builds as cores,
>> each with as many make jobs as cores (so if you've got a quad core
>> CPU (or a dual core with hyperthreading), it would be running 4
>> builds with -j4 passed to make).  GCC seems to have memory usage
>> patterns that reliably trigger memory errors that aren't caught by
>> memtest, so this generally gives good results.
>
> Building kernel with 4 concurrent threads is not an issue for my
> system, in fact I do compile a lot and I never had any issue.

Note : I once had a server which would pass memtest86 and repeated
kernel compilations maxing out the CPU threads but couldn't at the same
time reliably compile a kernel and copy large amounts of data.
I think I lost my little automated test suite (I should definitely look
for it again or code it from scratch) but what I did on new servers
since that time was :

1/ create a file larger than the system's RAM (this makes sure you will
read and write all data from disk and not only caches and might catch
controller hardware problems too) with dd if=/dev/urandom (several
gigabytes of random data exercise many different patterns, far more than
what memtest86 would test), compute its md5 checksum
2/ launch a subprocess repeatedly compiling the kernel with more jobs
than available CPU threads and stopping as soon as the make exit code
was != 0.
3/ launch another subprocess repeatedly copying the random file to
another location and exiting when the md5 checksum didn't match the source.

Let it run as a burn-in test for as long as you can afford (from
experience after 24 hours if it's still running the probability that the
test will find a problem becomes negligible).
If one of the subprocess stopped by itself your hardware is not stable.

This actually caught a few unstable systems before it could go into
production for me.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Unable to remove Failing HD

2016-04-19 Thread Lionel Bouton
Hi,

Le 19/04/2016 11:13, Anand Jain a écrit :
>
>>> # btrfs device delete 3 /mnt/store/
>>> ERROR: device delete by id failed: Inappropriate ioctl for device
>>>
>>> Were the patch sets above for btrfs-progs or for the kernel ?
>> [...]
>
>  By the way, For Lionel issue, delete missing should work ?
>  which does not need any additional patch.

Delete missing works with 4.1.15 and btrfs-progs 4.5.1 (see later), but
the device can't be marked missing online so there's no way to maintain
redundancy without downtime. I was a little surprised: I half-expected
something like this because reading this list, RAID recovery seems to
still be a pain point but this isn't documented anywhere and after
looking around the relevant information seems to only be in this thread
(and many come from md and don't read this list, so won't expect this
behavior at all).
While I was waiting for directions the system crashed with a kernel
panic (clearly linked to IO errors according to the kernel panic but I
couldn't get all the stacktrace) and the system wasn't able to boot
properly (kernel panic shortly after the system mounted the filesystem
on each boot) until I removed the faulty drive (apparently it was
somehow readable enough to be recognized, but not enough to be usable).
After removing the faulty drive delete missing worked and a balance is
currently running (by the way it seems the drive bay was faulty: the
drive was not firmly fixed and it's cage could move a bit around in the
chassis and it was the only one, I didn't expect this and from
experience it's probably a factor in the hardware failure).

There may have been fixes since 4.1.15 to prevent the kernel panic
(there was only one device with IO errors, so ideally it shouldn't be
able to bring down the kernel) so it may not be worth further analysis.
That said I'll have 2 new drives next week (one replacement, one spare)
and I have a chassis lying around where I could try to replicate
failures with various kernels on a RAID1 filesystem built with a brand
new drive and the faulty drive (until the faulty drive completely dies
which they usually do in my experience) so if someone wants some tests
done with 4.6-rcX or even 4.6-rcX + patches I can spend some time on it
next week.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Unable to remove Failing HD

2016-04-18 Thread Lionel Bouton
Le 18/04/2016 10:59, Lionel Bouton a écrit :
> [...]
> So the obvious thing to do in this circumstance is to delete the drive,
> forcing the filesystem to create the missing replicas in the process and
> only reboot if needed (no hotplug). Unfortunately I'm not sure of the
> conditions where this is possible (which kernel version supports this if
> any ?). If there is a minimum kernel version where device delete works,
> can https://btrfs.wiki.kernel.org/index.php/Gotchas be updated ? I don't
> have a wiki account yet but I'm willing to do it myself if I can get
> reliable information.

Note that whatever the best course of action is I think the wiki should
probably be updated with clear instructions on that. I'm willing to
document this myself and probably other Gotchas (like how to fix a
4-device RAID10 filesystem when one of them fails based on the recent
discussion I've seen here) but I'm not sure I know all the details and
wouldn't want to put incomplete information in the wiki so I'll wait for
answers before starting to work on this.

The data on this filesystem isn't critical and I have backups for the
most important files so I can live with a "degraded" state for a while
until I'm sure of the best way to proceed.

Best regards,

Lionel Bouton
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Unable to remove Failing HD

2016-04-18 Thread Lionel Bouton
Hi,

Le 10/02/2016 10:00, Anand Jain a écrit :
>
>
> Rene,
>
> Thanks for the report. Fixes are in the following patch sets
>
>  concern1:
>  Btrfs to fail/offline a device for write/flush error:
>[PATCH 00/15] btrfs: Hot spare and Auto replace
>
>  concern2:
>  User should be able to delete a device when device has failed:
>[PATCH 0/7] Introduce device delete by devid
>
>  If you were able to tryout these patches, pls lets know.

Just found out this thread after digging for a problem similar to mine.

I just got the same error when trying to delete a failed hard drive on a
RAID1 filesystem with a total of 4 devices.

# btrfs device delete 3 /mnt/store/
ERROR: device delete by id failed: Inappropriate ioctl for device

Were the patch sets above for btrfs-progs or for the kernel ?
Currently the kernel is 4.1.15-r1 from Gentoo. I used btrfs-progs-4.3.1
(the Gentoo stable version) but it didn't support delete by devid so I
upgraded to btrfs-progs-4.5.1 which supports it but got the same
"inappropriate ioctl for device" error when I used the devid.

I don't have any drive available right now for replacing this one (so no
btrfs dev replace possible right now). The filesystem's data could fit
on only 2 of the 4 drives (in fact I just added 2 old drives that were
previously used with md and rebalanced, which is most probably what
triggered one of the new drives failure). So I can't use replace and
would prefer not to lose redundancy while waiting for new drives to get
there.

So the obvious thing to do in this circumstance is to delete the drive,
forcing the filesystem to create the missing replicas in the process and
only reboot if needed (no hotplug). Unfortunately I'm not sure of the
conditions where this is possible (which kernel version supports this if
any ?). If there is a minimum kernel version where device delete works,
can https://btrfs.wiki.kernel.org/index.php/Gotchas be updated ? I don't
have a wiki account yet but I'm willing to do it myself if I can get
reliable information.

I can reboot this system and I expect the current drive to appear
missing (it doesn't even respond to smartctl) and I suppose "device
delete missing" will work then. But should I/must I upgrade the kernel
to avoid this problem in the future and if yes which version(s)
support(s) failed device delete?

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "/tmp/mnt.", and not honouring compression

2016-03-31 Thread Lionel Bouton
Le 31/03/2016 22:49, Chris Murray a écrit :
> Hi,
>
> I'm trying to troubleshoot a ceph cluster which doesn't seem to be
> honouring BTRFS compression on some OSDs. Can anyone offer some help? Is
> it likely to be a ceph issue or a BTRFS one? Or something else? I've
> asked on ceph-users already, but not received a response yet.
>
> Config is set to mount with "noatime,nodiratime,compress-force=lzo"
>
> Some OSDs have been getting much more full than others though, which I
> think is something to do with these 'tmp' mounts e.g. below:

Note that there are other reasons for unbalanced storage on Ceph OSD.
The main reason is too few PGs (there's a calculator on ceph.com, google
for it).
These tmp mounts aren't normal, you should find out what is causing them.

So it might be a Ceph issue (too few PGs) or a system issue (some
component trying to use your filesystems for its own purposes).
You might have more luck on the ceph-users list (post your Ceph version,
the result of ceph osd tree, df on all OSDs and hunt for the process
creating theses mounts on your systems).

It's probably not a Btrfs issue (I run a Ceph on Btrfs cluster in
production and I've never seen this kind of problem).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 filesystem on sdcard corrupted

2016-02-25 Thread Lionel Bouton
Hi,

Le 25/02/2016 18:44, Hegner Robert a écrit :
> Am 25.02.2016 um 18:34 schrieb Hegner Robert:
>> Hi all!
>>
>> I'm working on a embedded system (ARM) running from a SDcard.

>From experience, most SD cards are not to be trusted. They are not
designed for storing an operating system and application data but for
storing pictures and videos written on a VFAT...

>> Recently I
>> switched to a btrfs-raid1 configuration, hoping to make my system more
>> resistant against power failures and flash-memory specific problems.

Note that there's no gain against power failures with RAID1.

>>
>> However today one of my devices wouldn't mount my root filesystem as rw
>> anymore.
>>
>> The main reason I'm asking in this mailing list is not that I want to
>> restory data. But I'd like to understand what happened and, even more
>> importantly, find out what I have to do so that something like this will
>> never happen again.
>>
>> Here is some info about my system:
>>
>> root@ObserverOne:~# uname -a
>> Linux ObserverOne 3.16.0-4-armmp #1 SMP Debian 3.16.7-ckt11-1+deb8u6
>> (2015-11-09) armv7l GNU/Linux

This is a very old kernel considering BTRFS code is moving fast. But in
this instance this is not your problem.

>>
>> root@ObserverOne:~# btrfs --version
>> Btrfs v3.17
>>
>> root@ObserverOne:~# btrfs fi show
>> Label: none  uuid: eef07fbf-77cb-427a-b118-bf5295f25b66
>>  Total devices 2 FS bytes used 816.80MiB
>>  devid1 size 3.45GiB used 3.02GiB path /dev/mmcblk0p2
>>  devid2 size 3.45GiB used 3.02GiB path /dev/mmcblk0p3

You use RAID1 on the same device: it could protect you against localized
errors but "localized" is difficult to define on a device which could
remap it's address space in various locations : nothing will prevent a
flash failure to affect both of your partitions. In this case RAID1 is
useless.
In fact using RAID1 on two partitions of the same physical device will
probably end up causing corruption earlier than without it: you are
writing twice as much to the same device, generating bad blocks twice as
fast.

> [...]

> [   12.021717] sunxi-mmc 1c0f000.mmc: smc 0 err, cmd 25, WR EBE !!
> [   12.027695] sunxi-mmc 1c0f000.mmc: data error, sending stop command
> [   12.035780] mmcblk0: timed out sending r/w cmd command, card status
> 0x900
> [   12.042640] end_request: I/O error, dev mmcblk0, sector 12386304
> [   12.048680] end_request: I/O error, dev mmcblk0, sector 12386312
> [   12.054708] end_request: I/O error, dev mmcblk0, sector 12386320
> [   12.060725] end_request: I/O error, dev mmcblk0, sector 12386328
> [   12.066744] BTRFS: bdev /dev/mmcblk0p3 errs: wr 1, rd 0, flush 0,
> corrupt 0, gen 0

Error on first partition.

> [   12.074324] end_request: I/O error, dev mmcblk0, sector 12386336
> [   12.080339] end_request: I/O error, dev mmcblk0, sector 12386344
> [   12.086353] end_request: I/O error, dev mmcblk0, sector 12386352
> [   12.092378] end_request: I/O error, dev mmcblk0, sector 12386360
> [   12.098393] BTRFS: bdev /dev/mmcblk0p3 errs: wr 2, rd 0, flush 0,
> corrupt 0, gen 0
> [   12.688370] sunxi-mmc 1c0f000.mmc: smc 0 err, cmd 25, WR EBE !!
> [   12.694342] sunxi-mmc 1c0f000.mmc: data error, sending stop command
> [   12.702553] mmcblk0: timed out sending r/w cmd command, card status
> 0x900
> [   12.709448] end_request: I/O error, dev mmcblk0, sector 2019328
> [   12.715393] end_request: I/O error, dev mmcblk0, sector 2019336
> [   12.721333] BTRFS: bdev /dev/mmcblk0p2 errs: wr 1, rd 0, flush 0,
> corrupt 0, gen 0

Error on second partition.
So both are unreliable : RAID1 can't help, game over.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major HDD performance degradation on btrfs receive

2016-02-23 Thread Lionel Bouton
Le 23/02/2016 19:30, Marc MERLIN a écrit :
> On Tue, Feb 23, 2016 at 07:01:52PM +0100, Lionel Bouton wrote:
>> Why don't you use autodefrag ? If you have writable snapshots and do
>> write to them heavily it would not be a good idea (depending on how
>> BTRFS handles this in most cases you would probably either break the
>> reflinks or fragment a snapshot to defragment another) but if you only
>> have read-only snapshots it may work for you (it does for me).
>  
> It's not a stupid question, I had issues with autodefrag in the past,
> and turned it off, but it's been a good 2 years, so maybe it works well
> enough now.
>
>> The only BTRFS filesystems where I disabled autodefrag where Ceph OSDs
>> with heavy in-place updates. Another option would have been to mark
>> files NoCoW but I didn't want to abandon BTRFS checksumming.
> Right. I don't have to worry about COW for virtualbox images there, and
> the snapshots are read only (well, my script makes read-write snapshots
> too, but I almost never use them. Hopefully their presence isn't a
> problem, right?)

I believe autodefrag is only triggering defragmentation on access (write
access only according to the wiki) and uses a queue of limited length
for defragmentation tasks to perform. So the snapshots by themselves
won't cause problems. Even if you access files the defragmentation
should be focused mainly on the versions of the files you access the
most. The real problems probably happen when you access the same file
from several snapshots with lots of internal modifications between the
versions in these snapshots: either autodefrag will break the reflinks
between them or it will attempt to optimize the 2 file versions at
roughly the same time which won't give any benefit but will waste I/O.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major HDD performance degradation on btrfs receive

2016-02-23 Thread Lionel Bouton
Le 23/02/2016 18:34, Marc MERLIN a écrit :
> On Tue, Feb 23, 2016 at 09:26:35AM -0800, Marc MERLIN wrote:
>> Label: 'dshelf2'  uuid: d4a51178-c1e6-4219-95ab-5c5864695bfd
>> Total devices 1 FS bytes used 4.25TiB
>> devid1 size 7.28TiB used 4.44TiB path /dev/mapper/dshelf2
>>
>> btrfs fi df /mnt/btrfs_pool2/
>> Data, single: total=4.29TiB, used=4.18TiB
>> System, DUP: total=64.00MiB, used=512.00KiB
>> Metadata, DUP: total=77.50GiB, used=73.31GiB
>> GlobalReserve, single: total=512.00MiB, used=31.22MiB
>>
>> Currently, it's btrfs on top of dmcrpyt on top of swraid5
> Sorry, I forgot to give the mount options:
> /dev/mapper/dshelf2 on /mnt/dshelf2/backup type btrfs 
> (rw,noatime,compress=lzo,space_cache,skip_balance,subvolid=257,subvol=/backup)

Why don't you use autodefrag ? If you have writable snapshots and do
write to them heavily it would not be a good idea (depending on how
BTRFS handles this in most cases you would probably either break the
reflinks or fragment a snapshot to defragment another) but if you only
have read-only snapshots it may work for you (it does for me).

The only BTRFS filesystems where I disabled autodefrag where Ceph OSDs
with heavy in-place updates. Another option would have been to mark
files NoCoW but I didn't want to abandon BTRFS checksumming.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Auto-rebalancing script

2016-02-14 Thread Lionel Bouton
Hi,

I'm using this Ruby script to maintain my BTRFS filesystems and try to
avoid them getting in a position where they can't allocate space even
though there is still plenty of it.

http://pastebin.com/39567Dun

It seems to work well (it maintains dozens of BTRFS filesystems, running
balance on them on occasions and avoiding too much I/O stress if any
most of the time) but there might be bugs or inefficiencies so I publish
it in case it can be useful to others or benefit from criticism.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fi corruption on RAID1, generation doesn't match

2016-02-07 Thread Lionel Bouton
Hi,

Le 07/02/2016 14:15, Andreas Hild a écrit :
> Dear All,
>
> The file system on a RAID1 Debian server seems corrupted in a major
> way, with 99% of the files not found. This was the result of a
> precarious shutdown after a crash that was preceded by an accidental
> misconfiguration in /etc/fstab; it pointed "/" and "/tmp" to one and
> the same UUID by omitting a subvol entry.
>
> Is there any way to repair or recover a substantial part of this RAID?

I don't think the RAID is damaged: most distributions including Debian
remove nearly all files from /tmp at boot.
If /tmp and / were as you described the same filesystem your server most
probably did what amounts to "rm -rf /". You would probably have got the
same result with any filesystem as mounting the same filesystem at
several points in the VFS is not BTRFS-specific.

Unless you can restore a snapshot or there is a way to debug the
filesystem to restore a previous state, I'm afraid there's nothing to be
done.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device removal seems to be very slow (kernel 4.1.15)

2016-01-05 Thread Lionel Bouton
Le 05/01/2016 14:04, David Goodwin a écrit :
> Using btrfs progs 4.3.1 on a Vanilla kernel.org 4.1.15 kernel.
>
> time btrfs device delete /dev/xvdh /backups
>
> real13936m56.796s
> user0m0.000s
> sys 1351m48.280s
>
>
> (which is about 9 days).
>
> Where :
>
> /dev/xvdh was 120gb in size.
>

That's very slow. Last week with a 4.1.12 kernel I just deleted a 3TB
SATA 7200rpm device with ~1.5TB used on a RAID10 filesystem (reduced
from 6 3TB devices to 5 devices in the process) in approximately 38
hours. This was without virtualisation though but there were some
damaged sectors to handle along the way which should have slowed the
delete a bit and it had more than 10 times the data to move than your
/dev/xvdh.

Note about the damaged sectors :
we use 7 disks for this BTRFS RAID10 arrays but to reduce the risk of
having to restore huge backups (see recent discussion about BTRFS RAID10
not protecting against 2-devices failure at all), as soon as numerous
damaged sectors appear on a drive we delete it from the RAID10 and add
it to a MD RAID1 array which is one of the devices on the BTRFS RAID10
(right now we have 5 devices in the RAID10 one of them being a 3-way md
RAID1 with disks having these numerous reallocated sectors)  : so the
reads from the deleted device had some errors to handle and the writes
on the md RAID1 device triggered some sector relocations too. Note that
ideally I would replace at least 2 of the disks in the md RAID1 because
I know from experience that they will fail in the short future (my
estimate is between right now and 6 months at best given the current
rate of reallocated sectors) but replacing a working drive with damaged
sectors costs us some downtime and a one time fee (unlike a drive which
is either unreadable or doesn't pass SMART tests anymore). We can live
with both the occasional slowdowns (SATA errors generated when the
drives detect new damaged sectors usually block IOs for a handful of
seconds) and the minor risk this causes : until now this worked OK for
this server, the md RAID1 array acts as a buffer for disks that are
slowly dying (and the monthly BTRFS scrub + md raid check helps getting
the worst ones up to the point where they fail fast enough to avoid
accumulating too much bad drives in this array for long periods of time).

>
> /backups is a single / "raid 0" volume that now looks like :
>
> Label: 'BACKUP_BTRFS_SNAPS'  uuid: 6ee08c31-f310-4890-8424-b88bb77186ed
> Total devices 3 FS bytes used 301.09GiB
> devid1 size 100.00GiB used 90.00GiB path /dev/xvdg
> devid3 size 220.00GiB used 196.06GiB path /dev/xvdi
> devid4 size 221.00GiB used 59.06GiB path /dev/xvdj
>
>
> There are about 400 snapshots on it.

I'm not sure if the number of snapshots can impact the device delete
operation: the slow part of device delete is relocating block groups
which (AFAIK) seems to be one level down in the stack and shouldn't even
know about snapshots. If however you create or delete snapshots during
the delete operation you could probably slow down the delete.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Lionel Bouton
Le 15/12/2015 02:49, Duncan a écrit :
> Christoph Anton Mitterer posted on Tue, 15 Dec 2015 00:25:05 +0100 as
> excerpted:
>
>> On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
>>
>>> I use noatime and nodiratime
>> FYI: noatime implies nodiratime :-)
> Was going to post that myself.  Is there some reason you:
>
> a) use nodiratime when noatime is already enabled, despite the fact that 
> the latter already includes the former, or

I don't (for some time). I didn't check for nodiratime on all the
systems I admin so there could be some left around but as they are
harmless I only remove them when I happen to stumble on them.

>
> b) didn't sufficiently research the option (at least the current mount 
> manpage documents that noatime includes nodiratime under both the noatime 
> and nodiratime options,

I just checked: this has only be made crystal-clear in the latest
man-pages version 4.03 released 10 days ago.

The mount(8) page of Gentoo's current stable man-pages (4.02 release in
August) which is installed on my systems states for noatime:
"Do not update inode access times on this filesystem (e.g., for faster
access on the news spool to speed up news servers)."

This is prone to misinterpretation: directories are inodes but that may
not be self-explanatory for everyone. At least it could leave me with a
doubt if I wasn't absolutely certain of the behavior (see below): I'm
not sure myself that there isn't a difference between a VFS inode (the
in-memory structure) and an on-disk structure called inode which some
filesystems may not have (I may have been mistaken but IIRC ReiserFS
left me with the impression that it wasn't storing directory entries in
inodes or it didn't call it that).

In fact I remember that when I read statements about noatime implying
nodiratime I had to check fs/inode.c after I found a random discussion
on the subject mentioning the proof being in the code to make sure of
the behavior.


>  and at least some hint of that has been in the 
> manpage for years as I recall reading it when I first read of nodiratime 
> and checked whether my noatime options included it) before standardizing 
> on it, or
>
> c) might have actually been talking in general, and there's some mounts 
> you don't actually choose to make noatime, but still want nodiratime, or

I probably used this case for testing purposes (but don't remember a
case where it was useful to me).
The expression I used was not meant to describe the exact flags in fstab
on my systems but the general idea of avoiding files and directories
atime updates as by using noatime I'm implicitly using nodiratime too.
Sorry for the confusion (I've been confused about the subject a long
time which probably didn't help express myself clearly).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Lionel Bouton
Le 14/12/2015 21:27, Austin S. Hemmelgarn a écrit :
> AFAIUI, the _only_ reason that that is still the default is because of
> Mutt, and that won't change as long as some of the kernel developers
> are using Mutt for e-mail and the Mutt developers don't realize that
> what they are doing is absolutely stupid.
>

Mutt is often used as an example but tmpwatch uses atime by default too
and it's quite useful.

If you have a local cache of remote files for which you want a good hit
ratio and don't care too much about its exact size (you should have
Nagios/Zabbix/... alerting you when a filesystem reaches a %free limit
if you value your system's availability anyway), using tmpwatch with
cron to maintain it is only one single line away and does the job. For
an example of this particular case, on Gentoo the /usr/portage/distfiles
directory is used in one of the tasks you can uncomment to activate in
the cron.daily file provided when installing tmpwatch.
Using tmpwatch/cron is far more convenient than using a dedicated cache
(which might get tricky if the remote isn't HTTP-based, like an
rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for example).
Some http frameworks put sessions in /tmp: in this case if you want
sessions to expire based on usage and not creation time, using tmpwatch
or similar with atime is the only way to clean these files. This can
even become a performance requirement: I've seen some servers slowing
down with tens/hundreds of thousands of session files in /tmp because it
was only cleaned at boot and the systems were almost never rebooted...

I use noatime and nodiratime on some BTRFS filesystems for performance
reasons: Ceph OSDs, heavily snapshotted first-level backup servers and
filesystems dedicated to database server files (in addition to
nodatacow) come to mind, but the cases where these options are really
useful even with BTRFS doesn't seem to be the common ones.

Finally Linus Torvalds has been quite vocal and consistent on the
general subject of the kernel not breaking user-space APIs no matter
what so I wouldn't have much hope for default kernel mount options
changes...

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub: no spae left on device

2015-12-08 Thread Lionel Bouton
Le 08/12/2015 16:06, Marc MERLIN a écrit :
> Howdy,
>
> Why would scrub need space and why would it cancel if there isn't enough of
> it?
> (kernel 4.3)
>
> /etc/cron.daily/btrfs-scrub:
> btrfs scrub start -Bd /dev/mapper/cryptroot
> scrub device /dev/mapper/cryptroot (id 1) done
>   scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
>   total bytes scrubbed: 130.84GiB with 0 errors
> btrfs scrub start -Bd /dev/mapper/pool1
> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on 
> device)
> scrub device /dev/mapper/pool1 (id 1) canceled

I can't be sure (not-a-dev), but one possibility that comes to mind is
that if an error is detected writes must be done on the device. The
repair might not be done in-place but with CoW and even if the error is
not repaired by lack of redundancy IIRC each device tracks the number of
errors detected so I assume this is written somewhere (system or
metadata chunks most probably).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub: no spae left on device

2015-12-08 Thread Lionel Bouton
Le 08/12/2015 16:37, Holger Hoffstätte a écrit :
> On 12/08/15 16:06, Marc MERLIN wrote:
>> Howdy,
>>
>> Why would scrub need space and why would it cancel if there isn't enough of
>> it?
>> (kernel 4.3)
>>
>> /etc/cron.daily/btrfs-scrub:
>> btrfs scrub start -Bd /dev/mapper/cryptroot
>> scrub device /dev/mapper/cryptroot (id 1) done
>>  scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
>>  total bytes scrubbed: 130.84GiB with 0 errors
>> btrfs scrub start -Bd /dev/mapper/pool1
>> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on 
>> device)
>> scrub device /dev/mapper/pool1 (id 1) canceled
> Scrub rewrites metadata (apparently even in -r aka readonly mode), and that
> can lead to temporary metadata expansion (stuff gets COWed around); it's
> a bit surprising but makes sense if you think about it.

How long must I think about it until it makes sense? :-)

Sorry I'm not sure why metadata is rewritten if no error is detected.
I've several theories but lack information: is the fact that no error
has been detected stored somewhere? is scrub using some kind of internal
temporary snapshot(s) to avoid interfering with other operations? other
reason I didn't think about?

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID6 stable enough for production?

2015-10-14 Thread Lionel Bouton
Le 14/10/2015 22:23, Donald Pearson a écrit :
> I would not use Raid56 in production.  I've tried using it a few
> different ways but have run in to trouble with stability and
> performance.  Raid10 has been working excellently for me.

Hi, could you elaborate on the stability and performance problems you
had? Which kernels were you using at the time you were testing?

I'm interested because I have some RAID10 installations of 7 disks which
don't need much write performance (large backup servers with few clients
and few updates but very large datasets) that I plan to migrate to RAID6
when they approach their storage capacity (at least theoretically with 7
disks this will give better read performance and better protection
against disk failures). 3.19 brought full RAID5/6 support and from what
I remember there were some initial quirks but I'm unaware of any big
RAID5/6 problem in 4.1+ kernels.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID6 stable enough for production?

2015-10-14 Thread Lionel Bouton
Le 14/10/2015 22:53, Donald Pearson a écrit :
> I've used it from 3.8 something to current, it does not handle drive
> failure well at all, which is the point of parity raid. I had a 10disk
> Raid6 array on 4.1.1 and a drive failure put the filesystem in an
> irrecoverable state.  Scrub speeds are also an order of magnitude or
> more slower in my own experience.  The issue isn't filesystem
> read/write performance, it's maintenance and operation.

Thanks, I'll proceed with caution...
When 3.19 got out I tried various tests with loopback devices in RAID6
(dd if=/dev/random in the middle of one loopback device guaranteed to
have file data while using the filesystem for example) and didn't manage
to break it but it was arguably simple situations (either missing device
or corrupted data on device, not something behaving really erratically
like failing hardware).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs says no errors, but booting gives lots of errors

2015-10-10 Thread Lionel Bouton
Le 10/10/2015 16:41, cov...@ccs.covici.com a écrit :
> Holger Hoffstätte  wrote:
>
>> On 10/10/15 14:46, cov...@ccs.covici.com wrote:
>>> Hi.  I am having lots of btrfs troubles  -- I am using a 4.1.9 kernel
>> Just FYI, both 4.1.9 and .10 have serious regressions in the network layer
>> that *will*  lock up the whole machine, either after a few minutes or a
>> few hours (when idle). Try 4.2.x or (also more btrfs fixes) or 4.1.8 (OK).

If I'm not mistaken as the OP uses Gentoo gentoo-sources-4.1.9-r1 should
have distribution patches for this (4.1 is LTS, not 4.2 so you might
want to prefer the 4.1 series).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs says no errors, but booting gives lots of errors

2015-10-10 Thread Lionel Bouton
Le 11/10/2015 01:32, cov...@ccs.covici.com a écrit :
> [...]
> I don't know if the file in question had the correct data, I only did a
> directory listing, but this makes no sense -- I did an rsync just before
> booting and got all kinds of errors and the only difference is the file
> system, this is what I am saying.

What makes no sense is that the same filesystem both shows the file is
there and isn't. If there was data corruption or buggy behaviour at
least it should be somehow consistent.
What is more likely is that the rsync was incomplete and didn't transfer
some data needed by systemd: did you transfer (from the top of my head)
extended attributes, special files, device files ? By default rsync
doesn't do that.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs says no errors, but booting gives lots of errors

2015-10-10 Thread Lionel Bouton
Le 10/10/2015 18:55, cov...@ccs.covici.com a écrit :
> [...]
> But do you folks have any idea about my original question, this leads me
> to think that btrfs is too new or something.

I've seen a recent report of a problem with btrfs-progs 4.2 confirmed as
a bug in mkfs. As you created the filesystem with it, it could be the
problem.
Note that btrfs-progs 4.2 is marked ~amd64 on Gentoo: when you live on
the bleeding edge you shouldn't be surprised to bleed sometimes ;-)

You might have more luck by better describing the errors. Your title
mentions lots of errors, but there's only one log extract in a zip file
for a filesystem being mounted and it's only a warning about lock
contention which from an educated guess seems unlikely to make programs
crash.
I'm not familiar with the 203 exit codes you mention. This seems a
systemd thing with unclear meaning from a quick Google search so it
isn't really helpful unless there are kernel oops or panics for these
errors too.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs says no errors, but booting gives lots of errors

2015-10-10 Thread Lionel Bouton
Le 11/10/2015 01:02, cov...@ccs.covici.com a écrit :
> Lionel Bouton <lionel+c...@bouton.name> wrote:
>
>> Le 10/10/2015 18:55, cov...@ccs.covici.com a écrit :
>>> [...]
>>> But do you folks have any idea about my original question, this leads me
>>> to think that btrfs is too new or something.
>> I've seen a recent report of a problem with btrfs-progs 4.2 confirmed as
>> a bug in mkfs. As you created the filesystem with it, it could be the
>> problem.
>> Note that btrfs-progs 4.2 is marked ~amd64 on Gentoo: when you live on
>> the bleeding edge you shouldn't be surprised to bleed sometimes ;-)
>>
>> You might have more luck by better describing the errors. Your title
>> mentions lots of errors, but there's only one log extract in a zip file
>> for a filesystem being mounted and it's only a warning about lock
>> contention which from an educated guess seems unlikely to make programs
>> crash.
>> I'm not familiar with the 203 exit codes you mention. This seems a
>> systemd thing with unclear meaning from a quick Google search so it
>> isn't really helpful unless there are kernel oops or panics for these
>> errors too.
> These errors are not kernel panicks, they are just that systemd units
> are not starting and the programs that  are executed in the unit files
> are returning these errors such as the 203 with no other explanation.  I
> tried for instance to run /usr/bin/postgresql-9.4-check-db-dir and it
> said that postgresql.conf was missing, but I could do an ls on that file
> and got the name.

If you can list files and read them, your problems probably have nothing
to do with the filesystem itself.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Lionel Bouton
Hi,

Le 04/10/2015 14:03, Lionel Bouton a écrit :
> [...]
> This focus on single reader RAID1 performance surprises me.
>
> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> you need 2 processes to read from 2 devices at once) and I've never seen
> anyone arguing that the current md code is unstable.

To better illustrate my point.

According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
most of the time.

http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1

The only case where md RAID1 was noticeably faster is sequential reads
with FIO libaio.

So if you base your analysis on Phoronix tests when serving large files
to a few clients maybe md could perform better. In all other cases BTRFS
RAID1 seems to be a better place to start if you want performance.
According to the bad performance -> unstable logic, md would then be the
less stable RAID1 implementation which doesn't make sense to me.

I'm not even saying that BTRFS performs better than md for most
real-world scenarios (these are only benchmarks), but that arguing that
BTRFS is not stable because it has performance issues still doesn't make
sense to me. Even synthetic benchmarks aren't enough to find the best
fit for real-world scenarios, so you could always find a very
restrictive situation where any filesystem, RAID implementation, volume
manager could look bad even the most robust ones.

Of course if BTRFS RAID1 was always slower than md RAID1 the logic might
make more sense. But clearly there were design decisions and performance
tuning in BTRFS that led to better or similar performance in several
scenarios, if the remaining scenarios don't get attention it may be
because they represent a niche (at least from the point of view of the
developers) not a lack of polishing.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-04 Thread Lionel Bouton
Hi,

Le 04/10/2015 04:09, Duncan a écrit :
> Russell Coker posted on Sat, 03 Oct 2015 18:32:17 +1000 as excerpted:
>
>> Last time I checked a BTRFS RAID-1 filesystem would assign each process
>> to read from one disk based on it's PID.  Every RAID-1 implementation
>> that has any sort of performance optimisation will allow a single
>> process that's reading to use both disks to some extent.
>>
>> When the BTRFS developers spend some serious effort optimising for
>> performance it will be useful to compare BTRFS and ZFS.
> This is the example I use as to why btrfs isn't really stable, as well.  
> Devs tend to be very aware of the dangers of premature optimization, 
> because done too early, it either means throwing that work away when a 
> rewrite comes, or it severely limits options as to what can be rewritten, 
> if necessary, in ordered to avoid throwing all that work that went into 
> optimization away.
>
> So at least for devs that have been around awhile, that don't have some 
> boss that's paying the bills saying optimize now, an actually really good 
> mark of when the /devs/ consider something stable, is when they start 
> focusing on that optimization.

This focus on single reader RAID1 performance surprises me.

1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
you need 2 processes to read from 2 devices at once) and I've never seen
anyone arguing that the current md code is unstable.

2/ I'm not familiar with implementations taking advantage of several
disks for single process reads but clearly they'll have more problems
with seeks on rotating devices to solve. So are there really
implementations with better performance across the spectrum or do they
have to pay a performance penalty in the mutiple readers case to
optimize the (arguably less frequent/important) single reader case ?

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation

2015-09-29 Thread Lionel Bouton
Le 27/09/2015 17:34, Lionel Bouton a écrit :
> [...]
> It's not clear to me that "btrfs fi defrag " can't interfere with
> another process trying to use the file. I assume basic reading and
> writing is OK but there might be restrictions on unlinking/locking/using
> other ioctls... Are there any I should be aware of and should look for
> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
> our storage network : 2 are running a 4.0.5 kernel and 3 are running
> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
> 4.0.5 (or better if we have the time to test a more recent kernel before
> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).

Apparently this isn't the problem : we just had another similar Ceph OSD
crash without any concurrent defragmentation going on.

Best regards,

Lionel Bouton
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation

2015-09-29 Thread Lionel Bouton
Le 29/09/2015 16:49, Lionel Bouton a écrit :
> Le 27/09/2015 17:34, Lionel Bouton a écrit :
>> [...]
>> It's not clear to me that "btrfs fi defrag " can't interfere with
>> another process trying to use the file. I assume basic reading and
>> writing is OK but there might be restrictions on unlinking/locking/using
>> other ioctls... Are there any I should be aware of and should look for
>> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
>> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
>> our storage network : 2 are running a 4.0.5 kernel and 3 are running
>> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
>> 4.0.5 (or better if we have the time to test a more recent kernel before
>> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).
> Apparently this isn't the problem : we just had another similar Ceph OSD
> crash without any concurrent defragmentation going on.

However the Ceph developpers confirmed that BTRFS returned an EIO while
reading data from disk. Is there a known bug  in kernel 3.18.9 (sorry
for the initial typo) that could lead to that? I couldn't find any on
the wiki.
The last crash was on a filesystem mounted with these options:

rw,noatime,nodiratime,compress=lzo,space_cache,recovery,autodefrag

Some of the extents have been recompressed to zlib (though at the time
of the crash there was no such activity as I disabled it 2 days before
to simplify diagnostics).

Best regards,

Lionel Bouton
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation

2015-09-28 Thread Lionel Bouton
Le 28/09/2015 22:52, Duncan a écrit :
> Lionel Bouton posted on Mon, 28 Sep 2015 11:55:15 +0200 as excerpted:
>
>> From what I understood, filefrag doesn't known the length of each extent
>> on disk but should have its position. This is enough to have a rough
>> estimation of how badly fragmented the file is : it doesn't change the
>> result much when computing what a rotating disk must do (especially how
>> many head movements) to access the whole file.
> AFAIK, it's the number of extents reported that's the problem with 
> filefrag and btrfs compression.  Multiple 128 KiB compression blocks can 
> be right next to each other, forming one longer extent on-device, but due 
> to the compression, filefrag sees and reports them as one extent per 
> compression block, making the file look like it has perhaps thousands or 
> tens of thousands of extents when in actuality it's only a handful, 
> single or double digits.

Yes but that's not a problem for our defragmentation scheduler: we
compute the time needed to read the file based on a model of the disk
where reading consecutive compressed blocks has no seek cost, only the
same revolution cost as reading the larger block they form. The cost of
fragmentation is defined as the ratio between this time and the time
computed with our model if the blocks were purely sequential.

>
> In that regard, length or position neither one matter, filefrag will 
> simply report a number of extents orders of magnitude higher than what's 
> actually there, on-device.

Yes but filefrag -v reports the length and position and we can then find
out based purely on the positions if extents are sequential or random.

If people are interested by the details I can discuss them in a separate
thread (or a subthread with a different title). One thing in particular
surprised me and could be an interesting separate discussion: according
to the extents positions reported by filefrag -v, defragmentation can
leave extents in several sequences at different positions on the disk
leading to an average fragmentation cost for compressed files of 2.7x to
3x compared to the ideal case (note that this is an approximation: we
consider files compressed if more than half of their extents are
compressed by checking for "encoded" in the extent flags). This is
completely different for uncompressed files: here defragmentation is
completely effective and we get a single extent most of the time. So
there's at least 3 possibilities : an error in positions reported by
filefrag (and the file is really defragmented), a good reason to leave
these files fragmented or an opportunity for optimization.

But let's remember our real problem: I'm still not sure if calling btrfs
fi defrag  can interfere with any concurrent operation on 
leading to an I/O error. As this has the potential to bring our platform
down in our current setup the answer I really hope this will catch the
attention of someone familiar with the technical details of btrfs fi defrag.

Best regards,

Lionel Bouton
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation

2015-09-28 Thread Lionel Bouton
> against this sort of thing, perhaps there's a bug, because...
>
> I know for sure that btrfs itself is not intended for distributed access, 
> from more than one system/kernel at a time.  Which assuming my ceph 
> illiteracy isn't negatively affecting my reading of the above, seems to 
> be more or less what you're suggesting happened, and I do know that *if* 
> it *did* happen, it could indeed trigger all sorts of havoc!

No: Ceph OSDs are normal local processes using a filesystem for storage
(and optionally a dedicated journal out of the filesystem) as are the
btrfs fi defrag commands run on the same host. What I'm interested in is
how the btrfs fi defrag  command could interfere with any other
process accessing  simultaneously. The answer could very well be
"it never will" (for example because it doesn't use any operation that
can before calling the defrag ioctl which is guaranteed to not interfere
with other file operations too). I just need to know if there's a
possibility so I can decide if these defragmentations are an operational
risk or not in my context and if I found the cause for my slightly
frightening morning.

>> It's not clear to me that "btrfs fi defrag " can't interfere with
>> another process trying to use the file. I assume basic reading and
>> writing is OK but there might be restrictions on unlinking/locking/using
>> other ioctls... Are there any I should be aware of and should look for
>> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
>> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
>> our storage network : 2 are running a 4.0.5 kernel and 3 are running
>> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
>> 4.0.5 (or better if we have the time to test a more recent kernel before
>> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).
> It's worth keeping in mind that the explicit warnings about btrfs being 
> experimental weren't removed until 3.12, and while current status is no 
> longer experimental or entirely unstable, it remains, as I characterize 
> it, as "maturing and stabilizing, not yet entirely stable and mature."
>
> So 3.8 is very much still in btrfs-experimental land!  And so many bugs 
> have been fixed since then that... well, just get off of it ASAP, which 
> it seems you're already doing.

Oops, that was a typo : I meant 3.18.9, sorry :-(

> [...]
>
>
> Tying up a couple loose ends...
>
> Regarding nocow...
>
> Given that you had apparently missed much of the general list and wiki 
> wisdom above (while at the same time eventually coming to the many of the 
> same conclusions on your own),

In fact I was initially aware of (no)CoW/defragmentation/snapshots
performance gotchas (I already used BTRFS for PostgreSQL slaves hosting
for example...).
But Ceph is filesystem aware: its OSDs detect if they are running on
XFS/BTRFS and activate automatically some filesystem features. So even
though I was aware of the problems that can happen on a CoW filesystem,
I preferred to do actual testing with the default Ceph settings and
filesystem mount options before tuning.

Best regards,

Lionel Bouton
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs fi defrag interfering (maybe) with Ceph OSD operation

2015-09-27 Thread Lionel Bouton
Hi,

we use BTRFS for Ceph filestores (after much tuning and testing over
more than a year). One of the problem we've had to face was the slow
decrease in performance caused by fragmentation.

Here's a small recap of the history for context.
Initially we used internal journals on the few OSDs where we tested
BTRFS, which meant constantly overwriting 10GB files (which is obviously
bad for CoW). Before using NoCoW and eventually moving the journals to
raw SSD partitions, we understood autodefrag was not being effective :
the initial performance on a fresh, recently populated OSD was great and
slowly degraded over time without access patterns and filesystem sizes
changing significantly.
My idea was that autodefrag might focus its efforts on files not useful
to defragment in the long term. The obvious one was the journal
(constant writes but only read again when restarting an OSD) but I
couldn't find any description of the algorithms/heuristics used by
autodefrag so I decided to disable it and develop our own
defragmentation scheduler. It is based on both a slow walk through the
filesystem (which acts as a safety net over one week period) and a
fatrace pipe (used to detect recent fragmentation). Fragmentation is
computed from filefrag detailed outputs and it learns how much it can
defragment files with calls to filefrag after defragmentation (we
learned compressed files and uncompressed files don't behave the same
way in the process so we ended up treating them separately).
Simply excluding the journal from defragmentation and using some basic
heuristics (don't defragment recently written files but keep them in a
pool then queue them and don't defragment files below a given
fragmentation "cost" were defragmentation becomes ineffective) gave us
usable performance in the long run. Then we successively moved the
journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS
snapshots which were too costly (removing snapshots generated 120MB of
writes to the disks and this was done every 30s on our configuration).

In the end we had a very successful experience, migrated everything to
BTRFS filestores that were noticeably faster than XFS (according to Ceph
metrics), detected silent corruption and compressed data. Everything
worked well until this morning.

I woke up to a text message signalling VM freezes all over our platform.
2 Ceph OSDs died at the same time on two of our servers (20s appart)
which for durability reason freezes writes on the data chunks shared by
these two OSDs.
The errors we got in the OSD logs seem to point to an IO error (at least
IIRC we got a similar crash on an OSD where we had invalid csum errors
logged by the kernel) but we couldn't find any kernel error and btrfs
scrubs finished on the filesystems without finding any corruption. I've
yet to get an answer for the possible contexts and exact IO errors. If
people familiar with Ceph read this here's the error on Ceph 0.80.9
(more logs available on demand) :

2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27
06:30:57.260978
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

Given that the defragmentation scheduler treats file accesses the same
on all replicas to decide when triggering a call to "btrfs fi defrag
", I suspect this manual call to defragment could have happened on
the 2 OSDs affected for the same file at nearly the same time and caused
the near simultaneous crashes.

It's not clear to me that "btrfs fi defrag " can't interfere with
another process trying to use the file. I assume basic reading and
writing is OK but there might be restrictions on unlinking/locking/using
other ioctls... Are there any I should be aware of and should look for
in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
our storage network : 2 are running a 4.0.5 kernel and 3 are running
3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
4.0.5 (or better if we have the time to test a more recent kernel before
rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).

Best regards,

Lionel Bouton
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html