Re: btrfs restore fails because of NO SPACE

2016-05-20 Thread Duncan
Wolf Bublitz posted on Fri, 20 May 2016 19:38:55 +0200 as excerpted:

> Hallo,
> 
> I have a confusing problem with my btrfs raid. Currently I am using the
> following setup:
> 
> btrfs fi show
> Label: none  uuid: 93000933-e46d-403b-80d7-60475855e3f3
>   Total devices 2 FS bytes used 2.56TiB
>   devid1 size 2.73TiB used 2.71TiB path /dev/sda
>   devid4 size 2.73TiB used 2.71TiB path /dev/sdb
> 
> As you can see both disks are full.
> 
> Actually I cannot mount my raid, even with recovery options enabled:
> 
> mount /dev/sda /mnt/Data -t btrfs
> -onospace_cache,clear_cache,enospc_debug,nodatacow

[generic mount error]

> dmesg shows:
> 
> [ 1066.243813] BTRFS error (device sda): parent transid verify
> failed on 9676657786880 wanted 242139 found 0
[...]
> [ 1066.273234] BTRFS: failed to read chunk root on sda

None of those options are likely to help, there.

What /might/ help is the "usebackuproot" mount option, if your kernel is 
reasonably current, or the "recovery" mount option, if it's a bit older.  
Btrfs mount options are documented in the btrfs (5) manpage (not the 
btrfs (8) manpage, specify the 5), tho again, usebackuproot will only 
appear in the manpage if you're running a reasonably current btrfs-progs 
version (recovery should be listed in both new and old, but it's listed 
as deprecated in new, referring readers to the usebackuproot entry).  Or 
alternatively to the manpage, you can check the mount options listing on 
the wiki.

> After spending some time with Google I found a possible solution for my
> problem by running:
> 
> btrfs restore -v /dev/sda /mnt/Data
> 
> Actually this operation fails silently (computer freezes). After examine
> the kernel logs I have found out that the operations fails because of
> „NO SPACE LEFT ON DEVICE“. Can anybody please give me a solution for
> this problem?

You don't explicitly say what you expect btrfs restore to do, but given 
the specific command you use, I suspect that you misunderstand what it 
does, and it's actually working, but you are running out of space as a 
result of using restore incorrectly, because of that misunderstanding.

What btrfs restore does is provide you a read-only method to try to 
restore your files from a filesystem that won't mount, by rewriting what 
it can recover to an entirely different location on an entirely 
different, mounted, filesystem, which of course must contain enough space 
to hold a new copy of all those restored files.

And if the filesystem in question isn't mounted to its usual mountpoint,
/mnt/Data, that means you're trying to write all the recovered files to 
whatever filesystem actually contains the mountpoint itself, almost 
certainly the root filesystem (/), in this case.

And I'll place money on a bet that whatever your root filesystem is, it 
doesn't have the terabytes of free space that are likely to be necessary 
to restore all of that multi-device multi-TB per device btrfs!  
Otherwise, you /likely/ wouldn't be running the separate btrfs in the 
first place, but storing it on your main filesystem, instead.  So when 
btrfs restore runs out of room on / ... everything freezes.


IOW, in ordered to successfully use btrfs restore, you have to have a 
filesystem with enough free space available mounted somewhere in ordered 
to write the files to that btrfs restore is restoring!  If you don't, 
yes, you /will/ run into problems! =:^)

That said, it is possible to use pattern matching to tell btrfs restore 
to only restore say one directory at a time, and by using that if you 
don't have enough space for everything but are willing to give up some of 
what was stored, you can tell btrfs restore to only restore the vitally 
important stuff that will fit, and not bother trying to restore the less 
important stuff that won't fit.  Again, see the btrfs-restore manpage, 
for that and other details.

And, unless you want those restored files all written as root, you'll 
probably want to use the restore metadata option as well, to restore 
timestamps and owner/perms information.  Similarly, there's an option to 
restore symlinks as well, without which they'll be missing.  So you 
probably do want to check that manpage.  Just sayin'. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs restore fails because of NO SPACE

2016-05-20 Thread Duncan
Chris Murphy posted on Fri, 20 May 2016 15:53:07 -0600 as excerpted:

>>btrfs fi show Label: none  uuid: 93000933-e46d-403b-80d7-60475855e3f3
>> Total devices 2 FS bytes used 2.56TiB
>> devid1 size 2.73TiB used 2.71TiB path /dev/sda
>> devid4 size 2.73TiB used 2.71TiB path /dev/sdb
> 
> 
> OK so why does it only list two devices? This is a three drive or four
> drive raid10? This is Btrfs raid10 specifically? I'm confused now about
> the setup and why btrfs fi show isn't saying there are missing devices,
> there is no such thing as two drive btrfs raid10.

I'm confused about where you got that it was a raid10?  I don't see it in 
anything he posted, that got here to gmane, at least.  In fact, I see 
only his initial thread-root post, and it doesn't mention raid10 at all 
that I can see.

So given that fi show says two devices, none missing, I'd say it can't be 
a raid10, and further, given that he didn't specify the raid type, the 
btrfs default for a two-device btrfs must be assumed, which is raid1 
system and metadata, single mode data.

But I think that's beside the point in terms of the original question.  I 
think the problem is with his understanding of restore.  See the reply 
directly to his post, that I'll be making after this one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-20 Thread Henk Slager
 bcache protective superblocks is a one-time procedure which can be done
 online. The bcache devices act as normal HDD if not attached to a
 caching SSD. It's really less pain than you may think. And it's a
 solution available now. Converting back later is easy: Just detach the
 HDDs from the SSDs and use them for some other purpose if you feel so
 later. Having the bcache protective superblock still in place doesn't
 hurt then. Bcache is a no-op without caching device attached.
>>>
>>>
>>> No, bcache is _almost_ a no-op without a caching device.  From a
>>> userspace
>>> perspective, it does nothing, but it is still another layer of
>>> indirection
>>> in the kernel, which does have a small impact on performance.  The same
>>> is
>>> true of using LVM with a single volume taking up the entire partition, it
>>> looks almost no different from just using the partition, but it will
>>> perform
>>> worse than using the partition directly.  I've actually done profiling of
>>> both to figure out base values for the overhead, and while bcache with no
>>> cache device is not as bad as the LVM example, it can still be a roughly
>>> 0.5-2% slowdown (it gets more noticeable the faster your backing storage
>>> is).
>>>
>>> You also lose the ability to mount that filesystem directly on a kernel
>>> without bcache support (this may or may not be an issue for you).
>>
>>
>> The bcache (protective) superblock is in an 8KiB block in front of the
>> file system device. In case the current, non-bcached HDD's use modern
>> partitioning, you can do a 5-minute remove or add of bcache, without
>> moving/copying filesystem data. So in case you have a bcache-formatted
>> HDD that had just 1 primary partition (512 byte logical sectors), the
>> partition start is at sector 2048 and the filesystem start is at 2064.
>> Hard removing bcache (so making sure the module is not
>> needed/loaded/used the next boot) can be done done by changing the
>> start-sector of the partition from 2048 to 2064. In gdisk one has to
>> change the alignment to 16 first, otherwise this it refuses. And of
>> course, also first flush+stop+de-register bcache for the HDD.
>>
>> The other way around is also possible, i.e. changing the start-sector
>> from 2048 to 2032. So that makes adding bcache to an existing
>> filesystem a 5 minute action and not a GBs- or TBs copy action. It is
>> not online of course, but just one reboot is needed (or just umount,
>> gdisk, partprobe, add bcache etc).
>> For RAID setups, one could just do 1 HDD first.
>
> My argument about the overhead was not about the superblock, it was about
> the bcache layer itself.  It isn't practical to just access the data
> directly if you plan on adding a cache device, because then you couldn't do
> so online unless you're going through bcache.  This extra layer of
> indirection in the kernel does add overhead, regardless of the on-disk
> format.

Yes, sorry, I took some shortcut in the discussion and jumped to a
method for avoiding this 0.5-2% slowdown that you mention. (Or a
kernel crashing in bcache code due to corrupt SB on a backing device
or corrupted caching device contents).
I am actually bit surprised that there is a measurable slowdown,
considering that it is basically just one 8KiB offset on a certain
layer in the kernel stack, but I haven't looked at that code.

> Secondarily, having a HDD with just one partition is not a typical use case,
> and that argument about the slack space resulting from the 1M alignment only
> holds true if you're using an MBR instead of a GPT layout (or for that
> matter, almost any other partition table format), and you're not booting
> from that disk (because GRUB embeds itself there). It's also fully possible
> to have an MBR formatted disk which doesn't have any spare space there too
> (which is how most flash drives get formatted).

I don't know other tables than MBR and GPT, but this bcache SB
'insertion' works with both. Indeed, if GRUB is involved, it can get
complicated, I have avoided that. If there is less than 8KiB slack
space on a HDD, I would worry about alignment/performance first, then
there is likely a reason to fully rewrite the HDD with a standard 1M
alingment.
If there is more partitions and the partition in front of the one you
would like to be bcached, I personally would shrink it by 8KiB (like
NTFS or swap or ext4 ) if that saves me TeraBytes of datatransfers.

> This also doesn't change the fact that without careful initial formatting
> (it is possible on some filesystems to embed the bcache SB at the beginning
> of the FS itself, many of them have some reserved space at the beginning of
> the partition for bootloaders, and this space doesn't have to exist when
> mounting the FS) or manual alteration of the partition, it's not possible to
> mount the FS on a system without bcache support.

If we consider a non-bootable single HDD btrfs FS, are you then
suggesting that the bcache SB could be placed in the 

Re: btrfs restore fails because of NO SPACE

2016-05-20 Thread Chris Murphy
>btrfs fi show
>Label: none  uuid: 93000933-e46d-403b-80d7-60475855e3f3
 >   Total devices 2 FS bytes used 2.56TiB
  >  devid1 size 2.73TiB used 2.71TiB path /dev/sda
   > devid4 size 2.73TiB used 2.71TiB path /dev/sdb


OK so why does it only list two devices? This is a three drive or four
drive raid10? This is Btrfs raid10 specifically? I'm confused now
about the setup and why btrfs fi show isn't saying there are missing
devices, there is no such thing as two drive btrfs raid10.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-20 Thread Henk Slager
On Fri, May 20, 2016 at 7:59 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-05-20 13:02, Ferry Toth wrote:
>>
>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
>> partitions are in the same pool, which is in btrfs RAID10 format. /boot
>> is in subvolume @boot.
>
> If you have GRUB installed on all 4, then you don't actually have the full
> 2047 sectors between the MBR and the partition free, as GRUB is embedded in
> that space.  I forget exactly how much space it takes up, but I know it's
> not the whole 1023.5K  I would not suggest risking usage of the final 8k
> there though.  You could however convert to raid1 temporarily, and then for
> each device, delete it, reformat for bcache, then re-add it to the FS.  This
> may take a while, but should be safe (of course, it's only an option if
> you're already using a kernel with bcache support).

There is more then enough space in that 2047 sectors area for
inserting a bcache SB, but initially I also found it risky and was not
so sure. I anyhow don't want GRUB in the MBR, but in the filesystem/OS
partition that it should boot, otherwise multi-OS on the same SSD or
HDD gets into trouble.

For the described system, assuming a few minutes offline or
'maintenance' mode is acceptable, I personally would just shrink the
swap by 8KiB, lower its end-sector by 16 and also lower the
start-sector of the btrfs partition by 16 and then add bcache. The
location of GRUB should not matter actually.

>> In this configuration nothing would beat btrfs if I could just add 2
>> SSD's to the pool that would be clever enough to be paired in RAID1 and
>> would be preferred for small (<1GB) file writes. Then balance should be
>> able to move not often used files to the HDD.
>>
>> None of the methods mentioned here sound easy or quick to do, or even
>> well tested.

I agree that all the methods are actually quite complicated,
especially if compared to ZFS and its tools. Adding an ARC is as
simple and easy as you want and describe.

The statement I wanted make is that adding bcache for a (btrfs)
file-system can be done without touching the FS itself, provided that
one can allow some offline time for the FS.

> It really depends on what you're used to.  I would consider most of the
> options easy, but one of the areas I'm strongest with is storage management,
> and I've repaired damaged filesystems and partition tables by hand with a
> hex editor before, so I'm not necessarily a typical user.  If I was going to
> suggest something specifically, it would be dm-cache, because it requires no
> modification to the backing store at all, but that would require running on
> LVM if you want it to be easy to set up (it's possible to do it without LVM,
> but you need something to call dmsetup before mounting the filesystem, which
> is not easy to configure correctly), and if you're on an enterprise distro,
> it may not be supported.
>
> If you wanted to, it's possible, and not all that difficult, to convert a
> BTRFS system to BTRFS on top of LVM online, but you would probably have to
> split out the boot subvolume to a separate partition (depends on which
> distro you're on, some have working LVM support in GRUB, some don't).  If
> you're on a distro which does have LVM support in GRUB, the procedure would
> be:
> 1. Convert the BTRFS array to raid1. This lets you run with only 3 disks
> instead of 4.
> 2. Delete one of the disks from the array.
> 3. Convert the disk you deleted from the array to a LVM PV and add it to a
> VG.
> 4. Create a new logical volume occupying almost all of the PV you just added
> (having a little slack space is usually a good thing).
> 5. Add use btrfs replace to add the LV to the BTRFS array while deleting one
> of the others.
> 6. Repeat from step 3-5 for each disk, but stop at step 4 when you have
> exactly one disk that isn't on LVM (so for four disks, stop at step four
> when you have 2 with BTRFS+LVM, one with just the LVM logical volume, and
> one with just BTRFS).
> 7. Reinstall GRUB (it should pull in LVM support now).
> 8. Use BTRFS replace to move the final BTRFS disk to the empty LVM volume.
> 9. Convert the now empty final disk to LVM using steps 3-4
> 10. Add the LV to the BTRFS array and rebalance to raid10.
> 11. Reinstall GRUB again (just to be certain).
>
> I've done essentially the same thing on numerous occasions when
> reprovisioning for various reasons, and it's actually one of the things
> outside of the xfstests that I check with my regression testing (including
> simulating a couple of the common failure modes).  It takes a while
> (especially for big arrays with lots of data), but it works, and is
> relatively safe (you are guaranteed to be able to rebuild a raid1 array of 3
> disks from just 2, so losing the disk in the process of copying it will not
> result in data loss unless you hit a kernel bug).
--
To unsubscribe from this 

[PATCH] Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes

2016-05-20 Thread Omar Sandoval
From: Omar Sandoval 

Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

while make tinyconfig && make -j4 && git clean dqfx; do
:
done

along with a bunch of parallel finds in another shell:

while true; do
for ((i=0; i<4; i++)); do
find . >/dev/null &
done
wait
done

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/delayed-inode.c | 27 ++-
 fs/btrfs/delayed-inode.h | 10 ++
 fs/btrfs/inode.c | 10 ++
 3 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 6cef0062f929..d60cd17ea66b 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1606,15 +1606,23 @@ int btrfs_inode_delayed_dir_index_count(struct inode 
*inode)
return 0;
 }
 
-void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
-struct list_head *del_list)
+bool btrfs_readdir_get_delayed_items(struct inode *inode,
+struct list_head *ins_list,
+struct list_head *del_list)
 {
struct btrfs_delayed_node *delayed_node;
struct btrfs_delayed_item *item;
 
delayed_node = btrfs_get_delayed_node(inode);
if (!delayed_node)
-   return;
+   return false;
+
+   /*
+* We can only do one readdir with delayed items at a time because of
+* item->readdir_list.
+*/
+   inode_unlock_shared(inode);
+   inode_lock(inode);
 
mutex_lock(_node->mutex);
item = __btrfs_first_delayed_insertion_item(delayed_node);
@@ -1641,10 +1649,13 @@ void btrfs_get_delayed_items(struct inode *inode, 
struct list_head *ins_list,
 * requeue or dequeue this delayed node.
 */
atomic_dec(_node->refs);
+
+   return true;
 }
 
-void btrfs_put_delayed_items(struct list_head *ins_list,
-struct list_head *del_list)
+void btrfs_readdir_put_delayed_items(struct inode *inode,
+struct list_head *ins_list,
+struct list_head *del_list)
 {
struct btrfs_delayed_item *curr, *next;
 
@@ -1659,6 +1670,12 @@ void btrfs_put_delayed_items(struct list_head *ins_list,
if (atomic_dec_and_test(>refs))
kfree(curr);
}
+
+   /*
+* The VFS is going to do up_read(), so we need to downgrade back to a
+* read lock.
+*/
+   downgrade_write(>i_rwsem);
 }
 
 int btrfs_should_delete_dir_index(struct list_head *del_list,
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 0167853c84ae..2495b3d4075f 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -137,10 +137,12 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root 
*root);
 void btrfs_destroy_delayed_inodes(struct btrfs_root *root);
 
 /* Used for readdir() */
-void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
-struct list_head *del_list);
-void btrfs_put_delayed_items(struct list_head *ins_list,
-struct list_head *del_list);
+bool btrfs_readdir_get_delayed_items(struct inode *inode,
+struct list_head *ins_list,
+struct list_head *del_list);
+void btrfs_readdir_put_delayed_items(struct inode *inode,
+struct list_head *ins_list,
+struct list_head *del_list);
 int btrfs_should_delete_dir_index(struct list_head *del_list,
  u64 index);
 int btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6b7fe291a174..6ab6ca195f2f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5733,6 +5733,7 @@ static int btrfs_real_readdir(struct file *file, struct 
dir_context *ctx)
int name_len;
int is_curr = 0;/* ctx->pos points to the current index? */
bool emitted;
+   bool put = false;
 
/* FIXME, use a real flag for deciding about the key type */
if (root->fs_info->tree_root == root)
@@ -5750,7 +5751,8 @@ 

Re: btrfs restore fails because of NO SPACE

2016-05-20 Thread Chris Murphy
What versions for kernel and btrfs-progs?

Have you tried only '-o ro,recovery' ? What kernel messages do you get for this?

Failure to read chunk tree message is usually bad. If you have a
recent enough btrfs-progs, try 'btrfs check' on the volume without
--repair and post the results; recent would be 4.4.1 or better.

Also a good idea is to post btrfs-show-super -f, there's a scant
possibility there's more than one backup chunk root and possible to
explicitly use an older one (but btrfs check--chunk-root is only in
v4.5 of btrfs-progs).



Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs restore fails because of NO SPACE

2016-05-20 Thread Wolf Bublitz
Hallo,

I have a confusing problem with my btrfs raid. Currently I am using the 
following setup:

btrfs fi show
Label: none  uuid: 93000933-e46d-403b-80d7-60475855e3f3
Total devices 2 FS bytes used 2.56TiB
devid1 size 2.73TiB used 2.71TiB path /dev/sda
devid4 size 2.73TiB used 2.71TiB path /dev/sdb

As you can see both disks are full.

Actually I cannot mount my raid, even with recovery options enabled:

mount /dev/sda /mnt/Data -t btrfs 
-onospace_cache,clear_cache,enospc_debug,nodatacow
mount: wrong fs type, bad option, bad superblock on /dev/sda,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

dmesg shows:

[ 1066.221696] BTRFS info (device sda): disabling disk space caching
[ 1066.227990] BTRFS info (device sda): force clearing of disk cache
[ 1066.234331] BTRFS info (device sda): setting nodatacow, compression disabled
[ 1066.243813] BTRFS error (device sda): parent transid verify failed on 
9676657786880 wanted 242139 found 0
[ 1066.253672] BTRFS error (device sda): parent transid verify failed on 
9676657786880 wanted 242139 found 0
[ 1066.263450] BTRFS error (device sda): parent transid verify failed on 
9676657786880 wanted 242139 found 0
[ 1066.273234] BTRFS: failed to read chunk root on sda
[ 1066.279675] BTRFS warning (device sda): page private not zero on page 
9676657786880
[ 1066.287482] BTRFS warning (device sda): page private not zero on page 
9676657790976
[ 1066.295361] BTRFS warning (device sda): page private not zero on page 
9676657795072
[ 1066.303204] BTRFS warning (device sda): page private not zero on page 
9676657799168
[ 1066.369266] BTRFS: open_ctree failed

After spending some time with Google I found a possible solution for my problem 
by running:

btrfs restore -v /dev/sda /mnt/Data

Actually this operation fails silently (computer freezes). After examine the 
kernel logs I have found out that the operations fails because of „NO SPACE 
LEFT ON DEVICE“. Can anybody please give me a solution for this problem?

Greetings

Wolf Bublitz--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-20 Thread Austin S. Hemmelgarn

On 2016-05-20 13:02, Ferry Toth wrote:

We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
partitions are in the same pool, which is in btrfs RAID10 format. /boot
is in subvolume @boot.
If you have GRUB installed on all 4, then you don't actually have the 
full 2047 sectors between the MBR and the partition free, as GRUB is 
embedded in that space.  I forget exactly how much space it takes up, 
but I know it's not the whole 1023.5K  I would not suggest risking usage 
of the final 8k there though.  You could however convert to raid1 
temporarily, and then for each device, delete it, reformat for bcache, 
then re-add it to the FS.  This may take a while, but should be safe (of 
course, it's only an option if you're already using a kernel with bcache 
support).

In this configuration nothing would beat btrfs if I could just add 2
SSD's to the pool that would be clever enough to be paired in RAID1 and
would be preferred for small (<1GB) file writes. Then balance should be
able to move not often used files to the HDD.

None of the methods mentioned here sound easy or quick to do, or even
well tested.
It really depends on what you're used to.  I would consider most of the 
options easy, but one of the areas I'm strongest with is storage 
management, and I've repaired damaged filesystems and partition tables 
by hand with a hex editor before, so I'm not necessarily a typical user. 
 If I was going to suggest something specifically, it would be 
dm-cache, because it requires no modification to the backing store at 
all, but that would require running on LVM if you want it to be easy to 
set up (it's possible to do it without LVM, but you need something to 
call dmsetup before mounting the filesystem, which is not easy to 
configure correctly), and if you're on an enterprise distro, it may not 
be supported.


If you wanted to, it's possible, and not all that difficult, to convert 
a BTRFS system to BTRFS on top of LVM online, but you would probably 
have to split out the boot subvolume to a separate partition (depends on 
which distro you're on, some have working LVM support in GRUB, some 
don't).  If you're on a distro which does have LVM support in GRUB, the 
procedure would be:
1. Convert the BTRFS array to raid1. This lets you run with only 3 disks 
instead of 4.

2. Delete one of the disks from the array.
3. Convert the disk you deleted from the array to a LVM PV and add it to 
a VG.
4. Create a new logical volume occupying almost all of the PV you just 
added (having a little slack space is usually a good thing).
5. Add use btrfs replace to add the LV to the BTRFS array while deleting 
one of the others.
6. Repeat from step 3-5 for each disk, but stop at step 4 when you have 
exactly one disk that isn't on LVM (so for four disks, stop at step four 
when you have 2 with BTRFS+LVM, one with just the LVM logical volume, 
and one with just BTRFS).

7. Reinstall GRUB (it should pull in LVM support now).
8. Use BTRFS replace to move the final BTRFS disk to the empty LVM volume.
9. Convert the now empty final disk to LVM using steps 3-4
10. Add the LV to the BTRFS array and rebalance to raid10.
11. Reinstall GRUB again (just to be certain).

I've done essentially the same thing on numerous occasions when 
reprovisioning for various reasons, and it's actually one of the things 
outside of the xfstests that I check with my regression testing 
(including simulating a couple of the common failure modes).  It takes a 
while (especially for big arrays with lots of data), but it works, and 
is relatively safe (you are guaranteed to be able to rebuild a raid1 
array of 3 disks from just 2, so losing the disk in the process of 
copying it will not result in data loss unless you hit a kernel bug).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-20 Thread Ferry Toth
Op Fri, 20 May 2016 08:03:12 -0400, schreef Austin S. Hemmelgarn:

> On 2016-05-19 19:23, Henk Slager wrote:
>> On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
>>  wrote:
>>> On 2016-05-19 14:09, Kai Krakow wrote:

 Am Wed, 18 May 2016 22:44:55 + (UTC)
 schrieb Ferry Toth :

> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
 Bcache is actually low maintenance, no knobs to turn. Converting to
 bcache protective superblocks is a one-time procedure which can be
 done online. The bcache devices act as normal HDD if not attached to
 a caching SSD. It's really less pain than you may think. And it's a
 solution available now. Converting back later is easy: Just detach
 the HDDs from the SSDs and use them for some other purpose if you
 feel so later. Having the bcache protective superblock still in place
 doesn't hurt then. Bcache is a no-op without caching device attached.
>>>
>>> No, bcache is _almost_ a no-op without a caching device.  From a
>>> userspace perspective, it does nothing, but it is still another layer
>>> of indirection in the kernel, which does have a small impact on
>>> performance.  The same is true of using LVM with a single volume
>>> taking up the entire partition, it looks almost no different from just
>>> using the partition, but it will perform worse than using the
>>> partition directly.  I've actually done profiling of both to figure
>>> out base values for the overhead, and while bcache with no cache
>>> device is not as bad as the LVM example, it can still be a roughly
>>> 0.5-2% slowdown (it gets more noticeable the faster your backing
>>> storage is).
>>>
>>> You also lose the ability to mount that filesystem directly on a
>>> kernel without bcache support (this may or may not be an issue for
>>> you).
>>
>> The bcache (protective) superblock is in an 8KiB block in front of the
>> file system device. In case the current, non-bcached HDD's use modern
>> partitioning, you can do a 5-minute remove or add of bcache, without
>> moving/copying filesystem data. So in case you have a bcache-formatted
>> HDD that had just 1 primary partition (512 byte logical sectors), the
>> partition start is at sector 2048 and the filesystem start is at 2064.
>> Hard removing bcache (so making sure the module is not
>> needed/loaded/used the next boot) can be done done by changing the
>> start-sector of the partition from 2048 to 2064. In gdisk one has to
>> change the alignment to 16 first, otherwise this it refuses. And of
>> course, also first flush+stop+de-register bcache for the HDD.
>>
>> The other way around is also possible, i.e. changing the start-sector
>> from 2048 to 2032. So that makes adding bcache to an existing
>> filesystem a 5 minute action and not a GBs- or TBs copy action. It is
>> not online of course, but just one reboot is needed (or just umount,
>> gdisk, partprobe, add bcache etc).
>> For RAID setups, one could just do 1 HDD first.
> My argument about the overhead was not about the superblock, it was
> about the bcache layer itself.  It isn't practical to just access the
> data directly if you plan on adding a cache device, because then you
> couldn't do so online unless you're going through bcache.  This extra
> layer of indirection in the kernel does add overhead, regardless of the
> on-disk format.
> 
> Secondarily, having a HDD with just one partition is not a typical use
> case, and that argument about the slack space resulting from the 1M
> alignment only holds true if you're using an MBR instead of a GPT layout
> (or for that matter, almost any other partition table format), and
> you're not booting from that disk (because GRUB embeds itself there).
> It's also fully possible to have an MBR formatted disk which doesn't
> have any spare space there too (which is how most flash drives get
> formatted).

We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4, 
then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs 
partitions are in the same pool, which is in btrfs RAID10 format. /boot 
is in subvolume @boot.

In this configuration nothing would beat btrfs if I could just add 2 
SSD's to the pool that would be clever enough to be paired in RAID1 and 
would be preferred for small (<1GB) file writes. Then balance should be 
able to move not often used files to the HDD.

None of the methods mentioned here sound easy or quick to do, or even 
well tested.

> This also doesn't change the fact that without careful initial
> formatting (it is possible on some filesystems to embed the bcache SB at
> the beginning of the FS itself, many of them have some reserved space at
> the beginning of the partition for bootloaders, and this space doesn't
> have to exist when mounting the FS) or manual alteration of the
> partition, it's not possible to mount the FS on a system without bcache
> support.
>>
>> There is also a tool doing the conversion in-place 

Re: Amount of scrubbed data goes from 15.90GiB to 26.66GiB after defragment -r -v -clzo on a fs always mounted with compress=lzo

2016-05-20 Thread Niccolò Belli

On venerdì 13 maggio 2016 08:11:27 CEST, Duncan wrote:
In theory the various btrfs dedup solutions out there should work as 
well, while letting you keep the snapshots (at least to the extent 
they're either writable snapshots so can be reflink modified


Unfortunately as you said dedup doesn't work with read-only snapshots (I 
only use read-only snapshots with snapper) :(


Does dedup's dedup-syscall branch 
(https://github.com/g2p/bedup/tree/wip/dedup-syscall) which uses the new 
batch deduplication ioctl merged in Linux 3.12 fix this? Unfortunately 
latest commit is from september :(

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] Btrfs: fix race between device replace and chunk allocation

2016-05-20 Thread Filipe Manana
On Fri, May 20, 2016 at 4:30 PM, Josef Bacik  wrote:
> On Fri, May 20, 2016 at 12:45 AM,   wrote:
>> From: Filipe Manana 
>>
>> While iterating and copying extents from the source device, the device
>> replace code keeps adjusting a left cursor that is used to make sure that
>> once we finish processing a device extent, any future writes to extents
>> from the corresponding block group will get into both the source and
>> target devices. This left cursor is also used for resuming the device
>> replace operation at mount time.
>>
>> However using this left cursor to decide whether writes go into both
>> devices or only the source device is not enough to guarantee we don't
>> miss copying extents into the target device. There are two cases where
>> the current approach fails. The first one is related to when there are
>> holes in the device and they get allocated for new block groups while
>> the device replace operation is iterating the device extents (more on
>> this explained below). The second one is that when that loop over the
>> device extents finishes, we start dellaloc, wait for all ordered extents
>> and then commit the current transaction, we might have got new block
>> groups allocated that are now using a device extent that has an offset
>> greater then or equals to the value of the left cursor, in which case
>> writes to extents belonging to these new block groups will get issued
>> only to the source device.
>>
>> For the first case where the current approach of using a left cursor
>> fails, consider the source device currently has the following layout:
>>
>>   [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
>>   3Gb 4Gb 5Gb
>>
>> While we are iterating the device extents from the source device using
>> the commit root of the device tree, the following happens:
>>
>> CPU 1CPU 2
>>
>>   
>>
>>   scrub_enumerate_chunks()
>> --> searches the device tree for
>> extents belonging to the source
>> device using the device tree's
>> commit root
>> --> 1st iteration finds extent belonging to
>> block group A
>>
>> --> sets block group A to RO mode
>> (btrfs_inc_block_group_ro)
>>
>> --> sets cursor left to found_key.offset
>> which is 3Gb
>>
>> --> scrub_chunk() starts
>> copies all allocated extents from
>> block group's A stripe at source
>> device into target device
>>
>>
>> btrfs_alloc_chunk()
>>  --> allocates 
>> device extent
>>  in the 
>> range [4Gb, 5Gb[
>>  from the 
>> source device for
>>  a new block 
>> group C
>>
>>extent allocated 
>> from block
>>group C for a 
>> direct IO,
>>buffered write or 
>> btree node/leaf
>>
>>extent is written 
>> to, perhaps
>>in response to a 
>> writepages()
>>call from the VM 
>> or directly
>>through direct IO
>>
>>the write is made 
>> only against
>>the source device 
>> and not against
>>the target device 
>> because the
>>extent's offset 
>> is in the interval
>>[4Gb, 5Gb[ which 
>> is larger then
>>the value of 
>> cursor_left (3Gb)
>>
>> --> scrub_chunks() finishes
>>
>> --> updates left cursor from 3Gb to
>> 4Gb
>>
>> --> btrfs_dec_block_group_ro() sets
>> block group A back to RW mode
>>
>>  
>>
>> --> 2nd iteration finds extent belonging to
>> block group B - it did not find the new
>> extent in the range [4Gb, 5Gb[ for block
>> group C because we are using the device
>> tree's commit root or even because the
>> block group's items are not all yet
>> inserted in the respective btrees, that is,
>> the block group is still attached to some
>> 

Re: [PATCH 6/6] Btrfs: fix race between device replace and chunk allocation

2016-05-20 Thread Josef Bacik
On Fri, May 20, 2016 at 12:45 AM,   wrote:
> From: Filipe Manana 
>
> While iterating and copying extents from the source device, the device
> replace code keeps adjusting a left cursor that is used to make sure that
> once we finish processing a device extent, any future writes to extents
> from the corresponding block group will get into both the source and
> target devices. This left cursor is also used for resuming the device
> replace operation at mount time.
>
> However using this left cursor to decide whether writes go into both
> devices or only the source device is not enough to guarantee we don't
> miss copying extents into the target device. There are two cases where
> the current approach fails. The first one is related to when there are
> holes in the device and they get allocated for new block groups while
> the device replace operation is iterating the device extents (more on
> this explained below). The second one is that when that loop over the
> device extents finishes, we start dellaloc, wait for all ordered extents
> and then commit the current transaction, we might have got new block
> groups allocated that are now using a device extent that has an offset
> greater then or equals to the value of the left cursor, in which case
> writes to extents belonging to these new block groups will get issued
> only to the source device.
>
> For the first case where the current approach of using a left cursor
> fails, consider the source device currently has the following layout:
>
>   [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
>   3Gb 4Gb 5Gb
>
> While we are iterating the device extents from the source device using
> the commit root of the device tree, the following happens:
>
> CPU 1CPU 2
>
>   
>
>   scrub_enumerate_chunks()
> --> searches the device tree for
> extents belonging to the source
> device using the device tree's
> commit root
> --> 1st iteration finds extent belonging to
> block group A
>
> --> sets block group A to RO mode
> (btrfs_inc_block_group_ro)
>
> --> sets cursor left to found_key.offset
> which is 3Gb
>
> --> scrub_chunk() starts
> copies all allocated extents from
> block group's A stripe at source
> device into target device
>
>btrfs_alloc_chunk()
>  --> allocates 
> device extent
>  in the range 
> [4Gb, 5Gb[
>  from the 
> source device for
>  a new block 
> group C
>
>extent allocated 
> from block
>group C for a 
> direct IO,
>buffered write or 
> btree node/leaf
>
>extent is written 
> to, perhaps
>in response to a 
> writepages()
>call from the VM 
> or directly
>through direct IO
>
>the write is made 
> only against
>the source device 
> and not against
>the target device 
> because the
>extent's offset is 
> in the interval
>[4Gb, 5Gb[ which 
> is larger then
>the value of 
> cursor_left (3Gb)
>
> --> scrub_chunks() finishes
>
> --> updates left cursor from 3Gb to
> 4Gb
>
> --> btrfs_dec_block_group_ro() sets
> block group A back to RW mode
>
>  
>
> --> 2nd iteration finds extent belonging to
> block group B - it did not find the new
> extent in the range [4Gb, 5Gb[ for block
> group C because we are using the device
> tree's commit root or even because the
> block group's items are not all yet
> inserted in the respective btrees, that is,
> the block group is still attached to some
> transaction handle's new_bgs list and
> btrfs_create_pending_block_groups() was
> not called yet against that transaction
> handle, so the device extent items 

Re: [PATCH 5/6] Btrfs: fix race setting block group back to RW mode during device replace

2016-05-20 Thread Josef Bacik
On Fri, May 20, 2016 at 12:45 AM,   wrote:
> From: Filipe Manana 
>
> After it finishes processing a device extent, the device replace code sets
> back the block group to RW mode and then after that it sets the left cursor
> to match the logical end address of the block group, so that future writes
> into extents belonging to the block group go both the source (old) and
> target (new) devices. However from the moment we turn the block group
> back to RW mode we have a short time window, that lasts until we update
> the left cursor's value, where extents can be allocated from the block
> group and written to, in which case they will not be copied/written to
> the target (new) device. Fix this by updating the left cursor's value
> before turning the block group back to RW mode.
>
> Signed-off-by: Filipe Manana 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Btrfs: fix unprotected assignment of the left cursor for device replace

2016-05-20 Thread Josef Bacik
On Fri, May 20, 2016 at 12:44 AM,   wrote:
> From: Filipe Manana 
>
> We were assigning new values to fields of the device replace object
> without holding the respective lock after processing each device extent.
> This is important for the left cursor field which can be accessed by a
> concurrent task running __btrfs_map_block (which, correctly, takes the
> device replace lock).
> So change these fields while holding the device replace lock.
>
> Signed-off-by: Filipe Manana 

Eesh, thanks,

Reviewed-by: Josef Bacik 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] Btrfs: fix race setting block group readonly during device replace

2016-05-20 Thread Josef Bacik
On Fri, May 20, 2016 at 12:44 AM,   wrote:
> From: Filipe Manana 
>
> When we do a device replace, for each device extent we find from the
> source device, we set the corresponding block group to readonly mode to
> prevent writes into it from happening while we are copying the device
> extent from the source to the target device. However just before we set
> the block group to readonly mode some concurrent task might have already
> allocated an extent from it or decided it could perform a nocow write
> into one of its extents, which can make the device replace process to
> miss copying an extent since it uses the extent tree's commit root to
> search for extents and only once it finishes searching for all extents
> belonging to the block group it does set the left cursor to the logical
> end address of the block group - this is a problem if the respective
> ordered extents finish while we are searching for extents using the
> extent tree's commit root and no transaction commit happens while we
> are iterating the tree, since it's the delayed references created by the
> ordered extents (when they complete) that insert the extent items into
> the extent tree (using the non-commit root of course).
> Example:
>
>   CPU 1CPU 2
>
>  btrfs_dev_replace_start()
>btrfs_scrub_dev()
>  scrub_enumerate_chunks()
>--> finds device extent belonging
>to block group X
>
>
>
>   starts buffered write
>   against some inode
>
>   writepages is run 
> against
>   that inode forcing 
> dellaloc
>   to run
>
>   btrfs_writepages()
> extent_writepages()
>   
> extent_write_cache_pages()
> 
> __extent_writepage()
>   
> writepage_delalloc()
> 
> run_delalloc_range()
>   
> cow_file_range()
> 
> btrfs_reserve_extent()
>   --> 
> allocates an extent
>   
> from block group X
>   
> (which is not yet
>in 
> RO mode)
> 
> btrfs_add_ordered_extent()
>   --> 
> creates ordered extent Y
> flush_epd_write_bio()
>   --> bio against the 
> extent from
>   block group X 
> is submitted
>
>btrfs_inc_block_group_ro(bg X)
>  --> sets block group X to readonly
>
>scrub_chunk(bg X)
>  scrub_stripe(device extent from srcdev)
>--> keeps searching for extent items
>belonging to the block group using
>the extent tree's commit root
>--> it never blocks due to
>fs_info->scrub_pause_req as no
>one tries to commit transaction N
>--> copies all extents found from the
>source device into the target device
>--> finishes search loop
>
> bio completes
>
> ordered extent Y 
> completes
> and creates delayed 
> data
> reference which will 
> add an
> extent item to the 
> extent
> tree when run 
> (typically
> at transaction commit 
> time)
>
>   --> so the task 
> doing the
>   scrub/device 
> replace
>   at CPU 1 misses 
> this
>  

Re: [PATCH 2/6] Btrfs: fix race between device replace and block group removal

2016-05-20 Thread Josef Bacik
On Fri, May 20, 2016 at 12:44 AM,   wrote:
> From: Filipe Manana 
>
> When it's finishing, the device replace code iterates all extent maps
> representing block group and for each one that has a stripe that refers
> to the source device, it replaces its device with the target device.
> However when it replaces the source device with the target device it,
> the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
> only after its ID is changed to match the one from the source device.
> This leads to races with the chunk removal code that can temporarly see
> a device with an ID of 0ULL and then attempt to use that ID to remove
> items from the device tree and fail, causing a transaction abort:
>
> [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) 
> to /dev/sde finished
> [ 9238.594377] [ cut here ]
> [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 
> btrfs_remove_chunk+0x2e5/0x793 [btrfs]
> [ 9238.594403] BTRFS: Transaction aborted (error 1)
> [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor 
> tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 
> evdev sg i2c_core se
> rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom 
> sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio 
> e1000 scsi_mod fl
> oppy [last unloaded: btrfs]
> [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 
> 4.6.0-rc7-btrfs-next-29+ #1
> [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by 
> qemu-project.org 04/01/2014
> [ 9238.594421]   88017f1dbc60 8126b42c 
> 88017f1dbcb0
> [ 9238.594422]   88017f1dbca0 81052b14 
> 0ad37f1dbd18
> [ 9238.594423]  0001 88018068a558 88005c4b9c00 
> 880233f60db0
> [ 9238.594424] Call Trace:
> [ 9238.594428]  [] dump_stack+0x67/0x90
> [ 9238.594430]  [] __warn+0xc2/0xdd
> [ 9238.594432]  [] warn_slowpath_fmt+0x4b/0x53
> [ 9238.594434]  [] ? kmem_cache_free+0x128/0x188
> [ 9238.594450]  [] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
> [ 9238.594452]  [] ? arch_local_irq_save+0x9/0xc
> [ 9238.594464]  [] btrfs_delete_unused_bgs+0x317/0x382 
> [btrfs]
> [ 9238.594476]  [] cleaner_kthread+0x1ad/0x1c7 [btrfs]
> [ 9238.594489]  [] ? btree_invalidatepage+0x8e/0x8e [btrfs]
> [ 9238.594490]  [] kthread+0xd4/0xdc
> [ 9238.594494]  [] ret_from_fork+0x22/0x40
> [ 9238.594495]  [] ? kthread_stop+0x286/0x286
> [ 9238.594496] ---[ end trace 183efbe50275f059 ]---
>
> The sequence of steps leading to this is like the following:
>
>   CPU 1   CPU 2
>
>  btrfs_dev_replace_finishing()
>
>at this point
>dev_replace->tgtdev->devid ==
>BTRFS_DEV_REPLACE_DEVID (0ULL)
>
>...
>
>btrfs_start_transaction()
>btrfs_commit_transaction()
>
>  btrfs_delete_unused_bgs()
>btrfs_remove_chunk()
>
>  looks up for the 
> extent map
>  corresponding to the 
> chunk
>
>  lock_chunks() 
> (chunk_mutex)
>  check_system_chunk()
>  unlock_chunks() 
> (chunk_mutex)
>
>locks fs_info->chunk_mutex
>
>btrfs_dev_replace_update_device_in_mapping_tree()
>  --> iterates fs_info->mapping_tree and
>  replaces the device in every extent
>  map's map->stripes[] with
>  dev_replace->tgtdev, which still has
>  an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
>
>  iterates over all 
> stripes from
>  the extent map
>
>--> calls 
> btrfs_free_dev_extent()
>passing it the 
> target device
>that still has 
> an ID of 0ULL
>
>--> 
> btrfs_free_dev_extent() fails
>  --> aborts 
> current transaction
>
>finishes setting up the target device,
>namely it sets tgtdev->devid to the value
>of srcdev->devid (which is necessarily > 0)
>
>frees the srcdev
>
>unlocks fs_info->chunk_mutex
>
> So fix this by taking the device list mutex while processing the stripes
> for the chunk's extent map. This is similar to the race between device
> replace and block group creation that was fixed by commit 50460e37186a
> ("Btrfs: fix 

Re: [PATCH 1/6] Btrfs: fix race between readahead and device replace/removal

2016-05-20 Thread Josef Bacik
On Fri, May 20, 2016 at 12:44 AM,   wrote:
> From: Filipe Manana 
>
> The list of devices is protected by the device_list_mutex and the device
> replace code, in its finishing phase correctly takes that mutex before
> removing the source device from that list. However the readahead code was
> iterating that list without acquiring the respective mutex leading to
> crashes later on due to invalid memory accesses:
>
> [125671.831036] general protection fault:  [#1] PREEMPT SMP
> [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor 
> raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport
> processor ser
> [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: GW   
> 4.6.0-rc7-btrfs-next-29+ #1
> [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> by qemu-project.org 04/01/2014
> [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs]
> [125671.834973] task: 8801ac520540 ti: 8801ac918000 task.ti: 
> 8801ac918000
> [125671.834973] RIP: 0010:[]  [] 
> __radix_tree_lookup+0x6a/0x105
> [125671.834973] RSP: 0018:8801ac91bc28  EFLAGS: 00010206
> [125671.834973] RAX:  RBX: 6b6b6b6b6b6b6b6a RCX: 
> 
> [125671.834973] RDX:  RSI: 000c1bff RDI: 
> 88002ebd62a8
> [125671.834973] RBP: 8801ac91bc70 R08: 0001 R09: 
> 
> [125671.834973] R10: 8801ac91bc70 R11:  R12: 
> 88002ebd62a8
> [125671.834973] R13:  R14:  R15: 
> 000c1bff
> [125671.834973] FS:  () GS:88023fd4() 
> knlGS:
> [125671.834973] CS:  0010 DS:  ES:  CR0: 80050033
> [125671.834973] CR2: 0073cae4 CR3: b7723000 CR4: 
> 06e0
> [125671.834973] Stack:
> [125671.834973]   8801422d5600 8802286bbc00 
> 
> [125671.834973]  0001 8802286bbc00 000c1bff 
> 
> [125671.834973]  88002e639eb8 8801ac91bc80 81270541 
> 8801ac91bcb0
> [125671.834973] Call Trace:
> [125671.834973]  [] radix_tree_lookup+0xd/0xf
> [125671.834973]  [] reada_peer_zones_set_lock+0x3e/0x60 
> [btrfs]
> [125671.834973]  [] reada_pick_zone+0x29/0x103 [btrfs]
> [125671.834973]  [] reada_start_machine_worker+0x129/0x2d3 
> [btrfs]
> [125671.834973]  [] btrfs_scrubparity_helper+0x185/0x3aa 
> [btrfs]
> [125671.834973]  [] btrfs_readahead_helper+0xe/0x10 [btrfs]
> [125671.834973]  [] process_one_work+0x271/0x4e9
> [125671.834973]  [] worker_thread+0x1eb/0x2c9
> [125671.834973]  [] ? rescuer_thread+0x2b3/0x2b3
> [125671.834973]  [] kthread+0xd4/0xdc
> [125671.834973]  [] ret_from_fork+0x22/0x40
> [125671.834973]  [] ? kthread_stop+0x286/0x286
>
> So fix this by taking the device_list_mutex in the readahead code. We
> can't use here the lighter approach of using a rcu_read_lock() and
> rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
> because we end up doing calls to sleeping functions (kzalloc()) in the
> respective code path.
>
> Signed-off-by: Filipe Manana 

I think it might be time to change this to a rwsem as well as we use
it in a bunch of places that are read only like statfs and readahead.
But this works for now.

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] Btrfs: fix unprotected assignment of the left cursor for device replace

2016-05-20 Thread fdmanana
From: Filipe Manana 

We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/scrub.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a181b52..a58e0ae 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3640,9 +3640,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
break;
}
 
+   btrfs_dev_replace_lock(_info->dev_replace, 1);
dev_replace->cursor_right = found_key.offset + length;
dev_replace->cursor_left = found_key.offset;
dev_replace->item_needs_writeback = 1;
+   btrfs_dev_replace_unlock(_info->dev_replace, 1);
ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length,
  found_key.offset, cache, is_dev_replace);
 
@@ -3716,8 +3718,10 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
break;
}
 
+   btrfs_dev_replace_lock(_info->dev_replace, 1);
dev_replace->cursor_left = dev_replace->cursor_right;
dev_replace->item_needs_writeback = 1;
+   btrfs_dev_replace_unlock(_info->dev_replace, 1);
 skip:
key.offset = found_key.offset + length;
btrfs_release_path(path);
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] Btrfs: fix race setting block group back to RW mode during device replace

2016-05-20 Thread fdmanana
From: Filipe Manana 

After it finishes processing a device extent, the device replace code sets
back the block group to RW mode and then after that it sets the left cursor
to match the logical end address of the block group, so that future writes
into extents belonging to the block group go both the source (old) and
target (new) devices. However from the moment we turn the block group
back to RW mode we have a short time window, that lasts until we update
the left cursor's value, where extents can be allocated from the block
group and written to, in which case they will not be copied/written to
the target (new) device. Fix this by updating the left cursor's value
before turning the block group back to RW mode.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/scrub.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a58e0ae..c4c09a8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3680,6 +3680,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
scrub_pause_off(fs_info);
 
+   btrfs_dev_replace_lock(_info->dev_replace, 1);
+   dev_replace->cursor_left = dev_replace->cursor_right;
+   dev_replace->item_needs_writeback = 1;
+   btrfs_dev_replace_unlock(_info->dev_replace, 1);
+
if (ro_set)
btrfs_dec_block_group_ro(root, cache);
 
@@ -3717,11 +3722,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
ret = -ENOMEM;
break;
}
-
-   btrfs_dev_replace_lock(_info->dev_replace, 1);
-   dev_replace->cursor_left = dev_replace->cursor_right;
-   dev_replace->item_needs_writeback = 1;
-   btrfs_dev_replace_unlock(_info->dev_replace, 1);
 skip:
key.offset = found_key.offset + length;
btrfs_release_path(path);
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Btrfs: fix race between device replace and chunk allocation

2016-05-20 Thread fdmanana
From: Filipe Manana 

While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.

However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.

For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:

  [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
  3Gb 4Gb 5Gb

While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:

CPU 1CPU 2

  

  scrub_enumerate_chunks()
--> searches the device tree for
extents belonging to the source
device using the device tree's
commit root
--> 1st iteration finds extent belonging to
block group A

--> sets block group A to RO mode
(btrfs_inc_block_group_ro)

--> sets cursor left to found_key.offset
which is 3Gb

--> scrub_chunk() starts
copies all allocated extents from
block group's A stripe at source
device into target device

   btrfs_alloc_chunk()
 --> allocates 
device extent
 in the range 
[4Gb, 5Gb[
 from the 
source device for
 a new block 
group C

   extent allocated 
from block
   group C for a direct 
IO,
   buffered write or 
btree node/leaf

   extent is written 
to, perhaps
   in response to a 
writepages()
   call from the VM or 
directly
   through direct IO

   the write is made 
only against
   the source device 
and not against
   the target device 
because the
   extent's offset is 
in the interval
   [4Gb, 5Gb[ which is 
larger then
   the value of 
cursor_left (3Gb)

--> scrub_chunks() finishes

--> updates left cursor from 3Gb to
4Gb

--> btrfs_dec_block_group_ro() sets
block group A back to RW mode

 

--> 2nd iteration finds extent belonging to
block group B - it did not find the new
extent in the range [4Gb, 5Gb[ for block
group C because we are using the device
tree's commit root or even because the
block group's items are not all yet
inserted in the respective btrees, that is,
the block group is still attached to some
transaction handle's new_bgs list and
btrfs_create_pending_block_groups() was
not called yet against that transaction
handle, so the device extent items were
not yet inserted into the devices tree

 

--> so we end not copying anything from the newly
allocated device extent from the source device
to the target device

So fix this by making 

[PATCH 3/6] Btrfs: fix race setting block group readonly during device replace

2016-05-20 Thread fdmanana
From: Filipe Manana 

When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:

  CPU 1CPU 2

 btrfs_dev_replace_start()
   btrfs_scrub_dev()
 scrub_enumerate_chunks()
   --> finds device extent belonging
   to block group X

   

  starts buffered write
  against some inode

  writepages is run against
  that inode forcing 
dellaloc
  to run

  btrfs_writepages()
extent_writepages()
  
extent_write_cache_pages()
__extent_writepage()
  
writepage_delalloc()

run_delalloc_range()
  
cow_file_range()

btrfs_reserve_extent()
  --> 
allocates an extent
  from 
block group X
  
(which is not yet
   in 
RO mode)

btrfs_add_ordered_extent()
  --> 
creates ordered extent Y
flush_epd_write_bio()
  --> bio against the 
extent from
  block group X is 
submitted

   btrfs_inc_block_group_ro(bg X)
 --> sets block group X to readonly

   scrub_chunk(bg X)
 scrub_stripe(device extent from srcdev)
   --> keeps searching for extent items
   belonging to the block group using
   the extent tree's commit root
   --> it never blocks due to
   fs_info->scrub_pause_req as no
   one tries to commit transaction N
   --> copies all extents found from the
   source device into the target device
   --> finishes search loop

bio completes

ordered extent Y 
completes
and creates delayed data
reference which will 
add an
extent item to the 
extent
tree when run (typically
at transaction commit 
time)

  --> so the task doing 
the
  scrub/device 
replace
  at CPU 1 misses 
this
  and does not copy 
this
  extent into the 
new/target
  device

   btrfs_dec_block_group_ro(bg X)
 --> turns block group X back to RW 

[PATCH 2/6] Btrfs: fix race between device replace and block group removal

2016-05-20 Thread fdmanana
From: Filipe Manana 

When it's finishing, the device replace code iterates all extent maps
representing block group and for each one that has a stripe that refers
to the source device, it replaces its device with the target device.
However when it replaces the source device with the target device it,
the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
only after its ID is changed to match the one from the source device.
This leads to races with the chunk removal code that can temporarly see
a device with an ID of 0ULL and then attempt to use that ID to remove
items from the device tree and fail, causing a transaction abort:

[ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to 
/dev/sde finished
[ 9238.594377] [ cut here ]
[ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 
btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594403] BTRFS: Transaction aborted (error 1)
[ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis 
tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg 
i2c_core se
rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod 
ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 
scsi_mod fl
oppy [last unloaded: btrfs]
[ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 
4.6.0-rc7-btrfs-next-29+ #1
[ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by 
qemu-project.org 04/01/2014
[ 9238.594421]   88017f1dbc60 8126b42c 
88017f1dbcb0
[ 9238.594422]   88017f1dbca0 81052b14 
0ad37f1dbd18
[ 9238.594423]  0001 88018068a558 88005c4b9c00 
880233f60db0
[ 9238.594424] Call Trace:
[ 9238.594428]  [] dump_stack+0x67/0x90
[ 9238.594430]  [] __warn+0xc2/0xdd
[ 9238.594432]  [] warn_slowpath_fmt+0x4b/0x53
[ 9238.594434]  [] ? kmem_cache_free+0x128/0x188
[ 9238.594450]  [] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594452]  [] ? arch_local_irq_save+0x9/0xc
[ 9238.594464]  [] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
[ 9238.594476]  [] cleaner_kthread+0x1ad/0x1c7 [btrfs]
[ 9238.594489]  [] ? btree_invalidatepage+0x8e/0x8e [btrfs]
[ 9238.594490]  [] kthread+0xd4/0xdc
[ 9238.594494]  [] ret_from_fork+0x22/0x40
[ 9238.594495]  [] ? kthread_stop+0x286/0x286
[ 9238.594496] ---[ end trace 183efbe50275f059 ]---

The sequence of steps leading to this is like the following:

  CPU 1   CPU 2

 btrfs_dev_replace_finishing()

   at this point
   dev_replace->tgtdev->devid ==
   BTRFS_DEV_REPLACE_DEVID (0ULL)

   ...

   btrfs_start_transaction()
   btrfs_commit_transaction()

 btrfs_delete_unused_bgs()
   btrfs_remove_chunk()

 looks up for the 
extent map
 corresponding to the 
chunk

 lock_chunks() 
(chunk_mutex)
 check_system_chunk()
 unlock_chunks() 
(chunk_mutex)

   locks fs_info->chunk_mutex

   btrfs_dev_replace_update_device_in_mapping_tree()
 --> iterates fs_info->mapping_tree and
 replaces the device in every extent
 map's map->stripes[] with
 dev_replace->tgtdev, which still has
 an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)

 iterates over all 
stripes from
 the extent map

   --> calls 
btrfs_free_dev_extent()
   passing it the 
target device
   that still has 
an ID of 0ULL

   --> 
btrfs_free_dev_extent() fails
 --> aborts current 
transaction

   finishes setting up the target device,
   namely it sets tgtdev->devid to the value
   of srcdev->devid (which is necessarily > 0)

   frees the srcdev

   unlocks fs_info->chunk_mutex

So fix this by taking the device list mutex while processing the stripes
for the chunk's extent map. This is similar to the race between device
replace and block group creation that was fixed by commit 50460e37186a
("Btrfs: fix race when finishing dev replace leading to transaction abort").

Signed-off-by: Filipe Manana 
---
 fs/btrfs/volumes.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bd0f45f..683e2bd 100644
--- 

[PATCH 1/6] Btrfs: fix race between readahead and device replace/removal

2016-05-20 Thread fdmanana
From: Filipe Manana 

The list of devices is protected by the device_list_mutex and the device
replace code, in its finishing phase correctly takes that mutex before
removing the source device from that list. However the readahead code was
iterating that list without acquiring the respective mutex leading to
crashes later on due to invalid memory accesses:

[125671.831036] general protection fault:  [#1] PREEMPT SMP
[125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor 
raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport
processor ser
[125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: GW 
  4.6.0-rc7-btrfs-next-29+ #1
[125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by 
qemu-project.org 04/01/2014
[125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs]
[125671.834973] task: 8801ac520540 ti: 8801ac918000 task.ti: 
8801ac918000
[125671.834973] RIP: 0010:[]  [] 
__radix_tree_lookup+0x6a/0x105
[125671.834973] RSP: 0018:8801ac91bc28  EFLAGS: 00010206
[125671.834973] RAX:  RBX: 6b6b6b6b6b6b6b6a RCX: 

[125671.834973] RDX:  RSI: 000c1bff RDI: 
88002ebd62a8
[125671.834973] RBP: 8801ac91bc70 R08: 0001 R09: 

[125671.834973] R10: 8801ac91bc70 R11:  R12: 
88002ebd62a8
[125671.834973] R13:  R14:  R15: 
000c1bff
[125671.834973] FS:  () GS:88023fd4() 
knlGS:
[125671.834973] CS:  0010 DS:  ES:  CR0: 80050033
[125671.834973] CR2: 0073cae4 CR3: b7723000 CR4: 
06e0
[125671.834973] Stack:
[125671.834973]   8801422d5600 8802286bbc00 

[125671.834973]  0001 8802286bbc00 000c1bff 

[125671.834973]  88002e639eb8 8801ac91bc80 81270541 
8801ac91bcb0
[125671.834973] Call Trace:
[125671.834973]  [] radix_tree_lookup+0xd/0xf
[125671.834973]  [] reada_peer_zones_set_lock+0x3e/0x60 
[btrfs]
[125671.834973]  [] reada_pick_zone+0x29/0x103 [btrfs]
[125671.834973]  [] reada_start_machine_worker+0x129/0x2d3 
[btrfs]
[125671.834973]  [] btrfs_scrubparity_helper+0x185/0x3aa 
[btrfs]
[125671.834973]  [] btrfs_readahead_helper+0xe/0x10 [btrfs]
[125671.834973]  [] process_one_work+0x271/0x4e9
[125671.834973]  [] worker_thread+0x1eb/0x2c9
[125671.834973]  [] ? rescuer_thread+0x2b3/0x2b3
[125671.834973]  [] kthread+0xd4/0xdc
[125671.834973]  [] ret_from_fork+0x22/0x40
[125671.834973]  [] ? kthread_stop+0x286/0x286

So fix this by taking the device_list_mutex in the readahead code. We
can't use here the lighter approach of using a rcu_read_lock() and
rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
because we end up doing calls to sleeping functions (kzalloc()) in the
respective code path.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/reada.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 298631ea..8428db7 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -761,12 +761,14 @@ static void __reada_start_machine(struct btrfs_fs_info 
*fs_info)
 
do {
enqueued = 0;
+   mutex_lock(_devices->device_list_mutex);
list_for_each_entry(device, _devices->devices, dev_list) {
if (atomic_read(>reada_in_flight) <
MAX_IN_FLIGHT)
enqueued += reada_start_machine_dev(fs_info,
device);
}
+   mutex_unlock(_devices->device_list_mutex);
total += enqueued;
} while (enqueued && total < 1);
 
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel 4.5.5 & space_cache=v2 early enospc, forced read-only

2016-05-20 Thread E V
Just trying space_cache=v2 on my big backup btrfs, mounted via
space_cache=v2,enospc_debug,nofail,noatime,compress=zlib. Looks like
something got confused during an rsync  which then quickly propigated
up to forcing the fs read-only in the long stack traces below. I'll be
happy to test the new ENOSPC ticket system patches once they seem
ready if it will help. btrfs file usage:
Overall:
Device size: 249.22TiB
Device allocated:211.45TiB
Device unallocated:   37.77TiB
Device missing:  0.00B
Used:210.90TiB
Free (estimated): 38.31TiB  (min: 19.43TiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:210.92TiB, Used:210.37TiB
   /dev/sdb   25.23TiB
   /dev/sdc   25.22TiB
   /dev/sdd   35.23TiB
   /dev/sde   35.23TiB
   /dev/sdf   22.50TiB
   /dev/sdg   22.50TiB
   /dev/sdh   22.50TiB
   /dev/sdi   22.50TiB

Metadata,RAID10: Size:274.00GiB, Used:272.47GiB
   /dev/sdb   34.25GiB
   /dev/sdc   34.25GiB
   /dev/sdd   34.25GiB
   /dev/sde   34.25GiB
   /dev/sdf   34.25GiB
   /dev/sdg   34.25GiB
   /dev/sdh   34.25GiB
   /dev/sdi   34.25GiB

System,RAID10: Size:64.00MiB, Used:28.09MiB
   /dev/sdb8.00MiB
   /dev/sdc8.00MiB
   /dev/sdd8.00MiB
   /dev/sde8.00MiB
   /dev/sdf8.00MiB
   /dev/sdg8.00MiB
   /dev/sdh8.00MiB
   /dev/sdi8.00MiB

Unallocated:
   /dev/sdb4.75TiB
   /dev/sdc4.75TiB
   /dev/sdd4.75TiB
   /dev/sde4.75TiB
   /dev/sdf4.75TiB
   /dev/sdg4.75TiB
   /dev/sdh4.75TiB
   /dev/sdi4.75TiB

[20581.396634] WARNING: CPU: 6 PID: 4639 at
fs/btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0xed/0x3e6
[btrfs]()
[20581.396684] BTRFS: block rsv returned -28
[20581.396686] Modules linked in: ipmi_si mpt3sas raid_class
scsi_transport_sas dell_rbu nfsv3 nfsv4 nfsd auth_rpcgss oid_registry
nfs_acl nfs lockd grace fscache sunrpc ext2 intel_powerclamp coretemp
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_generic hmac
drbg aesni_intel aes_x86_64 glue_helper lrw gf128mul joydev
ablk_helper cryptd iTCO_wdt iTCO_vendor_support evdev ipmi_devintf
dcdbas serio_raw pcspkr lpc_ich ipmi_msghandler mfd_core i7core_edac
edac_core acpi_power_meter button processor loop autofs4 ext4 crc16
mbcache jbd2 btrfs xor raid6_pq hid_generic usbhid hid sg sd_mod
crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas ixgbe
mdio usbcore ptp usb_common pps_core scsi_mod bnx2 [last unloaded:
ipmi_si]
[20581.397148] CPU: 6 PID: 4639 Comm: kworker/u65:6 Tainted: G
 I 4.5.5 #1
[20581.397260] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[20581.397293]  0006 811f4e82 88081851fa80
0009
[20581.397346]  810459cb a016f717 88081e8c4150
88081851fad8
[20581.397400]  88041e3d2800 8806c7022840 81045a23
a01e5f8c
[20581.397453] Call Trace:
[20581.397481]  [] ? dump_stack+0x46/0x59
[20581.397512]  [] ? warn_slowpath_common+0x94/0xa9
[20581.397558]  [] ? btrfs_alloc_tree_block+0xed/0x3e6 [btrfs]
[20581.397589]  [] ? warn_slowpath_fmt+0x43/0x4b
[20581.397632]  [] ? unlock_up+0x89/0x103 [btrfs]
[20581.397678]  [] ? btrfs_alloc_tree_block+0xed/0x3e6 [btrfs]
[20581.397728]  [] ? btrfs_tree_read_unlock+0x5c/0x5e [btrfs]
[20581.397774]  [] ? __btrfs_cow_block+0xda/0x45f [btrfs]
[20581.397819]  [] ? btrfs_cow_block+0xdd/0x144 [btrfs]
[20581.397863]  [] ? btrfs_search_slot+0x285/0x6d7 [btrfs]
[20581.397911]  [] ? btrfs_lookup_csum+0x39/0xcc [btrfs]
[20581.397959]  [] ? btrfs_csum_file_blocks+0x6a/0x4e0 [btrfs]
[20581.398010]  [] ?
add_pending_csums.isra.42+0x42/0x5b [btrfs]
[20581.398075]  [] ?
btrfs_finish_ordered_io+0x331/0x4cb [btrfs]
[20581.398141]  [] ? normal_work_helper+0xe2/0x21f [btrfs]
[20581.398174]  [] ? process_one_work+0x177/0x2a9
[20581.398203]  [] ? worker_thread+0x1e9/0x292
[20581.398232]  [] ? rescuer_thread+0x2a5/0x2a5
[20581.398262]  [] ? kthread+0xa7/0xaf
[20581.398289]  [] ? kthread_parkme+0x16/0x16
[20581.398321]  [] ? ret_from_fork+0x3f/0x70
[20581.398349]  [] ? kthread_parkme+0x16/0x16
[20581.398377] ---[ end trace 9a28cf840837b232 ]---
[20655.152170] use_block_rsv: 773 callbacks suppressed
[20655.152263] WARNING: CPU: 11 PID: 4814 at
fs/btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0xed/0x3e6
[btrfs]()
[20655.152313] BTRFS: block rsv returned -28
[20655.152808] CPU: 11 PID: 4814 Comm: kworker/u66:7 Tainted: G
W I 4.5.5 #1
[20655.152921] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[20655.152953]  0006 811f4e82 8801ad383a80
0009
[20655.153006]  810459cb a016f717 88081e8c4150
8801ad383ad8
[20655.153060]  88041e3d2800 

Re: Hot data tracking / hybrid storage

2016-05-20 Thread Austin S. Hemmelgarn

On 2016-05-19 19:23, Henk Slager wrote:

On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
 wrote:

On 2016-05-19 14:09, Kai Krakow wrote:


Am Wed, 18 May 2016 22:44:55 + (UTC)
schrieb Ferry Toth :


Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:

Bcache is actually low maintenance, no knobs to turn. Converting to
bcache protective superblocks is a one-time procedure which can be done
online. The bcache devices act as normal HDD if not attached to a
caching SSD. It's really less pain than you may think. And it's a
solution available now. Converting back later is easy: Just detach the
HDDs from the SSDs and use them for some other purpose if you feel so
later. Having the bcache protective superblock still in place doesn't
hurt then. Bcache is a no-op without caching device attached.


No, bcache is _almost_ a no-op without a caching device.  From a userspace
perspective, it does nothing, but it is still another layer of indirection
in the kernel, which does have a small impact on performance.  The same is
true of using LVM with a single volume taking up the entire partition, it
looks almost no different from just using the partition, but it will perform
worse than using the partition directly.  I've actually done profiling of
both to figure out base values for the overhead, and while bcache with no
cache device is not as bad as the LVM example, it can still be a roughly
0.5-2% slowdown (it gets more noticeable the faster your backing storage
is).

You also lose the ability to mount that filesystem directly on a kernel
without bcache support (this may or may not be an issue for you).


The bcache (protective) superblock is in an 8KiB block in front of the
file system device. In case the current, non-bcached HDD's use modern
partitioning, you can do a 5-minute remove or add of bcache, without
moving/copying filesystem data. So in case you have a bcache-formatted
HDD that had just 1 primary partition (512 byte logical sectors), the
partition start is at sector 2048 and the filesystem start is at 2064.
Hard removing bcache (so making sure the module is not
needed/loaded/used the next boot) can be done done by changing the
start-sector of the partition from 2048 to 2064. In gdisk one has to
change the alignment to 16 first, otherwise this it refuses. And of
course, also first flush+stop+de-register bcache for the HDD.

The other way around is also possible, i.e. changing the start-sector
from 2048 to 2032. So that makes adding bcache to an existing
filesystem a 5 minute action and not a GBs- or TBs copy action. It is
not online of course, but just one reboot is needed (or just umount,
gdisk, partprobe, add bcache etc).
For RAID setups, one could just do 1 HDD first.
My argument about the overhead was not about the superblock, it was 
about the bcache layer itself.  It isn't practical to just access the 
data directly if you plan on adding a cache device, because then you 
couldn't do so online unless you're going through bcache.  This extra 
layer of indirection in the kernel does add overhead, regardless of the 
on-disk format.


Secondarily, having a HDD with just one partition is not a typical use 
case, and that argument about the slack space resulting from the 1M 
alignment only holds true if you're using an MBR instead of a GPT layout 
(or for that matter, almost any other partition table format), and 
you're not booting from that disk (because GRUB embeds itself there). 
It's also fully possible to have an MBR formatted disk which doesn't 
have any spare space there too (which is how most flash drives get 
formatted).


This also doesn't change the fact that without careful initial 
formatting (it is possible on some filesystems to embed the bcache SB at 
the beginning of the FS itself, many of them have some reserved space at 
the beginning of the partition for bootloaders, and this space doesn't 
have to exist when mounting the FS) or manual alteration of the 
partition, it's not possible to mount the FS on a system without bcache 
support.


There is also a tool doing the conversion in-place (I haven't used it
myself, my python(s) had trouble; I could do the partition table edit
much faster/easier):
https://github.com/g2p/blocks#bcache-conversion


I actually hadn't known about this tool, thanks for mentioning it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-20 Thread Austin S. Hemmelgarn

On 2016-05-19 17:01, Kai Krakow wrote:

Am Thu, 19 May 2016 14:51:01 -0400
schrieb "Austin S. Hemmelgarn" :


For a point of reference, I've
got a pair of 250GB Crucial MX100's (they cost less than 0.50 USD per
GB when I got them and provide essentially the same power-loss
protections that the high end Intel SSD's do) which have seen more
than 2.5TB of data writes over their lifetime, combined from at least
three different filesystem formats (BTRFS, FAT32, and ext4), swap
space, and LVM management, and the wear-leveling indicator on each
still says they have 100% life remaining, and the similar 500GB one I
just recently upgraded in my laptop had seen over 50TB of writes and
was still saying 95% life remaining (and had been for months).
Correction, I hadn't checked recently, the 250G ones have seen about 
6.336TB of writes (I hadn't checked for multiple months), and report 90% 
remaining life, with about 240 days of power-on time.  This overall 
equates to about 775MBs of writes per-hour, and assuming similar write 
rates for the remaining life of the SSD, I can still expect roughly 9 
years of service from these, which means about 10 years of life given my 
usage, which is well beyond what I typically get from a traditional hard 
disk for the same price, and far exceeds the typical usable life of most 
desktops, laptops, and even some workstation computers.


And you have to also keep in mind, this 775MB/hour of writes is coming 
from a system that is running:
* BOINC distributed computing applications (regularly downloading big 
files, and almost constantly writing data)

* Dropbox
* Software builds for almost a dozen different systems (I use Gentoo, so 
_everything_ is built locally)

* Regression testing for BTRFS
* Basic network services (DHCP, DNS, and similar things)
* A tor entry node
* A local mail server (store and forward only, I just use it for 
monitoring messages)
And all of that (except the BTRFS regression testing) is running 24/7, 
and that's just the local VM's, and doesn't include the file sharing or 
SAN services.  Root filesystems for all of these VM's are all on the 
SSD's, as is the host's root filesystem and swap partition, and many of 
the data partitions.  And I haven't really done any write optimization, 
and it's still less  than 1GB/hour of writes to the SSD.  The typical 
user (including many types of server systems) will be writing much less 
than that most of the time.


The smaller Crucials are much worse at that: The MX100 128GB version I
had was specified for 85TB writes which I hit after about 12 months (97%
lifetime used according to smartctl) due to excessive write patterns.
I'm not sure how long it would have lasted but I decided to swap it for
a Samsung 500GB drive, and reconfigure my system for much less write
patterns.

What should I say: I liked the Crucial more, first: It has an easy
lifetime counter in smartctl, Samsung doesn't. And it had powerloss
protection which Samsung doesn't explicitly mention (tho I think it has
it).

At least, according to endurance tests, my Samsung SSD should take
about 1 PB of writes. I've already written 7 TB if I can trust the
smartctl raw value.

But I think you cannot compare specification values to a real endurance
test... I think it says 150TBW for 500GB 850 EVO.

The point was more that wear out is less of an issue for a lot of people 
than many individuals make it out to be, not me trying to make Crucial 
sound like an amazing brand.  Yes, one of the Crucial MX100's may not 
last long as a Samsung EVO in a busy mail server or something similar, 
but for a majority of people, they will probably outlast the usefulness 
of the computer.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sharing page cache pages between multiple mappings

2016-05-20 Thread Miklos Szeredi
On Fri, May 20, 2016 at 1:48 AM, Dave Chinner  wrote:
> On Thu, May 19, 2016 at 12:17:14PM +0200, Miklos Szeredi wrote:
>> On Thu, May 19, 2016 at 11:05 AM, Michal Hocko  wrote:
>> > On Thu 19-05-16 10:20:13, Miklos Szeredi wrote:
>> >> Has anyone thought about sharing pages between multiple files?
>> >>
>> >> The obvious application is for COW filesytems where there are
>> >> logically distinct files that physically share data and could easily
>> >> share the cache as well if there was infrastructure for it.
>> >
>> > FYI this has been discussed at LSFMM this year[1]. I wasn't at the
>> > session so cannot tell you any details but the LWN article covers it at
>> > least briefly.
>>
>> Cool, so it's not such a crazy idea.
>
> Oh, it most certainly is crazy. :P
>
>> Darrick, would you mind briefly sharing your ideas regarding this?
>
> The current line of though is that we'll only attempt this in XFS on
> inodes that are known to share underlying physical extents. i.e.
> files that have blocks that have been reflinked or deduped.  That
> way we can overload the breaking of reflink blocks (via copy on
> write) with unsharing the pages in the page cache for that inode.
> i.e. shared pages can propagate upwards in overlay if it uses
> reflink for copy-up and writes will then break the sharing with the
> underlying source without overlay having to do anything special.
>
> Right now I'm not sure what mechanism we will use - we want to
> support files that have a mix of private and shared pages, so that
> implies we are not going to be sharing mappings but sharing pages
> instead.  However, we've been looking at this as being completely
> encapsulated within the filesystem because it's tightly linked to
> changes in the physical layout of the filesystem, not as general
> "share this mapping between two unrelated inodes" infrastructure.
> That may change as we dig deeper into it...
>
>> The use case I have is fixing overlayfs weird behavior. The following
>> may result in "buf" not matching "data":
>>
>> int fr = open("foo", O_RDONLY);
>> int fw = open("foo", O_RDWR);
>> write(fw, data, sizeof(data));
>> read(fr, buf, sizeof(data));
>>
>> The reason is that "foo" is on a read-only layer, and opening it for
>> read-write triggers copy-up into a read-write layer.  However the old,
>> read-only open still refers to the unmodified file.
>>
>> Fixing this properly requires that when opening a file, we don't
>> delegate operations fully to the underlying file, but rather allow
>> sharing of pages from underlying file until the file is copied up.  At
>> that point we switch to sharing pages with the read-write copy.
>
> Unless I'm missing something here (quite possible!), I'm not sure
> we can fix that problem with page cache sharing or reflink. It
> implies we are sharing pages in a downwards direction - private
> overlay pages/mappings from multiple inodes would need to be shared
> with a single underlying shared read-only inode, and I lack the
> imagination to see how that works...

Indeed, reflink doesn't make this work.

We could reflink-up on any open (or on lookup), not just on write,
it's a trivial change in overlayfs.   Drawback is slower first
open/lookup and space used by duplicate trees even without
modification on the overlay.  Not sure if that's a problem in
practice.

I'll think about the generic downwards sharing.  For overlayfs it
doesn't need to be per-page, so that might make it somewhat simpler
problem.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html