Re: RAID-1 refuses to balance large drive

2016-03-25 Thread Duncan
Henk Slager posted on Fri, 25 Mar 2016 15:35:52 +0100 as excerpted:

> For the original OP situation, with chunks all filled op with extents
> and devices all filled up with chunks, 'integrating' a new 6TB drive
> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
> way in order to avoid immediate balancing needs:

> - 'plug-in' the 6TB
> - btrfs-replace  4TB by 6TB
> - btrfs fi resize max 6TB_devID
> - btrfs-replace  2TB by 4TB
> - btrfs fi resize max 4TB_devID
> - 'unplug' the 2TB

Way to think outside the box, Henk!  I'll have to remember this as it's
a very clever and rather useful method-tool to have in the ol' admin
toolbox (aka brain). =:^)

I only wish I had thought of it, as it sure seems clear... now that
you described it!

Greatly appreciated, in any case! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-25 Thread Duncan
Chris Murphy posted on Fri, 25 Mar 2016 15:34:11 -0600 as excerpted:

> Basically you get one chance to mount rw,degraded and you have to fix
> the problem at that time. And you have to balance away any phantom
> single chunks that have appeared. For what it's worth it's not the
> reboot that degraded it further, it's the unmount and then attempt to
> mount rw,degraded a 2nd time that's not allowed due to this bug.

As CMurphy says here but without mentioning the patch, as Alexander F 
says in sibling to CMurphy's reply, and as I said in my longer 
explanation further upthread, this is a known bug, with a patch in the 
pipeline that really should have made it into 4.5 but didn't as it was 
part of a larger patch set that apparently wasn't considered ready, and 
unfortunately it wasn't cherrypicked.

So right now, yes, known bug.  You get one chance at a degraded-writable 
mount to rebuild the array.  If you crash after writing but before the 
rebuild is complete, too bad, so sad, now you can only mount degraded-
readonly and your only possibility of saving the data (other than 
rebuilding with the appropriate patch) is to do just that, mount degraded-
readonly, and copy off the data to elsewhere.

But there's a patch that has been demonstrated to fix the bug, not only 
in tests, but in live-deployments where people found themselves with a 
degraded-readonly mount until they built with the patch.  Hopefully that 
patch will hit the 4.6 development kernel with a CC to stable, and be 
backported as necessary there, but I'm not sure it will be in 4.6 at this 
point, tho it should hit mainline /eventually/.  Meanwhile, the patch can 
still be applied manually if necessary, and I suppose some distros may 
already be applying it to their shipped versions as it's certainly a fix 
worth having.

I'll simply refer you to previous discussion on the list for the patch, 
as that's where I'd have to look for it if I needed it myself before it 
gets mainlined.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-25 Thread Anand Jain



On 03/26/2016 04:09 AM, Alexander Fougner wrote:

2016-03-25 20:57 GMT+01:00 Patrik Lundquist :

On 25 March 2016 at 18:20, Stephen Williams  wrote:


Your information below was very helpful and I was able to recreate the
Raid array. However my initial question still stands - What if the
drives dies completely? I work in a Data center and we see this quite a
lot where a drive is beyond dead - The OS will literally not detect it.


That's currently a weakness of Btrfs. I don't know how people deal
with it in production. I think Anand Jain is working on improving it.


 We need this issue be fixed for the real production usage.

 Patch set of hot spare contains the fix for this. Currently I am
 fixing an issue (#5) which Yauhen reported and thats related to the
 auto replace. Refreshed v2 will be out soon.

Thanks, Anand


At this point would the Raid10 array be beyond repair? As you need the
drive present in order to mount the array in degraded mode.


Right... let's try it again but a little bit differently.

# mount /dev/sdb /mnt

Let's drop the disk.

# echo 1 >/sys/block/sde/device/delete

[ 3669.024256] sd 5:0:0:0: [sde] Synchronizing SCSI cache
[ 3669.024934] sd 5:0:0:0: [sde] Stopping disk
[ 3669.037028] ata6.00: disabled

# touch /mnt/test3
# sync

[ 3845.960839] BTRFS error (device sdb): bdev /dev/sde errs: wr 1, rd
0, flush 0, corrupt 0, gen 0
[ 3845.961525] BTRFS error (device sdb): bdev /dev/sde errs: wr 2, rd
0, flush 0, corrupt 0, gen 0
[ 3845.962738] BTRFS error (device sdb): bdev /dev/sde errs: wr 3, rd
0, flush 0, corrupt 0, gen 0
[ 3845.963038] BTRFS error (device sdb): bdev /dev/sde errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 3845.963422] BTRFS error (device sdb): bdev /dev/sde errs: wr 4, rd
0, flush 1, corrupt 0, gen 0
[ 3845.963686] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 3845.963691] BTRFS error (device sdb): bdev /dev/sde errs: wr 5, rd
0, flush 1, corrupt 0, gen 0
[ 3845.963932] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 3845.963941] BTRFS error (device sdb): bdev /dev/sde errs: wr 6, rd
0, flush 1, corrupt 0, gen 0

# umount /mnt

[ 4095.276831] BTRFS error (device sdb): bdev /dev/sde errs: wr 7, rd
0, flush 1, corrupt 0, gen 0
[ 4095.278368] BTRFS error (device sdb): bdev /dev/sde errs: wr 8, rd
0, flush 1, corrupt 0, gen 0
[ 4095.279152] BTRFS error (device sdb): bdev /dev/sde errs: wr 8, rd
0, flush 2, corrupt 0, gen 0
[ 4095.279373] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 4095.279377] BTRFS error (device sdb): bdev /dev/sde errs: wr 9, rd
0, flush 2, corrupt 0, gen 0
[ 4095.279609] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 4095.279612] BTRFS error (device sdb): bdev /dev/sde errs: wr 10, rd
0, flush 2, corrupt 0, gen 0

# mount -o degraded /dev/sdb /mnt

[ 4608.113751] BTRFS info (device sdb): allowing degraded mounts
[ 4608.113756] BTRFS info (device sdb): disk space caching is enabled
[ 4608.113757] BTRFS: has skinny extents
[ 4608.116557] BTRFS info (device sdb): bdev /dev/sde errs: wr 6, rd
0, flush 1, corrupt 0, gen 0

# touch /mnt/test4
# sync

Writing to the filesystem works while the device is missing.
No new errors in dmesg after re-mounting degraded. Reboot to get back /dev/sde.

[4.329852] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 4 transid 26 /dev/sde
[4.330157] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 3 transid 31 /dev/sdd
[4.330511] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 2 transid 31 /dev/sdc
[4.330865] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 1 transid 31 /dev/sdb

/dev/sde transid is lagging behind, of course.

# wipefs -a /dev/sde
# btrfs device scan

# mount -o degraded /dev/sdb /mnt

[  507.248621] BTRFS info (device sdb): allowing degraded mounts
[  507.248626] BTRFS info (device sdb): disk space caching is enabled
[  507.248628] BTRFS: has skinny extents
[  507.252815] BTRFS info (device sdb): bdev /dev/sde errs: wr 6, rd
0, flush 1, corrupt 0, gen 0
[  507.252919] BTRFS: missing devices(1) exceeds the limit(0),


single/dup profile has zero-limit tolerance for missing devices. Only
ro-mount allowed in that case.


writeable mount is not allowed
[  507.278277] BTRFS: open_ctree failed

Well, that was unexpected! Reboot again.

# mount -o degraded /dev/sdb /mnt

[   94.368514] BTRFS info (device sdd): allowing degraded mounts
[   94.368519] BTRFS info (device sdd): disk space caching is enabled
[   94.368521] BTRFS: has skinny extents
[   94.370909] BTRFS warning (device sdd): devid 4 uuid
8549a275-f663-4741-b410-79b49a1d465f is missing
[   94.372170] BTRFS info (device sdd): bdev (null) errs: wr 6, rd 0,
flush 1, corrupt 0, gen 0
[   94.372284] BTRFS: missing devices(1) exceeds the limit(0),
writeable mount is not allowed
[   94.395021] BTRFS: open_ctree failed

No go.

# mount -o degraded,ro /dev/sdb /mnt
# btrfs device sta

[PATCH] btrfs: Cleanup compress_file_range()

2016-03-25 Thread Ashish Samant
Remove unnecessary checks in compress_file_range().

Signed-off-by: Ashish Samant 
---
 fs/btrfs/inode.c | 79 +++-
 1 file changed, 38 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 41a5688..a528ce7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -581,13 +581,30 @@ cont:
 * win, compare the page count read with the blocks on disk
 */
total_in = ALIGN(total_in, PAGE_CACHE_SIZE);
-   if (total_compressed >= total_in) {
+   if (total_compressed >= total_in)
will_compress = 0;
-   } else {
+   else {
num_bytes = total_in;
+   *num_added += 1;
+
+   /* the async work queues will take care of doing actual
+* allocation on disk for these compressed pages,
+* and will submit them to the elevator.
+*/
+   add_async_extent(async_cow, start, num_bytes,
+   total_compressed, pages, nr_pages_ret,
+   compress_type);
+
+   if (start + num_bytes < end) {
+   start += num_bytes;
+   pages = NULL;
+   cond_resched();
+   goto again;
+   }
+   return;
}
}
-   if (!will_compress && pages) {
+   if (pages) {
/*
 * the compression code ran but failed to make things smaller,
 * free any pages it allocated and our page pointer array
@@ -603,47 +620,27 @@ cont:
 
/* flag the file so we don't compress in the future */
if (!btrfs_test_opt(root, FORCE_COMPRESS) &&
-   !(BTRFS_I(inode)->force_compress)) {
+   !(BTRFS_I(inode)->force_compress))
BTRFS_I(inode)->flags |= BTRFS_INODE_NOCOMPRESS;
-   }
}
-   if (will_compress) {
-   *num_added += 1;
-
-   /* the async work queues will take care of doing actual
-* allocation on disk for these compressed pages,
-* and will submit them to the elevator.
-*/
-   add_async_extent(async_cow, start, num_bytes,
-total_compressed, pages, nr_pages_ret,
-compress_type);
-
-   if (start + num_bytes < end) {
-   start += num_bytes;
-   pages = NULL;
-   cond_resched();
-   goto again;
-   }
-   } else {
 cleanup_and_bail_uncompressed:
-   /*
-* No compression, but we still need to write the pages in
-* the file we've been given so far.  redirty the locked
-* page if it corresponds to our extent and set things up
-* for the async work queue to run cow_file_range to do
-* the normal delalloc dance
-*/
-   if (page_offset(locked_page) >= start &&
-   page_offset(locked_page) <= end) {
-   __set_page_dirty_nobuffers(locked_page);
-   /* unlocked later on in the async handlers */
-   }
-   if (redirty)
-   extent_range_redirty_for_io(inode, start, end);
-   add_async_extent(async_cow, start, end - start + 1,
-0, NULL, 0, BTRFS_COMPRESS_NONE);
-   *num_added += 1;
-   }
+   /*
+* No compression, but we still need to write the pages in
+* the file we've been given so far.  redirty the locked
+* page if it corresponds to our extent and set things up
+* for the async work queue to run cow_file_range to do
+* the normal delalloc dance
+*/
+   if (page_offset(locked_page) >= start &&
+   page_offset(locked_page) <= end)
+   __set_page_dirty_nobuffers(locked_page);
+   /* unlocked later on in the async handlers */
+
+   if (redirty)
+   extent_range_redirty_for_io(inode, start, end);
+   add_async_extent(async_cow, start, end - start + 1, 0, NULL, 0,
+BTRFS_COMPRESS_NONE);
+   *num_added += 1;
 
return;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-25 Thread Chris Murphy
[let me try keeping the list cc'd]

On Fri, Mar 25, 2016 at 7:21 PM, John Marrett  wrote:
> Chris,
>
>> Quite honestly I don't understand how Btrfs raid1 volume with two
>> missing devices even permits you to mount it degraded,rw in the first
>> place.
>
> I think you missed my previous post, it's simple, I patched the kernel
> to bypass the check for missing devices with rw mounts, I did this
> because one of my missing devices has no data on it, it's actually
> confirmed by my mounting as you can see here:
>

Yeah too many emails today, and I'm skimming too much.



>
> ubuntu@btrfs-recovery:~$ sudo btrfs filesystem show
> Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
> Total devices 7 FS bytes used 5.47TiB
> devid1 size 1.81TiB used 1.71TiB path /dev/sde
> devid2 size 1.81TiB used 1.71TiB path /dev/sda
> devid3 size 1.82TiB used 1.72TiB path /dev/sdc
> devid4 size 1.82TiB used 1.72TiB path /dev/sdd
> devid5 size 2.73TiB used 2.62TiB path /dev/sdf
> devid6 size 2.73TiB used 2.62TiB path
> devid7 size 2.73TiB used 0.00 path
>
>> Anyway, maybe it's possible there's no dual missing metadata chunks,
>> although I find it hard to believe.
>
> Considering the above do you still think that I may have missing metadata?

Post 'btrfs fi usage' for the fileystem. That may give some insight
what's expected to be on all the missing drives.

>
>> Because there are two devices missing, I doubt this matters, but I
>> think you're better off using 'btrfs replace' for this rather than
>> 'device add' followed by 'device remove'. The two catches with
>
> I'll try btrfs replace for the second device (with data) after
> removing the first.
>
> Do you think my chances are better moving data off the array in read only 
> mode?

My expectation is that whether copying everything or using replace, if
either process arrives at no metadata copies found, it's going to stop
whatever it's doing. Question is only how that manifests.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-25 Thread John Marrett
Chris,

> Quite honestly I don't understand how Btrfs raid1 volume with two
> missing devices even permits you to mount it degraded,rw in the first
> place.

I think you missed my previous post, it's simple, I patched the kernel
to bypass the check for missing devices with rw mounts, I did this
because one of my missing devices has no data on it, it's actually
confirmed by my mounting as you can see here:

ubuntu@btrfs-recovery:~$ sudo btrfs filesystem show
Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
Total devices 7 FS bytes used 5.47TiB
devid1 size 1.81TiB used 1.71TiB path /dev/sde
devid2 size 1.81TiB used 1.71TiB path /dev/sda
devid3 size 1.82TiB used 1.72TiB path /dev/sdc
devid4 size 1.82TiB used 1.72TiB path /dev/sdd
devid5 size 2.73TiB used 2.62TiB path /dev/sdf
devid6 size 2.73TiB used 2.62TiB path
devid7 size 2.73TiB used 0.00 path

> Anyway, maybe it's possible there's no dual missing metadata chunks,
> although I find it hard to believe.

Considering the above do you still think that I may have missing metadata?

> Because there are two devices missing, I doubt this matters, but I
> think you're better off using 'btrfs replace' for this rather than
> 'device add' followed by 'device remove'. The two catches with

I'll try btrfs replace for the second device (with data) after
removing the first.

Do you think my chances are better moving data off the array in read only mode?

-JohnF
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-25 Thread Chris Murphy
On Fri, Mar 25, 2016 at 4:31 PM, John Marrett  wrote:
> Continuing with my recovery efforts I've built overlay mounts of each
> of the block devices supporting my btrfs filesystem as well as the new
> disk I'm trying to introduce. I have patched the kernel to disable the
> check for multiple missing devices. I then exported the overlayed
> devices using iSCSI to a second system to attempt the recovery.
>
> I am able to mount the device rw, then I can remove missing devices
> which removes the missing empty disk. I can add in a new device to the
> filesystem and then attempt to remove the second missing disk (which
> has 2.7 TB of content on it).
>
> Unfortunately this removal fails as follows:
>
> ubuntu@btrfs-recovery:~$ sudo btrfs device delete missing /mnt
> ERROR: error removing the device 'missing' - Input/output error


Quite honestly I don't understand how Btrfs raid1 volume with two
missing devices even permits you to mount it degraded,rw in the first
place. That's rather mystifying considering the other thread where
there's a 4 disk raid10 with one missing device, and rw,degraded mount
is allowed only once, after that it disallows further attempts to
rw,degraded mount it.

Anyway, maybe it's possible there's no dual missing metadata chunks,
although I find it hard to believe. But OK, maybe it works for a while
and you can copy some stuff off the drives where there's at least one
data copy. If there's dual  missing data copies but there's still at
least 1 metadata copy, then file system will just spit out noisy error
messages. But if there ends up being dual missing metadata, I expect a
crash, or the file system goes read only, or maybe even unmounts. I'm
not sure. But once there's 0 copies of metadata I don't see how the
file system can correct for that.

Because there are two devices missing, I doubt this matters, but I
think you're better off using 'btrfs replace' for this rather than
'device add' followed by 'device remove'. The two catches with
replace: the replacement device must be as big or bigger than the one
being replaced; you have to do a resize on the replacement device,
using 'fi resize devid:max' to use all the space if the new one is
bigger than the old device. But I suspect either the first or second
replacement will fail also, it's too many missing devices.

So what can happen, if there's 0 copies of metadata, is that you might
not get everything off the drives before you hit the 0 copies problem
and the ensuing face plant. In that case you might have to depend on
btrfs restore. It could be really tedious to find out what can be
scraped. But I still think you're better off than any other file
system in this case, because they wouldn't even mount if there were
two mirrors lost.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-25 Thread John Marrett
Continuing with my recovery efforts I've built overlay mounts of each
of the block devices supporting my btrfs filesystem as well as the new
disk I'm trying to introduce. I have patched the kernel to disable the
check for multiple missing devices. I then exported the overlayed
devices using iSCSI to a second system to attempt the recovery.

I am able to mount the device rw, then I can remove missing devices
which removes the missing empty disk. I can add in a new device to the
filesystem and then attempt to remove the second missing disk (which
has 2.7 TB of content on it).

Unfortunately this removal fails as follows:

ubuntu@btrfs-recovery:~$ sudo btrfs device delete missing /mnt
ERROR: error removing the device 'missing' - Input/output error

The kernel shows:

[ 2772.000680] BTRFS warning (device sdd): csum failed ino 257 off
695730176 csum 2566472073 expected csum 2706136415
[ 2772.000724] BTRFS warning (device sdd): csum failed ino 257 off
695734272 csum 2566472073 expected csum 2558511802
[ 2772.000736] BTRFS warning (device sdd): csum failed ino 257 off
695746560 csum 2566472073 expected csum 3360772439
[ 2772.000742] BTRFS warning (device sdd): csum failed ino 257 off
695750656 csum 2566472073 expected csum 1205516886
[...]

Can anyone offer any advice as to how I should proceed from here?

One safe option is recreating the array. Now that I have discovered I
can mount the filesystem in degraded,ro mode I could purchase another
new disk, this will give me enough free disk space to copy all the
data off this array and onto a new non-redundant array. I can then add
all the drives in to the new array and convert it back to RAID1.

Here's a full breakdown of the commands that I ran in the process I
describe above; my patch only allows a remount with a missing device,
it's not very significant:

ubuntu@btrfs-recovery:~$ sudo mount -o degraded,ro /dev/sda /mnt
ubuntu@btrfs-recovery:~$ sudo mount -o remount,rw /mnt

Here we see the two missing devices:

ubuntu@btrfs-recovery:~$ sudo btrfs filesystem show
Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
Total devices 7 FS bytes used 5.47TiB
devid1 size 1.81TiB used 1.71TiB path /dev/sde
devid2 size 1.81TiB used 1.71TiB path /dev/sda
devid3 size 1.82TiB used 1.72TiB path /dev/sdc
devid4 size 1.82TiB used 1.72TiB path /dev/sdd
devid5 size 2.73TiB used 2.62TiB path /dev/sdf
devid6 size 2.73TiB used 2.62TiB path
devid7 size 2.73TiB used 0.00 path

I remove the first missing device:

ubuntu@btrfs-recovery:~$ sudo btrfs device delete missing /mnt

The unused missing device is removed:

ubuntu@btrfs-recovery:~$ sudo btrfs filesystem show
Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
Total devices 6 FS bytes used 5.47TiB
devid1 size 1.81TiB used 1.71TiB path /dev/sde
devid2 size 1.81TiB used 1.71TiB path /dev/sda
devid3 size 1.82TiB used 1.72TiB path /dev/sdc
devid4 size 1.82TiB used 1.72TiB path /dev/sdd
devid5 size 2.73TiB used 2.62TiB path /dev/sdf
devid6 size 2.73TiB used 2.62TiB path

I add a new device:

ubuntu@btrfs-recovery:~$ sudo btrfs device add /dev/sdb /mnt
ubuntu@btrfs-recovery:~$ sudo btrfs filesystem show
Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
Total devices 7 FS bytes used 5.47TiB
devid1 size 1.81TiB used 1.71TiB path /dev/sde
devid2 size 1.81TiB used 1.71TiB path /dev/sda
devid3 size 1.82TiB used 1.72TiB path /dev/sdc
devid4 size 1.82TiB used 1.72TiB path /dev/sdd
devid5 size 2.73TiB used 2.62TiB path /dev/sdf
devid6 size 2.73TiB used 2.62TiB path
devid7 size 2.73TiB used 0.00 path /dev/sdb

Here's some more details on the techniques necessary to get to this
point, in the hopes that others can benefit from them. I will also
update the apparently broken parallels scripts on the mdadm wiki.

To create overlay mounts use the following script; it will create
overlays for each device in the device list, using a sparse overlay
file located in /home/ubuntu/$device-overlay, each overlay will be
performed using a 512 MB file (the size passed to truncate).

for device in sda3 sdb3 sdc1 sdd1 sde1 sdf1; do
  dev=/dev/$device
  ovl=/home/ubuntu/$device-overlay
  truncate -s512M $ovl
  newdevname=$device
  size=$(blockdev --getsize "$dev")
  loop=$(losetup -f --show -- "$ovl")
  echo Setting up loop for $dev using overlay $ovl on loop $loop for
target $newdevname
  printf '%s\n' "0 $size snapshot $dev $loop P 8" | dmsetup create "$newdevname"
done

I used iscsitarget to export the block devices from the server,
configuration files are as follows (on ubuntu):

Install

sudo apt install iscsitarget

Enable

/etc/default/iscsitarget
ISCSITARGET_ENABLE=true

Exports

/etc/iet/ietd.conf

Target iqn.2001-04.com.example:storage.lun1
IncomingUser
OutgoingUser
Lun 0 Path=/dev/mapper/sda3,Type=fileio
Alias LUN1

Target iqn.2001-04.com.e

Re: Possible Raid Bug

2016-03-25 Thread Chris Murphy
On Fri, Mar 25, 2016 at 1:57 PM, Patrik Lundquist
 wrote:

>
> Only errors on the device formerly known as /dev/sde, so why won't it
> mount degraded,rw? Now I'm stuck like Stephen.
>
> # btrfs device usage /mnt
> /dev/sdb, ID: 1
>Device size: 2.00GiB
>Data,single:   624.00MiB   <<--
>Data,RAID10:   102.38MiB
>Metadata,RAID10:   102.38MiB
>System,RAID10:   4.00MiB
>Unallocated: 1.19GiB
>
> /dev/sdc, ID: 2
>Device size: 2.00GiB
>Data,RAID10:   102.38MiB
>Metadata,RAID10:   102.38MiB
>System,single:  32.00MiB   <<--
>System,RAID10:   4.00MiB
>Unallocated: 1.76GiB
>
> /dev/sdd, ID: 3
>Device size: 2.00GiB
>Data,RAID10:   102.38MiB
>Metadata,single:   256.00MiB   <<--
>Metadata,RAID10:   102.38MiB
>System,RAID10:   4.00MiB
>Unallocated: 1.55GiB
>
> missing, ID: 4
>Device size:   0.00B
>Data,RAID10:   102.38MiB
>Metadata,RAID10:   102.38MiB
>System,RAID10:   4.00MiB
>Unallocated: 1.80GiB
>
> The data written while mounted degraded is in profile 'single' and
> will have to be converted to 'raid10' once the filesystem is whole
> again.
>
> So what do I do now? Why did it degrade further after a reboot?

You're hosed. The file system is read only and can't be fixed. It's an
old bug. It's not a data loss bug, but it's major time loss bug
because now the volume has to be rebuilt, and totally unworkable for
production use.

While the appearance of the single chunks is one bug that shouldn't
happen, the worse bug is the truly bogus one that claims there aren't
enough drives for rw degraded mount. Those single chunks aren't on the
missing drive. They're on the three remaining ones. So the rw fail is
just a bad bug. It's a PITA but at least it's not a data loss bug.

Basically you get one chance to mount rw,degraded and you have to fix
the problem at that time. And you have to balance away any phantom
single chunks that have appeared. For what it's worth it's not the
reboot that degraded it further, it's the unmount and then attempt to
mount rw,degraded a 2nd time that's not allowed due to this bug.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-25 Thread Alexander Fougner
2016-03-25 20:57 GMT+01:00 Patrik Lundquist :
> On 25 March 2016 at 18:20, Stephen Williams  wrote:
>>
>> Your information below was very helpful and I was able to recreate the
>> Raid array. However my initial question still stands - What if the
>> drives dies completely? I work in a Data center and we see this quite a
>> lot where a drive is beyond dead - The OS will literally not detect it.
>
> That's currently a weakness of Btrfs. I don't know how people deal
> with it in production. I think Anand Jain is working on improving it.
>
>> At this point would the Raid10 array be beyond repair? As you need the
>> drive present in order to mount the array in degraded mode.
>
> Right... let's try it again but a little bit differently.
>
> # mount /dev/sdb /mnt
>
> Let's drop the disk.
>
> # echo 1 >/sys/block/sde/device/delete
>
> [ 3669.024256] sd 5:0:0:0: [sde] Synchronizing SCSI cache
> [ 3669.024934] sd 5:0:0:0: [sde] Stopping disk
> [ 3669.037028] ata6.00: disabled
>
> # touch /mnt/test3
> # sync
>
> [ 3845.960839] BTRFS error (device sdb): bdev /dev/sde errs: wr 1, rd
> 0, flush 0, corrupt 0, gen 0
> [ 3845.961525] BTRFS error (device sdb): bdev /dev/sde errs: wr 2, rd
> 0, flush 0, corrupt 0, gen 0
> [ 3845.962738] BTRFS error (device sdb): bdev /dev/sde errs: wr 3, rd
> 0, flush 0, corrupt 0, gen 0
> [ 3845.963038] BTRFS error (device sdb): bdev /dev/sde errs: wr 4, rd
> 0, flush 0, corrupt 0, gen 0
> [ 3845.963422] BTRFS error (device sdb): bdev /dev/sde errs: wr 4, rd
> 0, flush 1, corrupt 0, gen 0
> [ 3845.963686] BTRFS warning (device sdb): lost page write due to IO
> error on /dev/sde
> [ 3845.963691] BTRFS error (device sdb): bdev /dev/sde errs: wr 5, rd
> 0, flush 1, corrupt 0, gen 0
> [ 3845.963932] BTRFS warning (device sdb): lost page write due to IO
> error on /dev/sde
> [ 3845.963941] BTRFS error (device sdb): bdev /dev/sde errs: wr 6, rd
> 0, flush 1, corrupt 0, gen 0
>
> # umount /mnt
>
> [ 4095.276831] BTRFS error (device sdb): bdev /dev/sde errs: wr 7, rd
> 0, flush 1, corrupt 0, gen 0
> [ 4095.278368] BTRFS error (device sdb): bdev /dev/sde errs: wr 8, rd
> 0, flush 1, corrupt 0, gen 0
> [ 4095.279152] BTRFS error (device sdb): bdev /dev/sde errs: wr 8, rd
> 0, flush 2, corrupt 0, gen 0
> [ 4095.279373] BTRFS warning (device sdb): lost page write due to IO
> error on /dev/sde
> [ 4095.279377] BTRFS error (device sdb): bdev /dev/sde errs: wr 9, rd
> 0, flush 2, corrupt 0, gen 0
> [ 4095.279609] BTRFS warning (device sdb): lost page write due to IO
> error on /dev/sde
> [ 4095.279612] BTRFS error (device sdb): bdev /dev/sde errs: wr 10, rd
> 0, flush 2, corrupt 0, gen 0
>
> # mount -o degraded /dev/sdb /mnt
>
> [ 4608.113751] BTRFS info (device sdb): allowing degraded mounts
> [ 4608.113756] BTRFS info (device sdb): disk space caching is enabled
> [ 4608.113757] BTRFS: has skinny extents
> [ 4608.116557] BTRFS info (device sdb): bdev /dev/sde errs: wr 6, rd
> 0, flush 1, corrupt 0, gen 0
>
> # touch /mnt/test4
> # sync
>
> Writing to the filesystem works while the device is missing.
> No new errors in dmesg after re-mounting degraded. Reboot to get back 
> /dev/sde.
>
> [4.329852] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
> devid 4 transid 26 /dev/sde
> [4.330157] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
> devid 3 transid 31 /dev/sdd
> [4.330511] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
> devid 2 transid 31 /dev/sdc
> [4.330865] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
> devid 1 transid 31 /dev/sdb
>
> /dev/sde transid is lagging behind, of course.
>
> # wipefs -a /dev/sde
> # btrfs device scan
>
> # mount -o degraded /dev/sdb /mnt
>
> [  507.248621] BTRFS info (device sdb): allowing degraded mounts
> [  507.248626] BTRFS info (device sdb): disk space caching is enabled
> [  507.248628] BTRFS: has skinny extents
> [  507.252815] BTRFS info (device sdb): bdev /dev/sde errs: wr 6, rd
> 0, flush 1, corrupt 0, gen 0
> [  507.252919] BTRFS: missing devices(1) exceeds the limit(0),

single/dup profile has zero-limit tolerance for missing devices. Only
ro-mount allowed in that case.

> writeable mount is not allowed
> [  507.278277] BTRFS: open_ctree failed
>
> Well, that was unexpected! Reboot again.
>
> # mount -o degraded /dev/sdb /mnt
>
> [   94.368514] BTRFS info (device sdd): allowing degraded mounts
> [   94.368519] BTRFS info (device sdd): disk space caching is enabled
> [   94.368521] BTRFS: has skinny extents
> [   94.370909] BTRFS warning (device sdd): devid 4 uuid
> 8549a275-f663-4741-b410-79b49a1d465f is missing
> [   94.372170] BTRFS info (device sdd): bdev (null) errs: wr 6, rd 0,
> flush 1, corrupt 0, gen 0
> [   94.372284] BTRFS: missing devices(1) exceeds the limit(0),
> writeable mount is not allowed
> [   94.395021] BTRFS: open_ctree failed
>
> No go.
>
> # mount -o degraded,ro /dev/sdb /mnt
> # btrfs device stats /mnt
> [/dev/sdb].write_io_errs   0
> [/dev/sdb].read_io_errs0
> [/dev/sdb].flush_io

Re: Possible Raid Bug

2016-03-25 Thread Patrik Lundquist
On 25 March 2016 at 18:20, Stephen Williams  wrote:
>
> Your information below was very helpful and I was able to recreate the
> Raid array. However my initial question still stands - What if the
> drives dies completely? I work in a Data center and we see this quite a
> lot where a drive is beyond dead - The OS will literally not detect it.

That's currently a weakness of Btrfs. I don't know how people deal
with it in production. I think Anand Jain is working on improving it.

> At this point would the Raid10 array be beyond repair? As you need the
> drive present in order to mount the array in degraded mode.

Right... let's try it again but a little bit differently.

# mount /dev/sdb /mnt

Let's drop the disk.

# echo 1 >/sys/block/sde/device/delete

[ 3669.024256] sd 5:0:0:0: [sde] Synchronizing SCSI cache
[ 3669.024934] sd 5:0:0:0: [sde] Stopping disk
[ 3669.037028] ata6.00: disabled

# touch /mnt/test3
# sync

[ 3845.960839] BTRFS error (device sdb): bdev /dev/sde errs: wr 1, rd
0, flush 0, corrupt 0, gen 0
[ 3845.961525] BTRFS error (device sdb): bdev /dev/sde errs: wr 2, rd
0, flush 0, corrupt 0, gen 0
[ 3845.962738] BTRFS error (device sdb): bdev /dev/sde errs: wr 3, rd
0, flush 0, corrupt 0, gen 0
[ 3845.963038] BTRFS error (device sdb): bdev /dev/sde errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 3845.963422] BTRFS error (device sdb): bdev /dev/sde errs: wr 4, rd
0, flush 1, corrupt 0, gen 0
[ 3845.963686] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 3845.963691] BTRFS error (device sdb): bdev /dev/sde errs: wr 5, rd
0, flush 1, corrupt 0, gen 0
[ 3845.963932] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 3845.963941] BTRFS error (device sdb): bdev /dev/sde errs: wr 6, rd
0, flush 1, corrupt 0, gen 0

# umount /mnt

[ 4095.276831] BTRFS error (device sdb): bdev /dev/sde errs: wr 7, rd
0, flush 1, corrupt 0, gen 0
[ 4095.278368] BTRFS error (device sdb): bdev /dev/sde errs: wr 8, rd
0, flush 1, corrupt 0, gen 0
[ 4095.279152] BTRFS error (device sdb): bdev /dev/sde errs: wr 8, rd
0, flush 2, corrupt 0, gen 0
[ 4095.279373] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 4095.279377] BTRFS error (device sdb): bdev /dev/sde errs: wr 9, rd
0, flush 2, corrupt 0, gen 0
[ 4095.279609] BTRFS warning (device sdb): lost page write due to IO
error on /dev/sde
[ 4095.279612] BTRFS error (device sdb): bdev /dev/sde errs: wr 10, rd
0, flush 2, corrupt 0, gen 0

# mount -o degraded /dev/sdb /mnt

[ 4608.113751] BTRFS info (device sdb): allowing degraded mounts
[ 4608.113756] BTRFS info (device sdb): disk space caching is enabled
[ 4608.113757] BTRFS: has skinny extents
[ 4608.116557] BTRFS info (device sdb): bdev /dev/sde errs: wr 6, rd
0, flush 1, corrupt 0, gen 0

# touch /mnt/test4
# sync

Writing to the filesystem works while the device is missing.
No new errors in dmesg after re-mounting degraded. Reboot to get back /dev/sde.

[4.329852] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 4 transid 26 /dev/sde
[4.330157] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 3 transid 31 /dev/sdd
[4.330511] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 2 transid 31 /dev/sdc
[4.330865] BTRFS: device fsid 75737bea-d76c-42f5-b0e6-7d346e38610d
devid 1 transid 31 /dev/sdb

/dev/sde transid is lagging behind, of course.

# wipefs -a /dev/sde
# btrfs device scan

# mount -o degraded /dev/sdb /mnt

[  507.248621] BTRFS info (device sdb): allowing degraded mounts
[  507.248626] BTRFS info (device sdb): disk space caching is enabled
[  507.248628] BTRFS: has skinny extents
[  507.252815] BTRFS info (device sdb): bdev /dev/sde errs: wr 6, rd
0, flush 1, corrupt 0, gen 0
[  507.252919] BTRFS: missing devices(1) exceeds the limit(0),
writeable mount is not allowed
[  507.278277] BTRFS: open_ctree failed

Well, that was unexpected! Reboot again.

# mount -o degraded /dev/sdb /mnt

[   94.368514] BTRFS info (device sdd): allowing degraded mounts
[   94.368519] BTRFS info (device sdd): disk space caching is enabled
[   94.368521] BTRFS: has skinny extents
[   94.370909] BTRFS warning (device sdd): devid 4 uuid
8549a275-f663-4741-b410-79b49a1d465f is missing
[   94.372170] BTRFS info (device sdd): bdev (null) errs: wr 6, rd 0,
flush 1, corrupt 0, gen 0
[   94.372284] BTRFS: missing devices(1) exceeds the limit(0),
writeable mount is not allowed
[   94.395021] BTRFS: open_ctree failed

No go.

# mount -o degraded,ro /dev/sdb /mnt
# btrfs device stats /mnt
[/dev/sdb].write_io_errs   0
[/dev/sdb].read_io_errs0
[/dev/sdb].flush_io_errs   0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sdc].write_io_errs   0
[/dev/sdc].read_io_errs0
[/dev/sdc].flush_io_errs   0
[/dev/sdc].corruption_errs 0
[/dev/sdc].generation_errs 0
[/dev/sdd].write_io_errs   0
[/dev/sdd].read_io_errs0
[/dev/sdd].flush_io_errs   0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[(null)].wri

Re: [PATCH] Btrfs: fix crash/invalid memory access on fsync when using overlayfs

2016-03-25 Thread Filipe Manana
On Fri, Mar 25, 2016 at 6:49 PM, Chris Mason  wrote:
> On Mon, Mar 21, 2016 at 05:52:44PM +, Filipe Manana wrote:
>> On Mon, Mar 21, 2016 at 5:51 PM, Chris Mason  wrote:
>> > On Mon, Mar 21, 2016 at 05:38:44PM +, fdman...@kernel.org wrote:
>> >> From: Filipe Manana 
>> >>
>> >> If the lower or upper directory of an overlayfs mount belong to a btrfs
>> >> file system and we fsync the file through the overlayfs' merged directory
>> >> we ended up accessing an inode that didn't belong to btrfs as if it were
>> >> a btrfs inode at btrfs_sync_file() resulting in a crash like the 
>> >> following:
>> >
>> > Thanks Filipe, I'll put this in a second pull this week, even if its the
>> > only patch in it ;)
>>
>> Well we need to wait for Miklos patch to be merged first.
>
> I'll cut this one on top of rc1 in a dedicated branch instead.  That way
> people can still use the rest of our btrfs pull against v4.5

At the moment, the patch it depends on it's not yet on Linus' tree.
But it's in Miklos vfs tree:
https://git.kernel.org/cgit/linux/kernel/git/mszeredi/vfs.git/log/?h=overlayfs-next

>
> -chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix crash/invalid memory access on fsync when using overlayfs

2016-03-25 Thread Chris Mason
On Mon, Mar 21, 2016 at 05:52:44PM +, Filipe Manana wrote:
> On Mon, Mar 21, 2016 at 5:51 PM, Chris Mason  wrote:
> > On Mon, Mar 21, 2016 at 05:38:44PM +, fdman...@kernel.org wrote:
> >> From: Filipe Manana 
> >>
> >> If the lower or upper directory of an overlayfs mount belong to a btrfs
> >> file system and we fsync the file through the overlayfs' merged directory
> >> we ended up accessing an inode that didn't belong to btrfs as if it were
> >> a btrfs inode at btrfs_sync_file() resulting in a crash like the following:
> >
> > Thanks Filipe, I'll put this in a second pull this week, even if its the
> > only patch in it ;)
> 
> Well we need to wait for Miklos patch to be merged first.

I'll cut this one on top of rc1 in a dedicated branch instead.  That way
people can still use the rest of our btrfs pull against v4.5

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/14] Btrfs: always reserve metadata for delalloc extents

2016-03-25 Thread Liu Bo
On Fri, Mar 25, 2016 at 01:25:49PM -0400, Josef Bacik wrote:
> There are a few races in the metadata reservation stuff.  First we add the 
> bytes
> to the block_rsv well after we've set the bit on the inode saying that we have
> space for it and after we've reserved the bytes.  So use the normal
> btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
> extents when we try to reserve space for our write, which means that we could
> have used up the space for the inode and we wouldn't know because we only 
> check
> before the reservation.  So instead make sure we are always reserving space 
> for
> the inode update, and then if we don't need it release those bytes afterward.
> Thanks,

Looks fine.

Reviewed-by: Liu Bo 

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 35 +--
>  1 file changed, 13 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 06f4e7b..157a0b6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5653,12 +5653,12 @@ int btrfs_delalloc_reserve_metadata(struct inode 
> *inode, u64 num_bytes)
>   u64 to_reserve = 0;
>   u64 csum_bytes;
>   unsigned nr_extents = 0;
> - int extra_reserve = 0;
>   enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>   int ret = 0;
>   bool delalloc_lock = true;
>   u64 to_free = 0;
>   unsigned dropped;
> + bool release_extra = false;
>  
>   /* If we are a free space inode we need to not flush since we will be in
>* the middle of a transaction commit.  We also don't need the delalloc
> @@ -5684,24 +5684,15 @@ int btrfs_delalloc_reserve_metadata(struct inode 
> *inode, u64 num_bytes)
>BTRFS_MAX_EXTENT_SIZE - 1,
>BTRFS_MAX_EXTENT_SIZE);
>   BTRFS_I(inode)->outstanding_extents += nr_extents;
> - nr_extents = 0;
>  
> + nr_extents = 0;
>   if (BTRFS_I(inode)->outstanding_extents >
>   BTRFS_I(inode)->reserved_extents)
> - nr_extents = BTRFS_I(inode)->outstanding_extents -
> + nr_extents += BTRFS_I(inode)->outstanding_extents -
>   BTRFS_I(inode)->reserved_extents;
>  
> - /*
> -  * Add an item to reserve for updating the inode when we complete the
> -  * delalloc io.
> -  */
> - if (!test_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> -   &BTRFS_I(inode)->runtime_flags)) {
> - nr_extents++;
> - extra_reserve = 1;
> - }
> -
> - to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
> + /* We always want to reserve a slot for updating the inode. */
> + to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents + 1);
>   to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
>   csum_bytes = BTRFS_I(inode)->csum_bytes;
>   spin_unlock(&BTRFS_I(inode)->lock);
> @@ -5713,18 +5704,16 @@ int btrfs_delalloc_reserve_metadata(struct inode 
> *inode, u64 num_bytes)
>   goto out_fail;
>   }
>  
> - ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
> + ret = btrfs_block_rsv_add(root, block_rsv, to_reserve, flush);
>   if (unlikely(ret)) {
>   btrfs_qgroup_free_meta(root, nr_extents * root->nodesize);
>   goto out_fail;
>   }
>  
>   spin_lock(&BTRFS_I(inode)->lock);
> - if (extra_reserve) {
> - set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> - &BTRFS_I(inode)->runtime_flags);
> - nr_extents--;
> - }
> + if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> +  &BTRFS_I(inode)->runtime_flags))
> + release_extra = true;
>   BTRFS_I(inode)->reserved_extents += nr_extents;
>   spin_unlock(&BTRFS_I(inode)->lock);
>  
> @@ -5734,8 +5723,10 @@ int btrfs_delalloc_reserve_metadata(struct inode 
> *inode, u64 num_bytes)
>   if (to_reserve)
>   trace_btrfs_space_reservation(root->fs_info, "delalloc",
> btrfs_ino(inode), to_reserve, 1);
> - block_rsv_add_bytes(block_rsv, to_reserve, 1);
> -
> + if (release_extra)
> + btrfs_block_rsv_release(root, block_rsv,
> + btrfs_calc_trans_metadata_size(root,
> +1));
>   return 0;
>  
>  out_fail:
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] Btrfs: don't do nocow check unless we have to

2016-03-25 Thread Liu Bo
On Fri, Mar 25, 2016 at 01:26:00PM -0400, Josef Bacik wrote:
> Before we write into prealloc/nocow space we have to make sure that there are 
> no
> references to the extents we are writing into, which means checking the extent
> tree and csum tree in the case of nocow.  So we don't want to do the nocow 
> dance
> unless we can't reserve data space, since it's a serious drag on performance.
> With the following sequence
> 
> fallocate -l10737418240 /mnt/btrfs-test/file
> cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
> fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
>   --end_fsync=1
> 
> we get the worst case scenario where we have to fall back on to doing the 
> check
> anyway.
> 
> Without this patch
> lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
> write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
> 
> With this patch
> lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
> write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
> 
> We get twice the throughput, half of the runtime, and half of the average
> latency.  Thanks,

I've submitted a similar one, but looks like this one is cleaner, I
forgot to remove the goto reserve_metadata.

Thanks,

-liubo

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/file.c | 44 ++--
>  1 file changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0ce4bb3..7c80208 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1534,30 +1534,30 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   reserve_bytes = round_up(write_bytes + sector_offset,
>   root->sectorsize);
>  
> - if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
> -   BTRFS_INODE_PREALLOC)) &&
> - check_can_nocow(inode, pos, &write_bytes) > 0) {
> - /*
> -  * For nodata cow case, no need to reserve
> -  * data space.
> -  */
> - only_release_metadata = true;
> - /*
> -  * our prealloc extent may be smaller than
> -  * write_bytes, so scale down.
> -  */
> - num_pages = DIV_ROUND_UP(write_bytes + offset,
> -  PAGE_CACHE_SIZE);
> - reserve_bytes = round_up(write_bytes + sector_offset,
> - root->sectorsize);
> - goto reserve_metadata;
> - }
> -
>   ret = btrfs_check_data_free_space(inode, pos, write_bytes);
> - if (ret < 0)
> - break;
> + if (ret < 0) {
> + if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
> +   BTRFS_INODE_PREALLOC)) &&
> + check_can_nocow(inode, pos, &write_bytes) > 0) {
> + /*
> +  * For nodata cow case, no need to reserve
> +  * data space.
> +  */
> + only_release_metadata = true;
> + /*
> +  * our prealloc extent may be smaller than
> +  * write_bytes, so scale down.
> +  */
> + num_pages = DIV_ROUND_UP(write_bytes + offset,
> +  PAGE_CACHE_SIZE);
> + reserve_bytes = round_up(write_bytes +
> +  sector_offset,
> +  root->sectorsize);
> + } else {
> + break;
> + }
> + }
>  
> -reserve_metadata:
>   ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
>   if (ret) {
>   if (!only_release_metadata)
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/14] Btrfs: don't bother kicking async if there's nothing to reclaim

2016-03-25 Thread Josef Bacik
We do this check when we start the async reclaimer thread, might as well check
before we kick it off to save us some cycles.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 273e18d..4b5a517 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4871,6 +4871,9 @@ static inline int need_do_async_reclaim(struct 
btrfs_space_info *space_info,
if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
return 0;
 
+   if (!btrfs_calc_reclaim_metadata_size(fs_info->fs_root, space_info))
+   return 0;
+
return (used >= thresh && !btrfs_fs_closing(fs_info) &&
!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/14] Btrfs: don't do nocow check unless we have to

2016-03-25 Thread Josef Bacik
Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence

fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
--end_fsync=1

we get the worst case scenario where we have to fall back on to doing the check
anyway.

Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

We get twice the throughput, half of the runtime, and half of the average
latency.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/file.c | 44 ++--
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0ce4bb3..7c80208 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1534,30 +1534,30 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
reserve_bytes = round_up(write_bytes + sector_offset,
root->sectorsize);
 
-   if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
- BTRFS_INODE_PREALLOC)) &&
-   check_can_nocow(inode, pos, &write_bytes) > 0) {
-   /*
-* For nodata cow case, no need to reserve
-* data space.
-*/
-   only_release_metadata = true;
-   /*
-* our prealloc extent may be smaller than
-* write_bytes, so scale down.
-*/
-   num_pages = DIV_ROUND_UP(write_bytes + offset,
-PAGE_CACHE_SIZE);
-   reserve_bytes = round_up(write_bytes + sector_offset,
-   root->sectorsize);
-   goto reserve_metadata;
-   }
-
ret = btrfs_check_data_free_space(inode, pos, write_bytes);
-   if (ret < 0)
-   break;
+   if (ret < 0) {
+   if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) &&
+   check_can_nocow(inode, pos, &write_bytes) > 0) {
+   /*
+* For nodata cow case, no need to reserve
+* data space.
+*/
+   only_release_metadata = true;
+   /*
+* our prealloc extent may be smaller than
+* write_bytes, so scale down.
+*/
+   num_pages = DIV_ROUND_UP(write_bytes + offset,
+PAGE_CACHE_SIZE);
+   reserve_bytes = round_up(write_bytes +
+sector_offset,
+root->sectorsize);
+   } else {
+   break;
+   }
+   }
 
-reserve_metadata:
ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
if (ret) {
if (!only_release_metadata)
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/14] Btrfs: add tracepoint for adding block groups

2016-03-25 Thread Josef Bacik
I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c   |  2 ++
 include/trace/events/btrfs.h | 40 
 2 files changed, 42 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 90ac821..0db4319 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9849,6 +9849,7 @@ int btrfs_read_block_groups(struct btrfs_root *root)
goto error;
}
 
+   trace_btrfs_add_block_group(root->fs_info, cache, 0);
ret = update_space_info(info, cache->flags, found_key.offset,
btrfs_block_group_used(&cache->item),
cache->bytes_super, &space_info);
@@ -10019,6 +10020,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
 * Now that our block group has its ->space_info set and is inserted in
 * the rbtree, update the space info's counters.
 */
+   trace_btrfs_add_block_group(root->fs_info, cache, 1);
ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
cache->bytes_super, &cache->space_info);
if (ret) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index d866f21..3e61deb8 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -440,6 +440,46 @@ TRACE_EVENT(btrfs_sync_fs,
TP_printk("wait = %d", __entry->wait)
 );
 
+TRACE_EVENT(btrfs_add_block_group,
+
+   TP_PROTO(struct btrfs_fs_info *fs_info,
+struct btrfs_block_group_cache *block_group, int create),
+
+   TP_ARGS(fs_info, block_group, create),
+
+   TP_STRUCT__entry(
+   __array(u8, fsid,   BTRFS_UUID_SIZE )
+   __field(u64,offset  )
+   __field(u64,size)
+   __field(u64,flags   )
+   __field(u64,bytes_used  )
+   __field(u64,bytes_super )
+   __field(int,create  )
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry->fsid, fs_info->fsid, BTRFS_UUID_SIZE);
+   __entry->offset = block_group->key.objectid;
+   __entry->size   = block_group->key.offset;
+   __entry->flags  = block_group->flags;
+   __entry->bytes_used =
+   btrfs_block_group_used(&block_group->item);
+   __entry->bytes_super= block_group->bytes_super;
+   __entry->create = create;
+   ),
+
+   TP_printk("%pU: block_group offset = %llu, size = %llu, "
+ "flags = %llu(%s), bytes_used = %llu, bytes_super = %llu, "
+ "create = %d", __entry->fsid,
+ (unsigned long long)__entry->offset,
+ (unsigned long long)__entry->size,
+ (unsigned long long)__entry->flags,
+ __print_flags((unsigned long)__entry->flags, "|",
+   BTRFS_GROUP_FLAGS),
+ (unsigned long long)__entry->bytes_used,
+ (unsigned long long)__entry->bytes_super, __entry->create)
+);
+
 #define show_ref_action(action)
\
__print_symbolic(action,\
{ BTRFS_ADD_DELAYED_REF,"ADD_DELAYED_REF" },\
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/14] Btrfs: warn_on for unaccounted spaces

2016-03-25 Thread Josef Bacik
These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 157a0b6..90ac821 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9633,13 +9633,15 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
space_info = list_entry(info->space_info.next,
struct btrfs_space_info,
list);
-   if (btrfs_test_opt(info->tree_root, ENOSPC_DEBUG)) {
-   if (WARN_ON(space_info->bytes_pinned > 0 ||
+
+   /*
+* Do not hide this behind enospc_debug, this is actually
+* important and indicates a real bug if this happens.
+*/
+   if (WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
-   space_info->bytes_may_use > 0)) {
-   dump_space_info(space_info, 0, 0);
-   }
-   }
+   space_info->bytes_may_use > 0))
+   dump_space_info(space_info, 0, 0);
list_del(&space_info->list);
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
struct kobject *kobj;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/14] Btrfs: fix release reserved extents trace points

2016-03-25 Thread Josef Bacik
We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
isn't quite right because we will go through and free that extent later when we
unpin, so it messes up apps that are accounting for the reservation space.  We
were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
only actually free the reservation instead of pinning the extent.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0ecceea..273e18d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6253,8 +6253,6 @@ static int pin_down_extent(struct btrfs_root *root,
  cache->space_info->flags, num_bytes, 1);
set_extent_dirty(root->fs_info->pinned_extents, bytenr,
 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
-   if (reserved)
-   trace_btrfs_reserved_extent_free(root, bytenr, num_bytes);
return 0;
 }
 
@@ -7877,12 +7875,10 @@ static int __btrfs_free_reserved_extent(struct 
btrfs_root *root,
ret = btrfs_discard_extent(root, start, len, NULL);
btrfs_add_free_space(cache, start, len);
btrfs_update_reserved_bytes(cache, len, RESERVE_FREE, delalloc);
+   trace_btrfs_reserved_extent_free(root, start, len);
}
 
btrfs_put_block_group(cache);
-
-   trace_btrfs_reserved_extent_free(root, start, len);
-
return ret;
 }
 
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/14] Btrfs: change delayed reservation fallback behavior

2016-03-25 Thread Josef Bacik
We reserve space for the inode update when we first reserve space for writing to
a file.  However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents.  Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve.  The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly.  In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations.  Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-inode.c | 64 +---
 1 file changed, 23 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index d3cda0f..1c77103 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -598,6 +598,29 @@ static int btrfs_delayed_inode_reserve_metadata(
num_bytes = btrfs_calc_trans_metadata_size(root, 1);
 
/*
+* If our block_rsv is the delalloc block reserve then check and see if
+* we have our extra reservation for updating the inode.  If not fall
+* through and try to reserve space quickly.
+*
+* We used to try and steal from the delalloc block rsv or the global
+* reserve, but we'd steal a full reservation, which isn't kind.  We are
+* here through delalloc which means we've likely just cowed down close
+* to the leaf that contains the inode, so we would steal less just
+* doing the fallback inode update, so if we do end up having to steal
+* from the global block rsv we hopefully only steal one or two blocks
+* worth which is less likely to hurt us.
+*/
+   if (src_rsv && src_rsv->type == BTRFS_BLOCK_RSV_DELALLOC) {
+   spin_lock(&BTRFS_I(inode)->lock);
+   if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+  &BTRFS_I(inode)->runtime_flags))
+   release = true;
+   else
+   src_rsv = NULL;
+   spin_unlock(&BTRFS_I(inode)->lock);
+   }
+
+   /*
 * btrfs_dirty_inode will update the inode under btrfs_join_transaction
 * which doesn't reserve space for speed.  This is a problem since we
 * still need to reserve space for this update, so try to reserve the
@@ -626,51 +649,10 @@ static int btrfs_delayed_inode_reserve_metadata(
  num_bytes, 1);
}
return ret;
-   } else if (src_rsv->type == BTRFS_BLOCK_RSV_DELALLOC) {
-   spin_lock(&BTRFS_I(inode)->lock);
-   if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-  &BTRFS_I(inode)->runtime_flags)) {
-   spin_unlock(&BTRFS_I(inode)->lock);
-   release = true;
-   goto migrate;
-   }
-   spin_unlock(&BTRFS_I(inode)->lock);
-
-   /* Ok we didn't have space pre-reserved.  This shouldn't happen
-* too often but it can happen if we do delalloc to an existing
-* inode which gets dirtied because of the time update, and then
-* isn't touched again until after the transaction commits and
-* then we try to write out the data.  First try to be nice and
-* reserve something strictly for us.  If not be a pain and try
-* to steal from the delalloc block rsv.
-*/
-   ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
- BTRFS_RESERVE_NO_FLUSH);
-   if (!ret)
-   goto out;
-
-   ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
-   if (!ret)
-   goto out;
-
-   if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
-   btrfs_debug(root->fs_info,
-   "block rsv migrate returned %d", ret);
-   

[PATCH 10/14] Btrfs: add tracepoints for flush events

2016-03-25 Thread Josef Bacik
We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  9 +
 fs/btrfs/extent-tree.c   | 22 ++--
 include/trace/events/btrfs.h | 82 
 3 files changed, 103 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7437c8a..55a24c5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3611,6 +3611,15 @@ enum btrfs_reserve_flush_enum {
BTRFS_RESERVE_FLUSH_ALL,
 };
 
+enum btrfs_flush_state {
+   FLUSH_DELAYED_ITEMS_NR  =   1,
+   FLUSH_DELAYED_ITEMS =   2,
+   FLUSH_DELALLOC  =   3,
+   FLUSH_DELALLOC_WAIT =   4,
+   ALLOC_CHUNK =   5,
+   COMMIT_TRANS=   6,
+};
+
 int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1221c07..0ecceea 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4762,15 +4762,6 @@ commit:
return btrfs_commit_transaction(trans, root);
 }
 
-enum flush_state {
-   FLUSH_DELAYED_ITEMS_NR  =   1,
-   FLUSH_DELAYED_ITEMS =   2,
-   FLUSH_DELALLOC  =   3,
-   FLUSH_DELALLOC_WAIT =   4,
-   ALLOC_CHUNK =   5,
-   COMMIT_TRANS=   6,
-};
-
 struct reserve_ticket {
u64 bytes;
int error;
@@ -4828,6 +4819,8 @@ static int flush_space(struct btrfs_root *root,
break;
}
 
+   trace_btrfs_flush_space(root->fs_info, space_info->flags, num_bytes,
+   orig_bytes, state, ret);
return ret;
 }
 
@@ -5105,6 +5098,10 @@ static int __reserve_metadata_bytes(struct btrfs_root 
*root,
list_add_tail(&ticket.list, &space_info->tickets);
if (!space_info->flush) {
space_info->flush = 1;
+   trace_btrfs_trigger_flush(root->fs_info,
+ space_info->flags,
+ orig_bytes, flush,
+ "enospc");
queue_work(system_unbound_wq,
   &root->fs_info->async_reclaim_work);
}
@@ -5121,9 +5118,14 @@ static int __reserve_metadata_bytes(struct btrfs_root 
*root,
 */
if (!root->fs_info->log_root_recovering &&
need_do_async_reclaim(space_info, root->fs_info, used) &&
-   !work_busy(&root->fs_info->async_reclaim_work))
+   !work_busy(&root->fs_info->async_reclaim_work)) {
+   trace_btrfs_trigger_flush(root->fs_info,
+ space_info->flags,
+ orig_bytes, flush,
+ "preempt");
queue_work(system_unbound_wq,
   &root->fs_info->async_reclaim_work);
+   }
}
spin_unlock(&space_info->lock);
if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 3e61deb8..6c192dc 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -784,6 +784,88 @@ TRACE_EVENT(btrfs_space_reservation,
  __entry->bytes)
 );
 
+#define show_flush_action(action)  
\
+   __print_symbolic(action,
\
+   { BTRFS_RESERVE_NO_FLUSH,   "BTRFS_RESERVE_NO_FLUSH"},  
\
+   { BTRFS_RESERVE_FLUSH_LIMIT,"BTRFS_RESERVE_FLUSH_LIMIT"},   
\
+   { BTRFS_RESERVE_FLUSH_ALL,  "BTRFS_RESERVE_FLUSH_ALL"})
+
+TRACE_EVENT(btrfs_trigger_flush,
+
+   TP_PROTO(struct btrfs_fs_info *fs_info, u64 flags, u64 bytes,
+int flush, char *reason),
+
+   TP_ARGS(fs_info, flags, bytes, flush, reason),
+
+   TP_STRUCT__entry(
+   __array(u8, fsid,   BTRFS_UUID_SIZE )
+   __field(u64,flags   )
+   __field(u64,bytes   )
+   __field(int,flush   )
+   __string(   reason, reason  )
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry->fsid, fs_info->fsid, BTRFS_UUID_SIZE);
+   __entry->flags  = flags;
+

[PATCH 11/14] Btrfs: add fsid to some tracepoints

2016-03-25 Thread Josef Bacik
When tracing enospc problems on a box with multiple file systems mounted I need
to be able to differentiate between the two file systems.  Most of the important
trace points I'm looking at already have an fsid, but the reserved extent trace
points do not, so add that to make it possible to figure out which trace point
belongs to which file system.  Thanks,

Signed-off-by: Josef Bacik 
---
 include/trace/events/btrfs.h | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 6c192dc..b0f555e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -873,18 +873,21 @@ DECLARE_EVENT_CLASS(btrfs__reserved_extent,
TP_ARGS(root, start, len),
 
TP_STRUCT__entry(
-   __field(u64,  root_objectid )
-   __field(u64,  start )
-   __field(u64,  len   )
+   __array(u8, fsid,   BTRFS_UUID_SIZE )
+   __field(u64,root_objectid   )
+   __field(u64,start   )
+   __field(u64,len )
),
 
TP_fast_assign(
+   memcpy(__entry->fsid, root->fs_info->fsid, BTRFS_UUID_SIZE);
__entry->root_objectid  = root->root_key.objectid;
__entry->start  = start;
__entry->len= len;
),
 
-   TP_printk("root = %llu(%s), start = %llu, len = %llu",
+   TP_printk("%pU: root = %llu(%s), start = %llu, len = %llu",
+ __entry->fsid,
  show_root_type(__entry->root_objectid),
  (unsigned long long)__entry->start,
  (unsigned long long)__entry->len)
@@ -941,6 +944,7 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent,
TP_ARGS(root, block_group, start, len),
 
TP_STRUCT__entry(
+   __array(u8, fsid,   BTRFS_UUID_SIZE )
__field(u64,root_objectid   )
__field(u64,bg_objectid )
__field(u64,flags   )
@@ -949,6 +953,7 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent,
),
 
TP_fast_assign(
+   memcpy(__entry->fsid, root->fs_info->fsid, BTRFS_UUID_SIZE);
__entry->root_objectid  = root->root_key.objectid;
__entry->bg_objectid= block_group->key.objectid;
__entry->flags  = block_group->flags;
@@ -956,8 +961,8 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent,
__entry->len= len;
),
 
-   TP_printk("root = %Lu(%s), block_group = %Lu, flags = %Lu(%s), "
- "start = %Lu, len = %Lu",
+   TP_printk("%pU: root = %Lu(%s), block_group = %Lu, flags = %Lu(%s), "
+ "start = %Lu, len = %Lu", __entry->fsid,
  show_root_type(__entry->root_objectid), __entry->bg_objectid,
  __entry->flags, __print_flags((unsigned long)__entry->flags,
"|", BTRFS_GROUP_FLAGS),
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/14] Btrfs: trace pinned extents

2016-03-25 Thread Josef Bacik
Pinned extents are an important metric to keep track of for enospc.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1673365..26f7a9d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6168,6 +6168,9 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
spin_unlock(&cache->lock);
spin_unlock(&cache->space_info->lock);
 
+   trace_btrfs_space_reservation(root->fs_info, "pinned",
+ cache->space_info->flags,
+ num_bytes, 1);
set_extent_dirty(info->pinned_extents,
 bytenr, bytenr + num_bytes - 1,
 GFP_NOFS | __GFP_NOFAIL);
@@ -6242,6 +6245,8 @@ static int pin_down_extent(struct btrfs_root *root,
spin_unlock(&cache->lock);
spin_unlock(&cache->space_info->lock);
 
+   trace_btrfs_space_reservation(root->fs_info, "pinned",
+ cache->space_info->flags, num_bytes, 1);
set_extent_dirty(root->fs_info->pinned_extents, bytenr,
 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
if (reserved)
@@ -6549,6 +6554,9 @@ static int unpin_extent_range(struct btrfs_root *root, 
u64 start, u64 end,
spin_lock(&cache->lock);
cache->pinned -= len;
space_info->bytes_pinned -= len;
+
+   trace_btrfs_space_reservation(fs_info, "pinned",
+ space_info->flags, len, 0);
space_info->max_extent_size = 0;
percpu_counter_add(&space_info->total_bytes_pinned, -len);
if (cache->ro) {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/14] Btrfs: always reserve metadata for delalloc extents

2016-03-25 Thread Josef Bacik
There are a few races in the metadata reservation stuff.  First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes.  So use the normal
btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation.  So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 35 +--
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 06f4e7b..157a0b6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5653,12 +5653,12 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
u64 to_reserve = 0;
u64 csum_bytes;
unsigned nr_extents = 0;
-   int extra_reserve = 0;
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
int ret = 0;
bool delalloc_lock = true;
u64 to_free = 0;
unsigned dropped;
+   bool release_extra = false;
 
/* If we are a free space inode we need to not flush since we will be in
 * the middle of a transaction commit.  We also don't need the delalloc
@@ -5684,24 +5684,15 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
 BTRFS_MAX_EXTENT_SIZE - 1,
 BTRFS_MAX_EXTENT_SIZE);
BTRFS_I(inode)->outstanding_extents += nr_extents;
-   nr_extents = 0;
 
+   nr_extents = 0;
if (BTRFS_I(inode)->outstanding_extents >
BTRFS_I(inode)->reserved_extents)
-   nr_extents = BTRFS_I(inode)->outstanding_extents -
+   nr_extents += BTRFS_I(inode)->outstanding_extents -
BTRFS_I(inode)->reserved_extents;
 
-   /*
-* Add an item to reserve for updating the inode when we complete the
-* delalloc io.
-*/
-   if (!test_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
- &BTRFS_I(inode)->runtime_flags)) {
-   nr_extents++;
-   extra_reserve = 1;
-   }
-
-   to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
+   /* We always want to reserve a slot for updating the inode. */
+   to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents + 1);
to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
csum_bytes = BTRFS_I(inode)->csum_bytes;
spin_unlock(&BTRFS_I(inode)->lock);
@@ -5713,18 +5704,16 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
goto out_fail;
}
 
-   ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
+   ret = btrfs_block_rsv_add(root, block_rsv, to_reserve, flush);
if (unlikely(ret)) {
btrfs_qgroup_free_meta(root, nr_extents * root->nodesize);
goto out_fail;
}
 
spin_lock(&BTRFS_I(inode)->lock);
-   if (extra_reserve) {
-   set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-   &BTRFS_I(inode)->runtime_flags);
-   nr_extents--;
-   }
+   if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+&BTRFS_I(inode)->runtime_flags))
+   release_extra = true;
BTRFS_I(inode)->reserved_extents += nr_extents;
spin_unlock(&BTRFS_I(inode)->lock);
 
@@ -5734,8 +5723,10 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes)
if (to_reserve)
trace_btrfs_space_reservation(root->fs_info, "delalloc",
  btrfs_ino(inode), to_reserve, 1);
-   block_rsv_add_bytes(block_rsv, to_reserve, 1);
-
+   if (release_extra)
+   btrfs_block_rsv_release(root, block_rsv,
+   btrfs_calc_trans_metadata_size(root,
+  1));
return 0;
 
 out_fail:
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/14] Btrfs: fix callers of btrfs_block_rsv_migrate

2016-03-25 Thread Josef Bacik
So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv.  This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size.  So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  4 ++--
 fs/btrfs/delayed-inode.c |  8 
 fs/btrfs/extent-tree.c   | 18 ++
 fs/btrfs/file.c  |  4 ++--
 fs/btrfs/inode.c |  7 +++
 fs/btrfs/relocation.c|  2 +-
 6 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b..b675066 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3646,8 +3646,8 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
   struct btrfs_block_rsv *block_rsv, u64 min_reserved,
   enum btrfs_reserve_flush_enum flush);
 int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
-   struct btrfs_block_rsv *dst_rsv,
-   u64 num_bytes);
+   struct btrfs_block_rsv *dst_rsv, u64 num_bytes,
+   int update_size);
 int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
 struct btrfs_block_rsv *dest, u64 num_bytes,
 int min_factor);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 6cef006..d3cda0f 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -553,7 +553,7 @@ static int btrfs_delayed_item_reserve_metadata(struct 
btrfs_trans_handle *trans,
dst_rsv = &root->fs_info->delayed_block_rsv;
 
num_bytes = btrfs_calc_trans_metadata_size(root, 1);
-   ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes);
+   ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
if (!ret) {
trace_btrfs_space_reservation(root->fs_info, "delayed_item",
  item->key.objectid,
@@ -649,7 +649,7 @@ static int btrfs_delayed_inode_reserve_metadata(
if (!ret)
goto out;
 
-   ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes);
+   ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
if (!ret)
goto out;
 
@@ -663,12 +663,12 @@ static int btrfs_delayed_inode_reserve_metadata(
 * since this really shouldn't happen that often.
 */
ret = btrfs_block_rsv_migrate(&root->fs_info->global_block_rsv,
- dst_rsv, num_bytes);
+ dst_rsv, num_bytes, 1);
goto out;
}
 
 migrate:
-   ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes);
+   ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
 
 out:
/*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c357c96..06f4e7b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5192,8 +5192,9 @@ static void block_rsv_release_bytes(struct btrfs_fs_info 
*fs_info,
}
 }
 
-static int block_rsv_migrate_bytes(struct btrfs_block_rsv *src,
-  struct btrfs_block_rsv *dst, u64 num_bytes)
+int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src,
+   struct btrfs_block_rsv *dst, u64 num_bytes,
+   int update_size)
 {
int ret;
 
@@ -5201,7 +5202,7 @@ static int block_rsv_migrate_bytes(struct btrfs_block_rsv 
*src,
if (ret)
return ret;
 
-   block_rsv_add_bytes(dst, num_bytes, 1);
+   block_rsv_add_bytes(dst, num_bytes, update_size);
return 0;
 }
 
@@ -5308,13 +5309,6 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
return ret;
 }
 
-int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
-   struct btrfs_block_rsv *dst_rsv,
-   u64 num_bytes)
-{
-   return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
-}
-
 void btrfs_block_rsv_release(struct btrfs_root *root,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes)
@@ -5494,7 +5488,7 @@ int btrfs_orphan_reserve_metadata(struct 
btrfs_trans_handle *trans,
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
trace_btrfs_space_reservation(root->fs_info, "orphan",
  btrfs_ino(inode), num_bytes, 1);
-   return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
+   return btrfs_block_rsv_migrate(src_rs

[PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure

2016-03-25 Thread Josef Bacik
Our enospc flushing sucks.  It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out.  So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation.  This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations.  Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread.  This async flusher thread does the same old
flushing we always did, just asynchronously.  As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks.  These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |   2 +
 fs/btrfs/extent-tree.c | 524 +++--
 2 files changed, 375 insertions(+), 151 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b675066..7437c8a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1229,6 +1229,8 @@ struct btrfs_space_info {
struct list_head list;
/* Protected by the spinlock 'lock'. */
struct list_head ro_bgs;
+   struct list_head priority_tickets;
+   struct list_head tickets;
 
struct rw_semaphore groups_sem;
/* for block groups in our same type */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0db4319..1673365 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -111,6 +111,16 @@ static int block_rsv_use_bytes(struct btrfs_block_rsv 
*block_rsv,
   u64 num_bytes);
 int btrfs_pin_extent(struct btrfs_root *root,
 u64 bytenr, u64 num_bytes, int reserved);
+static int __reserve_metadata_bytes(struct btrfs_root *root,
+   struct btrfs_space_info *space_info,
+   u64 orig_bytes,
+   enum btrfs_reserve_flush_enum flush);
+static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
+struct btrfs_space_info *space_info,
+u64 num_bytes);
+static void space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
+struct btrfs_space_info *space_info,
+u64 num_bytes);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -3867,6 +3877,8 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
found->bytes_readonly += bytes_readonly;
if (total_bytes > 0)
found->full = 0;
+   space_info_add_new_bytes(info, found, total_bytes -
+bytes_used - bytes_readonly);
spin_unlock(&found->lock);
*space_info = found;
return 0;
@@ -3901,6 +3913,8 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
found->flush = 0;
init_waitqueue_head(&found->wait);
INIT_LIST_HEAD(&found->ro_bgs);
+   INIT_LIST_HEAD(&found->tickets);
+   INIT_LIST_HEAD(&found->priority_tickets);
 
ret = kobject_init_and_add(&found->kobj, &space_info_ktype,
info->space_info_kobj, "%s",
@@ -4514,12 +4528,19 @@ static int can_overcommit(struct btrfs_root *root,
  struct btrfs_space_info *space_info, u64 bytes,
  enum btrfs_reserve_flush_enum flush)
 {
-   struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
-   u64 profile = btrfs_get_alloc_profile(root, 0);
+   struct btrfs_block_rsv *global_rsv;
+   u64 profile;
u64 space_size;
u64 avail;
u64 used;
 
+   /* Don't overcommit when in mixed mode. */
+   if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
+   return 0;
+
+   BUG_ON(root->fs_info == NULL);
+   global_rsv = &root->fs_info->global_block_rsv;
+   profile = btrfs_get_alloc_profile(root, 0);
used = space_info->bytes_used + space_info->bytes_reserved +
space_info->bytes_pinned + space_info->bytes_readonly;
 
@@ -4669,6 +4690,11 @@ skip_async:
spin_unlock(&space_info->lock);

[PATCH 00/14] Enospc rework

2016-03-25 Thread Josef Bacik
1) Huge latency spikes.  One guy starts flushing, he doesn't wake up until the
flushers are finished doing work and then checks to see if he can continue.
Meanwhile everybody is backed up waiting for that guy to finish getting his
reservation.

2) The flushers flush everything.  They have no idea when to stop, so it just
flushes all of delalloc or all of the delayed inodes.  At first they try to
flush a little bit and hope they can get away with it, but the tighter you get
on space the more it becomes flush the world and hope for the best.

3) Some of the flushing isn't async, yay more latency.

The new approach introduces the idea of tickets for reservations.  If you cannot
make your reservation immediately you initialize a ticket with how much space
you need and you put yourselve on a list.  If you cannot flush anything (things
like dirty'ing an inode) then you add yourself to the priority queue and wait
for a little bit.  If you can flush then you add yourself to the normal queue
and wait for flushing to happen.  Each ticket has it's own waitqueue so as we
add space back into the system we can satisfy reservations immediately and
immediately wake the waiters back up, which greatly reduces latencies.

I've been testing these patches for a while and will be building on them from
here, but the results are pretty excellent so far.  In the fs_mark test with all
metadata here are the results (on an empty file system)

Without Patch
Average Files/sec: 212897.2
p50 Files/sec: 207495
p90 Files/sec: 196709
p99 Files/sec: 189682

Creat Max Latency in usec
p50: 264665
p90: 456347.2
p99: 659489.32
max: 1001413

With Patch
Average Files/sec: 238613.4  
p50 Files/sec: 235764  
p90 Files/sec: 223308  
p99 Files/sec: 216291 

Creat Max Latency in usec
p50: 206771.5
p90: 355430.6
p99: 469634.98
max: 512389

So as you can see there is quite a bit better latency and better throughput
overall.  There will be more work as I test the worst case scenarios and get
the worst latencies down further, but this is the initial work.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/14] Btrfs: fix delalloc reservation amount tracepoint

2016-03-25 Thread Josef Bacik
We can sometimes drop the reservation we had for our inode, so we need to remove
that amount from to_reserve so that our tracepoint reports a valid amount of
space.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 26f7a9d..1221c07 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5922,8 +5922,10 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes)
 
spin_lock(&BTRFS_I(inode)->lock);
if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-&BTRFS_I(inode)->runtime_flags))
+&BTRFS_I(inode)->runtime_flags)) {
+   to_reserve -= btrfs_calc_trans_metadata_size(root, 1);
release_extra = true;
+   }
BTRFS_I(inode)->reserved_extents += nr_extents;
spin_unlock(&BTRFS_I(inode)->lock);
 
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/14] Btrfs: add bytes_readonly to the spaceinfo at once

2016-03-25 Thread Josef Bacik
For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info.  This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit.  Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..c357c96 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3843,6 +3843,7 @@ static const char *alloc_name(u64 flags)
 
 static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 u64 total_bytes, u64 bytes_used,
+u64 bytes_readonly,
 struct btrfs_space_info **space_info)
 {
struct btrfs_space_info *found;
@@ -3863,6 +3864,7 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
found->disk_total += total_bytes * factor;
found->bytes_used += bytes_used;
found->disk_used += bytes_used * factor;
+   found->bytes_readonly += bytes_readonly;
if (total_bytes > 0)
found->full = 0;
spin_unlock(&found->lock);
@@ -3890,7 +3892,7 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
found->disk_used = bytes_used * factor;
found->bytes_pinned = 0;
found->bytes_reserved = 0;
-   found->bytes_readonly = 0;
+   found->bytes_readonly = bytes_readonly;
found->bytes_may_use = 0;
found->full = 0;
found->max_extent_size = 0;
@@ -4400,7 +4402,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
space_info = __find_space_info(extent_root->fs_info, flags);
if (!space_info) {
ret = update_space_info(extent_root->fs_info, flags,
-   0, 0, &space_info);
+   0, 0, 0, &space_info);
BUG_ON(ret); /* -ENOMEM */
}
BUG_ON(!space_info); /* Logic error */
@@ -9862,7 +9864,7 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 
ret = update_space_info(info, cache->flags, found_key.offset,
btrfs_block_group_used(&cache->item),
-   &space_info);
+   cache->bytes_super, &space_info);
if (ret) {
btrfs_remove_free_space_cache(cache);
spin_lock(&info->block_group_cache_lock);
@@ -9875,9 +9877,6 @@ int btrfs_read_block_groups(struct btrfs_root *root)
}
 
cache->space_info = space_info;
-   spin_lock(&cache->space_info->lock);
-   cache->space_info->bytes_readonly += cache->bytes_super;
-   spin_unlock(&cache->space_info->lock);
 
__link_block_group(space_info, cache);
 
@@ -9969,7 +9968,6 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
int ret;
struct btrfs_root *extent_root;
struct btrfs_block_group_cache *cache;
-
extent_root = root->fs_info->extent_root;
 
btrfs_set_log_full_commit(root->fs_info, trans);
@@ -10015,7 +10013,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
 * assigned to our block group, but don't update its counters just yet.
 * We want our bg to be added to the rbtree with its ->space_info set.
 */
-   ret = update_space_info(root->fs_info, cache->flags, 0, 0,
+   ret = update_space_info(root->fs_info, cache->flags, 0, 0, 0,
&cache->space_info);
if (ret) {
btrfs_remove_free_space_cache(cache);
@@ -10035,7 +10033,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
 * the rbtree, update the space info's counters.
 */
ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
-   &cache->space_info);
+   cache->bytes_super, &cache->space_info);
if (ret) {
btrfs_remove_free_space_cache(cache);
spin_lock(&root->fs_info->block_group_cache_lock);
@@ -10048,16 +10046,11 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
}
update_global_block_rsv(root->fs_info);
 
-   spin_lock(&cache->space_info->lock);
-   cache->space_info->bytes_readonly += cache->bytes_super;
-   spin_unlock(&cache->space_info->lock);
-
__link_block_group(cache->space_info, cache);
 
list_add_tail(&cache->bg_list, &trans->new_bgs);
 
set_avail_alloc_bits(extent_root->f

Re: Possible Raid Bug

2016-03-25 Thread Stephen Williams
Hi Patrik,

[root@Xen ~]# uname -r
4.4.5-1-ARCH

[root@Xen ~]# pacman -Q btrfs-progs
btrfs-progs 4.4.1-1

Your information below was very helpful and I was able to recreate the
Raid array. However my initial question still stands - What if the
drives dies completely? I work in a Data center and we see this quite a
lot where a drive is beyond dead - The OS will literally not detect it.
At this point would the Raid10 array be beyond repair? As you need the
drive present in order to mount the array in degraded mode.

-- 
  Stephen Williams
  steph...@veryfast.biz

On Fri, Mar 25, 2016, at 02:57 PM, Patrik Lundquist wrote:
> On Debian Stretch with Linux 4.4.6, btrfs-progs 4.4 in VirtualBox
> 5.0.16 with 4*2GB VDIs:
> 
> # mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sdbe
> 
> # mount /dev/sdb /mnt
> # touch /mnt/test
> # umount /mnt
> 
> Everything fine so far.
> 
> # wipefs -a /dev/sde
> 
> *reboot*
> 
> # mount /dev/sdb /mnt
> mount: wrong fs type, bad option, bad superblock on /dev/sdb,
>missing codepage or helper program, or other error
> 
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
> 
> # dmesg | tail
> [   85.979655] BTRFS info (device sdb): disk space caching is enabled
> [   85.979660] BTRFS: has skinny extents
> [   85.982377] BTRFS: failed to read the system array on sdb
> [   85.996793] BTRFS: open_ctree failed
> 
> Not very informative! An information regression?
> 
> # mount -o degraded /dev/sdb /mnt
> 
> # dmesg | tail
> [  919.899071] BTRFS info (device sdb): allowing degraded mounts
> [  919.899075] BTRFS info (device sdb): disk space caching is enabled
> [  919.899077] BTRFS: has skinny extents
> [  919.903216] BTRFS warning (device sdb): devid 4 uuid
> 8549a275-f663-4741-b410-79b49a1d465f is missing
> 
> # touch /mnt/test2
> # ls -l /mnt/
> total 0
> -rw-r--r-- 1 root root 0 mar 25 15:17 test
> -rw-r--r-- 1 root root 0 mar 25 15:42 test2
> 
> # btrfs device remove missing /mnt
> ERROR: error removing device 'missing': unable to go below four
> devices on raid10
> 
> As expected.
> 
> # btrfs replace start -B missing /dev/sde /mnt
> ERROR: source device must be a block device or a devid
> 
> Would have been nice if missing worked here too. Maybe it does in
> btrfs-progs 4.5?
> 
> # btrfs replace start -B 4 /dev/sde /mnt
> 
> # dmesg | tail
> [ 1618.170619] BTRFS info (device sdb): dev_replace from  disk> (devid 4) to /dev/sde started
> [ 1618.184979] BTRFS info (device sdb): dev_replace from  disk> (devid 4) to /dev/sde finished
> 
> Repaired!
> 
> # umount /mnt
> # mount /dev/sdb /mnt
> # dmesg | tail
> [ 1729.917661] BTRFS info (device sde): disk space caching is enabled
> [ 1729.917665] BTRFS: has skinny extents
> 
> All in all it works just fine with Linux 4.4.6.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue

2016-03-25 Thread Chris Mason
On Fri, Mar 25, 2016 at 09:44:31AM +0800, Qu Wenruo wrote:
> 
> 
> Chris Mason wrote on 2016/03/24 16:35 -0400:
> >On Tue, Mar 22, 2016 at 09:35:50AM +0800, Qu Wenruo wrote:
> >>From: Wang Xiaoguang 
> >>
> >>The basic idea is also calculate hash before compression, and add needed
> >>members for dedupe to record compressed file extent.
> >>
> >>Since dedupe support dedupe_bs larger than 128K, which is the up limit
> >>of compression file extent, in that case we will skip dedupe and prefer
> >>compression, as in that size dedupe rate is low and compression will be
> >>more obvious.
> >>
> >>Current implement is far from elegant. The most elegant one should split
> >>every data processing method into its own and independent function, and
> >>have a unified function to co-operate them.
> >
> >I'd leave this one out for now, it looks like we need to refine the
> >pipeline from dedup -> compression and this is just more to carry around
> >until the initial support is in.  Can you just decline to dedup
> >compressed extents for now?
> 
> Yes, completely no problem.
> Although this patch seems works well yet, but I also have planned to rework
> current run_delloc_range() to make it more flex and clear.
> 
> So the main object of the patch is more about raising attention for such
> further re-work.
> 
> And now it has achieved its goal.

Thanks, I do really like how you had compression in mind all along.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

2016-03-25 Thread Chris Mason
On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:
> 
> 
> Chris Mason wrote on 2016/03/24 16:58 -0400:
> >Are you storing the entire hash, or just the parts not represented in
> >the key?  I'd like to keep the on-disk part as compact as possible for
> >this part.
> 
> Currently, it's entire hash.
> 
> More detailed can be checked in another mail.
> 
> Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
> I still quite like current implementation, as one memcpy() is simpler.

[ sorry FB makes urls look ugly, so I delete them from replys ;) ]

Right, I saw that but wanted to reply to the specific patch.  One of the
lessons learned from the extent allocation tree and file extent items is
they are just too big.  Lets save those bytes, it'll add up.

> 
> >
> >>+
> >>+/*
> >>+ * Objectid: bytenr
> >>+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
> >>+ * offset: Last 64 bit of the hash
> >>+ *
> >>+ * Used for bytenr <-> hash search (for free_extent)
> >>+ * all its content is hash.
> >>+ * So no special item struct is needed.
> >>+ */
> >>+
> >
> >Can we do this instead with a backref from the extent?  It'll save us a
> >huge amount of IO as we delete things.
> 
> That's the original implementation from Liu Bo.
> 
> The problem is, it changes the data backref rules(originally, only
> EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
> other than current RO_COMPACT.
> So I really don't like to change the data backref rule.

Let me reread this part, the cost of maintaining the second index is
dramatically higher than adding a backref.  I do agree that's its nice
to be able to delete the dedup trees without impacting the rest, but
over the long term I think we'll regret the added balances.

> 
> If only want to reduce ondisk space, just trashing the hash and making
> DEDUPE_BYTENR_ITEM have no data would be good enough.
> 
> As (bytenr, DEDEUPE_BYTENR_ITEM) can locate the hash uniquely.

For the second index, the big problem is the cost of the btree
operations.  We're already pretty expensive in terms of the cost of
deleting an extent, with dedup its 2x higher, with dedup + extra index,
its 3x higher.

> 
> In fact no code really checked the hash for dedupe bytenr item, they all
> just swap objectid and offset, reset the type and do search for
> DEDUPE_HASH_ITEM.
> 
> So it's OK to emit the hash.

If we have to go with the second index, I do agree here.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-25 Thread Patrik Lundquist
On Debian Stretch with Linux 4.4.6, btrfs-progs 4.4 in VirtualBox
5.0.16 with 4*2GB VDIs:

# mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sdbe

# mount /dev/sdb /mnt
# touch /mnt/test
# umount /mnt

Everything fine so far.

# wipefs -a /dev/sde

*reboot*

# mount /dev/sdb /mnt
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

# dmesg | tail
[   85.979655] BTRFS info (device sdb): disk space caching is enabled
[   85.979660] BTRFS: has skinny extents
[   85.982377] BTRFS: failed to read the system array on sdb
[   85.996793] BTRFS: open_ctree failed

Not very informative! An information regression?

# mount -o degraded /dev/sdb /mnt

# dmesg | tail
[  919.899071] BTRFS info (device sdb): allowing degraded mounts
[  919.899075] BTRFS info (device sdb): disk space caching is enabled
[  919.899077] BTRFS: has skinny extents
[  919.903216] BTRFS warning (device sdb): devid 4 uuid
8549a275-f663-4741-b410-79b49a1d465f is missing

# touch /mnt/test2
# ls -l /mnt/
total 0
-rw-r--r-- 1 root root 0 mar 25 15:17 test
-rw-r--r-- 1 root root 0 mar 25 15:42 test2

# btrfs device remove missing /mnt
ERROR: error removing device 'missing': unable to go below four
devices on raid10

As expected.

# btrfs replace start -B missing /dev/sde /mnt
ERROR: source device must be a block device or a devid

Would have been nice if missing worked here too. Maybe it does in
btrfs-progs 4.5?

# btrfs replace start -B 4 /dev/sde /mnt

# dmesg | tail
[ 1618.170619] BTRFS info (device sdb): dev_replace from  (devid 4) to /dev/sde started
[ 1618.184979] BTRFS info (device sdb): dev_replace from  (devid 4) to /dev/sde finished

Repaired!

# umount /mnt
# mount /dev/sdb /mnt
# dmesg | tail
[ 1729.917661] BTRFS info (device sde): disk space caching is enabled
[ 1729.917665] BTRFS: has skinny extents

All in all it works just fine with Linux 4.4.6.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] delete obsolete function btrfs_print_tree()

2016-03-25 Thread Holger Hoffstätte
Dan Carpenter's static checker recently found missing IS_ERR handling
in print-tree.c:btrfs_print_tree(). While looking into this I found that
this function is no longer called anywhere and was moved to btrfs-progs
long ago. It can simply be removed.

Reported-by: Dan Carpenter 
Signed-off-by: Holger Hoffstätte 
---
 fs/btrfs/print-tree.c | 38 --
 fs/btrfs/print-tree.h |  1 -
 2 files changed, 39 deletions(-)

diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index 147dc6c..dc28db8 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -328,41 +328,3 @@ void btrfs_print_leaf(struct btrfs_root *root, struct 
extent_buffer *l)
};
}
 }
-
-void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *c)
-{
-   int i; u32 nr;
-   struct btrfs_key key;
-   int level;
-
-   if (!c)
-   return;
-   nr = btrfs_header_nritems(c);
-   level = btrfs_header_level(c);
-   if (level == 0) {
-   btrfs_print_leaf(root, c);
-   return;
-   }
-   btrfs_info(root->fs_info, "node %llu level %d total ptrs %d free spc 
%u",
-   btrfs_header_bytenr(c), level, nr,
-   (u32)BTRFS_NODEPTRS_PER_BLOCK(root) - nr);
-   for (i = 0; i < nr; i++) {
-   btrfs_node_key_to_cpu(c, &key, i);
-   printk(KERN_INFO "\tkey %d (%llu %u %llu) block %llu\n",
-  i, key.objectid, key.type, key.offset,
-  btrfs_node_blockptr(c, i));
-   }
-   for (i = 0; i < nr; i++) {
-   struct extent_buffer *next = read_tree_block(root,
-   btrfs_node_blockptr(c, i),
-   btrfs_node_ptr_generation(c, i));
-   if (btrfs_is_leaf(next) &&
-  level != 1)
-   BUG();
-   if (btrfs_header_level(next) !=
-  level - 1)
-   BUG();
-   btrfs_print_tree(root, next);
-   free_extent_buffer(next);
-   }
-}
diff --git a/fs/btrfs/print-tree.h b/fs/btrfs/print-tree.h
index 7faddfa..9dd56b9 100644
--- a/fs/btrfs/print-tree.h
+++ b/fs/btrfs/print-tree.h
@@ -19,5 +19,4 @@
 #ifndef __PRINT_TREE_
 #define __PRINT_TREE_
 void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l);
-void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *c);
 #endif
-- 
2.7.4



signature.asc
Description: OpenPGP digital signature


Re: btrfs ways to travel back in time

2016-03-25 Thread Alexander Fougner
2016-03-23 23:31 GMT+01:00 Vytautas D :
>
>> atime). Also, this might break some configurations not expecting the
>> set-default method
>
> I have never seen this before. Can you extend on this or provide a link so i
> can read more about such limitation?
>>

Ubuntu, for instance. Layout as /@ for root FS and /@home for /home.
Set-default will break this.

>> >  2. reboot
>> > b) always mount into snapshot
>> >  1. mount -o suvol=/.current $disk # at initrd
>> >  2. btrfs subvol del /.current
>> >  3. btrfs subvol snapshot /.snapshotA /.current
>> >  4. reboot
>> > c) rsync
>> >  1. rsync $options /.snapshotA /.current
>> >  2. reboot
>>
>> Depending on how broken the setup is, I'd probably go for the btrfs
>> sub snap ./__snapshots/@oldsnap ./@current approach.
> Is there a technical reason behind this ?
>

I'm not sure I understand what "this" refers to, but another way to do
it would be to just move the snap and set it as rw. But that consumes
the snapshot for further use.


>> If it is very broken (as in not bootable), then a temporary boot into
>> a readonly snapshot might be required. This is quite easy to do in
>> the grub menu, changing the boot parameter. I've also heard of
>> symlinking to the actual subvolume you want to use, and resymlink it
>> when an older snapshot is desired.
>
> Just to make it clear, i don't have broken system and have full control. I
> am interested in strategies and reasoning.

What I meant was that there are different levels of "breaking".
Depending on how bad the failure is, a rollback on the live FS might
be sufficient.

mv @ @broken
btrfs sub snap __snap/@_$date @
reboot

>>
>> >
>> > - Vytas Dauksa
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> > in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-25 Thread Duncan
Patrik Lundquist posted on Fri, 25 Mar 2016 13:48:08 +0100 as excerpted:

> On 25 March 2016 at 12:49, Stephen Williams 
> wrote:
>>
>> So catch 22, you need all the drives otherwise it won't let you mount,
>> But what happens if a drive dies and the OS doesn't detect it? BTRFS
>> wont allow you to mount the raid volume to remove the bad disk!
> 
> Version of Linux and btrfs-progs?

Yes, please.  This can be very critical information as a lot of bugs will 
be fixed in new versions that are known to exist in older versions, and 
occasionally new ones are introduced as well, where older versions won't 
be affected.

> You can't have a raid10 with less than 4 devices so you need to add a
> new device before deleting the missing. That is of course still a
> problem with a read-only fs.
> 
> btrfs replace is also the recommended way to replace a failed device
> nowadays. The wiki is outdated.

In theory, what it's supposed to do in a missing device situation that 
takes it below the minimum (four devices for a raid10) for a given raid 
mode, is allow writable mounting, unless the number of missing devices is 
too high (more than one missing on raid10) to allow functional degraded 
operation.

What it will often end up doing in that case, since it can't write the 
full raid10, is once current raid10 chunks get filled up and it needs to 
create more, since it doesn't have enough devices to create them in 
raid10, it will degrade to creating them in raid1 mode.

The problem, however, is that on subsequent mounts, btrfs will see that 
single chunk in addition to the raid10 chunks, and will see the missing 
device, and knowing single mode is broken with /any/ missing devices, 
will at that point only mount read-only.

That's a currently known bug, which effectively means you may well get 
only one read-write mount to fix the problem, before btrfs will see that 
new single chunk created in the first degraded writable mount, and will 
refuse to mount writable again.

There are patches available that will fix this known bug by changing this 
detection to per-chunk, instead of per-filesystem.  The degraded-writable 
mount will still degrade to writing single chunks, but btrfs will see 
that all single chunks are accounted for, and all raid10 chunks only have 
one device missing and thus can still be used, and the filesystem will 
thus continue to be write mountable, unless of course another device 
fails.

But AFAIK, those patches were part of a patch set (the hot-spare patches) 
that as a whole wasn't picked for 4.5, tho by rights the per-chunk 
checking patches should have been cherry-picked as ready and fixing an 
existing bug, but weren't.  So as of 4.5, AFAIK, they still have to be 
applied separately before build.  Hopefully they'll be in 4.6.

However, while lack of the per-chunk checking patch would mean an 
expected situation of allowing only one degraded-writable mount before no 
more would be allowed, unless you got it to work once and didn't mention 
it, and unless that btrfs fi usage was from before that writable mount as 
it doesn't show the single-mode chunk that would then prevent further 
writable mounts, it looks like you may have a possibly related, but 
definitely more severe bug, as it appears you aren't even being allowed 
what would otherwise be expected to be that one-shot degraded-writable 
mount.

And without that, as mentioned, you have a problem, since you have to 
have a writable mount to repair the filesystem, and it's not allowing you 
even that one-shot writable mount that should be possible even with that 
known bug.

Assuming you're using a current kernel and post that information, it's 
quite likely the dev working on the other bug will be interested, and 
will have you build a kernel with those patches to see if that alone 
fixes it, before possibly having you try various debugging patches to 
hone in on the problem, if it doesn't, so he can hopefully duplicate the 
problem himself, and ultimately come up with a fix.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-1 refuses to balance large drive

2016-03-25 Thread Henk Slager
On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
 wrote:
> On 23 March 2016 at 20:33, Chris Murphy  wrote:
>>
>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton  wrote:
>> >
>> > I am surprised to hear it said that having the mixed sizes is an odd
>> > case.
>>
>> Not odd as in wrong, just uncommon compared to other arrangements being 
>> tested.
>
> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
> where you replace an old smaller drive with the latest and largest
> when you need more storage.
>
> My raid1 currently consists of 6TB+3TB+3*2TB.

For the original OP situation, with chunks all filled op with extents
and devices all filled up with chunks, 'integrating' a new 6TB drive
in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
way in order to avoid immediate balancing needs:
- 'plug-in' the 6TB
- btrfs-replace  4TB by 6TB
- btrfs fi resize max 6TB_devID
- btrfs-replace  2TB by 4TB
- btrfs fi resize max 4TB_devID
- 'unplug' the 2TB

So then there would be 2 devices with roughly 2TB space available, so
good for continued btrfs raid1 writes.

An offline variant with dd instead of btrfs-replace could also be done
(I used to do that sometimes when btrfs-replace was not implemented).
My experience is that btrfs-replace speed is roughly at max speed (so
harddisk magnetic media transferspeed) during the whole replace
process and it does in a more direct way what you actually want. So in
total mostly way faster device replace/upgrade than with the
add+delete method. And raid1 redundancy is active all the time. Of
course it means first make sure the system runs up-to-date/latest
kernel+tools.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


scrub: Tree block spanning stripes, ignored

2016-03-25 Thread Ivan P
Hello,

using kernel  4.4.5 and btrfs-progs 4.4.1, I today ran a scrub on my
2x1Tb btrfs raid1 array and it finished with 36 unrecoverable errors
[1], all blaming the treeblock 741942071296. Running "btrfs check
--readonly" on one of the devices lists that extent as corrupted [2].

How can I recover, how much did I really lose, and how can I prevent
it from happening again?
If you need me to provide more info, do tell.

[1] http://cwillu.com:8080/188.110.141.36/1
[2] http://pastebin.com/xA5zezqw

Regards,
Soukyuu

P.S.: please add me to CC when replying as I did not subscribe to the
mailing list. Majordomo won't let me use my hotmail address and I
don't want that much traffic on this address.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: add btrfs test for fsync after snapshot deletion

2016-03-25 Thread Eryu Guan
On Thu, Mar 24, 2016 at 08:08:36PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that if we delete a snapshot, delete its parent directory, create
> another directory with the same name as that parent and then fsync either
> the new directory or a file inside the new directory, the fsync succeeds,
> the fsync log is replayable and produces a correct result.
> 
> This is motivated by a bug that is fixed by the following patch for
> btrfs (linux kernel):
> 
>   Btrfs: fix unreplayable log after snapshot deletion and parent
>   re-creation
> 
> Signed-off-by: Filipe Manana 

Looks good to me, test failed on v4.5 kernel and passed with above patch
(and dependencies) applied .

Reviewed-by: Eryu Guan 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: don't use src fd for printk

2016-03-25 Thread Josef Bacik
The fd we pass in may not be on a btrfs file system, so don't try to do
BTRFS_I() on it.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 8bbecda..b0d1345 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1644,7 +1644,7 @@ static noinline int 
btrfs_ioctl_snap_create_transid(struct file *file,
 
src_inode = file_inode(src.file);
if (src_inode->i_sb != file_inode(file)->i_sb) {
-   btrfs_info(BTRFS_I(src_inode)->root->fs_info,
+   btrfs_info(BTRFS_I(file_inode(file))->root->fs_info,
   "Snapshot src from another FS");
ret = -EXDEV;
} else if (!inode_owner_or_capable(src_inode)) {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: add btrfs test for fsync after snapshot deletion

2016-03-25 Thread Eryu Guan
On Fri, Mar 25, 2016 at 12:58:52PM +0100, Holger Hoffstätte wrote:
> On 03/25/16 04:53, Eryu Guan wrote:
> > Test fails on v4.5 kernel as expected, but I failed to compile btrfs
> > after applying this patch, seems btrfs_must_commit_transaction was not
> > defined anywhere (I did grep it through the kernel tree, nothing showed
> > up), did I miss anything?
> > 
> > fs/btrfs/tree-log.c: In function ‘check_parent_dirs_for_sync’:
> > fs/btrfs/tree-log.c:4836:4: error: implicit declaration of function 
> > ‘btrfs_must_commit_transaction’ [-Werror=implicit-function-declaration]
> > if (btrfs_must_commit_transaction(trans, inode))
> > ^
> > cc1: some warnings being treated as errors
> > make[1]: *** [fs/btrfs/tree-log.o] Error 1
> > make: *** [_module_fs/btrfs] Error 2
> 
> It was defined in a previous patch for 4.6:
> 
> https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.6&id=2be63d5ce929603d4e7cedabd9e992eb34a0ff95

I was testing against 4.5 kernel. Thanks for pointing it out, it
compiles now!

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-1 refuses to balance large drive

2016-03-25 Thread Patrik Lundquist
On 23 March 2016 at 20:33, Chris Murphy  wrote:
>
> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton  wrote:
> >
> > I am surprised to hear it said that having the mixed sizes is an odd
> > case.
>
> Not odd as in wrong, just uncommon compared to other arrangements being 
> tested.

I think mixed drive sizes in raid1 is a killer feature for a home NAS,
where you replace an old smaller drive with the latest and largest
when you need more storage.

My raid1 currently consists of 6TB+3TB+3*2TB.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Aw: cannot repair raid6 volume rescue zero-log crashed

2016-03-25 Thread Duncan
Jan Koester posted on Fri, 25 Mar 2016 12:02:29 +0100 as excerpted:

> with btrfs tools 4.5 i got this message:

Unfortunately this isn't going to be a lot of direct help in regard to 
your specific situation as I'm simply a btrfs using admin and list 
regular, not a dev, and I don't use btrfs raid56 mode here at all, both 
because it doesn't fit my use-case (I use raid1 mode, with backups =:^), 
and because btrfs raid56 mode isn't yet appropriately mature enough to 
handle my use-case even if parity raid would otherwise be an appropriate 
choice.  However, as I don't see any other answers, here's some rather 
generic notes:

These first points are btrfs generic, not specific to raid56 mode.

1) Btrfs in general is considered stabilizing, but not yet fully stable 
and mature.  As such, backups are extremely strongly recommended, more so 
than with fully stable and mature filesystems, unless you are using 
purely testing data that's trivial enough you simply don't care if it 
dies.

2) Additionally, given the speed at which btrfs is still changing and the 
fact that this list is mainline focused, not specific distro focused, 
list-recommended kernels are in two tracks based on mainline kernels, 
current track and LTS.  On the current track, the latest two current 
kernel series are recommended and best supported.  With 4.5 out, that's 
kernel series 4.5 and 4.4.

On the LTS track, until recently it was again the latest two, but 
mainline LTS kernel versions, which would be the 4.4 and 4.1 LTS kernel 
series.  However, as btrfs stabilizes and because the previous LTS 
kernel, 3.18, was relatively stable as well, while newer is recommended, 
we do recognize that more conservative users may wish to stay a bit 
further back, and as such LTS series 3.18 remains supported to some 
extent as well.

3) While this list is mainline focused, we do recognize that various 
distros support btrfs on kernels outside the above recommended mainline 
current and LTS track versions.  However, as we're mainline focused, we 
don't track what patches they may or may not have backported to whatever 
kernels they are running, and thus, while we'll do our best to help, 
often that "best" is going to be asking that you try with something newer 
and report back the results from that, if need be.

Alternatively, you may of course turn to the support your distro is 
providing for btrfs on that kernel, as they're better positioned to know 
what exactly they've backported and what they haven't, which would then 
make it a matter between you and your distro, rather than between you and 
the list.

It can be noted here that kernel 4.2, specifically, is not a mainline LTS 
track kernel, which means it's subject to the current track kernel 
upgrade rules, and support for mainline 4.2 series is now expired with no 
further patches being backported to it.  Therefore, the recommendation, 
both from a general mainline kernel perspective and from the btrfs 
specific perspective, would be to upgrade to something within current 
support scope, presently 4.4 and 4.5, or switch to the LTS track, as 
mentioned, 4.4, 4.1 and 3.18, the alternative being looking to your 
distro for longer term support if they've chosen to provide it for 4.2.

4) In terms of the btrfs-progs userspace, during normal runtime, most 
commands simply invoke kernel code, so userspace code isn't as critical.  
However, once you're dealing with a filesystem that's failing to mount, 
and trying to repair it using btrfs check and other userspace tools, or 
retrieve files from the unmounted filesystem using btrfs restore, then 
it's actually userspace code doing the work, and it's at that point that 
the userspace version becomes critical, as newer versions have the newest 
repair and restore code to best deal with problems only recently 
understood.

In this regard you're current, as you're now running btrfs-progs 4.5.


Those are the generic points applying to btrfs in general.  For btrfs 
raid56 mode more specifically...

5) Btrfs raid56 mode, while nominally complete with kernel 3.19, had show-
stopper-critical bugs into the 4.1 development cycle, and while those 
were fixed by 4.1 and later 4.2 release, btrfs raid56 mode code remains 
somewhat less stable than btrfs code in general.  As such, using the very 
latest code, kernel 4.5 and its matching 4.5 userspace, is extremely 
strongly recommended.

6) In addition, while there's no specific show-stopper level bugs with 
the raid56 code that I know of, there remains one known in-practice 
critical bug that hasn't been tracked down, the fact that in some cases, 
device replacement and array rebuild can be /extremely/ slow, to the 
point where it can take weeks to return to full undegraded mode.  
Unfortunately, the entire filesystem is at risk during that extended 
rebuild, due to the real risk of further loss of devices while the 
filesystem is already degraded.  With the length of that high-risk 
rebuild time so extended, the ra

Re: Possible Raid Bug

2016-03-25 Thread Patrik Lundquist
On 25 March 2016 at 12:49, Stephen Williams  wrote:
>
> So catch 22, you need all the drives otherwise it won't let you mount,
> But what happens if a drive dies and the OS doesn't detect it? BTRFS
> wont allow you to mount the raid volume to remove the bad disk!

Version of Linux and btrfs-progs?

You can't have a raid10 with less than 4 devices so you need to add a
new device before deleting the missing. That is of course still a
problem with a read-only fs.

btrfs replace is also the recommended way to replace a failed device
nowadays. The wiki is outdated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] fstests: add btrfs test for fsync after snapshot deletion

2016-03-25 Thread fdmanana
From: Filipe Manana 

Test that if we delete a snapshot, delete its parent directory, create
another directory with the same name as that parent and then fsync either
the new directory or a file inside the new directory, the fsync succeeds,
the fsync log is replayable and produces a correct result.

This is motivated by a bug that is fixed by the following patch for
btrfs (linux kernel):

  Btrfs: fix unreplayable log after snapshot deletion and parent
  re-creation

Signed-off-by: Filipe Manana 
---

V2: Renamed local function _populate_fs to populate_testdir.

 tests/btrfs/120 | 91 +
 tests/btrfs/120.out | 12 +++
 tests/btrfs/group   |  1 +
 3 files changed, 104 insertions(+)
 create mode 100755 tests/btrfs/120
 create mode 100644 tests/btrfs/120.out

diff --git a/tests/btrfs/120 b/tests/btrfs/120
new file mode 100755
index 000..329b45c
--- /dev/null
+++ b/tests/btrfs/120
@@ -0,0 +1,91 @@
+#! /bin/bash
+# FSQA Test No. 120
+#
+# Test that if we delete a snapshot, delete its parent directory, create
+# another directory with the same name as that parent and then fsync either
+# the new directory or a file inside the new directory, the fsync succeeds,
+# the fsync log is replayable and produces a correct result.
+#
+#---
+#
+# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+populate_testdir()
+{
+   _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
+   $SCRATCH_MNT/testdir/snap
+   _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap
+   rmdir $SCRATCH_MNT/testdir
+   mkdir $SCRATCH_MNT/testdir
+}
+
+_scratch_mkfs >>$seqres.full 2>&1
+_init_flakey
+_mount_flakey
+
+mkdir $SCRATCH_MNT/testdir
+populate_testdir
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
+_flakey_drop_and_remount
+
+echo "Filesystem contents after the first log replay:"
+ls -R $SCRATCH_MNT | _filter_scratch
+
+# Now do the same as before but instead of doing an fsync against the 
directory,
+# do an fsync against a file inside the directory.
+
+populate_testdir
+touch $SCRATCH_MNT/testdir/foobar
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foobar
+_flakey_drop_and_remount
+
+echo "Filesystem contents after the second log replay:"
+ls -R $SCRATCH_MNT | _filter_scratch
+
+_unmount_flakey
+status=0
+exit
diff --git a/tests/btrfs/120.out b/tests/btrfs/120.out
new file mode 100644
index 000..4210bfa
--- /dev/null
+++ b/tests/btrfs/120.out
@@ -0,0 +1,12 @@
+QA output created by 120
+Filesystem contents after the first log replay:
+SCRATCH_MNT:
+testdir
+
+SCRATCH_MNT/testdir:
+Filesystem contents after the second log replay:
+SCRATCH_MNT:
+testdir
+
+SCRATCH_MNT/testdir:
+foobar
diff --git a/tests/btrfs/group b/tests/btrfs/group
index d312874..13aa1e5 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -120,3 +120,4 @@
 117 auto quick send clone
 118 auto quick snapshot metadata
 119 auto quick snapshot metadata qgroup
+120 auto quick snapshot metadata
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: add btrfs test for fsync after snapshot deletion

2016-03-25 Thread Holger Hoffstätte
On 03/25/16 04:53, Eryu Guan wrote:
> Test fails on v4.5 kernel as expected, but I failed to compile btrfs
> after applying this patch, seems btrfs_must_commit_transaction was not
> defined anywhere (I did grep it through the kernel tree, nothing showed
> up), did I miss anything?
> 
> fs/btrfs/tree-log.c: In function ‘check_parent_dirs_for_sync’:
> fs/btrfs/tree-log.c:4836:4: error: implicit declaration of function 
> ‘btrfs_must_commit_transaction’ [-Werror=implicit-function-declaration]
> if (btrfs_must_commit_transaction(trans, inode))
> ^
> cc1: some warnings being treated as errors
> make[1]: *** [fs/btrfs/tree-log.o] Error 1
> make: *** [_module_fs/btrfs] Error 2

It was defined in a previous patch for 4.6:

https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.6&id=2be63d5ce929603d4e7cedabd9e992eb34a0ff95

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Possible Raid Bug

2016-03-25 Thread Stephen Williams
Hi,

Find instructions on how to recreate below -

I have a BTRFS raid 10 setup in Virtualbox (I'm getting to grips with
the Filesystem) 
I have the raid mounted to /mnt like so -
 
[root@Xen ~]# btrfs filesystem show /mnt/
Label: none  uuid: ad1d95ee-5cdc-420f-ad30-bd16158ad8cb
Total devices 4 FS bytes used 1.00GiB
devid1 size 2.00GiB used 927.00MiB path /dev/sdb
devid2 size 2.00GiB used 927.00MiB path /dev/sdc
devid3 size 2.00GiB used 927.00MiB path /dev/sdd
devid4 size 2.00GiB used 927.00MiB path /dev/sde
And -
[root@Xen ~]# btrfs filesystem usage /mnt/
Overall:
Device size:   8.00GiB
Device allocated:  3.62GiB
Device unallocated:4.38GiB
Device missing:  0.00B
Used:  2.00GiB
Free (estimated):  2.69GiB  (min: 2.69GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID10: Size:1.50GiB, Used:1.00GiB
   /dev/sdb  383.50MiB
   /dev/sdc  383.50MiB
   /dev/sdd  383.50MiB
   /dev/sde  383.50MiB

Metadata,RAID10: Size:256.00MiB, Used:1.16MiB
   /dev/sdb   64.00MiB
   /dev/sdc   64.00MiB
   /dev/sdd   64.00MiB
   /dev/sde   64.00MiB

System,RAID10: Size:64.00MiB, Used:16.00KiB
   /dev/sdb   16.00MiB
   /dev/sdc   16.00MiB
   /dev/sdd   16.00MiB
   /dev/sde   16.00MiB

Unallocated:
   /dev/sdb1.55GiB
   /dev/sdc1.55GiB
   /dev/sdd1.55GiB
   /dev/sde1.55GiB

Right so everything looks good and I stuck some dummy files in there too
-
[root@Xen ~]# ls -lh /mnt/
total 1.1G
-rw-r--r-- 1 root root 1.0G May 30  2008 1GB.zip
-rw-r--r-- 1 root root   28 Mar 24 15:16 hello
-rw-r--r-- 1 root root6 Mar 24 16:12 niglu
-rw-r--r-- 1 root root4 Mar 24 15:32 test

The bug appears to happen when you try and test out it's ability to
handle a dead drive.
If you follow the instructions here:
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Replacing_failed_devices
It tells you do mount the drive with the 'degraded' option,, however
this just does not work, allow me to show -

1) I power off the VM and remove one of the drives (Simulating a drive
being pulled from a machine)
2) Power on the VM
3) Check DMESG - Everything looks good
4) Check how BTRFS is feeling -

Label: none  uuid: ad1d95ee-5cdc-420f-ad30-bd16158ad8cb
Total devices 4 FS bytes used 1.00GiB
devid1 size 2.00GiB used 1.31GiB path /dev/sdb
devid2 size 2.00GiB used 1.31GiB path /dev/sdc
devid3 size 2.00GiB used 1.31GiB path /dev/sdd
*** Some devices missing

So far so good, /dev/sde is missing and BTRFS has detected this.
5) Try and mount it as per the wiki so I can remove the bad drive and
replace it with a good one -

[root@Xen ~]# mount -o degraded /dev/sdb /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

Ok, this is not good, I check DMESG -

[root@Xen ~]# dmesg | tail
[4.416445] e1000: enp0s3 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: RX
[4.416672] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s3: link becomes ready
[4.631812] snd_intel8x0 :00:05.0: white list rate for 1028:0177
is 48000
[7.091047] floppy0: no floppy controllers found
[   27.488345] BTRFS info (device sdb): allowing degraded mounts
[   27.488348] BTRFS info (device sdb): disk space caching is enabled
[   27.488349] BTRFS: has skinny extents
[   27.489794] BTRFS warning (device sdb): devid 4 uuid
ebcd53d9-5956-41d9-b0ef-c59d08e5830f is missing
[   27.491465] BTRFS: missing devices(1) exceeds the limit(0), writeable
mount is not allowed
[   27.520231] BTRFS: open_ctree failed

So here lies the problem - BTRFS needs you to have all the devices
present in order to mount is as writeable, however if a drive dies
spectacularly (as they can do) You can't have that do that. And as a
result you cannot mount any of the remaining drives and fix the problem. 
Now you ARE able to mount it read only but you can't issue the fix that
is recommend on the wiki, see here -

[root@Xen ~]# mount -o ro,degraded /dev/sdb /mnt/
[root@Xen ~]# btrfs device delete missing /mnt/
ERROR: error removing device 'missing': Read-only file system

So catch 22, you need all the drives otherwise it won't let you mount,
But what happens if a drive dies and the OS doesn't detect it? BTRFS
wont allow you to mount the raid volume to remove the bad disk! 

I also tried it with read only -

[root@Xen ~]# mount -o ro,degraded /dev/sdb /mnt/
[root@Xen ~]# btrfs device delete missing /mnt/
ERROR: error removing device 'missing': Read-only file system

subscribe linux-btrfs
--
To unsubscribe from this list: send the line "unsubscri

Aw: cannot repair raid6 volume rescue zero-log crashed

2016-03-25 Thread Jan Koester

 
with btrfs tools 4.5 i got this message:

Starting program: /root/btrfs-progs/btrfsck --repair --init-extent-tree 
/dev/disk/by-uuid/73d4dc77-6ff3-412f-9b0a-0d11458faf32
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
enabling repair mode
repair mode will force to clear out log tree, Are you sure? [y/N]: y
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs check(btrfs_reserve_extent+0x99e)[0x4500e6]
btrfs check(btrfs_alloc_free_block+0x60)[0x450479]
btrfs check(__btrfs_cow_block+0x1a7)[0x43f394]
btrfs check(btrfs_cow_block+0x102)[0x43fe19]
btrfs check[0x4458a2]
btrfs check(btrfs_commit_transaction+0xec)[0x447758]
btrfs check(cmd_check+0x633)[0x42c2a9]
btrfs check(main+0x155)[0x40a193]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x76f8ea40]
btrfs check(_start+0x29)[0x409d89]
[Inferior 1 (process 5460) exited with code 01]

Starting program: /root/btrfs-progs/btrfs rescue zero-log 
/dev/disk/by-uuid/73d4dc77-6ff3-412f-9b0a-0d11458faf32
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
parent transid verify failed on 2280450637824 wanted 861168 found 860380
parent transid verify failed on 2280450637824 wanted 861168 found 860380
checksum verify failed on 2280450637824 found BF5F5D16 wanted AE725F92
checksum verify failed on 2280450637824 found BF5F5D16 wanted AE725F92
bytenr mismatch, want=2280450637824, have=15938376490240
Clearing log on /dev/disk/by-uuid/73d4dc77-6ff3-412f-9b0a-0d11458faf32, 
previous log_root 2280534142976, level 0
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
bytenr mismatch, want=2280260939776, have=15937230354176
parent transid verify failed on 2280260939776 wanted 861166 found 860368
parent transid verify failed on 2280260939776 wanted 861166 found 860368
checksum verify failed on 2280260939776 found 816E966C wanted CB60A223
checksum 

Re: btrfs_destroy_inode WARN_ON.

2016-03-25 Thread Markus Trippelsdorf
On 2016.03.24 at 18:54 -0400, Dave Jones wrote:
> Just hit this on a tree from earlier this morning, v4.5-11140 or so.
> 
> WARNING: CPU: 2 PID: 32570 at fs/btrfs/inode.c:9261 
> btrfs_destroy_inode+0x389/0x3f0 [btrfs]
> CPU: 2 PID: 32570 Comm: rm Not tainted 4.5.0-think+ #14
>  c039baf9 ef721ef0 88025966fc08 8957bcdb
>    88025966fc50 890b41f1
>  88045d918040 242d4eed6048 88024eed6048 88024eed6048
> Call Trace:
>  [] ? btrfs_destroy_inode+0x389/0x3f0 [btrfs]
>  [] dump_stack+0x68/0x9d
>  [] __warn+0x111/0x130
>  [] warn_slowpath_null+0x1d/0x20
>  [] btrfs_destroy_inode+0x389/0x3f0 [btrfs]
>  [] destroy_inode+0x67/0x90
>  [] evict+0x1b7/0x240
>  [] iput+0x3ae/0x4e0
>  [] ? dput+0x20e/0x460
>  [] do_unlinkat+0x256/0x440
>  [] ? do_rmdir+0x350/0x350
>  [] ? syscall_trace_enter_phase1+0x87/0x260
>  [] ? enter_from_user_mode+0x50/0x50
>  [] ? __lock_is_held+0x25/0xd0
>  [] ? mark_held_locks+0x22/0xc0
>  [] ? syscall_trace_enter_phase2+0x12d/0x3d0
>  [] ? SyS_rmdir+0x20/0x20
>  [] SyS_unlinkat+0x1b/0x30
>  [] do_syscall_64+0xf4/0x240
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> ---[ end trace a48ce4e6a1b5e409 ]---
> 
> 
> That's WARN_ON(BTRFS_I(inode)->csum_bytes);
> 
> *maybe* it's a bad disk, but there's no indication in dmesg of anything awry.
> Spinning rust on SATA, nothing special.

Same thing here:

Mar 24 10:37:27 x4 kernel: [ cut here ]
Mar 24 10:37:27 x4 kernel: WARNING: CPU: 3 PID: 11838 at fs/btrfs/inode.c:9261 
btrfs_destroy_inode+0x22b/0x2a0
Mar 24 10:37:27 x4 kernel: CPU: 3 PID: 11838 Comm: rm Not tainted 
4.5.0-11787-ga24e3d414e59-dirty #64
Mar 24 10:37:27 x4 kernel: Hardware name: System manufacturer System Product 
Name/M4A78T-E, BIOS 350304/13/2011
Mar 24 10:37:27 x4 kernel:  813c0d1a 81b8bb84 
812ffd0b
Mar 24 10:37:27 x4 kernel: 81099a9a  880149b86088 
88021585f000
Mar 24 10:37:27 x4 kernel: 812ffd0b  88005f526000 

Mar 24 10:37:27 x4 kernel: Call Trace:
Mar 24 10:37:27 x4 kernel: [] ? dump_stack+0x46/0x6c
Mar 24 10:37:27 x4 kernel: [] ? 
btrfs_destroy_inode+0x22b/0x2a0
Mar 24 10:37:27 x4 kernel: [] ? warn_slowpath_null+0x5a/0xe0
Mar 24 10:37:27 x4 kernel: [] ? 
btrfs_destroy_inode+0x22b/0x2a0
Mar 24 10:37:27 x4 kernel: [] ? do_unlinkat+0x13c/0x3e0
Mar 24 10:37:27 x4 kernel: [] ? 
entry_SYSCALL_64_fastpath+0x13/0x8f
Mar 24 10:37:27 x4 kernel: ---[ end trace e9bae5be848e7a9e ]---

-- 
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: systemd : Timed out waiting for defice dev-disk-by…

2016-03-25 Thread Qu Wenruo

Hi,

Although I know the post is almost one year ago, but I'm quite 
interested in the long mount time.


Any info about the fs except it's a 12 x 4T raid10?

We're investigating such long mount time, but unfortunately, we didn't 
find a good idea to reproduce it (although we don't have 12 devices though).


Thanks,
Qu

Vincent Olivier wrote on 2015/07/24 14:41 -0400:

Hi,

(Sorry if this gets sent twice : one of my mail relay is misbehaving today)

50% of the time when booting, the system go in safe mode because my 12x 4TB 
RAID10 btrfs is taking too long to mount from fstab.

When I comment it out from fstab and mount it manually, it’s all good.

I don’t like that. Is there a way to increase the timer or something ?

Thanks,

Vincent

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html