[PATCH v3] Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent

2015-02-09 Thread Forrest Liu
btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().

Running following command repeatly can check this memory leak problem

btrfs inspect-internal inode-resolve 256 /mnt/btrfs

Signed-off-by: Chien-Kuan Yeh c...@synology.com
Signed-off-by: Forrest Liu forre...@synology.com
---
V2: do not call PagePrivate if page is NULL
V3: add reproducing step in commit message

 fs/btrfs/extent_io.c | 51 ++-
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 790dbae..9de93ee 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4554,36 +4554,37 @@ static void btrfs_release_extent_buffer_page(struct 
extent_buffer *eb)
do {
index--;
page = eb-pages[index];
-   if (page  mapped) {
+   if (!page)
+   continue;
+   if (mapped)
spin_lock(page-mapping-private_lock);
+   /*
+* We do this since we'll remove the pages after we've
+* removed the eb from the radix tree, so we could race
+* and have this page now attached to the new eb.  So
+* only clear page_private if it's still connected to
+* this eb.
+*/
+   if (PagePrivate(page) 
+   page-private == (unsigned long)eb) {
+   BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, eb-bflags));
+   BUG_ON(PageDirty(page));
+   BUG_ON(PageWriteback(page));
/*
-* We do this since we'll remove the pages after we've
-* removed the eb from the radix tree, so we could race
-* and have this page now attached to the new eb.  So
-* only clear page_private if it's still connected to
-* this eb.
+* We need to make sure we haven't be attached
+* to a new eb.
 */
-   if (PagePrivate(page) 
-   page-private == (unsigned long)eb) {
-   BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, 
eb-bflags));
-   BUG_ON(PageDirty(page));
-   BUG_ON(PageWriteback(page));
-   /*
-* We need to make sure we haven't be attached
-* to a new eb.
-*/
-   ClearPagePrivate(page);
-   set_page_private(page, 0);
-   /* One for the page private */
-   page_cache_release(page);
-   }
-   spin_unlock(page-mapping-private_lock);
-
-   }
-   if (page) {
-   /* One for when we alloced the page */
+   ClearPagePrivate(page);
+   set_page_private(page, 0);
+   /* One for the page private */
page_cache_release(page);
}
+
+   if (mapped)
+   spin_unlock(page-mapping-private_lock);
+
+   /* One for when we alloced the page */
+   page_cache_release(page);
} while (index != 0);
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group

2015-02-09 Thread Forrest Liu
Removing large amount of block group in a transaction may encounters
BUG_ON() in btrfs_orphan_add(). That is becuase btrfs_orphan_reserve_metadata()
will grab metadata reservation from transaction handle, and
btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when
delete unused block group.

The problem can be reproduce by following script

mntpath=/btrfs
loopdev=/dev/loop0
filepath=/home/forrest/image

umount $mntpath
losetup -d $loopdev
truncate --size 1000g $filepath
losetup $loopdev $filepath
mkfs.btrfs -f $loopdev
mount $loopdev $mntpath

for j in `seq 1 1 1000`; do
fallocate -l 1g $mntpath/$j
done
# wait cleaner thread remove unused block group
sleep 300

The call trace that results from the BUG_ON() is:

[  613.093084] [ cut here ]
[  613.097928] kernel BUG at fs/btrfs/inode.c:3142!
[  613.105855] invalid opcode:  [#1] SMP
[  613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) 
ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) 
aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) 
ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) 
snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) 
vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) 
i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) 
hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) 
mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E)
[  613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: GE  
3.19.0-rc7-custom #2
[  613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/31/2013
[  613.152694] task: 880035cdb1a0 ti: 880039cf4000 task.ti: 
880039cf4000
[  613.154969] RIP: 0010:[a01441c2]  [a01441c2] 
btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.157780] RSP: 0018:880039cf7c48  EFLAGS: 00010286
[  613.159560] RAX: ffe4 RBX: 88003bd981a0 RCX: 88003c9e4000
[  613.161904] RDX: 2244 RSI: 0004 RDI: 88003c9e4138
[  613.164264] RBP: 880039cf7c88 R08: 60ffc850 R09: 
[  613.166507] R10: 88003bc4b7a0 R11: eaeb6740 R12: 88003c9c
[  613.168681] R13: 88003c102160 R14: 88003c9c0458 R15: 0001
[  613.170932] FS:  () GS:88003f60() 
knlGS:
[  613.173316] CS:  0010 DS:  ES:  CR0: 80050033
[  613.175227] CR2: 7f6343537000 CR3: 36329000 CR4: 000407f0
[  613.177554] Stack:
[  613.178712]  880039cf7c88 a0182a54 88003c9e4b04 
88003c9c7800
[  613.181297]  88003bc4b7a0 88003bd981a0 88003c8db200 
88003c2fcc60
[  613.183782]  880039cf7d18 a012da97 88003bc4b7a4 
88003bc4b7a0
[  613.186171] Call Trace:
[  613.187493]  [a0182a54] ? lookup_free_space_inode+0x44/0x100 
[btrfs]
[  613.189801]  [a012da97] btrfs_remove_block_group+0x137/0x740 
[btrfs]
[  613.192126]  [a0166912] btrfs_remove_chunk+0x672/0x780 [btrfs]
[  613.194267]  [a012e2ff] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
[  613.196567]  [a0135e4c] cleaner_kthread+0x12c/0x190 [btrfs]
[  613.198687]  [a0135d20] ? check_leaf+0x350/0x350 [btrfs]
[  613.200758]  [8108f232] kthread+0xd2/0xf0
[  613.202616]  [8108f160] ? kthread_create_on_node+0x180/0x180
[  613.204738]  [8175dabc] ret_from_fork+0x7c/0xb0
[  613.206652]  [8108f160] ? kthread_create_on_node+0x180/0x180
[  613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 
df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff 0f 0b 
66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
[  613.216562] RIP  [a01441c2] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.218828]  RSP 880039cf7c48
[  613.220382] ---[ end trace 71073106deb8a457 ]---

This patch replace btrfs_join_transaction() with btrfs_start_transaction() in
btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add()

Signed-off-by: Forrest Liu forre...@synology.com
---
V2: add reproducing step and call trace in commit message

 fs/btrfs/extent-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a684086..63b974f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9611,7 +9611,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
 * Want to do this before we do anything else so we can recover
 * properly if we fail to join the transaction.
 */
-   trans = btrfs_join_transaction(root);
+   /* 1 for btrfs_orphan_reserve_metadata() */
+   trans = 

[PATCH v3] Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

2015-02-09 Thread Forrest Liu
If device tree has hole, find_free_dev_extent() cannot find available
address properly.

The problem can be reproduce by following script.

mntpath=/btrfs
loopdev=/dev/loop0
filepath=/home/forrest/image

umount $mntpath
losetup -d $loopdev
truncate --size 100g $filepath
losetup $loopdev $filepath
mkfs.btrfs -f $loopdev
mount $loopdev $mntpath

# make device tree with one big hole
for i in `seq 1 1 100`; do
fallocate -l 1g $mntpath/$i
done
sync
for i in `seq 1 1 95`; do
rm $mntpath/$i
done
sync

# wait cleaner thread remove unused block group
sleep 300

fallocate -l 1g $mntpath/aaa

# failed to allocate new chunk
fallocate -l 1g $mntpath/bbb

Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for 
$mntpath/bbb

item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 106292051968 length 1073741824
item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 103108575232 length 1073741824

Signed-off-by: Forrest Liu forre...@synology.com
Reviewed-by: Liu Bo bo.li@oracle.com
---
V2: fix typo key_offset
replace WARN_ON with WARN_ON_ONCE
V3: add missing {} to stick to kernel coding style
add reprducing step in commit message

 fs/btrfs/volumes.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 50c5a87..ddda8a0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1060,6 +1060,7 @@ static int contains_pending_extent(struct 
btrfs_trans_handle *trans,
struct extent_map *em;
struct list_head *search_list = trans-transaction-pending_chunks;
int ret = 0;
+   u64 physical_start = *start;
 
 again:
list_for_each_entry(em, search_list, list) {
@@ -1070,9 +1071,9 @@ again:
for (i = 0; i  map-num_stripes; i++) {
if (map-stripes[i].dev != device)
continue;
-   if (map-stripes[i].physical = *start + len ||
+   if (map-stripes[i].physical = physical_start + len ||
map-stripes[i].physical + em-orig_block_len =
-   *start)
+   physical_start)
continue;
*start = map-stripes[i].physical +
em-orig_block_len;
@@ -1195,8 +1196,14 @@ again:
 */
if (contains_pending_extent(trans, device,
search_start,
-   hole_size))
-   hole_size = 0;
+   hole_size)) {
+   if (key.offset = search_start) {
+   hole_size = key.offset - search_start;
+   } else {
+   WARN_ON_ONCE(1);
+   hole_size = 0;
+   }
+   }
 
if (hole_size  max_hole_size) {
max_hole_start = search_start;
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Accepting discard to free space from disk images

2015-02-09 Thread Roman Mamedov
On Mon, 09 Feb 2015 10:26:33 -0500
Devon B. devo...@virtualcomplete.com wrote:

 If you don't mind me asking, what version kernel are you running and are 
 you using any special mount options?

Well actually I did not claim I have working discard through 'loop', but your
post made me curious.

$ sudo dd if=/dev/zero of=100g bs=1M seek=10 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00221052 s, 474 MB/s

$ sudo mkfs.ext4 100g
[...]

$ du -hsc 100g 
133M100g
133Mtotal

$ sudo mount -o loop 100g /mnt/tmp1/

(then in a new terminal window):
$ cd /mnt/tmp1/
$ df -h .
Filesystem  Size  Used Avail Use% Mounted on
/dev/loop0   96G   60M   92G   1% /mnt/tmp1
$ sudo dd if=/dev/zero of=zerofile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.944377 s, 1.1 GB/s
$ sync

(back to the original one):
$ du -hsc 100g 
1.2G100g
1.2Gtotal

(2nd window):
$ sudo fstrim .

(back to the original one):
$ du -hsc 100g 
133M100g
133Mtotal

So it does work for me just fine even with 'loop'.
Kernel version 3.14.32, mount options
rw,noatime,nodiratime,compress=zlib,space_cache,inode_cache.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Accepting discard to free space from disk images

2015-02-09 Thread Roman Mamedov
On Mon, 9 Feb 2015 20:42:56 +0500
Roman Mamedov r...@romanrm.net wrote:

 On Mon, 09 Feb 2015 10:26:33 -0500
 Devon B. devo...@virtualcomplete.com wrote:
 
  If you don't mind me asking, what version kernel are you running and are 
  you using any special mount options?
 
 Well actually I did not claim I have working discard through 'loop', but your
 post made me curious.
 
 $ sudo dd if=/dev/zero of=100g bs=1M seek=10 count=1
 1+0 records in
 1+0 records out
 1048576 bytes (1.0 MB) copied, 0.00221052 s, 474 MB/s
 
 $ sudo mkfs.ext4 100g
 [...]
 
 $ du -hsc 100g 
 133M  100g
 133M  total
 
 $ sudo mount -o loop 100g /mnt/tmp1/
 
 (then in a new terminal window):
 $ cd /mnt/tmp1/
 $ df -h .
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/loop0   96G   60M   92G   1% /mnt/tmp1
 $ sudo dd if=/dev/zero of=zerofile bs=1M count=1024
 1024+0 records in
 1024+0 records out
 1073741824 bytes (1.1 GB) copied, 0.944377 s, 1.1 GB/s
 $ sync
 
 (back to the original one):
 $ du -hsc 100g 
 1.2G  100g
 1.2G  total

 (2nd window):

Forgot to add I also did 'rm zerofile' here, of course.

 $ sudo fstrim .
 
 (back to the original one):
 $ du -hsc 100g 
 133M  100g
 133M  total
 
 So it does work for me just fine even with 'loop'.
 Kernel version 3.14.32, mount options
 rw,noatime,nodiratime,compress=zlib,space_cache,inode_cache.
 


-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Repair broken btrfs raid6?

2015-02-09 Thread Duncan
Tobias Holst posted on Mon, 09 Feb 2015 23:45:21 +0100 as excerpted:

 So a short summary:
 - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
 - does not mount at boot up, open_ctree failed (disk 3)
 - mounts successfully after bootup
 - randomly checksum verify failed (disk 5)
 - balance and scrub crash after some time
 - after a while the volume gets unreadable, saying parent transid
 verify failed (disk 4 or 5)
 
 And it looks like there still is no way to btrfsck a raid6.
 
 Any ideas how to repair this filesystem?

(As a btrfs user/sysadmin and a list regular, not a dev, and not yet 
brave enough to try raid5/6 modes here...)

Btrfs raid6 should indeed be generally working in 3.19, including repair, 
yes.  Certainly, it's much closer to working than anything previous.

However, that code, while it actually exists now and is I believe in 
theory complete, is still very VERY new, and thus, it can be expected to 
be still quite buggy.  I've been telling people not to expect it to 
actually work for another kernel cycle (3.20), and even then, don't 
expect it to be as stable as the raid0/1/10 code, which after all has 
been in actual use for (well) over a year now, and thus has had a chance 
to have even many of the the not immediately obvious bugs show up and get 
worked out.  That'll take several more kernel cycles -- I've been 
suggesting that people not consider the raid56 code as stable as the 
earlier raid forms for another two cycles (3.22) at least.

HOWEVER, without claiming to speak for the devs working on it themselves, 
now that the code is actually there and it's time to start exterminating 
bugs in it, I expect they'll be very interested in your bug report, and 
if you're prepared to spend the time working thru it with them, applying 
patches, etc, you could well find your bugs fixed and be back operational 
before 3.20 or whatever. =:^)

Meanwhile, there's actually an integration branch with even newer code 
that hasn't hit release yet.  Given the still very new state of the 
btrfs56 mode code, if you're already brave enough to be running raid6 
mode and are having problems, your chances with integration are likely to 
be even better than with current release.  Of course it could break 
things worse too, but if you're already running raid56 mode I guess 
you're already prepared for that, and are either testing with throw-away 
data or data that's already well backed up, such that you're prepared to 
lose the btrfs raid6 copy of it in any case, so you might as well try 
integration...

See the wiki or other posts for the integration branch repos.  (As I said 
above I'm not brave enough to try raid56 yet, nor have I tried 
integration, so I don't have the links handy.)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem

2015-02-09 Thread Duncan
constantine posted on Tue, 10 Feb 2015 00:54:56 + as excerpted:

 Could you please answer two questions?:
 
 1.  I am testing various files and all seem readable. Is there a way to
 list every file that resides on a particular device (like /dev/sdc1?) so
 as to check them?

I don't know of such a way, but there are folks here that know way more 
than me about it.

 There are a handful of files that seem corrupted,
 since I get from scrub:
 
 BTRFS: checksum error at logical 10792783298560 on dev /dev/sdc1,
 sector 737159648, root 5, inode 1376754, offset 175428419584, length
 4096, links 1 (path: long/path/file.img) ,
 but are these the only files that could be corrupted?

Assuming you don't have any missing metadata, AFAIK that should be all 
of them.  With raid1 data and metadata, you would have had two copies of 
each chunk for both data and metadata, and if there's metadata where one 
copy existed on the missing device and the other copy is corrupted on the 
problem device... but then there'd be errors where the parents of the 
missing metadata didn't check out.  If all the scrub errors you're seeing 
can be matched to files, then you are lucky and should have at least one 
good copy of all metadata, which means only the files that scrub shows as 
corrupt should be corrupt.

 
 2. Chris mentioned:
 
 A. On Mon, Feb 9, 2015 at 12:31 AM, Chris Murphy
 li...@colorremedies.com wrote:
 [[[try # btrfs device delete /dev/sdc1 /mnt/mountpoint]]]. Just realize
 that any data that's on both the failed drive and sdc1 will be lost
 
 and later
 
 B. On Mon, Feb 9, 2015 at 1:34 AM, Chris Murphy
 li...@colorremedies.com wrote:
 So now I have a 4 device raid1 mounted degraded. And I can still device
 delete another device.
 So one device missing and one device removed.
 
 So when I do the # btrfs device delete /dev/sdc1 /mnt/mountpoint the
 normal behavior would for the files that are located in /dev/sdc1 (and
 also were on the missing/failed drive) to be transferred to other drives
 and not lose them, right? (Does B. hold and contradict A.?)

Normally you'd device delete missing first, then device delete the other 
failing one (sdc1).

If it'll even let you delete a second device with one missing, if you're 
lucky, there will be at least one valid copy of everything on the device 
you're trying to delete and it'll just work.  However, as we already 
know, there's some corrupted files, so as long as they are there it'll 
probably error out in some way part way thru, where the one copy was on 
the missing device and the other copy is corrupted on the device you're 
trying to delete.

What you may be able to do, however, is delete the corrupted files.  Once 
they're gone and a scrub doesn't show any further corruption, you should, 
with luck, be able to device delete the failing device.

Alternatively, once you've gotten valid files for everything you can, you 
can try Chris's checksum reset suggestion, which will reset the checksum 
on all files including the bad ones.  Assuming they're stable enough on 
the failing device for the faked checksum to hold long enough to read 
them, you can then copy them to backup and test to see if they're garbage 
or at least some data worth saving is left in them.  After which you can 
of course delete them and proceed as above.

The other alternative is to try using restore on the unmounted filesystem 
for just those files, using the regex option to confine it to just those 
files (perhaps one at a time if they don't combine into a nice regex, as 
is likely).  That's last-ditch and may not work either, particularly if 
the problem device is returning different random garbage every time an 
attempt to read the corrupted blocks is made, such that the checksum 
reset doesn't work.


Personally, what I'd do if it were me, is get all the data off I could, 
physically remove that problem device, and then call that filesystem 
toast and start over with a new filesystem with what remains, giving up 
on what's there.  Then I'd restore from the backup to the new filesystem 
(or newly designed layout, however you do it).  I'd not even worry about 
trying to repair what's there, beyond backing up what I could before 
wiping it.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs features

2015-02-09 Thread Tobias Holst
Hi

I am just looking at the features enabled on my btrfs volume.
 ls /sys/fs/btrfs/[UUID]/features/
shows the following output:
 big_metadata  compress_lzo  extended_iref  mixed_backref  raid56

So big_metadata means I am not using skinny-metadata,
compress_lzo means I am using compression. raid56 means I am using
the experimental RAID-features of btrfs.



But the other two flags are a little bit unclear... I think extended
_iref is the extref feature of mkfs.btrfs - right?

I am not sure about the mixed_backref feature. What does it mean? Is
this the mixed-bg-feature of mkfs.btrfs?

Also I try to change these features. I am missing the skinny extends,
this can be enabled by btrfstune -x [one device of my raid],
correct?

And how can I enable the missing no-holes-feature on my volume?

Regards,
Tobias
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix race waiting for ordered extents at transaction commit

2015-02-09 Thread Filipe David Manana
On Mon, Feb 9, 2015 at 12:21 PM, Filipe Manana fdman...@suse.com wrote:
 There's a short time window where a race can happen between two or more
 tasks that hold a transaction handle for the same transaction and where
 one starts the transaction commit before the other tasks attempt to
 split their pending ordered extents list into the transaction's pending
 ordered extents lists. This results in the transaction commit not waiting
 for those ordered extents to complete, in memory leaks of ordered extent
 structures and therefore inode leaks too, since an iput for the ordered
 extent's inode is done only when the ordered extent's refcount drops to
 zero. This race is described by the following sequence diagram:

  CPU 1   CPU 2

 btrfs_start_transaction()
started transaction N with
trans-transaction-num_writers == 1
and trans-transaction-state ==
  TRANS_STATE_RUNNING

  btrfs_sync_file()


 btrfs_start_transaction()
  -- returns 
 transaction
  handle pointing 
 to
  transaction N
  -- Now transaction 
 N's
  num_writers == 2

btrfs_sync_log()

 btrfs_commit_transaction()
btrfs_wait_pending_ordered()
   -- transaction N's -pending_ordered
   processed and is now an empty list
set transaction state to TRANS_STATE_COMMIT_DOING
wait for trans-transaction-num_writers == 1

  
 btrfs_wait_logged_extents()
 -- adds ordered 
 extents
 to 
 trans-ordered list

btrfs_end_transaction()
  -- trans-ordered 
 list is spliced
  into transaction 
 N's list
  pending_ordered
  -- transaction N's 
 num_writers
  becomes 1 now

   wait finished, num_writers == 1
   transaction is committed and it doesn't wait
   for the ordered extents from CPU 2's task to
   complete, nor does it decrement their last
   reference, resulting in memory leaks and
   inode leaks (the iput on the ordered extent's
   inode is done only when the ordered extent's
   refcount drops to zero)

 So fix this by processing the transaction's pending_ordered list again
 after the number of writers decreases to 1.

 I ran into this issue while running xfstests/generic/113 in a loop, which
 failed about 1 out of 10 runs with the following warning in dmesg:

 [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 
 free_fs_root+0x36/0x133 [btrfs]()
 [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd 
 auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop 
 processor parport_pc parport psmouse thermal_sys i2c_piix4 serio_raw pcspkr 
 evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod 
 cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio 
 floppy e1000 scsi_mod [last unloaded: btrfs]
 [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: GW  
 3.19.0-rc5-btrfs-next-4+ #1
 [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
 [ 2612.457709]  0009 8801342c3c78 8142425e 
 88023ec8f2d8
 [ 2612.459829]   8801342c3cb8 81045308 
 88004646
 [ 2612.461564]  a036da56 88003d07b000 88004646 
 880046460068
 [ 2612.463163] Call Trace:
 [ 2612.463719]  [8142425e] dump_stack+0x4c/0x65
 [ 2612.464789]  [81045308] warn_slowpath_common+0xa1/0xbb
 [ 2612.466026]  [a036da56] ? free_fs_root+0x36/0x133 [btrfs]
 [ 2612.467247]  [810453c5] warn_slowpath_null+0x1a/0x1c
 [ 2612.468416]  [a036da56] free_fs_root+0x36/0x133 [btrfs]
 [ 2612.469625]  [a036f2a7] btrfs_drop_and_free_fs_root+0x93/0x9b 
 [btrfs]
 [ 2612.471251]  [a036f353] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
 [ 2612.472536]  [8142612e] ? wait_for_completion+0x24/0x26
 [ 2612.473742]  [a0370bbc] close_ctree+0x1f3/0x33c [btrfs]
 [ 2612.475477]  [81059d1d] ? destroy_workqueue+0x148/0x1ba
 [ 2612.476695]  

[PATCH v5] Btrfs: fix fsync race leading to ordered extent memory leaks

2015-02-09 Thread Filipe Manana
We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:

 CPU 1 CPU 2

btrfs_sync_file()

 btrfs_sync_file()

  mutex_lock(inode-i_mutex)
  btrfs_log_inode()
btrfs_get_logged_extents()
  -- collects ordered extent X
  -- increments ordered
  extent X's refcount
btrfs_submit_logged_extents()
  mutex_unlock(inode-i_mutex)

   mutex_lock(inode-i_mutex)
  btrfs_sync_log()
 btrfs_wait_logged_extents()
   -- list_del_init(ordered-log_list)
 btrfs_log_inode()
   
btrfs_get_logged_extents()
 -- Adds ordered 
extent X
 to logged_list 
because
 at this point:
 
list_empty(ordered-log_list)
  
test_bit(BTRFS_ORDERED_LOGGED,
 
ordered-flags) == 0
 -- Increments ordered 
extent
 X's refcount
   -- check if ordered extent's io is
   finished or not, start it if
   necessary and wait for it to finish
   -- sets bit BTRFS_ORDERED_LOGGED
   on ordered extent X's flags
   and adds it to trans-ordered
  btrfs_sync_log() finishes

   
btrfs_submit_logged_extents()
 btrfs_log_inode() finishes
   mutex_unlock(inode-i_mutex)

btrfs_sync_file() finishes

   btrfs_sync_log()
  
btrfs_wait_logged_extents()
-- Sees ordered extent 
X has the
bit 
BTRFS_ORDERED_LOGGED set in
its flags
-- X's refcount is 
untouched
   btrfs_sync_log() finishes

 btrfs_sync_file() finishes

btrfs_commit_transaction()
  -- called by transaction kthread for e.g.
  btrfs_wait_pending_ordered()
-- waits for ordered extent X to
complete
-- decrements ordered extent X's
refcount by 1 only, corresponding
to the increment done by the fsync
task ran by CPU 1

In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.

Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.

This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):

  Btrfs: make sure logged extents complete in the current transaction V3
  commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f

I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:

[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 
free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor 
parport_pc parport psmouse therma
l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 
jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix 
virtio_ring libata virtio flo
ppy e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: 

[PATCH 08/16] Btrfs-progs: multi-thread btrfs-image restore

2015-02-09 Thread Josef Bacik
For some reason we only allow btrfs-image restore to have one thread, which is
incredibly slow with large images.  So allow us to do work with more than just
one thread.  This made my restore go from 16 minutes to 3 minutes.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 btrfs-image.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/btrfs-image.c b/btrfs-image.c
index aaff26d..ea85542 100644
--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -1922,7 +1922,6 @@ static int add_cluster(struct meta_cluster *cluster,
u32 i, nritems;
int ret;
 
-   BUG_ON(mdres-num_items);
mdres-compress_method = header-compress;
 
bytenr = le64_to_cpu(header-bytenr) + BLOCK_SIZE;
@@ -2433,7 +2432,7 @@ static int __restore_metadump(const char *input, FILE 
*out, int old_restore,
goto out;
}
 
-   while (1) {
+   while (!mdrestore.error) {
ret = fread(cluster, BLOCK_SIZE, 1, in);
if (!ret)
break;
@@ -2450,14 +2449,8 @@ static int __restore_metadump(const char *input, FILE 
*out, int old_restore,
fprintf(stderr, Error adding cluster\n);
break;
}
-
-   ret = wait_for_worker(mdrestore);
-   if (ret) {
-   fprintf(stderr, One of the threads errored out %d\n,
-   ret);
-   break;
-   }
}
+   ret = wait_for_worker(mdrestore);
 out:
mdrestore_destroy(mdrestore, num_threads);
 failed_cluster:
@@ -2598,7 +2591,7 @@ int main(int argc, char *argv[])
 {
char *source;
char *target;
-   u64 num_threads = 0;
+   u64 num_threads = 1;
u64 compress_level = 0;
int create = 1;
int old_restore = 0;
@@ -2689,7 +2682,7 @@ int main(int argc, char *argv[])
}
}
 
-   if (num_threads == 0  compress_level  0) {
+   if (num_threads == 1  compress_level  0) {
num_threads = sysconf(_SC_NPROCESSORS_ONLN);
if (num_threads = 0)
num_threads = 1;
@@ -2708,7 +2701,7 @@ int main(int argc, char *argv[])
ret = create_metadump(source, out, num_threads,
  compress_level, sanitize, walk_trees);
} else {
-   ret = restore_metadump(source, out, old_restore, 1,
+   ret = restore_metadump(source, out, old_restore, num_threads,
   multi_devices);
}
if (ret) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/16] Btrfs-progs: remove global transaction from fsck

2015-02-09 Thread Josef Bacik
We hold a transaction open for the entirety of fixing extent refs.  This works
out ok most of the time but we can be tight on space and run out of space when
fixing things.  To get around this just push down the transaction starting dance
into the functions that actually fix things.  This keeps us from ending up with
ENOSPC because we pinned everything and allows the code to be a bit simpler.
Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c  | 230 +-
 ctree.h   |   1 +
 disk-io.c |   2 +
 extent-tree.c |   7 ++
 4 files changed, 140 insertions(+), 100 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index ffdfbf2..5458c28 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3668,7 +3668,6 @@ static void free_extent_record_cache(struct btrfs_fs_info 
*fs_info,
if (!cache)
break;
rec = container_of(cache, struct extent_record, cache);
-   btrfs_unpin_extent(fs_info, rec-start, rec-max_size);
remove_cache_extent(extent_cache, cache);
free_all_extent_backrefs(rec);
free(rec);
@@ -3995,11 +3994,11 @@ again:
  * Attempt to fix basic block failures.  If we can't fix it for whatever reason
  * then just return -EIO.
  */
-static int try_to_fix_bad_block(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
+static int try_to_fix_bad_block(struct btrfs_root *root,
struct extent_buffer *buf,
enum btrfs_tree_block_status status)
 {
+   struct btrfs_trans_handle *trans;
struct ulist *roots;
struct ulist_node *node;
struct btrfs_root *search_root;
@@ -4016,7 +4015,7 @@ static int try_to_fix_bad_block(struct btrfs_trans_handle 
*trans,
if (!path)
return -EIO;
 
-   ret = btrfs_find_all_roots(trans, root-fs_info, buf-start,
+   ret = btrfs_find_all_roots(NULL, root-fs_info, buf-start,
   0, roots);
if (ret) {
btrfs_free_path(path);
@@ -4035,7 +4034,12 @@ static int try_to_fix_bad_block(struct 
btrfs_trans_handle *trans,
break;
}
 
-   record_root_in_trans(trans, search_root);
+
+   trans = btrfs_start_transaction(search_root, 0);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   break;
+   }
 
path-lowest_level = btrfs_header_level(buf);
path-skip_check_block = 1;
@@ -4046,23 +4050,26 @@ static int try_to_fix_bad_block(struct 
btrfs_trans_handle *trans,
ret = btrfs_search_slot(trans, search_root, key, path, 0, 1);
if (ret) {
ret = -EIO;
+   btrfs_commit_transaction(trans, search_root);
break;
}
if (status == BTRFS_TREE_BLOCK_BAD_KEY_ORDER)
ret = fix_key_order(trans, search_root, path);
else if (status == BTRFS_TREE_BLOCK_INVALID_OFFSETS)
ret = fix_item_offset(trans, search_root, path);
-   if (ret)
+   if (ret) {
+   btrfs_commit_transaction(trans, search_root);
break;
+   }
btrfs_release_path(path);
+   btrfs_commit_transaction(trans, search_root);
}
ulist_free(roots);
btrfs_free_path(path);
return ret;
 }
 
-static int check_block(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
+static int check_block(struct btrfs_root *root,
   struct cache_tree *extent_cache,
   struct extent_buffer *buf, u64 flags)
 {
@@ -4098,8 +4105,7 @@ static int check_block(struct btrfs_trans_handle *trans,
 
if (status != BTRFS_TREE_BLOCK_CLEAN) {
if (repair)
-   status = try_to_fix_bad_block(trans, root, buf,
- status);
+   status = try_to_fix_bad_block(root, buf, status);
if (status != BTRFS_TREE_BLOCK_CLEAN) {
ret = -EIO;
fprintf(stderr, bad block %llu\n,
@@ -5678,8 +5684,7 @@ full_backref:
return 0;
 }
 
-static int run_next_block(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
+static int run_next_block(struct btrfs_root *root,
  struct block_info *bits,
  int bits_nr,
  u64 *last,
@@ -5797,7 +5802,7 @@ static int run_next_block(struct btrfs_trans_handle 
*trans,
owner = btrfs_header_owner(buf);
}
 
-   ret = check_block(trans, root, 

[PATCH 12/16] Btrfs-progs: unpin excluded extents as we fix things

2015-02-09 Thread Josef Bacik
We don't want to keep extent records pinned down if we fix stuff as we may need
the space and we can be pretty sure that these records are correct.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 5458c28..9c379e6 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7335,6 +7335,8 @@ static int check_extent_refs(struct btrfs_root *root,
return -EAGAIN;
 
while(1) {
+   int cur_err = 0;
+
fixed = 0;
recorded = 0;
cache = search_cache_extent(extent_cache, 0);
@@ -7345,6 +7347,7 @@ static int check_extent_refs(struct btrfs_root *root,
fprintf(stderr, extent item %llu has multiple extent 
items\n, (unsigned long long)rec-start);
err = 1;
+   cur_err = 1;
}
 
if (rec-refs != rec-extent_item_refs) {
@@ -7374,7 +7377,7 @@ static int check_extent_refs(struct btrfs_root *root,
}
}
err = 1;
-
+   cur_err = 1;
}
if (all_backpointers_checked(rec, 1)) {
fprintf(stderr, backpointer mismatch on [%llu %llu]\n,
@@ -7388,6 +7391,7 @@ static int check_extent_refs(struct btrfs_root *root,
goto repair_abort;
fixed = 1;
}
+   cur_err = 1;
err = 1;
}
if (!rec-owner_ref_checked) {
@@ -7402,10 +7406,16 @@ static int check_extent_refs(struct btrfs_root *root,
fixed = 1;
}
err = 1;
+   cur_err = 1;
}
 
remove_cache_extent(extent_cache, cache);
free_all_extent_backrefs(rec);
+   if (!init_extent_tree  repair  (!cur_err || fixed))
+   clear_extent_dirty(root-fs_info-excluded_extents,
+  rec-start,
+  rec-start + rec-max_size - 1,
+  GFP_NOFS);
free(rec);
}
 repair_abort:
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/16] Btrfs-progs: make debug-tree spit out full_backref flag

2015-02-09 Thread Josef Bacik
Currently btrfs-debug-tree ignores the FULL_BACKREF flag which makes it hard to
figure out problems related to FULL_BACKREF.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 print-tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/print-tree.c b/print-tree.c
index 3a7c13c..931a321 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -312,6 +312,10 @@ static void extent_flags_to_str(u64 flags, char *ret)
}
strcat(ret, TREE_BLOCK);
}
+   if (flags  BTRFS_BLOCK_FLAG_FULL_BACKREF) {
+   strcat(ret, |);
+   strcat(ret, FULL_BACKREF);
+   }
 }
 
 void print_extent_item(struct extent_buffer *eb, int slot, int metadata)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/16] Btrfs-progs: Introduce metadump_v2

2015-02-09 Thread Josef Bacik
The METADUMP super flag makes us skip doing the chunk tree reading which isn't
helpful for the new restore since we have a valid chunk tree.  But we still want
to have a way for the kernel to know that this is a metadump restore so it
doesn't do things like verify data checksums.  We also want to skip some of the
device extent checks in fsck since those will obviously not match.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 btrfs-image.c | 3 +++
 cmds-check.c  | 9 +++--
 ctree.h   | 1 +
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/btrfs-image.c b/btrfs-image.c
index ea85542..feb4a62 100644
--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -1455,6 +1455,7 @@ static int update_super(struct mdrestore_struct *mdres, 
u8 *buffer)
struct btrfs_chunk *chunk;
struct btrfs_disk_key *disk_key;
struct btrfs_key key;
+   u64 flags = btrfs_super_flags(super);
u32 new_array_size = 0;
u32 array_size;
u32 cur = 0;
@@ -1510,6 +1511,8 @@ static int update_super(struct mdrestore_struct *mdres, 
u8 *buffer)
if (mdres-clear_space_cache)
btrfs_set_super_cache_generation(super, 0);
 
+   flags |= BTRFS_SUPER_FLAG_METADUMP_V2;
+   btrfs_set_super_flags(super, flags);
btrfs_set_super_sys_array_size(super, new_array_size);
csum_block(buffer, BTRFS_SUPER_INFO_SIZE);
 
diff --git a/cmds-check.c b/cmds-check.c
index 2163823..ffdfbf2 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7426,6 +7426,7 @@ static int check_chunk_refs(struct chunk_record 
*chunk_rec,
u64 devid;
u64 offset;
u64 length;
+   int metadump_v2 = 0;
int i;
int ret = 0;
 
@@ -7438,7 +7439,8 @@ static int check_chunk_refs(struct chunk_record 
*chunk_rec,
   cache);
if (chunk_rec-length != block_group_rec-offset ||
chunk_rec-offset != block_group_rec-objectid ||
-   chunk_rec-type_flags != block_group_rec-flags) {
+   (!metadump_v2 
+chunk_rec-type_flags != block_group_rec-flags)) {
if (!silent)
fprintf(stderr,
Chunk[%llu, %u, %llu]: length(%llu), 
offset(%llu), type(%llu) mismatch with block group[%llu, %u, %llu]: 
offset(%llu), objectid(%llu), flags(%llu)\n,
@@ -7472,6 +7474,9 @@ static int check_chunk_refs(struct chunk_record 
*chunk_rec,
ret = 1;
}
 
+   if (metadump_v2)
+   return ret;
+
length = calc_stripe_length(chunk_rec-type_flags, chunk_rec-length,
chunk_rec-num_stripes);
for (i = 0; i  chunk_rec-num_stripes; ++i) {
@@ -7538,7 +7543,7 @@ int check_chunks(struct cache_tree *chunk_cache,
 cache);
err = check_chunk_refs(chunk_rec, block_group_cache,
   dev_extent_cache, silent);
-   if (err)
+   if (err  0)
ret = err;
if (err == 0  good)
list_add_tail(chunk_rec-list, good);
diff --git a/ctree.h b/ctree.h
index 2d2988b..be30cb6 100644
--- a/ctree.h
+++ b/ctree.h
@@ -309,6 +309,7 @@ static inline unsigned long btrfs_chunk_item_size(int 
num_stripes)
 #define BTRFS_HEADER_FLAG_RELOC(1ULL  1)
 #define BTRFS_SUPER_FLAG_SEEDING   (1ULL  32)
 #define BTRFS_SUPER_FLAG_METADUMP  (1ULL  33)
+#define BTRFS_SUPER_FLAG_METADUMP_V2   (1ULL  34)
 
 #define BTRFS_BACKREF_REV_MAX  256
 #define BTRFS_BACKREF_REV_SHIFT56
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/16] Btrfs-progs: skip opening all devices with restore

2015-02-09 Thread Josef Bacik
When we go to fixup the dev items after a restore we scan all existing devices.
If you happen to be a btrfs developer you could possibly open up some random
device that you didn't just restore onto, which gives you weird errors and makes
you super cranky and waste a day trying to figure out what is failing.  This
will make it so that we use the fd we've already opened for opening our ctree.
Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 btrfs-find-root.c | 2 +-
 btrfs-image.c | 9 ++---
 chunk-recover.c   | 2 +-
 disk-io.c | 8 +---
 disk-io.h | 3 ++-
 super-recover.c   | 2 +-
 6 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/btrfs-find-root.c b/btrfs-find-root.c
index 3edb833..c6e6b82 100644
--- a/btrfs-find-root.c
+++ b/btrfs-find-root.c
@@ -79,7 +79,7 @@ static struct btrfs_root *open_ctree_broken(int fd, const 
char *device)
return NULL;
}
 
-   ret = btrfs_scan_fs_devices(fd, device, fs_devices, 0, 1);
+   ret = btrfs_scan_fs_devices(fd, device, fs_devices, 0, 1, 0);
if (ret)
goto out;
 
diff --git a/btrfs-image.c b/btrfs-image.c
index 3c78388..04ec473 100644
--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -2557,16 +2557,19 @@ static int restore_metadump(const char *input, FILE 
*out, int old_restore,
ret = wait_for_worker(mdrestore);
 
if (!ret  !multi_devices  !old_restore) {
+   struct btrfs_root *root;
struct stat st;
 
-   info = open_ctree_fs_info(target, 0, 0,
+   root = open_ctree_fd(fileno(out), target, 0,
  OPEN_CTREE_PARTIAL |
- OPEN_CTREE_WRITES);
-   if (!info) {
+ OPEN_CTREE_WRITES |
+ OPEN_CTREE_NO_DEVICES);
+   if (!root) {
fprintf(stderr, unable to open %s\n, target);
ret = -EIO;
goto out;
}
+   info = root-fs_info;
 
if (stat(target, st)) {
fprintf(stderr, statting %s failed\n, target);
diff --git a/chunk-recover.c b/chunk-recover.c
index 94efc43..832b3b1 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1520,7 +1520,7 @@ static int recover_prepare(struct recover_control *rc, 
char *path)
goto fail_free_sb;
}
 
-   ret = btrfs_scan_fs_devices(fd, path, fs_devices, 0, 1);
+   ret = btrfs_scan_fs_devices(fd, path, fs_devices, 0, 1, 0);
if (ret)
goto fail_free_sb;
 
diff --git a/disk-io.c b/disk-io.c
index ca39f17..0aec56e 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1006,7 +1006,8 @@ void btrfs_cleanup_all_caches(struct btrfs_fs_info 
*fs_info)
 
 int btrfs_scan_fs_devices(int fd, const char *path,
  struct btrfs_fs_devices **fs_devices,
- u64 sb_bytenr, int super_recover)
+ u64 sb_bytenr, int super_recover,
+ int skip_devices)
 {
u64 total_devs;
u64 dev_size;
@@ -1033,7 +1034,7 @@ int btrfs_scan_fs_devices(int fd, const char *path,
return ret;
}
 
-   if (total_devs != 1) {
+   if (!skip_devices  total_devs != 1) {
ret = btrfs_scan_lblkid();
if (ret)
return ret;
@@ -1114,7 +1115,8 @@ static struct btrfs_fs_info *__open_ctree_fd(int fp, 
const char *path,
fs_info-on_restoring = 1;
 
ret = btrfs_scan_fs_devices(fp, path, fs_devices, sb_bytenr,
-   (flags  OPEN_CTREE_RECOVER_SUPER));
+   (flags  OPEN_CTREE_RECOVER_SUPER),
+   (flags  OPEN_CTREE_NO_DEVICES));
if (ret)
goto out;
 
diff --git a/disk-io.h b/disk-io.h
index f963a96..53df8f0 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -33,6 +33,7 @@ enum btrfs_open_ctree_flags {
OPEN_CTREE_RESTORE  = (1  4),
OPEN_CTREE_NO_BLOCK_GROUPS  = (1  5),
OPEN_CTREE_EXCLUSIVE= (1  6),
+   OPEN_CTREE_NO_DEVICES   = (1  7),
 };
 
 static inline u64 btrfs_sb_offset(int mirror)
@@ -68,7 +69,7 @@ void btrfs_release_all_roots(struct btrfs_fs_info *fs_info);
 void btrfs_cleanup_all_caches(struct btrfs_fs_info *fs_info);
 int btrfs_scan_fs_devices(int fd, const char *path,
  struct btrfs_fs_devices **fs_devices, u64 sb_bytenr,
- int super_recover);
+ int super_recover, int skip_devices);
 int btrfs_setup_chunk_tree_and_device_map(struct btrfs_fs_info *fs_info);
 
 struct btrfs_root *open_ctree(const char *filename, u64 sb_bytenr,
diff --git a/super-recover.c b/super-recover.c
index 197fc4b..e2c3129 100644
--- a/super-recover.c
+++ b/super-recover.c

[PATCH 13/16] Btrfs-progs: make restore update dev items

2015-02-09 Thread Josef Bacik
When we restore a multi disk image onto a single disk we need to update the dev
items used and total bytes so that fsck doesn't freak out and that we get normal
results from stuff like btrfs fi show.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 btrfs-image.c | 150 ++
 1 file changed, 131 insertions(+), 19 deletions(-)

diff --git a/btrfs-image.c b/btrfs-image.c
index feb4a62..3c78388 100644
--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -133,6 +133,7 @@ struct mdrestore_struct {
size_t num_items;
u32 leafsize;
u64 devid;
+   u64 alloced_chunks;
u64 last_physical_offset;
u8 uuid[BTRFS_UUID_SIZE];
u8 fsid[BTRFS_FSID_SIZE];
@@ -1856,6 +1857,7 @@ static int mdrestore_init(struct mdrestore_struct *mdres,
mdres-multi_devices = multi_devices;
mdres-clear_space_cache = 0;
mdres-last_physical_offset = 0;
+   mdres-alloced_chunks = 0;
 
if (!num_threads)
return 0;
@@ -2087,6 +2089,7 @@ static int read_chunk_block(struct mdrestore_struct 
*mdres, u8 *buffer,
mdres-last_physical_offset)
mdres-last_physical_offset = fs_chunk-physical +
fs_chunk-bytes;
+   mdres-alloced_chunks += fs_chunk-bytes;
tree_insert(mdres-chunk_tree, fs_chunk-l, chunk_cmp);
}
 out:
@@ -2372,9 +2375,107 @@ static void remap_overlapping_chunks(struct 
mdrestore_struct *mdres)
}
 }
 
-static int __restore_metadump(const char *input, FILE *out, int old_restore,
- int num_threads, int fixup_offset,
- const char *target, int multi_devices)
+static int fixup_devices(struct btrfs_fs_info *fs_info,
+struct mdrestore_struct *mdres, off_t dev_size)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_dev_item *dev_item;
+   struct btrfs_path *path;
+   struct extent_buffer *leaf;
+   struct btrfs_root *root = fs_info-chunk_root;
+   struct btrfs_key key;
+   u64 devid, cur_devid;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path) {
+   fprintf(stderr, Error alloc'ing path\n);
+   return -ENOMEM;
+   }
+
+   trans = btrfs_start_transaction(fs_info-tree_root, 1);
+   if (IS_ERR(trans)) {
+   fprintf(stderr, Error starting transaction %ld\n,
+   PTR_ERR(trans));
+   btrfs_free_path(path);
+   return PTR_ERR(trans);
+   }
+
+   dev_item = fs_info-super_copy-dev_item;
+
+   devid = btrfs_stack_device_id(dev_item);
+
+   btrfs_set_stack_device_total_bytes(dev_item, dev_size);
+   btrfs_set_stack_device_bytes_used(dev_item, mdres-alloced_chunks);
+
+   key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
+   key.type = BTRFS_DEV_ITEM_KEY;
+   key.offset = 0;
+
+again:
+   ret = btrfs_search_slot(trans, root, key, path, -1, 1);
+   if (ret  0) {
+   fprintf(stderr, search failed %d\n, ret);
+   exit(1);
+   }
+
+   while (1) {
+   leaf = path-nodes[0];
+   if (path-slots[0] = btrfs_header_nritems(leaf)) {
+   ret = btrfs_next_leaf(root, path);
+   if (ret  0) {
+   fprintf(stderr, Error going to next leaf 
+   %d\n, ret);
+   exit(1);
+   }
+   if (ret  0) {
+   ret = 0;
+   break;
+   }
+   leaf = path-nodes[0];
+   }
+
+   btrfs_item_key_to_cpu(leaf, key, path-slots[0]);
+   if (key.type  BTRFS_DEV_ITEM_KEY)
+   break;
+   if (key.type != BTRFS_DEV_ITEM_KEY) {
+   path-slots[0]++;
+   continue;
+   }
+
+   dev_item = btrfs_item_ptr(leaf, path-slots[0],
+ struct btrfs_dev_item);
+   cur_devid = btrfs_device_id(leaf, dev_item);
+   if (devid != cur_devid) {
+   ret = btrfs_del_item(trans, root, path);
+   if (ret) {
+   fprintf(stderr, Error deleting item %d\n,
+   ret);
+   exit(1);
+   }
+   btrfs_release_path(path);
+   goto again;
+   }
+
+   btrfs_set_device_total_bytes(leaf, dev_item, dev_size);
+   btrfs_set_device_bytes_used(leaf, dev_item,
+   mdres-alloced_chunks);
+   btrfs_mark_buffer_dirty(leaf);
+   path-slots[0]++;
+ 

[PATCH 05/16] Btrfs-progs: don't try to repair reloc roots

2015-02-09 Thread Josef Bacik
We have logic to fix the root locations for roots in response to a corruption
bug we had earlier.  However this work doesn't apply to reloc roots and can
screw things up worse, so make sure we skip any reloc roots that we find.
Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index e74fa0f..2b08c64 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -8886,6 +8886,8 @@ again:
 
if (found_key.type != BTRFS_ROOT_ITEM_KEY)
goto next;
+   if (found_key.objectid == BTRFS_TREE_RELOC_OBJECTID)
+   goto next;
 
ret = maybe_repair_root_item(info, path, found_key,
 trans ? 0 : 1);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/16] Btrfs-progs: only build space info's for the main flags

2015-02-09 Thread Josef Bacik
Hitting enospc problems with a really corrupt fs uncovered the fact that we
match any flag in a block group when creating space info's.  This is a problem
if we have a raid level set, we'll end up with only one space info that covers
metadata and data because they share a raid level.  We don't want this, we want
to separate out the data and metadata space infos, so mask off the raid level
and only use the main flags.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 extent-tree.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 1785e22..d42c572 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1789,11 +1789,11 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
 static struct btrfs_space_info *__find_space_info(struct btrfs_fs_info *info,
  u64 flags)
 {
-   struct list_head *head = info-space_info;
-   struct list_head *cur;
struct btrfs_space_info *found;
-   list_for_each(cur, head) {
-   found = list_entry(cur, struct btrfs_space_info, list);
+
+   flags = BTRFS_BLOCK_GROUP_TYPE_MASK;
+
+   list_for_each_entry(found, info-space_info, list) {
if (found-flags  flags)
return found;
}
@@ -1825,7 +1825,7 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
return -ENOMEM;
 
list_add(found-list, info-space_info);
-   found-flags = flags;
+   found-flags = flags  BTRFS_BLOCK_GROUP_TYPE_MASK;
found-total_bytes = total_bytes;
found-bytes_used = bytes_used;
found-bytes_pinned = 0;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/16] Btrfs-progs: don't check csums for data reloc root

2015-02-09 Thread Josef Bacik
The data reloc root is weird with it's csums.  It'll copy an entire extent and
then log any csums it finds, which makes it look weird when it comes to prealloc
extents.  So just skip the data reloc tree, it's special and we just don't need
to worry about it.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 2b08c64..2163823 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1530,7 +1530,16 @@ static int process_file_extent(struct btrfs_root *root,
}
rec-extent_end = key-offset + num_bytes;
 
-   if (disk_bytenr  0) {
+   /*
+* The data reloc tree will copy full extents into its inode and then
+* copy the corresponding csums.  Because the extent it copied could be
+* a preallocated extent that hasn't been written to yet there may be no
+* csums to copy, ergo we won't have csums for our file extent.  This is
+* ok so just don't bother checking csums if the inode belongs to the
+* data reloc tree.
+*/
+   if (disk_bytenr  0 
+   btrfs_header_owner(eb) != BTRFS_DATA_RELOC_TREE_OBJECTID) {
u64 found;
if (btrfs_file_extent_compression(eb, fi))
num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance, sudden drop to 0 IOPs

2015-02-09 Thread Kai Krakow
P. Remek p.rem...@googlemail.com schrieb:

 Hello,
 
 I am benchmarking Btrfs and when benchmarking random writes with fio
 utility, I noticed following two things:
 
 1) On first run when target file doesn't exist yet, perfromance is
 about 8000 IOPs. On second, and every other run, performance goes up
 to 7 IOPs. Its massive difference. The target file is the one
 created during the first run.
 
 2) There are windows during the test where IOPs drop to 0 and stay 0
 about 10 seconds and then it goes back again, and after couple of
 seconds again to 0. This is reproducible 100% times.
 
 Can somobody shred some light on what's happening?

I'm not an expert or dev but it's probably due to btrfs doing some 
housekeeping under the hood. Could you check the output of btrfs filesystem 
usage /mountpoint while running the test? I'd guess there's some pressure 
on the global reserve during those times.

 Command: fio --randrepeat=1 --ioengine=libaio --direct=1
 --gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
 --size=10G --numjobs=1 --readwrite=randwrite
 
 Environment:
 CPU: dual socket: E5-2630 v2
RAM: 32 GB ram
OS: Ubuntu server 14.10
Kernel: 3.19.0-031900rc2-generic
btrfs tools: Btrfs v3.14.1
2x LSI 9300 HBAs - SAS3 12/Gbs
8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
 
 Regards,
 Premek

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/24] Btrfs: sysfs: fix, undo sysfs device links

2015-02-09 Thread Anand Jain
From: Anand Jain anand.j...@oracle.com

Theoritically need to remove the device links attributes, but since its entire 
device
kobject was removed, so there wasn't any issue of about it. Just do it nicely.

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 fs/btrfs/sysfs.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 68dcd17..adfac3e 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -522,6 +522,7 @@ void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info)
kobject_del(fs_info-space_info_kobj);
kobject_put(fs_info-space_info_kobj);
}
+   btrfs_kobj_rm_device(fs_info, NULL);
kobject_del(fs_info-device_dir_kobj);
kobject_put(fs_info-device_dir_kobj);
addrm_unknown_feature_attrs(fs_info, false);
@@ -604,6 +605,8 @@ static void init_feature_attrs(void)
}
 }
 
+/* when one_device is NULL, it removes all device links */
+
 int btrfs_kobj_rm_device(struct btrfs_fs_info *fs_info,
struct btrfs_device *one_device)
 {
@@ -621,6 +624,20 @@ int btrfs_kobj_rm_device(struct btrfs_fs_info *fs_info,
disk_kobj-name);
}
 
+   if (one_device)
+   return 0;
+
+   list_for_each_entry(one_device,
+   fs_info-fs_devices-devices, dev_list) {
+   if (!one_device-bdev)
+   continue;
+   disk = one_device-bdev-bd_part;
+   disk_kobj = part_to_dev(disk)-kobj;
+
+   sysfs_remove_link(fs_info-device_dir_kobj,
+   disk_kobj-name);
+   }
+
return 0;
 }
 
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/24] Btrfs: sysfs: reorder the kobject creations

2015-02-09 Thread Anand Jain
From: Anand Jain anand.j...@oracle.com

As of now the order in which the kobjects are created
at btrfs_sysfs_add_one() is..
 fsid
 features
 unknown features (dynamic features)
 devices.

Since we would move fsid and device kobject to fs_devices
from fs_info structure, this patch will reorder in which
the kobjects are created as below.
 fsid
 devices
 features
 unknown features (dynamic features)

And hence the btrfs_sysfs_remove_one() will follow the same
in reverse order. and the device kobject destroy now can
be moved into the function __btrfs_sysfs_remove_one()

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 fs/btrfs/sysfs.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 506f7e4..c3e7f06 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -510,6 +510,13 @@ static int addrm_unknown_feature_attrs(struct 
btrfs_fs_info *fs_info, bool add)
 
 static void __btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info)
 {
+   if (fs_info-device_dir_kobj) {
+   btrfs_kobj_rm_device(fs_info, NULL);
+   kobject_del(fs_info-device_dir_kobj);
+   kobject_put(fs_info-device_dir_kobj);
+   fs_info-device_dir_kobj = NULL;
+   }
+
kobject_del(fs_info-super_kobj);
kobject_put(fs_info-super_kobj);
wait_for_completion(fs_info-kobj_unregister);
@@ -522,12 +529,6 @@ void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info)
kobject_del(fs_info-space_info_kobj);
kobject_put(fs_info-space_info_kobj);
}
-   if (fs_info-device_dir_kobj) {
-   btrfs_kobj_rm_device(fs_info, NULL);
-   kobject_del(fs_info-device_dir_kobj);
-   kobject_put(fs_info-device_dir_kobj);
-   fs_info-device_dir_kobj = NULL;
-   }
addrm_unknown_feature_attrs(fs_info, false);
sysfs_remove_group(fs_info-super_kobj, btrfs_feature_attr_group);
__btrfs_sysfs_remove_one(fs_info);
@@ -700,6 +701,12 @@ int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info)
if (error)
return error;
 
+   error = btrfs_kobj_add_device(fs_info, NULL);
+   if (error) {
+   __btrfs_sysfs_remove_one(fs_info);
+   return error;
+   }
+
error = sysfs_create_group(fs_info-super_kobj,
   btrfs_feature_attr_group);
if (error) {
@@ -711,10 +718,6 @@ int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info)
if (error)
goto failure;
 
-   error = btrfs_kobj_add_device(fs_info, NULL);
-   if (error)
-   goto failure;
-
fs_info-space_info_kobj = kobject_create_and_add(allocation,
  fs_info-super_kobj);
if (!fs_info-space_info_kobj) {
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/24] Btrfc: sysfs: fix, check if device_dir_kobj is init before destroy

2015-02-09 Thread Anand Jain
From: Anand Jain anand.j...@oracle.com

Since the failure code in the btrfs_sysfs_add_one() can
call btrfs_sysfs_remove_one() even before device_dir_kobj
has been created we need to check if its null.

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 fs/btrfs/sysfs.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 15fead2..506f7e4 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -522,10 +522,12 @@ void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info)
kobject_del(fs_info-space_info_kobj);
kobject_put(fs_info-space_info_kobj);
}
-   btrfs_kobj_rm_device(fs_info, NULL);
-   kobject_del(fs_info-device_dir_kobj);
-   kobject_put(fs_info-device_dir_kobj);
-   fs_info-device_dir_kobj = NULL;
+   if (fs_info-device_dir_kobj) {
+   btrfs_kobj_rm_device(fs_info, NULL);
+   kobject_del(fs_info-device_dir_kobj);
+   kobject_put(fs_info-device_dir_kobj);
+   fs_info-device_dir_kobj = NULL;
+   }
addrm_unknown_feature_attrs(fs_info, false);
sysfs_remove_group(fs_info-super_kobj, btrfs_feature_attr_group);
__btrfs_sysfs_remove_one(fs_info);
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem

2015-02-09 Thread Chris Murphy
On Mon, Feb 9, 2015 at 5:54 PM, constantine costas.magn...@gmail.com wrote:

 1.  I am testing various files and all seem readable. Is there a way
 to list every file that resides on a particular device (like
 /dev/sdc1?) so as to check them? There are a handful of files that
 seem corrupted, since I get from scrub:
 
 BTRFS: checksum error at logical 10792783298560 on dev /dev/sdc1,
 sector 737159648, root 5, inode 1376754, offset 175428419584, length
 4096, links 1 (path: long/path/file.img)
 ,
 but are these the only files that could be corrupted?

It should be true the only corrupt files are the listed ones. I don't
have a good suggestion for the first question, whether btrfs restore
can help or btrfs-debug-tree - assuming you want something independent
from mounting the filesystem and just using recursive ls or tree
commands.






 2. Chris mentioned:

 A. On Mon, Feb 9, 2015 at 12:31 AM, Chris Murphy
 li...@colorremedies.com wrote:
 [[[try # btrfs device delete /dev/sdc1 /mnt/mountpoint]]]. Just realize that 
 any data that's on both the
 failed drive and sdc1 will be lost

 and later

 B. On Mon, Feb 9, 2015 at 1:34 AM, Chris Murphy li...@colorremedies.com 
 wrote:
 So now I have a 4 device
 raid1 mounted degraded. And I can still device delete another device.
 So one device missing and one device removed.

 So when I do the # btrfs device delete /dev/sdc1 /mnt/mountpoint the
 normal behavior would for the files that are located in /dev/sdc1 (and
 also were on the missing/failed drive) to be transferred to other
 drives and not lose them, right? (Does B. hold and contradict A.?)

The normal case, a non-degraded volume, a device delete will
successfully migrate data, and the volume remains non-degraded.

The unusual case, a degraded volume, a device delete is suspiciously
permitted. I think this is risky and maybe ought to be disallowed, or
at least require the user to use --force. And the reason is, it's a
degraded array. The first course of business is to do a 'device
replace start' or if enough devices exist 'btrfs device delete
missing' to get the volume from degraded to normal state. And then do
any additional device deletes.

But the even more unusual case, a degrade volume, with a 2nd device
that's producing a huge pile of read, write and corruption errors,
Btrfs can't migrate any data off the dead/removed drive (obviously),
but it also has problems removing the data that now only exists on the
2nd device that spitting out errors. I don't expect this device delete
to succeed.

The difference between case A and B, is that there isn't a 2nd drive
spitting out a pile of errors. It's merely degraded, with a drive
being deleted, and even that ended in a kernel panic for me, which
I've reproduced. However, as a followup, after rebooting, the btrfs
volume is mountable (degraded) without error. I can further btrfs
device delete missing, and remount normally (not degraded). So this is
good.

After the whole process,
 I suppose I will have a more robust array structure the RED/RAID
 drives and appropriate cron jobs as indicated in the thread.

Good ending sounds like.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Accepting discard to free space from disk images

2015-02-09 Thread Roman Mamedov
On Mon, 09 Feb 2015 12:07:18 -0500
Devon B. devo...@virtualcomplete.com wrote:

 Thanks for your testing.  I haven't tried 3.14.  I tried on CentOS 6 box 
 (2.6.32 - which is experimental) and Ubuntu 14.04 (3.13) and neither 
 worked.  So the question remains, what is the difference?  Possibly a 
 small difference between the 3.13 and 3.14 kernels, I don't think it is 
 any of the mount options.  I guess if anyone else has insight on this, 
 that would be great.  Otherwise, I'll see if I can get a newer kernel 
 loaded up and do some more testing.

Use Reply to all and not just Reply, or anyone else won't see your
question.

BTW it is not a good idea to use Btrfs on 2.6.32.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs performance, sudden drop to 0 IOPs

2015-02-09 Thread P. Remek
Hello,

I am benchmarking Btrfs and when benchmarking random writes with fio
utility, I noticed following two things:

1) On first run when target file doesn't exist yet, perfromance is
about 8000 IOPs. On second, and every other run, performance goes up
to 7 IOPs. Its massive difference. The target file is the one
created during the first run.

2) There are windows during the test where IOPs drop to 0 and stay 0
about 10 seconds and then it goes back again, and after couple of
seconds again to 0. This is reproducible 100% times.

Can somobody shred some light on what's happening?


Command: fio --randrepeat=1 --ioengine=libaio --direct=1
--gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
--size=10G --numjobs=1 --readwrite=randwrite

Environment:
CPU: dual socket: E5-2630 v2
   RAM: 32 GB ram
   OS: Ubuntu server 14.10
   Kernel: 3.19.0-031900rc2-generic
   btrfs tools: Btrfs v3.14.1
   2x LSI 9300 HBAs - SAS3 12/Gbs
   8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs

Regards,
Premek
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Accepting discard to free space from disk images

2015-02-09 Thread Devon B.


Thanks for your testing.  I haven't tried 3.14.  I tried on CentOS 6 box 
(2.6.32 - which is experimental) and Ubuntu 14.04 (3.13) and neither 
worked.  So the question remains, what is the difference?  Possibly a 
small difference between the 3.13 and 3.14 kernels, I don't think it is 
any of the mount options.  I guess if anyone else has insight on this, 
that would be great.  Otherwise, I'll see if I can get a newer kernel 
loaded up and do some more testing.



Roman Mamedov mailto:r...@romanrm.net
Monday, February 9, 2015 10:45 AM
On Mon, 9 Feb 2015 20:42:56 +0500
Roman Mamedovr...@romanrm.net  wrote:


On Mon, 09 Feb 2015 10:26:33 -0500
Devon B.devo...@virtualcomplete.com  wrote:


If you don't mind me asking, what version kernel are you running and are
you using any special mount options?

Well actually I did not claim I have working discard through 'loop', but your
post made me curious.

$ sudo dd if=/dev/zero of=100g bs=1M seek=10 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00221052 s, 474 MB/s

$ sudo mkfs.ext4 100g
[...]

$ du -hsc 100g
133M100g
133Mtotal

$ sudo mount -o loop 100g /mnt/tmp1/

(then in a new terminal window):
$ cd /mnt/tmp1/
$ df -h .
Filesystem  Size  Used Avail Use% Mounted on
/dev/loop0   96G   60M   92G   1% /mnt/tmp1
$ sudo dd if=/dev/zero of=zerofile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.944377 s, 1.1 GB/s
$ sync

(back to the original one):
$ du -hsc 100g
1.2G100g
1.2Gtotal



(2nd window):


Forgot to add I also did 'rm zerofile' here, of course.


$ sudo fstrim .

(back to the original one):
$ du -hsc 100g
133M100g
133Mtotal

So it does work for me just fine even with 'loop'.
Kernel version 3.14.32, mount options
rw,noatime,nodiratime,compress=zlib,space_cache,inode_cache.




Roman Mamedov mailto:r...@romanrm.net
Monday, February 9, 2015 10:42 AM
On Mon, 09 Feb 2015 10:26:33 -0500

Well actually I did not claim I have working discard through 'loop', 
but your

post made me curious.

$ sudo dd if=/dev/zero of=100g bs=1M seek=10 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00221052 s, 474 MB/s

$ sudo mkfs.ext4 100g
[...]

$ du -hsc 100g
133M 100g
133M total

$ sudo mount -o loop 100g /mnt/tmp1/

(then in a new terminal window):
$ cd /mnt/tmp1/
$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 96G 60M 92G 1% /mnt/tmp1
$ sudo dd if=/dev/zero of=zerofile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.944377 s, 1.1 GB/s
$ sync

(back to the original one):
$ du -hsc 100g
1.2G 100g
1.2G total

(2nd window):
$ sudo fstrim .

(back to the original one):
$ du -hsc 100g
133M 100g
133M total

So it does work for me just fine even with 'loop'.
Kernel version 3.14.32, mount options
rw,noatime,nodiratime,compress=zlib,space_cache,inode_cache.

Devon B. mailto:devo...@virtualcomplete.com
Monday, February 9, 2015 10:26 AM
If you don't mind me asking, what version kernel are you running and 
are you using any special mount options?


Here is a quick example:

# qemu-img create -f raw /btrfs/sub/raw.img 100G
Formatting '/btrfs/sub/raw.img', fmt=raw size=107374182400

# mkfs.ext4 /btrfs/sub/raw.img
...

# mount -o loop /btrfs/sub/raw.img /mnt/test

# du -hs /btrfs/sub/raw.img
1.7G/btrfs/sub/raw.img

# fstrim -v /mnt/test
/mnt/test: 105492688896 bytes were trimmed

# du -hs /btrfs/sub/raw.img
1.7G/btrfs/sub/raw.img

# dd if=/dev/zero of=/mnt/test/1GB count=1k bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.493057 s, 2.2 GB/s

# du -hs /btrfs/sub/raw.img
2.7G/btrfs/sub/raw.img

# rm -f /mnt/test/1GB

# fstrim -v /mnt/test
/mnt/test: 1186967552 bytes were trimmed

# du -hs /btrfs/sub/raw.img
2.7G/btrfs/sub/raw.img

# dd if=/dev/zero of=/mnt/test/1GB count=1k bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.467089 s, 2.3 GB/s

# du -hs /btrfs/sub/raw.img
3.7G/btrfs/sub/raw.img

# rm -f /mnt/test/1GB

# fstrim -v /mnt/test
/mnt/test: 1203761152 bytes were trimmed

# du -hs /btrfs/sub/raw.img
3.7G/btrfs/sub/raw.img

# du -hs /mnt/test
20K /mnt/test

So even though there is nothing in /mnt/test, the disk image is 
consuming 3.7GB of space.  Maybe you could test similarly with your 
server if you have time on your hands.


Thanks!
Roman Mamedov mailto:r...@romanrm.net
Monday, February 9, 2015 1:40 AM
On Mon, 09 Feb 2015 00:17:49 -0500

I use KVM (QEMU) with discard pass-through from the VM guest 
(discard=unmap

option), with the VM images stored on Btrfs. It works just fine, the disk
space used for the image file does shrink when the guest OS issues 
discards on
its FS. Maybe there is a difference in how KVM and the 'loop' module 
submit

discards to Btrfs?

Devon B. mailto:devo...@virtualcomplete.com
Monday, February 9, 2015 12:17 AM
Looking to use btrfs with disk images that contain ext4, xfs, and other

Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem

2015-02-09 Thread constantine
Thank you everybody for your support, care, cheerful comments and
understandable criticism. I am in the process of backing up every
file.

Could you please answer two questions?:

1.  I am testing various files and all seem readable. Is there a way
to list every file that resides on a particular device (like
/dev/sdc1?) so as to check them? There are a handful of files that
seem corrupted, since I get from scrub:

BTRFS: checksum error at logical 10792783298560 on dev /dev/sdc1,
sector 737159648, root 5, inode 1376754, offset 175428419584, length
4096, links 1 (path: long/path/file.img)
,
but are these the only files that could be corrupted?


2. Chris mentioned:

A. On Mon, Feb 9, 2015 at 12:31 AM, Chris Murphy
li...@colorremedies.com wrote:
 [[[try # btrfs device delete /dev/sdc1 /mnt/mountpoint]]]. Just realize that 
 any data that's on both the
 failed drive and sdc1 will be lost

and later

B. On Mon, Feb 9, 2015 at 1:34 AM, Chris Murphy li...@colorremedies.com wrote:
 So now I have a 4 device
 raid1 mounted degraded. And I can still device delete another device.
 So one device missing and one device removed.

So when I do the # btrfs device delete /dev/sdc1 /mnt/mountpoint the
normal behavior would for the files that are located in /dev/sdc1 (and
also were on the missing/failed drive) to be transferred to other
drives and not lose them, right? (Does B. hold and contradict A.?)



Long PS: Obviously, I have backed-up critical data that I would
almost consider committing suicide if I lost them with services like
tarsnap/dropbox/etc. However, I did not do this for
non-critical-yet-important
data-that-would-make-me-depressed-if-I-lost-them-for-some-months
because of budget constraints.

For network stumblers, RAID-1 btrfs was working for me for a couple
years and had the sense that I was covered. I obviously was not since
I neglected looking at dmesg after each scrub. Second, I rushed and
added both of the new 6TB in the array, instead of only one of them
and using the second for backing up my data. After the whole process,
I suppose I will have a more robust array structure the RED/RAID
drives and appropriate cron jobs as indicated in the thread.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: scrub, fix sleep in atomic context

2015-02-09 Thread Filipe Manana
My previous patch Btrfs: fix scrub race leading to use-after-free
introduced the possibility to sleep in an atomic context, which happens
when the scrub_lock mutex is held at the time scrub_pending_bio_dec()
is called - this function can be called under an atomic context.
Chris ran into this in a debug kernel which gave the following trace:

[ 1928.950319] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:621
[ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress
[ 1928.981324] INFO: lockdep is turned off.
[ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: GW 
3.19.0-rc7-mason+ #41
[ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 
05/10/2012
[ 1929.022207]  81a22cf8 881076e03b78 816b8dd9 
881076e03b78
[ 1929.037267]  880d8e828710 881076e03ba8 810856c4 
881076e03bc8
[ 1929.052315]   026d 81a22cf8 
881076e03bd8
[ 1929.067381] Call Trace:
[ 1929.072344]  IRQ  [816b8dd9] dump_stack+0x4f/0x6e
[ 1929.083968]  [810856c4] ___might_sleep+0x174/0x230
[ 1929.095352]  [810857d2] __might_sleep+0x52/0x90
[ 1929.106223]  [816bb68f] mutex_lock_nested+0x2f/0x3b0
[ 1929.117951]  [810ab37d] ? trace_hardirqs_on+0xd/0x10
[ 1929.129708]  [a05dc838] scrub_pending_bio_dec+0x38/0x70 [btrfs]
[ 1929.143370]  [a05dd0e0] scrub_parity_bio_endio+0x50/0x70 [btrfs]
[ 1929.157191]  [812fa603] bio_endio+0x53/0xa0
[ 1929.167382]  [a05f96bc] rbio_orig_end_io+0x7c/0xa0 [btrfs]
[ 1929.180161]  [a05f97ba] raid_write_parity_end_io+0x5a/0x80 [btrfs]
[ 1929.194318]  [812fa603] bio_endio+0x53/0xa0
[ 1929.204496]  [8130401b] blk_update_request+0x1eb/0x450
[ 1929.216569]  [81096e58] ? trigger_load_balance+0x78/0x500
[ 1929.229176]  [8144c74d] scsi_end_request+0x3d/0x1f0
[ 1929.240740]  [8144ccac] scsi_io_completion+0xac/0x5b0
[ 1929.252654]  [81441c50] scsi_finish_command+0xf0/0x150
[ 1929.264725]  [8144d317] scsi_softirq_done+0x147/0x170
[ 1929.276635]  [8130ace6] blk_done_softirq+0x86/0xa0
[ 1929.288014]  [8105d92e] __do_softirq+0xde/0x600
[ 1929.298885]  [8105df6d] irq_exit+0xbd/0xd0
(...)

Fix this by using a reference count on the scrub context structure
instead of locking the scrub_lock mutex.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/scrub.c | 39 +++
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index d5d790c..ec57687 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -193,6 +193,15 @@ struct scrub_ctx {
 */
struct btrfs_scrub_progress stat;
spinlock_t  stat_lock;
+
+   /*
+* Use a ref counter to avoid use-after-free issues. Scrub workers
+* decrement bios_in_flight and workers_pending and then do a wakeup
+* on the list_wait wait queue. We must ensure the main scrub task
+* doesn't free the scrub context before or while the workers are
+* doing the wakeup() call.
+*/
+   atomic_trefs;
 };
 
 struct scrub_fixup_nodatasum {
@@ -297,26 +306,20 @@ static int copy_nocow_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
 static void copy_nocow_pages_worker(struct btrfs_work *work);
 static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);
 static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);
+static void scrub_put_ctx(struct scrub_ctx *sctx);
 
 
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
 {
+   atomic_inc(sctx-refs);
atomic_inc(sctx-bios_in_flight);
 }
 
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx)
 {
-   struct btrfs_fs_info *fs_info = sctx-dev_root-fs_info;
-
-   /*
-* Hold the scrub_lock while doing the wakeup to ensure the
-* sctx (and its wait queue list_wait) isn't destroyed/freed
-* during the wakeup.
-*/
-   mutex_lock(fs_info-scrub_lock);
atomic_dec(sctx-bios_in_flight);
wake_up(sctx-list_wait);
-   mutex_unlock(fs_info-scrub_lock);
+   scrub_put_ctx(sctx);
 }
 
 static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
@@ -350,6 +353,7 @@ static void scrub_pending_trans_workers_inc(struct 
scrub_ctx *sctx)
 {
struct btrfs_fs_info *fs_info = sctx-dev_root-fs_info;
 
+   atomic_inc(sctx-refs);
/*
 * increment scrubs_running to prevent cancel requests from
 * completing as long as a worker is running. we must also
@@ -388,15 +392,11 @@ static void scrub_pending_trans_workers_dec(struct 
scrub_ctx *sctx)
mutex_lock(fs_info-scrub_lock);
atomic_dec(fs_info-scrubs_running);
atomic_dec(fs_info-scrubs_paused);
+   mutex_unlock(fs_info-scrub_lock);

Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem

2015-02-09 Thread Kai Krakow
Brendan Hide bren...@swiftspirit.co.za schrieb:

 I have the following two lines in
 /etc/udev/rules.d/61-persistent-storage.rules for two old 250GB
 spindles. It sets the timeout to 120 seconds because these two disks
 don't support SCT ERC. This may very well apply without modification to
 other distros - but this is only tested in Arch:
 ACTION==add, KERNEL==sd*, SUBSYSTEM==block,
 ENV{ID_SERIAL}=ST3250410AS_6RYF5NP7 RUN+=/bin/sh -c 'echo 120 
 /sys$devpath/device/timeout'
 ACTION==add, KERNEL==sd*, SUBSYSTEM==block,
 ENV{ID_SERIAL}=ST3250820AS_9QE2CQWC RUN+=/bin/sh -c 'echo 120 
 /sys$devpath/device/timeout'

Wouldn't it be easier and more efficient to use this:

ACTION==add|change, KERNEL==sd[a-z], ENV{ID_SERIAL}==..., 
ATTR{device/timeout}=120

Otherwise you always spawn a shell and additional file descriptors, and you 
could spare a variable interpolation. Tho it probably depends on your udev 
version...

I'm using this and it works setting the attributes (set deadline on SSD):

ACTION==add|change, KERNEL==sd[a-z], ATTR{queue/rotational}==0, 
ATTR{queue/scheduler}=deadline

And, I think you missed the double-equal == behind ENV{}... Right? 
Otherwise you just assign a value. Tho, you could probably match on 
ATTR{devices/model} instead to be more generic (the serial is probably too 
specific). You can get those from the /sys/block/sd* subtree.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


По вашему запросу высылаем

2015-02-09 Thread Олеся
Здравствуйте

Просьба выполнена посылаю предложение


praic-list.doc
Description: MS-Word document


btrfs raid5 with mixed disks

2015-02-09 Thread Rich Freeman
How does btrfs raid5 handle mixed-size disks?  The docs weren't
terribly clear on this.

Suppose I have 4x3TB and 1x1TB disks.  Using conventional lvm+mdadm in
raid5 mode I'd expect to be able to fit about 10TB of space on those
(2TB striped across 4 disks plus 1TB striped across 5 disks after
partitioning).  How much would btrfs be able to store in the same
configuration?  I did see something about being able to use fixed-size
stripes, and I'm not sure if this helps.  If it does, are there any
penalties, especially with future expansion of the array?

With raid1 mode btrfrs is reasonably smart about mixed disk sizes, and
you usually end up with half of the total space available.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance, sudden drop to 0 IOPs

2015-02-09 Thread P. Remek
Not sure if it helps, but here is it:

root@lab1:/mnt/vol1# btrfs filesystem df /mnt/vol1/
Data, RAID10: total=116.00GiB, used=110.03GiB
Data, single: total=8.00MiB, used=0.00
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID1: total=2.00GiB, used=563.72MiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=192.00MiB, used=0.00

On Mon, Feb 9, 2015 at 8:56 PM, Kai Krakow hurikha...@gmail.com wrote:
 P. Remek p.rem...@googlemail.com schrieb:

 Hello,

 I am benchmarking Btrfs and when benchmarking random writes with fio
 utility, I noticed following two things:

 1) On first run when target file doesn't exist yet, perfromance is
 about 8000 IOPs. On second, and every other run, performance goes up
 to 7 IOPs. Its massive difference. The target file is the one
 created during the first run.

 2) There are windows during the test where IOPs drop to 0 and stay 0
 about 10 seconds and then it goes back again, and after couple of
 seconds again to 0. This is reproducible 100% times.

 Can somobody shred some light on what's happening?

 I'm not an expert or dev but it's probably due to btrfs doing some
 housekeeping under the hood. Could you check the output of btrfs filesystem
 usage /mountpoint while running the test? I'd guess there's some pressure
 on the global reserve during those times.

 Command: fio --randrepeat=1 --ioengine=libaio --direct=1
 --gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
 --size=10G --numjobs=1 --readwrite=randwrite

 Environment:
 CPU: dual socket: E5-2630 v2
RAM: 32 GB ram
OS: Ubuntu server 14.10
Kernel: 3.19.0-031900rc2-generic
btrfs tools: Btrfs v3.14.1
2x LSI 9300 HBAs - SAS3 12/Gbs
8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs

 Regards,
 Premek

 --
 Replies to list only preferred.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix scrub race leading to use-after-free

2015-02-09 Thread Chris Mason



On Tue, Jan 27, 2015 at 11:11 AM, Filipe Manana fdman...@suse.com 
wrote:

While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:


This actually trades one bug for another:

[ 1928.950319] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:621^M
[ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: 
fsstress^M

[ 1928.981324] INFO: lockdep is turned off.^M
[ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: GW   
  3.19.0-rc7-mason+ #41^M
[ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, 
BIOS 1.07 05/10/2012^M
[ 1929.022207]  81a22cf8 881076e03b78 816b8dd9 
881076e03b78^M
[ 1929.037267]  880d8e828710 881076e03ba8 810856c4 
881076e03bc8^M
[ 1929.052315]   026d 81a22cf8 
881076e03bd8^M

[ 1929.067381] Call Trace:^M
[ 1929.072344]  IRQ  [816b8dd9] dump_stack+0x4f/0x6e^M
[ 1929.083968]  [810856c4] ___might_sleep+0x174/0x230^M
[ 1929.095352]  [810857d2] __might_sleep+0x52/0x90^M
[ 1929.106223]  [816bb68f] mutex_lock_nested+0x2f/0x3b0^M
[ 1929.117951]  [810ab37d] ? trace_hardirqs_on+0xd/0x10^M
[ 1929.129708]  [a05dc838] scrub_pending_bio_dec+0x38/0x70 
[btrfs]^M
[ 1929.143370]  [a05dd0e0] scrub_parity_bio_endio+0x50/0x70 
[btrfs]^M

[ 1929.157191]  [812fa603] bio_endio+0x53/0xa0^M
[ 1929.167382]  [a05f96bc] rbio_orig_end_io+0x7c/0xa0 
[btrfs]^M
[ 1929.180161]  [a05f97ba] raid_write_parity_end_io+0x5a/0x80 
[btrfs]^M

[ 1929.194318]  [812fa603] bio_endio+0x53/0xa0^M
[ 1929.204496]  [8130401b] blk_update_request+0x1eb/0x450^M
[ 1929.216569]  [81096e58] ? trigger_load_balance+0x78/0x500^M
[ 1929.229176]  [8144c74d] scsi_end_request+0x3d/0x1f0^M
[ 1929.240740]  [8144ccac] scsi_io_completion+0xac/0x5b0^M
[ 1929.252654]  [81441c50] scsi_finish_command+0xf0/0x150^M
[ 1929.264725]  [8144d317] scsi_softirq_done+0x147/0x170^M
[ 1929.276635]  [8130ace6] blk_done_softirq+0x86/0xa0^M
[ 1929.288014]  [8105d92e] __do_softirq+0xde/0x600^M
[ 1929.298885]  [8105df6d] irq_exit+0xbd/0xd0^M
[ 1929.308879]  [81034ea5] 
smp_call_function_single_interrupt+0x35/0x40^M
[ 1929.323455]  [816c126f] 
call_function_single_interrupt+0x6f/0x80^M
[ 1929.337270]  EOI  [811fc745] ? 
sync_inodes_sb+0x1b5/0x2a0^M

[ 1929.350261]  [811fc728] ? sync_inodes_sb+0x198/0x2a0^M
[ 1929.361991]  [816badcf] ? wait_for_completion+0xef/0x120^M
[ 1929.374423]  [812028d0] ? fdatawrite_one_bdev+0x20/0x20^M
[ 1929.386671]  [812028d0] ? fdatawrite_one_bdev+0x20/0x20^M
[ 1929.398930]  [812028ed] sync_inodes_one_sb+0x1d/0x30^M
[ 1929.410668]  [811cf4c6] iterate_supers+0xb6/0xf0^M
[ 1929.421712]  [81202935] sys_sync+0x35/0x90^M
[ 1929.431704]  [816bfed2] system_call_fastpath+0x12/0x17^M

So we'll have to either put in a refcount or a spinlock instead.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] btrfsck and btrfs-image fixes

2015-02-09 Thread Josef Bacik
This series of patches fixes up btrfsck in lots of ways and adds some new
functionality.  These patches were required to fix Hugo's broken multi-disk fs
as well as fix fsck so it would actually pass all of the fsck tests.  This also
fixes a long standing btrfs-image problem where it wouldn't restore multi disk
images onto a single disk properly.  Dave you can pull this from

https://github.com/josefbacik/btrfs-progs.git for-kdave

These patches all pass make test.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/16] Btrfs-progs: handle -eagain properly

2015-02-09 Thread Josef Bacik
If we fix bad blocks during run_next_block we will return -EAGAIN to loop around
and start again.  The deal_with_roots work messed up this handling, this patch
fixes it.  With this patch we can properly deal with broken tree blocks.
Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 93 +---
 1 file changed, 64 insertions(+), 29 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index ca40e35..e74fa0f 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7649,6 +7649,18 @@ static int add_root_item_to_list(struct list_head *head,
return 0;
 }
 
+static void free_root_item_list(struct list_head *list)
+{
+   struct root_item_record *ri_rec;
+
+   while (!list_empty(list)) {
+   ri_rec = list_first_entry(list, struct root_item_record,
+ list);
+   list_del_init(ri_rec-list);
+   free(ri_rec);
+   }
+}
+
 static int deal_root_from_list(struct list_head *list,
   struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
@@ -7846,50 +7858,49 @@ again:
path.slots[0]++;
}
btrfs_release_path(path);
+
+   /*
+* check_block can return -EAGAIN if it fixes something, please keep
+* this in mind when dealing with return values from these functions, if
+* we get -EAGAIN we want to fall through and restart the loop.
+*/
ret = deal_root_from_list(normal_trees, trans, root,
  bits, bits_nr, pending, seen,
  reada, nodes, extent_cache,
  chunk_cache, dev_cache, block_group_cache,
  dev_extent_cache);
-   if (ret  0)
+   if (ret  0) {
+   if (ret == -EAGAIN)
+   goto loop;
goto out;
+   }
ret = deal_root_from_list(dropping_trees, trans, root,
  bits, bits_nr, pending, seen,
  reada, nodes, extent_cache,
- chunk_cache, dev_cache, block_group_cache,
+ chunk_cache, dev_cache,
+ block_group_cache,
  dev_extent_cache);
-   if (ret  0)
+   if (ret  0) {
+   if (ret == -EAGAIN)
+   goto loop;
goto out;
-   if (ret = 0)
-   ret = check_extent_refs(trans, root, extent_cache);
-   if (ret == -EAGAIN) {
-   ret = btrfs_commit_transaction(trans, root);
-   if (ret)
-   goto out;
-
-   trans = btrfs_start_transaction(root, 1);
-   if (IS_ERR(trans)) {
-   ret = PTR_ERR(trans);
-   goto out;
-   }
-
-   free_corrupt_blocks_tree(root-fs_info-corrupt_blocks);
-   free_extent_cache_tree(seen);
-   free_extent_cache_tree(pending);
-   free_extent_cache_tree(reada);
-   free_extent_cache_tree(nodes);
-   free_chunk_cache_tree(chunk_cache);
-   free_block_group_tree(block_group_cache);
-   free_device_cache_tree(dev_cache);
-   free_device_extent_tree(dev_extent_cache);
-   free_extent_record_cache(root-fs_info, extent_cache);
-   goto again;
}
 
err = check_chunks(chunk_cache, block_group_cache,
   dev_extent_cache, NULL, NULL, NULL, 0);
-   if (err  !ret)
-   ret = err;
+   if (err) {
+   if (err == -EAGAIN)
+   goto loop;
+   if (!ret)
+   ret = err;
+   }
+
+   ret = check_extent_refs(trans, root, extent_cache);
+   if (ret  0) {
+   if (ret == -EAGAIN)
+   goto loop;
+   goto out;
+   }
 
err = check_devices(dev_cache, dev_extent_cache);
if (err  !ret)
@@ -7917,6 +7928,30 @@ out:
free_extent_cache_tree(reada);
free_extent_cache_tree(nodes);
return ret;
+loop:
+   ret = btrfs_commit_transaction(trans, root);
+   if (ret)
+   goto out;
+
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out;
+   }
+
+   free_corrupt_blocks_tree(root-fs_info-corrupt_blocks);
+   free_extent_cache_tree(seen);
+   free_extent_cache_tree(pending);
+   free_extent_cache_tree(reada);
+   free_extent_cache_tree(nodes);
+   free_chunk_cache_tree(chunk_cache);
+   free_block_group_tree(block_group_cache);
+   free_device_cache_tree(dev_cache);
+   free_device_extent_tree(dev_extent_cache);
+   

[PATCH] btrfs: remove unused chunk_tree argument in several functions

2015-02-09 Thread Zhaolei
From: Zhao Lei zhao...@cn.fujitsu.com

There functions include unused chunk_tree argument from the begining,
it is time to remove them and clean up relative code to prepare value
of this argument in caller.

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8a94642..270b401 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2485,8 +2485,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 }
 
 static int btrfs_free_chunk(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 chunk_tree, u64 chunk_objectid,
+   struct btrfs_root *root, u64 chunk_objectid,
u64 chunk_offset)
 {
int ret;
@@ -2578,7 +2577,6 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
struct map_lookup *map;
u64 dev_extent_len = 0;
u64 chunk_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
-   u64 chunk_tree = root-fs_info-chunk_root-objectid;
int i, ret = 0;
 
/* Just in case */
@@ -2632,8 +2630,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
}
}
}
-   ret = btrfs_free_chunk(trans, root, chunk_tree, chunk_objectid,
-  chunk_offset);
+   ret = btrfs_free_chunk(trans, root, chunk_objectid, chunk_offset);
if (ret) {
btrfs_abort_transaction(trans, root, ret);
goto out;
@@ -2662,8 +2659,8 @@ out:
 }
 
 static int btrfs_relocate_chunk(struct btrfs_root *root,
-u64 chunk_tree, u64 chunk_objectid,
-u64 chunk_offset)
+   u64 chunk_objectid,
+   u64 chunk_offset)
 {
struct btrfs_root *extent_root;
struct btrfs_trans_handle *trans;
@@ -2705,7 +2702,6 @@ static int btrfs_relocate_sys_chunks(struct btrfs_root 
*root)
struct btrfs_chunk *chunk;
struct btrfs_key key;
struct btrfs_key found_key;
-   u64 chunk_tree = chunk_root-root_key.objectid;
u64 chunk_type;
bool retried = false;
int failed = 0;
@@ -2742,7 +2738,7 @@ again:
btrfs_release_path(path);
 
if (chunk_type  BTRFS_BLOCK_GROUP_SYSTEM) {
-   ret = btrfs_relocate_chunk(chunk_root, chunk_tree,
+   ret = btrfs_relocate_chunk(chunk_root,
   found_key.objectid,
   found_key.offset);
if (ret == -ENOSPC)
@@ -3253,7 +3249,6 @@ again:
}
 
ret = btrfs_relocate_chunk(chunk_root,
-  chunk_root-root_key.objectid,
   found_key.objectid,
   found_key.offset);
if (ret  ret != -ENOSPC)
@@ -3955,7 +3950,6 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 
new_size)
struct btrfs_dev_extent *dev_extent = NULL;
struct btrfs_path *path;
u64 length;
-   u64 chunk_tree;
u64 chunk_objectid;
u64 chunk_offset;
int ret;
@@ -4025,13 +4019,11 @@ again:
break;
}
 
-   chunk_tree = btrfs_dev_extent_chunk_tree(l, dev_extent);
chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent);
chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
btrfs_release_path(path);
 
-   ret = btrfs_relocate_chunk(root, chunk_tree, chunk_objectid,
-  chunk_offset);
+   ret = btrfs_relocate_chunk(root, chunk_objectid, chunk_offset);
if (ret  ret != -ENOSPC)
goto done;
if (ret == -ENOSPC)
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix race waiting for ordered extents at transaction commit

2015-02-09 Thread Filipe Manana
There's a short time window where a race can happen between two or more
tasks that hold a transaction handle for the same transaction and where
one starts the transaction commit before the other tasks attempt to
split their pending ordered extents list into the transaction's pending
ordered extents lists. This results in the transaction commit not waiting
for those ordered extents to complete, in memory leaks of ordered extent
structures and therefore inode leaks too, since an iput for the ordered
extent's inode is done only when the ordered extent's refcount drops to
zero. This race is described by the following sequence diagram:

 CPU 1   CPU 2

btrfs_start_transaction()
   started transaction N with
   trans-transaction-num_writers == 1
   and trans-transaction-state ==
 TRANS_STATE_RUNNING

 btrfs_sync_file()

   btrfs_start_transaction()
 -- returns transaction
 handle pointing to
 transaction N
 -- Now transaction N's
 num_writers == 2

   btrfs_sync_log()

btrfs_commit_transaction()
   btrfs_wait_pending_ordered()
  -- transaction N's -pending_ordered
  processed and is now an empty list
   set transaction state to TRANS_STATE_COMMIT_DOING
   wait for trans-transaction-num_writers == 1

 
btrfs_wait_logged_extents()
-- adds ordered 
extents
to 
trans-ordered list

   btrfs_end_transaction()
 -- trans-ordered 
list is spliced
 into transaction 
N's list
 pending_ordered
 -- transaction N's 
num_writers
 becomes 1 now

  wait finished, num_writers == 1
  transaction is committed and it doesn't wait
  for the ordered extents from CPU 2's task to
  complete, nor does it decrement their last
  reference, resulting in memory leaks and
  inode leaks (the iput on the ordered extent's
  inode is done only when the ordered extent's
  refcount drops to zero)

So fix this by processing the transaction's pending_ordered list again
after the number of writers decreases to 1.

I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:

[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 
free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor 
parport_pc parport psmouse thermal_sys i2c_piix4 serio_raw pcspkr evdev 
microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom 
virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio floppy 
e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: GW  
3.19.0-rc5-btrfs-next-4+ #1
[ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2612.457709]  0009 8801342c3c78 8142425e 
88023ec8f2d8
[ 2612.459829]   8801342c3cb8 81045308 
88004646
[ 2612.461564]  a036da56 88003d07b000 88004646 
880046460068
[ 2612.463163] Call Trace:
[ 2612.463719]  [8142425e] dump_stack+0x4c/0x65
[ 2612.464789]  [81045308] warn_slowpath_common+0xa1/0xbb
[ 2612.466026]  [a036da56] ? free_fs_root+0x36/0x133 [btrfs]
[ 2612.467247]  [810453c5] warn_slowpath_null+0x1a/0x1c
[ 2612.468416]  [a036da56] free_fs_root+0x36/0x133 [btrfs]
[ 2612.469625]  [a036f2a7] btrfs_drop_and_free_fs_root+0x93/0x9b 
[btrfs]
[ 2612.471251]  [a036f353] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
[ 2612.472536]  [8142612e] ? wait_for_completion+0x24/0x26
[ 2612.473742]  [a0370bbc] close_ctree+0x1f3/0x33c [btrfs]
[ 2612.475477]  [81059d1d] ? destroy_workqueue+0x148/0x1ba
[ 2612.476695]  [a034e3da] btrfs_put_super+0x19/0x1b [btrfs]
[ 2612.477911]  [81153e53] generic_shutdown_super+0x73/0xef
[ 2612.479106]  [811540e2] 

Repair broken btrfs raid6?

2015-02-09 Thread Tobias Holst
Hi

I'm having some trouble with my six-drives btrfs raid6 (each drive
encrypted with LUKS). At first: Yes, I do have backups, but it may
take at least days, maybe weeks or even some month to restore
everything from the (offside) backups. So it is not essential to
recover the data, but would be great ;-)

OS: Ubuntu 14.04
Kernel: 3.19.0
btrfs-progs: 3.19-rc2

When booting my server I am getting this in the syslog:
 [8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 
 /dev/dm-0
 [8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 
 /dev/dm-1
 [8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 
 /dev/dm-2
 [8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 
 /dev/dm-3
 [8.70] BTRFS info (device dm-3): force lzo compression
 [8.74] BTRFS info (device dm-3): disk space caching is enabled
 [8.556310] BTRFS: failed to read the system array on dm-3
 [8.592135] BTRFS: open_ctree failed
 [9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 
 /dev/dm-4
 [9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 
 /dev/dm-5
Looks like there is something wrong on drive 3, giving me open_ctree
failed. I have to press S to skip mounting of the btrfs volume. It
boots and with sudo mount --all I can successfully mount the btrfs
volume. Sometimes it takes one or two minutes but it will mount.

After a while I am sometimes/randomly getting this in the syslog:
 [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted 
 BB5B0AD5 found 6B6F5040 level 0
Looks like something else is broken on dm-5... But shouldn't this be
repaired with the new raid56-repair-features of kernel 3.19?

After some more time I am getting this:
 [637017.631044] BTRFS (device dm-4): parent transid verify failed on 
 39099305132032 wanted 108722 found 108719
Then it is not possible to access the mounted volume anymore. I have
to umount -l to unmount it and then I can remount it. Until it
happens again (after some time)...

I also tried a balance and a scrub but they crash. Syslog is full of
messages like the following examples:
 [ 3355.523157] csum_tree_block: 53 callbacks suppressed
 [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted 
 F90D8231 found 5981C697 level 0
 [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 
 30525418536960 wanted 108975 found 108767
and btrfs scrub status /[device] gives me the following output:
 scrub status for [UUID]
scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 
 seconds
total bytes scrubbed: 113.04GiB with 0 errors

So a short summary:
- btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
- does not mount at boot up, open_ctree failed (disk 3)
- mounts successfully after bootup
- randomly checksum verify failed (disk 5)
- balance and scrub crash after some time
- after a while the volume gets unreadable, saying parent transid
verify failed (disk 4 or 5)

And it looks like there still is no way to btrfsck a raid6.

Any ideas how to repair this filesystem?

Regards,
Tobias
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid5 with mixed disks

2015-02-09 Thread Hugo Mills
On Mon, Feb 09, 2015 at 05:24:42PM -0500, Rich Freeman wrote:
 How does btrfs raid5 handle mixed-size disks?  The docs weren't
 terribly clear on this.
 
 Suppose I have 4x3TB and 1x1TB disks.  Using conventional lvm+mdadm in
 raid5 mode I'd expect to be able to fit about 10TB of space on those
 (2TB striped across 4 disks plus 1TB striped across 5 disks after
 partitioning).  How much would btrfs be able to store in the same
 configuration?  I did see something about being able to use fixed-size
 stripes, and I'm not sure if this helps.  If it does, are there any
 penalties, especially with future expansion of the array?
 
 With raid1 mode btrfrs is reasonably smart about mixed disk sizes, and
 you usually end up with half of the total space available.

   http://carfax.org.uk/btrfs-usage/ may be useful here.

   Hugo.

-- 
Hugo Mills | It was half way to Rivendell when the drugs began
hugo@... carfax.org.uk | to take hold
http://carfax.org.uk/  |  Hunter S Tolkien
PGP: 65E74AC0  |Fear and Loathing in Barad Dûr


signature.asc
Description: Digital signature


Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem

2015-02-09 Thread Brendan Hide
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2015/02/09 10:30 PM, Kai Krakow wrote:
 Brendan Hide bren...@swiftspirit.co.za schrieb:
 
 I have the following two lines in 
 /etc/udev/rules.d/61-persistent-storage.rules for two old 250GB
[snip]
 Wouldn't it be easier and more efficient to use this:
 
 ACTION==add|change, KERNEL==sd[a-z], ENV{ID_SERIAL}==..., 
 ATTR{device/timeout}=120
 
 Otherwise you always spawn a shell and additional file descriptors,
 and you could spare a variable interpolation. Tho it probably
 depends on your udev version...
 
 I'm using this and it works setting the attributes (set deadline on
 SSD):
 
 ACTION==add|change, KERNEL==sd[a-z],
 ATTR{queue/rotational}==0, ATTR{queue/scheduler}=deadline
 
 And, I think you missed the double-equal == behind ENV{}...
 Right? Otherwise you just assign a value. Tho, you could probably
 match on ATTR{devices/model} instead to be more generic (the serial
 is probably too specific). You can get those from the
 /sys/block/sd* subtree.
 

It is certainly possible that it isn't 100% the right way - but it has
been working. Your suggestions certainly sound more
efficient/canonical. I was following what I found online until it
worked.  :)

I'll make the appropriate adjustments and test.

Thanks!

- -- 
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJU2YknAAoJEE+uni74c4qNopMH/34nj5wEi3m25jk/vEUud3hh
bbK4/mh564VnMc1NnpYXe++gUUTf0+203JDERgCQ1k3XjFMUe3VDPQBSdCIxcuOV
H7BtFWcuUYvaTd/3kHTcB2mp097RUQs25Jhcmf8y/+YZdnglnpSrRYtIIMM8osil
Y70IzoSRLuVHYlZT5VPmH7r7P9CeW5VnEG0jb3DkDe+tLH2Ed1Wy/Ti5myX0BF2l
7vJ1gTnPMmIUu/MKmNka6/hSWKGV7G2MeFoOy9UB2HhWsdGCjpJ1z8ToRQLcZbWX
yCpSjw2GDCtdG91iKiWK+kAJOreKqWGA3GSdgKqZhAQVg6LFeml1qLrBZ7H9H1o=
=TtpU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance, sudden drop to 0 IOPs

2015-02-09 Thread Duncan
P. Remek posted on Mon, 09 Feb 2015 18:26:49 +0100 as excerpted:

 Hello,
 
 I am benchmarking Btrfs and when benchmarking random writes with fio
 utility, I noticed following two things:
 
 1) On first run when target file doesn't exist yet, perfromance is about
 8000 IOPs. On second, and every other run, performance goes up to 7
 IOPs. Its massive difference. The target file is the one created during
 the first run.

You say a file size of 10 GiB with a block size of 4 KiB, but don't say 
whether you're using the autodefrag mount option, or whether you had set 
nocow on the file at creation (generally done by setting it on the 
directory, so new files inherit the option, chattr +C).

What I /suspect/ is happening, is that at the 10 GiB files size, on 
original file creation, btrfs is creating a large file of several 
comparatively large extents (possibly 1 GiB each, the nominal data chunk 
size, tho it can be larger on large enough filesystems).  Note that btrfs 
will normally wait to sync, accumulating further writes into the file 
before actually writing it.  By default it's 30 seconds, but there's a 
mount option to change that.  So btrfs is probably waiting, then writing 
out all changes for the last 30 seconds at once, allowing it to use 
fairly large extents when it does so.

Then when the file already exists,, keeping in mind that btrfs is COW 
(copy-on-write) and that by default it keeps two copies of metadata (dup 
on a single device, or one each on two separate devices, on a multi-
device filesystem), one copy of data (single on a single device, I 
believe raid0 on multi-device), it's having to COW individual 4K blocks 
within the file as they are rewritten.

This is going to massively fragment the file, driving up IOPs 
tremendously.  On top of that, each time a data fragment is written, 
there's going to be two metadata updates due to the dup/raid1 metadata 
default, and while they won't be updated immediately, every commit (30 
seconds), those metadata changes are going to replicate up the metadata 
tree to its root.

So instead of having a few orderly GiB-ish size extents written, along 
with their metadata, as at file-create, now you're writing a new extent 
for each changed 4 KiB block, plus 2X metadata updates for each one, plus 
every commit, the updated metadata chain up to the root.

Those 70K IOPs are all the extra work the filesystem is doing in ordered 
to track those 4 KiB COWed writes!

The autodefrag option will likely increase this even further, as it 
doesn't prevent the COWs, but instead, queues up any files it detects as 
fragmented, for later cleanup via autodefrag worker thread.  This is one 
reason this option isn't recommended for large (say quarter to half-gig-
plus) heavy-internal-rewrite-pattern use-cases (typically VM images or 
large database files), tho it works quite well for files upto a couple 
hundred MiB or so (typical of firefox sqlite database files, etc), since 
those get rewritten pretty fast.

The nocow file attribute can be used on these larger files, but it does 
have additional implications.  Nocow turns off btrfs compression for that 
file, if you had it enabled (mount option), and also turns off 
checksumming.  Turning off checksumming means btrfs will no longer detect 
file corruption, but many databases and vm tools have their own 
corruption detection and possibly correction schemes already, since they 
use them on filesystems such as ext* that don't have builtin 
checksumming, so turning off the btrfs checksumming and error detection 
for these files isn't as bad as it would otherwise seem, and in many 
cases prevents the filesystem duplicating work that the application is 
already doing.  (Also, on btrfs, nocow must be set at file creation, when 
it is still zero-sized.  As mentioned above, this is usually accomplished 
by setting it on the directory and letting new files and subdirs inherit 
the attribute.)

But with the nocow file attribute properly applied, these random rewrites 
will be done in-place, no cascading fragmentation and metadata updates, 
and my guess is that you'll see the IOPs on existing nocow files reduce 
to something far more sane as a result.

 2) There are windows during the test where IOPs drop to 0 and stay 0
 about 10 seconds and then it goes back again, and after couple of
 seconds again to 0. This is reproducible 100% times.

I recall this periodic behavior coming up in at least one earlier thread 
as well, but I'm not a dev, just a btrfs user and list regular, and I 
don't recall what the explanation was, unless it was related to internal 
btrfs bookkeeping due to that 30-second commit cycle I mentioned above.

But I'm guessing that if you properly set nocow on the file, you'll 
probably see this go away as well, since you won't be overwhelming btrfs 
and the hardware with IOPs any longer.

Perhaps someone with a better understanding of the situation will jump in 
and explain this bit better than I can...

 Can