Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-04-05 Thread Qu Wenruo



Nicholas D Steeves wrote on 2016/04/05 23:47 -0400:

On 4 April 2016 at 12:55, David Sterba  wrote:

Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only
for both in-memory and on-disk backend. No SHA256 again.


I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
actually hash both to reduce the collisions.


Don't quite like the idea to use 2 hash other than 1.
Yes, some program like rsync uses this method, but this also involves a
lot of details, like the order to restore them on disk.


I'm considering fast-but-unsafe hashes for the in-memory backend, where
the speed matters and we cannot hide the slow sha256 calculations behind
the IO (ie. no point to save microseconds if the IO is going to take
milliseconds).


In that case, for MD5 hit case, we will do a full byte-to-byte
comparison. It may be slow or fast, depending on the cache.


If the probability of hash collision is low, so the number of needed
byte-to-byte comparisions is also low.


It is unlikely that I will use dedupe, but I imagine your work will
apply tot he following wishlist:

1. Allow disabling of memory-backend hash via a kernel argument,
sysctl, or mount option for those of us have ECC RAM.
 * page_cache never gets pushed to swap, so this should be safe, no?


Why not use current ioctl to disable dedupe?

And why it's related to ECC RAM? To avoid memory corruption which will 
finally lead to file corruption?

If so, it makes sense.

Also I didn't get the point when you mention page_cache.
For hash pool, we didn't use page cache. We just use kmalloc, which 
won't be swapped out.

For file page cache, it's not affected at all.



2. Implementing an intelligent cache so that it's possible to offset
the cost of hashing the most actively read data.  I'm guessing there's
already some sort of weighed cache eviction algorithm in place, but I
don't yet know how to look into it, let alone enough to leverage it...


I not quite a fan of such intelligent but complicated cache design.
The main problem is we are putting police into kernel space.

Currently, either use last-recent-use in-memory backend, or use all-in 
ondisk backend.
For user want more precious control on which file/dir shouldn't go 
through dedupe, they have the btrfs prop to set per-file flag to avoid 
dedupe.



 * on the topic of leaning on the cache, I've been thinking about
ways to optimize reads, while minimizing seeks on multi-spindle raid1
btrfs volumes.  I'm guessing that someone will commit a solution
before I manage to teach myself enough about filesystems to contribute
something useful.

That's it, in terms of features I want ;-)

It's probably a well-known fact, but sha512 is roughly 40 to 50%
faster than sha256, and 40 to 50% slower than sha1 on my 1200-series
Xeon v3 (Haswell), for 8192 size blocks.


Sadly I didn't know it until recent days. :(
Or I would have implemented SHA512 hash algorithm instead SHA256.

Anyway, it's not that hard to add a new hash algorithm.

Thanks for your comments.
Qu



Wish I could do more right now!
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk

2016-04-05 Thread Nicholas D Steeves
Hi Alex,

On 13 March 2016 at 05:51, Alex Lyakas  wrote:
> Nicholas,
>
> On Sat, Mar 12, 2016 at 12:19 AM, Nicholas D Steeves  
> wrote:
>> On 10 March 2016 at 06:10, Alex Lyakas  wrote:
>> Does this mean there is a good chance that everyone has corrupted
>> metadata?
> No, this definitely does not.
>
> The code that I added prevents btrfs from writing a metadata block, if
> it somehow got corrupted before being sent to disk. If it happens, it
> indicates a bug somewhere in the kernel. For example, if some other
> kernel module erroneously uses a page-cache entry, which does not
> belong to it (and contains btrfs metadata block or part of it).

Oh wow, I didn't know that was possible.  If I understand correctly,
this patch makes using bcache a little bit safer?  (I don't use it
since I'm too short on free time to what is--I suspect-- something
that radically increases the chances of having to restore from backup)

>> Is there any way to verify/rebuild it without wipefs+mkfs+restore from 
>> backups?
> To verify btrfs metadata: unmount the filesystem and run "btrfs check
> ...". Do not specify the "repair" parameter. Another way to verify is
> to run "btrfs-debug-tree" and redirect its standard output to
> /dev/null. It should not print anything to standard error. But "btrfs
> check" is faster.

Ah, that's exactly what I was looking for!  Thank you.  It took
forever, and brought me back to what it was like to fsck large ext2
volumes.  Is btrfs check conceptually identical to a read-only fsck of
a ext2 volume?  If now how does it defer?

Are the following sort of errors still an issue?:
Extent back ref already exists for 2148837945344 parent 0 root 257
leaf parent key incorrect 504993210368
bad block 504993210368
( https://btrfs.wiki.kernel.org/index.php/Btrfsck )

Cheers,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dstat shows unexpected result for two disk RAID1

2016-04-05 Thread Nicholas D Steeves
On 11 March 2016 at 20:20, Chris Murphy  wrote:
> On Fri, Mar 11, 2016 at 5:10 PM, Nicholas D Steeves  
> wrote:
>> P.S. Rather than parity, I mean instead of distributing into stripes, do a 
>> copy!
>
> raid56 by definition are parity based, so I'd say no that's confusing
> to turn it into something it's not.

I just found the Multiple Device Support diagram.  I'm trying to
figure out how hard it's going for me to get up to speed, because I've
only ever casually and informally read about filesystems.  I worry
that because I didn't study filesystem design in school, and because
everything I worked on was in C++...well, the level of sophistication
and design might be beyond what I can learn.  What do you think?  Can
you recommend any books on file system design that will provide what
is necessary to understand btrfs?

Cheers,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs4.4 with linux-3.16.7 (with truncation of extends patch)

2016-04-05 Thread Nicholas D Steeves
Dear Duncan,

Gmail seems to have mangled the draft of my reply :-/  It's at the bottom.

On 06/02/16 12:57 AM, Duncan wrote:

Most of the rest of the userspace tools, in particular, btrfs scrub,
subvolume, balance, device, filesystem, send, receive, etc, work by
making kernel calls to do the actual work in any case, and they will
use the old calls if they need to. The compatibility discussion,
meanwhile, is on making mkfs.btrfs (and btrfs-convert) check the
running kernel and taking its defaults from what that kernel supports,
instead of choosing arbitrary defaults that may be better when
supported, but that older kernels don't actually support. Of course
there will still be options to set these as desired regardless of
defaults, just as there are now, so people using for instance booted
to an old recovery kernel for system maintenance can still choose
whatever options that version of mkfs.btrfs supports if they know
they'll actually be mounting with a newer kernel, but the idea is
simply to have mkfs.btrfs act more sanely /by/ /default/ when run on
old kernels, so those same old kernels can actually mount a filesystem
created with defaults. Along that line, as a distro maintainer of the
btrfs-progs package, you may wish to patch the mkfs.btrfs defaults to
what your kernel supports.

Btrfs-progs will probably ship with kernel-sensitive defaults some
time in the future (userspace 4.5 release, probably), but it doesn't
do so yet...
--

Thank you very much for taking the time to write such a thorough
reply. I'm not the maintainer of Debian's btrfs-progs package, but I
am investigating the issues preventing the addition of btrfs-progs-4.4
to the backports repository. [ edit: sorry it took me so long to
reply, I've been swamped with work.  In the meantime, it seems v4.4
has made it into the backports without warnings or compatibility
checks, so I want to get my facts straight asap and patch the package
with some kind of a notice/alert asap, if only through the
debian/NEWS...since there isn't currently a way to depend on a
particular kernel series, or ever a kernel version <= 4.4.0

Cheers,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-04-05 Thread Nicholas D Steeves
On 4 April 2016 at 12:55, David Sterba  wrote:
>> >> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only
>> >> for both in-memory and on-disk backend. No SHA256 again.
>> >
>> > I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
>> > murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
>> > actually hash both to reduce the collisions.
>>
>> Don't quite like the idea to use 2 hash other than 1.
>> Yes, some program like rsync uses this method, but this also involves a
>> lot of details, like the order to restore them on disk.
>
> I'm considering fast-but-unsafe hashes for the in-memory backend, where
> the speed matters and we cannot hide the slow sha256 calculations behind
> the IO (ie. no point to save microseconds if the IO is going to take
> milliseconds).
>
>> >> In that case, for MD5 hit case, we will do a full byte-to-byte
>> >> comparison. It may be slow or fast, depending on the cache.
>> >
>> > If the probability of hash collision is low, so the number of needed
>> > byte-to-byte comparisions is also low.

It is unlikely that I will use dedupe, but I imagine your work will
apply tot he following wishlist:

1. Allow disabling of memory-backend hash via a kernel argument,
sysctl, or mount option for those of us have ECC RAM.
* page_cache never gets pushed to swap, so this should be safe, no?
2. Implementing an intelligent cache so that it's possible to offset
the cost of hashing the most actively read data.  I'm guessing there's
already some sort of weighed cache eviction algorithm in place, but I
don't yet know how to look into it, let alone enough to leverage it...
* on the topic of leaning on the cache, I've been thinking about
ways to optimize reads, while minimizing seeks on multi-spindle raid1
btrfs volumes.  I'm guessing that someone will commit a solution
before I manage to teach myself enough about filesystems to contribute
something useful.

That's it, in terms of features I want ;-)

It's probably a well-known fact, but sha512 is roughly 40 to 50%
faster than sha256, and 40 to 50% slower than sha1 on my 1200-series
Xeon v3 (Haswell), for 8192 size blocks.

Wish I could do more right now!
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Possible Double Freeing of dentry in check_parent_dirs_for_sync

2016-04-05 Thread Bastien Philbert
Greetings All,
After some tracing I am not certain if this is correct due to being newer to 
the btrfs 
codebase. However if someone more experience can show me if I am missing 
something in
my traces please let me known:)
Firstly here is the bug trace or the part that matters:
[ 7195.792492] [ cut here ]
[ 7195.792532] WARNING: CPU: 0 PID: 5352 at 
/home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 
[btrfs]
[ 7195.792535] Modules linked in: bnep binfmt_misc intel_rapl 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel samsung_laptop kvm 
irqbypass crct10dif_pclmul crc32_pclmul btusb ghash_clmulni_intel btrtl btbcm 
btintel cryptd snd_hda_codec_hdmi uvcvideo bluetooth snd_hda_codec_realtek 
videobuf2_vmalloc snd_hda_codec_generic videobuf2_memops arc4 videobuf2_v4l2 
snd_hda_intel input_leds videobuf2_core snd_hda_codec joydev snd_hda_core 
iwldvm serio_raw snd_hwdep videodev snd_pcm mac80211 media snd_seq_midi 
snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device iwlwifi snd_timer 
cfg80211 snd lpc_ich mei_me soundcore shpchp mei dell_smo8800 mac_hid 
parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq hid_generic usbhid hid 
i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect psmouse sysimgblt 
fb_sys_fops
[ 7195.792593]  drm r8169 ahci libahci mii wmi video fjes
[ 7195.792602] CPU: 0 PID: 5352 Comm: aptitude Not tainted 
4.6.0-040600rc1-generic #201603261930
[ 7195.792604] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 
530U3C/530U4C/SAMSUNG_NP1234567890, BIOS P14AAJ 04/15/2013
[ 7195.792607]  0286 002cb880 8800c91f3d40 
813ee833
[ 7195.792611]    8800c91f3d80 
8108275b
[ 7195.792614]  242da71863c0 8800209bca58 8800209bca58 
880034eda000
[ 7195.792618] Call Trace:
[ 7195.792626]  [] dump_stack+0x63/0x90
[ 7195.792631]  [] __warn+0xcb/0xf0
[ 7195.792635]  [] warn_slowpath_null+0x1d/0x20
[ 7195.792658]  [] btrfs_destroy_inode+0x247/0x2c0 [btrfs]
[ 7195.792663]  [] destroy_inode+0x3b/0x60
[ 7195.792666]  [] evict+0x136/0x1a0
[ 7195.792670]  [] iput+0x1ba/0x240
[ 7195.792673]  [] __dentry_kill+0x18d/0x1e0
[ 7195.792676]  [] dput+0x12b/0x220
[ 7195.792680]  [] SyS_rename+0x2f4/0x3c0
[ 7195.792686]  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
[ 7195.792689] ---[ end trace e42100b57fd49606 ]---
[ 7464.416637] perf: interrupt took too long (3157 > 3146), lowering 
kernel.perf_event_max_sample_rate to 63250
[ 9697.609514] perf: interrupt took too long (3950 > 3946), lowering 
kernel.perf_event_max_sample_rate to 50500
Firstly we start in the btrfs function for renaming btrfs_rename which in turn 
calls btrfs_log_new_name,which
in turn calls check_parent_dirs_for_sync which in turn calls 
check_parent_dirs_for_sync. This is where I got
confused though and wanted a more experienced viewpoint on this:
parent = dget_parent(parent);
 dput(old_parent);
 old_parent = parent;
 inode = d_inode(parent);
 
 }
Are we not double freeing and causing a NULL pointer deference here?
 dput(old_parent);
Sorry for the stupid question :(,
Bastien
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe

2016-04-05 Thread NeilBrown
On Wed, Apr 06 2016, Shaohua Li wrote:

> On Tue, Apr 05, 2016 at 03:36:57PM +0200, Lars Ellenberg wrote:
>> blk_check_plugged() will return a pointer
>> to an object linked on current->plug->cb_list.
>> 
>> That list may "at any time" be implicitly cleared by
>> blk_flush_plug_list()
>>  flush_plug_callbacks()
>> either as a result of blk_finish_plug(),
>> or implicitly by schedule() [and maybe other implicit mechanisms?]
>> 
>> If there is no protection against an implicit unplug
>> between the call to blk_check_plug() and using its return value,
>> that implicit unplug may have already happened,
>> even before the plug is actually initialized or populated,
>> and we may be using a pointer to already free()d data.
>
> This isn't correct. flush plug is never called in preemption, which is 
> designed
> only called when the task is going to sleep. See sched_submit_work. Am I
> missing anything?

Ahh yes, thanks.

Only two places call blk_schedule_flush_plug().
One is io_schedule_timeout() which must be called explicitly.
There other is, as you say, sched_submit_work().  It starts:

static inline void sched_submit_work(struct task_struct *tsk)
{
if (!tsk->state || tsk_is_pi_blocked(tsk))
return;

so if the task is runnable, then as
include/linux/sched.h:#define TASK_RUNNING  0

it will never call blk_schedule_flush_plug().

So I don't think you are missing anything, we were.

Lars:  have your concerns been relieved or do you still have reason to
think there is a problem?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH] btrfs: qgroup: Fix qgroup accounting when creating snapshot

2016-04-05 Thread Qu Wenruo
Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.

Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().

However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.

In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)
==
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots 
= 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 
0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 
16384, excl = 16384
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots 
= 0, nr_new_roots = 0
==

The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.

But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.

Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.

Reported-by: Mark Fasheh 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/transaction.c | 66 +-
 1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 43885e5..a3eb8ac 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1516,6 +1516,55 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
goto fail;
}
 
+   /*
+* Account qgroups before insert the dir item
+* As such dir item insert will modify parent_root, which could be
+* src root. If we don't do it now, wrong accounting may be inherited
+* to snapshot qgroup.
+*
+* For reason locking tree_log_mutex, see btrfs_commit_transaction()
+* comment
+*/
+   mutex_lock(>fs_info->tree_log_mutex);
+
+   ret = commit_fs_roots(trans, root);
+   if (ret) {
+   mutex_unlock(>fs_info->tree_log_mutex);
+   goto fail;
+   }
+
+   btrfs_apply_pending_changes(root->fs_info);
+
+   ret = btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
+   if (ret < 0) {
+   mutex_unlock(>fs_info->tree_log_mutex);
+   goto fail;
+   }
+   ret = btrfs_qgroup_account_extents(trans, root->fs_info);
+   if (ret < 0) {
+   mutex_unlock(>fs_info->tree_log_mutex);
+   goto fail;
+   }
+   /*
+* Now qgroup are all updated, we can inherit it to new qgroups
+*/
+   ret = btrfs_qgroup_inherit(trans, fs_info,
+  root->root_key.objectid,
+  objectid, pending->inherit);
+   if (ret < 0) {
+   mutex_unlock(>fs_info->tree_log_mutex);
+   goto fail;
+   }
+   /*
+* qgroup_account_extents() must be followed by a
+* switch_commit_roots(), or next qgroup_account_extents() will
+* be corrupted
+*/
+   ret = commit_cowonly_roots(trans, root);
+   mutex_unlock(>fs_info->tree_log_mutex);
+   if (ret)
+   goto fail;
+
ret = btrfs_insert_dir_item(trans, parent_root,
dentry->d_name.name, dentry->d_name.len,
parent_inode, ,
@@ -1559,23 +1608,6 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
goto fail;
}
 
-   /*
-* account qgroup counters before qgroup_inherit()
-*/
-   ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-   if (ret)
-   goto fail;
-   ret = btrfs_qgroup_account_extents(trans, fs_info);
-   if (ret)
-   goto fail;
-   ret = btrfs_qgroup_inherit(trans, fs_info,
-  root->root_key.objectid,
-  objectid, pending->inherit);
-   if (ret) {
-   btrfs_abort_transaction(trans, root, ret);
-   goto fail;
-   }
-
 fail:
pending->error = ret;
 dir_item_existed:
-- 
2.8.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe

2016-04-05 Thread Shaohua Li
On Tue, Apr 05, 2016 at 03:36:57PM +0200, Lars Ellenberg wrote:
> blk_check_plugged() will return a pointer
> to an object linked on current->plug->cb_list.
> 
> That list may "at any time" be implicitly cleared by
> blk_flush_plug_list()
>  flush_plug_callbacks()
> either as a result of blk_finish_plug(),
> or implicitly by schedule() [and maybe other implicit mechanisms?]
> 
> If there is no protection against an implicit unplug
> between the call to blk_check_plug() and using its return value,
> that implicit unplug may have already happened,
> even before the plug is actually initialized or populated,
> and we may be using a pointer to already free()d data.

This isn't correct. flush plug is never called in preemption, which is designed
only called when the task is going to sleep. See sched_submit_work. Am I
missing anything?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe

2016-04-05 Thread Chris Mason
On Tue, Apr 05, 2016 at 03:36:57PM +0200, Lars Ellenberg wrote:
> blk_check_plugged() will return a pointer
> to an object linked on current->plug->cb_list.
> 
> That list may "at any time" be implicitly cleared by
> blk_flush_plug_list()
>  flush_plug_callbacks()
> either as a result of blk_finish_plug(),
> or implicitly by schedule() [and maybe other implicit mechanisms?]
> 
> If there is no protection against an implicit unplug
> between the call to blk_check_plug() and using its return value,
> that implicit unplug may have already happened,
> even before the plug is actually initialized or populated,
> and we may be using a pointer to already free()d data.
> 
> I suggest that both raid1 and raid10 can easily be fixed
> by moving the call to blk_check_plugged() inside the spinlock.
> 
> For md/raid5 and btrfs/raid56,
> I'm unsure how (if) this needs to be fixed.

I think you're right, digging in to see if there's something I missed.
But as Neil said, it looks like we just got saved by preemption being
off by default.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub priority, am I using it wrong?

2016-04-05 Thread Gareth Pye
Yeah, RAID5. I'm now doing pause and resume on it to let it take
multiple nights, the idle should let other processes complete in
reasonable time.

On Wed, Apr 6, 2016 at 3:34 AM, Henk Slager  wrote:
> On Tue, Apr 5, 2016 at 4:37 AM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Gareth Pye posted on Tue, 05 Apr 2016 09:36:48 +1000 as excerpted:
>>
>>> I've got a btrfs file system set up on 6 drbd disks running on 2Tb
>>> spinning disks. The server is moderately loaded with various regular
>>> tasks that use a fair bit of disk IO, but I've scheduled my weekly btrfs
>>> scrub for the best quiet time in the week.
>>>
>>> The command that is run is:
>>> /usr/local/bin/btrfs scrub start -Bd -c idle /data
>>>
>>> Which is my best attempt to try and get it to have a low impact on user
>>> operations
>>>
>>> But iotop shows me:
>>>
>>> 1765 be/4 root   14.84 M/s0.00 B/s  0.00 % 96.65 % btrfs scrub
>>> start -Bd -c idle /data
>>>  1767 be/4 root   14.70 M/s0.00 B/s  0.00 % 95.35 % btrfs
>>> scrub start -Bd -c idle /data
>>>  1768 be/4 root   13.47 M/s0.00 B/s  0.00 % 92.59 % btrfs
>>> scrub start -Bd -c idle /data
>>>  1764 be/4 root   12.61 M/s0.00 B/s  0.00 % 88.77 % btrfs
>>> scrub start -Bd -c idle /data
>>>  1766 be/4 root   11.24 M/s0.00 B/s  0.00 % 85.18 % btrfs
>>> scrub start -Bd -c idle /data
>>>  1763 be/4 root7.79 M/s0.00 B/s  0.00 % 63.30 % btrfs
>>> scrub start -Bd -c idle /data
>>> 28858 be/4 root0.00 B/s  810.50 B/s  0.00 % 61.32 % [kworker/
>> u16:25]
>>>
>>>
>>> Which doesn't look like an idle priority to me. And the system sure
>>> feels like a system with a lot of heavy io going on. Is there something
>>> I'm doing wrong?
>
> When I see the throughput numbers, it lets me think that the
> filesystem is raid5 or raid6. On single or raid1 or raid10 one easily
> gets around 100M/s without the notice/feeling of heavy IO ongoing,
> mostly independent of scrub options.
>
>> Two points:
>>
>> 1) It appears btrfs scrub start's -c option only takes numeric class, so
>> try -c3 instead of -c idle.
>
> Thanks to Duncan for pointing this out. I don't remember exactly, but
> I think I also had issues with this in the past, but did not realize
> or have a further look at it.
>
>> Works for me with the numeric class (same results as you with spelled out
>> class), tho I'm on ssd with multiple independent btrfs on partitions, the
>> biggest of which is 24 GiB, 18.something GiB used, which scrubs in all of
>> 20 seconds, so I don't need and hadn't tried the -c option at all until
>> now.
>>
>> 2) What a difference an ssd makes!
>>
>> $$ sudo btrfs scrub start -c3 /p
>> scrub started on /p, [...]
>>
>> $$ sudo iotop -obn1
>> Total DISK READ : 626.53 M/s | Total DISK WRITE :   0.00 B/s
>> Actual DISK READ: 596.93 M/s | Actual DISK WRITE:   0.00 B/s
>>  TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN  IOCOMMAND
>>  872 idle root  268.40 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub
>> start -c3 /p
>>  873 idle root  358.13 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub
>> start -c3 /p
>>
>> CPU bound, 0% IOWait even at idle IO priority, in addition to the
>> hundreds of M/s values per thread/device, here.  You OTOH are showing
>> under 20 M/s per thread/device on spinning rust, with an IOWait near 90%,
>> thus making it IO bound.
>
> This low M/s and high IOWait is the kind of behavior I noticed with 3x
> 2TB raid5 when scrubbing or balancing (no bcache activated, kernel
> 4.3.3).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix missing s_id setting

2016-04-05 Thread Tsutomu Itoh

On 2016/04/05 17:52, Tsutomu Itoh wrote:

On 2016/04/05 16:56, Anand Jain wrote:

On 04/05/2016 08:08 AM, Tsutomu Itoh wrote:

When fs_devices->latest_bdev is deleted or is replaced, sb->s_id has
not been updated.
As a result, the deleted device name is displayed by btrfs_printk.

[before fix]
  # btrfs dev del /dev/sdc4 /mnt2
  # btrfs dev add /dev/sdb6 /mnt2

  [  217.458249] BTRFS info (device sdc4): found 1 extents
  [  217.695798] BTRFS info (device sdc4): disk deleted /dev/sdc4
  [  217.941284] BTRFS info (device sdc4): disk added /dev/sdb6

[after fix]
  # btrfs dev del /dev/sdc4 /mnt2
  # btrfs dev add /dev/sdb6 /mnt2

  [   83.835072] BTRFS info (device sdc4): found 1 extents
  [   84.080617] BTRFS info (device sdc3): disk deleted /dev/sdc4
  [   84.401951] BTRFS info (device sdc3): disk added /dev/sdb6



  [PATCH 05/13] Btrfs: fix fs logging for multi device

  any comments ?

  We would want to maintain the logging prefix as constant, so that
  troubleshooters with filters/scripts will find it helpful.


I think it is good to make the identifier constant for the troubleshooting.
However, fsid(uuid) is a little long for the print purpose, I think.
(But an appropriate value isn't found...)


BTW, the state that the deleted device name is set to sb->s_id
is not good.

Thanks,
Tsutomu



Thanks,
Tsutomu



Thanks, Anand



Signed-off-by: Tsutomu Itoh 
---
  fs/btrfs/dev-replace.c |  5 -
  fs/btrfs/volumes.c | 11 +--
  2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index a1d6652..11c4198 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -560,8 +560,11 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
  tgt_device->commit_bytes_used = src_device->bytes_used;
  if (fs_info->sb->s_bdev == src_device->bdev)
  fs_info->sb->s_bdev = tgt_device->bdev;
-if (fs_info->fs_devices->latest_bdev == src_device->bdev)
+if (fs_info->fs_devices->latest_bdev == src_device->bdev) {
  fs_info->fs_devices->latest_bdev = tgt_device->bdev;
+snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+ tgt_device->bdev);
+}
  list_add(_device->dev_alloc_list, _info->fs_devices->alloc_list);
  fs_info->fs_devices->rw_devices++;

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e2b54d5..a471385 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1846,8 +1846,12 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
   struct btrfs_device, dev_list);
  if (device->bdev == root->fs_info->sb->s_bdev)
  root->fs_info->sb->s_bdev = next_device->bdev;
-if (device->bdev == root->fs_info->fs_devices->latest_bdev)
+if (device->bdev == root->fs_info->fs_devices->latest_bdev) {
  root->fs_info->fs_devices->latest_bdev = next_device->bdev;
+snprintf(root->fs_info->sb->s_id,
+ sizeof(root->fs_info->sb->s_id), "%pg",
+ next_device->bdev);
+}

  if (device->bdev) {
  device->fs_devices->open_devices--;
@@ -2034,8 +2038,11 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
   struct btrfs_device, dev_list);
  if (tgtdev->bdev == fs_info->sb->s_bdev)
  fs_info->sb->s_bdev = next_device->bdev;
-if (tgtdev->bdev == fs_info->fs_devices->latest_bdev)
+if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) {
  fs_info->fs_devices->latest_bdev = next_device->bdev;
+snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+ next_device->bdev);
+}
  list_del_rcu(>dev_list);

  call_rcu(>rcu, free_device);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe

2016-04-05 Thread NeilBrown
On Tue, Apr 05 2016, Lars Ellenberg wrote:

> blk_check_plugged() will return a pointer
> to an object linked on current->plug->cb_list.
>
> That list may "at any time" be implicitly cleared by
> blk_flush_plug_list()
>  flush_plug_callbacks()
> either as a result of blk_finish_plug(),
> or implicitly by schedule() [and maybe other implicit mechanisms?]

I think the only risk here is preemption, so
  preempt_disable() / preempt_enable()
or as you say a spinlock, is sufficient protection.
I would suggest preempt_{dis,en}able for the raid5 code.
Maybe for raid1/raid10 too just for consistency.

>
> If there is no protection against an implicit unplug
> between the call to blk_check_plug() and using its return value,
> that implicit unplug may have already happened,
> even before the plug is actually initialized or populated,
> and we may be using a pointer to already free()d data.
>
> I suggest that both raid1 and raid10 can easily be fixed
> by moving the call to blk_check_plugged() inside the spinlock.
>
> For md/raid5 and btrfs/raid56,
> I'm unsure how (if) this needs to be fixed.
>
> The other current in-tree users of blk_check_plugged()
> are mm_check_plugged(), and mddev_check_plugged().
>
> mm_check_plugged() is already used safely inside a spinlock.
>
> with mddev_check_plugged() I'm unsure, at least on a preempt kernel.

I think this is only an issue on a preempt kernel, and in that case: yes
- mddev_check_plugged() needs protection.  Maybe preempt enable/disable
could be done in blk_check_plugged() so those calls which don't
dereference the pointer don't need further protection.
Or maybe blk_check_plugged should have WARN_ON_ONCE(!in_atomic());

>
> Did I overlook any magic that protects against such implicit unplug?

Just the fortunate lack of preemption probably.

>
> Also, why pretend that a custom plug struct (such as raid1_plug_cb)
> may have its member "struct blk_plug_cb cb" at an arbitrary offset?
> As it is, raid1_check_plugged() below is actually just a cast.

Fair point.  I generally prefer container_of to casts because it is more
obviously correct, and type checked.
However as blk_check_plugged performs the allocation, the blk_plug_cb
must be at the start of the containing structure, so the complex tests
for handling NULL are just noise.
I'd be happy for that to be changed.

Thanks,
NeilBrown



>
> Signed-off-by: Lars Ellenberg 
> ---
>  drivers/md/raid1.c  | 19 +--
>  drivers/md/raid10.c | 21 +
>  drivers/md/raid5.c  |  5 +
>  fs/btrfs/raid56.c   |  5 +
>  4 files changed, 36 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 39fb21e..55dc960 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1044,6 +1044,18 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool 
> from_schedule)
>   kfree(plug);
>  }
>  
> +static struct raid1_plug_cb *raid1_check_plugged(struct mddev *mddev)
> +{
> + /* return (struct raid1_plug_cb*)blk_check_plugged(...); */
> + struct blk_plug_cb *cb;
> + struct raid1_plug_cb *plug = NULL;
> +
> + cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
> + if (cb)
> + plug = container_of(cb, struct raid1_plug_cb, cb);
> + return plug;
> +}
> +
>  static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  {
>   struct r1conf *conf = mddev->private;
> @@ -1060,7 +1072,6 @@ static void raid1_make_request(struct mddev *mddev, 
> struct bio * bio)
> & (REQ_DISCARD | REQ_SECURE));
>   const unsigned long do_same = (bio->bi_rw & REQ_WRITE_SAME);
>   struct md_rdev *blocked_rdev;
> - struct blk_plug_cb *cb;
>   struct raid1_plug_cb *plug = NULL;
>   int first_clone;
>   int sectors_handled;
> @@ -1382,12 +1393,8 @@ read_again:
>  
>   atomic_inc(_bio->remaining);
>  
> - cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
> - if (cb)
> - plug = container_of(cb, struct raid1_plug_cb, cb);
> - else
> - plug = NULL;
>   spin_lock_irqsave(>device_lock, flags);
> + plug = raid1_check_plugged(mddev);
>   if (plug) {
>   bio_list_add(>pending, mbio);
>   plug->pending_cnt++;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index e3fd725..d7d4397 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1052,6 +1052,18 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool 
> from_schedule)
>   kfree(plug);
>  }
>  
> +static struct raid10_plug_cb *raid10_check_plugged(struct mddev *mddev)
> +{
> + /* return (struct raid1_plug_cb*)blk_check_plugged(...); */
> + struct blk_plug_cb *cb;
> + struct raid10_plug_cb *plug = NULL;
> +
> + cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
> + if (cb)
> +  

Re: Qgroups wrong after snapshot create

2016-04-05 Thread Mark Fasheh
On Tue, Apr 05, 2016 at 03:16:54PM -0700, Mark Fasheh wrote:
> On Tue, Apr 05, 2016 at 09:27:01AM +0800, Qu Wenruo wrote:
> > Mark Fasheh wrote on 2016/04/04 16:06 -0700:
> > >Hi,
> > >
> > >Making a snapshot gets us the wrong qgroup numbers. This is very easy to
> > >reproduce. From a fresh btrfs filesystem, simply enable qgroups and create 
> > >a
> > >snapshot. In this example we have mounted a newly created fresh filesystem
> > >and mounted it at /btrfs:
> > >
> > ># btrfs quota enable /btrfs
> > ># btrfs sub sna /btrfs/ /btrfs/snap1
> > ># btrfs qg show /btrfs
> > >
> > >qgroupid rfer excl
> > >  
> > >0/5  32.00KiB 32.00KiB
> > >0/25716.00KiB 16.00KiB
> > >
> > 
> > Also reproduced it.
> > 
> > My first idea is, old snapshot qgroup hack is involved.
> > 
> > Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty
> > hack to handle it:
> > Copy rfer from source subvolume, and directly set excl to nodesize.
> > 
> > If such work is before adding snapshot inode into src subvolume, it
> > may be the reason causing the bug.
> 
> Ok, thanks very much for looking into this Qu.
> 
> 
> > >In the example above, the default subvolume (0/5) should read 16KiB
> > >referenced and 16KiB exclusive.
> > >
> > >A rescan fixes things, so we know the rescan process is doing the math
> > >right:
> > >
> > ># btrfs quota rescan /btrfs
> > ># btrfs qgroup show /btrfs
> > >qgroupid rfer excl
> > >  
> > >0/5  16.00KiB 16.00KiB
> > >0/25716.00KiB 16.00KiB
> > >
> > 
> > So the base of qgroup code is not affected, or we may need another
> > painful rework.
> 
> Yeah as far as I can tell the core algorithm is fine. We're just running the
> extents incorrectly somehow.

Btw, I should add - my biggest fear was an algorithm change which would have
made older versions of btrfsck incompatible. It seems though we can still
use it for checking qgroups.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups wrong after snapshot create

2016-04-05 Thread Mark Fasheh
On Tue, Apr 05, 2016 at 09:27:01AM +0800, Qu Wenruo wrote:
> Mark Fasheh wrote on 2016/04/04 16:06 -0700:
> >Hi,
> >
> >Making a snapshot gets us the wrong qgroup numbers. This is very easy to
> >reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
> >snapshot. In this example we have mounted a newly created fresh filesystem
> >and mounted it at /btrfs:
> >
> ># btrfs quota enable /btrfs
> ># btrfs sub sna /btrfs/ /btrfs/snap1
> ># btrfs qg show /btrfs
> >
> >qgroupid rfer excl
> >  
> >0/5  32.00KiB 32.00KiB
> >0/25716.00KiB 16.00KiB
> >
> 
> Also reproduced it.
> 
> My first idea is, old snapshot qgroup hack is involved.
> 
> Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty
> hack to handle it:
> Copy rfer from source subvolume, and directly set excl to nodesize.
> 
> If such work is before adding snapshot inode into src subvolume, it
> may be the reason causing the bug.

Ok, thanks very much for looking into this Qu.


> >In the example above, the default subvolume (0/5) should read 16KiB
> >referenced and 16KiB exclusive.
> >
> >A rescan fixes things, so we know the rescan process is doing the math
> >right:
> >
> ># btrfs quota rescan /btrfs
> ># btrfs qgroup show /btrfs
> >qgroupid rfer excl
> >  
> >0/5  16.00KiB 16.00KiB
> >0/25716.00KiB 16.00KiB
> >
> 
> So the base of qgroup code is not affected, or we may need another
> painful rework.

Yeah as far as I can tell the core algorithm is fine. We're just running the
extents incorrectly somehow.


> >#  _-=> irqs-off
> ># / _=> need-resched
> >#| / _---=> hardirq/softirq
> >#|| / _--=> preempt-depth
> >#||| / delay
> >#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
> >#  | |   |      | |
> >btrfs-10233 [001]  260298.823339: 
> > btrfs_qgroup_account_extent: bytenr = 29360128, num_bytes = 16384, 
> > nr_old_roots = 1, nr_new_roots = 0
> >btrfs-10233 [001]  260298.823342: qgroup_update_counters: 
> > qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
> >btrfs-10233 [001]  260298.823342: qgroup_update_counters: 
> > qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
> >btrfs-10233 [001]  260298.823343: 
> > btrfs_qgroup_account_extent: bytenr = 29720576, num_bytes = 16384, 
> > nr_old_roots = 0, nr_new_roots = 0
> >btrfs-10233 [001]  260298.823345: 
> > btrfs_qgroup_account_extent: bytenr = 29736960, num_bytes = 16384, 
> > nr_old_roots = 0, nr_new_roots = 0
> >btrfs-10233 [001]  260298.823347: 
> > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, 
> > nr_old_roots = 0, nr_new_roots = 1
> 
> Now, for extent 29786112, its nr_new_roots is 1.
> 
> >btrfs-10233 [001]  260298.823347: qgroup_update_counters: 
> > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> >btrfs-10233 [001]  260298.823348: qgroup_update_counters: 
> > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> >btrfs-10233 [001]  260298.823421: 
> > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, 
> > nr_old_roots = 0, nr_new_roots = 0
> 
> Now the problem is here, nr_old_roots should be 1, not 0.
> Just as previous trace shows, we increased extent ref on that
> extent, but now it dropped back to 0.
> 
> Since its old_root == new_root == 0, qgroup code doesn't do anything on it.
> If its nr_old_roots is 1, qgroup will drop it's excl/rfer to 0, and
> then accounting may goes back to normal.

Ok, so we're fine with the numbers going to zero so long as it gets back to
where it should be. That also explains the 'strange' behavior I saw.


> >btrfs-10233 [001]  260298.823422: 
> > btrfs_qgroup_account_extent: bytenr = 29835264, num_bytes = 16384, 
> > nr_old_roots = 0, nr_new_roots = 0
> >btrfs-10233 [001]  260298.823425: 
> > btrfs_qgroup_account_extent: bytenr = 29851648, num_bytes = 16384, 
> > nr_old_roots = 0, nr_new_roots = 1
> >btrfs-10233 [001]  260298.823426: qgroup_update_counters: 
> > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> >btrfs-10233 [001]  260298.823426: qgroup_update_counters: 
> > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 32768, excl = 32768
> >
> >If you read through the whole log we do some ... interesting.. things - at
> >the start, we *subtract* from qgroup 5, making it's count go to zero. I want
> >to say that this is kind of unexpected for a snapshot create but perhaps
> >there's something I'm missing.
> >
> >Remember that I'm 

Re: good documentation on btrfs internals and on disk layout

2016-04-05 Thread Yauhen Kharuzhy
2016-04-05 11:56 GMT-07:00 Austin S. Hemmelgarn :
> On 2016-04-05 14:36, Yauhen Kharuzhy wrote:
>>
>> 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn :
>>>
>>> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:

>>> In general, it isn't allowed, but we don't explicitly disallow it either.
>>> The worst case here is that the devices both get written two separately,
>>> and
>>> you end up with data not matching for correlated generation ID's.  The
>>> second scrub in this case shows no errors because the first one corrects
>>> them (even though they are reported as uncorrectable, which is a bug as
>>> far
>>> as I can tell), and from what I can tell from reading the code, it does
>>> this
>>> by just picking the highest generation ID and dropping the data from the
>>> lower generation.
>>
>>
>> Hmm... Sounds reasonable but how to detect if filesystem should be
>> checked by scrub after mounting? There is one way as I understand — to
>> check kernel logs after mount for any btrfs errors and this is not a
>> good way for case of some kind of automatic management.
>
> There really isn't any way that I know of.  Personally, I just scrub all my
> filesystems shortly after mount, but I also have pretty small filesystems
> (the biggest are 64G) on relatively fast storage.  In theory, it might be
> possible to parse the filesystems before mounting to check the device
> generation numbers, but that may be just as expensive as just scrubbing the
> filesystem (and you really should be scrubbing somewhat regularly anyway).

Yes, size matters — we have 96TB massive and scrubbing, rebalancing,
replacing etc. can be expensive operations :)
In fact, I implemented some kind of filesystem status reporting in
kernel & btrfs progs already but still have no time to prepare this to
sending as patches for commenting. But it will be done soon, I hope.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: good documentation on btrfs internals and on disk layout

2016-04-05 Thread Austin S. Hemmelgarn

On 2016-04-05 14:36, Yauhen Kharuzhy wrote:

2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn :

On 2016-04-05 13:53, Yauhen Kharuzhy wrote:


Hello,

I try to understand btrfs logic in mounting of multi-device filesystem
when device generations are different. All my questions are related to
RAID5/6 for system, metadata, and data case.

Kernel can mount FS with different device generations (if drive was
physically removed before last unmount and returned back after, for
example) now but scrub will report uncorrectable errors after this
(but second run doesn't show any errors). Does any documentation about
algorithm of multiple device handling in such case exist? Does the
case with different device generations is allowed in general and what
worst cases can be here?


In general, it isn't allowed, but we don't explicitly disallow it either.
The worst case here is that the devices both get written two separately, and
you end up with data not matching for correlated generation ID's.  The
second scrub in this case shows no errors because the first one corrects
them (even though they are reported as uncorrectable, which is a bug as far
as I can tell), and from what I can tell from reading the code, it does this
by just picking the highest generation ID and dropping the data from the
lower generation.


Hmm... Sounds reasonable but how to detect if filesystem should be
checked by scrub after mounting? There is one way as I understand — to
check kernel logs after mount for any btrfs errors and this is not a
good way for case of some kind of automatic management.
There really isn't any way that I know of.  Personally, I just scrub all 
my filesystems shortly after mount, but I also have pretty small 
filesystems (the biggest are 64G) on relatively fast storage.  In 
theory, it might be possible to parse the filesystems before mounting to 
check the device generation numbers, but that may be just as expensive 
as just scrubbing the filesystem (and you really should be scrubbing 
somewhat regularly anyway).



What should happen if device was removed and returned back after some
time when filesystem is online? Should some kind of device
reopening be possible or one possible way to guarantee FS consistensy
is  to mark such device as missing and to replace it?


In this case, the device being removed (or some component between the device
and the processor failing, or the device itself erroneously reporting
failure) will force the FS read-only.  If the device reappears while the FS
is still online, it may just start working again (this is _really_ rare, and
requires that the device appear with the same device node as it had
previously, and this usually only happens when the device disappears for
only a very short period of time), or it may not work until the FS gets
remounted (this is usually the case), or the system may crash (thankfully
this almost never happens, and it's usually not because of BTRFS when it
does).  Regardless of what happens, you may still have to run a scrub to
make sure everything is consistent.


So, one right way if we see device reconnected as new block device —
is to reject it and don't include it in device list again, am I right?
Existing code tries to 'reconnect' it with new device name but this
works completely wrong for mounted FS (because btrfs device is renamed
only, no real device reopening is performed) and I intend to propose
patch based on Anand's 'global spare' patch series to handle this
properly.
In an ideal situation, you have nothing using the FS and can unmount, 
run a device scan, and then remount.  In most cases this won't work, and 
being able to re-add the device via a hot-spare type setup (or even just 
use device replace on it, which I've done before myself when dealing 
with filesystems on USB devices, and it works well) would be useful. 
Ideally, we should have the option to auto-detect such a situation and 
handle it, but that _really_ needs to be optional (there are just too 
many things that could go wrong).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: good documentation on btrfs internals and on disk layout

2016-04-05 Thread Yauhen Kharuzhy
2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn :
> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
>>
>> Hello,
>>
>> I try to understand btrfs logic in mounting of multi-device filesystem
>> when device generations are different. All my questions are related to
>> RAID5/6 for system, metadata, and data case.
>>
>> Kernel can mount FS with different device generations (if drive was
>> physically removed before last unmount and returned back after, for
>> example) now but scrub will report uncorrectable errors after this
>> (but second run doesn't show any errors). Does any documentation about
>> algorithm of multiple device handling in such case exist? Does the
>> case with different device generations is allowed in general and what
>> worst cases can be here?
>
> In general, it isn't allowed, but we don't explicitly disallow it either.
> The worst case here is that the devices both get written two separately, and
> you end up with data not matching for correlated generation ID's.  The
> second scrub in this case shows no errors because the first one corrects
> them (even though they are reported as uncorrectable, which is a bug as far
> as I can tell), and from what I can tell from reading the code, it does this
> by just picking the highest generation ID and dropping the data from the
> lower generation.

Hmm... Sounds reasonable but how to detect if filesystem should be
checked by scrub after mounting? There is one way as I understand — to
check kernel logs after mount for any btrfs errors and this is not a
good way for case of some kind of automatic management.

>> What should happen if device was removed and returned back after some
>> time when filesystem is online? Should some kind of device
>> reopening be possible or one possible way to guarantee FS consistensy
>> is  to mark such device as missing and to replace it?
>
> In this case, the device being removed (or some component between the device
> and the processor failing, or the device itself erroneously reporting
> failure) will force the FS read-only.  If the device reappears while the FS
> is still online, it may just start working again (this is _really_ rare, and
> requires that the device appear with the same device node as it had
> previously, and this usually only happens when the device disappears for
> only a very short period of time), or it may not work until the FS gets
> remounted (this is usually the case), or the system may crash (thankfully
> this almost never happens, and it's usually not because of BTRFS when it
> does).  Regardless of what happens, you may still have to run a scrub to
> make sure everything is consistent.

So, one right way if we see device reconnected as new block device —
is to reject it and don't include it in device list again, am I right?
Existing code tries to 'reconnect' it with new device name but this
works completely wrong for mounted FS (because btrfs device is renamed
only, no real device reopening is performed) and I intend to propose
patch based on Anand's 'global spare' patch series to handle this
properly.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: good documentation on btrfs internals and on disk layout

2016-04-05 Thread Austin S. Hemmelgarn

On 2016-04-05 13:53, Yauhen Kharuzhy wrote:

Hello,

I try to understand btrfs logic in mounting of multi-device filesystem
when device generations are different. All my questions are related to
RAID5/6 for system, metadata, and data case.

Kernel can mount FS with different device generations (if drive was
physically removed before last unmount and returned back after, for
example) now but scrub will report uncorrectable errors after this
(but second run doesn't show any errors). Does any documentation about
algorithm of multiple device handling in such case exist? Does the
case with different device generations is allowed in general and what
worst cases can be here?
In general, it isn't allowed, but we don't explicitly disallow it 
either.  The worst case here is that the devices both get written two 
separately, and you end up with data not matching for correlated 
generation ID's.  The second scrub in this case shows no errors because 
the first one corrects them (even though they are reported as 
uncorrectable, which is a bug as far as I can tell), and from what I can 
tell from reading the code, it does this by just picking the highest 
generation ID and dropping the data from the lower generation.


What should happen if device was removed and returned back after some
time when filesystem is online? Should some kind of device
reopening be possible or one possible way to guarantee FS consistensy
is  to mark such device as missing and to replace it?
In this case, the device being removed (or some component between the 
device and the processor failing, or the device itself erroneously 
reporting failure) will force the FS read-only.  If the device reappears 
while the FS is still online, it may just start working again (this is 
_really_ rare, and requires that the device appear with the same device 
node as it had previously, and this usually only happens when the device 
disappears for only a very short period of time), or it may not work 
until the FS gets remounted (this is usually the case), or the system 
may crash (thankfully this almost never happens, and it's usually not 
because of BTRFS when it does).  Regardless of what happens, you may 
still have to run a scrub to make sure everything is consistent.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: good documentation on btrfs internals and on disk layout

2016-04-05 Thread Yauhen Kharuzhy
Hello,

I try to understand btrfs logic in mounting of multi-device filesystem
when device generations are different. All my questions are related to
RAID5/6 for system, metadata, and data case.

Kernel can mount FS with different device generations (if drive was
physically removed before last unmount and returned back after, for
example) now but scrub will report uncorrectable errors after this
(but second run doesn't show any errors). Does any documentation about
algorithm of multiple device handling in such case exist? Does the
case with different device generations is allowed in general and what
worst cases can be here?

What should happen if device was removed and returned back after some
time when filesystem is online? Should some kind of device
reopening be possible or one possible way to guarantee FS consistensy
is  to mark such device as missing and to replace it?

If any sources of such information exist (except btrfs code) please
point me to them.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub priority, am I using it wrong?

2016-04-05 Thread Henk Slager
On Tue, Apr 5, 2016 at 4:37 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Gareth Pye posted on Tue, 05 Apr 2016 09:36:48 +1000 as excerpted:
>
>> I've got a btrfs file system set up on 6 drbd disks running on 2Tb
>> spinning disks. The server is moderately loaded with various regular
>> tasks that use a fair bit of disk IO, but I've scheduled my weekly btrfs
>> scrub for the best quiet time in the week.
>>
>> The command that is run is:
>> /usr/local/bin/btrfs scrub start -Bd -c idle /data
>>
>> Which is my best attempt to try and get it to have a low impact on user
>> operations
>>
>> But iotop shows me:
>>
>> 1765 be/4 root   14.84 M/s0.00 B/s  0.00 % 96.65 % btrfs scrub
>> start -Bd -c idle /data
>>  1767 be/4 root   14.70 M/s0.00 B/s  0.00 % 95.35 % btrfs
>> scrub start -Bd -c idle /data
>>  1768 be/4 root   13.47 M/s0.00 B/s  0.00 % 92.59 % btrfs
>> scrub start -Bd -c idle /data
>>  1764 be/4 root   12.61 M/s0.00 B/s  0.00 % 88.77 % btrfs
>> scrub start -Bd -c idle /data
>>  1766 be/4 root   11.24 M/s0.00 B/s  0.00 % 85.18 % btrfs
>> scrub start -Bd -c idle /data
>>  1763 be/4 root7.79 M/s0.00 B/s  0.00 % 63.30 % btrfs
>> scrub start -Bd -c idle /data
>> 28858 be/4 root0.00 B/s  810.50 B/s  0.00 % 61.32 % [kworker/
> u16:25]
>>
>>
>> Which doesn't look like an idle priority to me. And the system sure
>> feels like a system with a lot of heavy io going on. Is there something
>> I'm doing wrong?

When I see the throughput numbers, it lets me think that the
filesystem is raid5 or raid6. On single or raid1 or raid10 one easily
gets around 100M/s without the notice/feeling of heavy IO ongoing,
mostly independent of scrub options.

> Two points:
>
> 1) It appears btrfs scrub start's -c option only takes numeric class, so
> try -c3 instead of -c idle.

Thanks to Duncan for pointing this out. I don't remember exactly, but
I think I also had issues with this in the past, but did not realize
or have a further look at it.

> Works for me with the numeric class (same results as you with spelled out
> class), tho I'm on ssd with multiple independent btrfs on partitions, the
> biggest of which is 24 GiB, 18.something GiB used, which scrubs in all of
> 20 seconds, so I don't need and hadn't tried the -c option at all until
> now.
>
> 2) What a difference an ssd makes!
>
> $$ sudo btrfs scrub start -c3 /p
> scrub started on /p, [...]
>
> $$ sudo iotop -obn1
> Total DISK READ : 626.53 M/s | Total DISK WRITE :   0.00 B/s
> Actual DISK READ: 596.93 M/s | Actual DISK WRITE:   0.00 B/s
>  TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN  IOCOMMAND
>  872 idle root  268.40 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub
> start -c3 /p
>  873 idle root  358.13 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub
> start -c3 /p
>
> CPU bound, 0% IOWait even at idle IO priority, in addition to the
> hundreds of M/s values per thread/device, here.  You OTOH are showing
> under 20 M/s per thread/device on spinning rust, with an IOWait near 90%,
> thus making it IO bound.

This low M/s and high IOWait is the kind of behavior I noticed with 3x
2TB raid5 when scrubbing or balancing (no bcache activated, kernel
4.3.3).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: device stats: Print devid instead of null

2016-04-05 Thread Patrik Lundquist
Print e.g. "[devid:4].write_io_errs   6" instead of
"[(null)].write_io_errs   6" when device is missing.

Signed-off-by: Patrik Lundquist 
---
 cmds-device.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/cmds-device.c b/cmds-device.c
index b17b6c6..7616c43 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -447,6 +447,13 @@ static int cmd_device_stats(int argc, char **argv)
 
canonical_path = canonicalize_path((char *)path);
 
+   /* No path when device is missing. */
+   if (!canonical_path) {
+   canonical_path = malloc(32);
+   snprintf(canonical_path, 32,
+"devid:%llu", args.devid);
+   }
+
if (args.nr_items >= BTRFS_DEV_STAT_WRITE_ERRS + 1)
printf("[%s].write_io_errs   %llu\n",
   canonical_path,
-- 
2.8.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe

2016-04-05 Thread Lars Ellenberg
blk_check_plugged() will return a pointer
to an object linked on current->plug->cb_list.

That list may "at any time" be implicitly cleared by
blk_flush_plug_list()
 flush_plug_callbacks()
either as a result of blk_finish_plug(),
or implicitly by schedule() [and maybe other implicit mechanisms?]

If there is no protection against an implicit unplug
between the call to blk_check_plug() and using its return value,
that implicit unplug may have already happened,
even before the plug is actually initialized or populated,
and we may be using a pointer to already free()d data.

I suggest that both raid1 and raid10 can easily be fixed
by moving the call to blk_check_plugged() inside the spinlock.

For md/raid5 and btrfs/raid56,
I'm unsure how (if) this needs to be fixed.

The other current in-tree users of blk_check_plugged()
are mm_check_plugged(), and mddev_check_plugged().

mm_check_plugged() is already used safely inside a spinlock.

with mddev_check_plugged() I'm unsure, at least on a preempt kernel.

Did I overlook any magic that protects against such implicit unplug?

Also, why pretend that a custom plug struct (such as raid1_plug_cb)
may have its member "struct blk_plug_cb cb" at an arbitrary offset?
As it is, raid1_check_plugged() below is actually just a cast.

Signed-off-by: Lars Ellenberg 
---
 drivers/md/raid1.c  | 19 +--
 drivers/md/raid10.c | 21 +
 drivers/md/raid5.c  |  5 +
 fs/btrfs/raid56.c   |  5 +
 4 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 39fb21e..55dc960 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1044,6 +1044,18 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool 
from_schedule)
kfree(plug);
 }
 
+static struct raid1_plug_cb *raid1_check_plugged(struct mddev *mddev)
+{
+   /* return (struct raid1_plug_cb*)blk_check_plugged(...); */
+   struct blk_plug_cb *cb;
+   struct raid1_plug_cb *plug = NULL;
+
+   cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
+   if (cb)
+   plug = container_of(cb, struct raid1_plug_cb, cb);
+   return plug;
+}
+
 static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 {
struct r1conf *conf = mddev->private;
@@ -1060,7 +1072,6 @@ static void raid1_make_request(struct mddev *mddev, 
struct bio * bio)
  & (REQ_DISCARD | REQ_SECURE));
const unsigned long do_same = (bio->bi_rw & REQ_WRITE_SAME);
struct md_rdev *blocked_rdev;
-   struct blk_plug_cb *cb;
struct raid1_plug_cb *plug = NULL;
int first_clone;
int sectors_handled;
@@ -1382,12 +1393,8 @@ read_again:
 
atomic_inc(_bio->remaining);
 
-   cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
-   if (cb)
-   plug = container_of(cb, struct raid1_plug_cb, cb);
-   else
-   plug = NULL;
spin_lock_irqsave(>device_lock, flags);
+   plug = raid1_check_plugged(mddev);
if (plug) {
bio_list_add(>pending, mbio);
plug->pending_cnt++;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e3fd725..d7d4397 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1052,6 +1052,18 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool 
from_schedule)
kfree(plug);
 }
 
+static struct raid10_plug_cb *raid10_check_plugged(struct mddev *mddev)
+{
+   /* return (struct raid1_plug_cb*)blk_check_plugged(...); */
+   struct blk_plug_cb *cb;
+   struct raid10_plug_cb *plug = NULL;
+
+   cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
+   if (cb)
+   plug = container_of(cb, struct raid10_plug_cb, cb);
+   return plug;
+}
+
 static void __make_request(struct mddev *mddev, struct bio *bio)
 {
struct r10conf *conf = mddev->private;
@@ -1066,7 +1078,6 @@ static void __make_request(struct mddev *mddev, struct 
bio *bio)
const unsigned long do_same = (bio->bi_rw & REQ_WRITE_SAME);
unsigned long flags;
struct md_rdev *blocked_rdev;
-   struct blk_plug_cb *cb;
struct raid10_plug_cb *plug = NULL;
int sectors_handled;
int max_sectors;
@@ -1369,14 +1380,8 @@ retry_write:
 
atomic_inc(_bio->remaining);
 
-   cb = blk_check_plugged(raid10_unplug, mddev,
-  sizeof(*plug));
-   if (cb)
-   plug = container_of(cb, struct raid10_plug_cb,
-   cb);
-   else
-   plug = NULL;
spin_lock_irqsave(>device_lock, flags);
+   plug = raid10_check_plugged(mddev);
 

Re: Another ENOSPC situation

2016-04-05 Thread Austin S. Hemmelgarn

On 2016-04-02 01:43, Chris Murphy wrote:

On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.dun...@cox.net> wrote:

Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:



[4/502]mh@swivel:~$ sudo btrfs fi usage /
Overall:
 Device size: 600.00GiB
 Device allocated:600.00GiB
 Device unallocated:1.00MiB


That's the problem right there.  The admin didn't do his job and spot the
near full allocation issue



I don't yet agree this is an admin problem. This is the 2nd or 3rd
case we've seen only recently where there's plenty of space in all
chunk types and yet ENOSPC happens, seemingly only because there's no
unallocated space remaining. I don't know that this is a regression
for sure, but it sure seems like one.
I personally don't think it's a regression.  I've hit this myself before 
(although I make a point not to anymore, having to jump through hoops to 
the degree I did to get the FS working again tends to provide a pretty 
big incentive to not let it happen again), I know a couple of other 
people who have and never reported it here or on IRC, and I'd be willing 
to bet that the reason we're seeing it recently is that more 'regular' 
users (in contrast to system administrators or developers) are using 
BTRFS, and they tend to be more likely to hit such issues (because 
they're not as likely to know about them in the first place, let alone 
how to avoid them).






Data,single: Size:553.93GiB, Used:405.73GiB
/dev/mapper/swivelbtr 553.93GiB

Metadata,DUP: Size:23.00GiB, Used:3.83GiB
/dev/mapper/swivelbtr  46.00GiB

System,DUP: Size:32.00MiB, Used:112.00KiB
/dev/mapper/swivelbtr  64.00MiB

Unallocated:
/dev/mapper/swivelbtr   1.00MiB
[5/503]mh@swivel:~$


Both data and metadata have several GiB free, data ~140 GiB free, and
metadata isn't into global reserve, so the system isn't totally wedged,
only partially, due to the lack of unallocated space.


Unallocated space alone hasn't ever caused this that I can remember.
It's most often been totally full metadata chunks, with free space in
allocated data chunks, with no unallocated space out of which to
create another metadata chunk to write out changes.

There should be plenty of space for either a -dusage=1 or -musage=1
balance to free up a bunch of partially allocated chunks. Offhand I
don't think the profiles filter is helpful in this case.

OK so where I could be wrong is that I'm expecting balance doesn't
require allocated space to work. I'd expect that it can COW extents
from one chunk into another existing chunk (of the same type) and then
once that's successful, free up that chunk, i.e. revert it back to
unallocated. If balance can only copy into newly allocated chunks,
that seems like a big problem. I thought that problems had been fixed
a very long time ago.
Balance has always allocated new chunks.  This is IMHO one of the big 
issues with the current implementation of it (the other being that it 
can't be made asynchronous without some creative userspace work).  If we 
aren't converting chunk types and we're on a single device FS, we should 
be tail-packing existing chunks before we try to allocate new ones.


And what we don't see from 'usage' that we will see from 'df' is the
GlobalReserve values. I'd like to see that.

Anyway, in the meantime there is a work around:

btrfs dev add

Just add a device, even if it's an 8GiB flash drive. But it can be a
spare space on a partition, or it can be a logical volume, or whatever
you want. That'll add some gigs of unallocated space. Now the balance
will work, or for absolutely sure there's a bug (and a new one because
this has always worked in the past). After whatever filtered or full
balance is done, make sure to 'btfs dev rem' and confirm it's gone
with 'btrfs fi show' before removing the device. It's a two device
volume until that device is successfully removed and is in something
of a fragile state until then because any loss of data on that 2nd
device has a good chance of face planting the file system.
If you can ensure with a relative degree of certainty that you won't 
lose power or crash, and you have lots of RAM, a small ramdisk (or even 
zram) works well for this too.  I wouldn't use either personally for a 
critical filesystem (I'd pull out the disk and hook it up internally to 
another system with spare disk space and handle things there), but both 
options should work fine.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups wrong after snapshot create

2016-04-05 Thread Qu Wenruo



On 04/05/2016 07:06 AM, Mark Fasheh wrote:

Hi,

Making a snapshot gets us the wrong qgroup numbers. This is very easy to
reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
snapshot. In this example we have mounted a newly created fresh filesystem
and mounted it at /btrfs:

# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
# btrfs qg show /btrfs

qgroupid rfer excl
  
0/5  32.00KiB 32.00KiB
0/25716.00KiB 16.00KiB


In the example above, the default subvolume (0/5) should read 16KiB
referenced and 16KiB exclusive.

A rescan fixes things, so we know the rescan process is doing the math
right:

# btrfs quota rescan /btrfs
# btrfs qgroup show /btrfs
qgroupid rfer excl
  
0/5  16.00KiB 16.00KiB
0/25716.00KiB 16.00KiB



The last kernel to get this right was v4.1:

# uname -r
4.1.20
# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
Create a snapshot of '/btrfs/' in '/btrfs/snap1'
# btrfs qg show /btrfs
qgroupid rfer excl
  
0/5  16.00KiB 16.00KiB
0/25716.00KiB 16.00KiB


Which leads me to believe that this was a regression introduced by Qu's
rewrite as that is the biggest change to qgroups during that development
period.


Going back to upstream, I applied my tracing patch from this list
( http://thread.gmane.org/gmane.comp.file-systems.btrfs/54685 ), with a
couple changes - I'm printing the rfer/excl bytecounts in
qgroup_update_counters AND I print them twice - once before we make any
changes and once after the changes. If I enable tracing in
btrfs_qgroup_account_extent and qgroup_update_counters just before the
snapshot creation, we get the following trace:


# btrfs quota enable /btrfs
# 
# echo 1 > 
/sys/kernel/debug/tracing/events/btrfs/btrfs_qgroup_account_extent/enable
# echo 1 > //sys/kernel/debug/tracing/events/btrfs/qgroup_update_counters/enable
# btrfs sub sna /btrfs/ /btrfs/snap2
Create a snapshot of '/btrfs/' in '/btrfs/snap2'
# btrfs qg show /btrfs
qgroupid rfer excl
  
0/5  32.00KiB 32.00KiB
0/25716.00KiB 16.00KiB
# fstest1:~ # cat /sys/kernel/debug/tracing/trace

# tracer: nop
#
# entries-in-buffer/entries-written: 13/13   #P:2
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
btrfs-10233 [001]  260298.823339: btrfs_qgroup_account_extent: 
bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0
btrfs-10233 [001]  260298.823342: qgroup_update_counters: qgid 
= 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
btrfs-10233 [001]  260298.823342: qgroup_update_counters: qgid 
= 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
btrfs-10233 [001]  260298.823343: btrfs_qgroup_account_extent: 
bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
btrfs-10233 [001]  260298.823345: btrfs_qgroup_account_extent: 
bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
btrfs-10233 [001]  260298.823347: btrfs_qgroup_account_extent: 
bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
btrfs-10233 [001]  260298.823347: qgroup_update_counters: qgid 
= 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
btrfs-10233 [001]  260298.823348: qgroup_update_counters: qgid 
= 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs-10233 [001]  260298.823421: btrfs_qgroup_account_extent: 
bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0


I think I caught the direct cause.

Here for bytenr 29786112, it got quota dirtied twice, one for cow to a 
new block(although I don't got why it got cowed, as it is cowed again 
later), one for de-ref.


But the problem is, qgroup accounting is run twice inside the same 
transaction.


Current qgroup design heavily depends on transaction to get correct old 
and new roots.
It always get old roots from *COMMIT ROOT*, and get new roots from 
current root just before *SWITCHING COMMIT ROOT*.


However in this case, we got new roots in the middle of a transaction, 
and did update qgroup accounting.
This means we get wrong new_roots for extent which will not be committed 
to disk.


In this case, it is bytenr 29786112. It doesn't reach disk, so it's not 
on disk in either transaction before/after snapshot.
Normally, its old and new roots should both 

Re: Scrub priority, am I using it wrong?

2016-04-05 Thread Austin S. Hemmelgarn

On 2016-04-05 00:19, Duncan wrote:

Gareth Pye posted on Tue, 05 Apr 2016 13:44:05 +1000 as excerpted:


On Tue, Apr 5, 2016 at 12:37 PM, Duncan <1i5t5.dun...@cox.net> wrote:

1) It appears btrfs scrub start's -c option only takes numeric class,
so try -c3 instead of -c idle.



Does it count as a bug if it silently accepts the way I was doing it?

I've switched to -c3 and at least now the idle class listed in iotop is
idle, so I hope that means it will be more friendly to other processes.


I'd say yes, particularly given that the value must be the numeric class
isn't documented in the manpage at all.

Whether the bug is then one of documentation (say it must be numeric) or
of implementation (take the class name as well) is then up for debate.
I'd call fixing either one a fix.  If it must be numeric, document that
(and optionally change the implementation to error in some way if a
numeric parameter isn't supplied for -c), else change the implementation
so the name can be taken as well (and optionally change the documentation
to explicitly mention that either one can be used).  Doesn't matter to me
which.


In general, I'd say:
1. The documentation needs to be updated.  Even if it's intended to 
accept class names eventually, it should match the implementation.
2. The implementation needs error checking added.  At the very least, it 
should say something to the effect of "Hey, this doesn't work the way 
you appear to think it does.", and possibly bail.


That said, I don't think there's a huge amount of value in adding the 
ability to accept symbolic names.  Nothing that I know of accepts them 
on the command line, because most things pass the integer directly to 
the syscall, thus allowing for people to use new classes (if any are 
ever introduced) with old tools.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix missing s_id setting

2016-04-05 Thread Tsutomu Itoh

On 2016/04/05 16:56, Anand Jain wrote:

On 04/05/2016 08:08 AM, Tsutomu Itoh wrote:

When fs_devices->latest_bdev is deleted or is replaced, sb->s_id has
not been updated.
As a result, the deleted device name is displayed by btrfs_printk.

[before fix]
  # btrfs dev del /dev/sdc4 /mnt2
  # btrfs dev add /dev/sdb6 /mnt2

  [  217.458249] BTRFS info (device sdc4): found 1 extents
  [  217.695798] BTRFS info (device sdc4): disk deleted /dev/sdc4
  [  217.941284] BTRFS info (device sdc4): disk added /dev/sdb6

[after fix]
  # btrfs dev del /dev/sdc4 /mnt2
  # btrfs dev add /dev/sdb6 /mnt2

  [   83.835072] BTRFS info (device sdc4): found 1 extents
  [   84.080617] BTRFS info (device sdc3): disk deleted /dev/sdc4
  [   84.401951] BTRFS info (device sdc3): disk added /dev/sdb6



  [PATCH 05/13] Btrfs: fix fs logging for multi device

  any comments ?

  We would want to maintain the logging prefix as constant, so that
  troubleshooters with filters/scripts will find it helpful.


I think it is good to make the identifier constant for the troubleshooting.
However, fsid(uuid) is a little long for the print purpose, I think.
(But an appropriate value isn't found...)

Thanks,
Tsutomu



Thanks, Anand



Signed-off-by: Tsutomu Itoh 
---
  fs/btrfs/dev-replace.c |  5 -
  fs/btrfs/volumes.c | 11 +--
  2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index a1d6652..11c4198 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -560,8 +560,11 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
  tgt_device->commit_bytes_used = src_device->bytes_used;
  if (fs_info->sb->s_bdev == src_device->bdev)
  fs_info->sb->s_bdev = tgt_device->bdev;
-if (fs_info->fs_devices->latest_bdev == src_device->bdev)
+if (fs_info->fs_devices->latest_bdev == src_device->bdev) {
  fs_info->fs_devices->latest_bdev = tgt_device->bdev;
+snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+ tgt_device->bdev);
+}
  list_add(_device->dev_alloc_list, _info->fs_devices->alloc_list);
  fs_info->fs_devices->rw_devices++;

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e2b54d5..a471385 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1846,8 +1846,12 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
   struct btrfs_device, dev_list);
  if (device->bdev == root->fs_info->sb->s_bdev)
  root->fs_info->sb->s_bdev = next_device->bdev;
-if (device->bdev == root->fs_info->fs_devices->latest_bdev)
+if (device->bdev == root->fs_info->fs_devices->latest_bdev) {
  root->fs_info->fs_devices->latest_bdev = next_device->bdev;
+snprintf(root->fs_info->sb->s_id,
+ sizeof(root->fs_info->sb->s_id), "%pg",
+ next_device->bdev);
+}

  if (device->bdev) {
  device->fs_devices->open_devices--;
@@ -2034,8 +2038,11 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
   struct btrfs_device, dev_list);
  if (tgtdev->bdev == fs_info->sb->s_bdev)
  fs_info->sb->s_bdev = next_device->bdev;
-if (tgtdev->bdev == fs_info->fs_devices->latest_bdev)
+if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) {
  fs_info->fs_devices->latest_bdev = next_device->bdev;
+snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+ next_device->bdev);
+}
  list_del_rcu(>dev_list);

  call_rcu(>rcu, free_device);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-05 Thread David Sterba
On Tue, Apr 05, 2016 at 09:28:31AM +0800, Qu Wenruo wrote:
> 
> 
> David Sterba wrote on 2016/04/04 13:18 +0200:
> > On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote:
> >>> After another look, why don't we use nodesize directly? Or stripesize
> >>> where applies. With max_size == 0 the test does not make sense, we ought
> >>> to know the alignment.
> >>>
> >>>
> >> Yes, my first though is also to use nodesize directly, which should be
> >> always correct.
> >>
> >> But the problem is, the related function call stack doesn't have any
> >> member to reach btrfs_root or btrfs_fs_info.
> >
> > JFYI, there's global_info avalaible, so it's not necessary to pass
> > fs_info down the callstacks.
> >
> >
> Oh, that's a good news.
> 
> Do I need to re-submit the patch to use fs_info->tree_root->nodesize to 
> avoid false alert?
> Or wait for your refactor?

No need to resend, the refactored code is now in devel.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fix re-declared get_device_info()

2016-04-05 Thread David Sterba
On Tue, Apr 05, 2016 at 03:36:01PM +0800, Anand Jain wrote:
> The other get_device_info() is in the same file, 4 lines above.
> 
> Signed-off-by: Anand Jain 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix missing s_id setting

2016-04-05 Thread Anand Jain



On 04/05/2016 08:08 AM, Tsutomu Itoh wrote:

When fs_devices->latest_bdev is deleted or is replaced, sb->s_id has
not been updated.
As a result, the deleted device name is displayed by btrfs_printk.

[before fix]
  # btrfs dev del /dev/sdc4 /mnt2
  # btrfs dev add /dev/sdb6 /mnt2

  [  217.458249] BTRFS info (device sdc4): found 1 extents
  [  217.695798] BTRFS info (device sdc4): disk deleted /dev/sdc4
  [  217.941284] BTRFS info (device sdc4): disk added /dev/sdb6

[after fix]
  # btrfs dev del /dev/sdc4 /mnt2
  # btrfs dev add /dev/sdb6 /mnt2

  [   83.835072] BTRFS info (device sdc4): found 1 extents
  [   84.080617] BTRFS info (device sdc3): disk deleted /dev/sdc4
  [   84.401951] BTRFS info (device sdc3): disk added /dev/sdb6



 [PATCH 05/13] Btrfs: fix fs logging for multi device

 any comments ?

 We would want to maintain the logging prefix as constant, so that
 troubleshooters with filters/scripts will find it helpful.

Thanks, Anand



Signed-off-by: Tsutomu Itoh 
---
  fs/btrfs/dev-replace.c |  5 -
  fs/btrfs/volumes.c | 11 +--
  2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index a1d6652..11c4198 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -560,8 +560,11 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
tgt_device->commit_bytes_used = src_device->bytes_used;
if (fs_info->sb->s_bdev == src_device->bdev)
fs_info->sb->s_bdev = tgt_device->bdev;
-   if (fs_info->fs_devices->latest_bdev == src_device->bdev)
+   if (fs_info->fs_devices->latest_bdev == src_device->bdev) {
fs_info->fs_devices->latest_bdev = tgt_device->bdev;
+   snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+tgt_device->bdev);
+   }
list_add(_device->dev_alloc_list, _info->fs_devices->alloc_list);
fs_info->fs_devices->rw_devices++;

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e2b54d5..a471385 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1846,8 +1846,12 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
 struct btrfs_device, dev_list);
if (device->bdev == root->fs_info->sb->s_bdev)
root->fs_info->sb->s_bdev = next_device->bdev;
-   if (device->bdev == root->fs_info->fs_devices->latest_bdev)
+   if (device->bdev == root->fs_info->fs_devices->latest_bdev) {
root->fs_info->fs_devices->latest_bdev = next_device->bdev;
+   snprintf(root->fs_info->sb->s_id,
+sizeof(root->fs_info->sb->s_id), "%pg",
+next_device->bdev);
+   }

if (device->bdev) {
device->fs_devices->open_devices--;
@@ -2034,8 +2038,11 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
 struct btrfs_device, dev_list);
if (tgtdev->bdev == fs_info->sb->s_bdev)
fs_info->sb->s_bdev = next_device->bdev;
-   if (tgtdev->bdev == fs_info->fs_devices->latest_bdev)
+   if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) {
fs_info->fs_devices->latest_bdev = next_device->bdev;
+   snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+next_device->bdev);
+   }
list_del_rcu(>dev_list);

call_rcu(>rcu, free_device);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS warnings in dmesg / scrub errors - Transaction aborted ( error -95 )

2016-04-05 Thread Robert Munteanu
Hi,

I have a btrfs partition which exhibits some strange behaviours:

- btrfs scrub seems to force the filesystem in read-only mode
- transactions aborted with error -95

For context, this is a partition converted from ext4 which 'survived'
through the issues described at

  http://www.spinics.net/lists/linux-btrfs/msg44660.html

The errors that I receive at runtime, during regular usage are

[   49.297851] [ cut here ]
[   49.297905] WARNING: CPU: 0 PID: 72 at ../fs/btrfs/inode.c:2937
btrfs_finish_ordered_io+0x5f5/0x650 [b
trfs]()
[   49.297911] BTRFS warning (device sda1):
btrfs_finish_ordered_io:2937: Aborting unused transaction(unk
nown).
[   49.297916] BTRFS: Transaction aborted (error -95)
[   49.297919] Modules linked in: bnep af_packet vboxpci(O)
vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_re
ject_ipv6 nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw
ipt_REJECT nf_reject_ipv4 xt_tcpudp x
t_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_raw xt_CT
iptable_filter vboxdrv(O) ip6table_
mangle nf_conntrack_netbios_ns nf_conntrack_broadcast
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_connt
rack nf_conntrack ip6table_filter ip6_tables x_tables msr xfs
libcrc32c snd_hda_codec_hdmi snd_hda_codec_
realtek snd_hda_codec_generic i915 raid1 joydev intel_rapl
drm_kms_helper fb_sys_fops x86_pkg_temp_therma
l syscopyarea intel_powerclamp sysfillrect sysimgblt i2c_algo_bit
coretemp kvm_intel i2c_designware_platform snd_hda_intel
i2c_designware_core 8250_dw md_mod ppdev kvm snd_hda_codec
[   49.297991]  snd_hda_core irqbypass snd_seq crct10dif_pclmul
snd_hwdep crc32_pclmul ghash_clmulni_intel aesni_intel snd_pcm
aes_x86_64 hci_uart lrw gf128mul btbcm glue_helper ablk_helper cryptd
snd_seq_device pcspkr r8169 btqca i2c_i801 btintel snd_timer mii
bluetooth parport_pc dm_mod rfkill battery 8250_fintek parport snd
idma64 virt_dma pinctrl_sunrisepoint pinctrl_intel intel_lpss_acpi
soundcore mei_me intel_lpss_pci intel_lpss mei acpi_als mfd_core
kfifo_buf industrialio shpchp tpm_tis thermal acpi_pad fan tpm fjes
hid_generic hid_logitech_hidpp hid_logitech_dj usbhid btrfs xor uas
usb_storage mxm_wmi raid6_pq crc32c_intel serio_raw nvidia_uvm(PO)
nvidia(PO) sr_mod cdrom xhci_pci xhci_hcd usbcore usb_common drm
i2c_hid wmi video button sg
[   49.298081] CPU: 0 PID: 72 Comm: kworker/u8:4 Tainted: P
O4.5.0-2-default #1
[   49.298084] Hardware name: MSI MS-7971/H170A PC MATE (MS-7971),
BIOS B.60 02/23/2016
[   49.298131] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[   49.298135]   88045b837c98 81395921
88045b837ce0
[   49.298142]  a0d096be 88045b837cd0 8107d912
880089ab85c8
[   49.298147]  88045b9c5000 ffa1 88045dea5690
880459efd0c0
[   49.298153] Call Trace:
[   49.298170]  [] try_stack_unwind+0x175/0x190
[   49.298185]  [] dump_trace+0x69/0x3a0
[   49.298193]  [] show_trace_log_lvl+0x4b/0x60
[   49.298201]  [] show_stack_log_lvl+0x10c/0x180
[   49.298208]  [] show_stack+0x25/0x50
[   49.298218]  [] dump_stack+0x63/0x82
[   49.298230]  [] warn_slowpath_common+0x82/0xc0
[   49.298240]  [] warn_slowpath_fmt+0x4c/0x50
[   49.298281]  [] btrfs_finish_ordered_io+0x5f5/0x650 [btrfs]
[   49.298326]  [] finish_ordered_fn+0x15/0x20 [btrfs]
[   49.298369]  [] btrfs_scrubparity_helper+0xcb/0x350 [btrfs]
[   49.298413]  [] btrfs_endio_write_helper+0xe/0x10 [btrfs]
[   49.298421]  [] process_one_work+0x15c/0x4b0
[   49.298428]  [] worker_thread+0x48/0x4e0
[   49.298436]  [] kthread+0xc9/0xe0
[   49.298445]  [] ret_from_fork+0x3f/0x70
[   49.302378] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

[   49.302383] Leftover inexact backtrace:

[   49.302392]  [] ? kthread_worker_fn+0x180/0x180
[   49.302488] ---[ end trace a383d63cdc6dbae3 ]---
[   49.302502] BTRFS warning (device sda1):
btrfs_finish_ordered_io:2937: Aborting unused transaction(unknown).

I also recently got an error related to scrubbing which (IIRC) caused
the fs to become read-only

Apr 05 10:13:06 mars kernel: [ cut here ]
Apr 05 10:13:06 mars kernel: WARNING: CPU: 0 PID: 72 at
../fs/btrfs/inode.c:2937 btrfs_finish_ordered_io+0x5f5/0x650 [btrfs]()
Apr 05 10:13:06 mars kernel: BTRFS warning (device sda1):
btrfs_finish_ordered_io:2937: Aborting unused transaction(unknown).
Apr 05 10:13:06 mars kernel: BTRFS: Transaction aborted (error -95)
Apr 05 10:13:06 mars kernel: Modules linked in: bnep af_packet
vboxpci(O) vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_reject_ipv6
nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT
nf_reject_ipv
Apr 05 10:13:06 mars kernel:  snd_hda_core irqbypass snd_seq
crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel
aesni_intel snd_pcm aes_x86_64 hci_uart lrw gf128mul btbcm glue_helper
ablk_helper cryptd
Apr 05 10:13:06 mars kernel: CPU: 0 PID: 72 Comm: kworker/u8:4
Tainted: P   O4.5.0-2-default #1
Apr 05 10:13:06 mars kernel: Hardware name: 

[PATCH] btrfs-progs: fix re-declared get_device_info()

2016-04-05 Thread Anand Jain
The other get_device_info() is in the same file, 4 lines above.

Signed-off-by: Anand Jain 
---
 utils.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/utils.h b/utils.h
index ea54692aca86..e2ff41ad4ec1 100644
--- a/utils.h
+++ b/utils.h
@@ -187,8 +187,6 @@ int get_device_info(int fd, u64 devid,
struct btrfs_ioctl_dev_info_args *di_args);
 int test_uuid_unique(char *fs_uuid);
 u64 disk_size(const char *path);
-int get_device_info(int fd, u64 devid,
-   struct btrfs_ioctl_dev_info_args *di_args);
 u64 get_partition_size(const char *dev);
 
 int test_minimum_size(const char *file, u32 leafsize);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-05 Thread Kai Krakow
Am Mon, 4 Apr 2016 17:09:14 -0600
schrieb Chris Murphy :

> > Why couldn't/shouldn't I remove snapshots before detaching the seed
> > device? I want to keep them on the seed but they are useless to me
> > on the sprout.  
> 
> You can remove snapshots before or after detaching the seed device, it
> doesn't matter, but such snapshot removal only affects the sprout. You
> wrote:
> 
> "remove all left-over snapshots from the seed"
> 
> The seed is read only, you can't modify the contents of the seed
> device.

Sorry, not a native speaker... What I actually meant was to remove the
snapshot that originated from the seed, and which I don't need in the
sprout.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html