Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Chris Murphy

On Sep 11, 2014, at 11:19 PM, Russell Coker  wrote:

> It would be nice if a file system mounted ro counted as ro snapshots for 
> btrfs send.
> 
> When a file system is so messed up it can't be mounted rw it should be 
> regarded as ro for all operations.

Yes it's come up before, and there's a question whether mount -o ro is reliably 
ro enough for this. Maybe a force option?

But then another one is a recursive btrfs send to go along with the above. I 
might want them all, or I might want all of the ones in two particular 
subvolumes, etc. And even combine the recursive ro snapshot and recursive send 
as a btrfs rescue option that would work even if the volume is mounted 
read-only.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Duncan
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted:

> Sure. But what's the next step? Given 260+ snapshots might mean well
> more than 350GB of data, depending on how deduplicated the fs is, it
> still probably would be faster to rsync this to a pile of drives in
> linear/concat+XFS than wait a month (?) for device delete to finish.

That was what I was getting at in my other just-finished short reply.  It 
may be time to give up on the btrfs specific solutions for the moment and 
go with tried and tested traditional solutions (tho I'd definitely *NOT* 
try rsync or the like with the delete still going, we know from other 
reports that rsync places its own stresses on btrfs and one major 
stressor, the delete-triggered rebalance, at a time, is bad enough).

> Alternatively, script some way to create 260+ ro snapshots to btrfs
> send/receive to a new btrfs volume; and turn it into a raid1 later.

No confirmation yet but I strongly suspect most of those subs are 
snapshots.  Assuming that's the case, it's very likely most of them can 
simply be eliminated as I originally suggested, a process that /should/ 
be fast, decomplexifying the situation dramatically.

> I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem
> in a state where it could still be rw mountable. But I'm skeptical of
> anything interrupting the device delete before being fully prepared for
> the fs to be toast for rw mount. If only ro mount is possible, any
> chance of creating ro snapshots is out.

In theory, that is, barring bugs, interrupting the delete with normal 
shutdown to the extent possible, then sysrq+s, sysrq+u, should not be a 
problem.  The delete is basically a balance, going chunk by chunk, and 
either the chunk has been duplicated to the new device or it hasn't.  In 
either case, the existing chunk on the remaining old device shouldn't be 
affected.

So rebooting in that way in ordered to stop the delete temporarily 
/should/ have no bad effects.  Of course, that's barring bugs.  Btrfs is 
still not fully stabilized, and bugs do happen, so anything's possible.  
But I'd consider it safe enough to try here, certainly so if I had 
backups, as is still STRONGLY recommended for btrfs at this point, much 
more so than the routine sysadmin "if it's not backed up by definition 
it's not valuable to you" rule.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: deal with conflict options for btrfs fi show

2014-09-11 Thread Satoru Takeuchi
Hi Gui,

(2014/09/12 10:15), Gui Hecheng wrote:
> For btrfs fi show, -d|--all-devices & -m|--mounted will
> overwrite each other, so if specified both, let the user
> know that he should not use them at the same time.
> 
> Signed-off-by: Gui Hecheng 
> ---
> changelog:
>   v1->v2: add option conflict descriptions to manpage and usage.
> ---
>   Documentation/btrfs-filesystem.txt |  9 ++---
>   cmds-filesystem.c  | 12 ++--
>   2 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/btrfs-filesystem.txt 
> b/Documentation/btrfs-filesystem.txt
> index c9c0b00..d3d2dcc 100644
> --- a/Documentation/btrfs-filesystem.txt
> +++ b/Documentation/btrfs-filesystem.txt
> @@ -20,15 +20,18 @@ SUBCOMMAND
>   *df*  [...]::
>   Show space usage information for a mount point.
>   
> -*show* [--mounted|--all-devices]::
> +*show* [-m|--mounted|-d|--all-devices]::

This line seems to be too long. Please see also the
following thread.

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36270.html

Thanks,
Satoru


>   Show the btrfs filesystem with some additional info.
>   +
>   If no option nor ||| is passed, btrfs shows
>   information of all the btrfs filesystem both mounted and unmounted.
> -If '--mounted' is passed, it would probe btrfs kernel to list mounted btrfs
> +If '-m|--mounted' is passed, it would probe btrfs kernel to list mounted 
> btrfs
>   filesystem(s);
> -If '--all-devices' is passed, all the devices under /dev are scanned;
> +If '-d|--all-devices' is passed, all the devices under /dev are scanned;
>   otherwise the devices list is extracted from the /proc/partitions file.
> +Don't combine -m|--mounted and -d|--all-devices, because these two options
> +will overwrite each other, and only one scan way will be adopted,
> +probe the kernel to scan or scan devices under /dev.
>   
>   *sync* ::
>   Force a sync for the filesystem identified by .
> diff --git a/cmds-filesystem.c b/cmds-filesystem.c
> index 69c1ca5..51c4c55 100644
> --- a/cmds-filesystem.c
> +++ b/cmds-filesystem.c
> @@ -495,6 +495,7 @@ static const char * const cmd_show_usage[] = {
>   "-d|--all-devices   show only disks under /dev containing btrfs 
> filesystem",
>   "-m|--mounted   show only mounted btrfs",
>   "If no argument is given, structure of all present filesystems is 
> shown.",
> + "Don't combine -d|--all-devices and -m|--mounted, refer to manpage for 
> details.",
>   NULL
>   };
>   
> @@ -526,16 +527,23 @@ static int cmd_show(int argc, char **argv)
>   break;
>   switch (c) {
>   case 'd':
> - where = BTRFS_SCAN_PROC;
> + where &= ~BTRFS_SCAN_LBLKID;
> + where |= BTRFS_SCAN_PROC;
>   break;
>   case 'm':
> - where = BTRFS_SCAN_MOUNTED;
> + where &= ~BTRFS_SCAN_LBLKID;
> + where |= BTRFS_SCAN_MOUNTED;
>   break;
>   default:
>   usage(cmd_show_usage);
>   }
>   }
>   
> + if ((where & BTRFS_SCAN_PROC) && (where & BTRFS_SCAN_MOUNTED)) {
> + fprintf(stderr, "Don't use -d|--all-devices and -m|--mounted 
> options at the same time.\n");
> + usage(cmd_show_usage);
> + }
> +
>   if (check_argc_max(argc, optind + 1))
>   usage(cmd_show_usage);
>   
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Duncan
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted:

> And then when I think about just creating a new fs, using btrfs
> send/receive, the snapshots need to be ro first.

FWIW, at this point I'd forget about send/receive and create the backup 
(assuming one doesn't exist already) using more normal methods.  At least 
if the original send/receive hasn't yet been done, so it'd be copying off 
all the data anyway.  Perhaps mount selected snapshots and back them up 
too (after the current case is backed up), but throw away most of the 
snapshots.

Of course if there's an existing relatively current sent/received base to 
build on, and no indication that send/receive is broken, definitely try 
that first as the amount of data to sync in that case should be MUCH 
lower, but if not...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Duncan
Russell Coker posted on Fri, 12 Sep 2014 15:19:04 +1000 as excerpted:

> It would be nice if a file system mounted ro counted as ro snapshots for
> btrfs send.
> 
> When a file system is so messed up it can't be mounted rw it should be
> regarded as ro for all operations.

Indeed, and that has been suggested before, but unfortunately it's not 
current behavior.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Russell Coker
It would be nice if a file system mounted ro counted as ro snapshots for btrfs 
send.

When a file system is so messed up it can't be mounted rw it should be regarded 
as ro for all operations.
-- 
Sent from my Samsung Galaxy Note 2 with K-9 Mail.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Chris Murphy

On Sep 11, 2014, at 5:51 PM, Duncan <1i5t5.dun...@cox.net> wrote:

> Chris Murphy posted on Thu, 11 Sep 2014 15:25:51 -0600 as excerpted:
> 
>> On Sep 11, 2014, at 1:31 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>>> 
>>> I wouldn't try defragging now, but it might be worthwhile to stop the
>>> device delete (rebooting to do so since I don't think there's a cancel)
>> 
>> 'btrfs replace cancel' does exist, although I haven't tried it.
> 
> Btrfs replace cancel exists, yes, but does it work for btrfs device 
> delete, which is what the OP used?

Oops, right! I'm not sure what can do this safely.

And then when I think about just creating a new fs, using btrfs send/receive, 
the snapshots need to be ro first. So if there's any uncertainty about safely 
canceling the 'device delete' those ro snapshots need to be taken first, in the 
event only an ro mount is possible subsequently. And then there's some 
uncertainty how long 260+ ro snapshots will take (should be fast, but) and how 
much worse that makes the current situation. But probably worth the risk to 
take the snapshots and just wait a while before trying something like umount or 
sysrq+s followed by sysrq+u.



> 
>> Something isn't right though, because it's clearly neither reading nor
>> writing at anywhere close to 1/2 the drive read throughput. I'm curious
>> what 'iotop -d30 -o' shows (during the replace, before cancel), which
>> should be pretty consistent by averaging 30 seconds worth of io. And
>> then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet
>> there's a lot of nothing going on, with occasional spikes, rather than a
>> constant trickle.
>> 
>> And then the question is to find out what btrfs is thinking about while
>> nothing is reading or writing. Even though it's not 5000+ snapshots, I
>> wonder if the balance code (and hence btrfs replace) makes extensive use
>> of fiemap that's causing this to go catatonic.
>> http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724
> 
> Not sure (some of that stuff's beyond me), but one thing we /do/ know is 
> that btrfs has so far been focused mostly on features and debugging, not 
> on optimization beyond the worst-cases, which themselves remain a big 
> enough problem, tho it's slowly getting better.

Sure. But what's the next step? Given 260+ snapshots might mean well more than 
350GB of data, depending on how deduplicated the fs is, it still probably would 
be faster to rsync this to a pile of drives in linear/concat+XFS than wait a 
month (?) for device delete to finish.

Alternatively, script some way to create 260+ ro snapshots to btrfs 
send/receive to a new btrfs volume; and turn it into a raid1 later.

I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem in a 
state where it could still be rw mountable. But I'm skeptical of anything 
interrupting the device delete before being fully prepared for the fs to be 
toast for rw mount. If only ro mount is possible, any chance of creating ro 
snapshots is out.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix a overflowing boundary writing in csum_tree_block

2014-09-11 Thread Li RongQing
Do you means we can handle it like below?  I think it is not better,
if that, csum size can not the expand, and the code in csum_tree_block
seems redundancy;

If you do not want to truncate in first patch, I think we can try to avoid it

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 22a3676..baed6c0 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -879,6 +879,7 @@ static struct dentry *btrfs_mount(struct
file_system_type *fs_type, int flags,
u64 subvol_objectid = 0;
u64 subvol_rootid = 0;
int error = 0;
+   u16 csum_size;

if (!(flags & MS_RDONLY))
mode |= FMODE_WRITE;
@@ -974,6 +975,13 @@ static struct dentry *btrfs_mount(struct
file_system_type *fs_type, int flags,
return root;
}

+   csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
+   if (csum_size > 4) {
+   printk(KERN_ERR "btrfs: csume size is larger than 4\n");
+   deactivate_locked_super(s);
+   return -EINVAL;
+   }
+
return root;

 error_close_devices:

On Thu, Sep 11, 2014 at 10:02 PM, Chris Mason  wrote:
> On 09/09/2014 05:19 AM, rongqing...@windriver.com wrote:
>> From: Li RongQing 
>>
>> It is impossible that csum_size is larger than sizeof(long), but the codes
>> still add the handler for this condition, like allocate new memory, for
>> extension. If it becomes true someday, copying csum_size size memory to local
>> 32bit variable found and val will overflow these two variables.
>>
>> Fix it by returning the max 4 byte checksum.
>
> Thanks for the patch.  I'd rather not silently truncate the copy down
> though.  How about a one time check at mount to make sure the value in
> the super is reasonable?
>
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs-progs: deal with conflict options for btrfs fi show

2014-09-11 Thread Gui Hecheng
For btrfs fi show, -d|--all-devices & -m|--mounted will
overwrite each other, so if specified both, let the user
know that he should not use them at the same time.

Signed-off-by: Gui Hecheng 
---
changelog:
v1->v2: add option conflict descriptions to manpage and usage.
---
 Documentation/btrfs-filesystem.txt |  9 ++---
 cmds-filesystem.c  | 12 ++--
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/Documentation/btrfs-filesystem.txt 
b/Documentation/btrfs-filesystem.txt
index c9c0b00..d3d2dcc 100644
--- a/Documentation/btrfs-filesystem.txt
+++ b/Documentation/btrfs-filesystem.txt
@@ -20,15 +20,18 @@ SUBCOMMAND
 *df*  [...]::
 Show space usage information for a mount point.
 
-*show* [--mounted|--all-devices]::
+*show* [-m|--mounted|-d|--all-devices]::
 Show the btrfs filesystem with some additional info.
 +
 If no option nor ||| is passed, btrfs shows
 information of all the btrfs filesystem both mounted and unmounted.
-If '--mounted' is passed, it would probe btrfs kernel to list mounted btrfs
+If '-m|--mounted' is passed, it would probe btrfs kernel to list mounted btrfs
 filesystem(s);
-If '--all-devices' is passed, all the devices under /dev are scanned;
+If '-d|--all-devices' is passed, all the devices under /dev are scanned;
 otherwise the devices list is extracted from the /proc/partitions file.
+Don't combine -m|--mounted and -d|--all-devices, because these two options
+will overwrite each other, and only one scan way will be adopted,
+probe the kernel to scan or scan devices under /dev.
 
 *sync* ::
 Force a sync for the filesystem identified by .
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 69c1ca5..51c4c55 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -495,6 +495,7 @@ static const char * const cmd_show_usage[] = {
"-d|--all-devices   show only disks under /dev containing btrfs 
filesystem",
"-m|--mounted   show only mounted btrfs",
"If no argument is given, structure of all present filesystems is 
shown.",
+   "Don't combine -d|--all-devices and -m|--mounted, refer to manpage for 
details.",
NULL
 };
 
@@ -526,16 +527,23 @@ static int cmd_show(int argc, char **argv)
break;
switch (c) {
case 'd':
-   where = BTRFS_SCAN_PROC;
+   where &= ~BTRFS_SCAN_LBLKID;
+   where |= BTRFS_SCAN_PROC;
break;
case 'm':
-   where = BTRFS_SCAN_MOUNTED;
+   where &= ~BTRFS_SCAN_LBLKID;
+   where |= BTRFS_SCAN_MOUNTED;
break;
default:
usage(cmd_show_usage);
}
}
 
+   if ((where & BTRFS_SCAN_PROC) && (where & BTRFS_SCAN_MOUNTED)) {
+   fprintf(stderr, "Don't use -d|--all-devices and -m|--mounted 
options at the same time.\n");
+   usage(cmd_show_usage);
+   }
+
if (check_argc_max(argc, optind + 1))
usage(cmd_show_usage);
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Duncan
Chris Murphy posted on Thu, 11 Sep 2014 15:25:51 -0600 as excerpted:

> On Sep 11, 2014, at 1:31 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> 
>> I wouldn't try defragging now, but it might be worthwhile to stop the
>> device delete (rebooting to do so since I don't think there's a cancel)
> 
> 'btrfs replace cancel' does exist, although I haven't tried it.

Btrfs replace cancel exists, yes, but does it work for btrfs device 
delete, which is what the OP used?

> Something isn't right though, because it's clearly neither reading nor
> writing at anywhere close to 1/2 the drive read throughput. I'm curious
> what 'iotop -d30 -o' shows (during the replace, before cancel), which
> should be pretty consistent by averaging 30 seconds worth of io. And
> then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet
> there's a lot of nothing going on, with occasional spikes, rather than a
> constant trickle.
> 
> And then the question is to find out what btrfs is thinking about while
> nothing is reading or writing. Even though it's not 5000+ snapshots, I
> wonder if the balance code (and hence btrfs replace) makes extensive use
> of fiemap that's causing this to go catatonic.
> http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724

Not sure (some of that stuff's beyond me), but one thing we /do/ know is 
that btrfs has so far been focused mostly on features and debugging, not 
on optimization beyond the worst-cases, which themselves remain a big 
enough problem, tho it's slowly getting better.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Tomasz Chmielewski

After a disk died and was replaced, "btrfs device delete missing" is
taking more than 10 days on an otherwise idle server:


Something isn't right though, because it's clearly neither reading nor 
writing at \
anywhere close to 1/2 the drive read throughput. I'm curious what 
'iotop -d30 -o' \
shows (during the replace, before cancel), which should be pretty 
consistent by \
averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and see 
if there are \
spikes. I'm willing to bet there's a lot of nothing going on, with 
occasional spikes, \

rather than a constant trickle.


That's more or less what I'm seeing with both. The numbers will go up or 
down slightly, but it's counted in kilobytes per second:


Total DISK READ:   0.00 B/s | Total DISK WRITE: 545.82 B/s
  TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN IO>COMMAND
  940 be/3 root0.00 B/s  136.46 B/s  0.00 %  0.10 % [jbd2/md2-8]
 4714 be/4 root0.00 B/s  329.94 K/s  0.00 %  0.00 % 
[btrfs-transacti]
25534 be/4 root0.00 B/s  402.97 K/s  0.00 %  0.00 % 
[kworker/u16:0]



The bottleneck may be here - one CPU core is mostly 100% busy (kworker). 
Not sure what it's really busy with though:


  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
25546 root   20   0 0 0 0 R 93.0  0.0 18:22.94 
kworker/u16:7
14473 root   20   0 0 0 0 S  5.0  0.0 25:00.14 
kworker/0:0



[912979.063432] SysRq : Show Blocked State
[912979.063485]   taskPC stack   pid father
[912979.063545] btrfs   D 88083fa515c0 0  4793   4622 
0x
[912979.063601]  88061a29b878 0086  
88003683e040
[912979.063701]  000115c0 4000 880813e3 
88003683e040
[912979.063800]  88061a29b7e8 8105d8e9 88083fa4 
88083fa115c0

[912979.063899] Call Trace:
[912979.063951]  [] ? enqueue_task_fair+0x3e5/0x44f
[912979.064006]  [] ? resched_curr+0x47/0x57
[912979.064058]  [] ? check_preempt_curr+0x3e/0x6d
[912979.064111]  [] ? ttwu_do_wakeup+0x12/0x7f
[912979.064164]  [] ? 
ttwu_do_activate.constprop.74+0x57/0x5c

[912979.064220]  [] schedule+0x65/0x67
[912979.064272]  [] schedule_timeout+0x26/0x198
[912979.064324]  [] ? wake_up_process+0x31/0x35
[912979.064378]  [] ? wake_up_worker+0x1f/0x21
[912979.064431]  [] ? insert_work+0x87/0x94
[912979.064493]  [] ? free_block_list+0x1f/0x34 
[btrfs]

[912979.064548]  [] wait_for_common+0x10d/0x13e
[912979.064600]  [] ? try_to_wake_up+0x251/0x251
[912979.064653]  [] wait_for_completion+0x18/0x1a
[912979.064710]  [] 
btrfs_async_run_delayed_refs+0xc1/0xe4 [btrfs]
[912979.064814]  [] 
__btrfs_end_transaction+0x2bb/0x2e1 [btrfs]
[912979.064916]  [] 
btrfs_end_transaction_throttle+0xe/0x10 [btrfs]
[912979.065020]  [] relocate_block_group+0x2ad/0x4de 
[btrfs]
[912979.065079]  [] 
btrfs_relocate_block_group+0x158/0x278 [btrfs]
[912979.065184]  [] 
btrfs_relocate_chunk.isra.62+0x58/0x5f7 [btrfs]
[912979.065286]  [] ? 
btrfs_set_lock_blocking_rw+0x68/0x95 [btrfs]
[912979.065387]  [] ? 
btrfs_set_path_blocking+0x23/0x54 [btrfs]
[912979.065486]  [] ? btrfs_search_slot+0x7bc/0x816 
[btrfs]
[912979.065546]  [] ? free_extent_buffer+0x6f/0x7c 
[btrfs]
[912979.065605]  [] btrfs_shrink_device+0x23c/0x3a5 
[btrfs]
[912979.065679]  [] btrfs_rm_device+0x2a1/0x759 
[btrfs]

[912979.065747]  [] btrfs_ioctl+0xa52/0x227f [btrfs]
[912979.065811]  [] ? putname+0x23/0x2c
[912979.065863]  [] ? user_path_at_empty+0x60/0x90
[912979.065918]  [] ? avc_has_perm+0x2e/0xf7
[912979.065978]  [] ? __vm_enough_memory+0x25/0x13c
[912979.066032]  [] do_vfs_ioctl+0x3f2/0x43c
[912979.066084]  [] ? vfs_stat+0x16/0x18
[912979.066136]  [] SyS_ioctl+0x4e/0x7d
[912979.066188]  [] ? do_page_fault+0xc/0xf
[912979.066240]  [] system_call_fastpath+0x16/0x1b
[912979.066296] Sched Debug Version: v0.11, 3.17.0-rc3 #1
[912979.066347] ktime   : 
913460840.666210
[912979.066401] sched_clk   : 
912979066.295474
[912979.066454] cpu_clk : 
912979066.295485

[912979.066507] jiffies : 4386283381
[912979.066560] sched_clock_stable(): 1
[912979.066610]
[912979.066656] sysctl_sched
[912979.066703]   .sysctl_sched_latency: 24.00
[912979.066756]   .sysctl_sched_min_granularity: 3.00
[912979.066809]   .sysctl_sched_wakeup_granularity : 4.00
[912979.066861]   .sysctl_sched_child_runs_first   : 0
[912979.066913]   .sysctl_sched_features   : 11899
[912979.068196]   .sysctl_sched_tunable_scaling: 1 
(logaritmic)

[912979.068248]
[912979.068294] cpu#0, 3411.379 MHz
[912979.068342]   .nr_running: 0
[912979.068393]   .load  : 0
[912979.068443]   .nr_switches   : 374480347
[912979.068495]   .nr_load_updates   : 562

Re: [PATCH v2] btrfs: LLVMLinux: Remove VLAIS

2014-09-11 Thread Behan Webster

On 09/11/14 09:34, Chris Mason wrote:


On 09/05/2014 06:58 PM, beh...@converseincode.com wrote:

From: Vinícius Tinti 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This is the original VLAIS struct.

struct {
struct shash_desc shash;
char ctx[crypto_shash_descsize(tfm)];
} desc;

This patch instead allocates the appropriate amount of memory using an char
array.

The new code can be compiled with both gcc and clang.

struct shash_desc contains a flexible array member member ctx declared with
CRYPTO_MINALIGN_ATTR, so sizeof(struct shash_desc) aligns the beginning
of the array declared after struct shash_desc with long long.

No trailing padding is required because it is not a struct type that can
be used in an array.

The CRYPTO_MINALIGN_ATTR is required so that desc is aligned with long long
as would be the case for a struct containing a member with
CRYPTO_MINALIGN_ATTR.

We copied this from one of the other crypto api users, and I'm sure
cooking up all these patches was not a great way to your afternoon.

But, can I talk you into making a helper macro of some kind for this and
putting it into the crypto api headers?  Honestly this setup seems
really error prone.
You're not the only person to ask for this. I've got one already coded 
up based on suggestions from tglx.


I've regenerated all the patches. I will resubmit soon.

Will fix.

Behan

--
Behan Webster
beh...@converseincode.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Chris Murphy

On Sep 11, 2014, at 1:31 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> 
> I wouldn't try defragging now, but it might be worthwhile to stop the
> device delete (rebooting to do so since I don't think there's a cancel)

'btrfs replace cancel' does exist, although I haven't tried it.

Something isn't right though, because it's clearly neither reading nor writing 
at anywhere close to 1/2 the drive read throughput. I'm curious what 'iotop 
-d30 -o' shows (during the replace, before cancel), which should be pretty 
consistent by averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and 
see if there are spikes. I'm willing to bet there's a lot of nothing going on, 
with occasional spikes, rather than a constant trickle. 

And then the question is to find out what btrfs is thinking about while nothing 
is reading or writing. Even though it's not 5000+ snapshots, I wonder if the 
balance code (and hence btrfs replace) makes extensive use of fiemap that's 
causing this to go catatonic. 
http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Duncan
Tomasz Chmielewski posted on Thu, 11 Sep 2014 17:22:15 +0200 as excerpted:

> After a disk died and was replaced, "btrfs device delete missing" is 
> taking more than 10 days on an otherwise idle server:
> 
> # btrfs fi show /home
> Label: none  uuid: 84d087aa-3a32-46da-844f-a233237cf04f
>  Total devices 3 FS bytes used 362.44GiB
>  devid2 size 1.71TiB used 365.03GiB path /dev/sdb4
>  devid3 size 1.71TiB used 58.00GiB path /dev/sda4
>  *** Some devices missing
> 
> Btrfs v3.16
> 
> So far, it has copied 58 GB out of 365 GB - and it took 10 days. At this 
> speed, the whole operation will take 2-3 months (assuming that the only 
> healthy disk doesn't die in the meantime).
> Is this expected time for btrfs RAID-1?

Device delete definitely takes time.  For the sub-GiB usage shown above,
10 days for 50 GiB out of 350+ does seem excessive, but there are extreme
cases where it isn't entirely out of line.  See below.

> There are no errors in dmesg/smart, performance of both disks is fine:

> # btrfs sub list /home | wc -l
> 260

> I've tried running this on the latest 3.16.x kernel earlier, but since 
> the progress was so slow, rebooted after about a week to see if the 
> latest RC will be any faster.

The good thing is that once a block group is copied over, it should be
fine and won't need re-copied if the process is stopped over a reboot
and restarted on a new kernel, etc.

The bad thing is that if I'm interpreting your report correctly, that
likely means 7+10=17 days for that 58 gig. =:^(

Questions:

* Presumably most of those 260 subvolumes are snapshots, correct?
What was your snapshotting schedule and did you have old snapshot
cleanup-deletion scheduled as well?

* Do you run with autodefrag or was the system otherwise regularly
defragged?

* Do you have large (GiB plus) database or virtual machine image files
on that filesystem?  If so, had you properly set the NOCOW file
attribute (chattr +C) on them and were they on dedicated subvolumes?


200+ snapshots is somewhat high and could be part of the issue, tho
it's nothing like the extremes (thousands) we've seen posted in the
past.  Were it me, I'd have tried deleting as many as possible before
the device delete missing, in ordered to simplify the process and
eliminate as much "extra" data as possible.

The real issue is going to be fragmentation, on spinning-rust drives.
Run filefrag on some of your gig-plus files that get written to
frequently (vm-images and database files are the classic cases) and
see how many extents are reported.  (Tho note that filefrag doesn't
understand btrfs compression and won't be accurate in that case, and
also that due to the btrfs data chunk size of 1 GiB, that's the
maximum extent size, so multi-gig files will typically be two extents
more than the number of gigs, filling up the current chunk,
N whole-gig chunks, the file tail.)  The nocow file attribute (which
must be set while the file is zero-sized to be effective, see
discussion elsewhere) can help with that, but snapshotting an
actively being rewritten nocow file more or less defeats the
purpose of nocow, since the snapshot locks in place the existing
data and the first rewrite to a block must then be cowed anyway.
But putting those files on dedicated subvolumes and then not
snapshotting those subvolumes is a workaround.


I wouldn't try defragging now, but it might be worthwhile to stop the
device delete (rebooting to do so since I don't think there's a cancel)
and delete as many snapshots as possible.  That should help matters.
Additionally, if you have recent backups of highly fragmented files
such as the VM-images and DBs I mentioned, you might consider simply
deleting them, thus eliminating that fragment processing from the
device delete.  I don't know that making a backup now would go much
faster than the device delete, however, so I don't know whether to
recommend that or not.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: set inode's logged_trans/last_log_commit after ranged fsync

2014-09-11 Thread Filipe Manana
When a ranged fsync finishes if there are still extent maps in the modified
list, still set the inode's logged_trans and last_log_commit. This is important
in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
inode ref gets deleted from the log and the respective dentries in its parent
are deleted too from the log (if the parent directory was fsync'ed in the same
transaction).

Instead make btrfs_inode_in_log() return false if the list of modified extent
maps isn't empty.

This is an incremental on top of the v4 version of the patch:

"Btrfs: fix fsync data loss after a ranged fsync"

which was added to its v5, but didn't make it on time.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/btrfs_inode.h | 13 +++--
 fs/btrfs/tree-log.c| 14 ++
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 74ff403..3511031 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -246,8 +246,17 @@ static inline int btrfs_inode_in_log(struct inode *inode, 
u64 generation)
BTRFS_I(inode)->last_sub_trans <=
BTRFS_I(inode)->last_log_commit &&
BTRFS_I(inode)->last_sub_trans <=
-   BTRFS_I(inode)->root->last_log_commit)
-   return 1;
+   BTRFS_I(inode)->root->last_log_commit) {
+   /*
+* After a ranged fsync we might have left some extent maps
+* (that fall outside the fsync's range). So return false
+* here if the list isn't empty, to make sure btrfs_log_inode()
+* will be called and process those extent maps.
+*/
+   smp_mb();
+   if (list_empty(&BTRFS_I(inode)->extent_tree.modified_extents))
+   return 1;
+   }
return 0;
 }
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 82db14f..d7c1459 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4090,18 +4090,8 @@ log_extents:
}
}
 
-   write_lock(&em_tree->lock);
-   /*
-* If we're doing a ranged fsync and there are still modified extents
-* in the list, we must run on the next fsync call as it might cover
-* those extents (a full fsync or an fsync for other range).
-*/
-   if (list_empty(&em_tree->modified_extents)) {
-   BTRFS_I(inode)->logged_trans = trans->transid;
-   BTRFS_I(inode)->last_log_commit =
-   BTRFS_I(inode)->last_sub_trans;
-   }
-   write_unlock(&em_tree->lock);
+   BTRFS_I(inode)->logged_trans = trans->transid;
+   BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->last_sub_trans;
 out_unlock:
if (unlikely(err))
btrfs_put_logged_extents(&logged_list);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs listing is wrong

2014-09-11 Thread Duncan
Mark Murawski posted on Thu, 11 Sep 2014 10:45:58 -0400 as excerpted:

> Label: 'Root'  uuid: d71404d4-468e-47d5-8f06-3b65fa7776aa
>  Total devices 2 FS bytes used 7.46GiB
>  devid1 size 9.31GiB used 8.06GiB path /dev/sdh6
>  devid3 size 9.31GiB used 8.06GiB path
> /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa
> 
> # ls -al /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa
> lrwxrwxrwx 1 root root 10 Sep 11 10:43
> /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa -> ../../sdh6
> 
> devid 3 should really  be showing /dev/sde6

I believe that's an instance of a known bug (or depending on how you look 
at it possibly two) with a recently posted patch, but I doubt it has made 
it to mainline just yet.

What they're going to do is resolve to the canonical paths.  They don't 
do that at this point, and the paths can change, sometimes showing the 
wrong one for certain devices.

I'm not sure the fix will make it into this kernel cycle, however.  It 
should make it into 3.18 in any case, and will likely be stable-backported 
after that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deadlock with 3.15.10

2014-09-11 Thread Clemens Eisserer
Hi Liu,

>> I've recently run into a deadlock on 3.15.10, don't know if the kernel
>> stack-trace is useful: http://pastebin.com/guQNDAMX
>
> FYI, this's been fixed by https://patchwork.kernel.org/patch/4727711/

Thanks for letting me know.

- Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/18] Btrfs: Fix misuse of chunk mutex

2014-09-11 Thread Chris Mason


On 09/03/2014 09:35 AM, Miao Xie wrote:
> There were several problems about chunk mutex usage:
> - Lock chunk mutex when updating metadata. It would cause the nested
>   deadlock because updating metadata might need allocate new chunks
>   that need acquire chunk mutex. We remove chunk mutex at this case,
>   because b-tree lock and other lock mechanism can help us.
> - ABBA deadlock occured between device_list_mutex and chunk_mutex.
>   When we update device status, we must acquire device_list_mutex at the
>   beginning, and then we might get chunk_mutex during the device status
>   update because we need allocate new chunks for metadata COW. But at
>   most place, we acquire chunk_mutex at first and then acquire device list
>   mutex. We need change the lock order.
> - Some place we needn't acquire chunk_mutex. For example we needn't get
>   chunk_mutex when we free a empty seed fs_devices structure.
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9f22398d..357f911 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> 
> @@ -2619,10 +2615,23 @@ static int btrfs_relocate_chunk(struct btrfs_root 
> *root,
>   map = (struct map_lookup *)em->bdev;
>  
>   for (i = 0; i < map->num_stripes; i++) {
> - ret = btrfs_free_dev_extent(trans, map->stripes[i].dev,
> - map->stripes[i].physical);
> + device = map->stripes[i].dev;
> + ret = btrfs_free_dev_extent(trans, device,
> + map->stripes[i].physical,
> + &dev_extent_len);
>   BUG_ON(ret);

gcc is worried that dev_extent_len may be used uninitialized here.  The
BUG_ON makes it unlikely we'll notice dev_extent_len, but I set it to
zero in my version here.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs listing is wrong

2014-09-11 Thread Mark Murawski

Label: 'Root'  uuid: d71404d4-468e-47d5-8f06-3b65fa7776aa
Total devices 2 FS bytes used 7.46GiB
devid1 size 9.31GiB used 8.06GiB path /dev/sdh6
devid3 size 9.31GiB used 8.06GiB path 
/dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa


# ls -al /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa
lrwxrwxrwx 1 root root 10 Sep 11 10:43 
/dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa -> ../../sdh6


devid 3 should really  be showing /dev/sde6
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how long should "btrfs device delete missing ..." take?

2014-09-11 Thread Tomasz Chmielewski
After a disk died and was replaced, "btrfs device delete missing" is 
taking more than 10 days on an otherwise idle server:


# btrfs fi show /home
Label: none  uuid: 84d087aa-3a32-46da-844f-a233237cf04f
Total devices 3 FS bytes used 362.44GiB
devid2 size 1.71TiB used 365.03GiB path /dev/sdb4
devid3 size 1.71TiB used 58.00GiB path /dev/sda4
*** Some devices missing

Btrfs v3.16



So far, it has copied 58 GB out of 365 GB - and it took 10 days. At this 
speed, the whole operation will take 2-3 months (assuming that the only 
healthy disk doesn't die in the meantime).

Is this expected time for btrfs RAID-1?

There are no errors in dmesg/smart, performance of both disks is fine:

# hdparm -t /dev/sda /dev/sdb

/dev/sda:
 Timing buffered disk reads: 442 MB in  3.01 seconds = 146.99 MB/sec

/dev/sdb:
 Timing buffered disk reads: 402 MB in  3.39 seconds = 118.47 MB/sec


# btrfs fi df /home
Data, RAID1: total=352.00GiB, used=351.02GiB
System, RAID1: total=32.00MiB, used=96.00KiB
Metadata, RAID1: total=13.00GiB, used=11.38GiB
unknown, single: total=512.00MiB, used=67.05MiB

# btrfs sub list /home | wc -l
260

# uptime
 17:21:53 up 10 days,  6:01,  2 users,  load average: 3.22, 3.53, 3.55


I've tried running this on the latest 3.16.x kernel earlier, but since 
the progress was so slow, rebooted after about a week to see if the 
latest RC will be any faster.



--
Tomasz Chmielewski
http://www.sslrack.com



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: LLVMLinux: Remove VLAIS

2014-09-11 Thread Chris Mason


On 09/05/2014 06:58 PM, beh...@converseincode.com wrote:
> From: Vinícius Tinti 
> 
> Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
> compliant equivalent. This is the original VLAIS struct.
> 
> struct {
>   struct shash_desc shash;
>   char ctx[crypto_shash_descsize(tfm)];
> } desc;
> 
> This patch instead allocates the appropriate amount of memory using an char
> array.
> 
> The new code can be compiled with both gcc and clang.
> 
> struct shash_desc contains a flexible array member member ctx declared with
> CRYPTO_MINALIGN_ATTR, so sizeof(struct shash_desc) aligns the beginning
> of the array declared after struct shash_desc with long long.
> 
> No trailing padding is required because it is not a struct type that can
> be used in an array.
> 
> The CRYPTO_MINALIGN_ATTR is required so that desc is aligned with long long
> as would be the case for a struct containing a member with
> CRYPTO_MINALIGN_ATTR.

We copied this from one of the other crypto api users, and I'm sure
cooking up all these patches was not a great way to your afternoon.

But, can I talk you into making a helper macro of some kind for this and
putting it into the crypto api headers?  Honestly this setup seems
really error prone.

-chris

> 
> Signed-off-by: Jan-Simon Möller 
> Signed-off-by: Behan Webster 
> Signed-off-by: Vinícius Tinti 
> Signed-off-by: Mark Charlebois 
> ---
>  fs/btrfs/hash.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/hash.c b/fs/btrfs/hash.c
> index 85889aa..a62743f 100644
> --- a/fs/btrfs/hash.c
> +++ b/fs/btrfs/hash.c
> @@ -33,18 +33,18 @@ void btrfs_hash_exit(void)
>  
>  u32 btrfs_crc32c(u32 crc, const void *address, unsigned int length)
>  {
> - struct {
> - struct shash_desc shash;
> - char ctx[crypto_shash_descsize(tfm)];
> - } desc;
> + char desc[sizeof(struct shash_desc) +
> + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
> + struct shash_desc *shash = (struct shash_desc *)desc;
> + u32 *ctx = (u32 *)shash_desc_ctx(shash);
>   int err;
>  
> - desc.shash.tfm = tfm;
> - desc.shash.flags = 0;
> - *(u32 *)desc.ctx = crc;
> + shash->tfm = tfm;
> + shash->flags = 0;
> + *ctx = crc;
>  
> - err = crypto_shash_update(&desc.shash, address, length);
> + err = crypto_shash_update(shash, address, length);
>   BUG_ON(err);
>  
> - return *(u32 *)desc.ctx;
> + return *ctx;
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix a overflowing boundary writing in csum_tree_block

2014-09-11 Thread Chris Mason
On 09/09/2014 05:19 AM, rongqing...@windriver.com wrote:
> From: Li RongQing 
> 
> It is impossible that csum_size is larger than sizeof(long), but the codes
> still add the handler for this condition, like allocate new memory, for
> extension. If it becomes true someday, copying csum_size size memory to local
> 32bit variable found and val will overflow these two variables.
> 
> Fix it by returning the max 4 byte checksum.

Thanks for the patch.  I'd rather not silently truncate the copy down
though.  How about a one time check at mount to make sure the value in
the super is reasonable?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkdir and fsync

2014-09-11 Thread Chris Mason
On 09/10/2014 04:55 PM, Samer Al-Kiswany wrote:
> Hi,
> 
> Thank you for help.
> 
> I am seeing a strange behavior when fsync()ing a directory.
> 
> Here is what I do
> 
> for (i=0; i < 100,000, i++){
>   .
>   mkdir(p/child_i)
>   fsync(p)
> }
> 
> Btrfs seems to achieve around 100k fsycs/second, which makes me believe it
> is not touching the disk during these fsyncs.
> After looking at the code, it seems indeed that fsync adds the inode to the
> current transaction but does not sync the transaction to disk.
> 
> Is this the intended behavior for metadata fsync or is this a bug?
> Is this POSIX compliant?

Which kernel and hardware?  We had some dir fsync handling bugs in the
past which may have been related.

I just did a test here, and we're definitely doing the IO.  Christoph is
right about the requirements for fsync being sloppy.  For btrfs, we do
put directory changes into the log during an fsync, but we may end up
logging only what you fsync.

So this will get child_i:

mkdir(p/child_i)
fsync(p)

This will not:

mkdir(p/child_i)
fsync(some_other_directory_that_isn't_p)

(This is different from ext34)

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No space on empty, degraded raid10

2014-09-11 Thread Hugo Mills
On Thu, Sep 11, 2014 at 08:06:21AM -0400, Austin S Hemmelgarn wrote:
> On 2014-09-11 07:38, Hugo Mills wrote:
> > On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote:
> >> On 2014-09-11 02:40, Russell Coker wrote:
> >>> Also it would be nice if there was a N-way mirror option for system data. 
> >>>  As 
> >>> such data is tiny (32MB on the 120G filesystem in my workstation) the 
> >>> space 
> >>> used by having a copy on every disk in the array shouldn't matter.
> >>
> >> N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
> >> is ready, mkfs should default to one copy per disk in the filesystem.
> > 
> >Why change the default from 2-copies, which it's been for years?
> 
> Sorry about the ambiguity in my statement, I meant that the default for
> system chunks should be one copy per disk in the filesystem.  If you
> don't have a copy of the system chunks, then you essentially don't have
> a filesystem, and that means that BTRFS RAID6 can't provide true
> resilience against 2 disks failing catastrophically unless there are at
> least 3 copies of the system chunks.

   Aah, OK. That makes perfect sense, then.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Some days, it's just not worth gnawing through the straps ---


signature.asc
Description: Digital signature


Re: No space on empty, degraded raid10

2014-09-11 Thread Austin S Hemmelgarn
On 2014-09-11 07:38, Hugo Mills wrote:
> On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote:
>> On 2014-09-11 02:40, Russell Coker wrote:
>>> Also it would be nice if there was a N-way mirror option for system data.  
>>> As 
>>> such data is tiny (32MB on the 120G filesystem in my workstation) the space 
>>> used by having a copy on every disk in the array shouldn't matter.
>>
>> N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
>> is ready, mkfs should default to one copy per disk in the filesystem.
> 
>Why change the default from 2-copies, which it's been for years?

Sorry about the ambiguity in my statement, I meant that the default for
system chunks should be one copy per disk in the filesystem.  If you
don't have a copy of the system chunks, then you essentially don't have
a filesystem, and that means that BTRFS RAID6 can't provide true
resilience against 2 disks failing catastrophically unless there are at
least 3 copies of the system chunks.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No space on empty, degraded raid10

2014-09-11 Thread Hugo Mills
On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote:
> On 2014-09-11 02:40, Russell Coker wrote:
> > Also it would be nice if there was a N-way mirror option for system data.  
> > As 
> > such data is tiny (32MB on the 120G filesystem in my workstation) the space 
> > used by having a copy on every disk in the array shouldn't matter.
> 
> N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
> is ready, mkfs should default to one copy per disk in the filesystem.

   Why change the default from 2-copies, which it's been for years?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Ceci est un travail pour l'Australien. ---  


signature.asc
Description: Digital signature


Re: No space on empty, degraded raid10

2014-09-11 Thread Austin S Hemmelgarn
On 2014-09-11 02:40, Russell Coker wrote:
> On Mon, 8 Sep 2014, Austin S Hemmelgarn  wrote:
>> Also, I've found out the hard way that system chunks really should be
>> RAID1, NOT RAID10, otherwise it's very likely that the filesystem
>> won't mount at all if you lose 2 disks.
> 
> Why would that be different?
> 
> In a RAID-1 you expect system problems if 2 disks fail, why would RAID-10 be 
> different?
That's still the case, but in a RAID1 with four disks, of the six
different pairs of two disks you could lose, only one will make the
filesystem un-mountable, whereas for a four disk RAID10, there are two
different pairs of two disks you could lose to make the filesystem
un-mountable.  In haven't run the numbers for higher numbers of disks,
but things are likely not better, because if you lose both copies of the
same stripe, things will fail.
> 
> Also it would be nice if there was a N-way mirror option for system data.  As 
> such data is tiny (32MB on the 120G filesystem in my workstation) the space 
> used by having a copy on every disk in the array shouldn't matter.
> 
N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
is ready, mkfs should default to one copy per disk in the filesystem.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-11 Thread Duncan
Bob Williams posted on Thu, 11 Sep 2014 10:56:14 +0100 as excerpted:

> So if a RAID1/two disk system uses the disks symmetrically, why did my
> balance command take 22 hours? That's what puzzles me, as my
> understanding of RAID1 is that the disk use *is* symmetrical.

What you're missing is what balance actually /does/.

Balance will take every chunk it sees, data or metadata (with metadata 
including system as well), and rewrite it to a new location.  In simplest 
conception that's /all/ it does.

So your 22 hours was the time it took to rewrite-shift, effectively, the 
entire filesystem, one chunk at a time, from one location to another.

Now it so happens that in the process balance does a bunch of other stuff 
too, like combine partially empty blocks of the same type during the 
rewrite, filling them up such that the rewritten version likely takes 
fewer chunks than the original, thus having the effect of freeing the 
extra chunks back to unallocated space that's now again free to be used 
for data or metadata instead of tied up in chunks that are one or the 
other but can't be switched.

And after adding/deleting devices, that rewrite process balances out 
usage between devices.

And with the convert option (used with -d or -m, below) that rewrite can 
be used to convert the rewritten chunks to some other raid layout than 
the original.

And with the -d and -m options (along with -s), you can limit the chunks 
balance looks at to data or metadata (the latter including system as 
well, -s) instead of all chunks.

And with the usage option (along with -d or -m, above), you can limit the 
chunks looked at to those under a particular percentage fill, thus 
allowing to do the chunk consolidation more efficiently without taking 
time to do ALL chunks of that type as it would otherwise do.

But bottom line, balance is a chunk rewriter, and you told it to rewrite 
everything on the filesystem, so that's exactly what it did.  And with 
nearly a TB of data on spinning rust, that took awhile, about 22 hours 
"awhile"!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-11 Thread Bob Williams
On 10/09/14 19:43, Goffredo Baroncelli wrote:
> On 09/10/2014 02:27 PM, Bob Williams wrote:
>> I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
>> data and metadata. Last night I started
>>
>> # btrfs filesystem balance 
> 
> 
> May be that I am missing something obvious, however I have to ask which 
> would be the purpose to balance a two disks RAID1 system.
> The balance command should move the data between the disks in order to
> avoid some disk full and other empty; but this assume that there is a
> not symmetrical uses of the disks. Which is not the case for a RAID1/two
> disks system.
> 
> If the disk were more than two the situation would be completely different.
> But Bob reports that the system is compose by two disks only.
> 
>>
>> and it is still running 18 hours later. This suggests that most stuff
>> only gets written to one physical device, which in turn suggests that
>> there is a risk of lost data if one physical device fails. Or is
>> there something clever about btrfs raid that I've missed? I've used
>> linux software raid (mdraid) before, and it appeared to write to both
>> devices simultaneously.
>>

So if a RAID1/two disk system uses the disks symmetrically, why did my
balance command take 22 hours? That's what puzzles me, as my
understanding of RAID1 is that the disk use *is* symmetrical.

Bob




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add missing compression property remove in btrfs_ioctl_setflags

2014-09-11 Thread Filipe David Manana
On Thu, Sep 11, 2014 at 5:41 AM, Satoru Takeuchi
 wrote:
> Hi Filipe,
>
> (2014/09/11 0:10), Filipe Manana wrote:
>> The behaviour of a 'chattr -c' consists of getting the current flags,
>> clearing the FS_COMPR_FL bit and then sending the result to the set
>> flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags
>> passed to the ioctl. This results in the compression property not being
>> cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL
>> was set in the received flags.
>>
>> Reproducer:
>>
>>  $ mkfs.btrfs -f /dev/sdd
>>  $ mount /dev/sdd /mnt && cd /mnt
>>  $ mkdir a
>>  $ chattr +c a
>>  $ touch a/file
>>  $ lsattr a/file
>>  c--- a/file
>>  $ chattr -c a
>>  $ touch a/file2
>>  $ lsattr a/file2
>>  c--- a/file2
>>  $ lsattr -d a
>>   a
>>
>> Reported-by: Andreas Schneider 
>> Signed-off-by: Filipe Manana 
>> ---
>>   fs/btrfs/ioctl.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index a010c44..8e6950c 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -333,6 +333,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
>> __user *arg)
>>
>>   } else {
>>   ip->flags &= ~(BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS);
>> + ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
>
> IMHO, this patch is going on a wrong way since this patch
> changes the meaning of chattr. Here is the correct behavior.
>
> flag | BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS |
> =+==++
> +c   |set   | unset  |
> -c   |   unset  | unset  |
>
> However, by your change, chattr -c results in unset
> BTRFS_INODE_COMPRESS and *set* BTRFS_INODE_NOCOMPRESS.

Right, for the reason mentioned below it's not a big deal, but
nevertheless better to not break the expectations and behaviour from
pre-properties era.

> It's because btrfs_set_prop() will finally lead to
> prop_compression_apply() with setting its value to "".
> It's the behavior of calling ioctl() with FS_NOCOMP_FL.
>
> Please note that inode flag without both BTRFS_INODE_COMPRESS
> nor BTRFS_INODE_NOCOMPRESS has its own meaning: whether file
> is compress or not is decided by "compress" mount option.

Right, which will set NOCOMPRESS if compression of the first write is
worthless (compressed size >= uncompressed size).
The fs has the right of setting NOCOMPRESS/FS_NOCOMP_FL if it wishes
to (due to lack of any specification that forbids it).

>
> My suggestion is as follows and I'll write a patch based
> on it soon.
>
>  - Change the meaning of btrfs.compression == "" to mean
>unset both BTRFS_INODE_COMPRESS and BTRFS_INODE_NOCOMPRESS,
>for chattr -c.

Do that and you're breaking existing semantics - existing apps or
users might already depend on the current semantics/apis (i.e. not a
good idea).
However you can add a new value that implements those semantics.

>  - Add new value of "btrfs.compression" property, for example
>"no" or "off", to mean unset BTRFS_INODE_COMPRESS and set
>BTRFS_INODE_NOCOMPRESS. It's for ioctl() with FS_NOCOMP_FL.
>  - Unify duplicated code of changing inode flags to props.c.
>
> Thanks,
> Satoru
>
>> + if (ret && ret != -ENODATA)
>> + goto out_drop;
>>   }
>>
>>   trans = btrfs_start_transaction(root, 1);
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] xfstests: btrfs: add test regarding clearing compression flag/property

2014-09-11 Thread Filipe Manana
Regression test for btrfs where removing the flag FS_COMPR_FL
(chattr -c) from an inode wouldn't clear its compression property.
This was fixed in the following linux kernel patch:

  Btrfs: add missing compression property remove in btrfs_ioctl_setflags

Signed-off-by: Filipe Manana 
---
 tests/btrfs/059 | 85 +
 tests/btrfs/059.out | 11 +++
 tests/btrfs/group   |  1 +
 3 files changed, 97 insertions(+)
 create mode 100755 tests/btrfs/059
 create mode 100644 tests/btrfs/059.out

diff --git a/tests/btrfs/059 b/tests/btrfs/059
new file mode 100755
index 000..3379ead
--- /dev/null
+++ b/tests/btrfs/059
@@ -0,0 +1,85 @@
+#! /bin/bash
+# FS QA Test No. btrfs/059
+#
+# Regression test for btrfs where removing the flag FS_COMPR_FL (chattr -c)
+# from an inode wouldn't clear its compression property.
+# This was fixed in the following linux kernel patch:
+#
+# Btrfs: add missing compression property remove in btrfs_ioctl_setflags
+#
+#---
+# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_test
+_require_scratch
+_require_btrfs "property"
+_need_to_be_root
+
+rm -f $seqres.full
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+mkdir $SCRATCH_MNT/testdir
+echo "Setting compression flag in the directory..."
+chattr +c $SCRATCH_MNT/testdir
+echo "Directory compression property value:"
+$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir compression
+
+touch $SCRATCH_MNT/testdir/file1
+echo "file1 compression property value:"
+$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression
+
+echo "Clearing compression flag from directory..."
+chattr -c $SCRATCH_MNT/testdir
+echo "Directory compression property value:"
+$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir compression
+
+touch $SCRATCH_MNT/testdir/file2
+echo "file2 compression property value:"
+$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file2 compression
+
+touch $SCRATCH_MNT/testdir/file1
+echo "file1 compression property value:"
+$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression
+
+status=0
+exit
diff --git a/tests/btrfs/059.out b/tests/btrfs/059.out
new file mode 100644
index 000..9ec9a53
--- /dev/null
+++ b/tests/btrfs/059.out
@@ -0,0 +1,11 @@
+QA output created by 059
+Setting compression flag in the directory...
+Directory compression property value:
+compression=zlib
+file1 compression property value:
+compression=zlib
+Clearing compression flag from directory...
+Directory compression property value:
+file2 compression property value:
+file1 compression property value:
+compression=zlib
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 3fa9778..68b5c79 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -61,3 +61,4 @@
 056 auto quick
 057 auto quick
 058 auto quick
+059 auto quick
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: add missing compression property remove in btrfs_ioctl_setflags

2014-09-11 Thread Filipe Manana
The behaviour of a 'chattr -c' consists of getting the current flags,
clearing the FS_COMPR_FL bit and then sending the result to the set
flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags
passed to the ioctl. This results in the compression property not being
cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL
was set in the received flags.

Reproducer:

$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt && cd /mnt
$ mkdir a
$ chattr +c a
$ touch a/file
$ lsattr a/file
c--- a/file
$ chattr -c a
$ touch a/file2
$ lsattr a/file2
c--- a/file2
$ lsattr -d a
 a

Reported-by: Andreas Schneider 
Signed-off-by: Filipe Manana 
---

V2: Ensure BTRFS_INODE_NOCOMPRESS isn't set (unless the bit FS_NOCOMP_FL is 
set).

 fs/btrfs/ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a010c44..a46c169 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -332,6 +332,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
goto out_drop;
 
} else {
+   ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
+   if (ret && ret != -ENODATA)
+   goto out_drop;
ip->flags &= ~(BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS);
}
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-11 Thread Bob Williams
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/09/14 05:30, Zygo Blaxell wrote:
> On Wed, Sep 10, 2014 at 01:27:36PM +0100, Bob Williams wrote:
>> I have two 2TB disks formatted as a btrfs raid1 array, mirroring
>> both data and metadata. Last night I started
>> 
>> # btrfs filesystem balance 
>> 
[...]
> 
>> As a rough guide, how often should one perform
>> 
>> a) balance
> 
> I have a cron job that runs 'btrfs balance resume' or 'btrfs
> balance start' (depending on whether a balance is already in
> progress) nightly at 1AM.  Another cron job comes along at 6AM to
> run 'btrfs balance pause' on my headless servers.  On my desktops
> and laptops I have a daemon that detects keyboard/mouse input and
> does 'btrfs balance pause' when some is detected (the balance
> remains paused until the next day at 1AM, as it is really heavy and
> takes a long time to come to a stop).
> 
[...]

Many thanks to everyone who has contributed to this thread. I have
learnt a lot, and now have weekly cronjobs to balance and scrub.

Bob
- -- 
Bob Williams
System:  Linux 3.11.10-21-desktop
Distro:  openSUSE 13.1 (x86_64) with KDE Development Platform: 4.14.0
Uptime:  06:00am up 6 days 15:04, 4 users, load average: 0.02, 0.02, 0.05
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlQRZp4ACgkQ0Sr7eZJrmU7YcACgpMcD4w3J8IV8m4MpYSG8jl1/
kLEAoITiw3EcpvnELw2KlRaa3GlYEWUV
=ICmv
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html