Re: [PATCH] btrfs: properly track when rescan worker is running

2016-08-15 Thread Qu Wenruo



At 08/16/2016 12:10 AM, Jeff Mahoney wrote:

The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state.  The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused.  If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed.  When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.

This patch uses a separate flag to indicate when the worker is
running.  The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.

Cc:  # v4.4+
Signed-off-by; Jeff Mahoney 


Reviewed-by: Qu Wenruo 

Looks good to me.

Would you mind to submit a test case for it?

Thanks,
Qu

---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/qgroup.c  |9 -
 3 files changed, 10 insertions(+), 1 deletion(-)

--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1771,6 +1771,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *qgroup_rescan_workers;
struct completion qgroup_rescan_completion;
struct btrfs_work qgroup_rescan_work;
+   bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */

/* filesystem state */
unsigned long fs_state;
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2275,6 +2275,7 @@ static void btrfs_init_qgroup(struct btr
fs_info->quota_enabled = 0;
fs_info->pending_quota_state = 0;
fs_info->qgroup_ulist = NULL;
+   fs_info->qgroup_rescan_running = false;
mutex_init(_info->qgroup_rescan_lock);
 }

--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2302,6 +2302,10 @@ static void btrfs_qgroup_rescan_worker(s
int err = -ENOMEM;
int ret = 0;

+   mutex_lock(_info->qgroup_rescan_lock);
+   fs_info->qgroup_rescan_running = true;
+   mutex_unlock(_info->qgroup_rescan_lock);
+
path = btrfs_alloc_path();
if (!path)
goto out;
@@ -2368,6 +2372,9 @@ out:
}

 done:
+   mutex_lock(_info->qgroup_rescan_lock);
+   fs_info->qgroup_rescan_running = false;
+   mutex_unlock(_info->qgroup_rescan_lock);
complete_all(_info->qgroup_rescan_completion);
 }

@@ -2494,7 +2501,7 @@ int btrfs_qgroup_wait_for_completion(str

mutex_lock(_info->qgroup_rescan_lock);
spin_lock(_info->qgroup_lock);
-   running = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN;
+   running = fs_info->qgroup_rescan_running;
spin_unlock(_info->qgroup_lock);
mutex_unlock(_info->qgroup_rescan_lock);





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Qu Wenruo



At 08/15/2016 10:10 PM, Austin S. Hemmelgarn wrote:

On 2016-08-15 10:08, Anand Jain wrote:




IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.

Any comment is welcomed.


Based on looking at the code, we do in fact support 2/3 devices for
raid5/6 respectively.

Personally, I agree that we should warn when trying to do this, but I
absolutely don't think we should stop it from happening.



 How does 2 disks RAID5 work ?

One disk is your data, the other is your parity.  In essence, it works
like a really computationally expensive version of RAID1 with 2 disks,
which is why it's considered a degenerate configuration.


I totally agree with the fact that 2 disk raid5 is just a slow raid1.


 Three disks in
RAID6 is similar, but has a slight advantage at the moment in BTRFS
because it's the only way to configure three disks so you can lose two
and not lose any data as we have no support for higher order replication
than 2 copies yet.


It's true that btrfs doesn't support any other raid level which can 
provide 2 parities.


But the use case to gain the ability to lose 2 disks in a 3 disk raid6 
setup seems more like a trick other than normal use case.


Either in mkfs man page, or warning at mkfs time (but still allowing to 
do it), IMHO it's better to tell user "yes, you can do it, but it's not 
a really good idea"


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs quota issues

2016-08-15 Thread Qu Wenruo



At 08/16/2016 03:11 AM, Rakesh Sankeshi wrote:

yes, subvol level.

qgroupid rfer excl max_rfer max_excl parent  child

     --  -

0/5  16.00KiB 16.00KiB none none --- ---

0/258   119.48GiB119.48GiB200.00GiB none --- ---

0/25992.57GiB 92.57GiB200.00GiB none --- ---


although I have 200GB limit on 2 subvols, running into issue at about
120 and 92GB itself


1) About workload
Would you mind to mention the work pattern of your write?

Just dd data with LZO compression?
For compression part, it's a little complicated, as the reserved data 
size and on disk extent size are different.


It's possible that at some code we leaked some reserved data space.


2) Behavior after EDQUOT
And, after EDQUOT happens, can you write data into the subvolume?
If you can still write a lot of data (at least several giga), it seems 
to be something related with temporary reserved space.


If not, and even can't remove any file due to EQUOTA, then it's almost 
sure we have underflowed the reserved data.

In that case, unmount and mount again will be the only workaround.
(In fact, not workaround at all)

3) Behavior without compression

If it's OK for you, would you mind to test it without compression?
Currently we mostly use the assumption that on-disk extent size are the 
same with in-memory extent size (non-compression).


So qgroup + compression is not the main concern before and is buggy.

If without compression, qgroup works sanely, at least we can be sure 
that the cause is qgroup + compression.


Thanks,
Qu




On Sun, Aug 14, 2016 at 7:11 PM, Qu Wenruo  wrote:



At 08/12/2016 01:32 AM, Rakesh Sankeshi wrote:


I set 200GB limit to one user and 100GB to another user.

as soon as I reached 139GB and 53GB each, hitting the quota errors.
anyway to workaround quota functionality on btrfs LZO compressed
filesystem?



Please paste "btrfs qgroup show -prce " output if you are using btrfs
qgroup/quota function.

And, AFAIK btrfs qgroup is applied to subvolume, not user.

So did you mean limit it to one subvolume belongs to one user?

Thanks,
Qu




4.7.0-040700-generic #201608021801 SMP

btrfs-progs v4.7


Label: none  uuid: 66a78faf-2052-4864-8a52-c5aec7a56ab8

Total devices 2 FS bytes used 150.62GiB

devid1 size 1.00TiB used 78.01GiB path /dev/xvdc

devid2 size 1.00TiB used 78.01GiB path /dev/xvde


Data, RAID0: total=150.00GiB, used=149.12GiB

System, RAID1: total=8.00MiB, used=16.00KiB

Metadata, RAID1: total=3.00GiB, used=1.49GiB

GlobalReserve, single: total=512.00MiB, used=0.00B


Filesystem  Size  Used Avail Use% Mounted on

/dev/xvdc   2.0T  153G  1.9T   8% /test_lzo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html











--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS constantly reports "No space left on device" even with a huge unallocated space

2016-08-15 Thread Chris Murphy
On Mon, Aug 15, 2016 at 5:12 PM, Ronan Chagas  wrote:
> Hi guys!
>
> It happened again. The computer was completely unusable. The only useful
> message I saw was this one:
>
> http://img.ctrlv.in/img/16/08/16/57b24b0bb2243.jpg
>
> Does it help?
>
> I decided to format and reinstall tomorrow. This is a production machine and
> I have to fix this ASAP.

Looks similar to this:
https://lkml.org/lkml/2016/3/28/230

Can you describe the workload happening at the time?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Henk Slager
On Mon, Aug 15, 2016 at 8:30 PM, Hugo Mills  wrote:
> On Mon, Aug 15, 2016 at 10:32:25PM +0800, Anand Jain wrote:
>>
>>
>> On 08/15/2016 10:10 PM, Austin S. Hemmelgarn wrote:
>> >On 2016-08-15 10:08, Anand Jain wrote:
>> >>
>> >>
>> IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.
>> 
>> Any comment is welcomed.
>> 
>> >>>Based on looking at the code, we do in fact support 2/3 devices for
>> >>>raid5/6 respectively.
>> >>>
>> >>>Personally, I agree that we should warn when trying to do this, but I
>> >>>absolutely don't think we should stop it from happening.

About a year ago I had a raid5 array in an disk upgrade situation from
5x 2TB to 4x 4TB. As intermediate I had 2x 2TB + 2x 4TB situation for
several weeks. The 2x 2TB were getting really full and the fs was
slow. just wondering if an enospc would happen, I started an filewrite
task doing several 100 GB's and it simply did work to my surprise. At
some point, chunks only occupying the 4TB disks must have been
created. I also saw the expected write rate on the 4TB disks. CPU load
was not especially high as far as I remember, like a raid1 fs as far
as I remember.

So it is good that in such a situation, one can still use the fs. I
don't remember how the allocated/free space accounting was, probably
not correct, but I did not fill up the whole fs to see/experience
that.

I have no strong opinion whether we should warn about amount of
devices at mkfs time for raid56. It's just that the other known issues
with raid56 draw more attention.

>> >> How does 2 disks RAID5 work ?
>> >One disk is your data, the other is your parity.
>>
>>
>> >In essence, it works
>> >like a really computationally expensive version of RAID1 with 2 disks,
>> >which is why it's considered a degenerate configuration.
>>
>>How do you generate parity with only one data ?
>
>For plain parity calculations, parity is the value p which solves
> the expression:
>
> x_1 XOR x_2 XOR ... XOR x_n XOR p = 0
>
> for corresponding bits in the n data volumes. With one data volume,
> n=1, and hence p = x_1.
>
>What's the problem? :)
>
>Hugo.
>
>> -Anand
>>
>>
>> > Three disks in
>> >RAID6 is similar, but has a slight advantage at the moment in BTRFS
>> >because it's the only way to configure three disks so you can lose two
>> >and not lose any data as we have no support for higher order replication
>> >than 2 copies yet.
>
> --
> Hugo Mills | I always felt that as a C programmer, I was becoming
> hugo@... carfax.org.uk | typecast.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Huge load on btrfs subvolume delete

2016-08-15 Thread Daniel Caillibaud
Le 15/08/16 à 10:16, "Austin S. Hemmelgarn"  a écrit :
ASH> With respect to databases, you might consider backing them up separately 
ASH> too.  In many cases for something like an SQL database, it's a lot more 
ASH> flexible to have a dump of the database as a backup than it is to have 
ASH> the database files themselves, because it decouples it from the 
ASH> filesystem level layout.

With mysql|mariadb, having a consistent dump needs to lock tables during dump, 
not acceptable on
production servers. 

Even with specialised tools for hotdump, doing the dump on prod servers is too 
heavy about I/O
(I have huge db, writing the dump is expensive and long).

I used to have a slave juste for the dump (easy to stop slave, dump, and start 
slave), but after
a while it wasn't able to follow the writings all the day long (prod was on ssd 
and it wasn't,
dump hd was 100% busy all the day long), so it's for me really easier to rsync 
the raw
files once a day on a cheap host before dump.

(of course, I need to flush & lock table during the snapshot, before rsync, but 
it's just one or
two seconds, still acceptable)

-- 
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extents for a particular subvolume

2016-08-15 Thread Graham Cobb
On 03/08/16 22:55, Graham Cobb wrote:
> On 03/08/16 21:37, Adam Borowski wrote:
>> On Wed, Aug 03, 2016 at 08:56:01PM +0100, Graham Cobb wrote:
>>> Are there any btrfs commands (or APIs) to allow a script to create a
>>> list of all the extents referred to within a particular (mounted)
>>> subvolume?  And is it a reasonably efficient process (i.e. doesn't
>>> involve backrefs and, preferably, doesn't involve following directory
>>> trees)?

In case anyone else is interested in this, I ended up creating some
simple scripts to allow me to do this.  They are slow because they walk
the directory tree and they use filefrag to get the extent data, but
they do let me answer questions like:

* How much space am I wasting by keeping historical snapshots?
* How much data is being shared between two subvolumes
* How much of the data in my latest snapshot is unique to that snapshot?
* How much data would I actually free up if I removed (just) these
particular subvolumes?

If they are useful to anyone else you can find them at:

https://github.com/GrahamCobb/extents-lists

If anyone knows of more efficient ways to get this information please
let me know. And, of course, feel free to suggest improvements/bugfixes!


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs quota issues

2016-08-15 Thread Rakesh Sankeshi
yes, subvol level.

qgroupid rfer excl max_rfer max_excl parent  child

     --  -

0/5  16.00KiB 16.00KiB none none --- ---

0/258   119.48GiB119.48GiB200.00GiB none --- ---

0/25992.57GiB 92.57GiB200.00GiB none --- ---


although I have 200GB limit on 2 subvols, running into issue at about
120 and 92GB itself


On Sun, Aug 14, 2016 at 7:11 PM, Qu Wenruo  wrote:
>
>
> At 08/12/2016 01:32 AM, Rakesh Sankeshi wrote:
>>
>> I set 200GB limit to one user and 100GB to another user.
>>
>> as soon as I reached 139GB and 53GB each, hitting the quota errors.
>> anyway to workaround quota functionality on btrfs LZO compressed
>> filesystem?
>>
>
> Please paste "btrfs qgroup show -prce " output if you are using btrfs
> qgroup/quota function.
>
> And, AFAIK btrfs qgroup is applied to subvolume, not user.
>
> So did you mean limit it to one subvolume belongs to one user?
>
> Thanks,
> Qu
>
>>
>>
>> 4.7.0-040700-generic #201608021801 SMP
>>
>> btrfs-progs v4.7
>>
>>
>> Label: none  uuid: 66a78faf-2052-4864-8a52-c5aec7a56ab8
>>
>> Total devices 2 FS bytes used 150.62GiB
>>
>> devid1 size 1.00TiB used 78.01GiB path /dev/xvdc
>>
>> devid2 size 1.00TiB used 78.01GiB path /dev/xvde
>>
>>
>> Data, RAID0: total=150.00GiB, used=149.12GiB
>>
>> System, RAID1: total=8.00MiB, used=16.00KiB
>>
>> Metadata, RAID1: total=3.00GiB, used=1.49GiB
>>
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>> Filesystem  Size  Used Avail Use% Mounted on
>>
>> /dev/xvdc   2.0T  153G  1.9T   8% /test_lzo
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Hugo Mills
On Mon, Aug 15, 2016 at 10:32:25PM +0800, Anand Jain wrote:
> 
> 
> On 08/15/2016 10:10 PM, Austin S. Hemmelgarn wrote:
> >On 2016-08-15 10:08, Anand Jain wrote:
> >>
> >>
> IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.
> 
> Any comment is welcomed.
> 
> >>>Based on looking at the code, we do in fact support 2/3 devices for
> >>>raid5/6 respectively.
> >>>
> >>>Personally, I agree that we should warn when trying to do this, but I
> >>>absolutely don't think we should stop it from happening.
> >>
> >>
> >> How does 2 disks RAID5 work ?
> >One disk is your data, the other is your parity.
> 
> 
> >In essence, it works
> >like a really computationally expensive version of RAID1 with 2 disks,
> >which is why it's considered a degenerate configuration.
> 
>How do you generate parity with only one data ?

   For plain parity calculations, parity is the value p which solves
the expression:

x_1 XOR x_2 XOR ... XOR x_n XOR p = 0

for corresponding bits in the n data volumes. With one data volume,
n=1, and hence p = x_1.

   What's the problem? :)

   Hugo.

> -Anand
> 
> 
> > Three disks in
> >RAID6 is similar, but has a slight advantage at the moment in BTRFS
> >because it's the only way to configure three disks so you can lose two
> >and not lose any data as we have no support for higher order replication
> >than 2 copies yet.

-- 
Hugo Mills | I always felt that as a C programmer, I was becoming
hugo@... carfax.org.uk | typecast.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [GIT PULL] [PATCH v4 00/26] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-08-15 Thread Greg KH
On Sat, Aug 13, 2016 at 03:48:12PM -0700, Deepa Dinamani wrote:
> The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC 
> macros.
> The macros are not y2038 safe. There is no plan to transition them into being
> y2038 safe.
> ktime_get_* api's can be used in their place. And, these are y2038 safe.

Who are you execting to pull this huge patch series?

Why not just introduce the new api call, wait for that to be merged, and
then push the individual patches through the different subsystems?
After half of those get ignored, then provide a single set of patches
that can go through Andrew or my trees.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: properly track when rescan worker is running

2016-08-15 Thread Jeff Mahoney
The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state.  The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused.  If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed.  When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.

This patch uses a separate flag to indicate when the worker is
running.  The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.

Cc:  # v4.4+
Signed-off-by; Jeff Mahoney 
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/qgroup.c  |9 -
 3 files changed, 10 insertions(+), 1 deletion(-)

--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1771,6 +1771,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *qgroup_rescan_workers;
struct completion qgroup_rescan_completion;
struct btrfs_work qgroup_rescan_work;
+   bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
 
/* filesystem state */
unsigned long fs_state;
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2275,6 +2275,7 @@ static void btrfs_init_qgroup(struct btr
fs_info->quota_enabled = 0;
fs_info->pending_quota_state = 0;
fs_info->qgroup_ulist = NULL;
+   fs_info->qgroup_rescan_running = false;
mutex_init(_info->qgroup_rescan_lock);
 }
 
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2302,6 +2302,10 @@ static void btrfs_qgroup_rescan_worker(s
int err = -ENOMEM;
int ret = 0;
 
+   mutex_lock(_info->qgroup_rescan_lock);
+   fs_info->qgroup_rescan_running = true;
+   mutex_unlock(_info->qgroup_rescan_lock);
+
path = btrfs_alloc_path();
if (!path)
goto out;
@@ -2368,6 +2372,9 @@ out:
}
 
 done:
+   mutex_lock(_info->qgroup_rescan_lock);
+   fs_info->qgroup_rescan_running = false;
+   mutex_unlock(_info->qgroup_rescan_lock);
complete_all(_info->qgroup_rescan_completion);
 }
 
@@ -2494,7 +2501,7 @@ int btrfs_qgroup_wait_for_completion(str
 
mutex_lock(_info->qgroup_rescan_lock);
spin_lock(_info->qgroup_lock);
-   running = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN;
+   running = fs_info->qgroup_rescan_running;
spin_unlock(_info->qgroup_lock);
mutex_unlock(_info->qgroup_rescan_lock);
 

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 10:32, Anand Jain wrote:



On 08/15/2016 10:10 PM, Austin S. Hemmelgarn wrote:

On 2016-08-15 10:08, Anand Jain wrote:




IMHO it's better to warn user about 2 devices RAID5 or 3 devices
RAID6.

Any comment is welcomed.


Based on looking at the code, we do in fact support 2/3 devices for
raid5/6 respectively.

Personally, I agree that we should warn when trying to do this, but I
absolutely don't think we should stop it from happening.



 How does 2 disks RAID5 work ?

One disk is your data, the other is your parity.




In essence, it works
like a really computationally expensive version of RAID1 with 2 disks,
which is why it's considered a degenerate configuration.


   How do you generate parity with only one data ?
You treat the data as a stripe of width 1.  That's really all there is 
to it, it's just the same as using 3 or 4 or 5 disks, just with a 
smaller stripe size.


In other systems, 4 is the minimum disk count for RAID5.  I'm not sure 
why they usually disallow 3 disks (it's perfectly legitimate usage, it's 
just almost never seen in practice (largely because nothing supports it 
and erasure coding only makes sense from an economic perspective when 
dealing with lots of data)), but they disallow 2 because it gives no 
benefit over RAID1 with 2 copies and gives worse performance, not 
because the math doesn't work with 2 disks.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Anand Jain



On 08/15/2016 10:10 PM, Austin S. Hemmelgarn wrote:

On 2016-08-15 10:08, Anand Jain wrote:




IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.

Any comment is welcomed.


Based on looking at the code, we do in fact support 2/3 devices for
raid5/6 respectively.

Personally, I agree that we should warn when trying to do this, but I
absolutely don't think we should stop it from happening.



 How does 2 disks RAID5 work ?

One disk is your data, the other is your parity.




In essence, it works
like a really computationally expensive version of RAID1 with 2 disks,
which is why it's considered a degenerate configuration.


   How do you generate parity with only one data ?

-Anand



 Three disks in
RAID6 is similar, but has a slight advantage at the moment in BTRFS
because it's the only way to configure three disks so you can lose two
and not lose any data as we have no support for higher order replication
than 2 copies yet.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Huge load on btrfs subvolume delete

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 10:06, Daniel Caillibaud wrote:

Le 15/08/16 à 08:32, "Austin S. Hemmelgarn"  a écrit :

ASH> On 2016-08-15 06:39, Daniel Caillibaud wrote:
ASH> > I'm newbie with btrfs, and I have pb with high load after each btrfs 
subvolume delete
[…]

ASH> Before I start explaining possible solutions, it helps to explain what's
ASH> actually happening here.
[…]

Thanks a lot for these clear and detailed explanations.

Glad I could help.


ASH> > Is there a better way to do so ?

ASH> While there isn't any way I know of to do so, there are ways you can
ASH> reduce the impact by reducing how much your backing up:

Thanks for these clues too !

I'll use --commit-after, in order to wait for complete deletion before starting 
rsync the next
snapshot, and I keep in mind the benefit of putting /var/log outside the main 
subvolume of the
vm (but I guess my main pb is about databases, because their datadir are the 
ones with most
writes).

With respect to databases, you might consider backing them up separately 
too.  In many cases for something like an SQL database, it's a lot more 
flexible to have a dump of the database as a backup than it is to have 
the database files themselves, because it decouples it from the 
filesystem level layout.  Most good databases should be able to give you 
a stable dump (assuming of course that the application using the 
databases is sanely written) a whole lot faster than you could back up 
the files themselves.  For the couple of databases we use internally 
where I work, we actually back them up separately not only to retain 
this flexibility, but also because we have them on a separate backup 
schedule from the rest of the systems because they change a lot more 
frequently than anything else.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Anand Jain



Have a look at this..

http://www.spinics.net/lists/linux-btrfs/msg54779.html

--
RAID5&6 devs_min values are in the context of degraded volume.
RAID1&10.. devs_min values are in the context of healthy volume.

RAID56 is correct. We already have devs_max to know the number
of devices in a healthy volumes. RAID1's devs_min is wrong so
it ended up being same as devs_max.
--

Any comments?
Also you may use the btrfs-raid-cal simulator tool to verify.
https://github.com/asj/btrfs-raid-cal/blob/master/state-table


Thanks, Anand



On 08/15/2016 03:50 PM, Qu Wenruo wrote:

Hi,

Recently I found that manpage of mkfs is saying minimal device number
for RAID5 and RAID6 is 2 and 3.

Personally speaking, although I understand that RAID5/6 only requires
1/2 devices for parity stripe, it is still quite strange behavior.

Under most case, user use raid5/6 for striping AND parity. For 2 devices
RAID5, it's just a more expensive RAID1.

IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.

Any comment is welcomed.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 10:08, Anand Jain wrote:




IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.

Any comment is welcomed.


Based on looking at the code, we do in fact support 2/3 devices for
raid5/6 respectively.

Personally, I agree that we should warn when trying to do this, but I
absolutely don't think we should stop it from happening.



 How does 2 disks RAID5 work ?
One disk is your data, the other is your parity.  In essence, it works 
like a really computationally expensive version of RAID1 with 2 disks, 
which is why it's considered a degenerate configuration.  Three disks in 
RAID6 is similar, but has a slight advantage at the moment in BTRFS 
because it's the only way to configure three disks so you can lose two 
and not lose any data as we have no support for higher order replication 
than 2 copies yet.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Anand Jain




IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.

Any comment is welcomed.


Based on looking at the code, we do in fact support 2/3 devices for
raid5/6 respectively.

Personally, I agree that we should warn when trying to do this, but I
absolutely don't think we should stop it from happening.



 How does 2 disks RAID5 work ?

-Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Huge load on btrfs subvolume delete

2016-08-15 Thread Daniel Caillibaud
Le 15/08/16 à 08:32, "Austin S. Hemmelgarn"  a écrit :

ASH> On 2016-08-15 06:39, Daniel Caillibaud wrote:
ASH> > I'm newbie with btrfs, and I have pb with high load after each btrfs 
subvolume delete
[…]

ASH> Before I start explaining possible solutions, it helps to explain what's 
ASH> actually happening here.
[…]

Thanks a lot for these clear and detailed explanations.

ASH> > Is there a better way to do so ?

ASH> While there isn't any way I know of to do so, there are ways you can 
ASH> reduce the impact by reducing how much your backing up:

Thanks for these clues too !

I'll use --commit-after, in order to wait for complete deletion before starting 
rsync the next
snapshot, and I keep in mind the benefit of putting /var/log outside the main 
subvolume of the
vm (but I guess my main pb is about databases, because their datadir are the 
ones with most
writes).

-- 
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 09:39, Martin wrote:

That really is the case, there's currently no way to do this with BTRFS.
You have to keep in mind that the raid5/6 code only went into the mainline
kernel a few versions ago, and it's still pretty immature as far as kernel
code goes.  I don't know when (if ever) such a feature might get put in, but
it's definitely something to add to the list of things that would be nice to
have.

For the moment, the only option to achieve something like this is to set up
a bunch of separate 8 device filesystems, but I would be willing to bet that
the way you have it configured right now is closer to what most people would
be doing in a regular deployment, and therefore is probably more valuable
for testing.



I see.

Right now on our +500TB zfs filesystems we used raid6 with a 6 disk
vdev, which is often in the zfs world, and for btrfs I would be the
same when stable/possible.

A while back there was talk of implementing a system where you could 
specify any arbitrary number of replicas, stripes or parity (for 
example, if you had 16 devices, you could tell it to do two copies with 
double parity using full width stripes), and in theory, it would be 
possible there (parity level of 2 with a stripe width of 6 or 8 
depending on how it's implemented), but I don't think it's likely that 
that functionality will exist any time soon.  Implementing such a system 
would pretty much require re-writing most of the allocation code (which 
probably would be a good idea for other reasons now too), and that's not 
likely to happen given the amount of coding that went into the raid5/6 
support.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Chris Murphy
On Mon, Aug 15, 2016 at 7:38 AM, Martin  wrote:
>> Looking at the kernel log itself, you've got a ton of write errors on
>> /dev/sdap.  I would suggest checking that particular disk with smartctl, and
>> possibly checking the other hardware involved (the storage controller and
>> cabling).
>>
>> I would kind of expect BTRFS to crash with that many write errors regardless
>> of what profile is being used, but we really should get better about
>> reporting errors to user space in a sane way (making people dig through
>> kernel logs to figure out their having issues like this is not particularly
>> user friendly).
>
> Interesting!
>
> Why does it speak of "device sdq" and /dev/sdap ?
>
> [337411.703937] BTRFS error (device sdq): bdev /dev/sdap errs: wr
> 36973, rd 0, flush 1, corrupt 0, gen 0
> [337411.704658] BTRFS warning (device sdq): lost page write due to IO
> error on /dev/sdap
>
> /dev/sdap doesn't exist.

OK well
journalctl -b | grep -A10 -B10 "sdap"

See in what other context it appears. And also 'btrfs fi show' and see
if it appears associated with this Btrfs volume.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 09:38, Martin wrote:

Looking at the kernel log itself, you've got a ton of write errors on
/dev/sdap.  I would suggest checking that particular disk with smartctl, and
possibly checking the other hardware involved (the storage controller and
cabling).

I would kind of expect BTRFS to crash with that many write errors regardless
of what profile is being used, but we really should get better about
reporting errors to user space in a sane way (making people dig through
kernel logs to figure out their having issues like this is not particularly
user friendly).


Interesting!

Why does it speak of "device sdq" and /dev/sdap ?

[337411.703937] BTRFS error (device sdq): bdev /dev/sdap errs: wr
36973, rd 0, flush 1, corrupt 0, gen 0
[337411.704658] BTRFS warning (device sdq): lost page write due to IO
error on /dev/sdap

/dev/sdap doesn't exist.

I'm not quite certain, something in the kernel might have been confused, 
but it's hard to be sure.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Martin
> That really is the case, there's currently no way to do this with BTRFS.
> You have to keep in mind that the raid5/6 code only went into the mainline
> kernel a few versions ago, and it's still pretty immature as far as kernel
> code goes.  I don't know when (if ever) such a feature might get put in, but
> it's definitely something to add to the list of things that would be nice to
> have.
>
> For the moment, the only option to achieve something like this is to set up
> a bunch of separate 8 device filesystems, but I would be willing to bet that
> the way you have it configured right now is closer to what most people would
> be doing in a regular deployment, and therefore is probably more valuable
> for testing.
>

I see.

Right now on our +500TB zfs filesystems we used raid6 with a 6 disk
vdev, which is often in the zfs world, and for btrfs I would be the
same when stable/possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Martin
> Looking at the kernel log itself, you've got a ton of write errors on
> /dev/sdap.  I would suggest checking that particular disk with smartctl, and
> possibly checking the other hardware involved (the storage controller and
> cabling).
>
> I would kind of expect BTRFS to crash with that many write errors regardless
> of what profile is being used, but we really should get better about
> reporting errors to user space in a sane way (making people dig through
> kernel logs to figure out their having issues like this is not particularly
> user friendly).

Interesting!

Why does it speak of "device sdq" and /dev/sdap ?

[337411.703937] BTRFS error (device sdq): bdev /dev/sdap errs: wr
36973, rd 0, flush 1, corrupt 0, gen 0
[337411.704658] BTRFS warning (device sdq): lost page write due to IO
error on /dev/sdap

/dev/sdap doesn't exist.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Chris Murphy
On Mon, Aug 15, 2016 at 6:19 AM, Martin  wrote:

>
> I have now had the first crash, can you take a look if I have provided
> the needed info?
>
> https://bugzilla.kernel.org/show_bug.cgi?id=153141

[337406.626175] BTRFS warning (device sdq): lost page write due to IO
error on /dev/sdap

Anytime there's I/O related errors that you'd need to go back farther
in the log to find out what really happened. You can play around with
'journalctl --since' for this. It'll accept things like -1m or -2h for
"back one minute or back two hours" or also "today" "yesterday" or by
explicit date and time.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 10/26] fs: btrfs: Use ktime_get_real_ts for root ctime

2016-08-15 Thread David Sterba
On Sat, Aug 13, 2016 at 03:48:22PM -0700, Deepa Dinamani wrote:
> btrfs_root_item maintains the ctime for root updates.
> This is not part of vfs_inode.
> 
> Since current_time() uses struct inode* as an argument
> as Linus suggested, this cannot be used to update root
> times unless, we modify the signature to use inode.
> 
> Since btrfs uses nanosecond time granularity, it can also
> use ktime_get_real_ts directly to obtain timestamp for
> the root. It is necessary to use the timespec time api
> here because the same btrfs_set_stack_timespec_*() apis
> are used for vfs inode times as well. These can be
> transitioned to using timespec64 when btrfs internally
> changes to use timespec64 as well.
> 
> Signed-off-by: Deepa Dinamani 
> Acked-by: David Sterba 
> Reviewed-by: Arnd Bergmann 
> Cc: Chris Mason 
> Cc: David Sterba 

Acked-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 08:19, Martin wrote:

I'm not sure what Arch does any differently to their kernels from
kernel.org kernels. But bugzilla.kernel.org offers a Mainline and
Fedora drop down for identifying the kernel source tree.


IIRC, they're pretty close to mainline kernels.  I don't think they have any
patches in the filesystem or block layer code at least, but I may be wrong,
it's been a long time since I looked at an Arch kernel.


Perhaps I should use Arch then, as Fedora rawhide kernel wouldn't boot
on my hw, so I am running the stock Fedora 24 kernel right now for the
tests...


If I want to compile a mainline kernel. Are there anything I need to
tune?



Fedora kernels do not have these options set.

# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set

The sanity and integrity tests are both compile time and mount time
options, i.e. it has to be compiled enabled for the mount option to do
anything. I can't recall any thread where a developer asked a user to
set any of these options for testing though.



FWIW, I actually have the integrity checking code built in on most kernels I
build.  I don't often use it, but it has near zero overhead when not
enabled, and it's helped me track down lower-level storage configuration
issues on occasion.


I'll give that a shot tomorrow.


When I do the tests, how do I log the info you would like to see, if I
find a bug?



bugzilla.kernel.org for tracking, and then reference the URL for the
bug with a summary in an email to list is how I usually do it. The
main thing is going to be the exact reproduce steps. It's also better,
I think, to have complete dmesg (or journalctl -k) attached to the bug
report because not all problems are directly related to Btrfs, they
can have contributing factors elsewhere. And various MTAs, or more
commonly MUAs, have a tendancy to wrap such wide text as found in
kernel or journald messages.


Aside from kernel messages, the other general stuff you want to have is:
1. Kernel version and userspace tools version (`uname -a` and `btrfs
--version`)
2. Any underlying storage configuration if it's not just plain a SSD/HDD or
partitions (for example, usage of dm-crypt, LVM, mdadm, and similar things).
3. Output from `btrfs filesystem show` (this can be trimmed to the
filesystem that's having the issue).
4. If you can still mount the filesystem, `btrfs filesystem df` output can
be helpful.
5. If you can't mount the filesystem, output from `btrfs check` run without
any options will usually be asked for.


I have now had the first crash, can you take a look if I have provided
the needed info?

https://bugzilla.kernel.org/show_bug.cgi?id=153141

How long should I keep the host untouched? Or is all interesting idea provided?

Looking at the kernel log itself, you've got a ton of write errors on 
/dev/sdap.  I would suggest checking that particular disk with smartctl, 
and possibly checking the other hardware involved (the storage 
controller and cabling).


I would kind of expect BTRFS to crash with that many write errors 
regardless of what profile is being used, but we really should get 
better about reporting errors to user space in a sane way (making people 
dig through kernel logs to figure out their having issues like this is 
not particularly user friendly).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 08:19, Martin wrote:

The smallest disk of the 122 is 500GB. Is it possible to have btrfs
see each disk as only e.g. 10GB? That way I can corrupt and resilver
more disks over a month.


Well, at least you can easily partition the devices for that to happen.


Can it be done with btrfs or should I do it with gdisk?
With gdisk.  BTRFS includes some volume management features, but it 
doesn't handle partitioning itself.



However, I would also suggest that would it be more useful use of the
resource to run many arrays in parallel? Ie. one 6-device raid6, one
20-device raid6, and then perhaps use the rest of the devices for a very
large btrfs filesystem? Or if you have been using partitioning the large
btrfs volume can also be composed of all the 122 devices; in fact you
could even run multiple 122-device raid6s and use different kind of
testing on each. For performance testing you might only excert one of
the file systems at a time, though.


Very interesting idea, which leads me to the following question:

For the past weeks have I had all 122 disks in one raid6 filesystem,
and since I didn't entered any vdev (zfs term) size, I suspect only 2
of the 122 disks are parity.

If, how can I make the filesystem, so for every 6 disks, 2 of them are parity?

Reading the mkfs.btrfs man page gives me the impression that it can't
be done, which I find hard to believe.
That really is the case, there's currently no way to do this with BTRFS. 
 You have to keep in mind that the raid5/6 code only went into the 
mainline kernel a few versions ago, and it's still pretty immature as 
far as kernel code goes.  I don't know when (if ever) such a feature 
might get put in, but it's definitely something to add to the list of 
things that would be nice to have.


For the moment, the only option to achieve something like this is to set 
up a bunch of separate 8 device filesystems, but I would be willing to 
bet that the way you have it configured right now is closer to what most 
people would be doing in a regular deployment, and therefore is probably 
more valuable for testing.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Huge load on btrfs subvolume delete

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 06:39, Daniel Caillibaud wrote:

Hi,

I'm newbie with btrfs, and I have pb with high load after each btrfs subvolume 
delete

I use snapshots on lxc hosts under debian jessie with
- kernel 4.6.0-0.bpo.1-amd64
- btrfs-progs 4.6.1-1~bpo8

For backup, I have each day, for each subvolume

btrfs subvolume snapshot -r $subvol $snap
# then later
ionice -c3 btrfs subvolume delete $snap

but ionice doesn't seems to have any effect here and after a few minutes the 
load grows up
quite high (30~40), and I don't know how to make this deletion nicer with I/O
Before I start explaining possible solutions, it helps to explain what's 
actually happening here.  When you create a snapshot, BTRFS just scans 
down the tree for the subvolume in question and creates new references 
to everything in that subvolume in a separate tree.  This is usually 
insanely fast because all that needs to be done is updating metadata. 
When you delete a snapshot however, it has to remove any remaining 
references within the snapshot to the parent subvolume, and also has to 
process any changed data that is now different from the parent subvolume 
for deletion just like it would for deleting a file.  As a result of 
this, the work to create a snapshot only depends on the complexity of 
the directory structure within the subvolume, while the work to delete 
it depends on both that and how much the snapshot has changed from the 
parent subvolume.


The spike in load your seeing is the filesystem handling all that 
internal accounting in the background, and I'd be willing to bet that it 
varies based on how fast things are changing in the parent subvolume. 
Setting idle I/O scheduling priority on the command to delete the 
snapshot does nothing because all that command does is tell the kernel 
to delete the snapshot, the actual deletion is handled in the filesystem 
driver.  While it won't help with the spike in load, you probably want 
to add `--commit-after` to that subvolume deletion command.  That will 
cause the spike to happen almost immediately, and the command won't 
return until the filesystem is finished with the accounting and thus the 
load should be back to normal when it returns.


Is there a better way to do so ?
While there isn't any way I know of to do so, there are ways you can 
reduce the impact by reducing how much your backing up:
1. You almost certainly don't need to back up the logs, and if you do, 
they should probably be backed up independently from the rest of the 
system image.  In most cases, logs just add extra size to a backup, and 
have little value when you restore a backup, so it makes little sense in 
most cases to include them in a backup.  The simplest way to exclude 
them in your case is to make /var/log in the LXC containers be a 
separate subvolume.  This will exclude it from the snapshot for the 
backup, which will both speed up the backup, and reduce the amount of 
changes from the parent that occur while creating the backup.
2. Assuming you're using a distribution compliant with the filesystem 
hierarchy standard, there are a couple of directories you can safely 
exclude from all backups simply because portable programs are designed 
to handle losing data from these directories gracefully.  Such 
directories include /tmp, /var/tmp, and /var/cache, and they can be 
excluded the same way as /var/log.
3. Similar arguments apply to $HOME/.cache, which is essentially a 
per-user /var/cache.  This is less likely to have an impact if you don't 
have individual users doing things on these systems.
4. Look for other similar areas you may be able to safely exclude.  For 
example, I use Gentoo, and I build all my packages with external 
debugging symbols which get stored in /usr/lib/debug.  I only have this 
set up for convenience, so there's no point in me backing it up because 
I can just rebuild the package to regenerate the debugging symbols if I 
need them after restoring from a backup.  Similarly, I also exclude any 
VCS repositories that I have copies of elsewhere, simply because I can 
just clone that copy if I need it.


Is it a bad idea to set ionice -c3 on the btrfs-transacti process which seems 
the one doing a
lot of I/O ?
Yes, it's always a bad idea to mess with any scheduling properties other 
than CPU affinity for kernel threads (and even messing with CPU affinity 
is usually a bad idea too).  The btrfs-transaction kthread (the name 
gets cut off by the length limits built into the kernel) is a 
particularly bad one to mess with, because it handles committing updates 
to the filesystem.  Setting an idle scheduling priority on it would 
probably put you at severe risk of data loss or cause your system to 
lock up.


Actually my io priority on btrfs process are

ps x|awk '/[b]trfs/ {printf("%20s ", $NF); system("ionice -p" $1)}'
  [btrfs-worker] none: prio 4
   [btrfs-worker-hi] none: prio 4
[btrfs-delalloc] none: prio 4
   [btrfs-flush_del] none: prio 4
   [btrfs-cache] none: 

Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Martin
>> The smallest disk of the 122 is 500GB. Is it possible to have btrfs
>> see each disk as only e.g. 10GB? That way I can corrupt and resilver
>> more disks over a month.
>
> Well, at least you can easily partition the devices for that to happen.

Can it be done with btrfs or should I do it with gdisk?

> However, I would also suggest that would it be more useful use of the
> resource to run many arrays in parallel? Ie. one 6-device raid6, one
> 20-device raid6, and then perhaps use the rest of the devices for a very
> large btrfs filesystem? Or if you have been using partitioning the large
> btrfs volume can also be composed of all the 122 devices; in fact you
> could even run multiple 122-device raid6s and use different kind of
> testing on each. For performance testing you might only excert one of
> the file systems at a time, though.

Very interesting idea, which leads me to the following question:

For the past weeks have I had all 122 disks in one raid6 filesystem,
and since I didn't entered any vdev (zfs term) size, I suspect only 2
of the 122 disks are parity.

If, how can I make the filesystem, so for every 6 disks, 2 of them are parity?

Reading the mkfs.btrfs man page gives me the impression that it can't
be done, which I find hard to believe.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-15 Thread Martin
>> I'm not sure what Arch does any differently to their kernels from
>> kernel.org kernels. But bugzilla.kernel.org offers a Mainline and
>> Fedora drop down for identifying the kernel source tree.
>
> IIRC, they're pretty close to mainline kernels.  I don't think they have any
> patches in the filesystem or block layer code at least, but I may be wrong,
> it's been a long time since I looked at an Arch kernel.

Perhaps I should use Arch then, as Fedora rawhide kernel wouldn't boot
on my hw, so I am running the stock Fedora 24 kernel right now for the
tests...

>>> If I want to compile a mainline kernel. Are there anything I need to
>>> tune?
>>
>>
>> Fedora kernels do not have these options set.
>>
>> # CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
>> # CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
>> # CONFIG_BTRFS_DEBUG is not set
>> # CONFIG_BTRFS_ASSERT is not set
>>
>> The sanity and integrity tests are both compile time and mount time
>> options, i.e. it has to be compiled enabled for the mount option to do
>> anything. I can't recall any thread where a developer asked a user to
>> set any of these options for testing though.

> FWIW, I actually have the integrity checking code built in on most kernels I
> build.  I don't often use it, but it has near zero overhead when not
> enabled, and it's helped me track down lower-level storage configuration
> issues on occasion.

I'll give that a shot tomorrow.

>>> When I do the tests, how do I log the info you would like to see, if I
>>> find a bug?
>>
>>
>> bugzilla.kernel.org for tracking, and then reference the URL for the
>> bug with a summary in an email to list is how I usually do it. The
>> main thing is going to be the exact reproduce steps. It's also better,
>> I think, to have complete dmesg (or journalctl -k) attached to the bug
>> report because not all problems are directly related to Btrfs, they
>> can have contributing factors elsewhere. And various MTAs, or more
>> commonly MUAs, have a tendancy to wrap such wide text as found in
>> kernel or journald messages.
>
> Aside from kernel messages, the other general stuff you want to have is:
> 1. Kernel version and userspace tools version (`uname -a` and `btrfs
> --version`)
> 2. Any underlying storage configuration if it's not just plain a SSD/HDD or
> partitions (for example, usage of dm-crypt, LVM, mdadm, and similar things).
> 3. Output from `btrfs filesystem show` (this can be trimmed to the
> filesystem that's having the issue).
> 4. If you can still mount the filesystem, `btrfs filesystem df` output can
> be helpful.
> 5. If you can't mount the filesystem, output from `btrfs check` run without
> any options will usually be asked for.

I have now had the first crash, can you take a look if I have provided
the needed info?

https://bugzilla.kernel.org/show_bug.cgi?id=153141

How long should I keep the host untouched? Or is all interesting idea provided?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-15 03:50, Qu Wenruo wrote:

Hi,

Recently I found that manpage of mkfs is saying minimal device number
for RAID5 and RAID6 is 2 and 3.

Personally speaking, although I understand that RAID5/6 only requires
1/2 devices for parity stripe, it is still quite strange behavior.

Under most case, user use raid5/6 for striping AND parity. For 2 devices
RAID5, it's just a more expensive RAID1.

IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.

Any comment is welcomed.

Based on looking at the code, we do in fact support 2/3 devices for 
raid5/6 respectively.


Personally, I agree that we should warn when trying to do this, but I 
absolutely don't think we should stop it from happening.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: checksum error in metadata node - best way to move root fs to new drive?

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-12 11:06, Duncan wrote:

Austin S. Hemmelgarn posted on Fri, 12 Aug 2016 08:04:42 -0400 as
excerpted:


On a file server?  No, I'd ensure proper physical security is
established and make sure it's properly secured against network based
attacks and then not worry about it.  Unless you have things you want to
hide from law enforcement or your government (which may or may not be
legal where you live) or can reasonably expect someone to steal the
system, you almost certainly don't actually need whole disk encryption.
There are two specific exceptions to this though:
1. If your employer requires encryption on this system, that's their
call.
2. Encrypted swap is a good thing regardless, because it prevents
security credentials from accidentally being written unencrypted to
persistent storage.


In the US, medical records are pretty well protected under penalty of law
(HIPPA, IIRC?).  Anyone storing medical records here would do well to
have full filesystem encryption for that reason.

Of course financial records are sensitive as well, or even just forum
login information, and then there's the various industrial spies from
various countries (China being the one most frequently named) that would
pay good money for unencrypted devices from the right sources.

Medical and even financial records really fall under my first exception, 
but it's still no substitute for proper physical security.  As far as 
user account information, that depends on what your legal or PR 
department promised, but in many cases there, there's minimal 
improvement in security when using full disk encryption in place of just 
encrypting the database file used to store the information.


In either case though, it's still a better investment in terms of both 
time and money to properly secure the network and physical access to the 
hardware.  All that disk encryption protects is data at rest, and for a 
_server_ system, the data is almost always online, and therefore lack of 
protection of the system as a whole is usually more of a security issue 
in general than lack of protection for a single disk that's powered off.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Huge load on btrfs subvolume delete

2016-08-15 Thread Daniel Caillibaud
Hi,

I'm newbie with btrfs, and I have pb with high load after each btrfs subvolume 
delete

I use snapshots on lxc hosts under debian jessie with
- kernel 4.6.0-0.bpo.1-amd64
- btrfs-progs 4.6.1-1~bpo8

For backup, I have each day, for each subvolume

btrfs subvolume snapshot -r $subvol $snap
# then later
ionice -c3 btrfs subvolume delete $snap

but ionice doesn't seems to have any effect here and after a few minutes the 
load grows up
quite high (30~40), and I don't know how to make this deletion nicer with I/O

Is there a better way to do so ?

Is it a bad idea to set ionice -c3 on the btrfs-transacti process which seems 
the one doing a
lot of I/O ?

Actually my io priority on btrfs process are 

ps x|awk '/[b]trfs/ {printf("%20s ", $NF); system("ionice -p" $1)}'
  [btrfs-worker] none: prio 4
   [btrfs-worker-hi] none: prio 4
[btrfs-delalloc] none: prio 4
   [btrfs-flush_del] none: prio 4
   [btrfs-cache] none: prio 4
  [btrfs-submit] none: prio 4
   [btrfs-fixup] none: prio 4
   [btrfs-endio] none: prio 4
   [btrfs-endio-met] none: prio 4
   [btrfs-endio-met] none: prio 4
   [btrfs-endio-rai] none: prio 4
   [btrfs-endio-rep] none: prio 4
 [btrfs-rmw] none: prio 4
   [btrfs-endio-wri] none: prio 4
   [btrfs-freespace] none: prio 4
   [btrfs-delayed-m] none: prio 4
   [btrfs-readahead] none: prio 4
   [btrfs-qgroup-re] none: prio 4
   [btrfs-extent-re] none: prio 4
 [btrfs-cleaner] none: prio 0
   [btrfs-transacti] none: prio 0



Thanks

-- 
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


About minimal device number for RAID5/6

2016-08-15 Thread Qu Wenruo

Hi,

Recently I found that manpage of mkfs is saying minimal device number 
for RAID5 and RAID6 is 2 and 3.


Personally speaking, although I understand that RAID5/6 only requires 
1/2 devices for parity stripe, it is still quite strange behavior.


Under most case, user use raid5/6 for striping AND parity. For 2 devices 
RAID5, it's just a more expensive RAID1.


IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.

Any comment is welcomed.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] code cleanup

2016-08-15 Thread Omar Sandoval
On Sun, Aug 14, 2016 at 04:11:31PM -0400, Harinath Nampally wrote:
> This patch checks ret value and jumps to clean up in case of
>  btrs_add_systme_chunk call fails
> 
> Signed-off-by: Harinath Nampally 
> ---
>  fs/btrfs/volumes.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 366b335..fedb301 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4880,12 +4880,15 @@ int btrfs_finish_chunk_alloc(struct 
> btrfs_trans_handle *trans,
>  
>   ret = btrfs_insert_item(trans, chunk_root, , chunk, item_size);
>   if (ret == 0 && map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
> - /*
> -  * TODO: Cleanup of inserted chunk root in case of
> -  * failure.
> -  */
>   ret = btrfs_add_system_chunk(chunk_root, , chunk,
>item_size);
> + if (ret) {
> + /*
> +  * Cleanup of inserted chunk root in case of
> +  * failure.
> +  */
> + goto out;
> + }
>   }
>  
>  out:

NAK. This patch doesn't do anything. That's just jumping to the exact
same location that we were previously returning to anyways.

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html