subject:"RAID56"

Re: [PATCH 4/6] btrfs: fix ncopies raid_attr for RAID56

2018-10-05 Thread Nikolay Borisov




On  5.10.2018 00:24, Hans van Kranenburg wrote:
> RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
> values are not used anywhere by the way.
> 
> Signed-off-by: Hans van Kranenburg 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/volumes.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7045814fc98d..d82b3d735ebe 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -96,7 +96,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .devs_min   = 2,
>   .tolerated_failures = 1,
>   .devs_increment = 1,
> - .ncopies= 2,
> + .ncopies= 1,
>   .raid_name  = "raid5",
>   .bg_flag= BTRFS_BLOCK_GROUP_RAID5,
>   .mindev_error   = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
> @@ -108,7 +108,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .devs_min   = 3,
>   .tolerated_failures = 2,
>   .devs_increment = 1,
> - .ncopies= 3,
> + .ncopies= 1,
>   .raid_name  = "raid6",
>   .bg_flag= BTRFS_BLOCK_GROUP_RAID6,
>   .mindev_error   = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
>

[PATCH 4/6] btrfs: fix ncopies raid_attr for RAID56

2018-10-04 Thread Hans van Kranenburg

RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
values are not used anywhere by the way.

Signed-off-by: Hans van Kranenburg 
---
 fs/btrfs/volumes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7045814fc98d..d82b3d735ebe 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -96,7 +96,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.devs_min   = 2,
.tolerated_failures = 1,
.devs_increment = 1,
-   .ncopies= 2,
+   .ncopies= 1,
.raid_name  = "raid5",
.bg_flag= BTRFS_BLOCK_GROUP_RAID5,
.mindev_error   = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
@@ -108,7 +108,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.devs_min   = 3,
.tolerated_failures = 2,
.devs_increment = 1,
-   .ncopies= 3,
+   .ncopies= 1,
.raid_name  = "raid6",
.bg_flag= BTRFS_BLOCK_GROUP_RAID6,
.mindev_error   = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
-- 
2.19.0.329.g76f2f5c1e3

Re: [PATCH 13/14] btrfs: raid56: don't lock stripe cache table when freeing

2018-06-29 Thread David Sterba

On Fri, Jun 29, 2018 at 10:57:07AM +0200, David Sterba wrote:
> This is called either at the end of the mount or if the mount fails.
> In both cases, there's nothing running that can use the table so the
> lock is pointless.

And then lockdep says no. The umount path frees the table but there's
some unfinished bio that wants to use the table from the interrupt
context. And this is puzzling, there should be no IO in flight, all
workers should be stopped.

btrfs/011:

[ 1339.169842] 
[ 1339.171891] WARNING: inconsistent lock state
[ 1339.173724] 4.17.0-rc7-default+ #168 Not tainted
[ 1339.175661] 
[ 1339.177479] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 1339.179758] umount/4029 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1339.182008] 96eee2cd (&(>lock)->rlock){?.-.}, at: 
__remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.183982] {IN-HARDIRQ-W} state was registered at:
[ 1339.184819]   _raw_spin_lock_irqsave+0x43/0x52
[ 1339.185433]   unlock_stripe+0x78/0x3d0 [btrfs]
[ 1339.186041]   rbio_orig_end_io+0x41/0xd0 [btrfs]
[ 1339.186648]   blk_update_request+0xd7/0x330
[ 1339.187205]   blk_mq_end_request+0x18/0x70
[ 1339.187752]   flush_smp_call_function_queue+0x83/0x120
[ 1339.188405]   smp_call_function_single_interrupt+0x43/0x270
[ 1339.189094]   call_function_single_interrupt+0xf/0x20
[ 1339.189733]   native_safe_halt+0x2/0x10
[ 1339.190255]   default_idle+0x1f/0x190
[ 1339.190759]   do_idle+0x217/0x240
[ 1339.191229]   cpu_startup_entry+0x6f/0x80
[ 1339.191768]   start_secondary+0x192/0x1e0
[ 1339.192385]   secondary_startup_64+0xa5/0xb0
[ 1339.193145] irq event stamp: 783817
[ 1339.193701] hardirqs last  enabled at (783817): [] 
_raw_spin_unlock_irqrestore+0x4d/0x60
[ 1339.194896] hardirqs last disabled at (783816): [] 
_raw_spin_lock_irqsave+0x20/0x52
[ 1339.196010] softirqs last  enabled at (777902): [] 
__do_softirq+0x397/0x502
[ 1339.197435] softirqs last disabled at (777879): [] 
irq_exit+0xc1/0xd0
[ 1339.198797]
[ 1339.198797] other info that might help us debug this:
[ 1339.200011]  Possible unsafe locking scenario:
[ 1339.200011]
[ 1339.201066]CPU0
[ 1339.201464]
[ 1339.201845]   lock(&(>lock)->rlock);
[ 1339.202372]   
[ 1339.202768] lock(&(>lock)->rlock);
[ 1339.203313]
[ 1339.203313]  *** DEADLOCK ***
[ 1339.203313]
[ 1339.204215] 1 lock held by umount/4029:
[ 1339.204727]  #0: 7c3dd992 (>s_umount_key#26){}, at: 
deactivate_super+0x43/0x50
[ 1339.205822]
[ 1339.205822] stack backtrace:
[ 1339.206660] CPU: 2 PID: 4029 Comm: umount Not tainted 4.17.0-rc7-default+ 
#168
[ 1339.207875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
[ 1339.209357] Call Trace:
[ 1339.209742]  dump_stack+0x85/0xc0
[ 1339.210204]  print_usage_bug.cold.57+0x1aa/0x1e4
[ 1339.210790]  ? print_shortest_lock_dependencies+0x40/0x40
[ 1339.211680]  mark_lock+0x530/0x630
[ 1339.212291]  ? print_shortest_lock_dependencies+0x40/0x40
[ 1339.213180]  __lock_acquire+0x549/0xf60
[ 1339.213855]  ? flush_work+0x24a/0x280
[ 1339.214485]  ? __lock_is_held+0x4f/0x90
[ 1339.215176]  lock_acquire+0x9e/0x1d0
[ 1339.215850]  ? __remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.216755]  _raw_spin_lock+0x2c/0x40
[ 1339.217303]  ? __remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.218005]  __remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.218687]  btrfs_free_stripe_hash_table+0x2a/0x50 [btrfs]
[ 1339.219393]  close_ctree+0x1d7/0x330 [btrfs]
[ 1339.219961]  generic_shutdown_super+0x64/0x100
[ 1339.220659]  kill_anon_super+0xe/0x20
[ 1339.221245]  btrfs_kill_super+0x12/0xa0 [btrfs]
[ 1339.221840]  deactivate_locked_super+0x29/0x60
[ 1339.222416]  cleanup_mnt+0x3b/0x70
[ 1339.222892]  task_work_run+0x9b/0xd0
[ 1339.223388]  exit_to_usermode_loop+0x99/0xa0
[ 1339.223949]  do_syscall_64+0x16c/0x170
[ 1339.224603]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1339.225457] RIP: 0033:0x7ffb9180d017
[ 1339.226063] RSP: 002b:7fff9cb72a28 EFLAGS: 0246 ORIG_RAX: 
00a6
[ 1339.227289] RAX:  RBX: 55b804721970 RCX: 7ffb9180d017
[ 1339.228373] RDX: 0001 RSI:  RDI: 55b804721b50
[ 1339.229451] RBP:  R08: 55b804721b70 R09: 7fff9cb71290
[ 1339.230524] R10:  R11: 0246 R12: 7ffb91d291c4
[ 1339.231604] R13: 55b804721b50 R14:  R15: 7fff9cb72c98
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/14] btrfs: raid56: don't lock stripe cache table when freeing

2018-06-29 Thread David Sterba

This is called either at the end of the mount or if the mount fails.
In both cases, there's nothing running that can use the table so the
lock is pointless.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 272acd9b1192..0840b054e4b7 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -407,19 +407,16 @@ static void remove_rbio_from_cache(struct btrfs_raid_bio 
*rbio)
 static void btrfs_clear_rbio_cache(struct btrfs_fs_info *info)
 {
struct btrfs_stripe_hash_table *table;
-   unsigned long flags;
struct btrfs_raid_bio *rbio;
 
table = info->stripe_hash_table;
 
-   spin_lock_irqsave(>cache_lock, flags);
while (!list_empty(>stripe_cache)) {
rbio = list_entry(table->stripe_cache.next,
  struct btrfs_raid_bio,
  stripe_cache);
__remove_rbio_from_cache(rbio);
}
-   spin_unlock_irqrestore(>cache_lock, flags);
 }
 
 /*
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/14] btrfs: raid56: use new helper for async_scrub_parity

2018-06-29 Thread David Sterba

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index f9b349171d61..339cce0878d1 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -170,7 +170,7 @@ static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check);
-static void async_scrub_parity(struct btrfs_raid_bio *rbio);
+static void scrub_parity_work(struct btrfs_work *work);
 
 static void start_async_work(struct btrfs_raid_bio *rbio, btrfs_func_t 
work_func)
 {
@@ -812,7 +812,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
start_async_work(next, rmw_work);
} else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) {
steal_rbio(rbio, next);
-   async_scrub_parity(next);
+   start_async_work(next, scrub_parity_work);
}
 
goto done_nolock;
@@ -2703,18 +2703,10 @@ static void scrub_parity_work(struct btrfs_work *work)
raid56_parity_scrub_stripe(rbio);
 }
 
-static void async_scrub_parity(struct btrfs_raid_bio *rbio)
-{
-   btrfs_init_work(>work, btrfs_rmw_helper,
-   scrub_parity_work, NULL, NULL);
-
-   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
-}
-
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
 {
if (!lock_stripe_add(rbio))
-   async_scrub_parity(rbio);
+   start_async_work(rbio, scrub_parity_work);
 }
 
 /* The following code is used for dev replace of a missing RAID 5/6 device. */
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/14] btrfs: raid56: use new helper for async_rmw_stripe

2018-06-29 Thread David Sterba

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index f30d847baf07..96a7d3445623 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -162,7 +162,6 @@ static int __raid56_parity_recover(struct btrfs_raid_bio 
*rbio);
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio);
 static void rmw_work(struct btrfs_work *work);
 static void read_rebuild_work(struct btrfs_work *work);
-static void async_rmw_stripe(struct btrfs_raid_bio *rbio);
 static void async_read_rebuild(struct btrfs_raid_bio *rbio);
 static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
 static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
@@ -811,7 +810,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
async_read_rebuild(next);
} else if (next->operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);
-   async_rmw_stripe(next);
+   start_async_work(next, rmw_work);
} else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) {
steal_rbio(rbio, next);
async_scrub_parity(next);
@@ -1501,12 +1500,6 @@ static void raid_rmw_end_io(struct bio *bio)
rbio_orig_end_io(rbio, BLK_STS_IOERR);
 }
 
-static void async_rmw_stripe(struct btrfs_raid_bio *rbio)
-{
-   btrfs_init_work(>work, btrfs_rmw_helper, rmw_work, NULL, NULL);
-   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
-}
-
 static void async_read_rebuild(struct btrfs_raid_bio *rbio)
 {
btrfs_init_work(>work, btrfs_rmw_helper,
@@ -1645,7 +1638,7 @@ static int partial_stripe_write(struct btrfs_raid_bio 
*rbio)
 
ret = lock_stripe_add(rbio);
if (ret == 0)
-   async_rmw_stripe(rbio);
+   start_async_work(rbio, rmw_work);
return 0;
 }
 
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/14] btrfs: raid56: merge rbio_is_full helpers

2018-06-29 Thread David Sterba

There's only one call site of the unlocked helper so it can be folded
into the caller.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 339cce0878d1..272acd9b1192 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -507,32 +507,21 @@ static void run_xor(void **pages, int src_cnt, ssize_t 
len)
 }
 
 /*
- * returns true if the bio list inside this rbio
- * covers an entire stripe (no rmw required).
- * Must be called with the bio list lock held, or
- * at a time when you know it is impossible to add
- * new bios into the list
+ * Returns true if the bio list inside this rbio covers an entire stripe (no
+ * rmw required).
  */
-static int __rbio_is_full(struct btrfs_raid_bio *rbio)
+static int rbio_is_full(struct btrfs_raid_bio *rbio)
 {
+   unsigned long flags;
unsigned long size = rbio->bio_list_bytes;
int ret = 1;
 
+   spin_lock_irqsave(>bio_list_lock, flags);
if (size != rbio->nr_data * rbio->stripe_len)
ret = 0;
-
BUG_ON(size > rbio->nr_data * rbio->stripe_len);
-   return ret;
-}
-
-static int rbio_is_full(struct btrfs_raid_bio *rbio)
-{
-   unsigned long flags;
-   int ret;
-
-   spin_lock_irqsave(>bio_list_lock, flags);
-   ret = __rbio_is_full(rbio);
spin_unlock_irqrestore(>bio_list_lock, flags);
+
return ret;
 }
 
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/14] btrfs: raid56: catch errors from full_stripe_write

2018-06-29 Thread David Sterba

Add fall-back code to catch failure of full_stripe_write. Proper error
handling from inside run_plug would need more code restructuring as it's
called at arbitrary points by io scheduler.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0840b054e4b7..84889d10d5b0 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1683,8 +1683,11 @@ static void run_plug(struct btrfs_plug_cb *plug)
list_del_init(>plug_list);
 
if (rbio_is_full(cur)) {
+   int ret;
+
/* we have a full stripe, send it down */
-   full_stripe_write(cur);
+   ret = full_stripe_write(cur);
+   BUG_ON(ret);
continue;
}
if (last) {
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/14] btrfs: raid56: add new helper for starting async work

2018-06-29 Thread David Sterba

Add helper that schedules a given function to run on the rmw workqueue.
This will replace several standalone helpers.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1a1b7d6c44cb..f30d847baf07 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -174,6 +174,12 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
 int need_check);
 static void async_scrub_parity(struct btrfs_raid_bio *rbio);
 
+static void start_async_work(struct btrfs_raid_bio *rbio, btrfs_func_t 
work_func)
+{
+   btrfs_init_work(>work, btrfs_rmw_helper, work_func, NULL, NULL);
+   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
+}
+
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/14] btrfs: raid56: use new helper for async_read_rebuild

2018-06-29 Thread David Sterba

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 96a7d3445623..f9b349171d61 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -162,7 +162,6 @@ static int __raid56_parity_recover(struct btrfs_raid_bio 
*rbio);
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio);
 static void rmw_work(struct btrfs_work *work);
 static void read_rebuild_work(struct btrfs_work *work);
-static void async_read_rebuild(struct btrfs_raid_bio *rbio);
 static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
 static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
 static void __free_raid_bio(struct btrfs_raid_bio *rbio);
@@ -804,10 +803,10 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock_irqrestore(>lock, flags);
 
if (next->operation == BTRFS_RBIO_READ_REBUILD)
-   async_read_rebuild(next);
+   start_async_work(next, read_rebuild_work);
else if (next->operation == BTRFS_RBIO_REBUILD_MISSING) 
{
steal_rbio(rbio, next);
-   async_read_rebuild(next);
+   start_async_work(next, read_rebuild_work);
} else if (next->operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);
start_async_work(next, rmw_work);
@@ -1500,14 +1499,6 @@ static void raid_rmw_end_io(struct bio *bio)
rbio_orig_end_io(rbio, BLK_STS_IOERR);
 }
 
-static void async_read_rebuild(struct btrfs_raid_bio *rbio)
-{
-   btrfs_init_work(>work, btrfs_rmw_helper,
-   read_rebuild_work, NULL, NULL);
-
-   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
-}
-
 /*
  * the stripe must be locked by the caller.  It will
  * unlock after all the writes are done
@@ -2765,5 +2756,5 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, 
struct bio *bio,
 void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio)
 {
if (!lock_stripe_add(rbio))
-   async_read_rebuild(rbio);
+   start_async_work(rbio, read_rebuild_work);
 }
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Duncan

Gandalf Corvotempesta posted on Wed, 20 Jun 2018 11:15:03 +0200 as
excerpted:

> Il giorno mer 20 giu 2018 alle ore 10:34 Duncan <1i5t5.dun...@cox.net>
> ha scritto:
>> Parity-raid is certainly nice, but mandatory, especially when there's
>> already other parity solutions (both hardware and software) available
>> that btrfs can be run on top of, should a parity-raid solution be
>> /that/ necessary?
> 
> You can't be serious. hw raid as much more flaws than any sw raid.

I didn't say /good/ solutions, I said /other/ solutions.
FWIW, I'd go for mdraid at the lower level, were I to choose, here.

But for a 4-12-ish device solution, I'd probably go btrfs raid1 on a pair 
of mdraid-0s.  That gets you btrfs raid1 data integrity and recovery from 
its other mirror, while also being faster than the still not optimized 
btrfs raid 10.  Beyond about a dozen devices, six per "side" of the btrfs 
raid1, the risk of multi-device breakdown before recovery starts to get 
too high for comfort, but six 8 TB devices in raid0 gives you up to 48 TB 
to work with, and more than that arguably should be broken down into 
smaller blocks to work with in any case, because otherwise you're simply 
dealing with so much data it'll take you unreasonably long to do much of 
anything non-incremental with it, from any sort of fscks or btrfs 
maintenance, to trying to copy or move the data anywhere (including for 
backup/restore purposes), to ... whatever.

Actually, I'd argue that point is reached well before 48 TB, but the 
point remains, at some point it's just too much data to do much of 
anything with, too much to risk losing all at once, too much to backup 
and restore all at once as it just takes too much time to do it, just too 
much...  And that point's well within ordinary raid sizes with a dozen 
devices or less, mirrored, these days.

Which is one of the reasons I'm so skeptical about parity-raid being 
mandatory "nowadays".  Maybe it was in the past, when disks were (say) 
half a TB or less and mirroring a few TB of data was resource-
prohibitive, but now?

Of course we've got a guy here who works with CERN and deals with their 
annual 50ish petabytes of data (49 in 2016, see wikipedia's CERN 
article), but that's simply problems on a different scale.

Even so, I'd say it needs broken up into manageable chunks, and 50 PB is 
"only" a bit over 1000 48 TB filesystems worth.  OK, say 2000, so you're 
not filling them all absolutely full.

Meanwhile, I'm actually an N-way-mirroring proponent, here, as opposed to 
a parity-raid proponent.  And at that sort of scale, you /really/ don't 
want to have to restore from backups, so 3-way or even 4-5 way mirroring 
makes a lot of sense.  Hmm... 2.5 dozen for 5-way-mirroring, 2000 times, 
2.5*12*2000=... 60K devices!  That's a lot of hard drives!  And a lot of 
power to spin them.  But I guess it's a rounding error compared to what 
CERN uses for the LHC.

FWIW, N-way-mirroring has been on the btrfs roadmap, since at least 
kernel 3.6, for "after raid56".  I've been waiting awhile too; no sign of 
it yet so I guess I'll be waiting awhile longer.  So as they say, 
"welcome to the club!"  I'm 51 now.  Maybe I'll see it before I die.  
Imagine, I'm in my 80s in the retirement home and get the news btrfs 
finally has N-way-mirroring in mainline.  I'll be jumping up and down and 
cause a ruckus when I break my hip!  Well, hoping it won't be /that/ 
long, but... =;^]

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Gandalf Corvotempesta

Il giorno mer 20 giu 2018 alle ore 10:34 Duncan <1i5t5.dun...@cox.net>
ha scritto:
> Parity-raid is certainly nice, but mandatory, especially when there's
> already other parity solutions (both hardware and software) available
> that btrfs can be run on top of, should a parity-raid solution be /that/
> necessary?

You can't be serious. hw raid as much more flaws than any sw raid.
Current CPUs are much more performant than any hw raid chipset and
there is no more a performance lost in using a sw raid VS hw raid.

Biggest difference is that you are not locked with a single vendor.
When you have to move disks between servers you can do safely without
having to use the same hw raid controller (with the same firmware). Almost
all raid controller only support one-way upgrades, if your raid was created
with an older model, you can upgrade to a newer one but then it's impossible
to move it back. If you have some issues with the new controller, you can't use
the previous one.
Almost all server vendor doesn't support old-gen controller on new-gen servers
(at lest DELL), so you are forced to upgrade the raid controller when
you have to upgrade
the whole server or move disks between servers. I can continue for
hours, no, you can't
compare any modern software raid to any hw raid.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Nikolay Borisov




On 20.06.2018 10:34, Gandalf Corvotempesta wrote:
> Il giorno mer 20 giu 2018 alle ore 02:06 waxhead
>  ha scritto:
>> First of all: I am not a BTRFS developer, but I follow the mailing list
>> closely and I too have a particular interest in the "RAID"5/6 feature
>> which realistically is probably about 3-4 years (if not more) in the future.
> 
> Ok.
> 
> [cut]
> 
>> Now keep in mind that this is just a humble users analysis of the
>> situation based on whatever I have picked up from the mailing list which
>> may or may not be entirely accurate so take it for what it is!
> 
> I wasn't aware of all of these "restrictions".
> If this is true, now I understand why redhat lost interest in BTRFS.
> 3-4 years more for a "working" RAID56 is absolutely too much, in this case,
> ZFS support for RAID-Z expansion/reduction (actively being worked on)
> will be released
> much earlier (probably, a test working-version later this year and a
> stable version next year)
> 
> RAID-Z single disk espansion/removal is probably the real missing feature in 
> ZFS
> allowing it to be considered a general-purpose FS.
> 
> Device removal was added some months ago and now is possible (so, if
> you add a single disk to a mirrored vdev,
> you don't have to destroy the whole pool to remove the accidentally-added 
> disk)
> 
> In 3-4 years, maybe oracle release ZFS as GPL-compatible (solaris is
> dying, latest release is 3 years ago,
> so there is no need to keep a FS opensource compatible only with a died OS)
> 
> Keep in mind that i'm not a ZFS-fan (honestly, I don't like it) but
> with these 2 features added and tons of restriction in BTRFS,
> there is no other choise.

Of course btrfs is open source and new contributors are always welcome.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Gandalf Corvotempesta

Il giorno mer 20 giu 2018 alle ore 02:06 waxhead
 ha scritto:
> First of all: I am not a BTRFS developer, but I follow the mailing list
> closely and I too have a particular interest in the "RAID"5/6 feature
> which realistically is probably about 3-4 years (if not more) in the future.

Ok.

[cut]

> Now keep in mind that this is just a humble users analysis of the
> situation based on whatever I have picked up from the mailing list which
> may or may not be entirely accurate so take it for what it is!

I wasn't aware of all of these "restrictions".
If this is true, now I understand why redhat lost interest in BTRFS.
3-4 years more for a "working" RAID56 is absolutely too much, in this case,
ZFS support for RAID-Z expansion/reduction (actively being worked on)
will be released
much earlier (probably, a test working-version later this year and a
stable version next year)

RAID-Z single disk espansion/removal is probably the real missing feature in ZFS
allowing it to be considered a general-purpose FS.

Device removal was added some months ago and now is possible (so, if
you add a single disk to a mirrored vdev,
you don't have to destroy the whole pool to remove the accidentally-added disk)

In 3-4 years, maybe oracle release ZFS as GPL-compatible (solaris is
dying, latest release is 3 years ago,
so there is no need to keep a FS opensource compatible only with a died OS)

Keep in mind that i'm not a ZFS-fan (honestly, I don't like it) but
with these 2 features added and tons of restriction in BTRFS,
there is no other choise.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Duncan

Gandalf Corvotempesta posted on Tue, 19 Jun 2018 17:26:59 +0200 as
excerpted:

> Another kernel release was made.
> Any improvements in RAID56?

 Btrfs feature improvements come in "btrfs time".  Think long term, 
multiple releases, even multiple years (5 releases per year). 

In fact, btrfs raid56 is a good example.  Originally it was supposed to 
be in kernel 3.6 (or even before, but 3.5 is when I really started 
getting into btrfs enough to know), but for various reasons primarily 
involving the complexity of the feature as well as btrfs itself and the 
number of devs actually working on btrfs, even partial raid56 support 
didn't get added until 3.9, and still-buggy full support for raid56 scrub 
and device replace wasn't there until 3.19, with 4.3 fixing some bugs 
while others remained hidden for many releases until they were finally 
fixed in 4.12.

Since 4.12, btrfs raid56 mode, as such, has the known major bugs fixed 
and is ready for "still cautious use"[1], but for rather technical 
reasons discussed below, may not actually meet people's general 
expectations for what btrfs raid56 should be in reliability terms.

And that's the long term 3+ years out bit that waxhead was talking about.

> I didn't see any changes in that sector, is something still being worked
> on or it's stuck waiting for something ?

Actually, if you look on the wiki page, there were indeed raid56 changes 
in 4.17.

https://btrfs.wiki.kernel.org/index.php/Changelog#v4.17_.28Jun_2018.29


* raid56:
** make sure target is identical to source when raid56 rebuild fails 
after dev-replace
** faster rebuild during scrub, batch by stripes and not block-by-block
** make more use of cached data when rebuilding from a missing device


Tho that's actually the small stuff, "ignoring the elephant in the room" 
raid56 reliability expectations mentioned earlier as likely taking years 
to deal with.

As for those long term issues...

The "elephant in the room" problem is simply the parity-raid "write hole" 
common to all parity-raid systems, unless they've taken specific measures 
to work around the issue in one way or another.


In simple terms, the "write hole" problem is just that parity-raid makes 
the assumption that an update to a stripe including its parity is atomic, 
it happens all at once, so that it's impossible for the parity to be out 
of sync with the data actually written on all the other stripe-component 
devices.  In "real life", that's an invalid assumption.  Should the 
system crash at the wrong time, in the middle of a stripe update, it's 
quite possible that the parity will not match what's actually written to 
the data devices in the stripe, because either the parity will have been 
updated while at least one data device was still writing at the time of 
the crash, or the data will be updated but the parity device won't have 
finished writing yet at the time of the crash.  Either way, the parity 
doesn't match the data that's actually in the stripe, and should a device 
be/go missing so the parity is actually needed to recover the missing 
data, that missing data will be calculated incorrectly because the parity 
doesn't match what the data actually was.

Now as I already stated, that's a known problem common to parity-raid in 
general, so it's not unique at all to btrfs.

The problem specific to btrfs, however, is that in general it's copy-on-
write, with checksumming to guard against invalid data, so in general, it 
provides higher guarantees of data integrity than does a normal update-in-
place filesystem, and it'd be quite reasonable for someone to expect 
those guarantees to extend to btrfs raid56 mode as well, but they don't.

They don't, because while btrfs in general is copy-on-write and thus 
atomic update (in the event of a crash you get either the data as it was 
before the write or the completely written data, not some unpredictable 
mix of before and after), btrfs parity-raid stripes are *NOT* copy-on-
write, they're update-in-place, meaning the write-hole problem applies, 
and in the event of a crash when the parity-raid was already degraded, 
the integrity of the data or metadata being parity-raid written at the 
time of the crash is not guaranteed, nor at present, with the current 
raid56 implementation, /can/ it be guaranteed.

But as I said, the write hole problem is common to parity-raid in 
general, so for people that understand the problem and are prepared to 
deal with the reliability implications it implies[3], btrfs raid56 mode 
should be reasonably ready for still cautious use, even tho it doesn't 
carry the same data integrity and reliability guarantees that btrfs in 
general does.

As for working around or avoiding the write-hole problem entirely, 
there's (at least) four possible solutions, each with their own drawbacks.

The arguably "most proper" but also longest term solution would be to 
rewrite

Re: RAID56

2018-06-19 Thread waxhead


Gandalf Corvotempesta wrote:

Another kernel release was made.
Any improvements in RAID56?

I didn't see any changes in that sector, is something still being
worked on or it's stuck waiting for something ?

Based on official BTRFS status page, RAID56 is the only "unstable"
item marked in red.
No interested from Suse in fixing that?

I think it's the real missing part for a feature-complete filesystem.
Nowadays parity raid is mandatory, we can't only rely on mirroring.


First of all: I am not a BTRFS developer, but I follow the mailing list 
closely and I too have a particular interest in the "RAID"5/6 feature 
which realistically is probably about 3-4 years (if not more) in the future.


From what I am able to understand the pesky write hole is one of the 
major obstacles for having BTRFS "RAID"5/6 work reliably. There was 
patches to fix this a while ago, but if these patches are to be 
classified as a workaround or actually as "the darn thing done right" is 
perhaps up for discussion.


In general there seems to be a lot more momentum on the "RAID"5/6 
feature now compared to earlier. There also seem to be a lot of focus on 
fixing bugs and running tests as well. This is why I am guessing that 
3-4 years ahead is a absolute minimum until "RAID"5/6 might be somewhat 
reliable and usable.


There are a few other basics missing that may be acceptable for you as 
long as you know about it. For example as far as I know BTRFS does still 
not use the "device-id" or "BTRFS internal number" for storage devices 
to keep track of the storage device.


This means that if you have a multi storage device filesystem with for 
example /dev/sda /dev/sdb /dev/sdc etc... and /dev/sdc disappears and 
show up again as /dev/sdx then BTRFS would not recoginize this and 
happily try to continue to write on /dev/sdc even if it does not exist.


...and perhaps even worse - I can imagine that if you swap device 
ordering and a different device takes /dev/sdc's place then BTRFS 
*could* overwrite data on this device - possibly making a real mess of 
things. I am not sure if this holds true, but if it does it's for sure a 
real nugget of basic functionality missing right there.


BTRFS also so far have no automatic "drop device" function e.g. it will 
not automatically kick out a storage device that is throwing lots of 
errors and causing delays etc. There may be benefits to keeping this 
design of course, but for some dropping the device might be desirable.


And no hot-spare "or hot-(reserved-)space" (which would be more accurate 
in BTRFS terms) is implemented either, and that is one good reason to 
keep an eye on your storage pool.


What you *might* consider is to have your metadata in "RAID"1 or 
"RAID"10 and your data in "RAID5" or even "RAID6" so that if you run 
into problems then you might in worst case loose some data, but since 
"RAID"1/10 is beginning to be rather mature then it is likely that your 
filesystem might survive a disk failure.


So if you are prepared to perhaps loose a file or two, but want to feel 
confident that your filesystem is surviving and will give you a report 
about what file(s) are toast then this may be acceptable for you as you 
can always restore from backups (because you do have backups right? If 
not, read 'any' of Duncan's posts - he explains better than most people 
why you need and should have backups!)


Now keep in mind that this is just a humble users analysis of the 
situation based on whatever I have picked up from the mailing list which 
may or may not be entirely accurate so take it for what it is!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RAID56

2018-06-19 Thread Gandalf Corvotempesta

Another kernel release was made.
Any improvements in RAID56?

I didn't see any changes in that sector, is something still being
worked on or it's stuck waiting for something ?

Based on official BTRFS status page, RAID56 is the only "unstable"
item marked in red.
No interested from Suse in fixing that?

I think it's the real missing part for a feature-complete filesystem.
Nowadays parity raid is mandatory, we can't only rely on mirroring.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-03 Thread Goffredo Baroncelli

On 05/03/2018 02:47 PM, Alberto Bursi wrote:
> 
> 
> On 01/05/2018 23:57, Gandalf Corvotempesta wrote:
>> Hi to all
>> I've found some patches from Andrea Mazzoleni that adds support up to 6
>> parity raid.
>> Why these are wasn't merged ?
>> With modern disk size, having something greater than 2 parity, would be
>> great.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> His patch was about a generic library to do RAID6, it wasn't directly 
> for btrfs.
> 
> To actually use that for btrfs someone would have to actually port btrfs 
> to that library.

In the past Andrea ported this library to btrfs too

https://lwn.net/Articles/588106/

> -Alberto

G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-03 Thread Goffredo Baroncelli

On 05/03/2018 01:26 PM, Austin S. Hemmelgarn wrote:
>> My intention was to highlight that the parity-checksum is not related to the 
>> reliability and safety of raid5/6.
> It may not be related to the safety, but it is arguably indirectly related to 
> the reliability, dependent on your definition of reliability.  Spending less 
> time verifying the parity means you're spending less time in an indeterminate 
> state of usability, which arguably does improve the reliability of the 
> system.  However, that does still have nothing to do with the write hole.

If you start a scrub once per week, the fact that grub requires 1 hr, or 1 day 
doesn't impact the reliability, because in any case you have 1 week of 
un-scrubbed data.


BR 
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-03 Thread Alberto Bursi



On 01/05/2018 23:57, Gandalf Corvotempesta wrote:
> Hi to all
> I've found some patches from Andrea Mazzoleni that adds support up to 6
> parity raid.
> Why these are wasn't merged ?
> With modern disk size, having something greater than 2 parity, would be
> great.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

His patch was about a generic library to do RAID6, it wasn't directly 
for btrfs.

To actually use that for btrfs someone would have to actually port btrfs 
to that library.

-Alberto

Re: RAID56 - 6 parity raid

2018-05-03 Thread Austin S. Hemmelgarn


On 2018-05-03 04:11, Andrei Borzenkov wrote:

On Wed, May 2, 2018 at 10:29 PM, Austin S. Hemmelgarn
 wrote:
...


Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which gives
you 40TB of usable space).  You're storing roughly 20TB of data on it, using
a 16kB block size, and it sees about 1GB of writes a day, with no partial
stripe writes.  You, for reasons of argument, want to scrub it every week,
because the data in question matters a lot to you.

With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, and
can compute the parity at a rate of 1.25G/s (the ratio here is about the
average across the almost 50 systems I have quick access to check, including
a number of server and workstation systems less than a year old, though the
numbers themselves are artificially low to accentuate the point here).

At this rate, scrubbing by computing parity requires processing:

* Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 1
seconds, or 222 minutes, or about 3.7 hours.
* Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000
seconds, or 267 minutes, or roughly 4.4 hours.

So, over a week, you would be spending 8.1 hours processing data solely for
data integrity, or roughly 4.8214% of your time.

Now assume instead that you're doing checksummed parity:

* Scrubbing data is the same, 3.7 hours.
* Scrubbing parity turns into computing checksums for 4TB of data, which
would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.


Scrubbing must compute parity and compare with stored value to detect
write hole. Otherwise you end up with parity having good checksum but
not matching rest of data.
Yes, but that assumes we don't do anything to deal with the write hole, 
and it's been pretty much decided by the devs that they're going to try 
and close the write hole.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-03 Thread Austin S. Hemmelgarn


On 2018-05-02 16:40, Goffredo Baroncelli wrote:

On 05/02/2018 09:29 PM, Austin S. Hemmelgarn wrote:

On 2018-05-02 13:25, Goffredo Baroncelli wrote:

On 05/02/2018 06:55 PM, waxhead wrote:


So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).


I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
checksum the parity there is no way to verify that that the data (parity) you 
use to reconstruct other data is correct.


In any case you could catch that the compute data is wrong, because the data is 
always checksummed. And in any case you must check the data against their 
checksum.

My point is that storing the checksum is a cost that you pay *every time*. 
Every time you update a part of a stripe you need to update the parity, and 
then in turn the parity checksum. It is not a problem of space occupied nor a 
computational problem. It is a a problem of write amplification...

The only gain is to avoid to try to use the parity when
a) you need it (i.e. when the data is missing and/or corrupted)
and b) it is corrupted.
But the likelihood of this case is very low. And you can catch it during the 
data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to 
other side you have a gain (cpu-time) *only in case* of the parity is corrupted 
and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is 
very lower compared to the likelihood (=100% or always) of the cost.

You do realize that a write is already rewriting checksums elsewhere? It would 
be pretty trivial to make sure that the checksums for every part of a stripe 
end up in the same metadata block, at which point the only cost is computing 
the checksum (because when a checksum gets updated, the whole block it's in 
gets rewritten, period, because that's how CoW works).

Looking at this another way (all the math below uses SI units):


[...]
Good point: precomputing the checksum of the parity save a lot of time for the 
scrub process. You can see this in a more simply way saying that the parity 
calculation (which is dominated by the memory bandwidth) is like O(n) (where n 
is the number of disk); the parity checking (which again is dominated by the 
memory bandwidth) against a checksum is like O(1). And when the data written is 
2 order of magnitude lesser than the data stored, the effort required to 
precompute the checksum is negligible.
Excellent point about the computational efficiency, I had not thought of 
framing things that way.


Anyway, my "rant" started when Ducan put near the missing of parity checksum 
and the write hole. The first might be a performance problem. Instead the write hole 
could lead to a loosing data. My intention was to highlight that the parity-checksum is 
not related to the reliability and safety of raid5/6.
It may not be related to the safety, but it is arguably indirectly 
related to the reliability, dependent on your definition of reliability. 
 Spending less time verifying the parity means you're spending less 
time in an indeterminate state of usability, which arguably does improve 
the reliability of the system.  However, that does still have nothing to 
do with the write hole.




So, lets look at data usage:

1GB of data is translates to 62500 16kB blocks of data, which equates to an 
additional 15625 blocks for parity.  Adding parity checksums adds a 25% 
overhead to checksums being written, but that actually doesn't translate to a 
huge increase in the number of _blocks_ of checksums written.  One 16k block 
can hold roughly 500 checksums, so it would take 125 blocks worth of checksums 
without parity, and 157 (technically 156.25, but you can't write a quarter 
block) with parity checksums. Thus, without parity checksums, writing 1GB of 
data involves writing 78250 blocks, while doing the same with parity checksums 
involves writing 78282 blocks, a net change of only 32 blocks, or **0.0409%**.


How you would store the checksum ?
I asked that because I am not sure that we could use the "standard" btrfs 
metadata to store the checksum of the parity. Doing so you could face some pathological 
effect like:
- update a block(1) in a stripe(1)
- update the parity of stripe(1) containing block(1)
- update the checksum of parity stripe (1), which is contained in another 
stripe(2) [**]

- update the parity of stripe (2) containing the checksum of parity stripe(1)
- update the checksum of parity stripe (2), which is contained in another 
stripe(3)

and so on...

[**] pay attention that the checksum and the parity *have* to be in different 
stripe, otherwise you have the egg/chicken problem: compute the parity, then 
update the checksum, then update the parity again because the checksum is 
changed

Re: RAID56 - 6 parity raid

2018-05-03 Thread Andrei Borzenkov

On Wed, May 2, 2018 at 10:29 PM, Austin S. Hemmelgarn
 wrote:
...
>
> Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which gives
> you 40TB of usable space).  You're storing roughly 20TB of data on it, using
> a 16kB block size, and it sees about 1GB of writes a day, with no partial
> stripe writes.  You, for reasons of argument, want to scrub it every week,
> because the data in question matters a lot to you.
>
> With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, and
> can compute the parity at a rate of 1.25G/s (the ratio here is about the
> average across the almost 50 systems I have quick access to check, including
> a number of server and workstation systems less than a year old, though the
> numbers themselves are artificially low to accentuate the point here).
>
> At this rate, scrubbing by computing parity requires processing:
>
> * Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 1
> seconds, or 222 minutes, or about 3.7 hours.
> * Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000
> seconds, or 267 minutes, or roughly 4.4 hours.
>
> So, over a week, you would be spending 8.1 hours processing data solely for
> data integrity, or roughly 4.8214% of your time.
>
> Now assume instead that you're doing checksummed parity:
>
> * Scrubbing data is the same, 3.7 hours.
> * Scrubbing parity turns into computing checksums for 4TB of data, which
> would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.

Scrubbing must compute parity and compare with stored value to detect
write hole. Otherwise you end up with parity having good checksum but
not matching rest of data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Duncan

Goffredo Baroncelli posted on Wed, 02 May 2018 22:40:27 +0200 as
excerpted:

> Anyway, my "rant" started when Ducan put near the missing of parity
> checksum and the write hole. The first might be a performance problem.
> Instead the write hole could lead to a loosing data. My intention was to
> highlight that the parity-checksum is not related to the reliability and
> safety of raid5/6.

Thanks for making that point... and to everyone else for the vigorous 
thread debating it, as I'm learning quite a lot! =:^)

>From your first reply:

>> Why the fact that the parity is not checksummed is a problem ?
>> I read several times that this is a problem. However each time the
>> thread reached the conclusion that... it is not a problem.

I must have missed those threads, or at least, missed that conclusion 
from them (maybe believing they were about something rather narrower, or 
conflating... for instance), because AFAICT, this is the first time I've 
seen the practical merits of checksummed parity actually debated, at 
least in terms I as a non-dev can reasonably understand.  To my mind it 
was settled (or I'd have worded my original claim rather differently) and 
only now am I learning different.

And... to my credit... given the healthy vigor of the debate, it seems 
I'm not the only one that missed them...

But I'm surely learning of it now, and indeed, I had somewhat conflated 
parity-checksumming with the in-place-stripe-read-modify-write atomicity 
issue.  I'll leave the parity-checksumming debate (now that I know it at 
least remains debatable) to those more knowledgeable than myself, but in 
addition to what I've learned of it, I've definitely learned that I can't 
properly conflate it with the in-place stripe-rmw atomicity issue, so 
thanks!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Duncan

Gandalf Corvotempesta posted on Wed, 02 May 2018 19:25:41 + as
excerpted:

> On 05/02/2018 03:47 AM, Duncan wrote:
>> Meanwhile, have you looked at zfs? Perhaps they have something like
>> that?
> 
> Yes, i've looked at ZFS and I'm using it on some servers but I don't
> like it too much for multiple reasons, in example:
> 
> 1) is not officially in kernel, we have to build a module every time
> with DKMS

FWIW zfz is excluded from my choice domain as well, due to the well known 
license issues.  Regardless of strict legal implications, because Oracle 
has copyrights they could easily solve that problem and the fact that 
they haven't strongly suggests they have no interest in doing so.  That 
in turn means they have no interest in people like me running zfs, which 
means I have no interest in it either.

But because it does remain effectively the nearest to btrfs features and 
potential features "working now" solution out there, for those who simply 
_must_ have it and/or find it a more acceptable solution than cobbling 
together a multi-layer solution out of a standard filesystem on top of 
device-mapper or whatever, it's what I and others point to when people 
wonder about missing or unstable btrfs features.

> I'm new to BTRFS (if fact, i'm not using it) and I've seen in the status
> page that "it's almost ready".
> The only real missing part is a stable, secure and properly working
> RAID56,
> so i'm thinking why most effort aren't directed to fix RAID56 ?

Well, they are.  But finding and fixing corner-case bugs takes time and 
early-adopter deployments, and btrfs doesn't have the engineering 
resources to simply assign to the problem that Sun had with zfs.

Despite that, as I stated, current btrfs raid56 is, to the best of my/
list knowledge, the current code is now reasonably ready, tho it'll take 
another year or two without serious bug reports to actually test that, 
but it simply has the well known write hole that applies to all parity-
raid unless they've taken specific measures such as partial-stripe-write 
logging (slow), writing a full stripe even if it's partially empty 
(wastes space and needs periodic maintenance to reclaim it), or variable-
stripe-widths (needs periodic maintenance and more complex than always 
writing full stripes even if they're partially empty) (both of the latter 
avoiding the problem by avoiding in-place read-modify-write cycle 
entirely).

So to a large degree what's left is simply time for testing to 
demonstrate stability on the one hand, and a well known problem with 
parity-raid in general on the other.  There's the small detail that said 
well-known write hole has additional implementation-detail implications 
on btrfs, but at it's root it's the same problem all parity-raid has, and 
people choosing parity-raid as a solution are already choosing to either 
live with it or ameliorate it in some other way (tho some parity-raid 
solutions have that amelioration built-in).

> There are some environments where a RAID1/10 is too expensive and a
> RAID6 is mandatory,
> but with the current state of RAID56, BTRFS can't be used for valuable
> data

Not entirely true.  Btrfs, even btrfs raid56 mode, _can_ be used for 
"valuable" data, it simply requires astute /practical/ definitions of 
"valuable", as opposed to simple claims that don't actually stand up in 
practice.

Here's what I mean:  The sysadmin's first rule of backups defines 
"valuable data" by the number of backups it's worth making of that data.  
If there's no backups, then by definition the data is worth less than the 
time/hassle/resources necessary to have that backup, because it's not a 
question of if, but rather when, something's going to go wrong with the 
working copy and it won't be available any longer.

Additional layers of backup and whether one keeps geographically 
separated off-site backups as well are simply extensions of the first-
level-backup case/rule.  The more valuable the data, the more backups 
it's worth having of it, and the more effort is justified in ensuring 
that single or even multiple disasters aren't going to leave no working 
backup.

With this view, it's perfectly fine to use btrfs raid56 mode for 
"valuable" data, because that data is backed up and that backup can be 
used as a fallback if necessary.  True, the "working copy" might not be 
as reliable as it is in some cases, but statistically, that simply brings 
the 50% chance of failure rate (or whatever other percentage chance you 
choose) closer, to say once a year, or once a month, rather than perhaps 
once or twice a decade.  Working copy failure is GOING to happen in any 
case, it's just a matter of playing the chance game as to when, and using 
a not yet fully demonstrated reliable filesystem mode simply brings ups 
the chances a bit.

But if the data really *is* defined as &quo

Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli

On 05/02/2018 11:20 PM, waxhead wrote:
> 
> 
[...]
> 
> Ok, before attempting and answer I have to admit that I do not know enough 
> about how RAID56 is laid out on disk in BTRFS terms. Is data checksummed pr. 
> stripe or pr. disk? Is parity calculated on the data only or is it calculated 
> on the data+checksum ?!
> 

Data is checksummed per block. The parity is invisible to the checksum. The 
parity are allocated in an "address space" parallel to the data "address space" 
exposed by the BG.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread waxhead




Andrei Borzenkov wrote:

02.05.2018 21:17, waxhead пишет:

Goffredo Baroncelli wrote:

On 05/02/2018 06:55 PM, waxhead wrote:


So again, which problem would solve having the parity checksummed ?
On the best of my knowledge nothing. In any case the data is
checksummed so it is impossible to return corrupted data (modulo bug
:-) ).


I am not a BTRFS dev , but this should be quite easy to answer.
Unless you checksum the parity there is no way to verify that that
the data (parity) you use to reconstruct other data is correct.


In any case you could catch that the compute data is wrong, because
the data is always checksummed. And in any case you must check the
data against their checksum.


What if you lost an entire disk?


How does it matter exactly? RAID is per chunk anyway.

It does not matter. I was wrong, got bitten by thinking about BTRFS 
"RAID5" as normal RAID5. Again a good reason to change the naming for it 
I think...



or had corruption for both data AND checksum?


By the same logic you may have corrupted parity and its checksum.


Yup. Indeed


How do you plan to safely reconstruct that without checksummed
parity?



Define "safely". The main problem of current RAID56 implementation is
that stripe is not updated atomically (at least, that is what I
understood from the past discussions) and this is not solved by having
extra parity checksum. So how exactly "safety" is improved here? You
still need overall checksum to verify result of reconstruction, what
exactly extra parity checksum buys you?


> [...]




Again - please describe when having parity checksum will be beneficial
over current implementation. You do not reconstruct anything as long as
all data strips are there, so parity checksum will not be used. If one
data strip fails (including checksum) it will be reconstructed and
verified. If parity itself is corrupted, checksum verification fails
(hopefully). How is it different from verifying parity checksum before
reconstructing? In both cases data cannot be reconstructed, end of story.


Ok, before attempting and answer I have to admit that I do not know 
enough about how RAID56 is laid out on disk in BTRFS terms. Is data 
checksummed pr. stripe or pr. disk? Is parity calculated on the data 
only or is it calculated on the data+checksum ?!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli

On 05/02/2018 09:29 PM, Austin S. Hemmelgarn wrote:
> On 2018-05-02 13:25, Goffredo Baroncelli wrote:
>> On 05/02/2018 06:55 PM, waxhead wrote:

 So again, which problem would solve having the parity checksummed ? On the 
 best of my knowledge nothing. In any case the data is checksummed so it is 
 impossible to return corrupted data (modulo bug :-) ).

>>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
>>> checksum the parity there is no way to verify that that the data (parity) 
>>> you use to reconstruct other data is correct.
>>
>> In any case you could catch that the compute data is wrong, because the data 
>> is always checksummed. And in any case you must check the data against their 
>> checksum.
>>
>> My point is that storing the checksum is a cost that you pay *every time*. 
>> Every time you update a part of a stripe you need to update the parity, and 
>> then in turn the parity checksum. It is not a problem of space occupied nor 
>> a computational problem. It is a a problem of write amplification...
>>
>> The only gain is to avoid to try to use the parity when
>> a) you need it (i.e. when the data is missing and/or corrupted)
>> and b) it is corrupted.
>> But the likelihood of this case is very low. And you can catch it during the 
>> data checksum check (which has to be performed in any case !).
>>
>> So from one side you have a *cost every time* (the write amplification), to 
>> other side you have a gain (cpu-time) *only in case* of the parity is 
>> corrupted and you need it (eg. scrub or corrupted data)).
>>
>> IMHO the cost are very higher than the gain, and the likelihood the gain is 
>> very lower compared to the likelihood (=100% or always) of the cost.
> You do realize that a write is already rewriting checksums elsewhere? It 
> would be pretty trivial to make sure that the checksums for every part of a 
> stripe end up in the same metadata block, at which point the only cost is 
> computing the checksum (because when a checksum gets updated, the whole block 
> it's in gets rewritten, period, because that's how CoW works).
> 
> Looking at this another way (all the math below uses SI units):
> 
[...]
Good point: precomputing the checksum of the parity save a lot of time for the 
scrub process. You can see this in a more simply way saying that the parity 
calculation (which is dominated by the memory bandwidth) is like O(n) (where n 
is the number of disk); the parity checking (which again is dominated by the 
memory bandwidth) against a checksum is like O(1). And when the data written is 
2 order of magnitude lesser than the data stored, the effort required to 
precompute the checksum is negligible.

Anyway, my "rant" started when Ducan put near the missing of parity checksum 
and the write hole. The first might be a performance problem. Instead the write 
hole could lead to a loosing data. My intention was to highlight that the 
parity-checksum is not related to the reliability and safety of raid5/6.

> 
> So, lets look at data usage:
> 
> 1GB of data is translates to 62500 16kB blocks of data, which equates to an 
> additional 15625 blocks for parity.  Adding parity checksums adds a 25% 
> overhead to checksums being written, but that actually doesn't translate to a 
> huge increase in the number of _blocks_ of checksums written.  One 16k block 
> can hold roughly 500 checksums, so it would take 125 blocks worth of 
> checksums without parity, and 157 (technically 156.25, but you can't write a 
> quarter block) with parity checksums. Thus, without parity checksums, writing 
> 1GB of data involves writing 78250 blocks, while doing the same with parity 
> checksums involves writing 78282 blocks, a net change of only 32 blocks, or 
> **0.0409%**.

How you would store the checksum ?
I asked that because I am not sure that we could use the "standard" btrfs 
metadata to store the checksum of the parity. Doing so you could face some 
pathological effect like:
- update a block(1) in a stripe(1)
- update the parity of stripe(1) containing block(1)
- update the checksum of parity stripe (1), which is contained in another 
stripe(2) [**]

- update the parity of stripe (2) containing the checksum of parity stripe(1)
- update the checksum of parity stripe (2), which is contained in another 
stripe(3)

and so on...

[**] pay attention that the checksum and the parity *have* to be in different 
stripe, otherwise you have the egg/chicken problem: compute the parity, then 
update the checksum, then update the parity again because the checksum is 
changed

In order to avoid that, I fear that you can't store the checksum over a raid5/6 
BG with parity checksummed; 

It is a bit late and I am a bit tired out, so may be that I am wrong however I 
fear that the above "write amplification problem" may be a big problem; a 
possible solution would be storing the checksum in a N-mirror BG (where N is 1 
for raid5, 2 for raid6)

BR
G.Baroncelli

--

Re: RAID56 - 6 parity raid

2018-05-02 Thread Austin S. Hemmelgarn


On 2018-05-02 13:25, Goffredo Baroncelli wrote:

On 05/02/2018 06:55 PM, waxhead wrote:


So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).


I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
checksum the parity there is no way to verify that that the data (parity) you 
use to reconstruct other data is correct.


In any case you could catch that the compute data is wrong, because the data is 
always checksummed. And in any case you must check the data against their 
checksum.

My point is that storing the checksum is a cost that you pay *every time*. 
Every time you update a part of a stripe you need to update the parity, and 
then in turn the parity checksum. It is not a problem of space occupied nor a 
computational problem. It is a a problem of write amplification...

The only gain is to avoid to try to use the parity when
a) you need it (i.e. when the data is missing and/or corrupted)
and b) it is corrupted.
But the likelihood of this case is very low. And you can catch it during the 
data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to 
other side you have a gain (cpu-time) *only in case* of the parity is corrupted 
and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is 
very lower compared to the likelihood (=100% or always) of the cost.
You do realize that a write is already rewriting checksums elsewhere? 
It would be pretty trivial to make sure that the checksums for every 
part of a stripe end up in the same metadata block, at which point the 
only cost is computing the checksum (because when a checksum gets 
updated, the whole block it's in gets rewritten, period, because that's 
how CoW works).


Looking at this another way (all the math below uses SI units):

Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which 
gives you 40TB of usable space).  You're storing roughly 20TB of data on 
it, using a 16kB block size, and it sees about 1GB of writes a day, with 
no partial stripe writes.  You, for reasons of argument, want to scrub 
it every week, because the data in question matters a lot to you.


With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, 
and can compute the parity at a rate of 1.25G/s (the ratio here is about 
the average across the almost 50 systems I have quick access to check, 
including a number of server and workstation systems less than a year 
old, though the numbers themselves are artificially low to accentuate 
the point here).


At this rate, scrubbing by computing parity requires processing:

* Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 
1 seconds, or 222 minutes, or about 3.7 hours.
* Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000 
seconds, or 267 minutes, or roughly 4.4 hours.


So, over a week, you would be spending 8.1 hours processing data solely 
for data integrity, or roughly 4.8214% of your time.


Now assume instead that you're doing checksummed parity:

* Scrubbing data is the same, 3.7 hours.
* Scrubbing parity turns into computing checksums for 4TB of data, which 
would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.
* Computing parity for the 7GB of data you write each week takes 5.6 
_SECONDS_.


So, over a week, you would spend just over 4.58 hours processing data 
solely for data integrity, or roughly 2.7262% of your time.


So, in terms of just time spent, it's almost twice as fast to use 
checksummed parity (roughly 43% faster to be more specific).


So, lets look at data usage:

1GB of data is translates to 62500 16kB blocks of data, which equates to 
an additional 15625 blocks for parity.  Adding parity checksums adds a 
25% overhead to checksums being written, but that actually doesn't 
translate to a huge increase in the number of _blocks_ of checksums 
written.  One 16k block can hold roughly 500 checksums, so it would take 
125 blocks worth of checksums without parity, and 157 (technically 
156.25, but you can't write a quarter block) with parity checksums. 
Thus, without parity checksums, writing 1GB of data involves writing 
78250 blocks, while doing the same with parity checksums involves 
writing 78282 blocks, a net change of only 32 blocks, or **0.0409%**.


Note that the difference in the amount of checksums written is a simple 
linear function directly proportionate to the amount of data being 
written provided that all rewrites only rewrite full stripes (because 
that's equivalent for this to just adding new data).  In other words, 
even if we were to increase the total amount of data that array was 
getting in a day, the net change from having parity checksumming would 
still stay within the range of

Re: RAID56 - 6 parity raid

2018-05-02 Thread Gandalf Corvotempesta

On 05/02/2018 03:47 AM, Duncan wrote:
> Meanwhile, have you looked at zfs? Perhaps they have something like that?

Yes, i've looked at ZFS and I'm using it on some servers but I don't like
it too much for multiple reasons, in example:

1) is not officially in kernel, we have to build a module every time with
DKMS
2) it does not forgive, if you add the wrong device to a pool, you are
gone, you can't remove it without migrating all data and creating the new
pool from scratch. If, for mistake, you add a single device to a RAID-Z3,
you totally loose the whole redundancy and so on.
3) doesn't support expansion of RAID-Z one disk per time. if you want to
expand a RAIDZ, you have to create another pool and then stripe over it.

I'm new to BTRFS (if fact, i'm not using it) and I've seen in the status
page that "it's almost ready".
The only real missing part is a stable, secure and properly working RAID56,
so i'm thinking why most effort aren't directed to fix RAID56 ?

There are some environments where a RAID1/10 is too expensive and a RAID6
is mandatory,
but with the current state of RAID56, BTRFS can't be used for valuable data

Also, i've seen that to fix write hole, a dedicated disk is needed ? Is
this true ?
I cant' create a 6 disks RAID6 with only 6 disks and no write-hole like
with ZFS ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli

On 05/02/2018 08:17 PM, waxhead wrote:
> Goffredo Baroncelli wrote:
>> On 05/02/2018 06:55 PM, waxhead wrote:

 So again, which problem would solve having the parity checksummed ? On the 
 best of my knowledge nothing. In any case the data is checksummed so it is 
 impossible to return corrupted data (modulo bug :-) ).

>>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
>>> checksum the parity there is no way to verify that that the data (parity) 
>>> you use to reconstruct other data is correct.
>>
>> In any case you could catch that the compute data is wrong, because the data 
>> is always checksummed. And in any case you must check the data against their 
>> checksum.
>>
> What if you lost an entire disk? or had corruption for both data AND 
> checksum? How do you plan to safely reconstruct that without checksummed 
> parity?

As general rule, the returned data is *always* check against their checksum. So 
in any case wrong data is never returned. Let me to say in another words: 
having the parity checksummed doesn't increase the reliability and or the 
safety of the RAID rebuilding. I want to repeat again: even if the parity is 
corrupted, the rebuild (wrong) data fails the check against its checksum and it 
is not returned !

Back to your questions:

1) Loosing 1 disks -> 1 fault
2) Loosing both data and checksum -> 2 faults

RAID5 is single fault prof. So if case #2 happens, raid5 can't protect you. 
However BTRFS is capable to detect that the data is wrong due to checksum.
In case #1, there is no problem, because for each stripe you have enough data 
to rebuild the missing one.

Because I read several time that the checksum parity would increase the 
reliability and/or the safety of the raid5/6 profile, let me to explain the 
logic:

read_from_disk() {
data = read_data()
if (data != ERROR && check_checksum(data))
return data;
/* checksum mismatch or data is missing */
full_stripe = read_full_stripe()
if (raid5_profile) {
/* raid5 has only one way of rebuilding data */
data = rebuild_raid5_data(full_stripe)
if (data != ERROR && check_checksum(data)) {
rebuild_stripe(data, full_stripe)
return data;
}
/* parity and/or another data is corrupted/missed */
return ERROR
}

for_each raid6_rebuilding_strategies(full_stripe) {
/* 
 * raid6 might have more than one way of rebuilding data 
 * depending by the degree of failure
 */
data = rebuild_raid6_data(full_stripe)
if (data != ERROR && check_checksum(data)) {
rebuild_stripe(data, full_stripe)
/* data is good, return it */
return data;
}
}
return ERROR
}

In case of raid5, there is only one way of rebuild the data. In case of raid6 
and 1 fault, there are several way of rebuilding data (however in case of two 
failure, there are only one way). So more possibilities have to be tested for 
rebuilding the data.
If the parity is corrupted, the rebuild data is corrupted too, and it fails the 
check against its checksum.

> 
>> My point is that storing the checksum is a cost that you pay *every time*. 
>> Every time you update a part of a stripe you need to update the parity, and 
>> then in turn the parity checksum. It is not a problem of space occupied nor 
>> a computational problem. It is a a problem of write amplification...
> How much of a problem is this? no benchmarks have been run since the feature 
> is not yet there I suppose.

It is simple, for each stripe touched you need to update the parity(1); then 
you need to update parity checksums(1) (which in turn would requires an update 
of the parity(2) of the stripe where is stored the parity(1) checksums, which 
in turn would requires to update the parity(2) checksum... and so on)

> 
>>
>> The only gain is to avoid to try to use the parity when
>> a) you need it (i.e. when the data is missing and/or corrupted)
> I'm not sure I can make out your argument here , but with RAID5/6 you don't 
> have another copy to restore from. You *have* to use the parity to 
> reconstruct data and it is a good thing if this data is trusted.
I never say the opposite

> 
>> and b) it is corrupted.
>> But the likelihood of this case is very low. And you can catch it during the 
>> data checksum check (which has to be performed in any case !).
>>
>> So from one side you have a *cost every time* (the write amplification), to 
>> other side you have a gain (cpu-time) *only in case* of the parity is 
>> corrupted and you need it (eg. scrub or corrupted data)).
>>
>> IMHO the cost are very higher than the gain, and the likelihood the gain is 
>> very lower compared to

Re: RAID56 - 6 parity raid

2018-05-02 Thread waxhead


Goffredo Baroncelli wrote:

On 05/02/2018 06:55 PM, waxhead wrote:


So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).


I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
checksum the parity there is no way to verify that that the data (parity) you 
use to reconstruct other data is correct.


In any case you could catch that the compute data is wrong, because the data is 
always checksummed. And in any case you must check the data against their 
checksum.

What if you lost an entire disk? or had corruption for both data AND 
checksum? How do you plan to safely reconstruct that without checksummed 
parity?



My point is that storing the checksum is a cost that you pay *every time*. 
Every time you update a part of a stripe you need to update the parity, and 
then in turn the parity checksum. It is not a problem of space occupied nor a 
computational problem. It is a a problem of write amplification...
How much of a problem is this? no benchmarks have been run since the 
feature is not yet there I suppose.




The only gain is to avoid to try to use the parity when
a) you need it (i.e. when the data is missing and/or corrupted)
I'm not sure I can make out your argument here , but with RAID5/6 you 
don't have another copy to restore from. You *have* to use the parity to 
reconstruct data and it is a good thing if this data is trusted.



and b) it is corrupted.
But the likelihood of this case is very low. And you can catch it during the 
data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to 
other side you have a gain (cpu-time) *only in case* of the parity is corrupted 
and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is 
very lower compared to the likelihood (=100% or always) of the cost.

Then run benchmarks and considering making parity checksums optional 
(but pretty please dipped in syrup with sugar on top - keep in on by 
default).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli

On 05/02/2018 06:55 PM, waxhead wrote:
>>
>> So again, which problem would solve having the parity checksummed ? On the 
>> best of my knowledge nothing. In any case the data is checksummed so it is 
>> impossible to return corrupted data (modulo bug :-) ).
>>
> I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
> checksum the parity there is no way to verify that that the data (parity) you 
> use to reconstruct other data is correct.

In any case you could catch that the compute data is wrong, because the data is 
always checksummed. And in any case you must check the data against their 
checksum.

My point is that storing the checksum is a cost that you pay *every time*. 
Every time you update a part of a stripe you need to update the parity, and 
then in turn the parity checksum. It is not a problem of space occupied nor a 
computational problem. It is a a problem of write amplification...

The only gain is to avoid to try to use the parity when 
a) you need it (i.e. when the data is missing and/or corrupted)
and b) it is corrupted. 
But the likelihood of this case is very low. And you can catch it during the 
data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to 
other side you have a gain (cpu-time) *only in case* of the parity is corrupted 
and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is 
very lower compared to the likelihood (=100% or always) of the cost.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Austin S. Hemmelgarn


On 2018-05-02 12:55, waxhead wrote:

Goffredo Baroncelli wrote:

Hi
On 05/02/2018 03:47 AM, Duncan wrote:

Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
excerpted:


Hi to all I've found some patches from Andrea Mazzoleni that adds
support up to 6 parity raid.
Why these are wasn't merged ?
With modern disk size, having something greater than 2 parity, would be
great.

1) [...] the parity isn't checksummed, 


Why the fact that the parity is not checksummed is a problem ?
I read several times that this is a problem. However each time the 
thread reached the conclusion that... it is not a problem.


So again, which problem would solve having the parity checksummed ? On 
the best of my knowledge nothing. In any case the data is checksummed 
so it is impossible to return corrupted data (modulo bug :-) ).


I am not a BTRFS dev , but this should be quite easy to answer. Unless 
you checksum the parity there is no way to verify that that the data 
(parity) you use to reconstruct other data is correct.
While this is the biggest benefit (and it's a _huge_ one, because it 
means you don't have to waste time doing the parity reconstruction if 
you know the result won't be right), there's also a rather nice benefit 
for scrubbing the array, namely that you don't have to recompute parity 
to check if it's right or not (and thus can avoid wasting time 
recomputing it for every stripe in the common case of almost every 
stripe being correct).


On the other side, having the parity would increase both the code 
complexity and the write amplification, because every time a part of 
the stripe is touched not only the parity has to be updated, but also 
the checksum has too..
Which is a good thing. BTRFS main selling point is that you can feel 
pretty confident that whatever you put is exactly what you get out.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread waxhead


Goffredo Baroncelli wrote:

Hi
On 05/02/2018 03:47 AM, Duncan wrote:

Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
excerpted:


Hi to all I've found some patches from Andrea Mazzoleni that adds
support up to 6 parity raid.
Why these are wasn't merged ?
With modern disk size, having something greater than 2 parity, would be
great.

1) [...] the parity isn't checksummed, 


Why the fact that the parity is not checksummed is a problem ?
I read several times that this is a problem. However each time the thread 
reached the conclusion that... it is not a problem.

So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).

I am not a BTRFS dev , but this should be quite easy to answer. Unless 
you checksum the parity there is no way to verify that that the data 
(parity) you use to reconstruct other data is correct.



On the other side, having the parity would increase both the code complexity 
and the write amplification, because every time a part of the stripe is touched 
not only the parity has to be updated, but also the checksum has too..
Which is a good thing. BTRFS main selling point is that you can feel 
pretty confident that whatever you put is exactly what you get out.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli

Hi
On 05/02/2018 03:47 AM, Duncan wrote:
> Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
> excerpted:
> 
>> Hi to all I've found some patches from Andrea Mazzoleni that adds
>> support up to 6 parity raid.
>> Why these are wasn't merged ?
>> With modern disk size, having something greater than 2 parity, would be
>> great.
> 1) [...] the parity isn't checksummed, 

Why the fact that the parity is not checksummed is a problem ?
I read several times that this is a problem. However each time the thread 
reached the conclusion that... it is not a problem.

So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).

On the other side, having the parity would increase both the code complexity 
and the write amplification, because every time a part of the stripe is touched 
not only the parity has to be updated, but also the checksum has too..

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-01 Thread Duncan

Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
excerpted:

> Hi to all I've found some patches from Andrea Mazzoleni that adds
> support up to 6 parity raid.
> Why these are wasn't merged ?
> With modern disk size, having something greater than 2 parity, would be
> great.

1) Btrfs parity-raid was known to be seriously broken until quite 
recently (and still has the common parity-raid write-hole, which is more 
serious on btrfs because btrfs otherwise goes to some lengths to ensure 
data/metadata integrity via checksumming and verification, and the parity 
isn't checksummed, risking even old data due to the write hole, but there 
are a number of proposals to fix that), and piling even more not well 
tested patches on top was _not_ the way toward a solution.

2) Btrfs features in general have taken longer to merge and stabilize 
than one might expect, and parity-raid has been a prime example, with the 
original roadmap calling for parity-raid merge back in the 3.5 timeframe 
or so... partial/runtime (not full recovery) code was finally merged ~3 
years later in (IIRC) 3.19, took several development cycles for the 
initial critical bugs to be worked out but by 4.2 or so was starting to 
look good, then more bugs were found and reported, that took several more 
years to fix, tho IIRC LTS-4.14 has them.

Meanwhile, consider that N-way-mirroring was fast-path roadmapped for 
"right after raid56 mode, because some of its code depends on that), so 
was originally expected in 3.6 or so...  As someone who had been wanting 
to use /that/, I personally know the pain of "still waiting".

And that was "fast-pathed".

So even if the multi-way-parity patches were on the "fast" path, it's 
only "now" (for relative values of now, for argument say by 4.20/5.0 or 
whatever it ends up being called) that such a thing could be reasonably 
considered.


3) AFAIK none of the btrfs devs have flat rejected the idea, but btrfs 
remains development opportunity rich and implementing dev poor... there's 
likely 20 years or more of "good" ideas out there.  And the N-way-parity-
raid patches haven't hit any of the current devs' (or their employers') 
"personal itch that needs to be scratched" interest points, so while it 
certainly does remain a "nice idea", given the implementation timeline 
history for even 'fast-pathed" ideas, realistically we're looking at at 
least a decade out.  But with the practical projection horizon no more 
than 5-7 years out (beyond that other, unpredicted, developments, are 
likely to change things so much that projection is effectively 
impossible), in practice, a decade out is "bluesky", aka "it'd be nice to 
have someday, but it's not a priority, and with current developer 
manpower, it's unlikely to happen any time in the practically projectable 
future.

4) Of course all that's subject to no major new btrfs developer (or 
sponsor) making it a high priority, but even should such a developer (and/
or sponsor) appear, they'd probably need to spend at least two years 
coming up to speed with the code first, fixing normal bugs and improving 
the existing code quality, then post the updated and rebased N-way-parity 
patches for discussion, and get them roadmapped for merge probably some 
years later due to other then-current project feature dependencies.

So even if the N-way-parity patches became some new developer's (or 
sponsor's) personal itch to scratch, by the time they came up to speed 
and the code was actually merged, there's no realistic projection that it 
would be in under 5 years, plus another couple to stabilize, so at least 
7 years to properly usable stability.  So even then, we're already at the 
5-7 years practical projectability limit.


Meanwhile, have you looked at zfs?  Perhaps they have something like 
that?  And there's also a new(?) one, stratis, AFAIK commercially 
sponsored and device-mapper based, that I saw an article on recently, tho 
I've seen/heard no kernel-community discussion on it (there's a good 
chance followup here will change that if it's worth discussing, as 
there's several folks here for whom knowing about such things is part of 
their job) and no other articles (besides the pt 1 of the series 
mentioned below), so for all I know it's pie-in-the-sky or still new 
enough it'd be 5-7 years before it can be used in practice, as well.  But 
assuming it's a viable project, presumably it would get support if device-
mapper did/has.

The stratis article I saw (apparently part 2 in a series):
https://opensource.com/article/18/4/stratis-lessons-learned

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RAID56 - 6 parity raid

2018-05-01 Thread Gandalf Corvotempesta

Hi to all
I've found some patches from Andrea Mazzoleni that adds support up to 6
parity raid.
Why these are wasn't merged ?
With modern disk size, having something greater than 2 parity, would be
great.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Btrfs: scrub: batch rebuild for raid56

2018-03-09 Thread David Sterba

On Wed, Mar 07, 2018 at 12:08:09PM -0700, Liu Bo wrote:
> In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> process may be triggered on every block on a same stripe.
> 
> A typical example would be that when we're replacing a disappeared disk,
> all reads on the disks get -EIO, every block (size is 4K if blocksize is
> 4K) would go thru these,
> 
> scrub_handle_errored_block
>   scrub_recheck_block # re-read pages one by one
>   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> page by page
> 
> Although with raid56 stripe cache most of reads during rebuild can be
> avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> be done $(BTRFS_STRIPE_LEN / blocksize) times.
> 
> This makes it smarter by doing raid56 scrub/replace on stripe length.
> 
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
> v2: - Place bio allocation in code statement.
> - Get rid of bio_set_op_attrs.
> - Add SOB.

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-09 Thread David Sterba

On Wed, Mar 07, 2018 at 05:22:08PM +0200, Nikolay Borisov wrote:
> 
> 
> On  7.03.2018 16:43, David Sterba wrote:
> > On Tue, Mar 06, 2018 at 11:22:21AM -0700, Liu Bo wrote:
> >> On Tue, Mar 06, 2018 at 11:47:47AM +0100, David Sterba wrote:
> >>> On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
> >>>> In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> >>>> as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> >>>> process may be triggered on every block on a same stripe.
> >>>>
> >>>> A typical example would be that when we're replacing a disappeared disk,
> >>>> all reads on the disks get -EIO, every block (size is 4K if blocksize is
> >>>> 4K) would go thru these,
> >>>>
> >>>> scrub_handle_errored_block
> >>>>   scrub_recheck_block # re-read pages one by one
> >>>>   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> >>>> page by page
> >>>>
> >>>> Although with raid56 stripe cache most of reads during rebuild can be
> >>>> avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> >>>> be done $(BTRFS_STRIPE_LEN / blocksize) times.
> >>>>
> >>>> This makes it less stupid by doing raid56 scrub/replace on stripe length.
> >>>
> >>> missing s-o-b
> >>
> >> I'm surprised that checkpatch.pl didn't complain.
> 
> I have written a python script that can scrape the mailing list and run
> checkpatch (and any other software deemed appropriate) on posted patches
> and reply back with results. However, I haven't really activated it, I
> guess if people think there is merit in it I could hook it up to the
> mailing list :)

If we and checkpatch agree on the issues to report then yes, but I think
it would be confusing when we'd have to tell people to ignore some of
the warnings. I think we'd need to tune checkpatch for our needs, I
would not mind testing it locally. I'd start with catching typos in the
changelogs and comments, that's something that happens all the time and
can be automated. Next, implement AI that will try to understand the
changelog and tell people what's missing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Btrfs: scrub: batch rebuild for raid56

2018-03-07 Thread Liu Bo

In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
as unit, however, scrub_extent() sets blocksize as unit, so rebuild
process may be triggered on every block on a same stripe.

A typical example would be that when we're replacing a disappeared disk,
all reads on the disks get -EIO, every block (size is 4K if blocksize is
4K) would go thru these,

scrub_handle_errored_block
  scrub_recheck_block # re-read pages one by one
  scrub_recheck_block # rebuild by calling raid56_parity_recover()
page by page

Although with raid56 stripe cache most of reads during rebuild can be
avoided, the parity recover calculation(xor or raid6 algorithms) needs to
be done $(BTRFS_STRIPE_LEN / blocksize) times.

This makes it smarter by doing raid56 scrub/replace on stripe length.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
v2: - Place bio allocation in code statement.
- Get rid of bio_set_op_attrs.
- Add SOB.

 fs/btrfs/scrub.c | 79 +++-
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec56f33..3ccabad 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1718,6 +1718,45 @@ static int scrub_submit_raid56_bio_wait(struct 
btrfs_fs_info *fs_info,
return blk_status_to_errno(bio->bi_status);
 }
 
+static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
+ struct scrub_block *sblock)
+{
+   struct scrub_page *first_page = sblock->pagev[0];
+   struct bio *bio;
+   int page_num;
+
+   /* All pages in sblock belong to the same stripe on the same device. */
+   ASSERT(first_page->dev);
+   if (!first_page->dev->bdev)
+   goto out;
+
+   bio = btrfs_io_bio_alloc(BIO_MAX_PAGES);
+   bio_set_dev(bio, first_page->dev->bdev);
+
+   for (page_num = 0; page_num < sblock->page_count; page_num++) {
+   struct scrub_page *page = sblock->pagev[page_num];
+
+   WARN_ON(!page->page);
+   bio_add_page(bio, page->page, PAGE_SIZE, 0);
+   }
+
+   if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
+   bio_put(bio);
+   goto out;
+   }
+
+   bio_put(bio);
+
+   scrub_recheck_block_checksum(sblock);
+
+   return;
+out:
+   for (page_num = 0; page_num < sblock->page_count; page_num++)
+   sblock->pagev[page_num]->io_error = 1;
+
+   sblock->no_io_error_seen = 0;
+}
+
 /*
  * this function will check the on disk data for checksum errors, header
  * errors and read I/O errors. If any I/O errors happen, the exact pages
@@ -1733,6 +1772,10 @@ static void scrub_recheck_block(struct btrfs_fs_info 
*fs_info,
 
sblock->no_io_error_seen = 1;
 
+   /* short cut for raid56 */
+   if (!retry_failed_mirror && scrub_is_page_on_raid56(sblock->pagev[0]))
+   return scrub_recheck_block_on_raid56(fs_info, sblock);
+
for (page_num = 0; page_num < sblock->page_count; page_num++) {
struct bio *bio;
struct scrub_page *page = sblock->pagev[page_num];
@@ -1748,19 +1791,12 @@ static void scrub_recheck_block(struct btrfs_fs_info 
*fs_info,
bio_set_dev(bio, page->dev->bdev);
 
bio_add_page(bio, page->page, PAGE_SIZE, 0);
-   if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
-   if (scrub_submit_raid56_bio_wait(fs_info, bio, page)) {
-   page->io_error = 1;
-   sblock->no_io_error_seen = 0;
-   }
-   } else {
-   bio->bi_iter.bi_sector = page->physical >> 9;
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
+   bio->bi_iter.bi_sector = page->physical >> 9;
+   bio->bi_opf = REQ_OP_READ;
 
-   if (btrfsic_submit_bio_wait(bio)) {
-   page->io_error = 1;
-   sblock->no_io_error_seen = 0;
-   }
+   if (btrfsic_submit_bio_wait(bio)) {
+   page->io_error = 1;
+   sblock->no_io_error_seen = 0;
}
 
bio_put(bio);
@@ -2728,7 +2764,8 @@ static int scrub_find_csum(struct scrub_ctx *sctx, u64 
logical, u8 *csum)
 }
 
 /* scrub extent tries to collect up to 64 kB for each bio */
-static int scrub_extent(struct scrub_ctx *sctx, u64 logical, u64 len,
+static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map,
+   u64 logical, u64 len,
u64 physical, struct btrfs_device *dev, u64 flags,
u64 gen, int mirror_num, u64 phy

Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-07 Thread Nikolay Borisov



On  7.03.2018 16:43, David Sterba wrote:
> On Tue, Mar 06, 2018 at 11:22:21AM -0700, Liu Bo wrote:
>> On Tue, Mar 06, 2018 at 11:47:47AM +0100, David Sterba wrote:
>>> On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
>>>> In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
>>>> as unit, however, scrub_extent() sets blocksize as unit, so rebuild
>>>> process may be triggered on every block on a same stripe.
>>>>
>>>> A typical example would be that when we're replacing a disappeared disk,
>>>> all reads on the disks get -EIO, every block (size is 4K if blocksize is
>>>> 4K) would go thru these,
>>>>
>>>> scrub_handle_errored_block
>>>>   scrub_recheck_block # re-read pages one by one
>>>>   scrub_recheck_block # rebuild by calling raid56_parity_recover()
>>>> page by page
>>>>
>>>> Although with raid56 stripe cache most of reads during rebuild can be
>>>> avoided, the parity recover calculation(xor or raid6 algorithms) needs to
>>>> be done $(BTRFS_STRIPE_LEN / blocksize) times.
>>>>
>>>> This makes it less stupid by doing raid56 scrub/replace on stripe length.
>>>
>>> missing s-o-b
>>
>> I'm surprised that checkpatch.pl didn't complain.

I have written a python script that can scrape the mailing list and run
checkpatch (and any other software deemed appropriate) on posted patches
and reply back with results. However, I haven't really activated it, I
guess if people think there is merit in it I could hook it up to the
mailing list :)

> 
> Never mind, I'm your true checkpatch.cz
> 
> (http://checkpatch.pl actually exists)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-07 Thread David Sterba

On Tue, Mar 06, 2018 at 11:22:21AM -0700, Liu Bo wrote:
> On Tue, Mar 06, 2018 at 11:47:47AM +0100, David Sterba wrote:
> > On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
> > > In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> > > as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> > > process may be triggered on every block on a same stripe.
> > > 
> > > A typical example would be that when we're replacing a disappeared disk,
> > > all reads on the disks get -EIO, every block (size is 4K if blocksize is
> > > 4K) would go thru these,
> > > 
> > > scrub_handle_errored_block
> > >   scrub_recheck_block # re-read pages one by one
> > >   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> > > page by page
> > > 
> > > Although with raid56 stripe cache most of reads during rebuild can be
> > > avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> > > be done $(BTRFS_STRIPE_LEN / blocksize) times.
> > > 
> > > This makes it less stupid by doing raid56 scrub/replace on stripe length.
> > 
> > missing s-o-b
> 
> I'm surprised that checkpatch.pl didn't complain.

Never mind, I'm your true checkpatch.cz

(http://checkpatch.pl actually exists)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-06 Thread Liu Bo

On Tue, Mar 06, 2018 at 11:47:47AM +0100, David Sterba wrote:
> On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
> > In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> > as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> > process may be triggered on every block on a same stripe.
> > 
> > A typical example would be that when we're replacing a disappeared disk,
> > all reads on the disks get -EIO, every block (size is 4K if blocksize is
> > 4K) would go thru these,
> > 
> > scrub_handle_errored_block
> >   scrub_recheck_block # re-read pages one by one
> >   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> > page by page
> > 
> > Although with raid56 stripe cache most of reads during rebuild can be
> > avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> > be done $(BTRFS_STRIPE_LEN / blocksize) times.
> > 
> > This makes it less stupid by doing raid56 scrub/replace on stripe length.
> 
> missing s-o-b
>

I'm surprised that checkpatch.pl didn't complain.

> > ---
> >  fs/btrfs/scrub.c | 78 
> > +++-
> >  1 file changed, 60 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > index 9882513..e3203a1 100644
> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -1718,6 +1718,44 @@ static int scrub_submit_raid56_bio_wait(struct 
> > btrfs_fs_info *fs_info,
> > return blk_status_to_errno(bio->bi_status);
> >  }
> >  
> > +static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
> > + struct scrub_block *sblock)
> > +{
> > +   struct scrub_page *first_page = sblock->pagev[0];
> > +   struct bio *bio = btrfs_io_bio_alloc(BIO_MAX_PAGES);
> 
> nontrivial initializations (variable to variable) are better put into
> the statement section.
>

OK.

> > +   int page_num;
> > +
> > +   /* All pages in sblock belongs to the same stripe on the same device. */
> > +   ASSERT(first_page->dev);
> > +   if (first_page->dev->bdev == NULL)
> > +   goto out;
> > +
> > +   bio_set_dev(bio, first_page->dev->bdev);
> > +
> > +   for (page_num = 0; page_num < sblock->page_count; page_num++) {
> > +   struct scrub_page *page = sblock->pagev[page_num];
> > +
> > +   WARN_ON(!page->page);
> > +   bio_add_page(bio, page->page, PAGE_SIZE, 0);
> > +   }
> > +
> > +   if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
> > +   bio_put(bio);
> > +   goto out;
> > +   }
> > +
> > +   bio_put(bio);
> > +
> > +   scrub_recheck_block_checksum(sblock);
> > +
> > +   return;
> > +out:
> > +   for (page_num = 0; page_num < sblock->page_count; page_num++)
> > +   sblock->pagev[page_num]->io_error = 1;
> > +
> > +   sblock->no_io_error_seen = 0;
> > +}
> > +
> >  /*
> >   * this function will check the on disk data for checksum errors, header
> >   * errors and read I/O errors. If any I/O errors happen, the exact pages
> > @@ -1733,6 +1771,10 @@ static void scrub_recheck_block(struct btrfs_fs_info 
> > *fs_info,
> >  
> > sblock->no_io_error_seen = 1;
> >  
> > +   /* short cut for raid56 */
> > +   if (!retry_failed_mirror && scrub_is_page_on_raid56(sblock->pagev[0]))
> > +   return scrub_recheck_block_on_raid56(fs_info, sblock);
> > +
> > for (page_num = 0; page_num < sblock->page_count; page_num++) {
> > struct bio *bio;
> > struct scrub_page *page = sblock->pagev[page_num];
> > @@ -1748,19 +1790,12 @@ static void scrub_recheck_block(struct 
> > btrfs_fs_info *fs_info,
> > bio_set_dev(bio, page->dev->bdev);
> >  
> > bio_add_page(bio, page->page, PAGE_SIZE, 0);
> > -   if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
> > -   if (scrub_submit_raid56_bio_wait(fs_info, bio, page)) {
> > -   page->io_error = 1;
> > -   sblock->no_io_error_seen = 0;
> > -   }
> > -   } else {
> > -   bio->bi_iter.bi_sector = page->physical >> 9;
> > -   bio_set_op_attrs(bio, REQ_OP_READ, 0);
> > +   bio->bi_iter.bi

Re: [PATCH] Btrfs: dev-replace: make sure target is identical to source when raid56 rebuild fails

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:41PM -0700, Liu Bo wrote:
> In the last step of scrub_handle_error_block, we try to combine good
> copies on all possible mirrors, this works fine for raid1 and raid10,
> but not for raid56 as it's doing parity rebuild.
> 
> If parity rebuild doesn't get back with correct data which matches its
> checksum, in case of replace we'd rather write what is stored in the
> source device than the data calculuated from parity.
> 
> Signed-off-by: Liu Bo <bo.li@oracle.com>

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: raid56: remove redundant async_missing_raid56

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:39PM -0700, Liu Bo wrote:
> async_missing_raid56() is identical to async_read_rebuild().
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
> In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> process may be triggered on every block on a same stripe.
> 
> A typical example would be that when we're replacing a disappeared disk,
> all reads on the disks get -EIO, every block (size is 4K if blocksize is
> 4K) would go thru these,
> 
> scrub_handle_errored_block
>   scrub_recheck_block # re-read pages one by one
>   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> page by page
> 
> Although with raid56 stripe cache most of reads during rebuild can be
> avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> be done $(BTRFS_STRIPE_LEN / blocksize) times.
> 
> This makes it less stupid by doing raid56 scrub/replace on stripe length.

missing s-o-b

> ---
>  fs/btrfs/scrub.c | 78 
> +++-
>  1 file changed, 60 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 9882513..e3203a1 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1718,6 +1718,44 @@ static int scrub_submit_raid56_bio_wait(struct 
> btrfs_fs_info *fs_info,
>   return blk_status_to_errno(bio->bi_status);
>  }
>  
> +static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
> +   struct scrub_block *sblock)
> +{
> + struct scrub_page *first_page = sblock->pagev[0];
> + struct bio *bio = btrfs_io_bio_alloc(BIO_MAX_PAGES);

nontrivial initializations (variable to variable) are better put into
the statement section.

> + int page_num;
> +
> + /* All pages in sblock belongs to the same stripe on the same device. */
> + ASSERT(first_page->dev);
> + if (first_page->dev->bdev == NULL)
> + goto out;
> +
> + bio_set_dev(bio, first_page->dev->bdev);
> +
> + for (page_num = 0; page_num < sblock->page_count; page_num++) {
> + struct scrub_page *page = sblock->pagev[page_num];
> +
> + WARN_ON(!page->page);
> + bio_add_page(bio, page->page, PAGE_SIZE, 0);
> + }
> +
> + if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
> + bio_put(bio);
> + goto out;
> + }
> +
> + bio_put(bio);
> +
> + scrub_recheck_block_checksum(sblock);
> +
> + return;
> +out:
> + for (page_num = 0; page_num < sblock->page_count; page_num++)
> + sblock->pagev[page_num]->io_error = 1;
> +
> + sblock->no_io_error_seen = 0;
> +}
> +
>  /*
>   * this function will check the on disk data for checksum errors, header
>   * errors and read I/O errors. If any I/O errors happen, the exact pages
> @@ -1733,6 +1771,10 @@ static void scrub_recheck_block(struct btrfs_fs_info 
> *fs_info,
>  
>   sblock->no_io_error_seen = 1;
>  
> + /* short cut for raid56 */
> + if (!retry_failed_mirror && scrub_is_page_on_raid56(sblock->pagev[0]))
> + return scrub_recheck_block_on_raid56(fs_info, sblock);
> +
>   for (page_num = 0; page_num < sblock->page_count; page_num++) {
>   struct bio *bio;
>   struct scrub_page *page = sblock->pagev[page_num];
> @@ -1748,19 +1790,12 @@ static void scrub_recheck_block(struct btrfs_fs_info 
> *fs_info,
>   bio_set_dev(bio, page->dev->bdev);
>  
>   bio_add_page(bio, page->page, PAGE_SIZE, 0);
> - if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
> - if (scrub_submit_raid56_bio_wait(fs_info, bio, page)) {
> - page->io_error = 1;
> - sblock->no_io_error_seen = 0;
> - }
> - } else {
> - bio->bi_iter.bi_sector = page->physical >> 9;
> - bio_set_op_attrs(bio, REQ_OP_READ, 0);
> + bio->bi_iter.bi_sector = page->physical >> 9;
> + bio_set_op_attrs(bio, REQ_OP_READ, 0);

https://elixir.bootlin.com/linux/latest/source/include/linux/blk_types.h#L270

bio_set_op_attrs should not be used

>  
> - if (btrfsic_submit_bio_wait(bio)) {
> - page->io_error = 1;
> - sblock->no_io_error_seen = 0;
> - }
> + if (btrfsic_submit_bio_wait(bio)) {
> + page->io_error = 1;
> +

[PATCH] Btrfs: dev-replace: make sure target is identical to source when raid56 rebuild fails

2018-03-02 Thread Liu Bo

In the last step of scrub_handle_error_block, we try to combine good
copies on all possible mirrors, this works fine for raid1 and raid10,
but not for raid56 as it's doing parity rebuild.

If parity rebuild doesn't get back with correct data which matches its
checksum, in case of replace we'd rather write what is stored in the
source device than the data calculuated from parity.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/scrub.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1b5ce2f..f449dc6 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1412,8 +1412,17 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
if (!page_bad->io_error && !sctx->is_dev_replace)
continue;
 
-   /* try to find no-io-error page in mirrors */
-   if (page_bad->io_error) {
+   if (scrub_is_page_on_raid56(sblock_bad->pagev[0])) {
+   /*
+* In case of dev replace, if raid56 rebuild process
+* didn't work out correct data, then copy the content
+* in sblock_bad to make sure target device is identical
+* to source device, instead of writing garbage data in
+* sblock_for_recheck array to target device.
+*/
+   sblock_other = NULL;
+   } else if (page_bad->io_error) {
+   /* try to find no-io-error page in mirrors */
for (mirror_index = 0;
 mirror_index < BTRFS_MAX_MIRRORS &&
 sblocks_for_recheck[mirror_index].page_count > 0;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: raid56: remove redundant async_missing_raid56

2018-03-02 Thread Liu Bo

async_missing_raid56() is identical to async_read_rebuild().

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 18 +-
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index bb8a3c5..efb42dc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2766,24 +2766,8 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, 
struct bio *bio,
return rbio;
 }
 
-static void missing_raid56_work(struct btrfs_work *work)
-{
-   struct btrfs_raid_bio *rbio;
-
-   rbio = container_of(work, struct btrfs_raid_bio, work);
-   __raid56_parity_recover(rbio);
-}
-
-static void async_missing_raid56(struct btrfs_raid_bio *rbio)
-{
-   btrfs_init_work(>work, btrfs_rmw_helper,
-   missing_raid56_work, NULL, NULL);
-
-   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
-}
-
 void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio)
 {
if (!lock_stripe_add(rbio))
-   async_missing_raid56(rbio);
+   async_read_rebuild(rbio);
 }
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-02 Thread Liu Bo

In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
as unit, however, scrub_extent() sets blocksize as unit, so rebuild
process may be triggered on every block on a same stripe.

A typical example would be that when we're replacing a disappeared disk,
all reads on the disks get -EIO, every block (size is 4K if blocksize is
4K) would go thru these,

scrub_handle_errored_block
  scrub_recheck_block # re-read pages one by one
  scrub_recheck_block # rebuild by calling raid56_parity_recover()
page by page

Although with raid56 stripe cache most of reads during rebuild can be
avoided, the parity recover calculation(xor or raid6 algorithms) needs to
be done $(BTRFS_STRIPE_LEN / blocksize) times.

This makes it less stupid by doing raid56 scrub/replace on stripe length.
---
 fs/btrfs/scrub.c | 78 +++-
 1 file changed, 60 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 9882513..e3203a1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1718,6 +1718,44 @@ static int scrub_submit_raid56_bio_wait(struct 
btrfs_fs_info *fs_info,
return blk_status_to_errno(bio->bi_status);
 }
 
+static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
+ struct scrub_block *sblock)
+{
+   struct scrub_page *first_page = sblock->pagev[0];
+   struct bio *bio = btrfs_io_bio_alloc(BIO_MAX_PAGES);
+   int page_num;
+
+   /* All pages in sblock belongs to the same stripe on the same device. */
+   ASSERT(first_page->dev);
+   if (first_page->dev->bdev == NULL)
+   goto out;
+
+   bio_set_dev(bio, first_page->dev->bdev);
+
+   for (page_num = 0; page_num < sblock->page_count; page_num++) {
+   struct scrub_page *page = sblock->pagev[page_num];
+
+   WARN_ON(!page->page);
+   bio_add_page(bio, page->page, PAGE_SIZE, 0);
+   }
+
+   if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
+   bio_put(bio);
+   goto out;
+   }
+
+   bio_put(bio);
+
+   scrub_recheck_block_checksum(sblock);
+
+   return;
+out:
+   for (page_num = 0; page_num < sblock->page_count; page_num++)
+   sblock->pagev[page_num]->io_error = 1;
+
+   sblock->no_io_error_seen = 0;
+}
+
 /*
  * this function will check the on disk data for checksum errors, header
  * errors and read I/O errors. If any I/O errors happen, the exact pages
@@ -1733,6 +1771,10 @@ static void scrub_recheck_block(struct btrfs_fs_info 
*fs_info,
 
sblock->no_io_error_seen = 1;
 
+   /* short cut for raid56 */
+   if (!retry_failed_mirror && scrub_is_page_on_raid56(sblock->pagev[0]))
+   return scrub_recheck_block_on_raid56(fs_info, sblock);
+
for (page_num = 0; page_num < sblock->page_count; page_num++) {
struct bio *bio;
struct scrub_page *page = sblock->pagev[page_num];
@@ -1748,19 +1790,12 @@ static void scrub_recheck_block(struct btrfs_fs_info 
*fs_info,
bio_set_dev(bio, page->dev->bdev);
 
bio_add_page(bio, page->page, PAGE_SIZE, 0);
-   if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
-   if (scrub_submit_raid56_bio_wait(fs_info, bio, page)) {
-   page->io_error = 1;
-   sblock->no_io_error_seen = 0;
-   }
-   } else {
-   bio->bi_iter.bi_sector = page->physical >> 9;
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
+   bio->bi_iter.bi_sector = page->physical >> 9;
+   bio_set_op_attrs(bio, REQ_OP_READ, 0);
 
-   if (btrfsic_submit_bio_wait(bio)) {
-   page->io_error = 1;
-   sblock->no_io_error_seen = 0;
-   }
+   if (btrfsic_submit_bio_wait(bio)) {
+   page->io_error = 1;
+   sblock->no_io_error_seen = 0;
}
 
bio_put(bio);
@@ -2728,7 +2763,8 @@ static int scrub_find_csum(struct scrub_ctx *sctx, u64 
logical, u8 *csum)
 }
 
 /* scrub extent tries to collect up to 64 kB for each bio */
-static int scrub_extent(struct scrub_ctx *sctx, u64 logical, u64 len,
+static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map,
+   u64 logical, u64 len,
u64 physical, struct btrfs_device *dev, u64 flags,
u64 gen, int mirror_num, u64 physical_for_dev_replace)
 {
@@ -2737,13 +2773,19 @@ static int scrub_extent(struct scrub_ctx *sctx, u64 
logical, u64 len,
u32 blocksize;

Re: [PATCH] Btrfs: raid56: iterate raid56 internal bio with bio_for_each_segment_all

2018-01-18 Thread David Sterba

On Fri, Jan 12, 2018 at 06:07:01PM -0700, Liu Bo wrote:
> Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it
> will never be a BIO_CLONED bio, and since this is called by end_io
> functions, bio->bi_iter.bi_size is zero, we mustn't use
> bio_for_each_segment() as that is a no-op if bi_size is zero.
> 
> Fixes: 6592e58c6b68e61f003a01ba29a3716e7e2e9484 ("Btrfs: fix write corruption 
> due to bio cloning on raid5/6")
> Cc: <sta...@vger.kernel.org> # v4.12-rc6+
> Signed-off-by: Liu Bo <bo.li@oracle.com>

Tested and added to 4.16 queue, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: raid56: iterate raid56 internal bio with bio_for_each_segment_all

2018-01-12 Thread Liu Bo

Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it
will never be a BIO_CLONED bio, and since this is called by end_io
functions, bio->bi_iter.bi_size is zero, we mustn't use
bio_for_each_segment() as that is a no-op if bi_size is zero.

Fixes: 6592e58c6b68e61f003a01ba29a3716e7e2e9484 ("Btrfs: fix write corruption 
due to bio cloning on raid5/6")
Cc: <sta...@vger.kernel.org> # v4.12-rc6+
Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 765af6a..56ae5bd 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1442,14 +1442,13 @@ static int fail_bio_stripe(struct btrfs_raid_bio *rbio,
  */
 static void set_bio_pages_uptodate(struct bio *bio)
 {
-   struct bio_vec bvec;
-   struct bvec_iter iter;
+   struct bio_vec *bvec;
+   int i;
 
-   if (bio_flagged(bio, BIO_CLONED))
-   bio->bi_iter = btrfs_io_bio(bio)->iter;
+   ASSERT(!bio_flagged(bio, BIO_CLONED));
 
-   bio_for_each_segment(bvec, bio, iter)
-   SetPageUptodate(bvec.bv_page);
+   bio_for_each_segment_all(bvec, bio, i)
+   SetPageUptodate(bvec->bv_page);
 }
 
 /*
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2018-01-09 Thread Liu Bo

Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
more bios from other rbios and rbio->bio_list becomes non-empty,
in that case, these newly merged bios don't end properly.

Once unlock_stripe() is done, rbio->bio_list will not be updated any
more and we can call bio_endio() on all queued bios.

It should only happen in error-out cases, the normal path of recover
and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
merge before doing IO, so rbio_orig_end_io() called by them doesn't
have the above issue.

Reported-by: Jérôme Carretero <cj...@zougloub.eu>
Signed-off-by: Liu Bo <bo.li@oracle.com>
---
v2: - Remove the usage spin_lock as there is a chance of deadlock in
  interrupt context, it's reported by lockdep, although it'd never
  happen because we've taken care of it by saving irq flags at all
  places.
- Update commit log and comments of code to explain the new idea.
- This has been tested against btrfs/011 for 50 times.

 fs/btrfs/raid56.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7747323..b2b426d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -864,10 +864,17 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
kfree(rbio);
 }
 
-static void free_raid_bio(struct btrfs_raid_bio *rbio)
+static void rbio_endio_bio_list(struct bio *cur, blk_status_t err)
 {
-   unlock_stripe(rbio);
-   __free_raid_bio(rbio);
+   struct bio *next;
+
+   while (cur) {
+   next = cur->bi_next;
+   cur->bi_next = NULL;
+   cur->bi_status = err;
+   bio_endio(cur);
+   cur = next;
+   }
 }
 
 /*
@@ -877,20 +884,26 @@ static void free_raid_bio(struct btrfs_raid_bio *rbio)
 static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 {
struct bio *cur = bio_list_get(>bio_list);
-   struct bio *next;
+   struct bio *extra;
 
if (rbio->generic_bio_cnt)
btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
 
-   free_raid_bio(rbio);
+   /*
+* At this moment, rbio->bio_list is empty, however since rbio does not
+* always have RBIO_RMW_LOCKED_BIT set and rbio is still linked on the
+* hash list, rbio may be merged with others so that rbio->bio_list
+* becomes non-empty.
+* Once unlock_stripe() is done, rbio->bio_list will not be updated any
+* more and we can call bio_endio() on all queued bios.
+*/
+   unlock_stripe(rbio);
+   extra = bio_list_get(>bio_list);
+   __free_raid_bio(rbio);
 
-   while (cur) {
-   next = cur->bi_next;
-   cur->bi_next = NULL;
-   cur->bi_status = err;
-   bio_endio(cur);
-   cur = next;
-   }
+   rbio_endio_bio_list(cur, err);
+   if (extra)
+   rbio_endio_bio_list(extra, err);
 }
 
 /*
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2018-01-05 Thread Filipe Manana

On Wed, Jan 3, 2018 at 3:39 PM, Timofey Titovets <nefelim...@gmail.com> wrote:
> 2018-01-03 14:40 GMT+03:00 Filipe Manana <fdman...@gmail.com>:
>> On Thu, Dec 28, 2017 at 3:28 PM, Timofey Titovets <nefelim...@gmail.com> 
>> wrote:
>>> Insert sort are generaly perform better then bubble sort,
>>> by have less iterations on avarage.
>>> That version also try place element to right position
>>> instead of raw swap.
>>>
>>> I'm not sure how many stripes per bio raid56,
>>> btrfs try to store (and try to sort).
>>
>> If you don't know it, besides unlikely to be doing the best possible
>> thing here, you might actually make things worse or not offering any
>> benefit. IOW, you should know it for sure before submitting such
>> changes.
>>
>> You should know if the number of elements to sort is big enough such
>> that an insertion sort is faster than a bubble sort, and more
>> importantly, measure it and mention it in the changelog.
>> As it is, you are showing lack of understanding of the code and
>> component you are touching, and leaving many open questions such as
>> how faster this is, why insertion sort and not a
>> quick/merge/heap/whatever sort, etc.
>> --
>> Filipe David Manana,
>>
>> “Whether you think you can, or you think you can't — you're right.”
>
> Sorry, you are right,
> I must do some tests and investigations before send a patch.
> (I just try believe in some magic math things).
>
> Input size depends on number of devs,
> so on small arrays, like 3-5 no meaningful difference.
>
> Example: raid6 (with 4 disks) produce many stripe line addresses like:
> 1. 4641783808 4641849344 4641914880 18446744073709551614
> 2. 4641652736 4641718272 18446744073709551614 4641587200
> 3. 18446744073709551614 4636475392 4636540928 4636606464
> 4. 4641521664 18446744073709551614 4641390592 4641456128
>
> For that count of elements any sorting algo will work fast enough.
>
> Let's, consider that addresses as random non-repeating numbers.
>
> We can use tool like Sound Of Sorting (SoS) to make some
> easy to interpret tests of algorithms.

Nack.
My point was about testing in the btrfs code and not somewhere else.
We can all get estimations from CS books, websites, etc for multiple
algorithms for different input sizes. And these are typically
comparing the average case, and while some algorithms perform better
than others in the average case, things can get reversed in the worst
case (heap sort vs quick sort iirc, better in worst case but usually
worse in the average case).
What matters is in the btrfs context - that where things have to be measured.


>
> (Sorry, no script to reproduce, as SoS not provide a cli,
> just hand made by run SoS with different params).
>
> Table (also in attach with source data points):
> Sort_algo |Disk_num   |3   |4|6|8|10|12|14|AVG
> Bubble|Comparasions   |3   |6|15   |28   |45|66|91
> |36,2857142857143
> Bubble|Array_Accesses |7,8 |18,2 |45,8 |81,8 |133,4 |192   |268,6 |106,8
> Insertion |Comparasions   |2,8 |5|11,6 |17   |28,6  |39,4  |55,2  |22,8
> Insertion |Array_Accesses |8,4 |13,6 |31   |48,8 |80,4  |109,6 |155,8
> |63,9428571428571
>
> i.e. on Size like 3-4 no much difference,
> Insertion sort will work faster on bigger arrays (up to 1.7x for 14 disk 
> array).
>
> Does that make a sense?
> I think yes, i.e. in any case that are several dozen machine instructions.
> Which can be used elsewhere.
>
> P.S. For heap sort, which are also available in kernel by sort(),
> That will to much overhead on that small number of devices,
> i.e. heap sort will show a profit over insert sort at 16+ cells in array.
>
> /* Snob mode on */
> P.S.S.
> Heap sort & other like, need additional memory,

Yes... My point in listing the heap sort and other algorithms was not
meant to propose using any of them but rather for you to explain why
insertion sort and not something else.
And I think you are confusing heap sort with merge sort. Merge sort is
the one that requires extra memory.

> so that useless to compare in our case,
> but they will works faster, of course.
> /* Snob mode off */
>
> Thanks.
> --
> Have a nice day,
> Timofey.



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v6 04/16] btrfs-progs: scrub: Introduce structures to support offline scrub for RAID56

2018-01-05 Thread Gu Jinxiang

From: Qu Wenruo <quwen...@cn.fujitsu.com>

Introuduce new local structures, scrub_full_stripe and scrub_stripe, for
incoming offline RAID56 scrub support.

For pure stripe/mirror based profiles, like raid0/1/10/dup/single, we
will follow the original bytenr and mirror number based iteration, so
they don't need any extra structures for these profiles.

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
Signed-off-by: Gu Jinxiang <g...@cn.fujitsu.com>
---
 Makefile |   3 +-
 scrub.c  | 119 +++
 2 files changed, 121 insertions(+), 1 deletion(-)
 create mode 100644 scrub.c

diff --git a/Makefile b/Makefile
index ab45ab7f..fa3ebc86 100644
--- a/Makefile
+++ b/Makefile
@@ -106,7 +106,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o 
csum.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o 
csum.o \
+ scrub.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/scrub.c b/scrub.c
new file mode 100644
index ..41c40108
--- /dev/null
+++ b/scrub.c
@@ -0,0 +1,119 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+/*
+ * Main part to implement offline(unmounted) btrfs scrub
+ */
+
+#include 
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "utils.h"
+
+/*
+ * For parity based profile (RAID56)
+ * Mirror/stripe based on won't need this. They are iterated by bytenr and
+ * mirror number.
+ */
+struct scrub_stripe {
+   /* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */
+   u64 logical;
+
+   u64 physical;
+
+   /* Device is missing */
+   unsigned int dev_missing:1;
+
+   /* Any tree/data csum mismatches */
+   unsigned int csum_mismatch:1;
+
+   /* Some data doesn't have csum (nodatasum) */
+   unsigned int csum_missing:1;
+
+   /* Device fd, to write correct data back to disc */
+   int fd;
+
+   char *data;
+};
+
+/*
+ * RAID56 full stripe (data stripes + P/Q)
+ */
+struct scrub_full_stripe {
+   u64 logical_start;
+   u64 logical_len;
+   u64 bg_type;
+   u32 nr_stripes;
+   u32 stripe_len;
+
+   /* Read error stripes */
+   u32 err_read_stripes;
+
+   /* Missing devices */
+   u32 err_missing_devs;
+
+   /* Csum error data stripes */
+   u32 err_csum_dstripes;
+
+   /* Missing csum data stripes */
+   u32 missing_csum_dstripes;
+
+   /* currupted stripe index */
+   int corrupted_index[2];
+
+   int nr_corrupted_stripes;
+
+   /* Already recovered once? */
+   unsigned int recovered:1;
+
+   struct scrub_stripe stripes[];
+};
+
+static void free_full_stripe(struct scrub_full_stripe *fstripe)
+{
+   int i;
+
+   for (i = 0; i < fstripe->nr_stripes; i++)
+   free(fstripe->stripes[i].data);
+   free(fstripe);
+}
+
+static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
+   u32 stripe_len)
+{
+   struct scrub_full_stripe *ret;
+   int size = sizeof(*ret) + sizeof(unsigned long *) +
+   nr_stripes * sizeof(struct scrub_stripe);
+   int i;
+
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+
+   memset(ret, 0, size);
+   ret->nr_stripes = nr_stripes;
+   ret->stripe_len = stripe_len;
+   ret->corrupted_index[0] = -1;
+   ret->corrupted_index[1] = -1;
+
+   /* Alloc data memory for each stripe */
+   for (i = 0; i < nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   stripe->data = malloc(stripe_len);
+   if (!stripe->data) {
+   free_full_stripe(ret);
+   return NULL;
+   }
+   }
+   return ret;
+}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2018-01-03 Thread Timofey Titovets

2018-01-03 14:40 GMT+03:00 Filipe Manana <fdman...@gmail.com>:
> On Thu, Dec 28, 2017 at 3:28 PM, Timofey Titovets <nefelim...@gmail.com> 
> wrote:
>> Insert sort are generaly perform better then bubble sort,
>> by have less iterations on avarage.
>> That version also try place element to right position
>> instead of raw swap.
>>
>> I'm not sure how many stripes per bio raid56,
>> btrfs try to store (and try to sort).
>
> If you don't know it, besides unlikely to be doing the best possible
> thing here, you might actually make things worse or not offering any
> benefit. IOW, you should know it for sure before submitting such
> changes.
>
> You should know if the number of elements to sort is big enough such
> that an insertion sort is faster than a bubble sort, and more
> importantly, measure it and mention it in the changelog.
> As it is, you are showing lack of understanding of the code and
> component you are touching, and leaving many open questions such as
> how faster this is, why insertion sort and not a
> quick/merge/heap/whatever sort, etc.
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”

Sorry, you are right,
I must do some tests and investigations before send a patch.
(I just try believe in some magic math things).

Input size depends on number of devs,
so on small arrays, like 3-5 no meaningful difference.

Example: raid6 (with 4 disks) produce many stripe line addresses like:
1. 4641783808 4641849344 4641914880 18446744073709551614
2. 4641652736 4641718272 18446744073709551614 4641587200
3. 18446744073709551614 4636475392 4636540928 4636606464
4. 4641521664 18446744073709551614 4641390592 4641456128

For that count of elements any sorting algo will work fast enough.

Let's, consider that addresses as random non-repeating numbers.

We can use tool like Sound Of Sorting (SoS) to make some
easy to interpret tests of algorithms.

(Sorry, no script to reproduce, as SoS not provide a cli,
just hand made by run SoS with different params).

Table (also in attach with source data points):
Sort_algo |Disk_num   |3   |4|6|8|10|12|14|AVG
Bubble|Comparasions   |3   |6|15   |28   |45|66|91
|36,2857142857143
Bubble|Array_Accesses |7,8 |18,2 |45,8 |81,8 |133,4 |192   |268,6 |106,8
Insertion |Comparasions   |2,8 |5|11,6 |17   |28,6  |39,4  |55,2  |22,8
Insertion |Array_Accesses |8,4 |13,6 |31   |48,8 |80,4  |109,6 |155,8
|63,9428571428571

i.e. on Size like 3-4 no much difference,
Insertion sort will work faster on bigger arrays (up to 1.7x for 14 disk array).

Does that make a sense?
I think yes, i.e. in any case that are several dozen machine instructions.
Which can be used elsewhere.

P.S. For heap sort, which are also available in kernel by sort(),
That will to much overhead on that small number of devices,
i.e. heap sort will show a profit over insert sort at 16+ cells in array.

/* Snob mode on */
P.S.S.
Heap sort & other like, need additional memory,
so that useless to compare in our case,
but they will works faster, of course.
/* Snob mode off */

Thanks.
-- 
Have a nice day,
Timofey.

Bubble_vs_Insertion.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Re: [PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2018-01-03 Thread Filipe Manana

On Thu, Dec 28, 2017 at 3:28 PM, Timofey Titovets <nefelim...@gmail.com> wrote:
> Insert sort are generaly perform better then bubble sort,
> by have less iterations on avarage.
> That version also try place element to right position
> instead of raw swap.
>
> I'm not sure how many stripes per bio raid56,
> btrfs try to store (and try to sort).

If you don't know it, besides unlikely to be doing the best possible
thing here, you might actually make things worse or not offering any
benefit. IOW, you should know it for sure before submitting such
changes.

You should know if the number of elements to sort is big enough such
that an insertion sort is faster than a bubble sort, and more
importantly, measure it and mention it in the changelog.
As it is, you are showing lack of understanding of the code and
component you are touching, and leaving many open questions such as
how faster this is, why insertion sort and not a
quick/merge/heap/whatever sort, etc.

>
> So, that a bit shorter just in the name of a great justice.
>
> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
> ---
>  fs/btrfs/volumes.c | 29 -
>  1 file changed, 12 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 98bc2433a920..7195fc8c49b1 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5317,29 +5317,24 @@ static inline int parity_smaller(u64 a, u64 b)
> return a > b;
>  }
>
> -/* Bubble-sort the stripe set to put the parity/syndrome stripes last */
> +/* Insertion-sort the stripe set to put the parity/syndrome stripes last */
>  static void sort_parity_stripes(struct btrfs_bio *bbio, int num_stripes)
>  {
> struct btrfs_bio_stripe s;
> -   int i;
> +   int i, j;
> u64 l;
> -   int again = 1;
>
> -   while (again) {
> -   again = 0;
> -   for (i = 0; i < num_stripes - 1; i++) {
> -   if (parity_smaller(bbio->raid_map[i],
> -  bbio->raid_map[i+1])) {
> -   s = bbio->stripes[i];
> -   l = bbio->raid_map[i];
> -   bbio->stripes[i] = bbio->stripes[i+1];
> -   bbio->raid_map[i] = bbio->raid_map[i+1];
> -   bbio->stripes[i+1] = s;
> -   bbio->raid_map[i+1] = l;
> -
> -   again = 1;
> -   }
> +   for (i = 1; i < num_stripes; i++) {
> +   s = bbio->stripes[i];
> +   l = bbio->raid_map[i];
> +   for (j = i - 1; j >= 0; j--) {
> +   if (!parity_smaller(bbio->raid_map[j], l))
> +   break;
> +   bbio->stripes[j+1]  = bbio->stripes[j];
> +   bbio->raid_map[j+1] = bbio->raid_map[j];
> }
> +   bbio->stripes[j+1]  = s;
> +   bbio->raid_map[j+1] = l;
> }
>  }
>
> --
> 2.15.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2017-12-28 Thread Timofey Titovets

Insert sort are generaly perform better then bubble sort,
by have less iterations on avarage.
That version also try place element to right position
instead of raw swap.

I'm not sure how many stripes per bio raid56,
btrfs try to store (and try to sort).

So, that a bit shorter just in the name of a great justice.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/volumes.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 98bc2433a920..7195fc8c49b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5317,29 +5317,24 @@ static inline int parity_smaller(u64 a, u64 b)
return a > b;
 }
 
-/* Bubble-sort the stripe set to put the parity/syndrome stripes last */
+/* Insertion-sort the stripe set to put the parity/syndrome stripes last */
 static void sort_parity_stripes(struct btrfs_bio *bbio, int num_stripes)
 {
struct btrfs_bio_stripe s;
-   int i;
+   int i, j;
u64 l;
-   int again = 1;
 
-   while (again) {
-   again = 0;
-   for (i = 0; i < num_stripes - 1; i++) {
-   if (parity_smaller(bbio->raid_map[i],
-  bbio->raid_map[i+1])) {
-   s = bbio->stripes[i];
-   l = bbio->raid_map[i];
-   bbio->stripes[i] = bbio->stripes[i+1];
-   bbio->raid_map[i] = bbio->raid_map[i+1];
-   bbio->stripes[i+1] = s;
-   bbio->raid_map[i+1] = l;
-
-   again = 1;
-   }
+   for (i = 1; i < num_stripes; i++) {
+   s = bbio->stripes[i];
+   l = bbio->raid_map[i];
+   for (j = i - 1; j >= 0; j--) {
+   if (!parity_smaller(bbio->raid_map[j], l))
+   break;
+   bbio->stripes[j+1]  = bbio->stripes[j];
+   bbio->raid_map[j+1] = bbio->raid_map[j];
}
+   bbio->stripes[j+1]  = s;
+   bbio->raid_map[j+1] = l;
}
 }
 
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2017-12-12 Thread David Sterba

On Fri, Dec 08, 2017 at 04:02:35PM -0700, Liu Bo wrote:
> We're not allowed to take any new bios to rbio->bio_list in
> rbio_orig_end_io(), otherwise we may get merged with more bios and
> rbio->bio_list is not empty.
> 
> This should only happens in error-out cases, the normal path of
> recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to
> disable merge before doing IO.
> 
> Reported-by: Jérôme Carretero 
> Signed-off-by: Liu Bo 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2017-12-11 Thread Liu Bo

On Sat, Dec 09, 2017 at 03:32:18PM +0200, Nikolay Borisov wrote:
> 
> 
> On  9.12.2017 01:02, Liu Bo wrote:
> > We're not allowed to take any new bios to rbio->bio_list in
> > rbio_orig_end_io(), otherwise we may get merged with more bios and
> > rbio->bio_list is not empty.
> > 
> > This should only happens in error-out cases, the normal path of
> > recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to
> > disable merge before doing IO.
> > 
> > Reported-by: Jérôme Carretero <cj...@zougloub.eu>
> > Signed-off-by: Liu Bo <bo.li@oracle.com>
> > ---
> >  fs/btrfs/raid56.c | 13 ++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> > index 5aa9d22..127c782 100644
> > --- a/fs/btrfs/raid56.c
> > +++ b/fs/btrfs/raid56.c
> > @@ -859,12 +859,23 @@ static void free_raid_bio(struct btrfs_raid_bio *rbio)
> >   */
> >  static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
> >  {
> > -   struct bio *cur = bio_list_get(>bio_list);
> > +   struct bio *cur;
> > struct bio *next;
> >  
> > +   /*
> > +* We're not allowed to take any new bios to rbio->bio_list
> > +* from now on, otherwise we may get merged with more bios and
> > +* rbio->bio_list is not empty.
> > +*/
> > +   spin_lock(>bio_list_lock);
> > +   set_bit(RBIO_RMW_LOCKED_BIT, >flags);
> > +   spin_unlock(>bio_list_lock);
> 
> do we really need the spinlock, bit operations are atomic?
> 

Thanks for the question.

Atomicity doesn't really matter here, set_bit() needs to be done in
the critical section so that merge_rbio() can do right things if
merge_rbio() comes after rbio_orig_end_io().

thanks,
-liubo

> > +
> > if (rbio->generic_bio_cnt)
> > btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
> >  
> > +   cur = bio_list_get(>bio_list);
> > +
> > free_raid_bio(rbio);
> >  
> > while (cur) {
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2017-12-09 Thread Nikolay Borisov



On  9.12.2017 01:02, Liu Bo wrote:
> We're not allowed to take any new bios to rbio->bio_list in
> rbio_orig_end_io(), otherwise we may get merged with more bios and
> rbio->bio_list is not empty.
> 
> This should only happens in error-out cases, the normal path of
> recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to
> disable merge before doing IO.
> 
> Reported-by: Jérôme Carretero <cj...@zougloub.eu>
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
>  fs/btrfs/raid56.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 5aa9d22..127c782 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -859,12 +859,23 @@ static void free_raid_bio(struct btrfs_raid_bio *rbio)
>   */
>  static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
>  {
> - struct bio *cur = bio_list_get(>bio_list);
> + struct bio *cur;
>   struct bio *next;
>  
> + /*
> +  * We're not allowed to take any new bios to rbio->bio_list
> +  * from now on, otherwise we may get merged with more bios and
> +  * rbio->bio_list is not empty.
> +  */
> + spin_lock(>bio_list_lock);
> + set_bit(RBIO_RMW_LOCKED_BIT, >flags);
> + spin_unlock(>bio_list_lock);

do we really need the spinlock, bit operations are atomic?

> +
>   if (rbio->generic_bio_cnt)
>   btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
>  
> + cur = bio_list_get(>bio_list);
> +
>   free_raid_bio(rbio);
>  
>   while (cur) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2017-12-08 Thread Liu Bo

(Add Jérôme Carretero.)

Thanks,

-liubo

On Fri, Dec 08, 2017 at 04:02:35PM -0700, Liu Bo wrote:
> We're not allowed to take any new bios to rbio->bio_list in
> rbio_orig_end_io(), otherwise we may get merged with more bios and
> rbio->bio_list is not empty.
> 
> This should only happens in error-out cases, the normal path of
> recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to
> disable merge before doing IO.
> 
> Reported-by: Jérôme Carretero <cj...@zougloub.eu>
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
>  fs/btrfs/raid56.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 5aa9d22..127c782 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -859,12 +859,23 @@ static void free_raid_bio(struct btrfs_raid_bio *rbio)
>   */
>  static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
>  {
> - struct bio *cur = bio_list_get(>bio_list);
> + struct bio *cur;
>   struct bio *next;
>  
> + /*
> +  * We're not allowed to take any new bios to rbio->bio_list
> +  * from now on, otherwise we may get merged with more bios and
> +  * rbio->bio_list is not empty.
> +  */
> + spin_lock(>bio_list_lock);
> + set_bit(RBIO_RMW_LOCKED_BIT, >flags);
> + spin_unlock(>bio_list_lock);
> +
>   if (rbio->generic_bio_cnt)
>   btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
>  
> + cur = bio_list_get(>bio_list);
> +
>   free_raid_bio(rbio);
>  
>   while (cur) {
> -- 
> 2.9.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2017-12-08 Thread Liu Bo

We're not allowed to take any new bios to rbio->bio_list in
rbio_orig_end_io(), otherwise we may get merged with more bios and
rbio->bio_list is not empty.

This should only happens in error-out cases, the normal path of
recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to
disable merge before doing IO.

Reported-by: Jérôme Carretero <cj...@zougloub.eu>
Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 5aa9d22..127c782 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -859,12 +859,23 @@ static void free_raid_bio(struct btrfs_raid_bio *rbio)
  */
 static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 {
-   struct bio *cur = bio_list_get(>bio_list);
+   struct bio *cur;
struct bio *next;
 
+   /*
+* We're not allowed to take any new bios to rbio->bio_list
+* from now on, otherwise we may get merged with more bios and
+* rbio->bio_list is not empty.
+*/
+   spin_lock(>bio_list_lock);
+   set_bit(RBIO_RMW_LOCKED_BIT, >flags);
+   spin_unlock(>bio_list_lock);
+
if (rbio->generic_bio_cnt)
btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
 
+   cur = bio_list_get(>bio_list);
+
free_raid_bio(rbio);
 
while (cur) {
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/7] Btrfs: retry write for non-raid56

2017-11-28 Thread Liu Bo

On Wed, Nov 22, 2017 at 04:41:10PM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.11.2017 02:35, Liu Bo wrote:
> > If the underlying protocal doesn't support retry and there are some
> > transient errors happening somewhere in our IO stack, we'd like to
> > give an extra chance for IO.
> > 
> > In btrfs, read retry is handled in bio_readpage_error() with the retry
> > unit being page size , for write retry however, we're going to do it
> > in a different way, as a write may consist of several writes onto
> > different stripes, retry write needs to be done right after the IO on
> > each stripe completes and reaches to endio.
> > 
> > As write errors are really corner cases, performance impact is not
> > considered.
> > 
> > This adds code to retry write on errors _sector by sector_ in order to
> > find out which sector is really bad so that we can mark it somewhere.
> > And since endio is running in interrupt context, the real retry work
> > is scheduled in system_unbound_wq in order to get a normal context.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/volumes.c | 162 
> > +
> >  fs/btrfs/volumes.h |   3 +
> >  2 files changed, 142 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 853f9ce..c11db0b 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
> > }
> >  }
> >  
> > -static void btrfs_end_bio(struct bio *bio)
> > +static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
> >  {
> > struct btrfs_bio *bbio = bio->bi_private;
> > -   int is_orig_bio = 0;
> > +   unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
> > +
> > +   BUG_ON(stripe_index >= bbio->num_stripes);
> > +   return bbio->stripes[stripe_index].dev;
> > +}
> > +
> > +/*
> > + * return 1 if every sector retry returns successful.
> > + * return 0 if one or more sector retries fails.
> > + */
> > +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
> > +{
> > +   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +   u64 sectors_to_write;
> > +   u64 offset;
> > +   u64 orig;
> > +   u64 unit;
> > +   u64 block_sectors;
> > +   int ok = 1;
> > +   struct bio *wbio;
> > +
> > +   /* offset and unit are bytes aligned, not 512-bytes aligned. */
> > +   sectors_to_write = io_bio->iter.bi_size >> 9;
> > +   orig = io_bio->iter.bi_sector;
> > +   offset = 0;
> > +   block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
> > +   unit = block_sectors;
> > +   ASSERT(unit == 1);
> > +
> > +   while (1) {
> > +   if (!sectors_to_write)
> > +   break;
> > +   /*
> > +* LIUBO: I don't think unit > sectors_to_write could
> > +* happen, sectors_to_write should be aligned to PAGE_SIZE
> > +* which is > unit.  Just in case.
> > +*/
> > +   if (unit > sectors_to_write) {
> > +   WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", 
> > unit, sectors_to_write);
> > +   unit = sectors_to_write;
> > +   }
> > +
> > +   /* write @unit bytes at @offset */
> > +   /* this would never fail, check btrfs_bio_clone(). */
> > +   wbio = btrfs_bio_clone(bio);
> > +   wbio->bi_opf = REQ_OP_WRITE;
> > +   wbio->bi_iter = io_bio->iter;
> > +
> > +   bio_trim(wbio, offset, unit);
> > +   bio_copy_dev(wbio, bio);
> > +
> > +   /* submit in sync way */
> > +   /*
> > +* LIUBO: There is an issue, if this bio is quite
> > +* large, say 1M or 2M, and sector size is just 512,
> > +* then this may take a while.
> > +*
> > +* May need to schedule the job to workqueue.
> > +*/
> > +   if (submit_bio_wait(wbio) < 0) {
> > +   ok = 0 && ok;
> > +   /*
> > +* This is not correct if badblocks is enabled
> > +* as we need to record every bad sector by
> > +* trying sectors one by one.
> > +*/
> > +   break;
> > +   }
> > +
> > +   bio_put(wbio);
> > +   offset += unit;
> > +   sectors_to_write -= unit;
> > +   unit = block_sectors;
> > +   }
> > +   return ok;
> > +}
> > +
> > +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
> > +{
> > +   if (bio->bi_status == BLK_STS_IOERR ||
> > +   bio->bi_status == BLK_STS_TARGET) {
> > +   if (dev->bdev) {
> > +   if (bio_data_dir(bio) == WRITE)
> > +   btrfs_dev_stat_inc(dev,
> > +  BTRFS_DEV_STAT_WRITE_ERRS);
> > +   else
> > +   btrfs_dev_stat_inc(dev,
> > +

Re: WARNING: CPU: 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0

2017-11-22 Thread Jérôme Carretero

Hi,

On Wed, 22 Nov 2017 15:35:35 -0800
Liu Bo <bo.li@oracle.com> wrote:

> On Mon, Nov 20, 2017 at 02:00:07AM -0500, Jérôme Carretero wrote:
> > [ cut here ] [633254.461294] WARNING: CPU:
> > 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848
> > __free_raid_bio+0x8e/0xa0  
> 
> 
> The vanilla 4.14.0 shows it is
> WARN_ON(!bio_list_empty(>bio_list)); but we just emptied
> rbio->bio_list two lines above, i.e. struct bio *cur =
> bio_list_get(>bio_list);
> 
> Either we have some weird race, or the line number is misleading me.
> 
> Can you please check the code which warning fs/btrfs/raid56.c:848
> points to?

Same code as yours:
WARN_ON(!bio_list_empty(>bio_list));

So yeah, at least git is not broken, now it could be a very weird
compiler bug or a less-weird race indeed...


Regards,

-- 
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0

2017-11-22 Thread Liu Bo

On Mon, Nov 20, 2017 at 02:00:07AM -0500, Jérôme Carretero wrote:
> Hi,
> 
> 
> 
> This was while doing a "userspace scrub" with "tar c":
> 
> [633250.707455] btrfs_print_data_csum_error: 14608 callbacks suppressed
> [633250.707459] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0xb8c194fb expected csum 0xb3680c88 mirror 2
> [633250.707465] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707470] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0xa5db59eb expected csum 0xb3680c88 mirror 2
> [633250.707473] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x5d244234 expected csum 0xb3680c88 mirror 2
> [633250.707475] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707478] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530301440 csum 0xc0a71540 expected csum 0x904f75bc mirror 2
> [633250.707480] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707483] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707484] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530301440 csum 0x0abd2cac expected csum 0x904f75bc mirror 2
> [633250.707488] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530301440 csum 0x0d046c34 expected csum 0x904f75bc mirror 2
> [633250.888501] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230948864 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.937373] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230952960 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.949808] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230957056 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.961703] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230961152 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.973827] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230965248 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.986271] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230969344 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.998517] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230973440 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633251.010537] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230977536 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633251.022767] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230981632 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633251.034990] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230985728 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633254.456570] [ cut here ]
> [633254.461294] WARNING: CPU: 3 PID: 20953 at 
> /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0


The vanilla 4.14.0 shows it is WARN_ON(!bio_list_empty(>bio_list));
but we just emptied rbio->bio_list two lines above, i.e.
struct bio *cur = bio_list_get(>bio_list);

Either we have some weird race, or the line number is misleading me.

Can you please check the code which warning fs/btrfs/raid56.c:848 points to?

thanks,
-liubo

> [633254.470863] Modules linked in: bfq twofish_avx_x86_64 twofish_x86_64_3way 
> xts twofish_x86_64 twofish_common serpent_avx_x86_64 serpent_generic lrw 
> gf128mul ablk_helper algif_skcipher af_alg nfnetlink_queue nfnetlink_log 
> nfnetlink cfg80211 rfkill usbmon fuse usb_storage dm_crypt dm_mod dax 
> coretemp hwmon intel_rapl snd_hda_codec_realtek x86_pkg_temp_thermal 
> snd_hda_codec_generic iTCO_wdt kvm_intel iTCO_vendor_support snd_hda_intel 
> kvm snd_hda_codec irqbypass snd_hwdep aesni_intel snd_hda_core aes_x86_64 
> snd_pcm xhci_pci snd_timer ehci_pci crypto_simd xhci_hcd cryptd ehci_hcd 
> sdhci_pci glue_helper pcspkr snd usbcore sdhci soundcore lpc_ich mmc_core 
> usb_common mfd_core bnx2 bonding autofs4 [last unloaded: i2c_dev]
> [633254.533987] CPU: 3 PID: 20953 Comm: kworker/u16:18 Tainted: GW
>4.14.0-Vantage-dirty #14
> [633254.543298] Hardware name: LENOVO 056851U/LENOVO, BIOS A0KT56AUS 
> 02/01/2016
> [633254.550365] Workqueue: btrfs-endio btrfs_endio_helper
> [633254.08] task: 880859523b00 task.stack: c90006164000
> [633254.561528] RIP: 0010:__free_raid_bio+0x8e/0xa0
> [633254.566143] RSP: 0018:c90006167bc8 EFLAGS: 00010282
> [633254.571457] RAX: 88052540d010 RBX: 8801ffd02800 RCX:

Re: [PATCH 3/7] Btrfs: retry write for non-raid56

2017-11-22 Thread Nikolay Borisov



On 22.11.2017 02:35, Liu Bo wrote:
> If the underlying protocal doesn't support retry and there are some
> transient errors happening somewhere in our IO stack, we'd like to
> give an extra chance for IO.
> 
> In btrfs, read retry is handled in bio_readpage_error() with the retry
> unit being page size , for write retry however, we're going to do it
> in a different way, as a write may consist of several writes onto
> different stripes, retry write needs to be done right after the IO on
> each stripe completes and reaches to endio.
> 
> As write errors are really corner cases, performance impact is not
> considered.
> 
> This adds code to retry write on errors _sector by sector_ in order to
> find out which sector is really bad so that we can mark it somewhere.
> And since endio is running in interrupt context, the real retry work
> is scheduled in system_unbound_wq in order to get a normal context.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/volumes.c | 162 
> +
>  fs/btrfs/volumes.h |   3 +
>  2 files changed, 142 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 853f9ce..c11db0b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
>   }
>  }
>  
> -static void btrfs_end_bio(struct bio *bio)
> +static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
>  {
>   struct btrfs_bio *bbio = bio->bi_private;
> - int is_orig_bio = 0;
> + unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
> +
> + BUG_ON(stripe_index >= bbio->num_stripes);
> + return bbio->stripes[stripe_index].dev;
> +}
> +
> +/*
> + * return 1 if every sector retry returns successful.
> + * return 0 if one or more sector retries fails.
> + */
> +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
> +{
> + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> + u64 sectors_to_write;
> + u64 offset;
> + u64 orig;
> + u64 unit;
> + u64 block_sectors;
> + int ok = 1;
> + struct bio *wbio;
> +
> + /* offset and unit are bytes aligned, not 512-bytes aligned. */
> + sectors_to_write = io_bio->iter.bi_size >> 9;
> + orig = io_bio->iter.bi_sector;
> + offset = 0;
> + block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
> + unit = block_sectors;
> + ASSERT(unit == 1);
> +
> + while (1) {
> + if (!sectors_to_write)
> + break;
> + /*
> +  * LIUBO: I don't think unit > sectors_to_write could
> +  * happen, sectors_to_write should be aligned to PAGE_SIZE
> +  * which is > unit.  Just in case.
> +  */
> + if (unit > sectors_to_write) {
> + WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", 
> unit, sectors_to_write);
> + unit = sectors_to_write;
> + }
> +
> + /* write @unit bytes at @offset */
> + /* this would never fail, check btrfs_bio_clone(). */
> + wbio = btrfs_bio_clone(bio);
> + wbio->bi_opf = REQ_OP_WRITE;
> + wbio->bi_iter = io_bio->iter;
> +
> + bio_trim(wbio, offset, unit);
> + bio_copy_dev(wbio, bio);
> +
> + /* submit in sync way */
> + /*
> +  * LIUBO: There is an issue, if this bio is quite
> +  * large, say 1M or 2M, and sector size is just 512,
> +  * then this may take a while.
> +  *
> +  * May need to schedule the job to workqueue.
> +  */
> + if (submit_bio_wait(wbio) < 0) {
> + ok = 0 && ok;
> + /*
> +  * This is not correct if badblocks is enabled
> +  * as we need to record every bad sector by
> +  * trying sectors one by one.
> +  */
> + break;
> + }
> +
> + bio_put(wbio);
> + offset += unit;
> + sectors_to_write -= unit;
> + unit = block_sectors;
> + }
> + return ok;
> +}
> +
> +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
> +{
> + if (bio->bi_status == BLK_STS_IOERR ||
> + bio->bi_status == BLK_STS_TARGET) {
> + if (dev->bdev) {
> + if (bio_data_dir(bio) == WRITE)
> + btrfs_dev_stat_inc(dev,
> +BTRFS_DEV_STAT_WRITE_ERRS);
> + else
> + btrfs_dev_stat_inc(dev,
> +BTRFS_DEV_STAT_READ_ERRS);
> + if (bio->bi_opf & REQ_PREFLUSH)
> + btrfs_dev_stat_inc(dev,
> +

[PATCH 3/7] Btrfs: retry write for non-raid56

2017-11-21 Thread Liu Bo

If the underlying protocal doesn't support retry and there are some
transient errors happening somewhere in our IO stack, we'd like to
give an extra chance for IO.

In btrfs, read retry is handled in bio_readpage_error() with the retry
unit being page size , for write retry however, we're going to do it
in a different way, as a write may consist of several writes onto
different stripes, retry write needs to be done right after the IO on
each stripe completes and reaches to endio.

As write errors are really corner cases, performance impact is not
considered.

This adds code to retry write on errors _sector by sector_ in order to
find out which sector is really bad so that we can mark it somewhere.
And since endio is running in interrupt context, the real retry work
is scheduled in system_unbound_wq in order to get a normal context.

Signed-off-by: Liu Bo 
---
 fs/btrfs/volumes.c | 162 +
 fs/btrfs/volumes.h |   3 +
 2 files changed, 142 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 853f9ce..c11db0b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
}
 }
 
-static void btrfs_end_bio(struct bio *bio)
+static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
 {
struct btrfs_bio *bbio = bio->bi_private;
-   int is_orig_bio = 0;
+   unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
+
+   BUG_ON(stripe_index >= bbio->num_stripes);
+   return bbio->stripes[stripe_index].dev;
+}
+
+/*
+ * return 1 if every sector retry returns successful.
+ * return 0 if one or more sector retries fails.
+ */
+int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
+{
+   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   u64 sectors_to_write;
+   u64 offset;
+   u64 orig;
+   u64 unit;
+   u64 block_sectors;
+   int ok = 1;
+   struct bio *wbio;
+
+   /* offset and unit are bytes aligned, not 512-bytes aligned. */
+   sectors_to_write = io_bio->iter.bi_size >> 9;
+   orig = io_bio->iter.bi_sector;
+   offset = 0;
+   block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
+   unit = block_sectors;
+   ASSERT(unit == 1);
+
+   while (1) {
+   if (!sectors_to_write)
+   break;
+   /*
+* LIUBO: I don't think unit > sectors_to_write could
+* happen, sectors_to_write should be aligned to PAGE_SIZE
+* which is > unit.  Just in case.
+*/
+   if (unit > sectors_to_write) {
+   WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", 
unit, sectors_to_write);
+   unit = sectors_to_write;
+   }
+
+   /* write @unit bytes at @offset */
+   /* this would never fail, check btrfs_bio_clone(). */
+   wbio = btrfs_bio_clone(bio);
+   wbio->bi_opf = REQ_OP_WRITE;
+   wbio->bi_iter = io_bio->iter;
+
+   bio_trim(wbio, offset, unit);
+   bio_copy_dev(wbio, bio);
+
+   /* submit in sync way */
+   /*
+* LIUBO: There is an issue, if this bio is quite
+* large, say 1M or 2M, and sector size is just 512,
+* then this may take a while.
+*
+* May need to schedule the job to workqueue.
+*/
+   if (submit_bio_wait(wbio) < 0) {
+   ok = 0 && ok;
+   /*
+* This is not correct if badblocks is enabled
+* as we need to record every bad sector by
+* trying sectors one by one.
+*/
+   break;
+   }
+
+   bio_put(wbio);
+   offset += unit;
+   sectors_to_write -= unit;
+   unit = block_sectors;
+   }
+   return ok;
+}
+
+void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
+{
+   if (bio->bi_status == BLK_STS_IOERR ||
+   bio->bi_status == BLK_STS_TARGET) {
+   if (dev->bdev) {
+   if (bio_data_dir(bio) == WRITE)
+   btrfs_dev_stat_inc(dev,
+  BTRFS_DEV_STAT_WRITE_ERRS);
+   else
+   btrfs_dev_stat_inc(dev,
+  BTRFS_DEV_STAT_READ_ERRS);
+   if (bio->bi_opf & REQ_PREFLUSH)
+   btrfs_dev_stat_inc(dev,
+  BTRFS_DEV_STAT_FLUSH_ERRS);
+   btrfs_dev_stat_print_on_error(dev);
+   }
+

[PATCH 6/7] Btrfs: retry write for raid56

2017-11-21 Thread Liu Bo

Retry writes on raid56's final full stripe writes in order to get over
some transient errors in our IO stack.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index aebc849..2e182f8 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -909,14 +909,49 @@ static void __raid_write_end_io(struct bio *bio)
rbio_orig_end_io(rbio, err);
 }
 
+struct btrfs_device *get_raid_device_from_bio(struct bio *bio)
+{
+   struct btrfs_raid_bio *rbio = bio->bi_private;
+   unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
+
+   BUG_ON(stripe_index >= rbio->bbio->num_stripes);
+   return rbio->bbio->stripes[stripe_index].dev;
+}
+
+static void raid_handle_write_error(struct work_struct *work)
+{
+   struct btrfs_io_bio *io_bio;
+   struct bio *bio;
+   struct btrfs_device *dev;
+
+   io_bio = container_of(work, struct btrfs_io_bio, work);
+   bio = _bio->bio;
+   dev = get_raid_device_from_bio(bio);
+
+   if (!btrfs_narrow_write_error(bio, dev)) {
+   fail_bio_stripe(bio);
+   btrfs_record_bio_error(bio, dev);
+   }
+
+   __raid_write_end_io(bio);
+}
+
+static void raid_reschedule_bio(struct bio *bio)
+{
+   INIT_WORK(_io_bio(bio)->work, raid_handle_write_error);
+   queue_work(system_unbound_wq, _io_bio(bio)->work);
+}
+
 /*
  * end io function used by finish_rmw.  When we finally
  * get here, we've written a full stripe
  */
 static void raid_write_end_io(struct bio *bio)
 {
-   if (bio->bi_status)
-   fail_bio_stripe(bio);
+   if (bio->bi_status) {
+   raid_reschedule_bio(bio);
+   return;
+   }
 
__raid_write_end_io(bio);
 }
@@ -1108,6 +1143,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
bio->bi_iter.bi_size = 0;
bio_set_dev(bio, stripe->dev->bdev);
bio->bi_iter.bi_sector = disk_start >> 9;
+   btrfs_io_bio(bio)->stripe_index = stripe_nr;
 
bio_add_page(bio, page, PAGE_SIZE, 0);
bio_list_add(bio_list, bio);
@@ -1324,6 +1360,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
bio->bi_private = rbio;
bio->bi_end_io = raid_write_end_io;
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+   btrfs_io_bio(bio)->iter = bio->bi_iter;
 
submit_bio(bio);
}
@@ -2452,6 +2489,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
bio->bi_private = rbio;
bio->bi_end_io = raid_write_end_io;
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+   btrfs_io_bio(bio)->iter = bio->bi_iter;
 
submit_bio(bio);
}
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0

2017-11-19 Thread Jérôme Carretero

On Mon, 20 Nov 2017 02:00:07 -0500
Jérôme Carretero  wrote:

> [ cut here ]

It should be noted that the filesystem doesn't want to be unmounted now.


Regards,

-- 
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: CPU: 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0

2017-11-19 Thread Jérôme Carretero

Hi,



This was while doing a "userspace scrub" with "tar c":

[633250.707455] btrfs_print_data_csum_error: 14608 callbacks suppressed
[633250.707459] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530293248 csum 0xb8c194fb expected csum 0xb3680c88 mirror 2
[633250.707465] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
[633250.707470] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530293248 csum 0xa5db59eb expected csum 0xb3680c88 mirror 2
[633250.707473] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530293248 csum 0x5d244234 expected csum 0xb3680c88 mirror 2
[633250.707475] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
[633250.707478] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530301440 csum 0xc0a71540 expected csum 0x904f75bc mirror 2
[633250.707480] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
[633250.707483] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
[633250.707484] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530301440 csum 0x0abd2cac expected csum 0x904f75bc mirror 2
[633250.707488] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
3530301440 csum 0x0d046c34 expected csum 0x904f75bc mirror 2
[633250.888501] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230948864 (dev /dev/mapper/I8U2-4 sector 1373688)
[633250.937373] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230952960 (dev /dev/mapper/I8U2-4 sector 1373688)
[633250.949808] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230957056 (dev /dev/mapper/I8U2-4 sector 1373688)
[633250.961703] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230961152 (dev /dev/mapper/I8U2-4 sector 1373688)
[633250.973827] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230965248 (dev /dev/mapper/I8U2-4 sector 1373688)
[633250.986271] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230969344 (dev /dev/mapper/I8U2-4 sector 1373688)
[633250.998517] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230973440 (dev /dev/mapper/I8U2-4 sector 1373688)
[633251.010537] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230977536 (dev /dev/mapper/I8U2-4 sector 1373688)
[633251.022767] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230981632 (dev /dev/mapper/I8U2-4 sector 1373688)
[633251.034990] BTRFS info (device dm-18): read error corrected: ino 1376 off 
230985728 (dev /dev/mapper/I8U2-4 sector 1373688)
[633254.456570] [ cut here ]
[633254.461294] WARNING: CPU: 3 PID: 20953 at 
/usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0
[633254.470863] Modules linked in: bfq twofish_avx_x86_64 twofish_x86_64_3way 
xts twofish_x86_64 twofish_common serpent_avx_x86_64 serpent_generic lrw 
gf128mul ablk_helper algif_skcipher af_alg nfnetlink_queue nfnetlink_log 
nfnetlink cfg80211 rfkill usbmon fuse usb_storage dm_crypt dm_mod dax coretemp 
hwmon intel_rapl snd_hda_codec_realtek x86_pkg_temp_thermal 
snd_hda_codec_generic iTCO_wdt kvm_intel iTCO_vendor_support snd_hda_intel kvm 
snd_hda_codec irqbypass snd_hwdep aesni_intel snd_hda_core aes_x86_64 snd_pcm 
xhci_pci snd_timer ehci_pci crypto_simd xhci_hcd cryptd ehci_hcd sdhci_pci 
glue_helper pcspkr snd usbcore sdhci soundcore lpc_ich mmc_core usb_common 
mfd_core bnx2 bonding autofs4 [last unloaded: i2c_dev]
[633254.533987] CPU: 3 PID: 20953 Comm: kworker/u16:18 Tainted: GW  
 4.14.0-Vantage-dirty #14
[633254.543298] Hardware name: LENOVO 056851U/LENOVO, BIOS A0KT56AUS 02/01/2016
[633254.550365] Workqueue: btrfs-endio btrfs_endio_helper
[633254.08] task: 880859523b00 task.stack: c90006164000
[633254.561528] RIP: 0010:__free_raid_bio+0x8e/0xa0
[633254.566143] RSP: 0018:c90006167bc8 EFLAGS: 00010282
[633254.571457] RAX: 88052540d010 RBX: 8801ffd02800 RCX: 
0001
[633254.578683] RDX: 88052540d010 RSI: 0246 RDI: 
88052540d000
[633254.585912] RBP: 88052540d000 R08:  R09: 
0001
[633254.593131] R10: 8805ad3a2e60 R11: 0006 R12: 
000a
[633254.600378] R13: 0004 R14: 0001 R15: 
880537d1c000
[633254.607604] FS:  () GS:88087fcc() 
knlGS:
[633254.615804] CS:  0010 DS:  ES:  CR0: 80050033
[633254.621635] CR2: 7f13db554000 CR3: 01e09002 CR4: 
000606e0
[633254.628870] Call Trace:
[633254.631421]  rbio_orig_end_io+0x42/0x80
[633254.635352]  __raid56_parity_recover+0x17a/0x1f0
[633254.640078]  raid56_parity_recover+0x193/

Re: [PATCH] Btrfs: fix memory leak in raid56

2017-09-24 Thread David Sterba

On Fri, Sep 22, 2017 at 12:11:18PM -0600, Liu Bo wrote:
> The local bio_list may have pending bios when doing cleanup, it can
> end up with memory leak if they don't get free'd.

I was wondering if we could make a common helper that would call
rbio_orig_end_io and while (..) put_bio(), but __raid56_parity_recover
does not fit the pattern. Still might be worth, but the patch is ok as
is.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: fix memory leak in raid56

2017-09-22 Thread Liu Bo

The local bio_list may have pending bios when doing cleanup, it can
end up with memory leak if they don't get free'd.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 2cf6ba4..063a2a0 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1325,6 +1325,9 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
 cleanup:
rbio_orig_end_io(rbio, BLK_STS_IOERR);
+
+   while ((bio = bio_list_pop(_list)))
+   bio_put(bio);
 }
 
 /*
@@ -1580,6 +1583,10 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
 cleanup:
rbio_orig_end_io(rbio, BLK_STS_IOERR);
+
+   while ((bio = bio_list_pop(_list)))
+   bio_put(bio);
+
return -EIO;
 
 finish:
@@ -2105,6 +2112,10 @@ static int __raid56_parity_recover(struct btrfs_raid_bio 
*rbio)
if (rbio->operation == BTRFS_RBIO_READ_REBUILD ||
rbio->operation == BTRFS_RBIO_REBUILD_MISSING)
rbio_orig_end_io(rbio, BLK_STS_IOERR);
+
+   while ((bio = bio_list_pop(_list)))
+   bio_put(bio);
+
return -EIO;
 }
 
@@ -2452,6 +2463,9 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
 
 cleanup:
rbio_orig_end_io(rbio, BLK_STS_IOERR);
+
+   while ((bio = bio_list_pop(_list)))
+   bio_put(bio);
 }
 
 static inline int is_data_stripe(struct btrfs_raid_bio *rbio, int stripe)
@@ -2561,12 +2575,12 @@ static void raid56_parity_scrub_stripe(struct 
btrfs_raid_bio *rbio)
int stripe;
struct bio *bio;
 
+   bio_list_init(_list);
+
ret = alloc_rbio_essential_pages(rbio);
if (ret)
goto cleanup;
 
-   bio_list_init(_list);
-
atomic_set(>error, 0);
/*
 * build a list of bios to read all the missing parts of this
@@ -2634,6 +2648,10 @@ static void raid56_parity_scrub_stripe(struct 
btrfs_raid_bio *rbio)
 
 cleanup:
rbio_orig_end_io(rbio, BLK_STS_IOERR);
+
+   while ((bio = bio_list_pop(_list)))
+   bio_put(bio);
+
return;
 
 finish:
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/14] Btrfs: raid56: add raid56 log via add_dev v2 ioctl

2017-08-02 Thread Nikolay Borisov



On  1.08.2017 19:14, Liu Bo wrote:
> This introduces add_dev_v2 ioctl to add a device as raid56 journal
> device.  With the help of a journal device, raid56 is able to to get
> rid of potential write holes.
> 
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
>  fs/btrfs/ctree.h|  6 ++
>  fs/btrfs/ioctl.c| 48 
> ++++-
>  fs/btrfs/raid56.c   | 42 ++++
>  fs/btrfs/raid56.h   |  1 +
>  fs/btrfs/volumes.c  | 26 --
>  fs/btrfs/volumes.h  |  3 ++-
>  include/uapi/linux/btrfs.h  |  3 +++
>  include/uapi/linux/btrfs_tree.h |  4 
>  8 files changed, 125 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 643c70d..d967627 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -697,6 +697,7 @@ struct btrfs_stripe_hash_table {
>  void btrfs_init_async_reclaim_work(struct work_struct *work);
>  
>  /* fs_info */
> +struct btrfs_r5l_log;
>  struct reloc_control;
>  struct btrfs_device;
>  struct btrfs_fs_devices;
> @@ -1114,6 +1115,9 @@ struct btrfs_fs_info {
>   u32 nodesize;
>   u32 sectorsize;
>   u32 stripesize;
> +
> + /* raid56 log */
> + struct btrfs_r5l_log *r5log;
>  };
>  
>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> @@ -2932,6 +2936,8 @@ static inline int btrfs_need_cleaner_sleep(struct 
> btrfs_fs_info *fs_info)
>  
>  static inline void free_fs_info(struct btrfs_fs_info *fs_info)
>  {
> + if (fs_info->r5log)
> + kfree(fs_info->r5log);
>   kfree(fs_info->balance_ctl);
>   kfree(fs_info->delayed_root);
>   kfree(fs_info->extent_root);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index e176375..3d1ef4d 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2653,6 +2653,50 @@ static int btrfs_ioctl_defrag(struct file *file, void 
> __user *argp)
>   return ret;
>  }
>  
> +/* identical to btrfs_ioctl_add_dev, but this is with flags */
> +static long btrfs_ioctl_add_dev_v2(struct btrfs_fs_info *fs_info, void 
> __user *arg)
> +{
> + struct btrfs_ioctl_vol_args_v2 *vol_args;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> + return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> +
> + mutex_lock(_info->volume_mutex);
> + vol_args = memdup_user(arg, sizeof(*vol_args));
> + if (IS_ERR(vol_args)) {
> + ret = PTR_ERR(vol_args);
> + goto out;
> + }
> +
> + if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG &&
> + fs_info->r5log) {
> + ret = -EEXIST;
> + btrfs_info(fs_info, "r5log: attempting to add another log 
> device!");
> + goto out_free;
> + }
> +
> + vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
> + ret = btrfs_init_new_device(fs_info, vol_args->name, vol_args->flags);
> + if (!ret) {
> + if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG) {
> + ASSERT(fs_info->r5log);
> + btrfs_info(fs_info, "disk added %s as raid56 log", 
> vol_args->name);
> + } else {
> + btrfs_info(fs_info, "disk added %s", vol_args->name);
> + }
> + }
> +out_free:
> + kfree(vol_args);
> +out:
> + mutex_unlock(_info->volume_mutex);
> + clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
> + return ret;
> +}
> +
>  static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user 
> *arg)
>  {
>   struct btrfs_ioctl_vol_args *vol_args;
> @@ -2672,7 +2716,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
> *fs_info, void __user *arg)
>   }
>  
>   vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
> - ret = btrfs_init_new_device(fs_info, vol_args->name);
> + ret = btrfs_init_new_device(fs_info, vol_args->name, 0);
>  
>   if (!ret)
>   btrfs_info(fs_info, "disk added %s", vol_args->name);
> @@ -5539,6 +5583,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>   return btrfs_ioctl_resize(file, argp);
>   case BTRFS_IOC_ADD_DEV:
>   return btrfs_ioctl_add_dev(fs_info, argp);
> + case BTRFS_IOC_ADD_DEV_V2:
> + return btrfs_ioctl_add_dev_v2(fs_info, argp);
>   case BTRFS_IOC_RM_DEV:
>   return btrfs_ioc

[PATCH 12/14] Btrfs: raid56: fix error handling while adding a log device

2017-08-01 Thread Liu Bo

Currently there is a memory leak if we have an error while adding a
raid5/6 log.  Moreover, it didn't abort the transaction as others do,
so this is fixing the broken error handling by applying two steps on
initializing the log, step #1 is to allocate memory, check if it has a
proper size, and step #2 is to assign the pointer in %fs_info.  And by
running step #1 ahead of starting transaction, we can gracefully bail
out on errors now.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c  | 48 +---
 fs/btrfs/raid56.h  |  5 +
 fs/btrfs/volumes.c | 36 ++--
 3 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 8bc7ba4..0bfc97a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -3711,30 +3711,64 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio 
*rbio)
async_missing_raid56(rbio);
 }
 
-int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device)
+struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info 
*fs_info, struct btrfs_device *device, struct block_device *bdev)
 {
-   struct btrfs_r5l_log *log;
-
-   log = kzalloc(sizeof(*log), GFP_NOFS);
+   int num_devices = fs_info->fs_devices->num_devices;
+   u64 dev_total_bytes;
+   struct btrfs_r5l_log *log = kzalloc(sizeof(struct btrfs_r5l_log), 
GFP_NOFS);
if (!log)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
+
+   ASSERT(device);
+   ASSERT(bdev);
+   dev_total_bytes = i_size_read(bdev->bd_inode);
 
/* see find_free_dev_extent for 1M start offset */
log->data_offset = 1024ull * 1024;
-   log->device_size = btrfs_device_get_total_bytes(device) - 
log->data_offset;
+   log->device_size = dev_total_bytes - log->data_offset;
log->device_size = round_down(log->device_size, PAGE_SIZE);
+
+   /*
+* when device has been included in fs_devices, do not take
+* into account this device when checking log size.
+*/
+   if (device->in_fs_metadata)
+   num_devices--;
+
+   if (log->device_size < BTRFS_STRIPE_LEN * num_devices * 2) {
+   btrfs_info(fs_info, "r5log log device size (%llu < %llu) is too 
small", log->device_size, BTRFS_STRIPE_LEN * num_devices * 2);
+   kfree(log);
+   return ERR_PTR(-EINVAL);
+   }
+
log->dev = device;
log->fs_info = fs_info;
ASSERT(sizeof(device->uuid) == BTRFS_UUID_SIZE);
log->uuid_csum = btrfs_crc32c(~0, device->uuid, sizeof(device->uuid));
mutex_init(>io_mutex);
 
+   return log;
+}
+
+void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info, struct 
btrfs_r5l_log *log)
+{
cmpxchg(_info->r5log, NULL, log);
ASSERT(fs_info->r5log == log);
 
 #ifdef BTRFS_DEBUG_R5LOG
-   trace_printk("r5log: set a r5log in fs_info,  alloc_range 0x%llx 
0x%llx",
+   trace_printk("r5log: set a r5log in fs_info,  alloc_range 0x%llx 
0x%llx\n",
 log->data_offset, log->data_offset + log->device_size);
 #endif
+}
+
+int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device)
+{
+   struct btrfs_r5l_log *log;
+
+   log = btrfs_r5l_init_log_prepare(fs_info, device, device->bdev);
+   if (IS_ERR(log))
+   return PTR_ERR(log);
+
+   btrfs_r5l_init_log_post(fs_info, log);
    return 0;
 }
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 569cec8..f6d6f36 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -134,6 +134,11 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio 
*rbio);
 
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
+struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info 
*fs_info,
+ struct btrfs_device *device,
+ struct block_device *bdev);
+void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info,
+struct btrfs_r5l_log *log);
 int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device 
*device);
 int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp);
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ac64d93..851c001 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2327,6 +2327,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
int seeding_dev = 0;
int ret = 0;
bool is_r5log = (flags & BTRFS_DEVICE_RAID56_LOG);
+   struct btrfs_r5l_log *r5log = NULL;
 
if (is_r5log)
ASSERT(!fs_info->fs_devices->seeding);
@

[PATCH 10/14] Btrfs: raid56: use the readahead helper to get page

2017-08-01 Thread Liu Bo

This updates recovery code to use the readahead helper.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 24f7cbb..8f47e56 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1608,7 +1608,9 @@ static int btrfs_r5l_recover_load_meta(struct 
btrfs_r5l_recover_ctx *ctx)
 {
struct btrfs_r5l_meta_block *mb;
 
-   btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos >> 9), PAGE_SIZE, 
ctx->meta_page, REQ_OP_READ);
+   ret = btrfs_r5l_recover_read_page(ctx, ctx->meta_page, ctx->pos);
+   if (ret)
+   return ret;
 
mb = kmap(ctx->meta_page);
 #ifdef BTRFS_DEBUG_R5LOG
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/14] Btrfs: raid56: add reclaim support

2017-08-01 Thread Liu Bo

The log space is limited, so reclaim is necessary when there is not enough 
space to use.

By recording the largest position we've written to the log disk and
flushing all disks' cache and the superblock, we can be sure that data
and parity before this position have the identical copy in the log and
raid5/6 array.

Also we need to take care of the case when IOs get reordered.  A list
is used to keep the order right.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/ctree.h   | 10 +++-
 fs/btrfs/raid56.c  | 63 --
 fs/btrfs/transaction.c |  2 ++
 3 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d967627..9235643 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -244,8 +244,10 @@ struct btrfs_super_block {
__le64 cache_generation;
__le64 uuid_tree_generation;
 
+   /* r5log journal tail (where recovery starts) */
+   __le64 journal_tail;
/* future expansion */
-   __le64 reserved[30];
+   __le64 reserved[29];
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 } __attribute__ ((__packed__));
@@ -2291,6 +2293,8 @@ BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct 
btrfs_super_block,
 log_root_transid, 64);
 BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block,
 log_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_journal_tail, struct btrfs_super_block,
+journal_tail, 64);
 BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block,
 total_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block,
@@ -3284,6 +3288,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
unsigned long new_flags);
 int btrfs_sync_fs(struct super_block *sb, int wait);
 
+/* raid56.c */
+void btrfs_r5l_write_journal_tail(struct btrfs_fs_info *fs_info);
+
+
 static inline __printf(2, 3)
 void btrfs_no_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...)
 {
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 007ba63..60010a6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -191,6 +191,8 @@ struct btrfs_r5l_log {
u64 data_offset;
u64 device_size;
 
+   u64 next_checkpoint;
+
u64 last_checkpoint;
u64 last_cp_seq;
u64 seq;
@@ -1231,11 +1233,14 @@ static void btrfs_r5l_log_endio(struct bio *bio)
bio_put(bio);
 
 #ifdef BTRFS_DEBUG_R5LOG
-   trace_printk("move data to disk\n");
+   trace_printk("move data to disk(current log->next_checkpoint %llu (will 
be %llu after writing to RAID\n", log->next_checkpoint, io->log_start);
 #endif
/* move data to RAID. */
btrfs_write_rbio(io->rbio);
 
+   /* After stripe data has been flushed into raid, set ->next_checkpoint. 
*/
+   log->next_checkpoint = io->log_start;
+
if (log->current_io == io)
log->current_io = NULL;
btrfs_r5l_free_io_unit(log, io);
@@ -1473,6 +1478,42 @@ static bool btrfs_r5l_has_free_space(struct 
btrfs_r5l_log *log, u64 size)
 }
 
 /*
+ * writing super with log->next_checkpoint
+ *
+ * This is protected by log->io_mutex.
+ */
+static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp)
+{
+   int ret;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("r5l writing super to reclaim space, cp %llu\n", cp);
+#endif
+
+   btrfs_set_super_journal_tail(fs_info->super_for_commit, cp);
+
+   /*
+* flush all disk cache so that all data prior to
+* %next_checkpoint lands on raid disks(recovery will start
+* from %next_checkpoint).
+*/
+   ret = write_all_supers(fs_info, 1);
+   ASSERT(ret == 0);
+}
+
+/* this is called by commit transaction and it's followed by writing super. */
+void btrfs_r5l_write_journal_tail(struct btrfs_fs_info *fs_info)
+{
+   if (fs_info->r5log) {
+   u64 cp = READ_ONCE(fs_info->r5log->next_checkpoint);
+
+   trace_printk("journal_tail %llu\n", cp);
+   btrfs_set_super_journal_tail(fs_info->super_copy, cp);
+   WRITE_ONCE(fs_info->r5log->last_checkpoint, cp);
+   }
+}
+
+/*
  * return 0 if data/parity are written into log and it will move data
  * to RAID in endio.
  *
@@ -1535,7 +1576,25 @@ static int btrfs_r5l_write_stripe(struct btrfs_raid_bio 
*rbio)
btrfs_r5l_log_stripe(log, data_pages, parity_pages, rbio);
do_submit = true;
} else {
-   ; /* XXX: reclaim */
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("r5log: no space log->last_checkpoint %llu 
log->log

[PATCH 01/14] Btrfs: raid56: add raid56 log via add_dev v2 ioctl

2017-08-01 Thread Liu Bo

This introduces add_dev_v2 ioctl to add a device as raid56 journal
device.  With the help of a journal device, raid56 is able to to get
rid of potential write holes.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/ctree.h|  6 ++
 fs/btrfs/ioctl.c| 48 -
 fs/btrfs/raid56.c   | 42 
 fs/btrfs/raid56.h   |  1 +
 fs/btrfs/volumes.c  | 26 --
 fs/btrfs/volumes.h  |  3 ++-
 include/uapi/linux/btrfs.h  |  3 +++
 include/uapi/linux/btrfs_tree.h |  4 
 8 files changed, 125 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 643c70d..d967627 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -697,6 +697,7 @@ struct btrfs_stripe_hash_table {
 void btrfs_init_async_reclaim_work(struct work_struct *work);
 
 /* fs_info */
+struct btrfs_r5l_log;
 struct reloc_control;
 struct btrfs_device;
 struct btrfs_fs_devices;
@@ -1114,6 +1115,9 @@ struct btrfs_fs_info {
u32 nodesize;
u32 sectorsize;
u32 stripesize;
+
+   /* raid56 log */
+   struct btrfs_r5l_log *r5log;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
@@ -2932,6 +2936,8 @@ static inline int btrfs_need_cleaner_sleep(struct 
btrfs_fs_info *fs_info)
 
 static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 {
+   if (fs_info->r5log)
+   kfree(fs_info->r5log);
kfree(fs_info->balance_ctl);
kfree(fs_info->delayed_root);
kfree(fs_info->extent_root);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e176375..3d1ef4d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2653,6 +2653,50 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
return ret;
 }
 
+/* identical to btrfs_ioctl_add_dev, but this is with flags */
+static long btrfs_ioctl_add_dev_v2(struct btrfs_fs_info *fs_info, void __user 
*arg)
+{
+   struct btrfs_ioctl_vol_args_v2 *vol_args;
+   int ret;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
+   return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
+
+   mutex_lock(_info->volume_mutex);
+   vol_args = memdup_user(arg, sizeof(*vol_args));
+   if (IS_ERR(vol_args)) {
+   ret = PTR_ERR(vol_args);
+   goto out;
+   }
+
+   if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG &&
+   fs_info->r5log) {
+   ret = -EEXIST;
+   btrfs_info(fs_info, "r5log: attempting to add another log 
device!");
+   goto out_free;
+   }
+
+   vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
+   ret = btrfs_init_new_device(fs_info, vol_args->name, vol_args->flags);
+   if (!ret) {
+   if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG) {
+   ASSERT(fs_info->r5log);
+   btrfs_info(fs_info, "disk added %s as raid56 log", 
vol_args->name);
+   } else {
+   btrfs_info(fs_info, "disk added %s", vol_args->name);
+   }
+   }
+out_free:
+   kfree(vol_args);
+out:
+   mutex_unlock(_info->volume_mutex);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
+   return ret;
+}
+
 static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user 
*arg)
 {
struct btrfs_ioctl_vol_args *vol_args;
@@ -2672,7 +2716,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
*fs_info, void __user *arg)
}
 
vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
-   ret = btrfs_init_new_device(fs_info, vol_args->name);
+   ret = btrfs_init_new_device(fs_info, vol_args->name, 0);
 
if (!ret)
btrfs_info(fs_info, "disk added %s", vol_args->name);
@@ -5539,6 +5583,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_resize(file, argp);
case BTRFS_IOC_ADD_DEV:
return btrfs_ioctl_add_dev(fs_info, argp);
+   case BTRFS_IOC_ADD_DEV_V2:
+   return btrfs_ioctl_add_dev_v2(fs_info, argp);
case BTRFS_IOC_RM_DEV:
        return btrfs_ioctl_rm_dev(file, argp);
    case BTRFS_IOC_RM_DEV_V2:
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d8ea0eb..2b91b95 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -177,6 +177,25 @@ struct btrfs_raid_bio {
unsigned long *dbitmap;
 };
 
+/* raid56 log */
+struct btrfs_r5l_log {
+   /* protect this struct and log io */
+   struct mutex io_mutex;
+
+   /* r5log device */
+   struct btrfs_device *dev;
+
+   /* allocation range for log entries */
+   u64 data_offset;
+   u64 device_size;
+
+   u6

[PATCH 09/14] Btrfs: raid56: add readahead for recovery

2017-08-01 Thread Liu Bo

While doing recovery, blocks are read from the raid5/6 disk one by
one, so this is adding readahead so that we can read at most 256
contiguous blocks in one read IO.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 114 +++---
 1 file changed, 109 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index dea33c4..24f7cbb 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1530,15 +1530,81 @@ static int btrfs_r5l_write_empty_meta_block(struct 
btrfs_r5l_log *log, u64 pos,
return ret;
 }
 
+#define BTRFS_R5L_RECOVER_IO_POOL_SIZE BIO_MAX_PAGES
 struct btrfs_r5l_recover_ctx {
u64 pos;
u64 seq;
u64 total_size;
struct page *meta_page;
struct page *io_page;
+
+   struct page *ra_pages[BTRFS_R5L_RECOVER_IO_POOL_SIZE];
+   struct bio *ra_bio;
+   int total;
+   int valid;
+   u64 start_offset;
+
+   struct btrfs_r5l_log *log;
 };
 
-static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_log *log, struct 
btrfs_r5l_recover_ctx *ctx)
+static int btrfs_r5l_recover_read_ra(struct btrfs_r5l_recover_ctx *ctx, u64 
offset)
+{
+   bio_reset(ctx->ra_bio);
+   ctx->ra_bio->bi_bdev = ctx->log->dev->bdev;
+   ctx->ra_bio->bi_opf = REQ_OP_READ;
+   ctx->ra_bio->bi_iter.bi_sector = (ctx->log->data_offset + offset) >> 9;
+
+   ctx->valid = 0;
+   ctx->start_offset = offset;
+
+   while (ctx->valid < ctx->total) {
+   bio_add_page(ctx->ra_bio, ctx->ra_pages[ctx->valid++], 
PAGE_SIZE, 0);
+
+   offset = btrfs_r5l_ring_add(ctx->log, offset, PAGE_SIZE);
+   if (offset == 0)
+   break;
+   }
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("to read %d pages starting from 0x%llx\n", ctx->valid, 
ctx->log->data_offset + ctx->start_offset);
+#endif
+   return submit_bio_wait(ctx->ra_bio);
+}
+
+static int btrfs_r5l_recover_read_page(struct btrfs_r5l_recover_ctx *ctx, 
struct page *page, u64 offset)
+{
+   struct page *tmp;
+   int index;
+   char *src;
+   char *dst;
+   int ret;
+
+   if (offset < ctx->start_offset || offset >= (ctx->start_offset + 
ctx->valid * PAGE_SIZE)) {
+   ret = btrfs_r5l_recover_read_ra(ctx, offset);
+   if (ret)
+   return ret;
+   }
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("offset 0x%llx start->offset 0x%llx ctx->valid %d\n", 
offset, ctx->start_offset, ctx->valid);
+#endif
+
+   ASSERT(IS_ALIGNED(ctx->start_offset, PAGE_SIZE));
+   ASSERT(IS_ALIGNED(offset, PAGE_SIZE));
+
+   index = (offset - ctx->start_offset) >> PAGE_SHIFT;
+   ASSERT(index < ctx->valid);
+
+   tmp = ctx->ra_pages[index];
+   src = kmap(tmp);
+   dst = kmap(page);
+   memcpy(dst, src, PAGE_SIZE);
+   kunmap(page);
+   kunmap(tmp);
+   return 0;
+}
+
+static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_recover_ctx *ctx)
 {
struct btrfs_r5l_meta_block *mb;
 
@@ -1642,6 +1708,42 @@ static int btrfs_r5l_recover_flush_log(struct 
btrfs_r5l_log *log, struct btrfs_r
}
 
return ret;
+
+static int btrfs_r5l_recover_allocate_ra(struct btrfs_r5l_recover_ctx *ctx)
+{
+   struct page *page;
+   ctx->ra_bio = btrfs_io_bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
+
+   ctx->total = 0;
+   ctx->valid = 0;
+   while (ctx->total < BTRFS_R5L_RECOVER_IO_POOL_SIZE) {
+   page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (!page)
+   break;
+
+   ctx->ra_pages[ctx->total++] = page;
+   }
+
+   if (ctx->total == 0) {
+   bio_put(ctx->ra_bio);
+   return -ENOMEM;
+   }
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("readahead: %d allocated pages\n", ctx->total);
+#endif
+   return 0;
+}
+
+static void btrfs_r5l_recover_free_ra(struct btrfs_r5l_recover_ctx *ctx)
+{
+   int i;
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("readahead: %d to free pages\n", ctx->total);
+#endif
+   for (i = 0; i < ctx->total; i++)
+   __free_page(ctx->ra_pages[i]);
+   bio_put(ctx->ra_bio);
 }
 
 static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp);
@@ -1655,6 +1757,7 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log 
*log)
ctx = kzalloc(sizeof(*ctx), GFP_NOFS);
ASSERT(ctx);
 
+   ctx->log = log;
ctx->pos = log->last_checkpoint;
ctx->seq = log->last_cp_seq;
ctx->meta_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
@@ -1662,10 +1765,10 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log 
*log)
ctx->io_page =

[PATCH 14/14] Btrfs: raid56: maintain IO order on raid5/6 log

2017-08-01 Thread Liu Bo

A typical write to the raid5/6 log needs three steps:

1) collect data/parity pages into the bio in io_unit;
2) submit the bio in io_unit;
3) writeback data/parity to raid array in end_io.

1) and 2) are protected within log->io_mutex, while 3) is not.

Since recovery needs to know the checkpoint offset where the highest
successful writeback is, we cannot allow IO to be reordered.  This is
adding a list in which IO order is maintained properly.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 42 ++
 fs/btrfs/raid56.h |  5 +
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b771d7d..ceca415 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -183,6 +183,9 @@ struct btrfs_r5l_log {
/* protect this struct and log io */
struct mutex io_mutex;
 
+   spinlock_t io_list_lock;
+   struct list_head io_list;
+
/* r5log device */
struct btrfs_device *dev;
 
@@ -1205,6 +1208,7 @@ static struct btrfs_r5l_io_unit 
*btrfs_r5l_alloc_io_unit(struct btrfs_r5l_log *l
 static void btrfs_r5l_free_io_unit(struct btrfs_r5l_log *log, struct 
btrfs_r5l_io_unit *io)
 {
__free_page(io->meta_page);
+   ASSERT(list_empty(>list));
kfree(io);
 }
 
@@ -1225,6 +1229,27 @@ static void btrfs_r5l_reserve_log_entry(struct 
btrfs_r5l_log *log, struct btrfs_
io->need_split_bio = true;
 }
 
+/* the IO order is maintained in log->io_list. */
+static void btrfs_r5l_finish_io(struct btrfs_r5l_log *log)
+{
+   struct btrfs_r5l_io_unit *io, *next;
+
+   spin_lock(>io_list_lock);
+   list_for_each_entry_safe(io, next, >io_list, list) {
+   if (io->status != BTRFS_R5L_STRIPE_END)
+   break;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("current log->next_checkpoint %llu (will be %llu after 
writing to RAID\n", log->next_checkpoint, io->log_start);
+#endif
+
+   list_del_init(>list);
+   log->next_checkpoint = io->log_start;
+   btrfs_r5l_free_io_unit(log, io);
+   }
+   spin_unlock(>io_list_lock);
+}
+
 static void btrfs_write_rbio(struct btrfs_raid_bio *rbio);
 
 static void btrfs_r5l_log_endio(struct bio *bio)
@@ -1234,18 +1259,12 @@ static void btrfs_r5l_log_endio(struct bio *bio)
 
bio_put(bio);
 
-#ifdef BTRFS_DEBUG_R5LOG
-   trace_printk("move data to disk(current log->next_checkpoint %llu (will 
be %llu after writing to RAID\n", log->next_checkpoint, io->log_start);
-#endif
/* move data to RAID. */
btrfs_write_rbio(io->rbio);
 
+   io->status = BTRFS_R5L_STRIPE_END;
/* After stripe data has been flushed into raid, set ->next_checkpoint. 
*/
-   log->next_checkpoint = io->log_start;
-
-   if (log->current_io == io)
-   log->current_io = NULL;
-   btrfs_r5l_free_io_unit(log, io);
+   btrfs_r5l_finish_io(log);
 }
 
 static struct bio *btrfs_r5l_bio_alloc(struct btrfs_r5l_log *log)
@@ -1299,6 +1318,11 @@ static struct btrfs_r5l_io_unit 
*btrfs_r5l_new_meta(struct btrfs_r5l_log *log)
bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
 
btrfs_r5l_reserve_log_entry(log, io);
+
+   INIT_LIST_HEAD(>list);
+   spin_lock(>io_list_lock);
+   list_add_tail(>list, >io_list);
+   spin_unlock(>io_list_lock);
return io;
 }
 
@@ -3760,6 +3784,8 @@ struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct 
btrfs_fs_info *fs_info,
ASSERT(sizeof(device->uuid) == BTRFS_UUID_SIZE);
log->uuid_csum = btrfs_crc32c(~0, device->uuid, sizeof(device->uuid));
mutex_init(>io_mutex);
+   spin_lock_init(>io_list_lock);
+   INIT_LIST_HEAD(>io_list);
 
    return log;
 }
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 2cc64a3..fc4ff20 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -43,11 +43,16 @@ static inline int nr_data_stripes(struct map_lookup *map)
 struct btrfs_r5l_log;
 #define BTRFS_R5LOG_MAGIC 0x6433c509
 
+#define BTRFS_R5L_STRIPE_END 1
+
 /* one meta block + several data + parity blocks */
 struct btrfs_r5l_io_unit {
struct btrfs_r5l_log *log;
struct btrfs_raid_bio *rbio;
 
+   struct list_head list;
+   int status;
+
/* store meta block */
struct page *meta_page;
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/14] Btrfs: raid56: load r5log

2017-08-01 Thread Liu Bo

A raid5/6 log can be loaded while mounting a btrfs (which already has
a disk set up as raid5/6 log) or setting up a disk as raid5/6 log for
the first time.

It gets %journal_tail from super_block where it can read the first 4K
block and goes through the sanity checks, if it's valid, then go check
if anything needs to be replayed, otherwise it creates a new empty
block at the beginning of the disk and new writes will append to it.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/disk-io.c |  16 +++
 fs/btrfs/raid56.c  | 128 +
 fs/btrfs/raid56.h  |   1 +
 3 files changed, 145 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8685d67..c2d8697 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2987,6 +2987,22 @@ int open_ctree(struct super_block *sb,
fs_info->generation = generation;
fs_info->last_trans_committed = generation;
 
+   if (fs_info->r5log) {
+   u64 cp = btrfs_super_journal_tail(fs_info->super_copy);
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: get journal_tail %llu\n", __func__, cp);
+#endif
+   /* if the data is not replayed, data and parity on
+* disk are still consistent.  So we can move on.
+*
+* About fsync, since fsync can make sure data is
+* flushed onto disk and only metadata is kept into
+* write-ahead log, the fsync'd data will never ends
+* up with being replayed by raid56 log.
+*/
+   btrfs_r5l_load_log(fs_info, cp);
+   }
+
ret = btrfs_recover_balance(fs_info);
if (ret) {
btrfs_err(fs_info, "failed to recover balance: %d", ret);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 60010a6..5d7ea235 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1477,6 +1477,134 @@ static bool btrfs_r5l_has_free_space(struct 
btrfs_r5l_log *log, u64 size)
return log->device_size > (used_size + size);
 }
 
+static int btrfs_r5l_sync_page_io(struct btrfs_r5l_log *log,
+ struct btrfs_device *dev, sector_t sector,
+ int size, struct page *page, int op)
+{
+   struct bio *bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
+   int ret;
+
+   bio->bi_bdev = dev->bdev;
+   bio->bi_opf = op;
+   if (dev == log->dev)
+   bio->bi_iter.bi_sector = (log->data_offset >> 9) + sector;
+   else
+   bio->bi_iter.bi_sector = sector;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: op %d bi_sector 0x%llx\n", __func__, op, 
(bio->bi_iter.bi_sector << 9));
+#endif
+
+   bio_add_page(bio, page, size, 0);
+   submit_bio_wait(bio);
+   ret = !bio->bi_error;
+   bio_put(bio);
+   return ret;
+}
+
+static int btrfs_r5l_write_empty_meta_block(struct btrfs_r5l_log *log, u64 
pos, u64 seq)
+{
+   struct page *page;
+   struct btrfs_r5l_meta_block *mb;
+   int ret = 0;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: pos %llu seq %llu\n", __func__, pos, seq);
+#endif
+
+   page = alloc_page(GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO);
+   ASSERT(page);
+
+   mb = kmap(page);
+   mb->magic = cpu_to_le32(BTRFS_R5LOG_MAGIC);
+   mb->meta_size = cpu_to_le32(sizeof(struct btrfs_r5l_meta_block));
+   mb->seq = cpu_to_le64(seq);
+   mb->position = cpu_to_le64(pos);
+   kunmap(page);
+
+   if (!btrfs_r5l_sync_page_io(log, log->dev, (pos >> 9), PAGE_SIZE, page, 
REQ_OP_WRITE | REQ_FUA)) {
+   ret = -EIO;
+   }
+
+   __free_page(page);
+   return ret;
+}
+
+static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp);
+
+static int btrfs_r5l_recover_log(struct btrfs_r5l_log *log)
+{
+   return 0;
+}
+
+/* return 0 if success, otherwise return errors */
+int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp)
+{
+   struct btrfs_r5l_log *log = fs_info->r5log;
+   struct page *page;
+   struct btrfs_r5l_meta_block *mb;
+   bool create_new = false;
+
+   ASSERT(log);
+
+   page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   ASSERT(page);
+
+   if (!btrfs_r5l_sync_page_io(log, log->dev, (cp >> 9), PAGE_SIZE, page,
+   REQ_OP_READ)) {
+   __free_page(page);
+   return -EIO;
+   }
+
+   mb = kmap(page);
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("r5l: mb->pos %llu cp %llu mb->seq %llu\n", 
le64_to_cpu(mb->position), cp, le64_to_cpu(mb->seq));
+#endif
+
+   if (le32_to_cpu(mb->magic) != BTRFS_R5LOG_MAGIC) {
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("magic not match: create new r5l\n");
+#endif
+

[PATCH 05/14] Btrfs: raid56: add stripe log for raid5/6

2017-08-01 Thread Liu Bo

This is adding the ability to use a disk as raid5/6's stripe log (aka
journal), the primary goal is to fix write hole issue that is inherent
in raid56 setup.

In a typical raid5/6 setup, both full stripe write and a partial
stripe write will generate parity at the very end of writing, so after
parity is generated, it's the right time to issue writes.

Now with raid5/6's stripe log, every write will be put into the stripe
log prior to being written to raid5/6 array, so that we have
everything to rewrite all 'not-yet-on-disk' data/parity if a power
loss happens while writing data/parity to different disks in raid5/6
array.

A metadata block is used here to manage the information of data and
parity and it's placed ahead of data and parity on stripe log.  Right
now such metadata block is limited to one page size and the structure
is defined as

{metadata block} + {a few payloads}

- 'metadata block' contains a magic code, a sequence number and the
  start position on the stripe log.

- 'payload' contains the information about data and parity, e.g. the
 physical offset and device id where data/parity is supposed to be.

Each data block has a payload while each set of parity has a payload
(e.g. for raid6, parity p and q has their own payload respectively).

And we treat data and parity differently because btrfs always prepares
the whole stripe length(64k) of parity, but data may only come from a
partial stripe write.

This metadata block is written to the raid5/6 stripe log with data/parity
in a single bio(could be two bios, doesn't support more than two
bios).

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 512 +++---
 fs/btrfs/raid56.h |  65 +++
 2 files changed, 513 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c75766f..007ba63 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -185,6 +185,8 @@ struct btrfs_r5l_log {
/* r5log device */
struct btrfs_device *dev;
 
+   struct btrfs_fs_info *fs_info;
+
/* allocation range for log entries */
u64 data_offset;
u64 device_size;
@@ -1179,6 +1181,445 @@ static void index_rbio_pages(struct btrfs_raid_bio 
*rbio)
spin_unlock_irq(>bio_list_lock);
 }
 
+/* r5log */
+/* XXX: this allocation may be done earlier, eg. when allocating rbio */
+static struct btrfs_r5l_io_unit *btrfs_r5l_alloc_io_unit(struct btrfs_r5l_log 
*log)
+{
+   struct btrfs_r5l_io_unit *io;
+   gfp_t gfp = GFP_NOFS;
+
+   io = kzalloc(sizeof(*io), gfp);
+   ASSERT(io);
+   io->log = log;
+   /* need to use kmap. */
+   io->meta_page = alloc_page(gfp | __GFP_HIGHMEM | __GFP_ZERO);
+   ASSERT(io->meta_page);
+
+   return io;
+}
+
+static void btrfs_r5l_free_io_unit(struct btrfs_r5l_log *log, struct 
btrfs_r5l_io_unit *io)
+{
+   __free_page(io->meta_page);
+   kfree(io);
+}
+
+static u64 btrfs_r5l_ring_add(struct btrfs_r5l_log *log, u64 start, u64 inc)
+{
+   start += inc;
+   if (start >= log->device_size)
+   start = start - log->device_size;
+   return start;
+}
+
+static void btrfs_r5l_reserve_log_entry(struct btrfs_r5l_log *log, struct 
btrfs_r5l_io_unit *io)
+{
+   log->log_start = btrfs_r5l_ring_add(log, log->log_start, PAGE_SIZE);
+   io->log_end = log->log_start;
+
+   if (log->log_start == 0)
+   io->need_split_bio = true;
+}
+
+static void btrfs_write_rbio(struct btrfs_raid_bio *rbio);
+
+static void btrfs_r5l_log_endio(struct bio *bio)
+{
+   struct btrfs_r5l_io_unit *io = bio->bi_private;
+   struct btrfs_r5l_log *log = io->log;
+
+   bio_put(bio);
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("move data to disk\n");
+#endif
+   /* move data to RAID. */
+   btrfs_write_rbio(io->rbio);
+
+   if (log->current_io == io)
+   log->current_io = NULL;
+   btrfs_r5l_free_io_unit(log, io);
+}
+
+static struct bio *btrfs_r5l_bio_alloc(struct btrfs_r5l_log *log)
+{
+   /* this allocation will not fail. */
+   struct bio *bio = btrfs_io_bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
+
+   /* We need to make sure data/parity are settled down on the log disk. */
+   bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA;
+   bio->bi_bdev = log->dev->bdev;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("log->data_offset 0x%llx log->log_start 0x%llx\n", 
log->data_offset, log->log_start);
+#endif
+   bio->bi_iter.bi_sector = (log->data_offset + log->log_start) >> 9;
+
+   return bio;
+}
+
+static struct btrfs_r5l_io_unit *btrfs_r5l_new_meta(struct btrfs_r5l_log *log)
+{
+   struct btrfs_r5l_io_unit *io;
+   struct btrfs_r5l_meta_block *block;
+
+   io = btrfs_r5l_alloc_io_unit(log);
+   ASSERT(io);
+
+   block = kmap(io->meta_page);
+

[PATCH 08/14] Btrfs: raid56: log recovery

2017-08-01 Thread Liu Bo

This is adding recovery on raid5/6 log.

We've set a %journal_tail in super_block, which indicates the position
from where we need to replay data.  So we scan the log and replay
valid meta/data/parity pairs until finding an invalid one.  By
replaying, it simply reads data/parity from the raid5/6 log and issues
writes to the raid disks where it should be.  Please note that the
whole meta/data/parity pair can be discarded if it fails the sanity
check in the meta block.

After recovery, we also append an empty meta block and update the
%journal_tail in super_block in order to avoid a situation, where the
layout on the raid5/6 log is

[valid A][invalid B][valid C],

so block A is the only one we should replay.

Then the recovery ends up pointing to block A as block B is invalid,
and some new writes come in and append to block A so that block B is
now overwritten to be a valid meta/data/parity.  If a power loss
happens, the new recovery starts again from block A, and since block B
is now valid, it may replay block C as well which has become stale.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 151 ++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 5d7ea235..dea33c4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1530,10 +1530,161 @@ static int btrfs_r5l_write_empty_meta_block(struct 
btrfs_r5l_log *log, u64 pos,
return ret;
 }
 
+struct btrfs_r5l_recover_ctx {
+   u64 pos;
+   u64 seq;
+   u64 total_size;
+   struct page *meta_page;
+   struct page *io_page;
+};
+
+static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_log *log, struct 
btrfs_r5l_recover_ctx *ctx)
+{
+   struct btrfs_r5l_meta_block *mb;
+
+   btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos >> 9), PAGE_SIZE, 
ctx->meta_page, REQ_OP_READ);
+
+   mb = kmap(ctx->meta_page);
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("ctx->pos %llu ctx->seq %llu pos %llu seq %llu\n", 
ctx->pos, ctx->seq, le64_to_cpu(mb->position), le64_to_cpu(mb->seq));
+#endif
+
+   if (le32_to_cpu(mb->magic) != BTRFS_R5LOG_MAGIC ||
+   le64_to_cpu(mb->position) != ctx->pos ||
+   le64_to_cpu(mb->seq) != ctx->seq) {
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: mismatch magic %llu default %llu\n", 
__func__, le32_to_cpu(mb->magic), BTRFS_R5LOG_MAGIC);
+#endif
+   return -EINVAL;
+   }
+
+   ASSERT(le32_to_cpu(mb->meta_size) <= PAGE_SIZE);
+   kunmap(ctx->meta_page);
+
+   /* meta_block */
+   ctx->total_size = PAGE_SIZE;
+
+   return 0;
+}
+
+static int btrfs_r5l_recover_load_data(struct btrfs_r5l_log *log, struct 
btrfs_r5l_recover_ctx *ctx)
+{
+   u64 offset;
+   struct btrfs_r5l_meta_block *mb;
+   u64 meta_size;
+   u64 io_offset;
+   struct btrfs_device *dev;
+
+   mb = kmap(ctx->meta_page);
+
+   io_offset = PAGE_SIZE;
+   offset = sizeof(struct btrfs_r5l_meta_block);
+   meta_size = le32_to_cpu(mb->meta_size);
+
+   while (offset < meta_size) {
+   struct btrfs_r5l_payload *payload = (void *)mb + offset;
+
+   /* read data from log disk and write to payload->location */
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("payload type %d flags %d size %d location 0x%llx 
devid %llu\n", le16_to_cpu(payload->type), le16_to_cpu(payload->flags), 
le32_to_cpu(payload->size), le64_to_cpu(payload->location), 
le64_to_cpu(payload->devid));
+#endif
+
+   dev = btrfs_find_device(log->fs_info, 
le64_to_cpu(payload->devid), NULL, NULL);
+   if (!dev || dev->missing) {
+   ASSERT(0);
+   }
+
+   if (le16_to_cpu(payload->type) == R5LOG_PAYLOAD_DATA) {
+   ASSERT(le32_to_cpu(payload->size) == 1);
+   btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos + 
io_offset) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_READ);
+   btrfs_r5l_sync_page_io(log, dev, 
le64_to_cpu(payload->location) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_WRITE);
+   io_offset += PAGE_SIZE;
+   } else if (le16_to_cpu(payload->type) == R5LOG_PAYLOAD_PARITY) {
+   int i;
+   ASSERT(le32_to_cpu(payload->size) == 16);
+   for (i = 0; i < le32_to_cpu(payload->size); i++) {
+   /* liubo: parity are guaranteed to be
+* contiguous, use just one bio to
+* hold all pages and flush them. */
+   u64 parity_off = le64_to_cpu(payload->location) 
+ i * PAGE_SIZE;
+   btrfs_r5l_sync_page_io(log,

[PATCH 04/14] Btrfs: raid56: add verbose debug

2017-08-01 Thread Liu Bo

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c  | 2 ++
 fs/btrfs/volumes.c | 7 ++-
 fs/btrfs/volumes.h | 4 
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 2b91b95..c75766f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2753,7 +2753,9 @@ int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct 
btrfs_device *device)
cmpxchg(_info->r5log, NULL, log);
ASSERT(fs_info->r5log == log);
 
+#ifdef BTRFS_DEBUG_R5LOG
trace_printk("r5log: set a r5log in fs_info,  alloc_range 0x%llx 
0x%llx",
 log->data_offset, log->data_offset + log->device_size);
+#endif
return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a17a488..ac64d93 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4731,8 +4731,13 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
 
if (!device->in_fs_metadata ||
device->is_tgtdev_for_dev_replace ||
-   (device->type & BTRFS_DEV_RAID56_LOG))
+   (device->type & BTRFS_DEV_RAID56_LOG)) {
+#ifdef BTRFS_DEBUG_R5LOG
+   if (device->type & BTRFS_DEV_RAID56_LOG)
+   btrfs_info(info, "skip a r5log when alloc 
chunk\n");
+#endif
continue;
+   }
 
if (device->total_bytes > device->bytes_used)
total_avail = device->total_bytes - device->bytes_used;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 60e347a..44cc3fa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -26,6 +26,10 @@
 
 extern struct mutex uuid_mutex;
 
+#ifdef CONFIG_BTRFS_DEBUG
+#define BTRFS_DEBUG_R5LOG
+#endif
+
 #define BTRFS_STRIPE_LEN   SZ_64K
 
 struct buffer_head;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/14] Btrfs: raid56: initialize raid5/6 log after adding it

2017-08-01 Thread Liu Bo

We need to initialize the raid5/6 log after adding it, but we don't
want to race with concurrent writes.  So we initialize it before
assigning the log pointer in %fs_info.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/disk-io.c |  2 +-
 fs/btrfs/raid56.c  | 18 --
 fs/btrfs/raid56.h  |  3 ++-
 fs/btrfs/volumes.c |  2 ++
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c2d8697..3fbd347 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3000,7 +3000,7 @@ int open_ctree(struct super_block *sb,
 * write-ahead log, the fsync'd data will never ends
 * up with being replayed by raid56 log.
 */
-   btrfs_r5l_load_log(fs_info, cp);
+   btrfs_r5l_load_log(fs_info, NULL, cp);
}
 
ret = btrfs_recover_balance(fs_info);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0bfc97a..b771d7d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1943,14 +1943,28 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log 
*log)
 }
 
 /* return 0 if success, otherwise return errors */
-int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp)
+int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, struct btrfs_r5l_log 
*r5log, u64 cp)
 {
-   struct btrfs_r5l_log *log = fs_info->r5log;
+   struct btrfs_r5l_log *log;
struct page *page;
struct btrfs_r5l_meta_block *mb;
bool create_new = false;
int ret;
 
+   if (r5log)
+   ASSERT(fs_info->r5log == NULL);
+   if (fs_info->r5log)
+   ASSERT(r5log == NULL);
+
+   if (fs_info->r5log)
+   log = fs_info->r5log;
+   else
+   /*
+* this only happens when adding the raid56 log for
+* the first time.
+*/
+   log = r5log;
+
ASSERT(log);
 
page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index f6d6f36..2cc64a3 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -140,5 +140,6 @@ struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct 
btrfs_fs_info *fs_info,
 void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info,
 struct btrfs_r5l_log *log);
 int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device 
*device);
-int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp);
+int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info,
+  struct btrfs_r5l_log *r5log, u64 cp);
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 851c001..7f848d7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2521,6 +2521,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
}
 
if (is_r5log) {
+   /* initialize r5log with cp == 0. */
+   btrfs_r5l_load_log(fs_info, r5log, 0);
btrfs_r5l_init_log_post(fs_info, r5log);
}
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/14] Btrfs: raid56: detect raid56 log on mount

2017-08-01 Thread Liu Bo

We've put the flag BTRFS_DEV_RAID56_LOG in device->type, so we can
recognize the journal device of raid56 while reading the chunk tree.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/volumes.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5c50df7..a17a488 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6696,6 +6696,18 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
}
 
fill_device_from_item(leaf, dev_item, device);
+
+   if (device->type & BTRFS_DEV_RAID56_LOG) {
+   ret = btrfs_set_r5log(fs_info, device);
+   if (ret) {
+   btrfs_err(fs_info, "error %d on loading r5log", ret);
+   return ret;
+   }
+
+   btrfs_info(fs_info, "devid %llu uuid %pU is raid56 log",
+  device->devid, device->uuid);
+   }
+
device->in_fs_metadata = 1;
if (device->writeable && !device->is_tgtdev_for_dev_replace) {
device->fs_devices->total_rw_bytes += device->total_bytes;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/14] Btrfs: raid56: do not allocate chunk on raid56 log

2017-08-01 Thread Liu Bo

The journal device (aka raid56 log) is not for chunk allocation, lets
skip it.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dafc541..5c50df7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4730,7 +4730,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
}
 
if (!device->in_fs_metadata ||
-   device->is_tgtdev_for_dev_replace)
+   device->is_tgtdev_for_dev_replace ||
+   (device->type & BTRFS_DEV_RAID56_LOG))
continue;
 
if (device->total_bytes > device->bytes_used)
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/14] Btrfs: raid56: add csum support

2017-08-01 Thread Liu Bo

This is adding checksum to meta/data/parity resident on the raid5/6
log.  So recovery now can verify checksum to see if anything inside
meta/data/parity has been changed.

If anything is wrong in meta block, we stops replaying data/parity at
that position, while if anything is wrong in data/parity block, we
just skip this this meta/data/parity pair and move onto the next one.

Signed-off-by: Liu Bo <bo.li@oracle.com>
---
 fs/btrfs/raid56.c | 235 --
 fs/btrfs/raid56.h |   4 +
 2 files changed, 197 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 8f47e56..8bc7ba4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -43,6 +43,7 @@
 #include "async-thread.h"
 #include "check-integrity.h"
 #include "rcu-string.h"
+#include "hash.h"
 
 /* set when additional merges to this rbio are not allowed */
 #define RBIO_RMW_LOCKED_BIT1
@@ -197,6 +198,7 @@ struct btrfs_r5l_log {
u64 last_cp_seq;
u64 seq;
u64 log_start;
+   u32 uuid_csum;
struct btrfs_r5l_io_unit *current_io;
 };
 
@@ -1309,7 +1311,7 @@ static int btrfs_r5l_get_meta(struct btrfs_r5l_log *log, 
struct btrfs_raid_bio *
return 0;
 }
 
-static void btrfs_r5l_append_payload_meta(struct btrfs_r5l_log *log, u16 type, 
u64 location, u64 devid)
+static void btrfs_r5l_append_payload_meta(struct btrfs_r5l_log *log, u16 type, 
u64 location, u64 devid, u32 csum)
 {
struct btrfs_r5l_io_unit *io = log->current_io;
struct btrfs_r5l_payload *payload;
@@ -1326,11 +1328,11 @@ static void btrfs_r5l_append_payload_meta(struct 
btrfs_r5l_log *log, u16 type, u
payload->size = cpu_to_le32(16); /* stripe_len / PAGE_SIZE */
payload->devid = cpu_to_le64(devid);
payload->location = cpu_to_le64(location);
+   payload->csum = cpu_to_le32(csum);
kunmap(io->meta_page);
 
-   /* XXX: add checksum later */
io->meta_offset += sizeof(*payload);
-   //io->meta_offset += sizeof(__le32);
+
 #ifdef BTRFS_DEBUG_R5LOG
trace_printk("io->meta_offset %d\n", io->meta_offset);
 #endif
@@ -1380,6 +1382,10 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
int meta_size;
int stripe, pagenr;
struct page *page;
+   char *kaddr;
+   u32 csum;
+   u64 location;
+   u64 devid;
 
/*
 * parity pages are contiguous on disk, thus only one
@@ -1394,8 +1400,6 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
/* add data blocks which need to be written */
for (stripe = 0; stripe < rbio->nr_data; stripe++) {
for (pagenr = 0; pagenr < rbio->stripe_npages; pagenr++) {
-   u64 location;
-   u64 devid;
if (stripe < rbio->nr_data) {
page = page_in_rbio(rbio, stripe, pagenr, 1);
if (!page)
@@ -1406,7 +1410,11 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
 #ifdef BTRFS_DEBUG_R5LOG
trace_printk("data: stripe %d pagenr %d 
location 0x%llx devid %llu\n", stripe, pagenr, location, devid);
 #endif
-   btrfs_r5l_append_payload_meta(log, 
R5LOG_PAYLOAD_DATA, location, devid);
+   kaddr = kmap(page);
+   csum = btrfs_crc32c(log->uuid_csum, kaddr, 
PAGE_SIZE);
+   kunmap(page);
+
+   btrfs_r5l_append_payload_meta(log, 
R5LOG_PAYLOAD_DATA, location, devid, csum);
btrfs_r5l_append_payload_page(log, page);
}
}
@@ -1414,17 +1422,26 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
 
/* add the whole parity blocks */
for (; stripe < rbio->real_stripes; stripe++) {
-   u64 location = btrfs_compute_location(rbio, stripe, 0);
-   u64 devid = btrfs_compute_devid(rbio, stripe);
+   location = btrfs_compute_location(rbio, stripe, 0);
+   devid = btrfs_compute_devid(rbio, stripe);
 
 #ifdef BTRFS_DEBUG_R5LOG
trace_printk("parity: stripe %d location 0x%llx devid %llu\n", 
stripe, location, devid);
 #endif
-   btrfs_r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY, 
location, devid);
for (pagenr = 0; pagenr < rbio->stripe_npages; pagenr++) {
page = rbio_stripe_page(rbio, stripe, pagenr);
+
+   kaddr = kmap(page);
+   if (pagenr == 0)
+   csum = btrfs_crc32c(log->uuid_csum, kaddr, 
PAGE_SIZE);

[PATCH v6 04/15] btrfs-progs: scrub: Introduce structures to support offline scrub for RAID56

2017-07-20 Thread Gu Jinxiang

From: Qu Wenruo <quwen...@cn.fujitsu.com>

Introuduce new local structures, scrub_full_stripe and scrub_stripe, for
incoming offline RAID56 scrub support.

For pure stripe/mirror based profiles, like raid0/1/10/dup/single, we
will follow the original bytenr and mirror number based iteration, so
they don't need any extra structures for these profiles.

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
Signed-off-by: Gu Jinxiang <g...@cn.fujitsu.com>
---
 Makefile |   2 +-
 scrub.c  | 119 +++
 2 files changed, 120 insertions(+), 1 deletion(-)
 create mode 100644 scrub.c

diff --git a/Makefile b/Makefile
index 6f734a6..8275539 100644
--- a/Makefile
+++ b/Makefile
@@ -96,7 +96,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o csum.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o csum.o scrub.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/scrub.c b/scrub.c
new file mode 100644
index 000..41c4010
--- /dev/null
+++ b/scrub.c
@@ -0,0 +1,119 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+/*
+ * Main part to implement offline(unmounted) btrfs scrub
+ */
+
+#include 
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "utils.h"
+
+/*
+ * For parity based profile (RAID56)
+ * Mirror/stripe based on won't need this. They are iterated by bytenr and
+ * mirror number.
+ */
+struct scrub_stripe {
+   /* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */
+   u64 logical;
+
+   u64 physical;
+
+   /* Device is missing */
+   unsigned int dev_missing:1;
+
+   /* Any tree/data csum mismatches */
+   unsigned int csum_mismatch:1;
+
+   /* Some data doesn't have csum (nodatasum) */
+   unsigned int csum_missing:1;
+
+   /* Device fd, to write correct data back to disc */
+   int fd;
+
+   char *data;
+};
+
+/*
+ * RAID56 full stripe (data stripes + P/Q)
+ */
+struct scrub_full_stripe {
+   u64 logical_start;
+   u64 logical_len;
+   u64 bg_type;
+   u32 nr_stripes;
+   u32 stripe_len;
+
+   /* Read error stripes */
+   u32 err_read_stripes;
+
+   /* Missing devices */
+   u32 err_missing_devs;
+
+   /* Csum error data stripes */
+   u32 err_csum_dstripes;
+
+   /* Missing csum data stripes */
+   u32 missing_csum_dstripes;
+
+   /* currupted stripe index */
+   int corrupted_index[2];
+
+   int nr_corrupted_stripes;
+
+   /* Already recovered once? */
+   unsigned int recovered:1;
+
+   struct scrub_stripe stripes[];
+};
+
+static void free_full_stripe(struct scrub_full_stripe *fstripe)
+{
+   int i;
+
+   for (i = 0; i < fstripe->nr_stripes; i++)
+   free(fstripe->stripes[i].data);
+   free(fstripe);
+}
+
+static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
+   u32 stripe_len)
+{
+   struct scrub_full_stripe *ret;
+   int size = sizeof(*ret) + sizeof(unsigned long *) +
+   nr_stripes * sizeof(struct scrub_stripe);
+   int i;
+
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+
+   memset(ret, 0, size);
+   ret->nr_stripes = nr_stripes;
+   ret->stripe_len = stripe_len;
+   ret->corrupted_index[0] = -1;
+   ret->corrupted_index[1] = -1;
+
+   /* Alloc data memory for each stripe */
+   for (i = 0; i < nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   stripe->data = malloc(stripe_len);
+   if (!stripe->data) {
+   free_full_stripe(ret);
+   return NULL;
+   }
+   }
+   return ret;
+}
-- 
2.9.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 04/15] btrfs-progs: scrub: Introduce structures to support offline scrub for RAID56

2017-07-15 Thread Gu Jinxiang

From: Qu Wenruo <quwen...@cn.fujitsu.com>

Introuduce new local structures, scrub_full_stripe and scrub_stripe, for
incoming offline RAID56 scrub support.

For pure stripe/mirror based profiles, like raid0/1/10/dup/single, we
will follow the original bytenr and mirror number based iteration, so
they don't need any extra structures for these profiles.

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
Signed-off-by: Gu Jinxiang <g...@cn.fujitsu.com>
---
 Makefile |   2 +-
 scrub.c  | 124 +++
 2 files changed, 125 insertions(+), 1 deletion(-)
 create mode 100644 scrub.c

diff --git a/Makefile b/Makefile
index 6f734a6..8275539 100644
--- a/Makefile
+++ b/Makefile
@@ -96,7 +96,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o csum.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o csum.o scrub.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/scrub.c b/scrub.c
new file mode 100644
index 000..cdb311b
--- /dev/null
+++ b/scrub.c
@@ -0,0 +1,124 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+/*
+ * Main part to implement offline(unmounted) btrfs scrub
+ */
+
+#include 
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "utils.h"
+
+/*
+ * For parity based profile (RAID56)
+ * Mirror/stripe based on won't need this. They are iterated by bytenr and
+ * mirror number.
+ */
+struct scrub_stripe {
+   /* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */
+   u64 logical;
+
+   u64 physical;
+
+   /* Device is missing */
+   unsigned int dev_missing:1;
+
+   /* Any tree/data csum mismatches */
+   unsigned int csum_mismatch:1;
+
+   /* Some data doesn't have csum (nodatasum) */
+   unsigned int csum_missing:1;
+
+   /* Device fd, to write correct data back to disc */
+   int fd;
+
+   char *data;
+};
+
+/*
+ * RAID56 full stripe (data stripes + P/Q)
+ */
+struct scrub_full_stripe {
+   u64 logical_start;
+   u64 logical_len;
+   u64 bg_type;
+   u32 nr_stripes;
+   u32 stripe_len;
+
+   /* Read error stripes */
+   u32 err_read_stripes;
+
+   /* Missing devices */
+   u32 err_missing_devs;
+
+   /* Csum error data stripes */
+   u32 err_csum_dstripes;
+
+   /* Missing csum data stripes */
+   u32 missing_csum_dstripes;
+
+   /* currupted stripe index */
+   int corrupted_index[2];
+
+   int nr_corrupted_stripes;
+
+   /* Already recovered once? */
+   unsigned int recovered:1;
+
+   struct scrub_stripe stripes[];
+};
+
+static void free_full_stripe(struct scrub_full_stripe *fstripe)
+{
+   int i;
+
+   for (i = 0; i < fstripe->nr_stripes; i++)
+   free(fstripe->stripes[i].data);
+   free(fstripe);
+}
+
+static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
+   u32 stripe_len)
+{
+   struct scrub_full_stripe *ret;
+   int size = sizeof(*ret) + sizeof(unsigned long *) +
+   nr_stripes * sizeof(struct scrub_stripe);
+   int i;
+
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+
+   memset(ret, 0, size);
+   ret->nr_stripes = nr_stripes;
+   ret->stripe_len = stripe_len;
+   ret->corrupted_index[0] = -1;
+   ret->corrupted_index[1] = -1;
+
+   /* Alloc data memory for each stripe */
+   for (i = 0; i < nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   stripe->data = malloc(stripe_len);
+   if (!stripe->data) {
+   free_full_stripe(ret);
+   return NULL;
+   }
+   }
+   return ret;
+}
-- 
2.9

Re: [PATCH v4 05/20] btrfs-progs: Introduce wrapper to recover raid56 data

2017-05-31 Thread Qu Wenruo




At 05/31/2017 09:52 PM, David Sterba wrote:

On Thu, May 25, 2017 at 02:21:50PM +0800, Qu Wenruo wrote:

Introduce a wrapper to recover raid56 data.

The logical is the same with kernel one, but with different interfaces,
since kernel ones cares the performance while in btrfs we don't care
that much.


Can you please rephrase this paragraph? I'll commit the patch as-is for
now but want replace the changelog with an improved one.




I'm not sure which direction I should change the paragraph to.

To explain more about the wrapper or remove unrelated difference between 
kernel and btrfs-progs part?


Or why we need the wrapper?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 05/20] btrfs-progs: Introduce wrapper to recover raid56 data

2017-05-31 Thread David Sterba

On Thu, May 25, 2017 at 02:21:50PM +0800, Qu Wenruo wrote:
> Introduce a wrapper to recover raid56 data.
> 
> The logical is the same with kernel one, but with different interfaces,
> since kernel ones cares the performance while in btrfs we don't care
> that much.

Can you please rephrase this paragraph? I'll commit the patch as-is for
now but want replace the changelog with an improved one. 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 01/20] btrfs-progs: raid56: Introduce raid56 header for later recovery usage

2017-05-31 Thread David Sterba

On Thu, May 25, 2017 at 02:21:46PM +0800, Qu Wenruo wrote:
> @@ -0,0 +1,28 @@
> +/*
> + * Copyright (C) 2017 Fujitsu.  All rights reserved.

Please use just plain GPL v2 header in newly created files. The history
of who modified what in the given file is stored in git. I can't simply
remove the copyright notice from existing files without checking first,
but I can stop addding more.

> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#ifndef _BTRFS_PROGS_RAID56_H
> +#define _BTRFS_PROGS_RAID56_H

The pattern for the ifndef macro name is:

"__BTRFS_" + descriptive header name + "_H__"

just look to other files.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 01/20] btrfs-progs: raid56: Introduce raid56 header for later recovery usage

2017-05-25 Thread Qu Wenruo

Introduce a new header, kernel-lib/raid56.h, for later raid56 works.

It contains 2 functions, from original btrfs-progs code:
void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);

Will be expanded later and some part of it(RAID6 recover part) may keep
sync with kernel later.

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
---
 Makefile|  4 ++--
 disk-io.h   |  4 
 kernel-lib/raid56.h | 28 
 volumes.c   |  1 +
 4 files changed, 31 insertions(+), 6 deletions(-)
 create mode 100644 kernel-lib/raid56.h

diff --git a/Makefile b/Makefile
index 81598df1..92063a90 100644
--- a/Makefile
+++ b/Makefile
@@ -108,8 +108,8 @@ libbtrfs_objects = send-stream.o send-utils.o 
kernel-lib/rbtree.o btrfs-list.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
 libbtrfs_headers = send-stream.h send-utils.h send.h kernel-lib/rbtree.h 
btrfs-list.h \
   kernel-lib/crc32c.h kernel-lib/list.h kerncompat.h \
-  kernel-lib/radix-tree.h kernel-lib/sizes.h extent-cache.h \
-  extent_io.h ioctl.h ctree.h btrfsck.h version.h
+  kernel-lib/radix-tree.h kernel-lib/sizes.h kernel-lib/raid56.h \
+  extent-cache.h extent_io.h ioctl.h ctree.h btrfsck.h version.h
 convert_objects = convert/main.o convert/common.o convert/source-fs.o \
  convert/source-ext2.o
 mkfs_objects = mkfs/main.o mkfs/common.o
diff --git a/disk-io.h b/disk-io.h
index cd4fe929..ad8efb43 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -201,8 +201,4 @@ int write_tree_block(struct btrfs_trans_handle *trans,
 struct extent_buffer *eb);
 int write_and_map_eb(struct btrfs_root *root, struct extent_buffer *eb);
 
-/* raid56.c */
-void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
-int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);
-
 #endif
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
new file mode 100644
index ..fa8fa260
--- /dev/null
+++ b/kernel-lib/raid56.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright (C) 2017 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef _BTRFS_PROGS_RAID56_H
+#define _BTRFS_PROGS_RAID56_H
+/*
+ * Headers for RAID5/6 operations.
+ * Original headers from original RAID5/6 codes, not from kernel header.
+ */
+
+void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
+int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);
+#endif
diff --git a/volumes.c b/volumes.c
index b350e259..8c2ffd92 100644
--- a/volumes.c
+++ b/volumes.c
@@ -28,6 +28,7 @@
 #include "print-tree.h"
 #include "volumes.h"
 #include "utils.h"
+#include "kernel-lib/raid56.h"
 
 struct stripe {
struct btrfs_device *dev;
-- 
2.13.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 09/20] btrfs-progs: scrub: Introduce structures to support offline scrub for RAID56

2017-05-25 Thread Qu Wenruo

Introuduce new local structures, scrub_full_stripe and scrub_stripe, for
incoming offline RAID56 scrub support.

For pure stripe/mirror based profiles, like raid0/1/10/dup/single, we
will follow the original bytenr and mirror number based iteration, so
they don't need any extra structures for these profiles.

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
---
 Makefile |   2 +-
 scrub.c  | 126 +++
 2 files changed, 127 insertions(+), 1 deletion(-)
 create mode 100644 scrub.c

diff --git a/Makefile b/Makefile
index e6d7c187..b3c70e04 100644
--- a/Makefile
+++ b/Makefile
@@ -95,7 +95,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o csum.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o csum.o scrub.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/scrub.c b/scrub.c
new file mode 100644
index ..a757dff6
--- /dev/null
+++ b/scrub.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright (C) 2017 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+/*
+ * Main part to implement offline(unmounted) btrfs scrub
+ */
+
+#include 
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "utils.h"
+
+/*
+ * For parity based profile(RAID56)
+ * Mirror/stripe based on won't need this. They are iterated by bytenr and
+ * mirror number.
+ */
+struct scrub_stripe {
+   /* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */
+   u64 logical;
+
+   u64 physical;
+
+   /* Device is missing */
+   unsigned int dev_missing:1;
+
+   /* Any tree/data csum mismatches */
+   unsigned int csum_mismatch:1;
+
+   /* Some data doesn't have csum(nodatasum) */
+   unsigned int csum_missing:1;
+
+   /* Device fd, to write correct data back to disc */
+   int fd;
+
+   char *data;
+};
+
+/*
+ * RAID56 full stripe(data stripes + P/Q)
+ */
+struct scrub_full_stripe {
+   u64 logical_start;
+   u64 logical_len;
+   u64 bg_type;
+   u32 nr_stripes;
+   u32 stripe_len;
+
+   /* Read error stripes */
+   u32 err_read_stripes;
+
+   /* Missing devices */
+   u32 err_missing_devs;
+
+   /* Csum error data stripes */
+   u32 err_csum_dstripes;
+
+   /* Missing csum data stripes */
+   u32 missing_csum_dstripes;
+
+   /* currupted stripe index */
+   int corrupted_index[2];
+
+   int nr_corrupted_stripes;
+
+   /* Already recovered once? */
+   unsigned int recovered:1;
+
+   struct scrub_stripe stripes[];
+};
+
+static void free_full_stripe(struct scrub_full_stripe *fstripe)
+{
+   int i;
+
+   for (i = 0; i < fstripe->nr_stripes; i++)
+   free(fstripe->stripes[i].data);
+   free(fstripe);
+}
+
+static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
+   u32 stripe_len)
+{
+   struct scrub_full_stripe *ret;
+   int size = sizeof(*ret) + sizeof(unsigned long *) +
+   nr_stripes * sizeof(struct scrub_stripe);
+   int i;
+
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+
+   memset(ret, 0, size);
+   ret->nr_stripes = nr_stripes;
+   ret->stripe_len = stripe_len;
+   ret->corrupted_index[0] = -1;
+   ret->corrupted_index[1] = -1;
+
+   /* Alloc data memory for each stripe */
+   for (i = 0; i < nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   stripe->data = malloc(stripe_len);
+   if (!stripe->data) {
+   free_full_stripe(ret);
+   return NULL;
+   }
+   }
+   return ret;
+}
-- 
2.13.0



--
To unsubscribe from this l

[PATCH v4 05/20] btrfs-progs: Introduce wrapper to recover raid56 data

2017-05-25 Thread Qu Wenruo

Introduce a wrapper to recover raid56 data.

The logical is the same with kernel one, but with different interfaces,
since kernel ones cares the performance while in btrfs we don't care
that much.

And the interface is more caller friendly inside btrfs-progs.

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
---
 kernel-lib/raid56.c | 77 +
 kernel-lib/raid56.h | 11 
 2 files changed, 88 insertions(+)

diff --git a/kernel-lib/raid56.c b/kernel-lib/raid56.c
index e078972b..e3a9339e 100644
--- a/kernel-lib/raid56.c
+++ b/kernel-lib/raid56.c
@@ -280,3 +280,80 @@ int raid6_recov_datap(int nr_devs, size_t stripe_len, int 
dest1, void **data)
}
return 0;
 }
+
+/* Original raid56 recovery wrapper */
+int raid56_recov(int nr_devs, size_t stripe_len, u64 profile, int dest1,
+int dest2, void **data)
+{
+   int min_devs;
+   int ret;
+
+   if (profile & BTRFS_BLOCK_GROUP_RAID5)
+   min_devs = 2;
+   else if (profile & BTRFS_BLOCK_GROUP_RAID6)
+   min_devs = 3;
+   else
+   return -EINVAL;
+   if (nr_devs < min_devs)
+   return -EINVAL;
+
+   /* Nothing to recover */
+   if (dest1 == -1 && dest2 == -1)
+   return 0;
+
+   /* Reorder dest1/2, so only dest2 can be -1  */
+   if (dest1 == -1) {
+   dest1 = dest2;
+   dest2 = -1;
+   } else if (dest2 != -1 && dest1 != -1) {
+   /* Reorder dest1/2, ensure dest2 > dest1 */
+   if (dest1 > dest2) {
+   int tmp;
+
+   tmp = dest2;
+   dest2 = dest1;
+   dest1 = tmp;
+   }
+   }
+
+   if (profile & BTRFS_BLOCK_GROUP_RAID5) {
+   if (dest2 != -1)
+   return 1;
+   return raid5_gen_result(nr_devs, stripe_len, dest1, data);
+   }
+
+   /* RAID6 one dev corrupted case*/
+   if (dest2 == -1) {
+   /* Regenerate P/Q */
+   if (dest1 == nr_devs - 1 || dest1 == nr_devs - 2) {
+   raid6_gen_syndrome(nr_devs, stripe_len, data);
+   return 0;
+   }
+
+   /* Regerneate data from P */
+   return raid5_gen_result(nr_devs - 1, stripe_len, dest1, data);
+   }
+
+   /* P/Q bot corrupted */
+   if (dest1 == nr_devs - 2 && dest2 == nr_devs - 1) {
+   raid6_gen_syndrome(nr_devs, stripe_len, data);
+   return 0;
+   }
+
+   /* 2 Data corrupted */
+   if (dest2 < nr_devs - 2)
+   return raid6_recov_data2(nr_devs, stripe_len, dest1, dest2,
+data);
+   /* Data and P*/
+   if (dest2 == nr_devs - 1)
+   return raid6_recov_datap(nr_devs, stripe_len, dest1, data);
+
+   /*
+* Final case, Data and Q, recover data first then regenerate Q
+*/
+   ret = raid5_gen_result(nr_devs - 1, stripe_len, dest1, data);
+   if (ret < 0)
+   return ret;
+   raid6_gen_syndrome(nr_devs, stripe_len, data);
+   return 0;
+}
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
index 83ac39a1..e06c3ffb 100644
--- a/kernel-lib/raid56.h
+++ b/kernel-lib/raid56.h
@@ -44,4 +44,15 @@ int raid6_recov_data2(int nr_devs, size_t stripe_len, int 
dest1, int dest2,
  void **data);
 /* Recover data and P */
 int raid6_recov_datap(int nr_devs, size_t stripe_len, int dest1, void **data);
+
+/*
+ * Recover raid56 data
+ * @dest1/2 can be -1 to indicate correct data
+ *
+ * Return >0 for unrecoverable case.
+ * Return 0 for recoverable case, And recovered data will be stored into @data
+ * Return <0 for fatal error
+ */
+int raid56_recov(int nr_devs, size_t stripe_len, u64 profile, int dest1,
+int dest2, void **data);
 #endif
-- 
2.13.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 04/20] btrfs-progs: raid56: Allow raid6 to recover data and p

2017-05-25 Thread Qu Wenruo

Copied from kernel lib/raid6/recov.c.

Minor modifications includes:
- Rename from raid6_datap_recov_intx() to raid5_recov_datap()
- Rename parameter from faila to dest1

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
---
 kernel-lib/raid56.c | 41 +
 kernel-lib/raid56.h |  2 ++
 2 files changed, 43 insertions(+)

diff --git a/kernel-lib/raid56.c b/kernel-lib/raid56.c
index dca8f8d4..e078972b 100644
--- a/kernel-lib/raid56.c
+++ b/kernel-lib/raid56.c
@@ -239,3 +239,44 @@ int raid6_recov_data2(int nr_devs, size_t stripe_len, int 
dest1, int dest2,
free(zero_mem2);
return ret;
 }
+
+/*
+ * Raid 6 recover code copied from kernel lib/raid6/recov.c
+ * - rename from raid6_datap_recov_intx1()
+ * - parameter changed from faila to dest1
+ */
+int raid6_recov_datap(int nr_devs, size_t stripe_len, int dest1, void **data)
+{
+   u8 *p, *q, *dq;
+   const u8 *qmul; /* Q multiplier table */
+   char *zero_mem;
+
+   p = (u8 *)data[nr_devs - 2];
+   q = (u8 *)data[nr_devs - 1];
+
+   zero_mem = calloc(1, stripe_len);
+   if (!zero_mem)
+   return -ENOMEM;
+
+   /* Compute syndrome with zero for the missing data page
+  Use the dead data page as temporary storage for delta q */
+   dq = (u8 *)data[dest1];
+   data[dest1] = (void *)zero_mem;
+   data[nr_devs - 1] = dq;
+
+   raid6_gen_syndrome(nr_devs, stripe_len, data);
+
+   /* Restore pointer table */
+   data[dest1]   = dq;
+   data[nr_devs - 1] = q;
+
+   /* Now, pick the proper data tables */
+   qmul  = raid6_gfmul[raid6_gfinv[raid6_gfexp[dest1]]];
+
+   /* Now do it... */
+   while ( stripe_len-- ) {
+   *p++ ^= *dq = qmul[*q ^ *dq];
+   q++; dq++;
+   }
+   return 0;
+}
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
index 8d64256f..83ac39a1 100644
--- a/kernel-lib/raid56.h
+++ b/kernel-lib/raid56.h
@@ -42,4 +42,6 @@ extern const u8 raid6_gfexi[256]  
__attribute__((aligned(256)));
 /* Recover raid6 with 2 data corrupted */
 int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
  void **data);
+/* Recover data and P */
+int raid6_recov_datap(int nr_devs, size_t stripe_len, int dest1, void **data);
 #endif
-- 
2.13.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 02/20] btrfs-progs: raid56: Introduce tables for RAID6 recovery

2017-05-25 Thread Qu Wenruo

Use kernel RAID6 galois tables for later RAID6 recovery.

Galois tables file, kernel-lib/tables.c is generated by user space
program, mktable.

Galois field tables declaration, in kernel-lib/raid56.h, is completely
copied from kernel.

The mktables.c is copied from kernel with minor header/macro
modification, to ensure the generated tables.c works well in
btrfs-progs.

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
---
 .gitignore|   2 +
 Makefile  |  13 -
 kernel-lib/mktables.c | 148 ++
 kernel-lib/raid56.h   |  12 
 4 files changed, 173 insertions(+), 2 deletions(-)
 create mode 100644 kernel-lib/mktables.c

diff --git a/.gitignore b/.gitignore
index 43c0ed88..7d7a5482 100644
--- a/.gitignore
+++ b/.gitignore
@@ -35,6 +35,8 @@ btrfs-select-super
 btrfs-calc-size
 btrfs-crc
 btrfstune
+mktables
+kernel-lib/tables.c
 libbtrfs.a
 libbtrfs.so
 libbtrfs.so.0
diff --git a/Makefile b/Makefile
index 92063a90..ba73357b 100644
--- a/Makefile
+++ b/Makefile
@@ -95,7 +95,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o
+ fsfeatures.o kernel-lib/tables.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
@@ -314,6 +314,14 @@ version.h: version.sh version.h.in configure.ac
@echo "[SH] $@"
$(Q)bash ./config.status --silent $@
 
+mktables: kernel-lib/mktables.c
+   @echo "[CC] $@"
+   $(Q)$(CC) $(CFLAGS) $< -o $@
+
+kernel-lib/tables.c: mktables
+   @echo "[TABLE]  $@"
+   $(Q)./mktables > $@ || ($(RM) -f $@ && exit 1)
+
 libbtrfs: $(libs_shared) $(lib_links)
 
 $(libs_shared): $(libbtrfs_objects) $(lib_links) send.h
@@ -503,11 +511,12 @@ clean: $(CLEANDIRS)
$(Q)$(RM) -f -- $(progs) *.o *.o.d \
kernel-lib/*.o kernel-lib/*.o.d \
kernel-shared/*.o kernel-shared/*.o.d \
+   kernel-lib/tables.c\
image/*.o image/*.o.d \
convert/*.o convert/*.o.d \
mkfs/*.o mkfs/*.o.d \
  dir-test ioctl-test quick-test library-test library-test-static \
- btrfs.static mkfs.btrfs.static fssum \
+  mktables btrfs.static mkfs.btrfs.static fssum \
  $(check_defs) \
  $(libs) $(lib_links) \
  $(progs_static) $(progs_extra)
diff --git a/kernel-lib/mktables.c b/kernel-lib/mktables.c
new file mode 100644
index ..85f621fe
--- /dev/null
+++ b/kernel-lib/mktables.c
@@ -0,0 +1,148 @@
+/* -*- linux-c -*- --- *
+ *
+ *   Copyright 2002-2007 H. Peter Anvin - All Rights Reserved
+ *
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
+ *
+ * --- */
+
+/*
+ * mktables.c
+ *
+ * Make RAID-6 tables.  This is a host user space program to be run at
+ * compile time.
+ */
+
+/*
+ * Btrfs-progs port, with following minor fixes:
+ * 1) Use "kerncompat.h"
+ * 2) Get rid of __KERNEL__ related macros
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static uint8_t gfmul(uint8_t a, uint8_t b)
+{
+   uint8_t v = 0;
+
+   while (b) {
+   if (b & 1)
+   v ^= a;
+   a = (a << 1) ^ (a & 0x80 ? 0x1d : 0);
+   b >>= 1;
+   }
+
+   return v;
+}
+
+static uint8_t gfpow(uint8_t a, int b)
+{
+   uint8_t v = 1;
+
+   b %= 255;
+   if (b < 0)
+   b += 255;
+
+   while (b) {
+   if (b & 1)
+   v = gfmul(v, a);
+   a = gfmul(a, a);
+   b >>= 1;
+   }
+
+   return v;
+}
+
+int main(int argc, char *argv[])
+{
+   int i, j, k;
+   uint8_t v;
+   uint8_t exptbl[256], invtbl[256];
+
+   printf("#include \"kerncompat.h\"\n");
+
+   /* Compute multiplication table */
+   printf("\nconst u8  __attribute__((aligned(256)))\n"
+   "raid6_gfmul[256][256] =\n"
+   "{\n");
+   for (i = 0; i < 256; i++) {
+   printf("\t{\n");
+   for (j = 0; j < 256; j += 8) {
+   printf("\t\t");
+

[PATCH v4 03/20] btrfs-progs: raid56: Allow raid6 to recover 2 data stripes

2017-05-25 Thread Qu Wenruo

Copied from kernel lib/raid6/recov.c raid6_2data_recov_intx1() function.
With the following modification:
- Rename to raid6_recov_data2() for shorter name
- s/kfree/free/g modification

Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
---
 Makefile|  4 +--
 raid56.c => kernel-lib/raid56.c | 69 +
 kernel-lib/raid56.h |  5 +++
 3 files changed, 76 insertions(+), 2 deletions(-)
 rename raid56.c => kernel-lib/raid56.c (71%)

diff --git a/Makefile b/Makefile
index ba73357b..df584672 100644
--- a/Makefile
+++ b/Makefile
@@ -92,10 +92,10 @@ CHECKER_FLAGS := -include $(check_defs) -D__CHECKER__ \
 objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o 
\
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
-     qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \
+ qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/raid56.c b/kernel-lib/raid56.c
similarity index 71%
rename from raid56.c
rename to kernel-lib/raid56.c
index 8c79c456..dca8f8d4 100644
--- a/raid56.c
+++ b/kernel-lib/raid56.c
@@ -28,6 +28,7 @@
 #include "disk-io.h"
 #include "volumes.h"
 #include "utils.h"
+#include "kernel-lib/raid56.h"
 
 /*
  * This is the C data type to use
@@ -170,3 +171,71 @@ int raid5_gen_result(int nr_devs, size_t stripe_len, int 
dest, void **data)
}
return 0;
 }
+
+/*
+ * Raid 6 recovery code copied from kernel lib/raid6/recov.c.
+ * With modifications:
+ * - rename from raid6_2data_recov_intx1
+ * - kfree/free modification for btrfs-progs
+ */
+int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
+ void **data)
+{
+   u8 *p, *q, *dp, *dq;
+   u8 px, qx, db;
+   const u8 *pbmul;/* P multiplier table for B data */
+   const u8 *qmul; /* Q multiplier table (for both) */
+   char *zero_mem1, *zero_mem2;
+   int ret = 0;
+
+   /* Early check */
+   if (dest1 < 0 || dest1 >= nr_devs - 2 ||
+   dest2 < 0 || dest2 >= nr_devs - 2 || dest1 >= dest2)
+   return -EINVAL;
+
+   zero_mem1 = calloc(1, stripe_len);
+   zero_mem2 = calloc(1, stripe_len);
+   if (!zero_mem1 || !zero_mem2) {
+   free(zero_mem1);
+   free(zero_mem2);
+   return -ENOMEM;
+   }
+
+   p = (u8 *)data[nr_devs - 2];
+   q = (u8 *)data[nr_devs - 1];
+
+   /* Compute syndrome with zero for the missing data pages
+  Use the dead data pages as temporary storage for
+  delta p and delta q */
+   dp = (u8 *)data[dest1];
+   data[dest1] = (void *)zero_mem1;
+   data[nr_devs - 2] = dp;
+   dq = (u8 *)data[dest2];
+   data[dest2] = (void *)zero_mem2;
+   data[nr_devs - 1] = dq;
+
+   raid6_gen_syndrome(nr_devs, stripe_len, data);
+
+   /* Restore pointer table */
+   data[dest1]   = dp;
+   data[dest2]   = dq;
+   data[nr_devs - 2] = p;
+   data[nr_devs - 1] = q;
+
+   /* Now, pick the proper data tables */
+   pbmul = raid6_gfmul[raid6_gfexi[dest2 - dest1]];
+   qmul  = raid6_gfmul[raid6_gfinv[raid6_gfexp[dest1]^raid6_gfexp[dest2]]];
+
+   /* Now do it... */
+   while ( stripe_len-- ) {
+   px= *p ^ *dp;
+   qx= qmul[*q ^ *dq];
+   *dq++ = db = pbmul[px] ^ qx; /* Reconstructed B */
+   *dp++ = db ^ px; /* Reconstructed A */
+   p++; q++;
+   }
+
+   free(zero_mem1);
+   free(zero_mem2);
+   return ret;
+}
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
index 030b0afb..8d64256f 100644
--- a/kernel-lib/raid56.h
+++ b/kernel-lib/raid56.h
@@ -37,4 +37,9 @@ extern const u8 raid6_vgfmul[256][32] 
__attribute__((aligned(256)));
 extern const u8 raid6_gfexp[256]  __attribute__((aligned(256)));
 extern const u8 raid6_gfinv[256]  __attribute__((aligned(256)));
 extern const u8 raid6_gfexi[256]  __attribute__((aligned(256)));
+
+
+/* Recover raid6 with 2 data corrupted */
+int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
+ void **data);
 #endif
-- 
2.13.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to major

Re: [PATCH v6 2/2] btrfs: scrub: Fix RAID56 recovery race condition

2017-04-27 Thread Goffredo Baroncelli

On 2017-04-26 02:13, Qu Wenruo wrote:
> 
> 
> At 04/26/2017 01:58 AM, Goffredo Baroncelli wrote:
>> I Qu,
>> 
>> I tested these two patches on top of 4.10.12; however when I
>> corrupt disk1, It seems that BTRFS is still unable to rebuild
>> parity.
>> 
>> Because in the past the patches set V4 was composed by 5 patches
>> and this one (V5) is composed by only 2 patches, are these 2
>> sufficient to solve all known bugs of raid56 ? Or I have to cherry
>> pick other two ones ?
>> 
>> BR G.Baroncelli
> 
> These 2 patches are for David to merge.
> 
> The rest 3 patches are reviewed by Liu Bo and has nothing to be
> modified.
> 
> To test the full ability to recovery, please try latest mainline
> master, which doesn't only include my RAID56 scrub patches, but also
> patches from Liu Bo to do scrub recovery correctly.

You are right; I tested the 4.11-rc8 and it is able to detect and correct the 
defects

BR
G.Baroncelli
> 
> Thanks, Qu
> 
>> 
>> On 2017-04-14 02:35, Qu Wenruo wrote:
>>> When scrubbing a RAID5 which has recoverable data corruption
>>> (only one data stripe is corrupted), sometimes scrub will report
>>> more csum errors than expected. Sometimes even unrecoverable
>>> error will be reported.
>>> 
>>> The problem can be easily reproduced by the following steps: 1)
>>> Create a btrfs with RAID5 data profile with 3 devs 2) Mount it
>>> with nospace_cache or space_cache=v2 To avoid extra data space
>>> usage. 3) Create a 128K file and sync the fs, unmount it Now the
>>> 128K file lies at the beginning of the data chunk 4) Locate the
>>> physical bytenr of data chunk on dev3 Dev3 is the 1st data
>>> stripe. 5) Corrupt the first 64K of the data chunk stripe on
>>> dev3 6) Mount the fs and scrub it
>>> 
>>> The correct csum error number should be 16(assuming using
>>> x86_64). Larger csum error number can be reported in a 1/3
>>> chance. And unrecoverable error can also be reported in a 1/10
>>> chance.
>>> 
>>> The root cause of the problem is RAID5/6 recover code has race 
>>> condition, due to the fact that full scrub is initiated per
>>> device.
>>> 
>>> While for other mirror based profiles, each mirror is independent
>>> with each other, so race won't cause any big problem.
>>> 
>>> For example: Corrupted   |   Correct  |
>>> Correct| |   Scrub dev3 (D1) |Scrub dev2 (D2)
>>> |Scrub dev1(P)| 
>>> 
>>>
>>> 
Read out D1 |Read out D2 |Read full stripe |
>>> Check csum  |Check csum  |Check parity
>>> | Csum mismatch   |Csum match, continue|Parity
>>> mismatch  | handle_errored_block|
>>> |handle_errored_block | Read out full stripe   |
>>> | Read out full stripe| D1 csum error(err++)   |
>>> | D1 csum error(err++)| Recover D1 |
>>> | Recover D1  |
>>> 
>>> So D1's csum error is accounted twice, just because 
>>> handle_errored_block() doesn't have enough protect, and race can
>>> happen.
>>> 
>>> On even worse case, for example D1's recovery code is re-writing 
>>> D1/D2/P, and P's recovery code is just reading out full stripe,
>>> then we can cause unrecoverable error.
>>> 
>>> This patch will use previously introduced lock_full_stripe() and 
>>> unlock_full_stripe() to protect the whole
>>> scrub_handle_errored_block() function for RAID56 recovery. So no
>>> extra csum error nor unrecoverable error.
>>> 
>>> Reported-by: Goffredo Baroncelli <kreij...@libero.it> 
>>> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com> --- 
>>> fs/btrfs/scrub.c | 22 ++ 1 file changed, 22
>>> insertions(+)
>>> 
>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index
>>> 980deb8aae47..7d8ce57fd08a 100644 --- a/fs/btrfs/scrub.c +++
>>> b/fs/btrfs/scrub.c @@ -1113,6 +1113,7 @@ static int
>>> scrub_handle_errored_block(struct scrub_block *sblock_to_check) 
>>> int mirror_index; int page_num; int success; +bool
>>> full_stripe_locked; static DEFINE_RATELIMIT_STATE(_rs,
>>> DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -1138,6
>>> +1139,24 @@ static int scrub_handle_errored_block(struct
>>> scrub_block *sblock_to_check) have_csum =
>>>

1 2 3 4 >

1 - 100 of 378 matches

Mail list logo