Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-28 Thread Masami Hiramatsu
On Thu, 28 Dec 2017 17:11:31 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 11:51 PM, Masami Hiramatsu wrote:
> >
> > Then what happen if the user set invalid retval to those functions?
> > even if we limit the injectable functions, it can cause a problem,
> >
> > for example,
> >
> >  obj = func_return_object();
> >  if (!obj) {
> > handling_error...;
> >  }
> >  obj->field = x;
> >
> > In this case, obviously func_return_object() must return NULL if there is
> > an error, not -ENOMEM. But without the correct retval information, how would
> > you check the BPF code doesn't cause a trouble?
> > Currently it seems you are expecting only the functions which return error 
> > code.
> >
> >  ret = func_return_state();
> >  if (ret < 0) {
> > handling_error...;
> >  }
> >
> > But how we can distinguish those?
> >
> > If we have the error range for each function, we can ensure what is
> > *correct* error code, NULL or errno, or any other error numbers. :)
> 
> messing up return values may cause problems and range check is
> not going to magically help.
> The caller may handle only a certain set of errors or interpret
> some of them like EBUSY as a signal to retry.
> It's plain impossible to make sure that kernel will be functional
> after error injection has been made.

Hmm, if so, why we need this injectable table?
If we can not make sure the safeness of the error injection (of course, yes)
why we need to limit the error injection on such limited functions?
I think we don't need it anymore. Any function can be injectable, and no
need to make sure the safeness.

Thank you,

> Like kmalloc() unconditionally returning NULL will be deadly
> for the kernel, hence this patch 4/4 has very limited practical
> use. The bpf program need to make intelligent decisions when
> to return an error and what kind of error to return.
> Doing blank range check adds a false sense of additional safety.
> More so it wastes kilobytes of memory to do this check, hence nack.
> 


-- 
Masami Hiramatsu 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Qu Wenruo


On 2017年12月29日 11:35, Stirling Westrup wrote:
> On Thu, Dec 28, 2017 at 9:08 PM, Qu Wenruo  wrote:
>>
>>
> 
>>
>> I strongly recommend to do a binary search for magic number "5f42 4852
>> 6653 5f4d" to locate the real offset (if it's offset, not a toasted image)
>>
> I don't understand, how would I do a binary search for that signature?
> 
The most stupid idea is to use xxd and grep.

Something like:

# xxd /dev/sde | grep 5f42 -C1



signature.asc
Description: OpenPGP digital signature


Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Stirling Westrup
On Thu, Dec 28, 2017 at 9:08 PM, Qu Wenruo  wrote:
>
>

>
> I strongly recommend to do a binary search for magic number "5f42 4852
> 6653 5f4d" to locate the real offset (if it's offset, not a toasted image)
>
I don't understand, how would I do a binary search for that signature?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: enchanse raid1/10 balance heuristic

2017-12-28 Thread Timofey Titovets
Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded

Signed-off-by: Timofey Titovets 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 116 -
 2 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..a77426a7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(&part->in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
 {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9a04245003ab..1c84534df9a5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5216,6 +5217,112 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int num = map->num_stripes;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_down = 2;
+
+   for (i = 0; i < num; i++) {
+   if (qlen[i])
+   continue;
+   bdev = map->stripes[i].dev->bdev;
+   qlen[i] = bdev_get_queue_len(bdev, round_down);
+   }
+
+   /* For mixed case, pick non rotational dev as optimal */
+   if (all_bdev_rotate == all_bdev_nonrot) {
+   for (i = 0; i < num; i++) {
+   if (is_nonrot[i])
+   optimal = i;
+   }
+   }
+
+   for (i = 0; i < num; i++) {
+   if (qlen[optimal] > qlen[i])
+

Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Qu Wenruo


On 2017年12月29日 09:41, Stirling Westrup wrote:
> Okay, I ran the command 'btrfs ins dump-super -fa' on each of the four
> drives of the array, which are currently sda, sdb, sdc, and sde, and
> attached the results as log files.
> 
> As you'll note, the one superblock for sde is an exact copy of the one
> for sdc, as I copy the first 4M of sdc to sde before starting the
> recovery of the bad drive (sde is as much of that drive as I could
> copy, which all my tools claim is close to 99.99% of the original).

Well, from the result of e.log, there are no backup supers at all.

So either there is a offset when the data is recovered, or you lost most
of your data.

The good news is, according to the correct supers of devid 1/3/4, at
least your system and meta profile is RAID1, and they should be at least
RO degraded mountable.


Yes, this means you could get the needed device UUID and hand craft a
superblock.
But I really doubt about the possibility to success.

If you really want to do that, there is needed steps for you:

1) Get device info from your existing fs
   # btrfs ins dump-tree -t chunk 
   And looking for the following thing:
--
   item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16185 itemsize 98
devid 2 total_bytes 10737418240 bytes_used 289406976
io_align 4096 io_width 4096 sector_size 4096 type 0
generation 0 start_offset 0 dev_group 0
seek_speed 0 bandwidth 0
uuid f1d9b288-7865-463f-a65c-ca8b1fbde09b
fsid 1dd513fb-45f8-404f-ae23-979e3acb78ad
--
   Look for the key (DEV_ITEMS DEV_ITEM 2) and grab the "uuid"
   "total_bytes" "bytes_used" (other fields are mostly fixed)

2) Fill the fields of dev_item of a good superblock.
   If you feel it hard, I could help to do it if you provide the binary
   dump of any valid superblock, with above tree dump info.

But as I mentioned before, the disk seems to be heavily damaged or have
an unexpected offset.

Recovery using such method can easily lead to csum error and most (if
not all) RAID0 based data will unable to be read out.

I strongly recommend to do a binary search for magic number "5f42 4852
6653 5f4d" to locate the real offset (if it's offset, not a toasted image)

Thanks,
Qu

> 
> 
> On Thu, Dec 28, 2017 at 7:22 PM, Qu Wenruo  wrote:
>>
>>
>> On 2017年12月29日 07:09, Stirling Westrup wrote:
>>> Using "btrfs rescue super-recover" is of no use since there are no
>>> valid superblocks on the disk I need to fix.
>>
>> Btrfs has normally 1 primary superblock and 1 or 2 backup superblocks.
>>
>> super-recover is going to read the backup superblocks and use them as
>> the base to recover primary superblock.
>>
>> If super-recover can't even find the backups, then the disk is more
>> damaged than you have expected.
>>
>>> In fact, it's even worse,
>>> because the only even partly valid superblock is a copy of the one
>>> from drive sdd, which is a perfectly valid drive. What I need to do
>>> (as far as I can tell) is:
>>>
>>> 1) Patch the UUID_SUB and device number of sdf to make it distinct
>>> from sdd. Or just generate an entirely new superblock for sdf which
>>> indicates it is device 2 in the 4-device BTRFS (rather than device 1
>>> which it now thinks it is).
>>
>> You need your device UUID, which can be found in device tree.
>> (Only if you could mount the fs in RO and degraded mode, then you're
>> still OK to read it out)
>>
>> You're looking for this part of "btrfs ins dump-super" output:
>> --
>> ...
>> cache_generation8
>> uuid_tree_generation8
>> dev_item.uuid   f1d9b288-7865-463f-a65c-ca8b1fbde09b <
>> dev_item.fsid   1dd513fb-45f8-404f-ae23-979e3acb78ad [match]
>> dev_item.type   0
>> dev_item.total_bytes10737418240
>> ...
>> --
>>
>>>
>>> 2) Recover (somehow) whatever other information from the superblock
>>> that is missing.
>>>
>>
>> Just as I said, if your backup super is also corrupted, there is little
>> chance to recover.
>>
>> To verify if the backups are still alive, please paste the output of
>> "btrfs ins dump-super -fa".
>> (Even you think super-recover is of no use, the output can still help)
>>
>> Thanks,
>> Qu
>>
>>>
>>>
>>> On Thu, Dec 28, 2017 at 7:11 AM, Qu Wenruo  wrote:


 On 2017年12月28日 19:41, Nikolay Borisov wrote:
>
>
> On 28.12.2017 03:53, Qu Wenruo wrote:
>>
>>
>> On 2017年12月28日 09:46, Stirling Westrup wrote:
>>> Here's my situation: I have a network file server containing a 12TB
>>> BTRFS spread out over four devices (sda-sdd) which I am trying to
>>> recover. I do have a backup, but it's about 3 months old, and while I
>>> could certainly rebuild everything from that if I really had to, I
>>> would far rather not have to rerip my latest DVDs. So, I am willing to
>>> experiment if it might save me a few hundred hours of reconstruction.
>>> I don't currently have another 12 TB of space anywhere for making a

Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Stirling Westrup
Okay, I ran the command 'btrfs ins dump-super -fa' on each of the four
drives of the array, which are currently sda, sdb, sdc, and sde, and
attached the results as log files.

As you'll note, the one superblock for sde is an exact copy of the one
for sdc, as I copy the first 4M of sdc to sde before starting the
recovery of the bad drive (sde is as much of that drive as I could
copy, which all my tools claim is close to 99.99% of the original).


On Thu, Dec 28, 2017 at 7:22 PM, Qu Wenruo  wrote:
>
>
> On 2017年12月29日 07:09, Stirling Westrup wrote:
>> Using "btrfs rescue super-recover" is of no use since there are no
>> valid superblocks on the disk I need to fix.
>
> Btrfs has normally 1 primary superblock and 1 or 2 backup superblocks.
>
> super-recover is going to read the backup superblocks and use them as
> the base to recover primary superblock.
>
> If super-recover can't even find the backups, then the disk is more
> damaged than you have expected.
>
>> In fact, it's even worse,
>> because the only even partly valid superblock is a copy of the one
>> from drive sdd, which is a perfectly valid drive. What I need to do
>> (as far as I can tell) is:
>>
>> 1) Patch the UUID_SUB and device number of sdf to make it distinct
>> from sdd. Or just generate an entirely new superblock for sdf which
>> indicates it is device 2 in the 4-device BTRFS (rather than device 1
>> which it now thinks it is).
>
> You need your device UUID, which can be found in device tree.
> (Only if you could mount the fs in RO and degraded mode, then you're
> still OK to read it out)
>
> You're looking for this part of "btrfs ins dump-super" output:
> --
> ...
> cache_generation8
> uuid_tree_generation8
> dev_item.uuid   f1d9b288-7865-463f-a65c-ca8b1fbde09b <
> dev_item.fsid   1dd513fb-45f8-404f-ae23-979e3acb78ad [match]
> dev_item.type   0
> dev_item.total_bytes10737418240
> ...
> --
>
>>
>> 2) Recover (somehow) whatever other information from the superblock
>> that is missing.
>>
>
> Just as I said, if your backup super is also corrupted, there is little
> chance to recover.
>
> To verify if the backups are still alive, please paste the output of
> "btrfs ins dump-super -fa".
> (Even you think super-recover is of no use, the output can still help)
>
> Thanks,
> Qu
>
>>
>>
>> On Thu, Dec 28, 2017 at 7:11 AM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2017年12月28日 19:41, Nikolay Borisov wrote:


 On 28.12.2017 03:53, Qu Wenruo wrote:
>
>
> On 2017年12月28日 09:46, Stirling Westrup wrote:
>> Here's my situation: I have a network file server containing a 12TB
>> BTRFS spread out over four devices (sda-sdd) which I am trying to
>> recover. I do have a backup, but it's about 3 months old, and while I
>> could certainly rebuild everything from that if I really had to, I
>> would far rather not have to rerip my latest DVDs. So, I am willing to
>> experiment if it might save me a few hundred hours of reconstruction.
>> I don't currently have another 12 TB of space anywhere for making a
>> scratch copy.
>>
>> A few days ago sdb developed hard errors and I can no longer mount the
>> filesystem. sdb is no longer even recognized as a valid btrfs drive.
>> However, when I ran ddrescue over the drive I managed to make a clone
>> (sdf) which contains all but 12K of the original drive. However, those
>> missing 12K are all in the various superblocks, so the cloned drive is
>> still unreadable.
>>
>> In the hopes that I was only missing a few bits of the superblocks, I
>> started out by dd-ing the first 4M of sdd into sdf in the hopes that
>> ddrescue would overwrite much of the superblocks, and the final bits
>> from sdd would make things usable.
>>
>> No such luck. I now have a drive sdf which claims to be identical to
>> sdd but which is a clone of sdb. In case it matters, sda and sdc are
>> each 4TB while sdb and sdd are each 2TB drives; sde is my boot drive
>> and sdf is a 2TB clone of sdb.
>>
>> What I need to do is to somehow patch sdf's primary superblock so it
>> contains the correct device number and UUID_SUB for sdb, so that I can
>> attempt some sort of recovery. Right now my linux is (understandably)
>> quite confused by the situation:
>
> Did you tried "btrfs rescue super-recover"?
>
> Remember to use the devel branch from git, as there is a small bug
> prevent it report correct result.

 Unforutnately my patchset which fixes super-recover is still not merged,
 so he needs to grab the patches from the mailing list and compile the
 btrfs tools himself. The patch in question can be found here:

 https://patchwork.kernel.org/patch/10092471/
>>>
>>> And just in-case, "btrfs insp dump-super -fa" output could greatly help
>>> us to check if the backup superblocks are really good.
>>>
>>
>>
>



-- 
Stirling Westrup
Progra

Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-28 Thread Alexei Starovoitov

On 12/27/17 11:51 PM, Masami Hiramatsu wrote:


Then what happen if the user set invalid retval to those functions?
even if we limit the injectable functions, it can cause a problem,

for example,

 obj = func_return_object();
 if (!obj) {
handling_error...;
 }
 obj->field = x;

In this case, obviously func_return_object() must return NULL if there is
an error, not -ENOMEM. But without the correct retval information, how would
you check the BPF code doesn't cause a trouble?
Currently it seems you are expecting only the functions which return error code.

 ret = func_return_state();
 if (ret < 0) {
handling_error...;
 }

But how we can distinguish those?

If we have the error range for each function, we can ensure what is
*correct* error code, NULL or errno, or any other error numbers. :)


messing up return values may cause problems and range check is
not going to magically help.
The caller may handle only a certain set of errors or interpret
some of them like EBUSY as a signal to retry.
It's plain impossible to make sure that kernel will be functional
after error injection has been made.
Like kmalloc() unconditionally returning NULL will be deadly
for the kernel, hence this patch 4/4 has very limited practical
use. The bpf program need to make intelligent decisions when
to return an error and what kind of error to return.
Doing blank range check adds a false sense of additional safety.
More so it wastes kilobytes of memory to do this check, hence nack.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-28 Thread Alexei Starovoitov

On 12/28/17 12:20 AM, Masami Hiramatsu wrote:

On Wed, 27 Dec 2017 20:32:07 -0800
Alexei Starovoitov  wrote:


On 12/27/17 8:16 PM, Steven Rostedt wrote:

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:


I don't think that's the case. My reading of current
trace_kprobe_ftrace() -> arch_check_ftrace_location()
is that it will not be true for old mcount case.


In the old mcount case, you can't use ftrace to return without calling
the function. That is, no modification of the return ip, unless you
created a trampoline that could handle arbitrary stack frames, and
remove them from the stack before returning back to the function.


correct. I was saying that trace_kprobe_ftrace() won't let us do
bpf_override_return with old mcount.


No, trace_kprobe_ftrace() just checks the given address will be
managed by ftrace. you can see arch_check_ftrace_location() in kernel/kprobes.c.

FYI, CONFIG_KPROBES_ON_FTRACE depends on DYNAMIC_FTRACE_WITH_REGS, and
DYNAMIC_FTRACE_WITH_REGS doesn't depend on CC_USING_FENTRY.
This means if you compile kernel with old gcc and enable DYNAMIC_FTRACE,
kprobes uses ftrace on mcount address which is NOT the entry point
of target function.


ok. fair enough. I think we can gate the feature to !mcount only.


On the other hand, changing IP feature has been implemented originaly
by kprobes with int3 (sw breakpoint). This means you can use kprobes
at correct address (the entry address of the function) you can hijack
the function, as jprobe did.


As far as the rest of your arguments it very much puzzles me that
you claim that this patch suppose to work based on historical
reasoning whereas you did NOT test it.


I believe that Masami is saying that the modification of the IP from
kprobes has been very well tested. But I'm guessing that you still want
a test case for using kprobes in this particular instance. It's not the
implementation of modifying the IP that you are worried about, but the
implementation of BPF using it in this case. Right?


exactly. No doubt that old code works.
But it doesn't mean that bpf_override_return() will continue to
work in kprobes that are not ftrace based.
I suspect Josef's existing test case will cover this situation.
Probably only special .config is needed to disable ftrace, so
"kprobe on entry but not ftrace" check will kick in.


Right. If you need to test it, you can run Josef's test case without
CONFIG_DYNAMIC_FTRACE.


It should be obvious that the person who submits the patch
must run the tests.


But I didn't get an impression that this situation was tested.
Instead I see only logical reasoning that it's _supposed_ to work.
That's not enough.


OK, so would you just ask me to run samples/bpf ?


Please run Josef's test in the !ftrace setup.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Qu Wenruo


On 2017年12月29日 07:09, Stirling Westrup wrote:
> Using "btrfs rescue super-recover" is of no use since there are no
> valid superblocks on the disk I need to fix.

Btrfs has normally 1 primary superblock and 1 or 2 backup superblocks.

super-recover is going to read the backup superblocks and use them as
the base to recover primary superblock.

If super-recover can't even find the backups, then the disk is more
damaged than you have expected.

> In fact, it's even worse,
> because the only even partly valid superblock is a copy of the one
> from drive sdd, which is a perfectly valid drive. What I need to do
> (as far as I can tell) is:
> 
> 1) Patch the UUID_SUB and device number of sdf to make it distinct
> from sdd. Or just generate an entirely new superblock for sdf which
> indicates it is device 2 in the 4-device BTRFS (rather than device 1
> which it now thinks it is).

You need your device UUID, which can be found in device tree.
(Only if you could mount the fs in RO and degraded mode, then you're
still OK to read it out)

You're looking for this part of "btrfs ins dump-super" output:
--
...
cache_generation8
uuid_tree_generation8
dev_item.uuid   f1d9b288-7865-463f-a65c-ca8b1fbde09b <
dev_item.fsid   1dd513fb-45f8-404f-ae23-979e3acb78ad [match]
dev_item.type   0
dev_item.total_bytes10737418240
...
--

> 
> 2) Recover (somehow) whatever other information from the superblock
> that is missing.
> 

Just as I said, if your backup super is also corrupted, there is little
chance to recover.

To verify if the backups are still alive, please paste the output of
"btrfs ins dump-super -fa".
(Even you think super-recover is of no use, the output can still help)

Thanks,
Qu

> 
> 
> On Thu, Dec 28, 2017 at 7:11 AM, Qu Wenruo  wrote:
>>
>>
>> On 2017年12月28日 19:41, Nikolay Borisov wrote:
>>>
>>>
>>> On 28.12.2017 03:53, Qu Wenruo wrote:


 On 2017年12月28日 09:46, Stirling Westrup wrote:
> Here's my situation: I have a network file server containing a 12TB
> BTRFS spread out over four devices (sda-sdd) which I am trying to
> recover. I do have a backup, but it's about 3 months old, and while I
> could certainly rebuild everything from that if I really had to, I
> would far rather not have to rerip my latest DVDs. So, I am willing to
> experiment if it might save me a few hundred hours of reconstruction.
> I don't currently have another 12 TB of space anywhere for making a
> scratch copy.
>
> A few days ago sdb developed hard errors and I can no longer mount the
> filesystem. sdb is no longer even recognized as a valid btrfs drive.
> However, when I ran ddrescue over the drive I managed to make a clone
> (sdf) which contains all but 12K of the original drive. However, those
> missing 12K are all in the various superblocks, so the cloned drive is
> still unreadable.
>
> In the hopes that I was only missing a few bits of the superblocks, I
> started out by dd-ing the first 4M of sdd into sdf in the hopes that
> ddrescue would overwrite much of the superblocks, and the final bits
> from sdd would make things usable.
>
> No such luck. I now have a drive sdf which claims to be identical to
> sdd but which is a clone of sdb. In case it matters, sda and sdc are
> each 4TB while sdb and sdd are each 2TB drives; sde is my boot drive
> and sdf is a 2TB clone of sdb.
>
> What I need to do is to somehow patch sdf's primary superblock so it
> contains the correct device number and UUID_SUB for sdb, so that I can
> attempt some sort of recovery. Right now my linux is (understandably)
> quite confused by the situation:

 Did you tried "btrfs rescue super-recover"?

 Remember to use the devel branch from git, as there is a small bug
 prevent it report correct result.
>>>
>>> Unforutnately my patchset which fixes super-recover is still not merged,
>>> so he needs to grab the patches from the mailing list and compile the
>>> btrfs tools himself. The patch in question can be found here:
>>>
>>> https://patchwork.kernel.org/patch/10092471/
>>
>> And just in-case, "btrfs insp dump-super -fa" output could greatly help
>> us to check if the backup superblocks are really good.
>>
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Stirling Westrup
Using "btrfs rescue super-recover" is of no use since there are no
valid superblocks on the disk I need to fix. In fact, it's even worse,
because the only even partly valid superblock is a copy of the one
from drive sdd, which is a perfectly valid drive. What I need to do
(as far as I can tell) is:

1) Patch the UUID_SUB and device number of sdf to make it distinct
from sdd. Or just generate an entirely new superblock for sdf which
indicates it is device 2 in the 4-device BTRFS (rather than device 1
which it now thinks it is).

2) Recover (somehow) whatever other information from the superblock
that is missing.



On Thu, Dec 28, 2017 at 7:11 AM, Qu Wenruo  wrote:
>
>
> On 2017年12月28日 19:41, Nikolay Borisov wrote:
>>
>>
>> On 28.12.2017 03:53, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年12月28日 09:46, Stirling Westrup wrote:
 Here's my situation: I have a network file server containing a 12TB
 BTRFS spread out over four devices (sda-sdd) which I am trying to
 recover. I do have a backup, but it's about 3 months old, and while I
 could certainly rebuild everything from that if I really had to, I
 would far rather not have to rerip my latest DVDs. So, I am willing to
 experiment if it might save me a few hundred hours of reconstruction.
 I don't currently have another 12 TB of space anywhere for making a
 scratch copy.

 A few days ago sdb developed hard errors and I can no longer mount the
 filesystem. sdb is no longer even recognized as a valid btrfs drive.
 However, when I ran ddrescue over the drive I managed to make a clone
 (sdf) which contains all but 12K of the original drive. However, those
 missing 12K are all in the various superblocks, so the cloned drive is
 still unreadable.

 In the hopes that I was only missing a few bits of the superblocks, I
 started out by dd-ing the first 4M of sdd into sdf in the hopes that
 ddrescue would overwrite much of the superblocks, and the final bits
 from sdd would make things usable.

 No such luck. I now have a drive sdf which claims to be identical to
 sdd but which is a clone of sdb. In case it matters, sda and sdc are
 each 4TB while sdb and sdd are each 2TB drives; sde is my boot drive
 and sdf is a 2TB clone of sdb.

 What I need to do is to somehow patch sdf's primary superblock so it
 contains the correct device number and UUID_SUB for sdb, so that I can
 attempt some sort of recovery. Right now my linux is (understandably)
 quite confused by the situation:
>>>
>>> Did you tried "btrfs rescue super-recover"?
>>>
>>> Remember to use the devel branch from git, as there is a small bug
>>> prevent it report correct result.
>>
>> Unforutnately my patchset which fixes super-recover is still not merged,
>> so he needs to grab the patches from the mailing list and compile the
>> btrfs tools himself. The patch in question can be found here:
>>
>> https://patchwork.kernel.org/patch/10092471/
>
> And just in-case, "btrfs insp dump-super -fa" output could greatly help
> us to check if the backup superblocks are really good.
>


-- 
Stirling Westrup
Programmer, Entrepreneur.
https://www.linkedin.com/e/fpf/77228
http://www.linkedin.com/in/swestrup
http://technaut.livejournal.com
http://sourceforge.net/users/stirlingwestrup
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-28 Thread Timofey Titovets
2017-12-28 11:06 GMT+03:00 Dmitrii Tcvetkov :
> On Thu, 28 Dec 2017 01:39:31 +0300
> Timofey Titovets  wrote:
>
>> Currently btrfs raid1/10 balancer blance requests to mirrors,
>> based on pid % num of mirrors.
>>
>> Update logic and make it understood if underline device are non rotational.
>>
>> If one of mirrors are non rotational, then all read requests will be moved to
>> non rotational device.
>>
>> If both of mirrors are non rotational, calculate sum of
>> pending and in flight request for queue on that bdev and use
>> device with least queue leght.
>>
>> P.S.
>> Inspired by md-raid1 read balancing
>>
>> Signed-off-by: Timofey Titovets 
>> ---
>>  fs/btrfs/volumes.c | 59
>> ++ 1 file changed, 59
>> insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 9a04245003ab..98bc2433a920 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info
>> *fs_info, u64 logical, u64 len) return ret;
>>  }
>>
>> +static inline int bdev_get_queue_len(struct block_device *bdev)
>> +{
>> + int sum = 0;
>> + struct request_queue *rq = bdev_get_queue(bdev);
>> +
>> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
>> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
>> +
>
> This won't work as expected if bdev is controlled by blk-mq, these
> counters will be zero. AFAIK to get this info in block layer agnostic way
> part_in_flight[1] has to be used. It extracts these counters approriately.
>
> But it needs to be EXPORT_SYMBOL()'ed in block/genhd.c so we can continue
> to build btrfs as module.
>
>> + /*
>> +  * Try prevent switch for every sneeze
>> +  * By roundup output num by 2
>> +  */
>> + return ALIGN(sum, 2);
>> +}
>> +
>>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>   struct map_lookup *map, int first, int num,
>>   int optimal, int dev_replace_is_ongoing)
>>  {
>>   int i;
>>   int tolerance;
>> + struct block_device *bdev;
>>   struct btrfs_device *srcdev;
>> + bool all_bdev_nonrot = true;
>>
>>   if (dev_replace_is_ongoing &&
>>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info
>> *fs_info, else
>>   srcdev = NULL;
>>
>> + /*
>> +  * Optimal expected to be pid % num
>> +  * That's generaly ok for spinning rust drives
>> +  * But if one of mirror are non rotating,
>> +  * that bdev can show better performance
>> +  *
>> +  * if one of disks are non rotating:
>> +  *  - set optimal to non rotating device
>> +  * if both disk are non rotating
>> +  *  - set optimal to bdev with least queue
>> +  * If both disks are spinning rust:
>> +  *  - leave old pid % nu,
>> +  */
>> + for (i = 0; i < num; i++) {
>> + bdev = map->stripes[i].dev->bdev;
>> + if (!bdev)
>> + continue;
>> + if (blk_queue_nonrot(bdev_get_queue(bdev)))
>> + optimal = i;
>> + else
>> + all_bdev_nonrot = false;
>> + }
>> +
>> + if (all_bdev_nonrot) {
>> + int qlen;
>> + /* Forse following logic choise by init with some big number
>> */
>> + int optimal_dev_rq_count = 1 << 24;
>
> Probably better to use INT_MAX macro instead.
>
> [1] https://elixir.free-electrons.com/linux/v4.15-rc5/source/block/genhd.c#L68
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thank you very much!

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Project idea: reduce boot time/RAM usage: option to disable/delay raid6_pq and xor kmod

2017-12-28 Thread David Disseldorp
On Sun, 24 Dec 2017 13:31:40 +0100, Ceriel Jacobs wrote:

> Saving:
> 1. ± 0.4 seconds of boot time (10% of boot until root)
> 2. ± 150k of RAM
> 3. ± 75k of disk space

Thanks for bringing this up - I'm also particularly frustrated by the
boot delay caused by the raid6 algorithm benchmark (1).

> New kernel command-line parameters?
> a.) disable, like:
>  - btrfs=noraid6_pq
>  - btrfs=noraid (=no xor at all)
> b.) delay raid6_pq and xor module loading, for cases where root mount 
> doesn't need raid6_pq and/or xor.

c) It might not help with (2) or (3), but I'd be happy with an option to
preselect the raid6 algorithm, so that the benchmark didn't run on each
boot.

Cheers, David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2017-12-28 Thread Timofey Titovets
Insert sort are generaly perform better then bubble sort,
by have less iterations on avarage.
That version also try place element to right position
instead of raw swap.

I'm not sure how many stripes per bio raid56,
btrfs try to store (and try to sort).

So, that a bit shorter just in the name of a great justice.

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/volumes.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 98bc2433a920..7195fc8c49b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5317,29 +5317,24 @@ static inline int parity_smaller(u64 a, u64 b)
return a > b;
 }
 
-/* Bubble-sort the stripe set to put the parity/syndrome stripes last */
+/* Insertion-sort the stripe set to put the parity/syndrome stripes last */
 static void sort_parity_stripes(struct btrfs_bio *bbio, int num_stripes)
 {
struct btrfs_bio_stripe s;
-   int i;
+   int i, j;
u64 l;
-   int again = 1;
 
-   while (again) {
-   again = 0;
-   for (i = 0; i < num_stripes - 1; i++) {
-   if (parity_smaller(bbio->raid_map[i],
-  bbio->raid_map[i+1])) {
-   s = bbio->stripes[i];
-   l = bbio->raid_map[i];
-   bbio->stripes[i] = bbio->stripes[i+1];
-   bbio->raid_map[i] = bbio->raid_map[i+1];
-   bbio->stripes[i+1] = s;
-   bbio->raid_map[i+1] = l;
-
-   again = 1;
-   }
+   for (i = 1; i < num_stripes; i++) {
+   s = bbio->stripes[i];
+   l = bbio->raid_map[i];
+   for (j = i - 1; j >= 0; j--) {
+   if (!parity_smaller(bbio->raid_map[j], l))
+   break;
+   bbio->stripes[j+1]  = bbio->stripes[j];
+   bbio->raid_map[j+1] = bbio->raid_map[j];
}
+   bbio->stripes[j+1]  = s;
+   bbio->raid_map[j+1] = l;
}
 }
 
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-28 Thread waxhead



Timofey Titovets wrote:

Currently btrfs raid1/10 balancer blance requests to mirrors,
based on pid % num of mirrors.

Update logic and make it understood if underline device are non rotational.

If one of mirrors are non rotational, then all read requests will be moved to
non rotational device.

And this would make reads regardless of the PID always end up on the 
fastest device which sounds sane enough , but scubbing will be even more 
important since there is a less chance that a "random PID" will check 
the other copy every now and then.



If both of mirrors are non rotational, calculate sum of
pending and in flight request for queue on that bdev and use
device with least queue leght.

I think this would be tried out on rotational disk as well. I am happy 
to test this out for you on a 7x disk server if you want.
Note: I have no experience with compiling kernels and applying patches 
(but I do code a bit in C every now and then) so a pre-compiled kernel 
would be required (I believe you are on Debain as well)
For rotational then perhaps it would not be wise to use another mirror 
unless the queue length is significantly higher than the other. Again I 
am happy to test if tunables are provided.



P.S.
Inspired by md-raid1 read balancing

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/volumes.c | 59 ++
 1 file changed, 59 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9a04245003ab..98bc2433a920 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }

+static inline int bdev_get_queue_len(struct block_device *bdev)
+{
+   int sum = 0;
+   struct request_queue *rq = bdev_get_queue(bdev);
+
+   sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
+   sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by 2
+*/
+   return ALIGN(sum, 2);
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
struct map_lookup *map, int first, int num,
int optimal, int dev_replace_is_ongoing)
 {
int i;
int tolerance;
+   struct block_device *bdev;
struct btrfs_device *srcdev;
+   bool all_bdev_nonrot = true;

if (dev_replace_is_ongoing &&
fs_info->dev_replace.cont_reading_from_srcdev_mode ==
@@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info 
*fs_info,
else
srcdev = NULL;

+   /*
+* Optimal expected to be pid % num
+* That's generaly ok for spinning rust drives
+* But if one of mirror are non rotating,
+* that bdev can show better performance
+*
+* if one of disks are non rotating:
+*  - set optimal to non rotating device
+* if both disk are non rotating
+*  - set optimal to bdev with least queue
+* If both disks are spinning rust:
+*  - leave old pid % nu,
+*/
+   for (i = 0; i < num; i++) {
+   bdev = map->stripes[i].dev->bdev;
+   if (!bdev)
+   continue;
+   if (blk_queue_nonrot(bdev_get_queue(bdev)))
+   optimal = i;
+   else
+   all_bdev_nonrot = false;
+   }
+
+   if (all_bdev_nonrot) {
+   int qlen;
+   /* Forse following logic choise by init with some big number */
+   int optimal_dev_rq_count = 1 << 24;
+
+   for (i = 0; i < num; i++) {
+   bdev = map->stripes[i].dev->bdev;
+   if (!bdev)
+   continue;
+
+   qlen = bdev_get_queue_len(bdev);
+
+   if (qlen < optimal_dev_rq_count) {
+   optimal = i;
+   optimal_dev_rq_count = qlen;
+   }
+   }
+   }
+
/*
 * try to avoid the drive that is the source drive for a
 * dev-replace procedure, only choose it if no other non-missing


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Qu Wenruo


On 2017年12月28日 19:41, Nikolay Borisov wrote:
> 
> 
> On 28.12.2017 03:53, Qu Wenruo wrote:
>>
>>
>> On 2017年12月28日 09:46, Stirling Westrup wrote:
>>> Here's my situation: I have a network file server containing a 12TB
>>> BTRFS spread out over four devices (sda-sdd) which I am trying to
>>> recover. I do have a backup, but it's about 3 months old, and while I
>>> could certainly rebuild everything from that if I really had to, I
>>> would far rather not have to rerip my latest DVDs. So, I am willing to
>>> experiment if it might save me a few hundred hours of reconstruction.
>>> I don't currently have another 12 TB of space anywhere for making a
>>> scratch copy.
>>>
>>> A few days ago sdb developed hard errors and I can no longer mount the
>>> filesystem. sdb is no longer even recognized as a valid btrfs drive.
>>> However, when I ran ddrescue over the drive I managed to make a clone
>>> (sdf) which contains all but 12K of the original drive. However, those
>>> missing 12K are all in the various superblocks, so the cloned drive is
>>> still unreadable.
>>>
>>> In the hopes that I was only missing a few bits of the superblocks, I
>>> started out by dd-ing the first 4M of sdd into sdf in the hopes that
>>> ddrescue would overwrite much of the superblocks, and the final bits
>>> from sdd would make things usable.
>>>
>>> No such luck. I now have a drive sdf which claims to be identical to
>>> sdd but which is a clone of sdb. In case it matters, sda and sdc are
>>> each 4TB while sdb and sdd are each 2TB drives; sde is my boot drive
>>> and sdf is a 2TB clone of sdb.
>>>
>>> What I need to do is to somehow patch sdf's primary superblock so it
>>> contains the correct device number and UUID_SUB for sdb, so that I can
>>> attempt some sort of recovery. Right now my linux is (understandably)
>>> quite confused by the situation:
>>
>> Did you tried "btrfs rescue super-recover"?
>>
>> Remember to use the devel branch from git, as there is a small bug
>> prevent it report correct result.
> 
> Unforutnately my patchset which fixes super-recover is still not merged,
> so he needs to grab the patches from the mailing list and compile the
> btrfs tools himself. The patch in question can be found here:
> 
> https://patchwork.kernel.org/patch/10092471/

And just in-case, "btrfs insp dump-super -fa" output could greatly help
us to check if the backup superblocks are really good.

Thanks,
Qu
> 
>>
>> super-recover will try to use the backup superblock to recover the
>> primary one.
>>
>> Thanks,
>> Qu
>>
>>>
>>> videon:~ # uname -a
>>> Linux videon 4.4.103-18.41-default #1 SMP Wed Dec 13 14:06:33 UTC 2017
>>> (f66c68c) x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> videon:~ # btrfs --version
>>> btrfs-progs v4.5.3+20160729
>>>
>>> videon:~ # btrfs fi show
>>> Label: 'Storage'  uuid: 33d2890d-f07d-4ba8-b1fc-7b4f14463b1f
>>> Total devices 4 FS bytes used 10.69TiB
>>> devid1 size 1.82TiB used 1.82TiB path /dev/sdd
>>> devid3 size 3.64TiB used 3.54TiB path /dev/sdc
>>> devid4 size 3.64TiB used 3.54TiB path /dev/sda
>>> *** Some devices missing
>>>
>>> Any suggestions on how to proceed would be appreciated.
>>>
>>



signature.asc
Description: OpenPGP digital signature


Re: Hand Patching a BTRFS Superblock?

2017-12-28 Thread Nikolay Borisov


On 28.12.2017 03:53, Qu Wenruo wrote:
> 
> 
> On 2017年12月28日 09:46, Stirling Westrup wrote:
>> Here's my situation: I have a network file server containing a 12TB
>> BTRFS spread out over four devices (sda-sdd) which I am trying to
>> recover. I do have a backup, but it's about 3 months old, and while I
>> could certainly rebuild everything from that if I really had to, I
>> would far rather not have to rerip my latest DVDs. So, I am willing to
>> experiment if it might save me a few hundred hours of reconstruction.
>> I don't currently have another 12 TB of space anywhere for making a
>> scratch copy.
>>
>> A few days ago sdb developed hard errors and I can no longer mount the
>> filesystem. sdb is no longer even recognized as a valid btrfs drive.
>> However, when I ran ddrescue over the drive I managed to make a clone
>> (sdf) which contains all but 12K of the original drive. However, those
>> missing 12K are all in the various superblocks, so the cloned drive is
>> still unreadable.
>>
>> In the hopes that I was only missing a few bits of the superblocks, I
>> started out by dd-ing the first 4M of sdd into sdf in the hopes that
>> ddrescue would overwrite much of the superblocks, and the final bits
>> from sdd would make things usable.
>>
>> No such luck. I now have a drive sdf which claims to be identical to
>> sdd but which is a clone of sdb. In case it matters, sda and sdc are
>> each 4TB while sdb and sdd are each 2TB drives; sde is my boot drive
>> and sdf is a 2TB clone of sdb.
>>
>> What I need to do is to somehow patch sdf's primary superblock so it
>> contains the correct device number and UUID_SUB for sdb, so that I can
>> attempt some sort of recovery. Right now my linux is (understandably)
>> quite confused by the situation:
> 
> Did you tried "btrfs rescue super-recover"?
> 
> Remember to use the devel branch from git, as there is a small bug
> prevent it report correct result.

Unforutnately my patchset which fixes super-recover is still not merged,
so he needs to grab the patches from the mailing list and compile the
btrfs tools himself. The patch in question can be found here:

https://patchwork.kernel.org/patch/10092471/

> 
> super-recover will try to use the backup superblock to recover the
> primary one.
> 
> Thanks,
> Qu
> 
>>
>> videon:~ # uname -a
>> Linux videon 4.4.103-18.41-default #1 SMP Wed Dec 13 14:06:33 UTC 2017
>> (f66c68c) x86_64 x86_64 x86_64 GNU/Linux
>>
>> videon:~ # btrfs --version
>> btrfs-progs v4.5.3+20160729
>>
>> videon:~ # btrfs fi show
>> Label: 'Storage'  uuid: 33d2890d-f07d-4ba8-b1fc-7b4f14463b1f
>> Total devices 4 FS bytes used 10.69TiB
>> devid1 size 1.82TiB used 1.82TiB path /dev/sdd
>> devid3 size 3.64TiB used 3.54TiB path /dev/sdc
>> devid4 size 3.64TiB used 3.54TiB path /dev/sda
>> *** Some devices missing
>>
>> Any suggestions on how to proceed would be appreciated.
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance problems

2017-12-28 Thread Nikolay Borisov


On 23.12.2017 13:19, James Courtier-Dutton wrote:
> Hi,
> 
> During a btrfs balance, the process hogs all CPU.
> Or, to be exact, any other program that wishes to use the SSD during a
> btrfs balance is blocked for long periods. Long periods being more
> than 5 seconds.
> Is there any way to multiplex SSD access while btrfs balance is
> operating, so that other applications can still access the SSD with
> relatively low latency?
> 
> My guess is that btrfs is doing a transaction with a large number of
> SSD blocks at a time, and thus blocking other applications.
> 
> This makes for atrocious user interactivity as well as applications
> failing because they cannot access the disk in a relatively low latent
> manner.
> For, example, this is causing a High Definition network CCTV
> application to fail.
> 
> What I would really like, is for some way to limit SSD bandwidths to
> applications.
> For example the CCTV app always gets the bandwidth it needs, and all
> other applications can still access the SSD, but are rate limited.
> This would fix my particular problem.
> We have rate limiting for network applications, why not disk access also?

So how are you running btrfs balance? Are you using any filters
whatsoever? The documentation
[https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-balance] has the
following warning:

Warning: running balance without filters will take a lot of time as it
basically rewrites the entire filesystem and needs to update all block
pointers.


> 
> Kind Regards
> 
> James
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-28 Thread Masami Hiramatsu
On Wed, 27 Dec 2017 20:32:07 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 8:16 PM, Steven Rostedt wrote:
> > On Wed, 27 Dec 2017 19:45:42 -0800
> > Alexei Starovoitov  wrote:
> >
> >> I don't think that's the case. My reading of current
> >> trace_kprobe_ftrace() -> arch_check_ftrace_location()
> >> is that it will not be true for old mcount case.
> >
> > In the old mcount case, you can't use ftrace to return without calling
> > the function. That is, no modification of the return ip, unless you
> > created a trampoline that could handle arbitrary stack frames, and
> > remove them from the stack before returning back to the function.
> 
> correct. I was saying that trace_kprobe_ftrace() won't let us do
> bpf_override_return with old mcount.

No, trace_kprobe_ftrace() just checks the given address will be
managed by ftrace. you can see arch_check_ftrace_location() in kernel/kprobes.c.

FYI, CONFIG_KPROBES_ON_FTRACE depends on DYNAMIC_FTRACE_WITH_REGS, and
DYNAMIC_FTRACE_WITH_REGS doesn't depend on CC_USING_FENTRY.
This means if you compile kernel with old gcc and enable DYNAMIC_FTRACE,
kprobes uses ftrace on mcount address which is NOT the entry point
of target function.

On the other hand, changing IP feature has been implemented originaly
by kprobes with int3 (sw breakpoint). This means you can use kprobes
at correct address (the entry address of the function) you can hijack
the function, as jprobe did.

> >> As far as the rest of your arguments it very much puzzles me that
> >> you claim that this patch suppose to work based on historical
> >> reasoning whereas you did NOT test it.
> >
> > I believe that Masami is saying that the modification of the IP from
> > kprobes has been very well tested. But I'm guessing that you still want
> > a test case for using kprobes in this particular instance. It's not the
> > implementation of modifying the IP that you are worried about, but the
> > implementation of BPF using it in this case. Right?
> 
> exactly. No doubt that old code works.
> But it doesn't mean that bpf_override_return() will continue to
> work in kprobes that are not ftrace based.
> I suspect Josef's existing test case will cover this situation.
> Probably only special .config is needed to disable ftrace, so
> "kprobe on entry but not ftrace" check will kick in.

Right. If you need to test it, you can run Josef's test case without
CONFIG_DYNAMIC_FTRACE.

> But I didn't get an impression that this situation was tested.
> Instead I see only logical reasoning that it's _supposed_ to work.
> That's not enough.

OK, so would you just ask me to run samples/bpf ?

Thanks,

-- 
Masami Hiramatsu 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-28 Thread Dmitrii Tcvetkov
On Thu, 28 Dec 2017 01:39:31 +0300
Timofey Titovets  wrote:

> Currently btrfs raid1/10 balancer blance requests to mirrors,
> based on pid % num of mirrors.
> 
> Update logic and make it understood if underline device are non rotational.
> 
> If one of mirrors are non rotational, then all read requests will be moved to
> non rotational device.
> 
> If both of mirrors are non rotational, calculate sum of
> pending and in flight request for queue on that bdev and use
> device with least queue leght.
> 
> P.S.
> Inspired by md-raid1 read balancing
> 
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/volumes.c | 59
> ++ 1 file changed, 59
> insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9a04245003ab..98bc2433a920 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info
> *fs_info, u64 logical, u64 len) return ret;
>  }
>  
> +static inline int bdev_get_queue_len(struct block_device *bdev)
> +{
> + int sum = 0;
> + struct request_queue *rq = bdev_get_queue(bdev);
> +
> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
> +

This won't work as expected if bdev is controlled by blk-mq, these
counters will be zero. AFAIK to get this info in block layer agnostic way
part_in_flight[1] has to be used. It extracts these counters approriately.

But it needs to be EXPORT_SYMBOL()'ed in block/genhd.c so we can continue
to build btrfs as module.

> + /*
> +  * Try prevent switch for every sneeze
> +  * By roundup output num by 2
> +  */
> + return ALIGN(sum, 2);
> +}
> +
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   struct map_lookup *map, int first, int num,
>   int optimal, int dev_replace_is_ongoing)
>  {
>   int i;
>   int tolerance;
> + struct block_device *bdev;
>   struct btrfs_device *srcdev;
> + bool all_bdev_nonrot = true;
>  
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info
> *fs_info, else
>   srcdev = NULL;
>  
> + /*
> +  * Optimal expected to be pid % num
> +  * That's generaly ok for spinning rust drives
> +  * But if one of mirror are non rotating,
> +  * that bdev can show better performance
> +  *
> +  * if one of disks are non rotating:
> +  *  - set optimal to non rotating device
> +  * if both disk are non rotating
> +  *  - set optimal to bdev with least queue
> +  * If both disks are spinning rust:
> +  *  - leave old pid % nu,
> +  */
> + for (i = 0; i < num; i++) {
> + bdev = map->stripes[i].dev->bdev;
> + if (!bdev)
> + continue;
> + if (blk_queue_nonrot(bdev_get_queue(bdev)))
> + optimal = i;
> + else
> + all_bdev_nonrot = false;
> + }
> +
> + if (all_bdev_nonrot) {
> + int qlen;
> + /* Forse following logic choise by init with some big number
> */
> + int optimal_dev_rq_count = 1 << 24;

Probably better to use INT_MAX macro instead.

[1] https://elixir.free-electrons.com/linux/v4.15-rc5/source/block/genhd.c#L68

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html