Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-02 Thread Damien Le Moal
Hannes, Shaun,

Let me add some more comments.

> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
> 
> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>> 
>>> Can you please integrate this with Hannes series so that it uses
>>> his cache of the zone information?
>> 
>> Adding Hannes and Damien to Cc.
>> 
>> Christoph,
>> 
>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that 
>> is
>> quite simple. I can even have the open/close/reset zone commands update the
>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>> not need the RB-Tree to function with zoned media.
>> 
>> I do still have concerns with the approach which I have shared in smaller
>> forums but perhaps I have to bring them to this group.
>> 
>> First is the memory consumption. This isn't really much of a concern for 
>> large
>> servers with few drives but I think the embedded NAS market will grumble as
>> well as the large data pods trying to stuff 300+ drives in a chassis.
>> 
>> As of now the RB-Tree needs to hold ~3 zones.
>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>> around 3.5 MB per zoned drive attached.
>> Which is fine if it is really needed, but most of it is fixed information
>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>> in an array as more than adequate). Worse is that the crucial piece of
>> information, the current wp needed for scheduling the next write, is mostly
>> out of date because it is updated only after the write completes and zones
>> being actively written to must work off of the last location / size that was
>> submitted, not completed. The work around is for that tracking to be handled
>> in the private_data member. I am not saying that updating the wp on
>> completing a write isn’t important, I am saying that the bi_end_io hook is
>> the existing hook that works just fine.
>> 
> Which _actually_ is not true; with my patches I'll update the write
> pointer prior to submit the I/O (on the reasoning that most of the time
> I/O will succeed) and re-read the zone information if an I/O failed.
> (Which I'll have to do anyway as after an I/O failure the write pointer
> status is not clearly defined.)
> 
> I have thought about condensing the RB tree information, but then I
> figured that for 'real' SMR handling we cannot assume all zones are of
> fixed size, and hence we need all the information there.
> Any condensing method would assume a given structure of the zones, which
> the standard just doesn't provide.
> Or am I missing something here?

Indeed, the standards do not mandate any particular zone configuration,
constant zone size, etc. So writing code so that can be handled is certainly
the right way of doing things. However, if we decide to go forward with
mapping RESET WRITE POINTER command to DISCARD, then at least a constant
zone size (minus the last zone as you said) must be assumed, and that
information can be removed from the entries in the RB tree (as it will be
saved for the sysfs "zone_size" file anyway. Adding a little code to handle
that eventual last runt zone with a different size is not a big problem.

> 
> As for write pointer handling: yes, the write pointer on the zones is
> not really useful for upper-level usage.
> Where we do need it is to detect I/O which is crossing the write pointer
> (eg when doing reads over the entire zone).
> As per spec you will be getting an I/O error here, so we need to split
> the I/O on the write pointer to get valid results back.

To be precise here, the I/O splitting will be handled by the block layer
thanks to the "chunk_sectors" setting. But that relies on a constant zone
size assumption too.

The RB-tree here is most useful for reads over or after the write pointer as
this can have different behavior on different drives (URSWRZ bit). The RB-tree
allows us to hide these differences to upper layers and simplify support at
those levels.

I too agree that the write pointer value is not very useful to upper layers
(e.g. FS). What matters most of the times for these layers is an "allocation
pointer" or "submission pointer" which indicates the LBA to use for the next
BIO submission. That LBA will most of the time be in advance of the zone WP
because of request queing in the block scheduler.
Note that from what we have done already, in many cases, upper layers do not
even need this (e.g. F2FS, btrfs) and work fine without needing *any* 
zone->private_data because that "allocation pointer" information generally
already exists in a different form (e.g. a bit in a bitmap).
For these cases, not having the RB-tree of zones would force a lot
more code to be added to file systems for SMR support.

> 
>> This all tails into domain responsibility. With the RB-Tree doing half of the
>> work and the ‘responsible’ doma

[PATCH 0039/1285] Replace numeric parameter like 0444 with macro

2016-08-02 Thread Baole Ni
I find that the developers often just specified the numeric value
when calling a macro which is defined with a parameter for access permission.
As we know, these numeric value for access permission have had the 
corresponding macro,
and that using macro can improve the robustness and readability of the code,
thus, I suggest replacing the numeric parameter with the macro.

Signed-off-by: Chuansheng Liu 
Signed-off-by: Baole Ni 
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9f42526..1079e623 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1807,7 +1807,7 @@ static const struct kernel_param_ops 
disk_events_dfl_poll_msecs_param_ops = {
 #define MODULE_PARAM_PREFIX"block."
 
 module_param_cb(events_dfl_poll_msecs, &disk_events_dfl_poll_msecs_param_ops,
-   &disk_events_dfl_poll_msecs, 0644);
+   &disk_events_dfl_poll_msecs, S_IRUSR | S_IWUSR | S_IRGRP | 
S_IROTH);
 
 /*
  * disk_{alloc|add|del|release}_events - initialize and destroy disk_events.
-- 
2.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixup direct bi_rw modifiers

2016-08-02 Thread Jens Axboe

On 08/02/2016 05:32 AM, Christoph Hellwig wrote:

On Mon, Aug 01, 2016 at 01:55:36PM -0600, Jens Axboe wrote:


Set of three patches, where the target one is an actual bug fix...
Temporary branch, I'll rebase it once -rc1 is out, if more
changes/fixups need to be made in the next week until that happens.

http://git.kernel.dk/cgit/linux-block/log/?h=for-4.8/bi_rwf


Thanks.  Any chance to have a slightly better name for the field?
E.g. bi_opf or bi_op_flags if we want to be a bit more verbose.


bi_opf is fine with me as well, don't care too strongly about bi_rwf or 
bi_opf. I don't like the longer variants.



--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0310/1285] Replace numeric parameter like 0444 with macro

2016-08-02 Thread Baole Ni
I find that the developers often just specified the numeric value
when calling a macro which is defined with a parameter for access permission.
As we know, these numeric value for access permission have had the 
corresponding macro,
and that using macro can improve the robustness and readability of the code,
thus, I suggest replacing the numeric parameter with the macro.

Signed-off-by: Chuansheng Liu 
Signed-off-by: Baole Ni 
---
 drivers/lightnvm/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 160c1a6..8dec614 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -1032,7 +1032,7 @@ static const struct kernel_param_ops 
nvm_configure_by_str_event_param_ops = {
 #define MODULE_PARAM_PREFIX"lnvm."
 
 module_param_cb(configure_debug, &nvm_configure_by_str_event_param_ops, NULL,
-   0644);
+   S_IRUSR 
| S_IWUSR | S_IRGRP | S_IROTH);
 
 #endif /* CONFIG_NVM_DEBUG */
 
-- 
2.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-02 Thread Jens Axboe

On 08/02/2016 06:58 AM, Jinpu Wang wrote:

Hi Jens,

I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in
turn mutex_lock(&all_q_mutex);
  queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
if (ret)
break; /// if about error out, we will call
unregister below
}

if (ret)
blk_mq_unregister_disk(disk);

In blk_mq_unregister_disk, we will try to disable_hotplug again, which
leads to dead lock.

Did I miss anything?


Nope, your analysis looks correct. This should fix it:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] blk-mq: Prevent round-robin from scheduling dead cpus

2016-08-02 Thread Gabriel Krisman Bertazi
Hi,

I'm not completely sure I got the cause for this one completely right.
Still, it does looks like the correct fix and a good improvement in the
overall, so I'm making it an RFC for now to gather some feedback.

Let me hear your thoughts.

-- >8 --

When notifying blk-mq about CPU removals while running IO, we risk
racing the hctx->cpumask update with blk_mq_hctx_next_cpu, and end up
scheduling a dead cpu to execute hctx->run_{,delayed_}work.  As a
result, kblockd_schedule_delayed_work_on() may schedule another cpu
outside of hctx->cpumask, which triggers the following warning at
__blk_mq_run_hw_queue:

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));

This patch makes the issue much more unlikely to happen, as it makes
blk_mq_hctx_next_cpu aware of dead cpus, and triggers the round-robin
code, despite of remaining batch processing time.  Thus, in case we
offline a cpu in the middle of its batch processing time, we no longer
waste time scheduling it here, and just move through to the next cpu in
the mask.

The warning may still be triggered, though, since this is not the only
case that may cause the queue to schedule on a dead cpu.  But this fixes
the common case, which is the remaining batch processing time of a
sudden dead cpu, which makes the issue much more unlikely to happen.

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Brian King 
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
---
 block/blk-mq.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c27bb37..a2cb64c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -858,7 +858,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
-   if (--hctx->next_cpu_batch <= 0) {
+   if (--hctx->next_cpu_batch <= 0 ||
+   !cpumask_test_cpu(hctx->next_cpu, cpu_online_mask)) {
int cpu = hctx->next_cpu, next_cpu;
 
next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
@@ -868,7 +869,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
 
-   return cpu;
+   return (cpumask_test_cpu(cpu, cpu_online_mask)) ?
+   cpu : blk_mq_hctx_next_cpu(hctx);
}
 
return hctx->next_cpu;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-02 Thread Hannes Reinecke
On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
> 
> Adding Hannes and Damien to Cc.
> 
> Christoph,
> 
> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
> quite simple. I can even have the open/close/reset zone commands update the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
> not need the RB-Tree to function with zoned media.
> 
> I do still have concerns with the approach which I have shared in smaller
> forums but perhaps I have to bring them to this group.
> 
> First is the memory consumption. This isn't really much of a concern for large
> servers with few drives but I think the embedded NAS market will grumble as
> well as the large data pods trying to stuff 300+ drives in a chassis.
> 
> As of now the RB-Tree needs to hold ~3 zones.
> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
> around 3.5 MB per zoned drive attached.
> Which is fine if it is really needed, but most of it is fixed information
> and it can be significantly condensed (I have proposed 8 bytes per zone held
> in an array as more than adequate). Worse is that the crucial piece of
> information, the current wp needed for scheduling the next write, is mostly
> out of date because it is updated only after the write completes and zones
> being actively written to must work off of the last location / size that was
> submitted, not completed. The work around is for that tracking to be handled
> in the private_data member. I am not saying that updating the wp on
> completing a write isn’t important, I am saying that the bi_end_io hook is
> the existing hook that works just fine.
> 
Which _actually_ is not true; with my patches I'll update the write
pointer prior to submit the I/O (on the reasoning that most of the time
I/O will succeed) and re-read the zone information if an I/O failed.
(Which I'll have to do anyway as after an I/O failure the write pointer
status is not clearly defined.)

I have thought about condensing the RB tree information, but then I
figured that for 'real' SMR handling we cannot assume all zones are of
fixed size, and hence we need all the information there.
Any condensing method would assume a given structure of the zones, which
the standard just doesn't provide.
Or am I missing something here?

As for write pointer handling: yes, the write pointer on the zones is
not really useful for upper-level usage.
Where we do need it is to detect I/O which is crossing the write pointer
(eg when doing reads over the entire zone).
As per spec you will be getting an I/O error here, so we need to split
the I/O on the write pointer to get valid results back.

> This all tails into domain responsibility. With the RB-Tree doing half of the
> work and the ‘responsible’ domain handling the active path via private_data
> why have the split at all? It seems to be a double work to have second object
> tracking the first so that I/O scheduling can function.
> 
Tracking the zone states via RB trees is mainly to handly host-managed
drives seamlessly. Without an RB tree we will be subjected to I/O errors
during boot, as eg with new drives we inevitably will try to read beyond
the write pointer, which in turn will generate I/O errors on certain drives.
I do agree that there is no strict need to setup an RB tree for
host-aware drives; I can easily add an attribute/flag to disable RB
trees for those.
However, tracking zone information in the RB tree _dramatically_ speeds
up device initialisation; issuing a blkdev_discard() for the entire
drive will take _ages_ without it.

> Finally is the error handling path when the RB-Tree encounters and error it
> attempts to requery the drive topology virtually guaranteeing that the
> private_data is now out-of-sync with the RB-Tree. Again this is something
> that can be better encapsulated in the bi_end_io to be informed of the
> failed I/O and schedule the appropriate recovery (including re-querying the
> zone information of the affected zone(s)).
> 
Well, if we have an RB tree with write pointers of course we need to
re-read the zone information on failure (and I thought I did that?).
Plus the error information will be propagated via the usual mechanism,
so they need to take care of updating any information in ->private_data.

I'm perfectly fine with integrating your patches for the various
blkdev_XX callouts and associated ioctls; Jens favours that approach,
too, so we should be combining those efforts.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB

Re: Fixup direct bi_rw modifiers

2016-08-02 Thread Christoph Hellwig
On Mon, Aug 01, 2016 at 01:55:36PM -0600, Jens Axboe wrote:
> 
> Set of three patches, where the target one is an actual bug fix...
> Temporary branch, I'll rebase it once -rc1 is out, if more
> changes/fixups need to be made in the next week until that happens.
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.8/bi_rwf

Thanks.  Any chance to have a slightly better name for the field?
E.g. bi_opf or bi_op_flags if we want to be a bit more verbose.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] blkio: update cgroup's document path

2016-08-02 Thread seokhoon.yoon
cgroup's document path is changed to "cgroup-v1". update it.

Signed-off-by: seokhoon.yoon 
---
 block/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Kconfig b/block/Kconfig
index 0363cd7..5665ffe 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -111,7 +111,7 @@ config BLK_DEV_THROTTLING
one needs to mount and use blkio cgroup controller for creating
cgroups and specifying per device IO rate policies.
 
-   See Documentation/cgroups/blkio-controller.txt for more information.
+   See Documentation/cgroup-v1/blkio-controller.txt for more information.
 
 config BLK_CMDLINE_PARSER
bool "Block device command line partition parser"
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] Deadlock in blk_mq_register_disk error path

2016-08-02 Thread Jinpu Wang
Hi Jens,

I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in
turn mutex_lock(&all_q_mutex);
  queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
if (ret)
break; /// if about error out, we will call
unregister below
}

if (ret)
blk_mq_unregister_disk(disk);

In blk_mq_unregister_disk, we will try to disable_hotplug again, which
leads to dead lock.

Did I miss anything?

-- 
Mit freundlichen Grüßen,
Best Regards,

Jack Wang

Linux Kernel Developer Storage
ProfitBricks GmbH  The IaaS-Company.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH linux-4.7-rc7] blk_stack_limits() setting discard_granularity

2016-08-02 Thread Florian-Ewald Müller
Hi Martin,

I totally agree with better having a common block layer infrastructure
to handle such discard misfit cases.
But, for now, I do not have a good idea of how to aggregate in the
block layer discard chunks (< discard_granularity) and issue later
only a big one (== discard_granularity) to underlying block device in
a generic and persistent fashion.

For me, the current handling of discards by the block layer
[blk_stack_limits() + blk_bio_discard_split()] seems to be
inconsistent with the handling of normal (rd/wr) IO.
It makes the life of block drivers developers harder as they can not
rely on blk_queue_split() doing its job on discard bio's.

Regards,
Florian

On Tue, Aug 2, 2016 at 4:08 AM, Martin K. Petersen
 wrote:
>> Florian-Ewald Müller  writes:
>
> Florian-Ewald,
>
>> Now my experiments show that, at least, dm-cache doesn't complain nor
>> rejects those smaller discards than its discard_granularity, but
>> possibly turning them into noop (?).
>
> Correct. Anything smaller than (an aligned) multiple of the discard
> granularity will effectively be ignored.
>
> In practice this means that your device should allocate things in
> aligned units of the underlying device's discard granularity.
>
>> May be that the needed functionality of accumulating small discards to
>> a big one covering its own granularity (similar to SSDs block erasure)
>> should be done at that driver level.
>
> Do you allocate blocks in a predictable pattern between your nodes?
>
> For MD RAID0, for instance, we issue many small discard requests. But
> for I/Os that are bigger than the stripe width we'll wrap around and do
> merging so that for instance blocks 0, n, 2*n, 3*n, etc. become part of
> the same discard request sent to the device.
>
> If you want discards smaller than the underlying granularity to have an
> effect then I'm afraid the burden is on you to maintain a bitmap of each
> granularity sized region. And then issue a deferred discard when all
> blocks inside that region have been discarded by the application or
> filesystem above.
>
> If you want to pursue partial block tracking it would be good to come up
> with a common block layer infrastructure for it. dm-thin could benefit
> from it as well...
>
> --
> Martin K. Petersen  Oracle Linux Engineering



-- 
Florian-Ewald Mueller

Architecture Board


ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin


Tel:   +49 30 577 008 331
Fax:  +49 30 577 008 598
Email:   florian-ewald.muel...@profitbricks.com
URL: http://www.profitbricks.de

Sitz der Gesellschaft: Berlin.
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B.
Geschäftsführer: Andreas Gauger, Achim Weiss.

Please consider the environment before printing this email.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html