Re: [PATCH 01/19] bcache: Fix leak of bdev reference

2017-09-05 Thread Coly Li
On 2017/9/5 下午2:43, Christoph Hellwig wrote:
> On Tue, Sep 05, 2017 at 01:30:04AM +0800, Coly Li wrote:
>>
>> When you mentioned "whole chunk of code", do you mean the following
>> block of code ?
>>
>>
>> 1960 if (IS_ERR(bdev)) {
>> = start of whole chunk of code 
>> 1961 if (bdev == ERR_PTR(-EBUSY)) {
>> 1962 bdev = lookup_bdev(strim(path));
>> 1963 mutex_lock(_register_lock);
>> 1964 if (!IS_ERR(bdev) && bch_is_open(bdev))
>> 1965 err = "device already registered";
>> 1966 else
>> 1967 err = "device busy";
>> 1968 mutex_unlock(_register_lock);
>> 1969 if (!IS_ERR(bdev))
>> 1970 bdput(bdev);
>> 1971 if (attr == _register_quiet)
>> 1972 goto out;
>> 1973 }
>> = end of whole chunk of code 
>> 1974 goto err;
>> 1975 }
>>
>> I don't mind to remove it, just double check I don't misunderstand what
>> you meant.
> 
> Yes, that's the problematic block.
> 
Hi Christoph,

I tested the code, and my patch which removes the above code. The result
is, I feel the above chunk of code is useful. When I tried to register a
device and it was already registered. Without the above code, I only saw
"failed to open device", didn't realize it was because this device is
registered already. After adding the above code back, I knew where the
problem was.

The above chunk of code improves user experience and provides more
detailed diagnose information, it is useful. Then I suggest to keep the
code here and pick up Jan's patch.

Thanks.

-- 
Coly Li


Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-05 Thread Joseph Qi
Hi Shaohua,

On 17/9/6 05:02, Shaohua Li wrote:
> On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
>>
>>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha scritto:
>>>
>>> Hi,
>>>
>>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
>>> not
>>> much for SSD. This patch set try to add a conservative limit for 
>>> blk-throttle.
>>> It isn't a proportional scheduling, but can help prioritize cgroups. There 
>>> are
>>> several advantages we choose blk-throttle:
>>> - blk-throttle resides early in the block stack. It works for both bio and
>>>  request based queues.
>>> - blk-throttle is light weight in general. It still takes queue lock, but 
>>> it's
>>>  not hard to implement a per-cpu cache and remove the lock contention.
>>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. 
>>> The
>>>  mechanism is proved to harm performance for fast SSD.
>>>
>>> The patch set add a new io.low limit for blk-throttle. It's only for 
>>> cgroup2.
>>> The existing io.max is a hard limit throttling. cgroup with a max limit 
>>> never
>>> dispatch more IO than its max limit. While io.low is a best effort 
>>> throttling.
>>> cgroups with 'low' limit can run above their 'low' limit at appropriate 
>>> time.
>>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
>>> above
>>> their 'low' limit. If any cgroup runs under its 'low' limit, all other 
>>> cgroups
>>> will run according to their 'low' limit. So the 'low' limit could act as two
>>> roles, it allows cgroups using free bandwidth and it protects cgroups from
>>> their 'low' limit.
>>>
>>> An example usage is we have a high prio cgroup with high 'low' limit and a 
>>> low
>>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
>>> the low
>>> prio can run above its 'low' limit, so we don't waste the bandwidth. When 
>>> the
>>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
>>> under its 'low' limit. This will protect high prio cgroup to get more
>>> resources.
>>>
>>
>> Hi Shaohua,
> 
> Hi,
> 
> Sorry for the late response.
>> I would like to ask you some questions, to make sure I fully
>> understand how the 'low' limit and the idle-group detection work in
>> your above scenario.  Suppose that: the drive has a random-I/O peak
>> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
>> the low prio group has a 'low' limit of 10 MB/s.  If
>> - the high prio process happens to do, say, only 5 MB/s for a given
>>   long time
>> - the low prio process constantly does greedy I/O
>> - the idle-group detection is not being used
>> then the low prio process is limited to 10 MB/s during all this time
>> interval.  And only 10% of the device bandwidth is utilized.
>>
>> To recover lost bandwidth through idle-group detection, we need to set
>> a target IO latency for the high-prio group.  The high prio group
>> should happen to be below the threshold, and thus to be detected as
>> idle, leaving the low prio group free too use all the bandwidth.
>>
>> Here are my questions:
>> 1) Is all I wrote above correct?
> 
> Yes
>> 2) In particular, maybe there are other better mechanism to saturate
>> the bandwidth in the above scenario?
> 
> Assume it's the 4) below.
>> If what I wrote above is correct:
>> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
>> full bandwidth, the latency threshold of the high prio group may be
>> overcome, causing the high prio group to not be considered idle any
>> longer, and thus the low prio group to be limited again; this in turn
>> will cause the threshold to not be overcome any longer, and so on.
> 
> That's true. We try to mitigate the fluctuation by increasing the low prio
> cgroup bandwidth graduately though.
> 
>> 4) Is there a way to compute an appropriate target latency of the high
>> prio group, if it is a generic group, for which the latency
>> requirements of the processes it contains are only partially known or
>> completely unknown?  By appropriate target latency, I mean a target
>> latency that enables the framework to fully utilize the device
>> bandwidth while the high prio group is doing less I/O than its limit.
> 
> Not sure how we can do this. The device max bandwidth varies based on request
> size and read/write ratio. We don't know when the max bandwidth is reached.
> Also I think we must consider a case that the workloads never use the full
> bandwidth of a disk, which is pretty common for SSD (at least in our
> environment).
> 
I have a question on the base latency tracking.
>From my test on SSD, write latency is much lower than read when doing
mixed read/write, but currently we only track read request and then use
it's average as base latency. In other words, we don't distinguish read
and write now. As a result, all write request's latency will always be
considered as good. So I think we have to track read and write latency
separately. Or 

Re: [PATCH] mkfs: Remove messages printed when blocksize < physical sectorsize

2017-09-05 Thread Omar Sandoval
On Wed, Sep 06, 2017 at 09:24:49AM +1000, Dave Chinner wrote:
> On Tue, Sep 05, 2017 at 03:18:51PM -0700, Omar Sandoval wrote:
> > On Wed, Sep 06, 2017 at 08:06:31AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 05, 2017 at 09:00:26AM -0600, Jens Axboe wrote:
> > > > On 09/05/2017 01:37 AM, Chandan Rajendra wrote:
> > > > > On Tuesday, September 5, 2017 12:12:08 PM IST Omar Sandoval wrote:
> > > > >> On Mon, Sep 04, 2017 at 11:31:39PM -0700, Christoph Hellwig wrote:
> > > > >>> On Tue, Sep 05, 2017 at 11:14:42AM +0530, Chandan Rajendra wrote:
> > > >  Linux kernel commit 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 
> > > >  (loop: set
> > > >  physical block size to PAGE_SIZE) now sets PAGE_SIZE as the default
> > > >  physical sector size of loop devices. On ppc64, this causes loop 
> > > >  devices
> > > >  to have 64k as the physical sector size.
> > > > >>>
> > > > >>> Eek.  We'll need to revert the loop change ASAP!
> > > > >>
> > > > >> Most annoying patch series ever. It hasn't made it to Linus' tree 
> > > > >> yet,
> > > > >> right? We can revert (although the later change depends on that), 
> > > > >> fold
> > > > >> in a fix, or apply a fix on top of it, whatever Jens prefers.
> > > > >>
> > > > >>
> > > > > 
> > > > > My bad. 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 is the commit id 
> > > > > from 
> > > > > Linux-next. I don't see this commit in Linus's git tree.
> > > > 
> > > > Right, it's only queued up, and scheduled for the 2nd part of the
> > > > block changes for 4.14. It should have been PAGE_CACHE_SIZE, but
> > > > we don't have that anymore...
> > > 
> > > But PAGE_CACHE_SIZE was equal to PAGE_SIZE, so that would have been
> > > wrong, too.
> > > 
> > > I just don't see why this is necessary, given that buffered IO
> > > through the upper filesystem will already be doing page
> > > sized/aligned IO where possible (because that's what the page cache
> > > does!). And for direct IO the loop device should just export the
> > > underlying host filesystem logical/physical sector sizes, which
> > > should be optimal for the backing storage to begin with.
> > > 
> > > Someone want to enlighten me as to what problem is being solved
> > > here?
> > 
> > I already sent a patch to fix this:
> > https://marc.info/?l=linux-block=150464670510301=2
> > 
> > Loop was using the default physical block size of 512, which is a lie
> > according to the definition of the physical block size:
> 
> Loop devices are special. They /have to lie/ because they are the
> only mechanism we have for mounting image files with sector sizes
> smaller than the host storage sizes.  i.e. loop devices require
> compatibility and flexibility first and so need to default to 
> "most compatible" behaviour, not "most performant".

And that's why the logical block size is 512 by default. The physical
block size doesn't have any bearing on what you can do with the device,
it's a _hint_ for the application. But again, I'm just being overly
pedantic, I should've checked how tools were actually using the physical
block size.

> If you're really "sick of these stupid changes" then perhaps you
> should consider cc'ing linux-fsdevel for loop device changes so
> filesystem developers have a chance to catch these filesystem side
> problems in new loopdev code before it's merged anywhere

Yup, loop changes haven't been going outside of linux-block but at least
this one should have gone to fsdevel, sorry.

> FWIW, if you're going to set some kind of optimal default for the
> loop devices then - like all other stacked block devices - the
> defaults need to be pulled from the underlying storage (i.e. the
> filesystem that hosts the file) rather than making them up out of
> thin air.

Not worth the trouble for now, I'll just leave this alone now.


Re: [PATCH] mkfs: Remove messages printed when blocksize < physical sectorsize

2017-09-05 Thread Omar Sandoval
On Wed, Sep 06, 2017 at 08:06:31AM +1000, Dave Chinner wrote:
> On Tue, Sep 05, 2017 at 09:00:26AM -0600, Jens Axboe wrote:
> > On 09/05/2017 01:37 AM, Chandan Rajendra wrote:
> > > On Tuesday, September 5, 2017 12:12:08 PM IST Omar Sandoval wrote:
> > >> On Mon, Sep 04, 2017 at 11:31:39PM -0700, Christoph Hellwig wrote:
> > >>> On Tue, Sep 05, 2017 at 11:14:42AM +0530, Chandan Rajendra wrote:
> >  Linux kernel commit 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 (loop: set
> >  physical block size to PAGE_SIZE) now sets PAGE_SIZE as the default
> >  physical sector size of loop devices. On ppc64, this causes loop 
> >  devices
> >  to have 64k as the physical sector size.
> > >>>
> > >>> Eek.  We'll need to revert the loop change ASAP!
> > >>
> > >> Most annoying patch series ever. It hasn't made it to Linus' tree yet,
> > >> right? We can revert (although the later change depends on that), fold
> > >> in a fix, or apply a fix on top of it, whatever Jens prefers.
> > >>
> > >>
> > > 
> > > My bad. 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 is the commit id from 
> > > Linux-next. I don't see this commit in Linus's git tree.
> > 
> > Right, it's only queued up, and scheduled for the 2nd part of the
> > block changes for 4.14. It should have been PAGE_CACHE_SIZE, but
> > we don't have that anymore...
> 
> But PAGE_CACHE_SIZE was equal to PAGE_SIZE, so that would have been
> wrong, too.
> 
> I just don't see why this is necessary, given that buffered IO
> through the upper filesystem will already be doing page
> sized/aligned IO where possible (because that's what the page cache
> does!). And for direct IO the loop device should just export the
> underlying host filesystem logical/physical sector sizes, which
> should be optimal for the backing storage to begin with.
> 
> Someone want to enlighten me as to what problem is being solved
> here?

I already sent a patch to fix this:
https://marc.info/?l=linux-block=150464670510301=2

Loop was using the default physical block size of 512, which is a lie
according to the definition of the physical block size:

/**
 * blk_queue_physical_block_size - set physical block size for the queue
 * @q:  the request queue for the device
 * @size:  the physical block size, in bytes
 *
 * Description:
 *   This should be set to the lowest possible sector size that the
 *   hardware can operate on without reverting to read-modify-write
 *   operations.
 */

Clearly for buffered loop devices, this unit is a page. The change was
pedantic, so whatever, my patch above makes things backwards compatible.


Re: [PATCH] mkfs: Remove messages printed when blocksize < physical sectorsize

2017-09-05 Thread Eric Sandeen


On 9/5/17 5:10 PM, Dave Chinner wrote:
> On Tue, Sep 05, 2017 at 09:17:42AM -0500, Eric Sandeen wrote:
>>
>>
>> On 9/5/17 1:44 AM, Dave Chinner wrote:
>>> On Mon, Sep 04, 2017 at 11:31:39PM -0700, Christoph Hellwig wrote:
 On Tue, Sep 05, 2017 at 11:14:42AM +0530, Chandan Rajendra wrote:
> Linux kernel commit 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 (loop: set
> physical block size to PAGE_SIZE) now sets PAGE_SIZE as the default
> physical sector size of loop devices. On ppc64, this causes loop devices
> to have 64k as the physical sector size.

 Eek.  We'll need to revert the loop change ASAP!
>>>
>>> And, FWIW, making the warning go away if probably a bad idea,
>>> because XFS only supports devices with sector sizes up
>>> to 32k:
>>
>> Well, TBH removing this warning was my suggestion, because it's
>> automatically fixing values that weren't specified by the user in
>> the first place.  First preference is physical sector size, then
>> fallback to logical but it doesn't need to be noisy about it.
>>
>>> #define XFS_MIN_SECTORSIZE_LOG  9   /* i.e. 512 bytes */
>>> #define XFS_MAX_SECTORSIZE_LOG  15  /* i.e. 32768 bytes */
>>>
>>> And so it should be warning about devices trying to tell it to use
>>> something larger
>>
>> As long as the logical sector size is small enough, it seems like a
>> silent adjustment is probably ok,  no?
> 
> Think 512e drives. Doing 512 byte sector IO is possible, but slow.
> Someone might actually want to avoid that by having the filesystem
> use 4k sector sizes. However, if for some reason, mkfs selects 512
> byte sectors (the logical size) rather than 4k sector size, then
> shouldn't we be telling the user we're doing something that has a
> "for-the-life-of-the-filesystem" performance impact?

Well, sure, but it'll only select 512 if it /has/ to, i.e. if the
block size is < 4k.

So for the simple case of a 512e drive, our default block size is
4k, physical size is 4k, and everything is happy and fine.

If the user /specifies/ a 1k block size on such a device, how much
of a nanny do we really want to be about telling them this is suboptimal?

There are a lot of suboptimal things you can specify on the
mkfs commandline, but we don't generally choose to warn about them...

-Eric

> Cheers,
> 
> Dave.
> 


Re: [PATCH] mkfs: Remove messages printed when blocksize < physical sectorsize

2017-09-05 Thread Dave Chinner
On Tue, Sep 05, 2017 at 09:00:26AM -0600, Jens Axboe wrote:
> On 09/05/2017 01:37 AM, Chandan Rajendra wrote:
> > On Tuesday, September 5, 2017 12:12:08 PM IST Omar Sandoval wrote:
> >> On Mon, Sep 04, 2017 at 11:31:39PM -0700, Christoph Hellwig wrote:
> >>> On Tue, Sep 05, 2017 at 11:14:42AM +0530, Chandan Rajendra wrote:
>  Linux kernel commit 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 (loop: set
>  physical block size to PAGE_SIZE) now sets PAGE_SIZE as the default
>  physical sector size of loop devices. On ppc64, this causes loop devices
>  to have 64k as the physical sector size.
> >>>
> >>> Eek.  We'll need to revert the loop change ASAP!
> >>
> >> Most annoying patch series ever. It hasn't made it to Linus' tree yet,
> >> right? We can revert (although the later change depends on that), fold
> >> in a fix, or apply a fix on top of it, whatever Jens prefers.
> >>
> >>
> > 
> > My bad. 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 is the commit id from 
> > Linux-next. I don't see this commit in Linus's git tree.
> 
> Right, it's only queued up, and scheduled for the 2nd part of the
> block changes for 4.14. It should have been PAGE_CACHE_SIZE, but
> we don't have that anymore...

But PAGE_CACHE_SIZE was equal to PAGE_SIZE, so that would have been
wrong, too.

I just don't see why this is necessary, given that buffered IO
through the upper filesystem will already be doing page
sized/aligned IO where possible (because that's what the page cache
does!). And for direct IO the loop device should just export the
underlying host filesystem logical/physical sector sizes, which
should be optimal for the backing storage to begin with.

Someone want to enlighten me as to what problem is being solved
here?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


[PATCH] loop: set physical block size to logical block size

2017-09-05 Thread Omar Sandoval
From: Omar Sandoval 

Commit 6c6b6f28b333 ("loop: set physical block size to PAGE_SIZE")
caused mkfs.xfs to barf on ppc64 [1]. Always using PAGE_SIZE as the
physical block size still makes the most sense semantically, but let's
just lie and always set it to the same value as the logical block size
(same goes for io_min). In the future we might want to at least bump up
io_min to PAGE_SIZE but I'm sick of these stupid changes so let's play
it safe.

1: https://marc.info/?l=linux-xfs=150459024723753=2

Signed-off-by: Omar Sandoval 
---
 drivers/block/loop.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 78c47c4b584d..85de67334695 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1036,6 +1036,8 @@ static int loop_clr_fd(struct loop_device *lo)
memset(lo->lo_crypt_name, 0, LO_NAME_SIZE);
memset(lo->lo_file_name, 0, LO_NAME_SIZE);
blk_queue_logical_block_size(lo->lo_queue, 512);
+   blk_queue_physical_block_size(lo->lo_queue, 512);
+   blk_queue_io_min(lo->lo_queue, 512);
if (bdev) {
bdput(bdev);
invalidate_bdev(bdev);
@@ -1330,6 +1332,8 @@ static int loop_set_block_size(struct loop_device *lo, 
unsigned long arg)
blk_mq_freeze_queue(lo->lo_queue);
 
blk_queue_logical_block_size(lo->lo_queue, arg);
+   blk_queue_physical_block_size(lo->lo_queue, arg);
+   blk_queue_io_min(lo->lo_queue, arg);
loop_update_dio(lo);
 
blk_mq_unfreeze_queue(lo->lo_queue);
@@ -1777,8 +1781,6 @@ static int loop_add(struct loop_device **l, int i)
}
lo->lo_queue->queuedata = lo;
 
-   blk_queue_physical_block_size(lo->lo_queue, PAGE_SIZE);
-
blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
 
/*
-- 
2.14.1



Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-05 Thread Shaohua Li
On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
> 
> > Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha scritto:
> > 
> > Hi,
> > 
> > cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
> > not
> > much for SSD. This patch set try to add a conservative limit for 
> > blk-throttle.
> > It isn't a proportional scheduling, but can help prioritize cgroups. There 
> > are
> > several advantages we choose blk-throttle:
> > - blk-throttle resides early in the block stack. It works for both bio and
> >  request based queues.
> > - blk-throttle is light weight in general. It still takes queue lock, but 
> > it's
> >  not hard to implement a per-cpu cache and remove the lock contention.
> > - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. 
> > The
> >  mechanism is proved to harm performance for fast SSD.
> > 
> > The patch set add a new io.low limit for blk-throttle. It's only for 
> > cgroup2.
> > The existing io.max is a hard limit throttling. cgroup with a max limit 
> > never
> > dispatch more IO than its max limit. While io.low is a best effort 
> > throttling.
> > cgroups with 'low' limit can run above their 'low' limit at appropriate 
> > time.
> > Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
> > above
> > their 'low' limit. If any cgroup runs under its 'low' limit, all other 
> > cgroups
> > will run according to their 'low' limit. So the 'low' limit could act as two
> > roles, it allows cgroups using free bandwidth and it protects cgroups from
> > their 'low' limit.
> > 
> > An example usage is we have a high prio cgroup with high 'low' limit and a 
> > low
> > prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
> > the low
> > prio can run above its 'low' limit, so we don't waste the bandwidth. When 
> > the
> > high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
> > under its 'low' limit. This will protect high prio cgroup to get more
> > resources.
> > 
> 
> Hi Shaohua,

Hi,

Sorry for the late response.
> I would like to ask you some questions, to make sure I fully
> understand how the 'low' limit and the idle-group detection work in
> your above scenario.  Suppose that: the drive has a random-I/O peak
> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
> the low prio group has a 'low' limit of 10 MB/s.  If
> - the high prio process happens to do, say, only 5 MB/s for a given
>   long time
> - the low prio process constantly does greedy I/O
> - the idle-group detection is not being used
> then the low prio process is limited to 10 MB/s during all this time
> interval.  And only 10% of the device bandwidth is utilized.
> 
> To recover lost bandwidth through idle-group detection, we need to set
> a target IO latency for the high-prio group.  The high prio group
> should happen to be below the threshold, and thus to be detected as
> idle, leaving the low prio group free too use all the bandwidth.
> 
> Here are my questions:
> 1) Is all I wrote above correct?

Yes
> 2) In particular, maybe there are other better mechanism to saturate
> the bandwidth in the above scenario?

Assume it's the 4) below.
> If what I wrote above is correct:
> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
> full bandwidth, the latency threshold of the high prio group may be
> overcome, causing the high prio group to not be considered idle any
> longer, and thus the low prio group to be limited again; this in turn
> will cause the threshold to not be overcome any longer, and so on.

That's true. We try to mitigate the fluctuation by increasing the low prio
cgroup bandwidth graduately though.

> 4) Is there a way to compute an appropriate target latency of the high
> prio group, if it is a generic group, for which the latency
> requirements of the processes it contains are only partially known or
> completely unknown?  By appropriate target latency, I mean a target
> latency that enables the framework to fully utilize the device
> bandwidth while the high prio group is doing less I/O than its limit.

Not sure how we can do this. The device max bandwidth varies based on request
size and read/write ratio. We don't know when the max bandwidth is reached.
Also I think we must consider a case that the workloads never use the full
bandwidth of a disk, which is pretty common for SSD (at least in our
environment).

Thanks,
Shaohua


Re: [PATCH V7 00/10] mmc: Add Command Queue support

2017-09-05 Thread Ulf Hansson
On 5 September 2017 at 10:10, Adrian Hunter  wrote:
> On 05/09/17 10:24, Ulf Hansson wrote:
>> [...]
>>
>
> I can send blk-mq support for legacy requests in a few days if you like, 
> but
> I want to hear a better explanation of why you are delaying CQE support.

 That would be very nice, however be aware of that we are in the merge
 window, so I am not picking new material for 4.14 from this point. I
 assume you understand why.
>>>
>>> Nope.  This is new functionality - doesn't affect anyone who doesn't have a
>>> command queue engine.  Next to no chance of regressions.  Tested by several
>>> in the community.  Substantially unchanged since February.  It is not even
>>> very much code in the block driver.
>>
>> Let me make it clear, once more - I don't want to maintain more hacks
>> in mmc block layer code.
>>
>> This series add blkmq support, using a method (which may be considered
>> as intermediate) via a new change in patch1 - but only for the new CQE
>> path. That means the old legacy mmc block path is still there. So, for
>> the reason stated above - no thanks!
>
> And where is your alternative.  When I pointed out you need a way to
> arbitrate between internal partitions, you went silent again.
>
> Can't have CQE without blk-mq but can't have blk-mq because you don't
> understand it, is hardly acceptable.

Adrian, this discussion seems to lead nowhere. Can we please stop and
be constructive instead!

Regarding the arbitration issue. We have been moving forward,
re-factoring the mmc block driver code, soon also solving the problem
for the rpmb internal partition [1]. Maybe the background to why Linus
is working on mmc block re-factoring, hasn't been entirely clear.
Anyway, it's exactly because of moving closer to address these issues.

Even if the problems certainly becomes a step harder to resolve for
the boot and the general purpose partitions, it's still a path we
should try to find a solution for. Yeah, that may mean we need to
suggest changes for the generic block layer, to teach it to better
deal with these kind of devices.

Finally, I have never said the arbitration issue *must* be solved
before converting to blkmq. Only that we should avoid performance
regressions, but that of course applies to whatever changes we do.

[...]

Kind regards
Uffe

[1]
https://patchwork.kernel.org/patch/9911463/


Re: [PATCH] mkfs: Remove messages printed when blocksize < physical sectorsize

2017-09-05 Thread Jens Axboe
On 09/05/2017 01:37 AM, Chandan Rajendra wrote:
> On Tuesday, September 5, 2017 12:12:08 PM IST Omar Sandoval wrote:
>> On Mon, Sep 04, 2017 at 11:31:39PM -0700, Christoph Hellwig wrote:
>>> On Tue, Sep 05, 2017 at 11:14:42AM +0530, Chandan Rajendra wrote:
 Linux kernel commit 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 (loop: set
 physical block size to PAGE_SIZE) now sets PAGE_SIZE as the default
 physical sector size of loop devices. On ppc64, this causes loop devices
 to have 64k as the physical sector size.
>>>
>>> Eek.  We'll need to revert the loop change ASAP!
>>
>> Most annoying patch series ever. It hasn't made it to Linus' tree yet,
>> right? We can revert (although the later change depends on that), fold
>> in a fix, or apply a fix on top of it, whatever Jens prefers.
>>
>>
> 
> My bad. 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 is the commit id from 
> Linux-next. I don't see this commit in Linus's git tree.

Right, it's only queued up, and scheduled for the 2nd part of the
block changes for 4.14. It should have been PAGE_CACHE_SIZE, but
we don't have that anymore...

Omar, are you sending a patch to fix this up?

-- 
Jens Axboe



[GIT PULL] First set of block changes for 4.14-rc1

2017-09-05 Thread Jens Axboe
Hi Linus,

This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This pull request contains:

- Fix for a registration race in loop, from Anton Volkov.

- Overflow complaint fix from Arnd for DAC960.

- Series of drbd changes from the usual suspects.

- Conversion of the stec/skd driver to blk-mq. From Bart.

- A few BFQ improvements/fixes from Paolo.

- CFQ improvement from Ritesh, allowing idling for group idle.

- A few fixes found by Dan's smatch, courtesy of Dan.

- A warning fixup for a race between changing the IO scheduler and
  device remova. From David Jeffery.

- A few nbd fixes from Josef.

- Support for cgroup info in blktrace, from Shaohua.

- Also from Shaohua, new features in the null_blk driver to allow it to
  actually hold data, among other things.

- Various corner cases and error handling fixes from Weiping Zhang.

- Improvements to the IO stats tracking for blk-mq from me. Can
  drastically improve performance for fast devices and/or big machines.

- Series from Christoph removing bi_bdev as being needed for IO
  submission, in preparation for nvme multipathing code.

- Series from Bart, including various cleanups and fixes for switch fall
  through case complaints.

Note that you'll hit a conflict in block/bio-integrity.c and
mm/page_io.c, since we had fixes later in the 4.13 series for both of
those that ended up conflicting with new changes. Both are trivial to
fix up, I've included my resolution at the end of this email.

Please pull!


  git://git.kernel.dk/linux-block.git for-4.14/block



Anton Volkov (1):
  loop: fix to a race condition due to the early registration of device

Arnd Bergmann (1):
  block: DAC960: shut up format-overflow warning

Baoyou Xie (1):
  drbd: mark symbols static where possible

Bart Van Assche (78):
  block: Fix two comments that refer to .queue_rq() return values
  block: Unexport blk_queue_end_tag()
  blk-mq: Make blk_mq_reinit_tagset() calls easier to read
  blk-mq-debugfs: Declare a local symbol static
  genhd: Annotate all part and part_tbl pointer dereferences
  ide-floppy: Use blk_rq_is_scsi()
  virtio_blk: Use blk_rq_is_scsi()
  xen-blkback: Fix indentation
  xen-blkback: Avoid that gcc 7 warns about fall-through when building with 
W=1
  xen-blkfront: Avoid that gcc 7 warns about fall-through when building 
with W=1
  block: Relax a check in blk_start_queue()
  skd: Avoid that module unloading triggers a use-after-free
  skd: Submit requests to firmware before triggering the doorbell
  skd: Switch to GPLv2
  skd: Update maintainer information
  skd: Remove unneeded #include directives
  skd: Remove ESXi code
  skd: Remove unnecessary blank lines
  skd: Avoid that gcc 7 warns about fall-through when building with W=1
  skd: Fix spelling in a source code comment
  skd: Fix a function name in a comment
  skd: Remove set-but-not-used local variables
  skd: Remove a set-but-not-used variable from struct skd_device
  skd: Remove useless barrier() calls
  skd: Switch from the pr_*() to the dev_*() logging functions
  skd: Fix endianness annotations
  skd: Document locking assumptions
  skd: Introduce the symbolic constant SKD_MAX_REQ_PER_MSG
  skd: Introduce SKD_SKCOMP_SIZE
  skd: Fix size argument in skd_free_skcomp()
  skd: Reorder the code in skd_process_request()
  skd: Simplify the code for deciding whether or not to send a FIT msg
  skd: Simplify the code for allocating DMA message buffers
  skd: Use a structure instead of hardcoding structure offsets
  skd: Check structure sizes at build time
  skd: Use __packed only when needed
  skd: Make the skd_isr() code more brief
  skd: Use ARRAY_SIZE() where appropriate
  skd: Simplify the code for handling data direction
  skd: Remove superfluous initializations from skd_isr_completion_posted()
  skd: Drop second argument of skd_recover_requests()
  skd: Use for_each_sg()
  skd: Remove a redundant init_timer() call
  skd: Remove superfluous occurrences of the 'volatile' keyword
  skd: Use kcalloc() instead of kzalloc() with multiply
  skb: Use symbolic names for SCSI opcodes
  skd: Move a function definition
  skd: Rework request failing code path
  skd: Convert explicit skd_request_fn() calls
  skd: Remove SG IO support
  skd: Remove dead code
  skd: Initialize skd_special_context.req.n_sg to one
  skd: Enable request tags for the block layer queue
  skd: Convert several per-device scalar variables into atomics
  skd: Introduce skd_process_request()
  skd: Split skd_recover_requests()
  skd: Move skd_free_sg_list() up
  skd: Coalesce struct request and struct skd_request_context

Archiving linux-block

2017-09-05 Thread Bart Van Assche
Hello,

Since gmane.org is no longer operational linux-block messages are no longer
archived. I think it's useful to archive linux-block messages because it
makes it possible to review past discussions, especially for people who are
not subscribed to the list. How about activating archival by mail-archive.com
and spinics.net? As far as I know this can only be done by someone who is list
administrator. But I don't know who the linux-block list administrator(s) are?

Thanks,

Bart.

Re: [PATCH] mkfs: Remove messages printed when blocksize < physical sectorsize

2017-09-05 Thread Eric Sandeen


On 9/5/17 1:44 AM, Dave Chinner wrote:
> On Mon, Sep 04, 2017 at 11:31:39PM -0700, Christoph Hellwig wrote:
>> On Tue, Sep 05, 2017 at 11:14:42AM +0530, Chandan Rajendra wrote:
>>> Linux kernel commit 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 (loop: set
>>> physical block size to PAGE_SIZE) now sets PAGE_SIZE as the default
>>> physical sector size of loop devices. On ppc64, this causes loop devices
>>> to have 64k as the physical sector size.
>>
>> Eek.  We'll need to revert the loop change ASAP!
> 
> And, FWIW, making the warning go away if probably a bad idea,
> because XFS only supports devices with sector sizes up
> to 32k:

Well, TBH removing this warning was my suggestion, because it's
automatically fixing values that weren't specified by the user in
the first place.  First preference is physical sector size, then
fallback to logical but it doesn't need to be noisy about it.

> #define XFS_MIN_SECTORSIZE_LOG  9   /* i.e. 512 bytes */
> #define XFS_MAX_SECTORSIZE_LOG  15  /* i.e. 32768 bytes */
> 
> And so it should be warning about devices trying to tell it to use
> something larger

As long as the logical sector size is small enough, it seems like a
silent adjustment is probably ok,  no?

-Eric

> Cheers,
> 
> Dave.
> 


Re: BFQ + dm-mpath

2017-09-05 Thread Bart Van Assche
On Tue, 2017-09-05 at 09:56 +0200, Paolo Valente wrote:
> Ok, my suspects seem confirmed: the path dm_mq_queue_rq -> map_request
> -> setup_clone -> blk_rq_prep_clone creates a cloned request without
> invoking e->type->ops.mq.prepare_request for the target elevator e.
> The cloned request is therefore not initialized for the scheduler, but
> it is however inserted into the scheduler by
> blk_mq_sched_insert_request.  This seems an error for any scheduler
> that needs to initialize fields in the incoming request, or in general
> to take some preliminary action.
> 
> Am I missing something here?

(+Mike Snitzer)

Mike, do you perhaps have the time to look into this memory leak?

Thanks,

Bart.

Re: [PATCH V7 00/10] mmc: Add Command Queue support

2017-09-05 Thread Adrian Hunter
On 05/09/17 10:24, Ulf Hansson wrote:
> [...]
> 

 I can send blk-mq support for legacy requests in a few days if you like, 
 but
 I want to hear a better explanation of why you are delaying CQE support.
>>>
>>> That would be very nice, however be aware of that we are in the merge
>>> window, so I am not picking new material for 4.14 from this point. I
>>> assume you understand why.
>>
>> Nope.  This is new functionality - doesn't affect anyone who doesn't have a
>> command queue engine.  Next to no chance of regressions.  Tested by several
>> in the community.  Substantially unchanged since February.  It is not even
>> very much code in the block driver.
> 
> Let me make it clear, once more - I don't want to maintain more hacks
> in mmc block layer code.
> 
> This series add blkmq support, using a method (which may be considered
> as intermediate) via a new change in patch1 - but only for the new CQE
> path. That means the old legacy mmc block path is still there. So, for
> the reason stated above - no thanks!

And where is your alternative.  When I pointed out you need a way to
arbitrate between internal partitions, you went silent again.

Can't have CQE without blk-mq but can't have blk-mq because you don't
understand it, is hardly acceptable.

> 
>>
>>>
>>> Still, big changes is always nice to queue up early for a release
>>> cycle. Let's aim for that!
>>
>> You said that in February.  Never happened.  You said you wanted blk-mq, so
>> I waited to re-base on top, but it never appeared.
> 
> Yes, I want blkmq - and I believe I have explained why several times by now.
> 
> Unfortunate, blkmq just doesn't appear, we have to work on it - together.

If we are working on it together, how come you have never taken the time to
find out how blk-mq works?

> 
>>
>>> Moreover, I am not delaying CQE, but really want it to be merged asap!
>>> However, I am also having the role as a maintainer and the things that
>>> comes with it. For example, I would like the community to reach
>>> consensus around how to move forward with CQE, before I decide to pick
>>> it up.
>>
>> It has been more than 6 months.  That is enough time to wait for "consensus".
> 
> Normally it should be more than enough, on the other hand, it has
> turned out this was more complex than we first thought.

Nonsense.  I have raised the issues time and again but there have never been
any replies.


Re: [PATCH] mkfs: Remove messages printed when blocksize < physical sectorsize

2017-09-05 Thread Chandan Rajendra
On Tuesday, September 5, 2017 12:12:08 PM IST Omar Sandoval wrote:
> On Mon, Sep 04, 2017 at 11:31:39PM -0700, Christoph Hellwig wrote:
> > On Tue, Sep 05, 2017 at 11:14:42AM +0530, Chandan Rajendra wrote:
> > > Linux kernel commit 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 (loop: set
> > > physical block size to PAGE_SIZE) now sets PAGE_SIZE as the default
> > > physical sector size of loop devices. On ppc64, this causes loop devices
> > > to have 64k as the physical sector size.
> > 
> > Eek.  We'll need to revert the loop change ASAP!
> 
> Most annoying patch series ever. It hasn't made it to Linus' tree yet,
> right? We can revert (although the later change depends on that), fold
> in a fix, or apply a fix on top of it, whatever Jens prefers.
> 
> 

My bad. 6c6b6f28b3335fd85ec833ee0005d9c9dca6c003 is the commit id from 
Linux-next. I don't see this commit in Linus's git tree.

-- 
chandan



Re: [PATCH V7 00/10] mmc: Add Command Queue support

2017-09-05 Thread Ulf Hansson
[...]

>>>
>>> I can send blk-mq support for legacy requests in a few days if you like, but
>>> I want to hear a better explanation of why you are delaying CQE support.
>>
>> That would be very nice, however be aware of that we are in the merge
>> window, so I am not picking new material for 4.14 from this point. I
>> assume you understand why.
>
> Nope.  This is new functionality - doesn't affect anyone who doesn't have a
> command queue engine.  Next to no chance of regressions.  Tested by several
> in the community.  Substantially unchanged since February.  It is not even
> very much code in the block driver.

Let me make it clear, once more - I don't want to maintain more hacks
in mmc block layer code.

This series add blkmq support, using a method (which may be considered
as intermediate) via a new change in patch1 - but only for the new CQE
path. That means the old legacy mmc block path is still there. So, for
the reason stated above - no thanks!

>
>>
>> Still, big changes is always nice to queue up early for a release
>> cycle. Let's aim for that!
>
> You said that in February.  Never happened.  You said you wanted blk-mq, so
> I waited to re-base on top, but it never appeared.

Yes, I want blkmq - and I believe I have explained why several times by now.

Unfortunate, blkmq just doesn't appear, we have to work on it - together.

>
>> Moreover, I am not delaying CQE, but really want it to be merged asap!
>> However, I am also having the role as a maintainer and the things that
>> comes with it. For example, I would like the community to reach
>> consensus around how to move forward with CQE, before I decide to pick
>> it up.
>
> It has been more than 6 months.  That is enough time to wait for "consensus".

Normally it should be more than enough, on the other hand, it has
turned out this was more complex than we first thought.

Kind regards
Uffe


Re: [PATCH 01/19] bcache: Fix leak of bdev reference

2017-09-05 Thread Coly Li
On 2017/9/5 下午2:43, Christoph Hellwig wrote:
> On Tue, Sep 05, 2017 at 01:30:04AM +0800, Coly Li wrote:
>>
>> When you mentioned "whole chunk of code", do you mean the following
>> block of code ?
>>
>>
>> 1960 if (IS_ERR(bdev)) {
>> = start of whole chunk of code 
>> 1961 if (bdev == ERR_PTR(-EBUSY)) {
>> 1962 bdev = lookup_bdev(strim(path));
>> 1963 mutex_lock(_register_lock);
>> 1964 if (!IS_ERR(bdev) && bch_is_open(bdev))
>> 1965 err = "device already registered";
>> 1966 else
>> 1967 err = "device busy";
>> 1968 mutex_unlock(_register_lock);
>> 1969 if (!IS_ERR(bdev))
>> 1970 bdput(bdev);
>> 1971 if (attr == _register_quiet)
>> 1972 goto out;
>> 1973 }
>> = end of whole chunk of code 
>> 1974 goto err;
>> 1975 }
>>
>> I don't mind to remove it, just double check I don't misunderstand what
>> you meant.
> 
> Yes, that's the problematic block.
> 
Understand, I will send out a patch, hopefully it can catch up 4.14
merge window.

Thanks for the hint.

-- 
Coly Li


Re: [PATCH 01/19] bcache: Fix leak of bdev reference

2017-09-05 Thread Christoph Hellwig
On Tue, Sep 05, 2017 at 01:30:04AM +0800, Coly Li wrote:
> 
> When you mentioned "whole chunk of code", do you mean the following
> block of code ?
> 
> 
> 1960 if (IS_ERR(bdev)) {
> = start of whole chunk of code 
> 1961 if (bdev == ERR_PTR(-EBUSY)) {
> 1962 bdev = lookup_bdev(strim(path));
> 1963 mutex_lock(_register_lock);
> 1964 if (!IS_ERR(bdev) && bch_is_open(bdev))
> 1965 err = "device already registered";
> 1966 else
> 1967 err = "device busy";
> 1968 mutex_unlock(_register_lock);
> 1969 if (!IS_ERR(bdev))
> 1970 bdput(bdev);
> 1971 if (attr == _register_quiet)
> 1972 goto out;
> 1973 }
> = end of whole chunk of code 
> 1974 goto err;
> 1975 }
> 
> I don't mind to remove it, just double check I don't misunderstand what
> you meant.

Yes, that's the problematic block.