Re: [PATCH 00/12 v4] multiqueue for MMC/SD

2017-10-27 Thread Linus Walleij
On Fri, Oct 27, 2017 at 2:59 PM, Adrian Hunter  wrote:
> On 27/10/17 14:25, Linus Walleij wrote:

>> It is indeed tough to juggle this with the pressure to "upstream
>> first" the BFQ scheduler policy that we are working on in Linaro
>> to increase interactivity. We need to enable this on devices
>> pronto and that means migrating MMC/SD to MQ and MQ only.
>> I have shared this motivation since the start, so it should come
>> as no surprise.
>
> IMHO BFQ is just another example of unnecessary delay.

I do not see it as a delay to anything, it is a motivation for
my work. I am telling you why I am still working on my patch
set, what is driving and motivating it.

I guess CQE is driving and motivating your work?

>> So I also have some pressure to "Get This Feature In Now".
>
> It has nothing to do with pressure.  It is about what is reasonable.
> Features should go in as soon as they are ready.  Ideally queued up in the
> same release cycle they are submitted.  If the code doesn't work right, then
> it can't go in straight away, but fake reasons for delaying things needs to
> stop.

I don't understand who you are addressing or accusing.

Nobody wants to delay CQE if that is what you are implying,
I want to see it supported as much as you do.

I just prefer to see MQ happen first, and now you say your patch
set does that and that is great, so I just need to review the code
better I guess?

Yours,
Linus Walleij


Re: [PATCH 00/12 v4] multiqueue for MMC/SD

2017-10-27 Thread Adrian Hunter
On 27/10/17 14:25, Linus Walleij wrote:
> On Thu, Oct 26, 2017 at 9:27 PM, Hunter, Adrian  
> wrote:
>> On Thu, Oct 26, 2017 at 3:34 PM, Adrian Hunter 
> 
>>> What I mean is that the CQE code will likely look better on top of these
>>> refactorings.
>>>
>>> But as I say it is a matter of taste. I just love the looks of my own code 
>>> as
>>> much as the next programmer so I can't judge that. Let's see what the
>>> reviewers say.
>>
>> It doesn't look ready.  There seems still to be 2 task switches between
>> each transfer.
> 
> IIUC there is a task in blk-mq submitting the requests, and that goes
> all the way to starting it on the host. Then as an interrupt comes back
> I kick the worker and that reports back to the block layer. So that
> is one task switch per request and then one switch back to the MQ
> layer.
> 
> But I am experimenting as we speak to get rid of the worker
> for all simple cases that don't require retune or BKOPS and that's
> pretty cool if it can be made to work :)

Well the CQE patches already do that.

> 
>> mmc_blk_rw_done_error() is still using the messy error
>> handling and doesn’t handle retries e.g. 'retry' is a local variable so it 
>> can't
>> count the number of retries now that there is no loop.
> 
> Right! Nice catch.
> 
> I will update the patch and put that in the struct mmc_async_req so
> it survives the retry rounds.
> 
> It sounds simple but I bet this drives a truck through Adrians patch
> series. Sorry. :(

 I waited a long time for your patches but I had to give up waiting
 when Ulf belated insisted on blk-mq before CQE.  I am not sure what
 you are expecting now it seems too late.
>>>
>>> Too late for what? It's just a patch set, I don't really have a deadline 
>>> for this or
>>> anything. As I explained above I have been working on this all the time, the
>>> problem was that I was/am not smart enough to find that solution for host
>>> claiming by context.
>>
>> Too late to go before CQE.  All the blk-mq work is now in the CQE patchset.
> 
> You seem to have an either/or stance.
> 
> Mine if more of a both/and.
> 
> It is not even necessary to have one set of these patches on
> top of each other, they can also be mixed in some order.
> 
> A lot of factors influence this I think, like structure of code and
> maintainability, performance, block layer interaction semantics,
> etc etc.
> 
> We definately need input from Ulf and Bartlomiej (who was actually
> the first to work on MQ for MMC/SD).
> 
>>> The host claiming by context was merged a month ago and now I have
>>> understood that I can use that and rebased my patches on it. Slow learner, I
>>> guess.
>>>
>>> If you feel it is ungrateful that you have put in so much work and things 
>>> are
>>> not getting merged, and you feel your patches deserve to be merged first
>>> (because of human nature reasons) I can empathize with that. It's sad that
>>> your patches are at v12. Also I see that patch 4 bears the signoffs of a
>>> significant team at CodeAurora, so they are likely as impatient.
>>
>> It is important that you understand that this has nothing to do with
>> "human nature reasons".
> 
> You do come across as a bit hard-headed.
> 
> But I think it is better to focus on the code per se.
> 
> I would suggest we go and review each others patch series to
> learn from each codebase what was done in a good way for the
> MMC/SD stack and what was not, you tossed out a nice review
> comment above for example.

I can make a few more comments about what else is broken.  Have you tried
suspend / resume?  At a glance, it looks like you are trying to use
blk_stop_queue() which is not a blk-mq function.

> 
>>  Linux distributions use upstream kernels.
>> ChromeOS  has an "upstream first" policy.  Delaying features for long
>> periods has real-world consequences.  When people ask, what kernel
>> should they use, we expect to reply, just use mainline.
> 
> We are in violent agreement.
> 
> I take it that you are working on ChromeOS context and that since
> they have this policy, they, through their influence over Intel as a
> supplier is putting heavy pressure on you personally to get this
> merged.
> 
> Is that correctly understood?

No.  We just expect to base everything on mainline.

> 
> That would explain your increasing pushing to get this
> upstream pretty well, especially if you have tech leads and
> managers hovering over your shoulder every week asking how
> the CQE upstream work is progressing.
> 
> It is indeed tough to juggle this with the pressure to "upstream
> first" the BFQ scheduler policy that we are working on in Linaro
> to increase interactivity. We need to enable this on devices
> pronto and that means migrating MMC/SD to MQ and MQ only.
> I have shared this motivation since the start, so it should come
> as no surprise.

IMHO BFQ is just another example of unnecessary delay.

> 
> So I also have some pressure to "Get This 

Re: [PATCH 00/12 v4] multiqueue for MMC/SD

2017-10-27 Thread Linus Walleij
On Thu, Oct 26, 2017 at 9:27 PM, Hunter, Adrian  wrote:
> On Thu, Oct 26, 2017 at 3:34 PM, Adrian Hunter 

>> What I mean is that the CQE code will likely look better on top of these
>> refactorings.
>>
>> But as I say it is a matter of taste. I just love the looks of my own code as
>> much as the next programmer so I can't judge that. Let's see what the
>> reviewers say.
>
> It doesn't look ready.  There seems still to be 2 task switches between
> each transfer.

IIUC there is a task in blk-mq submitting the requests, and that goes
all the way to starting it on the host. Then as an interrupt comes back
I kick the worker and that reports back to the block layer. So that
is one task switch per request and then one switch back to the MQ
layer.

But I am experimenting as we speak to get rid of the worker
for all simple cases that don't require retune or BKOPS and that's
pretty cool if it can be made to work :)

> mmc_blk_rw_done_error() is still using the messy error
> handling and doesn’t handle retries e.g. 'retry' is a local variable so it 
> can't
> count the number of retries now that there is no loop.

Right! Nice catch.

I will update the patch and put that in the struct mmc_async_req so
it survives the retry rounds.

>> >> It sounds simple but I bet this drives a truck through Adrians patch
>> >> series. Sorry. :(
>> >
>> > I waited a long time for your patches but I had to give up waiting
>> > when Ulf belated insisted on blk-mq before CQE.  I am not sure what
>> > you are expecting now it seems too late.
>>
>> Too late for what? It's just a patch set, I don't really have a deadline for 
>> this or
>> anything. As I explained above I have been working on this all the time, the
>> problem was that I was/am not smart enough to find that solution for host
>> claiming by context.
>
> Too late to go before CQE.  All the blk-mq work is now in the CQE patchset.

You seem to have an either/or stance.

Mine if more of a both/and.

It is not even necessary to have one set of these patches on
top of each other, they can also be mixed in some order.

A lot of factors influence this I think, like structure of code and
maintainability, performance, block layer interaction semantics,
etc etc.

We definately need input from Ulf and Bartlomiej (who was actually
the first to work on MQ for MMC/SD).

>> The host claiming by context was merged a month ago and now I have
>> understood that I can use that and rebased my patches on it. Slow learner, I
>> guess.
>>
>> If you feel it is ungrateful that you have put in so much work and things are
>> not getting merged, and you feel your patches deserve to be merged first
>> (because of human nature reasons) I can empathize with that. It's sad that
>> your patches are at v12. Also I see that patch 4 bears the signoffs of a
>> significant team at CodeAurora, so they are likely as impatient.
>
> It is important that you understand that this has nothing to do with
> "human nature reasons".

You do come across as a bit hard-headed.

But I think it is better to focus on the code per se.

I would suggest we go and review each others patch series to
learn from each codebase what was done in a good way for the
MMC/SD stack and what was not, you tossed out a nice review
comment above for example.

>  Linux distributions use upstream kernels.
> ChromeOS  has an "upstream first" policy.  Delaying features for long
> periods has real-world consequences.  When people ask, what kernel
> should they use, we expect to reply, just use mainline.

We are in violent agreement.

I take it that you are working on ChromeOS context and that since
they have this policy, they, through their influence over Intel as a
supplier is putting heavy pressure on you personally to get this
merged.

Is that correctly understood?

That would explain your increasing pushing to get this
upstream pretty well, especially if you have tech leads and
managers hovering over your shoulder every week asking how
the CQE upstream work is progressing.

It is indeed tough to juggle this with the pressure to "upstream
first" the BFQ scheduler policy that we are working on in Linaro
to increase interactivity. We need to enable this on devices
pronto and that means migrating MMC/SD to MQ and MQ only.
I have shared this motivation since the start, so it should come
as no surprise.

So I also have some pressure to "Get This Feature In Now".

Yours,
Linus Walleij


RE: [PATCH 00/12 v4] multiqueue for MMC/SD

2017-10-26 Thread Hunter, Adrian
> -Original Message-
> From: Linus Walleij [mailto:linus.wall...@linaro.org]
> Sent: Thursday, October 26, 2017 5:20 PM
> To: Hunter, Adrian <adrian.hun...@intel.com>
> Cc: linux-...@vger.kernel.org; Ulf Hansson <ulf.hans...@linaro.org>;
> linux-block <linux-block@vger.kernel.org>; Jens Axboe <ax...@kernel.dk>;
> Christoph Hellwig <h...@lst.de>; Arnd Bergmann <a...@arndb.de>;
> Bartlomiej Zolnierkiewicz <b.zolnier...@samsung.com>; Paolo Valente
> <paolo.vale...@linaro.org>; Avri Altman <avri.alt...@sandisk.com>
> Subject: Re: [PATCH 00/12 v4] multiqueue for MMC/SD
> 
> On Thu, Oct 26, 2017 at 3:34 PM, Adrian Hunter <adrian.hun...@intel.com>
> wrote:
> > On 26/10/17 15:57, Linus Walleij wrote:
> >> In my opinion this is also a better fit for command queueuing.
> >
> > Not true.  CQE support worked perfectly before blk-mq and did not
> > depend on blk-mq in any way.  Obviously the current CQE patch set
> > actually implements the CQE requirements for blk-mq - which this patch set
> does not.
> 
> What I mean is that the CQE code will likely look better on top of these
> refactorings.
> 
> But as I say it is a matter of taste. I just love the looks of my own code as
> much as the next programmer so I can't judge that. Let's see what the
> reviewers say.

It doesn't look ready.  There seems still to be 2 task switches between
each transfer.  mmc_blk_rw_done_error() is still using the messy error
handling and doesn’t handle retries e.g. 'retry' is a local variable so it can't
count the number of retries now that there is no loop.

> >> It sounds simple but I bet this drives a truck through Adrians patch
> >> series. Sorry. :(
> >
> > I waited a long time for your patches but I had to give up waiting
> > when Ulf belated insisted on blk-mq before CQE.  I am not sure what
> > you are expecting now it seems too late.
> 
> Too late for what? It's just a patch set, I don't really have a deadline for 
> this or
> anything. As I explained above I have been working on this all the time, the
> problem was that I was/am not smart enough to find that solution for host
> claiming by context.

Too late to go before CQE.  All the blk-mq work is now in the CQE patchset.

> 
> The host claiming by context was merged a month ago and now I have
> understood that I can use that and rebased my patches on it. Slow learner, I
> guess.
> 
> If you feel it is ungrateful that you have put in so much work and things are
> not getting merged, and you feel your patches deserve to be merged first
> (because of human nature reasons) I can empathize with that. It's sad that
> your patches are at v12. Also I see that patch 4 bears the signoffs of a
> significant team at CodeAurora, so they are likely as impatient.

It is important that you understand that this has nothing to do with
"human nature reasons".Linux distributions use upstream kernels. 
ChromeOS  has an "upstream first" policy.  Delaying features for long
periods has real-world consequences.  When people ask, what kernel
should they use, we expect to reply, just use mainline.



Re: [PATCH 00/12 v4] multiqueue for MMC/SD

2017-10-26 Thread Linus Walleij
On Thu, Oct 26, 2017 at 3:34 PM, Adrian Hunter  wrote:
> On 26/10/17 15:57, Linus Walleij wrote:

>> I have now worked on it for more than a year. I was side
>> tracked to clean up some code, move request allocation to
>> be handled by the block layer, delete bounce buffer handling
>> and refactoring the RPMB support. With the changes to request
>> allocation, the patch set is a better fit and has shrunk
>> from 16 to 12 patches as a result.
>
> None of which was necessary for blk-mq support.

I was not smart enough to realize that it was possible to do what
you did in
commit d685f5d5fcf75c30ef009771d3067f7438cd8baf
"mmc: core: Introduce host claiming by context"
this simple and clever solution simply didn't occur to me
at all.

And now it uses that solution, as you can see :)

But since I didn't have that simple solution, the other
solution was to get rid of the lock altogether (which we should
anyways...) getting rid of the RPMB "partition" for example
removes some locks. (I guess I still will have to go on and
find a solution for the boot and generic partitions but it's no
blocker for MQ anymore.)

My patch set was dependent on solving that. As I already wrote
to you on sep 13:
https://marc.info/?l=linux-mmc=150607944524752=2

My patches for allocating the struct mmc_queue_req from the
block layer was actually originally a part of this series
so the old patches
  mmc: queue: issue struct mmc_queue_req items
  mmc: queue: get/put struct mmc_queue_req
was doing a stupid homebrewn solution to what the block
layer already can do, mea culpa. (Yeah I was educating
myself in the block layer too...)

Anyways, all of this happened in the context of moving
forward with my MQ patch set, not as random activity.

Now it looks like I'm defending myself from a project leader,
haha :D

Well for better or worse, this was how I was working.

>> We use the trick to set the queue depth to 2 to get two
>> parallel requests pushed down to the host. I tried to set this
>> to 4, the code survives it, the queue just have three items
>> waiting to be submitted all the time.
>
> The queue depth also sets the number of requests, so you are strangling the
> I/O scheduler.

Yup. Just did it to see if it survives.

>> In my opinion this is also a better fit for command queueuing.
>
> Not true.  CQE support worked perfectly before blk-mq and did not depend on
> blk-mq in any way.  Obviously the current CQE patch set actually implements
> the CQE requirements for blk-mq - which this patch set does not.

What I mean is that the CQE code will likely look better on top
of these refactorings.

But as I say it is a matter of taste. I just love the looks of my own code
as much as the next programmer so I can't judge that. Let's see what
the reviewers say.

>> Handling command queueing needs to happen in the asynchronous
>> submission codepath, so instead of waiting on a pending
>> areq, we just stack up requests in the command queue.
>
> That is how CQE has always worked.  It worked that way just fine without 
> blk-mq.

Okay nice.

>> It sounds simple but I bet this drives a truck through Adrians
>> patch series. Sorry. :(
>
> I waited a long time for your patches but I had to give up waiting when Ulf
> belated insisted on blk-mq before CQE.  I am not sure what you are expecting
> now it seems too late.

Too late for what? It's just a patch set, I don't really have a deadline
for this or anything. As I explained above I have
been working on this all the time, the problem was that I was/am not
smart enough to find that solution for host claiming by context.

The host claiming by context was merged a month ago and now I have
understood that I can use that and rebased my patches on it. Slow
learner, I guess.

If you feel it is ungrateful that you have put in so much work and things
are not getting merged, and you feel your patches deserve to be
merged first (because of human nature reasons) I can empathize with
that. It's sad that your patches are at v12. Also I see that patch 4 bears
the signoffs of a significant team at CodeAurora, so they are likely
as impatient.

I would just rebase my remaining work on top of the CQE patches if
they end up being merged first, no big deal, just work.

Yours,
Linus Walleij


Re: [PATCH 00/12 v4] multiqueue for MMC/SD

2017-10-26 Thread Adrian Hunter
On 26/10/17 15:57, Linus Walleij wrote:
> This switches the MMC/SD stack over to unconditionally
> using the multiqueue block interface for block access.
> This modernizes the MMC/SD stack and makes it possible
> to enable BFQ scheduling on these single-queue devices.
> 
> This is the v4 version of this v3 patch set from february:
> https://marc.info/?l=linux-mmc=148665788227015=2
> 
> The patches are available in a git branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson.git/log/?h=mmc-mq-v4.14-rc4
> 
> You can pull it to a clean kernel tree like this:
> git checkout -b mmc-test v4.14-rc4
> git pull 
> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson.git 
> mmc-mq-v4.14-rc4
> 
> I have now worked on it for more than a year. I was side
> tracked to clean up some code, move request allocation to
> be handled by the block layer, delete bounce buffer handling
> and refactoring the RPMB support. With the changes to request
> allocation, the patch set is a better fit and has shrunk
> from 16 to 12 patches as a result.

None of which was necessary for blk-mq support.

> 
> It is still quite invasive. Yet it is something I think would
> be nice to merge for v4.16...
> 
> The rationale for this approach was Arnd's suggestion to try to
> switch the MMC/SD stack around so as to complete requests as
> quickly as possible when they return from the device driver
> so that new requests can be issued. We are doing this now:
> the polling loop that was pulling NULL out of the request
> queue and driving the pipeline with a loop is gone with
> the next-to last patch ("block: issue requests in massive
> parallel"). This sets the stage for MQ to go in and hammer
> requests on the asynchronous issuing layer.
> 
> We use the trick to set the queue depth to 2 to get two
> parallel requests pushed down to the host. I tried to set this
> to 4, the code survives it, the queue just have three items
> waiting to be submitted all the time.

The queue depth also sets the number of requests, so you are strangling the
I/O scheduler.

> 
> In my opinion this is also a better fit for command queueuing.

Not true.  CQE support worked perfectly before blk-mq and did not depend on
blk-mq in any way.  Obviously the current CQE patch set actually implements
the CQE requirements for blk-mq - which this patch set does not.

> Handling command queueing needs to happen in the asynchronous
> submission codepath, so instead of waiting on a pending
> areq, we just stack up requests in the command queue.

That is how CQE has always worked.  It worked that way just fine without blk-mq.

> 
> It sounds simple but I bet this drives a truck through Adrians
> patch series. Sorry. :(

I waited a long time for your patches but I had to give up waiting when Ulf
belated insisted on blk-mq before CQE.  I am not sure what you are expecting
now it seems too late.


[PATCH 00/12 v4] multiqueue for MMC/SD

2017-10-26 Thread Linus Walleij
This switches the MMC/SD stack over to unconditionally
using the multiqueue block interface for block access.
This modernizes the MMC/SD stack and makes it possible
to enable BFQ scheduling on these single-queue devices.

This is the v4 version of this v3 patch set from february:
https://marc.info/?l=linux-mmc=148665788227015=2

The patches are available in a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson.git/log/?h=mmc-mq-v4.14-rc4

You can pull it to a clean kernel tree like this:
git checkout -b mmc-test v4.14-rc4
git pull 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson.git 
mmc-mq-v4.14-rc4

I have now worked on it for more than a year. I was side
tracked to clean up some code, move request allocation to
be handled by the block layer, delete bounce buffer handling
and refactoring the RPMB support. With the changes to request
allocation, the patch set is a better fit and has shrunk
from 16 to 12 patches as a result.

It is still quite invasive. Yet it is something I think would
be nice to merge for v4.16...

The rationale for this approach was Arnd's suggestion to try to
switch the MMC/SD stack around so as to complete requests as
quickly as possible when they return from the device driver
so that new requests can be issued. We are doing this now:
the polling loop that was pulling NULL out of the request
queue and driving the pipeline with a loop is gone with
the next-to last patch ("block: issue requests in massive
parallel"). This sets the stage for MQ to go in and hammer
requests on the asynchronous issuing layer.

We use the trick to set the queue depth to 2 to get two
parallel requests pushed down to the host. I tried to set this
to 4, the code survives it, the queue just have three items
waiting to be submitted all the time.

In my opinion this is also a better fit for command queueuing.
Handling command queueing needs to happen in the asynchronous
submission codepath, so instead of waiting on a pending
areq, we just stack up requests in the command queue.

It sounds simple but I bet this drives a truck through Adrians
patch series. Sorry. :(

We are not issueing new requests from interrupt context: I still
have to post a work on a workqueue for it. Since there is the
retune and background operations that need to be checked after
every command and yeah, it needs to happen in blocking context
as far as I know.

I might make a hack trying to strip out the retune (etc) and
instead run request until something fail and report requests
back to the block layer in interrupt context. It would be an
interesting experiment, but for later.

We have parallelism in pre/post hooks also with multiqueue.
All asynchronous optimization that was there for the old block
layer is now also there for multiqueue.

Last time I followed up with some open questions
https://marc.info/?l=linux-mmc=149075698610224=2
I think these are now resolved.

As a result, the last patch is no longer in RFC state. I
think this works. (Famous last words, OK there WILL be
regressions but hey, we need to do this.)
You can see there are three steps:

- I do some necessary refactoring and need to move postprocessing
  to after the requests have been completed. This clearly, as you
  can see, introduce a performance regression in the dd test with
  the patch:
  "mmc: core: move the asynchronous post-processing"
  It seems the random seek with find isn't much affected.

- I continue the refactoring and get to the point of issueing
  requests immediately after every successful transfer, and the
  dd performance is restored with patch
  "mmc: queue: issue requests in massive parallel"

- Then I add multiqueue on top of the cake. So before the change
  we have the nice performance we want so we can study the effect
  of just introducing multiqueueing in the last patch
  "mmc: switch MMC/SD to use blk-mq multiqueueing v4"


PERFORMANCE BEFORE AND AFTER:

BEFORE this patch series, on Ulf's next branch ending with
commit cf653c788a29fa70e07b86492a7599c471c705de (mmc-next)
Merge: 4dda8e1f70f8 eb701ce16a45 ("Merge branch 'fixes' into next")

sync
echo 3 > /proc/sys/vm/drop_caches
sync
time dd if=/dev/mmcblk3 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.0GB) copied, 23.966583 seconds, 42.7MB/s
real0m 23.97s
user0m 0.01s
sys 0m 3.74s

mount /dev/mmcblk3p1 /mnt/
cd /mnt/
sync
echo 3 > /proc/sys/vm/drop_caches
sync
time find . > /dev/null
real0m 3.24s
user0m 0.22s
sys 0m 1.23s

sync
echo 3 > /proc/sys/vm/drop_caches
sync
iozone -az -i0 -i1 -i2 -s 20m -I -f /mnt/foo.test

   randomrandom
   kB  reclenwrite  rewritereadrereadread write
20480   4 1598 1559 6782 6740 6751  536
20480   8 2134 2281114491144911407 1145
20480  16 3695 4171176761767717638 1234
20480  32 5751