Re: Proposal for a scalable SCSI midlayer

2014-02-26 Thread Bart Van Assche
On 02/23/14 21:10, James Bottomley wrote:
 Right ... my ideal here if we can achieve it would be lockless threaded
 models, where we could make guarantees like single thread of execution
 per command, so all command state could be lockless.

This approach sounds interesting but could be challenging to implement.
With this approach it would no longer be safe to access the SCSI command
state from interrupt nor from tasklet context. That means that the I/O
completion path would have to be modified such that instead of using an
IPI to invoke a tasklet at the CPU that submitted the SCSI command a new
mechanism would have to be used that causes the I/O completion code to
run directly on the context of the thread that submitted the SCSI command.

Bart.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for a scalable SCSI midlayer

2014-02-23 Thread James Bottomley

On Wed, 2014-02-05 at 04:39 -0800, Christoph Hellwig wrote:
 We've run into many issues where the SCSI layer simply does not scale to
 keep up with today's hardware, be that in simple single-thread IOPs, or
 in lock contention when using multiple LUNs or targets under a single
 SCSI host.  This proposal tries to draw a path how to fix this properly
 and avoids workarounds where various driver that speak a SCSI command
 set are implemented at the block layer because of these issues.
 
 After the dramatic improvements that the scsi-mq prototype from
 Nic Bellinger showed it is clear that using the block multiqueue
 infrastructure will take a big role in this effort, but it goes much
 further than that code base.
 
 As an important goal of this project I want to replace the whole I/O
 path in the SCSI midlayer, and not create largely parallel code paths
 for small and fast devices.  We will have to find if this is actually
 feasible for all cases, but I'd like to get an as broad as possible set
 of drivers to use the new I/O path, and avoid API differences if we have
 to keep the two paths around.

If we can do this, that would be great, because it cuts down on the
maintenance burden for all of us and gives some benefits at least to
non-MQ hardware.

 A specific non-goal is support for multiple hardware queues.  While
 we will have to support this soon, the improvements from just using the
 blk-mq code and fixing the obvious scalability issues in the SCSI midlayer
 are larger enough to deal with this as a first step, and postpone problems
 related to queue synchronization into the near future.
 
 
 1) Summary of the scalability issues
 
 The biggest problem in the current SCSI midlayer is the old block layer
 request model in general, with it's large amount of lock round trips on
 the queue_lock for every request, and a large amount of touched cache lines.

Agree with this.

 The way the old request code is used by the SCSI midlayer makes this even
 worse by using the queue_lock to protect additional internal state, and
 round tripping on a host-wide lock multiple times for each command.

That's largely a result of the above.  The premise for this is that if
we already have heavy lock shuffling induced by block, if we use the
same locks to protect internal state at the points which we've already
acquired them, they're essentially free ... obviously that hasn't quite
worked out.

 Even when avoiding the host lock by replacing it with atomic counters we'd
 run into multiple host or target-wide shared cache lines for each I/O
 submission or completion.

Right ... my ideal here if we can achieve it would be lockless threaded
models, where we could make guarantees like single thread of execution
per command, so all command state could be lockless.  Even CPU dedicated
to single device would give us all device state lockless and the
necessary cache hotness (although this may not be feasible).

 2) Suggested way forward
 
 I would suggest to attack the problems from two sides:
 
 a) fixing the easy to hit scalability issues in the SCSI layer where we
can, even if they are overshadowed by the block layer ones in small
patch sets.

Fine with this.  Anywhere we can obviously remove a lock or an atomic is
great with me.

 b) gradually moving the whole SCSI layer to be backed by blk-mq.  This
is a different approach from Nic's current scsi-mq tree, in that it
keeps all the per-device/target/shost accounting and fairness code in
the SCSI midlayer in place for now, and uses the same APIs to talk
to the LLDDs.  While this is certain to get less stellar results than
a hard cut, it will allow to do a full move to the new infrastructure
much easier, and avoid long term maintenance of parallel code paths.
Additional optimization can and should be implemented on top of this
baselines work.

Yes, but would like to see a clearer picture of how this would be
achieved and what it would entail.

 3) Current status
 
 I will send the first batch of patches implementing easy optimizations
 in the SCSI midlayer after this RFC, as well as a very early prototype
 of the blk-mq work based on that, as well as performance numbers.  We'll
 need to work from there to improve it to be generally usable, mostly by
 adding missing features to the blk-mq core.
 
 4) Major TODO items
 
  - add support for partial completions, as the SCSI drivers might
complete only part of a request for a given I/O completion.

Agreed, it's required at least for bad sector handling.

  - either make the blk-mq tag allocator usable on a per-host basis for
those drivers that currently use host-wide tagging, or find a way
that they can use their own per-host tagging without getting into the
way of blk-mq.

Agree.

  - implement BIDI support in blk-mq.  This is currently missing entirely
and will be needed to support the OSD2 protocol, as well as a few
SBC commands through sg_io.

Ambivalent.  We need bidi 

Re: Proposal for a scalable SCSI midlayer

2014-02-23 Thread Christoph Hellwig
On Sun, Feb 23, 2014 at 02:10:18PM -0600, James Bottomley wrote:
 If we can do this, that would be great, because it cuts down on the
 maintenance burden for all of us and gives some benefits at least to
 non-MQ hardware.

So far this seems to work out great, and I think we will be able to
stick to it.

  Even when avoiding the host lock by replacing it with atomic counters we'd
  run into multiple host or target-wide shared cache lines for each I/O
  submission or completion.
 
 Right ... my ideal here if we can achieve it would be lockless threaded
 models, where we could make guarantees like single thread of execution
 per command, so all command state could be lockless.  Even CPU dedicated
 to single device would give us all device state lockless and the
 necessary cache hotness (although this may not be feasible).

It's not fitting the current blk-mq model, which I'd much prefer to
follow for now.  As Jens pointed out in the previous discussion blk-mq
tries to map to cpu-local queues as much as possible, but there's no
hard guarantee.

   - implement BIDI support in blk-mq.  This is currently missing entirely
 and will be needed to support the OSD2 protocol, as well as a few
 SBC commands through sg_io.
 
 Ambivalent.  We need bidi support for OSD arrays and some of the more
 esoteric commands, but I'm not convinced they're necessary to the
 functioning of the stack.

It's needed so that we get a full replacement of the old code, so we'll
have to tackle it eventually.  I wish we could simply avoid it, but life
ain't that easy.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Proposal for a scalable SCSI midlayer

2014-02-05 Thread Christoph Hellwig
We've run into many issues where the SCSI layer simply does not scale to
keep up with today's hardware, be that in simple single-thread IOPs, or
in lock contention when using multiple LUNs or targets under a single
SCSI host.  This proposal tries to draw a path how to fix this properly
and avoids workarounds where various driver that speak a SCSI command
set are implemented at the block layer because of these issues.

After the dramatic improvements that the scsi-mq prototype from
Nic Bellinger showed it is clear that using the block multiqueue
infrastructure will take a big role in this effort, but it goes much
further than that code base.

As an important goal of this project I want to replace the whole I/O
path in the SCSI midlayer, and not create largely parallel code paths
for small and fast devices.  We will have to find if this is actually
feasible for all cases, but I'd like to get an as broad as possible set
of drivers to use the new I/O path, and avoid API differences if we have
to keep the two paths around.

A specific non-goal is support for multiple hardware queues.  While
we will have to support this soon, the improvements from just using the
blk-mq code and fixing the obvious scalability issues in the SCSI midlayer
are larger enough to deal with this as a first step, and postpone problems
related to queue synchronization into the near future.


1) Summary of the scalability issues

The biggest problem in the current SCSI midlayer is the old block layer
request model in general, with it's large amount of lock round trips on
the queue_lock for every request, and a large amount of touched cache lines.

The way the old request code is used by the SCSI midlayer makes this even
worse by using the queue_lock to protect additional internal state, and
round tripping on a host-wide lock multiple times for each command.

Even when avoiding the host lock by replacing it with atomic counters we'd
run into multiple host or target-wide shared cache lines for each I/O
submission or completion.


2) Suggested way forward

I would suggest to attack the problems from two sides:

a) fixing the easy to hit scalability issues in the SCSI layer where we
   can, even if they are overshadowed by the block layer ones in small
   patch sets.

b) gradually moving the whole SCSI layer to be backed by blk-mq.  This
   is a different approach from Nic's current scsi-mq tree, in that it
   keeps all the per-device/target/shost accounting and fairness code in
   the SCSI midlayer in place for now, and uses the same APIs to talk
   to the LLDDs.  While this is certain to get less stellar results than
   a hard cut, it will allow to do a full move to the new infrastructure
   much easier, and avoid long term maintenance of parallel code paths.
   Additional optimization can and should be implemented on top of this
   baselines work.

3) Current status

I will send the first batch of patches implementing easy optimizations
in the SCSI midlayer after this RFC, as well as a very early prototype
of the blk-mq work based on that, as well as performance numbers.  We'll
need to work from there to improve it to be generally usable, mostly by
adding missing features to the blk-mq core.

4) Major TODO items

 - add support for partial completions, as the SCSI drivers might
   complete only part of a request for a given I/O completion.

 - either make the blk-mq tag allocator usable on a per-host basis for
   those drivers that currently use host-wide tagging, or find a way
   that they can use their own per-host tagging without getting into the
   way of blk-mq.

 - implement BIDI support in blk-mq.  This is currently missing entirely
   and will be needed to support the OSD2 protocol, as well as a few
   SBC commands through sg_io.

 - fix the tag allocation for sequenced FLUSH commands.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html