Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-13 Thread Sagi Grimberg

On 1/12/2015 2:56 PM, Bart Van Assche wrote:

On 01/11/15 10:40, Sagi Grimberg wrote:

I would say there is no need for specific coordination from iSCSI PoV.
This is exactly what flow steering is designed for. As I see it, in
order to get the TX/RX to match rings, the user can attach 5-tuple rules
(using standard ethtool) to steer packets to the right rings.


Hello Sagi,

Can the 5-tuple rules be chosen such that it is guaranteed that the
sockets used to implement per-CPU queues are spread evenly over MSI-X
completion vectors ? If not, would it help to add a socket option to the
Linux network stack that allows to select the TX ring explicitly, just
like ib_create_cq() in the Linux RDMA stack allows to select a
completion vector explicitly ? My concerns are as follows:
- If the number of queues exceeds the number of MSI-X vectors then I
   expect that it will be much easier to guarantee even spreading by
   selecting tx queues explicitly instead of relying on a hashing scheme.
- On multi-socket systems it is important to process completion
   interrupts on the CPU socket from where the I/O was initiated. I'm
   not sure it is possible to guarantee this when using a hashing
   algorithm to select the TX ring.



Hey Bart,

Your concerns are correct. Flow steering rules will guarantee that each
socket will have a different TX/RX ring, but not necessarily the
correct TX/RX ring. These issues have been addressed in the
Networking subsystem.

Thinking more on this out loud,

There is the TX challenge, getting the HW queue selection to match the
TX ring selection (which might not be the same according to flow hash), 
First thing that comes to mind is XPS (Transmit Packet Steering).


From Documentation/networking/scaling.txt:
Transmit Packet Steering is a mechanism for intelligently selecting
which transmit queue to use when transmitting a packet on a multi-queue
device. To accomplish this, a mapping from CPU to hardware queue(s) is
recorded. The goal of this mapping is usually to assign queues
exclusively to a subset of CPUs, where the transmit completions for
these queues are processed on a CPU within this set.

About the RX challenge, I think RFS (Receive Flow Steering) will
probably be the best fit here since RX packets will be steered to the
CPU where the application is running.

From Documentation/networking/scaling.txt:
The goal of RFS is to increase datacache hitrate by steering
kernel processing of packets to the CPU where the application thread
consuming the packet is running. RFS relies on the same RPS mechanisms
to enqueue packets onto the backlog of another CPU and to wake up that
CPU. In RFS, packets are not forwarded directly by the value of their
hash, but the hash is used as index into a flow lookup table. This
table maps flows to the CPUs where those flows are being processed.

This definitely needs some more thinking. CC'ing Or Gerlitz which has
a lot of experience in the Networking stack...

Sagi.

--
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-13 Thread Sagi Grimberg

On 1/12/2015 10:05 PM, Mike Christie wrote:

On 01/11/2015 03:23 AM, Sagi Grimberg wrote:

On 1/9/2015 8:00 PM, Michael Christie wrote:
SNIP




Session wide command sequence number synchronization isn't something to
be removed as part of the MQ work.  It's a iSCSI/iSER protocol
requirement.

That is, the expected + maximum sequence numbers are returned as part of
every response PDU, which the initiator uses to determine when the
command sequence number window is open so new non-immediate commands may
be sent to the target.

So, given some manner of session wide synchronization is required
between different contexts for the existing single connection case to
update the command sequence number and check when the window opens, it's
a fallacy to claim MC/S adds some type of new initiator specific
synchronization overhead vs. single connection code.


I think you are assuming we are leaving the iscsi code as it is today.

For the non-MCS mq session per CPU design, we would be allocating and
binding the session and its resources to specific CPUs. They would
only be accessed by the threads on that one CPU, so we get our
serialization/synchronization from that. That is why we are saying we
do not need something like atomic_t/spin_locks for the sequence number
handling for this type of implementation.

If we just tried to do this with the old code where the session could
be accessed on multiple CPUs then you are right, we need locks/atomics
like how we do in the MCS case.



I don't think we will want to restrict session per CPU. There is a
tradeoff question of system resources. We might want to allow a user to
configure multiple HW queues but still not to use too much of the system
resources. So the session locks would still be used but definitely less
congested...


Are you talking about specifically the session per CPU or also MCS and
doing a connection per CPU?


This applies to both.



Based on the srp work, how bad do you think it will be to do a
session/connection per CPU? What are you thinking will be more common?
Session per 4 CPU? 2 CPUs? 8?


This is a level of degree which demonstrates why we need to let the
user choose. I don't think there is a magic number here, there is a
tradeoff between performance and memory footprint.



There is also multipath to take into account here. We could do a mq/MCS
session/connection per CPU (or group of CPS) then also one of those per
transport path. We could also do a mq/MCS session/connection per
transport path, then bind those to specific CPUs. Or something in between.



Is it a good idea to tie iSCSI implementation in multipath? I've seen
deployments where multipath was not used for HA (NIC bonding was used
for that).

The srp implementation allowed the user to choose the number of
channels per target and the default was chosen by empirical results
(Bart, please correct me if I'm wrong here).

Sagi.

--
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-12 Thread Mike Christie
On 01/11/2015 03:23 AM, Sagi Grimberg wrote:
 On 1/9/2015 8:00 PM, Michael Christie wrote:
 SNIP


 Session wide command sequence number synchronization isn't something to
 be removed as part of the MQ work.  It's a iSCSI/iSER protocol
 requirement.

 That is, the expected + maximum sequence numbers are returned as part of
 every response PDU, which the initiator uses to determine when the
 command sequence number window is open so new non-immediate commands may
 be sent to the target.

 So, given some manner of session wide synchronization is required
 between different contexts for the existing single connection case to
 update the command sequence number and check when the window opens, it's
 a fallacy to claim MC/S adds some type of new initiator specific
 synchronization overhead vs. single connection code.

 I think you are assuming we are leaving the iscsi code as it is today.

 For the non-MCS mq session per CPU design, we would be allocating and
 binding the session and its resources to specific CPUs. They would
 only be accessed by the threads on that one CPU, so we get our
 serialization/synchronization from that. That is why we are saying we
 do not need something like atomic_t/spin_locks for the sequence number
 handling for this type of implementation.

 If we just tried to do this with the old code where the session could
 be accessed on multiple CPUs then you are right, we need locks/atomics
 like how we do in the MCS case.

 
 I don't think we will want to restrict session per CPU. There is a
 tradeoff question of system resources. We might want to allow a user to
 configure multiple HW queues but still not to use too much of the system
 resources. So the session locks would still be used but definitely less
 congested...

Are you talking about specifically the session per CPU or also MCS and
doing a connection per CPU?

Based on the srp work, how bad do you think it will be to do a
session/connection per CPU? What are you thinking will be more common?
Session per 4 CPU? 2 CPUs? 8?

There is also multipath to take into account here. We could do a mq/MCS
session/connection per CPU (or group of CPS) then also one of those per
transport path. We could also do a mq/MCS session/connection per
transport path, then bind those to specific CPUs. Or something in between.

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-12 Thread Mike Christie
On 01/11/2015 03:40 AM, Sagi Grimberg wrote:
 On 1/9/2015 10:19 PM, Mike Christie wrote:
 On 01/09/2015 12:28 PM, Hannes Reinecke wrote:
 On 01/09/2015 07:00 PM, Michael Christie wrote:

 On Jan 8, 2015, at 11:03 PM, Nicholas A. Bellinger
 n...@linux-iscsi.org wrote:

 On Thu, 2015-01-08 at 15:22 -0800, James Bottomley wrote:
 On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote:
 On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote:
 On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote:

 SNIP

 The point is that a simple session wide counter for command sequence
 number assignment is significantly less overhead than all of the
 overhead associated with running a full multipath stack atop
 multiple
 sessions.

 I don't see how that's relevant to issue speed, which was the
 measure we
 were using: The layers above are just a hopper.  As long as they're
 loaded, the MQ lower layer can issue at full speed.  So as long as
 the
 multipath hopper is efficient enough to keep the queues loaded
 there's
 no speed degradation.

 The problem with a sequence point inside the MQ issue layer is
 that it
 can cause a stall that reduces the issue speed. so the counter
 sequence
 point causes a degraded issue speed over the multipath hopper
 approach
 above even if the multipath approach has a higher CPU overhead.

 Now, if the system is close to 100% cpu already, *then* the multipath
 overhead will try to take CPU power we don't have and cause a
 stall, but
 it's only in the flat out CPU case.

 Not to mention that our iSCSI/iSER initiator is already taking a
 session
 wide lock when sending outgoing PDUs, so adding a session wide
 counter
 isn't adding any additional synchronization overhead vs. what's
 already
 in place.

 I'll leave it up to the iSER people to decide whether they're redoing
 this as part of the MQ work.


 Session wide command sequence number synchronization isn't
 something to
 be removed as part of the MQ work.  It's a iSCSI/iSER protocol
 requirement.

 That is, the expected + maximum sequence numbers are returned as
 part of
 every response PDU, which the initiator uses to determine when the
 command sequence number window is open so new non-immediate
 commands may
 be sent to the target.

 So, given some manner of session wide synchronization is required
 between different contexts for the existing single connection case to
 update the command sequence number and check when the window opens,
 it's
 a fallacy to claim MC/S adds some type of new initiator specific
 synchronization overhead vs. single connection code.

 I think you are assuming we are leaving the iscsi code as it is today.

 For the non-MCS mq session per CPU design, we would be allocating and
 binding the session and its resources to specific CPUs. They would only
 be accessed by the threads on that one CPU, so we get our
 serialization/synchronization from that. That is why we are saying we
 do not need something like atomic_t/spin_locks for the sequence number
 handling for this type of implementation.

 Wouldn't that need to be coordinated with the networking layer?

 Yes.

 Doesn't it do the same thing, matching TX/RX queues to CPUs?

 Yes.

 
 Hey Hannes, Mike,
 
 I would say there is no need for specific coordination from iSCSI PoV.
 This is exactly what flow steering is designed for. As I see it, in
 order to get the TX/RX to match rings, the user can attach 5-tuple rules
 (using standard ethtool) to steer packets to the right rings.
 
 Sagi.
 
 If so, wouldn't we decrease bandwidth by restricting things to one CPU?

 We have a session or connection per CPU though, so we end up hitting the
 same problem you talked about last year where one hctx (iscsi session or
 connection's socket or nic hw queue) could get overloaded. This is what
 I meant in my original mail where iscsi would rely on whatever blk/mq
 load balancers we end up implementing at that layer to balance requests
 across hctxs.

 
 I'm not sure I understand,
 
 The submission flow is CPU bound. In the current single queue model
 both CPU X and CPU Y will end up using a single socket. In the
 multi-queue solution, CPU X will go to socket X and CPU Y will go to
 socket Y. This is equal to what we have today (if only CPU X is active)
 or better (if more CPUs are active).
 
 Am I missing something?

I did not take Hannes's comment as comparing what we have today vs the
proposal. I thought he was referring to the problem he was talking about
at LSF last year and saying there could be cases where we want to spread
IO across CPUs/queues and some cases where we would want to execute on
the CPU we were originally submitted on. I was just saying the iscsi
layer would not control that and would rely on the blk/mq layer to
handle this or tell us what to do similar to what we do for the
rq_affinity setting.

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop 

Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-11 Thread Sagi Grimberg

On 1/9/2015 10:19 PM, Mike Christie wrote:

On 01/09/2015 12:28 PM, Hannes Reinecke wrote:

On 01/09/2015 07:00 PM, Michael Christie wrote:


On Jan 8, 2015, at 11:03 PM, Nicholas A. Bellinger n...@linux-iscsi.org wrote:


On Thu, 2015-01-08 at 15:22 -0800, James Bottomley wrote:

On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote:

On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote:

On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote:


SNIP


The point is that a simple session wide counter for command sequence
number assignment is significantly less overhead than all of the
overhead associated with running a full multipath stack atop multiple
sessions.


I don't see how that's relevant to issue speed, which was the measure we
were using: The layers above are just a hopper.  As long as they're
loaded, the MQ lower layer can issue at full speed.  So as long as the
multipath hopper is efficient enough to keep the queues loaded there's
no speed degradation.

The problem with a sequence point inside the MQ issue layer is that it
can cause a stall that reduces the issue speed. so the counter sequence
point causes a degraded issue speed over the multipath hopper approach
above even if the multipath approach has a higher CPU overhead.

Now, if the system is close to 100% cpu already, *then* the multipath
overhead will try to take CPU power we don't have and cause a stall, but
it's only in the flat out CPU case.


Not to mention that our iSCSI/iSER initiator is already taking a session
wide lock when sending outgoing PDUs, so adding a session wide counter
isn't adding any additional synchronization overhead vs. what's already
in place.


I'll leave it up to the iSER people to decide whether they're redoing
this as part of the MQ work.



Session wide command sequence number synchronization isn't something to
be removed as part of the MQ work.  It's a iSCSI/iSER protocol
requirement.

That is, the expected + maximum sequence numbers are returned as part of
every response PDU, which the initiator uses to determine when the
command sequence number window is open so new non-immediate commands may
be sent to the target.

So, given some manner of session wide synchronization is required
between different contexts for the existing single connection case to
update the command sequence number and check when the window opens, it's
a fallacy to claim MC/S adds some type of new initiator specific
synchronization overhead vs. single connection code.


I think you are assuming we are leaving the iscsi code as it is today.

For the non-MCS mq session per CPU design, we would be allocating and
binding the session and its resources to specific CPUs. They would only
be accessed by the threads on that one CPU, so we get our
serialization/synchronization from that. That is why we are saying we
do not need something like atomic_t/spin_locks for the sequence number
handling for this type of implementation.


Wouldn't that need to be coordinated with the networking layer?


Yes.


Doesn't it do the same thing, matching TX/RX queues to CPUs?


Yes.



Hey Hannes, Mike,

I would say there is no need for specific coordination from iSCSI PoV.
This is exactly what flow steering is designed for. As I see it, in
order to get the TX/RX to match rings, the user can attach 5-tuple rules
(using standard ethtool) to steer packets to the right rings.

Sagi.


If so, wouldn't we decrease bandwidth by restricting things to one CPU?


We have a session or connection per CPU though, so we end up hitting the
same problem you talked about last year where one hctx (iscsi session or
connection's socket or nic hw queue) could get overloaded. This is what
I meant in my original mail where iscsi would rely on whatever blk/mq
load balancers we end up implementing at that layer to balance requests
across hctxs.



I'm not sure I understand,

The submission flow is CPU bound. In the current single queue model
both CPU X and CPU Y will end up using a single socket. In the
multi-queue solution, CPU X will go to socket X and CPU Y will go to
socket Y. This is equal to what we have today (if only CPU X is active)
or better (if more CPUs are active).

Am I missing something?

Sagi.

--
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-11 Thread Sagi Grimberg

On 1/9/2015 8:00 PM, Michael Christie wrote:
SNIP




Session wide command sequence number synchronization isn't something to
be removed as part of the MQ work.  It's a iSCSI/iSER protocol
requirement.

That is, the expected + maximum sequence numbers are returned as part of
every response PDU, which the initiator uses to determine when the
command sequence number window is open so new non-immediate commands may
be sent to the target.

So, given some manner of session wide synchronization is required
between different contexts for the existing single connection case to
update the command sequence number and check when the window opens, it's
a fallacy to claim MC/S adds some type of new initiator specific
synchronization overhead vs. single connection code.


I think you are assuming we are leaving the iscsi code as it is today.

For the non-MCS mq session per CPU design, we would be allocating and binding 
the session and its resources to specific CPUs. They would only be accessed by 
the threads on that one CPU, so we get our serialization/synchronization from 
that. That is why we are saying we do not need something like 
atomic_t/spin_locks for the sequence number handling for this type of 
implementation.

If we just tried to do this with the old code where the session could be 
accessed on multiple CPUs then you are right, we need locks/atomics like how we 
do in the MCS case.



I don't think we will want to restrict session per CPU. There is a
tradeoff question of system resources. We might want to allow a user to
configure multiple HW queues but still not to use too much of the system
resources. So the session locks would still be used but definitely less
congested...

Sagi.

--
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-09 Thread James Bottomley
On Fri, 2015-01-09 at 19:28 +0100, Hannes Reinecke wrote:
[...]
  I think you are assuming we are leaving the iscsi code as it is today.
  
  For the non-MCS mq session per CPU design, we would be allocating and
  binding the session and its resources to specific CPUs. They would only
  be accessed by the threads on that one CPU, so we get our
  serialization/synchronization from that. That is why we are saying we
  do not need something like atomic_t/spin_locks for the sequence number
  handling for this type of implementation.
  
 Wouldn't that need to be coordinated with the networking layer?
 Doesn't it do the same thing, matching TX/RX queues to CPUs?
 If so, wouldn't we decrease bandwidth by restricting things to one CPU?

So this is actually one of the fascinating questions on multi-queue.
Long ago, when I worked for the NCR OS group and we were bringing up the
first SMP systems, we actually found that the SCSI stack went faster
when bound to a single CPU.  The problem in those days was lock
granularity and contention, so single CPU binding eliminated that
overhead.  However, nowadays with modern multi-tiered caching and huge
latencies for cache line bouncing, we're approaching the point where the
fineness of our lock granularity is hurting performance, so it's worth
re-asking the question of whether just dumping all the lock latency by
single CPU binding is a worthwhile exercise.

James

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-09 Thread Michael Christie

On Jan 8, 2015, at 11:03 PM, Nicholas A. Bellinger n...@linux-iscsi.org wrote:

 On Thu, 2015-01-08 at 15:22 -0800, James Bottomley wrote:
 On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote:
 On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote:
 On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote:
 
 SNIP
 
 The point is that a simple session wide counter for command sequence
 number assignment is significantly less overhead than all of the
 overhead associated with running a full multipath stack atop multiple
 sessions.
 
 I don't see how that's relevant to issue speed, which was the measure we
 were using: The layers above are just a hopper.  As long as they're
 loaded, the MQ lower layer can issue at full speed.  So as long as the
 multipath hopper is efficient enough to keep the queues loaded there's
 no speed degradation.
 
 The problem with a sequence point inside the MQ issue layer is that it
 can cause a stall that reduces the issue speed. so the counter sequence
 point causes a degraded issue speed over the multipath hopper approach
 above even if the multipath approach has a higher CPU overhead.
 
 Now, if the system is close to 100% cpu already, *then* the multipath
 overhead will try to take CPU power we don't have and cause a stall, but
 it's only in the flat out CPU case.
 
 Not to mention that our iSCSI/iSER initiator is already taking a session
 wide lock when sending outgoing PDUs, so adding a session wide counter
 isn't adding any additional synchronization overhead vs. what's already
 in place.
 
 I'll leave it up to the iSER people to decide whether they're redoing
 this as part of the MQ work.
 
 
 Session wide command sequence number synchronization isn't something to
 be removed as part of the MQ work.  It's a iSCSI/iSER protocol
 requirement.
 
 That is, the expected + maximum sequence numbers are returned as part of
 every response PDU, which the initiator uses to determine when the
 command sequence number window is open so new non-immediate commands may
 be sent to the target.
 
 So, given some manner of session wide synchronization is required
 between different contexts for the existing single connection case to
 update the command sequence number and check when the window opens, it's
 a fallacy to claim MC/S adds some type of new initiator specific
 synchronization overhead vs. single connection code.

I think you are assuming we are leaving the iscsi code as it is today.

For the non-MCS mq session per CPU design, we would be allocating and binding 
the session and its resources to specific CPUs. They would only be accessed by 
the threads on that one CPU, so we get our serialization/synchronization from 
that. That is why we are saying we do not need something like 
atomic_t/spin_locks for the sequence number handling for this type of 
implementation.

If we just tried to do this with the old code where the session could be 
accessed on multiple CPUs then you are right, we need locks/atomics like how we 
do in the MCS case.

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-09 Thread Mike Christie
On 01/09/2015 12:28 PM, Hannes Reinecke wrote:
 On 01/09/2015 07:00 PM, Michael Christie wrote:

 On Jan 8, 2015, at 11:03 PM, Nicholas A. Bellinger n...@linux-iscsi.org 
 wrote:

 On Thu, 2015-01-08 at 15:22 -0800, James Bottomley wrote:
 On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote:
 On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote:
 On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote:

 SNIP

 The point is that a simple session wide counter for command sequence
 number assignment is significantly less overhead than all of the
 overhead associated with running a full multipath stack atop multiple
 sessions.

 I don't see how that's relevant to issue speed, which was the measure we
 were using: The layers above are just a hopper.  As long as they're
 loaded, the MQ lower layer can issue at full speed.  So as long as the
 multipath hopper is efficient enough to keep the queues loaded there's
 no speed degradation.

 The problem with a sequence point inside the MQ issue layer is that it
 can cause a stall that reduces the issue speed. so the counter sequence
 point causes a degraded issue speed over the multipath hopper approach
 above even if the multipath approach has a higher CPU overhead.

 Now, if the system is close to 100% cpu already, *then* the multipath
 overhead will try to take CPU power we don't have and cause a stall, but
 it's only in the flat out CPU case.

 Not to mention that our iSCSI/iSER initiator is already taking a session
 wide lock when sending outgoing PDUs, so adding a session wide counter
 isn't adding any additional synchronization overhead vs. what's already
 in place.

 I'll leave it up to the iSER people to decide whether they're redoing
 this as part of the MQ work.


 Session wide command sequence number synchronization isn't something to
 be removed as part of the MQ work.  It's a iSCSI/iSER protocol
 requirement.

 That is, the expected + maximum sequence numbers are returned as part of
 every response PDU, which the initiator uses to determine when the
 command sequence number window is open so new non-immediate commands may
 be sent to the target.

 So, given some manner of session wide synchronization is required
 between different contexts for the existing single connection case to
 update the command sequence number and check when the window opens, it's
 a fallacy to claim MC/S adds some type of new initiator specific
 synchronization overhead vs. single connection code.

 I think you are assuming we are leaving the iscsi code as it is today.

 For the non-MCS mq session per CPU design, we would be allocating and
 binding the session and its resources to specific CPUs. They would only
 be accessed by the threads on that one CPU, so we get our
 serialization/synchronization from that. That is why we are saying we
 do not need something like atomic_t/spin_locks for the sequence number
 handling for this type of implementation.

 Wouldn't that need to be coordinated with the networking layer?

Yes.

 Doesn't it do the same thing, matching TX/RX queues to CPUs?

Yes.

 If so, wouldn't we decrease bandwidth by restricting things to one CPU?

We have a session or connection per CPU though, so we end up hitting the
same problem you talked about last year where one hctx (iscsi session or
connection's socket or nic hw queue) could get overloaded. This is what
I meant in my original mail where iscsi would rely on whatever blk/mq
load balancers we end up implementing at that layer to balance requests
across hctxs.

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-08 Thread James Bottomley
On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote:
 On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote:
  On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote:
   On Thu, 2015-01-08 at 08:50 +0100, Bart Van Assche wrote:
On 01/07/15 22:39, Mike Christie wrote:
 On 01/07/2015 10:57 AM, Hannes Reinecke wrote:
 On 01/07/2015 05:25 PM, Sagi Grimberg wrote:
 Hi everyone,

 Now that scsi-mq is fully included, we need an iSCSI initiator that
 would use it to achieve scalable performance. The need is even 
 greater
 for iSCSI offload devices and transports that support multiple HW
 queues. As iSER maintainer I'd like to discuss the way we would 
 choose
 to implement that in iSCSI.

 My measurements show that iSER initiator can scale up to ~2.1M IOPs
 with multiple sessions but only ~630K IOPs with a single session 
 where
 the most significant bottleneck the (single) core processing
 completions.

 In the existing single connection per session model, given that 
 command
 ordering must be preserved session-wide, we end up in a serial 
 command
 execution over a single connection which is basically a single queue
 model. The best fit seems to be plugging iSCSI MCS as a multi-queued
 scsi LLDD. In this model, a hardware context will have a 1x1 mapping
 with an iSCSI connection (TCP socket or a HW queue).

 iSCSI MCS and it's role in the presence of dm-multipath layer was
 discussed several times in the past decade(s). The basic need for 
 MCS is
 implementing a multi-queue data path, so perhaps we may want to 
 avoid
 doing any type link aggregation or load balancing to not overlap
 dm-multipath. For example we can implement ERL=0 (which is 
 basically the
 scsi-mq ERL) and/or restrict a session to a single portal.

 As I see it, the todo's are:
 1. Getting MCS to work (kernel + user-space) with ERL=0 and a
 round-robin connection selection (per scsi command execution).
 2. Plug into scsi-mq - exposing num_connections as nr_hw_queues and
 using blk-mq based queue (conn) selection.
 3. Rework iSCSI core locking scheme to avoid session-wide locking
 as much as possible.
 4. Use blk-mq pre-allocation and tagging facilities.

 I've recently started looking into this. I would like the community 
 to
 agree (or debate) on this scheme and also talk about implementation
 with anyone who is also interested in this.

 Yes, that's a really good topic.

 I've pondered implementing MC/S for iscsi/TCP but then I've figured 
 my
 network implementation knowledge doesn't spread that far.
 So yeah, a discussion here would be good.

 Mike? Any comments?

 I have been working under the assumption that people would be ok with
 MCS upstream if we are only using it to handle the issue where we want
 to do something like have a tcp/iscsi connection per CPU then map the
 connection to a blk_mq_hw_ctx. In this more limited MCS implementation
 there would be no iscsi layer code to do something like load balance
 across ports or transport paths like how dm-multipath does, so there
 would be no feature/code duplication. For balancing across hctxs, then
 the iscsi layer would also leave that up to whatever we end up with in
 upper layers, so again no feature/code duplication with upper layers.

 So pretty non controversial I hope :)

 If people want to add something like round robin connection selection 
 in
 the iscsi layer, then I think we want to leave that for after the
 initial merge, so people can argue about that separately.

Hello Sagi and Mike,

I agree with Sagi that adding scsi-mq support in the iSER initiator 
would help iSER users because that would allow these users to configure 
a single iSER target and use the multiqueue feature instead of having 
to 
configure multiple iSER targets to spread the workload over multiple 
cpus at the target side.

And I agree with Mike that implementing scsi-mq support in the iSER 
initiator as multiple independent connections probably is a better 
choice than MC/S. RFC 3720 namely requires that iSCSI numbering is 
session-wide. This means maintaining a single counter for all MC/S 
sessions. Such a counter would be a contention point. I'm afraid that 
because of that counter performance on a multi-socket initiator system 
with a scsi-mq implementation based on MC/S could be worse than with 
the 
approach with multiple iSER targets. Hence my preference for an 
approach 
based on multiple independent iSER connections instead of MC/S.

   
   The idea that a simple session wide counter for command sequence number
   assignment adds such a degree of contention that it 

Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-08 Thread Jan Kara
On Wed 07-01-15 09:22:13, Lee Duncan wrote:
 On 01/07/2015 08:25 AM, Sagi Grimberg wrote:
  Hi everyone,
  
  Now that scsi-mq is fully included, we need an iSCSI initiator that
  would use it to achieve scalable performance. The need is even greater
  for iSCSI offload devices and transports that support multiple HW
  queues. As iSER maintainer I'd like to discuss the way we would choose
  to implement that in iSCSI.
  
  My measurements show that iSER initiator can scale up to ~2.1M IOPs
  with multiple sessions but only ~630K IOPs with a single session where
  the most significant bottleneck the (single) core processing
  completions.
  
  In the existing single connection per session model, given that command
  ordering must be preserved session-wide, we end up in a serial command
  execution over a single connection which is basically a single queue
  model. The best fit seems to be plugging iSCSI MCS as a multi-queued
  scsi LLDD. In this model, a hardware context will have a 1x1 mapping
  with an iSCSI connection (TCP socket or a HW queue).
  
  iSCSI MCS and it's role in the presence of dm-multipath layer was
  discussed several times in the past decade(s). The basic need for MCS is
  implementing a multi-queue data path, so perhaps we may want to avoid
  doing any type link aggregation or load balancing to not overlap
  dm-multipath. For example we can implement ERL=0 (which is basically the
  scsi-mq ERL) and/or restrict a session to a single portal.
  
  As I see it, the todo's are:
  1. Getting MCS to work (kernel + user-space) with ERL=0 and a
 round-robin connection selection (per scsi command execution).
  2. Plug into scsi-mq - exposing num_connections as nr_hw_queues and
 using blk-mq based queue (conn) selection.
  3. Rework iSCSI core locking scheme to avoid session-wide locking
 as much as possible.
  4. Use blk-mq pre-allocation and tagging facilities.
  
  I've recently started looking into this. I would like the community to
  agree (or debate) on this scheme and also talk about implementation
  with anyone who is also interested in this.
  
  Cheers,
  Sagi.
 
 I started looking at this last year (and Hannes' suggestion), and would
 love to join the discussion.
 
 Please add me to the list of those that wish to attend.
  For that please send a separate email with attend request as described in
the call for proposals. Thanks!

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

2015-01-08 Thread James Bottomley
On Thu, 2015-01-08 at 21:03 -0800, Nicholas A. Bellinger wrote:
 On Thu, 2015-01-08 at 15:22 -0800, James Bottomley wrote:
  On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote:
   On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote:
On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote:
 
 SNIP
 
   The point is that a simple session wide counter for command sequence
   number assignment is significantly less overhead than all of the
   overhead associated with running a full multipath stack atop multiple
   sessions.
  
  I don't see how that's relevant to issue speed, which was the measure we
  were using: The layers above are just a hopper.  As long as they're
  loaded, the MQ lower layer can issue at full speed.  So as long as the
  multipath hopper is efficient enough to keep the queues loaded there's
  no speed degradation.
  
  The problem with a sequence point inside the MQ issue layer is that it
  can cause a stall that reduces the issue speed. so the counter sequence
  point causes a degraded issue speed over the multipath hopper approach
  above even if the multipath approach has a higher CPU overhead.
  
  Now, if the system is close to 100% cpu already, *then* the multipath
  overhead will try to take CPU power we don't have and cause a stall, but
  it's only in the flat out CPU case.
  
   Not to mention that our iSCSI/iSER initiator is already taking a session
   wide lock when sending outgoing PDUs, so adding a session wide counter
   isn't adding any additional synchronization overhead vs. what's already
   in place.
  
  I'll leave it up to the iSER people to decide whether they're redoing
  this as part of the MQ work.
  
 
 Session wide command sequence number synchronization isn't something to
 be removed as part of the MQ work.  It's a iSCSI/iSER protocol
 requirement.

The sequence number is a requirement of the session.  Multiple separate
sessions means no SN correlation between the different connections, so
no global requirement for a SN counter across the queues ... that's what
Mike was saying about implementing multipath not using MCS.  With MCS we
have a single session for all the queues and thus have to correlate the
sequence number across all the connections and hence all the queues;
without it we don't.  That's why the sequence number becomes a potential
stall point in MQ implementation of MCS which can be obviated if we use
a separate session per queue.

James


-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.