Re: [PATCH v1 1/3] IB/srp: Fix crash when unmapping data loop

2014-02-24 Thread Sebastian Riemer
On 24.02.2014 15:30, Sagi Grimberg wrote:
 When unmapping request data, it is unsafe automatically
 decrement req-nfmr regardless of it's value. This may
 happen since IO and reconnect flow may run concurrently
 resulting in req-nfmr = -1 and falsely call ib_fmr_pool_unmap.

Something is still strange here. What about the following:
unsafe to decrement req-nfmr automatically

its instead of it's

and calling ib_fmr_pool_unmap falsely

 Fix the loop condition to be greater than zero (which
 explicitly means that FMRs were used on this request)
 and only increment when needed.
 
 This crash is easily reproduceable with ConnectX VFs OR
 Connect-IB (where FMRs are not supported)
 
 Signed-off-by: Sagi Grimberg sa...@mellanox.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 529b6bc..0e20bfb 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -766,8 +766,11 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd,
   return;
  
   pfmr = req-fmr_list;
 - while (req-nfmr--)
 +
 + while (req-nfmr  0) {
   ib_fmr_pool_unmap(*pfmr++);
 + req-nfmr--;
 + }
  
   ib_dma_unmap_sg(ibdev, scsi_sglist(scmnd), scsi_sg_count(scmnd),
   scmnd-sc_data_direction);
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] scsi_transport_srp: Fix two kernel-doc warnings

2014-02-20 Thread Sebastian Riemer
On 20.02.2014 11:51, Bart Van Assche wrote:
 This patch fixes the following two kernel-doc warnings:
 
 Warning(drivers/scsi/scsi_transport_srp.c:819): No description found for 
 parameter 'rport'
 Warning(include/scsi/scsi_transport_srp.h:75): Excess 
 struct/union/enum/typedef member 'deleted' description in 'srp_rport'
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Reported-by: Masanari Iida standby2...@gmail.com
 Cc: Sagi Grimberg sa...@mellanox.com
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 Cc: James Bottomley jbottom...@parallels.com
 Cc: Roland Dreier rol...@kernel.org
 ---
  drivers/scsi/scsi_transport_srp.c | 1 +
  include/scsi/scsi_transport_srp.h | 1 -
  2 files changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/scsi/scsi_transport_srp.c 
 b/drivers/scsi/scsi_transport_srp.c
 index d47ffc8..13e8983 100644
 --- a/drivers/scsi/scsi_transport_srp.c
 +++ b/drivers/scsi/scsi_transport_srp.c
 @@ -810,6 +810,7 @@ EXPORT_SYMBOL_GPL(srp_remove_host);
  
  /**
   * srp_stop_rport_timers - stop the transport layer recovery timers
 + * @rport: SRP remote port for which to stop the timers.
   *
   * Must be called after srp_remove_host() and scsi_remove_host(). The caller
   * must hold a reference on the rport (rport-dev) and on the SCSI host
 diff --git a/include/scsi/scsi_transport_srp.h 
 b/include/scsi/scsi_transport_srp.h
 index b11da5c..cdb05dd 100644
 --- a/include/scsi/scsi_transport_srp.h
 +++ b/include/scsi/scsi_transport_srp.h
 @@ -41,7 +41,6 @@ enum srp_rport_state {
   * @mutex: Protects against concurrent rport reconnect /
   * fast_io_fail / dev_loss_tmo activity.
   * @state: rport state.
 - * @deleted:   Whether or not srp_rport_del() has already been 
 invoked.
   * @reconnect_delay:   Reconnect delay in seconds.
   * @failed_reconnects: Number of failed reconnect attempts.
   * @reconnect_work:Work structure used for scheduling reconnect attempts.
 

This is trivial. Thanks!
Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IB/srp: merge fixes from MLNX_OFED

2014-02-18 Thread Sebastian Riemer
Hi Sagi,

is that /mswg/git/mlnx_ofed/mlnx-ofed-2.x-kernel.git tree from the
MLNX_OFED public by any chance?

There are fixes included relevant for the mainline. Would be strange if
I would send the patches as somebody at Mellanox discovered and fixed
the issues.

I've hit a kernel panic today during testing caused by the loop around
ib_fmr_pool_unmap(). The loop has been fixed in MLNX_OFED. So there
should be a patch sent for it to the linux-rdma mailing list.

I've also noticed the added target locking around target-free_tx
handling in srp_rport_reconnect(). There are cases e.g. in
srp_queuecommand() where holding the rport mutex isn't enough to protect
it. So for me this looks right.

Then, in srp_create_target() I've noticed the check of the return value
of ib_query_gid(). Makes completely sense to check it.

Please send patches for so obvious fixes to the mailing list! There is a
very good chance that they get accepted.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SRP initiator driver maintainership

2014-01-21 Thread Sebastian Riemer
On 21.01.2014 11:03, Sagi Grimberg wrote:
 On 1/20/2014 7:37 PM, Bart Van Assche wrote:
 On 01/03/14 22:16, David Dillow wrote:
 Today was my last day at ORNL, and my future endeavors will leave even
 less time to maintain the SRP initiator.

 My thanks especially go to Bart, for keeping the pressure to improve
 alive, and for driving so many of those improvements.

 diff --git a/MAINTAINERS b/MAINTAINERS
 index 6c20792..a36f1b5 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -7466,7 +7466,6 @@ S:Maintained
   F:drivers/scsi/sr*
 SCSI RDMA PROTOCOL (SRP) INITIATOR
 -M:David Dillow dillo...@ornl.gov
   L:linux-rdma@vger.kernel.org
   S:Supported
   W:http://www.openfabrics.org
 (replying to an e-mail of two weeks ago)

 Hello Dave,

 Thanks for all the time you have spent reviewing and testing SRP
 initiator patches. Such maintainer work is unglamorous but important -
 it is due to the combined effort of all kernel maintainers that the
 Linux kernel earned its high quality reputation.

 Roland, what is your preference with regard to maintainership of the SRP
 initiator driver ? My plan is to continue contributing patches to the
 SRP initiator driver at about the same pace as I had done in the past.
 Do you prefer to take over maintainership of this driver yourself or is
 it okay for you that I become the official maintainer for this driver ?

 Thanks,

 Bart.
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 Bart,
 
 Your contribution to SRP was and still is important!
 You led the efforts improving and stabilizing SRP driver and
 adding the fast-failover logic which was needed for so long.
 
 Roland,
 I collaborated with Bart on SRP enhancement in the past year or so
 and I think Bart is a perfect match for SRP maintainership.
 
 Sagi.

+1 from me for Bart! Thanks for the collaboration!

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OpenSM 3.3.16 at 100% CPU load, console off

2013-10-09 Thread Sebastian Riemer
Hi Hal,

we've encountered an issue with OpenSM 3.3.16 and the config option
console off.
OpenSM processes are at 100% CPU load.

From strace:
poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
read(0, , 4096)   = 0
poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
read(0, , 4096)   = 0
poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
read(0, , 4096)   = 0

As far as I've seen in the code, the function osm_console() from
opensm/osm_console.c is the only function which uses poll().

Is this issue already known or perhaps already fixed?

Thanks,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenSM 3.3.16 at 100% CPU load, console off

2013-10-09 Thread Sebastian Riemer
On 09.10.2013 15:30, David Dillow wrote:
 On Wed, 2013-10-09 at 09:28 -0400, Hal Rosenstock wrote:
 From strace:
 poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
 read(0, , 4096)   = 0
 poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
 read(0, , 4096)   = 0
 poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
 read(0, , 4096)   = 0

 So this doesn't block for 1 second and that's why the CPU is 100% ?
 
 Looks like it is spinning on a closed socket (or stdin) -- calling
 poll() on such will return immediately...
 

Thanks for the responses!

I've seen in the code that the local console is initialized but is not
released correctly. Should be done in osm_console_exit().

Something like this:

   if (p_oct-in_fd = 0) {
   p_oct-in = NULL;
   p_oct-out = NULL;
   p_oct-in_fd = -1;
   p_oct-out_fd = -1;
   }

I guess what happened was that console local was set, changed in the
config to console off and the service has been restarted. Restarting
the service again didn't help.

It is strange that the console_init_flag is still set. The function
osm_console() returns 0 if poll() fails. If it would return something
else, then the console_init_flag would be set to 0 again and there would
be no issue anymore I suppose.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenSM 3.3.16 at 100% CPU load, console off

2013-10-09 Thread Sebastian Riemer
On 09.10.2013 16:00, Hal Rosenstock wrote:

 Do you recall the sequence to get to this ?
 
 Was console option changed to off and then OpenSM SIGHUP'd ? Something
 else ?
 
 Is this reproducible ?

Yes, now I can reproduce it. The opensm has been initially started with
console off and I activate console local and restart the service.
CPU load is at 100% immediately. I set console off again and restart
the service and CPU load is low again.

I did this three times in a row, now. And the third time it even
remained at 100% load in the off state. I've set local and off
again and CPU load was low again.

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenSM 3.3.16 at 100% CPU load, console off

2013-10-09 Thread Sebastian Riemer
On 09.10.2013 17:15, Hal Rosenstock wrote:
 What does service restart do in terms of OpenSM ?
 
 Note that the console parameter is _not_ changeable on the fly right
 now so if OpenSM is being SIGHUP'd by service restart then this is a
 current limitation (and is clearly not detected/protected against in the
 current code base). It sounds like that may be what is going on.

Yes, it emits SIGHUP. Thanks for the information! The opensm is a
critical component. So IMHO it needs to be fixed in a way that it either
protects itself against such changes by ignoring them on the fly or it
needs to support these changes.

The current situation is not really acceptable and the opensm stability
is crucial. So I'll think about fixing it.
Are you interested in patches in this regard?

Cheers,
Sebastian


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/srp: Let srp_abort() return FAST_IO_FAIL if TL offline

2013-07-10 Thread Sebastian Riemer
Hi Bart,

my patch looks very similar. I was in a company meeting so I couldn't
send it fast enough.

Can be applied that way! Thanks!

Cheers,
Sebastian


On 10.07.2013 17:36, Bart Van Assche wrote:
 If the transport layer is offline it is more appropriate to let
 srp_abort() return FAST_IO_FAIL instead of SUCCESS.
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Reported-by: Sebastian Riemer sebastian.rie...@profitbricks.com
 Cc: David Dillow dillo...@ornl.gov
 Cc: Roland Dreier rol...@purestorage.com
 Cc: Vu Pham v...@mellanox.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 9d8b46e..f93baf8 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1753,8 +1753,7 @@ static int srp_abort(struct scsi_cmnd *scmnd)
   if (!req || !srp_claim_req(target, req, scmnd))
   return FAILED;
   if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun,
 -   SRP_TSK_ABORT_TASK) == 0 ||
 - target-transport_offline)
 +   SRP_TSK_ABORT_TASK) == 0)
   ret = SUCCESS;
   else if (target-transport_offline)
   ret = FAST_IO_FAIL;
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline

2013-07-02 Thread Sebastian Riemer
On 28.06.2013 14:49, Bart Van Assche wrote:
 If reconnecting failed we know that no command completion will
 be received anymore. Hence let the SCSI error handler fail such
 commands immediately.

Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline

2013-07-01 Thread Sebastian Riemer
On 28.06.2013 14:49, Bart Van Assche wrote:
 If reconnecting failed we know that no command completion will
 be received anymore. Hence let the SCSI error handler fail such
 commands immediately.
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: Roland Dreier rol...@purestorage.com
 Cc: David Dillow dillo...@ornl.gov
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 Cc: Vu Pham v...@mellanox.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 8c95262..5c91521 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd)
   if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun,
 SRP_TSK_ABORT_TASK) == 0)
   ret = SUCCESS;
 + else if (target-transport_offline)
 + ret = FAST_IO_FAIL;
   else
   ret = FAILED;
   srp_free_req(target, req, scmnd, 0);
 

This doesn't give us much speed advantage IMHO. The check for
target-transport_offline should be before calling srp_send_tsk_mgmt().

This way it would also match the patch description better.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline

2013-07-01 Thread Sebastian Riemer
On 28.06.2013 14:49, Bart Van Assche wrote:
 If reconnecting failed we know that no command completion will
 be received anymore. Hence let the SCSI error handler fail such
 commands immediately.
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: Roland Dreier rol...@purestorage.com
 Cc: David Dillow dillo...@ornl.gov
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 Cc: Vu Pham v...@mellanox.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 8c95262..5c91521 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd)
   if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun,
 SRP_TSK_ABORT_TASK) == 0)
   ret = SUCCESS;
 + else if (target-transport_offline)
 + ret = FAST_IO_FAIL;
   else
   ret = FAILED;
   srp_free_req(target, req, scmnd, 0);
 

I'm also missing the concept for srp_reset_device(). There is a very
common case that the SCSI error handling and the transport layer error
handling run in parallel: Congestion.

In congestion some LUNs are blocked while others can still transmit. A
little bit later the QP timeout triggers in the middle of the SCSI error
handling in srp_abort(), srp_reset_device() or less likely in
srp_reset_host().

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline

2013-07-01 Thread Sebastian Riemer
On 01.07.2013 13:33, Bart Van Assche wrote:
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd)
   if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun,
 SRP_TSK_ABORT_TASK) == 0)
   ret = SUCCESS;
 +else if (target-transport_offline)
 +ret = FAST_IO_FAIL;
   else
   ret = FAILED;
   srp_free_req(target, req, scmnd, 0);


 This doesn't give us much speed advantage IMHO. The check for
 target-transport_offline should be before calling srp_send_tsk_mgmt().

 This way it would also match the patch description better.
 
 Hello Sebastian,
 
 Had you perhaps overlooked the following code at the start of
 srp_send_tsk_mgmt() ?
 
 if (!target-connected || target-qp_in_error)
 return -1;
 
 Given this I don't think it matters whether the transport_offline check
 occurs before or after the srp_send_tsk_mgmt() call.

Hi Bart,

okay, right. So you get an error due to the connected and qp_in_error
state first. Yes, I've overlooked that. Thanks!

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline

2013-07-01 Thread Sebastian Riemer
On 01.07.2013 13:38, Bart Van Assche wrote:
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd)
   if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun,
 SRP_TSK_ABORT_TASK) == 0)
   ret = SUCCESS;
 +else if (target-transport_offline)
 +ret = FAST_IO_FAIL;
   else
   ret = FAILED;
   srp_free_req(target, req, scmnd, 0);

 I'm also missing the concept for srp_reset_device(). There is a very
 common case that the SCSI error handling and the transport layer error
 handling run in parallel: Congestion.
 
 Can you explain this comment further, and also how this comment relates
 to patch 04/15 ?

Sorry, found it. Even if only one srp_reset_device() fails, then
srp_reset_host() is called anyway. So there this check + returning
FAST_IO_FAIL doesn't make so much sense.

 In congestion some LUNs are blocked while others can still transmit. A
 little bit later the QP timeout triggers in the middle of the SCSI error
 handling in srp_abort(), srp_reset_device() or less likely in
 srp_reset_host().
 
 I am aware this can result in concurrent srp_reconnect_rport() calls.
 However, such concurrent calls are serialized via rport-mutex.

I put my comment regarding this to patch 10.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/14] IB/srp: Fix remove_one crash due to resource exhaustion

2013-06-28 Thread Sebastian Riemer
On 28.06.2013 01:45, Roland Dreier wrote:
 On Thu, Jun 27, 2013 at 2:01 PM, David Dillow dillo...@ornl.gov wrote:
 On Wed, 2013-06-12 at 15:20 +0200, Bart Van Assche wrote:
 If the add_one callback fails during driver load no resources are
 allocated so there isn't a need to release any resources. Trying
 to clean the resource may lead to the following kernel panic:

 Acked-by: David Dillow dillo...@ornl.gov
 
 Thanks, I've queued up the 1, 3, and 4/14 patches that Dave acked so far.

Hi Roland,

did you queue 3 without the target-transport_offline check?

Otherwise I can't agree on that.
1 and 4 are also for me okay.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/15] IB/srp: Fix race between srp_queuecommand() and srp_claim_req()

2013-06-28 Thread Sebastian Riemer
On 28.06.2013 14:48, Bart Van Assche wrote:
 Avoid that srp_claim_command() can claim a command while
 srp_queuecommand() is still busy queueing the same command.
 Found this via source reading.

Nice, that's much less re-acquiring of the target lock in error case in
srp_queuecommand().
But if we have to change that many locations for srp_put_tx_iu() anyway,
wouldn't it make sense to rename it into __srp_put_tx_iu() as well?

Then we can also put a little description to it and it looks familiar
compared to __srp_get_tx_iu().

The description could look like follows:
/*
 * Return an IU and possible credit to the free pool
 *
 * Must be called with target-lock held to protect free_tx.
 */

I'm not sure if we still need that lockdep_assert_held() then. There is
no other location with lock debugging in ib_srp.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/15] IB/srp: Fix race between srp_queuecommand() and srp_claim_req()

2013-06-28 Thread Sebastian Riemer
On 28.06.2013 16:51, Bart Van Assche wrote:
 Nice, that's much less re-acquiring of the target lock in error case in
 srp_queuecommand().
 But if we have to change that many locations for srp_put_tx_iu() anyway,
 wouldn't it make sense to rename it into __srp_put_tx_iu() as well?

 Then we can also put a little description to it and it looks familiar
 compared to __srp_get_tx_iu().

 The description could look like follows:
 /*
   * Return an IU and possible credit to the free pool
   *
   * Must be called with target-lock held to protect free_tx.
   */

 I'm not sure if we still need that lockdep_assert_held() then. There is
 no other location with lock debugging in ib_srp.
 
 Hello Sebastian,
 
 I don't have a strong opinion about either of these two topics.
 
 If a function like __srp_get_tx_iu() would be introduced that would
 allow to drop only two spin_lock/spin_unlock call pairs. So introducing
 that function would probably add more lines of code than adding the
 spin_lock/spin_unlock pairs. Hence my choice not to introduce
 __srp_get_tx_iu().
 
 Regarding the lockdep_assert_held() statement: the reason I introduced
 it instead of adding a comment above the function telling which locking
 is required is because a lockdep_assert_held() statement is verified at
 runtime on a system with a kernel in which lockdep support has been
 enabled.

Hi Bart,

I just meant a rename into __srp_put_tx_iu() to show that locking is
required and not introducing a further wrapper function.

The other function of that kind __srp_get_tx_iu() also doesn't have a
wrapper function srp_get_tx_iu().

For me it doesn't make much difference how it is marked that locking is
required. I just wanted to point out that this method is new to ib_srp.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus

2013-06-17 Thread Sebastian Riemer
On 14.06.2013 19:07, Vu Pham wrote:
[...]
 For what do you need the same target with multiple pkeys on the same
 local SRP port?
   
 There is no need, it's just a gray area that you can choose to have
 multiple connections to same target using different pkeys (same as dgid)
 Which other SRP targets exist?
   
 
 Netapp/LSI/Engenio, DDN, TexasMemorySystem Ramsan (IBM), Nimbus, Violin
 Memory, StreamScale
 The last three may be derived from SCST base target.
 
 I only know SCST, Solaris COMSTAR and that broken LIO stuff.
 Does SCST still not support to set the pkey?

   
 Yes, I think so
 
 Why should we check the dgid?

   
 If you want to have multiple connections/qps to same target, but as I
 said above, it's a gray area.
 
 Doesn't make any sense to me to connect both target ports to the same
 local port. 
 What if a target always expose single consistent and unique SRP port
 with tuple id_ext, ioc_guid, the ioc_guid part is not derived from any
 of its local HCA's GUID, then you can connect to this target thru
 different HCA ports (different dgid) as different paths to same target.

Do you have an example for a target which does it like this or a use
case where this makes sense?

I guess you're proposing here to use a driver global list of target
connections instead of handling this per local SRP port. This would
result in bigger changes which I wouldn't do without a good reason.

 If you do so, the multipath-tools will crash. Note: This
 function is called per local SRP port. Perhaps, a note should be added
 to that function that it only has to be called per local SRP port.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/14] scsi_transport_srp: Add transport layer error handling

2013-06-17 Thread Sebastian Riemer
On 17.06.2013 09:29, Bart Van Assche wrote:
 On 06/17/13 09:14, Hannes Reinecke wrote:
 On 06/17/2013 09:04 AM, Bart Van Assche wrote:
 I agree that the value of fast_io_fail_tmo should be kept small.
 Although as you explained changing the SCSI device state into
 SDEV_BLOCK doesn't help for I/O that has already been queued on a
 failed path, I think it's still useful for I/O that is queued after
 the fast_io_fail timer has been started and before that timer has
 expired.

 Why, but of course.

 The typical scenario would be:
 - detect link-loss
 - call scsi_block_request()
 - start dev_loss_tmo and fast_io_fail_tmo

 - When fast_io_fail_tmo triggers:
 - Abort all outstanding requests

 - When dev_loss_tmo triggers:
 - Abort all outstanding requests
 - Remove/disable the I_T nexus
 - call scsi_unblock_request()

 However, if and whether multipath detects SDEV_BLOCK doesn't
 guarantee a fast failover; in fact is was only added rather recently
 as it's not a big win in most cases.
 
 Even if setting the state SDEV_BLOCK doesn't help much with improving
 failover time, it still has the advantage over using
 scsi_block_requests() that it can be overridden by a user via sysfs.

In my opinion that SDEV_BLOCK can help the reconnect. The only reason
for high fast_io_fail_tmo is that you don't use multipath at all and
hope that the connection becomes available again before that timeout.
You place the reconnects in between so that there is a chance that the
reconnect succeeds and the transport layer error work can be canceled.

But I have to look at all of your patches first to see how you
implemented the big picture.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus

2013-06-14 Thread Sebastian Riemer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 14.06.2013 01:27, Vu Pham wrote:
 Bart Van Assche wrote:
 On 06/13/13 19:50, Vu Pham wrote:
 Hello Bart,
 
 +/** + * srp_conn_unique() - check whether the connection to
 a target is unique + */ +static bool srp_conn_unique(struct
 srp_host *host, +struct srp_target_port
 *target) +{ +struct srp_target_port *t; +bool ret =
 false; + +if (target-state == SRP_TARGET_REMOVED) +
 goto out; + +ret = true; + +
 spin_lock(host-target_lock); +list_for_each_entry(t,
 host-target_list, list) { +if (t != target  +
 target-id_ext == t-id_ext 
 
 Targets may advertise/expose on different pkeys You can have
 multiple connections  (or paths/scsi hosts) to same target with
 different pkeys. We need extra check to detect the uniqueness: 
 target-path.pkey == t-path.pkey 
 
 Hello Vu,
 
 Thanks for the feedback. This is something I have already
 thinking about myself. Unfortunately I have not found any
 requirements in the T10 SRP standard with regard to InfiniBand
 partitions. However, in that document there is a section about
 single RDMA channel operation. In that section it is explained
 that an SRP target must log out established sessions upon receipt
 of a new login request. What I'm not sure about is whether only
 sessions with the same P_Key must be logged out or all
 established sessions if a new login request is received. I assume
 the latter since otherwise that would mean that an SRP target 
 would be required to maintain multiple sessions if it allows 
 connections with more than one P_Key to a target port ? My
 concern about adding a pkey comparison in the function
 srp_conn_unique() is that if a target allows an initiator to
 choose which partition to use when logging in, that this could
 result in the undesired SRP initiator ping-pong effect this patch
 tries to avoid.
 
 Bart.
 
 Hello Bart,
 
 Yes, you pointed out the unclear/undefined area.
 
 If we stick to single RDMA channel per IT Nexus with unique tuple 
 Initiator Port Identier - Target Port Indentifier then newly
 created connection with same tuple (I_port_id, T_port_id) but with
 different P_Key or different DGID is not unique.
 
 Sticking to this rule by excluding P_Key and DGID  out of rdma
 channel indentity, your srp_conn_unique() checking is ok; however,
 some SRP target implementations may include DGID as part of rdma
 channel identifier. I'm not sure about different p_key part.
 
 -vu

Hi Vu!

For what do you need the same target with multiple pkeys on the same
local SRP port?

Which other SRP targets exist?

I only know SCST, Solaris COMSTAR and that broken LIO stuff.
Does SCST still not support to set the pkey?

Why should we check the dgid?

Doesn't make any sense to me to connect both target ports to the same
local port. If you do so, the multipath-tools will crash. Note: This
function is called per local SRP port. Perhaps, a note should be added
to that function that it only has to be called per local SRP port.

Cheers,
Sebastian
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRuuSCAAoJEH4DRb7WXajZcFcH+gKsSs64Js/CUqMSyPeFPQ7u
7jKHvLr2wqHqSMIg5rEeZxZJpE9rL+wi8k5TMAMBrV+Povdwr8tWHgdq7mh5N1xO
V517YTgdzrwPIFy9e2uktxx4VYpsFGrV8iw3rdAzXRmcYa5U8feXhiD1VZyKjs4p
3//wvGAR0po7Pm0WgU9Q+h0arQos8CmeHkpoaNp/nNINXpXlTX21WVvHjwQrMFhC
Kr8zoCOTd0Sn+WoSs+CT/7Y4oTknukwR5vh6wfKgz2W74YkMKpD658QZozlafyK/
rwdajV19YYvi8YRTjUXuY5TN0qshYOGDxJDtNFkRGbx+IxIqFkGyyFCp0LPCfto=
=nlf2
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/14] IB/srp: Avoid that srp_reset_host() is skipped after a TL error

2013-06-13 Thread Sebastian Riemer
On 12.06.2013 15:23, Bart Van Assche wrote:
 The SCSI error handler assumes that the transport layer is
 operational if an eh_abort_handler() returns SUCCESS. Hence let
 srp_abort() only return SUCCESS if sending the ABORT TASK task
 management function succeeded. This patch avoids that the SCSI
 error handler skips the srp_reset_host() call after a transport
 layer error.
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: Roland Dreier rol...@purestorage.com
 Cc: David Dillow dillo...@ornl.gov
 Cc: Vu Pham v...@mellanox.com
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |   11 ---
  1 file changed, 8 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 9c638dd..fb37b47 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1742,18 +1742,23 @@ static int srp_abort(struct scsi_cmnd *scmnd)
  {
   struct srp_target_port *target = host_to_target(scmnd-device-host);
   struct srp_request *req = (struct srp_request *) scmnd-host_scribble;
 + int ret;
  
   shost_printk(KERN_ERR, target-scsi_host, SRP abort called\n);
  
   if (!req || !srp_claim_req(target, req, scmnd))
   return FAILED;
 - srp_send_tsk_mgmt(target, req-index, scmnd-device-lun,
 -   SRP_TSK_ABORT_TASK);
 + if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun,
 +   SRP_TSK_ABORT_TASK) == 0 ||
 + target-transport_offline)
 + ret = SUCCESS;

Here you try to hide a little trick. Returning success upon
(target-transport_offline == true) is perhaps not the best way. I guess
you try to fail IO fast here but up to this point
target-transport_offline = true is only set in srp_reset_host().

Please explain for what you need that in this patch!

Furthermore, returning FAST_IO_FAIL sounds better to me in this situation.

 + else
 + ret = FAILED;
   srp_free_req(target, req, scmnd, 0);
   scmnd-result = DID_ABORT  16;
   scmnd-scsi_done(scmnd);
  
 - return SUCCESS;
 + return ret;
  }
  
  static int srp_reset_device(struct scsi_cmnd *scmnd)
 

The rest is okay.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/14] IB/srp: Skip host settle delay

2013-06-13 Thread Sebastian Riemer
On 12.06.2013 15:24, Bart Van Assche wrote:
 The SRP initiator implements host reset by reconnecting to the SRP
 target. That means that communication with the target is possible
 as soon as host reset finished. Hence skip the host settle delay.
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: Roland Dreier rol...@purestorage.com
 Cc: David Dillow dillo...@ornl.gov
 Cc: Vu Pham v...@mellanox.com
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index fb37b47..be12780 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1949,6 +1949,7 @@ static struct scsi_host_template srp_template = {
   .eh_abort_handler   = srp_abort,
   .eh_device_reset_handler= srp_reset_device,
   .eh_host_reset_handler  = srp_reset_host,
 + .skip_settle_delay  = true,
   .sg_tablesize   = SRP_DEF_SG_TABLESIZE,
   .can_queue  = SRP_CMD_SQ_SIZE,
   .this_id= -1,
 

Signed-off-by: Sebastian Riemer sebastian.rie...@profitbricks.com
Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com
Tested-by: Sebastian Riemer sebastian.rie...@profitbricks.com
Reviewed-by: Sebastian Riemer sebastian.rie...@profitbricks.com
Reviewed-by: Christoph Hellwig h...@infradead.org

Choose something, I totally agree. Adding Christoph in CC as he has
reviewed this as well.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus

2013-06-13 Thread Sebastian Riemer
Hi Bart,

thanks for picking up the idea not to use this 'add_target' file for
manual reconnects. I have only small remarks but basically you've got my
Acked-by and Tested-by.

Please find the remarks in-line.

Cheers,
Sebastian

On 12.06.2013 15:25, Bart Van Assche wrote:
 An SRP target is required to maintain a single connection between
 initiator and target. This means that if the 'add_target' attribute
 is used to create a second connection to a target that the first
 connection will be logged out and that the SCSI error handler will
 kick in. The SCSI error handler will cause the SRP initiator to
 reconnect, which will cause I/O over the second connection to fail.
 Avoid such ping-pong behavior by disabling relogins. Note: if
 reconnecting manually is necessary, that is possible by deleting
 and recreating an rport via sysfs.
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: Roland Dreier rol...@kernel.org
 Cc: David Dillow dillo...@ornl.gov
 Cc: Vu Pham v...@mellanox.com
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |   38 
 +++
  1 file changed, 38 insertions(+)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index be12780..1a73b24 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -556,6 +556,36 @@ static void srp_rport_delete(struct srp_rport *rport)
   srp_queue_remove_work(target);
  }
  
 +/**
 + * srp_conn_unique() - check whether the connection to a target is unique
 + */
 +static bool srp_conn_unique(struct srp_host *host,
 + struct srp_target_port *target)
 +{
 + struct srp_target_port *t;
 + bool ret = false;
 +
 + if (target-state == SRP_TARGET_REMOVED)
 + goto out;
 +
 + ret = true;
 +
 + spin_lock(host-target_lock);
 + list_for_each_entry(t, host-target_list, list) {
 + if (t != target 
 + target-id_ext == t-id_ext 
 + target-ioc_guid == t-ioc_guid 
 + target-initiator_ext == t-initiator_ext) {
 + ret = false;
 + break;
 + }
 + }
 + spin_unlock(host-target_lock);
 +
 +out:
 + return ret;
 +}
 +

You've only changed the style of this function. Functionality is still
the same. Fine for me.

But why do you put it that high in the source code?
Do you (still) need it for something else?

I would put it directly in front of srp_create_target() or even in front
of that option parsing stuff for correct bottom-up.

  static int srp_connect_target(struct srp_target_port *target)
  {
   int retries = 3;
 @@ -2261,6 +2291,14 @@ static ssize_t srp_create_target(struct device *dev,
   if (ret)
   goto err;
  
 + if (!srp_conn_unique(target-srp_host, target)) {
 + shost_printk(KERN_INFO, target-scsi_host,
 +  PFX Already connected to target port %.*s\n,
 +  (int)count, buf);
 + ret = -EEXIST;
 + goto err;
 + }
 +

Yes, this looks good! Nice idea to print the connection string!
Would be even cooler without trailing '\n' from within 'buf' but that's
okay.

I was a little bit afraid of overflows here so I did security testing.
But srp_parse_options() already rejected my evil connection strings. :-)

I've tried things like this:
id_ext=0002c903004ed0b2,\
ioc_guid=0002c903004ed0b2,\
dgid=fe82c903004ed0b4,\
pkey=,service_id=0002c903004ed0b2,\
x... until 4096 chars

id_ext=0002c903004ed0b2,\
ioc_guid=0002c903004ed0b2,\
dgid=fe82c903004ed0b4,\
pkey=,service_id=0002c903004ed0b2,\
id_ext=0002c903004ed0b2,\
ioc_guid=0002c903004ed0b2,\
dgid=fe82c903004ed0b4,\
pkey=,service_id=0002c903004ed0b2,\
... until 4096 chars

This string looked kind of funny. Also in the kernel message it was a
little bit longer than usual but the parsing detected that I have too
many parameters. So everything is fine in terms of security.

   if (!host-srp_dev-fmr_pool  !target-allow_ext_sg 
   target-cmd_sg_cnt  target-sg_tablesize) {
   pr_warn(No FMR pool and no external indirect descriptors, 
 limiting sg_tablesize to cmd_sg_cnt\n);
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/srp: Maintain a single connection per I_T nexus

2013-06-13 Thread Sebastian Riemer
Bart's version also has the printing of the connection string if the
double login fails.

So forget about this version here.

On 12.06.2013 13:51, Sebastian Riemer wrote:
 Hi all,
 
 as proposed by Or, let's discuss this on the mailing list.
 
 This is a fundamental change required for everything related to
 multipathing. It influences automatic reconnect patches which will
 follow. So let's agree on the right solution here first before looking
 at other patches.
 
 In my opinion the 'add_target' sysfs attribute shouldn't be used for any
 manual reconnect as well. This is why my patch rejects the double login
 attempt instead of reconnecting an existing connection.
 This can help to find scripting issues and things like this. We can't
 expect that all users are using the srp-tools.
 
 Please compare with Bart's version and let's discuss this here.
 https://github.com/bvanassche/ib_srp-backport/commit/7d8774ff58d489858b1c046b2bf01b4e84e8dd9b
 
 Cheers,
 Sebastian
 
 
 On 12.06.2013 13:29, Sebastian Riemer wrote:
 The sysfs attribute 'add_target' may not be used for multiple logins to
 the same target. If doing so with multipathing, this crashes the
 multipath-tools. Furthermore, we want to prevent the possibility of data
 corruption here. If manual reconnect is necessary, then the user can use
 the 'delete' sysfs attribute of the remote port before connecting.

 Note: The function srp_conn_unique() has been taken from Bart Van Assche.
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus

2013-06-13 Thread Sebastian Riemer
On 13.06.2013 17:07, Bart Van Assche wrote:
[...]
 The %.*s should only copy the data provided by the user, even if it
 is not '\0' terminated. Stripping the trailing newline is probably
 possible with something like the (untested) code below (will only work
 if there is only one newline in the input string and if it's at the
 end):
   shost_printk(KERN_INFO, target-scsi_host,
PFX Already connected to target port %.*s\n,
(int)count - (memchr(buf, '\n', count) ==
  buf + count - 1), buf);

I thought more like this existing message (as the input string by the
user is possibly long with a lot of configuration options):

shost_printk(KERN_DEBUG, target-scsi_host, PFX
 new target: id_ext %016llx ioc_guid %016llx pkey %04x 
 service_id %016llx dgid %pI6\n,
(unsigned long long) be64_to_cpu(target-id_ext),
(unsigned long long) be64_to_cpu(target-ioc_guid),
be16_to_cpu(target-path.pkey),
(unsigned long long) be64_to_cpu(target-service_id),
target-path.dgid.raw);

But this thing takes a lot of code lines. Perhaps this string formatting
should be put into a macro/inline function then.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/srp: Maintain a single connection per I_T nexus

2013-06-12 Thread Sebastian Riemer
Hi all,

as proposed by Or, let's discuss this on the mailing list.

This is a fundamental change required for everything related to
multipathing. It influences automatic reconnect patches which will
follow. So let's agree on the right solution here first before looking
at other patches.

In my opinion the 'add_target' sysfs attribute shouldn't be used for any
manual reconnect as well. This is why my patch rejects the double login
attempt instead of reconnecting an existing connection.
This can help to find scripting issues and things like this. We can't
expect that all users are using the srp-tools.

Please compare with Bart's version and let's discuss this here.
https://github.com/bvanassche/ib_srp-backport/commit/7d8774ff58d489858b1c046b2bf01b4e84e8dd9b

Cheers,
Sebastian


On 12.06.2013 13:29, Sebastian Riemer wrote:
 The sysfs attribute 'add_target' may not be used for multiple logins to
 the same target. If doing so with multipathing, this crashes the
 multipath-tools. Furthermore, we want to prevent the possibility of data
 corruption here. If manual reconnect is necessary, then the user can use
 the 'delete' sysfs attribute of the remote port before connecting.
 
 Note: The function srp_conn_unique() has been taken from Bart Van Assche.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/14] IB/srp: Fix remove_one crash due to resource exhaustion

2013-06-12 Thread Sebastian Riemer
On 12.06.2013 15:38, Bart Van Assche wrote:
 On 06/12/13 15:20, Bart Van Assche wrote:
 If the add_one callback fails during driver load no resources are
 allocated so there isn't a need to release any resources. Trying
 to clean the resource may lead to the following kernel panic:

 BUG: unable to handle kernel NULL pointer dereference at (null)
 IP: [a0132331] srp_remove_one+0x31/0x240 [ib_srp]
 RIP: 0010:[a0132331]  [a0132331]
 srp_remove_one+0x31/0x240 [ib_srp]
 Process rmmod (pid: 4562, threadinfo 8800dd738000, task
 8801167e60c0)
 Call Trace:
   [a024500e] ib_unregister_client+0x4e/0x120 [ib_core]
   [a01361bd] srp_cleanup_module+0x15/0x71 [ib_srp]
   [810ac6a4] sys_delete_module+0x194/0x260
   [8100b0f2] system_call_fastpath+0x16/0x1b

 [bvanassche: Shortened patch description]
 Signed-off-by: Dotan Barak dot...@dev.mellanox.co.il
 Reviewed-by: Eli Cohen e...@mellanox.co.il
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: Roland Dreier rol...@purestorage.com
 Cc: David Dillow dillo...@ornl.gov
 Cc: Vu Pham v...@mellanox.com
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 ---
   drivers/infiniband/ulp/srp/ib_srp.c |2 ++
   1 file changed, 2 insertions(+)

 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 7ccf328..368d160 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -2507,6 +2507,8 @@ static void srp_remove_one(struct ib_device
 *device)
   struct srp_target_port *target;

   srp_dev = ib_get_client_data(device, srp_client);
 +if (!srp_dev)
 +return;

   list_for_each_entry_safe(host, tmp_host, srp_dev-dev_list,
 list) {
   device_unregister(host-dev);

 
 Please note that this patch was authored by Dotan Barak, so I should
 have mentioned:
 
 From: Dotan Barak dot...@dev.mellanox.co.il

Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/14] IB/srp: Fix race between srp_queuecommand() and srp_claim_req()

2013-06-12 Thread Sebastian Riemer
Wait a minute, so you've changed this commit to also hold that target
lock in the following functions in error case:

srp_unmap_data(),
srp_put_tx_iu()

This is different from:
https://github.com/bvanassche/ib_srp-backport/commit/6ce0e30dbb69973926df84292239f0c20f6a2d6c

srp_unmap_data() calls ib_fmr_pool_unmap() which uses an own spin lock
(pool-pool_lock).

srp_put_tx_iu() acquires the target lock as well (target-lock). That's
spin lock in spin lock. I would say that this dead locks.

I like the other version more.

Cheers,
Sebastian


On 12.06.2013 15:21, Bart Van Assche wrote:
 Avoid that srp_claim_command() can claim a command while
 srp_queuecommand() is still busy queueing the same command.
 Found this via source reading.
 
 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: Roland Dreier rol...@purestorage.com
 Cc: David Dillow dillo...@ornl.gov
 Cc: Vu Pham v...@mellanox.com
 Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)
 
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 368d160..9c638dd 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -1367,7 +1367,6 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
 struct scsi_cmnd *scmnd)
  
   req = list_first_entry(target-free_reqs, struct srp_request, list);
   list_del(req-list);
 - spin_unlock_irqrestore(target-lock, flags);
  
   dev = target-srp_host-srp_dev-dev;
   ib_dma_sync_single_for_cpu(dev, iu-dma, target-max_iu_len,
 @@ -1401,6 +1400,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
 struct scsi_cmnd *scmnd)
   shost_printk(KERN_ERR, target-scsi_host, PFX Send failed\n);
   goto err_unmap;
   }
 + spin_unlock_irqrestore(target-lock, flags);
  
   return 0;
  
 @@ -1409,8 +1409,6 @@ err_unmap:
  
  err_iu:
   srp_put_tx_iu(target, iu, SRP_IU_CMD);
 -
 - spin_lock_irqsave(target-lock, flags);
   list_add(req-list, target-free_reqs);
  
  err_unlock:
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to do replication right with SRP or remote storage?

2013-06-10 Thread Sebastian Riemer
On 08.06.2013 04:31, Bruce McKenzie wrote:
 Hi Bart.
 
 any advice on using this fix with MD raid 1? a guide or site you know of?
 
 ive compiled ubuntu 13.04 to kernel 3.6.11 with OFED 2 from Mellanox, and it
 works ok, performance is a little better with SRP.  Some packages dont seem
 to work, ie srptools and IB-diags some commands fail, which looks like those
 tools havenet been tested with 3.6.11?  or updated.
 
 Ive tried using DRBD with pacemaker Stonith etc (which also works on 3.6.11)
 but it only works with iSCSI over IPOIB.  ie virtual nic with mounted LVM
 using scst to present file i/o.  and pacemaker to fail over the VIP to node
 2.  But OFED 2 doesnt seem to support SDP to have to rep via IPOIB which is
 slow even over dedicated IB_IPOIB nic.  IE DRBD rep is 200MB/s
 
 Any help or direction would be greatfull.
 Cheers
 Bruce McKenzie
 

(changed subject into something I think is more appropriate)

Hi Bruce,

thanks for contacting me privately in parallel. I can answer you the
replication questions. In order to share experience for others I reply
here again.

Please evaluate the ib_srp fixes from Bart and from me as well and send
us your feedback!

We are still negotiating how to do fast IO failing and the automatic
reconnect right, also together with the Mellanox SRP guys Sagi Grimberg,
Vu Pham, Oren Duer and others.

You need these patches in order to fail IO in the time you want to the
upper layers so that dm-multipath can fail over the path first and
ib_srp continuously tries to reconnect the failed path. If the other
path also fails, then very likely the storage server is down, so you
fail the IO further up to MD RAID-1 so that it can fail that replica.

For replication the last slide of my talk on LinuxTag this year could be
interesting for you:

http://www.slideshare.net/SebastianRiemer/infini-band-rdmaforstoragesrpvsiser-21791250

That slide caused a lot of discussion afterwards. The thing is that
replication of remote storage is best on the initiator (a single kernel
manages all replica, parallel network paths, symmetric latency,...).

The bad news is that replication of virtual/remote storage with MD
RAID-1 is a use case which basically works but has some issues which
Neil Brown doesn't want to have fixed in mainline. So you need a kernel
developer for some cool features like e.g. safe VM live migration.

Perhaps, I should collect all guys who require MD RAID-1 for remote
storage replication in order to put some pressure on Neil. At least some
things of this use case are easy to merge with mainline behavior like
e.g. letting MD assembly scale right (mdadm searches the whole /dev
without a need). I was surprised that he will make the data offset
settable again so that you can set it to 4 MiB (1 LV extent). We already
have that by custom patches on top of mdadm 3.2.6.

DRBD is already with iSCSI crap. 200 MB/s with IB sounds familiar. I had
250 MB/s in primary/secondary setup with DRBD during evaluation. That's
storeforward writes to the secondary which is slow. Chained network
paths! With Ethernet that hurts even more. People report 70 MB/s with
that. I've taught them how to use blktrace and it became obvious that
they were trapped in latency.

I can also recommend you Vasiliy Tolstov v.tols...@selfip.ru. He also
uses SRP with MD RAID-1. He could convince Neil to fix the MD data
offet. OpenSource is all about the right allies,

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to do replication right with SRP or remote storage?

2013-06-10 Thread Sebastian Riemer
On 10.06.2013 14:44, Bart Van Assche wrote:
 On 06/10/13 14:05, Sebastian Riemer wrote:
 Perhaps, I should collect all guys who require MD RAID-1 for remote
 storage replication in order to put some pressure on Neil.
 
 If I remember correctly one of the things Neil is trying to explain to
 md users is that when md is used without write-intent bitmap there is a
 risk of triggering a so-called write hole after a power failure ?

I'm not sure. Haven't seen something like this on the mailing list. Do
you have a reference from the archives?

I think this is handled by superblock writes in the correct order by
now. The main reason for the write-intent bitmap remains from my
knowledge that you need a full resync without it if a component device
is down for a short moment in time. It becomes faulty.
If you know that there can't be a hardware issue (e.g. virtual storage),
you can remove the faulty device and re-add it to the array.

If a device was faulty, then it assembles again. There is an error
counter in /sys/block/mdX/md/ sysfs and a maximum read error count
(usually 20) after which the faulty device doesn't assemble again.

/sys/block/mdX/md/dev-Y/errors
/sys/block/mdX/md/max_read_errors

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB

2013-05-21 Thread Sebastian Riemer
On 17.05.2013 16:16, Jack Wang wrote:
 unable to handle kernel paging request

Hi Jack,

this should be related to the list corruption in IPoIB as list_del()
sets the LIST_POISON1 and LIST_POISON2 pointers.
Referencing these results in page faults according to the documentation
in the code.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches

2013-05-15 Thread Sebastian Riemer
On 15.05.2013 07:12, Vasiliy Tolstov wrote:
 2013/5/14 Bart Van Assche bvanass...@acm.org:
 The ability to close a session from the initiator side went upstream in
 kernel 3.8 (/sys/class/srp_remote_ports/port-h:n/delete). Regarding
 faster reconnects: please keep in mind that after a cable pull it can easily
 takes 20 seconds before link training and initialization by the subnet
 manager have finished. It's not possible to make an initiator reconnect in
 less time than what the hardware and subnet manager need to bring the link
 back.
 
 
 Thanks. What about close session from target side? For example i need
 to close the srp session and block all access from specific initiator?

AFAIK the session is blocked as long as an initiator is connected. The
only possibility besides disconnecting the initiators is to disable the
target completely. Then, it sends a DREQ (disconnection request) to the
initiators. These know then that the target is disconnected and send a
DREP (disconnection reply). In our patches we also activate the
reconnect in this situation as we do that to orderly reboot a storage
system (e.g. due to an issue). The storage system comes up again,
exports the same volumes and the initiators can reconnect.

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-05-14 Thread Sebastian Riemer
On 14.05.2013 12:02, Vasiliy Tolstov wrote:
 Sorry for bumping old thread, i'm solve my problems with new firmware.
 I have supermicro servers that rebrand mellanox firmware (recompile
 and change some bits)
 Now all works fine i have 40 gb/s QDR instead of 10 Gb/s
 
Thanks, sharing lesson learned experience is never wrong. Especially as
there aren't many IB specialists in the world.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches

2013-05-14 Thread Sebastian Riemer
Of cause, also Qlogic HCAs can be used.
But please note that there is no back-port in my patches to make them
better readable. If you like my patches, then we can talk about how to
back-port them to a specific kernel version.

On 14.05.2013 12:00, Vasiliy Tolstov wrote:
 if i need faster reconnects and ability to close session from
 initiator side under qlogic hardware, does it possible? Or this
 patches only covers mallanox cards?
 
 2013/5/8 Sebastian Riemer sebastian.rie...@profitbricks.com:
 FYI: I've released version 0.6 of my SRP patches today.

 The automatic reconnect is included now. The tests for that will follow
 in the next version. But we already did quite intensive testing for that.

 Hard reboot and also soft reboot of the target are possible with that
 reconnect. It just reconnects and everything is fine again.

 With soft reboot I mean: disabling the target, removing the exports,
 rebooting, exporting the same LUNs, re-enabling the target.

 It also has an automatic mechanism to reduce the possibility of a DDoS
 attack reconnect. It automatically reconnects at different intervals.

 Check it out:
 https://github.com/sriemer/ib_srp

 Cheers,
 Sebastian
 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Infiniband HA

2013-05-08 Thread Sebastian Riemer
Hi Gandalf,

just build up two separate fabrics. This means that you don't
interconnect both switches.
Otherwise, issues on one port also affect the other port.

What do you use for storage? SRP?
This requires dm-multipath and fast IO failing + automatic reconnect
patches from Bart or from me.

All other traffic like IPoIB for example also has to be able to switch
the port.

Cheers,
Sebastian


On 08.05.2013 12:06, Gandalf Corvotempesta wrote:
 Hi to all
 I'm new to Infiniband/RMDA and probabily this is not the right place to ask.
 I'm planning a new infiniband infrastructure with dual port HBA and I
 have a question:
 
 How can I archieve fault tollerance with multiple switches? Should I
 connect all like a standard ethernet infrastructure ? (port1 to
 switch1, port2 to switch2, and both switches interconnected with a
 cable?)
 
 Thank you.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches

2013-05-08 Thread Sebastian Riemer
FYI: I've released version 0.6 of my SRP patches today.

The automatic reconnect is included now. The tests for that will follow
in the next version. But we already did quite intensive testing for that.

Hard reboot and also soft reboot of the target are possible with that
reconnect. It just reconnects and everything is fine again.

With soft reboot I mean: disabling the target, removing the exports,
rebooting, exporting the same LUNs, re-enabling the target.

It also has an automatic mechanism to reduce the possibility of a DDoS
attack reconnect. It automatically reconnects at different intervals.

Check it out:
https://github.com/sriemer/ib_srp

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches

2013-04-12 Thread Sebastian Riemer
Hello everyone,

I'm very proud to announce that we finally publish our SRP initiator
patches we've been working on for quite some time now.

In the first step we publish our way of failing IO fast as we've noticed
that the way Bart Van Assche does that in his GitHub repository doesn't
match our requirements completely.

His repo: https://github.com/bvanassche/ib_srp-backport
Our repo: https://github.com/sriemer/ib_srp

We want to fail IO fast in exactly the time we configure. With our
patches this works (or please tell us why not). We provide you with full
test descriptions and related shell scripts. Everything is done with as
little dependencies as possible.

The shell scripts can also be very useful to show how to configure and
use SRP with sysfs only. This is why I've added the scst-devel mailing
list here. We want to be as close as possible to the kernel.

We want to combine efforts here and to get valuable feedback from you
all. Evaluation, testing, criticism, comments, etc more than welcome!
Hopefully, we can get a really cool solution into the mainline together!
This would make my job as the maintainer for the ProfitBricks host
kernels a lot easier! ;-)

You'll notice: We've already adapted patches from Bart and parts of his
patches. So it is only fair to publish our patches as well. :-)

It is the same as with Bart's patches: This can't be used for
production, yet. Our patches don't have the reconnect for now. Ideas how
to implement that on top are welcome. Just send your patches directly to
me! :-)

Please further notice: There is a major bug in the upstream
multipath-tools. These read sysfs files cached which leads to IO on
offline devices. We've fixed it for us and publish the fix for you as
well. :-)

Git repo:
https://github.com/sriemer/multipath-tools

Thank you so much for your help in the past and in the future as well!
Thanks for the patience and reading this!

We'll continue publishing our SRP patches relevant for the mainline.
If you want to meet me or ProfitBricks in person, we'll have a booth on
LinuxTag in Berlin/Germany. I'll have a technical talk there about SRP:

http://www.linuxtag.org/2013/en/program/thursday-may-23-2013.html?eventid=208

Cheers,
Sebastian

-- 
Sebastian Riemer
Linux Kernel Developer - Storage

ProfitBricks GmbH • Greifswalder Str. 207 • 10405 Berlin, Germany
www.profitbricks.com • sebastian.rie...@profitbricks.com

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Andreas Gauger, Achim Weiss
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 13:51, Vasiliy Tolstov wrote:
 Something like this:
 echo 4096  /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
 
 After doing this all srp connections down and port is down. I need to
 restart openibd

Sorry for that! It's much easier to set the IP MTU. Managed switches
support setting the RDMA MTU. So it could be possible that it is a
setting in the SM config. But I'm not sure.

$ man opensm
says that it can be set in the partitions.conf

 You should see 40 Gb/sec (4X QDR) here. Perhaps the OFED is too old so
 that FDR and ConnectX 3 aren't supported, yet. 10 Gb/sec (4X) seems to
 be the default case if a rate isn't supported.
 
 Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 
 Gb

The kernel version is okay. It depends on the user space.
There is a support note in OFED 3.5:
- ConnectX-3 (fw-ConnectX3 Rev 2.11.0500) (FDR and FDR10 Modes are
Supported)

Before OFED 3.5 these HCAs aren't supported. A look at the related
source code could be worth a try.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 14:49, Hal Rosenstock wrote:
 On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
 Hello. I have some servers, with mellanox ConnectX-3 and have some questions:
 Why max_mtu differs with active_mtu? 
 
 What does peer port say for max MTU ?
 
 How can i set active mtu?
 
 SM sets active MTU to min of peer ports max MTUs.

So with peer port max MTU do you mean this file?:

/sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

I've seen that it can be set as well. I've got two ConnectX-2 machines
connected back2back. In general these have 4K max and active.

So let's try something:

Host1:
$ echo 2048  /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
# Port is not active, let's reactivate it.
$ echo 1  /sys/class/infiniband/mlx4_0/device/enable

ibv_devinfo Host1:
max_mtu:2048 (4)
active_mtu: 2048 (4)

Host2:
max_mtu:4096 (5)
active_mtu: 2048 (4)

Both had 4096 (5) before everywhere.
So that's the recommended way to reduce the MTU?

I've heard that reducing the MTU in a fabric can help fighting
congestion issues. As congestion control doesn't work yet, could this
help against congestion?

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 15:34, Hal Rosenstock wrote:
 On 4/9/2013 9:16 AM, Sebastian Riemer wrote:
 On 09.04.2013 14:49, Hal Rosenstock wrote:
 On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
 Hello. I have some servers, with mellanox ConnectX-3 and have some 
 questions:
 Why max_mtu differs with active_mtu? 

 What does peer port say for max MTU ?

 How can i set active mtu?

 SM sets active MTU to min of peer ports max MTUs.

 So with peer port max MTU do you mean this file?:

 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
 
 I meant NeighborMTU from PortInfo as active MTU and MTUCap there is
 supported MTU.

So these values are exactly the same as in ibv_devinfo and can be set
in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.

I've found the PortInfo with the command
smpquery portinfo -C mlx4_0 3 1
where I'm using the first HCA to contact the SM. I tell the SM the
destination LID ('3' here in my case) and the destination port ('1').

Is there another method to set the max MTU?

I know that switches can also set the max MTU for their switch ports
where most of them use 2048 as default.
How to change these switch port MTUs for unmanaged switches?

On managed switches this can be done over the web front-end.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 16:23, Hal Rosenstock wrote:
 So these values are exactly the same as in ibv_devinfo and can be set
 in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.

 I've found the PortInfo with the command
 smpquery portinfo -C mlx4_0 3 1
 where I'm using the first HCA to contact the SM. I tell the SM the
 destination LID ('3' here in my case) and the destination port ('1').

 Is there another method to set the max MTU?
 
 That doesn't set max MTU (MTUCap) but merely reads it (for that port).

Sorry, copy and paste error. I've meant the mlx4 file:
/sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

But you've answered that by vendor specific. Thanks for the valuable
information!

For us most interesting would be if the MTU can be changed live without
any service disruption. Looks like the mlx4 driver can't provide that.
Perhaps switches can do that.

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC ib_srp-backport] ib_srp: bind fast IO failing to QP timeout

2013-03-19 Thread Sebastian Riemer
Hi Bart,

now I've got my priority on SRP again.

I've also noticed that your ib_srp-backport doesn't fail the IO fast
enough. The fast_io_fail_tmo only comes into play after the QP is
already in timeout and the terminate_rport_io function is missing.

My idea is to use the QP retry count directly for fast IO failing. It is
at 7 by default and the QP timeout is at approx. 2s. The overall QP
timeout is at approx. 35s already (1+7 tries * 2s * 2, I guess). Using
only 3 retries I'm at approx 18s.

My patches introduce that parameter as module parameter as it is quite
difficult to set the QP from RTS to RTR again. Only there the QP timeout
parameters can be set.

My patch series isn't complete yet as paths aren't reconnected - they
are only failed fast bound to the overall QP timeout. But it should give
you an idea what I'm trying to do here.

What are your thought regarding this?

Attached patches:
ib_srp: register srp_fail_rport_io as terminate_rport_io
ib_srp: be quiet when failing SCSI commands
scsi_transport_srp: disable the fast_io_fail_tmo parameter
ib_srp: show the QP timeout and retry count in srp_host sysfs files
ib_srp: introduce qp_retry_cnt module parameter

Cheers,
Sebastian


Btw.: Before, I've hacked MD RAID-1 for high-performance replication as
DRBD is crap for our purposes. But that's worthless without a reliably
working transport.
From c101d00fe529d845192dd6d5930a1b9c16c99b81 Mon Sep 17 00:00:00 2001
From: Sebastian Riemer sebastian.rie...@profitbricks.com
Date: Wed, 13 Mar 2013 16:16:28 +0100
Subject: [PATCH 1/5] ib_srp: register srp_fail_rport_io as terminate_rport_io

We need to fail the IO fast in the selected time. So register the
missing terminate_rport_io function.

Signed-off-by: Sebastian Riemer sebastian.rie...@profitbricks.com
---
 drivers/infiniband/ulp/srp/ib_srp.c |   24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index dc49dc8..64644c5 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -756,6 +756,29 @@ static void srp_reset_req(struct srp_target_port *target, 
struct srp_request *re
}
 }
 
+static void srp_fail_req(struct srp_target_port *target, struct srp_request 
*req)
+{
+   struct scsi_cmnd *scmnd = srp_claim_req(target, req, NULL);
+
+   if (scmnd) {
+   srp_free_req(target, req, scmnd, 0);
+   scmnd-result = DID_TRANSPORT_FAILFAST  16;
+   scmnd-scsi_done(scmnd);
+   }
+}
+
+static void srp_fail_rport_io(struct srp_rport *rport)
+{
+   struct srp_target_port *target = rport-lld_data;
+   int i;
+
+   for (i = 0; i  SRP_CMD_SQ_SIZE; ++i) {
+   struct srp_request *req = target-req_ring[i];
+   if (req-scmnd)
+   srp_fail_req(target, req);
+   }
+}
+
 static int srp_reconnect_target(struct srp_target_port *target)
 {
struct Scsi_Host *shost = target-scsi_host;
@@ -2700,6 +2723,7 @@ static void srp_remove_one(struct ib_device *device)
 
 static struct srp_function_template ib_srp_transport_functions = {
.rport_delete= srp_rport_delete,
+   .terminate_rport_io  = srp_fail_rport_io,
 };
 
 static int __init srp_init_module(void)
-- 
1.7.9.5

From 06c3cc832a672856c416fee72705ea0448f23855 Mon Sep 17 00:00:00 2001
From: Sebastian Riemer sebastian.rie...@profitbricks.com
Date: Wed, 13 Mar 2013 16:46:44 +0100
Subject: [PATCH 2/5] ib_srp: be quiet when failing SCSI commands

Signed-off-by: Sebastian Riemer sebastian.rie...@profitbricks.com
---
 drivers/infiniband/ulp/srp/ib_srp.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 64644c5..0607e5a 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -750,6 +750,7 @@ static void srp_reset_req(struct srp_target_port *target, 
struct srp_request *re
struct scsi_cmnd *scmnd = srp_claim_req(target, req, NULL);
 
if (scmnd) {
+   scmnd-request-cmd_flags |= REQ_QUIET;
srp_free_req(target, req, scmnd, 0);
scmnd-result = DID_RESET  16;
scmnd-scsi_done(scmnd);
@@ -761,6 +762,7 @@ static void srp_fail_req(struct srp_target_port *target, 
struct srp_request *req
struct scsi_cmnd *scmnd = srp_claim_req(target, req, NULL);
 
if (scmnd) {
+   scmnd-request-cmd_flags |= REQ_QUIET;
srp_free_req(target, req, scmnd, 0);
scmnd-result = DID_TRANSPORT_FAILFAST  16;
scmnd-scsi_done(scmnd);
@@ -1526,6 +1528,7 @@ static int SRP_QUEUECOMMAND(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
int len;
 
if (unlikely(target-transport_offline)) {
+   scmnd-request-cmd_flags |= REQ_QUIET;
scmnd-result = DID_NO_CONNECT  16;
scmnd

Re: [RFC ib_srp-backport] ib_srp: bind fast IO failing to QP timeout

2013-03-19 Thread Sebastian Riemer
On 19.03.2013 12:22, Or Gerlitz wrote:
 On 19/03/2013 12:16, Sebastian Riemer wrote:
 Hi Bart,

 now I've got my priority on SRP again.
 
 Hi Sebastian,
 
 Are these patches targeted to upstream or backports to some OS/kernel?
 if the former, can you please
 send them inline so we can have proper review?
 
 Or.

Hi Or,

the patches are targeted to the stuff Bart is doing on GitHub.
https://github.com/bvanassche/ib_srp-backport

If I've seen that right, fast IO failing hasn't been accepted to the
mainline, yet.

So I didn't want to spam you all with multiple mails of patches which
don't apply to upstream.
I want to introduce my idea in the first place. The patches are not a
final solution to the problem. They should only show what I'm trying to
do here.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC ib_srp-backport] ib_srp: bind fast IO failing to QP timeout

2013-03-19 Thread Sebastian Riemer
On 19.03.2013 12:45, Bart Van Assche wrote:
 On 03/19/13 11:16, Sebastian Riemer wrote:

 What are your thought regarding this?

 Attached patches:
 ib_srp: register srp_fail_rport_io as terminate_rport_io
 ib_srp: be quiet when failing SCSI commands
 scsi_transport_srp: disable the fast_io_fail_tmo parameter
 ib_srp: show the QP timeout and retry count in srp_host sysfs files
 ib_srp: introduce qp_retry_cnt module parameter
 
 Hello Sebastian,
 
 Patches 1 and 2 make sense to me. Patch 3 makes it impossible to disable
 fast_io_fail_tmo and also disables the fast_io_fail_tmo timer - was that
 intended ?

I had a patch which has completely thrown out that fast_io_fail_tmo
parameter for ib_srp v1.2 as in my tests with dm-multipath it didn't
make any sense but having even longer to wait until IO can be failed. If
there is a connection issue, then all SCSI disks from that target are
affected and not only a single SCSI device. Today I've seen that you are
at v1.3 already and that patch didn't apply anymore. So I thought
disabling only the functionality shows what I'm trying to do here.

Can you please explain me what your intention was with that
fast_io_fail_tmo?
What I want to have is a calculateable timeout for IO failing. If the QP
retries are at 7 I can't get any lower than 35 seconds.

 Regarding patches 4 and 5: I'm not sure whether reducing the
 QP retry count will work well in large fabrics.

For me it is already a mystery why I measure 35 seconds at 2s QP timeout
and 7 retries. If the maximum is at 2s * 7 retries * 4, then I'm at 60
seconds. That's plain too long. The fast_io_fail_tmo comes on top of
that. How else should I reduce the overall timeout until I see in iostat
that the other path is taken?

 The iSCSI initiator
 follows another approach to realize quick failover, namely by
 periodically checking the transport layer and by triggering the
 fast_io_fail timer if that check fails. Unfortunately the SRP spec does
 not define an operation suited as a transport layer test. But maybe a
 zero-length RDMA write can be used to verify the transport layer ?

Hmmm, how do you want to implement that? This write would run into
(overall) QP timeout as well, I guess. The dm-multipath checks paths
with directio reads by polling every 5 seconds by default. IMHO this
does exactly that.

 I think the IB specification allows such operations. A quote from page 439:
 
 C9-88: For an HCA responder using Reliable Connection service, for
 each zero-length RDMA READ or WRITE request, the R_Key shall not be
 validated, even if the request includes Immediate data.

And this isn't bound on the (overall) QP timeout? Can you send me a
proof of concept for this?

 Note: I'm still working on transforming the patches present in the
 ib_srp-backport repository such that these become acceptable for
 upstream inclusion.

I know that and I appreciate that. But I'm running out of time. Perhaps,
we can combine some efforts to implement something working first.
Doesn't have to be clean and shiny. For me also hacky is okay as long as
it works in the data center.

Yes, I have to admit that the patches 4 and 5 are hacky. Perhaps, I can
report you soon how it behaves reducing the retry count in a large
setup. ;-)

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] IPoIB: Free ipoib neigh on path record failure so path rec queries are retried

2013-02-27 Thread Sebastian Riemer
On 26.02.2013 17:55, Roland Dreier wrote:
[...]
 In fact I bet this is why the bug has been there as long as it has
 been: almost no one is using IPv6 on IPoIB seriously, and IPv4 should
 work OK as you point out.

Thanks a lot, Unfortunately, we are using IPoIB with IPv6 in
production for the inter-VM and the internet gateway traffic. We had an
issue last Friday were one of our gateway machines wasn't reachable
anymore and pings didn't come through. We had to reload the ib_ipoib
modules of nearly every server connected to that gateway. I'm so glad
that we use SRP - all SRP connections survived.

I don't have the time to care for IPoIB as well at ProfitBricks so I
will encourage our networking teams to get their hands on IPoIB and
linux-rdma communication.

How to verify if we've hit this bug? How to reproduce this bug?

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time

2013-02-08 Thread Sebastian Riemer
On 08.02.2013 10:24, Sagi Grimberg wrote:
 On 2/8/2013 12:42 AM, Vu Pham wrote:
 Hello Bart,

 Thank you for taking the initiative.
 Mellanox think that this should be discussed. We'd be happy to attend.

 We also would like to discuss:
 * How and how fast does SRP detect a path failure besides RC error?
 * Role of srp_daemon, how often srp_daemon scan fabric for new/old
 targets, how-to scale srp_daemon discovery, traps.

 -vu
 Hey Bart,
 
 I agree with Vu that this issue should be discussed. We'd be happy to
 attend.
 
 -- 
 Sagi

Wow, also thanks to Mellanox for spending resources on SRP as well! Last
year in June we came across a very different situation.

Cheers,
Sebastian and the ProfitBricks storage team
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtual ibnetdiscover command fails

2013-02-06 Thread Sebastian Riemer
On 06.02.2013 10:22, Or Gerlitz wrote:
 On 06/02/2013 11:17, Mathis GAVILLON wrote:
 Ok. But what is it possible to do with Infiniband VFs if QP0 is not
 available ?
 
 EVERYTHING, e.g run IPoIB, iSER, RDS, MPI, etc, etc - except for what
 requires QP0, such as running SM or issuing SMPs for
 discovery/diagnostics purposes

But SRP isn't provided with SR-IOV I've heared. Is it just a matter of
software or is it a matter of firmware/hardware?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtual ibnetdiscover command fails

2013-02-06 Thread Sebastian Riemer
On 06.02.2013 11:20, Or Gerlitz wrote:
 On 06/02/2013 12:04, Mathis GAVILLON wrote:
 Just a last question : is that possible VFs lid to be different from
 PF one ?
 
 NO, we've implemented a shared port model, so all functions on the
 same IB port use the same lid, each function has its own
 virtual GUID though.

So if I don't use the unmaintained srptools to get the SRP connection
strings but instead send them directly to the initiator to connect to
the SRP target, then also SRP should be possible with the virtual GUID.
Am I right?

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time

2013-02-04 Thread Sebastian Riemer
Hi Bart,

thanks for approaching this! We're not the best mainline developers so I
guess we won't be there. But we have the big SRP setups and our
sysadmins really don't like reconnecting SRP hosts manually and putting
their devices complicated to the related dm-multipath devices again.

Think about  200 SRP devices per server (already filtered by initiator
groups). We also consider the srptools as unmaintained, unreliable and
slow. It is possible that the srptools commands don't return. Therefore,
we send the SRP connection strings directly to the initiator within our
mapping jobs.

It would also be great not to develop a DDoS attack reconnect like
open-iscsi does. Rebooting the whole cluster to fix this isn't fun.
There must be a possibility to configure different reconnect intervals.

Btw.: We even had the case that the IPoIB stuff reconnected but the RDMA
part didn't with iSER. It was so broken then, that we couldn't
disconnect or reconnect anymore - only chance hard reboot.

So you know our point of view and we already develop it that way for us.
I'm looking forward what's the output of the discussion. At the current
state it's difficult to nag our bosses to publish what we have so far.

On 01.02.2013 14:43, Bart Van Assche wrote:
 It is known that it takes about two to three minutes before the upstream
 SRP initiator fails over from a failed path to a working path. This is
 not only considered longer than acceptable but is also longer than other
 Linux SCSI initiators (e.g. iSCSI and FC). Progress so far with
 improving the fail-over SRP initiator has been slow. This is because the
 discussion about candidate patches occurred at two different levels: not
 only the patches itself were discussed but also the approach that should
 be followed. That last aspect is easier to discuss in a meeting than
 over a mailing list. Hence the proposal to discuss SRP initiator
 failover behavior during the LSF/MM summit. The topics that need further
 discussion are:
 * If a path fails, remove the entire SCSI host or preserve the SCSI
   host and only remove the SCSI devices associated with that host ?

Preserve SCSI hosts and SCSI devices unless they are removed explicitly
by disconnect request. Rescanning SCSI devices with - - - like
iscsiadm -R does for example may reorder the device names (sda becomes
sdb, etc.).

 * Which software component should test the state of a path and should
   reconnect to an SRP target if a path is restored ? Should that be
   done by the user space process srp_daemon or by the SRP initiator
   kernel module ?

By the SRP kernel module. This is exactly the big advantage of SRP so
far: It is simple, it is RDMA and kernel only.

 * How should the SRP initiator behave after a path failure has been
   detected ? Should the behavior be similar to the FC initiator with
   its fast_io_fail_tmo and dev_loss_tmo parameters ?

Fine for us as long as it is possible to configure such times and the
behavior at all. For dm-multipath we need fast IO failing and that the
SRP initiator tries to automatically reconnect that path.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] OFED-3.5-rc2 is available

2012-10-04 Thread Sebastian Riemer
Hi Vladimir,

why do you put OFED together for a kernel nobody uses? Perhaps SLES and
Red Hat do it like this but nobody else.

Have a look at http://en.wikipedia.org/wiki/Linux_kernel - 3.0, 3.2 and
3.4 are the long-term stable releases.

This approach is worse than the approach before IMHO. Since 1.5.4.1
there is no real stable release. No wonder that everyone puts his own
OFED together.

I'm so glad that we don't need too much OFED user space and we just use
the IB stuff from the mainline kernel. We need to surf close to mainline
kernel development anyway.

The only thing that we would need is a list which packet with which
version matches our mainline kernel. Updating it from Git directly from
a tag or branch with no external patches or SRPM/tar.gz stuff would make
it much easier - also for other distributions. Things are getting better
for us with the switch-over to Gentoo. There, we can leave out the
packages that we don't actually need.

Cheers,
Sebastian


On 03.10.2012 17:54, Vladimir Sokolovsky wrote:
 Hi,
 OFED 3.5-rc2 is available.

 The tarball is available on:
 http://www.openfabrics.org/downloads/OFED/ofed-3.5/OFED-3.5-rc2.tgz

 To get BUILD_ID run ofed_info

 Please report any issues in bugzilla https://bugs.openfabrics.org/ for
 OFED 3.5

 Regards,
 Vladimir


 OFED-3.5-rc2 Main Changes from OFED 3.5-rc1
 ---

 compat-rdma: Add SRP backport
 compat-rdma: /etc/init.d/openibd: Fix LSB header
 compat-rdma: IB/qib: linux-3.6 patches backported
 compat-rdma: iw_cxgb4: Fix bug 2369 in OFED bugzilla
 compat-rdma: IB/qib: fix compliance regression in 3.5
 compat-rdma: RDMA/nes: Added linux-next-pending patches
 compat-rdma: RDMA/nes: Updated backports
 compat-rdma: NFSRDMA RHEL6.3 backport
 compat-rdma: NFSRDMA SLES11SP2 backport
 compat-rdma: linux-next-cherry-picks: RDMA/ucma.c: Different fix for
 ucma context uid=0, causing iWarp RDMA applications to fail in
 connection establishment

 Updated packages:
 infinipath-psm-3.0.1-115.1015_open
 perftest-1.4.0-0.80.gd1763bd
 qperf-0.4.7-0.2.gf3f7001

 Supported Platforms and Operating Systems
 -
   o   CPU architectures:
 - x86_64
 - x86
 - ppc64
 - ia64

   o   Linux Operating Systems:
 - RedHat EL6.2  2.6.32-220.el6
 - RedHat EL6.3  2.6.32-279.el6
 - SLES11 SP23.0.13-0.27-default
 - kernel.org3.5*

   * Minimal QA for these versions.


 OFED_release_notes.txt

 Note: See the release notes of each component for additional issues.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/20] ib_srp: Make srp_disconnect_target() wait for IB completions

2012-08-23 Thread Sebastian Riemer
Hi Bart,

we've triggered the WARN_ON() in srp_wait_last_send_wqe() by connecting
to a disabled SCST SRP target.

I would remove that one.

Cheers,
Sebastian

 
On 09.08.2012 17:53, Bart Van Assche wrote:
 Modify srp_disconnect_target() such that it waits until it is
 sure that no new IB completions will be received anymore.

 Signed-off-by: Bart Van Assche bvanass...@acm.org
 Cc: David Dillow dillo...@ornl.gov
 Cc: Roland Dreier rol...@purestorage.com
 ---
  drivers/infiniband/ulp/srp/ib_srp.c |  104 
 ++-
  drivers/infiniband/ulp/srp/ib_srp.h |6 ++
  2 files changed, 95 insertions(+), 15 deletions(-)

 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index 0e7825a..4de7c46 100644
 --- a/drivers/infiniband/ulp/srp/ib_srp.c
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -40,7 +40,7 @@
  #include linux/parser.h
  #include linux/random.h
  #include linux/jiffies.h
 -
 +#include linux/delay.h
  #include linux/atomic.h
  
  #include scsi/scsi.h
 @@ -229,14 +229,16 @@ static int srp_create_target_ib(struct srp_target_port 
 *target)
   return -ENOMEM;
  
   target-recv_cq = ib_create_cq(target-srp_host-srp_dev-dev,
 -srp_recv_completion, NULL, target, 
 SRP_RQ_SIZE, 0);
 +srp_recv_completion, NULL, target,
 +SRP_RQ_SIZE + 1, 0);
   if (IS_ERR(target-recv_cq)) {
   ret = PTR_ERR(target-recv_cq);
   goto err;
   }
  
   target-send_cq = ib_create_cq(target-srp_host-srp_dev-dev,
 -srp_send_completion, NULL, target, 
 SRP_SQ_SIZE, 0);
 +srp_send_completion, NULL, target,
 +SRP_SQ_SIZE + 1, 0);
   if (IS_ERR(target-send_cq)) {
   ret = PTR_ERR(target-send_cq);
   goto err_recv_cq;
 @@ -245,8 +247,8 @@ static int srp_create_target_ib(struct srp_target_port 
 *target)
   ib_req_notify_cq(target-recv_cq, IB_CQ_NEXT_COMP);
  
   init_attr-event_handler   = srp_qp_event;
 - init_attr-cap.max_send_wr = SRP_SQ_SIZE;
 - init_attr-cap.max_recv_wr = SRP_RQ_SIZE;
 + init_attr-cap.max_send_wr = SRP_SQ_SIZE + 1;
 + init_attr-cap.max_recv_wr = SRP_RQ_SIZE + 1;
   init_attr-cap.max_recv_sge= 1;
   init_attr-cap.max_send_sge= 1;
   init_attr-sq_sig_type = IB_SIGNAL_ALL_WR;
 @@ -460,11 +462,69 @@ static bool srp_change_conn_state(struct 
 srp_target_port *target,
   return changed;
  }
  
 +static void srp_wait_last_recv_wqe(struct srp_target_port *target)
 +{
 + static struct ib_recv_wr wr = {
 + .wr_id = SRP_LAST_RECV,
 + };
 + struct ib_recv_wr *bad_wr;
 + int ret;
 +
 + if (target-last_recv_wqe)
 + return;
 +
 + ret = ib_post_recv(target-qp, wr, bad_wr);
 + if (ret  0) {
 + shost_printk(KERN_ERR, target-scsi_host,
 +  ib_post_recv() failed (%d)\n, ret);
 + return;
 + }
 +
 + ret = wait_event_timeout(target-qp_wq, target-last_recv_wqe,
 +  target-rq_tmo_jiffies);
 + WARN(ret = 0, Timeout while waiting for last recv WQE (ret = %d)\n,
 +  ret);
 +}
 +
 +static void srp_wait_last_send_wqe(struct srp_target_port *target)
 +{
 + static struct ib_send_wr wr = {
 + .wr_id = SRP_LAST_SEND,
 + };
 + struct ib_send_wr *bad_wr;
 + unsigned long deadline = jiffies + target-rq_tmo_jiffies;
 + int ret;
 +
 + if (target-last_send_wqe)
 + return;
 +
 + ret = ib_post_send(target-qp, wr, bad_wr);
 + if (ret  0) {
 + shost_printk(KERN_ERR, target-scsi_host,
 +  ib_post_send() failed (%d)\n, ret);
 + return;
 + }
 +
 + while (!target-last_send_wqe  time_before(jiffies, deadline)) {
 + srp_send_completion(target-send_cq, target);
 + msleep(20);
 + }
 +
 + WARN_ON(!target-last_send_wqe);

-- here it is - remove it

 +}
 +
  static void srp_disconnect_target(struct srp_target_port *target)
  {
 + static struct ib_qp_attr qp_attr = {
 + .qp_state = IB_QPS_ERR
 + };
 + int ret;
 +
   if (srp_change_conn_state(target, false)) {
   /* XXX should send SRP_I_LOGOUT request */
  
 + BUG_ON(!target-cm_id);
 +
   init_completion(target-done);
   if (ib_send_cm_dreq(target-cm_id, NULL, 0)) {
   shost_printk(KERN_DEBUG, target-scsi_host,
 @@ -473,6 +533,20 @@ static void srp_disconnect_target(struct srp_target_port 
 *target)
   wait_for_completion(target-done);
   }
   }
 +
 + if (target-cm_id) {
 + ib_destroy_cm_id(target-cm_id);
 + target-cm_id = NULL;
 + }
 +
 + if 

Basics of congestion control?

2012-07-31 Thread Sebastian Riemer
Hi all,

could someone please explain what I can do with the new congestion control?

Do I understand it right that I can influence the flow control (e.g.
amount of credits) with it so that I can avoid disruption (XmitWait,
XmitDiscardedPackets) caused by congestion?

This is at least what we need. ;-)

Cheers,
Sebastian

-- 
Sebastian Riemer
Linux Kernel Developer

ProfitBricks GmbH • Greifswalder Str. 207 • 10405 Berlin, Germany
www.profitbricks.com • sebastian.rie...@profitbricks.com
Tel.: +49 - 30 - 60 98 56 991 - 915

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Andreas Gauger, Achim Weiss

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Basics of congestion control?

2012-07-31 Thread Sebastian Riemer
On 31.07.2012 13:08, Alex Netes wrote:
 Congestion control isn't a credit based mechanism. While InfiniBand flow
 control is defined between two ports of the same link, congestion control is
 working across the fabric between a congestion point (a switch) and a reaction
 point (source node). Reaction point implements a Congestion Control Table that
 contains an array of values of injection rate delay used to control
 congestion. You can find more information in the IBTA LWG Errata document
 3Q2010.

Nice, thank you very much!

I've found the IBTA spec and the errata.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4_ib_create_qp failed - OOM with call trace

2012-07-20 Thread Sebastian Riemer
On 19.07.2012 22:31, Roland Dreier wrote:
 I have to think about the best way to fix this.  We could just
 convert to vmalloc() here but I'm not thrilled about consuming
 vmalloc() space (on modern 64-bit architectures it's a non-issue
 but it's going to cause issues for people on smaller systems).

This is at least something we can implement and test for us as we only
have modern server systems.

Thank you very much for the information and your help!

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mlx4_ib_create_qp failed - OOM with call trace

2012-07-18 Thread Sebastian Riemer
:80kB high:96kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15644kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[5416523.204866] lowmem_reserve[]: 0 3495 16095 16095
[5416523.204868] Node 0 DMA32 free:77448kB min:14664kB low:18328kB
high:21996kB active_anon:0kB inactive_anon:104kB active_file:1491480kB
inactive_file:1526412kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:3579648kB mlocked:0kB dirty:1176kB
writeback:0kB mapped:180kB shmem:0kB slab_reclaimable:301820kB
slab_unreclaimable:97804kB kernel_stack:20768kB pagetables:308kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[5416523.204875] lowmem_reserve[]: 0 0 12600 12600
[5416523.204877] Node 0 Normal free:72596kB min:52852kB low:66064kB
high:79276kB active_anon:4636kB inactive_anon:17712kB
active_file:5848236kB inactive_file:5906316kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:12902400kB mlocked:0kB
dirty:12500kB writeback:4kB mapped:4196kB shmem:6584kB
slab_reclaimable:258216kB slab_unreclaimable:187976kB
kernel_stack:15456kB pagetables:1340kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:36 all_unreclaimable? no
[5416523.204883] lowmem_reserve[]: 0 0 0 0
[5416523.204885] Node 0 DMA: 1*4kB 1*8kB 1*16kB 0*32kB 2*64kB 1*128kB
1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15900kB
[5416523.204890] Node 0 DMA32: 1923*4kB 2433*8kB 3088*16kB 16*32kB
3*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 78164kB
[5416523.204895] Node 0 Normal: 14590*4kB 1263*8kB 56*16kB 1*32kB 1*64kB
0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 73296kB
[5416523.204900] 3698022 total pagecache pages
[5416523.204901] 3249 pages in swap cache
[5416523.204903] Swap cache stats: add 2384957, delete 2381708, find
8582137/8898669
[5416523.204904] Free swap = 3902556kB
[5416523.204904] Total swap = 3926012kB
[5416523.238116] 4194288 pages RAM
[5416523.238117] 87642 pages reserved
[5416523.238118] 3673827 pages shared
[5416523.238119] 363584 pages non-shared
[5416523.238323] ib_srpt: ***ERROR***: failed to create_qp ret= -12
[5416523.238381] ib_srpt: ***ERROR***: rejected SRP_LOGIN_REQ because
creating a new RDMA channel failed.
[5416523.238393] ib_srpt: Rejecting login with reason 0x10001

Cheers,
Sebastian

-- 
Sebastian Riemer
Linux Kernel Developer

ProfitBricks GmbH • Greifswalder Str. 207 • 10405 Berlin, Germany
www.profitbricks.com • sebastian.rie...@profitbricks.com

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Andreas Gauger, Achim Weiss

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OFED 1.5.4.1 on Ubuntu 10.04 with Mellanox cards?

2012-06-25 Thread Sebastian Riemer
Hi Chet,

On 22/06/12 21:02, Chet Murthy wrote:
 
 Sebastian,
 
 Thank you for taking the time to explain these things!  It's a little
 confusing 
 
 Here a simple list of matching code:
 OFED-1.5.4  --- kernel 3.2.x
 OFED-1.5.4.1 --- kernel 3.3.x
 
 (1) Is there a more-exhaustive list of the right kernel to use with
 each OFED release?  I was going by the OFED docs (e.g. release notes),
 which seemed to indicate that for 1.5.4.1, the right range of
 kernels was (kernel.org: 2.6.30 - 3.1), and specific kernel versions
 for various distros.

Unfortunately, there is no more-exhaustive list for matching the kernel
code with the OFED user space. It's a matter of comparing dates - kernel
release and OFED release.
O.K., here is how they put the OFA kernel code into OFED:
- kernel developers develop for the latest kernel release cycle (here 3.3)
- OFED packagers use an older kernel as basis (2.6.30) and forward port
the OFA kernel stuff to the current kernel release (here 3.3) by patches
for kernels (2.6.30..3.1) - this leaves space for failures (e.g. that
they don't port the open-iscsi kernel code correctly)
- this is why they say that they don't support the mainline kernels
completely

We at ProfitBricks need latest kernels anyway. This is why we match it
from upstream (OFA kernel stuff from kernel.org). And we don't have to
build the OFA kernel modules from out-of-tree which simplifies our
kernel build chain. We have OFED-1.5.4 with OFA kernel code from kernel
3.2 at the moment.

But there is also a new OFED release approach:
Perhaps you've seen the OFED-3.2 already?! This is the OFED especially
for kernel 3.2. This makes it easier to match OFED user space and kernel
code. Here they just backport the OFA kernel stuff e.g. from 3.4 to 3.2.
Looks promising, but I have no experience with that, yet.

 (2) I'm pretty familiar with adminstering Debian systems and building
 debian packages, hacking their insides, alienizing, hacking that
 process, etc.
 
 (I -think- ;-) The only real question for me is, which versions, with
 which patches, of the various bits, will work together with this RoCEE
 card.

Your issue can be something with the shell scripts, kernel code to user
space matching or plain that you don't have the opensm running. Without
a running instance of a subnet manager your card will get no LID
assigned, no partition key, etc. IPoIB, MPI, iSER, SRP, etc. won't work.
Check with ibdiagnet -r if your master subnet manager is running. IB
is self-managed by the subnet manager. Make sure that your opensm
configuration is correct.

We have big deployments and don't want to have rpm installed on Debian
systems. This is why we've taken OFED-1.5.2 stuff from debian
experimental from pkg-ofed. We've converted the SVN stuff into git
repos for OFED, imported the OFED-1.5.4 upstream code and adopted the
modifications by Debian (e.g. shell code changes). Now, we can build
OFED with git-buildpackage and upload the deb packages to our debian
repository.

 (3) I'm -not at all- familiar with the workflow/process that Debian
 Developers use.  For instance, I don't really understand what you men
 below:
 
 But you'll have to ensure that the kernel code matches the OFED user
 space. The kernel stuff included in OFED doesn't support latest kernels
 and is based on an older code base (e.g. OFED 1.5.4 kernel stuff is
 based on 2.6.30).
 
 Do you mean that the kernel-ib RPM in 1.5.4 is the code form the
 2.6.30 kernel?  But then the list below doesn't seem to make sense.
 
 Here a simple list of matching code:
 OFED-1.5.4  --- kernel 3.2.x
 OFED-1.5.4.1 --- kernel 3.3.x

I've explained this above.

 (4) I think what you're saying here
 
 the trick is to check out the latest pkg-ofed source from debian SVN
 (svn://svn.debian.org/svn/pkg-ofed/) and to update the upstream source
 by merging the stuff by extracting the source RPMs or even better by
 importing the source directly from the git repos of the OFED user space.
 In the debian directory there are some patches e.g. which change some
 stuff in shell scripts for the dash. These need to be adopted.
 
 is:
 
   (a) check out the stuff from svn.debian.org
 
   (b) pull source from the OFED repos user-space
 
   (c) -copy- that (latest) OFED source into the tree I checked-out
   from debian
 
   (d) make sure that the patches in the debian directories apply
   properly to the various shellscripts
 
   (e) build debian packages per usual
 
 And per your instructions above, I believe you're saying I should be
 using a 3.3.x kernel?

Yes, this is exactly what I would suggest to you if you want to have a
really working solution without rpm. You should at least have a look
at this or try it to see if this fixes your issues and if this gives you
advantages.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OFED 1.5.4.1 on Ubuntu 10.04 with Mellanox cards?

2012-06-22 Thread Sebastian Riemer
Hi Chet,

the trick is to check out the latest pkg-ofed source from debian SVN
(svn://svn.debian.org/svn/pkg-ofed/) and to update the upstream source
by merging the stuff by extracting the source RPMs or even better by
importing the source directly from the git repos of the OFED user space.
In the debian directory there are some patches e.g. which change some
stuff in shell scripts for the dash. These need to be adopted.

But you'll have to ensure that the kernel code matches the OFED user
space. The kernel stuff included in OFED doesn't support latest kernels
and is based on an older code base (e.g. OFED 1.5.4 kernel stuff is
based on 2.6.30). I hope that you don't need iSER. The open-iscsi kernel
stuff in there is also based on 2.6.30 which means that you would need
old open-iscsi user space.

This is why we've decided to follow what they call upstream in this
list. This means: Use the OFED kernel code from the matching vanilla
kernel from kernel.org.

Here a simple list of matching code:
OFED-1.5.4  --- kernel 3.2.x
OFED-1.5.4.1 --- kernel 3.3.x

I've attached the IB user space HOWTO from Or Gerlitz for the git repos.
Some of the git repos already have a debinan directory.

Do you know how to build Debian packages?

Cheers,
Sebastian


On 22/06/12 02:46, Chet Murthy wrote:
 
 Hi,
 
 A long while ago, I got OFED 1.5.2 working on Ubuntu 10.04 (Lucid) on
 Opterons with Mellanox DDR cards.  It was a little messy, getting the
 RPMs compiled, but it was pretty straightforward.  Basically, I (a)
 built a kernel with neither infiniband nor mellanox ethernet drivers,
 and (b) ran the OFED install.pl with some minor modifications to
 convert the RPMs into DEBs as they were built.  And everything worked,
 smooth as a whistle.
 
 Today, I tried to do the same thing with OFED 1.5.4.1, and while the
 process of -building- was straightforward, once I get done, the card's
 state is all zeroes:
 
 chet@memstore3:~$ sudo ibstatus
 Infiniband device 'mlx4_0' port 1 status:
 default gid: :::::::
 base lid:0x0
 sm lid:  0x0
 state:   1: DOWN
 phys state:  3: Disabled
 rate:2.5 Gb/sec (1X)
 link_layer:  Ethernet
 
 Infiniband device 'mlx4_0' port 2 status:
 default gid: :::::::
 base lid:0x0
 sm lid:  0x0
 state:   1: DOWN
 phys state:  3: Disabled
 rate:2.5 Gb/sec (1X)
 link_layer:  Ethernet
 
 The card's a modern ConnectX
 
 1f:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 
 10GigE, PCIe 2.0 5GT/s] (rev b0)
 
 and on identical RedHat machines, the card's status is quite
 different:
 
 
 [root@memstore4 chet]# ibstatus
 Infiniband device 'mlx4_0' port 1 status:
 default gid: fe80::::0202:c9ff:fe4b:5890
 base lid:0x0
 sm lid:  0x0
 state:   1: DOWN
 phys state:  3: Disabled
 rate:10 Gb/sec (1X QDR)
 link_layer:  Ethernet
 
 Infiniband device 'mlx4_0' port 2 status:
 default gid: fe80::::0202:c9ff:fe4b:5891
 base lid:0x0
 sm lid:  0x0
 state:   4: ACTIVE
 phys state:  5: LinkUp
 rate:10 Gb/sec (1X QDR)
 link_layer:  Ethernet
 
 I'm not even sure how to go about debugging this.  Has anybody gotten
 OFED to work on Ubuntu with such modern cards?
 
 Thanks,
 --chet--
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
   IB user space HOWTO

June 2012

Or Gerlitz ogerl...@mellanox.com

This little note attempts to get you through how to get the upstream 
user-space IB packages, specifically libibverbs/libmlx4/librdmacm and/or 
opensm and the IB diags.

Under Fedora / RHEL, installing the INBOX user-space IB/RDMA offering is easy 
as 

# yum groupinstall Infiniband Support

The IB service is called rdma (vs. openibd which used to be the name in older 
RHEL/Fedora
releases) and there is an rpm named rdma with various scripts. Note that this 
will 
not install opensm/diags (see below).

If you are seeking the latest RELEASE done by the maintainers, its also 
trivial, 
the releases are provided in the form of tar balls which you plug into 
rpmbuild -ts and you have fresh source RPM to build and later install.

Going more hackish, you would need to build the sources from the maintainers 
git, the git trees contain spec files, so the process would be to create 
the tarballs and then repeat the rpmbuild excercise.

See below links to where there are tarball releases and the git trees where 
here gitweb links are provided, they have the git 

Re: IB/iSER problems with Linux 3.0

2012-01-19 Thread Sebastian Riemer
On 17/01/12 15:56, Or Gerlitz wrote:
 could you try and patch your 3.0.15 kernel with commit
 52439540ea30396982b69662dd21aede6b336288 IB/iser: DMA unmap TX bufs
 used for iSCSI/iSER headers from upstream, this could help here.

Hi Or,

unfortunately, just cherry-picking that commit didn't do the job.
Therefore, I've backported the whole ib_iser code from 3.2.1 to 3.0.15.
Now it works fine. I've attached the git diff.

Should I test anything further?

Thanks for your time again!

Cheers,
Sebastian
diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 8db008d..daf293c 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -57,6 +57,7 @@
 #include linux/scatterlist.h
 #include linux/delay.h
 #include linux/slab.h
+#include linux/module.h
 
 #include net/sock.h
 
@@ -101,13 +102,17 @@ iscsi_iser_recv(struct iscsi_conn *conn,
 
 	/* verify PDU length */
 	datalen = ntoh24(hdr-dlength);
-	if (datalen != rx_data_len) {
-		printk(KERN_ERR iscsi_iser: datalen %d (hdr) != %d (IB) \n,
-		   datalen, rx_data_len);
+	if (datalen  rx_data_len || (datalen + 4)  rx_data_len) {
+		iser_err(wrong datalen %d (hdr), %d (IB)\n,
+			datalen, rx_data_len);
 		rc = ISCSI_ERR_DATALEN;
 		goto error;
 	}
 
+	if (datalen != rx_data_len)
+		iser_dbg(aligned datalen (%d) hdr, %d (IB)\n,
+			datalen, rx_data_len);
+
 	/* read AHS */
 	ahslen = hdr-hlength * 4;
 
@@ -147,7 +152,6 @@ int iser_initialize_task_headers(struct iscsi_task *task,
 	tx_desc-tx_sg[0].length = ISER_HEADERS_LEN;
 	tx_desc-tx_sg[0].lkey   = device-mr-lkey;
 
-	iser_task-headers_initialized	= 1;
 	iser_task-iser_conn		= iser_conn;
 	return 0;
 }
@@ -162,8 +166,7 @@ iscsi_iser_task_init(struct iscsi_task *task)
 {
 	struct iscsi_iser_task *iser_task = task-dd_data;
 
-	if (!iser_task-headers_initialized)
-		if (iser_initialize_task_headers(task, iser_task-desc))
+	if (iser_initialize_task_headers(task, iser_task-desc))
 			return -ENOMEM;
 
 	/* mgmt task */
@@ -274,6 +277,13 @@ iscsi_iser_task_xmit(struct iscsi_task *task)
 static void iscsi_iser_cleanup_task(struct iscsi_task *task)
 {
 	struct iscsi_iser_task *iser_task = task-dd_data;
+	struct iser_tx_desc	*tx_desc = iser_task-desc;
+
+	struct iscsi_iser_conn *iser_conn = task-conn-dd_data;
+	struct iser_device *device= iser_conn-ib_conn-device;
+
+	ib_dma_unmap_single(device-ib_device,
+		tx_desc-dma_addr, ISER_HEADERS_LEN, DMA_TO_DEVICE);
 
 	/* mgmt tasks do not need special cleanup */
 	if (!task-sc)
diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 2f02ab0..db7ea37 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -45,6 +45,7 @@
 #include scsi/libiscsi.h
 #include scsi/scsi_transport_iscsi.h
 
+#include linux/interrupt.h
 #include linux/wait.h
 #include linux/sched.h
 #include linux/list.h
@@ -88,7 +89,7 @@
 	} while (0)
 
 #define SHIFT_4K	12
-#define SIZE_4K	(1UL  SHIFT_4K)
+#define SIZE_4K	(1ULL  SHIFT_4K)
 #define MASK_4K	(~(SIZE_4K-1))
 
 	/* support up to 512KB in one RDMA */
@@ -256,7 +257,8 @@ struct iser_conn {
 	struct list_head	 conn_list;   /* entry in ig conn list */
 
 	char  			 *login_buf;
-	u64 			 login_dma;
+	char			 *login_req_buf, *login_resp_buf;
+	u64			 login_req_dma, login_resp_dma;
 	unsigned int 		 rx_desc_head;
 	struct iser_rx_desc	 *rx_descs;
 	struct ib_recv_wr	 rx_wr[ISER_MIN_POSTED_RX];
@@ -276,7 +278,6 @@ struct iscsi_iser_task {
 	struct iser_regd_buf rdma_regd[ISER_DIRS_NUM];/* regd rdma buf */
 	struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data des*/
 	struct iser_data_buf data_copy[ISER_DIRS_NUM];/* contig. copy  */
-	int  headers_initialized;
 };
 
 struct iser_page_vec {
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
index 95a08a8..b53b04d 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -221,8 +221,14 @@ void iser_free_rx_descriptors(struct iser_conn *ib_conn)
 	struct iser_device *device = ib_conn-device;
 
 	if (ib_conn-login_buf) {
-		ib_dma_unmap_single(device-ib_device, ib_conn-login_dma,
-			ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE);
+		if (ib_conn-login_req_dma)
+			ib_dma_unmap_single(device-ib_device,
+ib_conn-login_req_dma,
+ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_TO_DEVICE);
+		if (ib_conn-login_resp_dma)
+			ib_dma_unmap_single(device-ib_device,
+ib_conn-login_resp_dma,
+ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE);
 		kfree(ib_conn-login_buf);
 	}
 
@@ -394,6 +400,7 @@ int iser_send_control(struct iscsi_conn *conn,
 	unsigned long data_seg_len;
 	int err = 0;
 	struct iser_device *device;
+	struct iser_conn *ib_conn = iser_conn-ib_conn;
 
 	/* build the tx desc regd header and add it to the tx desc dto */
 	mdesc-type = 

Solved: IB/iSER problems with Linux 3.0

2012-01-19 Thread Sebastian Riemer
On 19/01/12 13:18, Or Gerlitz wrote:
 [...]
 Or Gerlitz (4):
   IB/iser: Fix wrong mask when sizeof (dma_addr_t)  sizeof
 (unsigned long)
   IB/iser: Support iSCSI PDU padding
   IB/iser: Use separate buffers for the login request/response
   IB/iser: DMA unmap TX bufs used for iSCSI/iSER headers
 [...]
 could you try only the four of them on top of 3.0.15 and then if it
 works okay, find out which one of them does the job?

I've applied them one by one and at the following commit it worked:

IB/iser: Use separate buffers for the login request/response

Then, I've tried to apply that commit only to 3.0.15, but automatic
cherry-picking failed. I had to apply the following commit first:

IB/iser: Support iSCSI PDU padding

So, these two commits are the winners, for our Solaris 11 COMSTAR targets.

Without them I always had to reboot the system because it wasn't
possible to logout or to unload the ib_iser module.

Cheers,

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER problems with Linux 3.0

2012-01-17 Thread Sebastian Riemer
On 16/01/12 22:16, Or Gerlitz wrote:
 Sebastian, I asked for the **iser** (ib_iser) and not mlx4_core debug_level=2
   

Yes, I did! I've enabled that additionally. And I've checked these
settings in /sys/module/*/parameters. They were set. The libiscsi from
OFED had only the option debug_libiscsi but this was too verbose, so
this was the only thing I didn't activate there.

 1. yes, the logs (correct ones, please!) from success login on the
 very same kernel would help
   
Yes, I've sent you the correct logs. The only difference is:
1. in-tree vs. ofa-kernel-modules from OFED-1.5.4
2. open-iscsi 2.0.872 vs. open-iscsi 2.0.869 from OFED

In the log from working iSER there is the RDMA mapping debug message at
the position of the error in the other log.

Cheers,

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER major problems with Linux 3.0 and Solaris targets

2012-01-16 Thread Sebastian Riemer
On 12/01/12 17:14, Or Gerlitz wrote:
 
 you didn't send the kernel logs from the failure after opening the iser 
 (debug_level=2) and libiscsi (debug_libiscsi_session=1
 debug_libiscsi_conn=1) debug prints

OK, I've also set mlx4_core debug_level=2 and have verified in
/sys/module that the parameters are really set.
Please find attached the relevant part of the kernel log while login
attempt.

I've also attached the log from the working stuff from OFED-1.5.4 for a
compare and the settings from discovery.

I'll try to backport the IB + iSCSI kernel code from 3.2.1 to 3.0.15 next.

Cheers,

Sebastian


iser_dbg_dmesg.log.gz
Description: GNU Zip compressed data


iser_ofa_dbg_dmesg.log.gz
Description: GNU Zip compressed data
node.name = iqn.2010-03.com.profitbricks:cloud:customers:storage200
node.tpgt = 2
node.startup = manual
iface.hwaddress = default
iface.iscsi_ifacename = iser1
iface.net_ifacename = ib1.
iface.transport_name = iser
node.discovery_address = 10.1.24.204
node.discovery_port = 3260
node.discovery_type = send_targets
node.session.initial_cmdsn = 0
node.session.initial_login_retry_max = 4
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.auth.authmethod = None
node.session.timeo.replacement_timeout = 30
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 20
node.session.err_timeo.host_reset_timeout = 60
node.session.iscsi.FastAbort = Yes
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2
node.session.iscsi.MaxConnections = 1
node.session.iscsi.MaxOutstandingR2T = 1
node.session.iscsi.ERL = 0
node.conn[0].address = 10.1.24.204
node.conn[0].port = 3260
node.conn[0].startup = manual
node.conn[0].tcp.window_size = 524288
node.conn[0].tcp.type_of_service = 0
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.conn[0].iscsi.MaxRecvDataSegmentLength = 131072
node.conn[0].iscsi.HeaderDigest = None
node.conn[0].iscsi.DataDigest = None
node.conn[0].iscsi.IFMarker = No
node.conn[0].iscsi.OFMarker = No



Re: IB/iSER major problems with Linux 3.0 and Solaris targets

2012-01-12 Thread Sebastian Riemer
On 12/01/12 10:29, Or Gerlitz wrote:
 If you  have build the kernel IB user space support (uverbs) and the
 IB libs, do ibv_devinfo if not, just ossi cat
 /sys/class/infiniband/mlx4_0/* and send the output. To be clear, iser
 does work for you on the productive servers but not on this server?

Yes, we've got consistent OFED-1.5.4 user-space. ibv_devinfo reports a
mismatch between the kernel and the userspace libraries - kernel does
not support XRC.. ibverbs-driver-mlx4 is at version
1.0.1-1.20.g6771d22 and libibverbs is at version 1.1.4-1.24.gb89d4d7.

But O.K., the other method shows firmware version 2.9.1000.

iSER only works on productive servers, because we use the OFA kernel
modules from OFED for them at the moment (with 3.0 ported *iscsi*
drivers). But there the IPoIB traffic is too slow for us.
We connect customer VMs with IPv6 between different servers via IB.

And yes, we could also test kernel 3.2 on our iSER test server.

Regards,
Sebastian


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER major problems with Linux 3.0 and Solaris targets

2012-01-12 Thread Sebastian Riemer
On 12/01/12 11:16, Sebastian Riemer wrote:
 On 12/01/12 10:29, Or Gerlitz wrote:
   
 If you  have build the kernel IB user space support (uverbs) and the
 IB libs, do ibv_devinfo if not, just ossi cat
 /sys/class/infiniband/mlx4_0/* and send the output. To be clear, iser
 does work for you on the productive servers but not on this server?
 
 Yes, we've got consistent OFED-1.5.4 user-space. ibv_devinfo reports a
 mismatch between the kernel and the userspace libraries - kernel does
 not support XRC.. ibverbs-driver-mlx4 is at version
 1.0.1-1.20.g6771d22 and libibverbs is at version 1.1.4-1.24.gb89d4d7.

 But O.K., the other method shows firmware version 2.9.1000.

   

I've found out that we have two single port MHQH19B-XTR InfiniBand HCAs.

lspci output:
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0
5GT/s - IB QDR / 10GigE] (rev b0)
04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0
5GT/s - IB QDR / 10GigE] (rev b0)

The first one is ib1. And the second is ib0.
/sys/devices/pci:00/:00:0c.0/:03:00.0/net/ib1
/sys/devices/pci:00/:00:0b.0/:04:00.0/net/ib0

The iSER traffic is on ib1 (the HCA which reported the error) and ib0 is
for IPoIB traffic. I don't know if the mlx4 driver has a problem with
that hardware config.

Here is the requested data:
mlx4_0:
board_id   MT_0D90110009
fw_ver 2.9.1000
hca_type   MT26428
hw_rev b0
node_desc  pserver214 HCA-1 (mlx4_0 - MT26428)
node_guid  0002:c903:000f:5f76
node_type  1: CA
sys_image_guid 0002:c903:000f:5f79
uevent NAME=mlx4_0

mlx4_1:
board_id   MT_0D90110009
fw_ver 2.9.1000
hca_type   MT26428
hw_rev b0
node_desc  pserver214 HCA-2 (mlx4_1 - MT26428)
node_guid  0002:c903:000f:5f26
node_type  1: CA
sys_image_guid 0002:c903:000f:5f29
uevent NAME=mlx4_1

Both are connected to the storage but in different subnets and without
multipathing.

How do I find out if ib1 is on mlx4_1 or mlx4_0?

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IB/iSER major problems with Linux 3.0 and Solaris targets

2012-01-11 Thread Sebastian Riemer
Hi list, hi Or,

at the moment we use our kernel 3.0 ported OFA kernel modules from
OFED-1.5.4 to make iSER work together with our kernel 3.0 ported
open-iscsi 2.0.869 and Solaris 11 COMSTAR targets.

Now, we've got the problem that ib_iser from in there ignores its
debug_level parameter. /sys/module/ib_iser/parameters/debug_level
shows 0 but it spams the whole kernel log with debug messages. Seems
to be a code bug.

We've also tested a 3.0.15 mainline kernel with in-tree IB modules
together with the OFED-1.5.4 user-space and this has much better IPoIB
performance than the kernel stuff from OFED. So, we want to use them
instead, but there is the same problem with the iSER debug messages and
iSER doesn't work together with our Solaris 11 COMSTAR targets.

We've tested this with open-iscsi (2.0.872, commit
4323e342d2c9fb8ed7233ce855001c189ec55b23) user-space.
TCP is O.K. but iSER reports an error while login attempt:
iscsiadm: initiator reported error (11 - iSCSI PDU timed out)

The sent PDUs from iscsid debugging are the same, but there is an IO
page fault in the kernel log. I've attached the relevant part and the
iscsid log.

This looks interesting:
iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57

Or, could you please investigate/explain?

It is a pain that we need both: working iSER and IPoIB traffic with good
performance.

Cheers,

Sebastian


On 19/12/11 10:14, Sebastian Riemer wrote:
 Hi list,

 I've already sent this to the open-iscsi mailing list, but I guess
 this is more relevant for linux-rdma.

 Finally I've got IB/iSER running on Debian Squeeze with Linux kernel 3.0
 smoothly.

 The problem was that we did not have the suitable OFED for our kernel
 and we did not use the open-iscsi from OFED. Kernel 3.0 is supported
 since OFED-1.5.4 from 2011-12-05.

 So, I've taken the 1.5.2-based stuff from Debian/Experimental and I've
 updated it to 1.5.4 from OFA. Then, I've noticed that Debian doesn't
 build ib_iser in the OFA kernel source and that they don't build the
 open-iscsi kernel/user-space code - I made it do so.

 The next problem was that open-iscsi kernel code in OFED-1.5.4 is for =
 2.6.32 based RedHat distributions. I had to port the source from 2.6.30
 to 3.0 due to kernel API changes. OFA even forgot libiscsi_tcp.[ch] in
 OFED-1.5.4. So, I had to import it from 2.6.30 mainline.
 I did so, because we wanted to compare TCP and iSER speed over
 InfiniBand. Our Solaris COMSTAR targets provide both.

 After fixing the kernel, there was still a problem in the open-iscsi
 2.0.869 user-space from OFED. Some sysfs magic has changed - so that the
 iSCSI host number couldn't be found.

 After fixing that, it worked for me.

 Cheers,

 Sebastian
   

Jan 11 12:53:25 pserver214 kernel: [  716.518372] SCSI subsystem initialized
Jan 11 12:53:25 pserver214 kernel: [  716.521146] Loading iSCSI transport class v2.0-870.
Jan 11 12:53:25 pserver214 kernel: [  716.528756] iscsi: registered transport (tcp)
Jan 11 12:53:30 pserver214 kernel: [  721.903544] iscsi: registered transport (iser)
Jan 11 12:54:46 pserver214 kernel: [  797.537439] iser: iser_connect:connecting to: 10.1.24.204, port 0xbc0c
Jan 11 12:54:46 pserver214 kernel: [  797.563158] iser: iser_cma_handler:event 0 status 0 conn 880807b17a80 id 880807594400
Jan 11 12:54:46 pserver214 kernel: [  797.566402] iser: iser_cma_handler:event 2 status 0 conn 880807b17a80 id 880807594400
Jan 11 12:54:46 pserver214 kernel: [  797.579704] iser: iser_create_ib_conn_res:setting conn 880807b17a80 cma_id 880807594400: fmr_pool 88082426b400 qp 8807ed22aa00
Jan 11 12:54:46 pserver214 kernel: [  797.586557] iser: iser_cma_handler:event 9 status 0 conn 880807b17a80 id 880807594400
Jan 11 12:54:46 pserver214 kernel: [  797.787932] iser: iscsi_iser_ep_poll:ib conn 880807b17a80 rc = 1
Jan 11 12:54:46 pserver214 kernel: [  797.788137] scsi0 : iSCSI Initiator over iSER, v.0.1
Jan 11 12:54:46 pserver214 kernel: [  797.794249] iser: iscsi_iser_conn_bind:binding iscsi/iser conn 8808058deab8 8808058decc8 to ib_conn 880807b17a80
Jan 11 12:54:46 pserver214 kernel: [  797.794710] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488000 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.794919] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488200 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.794998] iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57
Jan 11 12:54:46 pserver214 kernel: [  797.795006]  connection1:0: detected conn error (1011)
Jan 11 12:54:46 pserver214 kernel: [  797.795338] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488100 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.795535] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488040 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.795730] AMD-Vi

Re: IB/iSER with Linux 3.0 and Debian: Lesson learned

2011-12-21 Thread Sebastian Riemer
 you wrote long emails, I'm asking for one concrete example for that enum
 crunching  of adding entries
 not at the end, can you, please?

I've meant e.g. the iscsi tasks in libiscsi.h between 2.6.30 and
2.6.32. But I've meant this for OFED and not the mainline kernel.

2.6.30:
enum {
 ISCSI_TASK_COMPLETED,
 ISCSI_TASK_PENDING,
 ISCSI_TASK_RUNNING,
};

2.6.32:
enum {
 ISCSI_TASK_FREE,
 ISCSI_TASK_COMPLETED,
 ISCSI_TASK_PENDING,
 ISCSI_TASK_RUNNING,
 ISCSI_TASK_ABRT_TMF,/* aborted due to TMF */
 ISCSI_TASK_ABRT_SESS_RECOV, /* aborted due to session recovery */
};

 I want to double check I'm with you - so when you said that iser didn't work
 e.g TCP worked very well. I've also updated from git to latest 2.0.872
 (latest change 2011-11-01) for testing. TCP always worked and iSER was
 always unusable. you actually wanted to say iser from ofed and not iser
 from this or that upstream kernel?


I've tried both. The iser from OFED oopsed (because it is 2.6.30 based
- didn't match the 3.0 open-iscsi in-tree) and everything from
upstream kernel 3.0.4 was pretty unstable (mentioned connection aborts
after 5s). And I guess because of the OFED-1.4 user-space from Squeeze
the IB connection was that unstable. The OFED user-space must match
the kernel code of cause. Before I took over the kernel maintaining at
ProfitBricks only few knew about that problem in the company. So, I
thought making everything OFED-1.5.4 is the right approach of doing
that.


 as life, mainline isn't perfect, but it doesn't say that ofed is perfect nor
 that by any bit its better then mainline, you may know and if you don't here
 are the news: the ofa community has to decided to stop producing ofed in the
 way it was done over the years, namely from now (Jan 2012) and onward, ofed
 will be only backports provided from mainline, no additions, so this false
 betterness claim can't even be stated anymore. Now, even this backporting
 only new mode has to be defined - since for example, is the iscsi case...
 except for iser, ofed will not provide the iscsi modules nor tools, so its
 not clear/how trivial/for someone takes (say) iser from 3.2 and backport it
 to (say) 3.6.35 in manner that it will be operable with 3.6.35 and unknown
 version of the tools.


As I wrote, I like that new approach. If OFED-3.2 will match mainline
3.2 this would be great, but then you'll also have to provide the
open-iscsi user-space which you've used for testing in there.
Or/and can't you just provide a list of tools, OFA user-space etc.
which you've tested (e.g. like that BUILD_ID file in OFED)?
I really hope that this makes things better/easier for InfiniBand and
iSER users. I'm looking forward to test that.

My question was more like: How was it tried to ensure to match kernel
and user-space code right now? and I did not want to read the
developer's favorite With the next release everything will be
better. ;-)

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER with Linux 3.0 and Debian: Lesson learned

2011-12-21 Thread Sebastian Riemer
2011/12/21 Or Gerlitz ogerl...@mellanox.com:
 I tested the upstream kernel iser against the upstream iscsi tools  from
 git://github.com/mikechristie/open-iscsi
 (commit 4323e342d2c9fb8ed7233ce855001c189ec55b23), it works


To bring this to an end: I believe you. Most likely I had that much
trouble because of the OFED-1.4 user-space which did not match the
kernel code.

As you don't answer my question how I can find out what's the matching
user-space for a given upstream kernel - I will just use what I have
now (should work like on RedHat) and will be looking forward to the
new OFED release approach.

Thanks for taking the time to make things clear to me.

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER with Linux 3.0 and Debian: Lesson learned

2011-12-20 Thread Sebastian Riemer
2011/12/20 Or Gerlitz ogerl...@mellanox.com:

 Beep, I'd like to better/understand the problem before looking on your
 struggle for solution...
 I understand that your Debian system runs kernel 3.0 - however, you didn't
 say what version of the iscsi initiator utils is provided with that distro
 nor what were the problems to make it work/well with iser, could you
 elaborate on that?

 Or.


Ah, O.K. - I wrote that on the open-iscsi list. Debian Squeeze (in
general 2.6.32 based) comes with open-iscsi 2.0.871.3-2squeeze1. We've
used that version together with the in-tree mainline kernel 3.0 OFA
kernel modules and Debian Squeeze OFED-1.4 user-space. But there were
lots of iSER connection aborts (and even log-outs) after only 5s
connection loss instead of 120s
node.session.timeo.replacement_timeout. The many connection losses
where also caused by the missing of a suitable OFED.

After installation of the OFA kernel modules (without open-iscsi
modules) from OFED-1.5.4 the kernel had oopses in ib_iser. Therefore,
the suitable open-iscsi code had to be found (in OFED). And due to the
fact that it didn't support 3.0 kernels it also had to be ported.
There where many ABI and API changes in mainline open-iscsi kernel
code between 2.6.30 and 3.0.

I've fixed the following kernel API changes in the open-iscsi code
from OFA kernel source from OFED-1.5.4:
- kfifo API = 2.6.33
- scsi_host API = 2.6.33
- scsi_host API = 2.6.37

Before that I've added the code and compilation of libiscsi_tcp from 2.6.30.

After stress testing the storage on a test machine with that fixed
OFED + iSER all other machines on that IB switch had IB connection
losses. So, we decided to roll out OFED-1.5.4 with fixed open-iscsi
code to all machines in our data center. And this works very well,
now. General network performance also doubled up.

Btw.: We need such a new kernel because of some cool virtualization,
cgroups and performance features.

I wrote the mail in this mailing list in order to show that open-iscsi
in OFED-1.5.4 isn't suitable for 3.0 kernel, that libiscsi_tcp is
missing and that we at ProfitBricks have a good test case with our
IaaS Cloud Computing.

Btw.: I like the proposed approach for new OFED releases. Version
checks in the code between kernel and user-space (like DRBD does)
would be great.

Cheers,

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER with Linux 3.0 and Debian: Lesson learned

2011-12-20 Thread Sebastian Riemer
2011/12/20 Or Gerlitz ogerl...@mellanox.com:

 Beep(2), so your system has distro which is based on kernel 2.6.32 and iscsi
 initiator tools version 2.0.871 and per your needs, you've booted it with
 kernel 3.0 .

 At this point should you have stop and make sure that this combo works,
 iscsi wise (simpler to test iscsi/tcp... no need in rdma knowledge), did you
 do this validation?

Yes, of cause I did - TCP worked very well. I've also updated from git
to latest 2.0.872 (latest change 2011-11-01) for testing. TCP always
worked and iSER was always unusable.


 If this combo isn't working, you have to update the iscsi tools to a version
 which supports your kernel.


Yes, I even took the max. 2.6.35 supported kernel stuff from
open-iscsi git and have changed the Makefile to make it compile
against my kernel. With that I ensured that open-iscsi kernel code and
user-space match. TCP O.K. - no iSER with that!

 If this combo is working iscsi/tcp wise but not ib_iser wise, its seems to
 be a bug, and I'd be happy to help with finding the root cause, etc.

Have you ever developed/tested the ib_iser module for/with  2.6.30
kernels? I've seen that there were lots of changes in the whole
open-iscsi kernel stack between 2.6.30 and 2.6.32. The whole ABI has
changed in libiscsi. They added stuff e.g. at the first position in
enums. If ib_iser isn't aware of such changes lots of crap can happen.
...and happened to me while testing by the way.


 But this way or another, OFA isn't an iscsi tools factory, nor have anyone
 that can/want to support iscsi tools, we (folks from the rdma vendors
 community that deal with iscsi) are working with the upstream iscsi
 maintainer to address iscsi issues. The fact that OFA ships iscsi code
 except for ib_iser/cxgb4i/etc modules is a bug, BTW, I'll act to change
 that.


ib_iser has tight dependencies to open-iscsi code (see attached). In
my opinion an ib_iser developer should work that tight together with
the open-iscsi guys. They should inform you about all ABI and API
changes in libiscsi so that you can react on that. As an user of
IB/iSER it is really confusing that this isn't the case and that OFA
only provides 2.6.30 based open-iscsi stuff which only works for iSER
due to missing libiscsi_tcp in there. At least this could be fixed
easily.

Now I know why everybody has his own OFED - so do we now. E.g. the
QLogic stuff isn't even compilable without the QLogic OFED, because
they only put their patches in there. Luckily, we have only Mellanox
HCAs in our productive environment.

Would it help, if we provide our patches for open-iscsi and IB/iSER 
2.6.32 to bring that into mainline OFED?

Sebastian
attachment: open-iscsi.png

Re: IB/iSER with Linux 3.0 and Debian: Lesson learned

2011-12-20 Thread Sebastian Riemer

 Would it help, if we provide our patches for open-iscsi and IB/iSER
 2.6.32 to bring that into mainline OFED?

 As Or notes, OFED is providing the kernel modules more than the iscsi code 
 drop.  Would be better for all (cough cough) to push changes back to the 
 iscsi initiator maintainer (Mike Christie I think).

No, I don't think so. Mike's open-iscsi works for many kernels and he
provides libiscsi. OFA decided to make open-iscsi and ib_iser 2.6.30
based in OFED-1.5.4, but they say that they also support 3.0 in OFED.
Now, they could provide two more open-iscsi and ib_iser versions
(=2.6.33, =2.6.37) or just patch the version they already have. API
changes are no magic and everyone can see what they do in the mainline
if the API changes.

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER with Linux 3.0 and Debian: Lesson learned

2011-12-20 Thread Sebastian Riemer
2011/12/20 Or Gerlitz or.gerl...@gmail.com:

 horses, please, stay at home, or at least run a little bit slower,
 just for you - from 2 minutes
 ago - iser works well with 3.2.0-rc5 (its say -dirty b/c its a
 development system and the kernel has some patches, but not iser ones)
 and iscsi-initiator-utils of 6.2.0.872-21.el6, I will try tomorrow
 with upstream iscsi-initiator-utils and see if there's a problem
 there.


O.K., sorry - I just wanted to know how this is developed. Even if the
kernel code matches 100%, how do I know which is the matching tested
user-space code for that?


 I don't use ofed at all, work only with upstream or distro code.
 AFAIK, the upstream kernel is functional with iser at all times,
 again, I will do the validation with the iscsi  tools and if the
 upstream iscsi tools aren't functional with the upstream iser code but
 are functional with the upstream iscsi/tcp code, we will (the iscsi
 maintainer and myself) fix that, and thanks for this possible heads
 up.


Thanks for that info. In OFED I thought I would have everything needed
and matching together. Due to the new kernel we have our own
distribution extends, packet repo, build server,... So, for OFED we're
the distribution.


 no way, OFA isn't iscsi factory, we can't support the iscsi kernel
 modules except for iser, nor any of the iscsi user space tools - and
 its a historic bug that someone with wrong ambitions added iscsi
 modules/tools info the ofa stack. The OFA stack should be compatible
 with the kernel/distro it is running on. As you can see in the
 maintainers file, I act as the iser maintainer, and I do work closely
 with the iscsi maintainer, maybe should work closer if indeed you
 stepped on a problem with the upstream iscsi tools, as for the iscsi
 tools provided with debian, I am not sure what was the problem, send
 me tgz with the sources to my @mellanox address and I can try look on
 that.


Perhaps, I really stepped on a rare case where this was broken. As
I've already asked: How do I find the matching, tested open-iscsi and
OFA user-space code for a mainline kernel?

Sorry again, I didn't want provoke. I just want to understand and make
things work.

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IB/iSER with Linux 3.0 and Debian: Lesson learned

2011-12-19 Thread Sebastian Riemer
Hi list,

I've already sent this to the open-iscsi mailing list, but I guess
this is more relevant for linux-rdma.

Finally I've got IB/iSER running on Debian Squeeze with Linux kernel 3.0
smoothly.

The problem was that we did not have the suitable OFED for our kernel
and we did not use the open-iscsi from OFED. Kernel 3.0 is supported
since OFED-1.5.4 from 2011-12-05.

So, I've taken the 1.5.2-based stuff from Debian/Experimental and I've
updated it to 1.5.4 from OFA. Then, I've noticed that Debian doesn't
build ib_iser in the OFA kernel source and that they don't build the
open-iscsi kernel/user-space code - I made it do so.

The next problem was that open-iscsi kernel code in OFED-1.5.4 is for =
2.6.32 based RedHat distributions. I had to port the source from 2.6.30
to 3.0 due to kernel API changes. OFA even forgot libiscsi_tcp.[ch] in
OFED-1.5.4. So, I had to import it from 2.6.30 mainline.
I did so, because we wanted to compare TCP and iSER speed over
InfiniBand. Our Solaris COMSTAR targets provide both.

After fixing the kernel, there was still a problem in the open-iscsi
2.0.869 user-space from OFED. Some sysfs magic has changed - so that the
iSCSI host number couldn't be found.

After fixing that, it worked for me.

Cheers,

Sebastian


--
Sebastian Riemer
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
10405 Berlin, Germany

Tel.:  +49 - 30 - 51 64 09 20
Fax:   +49 - 30 - 51 64 09 22
Email: sebastian.rie...@profitbricks.com
Web:   http://www.profitbricks.com/

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Andreas Gauger, Achim Weiss
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html