Re: [PATCH v1 1/3] IB/srp: Fix crash when unmapping data loop
On 24.02.2014 15:30, Sagi Grimberg wrote: When unmapping request data, it is unsafe automatically decrement req-nfmr regardless of it's value. This may happen since IO and reconnect flow may run concurrently resulting in req-nfmr = -1 and falsely call ib_fmr_pool_unmap. Something is still strange here. What about the following: unsafe to decrement req-nfmr automatically its instead of it's and calling ib_fmr_pool_unmap falsely Fix the loop condition to be greater than zero (which explicitly means that FMRs were used on this request) and only increment when needed. This crash is easily reproduceable with ConnectX VFs OR Connect-IB (where FMRs are not supported) Signed-off-by: Sagi Grimberg sa...@mellanox.com --- drivers/infiniband/ulp/srp/ib_srp.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 529b6bc..0e20bfb 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -766,8 +766,11 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd, return; pfmr = req-fmr_list; - while (req-nfmr--) + + while (req-nfmr 0) { ib_fmr_pool_unmap(*pfmr++); + req-nfmr--; + } ib_dma_unmap_sg(ibdev, scsi_sglist(scmnd), scsi_sg_count(scmnd), scmnd-sc_data_direction); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] scsi_transport_srp: Fix two kernel-doc warnings
On 20.02.2014 11:51, Bart Van Assche wrote: This patch fixes the following two kernel-doc warnings: Warning(drivers/scsi/scsi_transport_srp.c:819): No description found for parameter 'rport' Warning(include/scsi/scsi_transport_srp.h:75): Excess struct/union/enum/typedef member 'deleted' description in 'srp_rport' Signed-off-by: Bart Van Assche bvanass...@acm.org Reported-by: Masanari Iida standby2...@gmail.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com Cc: James Bottomley jbottom...@parallels.com Cc: Roland Dreier rol...@kernel.org --- drivers/scsi/scsi_transport_srp.c | 1 + include/scsi/scsi_transport_srp.h | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c index d47ffc8..13e8983 100644 --- a/drivers/scsi/scsi_transport_srp.c +++ b/drivers/scsi/scsi_transport_srp.c @@ -810,6 +810,7 @@ EXPORT_SYMBOL_GPL(srp_remove_host); /** * srp_stop_rport_timers - stop the transport layer recovery timers + * @rport: SRP remote port for which to stop the timers. * * Must be called after srp_remove_host() and scsi_remove_host(). The caller * must hold a reference on the rport (rport-dev) and on the SCSI host diff --git a/include/scsi/scsi_transport_srp.h b/include/scsi/scsi_transport_srp.h index b11da5c..cdb05dd 100644 --- a/include/scsi/scsi_transport_srp.h +++ b/include/scsi/scsi_transport_srp.h @@ -41,7 +41,6 @@ enum srp_rport_state { * @mutex: Protects against concurrent rport reconnect / * fast_io_fail / dev_loss_tmo activity. * @state: rport state. - * @deleted: Whether or not srp_rport_del() has already been invoked. * @reconnect_delay: Reconnect delay in seconds. * @failed_reconnects: Number of failed reconnect attempts. * @reconnect_work:Work structure used for scheduling reconnect attempts. This is trivial. Thanks! Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IB/srp: merge fixes from MLNX_OFED
Hi Sagi, is that /mswg/git/mlnx_ofed/mlnx-ofed-2.x-kernel.git tree from the MLNX_OFED public by any chance? There are fixes included relevant for the mainline. Would be strange if I would send the patches as somebody at Mellanox discovered and fixed the issues. I've hit a kernel panic today during testing caused by the loop around ib_fmr_pool_unmap(). The loop has been fixed in MLNX_OFED. So there should be a patch sent for it to the linux-rdma mailing list. I've also noticed the added target locking around target-free_tx handling in srp_rport_reconnect(). There are cases e.g. in srp_queuecommand() where holding the rport mutex isn't enough to protect it. So for me this looks right. Then, in srp_create_target() I've noticed the check of the return value of ib_query_gid(). Makes completely sense to check it. Please send patches for so obvious fixes to the mailing list! There is a very good chance that they get accepted. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SRP initiator driver maintainership
On 21.01.2014 11:03, Sagi Grimberg wrote: On 1/20/2014 7:37 PM, Bart Van Assche wrote: On 01/03/14 22:16, David Dillow wrote: Today was my last day at ORNL, and my future endeavors will leave even less time to maintain the SRP initiator. My thanks especially go to Bart, for keeping the pressure to improve alive, and for driving so many of those improvements. diff --git a/MAINTAINERS b/MAINTAINERS index 6c20792..a36f1b5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7466,7 +7466,6 @@ S:Maintained F:drivers/scsi/sr* SCSI RDMA PROTOCOL (SRP) INITIATOR -M:David Dillow dillo...@ornl.gov L:linux-rdma@vger.kernel.org S:Supported W:http://www.openfabrics.org (replying to an e-mail of two weeks ago) Hello Dave, Thanks for all the time you have spent reviewing and testing SRP initiator patches. Such maintainer work is unglamorous but important - it is due to the combined effort of all kernel maintainers that the Linux kernel earned its high quality reputation. Roland, what is your preference with regard to maintainership of the SRP initiator driver ? My plan is to continue contributing patches to the SRP initiator driver at about the same pace as I had done in the past. Do you prefer to take over maintainership of this driver yourself or is it okay for you that I become the official maintainer for this driver ? Thanks, Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Bart, Your contribution to SRP was and still is important! You led the efforts improving and stabilizing SRP driver and adding the fast-failover logic which was needed for so long. Roland, I collaborated with Bart on SRP enhancement in the past year or so and I think Bart is a perfect match for SRP maintainership. Sagi. +1 from me for Bart! Thanks for the collaboration! Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OpenSM 3.3.16 at 100% CPU load, console off
Hi Hal, we've encountered an issue with OpenSM 3.3.16 and the config option console off. OpenSM processes are at 100% CPU load. From strace: poll([{fd=0, events=POLLIN}], 1, 1000) = 1 ([{fd=0, revents=POLLIN}]) read(0, , 4096) = 0 poll([{fd=0, events=POLLIN}], 1, 1000) = 1 ([{fd=0, revents=POLLIN}]) read(0, , 4096) = 0 poll([{fd=0, events=POLLIN}], 1, 1000) = 1 ([{fd=0, revents=POLLIN}]) read(0, , 4096) = 0 As far as I've seen in the code, the function osm_console() from opensm/osm_console.c is the only function which uses poll(). Is this issue already known or perhaps already fixed? Thanks, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenSM 3.3.16 at 100% CPU load, console off
On 09.10.2013 15:30, David Dillow wrote: On Wed, 2013-10-09 at 09:28 -0400, Hal Rosenstock wrote: From strace: poll([{fd=0, events=POLLIN}], 1, 1000) = 1 ([{fd=0, revents=POLLIN}]) read(0, , 4096) = 0 poll([{fd=0, events=POLLIN}], 1, 1000) = 1 ([{fd=0, revents=POLLIN}]) read(0, , 4096) = 0 poll([{fd=0, events=POLLIN}], 1, 1000) = 1 ([{fd=0, revents=POLLIN}]) read(0, , 4096) = 0 So this doesn't block for 1 second and that's why the CPU is 100% ? Looks like it is spinning on a closed socket (or stdin) -- calling poll() on such will return immediately... Thanks for the responses! I've seen in the code that the local console is initialized but is not released correctly. Should be done in osm_console_exit(). Something like this: if (p_oct-in_fd = 0) { p_oct-in = NULL; p_oct-out = NULL; p_oct-in_fd = -1; p_oct-out_fd = -1; } I guess what happened was that console local was set, changed in the config to console off and the service has been restarted. Restarting the service again didn't help. It is strange that the console_init_flag is still set. The function osm_console() returns 0 if poll() fails. If it would return something else, then the console_init_flag would be set to 0 again and there would be no issue anymore I suppose. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenSM 3.3.16 at 100% CPU load, console off
On 09.10.2013 16:00, Hal Rosenstock wrote: Do you recall the sequence to get to this ? Was console option changed to off and then OpenSM SIGHUP'd ? Something else ? Is this reproducible ? Yes, now I can reproduce it. The opensm has been initially started with console off and I activate console local and restart the service. CPU load is at 100% immediately. I set console off again and restart the service and CPU load is low again. I did this three times in a row, now. And the third time it even remained at 100% load in the off state. I've set local and off again and CPU load was low again. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenSM 3.3.16 at 100% CPU load, console off
On 09.10.2013 17:15, Hal Rosenstock wrote: What does service restart do in terms of OpenSM ? Note that the console parameter is _not_ changeable on the fly right now so if OpenSM is being SIGHUP'd by service restart then this is a current limitation (and is clearly not detected/protected against in the current code base). It sounds like that may be what is going on. Yes, it emits SIGHUP. Thanks for the information! The opensm is a critical component. So IMHO it needs to be fixed in a way that it either protects itself against such changes by ignoring them on the fly or it needs to support these changes. The current situation is not really acceptable and the opensm stability is crucial. So I'll think about fixing it. Are you interested in patches in this regard? Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/srp: Let srp_abort() return FAST_IO_FAIL if TL offline
Hi Bart, my patch looks very similar. I was in a company meeting so I couldn't send it fast enough. Can be applied that way! Thanks! Cheers, Sebastian On 10.07.2013 17:36, Bart Van Assche wrote: If the transport layer is offline it is more appropriate to let srp_abort() return FAST_IO_FAIL instead of SUCCESS. Signed-off-by: Bart Van Assche bvanass...@acm.org Reported-by: Sebastian Riemer sebastian.rie...@profitbricks.com Cc: David Dillow dillo...@ornl.gov Cc: Roland Dreier rol...@purestorage.com Cc: Vu Pham v...@mellanox.com --- drivers/infiniband/ulp/srp/ib_srp.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 9d8b46e..f93baf8 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1753,8 +1753,7 @@ static int srp_abort(struct scsi_cmnd *scmnd) if (!req || !srp_claim_req(target, req, scmnd)) return FAILED; if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun, - SRP_TSK_ABORT_TASK) == 0 || - target-transport_offline) + SRP_TSK_ABORT_TASK) == 0) ret = SUCCESS; else if (target-transport_offline) ret = FAST_IO_FAIL; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline
On 28.06.2013 14:49, Bart Van Assche wrote: If reconnecting failed we know that no command completion will be received anymore. Hence let the SCSI error handler fail such commands immediately. Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline
On 28.06.2013 14:49, Bart Van Assche wrote: If reconnecting failed we know that no command completion will be received anymore. Hence let the SCSI error handler fail such commands immediately. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: David Dillow dillo...@ornl.gov Cc: Sebastian Riemer sebastian.rie...@profitbricks.com Cc: Vu Pham v...@mellanox.com --- drivers/infiniband/ulp/srp/ib_srp.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8c95262..5c91521 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd) if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun, SRP_TSK_ABORT_TASK) == 0) ret = SUCCESS; + else if (target-transport_offline) + ret = FAST_IO_FAIL; else ret = FAILED; srp_free_req(target, req, scmnd, 0); This doesn't give us much speed advantage IMHO. The check for target-transport_offline should be before calling srp_send_tsk_mgmt(). This way it would also match the patch description better. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline
On 28.06.2013 14:49, Bart Van Assche wrote: If reconnecting failed we know that no command completion will be received anymore. Hence let the SCSI error handler fail such commands immediately. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: David Dillow dillo...@ornl.gov Cc: Sebastian Riemer sebastian.rie...@profitbricks.com Cc: Vu Pham v...@mellanox.com --- drivers/infiniband/ulp/srp/ib_srp.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8c95262..5c91521 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd) if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun, SRP_TSK_ABORT_TASK) == 0) ret = SUCCESS; + else if (target-transport_offline) + ret = FAST_IO_FAIL; else ret = FAILED; srp_free_req(target, req, scmnd, 0); I'm also missing the concept for srp_reset_device(). There is a very common case that the SCSI error handling and the transport layer error handling run in parallel: Congestion. In congestion some LUNs are blocked while others can still transmit. A little bit later the QP timeout triggers in the middle of the SCSI error handling in srp_abort(), srp_reset_device() or less likely in srp_reset_host(). Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline
On 01.07.2013 13:33, Bart Van Assche wrote: --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd) if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun, SRP_TSK_ABORT_TASK) == 0) ret = SUCCESS; +else if (target-transport_offline) +ret = FAST_IO_FAIL; else ret = FAILED; srp_free_req(target, req, scmnd, 0); This doesn't give us much speed advantage IMHO. The check for target-transport_offline should be before calling srp_send_tsk_mgmt(). This way it would also match the patch description better. Hello Sebastian, Had you perhaps overlooked the following code at the start of srp_send_tsk_mgmt() ? if (!target-connected || target-qp_in_error) return -1; Given this I don't think it matters whether the transport_offline check occurs before or after the srp_send_tsk_mgmt() call. Hi Bart, okay, right. So you get an error due to the connected and qp_in_error state first. Yes, I've overlooked that. Thanks! Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/15] IB/srp: Fail I/O fast if target offline
On 01.07.2013 13:38, Bart Van Assche wrote: --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1755,6 +1755,8 @@ static int srp_abort(struct scsi_cmnd *scmnd) if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun, SRP_TSK_ABORT_TASK) == 0) ret = SUCCESS; +else if (target-transport_offline) +ret = FAST_IO_FAIL; else ret = FAILED; srp_free_req(target, req, scmnd, 0); I'm also missing the concept for srp_reset_device(). There is a very common case that the SCSI error handling and the transport layer error handling run in parallel: Congestion. Can you explain this comment further, and also how this comment relates to patch 04/15 ? Sorry, found it. Even if only one srp_reset_device() fails, then srp_reset_host() is called anyway. So there this check + returning FAST_IO_FAIL doesn't make so much sense. In congestion some LUNs are blocked while others can still transmit. A little bit later the QP timeout triggers in the middle of the SCSI error handling in srp_abort(), srp_reset_device() or less likely in srp_reset_host(). I am aware this can result in concurrent srp_reconnect_rport() calls. However, such concurrent calls are serialized via rport-mutex. I put my comment regarding this to patch 10. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/14] IB/srp: Fix remove_one crash due to resource exhaustion
On 28.06.2013 01:45, Roland Dreier wrote: On Thu, Jun 27, 2013 at 2:01 PM, David Dillow dillo...@ornl.gov wrote: On Wed, 2013-06-12 at 15:20 +0200, Bart Van Assche wrote: If the add_one callback fails during driver load no resources are allocated so there isn't a need to release any resources. Trying to clean the resource may lead to the following kernel panic: Acked-by: David Dillow dillo...@ornl.gov Thanks, I've queued up the 1, 3, and 4/14 patches that Dave acked so far. Hi Roland, did you queue 3 without the target-transport_offline check? Otherwise I can't agree on that. 1 and 4 are also for me okay. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/15] IB/srp: Fix race between srp_queuecommand() and srp_claim_req()
On 28.06.2013 14:48, Bart Van Assche wrote: Avoid that srp_claim_command() can claim a command while srp_queuecommand() is still busy queueing the same command. Found this via source reading. Nice, that's much less re-acquiring of the target lock in error case in srp_queuecommand(). But if we have to change that many locations for srp_put_tx_iu() anyway, wouldn't it make sense to rename it into __srp_put_tx_iu() as well? Then we can also put a little description to it and it looks familiar compared to __srp_get_tx_iu(). The description could look like follows: /* * Return an IU and possible credit to the free pool * * Must be called with target-lock held to protect free_tx. */ I'm not sure if we still need that lockdep_assert_held() then. There is no other location with lock debugging in ib_srp. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/15] IB/srp: Fix race between srp_queuecommand() and srp_claim_req()
On 28.06.2013 16:51, Bart Van Assche wrote: Nice, that's much less re-acquiring of the target lock in error case in srp_queuecommand(). But if we have to change that many locations for srp_put_tx_iu() anyway, wouldn't it make sense to rename it into __srp_put_tx_iu() as well? Then we can also put a little description to it and it looks familiar compared to __srp_get_tx_iu(). The description could look like follows: /* * Return an IU and possible credit to the free pool * * Must be called with target-lock held to protect free_tx. */ I'm not sure if we still need that lockdep_assert_held() then. There is no other location with lock debugging in ib_srp. Hello Sebastian, I don't have a strong opinion about either of these two topics. If a function like __srp_get_tx_iu() would be introduced that would allow to drop only two spin_lock/spin_unlock call pairs. So introducing that function would probably add more lines of code than adding the spin_lock/spin_unlock pairs. Hence my choice not to introduce __srp_get_tx_iu(). Regarding the lockdep_assert_held() statement: the reason I introduced it instead of adding a comment above the function telling which locking is required is because a lockdep_assert_held() statement is verified at runtime on a system with a kernel in which lockdep support has been enabled. Hi Bart, I just meant a rename into __srp_put_tx_iu() to show that locking is required and not introducing a further wrapper function. The other function of that kind __srp_get_tx_iu() also doesn't have a wrapper function srp_get_tx_iu(). For me it doesn't make much difference how it is marked that locking is required. I just wanted to point out that this method is new to ib_srp. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus
On 14.06.2013 19:07, Vu Pham wrote: [...] For what do you need the same target with multiple pkeys on the same local SRP port? There is no need, it's just a gray area that you can choose to have multiple connections to same target using different pkeys (same as dgid) Which other SRP targets exist? Netapp/LSI/Engenio, DDN, TexasMemorySystem Ramsan (IBM), Nimbus, Violin Memory, StreamScale The last three may be derived from SCST base target. I only know SCST, Solaris COMSTAR and that broken LIO stuff. Does SCST still not support to set the pkey? Yes, I think so Why should we check the dgid? If you want to have multiple connections/qps to same target, but as I said above, it's a gray area. Doesn't make any sense to me to connect both target ports to the same local port. What if a target always expose single consistent and unique SRP port with tuple id_ext, ioc_guid, the ioc_guid part is not derived from any of its local HCA's GUID, then you can connect to this target thru different HCA ports (different dgid) as different paths to same target. Do you have an example for a target which does it like this or a use case where this makes sense? I guess you're proposing here to use a driver global list of target connections instead of handling this per local SRP port. This would result in bigger changes which I wouldn't do without a good reason. If you do so, the multipath-tools will crash. Note: This function is called per local SRP port. Perhaps, a note should be added to that function that it only has to be called per local SRP port. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/14] scsi_transport_srp: Add transport layer error handling
On 17.06.2013 09:29, Bart Van Assche wrote: On 06/17/13 09:14, Hannes Reinecke wrote: On 06/17/2013 09:04 AM, Bart Van Assche wrote: I agree that the value of fast_io_fail_tmo should be kept small. Although as you explained changing the SCSI device state into SDEV_BLOCK doesn't help for I/O that has already been queued on a failed path, I think it's still useful for I/O that is queued after the fast_io_fail timer has been started and before that timer has expired. Why, but of course. The typical scenario would be: - detect link-loss - call scsi_block_request() - start dev_loss_tmo and fast_io_fail_tmo - When fast_io_fail_tmo triggers: - Abort all outstanding requests - When dev_loss_tmo triggers: - Abort all outstanding requests - Remove/disable the I_T nexus - call scsi_unblock_request() However, if and whether multipath detects SDEV_BLOCK doesn't guarantee a fast failover; in fact is was only added rather recently as it's not a big win in most cases. Even if setting the state SDEV_BLOCK doesn't help much with improving failover time, it still has the advantage over using scsi_block_requests() that it can be overridden by a user via sysfs. In my opinion that SDEV_BLOCK can help the reconnect. The only reason for high fast_io_fail_tmo is that you don't use multipath at all and hope that the connection becomes available again before that timeout. You place the reconnects in between so that there is a chance that the reconnect succeeds and the transport layer error work can be canceled. But I have to look at all of your patches first to see how you implemented the big picture. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 14.06.2013 01:27, Vu Pham wrote: Bart Van Assche wrote: On 06/13/13 19:50, Vu Pham wrote: Hello Bart, +/** + * srp_conn_unique() - check whether the connection to a target is unique + */ +static bool srp_conn_unique(struct srp_host *host, +struct srp_target_port *target) +{ +struct srp_target_port *t; +bool ret = false; + +if (target-state == SRP_TARGET_REMOVED) + goto out; + +ret = true; + + spin_lock(host-target_lock); +list_for_each_entry(t, host-target_list, list) { +if (t != target + target-id_ext == t-id_ext Targets may advertise/expose on different pkeys You can have multiple connections (or paths/scsi hosts) to same target with different pkeys. We need extra check to detect the uniqueness: target-path.pkey == t-path.pkey Hello Vu, Thanks for the feedback. This is something I have already thinking about myself. Unfortunately I have not found any requirements in the T10 SRP standard with regard to InfiniBand partitions. However, in that document there is a section about single RDMA channel operation. In that section it is explained that an SRP target must log out established sessions upon receipt of a new login request. What I'm not sure about is whether only sessions with the same P_Key must be logged out or all established sessions if a new login request is received. I assume the latter since otherwise that would mean that an SRP target would be required to maintain multiple sessions if it allows connections with more than one P_Key to a target port ? My concern about adding a pkey comparison in the function srp_conn_unique() is that if a target allows an initiator to choose which partition to use when logging in, that this could result in the undesired SRP initiator ping-pong effect this patch tries to avoid. Bart. Hello Bart, Yes, you pointed out the unclear/undefined area. If we stick to single RDMA channel per IT Nexus with unique tuple Initiator Port Identier - Target Port Indentifier then newly created connection with same tuple (I_port_id, T_port_id) but with different P_Key or different DGID is not unique. Sticking to this rule by excluding P_Key and DGID out of rdma channel indentity, your srp_conn_unique() checking is ok; however, some SRP target implementations may include DGID as part of rdma channel identifier. I'm not sure about different p_key part. -vu Hi Vu! For what do you need the same target with multiple pkeys on the same local SRP port? Which other SRP targets exist? I only know SCST, Solaris COMSTAR and that broken LIO stuff. Does SCST still not support to set the pkey? Why should we check the dgid? Doesn't make any sense to me to connect both target ports to the same local port. If you do so, the multipath-tools will crash. Note: This function is called per local SRP port. Perhaps, a note should be added to that function that it only has to be called per local SRP port. Cheers, Sebastian -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJRuuSCAAoJEH4DRb7WXajZcFcH+gKsSs64Js/CUqMSyPeFPQ7u 7jKHvLr2wqHqSMIg5rEeZxZJpE9rL+wi8k5TMAMBrV+Povdwr8tWHgdq7mh5N1xO V517YTgdzrwPIFy9e2uktxx4VYpsFGrV8iw3rdAzXRmcYa5U8feXhiD1VZyKjs4p 3//wvGAR0po7Pm0WgU9Q+h0arQos8CmeHkpoaNp/nNINXpXlTX21WVvHjwQrMFhC Kr8zoCOTd0Sn+WoSs+CT/7Y4oTknukwR5vh6wfKgz2W74YkMKpD658QZozlafyK/ rwdajV19YYvi8YRTjUXuY5TN0qshYOGDxJDtNFkRGbx+IxIqFkGyyFCp0LPCfto= =nlf2 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/14] IB/srp: Avoid that srp_reset_host() is skipped after a TL error
On 12.06.2013 15:23, Bart Van Assche wrote: The SCSI error handler assumes that the transport layer is operational if an eh_abort_handler() returns SUCCESS. Hence let srp_abort() only return SUCCESS if sending the ABORT TASK task management function succeeded. This patch avoids that the SCSI error handler skips the srp_reset_host() call after a transport layer error. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: David Dillow dillo...@ornl.gov Cc: Vu Pham v...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 9c638dd..fb37b47 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1742,18 +1742,23 @@ static int srp_abort(struct scsi_cmnd *scmnd) { struct srp_target_port *target = host_to_target(scmnd-device-host); struct srp_request *req = (struct srp_request *) scmnd-host_scribble; + int ret; shost_printk(KERN_ERR, target-scsi_host, SRP abort called\n); if (!req || !srp_claim_req(target, req, scmnd)) return FAILED; - srp_send_tsk_mgmt(target, req-index, scmnd-device-lun, - SRP_TSK_ABORT_TASK); + if (srp_send_tsk_mgmt(target, req-index, scmnd-device-lun, + SRP_TSK_ABORT_TASK) == 0 || + target-transport_offline) + ret = SUCCESS; Here you try to hide a little trick. Returning success upon (target-transport_offline == true) is perhaps not the best way. I guess you try to fail IO fast here but up to this point target-transport_offline = true is only set in srp_reset_host(). Please explain for what you need that in this patch! Furthermore, returning FAST_IO_FAIL sounds better to me in this situation. + else + ret = FAILED; srp_free_req(target, req, scmnd, 0); scmnd-result = DID_ABORT 16; scmnd-scsi_done(scmnd); - return SUCCESS; + return ret; } static int srp_reset_device(struct scsi_cmnd *scmnd) The rest is okay. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/14] IB/srp: Skip host settle delay
On 12.06.2013 15:24, Bart Van Assche wrote: The SRP initiator implements host reset by reconnecting to the SRP target. That means that communication with the target is possible as soon as host reset finished. Hence skip the host settle delay. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: David Dillow dillo...@ornl.gov Cc: Vu Pham v...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c |1 + 1 file changed, 1 insertion(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index fb37b47..be12780 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1949,6 +1949,7 @@ static struct scsi_host_template srp_template = { .eh_abort_handler = srp_abort, .eh_device_reset_handler= srp_reset_device, .eh_host_reset_handler = srp_reset_host, + .skip_settle_delay = true, .sg_tablesize = SRP_DEF_SG_TABLESIZE, .can_queue = SRP_CMD_SQ_SIZE, .this_id= -1, Signed-off-by: Sebastian Riemer sebastian.rie...@profitbricks.com Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com Tested-by: Sebastian Riemer sebastian.rie...@profitbricks.com Reviewed-by: Sebastian Riemer sebastian.rie...@profitbricks.com Reviewed-by: Christoph Hellwig h...@infradead.org Choose something, I totally agree. Adding Christoph in CC as he has reviewed this as well. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus
Hi Bart, thanks for picking up the idea not to use this 'add_target' file for manual reconnects. I have only small remarks but basically you've got my Acked-by and Tested-by. Please find the remarks in-line. Cheers, Sebastian On 12.06.2013 15:25, Bart Van Assche wrote: An SRP target is required to maintain a single connection between initiator and target. This means that if the 'add_target' attribute is used to create a second connection to a target that the first connection will be logged out and that the SCSI error handler will kick in. The SCSI error handler will cause the SRP initiator to reconnect, which will cause I/O over the second connection to fail. Avoid such ping-pong behavior by disabling relogins. Note: if reconnecting manually is necessary, that is possible by deleting and recreating an rport via sysfs. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@kernel.org Cc: David Dillow dillo...@ornl.gov Cc: Vu Pham v...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c | 38 +++ 1 file changed, 38 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index be12780..1a73b24 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -556,6 +556,36 @@ static void srp_rport_delete(struct srp_rport *rport) srp_queue_remove_work(target); } +/** + * srp_conn_unique() - check whether the connection to a target is unique + */ +static bool srp_conn_unique(struct srp_host *host, + struct srp_target_port *target) +{ + struct srp_target_port *t; + bool ret = false; + + if (target-state == SRP_TARGET_REMOVED) + goto out; + + ret = true; + + spin_lock(host-target_lock); + list_for_each_entry(t, host-target_list, list) { + if (t != target + target-id_ext == t-id_ext + target-ioc_guid == t-ioc_guid + target-initiator_ext == t-initiator_ext) { + ret = false; + break; + } + } + spin_unlock(host-target_lock); + +out: + return ret; +} + You've only changed the style of this function. Functionality is still the same. Fine for me. But why do you put it that high in the source code? Do you (still) need it for something else? I would put it directly in front of srp_create_target() or even in front of that option parsing stuff for correct bottom-up. static int srp_connect_target(struct srp_target_port *target) { int retries = 3; @@ -2261,6 +2291,14 @@ static ssize_t srp_create_target(struct device *dev, if (ret) goto err; + if (!srp_conn_unique(target-srp_host, target)) { + shost_printk(KERN_INFO, target-scsi_host, + PFX Already connected to target port %.*s\n, + (int)count, buf); + ret = -EEXIST; + goto err; + } + Yes, this looks good! Nice idea to print the connection string! Would be even cooler without trailing '\n' from within 'buf' but that's okay. I was a little bit afraid of overflows here so I did security testing. But srp_parse_options() already rejected my evil connection strings. :-) I've tried things like this: id_ext=0002c903004ed0b2,\ ioc_guid=0002c903004ed0b2,\ dgid=fe82c903004ed0b4,\ pkey=,service_id=0002c903004ed0b2,\ x... until 4096 chars id_ext=0002c903004ed0b2,\ ioc_guid=0002c903004ed0b2,\ dgid=fe82c903004ed0b4,\ pkey=,service_id=0002c903004ed0b2,\ id_ext=0002c903004ed0b2,\ ioc_guid=0002c903004ed0b2,\ dgid=fe82c903004ed0b4,\ pkey=,service_id=0002c903004ed0b2,\ ... until 4096 chars This string looked kind of funny. Also in the kernel message it was a little bit longer than usual but the parsing detected that I have too many parameters. So everything is fine in terms of security. if (!host-srp_dev-fmr_pool !target-allow_ext_sg target-cmd_sg_cnt target-sg_tablesize) { pr_warn(No FMR pool and no external indirect descriptors, limiting sg_tablesize to cmd_sg_cnt\n); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/srp: Maintain a single connection per I_T nexus
Bart's version also has the printing of the connection string if the double login fails. So forget about this version here. On 12.06.2013 13:51, Sebastian Riemer wrote: Hi all, as proposed by Or, let's discuss this on the mailing list. This is a fundamental change required for everything related to multipathing. It influences automatic reconnect patches which will follow. So let's agree on the right solution here first before looking at other patches. In my opinion the 'add_target' sysfs attribute shouldn't be used for any manual reconnect as well. This is why my patch rejects the double login attempt instead of reconnecting an existing connection. This can help to find scripting issues and things like this. We can't expect that all users are using the srp-tools. Please compare with Bart's version and let's discuss this here. https://github.com/bvanassche/ib_srp-backport/commit/7d8774ff58d489858b1c046b2bf01b4e84e8dd9b Cheers, Sebastian On 12.06.2013 13:29, Sebastian Riemer wrote: The sysfs attribute 'add_target' may not be used for multiple logins to the same target. If doing so with multipathing, this crashes the multipath-tools. Furthermore, we want to prevent the possibility of data corruption here. If manual reconnect is necessary, then the user can use the 'delete' sysfs attribute of the remote port before connecting. Note: The function srp_conn_unique() has been taken from Bart Van Assche. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] IB/srp: Maintain a single connection per I_T nexus
On 13.06.2013 17:07, Bart Van Assche wrote: [...] The %.*s should only copy the data provided by the user, even if it is not '\0' terminated. Stripping the trailing newline is probably possible with something like the (untested) code below (will only work if there is only one newline in the input string and if it's at the end): shost_printk(KERN_INFO, target-scsi_host, PFX Already connected to target port %.*s\n, (int)count - (memchr(buf, '\n', count) == buf + count - 1), buf); I thought more like this existing message (as the input string by the user is possibly long with a lot of configuration options): shost_printk(KERN_DEBUG, target-scsi_host, PFX new target: id_ext %016llx ioc_guid %016llx pkey %04x service_id %016llx dgid %pI6\n, (unsigned long long) be64_to_cpu(target-id_ext), (unsigned long long) be64_to_cpu(target-ioc_guid), be16_to_cpu(target-path.pkey), (unsigned long long) be64_to_cpu(target-service_id), target-path.dgid.raw); But this thing takes a lot of code lines. Perhaps this string formatting should be put into a macro/inline function then. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/srp: Maintain a single connection per I_T nexus
Hi all, as proposed by Or, let's discuss this on the mailing list. This is a fundamental change required for everything related to multipathing. It influences automatic reconnect patches which will follow. So let's agree on the right solution here first before looking at other patches. In my opinion the 'add_target' sysfs attribute shouldn't be used for any manual reconnect as well. This is why my patch rejects the double login attempt instead of reconnecting an existing connection. This can help to find scripting issues and things like this. We can't expect that all users are using the srp-tools. Please compare with Bart's version and let's discuss this here. https://github.com/bvanassche/ib_srp-backport/commit/7d8774ff58d489858b1c046b2bf01b4e84e8dd9b Cheers, Sebastian On 12.06.2013 13:29, Sebastian Riemer wrote: The sysfs attribute 'add_target' may not be used for multiple logins to the same target. If doing so with multipathing, this crashes the multipath-tools. Furthermore, we want to prevent the possibility of data corruption here. If manual reconnect is necessary, then the user can use the 'delete' sysfs attribute of the remote port before connecting. Note: The function srp_conn_unique() has been taken from Bart Van Assche. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/14] IB/srp: Fix remove_one crash due to resource exhaustion
On 12.06.2013 15:38, Bart Van Assche wrote: On 06/12/13 15:20, Bart Van Assche wrote: If the add_one callback fails during driver load no resources are allocated so there isn't a need to release any resources. Trying to clean the resource may lead to the following kernel panic: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [a0132331] srp_remove_one+0x31/0x240 [ib_srp] RIP: 0010:[a0132331] [a0132331] srp_remove_one+0x31/0x240 [ib_srp] Process rmmod (pid: 4562, threadinfo 8800dd738000, task 8801167e60c0) Call Trace: [a024500e] ib_unregister_client+0x4e/0x120 [ib_core] [a01361bd] srp_cleanup_module+0x15/0x71 [ib_srp] [810ac6a4] sys_delete_module+0x194/0x260 [8100b0f2] system_call_fastpath+0x16/0x1b [bvanassche: Shortened patch description] Signed-off-by: Dotan Barak dot...@dev.mellanox.co.il Reviewed-by: Eli Cohen e...@mellanox.co.il Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: David Dillow dillo...@ornl.gov Cc: Vu Pham v...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 7ccf328..368d160 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -2507,6 +2507,8 @@ static void srp_remove_one(struct ib_device *device) struct srp_target_port *target; srp_dev = ib_get_client_data(device, srp_client); +if (!srp_dev) +return; list_for_each_entry_safe(host, tmp_host, srp_dev-dev_list, list) { device_unregister(host-dev); Please note that this patch was authored by Dotan Barak, so I should have mentioned: From: Dotan Barak dot...@dev.mellanox.co.il Acked-by: Sebastian Riemer sebastian.rie...@profitbricks.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/14] IB/srp: Fix race between srp_queuecommand() and srp_claim_req()
Wait a minute, so you've changed this commit to also hold that target lock in the following functions in error case: srp_unmap_data(), srp_put_tx_iu() This is different from: https://github.com/bvanassche/ib_srp-backport/commit/6ce0e30dbb69973926df84292239f0c20f6a2d6c srp_unmap_data() calls ib_fmr_pool_unmap() which uses an own spin lock (pool-pool_lock). srp_put_tx_iu() acquires the target lock as well (target-lock). That's spin lock in spin lock. I would say that this dead locks. I like the other version more. Cheers, Sebastian On 12.06.2013 15:21, Bart Van Assche wrote: Avoid that srp_claim_command() can claim a command while srp_queuecommand() is still busy queueing the same command. Found this via source reading. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: David Dillow dillo...@ornl.gov Cc: Vu Pham v...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 368d160..9c638dd 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1367,7 +1367,6 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd) req = list_first_entry(target-free_reqs, struct srp_request, list); list_del(req-list); - spin_unlock_irqrestore(target-lock, flags); dev = target-srp_host-srp_dev-dev; ib_dma_sync_single_for_cpu(dev, iu-dma, target-max_iu_len, @@ -1401,6 +1400,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd) shost_printk(KERN_ERR, target-scsi_host, PFX Send failed\n); goto err_unmap; } + spin_unlock_irqrestore(target-lock, flags); return 0; @@ -1409,8 +1409,6 @@ err_unmap: err_iu: srp_put_tx_iu(target, iu, SRP_IU_CMD); - - spin_lock_irqsave(target-lock, flags); list_add(req-list, target-free_reqs); err_unlock: -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to do replication right with SRP or remote storage?
On 08.06.2013 04:31, Bruce McKenzie wrote: Hi Bart. any advice on using this fix with MD raid 1? a guide or site you know of? ive compiled ubuntu 13.04 to kernel 3.6.11 with OFED 2 from Mellanox, and it works ok, performance is a little better with SRP. Some packages dont seem to work, ie srptools and IB-diags some commands fail, which looks like those tools havenet been tested with 3.6.11? or updated. Ive tried using DRBD with pacemaker Stonith etc (which also works on 3.6.11) but it only works with iSCSI over IPOIB. ie virtual nic with mounted LVM using scst to present file i/o. and pacemaker to fail over the VIP to node 2. But OFED 2 doesnt seem to support SDP to have to rep via IPOIB which is slow even over dedicated IB_IPOIB nic. IE DRBD rep is 200MB/s Any help or direction would be greatfull. Cheers Bruce McKenzie (changed subject into something I think is more appropriate) Hi Bruce, thanks for contacting me privately in parallel. I can answer you the replication questions. In order to share experience for others I reply here again. Please evaluate the ib_srp fixes from Bart and from me as well and send us your feedback! We are still negotiating how to do fast IO failing and the automatic reconnect right, also together with the Mellanox SRP guys Sagi Grimberg, Vu Pham, Oren Duer and others. You need these patches in order to fail IO in the time you want to the upper layers so that dm-multipath can fail over the path first and ib_srp continuously tries to reconnect the failed path. If the other path also fails, then very likely the storage server is down, so you fail the IO further up to MD RAID-1 so that it can fail that replica. For replication the last slide of my talk on LinuxTag this year could be interesting for you: http://www.slideshare.net/SebastianRiemer/infini-band-rdmaforstoragesrpvsiser-21791250 That slide caused a lot of discussion afterwards. The thing is that replication of remote storage is best on the initiator (a single kernel manages all replica, parallel network paths, symmetric latency,...). The bad news is that replication of virtual/remote storage with MD RAID-1 is a use case which basically works but has some issues which Neil Brown doesn't want to have fixed in mainline. So you need a kernel developer for some cool features like e.g. safe VM live migration. Perhaps, I should collect all guys who require MD RAID-1 for remote storage replication in order to put some pressure on Neil. At least some things of this use case are easy to merge with mainline behavior like e.g. letting MD assembly scale right (mdadm searches the whole /dev without a need). I was surprised that he will make the data offset settable again so that you can set it to 4 MiB (1 LV extent). We already have that by custom patches on top of mdadm 3.2.6. DRBD is already with iSCSI crap. 200 MB/s with IB sounds familiar. I had 250 MB/s in primary/secondary setup with DRBD during evaluation. That's storeforward writes to the secondary which is slow. Chained network paths! With Ethernet that hurts even more. People report 70 MB/s with that. I've taught them how to use blktrace and it became obvious that they were trapped in latency. I can also recommend you Vasiliy Tolstov v.tols...@selfip.ru. He also uses SRP with MD RAID-1. He could convince Neil to fix the MD data offet. OpenSource is all about the right allies, Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to do replication right with SRP or remote storage?
On 10.06.2013 14:44, Bart Van Assche wrote: On 06/10/13 14:05, Sebastian Riemer wrote: Perhaps, I should collect all guys who require MD RAID-1 for remote storage replication in order to put some pressure on Neil. If I remember correctly one of the things Neil is trying to explain to md users is that when md is used without write-intent bitmap there is a risk of triggering a so-called write hole after a power failure ? I'm not sure. Haven't seen something like this on the mailing list. Do you have a reference from the archives? I think this is handled by superblock writes in the correct order by now. The main reason for the write-intent bitmap remains from my knowledge that you need a full resync without it if a component device is down for a short moment in time. It becomes faulty. If you know that there can't be a hardware issue (e.g. virtual storage), you can remove the faulty device and re-add it to the array. If a device was faulty, then it assembles again. There is an error counter in /sys/block/mdX/md/ sysfs and a maximum read error count (usually 20) after which the faulty device doesn't assemble again. /sys/block/mdX/md/dev-Y/errors /sys/block/mdX/md/max_read_errors Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB
On 17.05.2013 16:16, Jack Wang wrote: unable to handle kernel paging request Hi Jack, this should be related to the list corruption in IPoIB as list_del() sets the LIST_POISON1 and LIST_POISON2 pointers. Referencing these results in page faults according to the documentation in the code. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches
On 15.05.2013 07:12, Vasiliy Tolstov wrote: 2013/5/14 Bart Van Assche bvanass...@acm.org: The ability to close a session from the initiator side went upstream in kernel 3.8 (/sys/class/srp_remote_ports/port-h:n/delete). Regarding faster reconnects: please keep in mind that after a cable pull it can easily takes 20 seconds before link training and initialization by the subnet manager have finished. It's not possible to make an initiator reconnect in less time than what the hardware and subnet manager need to bring the link back. Thanks. What about close session from target side? For example i need to close the srp session and block all access from specific initiator? AFAIK the session is blocked as long as an initiator is connected. The only possibility besides disconnecting the initiators is to disable the target completely. Then, it sends a DREQ (disconnection request) to the initiators. These know then that the target is disconnected and send a DREP (disconnection reply). In our patches we also activate the reconnect in this situation as we do that to orderly reboot a storage system (e.g. due to an issue). The storage system comes up again, exports the same volumes and the initiators can reconnect. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 14.05.2013 12:02, Vasiliy Tolstov wrote: Sorry for bumping old thread, i'm solve my problems with new firmware. I have supermicro servers that rebrand mellanox firmware (recompile and change some bits) Now all works fine i have 40 gb/s QDR instead of 10 Gb/s Thanks, sharing lesson learned experience is never wrong. Especially as there aren't many IB specialists in the world. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches
Of cause, also Qlogic HCAs can be used. But please note that there is no back-port in my patches to make them better readable. If you like my patches, then we can talk about how to back-port them to a specific kernel version. On 14.05.2013 12:00, Vasiliy Tolstov wrote: if i need faster reconnects and ability to close session from initiator side under qlogic hardware, does it possible? Or this patches only covers mallanox cards? 2013/5/8 Sebastian Riemer sebastian.rie...@profitbricks.com: FYI: I've released version 0.6 of my SRP patches today. The automatic reconnect is included now. The tests for that will follow in the next version. But we already did quite intensive testing for that. Hard reboot and also soft reboot of the target are possible with that reconnect. It just reconnects and everything is fine again. With soft reboot I mean: disabling the target, removing the exports, rebooting, exporting the same LUNs, re-enabling the target. It also has an automatic mechanism to reduce the possibility of a DDoS attack reconnect. It automatically reconnects at different intervals. Check it out: https://github.com/sriemer/ib_srp Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Infiniband HA
Hi Gandalf, just build up two separate fabrics. This means that you don't interconnect both switches. Otherwise, issues on one port also affect the other port. What do you use for storage? SRP? This requires dm-multipath and fast IO failing + automatic reconnect patches from Bart or from me. All other traffic like IPoIB for example also has to be able to switch the port. Cheers, Sebastian On 08.05.2013 12:06, Gandalf Corvotempesta wrote: Hi to all I'm new to Infiniband/RMDA and probabily this is not the right place to ask. I'm planning a new infiniband infrastructure with dual port HBA and I have a question: How can I archieve fault tollerance with multiple switches? Should I connect all like a standard ethernet infrastructure ? (port1 to switch1, port2 to switch2, and both switches interconnected with a cable?) Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches
FYI: I've released version 0.6 of my SRP patches today. The automatic reconnect is included now. The tests for that will follow in the next version. But we already did quite intensive testing for that. Hard reboot and also soft reboot of the target are possible with that reconnect. It just reconnects and everything is fine again. With soft reboot I mean: disabling the target, removing the exports, rebooting, exporting the same LUNs, re-enabling the target. It also has an automatic mechanism to reduce the possibility of a DDoS attack reconnect. It automatically reconnects at different intervals. Check it out: https://github.com/sriemer/ib_srp Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] SRP: ProfitBricks publishes its SRP Initiator patches
Hello everyone, I'm very proud to announce that we finally publish our SRP initiator patches we've been working on for quite some time now. In the first step we publish our way of failing IO fast as we've noticed that the way Bart Van Assche does that in his GitHub repository doesn't match our requirements completely. His repo: https://github.com/bvanassche/ib_srp-backport Our repo: https://github.com/sriemer/ib_srp We want to fail IO fast in exactly the time we configure. With our patches this works (or please tell us why not). We provide you with full test descriptions and related shell scripts. Everything is done with as little dependencies as possible. The shell scripts can also be very useful to show how to configure and use SRP with sysfs only. This is why I've added the scst-devel mailing list here. We want to be as close as possible to the kernel. We want to combine efforts here and to get valuable feedback from you all. Evaluation, testing, criticism, comments, etc more than welcome! Hopefully, we can get a really cool solution into the mainline together! This would make my job as the maintainer for the ProfitBricks host kernels a lot easier! ;-) You'll notice: We've already adapted patches from Bart and parts of his patches. So it is only fair to publish our patches as well. :-) It is the same as with Bart's patches: This can't be used for production, yet. Our patches don't have the reconnect for now. Ideas how to implement that on top are welcome. Just send your patches directly to me! :-) Please further notice: There is a major bug in the upstream multipath-tools. These read sysfs files cached which leads to IO on offline devices. We've fixed it for us and publish the fix for you as well. :-) Git repo: https://github.com/sriemer/multipath-tools Thank you so much for your help in the past and in the future as well! Thanks for the patience and reading this! We'll continue publishing our SRP patches relevant for the mainline. If you want to meet me or ProfitBricks in person, we'll have a booth on LinuxTag in Berlin/Germany. I'll have a technical talk there about SRP: http://www.linuxtag.org/2013/en/program/thursday-may-23-2013.html?eventid=208 Cheers, Sebastian -- Sebastian Riemer Linux Kernel Developer - Storage ProfitBricks GmbH • Greifswalder Str. 207 • 10405 Berlin, Germany www.profitbricks.com • sebastian.rie...@profitbricks.com Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Andreas Gauger, Achim Weiss -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 13:51, Vasiliy Tolstov wrote: Something like this: echo 4096 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu After doing this all srp connections down and port is down. I need to restart openibd Sorry for that! It's much easier to set the IP MTU. Managed switches support setting the RDMA MTU. So it could be possible that it is a setting in the SM config. But I'm not sure. $ man opensm says that it can be set in the partitions.conf You should see 40 Gb/sec (4X QDR) here. Perhaps the OFED is too old so that FDR and ConnectX 3 aren't supported, yet. 10 Gb/sec (4X) seems to be the default case if a rate isn't supported. Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 Gb The kernel version is okay. It depends on the user space. There is a support note in OFED 3.5: - ConnectX-3 (fw-ConnectX3 Rev 2.11.0500) (FDR and FDR10 Modes are Supported) Before OFED 3.5 these HCAs aren't supported. A look at the related source code could be worth a try. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 14:49, Hal Rosenstock wrote: On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote: Hello. I have some servers, with mellanox ConnectX-3 and have some questions: Why max_mtu differs with active_mtu? What does peer port say for max MTU ? How can i set active mtu? SM sets active MTU to min of peer ports max MTUs. So with peer port max MTU do you mean this file?: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu I've seen that it can be set as well. I've got two ConnectX-2 machines connected back2back. In general these have 4K max and active. So let's try something: Host1: $ echo 2048 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu # Port is not active, let's reactivate it. $ echo 1 /sys/class/infiniband/mlx4_0/device/enable ibv_devinfo Host1: max_mtu:2048 (4) active_mtu: 2048 (4) Host2: max_mtu:4096 (5) active_mtu: 2048 (4) Both had 4096 (5) before everywhere. So that's the recommended way to reduce the MTU? I've heard that reducing the MTU in a fabric can help fighting congestion issues. As congestion control doesn't work yet, could this help against congestion? Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 15:34, Hal Rosenstock wrote: On 4/9/2013 9:16 AM, Sebastian Riemer wrote: On 09.04.2013 14:49, Hal Rosenstock wrote: On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote: Hello. I have some servers, with mellanox ConnectX-3 and have some questions: Why max_mtu differs with active_mtu? What does peer port say for max MTU ? How can i set active mtu? SM sets active MTU to min of peer ports max MTUs. So with peer port max MTU do you mean this file?: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu I meant NeighborMTU from PortInfo as active MTU and MTUCap there is supported MTU. So these values are exactly the same as in ibv_devinfo and can be set in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu. I've found the PortInfo with the command smpquery portinfo -C mlx4_0 3 1 where I'm using the first HCA to contact the SM. I tell the SM the destination LID ('3' here in my case) and the destination port ('1'). Is there another method to set the max MTU? I know that switches can also set the max MTU for their switch ports where most of them use 2048 as default. How to change these switch port MTUs for unmanaged switches? On managed switches this can be done over the web front-end. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 16:23, Hal Rosenstock wrote: So these values are exactly the same as in ibv_devinfo and can be set in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu. I've found the PortInfo with the command smpquery portinfo -C mlx4_0 3 1 where I'm using the first HCA to contact the SM. I tell the SM the destination LID ('3' here in my case) and the destination port ('1'). Is there another method to set the max MTU? That doesn't set max MTU (MTUCap) but merely reads it (for that port). Sorry, copy and paste error. I've meant the mlx4 file: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu But you've answered that by vendor specific. Thanks for the valuable information! For us most interesting would be if the MTU can be changed live without any service disruption. Looks like the mlx4 driver can't provide that. Perhaps switches can do that. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC ib_srp-backport] ib_srp: bind fast IO failing to QP timeout
Hi Bart, now I've got my priority on SRP again. I've also noticed that your ib_srp-backport doesn't fail the IO fast enough. The fast_io_fail_tmo only comes into play after the QP is already in timeout and the terminate_rport_io function is missing. My idea is to use the QP retry count directly for fast IO failing. It is at 7 by default and the QP timeout is at approx. 2s. The overall QP timeout is at approx. 35s already (1+7 tries * 2s * 2, I guess). Using only 3 retries I'm at approx 18s. My patches introduce that parameter as module parameter as it is quite difficult to set the QP from RTS to RTR again. Only there the QP timeout parameters can be set. My patch series isn't complete yet as paths aren't reconnected - they are only failed fast bound to the overall QP timeout. But it should give you an idea what I'm trying to do here. What are your thought regarding this? Attached patches: ib_srp: register srp_fail_rport_io as terminate_rport_io ib_srp: be quiet when failing SCSI commands scsi_transport_srp: disable the fast_io_fail_tmo parameter ib_srp: show the QP timeout and retry count in srp_host sysfs files ib_srp: introduce qp_retry_cnt module parameter Cheers, Sebastian Btw.: Before, I've hacked MD RAID-1 for high-performance replication as DRBD is crap for our purposes. But that's worthless without a reliably working transport. From c101d00fe529d845192dd6d5930a1b9c16c99b81 Mon Sep 17 00:00:00 2001 From: Sebastian Riemer sebastian.rie...@profitbricks.com Date: Wed, 13 Mar 2013 16:16:28 +0100 Subject: [PATCH 1/5] ib_srp: register srp_fail_rport_io as terminate_rport_io We need to fail the IO fast in the selected time. So register the missing terminate_rport_io function. Signed-off-by: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c | 24 1 file changed, 24 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index dc49dc8..64644c5 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -756,6 +756,29 @@ static void srp_reset_req(struct srp_target_port *target, struct srp_request *re } } +static void srp_fail_req(struct srp_target_port *target, struct srp_request *req) +{ + struct scsi_cmnd *scmnd = srp_claim_req(target, req, NULL); + + if (scmnd) { + srp_free_req(target, req, scmnd, 0); + scmnd-result = DID_TRANSPORT_FAILFAST 16; + scmnd-scsi_done(scmnd); + } +} + +static void srp_fail_rport_io(struct srp_rport *rport) +{ + struct srp_target_port *target = rport-lld_data; + int i; + + for (i = 0; i SRP_CMD_SQ_SIZE; ++i) { + struct srp_request *req = target-req_ring[i]; + if (req-scmnd) + srp_fail_req(target, req); + } +} + static int srp_reconnect_target(struct srp_target_port *target) { struct Scsi_Host *shost = target-scsi_host; @@ -2700,6 +2723,7 @@ static void srp_remove_one(struct ib_device *device) static struct srp_function_template ib_srp_transport_functions = { .rport_delete= srp_rport_delete, + .terminate_rport_io = srp_fail_rport_io, }; static int __init srp_init_module(void) -- 1.7.9.5 From 06c3cc832a672856c416fee72705ea0448f23855 Mon Sep 17 00:00:00 2001 From: Sebastian Riemer sebastian.rie...@profitbricks.com Date: Wed, 13 Mar 2013 16:46:44 +0100 Subject: [PATCH 2/5] ib_srp: be quiet when failing SCSI commands Signed-off-by: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c |4 1 file changed, 4 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 64644c5..0607e5a 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -750,6 +750,7 @@ static void srp_reset_req(struct srp_target_port *target, struct srp_request *re struct scsi_cmnd *scmnd = srp_claim_req(target, req, NULL); if (scmnd) { + scmnd-request-cmd_flags |= REQ_QUIET; srp_free_req(target, req, scmnd, 0); scmnd-result = DID_RESET 16; scmnd-scsi_done(scmnd); @@ -761,6 +762,7 @@ static void srp_fail_req(struct srp_target_port *target, struct srp_request *req struct scsi_cmnd *scmnd = srp_claim_req(target, req, NULL); if (scmnd) { + scmnd-request-cmd_flags |= REQ_QUIET; srp_free_req(target, req, scmnd, 0); scmnd-result = DID_TRANSPORT_FAILFAST 16; scmnd-scsi_done(scmnd); @@ -1526,6 +1528,7 @@ static int SRP_QUEUECOMMAND(struct Scsi_Host *shost, struct scsi_cmnd *scmnd) int len; if (unlikely(target-transport_offline)) { + scmnd-request-cmd_flags |= REQ_QUIET; scmnd-result = DID_NO_CONNECT 16; scmnd
Re: [RFC ib_srp-backport] ib_srp: bind fast IO failing to QP timeout
On 19.03.2013 12:22, Or Gerlitz wrote: On 19/03/2013 12:16, Sebastian Riemer wrote: Hi Bart, now I've got my priority on SRP again. Hi Sebastian, Are these patches targeted to upstream or backports to some OS/kernel? if the former, can you please send them inline so we can have proper review? Or. Hi Or, the patches are targeted to the stuff Bart is doing on GitHub. https://github.com/bvanassche/ib_srp-backport If I've seen that right, fast IO failing hasn't been accepted to the mainline, yet. So I didn't want to spam you all with multiple mails of patches which don't apply to upstream. I want to introduce my idea in the first place. The patches are not a final solution to the problem. They should only show what I'm trying to do here. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC ib_srp-backport] ib_srp: bind fast IO failing to QP timeout
On 19.03.2013 12:45, Bart Van Assche wrote: On 03/19/13 11:16, Sebastian Riemer wrote: What are your thought regarding this? Attached patches: ib_srp: register srp_fail_rport_io as terminate_rport_io ib_srp: be quiet when failing SCSI commands scsi_transport_srp: disable the fast_io_fail_tmo parameter ib_srp: show the QP timeout and retry count in srp_host sysfs files ib_srp: introduce qp_retry_cnt module parameter Hello Sebastian, Patches 1 and 2 make sense to me. Patch 3 makes it impossible to disable fast_io_fail_tmo and also disables the fast_io_fail_tmo timer - was that intended ? I had a patch which has completely thrown out that fast_io_fail_tmo parameter for ib_srp v1.2 as in my tests with dm-multipath it didn't make any sense but having even longer to wait until IO can be failed. If there is a connection issue, then all SCSI disks from that target are affected and not only a single SCSI device. Today I've seen that you are at v1.3 already and that patch didn't apply anymore. So I thought disabling only the functionality shows what I'm trying to do here. Can you please explain me what your intention was with that fast_io_fail_tmo? What I want to have is a calculateable timeout for IO failing. If the QP retries are at 7 I can't get any lower than 35 seconds. Regarding patches 4 and 5: I'm not sure whether reducing the QP retry count will work well in large fabrics. For me it is already a mystery why I measure 35 seconds at 2s QP timeout and 7 retries. If the maximum is at 2s * 7 retries * 4, then I'm at 60 seconds. That's plain too long. The fast_io_fail_tmo comes on top of that. How else should I reduce the overall timeout until I see in iostat that the other path is taken? The iSCSI initiator follows another approach to realize quick failover, namely by periodically checking the transport layer and by triggering the fast_io_fail timer if that check fails. Unfortunately the SRP spec does not define an operation suited as a transport layer test. But maybe a zero-length RDMA write can be used to verify the transport layer ? Hmmm, how do you want to implement that? This write would run into (overall) QP timeout as well, I guess. The dm-multipath checks paths with directio reads by polling every 5 seconds by default. IMHO this does exactly that. I think the IB specification allows such operations. A quote from page 439: C9-88: For an HCA responder using Reliable Connection service, for each zero-length RDMA READ or WRITE request, the R_Key shall not be validated, even if the request includes Immediate data. And this isn't bound on the (overall) QP timeout? Can you send me a proof of concept for this? Note: I'm still working on transforming the patches present in the ib_srp-backport repository such that these become acceptable for upstream inclusion. I know that and I appreciate that. But I'm running out of time. Perhaps, we can combine some efforts to implement something working first. Doesn't have to be clean and shiny. For me also hacky is okay as long as it works in the data center. Yes, I have to admit that the patches 4 and 5 are hacky. Perhaps, I can report you soon how it behaves reducing the retry count in a large setup. ;-) Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] IPoIB: Free ipoib neigh on path record failure so path rec queries are retried
On 26.02.2013 17:55, Roland Dreier wrote: [...] In fact I bet this is why the bug has been there as long as it has been: almost no one is using IPv6 on IPoIB seriously, and IPv4 should work OK as you point out. Thanks a lot, Unfortunately, we are using IPoIB with IPv6 in production for the inter-VM and the internet gateway traffic. We had an issue last Friday were one of our gateway machines wasn't reachable anymore and pings didn't come through. We had to reload the ib_ipoib modules of nearly every server connected to that gateway. I'm so glad that we use SRP - all SRP connections survived. I don't have the time to care for IPoIB as well at ProfitBricks so I will encourage our networking teams to get their hands on IPoIB and linux-rdma communication. How to verify if we've hit this bug? How to reproduce this bug? Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time
On 08.02.2013 10:24, Sagi Grimberg wrote: On 2/8/2013 12:42 AM, Vu Pham wrote: Hello Bart, Thank you for taking the initiative. Mellanox think that this should be discussed. We'd be happy to attend. We also would like to discuss: * How and how fast does SRP detect a path failure besides RC error? * Role of srp_daemon, how often srp_daemon scan fabric for new/old targets, how-to scale srp_daemon discovery, traps. -vu Hey Bart, I agree with Vu that this issue should be discussed. We'd be happy to attend. -- Sagi Wow, also thanks to Mellanox for spending resources on SRP as well! Last year in June we came across a very different situation. Cheers, Sebastian and the ProfitBricks storage team -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual ibnetdiscover command fails
On 06.02.2013 10:22, Or Gerlitz wrote: On 06/02/2013 11:17, Mathis GAVILLON wrote: Ok. But what is it possible to do with Infiniband VFs if QP0 is not available ? EVERYTHING, e.g run IPoIB, iSER, RDS, MPI, etc, etc - except for what requires QP0, such as running SM or issuing SMPs for discovery/diagnostics purposes But SRP isn't provided with SR-IOV I've heared. Is it just a matter of software or is it a matter of firmware/hardware? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual ibnetdiscover command fails
On 06.02.2013 11:20, Or Gerlitz wrote: On 06/02/2013 12:04, Mathis GAVILLON wrote: Just a last question : is that possible VFs lid to be different from PF one ? NO, we've implemented a shared port model, so all functions on the same IB port use the same lid, each function has its own virtual GUID though. So if I don't use the unmaintained srptools to get the SRP connection strings but instead send them directly to the initiator to connect to the SRP target, then also SRP should be possible with the virtual GUID. Am I right? Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time
Hi Bart, thanks for approaching this! We're not the best mainline developers so I guess we won't be there. But we have the big SRP setups and our sysadmins really don't like reconnecting SRP hosts manually and putting their devices complicated to the related dm-multipath devices again. Think about 200 SRP devices per server (already filtered by initiator groups). We also consider the srptools as unmaintained, unreliable and slow. It is possible that the srptools commands don't return. Therefore, we send the SRP connection strings directly to the initiator within our mapping jobs. It would also be great not to develop a DDoS attack reconnect like open-iscsi does. Rebooting the whole cluster to fix this isn't fun. There must be a possibility to configure different reconnect intervals. Btw.: We even had the case that the IPoIB stuff reconnected but the RDMA part didn't with iSER. It was so broken then, that we couldn't disconnect or reconnect anymore - only chance hard reboot. So you know our point of view and we already develop it that way for us. I'm looking forward what's the output of the discussion. At the current state it's difficult to nag our bosses to publish what we have so far. On 01.02.2013 14:43, Bart Van Assche wrote: It is known that it takes about two to three minutes before the upstream SRP initiator fails over from a failed path to a working path. This is not only considered longer than acceptable but is also longer than other Linux SCSI initiators (e.g. iSCSI and FC). Progress so far with improving the fail-over SRP initiator has been slow. This is because the discussion about candidate patches occurred at two different levels: not only the patches itself were discussed but also the approach that should be followed. That last aspect is easier to discuss in a meeting than over a mailing list. Hence the proposal to discuss SRP initiator failover behavior during the LSF/MM summit. The topics that need further discussion are: * If a path fails, remove the entire SCSI host or preserve the SCSI host and only remove the SCSI devices associated with that host ? Preserve SCSI hosts and SCSI devices unless they are removed explicitly by disconnect request. Rescanning SCSI devices with - - - like iscsiadm -R does for example may reorder the device names (sda becomes sdb, etc.). * Which software component should test the state of a path and should reconnect to an SRP target if a path is restored ? Should that be done by the user space process srp_daemon or by the SRP initiator kernel module ? By the SRP kernel module. This is exactly the big advantage of SRP so far: It is simple, it is RDMA and kernel only. * How should the SRP initiator behave after a path failure has been detected ? Should the behavior be similar to the FC initiator with its fast_io_fail_tmo and dev_loss_tmo parameters ? Fine for us as long as it is possible to configure such times and the behavior at all. For dm-multipath we need fast IO failing and that the SRP initiator tries to automatically reconnect that path. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] OFED-3.5-rc2 is available
Hi Vladimir, why do you put OFED together for a kernel nobody uses? Perhaps SLES and Red Hat do it like this but nobody else. Have a look at http://en.wikipedia.org/wiki/Linux_kernel - 3.0, 3.2 and 3.4 are the long-term stable releases. This approach is worse than the approach before IMHO. Since 1.5.4.1 there is no real stable release. No wonder that everyone puts his own OFED together. I'm so glad that we don't need too much OFED user space and we just use the IB stuff from the mainline kernel. We need to surf close to mainline kernel development anyway. The only thing that we would need is a list which packet with which version matches our mainline kernel. Updating it from Git directly from a tag or branch with no external patches or SRPM/tar.gz stuff would make it much easier - also for other distributions. Things are getting better for us with the switch-over to Gentoo. There, we can leave out the packages that we don't actually need. Cheers, Sebastian On 03.10.2012 17:54, Vladimir Sokolovsky wrote: Hi, OFED 3.5-rc2 is available. The tarball is available on: http://www.openfabrics.org/downloads/OFED/ofed-3.5/OFED-3.5-rc2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 3.5 Regards, Vladimir OFED-3.5-rc2 Main Changes from OFED 3.5-rc1 --- compat-rdma: Add SRP backport compat-rdma: /etc/init.d/openibd: Fix LSB header compat-rdma: IB/qib: linux-3.6 patches backported compat-rdma: iw_cxgb4: Fix bug 2369 in OFED bugzilla compat-rdma: IB/qib: fix compliance regression in 3.5 compat-rdma: RDMA/nes: Added linux-next-pending patches compat-rdma: RDMA/nes: Updated backports compat-rdma: NFSRDMA RHEL6.3 backport compat-rdma: NFSRDMA SLES11SP2 backport compat-rdma: linux-next-cherry-picks: RDMA/ucma.c: Different fix for ucma context uid=0, causing iWarp RDMA applications to fail in connection establishment Updated packages: infinipath-psm-3.0.1-115.1015_open perftest-1.4.0-0.80.gd1763bd qperf-0.4.7-0.2.gf3f7001 Supported Platforms and Operating Systems - o CPU architectures: - x86_64 - x86 - ppc64 - ia64 o Linux Operating Systems: - RedHat EL6.2 2.6.32-220.el6 - RedHat EL6.3 2.6.32-279.el6 - SLES11 SP23.0.13-0.27-default - kernel.org3.5* * Minimal QA for these versions. OFED_release_notes.txt Note: See the release notes of each component for additional issues. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/20] ib_srp: Make srp_disconnect_target() wait for IB completions
Hi Bart, we've triggered the WARN_ON() in srp_wait_last_send_wqe() by connecting to a disabled SCST SRP target. I would remove that one. Cheers, Sebastian On 09.08.2012 17:53, Bart Van Assche wrote: Modify srp_disconnect_target() such that it waits until it is sure that no new IB completions will be received anymore. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: David Dillow dillo...@ornl.gov Cc: Roland Dreier rol...@purestorage.com --- drivers/infiniband/ulp/srp/ib_srp.c | 104 ++- drivers/infiniband/ulp/srp/ib_srp.h |6 ++ 2 files changed, 95 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 0e7825a..4de7c46 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -40,7 +40,7 @@ #include linux/parser.h #include linux/random.h #include linux/jiffies.h - +#include linux/delay.h #include linux/atomic.h #include scsi/scsi.h @@ -229,14 +229,16 @@ static int srp_create_target_ib(struct srp_target_port *target) return -ENOMEM; target-recv_cq = ib_create_cq(target-srp_host-srp_dev-dev, -srp_recv_completion, NULL, target, SRP_RQ_SIZE, 0); +srp_recv_completion, NULL, target, +SRP_RQ_SIZE + 1, 0); if (IS_ERR(target-recv_cq)) { ret = PTR_ERR(target-recv_cq); goto err; } target-send_cq = ib_create_cq(target-srp_host-srp_dev-dev, -srp_send_completion, NULL, target, SRP_SQ_SIZE, 0); +srp_send_completion, NULL, target, +SRP_SQ_SIZE + 1, 0); if (IS_ERR(target-send_cq)) { ret = PTR_ERR(target-send_cq); goto err_recv_cq; @@ -245,8 +247,8 @@ static int srp_create_target_ib(struct srp_target_port *target) ib_req_notify_cq(target-recv_cq, IB_CQ_NEXT_COMP); init_attr-event_handler = srp_qp_event; - init_attr-cap.max_send_wr = SRP_SQ_SIZE; - init_attr-cap.max_recv_wr = SRP_RQ_SIZE; + init_attr-cap.max_send_wr = SRP_SQ_SIZE + 1; + init_attr-cap.max_recv_wr = SRP_RQ_SIZE + 1; init_attr-cap.max_recv_sge= 1; init_attr-cap.max_send_sge= 1; init_attr-sq_sig_type = IB_SIGNAL_ALL_WR; @@ -460,11 +462,69 @@ static bool srp_change_conn_state(struct srp_target_port *target, return changed; } +static void srp_wait_last_recv_wqe(struct srp_target_port *target) +{ + static struct ib_recv_wr wr = { + .wr_id = SRP_LAST_RECV, + }; + struct ib_recv_wr *bad_wr; + int ret; + + if (target-last_recv_wqe) + return; + + ret = ib_post_recv(target-qp, wr, bad_wr); + if (ret 0) { + shost_printk(KERN_ERR, target-scsi_host, + ib_post_recv() failed (%d)\n, ret); + return; + } + + ret = wait_event_timeout(target-qp_wq, target-last_recv_wqe, + target-rq_tmo_jiffies); + WARN(ret = 0, Timeout while waiting for last recv WQE (ret = %d)\n, + ret); +} + +static void srp_wait_last_send_wqe(struct srp_target_port *target) +{ + static struct ib_send_wr wr = { + .wr_id = SRP_LAST_SEND, + }; + struct ib_send_wr *bad_wr; + unsigned long deadline = jiffies + target-rq_tmo_jiffies; + int ret; + + if (target-last_send_wqe) + return; + + ret = ib_post_send(target-qp, wr, bad_wr); + if (ret 0) { + shost_printk(KERN_ERR, target-scsi_host, + ib_post_send() failed (%d)\n, ret); + return; + } + + while (!target-last_send_wqe time_before(jiffies, deadline)) { + srp_send_completion(target-send_cq, target); + msleep(20); + } + + WARN_ON(!target-last_send_wqe); -- here it is - remove it +} + static void srp_disconnect_target(struct srp_target_port *target) { + static struct ib_qp_attr qp_attr = { + .qp_state = IB_QPS_ERR + }; + int ret; + if (srp_change_conn_state(target, false)) { /* XXX should send SRP_I_LOGOUT request */ + BUG_ON(!target-cm_id); + init_completion(target-done); if (ib_send_cm_dreq(target-cm_id, NULL, 0)) { shost_printk(KERN_DEBUG, target-scsi_host, @@ -473,6 +533,20 @@ static void srp_disconnect_target(struct srp_target_port *target) wait_for_completion(target-done); } } + + if (target-cm_id) { + ib_destroy_cm_id(target-cm_id); + target-cm_id = NULL; + } + + if
Basics of congestion control?
Hi all, could someone please explain what I can do with the new congestion control? Do I understand it right that I can influence the flow control (e.g. amount of credits) with it so that I can avoid disruption (XmitWait, XmitDiscardedPackets) caused by congestion? This is at least what we need. ;-) Cheers, Sebastian -- Sebastian Riemer Linux Kernel Developer ProfitBricks GmbH • Greifswalder Str. 207 • 10405 Berlin, Germany www.profitbricks.com • sebastian.rie...@profitbricks.com Tel.: +49 - 30 - 60 98 56 991 - 915 Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Andreas Gauger, Achim Weiss -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Basics of congestion control?
On 31.07.2012 13:08, Alex Netes wrote: Congestion control isn't a credit based mechanism. While InfiniBand flow control is defined between two ports of the same link, congestion control is working across the fabric between a congestion point (a switch) and a reaction point (source node). Reaction point implements a Congestion Control Table that contains an array of values of injection rate delay used to control congestion. You can find more information in the IBTA LWG Errata document 3Q2010. Nice, thank you very much! I've found the IBTA spec and the errata. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mlx4_ib_create_qp failed - OOM with call trace
On 19.07.2012 22:31, Roland Dreier wrote: I have to think about the best way to fix this. We could just convert to vmalloc() here but I'm not thrilled about consuming vmalloc() space (on modern 64-bit architectures it's a non-issue but it's going to cause issues for people on smaller systems). This is at least something we can implement and test for us as we only have modern server systems. Thank you very much for the information and your help! Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mlx4_ib_create_qp failed - OOM with call trace
:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15644kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [5416523.204866] lowmem_reserve[]: 0 3495 16095 16095 [5416523.204868] Node 0 DMA32 free:77448kB min:14664kB low:18328kB high:21996kB active_anon:0kB inactive_anon:104kB active_file:1491480kB inactive_file:1526412kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3579648kB mlocked:0kB dirty:1176kB writeback:0kB mapped:180kB shmem:0kB slab_reclaimable:301820kB slab_unreclaimable:97804kB kernel_stack:20768kB pagetables:308kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [5416523.204875] lowmem_reserve[]: 0 0 12600 12600 [5416523.204877] Node 0 Normal free:72596kB min:52852kB low:66064kB high:79276kB active_anon:4636kB inactive_anon:17712kB active_file:5848236kB inactive_file:5906316kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12902400kB mlocked:0kB dirty:12500kB writeback:4kB mapped:4196kB shmem:6584kB slab_reclaimable:258216kB slab_unreclaimable:187976kB kernel_stack:15456kB pagetables:1340kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:36 all_unreclaimable? no [5416523.204883] lowmem_reserve[]: 0 0 0 0 [5416523.204885] Node 0 DMA: 1*4kB 1*8kB 1*16kB 0*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15900kB [5416523.204890] Node 0 DMA32: 1923*4kB 2433*8kB 3088*16kB 16*32kB 3*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 78164kB [5416523.204895] Node 0 Normal: 14590*4kB 1263*8kB 56*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 73296kB [5416523.204900] 3698022 total pagecache pages [5416523.204901] 3249 pages in swap cache [5416523.204903] Swap cache stats: add 2384957, delete 2381708, find 8582137/8898669 [5416523.204904] Free swap = 3902556kB [5416523.204904] Total swap = 3926012kB [5416523.238116] 4194288 pages RAM [5416523.238117] 87642 pages reserved [5416523.238118] 3673827 pages shared [5416523.238119] 363584 pages non-shared [5416523.238323] ib_srpt: ***ERROR***: failed to create_qp ret= -12 [5416523.238381] ib_srpt: ***ERROR***: rejected SRP_LOGIN_REQ because creating a new RDMA channel failed. [5416523.238393] ib_srpt: Rejecting login with reason 0x10001 Cheers, Sebastian -- Sebastian Riemer Linux Kernel Developer ProfitBricks GmbH • Greifswalder Str. 207 • 10405 Berlin, Germany www.profitbricks.com • sebastian.rie...@profitbricks.com Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Andreas Gauger, Achim Weiss -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OFED 1.5.4.1 on Ubuntu 10.04 with Mellanox cards?
Hi Chet, On 22/06/12 21:02, Chet Murthy wrote: Sebastian, Thank you for taking the time to explain these things! It's a little confusing Here a simple list of matching code: OFED-1.5.4 --- kernel 3.2.x OFED-1.5.4.1 --- kernel 3.3.x (1) Is there a more-exhaustive list of the right kernel to use with each OFED release? I was going by the OFED docs (e.g. release notes), which seemed to indicate that for 1.5.4.1, the right range of kernels was (kernel.org: 2.6.30 - 3.1), and specific kernel versions for various distros. Unfortunately, there is no more-exhaustive list for matching the kernel code with the OFED user space. It's a matter of comparing dates - kernel release and OFED release. O.K., here is how they put the OFA kernel code into OFED: - kernel developers develop for the latest kernel release cycle (here 3.3) - OFED packagers use an older kernel as basis (2.6.30) and forward port the OFA kernel stuff to the current kernel release (here 3.3) by patches for kernels (2.6.30..3.1) - this leaves space for failures (e.g. that they don't port the open-iscsi kernel code correctly) - this is why they say that they don't support the mainline kernels completely We at ProfitBricks need latest kernels anyway. This is why we match it from upstream (OFA kernel stuff from kernel.org). And we don't have to build the OFA kernel modules from out-of-tree which simplifies our kernel build chain. We have OFED-1.5.4 with OFA kernel code from kernel 3.2 at the moment. But there is also a new OFED release approach: Perhaps you've seen the OFED-3.2 already?! This is the OFED especially for kernel 3.2. This makes it easier to match OFED user space and kernel code. Here they just backport the OFA kernel stuff e.g. from 3.4 to 3.2. Looks promising, but I have no experience with that, yet. (2) I'm pretty familiar with adminstering Debian systems and building debian packages, hacking their insides, alienizing, hacking that process, etc. (I -think- ;-) The only real question for me is, which versions, with which patches, of the various bits, will work together with this RoCEE card. Your issue can be something with the shell scripts, kernel code to user space matching or plain that you don't have the opensm running. Without a running instance of a subnet manager your card will get no LID assigned, no partition key, etc. IPoIB, MPI, iSER, SRP, etc. won't work. Check with ibdiagnet -r if your master subnet manager is running. IB is self-managed by the subnet manager. Make sure that your opensm configuration is correct. We have big deployments and don't want to have rpm installed on Debian systems. This is why we've taken OFED-1.5.2 stuff from debian experimental from pkg-ofed. We've converted the SVN stuff into git repos for OFED, imported the OFED-1.5.4 upstream code and adopted the modifications by Debian (e.g. shell code changes). Now, we can build OFED with git-buildpackage and upload the deb packages to our debian repository. (3) I'm -not at all- familiar with the workflow/process that Debian Developers use. For instance, I don't really understand what you men below: But you'll have to ensure that the kernel code matches the OFED user space. The kernel stuff included in OFED doesn't support latest kernels and is based on an older code base (e.g. OFED 1.5.4 kernel stuff is based on 2.6.30). Do you mean that the kernel-ib RPM in 1.5.4 is the code form the 2.6.30 kernel? But then the list below doesn't seem to make sense. Here a simple list of matching code: OFED-1.5.4 --- kernel 3.2.x OFED-1.5.4.1 --- kernel 3.3.x I've explained this above. (4) I think what you're saying here the trick is to check out the latest pkg-ofed source from debian SVN (svn://svn.debian.org/svn/pkg-ofed/) and to update the upstream source by merging the stuff by extracting the source RPMs or even better by importing the source directly from the git repos of the OFED user space. In the debian directory there are some patches e.g. which change some stuff in shell scripts for the dash. These need to be adopted. is: (a) check out the stuff from svn.debian.org (b) pull source from the OFED repos user-space (c) -copy- that (latest) OFED source into the tree I checked-out from debian (d) make sure that the patches in the debian directories apply properly to the various shellscripts (e) build debian packages per usual And per your instructions above, I believe you're saying I should be using a 3.3.x kernel? Yes, this is exactly what I would suggest to you if you want to have a really working solution without rpm. You should at least have a look at this or try it to see if this fixes your issues and if this gives you advantages. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OFED 1.5.4.1 on Ubuntu 10.04 with Mellanox cards?
Hi Chet, the trick is to check out the latest pkg-ofed source from debian SVN (svn://svn.debian.org/svn/pkg-ofed/) and to update the upstream source by merging the stuff by extracting the source RPMs or even better by importing the source directly from the git repos of the OFED user space. In the debian directory there are some patches e.g. which change some stuff in shell scripts for the dash. These need to be adopted. But you'll have to ensure that the kernel code matches the OFED user space. The kernel stuff included in OFED doesn't support latest kernels and is based on an older code base (e.g. OFED 1.5.4 kernel stuff is based on 2.6.30). I hope that you don't need iSER. The open-iscsi kernel stuff in there is also based on 2.6.30 which means that you would need old open-iscsi user space. This is why we've decided to follow what they call upstream in this list. This means: Use the OFED kernel code from the matching vanilla kernel from kernel.org. Here a simple list of matching code: OFED-1.5.4 --- kernel 3.2.x OFED-1.5.4.1 --- kernel 3.3.x I've attached the IB user space HOWTO from Or Gerlitz for the git repos. Some of the git repos already have a debinan directory. Do you know how to build Debian packages? Cheers, Sebastian On 22/06/12 02:46, Chet Murthy wrote: Hi, A long while ago, I got OFED 1.5.2 working on Ubuntu 10.04 (Lucid) on Opterons with Mellanox DDR cards. It was a little messy, getting the RPMs compiled, but it was pretty straightforward. Basically, I (a) built a kernel with neither infiniband nor mellanox ethernet drivers, and (b) ran the OFED install.pl with some minor modifications to convert the RPMs into DEBs as they were built. And everything worked, smooth as a whistle. Today, I tried to do the same thing with OFED 1.5.4.1, and while the process of -building- was straightforward, once I get done, the card's state is all zeroes: chet@memstore3:~$ sudo ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: ::::::: base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate:2.5 Gb/sec (1X) link_layer: Ethernet Infiniband device 'mlx4_0' port 2 status: default gid: ::::::: base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate:2.5 Gb/sec (1X) link_layer: Ethernet The card's a modern ConnectX 1f:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) and on identical RedHat machines, the card's status is quite different: [root@memstore4 chet]# ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80::::0202:c9ff:fe4b:5890 base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate:10 Gb/sec (1X QDR) link_layer: Ethernet Infiniband device 'mlx4_0' port 2 status: default gid: fe80::::0202:c9ff:fe4b:5891 base lid:0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate:10 Gb/sec (1X QDR) link_layer: Ethernet I'm not even sure how to go about debugging this. Has anybody gotten OFED to work on Ubuntu with such modern cards? Thanks, --chet-- -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html IB user space HOWTO June 2012 Or Gerlitz ogerl...@mellanox.com This little note attempts to get you through how to get the upstream user-space IB packages, specifically libibverbs/libmlx4/librdmacm and/or opensm and the IB diags. Under Fedora / RHEL, installing the INBOX user-space IB/RDMA offering is easy as # yum groupinstall Infiniband Support The IB service is called rdma (vs. openibd which used to be the name in older RHEL/Fedora releases) and there is an rpm named rdma with various scripts. Note that this will not install opensm/diags (see below). If you are seeking the latest RELEASE done by the maintainers, its also trivial, the releases are provided in the form of tar balls which you plug into rpmbuild -ts and you have fresh source RPM to build and later install. Going more hackish, you would need to build the sources from the maintainers git, the git trees contain spec files, so the process would be to create the tarballs and then repeat the rpmbuild excercise. See below links to where there are tarball releases and the git trees where here gitweb links are provided, they have the git
Re: IB/iSER problems with Linux 3.0
On 17/01/12 15:56, Or Gerlitz wrote: could you try and patch your 3.0.15 kernel with commit 52439540ea30396982b69662dd21aede6b336288 IB/iser: DMA unmap TX bufs used for iSCSI/iSER headers from upstream, this could help here. Hi Or, unfortunately, just cherry-picking that commit didn't do the job. Therefore, I've backported the whole ib_iser code from 3.2.1 to 3.0.15. Now it works fine. I've attached the git diff. Should I test anything further? Thanks for your time again! Cheers, Sebastian diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index 8db008d..daf293c 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -57,6 +57,7 @@ #include linux/scatterlist.h #include linux/delay.h #include linux/slab.h +#include linux/module.h #include net/sock.h @@ -101,13 +102,17 @@ iscsi_iser_recv(struct iscsi_conn *conn, /* verify PDU length */ datalen = ntoh24(hdr-dlength); - if (datalen != rx_data_len) { - printk(KERN_ERR iscsi_iser: datalen %d (hdr) != %d (IB) \n, - datalen, rx_data_len); + if (datalen rx_data_len || (datalen + 4) rx_data_len) { + iser_err(wrong datalen %d (hdr), %d (IB)\n, + datalen, rx_data_len); rc = ISCSI_ERR_DATALEN; goto error; } + if (datalen != rx_data_len) + iser_dbg(aligned datalen (%d) hdr, %d (IB)\n, + datalen, rx_data_len); + /* read AHS */ ahslen = hdr-hlength * 4; @@ -147,7 +152,6 @@ int iser_initialize_task_headers(struct iscsi_task *task, tx_desc-tx_sg[0].length = ISER_HEADERS_LEN; tx_desc-tx_sg[0].lkey = device-mr-lkey; - iser_task-headers_initialized = 1; iser_task-iser_conn = iser_conn; return 0; } @@ -162,8 +166,7 @@ iscsi_iser_task_init(struct iscsi_task *task) { struct iscsi_iser_task *iser_task = task-dd_data; - if (!iser_task-headers_initialized) - if (iser_initialize_task_headers(task, iser_task-desc)) + if (iser_initialize_task_headers(task, iser_task-desc)) return -ENOMEM; /* mgmt task */ @@ -274,6 +277,13 @@ iscsi_iser_task_xmit(struct iscsi_task *task) static void iscsi_iser_cleanup_task(struct iscsi_task *task) { struct iscsi_iser_task *iser_task = task-dd_data; + struct iser_tx_desc *tx_desc = iser_task-desc; + + struct iscsi_iser_conn *iser_conn = task-conn-dd_data; + struct iser_device *device= iser_conn-ib_conn-device; + + ib_dma_unmap_single(device-ib_device, + tx_desc-dma_addr, ISER_HEADERS_LEN, DMA_TO_DEVICE); /* mgmt tasks do not need special cleanup */ if (!task-sc) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index 2f02ab0..db7ea37 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -45,6 +45,7 @@ #include scsi/libiscsi.h #include scsi/scsi_transport_iscsi.h +#include linux/interrupt.h #include linux/wait.h #include linux/sched.h #include linux/list.h @@ -88,7 +89,7 @@ } while (0) #define SHIFT_4K 12 -#define SIZE_4K (1UL SHIFT_4K) +#define SIZE_4K (1ULL SHIFT_4K) #define MASK_4K (~(SIZE_4K-1)) /* support up to 512KB in one RDMA */ @@ -256,7 +257,8 @@ struct iser_conn { struct list_head conn_list; /* entry in ig conn list */ char *login_buf; - u64 login_dma; + char *login_req_buf, *login_resp_buf; + u64 login_req_dma, login_resp_dma; unsigned int rx_desc_head; struct iser_rx_desc *rx_descs; struct ib_recv_wr rx_wr[ISER_MIN_POSTED_RX]; @@ -276,7 +278,6 @@ struct iscsi_iser_task { struct iser_regd_buf rdma_regd[ISER_DIRS_NUM];/* regd rdma buf */ struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data des*/ struct iser_data_buf data_copy[ISER_DIRS_NUM];/* contig. copy */ - int headers_initialized; }; struct iser_page_vec { diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index 95a08a8..b53b04d 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -221,8 +221,14 @@ void iser_free_rx_descriptors(struct iser_conn *ib_conn) struct iser_device *device = ib_conn-device; if (ib_conn-login_buf) { - ib_dma_unmap_single(device-ib_device, ib_conn-login_dma, - ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE); + if (ib_conn-login_req_dma) + ib_dma_unmap_single(device-ib_device, +ib_conn-login_req_dma, +ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_TO_DEVICE); + if (ib_conn-login_resp_dma) + ib_dma_unmap_single(device-ib_device, +ib_conn-login_resp_dma, +ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE); kfree(ib_conn-login_buf); } @@ -394,6 +400,7 @@ int iser_send_control(struct iscsi_conn *conn, unsigned long data_seg_len; int err = 0; struct iser_device *device; + struct iser_conn *ib_conn = iser_conn-ib_conn; /* build the tx desc regd header and add it to the tx desc dto */ mdesc-type =
Solved: IB/iSER problems with Linux 3.0
On 19/01/12 13:18, Or Gerlitz wrote: [...] Or Gerlitz (4): IB/iser: Fix wrong mask when sizeof (dma_addr_t) sizeof (unsigned long) IB/iser: Support iSCSI PDU padding IB/iser: Use separate buffers for the login request/response IB/iser: DMA unmap TX bufs used for iSCSI/iSER headers [...] could you try only the four of them on top of 3.0.15 and then if it works okay, find out which one of them does the job? I've applied them one by one and at the following commit it worked: IB/iser: Use separate buffers for the login request/response Then, I've tried to apply that commit only to 3.0.15, but automatic cherry-picking failed. I had to apply the following commit first: IB/iser: Support iSCSI PDU padding So, these two commits are the winners, for our Solaris 11 COMSTAR targets. Without them I always had to reboot the system because it wasn't possible to logout or to unload the ib_iser module. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER problems with Linux 3.0
On 16/01/12 22:16, Or Gerlitz wrote: Sebastian, I asked for the **iser** (ib_iser) and not mlx4_core debug_level=2 Yes, I did! I've enabled that additionally. And I've checked these settings in /sys/module/*/parameters. They were set. The libiscsi from OFED had only the option debug_libiscsi but this was too verbose, so this was the only thing I didn't activate there. 1. yes, the logs (correct ones, please!) from success login on the very same kernel would help Yes, I've sent you the correct logs. The only difference is: 1. in-tree vs. ofa-kernel-modules from OFED-1.5.4 2. open-iscsi 2.0.872 vs. open-iscsi 2.0.869 from OFED In the log from working iSER there is the RDMA mapping debug message at the position of the error in the other log. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER major problems with Linux 3.0 and Solaris targets
On 12/01/12 17:14, Or Gerlitz wrote: you didn't send the kernel logs from the failure after opening the iser (debug_level=2) and libiscsi (debug_libiscsi_session=1 debug_libiscsi_conn=1) debug prints OK, I've also set mlx4_core debug_level=2 and have verified in /sys/module that the parameters are really set. Please find attached the relevant part of the kernel log while login attempt. I've also attached the log from the working stuff from OFED-1.5.4 for a compare and the settings from discovery. I'll try to backport the IB + iSCSI kernel code from 3.2.1 to 3.0.15 next. Cheers, Sebastian iser_dbg_dmesg.log.gz Description: GNU Zip compressed data iser_ofa_dbg_dmesg.log.gz Description: GNU Zip compressed data node.name = iqn.2010-03.com.profitbricks:cloud:customers:storage200 node.tpgt = 2 node.startup = manual iface.hwaddress = default iface.iscsi_ifacename = iser1 iface.net_ifacename = ib1. iface.transport_name = iser node.discovery_address = 10.1.24.204 node.discovery_port = 3260 node.discovery_type = send_targets node.session.initial_cmdsn = 0 node.session.initial_login_retry_max = 4 node.session.cmds_max = 128 node.session.queue_depth = 32 node.session.auth.authmethod = None node.session.timeo.replacement_timeout = 30 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 20 node.session.err_timeo.host_reset_timeout = 60 node.session.iscsi.FastAbort = Yes node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.session.iscsi.DefaultTime2Retain = 0 node.session.iscsi.DefaultTime2Wait = 2 node.session.iscsi.MaxConnections = 1 node.session.iscsi.MaxOutstandingR2T = 1 node.session.iscsi.ERL = 0 node.conn[0].address = 10.1.24.204 node.conn[0].port = 3260 node.conn[0].startup = manual node.conn[0].tcp.window_size = 524288 node.conn[0].tcp.type_of_service = 0 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.auth_timeout = 45 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.conn[0].iscsi.MaxRecvDataSegmentLength = 131072 node.conn[0].iscsi.HeaderDigest = None node.conn[0].iscsi.DataDigest = None node.conn[0].iscsi.IFMarker = No node.conn[0].iscsi.OFMarker = No
Re: IB/iSER major problems with Linux 3.0 and Solaris targets
On 12/01/12 10:29, Or Gerlitz wrote: If you have build the kernel IB user space support (uverbs) and the IB libs, do ibv_devinfo if not, just ossi cat /sys/class/infiniband/mlx4_0/* and send the output. To be clear, iser does work for you on the productive servers but not on this server? Yes, we've got consistent OFED-1.5.4 user-space. ibv_devinfo reports a mismatch between the kernel and the userspace libraries - kernel does not support XRC.. ibverbs-driver-mlx4 is at version 1.0.1-1.20.g6771d22 and libibverbs is at version 1.1.4-1.24.gb89d4d7. But O.K., the other method shows firmware version 2.9.1000. iSER only works on productive servers, because we use the OFA kernel modules from OFED for them at the moment (with 3.0 ported *iscsi* drivers). But there the IPoIB traffic is too slow for us. We connect customer VMs with IPv6 between different servers via IB. And yes, we could also test kernel 3.2 on our iSER test server. Regards, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER major problems with Linux 3.0 and Solaris targets
On 12/01/12 11:16, Sebastian Riemer wrote: On 12/01/12 10:29, Or Gerlitz wrote: If you have build the kernel IB user space support (uverbs) and the IB libs, do ibv_devinfo if not, just ossi cat /sys/class/infiniband/mlx4_0/* and send the output. To be clear, iser does work for you on the productive servers but not on this server? Yes, we've got consistent OFED-1.5.4 user-space. ibv_devinfo reports a mismatch between the kernel and the userspace libraries - kernel does not support XRC.. ibverbs-driver-mlx4 is at version 1.0.1-1.20.g6771d22 and libibverbs is at version 1.1.4-1.24.gb89d4d7. But O.K., the other method shows firmware version 2.9.1000. I've found out that we have two single port MHQH19B-XTR InfiniBand HCAs. lspci output: 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) 04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) The first one is ib1. And the second is ib0. /sys/devices/pci:00/:00:0c.0/:03:00.0/net/ib1 /sys/devices/pci:00/:00:0b.0/:04:00.0/net/ib0 The iSER traffic is on ib1 (the HCA which reported the error) and ib0 is for IPoIB traffic. I don't know if the mlx4 driver has a problem with that hardware config. Here is the requested data: mlx4_0: board_id MT_0D90110009 fw_ver 2.9.1000 hca_type MT26428 hw_rev b0 node_desc pserver214 HCA-1 (mlx4_0 - MT26428) node_guid 0002:c903:000f:5f76 node_type 1: CA sys_image_guid 0002:c903:000f:5f79 uevent NAME=mlx4_0 mlx4_1: board_id MT_0D90110009 fw_ver 2.9.1000 hca_type MT26428 hw_rev b0 node_desc pserver214 HCA-2 (mlx4_1 - MT26428) node_guid 0002:c903:000f:5f26 node_type 1: CA sys_image_guid 0002:c903:000f:5f29 uevent NAME=mlx4_1 Both are connected to the storage but in different subnets and without multipathing. How do I find out if ib1 is on mlx4_1 or mlx4_0? Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IB/iSER major problems with Linux 3.0 and Solaris targets
Hi list, hi Or, at the moment we use our kernel 3.0 ported OFA kernel modules from OFED-1.5.4 to make iSER work together with our kernel 3.0 ported open-iscsi 2.0.869 and Solaris 11 COMSTAR targets. Now, we've got the problem that ib_iser from in there ignores its debug_level parameter. /sys/module/ib_iser/parameters/debug_level shows 0 but it spams the whole kernel log with debug messages. Seems to be a code bug. We've also tested a 3.0.15 mainline kernel with in-tree IB modules together with the OFED-1.5.4 user-space and this has much better IPoIB performance than the kernel stuff from OFED. So, we want to use them instead, but there is the same problem with the iSER debug messages and iSER doesn't work together with our Solaris 11 COMSTAR targets. We've tested this with open-iscsi (2.0.872, commit 4323e342d2c9fb8ed7233ce855001c189ec55b23) user-space. TCP is O.K. but iSER reports an error while login attempt: iscsiadm: initiator reported error (11 - iSCSI PDU timed out) The sent PDUs from iscsid debugging are the same, but there is an IO page fault in the kernel log. I've attached the relevant part and the iscsid log. This looks interesting: iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57 Or, could you please investigate/explain? It is a pain that we need both: working iSER and IPoIB traffic with good performance. Cheers, Sebastian On 19/12/11 10:14, Sebastian Riemer wrote: Hi list, I've already sent this to the open-iscsi mailing list, but I guess this is more relevant for linux-rdma. Finally I've got IB/iSER running on Debian Squeeze with Linux kernel 3.0 smoothly. The problem was that we did not have the suitable OFED for our kernel and we did not use the open-iscsi from OFED. Kernel 3.0 is supported since OFED-1.5.4 from 2011-12-05. So, I've taken the 1.5.2-based stuff from Debian/Experimental and I've updated it to 1.5.4 from OFA. Then, I've noticed that Debian doesn't build ib_iser in the OFA kernel source and that they don't build the open-iscsi kernel/user-space code - I made it do so. The next problem was that open-iscsi kernel code in OFED-1.5.4 is for = 2.6.32 based RedHat distributions. I had to port the source from 2.6.30 to 3.0 due to kernel API changes. OFA even forgot libiscsi_tcp.[ch] in OFED-1.5.4. So, I had to import it from 2.6.30 mainline. I did so, because we wanted to compare TCP and iSER speed over InfiniBand. Our Solaris COMSTAR targets provide both. After fixing the kernel, there was still a problem in the open-iscsi 2.0.869 user-space from OFED. Some sysfs magic has changed - so that the iSCSI host number couldn't be found. After fixing that, it worked for me. Cheers, Sebastian Jan 11 12:53:25 pserver214 kernel: [ 716.518372] SCSI subsystem initialized Jan 11 12:53:25 pserver214 kernel: [ 716.521146] Loading iSCSI transport class v2.0-870. Jan 11 12:53:25 pserver214 kernel: [ 716.528756] iscsi: registered transport (tcp) Jan 11 12:53:30 pserver214 kernel: [ 721.903544] iscsi: registered transport (iser) Jan 11 12:54:46 pserver214 kernel: [ 797.537439] iser: iser_connect:connecting to: 10.1.24.204, port 0xbc0c Jan 11 12:54:46 pserver214 kernel: [ 797.563158] iser: iser_cma_handler:event 0 status 0 conn 880807b17a80 id 880807594400 Jan 11 12:54:46 pserver214 kernel: [ 797.566402] iser: iser_cma_handler:event 2 status 0 conn 880807b17a80 id 880807594400 Jan 11 12:54:46 pserver214 kernel: [ 797.579704] iser: iser_create_ib_conn_res:setting conn 880807b17a80 cma_id 880807594400: fmr_pool 88082426b400 qp 8807ed22aa00 Jan 11 12:54:46 pserver214 kernel: [ 797.586557] iser: iser_cma_handler:event 9 status 0 conn 880807b17a80 id 880807594400 Jan 11 12:54:46 pserver214 kernel: [ 797.787932] iser: iscsi_iser_ep_poll:ib conn 880807b17a80 rc = 1 Jan 11 12:54:46 pserver214 kernel: [ 797.788137] scsi0 : iSCSI Initiator over iSER, v.0.1 Jan 11 12:54:46 pserver214 kernel: [ 797.794249] iser: iscsi_iser_conn_bind:binding iscsi/iser conn 8808058deab8 8808058decc8 to ib_conn 880807b17a80 Jan 11 12:54:46 pserver214 kernel: [ 797.794710] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488000 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.794919] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488200 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.794998] iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57 Jan 11 12:54:46 pserver214 kernel: [ 797.795006] connection1:0: detected conn error (1011) Jan 11 12:54:46 pserver214 kernel: [ 797.795338] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488100 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.795535] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488040 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.795730] AMD-Vi
Re: IB/iSER with Linux 3.0 and Debian: Lesson learned
you wrote long emails, I'm asking for one concrete example for that enum crunching of adding entries not at the end, can you, please? I've meant e.g. the iscsi tasks in libiscsi.h between 2.6.30 and 2.6.32. But I've meant this for OFED and not the mainline kernel. 2.6.30: enum { ISCSI_TASK_COMPLETED, ISCSI_TASK_PENDING, ISCSI_TASK_RUNNING, }; 2.6.32: enum { ISCSI_TASK_FREE, ISCSI_TASK_COMPLETED, ISCSI_TASK_PENDING, ISCSI_TASK_RUNNING, ISCSI_TASK_ABRT_TMF,/* aborted due to TMF */ ISCSI_TASK_ABRT_SESS_RECOV, /* aborted due to session recovery */ }; I want to double check I'm with you - so when you said that iser didn't work e.g TCP worked very well. I've also updated from git to latest 2.0.872 (latest change 2011-11-01) for testing. TCP always worked and iSER was always unusable. you actually wanted to say iser from ofed and not iser from this or that upstream kernel? I've tried both. The iser from OFED oopsed (because it is 2.6.30 based - didn't match the 3.0 open-iscsi in-tree) and everything from upstream kernel 3.0.4 was pretty unstable (mentioned connection aborts after 5s). And I guess because of the OFED-1.4 user-space from Squeeze the IB connection was that unstable. The OFED user-space must match the kernel code of cause. Before I took over the kernel maintaining at ProfitBricks only few knew about that problem in the company. So, I thought making everything OFED-1.5.4 is the right approach of doing that. as life, mainline isn't perfect, but it doesn't say that ofed is perfect nor that by any bit its better then mainline, you may know and if you don't here are the news: the ofa community has to decided to stop producing ofed in the way it was done over the years, namely from now (Jan 2012) and onward, ofed will be only backports provided from mainline, no additions, so this false betterness claim can't even be stated anymore. Now, even this backporting only new mode has to be defined - since for example, is the iscsi case... except for iser, ofed will not provide the iscsi modules nor tools, so its not clear/how trivial/for someone takes (say) iser from 3.2 and backport it to (say) 3.6.35 in manner that it will be operable with 3.6.35 and unknown version of the tools. As I wrote, I like that new approach. If OFED-3.2 will match mainline 3.2 this would be great, but then you'll also have to provide the open-iscsi user-space which you've used for testing in there. Or/and can't you just provide a list of tools, OFA user-space etc. which you've tested (e.g. like that BUILD_ID file in OFED)? I really hope that this makes things better/easier for InfiniBand and iSER users. I'm looking forward to test that. My question was more like: How was it tried to ensure to match kernel and user-space code right now? and I did not want to read the developer's favorite With the next release everything will be better. ;-) Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER with Linux 3.0 and Debian: Lesson learned
2011/12/21 Or Gerlitz ogerl...@mellanox.com: I tested the upstream kernel iser against the upstream iscsi tools from git://github.com/mikechristie/open-iscsi (commit 4323e342d2c9fb8ed7233ce855001c189ec55b23), it works To bring this to an end: I believe you. Most likely I had that much trouble because of the OFED-1.4 user-space which did not match the kernel code. As you don't answer my question how I can find out what's the matching user-space for a given upstream kernel - I will just use what I have now (should work like on RedHat) and will be looking forward to the new OFED release approach. Thanks for taking the time to make things clear to me. Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER with Linux 3.0 and Debian: Lesson learned
2011/12/20 Or Gerlitz ogerl...@mellanox.com: Beep, I'd like to better/understand the problem before looking on your struggle for solution... I understand that your Debian system runs kernel 3.0 - however, you didn't say what version of the iscsi initiator utils is provided with that distro nor what were the problems to make it work/well with iser, could you elaborate on that? Or. Ah, O.K. - I wrote that on the open-iscsi list. Debian Squeeze (in general 2.6.32 based) comes with open-iscsi 2.0.871.3-2squeeze1. We've used that version together with the in-tree mainline kernel 3.0 OFA kernel modules and Debian Squeeze OFED-1.4 user-space. But there were lots of iSER connection aborts (and even log-outs) after only 5s connection loss instead of 120s node.session.timeo.replacement_timeout. The many connection losses where also caused by the missing of a suitable OFED. After installation of the OFA kernel modules (without open-iscsi modules) from OFED-1.5.4 the kernel had oopses in ib_iser. Therefore, the suitable open-iscsi code had to be found (in OFED). And due to the fact that it didn't support 3.0 kernels it also had to be ported. There where many ABI and API changes in mainline open-iscsi kernel code between 2.6.30 and 3.0. I've fixed the following kernel API changes in the open-iscsi code from OFA kernel source from OFED-1.5.4: - kfifo API = 2.6.33 - scsi_host API = 2.6.33 - scsi_host API = 2.6.37 Before that I've added the code and compilation of libiscsi_tcp from 2.6.30. After stress testing the storage on a test machine with that fixed OFED + iSER all other machines on that IB switch had IB connection losses. So, we decided to roll out OFED-1.5.4 with fixed open-iscsi code to all machines in our data center. And this works very well, now. General network performance also doubled up. Btw.: We need such a new kernel because of some cool virtualization, cgroups and performance features. I wrote the mail in this mailing list in order to show that open-iscsi in OFED-1.5.4 isn't suitable for 3.0 kernel, that libiscsi_tcp is missing and that we at ProfitBricks have a good test case with our IaaS Cloud Computing. Btw.: I like the proposed approach for new OFED releases. Version checks in the code between kernel and user-space (like DRBD does) would be great. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER with Linux 3.0 and Debian: Lesson learned
2011/12/20 Or Gerlitz ogerl...@mellanox.com: Beep(2), so your system has distro which is based on kernel 2.6.32 and iscsi initiator tools version 2.0.871 and per your needs, you've booted it with kernel 3.0 . At this point should you have stop and make sure that this combo works, iscsi wise (simpler to test iscsi/tcp... no need in rdma knowledge), did you do this validation? Yes, of cause I did - TCP worked very well. I've also updated from git to latest 2.0.872 (latest change 2011-11-01) for testing. TCP always worked and iSER was always unusable. If this combo isn't working, you have to update the iscsi tools to a version which supports your kernel. Yes, I even took the max. 2.6.35 supported kernel stuff from open-iscsi git and have changed the Makefile to make it compile against my kernel. With that I ensured that open-iscsi kernel code and user-space match. TCP O.K. - no iSER with that! If this combo is working iscsi/tcp wise but not ib_iser wise, its seems to be a bug, and I'd be happy to help with finding the root cause, etc. Have you ever developed/tested the ib_iser module for/with 2.6.30 kernels? I've seen that there were lots of changes in the whole open-iscsi kernel stack between 2.6.30 and 2.6.32. The whole ABI has changed in libiscsi. They added stuff e.g. at the first position in enums. If ib_iser isn't aware of such changes lots of crap can happen. ...and happened to me while testing by the way. But this way or another, OFA isn't an iscsi tools factory, nor have anyone that can/want to support iscsi tools, we (folks from the rdma vendors community that deal with iscsi) are working with the upstream iscsi maintainer to address iscsi issues. The fact that OFA ships iscsi code except for ib_iser/cxgb4i/etc modules is a bug, BTW, I'll act to change that. ib_iser has tight dependencies to open-iscsi code (see attached). In my opinion an ib_iser developer should work that tight together with the open-iscsi guys. They should inform you about all ABI and API changes in libiscsi so that you can react on that. As an user of IB/iSER it is really confusing that this isn't the case and that OFA only provides 2.6.30 based open-iscsi stuff which only works for iSER due to missing libiscsi_tcp in there. At least this could be fixed easily. Now I know why everybody has his own OFED - so do we now. E.g. the QLogic stuff isn't even compilable without the QLogic OFED, because they only put their patches in there. Luckily, we have only Mellanox HCAs in our productive environment. Would it help, if we provide our patches for open-iscsi and IB/iSER 2.6.32 to bring that into mainline OFED? Sebastian attachment: open-iscsi.png
Re: IB/iSER with Linux 3.0 and Debian: Lesson learned
Would it help, if we provide our patches for open-iscsi and IB/iSER 2.6.32 to bring that into mainline OFED? As Or notes, OFED is providing the kernel modules more than the iscsi code drop. Would be better for all (cough cough) to push changes back to the iscsi initiator maintainer (Mike Christie I think). No, I don't think so. Mike's open-iscsi works for many kernels and he provides libiscsi. OFA decided to make open-iscsi and ib_iser 2.6.30 based in OFED-1.5.4, but they say that they also support 3.0 in OFED. Now, they could provide two more open-iscsi and ib_iser versions (=2.6.33, =2.6.37) or just patch the version they already have. API changes are no magic and everyone can see what they do in the mainline if the API changes. Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER with Linux 3.0 and Debian: Lesson learned
2011/12/20 Or Gerlitz or.gerl...@gmail.com: horses, please, stay at home, or at least run a little bit slower, just for you - from 2 minutes ago - iser works well with 3.2.0-rc5 (its say -dirty b/c its a development system and the kernel has some patches, but not iser ones) and iscsi-initiator-utils of 6.2.0.872-21.el6, I will try tomorrow with upstream iscsi-initiator-utils and see if there's a problem there. O.K., sorry - I just wanted to know how this is developed. Even if the kernel code matches 100%, how do I know which is the matching tested user-space code for that? I don't use ofed at all, work only with upstream or distro code. AFAIK, the upstream kernel is functional with iser at all times, again, I will do the validation with the iscsi tools and if the upstream iscsi tools aren't functional with the upstream iser code but are functional with the upstream iscsi/tcp code, we will (the iscsi maintainer and myself) fix that, and thanks for this possible heads up. Thanks for that info. In OFED I thought I would have everything needed and matching together. Due to the new kernel we have our own distribution extends, packet repo, build server,... So, for OFED we're the distribution. no way, OFA isn't iscsi factory, we can't support the iscsi kernel modules except for iser, nor any of the iscsi user space tools - and its a historic bug that someone with wrong ambitions added iscsi modules/tools info the ofa stack. The OFA stack should be compatible with the kernel/distro it is running on. As you can see in the maintainers file, I act as the iser maintainer, and I do work closely with the iscsi maintainer, maybe should work closer if indeed you stepped on a problem with the upstream iscsi tools, as for the iscsi tools provided with debian, I am not sure what was the problem, send me tgz with the sources to my @mellanox address and I can try look on that. Perhaps, I really stepped on a rare case where this was broken. As I've already asked: How do I find the matching, tested open-iscsi and OFA user-space code for a mainline kernel? Sorry again, I didn't want provoke. I just want to understand and make things work. Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IB/iSER with Linux 3.0 and Debian: Lesson learned
Hi list, I've already sent this to the open-iscsi mailing list, but I guess this is more relevant for linux-rdma. Finally I've got IB/iSER running on Debian Squeeze with Linux kernel 3.0 smoothly. The problem was that we did not have the suitable OFED for our kernel and we did not use the open-iscsi from OFED. Kernel 3.0 is supported since OFED-1.5.4 from 2011-12-05. So, I've taken the 1.5.2-based stuff from Debian/Experimental and I've updated it to 1.5.4 from OFA. Then, I've noticed that Debian doesn't build ib_iser in the OFA kernel source and that they don't build the open-iscsi kernel/user-space code - I made it do so. The next problem was that open-iscsi kernel code in OFED-1.5.4 is for = 2.6.32 based RedHat distributions. I had to port the source from 2.6.30 to 3.0 due to kernel API changes. OFA even forgot libiscsi_tcp.[ch] in OFED-1.5.4. So, I had to import it from 2.6.30 mainline. I did so, because we wanted to compare TCP and iSER speed over InfiniBand. Our Solaris COMSTAR targets provide both. After fixing the kernel, there was still a problem in the open-iscsi 2.0.869 user-space from OFED. Some sysfs magic has changed - so that the iSCSI host number couldn't be found. After fixing that, it worked for me. Cheers, Sebastian -- Sebastian Riemer Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 10405 Berlin, Germany Tel.: +49 - 30 - 51 64 09 20 Fax: +49 - 30 - 51 64 09 22 Email: sebastian.rie...@profitbricks.com Web: http://www.profitbricks.com/ Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Andreas Gauger, Achim Weiss -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html