Re: [PATCH] opensm: Fix sl2vl configuration
Hal Rosenstock wrote: Hi Eli, On Mon, Aug 23, 2010 at 3:37 PM, Eli Dorfman dorfman@gmail.com wrote: Hi Hal, On Mon, Aug 23, 2010 at 5:25 PM, Hal Rosenstock hal.rosenst...@gmail.com wrote: On Wed, Jul 28, 2010 at 12:26 PM, Eli Dorfman (Voltaire) dorfman@gmail.com wrote: Subject: [PATCH] Fix sl2vl configuration For non-optimized sl2vl configuration in and out ports were reversed. Are you sure these are reversed ? Any idea which commit introduced this reversal of in and out ports ? I'm sure they are reversed. I looked at it some more and the ports look reversed to me too. This patch was also tested by Jim Schutt. That only means it works in his environment rather than correctness. It was tested in mine enviornment as well. Usually I test the patch in my environment to verify its correctness (in addition to code review). I assume you do the same. Do you expect me to test the patch in your environment as well? I didn't check which commit introduced this - why is it important? I'd like to understand which patch introduced the reversal. I don't see it but might have missed it. I think it's important to know which versions are broken. I think that the following commit added the bug: commit 051a1dd514e63f3a971afad38e377932efca5e18 Author: Sasha Khapyorsky sas...@voltaire.com Date: Mon Jan 4 21:06:19 2010 +0200 opensm/osm_qos.c: split switch external and end ports setup This splits QoS related port parameters setup over switch external ports and end ports. Such separation leaves us with simpler code and saves some repeated flows in case of switch external ports (actually required per switch and not per port). Another advantages: Optimized Sl2VL Mapping procedure can be implemented easier using this model. A low level port QoS related parameters setup infrastructure is ready for supporting per port QoS related configuration (which hopefully will be implemented some days). For optimal sl2vl added override of default ALL settting with port's sl2vl when operational VL was other than the default port. What is the motivation to override when the operational VLs is different ? Why is that better than what is done currently ? The idea was to apply the default policy - set sl2vl modulo operational VL. When applying ALL settings using port 1 we still want to override this setting for ports with different operational VL. What makes the default policy modulo operational VLs ? This is how its implemented in sl2vl_update_table() vl_mask = (1 (ib_port_info_get_op_vls(p-port_info) - 1)) - 1; for (i = 0; i IB_MAX_NUM_VLS / 2; i++) { vl1 = sl2vl_table-raw_vl_by_sl[i] 4; vl2 = sl2vl_table-raw_vl_by_sl[i] 0xf; if (vl1 != 15) vl1 = vl_mask; if (vl2 != 15) vl2 = vl_mask; Default startup switches configuration uses the same policy. Is this really a policy issue ? IMO there are two separate issues in this patch and they should be in separate patches (for better git bisection). Maybe, but I still think this patch fixes wrong setting for sl2vl using optimized and non optimized methods. I'm not sure this should be divided to 2 separate patches. It's one thought per patch and to me this is two different thoughts. If this is the only issue, I can split this patch to 2 separate patches. Eli Also, a couple of (possibly related) questions below. It seems that you are referring to patch v1 which was modified according to Jim's comments. Please check the v2 patch . I see my questions are moot in terms of the v2 patch. -- Hal Thanks, Eli Signed-off-by: Eli Dorfman e...@voltaire.com --- opensm/opensm/osm_qos.c | 25 - 1 files changed, 20 insertions(+), 5 deletions(-) diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index a571370..de0ae23 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -182,7 +182,7 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, tbl.raw_vl_by_sl[i] = (vl1 4) | vl2; } - if (!force_update (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) + if (!force_update in_port (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) !memcmp(p_tbl, tbl, sizeof(tbl))) return IB_SUCCESS; Why exclude port 0 here ? Is it related to the change noted below ? @@ -209,6 +209,7 @@ static int qos_extports_setup(osm_sm_t * sm, osm_node_t *node, unsigned num_ports = osm_node_get_num_physp(node); int ret = 0; unsigned i, j; + uint8_t op_vl1; for (i = 1; i num_ports; i++) { p = osm_node_get_physp_ptr(node, i); @@ -225,17 +226,31 @@ static int qos_extports_setup(osm_sm_t * sm, osm_node_t *node, if (ib_switch_info_get_opt_sl2vlmapping(node-sw-switch_info)
srp target - write to raw block device like /dev/sda ist fragmented into very small requests
Hello, we employ Infiniband SRP storage. For basic tests I just did write/read with dd to the raw block device like /dev/sda. I tell dd to use 1M sized blocks, but I see very small few KB sized requests only. To make this tests of real value I would need to read and write with large 1MB sized requests. More important I'm not sure whether this points to some misconfiguration. Example: Read 1M sized blocks from /dev/sdh - iostat shows 8*512byte average request size, which is very small, I would expect 1M sized requests: # dd if=/dev/sdh of=/dev/null bs=1M # iostat -xk /dev/sdh 2 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdh 0.00 0.00 8253.23 0.00 33012.94 0.00 8.00 0.900.11 0.11 89.95 Performance counters on the srp target show the same small request size. Is this expected behaviour if I write or read from/to a srp block device file? Does anything limit the request size if I access the storage via the block device file like /dev/sdh? Using a mounted filesystem on top of the block devices I get perfectly 1M sized i/o requests, hence I think that the storage (DDN S2A9900) an the settings for the block device are o.k.. We currently run Scientific Linux 5.4 with a rebuild vendor kernel 2.6.18-194.8.1.el5 and the included ofed 1.3. I know that this is no plain vanilla kernel.org kernel and no very recent OFED version. Kind regards, Heiner Billich Some more details # ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.7.100 Hardware version: b0 Node GUID: 0x0002c90300088dc6 System image GUID: 0x0002c90300088dc9 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 14 LMC: 0 SM lid: 1 Capability mask: 0x02510868 Port GUID: 0x0002c90300088dc7 ib_srp parameters: mellanox_workarounds 1 srp_dev_loss_tmo 60 srp_sg_tablesize 255 topspin_workarounds 1 modinfo ib_srp filename: /lib/modules/2.6.18-194.8.1.el5/kernel/drivers/infiniband/ulp/srp/ib_srp .ko license:Dual BSD/GPL description:InfiniBand SCSI RDMA Protocol initiator v0.2 (November 1, 2005) author: Roland Dreier srcversion: DCFC6574D959F8EC7B4AE72 depends:ib_core,scsi_mod,ib_cm,ib_sa vermagic: 2.6.18-194.8.1.el5 SMP mod_unload gcc-4.1 parm: srp_sg_tablesize:Max number of gather/scatter entries per I/O (default is 12, max 255) (int) parm: topspin_workarounds:Enable workarounds for Topspin/Cisco SRP target bugs if != 0 (int) parm: mellanox_workarounds:Enable workarounds for Mellanox SRP target bugs if != 0 (int) parm: srp_dev_loss_tmo:Default number of seconds that srp transport should insulate the lost of a remote port (default is 60 secs (int) module_sig: 883f3504c2cf87cab955e73ffd495d2112149509f45aa2bcee8c81463e247bd36d55c5dd bc2634d509e2b3eb9c09799f2ea4e951c973d0fe3725ddbc -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB CM connection establishment protocol question
Hi stan, Here's a short summary: An IPoIB interface encodes it capability of supporting CM mode in the hardware adress it publishes. Once the stack hands an SKB to IPoIB for transmission (unicast), IPoIB needs to resolve the hardware address to IB adress information (LID, SL etc.). Once resolution is complete, it will create a RC connection with the other host using IB CM. This process is symetrical - that is, for two communicating hosts there are two RC connections so that each connection works half duplex. Hope that helps. On Thu, Aug 05, 2010 at 10:25:27AM -0700, Smith, Stan wrote: Could someone who is OFED IPoIB knowledgeable help me understand which IPoIB side starts the RC connection establishment (Tx REQ) to an ARP unknown remote host? I tried reading the Linux IPoIB code and realized it's been too many years since writing Linux network drivers (2001) and found myself lost in the bits w.r.t. CM connection mgmt. and ARP handling. From limited observation, it looks like the ACTIVE side (ttcp -t) of an IPoIB network connection broadcasts an IB ARP-REQ which indicates CM capability in the IB ARP flags byte. When the PASSIVE IPoIB (ttcp -r) receives the IB ARP-REQ, it sends back an IB ARP-REPLY which indicates it's CM capability among other items. Now the confusing part: which IPoIB CM side starts the CM-RC connection establishment? Does the PASSIVE IPoIB start the RC connection (Tx REQ) prior to sending the IB ARP-REPLY? Or does the ACTIVE IPoIB side receive the IB ARP-REPLY and then start an ACTIVE RC connection (Tx REQ) now that it knows the remote side is CM capable? thanks, stan. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: IPoIB: Broken IGMP processing
On Mon, 23 Aug 2010, Yossi Etigin wrote: Simplest then is to check if byte 24 of the packet is 0xff. (ie IN6_IS_ADDR_MULTICAST) No need to worry about if it is properly formed or anything, if it is a multicast DGID then it is a multicast packet at the link level. Sounds good to me Therefore this patch? Subject: [IB] Make igmp processing work with IPOIB IGMP processing is broken because the IPOIB does not set the skb-pkt_type the right way for Multicast traffic. All incoming packets are set to PACKET_HOST which means that the igmp_recv() function will ignore the IGMP broadcasts/multicasts. This in turn means that the IGMP timers are firing and are sending information about multicast subscriptions unnecessarily. In a large private network this can cause traffic spikes. Signed-off-by: Christoph Lameter c...@linux.com --- drivers/infiniband/ulp/ipoib/ipoib_ib.c |7 +-- include/linux/in6.h |3 +++ 2 files changed, 8 insertions(+), 2 deletions(-) Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c === --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2010-08-23 16:04:38.0 -0500 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2010-08-25 09:43:01.0 -0500 @@ -281,8 +281,11 @@ static void ipoib_ib_handle_rx_wc(struct dev-stats.rx_bytes += skb-len; skb-dev = dev; - /* XXX get correct PACKET_ type here */ - skb-pkt_type = PACKET_HOST; + if (IN6_IS_ADDR_MULTICAST(skb-head + 24)) + + skb-pkt_type = PACKET_MULTICAST; + else + skb-pkt_type = PACKET_HOST; if (test_bit(IPOIB_FLAG_CSUM, priv-flags) likely(wc-csum_ok)) skb-ip_summed = CHECKSUM_UNNECESSARY; Index: linux-2.6/include/linux/in6.h === --- linux-2.6.orig/include/linux/in6.h 2010-08-25 09:39:40.0 -0500 +++ linux-2.6/include/linux/in6.h 2010-08-25 09:40:22.0 -0500 @@ -53,6 +53,9 @@ extern const struct in6_addr in6addr_lin extern const struct in6_addr in6addr_linklocal_allrouters; #define IN6ADDR_LINKLOCAL_ALLROUTERS_INIT \ { { { 0xff,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2 } } } + +#define IN6_IS_ADDR_MULTICAST(a) (((const __u8 *) (a))[0] == 0xff) + #endif struct sockaddr_in6 { -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] complib/cl_timer.c: fixing cl_timer calculation
On 15:01 Tue 24 Aug , Yevgeny Kliteynik wrote: BTW, any idea what are the (obviously historical) reasons for these lines in the code? 323: /* do not do 0 wait ! */ 324: /* if (delta_time 1000.0) {delta_time = 1000;} */ Have no idea. This is from day zero of git history. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] complib/cl_timer.c: fixing cl_timer calculation
On 16:21 Tue 24 Aug , Yevgeny Kliteynik wrote: When calculating p_timer-timeout.tv_sec and p_timer-timeout.tv_nsec, the carry was ignored, resulting in wrong value in p_timer-timeout.tv_sec, and value 10^9 in p_timer-timeout.tv_nsec (illegal value). Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il Applied. Thanks. Tiny comment is below. --- V2: - using macros instead of int types - fixed the same problem in cl_timer_trim() opensm/complib/cl_timer.c | 48 ++-- 1 files changed, 24 insertions(+), 24 deletions(-) diff --git a/opensm/complib/cl_timer.c b/opensm/complib/cl_timer.c index 2acdb51..3b3c7b0 100644 --- a/opensm/complib/cl_timer.c +++ b/opensm/complib/cl_timer.c @@ -294,12 +294,32 @@ static cl_status_t __cl_timer_find(IN const cl_list_item_t * const p_list_item, return (CL_NOT_FOUND); } +/* + * Calculate 'struct timespec' value that is the + * current time plus the 'time_ms' milliseconds. + */ +static __inline void __cl_timer_calculate(IN const uint32_t time_ms, + OUT struct timespec * const p_timer) +{ + struct timeval curtime, deltatime, endtime; + +#ifndef timerclear +#define timerclear(tvp) (tvp)-tv_sec = (time_t)0, (tvp)-tv_usec = 0L +#endif + timerclear(curtime); + gettimeofday(curtime, NULL); I don't think that curtime initialization does something useful before gettimeofday() call. Sasha + + deltatime.tv_sec = time_ms / 1000; + deltatime.tv_usec = (time_ms % 1000) * 1000; + timeradd(curtime, deltatime, endtime); + p_timer-tv_sec = endtime.tv_sec; + p_timer-tv_nsec = endtime.tv_usec * 1000; +} + cl_status_t cl_timer_start(IN cl_timer_t * const p_timer, IN const uint32_t time_ms) { - struct timeval curtime; cl_list_item_t *p_list_item; - uint32_t delta_time = time_ms; CL_ASSERT(p_timer); CL_ASSERT(p_timer-state == CL_INITIALIZED); @@ -313,20 +333,7 @@ cl_status_t cl_timer_start(IN cl_timer_t * const p_timer, cl_qlist_remove_item(gp_timer_prov-queue, p_timer-list_item); - /* Get the current time */ -#ifndef timerclear -#define timerclear(tvp) (tvp)-tv_sec = (time_t)0, (tvp)-tv_usec = 0L -#endif - timerclear(curtime); - gettimeofday(curtime, NULL); - - /* do not do 0 wait ! */ - /* if (delta_time 1000.0) {delta_time = 1000;} */ - - /* Calculate the timeout. */ - p_timer-timeout.tv_sec = curtime.tv_sec + (delta_time / 1000); - p_timer-timeout.tv_nsec = - (curtime.tv_usec + ((delta_time % 1000) * 1000)) * 1000; + __cl_timer_calculate(time_ms, p_timer-timeout); /* Add the timer to the queue. */ if (cl_is_qlist_empty(gp_timer_prov-queue)) { @@ -385,7 +392,6 @@ void cl_timer_stop(IN cl_timer_t * const p_timer) cl_status_t cl_timer_trim(IN cl_timer_t * const p_timer, IN const uint32_t time_ms) { - struct timeval curtime; struct timespec newtime; cl_status_t status; @@ -394,13 +400,7 @@ cl_status_t cl_timer_trim(IN cl_timer_t * const p_timer, pthread_mutex_lock(gp_timer_prov-mutex); - /* Get the current time */ - timerclear(curtime); - gettimeofday(curtime, NULL); - - /* Calculate the timeout. */ - newtime.tv_sec = curtime.tv_sec + (time_ms / 1000); - newtime.tv_nsec = (curtime.tv_usec + ((time_ms % 1000) * 1000)) * 1000; + __cl_timer_calculate(time_ms, newtime); if (p_timer-timer_state == CL_TIMER_QUEUED) { /* If the old time is earlier, do not trim it. Just return. */ -- 1.6.2.4 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: opensm/osm_opensm.c: no report when SM is exiting
On 16:38 Wed 18 Aug , Yevgeny Kliteynik wrote: Don't bother reporting events to plug-ins while SM is exiting. Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5 v6] IB/srp: IB/srp: Implement SRP_CRED_REQ and SRP_AER_REQ
On Wed, Aug 25, 2010 at 4:48 AM, David Dillow d...@thedillows.org wrote: On Sun, 2010-08-22 at 18:56 +0200, Bart Van Assche wrote: [ ... ] static struct srp_iu *__srp_get_tx_iu(struct srp_target_port *target, - enum srp_request_type req_type) + enum srp_tx_iu_type iu_type) { - s32 rsv = (req_type == SRP_REQ_TASK_MGMT) ? 0 : SRP_TSK_MGMT_SQ_SIZE; + s32 rsv = (iu_type == SRP_IU_REQ_TSK_MGMT) ? 0 : SRP_TSK_MGMT_SQ_SIZE; You check for credits even if this is an SRP_IU_RSP, so we'll fail to allocate an IU if we're low on credits. Not a common condition, but possible within the spec. The code below was present in version five of the patch series and has been removed accidentally. I'll resend the patch series. + if (tx_iu_type != SRP_TX_IU_RSP target-req_lim = rsv) { + ++target-zero_req_lim; + return NULL; + } Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ibnetdiscover: add '-f' flag to show full information (ports' speed and width).
On 09:45 Tue 24 Aug , Hal Rosenstock wrote: What about the flag? do we still need it if we pass the output after the comment? I wouldn't think so. I also think we've made commentary changes to the ibnetdiscover output format like this before. If we wanted to be absolutely sure it wouldn't break anything, we'd keep the flag though. It's up to Sasha. The '-f' flag could make sense since those information actually duplicates previous 4xDDR, etc. and used by ibsim's parser only. Also it would be nice to make ibsim's parser to understand both possibilities: s=2 and speed=2 (Or even to parse *x*DR strings :)) Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofw] [Patch] [Tools][infiniband-diags]
On 20:19 Sun 01 Aug , Sasha Khapyorsky wrote: On 14:36 Wed 28 Jul , Irena Kruchkovsky wrote: A patch that fixes the GUID output in ibstat to work correctly in windows 2003. Index: D:/Windows/MLNX_VPI/tools/infiniband-diags/src/ibstat.c === --- D:/Windows/MLNX_VPI/tools/infiniband-diags/src/ibstat.c (revision 6199) +++ D:/Windows/MLNX_VPI/tools/infiniband-diags/src/ibstat.c (revision 6200) @@ -72,9 +72,9 @@ printf(\tNumber of ports: %d\n, ca-numports); printf(\tFirmware version: %s\n, ca-fw_ver); printf(\tHardware version: %s\n, ca-hw_ver); - printf(\tNode GUID: 0x%016llx\n, + printf(\tNode GUID: 0x%016I64x\n, Normally we are using PRI* macros with management code. What about this: I've commited this couple of days ago: commit e4b73c2081a9afc346adf98d0f4d348005c920f6 Author: Sasha Khapyorsky sas...@voltaire.com Date: Sun Aug 1 20:26:11 2010 +0300 infiniband-diags/ibstat: convert to PRIx64 macros GUID printing Convert to using PRIx64 macros for printing GUID values. Pointed out by Irena Kruchkovsky. Signed-off-by: Sasha Khapyorsky sas...@voltaire.com diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index c44d8c4..f655a13 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -72,10 +72,9 @@ static void ca_dump(umad_ca_t * ca) printf(\tNumber of ports: %d\n, ca-numports); printf(\tFirmware version: %s\n, ca-fw_ver); printf(\tHardware version: %s\n, ca-hw_ver); - printf(\tNode GUID: 0x%016llx\n, - (long long unsigned)ntohll(ca-node_guid)); - printf(\tSystem image GUID: 0x%016llx\n, - (long long unsigned)ntohll(ca-system_guid)); + printf(\tNode GUID: 0x%016 PRIx64 \n, ntohll(ca-node_guid)); + printf(\tSystem image GUID: 0x%016 PRIx64 \n, + ntohll(ca-system_guid)); } static char *port_state_str[] = { @@ -122,8 +121,7 @@ static int port_dump(umad_port_t * port, int alone) printf(%sLMC: %d\n, pre, port-lmc); printf(%sSM lid: %d\n, pre, port-sm_lid); printf(%sCapability mask: 0x%08x\n, pre, ntohl(port-capmask)); - printf(%sPort GUID: 0x%016llx\n, pre, - (long long unsigned)ntohll(port-port_guid)); + printf(%sPort GUID: 0x%016 PRIx64 \n, pre, ntohll(port-port_guid)); printf(%sLink layer: %s\n, pre, port-link_layer); return 0; } @@ -182,8 +180,7 @@ static int ports_list(char names[][UMAD_CA_NAME_LEN], int n) for (i = 0; i found; i++) if (guids[i]) - printf(0x%016llx\n, - (long long unsigned)ntohll(guids[i])); + printf(0x%016 PRIx64 \n, ntohll(guids[i])); return found; } , hope this solves your issues too. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm: Fix sl2vl configuration
On Wed, Aug 25, 2010 at 4:43 AM, Eli Dorfman (Voltaire) dorfman@gmail.com wrote: Hal Rosenstock wrote: Hi Eli, On Mon, Aug 23, 2010 at 3:37 PM, Eli Dorfman dorfman@gmail.com wrote: Hi Hal, On Mon, Aug 23, 2010 at 5:25 PM, Hal Rosenstock hal.rosenst...@gmail.com wrote: On Wed, Jul 28, 2010 at 12:26 PM, Eli Dorfman (Voltaire) dorfman@gmail.com wrote: Subject: [PATCH] Fix sl2vl configuration For non-optimized sl2vl configuration in and out ports were reversed. Are you sure these are reversed ? Any idea which commit introduced this reversal of in and out ports ? I'm sure they are reversed. I looked at it some more and the ports look reversed to me too. This patch was also tested by Jim Schutt. That only means it works in his environment rather than correctness. It was tested in mine enviornment as well. Usually I test the patch in my environment to verify its correctness (in addition to code review). I assume you do the same. Do you expect me to test the patch in your environment as well? Huh ? I didn't check which commit introduced this - why is it important? I'd like to understand which patch introduced the reversal. I don't see it but might have missed it. I think it's important to know which versions are broken. I think that the following commit added the bug: commit 051a1dd514e63f3a971afad38e377932efca5e18 Author: Sasha Khapyorsky sas...@voltaire.com Date: Mon Jan 4 21:06:19 2010 +0200 opensm/osm_qos.c: split switch external and end ports setup This splits QoS related port parameters setup over switch external ports and end ports. Such separation leaves us with simpler code and saves some repeated flows in case of switch external ports (actually required per switch and not per port). Another advantages: Optimized Sl2VL Mapping procedure can be implemented easier using this model. A low level port QoS related parameters setup infrastructure is ready for supporting per port QoS related configuration (which hopefully will be implemented some days). I thought it might be that one but wasn't sure. Thanks. For optimal sl2vl added override of default ALL settting with port's sl2vl when operational VL was other than the default port. What is the motivation to override when the operational VLs is different ? Why is that better than what is done currently ? The idea was to apply the default policy - set sl2vl modulo operational VL. When applying ALL settings using port 1 we still want to override this setting for ports with different operational VL. What makes the default policy modulo operational VLs ? This is how its implemented in sl2vl_update_table() vl_mask = (1 (ib_port_info_get_op_vls(p-port_info) - 1)) - 1; for (i = 0; i IB_MAX_NUM_VLS / 2; i++) { vl1 = sl2vl_table-raw_vl_by_sl[i] 4; vl2 = sl2vl_table-raw_vl_by_sl[i] 0xf; if (vl1 != 15) vl1 = vl_mask; if (vl2 != 15) vl2 = vl_mask; Default startup switches configuration uses the same policy. I view this as further modification on the configuration based on the operational VLs. Besides, any VL specified above the op VLs is a drop (same as indicating VL15). I think what you are doing is changing the default behavior and hence perhaps an additional policy switch is needed if this change really is needed and I'm unsure of that. Is this really a policy issue ? IMO there are two separate issues in this patch and they should be in separate patches (for better git bisection). Maybe, but I still think this patch fixes wrong setting for sl2vl using optimized and non optimized methods. I'm not sure this should be divided to 2 separate patches. It's one thought per patch and to me this is two different thoughts. If this is the only issue, I can split this patch to 2 separate patches. It's also the issue noted above. -- Hal Eli Also, a couple of (possibly related) questions below. It seems that you are referring to patch v1 which was modified according to Jim's comments. Please check the v2 patch . I see my questions are moot in terms of the v2 patch. -- Hal Thanks, Eli Signed-off-by: Eli Dorfman e...@voltaire.com --- opensm/opensm/osm_qos.c | 25 - 1 files changed, 20 insertions(+), 5 deletions(-) diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index a571370..de0ae23 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -182,7 +182,7 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, tbl.raw_vl_by_sl[i] = (vl1 4) | vl2; } - if (!force_update (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) + if (!force_update in_port (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) !memcmp(p_tbl, tbl, sizeof(tbl))) return IB_SUCCESS; Why exclude
ibnetdiscover issue with multiported CA (or router) with multiple ports on same subnet
Sasha, I'm seeing an issue with ibnetdiscover from a CA port where it appears to extend a path at a remote CA port (it's actually another port on the same CA) to query NodeInfo of the next hop beyond it. I get the following error message: src/query_smp.c:188; umad (DR path slid 0; dlid 0; 0,1,20,2 Attr 0x11:0) bad status 110; Connection timed out where smpquery -D nodeinfo of 0,1,20 is a CA which can also be seen from the topology. It appears to stem from the following code snippet from libibnetdisc/src/ibnetdisc.c:recv_port_info if (port_num mad_get_field(port-info, 0, IB_PORT_PHYS_STATE_F) == IB_PORT_PHYS_STATE_LINKUP ((node-type == IB_NODE_SWITCH port_num != local_port) || (node == fabric-from_node port_num == local_port))) { ib_portid_t path = smp-path; if (extend_dpath(engine, path, port_num) 0) query_node_info(engine, path, node); } that was introduced by: commit fcb8d5e7588e38508a8e354c37009d73c0a3889f Author: Sasha Khapyorsky sas...@voltaire.com Date: Sat Apr 10 02:43:24 2010 +0300 libibnetdisc: no backward NodeInfo queries Then switch is reached via port N we don't need to query back via this port - source node is discovered already. Finally this saves some amount of unnecessary MADs. Signed-off-by: Sasha Khapyorsky sas...@voltaire.com and subsequently modified by: commit 49d149c63a44d99259f516a15af53d8cf3f0e7c9 Author: Sasha Khapyorsky sas...@voltaire.com Date: Tue Apr 13 19:54:45 2010 +0300 libibnetdisc: don't try to cross discovery over CA When discovery is running from CA node it shouldn't try to cross over all ports, but only via local one (send over non-local ports will fail since CA doesn't route MADs). Signed-off-by: Sasha Khapyorsky sas...@voltaire.com due to the (node == fabric-from_node port_num == local_port) clause being TRUE. ibnetdiscover src/query_smp.c:188; umad (DR path slid 0; dlid 0; 0,1,20,2 Attr 0x11:0) bad status 110; Connection timed out # # Topology file: generated on Wed Aug 25 18:52:16 2010 # # Initiated from node 0002c9020020ee0c port 0002c9020020ee0d vendid=0x2c9 devid=0xb924 sysimgguid=0xb8c00438b switchguid=0xb8c00438b(b8c00438b) Switch 24 S-000b8c00438b # MT47396 Infiniscale-III Mellanox Technologies base port 0 lid 4 lmc 0 [5] H-0002c90310e0[1](2c90310e1) # sw124 HCA-1 lid 5 4xDDR [6] H-0002c903d1c8[1](2c903d1c9) # sw123 HCA-1 lid 0 4xDDR [7] H-0002c9020020ee0c[1](2c9020020ee0d) # sw075 HCA-1 lid 2 4xDDR [20]H-0002c9020020ee0c[2](2c9020020ee0e) # sw075 HCA-1 lid 3 4xDDR ... vendid=0x2c9 devid=0x6278 sysimgguid=0x2c9020020ee0f caguid=0x2c9020020ee0c Ca 2 H-0002c9020020ee0c # sw075 HCA-1 [1](2c9020020ee0d) S-000b8c00438b[7] # lid 2 lmc 0 MT47396 Infiniscale-III Mellanox Technologies lid 4 4xDDR [2](2c9020020ee0e) S-000b8c00438b[20]# lid 3 lmc 0 MT47396 Infiniscale-III Mellanox Technologies lid 4 4xDDR smpquery -D nodeinfo 0,1,20 # Node info: DR path slid 65535; dlid 65535; 0,1,20 BaseVers:1 ClassVers:...1 NodeType:Channel Adapter NumPorts:2 SystemGuid:..0x0002c9020020ee0f Guid:0x0002c9020020ee0c PortGuid:0x0002c9020020ee0e PartCap:.64 DevId:...0x6278 Revision:0x00a0 LocalPort:...2 VendorId:0x0002c9 I don't think the local port part of the test above (node == fabric-from_node port_num == local_port) is correct where: local_port = (uint8_t) mad_get_field(port_info, 0, IB_PORT_LOCAL_PORT_F); Instead, shouldn't port_num be checked against the local port that initiated the ibnetdiscover (which in this case is port 1) ? If so, a from_portnum could be added/saved in the fabric structure and used for this check. Do you concur with this approach ? -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] infiniband-diags/ibnetdiscover: Fix handling of CA ports in recv_port_info
When multiple ports on the same CA are connected to the same subnet and an ibnetdiscover is initiated from one of them, the discovery continues past the other CA port and an error occurs. The error is: src/query_smp.c:188; umad (DR path slid 0; dlid 0; 0,1,20,2 Attr0x11:0) bad status 110; Connection timed out Fix this by saving the initiating port number in the fabric structure and using that rather than local_port for the comparison in recv_port_info. Signed-off-by: Hal Rosenstock h...@mellanox.com --- diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index cfd3bbe..935e427 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -1,6 +1,7 @@ /* * Copyright (c) 2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * Copyright (c) 2010 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -149,6 +150,8 @@ typedef struct ibnd_fabric { * or by default the node you ar running on */ ibnd_node_t *from_node; + int from_portnum; + /* NULL term list of all nodes in the fabric */ ibnd_node_t *nodes; /* NULL terminated list of all chassis found in the fabric */ diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index f525d71..79dd98e 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2009 Voltaire Inc. All rights reserved. * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * Copyright (c) 2008 Lawrence Livermore National Laboratory + * Copyright (c) 2010 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -199,7 +200,7 @@ static int recv_port_info(smp_engine_t * engine, ibnd_smp_t * smp, if (port_num mad_get_field(port-info, 0, IB_PORT_PHYS_STATE_F) == IB_PORT_PHYS_STATE_LINKUP ((node-type == IB_NODE_SWITCH port_num != local_port) || - (node == fabric-from_node port_num == local_port))) { + (node == fabric-from_node port_num == fabric-from_portnum))) { ib_portid_t path = smp-path; if (extend_dpath(engine, path, port_num) 0) query_node_info(engine, path, node); @@ -324,9 +325,10 @@ static int recv_node_info(smp_engine_t * engine, ibnd_smp_t * smp, dump_endnode(smp-path, node_is_new ? new : known, node, port); - if (rem_node == NULL) /* this is the start node */ + if (rem_node == NULL) { /* this is the start node */ fabric-from_node = node; - else { + fabric-from_portnum = port_num; + } else { /* link ports... */ int rem_port_num = get_last_port(smp-path); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] complib/cl_timer.c: fixing cl_timer calculation
On 25-Aug-10 7:07 PM, Sasha Khapyorsky wrote: On 15:01 Tue 24 Aug , Yevgeny Kliteynik wrote: BTW, any idea what are the (obviously historical) reasons for these lines in the code? 323: /* do not do 0 wait ! */ 324: /* if (delta_time 1000.0) {delta_time = 1000;} */ Have no idea. This is from day zero of git history. Well, it's not there any more. -- Yevgeny Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: IPoIB CM connection establishment protocol question
Eli Cohen wrote: Hi stan, Here's a short summary: An IPoIB interface encodes it capability of supporting CM mode in the hardware address it publishes. Once the stack hands an SKB to IPoIB for transmission (unicast), IPoIB needs to resolve the hardware address to IB address information (LID, SL etc.). Once resolution is complete, it will create a RC connection with the other host using IB CM. This process is symmetrical - that is, for two communicating hosts there are two RC connections so that each connection works half duplex. Hope that helps. Hello, Thanks for responding. Indeed this clears the air; the original Windows IPoIB CM code assumed a single connection. stan. On Thu, Aug 05, 2010 at 10:25:27AM -0700, Smith, Stan wrote: Could someone who is OFED IPoIB knowledgeable help me understand which IPoIB side starts the RC connection establishment (Tx REQ) to an ARP unknown remote host? I tried reading the Linux IPoIB code and realized it's been too many years since writing Linux network drivers (2001) and found myself lost in the bits w.r.t. CM connection mgmt. and ARP handling. From limited observation, it looks like the ACTIVE side (ttcp -t) of an IPoIB network connection broadcasts an IB ARP-REQ which indicates CM capability in the IB ARP flags byte. When the PASSIVE IPoIB (ttcp -r) receives the IB ARP-REQ, it sends back an IB ARP-REPLY which indicates it's CM capability among other items. Now the confusing part: which IPoIB CM side starts the CM-RC connection establishment? Does the PASSIVE IPoIB start the RC connection (Tx REQ) prior to sending the IB ARP-REPLY? Or does the ACTIVE IPoIB side receive the IB ARP-REPLY and then start an ACTIVE RC connection (Tx REQ) now that it knows the remote side is CM capable? thanks, stan. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html