[ewg] New/latest OpenSM and ibsim releases for (some) upcoming OFED release

2019-05-01 Thread Hal Rosenstock
Hi,

On April 17, opensm 3.3.22 and ibsim 0.8 were released.

Can these new releases be picked up by OFED when appropriate ? I know
it's late in current OFED 4.17 RC1 cycle.

Thanks.

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
https://lists.openfabrics.org/mailman/listinfo/ewg


[ewg] [ANNOUNCE] opensm 3.3.22 release

2019-04-17 Thread Hal Rosenstock
There is a new 3.3.22 release of opensm.

https://github.com/linux-rdma/opensm/releases/tag/3.3.22

Changes since 3.3.21:
Internal library reorganization to remove circular dependencies
Added a few additional command line options which were supported by options to 
be consistent with RedHat Fedora startup script
Internal improvements and bug fixes as noted below and in release notes

See 
https://github.com/linux-rdma/opensm/tree/master/doc/opensm_release_notes-3.3.txt

Full list of changes is below:

Honggang Li (36):
  complib: cl_vector_copy16/32/64 should be static functions
  Use precision specifier for scanf
  complib/cl_types.h: Cosmetic formatting changes
  complib/cl_types.h: Remove unimplemented function cl_panic
  complib/cl_heap.h: Replace 'shift_' with 'heap_' in the DESCRIPTION 
section
  complib/cl_heap.c: Remove redundant initialization statement
  include/complib: Delete documentation about 'p_nil'
  include/opensm: Remove some redundant includes
  complib/cl_dispatcher.c: Fix typo and delete one incorrect comment
  complib/cl_dispatcher.h: Improve comments
  complib/cl_event_wheel.h: Improve comment documentation
  complib/cl_event_wheel.c: Minor update to the sample test program
  opensm/osm_base.h: Delete unused header complib/cl_types.h
  opensm/osm_base.h: Delete comments about non-existent "Base" class
  opensm/osm_path.h: Delete comments for non-existent struct field and 
function parameters
  Delete unused header opensm/osm_attrib_req.h
  opensm/osm_router.h: Improve comments
  opensm/osm_node.h: Improve comments
  opensm/osm_vl15intf.h: Minor update of comments
  opensm/osm_mtree.h: Improve comments
  opensm/osm_remote_sm.h: Improve comments
  opensm/osm_sm_mad_ctrl.h: Improve comments
  opensm/osm_sa_mad_ctrl.h: Improve comments
  opensm/osm_multicast.h: Improve comments
  opensm/osm_ucast_mgr.h: Add comment for 'max_lid' foeld in osm_ucast_mgr 
structure
  opensm/osm_ucast_cache.h: Improve coding style and comments
  opensm/osm_remote_sm.c: Improve comment
  opensm/osm_mtree.h: Improve comment
  opensm/osm_mlnx_ext_port_info_rcv.c: Check the pointer of osm_sm_t before 
accessing it
  opensm/osm_service.h: Fix and add some comments
  osm_sa.c: Remove unneeded label in osm_sa_bind
  osm_subnet.c: Free per_module_logging_file in subn_opt_destroy
  osm_opensm.c: destroy_routing_engines should destroy the default routing 
engine
  Add '--subnet_prefix' and '--dump_files_dir' options
  osm_helper.c: Fix lsea_str_fixed_width OVERRUN issue
  main.c: Remove NO_EFFECT code

Hal Rosenstock (24):
  complib/cl_ptr_vector.h: Fixed cut 'n paste error in cl_ptr_vector_remove 
NOTES
  include/complib: Fix comments and documentation
  complib/cl_qcomppool.h: Improve max_objects comment
  complib/cl_debug.h: Cosmetic formatting changes to some macros
  osm_subnet.c: Add latest Bull device IDs to device white lists
  complib/cl_event_wheel.c: Some cosmetic changes
  complib/cl_event_wheel.h: Eliminate unneeded field in 
cl_event_wheel_reg_info_t
  complib/cl_event_wheel.c: Handle malloc failure in cl_event_wheel_reg
  complib/cl_[dispatcher event_wheel].h: Fix commentary typo
  Revert complib/cl_event_wheel.c: Fix memory leak in event_wheel mechanism
  osm_resp.c: No need to swap DR [D/S]LIDs in resp_make_resp_smp
  osm_opensm.c: Fix seg fault in destroy_routing_engines
  osm_opensm.c: No need to check context for default routing engine in 
destroy_routing_engines
  Eliminate circular dependencies in shared libraries
  man/opensm.8.in: Quiet some man warnings
  osm_helper.c: Make "50" string proper fixed width in lsea_str_fixed_width
  osm_[link_mgr trap_rcv].c: Check the return value of osm_get_port_by_guid
  gen_chlog.sh: Update script to use git describe rather than git cat-file 
tag
  Deprecate complib_init() due to use of exit() function
  libvendor/osm_vendor_mlx[_hca]_sim.c: Eliminate use of exit
  libvendor/osm_vendor_mlx_sim.c: In osmv_transport_init, fix memory leaks 
on error
  osmtest.c: Remove ununsed osmtest_get_node_rec routine
  Update shared (internal) library versions in accordance with changes 
since OpenSM 3.3.21
  configure.ac: Update package number for OpenSM to 3.3.22 for release

Nicolas Morey-Chaisemartin (7):
  osm_opensm.c: Fix static declaration
  osm_[congestion_control perfmgr].c: Fix signed vs unsigned comparison
  osmtest/main.c: Fix return type for getopt_long_only
  libvendor,osmtest: Use NULL instead of 0 in all places where it used as a 
pointer
  osmtest: Add missing static keywords
  osmtest/main.c: Fix show_usage declaration
  libvendor/osm_vendor_ibumad.c: Fix type of array passed to 
umad_get_ca_portguids in libibumad

Aleksandr Minchiu (2):
  osm_port_info_rcv.c: O

[ewg] [ANNOUNCE] ibsim 0.8 release

2019-04-17 Thread Hal Rosenstock
There is a new 0.8 release of ibsim.

https://github.com/linux-rdma/ibsim/releases/tag/0.8

New features since 0.7:
Extended speed support including HDR and FDR10
2x link support
Link speed, espeed, and width change support
Support for PortXmitWait field in PortCounters
Minor changes/bug fixes as noted below

All component versions are from recent master branch. Full list of
changes is below.

Honggang Li (18):
  Move sim_cmd_file function from ibsim.c to sim_cmd.c
  Return 'NULL' instead of '0' if function returns a pointer
  sim_net.c: Avoid copying garbage for vlarb in init_ports
  sim_cmd.c: Fix ‘orig’ may be used uninitialized in do_perf_counter_set
  sim_mad.c: Fix some clang compilation errors
  ibsim.spec.in: Append gcc into the BuildRequires tag
  sim_net.c: Delete placeholder 'new_hca' function
  sim_cmd.c: do_set_guid should dereference the char pointer
  sim_cmd.c: sim_cmd_file should check the char pointer before dereferencing
  sim_mad.c: Fix NO_EFFECT issue for do_linearforwtbl
  sim_net.c: Initialize the pointer sp with NULL for parse_port
  sim_cmd.c: Delete unused update of a value
  sim_net.c: Free pattern buffer in parse_port_connection_data
  sim_mad.c: Fix potential dereference of a null pointer in do_sl2vl
  sim_mad.c: Fix accessing of uninitialized memory in send_trap
  umad2sim.c: Fix a few potential buffer overflow issues
  umad2sim.c: make_path should check the return value of mkdir
  sim_cmd.c: Fix out of bound memory access in do_cmd

Hal Rosenstock (12):
  Support 2x link widths
  Add support for HDR extended link speed
  sim_mad.c: Fix compile warning in do_portinfo
  umad2sim/umad2sim.c: Fix ibsim for latest libibumad in rdma-core
  net-examples: Fix commentary typos
  README: Updated maintainer
  sim_mad.c: Combine linkwidth cases in do_portinfo
  sim_mad.c: Combine linkspeed/linkspeedext cases in do_portinfo
  sim_net.c: Reduce regfree calls in parse_port_connection_data
  sim_cmd.c: Fix CONSTANT_EXPRESSION_RESULT in dump_route
  sim_mad.c: Fix a CONSTANT_EXPRESSION_RESULT in switch_lookup
  ibsim/ibsim.c: Bump version to 0.8

Daniel Klein (2):
  Add extended link speed support
  FDR10 link speed support

Cyrille Verrier (1):
  Set PortCountersXmitWaitSupported flag

hnrose (1):
  Merge pull request #1 from cyrilleverrier/master

jecavil (1):
  Link speed, espeed, and width change support

___
ewg mailing list
ewg@lists.openfabrics.org
https://lists.openfabrics.org/mailman/listinfo/ewg


[ewg] [ANNOUNCE] opensm 3.3.21 release

2018-09-24 Thread Hal Rosenstock
There is a new 3.3.21 release of opensm.

https://github.com/linux-rdma/opensm/releases/tag/3.3.21

Changes since 3.3.20:
Support for HDR link speed and 2x link width
Nue routing algorithm (EXPERIMENTAL)
Support for ignoring throttled links with DFSSSP
Support for long transaction timeout for SM class transactions
Internal improvements and bug fixes as noted below and in release notes

See 
https://github.com/linux-rdma/opensm/tree/master/doc/opensm_release_notes-3.3.txt

Full list of changes is below:

Alex Netes (1):
  osmtest.c: Close file before exit function osmtest_create_inventory_file

Benjamin Drung (1):
  Fix various typos

Dan Ben Yosef (1):
  osm_sa_mcmember_record.c: Fix use after free in mcmr_rcv_join_mgrp

Daniel Klein (5):
  osm_ucast_mgr.c: Fix minhop tables miscalculation due to variable 
wraparound
  osm_ucast_updn.c: Add memory allocation failure handling in 
updn_build_lid_matrices
  osm_subnet.c: Indicate that subnet prefix can't be changed at runtime
  ib_types.h: Add CapabiltyMask2 bit definition for CPI CapabilityMask
  osm_sa_path_record.c: Check input parameters in osm_get_path_params

Hal Rosenstock (75):
  osm_torus.c: Cosmetic formatting change
  osm_state_mgr.c: Move subnet up event to occur after mkey related files 
are written
  osm_console.c: Remove redundant condition in __get_stats
  osm_ucast_ftree.c: Remove redundant condition in 
fabric_route_downgoing_by_going_up
  osm_sa_mcmember_record.c: Add MGID to 1B13 error message
  osm_switch.h: Fix commentary typo
  osm_sminfo_rcv.c: Use initial rather than return path in 
smi_rcv_process_get_response
  osm_[link lid]_mgr.c: Simplify link width comparisons
  osm_link_mgr.c: Simplify some link speed related comparisons
  osm_[link lid]_mgr.c: Simplify error threshold comparisons
  Makefile.am: Fix INCLUDES warnings
  configure.in: Update AM_INIT_AUTOMAKE to use subdir-objects
  configure.ac: Update configure.in to configure.ac
  gen_ver.sh: Change configure.in to configure.ac in comment
  osm_[helper.c base.h]: Add support for Bull OUI
  osm_subnet.c: Add support for Bull device ID to 
is_mlnx_ext_port_info_supported
  Add Bull device IDs to device white lists
  osm_subnet.c: Add Connect-X5 support to is_mlnx_ext_port_info_supported
  osm_sa_mad_ctrl.c: It's report response rather than repress
  ib_types.h: Fix some typos associated with IB_CLASS_RESP_TIME_MASK
  ib_types.h: Replace hard coded constant with define
  ib_types.h: Add optional QP1Dropped counter to PortCounters attribute
  ib_types.h: Add additional optional counters to PortCountersExtended
  ib_types.h: Fix bit for IB_PM_IS_ADDL_PORT_CTRS_EXT_SUP
  ib_types.h: Add IsPMKeySupported ClassPortInfo CapabilityMask2 bit
  [ib_types.h, osm_helper.c]: Change IB_PORT_CAP_RESV13 to 
IB_PORT_CAP_HAS_CABLE_INFO
  ib_types.h: Add additional PortInfo:CapabilityMask2 definitions
  osm_helper.c: Add support for dumping PortInfo:CapabilityMask2
  osm_port_info_rcv.c: Fix min_ca_rate determination in 
pi_rcv_process_endport
  osm_sa_inform_info.c: Use defines rather than hard coded constants in 
infr_rcv_process_set_method
  Add support for 2x link widths
  osm_switch.c: Fix commentary typo
  osm_madw.h: Remove unused bind_info in osm_madw structure
  ib_types.h: Add new rates to return values for ib_[multi]path_rec_rate
  Add timeout parameter for SM class set transactions
  Add timeout parameter for SM class get transactions
  Add initial policy for long transaction timeout
  osm_multicast.h: Fix some osm_mgrp_box structure field descriptions
  osmtest/osmt_multicast.c: Fix MC join with unrealistic rate
  osm_sa_mcmember_record.c: Use neighbor MTU rather than MTUCap in 
mgrp_request_is_realizable
  osm_prtn_config.c: Fix a couple of compile warnings with more recent gcc
  osm_qos.c: Better handling of VL arbitration tables when there is 1 data 
VL
  osm_sa.c: Cosmetic change to 4C05 error log message
  ib_types.h: mcast_pkey_trap_suppr in PortInfo attribute is 2 bits in IBA 
1.3
  osm_subnet.c: Fix typo in generated configuration/options file
  Add support for HDR
  osm_helper.c: Add decode of HDR supported to dbg_get_capabilities2_str
  osm_ucast_dfsssp.c: Uniquify some error codes
  osm_subnet.c: Add additional ConnectX-5 device ID to 
is_mlnx_ext_port_info_supported
  Add option and support for only using the original extended SA rates
  doc/QoS_management_in_OpenSM.txt: Fix typo
  osm_subnet.c: Make formatting consistent in generated opensm.conf
  osm_[multi]path_record.c: Fix a couple of edge cases with new 2x/HDR SA 
rates
  Add support for additional Mellanox OUIs
  osm_vendor_ibumad.c: OpenSM no longer works with ibsim with latest 
libibumad
  osm_pkey.c: Fix comment in match_pkey
  Revert osm_vendor_ibumad.c

[ewg] [ANNOUNCE] ibsim git repo is now moved from OFA server to github.com linux-rdma

2018-06-12 Thread Hal Rosenstock
The ibsim git repo on the OFA server 
(http://git.openfabrics.org/?p=~halr/ibsim.git) is now obsolete
and has been moved to the linux-rdma project in github.com 
(https://github.com/linux-rdma/ibsim).

Original releases (tar balls) are available on 
http://downloads.openfabrics.org/management/

Last release was ibsim-0.7 on June 11, 2016
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg

Re: [ewg] maintainer for ibutils?

2016-11-11 Thread Hal Rosenstock
Hi David,

On 11/10/2016 5:40 PM, David Brean wrote:
> Hello,
> 
> The OFED maintainers file  on 
> the web and it contains:
> 
> ibutils: NA
> http://www.openfabrics.org/downloads/ibutils
> git://git.openfabrics.org/~vlad/ibutils.git master

> If patches are submitted against the files in ibutils package, is there 
> someone to pick them up?

ibutils is deprecated/unmaintained for several years now.

-- Hal

> -David
> 
> 
> 
> 
> ___
> ewg mailing list
> ewg@lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/ewg
> 
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg

Re: [ewg] SoftRoCE

2015-05-05 Thread Hal Rosenstock
Hi,

On 5/5/2015 8:57 AM, Dinesh Kb wrote:
 Hi,
 
 Is subnet manager manager is mandatory for the SoftRoCE to measure the
 benchmarks.

There's no SM with RoCE.

 while trying to run the benchmark tests, throws the error Is SM running
 
 Got the following error while running opensm
 
 vvdn@vvdn:~$ sudo opensm
 -
 OpenSM 3.3.18
 Command Line Arguments:
  Log File: /var/log/opensm.log
 -
 OpenSM 3.3.18
 
 Using default GUID 0xf24da2fffe68b1df
 Entering DISCOVERING state
 
 
 Error from osm_opensm_bind (0x2A)
 Perhaps another instance of OpenSM is already running
 Exiting SM
 
 vvdn@vvdn:~$

Is GUID 0xf24da2fffe68b1df a RoCE or IB port ? OpenSM won't come up on
RoCE port.

-- Hal

 *
 *
 *Is there any application specific for SoftRoCE to measure the benchmark
 effectively*
 
 
 with warm regards
 *Dinesh.K.B*
 **
 
 VVDN Technologies Pvt Ltd.,
 
 Cell :* *+91 9944456867 | Skype : dinesh_kb93
 
 
 On Tue, Apr 21, 2015 at 7:35 PM, Dinesh Kb dinesh...@vvdntech.in
 mailto:dinesh...@vvdntech.in wrote:
 
 Hi
 
 Shall we implement *SoftRoCE using OFED-3.12*. can the above
 action be done without special hardware.
 
 Kindly share your comments..
 
 with warm regards
 *Dinesh.K.B*
 **
 
 VVDN Technologies Pvt Ltd.,
 
 Cell :* *+91 9944456867 | Skype : dinesh_kb93
 
 
 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


Re: [ewg] Opensm for dual GUID

2015-02-24 Thread Hal Rosenstock
On 2/24/2015 5:19 AM, Atul Yadav wrote:
 Hi Team,
 
 We are trying to setup the HCA and Switch level failover.
 
 Operating System:- Centos 6.5
 
 Please guide us 

To run OpenSM on multiple ports/HCAs on the same machine for the same
subnet, multiple instances of OpenSM need to be invoked one per port/HCA
and need separate but similar configuration.

-- Hal

 
 Thank You
 
 Atul Yadav
 
 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


Re: [ewg] Opensm for dual GUID

2015-02-24 Thread Hal Rosenstock
On 2/24/2015 7:33 AM, Atul Yadav wrote:
 Hi Rosenstock,
 
 Thanks for responding.
 
 As per our requirement we want to achieve IB bonding for  HCA and Switch
 level fail-over.
 
 bond0 (Active- Passive)
 Please provide opensm configuration parameter

You need to do something along the following lines:

First, create 2 config files (say opensm-qib0.conf and opensm-qib1.conf)
with the following variables changed as follows:

opensm-qib0.conf:
# The port GUID on which the OpenSM is running
guid 0x0011756f5f4c

# Log file to be used
log_file /var/log/opensm-qib0/opensm.log

opensm-qib1.conf:
# The port GUID on which the OpenSM is running
guid 0x0011756f5f4a

# Log file to be used
log_file /var/log/opensm-qib1/opensm.log


then make sure that the following directories exist:
/var/cache/opensm-qib0
/var/log/opensm-qib0
/var/cache/opensm-qib1
/var/log/opensm-qib1


and then:
export OSM_CACHE_DIR=/var/cache/opensm-qib0
export OSM_TMP_DIR=/var/log/opensm-qib0
opensm -F opensm-qib0.conf 

export OSM_CACHE_DIR=/var/cache/opensm-qib1
export OSM_TMP_DIR=/var/log/opensm-qib1
opensm -F opensm-qib1.conf 


A similar alternative configuration approach is described in:
https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg03557.html

-- Hal

 
 [root@SRDCB0970RTGMS opensm]# ibstat
 CA 'qib0'
 CA type: InfiniPath_QLE7342
 Number of ports: 2
 Firmware version:
 Hardware version: 2
 Node GUID: 0x0011756f5f4c
 System image GUID: 0x0011756f5f4c
 Port 1:
 State: Active
 Physical state: LinkUp
 Rate: 40
 Base lid: 1
 LMC: 0
 SM lid: 1
 Capability mask: 0x0761086a
 Port GUID: 0x0011756f5f4c
 Link layer: InfiniBand
 Port 2:
 State: Down
 Physical state: Disabled
 Rate: 10
 Base lid: 65535
 LMC: 0
 SM lid: 65535
 Capability mask: 0x07610868
 Port GUID: 0x0011756f5f4d
 Link layer: InfiniBand
 CA 'qib1'
 CA type: InfiniPath_QLE7342
 Number of ports: 2
 Firmware version:
 Hardware version: 2
 Node GUID: 0x0011756f5f4a
 System image GUID: 0x0011756f5f4c
 Port 1:
 State: Initializing
 Physical state: LinkUp
 Rate: 40
 Base lid: 65535
 LMC: 0
 SM lid: 65535
 Capability mask: 0x07610868
 Port GUID: 0x0011756f5f4a
 Link layer: InfiniBand
 Port 2:
 State: Down
 Physical state: Disabled
 Rate: 10
 Base lid: 65535
 LMC: 0
 SM lid: 65535
 Capability mask: 0x07610868
 Port GUID: 0x0011756f5f4b
 Link layer: InfiniBand
 
 [root@SRDCB0970RTGMS opensm]#
 
 [root@SRDCB0970RTGMS ~]# ibstat -p
 0x0011756f5f4c
 0x0011756f5f4d
 0x0011756f5f4a
 
 0x0011756f5f4b 
 
 
 [root@SRDCB0970RTGMS ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
 DEVICE=bond0
 IPADDR=192.168.1.1
 NETMASK=255.255.255.0
 BROADCAST=192.168.1.255
 ONBOOT=yes
 BOOTPROTO=none
 USERCTL=no
 MTU=65520
 BONDING_OPTS= mode=1 primary=ib0 updelay=0 downdelay=0
 [root@SRDCB0970RTGMS ~]# cat /etc/sysconfig/network-scripts/ifcfg-ib*
 DEVICE=ib0
 USERCTL=no
 ONBOOT=yes
 MASTER=bond0
 SLAVE=yes
 BOOTPROTO=none
 TYPE=Infiniband
 PRIMARY=yes
 DEVICE=ib1
 USERCTL=no
 ONBOOT=yes
 MASTER=bond0
 SLAVE=yes
 BOOTPROTO=none
 TYPE=Infiniband
 DEVICE=ib2
 USERCTL=no
 ONBOOT=yes
 MASTER=bond0
 SLAVE=yes
 BOOTPROTO=none
 TYPE=Infiniband
 DEVICE=ib3
 USERCTL=no
 ONBOOT=yes
 MASTER=bond0
 SLAVE=yes
 BOOTPROTO=none
 TYPE=Infiniband
 [root@SRDCB0970RTGMS ~]#
 
 
 
 Thank You
 
 Atul Yadav
 
 
 
  
 
 On Tue, Feb 24, 2015 at 5:55 PM, Hal Rosenstock h...@dev.mellanox.co.il
 mailto:h...@dev.mellanox.co.il wrote:
 
 On 2/24/2015 5:19 AM, Atul Yadav wrote:
  Hi Team,
 
  We are trying to setup the HCA and Switch level failover.
 
  Operating System:- Centos 6.5
 
  Please guide us
 
 To run OpenSM on multiple ports/HCAs on the same machine for the same
 subnet, multiple instances of OpenSM need to be invoked one per port/HCA
 and need separate but similar configuration.
 
 -- Hal
 
 
  Thank You
 
  Atul Yadav
 
 
 
  ___
  ewg mailing list
  ewg@lists.openfabrics.org mailto:ewg@lists.openfabrics.org
  http://lists.openfabrics.org/mailman/listinfo/ewg
 
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


Re: [ewg] Please add libibumad 1.3.10 release to OFED 3.18

2014-12-17 Thread Hal Rosenstock
Hi Vlad,

On 12/17/2014 4:14 AM, Vladimir Sokolovsky wrote:
 On 16/12/14 22:40, Hal Rosenstock wrote:
 Hi Vlad,

 Please incorporate libibumad 1.3.10 package into the upcoming OFED 3.18
 release.

 Thanks.

 -- Hal
 
 
 Hi Hal,
 This is done automatically, once you update the latest.txt.
 Please note that the resulted source RPM has unknown release number:
 libibumad-1.3.10-unknown.src.rpm
 As I understand, you did not declare the RELEASE environment variable
 while building the tarball.
 
 From the configure.in:
 ...
 AC_SUBST(RELEASE, ${RELEASE:-unknown})
 AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz})
 ...
 
 So, maybe you want to rebuild the tarball with the correct release number.
 I would change the default value from unknown to 1 to avoid such
 issues in the future.

I fixed this in libibumad-1.3.10.1 tarball just released.

This should only be included for OFED 3.18 and not 3.12-1 (which I think
is not quite fully baked yet).

Thanks.

-- Hal

 
 Regards,
 Vladimir
 

  Original Message 
 Subject: [ANNOUNCE] libibumad 1.3.10 release
 Date: Tue, 16 Dec 2014 15:35:41 -0500
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To: linux-rdma (linux-r...@vger.kernel.org) linux-r...@vger.kernel.org

 There is a new 1.3.10 release of libibumad.

 Tarball is available in:
 http://www.openfabrics.org/downloads/management/
 (listed in http://www.openfabrics.org/downloads/management/latest.txt)

 md5sum:
 3a4046ddaaf43bbb82f124808232a294  libibumad-1.3.10.tar.gz

 All component versions are from recent master branch. Full list of
 changes is below.

 Alex Netes (1):
libibumad: Fix memory leak in resolve_ca_port

 Dan Ben Yosef (1):
umad.c: Buffer not null terminated

 Hal Rosenstock (14):
libibumad: Minor fixups for previous umad_str patch
libibumad: Fix issues causing const warnings for strings
libibumad: Add recent/missing SM/SA attributes
libibumad: Rename attributes UMAD_SM_ATTR_XXX rather than
 UMAD_SMP_ATTR_XXX
libibumad: update shared library version
libibumad: package version update for 1.3.9 release
umad_sm.h: Add Mellanox extended port info SM attribute ID to enum
umad_[sm sa].h: Add PortInfoExtended SM and PortInfoExtendedRecord
 SA attributes
umad_sa.h: Add some new (at IBA 1.3) SA CapabilityMask2 bit
 definitions
umad_cm.h: Add new SAP and SPR CM attributes
umad_str.c: Add strings for newly added attributes
Added/updated some Mellanox copyrights
libibumad.ver: Update shared library version
configure.in: package version update for 1.3.10 release

 Ilya Nelkenbaum (1):
libibumad/umad.c: In resolve_ca_port, skip ethernet link layer
 ports

 Ira Weiny (11):
libibumad: fix umad_register man page
libibumad: update umad_[send|recv] man pages to document how rmpp
 is handled
libibumad: change UMAD_METHOD_RESP to UMAD_METHOD_RESP_MASK
libibumad: add UMAD_SA_STATUS_PRI_SUGGESTED to SA status.
libibumad: add string functions for various enums
libibumad: document the setting of errno for umad_send and
 umad_recv
libibumad: add SA CAP MASK[2] definitions
libibumad: add ClassPortInfo struct
umad_types.h: fix status type in umad_hdr
Add support for new registration ioctl
Add make check with umad register tests

 Line Holen (1):
umad_sm.h Add SM trap definitions

 Nick Mills (1):
configure.in: Remove unused --disable-libcheck configure option

 Sean Hefty (10):
libibumad: Provide MAD definitions with libibumad
libibumad: Add SA MAD definitions to umad
libibumad: Add basic SM definitions to umad
libibumad: Add CM definitions to umad
libibumad: Add new umad header files to release
libibumad: Define ntohll/htonll
libibumad: Define data type to indicate values are in big-endian
libibumad/sa: Add SA specific status values
libibumad: Define well known QKEY
Fix export of umad_sa_mad_status_str

 sean.he...@intel.com (1):
Add UMAD_RMPP_FLAG_ACTIVE into umad_types.h

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/mailman/listinfo/ewg

 
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


[ewg] Please add libibumad 1.3.10 release to OFED 3.18

2014-12-16 Thread Hal Rosenstock
Hi Vlad,

Please incorporate libibumad 1.3.10 package into the upcoming OFED 3.18
release.

Thanks.

-- Hal

 Original Message 
Subject: [ANNOUNCE] libibumad 1.3.10 release
Date: Tue, 16 Dec 2014 15:35:41 -0500
From: Hal Rosenstock h...@dev.mellanox.co.il
To: linux-rdma (linux-r...@vger.kernel.org) linux-r...@vger.kernel.org

There is a new 1.3.10 release of libibumad.

Tarball is available in:
http://www.openfabrics.org/downloads/management/
(listed in http://www.openfabrics.org/downloads/management/latest.txt)

md5sum:
3a4046ddaaf43bbb82f124808232a294  libibumad-1.3.10.tar.gz

All component versions are from recent master branch. Full list of
changes is below.

Alex Netes (1):
  libibumad: Fix memory leak in resolve_ca_port

Dan Ben Yosef (1):
  umad.c: Buffer not null terminated

Hal Rosenstock (14):
  libibumad: Minor fixups for previous umad_str patch
  libibumad: Fix issues causing const warnings for strings
  libibumad: Add recent/missing SM/SA attributes
  libibumad: Rename attributes UMAD_SM_ATTR_XXX rather than
UMAD_SMP_ATTR_XXX
  libibumad: update shared library version
  libibumad: package version update for 1.3.9 release
  umad_sm.h: Add Mellanox extended port info SM attribute ID to enum
  umad_[sm sa].h: Add PortInfoExtended SM and PortInfoExtendedRecord
SA attributes
  umad_sa.h: Add some new (at IBA 1.3) SA CapabilityMask2 bit
definitions
  umad_cm.h: Add new SAP and SPR CM attributes
  umad_str.c: Add strings for newly added attributes
  Added/updated some Mellanox copyrights
  libibumad.ver: Update shared library version
  configure.in: package version update for 1.3.10 release

Ilya Nelkenbaum (1):
  libibumad/umad.c: In resolve_ca_port, skip ethernet link layer ports

Ira Weiny (11):
  libibumad: fix umad_register man page
  libibumad: update umad_[send|recv] man pages to document how rmpp
is handled
  libibumad: change UMAD_METHOD_RESP to UMAD_METHOD_RESP_MASK
  libibumad: add UMAD_SA_STATUS_PRI_SUGGESTED to SA status.
  libibumad: add string functions for various enums
  libibumad: document the setting of errno for umad_send and umad_recv
  libibumad: add SA CAP MASK[2] definitions
  libibumad: add ClassPortInfo struct
  umad_types.h: fix status type in umad_hdr
  Add support for new registration ioctl
  Add make check with umad register tests

Line Holen (1):
  umad_sm.h Add SM trap definitions

Nick Mills (1):
  configure.in: Remove unused --disable-libcheck configure option

Sean Hefty (10):
  libibumad: Provide MAD definitions with libibumad
  libibumad: Add SA MAD definitions to umad
  libibumad: Add basic SM definitions to umad
  libibumad: Add CM definitions to umad
  libibumad: Add new umad header files to release
  libibumad: Define ntohll/htonll
  libibumad: Define data type to indicate values are in big-endian
  libibumad/sa: Add SA specific status values
  libibumad: Define well known QKEY
  Fix export of umad_sa_mad_status_str

sean.he...@intel.com (1):
  Add UMAD_RMPP_FLAG_ACTIVE into umad_types.h

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


Re: [ewg] OPENSM cONFIGURATION

2014-04-12 Thread Hal Rosenstock
On 4/12/2014 11:29 AM, Atul Yadav wrote:
 Hi,
 
 Yes, i am able to ping all the nodes connected with Infiniband switch 
 For more details please go through the attachment.

OpenSM looks fine although it is very old (3.3.5). Is this SM host based
or embedded in one of your switches ?

I didn't see any output related to showing pings working but I'll take
your word for this. If pings work, I have no theory why this wouldn't work.

-- Hal

 
 
 
 Thanks
 Atul Yadav
 
 
 On Sat, Apr 12, 2014 at 7:28 PM, Hal Rosenstock h...@dev.mellanox.co.il
 mailto:h...@dev.mellanox.co.il wrote:
 
 On 4/12/2014 6:59 AM, Atul Yadav wrote:
  HI,
 
  Thanks for replying
  In this artectuire, when we are doing ibv_rc_pingpong between two
 nodes
  connected with same switch we are getting result. But when we use two
  nodes with 2 switches we are getting error.
 
  Success:-
  [root@oss1 ~]# ibv_rc_pingpong
local address:  LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID ::
remote address: LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID ::
  8192000 bytes in 0.01 seconds = 6992.74 Mbit/sec
  1000 iters in 0.01 seconds = 9.37 usec/iter
  [root@oss1 ~]#
 
  [root@mds1 ~]# ibv_rc_pingpong 173.16.1.52
local address:  LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID ::
remote address: LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID ::
  8192000 bytes in 0.01 seconds = 7084.97 Mbit/sec
  1000 iters in 0.01 seconds = 9.25 usec/iter
  [root@mds1 ~]#
 
 
 
 
  Error
  [root@nalanda mvapich2-1.9]# ibv_rc_pingpong
local address:  LID 0x0001, QPN 0x56004e, PSN 0x704d51
remote address: LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2
 
  [root@mds1 ~]# ibv_rc_pingpong 173.16.1.1
local address:  LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2, GID ::
  client read: Success
  Couldn't read remote address
  [root@mds1 ~]#
 
 Looking at libibverbs/examples/rc_pingpong.c:
 
 static struct pingpong_dest *pp_client_exch_dest(const char
 *servername, int port,
  const struct
 pingpong_dest *my_dest)
 {
 ...
 gid_to_wire_gid(my_dest-gid, gid);
 sprintf(msg, %04x:%06x:%06x:%s, my_dest-lid, my_dest-qpn,

 my_dest-psn, gid);
 if (write(sockfd, msg, sizeof msg) != sizeof msg) {
 fprintf(stderr, Couldn't send local address\n);
 goto out;
 }
 
 
 if (read(sockfd, msg, sizeof msg) != sizeof msg) {
 perror(client read);
 fprintf(stderr, Couldn't read remote address\n);
 goto out;
 }
 
 This read is failing for some reason. This is some message exchange
 over some IP network (for example, IPoIB or ethernet).
 
 
  And how we test our ftree topology is working fine.
 
  Please go through the attachment.
 
 Looks like LIDs are assigned but can't tell about routing from info
 supplied but topology looks relatively simple (5 switches,
 homogenous 4x QDR links). Is the OpenSM log clean ? Any fat tree
 related messages. This is likely not SM issue.
 
 The next issues are end node related (probably with IPoIB
 configuration). Can you ping between the nodes which fail
 rc_pingpong ? If not,
 
 -- Hal
 
 
  Thank You
  Atul Yadav
 
 
  On Sat, Apr 12, 2014 at 12:14 AM, Hal Rosenstock
 h...@dev.mellanox.co.il mailto:h...@dev.mellanox.co.il
  mailto:h...@dev.mellanox.co.il mailto:h...@dev.mellanox.co.il
 wrote:
 
  On 4/11/2014 2:21 PM, Atul Yadav wrote:
   Dear Team,
  
   We are trying to build Fat tree topology.
   The details are given below:
   Unmanaged switches 36 port  quantity 5
   As per the some blog we need to modify the opensm.conf file
   But we are unable to identify some parameter like:-
root_guid_file???
 
  Fat tree routing will try to autodetect the roots but this may not
  work and it is better to specify the root GUIDs. In your case,
 they
  are the GUIDs for switches A and B.
 
  The root GUID file is then provided to OpenSM either via the conf
  file or command line parameters. The command line parameter is
 [-a |
 --root_guid_file path to file]
 
  OpenSM man page says:
 
 -a, --root_guid_file file name
Set the root nodes for the Up/Down or Fat-Tree
 routing
  algorithm
to the guids provided in the given file (one to
 a line).
 
  It also says:
 
 If the root guid file  is  not  provided  (?-a?  or
   ?--root_guid_file

Re: [ewg] Can't snoop all kinds of mad packets

2014-04-08 Thread Hal Rosenstock
On 4/8/2014 12:13 PM, Yunzhao li wrote:
 We are working on the newest IB card: Mellanox Connect-IB. We pulled
 Mellanox OFED-2.1 into our environment. The IB nodes are connected
 through Mellanox SX3036 switch. We exactly followed Sean Hefty's madeye
 code: using ib_register_mad_snoop() for registration and using
 ib_mad_snoop_handler() and ib_mad_recv_handler() to handle the sent and
 received MAD packets respectively. However, the most captured SM packets
 are DevMgt (0x06), and we haven't received any class 0x81 or class 0x01
 MADs.
 
 Does the snooping mad routine need the support of HCA hardware/firmware?
 Or, does it need the support of ibdump package?

I don't have access to Connect-IB so don't know for sure but my
understanding is that a Connect-IB port can be configured by an external
SM and that an SM can be run on the Connect-IB port so you should be
able to capture send and receive SM class packets.

Also, PMA should be supported and you can double check this with perfquery.

All of the above involves kernel interaction (for even SMA/PMA) so
snooping should work AFAIK.

What does ibstat say for your Connect-IB port ?

-- Hal

 
 Thanks! 
 
 
 On Mon, Apr 7, 2014 at 7:50 PM, Hal Rosenstock h...@dev.mellanox.co.il
 mailto:h...@dev.mellanox.co.il wrote:
 
 
  We try to use IB MAD snoop to capture IB mad packets.
 
 Note that there is old kernel util module madeye written by Sean which
 does this:
 
 
 http://git.openfabrics.org/?p=ofed_1_5/linux-2.6.git;a=blob;f=drivers/infiniband/util/madeye.c;h=2c650a33a69c56d2b8a3274f63185214904abf3a;hb=967460824529719677d6a1d4600ec3e89a4538ab
 
 which shows how to properly use the ib_register_mad_snoop API for both
 SM and GS class MAD snooping.
 
  However, currently
  only the classes of DevMgt (0x06) and ComMgt (0x07) could be
 snooped. We
  tried to run tests based on class Perf (0x04) and class Subn
 (0x01) mad,
  and neither of them can be captured. Any suggestion?
 
 In the distant past, I've used madeye and captured SM, SA, and PerfMgt
 packets.
 
 Assuming your snoop registration(s) is/are correct:
 Most SM packets are class 0x81 (directed route) and not class 0x01 (LID
 routed) but this is SM dependent. Also, it might be that there is no
 PerfMgt running to/from your node.
 
 -- Hal
 
 
 
  Thanks in advance!
 
 
 
  Yunzhao
 
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


Re: [ewg] Can't snoop all kinds of mad packets

2014-04-08 Thread Hal Rosenstock
On 4/8/2014 12:41 PM, Yunzhao li wrote:
 The Connect-IB node is configured by the switch SM, and it can
 send/receive traffic with other connected Connect-IB nodes. We pulled
 OFED-2.1 into our VxWorks environment, 

You hadn't mentioned you were talking about vxWorks rather than Linux.

 so it's a little difficult to run
 linux commands for verification. For the Connect-IB port: the port state
 is defer and link state is active. 

LinkActDefer means that the physical layer has indicated a failure in
the link.

 
 On Tue, Apr 8, 2014 at 11:25 AM, Hal Rosenstock h...@dev.mellanox.co.il
 mailto:h...@dev.mellanox.co.il wrote:
 
 On 4/8/2014 12:13 PM, Yunzhao li wrote:
  We are working on the newest IB card: Mellanox Connect-IB. We pulled
  Mellanox OFED-2.1 into our environment. The IB nodes are connected
  through Mellanox SX3036 switch. We exactly followed Sean Hefty's
 madeye
  code: using ib_register_mad_snoop() for registration and using
  ib_mad_snoop_handler() and ib_mad_recv_handler() to handle the
 sent and
  received MAD packets respectively. However, the most captured SM
 packets
  are DevMgt (0x06), and we haven't received any class 0x81 or class
 0x01
  MADs.
 
  Does the snooping mad routine need the support of HCA
 hardware/firmware?
  Or, does it need the support of ibdump package?
 
 I don't have access to Connect-IB so don't know for sure but my
 understanding is that a Connect-IB port can be configured by an external
 SM and that an SM can be run on the Connect-IB port so you should be
 able to capture send and receive SM class packets.
 
 Also, PMA should be supported and you can double check this with
 perfquery.
 
 All of the above involves kernel interaction (for even SMA/PMA) so
 snooping should work AFAIK.
 
 What does ibstat say for your Connect-IB port ?
 
 -- Hal
 
 
  Thanks!
 
 
  On Mon, Apr 7, 2014 at 7:50 PM, Hal Rosenstock
 h...@dev.mellanox.co.il mailto:h...@dev.mellanox.co.il
  mailto:h...@dev.mellanox.co.il mailto:h...@dev.mellanox.co.il
 wrote:
 
 
   We try to use IB MAD snoop to capture IB mad packets.
 
  Note that there is old kernel util module madeye written by
 Sean which
  does this:
 
 
 
 http://git.openfabrics.org/?p=ofed_1_5/linux-2.6.git;a=blob;f=drivers/infiniband/util/madeye.c;h=2c650a33a69c56d2b8a3274f63185214904abf3a;hb=967460824529719677d6a1d4600ec3e89a4538ab
 
  which shows how to properly use the ib_register_mad_snoop API
 for both
  SM and GS class MAD snooping.
 
   However, currently
   only the classes of DevMgt (0x06) and ComMgt (0x07) could be
  snooped. We
   tried to run tests based on class Perf (0x04) and class Subn
  (0x01) mad,
   and neither of them can be captured. Any suggestion?
 
  In the distant past, I've used madeye and captured SM, SA, and
 PerfMgt
  packets.
 
  Assuming your snoop registration(s) is/are correct:
  Most SM packets are class 0x81 (directed route) and not class
 0x01 (LID
  routed) but this is SM dependent. Also, it might be that there
 is no
  PerfMgt running to/from your node.
 
  -- Hal
 
  
  
   Thanks in advance!
  
  
  
   Yunzhao
 
 
 
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


Re: [ewg] Can't snoop all kinds of mad packets

2014-04-07 Thread Hal Rosenstock

 We try to use IB MAD snoop to capture IB mad packets.

Note that there is old kernel util module madeye written by Sean which
does this:

http://git.openfabrics.org/?p=ofed_1_5/linux-2.6.git;a=blob;f=drivers/infiniband/util/madeye.c;h=2c650a33a69c56d2b8a3274f63185214904abf3a;hb=967460824529719677d6a1d4600ec3e89a4538ab

which shows how to properly use the ib_register_mad_snoop API for both
SM and GS class MAD snooping.

 However, currently
 only the classes of DevMgt (0x06) and ComMgt (0x07) could be snooped. We
 tried to run tests based on class Perf (0x04) and class Subn (0x01) mad,
 and neither of them can be captured. Any suggestion? 

In the distant past, I've used madeye and captured SM, SA, and PerfMgt
packets.

Assuming your snoop registration(s) is/are correct:
Most SM packets are class 0x81 (directed route) and not class 0x01 (LID
routed) but this is SM dependent. Also, it might be that there is no
PerfMgt running to/from your node.

-- Hal

  
 
 Thanks in advance!
 
  
 
 Yunzhao
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/ewg


Re: [ewg] ibportstate of infiniband-diag 1.6.2 and after

2014-03-17 Thread Hal Rosenstock
On 3/14/2014 2:17 PM, Chen, Mei-Jen wrote:
 Hi Ira,
 
  
 
 The problem seems to be related on what HCA and particular firmware version.
 
 I redo tests on Mellanox MCX353A-QCBT and Intel/Qlogic QLE7340.
 
 So far
 
 Intel/Qlogic QLE7340, ibportstate 1.6.1 OK, ibportstate 1.6.2 OK
 
 Mellanox MCX353A-QCBT, firmware 2.10.700 , ibportstate 1.6.1 OK,
 ibportstate 1.6.2 NG, ibportstate 1.6.4 NG
 
 Mellanox MCX353A-QCBT, firmware 2.30.8000(latest) , ibportstate 1.6.1
 NG, ibportstate 1.6.2 NG, ibportstate 1.6.4 NG
 
  
 
 Attached please see  “mellanox_qdr.txt” and “qlogic_qdr.txt” for the
 output of the command.
 

At start of commands, QLogic port is ACTIVE whereas the Mellanox port is
DOWN.

IBA spec says:
C14-24.2.1: If PortInfo:Portstate=Down, then
• a SubnGet(PortInfo) shall produce valid data for PortInfo:PortState
and PortInfo:PortPhysicalState; whether any other component has
valid data is vendor-dependent.
• a SubnSet(PortInfo) shall make any changes it specifies to PortInfo:
PortPhysicalState; any other result is vendor-dependent.

so when port is down only things in PortInfo you can trust are PortState
and PortPhysicalState and you can't expect anything else to work.

Bring port up at least to INIT and retry.

-- Hal

 
 Thanks,
 
 Mei
 
  
 
 *From:*Weiny, Ira [mailto:ira.we...@intel.com]
 *Sent:* Thursday, March 13, 2014 5:03 PM
 *To:* Chen, Mei-Jen; ewg@lists.openfabrics.org
 *Subject:* RE: [ewg] ibportstate of infiniband-diag 1.6.2 and after
 
  
 
 There are not that many differences in ibportstate between 1.6.1 and 1.6.2.
 
  
 
 Do you have any output from the command which fails?  What HCA’s are these?
 
  
 
 Could you run with the “-d” and “-v” options for both tools and send the
 output.
 
  
 
 Thanks,
 
 Ira
 
  
 
 *From:*ewg-boun...@lists.openfabrics.org
 mailto:ewg-boun...@lists.openfabrics.org
 [mailto:ewg-boun...@lists.openfabrics.org] *On Behalf Of *Chen, Mei-Jen
 *Sent:* Thursday, March 13, 2014 12:45 PM
 *To:* ewg@lists.openfabrics.org mailto:ewg@lists.openfabrics.org
 *Subject:* [ewg] ibportstate of infiniband-diag 1.6.2 and after
 
  
 
 Hi,
 
  
 
 I have been using utility “ibportstate” from “infiniband-diag 1.6.1”
 package to force lower link up speed.
 
 It’s working properly until I upgrade to newer version of
 “infiniband-diag 1.6.2” or ““infiniband-diag 1.6.4”.  Example of
  “ibportstate 2 1 speed 1”  does not seem to have any effect.  Does
 anyone know the potential causes?
 
  
 
 Thanks,
 
 Mei
 
 
 __
 This email has been scanned by the Symantec Email Security.cloud service.
 For more information please visit http://www.symanteccloud.com
 __
 
 
 __
 This email has been scanned by the Symantec Email Security.cloud service.
 For more information please visit http://www.symanteccloud.com
 __
 
 
 __
 This email has been scanned by the Symantec Email Security.cloud service.
 For more information please visit http://www.symanteccloud.com
 __
 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] [ANNOUNCE] libibumad 1.3.9 release

2014-02-05 Thread Hal Rosenstock
There is a new 1.3.9 release of libibumad.

Tarball is available in:
http://www.openfabrics.org/downloads/management/
(listed in http://www.openfabrics.org/downloads/management/latest.txt)

md5sum:
52a81356906f4faf29a6cf9583161aa8  libibumad-1.3.9.tar.gz

Alex Netes (1):
  libibumad: Fix memory leak in resolve_ca_port

Hal Rosenstock (2):
  libibumad: update shared library version
  libibumad: package version update for 1.3.9 release

Ilya Nelkenbaum (1):
  libibumad/umad.c: In resolve_ca_port, skip ethernet link layer ports

Ira Weiny (3):
  libibumad: fix umad_register man page
  libibumad: update umad_[send|recv] man pages to document how rmpp is 
handled
  libibumad: document the setting of errno for umad_send and umad_recv
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [ANNOUNCE] opensm 3.3.17 release

2014-01-31 Thread Hal Rosenstock
On 1/31/2014 9:18 AM, Or Gerlitz wrote:
 On Thu, Jan 30, 2014 , Hal Rosenstock h...@dev.mellanox.co.il wrote:
 There is a new 3.3.17 release of OpenSM.

 Tarball is available in:
 http://www.openfabrics.org/downloads/management/
 (listed in http://www.openfabrics.org/downloads/management/latest.txt)

 md5sum:
 9c1b85e47ab495110c1944e0f4d634b7  opensm-3.3.17.tar.gz

 All component versions are from recent master branch.
 
 what components are you referring to?

opensm and it's internal libraries.

 Full list of changes is below.
 [...]
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] [ANNOUNCE] opensm 3.3.17 release

2014-01-30 Thread Hal Rosenstock
There is a new 3.3.17 release of OpenSM.

Tarball is available in:
http://www.openfabrics.org/downloads/management/
(listed in http://www.openfabrics.org/downloads/management/latest.txt)

md5sum:
9c1b85e47ab495110c1944e0f4d634b7  opensm-3.3.17.tar.gz

All component versions are from recent master branch. Full list of
changes is below.

Albert Chu (4):
  opensm/osm_console.c: Support portstatus output for unenabled width/speed
  opensm/osm_console.c: Do not perform portstatus checks on down ports
  opensm/osm_ucast_ftree.c: Fix invalid debug output message
  opensm: Add configure output messages for several configure options

Alex Netes (35):
  opensm/osm_req.c: Better implementation of req_determine_mkey
  osm_lid_mgr.c: Fix duplicate LID assignment after SM port down
  opensm: Fix Q_Key, TClass and limited keys parsing warnings in 
partitions.conf
  opensm/osm_ucast_dfsssp.c: Fix unused variable in update_mcft()
  When SM fails to load/parse root_guids file use MinHop heuristics
  osm_trap_rcv.c: Fix crash in babbling port feature
  complib/cl_event_wheel.h: Some cosmetic fixes
  Fix test scenario in cl_event_wheel
  complib/cl_event_wheel.c: Fix duplicate error codes
  complib/cl_event_wheel.c: Add print of num_regs in cl_event_wheel_dump
  complib/cl_event_wheel.c: Roundup timeout to nearest msec
  opensm/osm_db_pack.c: Removed uneeded asserts
  osm_vendor_ibumad.c: Fix explicit null derefenced issue found by coverity
  osm_lid_mgr.c: Don't configure MTU and LMC for base SP0
  osm_db_files.c: Fix memory leak when deleting entries from osm db
  osm_trap_rcv.c: Fix locking in aging callback
  Add option to disable M_Key lookup
  Improve m_key lookup
  osm_guid_mgr.c: Fix GUIDInfo SET function
  osm_sa_mcmember_record.c: Fix incorrect comparison of IPv6 MGID when 
searching for SNM MLID
  Resend trap 144 when detecting remote MASTER SM with lower priority
  osm_ucast_cache.c: Fix memory leak in ucast_cache
  osm_drop_mgr.c: fix timeouts on Get Pkey from ext switch ports
  osm_link_mgr.c: fix uninitialized variable usage
  osm_link_mgr.c: Fix uninitialized value (physp0)
  osm_mcast_mgr.c: Fix wrong comparison in mcast_mgr_subdivide()
  osm_state_mgr.c: Fix error print in state_mgr_check_tbl_consistency()
  Fix minhop population in fabric with duplicate lids
  osm_mtree.c: Cosmetic change in osm_mtree_destroy function
  osm_trap_rcv.c: fix race condition during sweep
  osm_port_info_rcv.c: Reset client reregister bit only on a response to SET
  osm_trap_rcv.c: fix locking in trap_rcv_process_request()
  osm_trap_rcv.c: Removed unneeded lock when disabling port
  osm_opensm.c: Add missing ERR number
  osm_drop_mgr.c: Add missing assert

Ammar Haj Hamad (1):
  opensm: fix dfsssp uninitialized value

Bernd Schubert (2):
  reduce log level for missing partition configuration file.
  Try default partition config if parsing partitions.conf fails

Dan Ben Yosef (14):
  opensm/osm_qos_policy.c: fix memory leak when parsing policy file
  opensm/osmtest.c: half_world_query when creating inventory file
  osm_ucast_dfsssp.c: Fix memory leak in dfsssp_do_dijkstra_routing
  opensm: fix possible double free in osm_ucast_ftree.c
  osm_ucast_ftree.c : fix dereferencing null variable
  osm_ucast_ftree.c: fix dereferencing null variable
  osm_ucast_cache.c : Fix dereference null return value
  osm_sa_sminfo_record.c: fix sminfo sa query returns all sminfo records 
when filtering by LID in osm_smir_rcv_process
  osm_db_files.c : Fix resource leak guid2lid parser
  osm_helper.c: Fix out-of-bounds read
  Fix handling of get P_KeyTable on timeout
  Fix possible use of lid 0 when sending set PortInfo after failure of the 
first PortInfo set
  Fix timeout handling for pkeyGet for sw port 0
  Fix dropping node after setPkey mad

Daniel Klein (7):
  libvendor/osm_vendor_ibumad.c: fix mad validation in case of multipath 
record response
  update man page and usage explanation for --lfts_file
  osm_pkey.c: Fix find common pkey bug fix
  osm_mcast_mgr.c: fixed missing error message number
  osm_sm_state_mgr.c: Fix race condition during 
sm_state_mgr_send_master_sm_info_req
  osm_mcast_mgr.c: Removed mcast_mgr_purge_tree_node due to code duplication
  osm_ucast_ftree.c: replace assert with error return value

Eitan Zahavi (1):
  libvendor/osm_pkt_randomizer.c: Fix broken compilation with vendor sim

Hal Rosenstock (65):
  opensm/ib_types.h: Add missing IB_PR_COMPMASK_SERVICEID define
  opensm/osm_torus.c: Cosmetic formatting changes
  opensm/ib_types.h: Add missing IB_MPR_COMPMASK_SERVICEID define
  opensm/osm_qos_policy.c: Fix source  destination GUID policy check
  opensm/osm_dump.c: Fix enhanced switch port 0 handling in 
print_node_report
  opensm

[ewg] [ANNOUNCE] ibsim 0.6 release

2014-01-30 Thread Hal Rosenstock
There is a new 0.6 release of ibsim.

Tarball is available in:
http://www.openfabrics.org/downloads/management/
(listed in http://www.openfabrics.org/downloads/management/latest.txt)

md5sum:
d08e196d980e7c88066b3e5e25bf5432  ibsim-0.6.tar.gz

All component versions are from recent master branch. Full list of
changes is below.

Al Chu (4):
  Fix PerformanceSet parsing corner case
  Output error on bad input to PerformanceSet
  ibsim: Do not return allportselect for non-switches
  ibsim: Remove parse corner case with full ibnetdiscover output

Albert Chu (1):
  ibsim: Fix typo in help

Alex Netes (1):
  ibsim: fix double slash in include the path

David McMillen (1):
  umad2sim: Segmentation fault fix

Doron Shoham (1):
  ibsim: support xmitwait counters

Hal Rosenstock (14):
  ibsim/sim_mad.c: Add sim support for PerfMgt ClassPortInfo
  ibsim: Eliminate unused modified variable
  ibsim: Change lid print format to unsigned
  ibsim/sim.h: Better portinfo alignment in Port struct
  ibsim: Handle sim_init_net errors better
  ibsim/sim_net.c: Eliminate IsLEDInfoSupported PortInfo CapabilityMask bit
  ibsim/sim_mad.c: Eliminate a couple of unneeded blank lines
  ibsim/sim_mad.c: Better error handling in send_trap
  ibsim: Some cosmetic changes
  ibsim/sim_cmd.c: Cosmetic change to error message
  ibsim/sim_cmd.c: Fix help for error command
  ibsim/sim_cmd.c: Allow error command on attributes for switch port 0
  sim_client.c: Set issm flag on connect when SIM_SET_ISSM set
  ibsim/ibsim.c: Bump version to 0.6

Huang, Parry W (2):
  ibsim/ibsim.c: Fixed bug regarding undefined commands
  ibsim/sim_cmd.c: Fixed help output for dump command

Huang, Perry (3):
  ibsim/sim.h: Add support for optional performance attributes.
  ibsim/sim_mad.c: Add read/reset functions for optional performance 
attributes.
  ibsim/sim_cmd: Add command to set perf. counters.

Ira Weiny (1):
  tests/subnet_discover: And I found the other reason (Re: Found one reason 
libibnetdisc is slower than subnet_discover)

Jon Stanley (1):
  ibsim/sim_cmd: Fix compile errors

Nicolas Morey Chaisemartin (1):
  Socket name can be forced by exporting IBSIM_SOCKNAME before starting 
ibsim and/or preloading umad2sim so multiple simulator can run on the same 
system at the same time

Nicolas Morey-Chaisemartin (1):
  ibsim: Fixed custom release in SPEC file

Sasha Khapyorsky (23):
  tests/mcast_storm: code consolidation
  tests/mcast_storm: add mcmember parameters
  ibsim: remove libibcommon dependencies
  ibsim/sim_cmd.c: fix printf format
  ibsim: fix port initial state
  ibsim: fix LocalPortNum in PortInfo response
  tests/mcast_storm.c: fix gid file parser
  ibsim: fix C99 non complaint types
  mcast_storm.c: migrate to newer libibmad API
  tests/mcast_storm.c: fix numeric GID parser
  tests/mcast_storm.c: fix uninitialized variable use
  tests/mcast_storm.c: fix mcmember parameters initialization
  tests/mcast_storm: fix MGID command line parser
  tests/mcast_storm.c: support comments in gid file
  tests/mcast_storm.c: report error response status
  tests/subnet_discover: discover test utility
  tests/query_many: simple utility to send many SMP queries
  tests/subnet_discover: limit possible number of hops
  tests/subnet_discover: add --help option
  tests/subnet_discover: --maxsmps (-n) option
  tests/subnet_discover.c: print useful information
  tests/subnet_discover: report unresponded transactions
  tests/subnet_discover: verbose node discovery printout

hnr...@comcast.net (4):
  ibsim/sim_net.c: In new_node, fix nodetype in nodeinfo for router nodes
  ibsim/sim_client.c: In sim_client_init, return -1 on error
  ibsim: Eliminate unneeded argument in sim_client_init
  ibsim/sim_client.c: Eliminate unneeded qp param from sim_init

sebastien dugue (1):
  ibsim - Fix umad2sim build with glibc = 2.10


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] where can I find librsocket?

2013-08-21 Thread Hal Rosenstock
On 8/21/2013 4:34 AM, Richard Croucher wrote:
 Where can I  obtain rsockets?
 
 I can's see it in OFED
 
 Now that SDP has been dropped, I need an alternative means of accessing
 RDMA, particularly where I don't have access to source code.
 
 Sean's presentations look interesting but where's the code?   Is it
 buried within rdma_cm?,   I could only find rsocket.h in my version.
 if so which release did it appear it?
 
 I understood it was all user space code so I was expecting to find it in
 OFED rather than kernel.org .However, neither site seem to have it
 although I did find a few patches referencing it on the latter.
 
  It seems to well buried that even Google can't find anywhere.

rsockets is part of librdmacm. It first appeared in v1.0.16 (7/12/12)
but there are bug fixes in v1.0.17 (3/6/13) and some beyond that but not
yet released.

I think v1.0.17 is probably the minimum version to use if not the latest
from Sean's git tree (git.openfabrics.org/~shefty/librdmacm.git).
Also, there are release tarballs in
http://www.openfabrics.org/downloads/rdmacm/.

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM 3.3.16 will not start

2013-04-12 Thread Hal Rosenstock
Hi again Mads,

On 4/12/2013 2:25 AM, Mads Boye wrote:
 Our worker nodes PXE boots from an image on a local administration machine. 
 OpenSM has been compiled and installed into this image directly 
 with no errors durring ./configure , make , or make install.

Are you saying that worker nodes run OpenSM ?

Is the PXE booting over IB (or some other network) ?

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM 3.3.16 will not start

2013-04-12 Thread Hal Rosenstock
Hi Mads,

On 4/12/2013 2:57 AM, Mads Boye wrote:
 Hi again.
 I fixed the issue myself.
 The issue was caused by the fact that i had libopensm.so.5.0.0 and a symlink 
 libopensm.so.5 in /usr/lib64
 I removed them and created a symlink ln -s /usr/local/lib/libopensm.so.5.1.1 
 libopensm.so.5 in /usr/lib64 and now i am able to start opensm 3.3.16.
 
 My system is Ubuntu 12.04 x86_64.

Your system already had an older OpenSM on it and by default, opensm is
installed in /usr/local/... (lib, etc.). I suspect your library path had
/usr/lib64 ahead of /usr/lib or /usr/local/lib.

There are a number of ways around this. One is the way you found.
Another is to configure so that the libraries go into /usr/lib64.

-- Hal

 Still if this is not the correct mailinglist please let me know, so i can 
 post the solution the right place.
 
 Mads Boye
 HPC  VMware Administrator  |  Information Service Technology
 
 Phone: (+45) 9940 3453  | Email:m...@ist.aau.dk |  Webhomes.ist.aau.dk/mb/
 Aalborg University | Selma Lagerløfs Vej 300 - 4.2.46 | Aalborg East |
 
 
 Fra: ewg-boun...@lists.openfabrics.org [ewg-boun...@lists.openfabrics.org] 
 p#229; vegne af Mads Boye [m...@its.aau.dk]
 Sendt: 12. april 2013 08:27
 Til: ewg@lists.openfabrics.org
 Emne: Re: [ewg] OpenSM 3.3.16 will not start
 
 Sorry i forgot a thing.
 This is the output i get when i try to run opensm
 
 mb@munchkin1:~$ sudo /usr/local/sbin/opensm
 [sudo] password for mb:
 -
 OpenSM 3.3.16
 Command Line Arguments:
  Log File: /var/log/opensm.log
 -
 /usr/local/sbin/opensm: relocation error: /usr/local/sbin/opensm: symbol 
 osm_log_v2, version OPENSM_1.5 not defined in file libopensm.so.5 with link 
 time reference
 
 Thank you.
 
 Mads Boye
 HPC  VMware Administrator  |  Information Service Technology
 
 Phone: (+45) 9940 3453  | Email:m...@ist.aau.dk |  Webhomes.ist.aau.dk/mb/
 Aalborg University | Selma Lagerløfs Vej 300 - 4.2.46 | Aalborg East |
 
 
 Fra: ewg-boun...@lists.openfabrics.org [ewg-boun...@lists.openfabrics.org] 
 p#229; vegne af Mads Boye [m...@its.aau.dk]
 Sendt: 12. april 2013 08:25
 Til: ewg@lists.openfabrics.org
 Emne: [ewg] OpenSM 3.3.16 will not start
 
 Hi.
 
 I'm trying to configure OFED in our cluster, and therefore have been 
 installing openSM from source.
 In the INSTALL it says to start openSM from /usr/local/sbin/opensm and if 
 this fails send an email with /var/log/opensm.log so i hope this is the right 
 mailing list, otherwise I'm sorry and hope that you can point me to the 
 proper list.
 
 Our worker nodes PXE boots from an image on a local administration machine. 
 OpenSM has been compiled and installed into this image directly with no 
 errors durring ./configure , make , or make install.
 
 I've rebooted the node.
 
 OFED is also installed directly in the image without any installation errors. 
 openibd starts correctly at boot.
 
 Hope that you are able to help me.
 
 Please let me know if you need further information.
 
 Best Regards
 
 Mads Boye.
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM 3.3.16 will not start

2013-04-12 Thread Hal Rosenstock
Hi Mads,

On 4/12/2013 10:09 AM, Mads Boye wrote:
 Hi Hal.
 Yes, maybe i understood it incorrectly, but doesn't all nodes need the OpenSM 
 running, in order for working IB?

To clarify, a subnet needs to have at least 1 SM running. It's best if
OpenSM runs on a small number of admin nodes which are always present.

-- Hal

 The PXE booting is happening over an admin ethernet.

 Mads Boye
 HPC  VMware Administrator  |  Information Service Technology
 
 Phone: (+45) 9940 3453  | Email:m...@ist.aau.dk |  Webhomes.ist.aau.dk/mb/
 Aalborg University | Selma Lagerløfs Vej 300 - 4.2.46 | Aalborg East |
 
 
 Fra: Hal Rosenstock [h...@dev.mellanox.co.il]
 Sendt: 12. april 2013 16:04
 Til: Mads Boye
 Cc: ewg@lists.openfabrics.org
 Emne: Re: [ewg] OpenSM 3.3.16 will not start
 
 Hi again Mads,
 
 On 4/12/2013 2:25 AM, Mads Boye wrote:
 Our worker nodes PXE boots from an image on a local administration machine. 
 OpenSM has been compiled and installed into this image directly
 with no errors durring ./configure , make , or make install.
 
 Are you saying that worker nodes run OpenSM ?
 
 Is the PXE booting over IB (or some other network) ?
 
 -- Hal
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM 3.3.16 will not start

2013-04-12 Thread Hal Rosenstock
On 4/12/2013 11:17 AM, Mads Boye wrote:
 Hi Hal.
 Thank you for the clarification. The admin nodes must be connected to the 
 infiniband network aswell I assume?

Yes.

-- Hal

 
 De bedste hilsner / Best regards,
 
 Mads Boye.
 
 -Original Message-
 From: Hal Rosenstock [mailto:h...@dev.mellanox.co.il]
 Sent: 12. april 2013 16:15
 To: Mads Boye; ewg@lists.openfabrics.org
 Subject: Re: SV: [ewg] OpenSM 3.3.16 will not start

 Hi Mads,

 On 4/12/2013 10:09 AM, Mads Boye wrote:
 Hi Hal.
 Yes, maybe i understood it incorrectly, but doesn't all nodes need the
 OpenSM running, in order for working IB?

 To clarify, a subnet needs to have at least 1 SM running. It's best if OpenSM
 runs on a small number of admin nodes which are always present.

 -- Hal

 The PXE booting is happening over an admin ethernet.

 Mads Boye
 HPC  VMware Administrator  |  Information Service Technology

 Phone: (+45) 9940 3453  | Email:m...@ist.aau.dk |
 Webhomes.ist.aau.dk/mb/ Aalborg University | Selma Lagerløfs Vej 300 -
 4.2.46 | Aalborg East |

 
 Fra: Hal Rosenstock [h...@dev.mellanox.co.il]
 Sendt: 12. april 2013 16:04
 Til: Mads Boye
 Cc: ewg@lists.openfabrics.org
 Emne: Re: [ewg] OpenSM 3.3.16 will not start

 Hi again Mads,

 On 4/12/2013 2:25 AM, Mads Boye wrote:
 Our worker nodes PXE boots from an image on a local administration
 machine. OpenSM has been compiled and installed into this image directly
 with no errors durring ./configure , make , or make install.

 Are you saying that worker nodes run OpenSM ?

 Is the PXE booting over IB (or some other network) ?

 -- Hal

 
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [RESEND] Multiple fabrics and OpenSM

2013-03-06 Thread Hal Rosenstock
Hi Pavan,

On 3/6/2013 1:19 AM, pavan tc wrote:
 Hi,
 
 [Please reply to my mail ID since I am not part of the ewg mailing list]
 
 I am trying to use OpenSM to manage a simple two node back-to-back
 connected node IB network.
 But for more bandwidth I have two 1 port cards on each of these and have
 connected them as below:
 
 ++   ++
 |  |  IB LINK 1| |
 |  || |
 |  Node 1  ||Node 2   |
 |  |  IB LINK 2| |
 |  || |
 |  || |
 ++   ++
 
 If I start OpenSM with no specific conf files, it binds to the first
 port and the other one cannot be used.
 I need to manually start it on the other with '-g' option if I have to
 use it. Most solutions I found on the internet are some flavour of
 manually starting the instances. I would like to avoid writing an init
 script that starts the required number of opensm instances with the
 right parameters.
 
 I wanted to automate this on every boot. Does opensm offer such
 mechanisms via some conf file settings?

There is /etc/init.d/opensmd script but it needs some mods to support
this dual subnet/OpenSM configuration.

What distro are you using ?

-- Hal

 Thanks,
 Pavan
 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] strange value of IB port rate

2013-02-20 Thread Hal Rosenstock
On 2/20/2013 6:51 AM, Mahesh Chaudhari wrote:
 Hi All,
 i have Mellanox dual port IB card installed in m/c
 when i do :
  cat /sys/class/infiniband/mthca0/ports/1/rate 
 it shows 20 Gb/sec (4X DDR)
 
 while on port2
  cat /sys/class/infiniband/mthca0/ports/2/rate 
 it shows 8.5 Gb/sec (4X)
 
 when i looked at source code, i found a equation to calculate rate :
 
rate = 25 *
 ib_width_enum_to_int(attr.active_width) * attr.active_speed;
   
return sprintf(buf, %d%s Gb/sec (%dX%s)\n,
rate / 10, rate % 10 ? .5 : ,
ib_width_enum_to_int(attr.active_width), speed);
 
 
 where  
 ib_width_enum_to_int(attr.active_width) =  1 | 4 |8 |12 | -1(error)
  attr.active_speed = 1 | 2 | 4 
 
 
 i am wondering, how could it possible to get such an odd value (8.5
 Gbps)
 
 /usr/ofed/sbin/ibstatus utility also shows :
 
 Infiniband device 'mthca0' port 1 status:
 default gid: fe80::::001a:4bff:ff0c:96e5
 base lid: 0x6
 sm lid: 0x1
 state: 4: ACTIVE
 phys state: 5: LinkUp
 rate: 20 Gb/sec (4X DDR)
 link_layer: InfiniBand
 
 Infiniband device 'mthca0' port 2 status:
 default gid: fe80::::001a:4bff:ff0c:96e6
 base lid: 0x0
 sm lid: 0x0
 state: 1: DOWN
 phys state: 2: Polling
 rate: 8.5 Gb/sec (4X)
 link_layer: InfiniBand
 
 
 
 
 Any Clue ???

In the case of DOWN ports, rate is meaningless and should be ignored.

-- Hal

 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] strange value of IB port rate

2013-02-20 Thread Hal Rosenstock
On 2/20/2013 7:41 AM, Mahesh Chaudhari wrote:
 We have a cluster of 16 nodes. And rest all 15 nodes shows the same
 value i.e. 10 Gb/sec (4X) except this node

What's different about this node ? Same HCAs ? Same firmware ? Same
kernel ? Same user space packages ?

IB spec says:
C14-24.2.1: If PortInfo:Portstate=Down, then
• a SubnGet(PortInfo) shall produce valid data for PortInfo:PortState
and PortInfo:PortPhysicalState; whether any other component has
valid data is vendor-dependent

It relies on those vendor dependent values. Some additional checking
could be introduced into those utilities to avoid this confusion but
they're just largely pretty printers for the returned information.

-- Hal

 
 
 *From:* Hal Rosenstock h...@dev.mellanox.co.il
 *To:* Mahesh Chaudhari mahesh.chaudh...@ymail.com
 *Cc:* ewg@lists.openfabrics.org ewg@lists.openfabrics.org
 *Sent:* Wednesday, 20 February 2013 5:24 PM
 *Subject:* Re: [ewg] strange value of IB port rate
 
 On 2/20/2013 6:51 AM, Mahesh Chaudhari wrote:
 Hi All,
 i have Mellanox dual port IB card installed in m/c
 when i do :
  cat /sys/class/infiniband/mthca0/ports/1/rate 
 it shows 20 Gb/sec (4X DDR)

 while on port2
 cat /sys/class/infiniband/mthca0/ports/2/rate 
 it shows 8.5 Gb/sec (4X)

 when i looked at source code, i found a equation to calculate rate :

rate = 25 *
 ib_width_enum_to_int(attr.active_width) * attr.active_speed;
 
return sprintf(buf, %d%s Gb/sec (%dX%s)\n,
rate / 10, rate % 10 ? .5 : ,
ib_width_enum_to_int(attr.active_width), speed);


where 
 ib_width_enum_to_int(attr.active_width) =  1 | 4 |8 |12 | -1(error)
  attr.active_speed = 1 | 2
 | 4


i am wondering, how could it possible to get such an odd value (8.5
 Gbps)

 /usr/ofed/sbin/ibstatus utility also shows :

 Infiniband device 'mthca0' port 1 status:
default gid:fe80::::001a:4bff:ff0c:96e5
base lid:0x6
sm lid:0x1
state:4: ACTIVE
phys state:5: LinkUp
rate:20 Gb/sec (4X DDR)
link_layer:InfiniBand

 Infiniband device 'mthca0' port 2 status:
default gid:fe80::::001a:4bff:ff0c:96e6
base lid:0x0
sm lid:0x0
state:1: DOWN
phys state:2: Polling
rate:8.5 Gb/sec (4X)
link_layer:InfiniBand




 Any Clue ???
 
 In the case of DOWN ports, rate is meaningless and should be ignored.
 
 -- Hal
 


 ___
 ewg mailing list
 ewg@lists.openfabrics.org mailto:ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
 
 
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Changing min_rnr_timer and timeout attributes of QP

2013-02-07 Thread Hal Rosenstock
On 2/7/2013 1:29 AM, Hefty, Sean wrote:
 I am trying to change min_rnr_timer and timeout attribute of QP from the 
 Linux
 kernel (RHEL6.3). The QP is created using rdma_create_qp() of rdma_cm 
 moduleand
 not ib_create_qp() as I want a generic code to work over IB as well as ROCEE.
 
 These cannot be changed directly.  The min_rnr_timeout is set to 0, which is 
 the maximum.  
 The timeout value is determined based on the packet lifetime reported by the 
 SA.  You may be able to 
 configure the SM to set a higher/lower packet lifetime.

If you are using OpenSM, packet lifetime comes from the subnet timeout
as long as advanced QoS is not configured so in that case, you can set
the subnet_timeout option. Default is 0x12 which is 4.096usec * 2^18 ~ 1
sec.

-- Hal

 - Sean
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM - GID_AVAIL/GID_UNAVAIL GID_IN/GID_OUT events

2012-07-25 Thread Hal Rosenstock
Hi Pavan,

 Hi,
 
 Kindly include my mail ID in the response since I am not on the ewg
 mailing list.
 
 I would like to know if OpenSM supports GID_UNAVAILABLE and
 GID_AVAILABLE asynchronous events.
 ( Not sure if they are known by any other event name - GID_IN/GID_OUT ?? )

Yes, OpenSM supports these (traps 64/65).

 If I subscribe to subnet manager notices, would I be able to detect
 another node going down/coming up in the subnet?

Yes, if you subscribe for all LIDs in subnet.

-- Hal

 TIA,
 Pavan
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org mailto:ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OSM vendor directory

2012-06-18 Thread Hal Rosenstock
Hi Hector,

On 6/18/2012 4:14 PM, Hector Abrach wrote:
 Hello,
 
 I have a generic question:
 in folder vendor/
 
 There exist some osm_vendor_mlx_* files.
 
 1. Would someone tell me what this are used for?
 2. are these unique for MLX4 devices?
 3. How do they relate to osm_vendor_ibumad*.c?
 
 Thank you very much for your support!!!

They're mainly historical/old vendor layers (mainly for reference at
this point).

OpenFabrics uses osm_vendor_ibumad.c (and osm_vendor_ibumad_sa.c) and
that is what's used on any driver that libibumad works over which
includes mlx4 and mthca.

-- Hal

 Hector Abrach
 __
 This email has been scanned by the Symantec Email Security.cloud service.
 For more information please visit http://www.symanteccloud.com
 __
 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-22 Thread Hal Rosenstock
Hector,

On 12/21/2011 2:16 PM, Hector Abrach wrote:
 Hal,
 
 When an SMP times out how does the Linux kernel know it timed out?

The kernel MAD module maintains a timer list for outstanding
transactions and if no response is received before the timer expires, it
knows that transaction timed out. If the matching response is received,
it removes that transaction from that list. See timeout_sends in
drivers/infiniband/core/mad.c

 When the Linux kernel determines it timed out how does it signal OpenSM
 the timeout/retry/send? Through what function calls does this signal go
 through?
 
 I was noticing that cl_event_wait_on in vl15_poller() has a parameter
 passed as EVENT_NO_TIMEOUT should this be a time out or should the time
 out occur somewhere else? This is where it stalls.

Yes, I've already responded about this several times. I'm reasonably
sure that this is due to erroneous QP0/VL15 accounting due to lack of
timeouts.

 Do you know somewhere I could read a little bit more about the Linux
 Kernel timeout and how it interacts with OpenSM?

In terms of the kernel, look at:
linux/Documentation/infiniband/user_mad.txt
and
include/rdma/ib_mad.h and ib_user_mad.h

OpenSM uses osm_vendor_ibumad.c which is layered on top of libibumad. In
osm_vendor_ibumad.c, the send error callback is invoked for transaction
timeout in umad_receiver. For libibumad, see umad_status and umad_send
man pages.

-- Hal

 Thank you for the help and your insight.
 
 Hector Abrach
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/16/2011 11:11 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hector,
 
 On 12/16/2011 11:59 AM, Hector Abrach wrote:
 Hal,

 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...

 Based on the inherit nature of the QNX Kernel I don't believe we have a
 timeout/retry/send on it. This may be the reason I see the bootup
 freeze. If it is I may have to implement this somehow.
 
 I think that's the OpenSM side of the failure as a timed out transaction
 never times out so the MAD accounting is wrong, etc. It breaks that
 fundamental assumption.
 
 There may also be some issue with the SMA implementation on your QNX
 nodes which is the root cause. Of course, SMPs are unreliable so
 timeout/retries can be needed...
 
 However, for the time being at least, I believe that setting
 OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
 works reliably. But, it would be nice to know why it freezes anyway, may
 be because of the above.

 Thus far I've been unsuccessful in failing with debug property -D 0x23
 but I'll keep trying.
 
 That slows things down enough to make it work as does 1 SMP outstanding.
 It appears when SMPs are pipelined, some get dropped...
 
 -- Hal
 
 Thank you

 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/15/2011 01:21 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 On 12/15/2011 1:57 PM, Hector Abrach wrote:
 Hal,

 I managed to get it to fail with Debug information -D 0x08. Attached is
 the log file.
 I'll dig deeper it seems is pkey related maybe...

 Yes, I saw signs of that last night from the log you sent where it
 stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
 was that or not. I didn't check how many pairs of the pkey tables you
 got back here to validate whether every port responded with the proper
 number of pkey table blocks.

 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...

 -- Hal

 Once again thank you for your support.



 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/14/2011 08:29 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 Hector,

 On 12/14/2011 5:49 PM, Hector Abrach wrote:
 Hal,

 I got the system to fail with verbose enabled after 25 reboots. Please
 find attached the log file.


 I can see the responses but not the requests. What verbosity level did
 you use ?

 I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
 boot process in multi-switch systems and make the boot process faster
 correct?

 It's multinode not just multiswitch and this configuration is 8 nodes (1
 switch + 7 CAs). It's not boot process but discovery/initialization
 which is pipelined.

 Since my system is a single switch system I do not need to have
 4 but 1

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-16 Thread Hal Rosenstock
Hector,

On 12/16/2011 11:59 AM, Hector Abrach wrote:
 Hal,
 
 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...
 
 Based on the inherit nature of the QNX Kernel I don't believe we have a
 timeout/retry/send on it. This may be the reason I see the bootup
 freeze. If it is I may have to implement this somehow.

I think that's the OpenSM side of the failure as a timed out transaction
never times out so the MAD accounting is wrong, etc. It breaks that
fundamental assumption.

There may also be some issue with the SMA implementation on your QNX
nodes which is the root cause. Of course, SMPs are unreliable so
timeout/retries can be needed...

 However, for the time being at least, I believe that setting
 OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
 works reliably. But, it would be nice to know why it freezes anyway, may
 be because of the above.
 
 Thus far I've been unsuccessful in failing with debug property -D 0x23
 but I'll keep trying.

That slows things down enough to make it work as does 1 SMP outstanding.
It appears when SMPs are pipelined, some get dropped...

-- Hal

 Thank you
 
 Hector Abrach
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Date: 12/15/2011 01:21 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 On 12/15/2011 1:57 PM, Hector Abrach wrote:
 Hal,

 I managed to get it to fail with Debug information -D 0x08. Attached is
 the log file.
 I'll dig deeper it seems is pkey related maybe...
 
 Yes, I saw signs of that last night from the log you sent where it
 stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
 was that or not. I didn't check how many pairs of the pkey tables you
 got back here to validate whether every port responded with the proper
 number of pkey table blocks.
 
 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...
 
 -- Hal
 
 Once again thank you for your support.



 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/14/2011 08:29 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 Hector,

 On 12/14/2011 5:49 PM, Hector Abrach wrote:
 Hal,

 I got the system to fail with verbose enabled after 25 reboots. Please
 find attached the log file.


 I can see the responses but not the requests. What verbosity level did
 you use ?

 I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
 boot process in multi-switch systems and make the boot process faster
 correct?

 It's multinode not just multiswitch and this configuration is 8 nodes (1
 switch + 7 CAs). It's not boot process but discovery/initialization
 which is pipelined.

 Since my system is a single switch system I do not need to have
 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE.

 You can run with 1 if that suits your needs. It's just not the default.

 Maybe the pipelined SMP's are confusing the switch some how.

 Even if it did, there's nothing that should stop the SM from
 working/proceeding. From the log, it looks like the SM does get stuck.

 -- Hal

 Thanks again for your help.

 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Cc:  ewg@lists.openfabrics.org
 Date:  12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 Hi,

 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,

 I have a boot problem with OpenSM

 Are you saying the switch is booted rather than OpenSM ?

 What is the OpenSM running on and in what environment ?

 the problem occurs seldomly and
 started to ocur when we started using a new Mellanox MT1118X03342
 switch.
 The problem occurs during the discovery phase within
 state_mgr_sweep_hop_1.

 However, I discovered that the actual location is because the
 qp0_mads_outsanding stalls at 1 occasionally.

 Is it stuck or after timeout/retry does this get updated properly ?

 Within file osm_vl15intf.c in function vl15_poller it checks at the
 rfifo and if the qlist still has items it applies function vl15_send_mad
 which later on triggers the signal.
 With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
 noticed that cl_qlist_end reaches zero before
 stats-qp0_mads_outstanding does. This causes a stall in
 cl_event_wait_on. The rfifo always reaches 0 when there are 4
 qp0_mads_outstanding however when it fails it always fails when there is
 1 qp0_mad_outstanding.

 Is some (request) SMP that OpenSM

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-15 Thread Hal Rosenstock
Hector,

On 12/15/2011 12:49 PM, Hector Abrach wrote:
 Hal,
 
 Thank you for the response. To address your questions:
 
 So the switch stays up and the servers (including the one OpenSM is on)
 is rebooted, right ?
 
 Right.
 
 Do the servers run QNX rather than Linux ? Are you saying all OpenSM
 code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?
 
 Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only
 changes I had to make were made to some #define libraries.
 The big changes were made for the driver, not so much OpenSM. 

I would think there are also changes for porting of complib to QNX. Do
you use osm_vendor_ibumad.c as the OpenSM vendor layer ?

 I'm using IBNet 1.3. 

What's IBNet 1.3 ? I'm not familiar with that.

 OpenSM always runs on the same one server, the others don't
 run it.

Understood.

 Is the topology the 7 servers and the 1 switch and if you use other
 switches you don't see this issue ?
 
 That's correct, the topology is 7 servers and 1 switch. We typically use
 less servers (4) for our application but the problem is more easily
 reproducible with more servers so we have a 7 server setup with 1
 switch. We don't have a great selection of switches but I know our
 previous switch did not cause this problem. Our intention is to go to
 production with this new switch but we can't release until we find an
 acceptable solution.
 
Ican see the responses but not the requests. What verbosity level did
 you use ?
 
 I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to
 do -D 0xFF because I know this fixes the problem for sure.

I think -D 0x23 (error, info, frames) would do the trick...

 -
 
 In summary:
 1.knowing that the system gets stuck for sm_vendor_ibumad.c -
 umad_receiver() - for(;;) but keeps running properly for function
 main.c - osm_manager_loop().
 2.If I use -D 0xFF the problem is completely fixed
 3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other
 value the problem is completely fixed
 4.The failure always occurs with qp0_mads_outstanding of 1
 remaining
 what do you think could be wrong?
 Do you think the driver could be the problem?

Yes; The thing that I think is a likely suspect and may be missing and
causing this issue is the (built in to kernel MAD in Linux) timeout
retry code for MAD transactions which if the timeout/retries are
exhaused triggers a send error (callback). Is that implemented ?

However, I don't have a good explanation for why you see this now and
not before with your other switches but maybe that's not important.

 What debug command should I use to see the sent requests?

See above.

-- Hal

 Thank you
 
 Hector Abrach
 
 
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 08:23 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hector,
 
 On 12/14/2011 1:41 PM, Hector Abrach wrote:
 Hal,

 Sorry for the multiple emails, but I was thinking how it may be a
 freeze /stall rather than a time out.  One reason is that it doesn't
 send an error message, is as if the log completely dies.
 
 So nothing interesting in the log...
 
 However, in
 file osm_vendor_ibumad.c under function umad_receiver there is an
 infinite loop for(;;) which seems to die when I get to that previously
 discussed vl15_poller. I checked to see if it breaks out of the loop but
 it doesn't seem to.
 
 It never breaks out of that loop except when OpenSM is shutting down.
 That's the basic receive loop.
 
 -- Hal
 
 I'm not sure if this may be an additional hint.
 Thank you

 Hector Abrach


 From:  Hector Abrach habr...@tmriusa.com
 To:  Hal Rosenstock h...@dev.mellanox.co.il
 Cc:  ewg@lists.openfabrics.org
 Date:  12/14/2011 11:15 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 Sent by:  ewg-boun...@lists.openfabrics.org


 



 Hal,

 Thank you very much for the support, I am the same person from the gmail
 account so I will respond through here.

 Attached is a picture of the switch serial number:



 I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
 system which I reboot via a script over and over again. Technically
 speaking the switch is not being powered off or physically rebooted. My
 server system is what is being rebooted. I am running OpenSM on one of
 the 7 servers. This means I'm constantly shutting down and rebooting
 OpenSM. I am running OpenSM on QNX but we have not had this problem
 until we decided to upgrade to this switch.

 The problem is that every 1 out of 15 of this remote reboots OpenSM
 stalls or times out because stats-qp0_mads_outstanding did not reach
 zero. Please

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-14 Thread Hal Rosenstock
Hi,

On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,
 
 I have a boot problem with OpenSM

Are you saying the switch is booted rather than OpenSM ?

What is the OpenSM running on and in what environment ?

 the problem occurs seldomly and
 started to ocur when we started using a new Mellanox MT1118X03342 switch.
 The problem occurs during the discovery phase within state_mgr_sweep_hop_1.
 
 However, I discovered that the actual location is because the
 qp0_mads_outsanding stalls at 1 occasionally.

Is it stuck or after timeout/retry does this get updated properly ?

 Within file osm_vl15intf.c in function vl15_poller it checks at the
 rfifo and if the qlist still has items it applies function vl15_send_mad
 which later on triggers the signal.
 With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
 noticed that cl_qlist_end reaches zero before
 stats-qp0_mads_outstanding does. This causes a stall in
 cl_event_wait_on. The rfifo always reaches 0 when there are 4
 qp0_mads_outstanding however when it fails it always fails when there is
 1 qp0_mad_outstanding.

Is some (request) SMP that OpenSM sent timing out (not being responded to) ?

 Have you seen this failure? By the way, I see this failure once every 15
 reboots approximately.
 
 I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
 problem.

What do you mean exactly by fixes the problem ? I'm not sure I
understand what the problem is yet.

-- Hal

 My guess is that there is a race condition when the switch sends 4 SMPs
 in parallel. Also, this failure only appears to occur at reboot. Another
 solution which is not acceptable is when I add a delay in the process
 the failure goes away. This as if the switch needed more time to do
 something.
 
 I would really appreciate your help and insight.
 Thank you
 
 Hector Abrach
 __
 This email has been scanned by the Symantec Email Security.cloud service.
 For more information please visit http://www.symanteccloud.com
 __
 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-14 Thread Hal Rosenstock
Hector,

On 12/14/2011 12:14 PM, Hector Abrach wrote:
 Hal,
 
 Thank you very much for the support, I am the same person from the gmail
 account so I will respond through here.
 
 Attached is a picture of the switch serial number:

OK; I see now; that's an 8 port unmanaged QDR switch.

 I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
 system which I reboot via a script over and over again. Technically
 speaking the switch is not being powered off or physically rebooted. My
 server system is what is being rebooted.

So the switch stays up and the servers (including the one OpenSM is on)
is rebooted, right ?

 I am running OpenSM on one of
 the 7 servers. This means I'm constantly shutting down and rebooting
 OpenSM. I am running OpenSM on QNX but we have not had this problem
 until we decided to upgrade to this switch.

Do the servers run QNX rather than Linux ? Are you saying all OpenSM
code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?

Is the topology the 7 servers and the 1 switch and if you use other
switches you don't see this issue ?

 The problem is that every 1 out of 15 of this remote reboots OpenSM
 stalls or times out because stats-qp0_mads_outstanding did not reach
 zero. Please excuse my ignorance as I'm relatively new at this but how
 do I verify if it is a timeout problem vs a stall?
 
 You also mentioned that you'd like to see the Verbose output of openSM;
 however, when I run in Verbose mode I don't see the problem. It appears
 as if the verbose output stalls enough time to give the switch time to
 do what ever it needs to do and hence not have the problem occur. But
 this is the last I see when the problem occurs:
 
 
 
 -
 OpenSM 3.3.12
 Command Line Arguments:
  Log file max size is 5 MBytes
  Log File: /tmp/opensm.log
 -
 OpenSM 3.3.12
 
 Entering DISCOVERING state
 
 Using default GUID 0x2c9020023277d
 

Is there anything interesting in the log file (when running normally not
with verbosity on) ?

 The problem occurs in function osm_vl15intf.c - vl15_poller in the else
 statement.
 
 if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
 OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG,
 Servicing p_madw = %p\n, p_madw);
 if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES))
 osm_dump_dr_smp(p_vl-p_log,
 osm_madw_get_smp_ptr(p_madw),
 OSM_LOG_FRAMES);
 
 vl15_send_mad(p_vl, p_madw);
 } else
 /*
The VL15 FIFO is empty, so we have nothing left to do.
  */
 status = cl_event_wait_on(p_vl-signal,
   EVENT_NO_TIMEOUT, TRUE);

 It won't move forward from the cl_event_wait_on in this line of code.

So it's stuck here forever and never gets past this ?

 However, there are other locations such as wait_for_pending_transactions
 in the do_sweep function that won't move forward from. But I believe
 this to be a side effect of the problem I'm mentioning.
 
 When you mention what is my timeout, I'm guessing you refer to
 max_smps_timeout which is used in the second while loop within
 vl15_poller? For this setting I am using the default which is defined in
 osm_subnet.c as:
 
 p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
 p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT;
 p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout
 *p_opt-transaction_retries;
 
 Would you explain to me what are the advantages or disadvantages of
 OSM_DEFAULT_SMP_MAX_ON_WIRE? 

It allows for more SMPs to be outstanding on the IB wire which helps
with subnet discovery/initialization, etc. So limiting SMPs to 1 will
slow things down but maybe that doesn't matter in your subnet.

 Does this parameter change my bandwidth
 performance at all?

It's a minor amount of bandwidth and is used to limit the SMPs which
unlike other VLs are not flow controlled so you can overflow the
dedicated buffers for those if OpenSM or diag tools send too quickly.

-- Hal

 I noticed that when using the default setting of 4 I get into the else
 of the above if statement when there are 4 qp0_mads_outstanding. I
 noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
 the failure I'm mentioning at all. Partly (I think) because I don't
 enter the else in the if statement until there is 1 qp0_mads_outstanding.
 
 I hope this explains the problem well enough and it may be a time out
 problem but I'd like to understand why the problem is occurring.
 Thank you very much,
 
 Hector Abrach
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hi,
 
 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,

 I have a boot problem with OpenSM
 
 Are you

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-14 Thread Hal Rosenstock
Hector,

On 12/14/2011 1:41 PM, Hector Abrach wrote:
 Hal,
 
 Sorry for the multiple emails, but I was thinking how it may be a
 freeze /stall rather than a time out.  One reason is that it doesn't
 send an error message, is as if the log completely dies.

So nothing interesting in the log...

 However, in
 file osm_vendor_ibumad.c under function umad_receiver there is an
 infinite loop for(;;) which seems to die when I get to that previously
 discussed vl15_poller. I checked to see if it breaks out of the loop but
 it doesn't seem to. 

It never breaks out of that loop except when OpenSM is shutting down.
That's the basic receive loop.

-- Hal

 I'm not sure if this may be an additional hint.
 Thank you
 
 Hector Abrach
 
 
 From: Hector Abrach habr...@tmriusa.com
 To:   Hal Rosenstock h...@dev.mellanox.co.il
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 11:15 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 Sent by:  ewg-boun...@lists.openfabrics.org
 
 
 
 
 
 
 Hal,
 
 Thank you very much for the support, I am the same person from the gmail
 account so I will respond through here.
 
 Attached is a picture of the switch serial number:
 
 
 
 I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
 system which I reboot via a script over and over again. Technically
 speaking the switch is not being powered off or physically rebooted. My
 server system is what is being rebooted. I am running OpenSM on one of
 the 7 servers. This means I'm constantly shutting down and rebooting
 OpenSM. I am running OpenSM on QNX but we have not had this problem
 until we decided to upgrade to this switch.
 
 The problem is that every 1 out of 15 of this remote reboots OpenSM
 stalls or times out because stats-qp0_mads_outstanding did not reach
 zero. Please excuse my ignorance as I'm relatively new at this but how
 do I verify if it is a timeout problem vs a stall?
 
 You also mentioned that you'd like to see the Verbose output of openSM;
 however, when I run in Verbose mode I don't see the problem. It appears
 as if the verbose output stalls enough time to give the switch time to
 do what ever it needs to do and hence not have the problem occur. But
 this is the last I see when the problem occurs:
 
 
 
 -
 OpenSM 3.3.12
 Command Line Arguments:
 Log file max size is 5 MBytes
 Log File: /tmp/opensm.log
 -
 OpenSM 3.3.12
 
 Entering DISCOVERING state
 
 Using default GUID 0x2c9020023277d
 
 
 
 The problem occurs in function osm_vl15intf.c - vl15_poller in the else
 statement.
 
 if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG,
Servicing p_madw = %p\n, p_madw);
if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES))
osm_dump_dr_smp(p_vl-p_log,
osm_madw_get_smp_ptr(p_madw),
OSM_LOG_FRAMES);
 
vl15_send_mad(p_vl, p_madw);
 } else
/*
   The VL15 FIFO is empty, so we have nothing left to do.
 */
status = cl_event_wait_on(p_vl-signal,
  EVENT_NO_TIMEOUT, TRUE);
 
 It won't move forward from the cl_event_wait_on in this line of code.
 However, there are other locations such as wait_for_pending_transactions
 in the do_sweep function that won't move forward from. But I believe
 this to be a side effect of the problem I'm mentioning.
 
 When you mention what is my timeout, I'm guessing you refer to
 max_smps_timeout which is used in the second while loop within
 vl15_poller? For this setting I am using the default which is defined in
 osm_subnet.c as:
 
 p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT;
p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout
 *p_opt-transaction_retries;
 
 Would you explain to me what are the advantages or disadvantages of
 OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth
 performance at all?
 
 I noticed that when using the default setting of 4 I get into the else
 of the above if statement when there are 4 qp0_mads_outstanding. I
 noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
 the failure I'm mentioning at all. Partly (I think) because I don't
 enter the else in the if statement until there is 1 qp0_mads_outstanding.
 
 I hope this explains the problem well enough and it may be a time out
 problem but I'd like to understand why the problem is occurring.
 Thank you very much,
 
 Hector Abrach
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 
 Hi,
 
 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello

Re: [ewg] [Q] how to setup /etc/opensm/partitions.conf??

2011-10-27 Thread Hal Rosenstock
On Thu, Oct 27, 2011 at 7:05 AM, Hiroyuki Sato hiroys...@gmail.com wrote:

 Hello Hal and Richard.

 I found what is the problem.

 Q: Can I create subinterface with 0 (0x8000)???


That uses an invalid pkey so this is not a valid configuration.

-- Hal



 My Test result.

  partitions.conf

   Default=0x7fff,: ALL=full  ;
   Net0=0x, ipoib : ALL=full ;
   Net1=0x0001, ipoib : ALL=full ;
   Net2=0x0002, ipoib : ALL=full ;

  1) Server1 ib.8000 - Server1 ib.8000
Ping NG

  2) Server1 ib.8001 - Server1 ib.8001

  The differece is subinterfae number.

 Yesterday,  I always tested subinterface 0x8000.



 And about previous post,

 The partitions.conf which I posted few hous ago,
 was semi colon missing.

 * partitions.conf is the following
   Default=0x7fff,  ipoib : ALL=full
   Net0=0x0001, ipoib : ALL=full

 Should be

   Default=0x7fff,  ipoib : ALL=full ;
   Net0=0x0001, ipoib : ALL=full ;

 Thank you again.


 2011/10/27 Hiroyuki Sato hiroys...@gmail.com:
  Hello Hal
 
  Thank you for your information.
 
  I simplified my test environment
  The environment is the folloing.
 
  I'm not sure what is wrong Test2 (Partition test)
  And Could you please tell me how to check the problem??
  (tool, logfile, etc.. )
 
  * Environment
 
   OS: Scientific Linux6.1
   OFED: 1.5.3.
 
  * Diagram
 
   +--+ib0ib0+--+
   | Server1  |--| Server2  |
   +--+  +--+
 
 
 
 
  1) Test1 (Simple IPoIB no partition Ping test)
 
   Server1
 
 a) /sbin/ifconfig ib0 inet 192.168.1.1/24
 b) /sbn/service opensmd start
 3) /sbin/service opensmd start
 * no /etc/opensm/partitions.conf
 
   Server2
 /sbin/ifconfig ib0 inet 192.168.1.1/24
 
   Test from server1(192.168.1.1) to server2(192.168.1.2)
 ping 192.168.1.2
 The result was OK.
 
  2) Test2 (Add partition by hand and Ping test)
 
   reboot Server 1 and Server 2
 
   Server 1
 1) echo 0x8001  /sys/class/net/ib0/create_child
 2) /sbin/ifconfig ib0.8001 inet 192.168.0.1/24
 3) /sbin/service opensmd start
 
 * partitions.conf is the following
   Default=0x7fff,  ipoib : ALL=full
   Net0=0x0001, ipoib : ALL=full
 
   Server2
 1) echo 0x8001  /sys/class/net/ib0/create_child
 2) /sbin/ifconfig ib0.8001 inet 192.168.0.2/24
 
   Test from server1(ib0.8001/192.168.0.1) to server2(ib0.8001/192.168.0.2
 )
 
 ping 192.168.0.2
 no response.
 
  3) Test2 Log
 
   I enabled debug_log parameter in ib_ipoib modules.
 
   cd /sys/module/ib_ipoib/parameters
   # cat debug_level
   1
 
   Could you please tell me what is wrong??
 
   ib0: enabling connected mode will cause multicast packet drops
   ib0: mtu  2044 will cause multicast packet drops.
   ib0: mtu  2044 will cause multicast packet drops.
   ib1: enabling connected mode will cause multicast packet drops
   ib1: mtu  2044 will cause multicast packet drops.
   ib1: mtu  2044 will cause multicast packet drops.
   ib0: Event 17 on device mthca0 port 1
   ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
   ib0: Event 11 on device mthca0 port 1
   ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
   ib0: Event 9 on device mthca0 port 1
   ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
   ib0.8001: max_srq_sge=27
   ib0.8001: max_cm_mtu = 0xfff0, num_frags=16
   ib0.8001: bringing up interface
   ib0.8001: IPOIB_FLAG_OPER_UP not set
   ib0.8001: IPOIB_FLAG_OPER_UP not set
   ADDRCONF(NETDEV_UP): ib0.8001: link is not ready
   ib0.8001: IPOIB_FLAG_OPER_UP not set
 
  regards.
 
  2011/10/27 Hal Rosenstock hal.rosenst...@gmail.com:
 
 
  On Wed, Oct 26, 2011 at 10:03 AM, Hiroyuki Sato hiroys...@gmail.com
 wrote:
 
  Dear members.
 
  I have some question about Infiniband Partitions.
 
  I would like to build Linux box with IPoIB Router.
 
  * Questions
 
   My Question is the following.
 
   a) Can I create the following IPoIB network ??
 
   b) If so, how to setup /etc/opensm/partitions.conf??
 
  * Problem.
 
   Ping IB Client to Router : I got the followig error.
 
   Network is unreachable.
 
  * Environment
 
   OFED: OFED-1.5.3.2
   OS:   Scientific Linux 6.1
 
  * Linux box (Router)
 
   1 ethernet interface  (eth0)
   1 Infiniband interfae (ib0)
   Run subnet manager.
 
  * Ethernet setup
 
   a) add VLAN ID 120 eth0.120 (192.168.120.0/24)
   b) add VLAN ID 130 eth0.130 (192.168.130.0/24)
 
  * Infiniband Setup
   a) add subinterface 0 ib0.8000 (192.168.0.0/24)
   b) add subinterface 1 ib0.8001 (192.168.1.0/24)
   c) Run opensm
 
 
   I crated subinterface with the following command.
 
   echo 0  /sys/class/net/ib0/create_child
   echo 1  /sys/class/net/ib0/create_child
 
 
  echo 0x8001  /sys/class/net/ib0/create_child
  echo 0x8002  /sys/class/net/ib0/create_child
 
 
   and assign ip address with ifconfig
 
  * Logical Diagram
 
  +--- 192.168.120.0/24

Re: [ewg] [Q] how to setup /etc/opensm/partitions.conf??

2011-10-27 Thread Hal Rosenstock
On Thu, Oct 27, 2011 at 7:48 AM, Hiroyuki Sato hiroys...@gmail.com wrote:

 Hello Hal

 Thank you for your information.

 I read this documents.  PP 45.

 http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_user_manual_1_5_1.pdf

  Decide on the PKey to be used in the subnet.
  Valid values are 0-255.


Valid values for pkey are 1-32767 (1-0x7).


  The actual PKey used is a 16-bit number with
  the most significant bit set.
  For example, a value of 0 will give a
  PKey with the value 0x8000.

 Is this wrong??


Pkey 0/0x8000 is invalid so this is wrong. What it's showing about turning
on the full membership bit in the pkey is correct. A better example would be
showing a value of 1 being transformed into 0x8001 pkey.

-- Hal




 2011/10/27 Hal Rosenstock hal.rosenst...@gmail.com:
 
 
  On Thu, Oct 27, 2011 at 7:05 AM, Hiroyuki Sato hiroys...@gmail.com
 wrote:
 
  Hello Hal and Richard.
 
  I found what is the problem.
 
  Q: Can I create subinterface with 0 (0x8000)???
 
 
  That uses an invalid pkey so this is not a valid configuration.
 
  -- Hal
 
 
  My Test result.
 
   partitions.conf
 
Default=0x7fff,: ALL=full  ;
Net0=0x, ipoib : ALL=full ;
Net1=0x0001, ipoib : ALL=full ;
Net2=0x0002, ipoib : ALL=full ;
 
   1) Server1 ib.8000 - Server1 ib.8000
 Ping NG
 
   2) Server1 ib.8001 - Server1 ib.8001
 
   The differece is subinterfae number.
 
  Yesterday,  I always tested subinterface 0x8000.
 
 
 
  And about previous post,
 
  The partitions.conf which I posted few hous ago,
  was semi colon missing.
 
  * partitions.conf is the following
Default=0x7fff,  ipoib : ALL=full
Net0=0x0001, ipoib : ALL=full
 
  Should be
 
Default=0x7fff,  ipoib : ALL=full ;
Net0=0x0001, ipoib : ALL=full ;
 
  Thank you again.
 
 
  2011/10/27 Hiroyuki Sato hiroys...@gmail.com:
   Hello Hal
  
   Thank you for your information.
  
   I simplified my test environment
   The environment is the folloing.
  
   I'm not sure what is wrong Test2 (Partition test)
   And Could you please tell me how to check the problem??
   (tool, logfile, etc.. )
  
   * Environment
  
OS: Scientific Linux6.1
OFED: 1.5.3.
  
   * Diagram
  
+--+ib0ib0+--+
| Server1  |--| Server2  |
+--+  +--+
  
  
  
  
   1) Test1 (Simple IPoIB no partition Ping test)
  
Server1
  
  a) /sbin/ifconfig ib0 inet 192.168.1.1/24
  b) /sbn/service opensmd start
  3) /sbin/service opensmd start
  * no /etc/opensm/partitions.conf
  
Server2
  /sbin/ifconfig ib0 inet 192.168.1.1/24
  
Test from server1(192.168.1.1) to server2(192.168.1.2)
  ping 192.168.1.2
  The result was OK.
  
   2) Test2 (Add partition by hand and Ping test)
  
reboot Server 1 and Server 2
  
Server 1
  1) echo 0x8001  /sys/class/net/ib0/create_child
  2) /sbin/ifconfig ib0.8001 inet 192.168.0.1/24
  3) /sbin/service opensmd start
  
  * partitions.conf is the following
Default=0x7fff,  ipoib : ALL=full
Net0=0x0001, ipoib : ALL=full
  
Server2
  1) echo 0x8001  /sys/class/net/ib0/create_child
  2) /sbin/ifconfig ib0.8001 inet 192.168.0.2/24
  
Test from server1(ib0.8001/192.168.0.1) to
   server2(ib0.8001/192.168.0.2)
  
  ping 192.168.0.2
  no response.
  
   3) Test2 Log
  
I enabled debug_log parameter in ib_ipoib modules.
  
cd /sys/module/ib_ipoib/parameters
# cat debug_level
1
  
Could you please tell me what is wrong??
  
ib0: enabling connected mode will cause multicast packet drops
ib0: mtu  2044 will cause multicast packet drops.
ib0: mtu  2044 will cause multicast packet drops.
ib1: enabling connected mode will cause multicast packet drops
ib1: mtu  2044 will cause multicast packet drops.
ib1: mtu  2044 will cause multicast packet drops.
ib0: Event 17 on device mthca0 port 1
ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
ib0: Event 11 on device mthca0 port 1
ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
ib0: Event 9 on device mthca0 port 1
ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
ib0.8001: max_srq_sge=27
ib0.8001: max_cm_mtu = 0xfff0, num_frags=16
ib0.8001: bringing up interface
ib0.8001: IPOIB_FLAG_OPER_UP not set
ib0.8001: IPOIB_FLAG_OPER_UP not set
ADDRCONF(NETDEV_UP): ib0.8001: link is not ready
ib0.8001: IPOIB_FLAG_OPER_UP not set
  
   regards.
  
   2011/10/27 Hal Rosenstock hal.rosenst...@gmail.com:
  
  
   On Wed, Oct 26, 2011 at 10:03 AM, Hiroyuki Sato hiroys...@gmail.com
 
   wrote:
  
   Dear members.
  
   I have some question about Infiniband Partitions.
  
   I would like to build Linux box with IPoIB Router.
  
   * Questions
  
My Question is the following.
  
a) Can I create the following IPoIB network ??
  
b) If so, how

Re: [ewg] [Q] how to setup /etc/opensm/partitions.conf??

2011-10-26 Thread Hal Rosenstock
On Wed, Oct 26, 2011 at 10:03 AM, Hiroyuki Sato hiroys...@gmail.com wrote:

 Dear members.

 I have some question about Infiniband Partitions.

 I would like to build Linux box with IPoIB Router.

 * Questions

  My Question is the following.

  a) Can I create the following IPoIB network ??

  b) If so, how to setup /etc/opensm/partitions.conf??

 * Problem.

  Ping IB Client to Router : I got the followig error.

  Network is unreachable.

 * Environment

  OFED: OFED-1.5.3.2
  OS:   Scientific Linux 6.1

 * Linux box (Router)

  1 ethernet interface  (eth0)
  1 Infiniband interfae (ib0)
  Run subnet manager.

 * Ethernet setup

  a) add VLAN ID 120 eth0.120 (192.168.120.0/24)
  b) add VLAN ID 130 eth0.130 (192.168.130.0/24)

 * Infiniband Setup
  a) add subinterface 0 ib0.8000 (192.168.0.0/24)
  b) add subinterface 1 ib0.8001 (192.168.1.0/24)
  c) Run opensm


  I crated subinterface with the following command.

  echo 0  /sys/class/net/ib0/create_child
  echo 1  /sys/class/net/ib0/create_child


echo 0x8001  /sys/class/net/ib0/create_child
echo 0x8002  /sys/class/net/ib0/create_child


  and assign ip address with ifconfig

 * Logical Diagram

 +--- 192.168.120.0/24 ---
 |
 |+ 192.168.130.0/24
 eth0.120|| eth0.130
+--+
| Router   |
+--+
 ib0.8000||ib0.8001
 192.168.0.1 |+ 192.168.1.0/24
 |
 +--- 192.168.0.0/24 --- IB Cllient. (ib0.8000)
 (192.168.0.2)

 * /etc/opensm/partitions.conf

  Default=0x7fff   : ALL=full
  Net0=0x00, ipoib : ALL=full
  Net1=0x01, ipoib : ALL=full


Default=0x7fff: ALL=full
Net0=0x0001, ipoib: ALL=full
Net1=0x0002, ipoib: ALL=full




 Coule you please tell me what is wrong??


The one thing in the above I'm not sure of is whether the default partition
also needs the ipoib flag above so you might need to add that too if the
above doesn't work.

-- Hal



 Sincerely.

 --
 Hiroyuki Sato.
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OFA Management maintainership

2011-02-09 Thread Hal Rosenstock
Sasha,

On Wed, Feb 9, 2011 at 2:13 PM, Sasha Khapyorsky sashakv...@gmail.com wrote:
 Hi,

 I'm finishing my work for Voltaire those days and wish to transfer
 my role as OFA management packages maintainer to
 Alex Netes ale...@voltaire.com which I know many years as a great
 experienced engineer and very good and positive person.

 So starting from today his trees should be considered as master
 development trees:

        git://git.openfabrics.org/~alexnetes/libibumad
        git://git.openfabrics.org/~alexnetes/opensm
        git://git.openfabrics.org/~alexnetes/libibmad
        git://git.openfabrics.org/~alexnetes/infiniband-diags
        git://git.openfabrics.org/~alexnetes/ibsim

 It is also likely that in a near feature maintainerships of
 libibumad and infiniband-diags will be taken by
 Ira Weiny wei...@llnl.gov.

Do you mean libibmad rather than libibumad ?

-- Hal


 I would like to wish to Alex and Ira a lot of success with their roles.

 Also I would like to thank a whole community for good working time.

 I still be reachable by my email address sashakv...@gmail.com, so feel
 free to contact me in case of any question.

 Sasha
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] user SA notifications, redux

2010-10-27 Thread Hal Rosenstock

On 10/13/2010 10:17 AM, Mike Heinz wrote:


Way back in May I proposed this prototype for adding SA notifications to the 
verbs API, but no one ever said yes or no. Vlad - I'm not even sure you were 
part of the conversation at the time, it was originally about a new API, but 
you should be aware of it since the conversation changed to adding the 
user-space capability to libibverbs.

Now that 1.5.2 is out the door, can we revisit this and try to get this and the 
matching kernel changes into the next release?

===

API for Proposal for adding ib_usa to the Linux Infiniband Subsystem
Mike Heinz
Mon, 24 May 2010 12:31:16 -0700

I spent the weekend thinking about your feedback Friday, and I'm concerned that
it widens the scope too far beyond what the current code is meant to do.

ib_usa isn't meant to be a general GSI interface, it's meant to be a user API
for accessing the existing functionality of the existing ib_sa module. In
particular, ib_sa and ib_usa provide a mechanism for other processes to share
SA/SM notices and traps.

As I mentioned earlier, the reason ib_sa acts as a single access point for
SA/SM traps and notices is because traps and notices are sent to ports, not to
queue pairs and not to processes. That means only one entity can be subscribed
for notices and traps at any particular time, and must manage them, sharing
them out among all processes that are interested in them.


Is this intended to handle multiple applications 
subscribing/unsubscribing for the same report ?



Generalizing that to include other types of notices and traps would involve
non-trivial changes to the ib_sa and might impact other parts of the infiniband
subsystem, including the SM, since they would have to be rewritten to deal with
the possibility that another component is now managing all notices and traps.

Below you will find a proposed API for accessing the notifications
functionality of the existing ib_sa and ib_usa modules. This is pretty much
exactly what we are currently using, but since Sean has suggested rdma_cm is
better suited for multi-casting, they have been omitted.

Now, given that this API is stand-alone right now, it could still be added to
either libibumad or to libibverbs - but I like Sean's suggestion that it be
added to verbs, since the current security model restricts libibumad to root
access and because the existing API already makes use of libibverbs'
ibv_context data structure.

-- current ib_usa API 

/* InformInfo:TrapNumber */
enum {
 IBV_SA_SM_TRAP_GID_IN_SERVICE  = 
__constant_cpu_to_be16(64),
 IBV_SA_SM_TRAP_GID_OUT_OF_SERVICE  = 
__constant_cpu_to_be16(65),
 IBV_SA_SM_TRAP_CREATE_MC_GROUP = 
__constant_cpu_to_be16(66),
 IBV_SA_SM_TRAP_DELETE_MC_GROUP = 
__constant_cpu_to_be16(67),
 IBV_SA_SM_TRAP_PORT_CHANGE_STATE   =
__constant_cpu_to_be16(128),
 IBV_SA_SM_TRAP_LINK_INTEGRITY  =
__constant_cpu_to_be16(129),
 IBV_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN=
__constant_cpu_to_be16(130),
 IBV_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED =
__constant_cpu_to_be16(131),


Why aren't traps 144 and 145 also defined ?


 IBV_SA_SM_TRAP_BAD_M_KEY   =
__constant_cpu_to_be16(256),
 IBV_SA_SM_TRAP_BAD_P_KEY   =
__constant_cpu_to_be16(257),
 IBV_SA_SM_TRAP_BAD_Q_KEY   =
__constant_cpu_to_be16(258),
 IBV_SA_SM_TRAP_ALL =
__constant_cpu_to_be16(0x)
};

struct ibv_sa_event_channel;
struct ibv_sa_event;
struct ibv_sa_id;

/**
  * ibv_sa_create_event_channel - Open a channel used to report events.
  */
struct ibv_sa_event_channel *ibv_sa_create_event_channel();

/**
  * ibv_sa_destroy_event_channel - Close the event channel.
  * @channel: The channel to destroy.
  */
void ibv_sa_destroy_event_channel(struct ibv_sa_event_channel *channel);

/**
  * ibv_sa_get_event - Retrieves the next pending event, if no event is
  *   pending waits for an event.
  * @channel: Event channel to check for events.
  * @event: Allocated information about the next event.
  *Event should be freed using ibv_sa_ack_event()
  */
int ibv_sa_get_event(struct ibv_sa_event_channel *channel,
  struct ibv_sa_event **event);

/**
  * ibv_sa_ack_event - Free an event.
  * @event: Event to be released.
  *
  * All events which are allocated by ibv_sa_get_event() must be released,
  * there should be a one-to-one correspondence between successful gets
  * and acks.
  */
int ibv_sa_ack_event(struct ibv_sa_event *event);

/**
  * ibv_sa_register_inform_info - Registers to receive notice events.
  * @channel: Event channel to issue query on.
  * @device: Device associated with record.
  * @port_num: Port number of record.
  * @trap_number: InformInfo trap number to register for, in network byte
  *   order.


Nit: If trap_number is in 

Re: [ewg] OFED 1.5.2: libibumad.so version bump

2010-09-03 Thread Hal Rosenstock
On Thu, Sep 2, 2010 at 5:05 PM, Ira Weiny wei...@llnl.gov wrote:
 On Thu, 2 Sep 2010 13:35:44 -0700
 Hal Rosenstock hal.rosenst...@gmail.com wrote:

 Hi Ira,

 On Tue, Aug 31, 2010 at 3:49 PM, Ira Weiny wei...@llnl.gov wrote:
  On Tue, 31 Aug 2010 02:27:29 -0700
  Yevgeny Kliteynik klit...@gmail.com wrote:
 
  Hi all,
 
  In order to support RoCEE, a while ago I've added
  a new field to umad, thus introduced an ABI change.
 
  There already was a discussion on the linux-rdma list,
  but due to the proximity of the upcoming OFED 1.5.2
  release these concerns were raised again.
 
  So my question is, other that *general* concerns about
  changing ABI, is anybody aware of the *actual* problem
  that will be caused by this? Any customer/3rd party
  solution that would be affected by this?
 
  Because our MVAPICH depends on umad, libibumad.so.1 to be exact.[*]  These 
  ABI
  changes (to v2 and v3) would have forced our users to recompile their 
  codes.
  We are maintaining the old ABI here until our next major release of 
  CHAOS[#]
  to prevent this.
 
  I think the thing to remember is that many people are using Open Fabrics
  software, but are _not_ using OFED.  What is tested with OFED is not the
  only thing which might be using these libraries.  Our version of MVAPICH 
  is a
  good example.
 
  I am certainly not the expert in this area and I know that many people have
  tried to make this point in the past, but I will say it here again.  Each 
  of
  these Open Fabrics packages _must_ be maintained to stand on their own.
  Roland did this a long time ago with ibverbs.
 
  I think now is a good time to start discussing breaking up the 
  management git
  tree so that these libraries can live on their own.

 How does breaking up the management git tree help with this issue ?

 Creating and tracking of branches to maintain ABI would be easier with
 separate git trees.  Furthermore, separate trees will help force the use of
 consistent ABI's and interfaces.

 For example, if I currently want OpenSM version 3.3.6 I get a management tree
 with version libibumad 1.3.5.  But this last ABI change to umad was only
 required for the latest infiniband-diags (ibstat utility).  Why do I get all
 this cruft when pulling the latest OpenSM?

That's a totally different issue than which packages a particular OFED
release picks up.


 To me, that's the admin part and is separate from the ABI issue
 raised.

 Yes it is separate.  That is why I created another thread to discuss those
 issues.


 The ABI compatibility is not achieved by administrative means
 (separate repos, etc.) but rather than review and discipline to
 achieve this as a unmutable goal.

 I agree that ABI compatibility will require more discipline.  That is what
 made me think of the separate git trees.  I feel it will be _easier_ to
 maintain this discipline when the trees are separate.

Call me a skeptic but I think the same thing would've occurred with
separate git trees.

I have no real preference one way or the other. In fact, this was
discussed in the early days which are long forgotten. The libraries
are small and umad is a lot more stable than mad. It would just mean a
lot of busy work for everyone with internal trees.

-- Hal


 Ira


 -- Hal

  I will write a separate email regarding this.
 
  Ira
 
  [*] We are looking into removing the dependency.
  [#] Shameless plug: 
  http://*code.google.com/p/chaos-release/wiki/CHAOS_Description
 
 
  -- Yevgeny
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://**vger.kernel.org/majordomo-info.html
 
 
 
  --
  Ira Weiny
  Math Programmer/Computer Scientist
  Lawrence Livermore National Lab
  925-423-8008
  wei...@llnl.gov
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://*vger.kernel.org/majordomo-info.html
 



 --
 Ira Weiny
 Math Programmer/Computer Scientist
 Lawrence Livermore National Lab
 925-423-8008
 wei...@llnl.gov

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OFED 1.5.2: libibumad.so version bump

2010-09-02 Thread Hal Rosenstock
Hi Ira,

On Tue, Aug 31, 2010 at 3:49 PM, Ira Weiny wei...@llnl.gov wrote:
 On Tue, 31 Aug 2010 02:27:29 -0700
 Yevgeny Kliteynik klit...@gmail.com wrote:

 Hi all,

 In order to support RoCEE, a while ago I've added
 a new field to umad, thus introduced an ABI change.

 There already was a discussion on the linux-rdma list,
 but due to the proximity of the upcoming OFED 1.5.2
 release these concerns were raised again.

 So my question is, other that *general* concerns about
 changing ABI, is anybody aware of the *actual* problem
 that will be caused by this? Any customer/3rd party
 solution that would be affected by this?

 Because our MVAPICH depends on umad, libibumad.so.1 to be exact.[*]  These ABI
 changes (to v2 and v3) would have forced our users to recompile their codes.
 We are maintaining the old ABI here until our next major release of CHAOS[#]
 to prevent this.

 I think the thing to remember is that many people are using Open Fabrics
 software, but are _not_ using OFED.  What is tested with OFED is not the
 only thing which might be using these libraries.  Our version of MVAPICH is a
 good example.

 I am certainly not the expert in this area and I know that many people have
 tried to make this point in the past, but I will say it here again.  Each of
 these Open Fabrics packages _must_ be maintained to stand on their own.
 Roland did this a long time ago with ibverbs.

 I think now is a good time to start discussing breaking up the management 
 git
 tree so that these libraries can live on their own.

How does breaking up the management git tree help with this issue ?
To me, that's the admin part and is separate from the ABI issue
raised. The ABI compatibility is not achieved by administrative means
(separate repos, etc.) but rather than review and discipline to
achieve this as a unmutable goal.

-- Hal

 I will write a separate email regarding this.

 Ira

 [*] We are looking into removing the dependency.
 [#] Shameless plug: 
 http://code.google.com/p/chaos-release/wiki/CHAOS_Description


 -- Yevgeny
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://*vger.kernel.org/majordomo-info.html



 --
 Ira Weiny
 Math Programmer/Computer Scientist
 Lawrence Livermore National Lab
 925-423-8008
 wei...@llnl.gov
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] RFC: Splitting of the management git tree in Open Fabrics

2010-09-02 Thread Hal Rosenstock
Ira,

On Tue, Aug 31, 2010 at 3:51 PM, Ira Weiny wei...@llnl.gov wrote:
 As I briefly mentioned in an email to Yevgeny regarding libibumad ABI's; I
 believe it is time to break up the management git tree.

I think these are separate issues. What are you trying to achieve by
doing this ?

-- Hal

 With GA of OFED 1.5.2 scheduled for Sept 13, I would like to request comments
 from the community about the following split after that GA.

 On openfabrics.org/git split management.git into the following trees.

 openfabrics.org/git/infiniband-diags.git
 openfabrics.org/git/libibumad.git
 openfabrics.org/git/libibmad.git
 openfabrics.org/git/opensm.git

 Sasha can populate those from the current management tree.  We believe there
 are git commands which will do this without losing any history from the git
 trees.

 Vlad, what changes would you have to make in the OFED build to accommodate
 these packages being in separate git trees?

 Any other concerns or comments?

 Thanks,
 Ira

 --
 Ira Weiny
 Math Programmer/Computer Scientist
 Lawrence Livermore National Lab
 925-423-8008
 wei...@llnl.gov
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] DHCP over InfiniBand Update

2010-08-31 Thread Hal Rosenstock
Hi,

There appear to be two basic approaches to supporting DHCP (over
InfiniBand) in Linux. There's LPF support (4.1.1 based) and older
(3.0.4 based) socket support.

The 4.1.1 LPF patches are:
http://lists.openfabrics.org/pipermail/ewg/2010-May/015265.html
http://lists.openfabrics.org/pipermail/ewg/2010-May/015266.html
http://lists.openfabrics.org/pipermail/ewg/2010-May/015264.html
The last being Matthieu Hautreux's matthieu.hautreux at cea.fr
improved XID generation (same as
https://lists.isc.org/mailman/htdig/dhcp-hackers/2009-January/001773.html).

AFAIT an LPF based approach will only work on older kernels (due to
elimination of CONFIG_FILTER support). Is this accurate ?

OFED has two patches for 3.0.4 for a socket approach in
http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=tree;f=dhcp;h=aec68a2905559c8ed91f1157fa11d78cccb266cd;hb=ofed_1_5
dhcp-3.0.4.patch
0001-Make-DHCP-server-print-HW-info.patch

I've been upporting those to a 4.x based DHCP and have a fundamental
question which occurs even with the 3.0.4 socket based version. On the
client machine, the DHCPOFFER in response to the DHCPDISCOVER is
received (seen with tcpdump) but never seems to make it to the
dhclient application. I can't see any kernel stack error counters
incremented so I'm mystified as to what could be going wrong. I've
also tried this on a number of different kernels. Any idea on why this
might be or how to figure out where that packet is going ? I do see
the dhcp client port with netstat -a --udp -n
udp0  0 0.0.0.0:68  0.0.0.0:*
udp0  0 0.0.0.0:68  0.0.0.0:*
Any idea on what I'm missing ?

Also, is any of this work making it's way into a released DHCP ?
What's the process for this ? Is there some branch in a source
repository where this work is available ?

Thanks in advance for any pointers on all this.

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Infiniband Interoperability

2010-07-08 Thread Hal Rosenstock
On Wed, Jul 7, 2010 at 8:04 PM, David Brean david.br...@oracle.com wrote:
 Correct, a SM hasn't been released for OpenSolaris, yet.

 Looks like a very unusual multicast address because it doesn't have the IPoIB 
 or Subnet Administrator signature.

Yes, it's something in Windows that does that. Not sure what it's used
for. Sean asked about it last week but there has been no response as
yet.

-- Hal


 -David

 On 7/7/10 6:37 PM, Matt Breitbach wrote:
 We disconnected one port on the IB card that was a dual port card.  The
 second port was not configured, so I can't imagine it caused problems,
 but it is completely disconnected now.

 As for a Subnet Manager on OpenSolaris - there isn't one. I believe they
 do have one for Solaris, but I do not believe that it's been released to
 OpenSolaris, and I can't find it anywhere on our system.

 

 *From:* rich...@informatix-sol.com [mailto:rich...@informatix-sol.com]
 *Sent:* Thursday, July 01, 2010 12:54 AM
 *To:* Matt Breitbach; ewg@lists.openfabrics.org
 *Subject:* Re: [ewg] Infiniband Interoperability

 When I had multiple SM's running none reported it as a problem.
 Sun developed their own for Solaris. I can't recall now what they called it.

 The other possibility i've seen cause problems with ipoib is having 2
 ports on the same IP subnet. Either bond them or disable ARP responses
 on one port. This is due to the broadcast simulation across multicast.


 Richard

 - Reply message -
 From: Matt Breitbach matth...@flash.shanje.com
 Date: Wed, Jun 30, 2010 19:32
 Subject: [ewg] Infiniband Interoperability
 To: rich...@informatix-sol.com, ewg@lists.openfabrics.org

 The Mellanox switch as far as I can tell does not have any SM running. It
 is a pretty dumb switch and there really isn't much to configure on it.



 LID 6 is the LID that OpenSM is running on - which is our CentOS 5.5 blade.
 I believe that it's reporting the issue since it's the Subnet Manager.



 The only other possibility is that there is a subnet manager running on the
 OpenSolaris box, but I have not been able to find one to blame this on. I
 would also think that in the OpenSM.log file I would find some reports of an
 election of some sort if there were multiple SM's running.



 LID listing :



 LID 1 - SuperMicro 4U running OpenSolaris (InfiniHost EX III PCI-E card w/
 128MB RAM)

 LID 2 - Blade Server currently running CentOS 5.5 and Xen (ConnectX
 Mezzanine card)

 LID 3 - InfiniScale III Switch

 LID 4 - SuperMicro 4U running OpenSolaris (InfiniHost EX III PCI-E card w/
 128MB RAM - 2nd port)

 LID 5 - Blade Server running Windows 2008R2 (ConnectX Mezzanine card)

 LID 6 - Blade Server running CentOS 5.5 and OpenSM (ConnectX Mezzanine card)

 LID 7 - Blade Server running Windows 2008 (InfiniHost EX III Mezzanine card)



 As for toggling the enable state - according to ibdiagnet the lowest
 connected rate member is at 20Gbps, but the network is only operating at
 10Gbps. I'm not sure which system I would toggle the enable state for.



 -Matt Breitbach

 _

 From: rich...@informatix-sol.com [mailto:rich...@informatix-sol.com]
 Sent: Wednesday, June 30, 2010 1:14 PM
 To: Matt Breitbach; ewg@lists.openfabrics.org
 Subject: Re: [ewg] Infiniband Interoperability



 I'm still suspicious that you have more than one SM running. Mellonex
 switches have it enabled by default.
 It's common that ARP requests, as caused by ping, will result in multicast
 group activity.
 Infiniband creates these on demand and tears them down if there are no
 current members. There is no broadcast address. It uses a dedicated MC
 group.
 They all seem to originate to LID 6 so you can trace the source.

 If you have ports at non optimal speeds, try toggling their enable state.
 This often fixes it.

 Richard

 - Reply message -
 From: Matt Breitbach matth...@flash.shanje.com
 Date: Wed, Jun 30, 2010 15:33
 Subject: [ewg] Infiniband Interoperability
 To: ewg@lists.openfabrics.org

 Well, let me throw out a little about the environment :



 We are running one SuperMicro 4U system with a Mellanox InfiniHost III EX
 card w/ 128MB RAM. This box is the OpenSolaris box. It's running the
 OpenSolaris Infiniband stack, but no SM. Both ports are cabled to the IB
 Switch to ports 1 and 2.



 The other systems are in a SuperMicro Bladecenter. The switch in the
 BladeCenter is an InfiniScale III switch with 10 internal ports and 10
 external ports.



 3 blades are connected with Mellanox ConnectX Mezzanine cards. 1 blade is
 connected with an InfiniHost III EX Mezzanine card.



 One of the blades is running CentOS and the 1.5.1 OFED release. OpenSM is
 running on that system, and is the only SM running on the network. This
 blade is using a ConnectX Mezzanine card.



 One blade is running Windows 2008 with the latest OFED drivers installed.
 It is using an InfiniHost III EX Mezzanine card.



 One blade is running Windows 2008 R2 

Re: [ewg] [ANNOUNCE] management tarballs release

2010-07-08 Thread Hal Rosenstock
Hi Sasha,

On Sat, May 22, 2010 at 5:43 PM, Sasha Khapyorsky sas...@voltaire.com wrote:
 Hi,

 There is a new release of the management (OpenSM and infiniband
 diagnostics) tarballs available in:

 http://www.openfabrics.org/downloads/management/

 (listed in http://www.openfabrics.org/downloads/management/latest.txt)

 md5sum:

 d3586e7a17bca99fd384a943f00e259e  libibumad-1.3.5.tar.gz
 754d93f567393d3b9987a65326f40917  libibmad-1.3.5.tar.gz
 5c94d6ee49e9c51c801f6634823b5ad5  opensm-3.3.6.tar.gz
 ba28f6b5323e6067ca019a999eeaf907  infiniband-diags-1.5.6.tar.gz

Shouldn't these versions be labeled/tagged in your management git tree
? Would you do that ?

Thanks.

-- Hal

 All component versions are from recent master branch. Full list of
 changes is below.

 Sasha
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] git tree for ofed docs

2010-07-02 Thread Hal Rosenstock
On Fri, Jul 2, 2010 at 6:15 PM, Hefty, Sean sean.he...@intel.com wrote:
 Can someone point me to the git tree that contains the release notes/docs 
 that get pulled into the OFED releases?

http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=summary

The latest (for OFED 1.5) is:
http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=shortlog;h=ofed_1_5

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Allowing ib dignostics to be run without being logged in as root.

2010-05-26 Thread Hal Rosenstock
On Tue, May 25, 2010 at 7:21 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Hal wrote,

If you really want any user to do this, is changing umad permissions
sufficient ? This is less of a security hole than setuid but does open
things up for malicious users.

-- Hal

 I wanted to avoid doing this as it would allow some malicious user to
 just open /dev/umad and send random mads and cause big problems with the 
 fabric.

 I was thinking that if the applications like perfquery are trusted
 to not allow someone to do anything malicious, then having them
 run as setuid root would not open a security hole ?

I don't know exactly how setuid programs are exploited to obtain
general root access but I've heard this.

 sudo sounds like if would allow them to run any command as root ID,
 which I think is a larger security hole than just setting the one
 or few trusted applications to setuid root. But then, I am not a
 security expert so I may not know all of the possible issues with
 setting a command to setuid root.

sudo can be configured for specific commands to be allowed to specific users.

-- Hal


 woody


 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Allowing ib dignostics to be run without being logged in as root.

2010-05-26 Thread Hal Rosenstock
On Wed, May 26, 2010 at 12:29 PM, Informatix solutions
rich...@informatix-sol.com wrote:
 The issue is that it is entirely dependent on the security integrity of the
 application with the setuid bit set.
 If someone can insert code, or swap a dynamically linked library with their
 own alternative, it becomes possible to have your own code executed as root.
 The system is then completely compromised.

The IB diags do use dynamically linked libs (libibmad and libibumad).

-- Hal


 -Original Message-
 From: ewg-boun...@lists.openfabrics.org
 [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Woodruff, Robert J
 Sent: 26 May 2010 17:19
 To: Hal Rosenstock
 Cc: EWG
 Subject: Re: [ewg] Allowing ib dignostics to be run without being logged in
 as root.

 Hal wrote,

sudo can be configured for specific commands to be allowed to specific
 users.

 Then perhaps that is a safer way to do it, but it would put more work
 on the system admin to set it up for people, but if setting the permissions
 of the commands to setuid root opens up a security hole, we would not want
 that.

 Does anyone know if setting the permissions to setuid root does actually
 open up a security hole ?

 woody


 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Allowing ib dignostics to be run without being logged in as root.

2010-05-25 Thread Hal Rosenstock
On Tue, May 25, 2010 at 4:51 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:

 Hi Sasha,

 Some people were asking me if it would be possible to
 allow some of the IB diagnostic tools to be run without
 requiring being logged in as root. Would there be
 any problem in changing the installation to set their
 permissions to setuid root to allow this, i.e.,

 chmod +s /usr/sbin/ibnetdiscover
 chmod +s /usr/sbin/ibaddr
 chmod +s /usr/sbin/smpquery
 chmod +s /usr/sbin/perfquery

If you really want any user to do this, is changing umad permissions
sufficient ? This is less of a security hole than setuid but does open
things up for malicious users.

-- Hal

 woody
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Allowing ib dignostics to be run without being logged in as root.

2010-05-25 Thread Hal Rosenstock
On Tue, May 25, 2010 at 5:51 PM, Hal Rosenstock
hal.rosenst...@gmail.com wrote:
 On Tue, May 25, 2010 at 4:51 PM, Woodruff, Robert J
 robert.j.woodr...@intel.com wrote:

 Hi Sasha,

 Some people were asking me if it would be possible to
 allow some of the IB diagnostic tools to be run without
 requiring being logged in as root. Would there be
 any problem in changing the installation to set their
 permissions to setuid root to allow this, i.e.,

 chmod +s /usr/sbin/ibnetdiscover
 chmod +s /usr/sbin/ibaddr
 chmod +s /usr/sbin/smpquery
 chmod +s /usr/sbin/perfquery

 If you really want any user to do this, is changing umad permissions
 sufficient ? This is less of a security hole than setuid but does open
 things up for malicious users.

IMO a better approach is to use sudo;

-- Hal


 -- Hal

 woody
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Is management package is ready for OFED 1.5 alpha?

2010-03-12 Thread Hal Rosenstock
On Sun, Jul 19, 2009 at 10:01 AM, Sasha Khapyorskysas...@voltaire.com wrote:
 On 06:35 Thu 16 Jul     , Hal Rosenstock wrote:
 
  I have couple of changes on the queue, then we will release.

 Are you going to handle your patch backlog before releasing ?

 Yes. And also note that we will have couple of management releases
 during OFED 1.5 cycle.

There are a non trivial number of outstanding patches which have not
even been commented on/considered which were submitted with more than
sufficient time for several review cycles to make OFED 1.5 (and some
even earlier as there are several outstanding for over a year now).

Any idea when the outstanding patches will be considered/commented on
so we can start to move them forward ?

-- Hal

 Sasha

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 6:02 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:

 I have 2 systems that have Mellanox dual port connectx
 cards with one of the ports connected to a and SDR switch
 and the other port direct connected.

 With today's OFED-1.5.1 daily build, the OpenSM does
 not seem to transition the port all the way up.
 If I use OFED-1.5.1-rc1, it works fine.

Has there been any change between those two in the management space ?


 [r...@woody-10 woody]# /etc/init.d/opensmd start
 Starting IB Subnet Manager.                                [  OK  ]

Based on the below, I'm presuming OpenSM runs on port 1.

 [r...@woody-10 woody]# /usr/sbin/ibstat
 CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 2
        Firmware version: 2.7.0
        Hardware version: a0
        Node GUID: 0x0002c90300044fa8
        System image GUID: 0x0002c90300044fab
        Port 1:
                State: Armed
                Physical state: LinkUp
                Rate: 10
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251086a
                Port GUID: 0x0002c90300044fa9

What state is the peer port in ? Any interesting OpenSM log messages ?

-- Hal

        Port 2:
                State: Initializing
                Physical state: LinkUp
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510868
                Port GUID: 0x0002c90300044faa
 [r...@woody-10 woody]#
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 6:47 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Hal wrote,

Has there been any change between those two in the management space ?

 I am not sure on that, but there must be some changes because it
 works with RC1 but fails with today's daily build.

Could it be changes to mlx driver ?

What state is the peer port in ? Any interesting OpenSM log messages ?

 The peer port on the other node is in the Iniaializing state.

And that's an SDR switch port ?

 Here is the tail of the opensm log file.


 Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state
 Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on 
 port 0x0002c90300044fa9
 Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0
 Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x123b
 Using default GUID 0x2c90300044fa9
 Entering MASTER state

 Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state
 SUBNET UP

 Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP
 Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0
 Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1240
 Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP
 Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0
 Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1244

Looks like the switch SMA is not responding ? Can you try some smpquerys to it ?

Is this reproducible in this environment ?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 7:08 PM, Hal Rosenstock
hal.rosenst...@gmail.com wrote:
 On Fri, Feb 19, 2010 at 6:47 PM, Woodruff, Robert J
 robert.j.woodr...@intel.com wrote:
 Hal wrote,

Has there been any change between those two in the management space ?

 I am not sure on that, but there must be some changes because it
 works with RC1 but fails with today's daily build.

 Could it be changes to mlx driver ?

What state is the peer port in ? Any interesting OpenSM log messages ?

 The peer port on the other node is in the Iniaializing state.

 And that's an SDR switch port ?

 Here is the tail of the opensm log file.


 Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state
 Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on 
 port 0x0002c90300044fa9
 Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0
 Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x123b
 Using default GUID 0x2c90300044fa9
 Entering MASTER state

 Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state
 SUBNET UP

 Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP
 Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0
 Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1240
 Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP
 Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0
 Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1244

 Looks like the switch SMA is not responding ? Can you try some smpquerys to 
 it ?

Also, try rebooting that switch.


 Is this reproducible in this environment ?

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 7:16 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Hal wrote,

Could it be changes to mlx driver ?

 Guess we need to look at what has changed since RC1.

And that's an SDR switch port ?

 Yes, this is a very very old 8 port Mellanox SDR switch.

Looks like the switch SMA is not responding ? Can you try some smpquerys to 
it ?

 I re-loaded the OFED-1.5.1-rc1 code, it seems to work fine, so I do not 
 suspect the switch,

Makes sense.

 unless the latest OpenSM or MLX driver is sending some MAD to the switch SMA 
 that it
 does not understand.

Is this reproducible in this environment ?

 Yes. happens every time.

Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily
build ? I suspect that will work and would show it's mlx4 as opposed
to management code but maybe I'll eat my words.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 7:49 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Hal wrote,

Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily
build ? I suspect that will work and would show it's mlx4 as opposed
to management code but maybe I'll eat my words.


 I tried using the OpenSM from today's daily build on the core and driver
 from RC1 and it seems to work OK.

 Also, I was wrong when I said that port 1 was connected to an old SDR
 switch, my bad, in fact these 2 systems are direct connected. Perhaps
 the SMA in the driver got broken between RC1 and today's build ?

That's consistent with the timeouts in the OpenSM log. The problem is
likely in the mlx4 specific part of the SMA as I think there are
changes going on there (and not in the core SMA). Best to ask Mellanox
about this.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] (no subject)

2009-10-18 Thread Hal Rosenstock
On Sat, Oct 17, 2009 at 2:49 PM, Mahmoud Hanafi mhan...@csc.com wrote:
 Hal Rosenstock hal.rosenst...@gmail.com wrote on 10/17/2009 04:34:55 AM:

 Hal Rosenstock hal.rosenst...@gmail.com
 10/17/2009 04:34 AM

 To

 Mahmoud Hanafi/DEF/c...@csc

 cc

 ewg@lists.openfabrics.org

 Subject

 Re: [ewg] (no subject)

 On Fri, Oct 16, 2009 at 2:20 PM, Mahmoud Hanafi mhan...@csc.com wrote:
  We have a linux cluster running RH5.3 with ofed1.4 using Mellanox
  MT25418.
  The cluster is attached to a sun solaris10.7 thumper box. The thumper
  box
  export a zfs filesystem via NFS. linux clients mount the filesystem via
  IPoIB.
 
  Under filesystem I/O load the subnet manager gets repeated path record
  requests from the sun solaris box.

 Do the path records all look the same or different in terms of
 destinations (and sources) ?

 There are path records look-ups for all the nodes repeating. So what I have
 posted below repeats in addition to all the other nodes.


 Is the source GUID (0x0003ba000100d0a5) the Solaris thumper port GUID
 (00-03-BA (hex) Sun Microsystems Inc.) ? The destination appears to be
 some HP device (00-23-7D (hex) Hewlett Packard).

 That is correct 0x0003ba000100d0a5 is the Solaris thumper the destination
 are HP blades See ibnetdiscover output below


  This can bring the SM and the fabric down.

 Are you referring to the load due to path requests or something else ?

 The sun ib interface does so many path record lookup that it can't service
 nfs request. This cause the nodes to lockup.

It looked like the right status if using Get or 1 record if GetTable
is returned at least in the OpenSM case. Not sure why the thumper
continues to rerequest the same path records.

 In addition the openSM see a
 high load trying to service the requests. We where running the sm on the
 switch it was seeing a load of 3. That's when we moved it to the head node.
 It uses %30 of a core when servicing the requests.

This is normal in that it takes CPU to respond to the SA queries for
PathRecords to the thumper. The amount of CPU is dependent on the
request load.

 Running the OpenSM are the logging level you appear to be using would
 certainly slow things down greatly so I presume that was only done to
 look further into what was going on.

  Any any one else had issue with solaris IB - Linux IB?

 I haven't run Solaris - Linux IB in several years now but this used
 to work but there have been a lot of changes.

  Any insight into what could be causing the issue?

 Could you elaborate on the below ? I see one PathRecord response trace
 and an ibdiagnet run which shows a bad link at direct route 1,11,23
 from where that was run. You might want to debug the issue with that
 link.
 We have a second cluster that does the same thing. It doesn't have any bad
 links.

 What does the IPoIB subnet check warning mean?

-I- Subnet: IPv4 PKey:Ox7fff QKey:Ox0blb MTU:2048Byte rate:lOGbps
SL:OxOO
-W- Suboptimal rate for group. Lowest member rate:20Gbps  grouprate:
lOGbps

It means you should be able to increase the rate of your IPoIB broadcast group.

-- Hal


 -- Hal

 
  Thanks,
  Mahmoud
 
  
 
  Oct 15 19:37:
 
  59 952368 [41E02960] 0x08 - PathRecord dump:
 
  service id ..0x
 
  dgid  Oxfe80 : 0x00237d949819
 
  sgid  Oxfe80 : 0x0003ba000100d0a5
 
  dlid  0
 
  slid  0
 
  hop_flow_raw OxO
 
  tclass .. OxO
 
  num_path_revers. Ox81
 
  pkey  0x0
 
  qos_class ... OxO
 
  sl ..OxO
 
  mtu .OxO
 
  rate  OxO
 
  pkt_life 0x0
 
  preference .. 0x0
 
  resv2 ... OxO
 
  resv3 ... OxO
 
  Oct 15 19:37:59 952376 [41E02960) 0x08 - osm_pr_rcv_process: Unicast
 
  destination requested
 
  Oct 15 19:37:59 952382 [41E02960] 0x08 -
 
  osm_pr_rcv_get_port pair_paths: Src port 0x0003ba000100d0a5, Dst port
 
  0x00237d949819
 
  Oct 15 19:37:59 952388 [41E02960] 0x08 -
 
  _osm_pr_rcv_get_port_pair_paths: Src LIDs [2 - 2], Dest LIDs [67-67]
 
  Oct 15 19:37:59 952393 [41E02960] 0x08 -
 
  _osm pr_rcv_get_lid_pair_path: Src LID 2, Dest LID 67
 
  Oct 15 19:37:59 952399 [41E02960] 0x08 - _osm_pr_rcv_get-path_parms:
 
  Path min MTU = 4, min rate = 6
 
  Oct 15 19:37:59 952408 [41E02960] 0x08 -  _osm_pr_rcv_get-path_parms:
 
  Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = Ox, sl
 
  = 0
 
  Oct 15 19:37:59 952417 [41E02960] 0x08 -  _osm_pr_rcv_get_path_parms:
 
  Path min MTU = 4, min rate = 6
 
  Oct 15 19: 37:59 952423 [41E02960] 0x08 - osm pr_rcv_get_path parms:
 
  Path params: mtu = 4, rate
 
  = 6, packet lifetime = 18, pkey = Ox, sl
 
  = 0
 
  Oct 15 19:37:59 952428 [41E02960] 0x08
 
  - osm_sa_respond: Returning 1
 
  records
 
  Oct 15 19:37:59 952433 [41E02960] 0x08

Re: [ewg] (no subject)

2009-10-17 Thread Hal Rosenstock
On Fri, Oct 16, 2009 at 2:20 PM, Mahmoud Hanafi mhan...@csc.com wrote:
 We have a linux cluster running RH5.3 with ofed1.4 using Mellanox MT25418.
 The cluster is attached to a sun solaris10.7 thumper box. The thumper box
 export a zfs filesystem via NFS. linux clients mount the filesystem via
 IPoIB.

 Under filesystem I/O load the subnet manager gets repeated path record
 requests from the sun solaris box.

Do the path records all look the same or different in terms of
destinations (and sources) ?

Is the source GUID (0x0003ba000100d0a5) the Solaris thumper port GUID
(00-03-BA (hex) Sun Microsystems Inc.) ? The destination appears to be
some HP device (00-23-7D (hex) Hewlett Packard).

 This can bring the SM and the fabric down.

Are you referring to the load due to path requests or something else ?
Running the OpenSM are the logging level you appear to be using would
certainly slow things down greatly so I presume that was only done to
look further into what was going on.

 Any any one else had issue with solaris IB - Linux IB?

I haven't run Solaris - Linux IB in several years now but this used
to work but there have been a lot of changes.

 Any insight into what could be causing the issue?

Could you elaborate on the below ? I see one PathRecord response trace
and an ibdiagnet run which shows a bad link at direct route 1,11,23
from where that was run. You might want to debug the issue with that
link.

-- Hal


 Thanks,
 Mahmoud

 

 Oct 15 19:37:

 59 952368 [41E02960] 0x08 - PathRecord dump:

 service id ..0x

 dgid  Oxfe80 : 0x00237d949819

 sgid  Oxfe80 : 0x0003ba000100d0a5

 dlid  0

 slid  0

 hop_flow_raw OxO

 tclass .. OxO

 num_path_revers. Ox81

 pkey  0x0

 qos_class ... OxO

 sl ..OxO

 mtu .OxO

 rate  OxO

 pkt_life 0x0

 preference .. 0x0

 resv2 ... OxO

 resv3 ... OxO

 Oct 15 19:37:59 952376 [41E02960) 0x08 - osm_pr_rcv_process: Unicast

 destination requested

 Oct 15 19:37:59 952382 [41E02960] 0x08 -

 osm_pr_rcv_get_port pair_paths: Src port 0x0003ba000100d0a5, Dst port

 0x00237d949819

 Oct 15 19:37:59 952388 [41E02960] 0x08 -

 _osm_pr_rcv_get_port_pair_paths: Src LIDs [2 - 2], Dest LIDs [67-67]

 Oct 15 19:37:59 952393 [41E02960] 0x08 -

 _osm pr_rcv_get_lid_pair_path: Src LID 2, Dest LID 67

 Oct 15 19:37:59 952399 [41E02960] 0x08 - _osm_pr_rcv_get-path_parms:

 Path min MTU = 4, min rate = 6

 Oct 15 19:37:59 952408 [41E02960] 0x08 -  _osm_pr_rcv_get-path_parms:

 Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = Ox, sl

 = 0

 Oct 15 19:37:59 952417 [41E02960] 0x08 -  _osm_pr_rcv_get_path_parms:

 Path min MTU = 4, min rate = 6

 Oct 15 19: 37:59 952423 [41E02960] 0x08 - osm pr_rcv_get_path parms:

 Path params: mtu = 4, rate

 = 6, packet lifetime = 18, pkey = Ox, sl

 = 0

 Oct 15 19:37:59 952428 [41E02960] 0x08

 - osm_sa_respond: Returning 1

 records

 Oct 15 19:37:59 952433 [41E02960] 0x08 -

 osm_vendor_get: Acquiring UMAD

 for p_madw = 0x2a9567f2c8, size = 120

 Oct 15 19:37:59 952439 [41E02960] 0x08 - osm_vendor_get: Acquired UMAD

 0x2a9567f390, size = 120

 Oct 15 19:37:59 952455 [41E02960] 0x08 -

 osm_vendor_put: Retiring UMAD

 0x2a9567f390

 Oct 15 19:37:59 952460 [41E02960] 0x08 -

 •.osm_vendor_send: Completed

 sending response or unsolicited p_madw'j= Ox2a9567f2b0

 Oct 15 19:37:59 952466 [41E02960] 0x08 - osm

 _vendor_put: Retiring UMAD

 0x724520

 ===

 Loading IBDIAGNET from: /usr/1ib64

 / ibdiagnetl.2

 -W- Topology file is not specified.

 Reports regarding cluster links will use direct routes.

 Loading IBDM from: /usr/lib64 / ibdml.2

 - I- Using port 1 as the local port.

 - I- Discovering ... 103 nodes (7 Switches  96 CA- s) discovered.

 -I ---

 - I- Bad Guids /LIDs Info

 -I -- -

 -I- No bad Guids were found

 -I -- -

 -I- Links With Logical State = INIT

 -I -- -

 -I- No bad Links (with logical state

 = INIT) were found

 -I ---

 -I- PM Counters Info

 -I -- -

 -I- No illegal PM counters values were found

 -I ---

 -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)

 -I ---

 -I- PKey:Ox7fff Hosts:97 full:97 partial:0

 -I -- -

 -I- IPoIB Subnets Check

 -I ---

 -I- Subnet: IPv4 

Re: [ewg] Re: [PATCH] perftest: Make rdma_lat, rdma_bw, and clock_test executable names rdma neutral

2009-10-05 Thread Hal Rosenstock
Hi again Ido,

On Mon, Oct 5, 2009 at 8:20 AM, Ido Shamai i...@dev.mellanox.co.il wrote:

 Hey Hal,

 Just applied the 2 patches ( version 2 of the executable_names and
 executable permisions)

Thanks. What's the plan to apply the ofed_1_5 branch changes to the master ?

-- Hal


 Regards,
 Ido Shaay



 Hal Rosenstock wrote:

 Hi Ido,

 On Sun, Oct 4, 2009 at 6:29 AM, Ido Shamai i...@dev.mellanox.co.il
 wrote:


 Hey Hal ,

 Sorry about the delay on response , I've waited for instructions for the
 patch.
 All is good except the change - ib_clock_test  - to - rdma_clock_test,
 my manager says clock_test is not a rdma benchmark. so If you can just
 remove this change I'll be greatfull


 I just submitted v2 of this as requested. Also, there's a minor
 outstanding patch on changing the file permissions in the perftest git
 repo.

 Thanks.

 -- Hal



 Regards
 Ido



 Hal Rosenstock wrote:


 Since rdma_lat and rdma_bw use RDMA CM, they can be used with both IB
 and
 iWARP so make their executable names neutral (by removing ib_)

 IB only tests only require linking with libibverbs

 Also, spec file change for executable name changes

 Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
 ---
 diff --git a/Makefile b/Makefile
 index 8042531..83c22c3 100755
 --- a/Makefile
 +++ b/Makefile
 @@ -1,7 +1,8 @@
 -TESTS = write_bw_postlist rdma_lat rdma_bw send_lat send_bw write_lat
 write_bw read_lat read_bw
 +RDMACM_TESTS = rdma_lat rdma_bw
 +TESTS = write_bw_postlist send_lat send_bw write_lat write_bw read_lat
 read_bw
  UTILS = clock_test
  -all: ${TESTS} ${UTILS}
 +all: ${RDMACM_TESTS} ${TESTS} ${UTILS}
  CFLAGS += -Wall -g -D_GNU_SOURCE -O2
  EXTRA_FILES = get_clock.c
 @@ -10,11 +11,18 @@ EXTRA_HEADERS = get_clock.h
  LOADLIBES +=  LDFLAGS +=
  -${TESTS}: LOADLIBES += -libverbs -lrdmacm
 +${RDMACM_TESTS} ${UTILS}: LOADLIBES += -libverbs -lrdmacm
 +${TESTS}: LOADLIBES += -libverbs
  -${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS}
 +${RDMACM_TESTS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS}
 +       $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $ ${EXTRA_FILES}
 $(LOADLIBES) $(LDLIBS) -o $@
 +${TESTS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS}
       $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $ ${EXTRA_FILES}
 $(LOADLIBES) $(LDLIBS) -o ib_$@
 +${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS}
 +       $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $ ${EXTRA_FILES}
 $(LOADLIBES) $(LDLIBS) -o rdma_$@
 +
  clean:
 -       $(foreach fname,${TESTS} ${UTILS}, rm -f ib_${fname})
 +       $(foreach fname,${RDMACM_TESTS} ${UTILS}, rm -f ${fname})
 +       $(foreach fname,${TESTS}, rm -f ib_${fname})
  .DELETE_ON_ERROR:
  .PHONY: all clean
 diff --git a/perftest.spec b/perftest.spec
 index bd234e1..81ca90a 100755
 --- a/perftest.spec
 +++ b/perftest.spec
 @@ -23,8 +23,8 @@ export CFLAGS=$RPM_OPT_FLAGS
  chmod -x runme
  %install
 -install -D -m 0755 ib_rdma_lat $RPM_BUILD_ROOT%{_bindir}/ib_rdma_lat
 -install -D -m 0755 ib_rdma_bw $RPM_BUILD_ROOT%{_bindir}/ib_rdma_bw
 +install -D -m 0755 rdma_lat $RPM_BUILD_ROOT%{_bindir}/rdma_lat
 +install -D -m 0755 rdma_bw $RPM_BUILD_ROOT%{_bindir}/rdma_bw
  install -D -m 0755 ib_write_lat $RPM_BUILD_ROOT%{_bindir}/ib_write_lat
  install -D -m 0755 ib_write_bw $RPM_BUILD_ROOT%{_bindir}/ib_write_bw
  install -D -m 0755 ib_send_lat $RPM_BUILD_ROOT%{_bindir}/ib_send_lat
 @@ -32,7 +32,7 @@ install -D -m 0755 ib_send_bw
 $RPM_BUILD_ROOT%{_bindir}/ib_send_bw
  install -D -m 0755 ib_read_lat $RPM_BUILD_ROOT%{_bindir}/ib_read_lat
  install -D -m 0755 ib_read_bw $RPM_BUILD_ROOT%{_bindir}/ib_read_bw
  install -D -m 0755 ib_write_bw_postlist
 $RPM_BUILD_ROOT%{_bindir}/ib_write_bw_postlist
 -install -D -m 0755 ib_clock_test
 $RPM_BUILD_ROOT%{_bindir}/ib_clock_test
 +install -D -m 0755 rdma_clock_test
 $RPM_BUILD_ROOT%{_bindir}/rdma_clock_test
  %clean
  rm -rf ${RPM_BUILD_ROOT}
 @@ -43,6 +43,8 @@ rm -rf ${RPM_BUILD_ROOT}
  %_bindir/*
  %changelog
 +* Sat Apr 18 2009 - hal.rosenst...@gmail.com
 +- Change executable names for rdma_lat, rdma_bw, and clock_test
  * Mon Jul 09 2007 - hvo...@suse.de
  - Use correct version
  * Wed Jul 04 2007 - hvo...@suse.de




 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg






___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: Possible process deadlock in RMPP flow

2009-09-23 Thread Hal Rosenstock
On Wed, Sep 23, 2009 at 12:08 PM, Sean Hefty sean.he...@intel.com wrote:

 ibnetdiscover D 80149b8d 0 26968  26544
 (L-TLB)
  8102c900bd88 0046 81037e8e 81037e8e02e8
  8102c900bd78 000a 8102c5b50820 81038a929820
  011837bf6105 0ede 8102c5b50a08 0001
 Call Trace:
  [80064207] wait_for_completion+0x79/0xa2
  [8008b4cc] default_wake_function+0x0/0xe
  [882271d9] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde
  [88224485] :ib_mad:ib_unregister_mad_agent+0x30d/0x424
  [883983e9] :ib_umad:ib_umad_close+0x9d/0xd6
  [80012e22] __fput+0xae/0x198
  [80023de6] filp_close+0x5c/0x64
  [800393df] put_files_struct+0x63/0xae
  [80015b26] do_exit+0x31c/0x911
  [8004971a] cpuset_exit+0x0/0x6c
  [8005e116] system_call+0x7e/0x83
 
 From the dump it seems that the process is waits on the call to
 flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is
 OFED 1.4.2.

 Roland just submitted a patch in this area yesterday.  I don't know if the
 patch
 would fix their issue, but it may be worth trying.  What kernel does 1.4.2
 map
 to?

 What RMPP messages does ibnetdiscover use?


None AFAIK.

-- Hal


   If the program is completing
 successfully, there may be a different race with the rmpp cleanup.  I'll
 see if
 anything else stands out in that area.

 - Sean

 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Re: [ofa-general] [PATCH] opensm: use mgrp pointer in port mcm_info

2009-09-16 Thread Hal Rosenstock
On Tue, Sep 15, 2009 at 7:26 AM, Hal Rosenstock hal.rosenst...@gmail.comwrote:



   On Tue, Sep 15, 2009 at 6:08 AM, Sasha Khapyorsky 
 sas...@voltaire.comwrote:

 On 08:45 Mon 14 Sep , Hal Rosenstock wrote:
 
  Does this mean consolidate_ipv6_snm_req does not work now ?

 No, it doesn't. As you may remember 'consolidate_ipv6_snm_req'
 workaround does nothing with MGIDs to MLID mapping, but instead
 enforces all IPv6 SNM matching requests to join a single multicast
 group (MGID).


  Is consolidate_ipv6_snm_req working for you ?


Never mind; My bad. It's working...

-- Hal



 -- Hal



 Sasha



___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Perftest Maintainer.

2009-09-14 Thread Hal Rosenstock
Hi Ido,

On Sun, Sep 13, 2009 at 8:56 AM, Ido Shamai i...@dev.mellanox.co.il wrote:

 Hello and a good day to all of the ewg community,

 My name is Ido Shamay , username in Ofa - shamoya , employee of  Mellanox
 Technologies LTD. ,
 and I've just became the new perftest maintainer on the Ofa website.

Welcome!



 Patches, comments, remarks and more are to be sent to my mail -
 i...@dev.mellanox.co.il.
 The perftests can be downloaded at  -
 http://www.openfabrics.org/downloads/perftest/
 Or cloned from the git repository -  git://
 git.openfabrics.org/~shamoya/perftest.git.


When I try to clone your perftest repo, I get:
fatal: Not a git repository

-- Hal



 I'll update you on new features and every other change of the perftests.

 Regards ,
 Ido.


 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Re: [ofa-general] [PATCH] opensm: use mgrp pointer in port mcm_info

2009-09-14 Thread Hal Rosenstock
On Sun, Sep 6, 2009 at 11:49 AM, Sasha Khapyorsky sas...@voltaire.comwrote:


 Port needs to access multicast groups where it is joined to. Now it is
 implemented by keeping list of list of mcm_info elements where MLID of
 each multicast group is stored. Obviously this assumes single MGID to
 MLID mapping model.



Does this mean consolidate_ipv6_snm_req does not work now ? If so, did OFED
1.5 Beta go out this way ? Also, what is the plan/timeframe to restore this
functionality ?

-- Hal



 This patch changes this so that instead of MLID mcm_info stores pointer
 to multicast group object (mgrp). Such model makes it possible to
 have MGIDs to MLID compression.

 Signed-off-by: Sasha Khapyorsky sas...@voltaire.com
 ---
  opensm/include/opensm/osm_mcm_info.h |   13 +++--
  opensm/include/opensm/osm_port.h |   13 +++--
  opensm/opensm/osm_drop_mgr.c |   10 +++---
  opensm/opensm/osm_mcm_info.c |8 
  opensm/opensm/osm_port.c |   10 +-
  opensm/opensm/osm_sm.c   |6 +++---
  6 files changed, 29 insertions(+), 31 deletions(-)

 diff --git a/opensm/include/opensm/osm_mcm_info.h
 b/opensm/include/opensm/osm_mcm_info.h
 index dec607f..62ae326 100644
 --- a/opensm/include/opensm/osm_mcm_info.h
 +++ b/opensm/include/opensm/osm_mcm_info.h
 @@ -47,6 +47,7 @@
  #include iba/ib_types.h
  #include complib/cl_qlist.h
  #include opensm/osm_base.h
 +#include opensm/osm_multicast.h

  #ifdef __cplusplus
  #  define BEGIN_C_DECLS extern C {
 @@ -73,15 +74,15 @@ BEGIN_C_DECLS
  */
  typedef struct osm_mcm_info {
cl_list_item_t list_item;
 -   ib_net16_t mlid;
 +   osm_mgrp_t *mgrp;
  } osm_mcm_info_t;
  /*
  * FIELDS
  *  list_item
  *  Linkage structure for cl_qlist.  MUST BE FIRST MEMBER!
  *
 -*  mlid
 -*  MLID of this multicast group.
 +*  mgrp
 +*  The pointer to multicast group where this port is member of
  *
  * SEE ALSO
  */
 @@ -95,11 +96,11 @@ typedef struct osm_mcm_info {
  *
  * SYNOPSIS
  */
 -osm_mcm_info_t *osm_mcm_info_new(IN const ib_net16_t mlid);
 +osm_mcm_info_t *osm_mcm_info_new(IN osm_mgrp_t *mgrp);
  /*
  * PARAMETERS
 -*  mlid
 -*  [in] MLID value for this multicast group.
 +*  mgrp
 +*  [in] the pointer to multicast group.
  *
  * RETURN VALUES
  *  Pointer to an initialized tree node.
 diff --git a/opensm/include/opensm/osm_port.h
 b/opensm/include/opensm/osm_port.h
 index 7079e74..0e0d3d2 100644
 --- a/opensm/include/opensm/osm_port.h
 +++ b/opensm/include/opensm/osm_port.h
 @@ -65,6 +65,7 @@ BEGIN_C_DECLS
  */
  struct osm_port;
  struct osm_node;
 +struct osm_mgrp;

  /h* OpenSM/Physical Port
  * NAME
 @@ -1420,14 +1421,14 @@ osm_get_port_by_base_lid(IN const osm_subn_t *
 const p_subn,
  * SYNOPSIS
  */
  ib_api_status_t
 -osm_port_add_mgrp(IN osm_port_t * const p_port, IN const ib_net16_t mlid);
 +osm_port_add_mgrp(IN osm_port_t * const p_port, IN struct osm_mgrp *mgrp);
  /*
  * PARAMETERS
  *  p_port
  *  [in] Pointer to an osm_port_t object.
  *
 -*  mlid
 -*  [in] MLID of the multicast group.
 +*  mgrp
 +*  [in] Pointer to the multicast group.
  *
  * RETURN VALUES
  *  IB_SUCCESS
 @@ -1449,14 +1450,14 @@ osm_port_add_mgrp(IN osm_port_t * const p_port, IN
 const ib_net16_t mlid);
  * SYNOPSIS
  */
  void
 -osm_port_remove_mgrp(IN osm_port_t * const p_port, IN const ib_net16_t
 mlid);
 +osm_port_remove_mgrp(IN osm_port_t * const p_port, IN struct osm_mgrp
 *mgrp);
  /*
  * PARAMETERS
  *  p_port
  *  [in] Pointer to an osm_port_t object.
  *
 -*  mlid
 -*  [in] MLID of the multicast group.
 +*  mgrp
 +*  [in] Pointer to the multicast group.
  *
  * RETURN VALUES
  *  None.
 diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c
 index c9a4f33..4891bb8 100644
 --- a/opensm/opensm/osm_drop_mgr.c
 +++ b/opensm/opensm/osm_drop_mgr.c
 @@ -158,7 +158,6 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN
 osm_port_t * p_port)
osm_port_t *p_port_check;
cl_qmap_t *p_sm_guid_tbl;
osm_mcm_info_t *p_mcm;
 -   osm_mgrp_t *p_mgrp;
cl_ptr_vector_t *p_port_lid_tbl;
uint16_t min_lid_ho;
uint16_t max_lid_ho;
 @@ -212,12 +211,9 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN
 osm_port_t * p_port)

p_mcm = (osm_mcm_info_t *) cl_qlist_remove_head(p_port-mcm_list);
while (p_mcm != (osm_mcm_info_t *) cl_qlist_end(p_port-mcm_list))
 {
 -   p_mgrp = osm_get_mgrp_by_mlid(sm-p_subn, p_mcm-mlid);
 -   if (p_mgrp) {
 -   osm_mgrp_delete_port(sm-p_subn, sm-p_log,
 -p_mgrp, p_port-guid);
 -   osm_mcm_info_delete((osm_mcm_info_t *) p_mcm);
 -   }
 +   osm_mgrp_delete_port(sm-p_subn, sm-p_log, p_mcm-mgrp,
 

Re: [ewg] Perftest Maintainer.

2009-09-14 Thread Hal Rosenstock
Ido,

On Mon, Sep 14, 2009 at 8:46 AM, Jeff Squyres jsquy...@cisco.com wrote:

 On Sep 13, 2009, at 8:56 AM, Ido Shamai wrote:

 I'll update you on new features and every other change of the perftests.


 Are there any plans to make all the perftest tools use RDMA CM for
 connections?


 Related to this is that rdma_bw and rdma_bw currently do use RDMA CM but
their names are ib_xxx. Shouldn't that be changed ? I had sent a patch
related to this some time ago. I can resend this if of interest once I get
in sync with your git repo.

-- Hal




 --
 Jeff Squyres
 jsquy...@cisco.com


 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Perftest Maintainer.

2009-09-14 Thread Hal Rosenstock
On Mon, Sep 14, 2009 at 10:46 AM, Ido Shamai i...@dev.mellanox.co.ilwrote:

 Hey Hall ,

 Ive just tested it again from several machines and from several user names,
 and It worked.
 The problem could be the dot ( . ) at the end of the link I sent to my
 git repository.
 So I write again   - git clone git://
 git.openfabrics.org/~shamoya/perftest.git 
 http://git.openfabrics.org/%7Eshamoya/perftest.githttp://git.openfabrics.org/~shamoya/perftest.git
 

 Please notify me if it works or not .


My bad. It's fine.

-- Hal



 Ido







 Hal Rosenstock wrote:

 Hi Ido,

 On Sun, Sep 13, 2009 at 8:56 AM, Ido Shamai i...@dev.mellanox.co.ilmailto:
 i...@dev.mellanox.co.il wrote:

Hello and a good day to all of the ewg community,

My name is Ido Shamay , username in Ofa - shamoya , employee of
 Mellanox Technologies LTD. ,
and I've just became the new perftest maintainer on the Ofa website.

 Welcome!


Patches, comments, remarks and more are to be sent to my mail -
i...@dev.mellanox.co.il mailto:i...@dev.mellanox.co.il.
The perftests can be downloaded at  -
http://www.openfabrics.org/downloads/perftest/
Or cloned from the git repository -
 git://git.openfabrics.org/~shamoya/perftest.git

 http://git.openfabrics.org/%7Eshamoya/perftest.githttp://git.openfabrics.org/~shamoya/perftest.git.


  When I try to clone your perftest repo, I get:
 fatal: Not a git repository
  -- Hal


I'll update you on new features and every other change of the
perftests.

Regards ,
Ido.


___
ewg mailing list
ewg@lists.openfabrics.org mailto:ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg




___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD

2009-08-27 Thread Hal Rosenstock
On 8/27/09, Joachim Fenkes fen...@de.ibm.com wrote:

 Hal Rosenstock hal.rosenst...@gmail.com wrote on 26.08.2009 17:15:03:

  Thanks for doing this. It looks sane to me. The only issue I recall that

  appears to be remaining is a better setting of
 ClassPortInfo:RespTimeValue
  rather than hardcoding. Perhaps using the value from PortInfo is the way
 to go
  (ideally it would be that value from the port to which the the requester
 is
  being redirected to but that might not be so easy to get from this port.

 I don't think that effort will be necessary or even legal. The requestor
 will react to the redirection with another Get(ClassPortInfo) to the
 redirection target, which will reply with its own RespTimeValue, so our
 driver should speak for itself.


I overreached with my comment on how this works.

 Since we don't know when our MAD
 processing and sending of the response is going to be scheduled (we're not
 running on real-time constraints here), we play it safe and return 18,
 which amounts to roughly a second.

 Make sense?


I don't think it should be hard coded. IMO it would be better to default to
18 and somehow able to be adjusted (via a (dynamic) module parameter ?).

-- Hal


 Regards
 Joachim

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Latest OFED 1.5 Schedule

2009-08-07 Thread Hal Rosenstock
Hi,

Is the below the latest OFED 1.5 schedule ?

http://www.openfabrics.org/txt/woody/roadmap.txt

OFED 1.5:
==
Schedule:
-
- Feature freezeJuly 15
- Alpha July 20
- Beta July 30
- RC1 Aug 12
- RC2-RCx: About every 2 weeks as needed
We usually have ~6 RCs
- GA Oct 30
Where are we in terms of these steps ? Has alpha been reached ?

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH 2/8 v3] ib_core: RDMAoE support only QP1

2009-07-14 Thread Hal Rosenstock
On Mon, Jul 13, 2009 at 4:58 PM, Woodruff, Robert
Jrobert.j.woodr...@intel.com wrote:
 Eli Cohen wrote,

Since RDMAoE is using Ethernet as its link layer, there is no need for QP0. 
QP1
is still needed since it handles communications between CM agents. This patch
will create only QP1 for RDMAoE ports.


 Trying to emulate IB for mad services is a total hack and not how this
 new transport should be added into the core. It should be it's own transport 
 type,
 just like iWarp was added.
 You should start with adding a new transport type to ib_verbs.h,
 e.g.,


 --- ib_verbs.h  2009-07-13 09:06:10.0 -0400
 +++ ib_verbs_new.h      2009-07-14 03:00:23.0 -0400
 @@ -64,12 +64,14 @@ enum rdma_node_type {
        RDMA_NODE_IB_CA         = 1,
        RDMA_NODE_IB_SWITCH,
        RDMA_NODE_IB_ROUTER,
 -       RDMA_NODE_RNIC
 +       RDMA_NODE_RNIC,
 +       RDMA_NODE_IBXOE
  };

  enum rdma_transport_type {
        RDMA_TRANSPORT_IB,
 -       RDMA_TRANSPORT_IWARP
 +       RDMA_TRANSPORT_IWARP,
 +       RDMA_TRANSPORT_IBXOE
  };

  enum rdma_transport_type

Unfortunately I don't think it's this simple although I wish it were.
IBXOE is on a per port rather than a per node basis which is a
different model than we've used for IB or iWARP.

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH 2/8 v3] ib_core: RDMAoE support only QP1

2009-07-14 Thread Hal Rosenstock
On Tue, Jul 14, 2009 at 3:46 AM, Eli Cohene...@dev.mellanox.co.il wrote:
 On Mon, Jul 13, 2009 at 03:26:34PM -0400, Hal Rosenstock wrote:
 On Mon, Jul 13, 2009 at 2:14 PM, Eli Cohene...@mellanox.co.il wrote:
  Since RDMAoE is using Ethernet as its link layer, there is no need for 
  QP0. QP1
  is still needed since it handles communications between CM agents. This 
  patch
  will create only QP1 for RDMAoE ports.

 What happens with other QP1 traffic (other than CM and SA) ?
 I think it should work but I haven't tried that.

Would you ? You could try tools from infiniband-diags or ibdiagnet.

 Userspace
 can access QP1 (and QP0).
 QP0 is not accessible since ib_register_mad_agent() will fail for QP0
 becuase of this:

        if (!port_priv-qp_info[qp_type].qp)
                return NULL;

 QP1 should work in the same way

So what happens with things like PerfMgt class ? I think it ends up
timing out if no receiver consumer is present.

 Does QP0 error out ? What about QP1 ? Does
 it just timeout ? If so, a direct error would be better.


 See above - you can't access QP0. Do you know of a utility from
 userspace which sends/receives MADs on QP0 or QP1?

Yes, opensm, infiniband-diags (various), and ibutils (ibdiagnet, etc).

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [PATCH 1/8 v3] ib_core: Add API to support RDMAoE

2009-07-14 Thread Hal Rosenstock
On Tue, Jul 14, 2009 at 2:35 AM, Eli Cohene...@dev.mellanox.co.il wrote:
 On Mon, Jul 13, 2009 at 03:26:06PM -0400, Hal Rosenstock wrote:
 
  +enum ib_port_link_type ib_get_port_link_type(struct ib_device *device, u8 
  port_num)
  +{
  +       return device-get_port_link_type ?
  +               device-get_port_link_type(device, port_num) : 
  PORT_LINK_IB;

 So do iWARP devices return PORT_LINK_IB ? If so, that seems a little
 weird to me.

 -- Hal


 Maybe it's more appropriate to make this function mandatory and
 require all drivers to report the correct port type. What do you
 think?

That seems better to me; another alternative would be to require this
routine for either all iWARP or IB devices as well but that might be
error prone. Maybe someone else has a better idea on this.

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH 2/8 v3] ib_core: RDMAoE support only QP1

2009-07-14 Thread Hal Rosenstock
On Tue, Jul 14, 2009 at 9:38 AM, Eli Cohene...@dev.mellanox.co.il wrote:
 On Tue, Jul 14, 2009 at 07:15:44AM -0400, Hal Rosenstock wrote:
 On Tue, Jul 14, 2009 at 3:46 AM, Eli Cohene...@dev.mellanox.co.il wrote:
  On Mon, Jul 13, 2009 at 03:26:34PM -0400, Hal Rosenstock wrote:
  On Mon, Jul 13, 2009 at 2:14 PM, Eli Cohene...@mellanox.co.il wrote:
   Since RDMAoE is using Ethernet as its link layer, there is no need for 
   QP0. QP1
   is still needed since it handles communications between CM agents. This 
   patch
   will create only QP1 for RDMAoE ports.
 
  What happens with other QP1 traffic (other than CM and SA) ?
  I think it should work but I haven't tried that.

 Would you ? You could try tools from infiniband-diags or ibdiagnet.
 Yes I would try that. But I need something that will not fail because
 it could not open QP0.

So opensm, ibdiagnet, and smpquery fail (error out) ?

 For example, something that uses only QP1. Are
 the any in ibutils?

In infiniband-diags, there are perfquery, saquery, vendstat, ibping,
and ibssystat which only use QP1. The latter two run are client/server
and can take GUID (not GID) as an argument.

-- Hal


  Userspace
  can access QP1 (and QP0).
  QP0 is not accessible since ib_register_mad_agent() will fail for QP0
  becuase of this:
 
         if (!port_priv-qp_info[qp_type].qp)
                 return NULL;
 
  QP1 should work in the same way

 So what happens with things like PerfMgt class ? I think it ends up
 timing out if no receiver consumer is present.

  Does QP0 error out ? What about QP1 ? Does
  it just timeout ? If so, a direct error would be better.
 
 
  See above - you can't access QP0. Do you know of a utility from
  userspace which sends/receives MADs on QP0 or QP1?

 Yes, opensm, infiniband-diags (various), and ibutils (ibdiagnet, etc).

 -- Hal

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Wish to remove local sa patches from OFED 1.5

2009-07-14 Thread Hal Rosenstock
On Tue, Jul 14, 2009 at 9:58 AM, Tziporet
Korentzipo...@dev.mellanox.co.il wrote:
 Jack Morgenstein wrote:

 Hello all,

 We wish to remove the local sa patches from OFED 1.5.  The local SA is
 disabled by default,
 and to the best of our knowledge, no one is using it, though it has been
 around since OFED 1.3.
 It has also never been accepted into the mainline kernel.

 We wish therefore to remove it from OFED 1.5.

 This includes, under kernel_patches/fixes, the following patches:
        sean_local_sa_1_notifications.patch
        sean_local_sa_2_cache.patch
        sean_local_sa_3_disable.patch
        sean_local_sa_4_fix_hang.patch

 If anyone objects to this removal, please let me know ASAP.



 Since Qlogic are using some of the APIs in these files it was decided not to
 remove them in 1.5
 However Qlogic were requested to approach Sean and see if they can move
 their implementation to the new SA API he is developing now

Has this new SA API been proposed to the list as yet (and I missed it :-() ?

Thanks.

-- Hal

 so eventualy we will be able to remove them

 Tziporet

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Wish to remove local sa patches from OFED 1.5

2009-07-14 Thread Hal Rosenstock
On Tue, Jul 14, 2009 at 3:58 PM, Woodruff, Robert
Jrobert.j.woodr...@intel.com wrote:
 Sorry, I did not explain it clearly,

 What I meant to say was that the new userspace
 module could work with the rdma_cm to provide
 better scaling than the local sa cache module
 that is in the kernel, and if it does, the
 local sa cache feature might not be needed
 anymore.

What new userspace module are you referring to ?

-- Hal

 woody


 -Original Message-
 From: ewg-boun...@lists.openfabrics.org 
 [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Tziporet Koren
 Sent: Tuesday, July 14, 2009 12:46 PM
 To: Hefty, Sean
 Cc: michael.bro...@qlogic.com; amar.mudran...@qlogic.com; 
 ewg@lists.openfabrics.org
 Subject: Re: [ewg] Wish to remove local sa patches from OFED 1.5

 Sean Hefty wrote:


 I am working on a userspace app that should help with scaling for some
 topologies, but I doubt it will work for all routing algorithms.  I'm at 
 least a
 couple weeks away from posting anything.


 I guess I didn't quite understood what Woody explained in the meeting
 Sorry about that

 Tziporet
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH 1/8 v3] ib_core: Add API to support RDMAoE

2009-07-13 Thread Hal Rosenstock
On Mon, Jul 13, 2009 at 2:13 PM, Eli Cohene...@mellanox.co.il wrote:
 Add two API functions needed for RDMAoE.

 ib_get_port_link_type() returns the link type support by the given device's
 port. It can be either PORT_LINK_IB for IB link layer or PORT_LINK_ETH for
 Ethernet links. Link type is reported to in query_port verb.

 ib_get_mac() will return the Ethernet MAC address leading to the port whose 
 GID
 is spcified. This function is exported to userspace applications.

 ABI version is incremented from 6 to 7.

 Signed-off-by: Eli Cohen e...@mellanox.co.il
 ---
  drivers/infiniband/core/uverbs.h      |    1 +
  drivers/infiniband/core/uverbs_cmd.c  |   33 
 +
  drivers/infiniband/core/uverbs_main.c |    1 +
  drivers/infiniband/core/verbs.c       |   17 +
  include/rdma/ib_user_verbs.h          |   21 ++---
  include/rdma/ib_verbs.h               |   22 ++
  6 files changed, 92 insertions(+), 3 deletions(-)

 diff --git a/drivers/infiniband/core/uverbs.h 
 b/drivers/infiniband/core/uverbs.h
 index b3ea958..e69b04c 100644
 --- a/drivers/infiniband/core/uverbs.h
 +++ b/drivers/infiniband/core/uverbs.h
 @@ -194,5 +194,6 @@ IB_UVERBS_DECLARE_CMD(create_srq);
  IB_UVERBS_DECLARE_CMD(modify_srq);
  IB_UVERBS_DECLARE_CMD(query_srq);
  IB_UVERBS_DECLARE_CMD(destroy_srq);
 +IB_UVERBS_DECLARE_CMD(get_mac);

  #endif /* UVERBS_H */
 diff --git a/drivers/infiniband/core/uverbs_cmd.c 
 b/drivers/infiniband/core/uverbs_cmd.c
 index 56feab6..eefc414 100644
 --- a/drivers/infiniband/core/uverbs_cmd.c
 +++ b/drivers/infiniband/core/uverbs_cmd.c
 @@ -452,6 +452,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
        resp.active_width    = attr.active_width;
        resp.active_speed    = attr.active_speed;
        resp.phys_state      = attr.phys_state;
 +       resp.link_type       = attr.link_type;

        if (copy_to_user((void __user *) (unsigned long) cmd.response,
                         resp, sizeof resp))
 @@ -1824,6 +1825,38 @@ err:
        return ret;
  }

 +ssize_t ib_uverbs_get_mac(struct ib_uverbs_file *file,
 +                         const char __user *buf, int in_len,
 +                         int out_len)
 +{
 +       struct ib_uverbs_get_mac        cmd;
 +       struct ib_uverbs_get_mac_resp   resp;
 +       int              ret;
 +       struct ib_pd    *pd;
 +
 +       if (out_len  sizeof resp)
 +               return -ENOSPC;
 +
 +       if (copy_from_user(cmd, buf, sizeof cmd))
 +               return -EFAULT;
 +
 +       pd = idr_read_pd(cmd.pd_handle, file-ucontext);
 +       if (!pd)
 +               return -EINVAL;
 +
 +       ret = ib_get_mac(pd-device, cmd.port, cmd.gid, resp.mac);
 +       put_pd_read(pd);
 +       if (!ret) {
 +               if (copy_to_user((void __user *) (unsigned long) cmd.response,
 +                                resp, sizeof resp)) {
 +                       return -EFAULT;
 +               }
 +               return in_len;
 +       }
 +
 +       return ret;
 +}
 +
  ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
                             const char __user *buf, int in_len, int out_len)
  {
 diff --git a/drivers/infiniband/core/uverbs_main.c 
 b/drivers/infiniband/core/uverbs_main.c
 index eb36a81..b2f148f 100644
 --- a/drivers/infiniband/core/uverbs_main.c
 +++ b/drivers/infiniband/core/uverbs_main.c
 @@ -108,6 +108,7 @@ static ssize_t (*uverbs_cmd_table[])(struct 
 ib_uverbs_file *file,
        [IB_USER_VERBS_CMD_MODIFY_SRQ]          = ib_uverbs_modify_srq,
        [IB_USER_VERBS_CMD_QUERY_SRQ]           = ib_uverbs_query_srq,
        [IB_USER_VERBS_CMD_DESTROY_SRQ]         = ib_uverbs_destroy_srq,
 +       [IB_USER_VERBS_CMD_GET_MAC]             = ib_uverbs_get_mac
  };

  static struct vfsmount *uverbs_event_mnt;
 diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
 index a7da9be..bde5b0d 100644
 --- a/drivers/infiniband/core/verbs.c
 +++ b/drivers/infiniband/core/verbs.c
 @@ -94,6 +94,13 @@ rdma_node_get_transport(enum rdma_node_type node_type)
  }
  EXPORT_SYMBOL(rdma_node_get_transport);

 +enum ib_port_link_type ib_get_port_link_type(struct ib_device *device, u8 
 port_num)
 +{
 +       return device-get_port_link_type ?
 +               device-get_port_link_type(device, port_num) : PORT_LINK_IB;

So do iWARP devices return PORT_LINK_IB ? If so, that seems a little
weird to me.

-- Hal

 +}
 +EXPORT_SYMBOL(ib_get_port_link_type);
 +
  /* Protection domains */

  struct ib_pd *ib_alloc_pd(struct ib_device *device)
 @@ -904,3 +911,13 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, 
 u16 lid)
        return qp-device-detach_mcast(qp, gid, lid);
  }
  EXPORT_SYMBOL(ib_detach_mcast);
 +
 +int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac)
 +{
 +       if (!device-get_mac)
 +               return -ENOSYS;
 +
 +       return device-get_mac(device, port, gid, mac);
 +}
 

Re: [ewg] [PATCH 2/8 v3] ib_core: RDMAoE support only QP1

2009-07-13 Thread Hal Rosenstock
On Mon, Jul 13, 2009 at 2:14 PM, Eli Cohene...@mellanox.co.il wrote:
 Since RDMAoE is using Ethernet as its link layer, there is no need for QP0. 
 QP1
 is still needed since it handles communications between CM agents. This patch
 will create only QP1 for RDMAoE ports.

What happens with other QP1 traffic (other than CM and SA) ? Userspace
can access QP1 (and QP0). Does QP0 error out ? What about QP1 ? Does
it just timeout ? If so, a direct error would be better.

-- Hal

 Signed-off-by: Eli Cohen e...@mellanox.co.il
 ---
  drivers/infiniband/core/agent.c |   12 ++---
  drivers/infiniband/core/mad.c   |   48 
 ++-
  2 files changed, 45 insertions(+), 15 deletions(-)

 diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
 index ae7c288..c3f2048 100644
 --- a/drivers/infiniband/core/agent.c
 +++ b/drivers/infiniband/core/agent.c
 @@ -48,6 +48,8 @@
  struct ib_agent_port_private {
        struct list_head port_list;
        struct ib_mad_agent *agent[2];
 +       struct ib_device    *device;
 +       u8                   port_num;
  };

  static DEFINE_SPINLOCK(ib_agent_port_list_lock);
 @@ -58,11 +60,10 @@ __ib_get_agent_port(struct ib_device *device, int 
 port_num)
  {
        struct ib_agent_port_private *entry;

 -       list_for_each_entry(entry, ib_agent_port_list, port_list) {
 -               if (entry-agent[0]-device == device 
 -                   entry-agent[0]-port_num == port_num)
 +       list_for_each_entry(entry, ib_agent_port_list, port_list)
 +               if (entry-device == device  entry-port_num == port_num)
                        return entry;
 -       }
 +
        return NULL;
  }

 @@ -175,6 +176,9 @@ int ib_agent_port_open(struct ib_device *device, int 
 port_num)
                goto error3;
        }

 +       port_priv-device = device;
 +       port_priv-port_num = port_num;
 +
        spin_lock_irqsave(ib_agent_port_list_lock, flags);
        list_add_tail(port_priv-port_list, ib_agent_port_list);
        spin_unlock_irqrestore(ib_agent_port_list_lock, flags);
 diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
 index de922a0..3d5449f 100644
 --- a/drivers/infiniband/core/mad.c
 +++ b/drivers/infiniband/core/mad.c
 @@ -199,6 +199,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
 ib_device *device,
        unsigned long flags;
        u8 mgmt_class, vclass;

 +       /* Validate device and port */
 +       port_priv = ib_get_mad_port(device, port_num);
 +       if (!port_priv) {
 +               ret = ERR_PTR(-ENODEV);
 +               goto error1;
 +       }
 +
 +       if (!port_priv-qp_info[qp_type].qp)
 +               return NULL;
 +
        /* Validate parameters */
        qpn = get_spl_qp_index(qp_type);
        if (qpn == -1)
 @@ -260,13 +270,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
 ib_device *device,
                        goto error1;
        }

 -       /* Validate device and port */
 -       port_priv = ib_get_mad_port(device, port_num);
 -       if (!port_priv) {
 -               ret = ERR_PTR(-ENODEV);
 -               goto error1;
 -       }
 -
        /* Allocate structures */
        mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL);
        if (!mad_agent_priv) {
 @@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent 
 *mad_agent)
        struct ib_mad_agent_private *mad_agent_priv;
        struct ib_mad_snoop_private *mad_snoop_priv;

 +       if (!mad_agent)
 +               return 0;
 +
        /* If the TID is zero, the agent can only snoop. */
        if (mad_agent-hi_tid) {
                mad_agent_priv = container_of(mad_agent,
 @@ -2602,6 +2608,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info 
 *qp_info)
        struct ib_mad_private *recv;
        struct ib_mad_list_head *mad_list;

 +       if (!qp_info-qp)
 +               return;
 +
        while (!list_empty(qp_info-recv_queue.list)) {

                mad_list = list_entry(qp_info-recv_queue.list.next,
 @@ -2643,6 +2652,9 @@ static int ib_mad_port_start(struct ib_mad_port_private 
 *port_priv)

        for (i = 0; i  IB_MAD_QPS_CORE; i++) {
                qp = port_priv-qp_info[i].qp;
 +               if (!qp)
 +                       continue;
 +
                /*
                 * PKey index for QP1 is irrelevant but
                 * one is needed for the Reset to Init transition
 @@ -2684,6 +2696,9 @@ static int ib_mad_port_start(struct ib_mad_port_private 
 *port_priv)
        }

        for (i = 0; i  IB_MAD_QPS_CORE; i++) {
 +               if (!port_priv-qp_info[i].qp)
 +                       continue;
 +
                ret = ib_mad_post_receive_mads(port_priv-qp_info[i], NULL);
                if (ret) {
                        printk(KERN_ERR PFX Couldn't post receive WRs\n);
 @@ -2762,6 +2777,9 @@ error:

  static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
  {
 +       if (!qp_info-qp)
 +               return;
 +
        

[ewg] Re: [PATCH v4] libibmad: Handle MAD redirection

2009-07-08 Thread Hal Rosenstock
On 7/8/09, Joachim Fenkes fen...@de.ibm.com wrote:
 Hal Rosenstock hal.rosenst...@gmail.com wrote on 07.07.2009 17:23:18:

  +static int redirect_port(ib_portid_t *port, uint8_t *mad)
  +{
  +   port-lid = mad_get_field(mad, 64, IB_CPI_REDIRECT_LID_F);
  +   if (!port-lid) {
  +   IBWARN(GID-based redirection is not supported);
  +   return -1;
  +   }

 I hate to keep beating this horse but the lack of a LID certainly
 means GID based redirection when the GID is not 0, IMO this LID check
 is insufficient in general.

 If the LID is given, my code does the right thing by redirecting
 regardless
 of any GID, as the spec requires. If no LID is given, but a GID is, my
 code
 bails with an error stating that GID-based redirection is not supported.
 If
 both GID and LID are 0, that's an error and my code bails with an error
 message (which may or may not be misleading depending on your perspective,
 but frankly I couldn't care less about broken agents).

 Which of those three reactions do you think is insufficient?

It looks to me like both GID and LID are allowed to be specified in
the redirect and if so, there is the possibility of GID based
redirection there (as well as when LID is 0) and it is the requester
which decides on GRH inclusion or not.

 I suppose this can be fixed down the road.

 Is that an ACK? ;)

Indeed, a qualified ACK :-) This case that concerns me ends up
entangled in the to be specified multiple IB subnet case (router
spec).

-- Hal

 Cheers,
   Joachim

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v4] libibmad: Handle MAD redirection

2009-07-08 Thread Hal Rosenstock
On 7/8/09, Joachim Fenkes fen...@de.ibm.com wrote:
 Hal Rosenstock hal.rosenst...@gmail.com wrote on 08.07.2009 15:24:53:

  I suppose this can be fixed down the road.
 
  Is that an ACK? ;)

 Indeed, a qualified ACK :-)

 Cool, thanks!

 This patch should make its way into OFED 1.5... so who should pull it?
 You? Vlad? Someone not on CC? Whoever, please apply for OFED 1.5 --
 thanks!

Sasha is the management maintainer. Userspace trees for OFED 1.5
haven't been created and I think this aspect is in transition.

-- Hal


 Cheers
   Joachim

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v4] libibmad: Handle MAD redirection

2009-07-07 Thread Hal Rosenstock
On Tue, Jul 7, 2009 at 10:20 AM, Joachim Fenkesfen...@de.ibm.com wrote:
 Previously, libibmad reacted to GSI MAD responses with a redirect status
 by throwing an error. IBM eHCA adapters use redirection, so most
 infiniband_diags tools didn't work against eHCA.

 Fix: Modify mad_rpc() so that it resends the request to the redirection
 target if a redirect GS response is received. This is repeated until no
 redirect response is received, allowing for multiple levels of
 indirection.

 The dport argument is updated with the redirection target, so subsequent
 MADs will not go through the redirection process again but reach the target
 directly.

 Tested using perfquery between ehca, mlx4 and mthca in all possible
 combinations.

 Signed-off-by: Joachim Fenkes fen...@de.ibm.com
 ---

 After all has been said and done, here's the hopefully last iteration of the
 patch, with the hex display of the redirect LID replaced by decimal.

 Any objections against this patch?

See below for comment.


 Regards,
  Joachim

  libibmad/include/infiniband/mad.h |    9 +
  libibmad/src/gs.c                 |    6 ++-
  libibmad/src/rpc.c                |   65 
  3 files changed, 63 insertions(+), 17 deletions(-)

 diff --git a/libibmad/include/infiniband/mad.h 
 b/libibmad/include/infiniband/mad.h
 index aa27eb5..bdf5158 100644
 --- a/libibmad/include/infiniband/mad.h
 +++ b/libibmad/include/infiniband/mad.h
 @@ -115,6 +115,8 @@ enum MAD_ATTR_ID {

  enum MAD_STATUS {
        IB_MAD_STS_OK                        = (0  2),
 +       IB_MAD_STS_BUSY                      = (1  0),
 +       IB_MAD_STS_REDIRECT                  = (1  1),
        IB_MAD_STS_BAD_BASE_VER_OR_CLASS     = (1  2),
        IB_MAD_STS_METHOD_NOT_SUPPORTED      = (2  2),
        IB_MAD_STS_METHOD_ATTR_NOT_SUPPORTED = (3  2),
 @@ -783,8 +785,15 @@ MAD_EXPORT int madrpc_set_timeout(int timeout);
  MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
                        int *mgmt_classes, int num_classes);
  MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
 +
 +/*
 + * On redirection, the dport argument is updated with the redirection target,
 + * so subsequent MADs will not go through the redirection process again but
 + * reach the target directly.
 + */
  MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc,
                        ib_portid_t * dport, void *payload, void *rcvdata);
 +
  MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * 
 rpc,
                              ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
                              void *data);
 diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
 index f3d245e..c7e4ff6 100644
 --- a/libibmad/src/gs.c
 +++ b/libibmad/src/gs.c
 @@ -70,7 +70,8 @@ uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, 
 int port,
        rpc.datasz = IB_PC_DATA_SZ;
        rpc.dataoffs = IB_PC_DATA_OFFS;

 -       dest-qp = 1;
 +       if (!dest-qp)
 +               dest-qp = 1;
        if (!dest-qkey)
                dest-qkey = IB_DEFAULT_QP1_QKEY;

 @@ -109,7 +110,8 @@ uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t 
 * dest,
        rpc.timeout = timeout;
        rpc.datasz = IB_PC_DATA_SZ;
        rpc.dataoffs = IB_PC_DATA_OFFS;
 -       dest-qp = 1;
 +       if (!dest-qp)
 +               dest-qp = 1;
        if (!dest-qkey)
                dest-qkey = IB_DEFAULT_QP1_QKEY;

 diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
 index 07b623d..efea1d3 100644
 --- a/libibmad/src/rpc.c
 +++ b/libibmad/src/rpc.c
 @@ -183,33 +183,68 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int 
 agentid, int len,
        return -1;
  }

 +static int redirect_port(ib_portid_t *port, uint8_t *mad)
 +{
 +       port-lid = mad_get_field(mad, 64, IB_CPI_REDIRECT_LID_F);
 +       if (!port-lid) {
 +               IBWARN(GID-based redirection is not supported);
 +               return -1;
 +       }

I hate to keep beating this horse but the lack of a LID certainly
means GID based redirection when the GID is not 0, IMO this LID check
is insufficient in general. I suppose this can be fixed down the road.

-- Hal

 +
 +       port-qp = mad_get_field(mad, 64, IB_CPI_REDIRECT_QP_F);
 +       port-qkey = mad_get_field(mad, 64, IB_CPI_REDIRECT_QKEY_F);
 +       port-sl = mad_get_field(mad, 64, IB_CPI_REDIRECT_SL_F);
 +
 +       /* TODO: Reverse map redirection P_Key to P_Key index */
 +
 +       if (ibdebug)
 +               IBWARN(redirected to lid %d, qp 0x%x, qkey 0x%x, sl 0x%x,
 +                      port-lid, port-qp, port-qkey, port-sl);
 +
 +       return 0;
 +}
 +
  void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
              ib_portid_t * dport, void *payload, void *rcvdata)
  {
        int status, len;
        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
        int timeout, retries;
 +       int redirect = 1;

 -       len = 0;
 -       memset(sndbuf, 0, umad_size() + 

[ewg] Re: [PATCH v3] libibmad: Handle MAD redirection

2009-07-01 Thread Hal Rosenstock
On Wed, Jul 1, 2009 at 9:34 AM, Joachim Fenkesfen...@de.ibm.com wrote:
 Previously, libibmad reacted to GSI MAD responses with a redirect status
 by throwing an error. IBM eHCA adapters use redirection, so most
 infiniband_diags tools didn't work against eHCA.

 Fix: Modify mad_rpc() so that it resends the request to the redirection
 target if a redirect GS response is received. This is repeated until no
 redirect response is received, allowing for multiple levels of
 indirection.

 The dport argument is updated with the redirection target, so subsequent
 MADs will not go through the redirection process again but reach the target
 directly.

 Tested using perfquery between ehca, mlx4 and mthca in all possible
 combinations.

 Signed-off-by: Joachim Fenkes fen...@de.ibm.com
 ---

 Hi, Hal and Jason,

 here's an updated patch that will bail on GID-routed redirection. Also, I
 moved the redirection itself into its own function so it can easily be
 included into RMPP as well.

 Of course, I tested this again using ehca, mthca and mlx4.

 If you have nothing to add to this patch, please queue it for OFED 1.5.

 Thanks and regards,
  Joachim


  libibmad/include/infiniband/mad.h |    9 +
  libibmad/src/gs.c                 |    6 ++-
  libibmad/src/rpc.c                |   65 
  3 files changed, 63 insertions(+), 17 deletions(-)

 diff --git a/libibmad/include/infiniband/mad.h 
 b/libibmad/include/infiniband/mad.h
 index aa27eb5..bdf5158 100644
 --- a/libibmad/include/infiniband/mad.h
 +++ b/libibmad/include/infiniband/mad.h
 @@ -115,6 +115,8 @@ enum MAD_ATTR_ID {

  enum MAD_STATUS {
        IB_MAD_STS_OK                        = (0  2),
 +       IB_MAD_STS_BUSY                      = (1  0),
 +       IB_MAD_STS_REDIRECT                  = (1  1),
        IB_MAD_STS_BAD_BASE_VER_OR_CLASS     = (1  2),
        IB_MAD_STS_METHOD_NOT_SUPPORTED      = (2  2),
        IB_MAD_STS_METHOD_ATTR_NOT_SUPPORTED = (3  2),
 @@ -783,8 +785,15 @@ MAD_EXPORT int madrpc_set_timeout(int timeout);
  MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
                        int *mgmt_classes, int num_classes);
  MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
 +
 +/*
 + * On redirection, the dport argument is updated with the redirection target,
 + * so subsequent MADs will not go through the redirection process again but
 + * reach the target directly.
 + */
  MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc,
                        ib_portid_t * dport, void *payload, void *rcvdata);
 +
  MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * 
 rpc,
                              ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
                              void *data);
 diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
 index f3d245e..c7e4ff6 100644
 --- a/libibmad/src/gs.c
 +++ b/libibmad/src/gs.c
 @@ -70,7 +70,8 @@ uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, 
 int port,
        rpc.datasz = IB_PC_DATA_SZ;
        rpc.dataoffs = IB_PC_DATA_OFFS;

 -       dest-qp = 1;
 +       if (!dest-qp)
 +               dest-qp = 1;
        if (!dest-qkey)
                dest-qkey = IB_DEFAULT_QP1_QKEY;

 @@ -109,7 +110,8 @@ uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t 
 * dest,
        rpc.timeout = timeout;
        rpc.datasz = IB_PC_DATA_SZ;
        rpc.dataoffs = IB_PC_DATA_OFFS;
 -       dest-qp = 1;
 +       if (!dest-qp)
 +               dest-qp = 1;
        if (!dest-qkey)
                dest-qkey = IB_DEFAULT_QP1_QKEY;

 diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
 index 07b623d..7364940 100644
 --- a/libibmad/src/rpc.c
 +++ b/libibmad/src/rpc.c
 @@ -183,33 +183,68 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int 
 agentid, int len,
        return -1;
  }

 +static int redirect_port(ib_portid_t *port, uint8_t *mad)
 +{
 +       port-lid = mad_get_field(mad, 64, IB_CPI_REDIRECT_LID_F);
 +       if (!port-lid) {
 +               IBWARN(GID-based redirection is not supported);
 +               return -1;
 +       }

Sorry for the confusion: determination of GID redirection should be
based on a comparison of the RedirectGID to 0. It's valid to supply
both a non zero RedirectGID and RedirectLID.

 +
 +       port-qp = mad_get_field(mad, 64, IB_CPI_REDIRECT_QP_F);
 +       port-qkey = mad_get_field(mad, 64, IB_CPI_REDIRECT_QKEY_F);
 +       port-sl = mad_get_field(mad, 64, IB_CPI_REDIRECT_SL_F);
 +
 +       /* TODO: Reverse map redirection P_Key to P_Key index */
 +
 +       if (ibdebug)
 +               IBWARN(redirected to lid 0x%x, qp 0x%x, qkey 0x%x, sl 0x%x,
 +                      port-lid, port-qp, port-qkey, port-sl);

Unicast LIDs should be displayed in decimal rather than hex.

-- Hal

 +
 +       return 0;
 +}
 +
  void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
              ib_portid_t * dport, void *payload, void *rcvdata)
  {
        int status, 

[ewg] Re: [PATCH v3] libibmad: Handle MAD redirection

2009-07-01 Thread Hal Rosenstock
On Wed, Jul 1, 2009 at 11:14 AM, Jason
Gunthorpejguntho...@obsidianresearch.com wrote:
 On Wed, Jul 01, 2009 at 09:59:41AM -0400, Hal Rosenstock wrote:

  +static int redirect_port(ib_portid_t *port, uint8_t *mad)
  +{
  + ?? ?? ?? port-lid = mad_get_field(mad, 64, IB_CPI_REDIRECT_LID_F);
  + ?? ?? ?? if (!port-lid) {
  + ?? ?? ?? ?? ?? ?? ?? IBWARN(GID-based redirection is not supported);
  + ?? ?? ?? ?? ?? ?? ?? return -1;
  + ?? ?? ?? }

 Sorry for the confusion: determination of GID redirection should be
 based on a comparison of the RedirectGID to 0. It's valid to supply
 both a non zero RedirectGID and RedirectLID.

 ?? The above is correct. As I said, RedirectGID is not allowed to be 0.

I think it depends on the interpretation of If redirection is not being
performed, this shall be set to zero. in the RedirectGID description
as to whether it is referring to redirection in general or just GID
redirection.

Futhermore, RedirectLID can be non zero but GID redirection is still
being used as indicated by the RedirectLID description indicating that
a non zero RedirectLID will in general not be valid.

-- Hal


 Jason

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v3] libibmad: Handle MAD redirection

2009-07-01 Thread Hal Rosenstock
On Wed, Jul 1, 2009 at 11:17 AM, Joachim Fenkesfen...@de.ibm.com wrote:
 Hal Rosenstock hal.rosenst...@gmail.com wrote on 01.07.2009 15:59:41:

  +static int redirect_port(ib_portid_t *port, uint8_t *mad)
  +{
  +       port-lid = mad_get_field(mad, 64, IB_CPI_REDIRECT_LID_F);
  +       if (!port-lid) {
  +               IBWARN(GID-based redirection is not supported);
  +               return -1;
  +       }

 Sorry for the confusion: determination of GID redirection should be
 based on a comparison of the RedirectGID to 0. It's valid to supply
 both a non zero RedirectGID and RedirectLID.

 Are you sure?

No; I'm not sure. See previous post to Jason.

 About the Redirection GID, the spec says If redirection is not being
 performed, this shall be set to zero, so if redirection _is_ being
 performed, the GID may or may not be zero without any explicit
 implication.

Agreed but Jason doesn't appear to agree.

 For the LID, it says If this value is zero, the redirect requires the
 requester to use the supplied RedirectGID to request further path
 resolution
 from subnet administration. To me, this explicitly states that a zero LID
 means that the GID must be used.

 If both LID and GID are non-zero, it is not specified whether the
 requester
 should use the LID or the GID, so I choose to always use the LID as long
 as it's non-zero, because that's what the code supports.

It does say that the LID might not be valid even though non-zero. I'm
thinking of the more general case (future) rather than just IBM eHCA
usage.

 Am I talking crazy or does this make sense to you?

  +
  +       port-qp = mad_get_field(mad, 64, IB_CPI_REDIRECT_QP_F);
  +       port-qkey = mad_get_field(mad, 64, IB_CPI_REDIRECT_QKEY_F);
  +       port-sl = mad_get_field(mad, 64, IB_CPI_REDIRECT_SL_F);
  +
  +       /* TODO: Reverse map redirection P_Key to P_Key index */
  +
  +       if (ibdebug)
  +               IBWARN(redirected to lid 0x%x, qp 0x%x, qkey 0x%x, sl
 0x%x,
  +                      port-lid, port-qp, port-qkey, port-sl);

 Unicast LIDs should be displayed in decimal rather than hex.

 Couldn't you have noticed this in the first patch? ;)

Somehow I missed it :-( Sorry.

I'll change it.

Thanks.

-- Hal

 Cheers,
  Joachim

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v3] libibmad: Handle MAD redirection

2009-07-01 Thread Hal Rosenstock
On Wed, Jul 1, 2009 at 12:20 PM, Joachim Fenkesfen...@de.ibm.com wrote:
 Hal Rosenstock hal.rosenst...@gmail.com wrote on 01.07.2009 17:54:41:

 It does say that the LID might not be valid even though non-zero.

 Can you elaborate on that? I can't seem to find that in the spec.
 What I do find, though, is this:

 If this value is non-zero, it is the DLID a requester _shall_ use to
 access the
 class services. -- v1.2, p736, line 7

Yes, that's same as v1.2.1 p.743 line 14.


 Which sounds to me like: If it's non-zero, it's valid and you must use it.

v1.2.1 p.743 line 17 goes on to say:
The RedirectGID, the RedirectQP and RedirectP_Key from this redirect
response are all valid, but the RedirectSL, RedirectFL, RedirectTC, and
RedirectLID will in general not be valid; they must be replaced using a Path-
Record obtained from the SA.
Doesn't look like that was a change from v1.2 as there are no change bars.

-- Hal

 I'm
 thinking of the more general case (future) rather than just IBM eHCA
 usage.

 I stopped thinking eHCA only two patches ago, don't worry ;)

 BTW, I'm going to be out of the office starting now and returning next
 Tuesday,
 so let's continue this discussion next week ;)

 Cheers,
  Joachim

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v3] libibmad: Handle MAD redirection

2009-07-01 Thread Hal Rosenstock
On Wed, Jul 1, 2009 at 12:50 PM, Jason
Gunthorpejguntho...@obsidianresearch.com wrote:
 On Wed, Jul 01, 2009 at 11:54:13AM -0400, Hal Rosenstock wrote:

 I think it depends on the interpretation of If redirection is not being
 performed, this shall be set to zero. in the RedirectGID description
 as to whether it is referring to redirection in general or just GID
 redirection.

 ClassPortInfo is used for alot of things,

Of course.

 I take that to mean that
 when it is used in non-redirection contexts that RedirectGID can be 0.

I took it to mean differently as there's some conflicting text in RedirectLID.

 Clearly the only sane way this can work is if the GID is always
 filled in for the redirection case.

Why is that ? Why must the redirector provide GRH info when it's not
required for subnet local cases ?

 Futhermore, RedirectLID can be non zero but GID redirection is still
 being used as indicated by the RedirectLID description indicating that
 a non zero RedirectLID will in general not be valid.

 The spec says if it is not zero the requester shall use it. I don't
 see an ambiguity here.

To me, the ambiguity is several lines below it where it states that
the RedirectLID might not be valid and says to obtain a PathRecord
when RedirectGID is supplied rather than relying on the RedirectLID is
non zero.

-- Hal

 Jason
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v3] libibmad: Handle MAD redirection

2009-07-01 Thread Hal Rosenstock
On Wed, Jul 1, 2009 at 4:00 PM, Jason
Gunthorpejguntho...@obsidianresearch.com wrote:
 On Wed, Jul 01, 2009 at 03:39:01PM -0400, Hal Rosenstock wrote:

  Clearly the only sane way this can work is if the GID is always
  filled in for the redirection case.

 Why is that ? Why must the redirector provide GRH info when it's not
 required for subnet local cases ?

 Because the redirector doesn't know what the initiator will do. It
 could include a GRH, or maybe not. It must include the GID to cover
 both cases.

It could restrict what the initiator can do by doing this. Nothing
wrong with that AFAIT. I agree that this is not what you'd want if the
requester were not subnet local. I'm only talking about the subnet
local case.

  Futhermore, RedirectLID can be non zero but GID redirection is still
  being used as indicated by the RedirectLID description indicating that
  a non zero RedirectLID will in general not be valid.
 
  The spec says if it is not zero the requester shall use it. I don't
  see an ambiguity here.

 To me, the ambiguity is several lines below it where it states that
 the RedirectLID might not be valid and says to obtain a PathRecord
 when RedirectGID is supplied rather than relying on the RedirectLID is
 non zero.

 Whoever authored this should not have mixed 'will in general not be
 valid' and 'they must be replaced' in the same sentance - but I think
 the meaning is still clear. With a 0 RedirectLID only the RedirectGID,
 QP and P_Key are to be used by the receiver. When RedirectLID is not 0
 then all of the Redirect fields must contain correct data and should
 be used as necessary by the receiver.

 It never says to obtain a Path Record when a GID is supplied. It says
 to obtain a path record with RedirectLID is 0.

In looking at this some more, I agree with you on this part now since
all that text is part of the RedirectLID 0 paragraph.

-- Hal

 Jason

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] libibmad: Handle MAD redirection

2009-06-30 Thread Hal Rosenstock
On Tue, Jun 30, 2009 at 8:04 AM, Joachim Fenkesfen...@de.ibm.com wrote:
 On Tuesday 30 June 2009 00:01, Hal Rosenstock wrote:
 On Mon, Jun 29, 2009 at 8:10 AM, Joachim Fenkesfen...@de.ibm.com wrote:

  Previously, libibmad reacted to GSI MAD responses with a redirect status
  by throwing an error. IBM eHCA adapters use redirection, so most
  infiniband_diags tools didn't work against eHCA.

 Are there GS classes other than PerfMgt which would be redirected by eHCA ?

 Not right now, no. If you're interested in the details of how and when the
 eHCA driver redirects, please have a look at 
 drivers/infiniband/hw/ehca/ehca_sqp.c.

SL is always set to 0 and RespTimeValue is hardcoded.

  --- a/libibmad/src/gs.c
  +++ b/libibmad/src/gs.c
  @@ -70,7 +70,8 @@ uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, 
  int port,
         rpc.datasz = IB_PC_DATA_SZ;
         rpc.dataoffs = IB_PC_DATA_OFFS;
 
  -       dest-qp = 1;
  +       if (!dest-qp)
  +               dest-qp = 1;

 Is this change part of this patch or unrelated/separate ?

 Part of the patch. Without this change, pma_query_via() would overwrite the
 redirected QP with QP1 again, and the MAD would never arrive at the right
 destination.

  +               /* check for exact match instead of only the redirect bit;
  +                * that way, weird statuses cause an error, too */
  +               if (status == IB_MAD_STS_REDIRECT) {
  +                       /* update dport for next request and retry */
  +                       dport-lid = mad_get_field(mad, 64, 
  IB_CPI_REDIRECT_LID_F);
  +                       dport-qp = mad_get_field(mad, 64, 
  IB_CPI_REDIRECT_QP_F);
  +                       dport-qkey = mad_get_field(mad, 64, 
  IB_CPI_REDIRECT_QKEY_F);

 Are those the only 3 fields which eHCA changes on a redirect ? There
 may be others we would want to add in here (PKey, SL, ...) ?

 Yeah, I agree on the SL, I can add it to the patch.

 At first, I also tried to set the PKey, but ClassPortInfo specifies a PKey
 while ib_portid_t needs a PKey Index, and I found no way of converting
 between the two,

It's available via libibumad. Note that umad version 5 is needed for
pkey index support.

 so I left it at zero. Incidentally, there isn't a single
 code line in management.git that actually changes the pkey_index from its
 init value of 0, so I figured that omission couldn't be too bad.

Agreed. I think you're referring to infiniband-diags and not opensm.

 Then there's the GRH stuff, but I refrained from coding that because I
 wouldn't be able to test it

That's fine for now IMO. I think there's only some minimal GRH support
now elsewhere in the diags and this needs more work in general.

 -- InfiniBand isn't going to evolve beyond a
 single subnet any time toon, is it?

There's ongoing work in this area but not sure how soon is soon...

 Also, are the offsets above correct ?

 Yes, they are, I tested. The ClassPortInfo data starts at offset 64 in the
 MAD, and I didn't find a constant for this in mad.h.

 Depending on which GS classes are to be supported for redirection, we
 may want to do something similar to the rmpp equivalent of this
 routine too.

 The spec says in 13.5.2 that The SA as well as each GSA may individually
 support this mechanism or not, so we should probably be prepared for any GS
 class to redirect. I don't care much about RMPP, though, so I left it alone.

Understood.

-- Hal

 Regards,
  Joachim




___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [PATCH] libibmad: Handle MAD redirection

2009-06-30 Thread Hal Rosenstock
On Tue, Jun 30, 2009 at 2:00 PM, Jason
Gunthorpejguntho...@obsidianresearch.com wrote:
 On Tue, Jun 30, 2009 at 02:04:03PM +0200, Joachim Fenkes wrote:
 On Tuesday 30 June 2009 00:01, Hal Rosenstock wrote:
  On Mon, Jun 29, 2009 at 8:10 AM, Joachim Fenkesfen...@de.ibm.com wrote:

   Previously, libibmad reacted to GSI MAD responses with a redirect 
   status
   by throwing an error. IBM eHCA adapters use redirection, so most
   infiniband_diags tools didn't work against eHCA.
 
  Are there GS classes other than PerfMgt which would be redirected by eHCA ?

 Not right now, no. If you're interested in the details of how and when the
 eHCA driver redirects, please have a look at 
 drivers/infiniband/hw/ehca/ehca_sqp.c.

 Hmm.. That definately doesn't look right overall. You are not forming
 the redirect reply in a way that will work with all possible
 fabrics. You can't just return a 0 SL and the default PKey and assume
 things will work out.

 It looks like all you want to do is redirect to a different QPN? If so
 I recommend you copy all the values from the incoming MAD's LRH and,
 if present, GRH into the ClassPortInfo reply. Copy the PKey too.

 This way you have the best chance of sending back the right information.

Agreed.

 If there is no GRH then you can use GID index 0 and a 0 TC and a 0
 FL. According to the spec returning the port GID is NOT optional.

These are not needed when using LID based redirection (see
ClassPortInfo RedirectLID description).

-- Hal

   +            /* update dport for next request and retry */
   +            dport-lid = mad_get_field(mad, 64, IB_CPI_REDIRECT_LID_F);
   +            dport-qp = mad_get_field(mad, 64, IB_CPI_REDIRECT_QP_F);
   +            dport-qkey = mad_get_field(mad, 64, 
   IB_CPI_REDIRECT_QKEY_F);

 This code sould also check for 0 LID and bail.

 Jason

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [PATCH] libibmad: Handle MAD redirection

2009-06-30 Thread Hal Rosenstock
On Tue, Jun 30, 2009 at 2:53 PM, Jason
Gunthorpejguntho...@obsidianresearch.com wrote:
 On Tue, Jun 30, 2009 at 02:37:26PM -0400, Hal Rosenstock wrote:

  If there is no GRH then you can use GID index 0 and a 0 TC and a 0
  FL. According to the spec returning the port GID is NOT optional.

 These are not needed when using LID based redirection (see
 ClassPortInfo RedirectLID description).

 Hmm?

 RedirectGID RO 128 64 The GID a requester shall use as the destination
                      GID in the GRH of messages used to access redirected 
 class
                      services. If redirection is not being
                      performed, this shall be set to zero.

 They *might* not be used when the LID is returned, but they still
 must be set.

Sure; they can all be set to 0 as eHCA is doing now. That was all I
meant to say.

-- Hal


 --
 Jason Gunthorpe jguntho...@obsidianresearch.com        (780)4406067x832
 Chief Technology Officer, Obsidian Research Corp         Edmonton, Canada

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] libibmad: Handle MAD redirection

2009-06-29 Thread Hal Rosenstock
On Mon, Jun 29, 2009 at 8:10 AM, Joachim Fenkesfen...@de.ibm.com wrote:
 Previously, libibmad reacted to GSI MAD responses with a redirect status
 by throwing an error. IBM eHCA adapters use redirection, so most
 infiniband_diags tools didn't work against eHCA.

Are there GS classes other than PerfMgt which would be redirected by eHCA ?

 Fix: Modify mad_rpc() so that it resends the request to the redirection
 target if a redirect GS response is received. This is repeated until no
 redirect response is received, allowing for multiple levels of
 indirection.

 The dport argument is updated with the redirection target, so subsequent
 MADs will not go through the redirection process again but reach the target
 directly.

 Tested using perfquery between ehca, mlx4 and mthca in all possible
 combinations.

 Signed-off-by: Joachim Fenkes fen...@de.ibm.com
 ---

 Vlad, please queue this patch for OFED 1.5 if Hal doesn't object -- thanks!

See other comments below.

-- Hal

 Regards,
  Joachim

  libibmad/include/infiniband/mad.h |    9 +++
  libibmad/src/gs.c                 |    6 +++-
  libibmad/src/rpc.c                |   47 +---
  3 files changed, 45 insertions(+), 17 deletions(-)

 diff --git a/libibmad/include/infiniband/mad.h 
 b/libibmad/include/infiniband/mad.h
 index aa27eb5..bdf5158 100644
 --- a/libibmad/include/infiniband/mad.h
 +++ b/libibmad/include/infiniband/mad.h
 @@ -115,6 +115,8 @@ enum MAD_ATTR_ID {

  enum MAD_STATUS {
        IB_MAD_STS_OK                        = (0  2),
 +       IB_MAD_STS_BUSY                      = (1  0),
 +       IB_MAD_STS_REDIRECT                  = (1  1),
        IB_MAD_STS_BAD_BASE_VER_OR_CLASS     = (1  2),
        IB_MAD_STS_METHOD_NOT_SUPPORTED      = (2  2),
        IB_MAD_STS_METHOD_ATTR_NOT_SUPPORTED = (3  2),
 @@ -783,8 +785,15 @@ MAD_EXPORT int madrpc_set_timeout(int timeout);
  MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
                        int *mgmt_classes, int num_classes);
  MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
 +
 +/*
 + * On redirection, the dport argument is updated with the redirection target,
 + * so subsequent MADs will not go through the redirection process again but
 + * reach the target directly.
 + */
  MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc,
                        ib_portid_t * dport, void *payload, void *rcvdata);
 +
  MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * 
 rpc,
                              ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
                              void *data);
 diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
 index f3d245e..c7e4ff6 100644
 --- a/libibmad/src/gs.c
 +++ b/libibmad/src/gs.c
 @@ -70,7 +70,8 @@ uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, 
 int port,
        rpc.datasz = IB_PC_DATA_SZ;
        rpc.dataoffs = IB_PC_DATA_OFFS;

 -       dest-qp = 1;
 +       if (!dest-qp)
 +               dest-qp = 1;

Is this change part of this patch or unrelated/separate ?

        if (!dest-qkey)
                dest-qkey = IB_DEFAULT_QP1_QKEY;

 @@ -109,7 +110,8 @@ uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t 
 * dest,
        rpc.timeout = timeout;
        rpc.datasz = IB_PC_DATA_SZ;
        rpc.dataoffs = IB_PC_DATA_OFFS;
 -       dest-qp = 1;
 +       if (!dest-qp)
 +               dest-qp = 1;

Same as above.

        if (!dest-qkey)
                dest-qkey = IB_DEFAULT_QP1_QKEY;

 diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
 index 07b623d..c809060 100644
 --- a/libibmad/src/rpc.c
 +++ b/libibmad/src/rpc.c
 @@ -189,27 +189,44 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * 
 rpc,
        int status, len;
        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
        int timeout, retries;
 +       int redirect = 1;

 -       len = 0;
 -       memset(sndbuf, 0, umad_size() + IB_MAD_SIZE);
 +       while (redirect) {
 +               len = 0;
 +               memset(sndbuf, 0, umad_size() + IB_MAD_SIZE);

 -       if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload))  0)
 -               return 0;
 +               if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload))  0)
 +                       return 0;

 -       timeout = rpc-timeout ? rpc-timeout :
 -           port-timeout ? port-timeout : madrpc_timeout;
 -       retries = port-retries ? port-retries : madrpc_retries;
 +               timeout = rpc-timeout ? rpc-timeout :
 +                       port-timeout ? port-timeout : madrpc_timeout;
 +               retries = port-retries ? port-retries : madrpc_retries;

 -       if ((len = _do_madrpc(port-port_id, sndbuf, rcvbuf,
 -                             port-class_agents[rpc-mgtclass],
 -                             len, timeout, retries))  0) {
 -               IBWARN(_do_madrpc failed; dport (%s), portid2str(dport));
 -               return 0;
 -       }
 +               if ((len = 

[ewg] Re: [PATCH] perftest: Add command line SL support

2009-06-28 Thread Hal Rosenstock
Oren,

On Sun, Jun 28, 2009 at 9:11 AM, orenmeronorenme...@dev.mellanox.co.il wrote:
 Signed-off-by: Oren Meron orenme...@dev.mellanox.co.il

Shouldn't this patch add your signoff to mine ?

Thanks.

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH 3/9] ib_core: RDMAoE support only QP1

2009-06-17 Thread Hal Rosenstock
On Wed, Jun 17, 2009 at 7:10 AM, Eli Cohene...@dev.mellanox.co.il wrote:
 On Mon, Jun 15, 2009 at 02:26:42PM -0400, Hal Rosenstock wrote:
 Should ib_post_send_mad return some error on QP0 sends on RDMAoE ports ?
 You can't send anything over QP0 because it is not created and so
 there are no data structs corresponding to it.

Yes, I understand that's the intention but I didn't see where a MAD
posted to QP0 returns an error. Does that occur ? Or is it just
silently dropped ?

 What QP1 sends are allowed ?
 Basically, all QP1 sends are allowed without any changes - QP1
 functions as normal.
 However, RDMAoE will initially support only the CM. In the future, we
 can support additional QP1 services.

So what happens with other QP1 sends now ? Do they go into hyperspace
and then timeout ?

 Is it only SA sends which are faked ?
 Yes

 How
 are others handled ? These questions are important to what happens to
 the IB management/diag tools when invoked on an RDMAoE port. We need
 to be able to handle that scenario cleanly.
 You should be able to to send MADs over QP1 from the kernel. I did not
 make any tests as for sending MADs from userspace but I can't think of
 a reason why this would be a problem.

There are a set of tools (infiniband-diags and ibutils) which do send
MADs from userspace. I'm concerned that if someone tries these, the
right thing will happen.

-- Hal


 Is something similar needed in ib_mad_port_close for handling no QP0
 on RDMAoE ports in terms of destroy_mad_qp/cleanup_recv_queue calls ?

 No, becuase it is handled inside destroy_mad_qp():

        if (!qp_info-qp)
                return;


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH 3/9] ib_core: RDMAoE support only QP1

2009-06-17 Thread Hal Rosenstock
On Wed, Jun 17, 2009 at 7:57 AM, Eli Cohene...@dev.mellanox.co.il wrote:
 On Wed, Jun 17, 2009 at 07:14:56AM -0400, Hal Rosenstock wrote:
 On Wed, Jun 17, 2009 at 7:10 AM, Eli Cohene...@dev.mellanox.co.il wrote:
  On Mon, Jun 15, 2009 at 02:26:42PM -0400, Hal Rosenstock wrote:
  Should ib_post_send_mad return some error on QP0 sends on RDMAoE ports ?
  You can't send anything over QP0 because it is not created and so
  there are no data structs corresponding to it.

 Yes, I understand that's the intention but I didn't see where a MAD
 posted to QP0 returns an error. Does that occur ? Or is it just
 silently dropped ?
 But you don't have a struct ib_qp * for QP0 that you could use to
 post MADs to QP0...

Understood.

Doesn't an error need to be returned for certain cases of invoking
ib_post_send_mad on this new port type (at least qp0 and maybe some
qp1 things) ? Look at user_mad.c:umad_write calling
ib_post_send_mad().



  What QP1 sends are allowed ?
  Basically, all QP1 sends are allowed without any changes - QP1
  functions as normal.
  However, RDMAoE will initially support only the CM. In the future, we
  can support additional QP1 services.

 So what happens with other QP1 sends now ? Do they go into hyperspace
 and then timeout ?
 SA joins and SA path queries are terminated at the driver. Otherwise,
 post sends on QP1 should be sent on the wire.

So these would just timeout if there's nothing there to consume them ?
Seems better to error them out at the sender.


 There are a set of tools (infiniband-diags and ibutils) which do send
 MADs from userspace. I'm concerned that if someone tries these, the
 right thing will happen.

 What exactly do you mean?

opensm and any IB diag (relying on QP1 and/or QP0) should error out at
the sender.

From your patches, I'm also not sure how user space sees this port
type (umad needs to know about these port types). Maybe this needs
ioctl exposure (and another change to umad API if it's done this way
(rather than some other way which is what I think Sean was getting at
in his email on node type)); sigh :-(

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH 3/9] ib_core: RDMAoE support only QP1

2009-06-15 Thread Hal Rosenstock
On Mon, Jun 15, 2009 at 3:34 AM, Eli Cohene...@mellanox.co.il wrote:
 Since RDMAoE is using Ethernet there is no need for QP0. This patch will 
 create
 only QP1 for RDMAoE ports.

Should ib_post_send_mad return some error on QP0 sends on RDMAoE ports ?
What QP1 sends are allowed ? Is it only SA sends which are faked ? How
are others handled ? These questions are important to what happens to
the IB management/diag tools when invoked on an RDMAoE port. We need
to be able to handle that scenario cleanly.

Also, one minor comment below.

 Signed-off-by: Eli Cohen e...@mellanox.co.il
 ---
  drivers/infiniband/core/agent.c |   16 -
  drivers/infiniband/core/mad.c   |   48 
 ++-
  2 files changed, 47 insertions(+), 17 deletions(-)

 diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
 index ae7c288..658a278 100644
 --- a/drivers/infiniband/core/agent.c
 +++ b/drivers/infiniband/core/agent.c
 @@ -46,8 +46,10 @@
  #define SPFX ib_agent: 

  struct ib_agent_port_private {
 -       struct list_head port_list;
 -       struct ib_mad_agent *agent[2];
 +       struct list_head        port_list;
 +       struct ib_mad_agent    *agent[2];
 +       struct ib_device       *device;
 +       u8                      port_num;
  };

  static DEFINE_SPINLOCK(ib_agent_port_list_lock);
 @@ -58,11 +60,10 @@ __ib_get_agent_port(struct ib_device *device, int 
 port_num)
  {
        struct ib_agent_port_private *entry;

 -       list_for_each_entry(entry, ib_agent_port_list, port_list) {
 -               if (entry-agent[0]-device == device 
 -                   entry-agent[0]-port_num == port_num)
 +       list_for_each_entry(entry, ib_agent_port_list, port_list)
 +               if (entry-device == device  entry-port_num == port_num)
                        return entry;
 -       }
 +
        return NULL;
  }

 @@ -175,6 +176,9 @@ int ib_agent_port_open(struct ib_device *device, int 
 port_num)
                goto error3;
        }

 +       port_priv-device = device;
 +       port_priv-port_num = port_num;
 +
        spin_lock_irqsave(ib_agent_port_list_lock, flags);
        list_add_tail(port_priv-port_list, ib_agent_port_list);
        spin_unlock_irqrestore(ib_agent_port_list_lock, flags);
 diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
 index de922a0..aadf396 100644
 --- a/drivers/infiniband/core/mad.c
 +++ b/drivers/infiniband/core/mad.c
 @@ -199,6 +199,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
 ib_device *device,
        unsigned long flags;
        u8 mgmt_class, vclass;

 +       /* Validate device and port */
 +       port_priv = ib_get_mad_port(device, port_num);
 +       if (!port_priv) {
 +               ret = ERR_PTR(-ENODEV);
 +               goto error1;
 +       }
 +
 +       if (!port_priv-qp_info[qp_type].qp)
 +               return NULL;
 +
        /* Validate parameters */
        qpn = get_spl_qp_index(qp_type);
        if (qpn == -1)
 @@ -260,13 +270,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
 ib_device *device,
                        goto error1;
        }

 -       /* Validate device and port */
 -       port_priv = ib_get_mad_port(device, port_num);
 -       if (!port_priv) {
 -               ret = ERR_PTR(-ENODEV);
 -               goto error1;
 -       }
 -
        /* Allocate structures */
        mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL);
        if (!mad_agent_priv) {
 @@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent 
 *mad_agent)
        struct ib_mad_agent_private *mad_agent_priv;
        struct ib_mad_snoop_private *mad_snoop_priv;

 +       if (!mad_agent)
 +               return 0;
 +
        /* If the TID is zero, the agent can only snoop. */
        if (mad_agent-hi_tid) {
                mad_agent_priv = container_of(mad_agent,
 @@ -2602,6 +2608,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info 
 *qp_info)
        struct ib_mad_private *recv;
        struct ib_mad_list_head *mad_list;

 +       if (!qp_info-qp)
 +               return;
 +
        while (!list_empty(qp_info-recv_queue.list)) {

                mad_list = list_entry(qp_info-recv_queue.list.next,
 @@ -2643,6 +2652,9 @@ static int ib_mad_port_start(struct ib_mad_port_private 
 *port_priv)

        for (i = 0; i  IB_MAD_QPS_CORE; i++) {
                qp = port_priv-qp_info[i].qp;
 +               if (!qp)
 +                       continue;
 +
                /*
                 * PKey index for QP1 is irrelevant but
                 * one is needed for the Reset to Init transition
 @@ -2684,6 +2696,9 @@ static int ib_mad_port_start(struct ib_mad_port_private 
 *port_priv)
        }

        for (i = 0; i  IB_MAD_QPS_CORE; i++) {
 +               if (!port_priv-qp_info[i].qp)
 +                       continue;
 +
                ret = ib_mad_post_receive_mads(port_priv-qp_info[i], NULL);
                if (ret) {
                        printk(KERN_ERR 

***SPAM*** Re: [ewg] [ANNOUNCE] Yevgeny K is taking the maintenance of the ibutils package

2009-03-19 Thread Hal Rosenstock
On Thu, Mar 19, 2009 at 8:12 AM, Oren Kladnitsky
or...@dev.mellanox.co.il wrote:
 Hi.

 Yevgeny Kliteynik klit...@dev.mellanox.co.il is taking the maintenance of
 the ibutils package from me.

Will he be picking up the outstanding ibutils patches or do they need
to be resubmitted ?

-- Hal

 Ibutils git:
 git://git.openfabrics.org/~kliteyn/ibutils.git

 Thanks,
 Oren
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


***SPAM*** Re: [ewg] [ANNOUNCE] Yevgeny K is taking the maintenance of the ibutils package

2009-03-19 Thread Hal Rosenstock
Hi Yevgeny,

On Thu, Mar 19, 2009 at 9:40 AM, Yevgeny Kliteynik
klit...@dev.mellanox.co.il wrote:
 Hi Hal,

 Hal Rosenstock wrote:

 On Thu, Mar 19, 2009 at 8:12 AM, Oren Kladnitsky
 or...@dev.mellanox.co.il wrote:

 Hi.

 Yevgeny Kliteynik klit...@dev.mellanox.co.il is taking the maintenance
 of
 the ibutils package from me.

 Will he be picking up the outstanding ibutils patches or do they need
 to be resubmitted ?

 Are there any other outstanding ibutils patches?

Not AFAIK. Have the changes been pushed to the server ? Thanks.

-- Hal

 -- Yevgeny

 -- Hal

 Ibutils git:
 git://git.openfabrics.org/~kliteyn/ibutils.git

 Thanks,
 Oren
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg




___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation

2009-02-26 Thread Hal Rosenstock
On Thu, Feb 26, 2009 at 11:38 AM, Ramachandra K
ramachandra.kuchiman...@qlogic.com wrote:
 Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime.
 In the 8-bit field of the RMPP header, the first 5 bits
 are RRespTime and next 3 bits are RMPPFlags. Hence to retain
 the first 5 bits, the mask should be 0xF8 instead of 0xF1.

 Signed-off-by: Ramachandra K ramachandra.kuchiman...@qlogic.com
 ---

 diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
 index 5f6c40f..1a0f409 100644
 --- a/include/rdma/ib_mad.h
 +++ b/include/rdma/ib_mad.h
 @@ -290,7 +290,7 @@ static inline void ib_set_rmpp_resptime(struct 
 ib_rmpp_hdr *rmpp_hdr, u8 rtime)
  */
  static inline void ib_set_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr, u8 flags)
  {
 -       rmpp_hdr-rmpp_rtime_flags = (rmpp_hdr-rmpp_rtime_flags  0xF1) |
 +       rmpp_hdr-rmpp_rtime_flags = (rmpp_hdr-rmpp_rtime_flags  0xF8) |

Looks right to me. Sean ?

-- Hal

                                     (flags  0x7);
  }




 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RE: Agenda for the OFED meeting today (Jan 5, 09)

2009-01-05 Thread Hal Rosenstock
Jeff,

On Mon, Jan 5, 2009 at 3:26 PM, Jeff Squyres jsquy...@cisco.com wrote:
 I chatted with John and Todd from QL on the phone today -- we basically came
 to the same conclusion:

 - need to beef-up opensm to be able to scalably handle lots of incoming path
 record lookups

This is the most obvious SA scalability issue but there are some
others which may be important (related to SA caching rather than SA
distribution as an approach).

 - need to beef-up the CM clients on the host (maybe; this work might already
 be done?)
 - need to see the current status of the SA caching stuff / re-open that
 discussion to see if the work can be completed, etc.

IMO this will aggravate other SA scalability issues as well as there
being other limitations with this approach.

Don't get me wrong; I'm all for improving the SA scalability; there's
no quick solution to this AFAIK.

It would be interesting to see an apples to apples comparison of
OpenSM and proprietary SMs in terms of running on the same hardware
and the transaction rate for various things.

I think this warrants an open discussion if people are serious about
working on this issue.

 It might also be worthwhile to start a whole new discussion about making a
 better CM (at least from the ULP perspective). One that offers simple
 mechanisms for those who don't need/care about the details, but also offers
 complex/detailed mechanisms (perhaps remarkably like today's mechanisms).

I've heard similar comments before but this too will take significant
where-with-all IMO.

-- Hal
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Have you opened bugs to OFED 1.4

2008-10-22 Thread Hal Rosenstock
Tziporet,

On Sun, Oct 19, 2008 at 8:48 AM, Tziporet Koren
[EMAIL PROTECTED] wrote:
 I mean the bugs you explained in the last OFED meeting.

Please see that bug 1290 2 port switches are treated as CAs for
fabric qualities in ibdiagnet is resolved in next spin of OFED 1.4.
Proposed patch is attached to bug and also posted as:

http://lists.openfabrics.org/pipermail/general/2008-October/054873.html

Thanks.

-- Hal


 Thanks
 Tziporet
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


  1   2   >