Re: [ewg] EWG/OFED meeting minutes for July 24, 2012
> > OFED 3.5: > = > 1. Kernel base: Move to kernel 3.5 GA will be done this week > > 2. Backports: > RHEL 6.2, 6.3 and SLES 11 SP2 - available today > Low level drivers: mlx4 (core & ib) , nes > Missing: mlx4_en - Mellanox, cxgb - Chelsio, qib - Intel > > > 3. RC1: > If all will provide backports by Tue - July-31 we will be able to release > RC1 > on Aug-2 > - Mellanox is committed. > - Need answers from Intel (Tom) and Chelsio (Steve) > > 4. User space: > New uDAPL package and it is in the latest OFED-3.5 build. > Need to include new librdmacm-1.0.16-1.src.rpm and a new ibacm-1.0.7- > 1.src.rpm packages > Management - Alex - is what we have is OK? There would be another OpenSM release on Wed Aug-1, that will include the latest bug fixes and some new contributed features such as: - Per Module Logging support - Congestion Control support - Perf_mgr extensions > Diagnostic tools - Ira - is what we have is OK? > > 5. Release schedule: > Will decide in next meeting - assuming RC1 will be at end of next week > and testing will start. > > Tziporet > > ___ > ewg mailing list > ewg@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hi Hector, Few more questions. Does this happen to you only when you try to shut down the OpenSM on reboot? What is the host cpu architecture? x86/x86_64/ppc? > -Original Message- > From: ewg-boun...@lists.openfabrics.org [mailto:ewg- > boun...@lists.openfabrics.org] On Behalf Of Hal Rosenstock > Sent: Thursday, December 15, 2011 9:06 PM > To: Hector Abrach > Cc: ewg@lists.openfabrics.org > Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > > Hector, > > On 12/15/2011 12:49 PM, Hector Abrach wrote: > > Hal, > > > > Thank you for the response. To address your questions: > > > >> So the switch stays up and the servers (including the one OpenSM is > >> on) is rebooted, right ? > > > > Right. > > > >> Do the servers run QNX rather than Linux ? Are you saying all OpenSM > >> code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ? > > > > Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only > > changes I had to make were made to some #define libraries. > > The big changes were made for the driver, not so much OpenSM. > > I would think there are also changes for porting of complib to QNX. Do you > use osm_vendor_ibumad.c as the OpenSM vendor layer ? > > > I'm using IBNet 1.3. > > What's IBNet 1.3 ? I'm not familiar with that. > > > OpenSM always runs on the same one server, the others don't run it. > > Understood. > > >> Is the topology the 7 servers and the 1 switch and if you use other > >> switches you don't see this issue ? > > > > That's correct, the topology is 7 servers and 1 switch. We typically > > use less servers (4) for our application but the problem is more > > easily reproducible with more servers so we have a 7 server setup with > > 1 switch. We don't have a great selection of switches but I know our > > previous switch did not cause this problem. Our intention is to go to > > production with this new switch but we can't release until we find an > > acceptable solution. > > > >>Ican see the responses but not the requests. What verbosity level did > >>you use ? > > > > I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want > > to do -D 0xFF because I know this fixes the problem for sure. > > I think -D 0x23 (error, info, frames) would do the trick... > > > - > > > > In summary: > > 1.knowing that the system gets stuck for sm_vendor_ibumad.c -> > > umad_receiver() -> "for(;;)" but keeps running properly for function > > main.c -> osm_manager_loop(). > > 2.If I use -D 0xFF the problem is completely fixed > > 3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other > > value the problem is completely fixed > > 4.The failure always occurs with qp0_mads_outstanding of 1 > > remaining > > what do you think could be wrong? > > Do you think the driver could be the problem? > > Yes; The thing that I think is a likely suspect and may be missing and causing > this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD > transactions which if the timeout/retries are exhaused triggers a send error > (callback). Is that implemented ? > > However, I don't have a good explanation for why you see this now and not > before with your other switches but maybe that's not important. > > > What debug command should I use to see the sent requests? > > See above. > > -- Hal > > > Thank you > > > > Hector Abrach > > > > > > > > > > From: Hal Rosenstock > > To: Hector Abrach > > Cc: ewg@lists.openfabrics.org > > Date: 12/14/2011 08:23 PM > > Subject:Re: [ewg] OpenSM 1.5.4 Boot Problem > > > > > > -- > > -- > > > > > > > > Hector, > > > > On 12/14/2011 1:41 PM, Hector Abrach wrote: > >> Hal, > >> > >> Sorry for the multiple emails, but I was thinking how it may be a > >> "freeze /stall" rather than a time out. One reason is that it > >> doesn't send an error message, is as if the log completely dies. > > > > So nothing interesting in the log... > > > >> However, in > >> file osm_vendor_ibumad.c under function umad_receiver there is an > >> infinite loop "for(;;)" which seems to die when I get to that > >> previously discussed vl15_poller. I checked to see if it breaks out > >> of the loop but it doesn't seem to. > > > > It never breaks out of that loop except when OpenSM is shutting down. > > That's the basic receive loop. > > > > -- Hal > > > >> I'm not sure if this may be an additional hint. > >> Thank you > >> > >> Hector Abrach > >> > >> > >> From: Hector Abrach > >> To: Hal Rosenstock > >> Cc: ewg@lists.openfabrics.org > >> Date: 12/14/2011 11:15 AM > >> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > >> Sent by: ewg-boun...@lists.openfabrics.org > >> > >> > >> - > >> --- > >> > >> > >> > >> Hal, > >> > >> Thank yo
Re: [ewg] EWG/Meeting agenda for today - 28-Mar, 2011
On 16:14 Mon 28 Mar , Tziporet Koren wrote: > > EWG/OFED Agenda for today: > > 1. OFED 1.6 schedule: > -- > - Move to kernel 2.6.38 (since its GA already, and we have not started > backports) > - Ongoing work on backports - during Q2 > - First RC - end of June > RCs every 2 weeks > - GA - End of Aug > > 2. OFED 1.6 main features: > > - Mellanox: CX3 support > - SRIOV support for mlx4 with CX2 & CX3 > - FDR support > - New OSes support: As usual the latest OSes will be supported > - Remove MPI packages from OFED > - Ne management package: Alex - please send details > OpenSM main improvements: 1. Torus-2QoS routing engine 2. Performance Manager improvements: improved redirection and extended counters support 3. Additional port balancing options for routing 4. More bug fixes Alex. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [ANNOUNCE] opensm tarballs release
Hi, There is a new release of the OpenSM tarball available in: http://www.openfabrics.org/downloads/management/ (listed in http://www.openfabrics.org/downloads/management/latest.txt) 5e9b461073f7cfbafe0207e014796f9f opensm-3.3.9.tar.gz All component versions are from recent master branch. Full list of changes is below. OpenSM: Alex Netes (1): opensm: fixed memory leak in multicast spanning tree calculation ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OFA management tree seperation
On 15:59 Sun 13 Feb , Jason Gunthorpe wrote: > On Sun, Feb 13, 2011 at 09:55:50AM +0200, Alex Netes wrote: > > > > We finished the management tree seperation. > > From now on, Ira Weiny takes the responsibility for > > maitaining libibmad and > > infiniband-diags. His trees are: > > > > git://git.openfabrics.org/~iraweiny/libibmad > > git://git.openfabrics.org/~iraweiny/infiniband-diags > > > > libibumad, opensm and ibsim trees stays under my responsibility: > > > > git://git.openfabrics.org/~alexnetes/libibumad > > git://git.openfabrics.org/~alexnetes/opensm > > git://git.openfabrics.org/~alexnetes/ibsim > > Can you please include the OpenSm 3.2.6 and related branch and tags in > your repository? > Done. OpenSM 3.2.6 resides on opensm-3.2 branch as before. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [ANNOUNCE] management tarballs release
Hi, There is a new release of the management (OpenSM and infiniband diagnostics) tarballs available in: http://www.openfabrics.org/downloads/management/ (listed in http://www.openfabrics.org/downloads/management/latest.txt) c0b24a1053ae8b0b3caf5950b3ede6dc infiniband-diags-1.5.8.tar.gz c2755aa360d3f29d04865ba4e2454a98 libibmad-1.3.7.tar.gz c7575b7620615d7dfa1c7fdbbd310ec7 libibumad-1.3.7.tar.gz df051f5f0192d369b0b904147cb045a8 opensm-3.3.8.tar.gz All component versions are from recent master branch. Full list of changes is below. OpenSM: === Alex Netes (1): opensm: fixed getline pointer allocation free in osm_console_io Eli Dorfman (Voltaire) (1): Wrong handling of MC create and delete traps Hal Rosenstock (6): opensm/osm_state_mgr.c: Don't signal DISCOVER to SM state machine when already DISCOVERING opensm: Fix some typos osmtest/osmt_service.c: In osmt_run_service_records_flow, add missing status opensm/osm_ucast_ftree: When roots are not connected, update hop count but not lft opensm/osm_trap_rcv.c: No need to check for sweep for trap 145 opensm: Add support for SwitchInfo:MulticastFDBTop Ira Weiny (1): Add node/port/qos information to some error messages Jason Gunthorpe (1): Fix autotools to include the necessary M4 files Sasha Khapyorsky (3): opensm/sa: simplify osm_mcmr_rcv_find_or_create_new_mgrp() function call opensm/osm_node_info_rcv.c: move p_physp declaration under code block opensm/osm_db_files.c: malloc() return value run-time check Stan C. Smith (2): replace (long*)(long) casting with transportable data type (uintptr_t) replace (long*)(long) casting with transportable data type (uintptr_t) Yevgeny Kliteynik (28): opensm/osm_qos_policy.c: change a log message opensm/osm_prtn.c: removing TopSpin hack libvendor/osm_vendor_ibumad_sa.c: remove useless "if" statement libvendor/osm_vendor_mlx_sa.c: remove useless "if" statement opensm/osm_mtree.c: removing useless 'if' statement opensm/osm_sminfo_rcv.c: removing unused variable opensm/osm_pkey.c: removing unused function opensm/osm_sa_pkey_record.c: removing unused variable opensm/osm_sa_vlarb_record.c: removed unused variable opensm/osm_node_info_rcv.c: remove useless code line osmtest/osmtest.c: handle timeouts in PR stress test opensm/osm_helper.c: fix potential overrun of the array opensm/osm_helper.c: cosmetics - move define closer to the relevant code opensm/osm_mesh.c: fixing a bug in compare_switches() opensm/osm_subnet.c: fixing small bug in error path opensm/osm_db_files.c: fix small memory leak osmtest/osmt_slvl_vl_arb.c: handling fopen() failure opensm/osm_helper.c: use ARR_SIZE macro instead of hardcoded values osm_vl15intf.c: fixing use-after-free coredump opensm/osm_trap_rcv.c: fix possible core dump opensm/osm_ucast_ftree.c: fix small memory leak in error path opensm/osm_ucast_ftree.c: fixing another memory leak at error path opensm/osm_ucast_lash.c: small bug in calculating allocated size opensm/osm_pkey_mgr.c: fixing small memory leak opensm/osm_ucast_file.c: closing file descriptor in error path opensm/osm_qos_parser_y.y: fixing bunch of memory leaks on invalid values opensm/osm_console.c: fix memory and file descriptor leaks opensm/st.c: fix potential core dumps libibumad: == Jason Gunthorpe (1): Fix autotools to include the necessary M4 files Mike Heinz (1): FW: [PATCH] umad_send.3 (man page) Yevgeny Kliteynik (1): umad.{c,h}: moving stdlib.h include from C to H file libibmad: = Ira Weiny (1): libibmad/fields.c: Change all PortCounter names to match the Specification Jason Gunthorpe (1): Fix autotools to include the necessary M4 files infiniband-diags: = Albert Chu (4): add --diff support to iblinkinfo support --diffcheck in iblinkinfo Add lid and node description diff options for --diffcheck in iblinkinfo support --filterdownports in iblinkinfo Alex Netes (3): Makefile: ChangeLog and version generation script path fix infiniband-diags: update shared library versions infiniband-diags: package versions update Eli Dorfman (Voltaire) (2): infiniband-diags: Do not exit when unexpected node found inifiband-diags: Support Voltaire switch ISR4200 Hal Rosenstock (3): infiniband-diags/ibtracert: Eliminate direct route (-D) option infiniband-diags/saquery.c: In dump_one_mcmember_record, fix flow label endian infiniband-diags/iblinkinfo.c: Limit some queries to switches Ira Weiny (4): libibmad/fields.c: Change all PortCounter names to match the Specification infiniband-diags: Verify timeout value specified to diagnostics Further timeout paramater verifica
[ewg] OFA management tree seperation
We finished the management tree seperation. >From now on, Ira Weiny takes the responsibility for >maitaining libibmad and infiniband-diags. His trees are: git://git.openfabrics.org/~iraweiny/libibmad git://git.openfabrics.org/~iraweiny/infiniband-diags libibumad, opensm and ibsim trees stays under my responsibility: git://git.openfabrics.org/~alexnetes/libibumad git://git.openfabrics.org/~alexnetes/opensm git://git.openfabrics.org/~alexnetes/ibsim Alex. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Current administrator for git accounts on openfabrics.org
Hi Ira, Who would I contact for a git account on git.openfabrics.org/git? > Ken Strandberg is sysadmin in openfabrics.org and he > would be happy to assist you. Thanks, Alex. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg