Re: [openib-general] opensm issue

2007-02-26 Thread Hal Rosenstock
Hi Ashish, On Mon, 2007-02-26 at 16:04, Batwara, Ashish wrote: > Hi, > I am trying to bring up opensm, but it not letting me. When I look at > the /var/log/messages, I see that it becomes UP for a moment and then > again it goes down. Look for " SUBNET UP " in below logs. Can anyone > know what t

Re: [openib-general] OpenSM core dump - file size exceeded

2007-01-11 Thread Sean Hefty
> Looking at the log file, the problem appears to be related to: > > http://openib.org/pipermail/openib-general/2006-December/029962.html This should be fixed in my rdma-dev tree. The problem was that a patch got lost moving between svn and git that caused failed multicast requests to be retrie

Re: [openib-general] OpenSM core dump - file size exceeded

2007-01-10 Thread Hal Rosenstock
On Fri, 2006-12-15 at 17:05, Woodruff, Robert J wrote: > Hal wrote, > >Any idea what filled up the log ? but that's a side issue. > > Yes we were getting a bunch of multicast errors, Sean is investigating > this. > > >This has been discussed on the list before. This is one option which > can > >

Re: [openib-general] OpenSM log growing too big

2006-11-16 Thread Venkatesh Babu
Hal Rosenstock wrote: > Yes, LID 5 is a switch LID and there is a port which is flapping. Bad >cable ? > > When this port is disconnected the OpenSM stops logging these messages. It could have been bad connection. > The code is reducing the messages which are similar (approx 128 traps).

Re: [openib-general] opensm problem

2006-11-16 Thread Steve Wise
> IGMP turned on where? > | > Not sure what turns this on. I think IP multicast needs to be > configured in the kernel. I don't think it is automatic although that > might be the default config. Also, using IP multicast (via Sean's > multicast code) likely causes IGMP to be used so the routers kn

Re: [openib-general] opensm problem

2006-11-16 Thread Hal Rosenstock
Steve, See ... embedded comments below. -- Hal From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Thu 11/16/2006 4:51 PM To: Hal Rosenstock Cc: openib-general@openib.org Subject: RE: [openib-general] opensm problem On Thu, 2006-11-16 at 23:37 +0200, Hal

Re: [openib-general] opensm problem

2006-11-16 Thread Steve Wise
On Thu, 2006-11-16 at 23:37 +0200, Hal Rosenstock wrote: > Steve, > > Did you configure the kernel differently ? Is IGMP turned on somehow ? > (I haven't run with Sean's multicast code.) IGMP turned on where? > > BTW, as I mentioned, this can be solved on the client side equally as > well a

Re: [openib-general] OpenSM log growing too big

2006-11-16 Thread Hal Rosenstock
Hi Venkat, See embedded ... comments below. -- Hal From: Venkatesh Babu [mailto:[EMAIL PROTECTED] Sent: Thu 11/16/2006 1:39 PM To: Hal Rosenstock Cc: openib-general@openib.org Subject: Re: [openib-general] OpenSM log growing too big Hal Rosenstock wrote

Re: [openib-general] opensm problem

2006-11-16 Thread Hal Rosenstock
lto:[EMAIL PROTECTED] Sent: Thu 11/16/2006 4:32 PM To: Hal Rosenstock Cc: openib-general@openib.org Subject: RE: [openib-general] opensm problem On Thu, 2006-11-16 at 23:28 +0200, Hal Rosenstock wrote: > Steve, > > Those messages mean that you are joining a MC group which is not > alr

Re: [openib-general] opensm problem

2006-11-16 Thread Steve Wise
On Thu, 2006-11-16 at 23:28 +0200, Hal Rosenstock wrote: > Steve, > > Those messages mean that you are joining a MC group which is not > already created. The MGID iof 0xff12401b : 0x0016 > is for 224.0.0.22. That is for IGMP on your IPoIB subnet. The group > either needs to be

Re: [openib-general] opensm problem

2006-11-16 Thread Hal Rosenstock
Steve, Those messages mean that you are joining a MC group which is not already created. The MGID iof 0xff12401b : 0x0016 is for 224.0.0.22. That is for IGMP on your IPoIB subnet. The group either needs to be preconfigured or the "first" joiner needs to create the group (wh

Re: [openib-general] OpenSM log growing too big

2006-11-16 Thread Venkatesh Babu
Hal Rosenstock wrote: > Not sure what question you are asking exactly. > > Is it what do those messages mean or the file getting large or both ? > > Both. The message looks like LID 5 is generating too many events. The log file grows few MBs a second. What ever the problem with the port it

Re: [openib-general] OpenSM log growing too big

2006-11-16 Thread Hal Rosenstock
Not sure what question you are asking exactly. Is it what do those messages mean or the file getting large or both ? What options are you using on OpenSM startup ? Also, any chance you can move forward on a more recent and better OpenSM ? -- Hal From: [EMA

Re: [openib-general] opensm crash with topspin HCA

2006-11-02 Thread Hal Rosenstock
On Thu, 2006-11-02 at 13:33, Viswanath Krishnamurthy wrote: > > When we run opensm (OFED) release and if a Topspin HCA is in the IB > network, opensm crashes in umad_receiver with NULL pointer exception. > The transaction ID is zero is the MAD'S from topspin HCA on windows. > The crashes seems to

Re: [openib-general] opensm crash with topspin HCA

2006-11-02 Thread Sasha Khapyorsky
On 10:33 Thu 02 Nov , Viswanath Krishnamurthy wrote: > When we run opensm (OFED) release and if a Topspin HCA is in the IB network, > opensm crashes in umad_receiver with NULL pointer exception. Do you have any logs, gdb backtrace or any other details? Sasha > The > transaction ID is zero i

Re: [openib-general] OpenSM unneeded/no longer used header files

2006-10-31 Thread Sasha Khapyorsky
On 13:35 Tue 31 Oct , Hal Rosenstock wrote: > The following OpenSM header files appear to be unused: > > 183 osm_errors.h > 230 osm_ft_config_ctrl.h > 291 osm_mcast_config_ctrl.h > 289 osm_pi_config_ctrl.h > 289 osm_pkey_config_ctrl.h > 297 osm_sm_info_get_ctrl.h > 290 osm_subnet

Re: [openib-general] OpenSM -> 'open /dev/infiniband/umad1 failed'

2006-09-26 Thread Sasha Khapyorsky
Hi, On 04:21 Wed 27 Sep , [EMAIL PROTECTED] wrote: > I'm trying to setup OpenSM on one of our boxes. I've installed the RPMs from > ofed-1.0-sles10-rpms_i686.tar.gz and updated the firmware on our Mellanox > card. > When I try to start opensm I get the following error message: > 'umad_open

Re: [openib-general] OpenSM -> 'open /dev/infiniband/umad1 failed'

2006-09-26 Thread Hal Rosenstock
Hi, Do you have udev installed and configured ? You may want to refer to the wiki (https://openib.org/tiki/tiki-index.php) for more troubleshooting info. There's some info in the cheat sheet (https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet) which may help. -- Hal _

Re: [openib-general] OpenSM Multiple HCA cards on the same host

2006-09-12 Thread Hal Rosenstock
Hi Michael, On Tue, 2006-09-12 at 07:20, Michael Arndt wrote: > Hi, > > in the osm/docs Which doc ? BTW, what version of OpenSM are you using ? > is mentioned that at the next release multiple HCA cards on > the same host will be supported. If I understand your question correctly, OpenIB Op

Re: [openib-general] OpenSM/osm_log API: Use symbol versionsratherthan polluting namespace

2006-09-08 Thread Michael S. Tsirkin
Quoting r. Doug Ledford <[EMAIL PROTECTED]>: > What you just argued for is the opposite of both of those accepted and > commonly used practices. Not only that, but since all the old code > *could* be made to work with the new API using nothing more than a macro > in a header file, to argue for an

Re: [openib-general] OpenSM/osm_log API: Use symbol versionsrather than polluting namespace

2006-09-07 Thread Doug Ledford
On Thu, 2006-09-07 at 09:22 +0300, Michael S. Tsirkin wrote: > Quoting r. Doug Ledford <[EMAIL PROTECTED]>: > > Subject: Re: [openib-general] OpenSM/osm_log API: Use symbol versionsrather > > than polluting namespace > > > > On Wed, 2006-09-06 at 18:1

Re: [openib-general] OpenSM/osm_log API: Use symbol versionsrather than polluting namespace

2006-09-06 Thread Michael S. Tsirkin
Quoting r. Doug Ledford <[EMAIL PROTECTED]>: > Subject: Re: [openib-general] OpenSM/osm_log API: Use symbol versionsrather > than polluting namespace > > On Wed, 2006-09-06 at 18:16 +0300, Michael S. Tsirkin wrote: > > Quoting r. Hal Rosenstock <[EMAIL PROTECTED]&

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Doug Ledford
On Wed, 2006-09-06 at 18:16 +0300, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > Subject: Re: OpenSM/osm_log API: Use symbol versions rather than polluting > > namespace > > > > On Wed, 2006-09-06 at 09:42, Michael S. Tsirkin wrote: > > > Quoting r. Hal Rosenstock

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than pollutingnamespace

2006-09-06 Thread Michael S. Tsirkin
Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > Subject: Re: OpenSM/osm_log API: Use symbol versions rather than > pollutingnamespace > > On Wed, 2006-09-06 at 12:10, Michael S. Tsirkin wrote: > > Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > > > > (which, as I understand it, is really only

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Hal Rosenstock
On Wed, 2006-09-06 at 12:10, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > > > (which, as I understand it, is really only an issue because opensm can > > > > log so much), > > > > which is what this entire patch series was designed to > > > > address. They are two

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Michael S. Tsirkin
Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > > (which, as I understand it, is really only an issue because opensm can > > > log so much), > > > which is what this entire patch series was designed to > > > address. They are two different problem spaces. > > > > So ... wouldn't it be better t

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Hal Rosenstock
On Wed, 2006-09-06 at 11:27, Michael S. Tsirkin wrote: > Quoting r. Doug Ledford <[EMAIL PROTECTED]>: > > Subject: Re: OpenSM/osm_log API: Use symbol versions rather than polluting > > namespace > > > > On Wed, 2006-09-06 at 10:14 -0400, Hal Rosenstock wrote: > > > On Wed, 2006-09-06 at 09:42, Mi

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than pollutingnamespace

2006-09-06 Thread Michael S. Tsirkin
Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > > It is an upward compatible change so is low risk. > > > > Not sure what do you mean by upward compatible. This API change does not > > seem to be backward compatible - won't it break building dependent > > applications? > > We are talking about

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Hal Rosenstock
On Wed, 2006-09-06 at 11:16, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > Subject: Re: OpenSM/osm_log API: Use symbol versions rather than polluting > > namespace > > > > On Wed, 2006-09-06 at 09:42, Michael S. Tsirkin wrote: > > > Quoting r. Hal Rosenstock <[EMA

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Michael S. Tsirkin
Quoting r. Doug Ledford <[EMAIL PROTECTED]>: > Subject: Re: OpenSM/osm_log API: Use symbol versions rather than polluting > namespace > > On Wed, 2006-09-06 at 10:14 -0400, Hal Rosenstock wrote: > > On Wed, 2006-09-06 at 09:42, Michael S. Tsirkin wrote: > > > > Nor is this feature uncontroversia

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Michael S. Tsirkin
Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > Subject: Re: OpenSM/osm_log API: Use symbol versions rather than polluting > namespace > > On Wed, 2006-09-06 at 09:42, Michael S. Tsirkin wrote: > > Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > > Subject: OpenSM/osm_log API: Use symbol versi

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Doug Ledford
On Wed, 2006-09-06 at 10:14 -0400, Hal Rosenstock wrote: > On Wed, 2006-09-06 at 09:42, Michael S. Tsirkin wrote: > > Nor is this feature uncontroversial. Would not support for log rotation > > be better? If you are just going to do log rotation, then no need to change opensm, just add an appropr

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Hal Rosenstock
On Wed, 2006-09-06 at 09:42, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > > Subject: OpenSM/osm_log API: Use symbol versions rather than polluting > > namespace > > > > OpenSM/osm_log API: Rather than polluting the namespace with needless > > symbols, use symbol ve

Re: [openib-general] OpenSM/osm_log API: Use symbol versions rather than polluting namespace

2006-09-06 Thread Michael S. Tsirkin
Quoting r. Hal Rosenstock <[EMAIL PROTECTED]>: > Subject: OpenSM/osm_log API: Use symbol versions rather than polluting > namespace > > OpenSM/osm_log API: Rather than polluting the namespace with needless > symbols, use symbol versions and have a versioned osm_log_init rather > than adding osm_l

Re: [openib-general] OpenSM - guid2lid cache file questions

2006-09-05 Thread Eitan Zahavi
2006 4:18 PM > To: Leonid Arsh > Cc: openib-general@openib.org > Subject: Re: [openib-general] OpenSM - guid2lid cache file questions > > Leonid, > > On Tue, 2006-09-05 at 09:13, Leonid Arsh wrote: > > Thanks, > > > > On 05 Sep 2006 08:46:22 -0400, Hal Ros

Re: [openib-general] OpenSM - guid2lid cache file questions

2006-09-05 Thread Hal Rosenstock
Leonid, On Tue, 2006-09-05 at 09:13, Leonid Arsh wrote: > Thanks, > > On 05 Sep 2006 08:46:22 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: > > > I have a problem when OpenSM, being started, reads an out-if-date > > > guid2lid file. > > > OpenSM changes LIDs in this case. > > > > How do you k

Re: [openib-general] OpenSM - guid2lid cache file questions

2006-09-05 Thread Leonid Arsh
Thanks, On 05 Sep 2006 08:46:22 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: > > I have a problem when OpenSM, being started, reads an out-if-date guid2lid > > file. > > OpenSM changes LIDs in this case. > > How do you know the file is "out of date" ? > Actually, the LIDs were assigned by ano

Re: [openib-general] OpenSM - guid2lid cache file questions

2006-09-05 Thread Hal Rosenstock
Hi Leonid, On Tue, 2006-09-05 at 08:11, Leonid Arsh wrote: > Hi Hal, > > Thank you for your reply. > > Probably I wasn't clear. > > I have a problem when OpenSM, being started, reads an out-if-date guid2lid > file. > OpenSM changes LIDs in this case. How do you know the file is "out of date

Re: [openib-general] OpenSM - guid2lid cache file questions

2006-09-05 Thread Leonid Arsh
Hi Hal, Thank you for your reply. Probably I wasn't clear. I have a problem when OpenSM, being started, reads an out-if-date guid2lid file. OpenSM changes LIDs in this case. I don't want the LIDs to be changed. As I understand it, the '-r' option, on the contrary, causes the SM to reassign al

Re: [openib-general] OpenSM - guid2lid cache file questions

2006-09-05 Thread Hal Rosenstock
Hi Leonid, On Tue, 2006-09-05 at 03:30, Leonid Arsh wrote: > Hi list, > > I have a question regarding the guid2lid cache file. > > The file is read by OpenSM on the start up. > OpenSM may reassign LIDs according to the LIDs saved in this file. > It isn't always acceptable. > > Is it a ri

Re: [openib-general] OpenSM partition Management

2006-08-25 Thread Sasha Khapyorsky
Hi, On 15:02 Fri 25 Aug , Venkatesh Babu wrote: > > The document OpenSM_PKey_Mgr.txt under link > https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/OpenSM_PKey_Mgr.txt > > describes the roadmap for OpenSM partition management. It discusses two > phase implementation. > >

Re: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table)

2006-08-15 Thread Hal Rosenstock
On Mon, 2006-08-14 at 07:36, Dotan Barak wrote: > Hi. > > I noticed that the behavior of the openSM was changed in the latest driver: > > in the past, every HCA was configured (by the FW) with 0x in the first > entry. > today, Just as an FYI: I think that Anafas have this in the second entr

Re: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table)

2006-08-14 Thread Dotan Barak
On Monday 14 August 2006 16:09, Sasha Khapyorsky wrote: > > > > Why doesn't the SM print that this file was found? > > Yes, some prints may be helpful. Do you mean just log file or would prefer > the message on stdout too? I believe that most of the users don't look at the log file, so a message

Re: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table)

2006-08-14 Thread Sasha Khapyorsky
On 15:36 Mon 14 Aug , Dotan Barak wrote: > Thanks for the quick response. > > On Monday 14 August 2006 15:17, Sasha Khapyorsky wrote: > > Hi Dotan, > > > > On 14:36 Mon 14 Aug , Dotan Barak wrote: > > > Hi. > > > > > > I noticed that the behavior of the openSM was changed in the latest

Re: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table)

2006-08-14 Thread Dotan Barak
Thanks for the quick response. On Monday 14 August 2006 15:17, Sasha Khapyorsky wrote: > Hi Dotan, > > On 14:36 Mon 14 Aug , Dotan Barak wrote: > > Hi. > > > > I noticed that the behavior of the openSM was changed in the latest driver: > > > > in the past, every HCA was configured (by the F

Re: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table)

2006-08-14 Thread Sasha Khapyorsky
Hi Dotan, On 14:36 Mon 14 Aug , Dotan Barak wrote: > Hi. > > I noticed that the behavior of the openSM was changed in the latest driver: > > in the past, every HCA was configured (by the FW) with 0x in the first > entry. > today, the PKey table is being configured by the openSM: the fir

Re: [openib-general] openSM failover / failback issue?

2006-07-13 Thread Hal Rosenstock
On Wed, 2006-07-12 at 21:45, Hal Rosenstock wrote: [snip...] > > I don't know if this is an HCA firmware issues, switch issue, or openSM > > issue. > > I don't think it's related to my changes or osmtest at this point. > > I'll see if I can reproduce this tomorrow. I've followed your scenario

Re: [openib-general] openSM failover / failback issue?

2006-07-12 Thread Sean Hefty
>> I don't know if this is an HCA firmware issues, switch issue, or openSM >issue. >> I don't think it's related to my changes or osmtest at this point. > >I'll see if I can reproduce this tomorrow. > >Also, can you send me the guid2lid files from the 3 SMs ? I'll send this tomorrow. Before reloa

Re: [openib-general] openSM failover / failback issue?

2006-07-12 Thread Hal Rosenstock
On Wed, 2006-07-12 at 18:36, Sean Hefty wrote: > Hal Rosenstock wrote: > > With the default sminfo_polling_timeout of 10 seconds and default > > polling_retry_number of 4, so the total handoff time should be around 40 > > seconds. I just did that experiment with 2 SMs and saw that as well. > > Oka

Re: [openib-general] openSM - IS_SM capability mask problem

2006-07-12 Thread Hal Rosenstock
On Wed, 2006-07-12 at 09:13, yipeeyipee yipeeyipee wrote: > --- Hal Rosenstock <[EMAIL PROTECTED]> wrote: > > [snip] > Should this IS_SM bit in port attributes be supported > in the switch hardware? If you are running an SM on your switch, the IS_SM bit would be on for port 0. Otherwise not. > >

Re: [openib-general] openSM - IS_SM capability mask problem

2006-07-12 Thread yipeeyipee yipeeyipee
--- Hal Rosenstock <[EMAIL PROTECTED]> wrote: [snip] Should this IS_SM bit in port attributes be supported in the switch hardware? > Yes (I'm pretty sure). The user_mad API has not > changed in quite some > time now. What ABI version is 2.6.14 ? I don't know where to check this. _

Re: [openib-general] openSM - IS_SM capability mask problem

2006-07-12 Thread Hal Rosenstock
On Tue, 2006-07-11 at 09:27, yipee wrote: > Hal Rosenstock voltaire.com> writes: > [snip] > > It's not the setting which is failing. You are likely not using an SM > > which supports this (it is an enhanced capability defined in a 1.2 > > erratum). Are you running a recent OpenSM or something els

Re: [openib-general] openSM - IS_SM capability mask problem

2006-07-11 Thread yipee
Hal Rosenstock voltaire.com> writes: [snip] > It's not the setting which is failing. You are likely not using an SM > which supports this (it is an enhanced capability defined in a 1.2 > erratum). Are you running a recent OpenSM or something else ? > I'm running a 1.1 openSM on a 2.6.14 kernel.

Re: [openib-general] openSM - IS_SM capability mask problem

2006-07-11 Thread Hal Rosenstock
On Tue, 2006-07-11 at 03:24, yipee wrote: > Hi, > > On one of my IB setups I get the following error from openSM: > osm_vendor_set_sm: ERR 5431: setting IS_SM capability mask failed; errno 2 > > what's this IS_SM capability mask? what might cause its setting to fail? It's not the setting which i

Re: [openib-general] openSM - IS_SM capability mask problem

2006-07-11 Thread Eitan Zahavi
You probably have another SM already running on your machine. The error means that OpenSM failed to set the local port IS_SM capability mask bit (which say there is an SM running on that port). If you do not have another SM running on the port you should probably restart the driver as the ref cou

Re: [openib-general] opensm and NPTL

2006-06-13 Thread Hal Rosenstock
On Tue, 2006-06-13 at 12:56, Viswanath Krishnamurthy wrote: > I am using the trunk. Should I be using 1.0 ? No; I didn't check but if my memory serves me correctly, the trunk may have some fixes 1.0 doesn't towards this but I'm not 100% sure right now and since you are using the trunk, I'm not g

Re: [openib-general] opensm and NPTL

2006-06-13 Thread Viswanath Krishnamurthy
I am using the trunk.   Should I be using 1.0 ? -Viswa  On 13 Jun 2006 12:35:17 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote:> Yes.. I want to test waters again and see if the issues went away.Are you using the trunk or 1.0 ?-- Hal> -V

Re: [openib-general] opensm and NPTL

2006-06-13 Thread Hal Rosenstock
On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote: > Yes.. I want to test waters again and see if the issues went away. Are you using the trunk or 1.0 ? -- Hal > -Viswa > > > On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock <[EMAIL PROTECTED]> > wrote: > Hi Viswa, > >

Re: [openib-general] opensm and NPTL

2006-06-13 Thread Viswanath Krishnamurthy
Yes.. I want to test waters again and see if the issues went away. -Viswa On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi Viswa,On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote:> There were some issues with opensm running with NPTL  (thread> library). Has the

Re: [openib-general] opensm and NPTL

2006-06-13 Thread Hal Rosenstock
Hi Viswa, On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote: > There were some issues with opensm running with NPTL (thread > library). Has the issues been resolved ? There were some fixes to the signal handling which went in back in the Feb/early March time frame. OpenSM should be bett

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-30 Thread Hal Rosenstock
Hi Paul, On Tue, 2006-05-30 at 11:06, Paul wrote: > Hi All, > I will be working on this as time permits this week. > Unfortunately my employer is not crazy about giving out remote access, > so I will have to be your hands on this. If you want me to do > something just tell me what it is. I kn

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-30 Thread Hal Rosenstock
Don, On Tue, 2006-05-30 at 10:55, [EMAIL PROTECTED] wrote: > Hal, > > With your patch to OpenSM, I think everything is ok on the local node. That patch with one minor change (elimination of the CL_ASSERT) will be part of the upcoming RC6. > The remote node is definitely having some problems,

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-30 Thread Paul
Hi All, I will be working on this as time permits this week. Unfortunately my employer is not crazy about giving out remote access, so I will have to be your hands on this. If you want me to do something just tell me what it is. I know its a pain I have been there myself. Regards.On 5/30/06, [E

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-30 Thread Don . Albert
Hal, With your patch to OpenSM, I think everything is ok on the local node.  The remote node is definitely having some problems, resulting in not responding to the MAD packets.  I have entered a separate message on the problems with the "ib0" interface on that machine. > > On Fri, 2006-05-26 at

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-27 Thread Sasha Khapyorsky
Hi Paul, On 12:14 Fri 26 May , Paul wrote: > No, I figured all of that out, ppc64 was not supported/working in RC4. > Either way, here is what I see with opensm: > > [EMAIL PROTECTED] ~]# /etc/init.d/opensmd start > *** glibc detected *** realloc(): invalid next size: 0x100ab1e0 ***

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-27 Thread Hal Rosenstock
Hi Paul, On Sat, 2006-05-27 at 02:26, Paul wrote: > Hi Hal, > My lab is undergoing maitanence this weekend so I wont be able > to get you any results til tuesday, however the results are readily > reproducable. Everything is 64bit. Unfortunately I don't have access to a PPC64 machine on whi

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Paul
Hi Hal,  My lab is undergoing maitanence this weekend so I wont be able to get you any results til tuesday, however the results are readily reproducable. Everything is 64bit.Regards. On 26 May 2006 12:46:01 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi again Paul,On Fri, 2006-05-26 at 12:

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Hal Rosenstock
Don, On Fri, 2006-05-26 at 20:59, Hal Rosenstock wrote: > > What next, coach? > > Can you turn on madeye on the remote node and see what packets are > received and sent ? Let me know if you need help with that. I think you > said you were running OFED, right ? I don't think madeye is part of OFE

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Hal Rosenstock
Don, On Fri, 2006-05-26 at 17:32, [EMAIL PROTECTED] wrote: > Hal, > > I rebuilt the opensm executable with the patch you provided. The > patch fixes (or avoids) the segmentation fault and opensm comes up and > runs. Thanks for trying this out. > However, the link is still not becoming opera

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
Hal, I rebuilt the opensm executable with the patch you provided.   The patch fixes (or avoids) the segmentation fault and opensm comes up and runs.  However, the link is still not becoming operational.   On the local side it goes to ARMED,  and on the remote side it goes to INIT.   The osm.log s

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
Hal, > > One more thing on the remote side, try: > > smpquery nodeinfo -D 0 > Here is the smpquery on the remote (system "jatoba") side > [jatoba] (ib) ib> smpquery nodeinfo -D 0 # Node info: DR path [0] BaseVers:1 ClassVers:...1 Node

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Hal Rosenstock
Don, On Fri, 2006-05-26 at 14:35, [EMAIL PROTECTED] wrote: > Hal, > > > Yes, that is very useful. I had been working on trying to come up > with > > what the problem was but this narrows it down to something I was > > thinking might be going on. > > > > It looks like you are running back to bac

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
Hal,   > Yes, that is very useful. I had been working on trying to come up with > what the problem was but this narrows it down to something I was > thinking might be going on. > > It looks like you are running back to back HCAs, right ? Yes, the HCAs are 4X DDR, connected back to back. > > It

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Hal Rosenstock
Hi Don, On Fri, 2006-05-26 at 13:34, [EMAIL PROTECTED] wrote: > Hal, > > > Hi again Paul, > > Since your last message was addressed to Paul, and you said my problem > was completely different, I don't know if a backtrace would help in my > case, but here it is anyway, just in case. (See below.)

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
Hal, > Hi again Paul, Since your last message was addressed to Paul, and you said my problem was completely different, I don't know if a backtrace would help in my case, but here it is anyway, just in case. (See below.) > > Would you rebuild OpenSM with debug: > ./configure --enable-debug && m

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Hal Rosenstock
Hi again Paul, On Fri, 2006-05-26 at 12:14, Paul wrote: > No, I figured all of that out, ppc64 was not supported/working in RC4. > Either way, here is what I see with opensm: > > [EMAIL PROTECTED] ~]# /etc/init.d/opensmd start > *** glibc detected *** realloc(): invalid next size: > 0x100

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Paul
No, I figured all of that out, ppc64 was not supported/working in RC4. Either way, here is what I see with opensm:[EMAIL PROTECTED] ~]# /etc/init.d/opensmd start*** glibc detected *** realloc(): invalid next size: 0x100ab1e0 *** /etc/init.d/opensmd: line 330: 7854 Donee

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Hal Rosenstock
Hi Paul, On Fri, 2006-05-26 at 11:35, Paul wrote: > I am having a similar issue on my ppc64 systems. Take a look at the > email I sent to the list last night. I have not been able to figure > out much regarding why its dying, Are you referring to your mail on compile flags and then OpenMPI ? I sa

Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Paul
I am having a similar issue on my ppc64 systems. Take a look at the email I sent to the list last night. I have not been able to figure out much regarding why its dying, I wonder if it might be tied to some other issues I have am having. On 5/26/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: I

RE: [openib-general] opensm segfault?

2006-05-17 Thread Hal Rosenstock
l Message- > > From: [EMAIL PROTECTED] [mailto:openib-general- > > [EMAIL PROTECTED] On Behalf Of Sasha Khapyorsky > > Sent: Wednesday, May 17, 2006 2:11 AM > > To: Troy Benjegerdes > > Cc: openib-general@openib.org > > Subject: Re: [openib-general] opensm segfault

Re: [openib-general] opensm segfault?

2006-05-16 Thread Jason Gunthorpe
On Wed, May 17, 2006 at 09:10:11AM +0300, Eitan Zahavi wrote: > cl_memcpy should have some debug capabilities on top of memcpy ... > cl memory management provide means to track all memory allocations, etc. There are a huge number of canned solutions that provide a way to debug memory problems wit

RE: [openib-general] opensm segfault?

2006-05-16 Thread Eitan Zahavi
ISRAEL > -Original Message- > From: [EMAIL PROTECTED] [mailto:openib-general- > [EMAIL PROTECTED] On Behalf Of Sasha Khapyorsky > Sent: Wednesday, May 17, 2006 2:11 AM > To: Troy Benjegerdes > Cc: openib-general@openib.org > Subject: Re: [openib-general] opensm segfault?

Re: [openib-general] opensm segfault?

2006-05-16 Thread Sasha Khapyorsky
Hi Troy, On 14:41 Tue 16 May , Troy Benjegerdes wrote: > I got this after an indeterminate amount of time running opensm.. May this be reproducible? Or it is completely random failure? > (gdb) bt > #0 0x2b90b0dbebf3 in cl_memcpy (p_dest=0x2ac88850, p_src=0x0, > count=64) at cl_m

Re: [openib-general] opensm segfault?

2006-05-16 Thread Hal Rosenstock
On Tue, 2006-05-16 at 16:10, Roland Dreier wrote: > Troy> And why the heck is "cl_memcpy" just a call to 'memcpy' > Troy> anyway? This just seems like excessive uneeded abstraction. > > Hal> It's part of the component library, which is an OS > Hal> abstraction layer. > > memcpy()

Re: [openib-general] opensm segfault?

2006-05-16 Thread Roland Dreier
Troy> And why the heck is "cl_memcpy" just a call to 'memcpy' Troy> anyway? This just seems like excessive uneeded abstraction. Hal> It's part of the component library, which is an OS Hal> abstraction layer. memcpy() is specified by the ISO C standard, so it seems pretty silly to

Re: [openib-general] opensm segfault?

2006-05-16 Thread Hal Rosenstock
Hi Troy, On Tue, 2006-05-16 at 15:41, Troy Benjegerdes wrote: > I got this after an indeterminate amount of time running opensm.. > > > (gdb) bt > #0 0x2b90b0dbebf3 in cl_memcpy (p_dest=0x2ac88850, p_src=0x0, ^ This i

Re: [openib-general] opensm issues on 64 node RHEL4 cluster?

2006-04-13 Thread Hal Rosenstock
Hi Troy, On Thu, 2006-04-13 at 15:35, Troy Benjegerdes wrote: > We just moved a cluster over to the latest redhat release, and opensm > seems to be having issues. > > This is running the redhat provided kernel and opensm packages > > [EMAIL PROTECTED] troy]# uname -r > 2.6.9-34.ELsmp > [EMAIL PR

Re: [openib-general] OpenSM realloc error

2006-02-17 Thread Hal Rosenstock
Hi Owen, On Fri, 2006-02-17 at 00:01, Owen Stampflee wrote: > Of course, I need to get things working first, than we can deal with the > 64-bit issues (gotta please the boss, and if shipping 32-bit binarys and > both 32/64 bit libraries provides a working udapl, ipoib, and 32+64-bit > mpi, I can m

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Owen Stampflee
Of course, I need to get things working first, than we can deal with the 64-bit issues (gotta please the boss, and if shipping 32-bit binarys and both 32/64 bit libraries provides a working udapl, ipoib, and 32+64-bit mpi, I can meet my deadline (Monday)). I'm suspecting some glibc issues on our en

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Hal Rosenstock
On Thu, 2006-02-16 at 20:43, Owen Stampflee wrote: > A 32-bit build of 5411 gets the link to become active Glad to hear this.That is what I would expect and would like to confirm the tid patch is missing from the FC5 package as well as getting to the bottom of the 64 bit issues if you have some ti

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Hal Rosenstock
Hi again Owen, On Thu, 2006-02-16 at 20:01, Owen Stampflee wrote: > http://cvs.terraplex.com/~owen/osm.log > > I'm currently using the Fedora FC5 packages that have been rebuilt, I'm not sure what svn the FC5 package corresponds to. Can you check the following in osm/libvendor/osm_vendor_ibumad.

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Sasha Khapyorsky
On 13:27 Thu 16 Feb , Owen Stampflee wrote: > > Commenting out the cl_log_event in osm_log results in this backtrace: > > (gdb) bt > #0 0x0080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > #1 0x0080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > #2 0x0080b974e860

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Owen Stampflee
A 32-bit build of 5411 gets the link to become active and ipv_rc_pingpng works, but I cant bring up ipoib... dmesg says this (tried both ib0 and ib1 to ensure ports werent swapped) ADDRCONF(NETDEV_UP): ib0: link is not ready ADDRCONF(NETDEV_UP): ib1: link is not ready At least we're making progre

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Owen Stampflee
http://cvs.terraplex.com/~owen/osm.log I'm currently using the Fedora FC5 packages that have been rebuilt, I'm going to try the OpenIB 5411 source. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-gener

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Hal Rosenstock
Hi Owen, On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote: > So, here is the back trace with no code modifications... > > 0x0080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > #1 0x0080b971b89c in .__

Re: [openib-general] OpenSM realloc error

2006-02-16 Thread Owen Stampflee
So, here is the back trace with no code modifications... 0x0080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 #1 0x0080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 #2 0x0080b974e860 in .__libc_m

Re: [openib-general] OpenSM realloc error

2006-02-15 Thread Hal Rosenstock
On Wed, 2006-02-15 at 13:47, Sasha Khapyorsky wrote: > On 09:41 Wed 15 Feb , Owen Stampflee wrote: > > This doesnt help much... at all... no new info to report. > > > > [EMAIL PROTECTED] ~]# rm /var/log/osm.log > > [EMAIL PROTECTED] ~]# opensm -v > > ---

Re: [openib-general] OpenSM realloc error

2006-02-15 Thread Sasha Khapyorsky
On 09:41 Wed 15 Feb , Owen Stampflee wrote: > This doesnt help much... at all... no new info to report. > > [EMAIL PROTECTED] ~]# rm /var/log/osm.log > [EMAIL PROTECTED] ~]# opensm -v > - > OpenSM Rev:openib-1.1.0 > Command Line Arguments: > Ver

Re: [openib-general] OpenSM realloc error

2006-02-15 Thread Owen Stampflee
This doesnt help much... at all... no new info to report. [EMAIL PROTECTED] ~]# rm /var/log/osm.log [EMAIL PROTECTED] ~]# opensm -v - OpenSM Rev:openib-1.1.0 Command Line Arguments: Verbose option -v (log flags = 0x7) Log File: /var/log/osm.log ---

Re: [openib-general] OpenSM realloc error

2006-02-15 Thread Hal Rosenstock
Hi Owen, On Wed, 2006-02-15 at 11:43, Owen Stampflee wrote: > > Can you strace it and provide the output ? Thanks. > > > > -- Hal > http://cvs.terraplex.com/~owen/opensm.strace I can see the initial write to send a MAD here and it fails after that. One more try: can you send an osm.log from ope

  1   2   3   4   >