Re: [openib-general] [openfabrics-ewg] Link Initialization problem and hangs in MTHCA on OFED-1.0

2006-07-05 Thread Don . Albert
question is an Intel SE7525GP2 motherboard (7525 chip set).    Updating to  version P10 of the BIOS firmware fixed the problem. Thanks to Hal Rosenstock for his advice and suggestions.         -Don Albert- ___ openib-general mailing list openib-general

[openib-general] Link Initialization problem and hangs in MTHCA on OFED-1.0

2006-06-23 Thread Don . Albert
re generated when the hangs occur. The machines are both EM64T but are not identical.  The "koa" side has the HCA on PCI "06:00.0",  and the "jatoba" side has the HCA on "03:00.0".  The two machines are:    koa (the working on

Re: [openib-general] Stopping Infiniband kernel modules from loading at system boot

2006-06-23 Thread Don . Albert
Thanks to Pradeep Satyanarayana and Boris Shpolyansky  for suggesting  that I add an entry to /etc/hotplug/blacklist,  but I thought that the "/etc/hotplug" stuff was replaced in the latest kernels with "/etc/udev"  functionality.  Is this not true?

[openib-general] Stopping Infiniband kernel modules from loading at system boot

2006-06-23 Thread Don . Albert
b_mthca ib_core                45952  2 ib_mthca,ib_mad What have I missed?         -Don Albert- ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openfabrics-ewg] Re: [openib-general] Re: NOP problem in ib_mthca on OFED RC4

2006-05-30 Thread Don . Albert
p as well. > >  - R. I also tried this.  I didn't see any output on my terminal.  Where does all this "copious output" go?   -Don Albert- ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-30 Thread Don . Albert
o try ? > > 2. Can you completely shutdown and repower the remote node and see if it > starts responding ? > It is difficult for me to debug this sort of thing, since I telecommute from Tucson and the machines are located in Phoenix.  But I can get someone there to power the machine

[openib-general] Re: NOP problem in ib_mthca on OFED RC4

2006-05-30 Thread Don . Albert
k set dev ib0 down I tried using gdb to "attach" to process 7031 to see its stack, but that hung too, as well as an attempt to see what the status of the interface was with "/sbin/ifconfig".   It is rather difficult for me t

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
ort # # Topology file: generated on Fri May 26 14:24:20 2006 # # Max of 1 hops discovered # Initiated from node 0002c90200216dc4 port 0002c90200216dc5 vendid=0x2c9 devid=0x6274 sysimgguid=0x2c90200216dc7 caguid=0x2c90200216dc4 Ca      1 "H-0002c90200216dc4"          # koa HCA-1 W

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
;>>>>>>>>>>>> > Do you also have an iPath adapter ? If not, no need to load those > modules. > We do not have an iPath adapter.  I just did a "build all packages" in the OFED install.sh script, and it included it.  I did a "modprobe -r

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
loppy                 67400  0 >>>>>>>>>>>>>>>>>>> > > Can you try this patch to see if it gets you further and let me know ? > Note that this is just a potential workaround right now. > I will try rebuilding with the pa

Re: [openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

2006-05-26 Thread Don . Albert
re --enable-debug && make clean && make && make install > > and then run opensm under gdb and provide the backtrace after the > failure? > > Thanks. > > -- Hal I can also rebuild with --enable_debug if it would be useful.         -Don Albert- Backtr

[openib-general] Re: NOP problem in ib_mthca on OFED RC4

2006-05-26 Thread Don . Albert
sion, full lspci -v output, etc. > -- > MST In case the problem comes back again with RC6, below is some information on the machine that had the problem.   -Don Albert- MODEL x86_64   [type=x86_64] CPU   4 x Intel(R) Xeon(TM) CPU 3.00GHz, 64 bits  2992.628 Mhz MEM   2055516 kB  real memory

[openib-general] Re: NOP problem in ib_mthca on OFED RC4

2006-05-09 Thread Don . Albert
t;         Node GUID: 0x0002c90200216dc4 > >         System image GUID: 0x0002c90200216dc7 > > > >   -Don Albert- > > > > Yes, that's the latest revision. Hmm. > What about the other thing I mentioned in my first message:  the "lspci" command co

[openib-general] Re: NOP problem in ib_mthca on OFED RC4

2006-05-09 Thread Don . Albert
one else > seen this problem? > > Which FW revision do you have? > The "ibstat" command shows:         CA type: MT25204         Number of ports: 1         Firmware version: 1.0.800         Hardware version: a0         Node GUID: 0

[openib-general] NOP problem in ib_mthca on OFED RC4

2006-05-08 Thread Don . Albert
slot that the HCA is in.  Here is the message:    pcilib: Resource 2 in /sys/bus/pci/devices/:03:00.0/resource has a 64-bit address, ignoring        03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Do I need a new version of pcilib?  I currently have p

RE: [openib-general] RHEL4ASU3 question

2006-04-11 Thread Don . Albert
e some sort of "length error".    This could be coming from the driver or the card, I suppose?   That's as far as I have gotten so far. Does this sound like any of the "issues" you referred to above relative to RHEL4 U3 and th

Re: [openib-general] Problems running MPI jobs with MVAPICH

2006-04-06 Thread Don . Albert
obs,  including "cpi", "mping" and other benchmark tests. There seems to be a problem with "USE_MPD_RING".    Have you seen this before?   Should I try with "USE_MPD_BASIC" instead?         -Don Albert- ___ openib

Re: [openib-general] Problems running MPI jobs with MVAPICH

2006-03-29 Thread Don . Albert
t the script is not in the "mvapich-gen2" directory.  There is a script "make.mvapich2.tcp" under the "mvapich2-gen2" directory.  I will try building MVAPICH2 for TCP and see if that works.         -Don Albert- ___ openib-general mai

Re: [openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2

2006-03-22 Thread Don . Albert
these two machines a bit.  I know that "koa" at least had both RedHat and Suse distributions installed at one time or another,  but I am not sure about "jatoba". You are also correct that I could not get any version of mpi to run between the two machines. Thanks, again!   -Don

Re: [openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2

2006-03-22 Thread Don . Albert
Matthew, Thanks for the response.  I can well believe that the problem is some error in the setup of my environment.  I just can't figure out what as yet.  As requested, some more information is below on the two systems 'koa' and 'jatoba'. > > Since the MPD daemons seem to have been started pr

[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2

2006-03-22 Thread Don . Albert
of rank 1: killed by signal 9 rank 0 in job 5  koa.az05.bull.com_60194   caused collective abort of all ranks   exit status of rank 0: killed by signal 9 Does anyone have any ideas about this? -Don Albert- Bull HN Information Systems ___ openib-general

Re: [openib-general] ib_mthca "NOP command failed to generate interrupt (IRQ 169), aborting."

2006-03-16 Thread Don . Albert
tion. > > > > > >  - R. I copied over the firmware and the tvflash utility from our other system that has the same type HCA card and flashed the new version (4.7.400) of the firmware. After rebooting the system, the mthca driver nows loads successfully without using the option &

Re: [openib-general] ib_mthca "NOP command failed to generate interrupt (IRQ 169), aborting."

2006-03-14 Thread Don . Albert
ox firmware support site and tried to determine what firmware I needed based on the information I did have (i.e. Lion Cub, PCI Express, 25208) and I think that I need to load the following firmware: fw-25208-4_7_600-MHEL-CF128-T.bin.gz Michael, can you confirm this, based on what I have described a

Re: [openib-general] ib_mthca "NOP command failed to generate interrupt (IRQ 169), aborting."

2006-03-14 Thread Don . Albert
ssfully. I will look into updating the firmware as well,  but for now I can proceed with other testing. >  - Try a binary search of svn revisions between 3929 and 5685 to >    figure out when the driver stopped working. > >  - R. Don Albert Bull HN Information Systems

[openib-general] ib_mthca "NOP command failed to generate interrupt (IRQ 169), aborting."

2006-03-13 Thread Don . Albert
ing problem? ACPI: PCI interrupt for device :06:00.0 disabled ib_mthca: probe of :06:00.0 failed with error -16 I see that the HCA firmware is old,  but is that the problem here,  or is there some other change that could be causing this?   -Don Albert- Bull