question is an Intel
SE7525GP2 motherboard (7525 chip set).
Updating to version P10 of the BIOS firmware fixed the problem.
Thanks to Hal Rosenstock for his advice
and suggestions.
-Don
Albert-
___
openib-general mailing list
openib-general
re generated when the hangs occur.
The machines are both EM64T but are
not identical. The "koa" side has the HCA on PCI "06:00.0",
and the "jatoba" side has the HCA on "03:00.0".
The two machines are:
koa (the working on
Thanks to Pradeep Satyanarayana and
Boris Shpolyansky for suggesting that I add an entry to /etc/hotplug/blacklist,
but I thought that the "/etc/hotplug" stuff was replaced
in the latest kernels with "/etc/udev" functionality. Is
this not true?
b_mthca
ib_core
45952 2 ib_mthca,ib_mad
What have I missed?
-Don
Albert-
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
p as well.
>
> - R.
I also tried this. I didn't see any output on
my terminal. Where does all this "copious output" go?
-Don Albert-
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo
o try ?
>
> 2. Can you completely shutdown and repower the remote node and see
if it
> starts responding ?
>
It is difficult for me to debug this sort of thing,
since I telecommute from Tucson and the machines are located in Phoenix.
But I can get someone there to power the machine
k set dev ib0 down
I tried using gdb to "attach" to process
7031 to see its stack, but that hung too, as well as an attempt to see
what the status of the interface was with "/sbin/ifconfig".
It is rather difficult for me t
ort
#
# Topology file: generated on Fri May 26 14:24:20
2006
#
# Max of 1 hops discovered
# Initiated from node 0002c90200216dc4 port 0002c90200216dc5
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200216dc7
caguid=0x2c90200216dc4
Ca 1 "H-0002c90200216dc4"
# koa HCA-1
W
;>>>>>>>>>>>>
> Do you also have an iPath adapter ? If not, no need to load those
> modules.
>
We do not have an iPath adapter. I just did
a "build all packages" in the OFED install.sh script, and it
included it. I did a "modprobe -r
loppy
67400 0
>>>>>>>>>>>>>>>>>>>
>
> Can you try this patch to see if it gets you further and let me know
?
> Note that this is just a potential workaround right now.
>
I will try rebuilding with the pa
re --enable-debug && make clean && make &&
make install
>
> and then run opensm under gdb and provide the backtrace after the
> failure?
>
> Thanks.
>
> -- Hal
I can also rebuild with --enable_debug if it would
be useful.
-Don Albert-
Backtr
sion, full lspci -v output, etc.
> --
> MST
In case the problem comes back again with RC6, below
is some information on the machine that had the problem.
-Don Albert-
MODEL x86_64 [type=x86_64]
CPU 4 x Intel(R) Xeon(TM) CPU 3.00GHz, 64 bits
2992.628 Mhz
MEM 2055516 kB real memory
t; Node GUID: 0x0002c90200216dc4
> > System image GUID: 0x0002c90200216dc7
> >
> > -Don Albert-
> >
>
> Yes, that's the latest revision. Hmm.
>
What about the other thing I mentioned in my first
message: the "lspci" command co
one else
> seen this problem?
>
> Which FW revision do you have?
>
The "ibstat" command shows:
CA type:
MT25204
Number of
ports: 1
Firmware
version: 1.0.800
Hardware
version: a0
Node GUID:
0
slot that the HCA is in. Here
is the message:
pcilib: Resource 2 in
/sys/bus/pci/devices/:03:00.0/resource has a 64-bit address, ignoring
03:00.0 InfiniBand: Mellanox
Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
Do I need a new version of pcilib?
I currently have p
e some sort of "length error". This could be
coming from the driver or the card, I suppose? That's as far as
I have gotten so far.
Does this sound like any of the "issues"
you referred to above relative to RHEL4 U3 and th
obs, including "cpi", "mping"
and other benchmark tests.
There seems to be a problem with "USE_MPD_RING".
Have you seen this before? Should I try with "USE_MPD_BASIC"
instead?
-Don
Albert-
___
openib
t the script
is not in the "mvapich-gen2" directory. There is a script
"make.mvapich2.tcp" under the "mvapich2-gen2" directory.
I will try building MVAPICH2 for TCP and see if that works.
-Don Albert-
___
openib-general mai
these two machines a
bit. I know that "koa" at least had both RedHat and Suse
distributions installed at one time or another, but I am not sure
about "jatoba".
You are also correct that I could not
get any version of mpi to run between the two machines.
Thanks, again!
-Don
Matthew,
Thanks for the response. I can
well believe that the problem is some error in the setup of my environment.
I just can't figure out what as yet. As requested, some more
information is below on the two systems 'koa' and 'jatoba'.
>
> Since the MPD daemons seem to have been started pr
of rank 1: killed
by signal 9
rank 0 in job 5 koa.az05.bull.com_60194
caused collective abort of all ranks
exit status of rank 0: killed
by signal 9
Does anyone have any ideas about this?
-Don Albert-
Bull HN Information Systems
___
openib-general
tion.
> > >
> > > - R.
I copied over the firmware and the tvflash utility
from our other system that
has the same type HCA card and flashed the new version
(4.7.400) of the firmware.
After rebooting the system, the mthca driver nows
loads successfully without
using the option &
ox
firmware support site and tried to determine what firmware I needed based
on the information I did have (i.e. Lion Cub, PCI Express, 25208) and I
think that I need to load the following firmware:
fw-25208-4_7_600-MHEL-CF128-T.bin.gz
Michael, can you confirm this, based on what I have
described a
ssfully.
I will look into updating the firmware as well, but
for now I can
proceed with other testing.
> - Try a binary search of svn revisions between 3929 and 5685
to
> figure out when the driver stopped working.
>
> - R.
Don Albert
Bull HN Information Systems
ing problem?
ACPI: PCI interrupt for device :06:00.0
disabled
ib_mthca: probe of :06:00.0 failed
with error -16
I see that the HCA firmware is old,
but is that the problem here, or is there some other change
that could be causing this?
-Don Albert-
Bull
25 matches
Mail list logo