On Fri, Apr 4, 2014 at 12:56 PM, Hal Rosenstock <[email protected]> wrote:
> Hi Florent,
>
> On 4/2/2014 5:43 PM, Florent Parent wrote:
>> Hi,
>>
>> We experienced constant crashing from opensm 3.3.15 (3.3.15-1.el6.cq5)
>> after a recent upgrade. We compiled and installed 3.3.17 and problem
>> went away.
>>
>> OpenSM server: CentOS 6.5 w/ stock RDMA. OpenSM 3.3.15 was from the
>> CentOS repository.
>>
>> A behaviour that may help diagnose this: Unusual large amount messages
>> were filling up the opensm.log file:
>>
>> Mar 13 09:50:04 909147 [4FAFC700] 0x01 -> log_rcv_cb_error: ERR 3111:
>> Received MAD with error status = 0x1C
>>                         SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x73c86e46
>>                         Initial path: 0,1,33,30,28 Return path: 0,10,32,13,28
>>
>> 80 of these messages occur periodically. smpquery on the paths shows
>> that these all point to the Sun QNEM switches (80 I4 chips).
>> "use_mfttop FALSE" eliminated these messages.
>
> Yes, this is caused by bad firmware. The best fix is to upgrade the
> firmware on the devices indicated by the DR paths. There's also the
> workaround on the OpenSM side that you are using.
>
> This is orthogonal to the crashes below.

ok

>
>> Florent
>>
>>
>> *** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
>> linked list corrupted: 0x00007f9b3c4352a0 ***
>> ======= Backtrace: =========
>> /lib64/libc.so.6(+0x76166)[0x7f9b56279166]
>> /lib64/libc.so.6(+0x79f1f)[0x7f9b5627cf1f]
>> /lib64/libc.so.6(__libc_malloc+0x71)[0x7f9b5627d991]
>> /usr/sbin/opensm[0x4216f3]
>> /usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
>> /usr/sbin/opensm[0x446efb]
>> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
>> /usr/sbin/opensm[0x4422bb]
>> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f9b56ddb5fe]
>> /lib64/libpthread.so.0(+0x79d1)[0x7f9b5659e9d1]
>> /lib64/libc.so.6(clone+0x6d)[0x7f9b562ebb6d]
>>
>> *** glibc detected *** /usr/sbin/opensm: double free or corruption
>> (out): 0x00007fe2f42e1830 ***
>
> Are you using partitions ? Any idea on the scenario here ?
>
> I can isolate the patch (beyond 3.3.15) that fixes this if needed.

No partitions. We installed 3.3.15 during a maintenance window. Crash
started to occur only when the scheduler started dispatching jobs.

Since we're not seeing any issues so far with 3.3.17, this patch is
not required for us. I just taught it was good practice to report any
crash.

>
>> ======= Backtrace: =========
>> /lib64/libc.so.6(+0x76166)[0x7fe30ec9d166]
>> /lib64/libc.so.6(+0x78c93)[0x7fe30ec9fc93]
>> /usr/sbin/opensm[0x449cf6]
>> /usr/sbin/opensm(osm_subn_rescan_conf_files+0x194)[0x44af14]
>> /usr/sbin/opensm[0x447260]
>> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
>> /usr/sbin/opensm[0x4422bb]
>> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7fe30f7ff5fe]
>> /lib64/libpthread.so.0(+0x79d1)[0x7fe30efc29d1]
>> /lib64/libc.so.6(clone+0x6d)[0x7fe30ed0fb6d]
>>
>> *** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
>> linked list corrupted: 0x00007f200838ede0 ***
>
> This is one I'm unfamiliar with and will need to investigate further.
> Did this one also go away with 3.3.17 ?

Yes it did.

Thanks
Florent
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to