On Wed, Jul 08, 2009 at 11:12:15AM +0300, Amir Vadai wrote: > Lars Hi, > > I opened a bug in our bugzilla > (https://bugs.openfabrics.org/show_bug.cgi?id=1672). > > I couldn't reproduce it on my setup: SLES 10SP2, stock kernel, same ofed git > version. > will try now to install 2.6.27 kernel and check again.
With a "normal" kernel config, I needed to do full load bi-directional network traffic on IPoIB as well as SDP, multiple stream sockets, to eventually actually trigger it after a few minutes (several hundered MegaByte per second). with the "debug" kernel config, I was able to reproduce with only one socket, within milliseconds. my .config is attached. > BTW, what type of servers do you use? Are they low/high end server? This is the second cluster that show this bug. I first experienced it when using SDP sockets from within kernel space. I was able to reproduce in userland, which I thought might make it easier for you to reproduce. The current test cluster is a slightly aged 2U supermicro dual quadcore, 4GB ram, and proved to be very reliable hardware in all test up to now. it may be a little slow on interrupts. tail of /proc/cpuinfo: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5310 @ 1.60GHz stepping : 7 cpu MHz : 1599.984 cache size : 4096 KB physical id : 1 siblings : 4 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 3201.35 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: the IB setup is direct link, lspci says: 09:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR, PCIe 2.0 5GT/s] (rev a0) because using IPoIB does work just fine, I don't think we have issues with IB setup, or the hardware in general. only when using SDP it is broken, "forgets" bytes, or corrupts data. what I do "different" than the (assumed to be) typical SDP user is: sending large-ish messages at once (up to ~32 kB), possibly unaligned. which apparently is a mode that SDP has not excercised much yet, otherwise the recently fixed page leak would have been noticed by someone much earlier. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
config-ofed-1.4-d02.gz
Description: GNU Zip compressed data
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
