Re: [ewg] OFED 1.2.5.3
Tziporet Koren wrote: Or Gerlitz wrote: In one of our tests that is doing up/down of the driver we see some deadlock when this patch is used, and it does not happened when its removed. Thus we decided to remove this patch till we get to the root cause of the problem Note that we do not think the issue with the patch but seems that it bring the issue up to the surface I see. Note that even though with the fact that internal review makes you think the patch is not the cause for the hang, its still possible that external reviewers will come into a different conclusion, that's big part of the the story behind open source development and processes. I understand that this patch does not exist in OFED 1.3, correct? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] OFED 1.2.5.4 is ready on the ofa server
OFED-1.2.5.4 is ready: http://www.openfabrics.org/downloads/OFED/ofed-1.2.5/OFED-1.2.5.4.tgz Changes since OFED 1.2.5 - RDS: - Performance enhancements - GA for Oracle 11 - IPoIB: - Use NAPI by default - For small received packets, allocate a new, smaller SKB to relief accounting on the socket. - mlx4: - Enable changing default max HCA resource limits using module options. - Support opening of more resources then the default by increasing command timeout for INIT_HCA to 10 seconds - PPC64 support: - Fixed compilation problems on SLES10 SP1 Changes from OFED 1.2.5.3: == - Low level drivers update: - cxgb3: Pull in latest fixes. - ipath: Pull in latest fixes. - OSes support: - Added support for SLES9 SP4 (no QA was done) - Added support for RHEL5 up1 (no QA was done) - IPOIB: - Removed the usage of unsignalled QP in Tx due to deadlock. - RDS: - Relax the header consistency check on fragment reassembly Tziporet & Vlad Tziporet Koren Software Director Mellanox Technologies mailto: [EMAIL PROTECTED] Tel +972-4-9097200, ext 380 ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OFED 1.2.5.3
Or Gerlitz wrote: I understand that this patch does not exist in OFED 1.3, correct? correct - but don't worry for the bugs we see new oops even with IPoIB vanilla from 2.6.24-rc3 :-( Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] OFED Dec 3 meeting summary on beta release status
Meeting Summary: 1. We must get bugzilla fixed ASAP to track OFED 1.3 bugs - Jeff B. 2. RC1 is delayed to next week since there were no builds for almost a week 3. Beta release testing status: * Voltaire: See issues with: o TCP performance of ConnectX (30% worst then Arbel on UD mode) Note: I verified that interrupt moderation was not activated on ConnectX as should. Should be fixed today on the daily build. Also LRO is not yet enabled - to be done till RC1. o Performance is harmed when working with IPoIB partitioning (different PKey) o iSCSI over IPoIB * Cisco: o Test so far only x86_64; RHEL4, RHEL5, SLES10 o Focus on MPI testing: Intel MPI 3.1; HP MPI 2.5.1 o Will test new compilers: Intel 10.1 and PGI 7.1 * Intel: o Beta is working fine on 16 nodes cluster o Tests also small ia64 cluster - see problem with MVAPICH compilation * Qlogic: o See SDP issues o Mainly test basic verbs (libibverbs) o Will have a code update in 1-2 weeks * Mellanox: o Covers x86, x86_64, PPC, all OSes on the matrix o SDP is still broken - should be fixed by end of this week o Test Open MPI and MVAPICH o See issues with IPoIB performance - under work Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OFED Dec 3 meeting summary on beta release status
Hello Tziporet! This is our test status: * Tested ehca, ehca2 on SLES10 ppc64 and upstream kernel * RHEL4.5, RHEL5.1 and other backport test will be next * Build process works great (basic, custom, 32/64-bit libs) Thanks Nam [EMAIL PROTECTED] wrote on 04.12.2007 10:34:55: > Meeting Summary: > 1. We must get bugzilla fixed ASAP to track OFED 1.3 bugs - Jeff B. > 2. RC1 is delayed to next week since there were no builds for almost a week > 3. Beta release testing status: > > * Voltaire: See issues with: > o TCP performance of ConnectX (30% worst then Arbel on UD mode) > Note: I verified that interrupt moderation was not activated > on ConnectX as should. Should be fixed today on the daily > build. > Also LRO is not yet enabled - to be done till RC1. > o Performance is harmed when working with IPoIB partitioning > (different PKey) > o iSCSI over IPoIB > * Cisco: > o Test so far only x86_64; RHEL4, RHEL5, SLES10 > o Focus on MPI testing: Intel MPI 3.1; HP MPI 2.5.1 > o Will test new compilers: Intel 10.1 and PGI 7.1 > * Intel: > o Beta is working fine on 16 nodes cluster > o Tests also small ia64 cluster - see problem with MVAPICH > compilation > * Qlogic: > o See SDP issues > o Mainly test basic verbs (libibverbs) > o Will have a code update in 1-2 weeks > * Mellanox: > o Covers x86, x86_64, PPC, all OSes on the matrix > o SDP is still broken - should be fixed by end of this week > o Test Open MPI and MVAPICH > o See issues with IPoIB performance - under work > > > Tziporet > ___ > ewg mailing list > ewg@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Problems building mvapich on ia64 on RedHat EL5.1
Hi, I don't have access to such platform in Mellanox and from the log I don't really understand the problem. Can I get remote access to the machine ? Regards, Pasha. Woodruff, Robert J wrote: I get the following build error trying to build mvapich on ia64 on RedHat EL5.1 using today's OFED daily build. gcc -DHAVE_CONFIG_H -I. -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/util -DMPID_DEVICE_CODE -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 -DMPID_DEBUG_NONE -DMPID_STAT_NONE -fPIC -O3 -fno-strict-aliasing -g -D_GNU_SOURCE -DCH_GEN2 -DMEMORY_SCALE -D_AFFINITY_ -DCOMPAT_MODE -Wall -D_SMP_ -D_SMP_RNDV_ -DVIADEV_RPUT_SUPPORT -DEARLY_SEND_COMPLETION -DLAZY_MEM_UNREGISTER -D_IA64_ -I/usr/include -DHAVE_MPICHCONF_H -D_GNU_SOURCE -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I.-c viainit.c viainit.c: In function 'viainit_setaffinity': viainit.c:140: warning: passing argument 3 of 'sched_setaffinity' from incompatible pointer type viainit.c: In function 'viainit_exchange': viainit.c:752: warning: unused variable 'other_qp_list' viainit.c: In function 'MPID_VIA_Init': viainit.c:952: warning: implicit declaration of function 'init_apm_lock' viainit.c:788: warning: unused variable 'smpi_ptr' viainit.c:784: warning: unused variable 'j' viainit.c:784: warning: unused variable 'i' viainit.c: In function 'ib_qp_enable': viainit.c:1379: warning: implicit declaration of function 'reload_alternate_path' viainit.c: In function 'ib_rank_lid_table': viainit.c:1917: warning: pointer targets in assignment differ in signedness viainit.c: At top level: viainit.c:1907: warning: 'ib_rank_lid_table' defined but not used gcc -DHAVE_CONFIG_H -I. -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/util -DMPID_DEVICE_CODE -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 -DMPID_DEBUG_NONE -DMPID_STAT_NONE -fPIC -O3 -fno-strict-aliasing -g -D_GNU_SOURCE -DCH_GEN2 -DMEMORY_SCALE -D_AFFINITY_ -DCOMPAT_MODE -Wall -D_SMP_ -D_SMP_RNDV_ -DVIADEV_RPUT_SUPPORT -DEARLY_SEND_COMPLETION -DLAZY_MEM_UNREGISTER -D_IA64_ -I/usr/include -DHAVE_MPICHCONF_H -D_GNU_SOURCE -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I.-c viasend.c In file included from /usr/include/netdb.h:28, from cm.h:25, from viapriv.h:44, from viasend.c:29: /usr/include/netinet/in.h:355: error: expected ')' before '__netshort' /usr/include/netinet/in.h:355: error: expected ')' before '>>' token /usr/include/netinet/in.h:355: error: expected ')' before '&' token ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OFED 1.2.5.3
Tziporet Koren wrote: Or Gerlitz wrote: I understand that this patch does not exist in OFED 1.3, correct? correct - but don't worry for the bugs we see new oops even with IPoIB vanilla from 2.6.24-rc3 :-( I see, how about sending the oop trace to the general list? Or. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Problems building mvapich on ia64 on RedHat EL5.1
It also might be useful to see if a simple program including netdb.h can compile on the system. We could find out if the error is really in the system header files or due to some other interaction. > #include > > int foo(int i) { > return i > 1 ? i * foo(i - 1) : 1; > } Pavel Shamis (Pasha) wrote: Hi, I don't have access to such platform in Mellanox and from the log I don't really understand the problem. Can I get remote access to the machine ? Regards, Pasha. Woodruff, Robert J wrote: I get the following build error trying to build mvapich on ia64 on RedHat EL5.1 using today's OFED daily build. gcc -DHAVE_CONFIG_H -I. -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/util -DMPID_DEVICE_CODE -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 -DMPID_DEBUG_NONE -DMPID_STAT_NONE -fPIC -O3 -fno-strict-aliasing -g -D_GNU_SOURCE -DCH_GEN2 -DMEMORY_SCALE -D_AFFINITY_ -DCOMPAT_MODE -Wall -D_SMP_ -D_SMP_RNDV_ -DVIADEV_RPUT_SUPPORT -DEARLY_SEND_COMPLETION -DLAZY_MEM_UNREGISTER -D_IA64_ -I/usr/include -DHAVE_MPICHCONF_H -D_GNU_SOURCE -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I.-c viainit.c viainit.c: In function 'viainit_setaffinity': viainit.c:140: warning: passing argument 3 of 'sched_setaffinity' from incompatible pointer type viainit.c: In function 'viainit_exchange': viainit.c:752: warning: unused variable 'other_qp_list' viainit.c: In function 'MPID_VIA_Init': viainit.c:952: warning: implicit declaration of function 'init_apm_lock' viainit.c:788: warning: unused variable 'smpi_ptr' viainit.c:784: warning: unused variable 'j' viainit.c:784: warning: unused variable 'i' viainit.c: In function 'ib_qp_enable': viainit.c:1379: warning: implicit declaration of function 'reload_alternate_path' viainit.c: In function 'ib_rank_lid_table': viainit.c:1917: warning: pointer targets in assignment differ in signedness viainit.c: At top level: viainit.c:1907: warning: 'ib_rank_lid_table' defined but not used gcc -DHAVE_CONFIG_H -I. -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/include -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/util -DMPID_DEVICE_CODE -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 -DMPID_DEBUG_NONE -DMPID_STAT_NONE -fPIC -O3 -fno-strict-aliasing -g -D_GNU_SOURCE -DCH_GEN2 -DMEMORY_SCALE -D_AFFINITY_ -DCOMPAT_MODE -Wall -D_SMP_ -D_SMP_RNDV_ -DVIADEV_RPUT_SUPPORT -DEARLY_SEND_COMPLETION -DLAZY_MEM_UNREGISTER -D_IA64_ -I/usr/include -DHAVE_MPICHCONF_H -D_GNU_SOURCE -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639 -I/var/tmp/OFED_topdir/BUILD/mvapich-1.0.0-1639/mpid/ch_gen2 -I.-c viasend.c In file included from /usr/include/netdb.h:28, from cm.h:25, from viapriv.h:44, from viasend.c:29: /usr/include/netinet/in.h:355: error: expected ')' before '__netshort' /usr/include/netinet/in.h:355: error: expected ')' before '>>' token /usr/include/netinet/in.h:355: error: expected ')' before '&' token ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] EarnestMacroFuckstick
PhallusProdigiousDarrellhttp://chemhg.com___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 0/6] nes: Cosmetic changes; support virtual WQs and PPC
OK, I just pushed out a few more small cleanups (running unifdef, fixing signedness warnings, and fixing a locking bug on an error path). One question: what is the point of the monkeying with SPIN_BUG_ON on in nes.c? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 5/5] nes: napi interface fix
> > #ifdef NES_NAPI > > Is #ifdef napi sprinkled throughout the code common for most drivers? Is > there > a better way to handle this? (Is this OFED only for backports, or for > upstream?) Is there any reason why we want the upstream kernel to have both NAPI and non-NAPI support? If so, then this should probably be settable through Kconfig rather than having to edit the Makefile to change the NES_NAPI define. However, what almost always seems to happen is that no one uses the non-default code and it ends up bitrotting to the point of not compiling. So I would strongly suggest just having the NAPI code and getting rid of the NES_NAPI tests entirely. Is there any reason not to do that? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [ofa-general] [PATCH 0/6] nes: Cosmetic changes; support virtual WQs and PPC
> > OK, I just pushed out a few more small cleanups (running unifdef, > fixing signedness warnings, and fixing a locking bug on an error > path). One question: what is the point of the monkeying with > SPIN_BUG_ON on in nes.c? > Probably just some leftover debugging. I can remove it. Thanks for letting me know that you've push content to your branch. I'll pick it up. Btw, I'll be preparing another set of patches that should be ready very soon. Glenn. > - R. > ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] OFED 1.2.5.4 is ready on the ofa server
It looks like there is something corrupt with the tarball. 13:29:50 > tar xzf OFED-1.2.5.4.tgz gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now Ira On Tue, 4 Dec 2007 10:34:48 +0200 "Tziporet Koren" <[EMAIL PROTECTED]> wrote: > OFED-1.2.5.4 is ready: > http://www.openfabrics.org/downloads/OFED/ofed-1.2.5/OFED-1.2.5.4.tgz > > Changes since OFED 1.2.5 > > - RDS: > - Performance enhancements > - GA for Oracle 11 > - IPoIB: > - Use NAPI by default > - For small received packets, allocate a new, smaller SKB to relief > accounting > on the socket. > - mlx4: > - Enable changing default max HCA resource limits using module > options. > - Support opening of more resources then the default by increasing > command > timeout for INIT_HCA to 10 seconds > - PPC64 support: > - Fixed compilation problems on SLES10 SP1 > > Changes from OFED 1.2.5.3: > == > - Low level drivers update: > - cxgb3: Pull in latest fixes. > - ipath: Pull in latest fixes. > - OSes support: > - Added support for SLES9 SP4 (no QA was done) > - Added support for RHEL5 up1 (no QA was done) > - IPOIB: > - Removed the usage of unsignalled QP in Tx due to deadlock. > - RDS: > - Relax the header consistency check on fragment reassembly > > > Tziporet & Vlad > > > > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: [EMAIL PROTECTED] > Tel +972-4-9097200, ext 380 > > ___ > general mailing list > [EMAIL PROTECTED] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [ofa-general] OFED 1.3 Beta release is available
Here is an issue we have: struct ibv_context { struct ibv_device *device; struct ibv_context_ops ops; int cmd_fd; int async_fd; int num_comp_vectors; pthread_mutex_t mutex; void *abi_compat; }; The binary is compiled with OFED 1.2 header files, it tries to set async_fd to non-blocking, I get error: Bad file descriptor. If I compile the binary with OFED-1.3-beta header files (with XRC changes), it works fine. Is this the expected behavior, or there will be a fix ? Thanks. --CQ Tang From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tziporet Koren Sent: Thursday, November 22, 2007 9:46 AM To: ewg@lists.openfabrics.org Cc: [EMAIL PROTECTED] Subject: [ofa-general] OFED 1.3 Beta release is available Hi, OFED 1.3 Beta release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.3/OFED-1.3-beta2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ The RC1 release is expected on December 5 Tziporet & Vlad Release information: OS support: Novell: - SLES10 - SLES10 SP1 and up1 Redhat: - Redhat EL4 up4 and up5 - Redhat EL5 and up1 kernel.org: - 2.6.23 and 2.6.24-rc2 Systems: * x86_64 * x86 * ia64 * ppc64* Main Changes from OFED 1.3-alpha * Kernel code based on 2.6.24-rc2 * New packages: * SRP target * qperf test from Qlogic * ibsim package * uDAPL 2.0 library (1.0 & 2.0 are coexist) * New OSes Support: * RHEL 5 up1 * SLES10 SP1 up1 * Compilation issues resolved: * Open MPI compilation on SLES10 SP1 * ibutils compiles on SLES10 PPC64 (64 bits) * Apply patches that fix warning of backport patches * Prefix is now supported properly * RDS implementation for API version 2 was updated form 1.2.5 branch * Fix binary compatibility of libibverbs caused by XRC implementation * Uninstall is now working properly * ib-bonding update to release 19 * MPI packages update: * mvapich-1.0.0-1625.src.rpm * mvapich2-1.0.1-1.src.rpm * openmpi-1.2.4-1.src.rpm Mlx4 driver specific changes: * Enable changing the default of HCA resource limits with module parameters * Default number of maximum QPs is now 128K (was 64K) * Fixing max_cqe's (not adding an extra cqe) * Fix state check in mlx4_qp_modify * Sanity check userspace send queue sizes * Several bug fixes in XRC Tasks that should be completed for the beta release: 1. 32-bit libraries to be supported on SLES10 SP1 Update1. 2. Fix SDP stability issues 3. IPoIB performance improvements for small messages 4. Fix bugs ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] OFED 1.3 Beta release is available
> Here is an issue we have: > > struct ibv_context { > struct ibv_device *device; > struct ibv_context_ops ops; > int cmd_fd; > int async_fd; > int num_comp_vectors; > pthread_mutex_t mutex; > void *abi_compat; > }; > > The binary is compiled with OFED 1.2 header files, it tries to set async_fd > to non-blocking, I get error: > Bad file descriptor. If I compile the binary with OFED-1.3-beta header > files (with XRC changes), it works fine. > > Is this the expected behavior, or there will be a fix ? Unfortunately the XRC patches were put into OFED 1.3 before they went into the upstream libibverbs tree, so I have not reviewed them in detail. If XRC support requires an ABI change, then we'll have to create a new ABI and provide versioned symbols for backwards compatibility. However your problem seems quite strange: I don't see any change to struct ibv_context caused by the XRC patches. So I don't understand exactly what is causing the problem you see. Can you debug further to see which structure layout change is the real issue? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofa-general] OFED 1.3 Beta release is available
BTW, sifting through the OFED 1.3 libibverbs tree, I do see that the commit to add max_xrc_domains to struct ibv_device_attr did break things by adding the member in the middle of the structure (so that an app compiled against the old header will see bogus values for local_ca_ack_delay and phys_port_count. Actually looking at the commit again, it's worse than that... anything compiled against the old header that calls ibv_query_device() may get memory corrupted, because the new ibv_query_device() writes to a bigger structure than the app passes in. The perils of not reviewing properly I guess... - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofa-general] OFED 1.3 Beta release is available
oops, sorry... I see that the very next OFED 1.3 commit reverted that change, so things aren't as bad as I thought. Never mind. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [ofa-general] OFED 1.3 Beta release is available
I think the problem is that sizeof "struct ibv_context_ops" has changed, so the new driver returns a big "struct ibv_context", app compiled with older header file has a smaller "struct ibv_context" and use the old offset to find fields after "ops". --CQ > -Original Message- > From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 04, 2007 6:18 PM > To: Tang, Changqing > Cc: Tziporet Koren; ewg@lists.openfabrics.org; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] OFED 1.3 Beta release is available > > > Here is an issue we have: > > > > struct ibv_context { > > struct ibv_device *device; > > struct ibv_context_ops ops; > > int cmd_fd; > > int async_fd; > > int num_comp_vectors; > > pthread_mutex_t mutex; > > void *abi_compat; > > }; > > > > The binary is compiled with OFED 1.2 header files, it > tries to set async_fd to non-blocking, I get error: > > Bad file descriptor. If I compile the binary with > OFED-1.3-beta header files (with XRC changes), it works fine. > > > > Is this the expected behavior, or there will be a fix ? > > Unfortunately the XRC patches were put into OFED 1.3 before > they went into the upstream libibverbs tree, so I have not > reviewed them in detail. If XRC support requires an ABI > change, then we'll have to create a new ABI and provide > versioned symbols for backwards compatibility. > > However your problem seems quite strange: I don't see any > change to struct ibv_context caused by the XRC patches. So I > don't understand exactly what is causing the problem you see. > Can you debug further to see which structure layout change > is the real issue? > > - R. > ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] OFED 1.3 Beta release is available
> I think the problem is that sizeof "struct ibv_context_ops" has > changed, so the new driver returns a big "struct ibv_context", app > compiled with older header file has a smaller "struct ibv_context" > and use the old offset to find fields after "ops". Oh crud, you're obviously right. For some reason I kept missing that when I looked over the code. I think the only alternative we have to preserve backwards compatibility is to leave struct ibv_context_ops alone and change the structure to: struct ibv_context { struct ibv_device *device; struct ibv_context_ops ops; int cmd_fd; int async_fd; int num_comp_vectors; pthread_mutex_t mutex; void *abi_compat; struct ibv_xrc_op *xrc_ops; }; with xrc_ops added at the end. It's my fault for not making the ops member a pointer I guess. Tziporet/Jack/whoever -- please fix up the libibverbs you ship for OFED 1.3 to resolve this. We can clean this up for libibverbs 1.2 when the ABI can change, if/when we have something worth breaking the ABI for. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg