Re: [OMPI devel] IBCM error
Thanks for update. Sean Hefty wrote: I've committed a patch to my libibcm git tree with the values IB_CM_ASSIGN_SERVICE_ID IB_CM_ASSIGN_SERVICE_ID_MASK these will be in libibcm release 1.0.3, which will shortly... - Sean
Re: [OMPI devel] IBCM error
I've committed a patch to my libibcm git tree with the values IB_CM_ASSIGN_SERVICE_ID IB_CM_ASSIGN_SERVICE_ID_MASK these will be in libibcm release 1.0.3, which will shortly... - Sean
Re: [OMPI devel] IBCM error
Sean Hefty wrote: It is not zero, it should be: #define IB_CM_ASSIGN_SERVICE_ID __cpu_to_be64(0x0200ULL) Unfortunately the value defined in kernel level IBCM and does not exposed to user level. Can you please expose it to user level (infiniband/cm.h) Oops - good catch. I will add the assign ID and mask values to the header file for the next release. Until then, can you try using the values given in the kernel header file and let me know if it solves the problem? I already prepared patch for OMPI that defines the value. Few people already reported that the patch ok ( https://svn.open-mpi.org/trac/ompi/ticket/1388 ) Pasha
Re: [OMPI devel] IBCM error
>It is not zero, it should be: >#define IB_CM_ASSIGN_SERVICE_ID __cpu_to_be64(0x0200ULL) > >Unfortunately the value defined in kernel level IBCM and does not >exposed to user level. >Can you please expose it to user level (infiniband/cm.h) Oops - good catch. I will add the assign ID and mask values to the header file for the next release. Until then, can you try using the values given in the kernel header file and let me know if it solves the problem? - Sean
Re: [OMPI devel] IBCM error
Jeff Squyres wrote: On Jul 16, 2008, at 11:07 AM, Don Kerr wrote: Pasha added configure switches for this about a week ago: --en|disable-openib-ibcm --en|disable-openib-rdmacm I like these flags but I thought there was going to be a run time check for cases where Open MPI is built on a system that has ibcm support but is later run on a system without ibcm support. Yes, there are. - if the /dev/infiniband/ucm* files aren't there, we silently return "not supported" and skip ibcm - if ib_cm_open_device() (the first function call) fails, we assume that IBCM simply isn't supported on this platform and silently return "not supported" and skip ibcm Right not we are skipping the IBCM all time. Only if user specify IBCM explicitly via include/exclude interface the IBCM will be used. Pasha
Re: [OMPI devel] IBCM error
Sean Hefty wrote: If you don't care what the service ID is, you can specify 0, and the kernel will assign one. The assigned value can be retrieved by calling ib_cm_attr_id(). (I'm assuming that you communicate the IDs out of band somehow.) It is not zero, it should be: #define IB_CM_ASSIGN_SERVICE_ID __cpu_to_be64(0x0200ULL) Unfortunately the value defined in kernel level IBCM and does not exposed to user level. Can you please expose it to user level (infiniband/cm.h) Regards, Pasha
Re: [OMPI devel] IBCM error
On Jul 16, 2008, at 11:07 AM, Don Kerr wrote: Pasha added configure switches for this about a week ago: --en|disable-openib-ibcm --en|disable-openib-rdmacm I like these flags but I thought there was going to be a run time check for cases where Open MPI is built on a system that has ibcm support but is later run on a system without ibcm support. Yes, there are. - if the /dev/infiniband/ucm* files aren't there, we silently return "not supported" and skip ibcm - if ib_cm_open_device() (the first function call) fails, we assume that IBCM simply isn't supported on this platform and silently return "not supported" and skip ibcm -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
Jeff Squyres wrote: On Jul 15, 2008, at 7:30 AM, Ralph Castain wrote: Minor clarification: we did not test RDMACM on RoadRunner. Just for further clarification - I did, and it wasn't a particularly good experience. Encountered several problems, none of them overwhelming, hence my comments. Ah -- I didn't know this. What went wrong? We need to fix it if there are problems. RDMACM, on the other hand, is *necessary* for iWARP connections. We know it won't scale well because of ARP issues, to which the iWARP vendors are publishing their own solutions (pre-populating ARP caches, etc.). Even when built and installed, RDMACM will not be used by default for IB hardware (you have to specifically ask for it). Since it's necessary for iWARP, I think we need to build and install it by default. Most importantly: production IB users won't be disturbed. If it is necessary for iWARP, then fine - so long as it is only used if specifically requested. However, I would also ask that we be able to -not- build it upon request so we can be certain a user doesn't attempt to use it by mistake ("gee, that looks interesting - let Mikey try it!"). Ditto for ibcm support. Pasha added configure switches for this about a week ago: --en|disable-openib-ibcm --en|disable-openib-rdmacm I like these flags but I thought there was going to be a run time check for cases where Open MPI is built on a system that has ibcm support but is later run on a system without ibcm support. -DON
Re: [OMPI devel] IBCM error
Guess what - we don't always put them out there because - tada - we don't use them! What goes out on the backend is a stripped down version of libraries we require. Given the huge number of libraries people provide (looking at the bigger, beyond OMPI picture), it consumes a lot of limited disk space to install every library on every node. So sometimes we build our own rpm's to pick up only what we need. As long as --without-rdmacm --without-ibcm are present, then we are happy. FYI I recently added options that allow enable/disable all the *cm stuff: --enable-openib-ibcmEnable Open Fabrics IBCM support in openib BTL (default: enabled) --enable-openib-rdmacm Enable Open Fabrics RDMACM support in openib BTL (default: enabled)
Re: [OMPI devel] IBCM error
On 7/15/08 5:05 AM, "Jeff Squyres"wrote: > On Jul 14, 2008, at 3:04 PM, Ralph H. Castain wrote: > >> I've been quietly following this discussion, but now feel a need to >> jump >> in here. I really must disagree with the idea of building either >> IBCM or >> RDMACM support by default. Neither of these has been proven to >> reliably >> work, or to be advantageous. Our own experiences in testing them >> have been >> slightly negative at best. When the did work, they were slower, didn't >> scale well, and unreliable. > > Minor clarification: we did not test RDMACM on RoadRunner. Just for further clarification - I did, and it wasn't a particularly good experience. Encountered several problems, none of them overwhelming, hence my comments. > > We only tested IBCM at scale (not RDMACM) and ran into a variety of > issues -- most of which were bugs in Open MPI's use of IBCM -- that > culminated in the ib_cm_listen() problem. That problem is currently > unsolved, and I agree that it unfortunately currently makes OMPI's > IBCM support fairly useless. Bonk. > > IBCM was thought to be a nice thing: a cheap/fast way to make IB > connections that would get OOB out of the picture. If the > ib_cm_listen() problem is fixed, it may still be (Sean had an > interesting suggestion; we'll see where it goes). But I totally agree > that it is somewhat of an unknown quantity at this point. I also > agree that the IBCM support in OMPI is not *necessary* because OOB > works just fine (especially with the scalability improvements in v1.3). > > RDMACM, on the other hand, is *necessary* for iWARP connections. We > know it won't scale well because of ARP issues, to which the iWARP > vendors are publishing their own solutions (pre-populating ARP caches, > etc.). Even when built and installed, RDMACM will not be used by > default for IB hardware (you have to specifically ask for it). Since > it's necessary for iWARP, I think we need to build and install it by > default. Most importantly: production IB users won't be disturbed. If it is necessary for iWARP, then fine - so long as it is only used if specifically requested. However, I would also ask that we be able to -not- build it upon request so we can be certain a user doesn't attempt to use it by mistake ("gee, that looks interesting - let Mikey try it!"). Ditto for ibcm support. This way, we can experiment with it and continue to learn the problems without forcing our production people to deal with problem tickets because a user tried something that has known problems. > >> I'm not trying to rain on anyone's parade. These are worthwhile in the >> long term. However, they clearly need further work to be "ready for >> prime >> time". >> >> Accordingly, I would recommend that they -only- be built if >> specifically >> requested. Remember, most of our users just build blindly. It makes no >> sense to have them build support for what can only be classed as an >> experimental capability at this time. > > I defer to Mellanox for a decision about the IBCM CPC. > > But for the RDMACM, per above, I am still in favor of building and > installing it by default. Like I said, no problem - but give me a configure option so I can -not- build it too. > >> Also, note that the OFED install is less-than-reliable wrt IBCM and >> RDMACM. > > True; the OFED install is less-than-reliable w.r.t. IBCM per the > previously-discussed issue of not necessarily creating the /dev/ > infiniband/ucm* devices. There's a ticket open on the OpenFabrics > bugzilla about it. I wish it would get fixed. :-) > > But I've not seen any problems with OFED's RDMACM installation. > > The only issue I've seen with RDMACM is when sites consciously chose > to put the RDMACM libraries and/or modules on the head node (and > therefore OMPI built support for it), but then chose not put them out > on back-end compute nodes. Keep in mind that this is *not* the > default OFED installation pattern -- a human has to go manually modify > a file to make it do that (I don't believe that there's even a menu > option for that installation mode; you have to go hand-edit an OFED > installation configuration file or simply choose not to put / remove > certain RPMs out on back-end nodes). Guess what - we don't always put them out there because - tada - we don't use them! What goes out on the backend is a stripped down version of libraries we require. Given the huge number of libraries people provide (looking at the bigger, beyond OMPI picture), it consumes a lot of limited disk space to install every library on every node. So sometimes we build our own rpm's to pick up only what we need. As long as --without-rdmacm --without-ibcm are present, then we are happy. > >> We have spent considerable time chasing down installation problems >> that allowed the system to build, but then caused it to crash-and- >> burn if >> we attempted to use it. We have gained experience at knowing when/ >> where
Re: [OMPI devel] IBCM error
On Jul 14, 2008, at 3:04 PM, Ralph H. Castain wrote: I've been quietly following this discussion, but now feel a need to jump in here. I really must disagree with the idea of building either IBCM or RDMACM support by default. Neither of these has been proven to reliably work, or to be advantageous. Our own experiences in testing them have been slightly negative at best. When the did work, they were slower, didn't scale well, and unreliable. Minor clarification: we did not test RDMACM on RoadRunner. We only tested IBCM at scale (not RDMACM) and ran into a variety of issues -- most of which were bugs in Open MPI's use of IBCM -- that culminated in the ib_cm_listen() problem. That problem is currently unsolved, and I agree that it unfortunately currently makes OMPI's IBCM support fairly useless. Bonk. IBCM was thought to be a nice thing: a cheap/fast way to make IB connections that would get OOB out of the picture. If the ib_cm_listen() problem is fixed, it may still be (Sean had an interesting suggestion; we'll see where it goes). But I totally agree that it is somewhat of an unknown quantity at this point. I also agree that the IBCM support in OMPI is not *necessary* because OOB works just fine (especially with the scalability improvements in v1.3). RDMACM, on the other hand, is *necessary* for iWARP connections. We know it won't scale well because of ARP issues, to which the iWARP vendors are publishing their own solutions (pre-populating ARP caches, etc.). Even when built and installed, RDMACM will not be used by default for IB hardware (you have to specifically ask for it). Since it's necessary for iWARP, I think we need to build and install it by default. Most importantly: production IB users won't be disturbed. I'm not trying to rain on anyone's parade. These are worthwhile in the long term. However, they clearly need further work to be "ready for prime time". Accordingly, I would recommend that they -only- be built if specifically requested. Remember, most of our users just build blindly. It makes no sense to have them build support for what can only be classed as an experimental capability at this time. I defer to Mellanox for a decision about the IBCM CPC. But for the RDMACM, per above, I am still in favor of building and installing it by default. Also, note that the OFED install is less-than-reliable wrt IBCM and RDMACM. True; the OFED install is less-than-reliable w.r.t. IBCM per the previously-discussed issue of not necessarily creating the /dev/ infiniband/ucm* devices. There's a ticket open on the OpenFabrics bugzilla about it. I wish it would get fixed. :-) But I've not seen any problems with OFED's RDMACM installation. The only issue I've seen with RDMACM is when sites consciously chose to put the RDMACM libraries and/or modules on the head node (and therefore OMPI built support for it), but then chose not put them out on back-end compute nodes. Keep in mind that this is *not* the default OFED installation pattern -- a human has to go manually modify a file to make it do that (I don't believe that there's even a menu option for that installation mode; you have to go hand-edit an OFED installation configuration file or simply choose not to put / remove certain RPMs out on back-end nodes). We have spent considerable time chasing down installation problems that allowed the system to build, but then caused it to crash-and- burn if we attempted to use it. We have gained experience at knowing when/ where to look now, but that doesn't lessen the reputation impact OMPI is getting as a "buggy, cantankerous beast" according to our sys admins. Isn't the whole point of pre-release test versions is to find and fix such bugs? ;-) Not a reputation we should be encouraging. Turning this off by default allows those more adventurous souls to explore this capability, while letting our production-oriented customers install and run in peace. Pasha was recommending that IBCM be built by default *but not used by default*. So production users would still be able to run in peace -- OOB will still be the default. I see it pretty much like SLURM support: it's built by default, but it won't activate itself unless relevant. But like I said above, I defer to Mellanox for IBCM. :-) Just my $0.002... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
I need to check on this. You may want to look at section A3.2.3 of the spec. If you set the first byte (network order) to 0x00, and the 2nd byte to 0x01, then you hit a 'reserved' range that probably isn't being used currently. If you don't care what the service ID is, you can specify 0, and the kernel will assign one. The assigned value can be retrieved by calling ib_cm_attr_id(). (I'm assuming that you communicate the IDs out of band somehow.) Ok; we'll need to check into this. I don't remember the ordering -- we might actually be communicating the IDs before calling ib_cm_listen() (since we were simply using the PIDs, we could do that). Thanks for the tip! Pasha -- can you look into this? It looks that th modex message we are preparing during query stage, so the order looks ok. Unfortunately on my machines ibcm module does not create "/dev/infiniband/ucm*" and I can not thest the functionality. Regards, Pasha.
Re: [OMPI devel] IBCM error
On Jul 14, 2008, at 5:48 PM, Sean Hefty wrote: Is there a service ID range that is guaranteed to be available for user apps? I need to check on this. You may want to look at section A3.2.3 of the spec. If you set the first byte (network order) to 0x00, and the 2nd byte to 0x01, then you hit a 'reserved' range that probably isn't being used currently. If you don't care what the service ID is, you can specify 0, and the kernel will assign one. The assigned value can be retrieved by calling ib_cm_attr_id(). (I'm assuming that you communicate the IDs out of band somehow.) Ok; we'll need to check into this. I don't remember the ordering -- we might actually be communicating the IDs before calling ib_cm_listen() (since we were simply using the PIDs, we could do that). Thanks for the tip! Pasha -- can you look into this? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
>Ah! I did not realize that there were other services on the machine >that were using / reserving IBCM service ID's. Intel MPI hit a similar problem a long, long time ago. >Is there a service ID range that is guaranteed to be available for >user apps? I need to check on this. You may want to look at section A3.2.3 of the spec. If you set the first byte (network order) to 0x00, and the 2nd byte to 0x01, then you hit a 'reserved' range that probably isn't being used currently. If you don't care what the service ID is, you can specify 0, and the kernel will assign one. The assigned value can be retrieved by calling ib_cm_attr_id(). (I'm assuming that you communicate the IDs out of band somehow.) - Sean
Re: [OMPI devel] IBCM error
>The service ID that it uses is its PID and the mask is always 0. >There will only be one call to ib_cm_listen() per device per MPI >process. > >Open MPI certainly could be buggy with IBCM, of course -- but it's >fishy that the same exact "mpirun ..." command line works one time and >fails the next (it's kinda random when the problem occurs). I just want to make sure that service ID collision isn't the issue. (It may be unlikely, but it could happen.) Using the PID is random, and could cause conflicts with other services, depending on the value that's used. I know SDP reserve ranges of service ID values. Is the service ID specified in host or network order? Do you know the range of PIDs? I can see if any well known apps might collide. - Sean
Re: [OMPI devel] IBCM error
On Jul 14, 2008, at 1:17 PM, Sean Hefty wrote: I talked to Sean Hefty about it, but we never figured out a definitive cause or solution. My best guess is that there is something wonky about multiple processes simultaneously interacting with the IBCM kernel driver from userspace; but I don't know jack about kernel stuff, so that's a total SWAG. The only reason I can think of why ib_cm_listen() fails is if there's a conflict with the service_id and/or service_mask from multiple threads. What does OMPI pass in for these parameters? The service ID that it uses is its PID and the mask is always 0. There will only be one call to ib_cm_listen() per device per MPI process. Open MPI certainly could be buggy with IBCM, of course -- but it's fishy that the same exact "mpirun ..." command line works one time and fails the next (it's kinda random when the problem occurs). -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
I've been quietly following this discussion, but now feel a need to jump in here. I really must disagree with the idea of building either IBCM or RDMACM support by default. Neither of these has been proven to reliably work, or to be advantageous. Our own experiences in testing them have been slightly negative at best. When the did work, they were slower, didn't scale well, and unreliable. I'm not trying to rain on anyone's parade. These are worthwhile in the long term. However, they clearly need further work to be "ready for prime time". Accordingly, I would recommend that they -only- be built if specifically requested. Remember, most of our users just build blindly. It makes no sense to have them build support for what can only be classed as an experimental capability at this time. Also, note that the OFED install is less-than-reliable wrt IBCM and RDMACM. We have spent considerable time chasing down installation problems that allowed the system to build, but then caused it to crash-and-burn if we attempted to use it. We have gained experience at knowing when/where to look now, but that doesn't lessen the reputation impact OMPI is getting as a "buggy, cantankerous beast" according to our sys admins. Not a reputation we should be encouraging. Turning this off by default allows those more adventurous souls to explore this capability, while letting our production-oriented customers install and run in peace. Ralph > On Jul 14, 2008, at 9:21 AM, Pavel Shamis (Pasha) wrote: > >>> Should we not even build support for it? >> I think IBCM CPC build should be enabled by default. The IBCM is >> supplied with OFED so it should not be any problem during install. > > Ok. But remember that there are at least some OS's where /dev/ucm* do > *not* get created by default for some unknown reason (even though IBCM > is installed). > >>> PRO: don't even allow the possibility of running with it, because >>> we know that there are issues with the ibcm userspace library >>> (i.e., reduce problem reports from users) >>> >>> PRO: users don't have to have libibcm installed on compute nodes >>> (we've actually gotten some complaints about this) >> We got compliances only for case when ompi was build on platform >> with IBCM and after it was run on platform without IBCM. Also we >> did not have option to disable >> the ibcm during compilation. So actually it was no way to install >> OMPI on compute node. We added the option and the problem was >> resolved. >> In most cases the OFED install is the same on all nodes and it >> should not be any problem to build IBCM support by default. > > > Ok, sounds good. > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] IBCM error
>I talked to Sean Hefty about it, but we never figured out a definitive >cause or solution. My best guess is that there is something wonky >about multiple processes simultaneously interacting with the IBCM >kernel driver from userspace; but I don't know jack about kernel >stuff, so that's a total SWAG. The only reason I can think of why ib_cm_listen() fails is if there's a conflict with the service_id and/or service_mask from multiple threads. What does OMPI pass in for these parameters? - Sean
Re: [OMPI devel] IBCM error
On Jul 14, 2008, at 9:21 AM, Pavel Shamis (Pasha) wrote: Should we not even build support for it? I think IBCM CPC build should be enabled by default. The IBCM is supplied with OFED so it should not be any problem during install. Ok. But remember that there are at least some OS's where /dev/ucm* do *not* get created by default for some unknown reason (even though IBCM is installed). PRO: don't even allow the possibility of running with it, because we know that there are issues with the ibcm userspace library (i.e., reduce problem reports from users) PRO: users don't have to have libibcm installed on compute nodes (we've actually gotten some complaints about this) We got compliances only for case when ompi was build on platform with IBCM and after it was run on platform without IBCM. Also we did not have option to disable the ibcm during compilation. So actually it was no way to install OMPI on compute node. We added the option and the problem was resolved. In most cases the OFED install is the same on all nodes and it should not be any problem to build IBCM support by default. Ok, sounds good. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
Should we not even build support for it? I think IBCM CPC build should be enabled by default. The IBCM is supplied with OFED so it should not be any problem during install. PRO: don't even allow the possibility of running with it, because we know that there are issues with the ibcm userspace library (i.e., reduce problem reports from users) PRO: users don't have to have libibcm installed on compute nodes (we've actually gotten some complaints about this) We got compliances only for case when ompi was build on platform with IBCM and after it was run on platform without IBCM. Also we did not have option to disable the ibcm during compilation. So actually it was no way to install OMPI on compute node. We added the option and the problem was resolved. In most cases the OFED install is the same on all nodes and it should not be any problem to build IBCM support by default. Pasha
Re: [OMPI devel] IBCM error
On Jul 14, 2008, at 7:55 AM, Pavel Shamis (Pasha) wrote: I can add in head of query function something like : if (!mca_btl_openib_component.cpc_explicitly_defined) return OMPI_ERR_NOT_SUPPORTED; That sounds reasonable until the ibcm userspace library issues can be sorted out. Then perhaps this check can be removed. Should we not even build support for it? PRO: don't even allow the possibility of running with it, because we know that there are issues with the ibcm userspace library (i.e., reduce problem reports from users) PRO: users don't have to have libibcm installed on compute nodes (we've actually gotten some complaints about this) CON: OMPI is not release-synchronized with OFED; OFED could be released with a fixed ibcm userspace library, but it still wouldn't be built by default in OMPI CON: users already have to have librdmacm installed on the compute notes for the RDMA CM (e.g., probably mainly for iWARP support); adding ibcm and rdmacm user libs at the same time might actually be better (rather than rdmacm in v1.3 and ibcm in a future version) Thoughts? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
I can add in head of query function something like : if (!mca_btl_openib_component.cpc_explicitly_defined) return OMPI_ERR_NOT_SUPPORTED; Jeff Squyres wrote: On Jul 14, 2008, at 3:59 AM, Lenny Verkhovsky wrote: Seems to be fixed. Well, it's "fixed" in that Pasha turned off the error message. But the same issue is undoubtedly happening. I was asking for something a little stronger: perhaps we should actually have IBCM not try to be used unless it's specifically asked for. Or maybe it shouldn't even build itself unless specifically asked for (which would obviously take care of the run-time issues as well). The whole point of doing IBCM was to have a simple and fast mechanism for IB wireup. But with these two problems (IBCM not properly installed on all systems, and ib_cm_listen() fails periodically), it more or less makes it unusable. Therefore we shouldn't recommend it to production customers, and per precedent elsewhere in the code base, we should either not build it by default and/or not use it unless specifically asked for.
Re: [OMPI devel] IBCM error
On Jul 14, 2008, at 3:59 AM, Lenny Verkhovsky wrote: Seems to be fixed. Well, it's "fixed" in that Pasha turned off the error message. But the same issue is undoubtedly happening. I was asking for something a little stronger: perhaps we should actually have IBCM not try to be used unless it's specifically asked for. Or maybe it shouldn't even build itself unless specifically asked for (which would obviously take care of the run-time issues as well). The whole point of doing IBCM was to have a simple and fast mechanism for IB wireup. But with these two problems (IBCM not properly installed on all systems, and ib_cm_listen() fails periodically), it more or less makes it unusable. Therefore we shouldn't recommend it to production customers, and per precedent elsewhere in the code base, we should either not build it by default and/or not use it unless specifically asked for. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
Right about when Brad and I discovered that issue, I ran out of time. This made IBCM more-or-less unusable for many installations -- we were kinda hoping for an OpenFabrics fix... On Jul 13, 2008, at 12:43 PM, Pavel Shamis (Pasha) wrote: Fixed in https://svn.open-mpi.org/trac/ompi/changeset/18897 Is it any other know IBCM issue ? Regards, Pasha Jeff Squyres wrote: I think you said opposite things: Lenny's command line did not specifically ask for ibcm, but it was used anyway. Lenny -- did you explicitly request it somewhere else (e.g., env var or MCA param file)? I suspect that you did not; I suspect (without looking at the code again) that ibcm tried to select itself and failed on the ibcm_listen() call, so it fell back to oob. This might have to be another workaround in OMPI, perhaps something like this: if (ibcm_listen() fails) if (ibcm explicitly requested) print_warning() fail to use ibcm Has this been filed as a bug at openfabrics.org? I don't think that I filed it when Brad and I were testing on RoadRunner -- it would probably be good if someone filed it. On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote: Pasha is right, I didn't disabled it. On 7/13/08, Pavel Shamis (Pasha)wrote: Jeff Squyres wrote: Brad and I did some scale testing of IBCM and saw this error sometimes. It seemed to happen with higher frequency when you increased the number of processes on a single node. I talked to Sean Hefty about it, but we never figured out a definitive cause or solution. My best guess is that there is something wonky about multiple processes simultaneously interacting with the IBCM kernel driver from userspace; but I don't know jack about kernel stuff, so that's a total SWAG. Thanks for reminding me of this issue; I admit that I had forgotten about it. :-( Pasha -- should IBCM not be the default? It is not default. I guess Lenny configured it explicitly, is not it ? Pasha. On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote: Hi, I am getting this error sometimes. /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile / home/USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/ COMPILERS/hello [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to ib_cm_listen 10 times: rc=-1, errno=22 Hello world! I'm 0 of 100 on witch2 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] IBCM error
Seems to be fixed. On 7/14/08, Lenny Verkhovskywrote: > > ../configure --with-memory-manager=ptmalloc2 --with-openib > > I guess not. I always use same configure line, and only recently I started > to see this error. > > On 7/13/08, Jeff Squyres wrote: >> >> I think you said opposite things: Lenny's command line did not >> specifically ask for ibcm, but it was used anyway. Lenny -- did you >> explicitly request it somewhere else (e.g., env var or MCA param file)? >> >> I suspect that you did not; I suspect (without looking at the code again) >> that ibcm tried to select itself and failed on the ibcm_listen() call, so it >> fell back to oob. This might have to be another workaround in OMPI, perhaps >> something like this: >> >> if (ibcm_listen() fails) >> if (ibcm explicitly requested) >> print_warning() >> fail to use ibcm >> >> Has this been filed as a bug at openfabrics.org? I don't think that I >> filed it when Brad and I were testing on RoadRunner -- it would probably be >> good if someone filed it. >> >> >> >> On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote: >> >> Pasha is right, I didn't disabled it. >>> >>> On 7/13/08, Pavel Shamis (Pasha) wrote: Jeff >>> Squyres wrote: >>> Brad and I did some scale testing of IBCM and saw this error sometimes. >>> It seemed to happen with higher frequency when you increased the number of >>> processes on a single node. >>> >>> I talked to Sean Hefty about it, but we never figured out a definitive >>> cause or solution. My best guess is that there is something wonky about >>> multiple processes simultaneously interacting with the IBCM kernel driver >>> from userspace; but I don't know jack about kernel stuff, so that's a total >>> SWAG. >>> >>> Thanks for reminding me of this issue; I admit that I had forgotten about >>> it. :-( Pasha -- should IBCM not be the default? >>> It is not default. I guess Lenny configured it explicitly, is not it ? >>> >>> Pasha. >>> >>> >>> >>> >>> >>> On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote: >>> >>> Hi, >>> >>> I am getting this error sometimes. >>> >>> /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile >>> /home/USERS/lenny/TESTS/COMPILERS/hostfile >>> /home/USERS/lenny/TESTS/COMPILERS/hello >>> [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] >>> failed to ib_cm_listen 10 times: rc=-1, errno=22 >>> Hello world! I'm 0 of 100 on witch2 >>> >>> >>> Best Regards >>> >>> Lenny. >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >
Re: [OMPI devel] IBCM error
../configure --with-memory-manager=ptmalloc2 --with-openib I guess not. I always use same configure line, and only recently I started to see this error. On 7/13/08, Jeff Squyreswrote: > > I think you said opposite things: Lenny's command line did not specifically > ask for ibcm, but it was used anyway. Lenny -- did you explicitly request > it somewhere else (e.g., env var or MCA param file)? > > I suspect that you did not; I suspect (without looking at the code again) > that ibcm tried to select itself and failed on the ibcm_listen() call, so it > fell back to oob. This might have to be another workaround in OMPI, perhaps > something like this: > > if (ibcm_listen() fails) > if (ibcm explicitly requested) > print_warning() > fail to use ibcm > > Has this been filed as a bug at openfabrics.org? I don't think that I > filed it when Brad and I were testing on RoadRunner -- it would probably be > good if someone filed it. > > > > On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote: > > Pasha is right, I didn't disabled it. >> >> On 7/13/08, Pavel Shamis (Pasha) wrote: Jeff >> Squyres wrote: >> Brad and I did some scale testing of IBCM and saw this error sometimes. >> It seemed to happen with higher frequency when you increased the number of >> processes on a single node. >> >> I talked to Sean Hefty about it, but we never figured out a definitive >> cause or solution. My best guess is that there is something wonky about >> multiple processes simultaneously interacting with the IBCM kernel driver >> from userspace; but I don't know jack about kernel stuff, so that's a total >> SWAG. >> >> Thanks for reminding me of this issue; I admit that I had forgotten about >> it. :-( Pasha -- should IBCM not be the default? >> It is not default. I guess Lenny configured it explicitly, is not it ? >> >> Pasha. >> >> >> >> >> >> On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote: >> >> Hi, >> >> I am getting this error sometimes. >> >> /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile >> /home/USERS/lenny/TESTS/COMPILERS/hostfile >> /home/USERS/lenny/TESTS/COMPILERS/hello >> [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] >> failed to ib_cm_listen 10 times: rc=-1, errno=22 >> Hello world! I'm 0 of 100 on witch2 >> >> >> Best Regards >> >> Lenny. >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] IBCM error
Fixed in https://svn.open-mpi.org/trac/ompi/changeset/18897 Is it any other know IBCM issue ? Regards, Pasha Jeff Squyres wrote: I think you said opposite things: Lenny's command line did not specifically ask for ibcm, but it was used anyway. Lenny -- did you explicitly request it somewhere else (e.g., env var or MCA param file)? I suspect that you did not; I suspect (without looking at the code again) that ibcm tried to select itself and failed on the ibcm_listen() call, so it fell back to oob. This might have to be another workaround in OMPI, perhaps something like this: if (ibcm_listen() fails) if (ibcm explicitly requested) print_warning() fail to use ibcm Has this been filed as a bug at openfabrics.org? I don't think that I filed it when Brad and I were testing on RoadRunner -- it would probably be good if someone filed it. On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote: Pasha is right, I didn't disabled it. On 7/13/08, Pavel Shamis (Pasha)wrote: Jeff Squyres wrote: Brad and I did some scale testing of IBCM and saw this error sometimes. It seemed to happen with higher frequency when you increased the number of processes on a single node. I talked to Sean Hefty about it, but we never figured out a definitive cause or solution. My best guess is that there is something wonky about multiple processes simultaneously interacting with the IBCM kernel driver from userspace; but I don't know jack about kernel stuff, so that's a total SWAG. Thanks for reminding me of this issue; I admit that I had forgotten about it. :-( Pasha -- should IBCM not be the default? It is not default. I guess Lenny configured it explicitly, is not it ? Pasha. On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote: Hi, I am getting this error sometimes. /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /home/USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/COMPILERS/hello [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to ib_cm_listen 10 times: rc=-1, errno=22 Hello world! I'm 0 of 100 on witch2 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] IBCM error
Pasha is right, I didn't disabled it. On 7/13/08, Pavel Shamis (Pasha)wrote: > > Jeff Squyres wrote: > >> Brad and I did some scale testing of IBCM and saw this error sometimes. >> It seemed to happen with higher frequency when you increased the number of >> processes on a single node. >> >> I talked to Sean Hefty about it, but we never figured out a definitive >> cause or solution. My best guess is that there is something wonky about >> multiple processes simultaneously interacting with the IBCM kernel driver >> from userspace; but I don't know jack about kernel stuff, so that's a total >> SWAG. >> >> Thanks for reminding me of this issue; I admit that I had forgotten about >> it. :-( Pasha -- should IBCM not be the default? >> > It is not default. I guess Lenny configured it explicitly, is not it ? > > Pasha. > > >> >> >> On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote: >> >> Hi, >>> >>> I am getting this error sometimes. >>> >>> /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile >>> /home/USERS/lenny/TESTS/COMPILERS/hostfile >>> /home/USERS/lenny/TESTS/COMPILERS/hello >>> [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] >>> failed to ib_cm_listen 10 times: rc=-1, errno=22 >>> Hello world! I'm 0 of 100 on witch2 >>> >>> >>> Best Regards >>> >>> Lenny. >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] IBCM error
Jeff Squyres wrote: Brad and I did some scale testing of IBCM and saw this error sometimes. It seemed to happen with higher frequency when you increased the number of processes on a single node. I talked to Sean Hefty about it, but we never figured out a definitive cause or solution. My best guess is that there is something wonky about multiple processes simultaneously interacting with the IBCM kernel driver from userspace; but I don't know jack about kernel stuff, so that's a total SWAG. Thanks for reminding me of this issue; I admit that I had forgotten about it. :-( Pasha -- should IBCM not be the default? It is not default. I guess Lenny configured it explicitly, is not it ? Pasha. On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote: Hi, I am getting this error sometimes. /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /home/USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/COMPILERS/hello [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to ib_cm_listen 10 times: rc=-1, errno=22 Hello world! I'm 0 of 100 on witch2 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] IBCM error
Brad and I did some scale testing of IBCM and saw this error sometimes. It seemed to happen with higher frequency when you increased the number of processes on a single node. I talked to Sean Hefty about it, but we never figured out a definitive cause or solution. My best guess is that there is something wonky about multiple processes simultaneously interacting with the IBCM kernel driver from userspace; but I don't know jack about kernel stuff, so that's a total SWAG. Thanks for reminding me of this issue; I admit that I had forgotten about it. :-( Pasha -- should IBCM not be the default? On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote: Hi, I am getting this error sometimes. /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /home/ USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/ COMPILERS/hello [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to ib_cm_listen 10 times: rc=-1, errno=22 Hello world! I'm 0 of 100 on witch2 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] IBCM error
Hi, I am getting this error sometimes. /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /home/USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/COMPILERS/hello [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to ib_cm_listen 10 times: rc=-1, errno=22 Hello world! I'm 0 of 100 on witch2 Best Regards Lenny.