Re: [OMPI devel] [OMPI users] Adding a new BTL
Hello Gilles Thank you; the build is successful now. I do have a generic, unrelated question, though: I really appreciate how all the principle of object oriented design principles have been used in OMPI architecture and have been implemented in a language that is not object oriented. It is a textbook example in software engineering. However, I see that functions that are indirectly referenced, via a structure of function pointers, have been declared 'extern' in header files. This, to me, looks like against the principle of code design. Making these 'static' and removing them from the headers does not break the build of course, and does not even generate warnings. As an example, the following functions need not be externalized, except, perhaps, for debugging: mca_btl_template_module_t mca_btl_template_module = { .super = { .btl_component = &mca_btl_template_component.super, .btl_add_procs = mca_btl_template_add_procs, .btl_del_procs = mca_btl_template_del_procs, .btl_register = mca_btl_template_register, .btl_finalize = mca_btl_template_finalize, .btl_alloc = mca_btl_template_alloc, .btl_free = mca_btl_template_free, .btl_prepare_src = mca_btl_template_prepare_src, .btl_send = mca_btl_template_send, .btl_put = mca_btl_template_put, .btl_get = mca_btl_template_get, .btl_register_mem = mca_btl_template_register_mem, .btl_deregister_mem = mca_btl_template_deregister_mem, .btl_ft_event = mca_btl_template_ft_event } }; Is there any reason why it is done this way? If I made them 'static' in my own BTL code, would I get into trouble down the road? Thanks Durga Life is complex. It has real and imaginary parts. On Thu, Feb 25, 2016 at 7:02 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > on master/v2.x, you also have to > > rm -f opal/mca/btl/lf/.opal_ignore > > (and this file would have been .ompi_ignore on v1.10) > > Cheers, > > Gilles > > On Fri, Feb 26, 2016 at 7:44 AM, dpchoudh . wrote: > > Hello Jeff and other developers: > > > > Attached are five files: > > 1-2: Full output from autogen.pl and configure, captured with: ./ > 2>&1 > > | tee .log > > 3. Makefile.am of the specific BTL directory > > 4. configure.m4 of the same directory > > 5. config.log, as generated internally by autotools > > > > Thank you > > Durga > > > > > > Life is complex. It has real and imaginary parts. > > > > On Thu, Feb 25, 2016 at 5:15 PM, Jeff Squyres (jsquyres) > > wrote: > >> > >> Can you send the full output from autogen and configure? > >> > >> Also, this is probably better suited for the Devel list, since we're > >> talking about OMPI internals. > >> > >> Sent from my phone. No type good. > >> > >> On Feb 25, 2016, at 2:06 PM, dpchoudh . wrote: > >> > >> Hello Gilles > >> > >> Thank you very much for your advice. Yes, I copied the templates from > the > >> master branch to the 1.10.2 release, since the release does not have > them. > >> And yes, changing the Makefile.am as you suggest did make the autogen > error > >> go away. > >> > >> However, in the master branch, the autotools seem to be ignoring the new > >> btl directory altogether; i.e. I do not get a Makefile.in from the > >> Makefile.am. > >> > >> In the 1.10.2 release, doing an identical sequence of steps do create a > >> Makefile.in from Makefile.am (via autogen) and a Makefile from > Makefile.in > >> (via configure), but of course, the new BTL does not build because the > >> include paths in master and 1.10.2 are different. > >> > >> My Makefile.am and configure.m4 are as follows. Any thoughts on what it > >> would take in the master branch to hook up the new BTL directory? > >> > >> opal/mca/btl/lf/configure.m4 > >> # > >> AC_DEFUN([MCA_opal_btl_lf_CONFIG],[ > >> AC_CONFIG_FILES([opal/mca/btl/lf/Makefile]) > >> ])dnl > >> > >> opal/mca/btl/lf/Makefile.am--- > >> amca_paramdir = $(AMCA_PARAM_SETS_DIR) > >> dist_amca_param_DATA = netpipe-btl-lf.txt > >> > >> sources = \ > >> btl_lf.c \ > >> btl_lf.h \ > >> btl_lf_component.c \ > >> btl_lf_endpoint.c \ > >> btl_lf_endpoint.h \ > >> btl_lf_frag.c \ > >> btl_lf_frag.h \ > >> btl_lf_proc.c \ > >> btl
[OMPI devel] component progress function optional?
Hello all (As you might know), I am working on implementing a new BTL for a proprietary fabric, and, taking the path of least effort, copying and pasting code from various pre-implemented BTL as is appropriate for our hardware. My question is: are there any guidance on which of the functions must be implemented and which are optional (i.e. depends on the underlying hardware)? As a specific example, I see that mca_btl_tcp_component_progress() is never implemented although similar functions in other BTLs are. Thanks in advance Durga Life is complex. It has real and imaginary parts.
[OMPI devel] Network atomic operations
Hello all Here is a 101 level question: OpenMPI supports many transports, out of the box, and can be extended to support those which it does not. Some of these transports, such as infiniband, provide hardware atomic operations on remote memory, whereas others, such as iWARP, do not. My question is: how (and where in the code base) does openMPI use this feature, on those hardware that support it? What is the penalty, in terms of additional code, runtime performance and all other considerations, on a hardware that does not support it? Thanks in advance. Durga Life is complex. It has real and imaginary parts.
Re: [OMPI devel] Network atomic operations
Hello Nathan, Mike and all Thank you for your responses. Let me rephrase them to make sure I understood them correctly, and please correct me if I didn't: 1. Atomics are (have been) used in OSHMEM in the current (v1) release 2. They are (will be) used in the MPI RMA in v2 release, which has not happened yet I am sorry if I sound like I am nitpicking, but the reason I ask is that I am trying to implement a new BTL that I am supposed to demo on a customer's existing OMPI code base (which is obviously v1) but I am doing the development out of the master branch (to avoid porting later), so I am in a bit of spaghetti situation. Thank you Durga Life is complex. It has real and imaginary parts. On Fri, Mar 4, 2016 at 11:06 AM, Nathan Hjelm wrote: > > On Thu, Mar 03, 2016 at 05:26:45PM -0500, dpchoudh . wrote: > >Hello all > > > >Here is a 101 level question: > > > >OpenMPI supports many transports, out of the box, and can be extended > to > >support those which it does not. Some of these transports, such as > >infiniband, provide hardware atomic operations on remote memory, > whereas > >others, such as iWARP, do not. > > > >My question is: how (and where in the code base) does openMPI use this > >feature, on those hardware that support it? What is the penalty, in > terms > >of additional code, runtime performance and all other considerations, > on a > >hardware that does not support it? > > Network atomics are used for oshmem (see Mike's email) and MPI RMA. For > RMA they are exposed through the BTL 3.0 interface on the v2.x branch > and master. So far we have only really implemented compare-and-swap, > atomic add, and atomic fetch-and-add. Compare-and-swap and fetch-and-add > are required by our optimized RMA component (ompi/mca/osc/rdma). > > -Nathan > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/03/18688.php >
[OMPI devel] Thread safety in the BTL layer
Hello all sorry for asking too many 101 questions; hopefully someone won't mind answering. It looks like, as of the current release, some BTLs (e.g. openib) are not thread safe, and the code explicitly bails out if it finds that MIT_Init() was called with THREAD_MULTIPLE. Then there are some BTLs, such as TCP, that can handle THREAD_MULTIPLE. Here are the questions: 1. There must be global (shared) variables that the BTL layer is accessing, which is giving rise to the thread safety. Is there a list of such variables, the code path in which they are accessed, and/or any documentation on them (including any past mailing list post)? 2. Browsing through the mailing list (I have been a subscriber to the *user* list for quite a while), it looks like a lot of people have stumbled on to the issue that the openib BTL is not thread safe. Given that, I'd presume, it is the most popular BTL, since infiniband-like fabrics holds a lion's share of the HPC interconnect market, it must be quite difficult to make it thread safe. Any comments on the level of work it would take to make sure a new BTL would be thread safe? Something along the line of a 'do-this' or 'don't-do-that' would be greatly appreciated. 3. It looks like the openib BTL bailing out if called with THREAD_MULTIPLE has been removed in the master branch (at least from a cursory look.) Does that mean that the openib BTL is now thread safe, of is it that the check has simply been moved to another location? Thanks in advance Durga Life is complex. It has real and imaginary parts.
[OMPI devel] How to 'hook' a new BTL to OMPI call chain?
Hello all Sorry about asking too many 101 level question, but here is another one: I have a BTL layer code, called 'lf' that is ready for unit testing; it compiles with the OMPI tool chain (by doing a ./configure; make from the top level directory) and have the basic data transport calls implemented. How do I 'hook up' the BTL to the OMPI call chain? If I do the following: mpirin -np 2 --hostfile ~/hostfile -mca btl lf,self ./NPmpi it fails to run and the gist of the failure is that it does not even attempt connecting with the 'lf' BTL (the error says: 'BTLs attempted: self') The 'lf' NIC, and RDMA capable card, also has a TCP/IP interface and replacing 'lf' with 'tcp' in the above command *does* work. Thanks in advance Durga Life is complex. It has real and imaginary parts.
Re: [OMPI devel] How to 'hook' a new BTL to OMPI call chain?
Hi all Anyone willing to help? :-) I now have a follow up question: I was trying to figure this out myself by taking the backtrace from the BTLs that do work, and found that, since most of the internal functions are not exported, the backtraces contain just the addresses which are not all that helpful (this is even after building with --enable-debug.) This is going back to a question that I myself asked recently, and I am now finding out the answer the hard way! Is there any way that all the internal functions, not explicitly declared 'static' can be made visible? Thanks Durga Life is complex. It has real and imaginary parts. On Wed, Mar 16, 2016 at 12:52 PM, dpchoudh . wrote: > Hello all > > Sorry about asking too many 101 level question, but here is another one: > > I have a BTL layer code, called 'lf' that is ready for unit testing; it > compiles with the OMPI tool chain (by doing a ./configure; make from the > top level directory) and have the basic data transport calls implemented. > > How do I 'hook up' the BTL to the OMPI call chain? > > If I do the following: > mpirin -np 2 --hostfile ~/hostfile -mca btl lf,self ./NPmpi > > it fails to run and the gist of the failure is that it does not even > attempt connecting with the 'lf' BTL (the error says: 'BTLs attempted: > self') > > The 'lf' NIC, and RDMA capable card, also has a TCP/IP interface and > replacing 'lf' with 'tcp' in the above command *does* work. > > Thanks in advance > Durga > > Life is complex. It has real and imaginary parts. >
[OMPI devel] mca_btl__prepare_dst
Hello developers It looks like in the trunk, the routine mca_btl__prepare_dst is no longer being implemented, at least in TCP and openib BTLs. Up until 1.10.2, it does exist. Is it a new MPI-3 related thing? What is the reason behind this? Thanks Durga Life is complex. It has real and imaginary parts.
[OMPI devel] IP address to verb interface mapping
Hello all (Newbie warning! Sorry :-( ) Let's say my cluster has 7 nodes, connected via IP-over-Ethernet for control traffic and some kind of raw verbs (or anything else such as SRIO) interface for data transfer. Let's say my host file chooses 4 out of the 7 nodes for an MPI job, based on the IP address, which are assigned to the Ethernet interfaces. My question is: where in the code does this mapping between IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined, such as only those chosen nodes receive traffic over the verbs interface? Thanks in advance Durga We learn from history that we never learn from history.
Re: [OMPI devel] IP address to verb interface mapping
Hi Gilles Thanks for responding quickly; however, I am afraid I did not explain my question clearly enough; my apologies. What I am trying to understand is this: My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for job launch and control traffic); this is not used for MPI messaging. Let's say that the IP addresses are 192.168.1.2-192.168.1.9. They are all in the same IP subnet. The MPI messaging is used using some other interconnects, such as Infiniband. All 7 nodes are connected to the same Infiniband switch and hence are in the same (infiniband) subnet as well. In my host file, I mention (say) 4 IP addresses: 192.168.3-192.168.1.7 My question is, how does OpenMPI pick the 4 Infiniband interfaces that matches the IP addresses? Put another way, the ranks of each launched jobs are (I presume) setup by orted by some mechanism. When I do an MPI_Send() to a given rank, the message goes to the Infiniband interface with a particular LID. How does this IP-to-Infiniband LID mapping happen? Thanks Durga We learn from history that we never learn from history. On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet wrote: > Hi, > > the hostnames (or their IPs) are only used to ssh orted. > > > if you use only the tcp btl : > > TCP *MPI* communications (vs OOB management communications) are handled by > btl/tcp > by default, all usable interfaces are used, then messages are split (iirc, > by ob1 pml) and then "fragments" > are sent using all interfaces. > > each interface has a latency and bandwidth that is used to split message > into fragments. > (assuming it is correctly configured, 90% of a large message is sent over > the 10GbE interface, and 10% is sent over the GbE interface) > > if you can explicitly list/blacklist interface > mpirun --mca btl_tcp_if_include ... > or > mpirun --mca btl_tcp_if_exclude ... > > (see ompi_info --all for the syntax) > > > but if you use several btls (for example tcp and openib), the btl(s) with > the lower exclusivity are not used. > (for example, a large message is *not* split and send using native ib, > IPoIB and GbE because the openib btl > has a higher exclusivity than the tcp btl) > > > did this answer your question ? > > Cheers, > > Gilles > > > > On 4/8/2016 12:24 PM, dpchoudh . wrote: > > Hello all > > (Newbie warning! Sorry :-( ) > > Let's say my cluster has 7 nodes, connected via IP-over-Ethernet for > control traffic and some kind of raw verbs (or anything else such as SRIO) > interface for data transfer. Let's say my host file chooses 4 out of the 7 > nodes for an MPI job, based on the IP address, which are assigned to the > Ethernet interfaces. > > My question is: where in the code does this mapping between > IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined, such as > only those chosen nodes receive traffic over the verbs interface? > > Thanks in advance > Durga > > We learn from history that we never learn from history. > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18746.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18747.php >
Re: [OMPI devel] IP address to verb interface mapping
Thank you very much, Gilles. That is exactly the information I was looking for. Best regards Durga We learn from history that we never learn from history. On Fri, Apr 8, 2016 at 12:52 AM, Gilles Gouaillardet wrote: > At init time, each task invoke btl_openib_component_init() which invokes > btl_openib_modex_send() > basically, it collects infiniband info (port, subnet, lid, ...) and "push" > them to orted via the modex mechanism. > > When a communication is created, the remote information is retrieved via > the modex mechanism in mca_btl_openib_proc_get_locket() > > Cheers, > > Gilles > > > On 4/8/2016 1:30 PM, dpchoudh . wrote: > > Hi Gilles > > Thanks for responding quickly; however, I am afraid I did not explain my > question clearly enough; my apologies. > > What I am trying to understand is this: > > My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for job > launch and control traffic); this is not used for MPI messaging. Let's say > that the IP addresses are 192.168.1.2-192.168.1.9. They are all in the same > IP subnet. > > The MPI messaging is used using some other interconnects, such as > Infiniband. All 7 nodes are connected to the same Infiniband switch and > hence are in the same (infiniband) subnet as well. > > In my host file, I mention (say) 4 IP addresses: 192.168.3-192.168.1.7 > > My question is, how does OpenMPI pick the 4 Infiniband interfaces that > matches the IP addresses? Put another way, the ranks of each launched jobs > are (I presume) setup by orted by some mechanism. When I do an MPI_Send() > to a given rank, the message goes to the Infiniband interface with a > particular LID. How does this IP-to-Infiniband LID mapping happen? > > Thanks > Durga > > We learn from history that we never learn from history. > > On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet > wrote: > >> Hi, >> >> the hostnames (or their IPs) are only used to ssh orted. >> >> >> if you use only the tcp btl : >> >> TCP *MPI* communications (vs OOB management communications) are handled >> by btl/tcp >> by default, all usable interfaces are used, then messages are split >> (iirc, by ob1 pml) and then "fragments" >> are sent using all interfaces. >> >> each interface has a latency and bandwidth that is used to split message >> into fragments. >> (assuming it is correctly configured, 90% of a large message is sent over >> the 10GbE interface, and 10% is sent over the GbE interface) >> >> if you can explicitly list/blacklist interface >> mpirun --mca btl_tcp_if_include ... >> or >> mpirun --mca btl_tcp_if_exclude ... >> >> (see ompi_info --all for the syntax) >> >> >> but if you use several btls (for example tcp and openib), the btl(s) with >> the lower exclusivity are not used. >> (for example, a large message is *not* split and send using native ib, >> IPoIB and GbE because the openib btl >> has a higher exclusivity than the tcp btl) >> >> >> did this answer your question ? >> >> Cheers, >> >> Gilles >> >> >> >> On 4/8/2016 12:24 PM, dpchoudh . wrote: >> >> Hello all >> >> (Newbie warning! Sorry :-( ) >> >> Let's say my cluster has 7 nodes, connected via IP-over-Ethernet for >> control traffic and some kind of raw verbs (or anything else such as SRIO) >> interface for data transfer. Let's say my host file chooses 4 out of the 7 >> nodes for an MPI job, based on the IP address, which are assigned to the >> Ethernet interfaces. >> >> My question is: where in the code does this mapping between >> IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined, such as >> only those chosen nodes receive traffic over the verbs interface? >> >> Thanks in advance >> Durga >> >> We learn from history that we never learn from history. >> >> >> ___ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/04/18746.php >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/04/18747.php >> > > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18748.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18749.php >
Re: [OMPI devel] Common symbol warnings in tarballs (was: make install warns about 'common symbols')
Dear all Just to clarify, I was doing a build (after adding code to support a new transport) from code pulled from git (a 'git clone') when I came across this warning, so I suppose this would be a 'developer build'. I know I am not a real MPI developer (I am doing OMPI internal development for the second time in my whole career), but if my vote counts, I'd vote for leaving the warning in. It, in my opinion, encourages good coding practice, that should matter to everyone, not just 'core developers'. However, I agree that the phrasing of the warning is confusing, and adding a URL there to an appropriate page should be enough to prevent future questions like this in the support forum. Thanks Durga 1% of the executables have 99% of CPU privilege! Userspace code! Unite!! Occupy the kernel!!! On Wed, Apr 20, 2016 at 1:41 PM, Ralph Castain wrote: > > > On Apr 20, 2016, at 10:24 AM, Dave Goodell (dgoodell) < > dgood...@cisco.com> wrote: > > > > On Apr 20, 2016, at 9:14 AM, Jeff Squyres (jsquyres) > wrote: > >> > >> I was under the impression that this warning script only ran for > developer builds. But it looks like it's unconditionally run at the end of > "make install" (on master only -- so far). > >> > >> Should we make this only run for developer builds? (e.g., check for > $srcdir/.git, or somesuch) I think it's our goal to have zero common > symbols, but that may not always be the case, and we don't want this > potentially alarming warning showing up for users, right? > > > > IMO, this is basically just another warning flag. If you enable most > compiler warnings for non-developer builds, I don't see why you would go > out of your way to disable this particular one. You could always tweak the > output to point to a wiki page that explains what the warning means, so > concerned users can hopefully be assuaged. > > I guess this is where I differ. I see no benefit in warning a user about > something they cannot control and that has no impact on them. These > warnings were intended solely for developers as a reminder/suggestion that > they follow a specific policy regarding common variables. Thus, they convey > nothing of interest or use to a user. > > So I fail to see why we should include this warning in a non-developer > build. As for other warnings, we have a stated policy (and proactive > effort) to always stamp them out, so I don’t think the user is actually > seeing many (or any) of them. Remember, we turn off pedantic and other > levels when doing non-developer builds. > > Seems like this warning falls into the same category to me. > > > > > -Dave > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18794.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18795.php
[OMPI devel] modex receive
Hello all I am struggling with this issue for last few days and thought it would be prudent to ask for help from people who have way more experience than I do. There are two questions, interrelated in my mind, but may not be so in reality. Question 2 is the issue I am struggling with, and question 1 sort of leads to it. 1. I see that both in openib and tcp BTL (the two kind of hardware I have access to) a modex send happens, but a matching modex receive never happens. Is it because of some kind of optimization? (In my case, both IP NICs are in the same IP subnet and both IB NICs are in the same IB subnet) Or am I not understanding something? How do the processes figure out their peer information without a modex receive? The place in code where the modex receive is called is in btl_add_procs(). However, it looks like in both the above BTLs, this method is never called. Is that expected? 2. This is the real question is this: I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code) that has no routing capability in protocol, and hence no concept of subnets. An HCA simply needs to be plugged in to the switch and it can see the whole network. However, there is a VLAN like partition (similar to IB partitions) Given this (and as a first cut, every node is in the same partition, so even this complexity is eliminated), there is not much use for a modex exchange, but I added one anyway just with the partition key. What I see is that the component open, register and init are all successful, but r2 bml still does not choose this network and thus OMPI aborts because of lack of full reachability. This is my command line: sudo /usr/local/bin/mpirun --allow-run-as-root -hostfile ~/hostfile -np 2 -mca btl self,lf -mca btl_base_verbose 100 -mca bml_base_verbose 100 ./mpitest ('mpitest' is a trivial 'hello world' program plus ONE MPI_Send()/MPI_Recv() to test in-band communication. The sudo is required because currently the driver requires root permission; I was told that this will be fixed. The hostfile has 2 hosts, named b-2 and b-3, with back-to-back connection on this 'lf' HCA) The output of this command is as follows; I have added my comments to explain it a bit. [b-2:21062] mca: base: components_register: registering framework bml components [b-2:21062] mca: base: components_register: found loaded component r2 [b-2:21062] mca: base: components_register: component r2 register function successful [b-2:21062] mca: base: components_open: opening bml components [b-2:21062] mca: base: components_open: found loaded component r2 [b-2:21062] mca: base: components_open: component r2 open function successful [b-2:21062] mca: base: components_register: registering framework btl components [b-2:21062] mca: base: components_register: found loaded component self [b-2:21062] mca: base: components_register: component self register function successful [b-2:21062] mca: base: components_register: found loaded component lf [b-2:21062] mca: base: components_register: component lf register function successful [b-2:21062] mca: base: components_open: opening btl components [b-2:21062] mca: base: components_open: found loaded component self [b-2:21062] mca: base: components_open: component self open function successful [b-2:21062] mca: base: components_open: found loaded component lf lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0) [b-2:21062] mca: base: components_open: component lf open function successful [b-2:21062] select: initializing btl component self [b-2:21062] select: init of component self returned success [b-2:21062] select: initializing btl component lf Created group on b-2 [b-2:21062] select: init of component lf returned success [b-3:07672] mca: base: components_register: registering framework bml components [b-3:07672] mca: base: components_register: found loaded component r2 [b-3:07672] mca: base: components_register: component r2 register function successful [b-3:07672] mca: base: components_open: opening bml components [b-3:07672] mca: base: components_open: found loaded component r2 [b-3:07672] mca: base: components_open: component r2 open function successful [b-3:07672] mca: base: components_register: registering framework btl components [b-3:07672] mca: base: components_register: found loaded component self [b-3:07672] mca: base: components_register: component self register function successful [b-3:07672] mca: base: components_register: found loaded component lf [b-3:07672] mca: base: components_register: component lf register function successful [b-3:07672] mca: base: components_open: opening btl components [b-3:07672] mca: base: components_open: found loaded component self [b-3:07672] mca: base: components_open: component self open function successful [b-3:07672] mca: base: components_open: found loaded component lf [b-3:07672] mca: base: components_open: component lf open function successful [b-3:07672] select: initializing btl component self [b-3:07672] select: init o
[OMPI devel] Why is floating point number used for locality
Hello all I am wondering about the rationale of using floating point numbers for calculating 'distances' in the openib BTL. Is it because some distances can be infinite and there is no (conventional) way to represent infinity using integers? Thanks for your comments Durga The surgeon general advises you to eat right, exercise regularly and quit ageing.
Re: [OMPI devel] modex receive
Hello Gilles You are absolutely right: 1. Adding --mca pml_base_verbose 100 does show that it is the cm PML that is being picked by default (even for TCP) 2. Adding --mca pml ob1 does cause add_procs() and related BTL friends to be invoked. With a command line of mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca btl_base_verbose 100 -mca pml_base_verbose 100 ./mpitest The output shows (among many other lines) the following: [smallMPI:49178] select: init returned priority 30 [smallMPI:49178] select: initializing pml component ob1 [smallMPI:49178] select: init returned priority 20 [smallMPI:49178] select: component v not in the include list [smallMPI:49178] selected cm best priority 30 *[smallMPI:49178] select: component ob1 not selected / finalized[smallMPI:49178] select: component cm selected* Which shows that the cm PML was selected. Replacing 'tcp' above with 'openib' shows very similar results. (The openib BTL methods are not invoked, either) However, I was under the impression that the CM PML can only handle MTLs (and ob1 can only handle BTLs). So why is cm being selected for TCP? Thank you Durga The surgeon general advises you to eat right, exercise regularly and quit ageing. On Thu, Apr 28, 2016 at 2:34 AM, Gilles Gouaillardet wrote: > the add_procs subroutine of the btl should be called. > > /* i added a printf in mca_btl_tcp_add_procs and it *is* invoked */ > > can you try again with --mca pml ob1 --mca pml_base_verbose 100 ? > > maybe the add_procs subroutine is not invoked because openmpi uses cm > instead of ob1 > > > Cheers, > > > Gilles > > On 4/28/2016 3:07 PM, dpchoudh . wrote: > > Hello all > > I am struggling with this issue for last few days and thought it would be > prudent to ask for help from people who have way more experience than I do. > > There are two questions, interrelated in my mind, but may not be so in > reality. Question 2 is the issue I am struggling with, and question 1 sort > of leads to it. > > 1. I see that both in openib and tcp BTL (the two kind of hardware I have > access to) a modex send happens, but a matching modex receive never > happens. Is it because of some kind of optimization? (In my case, both IP > NICs are in the same IP subnet and both IB NICs are in the same IB subnet) > Or am I not understanding something? How do the processes figure out their > peer information without a modex receive? > > The place in code where the modex receive is called is in btl_add_procs(). > However, it looks like in both the above BTLs, this method is never called. > Is that expected? > > 2. This is the real question is this: > I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code) > that has no routing capability in protocol, and hence no concept of > subnets. An HCA simply needs to be plugged in to the switch and it can see > the whole network. However, there is a VLAN like partition (similar to IB > partitions) > Given this (and as a first cut, every node is in the same partition, so > even this complexity is eliminated), there is not much use for a modex > exchange, but I added one anyway just with the partition key. > > What I see is that the component open, register and init are all > successful, but r2 bml still does not choose this network and thus OMPI > aborts because of lack of full reachability. > > This is my command line: > sudo /usr/local/bin/mpirun --allow-run-as-root -hostfile ~/hostfile -np 2 > -mca btl self,lf -mca btl_base_verbose 100 -mca bml_base_verbose 100 > ./mpitest > > ('mpitest' is a trivial 'hello world' program plus ONE > MPI_Send()/MPI_Recv() to test in-band communication. The sudo is required > because currently the driver requires root permission; I was told that this > will be fixed. The hostfile has 2 hosts, named b-2 and b-3, with > back-to-back connection on this 'lf' HCA) > > The output of this command is as follows; I have added my comments to > explain it a bit. > > > [b-2:21062] mca: base: components_register: registering framework bml > components > [b-2:21062] mca: base: components_register: found loaded component r2 > [b-2:21062] mca: base: components_register: component r2 register function > successful > [b-2:21062] mca: base: components_open: opening bml components > [b-2:21062] mca: base: components_open: found loaded component r2 > [b-2:21062] mca: base: components_open: component r2 open function > successful > [b-2:21062] mca: base: components_register: registering framework btl > components > [b-2:21062] mca: base: components_register: found loaded component self > [b-2:21062] mca: base: components_register: component self register > function successful > [b-2:21062] mca: base: component
Re: [OMPI devel] modex receive
Hello Ralph and Gilles Thanks for the clarification. My understanding was that if a BTL was specified to mpirun, then only BTL (and, therefore, the ob1 PML) will be used. However, I always saw that is not the case and now I know why. I do have PSM capable cards (Qlogic IB) in my nodes, and this time, the link was up (however, like I reported earlier, this behaviour happens even with PSM link down), so obviously the PSM MTL was chosen. Best regards Durga The surgeon general advises you to eat right, exercise regularly and quit ageing. On Thu, Apr 28, 2016 at 11:41 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > my basic understanding is that ob1 works with btl, and cm works with mtl > (please someone corrects me if I am wrong) > an other way to put this is cm cannot use the tcp btl. > > so I can only guess one mtl (PSM ?) is available, and so cm is preferred > over ob1. > > what if you > mpirun --mca mtl ^psm ... > is cm selected over ob1 ? > > note PSM does not disqualify itself if there is no link, and this is > now being investigated at intel. > > Cheers, > > Gilles > > On Friday, April 29, 2016, dpchoudh . wrote: > >> Hello Gilles >> >> You are absolutely right: >> >> 1. Adding --mca pml_base_verbose 100 does show that it is the cm PML that >> is being picked by default (even for TCP) >> 2. Adding --mca pml ob1 does cause add_procs() and related BTL friends to >> be invoked. >> >> >> With a command line of >> >> mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca >> btl_base_verbose 100 -mca pml_base_verbose 100 ./mpitest >> >> The output shows (among many other lines) the following: >> >> [smallMPI:49178] select: init returned priority 30 >> [smallMPI:49178] select: initializing pml component ob1 >> [smallMPI:49178] select: init returned priority 20 >> [smallMPI:49178] select: component v not in the include list >> [smallMPI:49178] selected cm best priority 30 >> >> *[smallMPI:49178] select: component ob1 not selected / >> finalized[smallMPI:49178] select: component cm selected* >> >> Which shows that the cm PML was selected. Replacing 'tcp' above with >> 'openib' shows very similar results. (The openib BTL methods are not >> invoked, either) >> >> However, I was under the impression that the CM PML can only handle MTLs >> (and ob1 can only handle BTLs). So why is cm being selected for TCP? >> >> Thank you >> Durga >> >> >> >> The surgeon general advises you to eat right, exercise regularly and quit >> ageing. >> >> On Thu, Apr 28, 2016 at 2:34 AM, Gilles Gouaillardet >> wrote: >> >>> the add_procs subroutine of the btl should be called. >>> >>> /* i added a printf in mca_btl_tcp_add_procs and it *is* invoked */ >>> >>> can you try again with --mca pml ob1 --mca pml_base_verbose 100 ? >>> >>> maybe the add_procs subroutine is not invoked because openmpi uses cm >>> instead of ob1 >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> On 4/28/2016 3:07 PM, dpchoudh . wrote: >>> >>> Hello all >>> >>> I am struggling with this issue for last few days and thought it would >>> be prudent to ask for help from people who have way more experience than I >>> do. >>> >>> There are two questions, interrelated in my mind, but may not be so in >>> reality. Question 2 is the issue I am struggling with, and question 1 sort >>> of leads to it. >>> >>> 1. I see that both in openib and tcp BTL (the two kind of hardware I >>> have access to) a modex send happens, but a matching modex receive never >>> happens. Is it because of some kind of optimization? (In my case, both IP >>> NICs are in the same IP subnet and both IB NICs are in the same IB subnet) >>> Or am I not understanding something? How do the processes figure out their >>> peer information without a modex receive? >>> >>> The place in code where the modex receive is called is in >>> btl_add_procs(). However, it looks like in both the above BTLs, this method >>> is never called. Is that expected? >>> >>> 2. This is the real question is this: >>> I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code) >>> that has no routing capability in protocol, and hence no concept of >>> subnets. An HCA simply needs to be plugged in to the switch and it can see >>> the whole network. However, the
[OMPI devel] Question about 'progress function'
Hi all Apologies for a 101 level question again, but here it is: A new BTL layer I am implementing hangs in MPI_Send(). Please keep in mind that at this stage, I am simply desperate to make MPI data move through this fabric in any way possible, so I have thrown all good programming practice out of the window and in the process might have added bugs. The test code basically has a single call to MPI_Send() with 8 bytes of data, the smallest amount the HCA can DMA. I have a very simple mca_btl_component_progress() method that returns 0 if called before mca_btl_endpoint_send() and returns 1 if called after. I use a static variable to keep track whether endpoint_send() has been called. With this, the MPI process hangs with the following stack: (gdb) bt #0 0x7f7518c60b7d in poll () from /lib64/libc.so.6 #1 0x7f75183e79f6 in poll_dispatch (base=0x19cf480, tv=0x7f75177efe80) at poll.c:165 #2 0x7f75183df690 in opal_libevent2022_event_base_loop (base=0x19cf480, flags=1) at event.c:1630 #3 0x7f75183613d4 in progress_engine (obj=0x19cedd8) at runtime/opal_progress_threads.c:105 #4 0x7f7518f3ddf5 in start_thread () from /lib64/libpthread.so.0 #5 0x7f7518c6b1ad in clone () from /lib64/libc.so.6 I am using code from master branch for this work. Obviously I am not doing the progress handling right, and I don't even understand how it should work, as the TCP btl does not even provide a component progress function. Any relevant pointer on how this should be done is highly appreciated. Thanks Durga The surgeon general advises you to eat right, exercise regularly and quit ageing.
Re: [OMPI devel] Question about 'progress function'
George Thanks for your help. But what should the progress function return, so that the event is signalled? Right now I am returning a 1 when data has been transmitted and 0 otherwise, but that does not seem to work. Also, please keep in mind that the transport I am working on supports unreliable datagrams only, so there is no ack from the recipient to wait for. Thanks again Durga The surgeon general advises you to eat right, exercise regularly and quit ageing. On Thu, May 5, 2016 at 11:33 PM, George Bosilca wrote: > Durga, > > TCP doesn't need a specialized progress function because we are tied > directly with libevent. In your case you should provide a BTL progress > function, function that will be called at the end of libevent base loop > regularly. > > George. > > > On Thu, May 5, 2016 at 11:30 PM, dpchoudh . wrote: > >> Hi all >> >> Apologies for a 101 level question again, but here it is: >> >> A new BTL layer I am implementing hangs in MPI_Send(). Please keep in >> mind that at this stage, I am simply desperate to make MPI data move >> through this fabric in any way possible, so I have thrown all good >> programming practice out of the window and in the process might have added >> bugs. >> >> The test code basically has a single call to MPI_Send() with 8 bytes of >> data, the smallest amount the HCA can DMA. I have a very simple >> mca_btl_component_progress() method that returns 0 if called before >> mca_btl_endpoint_send() and returns 1 if called after. I use a static >> variable to keep track whether endpoint_send() has been called. >> >> With this, the MPI process hangs with the following stack: >> >> (gdb) bt >> #0 0x7f7518c60b7d in poll () from /lib64/libc.so.6 >> #1 0x7f75183e79f6 in poll_dispatch (base=0x19cf480, >> tv=0x7f75177efe80) at poll.c:165 >> #2 0x7f75183df690 in opal_libevent2022_event_base_loop >> (base=0x19cf480, flags=1) at event.c:1630 >> #3 0x7f75183613d4 in progress_engine (obj=0x19cedd8) at >> runtime/opal_progress_threads.c:105 >> #4 0x7f7518f3ddf5 in start_thread () from /lib64/libpthread.so.0 >> #5 0x7f7518c6b1ad in clone () from /lib64/libc.so.6 >> >> I am using code from master branch for this work. >> >> Obviously I am not doing the progress handling right, and I don't even >> understand how it should work, as the TCP btl does not even provide a >> component progress function. >> >> Any relevant pointer on how this should be done is highly appreciated. >> >> Thanks >> Durga >> >> >> The surgeon general advises you to eat right, exercise regularly and quit >> ageing. >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/05/18919.php >> > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/18920.php >
[OMPI devel] Process connectivity map
Hello all I have been struggling with this issue for a while and figured it might be a good idea to ask for help. Where (in the code path) is the connectivity map created? I can see that it is *used* in mca_bml_r2_endpoint_add_btl(), but obviously I am not setting it up right, because this routine is not finding the BTL corresponding to my interconnect. Thanks in advance Durga The surgeon general advises you to eat right, exercise regularly and quit ageing.
[OMPI devel] Misleading error messages?
In the file ompi/mca.bml/r2/bml_r2.c, it seems like the function name is incorrect in some error messages (seems like a case of unchecked copy-paste issue) in: 1. Function mca_bml_r2_allocate_endpoint() line 154 2. Function mca_bml_r2_endpoint_add_btl() line 200, 206 This is on master branch. Thanks Durga The surgeon general advises you to eat right, exercise regularly and quit ageing.
Re: [OMPI devel] Process connectivity map
Hello Gilles Thanks for jumping in to help again. Actually, I had already tried some of your suggestions before asking for help. I have several interconnects that can run both openib and tcp BTL. To simplify things, I explicitly mentioned TCP: mpirun -np 2 -hostfile ~/hostfile -mca pml ob1 -mca btl self.tcp ./mpitest where mpitest is a small program that does MPI_Send()/MPI_Recv() on a small string, and then does an MPI_Barrier(). The program does work as expected. I put a printf on the last line of mca_tcp_add_procs() to print the value of 'reachable'. What I saw was that the value was always 0 when it was invoked for Send()/Recv() and the pointer itself was NULL when invoked for Barrier() Next I looked at pml_ob1_add_procs(), where the call chain starts, and found that it initializes and passes an opal_bitmap_t reachable down the call chain, but the resulting value is not used later in the code (the memory is simply freed later). That, coupled with the fact that I am trying to imitate what the other BTL implementations are doing, yet in mca_bml_r2_endpoint_add_btl() by BTL is not being picked up, left me puzzled. Please note that the interconnect that I am developing for is on a different cluster (than where I ran the above test for TCP BTL.) Thanks again Durga The surgeon general advises you to eat right, exercise regularly and quit ageing. On Sun, May 15, 2016 at 10:20 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > did you check the add_procs callbacks ? > (e.g. mca_btl_tcp_add_procs() for the tcp btl) > this is where the reachable bitmap is set, and I guess this is what you > are looking for. > > keep in mind that if several btl can be used, the one with the higher > exclusivity is used > (e.g. tcp is never used if openib is available) > you can simply force your btl and self, and the ob1 pml, so you do not > have to worry about other btl exclusivity. > > Cheers, > > Gilles > > > On Sunday, May 15, 2016, dpchoudh . wrote: > >> Hello all >> >> I have been struggling with this issue for a while and figured it might >> be a good idea to ask for help. >> >> Where (in the code path) is the connectivity map created? >> >> I can see that it is *used* in mca_bml_r2_endpoint_add_btl(), but >> obviously I am not setting it up right, because this routine is not finding >> the BTL corresponding to my interconnect. >> >> Thanks in advance >> Durga >> >> The surgeon general advises you to eat right, exercise regularly and quit >> ageing. >> > > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/18975.php >
Re: [OMPI devel] Process connectivity map
Hello Gilles Setting -mca mpi_add_procs_cutoff 1024 indeed makes a difference to the output, as follows: With -mca mpi_add_procs_cutoff 1024: reachable = 0x1 (Note that add_procs was called once and the value of 'reachable is correct') Without -mca mpi_add_procs_cutoff 1024 reachable = 0x0 reachable = NULL reachable = NULL (Note that add_procs() was caklled three times and the value of 'reachable' seems wrong. The program does run correctly in either case. The program listing is as below (note that I have removed output from the program itself in the above reporting.) The code that prints 'reachable' is as follows: if (reachable == NULL) printf("reachable = NULL\n"); else { int i; printf("reachable = "); for (i = 0; i < reachable->array_size; i++) printf("\t0x%llu", reachable->bitmap[i]); printf("\n\n"); } return OPAL_SUCCESS; And the code for the test program is as follows: #include #include #include #include int main(int argc, char *argv[]) { int world_size, world_rank, name_len; char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Get_processor_name(hostname, &name_len); printf("Hello world from processor %s, rank %d out of %d processors\n", hostname, world_rank, world_size); if (world_rank == 1) { MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s received %s, rank %d\n", hostname, buf, world_rank); } else { strcpy(buf, "haha!"); MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); printf("%s sent %s, rank %d\n", hostname, buf, world_rank); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return 0; } The surgeon general advises you to eat right, exercise regularly and quit ageing. On Sun, May 15, 2016 at 10:49 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > At first glance, that seems a bit odd... > are you sure you correctly print the reachable bitmap ? > I would suggest you add some instrumentation to understand what happens > (e.g., printf before opal_bitmap_set_bit() and other places that prevent > this from happening) > > one more thing ... > now, master default behavior is > mpirun --mca mpi_add_procs_cutoff 0 ... > you might want to try > mpirun --mca mpi_add_procs_cutoff 1024 ... > and see if things make more sense. > if it helps, and iirc, there is a parameter so a btl can report it does > not support cutoff. > > > Cheers, > > Gilles > > On Sunday, May 15, 2016, dpchoudh . wrote: > >> Hello Gilles >> >> Thanks for jumping in to help again. Actually, I had already tried some >> of your suggestions before asking for help. >> >> I have several interconnects that can run both openib and tcp BTL. To >> simplify things, I explicitly mentioned TCP: >> >> mpirun -np 2 -hostfile ~/hostfile -mca pml ob1 -mca btl self.tcp ./mpitest >> >> where mpitest is a small program that does MPI_Send()/MPI_Recv() on a >> small string, and then does an MPI_Barrier(). The program does work as >> expected. >> >> I put a printf on the last line of mca_tcp_add_procs() to print the value >> of 'reachable'. What I saw was that the value was always 0 when it was >> invoked for Send()/Recv() and the pointer itself was NULL when invoked for >> Barrier() >> >> Next I looked at pml_ob1_add_procs(), where the call chain starts, and >> found that it initializes and passes an opal_bitmap_t reachable down the >> call chain, but the resulting value is not used later in the code (the >> memory is simply freed later). >> >> That, coupled with the fact that I am trying to imitate what the other >> BTL implementations are doing, yet in mca_bml_r2_endpoint_add_btl() by BTL >> is not being picked up, left me puzzled. Please note that the interconnect >> that I am developing for is on a different cluster (than where I ran the >> above test for TCP BTL.) >> >> Thanks again >> Durga >> >> The surgeon general advises you to eat right, exercise regularly and quit >> ageing. >> >> On Sun, May 15, 2016 at 10:20 AM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >>> did you check the add_procs callbacks ? >>> (e.g. mca_btl_tcp_add_procs() for the tcp btl) >>> this is where the reachable bitmap is set, and I guess this is what you >>> are looking for. >>> >>> keep in mind that if several btl can be used, the one with the higher >&g
Re: [OMPI devel] Process connectivity map
Sorry, I accidentally pressed 'Send' before I was done writing the last mail. What I wanted to ask was what is the parameter mpi_add_procs_cutoff and why adding it seems to make a difference in the code path but not in the end result of the program? How would it help me debug my problem? Thank you Durga The surgeon general advises you to eat right, exercise regularly and quit ageing. On Sun, May 15, 2016 at 11:17 AM, dpchoudh . wrote: > Hello Gilles > > Setting -mca mpi_add_procs_cutoff 1024 indeed makes a difference to the > output, as follows: > > With -mca mpi_add_procs_cutoff 1024: > reachable = 0x1 > (Note that add_procs was called once and the value of 'reachable is > correct') > > Without -mca mpi_add_procs_cutoff 1024 > reachable = 0x0 > reachable = NULL > reachable = NULL > (Note that add_procs() was caklled three times and the value of > 'reachable' seems wrong. > > The program does run correctly in either case. The program listing is as > below (note that I have removed output from the program itself in the above > reporting.) > > The code that prints 'reachable' is as follows: > > if (reachable == NULL) > printf("reachable = NULL\n"); > else > { > int i; > printf("reachable = "); > for (i = 0; i < reachable->array_size; i++) > printf("\t0x%llu", reachable->bitmap[i]); > printf("\n\n"); > } > return OPAL_SUCCESS; > > And the code for the test program is as follows: > > #include > #include > #include > #include > > int main(int argc, char *argv[]) > { > int world_size, world_rank, name_len; > char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; > > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > MPI_Get_processor_name(hostname, &name_len); > printf("Hello world from processor %s, rank %d out of %d > processors\n", hostname, world_rank, world_size); > if (world_rank == 1) > { > MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); > printf("%s received %s, rank %d\n", hostname, buf, world_rank); > } > else > { > strcpy(buf, "haha!"); > MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); > printf("%s sent %s, rank %d\n", hostname, buf, world_rank); > } > MPI_Barrier(MPI_COMM_WORLD); > MPI_Finalize(); > return 0; > } > > > > The surgeon general advises you to eat right, exercise regularly and quit > ageing. > > On Sun, May 15, 2016 at 10:49 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> At first glance, that seems a bit odd... >> are you sure you correctly print the reachable bitmap ? >> I would suggest you add some instrumentation to understand what happens >> (e.g., printf before opal_bitmap_set_bit() and other places that prevent >> this from happening) >> >> one more thing ... >> now, master default behavior is >> mpirun --mca mpi_add_procs_cutoff 0 ... >> you might want to try >> mpirun --mca mpi_add_procs_cutoff 1024 ... >> and see if things make more sense. >> if it helps, and iirc, there is a parameter so a btl can report it does >> not support cutoff. >> >> >> Cheers, >> >> Gilles >> >> On Sunday, May 15, 2016, dpchoudh . wrote: >> >>> Hello Gilles >>> >>> Thanks for jumping in to help again. Actually, I had already tried some >>> of your suggestions before asking for help. >>> >>> I have several interconnects that can run both openib and tcp BTL. To >>> simplify things, I explicitly mentioned TCP: >>> >>> mpirun -np 2 -hostfile ~/hostfile -mca pml ob1 -mca btl self.tcp >>> ./mpitest >>> >>> where mpitest is a small program that does MPI_Send()/MPI_Recv() on a >>> small string, and then does an MPI_Barrier(). The program does work as >>> expected. >>> >>> I put a printf on the last line of mca_tcp_add_procs() to print the >>> value of 'reachable'. What I saw was that the value was always 0 when it >>> was invoked for Send()/Recv() and the pointer itself was NULL when invoked >>> for Barrier() >>> >>> Next I looked at pml_ob1_add_procs(), where the call chain starts, and >>> found that it initializes and passes an opal_bitmap_t reachable down the >>> call chain, but the resulting value is not used later in the code (the >>> memory i
[OMPI devel] modex getting corrupted
Hello all I have a naive question: My 'cluster' consists of two nodes, connected back to back with a proprietary link as well as GbE (over a switch). I am calling OPAL_MODEX_SEND() and the modex consists of just this: struct modex {char name[20], unsigned mtu}; The mtu field is not currently being used. I bzero() the struct and have verified that the value being written to the 'name' field (this is similar to a PKEY for infiniband; the driver will translate this to a unique integer) is correct at the sending end. When I do a OPAL_MODEX_RECV(), the value is completely corrupted. However, the size of the modex message is still correct (24 bytes) What could I be doing wrong? (Both nodes are little endian x86_64 machines) Thanks in advance Durga We learn from history that we never learn from history.
Re: [OMPI devel] modex getting corrupted
Hello Ralph Thanks for your input. The routine that does the send is this: static int btl_lf_modex_send(lfgroup lfgroup) { char *grp_name = lf_get_group_name(lfgroup, NULL, 0); btl_lf_modex_t lf_modex; int rc; strncpy(lf_modex.grp_name, grp_name, GRP_NAME_MAX_LEN); OPAL_MODEX_SEND(rc, OPAL_PMIX_GLOBAL, &mca_btl_lf_component.super.btl_version, (char *)&lf_modex, sizeof(lf_modex)); return rc; } This routine is called from the component init routine (mca_btl_lf_component_init()). I have verified that the values in the modex (lf_modex) are correct. The receive happens in proc_create, and I call it like this: OPAL_MODEX_RECV(rc, &mca_btl_lf_component.super.btl_version, &opal_proc->proc_name, (uint8_t **)&module_proc->proc_modex, &size); In here, I get junk value in proc_modex. If I pass a buffer that was malloc()'ed in place of module_proc->proc_modex, I still get bad data. Thanks again for your help. Durga We learn from history that we never learn from history. On Sat, May 21, 2016 at 8:38 PM, Ralph Castain wrote: > Please provide the exact code used for both send/recv - you likely have an > error in the syntax > > > On May 20, 2016, at 9:36 PM, dpchoudh . wrote: > > Hello all > > I have a naive question: > > My 'cluster' consists of two nodes, connected back to back with a > proprietary link as well as GbE (over a switch). > I am calling OPAL_MODEX_SEND() and the modex consists of just this: > > struct modex > {char name[20], unsigned mtu}; > > The mtu field is not currently being used. I bzero() the struct and have > verified that the value being written to the 'name' field (this is similar > to a PKEY for infiniband; the driver will translate this to a unique > integer) is correct at the sending end. > > When I do a OPAL_MODEX_RECV(), the value is completely corrupted. However, > the size of the modex message is still correct (24 bytes) > What could I be doing wrong? (Both nodes are little endian x86_64 machines) > > Thanks in advance > Durga > > We learn from history that we never learn from history. > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/19012.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/19019.php >
Re: [OMPI devel] modex getting corrupted
Hello Ralph and all Please ignore this mail. It is indeed due to a syntax error in my code. Sorry for the noise; I'll be more careful with my homework from now on. Best regards Durga We learn from history that we never learn from history. On Mon, May 23, 2016 at 2:13 AM, dpchoudh . wrote: > Hello Ralph > > Thanks for your input. The routine that does the send is this: > > static int btl_lf_modex_send(lfgroup lfgroup) > { > char *grp_name = lf_get_group_name(lfgroup, NULL, 0); > btl_lf_modex_t lf_modex; > int rc; > strncpy(lf_modex.grp_name, grp_name, GRP_NAME_MAX_LEN); > OPAL_MODEX_SEND(rc, OPAL_PMIX_GLOBAL, > &mca_btl_lf_component.super.btl_version, > (char *)&lf_modex, sizeof(lf_modex)); > return rc; > } > > This routine is called from the component init routine > (mca_btl_lf_component_init()). I have verified that the values in the modex > (lf_modex) are correct. > > The receive happens in proc_create, and I call it like this: > OPAL_MODEX_RECV(rc, &mca_btl_lf_component.super.btl_version, >&opal_proc->proc_name, (uint8_t > **)&module_proc->proc_modex, &size); > > In here, I get junk value in proc_modex. If I pass a buffer that was > malloc()'ed in place of module_proc->proc_modex, I still get bad data. > > > Thanks again for your help. > > Durga > > We learn from history that we never learn from history. > > On Sat, May 21, 2016 at 8:38 PM, Ralph Castain wrote: > >> Please provide the exact code used for both send/recv - you likely have >> an error in the syntax >> >> >> On May 20, 2016, at 9:36 PM, dpchoudh . wrote: >> >> Hello all >> >> I have a naive question: >> >> My 'cluster' consists of two nodes, connected back to back with a >> proprietary link as well as GbE (over a switch). >> I am calling OPAL_MODEX_SEND() and the modex consists of just this: >> >> struct modex >> {char name[20], unsigned mtu}; >> >> The mtu field is not currently being used. I bzero() the struct and have >> verified that the value being written to the 'name' field (this is similar >> to a PKEY for infiniband; the driver will translate this to a unique >> integer) is correct at the sending end. >> >> When I do a OPAL_MODEX_RECV(), the value is completely corrupted. >> However, the size of the modex message is still correct (24 bytes) >> What could I be doing wrong? (Both nodes are little endian x86_64 >> machines) >> >> Thanks in advance >> Durga >> >> We learn from history that we never learn from history. >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/05/19012.php >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/05/19019.php >> > >
[OMPI devel] mpirun fails with the latest git pull
Hello all With a git pull of roughly 4 PM EDT (US), that had a .m4 file (something to do with MXM) in the change set, mpirun does not work anymore. The failure is like this: [durga@b-2 ~]$ sudo /usr/local/bin/mpirun --allow-run-as-root -np 2 -hostfile ~/hostfile -mca btl lf,self -mca btl_base_verbose 100 ./mpitest [b-2:1] [[2440,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 619 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- Doing a make clean && make && sudo make install did not help. Should I re-run the autogen.pl script and start from scratch? Thanks Durga We learn from history that we never learn from history.
[OMPI devel] Porting the underlying fabric interface
Hi developers I am trying to add support for a new (proprietary) RDMA capable fabric to OpenMPI and have the following question: As I understand, some networks are implemented as a PML framework and some are implemented as a BTL framework. It seems there is even overlap as Myrinet seems to exist in both. My question is: what is the difference between these two frameworks? When adding support for a new fabric, what factors one should consider when choosing between one type of framework over the other? And, with apologies for asking a summary question: is there any kind of documentation and/or book that explains all the internal details of the implementation (which looks little like voodoo to a newcomer like me)? Thanks for your help. Durga Choudhury Life is complex. It has real and imaginary parts.
Re: [OMPI devel] [OMPI users] Adding a new BTL
Hello Jeff and other developers: Attached are five files: 1-2: Full output from autogen.pl and configure, captured with: ./ 2>&1 | tee .log 3. Makefile.am of the specific BTL directory 4. configure.m4 of the same directory 5. config.log, as generated internally by autotools Thank you Durga Life is complex. It has real and imaginary parts. On Thu, Feb 25, 2016 at 5:15 PM, Jeff Squyres (jsquyres) wrote: > Can you send the full output from autogen and configure? > > Also, this is probably better suited for the Devel list, since we're > talking about OMPI internals. > > Sent from my phone. No type good. > > On Feb 25, 2016, at 2:06 PM, dpchoudh . wrote: > > Hello Gilles > > Thank you very much for your advice. Yes, I copied the templates from the > master branch to the 1.10.2 release, since the release does not have them. > And yes, changing the Makefile.am as you suggest did make the autogen error > go away. > > However, in the master branch, the autotools seem to be ignoring the new > btl directory altogether; i.e. I do not get a Makefile.in from the > Makefile.am. > > In the 1.10.2 release, doing an identical sequence of steps do create a > Makefile.in from Makefile.am (via autogen) and a Makefile from Makefile.in > (via configure), but of course, the new BTL does not build because the > include paths in master and 1.10.2 are different. > > My Makefile.am and configure.m4 are as follows. Any thoughts on what it > would take in the master branch to hook up the new BTL directory? > > opal/mca/btl/lf/configure.m4 > # > AC_DEFUN([MCA_opal_btl_lf_CONFIG],[ > AC_CONFIG_FILES([opal/mca/btl/lf/Makefile]) > ])dnl > > opal/mca/btl/lf/Makefile.am--- > amca_paramdir = $(AMCA_PARAM_SETS_DIR) > dist_amca_param_DATA = netpipe-btl-lf.txt > > sources = \ > btl_lf.c \ > btl_lf.h \ > btl_lf_component.c \ > btl_lf_endpoint.c \ > btl_lf_endpoint.h \ > btl_lf_frag.c \ > btl_lf_frag.h \ > btl_lf_proc.c \ > btl_lf_proc.h > > # Make the output library in this directory, and name it either > # mca__.la (for DSO builds) or libmca__.la > # (for static builds). > > if MCA_BUILD_opal_btl_lf_DSO > lib = > lib_sources = > component = mca_btl_lf.la > component_sources = $(sources) > else > lib = libmca_btl_lf.la > lib_sources = $(sources) > component = > component_sources = > endif > > mcacomponentdir = $(opallibdir) > mcacomponent_LTLIBRARIES = $(component) > mca_btl_lf_la_SOURCES = $(component_sources) > mca_btl_lf_la_LDFLAGS = -module -avoid-version > > noinst_LTLIBRARIES = $(lib) > libmca_btl_lf_la_SOURCES = $(lib_sources) > libmca_btl_lf_la_LDFLAGS = -module -avoid-version > > - > > Life is complex. It has real and imaginary parts. > > On Thu, Feb 25, 2016 at 3:10 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Did you copy the template from the master branch into the v1.10 branch ? >> if so, you need to replacing MCA_BUILD_opal_btl_lf_DSO with >> MCA_BUILD_ompi_btl_lf_DSO will likely solve your issue. >> you do need a configure.m4 (otherwise your btl will not be built) but >> you do not need AC_MSG_FAILURE >> >> as far as i am concerned, i would develop in the master branch, and >> then back port it into the v1.10 branch when it is ready. >> >> fwiw, btl used to be in ompi/mca/btl (still the case in v1.10) and >> have been moved into opal/mca/btl since v2.x >> so it is quite common a bit of porting is required, most of the time, >> it consists in replacing OMPI like macros by OPAL like macros >> >> Cheers, >> >> Gilles >> >> On Thu, Feb 25, 2016 at 3:54 PM, dpchoudh . wrote: >> > Hello all >> > >> > I am not sure if this question belongs in the user list or the >> > developer list, but because it is a simpler question I am trying the >> > user list first. >> > >> > I am trying to add a new BTL for a proprietary transport. >> > >> > As step #0, I copied the BTL template, renamed the 'template' to >> > something else, and ran autogen.sh at the top level directory (of >> > openMPI 1.10.2). The Makefile.am is identical to what is provided in >> > the template except that all the 'template' has been substituted with >> > 'lf', the name of the fabric. >> > >> > With that, I get the following error: >> > >> > >> > >> > autoreconf: running: /usr/bin/autoconf --include=config --