Re: [OMPI devel] [OMPI users] Adding a new BTL

2016-02-25 Thread dpchoudh .
Hello Gilles

Thank you; the build is successful now.

I do have a generic, unrelated question, though:

I really appreciate how all the principle of object oriented design
principles have been used in OMPI architecture and have been implemented in
a language that is not object oriented. It is a textbook example in
software engineering.

However, I see that functions that are indirectly referenced, via a
structure of function pointers, have been declared 'extern' in header
files. This, to me, looks like against the principle of code design. Making
these 'static' and removing them from the headers does not break the build
of course, and does not even generate warnings. As an example, the
following functions need not be externalized, except, perhaps, for
debugging:

mca_btl_template_module_t mca_btl_template_module = {
.super = {
.btl_component = &mca_btl_template_component.super,
.btl_add_procs = mca_btl_template_add_procs,
.btl_del_procs = mca_btl_template_del_procs,
.btl_register = mca_btl_template_register,
.btl_finalize = mca_btl_template_finalize,
.btl_alloc = mca_btl_template_alloc,
.btl_free = mca_btl_template_free,
.btl_prepare_src = mca_btl_template_prepare_src,
.btl_send = mca_btl_template_send,
.btl_put = mca_btl_template_put,
.btl_get = mca_btl_template_get,
.btl_register_mem = mca_btl_template_register_mem,
.btl_deregister_mem = mca_btl_template_deregister_mem,
.btl_ft_event = mca_btl_template_ft_event
}
};

Is there any reason why it is done this way? If I made them 'static' in my
own BTL code, would I get into trouble down the road?

Thanks
Durga



Life is complex. It has real and imaginary parts.

On Thu, Feb 25, 2016 at 7:02 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> on master/v2.x, you also have to
>
> rm -f opal/mca/btl/lf/.opal_ignore
>
> (and this file would have been .ompi_ignore on v1.10)
>
> Cheers,
>
> Gilles
>
> On Fri, Feb 26, 2016 at 7:44 AM, dpchoudh .  wrote:
> > Hello Jeff and other developers:
> >
> > Attached are five files:
> > 1-2: Full output from autogen.pl and configure, captured with: ./
> 2>&1
> > | tee .log
> > 3. Makefile.am of the specific BTL directory
> > 4. configure.m4 of the same directory
> > 5. config.log, as generated internally by autotools
> >
> > Thank you
> > Durga
> >
> >
> > Life is complex. It has real and imaginary parts.
> >
> > On Thu, Feb 25, 2016 at 5:15 PM, Jeff Squyres (jsquyres)
> >  wrote:
> >>
> >> Can you send the full output from autogen and configure?
> >>
> >> Also, this is probably better suited for the Devel list, since we're
> >> talking about OMPI internals.
> >>
> >> Sent from my phone. No type good.
> >>
> >> On Feb 25, 2016, at 2:06 PM, dpchoudh .  wrote:
> >>
> >> Hello Gilles
> >>
> >> Thank you very much for your advice. Yes, I copied the templates from
> the
> >> master branch to the 1.10.2 release, since the release does not have
> them.
> >> And yes, changing the Makefile.am as you suggest did make the autogen
> error
> >> go away.
> >>
> >> However, in the master branch, the autotools seem to be ignoring the new
> >> btl directory altogether; i.e. I do not get a Makefile.in from the
> >> Makefile.am.
> >>
> >> In the 1.10.2 release, doing an identical sequence of steps do create a
> >> Makefile.in from Makefile.am (via autogen) and a Makefile from
> Makefile.in
> >> (via configure), but of course, the new BTL does not build because the
> >> include paths in master and 1.10.2 are different.
> >>
> >> My Makefile.am and configure.m4 are as follows. Any thoughts on what it
> >> would take in the master branch to hook up the new BTL directory?
> >>
> >> opal/mca/btl/lf/configure.m4
> >> # 
> >> AC_DEFUN([MCA_opal_btl_lf_CONFIG],[
> >> AC_CONFIG_FILES([opal/mca/btl/lf/Makefile])
> >> ])dnl
> >>
> >> opal/mca/btl/lf/Makefile.am---
> >> amca_paramdir = $(AMCA_PARAM_SETS_DIR)
> >> dist_amca_param_DATA = netpipe-btl-lf.txt
> >>
> >> sources = \
> >> btl_lf.c \
> >> btl_lf.h \
> >> btl_lf_component.c \
> >> btl_lf_endpoint.c \
> >> btl_lf_endpoint.h \
> >> btl_lf_frag.c \
> >> btl_lf_frag.h \
> >> btl_lf_proc.c \
> >> btl

[OMPI devel] component progress function optional?

2016-03-01 Thread dpchoudh .
Hello all

(As you might know), I am working on implementing a new BTL for a
proprietary fabric, and, taking the path of least effort, copying and
pasting code from various pre-implemented BTL as is appropriate for our
hardware. My question is: are there any guidance on which of the functions
must be implemented and which are optional (i.e. depends on the underlying
hardware)?

As a specific example, I see that mca_btl_tcp_component_progress() is never
implemented although similar functions in other BTLs are.

Thanks in advance
Durga

Life is complex. It has real and imaginary parts.


[OMPI devel] Network atomic operations

2016-03-03 Thread dpchoudh .
Hello all

Here is a 101 level question:

OpenMPI supports many transports, out of the box, and can be extended to
support those which it does not. Some of these transports, such as
infiniband, provide hardware atomic operations on remote memory, whereas
others, such as iWARP, do not.

My question is: how (and where in the code base) does openMPI use this
feature, on those hardware that support it? What is the penalty, in terms
of additional code, runtime performance and all other considerations, on a
hardware that does not support it?

Thanks in advance.
Durga

Life is complex. It has real and imaginary parts.


Re: [OMPI devel] Network atomic operations

2016-03-04 Thread dpchoudh .
Hello Nathan, Mike and all

Thank you for your responses. Let me rephrase them to make sure I
understood them correctly, and please correct me if I didn't:

1. Atomics are (have been) used in OSHMEM in the current (v1) release
2. They are (will be) used in the MPI RMA in v2 release, which has not
happened yet

I am sorry if I sound like I am nitpicking, but the reason I ask is that I
am trying to implement a new BTL that I am supposed to demo on a customer's
existing OMPI code base (which is obviously v1) but I am doing the
development out of the master branch (to avoid porting later), so I am in a
bit of spaghetti situation.

Thank you
Durga

Life is complex. It has real and imaginary parts.

On Fri, Mar 4, 2016 at 11:06 AM, Nathan Hjelm  wrote:

>
> On Thu, Mar 03, 2016 at 05:26:45PM -0500, dpchoudh . wrote:
> >Hello all
> >
> >Here is a 101 level question:
> >
> >OpenMPI supports many transports, out of the box, and can be extended
> to
> >support those which it does not. Some of these transports, such as
> >infiniband, provide hardware atomic operations on remote memory,
> whereas
> >others, such as iWARP, do not.
> >
> >My question is: how (and where in the code base) does openMPI use this
> >feature, on those hardware that support it? What is the penalty, in
> terms
> >of additional code, runtime performance and all other considerations,
> on a
> >hardware that does not support it?
>
> Network atomics are used for oshmem (see Mike's email) and MPI RMA. For
> RMA they are exposed through the BTL 3.0 interface on the v2.x branch
> and master. So far we have only really implemented compare-and-swap,
> atomic add, and atomic fetch-and-add. Compare-and-swap and fetch-and-add
> are required by our optimized RMA component (ompi/mca/osc/rdma).
>
> -Nathan
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18688.php
>


[OMPI devel] Thread safety in the BTL layer

2016-03-06 Thread dpchoudh .
Hello all

sorry for asking too many 101 questions; hopefully someone won't mind
answering.

It looks like, as of the current release, some BTLs (e.g. openib) are not
thread safe, and the code explicitly bails out if it finds that MIT_Init()
was called with THREAD_MULTIPLE. Then there are some BTLs, such as TCP,
that can handle THREAD_MULTIPLE. Here are the questions:

1. There must be global (shared) variables that the BTL layer is accessing,
which is giving rise to the thread safety. Is there a list of such
variables, the code path in which they are accessed, and/or any
documentation on them (including any past mailing list post)?

2. Browsing through the mailing list (I have been a subscriber to the
*user* list for quite a while), it looks like a lot of people have stumbled
on to the issue that the openib BTL is not thread safe. Given that, I'd
presume, it is the most popular BTL, since infiniband-like fabrics holds a
lion's share of the HPC interconnect market, it must be quite difficult to
make it thread safe. Any comments on the level of work it would take to
make sure a new BTL would be thread safe? Something along the line of a
'do-this' or 'don't-do-that' would be greatly appreciated.

3. It looks like the openib BTL bailing out if called with THREAD_MULTIPLE
has been removed in the master branch (at least from a cursory look.) Does
that mean that the openib BTL is now thread safe, of is it that the check
has simply been moved to another location?

Thanks in advance
Durga

Life is complex. It has real and imaginary parts.


[OMPI devel] How to 'hook' a new BTL to OMPI call chain?

2016-03-16 Thread dpchoudh .
Hello all

Sorry about asking too many 101 level question, but here is another one:

I have a BTL layer code, called 'lf' that is ready for unit testing; it
compiles with the OMPI tool chain (by doing a ./configure; make from the
top level directory) and have the basic data transport calls implemented.

How do I 'hook up' the BTL to the OMPI call chain?

If I do the following:
mpirin -np 2 --hostfile ~/hostfile -mca btl lf,self ./NPmpi

it fails to run and the gist of the failure is that it does not even
attempt connecting with the 'lf' BTL (the error says: 'BTLs attempted:
self')

The 'lf' NIC, and RDMA capable card, also has a TCP/IP interface and
replacing 'lf' with 'tcp' in the above command *does* work.

Thanks in advance
Durga

Life is complex. It has real and imaginary parts.


Re: [OMPI devel] How to 'hook' a new BTL to OMPI call chain?

2016-03-16 Thread dpchoudh .
Hi all

Anyone willing to help?  :-)

I now have a follow up question:
I was trying to figure this out myself by taking the backtrace from the
BTLs that do work, and found that, since most of the internal functions are
not exported, the backtraces contain just the addresses which are not all
that helpful (this is even after building with --enable-debug.) This is
going back to a question that I myself asked recently, and I am now finding
out the answer the hard way!

Is there any way that all the internal functions, not explicitly declared
'static' can be made visible?

Thanks
Durga

Life is complex. It has real and imaginary parts.

On Wed, Mar 16, 2016 at 12:52 PM, dpchoudh .  wrote:

> Hello all
>
> Sorry about asking too many 101 level question, but here is another one:
>
> I have a BTL layer code, called 'lf' that is ready for unit testing; it
> compiles with the OMPI tool chain (by doing a ./configure; make from the
> top level directory) and have the basic data transport calls implemented.
>
> How do I 'hook up' the BTL to the OMPI call chain?
>
> If I do the following:
> mpirin -np 2 --hostfile ~/hostfile -mca btl lf,self ./NPmpi
>
> it fails to run and the gist of the failure is that it does not even
> attempt connecting with the 'lf' BTL (the error says: 'BTLs attempted:
> self')
>
> The 'lf' NIC, and RDMA capable card, also has a TCP/IP interface and
> replacing 'lf' with 'tcp' in the above command *does* work.
>
> Thanks in advance
> Durga
>
> Life is complex. It has real and imaginary parts.
>


[OMPI devel] mca_btl__prepare_dst

2016-03-18 Thread dpchoudh .
Hello developers

It looks like in the trunk, the routine mca_btl__prepare_dst is no
longer being implemented, at least in TCP and openib BTLs. Up until 1.10.2,
it does exist.

Is it a new MPI-3 related thing? What is the reason behind this?

Thanks
Durga

Life is complex. It has real and imaginary parts.


[OMPI devel] IP address to verb interface mapping

2016-04-07 Thread dpchoudh .
Hello all

(Newbie warning! Sorry :-(  )

Let's say my cluster has 7 nodes, connected via IP-over-Ethernet for
control traffic and some kind of raw verbs (or anything else such as SRIO)
interface for data transfer. Let's say my host file chooses 4 out of the 7
nodes for an MPI job, based on the IP address, which are assigned to the
Ethernet interfaces.

My question is: where in the code does this mapping between
IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined, such as
only those chosen nodes receive traffic over the verbs interface?

Thanks in advance
Durga

We learn from history that we never learn from history.


Re: [OMPI devel] IP address to verb interface mapping

2016-04-08 Thread dpchoudh .
Hi Gilles

Thanks for responding quickly; however, I am afraid I did not explain my
question clearly enough; my apologies.

What I am trying to understand is this:

My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for job
launch and control traffic); this is not used for MPI messaging. Let's say
that the IP addresses are 192.168.1.2-192.168.1.9. They are all in the same
IP subnet.

The MPI messaging is used using some other interconnects, such as
Infiniband. All 7 nodes are connected to the same Infiniband switch and
hence are in the same (infiniband) subnet as well.

In my host file, I mention (say) 4 IP addresses:  192.168.3-192.168.1.7

My question is, how does OpenMPI pick the 4 Infiniband interfaces that
matches the IP addresses? Put another way, the ranks of each launched jobs
are (I presume) setup by orted by some mechanism. When I do an MPI_Send()
to a given rank, the message goes to the Infiniband interface with a
particular LID. How does this IP-to-Infiniband LID mapping happen?

Thanks
Durga

We learn from history that we never learn from history.

On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet 
wrote:

> Hi,
>
> the hostnames (or their IPs) are only used to ssh orted.
>
>
> if you use only the tcp btl :
>
> TCP *MPI* communications (vs OOB management communications) are handled by
> btl/tcp
> by default, all usable interfaces are used, then messages are split (iirc,
> by ob1 pml) and then "fragments"
> are sent using all interfaces.
>
> each interface has a latency and bandwidth that is used to split message
> into fragments.
> (assuming it is correctly configured, 90% of a large message is sent over
> the 10GbE interface, and 10% is sent over the GbE interface)
>
> if you can explicitly list/blacklist interface
> mpirun --mca btl_tcp_if_include ...
> or
> mpirun --mca btl_tcp_if_exclude ...
>
> (see ompi_info --all for the syntax)
>
>
> but if you use several btls (for example tcp and openib), the btl(s) with
> the lower exclusivity are not used.
> (for example, a large message is *not* split and send using native ib,
> IPoIB and GbE because the openib btl
> has a higher exclusivity than the tcp btl)
>
>
> did this answer your question ?
>
> Cheers,
>
> Gilles
>
>
>
> On 4/8/2016 12:24 PM, dpchoudh . wrote:
>
> Hello all
>
> (Newbie warning! Sorry :-(  )
>
> Let's say my cluster has 7 nodes, connected via IP-over-Ethernet for
> control traffic and some kind of raw verbs (or anything else such as SRIO)
> interface for data transfer. Let's say my host file chooses 4 out of the 7
> nodes for an MPI job, based on the IP address, which are assigned to the
> Ethernet interfaces.
>
> My question is: where in the code does this mapping between
> IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined, such as
> only those chosen nodes receive traffic over the verbs interface?
>
> Thanks in advance
> Durga
>
> We learn from history that we never learn from history.
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18746.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18747.php
>


Re: [OMPI devel] IP address to verb interface mapping

2016-04-08 Thread dpchoudh .
Thank you very much, Gilles. That is exactly the information I was looking
for.

Best regards
Durga

We learn from history that we never learn from history.

On Fri, Apr 8, 2016 at 12:52 AM, Gilles Gouaillardet 
wrote:

> At init time, each task invoke btl_openib_component_init() which invokes
> btl_openib_modex_send()
> basically, it collects infiniband info (port, subnet, lid, ...) and "push"
> them to orted via the modex mechanism.
>
> When a communication is created, the remote information is retrieved via
> the modex mechanism in mca_btl_openib_proc_get_locket()
>
> Cheers,
>
> Gilles
>
>
> On 4/8/2016 1:30 PM, dpchoudh . wrote:
>
> Hi Gilles
>
> Thanks for responding quickly; however, I am afraid I did not explain my
> question clearly enough; my apologies.
>
> What I am trying to understand is this:
>
> My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for job
> launch and control traffic); this is not used for MPI messaging. Let's say
> that the IP addresses are 192.168.1.2-192.168.1.9. They are all in the same
> IP subnet.
>
> The MPI messaging is used using some other interconnects, such as
> Infiniband. All 7 nodes are connected to the same Infiniband switch and
> hence are in the same (infiniband) subnet as well.
>
> In my host file, I mention (say) 4 IP addresses:  192.168.3-192.168.1.7
>
> My question is, how does OpenMPI pick the 4 Infiniband interfaces that
> matches the IP addresses? Put another way, the ranks of each launched jobs
> are (I presume) setup by orted by some mechanism. When I do an MPI_Send()
> to a given rank, the message goes to the Infiniband interface with a
> particular LID. How does this IP-to-Infiniband LID mapping happen?
>
> Thanks
> Durga
>
> We learn from history that we never learn from history.
>
> On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet 
> wrote:
>
>> Hi,
>>
>> the hostnames (or their IPs) are only used to ssh orted.
>>
>>
>> if you use only the tcp btl :
>>
>> TCP *MPI* communications (vs OOB management communications) are handled
>> by btl/tcp
>> by default, all usable interfaces are used, then messages are split
>> (iirc, by ob1 pml) and then "fragments"
>> are sent using all interfaces.
>>
>> each interface has a latency and bandwidth that is used to split message
>> into fragments.
>> (assuming it is correctly configured, 90% of a large message is sent over
>> the 10GbE interface, and 10% is sent over the GbE interface)
>>
>> if you can explicitly list/blacklist interface
>> mpirun --mca btl_tcp_if_include ...
>> or
>> mpirun --mca btl_tcp_if_exclude ...
>>
>> (see ompi_info --all for the syntax)
>>
>>
>> but if you use several btls (for example tcp and openib), the btl(s) with
>> the lower exclusivity are not used.
>> (for example, a large message is *not* split and send using native ib,
>> IPoIB and GbE because the openib btl
>> has a higher exclusivity than the tcp btl)
>>
>>
>> did this answer your question ?
>>
>> Cheers,
>>
>> Gilles
>>
>>
>>
>> On 4/8/2016 12:24 PM, dpchoudh . wrote:
>>
>> Hello all
>>
>> (Newbie warning! Sorry :-(  )
>>
>> Let's say my cluster has 7 nodes, connected via IP-over-Ethernet for
>> control traffic and some kind of raw verbs (or anything else such as SRIO)
>> interface for data transfer. Let's say my host file chooses 4 out of the 7
>> nodes for an MPI job, based on the IP address, which are assigned to the
>> Ethernet interfaces.
>>
>> My question is: where in the code does this mapping between
>> IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined, such as
>> only those chosen nodes receive traffic over the verbs interface?
>>
>> Thanks in advance
>> Durga
>>
>> We learn from history that we never learn from history.
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/04/18746.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/04/18747.php
>>
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18748.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18749.php
>


Re: [OMPI devel] Common symbol warnings in tarballs (was: make install warns about 'common symbols')

2016-04-20 Thread dpchoudh .
Dear all

Just to clarify, I was doing a build (after adding code to support a new
transport) from code pulled from git (a 'git clone') when I came across
this warning, so I suppose this would be a 'developer build'.

I know I am not a real MPI developer (I am doing OMPI internal development
for the second time in my whole career), but if my vote counts, I'd vote
for leaving the warning in. It, in my opinion, encourages good coding
practice, that should matter to everyone, not just 'core developers'.
However, I agree that the phrasing of the warning is confusing, and adding
a URL there to an appropriate page should be enough to prevent future
questions like this in the support forum.

Thanks
Durga

1% of the executables have 99% of CPU privilege!
Userspace code! Unite!! Occupy the kernel!!!

On Wed, Apr 20, 2016 at 1:41 PM, Ralph Castain  wrote:

>
> > On Apr 20, 2016, at 10:24 AM, Dave Goodell (dgoodell) <
> dgood...@cisco.com> wrote:
> >
> > On Apr 20, 2016, at 9:14 AM, Jeff Squyres (jsquyres) 
> wrote:
> >>
> >> I was under the impression that this warning script only ran for
> developer builds.  But it looks like it's unconditionally run at the end of
> "make install" (on master only -- so far).
> >>
> >> Should we make this only run for developer builds?  (e.g., check for
> $srcdir/.git, or somesuch)  I think it's our goal to have zero common
> symbols, but that may not always be the case, and we don't want this
> potentially alarming warning showing up for users, right?
> >
> > IMO, this is basically just another warning flag.  If you enable most
> compiler warnings for non-developer builds, I don't see why you would go
> out of your way to disable this particular one.  You could always tweak the
> output to point to a wiki page that explains what the warning means, so
> concerned users can hopefully be assuaged.
>
> I guess this is where I differ. I see no benefit in warning a user about
> something they cannot control and that has no impact on them. These
> warnings were intended solely for developers as a reminder/suggestion that
> they follow a specific policy regarding common variables. Thus, they convey
> nothing of interest or use to a user.
>
> So I fail to see why we should include this warning in a non-developer
> build. As for other warnings, we have a stated policy (and proactive
> effort) to always stamp them out, so I don’t think the user is actually
> seeing many (or any) of them. Remember, we turn off pedantic and other
> levels when doing non-developer builds.
>
> Seems like this warning falls into the same category to me.
>
> >
> > -Dave
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18794.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18795.php


[OMPI devel] modex receive

2016-04-28 Thread dpchoudh .
Hello all

I am struggling with this issue for last few days and thought it would be
prudent to ask for help from people who have way more experience than I do.

There are two questions, interrelated in my mind, but may not be so in
reality. Question 2 is the issue I am struggling with, and question 1 sort
of leads to it.

1. I see that both in openib and tcp BTL (the two kind of hardware I have
access to) a modex send happens, but a matching modex receive never
happens. Is it because of some kind of optimization? (In my case, both IP
NICs are in the same IP subnet and both IB NICs are in the same IB subnet)
Or am I not understanding something? How do the processes figure out their
peer information without a modex receive?

The place in code where the modex receive is called is in btl_add_procs().
However, it looks like in both the above BTLs, this method is never called.
Is that expected?

2. This is the real question is this:
I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code) that
has no routing capability in protocol, and hence no concept of subnets. An
HCA simply needs to be plugged in to the switch and it can see the whole
network. However, there is a VLAN like partition (similar to IB partitions)
Given this (and as a first cut, every node is in the same partition, so
even this complexity is eliminated), there is not much use for a modex
exchange, but I added one anyway just with the partition key.

What I see is that the component open, register and init are all
successful, but r2 bml still does not choose this network and thus OMPI
aborts because of lack of full reachability.

This is my command line:
sudo /usr/local/bin/mpirun --allow-run-as-root -hostfile ~/hostfile -np 2
-mca btl self,lf -mca btl_base_verbose 100 -mca bml_base_verbose 100
./mpitest

('mpitest' is a trivial 'hello world' program plus ONE
MPI_Send()/MPI_Recv() to test in-band communication. The sudo is required
because currently the driver requires root permission; I was told that this
will be fixed. The hostfile has 2 hosts, named b-2 and b-3, with
back-to-back connection on this 'lf' HCA)

The output of this command is as follows; I have added my comments to
explain it a bit.


[b-2:21062] mca: base: components_register: registering framework bml
components
[b-2:21062] mca: base: components_register: found loaded component r2
[b-2:21062] mca: base: components_register: component r2 register function
successful
[b-2:21062] mca: base: components_open: opening bml components
[b-2:21062] mca: base: components_open: found loaded component r2
[b-2:21062] mca: base: components_open: component r2 open function
successful
[b-2:21062] mca: base: components_register: registering framework btl
components
[b-2:21062] mca: base: components_register: found loaded component self
[b-2:21062] mca: base: components_register: component self register
function successful
[b-2:21062] mca: base: components_register: found loaded component lf
[b-2:21062] mca: base: components_register: component lf register function
successful
[b-2:21062] mca: base: components_open: opening btl components
[b-2:21062] mca: base: components_open: found loaded component self
[b-2:21062] mca: base: components_open: component self open function
successful
[b-2:21062] mca: base: components_open: found loaded component lf


lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0)


[b-2:21062] mca: base: components_open: component lf open function
successful
[b-2:21062] select: initializing btl component self
[b-2:21062] select: init of component self returned success
[b-2:21062] select: initializing btl component lf


Created group on b-2


[b-2:21062] select: init of component lf returned success
[b-3:07672] mca: base: components_register: registering framework bml
components
[b-3:07672] mca: base: components_register: found loaded component r2
[b-3:07672] mca: base: components_register: component r2 register function
successful
[b-3:07672] mca: base: components_open: opening bml components
[b-3:07672] mca: base: components_open: found loaded component r2
[b-3:07672] mca: base: components_open: component r2 open function
successful
[b-3:07672] mca: base: components_register: registering framework btl
components
[b-3:07672] mca: base: components_register: found loaded component self
[b-3:07672] mca: base: components_register: component self register
function successful
[b-3:07672] mca: base: components_register: found loaded component lf
[b-3:07672] mca: base: components_register: component lf register function
successful
[b-3:07672] mca: base: components_open: opening btl components
[b-3:07672] mca: base: components_open: found loaded component self
[b-3:07672] mca: base: components_open: component self open function
successful
[b-3:07672] mca: base: components_open: found loaded component lf
[b-3:07672] mca: base: components_open: component lf open function
successful
[b-3:07672] select: initializing btl component self
[b-3:07672] select: init o

[OMPI devel] Why is floating point number used for locality

2016-04-28 Thread dpchoudh .
Hello all

I am wondering about the rationale of using floating point numbers for
calculating 'distances' in the openib BTL. Is it because some distances can
be infinite and there is no (conventional) way to represent infinity using
integers?

Thanks for your comments

Durga


The surgeon general advises you to eat right, exercise regularly and quit
ageing.


Re: [OMPI devel] modex receive

2016-04-28 Thread dpchoudh .
Hello Gilles

You are absolutely right:

1. Adding --mca pml_base_verbose 100 does show that it is the cm PML that
is being picked by default (even for TCP)
2. Adding --mca pml ob1 does cause add_procs() and related BTL friends to
be invoked.


With a command line of

mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp  -mca btl_base_verbose
100 -mca pml_base_verbose 100 ./mpitest

The output shows (among many other lines) the following:

[smallMPI:49178] select: init returned priority 30
[smallMPI:49178] select: initializing pml component ob1
[smallMPI:49178] select: init returned priority 20
[smallMPI:49178] select: component v not in the include list
[smallMPI:49178] selected cm best priority 30

*[smallMPI:49178] select: component ob1 not selected /
finalized[smallMPI:49178] select: component cm selected*

Which shows that the cm PML was selected. Replacing 'tcp' above with
'openib' shows very similar results. (The openib BTL methods are not
invoked, either)

However, I was under the impression that the CM PML can only handle MTLs
(and ob1 can only handle BTLs). So why is cm being selected for TCP?

Thank you
Durga



The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Thu, Apr 28, 2016 at 2:34 AM, Gilles Gouaillardet 
wrote:

> the add_procs subroutine of the btl should be called.
>
> /* i added a printf in mca_btl_tcp_add_procs and it *is* invoked */
>
> can you try again with --mca pml ob1 --mca pml_base_verbose 100 ?
>
> maybe the add_procs subroutine is not invoked because openmpi uses cm
> instead of ob1
>
>
> Cheers,
>
>
> Gilles
>
> On 4/28/2016 3:07 PM, dpchoudh . wrote:
>
> Hello all
>
> I am struggling with this issue for last few days and thought it would be
> prudent to ask for help from people who have way more experience than I do.
>
> There are two questions, interrelated in my mind, but may not be so in
> reality. Question 2 is the issue I am struggling with, and question 1 sort
> of leads to it.
>
> 1. I see that both in openib and tcp BTL (the two kind of hardware I have
> access to) a modex send happens, but a matching modex receive never
> happens. Is it because of some kind of optimization? (In my case, both IP
> NICs are in the same IP subnet and both IB NICs are in the same IB subnet)
> Or am I not understanding something? How do the processes figure out their
> peer information without a modex receive?
>
> The place in code where the modex receive is called is in btl_add_procs().
> However, it looks like in both the above BTLs, this method is never called.
> Is that expected?
>
> 2. This is the real question is this:
> I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code)
> that has no routing capability in protocol, and hence no concept of
> subnets. An HCA simply needs to be plugged in to the switch and it can see
> the whole network. However, there is a VLAN like partition (similar to IB
> partitions)
> Given this (and as a first cut, every node is in the same partition, so
> even this complexity is eliminated), there is not much use for a modex
> exchange, but I added one anyway just with the partition key.
>
> What I see is that the component open, register and init are all
> successful, but r2 bml still does not choose this network and thus OMPI
> aborts because of lack of full reachability.
>
> This is my command line:
> sudo /usr/local/bin/mpirun --allow-run-as-root -hostfile ~/hostfile -np 2
> -mca btl self,lf -mca btl_base_verbose 100 -mca bml_base_verbose 100
> ./mpitest
>
> ('mpitest' is a trivial 'hello world' program plus ONE
> MPI_Send()/MPI_Recv() to test in-band communication. The sudo is required
> because currently the driver requires root permission; I was told that this
> will be fixed. The hostfile has 2 hosts, named b-2 and b-3, with
> back-to-back connection on this 'lf' HCA)
>
> The output of this command is as follows; I have added my comments to
> explain it a bit.
>
> 
> [b-2:21062] mca: base: components_register: registering framework bml
> components
> [b-2:21062] mca: base: components_register: found loaded component r2
> [b-2:21062] mca: base: components_register: component r2 register function
> successful
> [b-2:21062] mca: base: components_open: opening bml components
> [b-2:21062] mca: base: components_open: found loaded component r2
> [b-2:21062] mca: base: components_open: component r2 open function
> successful
> [b-2:21062] mca: base: components_register: registering framework btl
> components
> [b-2:21062] mca: base: components_register: found loaded component self
> [b-2:21062] mca: base: components_register: component self register
> function successful
> [b-2:21062] mca: base: component

Re: [OMPI devel] modex receive

2016-04-29 Thread dpchoudh .
Hello Ralph and Gilles

Thanks for the clarification. My understanding was that if a BTL was
specified to mpirun, then only BTL (and, therefore, the ob1 PML) will be
used. However, I always saw that is not the case and now I know why.

I do have PSM capable cards (Qlogic IB) in my nodes, and this time, the
link was up (however, like I reported earlier, this behaviour happens even
with PSM link down), so obviously the PSM MTL was chosen.


Best regards
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Thu, Apr 28, 2016 at 11:41 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> my basic understanding is that ob1 works with btl, and cm works with mtl
> (please someone corrects me if I am wrong)
> an other way to put this is cm cannot use the tcp btl.
>
> so I can only guess one mtl (PSM ?) is available, and so cm is preferred
> over ob1.
>
> what if you
> mpirun --mca mtl ^psm ...
> is cm selected over ob1 ?
>
> note PSM does not disqualify itself if there is no link, and this is
> now being investigated at intel.
>
> Cheers,
>
> Gilles
>
> On Friday, April 29, 2016, dpchoudh .  wrote:
>
>> Hello Gilles
>>
>> You are absolutely right:
>>
>> 1. Adding --mca pml_base_verbose 100 does show that it is the cm PML that
>> is being picked by default (even for TCP)
>> 2. Adding --mca pml ob1 does cause add_procs() and related BTL friends to
>> be invoked.
>>
>>
>> With a command line of
>>
>> mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp  -mca
>> btl_base_verbose 100 -mca pml_base_verbose 100 ./mpitest
>>
>> The output shows (among many other lines) the following:
>>
>> [smallMPI:49178] select: init returned priority 30
>> [smallMPI:49178] select: initializing pml component ob1
>> [smallMPI:49178] select: init returned priority 20
>> [smallMPI:49178] select: component v not in the include list
>> [smallMPI:49178] selected cm best priority 30
>>
>> *[smallMPI:49178] select: component ob1 not selected /
>> finalized[smallMPI:49178] select: component cm selected*
>>
>> Which shows that the cm PML was selected. Replacing 'tcp' above with
>> 'openib' shows very similar results. (The openib BTL methods are not
>> invoked, either)
>>
>> However, I was under the impression that the CM PML can only handle MTLs
>> (and ob1 can only handle BTLs). So why is cm being selected for TCP?
>>
>> Thank you
>> Durga
>>
>>
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Thu, Apr 28, 2016 at 2:34 AM, Gilles Gouaillardet 
>> wrote:
>>
>>> the add_procs subroutine of the btl should be called.
>>>
>>> /* i added a printf in mca_btl_tcp_add_procs and it *is* invoked */
>>>
>>> can you try again with --mca pml ob1 --mca pml_base_verbose 100 ?
>>>
>>> maybe the add_procs subroutine is not invoked because openmpi uses cm
>>> instead of ob1
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>> On 4/28/2016 3:07 PM, dpchoudh . wrote:
>>>
>>> Hello all
>>>
>>> I am struggling with this issue for last few days and thought it would
>>> be prudent to ask for help from people who have way more experience than I
>>> do.
>>>
>>> There are two questions, interrelated in my mind, but may not be so in
>>> reality. Question 2 is the issue I am struggling with, and question 1 sort
>>> of leads to it.
>>>
>>> 1. I see that both in openib and tcp BTL (the two kind of hardware I
>>> have access to) a modex send happens, but a matching modex receive never
>>> happens. Is it because of some kind of optimization? (In my case, both IP
>>> NICs are in the same IP subnet and both IB NICs are in the same IB subnet)
>>> Or am I not understanding something? How do the processes figure out their
>>> peer information without a modex receive?
>>>
>>> The place in code where the modex receive is called is in
>>> btl_add_procs(). However, it looks like in both the above BTLs, this method
>>> is never called. Is that expected?
>>>
>>> 2. This is the real question is this:
>>> I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code)
>>> that has no routing capability in protocol, and hence no concept of
>>> subnets. An HCA simply needs to be plugged in to the switch and it can see
>>> the whole network. However, the

[OMPI devel] Question about 'progress function'

2016-05-05 Thread dpchoudh .
Hi all

Apologies for a 101 level question again, but here it is:

A new BTL layer I am implementing hangs in MPI_Send(). Please keep in mind
that at this stage, I am simply desperate to make MPI data move through
this fabric in any way possible, so I have thrown all good programming
practice out of the window and in the process might have added bugs.

The test code basically has a single call to MPI_Send() with 8 bytes of
data, the smallest amount the HCA can DMA. I have a very simple
mca_btl_component_progress() method that returns 0 if called before
mca_btl_endpoint_send() and returns 1 if called after. I use a static
variable to keep track whether endpoint_send() has been called.

With this, the MPI process hangs with the following stack:

(gdb) bt
#0  0x7f7518c60b7d in poll () from /lib64/libc.so.6
#1  0x7f75183e79f6 in poll_dispatch (base=0x19cf480, tv=0x7f75177efe80)
at poll.c:165
#2  0x7f75183df690 in opal_libevent2022_event_base_loop
(base=0x19cf480, flags=1) at event.c:1630
#3  0x7f75183613d4 in progress_engine (obj=0x19cedd8) at
runtime/opal_progress_threads.c:105
#4  0x7f7518f3ddf5 in start_thread () from /lib64/libpthread.so.0
#5  0x7f7518c6b1ad in clone () from /lib64/libc.so.6

I am using code from master branch for this work.

Obviously I am not doing the progress handling right, and I don't even
understand how it should work, as the TCP btl does not even provide a
component progress function.

Any relevant pointer on how this should be done is highly appreciated.

Thanks
Durga


The surgeon general advises you to eat right, exercise regularly and quit
ageing.


Re: [OMPI devel] Question about 'progress function'

2016-05-06 Thread dpchoudh .
George

Thanks for your help. But what should the progress function return, so that
the event is signalled? Right now I am returning a 1 when data has been
transmitted and 0 otherwise, but that does not seem to work. Also, please
keep in mind that the transport I am working on supports unreliable
datagrams only, so there is no ack from the recipient to wait for.

Thanks again
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Thu, May 5, 2016 at 11:33 PM, George Bosilca  wrote:

> Durga,
>
> TCP doesn't need a specialized progress function because we are tied
> directly with libevent. In your case you should provide a BTL progress
> function, function that will be called at the end of libevent base loop
> regularly.
>
>   George.
>
>
> On Thu, May 5, 2016 at 11:30 PM, dpchoudh .  wrote:
>
>> Hi all
>>
>> Apologies for a 101 level question again, but here it is:
>>
>> A new BTL layer I am implementing hangs in MPI_Send(). Please keep in
>> mind that at this stage, I am simply desperate to make MPI data move
>> through this fabric in any way possible, so I have thrown all good
>> programming practice out of the window and in the process might have added
>> bugs.
>>
>> The test code basically has a single call to MPI_Send() with 8 bytes of
>> data, the smallest amount the HCA can DMA. I have a very simple
>> mca_btl_component_progress() method that returns 0 if called before
>> mca_btl_endpoint_send() and returns 1 if called after. I use a static
>> variable to keep track whether endpoint_send() has been called.
>>
>> With this, the MPI process hangs with the following stack:
>>
>> (gdb) bt
>> #0  0x7f7518c60b7d in poll () from /lib64/libc.so.6
>> #1  0x7f75183e79f6 in poll_dispatch (base=0x19cf480,
>> tv=0x7f75177efe80) at poll.c:165
>> #2  0x7f75183df690 in opal_libevent2022_event_base_loop
>> (base=0x19cf480, flags=1) at event.c:1630
>> #3  0x7f75183613d4 in progress_engine (obj=0x19cedd8) at
>> runtime/opal_progress_threads.c:105
>> #4  0x7f7518f3ddf5 in start_thread () from /lib64/libpthread.so.0
>> #5  0x7f7518c6b1ad in clone () from /lib64/libc.so.6
>>
>> I am using code from master branch for this work.
>>
>> Obviously I am not doing the progress handling right, and I don't even
>> understand how it should work, as the TCP btl does not even provide a
>> component progress function.
>>
>> Any relevant pointer on how this should be done is highly appreciated.
>>
>> Thanks
>> Durga
>>
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/05/18919.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18920.php
>


[OMPI devel] Process connectivity map

2016-05-14 Thread dpchoudh .
Hello all

I have been struggling with this issue for a while and figured it might be
a good idea to ask for help.

Where (in the code path) is the connectivity map created?

I can see that it is *used* in mca_bml_r2_endpoint_add_btl(), but obviously
I am not setting it up right, because this routine is not finding the BTL
corresponding to my interconnect.

Thanks in advance
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.


[OMPI devel] Misleading error messages?

2016-05-14 Thread dpchoudh .
In the file ompi/mca.bml/r2/bml_r2.c, it seems like the function name is
incorrect in some error messages (seems like a case of unchecked copy-paste
issue) in:

1. Function mca_bml_r2_allocate_endpoint() line 154
2. Function mca_bml_r2_endpoint_add_btl() line 200, 206

This is on master branch.

Thanks
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.


Re: [OMPI devel] Process connectivity map

2016-05-15 Thread dpchoudh .
Hello Gilles

Thanks for jumping in to help again. Actually, I had already tried some of
your suggestions before asking for help.

I have several interconnects that can run both openib and tcp BTL. To
simplify things, I explicitly mentioned TCP:

mpirun -np 2 -hostfile ~/hostfile -mca pml ob1 -mca btl self.tcp ./mpitest

where mpitest is a small program that does MPI_Send()/MPI_Recv() on a small
string, and then does an MPI_Barrier(). The program does work as expected.

I put a printf on the last line of mca_tcp_add_procs() to print the value
of 'reachable'. What I saw was that the value was always 0 when it was
invoked for Send()/Recv() and the pointer itself was NULL when invoked for
Barrier()

Next I looked at pml_ob1_add_procs(), where the call chain starts, and
found that it initializes and passes an opal_bitmap_t reachable down the
call chain, but the resulting value is not used later in the code (the
memory is simply freed later).

That, coupled with the fact that I am trying to imitate what the other BTL
implementations are doing, yet in mca_bml_r2_endpoint_add_btl() by BTL is
not being picked up, left me puzzled. Please note that the interconnect
that I am developing for is on a different cluster (than where I ran the
above test for TCP BTL.)

Thanks again
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Sun, May 15, 2016 at 10:20 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> did you check the add_procs callbacks ?
> (e.g. mca_btl_tcp_add_procs() for the tcp btl)
> this is where the reachable bitmap is set, and I guess this is what you
> are looking for.
>
> keep in mind that if several btl can be used, the one with the higher
> exclusivity is used
> (e.g. tcp is never used if openib is available)
> you can simply force your btl and self, and the ob1 pml, so you do not
> have to worry about other btl exclusivity.
>
> Cheers,
>
> Gilles
>
>
> On Sunday, May 15, 2016, dpchoudh .  wrote:
>
>> Hello all
>>
>> I have been struggling with this issue for a while and figured it might
>> be a good idea to ask for help.
>>
>> Where (in the code path) is the connectivity map created?
>>
>> I can see that it is *used* in mca_bml_r2_endpoint_add_btl(), but
>> obviously I am not setting it up right, because this routine is not finding
>> the BTL corresponding to my interconnect.
>>
>> Thanks in advance
>> Durga
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18975.php
>


Re: [OMPI devel] Process connectivity map

2016-05-15 Thread dpchoudh .
Hello Gilles

Setting -mca mpi_add_procs_cutoff 1024 indeed makes a difference to the
output, as follows:

With -mca mpi_add_procs_cutoff 1024:
reachable = 0x1
(Note that add_procs was called once and the value of 'reachable is
correct')

Without -mca mpi_add_procs_cutoff 1024
reachable = 0x0
reachable = NULL
reachable = NULL
(Note that add_procs() was caklled three times and the value of 'reachable'
seems wrong.

The program does run correctly in either case. The program listing is as
below (note that I have removed output from the program itself in the above
reporting.)

The code that prints 'reachable' is as follows:

if (reachable == NULL)
printf("reachable = NULL\n");
else
{
int i;
printf("reachable = ");
for (i = 0; i < reachable->array_size; i++)
printf("\t0x%llu", reachable->bitmap[i]);
printf("\n\n");
}
return OPAL_SUCCESS;

And the code for the test program is as follows:

#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
int world_size, world_rank, name_len;
char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Get_processor_name(hostname, &name_len);
printf("Hello world from processor %s, rank %d out of %d processors\n",
hostname, world_rank, world_size);
if (world_rank == 1)
{
MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%s received %s, rank %d\n", hostname, buf, world_rank);
}
else
{
strcpy(buf, "haha!");
MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}



The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Sun, May 15, 2016 at 10:49 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> At first glance, that seems a bit odd...
> are you sure you correctly print the reachable bitmap ?
> I would suggest you add some instrumentation to understand what happens
> (e.g., printf before opal_bitmap_set_bit() and other places that prevent
> this from happening)
>
> one more thing ...
> now, master default behavior is
> mpirun --mca mpi_add_procs_cutoff 0 ...
> you might want to try
> mpirun --mca mpi_add_procs_cutoff 1024 ...
> and see if things make more sense.
> if it helps, and iirc, there is a parameter so a btl can report it does
> not support cutoff.
>
>
> Cheers,
>
> Gilles
>
> On Sunday, May 15, 2016, dpchoudh .  wrote:
>
>> Hello Gilles
>>
>> Thanks for jumping in to help again. Actually, I had already tried some
>> of your suggestions before asking for help.
>>
>> I have several interconnects that can run both openib and tcp BTL. To
>> simplify things, I explicitly mentioned TCP:
>>
>> mpirun -np 2 -hostfile ~/hostfile -mca pml ob1 -mca btl self.tcp ./mpitest
>>
>> where mpitest is a small program that does MPI_Send()/MPI_Recv() on a
>> small string, and then does an MPI_Barrier(). The program does work as
>> expected.
>>
>> I put a printf on the last line of mca_tcp_add_procs() to print the value
>> of 'reachable'. What I saw was that the value was always 0 when it was
>> invoked for Send()/Recv() and the pointer itself was NULL when invoked for
>> Barrier()
>>
>> Next I looked at pml_ob1_add_procs(), where the call chain starts, and
>> found that it initializes and passes an opal_bitmap_t reachable down the
>> call chain, but the resulting value is not used later in the code (the
>> memory is simply freed later).
>>
>> That, coupled with the fact that I am trying to imitate what the other
>> BTL implementations are doing, yet in mca_bml_r2_endpoint_add_btl() by BTL
>> is not being picked up, left me puzzled. Please note that the interconnect
>> that I am developing for is on a different cluster (than where I ran the
>> above test for TCP BTL.)
>>
>> Thanks again
>> Durga
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Sun, May 15, 2016 at 10:20 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> did you check the add_procs callbacks ?
>>> (e.g. mca_btl_tcp_add_procs() for the tcp btl)
>>> this is where the reachable bitmap is set, and I guess this is what you
>>> are looking for.
>>>
>>> keep in mind that if several btl can be used, the one with the higher
>&g

Re: [OMPI devel] Process connectivity map

2016-05-15 Thread dpchoudh .
Sorry, I accidentally pressed 'Send' before I was done writing the last
mail. What I wanted to ask was what is the parameter mpi_add_procs_cutoff
and why adding it seems to make a difference in the code path but not in
the end result of the program? How would it help me debug my problem?

Thank you
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Sun, May 15, 2016 at 11:17 AM, dpchoudh .  wrote:

> Hello Gilles
>
> Setting -mca mpi_add_procs_cutoff 1024 indeed makes a difference to the
> output, as follows:
>
> With -mca mpi_add_procs_cutoff 1024:
> reachable = 0x1
> (Note that add_procs was called once and the value of 'reachable is
> correct')
>
> Without -mca mpi_add_procs_cutoff 1024
> reachable = 0x0
> reachable = NULL
> reachable = NULL
> (Note that add_procs() was caklled three times and the value of
> 'reachable' seems wrong.
>
> The program does run correctly in either case. The program listing is as
> below (note that I have removed output from the program itself in the above
> reporting.)
>
> The code that prints 'reachable' is as follows:
>
> if (reachable == NULL)
> printf("reachable = NULL\n");
> else
> {
> int i;
> printf("reachable = ");
> for (i = 0; i < reachable->array_size; i++)
> printf("\t0x%llu", reachable->bitmap[i]);
> printf("\n\n");
> }
> return OPAL_SUCCESS;
>
> And the code for the test program is as follows:
>
> #include 
> #include 
> #include 
> #include 
>
> int main(int argc, char *argv[])
> {
> int world_size, world_rank, name_len;
> char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>
> MPI_Init(&argc, &argv);
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> MPI_Get_processor_name(hostname, &name_len);
> printf("Hello world from processor %s, rank %d out of %d
> processors\n", hostname, world_rank, world_size);
> if (world_rank == 1)
> {
> MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
> printf("%s received %s, rank %d\n", hostname, buf, world_rank);
> }
> else
> {
> strcpy(buf, "haha!");
> MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
> printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
> }
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_Finalize();
> return 0;
> }
>
>
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>
> On Sun, May 15, 2016 at 10:49 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> At first glance, that seems a bit odd...
>> are you sure you correctly print the reachable bitmap ?
>> I would suggest you add some instrumentation to understand what happens
>> (e.g., printf before opal_bitmap_set_bit() and other places that prevent
>> this from happening)
>>
>> one more thing ...
>> now, master default behavior is
>> mpirun --mca mpi_add_procs_cutoff 0 ...
>> you might want to try
>> mpirun --mca mpi_add_procs_cutoff 1024 ...
>> and see if things make more sense.
>> if it helps, and iirc, there is a parameter so a btl can report it does
>> not support cutoff.
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On Sunday, May 15, 2016, dpchoudh .  wrote:
>>
>>> Hello Gilles
>>>
>>> Thanks for jumping in to help again. Actually, I had already tried some
>>> of your suggestions before asking for help.
>>>
>>> I have several interconnects that can run both openib and tcp BTL. To
>>> simplify things, I explicitly mentioned TCP:
>>>
>>> mpirun -np 2 -hostfile ~/hostfile -mca pml ob1 -mca btl self.tcp
>>> ./mpitest
>>>
>>> where mpitest is a small program that does MPI_Send()/MPI_Recv() on a
>>> small string, and then does an MPI_Barrier(). The program does work as
>>> expected.
>>>
>>> I put a printf on the last line of mca_tcp_add_procs() to print the
>>> value of 'reachable'. What I saw was that the value was always 0 when it
>>> was invoked for Send()/Recv() and the pointer itself was NULL when invoked
>>> for Barrier()
>>>
>>> Next I looked at pml_ob1_add_procs(), where the call chain starts, and
>>> found that it initializes and passes an opal_bitmap_t reachable down the
>>> call chain, but the resulting value is not used later in the code (the
>>> memory i

[OMPI devel] modex getting corrupted

2016-05-21 Thread dpchoudh .
Hello all

I have a naive question:

My 'cluster' consists of two nodes, connected back to back with a
proprietary link as well as GbE (over a switch).
I am calling OPAL_MODEX_SEND() and the modex consists of just this:

struct modex
{char name[20], unsigned mtu};

The mtu field is not currently being used. I bzero() the struct and have
verified that the value being written to the 'name' field (this is similar
to a PKEY for infiniband; the driver will translate this to a unique
integer) is correct at the sending end.

When I do a OPAL_MODEX_RECV(), the value is completely corrupted. However,
the size of the modex message is still correct (24 bytes)
What could I be doing wrong? (Both nodes are little endian x86_64 machines)

Thanks in advance
Durga

We learn from history that we never learn from history.


Re: [OMPI devel] modex getting corrupted

2016-05-23 Thread dpchoudh .
Hello Ralph

Thanks for your input. The routine that does the send is this:

static int btl_lf_modex_send(lfgroup lfgroup)
{
char *grp_name = lf_get_group_name(lfgroup, NULL, 0);
btl_lf_modex_t lf_modex;
int rc;
strncpy(lf_modex.grp_name, grp_name, GRP_NAME_MAX_LEN);
OPAL_MODEX_SEND(rc, OPAL_PMIX_GLOBAL,
&mca_btl_lf_component.super.btl_version,
(char *)&lf_modex, sizeof(lf_modex));
return rc;
}

This routine is called from the component init routine
(mca_btl_lf_component_init()). I have verified that the values in the modex
(lf_modex) are correct.

The receive happens in proc_create, and I call it like this:
OPAL_MODEX_RECV(rc, &mca_btl_lf_component.super.btl_version,
   &opal_proc->proc_name, (uint8_t **)&module_proc->proc_modex,
&size);

In here, I get junk value in proc_modex. If I pass a buffer that was
malloc()'ed in place of module_proc->proc_modex, I still get bad data.


Thanks again for your help.

Durga

We learn from history that we never learn from history.

On Sat, May 21, 2016 at 8:38 PM, Ralph Castain  wrote:

> Please provide the exact code used for both send/recv - you likely have an
> error in the syntax
>
>
> On May 20, 2016, at 9:36 PM, dpchoudh .  wrote:
>
> Hello all
>
> I have a naive question:
>
> My 'cluster' consists of two nodes, connected back to back with a
> proprietary link as well as GbE (over a switch).
> I am calling OPAL_MODEX_SEND() and the modex consists of just this:
>
> struct modex
> {char name[20], unsigned mtu};
>
> The mtu field is not currently being used. I bzero() the struct and have
> verified that the value being written to the 'name' field (this is similar
> to a PKEY for infiniband; the driver will translate this to a unique
> integer) is correct at the sending end.
>
> When I do a OPAL_MODEX_RECV(), the value is completely corrupted. However,
> the size of the modex message is still correct (24 bytes)
> What could I be doing wrong? (Both nodes are little endian x86_64 machines)
>
> Thanks in advance
> Durga
>
> We learn from history that we never learn from history.
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/19012.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/19019.php
>


Re: [OMPI devel] modex getting corrupted

2016-05-23 Thread dpchoudh .
Hello Ralph and all

Please ignore this mail. It is indeed due to a syntax error in my code.
Sorry for the noise; I'll be more careful with my homework from now on.

Best regards
Durga

We learn from history that we never learn from history.

On Mon, May 23, 2016 at 2:13 AM, dpchoudh .  wrote:

> Hello Ralph
>
> Thanks for your input. The routine that does the send is this:
>
> static int btl_lf_modex_send(lfgroup lfgroup)
> {
> char *grp_name = lf_get_group_name(lfgroup, NULL, 0);
> btl_lf_modex_t lf_modex;
> int rc;
> strncpy(lf_modex.grp_name, grp_name, GRP_NAME_MAX_LEN);
> OPAL_MODEX_SEND(rc, OPAL_PMIX_GLOBAL,
> &mca_btl_lf_component.super.btl_version,
> (char *)&lf_modex, sizeof(lf_modex));
> return rc;
> }
>
> This routine is called from the component init routine
> (mca_btl_lf_component_init()). I have verified that the values in the modex
> (lf_modex) are correct.
>
> The receive happens in proc_create, and I call it like this:
> OPAL_MODEX_RECV(rc, &mca_btl_lf_component.super.btl_version,
>&opal_proc->proc_name, (uint8_t
> **)&module_proc->proc_modex, &size);
>
> In here, I get junk value in proc_modex. If I pass a buffer that was
> malloc()'ed in place of module_proc->proc_modex, I still get bad data.
>
>
> Thanks again for your help.
>
> Durga
>
> We learn from history that we never learn from history.
>
> On Sat, May 21, 2016 at 8:38 PM, Ralph Castain  wrote:
>
>> Please provide the exact code used for both send/recv - you likely have
>> an error in the syntax
>>
>>
>> On May 20, 2016, at 9:36 PM, dpchoudh .  wrote:
>>
>> Hello all
>>
>> I have a naive question:
>>
>> My 'cluster' consists of two nodes, connected back to back with a
>> proprietary link as well as GbE (over a switch).
>> I am calling OPAL_MODEX_SEND() and the modex consists of just this:
>>
>> struct modex
>> {char name[20], unsigned mtu};
>>
>> The mtu field is not currently being used. I bzero() the struct and have
>> verified that the value being written to the 'name' field (this is similar
>> to a PKEY for infiniband; the driver will translate this to a unique
>> integer) is correct at the sending end.
>>
>> When I do a OPAL_MODEX_RECV(), the value is completely corrupted.
>> However, the size of the modex message is still correct (24 bytes)
>> What could I be doing wrong? (Both nodes are little endian x86_64
>> machines)
>>
>> Thanks in advance
>> Durga
>>
>> We learn from history that we never learn from history.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/05/19012.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/05/19019.php
>>
>
>


[OMPI devel] mpirun fails with the latest git pull

2016-05-26 Thread dpchoudh .
Hello all

With a git pull of roughly 4 PM EDT (US), that had a .m4 file (something to
do with MXM) in the change set, mpirun does not work anymore. The failure
is like this:

[durga@b-2 ~]$ sudo /usr/local/bin/mpirun --allow-run-as-root -np 2
-hostfile ~/hostfile -mca btl lf,self -mca btl_base_verbose 100 ./mpitest
[b-2:1] [[2440,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c
at line 619
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--

Doing a make clean && make && sudo make install did not help. Should I
re-run the autogen.pl script and start from scratch?

Thanks
Durga

We learn from history that we never learn from history.


[OMPI devel] Porting the underlying fabric interface

2016-02-04 Thread dpchoudh .
Hi developers

I am trying to add support for a new (proprietary) RDMA capable fabric
to OpenMPI and have the following question:

As I understand, some networks are implemented as a PML framework and
some are implemented as a BTL framework. It seems there is even
overlap as Myrinet seems to exist in both.

My question is: what is the difference between these two frameworks?
When adding support for a new fabric, what factors one should consider
when choosing between one type of framework over the other?

And, with apologies for asking a summary question: is there any kind
of documentation and/or book that explains all the internal details of
the implementation (which looks little like voodoo to a newcomer like
me)?

Thanks for your help.

Durga Choudhury

Life is complex. It has real and imaginary parts.


Re: [OMPI devel] [OMPI users] Adding a new BTL

2016-02-25 Thread dpchoudh .
Hello Jeff and other developers:

Attached are five files:
1-2: Full output from autogen.pl and configure, captured with: ./ 2>&1
| tee .log
3. Makefile.am of the specific BTL directory
4. configure.m4 of the same directory
5. config.log, as generated internally by autotools

Thank you
Durga


Life is complex. It has real and imaginary parts.

On Thu, Feb 25, 2016 at 5:15 PM, Jeff Squyres (jsquyres)  wrote:

> Can you send the full output from autogen and configure?
>
> Also, this is probably better suited for the Devel list, since we're
> talking about OMPI internals.
>
> Sent from my phone. No type good.
>
> On Feb 25, 2016, at 2:06 PM, dpchoudh .  wrote:
>
> Hello Gilles
>
> Thank you very much for your advice. Yes, I copied the templates from the
> master branch to the 1.10.2 release, since the release does not have them.
> And yes, changing the Makefile.am as you suggest did make the autogen error
> go away.
>
> However, in the master branch, the autotools seem to be ignoring the new
> btl directory altogether; i.e. I do not get a Makefile.in from the
> Makefile.am.
>
> In the 1.10.2 release, doing an identical sequence of steps do create a
> Makefile.in from Makefile.am (via autogen) and a Makefile from Makefile.in
> (via configure), but of course, the new BTL does not build because the
> include paths in master and 1.10.2 are different.
>
> My Makefile.am and configure.m4 are as follows. Any thoughts on what it
> would take in the master branch to hook up the new BTL directory?
>
> opal/mca/btl/lf/configure.m4
> # 
> AC_DEFUN([MCA_opal_btl_lf_CONFIG],[
> AC_CONFIG_FILES([opal/mca/btl/lf/Makefile])
> ])dnl
>
> opal/mca/btl/lf/Makefile.am---
> amca_paramdir = $(AMCA_PARAM_SETS_DIR)
> dist_amca_param_DATA = netpipe-btl-lf.txt
>
> sources = \
> btl_lf.c \
> btl_lf.h \
> btl_lf_component.c \
> btl_lf_endpoint.c \
> btl_lf_endpoint.h \
> btl_lf_frag.c \
> btl_lf_frag.h \
> btl_lf_proc.c \
> btl_lf_proc.h
>
> # Make the output library in this directory, and name it either
> # mca__.la (for DSO builds) or libmca__.la
> # (for static builds).
>
> if MCA_BUILD_opal_btl_lf_DSO
> lib =
> lib_sources =
> component = mca_btl_lf.la
> component_sources = $(sources)
> else
> lib = libmca_btl_lf.la
> lib_sources = $(sources)
> component =
> component_sources =
> endif
>
> mcacomponentdir = $(opallibdir)
> mcacomponent_LTLIBRARIES = $(component)
> mca_btl_lf_la_SOURCES = $(component_sources)
> mca_btl_lf_la_LDFLAGS = -module -avoid-version
>
> noinst_LTLIBRARIES = $(lib)
> libmca_btl_lf_la_SOURCES = $(lib_sources)
> libmca_btl_lf_la_LDFLAGS = -module -avoid-version
>
> -
>
> Life is complex. It has real and imaginary parts.
>
> On Thu, Feb 25, 2016 at 3:10 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Did you copy the template from the master branch into the v1.10 branch ?
>> if so, you need to replacing MCA_BUILD_opal_btl_lf_DSO with
>> MCA_BUILD_ompi_btl_lf_DSO will likely solve your issue.
>> you do need a configure.m4 (otherwise your btl will not be built) but
>> you do not need AC_MSG_FAILURE
>>
>> as far as i am concerned, i would develop in the master branch, and
>> then back port it into the v1.10 branch when it is ready.
>>
>> fwiw, btl used to be in ompi/mca/btl (still the case in v1.10) and
>> have been moved into opal/mca/btl since v2.x
>> so it is quite common a bit of porting is required, most of the time,
>> it consists in replacing OMPI like macros by OPAL like macros
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thu, Feb 25, 2016 at 3:54 PM, dpchoudh .  wrote:
>> > Hello all
>> >
>> > I am not sure if this question belongs in the user list or the
>> > developer list, but because it is a simpler question I am trying the
>> > user list first.
>> >
>> > I am trying to add a new BTL for a proprietary transport.
>> >
>> > As step #0, I copied the BTL template, renamed the 'template' to
>> > something else, and ran autogen.sh at the top level directory (of
>> > openMPI 1.10.2). The Makefile.am is identical to what is provided in
>> > the template except that all the 'template' has been substituted with
>> > 'lf', the name of the fabric.
>> >
>> > With that, I get the following error:
>> >
>> > 
>> >
>> > autoreconf: running: /usr/bin/autoconf --include=config --