Re: [OMPI devel] Setting AUTOMAKE_JOBS

2010-09-24 Thread Jeff Squyres (jsquyres)
Also to clarify:

- did autogen set am-jobs to 2 in your case?  (it should do that if lstopo is 
not found - it also limits itself to 4 at max)

- in the same scenario, what happens if you manually set am-jobs to 1 and run 
autogen?  Ie do you get the same heat/sluggishness?  I have experienced vms 
causing this kind of behavior just because they are running - causing CPU and 
memory pressure. 

Sent from my PDA. No type good. 

On Sep 24, 2010, at 12:49 AM, "Ralph Castain"  wrote:

> Sent to both for reference (see below)
> 
> Just to clarify. It wasn't a deadlock situation, but rather that the machine 
> was overloaded and running so hard that the response to keystrokes was 
> multiple seconds. Thus, there was no way to shut it down from the keyboard or 
> screen. Even a ctrl-c was just getting ignored for a very long time due to 
> the overload.
> 
> I was running vmware on my machine, and doing a heavy compile/build in it. On 
> top of this, I had email, editor, and browsers running - and then kicked off 
> a fresh build in a terminal window. With Jeff's default settings, this latter 
> build thought it would be running alone on the machine, and promptly 
> generated a number of threads equal to all the processors. Since they were 
> already loaded, this drove the machine into the ground.
> 
> My point is just that it is unwise to assume that the OMPI build can utilize 
> all available processors. I'm sure it's fine for the MTT runs, especially on 
> Jeff's machines as they are dedicated to that purpose - just not a good 
> general assumption.
> 
> 
> HTH
> Ralph
> 
> 
> Output of "perl -V":
> 
> Summary of my perl5 (revision 5 version 8 subversion 9) configuration:
>   Platform:
> osname=darwin, osvers=10.2.0, archname=darwin-2level
> uname='darwin sjc-rcastain-87111.cisco.com 10.2.0 darwin kernel version 
> 10.2.0: tue nov 3 10:37:10 pst 2009; root:xnu-1486.2.11~1release_i386 i386 '
> config_args='-des -D prefix=/opt/local -D scriptdir=/opt/local/bin -D 
> cppflags=-I/opt/local/include -D ccflags=-O2 -arch x86_64 -D 
> ldflags=-L/opt/local/lib -D vendorprefix=/opt/local -D man1ext=1pm -D 
> man3ext=3pm -D cc=/usr/bin/gcc-4.2 -D ld=/usr/bin/gcc-4.2 -D 
> man1dir=/opt/local/share/man/man1p -D man3dir=/opt/local/share/man/man3p -D 
> siteman1dir=/opt/local/share/man/man1 -D 
> siteman3dir=/opt/local/share/man/man3 -D 
> vendorman1dir=/opt/local/share/man/man1 -D 
> vendorman3dir=/opt/local/share/man/man3 -D inc_version_list=5.8.8 
> 5.8.8/darwin-2level -U i_bind -U i_gdbm -U i_db'
> hint=recommended, useposix=true, d_sigaction=define
> usethreads=undef use5005threads=undef useithreads=undef 
> usemultiplicity=undef
> useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
> use64bitint=define use64bitall=define uselongdouble=undef
> usemymalloc=n, bincompat5005=undef
>   Compiler:
> cc='/usr/bin/gcc-4.2', ccflags ='-O2 -arch x86_64 -fno-common 
> -DPERL_DARWIN -I/opt/local/include -no-cpp-precomp -fno-strict-aliasing -pipe 
> -I/usr/local/include -I/opt/local/include',
> optimize='-O3',
> cppflags='-I/opt/local/include -no-cpp-precomp -O2 -arch x86_64 
> -fno-common -DPERL_DARWIN -I/opt/local/include -no-cpp-precomp 
> -fno-strict-aliasing -pipe -I/usr/local/include -I/opt/local/include'
> ccversion='', gccversion='4.2.1 (Apple Inc. build 5646) (dot 1)', 
> gccosandvers=''
> intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
> d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
> ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', 
> lseeksize=8
> alignbytes=8, prototype=define
>   Linker and Libraries:
> ld='env MACOSX_DEPLOYMENT_TARGET=10.3 /usr/bin/gcc-4.2', ldflags 
> ='-L/opt/local/lib -L/usr/local/lib'
> libpth=/usr/local/lib /opt/local/lib /usr/lib
> libs=-ldbm -ldl -lm -lutil -lc
> perllibs=-ldl -lm -lutil -lc
> libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a
> gnulibc_version=''
>   Dynamic Linking:
> dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
> cccdlflags=' ', lddlflags='-L/opt/local/lib -bundle -undefined 
> dynamic_lookup -L/usr/local/lib'
> 
> 
> Characteristics of this binary (from libperl): 
>   Compile-time options: PERL_MALLOC_WRAP USE_64_BIT_ALL USE_64_BIT_INT
> USE_FAST_STDIO USE_LARGE_FILES USE_PERLIO
>   Built under darwin
>   Compiled at Feb 13 2010 13:19:33
>   @INC:
> /opt/local/lib/perl5/site_perl/5.8.9/darwin-2level
> /opt/local/lib/perl5/site_perl/5.8.9
> /opt/local/lib/perl5/site_perl
> /opt/local/lib/perl5/vendor_perl/5.8.9/darwin-2level
> /opt/local/lib/perl5/vendor_perl/5.8.9
> /opt/local/lib/perl5/vendor_perl
> /opt/local/lib/perl5/5.8.9/darwin-2level
> /opt/local/lib/perl5/5.8.9
> .
> 
> On Thu, Sep 23, 2010 at 10:26 PM, Ralf Wildenhues  
> wrote:
> Hello Ralph,
> 
> wow, that's no

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r23936

2010-10-26 Thread Jeff Squyres (jsquyres)
Btw it strikes me that we could put the old libevent back as a separate 
component for comparisons. 

Sent from my PDA. No type good. 

On Oct 26, 2010, at 6:20 AM, "Jeff Squyres"  wrote:

> On Oct 25, 2010, at 9:29 PM, George Bosilca wrote:
> 
>> 1. Not all processes deadlock in btl_sm_add_procs. The process that setup 
>> the shared memory area, is going forward, and block later in a barrier.
> 
> Yes, I'm seeing the same thing (I didn't include all details like this in my 
> post, sorry). I was running with -np 2 on a local machine and saw vpid=0 get 
> stuck in opal_progress (because the first time through, seg_inited < 
> n_local_procs).  vpid=1 increments seg_inited and therefore doesn't enter the 
> loop that calls opal_progress(), and therefore continues on.
> 
>> 2. All other processes, loop around the opal_progress, until they got a 
>> message from all other processes. The variable used for counting is somehow 
>> updated correctly, but we still call opal_progress. I couldn't figure out is 
>> we loop more that we should, or if opal_progress doesn't return. However, 
>> both of these possibilities look very unlikely to me: the loop in the 
>> sm_add_procs is pretty straightforward, and I couldn't find any loops in 
>> opal_progress. I wonder if some of the messages get lost on the exchange.
> 
> I had this problem, too, until I tried to use padb to get stack traces.  I 
> noticed that when I ran padb, my blocked process un-blocked itself and 
> continued.  After more digging, I determined that my blocked process was, in 
> fact, blocked in poll() with an infinite timeout.  padb (or any signal at 
> all) caused it to unblock and therefore continue.
> 
>> 3. If I unblock the situation by hand, everything goes back to normal. 
>> NetPIPE runs to completion but the performances are __really__ bad. On my 
>> test machine I get around 2000Mbs, when the expected value is at least 10 
>> times more. Similar finding on the latency side, we're now at 1.65 micro-sec 
>> up from the usual 0.35 we had before.
> 
> It's a feature!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] === CREATE FAILURE (trunk) ===

2010-11-03 Thread Jeff Squyres (jsquyres)
Yep - I get these mails, too. 

My only comment is: %}%%€<>~|%>€€!!!

I swear I actually do test these things and they *do* work before I commit 
them. There must be some difference between my env and the nightly creation 
env. I'll investigate...

Sent from my PDA. No type good. 

On Nov 3, 2010, at 2:12 AM, "Mike Dubman"  wrote:

> 
> Hi,
> ompi/trunk (r23985) build still fails with compilation errors (attached).
> 
> Regards
> M
> 
> On Mon, Nov 1, 2010 at 11:10 PM, Jeff Squyres  wrote:
> Sorry for the delay on this -- the issue was quite subtle and the holiday 
> weekend got in the way.
> 
> I have a fix that will be committed a little after 6pm US Eastern.  It seems 
> to allow a fresh SVN checkout (with my patch applied) to pass "make 
> distcheck".  Hopefully we'll finally get a new trunk tarball tonight.
> 
> 
> On Oct 31, 2010, at 9:16 PM, MPI Team wrote:
> 
> >
> > ERROR: Command returned a non-zero exist status (trunk):
> >   make distcheck
> >
> > Start time: Sun Oct 31 21:00:12 EDT 2010
> > End time:   Sun Oct 31 21:16:33 EDT 2010
> >
> > ===
> > [... previous lines snipped ...]
> > checking for OPAL CXXFLAGS... -pthread
> > checking for OPAL CXXFLAGS_PREFIX...
> > checking for OPAL LDFLAGS...
> > checking for OPAL LIBS... -ldl   -Wl,--export-dynamic -lrt -lnsl -lutil -lm 
> > -ldl
> > checking for OPAL extra include dirs...
> > checking for ORTE CPPFLAGS...
> > checking for ORTE CXXFLAGS... -pthread
> > checking for ORTE CXXFLAGS_PREFIX...
> > checking for ORTE CFLAGS... -pthread
> > checking for ORTE CFLAGS_PREFIX...
> > checking for ORTE LDFLAGS...
> > checking for ORTE LIBS...  -ldl   -Wl,--export-dynamic -lrt -lnsl -lutil 
> > -lm -ldl
> > checking for ORTE extra include dirs...
> > checking for OMPI CPPFLAGS...
> > checking for OMPI CFLAGS... -pthread
> > checking for OMPI CFLAGS_PREFIX...
> > checking for OMPI CXXFLAGS... -pthread
> > checking for OMPI CXXFLAGS_PREFIX...
> > checking for OMPI FFLAGS... -pthread
> > checking for OMPI FFLAGS_PREFIX...
> > checking for OMPI FCFLAGS... -pthread
> > checking for OMPI FCFLAGS_PREFIX...
> > checking for OMPI LDFLAGS...
> > checking for OMPI LIBS...   -ldl   -Wl,--export-dynamic -lrt -lnsl -lutil 
> > -lm -ldl
> > checking for OMPI extra include dirs...
> >
> > *** Final output
> > configure: creating ./config.status
> > config.status: creating ompi/include/ompi/version.h
> > config.status: creating orte/include/orte/version.h
> > config.status: creating opal/include/opal/version.h
> > config.status: creating opal/mca/backtrace/Makefile
> > config.status: creating opal/mca/backtrace/printstack/Makefile
> > config.status: creating opal/mca/backtrace/execinfo/Makefile
> > config.status: creating opal/mca/backtrace/darwin/Makefile
> > config.status: creating opal/mca/backtrace/none/Makefile
> > config.status: creating opal/mca/carto/Makefile
> > config.status: creating opal/mca/carto/auto_detect/Makefile
> > config.status: creating opal/mca/carto/file/Makefile
> > config.status: creating opal/mca/compress/Makefile
> > config.status: creating opal/mca/compress/gzip/Makefile
> > config.status: creating opal/mca/compress/bzip/Makefile
> > config.status: creating opal/mca/crs/Makefile
> > config.status: creating opal/mca/crs/none/Makefile
> > config.status: creating opal/mca/crs/self/Makefile
> > config.status: creating opal/mca/crs/blcr/Makefile
> > config.status: creating opal/mca/event/Makefile
> > config.status: creating opal/mca/event/libevent207/Makefile
> > config.status: error: cannot find input file: 
> > `opal/mca/event/libevent207/libevent/include/event2/event-config.h.in'
> > make: *** [distcheck] Error 1
> > ===
> >
> > Your friendly daemon,
> > Cyrador
> > ___
> > testing mailing list
> > test...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/testing
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Jeff Squyres (jsquyres)
Sorry for the delay in replying - many of us were at SC last week. 

Admittedly, I'm looking at your code on a PDA, so I might be missing some 
things. But I have 2 q's:

1 your send routine doesn't seem to protect from sending to yourself. Correct?

2 you're not using nonblocking sends, which, if I understand your code right, 
can lead to deadlock. Right?  Eg proc A sends to proc b and blocks until b 
receives. But b is blocking waiting for it's send completion, etc. 

I think with your random destinations (which may even be yourself, in which 
case the blocking send will never complete because you didn't prepost a 
nomblocking receive) and blocking sends, you can end up with deadlock. 

Sent from my PDA. No type good. 

On Nov 16, 2010, at 5:21 PM, Sébastien Boisvert 
 wrote:

> Dear awesome community,
> 
> 
> Over the last months, I closely followed the evolution of bug 2043,
> entitled 'sm BTL hang with GCC 4.4.x'.
> 
> https://svn.open-mpi.org/trac/ompi/ticket/2043
> 
> The reason is that I am developping an MPI-based software, and I use
> Open-MPI as it is the only implementation I am aware of that send
> messages eagerly (powerful feature, that is).
> 
> http://denovoassembler.sourceforge.net/
> 
> I believe that this very pesky bug remains in Open-MPI 1.4.3, and
> enclosed to this communication are scientific proofs of my claim, or at
> least I think they are ;).
> 
> 
> Each byte transfer layer has its default limit to send eagerly a
> message. With shared memory (sm), the value is 4096 bytes. At least it
> is according to ompi_info.
> 
> 
> To verify this limit, I implemented a very simple test. The source code
> is test4096.cpp, which basically just send a single message of 4096
> bytes from a rank to another (rank 1 to 0).
> 
> The test was conclusive: the limit is 4096 bytes (see
> mpirun-np-2-Simple.txt).
> 
> 
> 
> Then, I implemented a simple program (103 lines) that makes Open-MPI
> 1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each
> rank send a message to a randomly-selected destination. A rank polls for
> new messages with MPI_Iprobe. Each rank prints the current time at each
> second during 30 seconds. Using this simple code, I ran 4 test cases,
> each with a different outcome (use the Makefile if you want to reproduce
> the bug).
> 
> Before I describe these cases, I will describe the testing hardware. 
> 
> I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). 
> The computer has 128 GB of physical memory (see
> cat-proc-meminfo.txt.gz).
> It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see
> dmesg.txt.gz & uname.txt).
> Default kernel parameters are utilized at runtime (see
> sudo-sysctl-a.txt.gz).
> 
> The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g
> ++--version.txt).
> 
> 
> I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz,
> make-install.out.gz).
> Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz).
> 
> 
> 
> 
> Now I can describe the cases.
> 
> 
> Case 1: 30 MPI ranks, message size is 4096 bytes
> 
> File: mpirun-np-30-Program-4096.txt
> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> 
> 
> 
> 
> Case 2: 30 MPI ranks, message size is 1 byte
> 
> File: mpirun-np-30-Program-1.txt.gz
> Outcome: It runs just fine.
> 
> 
> 
> 
> Case 3: 2 MPI ranks, message size is 4096 bytes
> 
> File: mpirun-np-2-Program-4096.txt
> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> 
> 
> 
> 
> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> disabled
> 
> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> Outcome: It runs just fine.
> 
> 
> 
> 
> 
> A backtrace of the processes in Case 1 is in gdb-bt.txt.gz.
> 
> 
> 
> 
> Thank you !
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Jeff Squyres (jsquyres)
Ya, it sounds like we should fix this eager limit help text so that others 
aren't misled. We did say "attempt", but that's probably a bit too subtle. 

Eugene - iirc: this is in the btl base (or some other central location) because 
it's shared between all btls. 

Sent from my PDA. No type good. 

On Nov 23, 2010, at 5:54 PM, "Eugene Loh"  wrote:

> George Bosilca wrote:
> 
>> Moreover, eager send can improve performance if and only if the matching 
>> receives are already posted on the peer. If not, the data will become 
>> unexpected, and there will be one additional memcpy.
>> 
> I don't think the first sentence is strictly true.  There is a cost 
> associated with eager messages, but whether there is an overall improvement 
> or not depends on lots of factors.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Jeff Squyres (jsquyres)
Beware that MPI-request-free on active buffers is valid but evil. You CANNOT be 
sure when the buffer is available for reuse. 

There was a sentence or paragraph added yo MPI 2.2 describing exactly this 
case. 

Sent from my PDA. No type good. 

On Nov 23, 2010, at 5:36 PM, Sébastien Boisvert 
 wrote:

> Le mardi 23 novembre 2010 à 17:28 -0500, George Bosilca a écrit :
>> Sebastien,
>> 
>> Using MPI_Isend doesn't guarantee asynchronous progress. As you might be 
>> aware, the non-blocking communications are guaranteed to progress only when 
>> the application is in the MPI library. Currently very few MPI 
>> implementations progress asynchronously (and unfortunately Open MPI is not 
>> one of them).
>> 
> 
> Regardless, I just need the non-blocking behavior.
> I call MPI_Request_free just after MPI_Isend, and I use a ring allocator
> to allocate message buffers.
> 
> Message recipients just reply with another message to the source, using
> a NULL buffer.
> 
> The sender waits for the reply before sending the next message.
> 
> And it works for assembling bacterial genomes on many MPI ranks:
> 
> ...
> Rank 0: 162 contigs/4576725 nucleotides
> 
> Rank 0 reports the elapsed time, Tue Nov 23 01:35:48 2010
> ---> Step: Collection of fusions
>  Elapsed time: 0 seconds
>  Since beginning: 17 minutes, 33 seconds
> 
> Elapsed time for each step, Tue Nov 23 01:35:48 2010
> 
> Beginning of computation: 1 seconds
> Distribution of sequence reads: 7 minutes, 49 seconds
> Distribution of vertices: 19 seconds
> Calculation of coverage distribution: 1 seconds
> Distribution of edges: 29 seconds
> Indexing of sequence reads: 1 seconds
> Computation of seeds: 2 minutes, 33 seconds
> Computation of library sizes: 1 minutes, 47 seconds
> Extension of seeds: 3 minutes, 34 seconds
> Computation of fusions: 59 seconds
> Collection of fusions: 0 seconds
> Completion of the assembly: 17 minutes, 33 seconds
> 
> Rank 0 wrote Ecoli-THEONE.CoverageDistribution.txt
> Rank 0 wrote Ecoli-THEONE.fasta
> Rank 0 wrote Ecoli-THEONE.ReceivedMessages.txt
> Rank 0 wrote Ecoli-THEONE.Library0.txt
> Rank 0 wrote Ecoli-THEONE.Library1.txt
> 
> Au revoir !
> 
> 
>>  george.
>> 
>> On Nov 23, 2010, at 17:17 , Sébastien Boisvert wrote:
>> 
>>> I now use MPI_Isend, so the problem is no more.
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> -- 
> M. Sébastien Boisvert
> Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
> Boursier des Instituts de recherche en santé du Canada
> Équipe du Professeur Jacques Corbeil
> 
> Centre de recherche en infectiologie de l'Université Laval
> Local R-61B
> 2705, boulevard Laurier
> Québec, Québec
> Canada G1V 4G2
> Téléphone: 418 525  46342
> 
> Courriel: s...@boisvert.info
> Web: http://boisvert.info
> 
> "Innovation comes only from an assault on the unknown" -Sydney Brenner
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Jeff Squyres (jsquyres)
I'd guess thesame thing as George - a race condition in the shutdown of the 
async thread...?  I haven't looked at that code in a long log time to remember 
how it tried to defend against the race condition. 

Sent from my PDA. No type good. 

On Jan 3, 2011, at 2:31 PM, "Eugene Loh"  wrote:

> George Bosilca wrote:
> 
>> Eugene,
>> 
>> This error indicate that somehow we're accessing the QP while the QP is in 
>> "down" state. As the asynchronous thread is the one that see this error, I 
>> wonder if it doesn't look for some information about a QP that has been 
>> destroyed by the main thread (as this only occurs in MPI_Finalize).
>> 
>> Can you look in the syslog to see if there is any additional info related to 
>> this issue there?
>> 
> Not much.  A one-liner like this:
> 
> Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE 
> local access violation
> 
>> On Dec 30, 2010, at 20:43, Eugene Loh  wrote:
>> 
>>> I was running a bunch of np=4 test programs over two nodes.  Occasionally, 
>>> *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during 
>>> MPI_Finalize().  I traced the code and ran another program that mimicked 
>>> the particular MPI calls made by that program.  This other program, too, 
>>> would occasionally trigger this error.  I never saw the problem with other 
>>> tests.  Rate of incidence could go from consecutive runs (I saw this once) 
>>> to 1:100s (more typically) to even less frequently -- I've had 1000s of 
>>> consecutive runs with no problems.  (The tests run a few seconds apiece.)  
>>> The traffic pattern is sends from non-zero ranks to rank 0, with root-0 
>>> gathers, and lots of Allgathers.  The largest messages are 1000bytes.  It 
>>> appears the problem is always seen on rank 3.
>>> 
>>> Now, I wouldn't mind someone telling me, based on that little information, 
>>> what the problem is here, but I guess I don't expect that.  What I am 
>>> asking is what IBV_EVENT_QP_ACCESS_ERR means.  Again, it's seen during 
>>> MPI_Finalize.  The async thread is seeing this.  What is this error trying 
>>> to tell me?
>>>   
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Building Open MPI components outside of the sourcetree

2011-01-19 Thread Jeff Squyres (jsquyres)
I'd rather not setup another SVN repo. Where should it go in the current OMPI 
SVN?

Sent from my PDA. No type good. 

On Jan 19, 2011, at 5:01 PM, "George Bosilca"  wrote:

> 
> On Jan 19, 2011, at 16:44 , Jeff Squyres wrote:
> 
>> Where should it be on the main web site?  
> 
> The Documentation section look like a good place to me.
> 
>> It needs to be in a repo somewhere; it may change over time.
> 
> The source code can be hosted at Indiana in the same way ompi-tests and 
> ompi-docs are hosted. However, I don't expect this code to drastically change 
> every other day, so providing a tar on a webpage should be good enough. To be 
> more precise on this point, as we only allow big modification of the build 
> system between major releases I expect to only maintain 3 template (stable, 
> unstable and trunk).
> 
>  george.
> 
>> 
>> 
>> On Jan 19, 2011, at 4:38 PM, George Bosilca wrote:
>> 
>>> This stuff should be directly on the main Open MPI website. Not as a link 
>>> to bitbucket, but as a webpage and 3 tars.
>>> 
>>> george.
>>> 
>>> On Jan 19, 2011, at 15:43 , Jeff Squyres wrote:
>>> 
 Over the years, a few parties have wanted to be able to build Open MPI 
 components outside of the official source tree (e.g., they are developing 
 their own components outside of OMPI's SVN).  We've typically said "use 
 --with-devel-headers", but a) never really provided a full example of how 
 to do this, and b) never acknowledged that using --with-devel-headers is 
 somewhat of a pain.
 
 That ends now.  :-)
 
 I am publishing a bitbucket repo of three example "tcp2" BTL components.  
 They are almost exact copies of the real TCP BTL component, but have had 
 their configury updated to enable them to be built outside of the Open MPI 
 source tree:
 
 1. A component for the v1.4 Open MPI tree
 2. A component for the v1.5/v1.6 Open MPI tree
 3. A component for the trunk/v1.7 (as of r24265) Open MPI tree
 
 Each of these example components support the --with-devel-headers method 
 as well as a new method: --with-openmpi-source=DIR (i.e., where you 
 specify the corresponding Open MPI source directory, and the component 
 builds against that).  
 
 There are three different components because the configury between each of 
 them are a bit different.  Look at the configure.ac in the version that 
 you care about to see examples of how to get the relevant CPPFLAGS / 
 CFLAGS that you need to build your component.
 
 Here's the bitbucket repo:
 
 https://bitbucket.org/jsquyres/build-ompi-components-outside-of-source-tree
 
 There's a top-level README.txt file in the repo that explains a bit more.
 
 Enjoy!
 
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Old Linux kernels

2011-03-16 Thread Jeff Squyres (jsquyres)
Is there a version in a pthreads header file that can be checked?

You're right that I am currently checking Linux kernel version, not pthread 
version. Note that this is *only* in cross-compiling environments; in non cross 
compiling situations, we actually test the behavior to see if threads have 
different PIDs or not. 

Sent from my phone. No type good. 

On Mar 15, 2011, at 7:41 PM, "Ralph Castain"  wrote:

> My point was just that we support the current implementation of pthreads - 
> not any old one.
> 
> Also, to clarify: Jeff actually tests to see what the thread library does. We 
> only use the Linux kernel version when cross-compiling since we cannot, in 
> that case, actually test the support. We know that old Linux kernels have the 
> old implementation, so we exclude them. Anything else is hit-miss when 
> cross-compiling.
> 
> 
> On Mar 15, 2011, at 4:46 PM, Paul H. Hargrove wrote:
> 
>> Sorry, I stated my facts backwards.
>> CORRECTED facts:
>> 
>> +The old "LinuxThreads" implementation is the one that gave DIFFERENT pids 
>> to each pthread.
>> + "NPTL" is the current implementation of Pthreads for Linux, and the one 
>> giving a single pid shared by all pthreads.
>> 
>> So, I hope Ralph's statement is similarly reversed, because "LinuxThreads" 
>> as not been maintained in years.
>> 
>> -Paul
>> 
>> On 3/15/2011 3:40 PM, Ralph Castain wrote:
>>> I believe the test is intended strictly for Linux threads. I don't believe 
>>> we have ever (intentionally) supported any other thread library in such 
>>> environments.
>>> 
>>> I'll leave it to Jeff to decide if he feels this is an issue.
>>> 
>>> 
>>> On Mar 15, 2011, at 4:27 PM, Paul H. Hargrove wrote:
>>> 
 I'd like to point out that it is libpthread and the arguments it passes to 
 clone(), NOT the Linux kernel version, that is the determining factor (at 
 least if you have a 2.6.x kernel).  The "LinuxThreads" implementation of 
 Pthreads will give the one-pid-to-rule-them all behavior, while the NPTL 
 implementation gives unquie pids under any 2.6.x kernel and even w/ some 
 2.4.x kernels from Red Hat.
 
 I have encountered systems on which dynamic linking gave NPTL and static 
 linking gave LinuxThreads.  That is a "gottcha" that I am not certain Jeff 
 and Ralph have taken into account.
 
 Note that I have no objection to "we don't support this", but fear that 
 detection of that situation may be flawed.
 
 -Paul
 
 On 3/15/2011 2:14 PM, Ralph Castain wrote:
> Hi folks
> 
> Jeff and I encountered a problem when cross-compiling OMPI for Linux. 
> Turned out that we had an old test in the code that looked for threads to 
> have different pids. Since it couldn't be tested when cross-compiling, 
> the test simply assumed this was the case for Linux under those 
> conditions - which broke the build for current Linux kernels.
> 
> Different pids for threads was last seen in the old RH 4 series (kernel 
> 2.6.9 or so). Some code (e.g., waitpid) was also provided to support this 
> unusual situation - this code was in fact broken when we updated the 
> event library. So even if we were in an old kernel, the code base would 
> neither compile nor run.
> 
> Rather than trying to continue to support these old kernels, we have 
> removed all the stale code that was covered by 
> OPAL_THREADS_HAVE_DIFFERENT_PIDS. This removed some complexity from a few 
> PLM modules and removed the broken code.
> 
> Jeff modified the corresponding .m4 test so we now detect an older 
> kernel, print out a nice "we don't support this" message (along with 
> noting that earlier versions of OMPI do), and then abort the build.
> 
> If you know of some reason to restore support for old Linux kernels, and 
> someone willing to do the work to "refresh" that support, please let us 
> know.
> 
> Ralph&   Jeff
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 -- 
 Paul H. Hargrove  phhargr...@lbl.gov
 Future Technologies Group
 HPC Research Department   Tel: +1-510-495-2352
 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> HPC Research Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> 
>> ___

Re: [OMPI devel] trunk not compiling for btl_openib_connect_oob.c

2011-03-16 Thread Jeff Squyres (jsquyres)
K. When Ralph and I removed that code, it was on he educated guess that no one 
was using it (because it hasn't compiled right in a while). If we were wrong, 
it can be put back, but someone will need to update it and Ralph and I don't 
have access to machines to test that behavior. 

Sent from my phone. No type good. 

On Mar 16, 2011, at 6:32 AM, "Terry Dontje"  wrote:

> On 03/16/2011 06:21 AM, Jeff Squyres wrote:
>> 
>> On Mar 16, 2011, at 5:51 AM, Terry Dontje wrote:
>> 
>>> I've seen this with the following:
>>> 
>>> RH 4.6 / OFED 1.3.6
>> Errr... did you look at 
>> http://www.open-mpi.org/community/lists/devel/2011/03/9068.php?
> Yes I did, and I will be talking with my group about this, this afternoon.  
> We might be able to remove that dependency.
>> 
>>> CentOS 5.2 / OFED 1.3.6 
>>> SLES 10.1 /  OFED 1.3.6
>>> 
>>> I know the above is pretty darn old but it would be nice to know what is 
>>> the oldest s/w we can be using?  Note things have been building up until 
>>> now.
>> 
> BTW, I am now trying to compile on a system with ofed 1.4.4.
>> I'll look at my MTT runs later this morning.
>> 
> 
> 
> -- 
> 
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] F90 open-mpi module bug

2011-05-21 Thread Jeff Squyres (jsquyres)
Nick's right - changing your test program to use ierr instead of 0 makes it 
compile on OMPI for me.  Hence, the F90 module is actually doing exactly what 
it is supposed to do: tell you when you have a compile time error in your code. 
:)

I'm not sure why it compiles for you on MPICH - perhaps they don't have an 
explicit F90 interface for MPI_ABORT...?

Sent from my phone. No type good. 

On May 21, 2011, at 6:14 AM, "N.M. Maclaren"  wrote:

> On May 21 2011, Dan Reynolds wrote:
>> 
>> ./test_driver.F90:12.39: call mpi_abort(MPI_COMM_WORLD, -1, 0)
> 
> It's unlikely to provoke that particular error, but that call is erroneous.
> It should be something like:
> 
>   integer :: ierror
>   call mpi_abort(MPI_COMM_WORLD, 1, ierror)
> 
> Negative error numbers aren't forbidden, but aren't advisable.  However,
> passing a constant to an INTENT(OUT) argument is a serious no-no.
> 
> I can imagine compilers where it might provoke that error, but I doubt
> that it is the cause.  It's worth fixing and retrying, anyway.
> 
> 
> Regards,
> Nick Maclaren.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] Fwd: === CREATE FAILURE (trunk) ===

2011-06-22 Thread Jeff Squyres (jsquyres)
VT guys - please fix. 

Sent from my phone. No type good. 

Begin forwarded message:

> From: MPI Team 
> Date: June 22, 2011 9:42:42 PM EDT
> To: test...@open-mpi.org
> Subject: === CREATE FAILURE (trunk) ===
> Reply-To: de...@open-mpi.org
> 

> 
> ERROR: Command returned a non-zero exist status (trunk):
>   make distcheck
> 
> Start time: Wed Jun 22 21:00:02 EDT 2011
> End time:   Wed Jun 22 21:42:42 EDT 2011
> 
> ===
> [... previous lines snipped ...]
> make[7]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/opari/doc'
> make[6]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/opari'
> (cd vtcpcavail && make  top_distdir=../../../../../../openmpi-1.7a1r24808 
> distdir=../../../../../../openmpi-1.7a1r24808/ompi/contrib/vt/vt/tools/vtcpcavail
>  \
> am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[6]: Entering directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtcpcavail'
> make[6]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtcpcavail'
> (cd vtdyn && make  top_distdir=../../../../../../openmpi-1.7a1r24808 
> distdir=../../../../../../openmpi-1.7a1r24808/ompi/contrib/vt/vt/tools/vtdyn \
> am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[6]: Entering directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtdyn'
> (cd dynattlib && make  top_distdir=../../../../../../../openmpi-1.7a1r24808 
> distdir=../../../../../../../openmpi-1.7a1r24808/ompi/contrib/vt/vt/tools/vtdyn/dynattlib
>  \
> am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[7]: Entering directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtdyn/dynattlib'
> make[7]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtdyn/dynattlib'
> make[6]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtdyn'
> (cd vtfilter && make  top_distdir=../../../../../../openmpi-1.7a1r24808 
> distdir=../../../../../../openmpi-1.7a1r24808/ompi/contrib/vt/vt/tools/vtfilter
>  \
> am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[6]: Entering directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtfilter'
> (cd mpi && make  top_distdir=../../../../../../../openmpi-1.7a1r24808 
> distdir=../../../../../../../openmpi-1.7a1r24808/ompi/contrib/vt/vt/tools/vtfilter/mpi
>  \
> am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[7]: Entering directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtfilter/mpi'
> make[7]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtfilter/mpi'
> make[6]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtfilter'
> (cd vtjava && make  top_distdir=../../../../../../openmpi-1.7a1r24808 
> distdir=../../../../../../openmpi-1.7a1r24808/ompi/contrib/vt/vt/tools/vtjava 
> \
> am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[6]: Entering directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtjava'
> make[6]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtjava'
> (cd vtlibwrapgen && make  top_distdir=../../../../../../openmpi-1.7a1r24808 
> distdir=../../../../../../openmpi-1.7a1r24808/ompi/contrib/vt/vt/tools/vtlibwrapgen
>  \
> am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[6]: Entering directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtlibwrapgen'
> make[6]: Leaving directory 
> `/home/mpiteam/openmpi/nightly-tarball-build-root/trunk/create-r24808/ompi/openmpi-1.7a1r24808/_build/ompi/contrib/vt/vt/tools/vtlibwrapgen'
> (cd vtrun && make  to

Re: [OMPI devel] Trunk problem: VT breakage

2011-07-02 Thread Jeff Squyres (jsquyres)
Automake, I guess - that's what does the deps. 

Sent from my phone. No type good. 

On Jun 30, 2011, at 10:28 AM, "Ralph Castain"  wrote:

> I'm surprised that autogen/configure wouldn't catch this, yet it clearly 
> doesn't. I guess it's because the file moved?
> 
> Seems like a bug in the autotools or libtool, perhaps?
> 
> 
> On Jun 30, 2011, at 8:28 AM, Jeff Squyres wrote:
> 
>> I'm betting that this is a problem in the .deps directory; you could
>> 
>> foreach file (`ls ompi/contrib/vt/vt/tools/vtfilter/.deps/*.Po`)
>>echo $file
>>rm $file
>>touch $file
>> end
>> 
>> and then it builds fine (I just tried it).
>> 
>> ...or remove *.Po in that .deps directory and re-autogen/configure/build.
>> 
>> ...or get a fresh checkout, autogen/configure/build.  :-)
>> 
>> 
>> On Jun 30, 2011, at 10:18 AM, Yevgeny Kliteynik wrote:
>> 
>>> Same here:
>>> 
>>> ...
>>> CXXvtfilter-vt_filter.o
>>> CXXvtfilter-vt_filter_common.o
>>> CXXvtfilter-vt_filter_gen.o
>>> CXXvtfilter-vt_filter_trc.o
>>> CXXvtfilter-vt_filter_trc_hdlr.o
>>> CXXvtfilter-vt_filterc.o
>>> make[7]: *** No rule to make target `vt_filthandler.cc', needed by 
>>> `vtfilter-vt_filthandler.o'.  Stop.
>>> 
>>> Note that this happens only on 'svn up' on existing repository.
>>> When doing fresh 'svn co && autogen && configure && make', everything works 
>>> fine.
>>> 
>>> -- YK
>>> 
>>> 
>>> On 30-Jun-11 4:37 PM, Jeff Squyres wrote:
 FWIW, I get the same error as Ralph.  I'm on my laptop battery atm, so I 
 don't want to do a fresh checkout/build.  This is from an "Svn up" with a 
 fresh automake/configure:
 
 [9:32] rtp-jsquyres-8714:~/svn/ompi/ompi/contrib/vt/vt/tools/vtfilter % 
 make
 Making all in .
 make[1]: *** No rule to make target `vt_filthandler.cc', needed by 
 `vtfilter-vt_filthandler.o'.  Stop.
 make: *** [all-recursive] Error 1
 [9:32] rtp-jsquyres-8714:~/svn/ompi/ompi/contrib/vt/vt/tools/vtfilter %
 
 Is there some file that we need to delete to make the tree build?
 
 Is the problem in the corresponding .deps file?
 
 
 On Jun 30, 2011, at 9:05 AM, Matthias Jurenz wrote:
 
> It seems to me that anything during your last update went wrong. Since 
> r24803
> the source file 'vt_filthandler.cc' is moved to the subdirectory 'old', so
> actually if the source file doesn't exist the error message should be:
> 
> No rule to make target `old/vt_filthandler.cc', needed by `vtfilter-
> vt_filthandler.o'.  Stop.
> 
> (I just tested it by removing the source file by hand)
> 
> Does the error occur also with a completely new checkout of the trunk?
> 
> Matthias
> 
> On Thursday 30 June 2011 03:01:10 Ralph Castain wrote:
>> It appears I cannot build the trunk on Mac - I hit this issue when I
>> updated from the trunk and rebuilt from autogen this evening:
>> 
>> make[7]: *** No rule to make target `vt_filthandler.cc', needed by
>> `vtfilter-vt_filthandler.o'.  Stop.
>> 
>> Vanilla configure - I didn't turn VT off like I usually do.
>> 
>> Any help would be appreciated.
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> -- 
> Matthias Jurenz
> 
> Technische Universität Dresden
> Center for Information Services and High Performance Computing (ZIH)
> 01062 Dresden, Germany
> Phone: +49 (351) 463-31945
> Fax: +49 (351) 463-37773
> E-Mail: matthias.jur...@tu-dresden.de
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 
>>> 
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-02 Thread Jeff Squyres (jsquyres)
Do u know which object it is that is being constructed?  When you compile with 
debugging enabled, theres strings in the object struct that identify te file 
and line where the obj was created. 

Sent from my phone. No type good. 

On Jun 29, 2011, at 8:48 AM, "Xin He"  wrote:

> Hi,
> 
> As I advanced in my implementation of TIPC BTL, I added the component and 
> tried to run hello_c program to test.
> 
> Then I got this segmentation fault. It seemed happening after the call 
> "mca_btl_tipc_add_procs".
> 
> The error message displayed:
> 
> [oak:23192] *** Process received signal ***
> [oak:23192] Signal: Segmentation fault (11)
> [oak:23192] Signal code:  (128)
> [oak:23192] Failing at address: (nil)
> [oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
> [oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
> [oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
> [oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
> [oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
> [oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
> [oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
> [oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
> [oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
> [oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
> [oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
> [oak:23192] [11] hello_i(main+0x22) [0x400936]
> [oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
> [oak:23192] [13] hello_i() [0x400859]
> [oak:23192] *** End of error message ***
> 
> I used gdb to check the stack:
> (gdb) bt
> #0  0x77afac10 in opal_obj_run_constructors (object=0x6ca980)
>at ../opal/class/opal_object.h:427
> #1  0x77afb1f2 in opal_list_construct (list=0x6ca958) at 
> class/opal_list.c:88
> #2  0x72d479f2 in opal_obj_run_constructors (object=0x6ca958)
>at ../../../../opal/class/opal_object.h:427
> #3  0x72d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
>at pml_ob1_comm.c:55
> #4  0x72d44386 in opal_obj_run_constructors (object=0x6ca8c0)
>at ../../../../opal/class/opal_object.h:427
> #5  0x72d444a0 in opal_obj_new (cls=0x72f6c040)
>at ../../../../opal/class/opal_object.h:477
> #6  0x72d442fb in opal_obj_new_debug (type=0x72f6c040,
>file=0x72d62840 "pml_ob1.c", line=182)
>at ../../../../opal/class/opal_object.h:252
> #7  0x72d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at 
> pml_ob1.c:182
> #8  0x7797bf51 in ompi_mpi_init (argc=1, argv=0x7fffdf58, 
> requested=0,
>provided=0x7fffde28) at runtime/ompi_mpi_init.c:770
> #9  0x779acc33 in PMPI_Init (argc=0x7fffde5c, argv=0x7fffde50)
>at pinit.c:84
> #10 0x00400936 in main (argc=1, argv=0x7fffdf58) at hello_c.c:17
> 
> It seems the error happened when an object is constructed. Any idea why this 
> is happening?
> 
> Thanks.
> 
> Best regards,
> Xin
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-07-02 Thread Jeff Squyres (jsquyres)
Does your llp sed path order MPI matching ordering?  Eg if some prior isend is 
already queued, could the llp send overtake it?

Sent from my phone. No type good. 

On Jun 29, 2011, at 8:27 AM, "Kawashima"  wrote:

> Hi Jeff,
> 
>>> First, we created a new BTL component, 'tofu BTL'. It's not so special
>>> one but dedicated to our Tofu interconnect. But its latency was not
>>> enough for us.
>>> 
>>> So we created a new framework, 'LLP', and its component, 'tofu LLP'.
>>> It bypasses request object creation in PML and BML/BTL, and sends
>>> a message immediately if possible.
>> 
>> Gotcha.  Was the sendi pml call not sufficient?  (sendi = "send immediate")  
>> This call was designed to be part of a latency reduction mechanism.  I 
>> forget offhand what we don't do before calling sendi, but the rationale was 
>> that if the message was small enough, we could skip some steps in the 
>> sending process and "just send it."
> 
> I know sendi, but its latency was not sufficient for us.
> To come at sendi call, we must do:
>  - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC)
>  - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT)
>  - select BTL module (mca_pml_ob1_send_request_start)
>  - select protocol (mca_pml_ob1_send_request_start_btl)
> We want to eliminate these overheads. We want to send more immediately.
> 
> Here is a code snippet:
> 
> 
> 
> #if OMPI_ENABLE_LLP
> static inline int mca_pml_ob1_call_llp_send(void *buf,
>size_t size,
>int dst,
>int tag,
>ompi_communicator_t *comm)
> {
>int rc;
>mca_pml_ob1_comm_proc_t *proc = &comm->c_pml_comm->procs[dst];
>mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf;
> 
>match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH;
>match->hdr_common.hdr_flags = 0;
>match->hdr_ctx = comm->c_contextid;
>match->hdr_src = comm->c_my_rank;
>match->hdr_tag = tag;
>match->hdr_seq = proc->send_sequence + 1;
> 
>rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
>   (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
>   ompi_comm_peer_lookup(comm, dst),
>   MCA_PML_OB1_HDR_TYPE_MATCH));
> 
>if (rc == OMPI_SUCCESS) {
>/* NOTE this is not thread safe */
>OPAL_THREAD_ADD32(&proc->send_sequence, 1);
>}
> 
>return rc;
> }
> #endif
> 
> int mca_pml_ob1_send(void *buf,
> size_t count,
> ompi_datatype_t * datatype,
> int dst,
> int tag,
> mca_pml_base_send_mode_t sendmode,
> ompi_communicator_t * comm)
> {
>int rc;
>mca_pml_ob1_send_request_t *sendreq;
> 
> #if OMPI_ENABLE_LLP
>/* try to send message via LLP if
> *   - one of LLP modules is available, and
> *   - datatype is basic, and
> *   - data is small, and
> *   - communication mode is standard, buffered, or ready, and
> *   - destination is not myself
> */
>if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) &&
>(datatype->size * count < mca_pml_ob1.llp_max_payload_size) &&
>(sendmode == MCA_PML_BASE_SEND_STANDARD ||
> sendmode == MCA_PML_BASE_SEND_BUFFERED ||
> sendmode == MCA_PML_BASE_SEND_READY) &&
>(dst != comm->c_my_rank)) {
>rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst, tag, 
> comm);
>if (rc != OMPI_ERR_NOT_AVAILABLE) {
>/* successfully sent out via LLP or unrecoverable error occurred */
>return rc;
>}
>}
> #endif
> 
>MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc);
>if (rc != OMPI_SUCCESS)
>return rc;
> 
>MCA_PML_OB1_SEND_REQUEST_INIT(sendreq,
>  buf,
>  count,
>  datatype,
>  dst, tag,
>  comm, sendmode, false);
> 
>PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_ACTIVATE,
> &(sendreq)->req_send.req_base,
> PERUSE_SEND);
> 
>MCA_PML_OB1_SEND_REQUEST_START(sendreq, rc);
>if (rc != OMPI_SUCCESS) {
>MCA_PML_OB1_SEND_REQUEST_RETURN( sendreq );
>return rc;
>}
> 
>ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);
> 
>rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
>ompi_request_free( (ompi_request_t**)&sendreq );
>return rc;
> }
> 
> 
> 
> mca_pml_ob1_send is body of MPI_Send in Open MPI. Region of
> OMPI_ENABLE_LLP is added by us.
> 
> We don't have to use a send request if we could "send immediately".
> So we try to se

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24830

2011-07-02 Thread Jeff Squyres (jsquyres)
Were all the issueswith this code fixed?  There were m4 issues and solaris 
issues, IIRC. 

Sent from my phone. No type good. 

On Jun 28, 2011, at 9:28 AM, "klit...@osl.iu.edu"  wrote:

> Author: kliteyn
> Date: 2011-06-28 10:28:29 EDT (Tue, 28 Jun 2011)
> New Revision: 24830
> URL: https://svn.open-mpi.org/trac/ompi/changeset/24830
> 
> Log:
> Supporting dynamic SL (#2674)
> 
> - Added enable/disable configuration parameter for dynamic SL
> - All the dynamic SL code is conditionalized
> - Removed libibmad dependency
> - Using only one include - ib_types.h (part of opensm-devel package)
> - Removed all the macro and data types definitions, using the
>   existing definitions from ib_types.h instead
> - general cleaning here and there
> 
> The async mode is not implemented yet - stay tuned...
> 
> 
> Text files modified: 
>   trunk/ompi/config/ompi_check_openib.m4 |38  
>
>   trunk/ompi/mca/btl/openib/btl_openib.h | 5  
>
>   trunk/ompi/mca/btl/openib/btl_openib_mca.c |10  
>
>   trunk/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c |   309 
> +-- 
>   4 files changed, 182 insertions(+), 180 deletions(-)
> 
> Modified: trunk/ompi/config/ompi_check_openib.m4
> ==
> --- trunk/ompi/config/ompi_check_openib.m4(original)
> +++ trunk/ompi/config/ompi_check_openib.m42011-06-28 10:28:29 EDT (Tue, 
> 28 Jun 2011)
> @@ -155,11 +155,21 @@
>  [$ompi_cv_func_ibv_create_cq_args],
>  [Number of arguments to 
> ibv_create_cq])])])
> 
> +#
> +# OpenIB dynamic SL
> +#
> +AC_ARG_ENABLE([openib-dynamic-sl],
> +[AC_HELP_STRING([--enable-openib-dynamic-sl],
> +[Enable openib BTL to query Subnet Manager for IB SL 
> (default: enabled)])],
> +[enable_openib_dynamic_sl="$enableval"],
> +[enable_openib_dynamic_sl="yes"])
> +
> # Set these up so that we can do an AC_DEFINE below
> # (unconditionally)
> $1_have_xrc=0
> $1_have_rdmacm=0
> $1_have_ibcm=0
> +$1_have_dynamic_sl=0
> 
> # If we have the openib stuff available, find out what we've got
> AS_IF([test "$ompi_check_openib_happy" = "yes"],
> @@ -176,6 +186,19 @@
>AC_CHECK_FUNCS([ibv_create_xrc_rcv_qp], [$1_have_xrc=1])
>fi
> 
> +   if test "$enable_openib_dynamic_sl" = "yes"; then
> +   # We need ib_types.h file, which is installed with 
> opensm-devel
> +   # package. However, ib_types.h has a bad include directive,
> +   # which will cause AC_CHECK_HEADER to fail.
> +   # So instead, we will look for another file that is also
> +   # installed as part of opensm-devel package and included in
> +   # ib_types.h, but it doesn't include any other IB-related 
> files.
> +   AC_CHECK_HEADER([infiniband/complib/cl_types_osd.h],
> +   [$1_have_dynamic_sl=1],
> +   [AC_MSG_ERROR([opensm-devel package not found 
> - please install it or disable dynamic SL support with 
> \"--disable-openib-dynamic-sl\"])],
> +   [])
> +   fi
> +
># Do we have a recent enough RDMA CM?  Need to have the
># rdma_get_peer_addr (inline) function (originally appeared
># in OFED v1.3).
> @@ -244,6 +267,15 @@
> else
> AC_MSG_RESULT([no])
> fi
> +
> +AC_MSG_CHECKING([if dynamic SL is enabled])
> +AC_DEFINE_UNQUOTED([OMPI_ENABLE_DYNAMIC_SL], [$$1_have_dynamic_sl],
> +[Enable features required for dynamic SL support])
> +if test "1" = "$$1_have_dynamic_sl"; then
> +AC_MSG_RESULT([yes])
> +else
> +AC_MSG_RESULT([no])
> +fi
> 
> AC_MSG_CHECKING([if OpenFabrics RDMACM support is enabled])
> AC_DEFINE_UNQUOTED([OMPI_HAVE_RDMACM], [$$1_have_rdmacm],
> @@ -267,7 +299,11 @@
> AC_MSG_RESULT([no])
> fi
> 
> -CPPFLAGS="$ompi_check_openib_$1_save_CPPFLAGS"
> +AS_IF([test -z "$ompi_check_openib_dir"],
> +  [openib_include_dir="/usr/include"],
> +  [openib_include_dir="$ompi_check_openib_dir/include"])
> +
> +CPPFLAGS="$ompi_check_openib_$1_save_CPPFLAGS 
> -I$openib_include_dir/infiniband"
> LDFLAGS="$ompi_check_openib_$1_save_LDFLAGS"
> LIBS="$ompi_check_openib_$1_save_LIBS"
> 
> 
> Modified: trunk/ompi/mca/btl/openib/btl_openib.h
> ==
> --- trunk/ompi/mca/btl/openib/btl_openib.h(original)
> +++ trunk/ompi/mca/btl/openib/btl_openib.h2011-06-28 10:28:29 EDT (Tue, 
> 28 Jun 2011)
> @@ -52,6 +52,7 @@
> BEGIN_C_DECLS
>

Re: [OMPI devel] RFC: CUDA register sm and openib host memory

2011-08-02 Thread Jeff Squyres (jsquyres)
Rolf -

Can you send a cumulative SVN diff against the SVN HEAD?

Sent from my phone. No type good. 

On Jul 28, 2011, at 5:52 PM, "Rolf vandeVaart"  wrote:

> WHAT: Add CUDA registration of host memory in sm and openib BTLs.
> 
>  
> 
> TIMEOUT: 8/4/2011
> 
>  
> 
> DETAILS: In order to improve performance of sending GPU device memory,
> 
> we need to register the host memory with the CUDA framework.  These
> 
> changes allow that to happen.  These changes are somewhat different
> 
> from what I proposed a while ago and I think a lot cleaner.  There is
> 
> a new memory pool flag that indicates whether a piece of memory
> 
> should be registered.  This allows us to register the sm memory and
> 
> the pre-posted openib memory.
> 
>  
> 
> The CUDA specific code is in the ompi/mca/common/cuda directory.
> 
>  
> 
> Do not look at the configure.m4 code, as that is still not done.
> 
>  
> 
> Here a link to the proposed changes:
> 
> https://bitbucket.org/rolfv/ompi-cuda-register
> 
>  
> 
> Here is a list of files that would change.
> 
> M   VERSION
> 
> M   configure.ac
> 
> M   ompi/mca/btl/openib/btl_openib_component.c
> 
> M   ompi/mca/btl/openib/Makefile.am
> 
> M   ompi/mca/mpool/sm/Makefile.am
> 
> M   ompi/mca/mpool/sm/mpool_sm_module.c
> 
> M   ompi/mca/mpool/mpool.h
> 
> M   ompi/mca/pml/ob1/pml_ob1_sendreq.h
> 
> A   ompi/mca/common/cuda
> 
> A   ompi/mca/common/cuda/configure.m4
> 
> A   ompi/mca/common/cuda/common_cuda.c
> 
> A   ompi/mca/common/cuda/help-mpi-common-cuda.txt
> 
> A   ompi/mca/common/cuda/Makefile.am
> 
> A   ompi/mca/common/cuda/common_cuda.h
> 
> M   ompi/class/ompi_free_list.c
> 
>  
> 
>  
> 
>  
> 
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is prohibited.  If you are not the intended recipient, please 
> contact the sender by reply email and destroy all copies of the original 
> message.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24977

2011-08-02 Thread Jeff Squyres (jsquyres)
Te question that needs to be answered in the readme is: when should one se 
openib/ob1 vs. Mxm?  Users will need to know thus. 

Also see the part in the readme about te different PMLs - u might want to write 
more there. 

Sent from my phone. No type good. 

On Aug 2, 2011, at 10:30 AM, "mi...@osl.iu.edu"  wrote:

> Author: miked
> Date: 2011-08-02 10:30:11 EDT (Tue, 02 Aug 2011)
> New Revision: 24977
> URL: https://svn.open-mpi.org/trac/ompi/changeset/24977
> 
> Log:
> code and readme  updates, some refactoring
> Text files modified: 
>   trunk/NEWS   | 1
>  
>   trunk/README | 5 ++ 
>  
>   trunk/ompi/mca/mtl/mxm/mtl_mxm_cancel.c  | 4 +- 
>  
>   trunk/ompi/mca/mtl/mxm/mtl_mxm_probe.c   |16    
>  
>   trunk/ompi/mca/mtl/mxm/mtl_mxm_recv.c|54 
> ++--
>   trunk/ompi/mca/mtl/mxm/mtl_mxm_request.h | 2
>  
>   trunk/ompi/mca/mtl/mxm/mtl_mxm_send.c|77 
> --- 
>   7 files changed, 74 insertions(+), 85 deletions(-)
> 
> Modified: trunk/NEWS
> ==
> --- trunk/NEWS(original)
> +++ trunk/NEWS2011-08-02 10:30:11 EDT (Tue, 02 Aug 2011)
> @@ -62,6 +62,7 @@
>   OPAL levels - intended for use when configuring without MPI support
> - Modified paffinity system to provide warning when bindings result in
>   being "bound to all", which is equivalent to "not bound"
> +- Added Mellanox MTL layer implementation (mxm)
> 
> 
> 1.5.3
> 
> Modified: trunk/README
> ==
> --- trunk/README(original)
> +++ trunk/README2011-08-02 10:30:11 EDT (Tue, 02 Aug 2011)
> @@ -509,6 +509,9 @@
> or
> shell$ mpirun --mca pml cm ...
> 
> +- MXM MTL is an transport layer utilizing various Mellanox proprietary
> +  technologies and providing better scalability and performance for large 
> scale jobs
> +
> - Myrinet MX (and Open-MX) support is shared between the 2 internal
>   devices, the MTL and the BTL.  The design of the BTL interface in
>   Open MPI assumes that only naive one-sided communication
> @@ -707,7 +710,7 @@
> --with-mxm=
>   Specify the directory where the Mellanox MXM library and
>   header files are located.  This option is generally only necessary
> -  if the InfiniPath headers and libraries are not in default
> +  if the MXM headers and libraries are not in default
>   compiler/linker search paths.
> 
>   MXM is the support library for Mellanox network adapters.
> 
> Modified: trunk/ompi/mca/mtl/mxm/mtl_mxm_cancel.c
> ==
> --- trunk/ompi/mca/mtl/mxm/mtl_mxm_cancel.c(original)
> +++ trunk/ompi/mca/mtl/mxm/mtl_mxm_cancel.c2011-08-02 10:30:11 EDT (Tue, 
> 02 Aug 2011)
> @@ -18,9 +18,9 @@
> mxm_error_t err;
> mca_mtl_mxm_request_t *mtl_mxm_request = (mca_mtl_mxm_request_t*) 
> mtl_request;
> 
> -err = mxm_req_cancel(&mtl_mxm_request->mxm_request);
> +err = mxm_req_cancel(mtl_mxm_request->mxm_base_request);
> if (MXM_OK == err) {
> -err = mxm_req_test(&mtl_mxm_request->mxm_request);
> +err = mxm_req_test(mtl_mxm_request->mxm_base_request);
> if (MXM_OK == err) {
> mtl_request->ompi_req->req_status._cancelled = true;
> 
> mtl_mxm_request->super.completion_callback(&mtl_mxm_request->super);
> 
> Modified: trunk/ompi/mca/mtl/mxm/mtl_mxm_probe.c
> ==
> --- trunk/ompi/mca/mtl/mxm/mtl_mxm_probe.c(original)
> +++ trunk/ompi/mca/mtl/mxm/mtl_mxm_probe.c2011-08-02 10:30:11 EDT (Tue, 
> 02 Aug 2011)
> @@ -18,21 +18,21 @@
> int *flag, struct ompi_status_public_t *status)
> {
> mxm_error_t err;
> -mxm_req_t req;
> +mxm_recv_req_t req;
> 
> -req.state  = MXM_REQ_NEW;
> -req.mq = (mxm_mq_h)comm->c_pml_comm;
> -req.tag= tag;
> -req.tag_mask   = (tag == MPI_ANY_TAG) ? 0 : 0xU;
> -req.conn   = (src == MPI_ANY_SOURCE) ? NULL : 
> ompi_mtl_mxm_conn_lookup(comm, src);
> +req.base.state  = MXM_REQ_NEW;
> +req.base.mq = (mxm_mq_h)comm->c_pml_comm;
> +req.tag= tag;
> +req.tag_mask   = (tag == MPI_ANY_TAG) ? 0 : 0xU;
> +req.base.conn   = (src == MPI_ANY_SOURCE) ? NULL : 
> ompi_mtl_mxm_conn_lookup(comm, src);
> 
> err = mxm_req_probe(&req);
> if (MXM_OK == err) {
> *flag = 1;
> if (MPI_STATUS_IGNORE != status) {
> -status->MPI_SOURCE = *(int *)mxm_conn_get_context(req.conn);
> +status->MPI_SOURCE = *(int *)mxm_c

Re: [OMPI devel] [OMPI bugs] [Open MPI] #2888: base.h inclusion breaks Solaris build

2011-10-18 Thread Jeff Squyres (jsquyres)
Terry -

Did #2887 fix this already?

Sent from my phone. No type good. 

On Oct 18, 2011, at 6:19 AM, "Open MPI"  wrote:

> #2888: base.h inclusion breaks Solaris build
> +
> Reporter:  tdd  |  Owner:  tdd
>Type:  defect   | Status:  new
> Priority:  blocker  |  Milestone:  Open MPI 1.5.5
> Version:  trunk|   Keywords:
> +
> #2887 breaks the Solaris build because opal/sys/timer.h and
> opal/mca/timer/base/base.h cause a redeclaration error for opal_timer_t.
> This is a similar issue we saw with r25157 that r25170 fixed.
> 
> -- 
> Ticket URL: 
> Open MPI 
> 
> ___
> bugs mailing list
> b...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/bugs


Re: [OMPI devel] [OMPI bugs] [Open MPI] #2888: base.h inclusion breaks Solaris build

2011-10-18 Thread Jeff Squyres (jsquyres)
Never mind; I just ready your text more carefully - 2887 caused the problem. 

Sent from my phone. No type good. 

On Oct 18, 2011, at 6:19 AM, "Open MPI"  wrote:

> #2888: base.h inclusion breaks Solaris build
> +
> Reporter:  tdd  |  Owner:  tdd
>Type:  defect   | Status:  new
> Priority:  blocker  |  Milestone:  Open MPI 1.5.5
> Version:  trunk|   Keywords:
> +
> #2887 breaks the Solaris build because opal/sys/timer.h and
> opal/mca/timer/base/base.h cause a redeclaration error for opal_timer_t.
> This is a similar issue we saw with r25157 that r25170 fixed.
> 
> -- 
> Ticket URL: 
> Open MPI 
> 
> ___
> bugs mailing list
> b...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/bugs


Re: [OMPI devel] autogen.sh generates broken configure on FreeBSD-8.2

2011-12-21 Thread Jeff Squyres (jsquyres)
Paul -

Are you running autogen from the tarballs in your testing?  You probably 
shouldn't - we have users just run configure and make. We also bootstrap the 
tarballs w the most recent config.sub and .guess (i.e., more recent than what 
comes w the most recent Autotools). 

Sent from my phone. No type good. 

On Dec 20, 2011, at 9:55 PM, "Paul H. Hargrove"  wrote:

> While dealing w/ GNU-vs-Berkeley Make issues, mentioned in passing that I 
> wasn't able to autogen on my FreeBSD tester because the resulting configure 
> failed.  The specific failure I encountered was:
>> configure: error: No atomic primitives available for amd64-unknown-freebsd8.2
> 
> The problem boils down to the difference in the following:
> 
>> $ /usr/local/share/autoconf-2.68/config.guess
>> amd64-unknown-freebsd8.2
>> $ openmpi-1.5.5rc1/config/config.guess
>> x86_64-unknown-freebsd8.2
> 
> These differ in the arch identifier, which then causes (at least) 
> opal/config/opal_config_asm.m4 to decide there is no atomics support for the 
> (unknown) architecture.  The included hwloc also appears unhappy w/ 
> arch=amd64, but at least that is non-fatal.  I cannot (yet?) say what else is 
> broken due to this disagreement in system tuple.  I can say that adding 
> "|amd64-*" in the appropriate spot in opal/config/opal_config_asm.m4 is 
> sufficient to get past the configure failure.
> 
> The basic problem is that this system's config.guess is ancient 
> (timestamp='2003-07-02') despite the recent autoconf-2.68.
> I suggest that autogen.sh should include logic to keep the NEWER of the 
> config/config.guess and the one that "automake --copy" wishes to install.
> 
> While looking into this I also noted something "odd" in autogen.sh:
> Why is ompi_autoconf_version="2.59" when there is ALSO a check for 2.60 or 
> later?
> 
> Note that I don't think this is worth fixing for 1.5.5.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> HPC Research Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] autogen.sh generates broken configure onFreeBSD-8.2

2011-12-21 Thread Jeff Squyres (jsquyres)
Don't need to re-run autogen if you edit a make file.am. 

To avoid older config.foo files, you might be able to edit configur directly, 
or upgrade Autotools...?  I am specifically wondering if the config.guess 
issues you ran into are from te results that we return from our config.foo 
files or the ones from your Autotools. 

Sent from my phone. No type good. 

On Dec 21, 2011, at 8:41 AM, "Paul H. Hargrove"  wrote:

> I only ran autogen after I had edited a Makefile.am or a .m4 file.
> 
> -Paul
> 
> On 12/21/2011 4:58 AM, Jeff Squyres (jsquyres) wrote:
>> Paul -
>> 
>> Are you running autogen from the tarballs in your testing?  You probably 
>> shouldn't - we have users just run configure and make. We also bootstrap the 
>> tarballs w the most recent config.sub and .guess (i.e., more recent than 
>> what comes w the most recent Autotools).
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> HPC Research Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] RFC: Add "virbr0" to [btl|oob]_tcp_if_exclude?

2012-02-10 Thread Jeff Squyres (jsquyres)
Check out #220 now; I updated it. 

Sent from my phone. No type good. 

On Feb 10, 2012, at 4:46 PM, "Jeff Squyres"  wrote:

> On Feb 10, 2012, at 3:32 PM, Paul H. Hargrove wrote:
> 
>> The point of the question isn't related to WHY eth8 is useless - just assume 
>> it is.
>> Assume it is UP, but useless for whatever reasons motivated writing FAQ #220.
>> It could be Terry's example of a port connected to the service processor.
>> 
>> The concern is what happens in this situation when the user, following the 
>> advice in the FAQ, passes an explicit setting for btl_tcp_if_exclude, which 
>> DOES NOT include virbr0?
>> They don't know it was there before, or that it needs to be there (the FAQ 
>> states that lo MUST be included).
>> So, by following the FAQ they don't resolve their problem.
>> OMPI ceases any attempts use of eth8 (or whatever), but loss of the implicit 
>> virbr0 from the exclude list results in their system attempting to use 
>> virbr0 (and thus continue to fail).  Right?
>> 
>> Maybe the FAQ just needs an update to address my concern.
> 
> Got it.  Sure, I can update the faq to be a bit more loose in the definition 
> of what must be excluded.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] non-portable code in examples/Makefile

2012-02-21 Thread Jeff Squyres (jsquyres)
That is truly bizarre "make" behavior. 

Heads up that in the upcoming fortran revamp, we *only* use FC. I.E., there's 
only mpifort wrapper compiler (mpif77 and mpif90 still exist, but only as sym 
links to mpifort, signifying that mpifort is the way of the future). 

This was done because there have been no f77 compilers for decades (literally), 
and no f90 compilers for 10+ years. All the fortran compiler vendors have 
long-since moved to a single compiler executable name (e.g., ifort, gfortran), 
so mpifort just reflects that. 

Sent from my phone. No type good. 

On Feb 21, 2012, at 5:01 AM, "Paul H. Hargrove"  wrote:

> Thanks, Ralph.
> Excellent point about not needing to use the "FC" name with its special 
> (absurd?) behavior.
> 
> -Paul
> 
> On 2/21/2012 1:52 AM, Ralph Castain wrote:
>> 
>> I went ahead and applied this, with a tweak. There is no reason to call our 
>> flag "FC" as all we use it for is to call the write wrapper. So I renamed it 
>> to something less problematic.
>> 
>> On Feb 21, 2012, at 1:20 AM, Paul H. Hargrove wrote:
>> 
>>> And while we are looking at examples/Makefile on Solaris-10, why are the 
>>> F77 examples getting built w/ mpif90?
>>> Because w/ the Solaris make setting FC also silently sets F77 (yes, I am 
>>> NOT kidding)!
>>> So, reordering the F77= and FC= lines in Makefile resolves that 
>>> mis-behavior.
>>> 
>>> Attached is my patch to fix both F77/FC and the "better" ompi_info queries 
>>> mentioned in my previous post.
>>> This REPLACES the patch in the previous post.
>>> 
>>> -Paul
>>> 
>>> On 2/20/2012 11:36 PM, Paul H. Hargrove wrote:
 
 The addition on Monday of the Java cases to examples/Makefile has shown 
 that the default "make" in Solaris-10 will stop on the first failed grep 
 command in the "all" rule: 
> $ make 
> mpicc -g   -o hello_c hello_c.c 
> mpicc -g   -o ring_c ring_c.c 
> mpicc -g   -o connectivity_c connectivity_c.c 
> mpic++ -g   -o hello_cxx hello_cxx.cc 
> mpic++ -g   -o ring_cxx ring_cxx.cc 
> mpif90 -g hello_f77.f -o hello_f77 
> mpif90 -g ring_f77.f -o ring_f77 
> mpif90 -g hello_f90.f90 -o hello_f90 
> mpif90 -g ring_f90.f90 -o ring_f90 
> *** Error code 1 
> The following command caused the error: 
> if test "`ompi_info --parsable | grep bindings:java:yes`" != ""; then \ 
> make Hello.class; \ 
> fi 
> make: Fatal error: Command failed for target `all' 
 
 The addition of java did NOT break anything, but exposed a pre-existing 
 problem which  was not evident in my prior testing because all language 
 bindings were being build prior to adding java. 
 
 The attached patch resolves the problem in my (admittedly minimal) testing 
 with the smallest possible change. 
 However an entirely different avoids both "test" and "true" and simply 
 looks like: 
 @ if ompi_info --parsable | grep bindings:cxx:yes >/dev/null; then 
 I have *also* tested that approach, and it works fine too. 
 
 I *did* warn that the introduction of the java bindings would bring 
 collateral damage. 
 I just didn't anticipate encountering it personally. 
 
 -Paul 
 
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> -- 
>>> Paul H. Hargrove  phhargr...@lbl.gov
>>> Future Technologies Group
>>> HPC Research Department   Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> HPC Research Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] Agenda item

2012-04-16 Thread Jeff Squyres (jsquyres)
Terry -

Please add te fortran revamp to the agenda tomorrow. Thanks. 

Sent from my phone. No type good. 


Re: [OMPI devel] algorithm selection in open mpi

2012-05-07 Thread Jeff Squyres (jsquyres)
George will have to answer that in detail, but note that if you modify the 
tuned coll module source code, you can simply "make install" in 
ompi/mca/coll/tuned.  That will re-build the coll tuned module and install it 
in the plugin directory.  You don't even need to recompile your MPI app, since 
all of the coll tuned module is dynamically opened at run-time.

That's *much* faster than re-compiling/re-installing the whole of Open MPI.  
w00t. :-)


On May 7, 2012, at 4:24 AM, roswan ismail wrote:

> hi all..
> i already got the results from all algorithm used in open mpi for bcast. If i 
> want to modify binomial algorithm for example, there is a simpler way to do 
> that? or i just need to modify "ompi_coll_tuned_bcast_intra_binomial" 
> function, then recompile and force the system to broadcast the data using a 
> modified binomial?? is it the right way?? thanks
>  
> 
> Roswan Ismail,
> FSKIK,
> Universiti Pendidikan Sultan Idris,
> Tanjong Malim, Perak, Malaysia.
> iewa...@gmail.com
> ros...@fskik.upsi.edu.my
> 
> From: George Bosilca 
> To: Open MPI Developers  
> Sent: Tuesday, April 3, 2012 9:06 PM
> Subject: Re: [OMPI devel] algorithm selection in open mpi
> 
> Of course !!! ;)
> 
> First set   coll_tuned_use_dynamic_rules to 1, and then use 
> coll_tuned_dynamic_rules_filename to specify a file containing the selection 
> logic. This is kind of tricky to write, so we don't advertise it to widely. I 
> added an example below, contact me privately if you need more info.
> 
>   Thanks,
> george.
> 
> 
> 1 # num of collectives
> 3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
> 1 # number of com sizes
> 64 # comm size 8
> 2 # number of msg sizes
> 0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
> 8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
> # end of first collective
> 
> 
> On Apr 3, 2012, at 09:01 , Pavel Mezentsev wrote:
> 
>> Is there a way to specify collective depending on the size of the message 
>> and number of processes?
>> 
>> Regards,
>> Pavel Mezentsev
>> 
>> 2012/4/3 George Bosilca 
>> Roswan,
>> 
>> There a re simpler solutions to achieve this. We have a built-in mechanism 
>> to select a specific collective implementation. Here is what you have to add 
>> in your .openmpi/mca-params.conf (or as MCA argument on the command line):
>> 
>> coll_tuned_use_dynamic_rules = 1 
>> coll_tuned_bcast_algorithm = 6
>> 
>> The first one activate the dynamic selection of collective algorithms, while 
>> the second one force all broadcast to be of the type 6 (binomial tree). Btw, 
>> once you set the first one, do a quick "ompi_info --param coll tuned" to see 
>> the list of all possible options for the collective algorithm selection.
>> 
>>   george.
>>   
>> On Apr 2, 2012, at 23:10 , roswan ismail wrote:
>> 
>>> Hi all..
>>>  
>>> I am Roswan Ismail from Malaysia. I am focusing on MPI communication 
>>> performance on quad-core cluster at my university. I used Open MPI-1.4.3 
>>> and measurements were done using scampi benchmark.
>>>  
>>> As I know, open MPI used multiple algorithms to broadcast data (MPI_BCAST) 
>>> such as binomial, pipeline, binary tree, basic linear and split binary 
>>> tree. All these algorithms will be used based on message size and 
>>> communicator size. For example, binomial is used when message size to be 
>>> broadcasted is small while pipeline used for broadcasting a large message.
>>>  
>>> What I want to do now is, to use fixed algorithm i.e binomial for all 
>>> message size. I want to see and compare the results with the default 
>>> results. So, I was modified coll_tuned_decision_fixed.c which is located in 
>>> open mpi-1.4.3/ompi/mca/coll/tuned by returning binomial algorithm for all 
>>> condition. Then I recompile the files but the problem is, the results 
>>> obtained is same as default. It seems I do not do any changes to the codes.
>>>  
>>> So could you guys tell me the right way to do that.
>>>  
>>> Many thanks
>>>  
>>> Roswan Binti Ismail,
>>> FTMK,
>>> Univ. Pend. Sultan Idris,
>>> Tg Malim, Perak.
>>> Pej: 05-4505173
>>> H/P: 0123588047
>>> iewa...@gmail.com
>>> ros...@ftmk.upsi.edu.my
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http:/

[OMPI devel] The Architecture of Open Source Applications (vol 2)

2012-05-08 Thread Jeff Squyres (jsquyres)
I wrote a chapter about Open MPI in "The Architecture of Open Source
Applications, volume 2", which was just made available in dead tree form
today:


http://blogs.cisco.com/performance/the-architecture-of-open-source-applicat
ions-volume-ii/

All royalties from this book go to Amnesty International (I don't get
anything).

There's lots to learn from this book (and volume 1, too!).  The designs of
successful, open source software packages that you'll recognize are in
these books, including (in no particular order): Open MPI, Bash, GDB,
Puppet, Eclipse, Mailman, the Hadoop distributed filesystem, LLVM, Git,
Sendmail, ...and many others.

I'm told that PDFs will be available soon.  Additionally, in good open
source form, all the content from this book will be freely available at
aosabook.org in a week or so (all the content from volume 1 is already
there).


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/







Re: [OMPI devel] Unable to set flags using platform files in the 1.6 release

2012-05-23 Thread Jeff Squyres (jsquyres)
Can you send some output showing that those flags aren't passed through, like 
some output from "make V=1" and or from config.log?

Offhand, I don't know if we ever formally supported setting env variables other 
than enable and with flag variables in the platform files...?

Sent from my phone. No type good. 

On May 23, 2012, at 12:49 PM, "Gunter, David O"  wrote:

> I am trying to set LDFLAGS, CFLAGS, etc, in a platform file but the 1.6 
> release does not seem to pick these up.
> 
> Here's the tail end of one of our platform files, for building with the 
> latest PGI compilers:
> 
> LDFLAGS="-nomp -lnuma"
> CFLAGS="-I/opt/panfs/include"
> CXXFLAGS="-I/opt/panfs/include"
> FCFLAGS="-I/opt/panfs/include"
> FFLAGS="-I/opt/panfs/include"
> CCASFLAGS="-I/opt/panfs/include"
> 
> The same platform file will configure the 1.4.5 release just fine but does 
> not work with 1.6. If I set these variables in my environment and then run 
> configure, it works just fine - as expected.
> 
> Has anyone else noticed this behavior?
> 
> -david
> --
> David Gunter
> HPC-3: Infrastructure Team
> Los Alamos National Laboratory
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] RFC: OMPI git mirror on github.com

2012-08-18 Thread Jeff Squyres (jsquyres)
FWIW: Ralph, I think Mike is proposing that we use the built in github SVN 
functionality. I.E., you can use git or SVN - both can read or write to the 
same backend repo. Pretty clever of github, actually. See the github blog entry 
he referenced, if you care.

But I agree: although dvcs are very nice and have many upsides, this would be a 
large change and there are downsides, too. Would definitely require more 
discussion, developer buy in, and planning, at a minimum.

Sent from my phone. No type good.

On Aug 18, 2012, at 11:28 AM, "Ralph Castain" 
mailto:r...@open-mpi.org>> wrote:


On Aug 18, 2012, at 8:21 AM, Mike Dubman 
mailto:mike.o...@gmail.com>> wrote:

re item (5):

The current svn tree can be set as read-only and serve as a reference for old 
commit numbers.
It is rarery used anyway to search through historic commit numbers and can be 
done in read-only historic tree.

I use it a lot for old commits, but agree it is read-only for that purpose.


All other items can use svn interface of guthub and stay w/o any change.

Yeah, we've had experience with svn to git - no thanks!


It is pretty minor change (mostly mental) and pretty big gain

Guess we can agree to disagree - I found git to be awkward and a royal pain, 
especially when someone commits without doing a rebase (which happens a lot). 
No thanks.






On Sat, Aug 18, 2012 at 3:46 PM, Jeff Squyres 
mailto:jsquy...@cisco.com>> wrote:
On Aug 18, 2012, at 8:27 AM, Jeff Squyres wrote:

> That's pretty clever, actually (SVN and git effectively together in the same 
> repo).  Cool!
>
> However, migrating to git has all the same problems that I mentioned in the 
> prior email to you.  Is Mellanox volunteering to do all the work for 
> conversion?


I guess I should clarify -- here's what I previously sent to Mike in an 
off-list email about converting our main SVN repo to something else (e.g., 
Mercurial or Git).  #3 is probably moot if we entirely move to github, but it 
would be replaced with "migrate all existing users to github" (which is a fair 
amount of work, too).

-
We have *many* discussions a year or two ago about making Mercurial the primary 
repo, not SVN, and ultimately rejected it.  There's many issues involved:

1. developer learning curve
 --> certainly not the biggest factor, but definitely a factor
 --> "rebase" would certainly be a big deal (so that people don't put back a 
million intermediate commits)

2. adapting all of OMPI's current scripting to use hg (or git)
 --> this is a fair amount of work

3. getting IU to host git instead of SVN
 --> they have a whole management system for SVN: users, permissions, etc.  No 
such thing exists for git.

4. integrating Trac with git.  Or migrating to a whole new bug tracker that 
supports git.
 --> this is an entire conversation in itself.  Note that everyone hates 
bugzilla.

5. re-writing the SVN history to find all references to "rXXX" in commit 
messages and replace them with the relevant hg (git) unique commit hash
 --> someone would have to figure out how to script that

So conversion would be a significant amount of work.  Instead, we opted for our 
current modes of operation, which seem to be working well enough:

- use the hg+svn or git+svn combo mechanisms to do actual development in hg/git 
and then push back up to svn when done
- provide hg (and now git) official mirrors so that people can branch/clone 
from there, and then provide patches to commit when done with development

In short -- I agree with you: moving to 100% hg/git would be nice.  But it 
would be a lot of work that no one was willing to spend the time to do.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] making Fortran MPI_Status components public

2012-09-27 Thread Jeff Squyres (jsquyres)
Fwiw, we have put in many hours of engineering to "that obscene hack" *because* 
compilers all have differing degrees of compatibility suck. It's going to be 
years before compilers fully support f08, for example, so we have no choice but 
to test for various compiler characteristics at configure time. 

I don't remember which ones offhand didn't support private (and I'm not near a 
computer to check), but we had to put that check in because there were 
problems. 

In short, the compilers themselves aren't providing fully portable behavior. So 
we have to check. 

And I would *much* rather prevent users from accidentally using the private 
status members in the status then have them ask us why  status member 
works in MPICH but doesn't work in OMPI. If its marked as private, it's obvious 
that the user should not touch it. Yes, users should know that there are only 3 
public fields in the status, but plenty of users don't read the spec and just 
read the header files instead. 

So I'm ok propagating "that obscene hack" when it protects users and prevents 
wasting time on mailing list questions. 

Case closed. 

Sent from my phone. No type good. 

On Sep 27, 2012, at 9:48 AM, "N.M. Maclaren"  wrote:

> On Sep 27 2012, Jeff Squyres wrote:
>> On Sep 27, 2012, at 7:30 AM, Paul Hargrove wrote:
>> 
>>> PUBLIC should be a standard part of F95 (no configure probe required).
>> 
>> Good.
>> 
>>> However, the presence of "OMPI_PRIVATE" suggests you already have a 
>>> configure probe for the "PRIVATE" keyword.
>> 
>> Yes, we do, because not all compilers support it (yet?).
> 
> All serious compilers do, and have for a long time.  I should be interested
> to know which ones are actually used that don't.  There are a couple of
> Fortran compilers around that aren't really maintained and haven't been
> upgraded in a decade, but it is ridiculous to attempt to support such
> things.  It's like attempting to support K&R C compilers!
> 
> There is a high chance that these portability problems are actually
> caused by the configuration mechanism - indeed, I would give three to
> one on that being the cause.  I use PUBLIC and PRIVATE in my course,
> and I am pretty sure that I tested with Sun ONE Studio, so Oracle
> Studio is probably OK :-)
> 
> A VERY strong recommendation for portability is to minimise the use of
> that obscene hack (by which I mean configure).  OpenMPI probably can't
> avoid using it, but its use should be minimised - it is FAR better to
> clean up code and make it fully portable than to use it to hack around
> a problem.  And far too few configuration scripts get the software
> engineering and maintenance that they need.
> 
> Here, for example, there is absolutely NO point in supporting anything
> beyond the pure Fortran 77 interfaces for any compiler that isn't a
> full Fortran 95 one.
> 
> 
> Regards,
> Nick Maclaren.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn] svn:open-mpi r27601 - trunk

2012-11-15 Thread Jeff Squyres (jsquyres)
Wait. 

Why did we just add a version check for m4?

Sent from my phone. No type good. 

On Nov 15, 2012, at 9:43 AM, "Hjelm, Nathan T"  wrote:

> Committed as r27615. Let me know if there are any more issues.
> 
> -Nathan
> 
> 
> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] on behalf of 
> Ralph Castain [r...@open-mpi.org]
> Sent: Thursday, November 15, 2012 8:53 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r27601 - trunk
> 
> Looks fine to me. I would only add one further refinement - I think we should 
> check m4, but add a check in autogen.pl so that if we get nothing useful back 
> from -v (or whatever), then output a warning that we couldn't validate the 
> version and assume it is okay.
> 
> I believe the tool will return a non-zero status if the option isn't 
> supported, so we should be able to do this - yes?
> 
> 
> On Nov 15, 2012, at 7:48 AM, "Hjelm, Nathan T"  wrote:
> 
>> Since the version of m4 that comes with Solaris likely works with all our 
>> .m4 files and there is no way to check the version (no --version, -v, -V, or 
>> anything from what I can tell) I guess we have no choice but to not check 
>> the m4 version.
>> 
>> flex on the other hand we can check. How about this for the new regex (for 
>> reference the old one is $version =~ m/\s([\d\w\.]+)$/m; -- matching a 
>> version at the end of the line):
>> 
>> $version =~ m/\s([\d\.]+\w?)/m;
>> 
>> It works with Apple's flex and still works with glibtoolize, autoconf, and 
>> automake.
>> 
>>  Searching for autoconf
>>Found autoconf version 2.69; checking version...
>>  Found version component 2 -- need 2
>>  Found version component 69 -- need 65
>>==> ACCEPTED
>>  Searching for libtoolize
>> libtoolize not found
>>  Searching for glibtoolize
>>Found glibtoolize version 2.4.2; checking version...
>>  Found version component 2 -- need 2
>>  Found version component 4 -- need 2
>>==> ACCEPTED
>>  Searching for automake
>>Found automake version 1.12.2; checking version...
>>  Found version component 1 -- need 1
>>  Found version component 12 -- need 11
>>==> ACCEPTED
>>  Searching for flex
>>Found flex version 2.5.35; checking version...
>>  Found version component 2 -- need 2
>>  Found version component 5 -- need 5
>>  Found version component 35 -- need 35
>>==> ACCEPTED
>>  Searching for m4
>>Found m4 version 1.4.6; checking version...
>>  Found version component 1 -- need 1
>>  Found version component 4 -- need 4
>>  Found version component 6 -- need 16
>>==> Too low!  Skipping this version
>>  Searching for gm4
>>Found gm4 version 1.4.16; checking version...
>>  Found version component 1 -- need 1
>>  Found version component 4 -- need 4
>>  Found version component 16 -- need 16
>>==> ACCEPTED
>> 
>> 
>> -Nathan
>> 
>> 
>> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] on behalf of 
>> Paul Hargrove [phhargr...@lbl.gov]
>> Sent: Wednesday, November 14, 2012 7:37 PM
>> To: Larry Baker
>> Cc: Open MPI Developers
>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r27601 - trunk
>> 
>> Larry,
>> 
>> I just wanted to speak up quickly to be sure nobody used your example to 
>> "fix" the Mac OS problem and thereby break Solaris instead.  No personal 
>> attack/affront was intended.
>> 
>> -Paulhttps://mymail.lanl.gov/owa/?ae=PreFormAction&t=IPM.Note&a=ReplyAll&id=RgD3GfjXt9HDTI902%2b63W1IcBwCuRfL1X%2babT5m7NFXoIdcVxVZxAACuRfL1X%2babT5m7NFXoIdcVAAAa4RQFAAAJ#
>> 
>> On Wed, Nov 14, 2012 at 7:10 PM, Larry Baker 
>> mailto:ba...@usgs.gov>> wrote:
>> Paul,
>> 
>> 1) I wasn't trying to solve the --version issue, only the parsing of the 
>> response.
>> 2) I assumed from the initial e-mail that the broken parser was in a Perl 
>> script.  I'm not a Perl person, so I wrote the example regular expression 
>> parser in sed.
>> 
>> These commands were done on my Mac OS X 10.6 system.  I have no idea where 
>> the apps came from.  I know the sed, at least, does not recognize regular 
>> expressions documented for GNU sed (such as \< \> for begin/end word).  
>> Maybe it is a BSD sed?
>> 
>> I was just trying to illustrate how to fix the broken parsing of Ralph's 
>> "flex --version".  Assuming the RE parser I wrote is satisfactory, it would 
>> have to be adapted to fit in the framework, i.e., it has to be portable.
>> 
>> Larry Baker
>> US Geological Survey
>> 650-329-5608
>> ba...@usgs.gov
>> 
>> 
>> 
>> On 14 Nov 2012, at 5:41 PM, Paul Hargrove wrote:
>> 
>> On Wed, Nov 14, 2012 at 6:26 PM, Larry Baker 
>> mailto:ba...@usgs.gov>> wrote:
>> m4 --version | sed -n -E -e 
>> '1s/^.*[^A-Za-z0-9_-]?([0-9]+[.][0-9]+[.][0-9]+)[^A-Za-z0-9_-]?.*$/\1/p'
>> 
>> 
>> There are STILL problems with this approach as it is TWICE specific to GNU 
>> software:
>> 
>> 1) M4 on OpenBSD (may

Re: [OMPI devel] [OMPI svn] svn:open-mpi r27601 - trunk

2012-11-15 Thread Jeff Squyres (jsquyres)
We only call out te version of m4 because the Autotools we require need that m4 
version (which is not always already installed). We don't need that version of 
m4 for OMPI itself. 

Sent from my phone. No type good. 

On Nov 15, 2012, at 10:04 AM, "Ralph Castain"  wrote:

> Only because we call out a minimum required version in our HACKING file, but 
> we never check for it
> 
> If we don't require a min version, then we shouldn't check - but if we do, 
> then we should
> 
> On Nov 15, 2012, at 9:00 AM, "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> Wait. 
>> 
>> Why did we just add a version check for m4?
>> 
>> Sent from my phone. No type good. 
>> 
>> On Nov 15, 2012, at 9:43 AM, "Hjelm, Nathan T"  wrote:
>> 
>>> Committed as r27615. Let me know if there are any more issues.
>>> 
>>> -Nathan
>>> 
>>> 
>>> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] on behalf of 
>>> Ralph Castain [r...@open-mpi.org]
>>> Sent: Thursday, November 15, 2012 8:53 AM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r27601 - trunk
>>> 
>>> Looks fine to me. I would only add one further refinement - I think we 
>>> should check m4, but add a check in autogen.pl so that if we get nothing 
>>> useful back from -v (or whatever), then output a warning that we couldn't 
>>> validate the version and assume it is okay.
>>> 
>>> I believe the tool will return a non-zero status if the option isn't 
>>> supported, so we should be able to do this - yes?
>>> 
>>> 
>>> On Nov 15, 2012, at 7:48 AM, "Hjelm, Nathan T"  wrote:
>>> 
>>>> Since the version of m4 that comes with Solaris likely works with all our 
>>>> .m4 files and there is no way to check the version (no --version, -v, -V, 
>>>> or anything from what I can tell) I guess we have no choice but to not 
>>>> check the m4 version.
>>>> 
>>>> flex on the other hand we can check. How about this for the new regex (for 
>>>> reference the old one is $version =~ m/\s([\d\w\.]+)$/m; -- matching a 
>>>> version at the end of the line):
>>>> 
>>>> $version =~ m/\s([\d\.]+\w?)/m;
>>>> 
>>>> It works with Apple's flex and still works with glibtoolize, autoconf, and 
>>>> automake.
>>>> 
>>>> Searching for autoconf
>>>>  Found autoconf version 2.69; checking version...
>>>>Found version component 2 -- need 2
>>>>Found version component 69 -- need 65
>>>>  ==> ACCEPTED
>>>> Searching for libtoolize
>>>> libtoolize not found
>>>> Searching for glibtoolize
>>>>  Found glibtoolize version 2.4.2; checking version...
>>>>Found version component 2 -- need 2
>>>>Found version component 4 -- need 2
>>>>  ==> ACCEPTED
>>>> Searching for automake
>>>>  Found automake version 1.12.2; checking version...
>>>>Found version component 1 -- need 1
>>>>Found version component 12 -- need 11
>>>>  ==> ACCEPTED
>>>> Searching for flex
>>>>  Found flex version 2.5.35; checking version...
>>>>Found version component 2 -- need 2
>>>>Found version component 5 -- need 5
>>>>Found version component 35 -- need 35
>>>>  ==> ACCEPTED
>>>> Searching for m4
>>>>  Found m4 version 1.4.6; checking version...
>>>>Found version component 1 -- need 1
>>>>Found version component 4 -- need 4
>>>>Found version component 6 -- need 16
>>>>  ==> Too low!  Skipping this version
>>>> Searching for gm4
>>>>  Found gm4 version 1.4.16; checking version...
>>>>Found version component 1 -- need 1
>>>>Found version component 4 -- need 4
>>>>Found version component 16 -- need 16
>>>>  ==> ACCEPTED
>>>> 
>>>> 
>>>> -Nathan
>>>> 
>>>> 
>>>> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] on behalf of 
>>>> Paul Hargrove [phhargr...@lbl.gov]
>>>> Sent: Wednesday, November 14, 2012 7:37 PM
>>>> To: Larry Baker
>>>> Cc: Open MPI Developers
>>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-

Re: [OMPI devel] [OMPI svn] svn:open-mpi r27601 - trunk

2012-11-15 Thread Jeff Squyres (jsquyres)
No issue, I guess. It's just new and I wondered why it was done. 

Sent from my phone. No type good. 

On Nov 15, 2012, at 11:34 AM, "Ralph Castain"  wrote:

> Sooo...what's the issue with checking for it then? Isn't it "required" by 
> association?
> 
> 
> On Nov 15, 2012, at 10:27 AM, "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> We only call out te version of m4 because the Autotools we require need that 
>> m4 version (which is not always already installed). We don't need that 
>> version of m4 for OMPI itself. 
>> 
>> Sent from my phone. No type good. 
>> 
>> On Nov 15, 2012, at 10:04 AM, "Ralph Castain"  wrote:
>> 
>>> Only because we call out a minimum required version in our HACKING file, 
>>> but we never check for it
>>> 
>>> If we don't require a min version, then we shouldn't check - but if we do, 
>>> then we should
>>> 
>>> On Nov 15, 2012, at 9:00 AM, "Jeff Squyres (jsquyres)"  
>>> wrote:
>>> 
>>>> Wait. 
>>>> 
>>>> Why did we just add a version check for m4?
>>>> 
>>>> Sent from my phone. No type good. 
>>>> 
>>>> On Nov 15, 2012, at 9:43 AM, "Hjelm, Nathan T"  wrote:
>>>> 
>>>>> Committed as r27615. Let me know if there are any more issues.
>>>>> 
>>>>> -Nathan
>>>>> 
>>>>> 
>>>>> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] on behalf 
>>>>> of Ralph Castain [r...@open-mpi.org]
>>>>> Sent: Thursday, November 15, 2012 8:53 AM
>>>>> To: Open MPI Developers
>>>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r27601 - trunk
>>>>> 
>>>>> Looks fine to me. I would only add one further refinement - I think we 
>>>>> should check m4, but add a check in autogen.pl so that if we get nothing 
>>>>> useful back from -v (or whatever), then output a warning that we couldn't 
>>>>> validate the version and assume it is okay.
>>>>> 
>>>>> I believe the tool will return a non-zero status if the option isn't 
>>>>> supported, so we should be able to do this - yes?
>>>>> 
>>>>> 
>>>>> On Nov 15, 2012, at 7:48 AM, "Hjelm, Nathan T"  wrote:
>>>>> 
>>>>>> Since the version of m4 that comes with Solaris likely works with all 
>>>>>> our .m4 files and there is no way to check the version (no --version, 
>>>>>> -v, -V, or anything from what I can tell) I guess we have no choice but 
>>>>>> to not check the m4 version.
>>>>>> 
>>>>>> flex on the other hand we can check. How about this for the new regex 
>>>>>> (for reference the old one is $version =~ m/\s([\d\w\.]+)$/m; -- 
>>>>>> matching a version at the end of the line):
>>>>>> 
>>>>>> $version =~ m/\s([\d\.]+\w?)/m;
>>>>>> 
>>>>>> It works with Apple's flex and still works with glibtoolize, autoconf, 
>>>>>> and automake.
>>>>>> 
>>>>>> Searching for autoconf
>>>>>> Found autoconf version 2.69; checking version...
>>>>>>  Found version component 2 -- need 2
>>>>>>  Found version component 69 -- need 65
>>>>>> ==> ACCEPTED
>>>>>> Searching for libtoolize
>>>>>> libtoolize not found
>>>>>> Searching for glibtoolize
>>>>>> Found glibtoolize version 2.4.2; checking version...
>>>>>>  Found version component 2 -- need 2
>>>>>>  Found version component 4 -- need 2
>>>>>> ==> ACCEPTED
>>>>>> Searching for automake
>>>>>> Found automake version 1.12.2; checking version...
>>>>>>  Found version component 1 -- need 1
>>>>>>  Found version component 12 -- need 11
>>>>>> ==> ACCEPTED
>>>>>> Searching for flex
>>>>>> Found flex version 2.5.35; checking version...
>>>>>>  Found version component 2 -- need 2
>>>>>>  Found version component 5 -- need 5
>>>>>>  Found version component 35 -- need 35
>>>>>> ==> ACCEPTED
>>>>>> Searching for m4
>>>&g

[OMPI devel] OMPI trunk: MPI C++ bindings no longer build by default

2013-01-07 Thread Jeff Squyres (jsquyres)
Per discussion and RFC, on the trunk (i.e., what will someday be OMPI v1.9), 
the MPI C++ bindings are no longer built by default.

You can enable them via the configure switch --enable-mpi-cxx.  

Those who are running MTT, you probably want to add --enable-mpi-cxx to your 
OMPI configuration so that the C++ tests will still run.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] use of stat() during malloc initialization

2013-01-09 Thread Jeff Squyres (jsquyres)
Greetings Phil.  Good analysis.

You can thank OFED for this horribleness, BTW.  :-)  Since OFED hardware 
requires memory registration, and since that registration is expensive, MPI 
implementations cache registered memory to alleviate the re-registration costs 
for repeated memory usage.  But MPI doesn't allocate user buffers, so MPI 
doesn't get notified when users free their buffers, meaning that MPI's internal 
cache gets out of sync with reality.  Hence, MPI implementations are forced to 
do horrid workaround like you found to find out when applications free buffers 
that may be cached.  Ugh.  Go knock your local OFED developer and tell them to 
give us a notification mechanism so that we don't have to do these horrid 
workarounds.  :-)

Regardless, I think your suggestion is fine (replace stat with access).

Can you confirm that the attached patch works for you?


On Jan 9, 2013, at 10:49 AM, Phil Carns 
 wrote:

> Hi,
> 
> I am a developer on the Darshan project (http://www.mcs.anl.gov/darshan), 
> which provides a set of lightweight wrappers to characterize the I/O access 
> patterns of MPI applications.  Darshan can operate on static or dynamic 
> executables.  As you might expect, it uses the LD_PRELOAD mechanism to 
> intercept I/O calls like open(), read(), write() and stat() on dynamic 
> executables.
> 
> We recently received an unusual bug report (courtesy of Myriam Botalla) when 
> Darshan is used in LD_PRELOAD mode with Open MPI 1.6.3, however. When Darshan 
> intercepts a function call via LD_PRELOAD, it must use dlsym() to locate the 
> "real" underlying function to invoke.  dlsym() in turn uses the calloc() 
> function internally.  In most cases this is fine, but Open MPI actually makes 
> its first stat() call within the malloc initialization hook 
> (opal_memory_linux_malloc_init_hook()) before the malloc() and its related 
> functions have been configured.  Darshan therefore (indirectly) triggers a 
> segfault because it intercepts those stat() calls but can't find the real 
> stat() function without using malloc.
> 
> There is some more detailed information about this issue, including a stack 
> trace, in this mailing list thread:
> 
> http://lists.mcs.anl.gov/pipermail/darshan-users/2013-January/000131.html
> 
> Looking a little more closely at the opal_memory_linux_malloc_init_hook() 
> function, it looks like the struct stat output argument from stat() is being 
> ignored in all cases.  Open MPI is just checking the stat() return code to 
> determine if the files in question exist or not.  Taking that into account, 
> would it be possible to make a minor change in Open MPI to replace these 
> instances:
> 
>stat("some_filename", &st)
> 
> with:
> 
>access("some_filename", F_OK)
> 
> in the opal_memory_linux_malloc_init_hook() function?  There is a slight 
> technical advantage to the change in that access() is lighter weight than 
> stat() on some systems (and it might arguably make the intent  of the calls a 
> little clearer), but of course my main motivation here is to have Open MPI 
> use a function that is less likely to be intercepted by I/O tracing tools 
> before a malloc implementation has been initialized.  Technically it is 
> possible to work around this in Darshan itself by checking the arguments 
> passed in to stat() and using a workaround path for this case, but this isn't 
> a very safe solution in the long run.
> 
> Thanks in advance for your time and consideration,
> -Phil
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


stat-to-access.diff
Description: stat-to-access.diff


Re: [OMPI devel] [OMPI users] Backward Compatibility of MPI Java Binding

2013-01-10 Thread Jeff Squyres (jsquyres)
On Jan 9, 2013, at 10:30 PM, Yoshiki SATO 
 wrote:

> The 1.7's Java implementation under ompi/mpi/java seem to be able to build up 
> independently.  Do you think we can build just them and run it (via 
> prunjava?) with our custom OpenMPI build based on 1.6?


Yes -- IIRC, the Java interface isn't really dependent upon anything specific 
in the back-end C implementation of Open MPI.  So I'm guessing/assuming that if 
you can build it, it should work against the 1.6 OMPI C engine just fine.

Is MPI-for-Java of interest to you?  I ask because Ralph and I have been trying 
to figure out how to get the cycles to do more with the Java interface (e.g., 
make it more like a 1:1 set of bindings).  Is this something you'd be willing 
to work on / contribute, perchance?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] use of stat() during malloc initialization

2013-01-10 Thread Jeff Squyres (jsquyres)
Committed in https://svn.open-mpi.org/trac/ompi/changeset/27785, and I filed 
CMRs to get this fix in 1.6.4 and 1.7.


On Jan 10, 2013, at 9:23 AM, Phil Carns 
 wrote:

> Thanks Jeff.  I tested the patch just now using Open MPI SVN trunk revision 
> 27784.  I was able to instrument an application without any trouble at all, 
> and the patch looks great.
> 
> I definitely understand the memory registration cache pain.  I've dabbled in 
> network abstractions for file systems in the past, and it's disappointing but 
> not terribly surprising that this is still the state of affairs :)  
> 
> Thanks for addressing this so quickly.  This will definitely make life easier 
> for some Darshan and Open MPI users in the future.
> 
> -Phil
> 
> On 01/09/2013 04:24 PM, Jeff Squyres (jsquyres) wrote:
>> Greetings Phil.  Good analysis.
>> 
>> You can thank OFED for this horribleness, BTW.  :-)  Since OFED hardware 
>> requires memory registration, and since that registration is expensive, MPI 
>> implementations cache registered memory to alleviate the re-registration 
>> costs for repeated memory usage.  But MPI doesn't allocate user buffers, so 
>> MPI doesn't get notified when users free their buffers, meaning that MPI's 
>> internal cache gets out of sync with reality.  Hence, MPI implementations 
>> are forced to do horrid workaround like you found to find out when 
>> applications free buffers that may be cached.  Ugh.  Go knock your local 
>> OFED developer and tell them to give us a notification mechanism so that we 
>> don't have to do these horrid workarounds.  :-)
>> 
>> Regardless, I think your suggestion is fine (replace stat with access).
>> 
>> Can you confirm that the attached patch works for you?
>> 
>> 
>> On Jan 9, 2013, at 10:49 AM, Phil Carns 
>> 
>> 
>>  wrote:
>> 
>> 
>>> Hi,
>>> 
>>> I am a developer on the Darshan project (
>>> http://www.mcs.anl.gov/darshan
>>> ), which provides a set of lightweight wrappers to characterize the I/O 
>>> access patterns of MPI applications.  Darshan can operate on static or 
>>> dynamic executables.  As you might expect, it uses the LD_PRELOAD mechanism 
>>> to intercept I/O calls like open(), read(), write() and stat() on dynamic 
>>> executables.
>>> 
>>> We recently received an unusual bug report (courtesy of Myriam Botalla) 
>>> when Darshan is used in LD_PRELOAD mode with Open MPI 1.6.3, however. When 
>>> Darshan intercepts a function call via LD_PRELOAD, it must use dlsym() to 
>>> locate the "real" underlying function to invoke.  dlsym() in turn uses the 
>>> calloc() function internally.  In most cases this is fine, but Open MPI 
>>> actually makes its first stat() call within the malloc initialization hook 
>>> (opal_memory_linux_malloc_init_hook()) before the malloc() and its related 
>>> functions have been configured.  Darshan therefore (indirectly) triggers a 
>>> segfault because it intercepts those stat() calls but can't find the real 
>>> stat() function without using malloc.
>>> 
>>> There is some more detailed information about this issue, including a stack 
>>> trace, in this mailing list thread:
>>> 
>>> 
>>> http://lists.mcs.anl.gov/pipermail/darshan-users/2013-January/000131.html
>>> 
>>> 
>>> Looking a little more closely at the opal_memory_linux_malloc_init_hook() 
>>> function, it looks like the struct stat output argument from stat() is 
>>> being ignored in all cases.  Open MPI is just checking the stat() return 
>>> code to determine if the files in question exist or not.  Taking that into 
>>> account, would it be possible to make a minor change in Open MPI to replace 
>>> these instances:
>>> 
>>>stat("some_filename", &st)
>>> 
>>> with:
>>> 
>>>access("some_filename", F_OK)
>>> 
>>> in the opal_memory_linux_malloc_init_hook() function?  There is a slight 
>>> technical advantage to the change in that access() is lighter weight than 
>>> stat() on some systems (and it might arguably make the intent  of the calls 
>>> a little clearer), but of course my main motivation here is to have Open 
>>> MPI use a function that is less likely to be intercepted by I/O tracing 
>>> tools before a malloc implementation has been initialized.  Technically it 
>>> is possible to work around this in Darshan itself by checking the arguments 
>>> passed 

Re: [OMPI devel] Build open MPI

2013-01-10 Thread Jeff Squyres (jsquyres)
Check the HACKING file in the top-level directory if you need some assistance 
on how to upgrade your Autoconf/Automake/Libtool.


On Jan 9, 2013, at 9:27 PM, Ralph Castain 
 wrote:

> I'm pretty sure we are at autoconf 2.69 now. You might want to upgrade it, 
> and ensure your m4 is correspondingly updated. Also, automake should probably 
> be at 1.12.x (avoid 1.13,x as it has bugs). I think libtool looks pretty old 
> too.
> 
> Sent from my iPad
> 
> On Jan 9, 2013, at 5:37 PM, "Ding, Boxiong"  wrote:
> 
>> Hi,
>> 
>> I am trying to build the code that Ralph has put here: 
>> https://boxd...@bitbucket.org/rhc/hdmon, but failed. It is a modified open 
>> MPI code. Can someone help?
>> 
>> [root@aesaroyp1d1c hdmon]# cat /etc/redhat-release 
>> Red Hat Enterprise Linux Server release 6.1 (Santiago)
>> 
>> I have manually installed m4/autoconf/automake/libtool on my local directory 
>> and the versions match those specified in HACKING.
>> [root@aesaroyp1d1c lib]# pwd
>> /root/local/lib
>> [root@aesaroyp1d1c lib]# ls
>> autoconf-2.68  automake-1.11.1  libtool-2.2.8  m4-1.4.13
>> 
>> [root@aesaroyp1d1c lib]# which m4
>> /root/local/lib/m4-1.4.13/bin/m4
>> [root@aesaroyp1d1c lib]# which autoconf
>> /root/local/lib/autoconf-2.68/bin/autoconf
>> [root@aesaroyp1d1c lib]# which automake
>> /root/local/lib/automake-1.11.1/bin/automake
>> [root@aesaroyp1d1c lib]# which libtool
>> /root/local/lib/libtool-2.2.8/bin/libtool
>> 
>> When I run autogen.pl I got the following error:
>> 
>> 6. Processing autogen.subdirs directories
>> 
>> === Processing subdir: 
>> /root/workspace/hdmon/opal/mca/event/libevent2019/libevent
>> --- Found autogen.sh; running...
>> Running: ./autogen.sh
>> autoreconf: Entering directory `.'
>> autoreconf: configure.in: not using Gettext
>> autoreconf: running: aclocal -I .. --force -I m4
>> autoreconf: configure.in: tracing
>> autoreconf: configure.in: not using Libtool
>> autoreconf: running: /root/local/lib/autoconf-2.68/bin/autoconf --include=.. 
>> --force
>> configure.in:146: error: possibly undefined macro: AC_PROG_LIBTOOL
>>   If this token and others are legitimate, please use m4_pattern_allow.
>>   See the Autoconf documentation.
>> autoreconf: /root/local/lib/autoconf-2.68/bin/autoconf failed with exit 
>> status: 1
>> Command failed: ./autogen.sh
>> 
>> 
>> Thanks,
>> Boxiong
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] New warning in btl_openib_endpoint.h

2013-01-11 Thread Jeff Squyres (jsquyres)
It looks like a recent change to btl_openib_endpoint.h is resulting in warnings 
-- the code is a bit ambiguous:

-
if (ep->qps[qp].qp->sd_wqe <= 0  ||
size + sizeof(mca_btl_openib_header_t) + (rdma ? 
sizeof(mca_btl_openib_footer_t) : 0) > ep->qps[qp].ib_inline_max ||
 !BTL_OPENIB_QP_TYPE_PP(qp) && 
ep->endpoint_btl->qps[qp].u.srq_qp.sd_credits <= 0) {
-

Results in the following warning:

./btl_openib_endpoint.h:307: warning: suggest parentheses around '&&' within 
'||'

*** Mellanox: did you mean to put parentheses around the || or the && ?

I think the fix for this will need to go to v1.6 and v1.7, because this new 
code has already been pushed to those branches (bad reviewing! :-( :-( :-( ).

(this code is already in v1.7; there's a CMR waiting for the gk to apply it to 
v1.6)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Backward Compatibility of MPI Java Binding

2013-01-11 Thread Jeff Squyres (jsquyres)
On Jan 10, 2013, at 10:33 AM, Yoshiki SATO  wrote:

>> Yes -- IIRC, the Java interface isn't really dependent upon anything 
>> specific in the back-end C implementation of Open MPI.  So I'm 
>> guessing/assuming that if you can build it, it should work against the 1.6 
>> OMPI C engine just fine.
> 
> Sounds good :-)  I'm going to try to build it anyway, and let you know if 
> stucked.

Cool.

>> Is MPI-for-Java of interest to you?  I ask because Ralph and I have been 
>> trying to figure out how to get the cycles to do more with the Java 
>> interface (e.g., make it more like a 1:1 set of bindings).  Is this 
>> something you'd be willing to work on / contribute, perchance?
> 
> Yes.  This is because one of our roles is enabling Java applications to run 
> on top of the K computer (Fujitsu), which prevents user processes from TCP/IP 
> communications directly.  It only allows communications via customized 
> openmpi (scheduler) between inter node communications.  

Gotcha.

> BTW, I don't fully understand what a 1:1 set of bindings mean.  I believe 
> that the interfaces defined in mpiJava was carefully designed to match the 
> MPI spec, and then should be a 1:1 set of bindings.  Otherwise, I'm willing 
> to contribute something to make it better ;-)

I think they're *somewhat* close; I don't think that they're exactly a 
one-to-one mapping to the C bindings.  I know that they added the "offset" 
argument for choice buffers, and I can see what that would be necessary.  So 
that should probably stay.

But extras like MPI.OBJECT should probably disappear, and be replaced with 
proper support for Java-isms (e.g., N-dimensional arrays -- see posts from 
Siegmar on the users list about how N-dimensional arrays don't work because 
OMPI is assuming that they'll be contiguous in memory, but Java allocates, for 
example, a 2D array as a series of 1D arrays).  And any other interfaces that 
aren't nearly identical to what the MPI C++ bindings are (were) should probably 
also go, unless there are strong reasons to *need* them.

I've setup a Bitbucket for this Java work, if you would like to contribute.  
It's a branch from the Open MPI SVN trunk; I can keep it up to date.  If you 
send me your Bitbucket ID, I can give you write permissions.

https://bitbucket.org/jsquyres/ompi-java-revamped

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.7rc6 now posted

2013-01-15 Thread Jeff Squyres (jsquyres)
In the usual location:

http://www.open-mpi.org/software/ompi/v1.7/

I don't have an easy list of things that have changed since rc6; *many* things 
have changed / been fixed.  Please test.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [patch] MPI-2.2: Ordering of attribution deletion callbacks on MPI_COMM_SELF

2013-01-17 Thread Jeff Squyres (jsquyres)
Not that I'm aware of; that would be great.

Unlike George, however, I'm not concerned about converting to linear operations 
for attributes.

Attributes are not used often, but when they are:

a) there aren't many of them (so a linear penalty is trivial)
b) they're expected to be low performance

So if it makes the code simpler, I certainly don't mind linear operations.



On Jan 17, 2013, at 9:32 AM, KAWASHIMA Takahiro 
 wrote:

> George,
> 
> Your idea makes sense.
> Is anyone working on it? If not, I'll try.
> 
> Regards,
> KAWASHIMA Takahiro
> 
>> Takahiro,
>> 
>> Thanks for the patch. I deplore the lost of the hash table in the attribute 
>> management, as the potential of transforming all attributes operation to a 
>> linear complexity is not very appealing.
>> 
>> As you already took the decision C, it means that at the communicator 
>> destruction stage the hash table is not relevant anymore. Thus, I would have 
>> converted the hash table to an ordered list (ordered by the creation index, 
>> a global entity atomically updated every time an attribute is created), and 
>> proceed to destroy the attributed in the desired order. Thus instead of 
>> having a linear operation for every operation on attributes, we only have a 
>> single linear operation per communicator (and this during the destruction 
>> stage).
>> 
>>  George.
>> 
>> On Jan 16, 2013, at 16:37 , KAWASHIMA Takahiro  
>> wrote:
>> 
>>> Hi,
>>> 
>>> I've implemented ticket #3123 "MPI-2.2: Ordering of attribution deletion
>>> callbacks on MPI_COMM_SELF".
>>> 
>>> https://svn.open-mpi.org/trac/ompi/ticket/3123
>>> 
>>> As this ticket says, attributes had been stored in unordered hash.
>>> So I've replaced opal_hash_table_t with opal_list_t and made necessary
>>> modifications for it. And I've also fixed some multi-threaded concurrent
>>> (get|set|delete)_attr call issues.
>>> 
>>> By this modification, following behavior changes are introduced.
>>> 
>>> (A) MPI_(Comm|Type|Win)_(get|set|delete)_attr function may be slower
>>> for MPI objects that has many attributes attached.
>>> (B) When the user-defined delete callback function is called, the
>>> attribute is already removed from the list. In other words,
>>> if MPI_(Comm|Type|Win)_get_attr is called by the user-defined
>>> delete callback function for the same attribute key, it returns
>>> flag = false.
>>> (C) Even if the user-defined delete callback function returns non-
>>> MPI_SUCCESS value, the attribute is not reverted to the list.
>>> 
>>> (A) is due to a sequential list search instead of a hash. See find_value
>>> function for its implementation.
>>> (B) and (C) are due to an atomic deletion of the attribute to allow
>>> multi-threaded concurrent (get|set|delete)_attr call in MPI_THREAD_MULTIPLE.
>>> See ompi_attr_delete function for its implementation. I think this does
>>> not matter because MPI standard doesn't specify behavior in such cases.
>>> 
>>> The patch for Open MPI trunk is attached. If you like it, take in
>>> this patch.
>>> 
>>> Though I'm a employee of a company, this is my independent and private
>>> work at my home. No intellectual property from my company. If needed,
>>> I'll sign to Individual Contributor License Agreement.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] sanity check on 1.6.4 .so versions

2013-01-17 Thread Jeff Squyres (jsquyres)
Given that we've screwed this up before, could someone please sanity check the 
.so versions I'm planning on using for the 1.6.4 release.  Only 2 libraries 
changed: libmpi and libopen-pal (most other changes were in VT and various 
components).  No interfaces changed.

Is this right?

## libmpi changed
# was: libmpi_so_version=1:6:0
libmpi_so_version=1:7:0
libmpi_cxx_so_version=1:1:0
libmpi_f77_so_version=1:6:0
libmpi_f90_so_version=4:0:3
libopen_rte_so_version=4:3:0
##  opal changed
# was: libopen_pal_so_version=4:3:0
libopen_pal_so_version=4:4:0

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.6.4rc1 has been posted

2013-01-17 Thread Jeff Squyres (jsquyres)
In the usual location:

http://www.open-mpi.org/software/ompi/v1.6/

Here's a list of changes since 1.6.3:

- Added performance improvements to the OpenIB (OpenFabrics) BTL.
- Improved error message when process affinity fails.
- Fixed MPI_MINLOC on man pages for MPI_REDUCE(_LOCAL).  Thanks to Jed
  Brown for noticing the problem and supplying a fix.
- Made malloc hooks more friendly to IO interprosers.  Thanks to the
  bug report and suggested fix from Darshan maintainer Phil Carns.
- Restored ability to direct launch under SLURM without PMI support.
- Fixed MPI datatype issues on OpenBSD.
- Major VT update to 5.14.2.
- Support FCA v3.0+.
- Fixed header file problems on OpenBSD.
- Fixed issue with MPI_TYPE_CREATE_F90_REAL.
- Fix an issue with using external libltdl installations.  Thanks to
  opolawski for identifying the problem.
- Fixed MPI_IN_PLACE case for MPI_ALLGATHER for FCA.
- Allow SLURM PMI support to look in lib64 directories.  Thanks to
  Guillaume Papaure for the patch.
- Restore "use mpi" ABI compatibility with the rest of the 1.5/1.6
  series (except for v1.6.3, where it was accidentally broken).
- Fix a very old error in opal_path_access(). Thanks to Marco Atzeri
  for chasing it down.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] 1.6.4rc1 has been posted

2013-01-17 Thread Jeff Squyres (jsquyres)
Sweet. :)

Sent from my phone. No type good.

On Jan 17, 2013, at 6:59 PM, "Paul Hargrove" 
mailto:phhargr...@lbl.gov>> wrote:

On Thu, Jan 17, 2013 at 2:26 PM, Paul Hargrove 
mailto:phhargr...@lbl.gov>> wrote:
[snip]
The BAD news is a new failure (SEGV in orted at exit) on OpenBSD-5.2/amd64, 
which I will report in a separate email once I've completed some triage.
[snip]

You can disregard the "BAD news" above.
Everything was fine with gcc, but fails with llvm-gcc.
Looking deeper (details upon request) the SEGV appears to be caused by a bug in 
llvm-gcc.

-Paul

--
Paul H. Hargrove  
phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] MPI-2.2 status #2223, #3127

2013-01-18 Thread Jeff Squyres (jsquyres)
On Jan 17, 2013, at 8:43 PM, "Kawashima, Takahiro" 
 wrote:

> Fujitsu is interested in completing MPI-2.2 on Open MPI and Open MPI
> -based Fujitsu MPI.
> 
> We've read wiki and tickets. These two tickets seem to be almost done
> but need testing and bug fixing.
> 
>  https://svn.open-mpi.org/trac/ompi/ticket/2223
>  MPI-2.2: MPI_Dist_graph_* functions missing
> 
>  https://svn.open-mpi.org/trac/ompi/ticket/3127
>  MPI-2.2: Add reduction support for MPI_C_*COMPLEX and MPI::*COMPLEX
> 
> My colleagues are planning to work on these. They will write test codes
> and try to fix bugs. Test codes and patches can be contributed to the
> community. If they cannot fix some bugs, we will report details. They
> are planning to complete them in around March.

Great!

> The latest statuses written in these ticket comments are correct?
> Is there any more progress?
> 
> Where are the latest codes?
> In ticket #2223 says it is on Jeff's ompi-topo-fixes bitbucket branch.
>  https://bitbucket.org/jsquyres/ompi-topo-fixes
> But Jeff seems to have one more branch with a similar name.
>  https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed

You are correct; I should update #2223 -- the "-fixed" one is the right one.

I think the original one got hosed somehow, and I created the "-fixed" one for 
all future commits.

George and I did a bunch of work on this ticket, but AFAIK, it never got 
finished.  It might be within a delta of debugging, though.  I notice that some 
of the IBM topology tests are failing (i.e., not Dist_graph stuff -- normal 
topology stuff).  So something possibly got a little hosed in the topo base in 
the revamp.

And then, as mentioned on the ticket, there's no Dist_graph tests to know what 
the status is of the implementation of those functions.

> Ticket #3127 says it is on Jeff's mpi22-c-complex bitbucket branch.
> But there is no such branch now.
>  https://bitbucket.org/jsquyres/mpi22-c-complex

Oops -- I put it back.

It looks like I did one commit that is annotated "First cut -- doesn't work 
yet."  I think the last status on that ticket is probably accurate.

I updated both bitbuckets this morning to be at the head of the SVN trunk, so 
they're good and recent.

Let me know if you want to fork/do pull requests, or if you just want write 
access to those repos.

Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] MPI-2.2 status #2223, #3127

2013-01-18 Thread Jeff Squyres (jsquyres)
George --

Should I pull from your repo into 
https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed?  Or did you effectively 
fork, and you guys will put back to SVN when you're done?


On Jan 18, 2013, at 5:47 AM, George Bosilca 
 wrote:

> Takahiro,
> 
> The MPI_Dist_graph effort is happening in 
> ssh://h...@bitbucket.org/bosilca/ompi-topo. I would definitely be interested 
> in seeing some test cases, and giving this branch a tough test.
> 
>  George.
> 
> On Jan 18, 2013, at 02:43 , "Kawashima, Takahiro" 
>  wrote:
> 
>> Hi,
>> 
>> Fujitsu is interested in completing MPI-2.2 on Open MPI and Open MPI
>> -based Fujitsu MPI.
>> 
>> We've read wiki and tickets. These two tickets seem to be almost done
>> but need testing and bug fixing.
>> 
>> https://svn.open-mpi.org/trac/ompi/ticket/2223
>> MPI-2.2: MPI_Dist_graph_* functions missing
>> 
>> https://svn.open-mpi.org/trac/ompi/ticket/3127
>> MPI-2.2: Add reduction support for MPI_C_*COMPLEX and MPI::*COMPLEX
>> 
>> My colleagues are planning to work on these. They will write test codes
>> and try to fix bugs. Test codes and patches can be contributed to the
>> community. If they cannot fix some bugs, we will report details. They
>> are planning to complete them in around March.
>> 
>> With that two questions.
>> 
>> The latest statuses written in these ticket comments are correct?
>> Is there any more progress?
>> 
>> Where are the latest codes?
>> In ticket #2223 says it is on Jeff's ompi-topo-fixes bitbucket branch.
>> https://bitbucket.org/jsquyres/ompi-topo-fixes
>> But Jeff seems to have one more branch with a similar name.
>> https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed
>> Ticket #3127 says it is on Jeff's mpi22-c-complex bitbucket branch.
>> But there is no such branch now.
>> https://bitbucket.org/jsquyres/mpi22-c-complex
>> 
>> Best regards,
>> Takahiro Kawashima,
>> MPI development team,
>> Fujitsu
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] MPI-2.2 status #2223, #3127

2013-01-18 Thread Jeff Squyres (jsquyres)
Ok.  If it contains everything you put on the original topo-fixes (and 
topo-fixes-fixed), I might as well kill those two repos and put your repo URL 
on the ticket.

So -- before I whack those two -- can you absolutely confirm that you've got 
everything from the topo-fixes-fixed repo?  IIRC, there was some other 
fixes/updates to the topo base in there, not just the new dist_graph 
improvements.


On Jan 18, 2013, at 11:06 AM, George Bosilca 
 wrote:

> It's a fork from the official ompi (well the hg version of it). We will push 
> back once we're done.
> 
>  George.
> 
> On Jan 18, 2013, at 15:42 , "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> George --
>> 
>> Should I pull from your repo into 
>> https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed?  Or did you 
>> effectively fork, and you guys will put back to SVN when you're done?
>> 
>> 
>> On Jan 18, 2013, at 5:47 AM, George Bosilca 
>> wrote:
>> 
>>> Takahiro,
>>> 
>>> The MPI_Dist_graph effort is happening in 
>>> ssh://h...@bitbucket.org/bosilca/ompi-topo. I would definitely be 
>>> interested in seeing some test cases, and giving this branch a tough test.
>>> 
>>> George.
>>> 
>>> On Jan 18, 2013, at 02:43 , "Kawashima, Takahiro" 
>>>  wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Fujitsu is interested in completing MPI-2.2 on Open MPI and Open MPI
>>>> -based Fujitsu MPI.
>>>> 
>>>> We've read wiki and tickets. These two tickets seem to be almost done
>>>> but need testing and bug fixing.
>>>> 
>>>> https://svn.open-mpi.org/trac/ompi/ticket/2223
>>>> MPI-2.2: MPI_Dist_graph_* functions missing
>>>> 
>>>> https://svn.open-mpi.org/trac/ompi/ticket/3127
>>>> MPI-2.2: Add reduction support for MPI_C_*COMPLEX and MPI::*COMPLEX
>>>> 
>>>> My colleagues are planning to work on these. They will write test codes
>>>> and try to fix bugs. Test codes and patches can be contributed to the
>>>> community. If they cannot fix some bugs, we will report details. They
>>>> are planning to complete them in around March.
>>>> 
>>>> With that two questions.
>>>> 
>>>> The latest statuses written in these ticket comments are correct?
>>>> Is there any more progress?
>>>> 
>>>> Where are the latest codes?
>>>> In ticket #2223 says it is on Jeff's ompi-topo-fixes bitbucket branch.
>>>> https://bitbucket.org/jsquyres/ompi-topo-fixes
>>>> But Jeff seems to have one more branch with a similar name.
>>>> https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed
>>>> Ticket #3127 says it is on Jeff's mpi22-c-complex bitbucket branch.
>>>> But there is no such branch now.
>>>> https://bitbucket.org/jsquyres/mpi22-c-complex
>>>> 
>>>> Best regards,
>>>> Takahiro Kawashima,
>>>> MPI development team,
>>>> Fujitsu
>>>> ___
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-18 Thread Jeff Squyres (jsquyres)
Done -- thank you!

On Jan 11, 2013, at 3:52 AM, "Kawashima, Takahiro"  
wrote:

> Hi Open MPI core members and Rayson,
> 
> I've confirmed to the authors and created the bibtex reference.
> Could you make a page in the "Open MPI Publications" page that
> links to Fujitsu's PDF file? The attached file contains information
> of title, authors, abstract, link URL, and bibtex reference.
> 
> Best regards,
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
>> Sorry for not replying sooner.
>> I'm taliking with the authors (they are not in this list) and
>> will request linking the PDF soon if they allowed.
>> 
>> Takahiro Kawashima,
>> MPI development team,
>> Fujitsu
>> 
>>> Our policy so far was that adding a paper to the list of publication on the 
>>> Open MPI website was a discretionary action at the authors' request. I 
>>> don't see any compelling reason to change. Moreover, Fujitsu being a 
>>> contributor of the Open MPI community, there is no obstacle of adding a 
>>> link to their paper -- at their request.
>>> 
>>>  George.
>>> 
>>> On Jan 10, 2013, at 00:15 , Rayson Ho  wrote:
>>> 
 Hi Ralph,
 
 Since the whole journal is available online, and is reachable by
 Google, I don't believe we can get into copyright issues by providing
 a link to it (but then, I also know that there are countries that have
 more crazy web page linking rules!).
 
 http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
 
 Rayson
 
 ==
 Open Grid Scheduler - The Official Open Source Grid Engine
 http://gridscheduler.sourceforge.net/
 
 Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
 http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
 
 
 On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
> I'm unaware of any formal criteria. The papers currently located there 
> are those written by members of the OMPI community, but we can certainly 
> link to something written by someone else, so long as we don't get into 
> copyright issues.
> 
> On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
> 
>> I found this paper recently, "MPI Library and Low-Level Communication
>> on the K computer", available at:
>> 
>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
>> 
>> What are the criteria for adding papers to the "Open MPI Publications" 
>> page?
>> 
>> Rayson
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New ARM patch

2013-01-22 Thread Jeff Squyres (jsquyres)
Leif --

We talked about this a bit on our weekly call today. 

Just to be sure: are you saying that George's patches are *functionally 
correct* for ARM5/6/7 (and broken for ARM 4), but it would be better to 
organize the code a bit better?

If that is correct, was ARM4 working before?

If ARM4 was working before, how important is it?  I.e., would it be ok to 
accept George's stuff for 1.7.0, and then accept any 
improvements/reshuffle/etc. from you for 1.7.1?



On Jan 21, 2013, at 12:15 PM, Leif Lindholm  wrote:

> Hi George,
> 
> Any chance of r27882 being reverted?
> 
> As I told the Fedora guys when that patch originally surfaced[1],
> I'm not overly fond of
> - copying source files around as part of the configure step
> - having separate source files for ARMv6 and ARMv7, when those differences
>  should be easily separated through macros (and would be reusable for 32-bit
>  ARMv8).
> 
> Also, I might have mentioned that bit only on a separate thread on the Fedora 
> list, but the ARMv4 support isn't actually correct (the ASM uses ARMv5-only 
> operations).
> 
> My alternate solution, the basic idea of which I posted over there [2] was to 
> separate ARMv5 and earlier from ARM. Effectively separating the atomics 
> implementation at the boundary where The ARM architecture got 
> load-linked/store-conditional, rather than having a separate source file for 
> every architecture version.
> 
> [1] https://lists.fedoraproject.org/pipermail/arm/2012-November/004434.html
> [2] https://lists.fedoraproject.org/pipermail/arm/2012-November/004460.html
> 
> Best Regards,
> 
> Leif
> 
> -- IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose the 
> contents to any other person, use it for any purpose, or store or copy the 
> information in any medium.  Thank you.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-01-22 Thread Jeff Squyres (jsquyres)
George --

Is there any reason not to CMR this to v1.6 and v1.7?


On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:

> Author: bosilca (George Bosilca)
> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
> New Revision: 27880
> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
> 
> Log:
> My understanding is that an MPI_WAIT() on an inactive request should
> return the empty status (MPI 3.0 page 52 line 46).
> 
> Text files modified: 
>   trunk/ompi/request/req_wait.c | 3 +++   
>   
>   1 files changed, 3 insertions(+), 0 deletions(-)
> 
> Modified: trunk/ompi/request/req_wait.c
> ==
> --- trunk/ompi/request/req_wait.c Sat Jan 19 19:33:42 2013(r27879)
> +++ trunk/ompi/request/req_wait.c 2013-01-21 06:35:42 EST (Mon, 21 Jan 
> 2013)  (r27880)
> @@ -61,6 +61,9 @@
> }
> if( req->req_persistent ) {
> if( req->req_state == OMPI_REQUEST_INACTIVE ) {
> +if (MPI_STATUS_IGNORE != status) {
> +*status = ompi_status_empty;
> +}
> return OMPI_SUCCESS;
> }
> req->req_state = OMPI_REQUEST_INACTIVE;
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27881 - trunk/ompi/mca/btl/tcp

2013-01-22 Thread Jeff Squyres (jsquyres)
George --

Similar question on this one: should it be CMR'ed to v1.7?  (I kinda doubt it's 
appropriate for v1.6)


On Jan 21, 2013, at 6:41 AM, svn-commit-mai...@open-mpi.org wrote:

> Author: bosilca (George Bosilca)
> Date: 2013-01-21 06:41:08 EST (Mon, 21 Jan 2013)
> New Revision: 27881
> URL: https://svn.open-mpi.org/trac/ompi/changeset/27881
> 
> Log:
> Make the TCP BTL really fail-safe. It now trigger the error callback on
> all pending fragments when the destination goes down. This allows the PML
> to recalibrate its behavior, either find an alternate route or just give up.
> 
> Text files modified: 
>   trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c |29 
> +++--   
>   trunk/ompi/mca/btl/tcp/btl_tcp_frag.c | 7 ++-   
>   
>   trunk/ompi/mca/btl/tcp/btl_tcp_proc.c | 2 +-
>   
>   3 files changed, 34 insertions(+), 4 deletions(-)
> 
> Modified: trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c
> ==
> --- trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c Mon Jan 21 06:35:42 2013
> (r27880)
> +++ trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c 2013-01-21 06:41:08 EST (Mon, 
> 21 Jan 2013)  (r27881)
> @@ -2,7 +2,7 @@
>  * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>  * University Research and Technology
>  * Corporation.  All rights reserved.
> - * Copyright (c) 2004-2008 The University of Tennessee and The University
> + * Copyright (c) 2004-2013 The University of Tennessee and The University
>  * of Tennessee Research Foundation.  All rights
>  * reserved.
>  * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
> @@ -295,6 +295,7 @@
> if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && 
> opal_socket_errno != EWOULDBLOCK) {
> BTL_ERROR(("send() failed: %s (%d)",
>strerror(opal_socket_errno), opal_socket_errno));
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> return -1;
> }
> @@ -359,6 +360,7 @@
> mca_btl_tcp_endpoint_close(btl_endpoint);
> btl_endpoint->endpoint_sd = sd;
> if(mca_btl_tcp_endpoint_send_connect_ack(btl_endpoint) != 
> OMPI_SUCCESS) {
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> OPAL_THREAD_UNLOCK(&btl_endpoint->endpoint_send_lock);
> OPAL_THREAD_UNLOCK(&btl_endpoint->endpoint_recv_lock);
> @@ -389,7 +391,6 @@
> {
> if(btl_endpoint->endpoint_sd < 0)
> return;
> -btl_endpoint->endpoint_state = MCA_BTL_TCP_CLOSED;
> btl_endpoint->endpoint_retries++;
> opal_event_del(&btl_endpoint->endpoint_recv_event);
> opal_event_del(&btl_endpoint->endpoint_send_event);
> @@ -401,6 +402,24 @@
> btl_endpoint->endpoint_cache_pos= NULL;
> btl_endpoint->endpoint_cache_length = 0;
> #endif  /* MCA_BTL_TCP_ENDPOINT_CACHE */
> +/**
> + * If we keep failing to connect to the peer let the caller know about
> + * this situation by triggering all the pending fragments callback and
> + * reporting the error.
> + */
> +if( MCA_BTL_TCP_FAILED == btl_endpoint->endpoint_state ) {
> +mca_btl_tcp_frag_t* frag = btl_endpoint->endpoint_send_frag;
> +if( NULL == frag ) 
> +frag = 
> (mca_btl_tcp_frag_t*)opal_list_remove_first(&btl_endpoint->endpoint_frags);
> +while(NULL != frag) {
> +frag->base.des_cbfunc(&frag->btl->super, frag->endpoint, 
> &frag->base, OMPI_ERR_UNREACH);
> +
> +frag = 
> (mca_btl_tcp_frag_t*)opal_list_remove_first(&btl_endpoint->endpoint_frags);
> +}
> +} else {
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_CLOSED;
> +}
> +
> }
> 
> /*
> @@ -444,6 +463,7 @@
> 
> /* remote closed connection */
> if(retval == 0) {
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> return -1;
> }
> @@ -453,6 +473,7 @@
> if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && 
> opal_socket_errno != EWOULDBLOCK) {
> BTL_ERROR(("recv(%d) failed: %s (%d)",
>btl_endpoint->endpoint_sd, 
> strerror(opal_socket_errno), opal_socket_errno));
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> return -1;
> }
> @@ -589,6 +610,7 @@
> address,
>btl_endpoint->endpoint_addr->addr_port, 
> strerror(opal_socket_errno) ) );
> }
> +  

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27881 - trunk/ompi/mca/btl/tcp

2013-01-23 Thread Jeff Squyres (jsquyres)
Are you going to develop anything further with regards to this functionality, 
and target that stuff for v1.7?  Or should all of this just wait until 1.9?

(I don't really care either way; I'm asking out of curiosity)


On Jan 22, 2013, at 7:24 PM, George Bosilca  wrote:

> Nobody cared about error cases so far, I don't personally see any incentive 
> to push this patch in the 1.7 right now. But I won't be against as it is not 
> hurting either.
> 
>  George.
> 
> 
> On Jan 22, 2013, at 16:28 , "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> George --
>> 
>> Similar question on this one: should it be CMR'ed to v1.7?  (I kinda doubt 
>> it's appropriate for v1.6)
>> 
>> 
>> On Jan 21, 2013, at 6:41 AM, svn-commit-mai...@open-mpi.org wrote:
>> 
>>> Author: bosilca (George Bosilca)
>>> Date: 2013-01-21 06:41:08 EST (Mon, 21 Jan 2013)
>>> New Revision: 27881
>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27881
>>> 
>>> Log:
>>> Make the TCP BTL really fail-safe. It now trigger the error callback on
>>> all pending fragments when the destination goes down. This allows the PML
>>> to recalibrate its behavior, either find an alternate route or just give up.
>>> 
>>> Text files modified: 
>>> trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c |29 
>>> +++--   
>>> trunk/ompi/mca/btl/tcp/btl_tcp_frag.c | 7 ++-   
>>>   
>>> trunk/ompi/mca/btl/tcp/btl_tcp_proc.c | 2 +-
>>>   
>>> 3 files changed, 34 insertions(+), 4 deletions(-)
>>> 
>>> Modified: trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c
>>> ==
>>> --- trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c   Mon Jan 21 06:35:42 
>>> 2013(r27880)
>>> +++ trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c   2013-01-21 06:41:08 EST 
>>> (Mon, 21 Jan 2013)  (r27881)
>>> @@ -2,7 +2,7 @@
>>> * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>>> * University Research and Technology
>>> * Corporation.  All rights reserved.
>>> - * Copyright (c) 2004-2008 The University of Tennessee and The University
>>> + * Copyright (c) 2004-2013 The University of Tennessee and The University
>>> * of Tennessee Research Foundation.  All rights
>>> * reserved.
>>> * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
>>> @@ -295,6 +295,7 @@
>>>   if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && 
>>> opal_socket_errno != EWOULDBLOCK) {
>>>   BTL_ERROR(("send() failed: %s (%d)",
>>>  strerror(opal_socket_errno), opal_socket_errno));
>>> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
>>>   mca_btl_tcp_endpoint_close(btl_endpoint);
>>>   return -1;
>>>   }
>>> @@ -359,6 +360,7 @@
>>>   mca_btl_tcp_endpoint_close(btl_endpoint);
>>>   btl_endpoint->endpoint_sd = sd;
>>>   if(mca_btl_tcp_endpoint_send_connect_ack(btl_endpoint) != 
>>> OMPI_SUCCESS) {
>>> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
>>>   mca_btl_tcp_endpoint_close(btl_endpoint);
>>>   OPAL_THREAD_UNLOCK(&btl_endpoint->endpoint_send_lock);
>>>   OPAL_THREAD_UNLOCK(&btl_endpoint->endpoint_recv_lock);
>>> @@ -389,7 +391,6 @@
>>> {
>>>   if(btl_endpoint->endpoint_sd < 0)
>>>   return;
>>> -btl_endpoint->endpoint_state = MCA_BTL_TCP_CLOSED;
>>>   btl_endpoint->endpoint_retries++;
>>>   opal_event_del(&btl_endpoint->endpoint_recv_event);
>>>   opal_event_del(&btl_endpoint->endpoint_send_event);
>>> @@ -401,6 +402,24 @@
>>>   btl_endpoint->endpoint_cache_pos= NULL;
>>>   btl_endpoint->endpoint_cache_length = 0;
>>> #endif  /* MCA_BTL_TCP_ENDPOINT_CACHE */
>>> +/**
>>> + * If we keep failing to connect to the peer let the caller know about
>>> + * this situation by triggering all the pending fragments callback and
>>> + * reporting the error.
>>> + */
>>> +if( MCA_BTL_TCP_FAILED == btl_endpoint->endpoin

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27881 - trunk/ompi/mca/btl/tcp

2013-01-23 Thread Jeff Squyres (jsquyres)
On Jan 23, 2013, at 10:27 AM, George Bosilca  wrote:

> While we always strive to improve this functionality, it was available as a 
> separate software packages for quite some time.

What separate software package are you referring to?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New ARM patch

2013-01-23 Thread Jeff Squyres (jsquyres)
On Jan 23, 2013, at 9:55 AM, Leif Lindholm  wrote:

> To summarize the out-of-line assembler changes of this patch:
> - The patch is functionally correct for ARMv7 (which we know, because the code
> - It also appears to be functionally correct for ARMv6, given reports of
> - It *might* be functionally correct for ARMv5, although I have seen no
> - It is not functionally correct on ARMv4.

Thanks for the summary (snipped, above).

> [snip]
> Basic point is - this is an insufficiently validated patch referred to as
> "an ugly kludge" by the original author (Jon Masters@Red Hat), who created
> it to be able to include it in the Fedora ARMv5 port. I has previously
> provided suggestions for improvements, but it has still been submitted to
> the Open MPI users list without any of those suggestions being acted on.
> 
> I admit to being slightly miffed with it being accepted and applied without
> ever being mentioned on the Open MPI developers list

It was done by one of the core committers (George); it's in our community's 
culture to go commit without discussion on the devel list for many kinds of 
things.  

FWIW: Since we all know each other pretty well, we do a lot of communication 
via IM and telephone in addition to the public mailing list discussions.  This 
is not because we're discussing secret things -- it's just that you can get a 
lot more accomplished in a 10 minute phone call than 15 back-n-forth, 10-page, 
highly detailed emails.

> - only on the users list.

All the above being said: you're absolutely right.  We have not been careful 
about what gets discussed on the users' list vs. the devel list.  You're right 
that this was discussed over on the users' list (because of a bug report; the 
conversation turned to devel-like topics, but stayed on the users' list).  
George committed a fix and then said "how's this?" (on the users' list).  And 
he didn't consult you, the primary maintainer of this section of code.

> A list to which I now find myself subscribed to without having asked
> for or being told about - miffed again.

Sorry about that; this was my fault.  I interpreted your off-post mails to me 
about not being able to post to the users list as an ask to be subscribed 
(since we don't allow posts from unsubscribed users).  

Rather than unsubscribe you, though, I just marked you as "nomail" on the 
users' list.  So you won't receive any further mail from that list, but you're 
still subscribed, so you can post.

> If the main purpose of accepting this patch is to provide a stopgap measure
> for something better, I would much prefer simply incorporating your
> CCASFLAGS
> workaround into the configure script - removing the out-of-line asm
> implementations of the atomics, but still providing a functional library for
> the most common use-cases.
> 
> Something like:

I tested this patch in v1.6 and v1.7 on my Pi, and it seems to work just fine.  
"make check" passes all the ASM tests.

To be clear: I consider you to be the primary author and maintainer of this 
code, and you're certainly more of an ARM expert than any of us.  George may 
not have realized that someone from ARM was still an active part of the 
community; I'm not sure.

But I, too, vote that we should back out his changes from the trunk and put 
your suggested patch (his patch did not make it over to v1.6 or v1.7, because I 
was waiting for your response).

We actually do try to get consensus for these kinds of things, so let's give 
George a little time to respond before backing it out.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-01-24 Thread Jeff Squyres (jsquyres)
Many thanks for the summary!

Can you file tickets about this stuff against 1.7?  Included your patches, etc. 

These are pretty obscure issues and I'm ok not fixing them in the 1.6 branch 
(unless someone has a burning desire to get them fixed in 1.6). 

But we should properly track and fix these in the 1.7 series. I'd mark them as 
"critical" so that they don't get lost in the wilderness of other bugs. 

Sent from my phone. No type good. 

On Jan 22, 2013, at 8:57 PM, "Kawashima, Takahiro"  
wrote:

> George,
> 
> I reported the bug three months ago.
> Your commit r27880 resolved one of the bugs reported by me,
> in another approach.
> 
>  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php
> 
> But other bugs are still open.
> 
> "(1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE."
> in my previous mail is not fixed yet. This can be fixed by my patch
> (ompi/mpi/c/wait.c and ompi/request/request.c part only) attached
> in my another mail.
> 
>  http://www.open-mpi.org/community/lists/devel/2012/10/11561.php
> 
> "(2) MPI_Status for an inactive request must be an empty status."
> in my previous mail is partially fixed. MPI_Wait is fixed by your
> r27880. But MPI_Waitall and MPI_Testall should be fixed.
> Codes similar to your r27880 should be inserted to
> ompi_request_default_wait_all and ompi_request_default_test_all.
> 
> You can confirm the fixes by the test program status.c attached in
> my previous mail. Run with -n 2. 
> 
>  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php
> 
> Regards,
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
>> To be honest it was hanging in one of my repos for some time. If I'm not 
>> mistaken it is somehow related to one active ticket (but I couldn't find the 
>> info). It might be good to push it upstream.
>> 
>>  George.
>> 
>> On Jan 22, 2013, at 16:27 , "Jeff Squyres (jsquyres)"  
>> wrote:
>> 
>>> George --
>>> 
>>> Is there any reason not to CMR this to v1.6 and v1.7?
>>> 
>>> 
>>> On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:
>>> 
>>>> Author: bosilca (George Bosilca)
>>>> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
>>>> New Revision: 27880
>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
>>>> 
>>>> Log:
>>>> My understanding is that an MPI_WAIT() on an inactive request should
>>>> return the empty status (MPI 3.0 page 52 line 46).
>>>> 
>>>> Text files modified: 
>>>> trunk/ompi/request/req_wait.c | 3 +++  
>>>>
>>>> 1 files changed, 3 insertions(+), 0 deletions(-)
>>>> 
>>>> Modified: trunk/ompi/request/req_wait.c
>>>> ==
>>>> --- trunk/ompi/request/req_wait.cSat Jan 19 19:33:42 2013(r27879)
>>>> +++ trunk/ompi/request/req_wait.c2013-01-21 06:35:42 EST (Mon, 21 Jan 
>>>> 2013)(r27880)
>>>> @@ -61,6 +61,9 @@
>>>>   }
>>>>   if( req->req_persistent ) {
>>>>   if( req->req_state == OMPI_REQUEST_INACTIVE ) {
>>>> +if (MPI_STATUS_IGNORE != status) {
>>>> +*status = ompi_status_empty;
>>>> +}
>>>>   return OMPI_SUCCESS;
>>>>   }
>>>>   req->req_state = OMPI_REQUEST_INACTIVE;
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] New ARM patch

2013-01-24 Thread Jeff Squyres (jsquyres)
On Jan 24, 2013, at 8:18 AM, Leif Lindholm  wrote:

> OK. In which case I probably _should_ be on that list.
> *cough* might I however suggest that a statement to that effect is added
> to http://www.open-mpi.org/community/lists/ompi.php ?

Fair point.  Done.

>> I tested this patch in v1.6 and v1.7 on my Pi, and it seems to work
>> just fine.  "make check" passes all the ASM tests.
> 
> Just to be perfectly clear: it wouldn't on ARMv5 though, and the ARMv6
> ASM test executed with NOPs for barriers, although it would correctly
> pass all other tests.

Mmm.  Ok.  So is this a correct list of what is supported right now (i.e., in 
v1.6 with your patch)

ARM4: no
ARM5: no
ARM6: sorta (not multi-core, or anywhere we would need barriers)
ARM7: yes

?

How would George's patch have changed that list?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New ARM patch

2013-01-25 Thread Jeff Squyres (jsquyres)
On Jan 25, 2013, at 7:28 AM, Leif Lindholm  wrote:

>> Mmm.  Ok.  So is this a correct list of what is supported right now (i.e., 
>> in v1.6 with your patch)
>> ARM4: no
>> ARM5: no
>> ARM6: sorta (not multi-core, or anywhere we would need barriers)
>> ARM7: yes
> 
> Correct, that is what is supported with out-of-line assembler functions
> - i.e. when explicitly building with -DOMPI_DISABLE_INLINE_ASM.
> They are all supported (and correctly using barriers) otherwise.

Here's what I have done:

1. Committed your patch to v1.6.  George's patch was not committed to v1.6.

2. I opened https://svn.open-mpi.org/trac/ompi/ticket/3481 to track your 
proposal of re-implementing/revamping the ARM ASM code.  

Do you have a timeline for when that can be done?

3. Since no one is currently MTT testing Open MPI on ARM, I added the following 
statement in the v1.6 README file under "Other systems have been lightly (but 
not fully tested):"

  - ARM4, ARM5, ARM6, ARM7 (when using non-inline assembly; only ARM7
is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).

--> Is this correct?

--> Do you think you'll be able to setup some MTT on ARM platforms?

4. I also added the following to v1.6 NEWS:

- Automatically provide compiler flags that compile properly on some
  types of ARM systems.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.6.4rc2 released

2013-01-26 Thread Jeff Squyres (jsquyres)
In the usual location:

http://www.open-mpi.org/software/ompi/v1.6/

Changes since rc1:

- Automatically provide compiler flags that compile properly on some
  types of ARM systems.
- Fix slot_list behavior when multiple sockets are specified.  Thanks
  to Siegmar Gross for reporting the problem.
- Fixed memory leak in one-sided operations.  Thanks to Victor
  Vysotskiy for letting us know about this one.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Looking for a replacement call for repeated call to MPI_IPROBE

2013-01-26 Thread Jeff Squyres (jsquyres)
First off, 1.4.4 is fairly ancient.  You might want to try upgrading to 1.6.3.

Second, you might want to use non-blocking receives for B such that you can 
MPI_WAITALL, or perhaps MPI_WAITSOME or MPI_WAITANY to wait for some/all of the 
values to arrive in B.  This keeps any looping down in MPI (i.e., as close to 
the hardware as possible).


On Jan 25, 2013, at 3:21 PM, Jeremy McCaslin  wrote:

> Hello,
> 
> I am trying to figure out the most appropriate MPI calls for a certain 
> portion of my code.  I will describe the situation here:
> 
> Each cell (i,j) of my array A is being updated by a calculation that depends 
> on the values of 1 or 2 of the 4 possible neighbors A(i+1,j), A(i-1,j), 
> A(i,j+1), and A(i,j-1).  Say, for example, A(i,j)=A(i-1,j)*A(i,j-1).  The 
> thing is, the values of the neighbors A(i-1,j) and A(i,j-1) cannot be used 
> until an auxiliary array B has been updated from 0 to 1.  The values B(i-1,j) 
> and B(i,j-1) are changed from 0 -> 1 after the values A(i-1,j) and A(i,j-1) 
> have been communicated to the proc that contains cell (i,j), as cells (i-1,j) 
> and (i,j-1) belong to different procs.  Here is pseudocode for how I have the 
> algorithm implemented (in fortran):
> 
> do while (B(ii,jj,kk).eq.0)
>  if (probe_for_message(i0,j0,k0,this_sc)) then
>   my_ibuf(1)=my_ibuf(1)+1
>   A(i0,j0,k0)=this_sc
>   B(i0,j0,k0)=1
>  end if
> end do
> 
> The function 'probe_for_message' uses an 'MPI_IPROBE' to see if 
> 'MPI_ANY_SOURCE' has a message for my current proc.  If there is a message, 
> the function returns a true logical and calls 'MPI_RECV', receiving 
> (i0,j0,k0,this_sc) from the proc that has the message.  This works!  My 
> concern is that I am probing repeatedly inside the while loop until I receive 
> a message from a proc such that ii=i0, jj=j0, kk=k0.  I could potentially 
> call MPI_IPROBE many many times before this happens... and I'm worried that 
> this is a messy way of doing this.  Could I "break" the mpi probe call?  Are 
> there MPI routines that would allow me to accomplish the same thing in a more 
> formal or safer way?  Maybe a persistent communication or something?  For 
> very large computations with many procs, I am observing a hanging situation 
> which I suspect may be due to this.  I observe it when using openmpi-1.4.4, 
> and the hanging seems to disappear if I use mvapich.  Any 
> suggestions/comments would be greatly appreciated.  Thanks so much!
> 
> -- 
> JM ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread Jeff Squyres (jsquyres)
I'll +1 what Brian said: we *really* don't want to have to link Open MPI with a 
C++ compiler.

Can't you rpath in whatever support libraries you need (e.g., the g++ libraries 
with the cxx_personality symbol), such that when we -ltorque, it just pulls in 
whatever other dependencies it needs?

(I'm assuming that you're extern "C"'ing all the tm_*() function calls so that 
they can be called from C code, not C++ code)


On Jan 28, 2013, at 2:14 PM, "Barrett, Brian W"  wrote:

> On 1/28/13 11:54 AM, "David Beer"  wrote:
> 
>> checking for tm_init in -ltorque... no
>> configure: error: TM support requested but not found.  Aborting
>> 
>> Oddly enough, if you have already configured with an older version of 
>> TORQUE, you can build open-mpi with TORQUE 4.2 installed, so it can find the 
>> function definitions when compiling, its just for some reason it doesn't 
>> find them in the configure script. This is why I think that something in the 
>> configure script is assuming that libtorque was compiled with gcc.
> 
> Right, the configure output to stdout/stderr isn't very useful in diagnosing 
> why a test failed.  The config.log file generated by configure will have much 
> more information.
> 
> Brian
> 
> --
>   Brian W. Barrett
>   Scalable System Software Group
>   Sandia National Laboratories
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread Jeff Squyres (jsquyres)
You're basically telling your build system to use a C++ compiler as the linker 
when creating libtorque.  This probably does more-or-less what I suggested: 
rpath'ing in whatever dependencies you need such that when we link against 
libtorque, all of the (C++) dependencies that you need are automatically pulled 
along.

Glad it got resolved!


On Jan 28, 2013, at 4:55 PM, David Beer 
 wrote:

> 
> On Mon, Jan 28, 2013 at 12:14 PM, Barrett, Brian W  wrote:
> On 1/28/13 11:54 AM, "David Beer"  wrote:
> 
> checking for tm_init in -ltorque... no
> configure: error: TM support requested but not found.  Aborting
> 
> Oddly enough, if you have already configured with an older version of TORQUE, 
> you can build open-mpi with TORQUE 4.2 installed, so it can find the function 
> definitions when compiling, its just for some reason it doesn't find them in 
> the configure script. This is why I think that something in the configure 
> script is assuming that libtorque was compiled with gcc.
> 
> Right, the configure output to stdout/stderr isn't very useful in diagnosing 
> why a test failed.  The config.log file generated by configure will have much 
> more information.
> 
> All,
> 
> Thanks for your help. I found a way to resolve this by changing OpenMPI's 
> configure script, but then someone who knows a bit more about these things 
> showed me that we can solve this by defining some more things on our end, 
> namely adding:
> 
> +LT_LANG([C++])
> +AC_SUBST([LIBTOOL_DEPS])
> 
> and 
> 
> +CCLD="$CXX"
> +AC_SUBST([CCLD])
> +LIBTOOLFLAGS="--tag=CXX"
> +AC_SUBST([LIBTOOLFLAGS])
> 
> to our configure.ac and 
> 
> +LIBTOOL_DEPS = @LIBTOOL_DEPS@
> +libtool: $(LIBTOOL_DEPS)
> + $(SHELL) ./config.status --recheck
> 
> to our Makefile.am. I'm going to try to look some things up and see why this 
> makes a difference, but I'm guessing that we previously had an incomplete 
> definition that confused OpenMPI's configure script. Thanks for all of your 
> help and I'm glad we could resolve this by fixing TORQUE.
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Looking for a replacement call for repeated call to MPI_IPROBE

2013-01-28 Thread Jeff Squyres (jsquyres)
Is there a reason you're using buffered sends?  They're generally pretty evil:

 
http://blogs.cisco.com/performance/top-10-reasons-why-buffered-sends-are-evil/

FWIW, you can probably install Open MPI 1.6.3 yourself -- you can just install 
it under $HOME, or some other directory that is available on all compute nodes. 
 Then just set your PATH and LD_LIBRARY_PATH to point to your install instead 
of the system install.

This would at least let you know if upgrading to 1.6.3 will fix your issue.



On Jan 28, 2013, at 12:22 PM, Jeremy McCaslin  wrote:

> Thank you for the feedback.  I actually just changed the repeated probing for 
> a message to a blocking MPI_RECV, as the processor waiting to receive does 
> nothing but repeatedly probe until the message is there anyway.  This also 
> works, and it makes more sense to do it this way.  However, this did not fix 
> my hanging issue.  I am wondering if it has something to do with the size of 
> my buffer used in MPI_BUFFER_ATTACH.  I believe I am following the proper 
> MPI_BSEND_OVERHEAD protocol.  I am waiting on the admins to install 
> openmpi-1.6.3, and hoping that maybe this will fix my issue.
> 
> On Sat, Jan 26, 2013 at 7:32 AM, Jeff Squyres (jsquyres)  
> wrote:
> First off, 1.4.4 is fairly ancient.  You might want to try upgrading to 1.6.3.
> 
> Second, you might want to use non-blocking receives for B such that you can 
> MPI_WAITALL, or perhaps MPI_WAITSOME or MPI_WAITANY to wait for some/all of 
> the values to arrive in B.  This keeps any looping down in MPI (i.e., as 
> close to the hardware as possible).
> 
> 
> On Jan 25, 2013, at 3:21 PM, Jeremy McCaslin  wrote:
> 
> > Hello,
> >
> > I am trying to figure out the most appropriate MPI calls for a certain 
> > portion of my code.  I will describe the situation here:
> >
> > Each cell (i,j) of my array A is being updated by a calculation that 
> > depends on the values of 1 or 2 of the 4 possible neighbors A(i+1,j), 
> > A(i-1,j), A(i,j+1), and A(i,j-1).  Say, for example, 
> > A(i,j)=A(i-1,j)*A(i,j-1).  The thing is, the values of the neighbors 
> > A(i-1,j) and A(i,j-1) cannot be used until an auxiliary array B has been 
> > updated from 0 to 1.  The values B(i-1,j) and B(i,j-1) are changed from 0 
> > -> 1 after the values A(i-1,j) and A(i,j-1) have been communicated to the 
> > proc that contains cell (i,j), as cells (i-1,j) and (i,j-1) belong to 
> > different procs.  Here is pseudocode for how I have the algorithm 
> > implemented (in fortran):
> >
> > do while (B(ii,jj,kk).eq.0)
> >  if (probe_for_message(i0,j0,k0,this_sc)) then
> >   my_ibuf(1)=my_ibuf(1)+1
> >   A(i0,j0,k0)=this_sc
> >   B(i0,j0,k0)=1
> >  end if
> > end do
> >
> > The function 'probe_for_message' uses an 'MPI_IPROBE' to see if 
> > 'MPI_ANY_SOURCE' has a message for my current proc.  If there is a message, 
> > the function returns a true logical and calls 'MPI_RECV', receiving 
> > (i0,j0,k0,this_sc) from the proc that has the message.  This works!  My 
> > concern is that I am probing repeatedly inside the while loop until I 
> > receive a message from a proc such that ii=i0, jj=j0, kk=k0.  I could 
> > potentially call MPI_IPROBE many many times before this happens... and I'm 
> > worried that this is a messy way of doing this.  Could I "break" the mpi 
> > probe call?  Are there MPI routines that would allow me to accomplish the 
> > same thing in a more formal or safer way?  Maybe a persistent communication 
> > or something?  For very large computations with many procs, I am observing 
> > a hanging situation which I suspect may be due to this.  I observe it when 
> > using openmpi-1.4.4, and the hanging seems to disappear if I use mvapich.  
> > Any suggestions/comments would be greatly ap!
>  preciated.  Thanks so much!
> >
> > --
> > JM ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> JM ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-29 Thread Jeff Squyres (jsquyres)
Thanks Josh.

Steve -- if you can confirm that this fixes your problem in the v1.6 series, 
we'll go ahead and commit the patch.

FWIW: the OpenFabrics startup code got a little cleanup/revamp on the 
trunk/v1.7 -- I suspect that's why you're not seeing the problem on trunk/v1.7 
(e.g., look at the utility routines that were abstracted out to 
ompi/mca/common/verbs).



On Jan 29, 2013, at 2:41 AM, Joshua Ladd  wrote:

> So, we (Mellanox) have observed this ourselves when no suitable CPC can be 
> found. Seems the BTL associated with this port is not destroyed and the ref 
> count is not decreased.  Not sure why you don't see the problem in 1.7. But 
> we have a patch that I'll CMR today. Please review our symptoms, diagnosis, 
> and proposed change. Ralph, maybe I can list you as a reviewer of the patch? 
> I've reviewed myself and it looks fine, but wouldn't mind having another set 
> of eyes on it since I don't want to be responsible for breaking the OpenIB 
> BTL.
> 
> Thanks,
> 
> Josh Ladd
> 
> 
> Reported by Yossi:
> Hi,
> 
> There is a bug in open mpi (openib component) when one of the active ports is 
> Ethernet.
> The fix is attached, probably needs to be reviewed and submitted to ompi
> 
> Error flow:
> 1.Openib component creates a btl instance for every active port 
> (including Ethernet)
> 2.Every btl holds a reference count to the device 
> (mca_btl_openib_device_t::btls)
> 3.Openib tries to create a "connection module" for every btl
> 4.It fails to create connection module for the Ethernet port
> 5.The btl for Ethernet port is not returned by openib component, in the 
> list of btl modules
> 6.The btl for Ethernet port is not destroyed during openib component 
> finalize
> 7.The device is not destroyed, because of the reference count
> 8.The memory pool created by the device is not destroyed
> 9.Later, rdma mpool module cleans up remaining pools during its finalize
> 10.   The memory pool created by openib is destroyed by rdma mpool component 
> finalize
> 11.   The memory pool points to a function (openib_dereg_mr) which is already 
> unloaded from memory (because mca_btl_openib.so was unloaded)
> 12.   Segfault because of a call to invalid function
> 
> The fix:  If a btl module is not going to be returned from openib component 
> init, destroy it.
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Monday, January 28, 2013 8:35 PM
> To: Steve Wise
> Cc: Open MPI Developers
> Subject: Re: [OMPI devel] openib unloaded before last mem dereg
> 
> Out of curiosity, could you tell us how you configured OMPI?
> 
> 
> On Jan 28, 2013, at 12:46 PM, Steve Wise  wrote:
> 
>> On 1/28/2013 2:04 PM, Ralph Castain wrote:
>>> On Jan 28, 2013, at 11:55 AM, Steve Wise  
>>> wrote:
>>> 
 Do you know if the rdmacm CPC is really being used for your connection 
 setup (vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  
 Maybe that's the difference?
>>> Dunno for certain, but I expect it is using the OOB cm since I didn't 
>>> direct it to do anything different. Like I said, I suspect the problem is 
>>> that the cluster doesn't have iWARP on it.
>> 
>> Definitely, or it could be the different CPC used for IWvs IB is tickling 
>> the issue.
>> 
 Steve.
 
 On 1/28/2013 1:47 PM, Ralph Castain wrote:
> Nope - still works just fine. I didn't receive that warning at all, and 
> it ran to completion without problem.
> 
> I suspect the problem is that the system I can use just isn't 
> configured like yours, and so I can't trigger the problem. Afraid I 
> can't be of help after all... :-(
> 
> 
> On Jan 28, 2013, at 11:25 AM, Steve Wise  
> wrote:
> 
>> On 1/28/2013 12:48 PM, Ralph Castain wrote:
>>> Hmmm...afraid I cannot replicate this using the current state of the 
>>> 1.6 branch (which is the 1.6.4rcN) on the only IB-based cluster I can 
>>> access.
>>> 
>>> Can you try it with a 1.6.4 tarball and see if you still see the 
>>> problem? Could be someone already fixed it.
>> I still hit it on 1.6.4rc2.
>> 
>> Note iWARP != IB so you may not have this issue on IB systems for 
>> various reasons.  Did you use the same mpirun line? Namely using this:
>> 
>> --mca btl_openib_ipaddr_include "192.168.170.0/24"
>> 
>> (adjusted to your network config).
>> 
>> Because if I don't use ipaddr_include, then I don't see this issue on my 
>> setup.
>> 
>> Also, did you see these logged:
>> 
>> Right after starting the job:
>> 
>> --
>>  No OpenFabrics connection schemes reported that they were 
>> able to be used on a specific port.  As such, the openib BTL 
>> (OpenFabrics
>> support) will be disabled for this port.
>> 
>>>

[OMPI devel] RFC: Remove (broken) heterogeneous support

2013-01-29 Thread Jeff Squyres (jsquyres)
WHAT: Remove the configure command line option to enable heterogeneous support

WHY: The heterogeneous conversion code isn't working, very few people use this 
feature

WHERE: README and config/opal_configure_options.m4.  See attached patch.

TIMEOUT: Next Tuesday teleconf, 5 Feb, 2013

MORE DETAIL:

The heterogeneous code has been broken for a while.  The assumption is that 
this is a minor bug that can fairly easily be fixed, but a) no one has taken 
the time to do so, b) very few people use this functionality, and c) many OMPI 
developers don't even have hardware where to test this scenario (e.g., big and 
little endian systems).

As such, a suggestion was made to remove the --enable-heterogeneous configure 
CLI switch so that users don't try to enable it.  It someone ever fixes the 
heterogeneous code, the configure CLI switch can be put back.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


remove-configure-enable-heterogeneous-switch.diff
Description: remove-configure-enable-heterogeneous-switch.diff


Re: [OMPI devel] New ARM patch

2013-01-29 Thread Jeff Squyres (jsquyres)
On Jan 28, 2013, at 8:46 AM, Leif Lindholm  wrote:

> But giving some flexibility for roadblocks, can we say "this quarter"?

Cool.

> Apart from our *cough* convoluted architecture vs. processor naming scheme... 
> It should be ARMv4, ARMv5, ARMv6 and ARMv7.

Fixed in the README; thanks.

>> --> Do you think you'll be able to setup some MTT on ARM platforms?
> 
> I hope so.

Ok.  Ping me whenever you're ready; since our initial discussions about MTT 
were quite a while ago, I assume our prior discussions about MTT have fallen 
out of cache since then.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-29 Thread Jeff Squyres (jsquyres)
It's on the ticket that I just assigned to you.  :-)


On Jan 29, 2013, at 10:03 AM, Steve Wise  wrote:

> Will do...once I get a patch.
> 
> STeve
> On 1/29/2013 7:40 AM, Jeff Squyres (jsquyres) wrote:
>> Thanks Josh.
>> 
>> Steve -- if you can confirm that this fixes your problem in the v1.6 series, 
>> we'll go ahead and commit the patch.
>> 
>> FWIW: the OpenFabrics startup code got a little cleanup/revamp on the 
>> trunk/v1.7 -- I suspect that's why you're not seeing the problem on 
>> trunk/v1.7 (e.g., look at the utility routines that were abstracted out to 
>> ompi/mca/common/verbs).
>> 
>> 
>> 
>> On Jan 29, 2013, at 2:41 AM, Joshua Ladd  wrote:
>> 
>>> So, we (Mellanox) have observed this ourselves when no suitable CPC can be 
>>> found. Seems the BTL associated with this port is not destroyed and the ref 
>>> count is not decreased.  Not sure why you don't see the problem in 1.7. But 
>>> we have a patch that I'll CMR today. Please review our symptoms, diagnosis, 
>>> and proposed change. Ralph, maybe I can list you as a reviewer of the 
>>> patch? I've reviewed myself and it looks fine, but wouldn't mind having 
>>> another set of eyes on it since I don't want to be responsible for breaking 
>>> the OpenIB BTL.
>>> 
>>> Thanks,
>>> 
>>> Josh Ladd
>>> 
>>> 
>>> Reported by Yossi:
>>> Hi,
>>> 
>>> There is a bug in open mpi (openib component) when one of the active ports 
>>> is Ethernet.
>>> The fix is attached, probably needs to be reviewed and submitted to ompi
>>> 
>>> Error flow:
>>> 1.  Openib component creates a btl instance for every active port 
>>> (including Ethernet)
>>> 2.  Every btl holds a reference count to the device 
>>> (mca_btl_openib_device_t::btls)
>>> 3.  Openib tries to create a "connection module" for every btl
>>> 4.  It fails to create connection module for the Ethernet port
>>> 5.  The btl for Ethernet port is not returned by openib component, in the 
>>> list of btl modules
>>> 6.  The btl for Ethernet port is not destroyed during openib component 
>>> finalize
>>> 7.  The device is not destroyed, because of the reference count
>>> 8.  The memory pool created by the device is not destroyed
>>> 9.  Later, rdma mpool module cleans up remaining pools during its finalize
>>> 10. The memory pool created by openib is destroyed by rdma mpool component 
>>> finalize
>>> 11. The memory pool points to a function (openib_dereg_mr) which is already 
>>> unloaded from memory (because mca_btl_openib.so was unloaded)
>>> 12. Segfault because of a call to invalid function
>>> 
>>> The fix:  If a btl module is not going to be returned from openib component 
>>> init, destroy it.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -Original Message-
>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
>>> Behalf Of Ralph Castain
>>> Sent: Monday, January 28, 2013 8:35 PM
>>> To: Steve Wise
>>> Cc: Open MPI Developers
>>> Subject: Re: [OMPI devel] openib unloaded before last mem dereg
>>> 
>>> Out of curiosity, could you tell us how you configured OMPI?
>>> 
>>> 
>>> On Jan 28, 2013, at 12:46 PM, Steve Wise  
>>> wrote:
>>> 
>>>> On 1/28/2013 2:04 PM, Ralph Castain wrote:
>>>>> On Jan 28, 2013, at 11:55 AM, Steve Wise  
>>>>> wrote:
>>>>> 
>>>>>> Do you know if the rdmacm CPC is really being used for your connection 
>>>>>> setup (vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  
>>>>>> Maybe that's the difference?
>>>>> Dunno for certain, but I expect it is using the OOB cm since I didn't 
>>>>> direct it to do anything different. Like I said, I suspect the problem is 
>>>>> that the cluster doesn't have iWARP on it.
>>>> Definitely, or it could be the different CPC used for IWvs IB is tickling 
>>>> the issue.
>>>> 
>>>>>> Steve.
>>>>>> 
>>>>>> On 1/28/2013 1:47 PM, Ralph Castain wrote:
>>>>>>> Nope - still works just fine. I didn't receive that warning at all, and 
>>>>>>> it ran to completion without problem.
>>>>>>> 
>

Re: [OMPI devel] RFC: opal_list iteration macros

2013-01-29 Thread Jeff Squyres (jsquyres)
Agreed.  I like the idea, and recognize that it is inspired by Linux kernel 
macros.  But I would prefer them to be upper case to match our conventions.

Also, please be sure to put in good comments explaining their use in the .h 
file.

Thanks!


On Jan 29, 2013, at 12:18 PM, Ralph Castain  wrote:

> Ja, I've considered a similar addition on occasion. +1 from me
> 
> Only comment: you should change these to match our convention by making the 
> macros be capital letters: e.g., OPAL_LIST_FOREACH
> 
> On Jan 29, 2013, at 9:08 AM, Nathan Hjelm  wrote:
> 
>> What: Add two new macros to opal_list.h:
>> 
>> #define opal_list_foreach(item, list, type) \
>> for (item = (type *) (list)->opal_list_sentinel.opal_list_next ;  \
>>  item != (type *) &(list)->opal_list_sentinel ;   \
>>  item = (type *) ((opal_list_item_t *) (item))->opal_list_next)
>> 
>> #define opal_list_foreach_safe(item, next, list, type)  \
>> for (item = (type *) (list)->opal_list_sentinel.opal_list_next,   \
>>next = (type *) ((opal_list_item_t *) (item))->opal_list_next ;\
>>  item != (type *) &(list)->opal_list_sentinel ;   \
>>  item = next, next = (type *) ((opal_list_item_t *) 
>> (item))->opal_list_next)
>> 
>> The first macro provides a simple iterator over an unchanging list and the 
>> second macro is safe for opal_list_item_remove(item).
>> 
>> Why: These macros provide a clean way to do the following:
>> 
>> for (item = opal_list_get_first (list) ;
>>item != opal_list_get_end (list) ;
>>item = opal_list_get_next (item)) {
>>  some_class_t *foo = (some_class_t *) foo;
>>  ...
>> }
>> 
>> becomes:
>> 
>> some_class_t *foo;
>> 
>> opal_list_foreach(foo, list, some_class_t) {
>>  ...
>> }
>> 
>> When: This is a very simple addition but I wanted to give a heads up on the 
>> devel list because these macros are different from what we usually provide 
>> (though they should look familiar to those familiar with the Linux kernel). 
>> I intend to commit these macros to the truck (and CMR for 1.7.1) tomorrow 
>> (Wed 01/29/13) around 12:00 PM MST.
>> 
>> Thoughs? Comments?
>> 
>> -Nathan Hjelm
>> HPC-3, LANL
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Remove (broken) heterogeneous support

2013-01-30 Thread Jeff Squyres (jsquyres)
On Jan 30, 2013, at 7:36 AM, Siegmar Gross 
 wrote:

> HiI have no problem with the option --enable-heterogeneous, when I build
> Open MPI, but Open MPI will not work in a heterogeneous environment
> with little and big endian machines,

Right -- that's the issue: --enable-heterogeneous is broken (and has been for a 
long time).  No one has stepped up to fix it, so we might as well disable the 
option so that users don't think that we support it.

> while LAM MPI can handle such
> environments. You wanted to solve this problem.
> 
> https://svn.open-mpi.org/trac/ompi/ticket/3430

Understood.  But the reality is that this is a very uncommon feature, and we 
apparently don't have the resources to fix it.  :-\

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Remove (broken) heterogeneous support

2013-01-30 Thread Jeff Squyres (jsquyres)
On Jan 30, 2013, at 9:40 AM, Andreas Schäfer  wrote:

> But isn't heterogeneity the main reason for having MPI datatypes in
> the first place? Otherwise I could always use MPI_CHAR and sizeof(Foo).

Heterogeneity was a much bigger issue back in the 90s.  Nowadays, most people 
have pretty homogeneous clusters.

>> Understood.  But the reality is that this is a very uncommon feature, and we 
>> apparently don't have the resources to fix it.  :-\
> 
> Could you give a rough estimate on how much effort this would be?


Unfortunately, no.  No one has even looked into this.

I'm *guessing* that it's not a difficult issue to fix -- that something is just 
broken down in the datatype handling of heterogeneous machines.  But that's an 
assumption/guess.  Tracking that down, however, will likely take a little 
effort, especially for someone unfamiliar with the code base.  I can give hints 
where to start looking, but that's about it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.6.4rc3 released

2013-01-30 Thread Jeff Squyres (jsquyres)
In the usual place:

http://www.open-mpi.org/software/ompi/v1.6/

Changes since rc2:

- Fix a seg fault in the openib BTL when no connection method can be found.

This is looking pretty stable; it could well be the last rc.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] "pml_ob1_sendreq.c:188 FATAL" errors

2013-01-31 Thread Jeff Squyres (jsquyres)
I'm seeing a LOT of these on errors on the trunk:

pml_ob1_sendreq.c:188 FATAL

The job then hangs.  I see this starting at np=6 across 2 nodes, using only the 
TCP and SM BTLs.  This is not happening on v1.6 or v1.7.  Line 188 in 
pml_ob1_sendreq.c is when someone calls mca_pml_ob1_match_completion_free() 
with a non-OMPI_SUCCESS status.

*** Is anyone else seeing this?

My configure is very straightforward:

./configure --prefix=/home/jsquyres/bogus --disable-dlopen --disable-vt

I notice that this is only happening in optimized builds; it is not happening 
when I do a normal developer / debug build.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] "pml_ob1_sendreq.c:188 FATAL" errors

2013-01-31 Thread Jeff Squyres (jsquyres)
The show help bit doesn't look right -- opal_output on stream 0 will put the 
hostname and PID as the prefix.


On Jan 31, 2013, at 6:13 PM, Ralph Castain 
 wrote:

> I fixed it so that "abort" really aborts the job - see r28004
> 
> On Jan 31, 2013, at 2:02 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> I'm seeing a LOT of these on errors on the trunk:
>> 
>>   pml_ob1_sendreq.c:188 FATAL
>> 
>> The job then hangs.  I see this starting at np=6 across 2 nodes, using only 
>> the TCP and SM BTLs.  This is not happening on v1.6 or v1.7.  Line 188 in 
>> pml_ob1_sendreq.c is when someone calls mca_pml_ob1_match_completion_free() 
>> with a non-OMPI_SUCCESS status.
>> 
>> *** Is anyone else seeing this?
>> 
>> My configure is very straightforward:
>> 
>>   ./configure --prefix=/home/jsquyres/bogus --disable-dlopen --disable-vt
>> 
>> I notice that this is only happening in optimized builds; it is not 
>> happening when I do a normal developer / debug build.
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: shiny new variable subsystem

2013-02-01 Thread Jeff Squyres (jsquyres)
+1.

Nathan and I have been talking about this for quite a while.  Note that this is 
the first of several updates that we will have for the MCA param system -- we 
have a roadmap that can be easily described as two groups of things:

1. Changes that are intended for v1.7.x: everything to support MPI-3's MPI_T 
interface
2. Change that can wait until 1.9: long-standing MCA param feature requests and 
breaking backwards compatibility

Like Nathan said, we'll talk through this next Tuesday.  He highlighted most of 
the important parts, but I wanted to explain one point a little further: as 
Nathan said, we took the opportunity to greenfield the MCA param API.  Meaning: 
we designed a whole new OPAL MCA param API, taking into account all that we 
have learned from years of experience with the current MCA param API.

This is worth explaining a bit:

- The new API will be pushed to v1.7.
- But *the old API will also exist in v1.7* (as a thin shim compatibility layer 
to the new API)
- We will update all components and frameworks in both trunk and v1.7 to use 
the new API
- Any off-trunk development can still use the old API for the duration of the 
v1.7/v1.8 series
- After a while, *we will remove the old API/compatibility shim on the 
trunk/v1.9* (dates TBD)
*** Off-trunk development will therefore need to be updated for trunk/v1.9

The new MCA param API is cleaner, simpler, and provides for a lot of 
functionality that we don't currently have, like:

- un-overrideable system-level MCA params (as Nathan described)
- much shorter default ompi_info output (e.g., by default, only showing users 
the  MCA params they really care about, not all 6 
billion of them)
- ...and others



On Jan 31, 2013, at 8:22 PM, Nathan Hjelm  wrote:

> What: Introduce the MCA variable system. This system is meant as a 
> replacement for the MCA parameter system. It offers a number of improvements 
> over the old system including:
> 
>  - Cleaner, expandable API. Instead of providing multiple variable 
> registration functions (reg_int, reg_int_name, reg_string, reg_string_name) 
> there are only two (three when the framework system is introduced): 
> mca_base_var_register() and mca_base_component_var_register().
> 
>  - Support for boolean variables.
> 
>  - Support for true/false values for integer and boolean variable. ex: 
> setting OMPI_MCA_mca_component_show_load_errors=true will work the same as 
> OMPI_MCA_mca_component_show_load_errors=1.
> 
>  - Support for integer enumerations. Example: create a integer variable foo 
> with possible values 0:none, 1:error, 2: warning then any of 
> OMPI_MCA_foo=none OMPI_MCA_foo=0 OMPI_MCA_foo=warning etc will work. A 
> warning is printed for values not enumerated.
> 
>  - Support for system-administrator forced variables through the use of 
> etc/openmpi-mca-params-override.conf. A warning is printed (which can be 
> suppressed) for any attempt to override one of these values.
> 
>  - Support for variable scopes (constant, read-only, local, all, group, etc). 
> Equivalent to MPI_T scopes.
> 
>  - Support for setting verbosity levels for each parameter. This will enable 
> us to add a --level option to ompi_info to reduce the number of variables 
> shown. Equivalent to MPI_T verbosity.
> 
>  - Renamed the read_only attribute for parameters to default only. This name 
> better reflects the meaning of these variables.
> 
>  - Variables are now broken down by group. A group is a project, framework, 
> or component. Ex: opal, opal_shmem, opal_shmem_mmap. Groups are automatically 
> registered when a variable is registered. You can set a description for a 
> group by calling mca_base_var_group_register before registering any 
> variables. Groups are equivalent to MPI_T categories.
> 
>  - Variables must be registered with a backing store that must live at least 
> as long as the variable (no stack variables-- unless of course they are 
> deregistered before return). This means changes to a variable made with 
> mca_base_var_set_value() will be immediately visible in the registered 
> storage. There is no need to "lookup" the value.
> 
>  - Environment and file values are only looked up at registration time. After 
> registration a variable can change by either: 1) the registree changes the 
> value, or 2) the value is changed with mca_base_var_set_value().
> 
>  - File values are preserved in mca_base_var_file_values so there is no 
> longer a need to recache files. The values are still stored in an 
> opal_list_t. Since the list is only referenced at registration time this 
> shouldn't be an issues.
> 
>  - etc
> 
> Why: This RFC is one of a number of RFCs that will eventually bring full 
> support for the MPI tool interface to Open MPI. This change is intended to 
> support the entirely of the MPI tools control variable API (except 
> MPI_T_cvar_write-- that will be supported by a future update). A quick 
> background for this change: the MCA parameter system needed to be augmented 
> to supp

Re: [OMPI devel] [OMPI svn] svn:open-mpi r28016 - trunk/ompi/mca/btl/tcp

2013-02-01 Thread Jeff Squyres (jsquyres)
On Feb 1, 2013, at 6:28 PM, George Bosilca  wrote:

> So far, all interfaces specified via MCA parameters for the BTL TCP
> are required to exist. Otherwise an error message is printed and an
> error returned to the upper level, with the intent that no BTLs of
> this type will be enabled (as an example btl_tcp_component.c:682).

Actually, it doesn't -- that's why I made this one match the other behavior.  

For example, if I exclude an interface that doesn't exist (on v1.6 HEAD):

-
[15:40] savbu-usnic:~/svn/ompi-1.6/examples % mpirun -np 2 --mca 
btl_tcp_if_exclude lo,bogus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
[15:40] savbu-usnic:~/svn/ompi-1.6/examples % 
-

Or if I include an interface that doesn't exist (although this one warns):

-
[15:40] savbu-usnic:~/svn/ompi-1.6/examples % mpirun -np 2 --mca 
btl_tcp_if_include eth0,bogus ring_c
[savbu-usnic][[7221,1],0][btl_tcp_component.c:682:mca_btl_tcp_component_create_instances]
 invalid interface "bogus"
[savbu-usnic][[7221,1],1][btl_tcp_component.c:682:mca_btl_tcp_component_create_instances]
 invalid interface "bogus"
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
[15:42] savbu-usnic:~/svn/ompi-1.6/examples % 
-

Are there other cases that I'm missing where we *do* abort?

If so, we should probably be consistent: pick one way (abort or not abort) and 
do that in all cases.  I don't think I have much of an opinion here on which 
way we should go; I can see multiple arguments:

- We should abort: we have a large precedent in many other place in OMPI that 
if a human asks for something OMPI can't deliver, we abort and make the human 
figure it out.

- We should warn/not abort: this is the behavior we've had for a long time.  
Changing it may break backwards compatibility.



> If I correctly understand your commit, it change this [so far
> consistent] behavior for a single of our TCP MCA parameter (if_seq)
> to: print an error message and then continue. As you set
> themca_btl_tcp_component.tcp_if_seq to NULL this is as if this
> argument was never provided.
> 
> I prefer the old behavior for its corrective meaning (you fix it and
> then it works), as well as for its consistency with the other BTL TCP
> parameters.
> 
>  George.
> 
> 
> 
> On Fri, Feb 1, 2013 at 3:17 PM,   wrote:
>> Author: jsquyres (Jeff Squyres)
>> Date: 2013-02-01 15:17:43 EST (Fri, 01 Feb 2013)
>> New Revision: 28016
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/28016
>> 
>> Log:
>> As the help message states, it's not an ''error'' if the specified
>> interface is not found.  It should just be skipped.
>> 
>> Text files modified:
>>   trunk/ompi/mca/btl/tcp/btl_tcp_component.c | 8 +---
>>   1 files changed, 5 insertions(+), 3 deletions(-)
>> 
>> Modified: trunk/ompi/mca/btl/tcp/btl_tcp_component.c
>> ==
>> --- trunk/ompi/mca/btl/tcp/btl_tcp_component.c  Fri Feb  1 09:27:37 2013 
>>(r28015)
>> +++ trunk/ompi/mca/btl/tcp/btl_tcp_component.c  2013-02-01 15:17:43 EST 
>> (Fri, 01 Feb 2013)  (r28016)
>> @@ -314,10 +314,12 @@
>>ompi_process_info.nodename,
>>mca_btl_tcp_component.tcp_if_seq,
>>"Interface does not exist");
>> -return OMPI_ERR_BAD_PARAM;
>> +free(mca_btl_tcp_component.tcp_if_seq);
>> +mca_btl_tcp_component.tcp_if_seq = NULL;
>> +} else {
>> +BTL_VERBOSE(("Node rank %d using TCP interface %s",
>> + node_rank, mca_btl_tcp_component.tcp_if_seq));
>> }
>> -BTL_VERBOSE(("Node rank %d using TCP interface %s",
>> - node_rank, mca_btl_tcp_component.tcp_if_seq));
>> }
>> }
>> 
>> ___
>> svn mailing list
>> s...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/abo

Re: [OMPI devel] [OMPI svn] svn:open-mpi r28016 - trunk/ompi/mca/btl/tcp

2013-02-01 Thread Jeff Squyres (jsquyres)
On Feb 1, 2013, at 7:09 PM, George Bosilca  wrote:

> I did not say we abort, I say we prevent BTL TCP from being used.

Ah.

> In your example, I guess the TCP is disabled but the PML finds another
> available interface and keeps going. If I try the same thing with
> "--mca btl tcp,self" it does abort on my cluster.
> 
> ---
> mpirun -np 2 --mca btl tcp,self --mca btl_tcp_if_include eth3 ./ring_c
> [dancer02][[48001,1],1][../../../../../ompi/ompi/mca/btl/tcp/btl_tcp_component.c:682:mca_btl_tcp_component_create_instances]
> invalid interface "eth3"

Good point.

But it looks like that behavior doesn't occur for btl_tcp_if_exclude:

--
[16:57] savbu-usnic:~/svn/ompi-1.6/examples % mpirun --host node001,node002 
--mca btl tcp,self --mca btl_tcp_if_exclude lo,bogus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
[16:58] savbu-usnic:~/svn/ompi-1.6/examples % 
--

So it sounds like I should:

1. put if_seq back the way it was
2. fix the if_seq show_help message to say that TCP won't be used (right now it 
just says that the value will be ignored -- which is one of the other reasons I 
changed the behavior to ignore the value)
3. make btl_tcp_if_exclude exhibit the same behavior (if a bogus interface is 
specified, disable TCP)
4. make all error cases nice-nice with show_help instead of BTL_VERBOSE :-)

Agree?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn] svn:open-mpi r28016 - trunk/ompi/mca/btl/tcp

2013-02-04 Thread Jeff Squyres (jsquyres)
On Feb 1, 2013, at 9:59 PM, "Barrett, Brian W"  wrote:

> I don't think this is right either. Excluding a device that doesn't exist has 
> many use cases. Such as disabling a network that only exists on part of the 
> cluster.  I'm not sure about what to do with seq; it's more like include than 
> exclude.

Hmm.  I've now given this quite a bit of thought.  Here's what I think:

1. Just like there might be good reasons to exclude non-existent interfaces 
(e.g., networks that only include on part of the cluster), the same argument 
could be made for *including* non-existent interfaces.

2. It seems odd to me to have different behavior for non-existent interfaces 
between include, exclude, and/or seq.

3. We have a very strong precedent throughout OMPI that if a human asks for 
something that OMPI can't deliver, OMPI should error.  According to this, and 
according to the Law of Least Surprise, I would think that if I typo an exclude 
interface name, OMPI should error and make a human figure it out.

4. If someone wants different includes/excludes in different parts of the 
cluster, then they should have per-node values for these MCA params.

5. That being said, #4 is not always feasible.  Concrete example (which is why 
this whole thing started, incidentally): in my MTT cluster at Cisco, I have 
*some* nodes with back-to-back interfaces.  I can't think of a good way to have 
per-node MCA params in an MTT run that is SLURM-queued and may end up on random 
nodes in my cluster -- that may or may not include nodes with loopback 
interfaces.

So how about this compromise:

If an invalid include, exclude, or if_seq interface is specified:
- If that interface is prefaced with "nowarn:", silently ignore that token
- Otherwise, display a show_help message and ignore the TCP BTL

For example:

mpirun --mca btl_tcp_if_include nowarn:eth5,eth6

- If eth5 doesn't exist, the job will continue just as if eth5 wasn't specified
- If eth6 doesn't exist, the TCP BTL will disqualify itself

(BTW: yes, I'm volunteering to code up whatever we agree on)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn] svn:open-mpi r28016 - trunk/ompi/mca/btl/tcp

2013-02-04 Thread Jeff Squyres (jsquyres)
On Feb 4, 2013, at 2:03 PM, George Bosilca  wrote:

> The two behaviors you describe for include and exclude do not look 
> conflicting to me. Inclusion is a strong request, the user enforce the usage 
> of a specific interface. If the interface is not available, then we have a 
> problem. Exclude on the other side, must enforce that a specific interface is 
> not in use, fact that can be quite simple if the interface is not available.

I still maintain that it's equally disastrous if you don't exclude the correct 
interfaces (I lost 2 nights of MTT because of this!).

> I'm not a fan of the nowarn option. Seems like a lot of code with limited 
> interest, especially if we only plan to support it in TCP.

This is a good point -- I wonder what openib (and others?) do who support 
*_if_include and *_if_exclude notation.  Do they warn / error if you specify an 
invalid interface?

> If you need specialized arguments for some of your nodes here is what I do: 
> rename the binaries to .orig, and use the original name to create a sh script 
> that will change the value of mca_param_files to something based on the host 
> name (if such a file exists) and then call the .orig executable. Works like a 
> charm., even when a batch scheduler is used.

That will still be quite difficult to do in MTT.  Remember: all the tests that 
are run in MTT are shared across all of us via the ompi-tests SVN repo.  Are 
you suggesting that I alias every test in the ompi-tests SVN with a public 
script that you should run that should look for some site-specific MCA override 
param file?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn] svn:open-mpi r28029 - trunk/opal/class

2013-02-04 Thread Jeff Squyres (jsquyres)
I think the point is that there are many cases throughout the OMPI code base 
where we do exactly the things listed in these macros.

You certainly don't have to use them, but they can save a little effort when 
you them.


On Feb 4, 2013, at 2:13 PM, George Bosilca  wrote:

> Ralph,
> 
> There are valid reasons why we decided not to add such macros.
> 
> Adding elements to a list do not increase the element ref count.
> Similarly, removing an element from a list does not decrease its
> refcount either. Thus, there is no obvious link between the refcount
> of the elements in a list and the list itself. As a result, we can not
> make the assumption that decreasing the refcount by one is correct,
> and this even when we plan to get rid of one of our lists.
> 
> In addition, the list can contain elements that have been
> OBJ_CONSTRUCT in which case this macro will lead to unexpected
> behaviors.
> 
>  George.
> 
> 
> On Mon, Feb 4, 2013 at 2:42 PM,   wrote:
>> Author: rhc (Ralph Castain)
>> Date: 2013-02-04 14:42:57 EST (Mon, 04 Feb 2013)
>> New Revision: 28029
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/28029
>> 
>> Log:
>> The opal_list_t destructor doesn't release the items on the list prior to 
>> destructing or releasing it. Provide two convenience macros for doing so.
>> 
>> Text files modified:
>>   trunk/opal/class/opal_list.h |26 ++
>>   1 files changed, 26 insertions(+), 0 deletions(-)
>> 
>> Modified: trunk/opal/class/opal_list.h
>> ==
>> --- trunk/opal/class/opal_list.hMon Feb  4 12:36:55 2013
>> (r28028)
>> +++ trunk/opal/class/opal_list.h2013-02-04 14:42:57 EST (Mon, 04 Feb 
>> 2013)  (r28029)
>> @@ -160,6 +160,32 @@
>>  */
>> typedef struct opal_list_t opal_list_t;
>> 
>> +/** Cleanly destruct a list
>> + *
>> + * The opal_list_t destructor doesn't release the items on the
>> + * list - so provide two convenience macros that do so and then
>> + * destruct/release the list object itself
>> + *
>> + * @param[in] list List to destruct or release
>> + */
>> +#define OPAL_LIST_DESTRUCT(list)\
>> +do {\
>> +opal_list_item_t *it;   \
>> +while (NULL != (it = opal_list_remove_first(list))) {   \
>> +OBJ_RELEASE(it);\
>> +}   \
>> +OBJ_DESTRUCT(list); \
>> +} while(0);
>> +
>> +#define OPAL_LIST_RELEASE(list) \
>> +do {\
>> +opal_list_item_t *it;   \
>> +while (NULL != (it = opal_list_remove_first(list))) {   \
>> +OBJ_RELEASE(it);\
>> +}   \
>> +OBJ_RELEASE(list);  \
>> +} while(0);
>> +
>> 
>> /**
>>  * Loop over a list.
>> ___
>> svn mailing list
>> s...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [EXTERNAL] Re: [OMPI svn] svn:open-mpi r28016 - trunk/ompi/mca/btl/tcp

2013-02-05 Thread Jeff Squyres (jsquyres)
I had a typo in my btl_tcp_if_exclude such that it was effectively

  mpirun --mca btl_tco_if_exclude bogus ...

instead of ignoring the actual interface I wanted to ignore.  And since I 
wasn't ignoring the special loopback device that I have on some machines, every 
single MPI job hung because they tried to use those interfaces to communicate 
with processes on other nodes that that interface could not reach.



On Feb 4, 2013, at 5:56 PM, "Barrett, Brian W"  wrote:

> I'm confused; why is it disastrous to have an interface in if_exclude that 
> doesn't exist?  I can see it being a problem if we don't exclude something in 
> the list, but the other way is (in my opinion) harmless but with a useful use 
> case...
> 
> Brian
> 
> 
> 
> Sent with Good (www.good.com)
> 
> 
> -Original Message-
> From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> Sent: Monday, February 04, 2013 06:47 PM Mountain Standard Time
> To:   Open MPI Developers
> Subject:  [EXTERNAL] Re: [OMPI devel] [OMPI svn] svn:open-mpi r28016 - 
> trunk/ompi/mca/btl/tcp
> 
> On Feb 4, 2013, at 2:03 PM, George Bosilca  wrote:
> 
>> The two behaviors you describe for include and exclude do not look 
>> conflicting to me. Inclusion is a strong request, the user enforce the usage 
>> of a specific interface. If the interface is not available, then we have a 
>> problem. Exclude on the other side, must enforce that a specific interface 
>> is not in use, fact that can be quite simple if the interface is not 
>> available.
> 
> I still maintain that it's equally disastrous if you don't exclude the 
> correct interfaces (I lost 2 nights of MTT because of this!).
> 
>> I'm not a fan of the nowarn option. Seems like a lot of code with limited 
>> interest, especially if we only plan to support it in TCP.
> 
> This is a good point -- I wonder what openib (and others?) do who support 
> *_if_include and *_if_exclude notation.  Do they warn / error if you specify 
> an invalid interface?
> 
>> If you need specialized arguments for some of your nodes here is what I do: 
>> rename the binaries to .orig, and use the original name to create a sh 
>> script that will change the value of mca_param_files to something based on 
>> the host name (if such a file exists) and then call the .orig executable. 
>> Works like a charm., even when a batch scheduler is used.
> 
> That will still be quite difficult to do in MTT.  Remember: all the tests 
> that are run in MTT are shared across all of us via the ompi-tests SVN repo.  
> Are you suggesting that I alias every test in the ompi-tests SVN with a 
> public script that you should run that should look for some site-specific MCA 
> override param file?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [EXTERNAL] Re: [OMPI svn] svn:open-mpi r28016 - trunk/ompi/mca/btl/tcp

2013-02-05 Thread Jeff Squyres (jsquyres)
Yeah, that's the quandary: I can see both use cases.

That's why I proposed the "nowarn:" syntax that George hated.  :-)

Got any other suggestion on how to handle both use cases?



On Feb 5, 2013, at 7:25 AM, "Barrett, Brian W"  wrote:

> I guess I can see that, but I have the opposite use case; I have a device
> on some nodes and not others that I want to ignore, so I set
> btl_tcp_if_exclude to include that device.  It would be totally
> counter-intuitive to have a giant warning because of that.
> 
> Brian
> 
> On 2/5/13 6:46 AM, "Jeff Squyres (jsquyres)"  wrote:
> 
>> I had a typo in my btl_tcp_if_exclude such that it was effectively
>> 
>> mpirun --mca btl_tco_if_exclude bogus ...
>> 
>> instead of ignoring the actual interface I wanted to ignore.  And since I
>> wasn't ignoring the special loopback device that I have on some machines,
>> every single MPI job hung because they tried to use those interfaces to
>> communicate with processes on other nodes that that interface could not
>> reach.
>> 
>> 
>> 
>> On Feb 4, 2013, at 5:56 PM, "Barrett, Brian W"  wrote:
>> 
>>> I'm confused; why is it disastrous to have an interface in if_exclude
>>> that doesn't exist?  I can see it being a problem if we don't exclude
>>> something in the list, but the other way is (in my opinion) harmless but
>>> with a useful use case...
>>> 
>>> Brian
>>> 
>>> 
>>> 
>>> Sent with Good (www.good.com)
>>> 
>>> 
>>> -Original Message-
>>> From:   Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
>>> Sent:   Monday, February 04, 2013 06:47 PM Mountain Standard Time
>>> To: Open MPI Developers
>>> Subject:[EXTERNAL] Re: [OMPI devel] [OMPI svn] svn:open-mpi r28016 -
>>> trunk/ompi/mca/btl/tcp
>>> 
>>> On Feb 4, 2013, at 2:03 PM, George Bosilca  wrote:
>>> 
>>>> The two behaviors you describe for include and exclude do not look
>>>> conflicting to me. Inclusion is a strong request, the user enforce the
>>>> usage of a specific interface. If the interface is not available, then
>>>> we have a problem. Exclude on the other side, must enforce that a
>>>> specific interface is not in use, fact that can be quite simple if the
>>>> interface is not available.
>>> 
>>> I still maintain that it's equally disastrous if you don't exclude the
>>> correct interfaces (I lost 2 nights of MTT because of this!).
>>> 
>>>> I'm not a fan of the nowarn option. Seems like a lot of code with
>>>> limited interest, especially if we only plan to support it in TCP.
>>> 
>>> This is a good point -- I wonder what openib (and others?) do who
>>> support *_if_include and *_if_exclude notation.  Do they warn / error if
>>> you specify an invalid interface?
>>> 
>>>> If you need specialized arguments for some of your nodes here is what
>>>> I do: rename the binaries to .orig, and use the original name to create
>>>> a sh script that will change the value of mca_param_files to something
>>>> based on the host name (if such a file exists) and then call the .orig
>>>> executable. Works like a charm., even when a batch scheduler is used.
>>> 
>>> That will still be quite difficult to do in MTT.  Remember: all the
>>> tests that are run in MTT are shared across all of us via the ompi-tests
>>> SVN repo.  Are you suggesting that I alias every test in the ompi-tests
>>> SVN with a public script that you should run that should look for some
>>> site-specific MCA override param file?
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] hwloc using libpci: GPL issue

2013-02-06 Thread Jeff Squyres (jsquyres)
BEFORE YOU PANIC: this only affects Open MPI v1.7 (which is not yet released) 
and the OMPI SVN trunk (which is also, obviously, not released).  ***OMPI 
v1.6.x is unaffected/not GPL tainted***

Here's the full details:

It was just discovered yesterday that libpci, which hwloc links against for PCI 
device detection, is GPL (not LGPL).  IANAL / this is not legal advice, but my 
humble understanding is that this introduces GPL taint to hwloc.  And since 
OMPI links in hwloc, it is also tainted.  This is problematic for vendors who 
want to ship binaries linked against Open MPI.

 * The as-yet-unreleased OMPI v1.7 (and trunk) embeds hwloc v1.5.1, and 
utilizes hwloc PCI device detection (thereby linking in libpci).  Bad.
 * The OMPI v1.6 series embeds hwloc v1.3.2, and explicitly disables hwloc's 
PCI device detection (thereby NOT linking in libpci) because of compatibility 
problems with Solaris.  Good.

Hence, from a released-version perspective, I think OMPI is in the clear.  
However, we can't release 1.7 until this is fixed.  

The good news is that within hours of discovering the issue, the hwloc guys 
issued a preliminary patch to change hwloc to use libpciaccess (vs. libpci), 
which is BSD-licensed.  They are working on firming up this patch in order to 
release new versions of hwloc to remove the default-build options of libpci/GPL 
taint.

I will update OMPI's SVN trunk and submit a v1.7 CMR when this is ready.  I 
imagine it will take at least a few days.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] hwloc using libpci: GPL issue

2013-02-06 Thread Jeff Squyres (jsquyres)
On Feb 6, 2013, at 6:44 AM, Brice Goglin  wrote:

> Do you already use hwloc's PCI objects in OMPI v1.7 ?

We enable the functionality; I'm not sure if anyone is actually looking at 
specific PCI devices in the resulting hwloc tree.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r28048 - in trunk: . config ...etc.

2013-02-12 Thread Jeff Squyres (jsquyres)
Mellanox --

Was this commit a mistake?



Author: miked (Mike Dubman)
List-Post: devel@lists.open-mpi.org
Date: 2013-02-12 10:33:21 EST (Tue, 12 Feb 2013)
New Revision: 28048
URL: https://svn.open-mpi.org/trac/ompi/changeset/28048

Log:
Added new project: oshmem.

Added:
  trunk/README-SHMEM-WITH-VALGRIND.txt
  trunk/README-SHMEM.txt
...etc.



Re: [OMPI devel] RFC: Remove solaris thread support

2013-02-14 Thread Jeff Squyres (jsquyres)
+1

On Feb 14, 2013, at 12:38 PM, "Barrett, Brian W"  wrote:

> Hi all -
> 
> I'd like to propose that we remove the support for Solaris threads in the
> trunk.  Solaris provides a pthread implementation which is of equivalent
> performance and supporting two thread libraries is kind of a pain.
> Pthreads also supports static initializers, which will be nice going
> forward.  Assuming no one complains, I'll remove the solaris threads
> support from the trunk on Wednesday, February 20th.
> 
> Brian
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.6.4rc4 posted

2013-02-15 Thread Jeff Squyres (jsquyres)
http://www.open-mpi.org/software/ompi/v1.6/

Changes since rc3:

- Fix Cygwin shared memory and debugger plugin support.  Thanks to
  Marco Atzeri for reporting the issue and providing initial patches.
- Fix to obtaining the correct available nodes when a rankfile is
  providing the allocation.  Thanks to Siegmar Gross for reporting the
  problem.
- Fix process binding issue on Solaris.  Thanks to Siegmar Gross for
  reporting the problem.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] RFC: Remove windows support

2013-02-18 Thread Jeff Squyres (jsquyres)
WHAT: Remove all Windows code from the trunk.

WHY: This issue keeps coming up over and over and over...

WHERE: Throughout the tree.

WHEN: Timeout: next Tuesday teleconf: 26 Feb, 2013

More detail:

It seems like every week, a new issue related to "what do we do about the 
Windows code?" comes up.  It came up again in the BTL discussions in Knoxville 
last week.  It keeps coming up in various ORTE discussions.  And so on.

So let's just do it -- let's cut the cord.  We have no Windows maintainer any 
more, and we don't test on Windows.  So let's not pretend that we do.  There's 
two levels of Windows support that we can remove:

1. Remove all the CMAKE stuff.  This is a no-brainer, IMNSHO -- it's broken and 
unmaintained on the trunk; it doesn't support all the new Fortran stuff, for 
example.  Who knows what else has bit rotted?

  ==> Removing this code can probably be done in a single SVN commit.

2. Remove all Windows code.  This involves some wholesale removing of 
components as well as a bunch of #if code throughout the code base.

  ==> Removing this code can probably be done in multiple SVN commits:

2a. Removing Windows-only components (which, given the rate of change that we 
are planning for the trunk, may well need to be re-written if they are ever 
re-introduced into the tree).
2b. Removing "#if WINDOWS" code (e.g., in opal/util/*, etc.).  This code may 
not be changing as much as the rest of the trunk, and may be suitable for svn 
reverting someday.

This does kill Cygwin support, too.  I realize we have a downstream packager 
for Cygwin, but the fact that we can't get any developer support for Windows -- 
despite multiple appeals -- seems to imply that the Windows Open MPI audience 
is very, very small.  So while it feels a bit sad to kill it, it may still be 
the Right Thing to do.

This is a proposal, and is open for discussion.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] mpi/java question

2013-02-20 Thread Jeff Squyres (jsquyres)
I guess the question is whether a java "long" is equivalent to a C "long", 
"long long", or "long int"...

Do you know?  (I'm not much of a Java guy)


On Feb 19, 2013, at 7:22 PM, Steve Angelovich  wrote:

> All,
> 
> We ran into a  problem using openmpi from java with a Java data type of long 
> when doing bcast and reduce operations.
> 
> *** An error occurred in MPI_Allreduce: the reduction operation MPI_MIN is 
> not defined on the MPI_LONG_INT datatype
> *** reported by process [211105480705,0]
> *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
> *** MPI_ERR_OP: invalid reduce operation
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> 3 more processes have sent help message help-mpi-errors.txt / 
> mpi_errors_are_fatal
> Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error 
> messages
> 
> Looking at ompi/mpi/java/c/mpi_Datatype.c it looks like an MPI_LONG_INT type 
> is being used.  It seems this should be an MPI_LONG_LONG.  If I change this 
> data type I'm able to do bcast and reduce operations via the java interface.  
> Does this look like a bug or am I missing something else?
> 
> 
> 
> --- openmpi-1.7rc6/ompi/mpi/java/c/mpi_Datatype.c   2013-02-19 
> 15:44:13.299046000 -0600
> +++ openmpi-1.9a1r28069/ompi/mpi/java/c/mpi_Datatype.c  2013-02-17 
> 20:00:14.0 -0600
> @@ -60,7 +60,7 @@
> 
> MPI_Datatype Dts[] = { MPI_DATATYPE_NULL, MPI_BYTE,  MPI_SHORT,
> MPI_SHORT, MPI_BYTE,  MPI_INT,
> -   MPI_LONG_LONG,  MPI_FLOAT, MPI_DOUBLE,
> +   MPI_LONG_INT,  MPI_FLOAT, MPI_DOUBLE,
> MPI_PACKED,MPI_LB,MPI_UB,
> MPI_BYTE };
> 
> 
> Thanks,
> Steve
> 
> --
> This e-mail, including any attached files, may contain confidential and 
> privileged information for the sole use of the intended recipient.  Any 
> review, use, distribution, or disclosure by others is strictly prohibited.  
> If you are not the intended recipient (or authorized to receive information 
> for the intended recipient), please contact the sender by reply e-mail and 
> delete all copies of this message.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.6.4rc5: final rc

2013-02-20 Thread Jeff Squyres (jsquyres)
All MTT testing looks good for 1.6.4.  There seems to be an MPI dynamics 
problem when --enable-spare-groups is used, but this does not look like a 
regression to me.

I put out a final rc, because there was one more minor change to accommodate an 
MXM API change; it's in the usual place:

http://www.open-mpi.org/software/ompi/v1.6/

Unless something disastrous happens, I plan to release this as the final 1.6.4 
tomorrow.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] mpi/java question

2013-02-20 Thread Jeff Squyres (jsquyres)
On Feb 20, 2013, at 12:37 PM, Ralph Castain  wrote:

>> In Java, a long is always 64 bits. In C and Objective-C, a long might be 64 
>> bits, or it might be 32 bits, or (in less common cases) it might be 
>> something else entirely; the C standard doesn't specify an exact bit width.
> 
> So we may need a configure test to map the Java "long" data type to the right 
> thing to get int64_t?

I think we might have to end up doing what we did for fortran: a bunch of 
configure tests to map Java types to their corresponding C types.

I have no idea how to write such configure tests, though, because it involves 
writing java code. The way it works in the Fortran tests is that we write a 
simple program (that's usually a combination of Fortran and C) to figure out 
the size of a given fortran datatype.  Then we have shell/m4 code to match that 
with a corresponding C type.

If someone could write some generic java code to figure out the size of a java 
type (and either printf it out, or write it to a file, or otherwise be able to 
give that value to a shell script), that would be a good start.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] mpi/java question

2013-02-20 Thread Jeff Squyres (jsquyres)
I didn't misspeak in my email.  :-)

That being said:

1. If the Java sizes are fixed, great.  It should make writing configury to 
find matching C types easier (because we know what the Java sizes are).

2. George raises a good point: we support the MPI_INTx_T datatypes now, which 
probably obviates the need for any extra configury (since the Java sizes are 
fixed).


On Feb 20, 2013, at 3:44 PM, Ralph Castain  wrote:

> Might be just fine - need to see how many of the types have issues, how best 
> to correct them
> 
> On Feb 20, 2013, at 12:32 PM, George Bosilca  wrote:
> 
>> That is wrong with MPI_INT64_T ? (MPI 3.0 standard page 26.)
>> 
>> George.
>> 
>> On Feb 20, 2013, at 21:12 , Ralph Castain  wrote:
>> 
>>> 
>>> On Feb 20, 2013, at 12:08 PM, Dmitri Gribenko  wrote:
>>> 
>>>> On Wed, Feb 20, 2013 at 10:05 PM, Ralph Castain  wrote:
>>>>> 
>>>>> On Feb 20, 2013, at 11:39 AM, Dmitri Gribenko  wrote:
>>>>> 
>>>>>> On Wed, Feb 20, 2013 at 9:34 PM, Jeff Squyres (jsquyres)
>>>>>>  wrote:
>>>>>>> If someone could write some generic java code to figure out the size of 
>>>>>>> a java type (and either printf it out, or write it to a file, or 
>>>>>>> otherwise be able to give that value to a shell script), that would be 
>>>>>>> a good start.
>>>>>> 
>>>>>> No need for that -- type sizes in Java are fixed.
>>>>>> 
>>>>>> http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
>>>>> 
>>>>> True - but the ones on the C-side are not, and that's the problem.
>>>> 
>>>> My point was that there is no need to write java code to detect type
>>>> sizes.  About C types -- don't we already check those anyway?  Sure,
>>>> we need to match these with java side, but there's no need to write
>>>> new code to check type sizes.
>>> 
>>> 
>>> I think you misunderstood - we are talking about writing build-system code 
>>> that matches the discovered C-type sizes to the corresponding known Java 
>>> type. This is the source of the reported problem.
>>> 
>>> And yes - Jeff misspoke in his note. I've straightened him out over the 
>>> phone. :-)
>>> 
>>>> 
>>>> Dmitri
>>>> 
>>>> -- 
>>>> main(i,j){for(i=2;;i++){for(j=2;j>>> (j){printf("%d\n",i);}}} /*Dmitri Gribenko */
>>>> ___
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] 1.6.4rc5: final rc

2013-02-20 Thread Jeff Squyres (jsquyres)
If someone wants to submit a patch in the immediate future (i.e., within the 
next hour), great.

Otherwise, I'm still going to release 1.6.4 as-is.

If someone wants to submit a patch after 1.6.4 is out, that's fine -- if we 
ever do 1.6.5, it can go in there.


On Feb 20, 2013, at 4:09 PM, Nathan Hjelm  wrote:

> On Wed, Feb 20, 2013 at 10:28:56AM -0800, Eugene Loh wrote:
>> On 02/20/13 07:54, Jeff Squyres (jsquyres) wrote:
>>> All MTT testing looks good for 1.6.4.  There seems to be an MPI dynamics 
>>> problem when --enable-spare-groups is used, but this does not look like a 
>>> regression to me.
>>> 
>>> I put out a final rc, because there was one more minor change to 
>>> accommodate an MXM API change; it's in the usual place:
>>> 
>>>   http://www.open-mpi.org/software/ompi/v1.6/
>>> 
>>> Unless something disastrous happens, I plan to release this as the final 
>>> 1.6.4 tomorrow.
>> 
>> I don't think this qualifies as "disastrous", but...
>> 
>> I've been trying to do some 1.6 testing on Solaris.  (Solaris 11,
>> Oracle Studio compilers, both SPARC and x86)  Results generally look
>> good.  The main issue appears to be:
>> 
>> - SPARC
>>  *AND*
>> - compile with "-m32 -xmemalign=8s" (the latter means assume at most 8-byte 
>> alignment, with sigbus for misalignment)
>>  *AND*
>> - openib
>> 
>> There is a sigbus during MPI_Init.  Specifically, if I go to 
>> btl_openib_frag.h out_constructor(), I see:
>> 
>>frag->sr_desc.wr_id = (uint64_t)(uintptr_t)frag;
>> 
>> and the left-hand side is on a 4-byte (but not 8-byte) boundary.  How hard 
>> would it be to get openib frags on 8-byte boundaries?
> 
> Very easy. Just adjust the parameters given to ompi_free_list_init(). There 
> are arguments for frag alignment and data alignment. Looking at 
> btl_openib_component.c a number of free lists have the alignment set at 2. 
> Change those to 8 and see if that fixes the problem.
> 
> Anyone know why these were set with an alignment of 2 in the first place? I 
> would have expected 8 or opal_cache_line_size.
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] openib fragment alignment

2013-02-20 Thread Jeff Squyres (jsquyres)
Arrgh.  I think you're going to make me eat my words 
(http://www.open-mpi.org/community/lists/devel/2013/02/12143.php).  

I just recently lost my access to InfiniBand test gear, so I can't test this 
myself.  Hypothetically, it should be fine.  But throwing in an untested change 
literally right before a release without IB vendor say-so really, really gives 
me pause...

Mellanox?


On Feb 20, 2013, at 4:27 PM, Open MPI  wrote:

> #3519: Move r28083 to v1.6 branch
> ---+--
> Reporter:  hjelmn  |  Owner:  hjelmn
>Type:  changeset move request  | Status:  new
> Priority:  major   |  Milestone:  Open MPI 1.6
> Version:  trunk   |
> ---+--
> (In [28083]) btl/openib: don't align fragments on 2 byte boundaries
> (changed to 8)
> 
> cmr:v1.6,v1.7
> 
> -- 
> Ticket URL: 
> Open MPI 
> 
> ___
> bugs mailing list
> b...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/bugs


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] openib fragment alignment

2013-02-20 Thread Jeff Squyres (jsquyres)
I waffled on this issue a bit (and talked w/ Nathan about it in IM), but with 
my RM hat on, I'm giving a final ruling: no.

This is too "last second", and it's for an incredibly small set of platforms 
and configuration options.  

I see that the risk is pretty small for this commit, but history is littered 
with "but that should have worked!".  I'd rather be conservative and have a 
good 1.6.4 release.  Since this has been committed on the trunk already, we can 
see what happens (likely: it'll cause no problems), and someday move it over to 
1.6.5 if anyone cares.

- Grouchy old RM



On Feb 20, 2013, at 4:51 PM, Nathan Hjelm  wrote:

> I talked to Pasha about the change. He suggests fragments are 2-byte aligned 
> to save space. I suspect that on 64-bit platforms the fragment size is 
> already a multiple of 8 bytes so this change will likely only affect 32-bit 
> systems (which is where the bus error is occurring).
> 
> -Nathan
> 
> On Wed, Feb 20, 2013 at 09:39:09PM +, Joshua Ladd wrote:
>> I would hold off, if possible, until I can investigate the issue. I don't, 
>> off-hand, know why the 2-byte alignment, although I would suspect it's for 
>> performance reasons.   
>> 
>> 
>> Josh
>> 
>> 
>> -----Original Message-
>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
>> Behalf Of Jeff Squyres (jsquyres)
>> Sent: Wednesday, February 20, 2013 4:35 PM
>> To: 
>> Subject: [OMPI devel] openib fragment alignment
>> Importance: High
>> 
>> Arrgh.  I think you're going to make me eat my words 
>> (http://www.open-mpi.org/community/lists/devel/2013/02/12143.php).  
>> 
>> I just recently lost my access to InfiniBand test gear, so I can't test this 
>> myself.  Hypothetically, it should be fine.  But throwing in an untested 
>> change literally right before a release without IB vendor say-so really, 
>> really gives me pause...
>> 
>> Mellanox?
>> 
>> 
>> On Feb 20, 2013, at 4:27 PM, Open MPI  wrote:
>> 
>>> #3519: Move r28083 to v1.6 branch
>>> ---+--
>>> Reporter:  hjelmn  |  Owner:  hjelmn
>>>   Type:  changeset move request  | Status:  new
>>> Priority:  major   |  Milestone:  Open MPI 1.6
>>> Version:  trunk   |
>>> ---+--
>>> (In [28083]) btl/openib: don't align fragments on 2 byte boundaries 
>>> (changed to 8)
>>> 
>>> cmr:v1.6,v1.7
>>> 
>>> --
>>> Ticket URL: <https://svn.open-mpi.org/trac/ompi/ticket/3519>
>>> Open MPI <http://www.open-mpi.org/>
>>> 
>>> ___
>>> bugs mailing list
>>> b...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/bugs
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] mpi/java question

2013-02-20 Thread Jeff Squyres (jsquyres)
I committed a fix to the trunk to use the fixed size datatypes.

I don't know offhand if the reduction type you need is defined on 64 bit 
types...?


On Feb 20, 2013, at 5:41 PM, Steve Angelovich  wrote:

> Sorry I lost track of all the comments in the thread.  Does this mean it 
> is fixed or will be fixed?
> 
> Thanks,
> Steve
> 
> On 02/20/2013 02:15 PM, Jeff Squyres (jsquyres) wrote:
>> I didn't misspeak in my email.  :-)
>> 
>> That being said:
>> 
>> 1. If the Java sizes are fixed, great.  It should make writing configury to 
>> find matching C types easier (because we know what the Java sizes are).
>> 
>> 2. George raises a good point: we support the MPI_INTx_T datatypes now, 
>> which probably obviates the need for any extra configury (since the Java 
>> sizes are fixed).
>> 
>> 
>> On Feb 20, 2013, at 3:44 PM, Ralph Castain  wrote:
>> 
>>> Might be just fine - need to see how many of the types have issues, how 
>>> best to correct them
>>> 
>>> On Feb 20, 2013, at 12:32 PM, George Bosilca  wrote:
>>> 
>>>> That is wrong with MPI_INT64_T ? (MPI 3.0 standard page 26.)
>>>> 
>>>> George.
>>>> 
>>>> On Feb 20, 2013, at 21:12 , Ralph Castain  wrote:
>>>> 
>>>>> On Feb 20, 2013, at 12:08 PM, Dmitri Gribenko  wrote:
>>>>> 
>>>>>> On Wed, Feb 20, 2013 at 10:05 PM, Ralph Castain  
>>>>>> wrote:
>>>>>>> On Feb 20, 2013, at 11:39 AM, Dmitri Gribenko  
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> On Wed, Feb 20, 2013 at 9:34 PM, Jeff Squyres (jsquyres)
>>>>>>>>  wrote:
>>>>>>>>> If someone could write some generic java code to figure out the size 
>>>>>>>>> of a java type (and either printf it out, or write it to a file, or 
>>>>>>>>> otherwise be able to give that value to a shell script), that would 
>>>>>>>>> be a good start.
>>>>>>>> No need for that -- type sizes in Java are fixed.
>>>>>>>> 
>>>>>>>> http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
>>>>>>> True - but the ones on the C-side are not, and that's the problem.
>>>>>> My point was that there is no need to write java code to detect type
>>>>>> sizes.  About C types -- don't we already check those anyway?  Sure,
>>>>>> we need to match these with java side, but there's no need to write
>>>>>> new code to check type sizes.
>>>>> 
>>>>> I think you misunderstood - we are talking about writing build-system 
>>>>> code that matches the discovered C-type sizes to the corresponding known 
>>>>> Java type. This is the source of the reported problem.
>>>>> 
>>>>> And yes - Jeff misspoke in his note. I've straightened him out over the 
>>>>> phone. :-)
>>>>> 
>>>>>> Dmitri
>>>>>> 
>>>>>> -- 
>>>>>> main(i,j){for(i=2;;i++){for(j=2;j>>>>> (j){printf("%d\n",i);}}} /*Dmitri Gribenko */
>>>>>> ___
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> ___
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> ___
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> --
> This e-mail, including any attached files, may contain confidential and 
> privileged information for the sole use of the intended recipient.  Any 
> review, use, distribution, or disclosure by others is strictly prohibited.  
> If you are not the intended recipient (or authorized to receive information 
> for the intended recipient), please contact the sender by reply e-mail and 
> delete all copies of this message.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] v1.7.0rc7

2013-02-25 Thread Jeff Squyres (jsquyres)
Marco -- 

Is it just these 2 patches:

r28059 [[BR]]
Patch for Cygwin support: use correct DSO/shared library prefix and
suffix.  Thanks to Marco Atzeri for reporting the issue and providing
an initial patch.

r28060 [[BR]]
Patch for Cygwin support: Use S_IRWXU for shmget() and include
.  Thanks to Marco Atzeri for reporting the issue and
providing an initial patch.


On Feb 25, 2013, at 4:40 PM, marco atzeri  wrote:

> On 2/23/2013 11:45 PM, Ralph Castain wrote:
>> This release candidate is the last one we expect to have before release, so 
>> please test it. Can be downloaded from the usual place:
>> 
>> http://www.open-mpi.org/software/ompi/v1.7/
>> 
>> Latest changes include:
>> 
>> * update of the alps/lustre configure code
>> * fixed solaris hwloc code
>> * various mxm updates
>> * removed java bindings (delayed until later release)
>> * improved the --report-bindings output
>> * a variety of minor cleanups
>> 
> 
> any reason to not include the cygwin patches added to 1.6.4 ?
> 
> Marco
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] v1.7.0rc7

2013-02-25 Thread Jeff Squyres (jsquyres)
On Feb 25, 2013, at 6:30 PM, Pavel Mezentsev  wrote:

> I've tried to build it but got different errors with different compilers.
> 
> With Intel (2011.5.220) and pgi (13.2) I get the following error:
> CC   bcol_iboffload_module.lo
> bcol_iboffload_module.c(37): catastrophic error: cannot open source file 
> "ompi/mca/common/netpatterns/common_netpatterns.h"
>   #include "ompi/mca/common/netpatterns/common_netpatterns.h"

This is a clear error.

Pasha?

> I failed to find that file anywhere among the sources.
> 
> With pathscale (4.0.12.1) I get the following:
>   PPFC mpi-f08-interfaces-callbacks.lo
> 
> module mpi_f08_interfaces_callbacks
>^
> pathf95-855 pathf95: ERROR MPI_F08_INTERFACES_CALLBACKS, File = 
> mpi-f08-interfaces-callbacks.F90, Line = 9, Column = 8 

I don't have access to the Pathscale compiler.  Without more detail, it's hard 
to say what's wrong here.

I've pinged my pathscale contact; perhaps he can shed some light on this...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Remove windows support

2013-02-26 Thread Jeff Squyres (jsquyres)
No other issues were raised about this, and today was the timeout.

On the call today, Ralph volunteered to do the work:

- svn rm the windows-specific components
- remove all the #if Windows-specific code

He'll be doing that over the next week or so.



On Feb 18, 2013, at 1:34 PM, Ralph Castain  wrote:

> Thanks Marco - I was hoping that would be the case!
> 
> 
> On Feb 18, 2013, at 8:42 AM, marco atzeri  wrote:
> 
>> On 2/18/2013 5:10 PM, Jeff Squyres (jsquyres) wrote:
>>> WHAT: Remove all Windows code from the trunk.
>>> 
>>> WHY: This issue keeps coming up over and over and over...
>>> 
>> [cut]
>>> 2. Remove all Windows code.  This involves some wholesale removing of 
>>> components as well as a bunch of #if code throughout the code base.
>>> 
>>>  ==> Removing this code can probably be done in multiple SVN commits:
>>> 
>>> 2a. Removing Windows-only components (which, given the rate of change that 
>>> we are planning for the trunk, may well need to be re-written if they are 
>>> ever re-introduced into the tree).
>> 
>> Cygwin does not use them. I'm currently building the trunk packages with
>> 
>> --enable-mca-no-build=paffinity,installdirs-windows,timer-windows,shmem-sysv,if-windows,shmem-windows
>> 
>> to specifically exclude them
>> 
>>> 2b. Removing "#if WINDOWS" code (e.g., in opal/util/*, etc.).  This code 
>>> may not be changing as much as the rest of the trunk, and may be suitable 
>>> for svn reverting someday.
>>> 
>>> This does kill Cygwin support, too.  I realize we have a downstream 
>>> packager for Cygwin, but the fact that we can't get any developer support 
>>> for Windows -- despite multiple appeals -- seems to imply that the Windows 
>>> Open MPI audience is very, very small.  So while it feels a bit sad to kill 
>>> it, it may still be the Right Thing to do.
>> 
>> I assume it is __WINDOWS__
>> That is not defined on cygwin, so the build should survive
>> 
>>> 
>>> This is a proposal, and is open for discussion.
>>> 
>> 
>> Regards
>> Marco
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] v1.7.0rc7

2013-02-27 Thread Jeff Squyres (jsquyres)
On Feb 25, 2013, at 10:27 PM, marco atzeri  wrote:

> plus the additional ones
> 
>   ERROR.patch : ERROR is already defined, so another label
> is needed for "goto ERROR"

Snipped.

I finally filed a ticket about this: 
https://svn.open-mpi.org/trac/ompi/ticket/3527

We talked about this on the weekly call yesterday.  The RM's said they would 
evaluate the combined patch and see how much risk it posed this close to a 
release.  If it doesn't make 1.7.0, it'll go into 1.7.1.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.7rc8 is posted

2013-02-27 Thread Jeff Squyres (jsquyres)
The goal is to release 1.7 (final) by the end of this week.  New rc posted with 
fairly small changes:

http://www.open-mpi.org/software/ompi/v1.7/

- Fix wrong header file / compilation error in bcol
- Support MXM STREAM for isend and irecv
- Make sure "mpirun " fails with $status!=0
- Bunches of cygwin minor fixes
- Make sure the fortran compiler supports BIND(C) with LOGICAL for the F08 
bindings
- Fix --disable-mpi-io with the F08 bindings

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




  1   2   3   4   5   6   7   8   9   10   >