Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Kawashima, Takahiro
> > As other people said, Fujitsu MPI used in K is based on old
> > Open MPI (v1.6.3 with bug fixes). 
> 
> I guess the obvious question is will the vanilla Open-MPI work on K?

Unfortunately no. Support of Tofu and Fujitsu resource manager
are not included in Open MPI.

Takahiro Kawashima,
MPI development team,
Fujitsu

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Kawashima, Takahiro
Samuel,

I am a developer of Fujitsu MPI. Thanks for using the K computer.
For official support, please consult with the helpdesk of K,
as Gilles said. The helpdesk may have information based on past
inquiries. If not, the inquiry will be forwarded to our team.

As other people said, Fujitsu MPI used in K is based on old
Open MPI (v1.6.3 with bug fixes). We don't have a plan to
update it to newer version because it is in a maintenance
phase regarding system softwares. At first glance, I also
suspect the cost of multiple allreduce.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Some of my collaborators have had issues with one of my benchmarks at high 
> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
> increasing quadratically with process concurrency.  At 82K processes, each 
> call to dup/split is taking 15s to complete.  These high times restrict 
> comm_split/dup to be used statically (at the beginning) and not dynamically 
> in an application.
> 
> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
> qsort to split the ranks.  Although qsort/quicksort has ideal computational 
> complexity of O(PlogP)  [P is the number of MPI ranks], it can have worst 
> case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown.  
> 
> Can you confirm whether qsort (or the like) is (still) used in these routines 
> in OpenMPI?  It seems mergesort (worst case complexity of PlogP) would be a 
> more scalable approach.  I have not observed this issue on the Cray MPICH 
> implementation and the Mira MPICH issues has since been resolved.

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] [2.1.2rc2] CMA build failure on Linux/SPARC64

2017-08-21 Thread Kawashima, Takahiro
Paul,

Thank you.

I created an issue and PRs (v2.x and v2.0.x).

  https://github.com/open-mpi/ompi/issues/4122
  https://github.com/open-mpi/ompi/pull/4123
  https://github.com/open-mpi/ompi/pull/4124

Takahiro Kawashima,
MPI development team,
Fujitsu

> Takahiro,
> 
> This is a Debian/Sid system w/ glibc-2.24.
> 
> The patch you pointed me at does appear to fix the problem!
> I will note this in your PRs.
> 
> -Paul
> 
> On Mon, Aug 21, 2017 at 9:17 PM, Kawashima, Takahiro <
> t-kawash...@jp.fujitsu.com> wrote:
> 
> > Paul,
> >
> > Did you upgrade glibc or something? I suspect newer glibc
> > supports process_vm_readv and process_vm_writev and output
> > of configure script changed. My Linux/SPARC64 with old glibc
> > can compile Open MPI 2.1.2rc2 (CMA is disabled).
> >
> > To fix this, we need to cherry-pick d984b4b. Could you test the
> > d984b4b patch? I cannot test it because I cannot update glibc.
> > If it is fine, I'll create a PR for v2.x branch.
> >
> >   https://github.com/open-mpi/ompi/commit/d984b4b
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> > > Two things to note:
> > >
> > > 1) This is *NOT* present in 3.0.0rc2, thought I don't know what has
> > changed.
> > >
> > > 2) Here are the magic numbers:
> > > /usr/include/sparc64-linux-gnu/asm/unistd.h:#define
> > __NR_process_vm_readv
> > > 338
> > > /usr/include/sparc64-linux-gnu/asm/unistd.h:#define
> > __NR_process_vm_writev
> > >  339
> > >
> > > -Paul
> > >
> > > On Mon, Aug 21, 2017 at 6:56 PM, Paul Hargrove <phhargr...@lbl.gov>
> > wrote:
> > >
> > > > Both the v9 and v8+ ABIs on a Linux/SPARC64 system are failing "make
> > all"
> > > > with the error below.
> > > >
> > > > -Paul
> > > >
> > > > make[2]: Entering directory '/home/phargrov/OMPI/openmpi-
> > > > 2.1.2rc2-linux-sparcv9/BLD/opal/mca/btl/sm'
> > > >   CC   mca_btl_sm_la-btl_sm.lo
> > > > In file included from /home/phargrov/OMPI/openmpi-2.
> > > > 1.2rc2-linux-sparcv9/openmpi-2.1.2rc2/opal/mca/btl/sm/btl_sm.c:45:0:
> > > > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > > > 2.1.2rc2/opal/include/opal/sys/cma.h:101:2: error: #error "Unsupported
> > > > architecture for process_vm_readv and process_vm_writev syscalls"
> > > >  #error "Unsupported architecture for process_vm_readv and
> > > > process_vm_writev syscalls"
> > > >   ^
> > > > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > > > 2.1.2rc2/opal/include/opal/sys/cma.h: In function  process_vm_readv:
> > > > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > > > 2.1.2rc2/opal/include/opal/sys/cma.h:113:18: error:
> > > > __NR_process_vm_readv undeclared (first use in this function); did you
> > > > mean process_vm_readv?
> > > >return syscall(__NR_process_vm_readv, pid, lvec, liovcnt, rvec,
> > > > riovcnt, flags);
> > > >   ^
> > > >   process_vm_readv
> > > > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > > > 2.1.2rc2/opal/include/opal/sys/cma.h:113:18: note: each undeclared
> > > > identifier is reported only once for each function it appears in
> > > > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > > > 2.1.2rc2/opal/include/opal/sys/cma.h: In function  process_vm_writev:
> > > > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > > > 2.1.2rc2/opal/include/opal/sys/cma.h:124:18: error:
> > > > __NR_process_vm_writev undeclared (first use in this function); did you
> > > > mean process_vm_writev?
> > > >return syscall(__NR_process_vm_writev, pid, lvec, liovcnt, rvec,
> > > > riovcnt, flags);
> > > >   ^~
> > > >   process_vm_writev
> > > > Makefile:1838: recipe for target 'mca_btl_sm_la-btl_sm.lo' failed

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] [2.1.2rc2] CMA build failure on Linux/SPARC64

2017-08-21 Thread Kawashima, Takahiro
Paul,

Did you upgrade glibc or something? I suspect newer glibc
supports process_vm_readv and process_vm_writev and output
of configure script changed. My Linux/SPARC64 with old glibc
can compile Open MPI 2.1.2rc2 (CMA is disabled).

To fix this, we need to cherry-pick d984b4b. Could you test the
d984b4b patch? I cannot test it because I cannot update glibc.
If it is fine, I'll create a PR for v2.x branch.

  https://github.com/open-mpi/ompi/commit/d984b4b

Takahiro Kawashima,
MPI development team,
Fujitsu

> Two things to note:
> 
> 1) This is *NOT* present in 3.0.0rc2, thought I don't know what has changed.
> 
> 2) Here are the magic numbers:
> /usr/include/sparc64-linux-gnu/asm/unistd.h:#define __NR_process_vm_readv
> 338
> /usr/include/sparc64-linux-gnu/asm/unistd.h:#define __NR_process_vm_writev
>  339
> 
> -Paul
> 
> On Mon, Aug 21, 2017 at 6:56 PM, Paul Hargrove  wrote:
> 
> > Both the v9 and v8+ ABIs on a Linux/SPARC64 system are failing "make all"
> > with the error below.
> >
> > -Paul
> >
> > make[2]: Entering directory '/home/phargrov/OMPI/openmpi-
> > 2.1.2rc2-linux-sparcv9/BLD/opal/mca/btl/sm'
> >   CC   mca_btl_sm_la-btl_sm.lo
> > In file included from /home/phargrov/OMPI/openmpi-2.
> > 1.2rc2-linux-sparcv9/openmpi-2.1.2rc2/opal/mca/btl/sm/btl_sm.c:45:0:
> > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > 2.1.2rc2/opal/include/opal/sys/cma.h:101:2: error: #error "Unsupported
> > architecture for process_vm_readv and process_vm_writev syscalls"
> >  #error "Unsupported architecture for process_vm_readv and
> > process_vm_writev syscalls"
> >   ^
> > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > 2.1.2rc2/opal/include/opal/sys/cma.h: In function  process_vm_readv:
> > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > 2.1.2rc2/opal/include/opal/sys/cma.h:113:18: error:
> > __NR_process_vm_readv undeclared (first use in this function); did you
> > mean process_vm_readv?
> >return syscall(__NR_process_vm_readv, pid, lvec, liovcnt, rvec,
> > riovcnt, flags);
> >   ^
> >   process_vm_readv
> > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > 2.1.2rc2/opal/include/opal/sys/cma.h:113:18: note: each undeclared
> > identifier is reported only once for each function it appears in
> > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > 2.1.2rc2/opal/include/opal/sys/cma.h: In function  process_vm_writev:
> > /home/phargrov/OMPI/openmpi-2.1.2rc2-linux-sparcv9/openmpi-
> > 2.1.2rc2/opal/include/opal/sys/cma.h:124:18: error:
> > __NR_process_vm_writev undeclared (first use in this function); did you
> > mean process_vm_writev?
> >return syscall(__NR_process_vm_writev, pid, lvec, liovcnt, rvec,
> > riovcnt, flags);
> >   ^~
> >   process_vm_writev
> > Makefile:1838: recipe for target 'mca_btl_sm_la-btl_sm.lo' failed

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] [3.0.0rc1] ppc64/gcc-4.8.3 check failure (regression).

2017-07-04 Thread Kawashima, Takahiro
It might be related to https://github.com/open-mpi/ompi/issues/3697 .
I added a comment to the issue.

Takahiro Kawashima,
Fujitsu

> On a PPC64LE w/ gcc-7.1.0 I see opal_fifo hang instead of failing.
> 
> -Paul
> 
> On Mon, Jul 3, 2017 at 4:39 PM, Paul Hargrove  wrote:
> 
> > On a PPC64 host with gcc-4.8.3 I have configured with
> >
> >  --prefix=[...] --enable-debug \
> > CFLAGS=-m64 --with-wrapper-cflags=-m64 \
> > CXXFLAGS=-m64 --with-wrapper-cxxflags=-m64 \
> > FCFLAGS=-m64 --with-wrapper-fcflags=-m64
> >
> > I see "make check" report a failure from opal_fifo.
> > Previous testing of Open MPI 2.1.1rc1 did not fail this test.
> >
> >
> > I also noticed the following warnings from building opal_lifo and
> > opal_fifo tests, but have found that adding the volatile qualifier in
> > opal_fifo.c did *not* resolve the failure.
> >
> >   CC   opal_lifo.o
> > /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-ppc64-gcc/
> > openmpi-3.0.0rc1/test/class/opal_lifo.c: In function
> > 'check_lifo_consistency':
> > /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-ppc64-gcc/
> > openmpi-3.0.0rc1/test/class/opal_lifo.c:72:26: warning: assignment
> > discards 'volatile' qualifier from pointer target type [enabled by default]
> >  for (count = 0, item = lifo->opal_lifo_head.data.item ; item !=
> > >opal_lifo_ghost ;
> >   ^
> >   CCLD opal_lifo
> >   CC   opal_fifo.o
> > /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-ppc64-gcc/
> > openmpi-3.0.0rc1/test/class/opal_fifo.c: In function
> > 'check_fifo_consistency':
> > /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-ppc64-gcc/
> > openmpi-3.0.0rc1/test/class/opal_fifo.c:109:26: warning: assignment
> > discards 'volatile' qualifier from pointer target type [enabled by default]
> >  for (count = 0, item = fifo->opal_fifo_head.data.item ; item !=
> > >opal_fifo_ghost ;
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] bug in MPI_Comm_accept?

2017-04-04 Thread Kawashima, Takahiro
I filed a PR against v1.10.7 though v1.10.7 may not be released.

  https://github.com/open-mpi/ompi/pull/3276

I'm not aware of v2.1.x issue, sorry. Other developer may be
able to answer.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Bullseye!
> 
> Thank you, Takahiro, for your quick answer. Brief tests with 1.10.6 show 
> that this did indeed solve the problem! I will look at this in more 
> detail, but it looks really good now.
> 
> About MPI_Comm_accept in 2.1.x. I've seen a thread here by Adam 
> Sylvester, where it essentially says that it is not working now, nor in 
> 2.0.x. I've checked the master, and it also does not work there. Is 
> there any time line for this?
> 
> Thanks a lot!
> 
> Marcin
> 
> 
> 
> On 04/04/2017 11:03 AM, Kawashima, Takahiro wrote:
> > Hi,
> >
> > I encountered a similar problem using MPI_COMM_SPAWN last month.
> > Your problem my be same.
> >
> > The problem was fixed by commit 0951a34 in Open MPI master and
> > backported to v2.1.x v2.0.x but not backported to v1.8.x and
> > v1.10.x.
> >
> >https://github.com/open-mpi/ompi/commit/0951a34
> >
> > Please try the attached patch. It was backported for v1.10 branch.
> >
> > The problem exists in the memory registration limit calculation
> > in openib BTL and processes loop forever in OMPI_FREE_LIST_WAIT_MT
> > when connecting to other ORTE jobs because openib_reg_mr returns
> > OMPI_ERR_OUT_OF_RESOURCE. It probably affects MPI_COMM_SPAWN,
> > MPI_COMM_SPAWN_MULTIPLE, MPI_COMM_ACCEPT, and MPI_COMM_CONNECT.
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Dear Developers,
> >>
> >> This is an old problem, which I described in an email to the users list
> >> in 2015, but I continue to struggle with it. In short, MPI_Comm_accept /
> >> MPI_Comm_disconnect combo causes any communication over openib btl
> >> (e.g., also a barrier) to hang after a few clients connect and
> >> disconnect from the server. I've noticed that the number of successful
> >> connects depends on the number of server ranks, e.g., if my server has
> >> 32 ranks, then the communication hangs already for the second connecting
> >> client.
> >>
> >> I have now checked that the problem exists also in 1.10.6. As far as I
> >> could tell, MPI_Comm_accept is not working in 2.0 and 2.1 at all, so I
> >> could not test those versions. My previous investigations have shown
> >> that the problem was introduced in 1.8.4.
> >>
> >> I wonder, will this be addressed in OpenMPI, or is this part of the MPI
> >> functionality considered less important than the core? Should I file a
> >> bug report?
> >>
> >> Thanks!
> >>
> >> Marcin Krotkiewski
> >>
> >>
> >> On 09/16/2015 04:06 PM, marcin.krotkiewski wrote:
> >>> I have run into a freeze / potential bug when using MPI_Comm_accept in
> >>> a simple client / server implementation. I have attached two simplest
> >>> programs I could produce:
> >>>
> >>>   1. mpi-receiver.c opens a port using MPI_Open_port, saves the port
> >>> name to a file
> >>>
> >>>   2. mpi-receiver enters infinite loop and waits for connections using
> >>> MPI_Comm_accept
> >>>
> >>>   3. mpi-sender.c connects to that port using MPI_Comm_connect, sends
> >>> one MPI_UNSIGNED_LONG, calls barrier and disconnects using
> >>> MPI_Comm_disconnect
> >>>
> >>>   4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier
> >>> and disconnects using MPI_Comm_disconnect and goes to point 2 -
> >>> infinite loop
> >>>
> >>> All works fine, but only exactly 5 times. After that the receiver
> >>> hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100%
> >>> repeatable. I have tried with Intel MPI - no such problem.
> >>>
> >>> I execute the programs using OpenMPI 1.10 as follows
> >>>
> >>> mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver
> >>>
> >>>
> >>> Do you have any clues what could be the reason? Am I doing sth wrong,
> >>> or is it some problem with internal state of OpenMPI?
> >>>
> >>> Thanks a lot!
> >>>
> >>> Marcin
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] bug in MPI_Comm_accept?

2017-04-04 Thread Kawashima, Takahiro
Hi,

I encountered a similar problem using MPI_COMM_SPAWN last month.
Your problem my be same.

The problem was fixed by commit 0951a34 in Open MPI master and
backported to v2.1.x v2.0.x but not backported to v1.8.x and
v1.10.x.

  https://github.com/open-mpi/ompi/commit/0951a34

Please try the attached patch. It was backported for v1.10 branch.

The problem exists in the memory registration limit calculation
in openib BTL and processes loop forever in OMPI_FREE_LIST_WAIT_MT
when connecting to other ORTE jobs because openib_reg_mr returns
OMPI_ERR_OUT_OF_RESOURCE. It probably affects MPI_COMM_SPAWN,
MPI_COMM_SPAWN_MULTIPLE, MPI_COMM_ACCEPT, and MPI_COMM_CONNECT.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Dear Developers,
> 
> This is an old problem, which I described in an email to the users list 
> in 2015, but I continue to struggle with it. In short, MPI_Comm_accept / 
> MPI_Comm_disconnect combo causes any communication over openib btl 
> (e.g., also a barrier) to hang after a few clients connect and 
> disconnect from the server. I've noticed that the number of successful 
> connects depends on the number of server ranks, e.g., if my server has 
> 32 ranks, then the communication hangs already for the second connecting 
> client.
> 
> I have now checked that the problem exists also in 1.10.6. As far as I 
> could tell, MPI_Comm_accept is not working in 2.0 and 2.1 at all, so I 
> could not test those versions. My previous investigations have shown 
> that the problem was introduced in 1.8.4.
> 
> I wonder, will this be addressed in OpenMPI, or is this part of the MPI 
> functionality considered less important than the core? Should I file a 
> bug report?
> 
> Thanks!
> 
> Marcin Krotkiewski
> 
> 
> On 09/16/2015 04:06 PM, marcin.krotkiewski wrote:
> > I have run into a freeze / potential bug when using MPI_Comm_accept in 
> > a simple client / server implementation. I have attached two simplest 
> > programs I could produce:
> >
> >  1. mpi-receiver.c opens a port using MPI_Open_port, saves the port 
> > name to a file
> >
> >  2. mpi-receiver enters infinite loop and waits for connections using 
> > MPI_Comm_accept
> >
> >  3. mpi-sender.c connects to that port using MPI_Comm_connect, sends 
> > one MPI_UNSIGNED_LONG, calls barrier and disconnects using 
> > MPI_Comm_disconnect
> >
> >  4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier 
> > and disconnects using MPI_Comm_disconnect and goes to point 2 - 
> > infinite loop
> >
> > All works fine, but only exactly 5 times. After that the receiver 
> > hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100% 
> > repeatable. I have tried with Intel MPI - no such problem.
> >
> > I execute the programs using OpenMPI 1.10 as follows
> >
> > mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver
> >
> >
> > Do you have any clues what could be the reason? Am I doing sth wrong, 
> > or is it some problem with internal state of OpenMPI?
> >
> > Thanks a lot!
> >
> > Marcin
diff --git a/ompi/mca/btl/openib/btl_openib.c b/ompi/mca/btl/openib/btl_openib.c
index 7030aa1..38790af 100644
--- a/ompi/mca/btl/openib/btl_openib.c
+++ b/ompi/mca/btl/openib/btl_openib.c
@@ -1074,7 +1074,7 @@ int mca_btl_openib_add_procs(
 }
 
 openib_btl->local_procs += local_procs;
-openib_btl->device->mem_reg_max /= openib_btl->local_procs;
+openib_btl->device->mem_reg_max = openib_btl->device->mem_reg_max_total / openib_btl->local_procs;
 
 return mca_btl_openib_size_queues(openib_btl, nprocs);
 }
diff --git a/ompi/mca/btl/openib/btl_openib.h b/ompi/mca/btl/openib/btl_openib.h
index a3b1f87..38d9d6f 100644
--- a/ompi/mca/btl/openib/btl_openib.h
+++ b/ompi/mca/btl/openib/btl_openib.h
@@ -417,7 +417,7 @@ typedef struct mca_btl_openib_device_t {
 /* Maximum value supported by this device for max_inline_data */
 uint32_t max_inline_data;
 /* Registration limit and current count */
-uint64_t mem_reg_max, mem_reg_active;
+uint64_t mem_reg_max, mem_reg_max_total, mem_reg_active;
 /* Device is ready for use */
 bool ready_for_use;
 } mca_btl_openib_device_t;
diff --git a/ompi/mca/btl/openib/btl_openib_component.c b/ompi/mca/btl/openib/btl_openib_component.c
index 40831f2..06ff9d4 100644
--- a/ompi/mca/btl/openib/btl_openib_component.c
+++ b/ompi/mca/btl/openib/btl_openib_component.c
@@ -1549,7 +1549,8 @@ static int init_one_device(opal_list_t *btl_list, struct ibv_device* ib_dev)
 }
 
 device->mem_reg_active = 0;
-device->mem_reg_max= calculate_max_reg(ibv_get_device_name(ib_dev));
+device->mem_reg_max_total = calculate_max_reg(ibv_get_device_name(ib_dev));
+device->mem_reg_max = device->mem_reg_max_total;
 
 device->ib_dev = ib_dev;
 device->ib_dev_context = dev_context;
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] MCA Component Development: Function Pointers

2017-01-18 Thread Kawashima, Takahiro
Hi,

I created a pull request to add the persistent collective
communication request feature to Open MPI. Though it's
incomplete and will not be merged into Open MPI soon,
you can play your collective algorithms based on my work.

  https://github.com/open-mpi/ompi/pull/2758

Takahiro Kawashima,
MPI development team,
Fujitsu

> Bradley,
> 
> 
> good to hear that !
> 
> 
> What Jeff meant in his previous email, is that since persistent 
> collectives are not (yet) part of the standard, user visible functions
> 
> (Pbcast_init, Pcoll_start, ...) should be part of an extension (e.g. 
> ompi/mpiext/pcoll) and should be named with the MPIX_ prefix
> 
> (e.g. MPIX_Pbcast_init)
> 
> 
> if you can make your source code available (e.g. github, bitbucket, 
> email, ...), then we'll get some more chances to review it and guide you.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 8/1/2016 12:41 PM, Bradley Morgan wrote:
> >
> > Gilles, Nathan, Jeff, George, and the OMPI Developer Community,
> >
> > Thank you all for your kind and helpful responses.
> >
> > I have been gathering your advice and trying to put the various pieces 
> > together.
> >
> > Currently, I have managed to graft a new function MPI_LIBPNBC_Start at 
> > the MPI level with a corresponding pointer into 
> > mca->coll->libpnbc->mca_coll_libpnbc_start() and I can get it to fire 
> > from my test code.  This required good deal of hacking on some of the 
> > core files in trunk/ompi/mpi/c/… and trunk/ompi/mca/coll/… Not ideal, 
> > I’m sure, but for my purposes (and level of familiarity) just getting 
> > this to fire is a breakthrough.
> >
> > I will delve into some of the cleaner looking methods that you all 
> > have provided―I still need much more familiarity with the codebase, as 
> > I often find myself way out in the woods :)
> >
> > Thanks again to all of you for your help.  It is nice to find a 
> > welcoming community of developers.  I hope to be in touch soon with 
> > some more useful findings for you.
> >
> >
> > Best Regards,
> >
> > -Bradley
> >
> >
> >
> >
> >> On Jul 31, 2016, at 5:28 PM, George Bosilca  >> > wrote:
> >>
> >> Bradley,
> >>
> >> We had similar needs in one of our projects and as a quick hack we 
> >> extended the GRequest interface to support persistent requests. There 
> >> are cleaner ways, but we decided that highjacking 
> >> the OMPI_REQUEST_GEN was good enough for a proof-of-concept. Then add 
> >> a start member to the ompi_grequest_t in request/grequest.h, and then 
> >> do what Nathan suggested by extending the switch in the 
> >> ompi/mpi/c/start.c (and startall), and directly call your own start 
> >> function.
> >>
> >> George.
> >>
> >>
> >> On Sat, Jul 30, 2016 at 6:29 PM, Jeff Squyres (jsquyres) 
> >> > wrote:
> >>
> >> Also be aware of the Open MPI Extensions framework, explicitly
> >> intended for adding new/experimental APIs to mpi.h and the
> >> Fortran equivalents.  See ompi/mpiext.
> >>
> >>
> >> > On Jul 29, 2016, at 11:16 PM, Gilles Gouaillardet
> >>  >> > wrote:
> >> >
> >> > For a proof-of-concept, I'd rather suggest you add
> >> MPI_Pcoll_start(), and add a pointer in mca_coll_base_comm_coll_t.
> >> > If you add MCA_PML_REQUEST_COLL, then you have to update all
> >> pml components (fastidious), if you update start.c (quite
> >> simple), then you also need to update start_all.c (less trivial)
> >> > If the future standard mandates the use of MPI_Start and
> >> MPI_Startall, then we will reconsider this.
> >> >
> >> > From a performance point of view, that should not change much.
> >> > IMHO, non blocking collectives come with a lot of overhead, so
> >> shaving a few nanoseconds here and then will unlikely change the
> >> big picture.
> >> >
> >> > If I oversimplify libnbc, it basically schedule MPI_Isend,
> >> MPI_Irecv and MPI_Wait (well, MPI_Test since this is on blocking,
> >> but let's keep it simple)
> >> > My intuition is your libpnbc will post MPI_Send_init,
> >> MPI_Recv_init, and schedule MPI_Start and MPI_Wait.
> >> > Because of the overhead, I would only expect marginal
> >> performance improvement, if any.
> >> >
> >> > Cheers,
> >> >
> >> > Gilles
> >> >
> >> > On Saturday, July 30, 2016, Bradley Morgan  >> > wrote:
> >> >
> >> > Hello Gilles,
> >> >
> >> > Thank you very much for your response.
> >> >
> >> > My understanding is yes, this might be part of the future
> >> standard―but probably not from my work alone.  I’m currently just
> >> trying get a proof-of-concept and some performance metrics.
> >> >
> >> > I have item one of your list completed, but not the others.  I
> >> will look into adding 

Re: [OMPI devel] Open MPI 2.0.0: Fortran with NAG compiler (nagfor)

2016-08-23 Thread Kawashima, Takahiro
Gilles, Jeff,

In Open MPI 1.6 days, MPI_ARGVS_NULL and MPI_STATUSES_IGNORE
were defined as double precision and MPI_Comm_spawn_multiple
and MPI_Waitall etc. interfaces had two subroutines each.

  
https://github.com/open-mpi/ompi-release/blob/v1.6/ompi/include/mpif-common.h#L148
  
https://github.com/open-mpi/ompi-release/blob/v1.6/ompi/mpi/f90/scripts/mpi-f90-interfaces.h.sh#L9568
  
https://github.com/open-mpi/ompi-release/blob/v1.6/ompi/mpi/f90/scripts/mpi-f90-interfaces.h.sh#L8639

This situation was changed in Open MPI 1.8 but perhaps
use-mpi-tkr interface was not correctly adapted.
Though I don't know the background, the files which caused
compilation errors may be unnecessary.

> Jeff,
> 
> so it seems NAG uses the use-mpi-tkr interface.
> do you have any recollection of why
> - in mpi_waitall_f90.f90, MPI_Waitall and friends have two implementations
> (e.g. MPI_WaitallI and MPI_WaitallS)
> especially, MPI_WaitallI declares array_of_statuses as a double precision
> 
> - in mpi-f90-interfaces.h, MPI_Waitall has only one prototype (no double
> precision)
> 
> Cheers,
> 
> Gilles
> 
> 
> On Monday, August 15, 2016, Franz-Joseph Barthold  dortmund.de
> >
> wrote:
> 
> > Dear developers,
> >
> > I actually compiled Open MPI 2.0.0 using the NAG compiler (nagfor) in its
> > recent version.
> >
> > The following error occur:
> >
> > make[2]: Entering directory '/home/barthold/devel/prgs/ext
> > ernal/openmpi-2.0.0/
> > ompi/mpi/fortran/use-mpi-tkr'
> >   FC   mpi_comm_spawn_multiple_f90.lo
> > NAG Fortran Compiler Release 6.1(Tozai) Build 6106
> > Error: mpi_comm_spawn_multiple_f90.f90: Argument 3 to
> > MPI_COMM_SPAWN_MULTIPLE
> > has data type DOUBLE PRECISION in reference from MPI_COMM_SPAWN_MULTIPLEN
> > and
> > CHARACTER in reference from MPI_COMM_SPAWN_MULTIPLEA
> > [NAG Fortran Compiler error termination, 1 error]
> > Makefile:1896: recipe for target 'mpi_comm_spawn_multiple_f90.lo' failed
> > make[2]: [mpi_comm_spawn_multiple_f90.lo] Error 1 (ignored)
> >   FC   mpi_testall_f90.lo
> > NAG Fortran Compiler Release 6.1(Tozai) Build 6106
> > Error: mpi_testall_f90.f90: Argument 4 to MPI_TESTALL has data type DOUBLE
> > PRECISION in reference from MPI_TESTALLI and INTEGER in reference from
> > MPI_TESTALLS
> > [NAG Fortran Compiler error termination, 1 error]
> > Makefile:1896: recipe for target 'mpi_testall_f90.lo' failed
> > make[2]: [mpi_testall_f90.lo] Error 1 (ignored)
> >   FC   mpi_testsome_f90.lo
> > NAG Fortran Compiler Release 6.1(Tozai) Build 6106
> > Error: mpi_testsome_f90.f90: Argument 5 to MPI_TESTSOME has data type
> > DOUBLE
> > PRECISION in reference from MPI_TESTSOMEI and INTEGER in reference from
> > MPI_TESTSOMES
> > [NAG Fortran Compiler error termination, 1 error]
> > Makefile:1896: recipe for target 'mpi_testsome_f90.lo' failed
> > make[2]: [mpi_testsome_f90.lo] Error 1 (ignored)
> >   FC   mpi_waitall_f90.lo
> > NAG Fortran Compiler Release 6.1(Tozai) Build 6106
> > Error: mpi_waitall_f90.f90: Argument 3 to MPI_WAITALL has data type DOUBLE
> > PRECISION in reference from MPI_WAITALLI and INTEGER in reference from
> > MPI_WAITALLS
> > [NAG Fortran Compiler error termination, 1 error]
> > Makefile:1896: recipe for target 'mpi_waitall_f90.lo' failed
> > make[2]: [mpi_waitall_f90.lo] Error 1 (ignored)
> >   FC   mpi_waitsome_f90.lo
> > NAG Fortran Compiler Release 6.1(Tozai) Build 6106
> > Error: mpi_waitsome_f90.f90: Argument 5 to MPI_WAITSOME has data type
> > DOUBLE
> > PRECISION in reference from MPI_WAITSOMEI and INTEGER in reference from
> > MPI_WAITSOMES
> > [NAG Fortran Compiler error termination, 1 error]
> > Makefile:1896: recipe for target 'mpi_waitsome_f90.lo' failed
> >
> > Other compiler (Absoft 16.0, Gfortran 6.1, Intel 16 ifort) are less
> > sensitive
> > and compile. But may fail during runtime.
> >
> > My experience is that NAG (nagfor) is the most sensitive Fortran compiler.
> > Thus, I recommend its usage within the development phase.
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Class information in OpenMPI

2016-07-07 Thread KAWASHIMA Takahiro
FWIW, I have my private notes on process and datatype -related structs.

  https://rivis.github.io/doc/openmpi/openmpi-source-reading.en.xhtml

They are created by my hands with the help of Autodia.

  http://www.aarontrevena.co.uk/opensource/autodia/

> I want to know if there is “class diagram” for OpenMPI code base that shows 
> existing classes and dependencies/associations. Are there any available tools 
> to extract and visualize this information.

Thanks,
KAWASHIMA Takahiro


Re: [OMPI devel] Missing support for 2 types in MPI_Sizeof()

2016-04-15 Thread Kawashima, Takahiro
> I just checked MPICH 3.2, and they *do* include MPI_SIZEOF interfaces for 
> CHARACTER and LOGICAL, but they are missing many of the other MPI_SIZEOF 
> interfaces that we have in OMPI.  Meaning: OMPI and MPICH already diverge 
> wildly on MPI_SIZEOF.  :-\

And OMPI 1.6 also had MPI_SIZEOF interfaces for CHARACTER and LOGICAL.  :-)

  
https://github.com/open-mpi/ompi-release/blob/v1.6/ompi/mpi/f90/scripts/mpi_sizeof.f90.sh#L27


> Nadia --
> 
> I believe that the character and logical types are not in this script already 
> because the description of MPI_SIZEOF in MPI-3.1 says that the input choice 
> buffer parameter is:
> 
> IN x a Fortran variable of numeric intrinsic type (choice)
> 
> As I understand it (and my usual disclaimer here: I am *not* a Fortran 
> expert), CHARACTER and LOGICAL types are not numeric in Fortran.
> 
> However, we could add such interfaces as an extension.
> 
> I just checked MPICH 3.2, and they *do* include MPI_SIZEOF interfaces for 
> CHARACTER and LOGICAL, but they are missing many of the other MPI_SIZEOF 
> interfaces that we have in OMPI.  Meaning: OMPI and MPICH already diverge 
> wildly on MPI_SIZEOF.  :-\
> 
> I guess I don't have a strong opinion here.  If you file a PR for this patch, 
> I won't object.  :-)
> 
> 
> > On Apr 15, 2016, at 3:22 AM, DERBEY, NADIA  wrote:
> > 
> > Hi,
> > 
> > The following trivial example doesn't compile because of 2 missing types 
> > in the MPI_SIZEOF subroutines (in mpi_sizeof.f90).
> > 
> > [derbeyn@btp0 test]$ cat mpi_sizeof.f90
> >   program main
> > !use mpi
> >   include 'mpif.h'
> > 
> >   integer ierr, sz, mpisize
> >   real r1
> >   integer i1
> >   character ch1
> >   logical l1
> > 
> >   call MPI_INIT(ierr)
> >   call MPI_SIZEOF(r1, sz, ierr)
> >   call MPI_SIZEOF(i1, sz, ierr)
> >   call MPI_SIZEOF(l1, sz, ierr)
> >   call MPI_SIZEOF(ch1, sz, ierr)
> >   call MPI_FINALIZE(ierr)
> > 
> >   end
> > [derbeyn@btp0 test]$ mpif90 -o mpi_sizeof mpi_sizeof.f90
> > mpi_sizeof.f90(14): error #6285: There is no matching specific 
> > subroutine for this generic subroutine call.   [MPI_SIZEOF]
> >   call MPI_SIZEOF(ch1, sz, ierr)
> > -^
> > mpi_sizeof.f90(15): error #6285: There is no matching specific
> > subroutine for this generic subroutine call.   [MPI_SIZEOF]
> >   call MPI_SIZEOF(l1, sz, ierr)
> > -^
> > compilation aborted for mpi_sizeof.f90 (code 1)
> > 
> > 
> > This problem happens both on master and v2.x. The following patch seems
> > to solve the issue:
> > 
> > diff --git a/ompi/mpi/fortran/base/gen-mpi-sizeof.pl
> > b/ompi/mpi/fortran/base/gen-mpi-sizeof.pl
> > index 5ea3dca3..a2a99924 100755
> > --- a/ompi/mpi/fortran/base/gen-mpi-sizeof.pl
> > +++ b/ompi/mpi/fortran/base/gen-mpi-sizeof.pl
> > @@ -145,6 +145,9 @@ sub generate {
> ># Main
> > 
> > #
> > 
> > +queue_sub("character", "char", "character_kinds");
> > +queue_sub("logical", "logical", "logical_kinds");
> > +
> >for my $size (qw/8 16 32 64/) {
> >queue_sub("integer(int${size})", "int${size}", "int${size}");
> >}


Re: [OMPI devel] Fwd: [OMPI users] shared memory under fortran, bug?

2016-02-02 Thread Kawashima, Takahiro
Gilles,

I see. Thanks!

Takahiro Kawashima,
MPI development team,
Fujitsu

> Kawashima-san,
> 
> we always duplicate the communicator, and use the CID of the duplicated 
> communicator, so bottom line,
> there cannot be more than one window per communicator.
> 
> i will double check about using PID. if a broadcast is needed, i would 
> rather use the process name of rank 0 in order to avoid a broadcast.
> 
> Cheers,
> 
> Gilles
> 
> On 2/3/2016 8:40 AM, Kawashima, Takahiro wrote:
> > Nathan,
> >
> > Is is sufficient?
> > Multiple windows can be created on a communicator.
> > So I think PID + CID is not sufficient.
> >
> > Possible fixes:
> > - The root process creates a filename with a random number
> >and broadcast it in the communicator.
> > - Use per-communicator counter and use it in the filename.
> >
> > Regards,
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Hmm, I think you are correct. There may be instances where two different
> >> local processes may use the same CID for different communicators. It
> >> should be sufficient to add the PID of the current process to the
> >> filename to ensure it is unique.
> >>
> >> -Nathan
> >>
> >> On Tue, Feb 02, 2016 at 09:33:29PM +0900, Gilles Gouaillardet wrote:
> >>> Nathan,
> >>> the sm osc component uses communicator CID to name the file that will 
> >>> be
> >>> used to create shared memory segments.
> >>> if I understand and correctly, two different communicators coming 
> >>> from the
> >>> same MPI_Comm_split might share the same CID, so CID (alone) cannot be
> >>> used to generate a unique per communicator file name
> >>> Makes sense ?
> >>> Cheers,
> >>> Gilles
> >>>
> >>> -- Forwarded message --
> >>> From: Peter Wind <peter.w...@met.no>
> >>> Date: Tuesday, February 2, 2016
> >>> Subject: [OMPI users] shared memory under fortran, bug?
> >>> To: us...@open-mpi.org
> >>>
> >>> Enclosed is a short (< 100 lines) fortran code example that uses 
> >>> shared
> >>> memory.
> >>> It seems to me it behaves wrongly if openmpi is used.
> >>> Compiled with SGI/mpt , it gives the right result.
> >>>
> >>> To fail, the code must be run on a single node.
> >>> It creates two groups of 2 processes each. Within each group memory is
> >>> shared.
> >>> The error is that the two groups get the same memory allocated, but 
> >>> they
> >>> should not.
> >>>
> >>> Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, 
> >>> intel
> >>> 14.0
> >>> all fail.
> >>>
> >>> The call:
> >>>call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL,
> >>> comm_group, cp1, win, ierr)
> >>>
> >>> Should allocate memory only within the group. But when the other group
> >>> allocates memory, the pointers from the two groups point to the same
> >>> address in memory.
> >>>
> >>> Could you please confirm that this is the wrong behaviour?
> >>>
> >>> Best regards,
> >>> Peter Wind
> >>> program shmem_mpi
> >>>
> >>> !
> >>> ! in this example two groups are created, within each group memory is 
> >>> shared.
> >>> ! Still the other group get allocated the same adress space, which it 
> >>> shouldn't.
> >>> !
> >>> ! Run with 4 processes, mpirun -np 4 a.out
> >>>
> >>>
> >>> use mpi
> >>>
> >>> use, intrinsic :: iso_c_binding, only : c_ptr, c_f_pointer
> >>>
> >>> implicit none
> >>> !   include 'mpif.h'
> >>>
> >>> integer, parameter :: nsize = 100
> >>> integer, pointer   :: array(:)
> >>> integer:: num_procs
> >>> integer:: ierr
> >>> integer:: irank, irank_group
> >>> integer:: win
> >>> integer:: comm = MPI_COMM_WORLD
> >>> integer:: disp_unit
> >>

Re: [OMPI devel] Fwd: [OMPI users] shared memory under fortran, bug?

2016-02-02 Thread Kawashima, Takahiro
Nathan,

Is is sufficient?
Multiple windows can be created on a communicator.
So I think PID + CID is not sufficient.

Possible fixes:
- The root process creates a filename with a random number
  and broadcast it in the communicator.
- Use per-communicator counter and use it in the filename.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Hmm, I think you are correct. There may be instances where two different
> local processes may use the same CID for different communicators. It
> should be sufficient to add the PID of the current process to the
> filename to ensure it is unique.
> 
> -Nathan
> 
> On Tue, Feb 02, 2016 at 09:33:29PM +0900, Gilles Gouaillardet wrote:
> >Nathan,
> >the sm osc component uses communicator CID to name the file that will be
> >used to create shared memory segments.
> >if I understand and correctly, two different communicators coming from 
> > the
> >same MPI_Comm_split might share the same CID, so CID (alone) cannot be
> >used to generate a unique per communicator file name
> >Makes sense ?
> >Cheers,
> >Gilles
> > 
> >-- Forwarded message --
> >From: Peter Wind 
> >Date: Tuesday, February 2, 2016
> >Subject: [OMPI users] shared memory under fortran, bug?
> >To: us...@open-mpi.org
> > 
> >Enclosed is a short (< 100 lines) fortran code example that uses shared
> >memory.
> >It seems to me it behaves wrongly if openmpi is used.
> >Compiled with SGI/mpt , it gives the right result.
> > 
> >To fail, the code must be run on a single node.
> >It creates two groups of 2 processes each. Within each group memory is
> >shared.
> >The error is that the two groups get the same memory allocated, but they
> >should not.
> > 
> >Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, intel
> >14.0
> >all fail.
> > 
> >The call:
> >   call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL,
> >comm_group, cp1, win, ierr)
> > 
> >Should allocate memory only within the group. But when the other group
> >allocates memory, the pointers from the two groups point to the same
> >address in memory.
> > 
> >Could you please confirm that this is the wrong behaviour?
> > 
> >Best regards,
> >Peter Wind
> 
> > program shmem_mpi
> > 
> >!
> >! in this example two groups are created, within each group memory is 
> > shared.
> >! Still the other group get allocated the same adress space, which it 
> > shouldn't.
> >!
> >! Run with 4 processes, mpirun -np 4 a.out
> > 
> > 
> >use mpi
> > 
> >use, intrinsic :: iso_c_binding, only : c_ptr, c_f_pointer
> > 
> >implicit none
> > !   include 'mpif.h'
> > 
> >integer, parameter :: nsize = 100
> >integer, pointer   :: array(:)
> >integer:: num_procs
> >integer:: ierr
> >integer:: irank, irank_group
> >integer:: win
> >integer:: comm = MPI_COMM_WORLD
> >integer:: disp_unit
> >type(c_ptr):: cp1
> >type(c_ptr):: cp2
> >integer:: comm_group
> > 
> >integer(MPI_ADDRESS_KIND) :: win_size
> >integer(MPI_ADDRESS_KIND) :: segment_size
> > 
> >call MPI_Init(ierr)
> >call MPI_Comm_size(comm, num_procs, ierr)
> >call MPI_Comm_rank(comm, irank, ierr)
> > 
> >disp_unit = sizeof(1)
> >call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr)
> >call MPI_Comm_rank(comm_group, irank_group, ierr)
> > !   print *, 'irank=', irank, ' group rank=', irank_group
> > 
> >if (irank_group == 0) then
> >   win_size = nsize*disp_unit
> >else
> >   win_size = 0
> >endif
> > 
> >call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL, 
> > comm_group, cp1, win, ierr)
> >call MPI_Win_fence(0, win, ierr)
> > 
> >call MPI_Win_shared_query(win, 0, segment_size, disp_unit, cp2, ierr)
> > 
> >call MPI_Win_fence(0, win, ierr)
> >CALL MPI_BARRIER(comm, ierr)! allocations finished
> > !   print *, 'irank=', irank, ' size ', segment_size
> > 
> >call c_f_pointer(cp2, array, [nsize])
> > 
> >array(1)=0;array(2)=0
> >CALL MPI_BARRIER(comm, ierr)!
> > 77 format(4(A,I3))
> >if(irank >   if (irank_group == 0)array(1)=11
> >   CALL MPI_BARRIER(comm, ierr)
> >   print 77, 'Group 0, rank', irank, ':  array ', array(1), ' ',array(2)
> >   CALL MPI_BARRIER(comm, ierr)!Group 1 not yet start writing
> >   CALL MPI_BARRIER(comm, ierr)!Group 1 finished writing
> >   print 77, 'Group 0, rank', irank, ':  array ', array(1),' ',array(2) 
> >   if(array(1)==11.and.array(2)==0)then
> >  print *,irank,' correct result'
> >   else
> >  print *,irank,' wrong result'
> >   endif
> >else
> >   CALL MPI_BARRIER(comm, ierr)
> >   CALL MPI_BARRIER(comm, ierr)!Group 0 finished writing
> 

Re: [OMPI devel] Please test: v1.10.1rc3

2015-10-30 Thread Kawashima, Takahiro
`configure && make && make install && make check` and
running some sample MPI programs succeeded with 1.10.1rc3
on my SPARC-V9/Linux/GCC machine (Fujitsu PRIMEHPC FX10).

No @SET_MAKE@ appears in any Makefiles, of course.

> > For the first time I was also able to (attempt to) test SPARC64 via QEMU.
> > I have a "very odd" failure on this system in which "@SET_MAKE@" appears
> > un-expanded in several generated Makefiles.
> > For that reason the testing on this platform did not finish.
> > I am still investigating, but currently am assuming this is some issue
> > like sed crashing (due to bad emulation?) rather than anything in Open MPI.
> >
> [...]
> 
> Each time I run config.status in the build directory, a *different* set of
> random Makefiles end up with unexpanded instances of "@SET_MAKE@".
> I don't know what other configure substitutions might be passing through
> unexpanded.
> Anyway, I cannot conceive of any way in which this behavior could be Open
> MPI's fault.
> So I am going to discard the emulated SPARC64 system as grossly unreliable.


Re: [OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-04 Thread Kawashima, Takahiro
Brice,

I'm a developer of Fujitsu MPI for K computer and Fujitsu
PRIMEHPC FX10/FX100 (SPARC-based CPU).

Though I'm not familiar with the hwloc code and didn't know
the issue reported by Gilles, I also would be able to help
you to fix the issue.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Thanks Brice,
> 
> bottom line, even if hwloc is not fully ported, it should build and ompi 
> should get something usable.
> in this case, i have no objection removing the --without-hwloc configure 
> option.
> 
> you can contact me off-list regarding the FX10 specific issue
> 
> Cheers,
> 
> Gilles
> 
> On 9/4/2015 2:31 PM, Brice Goglin wrote:
> > Le 04/09/2015 00:36, Gilles Gouaillardet a écrit :
> >> Ralph,
> >>
> >> just to be clear, your proposal is to abort if openmpi is configured 
> >> with --without-hwloc, right ?
> >> ( the --with-hwloc option is not removed because we want to keep the 
> >> option of using an external hwloc library )
> >>
> >> if I understand correctly, Paul's point is that if openmpi is ported 
> >> to a new architecture for which hwloc has not been ported yet 
> >> (embedded hwloc or external hwloc), then the very first step is to 
> >> port hwloc before ompi can be built.
> >>
> >> did I get it right Paul ?
> >>
> >> Brice, what would happen in such a case ?
> >> embedded hwloc cannot be built ?
> >> hwloc returns little or no information ?
> >
> > If it's a new operating system and it supports at least things like 
> > sysconf, you will get a Machine object with one PUs per logical processor.
> >
> > If it's a new platform running Linux, they are supposed to tell Linux 
> > at least package/core/thread information. That's what we have for ARM 
> > for instance.
> >
> > Missing topology detection can be worked around easily (with XML and 
> > synthetic description, what we did for BlueGene/Q before adding manual 
> > support for that specific processor). Binding support can't.
> > And once you get binding, you get x86-topology even if the operating 
> > system isn't supported (using cpuid).
> >
> >> for example, on Fujitsu FX10 node (single socket, 16 cores), hwloc 
> >> reports 16 sockets with one core each and no cache. though this is 
> >> not correct, that can be seen as equivalent to the real config by 
> >> ompi, so this is not really an issue for ompi.
> >
> > Can you help fixing this?
> >
> > The issue is indeed with supercomputers with uncommon architectures 
> > like this one.

Re: [OMPI devel] OpenMPI 1.8 Bug Report

2015-08-27 Thread Kawashima, Takahiro
Oh, I also noticed it yesterday and was about to report it.

And one more, the base parameter of MPI_Win_detach.

Regards,
Takahiro Kawashima

> Dear OpenMPI developers,
> 
> I noticed a bug in the definition of the 3 MPI-3 RMA functions
> MPI_Compare_and_swap, MPI_Fetch_and_op and MPI_Raccumulate.
> 
> According to the MPI standard, the origin_addr and compare_addr
> parameters of these functions have a const attribute, which is missing
> in OpenMPI's mpi.h (OpenMPI 1.8.x and 1.10.0).
> 
> Regards,
> 
> Michael


Re: [OMPI devel] v1.10.0rc1 available for testing

2015-07-17 Thread Kawashima, Takahiro
Hi folks,

`configure && make && make install && make test` and
running some sample MPI programs succeeded with 1.10.0rc1
on my SPARC-V9/Linux/GCC machine (Fujitsu PRIMEHPC FX10).

Takahiro Kawashima,
MPI development team,
Fujitsu

> Hi folks
> 
> Now that 1.8.7 is out the door, we need to switch our attention to releasing 
> 1.10.0, which has been patiently waiting for quite some time. There is a 
> blocker on its release due to a USNIC issue, but that will only affect Cisco 
> at this time. So I’d like to begin the release cycle so the rest of the code 
> gets tested.
> 
> http://www.open-mpi.org/software/ompi/v1.10/ 
> 


Re: [OMPI devel] RFC: standardize verbosity values

2015-06-08 Thread KAWASHIMA Takahiro
> static const char* const priorities[] = {
> "ERROR",
> "WARN",
> "INFO",
> "DEBUG",
> "TRACE"
> };

+1

I usually use these levels.

Typical usage:

ERROR:
  Print an error message on returning a value other than
  OMPI_SUCCESS (and OMPI_ERR_TEMP_OUT_OF_RESOURCE etc.).

WARN:
  This does not indicate an error. But users/developers should
  be aware on debugging/tuning. For example, network-level
  timeout, hardware queue full, buggy code.
  Often used with OMPI_ERR_TEMP_OUT_OF_RESOURCE.

INFO:
  Information that may be useful for users and developers.
  Not so verbose. Output only on initialization or
  object creation etc.

DEBUG:
  Information that is useful only for developers.
  Not so verbose. Output once per MPI routine call.

TRACE:
  Information that is useful only for developers.
  Verbose. Output more than once per MPI routine call.

Regards,
KAWASHIMA Takahiro

> so what about :
> 
> static const char* const priorities[] = {
> "ERROR",
> "WARN",
> "INFO",
> "DEBUG",
> "TRACE"
> };
> 
> and merge debug and trace if there should be only 4
> 
> Cheers,
> 
> Gilles
> 
> 
> On Monday, June 8, 2015, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > Could we maybe narrow it down some? If we are going to do it, let’s not
> > make the mistake of the MCA param system and create so many levels. Nobody
> > can figure out the right gradation as it is just too fine grained.
> >
> > I think Nathan’s proposal is the max that makes sense.
> >
> > I’d also like to see us apply the same logic to the MCA param system.
> > Let’s just define ~4 named levels and get rid of the fine grained numbering.
> >
> >
> > On Jun 8, 2015, at 2:04 AM, Gilles Gouaillardet <gil...@rist.or.jp
> > <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>> wrote:
> >
> >  Nathan,
> >
> > i think it is a good idea to use names vs numeric values for verbosity.
> >
> > what about using "a la" log4c verbosity names ?
> > http://sourceforge.net/projects/log4c/
> >
> > static const char* const priorities[] = {
> > "FATAL",
> > "ALERT",
> > "CRIT",
> > "ERROR",
> > "WARN",
> > "NOTICE",
> > "INFO",
> > "DEBUG",
> > "TRACE",
> > "NOTSET",
> > "UNKNOWN"
> > };
> >
> > Cheers,
> >
> > Gilles
> >
> > On 5/30/2015 1:32 AM, Nathan Hjelm wrote:
> >
> > At the moment we have a loosely enforced standard for verbosity
> > values. In general frameworks accept anything in the range 0 - 100 with
> > few exceptions. I am thinking about adding an enumerator for verbosities
> > that will accept values in this range and certain named constants which
> > will match with specific verbosity levels. One possible set: none - 0,
> > low - 25, med - 50, high - 75, max - 100. I am open to any set of named
> > verbosities.
> >
> > Thoughts?


Re: [OMPI devel] c_accumulate

2015-04-20 Thread Kawashima, Takahiro
Gilles,

Sorry for confusing you.

My understanding is:

MPI_WIN_FENCE has four roles regarding access/exposure epochs.

  - end access epoch
  - end exposure epoch
  - start access epoch
  - start exposure epoch

In order to end access/exposure epochs, a barrier is not needed
in the MPI implementation for MPI_MODE_NOPRECEDE.
But in order to start access/exposure epochs, synchronization
is still needed in the MPI implementation even for MPI_MODE_NOPRECEDE.

This synchronization (the latter case above) is not necessarily
a barrier. A peer-to-peer synchronization for the origin/target
pair is sufficient. But an easy implementation is using a barrier.

Thanks,
Takahiro Kawashima,

> Kawashima-san,
> 
> i am confused ...
> 
> as you wrote :
> 
> > In the MPI_MODE_NOPRECEDE case, a barrier is not necessary
> > in the MPI implementation to end access/exposure epochs.
> 
> 
> and the test case calls MPI_Win_fence with MPI_MODE_NOPRECEDE.
> 
> are you saying Open MPI implementation of MPI_Win_fence should perform
> a barrier in this case (e.g. MPI_MODE_NOPRECEDE) ?
> 
> Cheers,
> 
> Gilles
> 
> On 4/21/2015 11:08 AM, Kawashima, Takahiro wrote:
> > Hi Gilles, Nathan,
> >
> > No, my conclusion is that the MPI program does not need a MPI_Barrier
> > but MPI implementations need some synchronizations.
> >
> > Thanks,
> > Takahiro Kawashima,
> >
> >> Kawashima-san,
> >>
> >> Nathan reached the same conclusion (see the github issue) and i fixed
> >> the test
> >> by manually adding a MPI_Barrier.
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 4/21/2015 10:20 AM, Kawashima, Takahiro wrote:
> >>> Hi Gilles, Nathan,
> >>>
> >>> I read the MPI standard but I think the standard doesn't
> >>> require a barrier in the test program.
> >>>
> >>> >From the standards (11.5.1 Fence) :
> >>>
> >>>   A fence call usually entails a barrier synchronization:
> >>> a process completes a call to MPI_WIN_FENCE only after all
> >>> other processes in the group entered their matching call.
> >>> However, a call to MPI_WIN_FENCE that is known not to end
> >>> any epoch (in particular, a call with assert equal to
> >>> MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.
> >>>
> >>> This sentence is misleading.
> >>>
> >>> In the non-MPI_MODE_NOPRECEDE case, a barrier is necessary
> >>> in the MPI implementation to end access/exposure epochs.
> >>>
> >>> In the MPI_MODE_NOPRECEDE case, a barrier is not necessary
> >>> in the MPI implementation to end access/exposure epochs.
> >>> Also, a *global* barrier is not necessary in the MPI
> >>> implementation to start access/exposure epochs. But some
> >>> synchronizations are still needed to start an exposure epoch.
> >>>
> >>> For example, let's assume all ranks call MPI_WIN_FENCE(MPI_MODE_NOPRECEDE)
> >>> and then rank 0 calls MPI_PUT to rank 1. In this case, rank 0
> >>> can access the window on rank 1 before rank 2 or others
> >>> call MPI_WIN_FENCE. (But rank 0 must wait rank 1's MPI_WIN_FENCE.)
> >>> I think this is the intent of the sentence in the MPI standard
> >>> cited above.
> >>>
> >>> Thanks,
> >>> Takahiro Kawashima
> >>>
> >>>> Hi Rolf,
> >>>>
> >>>> yes, same issue ...
> >>>>
> >>>> i attached a patch to the github issue ( the issue might be in the test).
> >>>>
> >>>>From the standards (11.5 Synchronization Calls) :
> >>>> "TheMPI_WIN_FENCE collective synchronization call supports a simple
> >>>> synchroniza-
> >>>> tion pattern that is often used in parallel computations: namely a
> >>>> loosely-synchronous
> >>>> model, where global computation phases alternate with global
> >>>> communication phases."
> >>>>
> >>>> as far as i understand (disclaimer, i am *not* good at reading standards
> >>>> ...) this is not
> >>>> necessarily an MPI_Barrier, so there is a race condition in the test
> >>>> case that can be avoided
> >>>> by adding an MPI_Barrier after initializing RecvBuff.
> >>>>
> >>>> could someone (Jeff ? George ?) please double check this before i push a
> >>>> fix 

Re: [OMPI devel] c_accumulate

2015-04-20 Thread Kawashima, Takahiro
Hi Gilles, Nathan,

No, my conclusion is that the MPI program does not need a MPI_Barrier
but MPI implementations need some synchronizations.

Thanks,
Takahiro Kawashima,

> Kawashima-san,
> 
> Nathan reached the same conclusion (see the github issue) and i fixed 
> the test
> by manually adding a MPI_Barrier.
> 
> Cheers,
> 
> Gilles
> 
> On 4/21/2015 10:20 AM, Kawashima, Takahiro wrote:
> > Hi Gilles, Nathan,
> >
> > I read the MPI standard but I think the standard doesn't
> > require a barrier in the test program.
> >
> > >From the standards (11.5.1 Fence) :
> >
> >  A fence call usually entails a barrier synchronization:
> >a process completes a call to MPI_WIN_FENCE only after all
> >other processes in the group entered their matching call.
> >However, a call to MPI_WIN_FENCE that is known not to end
> >any epoch (in particular, a call with assert equal to
> >MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.
> >
> > This sentence is misleading.
> >
> > In the non-MPI_MODE_NOPRECEDE case, a barrier is necessary
> > in the MPI implementation to end access/exposure epochs.
> >
> > In the MPI_MODE_NOPRECEDE case, a barrier is not necessary
> > in the MPI implementation to end access/exposure epochs.
> > Also, a *global* barrier is not necessary in the MPI
> > implementation to start access/exposure epochs. But some
> > synchronizations are still needed to start an exposure epoch.
> >
> > For example, let's assume all ranks call MPI_WIN_FENCE(MPI_MODE_NOPRECEDE)
> > and then rank 0 calls MPI_PUT to rank 1. In this case, rank 0
> > can access the window on rank 1 before rank 2 or others
> > call MPI_WIN_FENCE. (But rank 0 must wait rank 1's MPI_WIN_FENCE.)
> > I think this is the intent of the sentence in the MPI standard
> > cited above.
> >
> > Thanks,
> > Takahiro Kawashima
> >
> >> Hi Rolf,
> >>
> >> yes, same issue ...
> >>
> >> i attached a patch to the github issue ( the issue might be in the test).
> >>
> >>   From the standards (11.5 Synchronization Calls) :
> >> "TheMPI_WIN_FENCE collective synchronization call supports a simple
> >> synchroniza-
> >> tion pattern that is often used in parallel computations: namely a
> >> loosely-synchronous
> >> model, where global computation phases alternate with global
> >> communication phases."
> >>
> >> as far as i understand (disclaimer, i am *not* good at reading standards
> >> ...) this is not
> >> necessarily an MPI_Barrier, so there is a race condition in the test
> >> case that can be avoided
> >> by adding an MPI_Barrier after initializing RecvBuff.
> >>
> >> could someone (Jeff ? George ?) please double check this before i push a
> >> fix into ompi-tests repo ?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 4/20/2015 10:19 PM, Rolf vandeVaart wrote:
> >>> Hi Gilles:
> >>>
> >>> Is your failure similar to this ticket?
> >>>
> >>> https://github.com/open-mpi/ompi/issues/393
> >>>
> >>> Rolf
> >>>
> >>> *From:*devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Gilles
> >>> Gouaillardet
> >>> *Sent:* Monday, April 20, 2015 9:12 AM
> >>> *To:* Open MPI Developers
> >>> *Subject:* [OMPI devel] c_accumulate
> >>>
> >>> Folks,
> >>>
> >>> i (sometimes) get some failure with the c_accumulate test from the ibm
> >>> test suite on one host with 4 mpi tasks
> >>>
> >>> so far, i was only able to observe this on linux/sparc with the vader btl
> >>>
> >>> here is a snippet of the test :
> >>>
> >>> MPI_Win_create(, sizeOfInt, 1, MPI_INFO_NULL,
> >>>   MPI_COMM_WORLD, );
> >>>
> >>> SendBuff = rank + 100;
> >>> RecvBuff = 0;
> >>>
> >>> /* Accumulate to everyone, just for the heck of it */
> >>>
> >>> MPI_Win_fence(MPI_MODE_NOPRECEDE, Win);
> >>> for (i = 0; i < size; ++i)
> >>>   MPI_Accumulate(, 1, MPI_INT, i, 0, 1, MPI_INT, MPI_SUM, 
> >>> Win);
> >>> MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOSUCCEED), Win);
> >>>
> >>> when the test fails, RecvBuff in (rank+100) instead of the accumulated
> >>> value (100 * nprocs + (nprocs -1)*nprocs/2
> >>>
> >>> i am not familiar with onesided operations nor MPI_Win_fence.
> >>>
> >>> that being said, i found suspicious RecvBuff is initialized *after*
> >>> MPI_Win_create ...
> >>>
> >>> does MPI_Win_fence implies MPI_Barrier ?
> >>>
> >>> if not, i guess RecvBuff should be initialized *before* MPI_Win_create.
> >>>
> >>> makes sense ?
> >>>
> >>> (and if it does make sense, then this issue is not related to sparc,
> >>> and vader is not the root cause)


Re: [OMPI devel] c_accumulate

2015-04-20 Thread Kawashima, Takahiro
Hi Gilles, Nathan,

I read the MPI standard but I think the standard doesn't
require a barrier in the test program.

>From the standards (11.5.1 Fence) :

A fence call usually entails a barrier synchronization:
  a process completes a call to MPI_WIN_FENCE only after all
  other processes in the group entered their matching call.
  However, a call to MPI_WIN_FENCE that is known not to end
  any epoch (in particular, a call with assert equal to
  MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.

This sentence is misleading.

In the non-MPI_MODE_NOPRECEDE case, a barrier is necessary
in the MPI implementation to end access/exposure epochs.

In the MPI_MODE_NOPRECEDE case, a barrier is not necessary
in the MPI implementation to end access/exposure epochs.
Also, a *global* barrier is not necessary in the MPI
implementation to start access/exposure epochs. But some
synchronizations are still needed to start an exposure epoch.

For example, let's assume all ranks call MPI_WIN_FENCE(MPI_MODE_NOPRECEDE)
and then rank 0 calls MPI_PUT to rank 1. In this case, rank 0
can access the window on rank 1 before rank 2 or others
call MPI_WIN_FENCE. (But rank 0 must wait rank 1's MPI_WIN_FENCE.)
I think this is the intent of the sentence in the MPI standard
cited above.

Thanks,
Takahiro Kawashima

> Hi Rolf,
> 
> yes, same issue ...
> 
> i attached a patch to the github issue ( the issue might be in the test).
> 
>  From the standards (11.5 Synchronization Calls) :
> "TheMPI_WIN_FENCE collective synchronization call supports a simple 
> synchroniza-
> tion pattern that is often used in parallel computations: namely a 
> loosely-synchronous
> model, where global computation phases alternate with global 
> communication phases."
> 
> as far as i understand (disclaimer, i am *not* good at reading standards 
> ...) this is not
> necessarily an MPI_Barrier, so there is a race condition in the test 
> case that can be avoided
> by adding an MPI_Barrier after initializing RecvBuff.
> 
> could someone (Jeff ? George ?) please double check this before i push a 
> fix into ompi-tests repo ?
> 
> Cheers,
> 
> Gilles
> 
> On 4/20/2015 10:19 PM, Rolf vandeVaart wrote:
> >
> > Hi Gilles:
> >
> > Is your failure similar to this ticket?
> >
> > https://github.com/open-mpi/ompi/issues/393
> >
> > Rolf
> >
> > *From:*devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Gilles 
> > Gouaillardet
> > *Sent:* Monday, April 20, 2015 9:12 AM
> > *To:* Open MPI Developers
> > *Subject:* [OMPI devel] c_accumulate
> >
> > Folks,
> >
> > i (sometimes) get some failure with the c_accumulate test from the ibm 
> > test suite on one host with 4 mpi tasks
> >
> > so far, i was only able to observe this on linux/sparc with the vader btl
> >
> > here is a snippet of the test :
> >
> > MPI_Win_create(, sizeOfInt, 1, MPI_INFO_NULL,
> >  MPI_COMM_WORLD, );
> >   
> >SendBuff = rank + 100;
> >RecvBuff = 0;
> >   
> >/* Accumulate to everyone, just for the heck of it */
> >   
> >MPI_Win_fence(MPI_MODE_NOPRECEDE, Win);
> >for (i = 0; i < size; ++i)
> >  MPI_Accumulate(, 1, MPI_INT, i, 0, 1, MPI_INT, MPI_SUM, Win);
> >MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOSUCCEED), Win);
> >
> > when the test fails, RecvBuff in (rank+100) instead of the accumulated 
> > value (100 * nprocs + (nprocs -1)*nprocs/2
> >
> > i am not familiar with onesided operations nor MPI_Win_fence.
> >
> > that being said, i found suspicious RecvBuff is initialized *after* 
> > MPI_Win_create ...
> >
> > does MPI_Win_fence implies MPI_Barrier ?
> >
> > if not, i guess RecvBuff should be initialized *before* MPI_Win_create.
> >
> > makes sense ?
> >
> > (and if it does make sense, then this issue is not related to sparc, 
> > and vader is not the root cause)
> >
> > Cheers,
> >
> > Gilles


Re: [OMPI devel] Opal atomics question

2015-03-26 Thread Kawashima, Takahiro
Yes, Fujitsu MPI is running on sparcv9-compatible CPU.
Though we currently use only stable-series (v1.6, v1.8),
they work fine.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Nathan,
> 
> Fujitsu MPI is openmpi based and is running on their sparcv9 like proc.
> 
> Cheers,
> 
> Gilles
> 
> On Friday, March 27, 2015, Nathan Hjelm  wrote:
> 
> >
> > As a follow-on. How many of our supported architectures should we
> > continue to support. The current supported list is:
> >
> > alpha
> > amd64*
> > arm*
> > ia32*
> > ia64
> > mips
> > osx*
> > powerpc*
> > sparcv9
> > sync_builtin*
> >
> > * - known to be in-use.
> >
> > Additionally, should we continue to support the atomics in opal/asm?
> > Some of those are known to be wrong and most compilers support in-line
> > assembly.
> >
> > -Nathan
> >
> > On Thu, Mar 26, 2015 at 09:22:39AM -0600, Nathan Hjelm wrote:
> > >
> > > I am working on cleaning up the atomics in opal and I noticed something
> > > odd. We define opal_atomic_sub_32 and opal_atomic_sub_64 yet only use
> > > opal_atomic_sub_32 once:
> > >
> > > ./opal/runtime/opal_progress.c:val =
> > opal_atomic_sub_32(_event_users, 1);
> > >
> > > This could easily be changed to:
> > >
> > > val = opal_atomic_add_32(_event_users, -1);
> > >
> > > And then we could remove all both opal_atomic_sub_32 and
> > > opal_atomic_sub_64. Is there a reason to leave these functions in opal?
> > >
> > >
> > > -Nathan


Re: [OMPI devel] Problem on MPI_Type_create_resized and multiple BTL modules

2014-11-30 Thread Kawashima, Takahiro
Thanks!

> Takahiro,
> 
> Sorry for the delay in answering. Thanks for the bug report and the patch.
> I applied you patch, and added some tougher tests to make sure we catch
> similar issues in the future.
> 
> Thanks,
>   George.
> 
> 
> On Mon, Sep 29, 2014 at 8:56 PM, Kawashima, Takahiro <
> t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi George,
> >
> > Thank you for attending the meeting at Kyoto. As we talked
> > at the meeting, my colleague suffers from a datatype problem.
> >
> > See attached create_resized.c. It creates a datatype with an
> > LB marker using MPI_Type_create_struct and MPI_Type_create_resized.
> >
> > Expected contents of the output file (received_data) is:
> > 
> > 0: t1 = 0.1, t2 = 0.2
> > 1: t1 = 1.1, t2 = 1.2
> > 2: t1 = 2.1, t2 = 2.2
> > 3: t1 = 3.1, t2 = 3.2
> > 4: t1 = 4.1, t2 = 4.2
> > ... snip ...
> > 1995: t1 = 1995.1, t2 = 1995.2
> > 1996: t1 = 1996.1, t2 = 1996.2
> > 1997: t1 = 1997.1, t2 = 1997.2
> > 1998: t1 = 1998.1, t2 = 1998.2
> > 1999: t1 = 1999.1, t2 = 1999.2
> > 
> >
> > But if you run the program many times with multiple BTL modules
> > and with their small eager_limit and small max_send_size,
> > you'll see on some run:
> > 
> > 0: t1 = 0.1, t2 = 0.2
> > 1: t1 = 1.1, t2 = 1.2
> > 2: t1 = 2.1, t2 = 2.2
> > 3: t1 = 3.1, t2 = 3.2
> > 4: t1 = 4.1, t2 = 4.2
> > ... snip ...
> > 470: t1 = 470.1, t2 = 470.2
> > 471: t1 = 471.1, t2 = 471.2
> > 472: t1 = 472.1, t2 = 472.2
> > 473: t1 = 473.1, t2 = 473.2
> > 474: t1 = 474.1, t2 = 0<-- broken!
> > 475: t1 = 0, t2 = 475.1
> > 476: t1 = 0, t2 = 476.1
> > 477: t1 = 0, t2 = 477.1
> > ... snip ...
> > 1995: t1 = 0, t2 = 1995.1
> > 1996: t1 = 0, t2 = 1996.1
> > 1997: t1 = 0, t2 = 1997.1
> > 1998: t1 = 0, t2 = 1998.1
> > 1999: t1 = 0, t2 = 1999.1
> > 
> >
> > The index of the array at which data start to break (474 in the
> > above case) may change on every run.
> > Same result appears on both trunk and v1.8.3.
> >
> > You can reproduce this with the following options if you have
> > multiple IB HCAs.
> >
> >   -n 2
> >   --mca btl self,openib
> >   --mca btl_openib_eager_limit 256
> >   --mca btl_openib_max_send_size 384
> >
> > Or if you don't have multiple NICs, with the following options.
> >
> >   -n 2
> >   --host localhost
> >   --mca btl self,sm,vader
> >   --mca btl_vader_exclusivity 65536
> >   --mca btl_vader_eager_limit 256
> >   --mca btl_vader_max_send_size 384
> >   --mca btl_sm_exclusivity 65536
> >   --mca btl_sm_eager_limit 256
> >   --mca btl_sm_max_send_size 384
> >
> > My colleague found that OPAL convertor on the receiving process
> > seems to add the LB value twice for out-of-order arrival of
> > fragments when computing the receive buffer write-offset.
> >
> > He created the patch bellow. Our program works fine with
> > this patch but we don't know this is a correct fix.
> > Could you see this issue?
> >
> > Index: opal/datatype/opal_convertor.c
> > ===
> > --- opal/datatype/opal_convertor.c  (revision 32807)
> > +++ opal/datatype/opal_convertor.c  (working copy)
> > @@ -362,11 +362,11 @@
> >  if( OPAL_LIKELY(0 == count) ) {
> >  pStack[1].type = pElems->elem.common.type;
> >  pStack[1].count= pElems->elem.count;
> > -pStack[1].disp = pElems->elem.disp;
> > +pStack[1].disp = 0;
> >  } else {
> >  pStack[1].type  = OPAL_DATATYPE_UINT1;
> >  pStack[1].count = pData->size - count;
> > -pStack[1].disp  = pData->true_lb + count;
> > +pStack[1].disp  = count;
> >  }
> >  pStack[1].index= 0;  /* useless */


[OMPI devel] Problem on MPI_Type_create_resized and multiple BTL modules

2014-09-29 Thread Kawashima, Takahiro
Hi George,

Thank you for attending the meeting at Kyoto. As we talked
at the meeting, my colleague suffers from a datatype problem.

See attached create_resized.c. It creates a datatype with an
LB marker using MPI_Type_create_struct and MPI_Type_create_resized.

Expected contents of the output file (received_data) is:

0: t1 = 0.1, t2 = 0.2
1: t1 = 1.1, t2 = 1.2
2: t1 = 2.1, t2 = 2.2
3: t1 = 3.1, t2 = 3.2
4: t1 = 4.1, t2 = 4.2
... snip ...
1995: t1 = 1995.1, t2 = 1995.2
1996: t1 = 1996.1, t2 = 1996.2
1997: t1 = 1997.1, t2 = 1997.2
1998: t1 = 1998.1, t2 = 1998.2
1999: t1 = 1999.1, t2 = 1999.2


But if you run the program many times with multiple BTL modules
and with their small eager_limit and small max_send_size,
you'll see on some run:

0: t1 = 0.1, t2 = 0.2
1: t1 = 1.1, t2 = 1.2
2: t1 = 2.1, t2 = 2.2
3: t1 = 3.1, t2 = 3.2
4: t1 = 4.1, t2 = 4.2
... snip ...
470: t1 = 470.1, t2 = 470.2
471: t1 = 471.1, t2 = 471.2
472: t1 = 472.1, t2 = 472.2
473: t1 = 473.1, t2 = 473.2
474: t1 = 474.1, t2 = 0<-- broken!
475: t1 = 0, t2 = 475.1
476: t1 = 0, t2 = 476.1
477: t1 = 0, t2 = 477.1
... snip ...
1995: t1 = 0, t2 = 1995.1
1996: t1 = 0, t2 = 1996.1
1997: t1 = 0, t2 = 1997.1
1998: t1 = 0, t2 = 1998.1
1999: t1 = 0, t2 = 1999.1


The index of the array at which data start to break (474 in the
above case) may change on every run.
Same result appears on both trunk and v1.8.3.

You can reproduce this with the following options if you have
multiple IB HCAs.

  -n 2
  --mca btl self,openib
  --mca btl_openib_eager_limit 256
  --mca btl_openib_max_send_size 384

Or if you don't have multiple NICs, with the following options.

  -n 2
  --host localhost
  --mca btl self,sm,vader
  --mca btl_vader_exclusivity 65536
  --mca btl_vader_eager_limit 256
  --mca btl_vader_max_send_size 384
  --mca btl_sm_exclusivity 65536
  --mca btl_sm_eager_limit 256
  --mca btl_sm_max_send_size 384

My colleague found that OPAL convertor on the receiving process
seems to add the LB value twice for out-of-order arrival of
fragments when computing the receive buffer write-offset.

He created the patch bellow. Our program works fine with
this patch but we don't know this is a correct fix.
Could you see this issue?

Index: opal/datatype/opal_convertor.c
===
--- opal/datatype/opal_convertor.c  (revision 32807)
+++ opal/datatype/opal_convertor.c  (working copy)
@@ -362,11 +362,11 @@
 if( OPAL_LIKELY(0 == count) ) {
 pStack[1].type = pElems->elem.common.type;
 pStack[1].count= pElems->elem.count;
-pStack[1].disp = pElems->elem.disp;
+pStack[1].disp = 0;
 } else {
 pStack[1].type  = OPAL_DATATYPE_UINT1;
 pStack[1].count = pData->size - count;
-pStack[1].disp  = pData->true_lb + count;
+pStack[1].disp  = count;
 }
 pStack[1].index= 0;  /* useless */


Best regards,
Takahiro Kawashima,
MPI development team,
Fujitsu
/* np=2 */

#include 
#include 
#include 

struct structure {
double not_transfered;
double transfered_1;
double transfered_2;
};

int main(int argc, char *argv[])
{
int i, n = 2000, myrank;
struct structure *data;
MPI_Datatype struct_type, temp_type;
MPI_Datatype types[2] = {MPI_DOUBLE, MPI_DOUBLE};
int blocklens[2] = {1, 1};
MPI_Aint disps[3];

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );

data = malloc(sizeof(data[0]) * n);

if (myrank == 0) {
for (i = 0; i < n; i++) {
data[i].transfered_1 = i + 0.1;
data[i].transfered_2 = i + 0.2;
}
}

MPI_Get_address([0].transfered_1, [0]);
MPI_Get_address([0].transfered_2, [1]);
MPI_Get_address([0], [2]);
disps[1] -= disps[2]; /*  8 */
disps[0] -= disps[2]; /* 16 */
MPI_Type_create_struct(2, blocklens, disps, types, _type);
MPI_Type_create_resized(temp_type, 0, sizeof(data[0]), _type);
MPI_Type_commit(_type);

if (myrank == 0) {
MPI_Send(data, n, struct_type, 1, 0, MPI_COMM_WORLD);
} else if (myrank == 1) {
MPI_Recv(data, n, struct_type, 0, 0, MPI_COMM_WORLD,
 MPI_STATUS_IGNORE);
}

MPI_Type_free(_type);
MPI_Type_free(_type);

if (myrank == 1) {
FILE *fp;
fp = fopen("received_data", "w");
for (i = 0; i < n; i++) {
fprintf(fp, "%d: t1 = %g, t2 = %g\n",
i, data[i].transfered_1, data[i].transfered_2);
}
fclose(fp);
}

free(data);
MPI_Finalize();

return 0;
}
Index: opal/datatype/opal_convertor.c
===
--- opal/datatype/opal_convertor.c	(revision 32807)
+++ opal/datatype/opal_convertor.c	(working copy)
@@ -362,11 +362,11 @@
 if( OPAL_LIKELY(0 == count) ) {
 

Re: [OMPI devel] 1.8.3rc2 available

2014-09-26 Thread Kawashima, Takahiro
just FYI:
configure && make && make install && make test
succeeded on my SPARC64/Linux/GCC (both enable-debug=yes and no).

Takahiro Kawashima,
MPI development team,
Fujitsu

> Usual place:
> 
> http://www.open-mpi.org/software/ompi/v1.8/
> 
> Please beat it up as we want to release on Fri, barring discovery of a blocker
> Ralph


Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2 and gcc-4.9.0

2014-09-02 Thread Kawashima, Takahiro
Hi Siegmar, Ralph,

I forgot to follow the previous report, sorry.
The patch I suggested is not included in Open MPI 1.8.2.
The backtrace Siegmar reported points the problem that I fixed
in the patch.

  http://www.open-mpi.org/community/lists/users/2014/08/24968.php

Siegmar:
Could you try my patch again?

Ralph (or someone committer):
Open MPI 1.8 needs custom patch that I posted. See my previous mail.
Could you review it and commit it to v1.8 branch?

Regards,
Takahiro

> Hi,
> 
> yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc
> (tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64
> (linpc0)) with gcc-4.9.0. A small program works on some machines,
> but breaks with a bus error on Solaris 10 Sparc.
> 
> 
> tyr small_prog 118 which mpicc
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
> tyr small_prog 119 ompi_info | grep MPI:
> Open MPI: 1.8.2
> tyr small_prog 120 mpiexec -np 1 --host linpc0 init_finalize
> Hello!
> tyr small_prog 121 mpiexec -np 1 --host sunpc0 init_finalize
> Hello!
> tyr small_prog 122 mpiexec -np 1 --host tyr init_finalize
> [tyr:28081] *** Process received signal ***
> [tyr:28081] Signal: Bus Error (10)
> [tyr:28081] Signal code: Invalid address alignment (1)
> [tyr:28081] Failing at address: 7fffd304
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd118
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> [tyr:28081] *** End of error message ***
> --
> mpiexec noticed that process rank 0 with PID 28081 on node tyr exited on 
> signal 10 (Bus Error).
> --
> tyr small_prog 123 
> 
> 
> 
> gdb shows the following backtrace.
> 
> tyr small_prog 123 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec  
> GNU gdb (GDB) 7.6.1
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "sparc-sun-solaris2.10".
> For bug reporting instructions, please see:
> ...
> Reading symbols from 
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done.
> (gdb) run -np 1 --host tyr init_finalize
> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 --host 
> tyr init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP2]
> [tyr:28099] *** Process received signal ***
> [tyr:28099] Signal: Bus Error (10)
> [tyr:28099] Signal code: Invalid address alignment (1)
> [tyr:28099] Failing at address: 7fffd244
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd118
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> 

Re: [OMPI devel] bus error with openmpi-1.8.2rc4r32485 and gcc-4.9.0

2014-08-11 Thread Kawashima, Takahiro
Hi Ralph,

Your commit r32459 fixed the bus error by correcting
opal/dss/dss_copy.c. It's OK for trunk because mca_dstore_hash
calls dss to copy data. But it's insufficient for v1.8 because
mca_db_hash doesn't call dss and copies data itself.

The attached patch is the minimum patch to fix it in v1.8.
My fix doesn't call dss but uses memcpy. I have confirmed it on
SPARC64/Linux.

Sorry to response so late.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Siegmar, Ralph,
> 
> I'm sorry to response so late since last week.
> 
> Ralph fixed the problem in r32459 and it was merged to v1.8
> in r32474. But in v1.8 an additional custom patch is needed
> because the db/dstore source codes are different between trunk
> and v1.8.
> 
> I'm preparing and testing the custom patch just now.
> Wait wait a minute please.
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
> > Hi,
> > 
> > thank you very much to everybody who tried to solve my bus
> > error problem on Solaris 10 Sparc. I thought that you found
> > and fixed it, so that I installed openmpi-1.8.2rc4r32485 on
> > my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1),
> > openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small
> > program works on my x86_64 architectures, but still breaks
> > with a bus error on my Sparc system.
> > 
> > linpc1 fd1026 106 mpiexec -np 1 init_finalize
> > Hello!
> > linpc1 fd1026 106 exit
> > logout
> > tyr small_prog 113 ssh sunpc1
> > sunpc1 fd1026 101 mpiexec -np 1 init_finalize
> > Hello!
> > sunpc1 fd1026 102 exit
> > logout
> > tyr small_prog 114 mpiexec -np 1 init_finalize
> > [tyr:21109] *** Process received signal ***
> > [tyr:21109] Signal: Bus Error (10)
> > ...
> > 
> > 
> > gdb shows the following backtrace.
> > 
> > tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> > /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
> > GNU gdb (GDB) 7.6.1
> > ...
> > (gdb) run -np 1 init_finalize
> > Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 
> > init_finalize
> > [Thread debugging using libthread_db enabled]
> > [New Thread 1 (LWP 1)]
> > [New LWP2]
> > [tyr:21158] *** Process received signal ***
> > [tyr:21158] Signal: Bus Error (10)
> > [tyr:21158] Signal code: Invalid address alignment (1)
> > [tyr:21158] Failing at address: 7fffd224
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130
> > /lib/sparcv9/libc.so.1:0xd8b98
> > /lib/sparcv9/libc.so.1:0xcc70c
> > /lib/sparcv9/libc.so.1:0xcc918
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
> >  [ Signal 10 (BUS)]
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> > /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> > /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> > [tyr:21158] *** End of error message ***
> > --
> > mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on 
> > signal 10 (Bus Error).
> > --
> > [LWP2 exited]
> > [New Thread 2]
> > [Switching to Thread 1 (LWP 1)]
> > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> > satisfy query
> > (gdb) bt
> > #0  0x7f6173d0 in rtld_db_dlactivity () from 
> > /usr/lib/sparcv9/ld.so.1
> > #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> > #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> > #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> > #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> > #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> > #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> > #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> > #8  0x7ec7748c in vm_close () from 
> > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > #9  0x7ec74a6c in lt_dlclose () from 
> > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > #10 0x7ec99b90 in ri_destructor (obj=0x1001ead30)
> > at 
> > 

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc4r32485 and gcc-4.9.0

2014-08-11 Thread Kawashima, Takahiro
Siegmar, Ralph,

I'm sorry to response so late since last week.

Ralph fixed the problem in r32459 and it was merged to v1.8
in r32474. But in v1.8 an additional custom patch is needed
because the db/dstore source codes are different between trunk
and v1.8.

I'm preparing and testing the custom patch just now.
Wait wait a minute please.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Hi,
> 
> thank you very much to everybody who tried to solve my bus
> error problem on Solaris 10 Sparc. I thought that you found
> and fixed it, so that I installed openmpi-1.8.2rc4r32485 on
> my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1),
> openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small
> program works on my x86_64 architectures, but still breaks
> with a bus error on my Sparc system.
> 
> linpc1 fd1026 106 mpiexec -np 1 init_finalize
> Hello!
> linpc1 fd1026 106 exit
> logout
> tyr small_prog 113 ssh sunpc1
> sunpc1 fd1026 101 mpiexec -np 1 init_finalize
> Hello!
> sunpc1 fd1026 102 exit
> logout
> tyr small_prog 114 mpiexec -np 1 init_finalize
> [tyr:21109] *** Process received signal ***
> [tyr:21109] Signal: Bus Error (10)
> ...
> 
> 
> gdb shows the following backtrace.
> 
> tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
> GNU gdb (GDB) 7.6.1
> ...
> (gdb) run -np 1 init_finalize
> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 
> init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP2]
> [tyr:21158] *** Process received signal ***
> [tyr:21158] Signal: Bus Error (10)
> [tyr:21158] Signal code: Invalid address alignment (1)
> [tyr:21158] Failing at address: 7fffd224
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> [tyr:21158] *** End of error message ***
> --
> mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on 
> signal 10 (Bus Error).
> --
> [LWP2 exited]
> [New Thread 2]
> [Switching to Thread 1 (LWP 1)]
> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> satisfy query
> (gdb) bt
> #0  0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
> #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> #8  0x7ec7748c in vm_close () from 
> /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> #9  0x7ec74a6c in lt_dlclose () from 
> /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> #10 0x7ec99b90 in ri_destructor (obj=0x1001ead30)
> at 
> ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391
> #11 0x7ec984a8 in opal_obj_run_destructors (object=0x1001ead30)
> at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446
> #12 0x7ec9940c in mca_base_component_repository_release (
> component=0x7b023df0 )
> at 
> ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244
> #13 0x7ec9b754 in mca_base_component_unload (
> component=0x7b023df0 , output_id=-1)
> at 
> ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47
> #14 0x7ec9b7e8 in mca_base_component_close (
> component=0x7b023df0 , 

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles,

I applied your patch to v1.8 and it run successfully
on my SPARC machines.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Kawashima-san and all,
> 
> Here is attached a one off patch for v1.8.
> /* it does not use the __attribute__ modifier that might not be
> supported by all compilers */
> 
> as far as i am concerned, the same issue is also in the trunk,
> and if you do not hit it, it just means you are lucky :-)
> 
> the same issue might also be in other parts of the code :-(
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
> > Gilles, George,
> >
> > The problem is the one Gilles pointed.
> > I temporarily modified the code bellow and the bus error disappeared.
> >
> > --- orte/util/nidmap.c  (revision 32447)
> > +++ orte/util/nidmap.c  (working copy)
> > @@ -885,7 +885,7 @@
> >  orte_proc_state_t state;
> >  orte_app_idx_t app_idx;
> >  int32_t restarts;
> > -orte_process_name_t proc, dmn;
> > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
> >  char *hostname;
> >  uint8_t flag;
> >  opal_buffer_t *bptr;
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Kawashima-san,
> >>
> >> This is interesting :-)
> >>
> >> proc is in the stack and has type orte_process_name_t
> >>
> >> with
> >>
> >> typedef uint32_t orte_jobid_t;
> >> typedef uint32_t orte_vpid_t;
> >> struct orte_process_name_t {
> >> orte_jobid_t jobid; /**< Job number */
> >> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> >> };
> >> typedef struct orte_process_name_t orte_process_name_t;
> >>
> >>
> >> so there is really no reason to align this on 8 bytes...
> >> but later, proc is casted into an uint64_t ...
> >> so proc should have been aligned on 8 bytes but it is too late,
> >> and hence the glory SIGBUS
> >>
> >>
> >> this is loosely related to
> >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> >> (see heterogeneous.v2.patch)
> >> if we make opal_process_name_t an union of uint64_t and a struct of two
> >> uint32_t, the compiler
> >> will align this on 8 bytes.
> >> note the patch is not enough (and will not apply on the v1.8 branch 
> >> anyway),
> >> we could simply remove orte_process_name_t and ompi_process_name_t and
> >> use only
> >> opal_process_name_t (and never declare variables with type
> >> opal_proc_name_t otherwise alignment might be incorrect)
> >>
> >> as a workaround, you can declare an opal_process_name_t (for alignment),
> >> and cast it to an orte_process_name_t
> >>
> >> i will write a patch (i will not be able to test on sparc ...)
> >> please note this issue might be present in other places
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> >>> Hi,
> >>>
> >>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> >>>>>>>> 10 Sparc and I receive a bus error, if I run a small program.
> >>> I've finally reproduced the bus error in my SPARC environment.
> >>>
> >>> #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> >>> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
> >>> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 
> >>> in ../sigattach.c 
> >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> >>> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
> >>> 252 in db_hash.c
> >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> >>> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at 
> >>> line 49 in db_base_fns.c
> >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> >>> 0x00281d70) at line 975 in nidmap.c
> >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> >>> opal_buffer_t *) 0x00241fc0)

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles, George,

The problem is the one Gilles pointed.
I temporarily modified the code bellow and the bus error disappeared.

--- orte/util/nidmap.c  (revision 32447)
+++ orte/util/nidmap.c  (working copy)
@@ -885,7 +885,7 @@
 orte_proc_state_t state;
 orte_app_idx_t app_idx;
 int32_t restarts;
-orte_process_name_t proc, dmn;
+orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
 char *hostname;
 uint8_t flag;
 opal_buffer_t *bptr;

Takahiro Kawashima,
MPI development team,
Fujitsu

> Kawashima-san,
> 
> This is interesting :-)
> 
> proc is in the stack and has type orte_process_name_t
> 
> with
> 
> typedef uint32_t orte_jobid_t;
> typedef uint32_t orte_vpid_t;
> struct orte_process_name_t {
> orte_jobid_t jobid; /**< Job number */
> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> };
> typedef struct orte_process_name_t orte_process_name_t;
> 
> 
> so there is really no reason to align this on 8 bytes...
> but later, proc is casted into an uint64_t ...
> so proc should have been aligned on 8 bytes but it is too late,
> and hence the glory SIGBUS
> 
> 
> this is loosely related to
> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> (see heterogeneous.v2.patch)
> if we make opal_process_name_t an union of uint64_t and a struct of two
> uint32_t, the compiler
> will align this on 8 bytes.
> note the patch is not enough (and will not apply on the v1.8 branch anyway),
> we could simply remove orte_process_name_t and ompi_process_name_t and
> use only
> opal_process_name_t (and never declare variables with type
> opal_proc_name_t otherwise alignment might be incorrect)
> 
> as a workaround, you can declare an opal_process_name_t (for alignment),
> and cast it to an orte_process_name_t
> 
> i will write a patch (i will not be able to test on sparc ...)
> please note this issue might be present in other places
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> > Hi,
> >
> >>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> >>>>>> 10 Sparc and I receive a bus error, if I run a small program.
> > I've finally reproduced the bus error in my SPARC environment.
> >
> > #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
> > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in 
> > ../sigattach.c 
> > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
> > 252 in db_hash.c
> > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 
> > 49 in db_base_fns.c
> > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> > 0x00281d70) at line 975 in nidmap.c
> > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
> > 0x,pargv=(char ***) 0x,flags=32) at line 
> > 148 in orte_init.c
> > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
> > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 
> > 464 in ompi_mpi_init.c
> > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
> > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
> > 0x07fef348) at line 8 in mpiinitfinalize.c
> > #11 0x00d2b81c (__libc_start_main + 0x194) 
> > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> > #12 0x0010094c (_start + 0x2c) ()
> >
> > The line 252 in opal/mca/db/hash/db_hash.c is:
> >
> > case OPAL_UINT64:
> > if (NULL == data) {
> > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> > return OPAL_ERR_BAD_PARAM;
> > }
> > kv->type = OPAL_UINT64;
> > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> > break;
> >
> > My environment is:
> >
> >   Op

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Hi,

> > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> > >>> 10 Sparc and I receive a bus error, if I run a small program.

I've finally reproduced the bus error in my SPARC environment.

#0 0x00db4740 (__waitpid_nocancel + 0x44) 
(0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
#1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct siginfo *) 
0x07fed100,p=(void *) 0x07fed100) at line 277 in ../sigattach.c 

#2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 
in db_hash.c
#3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 49 
in db_base_fns.c
#4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
0x00281d70) at line 975 in nidmap.c
#5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
#6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
#7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
0x,pargv=(char ***) 0x,flags=32) at line 148 in 
orte_init.c
#8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 464 
in ompi_mpi_init.c
#9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
#10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 0x07fef348) 
at line 8 in mpiinitfinalize.c
#11 0x00d2b81c (__libc_start_main + 0x194) 
(0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
#12 0x0010094c (_start + 0x2c) ()

The line 252 in opal/mca/db/hash/db_hash.c is:

case OPAL_UINT64:
if (NULL == data) {
OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
return OPAL_ERR_BAD_PARAM;
}
kv->type = OPAL_UINT64;
kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
break;

My environment is:

  Open MPI v1.8 branch r32447 (latest)
  configure --enable-debug
  SPARC-V9 (Fujitsu SPARC64 IXfx)
  Linux (custom)
  gcc 4.2.4

I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.

Can this information help?

Takahiro Kawashima,
MPI development team,
Fujitsu

> Hi,
> 
> I'm sorry once more to answer late, but the last two days our mail
> server was down (hardware error).
> 
> > Did you configure this --enable-debug?
> 
> Yes, I used the following command.
> 
> ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
>   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>   JAVA_HOME=/usr/local/jdk1.8.0 \
>   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
>   CC="gcc" CXX="g++" FC="gfortran" \
>   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
>   CPP="cpp" CXXCPP="cpp" \
>   CPPFLAGS="" CXXCPPFLAGS="" \
>   --enable-mpi-cxx \
>   --enable-cxx-exceptions \
>   --enable-mpi-java \
>   --enable-heterogeneous \
>   --enable-mpi-thread-multiple \
>   --with-threads=posix \
>   --with-hwloc=internal \
>   --without-verbs \
>   --with-wrapper-cflags="-std=c11 -m64" \
>   --enable-debug \
>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
> 
> 
> 
> > If so, you should get a line number in the backtrace
> 
> I got them for gdb (see below), but not for "dbx".
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> > 
> > 
> > On Aug 5, 2014, at 2:59 AM, Siegmar Gross 
>  wrote:
> > 
> > > Hi,
> > > 
> > > I'm sorry to answer so late, but last week I didn't have Internet
> > > access. In the meantime I've installed openmpi-1.8.2rc3 and I get
> > > the same error.
> > > 
> > >> This looks like the typical type of alignment error that we used
> > >> to see when testing regularly on SPARC.  :-\
> > >> 
> > >> It looks like the error was happening in mca_db_hash.so.  Could
> > >> you get a stack trace / file+line number where it was failing
> > >> in mca_db_hash?  (i.e., the actual bad code will likely be under
> > >> opal/mca/db/hash somewhere)
> > > 
> > > Unfortunately I don't get a file+line number from a file in
> > > opal/mca/db/Hash.
> > > 
> > > 
> > > 
> > > tyr small_prog 102 ompi_info | grep MPI:
> > >Open MPI: 1.8.2rc3
> > > tyr small_prog 103 which mpicc
> > > /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
> > > tyr small_prog 104 mpicc init_finalize.c 
> > > tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx 
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec 
> > > For information about new features see `help changes'
> > > To remove this message, put `dbxenv suppress_startup_message 

Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-01 Thread Kawashima, Takahiro
George,

I compiled trunk with your patch for SPARCV9/Linux/GCC.
I see following warning/errors.


In file included from opal/include/opal/sys/atomic.h:175,
 from opal/asm/asm.c:21:
opal/include/opal/sys/sparcv9/atomic.h:213:1: warning: 
"OPAL_HAVE_ATOMIC_SWAP_64" redefined
opal/include/opal/sys/sparcv9/atomic.h:47:1: warning: this is the location of 
the previous definition

In file included from opal/asm/asm.c:21:
opal/include/opal/sys/atomic.h:369: error: conflicting types for 
'opal_atomic_cmpset_acq_64'
opal/include/opal/sys/sparcv9/atomic.h:175: error: previous definition of 
'opal_atomic_cmpset_acq_64' was here
opal/include/opal/sys/atomic.h:375: error: conflicting types for 
'opal_atomic_cmpset_rel_64'
opal/include/opal/sys/sparcv9/atomic.h:187: error: previous definition of 
'opal_atomic_cmpset_rel_64' was here


The attached patch fixes these warning/errors.

I run test programs only in test/asm directory manually
because 'make check' doesn't run under my cross-compiling
environment. They all passed correctly.

P.S.
I cannot reply until the next week if you request me something
because it's COB in Japan now, sorry.

Takahiro Kawashima,
MPI development team,
Fujitsu

> In case someone else want to play with the new atomics here is the most
> up-to-date patch.
> 
>   George.
> 
> 
> 
> On Thu, Jul 31, 2014 at 10:26 PM, Paul Hargrove  wrote:
> 
> > George:
> >
> > Have a failure with your patch applied on PPC64/Linux and gcc-4.4.6:
> >
> > Making all in asm
> > make[2]: Entering directory
> > `/home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/BLD/opal/asm'
> >   CC   asm.lo
> > In file included from
> > /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/openmpi-1.9a1r32369/opal/asm/asm.c:21:0:
> > /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/openmpi-1.9a1r32369/opal/include/opal/sys/atomic.h:374:9:
> > error: conflicting types for 'opal_atomic_cmpset_rel_64'
> > /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/openmpi-1.9a1r32369/opal/include/opal/sys/powerpc/atomic.h:214:19:
> > note: previous definition of 'opal_atomic_cmpset_rel_64' was here
> > /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/openmpi-1.9a1r32369/opal/include/opal/sys/atomic.h:374:9:
> > warning: 'opal_atomic_cmpset_rel_64' used but never defined [enabled by
> > default]
> > make[2]: *** [asm.lo] Error 1
> >
> >
> > BTW: the patch applied cleanly to trunk except the portion
> > changing opal/include/opal/sys/osx/atomic.h, which does not exist.
> >
> > -Paul
> >
> >
> > On Thu, Jul 31, 2014 at 4:25 PM, George Bosilca 
> > wrote:
> >
> >> Awesome, thanks Paul. When the results will be in we will fix whatever is
> >> needed for these less common architectures.
> >>
> >>   George.
> >>
> >>
> >>
> >> On Thu, Jul 31, 2014 at 7:24 PM, Paul Hargrove 
> >> wrote:
> >>
> >>>
> >>>
> >>> On Thu, Jul 31, 2014 at 4:22 PM, Paul Hargrove 
> >>> wrote:
> >>>
> 
>  On Thu, Jul 31, 2014 at 4:13 PM, George Bosilca 
>  wrote:
> 
> > Paul, I know you have a pretty diverse range computers. Can you try to
> > compile and run a “make check” with the following patch?
> 
> 
>  I will see what I can do for ARMv7, MIPS, PPC and IA64 (or whatever
>  subset of those is still supported).
>  The ARM and MIPS system are emulators and take forever to build OMPI.
>  However, I am not even sure how soon I'll get to start this testing.
> 
> >>>
> >>>
> >>> Add SPARC (v8plus and v9) to that list.
--- opal/include/opal/sys/sparcv9/atomic.h.george	2014-08-01 17:33:25.874189000 +0900
+++ opal/include/opal/sys/sparcv9/atomic.h	2014-08-01 18:21:25.316102730 +0900
@@ -170,8 +170,8 @@
 
 #endif /* OPAL_ASSEMBLY_ARCH == OPAL_SPARCV9_64 */
 
-static inline int opal_atomic_cmpset_acq_64( volatile int64_t *addr,
- int64_t oldval, int64_t newval)
+static inline int64_t opal_atomic_cmpset_acq_64( volatile int64_t *addr,
+ int64_t oldval, int64_t newval)
 {
int rc;
 
@@ -182,8 +182,8 @@
 }
 
 
-static inline int opal_atomic_cmpset_rel_64( volatile int64_t *addr,
- int64_t oldval, int64_t newval)
+static inline int64_t opal_atomic_cmpset_rel_64( volatile int64_t *addr,
+ int64_t oldval, int64_t newval)
 {
opal_atomic_wmb();
return opal_atomic_cmpset_64(addr, oldval, newval);
@@ -210,6 +210,7 @@
 
 #if OPAL_ASSEMBLY_ARCH == OPAL_SPARCV9_64
 
+#undef OPAL_HAVE_ATOMIC_SWAP_64
 #define OPAL_HAVE_ATOMIC_SWAP_64 1
 
 static inline int64_t


Re: [OMPI devel] MPI_T SEGV on DSO

2014-07-29 Thread KAWASHIMA Takahiro
Nathan,

Thanks for your response.

Yes. My previous mail was the result of uncommented code.
Now I also pulled latest varList source code which uncommented
the section you mentioned, but the result was same.

If MPI_T_cvar_get_info should return MPI_T_ERR_INVALID_INDEX
for variables for unloaded components, not returning
MPI_T_ERR_INVALID_INDEX is the problem.

I run varList on GDB and found that MPI_T_cvar_get_info returns
MPI_T_ERR_INVALID_INDEX for shmem_sysv_priority (this is sane).
But it returns MPI_SUCCESS for shmem_sysv_major_version.
The difference is mbv_flags values. mbv_flags is 0x44 for
shmem_sysv_priority on MPI_T_cvar_get_info call so that
mca_base_var_get function in opal/mca/base/mca_base_var.c
returns OPAL_ERR_NOT_FOUND. But mbv_flags is 0x10003 for
shmem_sysv_major_version so that mca_base_var_get function
returns OPAL_SUCCESS.

Control variables for unloaded components are not deregistered
completely?

I can track it more when I have time.

My environment:
  OS: Debian GNU/Linux wheezy
  CPU: x86_64
  Run: mpiexec -n 1 varList
  Open MPI source: trunk r32338 (almost latest)
  Open MPI configure:
enable_picky=yes
enable_debug=yes
enable_mem_debug=yes
enable_mem_profile=yes
enable_memchecker=no

enable_mca_no_build=btl-elan,btl-gm,btl-mx,btl-ofud,btl-portals,btl-sctp,btl-template,btl-udapl,common-mx,common-portals,ess-alps,ess-cnos,ess-lsf,ess-portals_utcp,ess-singleton,ess-slurm,grpcomm-cnos,mpool-fake,mtl,notifier,plm-alps,plm-ccp,plm-lsf,plm-process,plm-slurm,plm-submit,plm-tm,plm-xgrid,pml-cm,pml-csum,pml-example,pml-v,ras
enable_contrib_no_build=vt
enable_mpi_cxx=no
enable_mpi_f77=no
enable_mpi_f90=no
enable_ipv6=no
enable_mpi_io=no
with_devel_headers=no
with_wrapper_cflags=-g
with_wrapper_cxxflags=-g
with_wrapper_fflags=-g
with_wrapper_fcflags=-g

Regards,
KAWASHIMA Takahiro

> The problem is the code in question does not check the return code of
> MPI_T_cvar_handle_alloc . We are returning an error and they still try
> to use the handle (which is stale). Uncomment this section of the code:
> 
> 
> //if (MPI_T_ERR_INVALID_INDEX == err)// { NOTE TZI: This 
> variable is not recognized by Mvapich. It is OpenMPI specific.
> //  continue;
> 
> 
> Note that MPI_T_ERR_INVALID_INDEX is in the MPI-3 standard but mvapich
> must not have implemented it (and thus should not claim to be MPI 3.0).
> 
> -Nathan
> 
> On Wed, Jul 30, 2014 at 12:04:55AM +0900, KAWASHIMA Takahiro wrote:
> > Hi,
> > 
> > I encountered the same SEGV reported on the users list when
> > running varList program.
> > 
> >   http://www.open-mpi.org/community/lists/users/2014/07/24792.php
> > 
> > mpiexec -n 1 ./varList:
> > 
> > ... snip ...
> > event U/D-2 CHAR   n/a  ALL
> > event_base_verboseD/D-8 INTn/a  
> > LOCAL0
> > event_libevent2021_event_include  U/A-3 CHAR   n/a  
> > LOCALpoll
> > opal_event_includeU/A-3 CHAR   n/a  
> > LOCALpoll
> > event_libevent2021_major_version  D/A-9 INTn/a  
> > UNKNOWN  1
> > event_libevent2021_minor_version  D/A-9 INTn/a  
> > UNKNOWN  9
> > event_libevent2021_release_versionD/A-9 INTn/a  
> > UNKNOWN  0
> > shmem U/D-2 CHAR   n/a  ALL
> > shmem_base_verboseD/D-8 INTn/a  
> > LOCAL0
> > shmem_base_RUNTIME_QUERY_hint D/A-9 CHAR   n/a  
> > ALL-EQ
> > shmem_mmap_priority   U/A-3 INTn/a  ALL 
> >  50
> > shmem_mmap_enable_nfs_warning D/A-9 INTn/a  
> > LOCALtrue
> > shmem_mmap_relocate_backing_file  D/A-9 INTn/a  ALL 
> >  0
> > shmem_mmap_backing_file_base_dir  D/A-9 CHAR   n/a  ALL 
> >  /dev/shm
> > shmem_mmap_major_version  D/A-9 INTn/a  
> > UNKNOWN  1
> > shmem_mmap_minor_version  D/A-9 INTn/a  
> > UNKNOWN  9
> > shmem_mmap_release_versionD/A-9 INTn/a  
> > UNKNOWN  0
> > shmem_posix_major_version D/A-9 INTn/a  
> > UNKNOWN  1201644720
> > shmem_posix_minor_version D/A-9 INTn/a  
> > UNKNOWN  32756
> > shmem_posix_release_version   D/A-9 INTn/a  

[OMPI devel] MPI_T SEGV on DSO

2014-07-29 Thread KAWASHIMA Takahiro
Hi,

I encountered the same SEGV reported on the users list when
running varList program.

  http://www.open-mpi.org/community/lists/users/2014/07/24792.php

mpiexec -n 1 ./varList:

... snip ...
event U/D-2 CHAR   n/a  ALL
event_base_verboseD/D-8 INTn/a  LOCAL   
 0
event_libevent2021_event_include  U/A-3 CHAR   n/a  LOCAL   
 poll
opal_event_includeU/A-3 CHAR   n/a  LOCAL   
 poll
event_libevent2021_major_version  D/A-9 INTn/a  UNKNOWN 
 1
event_libevent2021_minor_version  D/A-9 INTn/a  UNKNOWN 
 9
event_libevent2021_release_versionD/A-9 INTn/a  UNKNOWN 
 0
shmem U/D-2 CHAR   n/a  ALL
shmem_base_verboseD/D-8 INTn/a  LOCAL   
 0
shmem_base_RUNTIME_QUERY_hint D/A-9 CHAR   n/a  ALL-EQ
shmem_mmap_priority   U/A-3 INTn/a  ALL 
 50
shmem_mmap_enable_nfs_warning D/A-9 INTn/a  LOCAL   
 true
shmem_mmap_relocate_backing_file  D/A-9 INTn/a  ALL 
 0
shmem_mmap_backing_file_base_dir  D/A-9 CHAR   n/a  ALL 
 /dev/shm
shmem_mmap_major_version  D/A-9 INTn/a  UNKNOWN 
 1
shmem_mmap_minor_version  D/A-9 INTn/a  UNKNOWN 
 9
shmem_mmap_release_versionD/A-9 INTn/a  UNKNOWN 
 0
shmem_posix_major_version D/A-9 INTn/a  UNKNOWN 
 1201644720
shmem_posix_minor_version D/A-9 INTn/a  UNKNOWN 
 32756
shmem_posix_release_version   D/A-9 INTn/a  UNKNOWN 
 6
[ppc:12688] *** Process received signal ***
[ppc:12688] Signal: Segmentation fault (11)
[ppc:12688] Signal code: Invalid permissions (2)
[ppc:12688] Failing at address: 0x7ff4479f83d8
[ppc:12688] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x325c0)[0x7ff4493015c0]
[ppc:12688] [ 1] 
/home/rivis/opt/openmpi-trunk-debug/lib/libmpi.so.0(PMPI_T_cvar_read+0xbc)[0x7ff44970abb7]
[ppc:12688] [ 2] ./varlist(list_cvars+0x56a)[0x4029bc]
[ppc:12688] [ 3] ./varlist(main+0x42b)[0x403598]
[ppc:12688] [ 4] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7ff4492edeed]
[ppc:12688] [ 5] ./varlist[0x4016c9]
[ppc:12688] *** End of error message ***


I tracked this error and found that this seems related to DSO.

The error occurs when accessing value->intval for the
control variable shmem_sysv_major_version in MPI_T_cvar_read.

  https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mpi/tool/cvar_read.c

The 'value' was gotten by mca_base_var_get_value and it points
mca_shmem_sysv_component.super.base_version.mca_component_major_version,
which was dlclose'd in MPI_INIT for DSO.
(component mmap is selected on my environment)

Abnormal shmem_posix_{major,minor,relase}_version values in
my output above are the same reason. SEGV occurs if the memory
was returned to kernel, and abnormal values are printed
if not yet.

So this SEGV doesn't occur if I configure Open MPI with
--disable-dlopen option. I think it's the reason why Nathan
doesn't see this error.

Regards,
KAWASHIMA Takahiro


[OMPI devel] [patch] man and FUNC_NAME corrections

2014-07-09 Thread Kawashima, Takahiro
Hi,

The attached patch corrects trivial typos in man files and
FUNC_NAME variables in ompi/mpi/c/*.c files.

One note which may not be trivial:
Before MPI-2.1, MPI standard says MPI_PACKED should be used for
MPI_{Pack,Unpack}_external. But in MPI-2.1, it was changed to
use MPI_BYTE. See 'B.3 Changes from Version 2.0 to Version 2.1'
(page 766) in MPI-3.0.

Though my patch is for OMPI trunk, I want to see these
corrections in 1.8 series.

Takahiro Kawashima,
MPI development team,
Fujitsu
Index: ompi/mpi/c/message_c2f.c
===
--- ompi/mpi/c/message_c2f.c	(revision 32173)
+++ ompi/mpi/c/message_c2f.c	(working copy)
@@ -35,7 +35,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_Message_f2c";
+static const char FUNC_NAME[] = "MPI_Message_c2f";
 
 
 MPI_Fint MPI_Message_c2f(MPI_Message message) 
Index: ompi/mpi/c/get_accumulate.c
===
--- ompi/mpi/c/get_accumulate.c	(revision 32173)
+++ ompi/mpi/c/get_accumulate.c	(working copy)
@@ -41,7 +41,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_Get_accumlate";
+static const char FUNC_NAME[] = "MPI_Get_accumulate";
 
 int MPI_Get_accumulate(const void *origin_addr, int origin_count, MPI_Datatype origin_datatype,
void *result_addr, int result_count, MPI_Datatype result_datatype,
Index: ompi/mpi/c/rget_accumulate.c
===
--- ompi/mpi/c/rget_accumulate.c	(revision 32173)
+++ ompi/mpi/c/rget_accumulate.c	(working copy)
@@ -42,7 +42,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_Rget_accumlate";
+static const char FUNC_NAME[] = "MPI_Rget_accumulate";
 
 int MPI_Rget_accumulate(const void *origin_addr, int origin_count, MPI_Datatype origin_datatype,
 void *result_addr, int result_count, MPI_Datatype result_datatype,
Index: ompi/mpi/c/request_c2f.c
===
--- ompi/mpi/c/request_c2f.c	(revision 32173)
+++ ompi/mpi/c/request_c2f.c	(working copy)
@@ -35,7 +35,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_Request_f2c";
+static const char FUNC_NAME[] = "MPI_Request_c2f";
 
 
 MPI_Fint MPI_Request_c2f(MPI_Request request) 
Index: ompi/mpi/c/raccumulate.c
===
--- ompi/mpi/c/raccumulate.c	(revision 32173)
+++ ompi/mpi/c/raccumulate.c	(working copy)
@@ -41,7 +41,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_Accumlate";
+static const char FUNC_NAME[] = "MPI_Raccumulate";
 
 int MPI_Raccumulate(void *origin_addr, int origin_count, MPI_Datatype origin_datatype,
int target_rank, MPI_Aint target_disp, int target_count,
Index: ompi/mpi/c/unpack_external.c
===
--- ompi/mpi/c/unpack_external.c	(revision 32173)
+++ ompi/mpi/c/unpack_external.c	(working copy)
@@ -37,7 +37,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_Unpack_external ";
+static const char FUNC_NAME[] = "MPI_Unpack_external";
 
 
 int MPI_Unpack_external (const char datarep[], const void *inbuf, MPI_Aint insize,
Index: ompi/mpi/c/comm_size.c
===
--- ompi/mpi/c/comm_size.c	(revision 32173)
+++ ompi/mpi/c/comm_size.c	(working copy)
@@ -35,7 +35,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_comm_size";
+static const char FUNC_NAME[] = "MPI_Comm_size";
 
 
 int MPI_Comm_size(MPI_Comm comm, int *size) 
Index: ompi/mpi/c/get_library_version.c
===
--- ompi/mpi/c/get_library_version.c	(revision 32173)
+++ ompi/mpi/c/get_library_version.c	(working copy)
@@ -31,7 +31,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_get_library_version";
+static const char FUNC_NAME[] = "MPI_Get_library_version";
 
 
 int MPI_Get_library_version(char *version, int *resultlen) 
Index: ompi/mpi/c/ireduce_scatter_block.c
===
--- ompi/mpi/c/ireduce_scatter_block.c	(revision 32173)
+++ ompi/mpi/c/ireduce_scatter_block.c	(working copy)
@@ -39,7 +39,7 @@
 #include "ompi/mpi/c/profile/defines.h"
 #endif
 
-static const char FUNC_NAME[] = "MPI_Reduce_scatter_block";
+static const char FUNC_NAME[] = "MPI_Ireduce_scatter_block";
 
 
 int MPI_Ireduce_scatter_block(const void *sendbuf, void *recvbuf, int recvcount,
Index: ompi/mpi/man/man3/MPI_Pack_external.3in
===
--- ompi/mpi/man/man3/MPI_Pack_external.3in	

[OMPI devel] [patch] async-signal-safe signal handler

2013-12-11 Thread Kawashima, Takahiro
Hi,

Open MPI's signal handler (show_stackframe function defined in
opal/util/stacktrace.c) calls non-async-signal-safe functions
and it causes a problem.

See attached mpisigabrt.c. Passing corrupted memory to realloc(3)
will cause SIGABRT and show_stackframe function will be invoked.
But invoked show_stackframe function deadlocks in backtrace_symbols(3)
on some systems because backtrace_symbols(3) calls malloc(3)
internally and a deadlock of realloc/malloc mutex occurs.

Attached mpisigabrt.gstack.txt shows the stacktrace gotten
by gdb in this deadlock situation on Ubuntu 12.04 LTS (precise)
x86_64. Though I could not reproduce this behavior on RHEL 5/6,
I can reproduce it also on K computer and its successor PRIMEHPC FX10.
Passing non-heap memory to free(3) and double-free also cause
this deadlock.

malloc (and backtrace_symbols) is not marked as async-signal-safe
in POSIX and current glibc, though it seems to have been marked
in old glibc. So we should not call it in the signal handler now.

  
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04
  http://cygwin.com/ml/libc-help/2013-06/msg5.html

I wrote a patch to address this issue. See the attached
async-signal-safe-stacktrace.patch.

This patch calls backtrace_symbols_fd(3) instead of backtrace_symbols(3).
Though backtrace_symbols_fd is not declared as async-signal-safe,
it is described not to call malloc internally in its man. So it
should be rather safer.

Output format of show_stackframe function is not changed by
this patch. But the opal_backtrace_print function (backtrace
framework) interface is changed for the output format compatibility.
This requires changes in some additional files (ompi_mpi_abort.c
etc.).

This patch also removes unnecessary fflush(3) calls, which are
meaningless for write(2) system call but might cause a similar
problem.

What do you think about this patch?

Takahiro Kawashima,
MPI development team,
Fujitsu
Index: opal/mca/backtrace/backtrace.h
===
--- opal/mca/backtrace/backtrace.h	(revision 29841)
+++ opal/mca/backtrace/backtrace.h	(working copy)
@@ -34,11 +34,12 @@
 
 
 /*
- * print back trace to FILE file
+ * Print back trace to FILE file with a prefix for each line.
+ * First strip lines are not printed.
  *
  * \note some attempts made to be signal safe.
  */
-OPAL_DECLSPEC void opal_backtrace_print(FILE *file);
+OPAL_DECLSPEC int opal_backtrace_print(FILE *file, char *prefix, int strip);
 
 /*
  * Return back trace in buffer.  buffer will be allocated by the
Index: opal/mca/backtrace/execinfo/backtrace_execinfo.c
===
--- opal/mca/backtrace/execinfo/backtrace_execinfo.c	(revision 29841)
+++ opal/mca/backtrace/execinfo/backtrace_execinfo.c	(working copy)
@@ -20,6 +20,10 @@
 #include "opal_config.h"
 
 #include 
+#include 
+#ifdef HAVE_UNISTD_H
+#include 
+#endif
 #ifdef HAVE_EXECINFO_H
 #include 
 #endif
@@ -27,23 +31,31 @@
 #include "opal/constants.h"
 #include "opal/mca/backtrace/backtrace.h"
 
-void
-opal_backtrace_print(FILE *file)
+int
+opal_backtrace_print(FILE *file, char *prefix, int strip)
 {
-int i;
+int i, fd, len;
 int trace_size;
 void * trace[32];
-char ** messages = (char **)NULL;
+char buf[6];
 
+fd = fileno (file);
+if (-1 == fd) {
+return OPAL_ERR_BAD_PARAM;
+}
+
 trace_size = backtrace (trace, 32);
-messages = backtrace_symbols (trace, trace_size);
 
-for (i = 0; i < trace_size; i++) {
-fprintf(file, "[%d] func:%s\n", i, messages[i]);
-fflush(file);
+for (i = strip; i < trace_size; i++) {
+if (NULL != prefix) {
+write (fd, prefix, strlen (prefix));
+}
+len = snprintf (buf, sizeof(buf), "[%2d] ", i - strip);
+write (fd, buf, len);
+backtrace_symbols_fd ([i], 1, fd);
 }
 
-free(messages);
+return OPAL_SUCCESS;
 }
 
 
Index: opal/mca/backtrace/printstack/backtrace_printstack.c
===
--- opal/mca/backtrace/printstack/backtrace_printstack.c	(revision 29841)
+++ opal/mca/backtrace/printstack/backtrace_printstack.c	(working copy)
@@ -24,10 +24,12 @@
 #include "opal/constants.h"
 #include "opal/mca/backtrace/backtrace.h"
 
-void
-opal_backtrace_print(FILE *file)
+int
+opal_backtrace_print(FILE *file, char *prefix, int strip)
 {
 printstack(fileno(file));
+
+return OPAL_SUCCESS;
 }
 
 
Index: opal/mca/backtrace/none/backtrace_none.c
===
--- opal/mca/backtrace/none/backtrace_none.c	(revision 29841)
+++ opal/mca/backtrace/none/backtrace_none.c	(working copy)
@@ -23,9 +23,10 @@
 #include "opal/constants.h"
 #include "opal/mca/backtrace/backtrace.h"
 
-void
-opal_backtrace_print(FILE *file)
+int
+opal_backtrace_print(FILE *file, char *prefix, int strip)
 {
+return 

Re: [OMPI devel] 1.6.5 large matrix test doesn't pass (decode) ?

2013-10-04 Thread KAWASHIMA Takahiro
It is a bug in the test program, test/datatype/ddt_raw.c, and it was
fixed at r24328 in trunk.

  https://svn.open-mpi.org/trac/ompi/changeset/24328

I've confirmed the failure occurs with plain v1.6.5 and it doesn't
occur with patched v1.6.5.

Thanks,
KAWASHIMA Takahiro

> Not sure if this is important, or expected, but I ran a make check out
> of interest after seeing recent emails and saw the final one of these
> tests be reported as "NOT PASSED" (it seems to be the only failure).
> 
> No idea if this is important or not.  The text I see is:
> 
>  #
>  * TEST UPPER MATRIX
>  #
> 
> test upper matrix
> complete raw in 7 microsec
> decode [NOT PASSED]
> 
> 
> This happens on both our Nehalem and SandyBridge clusters and we are
> building with the system GCC.  I've attached the full log from our
> Nehalem cluster (RHEL 6.4).
> 
> 
> Our configure script is:
> 
> #!/bin/bash
> 
> BASE=`basename $PWD | sed -e s,-,/,`
> 
> module purge
> 
> ./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib \
> --enable-static  --enable-shared
> 
> make -j
> 
> 
> I'm away on leave next week (first break for a year, yay!) but back
> the week after..
> 
> All the best,
> Chris


Re: [OMPI devel] [patch] MPI_IN_PLACE for MPI_ALLTOALL(V|W)

2013-09-17 Thread Kawashima, Takahiro
Thanks!

Takahiro Kawashima,
MPI development team,
Fujitsu

> Pushed in r29187.
> 
>   George.
> 
> 
> On Sep 17, 2013, at 12:03 , "Kawashima, Takahiro" 
> <t-kawash...@jp.fujitsu.com> wrote:
> 
> > George,
> > 
> > Copyright-added patch is attached.
> > I don't have my svn account so want someone to commit it.
> > 
> > All my reported issues are in the ALLTOALL(V|W) MPI_IN_PLACE code,
> > which was implemented two months ago for MPI-2.2 conformance.
> > Not so surprising.
> > 
> > P.S. Fujitsu does not yet signed the contribution agreement.
> > I must talk with the legal department again to sign it, sigh
> > This patch is very trivial and so no issues will arise.
> > 
> > Thanks,
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> > 
> >> Takahiro,
> >> 
> >> Good catches. It's absolutely amazing that some of these errors lasted for 
> >> so long before being discovered (especially the extent issue in the 
> >> MPI_ALLTOALL). Please feel free to apply your patch and add the correct 
> >> copyright at the beginning of all altered files.
> >> 
> >>  Thanks,
> >>George.
> >> 
> >> 
> >> 
> >> On Sep 17, 2013, at 07:36 , "Kawashima, Takahiro" 
> >> <t-kawash...@jp.fujitsu.com> wrote:
> >> 
> >>> Hi,
> >>> 
> >>> My colleague tested MPI_IN_PLACE for MPI_ALLTOALL, MPI_ALLTOALLV,
> >>> and MPI_ALLTOALLW, which was implemented two months ago in Open MPI
> >>> trunk. And he found three bugs and created a patch.
> >>> 
> >>> Found bugs are:
> >>> 
> >>> (A) Missing MPI_IN_PLACE support in self COLL component
> >>> 
> >>>   The attached alltoall-self-inplace.c fails with MPI_ERR_ARG.
> >>>   self COLL component also must support MPI_IN_PLACE.
> >>> 
> >>> (B) Incorrect rcount[] index
> >>> 
> >>>   A trivial bug in the following code.
> >>> 
> >>>   for (i = 0, max_size = 0 ; i < size ; ++i) {
> >>>   size_t size = ext * rcounts[rank]; // should be rcounts[i]
> >>> 
> >>>   max_size = size > max_size ? size : max_size;
> >>>   }
> >>> 
> >>>   This causes SEGV or something.
> >>> 
> >>> (C) For MPI_ALLTOALLV, the unit of displacements is extent, not byte
> >>> 
> >>>   Though the unit of displacements is byte for MPI_ALLTOALLW,
> >>>   the unit of displacements is extent for MPI_ALLTOALLV.
> >>> 
> >>>   MPI-2.2 (page 171) says:
> >>> 
> >>> The outcome is as if each process sent a message to every
> >>> other process with,
> >>>   MPI_Send(sendbuf + sdispls[i] · extent(sendtype),
> >>>sendcounts[i], sendtype, i, ...),
> >>> and received a message from every other process with a call to
> >>>   MPI_Recv(recvbuf + rdispls[i] · extent(recvtype),
> >>>recvcounts[i], recvtype, i, ...).
> >>> 
> >>> I attached his patch (alltoall-inplace.patch) to fix these three bugs.

Re: [OMPI devel] [patch] MPI_IN_PLACE for MPI_ALLTOALL(V|W)

2013-09-17 Thread Kawashima, Takahiro
George,

Copyright-added patch is attached.
I don't have my svn account so want someone to commit it.

All my reported issues are in the ALLTOALL(V|W) MPI_IN_PLACE code,
which was implemented two months ago for MPI-2.2 conformance.
Not so surprising.

P.S. Fujitsu does not yet signed the contribution agreement.
I must talk with the legal department again to sign it, sigh
This patch is very trivial and so no issues will arise.

Thanks,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Takahiro,
> 
> Good catches. It's absolutely amazing that some of these errors lasted for so 
> long before being discovered (especially the extent issue in the 
> MPI_ALLTOALL). Please feel free to apply your patch and add the correct 
> copyright at the beginning of all altered files.
> 
>   Thanks,
> George.
> 
> 
> 
> On Sep 17, 2013, at 07:36 , "Kawashima, Takahiro" 
> <t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi,
> > 
> > My colleague tested MPI_IN_PLACE for MPI_ALLTOALL, MPI_ALLTOALLV,
> > and MPI_ALLTOALLW, which was implemented two months ago in Open MPI
> > trunk. And he found three bugs and created a patch.
> > 
> > Found bugs are:
> > 
> > (A) Missing MPI_IN_PLACE support in self COLL component
> > 
> >The attached alltoall-self-inplace.c fails with MPI_ERR_ARG.
> >self COLL component also must support MPI_IN_PLACE.
> > 
> > (B) Incorrect rcount[] index
> > 
> >A trivial bug in the following code.
> > 
> >for (i = 0, max_size = 0 ; i < size ; ++i) {
> >size_t size = ext * rcounts[rank]; // should be rcounts[i]
> > 
> >max_size = size > max_size ? size : max_size;
> >}
> > 
> >This causes SEGV or something.
> > 
> > (C) For MPI_ALLTOALLV, the unit of displacements is extent, not byte
> > 
> >Though the unit of displacements is byte for MPI_ALLTOALLW,
> >the unit of displacements is extent for MPI_ALLTOALLV.
> > 
> >MPI-2.2 (page 171) says:
> > 
> >  The outcome is as if each process sent a message to every
> >  other process with,
> >MPI_Send(sendbuf + sdispls[i] · extent(sendtype),
> > sendcounts[i], sendtype, i, ...),
> >  and received a message from every other process with a call to
> >MPI_Recv(recvbuf + rdispls[i] · extent(recvtype),
> > recvcounts[i], recvtype, i, ...).
> > 
> > I attached his patch (alltoall-inplace.patch) to fix these three bugs.
Index: ompi/mca/coll/self/coll_self_alltoall.c
===
--- ompi/mca/coll/self/coll_self_alltoall.c	(revision 29185)
+++ ompi/mca/coll/self/coll_self_alltoall.c	(working copy)
@@ -9,6 +9,7 @@
  * University of Stuttgart.  All rights reserved.
  * Copyright (c) 2004-2005 The Regents of the University of California.
  * All rights reserved.
+ * Copyright (c) 2013  FUJITSU LIMITED.  All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -37,6 +38,10 @@
  struct ompi_communicator_t *comm,
  mca_coll_base_module_t *module)
 {
+if (MPI_IN_PLACE == sbuf) {
+return MPI_SUCCESS;
+}
+
 return ompi_datatype_sndrcv(sbuf, scount, sdtype,
rbuf, rcount, rdtype);
 }
Index: ompi/mca/coll/self/coll_self_alltoallv.c
===
--- ompi/mca/coll/self/coll_self_alltoallv.c	(revision 29185)
+++ ompi/mca/coll/self/coll_self_alltoallv.c	(working copy)
@@ -9,6 +9,7 @@
  * University of Stuttgart.  All rights reserved.
  * Copyright (c) 2004-2005 The Regents of the University of California.
  * All rights reserved.
+ * Copyright (c) 2013  FUJITSU LIMITED.  All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -40,6 +41,11 @@
 {
 int err;
 ptrdiff_t lb, rextent, sextent;
+
+if (MPI_IN_PLACE == sbuf) {
+return MPI_SUCCESS;
+}
+
 err = ompi_datatype_get_extent(sdtype, , );
 if (OMPI_SUCCESS != err) {
 return OMPI_ERROR;
Index: ompi/mca/coll/self/coll_self_alltoallw.c
===
--- ompi/mca/coll/self/coll_self_alltoallw.c	(revision 29185)
+++ ompi/mca/coll/self/coll_self_alltoallw.c	(working copy)
@@ -9,6 +9,7 @@
  * University of Stuttgart.  All rights reserved.
  * Copyright (c) 2004-2005 The Regents of the University of California.
  * All rights reserved.
+ * Copyright (c) 2013  FUJITSU LIMITED.  All rights reserved.
  * $COPYR

[OMPI devel] [patch] MPI_IN_PLACE for MPI_ALLTOALL(V|W)

2013-09-17 Thread Kawashima, Takahiro
Hi,

My colleague tested MPI_IN_PLACE for MPI_ALLTOALL, MPI_ALLTOALLV,
and MPI_ALLTOALLW, which was implemented two months ago in Open MPI
trunk. And he found three bugs and created a patch.

Found bugs are:

(A) Missing MPI_IN_PLACE support in self COLL component

The attached alltoall-self-inplace.c fails with MPI_ERR_ARG.
self COLL component also must support MPI_IN_PLACE.

(B) Incorrect rcount[] index

A trivial bug in the following code.

for (i = 0, max_size = 0 ; i < size ; ++i) {
size_t size = ext * rcounts[rank]; // should be rcounts[i]

max_size = size > max_size ? size : max_size;
}

This causes SEGV or something.

(C) For MPI_ALLTOALLV, the unit of displacements is extent, not byte

Though the unit of displacements is byte for MPI_ALLTOALLW,
the unit of displacements is extent for MPI_ALLTOALLV.

MPI-2.2 (page 171) says:

  The outcome is as if each process sent a message to every
  other process with,
MPI_Send(sendbuf + sdispls[i] · extent(sendtype),
 sendcounts[i], sendtype, i, ...),
  and received a message from every other process with a call to
MPI_Recv(recvbuf + rdispls[i] · extent(recvtype),
 recvcounts[i], recvtype, i, ...).

I attached his patch (alltoall-inplace.patch) to fix these three bugs.

Takahiro Kawashima,
MPI development team,
Fujitsu
Index: ompi/mca/coll/self/coll_self_alltoall.c
===
--- ompi/mca/coll/self/coll_self_alltoall.c	(revision 28967)
+++ ompi/mca/coll/self/coll_self_alltoall.c	(working copy)
@@ -37,6 +37,10 @@
  struct ompi_communicator_t *comm,
  mca_coll_base_module_t *module)
 {
+if (MPI_IN_PLACE == sbuf) {
+return MPI_SUCCESS;
+}
+
 return ompi_datatype_sndrcv(sbuf, scount, sdtype,
rbuf, rcount, rdtype);
 }
Index: ompi/mca/coll/self/coll_self_alltoallv.c
===
--- ompi/mca/coll/self/coll_self_alltoallv.c	(revision 28967)
+++ ompi/mca/coll/self/coll_self_alltoallv.c	(working copy)
@@ -40,6 +40,11 @@
 {
 int err;
 ptrdiff_t lb, rextent, sextent;
+
+if (MPI_IN_PLACE == sbuf) {
+return MPI_SUCCESS;
+}
+
 err = ompi_datatype_get_extent(sdtype, , );
 if (OMPI_SUCCESS != err) {
 return OMPI_ERROR;
Index: ompi/mca/coll/self/coll_self_alltoallw.c
===
--- ompi/mca/coll/self/coll_self_alltoallw.c	(revision 28967)
+++ ompi/mca/coll/self/coll_self_alltoallw.c	(working copy)
@@ -39,6 +39,11 @@
 {
 int err;
 ptrdiff_t lb, rextent, sextent;
+
+if (MPI_IN_PLACE == sbuf) {
+return MPI_SUCCESS;
+}
+
 err = ompi_datatype_get_extent(sdtypes[0], , );
 if (OMPI_SUCCESS != err) {
 return OMPI_ERROR;
Index: ompi/mca/coll/basic/coll_basic_alltoallv.c
===
--- ompi/mca/coll/basic/coll_basic_alltoallv.c	(revision 28967)
+++ ompi/mca/coll/basic/coll_basic_alltoallv.c	(working copy)
@@ -56,7 +57,7 @@
 /* Find the largest receive amount */
 ompi_datatype_type_extent (rdtype, );
 for (i = 0, max_size = 0 ; i < size ; ++i) {
-size_t size = ext * rcounts[rank];
+size_t size = ext * rcounts[i];
 
 max_size = size > max_size ? size : max_size;
 }
@@ -76,11 +77,11 @@
 if (i == rank && rcounts[j]) {
 /* Copy the data into the temporary buffer */
 err = ompi_datatype_copy_content_same_ddt (rdtype, rcounts[j],
-   tmp_buffer, (char *) rbuf + rdisps[j]);
+   tmp_buffer, (char *) rbuf + rdisps[j] * ext);
 if (MPI_SUCCESS != err) { goto error_hndl; }
 
 /* Exchange data with the peer */
-err = MCA_PML_CALL(irecv ((char *) rbuf + rdisps[j], rcounts[j], rdtype,
+err = MCA_PML_CALL(irecv ((char *) rbuf + rdisps[j] * ext, rcounts[j], rdtype,
   j, MCA_COLL_BASE_TAG_ALLTOALLV, comm, preq++));
 if (MPI_SUCCESS != err) { goto error_hndl; }
 
@@ -91,11 +92,11 @@
 } else if (j == rank && rcounts[i]) {
 /* Copy the data into the temporary buffer */
 err = ompi_datatype_copy_content_same_ddt (rdtype, rcounts[i],
-   tmp_buffer, (char *) rbuf + rdisps[i]);
+   tmp_buffer, (char *) rbuf + rdisps[i] * ext);
 if (MPI_SUCCESS != err) { goto error_hndl; }
 
 /* Exchange data with the peer */
-err = MCA_PML_CALL(irecv ((char *) rbuf + rdisps[i], 

Re: [OMPI devel] [bug] One-sided communication with a duplicated datatype

2013-07-15 Thread KAWASHIMA Takahiro
George,

Thanks. I've confirmed your patch.
I wrote a simple program to test your patch and no problems are found.
The test program is attached to this mail.

Regards,
KAWASHIMA Takahiro

> Takahiro,
> 
> Please find below another patch, this time hopefully fixing all issues. The 
> problem with my original patch and with yours was that they try to address 
> the packing of the data representation without fixing the computation of the 
> required length. As a result the length on the packer and unpacker differs 
> and the unpacking of the subsequent data is done from a wrong location.
> 
> I changed the code to force the preparation of the packed data representation 
> before returning the length the first time. This way we can compute exactly 
> how many bytes we need, including the potential alignment requirements. As a 
> result the amount on both sides (the packer and the unpacker) are now 
> identical, and the entire process works flawlessly (or so I hope).
> 
> Let me know if you still notice issues with this patch. I'll push the 
> tomorrow in the trunk, so it can soak for a few days before propagation to 
> the branches.
#include 
#include 
#include 
#include 

static MPI_Datatype types[100];
static MPI_Win win;
static int obuf[1], tbuf[100];

static void do_put(int expected, int index, char *description)
{
int i;

for (i = 1; types[i] != MPI_DATATYPE_NULL; i++);
i--;

if (i != 0) {
MPI_Type_commit([i]);
}

memset(tbuf, 0, sizeof(tbuf));

MPI_Win_fence(0, win);
MPI_Put(obuf, 1, types[0], 0, 0, 1, types[i], win);
MPI_Win_fence(0, win);

if (tbuf[index] != expected) {
printf("NG %s (expected: %d, actual: %d, index: %d)\n",
   description, expected, tbuf[index], index);
} else {
printf("OK %s\n", description);
}

for (i = 1; types[i] != MPI_DATATYPE_NULL; i++) {
MPI_Type_free([i]);
}
}

int main(int argc, char *argv[])
{
int i;
int displs[] = {1};
int blens[] = {1};

types[0] = MPI_INT;
for (i = 1; i < sizeof(types) / sizeof(types[0]); i++) {
types[i] = MPI_DATATYPE_NULL;
}

obuf[0] = 77;

MPI_Init(, );
MPI_Win_create(tbuf, sizeof(tbuf[0]), sizeof(tbuf) / sizeof(tbuf[0]),
   MPI_INFO_NULL, MPI_COMM_SELF, );

do_put(77, 0, "predefined");

MPI_Type_dup(types[0], [1]);
do_put(77, 0, "dup");

MPI_Type_contiguous(1, types[0], [1]);
do_put(77, 0, "contiguous");

MPI_Type_vector(1, 1, 1, types[0], [1]);
do_put(77, 0, "vector");

MPI_Type_indexed(1, blens, displs, types[0], [1]);
do_put(77, 1, "indexed");

MPI_Type_contiguous(1, types[0], [1]);
MPI_Type_dup(types[1], [2]);
do_put(77, 0, "contiguous+dup");

MPI_Type_dup(types[0], [1]);
MPI_Type_contiguous(1, types[1], [2]);
do_put(77, 0, "dup+contiguous");

MPI_Type_indexed(1, blens, displs, types[0], [1]);
MPI_Type_contiguous(1, types[1], [2]);
do_put(77, 1, "indexed+contiguous");

MPI_Type_contiguous(1, types[0], [1]);
MPI_Type_indexed(1, blens, displs, types[1], [2]);
do_put(77, 1, "contiguous+indexed");

MPI_Type_contiguous(1, types[0], [1]);
MPI_Type_dup(types[1], [2]);
MPI_Type_contiguous(1, types[2], [3]);
do_put(77, 0, "contiguous+dup+contiguous");

MPI_Type_dup(types[0], [1]);
MPI_Type_contiguous(1, types[1], [2]);
MPI_Type_dup(types[2], [3]);
do_put(77, 0, "dup+contiguous+dup");

MPI_Type_dup(types[0], [1]);
MPI_Type_dup(types[1], [2]);
MPI_Type_dup(types[2], [3]);
do_put(77, 0, "dup+dup+dup");

MPI_Type_indexed(1, blens, displs, types[0], [1]);
MPI_Type_contiguous(1, types[1], [2]);
MPI_Type_vector(1, 1, 1, types[2], [3]);
do_put(77, 1, "indexed+contiguous+vector");

MPI_Type_dup(types[0], [1]);
MPI_Type_contiguous(1, types[1], [2]);
MPI_Type_dup(types[2], [3]);
MPI_Type_dup(types[3], [4]);
do_put(77, 0, "dup+contiguous+dup+dup");

MPI_Type_contiguous(1, types[0], [1]);
MPI_Type_dup(types[1], [2]);
MPI_Type_dup(types[2], [3]);
MPI_Type_dup(types[3], [4]);
do_put(77, 0, "contiguous+dup+dup+dup");

MPI_Type_dup(types[0], [1]);
MPI_Type_dup(types[1], [2]);
MPI_Type_dup(types[2], [3]);
MPI_Type_contiguous(1, types[3], [4]);
do_put(77, 0, "dup+dup+dup+contiguous");

MPI_Type_indexed(1, blens, displs, types[0], [1]);
MPI_Type_dup(types[1], [2]);
MPI_Type_dup(types[2], [3]);
MPI_Type_contiguous(1, types[3], [4]);
do_put(77, 1, "indexed+dup+dup+contiguous");

MPI_Type_indexed(1, blens, displs, types[0], [1]);
MPI_Type_contiguous(1, types[1], [2]);
MPI_Type_dup(types[2], [3]);
MPI_Type_dup(types[3], [4]);
do_put(77, 

Re: [OMPI devel] [bug] One-sided communication with a duplicated datatype

2013-07-14 Thread KAWASHIMA Takahiro
George,

A improved patch is attached. Latter half is same as your patch.
But again, I'm not sure this is a correct solution.

It works correctly for my attached put_dup_type_3.c.
Run as "mpiexec -n 1 ./put_dup_type_3".
It will print seven OKs if succeeded.

Regards,
KAWASHIMA Takahiro

> No. My patch doesn't work for a more simple case,
> just a duplicate of MPI_INT.
> 
> Datatype is too complex for me ...
> 
> Regards,
> KAWASHIMA Takahiro
> 
> > George,
> > 
> > Thanks. But no, your patch does not work correctly.
> > 
> > The assertion failure disappeared by your patch but the value of the
> > target buffer of MPI_Put is not a correct one.
> > 
> > In rdma OSC (and pt2pt OSC), the following data are packed into
> > the send buffer in ompi_osc_rdma_sendreq_send function on the
> > origin side.
> > 
> >   - header
> >   - datatype description
> >   - user data
> > 
> > User data are written at the offset of
> > (sizeof(ompi_osc_rdma_send_header_t) + total_pack_size).
> > 
> > In the case of my program attached in my previous mail, total_pack_size
> > is 32 because ompi_datatype_set_args set 8 for MPI_COMBINER_DUP and
> > 24 for MPI_COMBINER_CONTIGUOUS. See the following code.
> > 
> > 
> > int32_t ompi_datatype_set_args(... snip ...)
> > {
> > ... snip ...
> > switch(type){
> > ... snip ...
> > case MPI_COMBINER_DUP:
> > /* Recompute the data description packed size based on the 
> > optimization
> >  * for MPI_COMBINER_DUP.
> >  */
> > pArgs->total_pack_size = 2 * sizeof(int);  total_pack_size = 8
> > break;
> > ... snip ...
> > }
> > ...
> > for( pos = 0; pos < cd; pos++ ) {
> > ... snip ...
> > if( !(ompi_datatype_is_predefined(d[pos])) ) {
> > ... snip ...
> > pArgs->total_pack_size += 
> > ((ompi_datatype_args_t*)d[pos]->args)->total_pack_size;  
> > total_pack_size += 24
> > ... snip ...
> > }
> > ... snip ...
> > }
> > ... snip ...
> > }
> > 
> > 
> > But on the target side, user data are read at the offset of
> > (sizeof(ompi_osc_rdma_send_header_t) + 24)
> > because ompi_osc_base_datatype_create function, which is called
> > by ompi_osc_rdma_sendreq_recv_put function, progress the offset
> > only 24 bytes. Not 32 bytes.
> > 
> > So the wrong data are written to the target buffer.
> > 
> > We need to take care of total_pack_size in the origin side.
> > 
> > I modified ompi_datatype_set_args function as a trial.
> > 
> > Index: ompi/datatype/ompi_datatype_args.c
> > ===
> > --- ompi/datatype/ompi_datatype_args.c  (revision 28778)
> > +++ ompi/datatype/ompi_datatype_args.c  (working copy)
> > @@ -129,7 +129,7 @@
> >  /* Recompute the data description packed size based on the 
> > optimization
> >   * for MPI_COMBINER_DUP.
> >   */
> > -pArgs->total_pack_size = 2 * sizeof(int);
> > +pArgs->total_pack_size = 0;
> >  break;
> >  
> >  case MPI_COMBINER_CONTIGUOUS:
> > 
> > This patch in addition to your patch works correctly for my program.
> > But I'm not sure this is a correct solution.
> > 
> > Regards,
> > KAWASHIMA Takahiro
> > 
> > > Takahiro,
> > > 
> > > Nice catch. That particular code was an over-optimizations … that failed. 
> > > Please try with the patch below.
> > > 
> > > Let me know if it's working as expected, I will push it in the trunk once 
> > > confirmed.
> > > 
> > >   George.
> > > 
> > > 
> > > Index: ompi/datatype/ompi_datatype_args.c
> > > ===
> > > --- ompi/datatype/ompi_datatype_args.c(revision 28787)
> > > +++ ompi/datatype/ompi_datatype_args.c(working copy)
> > > @@ -449,9 +449,10 @@
> > >  }
> > >  /* For duplicated datatype we don't have to store all the 
> > > information */
> > >  if( MPI_COMBINER_DUP == args->create_type ) {
> > > -position[0] = args->create_type;
> > > -position[1] = args->d[0]->id; /*

Re: [OMPI devel] [bug] One-sided communication with a duplicated datatype

2013-07-14 Thread KAWASHIMA Takahiro
No. My patch doesn't work for a more simple case,
just a duplicate of MPI_INT.

Datatype is too complex for me ...

Regards,
KAWASHIMA Takahiro

> George,
> 
> Thanks. But no, your patch does not work correctly.
> 
> The assertion failure disappeared by your patch but the value of the
> target buffer of MPI_Put is not a correct one.
> 
> In rdma OSC (and pt2pt OSC), the following data are packed into
> the send buffer in ompi_osc_rdma_sendreq_send function on the
> origin side.
> 
>   - header
>   - datatype description
>   - user data
> 
> User data are written at the offset of
> (sizeof(ompi_osc_rdma_send_header_t) + total_pack_size).
> 
> In the case of my program attached in my previous mail, total_pack_size
> is 32 because ompi_datatype_set_args set 8 for MPI_COMBINER_DUP and
> 24 for MPI_COMBINER_CONTIGUOUS. See the following code.
> 
> 
> int32_t ompi_datatype_set_args(... snip ...)
> {
> ... snip ...
> switch(type){
> ... snip ...
> case MPI_COMBINER_DUP:
> /* Recompute the data description packed size based on the 
> optimization
>  * for MPI_COMBINER_DUP.
>  */
> pArgs->total_pack_size = 2 * sizeof(int);  total_pack_size = 8
> break;
> ... snip ...
> }
> ...
> for( pos = 0; pos < cd; pos++ ) {
> ... snip ...
> if( !(ompi_datatype_is_predefined(d[pos])) ) {
> ... snip ...
> pArgs->total_pack_size += 
> ((ompi_datatype_args_t*)d[pos]->args)->total_pack_size;  total_pack_size 
> += 24
> ... snip ...
> }
> ... snip ...
> }
> ... snip ...
> }
> 
> 
> But on the target side, user data are read at the offset of
> (sizeof(ompi_osc_rdma_send_header_t) + 24)
> because ompi_osc_base_datatype_create function, which is called
> by ompi_osc_rdma_sendreq_recv_put function, progress the offset
> only 24 bytes. Not 32 bytes.
> 
> So the wrong data are written to the target buffer.
> 
> We need to take care of total_pack_size in the origin side.
> 
> I modified ompi_datatype_set_args function as a trial.
> 
> Index: ompi/datatype/ompi_datatype_args.c
> ===
> --- ompi/datatype/ompi_datatype_args.c  (revision 28778)
> +++ ompi/datatype/ompi_datatype_args.c  (working copy)
> @@ -129,7 +129,7 @@
>  /* Recompute the data description packed size based on the 
> optimization
>   * for MPI_COMBINER_DUP.
>   */
> -pArgs->total_pack_size = 2 * sizeof(int);
> +pArgs->total_pack_size = 0;
>  break;
>  
>  case MPI_COMBINER_CONTIGUOUS:
> 
> This patch in addition to your patch works correctly for my program.
> But I'm not sure this is a correct solution.
> 
> Regards,
> KAWASHIMA Takahiro
> 
> > Takahiro,
> > 
> > Nice catch. That particular code was an over-optimizations … that failed. 
> > Please try with the patch below.
> > 
> > Let me know if it's working as expected, I will push it in the trunk once 
> > confirmed.
> > 
> >   George.
> > 
> > 
> > Index: ompi/datatype/ompi_datatype_args.c
> > ===
> > --- ompi/datatype/ompi_datatype_args.c  (revision 28787)
> > +++ ompi/datatype/ompi_datatype_args.c  (working copy)
> > @@ -449,9 +449,10 @@
> >  }
> >  /* For duplicated datatype we don't have to store all the information 
> > */
> >  if( MPI_COMBINER_DUP == args->create_type ) {
> > -position[0] = args->create_type;
> > -position[1] = args->d[0]->id; /* On the OMPI - layer, copy the 
> > ompi_datatype.id */
> > -return OMPI_SUCCESS;
> > +ompi_datatype_t* temp_data = args->d[0];
> > +return __ompi_datatype_pack_description(temp_data,
> > +packed_buffer,
> > +next_index );
> >  }
> >  position[0] = args->create_type;
> >  position[1] = args->ci;
> > 
> > 
> > 
> > On Jul 14, 2013, at 14:30 , KAWASHIMA Takahiro <rivis.kawash...@nifty.com> 
> > wrote:
> > 
> > > Hi,
> > > 
> > > I encountered an assertion failure in Open MPI trunk and found a bug.
> > > 
> > > See the attached program. This program can be run with mpiexec -n 1.
> > 

Re: [OMPI devel] [bug] One-sided communication with a duplicated datatype

2013-07-14 Thread KAWASHIMA Takahiro
George,

Thanks. But no, your patch does not work correctly.

The assertion failure disappeared by your patch but the value of the
target buffer of MPI_Put is not a correct one.

In rdma OSC (and pt2pt OSC), the following data are packed into
the send buffer in ompi_osc_rdma_sendreq_send function on the
origin side.

  - header
  - datatype description
  - user data

User data are written at the offset of
(sizeof(ompi_osc_rdma_send_header_t) + total_pack_size).

In the case of my program attached in my previous mail, total_pack_size
is 32 because ompi_datatype_set_args set 8 for MPI_COMBINER_DUP and
24 for MPI_COMBINER_CONTIGUOUS. See the following code.


int32_t ompi_datatype_set_args(... snip ...)
{
... snip ...
switch(type){
... snip ...
case MPI_COMBINER_DUP:
/* Recompute the data description packed size based on the optimization
 * for MPI_COMBINER_DUP.
 */
pArgs->total_pack_size = 2 * sizeof(int);  total_pack_size = 8
break;
... snip ...
}
...
for( pos = 0; pos < cd; pos++ ) {
... snip ...
if( !(ompi_datatype_is_predefined(d[pos])) ) {
... snip ...
pArgs->total_pack_size += 
((ompi_datatype_args_t*)d[pos]->args)->total_pack_size;  total_pack_size += 
24
... snip ...
}
... snip ...
}
... snip ...
}


But on the target side, user data are read at the offset of
(sizeof(ompi_osc_rdma_send_header_t) + 24)
because ompi_osc_base_datatype_create function, which is called
by ompi_osc_rdma_sendreq_recv_put function, progress the offset
only 24 bytes. Not 32 bytes.

So the wrong data are written to the target buffer.

We need to take care of total_pack_size in the origin side.

I modified ompi_datatype_set_args function as a trial.

Index: ompi/datatype/ompi_datatype_args.c
===
--- ompi/datatype/ompi_datatype_args.c  (revision 28778)
+++ ompi/datatype/ompi_datatype_args.c  (working copy)
@@ -129,7 +129,7 @@
 /* Recompute the data description packed size based on the optimization
  * for MPI_COMBINER_DUP.
  */
-pArgs->total_pack_size = 2 * sizeof(int);
+pArgs->total_pack_size = 0;
 break;

 case MPI_COMBINER_CONTIGUOUS:

This patch in addition to your patch works correctly for my program.
But I'm not sure this is a correct solution.

Regards,
KAWASHIMA Takahiro

> Takahiro,
> 
> Nice catch. That particular code was an over-optimizations … that failed. 
> Please try with the patch below.
> 
> Let me know if it's working as expected, I will push it in the trunk once 
> confirmed.
> 
>   George.
> 
> 
> Index: ompi/datatype/ompi_datatype_args.c
> ===
> --- ompi/datatype/ompi_datatype_args.c(revision 28787)
> +++ ompi/datatype/ompi_datatype_args.c(working copy)
> @@ -449,9 +449,10 @@
>  }
>  /* For duplicated datatype we don't have to store all the information */
>  if( MPI_COMBINER_DUP == args->create_type ) {
> -position[0] = args->create_type;
> -position[1] = args->d[0]->id; /* On the OMPI - layer, copy the 
> ompi_datatype.id */
> -return OMPI_SUCCESS;
> +ompi_datatype_t* temp_data = args->d[0];
> +return __ompi_datatype_pack_description(temp_data,
> +packed_buffer,
> +next_index );
>  }
>  position[0] = args->create_type;
>  position[1] = args->ci;
> 
> 
> 
> On Jul 14, 2013, at 14:30 , KAWASHIMA Takahiro <rivis.kawash...@nifty.com> 
> wrote:
> 
> > Hi,
> > 
> > I encountered an assertion failure in Open MPI trunk and found a bug.
> > 
> > See the attached program. This program can be run with mpiexec -n 1.
> > This program calls MPI_Put and writes one int value to the target side.
> > The target side datatype is equivalent to MPI_INT, but is a derived
> > datatype created by MPI_Type_contiguous and MPI_Type_Dup.
> > 
> > This program aborts with the following output.
> > 
> > ==
> >  dt1 (0x2626160) 
> > type 2 count ints 1 count disp 0 count datatype 1
> > ints: 1 
> > types:MPI_INT 
> >  dt2 (0x2626340) 
> > type 1 count ints 0 count disp 0 count datatype 1
> > types:0x2626160 
> > put_dup_type: ../../../ompi/datatype/ompi_datatype_args.c:565: 
> > __ompi_datatype_create_from_packed_desc

Re: [OMPI devel] RFC MPI 2.2 Dist_graph addition

2013-07-01 Thread Kawashima, Takahiro
George,

My colleague was working on your ompi-topo bitbucket repository
but it was not completed. But he found bugs in your patch attached
in your previous mail and created the fixing patch. See the attached
patch, which is a patch against Open MPI trunk + your patch.

His test programs are also attached. test_1 and test_2 can run
with nprocs=5, and test_3 and test_4 can run with nprocs>=3.

Though I'm not sure about the contents of the patch and the test
programs, I can ask him if you have any questions.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> WHAT:Support for MPI 2.2 dist_graph
> 
> WHY: To become [almost entierly] MPI 2.2 compliant
> 
> WHEN:Monday July 1st
> 
> As discussed during the last phone call, a missing functionality of the MPI 
> 2.2 standard (the distributed graph topology) is ready for prime-time. The 
> attached patch provide a minimal version (no components supporting 
> reordering), that will complete the topology support in Open MPI.
> 
> It is somehow a major change compared with what we had before and it reshape 
> the way we deal with topologies completely. Where our topologies were mainly 
> storage components (they were not capable of creating the new communicator as 
> an example), the new version is built around a [possibly] common 
> representation (in mca/topo/topo.h), but the functions to attach and retrieve 
> the topological information are specific to each component. As a result the 
> ompi_create_cart and ompi_create_graph functions become useless and have been 
> removed.
> 
> In addition to adding the internal infrastructure to manage the topology 
> information, it updates the MPI interface, and the debuggers support and 
> provides all Fortran interfaces. From a correctness point of view it passes 
> all the tests we have in ompi-tests for the cart and graph topology, and some 
> tests/applications for the dist_graph interface.
> 
> I don't think there is a need for a long wait on this one so I would like to 
> propose a short deadline, a week from now on Monday July 1st. A patch based 
> on Open MPI trunk r28670 is attached below.
diff -u -r openmpi-trunk2/ompi/communicator/comm.c openmpi-trunk/ompi/communicator/comm.c
--- openmpi-trunk2/ompi/communicator/comm.c	2013-06-25 16:36:42.0 +0900
+++ openmpi-trunk/ompi/communicator/comm.c	2013-06-25 18:52:17.0 +0900
@@ -1446,6 +1446,8 @@
 opal_output(0," topo-cart,");
 if ( OMPI_COMM_IS_GRAPH(comm))
 opal_output(0," topo-graph");
+if ( OMPI_COMM_IS_DIST_GRAPH(comm))
+opal_output(0," topo-dist-graph");
 opal_output(0,"\n");
 
 if (OMPI_COMM_IS_INTER(comm)) {
diff -u -r openmpi-trunk2/ompi/communicator/communicator.h openmpi-trunk/ompi/communicator/communicator.h
--- openmpi-trunk2/ompi/communicator/communicator.h	2013-06-25 16:36:42.0 +0900
+++ openmpi-trunk/ompi/communicator/communicator.h	2013-06-25 17:19:31.0 +0900
@@ -46,7 +46,7 @@
 #define OMPI_COMM_INVALID  0x0020
 #define OMPI_COMM_CART 0x0100
 #define OMPI_COMM_GRAPH0x0200
-#define OMPI_COMM_DIST_GRAPH   0x0300
+#define OMPI_COMM_DIST_GRAPH   0x0400
 #define OMPI_COMM_PML_ADDED0x1000
 #define OMPI_COMM_EXTRA_RETAIN 0x4000
 
--- openmpi-trunk2/ompi/mca/topo/base/topo_base_dist_graph_create_adjacent.c	2013-06-25 16:36:38.0 +0900
+++ openmpi-trunk/ompi/mca/topo/base/topo_base_dist_graph_create_adjacent.c	2013-06-26 11:44:40.0 +0900
@@ -47,18 +47,18 @@
 topo->out = topo->outw = NULL;
 topo->indegree = indegree;
 topo->outdegree = outdegree;
-topo->weighted = !((sourceweights == MPI_UNWEIGHTED) || (destweights == MPI_UNWEIGHTED));
+topo->weighted = !((MPI_UNWEIGHTED == sourceweights) && (MPI_UNWEIGHTED == destweights));
 topo->in = (int*)malloc(sizeof(int) * topo->indegree);
 if( NULL == topo->in ) {
 goto bail_out;
 }
 memcpy( topo->in, sources, sizeof(int) * topo->indegree );
-if( sourceweights == MPI_UNWEIGHTED ) {
+if( MPI_UNWEIGHTED != sourceweights ) {
 topo->inw = (int*)malloc(sizeof(int) * topo->indegree);
 if( NULL == topo->inw ) {
 goto bail_out;
 }
-memcpy( topo->in, sourceweights, sizeof(int) * topo->indegree );
+memcpy( topo->inw, sourceweights, sizeof(int) * topo->indegree );
 }
 topo->out = (int*)malloc(sizeof(int) * topo->outdegree);
 if( NULL == topo->out ) {
@@ -66,12 +66,12 @@
 }
 memcpy( topo->out, destinations, sizeof(int) * topo->outdegree );
 topo->outw = NULL;
-if( destweights == MPI_UNWEIGHTED ) {
+if( MPI_UNWEIGHTED != destweights ) {
 topo->outw = (int*)malloc(sizeof(int) * topo->outdegree);
 if( NULL == topo->outw ) {
 goto bail_out;
 }
-memcpy( topo->out, destweights, sizeof(int) * topo->outdegree );
+memcpy( topo->outw, destweights, sizeof(int) * topo->outdegree );
 }
 

Re: [OMPI devel] Datatype initialization bug?

2013-05-22 Thread Kawashima, Takahiro
If so, we should use an OMPI index in
OMPI_DATATYPE_INIT_DESC_PREDEFINED.

But in the else-block, desc[0].elem.common.type is set to an OMPI
datatype index. And it seems that this 'type' is treated as an
OPAL datatype index in other parts. 
# OMPI_DATATYPE_MPI_CHARACTER and OPAL_DATATYPE_COMPLEX8 has
# same value (0x13) but how to distinguish them?

I wonder whether Fortran datatypes really need separate desc.
Though OPAL does not have *identical* datatypes, it always has
*corresponding* datatypes. It is obvious because we currently
translate an OMPI Fortran datatype to a corresponding OPAL
datatype index in OMPI_DATATYPE_INIT_DESC_PREDEFINED as
"OPAL_DATATYPE_ ## TYPE ## SIZE".

If Fortran datatypes don't need separate desc, this issue
may be fixed by my attached datatype-init-1.patch.
It also fixes the opt_desc issue described first.

Furthermore, do we need to copy desc for OMPI datatypes?
If not, use my attached datatype-init-2.patch instead.
It don't copy desc and OMPI desc points OPAL desc.
I'm not sure this is a correct solution.

The attached result-after.txt is the output of the attached
show_ompi_datatype.c with my patch. I think this output is
correct.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Takahiro,
> 
> Nice catch, I really wonder how this one survived for soo long. I pushed a 
> patch in r28535 addressing this issue. It is not the best solution, but it 
> provide an easy way to address the issue.
> 
> A little bit of history. A datatype is composed by (let's keep it short) 2 
> component, a high-level description containing among others the size and the 
> name of the datatype and a low level description (the desc_t part) containing 
> the basic predefined elements in the datatype. As most of the predefined 
> datatypes defined in the MPI layer are synonyms to some basic predefined 
> datatypes (such as the equivalent POSIX types MPI_INT32_T), the design of the 
> datatype allowed for the sharing of the desc_t part between datatypes. This 
> approach allows us to have similar datatypes (MPI_INT and MPI_INT32_T) with 
> different names but with the same backend internal description. However, when 
> we split the datatype engine in two, we duplicate this common description (in 
> OPAL and OMPI). The OMPI desc_t was pointing to OPAL desc_t for almost 
> everything … except the datatypes that were not defined by OPAL such as the 
> Fortran one. This turned the management of the common desc_t into a nightmare 
> … with the effect you noticed few days ago. Too bad for the optimization 
> part. I now duplicate the desc_t between the two layers, and all OMPI 
> datatypes have now their own desc_t.
> 
> Thanks for finding and analyzing so deeply this issue.
>   George.
> 
> 
> 
> 
> On May 16, 2013, at 12:04 , KAWASHIMA Takahiro <rivis.kawash...@nifty.com> 
> wrote:
> 
> > Hi,
> > 
> > I'm reading the datatype code in Open MPI trunk and have a question.
> > A bit long.
> > 
> > See the following program.
> > 
> > 
> > #include 
> > #include 
> > 
> > struct opal_datatype_t;
> > extern int opal_init(int *pargc, char ***pargv);
> > extern int opal_finalize(void);
> > extern void opal_datatype_dump(struct opal_datatype_t *type);
> > extern struct opal_datatype_t opal_datatype_int8;
> > 
> > int main(int argc, char **argv)
> > {
> >opal_init(NULL, NULL);
> >opal_datatype_dump(_datatype_int8);
> >MPI_Init(NULL, NULL);
> >opal_datatype_dump(_datatype_int8);
> >MPI_Finalize();
> >opal_finalize();
> >return 0;
> > }
> > 
> > 
> > All variables/functions declared as 'extern' are defined in OPAL.
> > opal_datatype_dump() function outputs internal data of a datatype.
> > I expect the same output on two opal_datatype_dump() calls.
> > But when I run it on an x86_64 machine, I get the following output.
> > 
> > 
> > ompi-trunk/opal-datatype-dump && ompiexec -n 1 ompi-trunk/opal-datatype-dump
> > [ppc.rivis.jp:27886] Datatype 0x600c60[OPAL_INT8] size 8 align 8 id 7 
> > length 1 used 1
> > true_lb 0 true_ub 8 (true_extent 8) lb 0 ub 8 (extent 8)
> > nbElems 1 loops 0 flags 136 (commited contiguous )-cC---P-DB-[---][---]
> >   contain OPAL_INT8
> > --C---P-D--[---][---]  OPAL_INT8 count 1 disp 0x0 (0) extent 8 (size 8)
> > No optimized description
> > 
> > [ppc.rivis.jp:27886] Datatype 0x600c60[OPAL_INT8] size 8 align 8 id 7 
> > length 1 used 1
> > true_lb 0 true_ub 8 (true_extent 8) lb 0 u

[OMPI devel] Datatype initialization bug?

2013-05-16 Thread KAWASHIMA Takahiro
Hi,

I'm reading the datatype code in Open MPI trunk and have a question.
A bit long.

See the following program.


#include 
#include 

struct opal_datatype_t;
extern int opal_init(int *pargc, char ***pargv);
extern int opal_finalize(void);
extern void opal_datatype_dump(struct opal_datatype_t *type);
extern struct opal_datatype_t opal_datatype_int8;

int main(int argc, char **argv)
{
opal_init(NULL, NULL);
opal_datatype_dump(_datatype_int8);
MPI_Init(NULL, NULL);
opal_datatype_dump(_datatype_int8);
MPI_Finalize();
opal_finalize();
return 0;
}


All variables/functions declared as 'extern' are defined in OPAL.
opal_datatype_dump() function outputs internal data of a datatype.
I expect the same output on two opal_datatype_dump() calls.
But when I run it on an x86_64 machine, I get the following output.


ompi-trunk/opal-datatype-dump && ompiexec -n 1 ompi-trunk/opal-datatype-dump
[ppc.rivis.jp:27886] Datatype 0x600c60[OPAL_INT8] size 8 align 8 id 7 length 1 
used 1
true_lb 0 true_ub 8 (true_extent 8) lb 0 ub 8 (extent 8)
nbElems 1 loops 0 flags 136 (commited contiguous )-cC---P-DB-[---][---]
   contain OPAL_INT8
--C---P-D--[---][---]  OPAL_INT8 count 1 disp 0x0 (0) extent 8 (size 8)
No optimized description

[ppc.rivis.jp:27886] Datatype 0x600c60[OPAL_INT8] size 8 align 8 id 7 length 1 
used 1
true_lb 0 true_ub 8 (true_extent 8) lb 0 ub 8 (extent 8)
nbElems 1 loops 0 flags 136 (commited contiguous )-cC---P-DB-[---][---]
   contain OPAL_INT8
--C---P-D--[---][---]   count 1 disp 0x0 (0) extent 8 (size 8971008)
No optimized description


The former output is what I expected. But the latter one is not
identical to the former one and its content datatype has no name
and a very large size.

This line is output in opal_datatype_dump_data_desc() function in
opal/datatype/opal_datatype_dump.c file. It refers
opal_datatype_basicDatatypes[pDesc->elem.common.type]->name and
opal_datatype_basicDatatypes[pDesc->elem.common.type]->size for
the content datatype.

In this case, pDesc->elem.common.type is
opal_datatype_int8.desc.desc[0].elem.common.type and is initialized to 7
in opal_datatype_init() function in opal/datatype/opal_datatype_module.c
file, which is called during opal_init() function.
opal_datatype_int8.desc.desc points _datatype_predefined_elem_desc[7*2].

But if we call MPI_Init() function, the value is overwritten.
ompi_datatype_init() function in ompi/datatype/ompi_datatype_module.c
file, which is called during MPI_Init() function, has similar
procedure to initialize OMPI datatypes.

On initializing ompi_mpi_aint in it, ompi_mpi_aint.dt.super.desc.desc
points _datatype_predefined_elem_desc[7*2], which is also pointed
by opal_datatype_int8, because ompi_mpi_aint is defined by
OMPI_DATATYPE_INIT_PREDEFINED_BASIC_TYPE macro and it uses
OPAL_DATATYPE_INITIALIZER_INT8 macro. So
opal_datatype_int8.desc.desc[0].elem.common.type is overwritten
to 37.

Therefore in the second opal_datatype_dump() function call in my
program, opal_datatype_basicDatatypes[37] is accessed.
But the array length of opal_datatype_basicDatatypes is 25.

Summarize:

  static initializer:
opal_datatype_predefined_elem_desc[25] = {{0, ...}, ...};
opal_datatype_int8.desc.desc = _datatype_predefined_elem_desc[7*2];
ompi_mpi_aint.dt.super.desc.desc = _datatype_predefined_elem_desc[7*2];

  opal_init:
opal_datatype_int8.desc.desc.elem.common.type = 7;

  MPI_Init:
ompi_mpi_aint.dt.super.desc.desc.elem.common.type = 37;

  opal_datatype_dump:
access to opal_datatype_predefined_elem_desc[37]

While opal_datatype_dump() function might not be called from
user's programs, breaking opal_datatype_predefined_elem_desc
array in ompi_datatype_init() function is not good.

Though the above is described for opal_datatype_int8 and ompi_mpi_aint,
the same thing happens to other datatypes.

Though I tried to fix this problem, I could not figure out the
correct solution.

  - The first loop in ompi_datatype_init() function should be removed?
But OMPI Fortran datatypes should be initialized in it?

  - All OMPI datatypes should point ompi_datatype_predefined_elem_desc
array? But having same 'type' value in OPAL datatypes and OMPI
datatypes is allowed?

Regards,
KAWASHIMA Takahiro


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-05-01 Thread KAWASHIMA Takahiro
George,

As I wrote in the ticket a few minutes ago, your patch looks good and
it passed my test. My previous patch didn't care about generalized
requests so your patch is better.

Thanks,
Takahiro Kawashima,
from my home

> Takahiro,
> 
> I went over this ticket and attached a new patch. Basically I went over all 
> the possible cases, both in test and wait, and ensure the behavior is always 
> consistent. Please give it a try, and let us know of the outcome.
> 
>   Thanks,
> George.
> 
> 
> 
> On Jan 25, 2013, at 00:53 , "Kawashima, Takahiro" 
> <t-kawash...@jp.fujitsu.com> wrote:
> 
> > Jeff,
> > 
> > I've filed the ticket.
> > https://svn.open-mpi.org/trac/ompi/ticket/3475
> > 
> > Thanks,
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> > 
> >> Many thanks for the summary!
> >> 
> >> Can you file tickets about this stuff against 1.7?  Included your patches, 
> >> etc. 
> >> 
> >> These are pretty obscure issues and I'm ok not fixing them in the 1.6 
> >> branch (unless someone has a burning desire to get them fixed in 1.6). 
> >> 
> >> But we should properly track and fix these in the 1.7 series. I'd mark 
> >> them as "critical" so that they don't get lost in the wilderness of other 
> >> bugs. 
> >> 
> >> Sent from my phone. No type good. 
> >> 
> >> On Jan 22, 2013, at 8:57 PM, "Kawashima, Takahiro" 
> >> <t-kawash...@jp.fujitsu.com> wrote:
> >> 
> >>> George,
> >>> 
> >>> I reported the bug three months ago.
> >>> Your commit r27880 resolved one of the bugs reported by me,
> >>> in another approach.
> >>> 
> >>> http://www.open-mpi.org/community/lists/devel/2012/10/11555.php
> >>> 
> >>> But other bugs are still open.
> >>> 
> >>> "(1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE."
> >>> in my previous mail is not fixed yet. This can be fixed by my patch
> >>> (ompi/mpi/c/wait.c and ompi/request/request.c part only) attached
> >>> in my another mail.
> >>> 
> >>> http://www.open-mpi.org/community/lists/devel/2012/10/11561.php
> >>> 
> >>> "(2) MPI_Status for an inactive request must be an empty status."
> >>> in my previous mail is partially fixed. MPI_Wait is fixed by your
> >>> r27880. But MPI_Waitall and MPI_Testall should be fixed.
> >>> Codes similar to your r27880 should be inserted to
> >>> ompi_request_default_wait_all and ompi_request_default_test_all.
> >>> 
> >>> You can confirm the fixes by the test program status.c attached in
> >>> my previous mail. Run with -n 2. 
> >>> 
> >>> http://www.open-mpi.org/community/lists/devel/2012/10/11555.php
> >>> 
> >>> Regards,
> >>> Takahiro Kawashima,
> >>> MPI development team,
> >>> Fujitsu
> >>> 
> >>>> To be honest it was hanging in one of my repos for some time. If I'm not 
> >>>> mistaken it is somehow related to one active ticket (but I couldn't find 
> >>>> the info). It might be good to push it upstream.
> >>>> 
> >>>> George.
> >>>> 
> >>>> On Jan 22, 2013, at 16:27 , "Jeff Squyres (jsquyres)" 
> >>>> <jsquy...@cisco.com> wrote:
> >>>> 
> >>>>> George --
> >>>>> 
> >>>>> Is there any reason not to CMR this to v1.6 and v1.7?
> >>>>> 
> >>>>> 
> >>>>> On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:
> >>>>> 
> >>>>>> Author: bosilca (George Bosilca)
> >>>>>> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
> >>>>>> New Revision: 27880
> >>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
> >>>>>> 
> >>>>>> Log:
> >>>>>> My understanding is that an MPI_WAIT() on an inactive request should
> >>>>>> return the empty status (MPI 3.0 page 52 line 46).
> >>>>>> 
> >>>>>> Text files modified: 
> >>>>>> trunk/ompi/request/req_wait.c | 3 +++  
> >>>>>>
> >>>>>> 1 files changed, 3 insertions(+), 0 deletions(-)
> >>>>>> 
> >>>>>> Modified: trunk/ompi/request/req_wait.c
> >>>>>> ==
> >>>>>> --- trunk/ompi/request/req_wait.cSat Jan 19 19:33:42 2013
> >>>>>> (r27879)
> >>>>>> +++ trunk/ompi/request/req_wait.c2013-01-21 06:35:42 EST (Mon, 21 
> >>>>>> Jan 2013)(r27880)
> >>>>>> @@ -61,6 +61,9 @@
> >>>>>>  }
> >>>>>>  if( req->req_persistent ) {
> >>>>>>  if( req->req_state == OMPI_REQUEST_INACTIVE ) {
> >>>>>> +if (MPI_STATUS_IGNORE != status) {
> >>>>>> +*status = ompi_status_empty;
> >>>>>> +}
> >>>>>>  return OMPI_SUCCESS;
> >>>>>>  }
> >>>>>>  req->req_state = OMPI_REQUEST_INACTIVE;


Re: [OMPI devel] [patch] MPI-2.2: Ordering of attribution deletion callbacks on MPI_COMM_SELF

2013-04-29 Thread KAWASHIMA Takahiro
Hi,

This MPI-2.2 feature does not seem to be implemented yet in trunk.
How about my patches posted 3 months ago? They can be applied to
the latest trunk. If you don't like them, I can improve it.
I've attached same patches to this mail again. One for the implementation
of this MPI-2.2 feature and another for bug fixes, as described in
my previous mail.

Regards,
KAWASHIMA Takahiro

> Jeff, George,
> 
> I've implemented George's idea for ticket #3123 "MPI-2.2: Ordering of
> attribution deletion callbacks on MPI_COMM_SELF". See attached
> delete-attr-order.patch.
> 
> It is implemented by creating a temporal array of ordered attribute_value_t
> pointers at ompi_attr_delete_all() call using attribute creation sequence
> numbers. It requires linear cost only at the communicator destruction
> stage and its implementation is rather simpler than my previous patch.
> 
> And apart from this MPI-2.2 ticket, I found some minor bugs and typos
> in attribute.c and attribute.h. They can be fixed by the attached
> attribute-bug-fix.patch. All fixes are assembled into one patch file.
> 
> I've pushed my modifications to Bitbucket.
>   
> https://bitbucket.org/rivis/openmpi-delattrorder/src/49bf3dc7cdbc/?at=sequence
> Note that my modifications are in "sequence" branch, not "default" branch.
> I had committed each implementation/fixes independently that are
> assembled in two patches attached to this mail. So you can see
> comment/diff of each modification on Bitbucket.
>   https://bitbucket.org/rivis/openmpi-delattrorder/commits/all
> Changesets eaa2432 and ace994b are for ticket #3123,
> and other 7 latest changesets are for bug/typo-fixes.
> 
> Regards,
> KAWASHIMA Takahiro
> 
> > Jeff,
> > 
> > OK. I'll try implementing George's idea and then you can compare which
> > one is simpler.
> > 
> > Regards,
> > KAWASHIMA Takahiro
> > 
> > > Not that I'm aware of; that would be great.
> > > 
> > > Unlike George, however, I'm not concerned about converting to linear 
> > > operations for attributes.
> > > 
> > > Attributes are not used often, but when they are:
> > > 
> > > a) there aren't many of them (so a linear penalty is trivial)
> > > b) they're expected to be low performance
> > > 
> > > So if it makes the code simpler, I certainly don't mind linear operations.
> > > 
> > > 
> > > 
> > > On Jan 17, 2013, at 9:32 AM, KAWASHIMA Takahiro 
> > > <rivis.kawash...@nifty.com>
> > >  wrote:
> > > 
> > > > George,
> > > > 
> > > > Your idea makes sense.
> > > > Is anyone working on it? If not, I'll try.
> > > > 
> > > > Regards,
> > > > KAWASHIMA Takahiro
> > > > 
> > > >> Takahiro,
> > > >> 
> > > >> Thanks for the patch. I deplore the lost of the hash table in the 
> > > >> attribute management, as the potential of transforming all attributes 
> > > >> operation to a linear complexity is not very appealing.
> > > >> 
> > > >> As you already took the decision C, it means that at the communicator 
> > > >> destruction stage the hash table is not relevant anymore. Thus, I 
> > > >> would have converted the hash table to an ordered list (ordered by the 
> > > >> creation index, a global entity atomically updated every time an 
> > > >> attribute is created), and proceed to destroy the attributed in the 
> > > >> desired order. Thus instead of having a linear operation for every 
> > > >> operation on attributes, we only have a single linear operation per 
> > > >> communicator (and this during the destruction stage).
> > > >> 
> > > >>  George.
> > > >> 
> > > >> On Jan 16, 2013, at 16:37 , KAWASHIMA Takahiro 
> > > >> <rivis.kawash...@nifty.com> wrote:
> > > >> 
> > > >>> Hi,
> > > >>> 
> > > >>> I've implemented ticket #3123 "MPI-2.2: Ordering of attribution 
> > > >>> deletion
> > > >>> callbacks on MPI_COMM_SELF".
> > > >>> 
> > > >>> https://svn.open-mpi.org/trac/ompi/ticket/3123
> > > >>> 
> > > >>> As this ticket says, attributes had been stored in unordered hash.
> > > >>> So I've replaced opal_hash_table_t with opal_list_t and made necessary
> > > >>> modifications for it. And I've also fixed s

Re: [OMPI devel] RFC: opal_list iteration macros

2013-01-30 Thread KAWASHIMA Takahiro
I don't care the macro names. Either one is OK for me.

Thanks, 
KAWASHIMA Takahiro

> Hmm, maybe something like:
> 
> OPAL_LIST_FOREACH, OPAL_LISTFOREACH_REV, OPAL_LIST_FOREACH_SAFE, 
> OPAL_LIST_FOREACH_REV_SAFE?
> 
> -Nathan
> 
> On Thu, Jan 31, 2013 at 12:36:29AM +0900, KAWASHIMA Takahiro wrote:
> > Hi,
> > 
> > Agreed.
> > But how about backward traversal in addition to forward traversal?
> > e.g. OPAL_LIST_FOREACH_FW, OPAL_LIST_FOREACH_FW_SAFE,
> >  OPAL_LIST_FOREACH_BW, OPAL_LIST_FOREACH_BW_SAFE
> > We sometimes search an item from the end of a list.
> > 
> > Thanks, 
> > KAWASHIMA Takahiro
> > 
> > > What: Add two new macros to opal_list.h:
> > > 
> > > #define opal_list_foreach(item, list, type) \
> > >   for (item = (type *) (list)->opal_list_sentinel.opal_list_next ;  \
> > >item != (type *) &(list)->opal_list_sentinel ;   \
> > >item = (type *) ((opal_list_item_t *) (item))->opal_list_next)
> > > 
> > > #define opal_list_foreach_safe(item, next, list, type)  \
> > >   for (item = (type *) (list)->opal_list_sentinel.opal_list_next,   \
> > >  next = (type *) ((opal_list_item_t *) (item))->opal_list_next ;\
> > >item != (type *) &(list)->opal_list_sentinel ;   \
> > >item = next, next = (type *) ((opal_list_item_t *) 
> > > (item))->opal_list_next)
> > > 
> > > The first macro provides a simple iterator over an unchanging list and 
> > > the second macro is safe for opal_list_item_remove(item).
> > > 
> > > Why: These macros provide a clean way to do the following:
> > > 
> > > for (item = opal_list_get_first (list) ;
> > >  item != opal_list_get_end (list) ;
> > >  item = opal_list_get_next (item)) {
> > >some_class_t *foo = (some_class_t *) foo;
> > >...
> > > }
> > > 
> > > becomes:
> > > 
> > > some_class_t *foo;
> > > 
> > > opal_list_foreach(foo, list, some_class_t) {
> > >...
> > > }
> > > 
> > > When: This is a very simple addition but I wanted to give a heads up on 
> > > the devel list because these macros are different from what we usually 
> > > provide (though they should look familiar to those familiar with the 
> > > Linux kernel). I intend to commit these macros to the truck (and CMR for 
> > > 1.7.1) tomorrow (Wed 01/29/13) around 12:00 PM MST.
> > > 
> > > Thoughs? Comments?
> > > 
> > > -Nathan Hjelm
> > > HPC-3, LANL


Re: [OMPI devel] RFC: opal_list iteration macros

2013-01-30 Thread KAWASHIMA Takahiro
Hi,

Agreed.
But how about backward traversal in addition to forward traversal?
e.g. OPAL_LIST_FOREACH_FW, OPAL_LIST_FOREACH_FW_SAFE,
 OPAL_LIST_FOREACH_BW, OPAL_LIST_FOREACH_BW_SAFE
We sometimes search an item from the end of a list.

Thanks, 
KAWASHIMA Takahiro

> What: Add two new macros to opal_list.h:
> 
> #define opal_list_foreach(item, list, type) \
>   for (item = (type *) (list)->opal_list_sentinel.opal_list_next ;  \
>item != (type *) &(list)->opal_list_sentinel ;   \
>item = (type *) ((opal_list_item_t *) (item))->opal_list_next)
> 
> #define opal_list_foreach_safe(item, next, list, type)  \
>   for (item = (type *) (list)->opal_list_sentinel.opal_list_next,   \
>  next = (type *) ((opal_list_item_t *) (item))->opal_list_next ;\
>item != (type *) &(list)->opal_list_sentinel ;   \
>item = next, next = (type *) ((opal_list_item_t *) 
> (item))->opal_list_next)
> 
> The first macro provides a simple iterator over an unchanging list and the 
> second macro is safe for opal_list_item_remove(item).
> 
> Why: These macros provide a clean way to do the following:
> 
> for (item = opal_list_get_first (list) ;
>  item != opal_list_get_end (list) ;
>  item = opal_list_get_next (item)) {
>some_class_t *foo = (some_class_t *) foo;
>...
> }
> 
> becomes:
> 
> some_class_t *foo;
> 
> opal_list_foreach(foo, list, some_class_t) {
>...
> }
> 
> When: This is a very simple addition but I wanted to give a heads up on the 
> devel list because these macros are different from what we usually provide 
> (though they should look familiar to those familiar with the Linux kernel). I 
> intend to commit these macros to the truck (and CMR for 1.7.1) tomorrow (Wed 
> 01/29/13) around 12:00 PM MST.
> 
> Thoughs? Comments?
> 
> -Nathan Hjelm
> HPC-3, LANL


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-01-24 Thread Kawashima, Takahiro
Jeff,

I've filed the ticket.
https://svn.open-mpi.org/trac/ompi/ticket/3475

Thanks,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Many thanks for the summary!
> 
> Can you file tickets about this stuff against 1.7?  Included your patches, 
> etc. 
> 
> These are pretty obscure issues and I'm ok not fixing them in the 1.6 branch 
> (unless someone has a burning desire to get them fixed in 1.6). 
> 
> But we should properly track and fix these in the 1.7 series. I'd mark them 
> as "critical" so that they don't get lost in the wilderness of other bugs. 
> 
> Sent from my phone. No type good. 
> 
> On Jan 22, 2013, at 8:57 PM, "Kawashima, Takahiro" 
> <t-kawash...@jp.fujitsu.com> wrote:
> 
> > George,
> > 
> > I reported the bug three months ago.
> > Your commit r27880 resolved one of the bugs reported by me,
> > in another approach.
> > 
> >  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php
> > 
> > But other bugs are still open.
> > 
> > "(1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE."
> > in my previous mail is not fixed yet. This can be fixed by my patch
> > (ompi/mpi/c/wait.c and ompi/request/request.c part only) attached
> > in my another mail.
> > 
> >  http://www.open-mpi.org/community/lists/devel/2012/10/11561.php
> > 
> > "(2) MPI_Status for an inactive request must be an empty status."
> > in my previous mail is partially fixed. MPI_Wait is fixed by your
> > r27880. But MPI_Waitall and MPI_Testall should be fixed.
> > Codes similar to your r27880 should be inserted to
> > ompi_request_default_wait_all and ompi_request_default_test_all.
> > 
> > You can confirm the fixes by the test program status.c attached in
> > my previous mail. Run with -n 2. 
> > 
> >  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php
> > 
> > Regards,
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> > 
> >> To be honest it was hanging in one of my repos for some time. If I'm not 
> >> mistaken it is somehow related to one active ticket (but I couldn't find 
> >> the info). It might be good to push it upstream.
> >> 
> >>  George.
> >> 
> >> On Jan 22, 2013, at 16:27 , "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
> >> wrote:
> >> 
> >>> George --
> >>> 
> >>> Is there any reason not to CMR this to v1.6 and v1.7?
> >>> 
> >>> 
> >>> On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:
> >>> 
> >>>> Author: bosilca (George Bosilca)
> >>>> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
> >>>> New Revision: 27880
> >>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
> >>>> 
> >>>> Log:
> >>>> My understanding is that an MPI_WAIT() on an inactive request should
> >>>> return the empty status (MPI 3.0 page 52 line 46).
> >>>> 
> >>>> Text files modified: 
> >>>> trunk/ompi/request/req_wait.c | 3 +++
> >>>>  
> >>>> 1 files changed, 3 insertions(+), 0 deletions(-)
> >>>> 
> >>>> Modified: trunk/ompi/request/req_wait.c
> >>>> ==
> >>>> --- trunk/ompi/request/req_wait.cSat Jan 19 19:33:42 2013(r27879)
> >>>> +++ trunk/ompi/request/req_wait.c2013-01-21 06:35:42 EST (Mon, 21 
> >>>> Jan 2013)(r27880)
> >>>> @@ -61,6 +61,9 @@
> >>>>   }
> >>>>   if( req->req_persistent ) {
> >>>>   if( req->req_state == OMPI_REQUEST_INACTIVE ) {
> >>>> +if (MPI_STATUS_IGNORE != status) {
> >>>> +*status = ompi_status_empty;
> >>>> +}
> >>>>   return OMPI_SUCCESS;
> >>>>   }
> >>>>   req->req_state = OMPI_REQUEST_INACTIVE;


Re: [OMPI devel] [patch] MPI-2.2: Ordering of attribution deletion callbacks on MPI_COMM_SELF

2013-01-24 Thread KAWASHIMA Takahiro
Jeff, George,

I've implemented George's idea for ticket #3123 "MPI-2.2: Ordering of
attribution deletion callbacks on MPI_COMM_SELF". See attached
delete-attr-order.patch.

It is implemented by creating a temporal array of ordered attribute_value_t
pointers at ompi_attr_delete_all() call using attribute creation sequence
numbers. It requires linear cost only at the communicator destruction
stage and its implementation is rather simpler than my previous patch.

And apart from this MPI-2.2 ticket, I found some minor bugs and typos
in attribute.c and attribute.h. They can be fixed by the attached
attribute-bug-fix.patch. All fixes are assembled into one patch file.

I've pushed my modifications to Bitbucket.
  https://bitbucket.org/rivis/openmpi-delattrorder/src/49bf3dc7cdbc/?at=sequence
Note that my modifications are in "sequence" branch, not "default" branch.
I had committed each implementation/fixes independently that are
assembled in two patches attached to this mail. So you can see
comment/diff of each modification on Bitbucket.
  https://bitbucket.org/rivis/openmpi-delattrorder/commits/all
Changesets eaa2432 and ace994b are for ticket #3123,
and other 7 latest changesets are for bug/typo-fixes.

Regards,
KAWASHIMA Takahiro

> Jeff,
> 
> OK. I'll try implementing George's idea and then you can compare which
> one is simpler.
> 
> Regards,
> KAWASHIMA Takahiro
> 
> > Not that I'm aware of; that would be great.
> > 
> > Unlike George, however, I'm not concerned about converting to linear 
> > operations for attributes.
> > 
> > Attributes are not used often, but when they are:
> > 
> > a) there aren't many of them (so a linear penalty is trivial)
> > b) they're expected to be low performance
> > 
> > So if it makes the code simpler, I certainly don't mind linear operations.
> > 
> > 
> > 
> > On Jan 17, 2013, at 9:32 AM, KAWASHIMA Takahiro <rivis.kawash...@nifty.com>
> >  wrote:
> > 
> > > George,
> > > 
> > > Your idea makes sense.
> > > Is anyone working on it? If not, I'll try.
> > > 
> > > Regards,
> > > KAWASHIMA Takahiro
> > > 
> > >> Takahiro,
> > >> 
> > >> Thanks for the patch. I deplore the lost of the hash table in the 
> > >> attribute management, as the potential of transforming all attributes 
> > >> operation to a linear complexity is not very appealing.
> > >> 
> > >> As you already took the decision C, it means that at the communicator 
> > >> destruction stage the hash table is not relevant anymore. Thus, I would 
> > >> have converted the hash table to an ordered list (ordered by the 
> > >> creation index, a global entity atomically updated every time an 
> > >> attribute is created), and proceed to destroy the attributed in the 
> > >> desired order. Thus instead of having a linear operation for every 
> > >> operation on attributes, we only have a single linear operation per 
> > >> communicator (and this during the destruction stage).
> > >> 
> > >>  George.
> > >> 
> > >> On Jan 16, 2013, at 16:37 , KAWASHIMA Takahiro 
> > >> <rivis.kawash...@nifty.com> wrote:
> > >> 
> > >>> Hi,
> > >>> 
> > >>> I've implemented ticket #3123 "MPI-2.2: Ordering of attribution deletion
> > >>> callbacks on MPI_COMM_SELF".
> > >>> 
> > >>> https://svn.open-mpi.org/trac/ompi/ticket/3123
> > >>> 
> > >>> As this ticket says, attributes had been stored in unordered hash.
> > >>> So I've replaced opal_hash_table_t with opal_list_t and made necessary
> > >>> modifications for it. And I've also fixed some multi-threaded concurrent
> > >>> (get|set|delete)_attr call issues.
> > >>> 
> > >>> By this modification, following behavior changes are introduced.
> > >>> 
> > >>> (A) MPI_(Comm|Type|Win)_(get|set|delete)_attr function may be slower
> > >>> for MPI objects that has many attributes attached.
> > >>> (B) When the user-defined delete callback function is called, the
> > >>> attribute is already removed from the list. In other words,
> > >>> if MPI_(Comm|Type|Win)_get_attr is called by the user-defined
> > >>> delete callback function for the same attribute key, it returns
> > >>> flag = false.
> > >>> (C) Even if the user-defined delete callback function returns non-
&

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-01-22 Thread Kawashima, Takahiro
George,

I reported the bug three months ago.
Your commit r27880 resolved one of the bugs reported by me,
in another approach.

  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php

But other bugs are still open.

"(1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE."
in my previous mail is not fixed yet. This can be fixed by my patch
(ompi/mpi/c/wait.c and ompi/request/request.c part only) attached
in my another mail.

  http://www.open-mpi.org/community/lists/devel/2012/10/11561.php

"(2) MPI_Status for an inactive request must be an empty status."
in my previous mail is partially fixed. MPI_Wait is fixed by your
r27880. But MPI_Waitall and MPI_Testall should be fixed.
Codes similar to your r27880 should be inserted to
ompi_request_default_wait_all and ompi_request_default_test_all.

You can confirm the fixes by the test program status.c attached in
my previous mail. Run with -n 2. 

  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> To be honest it was hanging in one of my repos for some time. If I'm not 
> mistaken it is somehow related to one active ticket (but I couldn't find the 
> info). It might be good to push it upstream.
> 
>   George.
> 
> On Jan 22, 2013, at 16:27 , "Jeff Squyres (jsquyres)"  
> wrote:
> 
> > George --
> > 
> > Is there any reason not to CMR this to v1.6 and v1.7?
> > 
> > 
> > On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:
> > 
> >> Author: bosilca (George Bosilca)
> >> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
> >> New Revision: 27880
> >> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
> >> 
> >> Log:
> >> My understanding is that an MPI_WAIT() on an inactive request should
> >> return the empty status (MPI 3.0 page 52 line 46).
> >> 
> >> Text files modified: 
> >>  trunk/ompi/request/req_wait.c | 3 +++ 
> >> 
> >>  1 files changed, 3 insertions(+), 0 deletions(-)
> >> 
> >> Modified: trunk/ompi/request/req_wait.c
> >> ==
> >> --- trunk/ompi/request/req_wait.c  Sat Jan 19 19:33:42 2013(r27879)
> >> +++ trunk/ompi/request/req_wait.c  2013-01-21 06:35:42 EST (Mon, 21 Jan 
> >> 2013)  (r27880)
> >> @@ -61,6 +61,9 @@
> >>}
> >>if( req->req_persistent ) {
> >>if( req->req_state == OMPI_REQUEST_INACTIVE ) {
> >> +if (MPI_STATUS_IGNORE != status) {
> >> +*status = ompi_status_empty;
> >> +}
> >>return OMPI_SUCCESS;
> >>}
> >>req->req_state = OMPI_REQUEST_INACTIVE;


Re: [OMPI devel] MPI-2.2 status #2223, #3127

2013-01-20 Thread Kawashima, Takahiro
Jeff, George,

Thanks for your replies. I'll notify my colleagues of these mails.
Please tell me (or write on the ticket) which repo to use for topo
after you take a look.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Long story short. It is freshly forked from the OMPI trunk, patched with 
> topi-fixes (and not topo-fixes-fixed for some reason). Don't whack them yet, 
> let me take a look more in details.
> 
>   George.
> 
> 
> On Jan 18, 2013, at 17:10 , "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
> wrote:
> 
> > Ok.  If it contains everything you put on the original topo-fixes (and 
> > topo-fixes-fixed), I might as well kill those two repos and put your repo 
> > URL on the ticket.
> > 
> > So -- before I whack those two -- can you absolutely confirm that you've 
> > got everything from the topo-fixes-fixed repo?  IIRC, there was some other 
> > fixes/updates to the topo base in there, not just the new dist_graph 
> > improvements.
> > 
> > 
> > On Jan 18, 2013, at 11:06 AM, George Bosilca <bosi...@icl.utk.edu>
> > wrote:
> > 
> >> It's a fork from the official ompi (well the hg version of it). We will 
> >> push back once we're done.
> >> 
> >> George.
> >> 
> >> On Jan 18, 2013, at 15:42 , "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
> >> wrote:
> >> 
> >>> George --
> >>> 
> >>> Should I pull from your repo into 
> >>> https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed?  Or did you 
> >>> effectively fork, and you guys will put back to SVN when you're done?
> >>> 
> >>> 
> >>> On Jan 18, 2013, at 5:47 AM, George Bosilca <bosi...@icl.utk.edu>
> >>> wrote:
> >>> 
> >>>> Takahiro,
> >>>> 
> >>>> The MPI_Dist_graph effort is happening in 
> >>>> ssh://h...@bitbucket.org/bosilca/ompi-topo. I would definitely be 
> >>>> interested in seeing some test cases, and giving this branch a tough 
> >>>> test.
> >>>> 
> >>>> George.
> >>>> 
> >>>> On Jan 18, 2013, at 02:43 , "Kawashima, Takahiro" 
> >>>> <t-kawash...@jp.fujitsu.com> wrote:
> >>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> Fujitsu is interested in completing MPI-2.2 on Open MPI and Open MPI
> >>>>> -based Fujitsu MPI.
> >>>>> 
> >>>>> We've read wiki and tickets. These two tickets seem to be almost done
> >>>>> but need testing and bug fixing.
> >>>>> 
> >>>>> https://svn.open-mpi.org/trac/ompi/ticket/2223
> >>>>> MPI-2.2: MPI_Dist_graph_* functions missing
> >>>>> 
> >>>>> https://svn.open-mpi.org/trac/ompi/ticket/3127
> >>>>> MPI-2.2: Add reduction support for MPI_C_*COMPLEX and MPI::*COMPLEX
> >>>>> 
> >>>>> My colleagues are planning to work on these. They will write test codes
> >>>>> and try to fix bugs. Test codes and patches can be contributed to the
> >>>>> community. If they cannot fix some bugs, we will report details. They
> >>>>> are planning to complete them in around March.
> >>>>> 
> >>>>> With that two questions.
> >>>>> 
> >>>>> The latest statuses written in these ticket comments are correct?
> >>>>> Is there any more progress?
> >>>>> 
> >>>>> Where are the latest codes?
> >>>>> In ticket #2223 says it is on Jeff's ompi-topo-fixes bitbucket branch.
> >>>>> https://bitbucket.org/jsquyres/ompi-topo-fixes
> >>>>> But Jeff seems to have one more branch with a similar name.
> >>>>> https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed
> >>>>> Ticket #3127 says it is on Jeff's mpi22-c-complex bitbucket branch.
> >>>>> But there is no such branch now.
> >>>>> https://bitbucket.org/jsquyres/mpi22-c-complex


Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-20 Thread Kawashima, Takahiro
I've confirmed. Thanks.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Done -- thank you!
> 
> On Jan 11, 2013, at 3:52 AM, "Kawashima, Takahiro" 
> <t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi Open MPI core members and Rayson,
> > 
> > I've confirmed to the authors and created the bibtex reference.
> > Could you make a page in the "Open MPI Publications" page that
> > links to Fujitsu's PDF file? The attached file contains information
> > of title, authors, abstract, link URL, and bibtex reference.
> > 
> > Best regards,
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> > 
> >> Sorry for not replying sooner.
> >> I'm taliking with the authors (they are not in this list) and
> >> will request linking the PDF soon if they allowed.
> >> 
> >> Takahiro Kawashima,
> >> MPI development team,
> >> Fujitsu
> >> 
> >>> Our policy so far was that adding a paper to the list of publication on 
> >>> the Open MPI website was a discretionary action at the authors' request. 
> >>> I don't see any compelling reason to change. Moreover, Fujitsu being a 
> >>> contributor of the Open MPI community, there is no obstacle of adding a 
> >>> link to their paper -- at their request.
> >>> 
> >>>  George.
> >>> 
> >>> On Jan 10, 2013, at 00:15 , Rayson Ho <raysonlo...@gmail.com> wrote:
> >>> 
> >>>> Hi Ralph,
> >>>> 
> >>>> Since the whole journal is available online, and is reachable by
> >>>> Google, I don't believe we can get into copyright issues by providing
> >>>> a link to it (but then, I also know that there are countries that have
> >>>> more crazy web page linking rules!).
> >>>> 
> >>>> http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
> >>>> 
> >>>> Rayson
> >>>> 
> >>>> ==
> >>>> Open Grid Scheduler - The Official Open Source Grid Engine
> >>>> http://gridscheduler.sourceforge.net/
> >>>> 
> >>>> Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
> >>>> http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
> >>>> 
> >>>> 
> >>>> On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>>> I'm unaware of any formal criteria. The papers currently located there 
> >>>>> are those written by members of the OMPI community, but we can 
> >>>>> certainly link to something written by someone else, so long as we 
> >>>>> don't get into copyright issues.
> >>>>> 
> >>>>> On Sep 19, 2012, at 11:57 PM, Rayson Ho <raysonlo...@gmail.com> wrote:
> >>>>> 
> >>>>>> I found this paper recently, "MPI Library and Low-Level Communication
> >>>>>> on the K computer", available at:
> >>>>>> 
> >>>>>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
> >>>>>> 
> >>>>>> What are the criteria for adding papers to the "Open MPI Publications" 
> >>>>>> page?
> >>>>>> 
> >>>>>> Rayson


[OMPI devel] MPI-2.2 status #2223, #3127

2013-01-17 Thread Kawashima, Takahiro
Hi,

Fujitsu is interested in completing MPI-2.2 on Open MPI and Open MPI
-based Fujitsu MPI.

We've read wiki and tickets. These two tickets seem to be almost done
but need testing and bug fixing.

  https://svn.open-mpi.org/trac/ompi/ticket/2223
  MPI-2.2: MPI_Dist_graph_* functions missing

  https://svn.open-mpi.org/trac/ompi/ticket/3127
  MPI-2.2: Add reduction support for MPI_C_*COMPLEX and MPI::*COMPLEX

My colleagues are planning to work on these. They will write test codes
and try to fix bugs. Test codes and patches can be contributed to the
community. If they cannot fix some bugs, we will report details. They
are planning to complete them in around March.

With that two questions.

The latest statuses written in these ticket comments are correct?
Is there any more progress?

Where are the latest codes?
In ticket #2223 says it is on Jeff's ompi-topo-fixes bitbucket branch.
  https://bitbucket.org/jsquyres/ompi-topo-fixes
But Jeff seems to have one more branch with a similar name.
  https://bitbucket.org/jsquyres/ompi-topo-fixes-fixed
Ticket #3127 says it is on Jeff's mpi22-c-complex bitbucket branch.
But there is no such branch now.
  https://bitbucket.org/jsquyres/mpi22-c-complex

Best regards,
Takahiro Kawashima,
MPI development team,
Fujitsu


Re: [OMPI devel] [patch] MPI-2.2: Ordering of attribution deletion callbacks on MPI_COMM_SELF

2013-01-17 Thread KAWASHIMA Takahiro
Jeff,

OK. I'll try implementing George's idea and then you can compare which
one is simpler.

Regards,
KAWASHIMA Takahiro

> Not that I'm aware of; that would be great.
> 
> Unlike George, however, I'm not concerned about converting to linear 
> operations for attributes.
> 
> Attributes are not used often, but when they are:
> 
> a) there aren't many of them (so a linear penalty is trivial)
> b) they're expected to be low performance
> 
> So if it makes the code simpler, I certainly don't mind linear operations.
> 
> 
> 
> On Jan 17, 2013, at 9:32 AM, KAWASHIMA Takahiro <rivis.kawash...@nifty.com>
>  wrote:
> 
> > George,
> > 
> > Your idea makes sense.
> > Is anyone working on it? If not, I'll try.
> > 
> > Regards,
> > KAWASHIMA Takahiro
> > 
> >> Takahiro,
> >> 
> >> Thanks for the patch. I deplore the lost of the hash table in the 
> >> attribute management, as the potential of transforming all attributes 
> >> operation to a linear complexity is not very appealing.
> >> 
> >> As you already took the decision C, it means that at the communicator 
> >> destruction stage the hash table is not relevant anymore. Thus, I would 
> >> have converted the hash table to an ordered list (ordered by the creation 
> >> index, a global entity atomically updated every time an attribute is 
> >> created), and proceed to destroy the attributed in the desired order. Thus 
> >> instead of having a linear operation for every operation on attributes, we 
> >> only have a single linear operation per communicator (and this during the 
> >> destruction stage).
> >> 
> >>  George.
> >> 
> >> On Jan 16, 2013, at 16:37 , KAWASHIMA Takahiro <rivis.kawash...@nifty.com> 
> >> wrote:
> >> 
> >>> Hi,
> >>> 
> >>> I've implemented ticket #3123 "MPI-2.2: Ordering of attribution deletion
> >>> callbacks on MPI_COMM_SELF".
> >>> 
> >>> https://svn.open-mpi.org/trac/ompi/ticket/3123
> >>> 
> >>> As this ticket says, attributes had been stored in unordered hash.
> >>> So I've replaced opal_hash_table_t with opal_list_t and made necessary
> >>> modifications for it. And I've also fixed some multi-threaded concurrent
> >>> (get|set|delete)_attr call issues.
> >>> 
> >>> By this modification, following behavior changes are introduced.
> >>> 
> >>> (A) MPI_(Comm|Type|Win)_(get|set|delete)_attr function may be slower
> >>> for MPI objects that has many attributes attached.
> >>> (B) When the user-defined delete callback function is called, the
> >>> attribute is already removed from the list. In other words,
> >>> if MPI_(Comm|Type|Win)_get_attr is called by the user-defined
> >>> delete callback function for the same attribute key, it returns
> >>> flag = false.
> >>> (C) Even if the user-defined delete callback function returns non-
> >>> MPI_SUCCESS value, the attribute is not reverted to the list.
> >>> 
> >>> (A) is due to a sequential list search instead of a hash. See find_value
> >>> function for its implementation.
> >>> (B) and (C) are due to an atomic deletion of the attribute to allow
> >>> multi-threaded concurrent (get|set|delete)_attr call in 
> >>> MPI_THREAD_MULTIPLE.
> >>> See ompi_attr_delete function for its implementation. I think this does
> >>> not matter because MPI standard doesn't specify behavior in such cases.
> >>> 
> >>> The patch for Open MPI trunk is attached. If you like it, take in
> >>> this patch.
> >>> 
> >>> Though I'm a employee of a company, this is my independent and private
> >>> work at my home. No intellectual property from my company. If needed,
> >>> I'll sign to Individual Contributor License Agreement.


Re: [OMPI devel] [patch] MPI-2.2: Ordering of attribution deletion callbacks on MPI_COMM_SELF

2013-01-17 Thread KAWASHIMA Takahiro
George,

Your idea makes sense.
Is anyone working on it? If not, I'll try.

Regards,
KAWASHIMA Takahiro

> Takahiro,
> 
> Thanks for the patch. I deplore the lost of the hash table in the attribute 
> management, as the potential of transforming all attributes operation to a 
> linear complexity is not very appealing.
> 
> As you already took the decision C, it means that at the communicator 
> destruction stage the hash table is not relevant anymore. Thus, I would have 
> converted the hash table to an ordered list (ordered by the creation index, a 
> global entity atomically updated every time an attribute is created), and 
> proceed to destroy the attributed in the desired order. Thus instead of 
> having a linear operation for every operation on attributes, we only have a 
> single linear operation per communicator (and this during the destruction 
> stage).
> 
>   George.
> 
> On Jan 16, 2013, at 16:37 , KAWASHIMA Takahiro <rivis.kawash...@nifty.com> 
> wrote:
> 
> > Hi,
> > 
> > I've implemented ticket #3123 "MPI-2.2: Ordering of attribution deletion
> > callbacks on MPI_COMM_SELF".
> > 
> >  https://svn.open-mpi.org/trac/ompi/ticket/3123
> > 
> > As this ticket says, attributes had been stored in unordered hash.
> > So I've replaced opal_hash_table_t with opal_list_t and made necessary
> > modifications for it. And I've also fixed some multi-threaded concurrent
> > (get|set|delete)_attr call issues.
> > 
> > By this modification, following behavior changes are introduced.
> > 
> >  (A) MPI_(Comm|Type|Win)_(get|set|delete)_attr function may be slower
> >  for MPI objects that has many attributes attached.
> >  (B) When the user-defined delete callback function is called, the
> >  attribute is already removed from the list. In other words,
> >  if MPI_(Comm|Type|Win)_get_attr is called by the user-defined
> >  delete callback function for the same attribute key, it returns
> >  flag = false.
> >  (C) Even if the user-defined delete callback function returns non-
> >  MPI_SUCCESS value, the attribute is not reverted to the list.
> > 
> > (A) is due to a sequential list search instead of a hash. See find_value
> > function for its implementation.
> > (B) and (C) are due to an atomic deletion of the attribute to allow
> > multi-threaded concurrent (get|set|delete)_attr call in MPI_THREAD_MULTIPLE.
> > See ompi_attr_delete function for its implementation. I think this does
> > not matter because MPI standard doesn't specify behavior in such cases.
> > 
> > The patch for Open MPI trunk is attached. If you like it, take in
> > this patch.
> > 
> > Though I'm a employee of a company, this is my independent and private
> > work at my home. No intellectual property from my company. If needed,
> > I'll sign to Individual Contributor License Agreement.


[OMPI devel] [patch] MPI-2.2: Ordering of attribution deletion callbacks on MPI_COMM_SELF

2013-01-16 Thread KAWASHIMA Takahiro
Hi,

I've implemented ticket #3123 "MPI-2.2: Ordering of attribution deletion
callbacks on MPI_COMM_SELF".

  https://svn.open-mpi.org/trac/ompi/ticket/3123

As this ticket says, attributes had been stored in unordered hash.
So I've replaced opal_hash_table_t with opal_list_t and made necessary
modifications for it. And I've also fixed some multi-threaded concurrent
(get|set|delete)_attr call issues.

By this modification, following behavior changes are introduced.

  (A) MPI_(Comm|Type|Win)_(get|set|delete)_attr function may be slower
  for MPI objects that has many attributes attached.
  (B) When the user-defined delete callback function is called, the
  attribute is already removed from the list. In other words,
  if MPI_(Comm|Type|Win)_get_attr is called by the user-defined
  delete callback function for the same attribute key, it returns
  flag = false.
  (C) Even if the user-defined delete callback function returns non-
  MPI_SUCCESS value, the attribute is not reverted to the list.

(A) is due to a sequential list search instead of a hash. See find_value
function for its implementation.
(B) and (C) are due to an atomic deletion of the attribute to allow
multi-threaded concurrent (get|set|delete)_attr call in MPI_THREAD_MULTIPLE.
See ompi_attr_delete function for its implementation. I think this does
not matter because MPI standard doesn't specify behavior in such cases.

The patch for Open MPI trunk is attached. If you like it, take in
this patch.

Though I'm a employee of a company, this is my independent and private
work at my home. No intellectual property from my company. If needed,
I'll sign to Individual Contributor License Agreement.

Regards,
KAWASHIMA Takahiro


delete-attr-order.patch.gz
Description: Binary data


Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-11 Thread Kawashima, Takahiro
Hi Open MPI core members and Rayson,

I've confirmed to the authors and created the bibtex reference.
Could you make a page in the "Open MPI Publications" page that
links to Fujitsu's PDF file? The attached file contains information
of title, authors, abstract, link URL, and bibtex reference.

Best regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Sorry for not replying sooner.
> I'm taliking with the authors (they are not in this list) and
> will request linking the PDF soon if they allowed.
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
> > Our policy so far was that adding a paper to the list of publication on the 
> > Open MPI website was a discretionary action at the authors' request. I 
> > don't see any compelling reason to change. Moreover, Fujitsu being a 
> > contributor of the Open MPI community, there is no obstacle of adding a 
> > link to their paper -- at their request.
> > 
> >   George.
> > 
> > On Jan 10, 2013, at 00:15 , Rayson Ho  wrote:
> > 
> > > Hi Ralph,
> > > 
> > > Since the whole journal is available online, and is reachable by
> > > Google, I don't believe we can get into copyright issues by providing
> > > a link to it (but then, I also know that there are countries that have
> > > more crazy web page linking rules!).
> > > 
> > > http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
> > > 
> > > Rayson
> > > 
> > > ==
> > > Open Grid Scheduler - The Official Open Source Grid Engine
> > > http://gridscheduler.sourceforge.net/
> > > 
> > > Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
> > > http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
> > > 
> > > 
> > > On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
> > >> I'm unaware of any formal criteria. The papers currently located there 
> > >> are those written by members of the OMPI community, but we can certainly 
> > >> link to something written by someone else, so long as we don't get into 
> > >> copyright issues.
> > >> 
> > >> On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
> > >> 
> > >>> I found this paper recently, "MPI Library and Low-Level Communication
> > >>> on the K computer", available at:
> > >>> 
> > >>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
> > >>> 
> > >>> What are the criteria for adding papers to the "Open MPI Publications" 
> > >>> page?
> > >>> 
> > >>> Rayson
Title: MPI Library and Low-Level Communication on the K computer






Title: MPI Library and Low-Level Communication on the K computer

Author(s): Naoyuki Shida, Shinji Sumimoto, Atsuya Uno

Abstract:

The key to raising application performance in a massively parallel system like the K computer is to increase the speed of communication between compute nodes. In the K computer, this inter-node communication is governed by the Message Passing Interface (MPI) communication library and low-level communication. This paper describes the implementation and performance of the MPI communication library, which exploits the new Tofu-interconnect architecture introduced in the K computer to enhance the performance of petascale applications, and low-level communication mechanism, which performs fine-grained control of the Tofu interconnect.

Paper:


paper11.pdf (PDF)



Presented: FUJITSU Scientific & Technical Journal 2012-7 (Vol.48, No.3)

Bibtex reference:


 @Article{shida2012:mpi_kcomputer,
  author  = {Naoyuki Shida and Shinji Sumimoto and Atsuya Uno},
  title   = {{MPI} Library and Low-Level Communication on the {K computer}},
  journal = {FUJITSU Scientific \& Technical Journal},
  month   = {July},
  year= {2012},
  volume  = {48},
  number  = {3},
  pages   = {324--330}
} 







Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-10 Thread Kawashima, Takahiro
Hi,

Sorry for not replying sooner.
I'm taliking with the authors (they are not in this list) and
will request linking the PDF soon if they allowed.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Our policy so far was that adding a paper to the list of publication on the 
> Open MPI website was a discretionary action at the authors' request. I don't 
> see any compelling reason to change. Moreover, Fujitsu being a contributor of 
> the Open MPI community, there is no obstacle of adding a link to their paper 
> -- at their request.
> 
>   George.
> 
> On Jan 10, 2013, at 00:15 , Rayson Ho  wrote:
> 
> > Hi Ralph,
> > 
> > Since the whole journal is available online, and is reachable by
> > Google, I don't believe we can get into copyright issues by providing
> > a link to it (but then, I also know that there are countries that have
> > more crazy web page linking rules!).
> > 
> > http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
> > 
> > Rayson
> > 
> > ==
> > Open Grid Scheduler - The Official Open Source Grid Engine
> > http://gridscheduler.sourceforge.net/
> > 
> > Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
> > http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
> > 
> > 
> > On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
> >> I'm unaware of any formal criteria. The papers currently located there are 
> >> those written by members of the OMPI community, but we can certainly link 
> >> to something written by someone else, so long as we don't get into 
> >> copyright issues.
> >> 
> >> On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
> >> 
> >>> I found this paper recently, "MPI Library and Low-Level Communication
> >>> on the K computer", available at:
> >>> 
> >>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
> >>> 
> >>> What are the criteria for adding papers to the "Open MPI Publications" 
> >>> page?
> >>> 
> >>> Rayson
> >>> 
> >>> ==
> >>> Open Grid Scheduler - The Official Open Source Grid Engine
> >>> http://gridscheduler.sourceforge.net/
> >>> 
> >>> 
> >>> On Fri, Nov 18, 2011 at 5:32 AM, George Bosilca  
> >>> wrote:
>  Dear Yuki and Takahiro,
>  
>  Thanks for the bug report and for the patch. I pushed a [nearly 
>  identical] patch in the trunk in 
>  https://svn.open-mpi.org/trac/ompi/changeset/25488. A special version 
>  for the 1.4 has been prepared and has been attached to the ticket #2916 
>  (https://svn.open-mpi.org/trac/ompi/ticket/2916).
>  
>  Thanks,
>  george.
>  
>  
>  On Nov 14, 2011, at 02:27 , Y.MATSUMOTO wrote:
>  
> > Dear Open MPI community,
> > 
> > I'm a member of MPI library development team in Fujitsu,
> > Takahiro Kawashima, who sent mail before, is my colleague.
> > We start to feed back.
> > 
> > First, we fixed about MPI_LB/MPI_UB and data packing problem.
> > 
> > Program crashes when it meets all of the following conditions:
> > a: The type of sending data is contiguous and derived type.
> > b: Either or both of MPI_LB and MPI_UB is used in the data type.
> > c: The size of sending data is smaller than extent(Data type has gap).
> > d: Send-count is bigger than 1.
> > e: Total size of data is bigger than "eager limit"
> > 
> > This problem occurs in attachment C program.
> > 
> > An incorrect-address accessing occurs
> > because an unintended value of "done" inputs and
> > the value of "max_allowd" becomes minus
> > in the following place in "ompi/datatype/datatype_pack.c(in version 
> > 1.4.3)".
> > 
> > 
> > (ompi/datatype/datatype_pack.c)
> > 188 packed_buffer = (unsigned char *) 
> > iov[iov_count].iov_base;
> > 189 done = pConv->bConverted - i * pData->size;  /* partial 
> > data from last pack */
> > 190 if( done != 0 ) {  /* still some data to copy from the 
> > last time */
> > 191 done = pData->size - done;
> > 192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
> > pConv->pBaseBuf, pData, pConv->count );
> > 193 MEMCPY_CSUM( packed_buffer, user_memory, done, 
> > pConv );
> > 194 packed_buffer += done;
> > 195 max_allowed -= done;
> > 196 total_bytes_converted += done;
> > 197 user_memory += (extent - pData->size + done);
> > 198 }
> > 
> > This program assumes "done" as the size of partial data from last pack.
> > However, when the program crashes, "done" equals the sum of all 
> > transmitted data size.
> > It makes "max_allowed" to be a negative value.
> > 
> 

Re: [OMPI devel] [patch] SEGV on processing unexpected messages

2012-10-18 Thread Kawashima, Takahiro
George, Brian,

I also think my patch is icky. George's patch may be nicer.

Thanks,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Takahiro,
> 
> Nice catch. A nicer fix will be to check the type of the header, and copy the 
> header accordingly. Attached is a patch following this idea.
> 
>   Thanks,
> george.
> 
> 
> On Oct 18, 2012, at 03:06 , "Kawashima, Takahiro" 
> <t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi Open MPI developers,
> >
> > I found another issue in Open MPI.
> >
> > In MCA_PML_OB1_RECV_FRAG_INIT macro in ompi/mca/pml/ob1/pml_ob1_recvfrag.h
> > file, we copy a PML header from an arrived message to another buffer,
> > as follows:
> >
> >frag->hdr = *(mca_pml_ob1_hdr_t*)hdr;
> >
> > On this copy, we cast hdr to mca_pml_ob1_hdr_t, which is a union
> > of all actual header structs such as mca_pml_ob1_match_hdr_t.
> > This means we copy the buffer of the size of the largest header
> > even if the arrived message is smaller than it. This can cause
> > SEGV if the arrived message is small and it is laid on the bottom
> > of the page. Actually, my tofu BTL, the BTL component of Fujitsu
> > MPI for K computer, suffered from this.
> >
> > The attached patch will be one of possible fixes for this issue.
> > This fix assume that the arrived header has at least segs[0].seg_len
> > bytes. This is always true for current Open MPI code because hdr
> > equals to segs[0].seg_addr.pval. There may exist a smarter fix.


[OMPI devel] [patch] SEGV on processing unexpected messages

2012-10-17 Thread Kawashima, Takahiro
Hi Open MPI developers,

I found another issue in Open MPI.

In MCA_PML_OB1_RECV_FRAG_INIT macro in ompi/mca/pml/ob1/pml_ob1_recvfrag.h
file, we copy a PML header from an arrived message to another buffer,
as follows:

frag->hdr = *(mca_pml_ob1_hdr_t*)hdr;

On this copy, we cast hdr to mca_pml_ob1_hdr_t, which is a union
of all actual header structs such as mca_pml_ob1_match_hdr_t.
This means we copy the buffer of the size of the largest header
even if the arrived message is smaller than it. This can cause
SEGV if the arrived message is small and it is laid on the bottom
of the page. Actually, my tofu BTL, the BTL component of Fujitsu
MPI for K computer, suffered from this.

The attached patch will be one of possible fixes for this issue.
This fix assume that the arrived header has at least segs[0].seg_len
bytes. This is always true for current Open MPI code because hdr
equals to segs[0].seg_addr.pval. There may exist a smarter fix.

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu
Index: ompi/mca/pml/ob1/pml_ob1_recvfrag.h
===
--- ompi/mca/pml/ob1/pml_ob1_recvfrag.h	(revision 27446)
+++ ompi/mca/pml/ob1/pml_ob1_recvfrag.h	(working copy)
@@ -69,7 +69,9 @@
 unsigned char* _ptr = (unsigned char*)frag->addr;   \
 /* init recv_frag */\
 frag->btl = btl;\
-frag->hdr = *(mca_pml_ob1_hdr_t*)hdr;   \
+_size = (segs[0].seg_len < sizeof(mca_pml_ob1_hdr_t)) ? \
+segs[0].seg_len : sizeof(mca_pml_ob1_hdr_t);\
+memcpy( >hdr, hdr, _size);\
 frag->num_segments = 1; \
 _size = segs[0].seg_len;\
 for( i = 1; i < cnt; i++ ) {\
Copyright (c) 2012   FUJITSU LIMITED.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer listed
in this license in the documentation and/or other materials
provided with the distribution.

* Neither the name of the copyright holders nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

The copyright holders provide no reassurances that the source code
provided does not infringe any patent, copyright, or any other
intellectual property rights of third parties.  The copyright holders
disclaim any liability to any recipient for claims brought against
recipient by any third party for infringement of that parties
intellectual property rights.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.




Re: [OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-15 Thread Kawashima, Takahiro
George,

Thanks for your reply.

Yes, as you explained, a case specifying an inactive request for MPI_Test
is OK. But the problem is a case specifying an inactive request for
MPI_Wait, MPI_Waitall, and MPI_Testall, as I explained in my first mail.

See code below:

/* make a inactive request */
MPI_Recv_init(, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, );
MPI_Start();
do {
MPI_Test(, , );
} while (completed == 0);

/* wait for an inactive request */
MPI_Wait(, );

In this code, the ompi_request_t object is marked as inactive in
MPI_Test (ompi_request_default_test) but its req_status field remains
unchanged.

Succeeding MPI_Wait (ompi_request_default_wait) sets a user-supplied
status object using req_status field of the request in req_wait.c
line 57.

if( MPI_STATUS_IGNORE != status ) {
status->MPI_TAG= req->req_status.MPI_TAG;
status->MPI_SOURCE = req->req_status.MPI_SOURCE;
status->_ucount= req->req_status._ucount;
status->_cancelled = req->req_status._cancelled;
}

Thus, the user will see a non-empty status after MPI_Wait in the code
above.

You can see this phenomenon with my status.c attached in my first mail.
Run with -n 2.

To avoid this problem, we have two options.
 A. Set req_status field to empty when we mark a request
as inactive, or
 B. Add if-statements for an inactive request in order to set
a user-supplied status object to empty in ompi_request_default_wait
etc.

For least astonishment, I think A. is better.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Takahiro,
> 
> I fail to see the cases your patch addresses. I recognize I did not have the 
> time to look over all the instances where we deal with persistent inactive 
> requests, but at the first occurrence, the one in req_test.c line 68, the 
> case you exhibit there is already covered by the test "request->req_state == 
> OMPI_REQUEST_INACTIVE". I see similar checks in all the other test/wait 
> files. Basically, it doesn't matter that we leave the last returned error 
> code on an inactive request, as we always return MPI_STATUS_EMPTY in the 
> status for such requests.
> 
> Thanks,
>   george.
> 
>  
> On Oct 15, 2012, at 07:02 , "Kawashima, Takahiro" 
> <t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi Open MPI developers,
> > 
> > How is my updated patch?
> > If there is an another concern, I'll try to update it.
> > 
> >>>>> The bugs are:
> >>>>> 
> >>>>> (1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE.
> >>>>> 
> >>>>> (2) MPI_Status for an inactive request must be an empty status.
> >>>>> 
> >>>>> (3) Possible BUS errors on sparc64 processors.
> >>>>> 
> >>>>>  r23554 fixed possible BUS errors on sparc64 processors.
> >>>>>  But the fix seems to be insufficient.
> >>>>> 
> >>>>>  We should use OMPI_STATUS_SET macro for all user-supplied
> >>>>>  MPI_Status objects.
> >>>> Regarding #3, see also a trac 3218. I'm putting a fix back today. Sorry
> >>>> for the delay. One proposed solution was extending the use of the
> >>>> OMPI_STATUS_SET macros, but I think the consensus was to fix the problem
> >>>> in the Fortran layer. Indeed, the Fortran layer already routinely
> >>>> converts between Fortran and C statuses. The problem was that we started
> >>>> introducing optimizations to bypass the Fortran-to-C conversion and that
> >>>> optimization was employed too liberally (e..g, in situations that would
> >>>> introduce the alignment errors you're describing). My patch will clean
> >>>> that up. I'll try to put it back in the next few hours.
> >>> 
> >>> Sorry, I didn't notice the ticket 3218.
> >>> Now I've confirmed your commit r27403.
> >>> Your modification is better for my issue (3).
> >>> 
> >>> With r27403, my patch for issue (1) and (2) needs modification.
> >>> I'll re-send modified patch in a few hours.
> >> 
> >> The updated patch is attached.
> >> This patch addresses bugs (1) and (2) in my previous mail
> >> and fixes some typos in comments.


Re: [OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-04 Thread Kawashima, Takahiro
Hi Open MPI developers,

> > > The bugs are:
> > >
> > > (1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE.
> > >
> > > (2) MPI_Status for an inactive request must be an empty status.
> > >
> > > (3) Possible BUS errors on sparc64 processors.
> > >
> > >   r23554 fixed possible BUS errors on sparc64 processors.
> > >   But the fix seems to be insufficient.
> > >
> > >   We should use OMPI_STATUS_SET macro for all user-supplied
> > >   MPI_Status objects.
> > Regarding #3, see also a trac 3218. I'm putting a fix back today. Sorry
> > for the delay. One proposed solution was extending the use of the
> > OMPI_STATUS_SET macros, but I think the consensus was to fix the problem
> > in the Fortran layer. Indeed, the Fortran layer already routinely
> > converts between Fortran and C statuses. The problem was that we started
> > introducing optimizations to bypass the Fortran-to-C conversion and that
> > optimization was employed too liberally (e..g, in situations that would
> > introduce the alignment errors you're describing). My patch will clean
> > that up. I'll try to put it back in the next few hours.
> 
> Sorry, I didn't notice the ticket 3218.
> Now I've confirmed your commit r27403.
> Your modification is better for my issue (3).
> 
> With r27403, my patch for issue (1) and (2) needs modification.
> I'll re-send modified patch in a few hours.

The updated patch is attached.
This patch addresses bugs (1) and (2) in my previous mail
and fixes some typos in comments.

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu
Index: ompi/mpi/c/wait.c
===
--- ompi/mpi/c/wait.c	(revision 27414)
+++ ompi/mpi/c/wait.c	(working copy)
@@ -10,6 +10,7 @@
  * Copyright (c) 2004-2005 The Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 2006  Cisco Systems, Inc.  All rights reserved.
+ * Copyright (c) 2012  FUJITSU LIMITED.  All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -54,7 +55,7 @@
 
 if (MPI_REQUEST_NULL == *request) {
 if (MPI_STATUS_IGNORE != status) {
-*status = ompi_request_empty.req_status;
+*status = ompi_status_empty;
 /*
  * Per MPI-1, the MPI_ERROR field is not defined for single-completion calls
  */
Index: ompi/request/request.c
===
--- ompi/request/request.c	(revision 27414)
+++ ompi/request/request.c	(working copy)
@@ -13,6 +13,7 @@
  * Copyright (c) 2006-2012 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2009  Sun Microsystems, Inc.  All rights reserved.
  * Copyright (c) 2012  Oak Ridge National Labs.  All rights reserved.
+ * Copyright (c) 2012  FUJITSU LIMITED.  All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -111,7 +112,7 @@
 return OMPI_ERROR;
 }
 ompi_request_null.request.req_type = OMPI_REQUEST_NULL;
-ompi_request_null.request.req_status.MPI_SOURCE = MPI_PROC_NULL;
+ompi_request_null.request.req_status.MPI_SOURCE = MPI_ANY_SOURCE;
 ompi_request_null.request.req_status.MPI_TAG = MPI_ANY_TAG;
 ompi_request_null.request.req_status.MPI_ERROR = MPI_SUCCESS;
 ompi_request_null.request.req_status._ucount = 0;
Index: ompi/request/req_wait.c
===
--- ompi/request/req_wait.c	(revision 27414)
+++ ompi/request/req_wait.c	(working copy)
@@ -12,6 +12,7 @@
  * Copyright (c) 2006-2008 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2010-2012 Oracle and/or its affiliates.  All rights reserved.
  * Copyright (c) 2012  Oak Ridge National Labs.  All rights reserved.
+ * Copyright (c) 2012  FUJITSU LIMITED.  All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -46,7 +47,7 @@
 #endif
 
 /* return status.  If it's a generalized request, we *have* to
-   invoke the query_fn, even if the user procided STATUS_IGNORE.
+   invoke the query_fn, even if the user provided STATUS_IGNORE.
MPI-2:8.2. */
 if (OMPI_REQUEST_GEN == req->req_type) {
 ompi_grequest_invoke_query(req, >req_status);
@@ -60,11 +61,15 @@
 status->_cancelled = req->req_status._cancelled;
 }
 if( req->req_persistent ) {
+int error = req->req_status.MPI_ERROR;
 if( req->req_state == OMPI_REQUEST_INACTIVE ) {
 return OMPI_SUCCESS;
 }
 req->req_state = OMPI_REQUEST_INACTIVE;
-return req->req_status.MPI_ERROR;
+/* Next time MPI_Wait* or MPI_Test* is called, we must return
+   empty status. */
+req->req_status = ompi_status_empty;
+return error;
 }
 
 /* If there was an error, don't free the request -- just return
@@ -188,6 +193,9 @@
 rc = request->req_status.MPI_ERROR;
 if( 

Re: [OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-04 Thread Kawashima, Takahiro
Hi Eugene,

> > I found some bugs in Open MPI and attach a patch to fix them.
> >
> > The bugs are:
> >
> > (1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE.
> >
> > (2) MPI_Status for an inactive request must be an empty status.
> >
> > (3) Possible BUS errors on sparc64 processors.
> >
> >   r23554 fixed possible BUS errors on sparc64 processors.
> >   But the fix seems to be insufficient.
> >
> >   We should use OMPI_STATUS_SET macro for all user-supplied
> >   MPI_Status objects.
> Regarding #3, see also a trac 3218. I'm putting a fix back today. Sorry
> for the delay. One proposed solution was extending the use of the
> OMPI_STATUS_SET macros, but I think the consensus was to fix the problem
> in the Fortran layer. Indeed, the Fortran layer already routinely
> converts between Fortran and C statuses. The problem was that we started
> introducing optimizations to bypass the Fortran-to-C conversion and that
> optimization was employed too liberally (e..g, in situations that would
> introduce the alignment errors you're describing). My patch will clean
> that up. I'll try to put it back in the next few hours.

Sorry, I didn't notice the ticket 3218.
Now I've confirmed your commit r27403.
Your modification is better for my issue (3).

With r27403, my patch for issue (1) and (2) needs modification.
I'll re-send modified patch in a few hours.

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu


[OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-04 Thread Kawashima, Takahiro
Hi Open MPI developers,

I found some bugs in Open MPI and attach a patch to fix them.

The bugs are:

(1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE.

  3.7.3 Communication Completion in MPI-3.0 (and also MPI-2.2)
  says an MPI_Status object returned by MPI_{Wait|Test}{|any|all}
  must be an empty status for a null request (MPI_REQUEST_NULL).
  And the MPI_SOURCE field of the empty status must be
  MPI_ANY_SOURCE.

  But MPI_Wait, MPI_Waitall, and MPI_Testall set MPI_PROC_NULL
  to the MPI_SOURCE field of such status object.

  This bug is caused by a use of an incorrect variable in
  ompi/mpi/c/wait.c (for MPI_Wait) and by an incorrect
  initialization of ompi_request_null in ompi/request/request.c
  (for MPI_Waitall and MPI_Testall).

(2) MPI_Status for an inactive request must be an empty status.

  3.7.3 Communication Completion in MPI-3.0 (and also MPI-2.2)
  says an MPI_Status object returned by MPI_{Wait|Test}{|any|all}
  must be an empty status for an inactive persistent request.

  But MPI_Wait, MPI_Waitall, and MPI_Testall return an old
  status (that was returned when the request was active) for
  an inactive persistent request.

  This bug is caused by not updating a req_status field of an
  inactive persistent request object in ompi/request/req_wait.c
  and ompi/request/req_test.c.

(3) Possible BUS errors on sparc64 processors.

  r23554 fixed possible BUS errors on sparc64 processors.
  But the fix seems to be insufficient.

  We should use OMPI_STATUS_SET macro for all user-supplied
  MPI_Status objects.

The attached patch is for Open MPI trunk and it also fixes some
typos in comments. A program to reproduce bugs (1) and (2) is
also attached.

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu
Index: ompi/request/request.c
===
--- ompi/request/request.c	(revision 27388)
+++ ompi/request/request.c	(working copy)
@@ -13,6 +13,7 @@
  * Copyright (c) 2006-2012 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2009  Sun Microsystems, Inc.  All rights reserved.
  * Copyright (c) 2012  Oak Ridge National Labs.  All rights reserved.
+ * Copyright (c) 2012  FUJITSU LIMITED.  All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -111,7 +112,7 @@
 return OMPI_ERROR;
 }
 ompi_request_null.request.req_type = OMPI_REQUEST_NULL;
-ompi_request_null.request.req_status.MPI_SOURCE = MPI_PROC_NULL;
+ompi_request_null.request.req_status.MPI_SOURCE = MPI_ANY_SOURCE;
 ompi_request_null.request.req_status.MPI_TAG = MPI_ANY_TAG;
 ompi_request_null.request.req_status.MPI_ERROR = MPI_SUCCESS;
 ompi_request_null.request.req_status._ucount = 0;
Index: ompi/request/req_wait.c
===
--- ompi/request/req_wait.c	(revision 27388)
+++ ompi/request/req_wait.c	(working copy)
@@ -12,6 +12,7 @@
  * Copyright (c) 2006-2008 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2010  Oracle and/or its affiliates.  All rights reserved.
  * Copyright (c) 2012  Oak Ridge National Labs.  All rights reserved.
+ * Copyright (c) 2012  FUJITSU LIMITED.  All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -46,7 +47,7 @@
 #endif
 
 /* return status.  If it's a generalized request, we *have* to
-   invoke the query_fn, even if the user procided STATUS_IGNORE.
+   invoke the query_fn, even if the user provided STATUS_IGNORE.
MPI-2:8.2. */
 if (OMPI_REQUEST_GEN == req->req_type) {
 ompi_grequest_invoke_query(req, >req_status);
@@ -60,11 +61,15 @@
 status->_cancelled = req->req_status._cancelled;
 }
 if( req->req_persistent ) {
+int error = req->req_status.MPI_ERROR;
 if( req->req_state == OMPI_REQUEST_INACTIVE ) {
 return OMPI_SUCCESS;
 }
 req->req_state = OMPI_REQUEST_INACTIVE;
-return req->req_status.MPI_ERROR;
+/* Next time MPI_Wait* or MPI_Test* is called, we must return
+   empty status. */
+OMPI_STATUS_SET(>req_status, _status_empty);
+return error;
 }
 
 /* If there was an error, don't free the request -- just return
@@ -188,6 +193,9 @@
 rc = request->req_status.MPI_ERROR;
 if( request->req_persistent ) {
 request->req_state = OMPI_REQUEST_INACTIVE;
+/* Next time MPI_Wait* or MPI_Test* is called, we must return
+   empty status. */
+OMPI_STATUS_SET(>req_status, _status_empty);
 } else if (MPI_SUCCESS == rc) {
 /* Only free the request if there is no error on it */
 /* If there's an error while freeing the request,
@@ -254,7 +262,7 @@
 /*
  * confirm the status of the pending requests. We have to do it before
  * taking the condition or otherwise we can miss some requests completion 

Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag

2012-07-26 Thread Kawashima, Takahiro
George,

Thanks for review and commit!
I've confirmed your modification.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Takahiro,
> 
> Indeed we were way to lax on canceling the requests. I modified your patch to 
> correctly deal with the MEMCHECK macro (remove the call from the branch that 
> will requires a completion function). The modified patch is attached below. I 
> will commit asap.
> 
>   Thanks,
> george.
> 
> 
> Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c
> ===
> --- ompi/mca/pml/ob1/pml_ob1_recvreq.c(revision 26870)
> +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c(working copy)
> @@ -3,7 +3,7 @@
>   * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>   * University Research and Technology
>   * Corporation.  All rights reserved.
> - * Copyright (c) 2004-2009 The University of Tennessee and The University
> + * Copyright (c) 2004-2012 The University of Tennessee and The University
>   * of Tennessee Research Foundation.  All rights
>   * reserved.
>   * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, 
> @@ -15,6 +15,7 @@
>   * Copyright (c) 2012  NVIDIA Corporation.  All rights reserved.
>   * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights
>   * reserved.
> + * Copyright (c) 2012  FUJITSU LIMITED.  All rights reserved.
>   * $COPYRIGHT$
>   * 
>   * Additional copyrights may follow
> @@ -97,36 +98,26 @@
>  mca_pml_ob1_recv_request_t* request = 
> (mca_pml_ob1_recv_request_t*)ompi_request;
>  mca_pml_ob1_comm_t* comm = 
> request->req_recv.req_base.req_comm->c_pml_comm;
>  
> -if( true == ompi_request->req_complete ) { /* way to late to cancel this 
> one */
> -/*
> - * Receive request completed, make user buffer accessable.
> - */
> -MEMCHECKER(
> -memchecker_call(_memchecker_base_mem_defined,
> -request->req_recv.req_base.req_addr,
> -request->req_recv.req_base.req_count,
> -request->req_recv.req_base.req_datatype);
> -);
> +if( true == request->req_match_received ) { /* way to late to cancel 
> this one */
> +assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* not 
> matched isn't it */
>  return OMPI_SUCCESS;
>  }
>  
>  /* The rest should be protected behind the match logic lock */
>  OPAL_THREAD_LOCK(>matching_lock);
> -if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match 
> has not been already done */
> -   if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
> -  opal_list_remove_item( >wild_receives, 
> (opal_list_item_t*)request );
> -   } else {
> -  mca_pml_ob1_comm_proc_t* proc = comm->procs + 
> request->req_recv.req_base.req_peer;
> -  opal_list_remove_item(>specific_receives, 
> (opal_list_item_t*)request);
> -   }
> -   PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q,
> -&(request->req_recv.req_base), PERUSE_RECV );
> -   /**
> -* As now the PML is done with this request we have to force the 
> pml_complete
> -* to true. Otherwise, the request will never be freed.
> -*/
> -   request->req_recv.req_base.req_pml_complete = true;
> +if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
> +opal_list_remove_item( >wild_receives, 
> (opal_list_item_t*)request );
> +} else {
> +mca_pml_ob1_comm_proc_t* proc = comm->procs + 
> request->req_recv.req_base.req_peer;
> +opal_list_remove_item(>specific_receives, 
> (opal_list_item_t*)request);
>  }
> +PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q,
> + &(request->req_recv.req_base), PERUSE_RECV );
> +/**
> + * As now the PML is done with this request we have to force the 
> pml_complete
> + * to true. Otherwise, the request will never be freed.
> + */
> +request->req_recv.req_base.req_pml_complete = true;
>  OPAL_THREAD_UNLOCK(>matching_lock);
>  
>  OPAL_THREAD_LOCK(_request_lock);
> @@ -138,7 +129,7 @@
>  MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request);
>  OPAL_THREAD_UNLOCK(_request_lock);
>  /*
> - * Receive request cancelled, make user buffer accessable.
> + * Receive request cancelled, make user buffer 

[OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag

2012-07-26 Thread Kawashima, Takahiro
Hi Open MPI developers,

I found a small bug in Open MPI.

See attached program cancelled.c.
In this program, rank 1 tries to cancel a MPI_Irecv and calls a MPI_Recv
instead if the cancellation succeeds. This program should terminate whether
the cancellation succeeds or not. But it leads a deadlock in MPI_Recv after
printing "MPI_Test_cancelled: 1".
I confirmed it works fine with MPICH2.

The problem is in mca_pml_ob1_recv_request_cancel function in
ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless
the request has been completed. I think it should not accept the
cancellation if the request has been matched. If it want to accept the
cancellation, it must push the recv frag to the unexpected message queue
back and redo matching. Furthermore, the receive buffer must be reverted
if the received message has been written to the receive buffer partially
in a pipeline protocol.

Attached patch cancel-recv.patch is a sample fix for this bug for Open MPI
trunk. Though this patch has 65 lines, main modifications are adding one
if-statement and deleting one if-statement. Other lines are just for
indent alignment.
I cannot confirm the MEMCHECKER part is correct. Could anyone review it
before committing?

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu
#include 
#include 
#include 

/* rendezvous */
#define BUFSIZE1 (1024*1024)
/* eager */
#define BUFSIZE2 (8)

int main(int argc, char *argv[])
{
int myrank, cancelled;
void *buf1, *buf2;
MPI_Request request;
MPI_Status status;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
buf1 = malloc(BUFSIZE1);
buf2 = malloc(BUFSIZE2);

if (myrank == 0) {

MPI_Isend(buf1, BUFSIZE1, MPI_BYTE, 1, 0, MPI_COMM_WORLD, );
MPI_Send(buf2, BUFSIZE2, MPI_BYTE, 1, 1, MPI_COMM_WORLD);
MPI_Wait(, );

} else if (myrank == 1) {

MPI_Irecv(buf1, BUFSIZE1, MPI_BYTE, 0, 0, MPI_COMM_WORLD, );
MPI_Recv(buf2, BUFSIZE2, MPI_BYTE, 0, 1, MPI_COMM_WORLD, );
MPI_Cancel();
MPI_Wait(, );

MPI_Test_cancelled(, );
printf("MPI_Test_cancelled: %d\n", cancelled);
fflush(stdout);

if (cancelled) {
MPI_Recv(buf1, BUFSIZE1, MPI_BYTE, 0, 0, MPI_COMM_WORLD,
 );
}

}

MPI_Finalize();
free(buf1);
free(buf2);

return 0;
}
Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c
===
--- ompi/mca/pml/ob1/pml_ob1_recvreq.c	(revision 26836)
+++ ompi/mca/pml/ob1/pml_ob1_recvreq.c	(working copy)
@@ -97,36 +97,36 @@
 mca_pml_ob1_recv_request_t* request = (mca_pml_ob1_recv_request_t*)ompi_request;
 mca_pml_ob1_comm_t* comm = request->req_recv.req_base.req_comm->c_pml_comm;
 
-if( true == ompi_request->req_complete ) { /* way to late to cancel this one */
-/*
- * Receive request completed, make user buffer accessable.
- */
-MEMCHECKER(
-memchecker_call(_memchecker_base_mem_defined,
-request->req_recv.req_base.req_addr,
-request->req_recv.req_base.req_count,
-request->req_recv.req_base.req_datatype);
-);
+if( true == request->req_match_received ) { /* way to late to cancel this one */
+if( true == ompi_request->req_complete ) {
+/*
+ * Receive request completed, make user buffer accessable.
+ */
+MEMCHECKER(
+memchecker_call(_memchecker_base_mem_defined,
+request->req_recv.req_base.req_addr,
+request->req_recv.req_base.req_count,
+request->req_recv.req_base.req_datatype);
+);
+}
 return OMPI_SUCCESS;
 }
 
 /* The rest should be protected behind the match logic lock */
 OPAL_THREAD_LOCK(>matching_lock);
-if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match has not been already done */
-   if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
-  opal_list_remove_item( >wild_receives, (opal_list_item_t*)request );
-   } else {
-  mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer;
-  opal_list_remove_item(>specific_receives, (opal_list_item_t*)request);
-   }
-   PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q,
-&(request->req_recv.req_base), PERUSE_RECV );
-   /**
-* As now the PML is done with this request we have to force the pml_complete
-* to true. Otherwise, the request will never be freed.
-*/
-   request->req_recv.req_base.req_pml_complete = true;
+if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
+   opal_list_remove_item( >wild_receives, (opal_list_item_t*)request );
+} else {
+