Dear Open MPI community,

I'm a member of MPI library development team in Fujitsu,
Takahiro Kawashima, who sent mail before, is my colleague.
We start to feed back.

First, we fixed about MPI_LB/MPI_UB and data packing problem.

Program crashes when it meets all of the following conditions:
a: The type of sending data is contiguous and derived type.
b: Either or both of MPI_LB and MPI_UB is used in the data type.
c: The size of sending data is smaller than extent(Data type has gap).
d: Send-count is bigger than 1.
e: Total size of data is bigger than "eager limit"

This problem occurs in attachment C program.

An incorrect-address accessing occurs
because an unintended value of "done" inputs and
the value of "max_allowd" becomes minus
in the following place in "ompi/datatype/datatype_pack.c(in version 1.4.3)".


(ompi/datatype/datatype_pack.c)
188             packed_buffer = (unsigned char *) iov[iov_count].iov_base;
189             done = pConv->bConverted - i * pData->size;  /* partial data 
from last pack */
190             if( done != 0 ) {  /* still some data to copy from the last 
time */
191                 done = pData->size - done;
192                 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
pConv->pBaseBuf, pData, pConv->count );
193                 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv );
194                 packed_buffer += done;
195                 max_allowed -= done;
196                 total_bytes_converted += done;
197                 user_memory += (extent - pData->size + done);
198             }

This program assumes "done" as the size of partial data from last pack.
However, when the program crashes, "done" equals the sum of all transmitted 
data size.
It makes "max_allowed" to be a negative value.

We modified the code as following and it passed our test suite.
But we are not sure this fix is correct. Can anyone review this fix?
Patch (against Open MPI 1.4 branch) is attached to this mail.

-            if( done != 0 ) {  /* still some data to copy from the last time */
+            if( (done + max_allowed) >= pData->size ) {  /* still some data to 
copy from the last time */

Best regards,

Yuki MATSUMOTO
MPI development team,
Fujitsu

(2011/06/28 10:58), Takahiro Kawashima wrote:
Dear Open MPI community,

I'm a member of MPI library development team in Fujitsu. Shinji
Sumimoto, whose name appears in Jeff's blog, is one of our bosses.

As Rayson and Jeff noted, K computer, world's most powerful HPC system
developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
library. We, Fujitsu, are pleased to announce that, and also have special
thanks to Open MPI community.
We are sorry to be late announce!

Our MPI library is based on Open MPI 1.4 series, and has a new point-
to-point component (BTL) and new topology-aware collective communication
algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
PLM, GRPCOMM etc).

K computer connects 68,544 nodes by our custom interconnect.
Its runtime environment is our proprietary one. So we don't use orted.
We cannot tell start-up time yet because of disclosure restriction, sorry.

We are surprised by the extensibility of Open MPI, and have proved that
Open MPI is scalable to 68,000 processes level! We feel pleasure to
utilize such a great open-source software.

We cannot tell detail of our technology yet because of our contract
with RIKEN AICS, however, we will plan to feedback of our improvements
and bug fixes. We can contribute some bug fixes soon, however, for
contribution of our improvements will be next year with Open MPI
agreement.

Best regards,

MPI development team,
Fujitsu


I got more information:

    http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/

Short version: yes, Open MPI is used on K and was used to power the 8PF runs.

w00t!



On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:

w00t!

OMPI powers 8 petaflops!
(at least I'm guessing that -- does anyone know if that's true?)


On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:

Interesting... page 11:

http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf

Open MPI based:

* Open Standard, Open Source, Multi-Platform including PC Cluster.
* Adding extension to Open MPI for "Tofu" interconnect

Rayson
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Index: ompi/datatype/datatype_pack.c
===================================================================
--- ompi/datatype/datatype_pack.c       (revision 25474)
+++ ompi/datatype/datatype_pack.c       (working copy)
@@ -187,7 +187,7 @@
 
             packed_buffer = (unsigned char *) iov[iov_count].iov_base;
             done = pConv->bConverted - i * pData->size;  /* partial data from 
last pack */
-            if( done != 0 ) {  /* still some data to copy from the last time */
+            if( (done + max_allowed) >= pData->size ) {  /* still some data to 
copy from the last time */
                 done = pData->size - done;
                 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
pConv->pBaseBuf, pData, pConv->count );
                 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv );
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>

#define MAXSIZE 1024*1024*2

int main( int argc, char *argv[] )
{
    int myrank, size;
    int *sendbuf,*recvbuf;
    int i;
    int count;
    int block[3];
    MPI_Aint disp[3];

    MPI_Status *stat;
    MPI_Request *request;
    MPI_Datatype newtype;
    MPI_Datatype dtype[3];

    MPI_Init( 0, 0 );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
    MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

    sendbuf = (int*)malloc(MAXSIZE);
    recvbuf = (int*)malloc(MAXSIZE);
    stat = (MPI_Status*)malloc(sizeof(MPI_Status)*size);
    request = (MPI_Request*)malloc(sizeof(MPI_Request)*size);

    for(i=0;i<MAXSIZE/4;i++){
       sendbuf[i] = 1;
       recvbuf[i] = 0;
    }

    count = 2;

    dtype[0] = MPI_LB;
    dtype[1] = MPI_INT;
    dtype[2] = MPI_UB;

    block[0] = 1;
    block[1] = (MAXSIZE/count)/size/sizeof(int);
    block[2] = 1;

    disp[0] = 0;
    disp[1] = (MAXSIZE/count)/size*myrank;
    disp[2] = MAXSIZE/count;

    MPI_Type_struct( 3, block, disp, dtype, &newtype);
    MPI_Type_commit(&newtype);

    if(myrank == 0){
        MPI_Send( sendbuf, count, newtype, 1, 0, MPI_COMM_WORLD);
    }
    if(myrank == 1){
        MPI_Recv( recvbuf, count, newtype, 0, 0, MPI_COMM_WORLD, 
MPI_STATUS_IGNORE);
    }

    if(myrank == 1){
    for(i=0;i<block[1];i++){
        if((0 != recvbuf[i])){
            printf("MYRANK %d failed 1 recvbuf[%d] %d\n",myrank,i,recvbuf[i]);
            MPI_Finalize();
            exit(0);
        }
    }
    for(i = block[1] ; i<block[1]*2;i++){
        if(1 != recvbuf[i]){
            printf("MYRANK %d failed 2 recvbuf[%d] %d\n",myrank,i,recvbuf[i]);
            MPI_Finalize();
            exit(0);
        }
    }
    for(i = block[1]*2 ; i<block[1]*3;i++){
        if(0 != recvbuf[i]){
            printf("MYRANK %d failed 3 recvbuf[%d] %d\n",myrank,i,recvbuf[i]);
            MPI_Finalize();
            exit(0);
        }
    }
    for(i = block[1]*3 ; i<block[1]*4;i++){
        if(1 != recvbuf[i]){
            printf("MYRANK %d failed 4 recvbuf[%d] %d\n",myrank,i,recvbuf[i]);
            MPI_Finalize();
            exit(0);
        }
    }
    }


    MPI_Finalize();

    return 0;
}

Reply via email to