Hi, Rafael,

Sorry for the delayed reply.
Let me share the program that reproduces the problem. (attached below)

As you can see, the program prints "INVALID? : true" if we get bulk copy 
transfer error, otherwise it prints "INVALID?: false".
I get the error when I run the program on 2 locales with ibv-conduit 
(mpi-spawner). The input data size is : matrixSize = 2000 and tileSize = 200. 
Please let me know if you want the input file.
Note that I don't get the error when I run the program on 1 locale. In 
addition, I don't get the error with smaller data size even on 2 or more 
locales (e.g 10x10 matrix and 2x2 tile size).
I'm guessing using ibv-conduit and transferring a certain amount of data incurs 
this problem.
FYI, using udp-conduit (amudprun) does not show the error.

Please let me know if you have any comments and questions.

Best,

Akihiro

--

use BlockDist;

config const matrixSize: int(32) = -1;
config const   tileSize: int(32) = -1;
config const     inFile: string = "m_2000.in";
const zero: int(32) = 0;
var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};

class Tile {
   var tile_array: [tile_array_indices] real;
}

proc read_2D_array ( fileName: string, matrixSize: int(32) ) {
   var input_stream = open (fileName, iomode.r);
   var reader = input_stream.reader();
   var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1};
   var array: [matrix_index_2D] real;

   for ij in matrix_index_2D do {
       reader.read(array(ij));
   }
   input_stream.close();
   reader.close();
   // if (debug) { writeln("whole array: ",array); }
   return array;
}

proc main(): void {
   writeln("numLocales : ", numLocales);

   var numTiles: int(32) = matrixSize/tileSize;
   var numTiles_2: int(64) = matrixSize/tileSize;

   var whole_array = read_2D_array(inFile, matrixSize);

   var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, 
zero..numTiles_2};
   var ijk_space = proto_ijk_space dmapped Block(boundingBox=proto_ijk_space);
   var lkji_tiles: [ijk_space] Tile;

   for i in zero..numTiles-1 do {
       for j in zero..i do {
           on lkji_tiles(i,j,zero).locale do {
               var curr_tile: Tile = new Tile();
                for (ii,jj) in tile_array_indices do {
                   curr_tile.tile_array(ii,jj) = 
whole_array(i*tileSize+ii,j*tileSize+jj);
                }
               lkji_tiles(i,j,zero) = curr_tile;
           }
        }
   }
   var invalid : bool = false;
   for i in zero..numTiles-1 do {
        for iB in zero..tileSize-1 do {
           for j in zero..i do {
                var temp = lkji_tiles(i,j,zero).tile_array;
                if(i != j) {
                   for jB in zero..tileSize-1 do {
                        if (temp(iB,jB) != lkji_tiles(i, j, 
zero).tile_array(iB, jB)) {
                           invalid = true;
                        }
                   }
                } else {
                   for jB in zero..iB do {
                        if (temp(iB,jB) != lkji_tiles(i, j, 
zero).tile_array(iB, jB)) {
                           invalid = true;
                        }
                   }
                }
           }
        }
   }
   writeln("INVALID? : ", invalid);

}
On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote:

> Hi Rafael,
> 
> Thanks for your reply.
> 
> I inlined my comments below:
> 
>> May we have a simplified copy of your code (kinda the snippet provided below 
>> but with initial values for tileSize, numTiles_2, k, etc. i.e. something 
>> that compiles) so that we can also give it a go?
> Yes, it would be better if we can have a simplified code.
> Actually, I have been trying to make a simple code that reproduce this 
> problem for several weeks. finally I managed to make it this morning.
> Let me ask my advisor if we can show you the code.
> 
>> Would you like to try also with these flags?:
>> 
>> -suseBulkTransferStride=true -suseBulkTransfer=false
> I tried these flags, but I still get the error.
> 
> I'll keep you updated.
> 
> Best,
> 
> Akihiro
> 
> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote:
> 
>> Hi Akihiro,
>> 
>> May we have a simplified copy of your code (kinda the snippet provided below 
>> but with initial values for tileSize, numTiles_2, k, etc. i.e. something 
>> that compiles) so that we can also give it a go?
>> 
>> Would you like to try also with these flags?:
>> 
>> -suseBulkTransferStride=true -suseBulkTransfer=false
>> 
>> Thank you,
>> 
>> Rafa.
>> 
>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> escribió:
>> 
>>> Dear Chapel developers,
>>> 
>>> This is Akihiro Hayashi, postdoc at Rice University.
>>> I'm writing this to ask array copy failure in chapel.
>>> 
>>> I'm trying to evaluate some chapel benchmark across multiple nodes but I 
>>> get strange error.
>>> Please note that I'm using old version of chapel compiler (r21945) with 
>>> qthread-1.10 and GASNet-1.20.2(infiniband-conduit, mpi-spawner) because the 
>>> latest version does not work.
>>> With the latest version of chapel compiler (r22568) with qthread-1.10 and 
>>> GASNet-1.22.0(infiniband-conduit, mpi-spawner), I get SEGV when running 
>>> simple program (coforall loc in Locales do on loc { writeln(loc); }) across 
>>> multiple nodes with mpi spawner.
>>> This is another problem but I have not investigated this problem yet. I'll 
>>> work on this later.
>>> 
>>> The following problem might be fixed in the latest version, but any 
>>> comments and suggestions are appreciated.
>>> Here is part of my code. 
>>> The main data structure is a 3-dimensional array, which is declared as a 
>>> distributed array that each of its element refers to a 2-dimension array.
>>> You can see array copy statement (liBlock = 
>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this copy 
>>> statement because the Chapel compiler generates bulk transfer code, which 
>>> accelerates program execution.
>>> 
>>> // Code
>>> 1: const zero: int(32) = 0;
>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};
>>> 3: class Tile {
>>> 4:    var tile_array: [tile_array_indices] real;
>>> 5: }
>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, 
>>> zero..numTiles_2};
>>> 7: var ijk_space = proto_ijk_space dmapped 
>>> Block(boundingBox=proto_ijk_space);
>>> 8: var lkji_tiles: [ijk_space] Tile;
>>> ...
>>> 9 :begin {
>>>  ...
>>> 10: var liBlock: [tile_array_indices] real;
>>> 11: liBlock = lkji_tiles(k,k,k+1).tile_array;
>>> 12: for (m,n) in tile_array_indices {
>>> 13:     if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) {
>>> 14:        invalid = true;
>>> 15:     }
>>> 16:   }
>>> 17:   if (invalid) { writln("Copy Failed");}
>>> 18:   ...
>>> 19: }
>>> ...
>>> 
>>> In my experiment, when running the program on 2 or more locales, the 
>>> program prints "Copy Failed" which means  "liBlock = 
>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed.
>>> This happens sometime (not always). and I confirmed the copy is 
>>> successfully done if I replace the array copy in Line 11 with copy loop.
>>> Additionally, I also see the same behavior when I replace the array copy in 
>>> Line 11 with 
>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);.
>>> 
>>> Here is an output log at runtime when I compile the program with -s 
>>> debugBulkTransfer (tileSize=200):
>>> 
>>> -- Log starts here
>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0), 
>>> len=40000, elemSize=8;
>>> -- End of Log
>>> 
>>> In both cases, the runtime internally calls chpl_comm_get API(*) and the 
>>> API takes the above parameters.
>>> I think it looks good.
>>> (*) Please take a look at doiBulkTransfer function in 
>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl
>>> 
>>> Any comments and suggestions are appreciated.
>>> 
>>> Best regards,
>>> 
>>> Akihiro
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today. 
>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Chapel-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers
>> 
>> __
>> Rafael Asenjo Plaza
>> Dept. Arquitectura de Computadores      
>> Complejo Tecnologico Campus de Teatinos
>> E-29071 MALAGA (SPAIN)
>> Tel: +34 95 213 27 91
>> Fax: +34 95 213 27 90        
>> http://www.ac.uma.es/~asenjo
>> 
>> 
> 


------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to