Hi, Rafael,
Sorry for the delayed reply.
Let me share the program that reproduces the problem. (attached below)
As you can see, the program prints "INVALID? : true" if we get bulk copy
transfer error, otherwise it prints "INVALID?: false".
I get the error when I run the program on 2 locales with ibv-conduit
(mpi-spawner). The input data size is : matrixSize = 2000 and tileSize = 200.
Please let me know if you want the input file.
Note that I don't get the error when I run the program on 1 locale. In
addition, I don't get the error with smaller data size even on 2 or more
locales (e.g 10x10 matrix and 2x2 tile size).
I'm guessing using ibv-conduit and transferring a certain amount of data incurs
this problem.
FYI, using udp-conduit (amudprun) does not show the error.
Please let me know if you have any comments and questions.
Best,
Akihiro
--
use BlockDist;
config const matrixSize: int(32) = -1;
config const tileSize: int(32) = -1;
config const inFile: string = "m_2000.in";
const zero: int(32) = 0;
var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};
class Tile {
var tile_array: [tile_array_indices] real;
}
proc read_2D_array ( fileName: string, matrixSize: int(32) ) {
var input_stream = open (fileName, iomode.r);
var reader = input_stream.reader();
var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1};
var array: [matrix_index_2D] real;
for ij in matrix_index_2D do {
reader.read(array(ij));
}
input_stream.close();
reader.close();
// if (debug) { writeln("whole array: ",array); }
return array;
}
proc main(): void {
writeln("numLocales : ", numLocales);
var numTiles: int(32) = matrixSize/tileSize;
var numTiles_2: int(64) = matrixSize/tileSize;
var whole_array = read_2D_array(inFile, matrixSize);
var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2,
zero..numTiles_2};
var ijk_space = proto_ijk_space dmapped Block(boundingBox=proto_ijk_space);
var lkji_tiles: [ijk_space] Tile;
for i in zero..numTiles-1 do {
for j in zero..i do {
on lkji_tiles(i,j,zero).locale do {
var curr_tile: Tile = new Tile();
for (ii,jj) in tile_array_indices do {
curr_tile.tile_array(ii,jj) =
whole_array(i*tileSize+ii,j*tileSize+jj);
}
lkji_tiles(i,j,zero) = curr_tile;
}
}
}
var invalid : bool = false;
for i in zero..numTiles-1 do {
for iB in zero..tileSize-1 do {
for j in zero..i do {
var temp = lkji_tiles(i,j,zero).tile_array;
if(i != j) {
for jB in zero..tileSize-1 do {
if (temp(iB,jB) != lkji_tiles(i, j,
zero).tile_array(iB, jB)) {
invalid = true;
}
}
} else {
for jB in zero..iB do {
if (temp(iB,jB) != lkji_tiles(i, j,
zero).tile_array(iB, jB)) {
invalid = true;
}
}
}
}
}
}
writeln("INVALID? : ", invalid);
}
On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote:
> Hi Rafael,
>
> Thanks for your reply.
>
> I inlined my comments below:
>
>> May we have a simplified copy of your code (kinda the snippet provided below
>> but with initial values for tileSize, numTiles_2, k, etc. i.e. something
>> that compiles) so that we can also give it a go?
> Yes, it would be better if we can have a simplified code.
> Actually, I have been trying to make a simple code that reproduce this
> problem for several weeks. finally I managed to make it this morning.
> Let me ask my advisor if we can show you the code.
>
>> Would you like to try also with these flags?:
>>
>> -suseBulkTransferStride=true -suseBulkTransfer=false
> I tried these flags, but I still get the error.
>
> I'll keep you updated.
>
> Best,
>
> Akihiro
>
> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote:
>
>> Hi Akihiro,
>>
>> May we have a simplified copy of your code (kinda the snippet provided below
>> but with initial values for tileSize, numTiles_2, k, etc. i.e. something
>> that compiles) so that we can also give it a go?
>>
>> Would you like to try also with these flags?:
>>
>> -suseBulkTransferStride=true -suseBulkTransfer=false
>>
>> Thank you,
>>
>> Rafa.
>>
>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> escribió:
>>
>>> Dear Chapel developers,
>>>
>>> This is Akihiro Hayashi, postdoc at Rice University.
>>> I'm writing this to ask array copy failure in chapel.
>>>
>>> I'm trying to evaluate some chapel benchmark across multiple nodes but I
>>> get strange error.
>>> Please note that I'm using old version of chapel compiler (r21945) with
>>> qthread-1.10 and GASNet-1.20.2(infiniband-conduit, mpi-spawner) because the
>>> latest version does not work.
>>> With the latest version of chapel compiler (r22568) with qthread-1.10 and
>>> GASNet-1.22.0(infiniband-conduit, mpi-spawner), I get SEGV when running
>>> simple program (coforall loc in Locales do on loc { writeln(loc); }) across
>>> multiple nodes with mpi spawner.
>>> This is another problem but I have not investigated this problem yet. I'll
>>> work on this later.
>>>
>>> The following problem might be fixed in the latest version, but any
>>> comments and suggestions are appreciated.
>>> Here is part of my code.
>>> The main data structure is a 3-dimensional array, which is declared as a
>>> distributed array that each of its element refers to a 2-dimension array.
>>> You can see array copy statement (liBlock =
>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this copy
>>> statement because the Chapel compiler generates bulk transfer code, which
>>> accelerates program execution.
>>>
>>> // Code
>>> 1: const zero: int(32) = 0;
>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};
>>> 3: class Tile {
>>> 4: var tile_array: [tile_array_indices] real;
>>> 5: }
>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2,
>>> zero..numTiles_2};
>>> 7: var ijk_space = proto_ijk_space dmapped
>>> Block(boundingBox=proto_ijk_space);
>>> 8: var lkji_tiles: [ijk_space] Tile;
>>> ...
>>> 9 :begin {
>>> ...
>>> 10: var liBlock: [tile_array_indices] real;
>>> 11: liBlock = lkji_tiles(k,k,k+1).tile_array;
>>> 12: for (m,n) in tile_array_indices {
>>> 13: if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) {
>>> 14: invalid = true;
>>> 15: }
>>> 16: }
>>> 17: if (invalid) { writln("Copy Failed");}
>>> 18: ...
>>> 19: }
>>> ...
>>>
>>> In my experiment, when running the program on 2 or more locales, the
>>> program prints "Copy Failed" which means "liBlock =
>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed.
>>> This happens sometime (not always). and I confirmed the copy is
>>> successfully done if I replace the array copy in Line 11 with copy loop.
>>> Additionally, I also see the same behavior when I replace the array copy in
>>> Line 11 with
>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);.
>>>
>>> Here is an output log at runtime when I compile the program with -s
>>> debugBulkTransfer (tileSize=200):
>>>
>>> -- Log starts here
>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0),
>>> len=40000, elemSize=8;
>>> -- End of Log
>>>
>>> In both cases, the runtime internally calls chpl_comm_get API(*) and the
>>> API takes the above parameters.
>>> I think it looks good.
>>> (*) Please take a look at doiBulkTransfer function in
>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl
>>>
>>> Any comments and suggestions are appreciated.
>>>
>>> Best regards,
>>>
>>> Akihiro
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Chapel-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers
>>
>> __
>> Rafael Asenjo Plaza
>> Dept. Arquitectura de Computadores
>> Complejo Tecnologico Campus de Teatinos
>> E-29071 MALAGA (SPAIN)
>> Tel: +34 95 213 27 91
>> Fax: +34 95 213 27 90
>> http://www.ac.uma.es/~asenjo
>>
>>
>
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers