El Miércoles, 29 de enero de 2014 16:09:18 Akihiro Hayashi escribió: > Hi, Brad, > > Thank you for your suggestions. > > > I'm always glad when we can pass the blame to something other than Chapel. > > :) > Exactly :) > > I tried --enable-debug flag at GASNet configuration time and I rebuilt my > Chapel compiler and runtime. Then I compile and run the program with > GASNET_BACKTRACE=1, But i don't get any message. Some debug messages are > supposed to be shown during program execution if we configure GASNet with > --enable-debug flag? > > Please let me know if you have any suggestions.
First you can check more things from Chapel, by compiling with: --bounds-checks --local-checks --nil-checks --debug And it can help also to wrap the Chapel communications with : startVerboseComm(); ... stopVerboseComm(); From the gasnet side, you can activate a trace to gasnet, so all operations are shown, and also you can get statistics, to activate them you must add those options to gasnet, in the file : third-party/gasnet/Makefile Add to gasnet those options : CHPL_GASNET_CFG_OPTIONS += --enable-segment-$(CHPL_MAKE_COMM_SEGMENT) --enable-allow-gcc4 --enable-stats --enable-trace Depending upon the mxm version of your system, perhaps you need to add : --disable-mxm After that to see the trace and the stats you must define those vars: export GASNET_TRACEFILE=gas_tracefile.out export GASNET_STATSFILE=gas_statsfile.out They contain the name of the file where the info will be written. Then you can execute your program with parameters that give a right result, and another with a wrong result, and compare the gasnet traces. I have tried to use the gasnet tests, I did that on Christmas, but cannot remember how I sent it to the queue system... basically is the testvis test, that can be compiled from the ibv_conduit using make test-seq. As I said I did that, and it worked fine, although I didn't changed the program to test other array sizes. Hope this helps, Rafael > Best, > > Akihiro > > On Jan 29, 2014, at 2:40 PM, Brad Chamberlain wrote: > > Hi guys -- > > > > While I'm sorry about the time spent on this issue, I'm always glad when > > we can pass the blame to something other than Chapel. :) > > > > Something I'm wondering is whether, if GASNet is built with runtime checks > > on (I think this is done using --enable-debug at GASNet configuration > > time), will the ibv-conduit issue show up as a runtime assertion or > > something other than a bug? I think we may want to pass along a report > > of this issue to the GASNet team, in which case this is undoubtedly the > > first question they'll ask. > > > > It would also be great if we had a standalone C+GASNet program that > > exhibited the issue (as they're not deeply invested in Chapel), but if > > that isn't a 15-minute exercise, we can tell them how to reproduce in > > Chapel. > > > > Thanks, > > -Brad > > > > On Wed, 29 Jan 2014, Akihiro Hayashi wrote: > >> Hi, Rafael, > >> > >> Thanks for your reply. I inlined my comments below: > >>> I’ve been talking to Rafael Larrosa regarding the issue you are > >>> reporting. He has conducted some experiments with your code on Titan > >>> (Cray XK7 at ORNL, which features Cray Gemini interconnect) and there > >>> are no communication problem there. However, we have here in Malaga a > >>> cluster based on Infiniband (ibv-conduit) and the execution fails on > >>> that platform. I’ve also confirmed that udp-conduit does not pose any > >>> problem.>> > >> I really appreciate your and Rafael Larrosa's experiments. I'm glad to > >> hear that this is not a problem in Chapel compiler.>> > >>> Rafael Larrosa told me that he faced the same issue last month and that > >>> after spending more than two weeks tackling the problem he is almost > >>> sure that there is a bug (or a maximum buffer size limitation) in the > >>> ibv-conduit implementation of gasnet. For his code, he found a turning > >>> point when the transferred buffer was 128MBytes: smaller communications > >>> work fine, but larger always fail. He says it was tricky, because when > >>> you try to isolate the problem (i.e. isolate the particular transfer > >>> that fails by executing just this single communication) then the > >>> problem vanish. So it will be challenging to chase this bug.>> > >> Sure, now I understand there is a bug in the ibv-conduit implementation > >> of gasnet. and yes, It seems fixing the bug is very difficult. > >> Actually, an original benchmark I want to run spawns many tasks by begin > >> statement and each task does bulk transfer. I can imagine the benchmark > >> exceeds some limit like Rafael's code. I'm also wondering why the > >> simplified code has this problem. There might be another problem.>> > >>> You may want to circumvent the bug by: > >>> > >>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false > >>> -suseBulkTransfer=false). —> Slower comms. 2.- Implementing a version > >>> of bulkComms that splits big messages into smaller ones. —> wearisome > >>> tinkering 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in > >>> latest top500 list are based on ibv 4.- Dive into the ibv-conduit > >>> implementation —> Probably not your main research goal > >>> > >>> For the time being we are conducting all our experiments on Cray > >>> machines, so we do not plan (and do not have time) to tackle 2 or 4, so > >>> we are getting by with 3.>> > >> Exactly, 4 is not my research goal, I 'd choose 3 if a benchmark I would > >> like to run use bulk transfer. Thanks for your suggestions.>> > >>> If Rafael wants to chime in, he can probably give you more details and > >>> advices, should you want to debug your code at a lower level.>> > >> I would appreciate if he could give me more details. I think I should > >> mention the bug in my paper or something. > >> > >> Best, > >> > >> Akihiro > >> > >> On Jan 29, 2014, at 4:46 AM, Rafael Asenjo Plaza wrote: > >>> Hi Akihiro, > >>> > >>> I’ve been talking to Rafael Larrosa regarding the issue you are > >>> reporting. He has conducted some experiments with your code on Titan > >>> (Cray XK7 at ORNL, which features Cray Gemini interconnect) and there > >>> are no communication problem there. However, we have here in Malaga a > >>> cluster based on Infiniband (ibv-conduit) and the execution fails on > >>> that platform. I’ve also confirmed that udp-conduit does not pose any > >>> problem. > >>> > >>> Rafael Larrosa told me that he faced the same issue last month and that > >>> after spending more than two weeks tackling the problem he is almost > >>> sure that there is a bug (or a maximum buffer size limitation) in the > >>> ibv-conduit implementation of gasnet. For his code, he found a turning > >>> point when the transferred buffer was 128MBytes: smaller communications > >>> work fine, but larger always fail. He says it was tricky, because when > >>> you try to isolate the problem (i.e. isolate the particular transfer > >>> that fails by executing just this single communication) then the > >>> problem vanish. So it will be challenging to chase this bug. > >>> > >>> You may want to circumvent the bug by: > >>> > >>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false > >>> -suseBulkTransfer=false). —> Slower comms. 2.- Implementing a version > >>> of bulkComms that splits big messages into smaller ones. —> wearisome > >>> tinkering 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in > >>> latest top500 list are based on ibv 4.- Dive into the ibv-conduit > >>> implementation —> Probably not your main research goal > >>> > >>> For the time being we are conducting all our experiments on Cray > >>> machines, so we do not plan (and do not have time) to tackle 2 or 4, so > >>> we are getting by with 3. > >>> > >>> If Rafael wants to chime in, he can probably give you more details and > >>> advices, should you want to debug your code at a lower level. > >>> > >>> Regards, > >>> > >>> Rafa. > >>> > >>> El 28/01/2014, a las 19:31, Akihiro Hayashi <[email protected]> escribió: > >>>> Hi, Rafael, > >>>> > >>>> Sorry for the delayed reply. > >>>> Let me share the program that reproduces the problem. (attached below) > >>>> > >>>> As you can see, the program prints "INVALID? : true" if we get bulk > >>>> copy transfer error, otherwise it prints "INVALID?: false". I get the > >>>> error when I run the program on 2 locales with ibv-conduit > >>>> (mpi-spawner). The input data size is : matrixSize = 2000 and tileSize > >>>> = 200. Please let me know if you want the input file. Note that I > >>>> don't get the error when I run the program on 1 locale. In addition, I > >>>> don't get the error with smaller data size even on 2 or more locales > >>>> (e.g 10x10 matrix and 2x2 tile size). I'm guessing using ibv-conduit > >>>> and transferring a certain amount of data incurs this problem. FYI, > >>>> using udp-conduit (amudprun) does not show the error. > >>>> > >>>> Please let me know if you have any comments and questions. > >>>> > >>>> Best, > >>>> > >>>> Akihiro > >>>> > >>>> -- > >>>> > >>>> use BlockDist; > >>>> > >>>> config const matrixSize: int(32) = -1; > >>>> config const tileSize: int(32) = -1; > >>>> config const inFile: string = "m_2000.in"; > >>>> const zero: int(32) = 0; > >>>> var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; > >>>> > >>>> class Tile { > >>>> var tile_array: [tile_array_indices] real; > >>>> } > >>>> > >>>> proc read_2D_array ( fileName: string, matrixSize: int(32) ) { > >>>> var input_stream = open (fileName, iomode.r); > >>>> var reader = input_stream.reader(); > >>>> var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1}; > >>>> var array: [matrix_index_2D] real; > >>>> > >>>> for ij in matrix_index_2D do { > >>>> > >>>> reader.read(array(ij)); > >>>> > >>>> } > >>>> input_stream.close(); > >>>> reader.close(); > >>>> // if (debug) { writeln("whole array: ",array); } > >>>> return array; > >>>> } > >>>> > >>>> proc main(): void { > >>>> writeln("numLocales : ", numLocales); > >>>> > >>>> var numTiles: int(32) = matrixSize/tileSize; > >>>> var numTiles_2: int(64) = matrixSize/tileSize; > >>>> > >>>> var whole_array = read_2D_array(inFile, matrixSize); > >>>> > >>>> var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, > >>>> zero..numTiles_2}; var ijk_space = proto_ijk_space dmapped > >>>> Block(boundingBox=proto_ijk_space); var lkji_tiles: [ijk_space] Tile; > >>>> > >>>> for i in zero..numTiles-1 do { > >>>> > >>>> for j in zero..i do { > >>>> > >>>> on lkji_tiles(i,j,zero).locale do { > >>>> > >>>> var curr_tile: Tile = new Tile(); > >>>> > >>>> for (ii,jj) in tile_array_indices do { > >>>> > >>>> curr_tile.tile_array(ii,jj) = > >>>> whole_array(i*tileSize+ii,j*tileSize+jj); > >>>> > >>>> } > >>>> > >>>> lkji_tiles(i,j,zero) = curr_tile; > >>>> > >>>> } > >>>> > >>>> } > >>>> > >>>> } > >>>> var invalid : bool = false; > >>>> for i in zero..numTiles-1 do { > >>>> > >>>> for iB in zero..tileSize-1 do { > >>>> > >>>> for j in zero..i do { > >>>> > >>>> var temp = lkji_tiles(i,j,zero).tile_array; > >>>> if(i != j) { > >>>> > >>>> for jB in zero..tileSize-1 do { > >>>> > >>>> if (temp(iB,jB) != lkji_tiles(i, j, > >>>> zero).tile_array(iB, jB)) { > >>>> > >>>> invalid = true; > >>>> > >>>> } > >>>> > >>>> } > >>>> > >>>> } else { > >>>> > >>>> for jB in zero..iB do { > >>>> > >>>> if (temp(iB,jB) != lkji_tiles(i, j, > >>>> zero).tile_array(iB, jB)) { > >>>> > >>>> invalid = true; > >>>> > >>>> } > >>>> > >>>> } > >>>> > >>>> } > >>>> > >>>> } > >>>> > >>>> } > >>>> > >>>> } > >>>> writeln("INVALID? : ", invalid); > >>>> > >>>> } > >>>> > >>>> On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote: > >>>>> Hi Rafael, > >>>>> > >>>>> Thanks for your reply. > >>>>> > >>>>> I inlined my comments below: > >>>>>> May we have a simplified copy of your code (kinda the snippet > >>>>>> provided below but with initial values for tileSize, numTiles_2, k, > >>>>>> etc. i.e. something that compiles) so that we can also give it a go?>>>>> > >>>>> Yes, it would be better if we can have a simplified code. > >>>>> Actually, I have been trying to make a simple code that reproduce this > >>>>> problem for several weeks. finally I managed to make it this morning. > >>>>> Let me ask my advisor if we can show you the code. > >>>>> > >>>>>> Would you like to try also with these flags?: > >>>>>> > >>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false > >>>>> > >>>>> I tried these flags, but I still get the error. > >>>>> > >>>>> I'll keep you updated. > >>>>> > >>>>> Best, > >>>>> > >>>>> Akihiro > >>>>> > >>>>> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote: > >>>>>> Hi Akihiro, > >>>>>> > >>>>>> May we have a simplified copy of your code (kinda the snippet > >>>>>> provided below but with initial values for tileSize, numTiles_2, k, > >>>>>> etc. i.e. something that compiles) so that we can also give it a go? > >>>>>> > >>>>>> Would you like to try also with these flags?: > >>>>>> > >>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false > >>>>>> > >>>>>> Thank you, > >>>>>> > >>>>>> Rafa. > >>>>>> > >>>>>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> escribió: > >>>>>>> Dear Chapel developers, > >>>>>>> > >>>>>>> This is Akihiro Hayashi, postdoc at Rice University. > >>>>>>> I'm writing this to ask array copy failure in chapel. > >>>>>>> > >>>>>>> I'm trying to evaluate some chapel benchmark across multiple nodes > >>>>>>> but I get strange error. Please note that I'm using old version of > >>>>>>> chapel compiler (r21945) with qthread-1.10 and > >>>>>>> GASNet-1.20.2(infiniband-conduit, mpi-spawner) because the latest > >>>>>>> version does not work. With the latest version of chapel compiler > >>>>>>> (r22568) with qthread-1.10 and GASNet-1.22.0(infiniband-conduit, > >>>>>>> mpi-spawner), I get SEGV when running simple program (coforall loc > >>>>>>> in Locales do on loc { writeln(loc); }) across multiple nodes with > >>>>>>> mpi spawner. This is another problem but I have not investigated > >>>>>>> this problem yet. I'll work on this later. > >>>>>>> > >>>>>>> The following problem might be fixed in the latest version, but any > >>>>>>> comments and suggestions are appreciated. Here is part of my code. > >>>>>>> The main data structure is a 3-dimensional array, which is declared > >>>>>>> as a distributed array that each of its element refers to a > >>>>>>> 2-dimension array. You can see array copy statement (liBlock = > >>>>>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this > >>>>>>> copy statement because the Chapel compiler generates bulk transfer > >>>>>>> code, which accelerates program execution. > >>>>>>> > >>>>>>> // Code > >>>>>>> 1: const zero: int(32) = 0; > >>>>>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; > >>>>>>> 3: class Tile { > >>>>>>> 4: var tile_array: [tile_array_indices] real; > >>>>>>> 5: } > >>>>>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, > >>>>>>> zero..numTiles_2}; 7: var ijk_space = proto_ijk_space dmapped > >>>>>>> Block(boundingBox=proto_ijk_space); 8: var lkji_tiles: [ijk_space] > >>>>>>> Tile; > >>>>>>> ... > >>>>>>> 9 :begin { > >>>>>>> ... > >>>>>>> 10: var liBlock: [tile_array_indices] real; > >>>>>>> 11: liBlock = lkji_tiles(k,k,k+1).tile_array; > >>>>>>> 12: for (m,n) in tile_array_indices { > >>>>>>> 13: if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) { > >>>>>>> 14: invalid = true; > >>>>>>> 15: } > >>>>>>> 16: } > >>>>>>> 17: if (invalid) { writln("Copy Failed");} > >>>>>>> 18: ... > >>>>>>> 19: } > >>>>>>> ... > >>>>>>> > >>>>>>> In my experiment, when running the program on 2 or more locales, the > >>>>>>> program prints "Copy Failed" which means "liBlock = > >>>>>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed. This happens > >>>>>>> sometime (not always). and I confirmed the copy is successfully > >>>>>>> done if I replace the array copy in Line 11 with copy loop. > >>>>>>> Additionally, I also see the same behavior when I replace the array > >>>>>>> copy in Line 11 with > >>>>>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);. > >>>>>>> > >>>>>>> Here is an output log at runtime when I compile the program with -s > >>>>>>> debugBulkTransfer (tileSize=200): > >>>>>>> > >>>>>>> -- Log starts here > >>>>>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0), > >>>>>>> len=40000, elemSize=8; -- End of Log > >>>>>>> > >>>>>>> In both cases, the runtime internally calls chpl_comm_get API(*) and > >>>>>>> the API takes the above parameters. I think it looks good. > >>>>>>> (*) Please take a look at doiBulkTransfer function in > >>>>>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl > >>>>>>> > >>>>>>> Any comments and suggestions are appreciated. > >>>>>>> > >>>>>>> Best regards, > >>>>>>> > >>>>>>> Akihiro > >>>>>>> -------------------------------------------------------------------- > >>>>>>> ---------- CenturyLink Cloud: The Leader in Enterprise Cloud > >>>>>>> Services. > >>>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For > >>>>>>> Critical Workloads, Development Environments & Everything In > >>>>>>> Between. > >>>>>>> Get a Quote or Start a Free Trial Today. > >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ost > >>>>>>> g.clktrk _______________________________________________ > >>>>>>> Chapel-developers mailing list > >>>>>>> [email protected] > >>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers > >>>>>> > >>>>>> __ > >>>>>> Rafael Asenjo Plaza > >>>>>> Dept. Arquitectura de Computadores > >>>>>> Complejo Tecnologico Campus de Teatinos > >>>>>> E-29071 MALAGA (SPAIN) > >>>>>> Tel: +34 95 213 27 91 > >>>>>> Fax: +34 95 213 27 90 > >>>>>> http://www.ac.uma.es/~asenjo > >>> > >>> __ > >>> Rafael Asenjo Plaza > >>> Dept. Arquitectura de Computadores > >>> Complejo Tecnologico Campus de Teatinos > >>> E-29071 MALAGA (SPAIN) > >>> Tel: +34 95 213 27 91 > >>> Fax: +34 95 213 27 90 > >>> http://www.ac.uma.es/~asenjo > >> > >> ------------------------------------------------------------------------- > >> ----- WatchGuard Dimension instantly turns raw network data into > >> actionable security intelligence. It gives you real-time visual feedback > >> on key security issues and trends. Skip the complicated setup - simply > >> import a virtual appliance and go from zero to informed in seconds. > >> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clk > >> trk _______________________________________________ > >> Chapel-developers mailing list > >> [email protected] > >> https://lists.sourceforge.net/lists/listinfo/chapel-developers > > ---------------------------------------------------------------------------- > -- WatchGuard Dimension instantly turns raw network data into actionable > security intelligence. It gives you real-time visual feedback on key > security issues and trends. Skip the complicated setup - simply import a > virtual appliance and go from zero to informed in seconds. > http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk > _______________________________________________ > Chapel-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/chapel-developers -- Rafael Larrosa Jiménez Centro de Supercomputación y Bioinformática - http://www.scbi.uma.es Universidad de Málaga EMAIL: [email protected] Edificio de Bioinnovación TELEF: + 34951952788 C/ Severo Ochoa 34 FAX : +34951952792 Parque Tecnológico de Andalucía 29590 Málaga (SPAIN) ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
