Hi Rafael, Thanks for your information.
I'll generate gasnet traces and compare them. > I have tried to use the gasnet tests, I did that on Christmas, but cannot > remember how I sent it to the queue system... basically is the testvis test, > that can be compiled from the ibv_conduit using make test-seq. > > As I said I did that, and it worked fine, although I didn't changed the > program > to test other array sizes. Sure, I'll also try to use the gasnet tests on my environment. Thanks, Akihiro On Jan 29, 2014, at 4:38 PM, Rafael Larrosa Jiménez wrote: > El Miércoles, 29 de enero de 2014 16:09:18 Akihiro Hayashi escribió: >> Hi, Brad, >> >> Thank you for your suggestions. >> >>> I'm always glad when we can pass the blame to something other than Chapel. >>> :) >> Exactly :) >> >> I tried --enable-debug flag at GASNet configuration time and I rebuilt my >> Chapel compiler and runtime. Then I compile and run the program with >> GASNET_BACKTRACE=1, But i don't get any message. Some debug messages are >> supposed to be shown during program execution if we configure GASNet with >> --enable-debug flag? >> >> Please let me know if you have any suggestions. > > First you can check more things from Chapel, by compiling with: > > --bounds-checks --local-checks --nil-checks --debug > > And it can help also to wrap the Chapel communications with : > > startVerboseComm(); > ... > stopVerboseComm(); > > From the gasnet side, you can activate a trace to gasnet, so all operations > are shown, and also you can get statistics, to activate them you must add > those options to gasnet, in the file : > > third-party/gasnet/Makefile > > Add to gasnet those options : > > CHPL_GASNET_CFG_OPTIONS += --enable-segment-$(CHPL_MAKE_COMM_SEGMENT) > --enable-allow-gcc4 --enable-stats --enable-trace > > Depending upon the mxm version of your system, perhaps you need to add : > --disable-mxm > > After that to see the trace and the stats you must define those vars: > > export GASNET_TRACEFILE=gas_tracefile.out > export GASNET_STATSFILE=gas_statsfile.out > > They contain the name of the file where the info will be written. > > Then you can execute your program with parameters that give a right result, > and another with a wrong result, and compare the gasnet traces. > > > I have tried to use the gasnet tests, I did that on Christmas, but cannot > remember how I sent it to the queue system... basically is the testvis test, > that can be compiled from the ibv_conduit using make test-seq. > > As I said I did that, and it worked fine, although I didn't changed the > program > to test other array sizes. > > > Hope this helps, > > Rafael > > >> Best, >> >> Akihiro >> >> On Jan 29, 2014, at 2:40 PM, Brad Chamberlain wrote: >>> Hi guys -- >>> >>> While I'm sorry about the time spent on this issue, I'm always glad when >>> we can pass the blame to something other than Chapel. :) >>> >>> Something I'm wondering is whether, if GASNet is built with runtime checks >>> on (I think this is done using --enable-debug at GASNet configuration >>> time), will the ibv-conduit issue show up as a runtime assertion or >>> something other than a bug? I think we may want to pass along a report >>> of this issue to the GASNet team, in which case this is undoubtedly the >>> first question they'll ask. >>> >>> It would also be great if we had a standalone C+GASNet program that >>> exhibited the issue (as they're not deeply invested in Chapel), but if >>> that isn't a 15-minute exercise, we can tell them how to reproduce in >>> Chapel. >>> >>> Thanks, >>> -Brad >>> >>> On Wed, 29 Jan 2014, Akihiro Hayashi wrote: >>>> Hi, Rafael, >>>> >>>> Thanks for your reply. I inlined my comments below: >>>>> I’ve been talking to Rafael Larrosa regarding the issue you are >>>>> reporting. He has conducted some experiments with your code on Titan >>>>> (Cray XK7 at ORNL, which features Cray Gemini interconnect) and there >>>>> are no communication problem there. However, we have here in Malaga a >>>>> cluster based on Infiniband (ibv-conduit) and the execution fails on >>>>> that platform. I’ve also confirmed that udp-conduit does not pose any >>>>> problem.>> >>>> I really appreciate your and Rafael Larrosa's experiments. I'm glad to >>>> hear that this is not a problem in Chapel compiler.>> >>>>> Rafael Larrosa told me that he faced the same issue last month and that >>>>> after spending more than two weeks tackling the problem he is almost >>>>> sure that there is a bug (or a maximum buffer size limitation) in the >>>>> ibv-conduit implementation of gasnet. For his code, he found a turning >>>>> point when the transferred buffer was 128MBytes: smaller communications >>>>> work fine, but larger always fail. He says it was tricky, because when >>>>> you try to isolate the problem (i.e. isolate the particular transfer >>>>> that fails by executing just this single communication) then the >>>>> problem vanish. So it will be challenging to chase this bug.>> >>>> Sure, now I understand there is a bug in the ibv-conduit implementation >>>> of gasnet. and yes, It seems fixing the bug is very difficult. >>>> Actually, an original benchmark I want to run spawns many tasks by begin >>>> statement and each task does bulk transfer. I can imagine the benchmark >>>> exceeds some limit like Rafael's code. I'm also wondering why the >>>> simplified code has this problem. There might be another problem.>> >>>>> You may want to circumvent the bug by: >>>>> >>>>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false >>>>> -suseBulkTransfer=false). —> Slower comms. 2.- Implementing a version >>>>> of bulkComms that splits big messages into smaller ones. —> wearisome >>>>> tinkering 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in >>>>> latest top500 list are based on ibv 4.- Dive into the ibv-conduit >>>>> implementation —> Probably not your main research goal >>>>> >>>>> For the time being we are conducting all our experiments on Cray >>>>> machines, so we do not plan (and do not have time) to tackle 2 or 4, so >>>>> we are getting by with 3.>> >>>> Exactly, 4 is not my research goal, I 'd choose 3 if a benchmark I would >>>> like to run use bulk transfer. Thanks for your suggestions.>> >>>>> If Rafael wants to chime in, he can probably give you more details and >>>>> advices, should you want to debug your code at a lower level.>> >>>> I would appreciate if he could give me more details. I think I should >>>> mention the bug in my paper or something. >>>> >>>> Best, >>>> >>>> Akihiro >>>> >>>> On Jan 29, 2014, at 4:46 AM, Rafael Asenjo Plaza wrote: >>>>> Hi Akihiro, >>>>> >>>>> I’ve been talking to Rafael Larrosa regarding the issue you are >>>>> reporting. He has conducted some experiments with your code on Titan >>>>> (Cray XK7 at ORNL, which features Cray Gemini interconnect) and there >>>>> are no communication problem there. However, we have here in Malaga a >>>>> cluster based on Infiniband (ibv-conduit) and the execution fails on >>>>> that platform. I’ve also confirmed that udp-conduit does not pose any >>>>> problem. >>>>> >>>>> Rafael Larrosa told me that he faced the same issue last month and that >>>>> after spending more than two weeks tackling the problem he is almost >>>>> sure that there is a bug (or a maximum buffer size limitation) in the >>>>> ibv-conduit implementation of gasnet. For his code, he found a turning >>>>> point when the transferred buffer was 128MBytes: smaller communications >>>>> work fine, but larger always fail. He says it was tricky, because when >>>>> you try to isolate the problem (i.e. isolate the particular transfer >>>>> that fails by executing just this single communication) then the >>>>> problem vanish. So it will be challenging to chase this bug. >>>>> >>>>> You may want to circumvent the bug by: >>>>> >>>>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false >>>>> -suseBulkTransfer=false). —> Slower comms. 2.- Implementing a version >>>>> of bulkComms that splits big messages into smaller ones. —> wearisome >>>>> tinkering 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in >>>>> latest top500 list are based on ibv 4.- Dive into the ibv-conduit >>>>> implementation —> Probably not your main research goal >>>>> >>>>> For the time being we are conducting all our experiments on Cray >>>>> machines, so we do not plan (and do not have time) to tackle 2 or 4, so >>>>> we are getting by with 3. >>>>> >>>>> If Rafael wants to chime in, he can probably give you more details and >>>>> advices, should you want to debug your code at a lower level. >>>>> >>>>> Regards, >>>>> >>>>> Rafa. >>>>> >>>>> El 28/01/2014, a las 19:31, Akihiro Hayashi <[email protected]> > escribió: >>>>>> Hi, Rafael, >>>>>> >>>>>> Sorry for the delayed reply. >>>>>> Let me share the program that reproduces the problem. (attached below) >>>>>> >>>>>> As you can see, the program prints "INVALID? : true" if we get bulk >>>>>> copy transfer error, otherwise it prints "INVALID?: false". I get the >>>>>> error when I run the program on 2 locales with ibv-conduit >>>>>> (mpi-spawner). The input data size is : matrixSize = 2000 and tileSize >>>>>> = 200. Please let me know if you want the input file. Note that I >>>>>> don't get the error when I run the program on 1 locale. In addition, I >>>>>> don't get the error with smaller data size even on 2 or more locales >>>>>> (e.g 10x10 matrix and 2x2 tile size). I'm guessing using ibv-conduit >>>>>> and transferring a certain amount of data incurs this problem. FYI, >>>>>> using udp-conduit (amudprun) does not show the error. >>>>>> >>>>>> Please let me know if you have any comments and questions. >>>>>> >>>>>> Best, >>>>>> >>>>>> Akihiro >>>>>> >>>>>> -- >>>>>> >>>>>> use BlockDist; >>>>>> >>>>>> config const matrixSize: int(32) = -1; >>>>>> config const tileSize: int(32) = -1; >>>>>> config const inFile: string = "m_2000.in"; >>>>>> const zero: int(32) = 0; >>>>>> var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; >>>>>> >>>>>> class Tile { >>>>>> var tile_array: [tile_array_indices] real; >>>>>> } >>>>>> >>>>>> proc read_2D_array ( fileName: string, matrixSize: int(32) ) { >>>>>> var input_stream = open (fileName, iomode.r); >>>>>> var reader = input_stream.reader(); >>>>>> var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1}; >>>>>> var array: [matrix_index_2D] real; >>>>>> >>>>>> for ij in matrix_index_2D do { >>>>>> >>>>>> reader.read(array(ij)); >>>>>> >>>>>> } >>>>>> input_stream.close(); >>>>>> reader.close(); >>>>>> // if (debug) { writeln("whole array: ",array); } >>>>>> return array; >>>>>> } >>>>>> >>>>>> proc main(): void { >>>>>> writeln("numLocales : ", numLocales); >>>>>> >>>>>> var numTiles: int(32) = matrixSize/tileSize; >>>>>> var numTiles_2: int(64) = matrixSize/tileSize; >>>>>> >>>>>> var whole_array = read_2D_array(inFile, matrixSize); >>>>>> >>>>>> var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, >>>>>> zero..numTiles_2}; var ijk_space = proto_ijk_space dmapped >>>>>> Block(boundingBox=proto_ijk_space); var lkji_tiles: [ijk_space] Tile; >>>>>> >>>>>> for i in zero..numTiles-1 do { >>>>>> >>>>>> for j in zero..i do { >>>>>> >>>>>> on lkji_tiles(i,j,zero).locale do { >>>>>> >>>>>> var curr_tile: Tile = new Tile(); >>>>>> >>>>>> for (ii,jj) in tile_array_indices do { >>>>>> >>>>>> curr_tile.tile_array(ii,jj) = >>>>>> whole_array(i*tileSize+ii,j*tileSize+jj); >>>>>> >>>>>> } >>>>>> >>>>>> lkji_tiles(i,j,zero) = curr_tile; >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> var invalid : bool = false; >>>>>> for i in zero..numTiles-1 do { >>>>>> >>>>>> for iB in zero..tileSize-1 do { >>>>>> >>>>>> for j in zero..i do { >>>>>> >>>>>> var temp = lkji_tiles(i,j,zero).tile_array; >>>>>> if(i != j) { >>>>>> >>>>>> for jB in zero..tileSize-1 do { >>>>>> >>>>>> if (temp(iB,jB) != lkji_tiles(i, j, >>>>>> zero).tile_array(iB, > jB)) { >>>>>> >>>>>> invalid = true; >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> >>>>>> } else { >>>>>> >>>>>> for jB in zero..iB do { >>>>>> >>>>>> if (temp(iB,jB) != lkji_tiles(i, j, >>>>>> zero).tile_array(iB, > jB)) { >>>>>> >>>>>> invalid = true; >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>>> writeln("INVALID? : ", invalid); >>>>>> >>>>>> } >>>>>> >>>>>> On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote: >>>>>>> Hi Rafael, >>>>>>> >>>>>>> Thanks for your reply. >>>>>>> >>>>>>> I inlined my comments below: >>>>>>>> May we have a simplified copy of your code (kinda the snippet >>>>>>>> provided below but with initial values for tileSize, numTiles_2, k, >>>>>>>> etc. i.e. something that compiles) so that we can also give it a > go?>>>>> >>>>>>> Yes, it would be better if we can have a simplified code. >>>>>>> Actually, I have been trying to make a simple code that reproduce this >>>>>>> problem for several weeks. finally I managed to make it this morning. >>>>>>> Let me ask my advisor if we can show you the code. >>>>>>> >>>>>>>> Would you like to try also with these flags?: >>>>>>>> >>>>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false >>>>>>> >>>>>>> I tried these flags, but I still get the error. >>>>>>> >>>>>>> I'll keep you updated. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Akihiro >>>>>>> >>>>>>> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote: >>>>>>>> Hi Akihiro, >>>>>>>> >>>>>>>> May we have a simplified copy of your code (kinda the snippet >>>>>>>> provided below but with initial values for tileSize, numTiles_2, k, >>>>>>>> etc. i.e. something that compiles) so that we can also give it a go? >>>>>>>> >>>>>>>> Would you like to try also with these flags?: >>>>>>>> >>>>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false >>>>>>>> >>>>>>>> Thank you, >>>>>>>> >>>>>>>> Rafa. >>>>>>>> >>>>>>>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> > escribió: >>>>>>>>> Dear Chapel developers, >>>>>>>>> >>>>>>>>> This is Akihiro Hayashi, postdoc at Rice University. >>>>>>>>> I'm writing this to ask array copy failure in chapel. >>>>>>>>> >>>>>>>>> I'm trying to evaluate some chapel benchmark across multiple nodes >>>>>>>>> but I get strange error. Please note that I'm using old version of >>>>>>>>> chapel compiler (r21945) with qthread-1.10 and >>>>>>>>> GASNet-1.20.2(infiniband-conduit, mpi-spawner) because the latest >>>>>>>>> version does not work. With the latest version of chapel compiler >>>>>>>>> (r22568) with qthread-1.10 and GASNet-1.22.0(infiniband-conduit, >>>>>>>>> mpi-spawner), I get SEGV when running simple program (coforall loc >>>>>>>>> in Locales do on loc { writeln(loc); }) across multiple nodes with >>>>>>>>> mpi spawner. This is another problem but I have not investigated >>>>>>>>> this problem yet. I'll work on this later. >>>>>>>>> >>>>>>>>> The following problem might be fixed in the latest version, but any >>>>>>>>> comments and suggestions are appreciated. Here is part of my code. >>>>>>>>> The main data structure is a 3-dimensional array, which is declared >>>>>>>>> as a distributed array that each of its element refers to a >>>>>>>>> 2-dimension array. You can see array copy statement (liBlock = >>>>>>>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this >>>>>>>>> copy statement because the Chapel compiler generates bulk transfer >>>>>>>>> code, which accelerates program execution. >>>>>>>>> >>>>>>>>> // Code >>>>>>>>> 1: const zero: int(32) = 0; >>>>>>>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; >>>>>>>>> 3: class Tile { >>>>>>>>> 4: var tile_array: [tile_array_indices] real; >>>>>>>>> 5: } >>>>>>>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, >>>>>>>>> zero..numTiles_2}; 7: var ijk_space = proto_ijk_space dmapped >>>>>>>>> Block(boundingBox=proto_ijk_space); 8: var lkji_tiles: [ijk_space] >>>>>>>>> Tile; >>>>>>>>> ... >>>>>>>>> 9 :begin { >>>>>>>>> ... >>>>>>>>> 10: var liBlock: [tile_array_indices] real; >>>>>>>>> 11: liBlock = lkji_tiles(k,k,k+1).tile_array; >>>>>>>>> 12: for (m,n) in tile_array_indices { >>>>>>>>> 13: if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) { >>>>>>>>> 14: invalid = true; >>>>>>>>> 15: } >>>>>>>>> 16: } >>>>>>>>> 17: if (invalid) { writln("Copy Failed");} >>>>>>>>> 18: ... >>>>>>>>> 19: } >>>>>>>>> ... >>>>>>>>> >>>>>>>>> In my experiment, when running the program on 2 or more locales, the >>>>>>>>> program prints "Copy Failed" which means "liBlock = >>>>>>>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed. This happens >>>>>>>>> sometime (not always). and I confirmed the copy is successfully >>>>>>>>> done if I replace the array copy in Line 11 with copy loop. >>>>>>>>> Additionally, I also see the same behavior when I replace the array >>>>>>>>> copy in Line 11 with >>>>>>>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);. >>>>>>>>> >>>>>>>>> Here is an output log at runtime when I compile the program with -s >>>>>>>>> debugBulkTransfer (tileSize=200): >>>>>>>>> >>>>>>>>> -- Log starts here >>>>>>>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0), >>>>>>>>> len=40000, elemSize=8; -- End of Log >>>>>>>>> >>>>>>>>> In both cases, the runtime internally calls chpl_comm_get API(*) and >>>>>>>>> the API takes the above parameters. I think it looks good. >>>>>>>>> (*) Please take a look at doiBulkTransfer function in >>>>>>>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl >>>>>>>>> >>>>>>>>> Any comments and suggestions are appreciated. >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> Akihiro >>>>>>>>> -------------------------------------------------------------------- >>>>>>>>> ---------- CenturyLink Cloud: The Leader in Enterprise Cloud >>>>>>>>> Services. >>>>>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For >>>>>>>>> Critical Workloads, Development Environments & Everything In >>>>>>>>> Between. >>>>>>>>> Get a Quote or Start a Free Trial Today. >>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ost >>>>>>>>> g.clktrk _______________________________________________ >>>>>>>>> Chapel-developers mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers >>>>>>>> >>>>>>>> __ >>>>>>>> Rafael Asenjo Plaza >>>>>>>> Dept. Arquitectura de Computadores >>>>>>>> Complejo Tecnologico Campus de Teatinos >>>>>>>> E-29071 MALAGA (SPAIN) >>>>>>>> Tel: +34 95 213 27 91 >>>>>>>> Fax: +34 95 213 27 90 >>>>>>>> http://www.ac.uma.es/~asenjo >>>>> >>>>> __ >>>>> Rafael Asenjo Plaza >>>>> Dept. Arquitectura de Computadores >>>>> Complejo Tecnologico Campus de Teatinos >>>>> E-29071 MALAGA (SPAIN) >>>>> Tel: +34 95 213 27 91 >>>>> Fax: +34 95 213 27 90 >>>>> http://www.ac.uma.es/~asenjo >>>> >>>> ------------------------------------------------------------------------- >>>> ----- WatchGuard Dimension instantly turns raw network data into >>>> actionable security intelligence. It gives you real-time visual feedback >>>> on key security issues and trends. Skip the complicated setup - simply >>>> import a virtual appliance and go from zero to informed in seconds. >>>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clk >>>> trk _______________________________________________ >>>> Chapel-developers mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers >> >> ---------------------------------------------------------------------------- >> -- WatchGuard Dimension instantly turns raw network data into actionable >> security intelligence. It gives you real-time visual feedback on key >> security issues and trends. Skip the complicated setup - simply import a >> virtual appliance and go from zero to informed in seconds. >> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk >> _______________________________________________ >> Chapel-developers mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/chapel-developers > -- > Rafael Larrosa Jiménez > Centro de Supercomputación y Bioinformática - http://www.scbi.uma.es > Universidad de Málaga > > EMAIL: [email protected] Edificio de Bioinnovación > TELEF: + 34951952788 C/ Severo Ochoa 34 > FAX : +34951952792 Parque Tecnológico de Andalucía > 29590 Málaga (SPAIN) > > ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
