Note a race condition in MPI_Init has been fixed yesterday in the master. can you please update your OpenMPI and try again ?
hopefully the hang will disappear. Can you reproduce the crash with a simpler (and ideally deterministic) version of your program. the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ? Cheers, Gilles On Wednesday, July 6, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de> wrote: > Hello Jason, > > thanks for your response! I thing it is another problem. I try to send > 100MB bytes. So there are not many tries (between 10 and 30). I realized > that the execution of this code can result 3 different errors: > > 1. most often the posted error message occures. > > 2. in <10% the cases i have a live lock. I can see 3 java-processes, one > with 200% and two with 100% processor utilization. After ~15 minutes > without new system outputs this error occurs. > > > [thread 47499823949568 also had an error] > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648 > # guarantee(PageArmed == 0) failed: invariant > # > # JRE version: 7.0_25-b15 > # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode > linux-amd64 compressed oops) > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /home/gl069/ompi/bin/executor/hs_err_pid24256.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.sun.com/bugreport/crash.jsp > # > [titan01:24256] *** Process received signal *** > [titan01:24256] Signal: Aborted (6) > [titan01:24256] Signal code: (-6) > [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100] > [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7] > [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8] > [titan01:24256] [ 3] > /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5] > [titan01:24256] [ 4] > /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137] > [titan01:24256] [ 5] > /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262] > [titan01:24256] [ 6] > /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34] > [titan01:24256] [ 7] > /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17] > [titan01:24256] [ 8] > /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0] > [titan01:24256] [ 9] > /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270] > [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5] > [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d] > [titan01:24256] *** End of error message *** > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 0 on node titan01 exited on > signal 6 (Aborted). > -------------------------------------------------------------------------- > > > 3. in <10% the cases i have a dead lock while MPI.init. This stays for > more than 15 minutes without returning with an error message... > > Can I enable some debug-flags to see what happens on C / OpenMPI side? > > Thanks in advance for your help! > Gundram Leifert > > > On 07/05/2016 06:05 PM, Jason Maldonis wrote: > > After reading your thread looks like it may be related to an issue I had a > few weeks ago (I'm a novice though). Maybe my thread will be of help: > https://www.open-mpi.org/community/lists/users/2016/06/29425.php > > When you say "After a specific number of repetitions the process either > hangs up or returns with a SIGSEGV." does you mean that a single call > hangs, or that at some point during the for loop a call hangs? If you mean > the latter, then it might relate to my issue. Otherwise my thread probably > won't be helpful. > > Jason Maldonis > Research Assistant of Professor Paul Voyles > Materials Science Grad Student > University of Wisconsin, Madison > 1509 University Ave, Rm M142 > Madison, WI 53706 > maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');> > 608-295-5532 > > On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert < > gundram.leif...@uni-rostock.de > <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');>> wrote: > >> Hello, >> >> I try to send many byte-arrays via broadcast. After a specific number of >> repetitions the process either hangs up or returns with a SIGSEGV. Does any >> one can help me solving the problem: >> >> ########## The code: >> >> import java.util.Random; >> import mpi.*; >> >> public class TestSendBigFiles { >> >> public static void log(String msg) { >> try { >> System.err.println(String.format("%2d/%2d:%s", >> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg)); >> } catch (MPIException ex) { >> System.err.println(String.format("%2s/%2s:%s", "?", "?", >> msg)); >> } >> } >> >> private static int hashcode(byte[] bytearray) { >> if (bytearray == null) { >> return 0; >> } >> int hash = 39; >> for (int i = 0; i < bytearray.length; i++) { >> byte b = bytearray[i]; >> hash = hash * 7 + (int) b; >> } >> return hash; >> } >> >> public static void main(String args[]) throws MPIException { >> log("start main"); >> MPI.Init(args); >> try { >> log("initialized done"); >> byte[] saveMem = new byte[100000000]; >> MPI.COMM_WORLD.barrier(); >> Random r = new Random(); >> r.nextBytes(saveMem); >> if (MPI.COMM_WORLD.getRank() == 0) { >> for (int i = 0; i < 1000; i++) { >> saveMem[r.nextInt(saveMem.length)]++; >> log("i = " + i); >> int[] lengthData = new int[]{saveMem.length}; >> log("object hash = " + hashcode(saveMem)); >> log("length = " + lengthData[0]); >> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0); >> log("bcast length done (length = " + lengthData[0] + >> ")"); >> MPI.COMM_WORLD.barrier(); >> MPI.COMM_WORLD.bcast(saveMem, lengthData[0], >> MPI.BYTE, 0); >> log("bcast data done"); >> MPI.COMM_WORLD.barrier(); >> } >> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0); >> } else { >> while (true) { >> int[] lengthData = new int[1]; >> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0); >> log("bcast length done (length = " + lengthData[0] + >> ")"); >> if (lengthData[0] == 0) { >> break; >> } >> MPI.COMM_WORLD.barrier(); >> saveMem = new byte[lengthData[0]]; >> MPI.COMM_WORLD.bcast(saveMem, saveMem.length, >> MPI.BYTE, 0); >> log("bcast data done"); >> MPI.COMM_WORLD.barrier(); >> log("object hash = " + hashcode(saveMem)); >> } >> } >> MPI.COMM_WORLD.barrier(); >> } catch (MPIException ex) { >> System.out.println("caugth error." + ex); >> log(ex.getMessage()); >> } catch (RuntimeException ex) { >> System.out.println("caugth error." + ex); >> log(ex.getMessage()); >> } finally { >> MPI.Finalize(); >> } >> >> } >> >> } >> >> >> ############ The Error (if it does not just hang up): >> >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232 >> # >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # JRE version: 7.0_25-b15 >> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode >> linux-amd64 compressed oops) >> # Problematic frame: >> # # >> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640 >> # >> # JRE version: 7.0_25-b15 >> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I >> # >> # Failed to write core dump. Core dumps have been disabled. To enable >> core dumping, try "ulimit -c unlimited" before starting Java again >> # >> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode >> linux-amd64 compressed oops) >> # Problematic frame: >> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I >> # >> # Failed to write core dump. Core dumps have been disabled. To enable >> core dumping, try "ulimit -c unlimited" before starting Java again >> # >> # An error report file with more information is saved as: >> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log >> # An error report file with more information is saved as: >> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log >> # >> # If you would like to submit a bug report, please visit: >> # http://bugreport.sun.com/bugreport/crash.jsp >> # >> # >> # If you would like to submit a bug report, please visit: >> # http://bugreport.sun.com/bugreport/crash.jsp >> # >> [titan01:01172] *** Process received signal *** >> [titan01:01172] Signal: Aborted (6) >> [titan01:01172] Signal code: (-6) >> [titan01:01173] *** Process received signal *** >> [titan01:01173] Signal: Aborted (6) >> [titan01:01173] Signal code: (-6) >> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100] >> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7] >> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8] >> [titan01:01172] [ 3] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5] >> [titan01:01172] [ 4] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137] >> [titan01:01172] [ 5] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0] >> [titan01:01172] [ 6] [titan01:01173] [ 0] >> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100] >> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670] >> [titan01:01172] [ 7] [0x2b7e9c86e3a1] >> [titan01:01172] *** End of error message *** >> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7] >> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8] >> [titan01:01173] [ 3] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5] >> [titan01:01173] [ 4] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137] >> [titan01:01173] [ 5] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0] >> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670] >> [titan01:01173] [ 7] [0x2af69c0693a1] >> [titan01:01173] *** End of error message *** >> ------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on >> signal 6 (Aborted). >> >> >> ########CONFIGURATION: >> I used the ompi master sources from github: >> commit 267821f0dd405b5f4370017a287d9a49f92e734a >> Author: Gilles Gouaillardet <gil...@rist.or.jp >> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>> >> Date: Tue Jul 5 13:47:50 2016 +0900 >> >> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 >> --disable-dlopen --disable-mca-dso >> >> Thanks a lot for your help! >> Gundram >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/07/29584.php >> > > > > _______________________________________________ > users mailing listus...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/07/29585.php > > >