Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic)
version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you
also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de>
wrote:

> Hello Jason,
>
> thanks for your response! I thing it is another problem. I try to send
> 100MB bytes. So there are not many tries (between 10 and 30). I realized
> that the execution of this code can result 3 different errors:
>
> 1. most often the posted error message occures.
>
> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one
> with 200% and two with 100% processor utilization. After ~15 minutes
> without new system outputs this error occurs.
>
>
> [thread 47499823949568 also had an error]
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
> #  guarantee(PageArmed == 0) failed: invariant
> #
> # JRE version: 7.0_25-b15
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
> linux-amd64 compressed oops)
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> #
> [titan01:24256] *** Process received signal ***
> [titan01:24256] Signal: Aborted (6)
> [titan01:24256] Signal code:  (-6)
> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
> [titan01:24256] [ 3]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
> [titan01:24256] [ 4]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
> [titan01:24256] [ 5]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
> [titan01:24256] [ 6]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
> [titan01:24256] [ 7]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
> [titan01:24256] [ 8]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
> [titan01:24256] [ 9]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
> [titan01:24256] *** End of error message ***
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
> signal 6 (Aborted).
> --------------------------------------------------------------------------
>
>
> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
> more than 15 minutes without returning with an error message...
>
> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>
> Thanks in advance for your help!
> Gundram Leifert
>
>
> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>
> After reading your thread looks like it may be related to an issue I had a
> few weeks ago (I'm a novice though). Maybe my thread will be of help:
> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>
> When you say "After a specific number of repetitions the process either
> hangs up or returns with a SIGSEGV."  does you mean that a single call
> hangs, or that at some point during the for loop a call hangs? If you mean
> the latter, then it might relate to my issue. Otherwise my thread probably
> won't be helpful.
>
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>
> 608-295-5532
>
> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
> gundram.leif...@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');>> wrote:
>
>> Hello,
>>
>> I try to send many byte-arrays via broadcast. After a specific number of
>> repetitions the process either hangs up or returns with a SIGSEGV. Does any
>> one can help me solving the problem:
>>
>> ########## The code:
>>
>> import java.util.Random;
>> import mpi.*;
>>
>> public class TestSendBigFiles {
>>
>>     public static void log(String msg) {
>>         try {
>>             System.err.println(String.format("%2d/%2d:%s",
>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>         } catch (MPIException ex) {
>>             System.err.println(String.format("%2s/%2s:%s", "?", "?",
>> msg));
>>         }
>>     }
>>
>>     private static int hashcode(byte[] bytearray) {
>>         if (bytearray == null) {
>>             return 0;
>>         }
>>         int hash = 39;
>>         for (int i = 0; i < bytearray.length; i++) {
>>             byte b = bytearray[i];
>>             hash = hash * 7 + (int) b;
>>         }
>>         return hash;
>>     }
>>
>>     public static void main(String args[]) throws MPIException {
>>         log("start main");
>>         MPI.Init(args);
>>         try {
>>             log("initialized done");
>>             byte[] saveMem = new byte[100000000];
>>             MPI.COMM_WORLD.barrier();
>>             Random r = new Random();
>>             r.nextBytes(saveMem);
>>             if (MPI.COMM_WORLD.getRank() == 0) {
>>                 for (int i = 0; i < 1000; i++) {
>>                     saveMem[r.nextInt(saveMem.length)]++;
>>                     log("i = " + i);
>>                     int[] lengthData = new int[]{saveMem.length};
>>                     log("object hash = " + hashcode(saveMem));
>>                     log("length = " + lengthData[0]);
>>                     MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>                     log("bcast length done (length = " + lengthData[0] +
>> ")");
>>                     MPI.COMM_WORLD.barrier();
>>                     MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>> MPI.BYTE, 0);
>>                     log("bcast data done");
>>                     MPI.COMM_WORLD.barrier();
>>                 }
>>                 MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>             } else {
>>                 while (true) {
>>                     int[] lengthData = new int[1];
>>                     MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>                     log("bcast length done (length = " + lengthData[0] +
>> ")");
>>                     if (lengthData[0] == 0) {
>>                         break;
>>                     }
>>                     MPI.COMM_WORLD.barrier();
>>                     saveMem = new byte[lengthData[0]];
>>                     MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>> MPI.BYTE, 0);
>>                     log("bcast data done");
>>                     MPI.COMM_WORLD.barrier();
>>                     log("object hash = " + hashcode(saveMem));
>>                 }
>>             }
>>             MPI.COMM_WORLD.barrier();
>>         } catch (MPIException ex) {
>>             System.out.println("caugth error." + ex);
>>             log(ex.getMessage());
>>         } catch (RuntimeException ex) {
>>             System.out.println("caugth error." + ex);
>>             log(ex.getMessage());
>>         } finally {
>>             MPI.Finalize();
>>         }
>>
>>     }
>>
>> }
>>
>>
>> ############ The Error (if it does not just hang up):
>>
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>> #
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> # JRE version: 7.0_25-b15
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # #
>> #  SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>> #
>> # JRE version: 7.0_25-b15
>> J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> [titan01:01172] *** Process received signal ***
>> [titan01:01172] Signal: Aborted (6)
>> [titan01:01172] Signal code:  (-6)
>> [titan01:01173] *** Process received signal ***
>> [titan01:01173] Signal: Aborted (6)
>> [titan01:01173] Signal code:  (-6)
>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>> [titan01:01172] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>> [titan01:01172] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>> [titan01:01172] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>> [titan01:01172] *** End of error message ***
>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>> [titan01:01173] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>> [titan01:01173] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>> [titan01:01173] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>> [titan01:01173] [ 7] [0x2af69c0693a1]
>> [titan01:01173] *** End of error message ***
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
>> signal 6 (Aborted).
>>
>>
>> ########CONFIGURATION:
>> I used the ompi master sources from github:
>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>> Author: Gilles Gouaillardet <gil...@rist.or.jp
>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>
>> Date:   Tue Jul 5 13:47:50 2016 +0900
>>
>> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>> --disable-dlopen --disable-mca-dso
>>
>> Thanks a lot for your help!
>> Gundram
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org 
> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>
>
>

Reply via email to