Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-04 Thread Howard Pritchard
Hello Nate,

As a first step to addressing this, could you please try using gcc rather
than the Intel compilers to build Open MPI?

We've been doing a lot of work recently on the java bindings, etc. but have
never tried using any compilers other
than gcc when working with the java bindings.

Thanks,

Howard


2015-08-03 17:36 GMT-06:00 Nate Chambers :

> We've been struggling with this error for a while, so hoping someone more
> knowledgeable can help!
>
> Our java MPI code exits with a segfault during its normal operation, *but
> the segfault occurs before our code ever uses MPI functionality like
> sending/receiving. *We've removed all message calls and any use of
> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
> in our code, and does not if we comment that line out. Further vexing us,
> the crash doesn't happen at the point of the MPI.init call, but later on in
> the program. I don't have an easy-to-run example here because our non-MPI
> code is so large and complicated. We have run simpler test programs with
> MPI and the segfault does not occur.
>
> We have isolated the line where the segfault occurs. However, if we
> comment that out, the program will run longer, but then randomly (but
> deterministically) segfault later on in the code. Does anyone have tips on
> how to debug this? We have tried several flags with mpirun, but no good
> clues.
>
> We have also tried several MPI versions, including stable 1.8.7 and the
> most recent 1.8.8rc1
>
>
> ATTACHED
> - config.log from installation
> - output from `ompi_info -all`
>
>
> OUTPUT FROM RUNNING
>
> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
> ...
> some normal output from our code
> ...
> --
> mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited on
> signal 11 (Segmentation fault).
> --
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27386.php
>


Re: [OMPI users] openmpi 1.8.7 build error with cuda support using pgi compiler 15.4

2015-08-04 Thread Rolf vandeVaart
Hi Shahzeb:
I believe another colleague of mine may have helped you with this issue (I was 
not around last week).  However, to help me better understand the issue you are 
seeing, could you send me your config.log file  from when you did the 
configuration?  You can just send to rvandeva...@nvidia.com.
Thanks, Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Shahzeb
>Sent: Thursday, July 30, 2015 9:45 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] openmpi 1.8.7 build error with cuda support using pgi
>compiler 15.4
>
>Hello,
>
>I am getting error in my make and make installl  with building OpenMPI with
>CUDA support using PGI compiler. Please help me fix this problem.
>No clue why it is happening. We are using PGI 15.4
>
>  ./configure --prefix=/usr/global/openmpi/pgi/1.8.7 CC=pgcc CXX=pgCC
>FC=pgfortran --with-cuda=/usr/global/cuda/7.0/include/
>
>
> fi
>make[2]: Leaving directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/sm'
>Making all in mca/common/verbs
>make[2]: Entering directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/verbs'
>if test -z "libmca_common_verbs.la"; then \
>   rm -f "libmca_common_verbs.la"; \
>   ln -s "libmca_common_verbs_noinst.la" "libmca_common_verbs.la"; \
> fi
>make[2]: Leaving directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/verbs'
>Making all in mca/common/cuda
>make[2]: Entering directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/cuda'
>   CC   common_cuda.lo
>PGC-S-0039-Use of undeclared variable
>mca_common_cuda_cumemcpy_async
>(common_cuda.c: 320)
>PGC-S-0039-Use of undeclared variable libcuda_handle (common_cuda.c:
>396)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 396)
>PGC-S-0103-Illegal operand types for comparison operator (common_cuda.c:
>397)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 441)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 441)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 442)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 442)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 443)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 443)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 444)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 444)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 445)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 445)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 446)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 446)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 447)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 447)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 448)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 448)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 449)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 449)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 450)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 450)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 451)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 451)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 452)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 452)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 453)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 453)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 454)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 454)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 455)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 455)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 463)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 463)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 464)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 464)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 465)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 465)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 469)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 469)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 470)
>PGC-W-0155-Pointer value created from a nonlong integral 

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-04 Thread Nate Chambers
Sure, I reran the configure with CC=gcc and then make install. I think
that's the proper way to do it. Attached is my config log. The behavior
when running our code appears to be the same. The output is the same error
I pasted in my email above. It occurs when calling MPI.init().

I'm not great at debugging this sort of stuff, but happy to try things out
if you need me to.

Nate


On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard 
wrote:

> Hello Nate,
>
> As a first step to addressing this, could you please try using gcc rather
> than the Intel compilers to build Open MPI?
>
> We've been doing a lot of work recently on the java bindings, etc. but
> have never tried using any compilers other
> than gcc when working with the java bindings.
>
> Thanks,
>
> Howard
>
>
> 2015-08-03 17:36 GMT-06:00 Nate Chambers :
>
>> We've been struggling with this error for a while, so hoping someone more
>> knowledgeable can help!
>>
>> Our java MPI code exits with a segfault during its normal operation, *but
>> the segfault occurs before our code ever uses MPI functionality like
>> sending/receiving. *We've removed all message calls and any use of
>> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
>> in our code, and does not if we comment that line out. Further vexing us,
>> the crash doesn't happen at the point of the MPI.init call, but later on in
>> the program. I don't have an easy-to-run example here because our non-MPI
>> code is so large and complicated. We have run simpler test programs with
>> MPI and the segfault does not occur.
>>
>> We have isolated the line where the segfault occurs. However, if we
>> comment that out, the program will run longer, but then randomly (but
>> deterministically) segfault later on in the code. Does anyone have tips on
>> how to debug this? We have tried several flags with mpirun, but no good
>> clues.
>>
>> We have also tried several MPI versions, including stable 1.8.7 and the
>> most recent 1.8.8rc1
>>
>>
>> ATTACHED
>> - config.log from installation
>> - output from `ompi_info -all`
>>
>>
>> OUTPUT FROM RUNNING
>>
>> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
>> ...
>> some normal output from our code
>> ...
>> --
>> mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited on
>> signal 11 (Segmentation fault).
>> --
>>
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/08/27386.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27389.php
>


config.log.bz2
Description: BZip2 compressed data


Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-04 Thread Howard Pritchard
Hello Nate,

As a sanity check of your installation, could you try to compile the
examples/*.java codes using the mpijavac you've installed and see that
those run correctly?
I'd be just interested in the Hello.java and Ring.java?

Howard







2015-08-04 14:34 GMT-06:00 Nate Chambers :

> Sure, I reran the configure with CC=gcc and then make install. I think
> that's the proper way to do it. Attached is my config log. The behavior
> when running our code appears to be the same. The output is the same error
> I pasted in my email above. It occurs when calling MPI.init().
>
> I'm not great at debugging this sort of stuff, but happy to try things out
> if you need me to.
>
> Nate
>
>
> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard 
> wrote:
>
>> Hello Nate,
>>
>> As a first step to addressing this, could you please try using gcc rather
>> than the Intel compilers to build Open MPI?
>>
>> We've been doing a lot of work recently on the java bindings, etc. but
>> have never tried using any compilers other
>> than gcc when working with the java bindings.
>>
>> Thanks,
>>
>> Howard
>>
>>
>> 2015-08-03 17:36 GMT-06:00 Nate Chambers :
>>
>>> We've been struggling with this error for a while, so hoping someone
>>> more knowledgeable can help!
>>>
>>> Our java MPI code exits with a segfault during its normal operation, *but
>>> the segfault occurs before our code ever uses MPI functionality like
>>> sending/receiving. *We've removed all message calls and any use of
>>> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
>>> in our code, and does not if we comment that line out. Further vexing us,
>>> the crash doesn't happen at the point of the MPI.init call, but later on in
>>> the program. I don't have an easy-to-run example here because our non-MPI
>>> code is so large and complicated. We have run simpler test programs with
>>> MPI and the segfault does not occur.
>>>
>>> We have isolated the line where the segfault occurs. However, if we
>>> comment that out, the program will run longer, but then randomly (but
>>> deterministically) segfault later on in the code. Does anyone have tips on
>>> how to debug this? We have tried several flags with mpirun, but no good
>>> clues.
>>>
>>> We have also tried several MPI versions, including stable 1.8.7 and the
>>> most recent 1.8.8rc1
>>>
>>>
>>> ATTACHED
>>> - config.log from installation
>>> - output from `ompi_info -all`
>>>
>>>
>>> OUTPUT FROM RUNNING
>>>
>>> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
>>> ...
>>> some normal output from our code
>>> ...
>>>
>>> --
>>> mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited
>>> on signal 11 (Segmentation fault).
>>>
>>> --
>>>
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/08/27386.php
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/08/27389.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27391.php
>


Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-04 Thread Nate Chambers
Sanity checks pass. Both Hello and Ring.java run correctly with the
expected program's output.

Does MPI.init(args) expect anything from those command-line args?


Nate


On Tue, Aug 4, 2015 at 12:26 PM, Howard Pritchard 
wrote:

> Hello Nate,
>
> As a sanity check of your installation, could you try to compile the
> examples/*.java codes using the mpijavac you've installed and see that
> those run correctly?
> I'd be just interested in the Hello.java and Ring.java?
>
> Howard
>
>
>
>
>
>
>
> 2015-08-04 14:34 GMT-06:00 Nate Chambers :
>
>> Sure, I reran the configure with CC=gcc and then make install. I think
>> that's the proper way to do it. Attached is my config log. The behavior
>> when running our code appears to be the same. The output is the same error
>> I pasted in my email above. It occurs when calling MPI.init().
>>
>> I'm not great at debugging this sort of stuff, but happy to try things
>> out if you need me to.
>>
>> Nate
>>
>>
>> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard 
>> wrote:
>>
>>> Hello Nate,
>>>
>>> As a first step to addressing this, could you please try using gcc
>>> rather than the Intel compilers to build Open MPI?
>>>
>>> We've been doing a lot of work recently on the java bindings, etc. but
>>> have never tried using any compilers other
>>> than gcc when working with the java bindings.
>>>
>>> Thanks,
>>>
>>> Howard
>>>
>>>
>>> 2015-08-03 17:36 GMT-06:00 Nate Chambers :
>>>
 We've been struggling with this error for a while, so hoping someone
 more knowledgeable can help!

 Our java MPI code exits with a segfault during its normal operation, *but
 the segfault occurs before our code ever uses MPI functionality like
 sending/receiving. *We've removed all message calls and any use of
 MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
 in our code, and does not if we comment that line out. Further vexing us,
 the crash doesn't happen at the point of the MPI.init call, but later on in
 the program. I don't have an easy-to-run example here because our non-MPI
 code is so large and complicated. We have run simpler test programs with
 MPI and the segfault does not occur.

 We have isolated the line where the segfault occurs. However, if we
 comment that out, the program will run longer, but then randomly (but
 deterministically) segfault later on in the code. Does anyone have tips on
 how to debug this? We have tried several flags with mpirun, but no good
 clues.

 We have also tried several MPI versions, including stable 1.8.7 and the
 most recent 1.8.8rc1


 ATTACHED
 - config.log from installation
 - output from `ompi_info -all`


 OUTPUT FROM RUNNING

 > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
 ...
 some normal output from our code
 ...

 --
 mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited
 on signal 11 (Segmentation fault).

 --




 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2015/08/27386.php

>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/08/27389.php
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/08/27391.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27392.php
>