Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
> On Oct 27, 2014, at 7:21 PM, Gilles Gouaillardet > wrote: > > Ralph, > > On 2014/10/28 0:46, Ralph Castain wrote: >> Actually, I propose to also remove that issue. Simple enough to use a >> hash_table_32 to handle the jobids, and let that point to a >> hash_table_32 of vpids. Since we rarely have more than one jobid >> anyway, the memory overhead actually decreases with this model, and we >> get rid of that annoying need to memcpy everything. > sounds good to me. > from an implementation/performance point of view, should we put treat > the local jobid differently ? > (e.g. use a special variable for the hash_table_32 of the vpids of the > current jobid) Not entirely sure - let’s see as we go. My initial thought is “no”, but since the use of dynamic jobs is so rare, it might make sense. >>> as far as i am concerned, i am fine with your proposed suggestion to >>> dump opal_identifier_t. >>> >>> about the patch, did you mean you have something ready i can apply to my >>> PR ? >>> or do you expect me to do the changes (i am ok to do it if needed) >> Why don’t I grab your branch, create a separate repo based on it (just to >> keep things clean), push it to my area and give you write access? We can >> then collaborate on the changes and create a PR from there. This way, you >> don’t need to give me write access to your entire repo. >> >> Make sense? > ok to work on an other "somehow shared" repo for that issue. > i am not convinced you should grab my branch since all the changes i > made are will be no more valid. > anyway, feel free to fork a repo from my branch or the master and i will > work from here. Okay, I’ll set something up tomorrow > > Cheers, > > Gilles > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/10/25621.php
Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
Ralph, On 2014/10/28 0:46, Ralph Castain wrote: > Actually, I propose to also remove that issue. Simple enough to use a > hash_table_32 to handle the jobids, and let that point to a > hash_table_32 of vpids. Since we rarely have more than one jobid > anyway, the memory overhead actually decreases with this model, and we > get rid of that annoying need to memcpy everything. sounds good to me. from an implementation/performance point of view, should we put treat the local jobid differently ? (e.g. use a special variable for the hash_table_32 of the vpids of the current jobid) >> as far as i am concerned, i am fine with your proposed suggestion to >> dump opal_identifier_t. >> >> about the patch, did you mean you have something ready i can apply to my >> PR ? >> or do you expect me to do the changes (i am ok to do it if needed) > Why don’t I grab your branch, create a separate repo based on it (just to > keep things clean), push it to my area and give you write access? We can then > collaborate on the changes and create a PR from there. This way, you don’t > need to give me write access to your entire repo. > > Make sense? ok to work on an other "somehow shared" repo for that issue. i am not convinced you should grab my branch since all the changes i made are will be no more valid. anyway, feel free to fork a repo from my branch or the master and i will work from here. Cheers, Gilles
Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
> On Oct 26, 2014, at 11:12 PM, Gilles Gouaillardet > wrote: > > Ralph, > > this is also a solution. > the pro is it seems more lightweight than PR #249 > the two cons i can see are : > - opal_process_name_t alignment goes from 64 to 32 bits > - some functions (opal_hash_table_*) takes an uint64_t as argument so we > still need to use memcpy in order to > * guarantee 64 bits alignment on some archs (such as sparc) > * avoid ugly cast such as uint64_t id = *(uint64_t *)&process_name; Actually, I propose to also remove that issue. Simple enough to use a hash_table_32 to handle the jobids, and let that point to a hash_table_32 of vpids. Since we rarely have more than one jobid anyway, the memory overhead actually decreases with this model, and we get rid of that annoying need to memcpy everything. > > as far as i am concerned, i am fine with your proposed suggestion to > dump opal_identifier_t. > > about the patch, did you mean you have something ready i can apply to my > PR ? > or do you expect me to do the changes (i am ok to do it if needed) Why don’t I grab your branch, create a separate repo based on it (just to keep things clean), push it to my area and give you write access? We can then collaborate on the changes and create a PR from there. This way, you don’t need to give me write access to your entire repo. Make sense? Ralph > > Cheers, > > Gilles > > On 2014/10/27 11:04, Ralph Castain wrote: >> Just took a glance thru 249 and have a few suggestions on it - will pass >> them along tomorrow. I think the right solution is to (a) dump >> opal_identifier_t in favor of using opal_process_name_t everywhere in the >> opal layer, (b) typedef orte_process_name_t to opal_process_name_t, and (c) >> leave ompi_process_name_t as typedef’d to the RTE component in the MPI >> layer. This lets other RTEs decide for themselves how they want to handle it. >> >> If you add changes to your branch, I can pass you a patch with my suggested >> alterations. >> >>> On Oct 26, 2014, at 5:55 PM, Gilles Gouaillardet >>> wrote: >>> >>> No :-( >>> I need some extra work to stop declaring orte_process_name_t and >>> ompi_process_name_t variables. >>> #249 will make things much easier. >>> One option is to use opal_process_name_t everywhere or typedef orte and >>> ompi types to the opal one. >>> An other (lightweight but error prone imho) is to change variable >>> declaration only. >>> Any thought ? >>> >>> Ralph Castain wrote: Will PR#249 solve it? If so, we should just go with it as I suspect that is the long-term solution. > On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet > wrote: > > It looks like we faced a similar issue : > opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 > bits aligned. If you run an alignment sensitive cpu such as sparc and you > are not lucky (so to speak) you can run into this issue. > i will make a patch for this shortly > > Ralph Castain wrote: >> Afraid this must be something about the Sparc - just ran on a Solaris 11 >> x86 box and everything works fine. >> >> >>> On Oct 26, 2014, at 8:22 AM, Siegmar Gross >>> wrote: >>> >>> Hi Gilles, >>> >>> I wanted to explore which function is called, when I call MPI_Init >>> in a C program, because this function should be called from a Java >>> program as well. Unfortunately C programs break with a Bus Error >>> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's >>> the reason why I get no useful backtrace for my Java program. >>> >>> tyr small_prog 117 mpicc -o init_finalize init_finalize.c >>> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec >>> ... >>> (gdb) run -np 1 init_finalize >>> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 >>> init_finalize >>> [Thread debugging using libthread_db enabled] >>> [New Thread 1 (LWP 1)] >>> [New LWP2] >>> [tyr:19240] *** Process received signal *** >>> [tyr:19240] Signal: Bus Error (10) >>> [tyr:19240] Signal code: Invalid address alignment (1) >>> [tyr:19240] Failing at address: 7bd1c10c >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04 >>> /lib/sparcv9/libc.so.1:0xd8b98 >>> /lib/sparcv9/libc.so.1:0xcc70c >>> /lib/sparcv9/libc.so.1:0xcc918 >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c >>> [ Signal 10 (BUS)] >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8 >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c
Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
Ralph, this is also a solution. the pro is it seems more lightweight than PR #249 the two cons i can see are : - opal_process_name_t alignment goes from 64 to 32 bits - some functions (opal_hash_table_*) takes an uint64_t as argument so we still need to use memcpy in order to * guarantee 64 bits alignment on some archs (such as sparc) * avoid ugly cast such as uint64_t id = *(uint64_t *)&process_name; as far as i am concerned, i am fine with your proposed suggestion to dump opal_identifier_t. about the patch, did you mean you have something ready i can apply to my PR ? or do you expect me to do the changes (i am ok to do it if needed) Cheers, Gilles On 2014/10/27 11:04, Ralph Castain wrote: > Just took a glance thru 249 and have a few suggestions on it - will pass them > along tomorrow. I think the right solution is to (a) dump opal_identifier_t > in favor of using opal_process_name_t everywhere in the opal layer, (b) > typedef orte_process_name_t to opal_process_name_t, and (c) leave > ompi_process_name_t as typedef’d to the RTE component in the MPI layer. This > lets other RTEs decide for themselves how they want to handle it. > > If you add changes to your branch, I can pass you a patch with my suggested > alterations. > >> On Oct 26, 2014, at 5:55 PM, Gilles Gouaillardet >> wrote: >> >> No :-( >> I need some extra work to stop declaring orte_process_name_t and >> ompi_process_name_t variables. >> #249 will make things much easier. >> One option is to use opal_process_name_t everywhere or typedef orte and ompi >> types to the opal one. >> An other (lightweight but error prone imho) is to change variable >> declaration only. >> Any thought ? >> >> Ralph Castain wrote: >>> Will PR#249 solve it? If so, we should just go with it as I suspect that is >>> the long-term solution. >>> On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet wrote: It looks like we faced a similar issue : opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 bits aligned. If you run an alignment sensitive cpu such as sparc and you are not lucky (so to speak) you can run into this issue. i will make a patch for this shortly Ralph Castain wrote: > Afraid this must be something about the Sparc - just ran on a Solaris 11 > x86 box and everything works fine. > > >> On Oct 26, 2014, at 8:22 AM, Siegmar Gross >> wrote: >> >> Hi Gilles, >> >> I wanted to explore which function is called, when I call MPI_Init >> in a C program, because this function should be called from a Java >> program as well. Unfortunately C programs break with a Bus Error >> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's >> the reason why I get no useful backtrace for my Java program. >> >> tyr small_prog 117 mpicc -o init_finalize init_finalize.c >> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec >> ... >> (gdb) run -np 1 init_finalize >> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 >> init_finalize >> [Thread debugging using libthread_db enabled] >> [New Thread 1 (LWP 1)] >> [New LWP2] >> [tyr:19240] *** Process received signal *** >> [tyr:19240] Signal: Bus Error (10) >> [tyr:19240] Signal code: Invalid address alignment (1) >> [tyr:19240] Failing at address: 7bd1c10c >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04 >> /lib/sparcv9/libc.so.1:0xd8b98 >> /lib/sparcv9/libc.so.1:0xcc70c >> /lib/sparcv9/libc.so.1:0xcc918 >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c >> [ Signal 10 (BUS)] >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8 >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 >> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20 >> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c >> [tyr:19240] *** End of error message *** >> -- >> mpiexec noticed that process rank 0 with PID 0 on node tyr exited on >> signal 10 (Bus Error). >> -- >> [LWP2 exited] >> [New Thread 2] >> [Switching to Thread 1 (LWP 1)] >> sol_thread_fetch_r
Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
Just took a glance thru 249 and have a few suggestions on it - will pass them along tomorrow. I think the right solution is to (a) dump opal_identifier_t in favor of using opal_process_name_t everywhere in the opal layer, (b) typedef orte_process_name_t to opal_process_name_t, and (c) leave ompi_process_name_t as typedef’d to the RTE component in the MPI layer. This lets other RTEs decide for themselves how they want to handle it. If you add changes to your branch, I can pass you a patch with my suggested alterations. > On Oct 26, 2014, at 5:55 PM, Gilles Gouaillardet > wrote: > > No :-( > I need some extra work to stop declaring orte_process_name_t and > ompi_process_name_t variables. > #249 will make things much easier. > One option is to use opal_process_name_t everywhere or typedef orte and ompi > types to the opal one. > An other (lightweight but error prone imho) is to change variable declaration > only. > Any thought ? > > Ralph Castain wrote: >> Will PR#249 solve it? If so, we should just go with it as I suspect that is >> the long-term solution. >> >>> On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet >>> wrote: >>> >>> It looks like we faced a similar issue : >>> opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 >>> bits aligned. If you run an alignment sensitive cpu such as sparc and you >>> are not lucky (so to speak) you can run into this issue. >>> i will make a patch for this shortly >>> >>> Ralph Castain wrote: Afraid this must be something about the Sparc - just ran on a Solaris 11 x86 box and everything works fine. > On Oct 26, 2014, at 8:22 AM, Siegmar Gross > wrote: > > Hi Gilles, > > I wanted to explore which function is called, when I call MPI_Init > in a C program, because this function should be called from a Java > program as well. Unfortunately C programs break with a Bus Error > once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's > the reason why I get no useful backtrace for my Java program. > > tyr small_prog 117 mpicc -o init_finalize init_finalize.c > tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec > ... > (gdb) run -np 1 init_finalize > Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 > init_finalize > [Thread debugging using libthread_db enabled] > [New Thread 1 (LWP 1)] > [New LWP2] > [tyr:19240] *** Process received signal *** > [tyr:19240] Signal: Bus Error (10) > [tyr:19240] Signal code: Invalid address alignment (1) > [tyr:19240] Failing at address: 7bd1c10c > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04 > /lib/sparcv9/libc.so.1:0xd8b98 > /lib/sparcv9/libc.so.1:0xcc70c > /lib/sparcv9/libc.so.1:0xcc918 > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c > [ Signal 10 (BUS)] > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8 > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 > /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 > /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20 > /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c > [tyr:19240] *** End of error message *** > -- > mpiexec noticed that process rank 0 with PID 0 on node tyr exited on > signal 10 (Bus Error). > -- > [LWP2 exited] > [New Thread 2] > [Switching to Thread 1 (LWP 1)] > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to > satisfy query > (gdb) bt > #0 0x7f6173d0 in rtld_db_dlactivity () from > /usr/lib/sparcv9/ld.so.1 > #1 0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 > #2 0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 > #3 0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 > #4 0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 > #5 0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 > #6 0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 > #7 0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 > #8 0x7ec87f60 in vm_close (loader_data=0x0, > module=0x7c9
Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
No :-( I need some extra work to stop declaring orte_process_name_t and ompi_process_name_t variables. #249 will make things much easier. One option is to use opal_process_name_t everywhere or typedef orte and ompi types to the opal one. An other (lightweight but error prone imho) is to change variable declaration only. Any thought ? Ralph Castain wrote: >Will PR#249 solve it? If so, we should just go with it as I suspect that is >the long-term solution. > >> On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet >> wrote: >> >> It looks like we faced a similar issue : >> opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 >> bits aligned. If you run an alignment sensitive cpu such as sparc and you >> are not lucky (so to speak) you can run into this issue. >> i will make a patch for this shortly >> >> Ralph Castain wrote: >>> Afraid this must be something about the Sparc - just ran on a Solaris 11 >>> x86 box and everything works fine. >>> >>> On Oct 26, 2014, at 8:22 AM, Siegmar Gross wrote: Hi Gilles, I wanted to explore which function is called, when I call MPI_Init in a C program, because this function should be called from a Java program as well. Unfortunately C programs break with a Bus Error once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's the reason why I get no useful backtrace for my Java program. tyr small_prog 117 mpicc -o init_finalize init_finalize.c tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec ... (gdb) run -np 1 init_finalize Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 init_finalize [Thread debugging using libthread_db enabled] [New Thread 1 (LWP 1)] [New LWP2] [tyr:19240] *** Process received signal *** [tyr:19240] Signal: Bus Error (10) [tyr:19240] Signal code: Invalid address alignment (1) [tyr:19240] Failing at address: 7bd1c10c /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04 /lib/sparcv9/libc.so.1:0xd8b98 /lib/sparcv9/libc.so.1:0xcc70c /lib/sparcv9/libc.so.1:0xcc918 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c [ Signal 10 (BUS)] /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20 /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c [tyr:19240] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 0 on node tyr exited on signal 10 (Bus Error). -- [LWP2 exited] [New Thread 2] [Switching to Thread 1 (LWP 1)] sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to satisfy query (gdb) bt #0 0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1 #1 0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 #2 0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 #3 0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 #4 0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 #5 0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 #6 0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 #7 0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 #8 0x7ec87f60 in vm_close (loader_data=0x0, module=0x7c901fe0) at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212 #9 0x7ec85534 in lt_dlclose (handle=0x100189b50) at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982 #10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0) at ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382 #11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0) at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446 #12 0x7ecaa474 in mca_base_component_repository_release ( component=0x7b1236f0 ) at ../../../../openmpi-dev-124-g91e9686/o
Re: [OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
Will PR#249 solve it? If so, we should just go with it as I suspect that is the long-term solution. > On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet > wrote: > > It looks like we faced a similar issue : > opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 bits > aligned. If you run an alignment sensitive cpu such as sparc and you are not > lucky (so to speak) you can run into this issue. > i will make a patch for this shortly > > Ralph Castain wrote: >> Afraid this must be something about the Sparc - just ran on a Solaris 11 x86 >> box and everything works fine. >> >> >>> On Oct 26, 2014, at 8:22 AM, Siegmar Gross >>> wrote: >>> >>> Hi Gilles, >>> >>> I wanted to explore which function is called, when I call MPI_Init >>> in a C program, because this function should be called from a Java >>> program as well. Unfortunately C programs break with a Bus Error >>> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's >>> the reason why I get no useful backtrace for my Java program. >>> >>> tyr small_prog 117 mpicc -o init_finalize init_finalize.c >>> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec >>> ... >>> (gdb) run -np 1 init_finalize >>> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 >>> init_finalize >>> [Thread debugging using libthread_db enabled] >>> [New Thread 1 (LWP 1)] >>> [New LWP2] >>> [tyr:19240] *** Process received signal *** >>> [tyr:19240] Signal: Bus Error (10) >>> [tyr:19240] Signal code: Invalid address alignment (1) >>> [tyr:19240] Failing at address: 7bd1c10c >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04 >>> /lib/sparcv9/libc.so.1:0xd8b98 >>> /lib/sparcv9/libc.so.1:0xcc70c >>> /lib/sparcv9/libc.so.1:0xcc918 >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c >>> [ Signal 10 (BUS)] >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8 >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 >>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 >>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20 >>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c >>> [tyr:19240] *** End of error message *** >>> -- >>> mpiexec noticed that process rank 0 with PID 0 on node tyr exited on signal >>> 10 (Bus Error). >>> -- >>> [LWP2 exited] >>> [New Thread 2] >>> [Switching to Thread 1 (LWP 1)] >>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to >>> satisfy query >>> (gdb) bt >>> #0 0x7f6173d0 in rtld_db_dlactivity () from >>> /usr/lib/sparcv9/ld.so.1 >>> #1 0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 >>> #2 0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 >>> #3 0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 >>> #4 0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 >>> #5 0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 >>> #6 0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 >>> #7 0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 >>> #8 0x7ec87f60 in vm_close (loader_data=0x0, >>> module=0x7c901fe0) >>> at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212 >>> #9 0x7ec85534 in lt_dlclose (handle=0x100189b50) >>> at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982 >>> #10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0) >>> at >>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382 >>> #11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0) >>> at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446 >>> #12 0x7ecaa474 in mca_base_component_repository_release ( >>> component=0x7b1236f0 ) >>> at >>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:240 >>> #13 0x7ecac774 in mca_base_component_unload ( >>> component=0x7b1236f0 , output_id=-1) >>> at >>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:47 >>> #14 0x7ecac808 in mca_base_component_close ( >>> component=0x7b1236f0 , output_id=-1) >>> at >>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_clos
Re: [OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris
It looks like we faced a similar issue : opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 bits aligned. If you run an alignment sensitive cpu such as sparc and you are not lucky (so to speak) you can run into this issue. i will make a patch for this shortly Ralph Castain wrote: >Afraid this must be something about the Sparc - just ran on a Solaris 11 x86 >box and everything works fine. > > >> On Oct 26, 2014, at 8:22 AM, Siegmar Gross >> wrote: >> >> Hi Gilles, >> >> I wanted to explore which function is called, when I call MPI_Init >> in a C program, because this function should be called from a Java >> program as well. Unfortunately C programs break with a Bus Error >> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's >> the reason why I get no useful backtrace for my Java program. >> >> tyr small_prog 117 mpicc -o init_finalize init_finalize.c >> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec >> ... >> (gdb) run -np 1 init_finalize >> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 >> init_finalize >> [Thread debugging using libthread_db enabled] >> [New Thread 1 (LWP 1)] >> [New LWP2] >> [tyr:19240] *** Process received signal *** >> [tyr:19240] Signal: Bus Error (10) >> [tyr:19240] Signal code: Invalid address alignment (1) >> [tyr:19240] Failing at address: 7bd1c10c >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04 >> /lib/sparcv9/libc.so.1:0xd8b98 >> /lib/sparcv9/libc.so.1:0xcc70c >> /lib/sparcv9/libc.so.1:0xcc918 >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c >> [ Signal 10 (BUS)] >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8 >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 >> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 >> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20 >> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c >> [tyr:19240] *** End of error message *** >> -- >> mpiexec noticed that process rank 0 with PID 0 on node tyr exited on signal >> 10 (Bus Error). >> -- >> [LWP2 exited] >> [New Thread 2] >> [Switching to Thread 1 (LWP 1)] >> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to >> satisfy query >> (gdb) bt >> #0 0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1 >> #1 0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 >> #2 0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 >> #3 0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 >> #4 0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 >> #5 0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 >> #6 0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 >> #7 0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 >> #8 0x7ec87f60 in vm_close (loader_data=0x0, >> module=0x7c901fe0) >>at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212 >> #9 0x7ec85534 in lt_dlclose (handle=0x100189b50) >>at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982 >> #10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0) >>at >> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382 >> #11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0) >>at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446 >> #12 0x7ecaa474 in mca_base_component_repository_release ( >>component=0x7b1236f0 ) >>at >> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:240 >> #13 0x7ecac774 in mca_base_component_unload ( >>component=0x7b1236f0 , output_id=-1) >>at >> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:47 >> #14 0x7ecac808 in mca_base_component_close ( >>component=0x7b1236f0 , output_id=-1) >>at >> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:60 >> #15 0x7ecac8dc in mca_base_components_close (output_id=-1, >>components=0x7f14ba58 , skip=0x0) >>at >> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:86 >> #16 0x7ecac844 in mca_base_fram