These are the bt's of 2 cores .. gdb hello core.14653
#0 0x000000300bc0cbc0 in ?? () #1 0x00002aaaab09d0fb in ?? () #2 0x00007fff6a782920 in ?? () #3 0x00002aaaaae3d348 in ?? () #4 0x00007fff6a7827b0 in ?? () #5 0x0000003806e6bcb4 in ?? () #6 0x0000000000000000 in ?? () gdb hello core.14654 #0 0x000000300bc0cbc0 in ?? () #1 0x00002aaaab09d0fb in ?? () #2 0x00007fff92eb3040 in ?? () #3 0x00002aaaaae3d348 in ?? () #4 0x00007fff92eb2ed0 in ?? () #5 0x0000003806e6bcb4 in ?? () #6 0x0000000000000000 in ?? () Please let me know if any other info is required. On Thu, Oct 9, 2008 at 2:01 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: > I cannot interpret the raw core files since they are specific your system > and setup. Can you run it through gdb and get a backtrace? "gdb hello > core.1234" then use the 'bt' command from inside gdb. > > That will help me start to focus in on the problem. > > Cheers, > Josh > > On Oct 8, 2008, at 10:22 PM, arun dhakne wrote: > >> I have configured with the additional flags(--enable-ft-thread >> --enable-mpi-threads) but there is no change in behaviour, it still >> gives seg fault. >> open mpi version: >> Open MPI: 1.3a1r19685 >> >> blcr version: >> version 0.7.3 >> >> >> The core file is attached. >> hello.c is sample mpi program whose core is dumped is also attached. >> >> ~]$ ompi-restart ompi_global_snapshot_11219.ckpt >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 11288 on node >> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation >> fault). >> -------------------------------------------------------------------------- >> 2 total processes killed (some possibly by mpirun during cleanup) >> >> >> Best, >> >> >> On Mon, Oct 6, 2008 at 6:44 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: >>> >>> The installation looks ok, though I'm not sure what is causing the >>> segfault >>> of the restarted process. Two things to try. First can you send me a >>> backtrace from the core file that is generated from the segmentation >>> fault. >>> That will provide insight into what is causing it. >>> >>> Second you may try to enable the C/R thread which allows for a checkpoint >>> to >>> progress when an application is in a computation loop instead of only >>> when >>> it is in the MPI library. To do so configure with these additional flags: >>> --enable-ft-thread --enable-mpi-threads >>> >>> What version of Open MPI are you using? What version of BLCR? >>> >>> Best, >>> Josh >>> >>> On Oct 6, 2008, at 3:55 PM, arun dhakne wrote: >>> >>>> Hi all, >>>> >>>> This is the procedure i have followed to install openmpi. Is there >>>> some installation or environment setting problem in here? >>>> an openmpi program with 4 process is run across 2 dual-core intel >>>> machines, with 2 processes running on each of the machine. >>>> >>>> ompi-checkpoint is successful but ompi-restart fails with following >>>> error >>>> >>>> >>>> $:> ompi-restart ompi_global_snapshot_6045.ckpt >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that process rank 0 with PID 6372 on node >>>> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation >>>> fault). >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> Open-mpi installation steps: >>>> ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr >>>> --with-blcr=/usr/lib64 --enable-debug >>>> make >>>> make install >>>> >>>> >>>> >>>> export >>>> >>>> LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/usr/lib64 >>>> export PATH=$HOME/.openmpi/bin:$PATH >>>> >>>> NOTE: blcr is installed as a module >>>> $:> lsmod | grep blcr >>>> >>>> blcr 117892 0 >>>> blcr_vmadump 58264 1 blcr >>>> blcr_imports 46080 2 blcr,blcr_vmadump >>>> >>>> Please let me know if there is problem with above procedure, thanks a >>>> lot for your time. >>>> >>>> Best. >>>> >>>> ---------- Forwarded message ---------- >>>> From: arun dhakne <arundha...@gmail.com> >>>> Date: Tue, Sep 30, 2008 at 12:52 AM >>>> Subject: ompi-restart issue : ompi-restart doesn't work across nodes >>>> To: Open MPI Users <us...@open-mpi.org> >>>> >>>> >>>> Hi all, >>>> >>>> I had gone through some previous ompi-restart issues but i couldn't >>>> find anything similar to this problem. >>>> >>>> I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645' >>>> >>>> i) If the sample mpi program say ( np 4 on single machine that is >>>> without any hostfile )is ran and I try to checkpoint it, it happens >>>> successfully and even ompi-restart works in this case. >>>> >>>> ii) If the sample mpi program is ran across say 2 different nodes and >>>> checkpoint happens successfully BUT ompi-restart throws following >>>> error: >>>> >>>> $ ompi-restart ompi_global_snapshot_7604.ckpt >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that process rank 3 with PID 9590 on node >>>> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation >>>> fault). >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> Please let me know if more information is needed. >>>> >>>> -- >>>> Thanks and Regards, >>>> Arun U. Dhakne >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> >> -- >> Thanks and Regards, >> Arun U. Dhakne >> Graduate Student >> Computer Science and Engineering Dept. >> State University of New York at Buffalo >> <core.tar.gz><hello.c> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Thanks and Regards, Arun U. Dhakne Graduate Student Computer Science and Engineering Dept. State University of New York at Buffalo