That is interesting. I cannot think of any reason why this might be causing a 
problem just in Open MPI. popen() is similar to fork()/system() so you have to 
be careful with interconnects that do not play nice with fork(), like openib. 
But since it looks like you are excluding openib, this should not be the 
problem.

I wonder if this has something to so with the way we use BLCR (maybe we need to 
pass additional parameters to cr_checkpoint()). When the process fails, are 
there any messages in the system logs from BLCR indicating an issue that it 
encountered? It is common for BLCR to post a 'socket open' warning, but that is 
expected/normal since we leave TCP sockets open in most cases as an 
optimization. I am wondering if there is a warning about the popen'ed process.

Personally, I will not have an opportunity to look into this in more detail 
until probably mid-April. :/

Let me know what you find, and maybe we can sort out what is happening on the 
list.

-- Josh

On Mar 29, 2010, at 2:28 PM, Jean Potsam wrote:

> Hi Josh/All,
>                I just tested a simple c application with blcr and it worked 
> fine.
>  
> ##########################################
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <fcntl.h>
> #include <limits.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include<signal.h>
> #include <fcntl.h> 
> #include <unistd.h>
> 
> char * getprocessid() 
> {
>     FILE * read_fp;
>     char buffer[BUFSIZ + 1];
>     int chars_read;
>     char * buffer_data="12345";
>     memset(buffer, '\0', sizeof(buffer));
>   read_fp = popen("uname -a", "r");
>      /*
>       ...
>  */ 
>      return buffer_data;
> }
>  
> int main(int argc, char ** argv)
> {
> 
>  int rank;
>    int size;
> char * thedata;
> int n=0;
>  thedata=getprocessid();
>  printf(" the data is %s", thedata);
>      
>   while( n <10)
>   {
>     printf("value is %d\n", n);
>     n++;
>     sleep(1);
>    }
>  printf("bye\n");
>  
> }
>  
>  
> jean@sun32:/tmp$ cr_run ./pipetest3 &
> [1] 31807
> jean@sun32:~$  the data is 12345value is 0
> value is 1
> value is 2
> ...
> value is 9
> bye
>  
> jean@sun32:/tmp$ cr_checkpoint 31807
>  
> jean@sun32:/tmp$ cr_restart context.31807
> value is 7
> value is 8
> value is 9
> bye
>  
> ##############################################
>  
>  
> It looks like its more to do with Openmpi.  Any ideas from you side?
>  
> Thank you.
>  
> Kind regards,
>  
> Jean.
>  
>  
>  
> 
> 
> --- On Mon, 29/3/10, Josh Hursey <jjhur...@open-mpi.org> wrote:
> 
> From: Josh Hursey <jjhur...@open-mpi.org>
> Subject: Re: [OMPI users] Segmentation fault (11)
> To: "Open MPI Users" <us...@open-mpi.org>
> Date: Monday, 29 March, 2010, 16:08
> 
> I wonder if this is a bug with BLCR (since the segv stack is in the BLCR 
> thread). Can you try an non-MPI version of this application that uses 
> popen(), and see if BLCR properly checkpoints/restarts it?
> 
> If so, we can start to see what Open MPI might be doing to confuse things, 
> but I suspect that this might be a bug with BLCR. Either way let us know what 
> you find out.
> 
> Cheers,
> Josh
> 
> On Mar 27, 2010, at 6:17 AM, jody wrote:
> 
> > I'm not sure if this is the cause of your problems:
> > You define the constant BUFFER_SIZE, but in the code you use a constant 
> > called BUFSIZ...
> > Jody
> > 
> > 
> > On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam <jeanpot...@yahoo.co.uk> 
> > wrote:
> > Dear All,
> >               I am having a problem with openmpi . I have installed openmpi 
> > 1.4 and blcr 0.8.1
> > 
> > I have written a small mpi application as follows below:
> > 
> > #######################
> > #include <unistd.h>
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <string.h>
> > #include <fcntl.h>
> > #include <limits.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <mpi.h>
> > #include<signal.h>
> > #include <fcntl.h>
> > #include <unistd.h>
> > 
> > #define BUFFER_SIZE PIPE_BUF
> > 
> > char * getprocessid()
> > {
> >     FILE * read_fp;
> >     char buffer[BUFSIZ + 1];
> >     int chars_read;
> >     char * buffer_data="12345";
> >     memset(buffer, '\0', sizeof(buffer));
> >   read_fp = popen("uname -a", "r");
> >      /*
> >       ...
> >  */
> >      return buffer_data;
> > }
> > 
> > int main(int argc, char ** argv)
> > {
> >   MPI_Status status;
> >  int rank;
> >    int size;
> > char * thedata;
> >     MPI_Init(&argc, &argv);
> >     MPI_Comm_size(MPI_COMM_WORLD,&size);
> >     MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> >  thedata=getprocessid();
> >  printf(" the data is %s", thedata);
> >     MPI_Finalize();
> > }
> > ############################
> > 
> > I get the following result:
> > 
> > #######################
> > jean@sunn32:~$ mpicc pipetest2.c -o pipetest2
> > jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib  pipetest2
> > [sun32:19211] *** Process received signal ***
> > [sun32:19211] Signal: Segmentation fault (11)
> > [sun32:19211] Signal code: Address not mapped (1)
> > [sun32:19211] Failing at address: 0x4
> > [sun32:19211] [ 0] [0xb7f3c40c]
> > [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b]
> > [sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free+0x2a) 
> > [0xb7a5925a]
> > [sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72]
> > [sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266]
> > [sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e]
> > [sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc]
> > [sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836]
> > [sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897]
> > [sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455]
> > [sun32:19211] [10] pipetest2 [0x8048761]
> > [sun32:19211] *** End of error message ***
> > #####################################################
> > 
> > 
> > However, If I compile the application using gcc, it works fine. The problem 
> > arises with:
> >   read_fp = popen("uname -a", "r");
> > 
> > Does anyone has an idea how to resolve this problem?
> > 
> > Many thanks
> > 
> > Jean
> > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to