[OMPI devel] ome questions about checkpoint/restart (3)
3rd question is as follows: (3) If the message of the same condition exists in two lists or more, an error occurs by assert(need <= found) in send_msg_details function. I built Open MPI with "--enable-debug" configure option. Framework : crcp Component : bkmrk The source file : ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c The function name : send_msg_details,do_recv_msg_detail_check_drain Here's the code that causes the problem: #define BLOCKNUM 1 #define SLPTIM 60 if (rank == 0) { MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD); MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD); MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); MPI_Wait(&sreq,&ssts); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); MPI_Wait(&sreq,&ssts); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); MPI_Wait(&sreq,&ssts); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); printf(" rank=%d sleep start \n",rank); fflush(stdout); sleep(SLPTIM); /** take checkpoint at this point **/ printf(" rank=%d sleep end \n",rank); fflush(stdout); MPI_Wait(&sreq,&ssts); } else { /* rank 1 */ MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); printf(" rank=%d sleep start \n",rank); fflush(stdout); sleep(SLPTIM); /** take checkpoint at this point **/ printf(" rank=%d sleep end \n",rank); fflush(stdout); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); } * Take checkpoint while Process 0 and Process 1 are in sleep function * Here's the tag,elements,type,and communicator of the message; message tag=100,number of elements=1,data type=MPI_INT,communicator=MPI_COMM_WORLD * Send side(Rank 0): The information of the message of the same condition exists in both send_list and isend_list. * Recv side(Rank 1): The information of the message exists in irecv_list only. I wonder that there are some problems on messages matching in do_recv_msg_detail_check_drain function. * Result rank=0 size=2 rank=1 size=2 rank=0 sleep start rank=1 sleep start rank=0 sleep end rank=1 sleep end t_mpi_question-3.out: ../../../../../ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c:5471: send_msg_details: Assertion `need <= found' failed. [camel0:24606] *** Process received signal *** [camel0:24606] Signal: Aborted (6) [camel0:24606] Signal code: (-6) -bash-3.2$ cat t_mpi_question-3.c #include #include #include #include #include #include #include "mpi.h" #define BLOCKNUM 1 #define SLPTIM 60 int main(int ac,char **av) { int i; int rank,size; int *wbuf; int *rbuf; MPI_Status rsts,ssts; MPI_Request rreq,sreq; MPI_Init(&ac,&av); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); if (size != 2) { MPI_Abort(MPI_COMM_WORLD,-1); } rbuf= (int *)malloc(BLOCKNUM * sizeof(int)); wbuf= (int *)malloc(BLOCKNUM * sizeof(int)); if ((rbuf == NULL)||(wbuf == NULL)) { MPI_Abort(MPI_COMM_WORLD,-1); } printf(" rank=%d size=%d \n",rank,size); fflush(stdout); MPI_Barrier(MPI_COMM_WORLD); if (rank == 0) { for (i=0;i
[OMPI devel] Some questions about checkpoint/restart (4)
4th question is as follows: (4) The pointer variables for information about communicator in the ompi_crcp_bkmrk_pml_drain_message_ref_t structure and the ompi_crcp_bkmrk_pml_traffic_message_ref_t structure Areas which was freed by the datatype-framework is referred in bkmrk-component. Framework : crcp Component : bkmrk The source file : ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.h The source file : ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c The function name : do_send_msg_detail, etc.. Here's the code that may cause the problem: #define BLOCKNUM 1048576 #define SLPTIM 60 if (rank == 0) { MPI_Comm_dup(MPI_COMM_WORLD,&commforcomm); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,commforcomm,&sreq); MPI_Wait(&sreq,&ssts); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,commforcomm,&sreq); MPI_Wait(&sreq,&ssts); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,commforcomm,&sreq); MPI_Wait(&sreq,&ssts); MPI_Comm_free(&commforcomm); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); MPI_Wait(&sreq,&ssts); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); MPI_Wait(&sreq,&ssts); MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); MPI_Wait(&sreq,&ssts); MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD); /** take checkpoint at this point **/ } else { /* rank 1 */ MPI_Comm_dup(MPI_COMM_WORLD,&commforcomm); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,commforcomm,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,commforcomm,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,commforcomm,&rreq); MPI_Wait(&rreq,&rsts); MPI_Comm_free(&commforcomm); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); printf(" rank=%d sleep start \n",rank); fflush(stdout); sleep(SLPTIM); /** take checkpoint at this point **/ printf(" rank=%d sleep end \n",rank); fflush(stdout); MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); MPI_Wait(&rreq,&rsts); } * Take checkpoint while Process 0 is in MPI_Send function and Process 1 is in sleep function * When checkpoint is taken, "commforcomm" communicator is already freed. Although "commforcomm" communicator is already freed when checkpoint is taken, the information about "commforcomm" communicator is referred via these structure in the checkpoint action. Areas which is pointed by the "commforcomm" communicator pointer variable are already freed and values of the address may be already broken. struct ompi_crcp_bkmrk_pml_drain_message_ref_t { . /** Communicator pointer */ ompi_communicator_t* comm; . } struct ompi_crcp_bkmrk_pml_traffic_message_ref_t { . /** Communicator pointer */ ompi_communicator_t* comm; . } static int do_send_msg_detail( ... ) { . comm_my_rank = ompi_comm_rank(msg_ref->comm); . } * I think that these structures should have information about communicator itself locally. they are c_contextid,c_my_rank,etc.. -bash-3.2$ cat t_mpi_question-4.c #include #include #include #include #include #include #include "mpi.h" #define BLOCKNUM 1048576 #define SLPTIM 60 int main(int ac,char **av) { int i; int rank,size; int *wbuf; int *rbuf; MPI_Status rsts,ssts; MPI_Request rreq,sreq; MPI_Comm commforcomm; MPI_Init(&ac,&av); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); if (size != 2) { MPI_Abort(MPI_COMM_WORLD,-1); } rbuf= (int *)malloc(BLOCKNUM * sizeof(int)); wbuf= (int *)malloc(BLOCKNUM * sizeof(int)); if ((rbuf == NULL)||(wbuf == NULL)) { MPI_Abort(MPI_COMM_WORLD,-1); } printf(" rank=%d size=%d \n",rank,size); fflush(stdout); MPI_Barrier(MPI_COMM_WORLD); if (rank == 0) { MPI_Comm_dup(MPI_COMM_WORLD,&commforcomm); for (i=0;i
[OMPI devel] Build issue: mpi_portable_platform.h
I noticed the following build error on the OMPI trunk (r22821) on IU's Odin machine: make[3]: *** No rule to make target `mpi_portable_platform.h', needed by `all-am'. Stop. I took a quick pass through the svn commit log and did not see anything that would have broken this. Any thoughts on what could be causing this? -- Josh
Re: [OMPI devel] Build issue: mpi_portable_platform.h
Josh, In r22619 mpi_portable_platform.h.in was replaced by mpi_portable_platform.h and Makefile.am changed accordingly. So, my best guess is that you might just need to rerun autogen.sh or that your checkout is somehow missing mpi_portable_platform.h -Paul Joshua Hursey wrote: I noticed the following build error on the OMPI trunk (r22821) on IU's Odin machine: make[3]: *** No rule to make target `mpi_portable_platform.h', needed by `all-am'. Stop. I took a quick pass through the svn commit log and did not see anything that would have broken this. Any thoughts on what could be causing this? -- Josh ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory
Re: [OMPI devel] Build issue: mpi_portable_platform.h
Hi Josh, this is caused by moving mpi_portable_platform.h.in file in two steps from ompi/include to opal/include -- in order to be used by opal_info and orte_info. You need to autogen.sh again after svn up to at least r22789. Hope, this helps? Best regards, RAiner On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote: > I noticed the following build error on the OMPI trunk (r22821) on IU's Odin > machine: make[3]: *** No rule to make target `mpi_portable_platform.h', > needed by `all-am'. Stop. > > I took a quick pass through the svn commit log and did not see anything > that would have broken this. Any thoughts on what could be causing this? > > -- Josh > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
Re: [OMPI devel] Build issue: mpi_portable_platform.h
I think I figured it out. The error was coming from a Mercurial branch cloned from my internal HG+SVN branch. HG previously marked "mpi_portable_platform.h" as a file to not include in rev. control since it was auto-generated. Now that it is not auto-generated, it needs to be included in the rev. control. The fix (in case anyone hits the same problem) is to remove "mpi_portable_platform.h" from the .hgignore in your HG+SVN, then 'hg addremove', 'hg commit'. Then things are better. Thanks for the pointers to the rev #, that helped. Cheers, Josh On Mar 12, 2010, at 4:42 PM, Rainer Keller wrote: > Hi Josh, > this is caused by moving mpi_portable_platform.h.in file in two steps from > ompi/include to opal/include -- in order to be used by opal_info and > orte_info. > > You need to autogen.sh again after svn up to at least r22789. > > Hope, this helps? > > Best regards, > RAiner > > > On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote: >> I noticed the following build error on the OMPI trunk (r22821) on IU's Odin >> machine: make[3]: *** No rule to make target `mpi_portable_platform.h', >> needed by `all-am'. Stop. >> >> I took a quick pass through the svn commit log and did not see anything >> that would have broken this. Any thoughts on what could be causing this? >> >> -- Josh >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > -- > > Rainer Keller, PhD Tel: +1 (865) 241-6293 > Oak Ridge National Lab Fax: +1 (865) 241-4811 > PO Box 2008 MS 6164 Email: kel...@ornl.gov > Oak Ridge, TN 37831-2008AIM/Skype: rusraink
Re: [OMPI devel] Build issue: mpi_portable_platform.h
Josh -- Do you use the contrib/hg/build-hgignore.pl script? It examines all the svn:ignore files to build up a .hgignore file. I run this every time I svn up on my hg+svn tree. On Mar 12, 2010, at 3:06 PM, Joshua Hursey wrote: > I think I figured it out. The error was coming from a Mercurial branch cloned > from my internal HG+SVN branch. HG previously marked > "mpi_portable_platform.h" as a file to not include in rev. control since it > was auto-generated. Now that it is not auto-generated, it needs to be > included in the rev. control. > > The fix (in case anyone hits the same problem) is to remove > "mpi_portable_platform.h" from the .hgignore in your HG+SVN, then 'hg > addremove', 'hg commit'. Then things are better. > > Thanks for the pointers to the rev #, that helped. > > Cheers, > Josh > > > On Mar 12, 2010, at 4:42 PM, Rainer Keller wrote: > > > Hi Josh, > > this is caused by moving mpi_portable_platform.h.in file in two steps from > > ompi/include to opal/include -- in order to be used by opal_info and > > orte_info. > > > > You need to autogen.sh again after svn up to at least r22789. > > > > Hope, this helps? > > > > Best regards, > > RAiner > > > > > > On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote: > >> I noticed the following build error on the OMPI trunk (r22821) on IU's Odin > >> machine: make[3]: *** No rule to make target `mpi_portable_platform.h', > >> needed by `all-am'. Stop. > >> > >> I took a quick pass through the svn commit log and did not see anything > >> that would have broken this. Any thoughts on what could be causing this? > >> > >> -- Josh > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > > > > -- > > > > Rainer Keller, PhD Tel: +1 (865) 241-6293 > > Oak Ridge National Lab Fax: +1 (865) 241-4811 > > PO Box 2008 MS 6164 Email: kel...@ornl.gov > > Oak Ridge, TN 37831-2008AIM/Skype: rusraink > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Build issue: mpi_portable_platform.h
I use it, but I only ran it once when I setup the HG+SVN. I'll start refreshing it more frequently. Thanks for the tip, Josh On Mar 12, 2010, at 6:19 PM, Jeff Squyres wrote: > Josh -- > > Do you use the contrib/hg/build-hgignore.pl script? It examines all the > svn:ignore files to build up a .hgignore file. I run this every time I svn > up on my hg+svn tree. > > > On Mar 12, 2010, at 3:06 PM, Joshua Hursey wrote: > >> I think I figured it out. The error was coming from a Mercurial branch >> cloned from my internal HG+SVN branch. HG previously marked >> "mpi_portable_platform.h" as a file to not include in rev. control since it >> was auto-generated. Now that it is not auto-generated, it needs to be >> included in the rev. control. >> >> The fix (in case anyone hits the same problem) is to remove >> "mpi_portable_platform.h" from the .hgignore in your HG+SVN, then 'hg >> addremove', 'hg commit'. Then things are better. >> >> Thanks for the pointers to the rev #, that helped. >> >> Cheers, >> Josh >> >> >> On Mar 12, 2010, at 4:42 PM, Rainer Keller wrote: >> >>> Hi Josh, >>> this is caused by moving mpi_portable_platform.h.in file in two steps from >>> ompi/include to opal/include -- in order to be used by opal_info and >>> orte_info. >>> >>> You need to autogen.sh again after svn up to at least r22789. >>> >>> Hope, this helps? >>> >>> Best regards, >>> RAiner >>> >>> >>> On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote: I noticed the following build error on the OMPI trunk (r22821) on IU's Odin machine: make[3]: *** No rule to make target `mpi_portable_platform.h', needed by `all-am'. Stop. I took a quick pass through the svn commit log and did not see anything that would have broken this. Any thoughts on what could be causing this? -- Josh ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> -- >>> >>> Rainer Keller, PhD Tel: +1 (865) 241-6293 >>> Oak Ridge National Lab Fax: +1 (865) 241-4811 >>> PO Box 2008 MS 6164 Email: kel...@ornl.gov >>> Oak Ridge, TN 37831-2008AIM/Skype: rusraink >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel