[OMPI devel] ome questions about checkpoint/restart (3)

2010-03-12 Thread Takayuki Seki

3rd question is as follows:

(3) If the message of the same condition exists in two lists or more,
an error occurs by assert(need <= found) in send_msg_details function.
I built Open MPI with "--enable-debug" configure option.

Framework : crcp
Component : bkmrk
The source file   : ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
The function name : send_msg_details,do_recv_msg_detail_check_drain

Here's the code that causes the problem:

#define BLOCKNUM 1
#define SLPTIM 60

  if (rank == 0) {
MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD);
MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD);
MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq);
printf(" rank=%d sleep start \n",rank); fflush(stdout);
sleep(SLPTIM); /** take checkpoint at this point **/
printf(" rank=%d sleep end   \n",rank); fflush(stdout);
MPI_Wait(&sreq,&ssts);
  }
  else {  /* rank 1 */
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
printf(" rank=%d sleep start \n",rank); fflush(stdout);
sleep(SLPTIM); /** take checkpoint at this point **/
printf(" rank=%d sleep end   \n",rank); fflush(stdout);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq);
MPI_Wait(&rreq,&rsts);
  }

* Take checkpoint while Process 0 and Process 1 are in sleep function

* Here's the tag,elements,type,and communicator of the message;
message tag=100,number of elements=1,data 
type=MPI_INT,communicator=MPI_COMM_WORLD

* Send side(Rank 0):
  The information of the message of the same condition exists in both send_list 
and isend_list.

* Recv side(Rank 1):
  The information of the message exists in irecv_list only.
  I wonder that there are some problems on messages matching in 
do_recv_msg_detail_check_drain function.

* Result
 rank=0 size=2
 rank=1 size=2
 rank=0 sleep start
 rank=1 sleep start
 rank=0 sleep end
 rank=1 sleep end
t_mpi_question-3.out: ../../../../../ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c:5471: 
send_msg_details: Assertion `need <= found' failed.
[camel0:24606] *** Process received signal ***
[camel0:24606] Signal: Aborted (6)
[camel0:24606] Signal code:  (-6)


-bash-3.2$ cat t_mpi_question-3.c
#include 
#include 
#include 
#include 
#include 
#include 
#include "mpi.h"

#define BLOCKNUM 1
#define SLPTIM 60

int main(int ac,char **av)
{
  int i;
  int rank,size;
  int *wbuf;
  int *rbuf;
  MPI_Status rsts,ssts;
  MPI_Request rreq,sreq;

  MPI_Init(&ac,&av);

  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  MPI_Comm_size(MPI_COMM_WORLD,&size);
  if (size != 2) { MPI_Abort(MPI_COMM_WORLD,-1); }

  rbuf= (int *)malloc(BLOCKNUM * sizeof(int));
  wbuf= (int *)malloc(BLOCKNUM * sizeof(int));
  if ((rbuf == NULL)||(wbuf == NULL)) { MPI_Abort(MPI_COMM_WORLD,-1); }

  printf(" rank=%d size=%d \n",rank,size); fflush(stdout);
  MPI_Barrier(MPI_COMM_WORLD);

  if (rank == 0) {
for (i=0;i

[OMPI devel] Some questions about checkpoint/restart (4)

2010-03-12 Thread Takayuki Seki

4th question is as follows:

(4) The pointer variables for information about communicator in
the ompi_crcp_bkmrk_pml_drain_message_ref_t structure and
the ompi_crcp_bkmrk_pml_traffic_message_ref_t structure

Areas which was freed by the datatype-framework is referred in bkmrk-component.

Framework : crcp
Component : bkmrk
The source file   : ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.h

The source file   : ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
The function name : do_send_msg_detail, etc..

Here's the code that may cause the problem:

#define BLOCKNUM 1048576
#define SLPTIM 60

  if (rank == 0) {
MPI_Comm_dup(MPI_COMM_WORLD,&commforcomm);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,commforcomm,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,commforcomm,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,commforcomm,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Comm_free(&commforcomm);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Isend(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD,&sreq); 
MPI_Wait(&sreq,&ssts);
MPI_Send(wbuf,BLOCKNUM,MPI_INT,1,100,MPI_COMM_WORLD); /** take checkpoint 
at this point **/
  }
  else {  /* rank 1 */
MPI_Comm_dup(MPI_COMM_WORLD,&commforcomm);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,commforcomm,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,commforcomm,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,commforcomm,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Comm_free(&commforcomm);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq); 
MPI_Wait(&rreq,&rsts);
printf(" rank=%d sleep start \n",rank); fflush(stdout);
sleep(SLPTIM); /** take checkpoint at this point **/
printf(" rank=%d sleep end   \n",rank); fflush(stdout);
MPI_Irecv(rbuf,BLOCKNUM,MPI_INT,0,100,MPI_COMM_WORLD,&rreq);
MPI_Wait(&rreq,&rsts);
  }

* Take checkpoint while Process 0 is in MPI_Send function and Process 1 is in 
sleep function

* When checkpoint is taken, "commforcomm" communicator is already freed.
  Although "commforcomm" communicator is already freed when checkpoint is taken,
  the information about "commforcomm" communicator is referred via these 
structure in the checkpoint action.

  Areas which is pointed by the "commforcomm" communicator pointer variable are 
already freed and values of the address may be already broken.

struct ompi_crcp_bkmrk_pml_drain_message_ref_t {
   .
/** Communicator pointer */
ompi_communicator_t* comm;
   .

}

struct ompi_crcp_bkmrk_pml_traffic_message_ref_t {
   .
/** Communicator pointer */
ompi_communicator_t* comm;
   .
}

static int do_send_msg_detail( ... ) {
 .
comm_my_rank  = ompi_comm_rank(msg_ref->comm);
 .
}

* I think that these structures should have information about communicator 
itself locally.
  they are c_contextid,c_my_rank,etc..


-bash-3.2$ cat t_mpi_question-4.c
#include 
#include 
#include 
#include 
#include 
#include 
#include "mpi.h"

#define BLOCKNUM 1048576
#define SLPTIM 60

int main(int ac,char **av)
{
  int i;
  int rank,size;
  int *wbuf;
  int *rbuf;
  MPI_Status rsts,ssts;
  MPI_Request rreq,sreq;
  MPI_Comm commforcomm;

  MPI_Init(&ac,&av);

  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  MPI_Comm_size(MPI_COMM_WORLD,&size);
  if (size != 2) { MPI_Abort(MPI_COMM_WORLD,-1); }

  rbuf= (int *)malloc(BLOCKNUM * sizeof(int));
  wbuf= (int *)malloc(BLOCKNUM * sizeof(int));
  if ((rbuf == NULL)||(wbuf == NULL)) { MPI_Abort(MPI_COMM_WORLD,-1); }

  printf(" rank=%d size=%d \n",rank,size); fflush(stdout);
  MPI_Barrier(MPI_COMM_WORLD);

  if (rank == 0) {
MPI_Comm_dup(MPI_COMM_WORLD,&commforcomm);
for (i=0;i

[OMPI devel] Build issue: mpi_portable_platform.h

2010-03-12 Thread Joshua Hursey
I noticed the following build error on the OMPI trunk (r22821) on IU's Odin 
machine:
  make[3]: *** No rule to make target `mpi_portable_platform.h', needed by 
`all-am'.  Stop.

I took a quick pass through the svn commit log and did not see anything that 
would have broken this. Any thoughts on what could be causing this?

-- Josh


Re: [OMPI devel] Build issue: mpi_portable_platform.h

2010-03-12 Thread Paul H. Hargrove

Josh,
 In r22619 mpi_portable_platform.h.in was replaced by 
mpi_portable_platform.h and Makefile.am changed accordingly.
 So, my best guess is that you might just need to rerun autogen.sh or 
that your checkout is somehow missing mpi_portable_platform.h

-Paul

Joshua Hursey wrote:

I noticed the following build error on the OMPI trunk (r22821) on IU's Odin 
machine:
  make[3]: *** No rule to make target `mpi_portable_platform.h', needed by 
`all-am'.  Stop.

I took a quick pass through the svn commit log and did not see anything that 
would have broken this. Any thoughts on what could be causing this?

-- Josh
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  



--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory 



Re: [OMPI devel] Build issue: mpi_portable_platform.h

2010-03-12 Thread Rainer Keller
Hi Josh,
this is caused by moving mpi_portable_platform.h.in file in two steps from 
ompi/include to opal/include -- in order to be used by opal_info and 
orte_info.

You need to autogen.sh again after svn up to at least r22789.

Hope, this helps?

Best regards,
RAiner


On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote:
> I noticed the following build error on the OMPI trunk (r22821) on IU's Odin
>  machine: make[3]: *** No rule to make target `mpi_portable_platform.h',
>  needed by `all-am'.  Stop.
> 
> I took a quick pass through the svn commit log and did not see anything
>  that would have broken this. Any thoughts on what could be causing this?
> 
> -- Josh
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink



Re: [OMPI devel] Build issue: mpi_portable_platform.h

2010-03-12 Thread Joshua Hursey
I think I figured it out. The error was coming from a Mercurial branch cloned 
from my internal HG+SVN branch. HG previously marked "mpi_portable_platform.h" 
as a file to not include in rev. control since it was auto-generated. Now that 
it is not auto-generated, it needs to be included in the rev. control.

The fix (in case anyone hits the same problem) is to remove 
"mpi_portable_platform.h" from the .hgignore in your HG+SVN, then 'hg 
addremove', 'hg commit'. Then things are better.

Thanks for the pointers to the rev #, that helped.

Cheers,
Josh


On Mar 12, 2010, at 4:42 PM, Rainer Keller wrote:

> Hi Josh,
> this is caused by moving mpi_portable_platform.h.in file in two steps from 
> ompi/include to opal/include -- in order to be used by opal_info and 
> orte_info.
> 
> You need to autogen.sh again after svn up to at least r22789.
> 
> Hope, this helps?
> 
> Best regards,
> RAiner
> 
> 
> On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote:
>> I noticed the following build error on the OMPI trunk (r22821) on IU's Odin
>> machine: make[3]: *** No rule to make target `mpi_portable_platform.h',
>> needed by `all-am'.  Stop.
>> 
>> I took a quick pass through the svn commit log and did not see anything
>> that would have broken this. Any thoughts on what could be causing this?
>> 
>> -- Josh
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> -- 
> 
> Rainer Keller, PhD  Tel: +1 (865) 241-6293
> Oak Ridge National Lab  Fax: +1 (865) 241-4811
> PO Box 2008 MS 6164   Email: kel...@ornl.gov
> Oak Ridge, TN 37831-2008AIM/Skype: rusraink




Re: [OMPI devel] Build issue: mpi_portable_platform.h

2010-03-12 Thread Jeff Squyres
Josh --

Do you use the contrib/hg/build-hgignore.pl script?  It examines all the 
svn:ignore files to build up a .hgignore file.  I run this every time I svn up 
on my hg+svn tree.


On Mar 12, 2010, at 3:06 PM, Joshua Hursey wrote:

> I think I figured it out. The error was coming from a Mercurial branch cloned 
> from my internal HG+SVN branch. HG previously marked 
> "mpi_portable_platform.h" as a file to not include in rev. control since it 
> was auto-generated. Now that it is not auto-generated, it needs to be 
> included in the rev. control.
> 
> The fix (in case anyone hits the same problem) is to remove 
> "mpi_portable_platform.h" from the .hgignore in your HG+SVN, then 'hg 
> addremove', 'hg commit'. Then things are better.
> 
> Thanks for the pointers to the rev #, that helped.
> 
> Cheers,
> Josh
> 
> 
> On Mar 12, 2010, at 4:42 PM, Rainer Keller wrote:
> 
> > Hi Josh,
> > this is caused by moving mpi_portable_platform.h.in file in two steps from
> > ompi/include to opal/include -- in order to be used by opal_info and
> > orte_info.
> >
> > You need to autogen.sh again after svn up to at least r22789.
> >
> > Hope, this helps?
> >
> > Best regards,
> > RAiner
> >
> >
> > On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote:
> >> I noticed the following build error on the OMPI trunk (r22821) on IU's Odin
> >> machine: make[3]: *** No rule to make target `mpi_portable_platform.h',
> >> needed by `all-am'.  Stop.
> >>
> >> I took a quick pass through the svn commit log and did not see anything
> >> that would have broken this. Any thoughts on what could be causing this?
> >>
> >> -- Josh
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >
> > --
> > 
> > Rainer Keller, PhD  Tel: +1 (865) 241-6293
> > Oak Ridge National Lab  Fax: +1 (865) 241-4811
> > PO Box 2008 MS 6164   Email: kel...@ornl.gov
> > Oak Ridge, TN 37831-2008AIM/Skype: rusraink
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Build issue: mpi_portable_platform.h

2010-03-12 Thread Joshua Hursey
I use it, but I only ran it once when I setup the HG+SVN. I'll start refreshing 
it more frequently.

Thanks for the tip,
Josh

On Mar 12, 2010, at 6:19 PM, Jeff Squyres wrote:

> Josh --
> 
> Do you use the contrib/hg/build-hgignore.pl script?  It examines all the 
> svn:ignore files to build up a .hgignore file.  I run this every time I svn 
> up on my hg+svn tree.
> 
> 
> On Mar 12, 2010, at 3:06 PM, Joshua Hursey wrote:
> 
>> I think I figured it out. The error was coming from a Mercurial branch 
>> cloned from my internal HG+SVN branch. HG previously marked 
>> "mpi_portable_platform.h" as a file to not include in rev. control since it 
>> was auto-generated. Now that it is not auto-generated, it needs to be 
>> included in the rev. control.
>> 
>> The fix (in case anyone hits the same problem) is to remove 
>> "mpi_portable_platform.h" from the .hgignore in your HG+SVN, then 'hg 
>> addremove', 'hg commit'. Then things are better.
>> 
>> Thanks for the pointers to the rev #, that helped.
>> 
>> Cheers,
>> Josh
>> 
>> 
>> On Mar 12, 2010, at 4:42 PM, Rainer Keller wrote:
>> 
>>> Hi Josh,
>>> this is caused by moving mpi_portable_platform.h.in file in two steps from
>>> ompi/include to opal/include -- in order to be used by opal_info and
>>> orte_info.
>>> 
>>> You need to autogen.sh again after svn up to at least r22789.
>>> 
>>> Hope, this helps?
>>> 
>>> Best regards,
>>> RAiner
>>> 
>>> 
>>> On Friday 12 March 2010 04:17:41 pm Joshua Hursey wrote:
 I noticed the following build error on the OMPI trunk (r22821) on IU's Odin
 machine: make[3]: *** No rule to make target `mpi_portable_platform.h',
 needed by `all-am'.  Stop.
 
 I took a quick pass through the svn commit log and did not see anything
 that would have broken this. Any thoughts on what could be causing this?
 
 -- Josh
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
>>> 
>>> --
>>> 
>>> Rainer Keller, PhD  Tel: +1 (865) 241-6293
>>> Oak Ridge National Lab  Fax: +1 (865) 241-4811
>>> PO Box 2008 MS 6164   Email: kel...@ornl.gov
>>> Oak Ridge, TN 37831-2008AIM/Skype: rusraink
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel