Re: [OMPI devel] === CREATE FAILURE (trunk) ===

2009-03-23 Thread Ralph Castain
Wow, sometimes I even amaze myself! Two for two on create failures in  
a single night!!


:-)

Anyway, both are fixed or shortly will be. However, there will be no  
MTT runs tonight as neither branch successfully generated a tarball.


Ralph


On Mar 23, 2009, at 7:30 PM, MPI Team wrote:



ERROR: Command returned a non-zero exist status (trunk):
  make distcheck

Start time: Mon Mar 23 21:22:33 EDT 2009
End time:   Mon Mar 23 21:30:20 EDT 2009

= 
==
{ test ! -d openmpi-1.4a1r20848 || { find openmpi-1.4a1r20848 -type  
d ! -perm -200 -exec chmod u+w {} ';' && rm -fr  
openmpi-1.4a1r20848; }; }

test -d openmpi-1.4a1r20848 || mkdir openmpi-1.4a1r20848
list='config contrib opal orte ompi test'; for subdir in $list; do \
 if test "$subdir" = .; then :; else \
   test -d "openmpi-1.4a1r20848/$subdir" \
   || /bin/mkdir -p "openmpi-1.4a1r20848/$subdir" \
   || exit 1; \
   distdir=`CDPATH="${ZSH_VERSION+.}:" && cd openmpi-1.4a1r20848 &&  
pwd`; \
   top_distdir=`CDPATH="${ZSH_VERSION+.}:" && cd openmpi-1.4a1r20848  
&& pwd`; \

   (cd $subdir && \
 make  \
   top_distdir="$top_distdir" \
   distdir="$distdir/$subdir" \
am__remove_distdir=: \
am__skip_length_check=: \
   distdir) \
 || exit 1; \
 fi; \
done
make[1]: Entering directory `/home/mpiteam/openmpi/nightly-tarball- 
build-root/trunk/create-r20848/ompi/config'
make[1]: Leaving directory `/home/mpiteam/openmpi/nightly-tarball- 
build-root/trunk/create-r20848/ompi/config'
make[1]: Entering directory `/home/mpiteam/openmpi/nightly-tarball- 
build-root/trunk/create-r20848/ompi/contrib'
make[1]: *** No rule to make target `platform/lanl/rr-class/ 
debug.conf', needed by `distdir'.  Stop.
make[1]: Leaving directory `/home/mpiteam/openmpi/nightly-tarball- 
build-root/trunk/create-r20848/ompi/contrib'

make: *** [distdir] Error 1
= 
==


Your friendly daemon,
Cyrador
___
testing mailing list
test...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/testing




Re: [OMPI devel] OMPI 1.3 - PERUSE peruse_comm_spec_t peer Negative Value

2009-03-23 Thread George Bosilca
You are absolutely right, the peer should never be set to -1 on any of  
the PERUSE callbacks. I checked the code this morning and figure out  
what was the problem. We report the peer and the tag attached to a  
request before setting the right values (some code moved around). I  
submitted a patch and created a "move request" to have this correction  
as soon as possible on one of our stable releases. The move request  
can be followed using our TRAC system and the following link (https://svn.open-mpi.org/trac/ompi/ticket/1845 
). If you want to play with this change please update your Open MPI  
installation to a nightly build or a fresh checkout from the SVN with  
at least revision 20844 (a nightly including this change will be  
posted on our website tomorrow morning).


  Thanks,
george.

On Mar 23, 2009, at 13:23 , Samuel K. Gutierrez wrote:


Hi Kiril,

Appreciate the quick response.


Hi Samuel,

On Sat, 21 Mar 2009 18:18:54 -0600 (MDT)
 "Samuel K. Gutierrez"  wrote:

Hi All,

I'm writing a simple profiling library which utilizes
PERUSE.  My callback


So am I :)


function counts communication events (see example code
below).  I noticed
that in OMPI v1.3 spec->peer is sometimes a negative
value (OMPI v1.2.6
did not exhibit this behavior).  I added some boundary
checks, but it
seems as if this is a bug?  I hope I'm not missing
something...


It took me quite some time to reproduce the error - I also


Sorry about that - I should have provided more information.


got peer value "-1" for the Peruse peruse_comm_spec_t
struct. I only managed to reproduce this with
communication of a process with itself, which is an
unusual scenario. Anyway, for all the tests I did, the
error happened only when:

-a process communicates with itself
-the MPI receive call is made
-the Peruse event "PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q" is
triggered


That's interesting... Nice work!




The file ompi/mca/pml/ob1/pml_ob1_recvreq.c seems to be
the place where the above event is called with a wrong
value of the peer attribute.

I will let you know if I find something.


I will also take a look.




Best regards,
Kiril



The peruse test provided in the OMPI v1.3 source
exhibits similar behavior:
mpirun -np 2 ./mpi_peruse | grep peer:-1

int callback(peruse_event_h event_h, MPI_Aint unique_id,
peruse_comm_spec_t *spec, void *param) {
  if (spec->peer == rank) {
  return MPI_SUCCESS;
  }
  rrCounts[spec->peer]++;
  return MPI_SUCCESS;
}


Any insight is greatly appreciated.

Thanks,

Samuel K. Gutierrez
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Appreciate the help,

Samuel K. Gutierrez
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Infinite Loop: ompi_free_list_wait

2009-03-23 Thread Timothy Hayes
That's a relief to know, although I'm still a bit concerned. I'm looking at
the code for the OpenMPI 1.3 trunk and in the ob1 component I can see the
following sequence:

mca_pml_ob1_recv_frag_callback_match -> append_frag_to_list ->
MCA_PML_OB1_RECV_FRAG_ALLOC -> OMPI_FREE_LIST_WAIT -> __ompi_free_list_wait

so I'm guessing unless the deadlock issue has been resolved for that
function, it will still fail non deterministically. I'm quite eager to give
it a try, but my component doesn't compile as is with the 1.3 source. Is it
trivial to convert it?

Or maybe you were suggesting that I go into the code of ob1 myself and
manually change every _wait to _get?

Kind regards
Tim

2009/3/23 George Bosilca 

> It is a known problem. When the freelist is empty going in the
> ompi_free_list_wait will block the process until at least one fragment
> became available. As a fragment can became available only when returned by
> the BTL, this can lead to deadlocks in some cases. The workaround is to ban
> the usage of the blocking _wait function, and replace it with the
> non-blocking version _get. The PML has all the required logic to deal with
> the cases where a fragment cannot be allocated. We changed most of the BTLs
> to use _get instead of _wait few months ago.
>
>  Thanks,
>george.
>
>
> On Mar 23, 2009, at 11:58 , Timothy Hayes wrote:
>
>  Hello,
>>
>> I'm working on an OpenMPI BTL component and am having a recurring problem,
>> I was wondering if anyone could shed some light on it. I have a component
>> that's quite straight forward, it uses a pair of lightweight sockets to take
>> advantage of being in a virtualised environment (specifically Xen). My code
>> is a bit messy and has lots of inefficiencies, but the logic seems sound
>> enough. I've been able to execute a few simple programs successfully using
>> the component, and they work most of the time.
>>
>> The problem I'm having is actually happening in higher layers,
>> specifically in my asynchronous receive handler, when I call the callback
>> function (cbfunc) that was set by the PML in the BTL initialisation phase.
>> It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(),
>> in this function there is a condition variable which should get set
>> eventually but just doesn't. I've stepped through it with GDB and I get a
>> backtrace of something like this:
>>
>> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
>> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
>> __ompi_free_list_wait -> opal_condition_wait
>>
>> and from there it just loops. Although this is happening in higher levels,
>> I haven't noticed something like this happening in any of the other BTL
>> components so chances are there's something in my code that's causing this.
>> I very much doubt that it's actually waiting for a list item to be returned
>> since this infinite loop can occur non deterministically and sometimes even
>> on the first receive callback.
>>
>> I'm really not too sure what else to include with this e-mail. I could
>> send my source code (a bit nasty right now) if it would be helpful, but I'm
>> hoping that someone might have noticed this problem before or something
>> similar. Maybe I'm making a common mistake. Any advice would be really
>> appreciated!
>>
>> I'm using OpenMPI 1.2.9 from the SVN tag repository.
>>
>> Kind regards
>> Tim Hayes
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] Infinite Loop: ompi_free_list_wait

2009-03-23 Thread George Bosilca
It is a known problem. When the freelist is empty going in the  
ompi_free_list_wait will block the process until at least one fragment  
became available. As a fragment can became available only when  
returned by the BTL, this can lead to deadlocks in some cases. The  
workaround is to ban the usage of the blocking _wait function, and  
replace it with the non-blocking version _get. The PML has all the  
required logic to deal with the cases where a fragment cannot be  
allocated. We changed most of the BTLs to use _get instead of _wait  
few months ago.


  Thanks,
george.

On Mar 23, 2009, at 11:58 , Timothy Hayes wrote:


Hello,

I'm working on an OpenMPI BTL component and am having a recurring  
problem, I was wondering if anyone could shed some light on it. I  
have a component that's quite straight forward, it uses a pair of  
lightweight sockets to take advantage of being in a virtualised  
environment (specifically Xen). My code is a bit messy and has lots  
of inefficiencies, but the logic seems sound enough. I've been able  
to execute a few simple programs successfully using the component,  
and they work most of the time.


The problem I'm having is actually happening in higher layers,  
specifically in my asynchronous receive handler, when I call the  
callback function (cbfunc) that was set by the PML in the BTL  
initialisation phase. It seems to be getting stuck in an infinite  
loop at __ompi_free_list_wait(), in this function there is a  
condition variable which should get set eventually but just doesn't.  
I've stepped through it with GDB and I get a backtrace of something  
like this:


mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv  
-> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->  
__ompi_free_list_wait -> opal_condition_wait


and from there it just loops. Although this is happening in higher  
levels, I haven't noticed something like this happening in any of  
the other BTL components so chances are there's something in my code  
that's causing this. I very much doubt that it's actually waiting  
for a list item to be returned since this infinite loop can occur  
non deterministically and sometimes even on the first receive  
callback.


I'm really not too sure what else to include with this e-mail. I  
could send my source code (a bit nasty right now) if it would be  
helpful, but I'm hoping that someone might have noticed this problem  
before or something similar. Maybe I'm making a common mistake. Any  
advice would be really appreciated!


I'm using OpenMPI 1.2.9 from the SVN tag repository.

Kind regards
Tim Hayes
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] OMPI 1.3 - PERUSE peruse_comm_spec_t peer Negative Value

2009-03-23 Thread Samuel K. Gutierrez
Hi Kiril,

Appreciate the quick response.

> Hi Samuel,
>
> On Sat, 21 Mar 2009 18:18:54 -0600 (MDT)
>   "Samuel K. Gutierrez"  wrote:
>> Hi All,
>>
>> I'm writing a simple profiling library which utilizes
>>PERUSE.  My callback
>
> So am I :)
>
>> function counts communication events (see example code
>>below).  I noticed
>> that in OMPI v1.3 spec->peer is sometimes a negative
>>value (OMPI v1.2.6
>> did not exhibit this behavior).  I added some boundary
>>checks, but it
>> seems as if this is a bug?  I hope I'm not missing
>>something...
>
> It took me quite some time to reproduce the error - I also

Sorry about that - I should have provided more information.

> got peer value "-1" for the Peruse peruse_comm_spec_t
> struct. I only managed to reproduce this with
> communication of a process with itself, which is an
> unusual scenario. Anyway, for all the tests I did, the
> error happened only when:
>
> -a process communicates with itself
> -the MPI receive call is made
> -the Peruse event "PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q" is
> triggered

That's interesting... Nice work!

>
>
> The file ompi/mca/pml/ob1/pml_ob1_recvreq.c seems to be
> the place where the above event is called with a wrong
> value of the peer attribute.
>
> I will let you know if I find something.

I will also take a look.

>
>
> Best regards,
> Kiril
>
>>
>> The peruse test provided in the OMPI v1.3 source
>>exhibits similar behavior:
>> mpirun -np 2 ./mpi_peruse | grep peer:-1
>>
>> int callback(peruse_event_h event_h, MPI_Aint unique_id,
>> peruse_comm_spec_t *spec, void *param) {
>>if (spec->peer == rank) {
>>return MPI_SUCCESS;
>>}
>>rrCounts[spec->peer]++;
>>return MPI_SUCCESS;
>> }
>>
>>
>> Any insight is greatly appreciated.
>>
>> Thanks,
>>
>> Samuel K. Gutierrez
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

Appreciate the help,

Samuel K. Gutierrez


Re: [OMPI devel] Infinite Loop: ompi_free_list_wait

2009-03-23 Thread Lenny Verkhovsky
did you try it with OpenMPI 1.3.1 version?

There have been few changes and bug fixes (example  r20591, fix in ob1 PML)
. 

Lenny.

2009/3/23 Timothy Hayes 

> Hello,
>
> I'm working on an OpenMPI BTL component and am having a recurring problem,
> I was wondering if anyone could shed some light on it. I have a component
> that's quite straight forward, it uses a pair of lightweight sockets to take
> advantage of being in a virtualised environment (specifically Xen). My code
> is a bit messy and has lots of inefficiencies, but the logic seems sound
> enough. I've been able to execute a few simple programs successfully using
> the component, and they work most of the time.
>
> The problem I'm having is actually happening in higher layers, specifically
> in my asynchronous receive handler, when I call the callback function
> (cbfunc) that was set by the PML in the BTL initialisation phase. It seems
> to be getting stuck in an infinite loop at __ompi_free_list_wait(), in this
> function there is a condition variable which should get set eventually but
> just doesn't. I've stepped through it with GDB and I get a backtrace of
> something like this:
>
> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
> __ompi_free_list_wait -> opal_condition_wait
>
> and from there it just loops. Although this is happening in higher levels,
> I haven't noticed something like this happening in any of the other BTL
> components so chances are there's something in my code that's causing this.
> I very much doubt that it's actually waiting for a list item to be returned
> since this infinite loop can occur non deterministically and sometimes even
> on the first receive callback.
>
> I'm really not too sure what else to include with this e-mail. I could send
> my source code (a bit nasty right now) if it would be helpful, but I'm
> hoping that someone might have noticed this problem before or something
> similar. Maybe I'm making a common mistake. Any advice would be really
> appreciated!
>
> I'm using OpenMPI 1.2.9 from the SVN tag repository.
>
> Kind regards
> Tim Hayes
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] Infinite Loop: ompi_free_list_wait

2009-03-23 Thread Timothy Hayes
Hello,

I'm working on an OpenMPI BTL component and am having a recurring problem, I
was wondering if anyone could shed some light on it. I have a component
that's quite straight forward, it uses a pair of lightweight sockets to take
advantage of being in a virtualised environment (specifically Xen). My code
is a bit messy and has lots of inefficiencies, but the logic seems sound
enough. I've been able to execute a few simple programs successfully using
the component, and they work most of the time.

The problem I'm having is actually happening in higher layers, specifically
in my asynchronous receive handler, when I call the callback function
(cbfunc) that was set by the PML in the BTL initialisation phase. It seems
to be getting stuck in an infinite loop at __ompi_free_list_wait(), in this
function there is a condition variable which should get set eventually but
just doesn't. I've stepped through it with GDB and I get a backtrace of
something like this:

mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
__ompi_free_list_wait -> opal_condition_wait

and from there it just loops. Although this is happening in higher levels, I
haven't noticed something like this happening in any of the other BTL
components so chances are there's something in my code that's causing this.
I very much doubt that it's actually waiting for a list item to be returned
since this infinite loop can occur non deterministically and sometimes even
on the first receive callback.

I'm really not too sure what else to include with this e-mail. I could send
my source code (a bit nasty right now) if it would be helpful, but I'm
hoping that someone might have noticed this problem before or something
similar. Maybe I'm making a common mistake. Any advice would be really
appreciated!

I'm using OpenMPI 1.2.9 from the SVN tag repository.

Kind regards
Tim Hayes


Re: [OMPI devel] 1.3.1rc5

2009-03-23 Thread Ralph Castain

We have had one user hit it with 1.3.0 - haven't installed 1.3.1 yet.


On Mar 23, 2009, at 9:34 AM, Eugene Loh wrote:


Jeff Squyres wrote:


Looks good to cisco.  Ship it.

I'm still seeing a very low incidence of the sm segv during startup  
(. 01% -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in   
Eugene's new sm code for 1.3.2.


For what it's worth, I just ran a start-up test... "main()  
{MPI_Init();MPI_Finalize();}" with 8 processes on a single node,  
200k times with no failures.  This is before my sm changes.  I  
wanted to check that my sm changes didn't make things worse, but I  
can't reproduce the behavior in the first place.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Updated Sonoma/OpenFabrics WebEx URLs

2009-03-23 Thread Jeff Squyres
It looks like the URLs I sent before were incorrect -- they ask for a
username/password.  Try these URLs instead:

Monday, 23 Mar 2009:
https://ciscosales.webex.com/ciscosales/j.php?ED=116762862&UID=0&PW=1c8c7f352179
Tuesday, 24 Mar 2009:
https://ciscosales.webex.com/ciscosales/j.php?ED=116762862&UID=0&PW=1c8c7f352179
Wednesday, 25 Mar 2009:
https://ciscosales.webex.com/ciscosales/j.php?ED=116762862&UID=0&ICS=MRS3&LD=1&RD=2&ST=1&SHA2=DMxufaBEsjnPl/tw2SMp2/jewdU/PigedECYIcEou/Q=

You should be prompted for your name, email address, and the meeting
password.  The meeting password is "OFED" (without the quotes).

If I got any of this information wrong, check the full meeting details
posted here:

http://lists.openfabrics.org/pipermail/ewg/2009-March/012819.html

Enjoy!  (and sorry for the confusion)

--
{+} Jeff Squyres



Re: [OMPI devel] 1.3.1rc5

2009-03-23 Thread Eugene Loh

Jeff Squyres wrote:


Looks good to cisco.  Ship it.

I'm still seeing a very low incidence of the sm segv during startup (. 
01% -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in  
Eugene's new sm code for 1.3.2.


For what it's worth, I just ran a start-up test... "main() 
{MPI_Init();MPI_Finalize();}" with 8 processes on a single node, 200k 
times with no failures.  This is before my sm changes.  I wanted to 
check that my sm changes didn't make things worse, but I can't reproduce 
the behavior in the first place.


Re: [OMPI devel] Next week: WebEx remote attendance of OpenFabricsSonoma conference

2009-03-23 Thread Jeff Squyres

On Mar 17, 2009, at 9:17 AM, Jeff Squyres (jsquyres) wrote:


Monday, 23 Mar 2009: 
https://ciscosales.webex.com/ciscosales/j.php?ED=116762862
   Tuesday, 23 Mar 2009: 
https://ciscosales.webex.com/ciscosales/j.php?ED=116762862
Wednesday, 24 Mar 2009: 
https://ciscosales.webex.com/ciscosales/j.php?ED=116762987

(yes, the URL is the same on Monday and Tuesday, and different for
Wednesday)



I believe you may need a password to join these WebEx meetings.  The  
password is "OFED" (without the quotes, of course).


See this URL for the full connection details:

http://lists.openfabrics.org/pipermail/ewg/2009-March/012819.html

--
Jeff Squyres
Cisco Systems