Re: [OMPI devel] === CREATE FAILURE (trunk) ===
Wow, sometimes I even amaze myself! Two for two on create failures in a single night!! :-) Anyway, both are fixed or shortly will be. However, there will be no MTT runs tonight as neither branch successfully generated a tarball. Ralph On Mar 23, 2009, at 7:30 PM, MPI Team wrote: ERROR: Command returned a non-zero exist status (trunk): make distcheck Start time: Mon Mar 23 21:22:33 EDT 2009 End time: Mon Mar 23 21:30:20 EDT 2009 = == { test ! -d openmpi-1.4a1r20848 || { find openmpi-1.4a1r20848 -type d ! -perm -200 -exec chmod u+w {} ';' && rm -fr openmpi-1.4a1r20848; }; } test -d openmpi-1.4a1r20848 || mkdir openmpi-1.4a1r20848 list='config contrib opal orte ompi test'; for subdir in $list; do \ if test "$subdir" = .; then :; else \ test -d "openmpi-1.4a1r20848/$subdir" \ || /bin/mkdir -p "openmpi-1.4a1r20848/$subdir" \ || exit 1; \ distdir=`CDPATH="${ZSH_VERSION+.}:" && cd openmpi-1.4a1r20848 && pwd`; \ top_distdir=`CDPATH="${ZSH_VERSION+.}:" && cd openmpi-1.4a1r20848 && pwd`; \ (cd $subdir && \ make \ top_distdir="$top_distdir" \ distdir="$distdir/$subdir" \ am__remove_distdir=: \ am__skip_length_check=: \ distdir) \ || exit 1; \ fi; \ done make[1]: Entering directory `/home/mpiteam/openmpi/nightly-tarball- build-root/trunk/create-r20848/ompi/config' make[1]: Leaving directory `/home/mpiteam/openmpi/nightly-tarball- build-root/trunk/create-r20848/ompi/config' make[1]: Entering directory `/home/mpiteam/openmpi/nightly-tarball- build-root/trunk/create-r20848/ompi/contrib' make[1]: *** No rule to make target `platform/lanl/rr-class/ debug.conf', needed by `distdir'. Stop. make[1]: Leaving directory `/home/mpiteam/openmpi/nightly-tarball- build-root/trunk/create-r20848/ompi/contrib' make: *** [distdir] Error 1 = == Your friendly daemon, Cyrador ___ testing mailing list test...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/testing
Re: [OMPI devel] OMPI 1.3 - PERUSE peruse_comm_spec_t peer Negative Value
You are absolutely right, the peer should never be set to -1 on any of the PERUSE callbacks. I checked the code this morning and figure out what was the problem. We report the peer and the tag attached to a request before setting the right values (some code moved around). I submitted a patch and created a "move request" to have this correction as soon as possible on one of our stable releases. The move request can be followed using our TRAC system and the following link (https://svn.open-mpi.org/trac/ompi/ticket/1845 ). If you want to play with this change please update your Open MPI installation to a nightly build or a fresh checkout from the SVN with at least revision 20844 (a nightly including this change will be posted on our website tomorrow morning). Thanks, george. On Mar 23, 2009, at 13:23 , Samuel K. Gutierrez wrote: Hi Kiril, Appreciate the quick response. Hi Samuel, On Sat, 21 Mar 2009 18:18:54 -0600 (MDT) "Samuel K. Gutierrez" wrote: Hi All, I'm writing a simple profiling library which utilizes PERUSE. My callback So am I :) function counts communication events (see example code below). I noticed that in OMPI v1.3 spec->peer is sometimes a negative value (OMPI v1.2.6 did not exhibit this behavior). I added some boundary checks, but it seems as if this is a bug? I hope I'm not missing something... It took me quite some time to reproduce the error - I also Sorry about that - I should have provided more information. got peer value "-1" for the Peruse peruse_comm_spec_t struct. I only managed to reproduce this with communication of a process with itself, which is an unusual scenario. Anyway, for all the tests I did, the error happened only when: -a process communicates with itself -the MPI receive call is made -the Peruse event "PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q" is triggered That's interesting... Nice work! The file ompi/mca/pml/ob1/pml_ob1_recvreq.c seems to be the place where the above event is called with a wrong value of the peer attribute. I will let you know if I find something. I will also take a look. Best regards, Kiril The peruse test provided in the OMPI v1.3 source exhibits similar behavior: mpirun -np 2 ./mpi_peruse | grep peer:-1 int callback(peruse_event_h event_h, MPI_Aint unique_id, peruse_comm_spec_t *spec, void *param) { if (spec->peer == rank) { return MPI_SUCCESS; } rrCounts[spec->peer]++; return MPI_SUCCESS; } Any insight is greatly appreciated. Thanks, Samuel K. Gutierrez ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel Appreciate the help, Samuel K. Gutierrez ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Infinite Loop: ompi_free_list_wait
That's a relief to know, although I'm still a bit concerned. I'm looking at the code for the OpenMPI 1.3 trunk and in the ob1 component I can see the following sequence: mca_pml_ob1_recv_frag_callback_match -> append_frag_to_list -> MCA_PML_OB1_RECV_FRAG_ALLOC -> OMPI_FREE_LIST_WAIT -> __ompi_free_list_wait so I'm guessing unless the deadlock issue has been resolved for that function, it will still fail non deterministically. I'm quite eager to give it a try, but my component doesn't compile as is with the 1.3 source. Is it trivial to convert it? Or maybe you were suggesting that I go into the code of ob1 myself and manually change every _wait to _get? Kind regards Tim 2009/3/23 George Bosilca > It is a known problem. When the freelist is empty going in the > ompi_free_list_wait will block the process until at least one fragment > became available. As a fragment can became available only when returned by > the BTL, this can lead to deadlocks in some cases. The workaround is to ban > the usage of the blocking _wait function, and replace it with the > non-blocking version _get. The PML has all the required logic to deal with > the cases where a fragment cannot be allocated. We changed most of the BTLs > to use _get instead of _wait few months ago. > > Thanks, >george. > > > On Mar 23, 2009, at 11:58 , Timothy Hayes wrote: > > Hello, >> >> I'm working on an OpenMPI BTL component and am having a recurring problem, >> I was wondering if anyone could shed some light on it. I have a component >> that's quite straight forward, it uses a pair of lightweight sockets to take >> advantage of being in a virtualised environment (specifically Xen). My code >> is a bit messy and has lots of inefficiencies, but the logic seems sound >> enough. I've been able to execute a few simple programs successfully using >> the component, and they work most of the time. >> >> The problem I'm having is actually happening in higher layers, >> specifically in my asynchronous receive handler, when I call the callback >> function (cbfunc) that was set by the PML in the BTL initialisation phase. >> It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(), >> in this function there is a condition variable which should get set >> eventually but just doesn't. I've stepped through it with GDB and I get a >> backtrace of something like this: >> >> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv -> >> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match -> >> __ompi_free_list_wait -> opal_condition_wait >> >> and from there it just loops. Although this is happening in higher levels, >> I haven't noticed something like this happening in any of the other BTL >> components so chances are there's something in my code that's causing this. >> I very much doubt that it's actually waiting for a list item to be returned >> since this infinite loop can occur non deterministically and sometimes even >> on the first receive callback. >> >> I'm really not too sure what else to include with this e-mail. I could >> send my source code (a bit nasty right now) if it would be helpful, but I'm >> hoping that someone might have noticed this problem before or something >> similar. Maybe I'm making a common mistake. Any advice would be really >> appreciated! >> >> I'm using OpenMPI 1.2.9 from the SVN tag repository. >> >> Kind regards >> Tim Hayes >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >
Re: [OMPI devel] Infinite Loop: ompi_free_list_wait
It is a known problem. When the freelist is empty going in the ompi_free_list_wait will block the process until at least one fragment became available. As a fragment can became available only when returned by the BTL, this can lead to deadlocks in some cases. The workaround is to ban the usage of the blocking _wait function, and replace it with the non-blocking version _get. The PML has all the required logic to deal with the cases where a fragment cannot be allocated. We changed most of the BTLs to use _get instead of _wait few months ago. Thanks, george. On Mar 23, 2009, at 11:58 , Timothy Hayes wrote: Hello, I'm working on an OpenMPI BTL component and am having a recurring problem, I was wondering if anyone could shed some light on it. I have a component that's quite straight forward, it uses a pair of lightweight sockets to take advantage of being in a virtualised environment (specifically Xen). My code is a bit messy and has lots of inefficiencies, but the logic seems sound enough. I've been able to execute a few simple programs successfully using the component, and they work most of the time. The problem I'm having is actually happening in higher layers, specifically in my asynchronous receive handler, when I call the callback function (cbfunc) that was set by the PML in the BTL initialisation phase. It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(), in this function there is a condition variable which should get set eventually but just doesn't. I've stepped through it with GDB and I get a backtrace of something like this: mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv -> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match -> __ompi_free_list_wait -> opal_condition_wait and from there it just loops. Although this is happening in higher levels, I haven't noticed something like this happening in any of the other BTL components so chances are there's something in my code that's causing this. I very much doubt that it's actually waiting for a list item to be returned since this infinite loop can occur non deterministically and sometimes even on the first receive callback. I'm really not too sure what else to include with this e-mail. I could send my source code (a bit nasty right now) if it would be helpful, but I'm hoping that someone might have noticed this problem before or something similar. Maybe I'm making a common mistake. Any advice would be really appreciated! I'm using OpenMPI 1.2.9 from the SVN tag repository. Kind regards Tim Hayes ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OMPI 1.3 - PERUSE peruse_comm_spec_t peer Negative Value
Hi Kiril, Appreciate the quick response. > Hi Samuel, > > On Sat, 21 Mar 2009 18:18:54 -0600 (MDT) > "Samuel K. Gutierrez" wrote: >> Hi All, >> >> I'm writing a simple profiling library which utilizes >>PERUSE. My callback > > So am I :) > >> function counts communication events (see example code >>below). I noticed >> that in OMPI v1.3 spec->peer is sometimes a negative >>value (OMPI v1.2.6 >> did not exhibit this behavior). I added some boundary >>checks, but it >> seems as if this is a bug? I hope I'm not missing >>something... > > It took me quite some time to reproduce the error - I also Sorry about that - I should have provided more information. > got peer value "-1" for the Peruse peruse_comm_spec_t > struct. I only managed to reproduce this with > communication of a process with itself, which is an > unusual scenario. Anyway, for all the tests I did, the > error happened only when: > > -a process communicates with itself > -the MPI receive call is made > -the Peruse event "PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q" is > triggered That's interesting... Nice work! > > > The file ompi/mca/pml/ob1/pml_ob1_recvreq.c seems to be > the place where the above event is called with a wrong > value of the peer attribute. > > I will let you know if I find something. I will also take a look. > > > Best regards, > Kiril > >> >> The peruse test provided in the OMPI v1.3 source >>exhibits similar behavior: >> mpirun -np 2 ./mpi_peruse | grep peer:-1 >> >> int callback(peruse_event_h event_h, MPI_Aint unique_id, >> peruse_comm_spec_t *spec, void *param) { >>if (spec->peer == rank) { >>return MPI_SUCCESS; >>} >>rrCounts[spec->peer]++; >>return MPI_SUCCESS; >> } >> >> >> Any insight is greatly appreciated. >> >> Thanks, >> >> Samuel K. Gutierrez >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Appreciate the help, Samuel K. Gutierrez
Re: [OMPI devel] Infinite Loop: ompi_free_list_wait
did you try it with OpenMPI 1.3.1 version? There have been few changes and bug fixes (example r20591, fix in ob1 PML) . Lenny. 2009/3/23 Timothy Hayes > Hello, > > I'm working on an OpenMPI BTL component and am having a recurring problem, > I was wondering if anyone could shed some light on it. I have a component > that's quite straight forward, it uses a pair of lightweight sockets to take > advantage of being in a virtualised environment (specifically Xen). My code > is a bit messy and has lots of inefficiencies, but the logic seems sound > enough. I've been able to execute a few simple programs successfully using > the component, and they work most of the time. > > The problem I'm having is actually happening in higher layers, specifically > in my asynchronous receive handler, when I call the callback function > (cbfunc) that was set by the PML in the BTL initialisation phase. It seems > to be getting stuck in an infinite loop at __ompi_free_list_wait(), in this > function there is a condition variable which should get set eventually but > just doesn't. I've stepped through it with GDB and I get a backtrace of > something like this: > > mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv -> > mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match -> > __ompi_free_list_wait -> opal_condition_wait > > and from there it just loops. Although this is happening in higher levels, > I haven't noticed something like this happening in any of the other BTL > components so chances are there's something in my code that's causing this. > I very much doubt that it's actually waiting for a list item to be returned > since this infinite loop can occur non deterministically and sometimes even > on the first receive callback. > > I'm really not too sure what else to include with this e-mail. I could send > my source code (a bit nasty right now) if it would be helpful, but I'm > hoping that someone might have noticed this problem before or something > similar. Maybe I'm making a common mistake. Any advice would be really > appreciated! > > I'm using OpenMPI 1.2.9 from the SVN tag repository. > > Kind regards > Tim Hayes > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI devel] Infinite Loop: ompi_free_list_wait
Hello, I'm working on an OpenMPI BTL component and am having a recurring problem, I was wondering if anyone could shed some light on it. I have a component that's quite straight forward, it uses a pair of lightweight sockets to take advantage of being in a virtualised environment (specifically Xen). My code is a bit messy and has lots of inefficiencies, but the logic seems sound enough. I've been able to execute a few simple programs successfully using the component, and they work most of the time. The problem I'm having is actually happening in higher layers, specifically in my asynchronous receive handler, when I call the callback function (cbfunc) that was set by the PML in the BTL initialisation phase. It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(), in this function there is a condition variable which should get set eventually but just doesn't. I've stepped through it with GDB and I get a backtrace of something like this: mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv -> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match -> __ompi_free_list_wait -> opal_condition_wait and from there it just loops. Although this is happening in higher levels, I haven't noticed something like this happening in any of the other BTL components so chances are there's something in my code that's causing this. I very much doubt that it's actually waiting for a list item to be returned since this infinite loop can occur non deterministically and sometimes even on the first receive callback. I'm really not too sure what else to include with this e-mail. I could send my source code (a bit nasty right now) if it would be helpful, but I'm hoping that someone might have noticed this problem before or something similar. Maybe I'm making a common mistake. Any advice would be really appreciated! I'm using OpenMPI 1.2.9 from the SVN tag repository. Kind regards Tim Hayes
Re: [OMPI devel] 1.3.1rc5
We have had one user hit it with 1.3.0 - haven't installed 1.3.1 yet. On Mar 23, 2009, at 9:34 AM, Eugene Loh wrote: Jeff Squyres wrote: Looks good to cisco. Ship it. I'm still seeing a very low incidence of the sm segv during startup (. 01% -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm code for 1.3.2. For what it's worth, I just ran a start-up test... "main() {MPI_Init();MPI_Finalize();}" with 8 processes on a single node, 200k times with no failures. This is before my sm changes. I wanted to check that my sm changes didn't make things worse, but I can't reproduce the behavior in the first place. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Updated Sonoma/OpenFabrics WebEx URLs
It looks like the URLs I sent before were incorrect -- they ask for a username/password. Try these URLs instead: Monday, 23 Mar 2009: https://ciscosales.webex.com/ciscosales/j.php?ED=116762862&UID=0&PW=1c8c7f352179 Tuesday, 24 Mar 2009: https://ciscosales.webex.com/ciscosales/j.php?ED=116762862&UID=0&PW=1c8c7f352179 Wednesday, 25 Mar 2009: https://ciscosales.webex.com/ciscosales/j.php?ED=116762862&UID=0&ICS=MRS3&LD=1&RD=2&ST=1&SHA2=DMxufaBEsjnPl/tw2SMp2/jewdU/PigedECYIcEou/Q= You should be prompted for your name, email address, and the meeting password. The meeting password is "OFED" (without the quotes). If I got any of this information wrong, check the full meeting details posted here: http://lists.openfabrics.org/pipermail/ewg/2009-March/012819.html Enjoy! (and sorry for the confusion) -- {+} Jeff Squyres
Re: [OMPI devel] 1.3.1rc5
Jeff Squyres wrote: Looks good to cisco. Ship it. I'm still seeing a very low incidence of the sm segv during startup (. 01% -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm code for 1.3.2. For what it's worth, I just ran a start-up test... "main() {MPI_Init();MPI_Finalize();}" with 8 processes on a single node, 200k times with no failures. This is before my sm changes. I wanted to check that my sm changes didn't make things worse, but I can't reproduce the behavior in the first place.
Re: [OMPI devel] Next week: WebEx remote attendance of OpenFabricsSonoma conference
On Mar 17, 2009, at 9:17 AM, Jeff Squyres (jsquyres) wrote: Monday, 23 Mar 2009: https://ciscosales.webex.com/ciscosales/j.php?ED=116762862 Tuesday, 23 Mar 2009: https://ciscosales.webex.com/ciscosales/j.php?ED=116762862 Wednesday, 24 Mar 2009: https://ciscosales.webex.com/ciscosales/j.php?ED=116762987 (yes, the URL is the same on Monday and Tuesday, and different for Wednesday) I believe you may need a password to join these WebEx meetings. The password is "OFED" (without the quotes, of course). See this URL for the full connection details: http://lists.openfabrics.org/pipermail/ewg/2009-March/012819.html -- Jeff Squyres Cisco Systems