Re: [OMPI devel] Duplicated modex issue.
Actually, if I reuse id's in equivalent calls like this: ... 'modex' block; 'modex' block; 'modex' block; ... or ... 'barrier' block; 'barrier' block; 'barrier' block; ... there is no hanging. The hang only occurs if this "reusing" follows after using of another collective id, In the way I wrote in the first letter: ... 'modex' block; 'barrier' block; 'modex' block; <- hangs ... or in this way ... 'barrier' block; 'modex' block; 'barrier' block; <- hangs ... If I use different collective id while calling modex (1, 2 , ... , but not 0==orte_process_info.peer_modex), that also won't work, unfortunately.. On Thu, Dec 20, 2012 at 10:39 PM, Ralph Castain wrote: > Yeah, that won't work. The id's cannot be reused, so you'd have to assign > a different one in each case. > > On Dec 20, 2012, at 9:12 AM, Victor Kocheganov < > victor.kochega...@itseez.com> wrote: > > In every 'modex' block I use coll->id = orte_process_info.peer_modex; > id and in every 'barrier' block I use coll->id = > orte_process_info.peer_init_barrier; id. > > P.s. In general (as I wrote in first letter), I use 'modex' term for > following code: > coll = OBJ_NEW(orte_grpcomm_collective_t); > coll->id = orte_process_info.peer_modex; > if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) { > error = "orte_grpcomm_modex failed"; > goto error; > } > /* wait for modex to complete - this may be moved anywhere in mpi_init > * so long as it occurs prior to calling a function that needs > * the modex info! > */ > while (coll->active) { > opal_progress(); /* block in progress pending events */ > } > OBJ_RELEASE(coll); > > and 'barrier' for this: > > coll = OBJ_NEW(orte_grpcomm_collective_t); > coll->id = orte_process_info.peer_init_barrier; > if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) { > error = "orte_grpcomm_barrier failed"; > goto error; > } > /* wait for barrier to complete */ > while (coll->active) { > opal_progress(); /* block in progress pending events */ > } > OBJ_RELEASE(coll); > > On Thu, Dec 20, 2012 at 8:57 PM, Ralph Castain wrote: > >> >> On Dec 20, 2012, at 8:29 AM, Victor Kocheganov < >> victor.kochega...@itseez.com> wrote: >> >> Thanks for fast answer, Ralph. >> >> In my example I use different collective objects. I mean in every >> mentioned block I call *coll = OBJ_NEW(orte_grpcomm_**collective_t);* >> and *OBJ_RELEASE(coll);* , so all the grpcomm operations use unique >> collective object. >> >> >> How are the procs getting the collective id for those new calls? They all >> have to match >> >> >> >> On Thu, Dec 20, 2012 at 7:48 PM, Ralph Castain wrote: >> >>> Absolutely it will hang as the collective object passed into any grpcomm >>> operation (modex or barrier) is only allowed to be used once - any attempt >>> to reuse it will fail. >>> >>> >>> On Dec 20, 2012, at 6:57 AM, Victor Kocheganov < >>> victor.kochega...@itseez.com> wrote: >>> >>> Hi. >>> >>> I have an issue with understanding *ompi_mpi_init() *logic. Could you >>> please tell me if you have any guesses about following behavior. >>> >>> I wonder if I understand ringh, there is a block in *ompi_mpi_init() >>> *function >>> for exchanging procs information between processes (denote this block >>> 'modex'): >>> >>> coll = OBJ_NEW(orte_grpcomm_collective_t); >>> coll->id = orte_process_info.peer_modex; >>> if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) { >>> error = "orte_grpcomm_modex failed"; >>> goto error; >>> } >>> /* wait for modex to complete - this may be moved anywhere in >>> mpi_init >>> * so long as it occurs prior to calling a function that needs >>> * the modex info! >>> */ >>> while (coll->active) { >>> opal_progress(); /* block in progress pending events */ >>> } >>> OBJ_RELEASE(coll); >>> >>> and several instructions after this there is a block for processes >>> synchronization (denote this block 'barrier'): >>> >>> coll = OBJ_NEW(orte_grpcomm_collective_t); >>> coll->id = orte_process_info.peer_init_barrier; >>> if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) { >>> error = "orte_grpcomm_barrier failed"; >>> goto error; >>> } >>> /* wait for barrier to complete */ >>> while (coll->active) { >>> opal_progress(); /* block in progress pending events */ >>> } >>> OBJ_RELEASE(coll); >>> >>> So,* *initially* **ompi_mpi_init()* has following structure: >>> >>> ... >>> 'modex' block; >>> ... >>> 'barrier' block; >>> ... >>> >>> I made several experiments with this code and the following one is of >>> interest: if I add sequence of two additional blocks, 'barrier' and >>> 'modex', right after 'modex' block, then* **ompi_mpi_init() *hangs in * >>> opal_progress()* of the last 'modex' block. >>> >>> ... >>> 'modex' block; >>> 'barrier' block; >>> 'modex' block; <- han
Re: [OMPI devel] Duplicated modex issue.
Don't know how many times I can repeat it, but I'll try again: you are not allowed to reuse a collective id. If it happens to work, it's by accident. If you want to implement multiple modex/barrier operations, they each need to have their own unique collective id. On Dec 20, 2012, at 9:28 PM, Victor Kocheganov wrote: > Actually, if I reuse id's in equivalent calls like this: > ... > 'modex' block; > 'modex' block; > 'modex' block; > ... > or > ... > 'barrier' block; > 'barrier' block; > 'barrier' block; > ... > there is no hanging. The hang only occurs if this "reusing" follows after > using of another collective id, In the way I wrote in the first letter: > ... > 'modex' block; > 'barrier' block; > 'modex' block; <- hangs > ... > or in this way > ... > 'barrier' block; > 'modex' block; > 'barrier' block; <- hangs > ... > > If I use different collective id while calling modex (1, 2 , ... , but not > 0==orte_process_info.peer_modex), that also won't work, unfortunately.. > > > > On Thu, Dec 20, 2012 at 10:39 PM, Ralph Castain wrote: > Yeah, that won't work. The id's cannot be reused, so you'd have to assign a > different one in each case. > > On Dec 20, 2012, at 9:12 AM, Victor Kocheganov > wrote: > >> In every 'modex' block I use coll->id = orte_process_info.peer_modex; id >> and in every 'barrier' block I use coll->id = >> orte_process_info.peer_init_barrier; id. >> >> P.s. In general (as I wrote in first letter), I use 'modex' term for >> following code: >> coll = OBJ_NEW(orte_grpcomm_collective_t); >> coll->id = orte_process_info.peer_modex; >> if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) { >> error = "orte_grpcomm_modex failed"; >> goto error; >> } >> /* wait for modex to complete - this may be moved anywhere in mpi_init >> * so long as it occurs prior to calling a function that needs >> * the modex info! >> */ >> while (coll->active) { >> opal_progress(); /* block in progress pending events */ >> } >> OBJ_RELEASE(coll); >> >> and 'barrier' for this: >> >> coll = OBJ_NEW(orte_grpcomm_collective_t); >> coll->id = orte_process_info.peer_init_barrier; >> if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) { >> error = "orte_grpcomm_barrier failed"; >> goto error; >> } >> /* wait for barrier to complete */ >> while (coll->active) { >> opal_progress(); /* block in progress pending events */ >> } >> OBJ_RELEASE(coll); >> >> On Thu, Dec 20, 2012 at 8:57 PM, Ralph Castain wrote: >> >> On Dec 20, 2012, at 8:29 AM, Victor Kocheganov >> wrote: >> >>> Thanks for fast answer, Ralph. >>> >>> In my example I use different collective objects. I mean in every mentioned >>> block I call coll = OBJ_NEW(orte_grpcomm_collective_t); >>> and OBJ_RELEASE(coll); , so all the grpcomm operations use unique >>> collective object. >> >> How are the procs getting the collective id for those new calls? They all >> have to match >> >>> >>> >>> On Thu, Dec 20, 2012 at 7:48 PM, Ralph Castain wrote: >>> Absolutely it will hang as the collective object passed into any grpcomm >>> operation (modex or barrier) is only allowed to be used once - any attempt >>> to reuse it will fail. >>> >>> >>> On Dec 20, 2012, at 6:57 AM, Victor Kocheganov >>> wrote: >>> Hi. I have an issue with understanding ompi_mpi_init() logic. Could you please tell me if you have any guesses about following behavior. I wonder if I understand ringh, there is a block in ompi_mpi_init() function for exchanging procs information between processes (denote this block 'modex'): coll = OBJ_NEW(orte_grpcomm_collective_t); coll->id = orte_process_info.peer_modex; if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) { error = "orte_grpcomm_modex failed"; goto error; } /* wait for modex to complete - this may be moved anywhere in mpi_init * so long as it occurs prior to calling a function that needs * the modex info! */ while (coll->active) { opal_progress(); /* block in progress pending events */ } OBJ_RELEASE(coll); and several instructions after this there is a block for processes synchronization (denote this block 'barrier'): coll = OBJ_NEW(orte_grpcomm_collective_t); coll->id = orte_process_info.peer_init_barrier; if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) { error = "orte_grpcomm_barrier failed"; goto error; } /* wait for barrier to complete */ while (coll->active) { opal_progress(); /* block in progress pending events */ } OBJ_RELEASE(coll); So, initially ompi_mpi_init() has following structure: ... 'modex' block; ... 'barrier' block
[OMPI devel] openmpi-1.9a1r27710 on cygwin: patch and questions
Hi, additional to the patches used for building on cygwin openmpi-1.7rc5, a new one is needed for openmpi-1.9a1r27710 build. See attached for statfs usage. As config parameters, I added "if-windows,shmem-windows" to --enable-mca-no-build=paffinity,installdirs-windows,timer-windows,shmem-sysv,if-windows,shmem-windows Question 1 : instead of a platform check should be better to check if statvfs or statfs are implemented on the platform ? Question 2: any specif reason to have reset the shared library version numbers ? On openmpi-1.9a1r27710 ./usr/bin/cygmpi-0.dll ./usr/bin/cygmpi_cxx-0.dll ./usr/bin/cygmpi_mpifh-0.dll ./usr/bin/cygmpi_usempi-0.dll ./usr/bin/cygopen-pal-0.dll ./usr/bin/cygopen-rte-0.dll ./usr/lib/openmpi/cygompi_dbg_msgq.dll On openmpi-1.7rc5 ./usr/bin/cygmpi-1.dll ./usr/bin/cygmpi_cxx-1.dll ./usr/bin/cygmpi_mpifh-2.dll ./usr/bin/cygmpi_usempi-1.dll ./usr/bin/cygopen-pal-5.dll ./usr/bin/cygopen-rte-5.dll ./usr/lib/openmpi/cygompi_dbg_msgq.dll Question 3: there is an alternative way to exclude all the "*-windows" mca instead of --enable-mca-no-build=installdirs-windows,timer-windows,if-windows,shmem-windows Regards Marco --- origsrc/openmpi-1.9a1r27710/opal/util/path.c2012-12-20 03:00:25.0 +0100 +++ src/openmpi-1.9a1r27710/opal/util/path.c2012-12-21 14:34:15.432823000 +0100 @@ -547,7 +547,7 @@ #if defined(__SVR4) && defined(__sun) struct statvfs buf; #elif defined(__linux__) || defined (__BSD) || \ - (defined(__APPLE__) && defined(__MACH__)) + (defined(__APPLE__) && defined(__MACH__)) || defined(__CYGWIN__) struct statfs buf; #endif @@ -560,7 +560,7 @@ #if defined(__SVR4) && defined(__sun) rc = statvfs(path, &buf); #elif defined(__linux__) || defined (__BSD) || \ - (defined(__APPLE__) && defined(__MACH__)) + (defined(__APPLE__) && defined(__MACH__)) || defined(__CYGWIN__) rc = statfs(path, &buf); #endif err = errno;
Re: [OMPI devel] Duplicated modex issue.
Thanks for help. All work as you said. On Fri, Dec 21, 2012 at 7:11 PM, Ralph Castain wrote: > Don't know how many times I can repeat it, but I'll try again: you are not > allowed to reuse a collective id. If it happens to work, it's by accident. > > If you want to implement multiple modex/barrier operations, they each need > to have their own unique collective id. > > > On Dec 20, 2012, at 9:28 PM, Victor Kocheganov < > victor.kochega...@itseez.com> wrote: > > Actually, if I reuse id's in equivalent calls like this: > > ... > 'modex' block; > 'modex' block; > 'modex' block; > ... > > or > > ... > 'barrier' block; > 'barrier' block; > 'barrier' block; > ... > > there is no hanging. The hang only occurs if this "reusing" follows after > using of another collective id, In the way I wrote in the first letter: > > ... > 'modex' block; > 'barrier' block; > 'modex' block; <- hangs > ... > > or in this way > > ... > 'barrier' block; > 'modex' block; > 'barrier' block; <- hangs > ... > > > If I use different collective id while calling modex (1, 2 , ... , but not > 0==orte_process_info.peer_modex), that also won't work, unfortunately.. > > > > On Thu, Dec 20, 2012 at 10:39 PM, Ralph Castain wrote: > >> Yeah, that won't work. The id's cannot be reused, so you'd have to assign >> a different one in each case. >> >> On Dec 20, 2012, at 9:12 AM, Victor Kocheganov < >> victor.kochega...@itseez.com> wrote: >> >> In every 'modex' block I use coll->id = orte_process_info.peer_modex; >> id and in every 'barrier' block I use coll->id = >> orte_process_info.peer_init_barrier; id. >> >> P.s. In general (as I wrote in first letter), I use 'modex' term for >> following code: >> coll = OBJ_NEW(orte_grpcomm_collective_t); >> coll->id = orte_process_info.peer_modex; >> if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) { >> error = "orte_grpcomm_modex failed"; >> goto error; >> } >> /* wait for modex to complete - this may be moved anywhere in mpi_init >> * so long as it occurs prior to calling a function that needs >> * the modex info! >> */ >> while (coll->active) { >> opal_progress(); /* block in progress pending events */ >> } >> OBJ_RELEASE(coll); >> >> and 'barrier' for this: >> >> coll = OBJ_NEW(orte_grpcomm_collective_t); >> coll->id = orte_process_info.peer_init_barrier; >> if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) { >> error = "orte_grpcomm_barrier failed"; >> goto error; >> } >> /* wait for barrier to complete */ >> while (coll->active) { >> opal_progress(); /* block in progress pending events */ >> } >> OBJ_RELEASE(coll); >> >> On Thu, Dec 20, 2012 at 8:57 PM, Ralph Castain wrote: >> >>> >>> On Dec 20, 2012, at 8:29 AM, Victor Kocheganov < >>> victor.kochega...@itseez.com> wrote: >>> >>> Thanks for fast answer, Ralph. >>> >>> In my example I use different collective objects. I mean in every >>> mentioned block I call *coll = OBJ_NEW(orte_grpcomm_**collective_t);* >>> and *OBJ_RELEASE(coll);* , so all the grpcomm operations use unique >>> collective object. >>> >>> >>> How are the procs getting the collective id for those new calls? They >>> all have to match >>> >>> >>> >>> On Thu, Dec 20, 2012 at 7:48 PM, Ralph Castain wrote: >>> Absolutely it will hang as the collective object passed into any grpcomm operation (modex or barrier) is only allowed to be used once - any attempt to reuse it will fail. On Dec 20, 2012, at 6:57 AM, Victor Kocheganov < victor.kochega...@itseez.com> wrote: Hi. I have an issue with understanding *ompi_mpi_init() *logic. Could you please tell me if you have any guesses about following behavior. I wonder if I understand ringh, there is a block in *ompi_mpi_init() *function for exchanging procs information between processes (denote this block 'modex'): coll = OBJ_NEW(orte_grpcomm_collective_t); coll->id = orte_process_info.peer_modex; if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) { error = "orte_grpcomm_modex failed"; goto error; } /* wait for modex to complete - this may be moved anywhere in mpi_init * so long as it occurs prior to calling a function that needs * the modex info! */ while (coll->active) { opal_progress(); /* block in progress pending events */ } OBJ_RELEASE(coll); and several instructions after this there is a block for processes synchronization (denote this block 'barrier'): coll = OBJ_NEW(orte_grpcomm_collective_t); coll->id = orte_process_info.peer_init_barrier; if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) { error = "orte_grpcomm_barrier failed"; goto error; } /* wait for barrie
[OMPI devel] Open MPI planned outage
Our Indiana U. hosting providers will be doing some maintenance over the holiday break. All Open MPI services -- web, trac, SVN, ...etc. -- will be down on Wednesday, December 26th, 2012 during the following time period: - 5:00am-11:00am Pacific US time - 6:00am-12:00pm Mountain US time - 7:00am-01:00pm Central US time - 6:00am-02:00pm Eastern US time - 11:00am-05:00pm GMT -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Open MPI planned outage
Oops! The times that were sent were wrong. Here's the correct times: - 3:00am-09:00am Pacific US time - 4:00am-10:00am Mountain US time - 5:00am-11:00am Central US time - 6:00am-12:00am Eastern US time - 11:00am-05:00pm GMT On Dec 21, 2012, at 12:44 PM, Jeff Squyres wrote: > Our Indiana U. hosting providers will be doing some maintenance over the > holiday break. > > All Open MPI services -- web, trac, SVN, ...etc. -- will be down on > Wednesday, December 26th, 2012 during the following time period: > > - 5:00am-11:00am Pacific US time > - 6:00am-12:00pm Mountain US time > - 7:00am-01:00pm Central US time > - 6:00am-02:00pm Eastern US time > - 11:00am-05:00pm GMT > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/