[OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Abhishek Kulkarni


==
[RFC 1/2]
==

WHAT: Merge improvements to the "notifier" framework from the OPAL SOS
 and the ORTE WDC mercurial branches into the SVN trunk.

WHY: Some improvements and interface changes were put into the ORTE
notifier framework during the development of the OPAL SOS[1] and
ORTE WDC[2] branches.

WHERE: Mostly restricted to ORTE notifier files and files using the
  notifier interface in OMPI.

TIMEOUT: The weekend of April 2-3.

REFERENCE MERCURIAL REPOS:
 * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
 * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/

==

BACKGROUND:

The notifier interface and its components underwent a host of
improvements and changes during the development of the SOS[1] and the
WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
accounting of events through the use of notifier interface, whereas
OPAL SOS uses the notifier interface by setting up callbacks to relay
out logged events.

Some of the improvements include:

- added more severity levels.
- "ftb" notifier improvements.
- "command" notifier improvements.
- added "file" notifier component
- changes in the notifier modules selection
- activate only a subset of the callbacks
 (i.e. any combination of log, help, log_peer)
- define different output media for any given callback (e.g. log_peer
 can be redirected to the syslog and smtp, while the show_help can be
 sent to the hnp).
- ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
 events)

Much more information is available on these two wiki pages:

[1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
[2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC

NOTE: This is first of a two-part RFC to bring the SOS and WDC branches
to the trunk. This only brings in the "notifier" changes from the SOS
branch, while the rest of the branch will be brought over after the
timeout of the second RFC.

==


[OMPI devel] RFC 2/2: merge the OPAL SOS development branch into trunk

2010-03-29 Thread Abhishek Kulkarni


==
[RFC 2/2]
==

WHAT: Merge the OPAL SOS development branch into the OMPI trunk.

WHY: Bring over some of the work done to enhance error reporting 
capabilities.


WHERE: opal/util/ and a few changes in the ORTE notifier.

TIMEOUT: April 6, Wednesday, COB.

REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/

==

BACKGROUND:

The OPAL SOS framework tries to meet the following objectives:

- Reduce the cascading error messages and the amount of code needed to
 print an error message.
- Build and aggregate stacks of encountered errors and associate
 related individual errors with each other.
- Allow registration of custom callbacks to intercept error events.

The SOS system provides an interface to log events of varying
severities.  These events are associated with an "encoded" error code
which can be used to refer to stacks of SOS events. When logging
events, they can also be transparently relayed to all the activated
notifier components.

The SOS system is described in detail on this wiki page:

   http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

Feel free to comment and/or provide suggestions.

==


Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Ralph Castain
Hi Abhishek

I'm confused by the WDC wiki page, specifically the part about the new 
ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying that I (as the developer) have 
to provide this macro with a unique notifier id? So that would mean that 
ORTE/OMPI would have to maintain a global notifier id counter to ensure it is 
unique?

If so, that seems really cumbersome. Could you please clarify?

Thanks
Ralph

On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:

> 
> ==
> [RFC 1/2]
> ==
> 
> WHAT: Merge improvements to the "notifier" framework from the OPAL SOS
> and the ORTE WDC mercurial branches into the SVN trunk.
> 
> WHY: Some improvements and interface changes were put into the ORTE
>notifier framework during the development of the OPAL SOS[1] and
>ORTE WDC[2] branches.
> 
> WHERE: Mostly restricted to ORTE notifier files and files using the
>  notifier interface in OMPI.
> 
> TIMEOUT: The weekend of April 2-3.
> 
> REFERENCE MERCURIAL REPOS:
> * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
> * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/
> 
> ==
> 
> BACKGROUND:
> 
> The notifier interface and its components underwent a host of
> improvements and changes during the development of the SOS[1] and the
> WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
> accounting of events through the use of notifier interface, whereas
> OPAL SOS uses the notifier interface by setting up callbacks to relay
> out logged events.
> 
> Some of the improvements include:
> 
> - added more severity levels.
> - "ftb" notifier improvements.
> - "command" notifier improvements.
> - added "file" notifier component
> - changes in the notifier modules selection
> - activate only a subset of the callbacks
> (i.e. any combination of log, help, log_peer)
> - define different output media for any given callback (e.g. log_peer
> can be redirected to the syslog and smtp, while the show_help can be
> sent to the hnp).
> - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
> events)
> 
> Much more information is available on these two wiki pages:
> 
> [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
> [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC
> 
> NOTE: This is first of a two-part RFC to bring the SOS and WDC branches
> to the trunk. This only brings in the "notifier" changes from the SOS
> branch, while the rest of the branch will be brought over after the
> timeout of the second RFC.
> 
> ==
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] process migration on openmpi

2010-03-29 Thread Josh Hursey
The link that Jeff cited contains all of the public information about  
the current design and use of the C/R infrastructure in Open MPI. The  
rest of the design of Open MPI is largely in the source code at the  
moment.


If you have questions about the design or specific pieces of code,  
then the devel list is a good place to ask those questions.


-- Josh

On Mar 26, 2010, at 7:50 AM, Jeff Squyres wrote:


Have a look at http://osl.iu.edu/research/ft/ompi-cr/.


On Mar 25, 2010, at 8:51 PM, luyang dong wrote:


dear teachers:
   I am a graduate .And my research is to achieve  
process migratioin on openmpi.But now i find that there is few  
resources about internal design of openmpi, and my work is nearly  
stopping. Can you help me.
   
thanks a lot


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] The feature of openmpi1.5

2010-03-29 Thread Josh Hursey
Process migration is a feature that we are planning on adding to the  
1.5 series within the next year. Unfortunately I cannot provide any  
more details about the state of the implementation or availability  
schedule at the moment. Once it is publicly available then there will  
be an announcement on the Open MPI users and devel lists, so I would  
watch those lists for updates.


-- Josh

On Mar 26, 2010, at 10:18 AM, luyang dong wrote:


dear teachers:
I want to know whether there is a planning to add the  
function of process migration to openmpi?

 thanks

 ___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Sylvain Jeaugey

Hi Ralph,

For now, I think that yes, this is a unique identifier. However, in my 
opinion, this could be improved in the future replacing it by a unique 
string.


Something like :

#define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) {
static int event = -1;
if (OPAL_UNLIKELY(event == -1) {
event = opal_sos_create_new_event(eventstr, associated_text);
}
..
}

This would move the event numbering to the OPAL layer, making it 
transparent to the developper.


Just my 2 cents ...

Sylvain

On Mon, 29 Mar 2010, Ralph Castain wrote:


Hi Abhishek
I'm confused by the WDC wiki page, specifically the part about the new 
ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying
that I (as the developer) have to provide this macro with a unique notifier id? 
So that would mean that ORTE/OMPI would
have to maintain a global notifier id counter to ensure it is unique?

If so, that seems really cumbersome. Could you please clarify?

Thanks
Ralph

On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:


  ==
  [RFC 1/2]
  ==

  WHAT: Merge improvements to the "notifier" framework from the OPAL SOS
  and the ORTE WDC mercurial branches into the SVN trunk.

  WHY: Some improvements and interface changes were put into the ORTE
     notifier framework during the development of the OPAL SOS[1] and
     ORTE WDC[2] branches.

  WHERE: Mostly restricted to ORTE notifier files and files using the
   notifier interface in OMPI.

  TIMEOUT: The weekend of April 2-3.

  REFERENCE MERCURIAL REPOS:
  * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
  * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/

  ==

  BACKGROUND:

  The notifier interface and its components underwent a host of
  improvements and changes during the development of the SOS[1] and the
  WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
  accounting of events through the use of notifier interface, whereas
  OPAL SOS uses the notifier interface by setting up callbacks to relay
  out logged events.

  Some of the improvements include:

  - added more severity levels.
  - "ftb" notifier improvements.
  - "command" notifier improvements.
  - added "file" notifier component
  - changes in the notifier modules selection
  - activate only a subset of the callbacks
  (i.e. any combination of log, help, log_peer)
  - define different output media for any given callback (e.g. log_peer
  can be redirected to the syslog and smtp, while the show_help can be
  sent to the hnp).
  - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
  events)

  Much more information is available on these two wiki pages:

  [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
  [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC

  NOTE: This is first of a two-part RFC to bring the SOS and WDC branches
  to the trunk. This only brings in the "notifier" changes from the SOS
  branch, while the rest of the branch will be brought over after the
  timeout of the second RFC.

  ==
  ___
  devel mailing list
  de...@open-mpi.org
  http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Changing BTLs at runtime

2010-03-29 Thread Josh Hursey
This line of work sounds interesting. Just wanted to add my 2 cents on  
one point below.


On Mar 26, 2010, at 9:46 AM, Christoph Konersmann wrote:






The Background:
I should give some background, why I'm implementing this. Changing  
the

MPI communication from a high speed network to a network with
flowcontrol (openib->tcp) is necessary for checkpointing distributed
applications in virtual machines. Ok, you are able to checkpoint  
through

the FT-Framework and BLCR in Open MPI, but virtual machines already
provide trivial functions for checkpointing. As you are not able to
checkpoint the hardware information of e.g. openib you have to get  
rid
of it in case of a checkpoint, and change back again on resume/ 
continue.


I'm not quite sure I understand.  I can see how the original model  
of CRS and SNAPC don't quite fit that of VM's, but I don't quite  
understand what switching openib ->  tcp and then later tcp ->   
openib gives you...?


Can't you just quiesce the openib BTL, let the VM checkpoint, and  
then resume with openib?  (or whatever other non TCP/sm BTL you want)




I worked under the assumption that a virtualization might support  
InfiniBand through SR-IOV. So every virtual machine has the  
possibility to use it at full speed. Just starving out the  
communication between InfiniBand devices would not help in case of  
migration when the underlying hardware and its configuration would  
change. Therefore I have to unload the desired BTL module. To make  
sure that absolutely no bml uses infiniband for transfer anymore, I  
change the communication to another device whose protocol is known  
to work with migrating virtual machines, like tcp.


A few papers have pointed out the difficulties of support InfiniBand  
in a virtualization environment where migration is a wanted feature.  
Most solutions involve shutting down the InfiniBand network, moving  
the process, then restarting the communication. It's a neat idea to  
shift the network load to the TCP network to allow the application to  
continue communication (though at diminished performance) during the  
migration to work around the InfiniBand issue.




Checkpointing would work with just quiesce the communication if the  
infiniband hardware will not changed.


Just wanted to mention that in Open MPI we have the ability to choose  
a new set of BTLs on restart in our current C/R infrastructure. So we  
can checkpoint process A which was communicating with process B over  
'openib', and then restart them on the same machine and have them  
transparently switch to 'sm'. Then we can move them apart and have  
them pick another set of BTLs for communication (either 'tcp' or back  
to 'openib' or something else entirely like 'mx').


-- Josh



Kind regards,
Christoph Konersmann
--
Paderborn Center for Parallel Computing - PC2
University of Paderborn - Germany
http://www.pc2.de

Christoph Konersmann 
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Ralph Castain
Hi Sylvain

I think something like that is really required. Having to manage event 
identifiers across OMPI layers is going to prove impractical otherwise.

Abhishek: I would suggest this be done prior to moving the branch into the 
trunk. Whether you use Sylvain's proposed solution or another is up to you. 
Frankly, I'm not entirely sure what this identifier really buys us, but if you 
believe it important, let's make it manageable.

Thanks
Ralph


On Mar 29, 2010, at 10:04 AM, Sylvain Jeaugey wrote:

> Hi Ralph,
> 
> For now, I think that yes, this is a unique identifier. However, in my 
> opinion, this could be improved in the future replacing it by a unique string.
> 
> Something like :
> 
> #define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) {
>   static int event = -1;
>   if (OPAL_UNLIKELY(event == -1) {
>   event = opal_sos_create_new_event(eventstr, associated_text);
>   }
>   ..
> }
> 
> This would move the event numbering to the OPAL layer, making it transparent 
> to the developper.
> 
> Just my 2 cents ...
> 
> Sylvain
> 
> On Mon, 29 Mar 2010, Ralph Castain wrote:
> 
>> Hi Abhishek
>> I'm confused by the WDC wiki page, specifically the part about the new 
>> ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying
>> that I (as the developer) have to provide this macro with a unique notifier 
>> id? So that would mean that ORTE/OMPI would
>> have to maintain a global notifier id counter to ensure it is unique?
>> If so, that seems really cumbersome. Could you please clarify?
>> Thanks
>> Ralph
>> On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:
>> 
>>  ==
>>  [RFC 1/2]
>>  ==
>> 
>>  WHAT: Merge improvements to the "notifier" framework from the OPAL SOS
>>  and the ORTE WDC mercurial branches into the SVN trunk.
>> 
>>  WHY: Some improvements and interface changes were put into the ORTE
>> notifier framework during the development of the OPAL SOS[1] and
>> ORTE WDC[2] branches.
>> 
>>  WHERE: Mostly restricted to ORTE notifier files and files using the
>>   notifier interface in OMPI.
>> 
>>  TIMEOUT: The weekend of April 2-3.
>> 
>>  REFERENCE MERCURIAL REPOS:
>>  * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
>>  * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/
>> 
>>  ==
>> 
>>  BACKGROUND:
>> 
>>  The notifier interface and its components underwent a host of
>>  improvements and changes during the development of the SOS[1] and the
>>  WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
>>  accounting of events through the use of notifier interface, whereas
>>  OPAL SOS uses the notifier interface by setting up callbacks to relay
>>  out logged events.
>> 
>>  Some of the improvements include:
>> 
>>  - added more severity levels.
>>  - "ftb" notifier improvements.
>>  - "command" notifier improvements.
>>  - added "file" notifier component
>>  - changes in the notifier modules selection
>>  - activate only a subset of the callbacks
>>  (i.e. any combination of log, help, log_peer)
>>  - define different output media for any given callback (e.g. log_peer
>>  can be redirected to the syslog and smtp, while the show_help can be
>>  sent to the hnp).
>>  - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
>>  events)
>> 
>>  Much more information is available on these two wiki pages:
>> 
>>  [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>>  [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC
>> 
>>  NOTE: This is first of a two-part RFC to bring the SOS and WDC branches
>>  to the trunk. This only brings in the "notifier" changes from the SOS
>>  branch, while the rest of the branch will be brought over after the
>>  timeout of the second RFC.
>> 
>>  ==
>>  ___
>>  devel mailing list
>>  de...@open-mpi.org
>>  http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Abhishek Kulkarni


On Mon, 29 Mar 2010, Ralph Castain wrote:

Hi Abhishek I'm confused by the WDC wiki page, specifically the part 
about the new ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying that I 
(as the developer) have to provide this macro with a unique notifier id? 
So that would mean that ORTE/OMPI would have to maintain a global 
notifier id counter to ensure it is unique?




I was thinking more like having a list of predefined events in the
file orte/mca/notifier/notifier_event_types.h or adding to this
file when you define new events (analogous to defining error codes).


If so, that seems really cumbersome. Could you please clarify?



It seems slightly cumbersome to me too. But then it saves on the
lookup cost. I am copying Nadia on this (since she's really done
all the WDC work)

Thanks,
Abhishek


Thanks
Ralph

On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:


  ==
  [RFC 1/2]
  ==

  WHAT: Merge improvements to the "notifier" framework from the OPAL SOS
  and the ORTE WDC mercurial branches into the SVN trunk.

  WHY: Some improvements and interface changes were put into the ORTE
     notifier framework during the development of the OPAL SOS[1] and
     ORTE WDC[2] branches.

  WHERE: Mostly restricted to ORTE notifier files and files using the
   notifier interface in OMPI.

  TIMEOUT: The weekend of April 2-3.

  REFERENCE MERCURIAL REPOS:
  * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
  * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/

  ==

  BACKGROUND:

  The notifier interface and its components underwent a host of
  improvements and changes during the development of the SOS[1] and the
  WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
  accounting of events through the use of notifier interface, whereas
  OPAL SOS uses the notifier interface by setting up callbacks to relay
  out logged events.

  Some of the improvements include:

  - added more severity levels.
  - "ftb" notifier improvements.
  - "command" notifier improvements.
  - added "file" notifier component
  - changes in the notifier modules selection
  - activate only a subset of the callbacks
  (i.e. any combination of log, help, log_peer)
  - define different output media for any given callback (e.g. log_peer
  can be redirected to the syslog and smtp, while the show_help can be
  sent to the hnp).
  - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
  events)

  Much more information is available on these two wiki pages:

  [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
  [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC

  NOTE: This is first of a two-part RFC to bring the SOS and WDC branches
  to the trunk. This only brings in the "notifier" changes from the SOS
  branch, while the rest of the branch will be brought over after the
  timeout of the second RFC.

  ==
  ___
  devel mailing list
  de...@open-mpi.org
  http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Abhishek Kulkarni



On Mon, 29 Mar 2010, Sylvain Jeaugey wrote:


Hi Ralph,

For now, I think that yes, this is a unique identifier. However, in my 
opinion, this could be improved in the future replacing it by a unique 
string.


Something like :

#define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) {
 static int event = -1;
 if (OPAL_UNLIKELY(event == -1) {
event = opal_sos_create_new_event(eventstr, associated_text);
 }
 ..
}

This would move the event numbering to the OPAL layer, making it transparent 
to the developper.




This is a good suggestion, but then I think we end up relying on run-time 
generation of the event numbers and have to pay the extra cost of looking 
up the event in a list/array/hash each time we log the event.


From what I understand, and from the discussions that took place when this 
proposal was first put up on the devel list, is that since the event 
tracing hooks could lie in the critical path, we want the overhead to be 
as low as possible. By manually defining the unique identifiers, we can 
generate the event tracing macro at compile-time and have a minimal 
tracing impact.


My 2¢ ofcourse.

Thanks
Abhishek


Just my 2 cents ...

Sylvain

On Mon, 29 Mar 2010, Ralph Castain wrote:


 Hi Abhishek
 I'm confused by the WDC wiki page, specifically the part about the
 new ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying
 that I (as the developer) have to provide this macro with a unique
 notifier id? So that would mean that ORTE/OMPI would
 have to maintain a global notifier id counter to ensure it is unique?

 If so, that seems really cumbersome. Could you please clarify?

 Thanks
 Ralph

 On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:


   ==
   [RFC 1/2]
   ==

   WHAT: Merge improvements to the "notifier" framework from the OPAL
   SOS
   and the ORTE WDC mercurial branches into the SVN trunk.

   WHY: Some improvements and interface changes were put into the ORTE
      notifier framework during the development of the OPAL SOS[1] and
      ORTE WDC[2] branches.

   WHERE: Mostly restricted to ORTE notifier files and files using the
    notifier interface in OMPI.

   TIMEOUT: The weekend of April 2-3.

   REFERENCE MERCURIAL REPOS:
   * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
   * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/

   ==

   BACKGROUND:

   The notifier interface and its components underwent a host of
   improvements and changes during the development of the SOS[1] and
   the
   WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
   accounting of events through the use of notifier interface, whereas
   OPAL SOS uses the notifier interface by setting up callbacks to
   relay
   out logged events.

   Some of the improvements include:

   - added more severity levels.
   - "ftb" notifier improvements.
   - "command" notifier improvements.
   - added "file" notifier component
   - changes in the notifier modules selection
   - activate only a subset of the callbacks
   (i.e. any combination of log, help, log_peer)
   - define different output media for any given callback (e.g.
   log_peer
   can be redirected to the syslog and smtp, while the show_help can be
   sent to the hnp).
   - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
   events)

   Much more information is available on these two wiki pages:

   [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
   [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC

   NOTE: This is first of a two-part RFC to bring the SOS and WDC
   branches
   to the trunk. This only brings in the "notifier" changes from the
   SOS
   branch, while the rest of the branch will be brought over after the
   timeout of the second RFC.

   ==
   ___
   devel mailing list
   de...@open-mpi.org
   http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Ralph Castain

On Mar 29, 2010, at 5:53 PM, Abhishek Kulkarni wrote:

> 
> 
> On Mon, 29 Mar 2010, Sylvain Jeaugey wrote:
> 
>> Hi Ralph,
>> 
>> For now, I think that yes, this is a unique identifier. However, in my 
>> opinion, this could be improved in the future replacing it by a unique 
>> string.
>> 
>> Something like :
>> 
>> #define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) {
>>   static int event = -1;
>>   if (OPAL_UNLIKELY(event == -1) {
>>  event = opal_sos_create_new_event(eventstr, associated_text);
>>   }
>>   ..
>> }
>> 
>> This would move the event numbering to the OPAL layer, making it transparent 
>> to the developper.
>> 
> 
> This is a good suggestion, but then I think we end up relying on run-time 
> generation of the event numbers and have to pay the extra cost of looking up 
> the event in a list/array/hash each time we log the event.

Since it is -solely- intended to be in an error path, I fail to see the concern 
here.

> 
>> From what I understand, and from the discussions that took place when this 
> proposal was first put up on the devel list, is that since the event tracing 
> hooks could lie in the critical path, we want the overhead to be as low as 
> possible. By manually defining the unique identifiers, we can generate the 
> event tracing macro at compile-time and have a minimal tracing impact.

Surely you jest - yes?? The event tracing hooks should -never- be in the 
critical path. The notifier is intended -solely- to be called when an error (or 
some other critical event) has already been detected. The idea was that we 
detect an error, and then (if selected) notify someone about it.

The last thing we want to do, IMHO, is put the notifier in a critical path. If 
we do, I personally will regret having created it :-)


> 
> My 2¢ ofcourse.
> 
> Thanks
> Abhishek
> 
>> Just my 2 cents ...
>> 
>> Sylvain
>> 
>> On Mon, 29 Mar 2010, Ralph Castain wrote:
>> 
>>> Hi Abhishek
>>> I'm confused by the WDC wiki page, specifically the part about the
>>> new ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying
>>> that I (as the developer) have to provide this macro with a unique
>>> notifier id? So that would mean that ORTE/OMPI would
>>> have to maintain a global notifier id counter to ensure it is unique?
>>> 
>>> If so, that seems really cumbersome. Could you please clarify?
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:
>>> 
>>>   ==
>>>   [RFC 1/2]
>>>   ==
>>> 
>>>   WHAT: Merge improvements to the "notifier" framework from the OPAL
>>>   SOS
>>>   and the ORTE WDC mercurial branches into the SVN trunk.
>>> 
>>>   WHY: Some improvements and interface changes were put into the ORTE
>>>  notifier framework during the development of the OPAL SOS[1] and
>>>  ORTE WDC[2] branches.
>>> 
>>>   WHERE: Mostly restricted to ORTE notifier files and files using the
>>>notifier interface in OMPI.
>>> 
>>>   TIMEOUT: The weekend of April 2-3.
>>> 
>>>   REFERENCE MERCURIAL REPOS:
>>>   * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
>>>   * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/
>>> 
>>>   ==
>>> 
>>>   BACKGROUND:
>>> 
>>>   The notifier interface and its components underwent a host of
>>>   improvements and changes during the development of the SOS[1] and
>>>   the
>>>   WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
>>>   accounting of events through the use of notifier interface, whereas
>>>   OPAL SOS uses the notifier interface by setting up callbacks to
>>>   relay
>>>   out logged events.
>>> 
>>>   Some of the improvements include:
>>> 
>>>   - added more severity levels.
>>>   - "ftb" notifier improvements.
>>>   - "command" notifier improvements.
>>>   - added "file" notifier component
>>>   - changes in the notifier modules selection
>>>   - activate only a subset of the callbacks
>>>   (i.e. any combination of log, help, log_peer)
>>>   - define different output media for any given callback (e.g.
>>>   log_peer
>>>   can be redirected to the syslog and smtp, while the show_help can be
>>>   sent to the hnp).
>>>   - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
>>>   events)
>>> 
>>>   Much more information is available on these two wiki pages:
>>> 
>>>   [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>>>   [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC
>>> 
>>>   NOTE: This is first of a two-part RFC to bring the SOS and WDC
>>>   branches
>>>   to the trunk. This only brings in the "notifier" changes from the
>>>