Hi again,
Maybe I should give more specific information with some code snippets...
Currently I added
#define ORTE_DAEMON_BTL_CTL_CMD (orte_daemon_cmd_flag_t) 26
to odls_types.h to identify if I want to trigger the BTL pause.
In process_commands() of orted/orted_comm.c this flag is processed first
by broadcasting to all orteds with xcast of the grpcomm framework. At
second it's forwarded with orte_odls.deliver_message to the local procs.
So every process should get the trigger. Or is there another possibly
easier way of spawning the trigger?
I expanded the mca_btl_base_module_t in btl/btl.h simply with an
indicator if pause is set.
struct mca_btl_base_module_t {
[...]
bool btl_paused;
[...]
};
I then added a line to the initial values in every BTL component that
btl_paused should be false by default. E.g. in self/btl_self.c:
mca_btl_base_module_t mca_btl_self = {
[...]
false, /* btl_paused */
[...]
};
Or did I forget something?
So my problem is now, when every process gets the trigger in the ORTE
project, how could I set btl->paused to true in OMPI project? ORTE has
not (and I know it should not) have access to the OMPI components. Is
there a way of implementing a libevent callback function in the BTL
modules? Or is there another way? I already read the documentation at
your wiki-site, but for me it's not really trivial as I'm relatively new
to this.
An idea to get the connection to the OMPI project would be to use the
ft_event framework. Therefore I added another opal_crs_state_type_t
OPAL_CRS_PAUSE in crs/crs.h and tried to trigger the event in
orted_comm.c with:
if( NULL != orte_ess.ft_event ) {
if( ORTE_SUCCESS != (ret = orte_ess.ft_event(OPAL_CRS_PAUSE))) {
goto CLEANUP;
}
}
But the ft_event() is NULL and therefore isn't executed...
Any ideas? Any advices?
For me the performance impact of a solution is of no interest.
Thanks, and please excuse me if I bother you with this.
Christoph
Christoph Konersmann schrieb:
Hi all,
I'm trying to implement a method to pause all BTL's sending packets to
their destinations.
Currently I added a state variable to orte_process_info which will be
changed with an external program through process_commands() in
orte/orted/orted_comm.c (I hope it's processed globaly not locally).
While this state is changed to something defined as PAUSE, I want the
send_methods in PML-Layer to be halted omitting any network traffic. By
now it's not working, cause the PML-Layer does not see the state change.
Another way would be to use a libevent thread on the bml/pml-level. I've
read that this library is already supported/implemented, or am I wrong?
How would I use libevent in this context? Does somebody have an example
or hint? Or should I use the fault tolerance framework for this purpose?
Any help would be appreciated. thanks
--
Paderborn Center for Parallel Computing - PC2
University of Paderborn - Germany
http://www.pc2.de
Christoph Konersmann <c...@upb.de>