Adding some more context. When trying to use the event logger (by using MPI_ANY_SOURCE) i get this error:
[clus9:28158] defining message event: ../../orte/runtime/orte_data_server.c 414 [clus9:28158] [[56904,0],0] data server: lookup on service ompi_ft_event_logger[0] [clus9:28158] [[56904,0],0] data server: service ompi_ft_event_logger[0] not found [clus5:7310] *** An error occurred in ../../../../../ompi/mca/vprotocol/pessimist/vprotocol_pessimist_eventlog.h: failed to connect to an Event Logger [clus5:7310] *** on communicator MPI_COMM_NULL [clus5:7310] *** MPI_ERR_INTERN: internal error [clus5:7310] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort The event_logger is not found, and of course the connection is not made. The service ompi_ft_event_logger is not defined apparently. Thanks for the help. Hugo 2012/1/31 Hugo Daniel Meyer <[email protected]> > Hello again. > > I've found where the connection with the event logger takes places. I've > some doubts about the next section of code: > > *rc = ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);* > > * if(OMPI_SUCCESS != rc) {* > > * ORTE_ERROR_LOG(rc);* > > * }* > > * /* Send Rank, receive max buffer size and max_clock back */* > > * MPI_Comm_rank(MPI_COMM_WORLD, &rank);* > > * rc = mca_pml_v.host_pml.pml_send(&rank, 1, MPI_INTEGER, 0,* > > * > VPROTOCOL_PESSIMIST_EVENTLOG_NEW_CLIENT_CMD,* > > * MCA_PML_BASE_SEND_STANDARD,* > > * mca_vprotocol_receiver.el_comm);* > > * if(OPAL_UNLIKELY(MPI_SUCCESS != rc))* > > * OMPI_ERRHANDLER_INVOKE(mca_vprotocol_receiver.el_comm, rc,* > > * __FILE__ ": failed sending event logger > handshake");* > > * rc = mca_pml_v.host_pml.pml_recv(&connect_info, 2, > MPI_UNSIGNED_LONG_LONG,* > > * 0, > VPROTOCOL_PESSIMIST_EVENTLOG_NEW_CLIENT_CMD,* > > * mca_vprotocol_receiver.el_comm, > MPI_STATUS_IGNORE);* > > * if(OPAL_UNLIKELY(MPI_SUCCESS != rc)) > \* > > * OMPI_ERRHANDLER_INVOKE(mca_vprotocol_receiver.el_comm, rc, > \* > > * __FILE__ ": failed receiving event logger > handshake");* > > > I understand that you make a connection using the dpm framework between > the process 0 (the logger) and yourself (MPI_COMM_SELF). But then, you send > and receive messages with pml. My question is: ¿Where is posted the recv of > the event_logger? I didn't find where in the code the event_logger receives > the rank, and answer the handshake. > > Thanks for your help. > > Hugo Meyer > > 2012/1/30 Hugo Daniel Meyer <[email protected]> > > Hello Aurelien. > > 2012/1/27 Aurélien Bouteiller <[email protected]> > > Hugo, > > It seems you want to implement some sort of remote pessimistic logging -a > la MPICH-V1- ? > MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes -- George > Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fédak, > Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodygensky, > Frédéric Magniette, Vincent Néri, Anton Selikhov -- In proceedings of The > IEEE/ACM SC2002 Conference, Baltimore USA, November 2002 > > We could say that is similar because i use a distributed logging > mechanism, but is a little diferent because my Memory Channels and > Checkpoint Servers are the processing nodes, i don't have specials nodes to > take care of the message log and checkpoints. > > > In the PML-V, unlike older designs, the payload of messages and the > non-deterministic events follow a different path. The payload of messages > is logged on the sender's volatile memory, while the non-deterministic > events are sent to a stable event logger, before allowing the process to > impact the state of others (the code you have found in the previous email). > The best depiction of this distinction can be read in this paper > @inproceedings{DBLP:conf/europar/BouteillerHBD11, > author = {Aurelien Bouteiller and > Thomas H{\'e}rault and > George Bosilca and > Jack J. Dongarra}, > title = {Correlated Set Coordination in Fault Tolerant Message Logging > Protocols}, > booktitle = {Euro-Par 2011 Parallel Processing - 17th International > Conference, Proceedings, Part II}, > month = {September}, > year = {2011}, > pages = {51-64}, > publisher = {Springer}, > series = {Lecture Notes in Computer Science}, > volume = {6853}, > year = {2011}, > isbn = {978-3-642-23396-8}, > doi = {http://dx.doi.org/10.1007/978-3-642-23397-5_6}, > > I will take a look to this paper to clarify this distinction. > > > > > If you intend to store both payload and message log on a remote node, I > suggest you look at the "sender-based" hooks, as this is where the message > payload is managed, and adapt from here. The event loggers can already > manage a subset only of the processes (if you launch as many EL as > processes, you get a 1-1 mapping), but they never handle message payload; > you'll have to add all this yourself is it so pleases you. > > Yes, i need to store every message, not only the non-deterministics > one. In my case every node is an event logger. Let's say that i have 16 > processes in four nodes (four by node), so the messages received by all > processes residing in node0 need to be stored in node1, and the received > messages received by all processes residing in node1, need to be stored in > node2, and so on. > > If i understand correctly, i have to modify the behavior in > ompi/mca/vprotocol/pessimist, to manage the message payload. And another > question, is there a way to launch ELs in every node? or i will have to > modify this too? > > Thanks a lot for your help Aurélien. > > Hugo > > > > Le 27 janv. 2012 à 11:19, Hugo Daniel Meyer a écrit : > > > Hello Aurélien. > > > > Thanks for the clarification. Considering what you've mentioned i will > have to make some adaptations, because to me, every single message has to > be logged. So, a sender not only will be sending messages to the receiver, > but also to an event logger. Is there any considerations that i've to take > into account when modifying the code?. My initial idea is to use the > el_comm with a group of event loggers (because every node uses a different > event logger in my approach), and then send the messages to them as you do > when using MPI_ANY_SOURCE. > > > > Thanks for your help. > > > > Hugo Meyer > > > > 2012/1/27 Aurélien Bouteiller <[email protected]> > > Hugo, > > > > Your program does not have non-deterministic events. Therefore, there > are no events to log. If you add MPI_ANY_SOURCE, you should see this code > being called. Please contact me again if you need more help. > > > > Aurelien > > > > > > Le 27 janv. 2012 à 10:21, Hugo Daniel Meyer a écrit : > > > > > Hello @ll. > > > > > > George, i'm using some pieces of the pessimist vprotocol. I've > observed that when you do a send, you call vprotocol_receiver_event_flush > and here the macro __VPROTOCOL_RECEIVER_SEND_BUFFER is called. I've noticed > that here you try send a copy of the message to process 0 using the > el_comm. This section of code is never executed, at least in my examples. > So, the message is never sent to the Event Logger, am i correct with this? > I think that this is happening because the > mca_vprotocol_pessimist.event_buffer_length is always 0. > > > > > > Is there something that i've got to turn on, or i will have to modify > this behavior manually to connect and send messages to the EL? > > > > > > Thanks in advance. > > > > > > Hugo Meyer > > > _______________________________________________ > > > devel mailing list > > > [email protected] > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > > * Dr. Aurélien Bouteiller > > * Researcher at Innovative Computing Laboratory > > * University of Tennessee > > * 1122 Volunteer Boulevard, suite 350 > > * Knoxville, TN 37996 > > * 865 974 6321 > > > > > > > > > > > > > > _______________________________________________ > > devel mailing list > > [email protected] > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > [email protected] > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > * Dr. Aurélien Bouteiller > * Researcher at Innovative Computing Laboratory > * University of Tennessee > * 1122 Volunteer Boulevard, suite 350 > * Knoxville, TN 37996 > * 865 974 6321 > > > > > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >
