Re: [OMPI devel] Rename "vader" BTL to "xpmem"

2011-11-24 Thread Leonardo Fialho
Maybe he is -10!'ing, which is worst than -10'ing! On Nov 23, 2011, at 7:52 PM, Jeff Squyres wrote: > Can you explain that a little more? Are you -10'ing the whole concept? Or > just renaming xpmem? Or ...? > > On Nov 22, 2011, at 11:37 AM, George Bosilca wrote: > >> -10! >> >> george. >>

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
ed due to lack of reason to > do so. > > Sorry for the confusion - old man brain fizzing out again. > > On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote: > >> Yes, I know the difference :) >> >> I'm trying to call orte_plm.signal_job from a PML c

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Yes, I know the difference :) I'm trying to call orte_plm.signal_job from a PML component. I think PLM stays resident after launching but it doesn't only for mpirun and orted, you're right. On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: > On 03/17/2010 10:10 AM, Leo

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of view? Leonardo On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: > To clarify a little bit more: I'm calling orte_plm.signal_job from a PML > component, I know that ORTE is bellow OMPI, but I thin

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
b. I didn't > see the message indicating it was sending the signal cmd out in your prior > debug output, and there isn't a printf in that code loop other than the debug > output. Can you attach to the process and get more info? > > On Mar 17, 2010, at 6:50 AM, Leonardo Fialh

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
g in a print statement, yet > there is no print statement in signal_job. Or did you run this with > plm_base_verbose set so that the verbose prints are trying to run (could be > we have a bug in one of them)? > > On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: > >&g

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
gt; > I don't currently know any way to do what you are trying to do. We could > extend the signal code to handle it, I would think...but I'm not sure how > soon that might happen. > > > On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: > >> Yes... but

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
t; line. >> >> Leonardo >> >> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote: >> >>> It's just the orte_process_name_t jobid field. So if you have an >>> orte_process_name_t *pname, then it would just be >>> >>> orte

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
have an > orte_process_name_t *pname, then it would just be > > orte_plm.signal_job(pname->jobid, sig) > > > On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote: > >> Hum and to signal a job probably the function is >> orte_plm.signal_job(jobid, signal); right? &

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
Castain wrote: > Afraid not - you can signal a job, but not a specific process. We used to > have such an API, but nobody ever used it. Easy to restore if someone has a > need. > > On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote: > >> Hi, >> >> Is there any

[OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
Hi, Is there any function in Open MPI's frameworks to send a signal to other ORTE proc? For example, the ORTE process [[1234,1],1] want to send a signal to process [[1234,1,4] locate in other node. I'm looking for this kind of functions but I just found functions to send signal to all procs i

[OMPI devel] Silly question

2010-03-15 Thread Leonardo Fialho
I know that it should be uncommon but why I get an error while I try to run a "parallel" application with only one process? aopclf:ping fialho$ mpirun -np 1 ./ping [Fialho-2.local:02834] OPAL dss:unpack: got type 32 when expecting type 9 [Fialho-2.local:02834] [[57446,1],0] ORTE_ERROR_LOG: Pack d

Re: [OMPI devel] Missing Symbol

2010-03-06 Thread Leonardo Fialho
gt; Fixing this properly in libltdl is actually somewhat tricky -- which is >>> why it hasn't been fixed yet. But given that OMPI's use of libltdl is >>> pretty specific, we might be able to get away with a simple fix that works >>> just for OMPI (but wouldn&#

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
gt; > george. > > On Mar 5, 2010, at 14:00 , Leonardo Fialho wrote: > >> Yeah, probably ompi_request_null and opal_output are not good candidates. >> I'm trying with mca_pml_v. But I'm not familiarized with this framework >> although it is really s

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
ho/lib/openmpi/mca_vprotocol_receiver.so: error: >>> symbol lookup error: undefined symbol: mca_pml_v (fatal) >>> >>> Leonardo >>> >>> On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote: >>> >>> >>>> You said this component was a

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
me > the critical elements (e.g., component, module) inside it to avoid name > confusion? > > On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote: > >> I see... but it is really strange because this module is clean, it does not >> use nothing. This is the output of the

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
e .so that is being loaded that >> cannot be resolved. >> --td >> Leonardo Fialho wrote: >>> Hi, >>> >>> I know that libtool does not help us to find the source of this error, but, >>> what can generate the following error? >>> >&g

[OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
Hi, I know that libtool does not help us to find the source of this error, but, what can generate the following error? [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing symbol, or compiled for a different ve

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi George, >> Hum... I'm really afraid about this. I understand your choice since it is >> really a good solution for fail/stop/restart behaviour, but looking from the >> fail/recovery side, can you envision some alternative for the orted's >> reconfiguration on the fly? > > I don't see why th

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph and Josh, >>> Regarding to the schema represented by the picture, I didn't understand the >>> RecoS' behaviour in a node failure situation. >>> >>> In this case, will mpirun consider the daemon failure as a normal proc >>> failure? If it is correct, should mpirun update the global proc

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun u

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun u

Re: [OMPI devel] Error in VT

2009-03-30 Thread Leonardo Fialho
Hi Jeff, There are... Thanks a lot, Leonardo Jeff Squyres escribió: Can you send all the information listed here: http://www.open-mpi.org/community/help/ On Mar 30, 2009, at 11:46 AM, Leonardo Fialho wrote: Hi, I'm experimenting the following errors while using Open MPI re

[OMPI devel] Error in VT

2009-03-30 Thread Leonardo Fialho
ve caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- [fialho@aoclsd gmwat]$ Along different executions the error occurs in different situations. Any help? Thanks, --

Re: [OMPI devel] Modex and others

2008-11-13 Thread Leonardo Fialho
ut I don't know if you can do what you describe or not - I'm not sure how the MPI layer will handle that situation. Ralph On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote: Jeff, I agree with your viewpoint, principally about the "reachability". But... Looking from the FT

[OMPI devel] RML OOB, What´s wrong?

2008-11-13 Thread Leonardo Fialho
progress thread. Why? Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

Re: [OMPI devel] Modex and others

2008-11-13 Thread Leonardo Fialho
;t now. I think not. And what is the impact of a allgather modex while MPI thread is delivering messages? These answers about these questions could suggest that a uncoordinated C/R is not possible on Open MPI. Leonardo Fialho Jeff Squyres escribió: On Nov 7, 2008, at 10:18 AM, Leonardo Fi

Re: [OMPI devel] libevent

2008-11-07 Thread Leonardo Fialho
ched internally by the ompi library, but are not propagated until the next call to opal_progress. If you want to use alarms that trigger outside the opal_progress you will have to deal directly with the libevent (and not use ORTE_TIMER_EVENT). george. On Nov 7, 2008, at 1:32 PM, Leonardo Fialho w

[OMPI devel] libevent

2008-11-07 Thread Leonardo Fialho
after time seconds, no? On my tests it does not occur, only if any communication occurs. Did I made any mistake? -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93

[OMPI devel] Modex and others

2008-11-07 Thread Leonardo Fialho
than the masters contact info. I think that it reduces the startup time, but increases the *first* communication between two peers. -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http

Re: [OMPI devel] Error after ompi-restart

2008-11-04 Thread Leonardo Fialho
ven't had time to focus on trying to find out what is going wrong. You may be right in your assessment below, I'll try to look into it this week. If you find that making this changes fixes your problem, let me know and I'll apply the patch. Thanks, Josh On Nov 4, 2008, at 10:16

Re: [OMPI devel] Error after ompi-restart

2008-11-04 Thread Leonardo Fialho
dmap(orte_process_info.sync_buf, &nidmap, *&jmap->pmap*))) { No? Leonardo Leonardo Fialho escribió: Hi All, I think that exists an error in the trunk version while trying to restore a checkpoint. The function orte_util_decode_pidmap while attempts to execute the following code /*

[OMPI devel] Error after ompi-restart

2008-11-03 Thread Leonardo Fialho
2:18027] Failing at address: (nil) I was trying to trace the problem and I think that it occurs in the line opal_value_array_set_item(procs, i, &pmap); Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio

[OMPI devel] Communications and it cache

2008-10-31 Thread Leonardo Fialho
to the faulty process is removed from the cache and a new request for the NS is performed. The process location and state is maintained up to date on the HNP by my FT routines. What do you think about this? Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS

Re: [OMPI devel] Error while restarting a checkpoint

2008-10-31 Thread Leonardo Fialho
My suspects were confirmed. After a orte_iof_base_setup_child/parent the problem does not occur. Leonardo Leonardo Fialho escribió: Hi All, I´m trying to restart a process from a previous checkpoint. My (modified) orted is trying to do this. Its uses the opal-restart command, but after

Re: [OMPI devel] OOB-TCP Retries

2008-10-30 Thread Leonardo Fialho
ry to address. Any volunteers?? Ralph On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote: Hi All, I´m doing some experiments and modifications in my heartbeat code witch uses the OOB-TCP communication channel. My modified orteds and orterun does not abort all processes when one orte

[OMPI devel] Error while restarting a checkpoint

2008-10-30 Thread Leonardo Fialho
How can I close these descriptor before the checkpoint? The opal-restart open these descriptor too? What can I make to it works? Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC

Re: [OMPI devel] Restarting processes on different node

2008-10-23 Thread Leonardo Fialho
lets you restart in cases where you might not otherwise be able to. The trick is to add --save-private or --save-all to the checkpoint command that OpenMPI uses to checkpoint the application processes. -Paul Leonardo Fialho wrote: Hi All, I´m trying to implement my FT architecture in Open

[OMPI devel] Restarting processes on different node

2008-10-22 Thread Leonardo Fialho
formation or can give me some help about this error I´ll be grateful. Thanks-- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

[OMPI devel] OOB-TCP Retries

2008-10-17 Thread Leonardo Fialho
nce an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it discards this peer, no? b) The message is removed from the queue with ORTE_ERR_UNREACH code, no? c) Why, after retries exceed, the orted continue to plot this message? Thanks, -- Leonardo Fialho Computer Architecture and Operati

Re: [OMPI devel] Update orte_proc structure

2008-10-01 Thread Leonardo Fialho
rte_job_t array? This would not be a good idea as a significant amount of code in the system expects that array to only exist inside of mpirun. You could run into some really strange behavior in various scenarios. Ralph On Oct 1, 2008, at 9:09 AM, Leonardo Fialho wrote: Hi All, I have a lit

Re: [OMPI devel] Update orte_proc structure

2008-10-01 Thread Leonardo Fialho
Forget it. I found the problem... a little patch to orte_dt_pack/unpack_fns solve my problem... Leonardo Leonardo Fialho escribió: Hi All, I have a little doubt about how to update the orte_proc structure. I have modified the orte_proc structure to include another field (orte_name_proc_t

[OMPI devel] Update orte_proc structure

2008-10-01 Thread Leonardo Fialho
correct information, and the orte-ps don´t? Thanks -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

Re: [OMPI devel] autogen error

2008-06-19 Thread Leonardo Fialho
e GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Rene' Seindal. Leonardo Ralf Wildenhues escribió: Hello Leonardo, * Leonardo F

Re: [OMPI devel] autogen error

2008-06-19 Thread Leonardo Fialho
(GNU libtool) 2.2.4 ... $ automake --version automake (GNU automake) 1.10.1 ... Leonardo Leonardo Fialho escribió: Hi Jeff, Yes, with a fresh checkout... well, it can be some error in my aclocal files, I just updated it today, but I think I did it correctly. Leonardo Jeff Squyres escribió: T

Re: [OMPI devel] autogen error

2008-06-19 Thread Leonardo Fialho
can't think offhand of how it could be bogus. If you have a fresh tree checkout and run autogen, is the problem repeatable? On Jun 19, 2008, at 10:29 AM, Leonardo Fialho wrote: Hi All, Anybody knows what is this error? Yes, I think that I'm using last version of M4, autoconf

Re: [OMPI devel] RML Send

2008-06-19 Thread Leonardo Fialho
ack this var have 33! I don't understand it... Thanks, Leonardo Fialho Ralph Castain escribió: On 6/17/08 3:35 PM, "Leonardo Fialho" wrote: Hi Ralph, 1) Yes, I'm using ORTE_RML_TAG_DAEMON with a new "command" that I defined in "odls_types.h". 2) I

[OMPI devel] autogen error

2008-06-19 Thread Leonardo Fialho
with exit status: 1 - It seems that the execution of "aclocal -I config" has failed. See above for the specific error message that caused it to abort. ----- -- Leonard

Re: [OMPI devel] RML Send

2008-06-17 Thread Leonardo Fialho
trying to use OPAL_NULL and OPAL_DATA_VALUE to send it but I got no success :( Thanks in advance, Leonardo Fialho Ralph H Castain escribió: I'm not sure exactly how you are trying to do this, but the usual procedure would be: 1. call opal_dss.pack(*buffer, *data, #data, data_type) f

[OMPI devel] RML Send

2008-06-17 Thread Leonardo Fialho
OPAL_DATA_VALUE but don´t get success... Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

Re: [OMPI devel] Communication between entities

2008-05-29 Thread Leonardo Fialho
using the direct routed module would not work. Can you provide some reason why the normal relay is unacceptable? And why the PML would want to communicate with a daemon, which, after all, is -not- an MPI process and has no idea what a PML is? On 5/29/08 7:41 AM, "Leonardo Fialho" wrote:

[OMPI devel] Communication between entities

2008-05-29 Thread Leonardo Fialho
between the application and the local ORTE daemon, but I don´t want to send the message to local ORTE daemon an then it send the same message to que remote ORTE daemon... Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona

Re: [OMPI devel] orte\mca\smr

2008-03-10 Thread Leonardo Fialho
it all got consolidated down into plm. We need to update the FAQ; the ORTE frameworks changed quite a bit in the recent ORTE merge... Ralph's on vacation this week. A detailed answer to your question may not occur until he returns... On Mar 10, 2008, at 10:05 AM, Leonardo Fialho wro

[OMPI devel] orte\mca\smr

2008-03-10 Thread Leonardo Fialho
Hi all, Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm... -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

[OMPI devel] SnapC

2008-01-31 Thread Leonardo Fialho
Hi all (and Josh), Why the ompi-checkpoint have to contact the HNP specifically? If I use another process to start the snapshot coordinator, apparently it´s works fine, no? PS: I prefer to send this message to the list... to keep it on the history for further use... -- Leonardo Fialho Computer

[OMPI devel] RES: v pml question

2008-01-23 Thread Leonardo Fialho
I'm testing the v protocol just now. Anybody have plans to do a message wrapper mixing crcpw and v_protocol? Leonardo Fialho University Autonoma of Barcelona -Mensagem original- De: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] Em nome de Jeff Squyres Envia