Maybe he is -10!'ing, which is worst than -10'ing!
On Nov 23, 2011, at 7:52 PM, Jeff Squyres wrote:
> Can you explain that a little more? Are you -10'ing the whole concept? Or
> just renaming xpmem? Or ...?
>
> On Nov 22, 2011, at 11:37 AM, George Bosilca wrote:
>
>> -10!
>>
>> george.
>>
ed due to lack of reason to
> do so.
>
> Sorry for the confusion - old man brain fizzing out again.
>
> On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote:
>
>> Yes, I know the difference :)
>>
>> I'm trying to call orte_plm.signal_job from a PML c
Yes, I know the difference :)
I'm trying to call orte_plm.signal_job from a PML component. I think PLM stays
resident after launching but it doesn't only for mpirun and orted, you're right.
On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote:
> On 03/17/2010 10:10 AM, Leo
Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of
view?
Leonardo
On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:
> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML
> component, I know that ORTE is bellow OMPI, but I thin
b. I didn't
> see the message indicating it was sending the signal cmd out in your prior
> debug output, and there isn't a printf in that code loop other than the debug
> output. Can you attach to the process and get more info?
>
> On Mar 17, 2010, at 6:50 AM, Leonardo Fialh
g in a print statement, yet
> there is no print statement in signal_job. Or did you run this with
> plm_base_verbose set so that the verbose prints are trying to run (could be
> we have a bug in one of them)?
>
> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
>
>&g
gt;
> I don't currently know any way to do what you are trying to do. We could
> extend the signal code to handle it, I would think...but I'm not sure how
> soon that might happen.
>
>
> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
>
>> Yes... but
t; line.
>>
>> Leonardo
>>
>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:
>>
>>> It's just the orte_process_name_t jobid field. So if you have an
>>> orte_process_name_t *pname, then it would just be
>>>
>>> orte
have an
> orte_process_name_t *pname, then it would just be
>
> orte_plm.signal_job(pname->jobid, sig)
>
>
> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:
>
>> Hum and to signal a job probably the function is
>> orte_plm.signal_job(jobid, signal); right?
&
Castain wrote:
> Afraid not - you can signal a job, but not a specific process. We used to
> have such an API, but nobody ever used it. Easy to restore if someone has a
> need.
>
> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:
>
>> Hi,
>>
>> Is there any
Hi,
Is there any function in Open MPI's frameworks to send a signal to other ORTE
proc?
For example, the ORTE process [[1234,1],1] want to send a signal to process
[[1234,1,4] locate in other node. I'm looking for this kind of functions but I
just found functions to send signal to all procs i
I know that it should be uncommon but why I get an error while I try to run a
"parallel" application with only one process?
aopclf:ping fialho$ mpirun -np 1 ./ping
[Fialho-2.local:02834] OPAL dss:unpack: got type 32 when expecting type 9
[Fialho-2.local:02834] [[57446,1],0] ORTE_ERROR_LOG: Pack d
gt; Fixing this properly in libltdl is actually somewhat tricky -- which is
>>> why it hasn't been fixed yet. But given that OMPI's use of libltdl is
>>> pretty specific, we might be able to get away with a simple fix that works
>>> just for OMPI (but wouldn
gt;
> george.
>
> On Mar 5, 2010, at 14:00 , Leonardo Fialho wrote:
>
>> Yeah, probably ompi_request_null and opal_output are not good candidates.
>> I'm trying with mca_pml_v. But I'm not familiarized with this framework
>> although it is really s
ho/lib/openmpi/mca_vprotocol_receiver.so: error:
>>> symbol lookup error: undefined symbol: mca_pml_v (fatal)
>>>
>>> Leonardo
>>>
>>> On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:
>>>
>>>
>>>> You said this component was a
me
> the critical elements (e.g., component, module) inside it to avoid name
> confusion?
>
> On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
>
>> I see... but it is really strange because this module is clean, it does not
>> use nothing. This is the output of the
e .so that is being loaded that
>> cannot be resolved.
>> --td
>> Leonardo Fialho wrote:
>>> Hi,
>>>
>>> I know that libtool does not help us to find the source of this error, but,
>>> what can generate the following error?
>>>
>&g
Hi,
I know that libtool does not help us to find the source of this error, but,
what can generate the following error?
[aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open
/home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing symbol, or
compiled for a different ve
Hi George,
>> Hum... I'm really afraid about this. I understand your choice since it is
>> really a good solution for fail/stop/restart behaviour, but looking from the
>> fail/recovery side, can you envision some alternative for the orted's
>> reconfiguration on the fly?
>
> I don't see why th
Hi Ralph and Josh,
>>> Regarding to the schema represented by the picture, I didn't understand the
>>> RecoS' behaviour in a node failure situation.
>>>
>>> In this case, will mpirun consider the daemon failure as a normal proc
>>> failure? If it is correct, should mpirun update the global proc
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun u
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun u
Hi Jeff,
There are...
Thanks a lot,
Leonardo
Jeff Squyres escribió:
Can you send all the information listed here:
http://www.open-mpi.org/community/help/
On Mar 30, 2009, at 11:46 AM, Leonardo Fialho wrote:
Hi,
I'm experimenting the following errors while using Open MPI re
ve caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[fialho@aoclsd gmwat]$
Along different executions the error occurs in different situations.
Any help?
Thanks,
--
ut I don't know if you can do
what you describe or not - I'm not sure how the MPI layer will handle
that situation.
Ralph
On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote:
Jeff,
I agree with your viewpoint, principally about the "reachability".
But...
Looking from the FT
progress thread. Why?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
;t
now. I think not. And what is the impact of a allgather modex while MPI thread
is delivering messages? These answers about these questions could suggest that
a uncoordinated C/R is not possible on Open MPI.
Leonardo Fialho
Jeff Squyres escribió:
On Nov 7, 2008, at 10:18 AM, Leonardo Fi
ched internally by the
ompi library, but are not propagated until the next call to
opal_progress. If you want to use alarms that trigger outside the
opal_progress you will have to deal directly with the libevent (and
not use ORTE_TIMER_EVENT).
george.
On Nov 7, 2008, at 1:32 PM, Leonardo Fialho w
after time seconds, no? On my tests it does not occur, only if any
communication occurs.
Did I made any mistake?
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93
than the masters contact info.
I think that it reduces the startup time, but increases the *first*
communication between two peers.
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http
ven't had time to focus on trying to
find out what is going wrong.
You may be right in your assessment below, I'll try to look into it
this week. If you find that making this changes fixes your problem,
let me know and I'll apply the patch.
Thanks,
Josh
On Nov 4, 2008, at 10:16
dmap(orte_process_info.sync_buf, &nidmap,
*&jmap->pmap*))) {
No?
Leonardo
Leonardo Fialho escribió:
Hi All,
I think that exists an error in the trunk version while trying to
restore a checkpoint.
The function orte_util_decode_pidmap while attempts to execute the
following code
/*
2:18027] Failing at address: (nil)
I was trying to trace the problem and I think that it occurs in the line
opal_value_array_set_item(procs, i, &pmap);
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio
to the faulty process is removed from the cache and a new
request for the NS is performed. The process location and state is
maintained up to date on the HNP by my FT routines. What do you think
about this?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
My suspects were confirmed. After a orte_iof_base_setup_child/parent the
problem does not occur.
Leonardo
Leonardo Fialho escribió:
Hi All,
I´m trying to restart a process from a previous checkpoint. My
(modified) orted is trying to do this. Its uses the opal-restart
command, but after
ry to address.
Any volunteers??
Ralph
On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote:
Hi All,
I´m doing some experiments and modifications in my heartbeat code
witch uses the OOB-TCP communication channel.
My modified orteds and orterun does not abort all processes when one
orte
How can I close these descriptor before the
checkpoint? The opal-restart open these descriptor too? What can I make
to it works?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC
lets you restart in cases where you might not otherwise be able
to. The trick is to add --save-private or --save-all to the
checkpoint command that OpenMPI uses to checkpoint the application
processes.
-Paul
Leonardo Fialho wrote:
Hi All,
I´m trying to implement my FT architecture in Open
formation or can give me some help about this
error I´ll be grateful.
Thanks--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
nce an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it
discards this peer, no?
b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?
c) Why, after retries exceed, the orted continue to plot this message?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operati
rte_job_t array?
This would not be a good idea as a significant amount of code in the
system expects that array to only exist inside of mpirun. You could
run into some really strange behavior in various scenarios.
Ralph
On Oct 1, 2008, at 9:09 AM, Leonardo Fialho wrote:
Hi All,
I have a lit
Forget it. I found the problem... a little patch to
orte_dt_pack/unpack_fns solve my problem...
Leonardo
Leonardo Fialho escribió:
Hi All,
I have a little doubt about how to update the orte_proc structure.
I have modified the orte_proc structure to include another field
(orte_name_proc_t
correct information, and the
orte-ps don´t?
Thanks
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
e GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Rene' Seindal.
Leonardo
Ralf Wildenhues escribió:
Hello Leonardo,
* Leonardo F
(GNU libtool) 2.2.4
...
$ automake --version
automake (GNU automake) 1.10.1
...
Leonardo
Leonardo Fialho escribió:
Hi Jeff,
Yes, with a fresh checkout... well, it can be some error in my aclocal
files, I just updated it today, but I think I did it correctly.
Leonardo
Jeff Squyres escribió:
T
can't think offhand of how
it could be bogus.
If you have a fresh tree checkout and run autogen, is the problem
repeatable?
On Jun 19, 2008, at 10:29 AM, Leonardo Fialho wrote:
Hi All,
Anybody knows what is this error?
Yes, I think that I'm using last version of M4, autoconf
ack this var have 33! I
don't understand it...
Thanks,
Leonardo Fialho
Ralph Castain escribió:
On 6/17/08 3:35 PM, "Leonardo Fialho" wrote:
Hi Ralph,
1) Yes, I'm using ORTE_RML_TAG_DAEMON with a new "command" that I
defined in "odls_types.h".
2) I
with exit status: 1
-
It seems that the execution of "aclocal -I config" has failed. See
above for
the specific error message that caused it to abort.
-----
--
Leonard
trying to use OPAL_NULL and OPAL_DATA_VALUE to
send it but I got no success :(
Thanks in advance,
Leonardo Fialho
Ralph H Castain escribió:
I'm not sure exactly how you are trying to do this, but the usual procedure
would be:
1. call opal_dss.pack(*buffer, *data, #data, data_type) f
OPAL_DATA_VALUE but don´t get success...
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
using the direct routed module would not work.
Can you provide some reason why the normal relay is unacceptable? And why
the PML would want to communicate with a daemon, which, after all, is -not-
an MPI process and has no idea what a PML is?
On 5/29/08 7:41 AM, "Leonardo Fialho" wrote:
between the application and the local ORTE
daemon, but I don´t want to send the message to local ORTE daemon an
then it send the same message to que remote ORTE daemon...
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona
it all got consolidated down into plm. We need to update the
FAQ; the ORTE frameworks changed quite a bit in the recent ORTE merge...
Ralph's on vacation this week. A detailed answer to your question may
not occur until he returns...
On Mar 10, 2008, at 10:05 AM, Leonardo Fialho wro
Hi all,
Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm...
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
Hi all (and Josh),
Why the ompi-checkpoint have to contact the HNP specifically? If I use
another process to start the snapshot coordinator, apparently it´s
works fine, no?
PS: I prefer to send this message to the list... to keep it on the
history for further use...
--
Leonardo Fialho
Computer
I'm testing the v protocol just now. Anybody have plans to do a message
wrapper mixing crcpw and v_protocol?
Leonardo Fialho
University Autonoma of Barcelona
-Mensagem original-
De: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] Em nome
de Jeff Squyres
Envia
56 matches
Mail list logo