date:20110325

Re: [OMPI devel] Build mca_sysinfo_linux module when /proc/cpuinfo doesn't exist

2011-03-25 Thread Ralph Castain

Yeah, that's probably the right soln for now. Like I said, it will be changed 
in the not-too-distant future anyway.

Thx!

On Mar 24, 2011, at 8:22 PM, Paul H. Hargrove wrote:

> Ralph,
> 
>  To be honest any joker can probably have a "/proc" under any non-Linux OS - 
> there is noting sacred about the name.  So, would in not make the most sense 
> (both simple and robust) to just check $target_os and build exclusively for 
> Linux?
> 
> -Paul
> 
> On 3/24/2011 7:01 PM, Ralph Castain wrote:
>> Thanks Paul - very illuminating!
>> 
>> Looks to me like I'm okay for OpenBSD as I won't find /proc and so won't 
>> build the Linux module.
>> 
>> I have a problem with FreeBSD because /proc exists, but I won't find what 
>> I'm looking for, so I'll have to add a test for that case and not-build when 
>> FreeBSD is detected.
>> 
>> The "not-mounted" case for NetBSD is more problematic. For now, I think I'll 
>> just use the safe method and not-build if NetBSD is detected.
>> 
>> Remember, folks - this is -not- system critical to running OMPI. At the 
>> moment, the info isn't really even used for an MPI job. In the future this 
>> will change, and so the build logic will become more important - but in that 
>> future, the "sysinfo" framework disappears and is merged with other 
>> functionality that already knows how to resolve this.
>> 
>> So all we're trying to do here is help fill a temporary gap :-)
>> 
>> 
>> On Mar 24, 2011, at 7:52 PM, Paul H. Hargrove wrote:
>> 
>>> Silas,
>>> 
>>> FYI: openmpi-1.4.1 is in the package repo for NetBSD 5.1.  So, you might 
>>> not need to build from scratch at all, depending on your desired use.
>>> 
>>> Jeff,
>>> 
>>> When available (remember that unlike Linux /proc might not be mounted by 
>>> default) the /proc/cpuinfo and /proc/meminfo on NetBSD 5.1 are (nearly?) 
>>> identical to the Linux ones.  See below for an example.
>>> 
>>> To "prefetch" the next logical question:
>>> On a FreeBSD 8.1 system I find that /proc exists but does not contain 
>>> cpuinfo or meminfo
>>> On a OpenBSD 4.8 system I find that there is no /proc
>>> 
>>> -Paul
>>> 
>>> -bash-4.1$ uname -a
>>> NetBSD netbsd5-amd64.xen 5.1 NetBSD 5.1 (XEN3_DOMU) #0: Sat Nov  6 13:17:16 
>>> UTC 2010  
>>> bui...@b6.netbsd.org:/home/builds/ab/netbsd-5-1-RELEASE/amd64/201011061943Z-obj/home/builds/ab/netbsd-5-1-RELEASE/src/sys/arch/amd64/compile/XEN3_DOMU
>>>  amd64
>>> -bash-4.1$ cat /proc/cpuinfo
>>> processor   : 0
>>> vendor_id   : GenuineIntel
>>> cpu family  : 6
>>> model   : 7
>>> model name  : Intel(R) Xeon(R) CPU   E5410  @ 2.33GHz
>>> stepping: 6
>>> cpu MHz : 2333.42
>>> fdiv_bug: no
>>> fpu : yes
>>> fpu_exception   : yes
>>> cpuid level : 10
>>> wp  : no
>>> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
>>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall mmxext 
>>> fxsr_opt rdtscp lm 3dnow recovery longrun lrti cxmmx cyrix_arr centaur_mcr 
>>> constant_tsc pni monitor ds_cpi vmx est tm2 cx16
>>> 
>>> -bash-4.1$ cat /proc/meminfo
>>>total:used:free:  shared: buffers: cached:
>>> Mem:  1031933952 796835840 2350981120 542756864 555749376
>>> Swap: 1342136320 134213632
>>> MemTotal:   1007748 kB
>>> MemFree: 229588 kB
>>> MemShared:0 kB
>>> Buffers: 530036 kB
>>> Cached:  542724 kB
>>> SwapTotal:   131068 kB
>>> SwapFree:131068 kB
>>> 
>>> 
>>> On 3/24/2011 6:07 PM, Jeff Squyres wrote:
 Is the data the same in /proc between NetBSD and Linux?
 
 We're currently looking in /proc/cpuinfo and /proc/meminfo for some 
 specific key / data pairs.
 
 
 
 On Mar 24, 2011, at 2:29 PM, Silas Silva wrote:
 
> Hello there,
> 
> I'm using OpenMPI for educational reasons.  It works pretty fine under
> GNU/Linux.  I have both compiled it and downloaded it from the package
> management system with no problems.
> 
> But I have trying to use it in other Unix systems as well.  In these
> systems /proc (NetBSD for instance) is by default unmounted, so
> ./configure script cannot stat /proc/cpuinfo (although it does exist in
> NetBSD if you manually mount /proc).  In the case it cannot stat
> /proc/cpuinfo, it just silently ignores compilation of
> mca_sysinfo_linux.{so,la}.
> 
> Is this behaviour correct?  Or it would be be a better idea that
> configure script fail with a "please check /proc/cpuinfo or specify
> --dont-build-sysinfo-linux"-like message?
> 
> Thank you very much.
> 
> -- 
> Silas Silva
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> -- 
>>> Paul H. Hargrove  phhargr...@lbl.gov
>>> Future Technologies Group
>>> HPC Research Department   Tel: +1-510-49

Re: [OMPI devel] Add child to another parent.

2011-03-25 Thread Hugo Meyer

>
> From what you've described before, I suspect all you'll need to do is add
> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
> see if a process in the launch message is being relocated (the
> construct_child_list code does that already), and then (b) sends the
> required info to all local child processes so they can take appropriate
> action.
>
> Failure detection, re-launch, etc. have all been taken care of for you.
>


I looked at the code that you mentioned me and i realize that i have two
possible options, that i'm going to share with you to know your opinion.

First of all i will let you know my actual situation with the
implementation. As i'm working in a Fault Tolerant system, but using
uncoordinated checkpoint i'm taking checkpoints of all my process at
different time and storing them on the machine where there are residing, but
i also send this checkpoints to another node (lets call it protector), so if
this node fails his process should be restarted in the protector that have
his checkpoints.

Right now i'm detecting the failure of a process and i know where this
process should be restarted, and also i have the checkpoint in the
protector. And i also have the child information of course.

So, my options are:
*First Option*
*
*
I detect the failure, and then i use
orte_errmgr_hnp_base_global_update_state()  with some modifications and the
hnp_relocate but changing the spawning to make a restart from a checkpoint,
i suposse that using this, the migration of the process to another node will
be updated and everyone will know it, because is the hnp who is going to do
this (is this ok?).

*Second Option*
*
*
Modify one of the spawn variations(probably the remote_spawn from rsh) in
the PLM framework and then use the orted_comm to command a remote_spawn in
the protector, but i don't know here how to update the info so everyone
knows about the change or how this is managed.

I might be very wrong in what I said, my apologies if so.

Thanks a lot for all the help.

Best regards.

Hugo Meyer

Re: [OMPI devel] Add child to another parent.

2011-03-25 Thread Ralph Castain


On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:

> From what you've described before, I suspect all you'll need to do is add 
> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to 
> see if a process in the launch message is being relocated (the 
> construct_child_list code does that already), and then (b) sends the required 
> info to all local child processes so they can take appropriate action.
> 
> Failure detection, re-launch, etc. have all been taken care of for you.
> 
> 
> I looked at the code that you mentioned me and i realize that i have two 
> possible options, that i'm going to share with you to know your opinion.
> 
> First of all i will let you know my actual situation with the implementation. 
> As i'm working in a Fault Tolerant system, but using uncoordinated checkpoint 
> i'm taking checkpoints of all my process at different time and storing them 
> on the machine where there are residing, but i also send this checkpoints to 
> another node (lets call it protector), so if this node fails his process 
> should be restarted in the protector that have his checkpoints.
> 
> Right now i'm detecting the failure of a process and i know where this 
> process should be restarted, and also i have the checkpoint in the protector. 
> And i also have the child information of course.
> 
> So, my options are:
> First Option
> 
> I detect the failure, and then i use 
> orte_errmgr_hnp_base_global_update_state()  with some modifications and the 
> hnp_relocate but changing the spawning to make a restart from a checkpoint, i 
> suposse that using this, the migration of the process to another node will be 
> updated and everyone will know it, because is the hnp who is going to do this 
> (is this ok?).

This is the option I would use. The other one is much, much more work. In this 
option, you only have to:

(a) modify the mapper so you can specify the location of the proc being 
restarted. The resilient mapper module will be handling the restart - if you 
look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code doing 
the "replacement" and modify accordingly.

(b) add any required info about your checkpoint to the launch message. This 
gets created in orte/mca/odls/base/odls_base_default_fns.c, the 
"get_add_procs_data" function (at the top of the file).

(c) modify the launch code to handle your checkpoint, if required - see the 
file in (b), the "construct_child" and "launch" functions.

HTH
Ralph


> 
> Second Option
> 
> Modify one of the spawn variations(probably the remote_spawn from rsh) in the 
> PLM framework and then use the orted_comm to command a remote_spawn in the 
> protector, but i don't know here how to update the info so everyone knows 
> about the change or how this is managed.
> 
> I might be very wrong in what I said, my apologies if so.
> 
> Thanks a lot for all the help.
> 
> Best regards.
> 
> Hugo Meyer
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Build mca_sysinfo_linux module when /proc/cpuinfo doesn't exist

Re: [OMPI devel] Add child to another parent.

Re: [OMPI devel] Add child to another parent.

3 matches

Site Navigation

Mail list logo

Footer information