Re: [OMPI devel] Build mca_sysinfo_linux module when /proc/cpuinfo doesn't exist
Yeah, that's probably the right soln for now. Like I said, it will be changed in the not-too-distant future anyway. Thx! On Mar 24, 2011, at 8:22 PM, Paul H. Hargrove wrote: > Ralph, > > To be honest any joker can probably have a "/proc" under any non-Linux OS - > there is noting sacred about the name. So, would in not make the most sense > (both simple and robust) to just check $target_os and build exclusively for > Linux? > > -Paul > > On 3/24/2011 7:01 PM, Ralph Castain wrote: >> Thanks Paul - very illuminating! >> >> Looks to me like I'm okay for OpenBSD as I won't find /proc and so won't >> build the Linux module. >> >> I have a problem with FreeBSD because /proc exists, but I won't find what >> I'm looking for, so I'll have to add a test for that case and not-build when >> FreeBSD is detected. >> >> The "not-mounted" case for NetBSD is more problematic. For now, I think I'll >> just use the safe method and not-build if NetBSD is detected. >> >> Remember, folks - this is -not- system critical to running OMPI. At the >> moment, the info isn't really even used for an MPI job. In the future this >> will change, and so the build logic will become more important - but in that >> future, the "sysinfo" framework disappears and is merged with other >> functionality that already knows how to resolve this. >> >> So all we're trying to do here is help fill a temporary gap :-) >> >> >> On Mar 24, 2011, at 7:52 PM, Paul H. Hargrove wrote: >> >>> Silas, >>> >>> FYI: openmpi-1.4.1 is in the package repo for NetBSD 5.1. So, you might >>> not need to build from scratch at all, depending on your desired use. >>> >>> Jeff, >>> >>> When available (remember that unlike Linux /proc might not be mounted by >>> default) the /proc/cpuinfo and /proc/meminfo on NetBSD 5.1 are (nearly?) >>> identical to the Linux ones. See below for an example. >>> >>> To "prefetch" the next logical question: >>> On a FreeBSD 8.1 system I find that /proc exists but does not contain >>> cpuinfo or meminfo >>> On a OpenBSD 4.8 system I find that there is no /proc >>> >>> -Paul >>> >>> -bash-4.1$ uname -a >>> NetBSD netbsd5-amd64.xen 5.1 NetBSD 5.1 (XEN3_DOMU) #0: Sat Nov 6 13:17:16 >>> UTC 2010 >>> bui...@b6.netbsd.org:/home/builds/ab/netbsd-5-1-RELEASE/amd64/201011061943Z-obj/home/builds/ab/netbsd-5-1-RELEASE/src/sys/arch/amd64/compile/XEN3_DOMU >>> amd64 >>> -bash-4.1$ cat /proc/cpuinfo >>> processor : 0 >>> vendor_id : GenuineIntel >>> cpu family : 6 >>> model : 7 >>> model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz >>> stepping: 6 >>> cpu MHz : 2333.42 >>> fdiv_bug: no >>> fpu : yes >>> fpu_exception : yes >>> cpuid level : 10 >>> wp : no >>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca >>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall mmxext >>> fxsr_opt rdtscp lm 3dnow recovery longrun lrti cxmmx cyrix_arr centaur_mcr >>> constant_tsc pni monitor ds_cpi vmx est tm2 cx16 >>> >>> -bash-4.1$ cat /proc/meminfo >>>total:used:free: shared: buffers: cached: >>> Mem: 1031933952 796835840 2350981120 542756864 555749376 >>> Swap: 1342136320 134213632 >>> MemTotal: 1007748 kB >>> MemFree: 229588 kB >>> MemShared:0 kB >>> Buffers: 530036 kB >>> Cached: 542724 kB >>> SwapTotal: 131068 kB >>> SwapFree:131068 kB >>> >>> >>> On 3/24/2011 6:07 PM, Jeff Squyres wrote: Is the data the same in /proc between NetBSD and Linux? We're currently looking in /proc/cpuinfo and /proc/meminfo for some specific key / data pairs. On Mar 24, 2011, at 2:29 PM, Silas Silva wrote: > Hello there, > > I'm using OpenMPI for educational reasons. It works pretty fine under > GNU/Linux. I have both compiled it and downloaded it from the package > management system with no problems. > > But I have trying to use it in other Unix systems as well. In these > systems /proc (NetBSD for instance) is by default unmounted, so > ./configure script cannot stat /proc/cpuinfo (although it does exist in > NetBSD if you manually mount /proc). In the case it cannot stat > /proc/cpuinfo, it just silently ignores compilation of > mca_sysinfo_linux.{so,la}. > > Is this behaviour correct? Or it would be be a better idea that > configure script fail with a "please check /proc/cpuinfo or specify > --dont-build-sysinfo-linux"-like message? > > Thank you very much. > > -- > Silas Silva > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Future Technologies Group >>> HPC Research Department Tel: +1-510-49
Re: [OMPI devel] Add child to another parent.
> > From what you've described before, I suspect all you'll need to do is add > some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to > see if a process in the launch message is being relocated (the > construct_child_list code does that already), and then (b) sends the > required info to all local child processes so they can take appropriate > action. > > Failure detection, re-launch, etc. have all been taken care of for you. > I looked at the code that you mentioned me and i realize that i have two possible options, that i'm going to share with you to know your opinion. First of all i will let you know my actual situation with the implementation. As i'm working in a Fault Tolerant system, but using uncoordinated checkpoint i'm taking checkpoints of all my process at different time and storing them on the machine where there are residing, but i also send this checkpoints to another node (lets call it protector), so if this node fails his process should be restarted in the protector that have his checkpoints. Right now i'm detecting the failure of a process and i know where this process should be restarted, and also i have the checkpoint in the protector. And i also have the child information of course. So, my options are: *First Option* * * I detect the failure, and then i use orte_errmgr_hnp_base_global_update_state() with some modifications and the hnp_relocate but changing the spawning to make a restart from a checkpoint, i suposse that using this, the migration of the process to another node will be updated and everyone will know it, because is the hnp who is going to do this (is this ok?). *Second Option* * * Modify one of the spawn variations(probably the remote_spawn from rsh) in the PLM framework and then use the orted_comm to command a remote_spawn in the protector, but i don't know here how to update the info so everyone knows about the change or how this is managed. I might be very wrong in what I said, my apologies if so. Thanks a lot for all the help. Best regards. Hugo Meyer
Re: [OMPI devel] Add child to another parent.
On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: > From what you've described before, I suspect all you'll need to do is add > some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to > see if a process in the launch message is being relocated (the > construct_child_list code does that already), and then (b) sends the required > info to all local child processes so they can take appropriate action. > > Failure detection, re-launch, etc. have all been taken care of for you. > > > I looked at the code that you mentioned me and i realize that i have two > possible options, that i'm going to share with you to know your opinion. > > First of all i will let you know my actual situation with the implementation. > As i'm working in a Fault Tolerant system, but using uncoordinated checkpoint > i'm taking checkpoints of all my process at different time and storing them > on the machine where there are residing, but i also send this checkpoints to > another node (lets call it protector), so if this node fails his process > should be restarted in the protector that have his checkpoints. > > Right now i'm detecting the failure of a process and i know where this > process should be restarted, and also i have the checkpoint in the protector. > And i also have the child information of course. > > So, my options are: > First Option > > I detect the failure, and then i use > orte_errmgr_hnp_base_global_update_state() with some modifications and the > hnp_relocate but changing the spawning to make a restart from a checkpoint, i > suposse that using this, the migration of the process to another node will be > updated and everyone will know it, because is the hnp who is going to do this > (is this ok?). This is the option I would use. The other one is much, much more work. In this option, you only have to: (a) modify the mapper so you can specify the location of the proc being restarted. The resilient mapper module will be handling the restart - if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code doing the "replacement" and modify accordingly. (b) add any required info about your checkpoint to the launch message. This gets created in orte/mca/odls/base/odls_base_default_fns.c, the "get_add_procs_data" function (at the top of the file). (c) modify the launch code to handle your checkpoint, if required - see the file in (b), the "construct_child" and "launch" functions. HTH Ralph > > Second Option > > Modify one of the spawn variations(probably the remote_spawn from rsh) in the > PLM framework and then use the orted_comm to command a remote_spawn in the > protector, but i don't know here how to update the info so everyone knows > about the change or how this is managed. > > I might be very wrong in what I said, my apologies if so. > > Thanks a lot for all the help. > > Best regards. > > Hugo Meyer > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel