Re: [Beowulf] HPC workflows

Lux, Jim (337K) via Beowulf Fri, 07 Dec 2018 10:08:11 -0800


On 12/7/18, 8:46 AM, "Beowulf on behalf of Michael Di Domenico" 
<[email protected] on behalf of [email protected]> wrote:

    On Fri, Dec 7, 2018 at 11:35 AM John Hanks <[email protected]> wrote:
    >
    >  But, putting it in a container wouldn't make my life any easier and 
would, in fact, just add yet another layer of something to keep up to date.

    i think the theory behind this is the containers allow the sysadmins
    to kick the can down the road and put the onus of updates on the
    container developer.  but then you get into a circle of trust issue,
    whereby now you have to trust the container developers are doing
    something sane and in a timely manner.

    a perfect example that we pitched up to our security team was (this
    was few year ago mind you); what happens when someone embeds openssl
    libraries in the container.  who's responsible for updating them?
    what happens when that container gets abandoned by the dev?  and those
    containers are running with some sort of docker/root privilege
    menagire.  this was back when openssl had bugs coming up left and
    right.  yeah, that conversation stopped dead in its tracks and we put
    a moratorium on docker.

    but i don't think the theory lines up with the practice, and that's
    why dev's shouldn't be doing ops

this is a generic problem in areas other than HPC.  Over the past few years, a 
fair amount of the software I'm working with is targeted to spacecraft 
platforms - We had an interesting exercise over the past couple years.  I was 
porting a standard orbit propagation package (SGP4, see 
http://www.celestrak.com/ for the Pascal version from 2000), which is available 
in many different languages. I happened to be implementing the C version in 
RTEMS running on a SPARC V8 processor (the LEON2 and LEON3, as it happens).  
The software itself is quite compact, has no dependencies other than math.h, 
stdio.h, stdlib.h, and derives from an original Fortran version.  RTEMS is a 
real time operating system that exposes POSIX API, so it's easy to work with.  
What we did is create a wrapper for SGP that matches a standardized set of APIs 
for software radios (Space Telecommunications Radio System, STRS).

But here's the problem - There are really 4 different target hardware 
platforms, all theoretically the same, but not. In the space flight software 
business, one chooses a toolchain and development environment at the beginning 
of the project (Phase A - Formulation) and you stay with that for the life of 
the mission, unless there's a compelling reason to change.   In the course of 
the last 10 years, we've gone through 5 versions of RTEMS 
(4.8.4.10,4.11,4.12,5.0), 3 different source management tools (cvs,svn,git), an 
IDE that came and went (Eclipse), not to mention a variety of versions of the 
gcc toolchain.  Each mission has its own set of all of this. And, a bunch of 
homegrown make files and related build processes. And, of course, it's a 
hodgepodge of CentOS, Scientific Linux, Ubuntu, Debian, and RH, depending on 
what was the "most supported distro" at the time the mission picked it (which 
might depend on who the SysAdmin on the project was). 

10 years is *forever* in the software development world. I've not yet had the 
experience of a developer born after the first version of the flight software 
they're working on was created - but I know that other people at JPL have (when 
it takes 7 years to get to where you're going, and the mission lasts 10-15 
years after that...).  And this is perfectly reasonable - SGP4, for instance, 
basically implements the laws of physics as a numerical model - it worked fine 
in 2000, it works fine now, it's going to work just fine in 2030, with no 
significant changes. "The SGP4 and SDP4 models were published along with sample 
code in FORTRAN IV in 1988 with refinements over the original model to handle 
the larger number of objects in orbit since" (Wikipedia article on SGP)

So, "inheriting" the SGP4 propagator from one project into another is not just 
a matter of moving the source code for SGP. You have to compile it with all the 
other stuff, and there are myriad hidden dependencies - does this platform have 
hardware floating point or software emulated floating point, and if the latter, 
which of several flavors.  Where in the source tree (for that project) does it 
sit? What's the permissions strategy? Where do you add it in the build process?

And then contemplate propagating a bug fix over all those platforms.  You might 
make a decision to propagate a change to some, but not all platforms - Maybe 
the spacecraft you're contemplating is getting towards the end of its life, and 
you'll never use the function you developed 4 years ago again. Do you put that 
bug fix to address the incorrect gravitation parameter at Mars into the systems 
that are orbiting Earth?

Yes - folks have said "put it in containers" and in the last few years, folks 
have started spinning up VMs to manage this. Historically, we keep "systems 
under glass" - once you've got the build PCs working, you preserve them for the 
project forever.   The problem is that PCs fail eventually.  But whether it is 
keeping half a dozen PCs on a shelf running, or half a dozen VMs running, it's 
really the same administrative burden - they all need to have annual security 
audits, perhaps have patches applied (if it's "on the network").  And you've 
really not addressed the underlying problem of needing to support a remarkable 
variety of heterogenous platforms.  You've basically saved the physical space 
on a shelf for all those PCs.

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] HPC workflows

Reply via email to