On 16/04/14 17:54, Mark Hahn wrote: > I'm trying to understand this from a perspective of conventional > HPC. > >> cop-out but we're not keen to reinvent the wheel. It provides >> statekeeping and job queues in one package; replacing it wouldn't >> be > > "statekeeping" is just tracking queued/running/done jobs, right?
That and metadata around jobs - number of times a task is retried, what machine ran what where, failure logs and traces, etc. It does also give you some guarantees about message delivery and receipt, timing, etc, which negates the need for an external process to handle that (eg job timeouts - if I expect a task to be done in 5 hours I can say so and after that time it will issue a failure decision so the workflow can decide what to do next, eg try again). > >> trivial but wouldn't be a massive task; the cost of using it is >> tiny, though, and it made our life a lot easier. It's all written >> in terms of deciders, which make decisions based on a list of >> events associated with an event (eg a "finished activity" event >> will have the details about the activity starting, being >> scheduled, and being completed, output status etc), > > is the workflow complicated - a directed graph with complicated > structure, rather than a series of discrete jobs, each a simple > chain/pipeline in structure? > It's a simple pipeline structure _at present_, but making it more complex (ie a directed graph with parallel processing and joins/locks etc) is relatively trivial to do; SWF does provide signalling and lock 'primitives' so to speak. We've not found the need for this just yet. You can easily take whole pipelines and run them as children of a master pipeline, eg "run these 5 things against this input", then combine them, to track overall success/failure - but we have a layer above this which provides this sort of batch management, so this wasn't needed for us. >> maintained by passing JSON blobs around as messages; there'll be >> a blog post or two explaining things on our website soonish and >> I'll post them across if there's interest. > > a reference would be interesting. Soonish, though we're not talking academic papers! > >> It's being used in production on a regular basis and has had >> quite a lot of content processed through it so far; these tasks >> on average run for 2-6 hours and involve ~1GB of data going in >> and a few megabytes out. > > that's unexceptional from an HPC perspective. > Absolutely; I wouldn't claim we're playing in the same ballpark as traditional HPC from a perspective of data or throughput or timing/latency requirements! We're only a humble R&D department playing with small datasets since we're in quite early stages of deploying this. We have ~15PB of data for starters to process with a number of tools once we've kicked the tyres a bit. >> The APIs are all simple HTTPS RESTful ones, storage can be cloud >> provider storage or local shared drive storage. > > one premise usually found in HPC is that the job, at least the main > part, should be compute-bound. how do you ensure that your compute > resources are not idle or starved by external IO bottlenecks? > Generally we're loading in a few GB of data which takes a few minutes; beyond that it's then hours of compute-bound work. We've got monitoring on those machines to ensure the machines aren't stuck. We have machines which load in content from remote sources and preload it in a (network-local) cache so that the IO bottlenecks are limited to local network bandwidth; generally we're running 4 or 8 workers per machine so the machines are only ever starved fully for a few minutes at the start of each piece of work when the machine starts, which is acceptable losses for us. There is no remote data after the initial chunk of time fetching the content to the local store. The machines are automagically killed by a very small script if they're idle for any significant amount of time, so failure conditions where the machines end up idle aren't really a concern - it costs us nothing but time, and minor delays are acceptable in our processing. >> interprocess communication performance is less important and >> robustness and dynamic scalability plays a major role. > > well, I think that's a bit disingenuous, since HPC is highly tuned > for robustness and dynamic scalability... > In a typical HPC setting you have n nodes and n does not necessarily change frequently, unless I've gotten wholly the wrong end of the stick, though that's not -always- the case eg condor. I know HPC is focused on robustness - same problem space, in terms of "Lots of machines == lots of failures" in general. But where people get twitchy about OpenMPI taking a few more uS in setting A versus setting B, we don't have that - every unit of work is isolated, stand-alone and CPU bound locally without any external dependencies until it is complete, and reporting results is a tiny amount of network load. This is a very lightweight system by most people's standards here, I'm sure - the more interesting thing from our perspective than the 'HPC' elements is that this is a generic system for our tasks and we've got a quite complex image/build system that lets us just drop in new code - even quite complex projects - with nearly no work, and run it all at more or less arbitrary scale. The generic nature of the system and low barrier to entry is the fun bit. -- Cheers, James _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
