On 3/14/2014 12:13 AM, Sebastian Schelter wrote:
(1) Efficient execution of iterative programs.
In Hadoop, every iteration must be scheduled as a separate job,
rereads invariant data and materializes its result to hdfs. Therefore,
iterative programs on Hadoop are an order of magnitude slower than on
systems that have dedicated support for iterations.
Does h2o help here or would we need to incorporate another system for
such tasks?
I'll just join Ted's voice here. For normal iteration, it's just "Plain
Old Java Code". In memory, in process, standard stuff you debug with an
IDE, etc.
This means the process has to stick around to run the multiple
iterations, which brings us to the question of deployment.
H2O supports a bunch of deployment options -
* Single JVM, single process, typically local. You hack Java code,
you press the "compile & go" button in your IDE (or your
shell-script launch or whatever), and a single JVM comes up with H2O
inside, your algo runs until you exit the process. Perfect for
algo-dev, but also good for running H2O on anybody's lappy to do
modeling or whatever. While the process is up, it holds data in the
K/V (e.g. an in-memory file system), and supports parallel execution
of algorithms written to the F/J style (including all the
light-weight Map/Reduce stuff). In addition, you can do all the
things in the following deployment options (batch, interactive,
long-lived), just run on a single machine.
* Multiple JVMs, batch-style. This is more like the traditional
Hadoop Job. The multiple JVMs come up; cluster; do whatever batched
commands they are given; shutdown. While up, they support the
notion of a persistent datastore across batch commands (that
in-memory K/V store again), and *distributed* as well as parallel
execution. This deployment model is similar to common production
use-cases, where H2O is being used in a larger work-flow to do e.g.
the analytics piece.
* Multple JVMs, interactive. You start a cluster, which you don't
shut down. While the cluster is up, it's available for interactive
work. You load datasets, which persist in RAM. You munge the data,
explore the results, etc. You access it with REST/JSON, which
includes e.g. an R-like interpreter, and the web gu.. This is our
typical interactive R/Python/Excel session model.
(2) Efficient join implementations
This one's easy: the start of "join" work was in my IDE prior to this
last few days crazy work. We'll see a fast generic join before too
long. Note that we already got the converse "ddply" - or GroupBy -
working at-scale.
Cliff
Thx,
Sebastian