On 3/14/2014 12:13 AM, Sebastian Schelter wrote:

(1) Efficient execution of iterative programs.

In Hadoop, every iteration must be scheduled as a separate job, rereads invariant data and materializes its result to hdfs. Therefore, iterative programs on Hadoop are an order of magnitude slower than on systems that have dedicated support for iterations.

Does h2o help here or would we need to incorporate another system for such tasks?

I'll just join Ted's voice here. For normal iteration, it's just "Plain Old Java Code". In memory, in process, standard stuff you debug with an IDE, etc.

This means the process has to stick around to run the multiple iterations, which brings us to the question of deployment.

H2O supports a bunch of deployment options -

 * Single JVM, single process, typically local.  You hack Java code,
   you press the "compile & go" button in your IDE (or your
   shell-script launch or whatever), and a single JVM comes up with H2O
   inside, your algo runs until you exit the process. Perfect for
   algo-dev, but also good for running H2O on anybody's lappy to do
   modeling or whatever.  While the process is up, it holds data in the
   K/V (e.g. an in-memory file system), and supports parallel execution
   of algorithms written to the F/J style (including all the
   light-weight Map/Reduce stuff).  In addition, you can do all the
   things in the following deployment options (batch, interactive,
   long-lived), just run on a single machine.
 * Multiple JVMs, batch-style.  This is more like the traditional
   Hadoop Job.  The multiple JVMs come up; cluster; do whatever batched
   commands they are given; shutdown.  While up, they support the
   notion of a persistent datastore across batch commands (that
   in-memory K/V store again), and *distributed* as well as parallel
   execution.  This deployment model is similar to common production
   use-cases, where H2O is being used in a larger work-flow to do e.g.
   the analytics piece.
 * Multple JVMs, interactive.  You start a cluster, which you don't
   shut down.  While the cluster is up, it's available for interactive
   work.  You load datasets, which persist in RAM.  You munge the data,
   explore the results, etc.  You access it with REST/JSON, which
   includes e.g. an R-like interpreter, and the web gu..  This is our
   typical interactive R/Python/Excel session model.



(2) Efficient join implementations

This one's easy: the start of "join" work was in my IDE prior to this last few days crazy work. We'll see a fast generic join before too long. Note that we already got the converse "ddply" - or GroupBy - working at-scale.

Cliff




Thx,
Sebastian

Reply via email to