Re: 0xdata interested in contributing

Cliff Click Fri, 14 Mar 2014 09:50:30 -0700

On 3/14/2014 12:13 AM, Sebastian Schelter wrote:

(1) Efficient execution of iterative programs.
In Hadoop, every iteration must be scheduled as a separate job,rereads invariant data and materializes its result to hdfs. Therefore,iterative programs on Hadoop are an order of magnitude slower than onsystems that have dedicated support for iterations.
Does h2o help here or would we need to incorporate another system forsuch tasks?

I'll just join Ted's voice here. For normal iteration, it's just "PlainOld Java Code". In memory, in process, standard stuff you debug with anIDE, etc.

This means the process has to stick around to run the multipleiterations, which brings us to the question of deployment.


H2O supports a bunch of deployment options -

 * Single JVM, single process, typically local.  You hack Java code,
   you press the "compile & go" button in your IDE (or your
   shell-script launch or whatever), and a single JVM comes up with H2O
   inside, your algo runs until you exit the process. Perfect for
   algo-dev, but also good for running H2O on anybody's lappy to do
   modeling or whatever.  While the process is up, it holds data in the
   K/V (e.g. an in-memory file system), and supports parallel execution
   of algorithms written to the F/J style (including all the
   light-weight Map/Reduce stuff).  In addition, you can do all the
   things in the following deployment options (batch, interactive,
   long-lived), just run on a single machine.
 * Multiple JVMs, batch-style.  This is more like the traditional
   Hadoop Job.  The multiple JVMs come up; cluster; do whatever batched
   commands they are given; shutdown.  While up, they support the
   notion of a persistent datastore across batch commands (that
   in-memory K/V store again), and *distributed* as well as parallel
   execution.  This deployment model is similar to common production
   use-cases, where H2O is being used in a larger work-flow to do e.g.
   the analytics piece.
 * Multple JVMs, interactive.  You start a cluster, which you don't
   shut down.  While the cluster is up, it's available for interactive
   work.  You load datasets, which persist in RAM.  You munge the data,
   explore the results, etc.  You access it with REST/JSON, which
   includes e.g. an R-like interpreter, and the web gu..  This is our
   typical interactive R/Python/Excel session model.


(2) Efficient join implementations

This one's easy: the start of "join" work was in my IDE prior to thislast few days crazy work. We'll see a fast generic join before toolong. Note that we already got the converse "ddply" - or GroupBy -working at-scale.


Cliff




Thx,
Sebastian

Re: 0xdata interested in contributing

Reply via email to