Re: Out-of-core random forest implementation

Sean Owen Fri, 08 Mar 2013 06:41:45 -0800

You can get a Hadoop job to finish in <1 minute -- with tuning and care,
not by default. I suppose I'd be surprised if broadly equivalent,
well-written iterative processes on the same data took very different
amounts of time on two different general infrastructures. (I'm sure
specialized implementations could do better -- by not being purely
iterative.) But I am sure Hadoop is not optimal by a wide margin.

I suppose I was wondering earlier out loud whether iterative/synchronous
processes are just not suitable in general for large-scale learning. (I
don't think so.)

But this is a different and interesting question here about when (not if)
there will be a better framework for iterative/synchronous processing.
Hadoop is not optimal, but probably still worth building on now, given that
people actually have Hadoop clusters. If you have a Hadoop-sized problem
that runs for hours on a cluster, the extra 20 minutes of overhead over 20
iterations isn't game-changing enough to start over. I am really interested
to see if YARN is that next-gen fabric or something else.

It's also an interesting point about most people not actually having large
data, after pruning and selection. Completely agree, and there's no
particular reason not to use tools that run comfortably on one big machine
if you're to that point. Simpler and cheaper.

The only interesting thing to do with "Big Learning" is be able to take
away the need to prune, filter, select features. If you can offer a
scalable way to magically squeeze more quality of much more and
lower-quality data, that's something interesting. That probably requires
something distributed. But otherwise, if you've already cleaned and refined
the data, probably not much is added by distributing.

(I also still think the real-time update and query aspect is a different,
and hard, question not addressed by any of these parallel computation
frameworks -- building the model is just half the battle!)

On Fri, Mar 8, 2013 at 2:15 PM, Sebastian Schelter
<[email protected]>wrote:

> We'll my general experience is that a lot of datasets used for iterative
> computations are not that large after feature extraction is done. Hadoop
> includes a lot of design decisions that only make sense when you scale
> out to really large setups. I'm not convinced that machine learning
> (except for maybe Google or facebook) really falls into this category.
>
> If you think of graph mining or collaborative filtering, even datasets
> with a few billion datapoints will need only a few dozen gigabytes and
> easily fit into the aggregate main memory of a small cluster.
>
> For example, in some recent experiments, I was able to conduct an
> iteration of PageRank on a dataset with 1B edges in approx 20 seconds on
> a small 26 node cluster using the Stratosphere [1] system. The authors
> of GraphLab report similar numbers in recent papers [2].
>
> I'm not sure that you can get Hadoop anywhere near that performance on a
> similar setup.
>
> [1] https://www.stratosphere.eu/
> [2]
>
> http://www.select.cs.cmu.edu/publications/paperdir/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf
>
>
>

Re: Out-of-core random forest implementation

Reply via email to