You can get a Hadoop job to finish in <1 minute -- with tuning and care, not by default. I suppose I'd be surprised if broadly equivalent, well-written iterative processes on the same data took very different amounts of time on two different general infrastructures. (I'm sure specialized implementations could do better -- by not being purely iterative.) But I am sure Hadoop is not optimal by a wide margin.
I suppose I was wondering earlier out loud whether iterative/synchronous processes are just not suitable in general for large-scale learning. (I don't think so.) But this is a different and interesting question here about when (not if) there will be a better framework for iterative/synchronous processing. Hadoop is not optimal, but probably still worth building on now, given that people actually have Hadoop clusters. If you have a Hadoop-sized problem that runs for hours on a cluster, the extra 20 minutes of overhead over 20 iterations isn't game-changing enough to start over. I am really interested to see if YARN is that next-gen fabric or something else. It's also an interesting point about most people not actually having large data, after pruning and selection. Completely agree, and there's no particular reason not to use tools that run comfortably on one big machine if you're to that point. Simpler and cheaper. The only interesting thing to do with "Big Learning" is be able to take away the need to prune, filter, select features. If you can offer a scalable way to magically squeeze more quality of much more and lower-quality data, that's something interesting. That probably requires something distributed. But otherwise, if you've already cleaned and refined the data, probably not much is added by distributing. (I also still think the real-time update and query aspect is a different, and hard, question not addressed by any of these parallel computation frameworks -- building the model is just half the battle!) On Fri, Mar 8, 2013 at 2:15 PM, Sebastian Schelter <[email protected]>wrote: > We'll my general experience is that a lot of datasets used for iterative > computations are not that large after feature extraction is done. Hadoop > includes a lot of design decisions that only make sense when you scale > out to really large setups. I'm not convinced that machine learning > (except for maybe Google or facebook) really falls into this category. > > If you think of graph mining or collaborative filtering, even datasets > with a few billion datapoints will need only a few dozen gigabytes and > easily fit into the aggregate main memory of a small cluster. > > For example, in some recent experiments, I was able to conduct an > iteration of PageRank on a dataset with 1B edges in approx 20 seconds on > a small 26 node cluster using the Stratosphere [1] system. The authors > of GraphLab report similar numbers in recent papers [2]. > > I'm not sure that you can get Hadoop anywhere near that performance on a > similar setup. > > [1] https://www.stratosphere.eu/ > [2] > > http://www.select.cs.cmu.edu/publications/paperdir/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf > > >
