Also, you may have to adjust your algorithms. For instance, the conventional standard algorithm for SVD is a Lanczos iterative algorithm. Iteration in Hadoop is death because of job invocation time ... what you wind up with is an algorithm that will handle big data but with a slow-down factor that makes a single node perform at the same level as 100 Hadoop nodes or more. Scaling with iterative algorithms like this is irrelevant because of the enormous fixed cost.
On the other hand, you can switch to some of the recently developed stochastic projection algorithms which give a non-iterative algorithm that requires 4-7 map-reduce steps (depending on which outputs you need). With these projection algorithms, Hadoop can out-run other techniques even with quite modest cluster sizes and will scale linearly. On Thu, Jan 17, 2013 at 9:47 PM, Stephen Boesch <java...@gmail.com> wrote: > Hi Thiago, > Subjectively: there are a number of items to consider to achieve nearly > linear scaling: > > > - if the work is well balanced among the tasks - no skew > - No skew in the association of tasks to nodes. Note: this skew > actually happens by default if the number of tasks is less than the cluster > capacity of slots. You will notice that on a cluster with 20 nodes, with > each node set to 20 mapper tasks, if you launch a job with 20 maps it may > well have all of them running on one node. > - with higher number of tasks the risk of having stragglers affecting > overall throughput/performance increases unless speculative execution were > set properly > - hadoop configuration settings come under more pressure with more > - properly tuning the number of mappers and reducers to (a) your node > and cluster characteristics and (b) the particular tasks has a large impact > on performance. In my experience the settings are often set too > conservatively / too low to take advantage of the node and cluster > resources > > So in summary hadoop itself is capable of nearly linear scaling to low > thousands of nodes, but configuring the cluster to really achieve that > requires effort. > > > 2013/1/17 Thiago Vieira <tpbvie...@gmail.com> > >> Hello! >> >> Is common to see this sentence: "Hadoop Scales Linearly". But, is there >> any performance evaluation to confirm this? >> >> In my evaluations, Hadoop processing capacity scales linearly, but not >> proportional to number of nodes, the processing capacity achieved with 20 >> nodes is not the double of the processing capacity achieved with 10 nodes. >> Is there any evaluation about this? >> >> Thank you! >> >> -- >> Thiago Vieira >> > >