Can you tell which nodes were doing the computation in each case? Date: Wed, 27 Aug 2014 20:29:38 +0530 Subject: Execution time increasing with increase of cluster size From: sarathchandra.jos...@algofusiontech.com To: user@spark.apache.org
Hi, I've written a simple scala program which reads a file on HDFS (which is a delimited file having 100 fields and 1 million rows), splits each row with delimiter, deduces hashcode of each field, makes new rows with these hashcodes and writes these rows back to HDFS. Code attached. When I run this on spark cluster of 2 nodes (these 2 nodes also act as HDFS cluster) it took about 35sec to complete. Then I increased the cluster to 4 nodes (additional nodes are not part of HDFS cluster) and submitted the same job. I was expecting a decrease in the execution time but instead it took 3 times more time (1.6 min) to complete. Attached snapshots of the execution summary. Both the times I've set executor memory to 6GB which is available in all the nodes. What am I'm missing here? Do I need to do any additional configuration when increasing the cluster size? ~Sarath --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org