Hi,

I am building a distributed machine learning algorithm on top of Spark.
Datasets reside on HDFS in .*sv format and I build the RDDs using the
textFile method of SparkContext.
The number of partitions is usually a lot larger than the number blocks on
HDFS (usually I aim to split the default (130M) block in 10 partitions).
My mappers process a partition and emit a trained model (~KB range) to the
reducers for aggregation (thus, minimal network usage).
When I benchmarked the application on AWS, I noticed that during the first
minute (when RDDs are built), the WHOLE dataset (~100GB) was shipped through
the network. 
I used ephemeral-hdfs so the source is supposed to the physically attached
instance store.

My guess could be that the Partitioner assigns ranges of hash keys to the
nodes and then Spark proceeds to redistribute the partitions across the
cluster. But again I am not an expert.
This is a huge overhead, as there is absolutely no reason at this stage not
to process local partitions (from a machine learning standpoint,
stratification is useful, but I want to avoid it at this stage).

So, any ideas as to why this is happening and how it can be avoided?

Thanks,
Acco




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Vast-network-traffic-during-RDD-creation-tp12255.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to