Hi, I am building a distributed machine learning algorithm on top of Spark. Datasets reside on HDFS in .*sv format and I build the RDDs using the textFile method of SparkContext. The number of partitions is usually a lot larger than the number blocks on HDFS (usually I aim to split the default (130M) block in 10 partitions). My mappers process a partition and emit a trained model (~KB range) to the reducers for aggregation (thus, minimal network usage). When I benchmarked the application on AWS, I noticed that during the first minute (when RDDs are built), the WHOLE dataset (~100GB) was shipped through the network. I used ephemeral-hdfs so the source is supposed to the physically attached instance store.
My guess could be that the Partitioner assigns ranges of hash keys to the nodes and then Spark proceeds to redistribute the partitions across the cluster. But again I am not an expert. This is a huge overhead, as there is absolutely no reason at this stage not to process local partitions (from a machine learning standpoint, stratification is useful, but I want to avoid it at this stage). So, any ideas as to why this is happening and how it can be avoided? Thanks, Acco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Vast-network-traffic-during-RDD-creation-tp12255.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org