This is more of an aside, but why repartition this data instead of letting it define partitions naturally? You will end up with a similar number. On Mar 9, 2015 5:32 PM, "mingweili0x" <m...@spokeo.com> wrote:
> I'm basically running a sorting using spark. The spark program will read > from > HDFS, sort on composite keys, and then save the partitioned result back to > HDFS. > pseudo code is like this: > > input = sc.textFile > pairs = input.mapToPair > sorted = pairs.sortByKey > values = sorted.values > values.saveAsTextFile > > Input size is ~ 160G, and I made 1000 partitions specified in > JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is > splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished > in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress > and the last few jobs just took forever and never finishes. > > Cluster setup: > 8 nodes > on each node: 15gb memory, 8 cores > > running parameters: > --executor-memory 12G > --conf "spark.cores.max=60" > > Thank you for any help. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >