Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid
is your data skewed?  Could it be that there are a few keys with a huge
number of records?  You might consider outputting
(recordA, count)
(recordB, count)

instead of

recordA
recordA
recordA
...


you could do this with:

input = sc.textFile
pairsCounts = input.map{x = (x,1)}.reduceByKey{_ + _}
sorted = pairs.sortByKey
sorted.saveAsTextFile


On Mon, Mar 9, 2015 at 12:31 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Akhil Das
Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Sean Owen
This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




saveAsTextFile extremely slow near finish

2015-03-09 Thread mingweili0x
I'm basically running a sorting using spark. The spark program will read from
HDFS, sort on composite keys, and then save the partitioned result back to
HDFS.
pseudo code is like this:

input = sc.textFile
pairs = input.mapToPair
sorted = pairs.sortByKey
values = sorted.values
values.saveAsTextFile

 Input size is ~ 160G, and I made 1000 partitions specified in
JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
and the last few jobs just took forever and never finishes. 

Cluster setup:
8 nodes
on each node: 15gb memory, 8 cores

running parameters:
--executor-memory 12G
--conf spark.cores.max=60

Thank you for any help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org