I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition
with 2048 buckets.
<pre>
          sqlsc.set("spark.sql.shuffle.partitions", "2048")
          hql("""|insert %s table mz_log
                   |PARTITION (date='%s')
                   |select * from tmp_mzlog
                   |CLUSTER BY mzid
                """.stripMargin.format(overwrite, log_date))
</pre>

env:

yarn-client mode with 80 executor, 2 cores/per executor.

Data:

original text log is about 1.1T.

- - -

the reduce stage is too slow.

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_1.png>
 

here is the network usage, it's not the bottle neck. 

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_2.png>
 

and the CPU load is very high, why? 

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_3.png>
 
here is the configuration(conf/spark-defaults.conf)

<pre>
spark.ui.port   8888
spark.akka.frameSize    128
spark.akka.timeout      600
spark.akka.threads      8
spark.files.overwrite   true
spark.executor.memory   2G
spark.default.parallelism       32
spark.shuffle.consolidateFiles  true
spark.kryoserializer.buffer.mb  128
spark.storage.blockManagerSlaveTimeoutMs        200000
spark.serializer        org.apache.spark.serializer.KryoSerializer
</pre>

2 failed with MapTracker Error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-is-slow-tp10765.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to