I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition
with 2048 buckets.
<pre>
sqlsc.set("spark.sql.shuffle.partitions", "2048")
hql("""|insert %s table mz_log
|PARTITION (date='%s')
|select * from tmp_mzlog
|CLUSTER BY mzid
""".stripMargin.format(overwrite, log_date))
</pre>
env:
yarn-client mode with 80 executor, 2 cores/per executor.
Data:
original text log is about 1.1T.
- - -
the reduce stage is too slow.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_1.png>
here is the network usage, it's not the bottle neck.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_2.png>
and the CPU load is very high, why?
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_3.png>
here is the configuration(conf/spark-defaults.conf)
<pre>
spark.ui.port 8888
spark.akka.frameSize 128
spark.akka.timeout 600
spark.akka.threads 8
spark.files.overwrite true
spark.executor.memory 2G
spark.default.parallelism 32
spark.shuffle.consolidateFiles true
spark.kryoserializer.buffer.mb 128
spark.storage.blockManagerSlaveTimeoutMs 200000
spark.serializer org.apache.spark.serializer.KryoSerializer
</pre>
2 failed with MapTracker Error.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-is-slow-tp10765.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.