I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition with 2048 buckets. <pre> sqlsc.set("spark.sql.shuffle.partitions", "2048") hql("""|insert %s table mz_log |PARTITION (date='%s') |select * from tmp_mzlog |CLUSTER BY mzid """.stripMargin.format(overwrite, log_date)) </pre>
env: yarn-client mode with 80 executor, 2 cores/per executor. Data: original text log is about 1.1T. - - - the reduce stage is too slow. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_1.png> here is the network usage, it's not the bottle neck. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_2.png> and the CPU load is very high, why? <http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_3.png> here is the configuration(conf/spark-defaults.conf) <pre> spark.ui.port 8888 spark.akka.frameSize 128 spark.akka.timeout 600 spark.akka.threads 8 spark.files.overwrite true spark.executor.memory 2G spark.default.parallelism 32 spark.shuffle.consolidateFiles true spark.kryoserializer.buffer.mb 128 spark.storage.blockManagerSlaveTimeoutMs 200000 spark.serializer org.apache.spark.serializer.KryoSerializer </pre> 2 failed with MapTracker Error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-is-slow-tp10765.html Sent from the Apache Spark User List mailing list archive at Nabble.com.