The optimal config depends on lots of things, but did you try a
smaller numPartition size? Just guessing -- 160 / 320 may be
reasonable.

On Mon, Jul 28, 2014 at 1:52 AM, Earthson <earthson...@gmail.com> wrote:
> I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition
> with 2048 buckets.
> <pre>
>           sqlsc.set("spark.sql.shuffle.partitions", "2048")
>           hql("""|insert %s table mz_log
>                    |PARTITION (date='%s')
>                    |select * from tmp_mzlog
>                    |CLUSTER BY mzid
>                 """.stripMargin.format(overwrite, log_date))
> </pre>
>
> env:
>
> yarn-client mode with 80 executor, 2 cores/per executor.
>
> Data:
>
> original text log is about 1.1T.
>
> - - -
>
> the reduce stage is too slow.
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_1.png>
>
> here is the network usage, it's not the bottle neck.
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_2.png>
>
> and the CPU load is very high, why?
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n10765/Screen_Shot_2014-07-28_3.png>
> here is the configuration(conf/spark-defaults.conf)
>
> <pre>
> spark.ui.port   8888
> spark.akka.frameSize    128
> spark.akka.timeout      600
> spark.akka.threads      8
> spark.files.overwrite   true
> spark.executor.memory   2G
> spark.default.parallelism       32
> spark.shuffle.consolidateFiles  true
> spark.kryoserializer.buffer.mb  128
> spark.storage.blockManagerSlaveTimeoutMs        200000
> spark.serializer        org.apache.spark.serializer.KryoSerializer
> </pre>
>
> 2 failed with MapTracker Error.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-is-slow-tp10765.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to