Hi Neal,

spark.sql.shuffle.partitions is the property to control the number of tasks
after shuffle (to generate t2, there is a shuffle for the aggregations
specified by groupBy and agg.) You can use
sqlContext.setConf("spark.sql.shuffle.partitions", "newNumber") or
sqlContext.sql("set spark.sql.shuffle.partitions=newNumber") to set the
value of it.

For t2.repartition(10).collect, because your t2 has 200 partitions, the
first stage had 200 tasks (for the second stage, there will be 10 tasks).
So, if you have something like val t3 = t2.repartition(10), t3 will have 10
partitions.

Thanks,

Yin

On Thu, Apr 16, 2015 at 3:04 PM, Neal Yin <neal....@workday.com> wrote:

>  I have some trouble to control number of spark tasks for a stage.  This
> on latest spark 1.3.x source code build.
>
>  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sc.getConf.get("spark.default.parallelism”)  —> setup to *10*
> val t1 = hiveContext.sql("FROM SalesJan2009 select * ")
> val t2 = t1.groupBy("country", "state",
> "city").agg(avg("price").as("aprive”))
>
>  t1.rdd.partitions.size   —>  got 2
> t2.rdd.partitions.size  —>  got 200
>
>  First questions, why does t2’s partition size becomes 200?
>
>  Second questions, even if I do t2.repartition(10).collect,  in some
> stages, it still fires 200 tasks.
>
>  Thanks,
>
>  -Neal
>
>
>
>

Reply via email to