Re: dataframe call, how to control number of tasks for a stage

Yin Huai Thu, 16 Apr 2015 15:42:08 -0700

Hi Neal,

spark.sql.shuffle.partitions is the property to control the number of tasks
after shuffle (to generate t2, there is a shuffle for the aggregations
specified by groupBy and agg.) You can use
sqlContext.setConf("spark.sql.shuffle.partitions", "newNumber") or
sqlContext.sql("set spark.sql.shuffle.partitions=newNumber") to set the
value of it.

For t2.repartition(10).collect, because your t2 has 200 partitions, the
first stage had 200 tasks (for the second stage, there will be 10 tasks).
So, if you have something like val t3 = t2.repartition(10), t3 will have 10
partitions.

Thanks,

Yin

On Thu, Apr 16, 2015 at 3:04 PM, Neal Yin <neal....@workday.com> wrote:

>  I have some trouble to control number of spark tasks for a stage.  This
> on latest spark 1.3.x source code build.
>
>  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sc.getConf.get("spark.default.parallelism”)  —> setup to *10*
> val t1 = hiveContext.sql("FROM SalesJan2009 select * ")
> val t2 = t1.groupBy("country", "state",
> "city").agg(avg("price").as("aprive”))
>
>  t1.rdd.partitions.size   —>  got 2
> t2.rdd.partitions.size  —>  got 200
>
>  First questions, why does t2’s partition size becomes 200?
>
>  Second questions, even if I do t2.repartition(10).collect,  in some
> stages, it still fires 200 tasks.
>
>  Thanks,
>
>  -Neal
>
>
>
>

Re: dataframe call, how to control number of tasks for a stage

Reply via email to