dataframe call, how to control number of tasks for a stage

Neal Yin Thu, 16 Apr 2015 15:05:44 -0700

I have some trouble to control number of spark tasks for a stage.  This on 
latest spark 1.3.x source code build.


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
sc.getConf.get("spark.default.parallelism")  -> setup to 10
val t1 = hiveContext.sql("FROM SalesJan2009 select * ")
val t2 = t1.groupBy("country", "state", "city").agg(avg("price").as("aprive"))

t1.rdd.partitions.size   ->  got 2
t2.rdd.partitions.size  ->  got 200

First questions, why does t2's partition size becomes 200?

Second questions, even if I do t2.repartition(10).collect,  in some stages, it 
still fires 200 tasks.

Thanks,

-Neal

dataframe call, how to control number of tasks for a stage

Reply via email to