I have some trouble to control number of spark tasks for a stage.  This on 
latest spark 1.3.x source code build.

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
sc.getConf.get("spark.default.parallelism")  -> setup to 10
val t1 = hiveContext.sql("FROM SalesJan2009 select * ")
val t2 = t1.groupBy("country", "state", "city").agg(avg("price").as("aprive"))

t1.rdd.partitions.size   ->  got 2
t2.rdd.partitions.size  ->  got 200

First questions, why does t2's partition size becomes 200?

Second questions, even if I do t2.repartition(10).collect,  in some stages, it 
still fires 200 tasks.

Thanks,

-Neal



Reply via email to