I have some trouble to control number of spark tasks for a stage. This on latest spark 1.3.x source code build.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) sc.getConf.get("spark.default.parallelism") -> setup to 10 val t1 = hiveContext.sql("FROM SalesJan2009 select * ") val t2 = t1.groupBy("country", "state", "city").agg(avg("price").as("aprive")) t1.rdd.partitions.size -> got 2 t2.rdd.partitions.size -> got 200 First questions, why does t2's partition size becomes 200? Second questions, even if I do t2.repartition(10).collect, in some stages, it still fires 200 tasks. Thanks, -Neal