dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Neal Yin
I have some trouble to control number of spark tasks for a stage. This on latest spark 1.3.x source code build. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) sc.getConf.get(spark.default.parallelism) - setup to 10 val t1 = hiveContext.sql(FROM SalesJan2009 select * ) val t2 =

Re: dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Yin Huai
Hi Neal, spark.sql.shuffle.partitions is the property to control the number of tasks after shuffle (to generate t2, there is a shuffle for the aggregations specified by groupBy and agg.) You can use sqlContext.setConf(spark.sql.shuffle.partitions, newNumber) or sqlContext.sql(set