HI ALL, How could I set the param MEMORY_ONLY_SER 、Spark.kryoserializer.buffer.mb 、 Spark.default.parallelism and Spark.worker.timeout when I run a shark query ? May I set other params in spark-env.sh or hive-site.xml instead ? or set name=value in the shark cli ?
I have a shark query test : table a 38b ; table b 23b ; sql: select a.* , b.* from a join b on a.id = b.id ; it build three stages : stage1 has tow tasks: task1: rdd.HadoopRDD : input split table a 0+19 ; task2: rdd.HadoopRDD : input split table a 19+19; stage2 has two tasks: task1: rdd.HadoopRDD : input split table b 0+11 ; task2: rdd.HadoopRDD : input split table b 11+12; stage3 has one task: task1: just fetch map outputs for shuffle and write to hdfs path . Why these tables so small , but build two tasks to read it ? How could I control the reduce task nums in shark ? It seems compute by the biggest father RDD's partitions ? THX ! [email protected]
