some problems about shark on spark

[email protected] Fri, 10 Jan 2014 06:50:58 -0800

HI ALL,
How could I set the param MEMORY_ONLY_SER 、Spark.kryoserializer.buffer.mb 、 
Spark.default.parallelism and Spark.worker.timeout
when I run a shark query ? 
May I set other params in spark-env.sh or hive-site.xml instead ?
or set name=value in the shark cli ?


I have a shark query test :
table a 38b ; table b 23b ;
sql: select a.* , b.* from a join b on a.id = b.id ;
it build three stages :
stage1 has tow tasks:
task1: rdd.HadoopRDD : input split table a 0+19 ;
task2: rdd.HadoopRDD : input split table a 19+19;
stage2 has two tasks: 
task1: rdd.HadoopRDD : input split table b 0+11 ;
task2: rdd.HadoopRDD : input split table b 11+12;
stage3 has one task:
task1: just fetch map outputs for shuffle and write to hdfs path .

Why these tables so small , but build two tasks to read it ?
How could I control the reduce task nums in shark ? It seems compute by the 
biggest father RDD's partitions ?
 
THX !




[email protected]

some problems about shark on spark

Reply via email to