Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
gt; underline price paid. > > > Yong > > > ------ > *From:* Takeshi Yamamuro <linguin@gmail.com> > *Sent:* Thursday, August 4, 2016 8:18 AM > *To:* Marco Colombo > *Cc:* user > *Subject:* Re: Spark SQL and number of task > &

Re: Spark SQL and number of task

2016-08-04 Thread Yong Zhang
o <linguin@gmail.com> Sent: Thursday, August 4, 2016 8:18 AM To: Marco Colombo Cc: user Subject: Re: Spark SQL and number of task Seems the performance difference comes from `CassandraSourceRelation`. I'm not familiar with the implementation though, I guess the filter `IN` is pushed down

Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
Seems the performance difference comes from `CassandraSourceRelation`. I'm not familiar with the implementation though, I guess the filter `IN` is pushed down into the datasource and the other not. You'd better off checking performance metrics in webUI. // maropu On Thu, Aug 4, 2016 at 8:41 PM,

Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Ok, thanx. The 2 plan are very similar with in condition +--+--+ | plan |

Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
Hi, Please type `sqlCtx.sql("select * ").explain` to show execution plans. Also, you can kill jobs from webUI. // maropu On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo wrote: > Hi all, I've a question on how hive+spark are handling data. > > I've started a

Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Hi all, I've a question on how hive+spark are handling data. I've started a new HiveContext and I'm extracting data from cassandra. I've configured spark.sql.shuffle.partitions=10. Now, I've following query: select d.id, avg(d.avg) from v_points d where id=90 group by id; I see that 10 task are