Hi all, I've a question on how hive+spark are handling data. I've started a new HiveContext and I'm extracting data from cassandra. I've configured spark.sql.shuffle.partitions=10. Now, I've following query:
select d.id, avg(d.avg) from v_points d where id=90 group by id; I see that 10 task are submitted and execution is fast. Every id on that table has 2000 samples. But if I just add a new id, as: select d.id, avg(d.avg) from v_points d where id=90 or id=2 group by id; it adds 663 task and query does not end. If I write query with in () like select d.id, avg(d.avg) from v_points d where id in (90,2) group by id; query is again fast. How can I get the 'execution plan' of the query? And also, how can I kill the long running submitted tasks? Thanks all!