Re: Spark SQL and number of task

Marco Colombo Thu, 04 Aug 2016 04:43:27 -0700

Ok, thanx.
The 2 plan are very similar

with in condition
+--------------------------------------------------------------------------------------------------------------------------------------------------+--+
|
plan                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan ==
                                                                       |
| TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Final,isDistinct=false)],
output=[id#0L,_c1#81])                                      |
| +- TungstenExchange hashpartitioning(id#0L,10), None
                                                                        |
|    +- TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
output=[id#0L,sum#85,count#86L])                    |
|       +- Scan
org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2]
PushedFilters: [In(id,[Ljava.lang.Object;@6f8b2e9d)]  |
+--------------------------------------------------------------------------------------------------------------------------------------------------+--+


with the or condition
+--------------------------------------------------------------------------------------------------------------------------------------------------+--+
|
plan                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan ==
                                                                       |
| TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Final,isDistinct=false)],
output=[id#0L,_c1#88])                                      |
| +- TungstenExchange hashpartitioning(id#0L,10), None
                                                                        |
|    +- TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
output=[id#0L,sum#92,count#93L])                    |
|       +- Filter ((id#0L = 94) || (id#0L = 2))
                                                                       |
|          +- Scan
org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2]
PushedFilters: [Or(EqualTo(id,94),EqualTo(id,2))]  |
+--------------------------------------------------------------------------------------------------------------------------------------------------+--+


Filters are pushed down, so I cannot realize why it is performing a so big
data extraction in case of or. It's like a full table scan.

Any advice?

Thanks!


2016-08-04 13:25 GMT+02:00 Takeshi Yamamuro <linguin....@gmail.com>:

> Hi,
>
> Please type `sqlCtx.sql("select * .... ").explain` to show execution plans.
> Also, you can kill jobs from webUI.
>
> // maropu
>
>
> On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo <ing.marco.colo...@gmail.com
> > wrote:
>
>> Hi all, I've a question on how hive+spark are handling data.
>>
>> I've started a new HiveContext and I'm extracting data from cassandra.
>> I've configured spark.sql.shuffle.partitions=10.
>> Now, I've following query:
>>
>> select d.id, avg(d.avg) from v_points d where id=90 group by id;
>>
>> I see that 10 task are submitted and execution is fast. Every id on that
>> table has 2000 samples.
>>
>> But if I just add a new id, as:
>>
>> select d.id, avg(d.avg) from v_points d where id=90 or id=2 group by id;
>>
>> it adds 663 task and query does not end.
>>
>> If I write query with in () like
>>
>> select d.id, avg(d.avg) from v_points d where id in (90,2) group by id;
>>
>> query is again fast.
>>
>> How can I get the 'execution plan' of the query?
>>
>> And also, how can I kill the long running submitted tasks?
>>
>> Thanks all!
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Ing. Marco Colombo

Re: Spark SQL and number of task

Reply via email to