Hi Olivier,

Here it is.

== Physical Plan ==
Aggregate false, [PartialGroup#155], [PartialGroup#155 AS
is_bad#108,Coalesce(SUM(PartialCount#152L),0) AS
count#109L,(CAST(SUM(PartialSum#153), DoubleType) /
CAST(SUM(PartialCount#154L), DoubleType)) AS avg#110]
 Exchange (HashPartitioning [PartialGroup#155], 200), []
  Aggregate true, [meta#143[is_bad]], [meta#143[is_bad] AS
PartialGroup#155,COUNT(1) AS PartialCount#152L,COUNT(nvar#145[var1]) AS
PartialCount#154L,SUM(nvar#145[var1]) AS PartialSum#153]
   Project [meta#143,nvar#145]
    Filter ((date#147 >= 2014-04-01) && (date#147 <= 2014-04-30))
     PhysicalRDD [meta#143,nvar#145,date#147], MapPartitionsRDD[6] at
explain at <console>:32


Jianshi

On Tue, May 12, 2015 at 10:34 PM, Olivier Girardot <ssab...@gmail.com>
wrote:

> can you post the explain too ?
>
> Le mar. 12 mai 2015 à 12:11, Jianshi Huang <jianshi.hu...@gmail.com> a
> écrit :
>
>> Hi,
>>
>> I have a SQL query on tables containing big Map columns (thousands of
>> keys). I found it to be very slow.
>>
>> select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as
>> avg
>> from test
>> where date between '2014-04-01' and '2014-04-30'
>> group by meta['is_bad']
>>
>> =>
>>
>> +---------+-----------+-----------------------+
>> | is_bad  |   count   |          avg          |
>> +---------+-----------+-----------------------+
>> | 0       | 17024396  | 0.16257395850742645   |
>> | 1       | 179729    | -0.37626256661125485  |
>> | 2       | 28128     | 0.11674427263203344   |
>> | 3       | 116327    | -0.6398689187187386   |
>> | 4       | 87715     | -0.5349632960030563   |
>> | 5       | 169771    | 0.40812641191854626   |
>> | 6       | 542447    | 0.5238256418341465    |
>> | 7       | 160324    | 0.29442847034840386   |
>> | 8       | 2099      | -0.9165701665162977   |
>> | 9       | 3104      | 0.3845685004598235    |
>> +---------+-----------+-----------------------+
>> 10 rows selected (130.5 seconds)
>>
>>
>> The total number of rows is less than 20M. Why so slow?
>>
>> I'm running on Spark 1.4.0-SNAPSHOT with 100 executors each having 4GB
>> ram and 2 CPU core.
>>
>> Looks like https://issues.apache.org/jira/browse/SPARK-5446 is still
>> open, when can we have it fixed? :)
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to