Hi,

I have a SQL query on tables containing big Map columns (thousands of
keys). I found it to be very slow.

select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as
avg
from test
where date between '2014-04-01' and '2014-04-30'
group by meta['is_bad']

=>

+---------+-----------+-----------------------+
| is_bad  |   count   |          avg          |
+---------+-----------+-----------------------+
| 0       | 17024396  | 0.16257395850742645   |
| 1       | 179729    | -0.37626256661125485  |
| 2       | 28128     | 0.11674427263203344   |
| 3       | 116327    | -0.6398689187187386   |
| 4       | 87715     | -0.5349632960030563   |
| 5       | 169771    | 0.40812641191854626   |
| 6       | 542447    | 0.5238256418341465    |
| 7       | 160324    | 0.29442847034840386   |
| 8       | 2099      | -0.9165701665162977   |
| 9       | 3104      | 0.3845685004598235    |
+---------+-----------+-----------------------+
10 rows selected (130.5 seconds)


The total number of rows is less than 20M. Why so slow?

I'm running on Spark 1.4.0-SNAPSHOT with 100 executors each having 4GB ram
and 2 CPU core.

Looks like https://issues.apache.org/jira/browse/SPARK-5446 is still open,
when can we have it fixed? :)

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to