Regarding the design decisions I encourage you to look at these:
https://blog.cloudera.com/blog/2015/11/new-in-cloudera-
enterprise-5-5-support-for-complex-types-in-impala/
https://www.slideshare.net/cloudera/nested-types-in-impala-55344174

Regarding making your query faster:
1. Your query with count(1) does not scan any data, so using it as a
comparison point is not realistic. A query with count(some_column) is a
better comparison point, but still not quite fair.
2. To speed up your query you might try the MT_DOP query option. The MT_DOP
query option controls the intra-node parallelism, i.e., how many fragment
instances per node will be launched.

In the future, this type of query might benefit from data elimination
techniques like Parquet min/max filtering (assuming your predicate is
selective), but we don't fully support min/max on nested types yet. If you
are able to restructure your data such that columns with selective
predicates are at the top-level, then you could benefit from min/max data
elimination even today (with the latest Impala version).

Alex

On Thu, Nov 2, 2017 at 12:58 AM, yu feng <olaptes...@gmail.com> wrote:

> Hi, I find In impala, If I have a table T and a column attributes which
> defained as Map<string, string>, I have to write SQL like this :
>
>
> select attributes.value, T.name from  T join T.attributes where attributes
> .key = 'play'.
>
> However, We query this table in hive like this :
>
> select T.attributes['play'], T.name from T
>
> And I find the execution of impala is Slow caused of the JOIN.
> If I execute :
>
> select count(1) from T;
>
> It takes 0.5 seconds and If I add join attributes, It will take 8s +.
>
> the summary like this :
>
> Operator                    #Hosts   Avg Time   Max Time    #Rows
> Est. #Rows   Peak Mem  Est. Peak Mem  Detail
> ------------------------------------------------------------
> -------------------------------------------------------------------
> 07:AGGREGATE                     1   85.020ms   85.020ms        1
>      1  112.00 KB        -1.00 B  FINALIZE
> 06:EXCHANGE                      1  404.928us  404.928us       25
>      1          0        -1.00 B  UNPARTITIONED
> 05:AGGREGATE                    25  104.354ms  141.573ms       25
>      1    8.35 MB       10.00 MB
> 01:SUBPLAN                      25    1s218ms    1s813ms  201.54M
> 668.96B    8.25 MB              0
> |--04:NESTED LOOP JOIN          25    3s250ms    4s852ms        0
>     10   24.00 KB        16.00 B  CROSS JOIN
> |  |--02:SINGULAR ROW SRC       25    0.000ns    0.000ns        0
>      1          0              0
> |  03:UNNEST                    25  688.110ms    1s033ms        0
>     10          0              0  uda_events.attributes
> 00:SCAN HDFS                    25  112.396ms  206.405ms  201.61M
> 66.90B  107.20 MB       48.00 MB  uda.uda_events
>
>
> Operator       #Hosts   Avg Time   Max Time    #Rows  Est. #Rows
> Peak Mem  Est. Peak Mem  Detail
> ------------------------------------------------------------
> -----------------------------------------------
> 03:AGGREGATE        1   87.908ms   87.908ms        1           1
> 112.00 KB        -1.00 B  FINALIZE
> 02:EXCHANGE         1  321.296us  321.296us       25           1
>    0        -1.00 B  UNPARTITIONED
> 01:AGGREGATE       25   74.613ms   81.567ms       25           1
> 12.00 KB       10.00 MB
> 00:SCAN HDFS       25   91.512ms  135.441ms  208.47M      66.90B
> 3.09 MB              0  uda.uda_events
>
>
> I have two question :
> 1、 Why Impala implements map column like this( via join) ?
> 2、 How to speed up the speed the query like this?
>
> Thanks very much.
>

Reply via email to