Regarding the design decisions I encourage you to look at these: https://blog.cloudera.com/blog/2015/11/new-in-cloudera- enterprise-5-5-support-for-complex-types-in-impala/ https://www.slideshare.net/cloudera/nested-types-in-impala-55344174
Regarding making your query faster: 1. Your query with count(1) does not scan any data, so using it as a comparison point is not realistic. A query with count(some_column) is a better comparison point, but still not quite fair. 2. To speed up your query you might try the MT_DOP query option. The MT_DOP query option controls the intra-node parallelism, i.e., how many fragment instances per node will be launched. In the future, this type of query might benefit from data elimination techniques like Parquet min/max filtering (assuming your predicate is selective), but we don't fully support min/max on nested types yet. If you are able to restructure your data such that columns with selective predicates are at the top-level, then you could benefit from min/max data elimination even today (with the latest Impala version). Alex On Thu, Nov 2, 2017 at 12:58 AM, yu feng <olaptes...@gmail.com> wrote: > Hi, I find In impala, If I have a table T and a column attributes which > defained as Map<string, string>, I have to write SQL like this : > > > select attributes.value, T.name from T join T.attributes where attributes > .key = 'play'. > > However, We query this table in hive like this : > > select T.attributes['play'], T.name from T > > And I find the execution of impala is Slow caused of the JOIN. > If I execute : > > select count(1) from T; > > It takes 0.5 seconds and If I add join attributes, It will take 8s +. > > the summary like this : > > Operator #Hosts Avg Time Max Time #Rows > Est. #Rows Peak Mem Est. Peak Mem Detail > ------------------------------------------------------------ > ------------------------------------------------------------------- > 07:AGGREGATE 1 85.020ms 85.020ms 1 > 1 112.00 KB -1.00 B FINALIZE > 06:EXCHANGE 1 404.928us 404.928us 25 > 1 0 -1.00 B UNPARTITIONED > 05:AGGREGATE 25 104.354ms 141.573ms 25 > 1 8.35 MB 10.00 MB > 01:SUBPLAN 25 1s218ms 1s813ms 201.54M > 668.96B 8.25 MB 0 > |--04:NESTED LOOP JOIN 25 3s250ms 4s852ms 0 > 10 24.00 KB 16.00 B CROSS JOIN > | |--02:SINGULAR ROW SRC 25 0.000ns 0.000ns 0 > 1 0 0 > | 03:UNNEST 25 688.110ms 1s033ms 0 > 10 0 0 uda_events.attributes > 00:SCAN HDFS 25 112.396ms 206.405ms 201.61M > 66.90B 107.20 MB 48.00 MB uda.uda_events > > > Operator #Hosts Avg Time Max Time #Rows Est. #Rows > Peak Mem Est. Peak Mem Detail > ------------------------------------------------------------ > ----------------------------------------------- > 03:AGGREGATE 1 87.908ms 87.908ms 1 1 > 112.00 KB -1.00 B FINALIZE > 02:EXCHANGE 1 321.296us 321.296us 25 1 > 0 -1.00 B UNPARTITIONED > 01:AGGREGATE 25 74.613ms 81.567ms 25 1 > 12.00 KB 10.00 MB > 00:SCAN HDFS 25 91.512ms 135.441ms 208.47M 66.90B > 3.09 MB 0 uda.uda_events > > > I have two question : > 1、 Why Impala implements map column like this( via join) ? > 2、 How to speed up the speed the query like this? > > Thanks very much. >