Hello We leverage the new filtering on Parquet stats for a few selective queries (refer also to http://blog.cloudera.com/blog/2017/12/faster-performance-for-selective-queries/ <http://blog.cloudera.com/blog/2017/12/faster-performance-for-selective-queries/>). We run Impala 2.9 on Cloudera 5.12.0.
However we notice that when inserting data into a non-partitioned table, the data is not optimally sorted. If we check the query plan (for a simple insert with only filtering on source table), there is no exchange of data between Impala daemons. So whatever a specific daemon reads gets sorted ‘locally’ and written away. That means that for selective queries we typically still scan as much Parquet files as the number of daemons in the cluster. Is this expected? Some kind of distributed merge sort could limited the IO needed. For partitioned tables, the insert already typically happens with an implicit SHUFFLE. So in that case all data to be sorted goes to one daemon and sorting happens as expected. In case of a NOSHUFFLE hint, the sorting again only happens in scope of individual daemons (which is kind of what what is requested by the hint - but still somewhat conflicting with the table definition). I did not yet try the behaviour with the CLUSTERED hint (my understanding is that it would be similar as SHUFFLE). In case of the SHUFFLE, the insert takes a lot longer however, because a single daemon is sorting all the data. Some more optimal distributed merge sorting could maybe reduce this time. I had a quick look at JIRA, but didn’t find much about this. In IMPALA-4969, sorting during insert in partitioned table is discussed, but that was before the filtering on Parquet stats was implemented. Hoping to get more insights on what thoughts were given to this and possible trade-offs involved. Thanks Bruno
