Hello

We leverage the new filtering on Parquet stats for a few selective queries 
(refer also to 
http://blog.cloudera.com/blog/2017/12/faster-performance-for-selective-queries/ 
<http://blog.cloudera.com/blog/2017/12/faster-performance-for-selective-queries/>).
 We run Impala 2.9 on Cloudera 5.12.0.

However we notice that when inserting data into a non-partitioned table, the 
data is not optimally sorted. If we check the query plan (for a simple insert 
with only filtering on source table), there is no exchange of data between 
Impala daemons. So whatever a specific daemon reads gets sorted ‘locally’ and 
written away. That means that for selective queries we typically still scan as 
much Parquet files as the number of daemons in the cluster.

Is this expected?
Some kind of distributed merge sort could limited the IO needed.

For partitioned tables, the insert already typically happens with an implicit 
SHUFFLE. So in that case all data to be sorted goes to one daemon and sorting 
happens as expected. In case of a NOSHUFFLE hint, the sorting again only 
happens in scope of individual daemons (which is kind of what what is requested 
by the hint - but still somewhat conflicting with the table definition). I did 
not yet try the behaviour with the CLUSTERED hint (my understanding is that it 
would be similar as SHUFFLE).
In case of the SHUFFLE, the insert takes a lot longer however, because a single 
daemon is sorting all the data. Some more optimal distributed merge sorting 
could maybe reduce this time.

I had a quick look at JIRA, but didn’t find much about this. In IMPALA-4969, 
sorting during insert in partitioned table is discussed, but that was before  
the filtering on Parquet stats was implemented.

Hoping to get more insights on what thoughts were given to this and possible 
trade-offs involved.

Thanks
Bruno

Reply via email to