[ https://issues.apache.org/jira/browse/IMPALA-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
karthik updated IMPALA-10431: ----------------------------- Comment: was deleted (was: [~karthikkpm] test) > Consider partitioning instead of sorting before inserts > ------------------------------------------------------- > > Key: IMPALA-10431 > URL: https://issues.apache.org/jira/browse/IMPALA-10431 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend > Reporter: Csaba Ringhofer > Priority: Minor > > Currently before partitioned inserts Impala can add a sort node to sort by > the partitioning columns - the benefit of this is that the insert will only > have to write one file at a time, which reduces memory requirements if the > number of partitions is large. The drawback is the extra O(n*log n) CPU cost > and the potential spilling. > As the order we write the partitions in doesn't matter, partitioning the rows > in a spilling hash table would give the same benefits but reduce the > O(n*log) n CPU cost to O(n). It may be even possible to avoid spilling in > some cases - e.g. if only a single partition is written, and the partitioner > passes through the first partition it sees to the file sink. > The drawback is that unlike sort, we don't have a partitioner node in Impala > yet, but the algorithm is very similar to what's happening in grouping > aggregation or hash join builders, so probably this could be done with medium > amount of code. > Note that we can also sort by other columns than partitioning columns during > insert to make min-max stat filtering efficient - in this case we should > probably stick to the sorting before insert. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org