[ https://issues.apache.org/jira/browse/IMPALA-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048025#comment-17048025 ]
ASF subversion and git services commented on IMPALA-8755: --------------------------------------------------------- Commit 60a9e72faf7b8c3e034f4319c0761ed389994707 in impala's branch refs/heads/master from norbert.luksa [ https://gitbox.apache.org/repos/asf?p=impala.git;h=60a9e72 ] IMPALA-8755: Backend support for Z-ordering This change depends on gerrit.cloudera.org/#/c/13955/ (Frontend support for Z-ordering) The commit adds a Comparator based on Z-ordering. See in detail: https://en.wikipedia.org/wiki/Z-order_curve The comparator instead of calculating the Z-values of the rows, looks for the column with the most significant dimension, and compares the values of this column only. The most significant dimension will be the one where the compared values have the highest different bits. The algorithm requires values of the same binary representation, therefore the values are converted into either uint32_t, uint63_t or uint128_t, the smallest in which all data fits. Comparing smaller types with bigger ones would make the bigger type much more dominant therefore the bits of these smaller types are shifted up. All primitive types (including string and floating point types) are supported. Testing: * Added unit tests. * Run manual tests, comparing 4-column values with 4-bit integers, for all possible combinations. Checked the result by calculating the Z-value for each comparison. * Tested performance on various data, getting great results for selective queries. An example: used the TPCH dataset's lineitem table with scale 25, where the sorting columns are l_partkey and l_suppkey, in that order. Run selective queries for the value range of the two columns, for both lexical and Z-ordering and compared the percentage of filtered pages and row groups. While queries with filters on the first column showed almost no difference, queries on the second column is in favour of Z-ordering: Ordering | Column | Filtered pages % | Filtered row groups % Lex. 1st ~99% ~90% Z-ord. 1st ~99% ~89% Lex. 2nd ~25% 0% Z-ord. 2nd ~97% 0% The only drawback is the sorting itself, taking ~4 times more than lexical sorting (eg. sorting for the dataset above took 14m for Lexical, and 55m for Z-ordering). Note however, that this is a one-time thing to do, sorting only happens once, when writing the data. Also, lexical ordering is supported by codegen, while it is not implemented for Z-ordering yet. Change-Id: I0200748ce3e65ebc5d3530f794c0f80aa335a2ab Reviewed-on: http://gerrit.cloudera.org:8080/14080 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Implement Z-ordering for Impala > ------------------------------- > > Key: IMPALA-8755 > URL: https://issues.apache.org/jira/browse/IMPALA-8755 > Project: IMPALA > Issue Type: New Feature > Reporter: Zoltán Borók-Nagy > Assignee: Norbert Luksa > Priority: Major > > Implement Z-ordering for Impala: [https://en.wikipedia.org/wiki/Z-order_curve] > A Z-order curve defines an ordering on multi-dimensional data. Data sorted > that way can be efficiently filtered by min/max statistics regarding to the > columns participating in the ordering. > Impala currently only supports lexicographic ordering via the SORT BY clause. > This strongly prefers the first column, i.e. given the "SORT BY A, B, C" > clause => A will be totally ordered (hence filtering on A will be very > efficient), but values belonging to B and C will be scattered throughout the > data set (hence filtering on B or C will barely do any good). > We could add a new clause, e.g. a "ZSORT BY" clause to Impala that writes the > data in Z-order. > "ZSORT BY A, B C" would cluster the rows in a way that filtering on A, B, or > C would be equally efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org