[ https://issues.apache.org/jira/browse/KUDU-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974290#comment-16974290 ]
wangningito edited comment on KUDU-1644 at 12/5/19 6:54 AM: ------------------------------------------------------------ Here I submitted an implementation for token-based scan in case of only one hash partition which it contains only one key. [https://gerrit.cloudera.org/c/14706/ |https://gerrit.cloudera.org/c/14706/] This implementation, in client module, filtered the values to be pushed during the stage of token building while do very slightly modification of current code and slightly impact on performance. In previous pruneHashComponent method, all the hash bucket of rows were calculated, I simply implemented the idea by collecting those id and replace the in-list predicate values with filtered values . So this implementation were done with almost no performance impaction for other case. I implemented it by place it in client instead of place in tablet while the performance improvement can be acquired in two aspects, less values for transport in network, and reduction the complexity of further binary search logarithmically. Here I attach some performance benchmark with this implementation. Hardware: Client: 4 cores, 8g memory Server: 4 cores, 8g memory In-List size: 100000, all query happen in cache. The table to be scan by in-list query contains 10M rows and 30 dense columns, cells are consist of BIGINT or STRING randomly. 24 partitions. Before tuning: !image-2019-12-05-14-54-03-485.png! After tuning: !image-2019-12-05-14-53-57-741.png! was (Author: wangning): Here I submitted an implementation for token-based scan in case of only one hash partition which it contains only one key. [https://gerrit.cloudera.org/c/14706/ |https://gerrit.cloudera.org/c/14706/] This implementation, in client module, filtered the values to be pushed during the stage of token building while do very slightly modification of current code and slightly impact on performance. In previous pruneHashComponent method, all the hash bucket of rows were calculated, I simply implemented the idea by collecting those id and replace the in-list predicate values with filtered values . So this implementation were done with almost no performance impaction for other case. I implemented it by place it in client instead of place in tablet while the performance improvement can be acquired in two aspects, less values for transport in network, and reduction the complexity of further binary search logarithmically. Here I attach some performance benchmark with this implementation. Hardware: Client: 4 cores, 8g memory Server: 4 cores, 8g memory In-List size: 100000, all query happen in cache. The table to be scan by in-list query contains 10M rows and 30 dense columns, cells are consist of BIGINT or STRING randomly. 24 partitions. Before tuning: !http://doc.sensorsdata.cn/download/attachments/29573518/image2019-11-11_19-11-21.png?version=1&modificationDate=1573470681000&api=v2! After tuning: !http://doc.sensorsdata.cn/download/attachments/29573518/image2019-11-12_15-5-57.png?version=1&modificationDate=1573542358000&api=v2! > Simplify IN-list predicate values based on tablet partition key or rowset PK > bounds > ----------------------------------------------------------------------------------- > > Key: KUDU-1644 > URL: https://issues.apache.org/jira/browse/KUDU-1644 > Project: Kudu > Issue Type: Sub-task > Components: perf, tablet > Reporter: Dan Burkert > Priority: Major > Attachments: image-2019-12-05-14-52-05-846.png, > image-2019-12-05-14-52-18-487.png, image-2019-12-05-14-53-51-175.png, > image-2019-12-05-14-53-57-741.png, image-2019-12-05-14-54-03-485.png > > > When new scans are optimized by the tablet, the tablet's partition key bounds > aren't taken into account in order to remove predicates from the scan. One > of the most important such optimizations is that IN-list predicates could > remove values based on the tablet's constraints. -- This message was sent by Atlassian Jira (v8.3.4#803005)