[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes
[ https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207989#comment-17207989 ] Rajesh Balamohan commented on HIVE-24205: - Thanks [~mustafaiman]. With repeated runs (i.e without any data miss), I see around 9-10% improvement with the PR. This is based on a small 5 node LLAP cluster with TPCH12 (43.82 seconds vs 39.01 seconds). Tried with "select count(*) from lineitem where l_shipmode in ('REG AIR', 'MAIL');" which showed much better improvement with and without PR (10.94 seconds vs 8.49 seconds). > Optimise CuckooSetBytes > --- > > Key: HIVE-24205 > URL: https://issues.apache.org/jira/browse/HIVE-24205 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Mustafa Iman >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, > vectorized.patch > > Time Spent: 10m > Remaining Estimate: 0h > > {{FilterStringColumnInList, StringColumnInList}} etc use CuckooSetBytes for > lookup. > !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508! > One option to optimize would be to add boundary conditions on "length" with > the min/max length stored in the hashes (ref: > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85]) > . This would significantly reduce the number of hash computation that needs > to happen. E.g > [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes
[ https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206560#comment-17206560 ] Mustafa Iman commented on HIVE-24205: - [~hashutosh] [~rajesh.balamohan] can you take a look? > Optimise CuckooSetBytes > --- > > Key: HIVE-24205 > URL: https://issues.apache.org/jira/browse/HIVE-24205 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Mustafa Iman >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, > vectorized.patch > > Time Spent: 10m > Remaining Estimate: 0h > > {{FilterStringColumnInList, StringColumnInList}} etc use CuckooSetBytes for > lookup. > !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508! > One option to optimize would be to add boundary conditions on "length" with > the min/max length stored in the hashes (ref: > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85]) > . This would significantly reduce the number of hash computation that needs > to happen. E.g > [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes
[ https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206559#comment-17206559 ] Mustafa Iman commented on HIVE-24205: - I added a simple max/min length check in CuckooSetBytes#lookup. Attached file shows some benchmark results. *TPCH_Q12* is a select with IN clause and a join afterwards. Selectivity of the filter is 30%. *Synthetic* query ** is Simple select with IN clause. IN is over two of the longest comment fields (both 72 characters wide). So selectivity is very high at about 2%: select o_orderkey, o_comment from orders where o_comment in ('jole quickly furiously bold escapades: regular accounts play regular req', 's foxes. regular warhorses detect fluffily. carefull y regular tithes amo', 'grate ironic, pending sauternes. deposits do are slyly. carefully ironic') *Synthetic Wide* query is the same as synthetic except IN clause is over one shortest length and one longest length comment. Selectivity is still high at 4% but our optimization cannot eliminate any tuples. select o_orderkey, o_comment from orders where o_comment in ('jole quickly furiously bold escapades: regular accounts play regular req', 'ts nag furiously. even'); The patch outperforms original code by 50% on synthetic query. For tpch q12, there is no meaningful difference between two runs. My conclusion is that the optimization is very low overhead and it gives significant perf improvement in certain cases. I implemented a vectorized version of the early return from cuckooset. It is attached as vectorized.patch. However, in all cases simpler patch outperforms vectorized one. > Optimise CuckooSetBytes > --- > > Key: HIVE-24205 > URL: https://issues.apache.org/jira/browse/HIVE-24205 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Mustafa Iman >Priority: Major > Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, > vectorized.patch > > > {{FilterStringColumnInList, StringColumnInList}} etc use CuckooSetBytes for > lookup. > !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508! > One option to optimize would be to add boundary conditions on "length" with > the min/max length stored in the hashes (ref: > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85]) > . This would significantly reduce the number of hash computation that needs > to happen. E.g > [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20] -- This message was sent by Atlassian Jira (v8.3.4#803005)