subject:"\[jira\] \[Commented\] \(HIVE\-24205\) Optimise CuckooSetBytes"

[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes

2020-10-05 Thread Rajesh Balamohan (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207989#comment-17207989
 ] 

Rajesh Balamohan commented on HIVE-24205:
-

Thanks [~mustafaiman]. With repeated runs (i.e without any data miss), I see 
around 9-10% improvement with the PR. This is based on a small 5 node LLAP 
cluster with TPCH12 (43.82 seconds vs 39.01 seconds). 

Tried with "select count(*) from lineitem where l_shipmode in ('REG AIR', 
'MAIL');" which showed much better improvement with and without PR (10.94 
seconds vs 8.49 seconds).

> Optimise CuckooSetBytes
> ---
>
> Key: HIVE-24205
> URL: https://issues.apache.org/jira/browse/HIVE-24205
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Mustafa Iman
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, 
> vectorized.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{FilterStringColumnInList, StringColumnInList}}  etc use CuckooSetBytes for 
> lookup.
> !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508!
> One option to optimize would be to add boundary conditions on "length" with 
> the min/max length stored in the hashes (ref: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85])
>  . This would significantly reduce the number of hash computation that needs 
> to happen. E.g 
> [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes

2020-10-02 Thread Mustafa Iman (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206560#comment-17206560
 ] 

Mustafa Iman commented on HIVE-24205:
-

[~hashutosh] [~rajesh.balamohan] can you take a look?

> Optimise CuckooSetBytes
> ---
>
> Key: HIVE-24205
> URL: https://issues.apache.org/jira/browse/HIVE-24205
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Mustafa Iman
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, 
> vectorized.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{FilterStringColumnInList, StringColumnInList}}  etc use CuckooSetBytes for 
> lookup.
> !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508!
> One option to optimize would be to add boundary conditions on "length" with 
> the min/max length stored in the hashes (ref: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85])
>  . This would significantly reduce the number of hash computation that needs 
> to happen. E.g 
> [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes

2020-10-02 Thread Mustafa Iman (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206559#comment-17206559
 ] 

Mustafa Iman commented on HIVE-24205:
-

I added a simple max/min length check in CuckooSetBytes#lookup. Attached file 
shows some benchmark results.

 

*TPCH_Q12* is a select with IN clause and a join afterwards. Selectivity of the 
filter is 30%.

*Synthetic* query ** is Simple select with IN clause. IN is over two of the 
longest comment fields (both 72 characters wide). So selectivity is very high 
at about 2%:

select o_orderkey, o_comment from orders where o_comment in ('jole quickly 
furiously bold escapades: regular accounts play regular req', 's foxes. regular 
warhorses detect fluffily. carefull 
y regular tithes amo', 'grate ironic, pending sauternes. deposits do are slyly. 
carefully ironic')

*Synthetic Wide* query is the same as synthetic except IN clause is over one 
shortest length and one longest length comment. Selectivity is still high at 4% 
but our optimization cannot eliminate any tuples.

select o_orderkey, o_comment from orders where o_comment in ('jole quickly 
furiously bold escapades: regular accounts play regular req', 'ts nag 
furiously. even');

 

The patch outperforms original code by 50% on synthetic query. For tpch q12, 
there is no meaningful difference between two runs. My conclusion is that the 
optimization is very low overhead and it gives significant perf improvement in 
certain cases.

I implemented a vectorized version of the early return from cuckooset. It is 
attached as vectorized.patch. However, in all cases simpler patch outperforms 
vectorized one.

> Optimise CuckooSetBytes
> ---
>
> Key: HIVE-24205
> URL: https://issues.apache.org/jira/browse/HIVE-24205
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Mustafa Iman
>Priority: Major
> Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, 
> vectorized.patch
>
>
> {{FilterStringColumnInList, StringColumnInList}}  etc use CuckooSetBytes for 
> lookup.
> !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508!
> One option to optimize would be to add boundary conditions on "length" with 
> the min/max length stored in the hashes (ref: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85])
>  . This would significantly reduce the number of hash computation that needs 
> to happen. E.g 
> [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes

[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes

[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes

3 matches

Site Navigation

Mail list logo

Footer information