Mostafa Mokhtar created HIVE-7664:
-------------------------------------
Summary: VectorizedBatchUtil.addRowToBatchFrom is not optimized
for Vectorized execution and takes 25% CPU
Key: HIVE-7664
URL: https://issues.apache.org/jira/browse/HIVE-7664
Project: Hive
Issue Type: Bug
Affects Versions: 0.13.1
Reporter: Mostafa Mokhtar
Fix For: 0.14.0
In a Group by heavy vectorized Reducer vertex 25% of CPU is spent in
VectorizedBatchUtil.addRowToBatchFrom().
Looked at the code of VectorizedBatchUtil.addRowToBatchFrom and it looks like
it wasn't optimized for Vectorized processing.
addRowToBatchFrom is called for every row and for each row and every column in
the batch getPrimitiveCategory is called to figure the type of each column,
column types are stored in a HashMap, for VectorGroupByOperator columns types
won't change between batches, so column types shouldn't be looked up for every
row.
I recommend storing the column type in StructObjectInspector so that other
components can leverage this optimization.
Also addRowToBatchFrom has a case statement for every row and every column used
for type casting I recommend encapsulating the type logic in templatized
methods.
{code}
Stack Trace Sample Count Percentage(%)
VectorizedBatchUtil.addRowToBatchFrom 86 26.543
AbstractPrimitiveObjectInspector.getPrimitiveCategory() 34 10.494
LazyBinaryStructObjectInspector.getStructFieldData 25 7.716
StandardStructObjectInspector.getStructFieldData 4 1.235
{code}
The query used :
{code}
select
ss_sold_date_sk
from
store_sales
where
ss_sold_date between '1998-01-01' and '1998-06-01'
group by ss_item_sk , ss_customer_sk , ss_sold_date_sk
having sum(ss_list_price) > 50000000000000;
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)