[jira] [Commented] (HIVE-7664) VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU

Hive QA (JIRA) Wed, 12 Nov 2014 20:45:23 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209256#comment-14209256
 ]


Hive QA commented on HIVE-7664:
-------------------------------



{color:red}Overall{color}: -1 no tests executed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12662948/HIVE-7664.2.patch.txt

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/1764/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/1764/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-1764/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]]
+ export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ export 
PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ 
PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ cd /data/hive-ptest/working/
+ tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-1764/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ svn = \s\v\n ]]
+ [[ -n '' ]]
+ [[ -d apache-svn-trunk-source ]]
+ [[ ! -d apache-svn-trunk-source/.svn ]]
+ [[ ! -d apache-svn-trunk-source ]]
+ cd apache-svn-trunk-source
+ svn revert -R .
++ egrep -v '^X|^Performing status on external'
++ awk '{print $2}'
++ svn status --no-ignore
+ rm -rf
+ svn update

Fetching external item into 'hcatalog/src/test/e2e/harness'
External at revision 1639245.

At revision 1639245.
+ patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hive-ptest/working/scratch/build.patch
+ [[ -f /data/hive-ptest/working/scratch/build.patch ]]
+ chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
+ /data/hive-ptest/working/scratch/smart-apply-patch.sh 
/data/hive-ptest/working/scratch/build.patch
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12662948 - PreCommit-HIVE-TRUNK-Build

> VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized 
> execution and takes 25% CPU
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7664
>                 URL: https://issues.apache.org/jira/browse/HIVE-7664
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.13.1
>            Reporter: Mostafa Mokhtar
>            Assignee: Gopal V
>         Attachments: HIVE-7664.1.patch.txt, HIVE-7664.2.patch.txt
>
>
> In a Group by heavy vectorized Reducer vertex 25% of CPU is spent in 
> VectorizedBatchUtil.addRowToBatchFrom().
> Looked at the code of VectorizedBatchUtil.addRowToBatchFrom and it looks like 
> it wasn't optimized for Vectorized processing.
> addRowToBatchFrom is called for every row and for each row and every column 
> in the batch getPrimitiveCategory is called to figure the type of each 
> column, column types are stored in a HashMap, for VectorGroupByOperator 
> columns types won't change between batches, so column types shouldn't be 
> looked up for every row.
> I recommend storing the column type in StructObjectInspector so that other 
> components can leverage this optimization.
> Also addRowToBatchFrom has a case statement for every row and every column 
> used for type casting I recommend encapsulating the type logic in templatized 
> methods.   
> {code}
> Stack Trace   Sample Count    Percentage(%)
> VectorizedBatchUtil.addRowToBatchFrom 86      26.543
>    AbstractPrimitiveObjectInspector.getPrimitiveCategory()    34      10.494
>    LazyBinaryStructObjectInspector.getStructFieldData 25      7.716
>    StandardStructObjectInspector.getStructFieldData   4       1.235
> {code}
> The query used : 
> {code}
> select 
>     ss_sold_date_sk
> from
>     store_sales
> where
>     ss_sold_date between '1998-01-01' and '1998-06-01'
> group by ss_item_sk , ss_customer_sk , ss_sold_date_sk
> having sum(ss_list_price) > 50000000000000;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7664) VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU

Reply via email to