Rajkumar Singh created HIVE-21935:
-------------------------------------

             Summary: Hive Vectorization : Server performance issue with 
vectorize UDF  
                 Key: HIVE-21935
                 URL: https://issues.apache.org/jira/browse/HIVE-21935
             Project: Hive
          Issue Type: Bug
          Components: Vectorization
    Affects Versions: 3.1.1
         Environment: Hive-3, JDK-8
            Reporter: Rajkumar Singh


with vectorization turned on and hive.vectorized.adaptor.usage.mode=all we were 
seeing severe performance degradation. looking at the task jstacks it seems 
that it is running the code which vectorizes UDF and stuck in some loop.


{code:java}
jstack -l 14954 | grep 0x3af0 -A20
"TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 
runnable [0x00007f1547581000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:573)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350)
        at 
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205)
        at 
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150)
        at 
org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
        at 
org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146)
        at 
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
        at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889)
        at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
        at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
        at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
[yarn@hdp32b ~]$ jstack -l 14954 | grep 0x3af0 -A20
"TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 
runnable [0x00007f1547581000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.ensureSize(BytesColumnVector.java:554)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:570)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350)
        at 
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205)
        at 
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150)
        at 
org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
        at 
org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146)
        at 
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
        at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889)
        at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
        at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
        at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)

{code}

after setting the hive.vectorized.adaptor.usage.mode=none query did complete 
much faster.

Steps To Reproduce:
1. Create Table:

{code}
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `splittestloc`(              |
|   `id` int,                                        |
|   `value` string)                                  |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://hdp31a.hdp.local:8020/tmp/splittableloc' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'transient_lastDdlTime'='1561482451')            |
+----------------------------------------------------+
{code}

2. Sample data: table has some 40M rows and sample data is generated using 
following script.
{code}
for i in {1..40000000} ; do echo $i,"start#mid#"$i >> data.log ; done
{code}

3. I believe this should be reproducible with hive generic split but I am 
attaching the custom UDF to split the string.

4. create a function
{code}
add jar /tmp/CustomSplit-1.0-SNAPSHOT.jar; 
create temporary function mysplit as 'com.rajkrrsingh.split.test.CustomSplit' 
{code}

5. run the following query which will reproduce the issue if vectorization 
turned on.
{code}
create temporary table tmp2 as select id,mysplit(value,"#")[2] from 
splittestloc 
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to