Rajkumar Singh created HIVE-21935:
-------------------------------------
Summary: Hive Vectorization : Server performance issue with
vectorize UDF
Key: HIVE-21935
URL: https://issues.apache.org/jira/browse/HIVE-21935
Project: Hive
Issue Type: Bug
Components: Vectorization
Affects Versions: 3.1.1
Environment: Hive-3, JDK-8
Reporter: Rajkumar Singh
with vectorization turned on and hive.vectorized.adaptor.usage.mode=all we were
seeing severe performance degradation. looking at the task jstacks it seems
that it is running the code which vectorizes UDF and stuck in some loop.
{code:java}
jstack -l 14954 | grep 0x3af0 -A20
"TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0
runnable [0x00007f1547581000]
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:573)
at
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350)
at
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205)
at
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150)
at
org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at
org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59)
at
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146)
at
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
at
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
[yarn@hdp32b ~]$ jstack -l 14954 | grep 0x3af0 -A20
"TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0
runnable [0x00007f1547581000]
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.ensureSize(BytesColumnVector.java:554)
at
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:570)
at
org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350)
at
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205)
at
org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150)
at
org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at
org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59)
at
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146)
at
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
at
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
{code}
after setting the hive.vectorized.adaptor.usage.mode=none query did complete
much faster.
Steps To Reproduce:
1. Create Table:
{code}
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `splittestloc`( |
| `id` int, |
| `value` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://hdp31a.hdp.local:8020/tmp/splittableloc' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transient_lastDdlTime'='1561482451') |
+----------------------------------------------------+
{code}
2. Sample data: table has some 40M rows and sample data is generated using
following script.
{code}
for i in {1..40000000} ; do echo $i,"start#mid#"$i >> data.log ; done
{code}
3. I believe this should be reproducible with hive generic split but I am
attaching the custom UDF to split the string.
4. create a function
{code}
add jar /tmp/CustomSplit-1.0-SNAPSHOT.jar;
create temporary function mysplit as 'com.rajkrrsingh.split.test.CustomSplit'
{code}
5. run the following query which will reproduce the issue if vectorization
turned on.
{code}
create temporary table tmp2 as select id,mysplit(value,"#")[2] from
splittestloc
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)