[ 
https://issues.apache.org/jira/browse/HIVE-18524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375309#comment-16375309
 ] 

Ke Jia commented on HIVE-18524:
-------------------------------

[~mmccline]:

HIVE-17139 mainly optimize vector- and row- expression.

For the vector- expression (for example IfExprDoubleColumnDoubleColumn.java),

If(expr1, expr2, expr3), When eveluate the children expression (expr1,expr2 and 
expr3), Firstly, we compute the expr1 and the result stored in 
batch.cols[arg1Column], where if the expr1 is true, the value of 
batch.cols[arg1Column] is 1, or is 0. Then we compute the expr2 if the 
batch.cols[arg1Column] is 1, or compute the expr3.  After we eveluate the 
children expression, the value of If expression is compute based on the result 
of expr1, if the expr1 is 1, the value is expr2, or the value is expr3. I think 
it will not be NPE like HIVE-18524. If I have wrong understanding, please tell 
me, thanks.

For the row- expression (for example VectorUDFAdaptor.java):

We eveluate the children expression same as the vector- expression above. After 
eveluated the children expression, the current implementation in 
VectorUDFAdaptor gets the i-th row batch.cols[arg1Column][i], 
batch.cols[arg2Column][i], batch.cols[arg3Column][i] and then wrap the result 
with GenericUDF.DeferredObject passing to GenericUDFIf.java . And eveluate the 
final value of If expression in GenericUDFIf.java base on the passed 
GenericUDF.DeferredObject. The exception of HIVE-18524 is in the wrapping 
result with GenericUDF.DeferredObject phase. For example, the value of If 
expression is BytesColumnVector, in the i-th row, if the expr1 is 1, we will 
skip compute expr3 during eveluating the children expression phase. So the 
batch.cols[arg3Column][i] is null. And it will throws NPE. And our solution is 
only wrap the satisfied value and skip the not-satisfied value. For example, if 
the batch.cols[arg1Column][i] is 1, we only wrap the batch.cols[arg2Column][i] 
and not wrap the batch.cols[arg3Column][i].

 And this optimization can gain 17% improvement in Q06 on TPCx-BB and +40% 
improvement in the complexity String operation. I think this optimization is 
necessary.

> Vectorization: Execution failure related to non-standard embedding of 
> IfExprConditionalFilter inside VectorUDFAdaptor (Revert HIVE-17139)
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-18524
>                 URL: https://issues.apache.org/jira/browse/HIVE-18524
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.0.0
>            Reporter: Matt McCline
>            Assignee: Matt McCline
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: HIVE-18524.01.patch, HIVE-18524.02.patch
>
>
> {noformat}
> insert overwrite table insert_10_1
>     select cast(gpa as float),
>            age,
>            IF(age>40,cast('2011-01-01 01:01:01' as timestamp),NULL),
>            IF(LENGTH(name)>10,cast(name as binary),NULL)
>     from studentnull10k
> vectorizationSchemaColumns: [0:name:string, 1:age:int, 2:gpa:double]
> ExprNodeDescs:
>     UDFToFloat(gpa) (type: float),
>     age (type: int),
>     if((age > 40), 2011-01-01 01:01:01.0, null) (type: timestamp),
>     if((length(name) > 10), CAST( name AS BINARY), null) (type: binary)
> selectExpressions:
>     VectorUDFAdaptor(if((age > 40), 2011-01-01 01:01:01.0, null))
>         (children: LongColGreaterLongScalar(col 1:int, val 40) -> 4:boolean) 
> -> 5:timestamp,
>     VectorUDFAdaptor(if((length(name) > 10), CAST( name AS BINARY), null))
>         (children: LongColGreaterLongScalar(col 4:int, val 10)(children: 
> StringLength(col 0:string) -> 4:int) -> 6:boolean,
>         VectorUDFAdaptor(CAST( name AS BINARY)) -> 7:binary) -> 8:binary
> {noformat}
> *// Notice there is no vector expression shown for the last IF stmt.*  It has 
> been magically embedded inside the VectorUDFAdaptor object...
> Execution results in this call stack.
> {nocode}
> Caused by: java.lang.NullPointerException
>       at java.util.Arrays.copyOfRange(Arrays.java:3521)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory$9.writeValue(VectorExpressionWriterFactory.java:1101)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory$VectorExpressionWriterBytes.writeValue(VectorExpressionWriterFactory.java:343)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFArgDesc.getDeferredJavaObject(VectorUDFArgDesc.java:123)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:211)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:177)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:145)
>       ... 22 more
> {nocode}
> Change is due to:
> HIVE-17139: Conditional expressions optimization: skip the expression 
> evaluation if the condition is not satisfied for vectorization engine. (Jia 
> Ke, reviewed by Ferdinand Xu)
> Embedding a raw vector expression outside of VectorizationContext is quite 
> non-standard and evidently buggy.
> [~Ferd] [~Ke Jia] I am inclined to revert this change.  Comments?  CC: 
> [~ashutoshc] [~hagleitn]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to