Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18295 )

Change subject: IMPALA-11162: Support GenericUDFs for Hive
......................................................................


Patch Set 4:

(35 comments)

Thanks, and sorry for the late review.

http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
File 
fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@65
PS4, Line 65: are passed in
How can they not be passed in? If you mean "are not NULL", it would be better 
to write that in my opinion, because 'not passing in' a parameter suggests to 
me that there is an overload that doesn't take those parameters.

Also, in checkValidFunction(), the check is always performed (regardless of the 
types being NULL), isn't it?


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@128
PS4, Line 128: getGenericUDFInstance
'constructGenericUDFInstance' would be better as we already have a 
'getGenericUDFInstance' which does a very different thing (is a getter).


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@139
PS4, Line 139: catch (InstantiationException e) {
             :       throw new CatalogException("Unable to call create UDF 
instance.", e);
             :     } catch (IllegalAccessException e) {
             :       throw new CatalogException("Unable to call create UDF 
instance.", e);
             :     } catch (InvocationTargetException e) {
             :       throw new CatalogException("Unable to call create UDF 
instance.", e);
These exceptions are handled identically, so catching them in the same clause 
could simplify the code (and also generate smaller bytecode):

catch 
(InstantiationException|IllegalAccessException|CatalogException|CatalogException
 e) {
  throw new CatalogException("Unable to call create UDF instance.", e);
}

https://docs.oracle.com/javase/8/docs/technotes/guides/language/catch-multiple.html


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@154
PS4, Line 154: !returnOI.getTypeName().equals("void")
Does it mean that if 'genericUDF_.initialize(parameterOIs)' returns 'void' we 
accept it even if 'retType_' is an int or some other valid type? Shouldn't we 
only accept 'void' if 'retType_' specifically indicates that we expect a void 
return type? I know that 'Type' doesn't have a void value, but it could be 
indicated by 'retType_' being NULL or 'retType_.isInvalid()' being true.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java
File 
fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java@67
PS4, Line 67: @SuppressWarnings("restriction")
Just curious, why is this warning suppression needed?


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java@76
PS4, Line 76:
Nit: missing 'is'.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java@79
PS4, Line 79: deferredObjects_
I think 'parameters_' would be more descriptive name. 'deferredObjects_' 
actually refers to the type, not what they are used for in the code. If you are 
worried that it could be confused with 'inputArgs_' in the base class, the name 
could also be 'deferredParameters_' or something like that.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java
File fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java@132
PS4, Line 132: " + "
Nit: no need to separate the string literals here.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
File fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@50
PS4, Line 50: This class is a copy of the TestGenericUdf class in the FE.
This is actually in the FE. Either this comment should be removed from this 
file (and only kept in 
java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java), or if 
it is important that the two files be exactly the same, the comment should be 
adjusted so that it is appropriate in both folders.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@59
PS4, Line 59: TestGenericUdf
A class comment should describe what this UDF does for each argument type.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@61
PS4, Line 61: inputTypes
Should have an underscore at the end (and other members too), so inputTypes_, 
retTypeIO_, retType_.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@85
PS4, Line 85:     retTypeOI = (PrimitiveObjectInspector) 
arguments[arguments.length - 1];
We do not allow different argument types (this is checked in 'verify*Args()'), 
so we could take the type of the first argument, which is simpler.
Taking the last argument seems to imply to me that there is a reason for 
choosing the last one, i.e. there can be different types, which is misleading.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@89
PS4, Line 89: // Ok this is weird.  I tried doing these "if/else" commands as a 
switch
            :     // statement. WHen I did that, it threw a 
'java.lang.NoClassDefFoundError'
            :     // exception with the line number given as the one with the 
"switch" string.
            :     // This is the same case statement used successfully in the 
"evaluate" command,
            :     // yet it failed here. Given that this has nothing to do with 
functionality, I
            :     // decided not to debug this and leave it like this.
This comment should be removed.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@97
PS4, Line 97:       return retTypeOI;
Instead of putting a return statement into each branch, you could put the 
'throw' into a general 'else' branch at the end and return after the 
conditional - if we did not throw by then, we return 'retTypeOI'.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@150
PS4, Line 150: Unsupported return type
This is a bit misleading because the user of this class does not specify the 
return type directly, but the argument type, so an error message should mention 
the argument type.

For this UDF the return type is the same as the argument type, so 'retType' 
could be renamed to something like 'argAndRetType_'.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@159
PS4, Line 159: verifyBooleanArgs
Except for 'byte', these 'verify*Args()' functions are the same, so we could 
make the expected input type a parameter instead of having a different function 
for each.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@171
PS4, Line 171:     if (inputTypes.size() > 1) {
Why is 'byte' handled so differently from all the other types?


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@255
PS4, Line 255:       finalBoolean &= currentBool;
For the other types, the behaviour of the UDF is more or less addition 
(addition for numbers, concatenation for strings). With booleans, addition is 
usually associated with OR, not AND.
I suggest this is changed to
finalBoolean |= currentBool;


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@276
PS4, Line 276: evaluateShort
Couldn't we unify these 'evaluate*()' implementations, at least the ones for 
number types, into a generic (template) function? The template parameters would 
probably be the 'Writable's and (Short, Integer, Long etc.).


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@296
PS4, Line 296: short
Should be 'int', not 'short'.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@384
PS4, Line 384: evaluate
Is it also an evaluation method? Is it used anywhere?


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java
File fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java@475
PS4, Line 475: identity
The multi-argument functions are not identity functions, so this description is 
misleading.


http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java@485
PS4, Line 485:     TestUdf(null, TestGenericUdf.class, createText("ABCD"), 
createText("ABCD"));
             :     TestUdf(null, TestGenericUdf.class, createText("ABCD"), 
createText("ABCD"));
Nit: these two lines are identical, is it intentional?


http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericImportsNearbyClassesUdf.java
File 
java/test-hive-udfs/src/main/java/org/apache/impala/GenericImportsNearbyClassesUdf.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericImportsNearbyClassesUdf.java@31
PS4, Line 31: GenericImportsNearbyClassesUdf
Could you explain in the class comment what this class is used for?


http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericReplaceStringUdf.java
File 
java/test-hive-udfs/src/main/java/org/apache/impala/GenericReplaceStringUdf.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericReplaceStringUdf.java@31
PS4, Line 31: GenericReplaceStringUdf
A short comment would be useful about what this UDF does.


http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@49
PS4, Line 49:  * This class is a copy of the TestGenericUdf class in the FE. We 
need this class in a
See the comments for the other copy of this file.


http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java
File 
java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java:

http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java@30
PS4, Line 30: s
Nit: throw.


http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java@46
PS4, Line 46: NullPointerException
I'd suggest throwing a general RuntimeException, throwing NullPointerException 
is misleading as there is no NULL involved anywhere.


http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
File 
testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test:

http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@24
PS4, Line 24: generic_identity
The multi-argument versions are not identity functions.


http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@74
PS4, Line 74: BytesWritable, Text,
            : # and String
How are these three types tested? It seems that the type is the same for all.


http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@88
PS4, Line 88: BytesWritable, Text,
            : # and String
See L74-75.


http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@170
PS4, Line 170: generic_hive_add
Addition in a boolean context usually means OR, but the current implementation 
is AND. See the comment in 
fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java L255.


http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@272
PS4, Line 272: # Java Generic UDFs for TIMESTAMP are not allowed yet
We could also add tests that cover incorrect return types, for example for 
org.apache.impala.TestGenericUdf a function that takes two smallints and 
returns an int - this should also throw an exception, am I right?


http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
File 
testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test:

http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test@35
PS4, Line 35: tinyint
It would be more logical if the definition for 'tinyint' was before 'smallint', 
so the integers would come in increasing byte size.


http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test@51
PS4, Line 51: generic_identity
I wouldn't call these functions '*identity' in the cases where there is more 
than one argument, maybe not even in the one-argument case.
In the multi-argument cases, they are definitely not identity functions.



--
To view, visit http://gerrit.cloudera.org:8080/18295
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6fd09120db413fade94410c83ebe8ff104013cd
Gerrit-Change-Number: 18295
Gerrit-PatchSet: 4
Gerrit-Owner: Steve Carlin <scar...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Steve Carlin <scar...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 12:24:25 +0000
Gerrit-HasComments: Yes

Reply via email to