Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/18295 )
Change subject: IMPALA-11162: Support GenericUDFs for Hive ...................................................................... Patch Set 4: (35 comments) Thanks, and sorry for the late review. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java File fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java: http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@65 PS4, Line 65: are passed in How can they not be passed in? If you mean "are not NULL", it would be better to write that in my opinion, because 'not passing in' a parameter suggests to me that there is an overload that doesn't take those parameters. Also, in checkValidFunction(), the check is always performed (regardless of the types being NULL), isn't it? http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@128 PS4, Line 128: getGenericUDFInstance 'constructGenericUDFInstance' would be better as we already have a 'getGenericUDFInstance' which does a very different thing (is a getter). http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@139 PS4, Line 139: catch (InstantiationException e) { : throw new CatalogException("Unable to call create UDF instance.", e); : } catch (IllegalAccessException e) { : throw new CatalogException("Unable to call create UDF instance.", e); : } catch (InvocationTargetException e) { : throw new CatalogException("Unable to call create UDF instance.", e); These exceptions are handled identically, so catching them in the same clause could simplify the code (and also generate smaller bytecode): catch (InstantiationException|IllegalAccessException|CatalogException|CatalogException e) { throw new CatalogException("Unable to call create UDF instance.", e); } https://docs.oracle.com/javase/8/docs/technotes/guides/language/catch-multiple.html http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java@154 PS4, Line 154: !returnOI.getTypeName().equals("void") Does it mean that if 'genericUDF_.initialize(parameterOIs)' returns 'void' we accept it even if 'retType_' is an int or some other valid type? Shouldn't we only accept 'void' if 'retType_' specifically indicates that we expect a void return type? I know that 'Type' doesn't have a void value, but it could be indicated by 'retType_' being NULL or 'retType_.isInvalid()' being true. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java File fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java: http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java@67 PS4, Line 67: @SuppressWarnings("restriction") Just curious, why is this warning suppression needed? http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java@76 PS4, Line 76: Nit: missing 'is'. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/HiveUdfExecutorGeneric.java@79 PS4, Line 79: deferredObjects_ I think 'parameters_' would be more descriptive name. 'deferredObjects_' actually refers to the type, not what they are used for in the code. If you are worried that it could be confused with 'inputArgs_' in the base class, the name could also be 'deferredParameters_' or something like that. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java File fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java: http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java@132 PS4, Line 132: " + " Nit: no need to separate the string literals here. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java File fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java: http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@50 PS4, Line 50: This class is a copy of the TestGenericUdf class in the FE. This is actually in the FE. Either this comment should be removed from this file (and only kept in java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java), or if it is important that the two files be exactly the same, the comment should be adjusted so that it is appropriate in both folders. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@59 PS4, Line 59: TestGenericUdf A class comment should describe what this UDF does for each argument type. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@61 PS4, Line 61: inputTypes Should have an underscore at the end (and other members too), so inputTypes_, retTypeIO_, retType_. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@85 PS4, Line 85: retTypeOI = (PrimitiveObjectInspector) arguments[arguments.length - 1]; We do not allow different argument types (this is checked in 'verify*Args()'), so we could take the type of the first argument, which is simpler. Taking the last argument seems to imply to me that there is a reason for choosing the last one, i.e. there can be different types, which is misleading. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@89 PS4, Line 89: // Ok this is weird. I tried doing these "if/else" commands as a switch : // statement. WHen I did that, it threw a 'java.lang.NoClassDefFoundError' : // exception with the line number given as the one with the "switch" string. : // This is the same case statement used successfully in the "evaluate" command, : // yet it failed here. Given that this has nothing to do with functionality, I : // decided not to debug this and leave it like this. This comment should be removed. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@97 PS4, Line 97: return retTypeOI; Instead of putting a return statement into each branch, you could put the 'throw' into a general 'else' branch at the end and return after the conditional - if we did not throw by then, we return 'retTypeOI'. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@150 PS4, Line 150: Unsupported return type This is a bit misleading because the user of this class does not specify the return type directly, but the argument type, so an error message should mention the argument type. For this UDF the return type is the same as the argument type, so 'retType' could be renamed to something like 'argAndRetType_'. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@159 PS4, Line 159: verifyBooleanArgs Except for 'byte', these 'verify*Args()' functions are the same, so we could make the expected input type a parameter instead of having a different function for each. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@171 PS4, Line 171: if (inputTypes.size() > 1) { Why is 'byte' handled so differently from all the other types? http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@255 PS4, Line 255: finalBoolean &= currentBool; For the other types, the behaviour of the UDF is more or less addition (addition for numbers, concatenation for strings). With booleans, addition is usually associated with OR, not AND. I suggest this is changed to finalBoolean |= currentBool; http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@276 PS4, Line 276: evaluateShort Couldn't we unify these 'evaluate*()' implementations, at least the ones for number types, into a generic (template) function? The template parameters would probably be the 'Writable's and (Short, Integer, Long etc.). http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@296 PS4, Line 296: short Should be 'int', not 'short'. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java@384 PS4, Line 384: evaluate Is it also an evaluation method? Is it used anywhere? http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java File fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java: http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java@475 PS4, Line 475: identity The multi-argument functions are not identity functions, so this description is misleading. http://gerrit.cloudera.org:8080/#/c/18295/4/fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java@485 PS4, Line 485: TestUdf(null, TestGenericUdf.class, createText("ABCD"), createText("ABCD")); : TestUdf(null, TestGenericUdf.class, createText("ABCD"), createText("ABCD")); Nit: these two lines are identical, is it intentional? http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericImportsNearbyClassesUdf.java File java/test-hive-udfs/src/main/java/org/apache/impala/GenericImportsNearbyClassesUdf.java: http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericImportsNearbyClassesUdf.java@31 PS4, Line 31: GenericImportsNearbyClassesUdf Could you explain in the class comment what this class is used for? http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericReplaceStringUdf.java File java/test-hive-udfs/src/main/java/org/apache/impala/GenericReplaceStringUdf.java: http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/GenericReplaceStringUdf.java@31 PS4, Line 31: GenericReplaceStringUdf A short comment would be useful about what this UDF does. http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java: http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@49 PS4, Line 49: * This class is a copy of the TestGenericUdf class in the FE. We need this class in a See the comments for the other copy of this file. http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java: http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java@30 PS4, Line 30: s Nit: throw. http://gerrit.cloudera.org:8080/#/c/18295/4/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdfException.java@46 PS4, Line 46: NullPointerException I'd suggest throwing a general RuntimeException, throwing NullPointerException is misleading as there is no NULL involved anywhere. http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test File testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test: http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@24 PS4, Line 24: generic_identity The multi-argument versions are not identity functions. http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@74 PS4, Line 74: BytesWritable, Text, : # and String How are these three types tested? It seems that the type is the same for all. http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@88 PS4, Line 88: BytesWritable, Text, : # and String See L74-75. http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@170 PS4, Line 170: generic_hive_add Addition in a boolean context usually means OR, but the current implementation is AND. See the comment in fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java L255. http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test@272 PS4, Line 272: # Java Generic UDFs for TIMESTAMP are not allowed yet We could also add tests that cover incorrect return types, for example for org.apache.impala.TestGenericUdf a function that takes two smallints and returns an int - this should also throw an exception, am I right? http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test File testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test: http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test@35 PS4, Line 35: tinyint It would be more logical if the definition for 'tinyint' was before 'smallint', so the integers would come in increasing byte size. http://gerrit.cloudera.org:8080/#/c/18295/4/testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test@51 PS4, Line 51: generic_identity I wouldn't call these functions '*identity' in the cases where there is more than one argument, maybe not even in the one-argument case. In the multi-argument cases, they are definitely not identity functions. -- To view, visit http://gerrit.cloudera.org:8080/18295 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie6fd09120db413fade94410c83ebe8ff104013cd Gerrit-Change-Number: 18295 Gerrit-PatchSet: 4 Gerrit-Owner: Steve Carlin <scar...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Steve Carlin <scar...@cloudera.com> Gerrit-Comment-Date: Tue, 05 Apr 2022 12:24:25 +0000 Gerrit-HasComments: Yes