Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/GenericUDAFCaseStudy" page has been changed by ArvindPrabhakar. http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy?action=diff&rev1=1&rev2=2 -------------------------------------------------- == Writing the source == - As stated above, create a new file called `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java`, relative to the Hive root directory. Please see the `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java` for a detailed example of a UDAF. + This section gives a high-level outline of how to implement your own generic UDAF. For a concrete example, look at any of the existing UDAF sources present in `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/` directory. + + At a high-level, there are two parts to implementing a Generic UDAF. The first is to write an ''evaluator'', and the second is to create a ''resolver''. An evaluator is the actual implementation of the generic UDAF with the processing logic in place. The resolver on the other provides a mechanism for the evaluator to be accessed by the query processing framework. + + All evaluators must extend from the abstract base class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator. This class provides a few abstract methods that must be implemented by the extending class. These methods establish the processing semantics followed by the UDAF. Please refer to the javadocs for the abstract methods to see their exact specifications. + + The implementation of resolver is done by either implementing the interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2 or extending from the abstract class org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver. There is also an interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver that can be implemented, but is deprecated as of 0.6.0 release. The key difference between GenericUDAFResolver and GenericUDAFResovler2 interface is the fact that the later allows the evaluator implementation to access extra information regarding the function invocation such as the presence of DISTINCT qualifier or the invocation with the wildcard syntax such as FUNCTION(*). Evaluators that implement the deprecated GenericUDAFResolver interface will not be able to tell the difference between an invocation such as FUNCTION() or FUNCTION(*) since the information regarding specification of the wildcard is not available. Similarly, these implementations will also not be able to tell the difference between FUNCTION(EXPR) vs FUNCTION(DISTINCT EXPR) since the information regarding presence of the DISTINCT qualifier too is not available. + + Note that while the resolvers which implement the GenericUDAFResolver2 interface are provided the extra information regarding the presence of DISTINCT qualifier of invocation with the wildcard syntax, they can choose to ignore it completely if it is of no significance to them. The underlying data manipulation to ensure DISTINCT nature of the expression values is actually done by the framework and not by the evaluator or resolver. For UDAF implementations that do not care about this extra information, they could simply extend from the AbstractGenericUDAFResolver interface which insulates the implementation from this information. It also offers an easy way to transition previously written UDAF implementations to migrate to the new resolver interface without having to re-write the implementation since the change from implementing GenericUDAFResolver interface to extending AbstractGenericUDAFResolver class is fairly minimal. There may be issues with implementations that are part of a inheritance hierarchy since it may not be easy to change the base class. == Modifying the function registry ==