[jira] [Created] (DRILL-6074) Corrections to UDF tutorial documentation page

Paul Rogers (JIRA) Sat, 06 Jan 2018 22:43:20 -0800

Paul Rogers created DRILL-6074:
----------------------------------

             Summary: Corrections to UDF tutorial documentation page
                 Key: DRILL-6074
                 URL: https://issues.apache.org/jira/browse/DRILL-6074
             Project: Apache Drill
          Issue Type: Bug
          Components: Documentation
            Reporter: Paul Rogers
            Assignee: Bridget Bevens
            Priority: Minor



Consider the [UDF 
Tutorial|http://drill.apache.org/docs/tutorial-develop-a-simple-function/]. 
Some of the details are a bit off.

Step 3:

bq. The function will be generated dynamically, as you can see in the 
DrillSimpleFuncHolder, and the input parameters and output holders are defined 
using holders by annotations. Define the parameters using the \@Param 
annotation.

Better: Drill uses your function template to in-line your function code into 
Drill's own generated code. The \@Param annotation identifies the input 
arguments. The order of the annotated fields indicates the order of the 
function parameters. Each parameter field must be one of Drill's holder types.

bq. Use a holder classes to provide a buffer to manage larger objects in an 
efficient way: VarCharHolder or NullableVarCharHolder.

Better: Our function template tells Drill to handle nulls, so all three of our 
arguments can be declared using the VarCharHolder type.

(Then, fix the code to use that type. The bit about larger objects is probably 
obsolete: holders are the only way to work with any value: large or otherwise.)

bq. NOTE: Drill doesn’t actually use the Java heap for data being processed in 
a query but instead keeps this data off the heap and manages the life-cycle for 
us without using the Java garbage collector.

Better: NOTE: VARCHAR data is stored in direct memory. The DrillBuf object in 
the VarCharHolder provides access to the data for the VARCHAR.

(For context: simple types, such as INT, are stored on the heap when passed to 
a UDF, so we don't want to make a blanket statement.)

Step 4.

bq. Also, using the \@Output annotation, define the returned value as 
VarCharHolder type. Because you are manipulating a VarChar, you also have to 
inject a buffer that Drill uses for the output.

Better: Identify the function's return value using the \@Output annotation. 
Like parameters, the output must be a holder type. Drill, however, does not 
provide the output buffer; we have to request one using the \@Inject 
annotation. The injected field must be of type DrillBuf. Then, in our code, we 
set the output holder to point to the injected buffer.

Step 5. The code is inefficient and not a good example. Replace this:

{code}
    out.end = outputValue.getBytes().length;
    buffer.setBytes(0, outputValue.getBytes());
{code}

With this:

{code}
    byte result[] = outputValue.getBytes();
    out.end = result.length;
    buffer.setBytes(0, result);
{code}

While we are at it, we might as well make another line a bit more readable.

{code}
    String outputValue = (new 
StringBuilder(maskSubString)).append(stringValue.substring(numberOfCharToReplace)).toString();
{code}

Should be rewritten as:

{code}
    String outputValue = new StringBuilder(maskSubString)
        .append(stringValue.substring(numberOfCharToReplace)
        .toString();
{code}

Then in the list of steps:

bq. Gets the number of character to replace

The word "character" should be "characters" (plural)

And:

bq. Creates and populates the output buffer

Better:

* Copies the new string into the temporary DrillBuf
* Sets up the output holder to point to the data in the DrillBuf

Then:

bq. Even to a seasoned Java developer, the eval() method might look a bit 
strange because Drill generates the final code on the fly to fulfill a query 
request. This technique leverages Java’s just-in-time (JIT) compiler for 
maximum speed.

Better: Even to a seasoned Java developer, the eval() method might look a bit 
strange. It is best to think of the UDF declaration as a Domain-Specific 
Language (DSL) that Drill uses to describe the function. Drill uses the 
declaration to in-line your function into generated code. That is, Drill does 
not call your function code; instead Drill extracts the code and copies it into 
Drill's own generated code.

(Note: the bit about the JIT compiler is plain wrong. Drills code generation 
has nothing to do with Java's JIT compiler.)

Basic Coding Rules

bq. To leverage Java’s just-in-time (JIT) compiler for maximum speed, you need 
to adhere to some basic rules.

Better: Drill's code generation mechanism supports a restricted subset of Java, 
meaning that you must adhere to some basic rules.

bq. Do not use imports. Instead, use the fully qualified class name as required 
by the Google Guava API packaged in Apache Drill and as shown in "Step 3: 
Declare input parameters".

(This mixes up a couple of ideas.) Better: Do not use imports. Instead, use the 
fully qualified class name.

bq. Manipulate the ValueHolders classes, for example VarCharHolder and 
IntHolder, as structs by calling helper methods, such as 
getStringFromVarCharHolder and toStringFromUTF8 as shown in "Step 5: Implement 
the eval() function".
bq. Do not call methods such as toString because this causes serious problems.

Better: Do not call any methods on the holder classes. The holders will be 
optimized away by Drill's scalar replacement mechanism.

Some additional restrictions:

* All class fields (member variables) must be preceded by one of the three 
annotations discussed above (\@Param, \@Output or \@Inject), or by the 
\@Workspace annotation which identifies internal temporary fields. (If you omit 
the annotations, then functions using your query will fail at runtime.)
* Do not use static fields (such as to declare constants.) If you must declare 
constants, declare them in a class other than the UDF class.

Prepare the Package

bq. Because Drill generates the source, ...

Better: Because Drill copies your code into is own generated code, ...

Basic Coding Rules
Build and Deploy the Function
Test the New Function

The above three lines probably want to be a heading; it appears as normal text.

bq. Add the JAR files to Drill, by copying them to the following location: 
<Drill installation directory>/jars/3rdparty

Perhaps add the following: Be sure to copy the jars into the above folder each 
time you rebuild, reinstall or upgrade Drill. If running in a cluster, copy the 
jars to the Drill installation on every node.

As an alternative, you can create a site directory as described (need link. Do 
we describe this anywhere except in the Drill-on-YARN PR?) Copy your files into 
the {{$DRILL_SITE/jars}} folder. This way, you need not remember to copy the 
jars each time you reinstall Drill.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (DRILL-6074) Corrections to UDF tutorial documentation page

Reply via email to