[ 
https://issues.apache.org/jira/browse/DRILL-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300789#comment-16300789
 ] 

ASF GitHub Bot commented on DRILL-6028:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1071
  
    Thanks much for the example files and explanation for the need to hash.
    
    The improvements look good. I wonder, however, if the code gen approach is 
overkill. There is exactly one allowable hash method per type. (Has to be the 
same for all queries to get reliable results.)
    
    Here, we are generating code to do the work of:
    
    * Bind to all vectors.
    * Get a value out of the vector into a holder.
    * Pass the value to the proper hash function.
    * Combine the results.
    
    The result is a huge amount of code to generate. The gist of this bug is 
that, when the number of columns becomes large, we generate so much code that 
we have to take extra steps to manage it. And, of course, compiling, caching 
and loading the code takes time.
    
    As something to think about for the future, this entire mechanism can be 
replaced with a much simpler one:
    
    * Add a `hash(seed)` method to each value vector.
    * Here, iterate over the list of vectors.
    * Call the `hash()` method on each vector.
    * Combine the results.
    
    Tradeoffs?
    
    * The proposed method has no setup cost. It is, instead an "interpreted" 
approach. The old method has a large setup cost.
    * The proposed method must make a "virtual" call into each vector to get 
the value and hash it using the pre-coded, type-specific function. The old 
method makes a direct call to get the value in a holder, then a direct call to 
the hash method.
    * The proposed method is insensitive to the number of columns (other than 
that it increases the size of the column loop.) The old method needs special 
care to handle the extra code.
    
    The proposed method would be easy to test to see which is more efficient: 
(large code generation + direct method calls) vs. (no code generation and 
virtual method calls). My money is on the new method as it eliminates the 
holders, sloshing variables around and so on. The JIT can optimize the 
"pre-coded" methods once and for all early in the Drillbit run rather than 
having to re-optimize the (huge) generated methods per query.
    
    The improvement is not necessary for this PR, but is something to think 
about. @Ben-Zvi may need something similar for the hash join to avoid 
generating query-specific key hashes. In fact, since hashing is used in many 
places (exchanges, hash agg, etc.), we might get quite a nice savings in time 
and code complexity by slowing moving to the proposed model.


> Allow splitting generated code in ChainedHashTable into blocks to avoid "code 
> too large" error
> ----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-6028
>                 URL: https://issues.apache.org/jira/browse/DRILL-6028
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>             Fix For: 1.13.0
>
>         Attachments: HashTableGen5_for_1200_columns_AFTER.java, 
> HashTableGen5_for_1200_columns_BEFORE.java, 
> HashTableGen5_for_40_columns_AFTER.java, 
> HashTableGen5_for_40_columns_BEFORE.java
>
>
> Allow splitting generated code in ChainedHashTable into blocks to avoid "code 
> too large" error.
> *REPRODUCE*
> File {{1200_columns.csv}}
> {noformat}
> 0,1,2,3...1200
> 0,1,2,3...1200
> {noformat}
> Query
> {noformat}
> select columns[0], column[1]...columns[1200] from dfs.`1200_columns.csv`
> union
> select columns[0], column[1]...columns[1200] from dfs.`1200_columns.csv`
> {noformat}
> Error
> {noformat}
> Error: SYSTEM ERROR: CompileException: File 
> 'org.apache.drill.exec.compile.DrillJavaFileObject[HashTableGen10.java]', 
> Line -7886, Column 24: HashTableGen10.java:57650: error: code too large
>         public boolean isKeyMatchInternalBuild(int incomingRowIdx, int 
> htRowIdx)
>                        ^ (compiler.err.limit.code)
> {noformat}
> *ROOT CAUSE*
> DRILL-4715 added ability to ensure that methods size won't go beyond the 64k 
> limit imposed by JVM. {{BlkCreateMode.TRUE_IF_BOUND}} was added to create new 
> block only if # of expressions added hit upper-bound defined by 
> {{exec.java.compiler.exp_in_method_size}}. Once number of expressions in 
> methods hits upper bound we create from call inner method.
> Example: 
> {noformat}
> public void doSetup(RecordBatch incomingBuild, RecordBatch incomingProbe) 
> throws SchemaChangeException {
> // some logic
> return doSetup0(incomingBuild, incomingProbe);
> }
> {noformat}
> During code generation {{ChainedHashTable}} added all code in its methods in 
> one block (using {{BlkCreateMode.FALSE}}) since {{getHashBuild}} and 
> {{getHashProbe}} methods contained state and thus could not be split. In 
> these methods hash was generated for each key expression. For the first key 
> seed was 0, subsequent keys hash was generated based on seed from previous 
> key.
> To allow splitting for there methods the following was done:
> 1. Method signatures was changed: added new parameter {{seedValue}}. 
> Initially starting seed value was hard-coded during code generation (set to 
> 0), now it is passed as method parameter.
> 2. Initially hash function call for all keys was transformed into one logical 
> expression which did not allow splitting. Now we create logical expression 
> for each key and thus splitting is possible. New {{seedValue}} parameter is 
> used as seed holder to pass seed value for the next key.
> 3. {{ParameterExpression}} was added to generate reference to method 
> parameter during code generation.
> Code example:
> {noformat}
>     public int getHashBuild(int incomingRowIdx, int seedValue)
>         throws SchemaChangeException
>     {
>         {
>             NullableVarCharHolder out3 = new NullableVarCharHolder();
>             {
>                 out3 .isSet = vv0 .getAccessor().isSet((incomingRowIdx));
>                 if (out3 .isSet == 1) {
>                     out3 .buffer = vv0 .getBuffer();
>                     long startEnd = vv0 
> .getAccessor().getStartEnd((incomingRowIdx));
>                     out3 .start = ((int) startEnd);
>                     out3 .end = ((int)(startEnd >> 32));
>                 }
>             }
>             IntHolder seedValue4 = new IntHolder();
>             seedValue4 .value = seedValue;
>             //---- start of eval portion of hash32 function. ----//
>             IntHolder out5 = new IntHolder();
>             {
>                 final IntHolder out = new IntHolder();
>                 NullableVarCharHolder in = out3;
>                 IntHolder seed = seedValue4;
>                  
> Hash32FunctionsWithSeed$NullableVarCharHash_eval: {
>     if (in.isSet == 0) {
>         out.value = seed.value;
>     } else
>     {
>         out.value = 
> org.apache.drill.exec.expr.fn.impl.HashHelper.hash32(in.start, in.end, 
> in.buffer, seed.value);
>     }
> }
>  
>                 out5 = out;
>             }
>             //---- end of eval portion of hash32 function. ----//
>             seedValue = out5 .value;
>    return getHashBuild0((incomingRowIdx), (seedValue));
> }
> {noformat}
> Examples of code generation:
> {{HashTableGen5_for_40_columns_BEFORE.java}} - code compiles
> {{HashTableGen5_for_40_columns_AFTER.java}} - code compiles
> {{HashTableGen5_for_1200_columns_BEFORE.java}} - error during compilation, 
> method too large
> {{HashTableGen5_for_1200_columns_AFTER.java}} - code compiles since methods 
> were split



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to