[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

bdrillard Tue, 17 Oct 2017 11:20:26 -0700

GitHub user bdrillard opened a pull request:

    https://github.com/apache/spark/pull/19518


    [SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - State 
Compaction

    ## What changes were proposed in this pull request?
    
    This PR is the part two followup to #18075, meant to address 
[SPARK-18016](https://github.com/apache/spark/pull/SPARK-18016), Constant Pool 
limit exceptions. Part 1 implemented `NestedClass` code splitting, in which 
excess code was split off into nested private sub-classes of the `OuterClass`. 
In Part 2 we address excess mutable state, in which the number of inlined 
variables declared at the top of the `OuterClass` can also exceed the constant 
pool limit. 
    
    Here, we modify the `addMutableState` function in the `CodeGenerator` to 
check if the declared state can be easily initialized compacted into an array 
and initialized in loops rather than inlined and initialized with its own line 
of code. We identify four types of state that can compacted:
    
    * Primitive state (ints, booleans, etc)
    * Object state of like-type without any initial assignment
    * Object state of like-type initialized to `null`
    * Object state of like-type initialized to the type's base (no-argument) 
constructor
    
    With mutable state compaction, at the top of the class we generate array 
declarations like:
    
    ```
    private Object[] references;
    private UnsafeRow result;
    private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
holder;
    private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
rowWriter;
      ...
    private boolean[] mutableStateArray1 = new boolean[12507];
    private InternalRow[] mutableStateArray4 = new InternalRow[5268];
    private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[] 
mutableStateArray5 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[7663];
    private java.lang.String[] mutableStateArray2 = new java.lang.String[12477];
    private int[] mutableStateArray = new int[42509];
    private java.lang.Object[] mutableStateArray6 = new java.lang.Object[30];
    private boolean[] mutableStateArray3 = new boolean[10536];
    ```
    
    and these arrays are initialized in loops as:
    
    ```
    private void init_3485() {
        for (int i = 0; i < mutableStateArray3.length; i++) {
            mutableStateArray3[i] = false;
        }
    }
    ```
    
    For compacted mutable state, `addMutableState` returns an array accessor 
value, which is then referenced in the subsequent generated code.
    
    **Note**: some state cannot be easily compacted (except without perhaps 
deeper changes to generating code), as some state value names are taken for 
granted at the global level during code generation (see `CatalystToExternalMap` 
in `Objects` as an example). For this state, we provide an `inline` hint to the 
function call, which indicates that the state should be inlined to the 
`OuterClass`. Still, the state we can easily compact manages to reduce the 
Constant Pool to an tractable size for the wide/deeply nested schemas I was 
able to test against.
    
    ## How was this patch tested?
    
    Tested against several complex schema types, also added a test case 
generating 40,000 string columns and creating the `UnsafeProjection`. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bdrillard/spark state_compaction

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19518
    
----
commit 081bc5de6ee55e00ff58c4abddc347f77c29d4aa
Author: ALeksander Eskilson <alek.eskil...@cerner.com>
Date:   2017-10-17T14:06:12Z

    adding state compaction

commit e7046c3d3bb528f18b3183d81e8bc26720a8baf7
Author: ALeksander Eskilson <alek.eskil...@cerner.com>
Date:   2017-10-17T16:54:54Z

    adding inline changes

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Reply via email to