Paul Rogers created DRILL-7455: ---------------------------------- Summary: "Renaming" projection operator to avoid physical copies Key: DRILL-7455 URL: https://issues.apache.org/jira/browse/DRILL-7455 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers
Drill/Calcite inserts project operators for three main reasons: 1. To compute a new column: {{SELECT a + b AS c ...}} 2. To rename columns: {{SELECT a AS x ...}} 3. To remove columns: {{SELECT a ...} but a data source provides columns {{a}}, and {{b}}. Example of case 1: {code:json} "pop" : "project", "@id" : 4, "exprs" : [ { "ref" : "`a0`", "expr" : "`a`" }, { "ref" : "`b0`", "expr" : "`b`" } ], {code} Of these, only case 2 requires row-by-row computation of new values. Case 1 simply creates a new vector with only the name changed; but the same data. Case 3 preserves some vectors, drops others. In the cases 1 and 2, a simple data transfer from input to output would be adequate. Yet, if one steps through the code, and enables code generation, one will see that Drill steps through each record in all three cases, even calling an empty per-record compute block. A better-performance solution is to separate out the renames/drops (cases 1 and 3) from the column computations (case 2). This can be done either: 1. At plan time, identify that all columns are renames, and replace the row-by-row project with a column-level project. 2. At run time that identifies the column-level projections (cases 1 and 3) and handles those with transfer pairs, while doing row-by-row computes only if case 2 exists. Since row-by-row copies are among the most expensive operations in Drill, this optimization could improve performance by a decent amount. Note that a further optimization is to remove "trivial" projects such as the following: {code:json} "pop" : "project", "@id" : 2, "exprs" : [ { "ref" : "`a`", "expr" : "`a`" }, { "ref" : "`b`", "expr" : "`b`" }, { "ref" : "`b0`", "expr" : "`b0`" } ], {code} The only value of such a projection is to say, "remove all vectors except {{a}}, {{b}} and {{b0}}. In fact, the only time such a projection should be needed is: 1. On top of a data source that does not support projection push down. 2. When Calcite knows it wants to discard certain intermediate columns. Otherwise, Calcite knows which columns emerge from operator x, and should not need to add a project to enforce that schema if it is already what the project will emit. -- This message was sent by Atlassian Jira (v8.3.4#803005)