Jorge created ARROW-9516:
----------------------------

             Summary: [Rust][DataFusion] Refactor physical expressions to not 
care about their names nor indexes
                 Key: ARROW-9516
                 URL: https://issues.apache.org/jira/browse/ARROW-9516
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust - DataFusion
            Reporter: Jorge


This issue covers three main topics that IMO are addressed as a whole in a 
refactor of the physical plans and expressions in data fusion. The underlying 
issues that justify this particular ticket:
h3. We currently assign poor names to the output schema.

Specifically, most names are given based on the last expression's name. 
Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the fields 
names "c, SUM, SUM".
h3. We currently derive the column names from physical expressions, not logical 
expressions

This implies that logical expressions that perform multiple operations (e.g. an 
grouped aggregation that performs partitioned aggregations + merge + final 
aggregation) have their name derived from their physical declaration, not 
logical. IMO a physical plan is an execution plan and is thus not concerned 
with naming. It is the logical plan that should be concerned with naming. 
Conceptually, a given logical plan can have more than one physical plan, e.g. 
depending on the execution environment (e.g. locally vs distributed).
h3. We currently carry the index of a column read throughout the plans, making 
it cumbersome to write optimizers.

More details here. In summary, it is possible to remove one of the optimizers 
and significantly simplify the other if columns do not carry indexing 
information.
h2. Proposal

I propose that we:
h3. drop {{physical_plan::expressions::Column::index}}

This is a major simplification of the code, and allow us to just ignore the 
position of the statement on the schema, and instead focus on its name. This is 
overall a simplification because it allow us to treat columns based solely on 
their names, and not on their position in the schema. Since SQL does not care 
about the position of the column on the table anyway (we currently already take 
the first column with that name), this seems natural.

I already prototyped this 
[here|https://github.com/jorgecarleitao/arrow/tree/column_names].

The main conclusion of this prototype is that this feasible as long as all our 
expressions get assigned a unique name, which is against what we currently 
offer (see example above). This leads me to:
h3. drop {{physical_plan::PhysicalExpr::name()}}

Currently, the name of an expression is derived from its physical plan. 
However, some operations' names are required to be known before its physical 
representation. The example I found in our current code is the grouped 
aggregation described above. If we were to build the name of our aggregation 
based on its physical plan, the name of a "COUNT(a)" operation would be 
{{SUM(COUNT(a))}} because, in the physical plan we first count on each 
partition, then merge, and them sum the counts over all partitions.

Fundamentally, IMO the issue here is that we are mixing responsibilities: the 
physical plan should not care about naming, because the physical plan 
corresponds to an execution plan, not a logical description of the column (its 
name). This leads me to:
h3. add {{logicalplan::Expr::name()}}

This will contain the name of this expression, that will naturally depend on 
the variation. Its implementation will be based on our current code for 
{{physical_plan::PhysicalExpr::name()}}.

I can take this work, but before committing, would like to know your thoughts 
about this. My initial prototyping indicate that all of this is possible and 
greatly simplifies the code, but I may be missing a design aspect of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to