kosiew commented on PR #17085:
URL: https://github.com/apache/datafusion/pull/17085#issuecomment-3389122318

   
   <html><head></head><body><h1>Why <code inline="">AccumulatorArgs</code> 
Differs from Other Function Args</h1>
   <p>Your review raises an excellent question about API consistency. The 
answer lies in <strong>when and how</strong> these functions resolve their 
input fields.</p>
   <hr>
   <h2>Scalar &amp; Window Functions: Pre-computed Fields</h2>
   <p>Both scalar and window functions receive <strong>pre-computed</strong> 
<code inline="">FieldRef</code>s at creation time:</p>
   <pre><code class="language-rust">// From create_udwf_window_expr 
(windows/mod.rs:178-180)
   let input_fields: Vec&lt;_&gt; = args
       .iter()
       .map(|arg| arg.return_field(input_schema)) // Computed once at planning
       .collect::&lt;Result&lt;_&gt;&gt;()?;
   </code></pre>
   <p>These are then passed to <code inline="">ExpressionArgs</code>, <code 
inline="">PartitionEvaluatorArgs</code>, or <code 
inline="">ScalarFunctionArgs</code>.<br>
   The schema itself is never exposed because <strong>the fields have already 
been computed</strong>.</p>
   <hr>
   <h2>Aggregate Functions: Runtime Schema Access</h2>
   <p>Aggregates, by contrast, receive the <strong>physical schema</strong> and 
compute fields <strong>on demand</strong>:</p>
   <pre><code class="language-rust">// From 
AggregateFunctionExpr::create_accumulator (aggregate.rs:429-432)
   let schema = self.args_schema();
   let acc_args = self.make_acc_args(schema.as_ref()); // Schema passed here
   self.fun.accumulator(acc_args)
   </code></pre>
   <hr>
   <h1>Why Can't Aggregates Use Pre-computed Fields?</h1>
   <p>The expressions need <strong>schema access at runtime</strong>. 
Consider:</p>
   <pre><code>SUM(a + b)
   </code></pre>
   <p>The expression <code inline="">a + b</code> references <strong>two 
columns</strong> from the physical schema, but produces <strong>one logical 
argument</strong> for the accumulator.<br>
   The <code inline="">PhysicalExpr</code> for <code inline="">a + b</code> 
needs to:</p>
   <ol>
   <li>
   <p>Resolve columns <code inline="">a</code> and <code inline="">b</code> 
from the physical schema</p>
   </li>
   <li>
   <p>Evaluate the addition</p>
   </li>
   <li>
   <p>Pass the result to the accumulator</p>
   </li>
   </ol>
   <p>If we only passed pre-computed fields (like scalar/window functions do), 
the <code inline="">PhysicalExpr</code> couldn’t resolve its column references 
— it needs the full <code inline="">Schema</code>.</p>
   <hr>
   <h3>From the patch (aggregate.rs:391-395):</h3>
   <blockquote>
   <p>Physical expressions may reference columns that are not one-to-one with 
the aggregate arguments (for example, <code inline="">SUM(a + b)</code> 
references both <code inline="">a</code> and <code inline="">b</code>). 
Replacing the schema with the synthesized argument fields would prevent those 
expressions from resolving their column references.</p>
   </blockquote>
   <hr>
   <h2>What the Patch Actually Changes</h2>
   <p>The patch <strong>doesn't change runtime behavior</strong> — it:</p>
   <ol>
   <li>
   <p><strong>Clarifies documentation:</strong> Makes explicit that <code 
inline="">schema</code> is the physical input schema</p>
   </li>
   <li>
   <p><strong>Adds convenience methods:</strong> <code 
inline="">input_field()</code> and <code inline="">input_fields()</code> wrap 
the common pattern of calling <code 
inline="">expr.return_field(&amp;schema)</code></p>
   </li>
   <li>
   <p><strong>Aligns ergonomics:</strong> Matches the convenience of <code 
inline="">ScalarFunctionArgs.arg_fields</code> and <code 
inline="">PartitionEvaluatorArgs.input_fields()</code>, without losing the 
schema access that aggregate expressions require</p>
   </li>
   </ol>
   <hr>
   <h1>Comparison Table</h1>
   
   Aspect | Scalar / Window | Accumulator (Before) | Accumulator (After)
   -- | -- | -- | --
   Fields | Pre-computed at planning | Computed on-demand | Computed on-demand
   Schema | Hidden (not needed) | Exposed (confusing) | Exposed (documented)
   Field access | Direct: args.arg_fields[i] | Manual: 
exprs[i].return_field(schema)? | Helper: input_field(i)?
   Why different? | Simple 1:1 args | Expressions need schema for column 
resolution | Same (clarified)
   
   
   <hr>
   <h2>Why This Fix Is Correct</h2>
   <p>The patch acknowledges that <code inline="">AccumulatorArgs</code> 
<strong>must remain unique</strong> because aggregates operate at the physical 
execution layer with complex expressions.<br>
   Rather than force-fitting them into the scalar/window model (which would 
break column resolution), it:</p>
   <ul>
   <li>
   <p>Preserves the necessary schema access</p>
   </li>
   <li>
   <p>Documents <strong>why</strong> it’s different</p>
   </li>
   <li>
   <p>Adds ergonomic helpers for the common case</p>
   </li>
   </ul>
   <hr>
   <p><strong>This is the right approach:</strong><br>
   Clarify intent and align ergonomics where possible, while preserving the 
functionality that makes aggregates work correctly.</p></body></html>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to