[GitHub] [arrow] westonpace commented on a diff in pull request #13150: ARROW-16549: [C++] Simplify AggregateNodeOptions aggregates/targets

GitBox Wed, 25 May 2022 15:59:16 -0700


westonpace commented on code in PR #13150:
URL: https://github.com/apache/arrow/pull/13150#discussion_r882163835



##########
cpp/src/arrow/compute/api_aggregate.h:
##########
@@ -480,6 +480,12 @@ struct ARROW_EXPORT Aggregate {
 
   /// options for the aggregation function
   const FunctionOptions* options;

Review Comment:
   We should convert this to `std::unique_ptr<FunctionOptions>` but I don't 
mind if we leave that for a follow-up PR.



##########
r/R/query-engine.R:
##########
@@ -121,11 +119,13 @@ ExecPlan <- R6Class("ExecPlan",
             x
           })
         }
+        target_names <- names(.data$aggregations)
+        for (i in seq_len(length(target_names))) {
+          .data$aggregations[[i]][["name"]] <- 
.data$aggregations[[i]][["target"]] <- target_names[i]

Review Comment:
   Hmm...do we have any test case doing something like...
   ```
   mtcars %>% group_by(cyl)
       %>% summarise(
     disp_mean = mean(disp),
     hp_mean = mean(hp)
   )
   ```
   
   I would expect `target` to be `disp` and `name` to be `disp_mean`.  It's 
always possible we are renaming the columns somewhere else in the generated 
plan as well in which case `name` is meaningless here.



##########
cpp/src/arrow/compute/exec/test_util.cc:
##########
@@ -372,22 +374,12 @@ static inline void PrintToImpl(const std::string& 
factory_name,
     for (const auto& agg : o->aggregates) {
       *os << agg.function << "<";
       if (agg.options) PrintTo(*agg.options, os);
+      *os << agg.target.ToString() << "<";

Review Comment:
   I'm not sure this is correct.  Ideally we would have something like...
   
   ```
   aggregates=[{function=hash_sum<>,target=a,name=sum(a)}]
   ```
   
   I think the way it is now it would print:
   ```
   aggregates={hash_sum<a<sum_a<>,},
   ```



##########
cpp/src/arrow/compute/api_aggregate.h:
##########
@@ -480,6 +480,12 @@ struct ARROW_EXPORT Aggregate {
 
   /// options for the aggregation function
   const FunctionOptions* options;
+
+  // fields to which aggregations will be applied
+  FieldRef target;
+
+  // output field name for aggregations

Review Comment:
   ```suggestion
     // output field name
   ```



##########
cpp/src/arrow/compute/kernels/hash_aggregate_test.cc:
##########
@@ -975,12 +964,12 @@ TEST(GroupBy, SumMeanProductDecimal) {
                                },
                                {table->GetColumnByName("key")},
                                {
-                                   {"hash_sum", nullptr},
-                                   {"hash_sum", nullptr},
-                                   {"hash_mean", nullptr},
-                                   {"hash_mean", nullptr},
-                                   {"hash_product", nullptr},
-                                   {"hash_product", nullptr},
+                                   {"hash_sum", nullptr, "agg_0", "hash_sum"},
+                                   {"hash_sum", nullptr, "agg_1", "hash_sum"},
+                                   {"hash_mean", nullptr, "agg_2", 
"hash_mean"},
+                                   {"hash_mean", nullptr, "agg_3", 
"hash_mean"},
+                                   {"hash_product", nullptr, "agg_4", 
"hash_product"},
+                                   {"hash_product", nullptr, "agg_5", 
"hash_product"},

Review Comment:
   I don't think we should have to make these changes and they do not make the 
test easier to read.  Could we change the definition of `GroupByTest` so that 
instead of taking in `const std::vector<internal::Aggregate>& aggregates` it 
takes in `const std::vector<TestAggregate>& aggregates` where we define:
   
   ```
   struct TestAggregate {
     std::string function;
     const FunctionOptions* options;
   };
   ```
   
   The old style makes sense for these unit tests since we don't really care 
about what we are naming the columns.



##########
r/src/compute-exec.cpp:
##########
@@ -226,31 +226,27 @@ std::shared_ptr<compute::ExecNode> ExecNode_Project(
 // [[arrow::export]]
 std::shared_ptr<compute::ExecNode> ExecNode_Aggregate(
     const std::shared_ptr<compute::ExecNode>& input, cpp11::list options,
-    std::vector<std::string> target_names, std::vector<std::string> 
out_field_names,
     std::vector<std::string> key_names) {
   std::vector<arrow::compute::internal::Aggregate> aggregates;
   std::vector<std::shared_ptr<arrow::compute::FunctionOptions>> keep_alives;
-
   for (cpp11::list name_opts : options) {
-    auto name = cpp11::as_cpp<std::string>(name_opts[0]);
-    auto opts = make_compute_options(name, name_opts[1]);
+    auto function = cpp11::as_cpp<std::string>(name_opts["fun"]);
+    auto opts = make_compute_options(function, name_opts["options"]);

Review Comment:
   It's a good thing this is probably always `nullptr` as this would be unsafe 
otherwise :).  Though, as I said above, this is a concern for another PR.



##########
cpp/src/arrow/compute/api_aggregate.h:
##########
@@ -480,6 +480,12 @@ struct ARROW_EXPORT Aggregate {
 
   /// options for the aggregation function
   const FunctionOptions* options;
+
+  // fields to which aggregations will be applied

Review Comment:
   ```suggestion
     // field to which aggregations will be applied
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #13150: ARROW-16549: [C++] Simplify AggregateNodeOptions aggregates/targets

Reply via email to