SemyonSinchenko opened a new issue, #503:
URL: https://github.com/apache/datafusion-comet/issues/503

   ### Describe the bug
   
   I'm running a query that do the following:
   1. Read parquet files
   2. Generate a lot of case-when columns
   3. Run groupBy + agg on top of that columns
   
   I have the following logical plan (I manually truncated some parts):
   ```
   == Parsed Logical Plan ==
   'Aggregate ['customer_id], ['customer_id, 
sum('DC_food-and-household_7d_flag) AS DC_food-and-household_7d_count#18, ..., 
..., ... 2057 more fields]
   +- Project [customer_id AS customer_id#5422, CASE WHEN (((true AND 
(t_minus#5L <= cast(7 as bigint))) AND (card_type#1 = DC)) AND (trx_type#2 = 
food-and-household)) THEN 1 ELSE 0 END AS DC_food-and-household_7d_flag#14, 
..., ... 1225 more fields]
      +- Relation 
[customer_id#0L,card_type#1,trx_type#2,channel#3,trx_amnt#4,t_minus#5L,part_col#6]
 parquet
   ```
   
   It is converted to the following Comet plan:
   ```
   == Physical Plan ==
   HashAggregate(keys=[customer_id#10003], 
functions=[sum(DC_food-and-household_7d_flag#14), ..., ... 2057 more fields])
   +- Exchange hashpartitioning(customer_id#10003, 11), ENSURE_REQUIREMENTS, 
[plan_id=35]
      +- ColumnarToRow
         +- CometHashAggregate [DC_food-and-household_7d_flag#14, 
DC_food-and-household_7d_or_none#15, ..., ... 2056 more fields]
            +- CometProject [DC_food-and-household_7d_flag#14, 
DC_food-and-household_7d_or_none#15, ..., ... 1224 more fields], [CASE WHEN 
(((t_minus#5L <= 7) AND (card_type#1 = DC)) AND (trx_type#2 = 
food-and-household)) THEN 1 ELSE 0 END AS DC_food-and-household_7d_flag#14, 
..., ... 1224 more fields]
               +- CometScan parquet [...] Batched: true, DataFilters: [], 
Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:...], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...>
   ```
   
   Visualization:
   
![image](https://github.com/apache/datafusion-comet/assets/29755009/fbf1c155-9094-430e-bffa-3852bb66aa4f)
   
   
   ### Steps to reproduce
   
   I'm running my own benchmark:
   1. [Generation of the dataset (link to 
github)](https://github.com/SemyonSinchenko/feature-generation-benchmark?tab=readme-ov-file#generate-datasets):
 `generator --prefix test_data_tiny`;
   2. [PySpark code (link to 
github)](https://github.com/SemyonSinchenko/feature-generation-benchmark/blob/main/impl/pyspark-comet-case-when.py);
   3. [Entry point (link to 
github)](https://github.com/SemyonSinchenko/feature-generation-benchmark/blob/main/run_comet.sh)
   
   ### Expected behavior
   
   I expected to see full native plan, but for some reason the last 
`HashAggregate` is running on Spark. It looks to me that it is running even in 
"spark interpreter mode" (I guess because I want too much aggregations and it 
exceed the limit of the code size for the "Whole stage CodeGet" but I'm not 
100% sure).
   
   I checked the documentation of the Comet project and it looks like 
`case-when` expressions, `sum`/`min`/`max`/`mean` expressions are supported. 
`HashAggregate` is supported too. `Exchange` should be supported too because I 
turned on Comet shuffle (`--conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager`,
 `--conf spark.comet.exec.shuffle.enabled=true`, `--conf 
spark.comet.exec.shuffle.mode=native`).
   
   Why if partial aggregation is in Comet the final one isn't and I have a 
`ColumnarToRow` instead?
   
   ### Additional context
   
   I'm ready to provide any additional information or to run any debug query.
   
   Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to