SemyonSinchenko opened a new issue, #503: URL: https://github.com/apache/datafusion-comet/issues/503
### Describe the bug I'm running a query that do the following: 1. Read parquet files 2. Generate a lot of case-when columns 3. Run groupBy + agg on top of that columns I have the following logical plan (I manually truncated some parts): ``` == Parsed Logical Plan == 'Aggregate ['customer_id], ['customer_id, sum('DC_food-and-household_7d_flag) AS DC_food-and-household_7d_count#18, ..., ..., ... 2057 more fields] +- Project [customer_id AS customer_id#5422, CASE WHEN (((true AND (t_minus#5L <= cast(7 as bigint))) AND (card_type#1 = DC)) AND (trx_type#2 = food-and-household)) THEN 1 ELSE 0 END AS DC_food-and-household_7d_flag#14, ..., ... 1225 more fields] +- Relation [customer_id#0L,card_type#1,trx_type#2,channel#3,trx_amnt#4,t_minus#5L,part_col#6] parquet ``` It is converted to the following Comet plan: ``` == Physical Plan == HashAggregate(keys=[customer_id#10003], functions=[sum(DC_food-and-household_7d_flag#14), ..., ... 2057 more fields]) +- Exchange hashpartitioning(customer_id#10003, 11), ENSURE_REQUIREMENTS, [plan_id=35] +- ColumnarToRow +- CometHashAggregate [DC_food-and-household_7d_flag#14, DC_food-and-household_7d_or_none#15, ..., ... 2056 more fields] +- CometProject [DC_food-and-household_7d_flag#14, DC_food-and-household_7d_or_none#15, ..., ... 1224 more fields], [CASE WHEN (((t_minus#5L <= 7) AND (card_type#1 = DC)) AND (trx_type#2 = food-and-household)) THEN 1 ELSE 0 END AS DC_food-and-household_7d_flag#14, ..., ... 1224 more fields] +- CometScan parquet [...] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...> ``` Visualization:  ### Steps to reproduce I'm running my own benchmark: 1. [Generation of the dataset (link to github)](https://github.com/SemyonSinchenko/feature-generation-benchmark?tab=readme-ov-file#generate-datasets): `generator --prefix test_data_tiny`; 2. [PySpark code (link to github)](https://github.com/SemyonSinchenko/feature-generation-benchmark/blob/main/impl/pyspark-comet-case-when.py); 3. [Entry point (link to github)](https://github.com/SemyonSinchenko/feature-generation-benchmark/blob/main/run_comet.sh) ### Expected behavior I expected to see full native plan, but for some reason the last `HashAggregate` is running on Spark. It looks to me that it is running even in "spark interpreter mode" (I guess because I want too much aggregations and it exceed the limit of the code size for the "Whole stage CodeGet" but I'm not 100% sure). I checked the documentation of the Comet project and it looks like `case-when` expressions, `sum`/`min`/`max`/`mean` expressions are supported. `HashAggregate` is supported too. `Exchange` should be supported too because I turned on Comet shuffle (`--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager`, `--conf spark.comet.exec.shuffle.enabled=true`, `--conf spark.comet.exec.shuffle.mode=native`). Why if partial aggregation is in Comet the final one isn't and I have a `ColumnarToRow` instead? ### Additional context I'm ready to provide any additional information or to run any debug query. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org