[I] Improve performance on ClickBench [datafusion-comet]

via GitHub Thu, 17 Jul 2025 21:37:33 -0700


Iskander14yo opened a new issue, #2035:
URL: https://github.com/apache/datafusion-comet/issues/2035


   Hi!
   
   Just made a [PR](https://github.com/ClickHouse/ClickBench/pull/557) to add 
Comet to [ClickBench](https://benchmark.clickhouse.com/) - one of the popular 
benchmarks for analytical workloads. I've decided to create an issue similar to 
#391. You may close it if you find it irrelevant.
   
   I'd appreciate feedback on whether my **configuration and setup are 
correct**. I consider this important because Comet _failed_ on one query and 
showed a few curious behaviors I'll outline below. Perhaps, these (and other 
hidden things) could be fixed with proper configuration.
   
   My notes:
   - Predictably, Comet doesn't support some expressions. That's what I got 
from logs:
   ```
   >>> grep -P "\[COMET:" log.txt | sed -e 's/^[ \t]*//' | sort | uniq -c
   
        78 +-  GlobalLimit [COMET: GlobalLimit is not supported]
        18 +-  HashAggregate [COMET: Unsupported aggregation mode PartialMerge]
       123 +-  HashAggregate [COMET: distinct aggregates are not supported]
        51 +-  Project [COMET: Unsupported cast from LongType to TimestampType 
with timezone Some(...) and evalMode LEGACY]
       126 +-  SortAggregate [COMET: SortAggregate is not supported]
        43 Execute CreateViewCommand [COMET: Execute CreateViewCommand is not 
supported]
       135 TakeOrderedAndProject [COMET: ]
   ```
   `Unsupported cast from LongType to TimestampType...` thing is something 
similar to #44 but in this case another column is involved (`EventTime` instead 
of `EventDate`). Check [this 
issue](https://github.com/ClickHouse/ClickBench/issues/7) also for the 
additional info.
   - Spark's local mode was used. I saw that docs suggest using standalone mode 
for EC2 but I didn't want to waste some extra resources on separate driver. I 
looked at Spark UI and seems that Comet works fine.
   - Comet's cold-runs are significantly slower than hot-runs. Even compared to 
Spark.
   - As I already mentioned, Comet failed on one query:
   ```sql
   SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN 
(SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL 
AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= 
'2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY 
TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC 
LIMIT 10 OFFSET 1000;
   ```
   with error
   ```
   QueryPlanSerde: Comet native execution is disabled due to: unsupported Spark 
partitioning: ArrayBuffer(PageViews#1143L DESC NULLS LAST)
   
   Caused by: org.apache.comet.CometNativeException: InternalError: Native cast 
invoked for unsupported cast from Utf8 to Dictionary(Int32, Utf8).
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Improve performance on ClickBench [datafusion-comet]

Reply via email to