waitingkuo commented on issue #5276:
URL: 
https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1431466976

   the codes to reproduce the benchmark are here 
https://github.com/ClickHouse/ClickBench/tree/main/datafusion
   The way to update the data in the clickbench website is to update the json 
file in the results folder and send a PR 
https://github.com/ClickHouse/ClickBench/tree/main/datafusion/results
   
   The current result listed in the website was benchmarked by datafusion v11, 
which was release around 6 months ago.
   
   to imporve:
   
   1. use datafusion v18 to rerun the benchmark codes and send the PR. t
   
   2. at the time I wrote the benchmark codes, datafusion didnt support some 
features. e.g. it didn't support schema from parquet, so for some data type 
like timestamp, we need to load it as string and then cast it to timestamp 
explicitly.  e.g.
    
   this is the original sql queries
   ```sql
   SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM ...
   ```
   
https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/queries.sql#L41
   
   this is the modified version for datafusion
   ```sql
   SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM
   ```
   https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql#L41
   
   I did modify some quries so that it works in datafusion. To fix this, we 
need to verify whether the new datafusion work or not. If so we could update 
the quries. If not, we could fire the issue to improve
   
   I'll do the first option soon (update to v18), it should be a quick 
improvement.
   It'll take more time to do the second approach. Welcome for the 
contribution. I'll get back to this if there's no one work on this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to