waitingkuo commented on issue #5276: URL: https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1431466976
the codes to reproduce the benchmark are here https://github.com/ClickHouse/ClickBench/tree/main/datafusion The way to update the data in the clickbench website is to update the json file in the results folder and send a PR https://github.com/ClickHouse/ClickBench/tree/main/datafusion/results The current result listed in the website was benchmarked by datafusion v11, which was release around 6 months ago. to imporve: 1. use datafusion v18 to rerun the benchmark codes and send the PR. t 2. at the time I wrote the benchmark codes, datafusion didnt support some features. e.g. it didn't support schema from parquet, so for some data type like timestamp, we need to load it as string and then cast it to timestamp explicitly. e.g. this is the original sql queries ```sql SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM ... ``` https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/queries.sql#L41 this is the modified version for datafusion ```sql SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM ``` https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql#L41 I did modify some quries so that it works in datafusion. To fix this, we need to verify whether the new datafusion work or not. If so we could update the quries. If not, we could fire the issue to improve I'll do the first option soon (update to v18), it should be a quick improvement. It'll take more time to do the second approach. Welcome for the contribution. I'll get back to this if there's no one work on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
