alamb commented on code in PR #19881:
URL: https://github.com/apache/datafusion/pull/19881#discussion_r2725539871
##########
datafusion/sqllogictest/test_files/clickbench.slt:
##########
@@ -64,10 +82,10 @@ SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;
----
1
-query II
+query DD
SELECT MIN("EventDate"), MAX("EventDate") FROM hits;
----
-15901 15901
+2013-07-15 2013-07-15
Review Comment:
❤️
##########
benchmarks/src/clickbench.rs:
##########
@@ -29,6 +29,16 @@ use datafusion::{
use datafusion_common::exec_datafusion_err;
use datafusion_common::instant::Instant;
+/// SQL to create the hits view with proper EventDate casting.
+///
+/// ClickBench stores EventDate as UInt16 (days since 1970-01-01) for
+/// storage efficiency (2 bytes vs 4-8 bytes for date types).
+/// This view transforms it to SQL DATE type for query compatibility.
+const HITS_VIEW_DDL: &str = r#"CREATE VIEW hits AS
Review Comment:
This only changes the DataFusion benchmark runner -- it won't change the
actual clickbench results which are created with the scripts here:
https://github.com/ClickHouse/ClickBench/tree/main/datafusion
In order to keep the reported benchmarks close to our actual runner I
suggest we add the same thing to the clickbench runner scripts too. I will file
a ticket
##########
datafusion/sqllogictest/test_files/clickbench.slt:
##########
@@ -26,10 +26,28 @@
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet'
(FORMAT PARQUET);
statement ok
-CREATE EXTERNAL TABLE hits
+CREATE EXTERNAL TABLE hits_raw
STORED AS PARQUET
LOCATION '../core/tests/data/clickbench_hits_10.parquet';
+# ClickBench encodes EventDate as UInt16 days since epoch.
Review Comment:
I worry that we are spreading the knowledge needed to run DataFusion on
ClickBench effectively all over the place. For example, this view definition is
now copied twice
Would it be possible to extend this documentation (either in this PR or
another) here
https://github.com/apache/datafusion/blob/main/benchmarks/README.md#clickbench
To mention:
1. The need to add the view / cast the integer column (basically the same
great explanation in `HITS_VIEW_DDL`)
2. The need to run with `binary_as_string` (e.g.
https://github.com/ClickHouse/ClickBench/blob/d88f3552bd928e86774e09901f05a6c1ea4e3ef4/datafusion/create.sql#L4)
Note this .slt test doesn't need the binary as string as the conversion
aparently was done when creating the subset (the columns say Utf8View, not
BinaryView)
```sql
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ datafusion-cli
DataFusion CLI v52.0.0
> describe
'/Users/andrewlamb/Software/datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID | Int64 | YES |
| JavaEnable | Int16 | YES |
| Title | Utf8View | YES |
| GoodEvent | Int16 | YES |
| EventTime | Int64 | YES |
| EventDate | UInt16 | YES |
| CounterID | Int32 | YES |
| ClientIP | Int32 | YES |
| RegionID | Int32 | YES |
| UserID | Int64 | YES |
| CounterClass | Int16 | YES |
| OS | Int16 | YES |
| UserAgent | Int16 | YES |
| URL | Utf8View | YES |
| Referer | Utf8View | YES |
| IsRefresh | Int16 | YES |
| RefererCategoryID | Int16 | YES |
| RefererRegionID | Int32 | YES |
| URLCategoryID | Int16 | YES |
| URLRegionID | Int32 | YES |
| ResolutionWidth | Int16 | YES |
| ResolutionHeight | Int16 | YES |
| ResolutionDepth | Int16 | YES |
| FlashMajor | Int16 | YES |
| FlashMinor | Int16 | YES |
| FlashMinor2 | Utf8View | YES |
| NetMajor | Int16 | YES |
| NetMinor | Int16 | YES |
| UserAgentMajor | Int16 | YES |
| UserAgentMinor | Utf8View | YES |
| CookieEnable | Int16 | YES |
| JavascriptEnable | Int16 | YES |
| IsMobile | Int16 | YES |
| MobilePhone | Int16 | YES |
| MobilePhoneModel | Utf8View | YES |
| Params | Utf8View | YES |
| IPNetworkID | Int32 | YES |
| TraficSourceID | Int16 | YES |
| SearchEngineID | Int16 | YES |
| SearchPhrase | Utf8View | YES |
| AdvEngineID | Int16 | YES |
| IsArtifical | Int16 | YES |
| WindowClientWidth | Int16 | YES |
| WindowClientHeight | Int16 | YES |
| ClientTimeZone | Int16 | YES |
| ClientEventTime | Int64 | YES |
| SilverlightVersion1 | Int16 | YES |
| SilverlightVersion2 | Int16 | YES |
| SilverlightVersion3 | Int32 | YES |
| SilverlightVersion4 | Int16 | YES |
| PageCharset | Utf8View | YES |
| CodeVersion | Int32 | YES |
| IsLink | Int16 | YES |
| IsDownload | Int16 | YES |
| IsNotBounce | Int16 | YES |
| FUniqID | Int64 | YES |
| OriginalURL | Utf8View | YES |
| HID | Int32 | YES |
| IsOldCounter | Int16 | YES |
| IsEvent | Int16 | YES |
| IsParameter | Int16 | YES |
| DontCountHits | Int16 | YES |
| WithHash | Int16 | YES |
| HitColor | Utf8View | YES |
| LocalEventTime | Int64 | YES |
| Age | Int16 | YES |
| Sex | Int16 | YES |
| Income | Int16 | YES |
| Interests | Int16 | YES |
| Robotness | Int16 | YES |
| RemoteIP | Int32 | YES |
| WindowName | Int32 | YES |
| OpenerName | Int32 | YES |
| HistoryLength | Int16 | YES |
| BrowserLanguage | Utf8View | YES |
| BrowserCountry | Utf8View | YES |
| SocialNetwork | Utf8View | YES |
| SocialAction | Utf8View | YES |
| HTTPError | Int16 | YES |
| SendTiming | Int32 | YES |
| DNSTiming | Int32 | YES |
| ConnectTiming | Int32 | YES |
| ResponseStartTiming | Int32 | YES |
| ResponseEndTiming | Int32 | YES |
| FetchTiming | Int32 | YES |
| SocialSourceNetworkID | Int16 | YES |
| SocialSourcePage | Utf8View | YES |
| ParamPrice | Int64 | YES |
| ParamOrderID | Utf8View | YES |
| ParamCurrency | Utf8View | YES |
| ParamCurrencyID | Int16 | YES |
| OpenstatServiceName | Utf8View | YES |
| OpenstatCampaignID | Utf8View | YES |
| OpenstatAdID | Utf8View | YES |
| OpenstatSourceID | Utf8View | YES |
| UTMSource | Utf8View | YES |
| UTMMedium | Utf8View | YES |
| UTMCampaign | Utf8View | YES |
| UTMContent | Utf8View | YES |
| UTMTerm | Utf8View | YES |
| FromTag | Utf8View | YES |
| HasGCLID | Int16 | YES |
| RefererHash | Int64 | YES |
| URLHash | Int64 | YES |
| CLID | Int32 | YES |
+-----------------------+-----------+-------------+
105 row(s) fetched.
Elapsed 0.010 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]