alamb commented on code in PR #19881:
URL: https://github.com/apache/datafusion/pull/19881#discussion_r2725539871


##########
datafusion/sqllogictest/test_files/clickbench.slt:
##########
@@ -64,10 +82,10 @@ SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;
 ----
 1
 
-query II
+query DD
 SELECT MIN("EventDate"), MAX("EventDate") FROM hits;
 ----
-15901 15901
+2013-07-15 2013-07-15

Review Comment:
   ❤️ 



##########
benchmarks/src/clickbench.rs:
##########
@@ -29,6 +29,16 @@ use datafusion::{
 use datafusion_common::exec_datafusion_err;
 use datafusion_common::instant::Instant;
 
+/// SQL to create the hits view with proper EventDate casting.
+///
+/// ClickBench stores EventDate as UInt16 (days since 1970-01-01) for
+/// storage efficiency (2 bytes vs 4-8 bytes for date types).
+/// This view transforms it to SQL DATE type for query compatibility.
+const HITS_VIEW_DDL: &str = r#"CREATE VIEW hits AS

Review Comment:
   This only changes the DataFusion benchmark runner -- it won't change the 
actual clickbench results which are created with the scripts here: 
https://github.com/ClickHouse/ClickBench/tree/main/datafusion
   
   In order to keep the reported benchmarks close to our actual runner I 
suggest we add the same thing to the clickbench runner scripts too. I will file 
a ticket
   
   



##########
datafusion/sqllogictest/test_files/clickbench.slt:
##########
@@ -26,10 +26,28 @@
 # COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' 
(FORMAT PARQUET);
 
 statement ok
-CREATE EXTERNAL TABLE hits
+CREATE EXTERNAL TABLE hits_raw
 STORED AS PARQUET
 LOCATION '../core/tests/data/clickbench_hits_10.parquet';
 
+# ClickBench encodes EventDate as UInt16 days since epoch.

Review Comment:
   I worry that we are spreading the knowledge needed to run DataFusion on 
ClickBench effectively all over the place. For example, this view definition is 
now copied twice
   
   Would it be possible to extend this documentation (either in this PR or 
another) here
   
https://github.com/apache/datafusion/blob/main/benchmarks/README.md#clickbench
   
   
   To mention:
   1. The need to add the view / cast the integer column (basically the same 
great explanation in `HITS_VIEW_DDL`)
   2. The need to run with  `binary_as_string` (e.g. 
https://github.com/ClickHouse/ClickBench/blob/d88f3552bd928e86774e09901f05a6c1ea4e3ef4/datafusion/create.sql#L4)
   
   Note this .slt test doesn't need the binary as string as the conversion 
aparently was done when creating the subset (the columns say Utf8View, not 
BinaryView)
   
   ```sql
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ datafusion-cli
   DataFusion CLI v52.0.0
   > describe 
'/Users/andrewlamb/Software/datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet';
   +-----------------------+-----------+-------------+
   | column_name           | data_type | is_nullable |
   +-----------------------+-----------+-------------+
   | WatchID               | Int64     | YES         |
   | JavaEnable            | Int16     | YES         |
   | Title                 | Utf8View  | YES         |
   | GoodEvent             | Int16     | YES         |
   | EventTime             | Int64     | YES         |
   | EventDate             | UInt16    | YES         |
   | CounterID             | Int32     | YES         |
   | ClientIP              | Int32     | YES         |
   | RegionID              | Int32     | YES         |
   | UserID                | Int64     | YES         |
   | CounterClass          | Int16     | YES         |
   | OS                    | Int16     | YES         |
   | UserAgent             | Int16     | YES         |
   | URL                   | Utf8View  | YES         |
   | Referer               | Utf8View  | YES         |
   | IsRefresh             | Int16     | YES         |
   | RefererCategoryID     | Int16     | YES         |
   | RefererRegionID       | Int32     | YES         |
   | URLCategoryID         | Int16     | YES         |
   | URLRegionID           | Int32     | YES         |
   | ResolutionWidth       | Int16     | YES         |
   | ResolutionHeight      | Int16     | YES         |
   | ResolutionDepth       | Int16     | YES         |
   | FlashMajor            | Int16     | YES         |
   | FlashMinor            | Int16     | YES         |
   | FlashMinor2           | Utf8View  | YES         |
   | NetMajor              | Int16     | YES         |
   | NetMinor              | Int16     | YES         |
   | UserAgentMajor        | Int16     | YES         |
   | UserAgentMinor        | Utf8View  | YES         |
   | CookieEnable          | Int16     | YES         |
   | JavascriptEnable      | Int16     | YES         |
   | IsMobile              | Int16     | YES         |
   | MobilePhone           | Int16     | YES         |
   | MobilePhoneModel      | Utf8View  | YES         |
   | Params                | Utf8View  | YES         |
   | IPNetworkID           | Int32     | YES         |
   | TraficSourceID        | Int16     | YES         |
   | SearchEngineID        | Int16     | YES         |
   | SearchPhrase          | Utf8View  | YES         |
   | AdvEngineID           | Int16     | YES         |
   | IsArtifical           | Int16     | YES         |
   | WindowClientWidth     | Int16     | YES         |
   | WindowClientHeight    | Int16     | YES         |
   | ClientTimeZone        | Int16     | YES         |
   | ClientEventTime       | Int64     | YES         |
   | SilverlightVersion1   | Int16     | YES         |
   | SilverlightVersion2   | Int16     | YES         |
   | SilverlightVersion3   | Int32     | YES         |
   | SilverlightVersion4   | Int16     | YES         |
   | PageCharset           | Utf8View  | YES         |
   | CodeVersion           | Int32     | YES         |
   | IsLink                | Int16     | YES         |
   | IsDownload            | Int16     | YES         |
   | IsNotBounce           | Int16     | YES         |
   | FUniqID               | Int64     | YES         |
   | OriginalURL           | Utf8View  | YES         |
   | HID                   | Int32     | YES         |
   | IsOldCounter          | Int16     | YES         |
   | IsEvent               | Int16     | YES         |
   | IsParameter           | Int16     | YES         |
   | DontCountHits         | Int16     | YES         |
   | WithHash              | Int16     | YES         |
   | HitColor              | Utf8View  | YES         |
   | LocalEventTime        | Int64     | YES         |
   | Age                   | Int16     | YES         |
   | Sex                   | Int16     | YES         |
   | Income                | Int16     | YES         |
   | Interests             | Int16     | YES         |
   | Robotness             | Int16     | YES         |
   | RemoteIP              | Int32     | YES         |
   | WindowName            | Int32     | YES         |
   | OpenerName            | Int32     | YES         |
   | HistoryLength         | Int16     | YES         |
   | BrowserLanguage       | Utf8View  | YES         |
   | BrowserCountry        | Utf8View  | YES         |
   | SocialNetwork         | Utf8View  | YES         |
   | SocialAction          | Utf8View  | YES         |
   | HTTPError             | Int16     | YES         |
   | SendTiming            | Int32     | YES         |
   | DNSTiming             | Int32     | YES         |
   | ConnectTiming         | Int32     | YES         |
   | ResponseStartTiming   | Int32     | YES         |
   | ResponseEndTiming     | Int32     | YES         |
   | FetchTiming           | Int32     | YES         |
   | SocialSourceNetworkID | Int16     | YES         |
   | SocialSourcePage      | Utf8View  | YES         |
   | ParamPrice            | Int64     | YES         |
   | ParamOrderID          | Utf8View  | YES         |
   | ParamCurrency         | Utf8View  | YES         |
   | ParamCurrencyID       | Int16     | YES         |
   | OpenstatServiceName   | Utf8View  | YES         |
   | OpenstatCampaignID    | Utf8View  | YES         |
   | OpenstatAdID          | Utf8View  | YES         |
   | OpenstatSourceID      | Utf8View  | YES         |
   | UTMSource             | Utf8View  | YES         |
   | UTMMedium             | Utf8View  | YES         |
   | UTMCampaign           | Utf8View  | YES         |
   | UTMContent            | Utf8View  | YES         |
   | UTMTerm               | Utf8View  | YES         |
   | FromTag               | Utf8View  | YES         |
   | HasGCLID              | Int16     | YES         |
   | RefererHash           | Int64     | YES         |
   | URLHash               | Int64     | YES         |
   | CLID                  | Int32     | YES         |
   +-----------------------+-----------+-------------+
   105 row(s) fetched.
   Elapsed 0.010 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to