kosiew commented on PR #20500:
URL: https://github.com/apache/datafusion/pull/20500#issuecomment-4134062456

   
https://github.com/Samyak2/datafusion/blob/fix-repartition-string-view-counting/datafusion/common/src/config.rs#L738-L740
   
   For the above benchmark runs, the Parquet-backed benchmark data is expected 
to use view types by default.
   
   Why:
   
   - DataFusion's Parquet config defaults `schema_force_view_types` to `true`
   - when that option is enabled, Parquet string columns are read as `Utf8View` 
and binary columns as `BinaryView`
   - the TPC-H benchmark constructs `ParquetFormat` using the session's Parquet 
table options, so it inherits that default behavior
   - the ClickBench benchmark also uses the session Parquet defaults and 
additionally sets `binary_as_string = true` so legacy binary-encoded string 
columns in the `hits_partitioned` dataset are treated as strings
   
   That means both of the benchmark outputs under discussion should be assumed 
to have string view arrays enabled for Parquet-backed string columns unless 
view types were explicitly disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to