rdettai commented on a change in pull request #1141:
URL: https://github.com/apache/arrow-datafusion/pull/1141#discussion_r738991233
##########
File path: datafusion/src/physical_plan/file_format/mod.rs
##########
@@ -24,19 +24,134 @@ mod json;
mod parquet;
pub use self::parquet::ParquetExec;
+use arrow::{
+ array::{ArrayData, ArrayRef, DictionaryArray, UInt8BufferBuilder},
+ buffer::Buffer,
+ datatypes::{DataType, Field, Schema, SchemaRef, UInt8Type},
+ error::{ArrowError, Result as ArrowResult},
+ record_batch::RecordBatch,
+};
pub use avro::AvroExec;
pub use csv::CsvExec;
pub use json::NdJsonExec;
-use crate::datasource::PartitionedFile;
-use std::fmt::{Display, Formatter, Result};
+use crate::{
+ datasource::{object_store::ObjectStore, PartitionedFile},
+ scalar::ScalarValue,
+};
+use std::{
+ collections::HashMap,
+ fmt::{Display, Formatter, Result as FmtResult},
+ sync::Arc,
+ vec,
+};
+
+use super::{ColumnStatistics, Statistics};
+
+lazy_static! {
+ /// The datatype used for all partitioning columns for now
+ pub static ref DEFAULT_PARTITION_COLUMN_DATATYPE: DataType =
DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8));
Review comment:
My tendency would also be to force the listing provider to use `Utf8`
only and let the user cast explicitly in its query if necessary. But after
various discussions with @Dandandan @yjshen and @houqp, the following points
emerged in favor of letting `ListingTable` perform the cast:
- "Better query user experience so users won't need to manually add the
casting in their queries if the type is already known"
- "Help with spark migration where they even support automatic partition
column type inference (can be turned off)"
- Avoid the cost of going through an intermediate Utf8 batch materialization
(though the cast should be pretty inexpensive as partition columns are
represented by dictionaries with only one dict value)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]