Jefffrey commented on issue #18770: URL: https://github.com/apache/datafusion/issues/18770#issuecomment-3561153658
For JSONs, this does seem to be the expected behaviour of DataFusion (I don't think we have a config for this behaviour). As for why it happens, I assume it's because usually the order of keys in a JSON _shouldn't_ be significant and ideally systems don't need to rely on the order of keys within JSON files to function. We could look into fixing this behaviour (though part of the schema resolution taking place here is located in arrow-rs) depending on the appetite for the fix, considering JSON is considered unordered 🤔 - I guess if polars does it then that's a point towards following suit and doing the fix, though not sure if they do it intentionally or it's just a happy side-effect of their implementation For reference, the related code: https://github.com/apache/datafusion/blob/f1ecaccd183367086ecb5b7736d93b3aba109e01/datafusion/datasource-json/src/file_format.rs#L215-L229 Here when we call `infer_json_schema_from_iterator()` from arrow-rs: ```rust pub fn infer_json_schema_from_iterator<I, V>(value_iter: I) -> Result<Schema, ArrowError> where I: Iterator<Item = Result<V, ArrowError>>, V: Borrow<Value>, { let mut field_types: HashMap<String, InferredType> = HashMap::new(); for record in value_iter { match record?.borrow() { Value::Object(map) => { collect_field_types_from_object(&mut field_types, map)?; } value => { return Err(ArrowError::JsonError(format!( "Expected JSON record to be an object, found {value:?}" ))); } }; } generate_schema(field_types) } ``` - https://github.com/apache/arrow-rs/blob/ed9efe78e4cc958cc96707557818e754419debb0/arrow-json/src/reader/schema.rs#L500-L521 Specifically how we have `field_types` as a hashmap which I'm guessing is where the order of the columns is determined by the keys in there. Though you mention that you see the same behaviour for CSVs but I don't think that's the case? Unless you have a reproduction? For reference my attempt: ```sh datafusion-cli (main)$ cat /Users/jeffrey/Downloads/order.csv zzz,bbb,column1,abc 1,2,3,4 1,2,3,4 1,2,3,4 datafusion-cli (main)$ cargo run Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.47s Running `/Users/jeffrey/.cargo_target_cache/debug/datafusion-cli` DataFusion CLI v51.0.0 > create external table t17 stored as csv location '/Users/jeffrey/Downloads/order.csv'; 0 row(s) fetched. Elapsed 0.061 seconds. > select * from t17; +-----+-----+---------+-----+ | zzz | bbb | column1 | abc | +-----+-----+---------+-----+ | 1 | 2 | 3 | 4 | | 1 | 2 | 3 | 4 | | 1 | 2 | 3 | 4 | +-----+-----+---------+-----+ 3 row(s) fetched. Elapsed 0.032 seconds. > describe t17; +-------------+-----------+-------------+ | column_name | data_type | is_nullable | +-------------+-----------+-------------+ | zzz | Int64 | YES | | bbb | Int64 | YES | | column1 | Int64 | YES | | abc | Int64 | YES | +-------------+-----------+-------------+ 4 row(s) fetched. Elapsed 0.004 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
