Jefffrey commented on issue #18770:
URL: https://github.com/apache/datafusion/issues/18770#issuecomment-3561153658

   For JSONs, this does seem to be the expected behaviour of DataFusion (I 
don't think we have a config for this behaviour). As for why it happens, I 
assume it's because usually the order of keys in a JSON _shouldn't_ be 
significant and ideally systems don't need to rely on the order of keys within 
JSON files to function. We could look into fixing this behaviour (though part 
of the schema resolution taking place here is located in arrow-rs) depending on 
the appetite for the fix, considering JSON is considered unordered 🤔 
   
   - I guess if polars does it then that's a point towards following suit and 
doing the fix, though not sure if they do it intentionally or it's just a happy 
side-effect of their implementation
   
   For reference, the related code:
   
   
https://github.com/apache/datafusion/blob/f1ecaccd183367086ecb5b7736d93b3aba109e01/datafusion/datasource-json/src/file_format.rs#L215-L229
   
   Here when we call `infer_json_schema_from_iterator()` from arrow-rs:
   
   ```rust
   pub fn infer_json_schema_from_iterator<I, V>(value_iter: I) -> 
Result<Schema, ArrowError>
   where
       I: Iterator<Item = Result<V, ArrowError>>,
       V: Borrow<Value>,
   {
       let mut field_types: HashMap<String, InferredType> = HashMap::new();
   
       for record in value_iter {
           match record?.borrow() {
               Value::Object(map) => {
                   collect_field_types_from_object(&mut field_types, map)?;
               }
               value => {
                   return Err(ArrowError::JsonError(format!(
                       "Expected JSON record to be an object, found {value:?}"
                   )));
               }
           };
       }
   
       generate_schema(field_types)
   }
   ```
   
   - 
https://github.com/apache/arrow-rs/blob/ed9efe78e4cc958cc96707557818e754419debb0/arrow-json/src/reader/schema.rs#L500-L521
   
   Specifically how we have `field_types` as a hashmap which I'm guessing is 
where the order of the columns is determined by the keys in there.
   
   Though you mention that you see the same behaviour for CSVs but I don't 
think that's the case? Unless you have a reproduction? For reference my attempt:
   
   ```sh
   datafusion-cli (main)$ cat /Users/jeffrey/Downloads/order.csv
   zzz,bbb,column1,abc
   1,2,3,4
   1,2,3,4
   1,2,3,4
   datafusion-cli (main)$ cargo run
       Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.47s
        Running `/Users/jeffrey/.cargo_target_cache/debug/datafusion-cli`
   DataFusion CLI v51.0.0
   > create external table t17 stored as csv location 
'/Users/jeffrey/Downloads/order.csv';
   0 row(s) fetched.
   Elapsed 0.061 seconds.
   
   > select * from t17;
   +-----+-----+---------+-----+
   | zzz | bbb | column1 | abc |
   +-----+-----+---------+-----+
   | 1   | 2   | 3       | 4   |
   | 1   | 2   | 3       | 4   |
   | 1   | 2   | 3       | 4   |
   +-----+-----+---------+-----+
   3 row(s) fetched.
   Elapsed 0.032 seconds.
   
   > describe t17;
   +-------------+-----------+-------------+
   | column_name | data_type | is_nullable |
   +-------------+-----------+-------------+
   | zzz         | Int64     | YES         |
   | bbb         | Int64     | YES         |
   | column1     | Int64     | YES         |
   | abc         | Int64     | YES         |
   +-------------+-----------+-------------+
   4 row(s) fetched.
   Elapsed 0.004 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to