liamphmurphy opened a new issue, #43893:
URL: https://github.com/apache/arrow/issues/43893

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Following a schema merge operation involving nested columns, PyArrow seems 
to struggle with loading data with the following error:
   
   `pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong 
order: Input fields: struct<c: int64> output fields: struct<c: int64, d: int64>`
   
   I have confirmed this does not happen with a schema merge that DOES NOT 
involve any nested columns.
   
   I believe this is a PyArrow specific problem as Spark does not have this 
problem. 
   
   Below is an example of how this can be reproduced:
   
   ```
   import pyarrow as pa
   import polars as pl
   from deltalake import write_deltalake
   
   # Create a pyarrow table, include a nested column 'd'
   df = pa.table({
       "a": [1, 2, 3],
       "b": [{"c": 1}, {"c": 2}, {"c": 3}]
   })
   
   # Create a PyArrow schema, include a nested column 'd'
   schema = pa.schema([
       pa.field("a", pa.int64()),
       pa.field("b", pa.struct([
           pa.field("c", pa.int64())
       ]))
   ])
   
   local_path = "./tables/merge_delta_table"
   
   # Write the table to delta lake
   write_deltalake(local_path, data=df, engine="rust", schema=schema, 
mode="append")
   
   # Create a new table with a different schema, adding 
   df2 = pa.table({
       "a": [4, 5, 6],
       "b": [{"d": 2, "c": 1}, {"c": 2}, {"c": 3}]
   })
   
   schema2 = pa.schema([
       pa.field("a", pa.int64()),
       pa.field("b", pa.struct([
           pa.field("d", pa.int64()),
           pa.field("c", pa.int64())
       ]))
   ])
   # Write the new table to the same delta lake
   write_deltalake(local_path, data=df2, schema=schema2, engine="rust", 
mode="append", schema_mode="merge")
   
   # Now read the delta lake using polars
   df = pl.read_delta(local_path)
   print(df)
   ```
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to