lukaswelsch opened a new issue, #3260:
URL: https://github.com/apache/iceberg-python/issues/3260

   ### Apache Iceberg version
   
   0.11.0 (latest release)
   
   ### Please describe the bug 🐞
   
   We have recently updated our functions to call the pyiceberg table.append() 
function with dict encoded arrow tables. Now we have in our iceberg tables 
mixed data from before this change, (where our data still is stored as string) 
and after the change, where the data is stored as dict-encoded strings. 
   
   If we now call to_arrow() of a DataScan class, on this table we get this 
error:
   
   ```
   pyarrow.lib.ArrowTypeError: Unable to merge: Field col has incompatible 
types: string vs dictionary<values=string, indices=int32, ordered=0>
   ```
   
   Here is a minimal example that reproduces this error:
   ```python
   from pyiceberg.io.pyarrow import ArrowScan
   from pyiceberg.table import ALWAYS_TRUE
   from pyiceberg.schema import Schema
   from pyiceberg.types import NestedField
   from pyiceberg.types import StringType
   
   import pyarrow as pa
   
   
   def create_scan_with_mixed_dict_encode_not_encode() -> ArrowScan:
       schema = Schema(
           NestedField(field_id=1, name="col", field_type=StringType(), 
required=False)
       )
   
       class FakeTableMetadata:
           def schema(self) -> Schema:
               return schema
   
       scan = ArrowScan(table_metadata=FakeTableMetadata(),
                        io=object(),
                        projected_schema=schema,
                        row_filter=ALWAYS_TRUE)
   
       def _batches_for_repro(self, _tasks):
           str_values = pa.array(["a"], type=pa.string())
           yield pa.record_batch([str_values], names=["col"])
           yield pa.record_batch([str_values.dictionary_encode()], 
names=["col"])
   
       ArrowScan.to_record_batches = _batches_for_repro
       return scan
   
   
   if __name__ == "__main__":
       scan = create_scan_with_mixed_dict_encode_not_encode()
       arrow_table = ArrowScan.to_table(scan, tasks=[])
   ```
   
   I am happy to provide a bugfix PR, but I need a small guidance on the best 
approach. 
   One idea is to cast each batch in to_table to the arrow_schema. The more 
performant way is to check for each batch, if the schema is different. If they 
are different, then find the dict_encoded col and only cast that one to string. 
   
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to