lukaswelsch opened a new issue, #3260:
URL: https://github.com/apache/iceberg-python/issues/3260
### Apache Iceberg version
0.11.0 (latest release)
### Please describe the bug 🐞
We have recently updated our functions to call the pyiceberg table.append()
function with dict encoded arrow tables. Now we have in our iceberg tables
mixed data from before this change, (where our data still is stored as string)
and after the change, where the data is stored as dict-encoded strings.
If we now call to_arrow() of a DataScan class, on this table we get this
error:
```
pyarrow.lib.ArrowTypeError: Unable to merge: Field col has incompatible
types: string vs dictionary<values=string, indices=int32, ordered=0>
```
Here is a minimal example that reproduces this error:
```python
from pyiceberg.io.pyarrow import ArrowScan
from pyiceberg.table import ALWAYS_TRUE
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField
from pyiceberg.types import StringType
import pyarrow as pa
def create_scan_with_mixed_dict_encode_not_encode() -> ArrowScan:
schema = Schema(
NestedField(field_id=1, name="col", field_type=StringType(),
required=False)
)
class FakeTableMetadata:
def schema(self) -> Schema:
return schema
scan = ArrowScan(table_metadata=FakeTableMetadata(),
io=object(),
projected_schema=schema,
row_filter=ALWAYS_TRUE)
def _batches_for_repro(self, _tasks):
str_values = pa.array(["a"], type=pa.string())
yield pa.record_batch([str_values], names=["col"])
yield pa.record_batch([str_values.dictionary_encode()],
names=["col"])
ArrowScan.to_record_batches = _batches_for_repro
return scan
if __name__ == "__main__":
scan = create_scan_with_mixed_dict_encode_not_encode()
arrow_table = ArrowScan.to_table(scan, tasks=[])
```
I am happy to provide a bugfix PR, but I need a small guidance on the best
approach.
One idea is to cast each batch in to_table to the arrow_schema. The more
performant way is to check for each batch, if the schema is different. If they
are different, then find the dict_encoded col and only cast that one to string.
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]