syun64 commented on code in PR #902:
URL: https://github.com/apache/iceberg-python/pull/902#discussion_r1669486866
##########
pyiceberg/table/__init__.py:
##########
@@ -1884,8 +1884,9 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
from pyiceberg.io.pyarrow import project_batches, schema_to_pyarrow
+ target_schema = schema_to_pyarrow(self.projection())
Review Comment:
Here, we are making an opinionated decision on whether we are using large or
small type as the pyarrow schema when reading the Iceberg table as a
RecordBatchReader. Is there a reason why we don't want to do the same for the
table API? I've noticed that we've changed the return type of the Table API to
`Optional[pa.Table]` in order to avoid having to use `schema_to_pyarrow`.
Similarly, other libraries like polars use the approach of choosing one type
over the other (large types in the case of polars).
```
>>> strings = pa.array(["a", "b"])
>>> pydict = {"strings": strings}
>>> pa.Table.from_pydict(pydict)
pyarrow.Table
strings: string
----
strings: [["a","b"]]
>>> pq.write_table(pa.Table.from_pydict(pydict), "strings.parquet")
>>> pldf = pl.read_parquet("strings.parquet", use_pyarrow=True)
>>> pldf.dtypes
[String]
>>> pldf.to_arrow()
pyarrow.Table
strings: large_string
----
strings: [["a","b"]]
```
##########
pyiceberg/table/__init__.py:
##########
@@ -1895,7 +1896,7 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
case_sensitive=self.case_sensitive,
limit=self.limit,
),
- )
+ ).cast(target_schema=target_schema)
Review Comment:
When I had originally worked on
https://github.com/apache/iceberg-python/pull/786 I thought of this approach as
well, but ran into issues like:
```
tests/integration/test_reads.py::test_pyarrow_batches_deletes[session_catalog_hive]
- pyarrow.lib.ArrowTypeError: Field 0 cannot be cast from date32[day] to
date32[day]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/ipc.pxi", line 800, in pyarrow.lib.RecordBatchReader.cast
File "pyarrow/error.pxi", line 154, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Field 0 cannot be cast from date32[day] to
date32[day]
```
As a workaround, I opted to cast each pa.Array individually within
`to_requested_schema`, rather than using this API.
This issue is fixed in https://github.com/apache/arrow/issues/41884, but
until we use the new release, I don't think we will be able to use this approach
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]