Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]

via GitHub Mon, 08 Jul 2024 18:27:11 -0700


syun64 commented on code in PR #902:
URL: https://github.com/apache/iceberg-python/pull/902#discussion_r1669486866



##########
pyiceberg/table/__init__.py:
##########
@@ -1884,8 +1884,9 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
 
         from pyiceberg.io.pyarrow import project_batches, schema_to_pyarrow
 
+        target_schema = schema_to_pyarrow(self.projection())

Review Comment:
   Here, we are making an opinionated decision on whether we are using large or 
small type as the pyarrow schema when reading the Iceberg table as a 
RecordBatchReader. Is there a reason why we don't want to do the same for the 
table API? I've noticed that we've changed the return type of the Table API to 
`Optional[pa.Table]` in order to avoid having to use `schema_to_pyarrow`.
   
   Similarly, other libraries like polars use the approach of choosing one type 
over the other (large types in the case of polars).
   ```
   >>> strings = pa.array(["a", "b"])
   >>> pydict = {"strings": strings}
   >>> pa.Table.from_pydict(pydict)
   pyarrow.Table
   strings: string
   ----
   strings: [["a","b"]]
   >>> pq.write_table(pa.Table.from_pydict(pydict), "strings.parquet")
   >>> pldf = pl.read_parquet("strings.parquet", use_pyarrow=True)
   >>> pldf.dtypes
   [String]
   >>> pldf.to_arrow()
   pyarrow.Table
   strings: large_string
   ----
   strings: [["a","b"]]
   ```



##########
pyiceberg/table/__init__.py:
##########
@@ -1895,7 +1896,7 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
                 case_sensitive=self.case_sensitive,
                 limit=self.limit,
             ),
-        )
+        ).cast(target_schema=target_schema)

Review Comment:
   When I had originally worked on 
https://github.com/apache/iceberg-python/pull/786 I thought of this approach as 
well, but ran into issues like:
   
   ```
   
tests/integration/test_reads.py::test_pyarrow_batches_deletes[session_catalog_hive]
 - pyarrow.lib.ArrowTypeError: Field 0 cannot be cast from date32[day] to 
date32[day]
   
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/ipc.pxi", line 800, in pyarrow.lib.RecordBatchReader.cast
     File "pyarrow/error.pxi", line 154, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: Field 0 cannot be cast from date32[day] to 
date32[day]
   ```
   
   As a workaround, I opted to cast each pa.Array individually within 
`to_requested_schema`, rather than using this API.
   
   This issue is fixed in https://github.com/apache/arrow/issues/41884, but 
until we use the new release, I don't think we will be able to use this approach



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]

Reply via email to