dongsupkim-onepredict opened a new issue, #3166:
URL: https://github.com/apache/iceberg-python/issues/3166
### Apache Iceberg version
0.11.0 (latest release)
### Please describe the bug 🐞
**Description:**
When passing an `options` dictionary to `Table.scan(options=...)`, the
properties (such as `s3.connect-timeout` or `s3.request-timeout`) are accepted
by the `DataScan` object but are never propagated to the underlying `FileIO`
(e.g., `PyArrowFileIO`) when actual data materialization occurs via methods
like `to_pandas()` or `to_arrow()`.
Because `ArrowScan` is initialized with the `FileIO` that was created during
catalog instantiation (`table.io`), any S3-specific configurations provided at
the scan level are completely bypassed. This causes operations reading numerous
manifest files to fall back to the AWS C++ SDK default timeouts (often
10s-30s), leading to unexpected `curlCode: 28 (Timeout was reached)` errors
even when generous timeouts are explicitly requested in the scan options.
**Steps to Reproduce:**
```
# 1. Load catalog with default (or no) S3 timeout properties
from pyiceberg.catalog import load_catalog
catalog = load_catalog("my_catalog", **{
"uri": "...",
"s3.endpoint": "..."
})
table = catalog.load_table("my_namespace.my_table")
# 2. Attempt to scan with explicit S3 timeout options
scan_options = {
"s3.connect-timeout": "600.0",
"s3.request-timeout": "600.0"
}
# The options are accepted by DataScan...
scan = table.scan(options=scan_options)
# 3. ...but completely ignored during S3 I/O operations (ArrowScan)
# This may throw a timeout error if RGW/S3 latency spikes, ignoring the 600s
setting above.
df = scan.to_pandas()
```
### Expected Behavior:
Properties passed via options in Table.scan() should cascade down and either
update or override the table.io.properties for the duration of the scan.
Specifically, s3.* configurations should be respected by the underlying FileIO
(e.g., PyArrowFileIO) when downloading manifest lists or data files.
### Actual Behavior:
The options passed to Table.scan() are stored in the DataScan instance but
are never passed to the ArrowScan class or the FileIO instance during
to_arrow() / to_pandas().
The ArrowScan relies entirely on the unmodified self.io object originally
initialized by the catalog:
```
# In pyiceberg/table/__init__.py -> DataScan.to_arrow()
return ArrowScan(
self.table_metadata,
self.io, # <--- options are missing here!
self.projection(),
self.row_filter,
self.case_sensitive,
self.limit
).to_table(self.plan_files())
```
Environment:
- PyIceberg Version: 0.11.1 (and earlier)
- PyArrow Version: 18.0.0
- Storage: Ceph S3 / Rados Gateway (RGW)
### Suggested Fix:
Ideally, DataScan should merge its options with self.io.properties and
instantiate a new FileIO, or ArrowScan should be modified to accept the
scan-level options and apply them dynamically to the FileSystem instance before
reading files.
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]