dongsupkim-onepredict opened a new issue, #3166:
URL: https://github.com/apache/iceberg-python/issues/3166

   ### Apache Iceberg version
   
   0.11.0 (latest release)
   
   ### Please describe the bug 🐞
   
   
   **Description:**
   When passing an `options` dictionary to `Table.scan(options=...)`, the 
properties (such as `s3.connect-timeout` or `s3.request-timeout`) are accepted 
by the `DataScan` object but are never propagated to the underlying `FileIO` 
(e.g., `PyArrowFileIO`) when actual data materialization occurs via methods 
like `to_pandas()` or `to_arrow()`. 
   Because `ArrowScan` is initialized with the `FileIO` that was created during 
catalog instantiation (`table.io`), any S3-specific configurations provided at 
the scan level are completely bypassed. This causes operations reading numerous 
manifest files to fall back to the AWS C++ SDK default timeouts (often 
10s-30s), leading to unexpected `curlCode: 28 (Timeout was reached)` errors 
even when generous timeouts are explicitly requested in the scan options.
   
   **Steps to Reproduce:**
   ```
   # 1. Load catalog with default (or no) S3 timeout properties
   
   from pyiceberg.catalog import load_catalog
   catalog = load_catalog("my_catalog", **{
       "uri": "...",
       "s3.endpoint": "..."
   })
   table = catalog.load_table("my_namespace.my_table")
   
   # 2. Attempt to scan with explicit S3 timeout options
   
   scan_options = {
       "s3.connect-timeout": "600.0",
       "s3.request-timeout": "600.0"
   }
   
   
   # The options are accepted by DataScan...
   
   scan = table.scan(options=scan_options)
   # 3. ...but completely ignored during S3 I/O operations (ArrowScan)
   # This may throw a timeout error if RGW/S3 latency spikes, ignoring the 600s 
setting above.
   df = scan.to_pandas()
   ```
   ### Expected Behavior:
   Properties passed via options in Table.scan() should cascade down and either 
update or override the table.io.properties for the duration of the scan. 
Specifically, s3.* configurations should be respected by the underlying FileIO 
(e.g., PyArrowFileIO) when downloading manifest lists or data files.
   ### Actual Behavior:
   The options passed to Table.scan() are stored in the DataScan instance but 
are never passed to the ArrowScan class or the FileIO instance during 
to_arrow() / to_pandas(). 
   The ArrowScan relies entirely on the unmodified self.io object originally 
initialized by the catalog:
   ```
   # In pyiceberg/table/__init__.py -> DataScan.to_arrow()
           return ArrowScan(
               self.table_metadata, 
               self.io,  # <--- options are missing here!
               self.projection(), 
               self.row_filter, 
               self.case_sensitive, 
               self.limit
           ).to_table(self.plan_files())
   ```
   Environment:
   - PyIceberg Version: 0.11.1 (and earlier)
   - PyArrow Version: 18.0.0
   - Storage: Ceph S3 / Rados Gateway (RGW)
   ### Suggested Fix:
   Ideally, DataScan should merge its options with self.io.properties and 
instantiate a new FileIO, or ArrowScan should be modified to accept the 
scan-level options and apply them dynamically to the FileSystem instance before 
reading files.
   
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to