kevinjqliu commented on code in PR #1453:
URL: https://github.com/apache/iceberg-python/pull/1453#discussion_r1902219038
##########
mkdocs/docs/configuration.md:
##########
@@ -102,21 +102,21 @@ For the FileIO there are several configuration options
available:
<!-- markdown-link-check-disable -->
-| Key | Example | Description
|
-|----------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| s3.endpoint | <https://10.0.19.25/> | Configure an alternative
endpoint of the S3 service for the FileIO to access. This could be used to use
S3FileIO with any s3-compatible object storage service that has a different
endpoint, or access a private S3 endpoint in a virtual private cloud. |
-| s3.access-key-id | admin | Configure the static
access key id used to access the FileIO.
|
-| s3.secret-access-key | password | Configure the static
secret access key used to access the FileIO.
|
-| s3.session-token | AQoDYXdzEJr... | Configure the static
session token used to access the FileIO.
|
-| s3.role-session-name | session | An optional
identifier for the assumed role session.
|
-| s3.role-arn | arn:aws:... | AWS Role ARN. If
provided instead of access_key and secret_key, temporary credentials will be
fetched by assuming this role.
|
-| s3.signer | bearer | Configure the signature
version of the FileIO.
|
-| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote
signing uri if it differs from the catalog uri. Remote signing is only
implemented for `FsspecFileIO`. The final request is sent to
`<s3.signer.uri>/<s3.signer.endpoint>`.
|
-| s3.signer.endpoint | v1/main/s3-sign | Configure the remote
signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The
final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default :
v1/aws/s3/sign). |
-| s3.region | us-west-2 | Sets the region of the
bucket
|
-| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy
server to be used by the FileIO.
|
-| s3.connect-timeout | 60.0 | Configure socket
connection timeout, in seconds.
|
-| s3.force-virtual-addressing | False | Whether to use
virtual addressing of buckets. If true, then virtual addressing is always
enabled. If false, then virtual addressing is only enabled if endpoint_override
is empty. This can be used for non-AWS backends that only support virtual
hosted-style access.
|
+| Key | Example | Description
|
+|----------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| s3.endpoint | <https://10.0.19.25/> | Configure an alternative
endpoint of the S3 service for the FileIO to access. This could be used to use
S3FileIO with any s3-compatible object storage service that has a different
endpoint, or access a private S3 endpoint in a virtual private cloud.
|
+| s3.access-key-id | admin | Configure the static
access key id used to access the FileIO.
|
+| s3.secret-access-key | password | Configure the static
secret access key used to access the FileIO.
|
+| s3.session-token | AQoDYXdzEJr... | Configure the static
session token used to access the FileIO.
|
+| s3.role-session-name | session | An optional
identifier for the assumed role session.
|
+| s3.role-arn | arn:aws:... | AWS Role ARN. If
provided instead of access_key and secret_key, temporary credentials will be
fetched by assuming this role.
|
+| s3.signer | bearer | Configure the signature
version of the FileIO.
|
+| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote
signing uri if it differs from the catalog uri. Remote signing is only
implemented for `FsspecFileIO`. The final request is sent to
`<s3.signer.uri>/<s3.signer.endpoint>`.
|
+| s3.signer.endpoint | v1/main/s3-sign | Configure the remote
signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The
final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default :
v1/aws/s3/sign).
|
+| s3.region | us-west-2 | Configure the default
region used to initialize an S3FileSystem. This setting will be overwritten if
the bucket actually used resolves to a different region.
|
Review Comment:
```suggestion
| s3.region | us-west-2 | Configure the default
region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to
automatically resolve the region for each S3 bucket, falling back to this value
if resolution fails.
|
```
##########
pyiceberg/io/pyarrow.py:
##########
@@ -362,6 +362,12 @@ def _initialize_fs(self, scheme: str, netloc:
Optional[str] = None) -> FileSyste
"region": get_first_property_value(self.properties, S3_REGION,
AWS_REGION),
}
+ # Override the default s3.region if netloc(bucket) resolves to a
different region
+ try:
+ client_kwargs["region"] = resolve_s3_region(netloc)
Review Comment:
I think there are these 3 cases we're worried about:
```
# region match
region=us-east-1
s3://foo-us-east-1/
s3://bar-us-east-1/
# region mismatch
region=us-west-2
s3://foo-us-east-1/
s3://bar-us-west-2/
# region not provided
region=None
s3://foo-us-east-1/
s3://bar-us-west-2/
```
We have 2 options here
1. use `region` when provided, fallback to `resolve_s3_region`
2. always use `resolve_s3_region`
3. `resolve_s3_region`, fall back to `region`
Option 1 is difficult since we dont know that the provided `region` is wrong
until we try to use the FileIO.
The code above uses option 2 which will always make an extra call to S3 to
get the correct bucket region. This extra call to S3 is cached though, so
`resolve_s3_region` is only called once per bucket.
This is similar to the `cache_regions` option for
[s3fs.core.S3FileSystem](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem)
I like option 3, we can resolve the bucket region and fallback to the
provided `region`. It might be confusing to the enduser when a `region` is
specified but the FileIO uses a different region, so lets add a warning for
that.
Something like this
```
# Attempt to resolve the S3 region for the bucket, falling back to
configured region if resolution fails
# Note, bucket resolution is cached and only called once per bucket
provided_region = get_first_property_value(self.properties, S3_REGION,
AWS_REGION)
try:
bucket_region = resolve_s3_region(bucket=netloc)
except (OSError, TypeError):
bucket_region = None
logger.warning(f"Unable to resolve region for bucket {netloc}, using
default region {provided_region}")
if bucket_region and bucket_region != provided_region:
logger.warning(
f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: "
f"provided region {provided_region}, actual region {bucket_region}"
)
region = bucket_region or provided_region
client_kwargs: Dict[str, Any] = {
"endpoint_override": self.properties.get(S3_ENDPOINT),
"access_key": get_first_property_value(self.properties,
S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
"secret_key": get_first_property_value(self.properties,
S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
"session_token": get_first_property_value(self.properties,
S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
"region": region,
}
```
##########
tests/io/test_pyarrow.py:
##########
@@ -2074,3 +2077,65 @@ def
test__to_requested_schema_timestamps_without_downcast_raises_exception(
_to_requested_schema(requested_schema, file_schema, batch,
downcast_ns_timestamp_to_us=False, include_field_ids=False)
assert "Unsupported schema projection from timestamp[ns] to timestamp[us]"
in str(exc_info.value)
+
Review Comment:
nit: add a test using both LocalFilesystem with S3Filesystem for #1041
##########
pyiceberg/io/pyarrow.py:
##########
@@ -352,7 +352,7 @@ def parse_location(location: str) -> Tuple[str, str, str]:
def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) ->
FileSystem:
if scheme in {"s3", "s3a", "s3n", "oss"}:
- from pyarrow.fs import S3FileSystem
+ from pyarrow.fs import S3FileSystem, resolve_s3_region
Review Comment:
nit: since `oss` scheme uses this path, does it also support
`resolve_s3_region`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]