tomtongue commented on code in PR #6352:
URL: https://github.com/apache/iceberg/pull/6352#discussion_r1046599507
##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java:
##########
@@ -74,17 +74,14 @@ class S3URI {
this.scheme = schemeSplit[0];
String[] authoritySplit = schemeSplit[1].split(PATH_DELIM, 2);
- ValidationException.check(
- authoritySplit.length == 2, "Invalid S3 URI, cannot determine bucket:
%s", location);
- ValidationException.check(
- !authoritySplit[1].trim().isEmpty(), "Invalid S3 URI, path is empty:
%s", location);
+
Review Comment:
Thanks for reviewing, Daniel.
> If you use the bucket as the location of a table, you basically cannot use
that bucket for anything else. Things like orphan files procedure will wipe
anything else in the bucket out.
>
> Is there a requirement that you cannot have a bucket prefix for some use
case?
Yes, I agree this, but I understand there are users who build their Iceberg
tables on each S3 bucket with some configurations such as encryptions, bucket
policy, lifecycle policy etc. Currently, Iceberg doesn’t allow users to
directly create an Iceberg table at S3 bucket, but I think it’s possible to
provide creating an Iceberg table with the users.
Let me share more background for this PR. This PR is based on the following
situation I confirmed:
1. You initially build an Iceberg table at direct S3 bucket like
`s3://bucket` using Athena or Spark (we can also build an Iceberg table at
direct S3 bucket with `LOCATION` clause in `CREATE TABLE` DDL)
2. Then, you write data into the Iceberg table by write operation such as
`INSERT INTO`. This is successful.
3. If we run a read query (e.g. `SELECT`) for the Iceberg table after
writing data, it fails with `org.apache.iceberg.exceptions.ValidationException
: Invalid S3 URI, path is empty: s3://bucket`.
Users who have this situation need to re-create the Iceberg table at s3
bucket with path(s) such as `s3://bucket/path`. I confirmed there are users who
had this situation.
To avoid this situation, we can have three choices (in my understanding):
* Just showing warning message to users if they directly create an Iceberg
table at s3 bucket
* Not allowing users to directly create an Iceberg table at s3 bucket
* Allowing users to directly create an Iceberg table at s3 bucket
Initially I thought adding the restriction of table creation in regards to
the first two choices. However based on the following considerations, I choose
the last (third) one and decided to submit this PR.
* For the first one, it’s not difficult because just putting warning message
in the Iceberg code. But this sometimes wouldn’t work well for users to avoid
creating an Iceberg table at direct s3 bucket.
* The second one needs to add the restriction of table creation at direct s3
bucket when users specify a s3 bucket name only for an Iceberg table location.
However, this restriction forces users to use s3 bucket with paths and it isn’t
consistent with other software behavior that allows to use direct s3 bucket.
Sorry for my long comments, if there’s my misunderstandings, please let me
know.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]