wypoon commented on code in PR #15630:
URL: https://github.com/apache/iceberg/pull/15630#discussion_r3139967831


##########
format/spec.md:
##########
@@ -168,6 +188,35 @@ All columns must be written to data files even if they 
introduce redundancy with
 
 Writers are not allowed to commit files with a partition spec that contains a 
field with an unknown transform.
 
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata files are classified as one of two 
types:
+
+* **Absolute path** -- A path string that includes a [URI 
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., 
`s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without 
modification.
+* **Relative path** -- A path string that does not include a URI scheme. 
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain absolute paths. Starting with v4, 
path fields may contain either absolute or relative paths. Directory navigation 
symbols (`.` and `..`) and other file system conventions are not supported in 
relative paths.
+
+#### Path Resolution
+
+Path resolution is the process of producing an absolute path from a relative 
path by combining it with the table's base location. If a path is absolute, it 
is used as-is. If a path is relative, it is concatenated with the table 
location to produce an absolute path:
+
+* If the path contains a URI scheme, it is absolute and is used without 
modification.
+* If the path does not contain a URI scheme, the resolved path is the table 
location followed by the relative path.
+
+Paths used as prefixes must not end in a path separator. The relative portion 
is appended to the prefix without introduction of any additional separator 
characters.
+
+#### Path Relativization
+
+Path relativization is the process of converting an absolute path to a 
relative path by removing the table location prefix. This is used when 
persisting paths to metadata files.
+
+* If an absolute path starts with the table location, the table location 
prefix should be removed and the remaining relative portion stored.

Review Comment:
   The table location must not end in a path separator; does the "remaining 
relative portion" include the beginning path separator or not?



##########
format/spec.md:
##########
@@ -168,6 +188,35 @@ All columns must be written to data files even if they 
introduce redundancy with
 
 Writers are not allowed to commit files with a partition spec that contains a 
field with an unknown transform.
 
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata files are classified as one of two 
types:
+
+* **Absolute path** -- A path string that includes a [URI 
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., 
`s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without 
modification.
+* **Relative path** -- A path string that does not include a URI scheme. 
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain absolute paths. Starting with v4, 
path fields may contain either absolute or relative paths. Directory navigation 
symbols (`.` and `..`) and other file system conventions are not supported in 
relative paths.
+
+#### Path Resolution
+
+Path resolution is the process of producing an absolute path from a relative 
path by combining it with the table's base location. If a path is absolute, it 
is used as-is. If a path is relative, it is concatenated with the table 
location to produce an absolute path:
+
+* If the path contains a URI scheme, it is absolute and is used without 
modification.
+* If the path does not contain a URI scheme, the resolved path is the table 
location followed by the relative path.
+
+Paths used as prefixes must not end in a path separator. The relative portion 
is appended to the prefix without introduction of any additional separator 
characters.

Review Comment:
   For my understanding, can you please clarify:
   Suppose the table location is "hdfs://ns/wh/foo.db/bar", and we are using 
relative paths; would a relative path for a data file look like 
"/data/00000-16-someuuid-0-00001.parquet", and the absolute path will then be 
"hdfs://ns/wh/foo.db/bar/data/00000-16-someuuid-0-00001.parquet"? Or should be 
the relative path for the data file be "data/00000-16-someuuid-0-00001.parquet" 
and we add the separator?
   Further below, you discuss the `write.data.path`, which you state defaults 
to the value `data` if unspecified, and you write:
   ```
   * If `write.data.path` is a relative path, the base is the table location 
followed by the `write.data.path` value.
   ```
   If the table location must not end in a path separator, and we're using 
"data" for `write.data.path`, then we need to add the separator before the 
"data", don't we? Is "followed by" meant in a loose sense, or in a strict sense 
of being "appended ... without ... any additional separator characters"?
   



##########
format/spec.md:
##########
@@ -1767,6 +1840,24 @@ Note that these requirements apply when writing data to 
a v2 table. Tables that
 
 This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification to help 
maintain a uniform experience.
 
+### Path Construction
+
+Path construction is the process by which new file locations are created for 
output files referenced by metadata. While the specific construction logic is 
not strictly required by the spec, the following guidance is provided for 
reference implementations to encourage consistency.
+
+The table properties `write.metadata.path` and `write.data.path` control where 
metadata and data files are written relative to the table location. When not 
specified, these default to the values `metadata` and `data` respectively.

Review Comment:
   I don't believe that `write.metadata.path` and `write.data.path` are 
mentioned in the spec before this. However, my understanding of the table 
properties according to docs/docs/configuration.md
   ```
   | write.data.path                                     | table location + 
/data      | Base location for data files                                       
                                                                                
                                                                                
                |
   | write.metadata.path                                 | table location + 
/metadata  | Base location for metadata files                                   
                                                                                
                                                                                
                |
   ```
   is that `write.metadata.path` and `write.data.path` default to table 
location + /data and table location + /metadata respectively.
   In other words, since relative paths are not supported up to now, the 
default values of `write.metadata.path` and `write.data.path` have to be the 
absolute paths given by the above.
   
   I believe that we now want to allow `write.metadata.path` and 
`write.data.path` to be relative paths, and want the defaults to be `metadata` 
and `data` respectively. Would you agree?
   
   I think it'd be more accurate to write:
   
   "The table properties `write.metadata.path` and `write.data.path` control 
where metadata and data files are written. When not specified, these default 
respectively to `metadata` and `data` folders under the table location. As we 
now allow paths to be relative, the default values for the properties would be 
`metadata` and `data` respectively."
   
   I know that you're trying to avoid words like folder and directory, but I 
couldn't think of how to express the idea of something "under a table location" 
otherwise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to