danielcweeks commented on code in PR #15630:
URL: https://github.com/apache/iceberg/pull/15630#discussion_r3174557454


##########
format/spec.md:
##########
@@ -168,6 +188,35 @@ All columns must be written to data files even if they 
introduce redundancy with
 
 Writers are not allowed to commit files with a partition spec that contains a 
field with an unknown transform.
 
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata files are classified as one of two 
types:
+
+* **Absolute path** -- A path string that includes a [URI 
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., 
`s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without 
modification.
+* **Relative path** -- A path string that does not include a URI scheme. 
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain absolute paths. Starting with v4, 
path fields may contain either absolute or relative paths. Directory navigation 
symbols (`.` and `..`) and other file system conventions are not supported in 
relative paths.

Review Comment:
   I will update this to refer to v3- paths as 'fully-qualified', but we were 
not specific about that in the spec with the exception of the description for 
the `referenced_data_file` field.
   
   Paths that start with `/` like your example may exist for local paths, but 
it's not limited to local paths and is very common for HDFS.  The configuration 
of the `core-site/hdfs-site` typically configures a default file system that 
includes the scheme and namenode address (e.g. 
`fs.defaultFS=hdfs://namenode-host:8020`).  All root based references start 
with `/` are then resolved based on the default fs value.
   
   This means that there are many scenarios with HDFS where you will have these 
types of paths in existing metadata. Moving forward to v4, we would require 
these be canonicalized to `hdfs://...` (the default namenode can still be 
omitted so as to not to require encoding it in the path).  Cloud providers do 
not have this issue unless similarly configured for hdfs, but that's not 
typical.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to