[GitHub] [iceberg] kbendick commented on a change in pull request #3334: Doc: Add docs for add_files procedure

GitBox Wed, 20 Oct 2021 18:39:04 -0700


kbendick commented on a change in pull request #3334:
URL: https://github.com/apache/iceberg/pull/3334#discussion_r733256251




##########
File path: site/docs/spark-procedures.md
##########
@@ -365,3 +365,45 @@ Migrate `db.sample` in the current catalog to an Iceberg 
table without adding an
 CALL catalog_name.system.migrate('db.sample')
 ```
 
+### add_files
+
+Attempts to directly add files from a Hive or file based table into a given 
Iceberg table. Unlike migrate or
+snapshot, `add_files` can import files from a specific partition or partitions 
and does not create a new Iceberg table.
+This command will create metadata for the new files and will not move them. 
This procedure will not analyze the schema 
+of the files to determine if they actually match the schema of the Iceberg 
table. Upon completion, the Iceberg table 
+will then treat these files as if they are part of the set of files  owned by 
Iceberg. This means any subsequent 
+`expire_snapshot` calls will be able to physically delete the added files. 
This method should not be used if 
+`migrate` or `snapshot` are possible.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table`       | ✔️  | string | Table which will have files added to|
+| `source_table`| ✔️  | string | Table where files should come from, paths are 
also possible in the form of `file_format`.`path |
+| `partition_filter`  | ️   | map<string, string> | A map of partitions in the 
source table to import from |
+
+Warning : Schema is not validated, adding files with different schema to the 
Iceberg table will cause issues.
+
+Warning : Files added by this method can be physically deleted by Iceberg 
operations
+
+#### Examples
+
+Add the files from table `db.src_table`, a Hive or Spark table registered in 
the session Catalog, to Iceberg table
+`db.tbl`. Only add files that exist within partitions where `part_col_1` is 
equal to `A`.
+```sql
+CALL spark_catalog.system.add_files(
+table => 'db.tbl',
+source_table => 'db.src_tbl',
+partition_filter => map('part_col_1', 'A')
+)
+```
+
+Add files from a `parquet` file based table at location `path/to/table` to the 
Iceberg table `db.tbl`. Add all
+files regardless of what partition they belong to.
+```sql
+CALL spark_catalog.system.add_files(
+  table => 'db.tbl',
+  source_table => '`parquet`.`path/to/table`'
+)
+```

Review comment:
       I agree that it would be good to include an example with the URI. 
Although @samredai is working on a docs refactored, I am always a believer in 
more examples.
   
   Maybe once this gets merged, you could help add an example or two as well 
@viniolivieri. Help with the docs is always hugely appreciated, and the 
community is about to partake on a relatively large docs refactoring and we 
could definitely use your help!

##########
File path: site/docs/spark-procedures.md
##########
@@ -365,3 +365,45 @@ Migrate `db.sample` in the current catalog to an Iceberg 
table without adding an
 CALL catalog_name.system.migrate('db.sample')
 ```
 
+### add_files
+
+Attempts to directly add files from a Hive or file based table into a given 
Iceberg table. Unlike migrate or
+snapshot, `add_files` can import files from a specific partition or partitions 
and does not create a new Iceberg table.
+This command will create metadata for the new files and will not move them. 
This procedure will not analyze the schema 
+of the files to determine if they actually match the schema of the Iceberg 
table. Upon completion, the Iceberg table 
+will then treat these files as if they are part of the set of files  owned by 
Iceberg. This means any subsequent 
+`expire_snapshot` calls will be able to physically delete the added files. 
This method should not be used if 
+`migrate` or `snapshot` are possible.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table`       | ✔️  | string | Table which will have files added to|
+| `source_table`| ✔️  | string | Table where files should come from, paths are 
also possible in the form of `file_format`.`path |
+| `partition_filter`  | ️   | map<string, string> | A map of partitions in the 
source table to import from |
+
+Warning : Schema is not validated, adding files with different schema to the 
Iceberg table will cause issues.
+
+Warning : Files added by this method can be physically deleted by Iceberg 
operations
+
+#### Examples
+
+Add the files from table `db.src_table`, a Hive or Spark table registered in 
the session Catalog, to Iceberg table
+`db.tbl`. Only add files that exist within partitions where `part_col_1` is 
equal to `A`.
+```sql
+CALL spark_catalog.system.add_files(
+table => 'db.tbl',
+source_table => 'db.src_tbl',
+partition_filter => map('part_col_1', 'A')
+)
+```
+
+Add files from a `parquet` file based table at location `path/to/table` to the 
Iceberg table `db.tbl`. Add all

Review comment:
       I agree that this write up doesn't really make it clear that `add_files` 
can be used to add individual files and not just tables / portions of tables / 
directories (which even then most of the time it says table but I get that 
directory and table are pretty much the same thing in some senses of the word).

##########
File path: site/docs/spark-procedures.md
##########
@@ -365,3 +365,45 @@ Migrate `db.sample` in the current catalog to an Iceberg 
table without adding an
 CALL catalog_name.system.migrate('db.sample')
 ```
 
+### add_files
+
+Attempts to directly add files from a Hive or file based table into a given 
Iceberg table. Unlike migrate or
+snapshot, `add_files` can import files from a specific partition or partitions 
and does not create a new Iceberg table.
+This command will create metadata for the new files and will not move them. 
This procedure will not analyze the schema 
+of the files to determine if they actually match the schema of the Iceberg 
table. Upon completion, the Iceberg table 
+will then treat these files as if they are part of the set of files  owned by 
Iceberg. This means any subsequent 

Review comment:
       Nit: extra space between `set of files` and `owned by Iceberg`.
   
   Also, maybe it would be better phrased as `Upon completion, the Iceberg 
table will then have ownership of these files, and will be responsible for the 
lifecycle of the files the way it would any files written by other means.`?
   
   I think for me, the phrasing that I like or that makes it most clear to me 
is "the Iceberg table will own". It's not so much that Iceberg owns it, but 
this particular table treats it as though it's a  file it wrote as far as users 
are concerned (obviously there are caveats but from a high-level).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #3334: Doc: Add docs for add_files procedure

Reply via email to