kbendick commented on a change in pull request #3334:
URL: https://github.com/apache/iceberg/pull/3334#discussion_r733256251
##########
File path: site/docs/spark-procedures.md
##########
@@ -365,3 +365,45 @@ Migrate `db.sample` in the current catalog to an Iceberg
table without adding an
CALL catalog_name.system.migrate('db.sample')
```
+### add_files
+
+Attempts to directly add files from a Hive or file based table into a given
Iceberg table. Unlike migrate or
+snapshot, `add_files` can import files from a specific partition or partitions
and does not create a new Iceberg table.
+This command will create metadata for the new files and will not move them.
This procedure will not analyze the schema
+of the files to determine if they actually match the schema of the Iceberg
table. Upon completion, the Iceberg table
+will then treat these files as if they are part of the set of files owned by
Iceberg. This means any subsequent
+`expire_snapshot` calls will be able to physically delete the added files.
This method should not be used if
+`migrate` or `snapshot` are possible.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table` | ✔️ | string | Table which will have files added to|
+| `source_table`| ✔️ | string | Table where files should come from, paths are
also possible in the form of `file_format`.`path |
+| `partition_filter` | ️ | map<string, string> | A map of partitions in the
source table to import from |
+
+Warning : Schema is not validated, adding files with different schema to the
Iceberg table will cause issues.
+
+Warning : Files added by this method can be physically deleted by Iceberg
operations
+
+#### Examples
+
+Add the files from table `db.src_table`, a Hive or Spark table registered in
the session Catalog, to Iceberg table
+`db.tbl`. Only add files that exist within partitions where `part_col_1` is
equal to `A`.
+```sql
+CALL spark_catalog.system.add_files(
+table => 'db.tbl',
+source_table => 'db.src_tbl',
+partition_filter => map('part_col_1', 'A')
+)
+```
+
+Add files from a `parquet` file based table at location `path/to/table` to the
Iceberg table `db.tbl`. Add all
+files regardless of what partition they belong to.
+```sql
+CALL spark_catalog.system.add_files(
+ table => 'db.tbl',
+ source_table => '`parquet`.`path/to/table`'
+)
+```
Review comment:
I agree that it would be good to include an example with the URI.
Although @samredai is working on a docs refactored, I am always a believer in
more examples.
Maybe once this gets merged, you could help add an example or two as well
@viniolivieri. Help with the docs is always hugely appreciated, and the
community is about to partake on a relatively large docs refactoring and we
could definitely use your help!
##########
File path: site/docs/spark-procedures.md
##########
@@ -365,3 +365,45 @@ Migrate `db.sample` in the current catalog to an Iceberg
table without adding an
CALL catalog_name.system.migrate('db.sample')
```
+### add_files
+
+Attempts to directly add files from a Hive or file based table into a given
Iceberg table. Unlike migrate or
+snapshot, `add_files` can import files from a specific partition or partitions
and does not create a new Iceberg table.
+This command will create metadata for the new files and will not move them.
This procedure will not analyze the schema
+of the files to determine if they actually match the schema of the Iceberg
table. Upon completion, the Iceberg table
+will then treat these files as if they are part of the set of files owned by
Iceberg. This means any subsequent
+`expire_snapshot` calls will be able to physically delete the added files.
This method should not be used if
+`migrate` or `snapshot` are possible.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table` | ✔️ | string | Table which will have files added to|
+| `source_table`| ✔️ | string | Table where files should come from, paths are
also possible in the form of `file_format`.`path |
+| `partition_filter` | ️ | map<string, string> | A map of partitions in the
source table to import from |
+
+Warning : Schema is not validated, adding files with different schema to the
Iceberg table will cause issues.
+
+Warning : Files added by this method can be physically deleted by Iceberg
operations
+
+#### Examples
+
+Add the files from table `db.src_table`, a Hive or Spark table registered in
the session Catalog, to Iceberg table
+`db.tbl`. Only add files that exist within partitions where `part_col_1` is
equal to `A`.
+```sql
+CALL spark_catalog.system.add_files(
+table => 'db.tbl',
+source_table => 'db.src_tbl',
+partition_filter => map('part_col_1', 'A')
+)
+```
+
+Add files from a `parquet` file based table at location `path/to/table` to the
Iceberg table `db.tbl`. Add all
Review comment:
I agree that this write up doesn't really make it clear that `add_files`
can be used to add individual files and not just tables / portions of tables /
directories (which even then most of the time it says table but I get that
directory and table are pretty much the same thing in some senses of the word).
##########
File path: site/docs/spark-procedures.md
##########
@@ -365,3 +365,45 @@ Migrate `db.sample` in the current catalog to an Iceberg
table without adding an
CALL catalog_name.system.migrate('db.sample')
```
+### add_files
+
+Attempts to directly add files from a Hive or file based table into a given
Iceberg table. Unlike migrate or
+snapshot, `add_files` can import files from a specific partition or partitions
and does not create a new Iceberg table.
+This command will create metadata for the new files and will not move them.
This procedure will not analyze the schema
+of the files to determine if they actually match the schema of the Iceberg
table. Upon completion, the Iceberg table
+will then treat these files as if they are part of the set of files owned by
Iceberg. This means any subsequent
Review comment:
Nit: extra space between `set of files` and `owned by Iceberg`.
Also, maybe it would be better phrased as `Upon completion, the Iceberg
table will then have ownership of these files, and will be responsible for the
lifecycle of the files the way it would any files written by other means.`?
I think for me, the phrasing that I like or that makes it most clear to me
is "the Iceberg table will own". It's not so much that Iceberg owns it, but
this particular table treats it as though it's a file it wrote as far as users
are concerned (obviously there are caveats but from a high-level).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]