szehon-ho commented on code in PR #11396:
URL: https://github.com/apache/iceberg/pull/11396#discussion_r1824797640
##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
| `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data
that should be rewritten in a single file group. The entire rewrite operation
is broken down into pieces based on partitioning and within partitions based on
size into file-groups. This helps with breaking down the rewriting of very
large partitions which may not be rewritable otherwise due to the resource
constraints of the cluster. |
| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Desired partition spec ID to
be used for rewriting data files. This allows data files to be rewritten with
any existing partition spec. |
Review Comment:
I thought the original javadoc is well rewritten here as well, how about
taking from that:
```
Identifier of the output partition spec. Data will be reorganized during
the rewrite to align with the output partitioning.
```
##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
| `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data
that should be rewritten in a single file group. The entire rewrite operation
is broken down into pieces based on partitioning and within partitions based on
size into file-groups. This helps with breaking down the rewriting of very
large partitions which may not be rewritable otherwise due to the resource
constraints of the cluster. |
| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Desired partition spec ID to
be used for rewriting data files. This allows data files to be rewritten with
any existing partition spec. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality
deletes after rewriting. A delete file is considered dangling if it does not
apply to any live data files. Enabling this will generate an additional
snapshot. |
Review Comment:
This is just my thought, but last sentence could be clarified a little:
'This will generate an additional commit for the removal'
##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
| `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data
that should be rewritten in a single file group. The entire rewrite operation
is broken down into pieces based on partitioning and within partitions based on
size into file-groups. This helps with breaking down the rewriting of very
large partitions which may not be rewritable otherwise due to the resource
constraints of the cluster. |
| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Desired partition spec ID to
be used for rewriting data files. This allows data files to be rewritten with
any existing partition spec. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality
deletes after rewriting. A delete file is considered dangling if it does not
apply to any live data files. Enabling this will generate an additional
snapshot. |
+!!! info
Review Comment:
Nit: should we also add about position deletes?
, nor to position delete files containing position deletes no longer
matching any live data files.
##########
docs/docs/spark-procedures.md:
##########
@@ -393,6 +393,7 @@ Iceberg can compact data files in parallel using Spark with
the `rewriteDataFile
| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `partial-progress.max-failed-commits` | same as
`partital-progress.max-commits` | Maximum amount of failed commits that this
rewrite can have before job failure, if partial progress is enabled |
Review Comment:
i think javadoc says 'that this rewrite can have' => 'is allowed'
Opt: same as => value of? (very slightly clearer?)
##########
docs/docs/spark-procedures.md:
##########
@@ -447,9 +453,9 @@ Using the same defaults as bin-pack to determine which
files to rewrite.
CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy =>
'sort', sort_order => 'zorder(c1,c2)');
```
-Rewrite the data files in table `db.sample` using bin-pack strategy in any
partition where more than 2 or more files need to be rewritten.
+Rewrite the data files in table `db.sample` using bin-pack strategy in any
partition where at least two files need rewriting, and then remove any dangling
delete files.
```sql
-CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options =>
map('min-input-files','2'));
+CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options =>
map('min-input-files','2','remove-dangling-deletes', 'true'));
Review Comment:
Do you think we can make spaces in the whole map for consistency?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]