Re: [PR] Doc: Update rewrite data files spark procedure [iceberg]

via GitHub Thu, 31 Oct 2024 09:44:00 -0700


szehon-ho commented on code in PR #11396:
URL: https://github.com/apache/iceberg/pull/11396#discussion_r1824797640



##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Desired partition spec ID to 
be used for rewriting data files. This allows data files to be rewritten with 
any existing partition spec. |

Review Comment:
   I thought the original javadoc is well rewritten here as well, how about 
taking from that:
   ```
   Identifier of the output partition spec.  Data will be reorganized during 
the rewrite to align with the output partitioning.
   ```



##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Desired partition spec ID to 
be used for rewriting data files. This allows data files to be rewritten with 
any existing partition spec. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality 
deletes after rewriting. A delete file is considered dangling if it does not 
apply to any live data files. Enabling this will generate an additional 
snapshot. |

Review Comment:
   This is just my thought, but last sentence could be clarified a little:
   
   'This will generate an additional commit for the removal'



##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Desired partition spec ID to 
be used for rewriting data files. This allows data files to be rewritten with 
any existing partition spec. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality 
deletes after rewriting. A delete file is considered dangling if it does not 
apply to any live data files. Enabling this will generate an additional 
snapshot. |
 
+!!! info

Review Comment:
   Nit:  should we also add about position deletes?
   
   , nor to position delete files containing position deletes no longer 
matching any live data files.



##########
docs/docs/spark-procedures.md:
##########
@@ -393,6 +393,7 @@ Iceberg can compact data files in parallel using Spark with 
the `rewriteDataFile
 | `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to 
be simultaneously rewritten |
 | `partial-progress.enabled` | false | Enable committing groups of files prior 
to the entire rewrite completing |
 | `partial-progress.max-commits` | 10 | Maximum amount of commits that this 
rewrite is allowed to produce if partial progress is enabled |
+| `partial-progress.max-failed-commits` | same as 
`partital-progress.max-commits` | Maximum amount of failed commits that this 
rewrite can have before job failure, if partial progress is enabled |

Review Comment:
   i think javadoc says 'that this rewrite can have' => 'is allowed'
   
   
   
   Opt: same as => value of?  (very slightly clearer?)



##########
docs/docs/spark-procedures.md:
##########
@@ -447,9 +453,9 @@ Using the same defaults as bin-pack to determine which 
files to rewrite.
 CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy => 
'sort', sort_order => 'zorder(c1,c2)');
 ```
 
-Rewrite the data files in table `db.sample` using bin-pack strategy in any 
partition where more than 2 or more files need to be rewritten.
+Rewrite the data files in table `db.sample` using bin-pack strategy in any 
partition where at least two files need rewriting, and then remove any dangling 
delete files.
 ```sql
-CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => 
map('min-input-files','2'));
+CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => 
map('min-input-files','2','remove-dangling-deletes', 'true'));

Review Comment:
   Do you think we can make spaces in the whole map for consistency?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Doc: Update rewrite data files spark procedure [iceberg]

Reply via email to