t3hw commented on PR #14797: URL: https://github.com/apache/iceberg/pull/14797#issuecomment-3747006047
Pasting my comment from another discussion about this PR: Original commenter said: The primary issue with using DVs for this is identifying the position of a deleted row in the dataset. If receive a delete record for an existing row, you need to scan the table to locate the file+position in order to create the DV. That results is lots of scan behavior and data materialization on the worker, which really isn't practical/scalable. That issue is the main issue that equality deletes address, but equality deletes also introduce significant maintenance burden that isn't available in KC natively. This means in order to use this approach with KC, you also need Spark or equivalent to aggressively maintain the deletes. Based on the PR, I'm not clear how these issues are addressed And the response: My implementation of BaseDeltaWriter inherits the insertedRowMap from the BaseEqualityDeltaWriter class. It deletes using the DV Writer by default, while falling back to Equality Deletes if the key was not found. The map is cleared when the batch is closed and the file gets written. I believe RisingWave is using a similar approach. If this gets accepted the documentation should be updated to let users know that compaction is highly recommended. The implementation includes `PartitionedDeltaWriter` which routes records to partition-specific writers. This means equality deletes are scoped to their partition - query engines only need to scan files within that partition, not the entire table. For well-partitioned tables (e.g., by date), this significantly reduces the read-time cost of equality deletes. Another approach would be to keep a separate index of the id-field->parquet file, but running compactions is required regardless, so this is kind of a wasted effort as it would need to be replicated, persisted, or managed outside of kafka connect, and it would lose its state and have to be re-built once a compaction is triggered. I agree the PR description was misleading about "using DVs for CDC" - DVs only help for in-batch deduplication. I'll update the description to be clearer. Given that equality deletes + compaction is how Flink handles CDC (and is the industry standard), would this approach be acceptable if we: 1. Update the PR description to accurately describe the behavior 2. Document that periodic compaction is required for production use 3. Recommend proper table partitioning to minimize equality delete scan overhead 4. Provide example compaction commands in the docs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
