kszucs commented on PR #9450: URL: https://github.com/apache/arrow-rs/pull/9450#issuecomment-4085392043
> It is not entirely clear to me how a "content addressable filesystem" works (aka how does it know where the parquet pages start/end) so having that documented / mocked out would also be nice The CDC feature in parquet essentially splits pages according to the columns' content resulting in fairly stable pages even if there are insterted deleted records. The HF xet filesystem is format agnostic (similarly to for example a deduplicating backup solution like restic) and chunks the byte stream directly. The main issue with parquet is the page level compression which break the deduplication if the page values change - this CDC feature makes the pages more or less stable depending on theit content. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
