kszucs commented on PR #9450:
URL: https://github.com/apache/arrow-rs/pull/9450#issuecomment-4085392043

   > It is not entirely clear to me how a "content addressable filesystem" 
works (aka how does it know where the parquet pages start/end) so having that 
documented / mocked out would also be nice
   
   The CDC feature in parquet essentially splits pages according to the 
columns' content resulting in fairly stable pages even if there are insterted 
deleted records. 
   
   The HF xet filesystem is format agnostic (similarly to for example a 
deduplicating backup solution like restic) and chunks the byte stream directly. 
The main issue with parquet is the page level compression which break the 
deduplication if the page values change - this CDC feature makes the pages more 
or less stable depending on theit content.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to