etseidl commented on code in PR #9450:
URL: https://github.com/apache/arrow-rs/pull/9450#discussion_r2914919511


##########
parquet/src/file/properties.rs:
##########
@@ -62,6 +62,56 @@ pub const DEFAULT_OFFSET_INDEX_DISABLED: bool = false;
 /// Default values for [`WriterProperties::coerce_types`]
 pub const DEFAULT_COERCE_TYPES: bool = false;
 
+/// EXPERIMENTAL: Options for content-defined chunking (CDC).
+///
+/// Content-defined chunking is an experimental feature that optimizes parquet
+/// files for content addressable storage (CAS) systems by writing data pages
+/// according to content-defined chunk boundaries. This allows for more
+/// efficient deduplication of data across files, hence more efficient network
+/// transfers and storage.
+///
+/// Each content-defined chunk is written as a separate parquet data page. The
+/// following options control the chunks' size and the chunking process. Note
+/// that the chunk size is calculated based on the logical value of the data,
+/// before any encoding or compression is applied.
+#[derive(Debug, Clone, Copy)]
+pub struct CdcOptions {
+    /// Minimum chunk size in bytes, default is 256 KiB.
+    /// The rolling hash will not be updated until this size is reached for 
each chunk.
+    /// Note that all data sent through the hash function is counted towards 
the chunk
+    /// size, including definition and repetition levels if present.
+    pub min_chunk_size: usize,
+    /// Maximum chunk size in bytes, default is 1024 KiB.
+    /// The chunker will create a new chunk whenever the chunk size exceeds 
this value.
+    /// Note that the parquet writer has a related `data_page_size_limit` 
property that
+    /// controls the maximum size of a parquet data page after encoding. While 
setting
+    /// `data_page_size_limit` to a smaller value than `max_chunk_size` 
doesn't affect
+    /// the chunking effectiveness, it results in more small parquet data 
pages.

Review Comment:
   ```suggestion
       /// Note that the parquet writer has a related [`data_page_size_limit`] 
property that
       /// controls the maximum size of a parquet data page after encoding. 
While setting
       /// `data_page_size_limit` to a smaller value than `max_chunk_size` 
doesn't affect
       /// the chunking effectiveness, it results in more small parquet data 
pages.
       ///
       /// [`data_page_size_limit`]: 
WriterPropertiesBuilder::set_data_page_size_limit
   ```



##########
parquet/src/file/properties.rs:
##########
@@ -364,6 +415,13 @@ impl WriterProperties {
         self.coerce_types
     }
 
+    /// EXPERIMENTAL: Returns content-defined chunking options, or `None` if 
CDC is disabled.
+    ///
+    /// For more details see 
[`WriterPropertiesBuilder::set_content_defined_chunking`]
+    pub fn content_defined_chunking(&self) -> Option<&CdcOptions> {

Review Comment:
   Does this only make sense as a global option, or would making it a 
per-column property be reasonable?



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -421,8 +445,10 @@ impl<W: Write + Send> ArrowWriter<W> {
             None => return Ok(()),
         };
 
+        let chunks = in_progress.close()?;

Review Comment:
   Also curious about this change and below. Does the `close` have to happen 
before calling `next_row_group`?



##########
parquet/src/file/properties.rs:
##########
@@ -62,6 +62,56 @@ pub const DEFAULT_OFFSET_INDEX_DISABLED: bool = false;
 /// Default values for [`WriterProperties::coerce_types`]
 pub const DEFAULT_COERCE_TYPES: bool = false;
 
+/// EXPERIMENTAL: Options for content-defined chunking (CDC).
+///
+/// Content-defined chunking is an experimental feature that optimizes parquet
+/// files for content addressable storage (CAS) systems by writing data pages
+/// according to content-defined chunk boundaries. This allows for more
+/// efficient deduplication of data across files, hence more efficient network
+/// transfers and storage.
+///
+/// Each content-defined chunk is written as a separate parquet data page. The
+/// following options control the chunks' size and the chunking process. Note
+/// that the chunk size is calculated based on the logical value of the data,
+/// before any encoding or compression is applied.
+#[derive(Debug, Clone, Copy)]
+pub struct CdcOptions {
+    /// Minimum chunk size in bytes, default is 256 KiB.
+    /// The rolling hash will not be updated until this size is reached for 
each chunk.
+    /// Note that all data sent through the hash function is counted towards 
the chunk
+    /// size, including definition and repetition levels if present.
+    pub min_chunk_size: usize,
+    /// Maximum chunk size in bytes, default is 1024 KiB.
+    /// The chunker will create a new chunk whenever the chunk size exceeds 
this value.
+    /// Note that the parquet writer has a related `data_page_size_limit` 
property that
+    /// controls the maximum size of a parquet data page after encoding. While 
setting
+    /// `data_page_size_limit` to a smaller value than `max_chunk_size` 
doesn't affect
+    /// the chunking effectiveness, it results in more small parquet data 
pages.
+    pub max_chunk_size: usize,
+    /// Number of bit adjustment to the gearhash mask in order to center the 
chunk size
+    /// around the average size more aggressively, default is 0.
+    /// Increasing the normalization level increases the probability of 
finding a chunk,
+    /// improving the deduplication ratio, but also increasing the number of 
small chunks
+    /// resulting in many small parquet data pages. The default value provides 
a good
+    /// balance between deduplication ratio and fragmentation.
+    /// Use norm_level=1 or norm_level=2 to reach a higher deduplication ratio 
at the
+    /// expense of fragmentation. Negative values can also be used to reduce 
the
+    /// probability of finding a chunk, resulting in larger chunks and fewer 
data pages.
+    /// Note that values outside [-3, 3] are not recommended, prefer using the 
default
+    /// value of 0 for most use cases.
+    pub norm_level: i32,
+}
+
+impl Default for CdcOptions {
+    fn default() -> Self {
+        Self {
+            min_chunk_size: 256 * 1024,
+            max_chunk_size: 1024 * 1024,

Review Comment:
   Can you add constants for these above? 🙏 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to