Re: [PR] Implement per column compression [parquet-java]

via GitHub Mon, 27 Apr 2026 19:44:37 -0700


wgtmac commented on code in PR #3396:
URL: https://github.com/apache/parquet-java/pull/3396#discussion_r3151217519



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java:
##########
@@ -672,6 +677,88 @@ public ColumnChunkPageWriteStore(
     }
   }
 
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /**
+   * Builder for {@link ColumnChunkPageWriteStore}. Prefer this over the 
constructors when new
+   * parameters are needed so that callers do not have to be updated every 
time a parameter is
+   * added.
+   */
+  public static class Builder {
+    private Function<ColumnDescriptor, BytesInputCompressor> 
compressorProvider;
+    private MessageType schema;
+    private ByteBufferAllocator allocator;
+    private int columnIndexTruncateLength = 
ParquetProperties.DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH;
+    private boolean pageWriteChecksumEnabled = 
ParquetProperties.DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED;
+    private InternalFileEncryptor fileEncryptor = null;
+    private int rowGroupOrdinal = 0;
+
+    private Builder() {}
+
+    /**
+     * Use a single compressor for every column.
+     */
+    public Builder withCompressor(BytesInputCompressor compressor) {
+      this.compressorProvider = path -> compressor;
+      return this;
+    }
+
+    /**
+     * Resolve the compressor per column from the given codec factory and 
properties, allowing
+     * per-column compression codecs.
+     */
+    public Builder withCodecFactory(CompressionCodecFactory codecFactory, 
ParquetProperties props) {
+      this.compressorProvider = path -> 
codecFactory.getCompressor(props.getCompressionCodec(path));
+      return this;
+    }
+
+    public Builder withSchema(MessageType schema) {
+      this.schema = schema;
+      return this;
+    }
+
+    public Builder withAllocator(ByteBufferAllocator allocator) {
+      this.allocator = allocator;
+      return this;
+    }
+
+    public Builder withColumnIndexTruncateLength(int 
columnIndexTruncateLength) {
+      this.columnIndexTruncateLength = columnIndexTruncateLength;
+      return this;
+    }
+
+    public Builder withPageWriteChecksumEnabled(boolean 
pageWriteChecksumEnabled) {
+      this.pageWriteChecksumEnabled = pageWriteChecksumEnabled;
+      return this;
+    }
+
+    public Builder withFileEncryptor(InternalFileEncryptor fileEncryptor) {
+      this.fileEncryptor = fileEncryptor;
+      return this;
+    }
+
+    public Builder withRowGroupOrdinal(int rowGroupOrdinal) {
+      this.rowGroupOrdinal = rowGroupOrdinal;
+      return this;
+    }
+
+    public ColumnChunkPageWriteStore build() {
+      if (compressorProvider == null) {

Review Comment:
   It seems that `schema` and `allocator` are also required arguments so we 
need to check them here. BTW, we may switch to use 
`Preconditions.checkArgument` to be consistent with the repo.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##########
@@ -77,7 +79,9 @@ class InternalParquetRecordWriter<T> {
    * @param extraMetaData     extra meta data to write in the footer of the 
file
    * @param rowGroupSize      the size of a block in the file (this will be 
approximate)
    * @param compressor        the codec used to compress
+   * @deprecated Use {@link #InternalParquetRecordWriter(ParquetFileWriter, 
WriteSupport, MessageType, Map, long, CompressionCodecFactory, boolean, 
ParquetProperties)} for per-column compression support
    */
+  @Deprecated

Review Comment:
   Is it better not to deprecate this constructor? 
`InternalParquetRecordWriter` is not public so we can change the parameter 
`compressor` to be a function that similar to one used by 
`ColumnChunkPageWriteStore`.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java:
##########
@@ -188,7 +188,10 @@ public ParquetRecordWriter(
    * @param codec         the compression codec used to compress the pages
    * @param validating    if schema validation should be turned on
    * @param props         parquet encoding properties
+   * @deprecated Use {@link #ParquetRecordWriter(ParquetFileWriter, 
WriteSupport, MessageType, Map, long, boolean, ParquetProperties, 
MemoryManager, Configuration)}
+   *             and set the compression codec in {@link ParquetProperties} 
instead
    */
+  @Deprecated

Review Comment:
   Let's not add `@Deprecated` here because this constructor is not a public 
one. We can directly modify its signature. We don't need to add a public ctor 
since this class is only used by `ParquetOutputFormat`.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java:
##########
@@ -672,6 +677,88 @@ public ColumnChunkPageWriteStore(
     }
   }
 
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /**
+   * Builder for {@link ColumnChunkPageWriteStore}. Prefer this over the 
constructors when new
+   * parameters are needed so that callers do not have to be updated every 
time a parameter is
+   * added.
+   */
+  public static class Builder {

Review Comment:
   Does this new builder look good to you? @gszadovszky @Fokko?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java:
##########
@@ -672,6 +677,88 @@ public ColumnChunkPageWriteStore(
     }
   }
 
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /**
+   * Builder for {@link ColumnChunkPageWriteStore}. Prefer this over the 
constructors when new
+   * parameters are needed so that callers do not have to be updated every 
time a parameter is
+   * added.
+   */
+  public static class Builder {
+    private Function<ColumnDescriptor, BytesInputCompressor> 
compressorProvider;
+    private MessageType schema;
+    private ByteBufferAllocator allocator;
+    private int columnIndexTruncateLength = 
ParquetProperties.DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH;
+    private boolean pageWriteChecksumEnabled = 
ParquetProperties.DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED;
+    private InternalFileEncryptor fileEncryptor = null;
+    private int rowGroupOrdinal = 0;
+
+    private Builder() {}
+
+    /**
+     * Use a single compressor for every column.
+     */
+    public Builder withCompressor(BytesInputCompressor compressor) {
+      this.compressorProvider = path -> compressor;
+      return this;
+    }
+
+    /**
+     * Resolve the compressor per column from the given codec factory and 
properties, allowing
+     * per-column compression codecs.
+     */
+    public Builder withCodecFactory(CompressionCodecFactory codecFactory, 
ParquetProperties props) {

Review Comment:
   It looks a little bit weird that `ParquetProperties` is used here. Should we 
directly use `Function<ColumnDescriptor, BytesInputCompressor>` to be more 
flexible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Implement per column compression [parquet-java]

Reply via email to