[PR] Add JavaParquetSink execution operator for the Java platform [wayang]

via GitHub Mon, 23 Mar 2026 16:43:14 -0700


gknz opened a new pull request, #731:
URL: https://github.com/apache/wayang/pull/731


   ## Java Parquet Sink
   
   Hey everyone! I built a Java platform execution operator for the existing 
`ParquetSink` logical operator, which previously only had a Spark 
implementation (`SparkParquetSink`). Now, the optimizer can choose between Java 
and Spark when writing Parquet files.
   
   ### What this PR adds
   
   **JavaParquetSink** (`wayang-java/operators/`)
   - Writes Wayang `Record`s to Parquet files using the `parquet-avro` library
   - Follows the same pattern/logic as `SparkParquetSink`
   - Infers Avro schema automatically by sampling up to 50 records
   - Uses `RecordType` field names when available, falls back to `field0`, 
`field1`, etc.
   - Uses Snappy compression
   - Handles file overwrite mode
   
   **ParquetSinkMapping** (`wayang-java/mapping/`)
   - Connects the logical `ParquetSink` to `JavaParquetSink`
   - Registered in Java platform `Mappings.java`
   
   **Fluent API** (`DataQuantaBuilder.scala`)
   - Added `writeParquet` method to `DataQuantaBuilder` so users can call it 
from the fluent API (from my understanding, this was missing — only the 
`DataQuanta.scala` layer existed)
   
   **Unit Tests** (`JavaParquetSinkTest.java`)
   - `testWriteStringRecords`: verifies basic write and read-back
   - `testWriteMixedTypeRecords`: verifies type inference (Int, String, Double)
   - `testWriteWithRecordType`: verifies column names from RecordType are 
preserved in the Parquet schema
   - `testOverwriteExistingFile`: verifies overwrite mode replaces data
   
   ### Testing
   All 4 unit tests pass. I also tested the operator in a different pipeline 
(not included here) that writes hourly aggregated data to Parquet and reads it 
back for further processing, and it was working as intended,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add JavaParquetSink execution operator for the Java platform [wayang]

Reply via email to