gknz opened a new pull request, #731: URL: https://github.com/apache/wayang/pull/731
## Java Parquet Sink Hey everyone! I built a Java platform execution operator for the existing `ParquetSink` logical operator, which previously only had a Spark implementation (`SparkParquetSink`). Now, the optimizer can choose between Java and Spark when writing Parquet files. ### What this PR adds **JavaParquetSink** (`wayang-java/operators/`) - Writes Wayang `Record`s to Parquet files using the `parquet-avro` library - Follows the same pattern/logic as `SparkParquetSink` - Infers Avro schema automatically by sampling up to 50 records - Uses `RecordType` field names when available, falls back to `field0`, `field1`, etc. - Uses Snappy compression - Handles file overwrite mode **ParquetSinkMapping** (`wayang-java/mapping/`) - Connects the logical `ParquetSink` to `JavaParquetSink` - Registered in Java platform `Mappings.java` **Fluent API** (`DataQuantaBuilder.scala`) - Added `writeParquet` method to `DataQuantaBuilder` so users can call it from the fluent API (from my understanding, this was missing — only the `DataQuanta.scala` layer existed) **Unit Tests** (`JavaParquetSinkTest.java`) - `testWriteStringRecords`: verifies basic write and read-back - `testWriteMixedTypeRecords`: verifies type inference (Int, String, Double) - `testWriteWithRecordType`: verifies column names from RecordType are preserved in the Parquet schema - `testOverwriteExistingFile`: verifies overwrite mode replaces data ### Testing All 4 unit tests pass. I also tested the operator in a different pipeline (not included here) that writes hourly aggregated data to Parquet and reads it back for further processing, and it was working as intended, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
