LuciferYang opened a new pull request, #55232: URL: https://github.com/apache/spark/pull/55232
### What changes were proposed in this pull request? Implements `StreamingWrite` support for V2 file tables, enabling structured streaming writes through the V2 data source path. - New `FileStreamingWrite` implementing `StreamingWrite` — uses `ManifestFileCommitProtocol` for file commit and `FileStreamSinkLog` for metadata tracking, with idempotent `commit(epochId, messages)` that skips already-committed batches - New `FileStreamingWriterFactory` bridging `DataWriterFactory` to `StreamingDataWriterFactory` - Override `FileWrite.toStreaming()` to create `FileStreamingWrite` with the same metadata log layout as V1 (`_spark_metadata`) - Add `STREAMING_WRITE` to `FileTable.CAPABILITIES` - Support `retention` option for metadata log cleanup (V1 parity) The implementation follows the same `WriteTaskResult` → `TaskCommitMessage` → `SinkFileStatus` extraction pattern as `FileBatchWrite.commit()`. Uses `useCommitCoordinator = true` (unlike batch's `false`) because `ManifestFileCommitProtocol` writes files directly to the output path without a temp-to-final rename step. ### Why are the changes needed? File streaming writes currently fall back to V1 `FileStreamSink`, preventing deprecation of V1 file source code. Together with SPARK-56232 (streaming read), this completes the streaming support needed for full V1 deprecation under SPARK-56170. ### Does this PR introduce _any_ user-facing change? No. By default, `USE_V1_SOURCE_LIST` includes all file formats, so streaming writes still use V1. Users can opt into V2 by clearing the list. Existing checkpoints and `_spark_metadata` are compatible. ### How was this patch tested? New `FileStreamV2WriteSuite` with 4 E2E tests: basic streaming write, multiple batches, checkpoint recovery, and JSON format. Existing `FileStreamSinkV1Suite` passes. Total: 108 streaming file tests pass across all suites. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
