LuciferYang opened a new pull request, #55232:
URL: https://github.com/apache/spark/pull/55232

   ### What changes were proposed in this pull request?
   
   Implements `StreamingWrite` support for V2 file tables, enabling structured 
streaming writes through the V2 data source path.
   
   - New `FileStreamingWrite` implementing `StreamingWrite` — uses 
`ManifestFileCommitProtocol` for file commit and `FileStreamSinkLog` for 
metadata tracking, with idempotent `commit(epochId, messages)` that skips 
already-committed batches
   - New `FileStreamingWriterFactory` bridging `DataWriterFactory` to 
`StreamingDataWriterFactory`
   - Override `FileWrite.toStreaming()` to create `FileStreamingWrite` with the 
same metadata log layout as V1 (`_spark_metadata`)
   - Add `STREAMING_WRITE` to `FileTable.CAPABILITIES`
   - Support `retention` option for metadata log cleanup (V1 parity)
   
   The implementation follows the same `WriteTaskResult` → `TaskCommitMessage` 
→ `SinkFileStatus` extraction pattern as `FileBatchWrite.commit()`. Uses 
`useCommitCoordinator = true` (unlike batch's `false`) because 
`ManifestFileCommitProtocol` writes files directly to the output path without a 
temp-to-final rename step.
   
   ### Why are the changes needed?
   
   File streaming writes currently fall back to V1 `FileStreamSink`, preventing 
deprecation of V1 file source code. Together with SPARK-56232 (streaming read), 
this completes the streaming support needed for full V1 deprecation under 
SPARK-56170.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. By default, `USE_V1_SOURCE_LIST` includes all file formats, so streaming 
writes still use V1. Users can opt into V2 by clearing the list. Existing 
checkpoints and `_spark_metadata` are compatible.
   
   ### How was this patch tested?
   
   New `FileStreamV2WriteSuite` with 4 E2E tests: basic streaming write, 
multiple batches, checkpoint recovery, and JSON format. Existing 
`FileStreamSinkV1Suite` passes. Total: 108 streaming file tests pass across all 
suites.
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by:  Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to