LuciferYang opened a new pull request, #55231: URL: https://github.com/apache/spark/pull/55231
### What changes were proposed in this pull request? Implements `MicroBatchStream` support for V2 file tables, enabling structured streaming reads through the V2 data source path. - New `FileMicroBatchStream` (430 lines) implementing `MicroBatchStream`, `SupportsAdmissionControl`, and `SupportsTriggerAvailableNow` — handles file discovery, offset management via `FileStreamSourceLog`, dedup via `SeenFilesMap`, rate limiting (`maxFilesPerTrigger` / `maxBytesPerTrigger`), and cross-batch file caching - Override `FileScan.toMicroBatchStream()` to create `FileMicroBatchStream` - Add `withFileIndex` method to `FileScan` and all 6 concrete scans for creating batch-specific scans in `planInputPartitions` - Add `MICRO_BATCH_READ` to `FileTable.CAPABILITIES` - Update `ResolveDataSource` to allow `FileDataSourceV2` into the V2 streaming path, respecting `USE_V1_SOURCE_LIST` for backward compatibility - Remove the `FileTable` streaming fallback in `FindDataSourceTable` Reuses V1 infrastructure for checkpoint compatibility: `FileStreamSourceLog` (metadata tracking), `FileStreamSourceOffset` (offset type), `SeenFilesMap` (dedup). Existing streaming queries can upgrade from V1 to V2 without checkpoint migration. ### Why are the changes needed? File streaming reads currently fall back to V1 `FileStreamSource`, preventing deprecation of V1 file source code. This is part of SPARK-56170 which aims to make V2 the default path for all file source operations. ### Does this PR introduce _any_ user-facing change? No. By default, `USE_V1_SOURCE_LIST` includes all file formats, so streaming reads still use V1. Users can opt into V2 by clearing the list (`spark.sql.sources.useV1SourceList=""`). Existing checkpoints are compatible. ### How was this patch tested? New `FileStreamV2ReadSuite` with 6 E2E tests: basic streaming read, file discovery across batches, `maxFilesPerTrigger` rate limiting, checkpoint recovery, V2 path verification (`MicroBatchScanExec`), and JSON format. Existing `FileStreamSourceSuite` (76 tests) passes with V1 forced via `USE_V1_SOURCE_LIST`. Total: 82 streaming file tests pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
