[PR] [MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths [gluten]

via GitHub Fri, 22 May 2026 12:12:14 -0700


yaooqinn opened a new pull request, #12130:
URL: https://github.com/apache/gluten/pull/12130


   ### What changes were proposed in this pull request?
   
   Removes dead JVM-side code paths related to Arrow CSV scanning and Arrow 
Dataset readers in `gluten-arrow`. Two commits:
   
   1. **`[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code path`** 
(-1441 lines)
      - Deletes 12 files: `ArrowCSVFileFormat`, `ArrowCSVOptionConverter`, 
`ArrowCSVPartitionReaderFactory`, `ArrowCSVScan`, `ArrowCSVScanBuilder`, 
`ArrowCSVTable`, `ArrowBatchScanExec`, `ArrowConvertorRule`, 
`ArrowScanReplaceRule`, `ArrowFileSourceScanExec`, `BaseArrowScanExec`, 
`ArrowCsvScanSuite`.
      - Truncates the `ArrowBatchScanExecShim` segment from 5 shim files 
(`spark33/34/35/40/41`).
   
   2. **`[MINOR][VL] Remove dead Arrow dataset reader paths from ArrowUtil`** 
(-167 lines)
      - Removes 6 methods from `ArrowUtil.scala`: `makeArrowDiscovery`, 
`readArrowSchema`, `readArrowFileColumnNames`, `readSchema` (×2 overloads), 
`loadMissingColumns`, `loadPartitionColumns`, `loadBatch`. These all 
instantiated `FileSystemDatasetFactory` / `CsvFragmentScanOptions` and were 
only used by the now-deleted classes above.
   
   Total: 18 files changed, **-1608 lines**.
   
   ### Why are the changes needed?
   
   This code is **dead**:
   
   - **No service registration**: no `META-INF/services` entry routes Spark to 
`ArrowCSVFileFormat` or `ArrowCSVTable`.
   - **No rule injection**: `VeloxRuleApi` does not inject `ArrowConvertorRule` 
or `ArrowScanReplaceRule` into the optimizer pipeline.
   - **No active tests**: `ArrowCsvScanSuite` is fully `@Ignore`d.
   - **No callers**: `grep`ing the whole repo for the 6 `ArrowUtil` reader 
methods returns 0 call sites outside the deleted files.
   
   These classes appear to be unreachable code introduced by a previous 
squash-merge and never wired into the actual execution path. They also keep 
`gluten-arrow` glued to the patched `arrow-dataset` JVM API 
(`CsvFragmentScanOptions.from(Map)`, 5-arg `FileSystemDatasetFactory`) shipped 
via `dev/build-arrow.sh` — removing them unblocks future work to drop the 
patched-arrow build entirely.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Compiled clean on both Spark profiles:
   
   ```bash
   ./build/mvn compile      -Pbackends-velox -Pspark-4.0 -Pscala-2.13 
-DskipTests  # ✅
   ./build/mvn test-compile -Pbackends-velox -Pspark-3.5 -Pscala-2.12 
-DskipTests  # ✅
   ```
   
   All 6 affected modules pass scalastyle with 0 errors.
   
   Generated-by: claude-opus-4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths [gluten]

Reply via email to