[PR] [VL] Hoist per-partition constants out of ColumnarCachedBatchSerializer.serialize hot path [gluten]

via GitHub Thu, 28 May 2026 02:17:17 -0700


yaooqinn opened a new pull request, #12166:
URL: https://github.com/apache/gluten/pull/12166


   ## What changes were proposed in this pull request?
   
   In `ColumnarCachedBatchSerializer.convertInternalRowToCachedBatch`,
   three values that are constant for the lifetime of the per-partition
   `Iterator[CachedBatch]` were being re-evaluated on every `next()` call:
   
   1. `BackendsApiManager.getBackendName` -- invoked twice per batch
      (once for the JNI runtime context, once for `getNativeHandle`)
   2. `GlutenConfig.get.getConf(COLUMNAR_TABLE_CACHE_PARTITION_STATS_ENABLED)`
      -- `GlutenConfig.get` allocates a fresh `GlutenConfig(SQLConf.get)`
      on every call (`GlutenConfig.scala` L584-L586) plus a `SparkConf`
      hashmap lookup
   3. `ColumnarBatchSerializerJniWrapper.create(Runtimes.contextInstance(...))`
      -- JNI wrapper construction
   
   This PR hoists all three out of `next()` into the `mapPartitions` body,
   alongside the `structSchema` value that the same block already hoists
   for the identical many-small-batch GC-pressure reason. Only the
   per-batch `handle = ColumnarBatches.getNativeHandle(backendName, batch)`
   remains inside `next()`, since it genuinely depends on the batch.
   
   ## Why are the changes needed?
   
   The block immediately preceding `new Iterator[CachedBatch]` already
   hoists `structSchema` with a comment explicitly citing many-small-batch
   GC pressure. The three values listed above belong in the same hoist --
   they are all partition-iterator-scoped constants whose per-batch
   re-evaluation produces zero behavioral signal but consumes allocator /
   JNI traffic on every batch.
   
   `GlutenConfig.get` is especially worth hoisting because it eagerly
   constructs a new `GlutenConfig(SQLConf.get)` on each invocation rather
   than returning a cached instance:
   
   ```scala
   def get: GlutenConfig = new GlutenConfig(SQLConf.get)
   ```
   
   ## Does this PR introduce any user-facing change?
   
   No. Wire format is byte-identical; `partitionStatsEnabled` is now
   captured once at partition start instead of re-read per batch, which
   matches the existing semantics of `SQLConf.get` in Spark task
   execution (already TaskContext-thread-snapshot) -- no observable
   difference from the per-batch re-read.
   
   ## How was this patch tested?
   
   The change is a pure refactor with byte-identical wire output. The
   behavior is fully covered by the existing serializer test suites:
   
   - `ColumnarCachedBatchKryoSuite` -- 6 tests (incl. V1 wire back-compat 
contract)
   - `ColumnarCachedBatchKryoBoundaryProbeBugSuite` -- 1 test (#12147)
   
   All 7 tests green locally on `-Pspark-4.1 -Pscala-2.13`.
   
   No new test file is added. Cross-batch reuse of the hoisted JNI
   wrapper is safe because (a) `ColumnarBatchSerializerJniWrapper` holds
   only a `final Runtime runtime` reference with no mutable per-batch
   state, and (b) the native side (`cpp/core/jni/JniWrapper.cc` L1280-L1325)
   constructs a fresh `ctx->createColumnarBatchSerializer(nullptr)` on
   every JNI call -- the wrapper is a stateless route to the runtime.
   
   ---
   
   **Was this patch authored or co-authored using generative AI tooling?**
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Hoist per-partition constants out of ColumnarCachedBatchSerializer.serialize hot path [gluten]

Reply via email to