[PR] [VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM [gluten]

via GitHub Fri, 22 May 2026 00:41:22 -0700


lifulong opened a new pull request, #12127:
URL: https://github.com/apache/gluten/pull/12127


   ## What changes are proposed in this pull request?
   
   Gluten jobs on the Velox backend are more prone to driver memory pressure 
than vanilla Spark in some production workloads. Investigation points to scan 
operators registering too many SQL metrics (accumulators).
   
   Each BatchScanExecTransformer / FileSourceScanExecTransformer / 
HiveTableScanExecTransformer previously registered 30+ executor-side metrics 
per scan node.
   
   Vanilla Spark is much leaner—for example, BatchScanExec only exposes 
numOutputRows (+ connector customMetrics), and FileSourceScanExec adds a small 
set of driver metrics (numFiles, metadataTime, etc.).
   
   This gap increases driver heap usage and can contribute to driver OOM, 
especially on scan-heavy queries.
   
   <img width="1004" height="352" 
alt="企业微信截图_7f05f208-9f83-472b-b638-0aa70650abfc" 
src="https://github.com/user-attachments/assets/fa71ac80-c593-4277-b2b1-d80affb58923";
 />
   
   <img width="590" height="143" 
alt="企业微信截图_0f06b928-eff5-4ba8-a1ae-6f87aca571be" 
src="https://github.com/user-attachments/assets/db15f11f-617f-4486-90a0-35ae3825d50d";
 />
   (Gluten has been failed in first scan stage, while vanilla spark finish 
success.)
   
   Introduce a Velox-only minimal scan metrics set by default, with an opt-in 
switch for full metrics collection (debugging / advanced troubleshooting).
   spark.gluten.sql.scan.detailedMetrics.enabled
   
   ClickHouse backend is unchanged—this config does not affect CH scan metrics.
   
   Default minimal metrics (Velox)
   BatchScan (9 executor metrics):
   rawInputRows, rawInputBytes, numOutputRows, outputBytes, scanTime, 
wallNanos, peakMemoryBytes, ioWaitTime, storageReadBytes
   
   FileSourceScan / HiveTableScan — above plus Spark-aligned driver metrics:
   numFiles, metadataTime, filesSize, numPartitions, pruningTime
   
   Moved to full collection only (when detailed metrics enabled)
   Examples include: numInputRows, inputVectors, inputBytes, outputVectors, 
cpuCount, numMemoryAllocations, skippedSplits, processedSplits, 
numDynamicFiltersAccepted, loadLazyVectorTime, skippedStrides, 
processedStrides, connector timing (preloadSplits, pageLoadTime, 
dataSourceAddSplitTime, dataSourceReadTime), storage cache details 
(storageReads, localReadBytes, ramReadBytes), etc.
   
   ## How was this patch tested?
   WIP on our produce envriment
   <!--
   Describe how the changes were tested, if applicable.
   Include new tests to validate the functionality, if necessary.
   For UI-related changes, attach screenshots to demonstrate the updates.
   -->
   
   ## Was this patch authored or co-authored using generative AI tooling?
   co-authored using cursor.
   <!--
   If generative AI tooling has been used in the process of authoring this 
patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling 
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM [gluten]

Reply via email to