Re: [PR] [GLUTEN][VL] Optimize Delta Lake DV materialization and plan rule performance [gluten]

via GitHub Mon, 29 Jun 2026 04:58:09 -0700


iemejia commented on code in PR #12390:
URL: https://github.com/apache/gluten/pull/12390#discussion_r3491496850



##########
gluten-substrait/src/main/java/org/apache/gluten/substrait/rel/LocalFilesNode.java:
##########
@@ -96,6 +96,9 @@ public enum ReadFileFormat {
   /**
    * Copies an existing node, replacing its per-file extra metadata. Lets 
data-lake subclasses
    * decorate a generically built node without re-deriving the file listing.
+   *
+   * <p>Note: uses direct list reference transfer (not deep copy) for 
efficiency, since the original
+   * node is typically discarded immediately after this constructor returns.

Review Comment:
   Fixed. Updated the comment to accurately say "shallow list copy (element 
references are shared, not deep-copied)". A deep copy is unnecessary here since 
callers supply freshly built maps and the original node is discarded 
immediately after construction.



##########
gluten-delta/src-delta40/main/scala/org/apache/gluten/delta/DeltaDeletionVectorScanInfo.scala:
##########
@@ -62,10 +64,20 @@ object DeltaDeletionVectorScanInfo {
    * Materializes per-file Delta DV read options for a split, alongside each 
file's metadata with
    * the DV bookkeeping keys stripped. Returns None when no file in the split 
carries a deletion
    * vector, so callers can keep the generic split representation.
+   *
+   * Performance: resolves the table path once (using the first file) and 
reuses a single Hadoop
+   * Configuration instance across all files in the partition to avoid 
redundant filesystem I/O and
+   * object allocation.
    */
   def normalize(partitionColumnCount: Int, partitionFiles: 
Seq[PartitionedFile])
       : Option[(Seq[JMap[String, Object]], Seq[DeltaFileReadOptions])] = {
-    val scanInfos = extractAll(activeSparkSession, partitionColumnCount, 
partitionFiles)
+    val spark = activeSparkSession
+    val hadoopConf = spark.sessionState.newHadoopConf()
+    val cachedTablePath = resolveTablePath(hadoopConf, partitionColumnCount, 
partitionFiles.head)

Review Comment:
   Fixed. Added an early `if (partitionFiles.isEmpty) return None` guard before 
accessing `.head`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN][VL] Optimize Delta Lake DV materialization and plan rule performance [gluten]

Reply via email to