(incubator-gluten) branch main updated: [GLUTEN-10660][VL] Adding configuration for hash table build (#10634)

yuanzhou Mon, 29 Sep 2025 05:49:39 -0700

This is an automated email from the ASF dual-hosted git repository.

yuanzhou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten.git



The following commit(s) were added to refs/heads/main by this push:
     new cc34b49101 [GLUTEN-10660][VL] Adding configuration for hash table 
build (#10634)
cc34b49101 is described below

commit cc34b49101019d561ec527a77aa53e9d74a82027
Author: Yuan <[email protected]>
AuthorDate: Mon Sep 29 13:48:06 2025 +0100

    [GLUTEN-10660][VL] Adding configuration for hash table build (#10634)
    
    Adding configurations for hash map build optimization
    
    
    ---------
    
    Signed-off-by: Yuan <[email protected]>
---
 .../org/apache/gluten/config/VeloxConfig.scala     | 14 ++++++++++
 cpp/velox/compute/WholeStageResultIterator.cc      |  6 +++++
 cpp/velox/config/VeloxConfig.h                     |  4 +++
 docs/Configuration.md                              | 30 +++++++++++-----------
 docs/velox-configuration.md                        |  2 ++
 5 files changed, 41 insertions(+), 15 deletions(-)

diff --git 
a/backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala 
b/backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala
index e87b18e078..d7eee76c53 100644
--- a/backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala
+++ b/backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala
@@ -529,6 +529,20 @@ object VeloxConfig {
       .booleanConf
       .createWithDefault(false)
 
+  val VELOX_HASHMAP_ABANDON_BUILD_DUPHASH_MIN_ROWS =
+    buildConf("spark.gluten.velox.abandonbuild.noduphashminrows")
+      .experimental()
+      .doc("Experimental: abandon hashmap build if duplicated rows more than 
this number.")
+      .intConf
+      .createWithDefault(100000)
+
+  val VELOX_HASHMAP_ABANDON_BUILD_DUPHASH_MIN_PCT =
+    buildConf("spark.gluten.velox.abandonbuild.noduphashminpct")
+      .experimental()
+      .doc("Experimental: abandon hashmap build if duplicated rows are more 
than this percentile.")
+      .doubleConf
+      .createWithDefault(0)
+
   val QUERY_TRACE_ENABLED = 
buildConf("spark.gluten.sql.columnar.backend.velox.queryTraceEnabled")
     .doc("Enable query tracing flag.")
     .booleanConf
diff --git a/cpp/velox/compute/WholeStageResultIterator.cc 
b/cpp/velox/compute/WholeStageResultIterator.cc
index d2b130f629..3dd1dd784d 100644
--- a/cpp/velox/compute/WholeStageResultIterator.cc
+++ b/cpp/velox/compute/WholeStageResultIterator.cc
@@ -608,6 +608,12 @@ std::unordered_map<std::string, std::string> 
WholeStageResultIterator::getQueryC
     configs[velox::core::QueryConfig::kMaxSplitPreloadPerDriver] =
         std::to_string(veloxCfg_->get<int32_t>(kVeloxSplitPreloadPerDriver, 
2));
 
+    // hashtable build optimizations
+    configs[velox::core::QueryConfig::kAbandonBuildNoDupHashMinRows] =
+        std::to_string(veloxCfg_->get<int32_t>(kAbandonBuildNoDupHashMinRows, 
100000));
+    configs[velox::core::QueryConfig::kAbandonBuildNoDupHashMinPct] =
+        std::to_string(veloxCfg_->get<int32_t>(kAbandonBuildNoDupHashMinPct, 
0));
+
     // Disable driver cpu time slicing.
     configs[velox::core::QueryConfig::kDriverCpuTimeSliceLimitMs] = "0";
 
diff --git a/cpp/velox/config/VeloxConfig.h b/cpp/velox/config/VeloxConfig.h
index e37c99987e..2a2bdaf56d 100644
--- a/cpp/velox/config/VeloxConfig.h
+++ b/cpp/velox/config/VeloxConfig.h
@@ -60,6 +60,10 @@ const std::string kAbandonPartialAggregationMinPct =
 const std::string kAbandonPartialAggregationMinRows =
     "spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinRows";
 
+// hashmap build
+const std::string kAbandonBuildNoDupHashMinRows = 
"spark.gluten.velox.abandonbuild.noduphashminrows";
+const std::string kAbandonBuildNoDupHashMinPct = 
"spark.gluten.velox.abandonbuild.noduphashminpct";
+
 // execution
 const std::string kBloomFilterExpectedNumItems = 
"spark.gluten.sql.columnar.backend.velox.bloomFilter.expectedNumItems";
 const std::string kBloomFilterNumBits = 
"spark.gluten.sql.columnar.backend.velox.bloomFilter.numBits";
diff --git a/docs/Configuration.md b/docs/Configuration.md
index b7e725278a..9aacdb3777 100644
--- a/docs/Configuration.md
+++ b/docs/Configuration.md
@@ -75,21 +75,21 @@ nav_order: 15
 | spark.gluten.sql.columnar.maxBatchSize                             | 4096    
          |
 | spark.gluten.sql.columnar.overwriteByExpression                    | true    
          | Enable or disable columnar v2 command overwrite by expression.      
                                                                                
                                                                                
                                                                                
                                                           |
 | spark.gluten.sql.columnar.parquet.write.blockSize                  | 128MB   
          |
-| spark.gluten.sql.columnar.partial.project                          | true    
          | Break up one project node into 2 phases when some of the 
expressions are non offload-able. Phase one is a regular offloaded project 
transformer that evaluates the offload-able expressions in native, phase two 
preserves the output from phase one and evaluates the remaining 
non-offload-able expressions using vanilla Spark projections                    
                                                 [...]
-| spark.gluten.sql.columnar.partial.generate                         | true    
          | evaluates the non-offload-able HiveUDTF using vanilla Spark 
generator                                                                       
                                                                                
                                                                                
                                                                                
                      [...]
-| spark.gluten.sql.columnar.physicalJoinOptimizationLevel            | 12      
          | Fallback to row operators if there are several continuous joins.    
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| spark.gluten.sql.columnar.physicalJoinOptimizeEnable               | false   
          | Enable or disable columnar physicalJoinOptimize.                    
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| spark.gluten.sql.columnar.preferStreamingAggregate                 | true    
          | Velox backend supports `StreamingAggregate`. `StreamingAggregate` 
uses the less memory as it does not need to hold all groups in memory, so it 
could avoid spill. When true and the child output ordering satisfies the 
grouping key then Gluten will choose `StreamingAggregate` as the native 
operator.                                                                       
                                  [...]
-| spark.gluten.sql.columnar.project                                  | true    
          | Enable or disable columnar project.                                 
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| spark.gluten.sql.columnar.project.collapse                         | true    
          | Combines two columnar project operators into one and perform alias 
substitution                                                                    
                                                                                
                                                                                
                                                                                
               [...]
-| spark.gluten.sql.columnar.query.fallback.threshold                 | -1      
          | The threshold for whether query will fall back by counting the 
number of ColumnarToRow & vanilla leaf node.                                    
                                                                                
                                                                                
                                                                                
                   [...]
-| spark.gluten.sql.columnar.range                                    | true    
          | Enable or disable columnar range.                                   
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| spark.gluten.sql.columnar.replaceData                              | true    
          | Enable or disable columnar v2 command replace data.                 
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| spark.gluten.sql.columnar.scanOnly                                 | false   
          | When enabled, only scan and the filter after scan will be offloaded 
to native.                                                                      
                                                                                
                                                                                
                                                                                
              [...]
-| spark.gluten.sql.columnar.shuffle                                  | true    
          | Enable or disable columnar shuffle.                                 
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| spark.gluten.sql.columnar.shuffle.celeborn.fallback.enabled        | true    
          | If enabled, fall back to ColumnarShuffleManager when celeborn 
service is unavailable.Otherwise, throw an exception.                           
                                                                                
                                                                                
                                                                                
                    [...]
-| spark.gluten.sql.columnar.shuffle.celeborn.useRssSort              | true    
          | If true, use RSS sort implementation for Celeborn sort-based 
shuffle.If false, use Gluten's row-based sort implementation. Only valid when 
`spark.celeborn.client.spark.shuffle.writer` is set to `sort`.                  
                                                                                
                                                                                
                       [...]
-| spark.gluten.sql.columnar.shuffle.codec                            | 
&lt;undefined&gt; | By default, the supported codecs are lz4 and zstd. When 
spark.gluten.sql.columnar.shuffle.codecBackend=qat,the supported codecs are 
gzip and zstd.                                                                  
                                                                                
                                                                                
                              [...]
+| spark.gluten.sql.columnar.partial.generate                         | true    
          | Evaluates the non-offload-able HiveUDTF using vanilla Spark 
generator                                                                       
                                                                                
                                                                                
                                                                   |
+| spark.gluten.sql.columnar.partial.project                          | true    
          | Break up one project node into 2 phases when some of the 
expressions are non offload-able. Phase one is a regular offloaded project 
transformer that evaluates the offload-able expressions in native, phase two 
preserves the output from phase one and evaluates the remaining 
non-offload-able expressions using vanilla Spark projections                    
              |
+| spark.gluten.sql.columnar.physicalJoinOptimizationLevel            | 12      
          | Fallback to row operators if there are several continuous joins.    
                                                                                
                                                                                
                                                                                
                                                           |
+| spark.gluten.sql.columnar.physicalJoinOptimizeEnable               | false   
          | Enable or disable columnar physicalJoinOptimize.                    
                                                                                
                                                                                
                                                                                
                                                           |
+| spark.gluten.sql.columnar.preferStreamingAggregate                 | true    
          | Velox backend supports `StreamingAggregate`. `StreamingAggregate` 
uses the less memory as it does not need to hold all groups in memory, so it 
could avoid spill. When true and the child output ordering satisfies the 
grouping key then Gluten will choose `StreamingAggregate` as the native 
operator.                                                                      |
+| spark.gluten.sql.columnar.project                                  | true    
          | Enable or disable columnar project.                                 
                                                                                
                                                                                
                                                                                
                                                           |
+| spark.gluten.sql.columnar.project.collapse                         | true    
          | Combines two columnar project operators into one and perform alias 
substitution                                                                    
                                                                                
                                                                                
                                                            |
+| spark.gluten.sql.columnar.query.fallback.threshold                 | -1      
          | The threshold for whether query will fall back by counting the 
number of ColumnarToRow & vanilla leaf node.                                    
                                                                                
                                                                                
                                                                |
+| spark.gluten.sql.columnar.range                                    | true    
          | Enable or disable columnar range.                                   
                                                                                
                                                                                
                                                                                
                                                           |
+| spark.gluten.sql.columnar.replaceData                              | true    
          | Enable or disable columnar v2 command replace data.                 
                                                                                
                                                                                
                                                                                
                                                           |
+| spark.gluten.sql.columnar.scanOnly                                 | false   
          | When enabled, only scan and the filter after scan will be offloaded 
to native.                                                                      
                                                                                
                                                                                
                                                           |
+| spark.gluten.sql.columnar.shuffle                                  | true    
          | Enable or disable columnar shuffle.                                 
                                                                                
                                                                                
                                                                                
                                                           |
+| spark.gluten.sql.columnar.shuffle.celeborn.fallback.enabled        | true    
          | If enabled, fall back to ColumnarShuffleManager when celeborn 
service is unavailable.Otherwise, throw an exception.                           
                                                                                
                                                                                
                                                                 |
+| spark.gluten.sql.columnar.shuffle.celeborn.useRssSort              | true    
          | If true, use RSS sort implementation for Celeborn sort-based 
shuffle.If false, use Gluten's row-based sort implementation. Only valid when 
`spark.celeborn.client.spark.shuffle.writer` is set to `sort`.                  
                                                                                
                                                                    |
+| spark.gluten.sql.columnar.shuffle.codec                            | 
&lt;undefined&gt; | By default, the supported codecs are lz4 and zstd. When 
spark.gluten.sql.columnar.shuffle.codecBackend=qat,the supported codecs are 
gzip and zstd.                                                                  
                                                                                
                                                                           |
 | spark.gluten.sql.columnar.shuffle.codecBackend                     | 
&lt;undefined&gt; |
 | spark.gluten.sql.columnar.shuffle.compression.threshold            | 100     
          | If number of rows in a batch falls below this threshold, will copy 
all buffers into one buffer to compress.                                        
                                                                                
                                                                                
                                                            |
 | spark.gluten.sql.columnar.shuffle.dictionary.enabled               | false   
          | Enable dictionary in hash-based shuffle.                            
                                                                                
                                                                                
                                                                                
                                                           |
diff --git a/docs/velox-configuration.md b/docs/velox-configuration.md
index b5724f24e8..d4fe5c2639 100644
--- a/docs/velox-configuration.md
+++ b/docs/velox-configuration.md
@@ -73,5 +73,7 @@ nav_order: 16
 
 |                           Key                            | Default |         
                                                      Description               
                                                |
 
|----------------------------------------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------|
+| spark.gluten.velox.abandonbuild.noduphashminpct          | 0       | 
Experimental: abandon hashmap build if duplicated rows are more than this 
percentile.                                                          |
+| spark.gluten.velox.abandonbuild.noduphashminrows         | 100000  | 
Experimental: abandon hashmap build if duplicated rows more than this number.   
                                                        |
 | spark.gluten.velox.offHeapBroadcastBuildRelation.enabled | false   | 
Experimental: If enabled, broadcast build relation will use offheap memory. 
Otherwise, broadcast build relation will use onheap memory. |
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(incubator-gluten) branch main updated: [GLUTEN-10660][VL] Adding configuration for hash table build (#10634)

Reply via email to