(gluten) branch main updated: [GLUTEN-6887][VL] Daily Update Velox Version (2026_04_08) (#11891)

yuanzhou Sat, 11 Apr 2026 01:38:02 -0700

This is an automated email from the ASF dual-hosted git repository.

yuanzhou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gluten.git



The following commit(s) were added to refs/heads/main by this push:
     new 3af2e33967 [GLUTEN-6887][VL] Daily Update Velox Version (2026_04_08) 
(#11891)
3af2e33967 is described below

commit 3af2e339674304549c3cf2cce3cb9941d39661a2
Author: Gluten Performance Bot 
<[email protected]>
AuthorDate: Sat Apr 11 09:37:30 2026 +0100

    [GLUTEN-6887][VL] Daily Update Velox Version (2026_04_08) (#11891)
    
    * [GLUTEN-6887][VL] Daily Update Velox Version (dft-2026_04_08)
    
    Upstream Velox's New Commits:
    d7891436c by Masha Basmanova, fix: Skip custom type coercion for 
parameterized types (#17064)
    fbda33022 by Han Yan, feat(dwio): Add BufferPool for reusing cached 
BufferPtr objects (#17042)
    6e5224164 by Krishna Pai, fix(ci): Add OIDC permission and unrestrict Bash 
for CI failure analysis (#17061)
    b43a8c892 by Peter Enescu, feat: Allow EncodedVectorCopy to generate 
FlatMapVector in non-NULL vectors (#16161)
    1355dd3ab by Pratik Pugalia, fix: GetTimestampFunction recompiling datetime 
format on every row (#17037)
    c7d5b0104 by Krishna Pai, fix(ci): Use bash parameter expansion for 
multiline metadata substitution (#17058)
    472701f4a by Pratik Pugalia, fix: Remove per-query timeout in 
TableEvolutionFuzzer (#17046)
    933dd4e10 by Kevin Wilfong, fix: Remove unnecessary output_ field from 
IndexLookupJoin (#17043)
    034f86cb0 by Pratik Pugalia, fix: Increase Presto request timeout for 
parallel fuzzer runs (#17049)
    ba21e5661 by Ke Wang, fix: Allow IoStats to override storageReadBytes in 
getRuntimeStats (#17036)
    39d3494de by Masha Basmanova, fix: Change array_sort comparator lambda 
return type from bigint to integer (#17030)
    682c4a8e7 by Artem Selishchev, fix: Catch exceptions from 
TaskCompletionListeners in Task::onTaskCompletion() (#17051)
    c3f34536b by Varun Srinivas, fix(remote): Use VELOX_USER_FAIL for remote 
error re-throwing (#16903)
    a7d9036ae by Krishna Pai, feat(ci): Use Claude to analyze CI failures and 
post diagnostic PR comments (#17039)
    e9d03d8b3 by Kk Pulla, fix(exec): Fix data race in 
OutputBuffer::getUtilization and isOverutilized (#17009)
    fe24ae068 by Pratik Pugalia, fix: BetweenFunction to handle NaN with 
correct Spark semantics (#17025)
    dd58b1536 by Rui Mo, fix: Change count metric from signed to unsigned 
(int64_t -> uint64_t) (#16989)
    303bba60c by Mahadevuni Naveen Kumar, refactor: Revert iceberg data file 
statistics changes (#16999)
    a649489c1 by Simon Eves, fix(cudf): Fix failure in 
ToCudfSelectionTest.zeroColumnCountConstantFallsBack (#17031)
    95894c30a by Christian Zentgraf, feat(s3): Add support for 
hive.s3.min-part-size when writing (#16935)
    01b86e20d by Konjac Huang, refactor: refactor filebased datasource (#16914)
    7ea56098a by Rajeev Singh, feat(expr-eval): Fix flaky 
adaptiveCpuSamplingPerFunctionRates test (#17002)
    37e897b30 by joey.ljy, test: Use VectorFuzzer for random RowVector 
generation in `semiJoinDeduplicateResetCapacity` test (#15748)
    509ab8fd2 by Chengcheng Jin, feat(cudf): Add config to set timestamp unit 
(#16769)
    338598815 by Masha Basmanova, refactor: Migrate production code to 
ConnectorRegistry API and deprecate free functions (#16986)
    b65b5c1c5 by Pratik Pugalia, fix: TempFilePath fd_ member initialization 
order bug causing flaky test failures (#17020)
    95ce76125 by Rui Mo, test: Extend cast tests in the expression fuzzer test 
(#16990)
    4fb74c52f by Miguel Blanco Godón, feat: Support reading PARQUET files with 
zero offset (#16456)
    1dfcfbbcc by Kent Yao, fix(sparksql): Default ignoreNulls to true for 
collect_set backward compatibility (#16947)
    65800681f by Masha Basmanova, refactor: Migrate test and fuzzer code to 
ConnectorRegistry API (#16985)
    4acf9bb28 by Masha Basmanova, feat: Add ScopedRegistry and query-scoped 
connector lookups (#16982)
    4a966b2ef by Pratik Pugalia, Fix: SIGSEGV in AggregationFuzzer when 
reference query returns empty result vector (#17018)
    4bbea83dc by Krishna Pai, feat(ci): Add workflow_run workflow for posting 
CI failure comments on PRs (#17022)
    d14cd0c27 by Matt Gara, fix(cudf): Enable GPU execution for count(*), 
count(column), and count(NULL) (#16522)
    084f2221a by Rui Mo, misc: Make `DirectBufferedInput` clone fields 
protected (#16979)
    d9c1b6ea3 by Krishna Pai, build(ci): Grant pull-requests write permission 
to Linux build workflow (#17021)
    cff0a6e36 by David Reveman, build: Update perfetto SDK to v54 (#17004)
    7534c2e47 by Bradley Dice, fix(cudf): Refactor CudfToVelox output batching 
to avoid O(n) D->H syncs (#16620)
    f736ec1d8 by Masha Basmanova, refactor: Add thread safety to connector 
registry (#16978)
    b79f0d188 by Krishna Pai, feat(ci): Add flaky test retry and JUnit XML 
reporting (#17003)
    cf7d5a7b7 by Natasha Sehgal, feat: Add pmod (positive modulo) function to 
Presto SQL (#17008)
    388105ba3 by Bradley Dice, fix(build): Add missing GTest::gmock link to 
velox_hive_connector_test (#16996)
    9d7a2ee24 by Andrii Rosa, fix: support NaN and Inf serialization for 
Variant (#17007)
    
    Signed-off-by: glutenperfbot <[email protected]>
    
    * Resolve compile issue
    
    * feat(velox): Support RESPECT NULLS for collect_list/collect_set
    
    Add ignoreNulls parameter to VeloxCollectList/VeloxCollectSet to support
    Spark's RESPECT NULLS syntax (SPARK-55256). When ignoreNulls=false, null
    elements are included in the collected array.
    
    - VeloxCollect: conditionally skip nulls based on ignoreNulls parameter
    - CollectRewriteRule: propagate ignoreNulls from Spark's 
CollectList/CollectSet
      via reflection (backward-compatible with Spark versions without 
ignoreNulls)
    - ArrayType containsNull reflects the ignoreNulls setting
    
    Co-authored-by: Copilot <[email protected]>
    
    * fix(velox): Handle generic-typed companion function lookup for 
collect_set/list
    
    When aggregate functions have multiple signatures with the same intermediate
    type (e.g., collect_set with 1-arg and 2-arg signatures), Velox registers
    companion functions with suffix using generic type variables (e.g.,
    collect_set_merge_extract_array_T). The Substrait layer was constructing
    concrete type suffixes (e.g., array_row_VARCHAR_BIGINT_BIGINT_endrow) that
    don't match.
    
    Fix: After failing exact concrete suffix lookup, fall back to discovering
    companion function names via getCompanionFunctionSignatures() API.
    
    Co-authored-by: Copilot <[email protected]>
    
    * trigger
    
    ---------
    
    Signed-off-by: glutenperfbot <[email protected]>
    Co-authored-by: glutenperfbot <[email protected]>
    Co-authored-by: Ke Jia <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Co-authored-by: Copilot <[email protected]>
    Co-authored-by: Yuan <[email protected]>
---
 .../gluten/expression/aggregate/VeloxCollect.scala | 27 +++++++++++--------
 .../gluten/extension/CollectRewriteRule.scala      | 13 ++++++++--
 cpp/velox/substrait/SubstraitToVeloxPlan.cc        | 30 +++++++++++++++++-----
 cpp/velox/utils/ConfigExtractor.cc                 |  2 +-
 ep/build-velox/src/get-velox.sh                    |  4 +--
 5 files changed, 54 insertions(+), 22 deletions(-)

diff --git 
a/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
 
b/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
index dc41bbc4fc..0945343ce9 100644
--- 
a/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
+++ 
b/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
@@ -21,13 +21,13 @@ import 
org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate
 import org.apache.spark.sql.catalyst.trees.UnaryLike
 import org.apache.spark.sql.types.{ArrayType, DataType}
 
-abstract class VeloxCollect(child: Expression)
+abstract class VeloxCollect(child: Expression, val ignoreNulls: Boolean)
   extends DeclarativeAggregate
   with UnaryLike[Expression] {
 
   protected lazy val buffer: AttributeReference = AttributeReference("buffer", 
dataType)()
 
-  override def dataType: DataType = ArrayType(child.dataType, false)
+  override def dataType: DataType = ArrayType(child.dataType, !ignoreNulls)
 
   override def nullable: Boolean = false
 
@@ -35,12 +35,17 @@ abstract class VeloxCollect(child: Expression)
 
   override lazy val initialValues: Seq[Expression] = 
Seq(Literal.create(Array(), dataType))
 
-  override lazy val updateExpressions: Seq[Expression] = Seq(
-    If(
-      IsNull(child),
-      buffer,
-      Concat(Seq(buffer, CreateArray(Seq(child), useStringTypeWhenEmpty = 
false))))
-  )
+  override lazy val updateExpressions: Seq[Expression] = {
+    val append = if (ignoreNulls) {
+      If(
+        IsNull(child),
+        buffer,
+        Concat(Seq(buffer, CreateArray(Seq(child), useStringTypeWhenEmpty = 
false))))
+    } else {
+      Concat(Seq(buffer, CreateArray(Seq(child), useStringTypeWhenEmpty = 
false)))
+    }
+    Seq(append)
+  }
 
   override lazy val mergeExpressions: Seq[Expression] = Seq(
     Concat(Seq(buffer.left, buffer.right))
@@ -49,7 +54,8 @@ abstract class VeloxCollect(child: Expression)
   override def defaultResult: Option[Literal] = Option(Literal.create(Array(), 
dataType))
 }
 
-case class VeloxCollectSet(child: Expression) extends VeloxCollect(child) {
+case class VeloxCollectSet(child: Expression, override val ignoreNulls: 
Boolean = true)
+  extends VeloxCollect(child, ignoreNulls) {
 
   override lazy val evaluateExpression: Expression =
     ArrayDistinct(buffer)
@@ -60,7 +66,8 @@ case class VeloxCollectSet(child: Expression) extends 
VeloxCollect(child) {
     copy(child = newChild)
 }
 
-case class VeloxCollectList(child: Expression) extends VeloxCollect(child) {
+case class VeloxCollectList(child: Expression, override val ignoreNulls: 
Boolean = true)
+  extends VeloxCollect(child, ignoreNulls) {
 
   override val evaluateExpression: Expression = buffer
 
diff --git 
a/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
 
b/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
index e76de56374..72e52cf3dd 100644
--- 
a/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
+++ 
b/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
@@ -67,15 +67,24 @@ object CollectRewriteRule {
     def unapply(expr: Expression): Option[Expression] = expr match {
       case aggExpr @ AggregateExpression(s: CollectSet, _, _, _, _) if 
has[VeloxCollectSet] =>
         val newAggExpr =
-          aggExpr.copy(aggregateFunction = VeloxCollectSet(s.child))
+          aggExpr.copy(aggregateFunction = VeloxCollectSet(s.child, 
getIgnoreNulls(s)))
         Some(newAggExpr)
       case aggExpr @ AggregateExpression(l: CollectList, _, _, _, _) if 
has[VeloxCollectList] =>
-        val newAggExpr = aggExpr.copy(VeloxCollectList(l.child))
+        val newAggExpr = aggExpr.copy(VeloxCollectList(l.child, 
getIgnoreNulls(l)))
         Some(newAggExpr)
       case _ => None
     }
   }
 
+  private def getIgnoreNulls(expr: Expression): Boolean = {
+    try {
+      val method = expr.getClass.getMethod("ignoreNulls")
+      method.invoke(expr).asInstanceOf[Boolean]
+    } catch {
+      case _: NoSuchMethodException => true // Default: ignore nulls
+    }
+  }
+
   private def has[T <: Expression: ClassTag]: Boolean =
     ExpressionMappings.expressionsMap.contains(classTag[T].runtimeClass)
 }
diff --git a/cpp/velox/substrait/SubstraitToVeloxPlan.cc 
b/cpp/velox/substrait/SubstraitToVeloxPlan.cc
index e9a2417d92..c7ec69906d 100644
--- a/cpp/velox/substrait/SubstraitToVeloxPlan.cc
+++ b/cpp/velox/substrait/SubstraitToVeloxPlan.cc
@@ -284,14 +284,30 @@ std::string 
SubstraitToVeloxPlanConverter::toAggregationFunctionName(
         // The merge_extract function is registered without suffix.
         return functionName;
       }
-      // The merge_extract function must be registered with suffix based on 
result type.
-      functionName += ("_" + companionFunctionSuffix(resultType));
-      signatures = exec::getAggregateFunctionSignatures(functionName);
-      VELOX_CHECK(
-          signatures.has_value() && signatures.value().size() > 0,
+      // The merge_extract function must be registered with suffix based on
+      // result type. First try exact concrete type suffix.
+      auto suffixedName =
+          functionName + "_" + companionFunctionSuffix(resultType);
+      signatures = exec::getAggregateFunctionSignatures(suffixedName);
+      if (signatures.has_value() && signatures.value().size() > 0) {
+        return suffixedName;
+      }
+      // When companion functions are registered with generic type variables
+      // (e.g., "collect_set_merge_extract_array_T"), look up companion
+      // function names from the aggregate function registry.
+      auto companionSigs = exec::getCompanionFunctionSignatures(baseName);
+      if (companionSigs.has_value()) {
+        for (const auto& entry : companionSigs->mergeExtract) {
+          auto entrySigs =
+              exec::getAggregateFunctionSignatures(entry.functionName);
+          if (entrySigs.has_value() && entrySigs.value().size() > 0) {
+            return entry.functionName;
+          }
+        }
+      }
+      VELOX_FAIL(
           "Cannot find function signature for {} in final aggregation step.",
-          functionName);
-      return functionName;
+          suffixedName);
     }
     case core::AggregationNode::Step::kIntermediate:
       suffix = "_merge";
diff --git a/cpp/velox/utils/ConfigExtractor.cc 
b/cpp/velox/utils/ConfigExtractor.cc
index 613331bdbb..b1c3767644 100644
--- a/cpp/velox/utils/ConfigExtractor.cc
+++ b/cpp/velox/utils/ConfigExtractor.cc
@@ -229,7 +229,7 @@ std::shared_ptr<facebook::velox::config::ConfigBase> 
createHiveConnectorSessionC
   
configs[facebook::velox::connector::hive::HiveConfig::kFileColumnNamesReadAsLowerCaseSession]
 =
       !conf->get<bool>(kCaseSensitive, false) ? "true" : "false";
   
configs[facebook::velox::connector::hive::HiveConfig::kPartitionPathAsLowerCaseSession]
 = "false";
-  
configs[facebook::velox::parquet::WriterOptions::kParquetSessionWriteTimestampUnit]
 = std::string("6");
+  configs[facebook::velox::parquet::WriterOptions::kParquetWriteTimestampUnit] 
= std::string("6");
   
configs[facebook::velox::connector::hive::HiveConfig::kReadTimestampUnitSession]
 = std::string("6");
   
configs[facebook::velox::connector::hive::HiveConfig::kMaxPartitionsPerWritersSession]
 =
       conf->get<std::string>(kMaxPartitions, "10000");
diff --git a/ep/build-velox/src/get-velox.sh b/ep/build-velox/src/get-velox.sh
index 27e083bbf1..14afa2e581 100755
--- a/ep/build-velox/src/get-velox.sh
+++ b/ep/build-velox/src/get-velox.sh
@@ -18,8 +18,8 @@ set -exu
 
 CURRENT_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd)
 VELOX_REPO=https://github.com/IBM/velox.git
-VELOX_BRANCH=dft-2026_04_01-iceberg
-VELOX_ENHANCED_BRANCH=ibm-2026_04_01-fix
+VELOX_BRANCH=dft-2026_04_08
+VELOX_ENHANCED_BRANCH=ibm-2026_04_08
 VELOX_HOME=""
 RUN_SETUP_SCRIPT=ON
 ENABLE_ENHANCED_FEATURES=OFF


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(gluten) branch main updated: [GLUTEN-6887][VL] Daily Update Velox Version (2026_04_08) (#11891)

Reply via email to