This is an automated email from the ASF dual-hosted git repository.
yuanzhou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gluten.git
The following commit(s) were added to refs/heads/main by this push:
new 3af2e33967 [GLUTEN-6887][VL] Daily Update Velox Version (2026_04_08)
(#11891)
3af2e33967 is described below
commit 3af2e339674304549c3cf2cce3cb9941d39661a2
Author: Gluten Performance Bot
<[email protected]>
AuthorDate: Sat Apr 11 09:37:30 2026 +0100
[GLUTEN-6887][VL] Daily Update Velox Version (2026_04_08) (#11891)
* [GLUTEN-6887][VL] Daily Update Velox Version (dft-2026_04_08)
Upstream Velox's New Commits:
d7891436c by Masha Basmanova, fix: Skip custom type coercion for
parameterized types (#17064)
fbda33022 by Han Yan, feat(dwio): Add BufferPool for reusing cached
BufferPtr objects (#17042)
6e5224164 by Krishna Pai, fix(ci): Add OIDC permission and unrestrict Bash
for CI failure analysis (#17061)
b43a8c892 by Peter Enescu, feat: Allow EncodedVectorCopy to generate
FlatMapVector in non-NULL vectors (#16161)
1355dd3ab by Pratik Pugalia, fix: GetTimestampFunction recompiling datetime
format on every row (#17037)
c7d5b0104 by Krishna Pai, fix(ci): Use bash parameter expansion for
multiline metadata substitution (#17058)
472701f4a by Pratik Pugalia, fix: Remove per-query timeout in
TableEvolutionFuzzer (#17046)
933dd4e10 by Kevin Wilfong, fix: Remove unnecessary output_ field from
IndexLookupJoin (#17043)
034f86cb0 by Pratik Pugalia, fix: Increase Presto request timeout for
parallel fuzzer runs (#17049)
ba21e5661 by Ke Wang, fix: Allow IoStats to override storageReadBytes in
getRuntimeStats (#17036)
39d3494de by Masha Basmanova, fix: Change array_sort comparator lambda
return type from bigint to integer (#17030)
682c4a8e7 by Artem Selishchev, fix: Catch exceptions from
TaskCompletionListeners in Task::onTaskCompletion() (#17051)
c3f34536b by Varun Srinivas, fix(remote): Use VELOX_USER_FAIL for remote
error re-throwing (#16903)
a7d9036ae by Krishna Pai, feat(ci): Use Claude to analyze CI failures and
post diagnostic PR comments (#17039)
e9d03d8b3 by Kk Pulla, fix(exec): Fix data race in
OutputBuffer::getUtilization and isOverutilized (#17009)
fe24ae068 by Pratik Pugalia, fix: BetweenFunction to handle NaN with
correct Spark semantics (#17025)
dd58b1536 by Rui Mo, fix: Change count metric from signed to unsigned
(int64_t -> uint64_t) (#16989)
303bba60c by Mahadevuni Naveen Kumar, refactor: Revert iceberg data file
statistics changes (#16999)
a649489c1 by Simon Eves, fix(cudf): Fix failure in
ToCudfSelectionTest.zeroColumnCountConstantFallsBack (#17031)
95894c30a by Christian Zentgraf, feat(s3): Add support for
hive.s3.min-part-size when writing (#16935)
01b86e20d by Konjac Huang, refactor: refactor filebased datasource (#16914)
7ea56098a by Rajeev Singh, feat(expr-eval): Fix flaky
adaptiveCpuSamplingPerFunctionRates test (#17002)
37e897b30 by joey.ljy, test: Use VectorFuzzer for random RowVector
generation in `semiJoinDeduplicateResetCapacity` test (#15748)
509ab8fd2 by Chengcheng Jin, feat(cudf): Add config to set timestamp unit
(#16769)
338598815 by Masha Basmanova, refactor: Migrate production code to
ConnectorRegistry API and deprecate free functions (#16986)
b65b5c1c5 by Pratik Pugalia, fix: TempFilePath fd_ member initialization
order bug causing flaky test failures (#17020)
95ce76125 by Rui Mo, test: Extend cast tests in the expression fuzzer test
(#16990)
4fb74c52f by Miguel Blanco Godón, feat: Support reading PARQUET files with
zero offset (#16456)
1dfcfbbcc by Kent Yao, fix(sparksql): Default ignoreNulls to true for
collect_set backward compatibility (#16947)
65800681f by Masha Basmanova, refactor: Migrate test and fuzzer code to
ConnectorRegistry API (#16985)
4acf9bb28 by Masha Basmanova, feat: Add ScopedRegistry and query-scoped
connector lookups (#16982)
4a966b2ef by Pratik Pugalia, Fix: SIGSEGV in AggregationFuzzer when
reference query returns empty result vector (#17018)
4bbea83dc by Krishna Pai, feat(ci): Add workflow_run workflow for posting
CI failure comments on PRs (#17022)
d14cd0c27 by Matt Gara, fix(cudf): Enable GPU execution for count(*),
count(column), and count(NULL) (#16522)
084f2221a by Rui Mo, misc: Make `DirectBufferedInput` clone fields
protected (#16979)
d9c1b6ea3 by Krishna Pai, build(ci): Grant pull-requests write permission
to Linux build workflow (#17021)
cff0a6e36 by David Reveman, build: Update perfetto SDK to v54 (#17004)
7534c2e47 by Bradley Dice, fix(cudf): Refactor CudfToVelox output batching
to avoid O(n) D->H syncs (#16620)
f736ec1d8 by Masha Basmanova, refactor: Add thread safety to connector
registry (#16978)
b79f0d188 by Krishna Pai, feat(ci): Add flaky test retry and JUnit XML
reporting (#17003)
cf7d5a7b7 by Natasha Sehgal, feat: Add pmod (positive modulo) function to
Presto SQL (#17008)
388105ba3 by Bradley Dice, fix(build): Add missing GTest::gmock link to
velox_hive_connector_test (#16996)
9d7a2ee24 by Andrii Rosa, fix: support NaN and Inf serialization for
Variant (#17007)
Signed-off-by: glutenperfbot <[email protected]>
* Resolve compile issue
* feat(velox): Support RESPECT NULLS for collect_list/collect_set
Add ignoreNulls parameter to VeloxCollectList/VeloxCollectSet to support
Spark's RESPECT NULLS syntax (SPARK-55256). When ignoreNulls=false, null
elements are included in the collected array.
- VeloxCollect: conditionally skip nulls based on ignoreNulls parameter
- CollectRewriteRule: propagate ignoreNulls from Spark's
CollectList/CollectSet
via reflection (backward-compatible with Spark versions without
ignoreNulls)
- ArrayType containsNull reflects the ignoreNulls setting
Co-authored-by: Copilot <[email protected]>
* fix(velox): Handle generic-typed companion function lookup for
collect_set/list
When aggregate functions have multiple signatures with the same intermediate
type (e.g., collect_set with 1-arg and 2-arg signatures), Velox registers
companion functions with suffix using generic type variables (e.g.,
collect_set_merge_extract_array_T). The Substrait layer was constructing
concrete type suffixes (e.g., array_row_VARCHAR_BIGINT_BIGINT_endrow) that
don't match.
Fix: After failing exact concrete suffix lookup, fall back to discovering
companion function names via getCompanionFunctionSignatures() API.
Co-authored-by: Copilot <[email protected]>
* trigger
---------
Signed-off-by: glutenperfbot <[email protected]>
Co-authored-by: glutenperfbot <[email protected]>
Co-authored-by: Ke Jia <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Yuan <[email protected]>
---
.../gluten/expression/aggregate/VeloxCollect.scala | 27 +++++++++++--------
.../gluten/extension/CollectRewriteRule.scala | 13 ++++++++--
cpp/velox/substrait/SubstraitToVeloxPlan.cc | 30 +++++++++++++++++-----
cpp/velox/utils/ConfigExtractor.cc | 2 +-
ep/build-velox/src/get-velox.sh | 4 +--
5 files changed, 54 insertions(+), 22 deletions(-)
diff --git
a/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
b/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
index dc41bbc4fc..0945343ce9 100644
---
a/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
+++
b/backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala
@@ -21,13 +21,13 @@ import
org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate
import org.apache.spark.sql.catalyst.trees.UnaryLike
import org.apache.spark.sql.types.{ArrayType, DataType}
-abstract class VeloxCollect(child: Expression)
+abstract class VeloxCollect(child: Expression, val ignoreNulls: Boolean)
extends DeclarativeAggregate
with UnaryLike[Expression] {
protected lazy val buffer: AttributeReference = AttributeReference("buffer",
dataType)()
- override def dataType: DataType = ArrayType(child.dataType, false)
+ override def dataType: DataType = ArrayType(child.dataType, !ignoreNulls)
override def nullable: Boolean = false
@@ -35,12 +35,17 @@ abstract class VeloxCollect(child: Expression)
override lazy val initialValues: Seq[Expression] =
Seq(Literal.create(Array(), dataType))
- override lazy val updateExpressions: Seq[Expression] = Seq(
- If(
- IsNull(child),
- buffer,
- Concat(Seq(buffer, CreateArray(Seq(child), useStringTypeWhenEmpty =
false))))
- )
+ override lazy val updateExpressions: Seq[Expression] = {
+ val append = if (ignoreNulls) {
+ If(
+ IsNull(child),
+ buffer,
+ Concat(Seq(buffer, CreateArray(Seq(child), useStringTypeWhenEmpty =
false))))
+ } else {
+ Concat(Seq(buffer, CreateArray(Seq(child), useStringTypeWhenEmpty =
false)))
+ }
+ Seq(append)
+ }
override lazy val mergeExpressions: Seq[Expression] = Seq(
Concat(Seq(buffer.left, buffer.right))
@@ -49,7 +54,8 @@ abstract class VeloxCollect(child: Expression)
override def defaultResult: Option[Literal] = Option(Literal.create(Array(),
dataType))
}
-case class VeloxCollectSet(child: Expression) extends VeloxCollect(child) {
+case class VeloxCollectSet(child: Expression, override val ignoreNulls:
Boolean = true)
+ extends VeloxCollect(child, ignoreNulls) {
override lazy val evaluateExpression: Expression =
ArrayDistinct(buffer)
@@ -60,7 +66,8 @@ case class VeloxCollectSet(child: Expression) extends
VeloxCollect(child) {
copy(child = newChild)
}
-case class VeloxCollectList(child: Expression) extends VeloxCollect(child) {
+case class VeloxCollectList(child: Expression, override val ignoreNulls:
Boolean = true)
+ extends VeloxCollect(child, ignoreNulls) {
override val evaluateExpression: Expression = buffer
diff --git
a/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
b/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
index e76de56374..72e52cf3dd 100644
---
a/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
+++
b/backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala
@@ -67,15 +67,24 @@ object CollectRewriteRule {
def unapply(expr: Expression): Option[Expression] = expr match {
case aggExpr @ AggregateExpression(s: CollectSet, _, _, _, _) if
has[VeloxCollectSet] =>
val newAggExpr =
- aggExpr.copy(aggregateFunction = VeloxCollectSet(s.child))
+ aggExpr.copy(aggregateFunction = VeloxCollectSet(s.child,
getIgnoreNulls(s)))
Some(newAggExpr)
case aggExpr @ AggregateExpression(l: CollectList, _, _, _, _) if
has[VeloxCollectList] =>
- val newAggExpr = aggExpr.copy(VeloxCollectList(l.child))
+ val newAggExpr = aggExpr.copy(VeloxCollectList(l.child,
getIgnoreNulls(l)))
Some(newAggExpr)
case _ => None
}
}
+ private def getIgnoreNulls(expr: Expression): Boolean = {
+ try {
+ val method = expr.getClass.getMethod("ignoreNulls")
+ method.invoke(expr).asInstanceOf[Boolean]
+ } catch {
+ case _: NoSuchMethodException => true // Default: ignore nulls
+ }
+ }
+
private def has[T <: Expression: ClassTag]: Boolean =
ExpressionMappings.expressionsMap.contains(classTag[T].runtimeClass)
}
diff --git a/cpp/velox/substrait/SubstraitToVeloxPlan.cc
b/cpp/velox/substrait/SubstraitToVeloxPlan.cc
index e9a2417d92..c7ec69906d 100644
--- a/cpp/velox/substrait/SubstraitToVeloxPlan.cc
+++ b/cpp/velox/substrait/SubstraitToVeloxPlan.cc
@@ -284,14 +284,30 @@ std::string
SubstraitToVeloxPlanConverter::toAggregationFunctionName(
// The merge_extract function is registered without suffix.
return functionName;
}
- // The merge_extract function must be registered with suffix based on
result type.
- functionName += ("_" + companionFunctionSuffix(resultType));
- signatures = exec::getAggregateFunctionSignatures(functionName);
- VELOX_CHECK(
- signatures.has_value() && signatures.value().size() > 0,
+ // The merge_extract function must be registered with suffix based on
+ // result type. First try exact concrete type suffix.
+ auto suffixedName =
+ functionName + "_" + companionFunctionSuffix(resultType);
+ signatures = exec::getAggregateFunctionSignatures(suffixedName);
+ if (signatures.has_value() && signatures.value().size() > 0) {
+ return suffixedName;
+ }
+ // When companion functions are registered with generic type variables
+ // (e.g., "collect_set_merge_extract_array_T"), look up companion
+ // function names from the aggregate function registry.
+ auto companionSigs = exec::getCompanionFunctionSignatures(baseName);
+ if (companionSigs.has_value()) {
+ for (const auto& entry : companionSigs->mergeExtract) {
+ auto entrySigs =
+ exec::getAggregateFunctionSignatures(entry.functionName);
+ if (entrySigs.has_value() && entrySigs.value().size() > 0) {
+ return entry.functionName;
+ }
+ }
+ }
+ VELOX_FAIL(
"Cannot find function signature for {} in final aggregation step.",
- functionName);
- return functionName;
+ suffixedName);
}
case core::AggregationNode::Step::kIntermediate:
suffix = "_merge";
diff --git a/cpp/velox/utils/ConfigExtractor.cc
b/cpp/velox/utils/ConfigExtractor.cc
index 613331bdbb..b1c3767644 100644
--- a/cpp/velox/utils/ConfigExtractor.cc
+++ b/cpp/velox/utils/ConfigExtractor.cc
@@ -229,7 +229,7 @@ std::shared_ptr<facebook::velox::config::ConfigBase>
createHiveConnectorSessionC
configs[facebook::velox::connector::hive::HiveConfig::kFileColumnNamesReadAsLowerCaseSession]
=
!conf->get<bool>(kCaseSensitive, false) ? "true" : "false";
configs[facebook::velox::connector::hive::HiveConfig::kPartitionPathAsLowerCaseSession]
= "false";
-
configs[facebook::velox::parquet::WriterOptions::kParquetSessionWriteTimestampUnit]
= std::string("6");
+ configs[facebook::velox::parquet::WriterOptions::kParquetWriteTimestampUnit]
= std::string("6");
configs[facebook::velox::connector::hive::HiveConfig::kReadTimestampUnitSession]
= std::string("6");
configs[facebook::velox::connector::hive::HiveConfig::kMaxPartitionsPerWritersSession]
=
conf->get<std::string>(kMaxPartitions, "10000");
diff --git a/ep/build-velox/src/get-velox.sh b/ep/build-velox/src/get-velox.sh
index 27e083bbf1..14afa2e581 100755
--- a/ep/build-velox/src/get-velox.sh
+++ b/ep/build-velox/src/get-velox.sh
@@ -18,8 +18,8 @@ set -exu
CURRENT_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd)
VELOX_REPO=https://github.com/IBM/velox.git
-VELOX_BRANCH=dft-2026_04_01-iceberg
-VELOX_ENHANCED_BRANCH=ibm-2026_04_01-fix
+VELOX_BRANCH=dft-2026_04_08
+VELOX_ENHANCED_BRANCH=ibm-2026_04_08
VELOX_HOME=""
RUN_SETUP_SCRIPT=ON
ENABLE_ENHANCED_FEATURES=OFF
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]