yaooqinn commented on code in PR #12092:
URL: https://github.com/apache/gluten/pull/12092#discussion_r3247745988
##########
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc:
##########
@@ -95,4 +98,405 @@ std::shared_ptr<ColumnarBatch>
VeloxColumnarBatchSerializer::deserialize(uint8_t
return std::make_shared<VeloxColumnarBatch>(result);
}
+namespace {
+
+// Per-type FlatVector min/max scan + NaN guard. Returns false when the column
must be marked
+// unsupported (any NaN observed for floating-point types -- Spark equality
NaN != NaN means
+// min/max-based pruning would silently drop matching rows).
+//
+// Floating-point edge cases that DO NOT poison the column:
+// - +/-Infinity: ordered (-Inf < x < +Inf for finite x); participate in
min/max normally.
+// - +0 and -0: IEEE 754 declares them equal under <, ==; min/max bound is
correct either way.
+// - subnormal (denormal) values: ordered like normal floats; no special
handling needed.
+template <typename T>
+bool scanMinMax(const facebook::velox::FlatVector<T>* flat, T& tLo, T& tHi,
int64_t& nullCnt, bool& seen) {
+ const auto size = flat->size();
+ const uint64_t* nulls = flat->rawNulls();
+ const T* values = flat->rawValues();
Review Comment:
Addressed in 1bf1524ff5 (Batch 2 — cpp stats correctness: bool/NaN/non-flat
null counting). Added scanBoolMinMax helper for the BOOLEAN path, NaN scan now
continues to accrue real nullCount (early-return removed), and non-flat
encoding gets a dedicated isNullAt loop. New gtest case verifies the BOOLEAN
scan fix; RED-validated by stashing the production change. Thank you for the
careful review!
##########
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc:
##########
@@ -95,4 +98,405 @@ std::shared_ptr<ColumnarBatch>
VeloxColumnarBatchSerializer::deserialize(uint8_t
return std::make_shared<VeloxColumnarBatch>(result);
}
+namespace {
+
+// Per-type FlatVector min/max scan + NaN guard. Returns false when the column
must be marked
+// unsupported (any NaN observed for floating-point types -- Spark equality
NaN != NaN means
+// min/max-based pruning would silently drop matching rows).
+//
+// Floating-point edge cases that DO NOT poison the column:
+// - +/-Infinity: ordered (-Inf < x < +Inf for finite x); participate in
min/max normally.
+// - +0 and -0: IEEE 754 declares them equal under <, ==; min/max bound is
correct either way.
+// - subnormal (denormal) values: ordered like normal floats; no special
handling needed.
+template <typename T>
+bool scanMinMax(const facebook::velox::FlatVector<T>* flat, T& tLo, T& tHi,
int64_t& nullCnt, bool& seen) {
+ const auto size = flat->size();
+ const uint64_t* nulls = flat->rawNulls();
+ const T* values = flat->rawValues();
+ for (vector_size_t i = 0; i < size; ++i) {
+ if (nulls != nullptr && bits::isBitNull(nulls, i)) {
+ ++nullCnt;
+ continue;
+ }
+ T v = values[i];
+ if constexpr (std::is_floating_point_v<T>) {
+ if (std::isnan(v)) {
+ return false;
Review Comment:
Addressed in 1bf1524ff5 (Batch 2 — cpp stats correctness: bool/NaN/non-flat
null counting). Added scanBoolMinMax helper for the BOOLEAN path, NaN scan now
continues to accrue real nullCount (early-return removed), and non-flat
encoding gets a dedicated isNullAt loop. New gtest case verifies the NaN
nullCount fix; RED-validated by stashing the production change. Thank you for
the careful review!
##########
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc:
##########
@@ -95,4 +98,405 @@ std::shared_ptr<ColumnarBatch>
VeloxColumnarBatchSerializer::deserialize(uint8_t
return std::make_shared<VeloxColumnarBatch>(result);
}
+namespace {
+
+// Per-type FlatVector min/max scan + NaN guard. Returns false when the column
must be marked
+// unsupported (any NaN observed for floating-point types -- Spark equality
NaN != NaN means
+// min/max-based pruning would silently drop matching rows).
+//
+// Floating-point edge cases that DO NOT poison the column:
+// - +/-Infinity: ordered (-Inf < x < +Inf for finite x); participate in
min/max normally.
+// - +0 and -0: IEEE 754 declares them equal under <, ==; min/max bound is
correct either way.
+// - subnormal (denormal) values: ordered like normal floats; no special
handling needed.
+template <typename T>
+bool scanMinMax(const facebook::velox::FlatVector<T>* flat, T& tLo, T& tHi,
int64_t& nullCnt, bool& seen) {
+ const auto size = flat->size();
+ const uint64_t* nulls = flat->rawNulls();
+ const T* values = flat->rawValues();
+ for (vector_size_t i = 0; i < size; ++i) {
+ if (nulls != nullptr && bits::isBitNull(nulls, i)) {
+ ++nullCnt;
+ continue;
+ }
+ T v = values[i];
+ if constexpr (std::is_floating_point_v<T>) {
+ if (std::isnan(v)) {
+ return false;
+ }
+ }
+ if (!seen) {
+ tLo = v;
+ tHi = v;
+ seen = true;
+ } else {
+ if (v < tLo)
+ tLo = v;
+ if (v > tHi)
+ tHi = v;
+ }
+ }
+ return true;
+}
+
+} // namespace
+
+std::vector<ColumnStats>
VeloxColumnarBatchSerializer::computeStats(RowVectorPtr rowVector) {
+ std::vector<ColumnStats> result;
+ const auto numCols = rowVector->childrenSize();
+ result.resize(numCols);
+ for (column_index_t col = 0; col < numCols; ++col) {
+ auto& stats = result[col];
+ auto child = rowVector->childAt(col);
+ if (child == nullptr || !child->isFlatEncoding()) {
Review Comment:
Addressed in 1bf1524ff5 (Batch 2 — cpp stats correctness: bool/NaN/non-flat
null counting). Added scanBoolMinMax helper for the BOOLEAN path, NaN scan now
continues to accrue real nullCount (early-return removed), and non-flat
encoding gets a dedicated isNullAt loop. New gtest case verifies the non-flat
nullCount fix; RED-validated by stashing the production change. Thank you for
the careful review!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]