harshitsaini17 opened a new pull request, #19197:
URL: https://github.com/apache/datafusion/pull/19197
## Which issue does this PR close?
Closes #19147
Part of #19144 (EPIC: fix nullability report for spark expression)
## Rationale for this change
The `bit_count` UDF was using the default `return_type` implementation which
does not preserve nullability information. This causes:
1. **Incorrect schema inference** - non-nullable integer inputs are
incorrectly marked as producing nullable Int32 outputs
2. **Missed optimization opportunities** - the query optimizer cannot apply
nullability-based optimizations when metadata is incorrect
3. **Inconsistent behavior** - other similar functions preserve nullability,
leading to unexpected differences
The `bit_count` function counts the number of set bits (ones) in the binary
representation of a number and returns an Int32. The operation itself doesn't
introduce nullability - if the input is non-nullable, the output will always be
non-nullable. Therefore, the output nullability should match the input.
## What changes are included in this PR?
1. **Implemented `return_field_from_args`**: Creates a field with Int32 type
and the same nullability as the input field
2. **Updated `return_type`**: Now returns an error directing users to use
`return_field_from_args` instead (following DataFusion best practices)
3. **Added `FieldRef` and `internal_err` imports** to support the new
implementation
4. **Added nullability test**: Verifies that both nullable and non-nullable
inputs are handled correctly
## Are these changes tested?
Yes, this PR includes a new test `test_bit_count_nullability` that verifies:
- Non-nullable Int32 input produces non-nullable Int32 output
- Nullable Int32 input produces nullable Int32 output
- Data type is correctly set to Int32 in both cases
Test results:
```
running 1 test
test function::bitwise::bit_count::tests::test_bit_count_nullability ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
```
Additionally, all existing `bit_count` tests continue to pass, ensuring
backward compatibility.
## Are there any user-facing changes?
**Yes - Schema metadata improvement:**
Users will now see correct nullability information in the schema:
**Before (Bug):**
- Non-nullable Int32/Int64/Boolean input → nullable Int32 output (incorrect)
**After (Fixed):**
- Non-nullable Int32/Int64/Boolean input → non-nullable Int32 output
(correct)
- Nullable Int32/Int64/Boolean input → nullable Int32 output (correct)
This is a **bug fix** that corrects schema metadata only - it does not
change the actual computation or introduce any breaking changes to the API.
**Impact:**
- Query optimizers can now make better decisions based on accurate
nullability information
- Schema validation will be more accurate
- No changes to function behavior or output values
---
## Code Changes Summary
### Modified File: `datafusion/spark/src/function/bitwise/bit_count.rs`
#### 1. Added Imports
```rust
use arrow::datatypes::{
DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type,
UInt16Type, UInt32Type,
UInt64Type, UInt8Type,
};
use datafusion_common::{internal_err, plan_err, Result};
```
#### 2. Updated return_type Method
```rust
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be used instead")
}
```
#### 3. Added return_field_from_args Implementation
```rust
fn return_field_from_args(
&self,
args: datafusion_expr::ReturnFieldArgs,
) -> Result<FieldRef> {
use arrow::datatypes::Field;
// bit_count returns Int32 with the same nullability as the input
Ok(Arc::new(Field::new(
args.arg_fields[0].name(),
DataType::Int32,
args.arg_fields[0].is_nullable(),
)))
}
```
#### 4. Added Test
```rust
#[test]
fn test_bit_count_nullability() -> Result<()> {
use datafusion_expr::ReturnFieldArgs;
let bit_count = SparkBitCount::new();
// Test with non-nullable Int32 field
let non_nullable_field = Arc::new(Field::new("num", DataType::Int32,
false));
let result = bit_count.return_field_from_args(ReturnFieldArgs {
arg_fields: &[Arc::clone(&non_nullable_field)],
scalar_arguments: &[None],
})?;
// The result should not be nullable (same as input)
assert!(!result.is_nullable());
assert_eq!(result.data_type(), &DataType::Int32);
// Test with nullable Int32 field
let nullable_field = Arc::new(Field::new("num", DataType::Int32, true));
let result = bit_count.return_field_from_args(ReturnFieldArgs {
arg_fields: &[Arc::clone(&nullable_field)],
scalar_arguments: &[None],
})?;
// The result should be nullable (same as input)
assert!(result.is_nullable());
assert_eq!(result.data_type(), &DataType::Int32);
Ok(())
}
```
---
## Verification Steps
1. **Run the new test:**
```bash
cargo test -p datafusion-spark test_bit_count_nullability --lib
```
2. **Run all bit_count tests:**
```bash
cargo test -p datafusion-spark bit_count --lib
```
3. **Run clippy checks:**
```bash
cargo clippy -p datafusion-spark --all-targets -- -D warnings
```
All checks pass successfully!
---
## Related Issues
- Closes: #19147
- Part of EPIC: #19144 (fix nullability report for spark expression)
- Similar fixes:
- #19145 (shuffle function nullability)
- #19146 (bitmap_count function nullability)
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]