harshitsaini17 opened a new pull request, #19197:
URL: https://github.com/apache/datafusion/pull/19197

   ## Which issue does this PR close?
   
   Closes #19147
   
   Part of #19144 (EPIC: fix nullability report for spark expression)
   
   ## Rationale for this change
   
   The `bit_count` UDF was using the default `return_type` implementation which 
does not preserve nullability information. This causes:
   
   1. **Incorrect schema inference** - non-nullable integer inputs are 
incorrectly marked as producing nullable Int32 outputs
   2. **Missed optimization opportunities** - the query optimizer cannot apply 
nullability-based optimizations when metadata is incorrect
   3. **Inconsistent behavior** - other similar functions preserve nullability, 
leading to unexpected differences
   
   The `bit_count` function counts the number of set bits (ones) in the binary 
representation of a number and returns an Int32. The operation itself doesn't 
introduce nullability - if the input is non-nullable, the output will always be 
non-nullable. Therefore, the output nullability should match the input.
   
   ## What changes are included in this PR?
   
   1. **Implemented `return_field_from_args`**: Creates a field with Int32 type 
and the same nullability as the input field
   2. **Updated `return_type`**: Now returns an error directing users to use 
`return_field_from_args` instead (following DataFusion best practices)
   3. **Added `FieldRef` and `internal_err` imports** to support the new 
implementation
   4. **Added nullability test**: Verifies that both nullable and non-nullable 
inputs are handled correctly
   
   ## Are these changes tested?
   
   Yes, this PR includes a new test `test_bit_count_nullability` that verifies:
   - Non-nullable Int32 input produces non-nullable Int32 output
   - Nullable Int32 input produces nullable Int32 output
   - Data type is correctly set to Int32 in both cases
   
   Test results:
   ```
   running 1 test
   test function::bitwise::bit_count::tests::test_bit_count_nullability ... ok
   
   test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
   ```
   
   Additionally, all existing `bit_count` tests continue to pass, ensuring 
backward compatibility.
   
   ## Are there any user-facing changes?
   
   **Yes - Schema metadata improvement:**
   
   Users will now see correct nullability information in the schema:
   
   **Before (Bug):**
   - Non-nullable Int32/Int64/Boolean input → nullable Int32 output (incorrect)
   
   **After (Fixed):**
   - Non-nullable Int32/Int64/Boolean input → non-nullable Int32 output 
(correct)
   - Nullable Int32/Int64/Boolean input → nullable Int32 output (correct)
   
   This is a **bug fix** that corrects schema metadata only - it does not 
change the actual computation or introduce any breaking changes to the API.
   
   **Impact:**
   - Query optimizers can now make better decisions based on accurate 
nullability information
   - Schema validation will be more accurate
   - No changes to function behavior or output values
   
   ---
   
   ## Code Changes Summary
   
   ### Modified File: `datafusion/spark/src/function/bitwise/bit_count.rs`
   
   #### 1. Added Imports
   ```rust
   use arrow::datatypes::{
       DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type, 
UInt16Type, UInt32Type,
       UInt64Type, UInt8Type,
   };
   use datafusion_common::{internal_err, plan_err, Result};
   ```
   
   #### 2. Updated return_type Method
   ```rust
   fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
       internal_err!("return_field_from_args should be used instead")
   }
   ```
   
   #### 3. Added return_field_from_args Implementation
   ```rust
   fn return_field_from_args(
       &self,
       args: datafusion_expr::ReturnFieldArgs,
   ) -> Result<FieldRef> {
       use arrow::datatypes::Field;
       // bit_count returns Int32 with the same nullability as the input
       Ok(Arc::new(Field::new(
           args.arg_fields[0].name(),
           DataType::Int32,
           args.arg_fields[0].is_nullable(),
       )))
   }
   ```
   
   #### 4. Added Test
   ```rust
   #[test]
   fn test_bit_count_nullability() -> Result<()> {
       use datafusion_expr::ReturnFieldArgs;
   
       let bit_count = SparkBitCount::new();
   
       // Test with non-nullable Int32 field
       let non_nullable_field = Arc::new(Field::new("num", DataType::Int32, 
false));
   
       let result = bit_count.return_field_from_args(ReturnFieldArgs {
           arg_fields: &[Arc::clone(&non_nullable_field)],
           scalar_arguments: &[None],
       })?;
   
       // The result should not be nullable (same as input)
       assert!(!result.is_nullable());
       assert_eq!(result.data_type(), &DataType::Int32);
   
       // Test with nullable Int32 field
       let nullable_field = Arc::new(Field::new("num", DataType::Int32, true));
   
       let result = bit_count.return_field_from_args(ReturnFieldArgs {
           arg_fields: &[Arc::clone(&nullable_field)],
           scalar_arguments: &[None],
       })?;
   
       // The result should be nullable (same as input)
       assert!(result.is_nullable());
       assert_eq!(result.data_type(), &DataType::Int32);
   
       Ok(())
   }
   ```
   
   ---
   
   ## Verification Steps
   
   1. **Run the new test:**
      ```bash
      cargo test -p datafusion-spark test_bit_count_nullability --lib
      ```
   
   2. **Run all bit_count tests:**
      ```bash
      cargo test -p datafusion-spark bit_count --lib
      ```
   
   3. **Run clippy checks:**
      ```bash
      cargo clippy -p datafusion-spark --all-targets -- -D warnings
      ```
   
   All checks pass successfully!
   
   ---
   
   ## Related Issues
   
   - Closes: #19147
   - Part of EPIC: #19144 (fix nullability report for spark expression)
   - Similar fixes: 
     - #19145 (shuffle function nullability)
     - #19146 (bitmap_count function nullability)
   ---


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to