codeant-ai-for-open-source[bot] commented on code in PR #37740:
URL: https://github.com/apache/superset/pull/37740#discussion_r2772765498


##########
superset/utils/pandas_postprocessing/histogram.py:
##########
@@ -49,7 +49,7 @@ def histogram(
         groupby = []
 
     # drop empty values from the target column
-    df = df.dropna(subset=[column])
+    df = df.dropna(subset=[column]).copy()

Review Comment:
   **Suggestion:** Dropping NaN rows from the entire DataFrame before grouping 
removes groups that only contain null values in the target column, so those 
groups will disappear from the histogram output instead of being represented 
with zero counts, which breaks the documented "each row corresponds to a group" 
contract. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ superset/utils/pandas_postprocessing/histogram.histogram() omits groups.
   - ⚠️ Grouped histogram outputs have missing group rows.
   - ⚠️ Downstream consumers receive fewer rows than expected.
   ```
   </details>
   
   ```suggestion
       # NaN values in the target column are dropped later in hist_values(), so 
keep all rows here
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Create a DataFrame file and call the function `histogram()` defined in
   `superset/utils/pandas_postprocessing/histogram.py` (function signature at 
lines 24-31).
   
   2. Prepare input DataFrame where one group has only nulls, e.g.:
   
      - DataFrame with columns `group` and `value` where group "B" rows exist 
but `value` is
      NaN for all "B" rows.
   
   3. Call histogram(df, column="value", groupby=["group"]) -> execution 
reaches the line
   that pre-filters nulls at line 52: `df = df.dropna(subset=[column]).copy()`.
   
   4. Rows for group "B" are removed by that dropna at line 52, so later 
grouping (code paths
   at lines 81-86: the df.groupby(...) branch) never sees group "B" and no 
output row for "B"
   is produced. The helper `hist_values` (lines 71-74) which drops NaNs inside 
a group's
   Series is never given a chance to generate a zero-count vector for an absent 
group.
   
   Explanation: The current code intentionally removes rows with NaN at line 
52; that causes
   groups whose only rows had NaN in the target column to disappear instead of 
appearing with
   zero-count bins.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/pandas_postprocessing/histogram.py
   **Line:** 52:52
   **Comment:**
        *Logic Error: Dropping NaN rows from the entire DataFrame before 
grouping removes groups that only contain null values in the target column, so 
those groups will disappear from the histogram output instead of being 
represented with zero counts, which breaks the documented "each row corresponds 
to a group" contract.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to