codeant-ai-for-open-source[bot] commented on code in PR #37744:
URL: https://github.com/apache/superset/pull/37744#discussion_r2772798085


##########
superset/utils/pandas_postprocessing/histogram.py:
##########
@@ -49,10 +49,13 @@ def histogram(
         groupby = []
 
     # drop empty values from the target column
-    df = df.dropna(subset=[column])
+    df = df.dropna(subset=[column]).copy()
     if df.empty:
         return df
 
+    # Make a copy to avoid SettingWithCopyWarning when modifying the DataFrame
+    df = df.copy()

Review Comment:
   **Suggestion:** The DataFrame is being copied twice in succession (once via 
`dropna(...).copy()` and again via `df.copy()`), which unnecessarily duplicates 
a potentially large DataFrame and impacts performance without changing 
behavior; the second copy and its comment should be removed or merged into the 
first. [possible bug]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Dashboard histogram loads may spike memory usage.
   - ⚠️ Chart editing can cause backend memory pressure.
   - ⚠️ Workers processing large charts consume extra RAM.
   - ⚠️ Increased GC/CPU from unnecessary data duplication.
   ```
   </details>
   
   ```suggestion
       # drop empty values from the target column and make a copy to avoid
       # SettingWithCopyWarning when modifying the DataFrame
       df = df.dropna(subset=[column]).copy()
       if df.empty:
           return df
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Trigger histogram postprocessing by loading or editing a Histogram chart 
in Superset
   (per PR description). The postprocessing function is defined at
   superset/utils/pandas_postprocessing/histogram.py in function histogram() 
which contains
   the copy operations at lines 52 and 56-57.
   
   2. Provide a large DataFrame input to histogram(), for example create a 
DataFrame with
   millions of rows in a Python shell and call
   superset.utils.pandas_postprocessing.histogram.histogram(df, column="value",
   groupby=None). The code executes the dropna + copy at line 52 (df =
   df.dropna(subset=[column]).copy()) and then executes the second copy at 
lines 56-57 (df =
   df.copy()).
   
   3. Observe memory usage: both copy() calls create full in-memory duplicates. 
Monitor
   process memory (e.g., via top/psutil) during the call and note the transient 
∼2x memory
   allocation for the DataFrame compared to a single-copy implementation.
   
   4. Replace the two-copy block with the improved single-copy variant (as 
suggested) and
   re-run the same large-DataFrame call. The transient memory spike 
disappears/reduces,
   confirming the second copy was redundant and caused the overhead.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/pandas_postprocessing/histogram.py
   **Line:** 51:57
   **Comment:**
        *Possible Bug: The DataFrame is being copied twice in succession (once 
via `dropna(...).copy()` and again via `df.copy()`), which unnecessarily 
duplicates a potentially large DataFrame and impacts performance without 
changing behavior; the second copy and its comment should be removed or merged 
into the first.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to