codeant-ai-for-open-source[bot] commented on code in PR #37731:
URL: https://github.com/apache/superset/pull/37731#discussion_r2772674342


##########
superset/utils/pandas_postprocessing/histogram.py:
##########
@@ -48,6 +48,9 @@ def histogram(
     if groupby is None:
         groupby = []
 
+    # Create an explicit copy to avoid SettingWithCopyWarning
+    df = df.copy()

Review Comment:
   **Suggestion:** Creating a full deep copy of the entire DataFrame can be 
very expensive in both time and memory for large inputs, potentially leading to 
avoidable memory pressure or even MemoryError in high-load situations; using a 
shallow copy is sufficient to break the chained-assignment relationship and 
prevent SettingWithCopyWarning while avoiding duplicating all the underlying 
data. [possible bug]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Histogram postprocessing may OOM for large query results.
   - ⚠️ Backend worker memory pressure during histogram calculation.
   - ⚠️ Dashboard histogram tiles risk failing under large datasets.
   ```
   </details>
   
   ```suggestion
       # Create a shallow copy to avoid SettingWithCopyWarning without 
duplicating all data
       df = df.copy(deep=False)
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. In a Superset environment or Python REPL, import the function:
   
      from superset.utils.pandas_postprocessing.histogram import histogram
   
      (the implementation lives at 
superset/utils/pandas_postprocessing/histogram.py and the
      copy call is at lines 51-52).
   
   2. Construct a large pandas DataFrame in the same process, e.g.:
   
      df = pandas.DataFrame({"value": numpy.random.rand(10_000_000)}) # created 
in memory
      prior to calling histogram
   
   3. Call histogram on that DataFrame:
   
      histogram(df, column="value", groupby=None)
   
      Execution will enter superset/utils/pandas_postprocessing/histogram.py 
and hit the
      df.copy() call at lines 51-52, allocating a full duplicate of the 
underlying data.
   
   4. Observe the effect via process monitoring (top/psutil) — memory usage 
will spike
   roughly doubling DataFrame memory, possibly leading to MemoryError or worker 
OOM and
   failing the histogram postprocessing step.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/pandas_postprocessing/histogram.py
   **Line:** 51:52
   **Comment:**
        *Possible Bug: Creating a full deep copy of the entire DataFrame can be 
very expensive in both time and memory for large inputs, potentially leading to 
avoidable memory pressure or even MemoryError in high-load situations; using a 
shallow copy is sufficient to break the chained-assignment relationship and 
prevent SettingWithCopyWarning while avoiding duplicating all the underlying 
data.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to