codeant-ai-for-open-source[bot] commented on code in PR #37744:
URL: https://github.com/apache/superset/pull/37744#discussion_r2772798085
##########
superset/utils/pandas_postprocessing/histogram.py:
##########
@@ -49,10 +49,13 @@ def histogram(
groupby = []
# drop empty values from the target column
- df = df.dropna(subset=[column])
+ df = df.dropna(subset=[column]).copy()
if df.empty:
return df
+ # Make a copy to avoid SettingWithCopyWarning when modifying the DataFrame
+ df = df.copy()
Review Comment:
**Suggestion:** The DataFrame is being copied twice in succession (once via
`dropna(...).copy()` and again via `df.copy()`), which unnecessarily duplicates
a potentially large DataFrame and impacts performance without changing
behavior; the second copy and its comment should be removed or merged into the
first. [possible bug]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ Dashboard histogram loads may spike memory usage.
- ⚠️ Chart editing can cause backend memory pressure.
- ⚠️ Workers processing large charts consume extra RAM.
- ⚠️ Increased GC/CPU from unnecessary data duplication.
```
</details>
```suggestion
# drop empty values from the target column and make a copy to avoid
# SettingWithCopyWarning when modifying the DataFrame
df = df.dropna(subset=[column]).copy()
if df.empty:
return df
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Trigger histogram postprocessing by loading or editing a Histogram chart
in Superset
(per PR description). The postprocessing function is defined at
superset/utils/pandas_postprocessing/histogram.py in function histogram()
which contains
the copy operations at lines 52 and 56-57.
2. Provide a large DataFrame input to histogram(), for example create a
DataFrame with
millions of rows in a Python shell and call
superset.utils.pandas_postprocessing.histogram.histogram(df, column="value",
groupby=None). The code executes the dropna + copy at line 52 (df =
df.dropna(subset=[column]).copy()) and then executes the second copy at
lines 56-57 (df =
df.copy()).
3. Observe memory usage: both copy() calls create full in-memory duplicates.
Monitor
process memory (e.g., via top/psutil) during the call and note the transient
∼2x memory
allocation for the DataFrame compared to a single-copy implementation.
4. Replace the two-copy block with the improved single-copy variant (as
suggested) and
re-run the same large-DataFrame call. The transient memory spike
disappears/reduces,
confirming the second copy was redundant and caused the overhead.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/pandas_postprocessing/histogram.py
**Line:** 51:57
**Comment:**
*Possible Bug: The DataFrame is being copied twice in succession (once
via `dropna(...).copy()` and again via `df.copy()`), which unnecessarily
duplicates a potentially large DataFrame and impacts performance without
changing behavior; the second copy and its comment should be removed or merged
into the first.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]