codeant-ai-for-open-source[bot] commented on code in PR #40190:
URL: https://github.com/apache/superset/pull/40190#discussion_r3252625012
##########
superset/common/utils/dataframe_utils.py:
##########
@@ -57,7 +57,7 @@ def df_metrics_to_num(df: pd.DataFrame, query_object:
QueryObject) -> None:
if dtype.type == np.object_ and col in query_object.metric_names:
# soft-convert a metric column to numeric
# will stay as strings if conversion fails
- df[col] = df[col].infer_objects()
+ df[col] = pd.to_numeric(df[col], errors="ignore")
Review Comment:
**Suggestion:** Using `errors="ignore"` can leave the entire metric column
as object/string when even one value is non-numeric, so downstream numeric
aggregations can still fail with the same `Could not convert string ... to
numeric` error. Use coercive conversion for metric columns (eg converting
invalid entries to `NaN`) so valid numeric strings are still parsed and the
column becomes numeric. [logic error]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ Chart data requests using QueryContextProcessor may crash.
- ❌ Rolling/aggregate post-processing fails on ClickHouse metric strings.
- ⚠️ Contribution/other numeric features skip or mis-handle metrics.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. A chart request is executed through the chart data pipeline:
`QueryContextProcessor.get_payload()` at
`superset/common/query_context_processor.py:88-126` calls
`get_query_results(...)`, which
in turn invokes `_get_full()` at `superset/common/query_actions.py:153-172`,
obtaining a
`QueryContext` and calling `query_context.get_df_payload(query_obj, ...)` to
retrieve a
pandas DataFrame.
2. `QueryContext.get_df_payload()` delegates to
`QueryContextProcessor.get_df_payload()`
at `superset/common/query_context_processor.py:80-60` (around lines 82-60 in
that file),
which calls `self.get_query_result(query_obj)`; this delegates to the
datasource
implementation `get_query_result()` in `superset/models/helpers.py:107-120`,
which first
normalizes the DataFrame via `normalize_df(df, query_object)` at
`superset/models/helpers.py:32-105`.
3. Inside `normalize_df`, after datetime normalization, metrics are
normalized by calling
`dataframe_utils.df_metrics_to_num(df, query_object)` at
`superset/models/helpers.py:99-102`, which executes `df_metrics_to_num`
defined in
`superset/common/utils/dataframe_utils.py:15-21`. This function iterates
over columns and,
for metric columns with `dtype.type == np.object_`, runs `df[col] =
pd.to_numeric(df[col],
errors="ignore")` (line 60 in the PR). When the ClickHouse backend returns a
metric column
(e.g. `sum__metric`) as strings with at least one non-numeric value (for
example `"655"`
and `"total"`), `pd.to_numeric(..., errors="ignore")` leaves the entire
Series unchanged
as `object`, so the metric column remains a string-typed column instead of
becoming
numeric.
4. The resulting `df` with metric columns still of dtype `object` is then
used by
downstream numeric code paths. For instance, post-processing operations such
as rolling
windows or aggregations use numpy/pandas numeric functions, e.g. the rolling
operator in
`superset/utils/pandas_postprocessing/rolling.py:31-88` builds `df_rolling =
df.loc[:,
columns.keys()]` and then calls `df_rolling = getattr(df_rolling,
rolling_type)(**rolling_type_options)`. When `rolling_type` is a numeric
operation like
`"sum"` and the metric column is still an `object` Series with string
values, pandas'
internal `nanops` machinery (in `pandas/core/nanops.py`) attempts to convert
those strings
to numeric and raises `TypeError: Could not convert string '655' to
numeric`, causing the
chart request (triggered via `_get_full` / `get_query_results`) to fail.
This behavior
contrasts with the legacy `BaseViz.df_metrics_to_num` implementation in
`superset/viz.py:16-23`, which uses `pd.to_numeric(..., errors="coerce")`,
coercing
invalid metric entries to `NaN` while ensuring the overall column becomes
numeric.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Fcommon%2Futils%2Fdataframe_utils.py%0A%2A%2ALine%3A%2A%2A%2060%3A60%0A%2A%2AComment%3A%2A%2A%0A%09%2ALogic%20Error%3A%20Using%20%60errors%3D%22ignore%22%60%20can%20leave%20the%20entire%20metric%20column%20as%20object%2Fstring%20when%20even%20one%20value%20is%20non-numeric%2C%20so%20downstream%20numeric%20aggregations%20can%20still%20fail%20with%20the%20same%20%60Could%20not%20convert%20string%20...%20to%20numeric%60%20error.%20Use%20coercive%20conversion%20for%20metric%20columns%20%28eg%20converting%20invalid%20entries%20to%20%60NaN%60%29%20so%20valid%20numeric%20strings%20are%20still%20parsed%20and%20the%20column%20becomes%20numeric.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%
20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Fcommon%2Futils%2Fdataframe_utils.py%0A%2A%2ALine%3A%2A%2A%2060%3A60%0A%2A%2AComment%3A%2A%2A%0A%09%2ALogic%20Error%3A%20Using%20%60errors%3D%22ignore%22%60%20can%20leave%20the%20entire%20metric%20column%20as%20object%2Fstring%20when%20even%20one%20value%20is%20non-numeric%2C%20so%20downstream%20numeric%20aggregations%20can%20still%20fail%20with%20the%20same%20%60Could%20not%20convert%20string%20...%20to%20numeric%60%20error.%20Use%20coercive%20conversion%
20for%20metric%20columns%20%28eg%20converting%20invalid%20entries%20to%20%60NaN%60%29%20so%20valid%20numeric%20strings%20are%20still%20parsed%20and%20the%20column%20becomes%20numeric.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/common/utils/dataframe_utils.py
**Line:** 60:60
**Comment:**
*Logic Error: Using `errors="ignore"` can leave the entire metric
column as object/string when even one value is non-numeric, so downstream
numeric aggregations can still fail with the same `Could not convert string ...
to numeric` error. Use coercive conversion for metric columns (eg converting
invalid entries to `NaN`) so valid numeric strings are still parsed and the
column becomes numeric.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40190&comment_hash=0dbafd11a269f136a721e1ef4f04c36d483b600bfabdf462252c2152f105cddc&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40190&comment_hash=0dbafd11a269f136a721e1ef4f04c36d483b600bfabdf462252c2152f105cddc&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]