codeant-ai-for-open-source[bot] commented on code in PR #40398:
URL: https://github.com/apache/superset/pull/40398#discussion_r3293888224


##########
superset/mcp_service/chart/validation/dataset_validator.py:
##########
@@ -139,6 +139,11 @@ def _validate_columns_exist(
         for col_ref in column_refs:
             if col_ref.saved_metric:
                 continue
+            if col_ref.sql_expression:
+                # SQL-expression metrics define their column references
+                # inline in the expression itself; the ColumnRef's `name`
+                # is just a label. Skip the dataset-column lookup.
+                continue

Review Comment:
   **Suggestion:** This skips dataset-column validation for any `ColumnRef` 
that carries `sql_expression`, but `ColumnRef` is also used for non-metric 
positions (like x-axis, dimensions, and group-bys). That lets invalid column 
names in those positions bypass Tier-1 validation and fail later at query time 
with a generic DB error instead of a clear validation error. Restrict this 
bypass to metric contexts only (or track whether each extracted ref came from a 
metric field before skipping). [incorrect condition logic]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ generate_chart accepts invalid dimension columns with sql_expression.
   - ❌ XY and mixed_timeseries X-axis errors occur at database.
   - ⚠️ Handlebars and table raw columns may bypass validation.
   - ⚠️ LLM callers receive opaque DB errors, not clear hints.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. A caller invokes the MCP `generate_chart` tool with a JSON request parsed 
by
   `SchemaValidator.validate_request()` in
   `superset/mcp_service/chart/validation/schema_validator.py:36-55`, using 
`chart_type:
   "xy"` and a config where a non-metric position carries `sql_expression`, 
e.g.:
   
      - `config.x = {"name": "bogus_dim", "sql_expression": 
"SUM(profit)/SUM(sales)"}` and a
      valid `y` metric list.
   
   2. `ValidationPipeline._validate_dataset()` in
   `superset/mcp_service/chart/validation/pipeline.py:17-28` calls
   `DatasetValidator.validate_against_dataset(config, dataset_id, 
dataset_context)`, which in
   turn calls `_extract_column_references(config)` in
   `superset/mcp_service/chart/validation/dataset_validator.py:153-199`. For an
   `XYChartConfig`, `_extract_column_references` appends:
   
      - `config.x` (the X-axis `ColumnRef`), all `config.y` metric 
`ColumnRef`s, and any
      `group_by` `ColumnRef`s into a single `column_refs` list without tracking 
which refs
      are metrics vs dimensions.
   
   3. Still in `DatasetValidator.validate_against_dataset`,
   `_validate_columns_exist(column_refs, dataset_context)` is invoked at
   `superset/mcp_service/chart/validation/dataset_validator.py:118-120`. In 
that method, the
   loop over `column_refs` at lines 24-32 checks:
   
      - `if col_ref.saved_metric: continue`
   
      - `if col_ref.sql_expression: ... continue` (lines 142-146 in the PR hunk)
   
      Because `config.x` has `sql_expression` set, this X-axis `ColumnRef` hits 
the `if
      col_ref.sql_expression:` branch and skips the dataset column lookup 
entirely, even
      though for X/group_by/dimension positions `name` is expected to be a real 
dataset
      column.
   
   4. Dataset validation therefore returns `True` with no 
`ChartGenerationError`, and the
   pipeline proceeds to build form_data via `map_xy_config(config, dataset_id)` 
in
   `superset/mcp_service/chart/chart_utils.py:704-79`. `map_xy_config` uses 
`config.x.name`
   and `config.group_by[i].name` directly as column names in form_data 
(`"x_axis":
   config.x.name`, `"groupby": [c.name ...]`) and ignores `sql_expression`. 
Superset then
   executes a query referencing `bogus_dim`, which does not exist in the 
dataset, causing a
   database-level "column not found" failure that surfaces as a generic DB 
engine error
   instead of the intended Tier-1 validation error that would have been raised 
if
   `_validate_columns_exist` had actually checked non-metric column names 
carrying
   `sql_expression`.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=4c608f3bd5654bb0b433d6334af4565e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=4c608f3bd5654bb0b433d6334af4565e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/chart/validation/dataset_validator.py
   **Line:** 142:146
   **Comment:**
        *Incorrect Condition Logic: This skips dataset-column validation for 
any `ColumnRef` that carries `sql_expression`, but `ColumnRef` is also used for 
non-metric positions (like x-axis, dimensions, and group-bys). That lets 
invalid column names in those positions bypass Tier-1 validation and fail later 
at query time with a generic DB error instead of a clear validation error. 
Restrict this bypass to metric contexts only (or track whether each extracted 
ref came from a metric field before skipping).
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40398&comment_hash=3746ecec817bf86cb9f72f371d69a598e59f1507ce4f9010fa5384f3b2732e36&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40398&comment_hash=3746ecec817bf86cb9f72f371d69a598e59f1507ce4f9010fa5384f3b2732e36&reaction=dislike'>👎</a>



##########
superset/mcp_service/chart/validation/schema_validator.py:
##########
@@ -392,18 +392,25 @@ def _pre_validate_big_number_config(
                 ],
                 error_code="INVALID_BIG_NUMBER_METRIC_TYPE",
             )
-        if not metric.get("aggregate") and not metric.get("saved_metric"):
+        if (
+            not metric.get("aggregate")
+            and not metric.get("saved_metric")
+            and not metric.get("sql_expression")
+        ):

Review Comment:
   **Suggestion:** The new pre-validation treats any truthy `sql_expression` as 
valid, so whitespace-only expressions (for example `"   "`) bypass this guard 
and proceed to query building, where they fail with database SQL syntax errors. 
Normalize/strip the value and reject blank expressions here so invalid SQL 
metrics fail fast with a clear validation error. [incorrect condition logic]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Big Number charts with blank SQL metrics crash at query.
   - ⚠️ MCP callers receive backend SQL errors, not validation feedback.
   - ⚠️ LLM-generated configs with whitespace SQL fail non-gracefully.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. A caller constructs MCP chart request data targeting the MCP validation 
pipeline (see
   `superset/mcp_service/chart/validation/pipeline.py:96-115`), with a Big 
Number config
   whose metric uses a whitespace-only SQL expression:
   
   2.
   
      ```python
   
      request_data = {
   
          "dataset_id": 1,
   
          "config": {
   
              "chart_type": "big_number",
   
              "metric": {
   
                  "name": "bad_metric",
   
                  "sql_expression": " ", # only spaces
   
              },
   
          },
   
      }
   
      ```
   
   3. `ValidationPipeline.validate_request_with_warnings()` 
(pipeline.py:96-115) calls
   `SchemaValidator.validate_request(request_data)` 
(`schema_validator.py:8-31`), which in
   turn calls `_pre_validate(request_data)` and then 
`_pre_validate_chart_type(chart_type,
   config)` (`schema_validator.py:7-20). For `chart_type == "big_number"`,
   `_pre_validate_big_number_config` is invoked (`schema_validator.py:22-25).
   
   4. Inside `_pre_validate_big_number_config` at lines 395-399, the guard:
   
      
   
      ```python
   
      if (
   
          not metric.get("aggregate")
   
          and not metric.get("saved_metric")
   
          and not metric.get("sql_expression")
   
      ):
   
          ...
   
      ```
   
      
   
      treats `" "` as truthy, so the guard passes and no validation error is 
returned, even
      though the SQL expression is effectively blank.
   
   5. `SchemaValidator.validate_request` then constructs a 
`GenerateChartRequest` whose
   `config.metric` is a `ColumnRef` with `sql_expression=" "`
   (`superset/mcp_service/chart/schemas.py:36-45). `ColumnRef.is_metric` 
(`schemas.py:49-51`)
   uses `bool(self.sql_expression)`, so this whitespace value still marks the 
column as a
   metric.
   
   6. When the chart is executed, `map_big_number_config` (`chart_utils.py:8-20 
around
   815-822) calls `create_metric_object(config.metric)` 
(`chart_utils.py:47-73), which builds
   a metric dict with `"expressionType": "SQL"` and `"sqlExpression": " "`. 
This is sent to
   Superset's query engine as a SQL metric, resulting in an invalid/empty SQL 
expression and
   a database-level SQL syntax error instead of a clear MCP validation error.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=cc0541817eca469abfdecb842e10001b&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=cc0541817eca469abfdecb842e10001b&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/chart/validation/schema_validator.py
   **Line:** 395:399
   **Comment:**
        *Incorrect Condition Logic: The new pre-validation treats any truthy 
`sql_expression` as valid, so whitespace-only expressions (for example `"   "`) 
bypass this guard and proceed to query building, where they fail with database 
SQL syntax errors. Normalize/strip the value and reject blank expressions here 
so invalid SQL metrics fail fast with a clear validation error.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40398&comment_hash=38849d1dcc6b58a79da010039836228033e7346fa880874d44fb30009683386c&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40398&comment_hash=38849d1dcc6b58a79da010039836228033e7346fa880874d44fb30009683386c&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to