gkneighb commented on code in PR #40398:
URL: https://github.com/apache/superset/pull/40398#discussion_r3294383299


##########
superset/mcp_service/chart/schemas.py:
##########
@@ -700,19 +700,41 @@ class ColumnRef(BaseModel):
         "(use get_dataset_info to see available metrics). "
         "When set, 'aggregate' is ignored.",
     )
+    sql_expression: str | None = Field(
+        None,
+        max_length=2000,
+        description=(
+            "Inline SQL expression for a calculated metric, e.g. "
+            "'SUM(profit) / NULLIF(SUM(sales), 0)'. Use for ratios, "
+            "growth rates, conversion percentages, and other formulas "
+            "that don't reduce to a single column + aggregate. When set, "
+            "'aggregate' is ignored; 'name' is used as the metric label."
+        ),

Review Comment:
   Thanks for flagging this — wanted to address the suggestion directly because 
applying `sanitize_user_input(..., check_sql_keywords=True)` to 
`sql_expression` would break the field by design.
   
   The `name` field is sanitized with `check_sql_keywords=True` because `name` 
is supposed to be a bare column identifier — letting SQL keywords through there 
is genuinely an injection vector. `sql_expression`, however, is *expected* to 
be a SQL fragment. Examples of valid values are `SUM(profit) / 
NULLIF(SUM(sales), 0)`, `AVG(latency)`, `COUNT(DISTINCT user_id) / 
NULLIF(COUNT(*), 0)`. Running those through `check_sql_keywords=True` would 
reject every keyword the field exists to carry — `SUM`, `AVG`, `COUNT`, 
`NULLIF`, etc. — so the field becomes unusable.
   
   The mitigation model for arbitrary-SQL metrics in Superset isn't input-side 
keyword filtering; it's the same model the REST API has used for adhoc 
SIMPLE/SQL metrics since they were introduced:
   
   1. **Chart-write RBAC**: only users with `can_write` on `Chart` (and access 
to the referenced dataset) can set the expression. Enforced at the MCP tool 
boundary via `@tool(class_permission_name="Chart", 
method_permission_name="write")`.
   2. **Dataset access control**: the expression executes against the dataset's 
database, which has its own role-mapped access — a user can only land SQL 
against a warehouse they're already authorized for.
   3. **Read-only data warehouse**: chart queries don't reach the Superset 
metadata DB; they only hit the data-source database, where the configured role 
is typically read-only.
   
   This is identical to what already happens when a chart is created via `POST 
/api/v1/chart/` with `params.metric = {expressionType: "SQL", sqlExpression: 
"..."}` — the existing REST path doesn't keyword-sanitize either, for the same 
reason.
   
   What this PR *does* add to harden the field:
   - `max_length=2000` cap on the field
   - `field_validator` that strips whitespace and rejects blank/whitespace-only 
values (so callers can't slip through trivially malformed input)
   - Mutual-exclusion with `saved_metric` so a single ref can't carry both paths
   
   If the underlying concern is about who can pass a SQL fragment in the first 
place, that's worth checking — but it's a permission question 
(`Chart.can_write` + dataset access), not a sanitization one. Happy to dig 
deeper if there's a specific attack scenario I'm missing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to