codeant-ai-for-open-source[bot] commented on code in PR #40422: URL: https://github.com/apache/superset/pull/40422#discussion_r3299169766
########## superset/mcp_service/dataset/tool/create_dataset.py: ########## @@ -0,0 +1,145 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import logging +from typing import Any + +from fastmcp import Context +from superset_core.mcp.decorators import tool, ToolAnnotations + +from superset.daos.dataset import DatasetDAO +from superset.exceptions import SupersetSecurityException +from superset.extensions import event_logger, security_manager +from superset.mcp_service.dataset.schemas import ( + CreateDatasetRequest, + DatasetError, + DatasetInfo, + serialize_dataset_object, +) +from superset.sql.parse import Table + +logger = logging.getLogger(__name__) + + +@tool( + tags=["mutate"], + class_permission_name="Dataset", + method_permission_name="write", + annotations=ToolAnnotations( + title="Register a physical table as a dataset", + readOnlyHint=False, + destructiveHint=False, + ), +) +async def create_dataset( + request: CreateDatasetRequest, ctx: Context +) -> DatasetInfo | DatasetError: + """Register an existing physical table as a Superset dataset. + + Use this tool when the user wants to make a physical database table available + for charting or exploration. The table must already exist in the target database. + + Workflow: + 1. Call list_databases to find the correct database_id + 2. Call this tool with database_id, schema, and table_name + 3. Use the returned id as dataset_id in generate_chart or generate_explore_link + + Returns DatasetInfo on success or DatasetError with error_type on failure. + """ + await ctx.info( + "Registering physical table as dataset: database_id=%s, schema=%r, table=%r" + % (request.database_id, request.schema_name, request.table_name) + ) + + # Verify the database exists and the caller has table-level access before + # registering. Mirrors the check in DatabaseRestApi.table_metadata(). + database = DatasetDAO.get_database_by_id(request.database_id) + if database is None: + await ctx.warning("Database %s not found" % request.database_id) + return DatasetError.create( + error=f"Database {request.database_id} not found", + error_type="DatabaseNotFoundError", + ) + + table = Table(request.table_name, request.schema_name, request.catalog) + try: + security_manager.raise_for_access(database=database, table=table) + except SupersetSecurityException as exc: + await ctx.warning("Access denied for table %r: %s" % (str(table), str(exc))) + return DatasetError.create(error=str(exc), error_type="AccessDeniedError") Review Comment: **Suggestion:** The pre-access check only handles `SupersetSecurityException`, so any other runtime failure from `security_manager.raise_for_access` (for example DB/session errors) will escape and crash the tool instead of returning a structured `DatasetError`. Wrap this check in broader error handling and convert unexpected failures into `InternalError` responses to keep the tool contract consistent. [logic error] <details> <summary><b>Severity Level:</b> Major ⚠️</summary> ```mdx - ❌ MCP create_dataset tool can crash on access check failures. - ⚠️ MCP clients see transport errors, not DatasetError payloads. - ⚠️ Dataset registration via MCP becomes brittle under DB issues. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. Invoke the MCP tool entrypoint `create_dataset` (decorated with `@tool`) defined in `superset/mcp_service/dataset/tool/create_dataset.py:38-50`, via the MCP server with a `CreateDatasetRequest` that points to a BigQuery database (`database_id`) whose credentials or network access are invalid. 2. In `create_dataset`, the code at `create_dataset.py:70-78` retrieves the database via `DatasetDAO.get_database_by_id` and builds a `Table` object, then calls `security_manager.raise_for_access(database=database, table=table)` at line 80 to perform an access check before creating the dataset. 3. Inside `security_manager.raise_for_access` in `superset/security/manager.py:2854-2927`, when `database` and `table` are set, it computes `default_catalog = database.get_default_catalog()` (line 64). For BigQuery, `Database.get_default_catalog()` delegates to `BigQueryEngineSpec.get_default_catalog` in `superset/db_engine_specs/bigquery.py:595-614`, which opens a SQLAlchemy engine via `database.get_sqla_engine()` (line 14) and may raise a `SQLAlchemyError` if the connection or credentials are invalid. 4. That `SQLAlchemyError` (or any non-`SupersetSecurityException`) propagates back to `create_dataset`: it is not caught by the `except SupersetSecurityException` block at `create_dataset.py:81-83`, and it occurs before the `try:` block at line 85 that wraps `CreateDatasetCommand`. As a result, `create_dataset` exits with an unhandled exception instead of returning a `DatasetError`, and the MCP client experiences a generic tool/transport failure rather than a structured error response. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=13565961502b4425aeb299fa1f5eea5a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=13565961502b4425aeb299fa1f5eea5a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/mcp_service/dataset/tool/create_dataset.py **Line:** 78:83 **Comment:** *Logic Error: The pre-access check only handles `SupersetSecurityException`, so any other runtime failure from `security_manager.raise_for_access` (for example DB/session errors) will escape and crash the tool instead of returning a structured `DatasetError`. Wrap this check in broader error handling and convert unexpected failures into `InternalError` responses to keep the tool contract consistent. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40422&comment_hash=3135040fd4c0e5358ed6868df65caa7fe71912a4bbb77eaa6991c6a4f88173bb&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40422&comment_hash=3135040fd4c0e5358ed6868df65caa7fe71912a4bbb77eaa6991c6a4f88173bb&reaction=dislike'>👎</a> ########## superset/mcp_service/dataset/tool/create_dataset.py: ########## @@ -0,0 +1,145 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import logging +from typing import Any + +from fastmcp import Context +from superset_core.mcp.decorators import tool, ToolAnnotations + +from superset.daos.dataset import DatasetDAO +from superset.exceptions import SupersetSecurityException +from superset.extensions import event_logger, security_manager +from superset.mcp_service.dataset.schemas import ( + CreateDatasetRequest, + DatasetError, + DatasetInfo, + serialize_dataset_object, +) +from superset.sql.parse import Table + +logger = logging.getLogger(__name__) + + +@tool( + tags=["mutate"], + class_permission_name="Dataset", + method_permission_name="write", + annotations=ToolAnnotations( + title="Register a physical table as a dataset", + readOnlyHint=False, + destructiveHint=False, + ), +) +async def create_dataset( + request: CreateDatasetRequest, ctx: Context +) -> DatasetInfo | DatasetError: + """Register an existing physical table as a Superset dataset. + + Use this tool when the user wants to make a physical database table available + for charting or exploration. The table must already exist in the target database. + + Workflow: + 1. Call list_databases to find the correct database_id + 2. Call this tool with database_id, schema, and table_name + 3. Use the returned id as dataset_id in generate_chart or generate_explore_link + + Returns DatasetInfo on success or DatasetError with error_type on failure. + """ + await ctx.info( + "Registering physical table as dataset: database_id=%s, schema=%r, table=%r" + % (request.database_id, request.schema_name, request.table_name) + ) + + # Verify the database exists and the caller has table-level access before + # registering. Mirrors the check in DatabaseRestApi.table_metadata(). + database = DatasetDAO.get_database_by_id(request.database_id) + if database is None: + await ctx.warning("Database %s not found" % request.database_id) + return DatasetError.create( + error=f"Database {request.database_id} not found", + error_type="DatabaseNotFoundError", + ) + + table = Table(request.table_name, request.schema_name, request.catalog) + try: + security_manager.raise_for_access(database=database, table=table) + except SupersetSecurityException as exc: + await ctx.warning("Access denied for table %r: %s" % (str(table), str(exc))) + return DatasetError.create(error=str(exc), error_type="AccessDeniedError") + + try: + from superset.commands.dataset.create import CreateDatasetCommand + from superset.commands.dataset.exceptions import ( + DatasetCreateFailedError, + DatasetExistsValidationError, + DatasetInvalidError, + TableNotFoundValidationError, + ) + + dataset_properties: dict[str, Any] = { + k: v + for k, v in { + "database": request.database_id, + "table_name": request.table_name, + "schema": request.schema_name, + "catalog": request.catalog, + "owners": request.owners, + }.items() + if v is not None + } + + with event_logger.log_context(action="mcp.create_dataset.create"): + dataset = CreateDatasetCommand(dataset_properties).run() + + result = serialize_dataset_object(dataset) + if result is None: + return DatasetError.create( + error="Dataset was created but could not be serialized", + error_type="InternalError", + ) + + await ctx.info( + "Dataset registered: id=%s, table=%r" % (dataset.id, dataset.table_name) + ) + return result + + except DatasetInvalidError as exc: + # CreateDatasetCommand.validate() aggregates individual validation errors + # into DatasetInvalidError; use the public get_list_classnames() helper + # to identify which specific validation errors are present. + classnames = exc.get_list_classnames() + if DatasetExistsValidationError.__name__ in classnames: + await ctx.warning("Dataset already exists: %s" % str(exc)) + return DatasetError.create(error=str(exc), error_type="DatasetExistsError") + if TableNotFoundValidationError.__name__ in classnames: + await ctx.warning("Table not found: %s" % str(exc)) + return DatasetError.create(error=str(exc), error_type="TableNotFoundError") + messages = exc.normalized_messages() + await ctx.warning("Dataset validation failed: %s" % (messages,)) + return DatasetError.create(error=str(messages), error_type="ValidationError") + except DatasetCreateFailedError as exc: + await ctx.error("Dataset creation failed: %s" % str(exc)) + return DatasetError.create(error=str(exc), error_type="CreateFailedError") + except Exception as exc: + await ctx.error( + "Unexpected error registering dataset: %s: %s" + % (type(exc).__name__, str(exc)) + ) + return DatasetError.create( + error=f"Failed to create dataset: {exc}", error_type="InternalError" + ) Review Comment: **Suggestion:** The generic error response includes raw exception text in the returned payload, which can leak internal implementation details (database/stack-derived messages) to MCP clients. Return a sanitized generic message to clients and keep full exception details only in server logs. [security] <details> <summary><b>Severity Level:</b> Major ⚠️</summary> ```mdx - ⚠️ MCP create_dataset responses expose raw internal exception text. - ⚠️ Clients can infer module names and internal code structure. - ⚠️ Database or catalog identifiers may leak via driver errors. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. Call the MCP tool `create_dataset` defined in `superset/mcp_service/dataset/tool/create_dataset.py:48-50` via the MCP server, supplying a `CreateDatasetRequest` that targets any database and table (e.g., a valid `database_id`, `schema_name`, and `table_name`). 2. During execution of the `try:` block starting at `create_dataset.py:85`, an unexpected runtime error occurs—for example, a deployment/environment issue where the import `from superset.commands.dataset.exceptions import ...` at lines 87-92 fails with `ImportError: No module named 'superset.commands.dataset.exceptions'`. 3. This `ImportError` is not a `DatasetInvalidError` or `DatasetCreateFailedError`, so it bypasses the specific handlers at `create_dataset.py:121-135` and is caught by the generic `except Exception as exc:` block at lines 138-145. That block logs the error via `await ctx.error("Unexpected error registering dataset: %s: %s" % (type(exc).__name__, str(exc)))` and then constructs a `DatasetError` at lines 143-144 using `error=f"Failed to create dataset: {exc}"`. 4. The MCP server returns this `DatasetError` to the client, exposing the raw exception string (e.g., `"Failed to create dataset: No module named 'superset.commands.dataset.exceptions'"`) in the `error` field. Similar unexpected failures deeper in `CreateDatasetCommand.run()` (`superset/commands/dataset/create.py:46-52`) would likewise surface driver or stack-derived messages to MCP clients, leaking internal module names, paths, or database details instead of a generic sanitized error message. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=c053075164504ab0a80b725dc18d863c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=c053075164504ab0a80b725dc18d863c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/mcp_service/dataset/tool/create_dataset.py **Line:** 143:145 **Comment:** *Security: The generic error response includes raw exception text in the returned payload, which can leak internal implementation details (database/stack-derived messages) to MCP clients. Return a sanitized generic message to clients and keep full exception details only in server logs. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40422&comment_hash=be043af92d30375dcad0457dcaf50cf046537bc6f03f0d5350788840763b2143&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40422&comment_hash=be043af92d30375dcad0457dcaf50cf046537bc6f03f0d5350788840763b2143&reaction=dislike'>👎</a> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
