Re: [PR] feat(mcp): add BM25 tool search transform to reduce initial context size [superset]

via GitHub Fri, 13 Mar 2026 04:49:01 -0700


codeant-ai-for-open-source[bot] commented on code in PR #38562:
URL: https://github.com/apache/superset/pull/38562#discussion_r2930703935



##########
superset/mcp_service/flask_singleton.py:
##########
@@ -52,62 +51,45 @@
         logger.info("Reusing existing Flask app from app context for MCP 
service")
         # Use _get_current_object() to get the actual Flask app, not the 
LocalProxy
         app = current_app._get_current_object()
+    elif appbuilder_initialized:
+        # appbuilder is initialized but we have no app context. Calling
+        # create_app() here would invoke appbuilder.init_app() a second
+        # time with a *different* Flask app, overwriting shared internal
+        # state (views, security manager, etc.).  Fail loudly instead of
+        # silently corrupting the singleton.
+        raise RuntimeError(
+            "appbuilder is already initialized but no Flask app context is "
+            "available.  Cannot call create_app() as it would re-initialize "
+            "appbuilder with a different Flask app instance."
+        )

Review Comment:
   **Suggestion:** The new `RuntimeError` path makes `get_flask_app()` fail 
whenever `appbuilder` is initialized but no request/app context is active, 
which is exactly the situation some MCP paths hit before they push a context. 
This can cause MCP startup/tool calls to crash in-process even though a Flask 
app already exists. Reuse the already-bound app instance from `appbuilder` when 
available, and only raise if no concrete app can be resolved. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ MCP initialization fails in no-context in-process paths.
   - ⚠️ Tool auth/context setup can fail before execution.
   - ⚠️ Middleware factories depending on Flask app may not initialize.
   ```
   </details>
   
   ```suggestion
       elif appbuilder_initialized:
           # appbuilder is initialized but we have no app context. If the
           # concrete Flask app is already bound, reuse it directly.
           existing_app = getattr(appbuilder, "_app", None)
           if isinstance(existing_app, Flask):
               logger.info("Reusing existing Flask app from appbuilder for MCP 
service")
               app = existing_app
           else:
               raise RuntimeError(
                   "appbuilder is already initialized but no Flask app context 
is "
                   "available.  Cannot resolve an existing Flask app instance."
               )
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Start Superset so `appbuilder` is initialized 
(`superset/extensions/__init__.py:11`
   defines global `appbuilder`; `superset/mcp_service/flask_singleton.py:46` 
checks
   `appbuilder._session is not None`).
   
   2. Execute MCP setup from a code path without Flask app context (real 
callers use
   `get_flask_app()` outside request context, e.g. 
`superset/mcp_service/server.py:341-345`,
   `superset/mcp_service/auth.py:47-54` when `has_app_context()` is false).
   
   3. Import/load `superset.mcp_service.flask_singleton` via those callers; 
module-level
   initialization runs immediately 
(`superset/mcp_service/flask_singleton.py:35-65`).
   
   4. When `appbuilder_initialized=True` and no context, branch `elif 
appbuilder_initialized`
   raises `RuntimeError` (`superset/mcp_service/flask_singleton.py:54-64`), 
aborting MCP
   server/middleware/auth initialization instead of reusing existing app.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/flask_singleton.py
   **Line:** 54:64
   **Comment:**
        *Logic Error: The new `RuntimeError` path makes `get_flask_app()` fail 
whenever `appbuilder` is initialized but no request/app context is active, 
which is exactly the situation some MCP paths hit before they push a context. 
This can cause MCP startup/tool calls to crash in-process even though a Flask 
app already exists. Reuse the already-bound app instance from `appbuilder` when 
available, and only raise if no concrete app can be resolved.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38562&comment_hash=0a6a9ef7cfc88e24582948d0c1b9ec3e22b3ff722b493abbeb40f688f7c644e3&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38562&comment_hash=0a6a9ef7cfc88e24582948d0c1b9ec3e22b3ff722b493abbeb40f688f7c644e3&reaction=dislike'>👎</a>



##########
superset/mcp_service/system/schemas.py:
##########
@@ -125,33 +111,19 @@ class FeatureAvailability(BaseModel):
 
 
 class InstanceInfo(BaseModel):
-    instance_summary: InstanceSummary = Field(
-        ..., description="Instance summary information"
-    )
-    recent_activity: RecentActivity = Field(
-        ..., description="Recent activity information"
-    )
-    dashboard_breakdown: DashboardBreakdown = Field(
-        ..., description="Dashboard breakdown information"
-    )
-    database_breakdown: DatabaseBreakdown = Field(
-        ..., description="Database breakdown by type"
-    )
-    popular_content: PopularContent = Field(
-        ..., description="Popular content information"
-    )
+    instance_summary: InstanceSummary
+    recent_activity: RecentActivity
+    dashboard_breakdown: DashboardBreakdown
+    database_breakdown: DatabaseBreakdown
+    popular_content: PopularContent
     current_user: UserInfo | None = Field(
         None,
-        description="The authenticated user making the request. "
-        "Use current_user.id with created_by_fk filter to find your own 
assets.",
-    )
-    feature_availability: FeatureAvailability = Field(
-        ...,
         description=(
-            "Dynamic feature availability for the current user and deployment"
+            "Use current_user.id with created_by_fk filter to find your own 
assets."
         ),
     )
-    timestamp: datetime = Field(..., description="Response timestamp")
+    feature_availability: FeatureAvailability

Review Comment:
   **Suggestion:** `feature_availability` is now required, but 
`instance_metadata.py` builds `InstanceInfo` without registering that metric 
calculator. This causes a Pydantic validation failure during resource 
generation and forces the metadata endpoint into its error fallback path. Make 
this field default to an empty `FeatureAvailability` so callers that don't 
provide it still produce valid responses. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ instance://metadata returns fallback error payload only.
   - ❌ Available datasets/databases context missing from metadata resource.
   - ⚠️ Accessible menus not provided in metadata response.
   ```
   </details>
   
   ```suggestion
       feature_availability: FeatureAvailability = 
Field(default_factory=FeatureAvailability)
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Start MCP app import path; `superset/mcp_service/app.py:419-423` imports
   `system.resources`, and 
`superset/mcp_service/system/resources/__init__.py:21-24`
   registers `instance://metadata` via `instance_metadata.py`.
   
   2. Trigger resource read for `instance://metadata` (resource function
   `get_instance_metadata_resource` at
   `superset/mcp_service/system/resources/instance_metadata.py:36-38`).
   
   3. In that function, `InstanceInfoCore` is configured with 
`output_schema=InstanceInfo`
   (`instance_metadata.py:78`) but `metric_calculators` omits 
`feature_availability`
   (`instance_metadata.py:79-85`), then calls 
`instance_info_core.get_resource()`
   (`instance_metadata.py:95`).
   
   4. `InstanceInfoCore._generate_instance_info()` builds `response_data` and 
instantiates
   schema (`superset/mcp_service/mcp_core.py:121-129`); `InstanceInfo` requires
   `feature_availability` (`superset/mcp_service/system/schemas.py:125`), so 
Pydantic raises
   `ValidationError` (subclass of `ValueError`), caught by 
`instance_metadata.py:143-155`,
   returning fallback error JSON instead of metadata.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/system/schemas.py
   **Line:** 125:125
   **Comment:**
        *Logic Error: `feature_availability` is now required, but 
`instance_metadata.py` builds `InstanceInfo` without registering that metric 
calculator. This causes a Pydantic validation failure during resource 
generation and forces the metadata endpoint into its error fallback path. Make 
this field default to an empty `FeatureAvailability` so callers that don't 
provide it still produce valid responses.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38562&comment_hash=5f7cf68eead1fa1f4deaab812137a251e3658c8488ba3d94f9ca57aa6b31f468&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38562&comment_hash=5f7cf68eead1fa1f4deaab812137a251e3658c8488ba3d94f9ca57aa6b31f468&reaction=dislike'>👎</a>



##########
superset/mcp_service/dashboard/schemas.py:
##########
@@ -286,45 +286,31 @@ class GetDashboardInfoRequest(MetadataCacheControl):
 
 
 class DashboardInfo(BaseModel):
-    id: int | None = Field(None, description="Dashboard ID")
-    dashboard_title: str | None = Field(None, description="Dashboard title")
-    slug: str | None = Field(None, description="Dashboard slug")
-    description: str | None = Field(None, description="Dashboard description")
-    css: str | None = Field(None, description="Custom CSS for the dashboard")
-    certified_by: str | None = Field(None, description="Who certified the 
dashboard")
-    certification_details: str | None = Field(None, description="Certification 
details")
-    json_metadata: str | None = Field(
-        None, description="Dashboard metadata (JSON string)"
-    )
-    position_json: str | None = Field(None, description="Chart positions (JSON 
string)")
-    published: bool | None = Field(
-        None, description="Whether the dashboard is published"
-    )
-    is_managed_externally: bool | None = Field(
-        None, description="Whether managed externally"
-    )
-    external_url: str | None = Field(None, description="External URL")
-    created_on: str | datetime | None = Field(None, description="Creation 
timestamp")
-    changed_on: str | datetime | None = Field(
-        None, description="Last modification timestamp"
-    )
-    created_by: str | None = Field(None, description="Dashboard creator 
(username)")
-    changed_by: str | None = Field(None, description="Last modifier 
(username)")
-    uuid: str | None = Field(None, description="Dashboard UUID (converted to 
string)")
-    url: str | None = Field(None, description="Dashboard URL")
-    created_on_humanized: str | None = Field(
-        None, description="Humanized creation time"
-    )
-    changed_on_humanized: str | None = Field(
-        None, description="Humanized modification time"
-    )
-    chart_count: int = Field(0, description="Number of charts in the 
dashboard")
-    owners: List[UserInfo] = Field(default_factory=list, 
description="Dashboard owners")
-    tags: List[TagInfo] = Field(default_factory=list, description="Dashboard 
tags")
-    roles: List[RoleInfo] = Field(default_factory=list, description="Dashboard 
roles")
-    charts: List[ChartInfo] = Field(
-        default_factory=list, description="Dashboard charts"
-    )
+    id: int | None = None
+    dashboard_title: str | None = None
+    slug: str | None = None
+    description: str | None = None
+    css: str | None = None
+    certified_by: str | None = None
+    certification_details: str | None = None
+    json_metadata: str | None = None
+    position_json: str | None = None
+    published: bool | None = None
+    is_managed_externally: bool | None = None
+    external_url: str | None = None
+    created_on: str | datetime | None = None
+    changed_on: str | datetime | None = None
+    created_by: str | None = None
+    changed_by: str | None = None

Review Comment:
   **Suggestion:** `DashboardInfo` now omits `created_by_name` and 
`changed_by_name`, but `serialize_dashboard_object` and dashboard schema 
discovery still use those field names. Pydantic will silently ignore these 
unexpected inputs, so requested creator/modifier display-name columns disappear 
from list responses. Restore the name-based fields to keep serializer and 
schema contracts aligned. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ list_dashboards omits requested creator display-name columns.
   - ⚠️ get_schema advertises columns list_dashboards cannot return.
   - ❌ MCP client column-selection contract becomes inconsistent.
   ```
   </details>
   
   ```suggestion
       created_by: str | None = None
       changed_by: str | None = None
       created_by_name: str | None = None
       changed_by_name: str | None = None
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Start MCP server where tools are registered in 
`superset/mcp_service/app.py:402-407`
   (`list_dashboards`) and `superset/mcp_service/app.py:423-426` (`get_schema`).
   
   2. Call `get_schema(model_type=\"dashboard\")`; this executes
   `superset/mcp_service/system/tool/get_schema.py:126` and builds dashboard 
select columns
   via `select_columns=get_dashboard_columns()` at `.../get_schema.py:103`.
   
   3. `get_dashboard_columns()` includes 
`"created_by_name"`/`"changed_by_name"` in
   `superset/mcp_service/common/schema_discovery.py:293-295` and `:275-277`, so 
clients are
   told these are valid selectable columns.
   
   4. Call 
`list_dashboards(select_columns=[\"created_by_name\",\"changed_by_name\"])`;
   execution goes through 
`superset/mcp_service/dashboard/tool/list_dashboards.py:68`,
   computes columns from schema discovery at `:103`, and serializes with 
context filtering at
   `:149` and `:156`.
   
   5. During item serialization, `serialize_dashboard_object()` passes
   `created_by_name`/`changed_by_name` into `DashboardInfo` at
   `superset/mcp_service/dashboard/schemas.py:515` and `:512`, but 
`DashboardInfo` no longer
   defines these fields (`:303-304` only has `created_by`/`changed_by`), so 
requested
   display-name columns are dropped from output.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/dashboard/schemas.py
   **Line:** 303:304
   **Comment:**
        *Logic Error: `DashboardInfo` now omits `created_by_name` and 
`changed_by_name`, but `serialize_dashboard_object` and dashboard schema 
discovery still use those field names. Pydantic will silently ignore these 
unexpected inputs, so requested creator/modifier display-name columns disappear 
from list responses. Restore the name-based fields to keep serializer and 
schema contracts aligned.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38562&comment_hash=36ffa07b4a287e27404606d8aef5b334442903f5b815d4e79dd7bf361f4e547d&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38562&comment_hash=36ffa07b4a287e27404606d8aef5b334442903f5b815d4e79dd7bf361f4e547d&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(mcp): add BM25 tool search transform to reduce initial context size [superset]

Reply via email to