Re: [PR] feat(mcp): add BM25 tool search transform to reduce initial context size [superset]

via GitHub Fri, 13 Mar 2026 03:33:55 -0700


aminghadersohi commented on code in PR #38562:
URL: https://github.com/apache/superset/pull/38562#discussion_r2930363304



##########
superset/mcp_service/flask_singleton.py:
##########
@@ -53,61 +52,33 @@
         # Use _get_current_object() to get the actual Flask app, not the 
LocalProxy
         app = current_app._get_current_object()
     else:
-        # Either appbuilder is not initialized (standalone MCP server),
-        # or appbuilder is initialized but we're not in an app context
-        # (edge case - should rarely happen). In both cases, create a minimal 
app.
+        # Standalone MCP server — Superset models are deeply coupled to
+        # appbuilder, security_manager, event_logger, encrypted_field_factory,
+        # etc. so we use create_app() for full initialization rather than
+        # trying to init a minimal subset (which leads to cascading failures).
         #
-        # We avoid calling create_app() which would run full FAB initialization
-        # and could corrupt the shared appbuilder singleton if main app starts.
-        from superset.app import SupersetApp
+        # create_app() is safe here because in standalone mode the main

Review Comment:
   Good catch — you're right that the else branch also covers the edge case 
where `appbuilder_initialized` is True but `has_app_context()` is False. In 
that scenario, calling `create_app()` would re-initialize appbuilder and 
overwrite shared state. I'll restore the guard that checks 
`appbuilder_initialized` before deciding whether to call `create_app()` vs. 
creating a minimal app, so the full initialization only happens in genuine 
standalone mode.



##########
superset/mcp_service/server.py:
##########
@@ -151,6 +155,110 @@ def create_event_store(config: dict[str, Any] | None = 
None) -> Any | None:
         return None
 
 
+def _strip_titles(obj: Any, in_properties_map: bool = False) -> Any:
+    """Recursively strip schema metadata ``title`` keys.
+
+    Keeps real field names inside ``properties`` (e.g. a property literally
+    named ``title``), while removing auto-generated schema title metadata.
+    """
+    if isinstance(obj, dict):
+        result: dict[str, Any] = {}
+        for key, value in obj.items():
+            if key == "title" and not in_properties_map:
+                continue
+            result[key] = _strip_titles(value, in_properties_map=(key == 
"properties"))
+        return result
+    if isinstance(obj, list):
+        return [_strip_titles(item, in_properties_map=False) for item in obj]
+    return obj
+
+
+def _serialize_tools_without_output_schema(
+    tools: Sequence[Any],
+) -> list[dict[str, Any]]:
+    """Serialize tools to JSON, stripping outputSchema and titles to reduce 
tokens.
+
+    LLMs only need inputSchema to call tools. outputSchema accounts for
+    50-80% of the per-tool schema size, and auto-generated 'title' fields
+    add ~12% bloat. Stripping both cuts search result tokens significantly.
+    """
+    results = []
+    for tool in tools:
+        data = tool.to_mcp_tool().model_dump(mode="json", exclude_none=True)
+        data.pop("outputSchema", None)
+        if input_schema := data.get("inputSchema"):
+            data["inputSchema"] = _strip_titles(input_schema)
+        results.append(data)
+    return results
+
+
+def _fix_call_tool_schema(transform: Any) -> None:
+    """Patch the call_tool proxy to emit a clean ``type: object`` schema.
+
+    FastMCP's BaseSearchTransform defines ``arguments`` as
+    ``dict[str, Any] | None`` which emits an ``anyOf`` JSON Schema.
+    Some MCP bridges (mcp-remote, Claude Desktop) don't handle ``anyOf``
+    and strip it, leaving the field without a ``type`` — causing all
+    call_tool invocations to fail with "Input should be a valid dictionary".
+
+    This patches the transform's ``_make_call_tool`` to post-process the
+    schema, replacing the ``anyOf`` with a flat ``type: object``.
+    """
+    original_make = transform._make_call_tool
+
+    def patched_make_call_tool() -> Any:
+        tool = original_make()
+        if "arguments" in (props := (tool.parameters or {}).get("properties", 
{})):
+            props["arguments"] = {
+                "additionalProperties": True,
+                "default": None,
+                "description": "Arguments to pass to the tool",
+                "type": "object",
+            }
+        return tool
+
+    import types
+
+    transform._make_call_tool = types.MethodType(

Review Comment:
   Fair point — `_make_call_tool` is private API and could break on any FastMCP 
minor release. I'll look into whether FastMCP 3.x exposes a public hook for 
customizing the call_tool schema. If not, I may open a FastMCP upstream issue 
to request one. In the meantime, I'll add a version check comment pinning the 
assumption and test coverage for `_fix_call_tool_schema`.



##########
superset/mcp_service/server.py:
##########
@@ -151,6 +155,110 @@ def create_event_store(config: dict[str, Any] | None = 
None) -> Any | None:
         return None
 
 
+def _strip_titles(obj: Any, in_properties_map: bool = False) -> Any:
+    """Recursively strip schema metadata ``title`` keys.
+
+    Keeps real field names inside ``properties`` (e.g. a property literally
+    named ``title``), while removing auto-generated schema title metadata.
+    """
+    if isinstance(obj, dict):
+        result: dict[str, Any] = {}
+        for key, value in obj.items():
+            if key == "title" and not in_properties_map:
+                continue
+            result[key] = _strip_titles(value, in_properties_map=(key == 
"properties"))
+        return result
+    if isinstance(obj, list):
+        return [_strip_titles(item, in_properties_map=False) for item in obj]
+    return obj
+
+
+def _serialize_tools_without_output_schema(
+    tools: Sequence[Any],
+) -> list[dict[str, Any]]:
+    """Serialize tools to JSON, stripping outputSchema and titles to reduce 
tokens.
+
+    LLMs only need inputSchema to call tools. outputSchema accounts for
+    50-80% of the per-tool schema size, and auto-generated 'title' fields
+    add ~12% bloat. Stripping both cuts search result tokens significantly.
+    """
+    results = []
+    for tool in tools:
+        data = tool.to_mcp_tool().model_dump(mode="json", exclude_none=True)
+        data.pop("outputSchema", None)
+        if input_schema := data.get("inputSchema"):
+            data["inputSchema"] = _strip_titles(input_schema)
+        results.append(data)
+    return results
+
+
+def _fix_call_tool_schema(transform: Any) -> None:
+    """Patch the call_tool proxy to emit a clean ``type: object`` schema.
+
+    FastMCP's BaseSearchTransform defines ``arguments`` as
+    ``dict[str, Any] | None`` which emits an ``anyOf`` JSON Schema.
+    Some MCP bridges (mcp-remote, Claude Desktop) don't handle ``anyOf``
+    and strip it, leaving the field without a ``type`` — causing all
+    call_tool invocations to fail with "Input should be a valid dictionary".
+
+    This patches the transform's ``_make_call_tool`` to post-process the
+    schema, replacing the ``anyOf`` with a flat ``type: object``.
+    """
+    original_make = transform._make_call_tool
+
+    def patched_make_call_tool() -> Any:

Review Comment:
   Agreed — monkey-patching is not the right pattern here, especially since the 
codebase already uses subclassing (DetailedJWTVerifier) and callback injection 
(search_result_serializer). I'll refactor `_fix_call_tool_schema` to use a 
subclass approach instead. If FastMCP's `BaseSearchTransform` doesn't expose a 
clean override point, I'll subclass 
`BM25SearchTransform`/`RegexSearchTransform` directly.



##########
tests/unit_tests/mcp_service/test_tool_search_transform.py:
##########
@@ -0,0 +1,167 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Tests for MCP tool search transform configuration and application."""
+
+from unittest.mock import MagicMock, patch
+
+from superset.mcp_service.mcp_config import MCP_TOOL_SEARCH_CONFIG
+from superset.mcp_service.server import (
+    _apply_tool_search_transform,
+    _serialize_tools_without_output_schema,
+)
+
+
+def test_tool_search_config_defaults():
+    """Default config has expected keys and values."""
+    assert MCP_TOOL_SEARCH_CONFIG["enabled"] is True
+    assert MCP_TOOL_SEARCH_CONFIG["strategy"] == "bm25"
+    assert MCP_TOOL_SEARCH_CONFIG["max_results"] == 5
+    assert "health_check" in MCP_TOOL_SEARCH_CONFIG["always_visible"]
+    assert "get_instance_info" in MCP_TOOL_SEARCH_CONFIG["always_visible"]
+    assert MCP_TOOL_SEARCH_CONFIG["search_tool_name"] == "search_tools"
+    assert MCP_TOOL_SEARCH_CONFIG["call_tool_name"] == "call_tool"
+
+
+def test_apply_bm25_transform():
+    """BM25SearchTransform is applied when strategy is 'bm25'."""
+    mock_mcp = MagicMock()
+    config = {
+        "strategy": "bm25",
+        "max_results": 5,
+        "always_visible": ["health_check"],
+        "search_tool_name": "search_tools",
+        "call_tool_name": "call_tool",
+    }
+
+    with patch("fastmcp.server.transforms.search.BM25SearchTransform") as 
mock_bm25_cls:
+        mock_transform = MagicMock()
+        mock_bm25_cls.return_value = mock_transform
+
+        _apply_tool_search_transform(mock_mcp, config)
+
+        call_kwargs = mock_bm25_cls.call_args[1]
+        assert call_kwargs["max_results"] == 5
+        assert call_kwargs["always_visible"] == ["health_check"]
+        assert call_kwargs["search_tool_name"] == "search_tools"
+        assert call_kwargs["call_tool_name"] == "call_tool"
+        assert (
+            call_kwargs["search_result_serializer"]
+            is _serialize_tools_without_output_schema
+        )
+        mock_mcp.add_transform.assert_called_once_with(mock_transform)
+
+
+def test_apply_regex_transform():
+    """RegexSearchTransform is applied when strategy is 'regex'."""
+    mock_mcp = MagicMock()
+    config = {
+        "strategy": "regex",
+        "max_results": 10,
+        "always_visible": ["health_check", "get_instance_info"],
+        "search_tool_name": "find_tools",
+        "call_tool_name": "invoke_tool",
+    }
+
+    with patch(
+        "fastmcp.server.transforms.search.RegexSearchTransform"
+    ) as mock_regex_cls:
+        mock_transform = MagicMock()
+        mock_regex_cls.return_value = mock_transform
+
+        _apply_tool_search_transform(mock_mcp, config)
+
+        call_kwargs = mock_regex_cls.call_args[1]
+        assert call_kwargs["max_results"] == 10
+        assert call_kwargs["always_visible"] == ["health_check", 
"get_instance_info"]
+        assert call_kwargs["search_tool_name"] == "find_tools"
+        assert call_kwargs["call_tool_name"] == "invoke_tool"
+        assert (
+            call_kwargs["search_result_serializer"]
+            is _serialize_tools_without_output_schema
+        )
+        mock_mcp.add_transform.assert_called_once_with(mock_transform)
+
+
+def test_apply_transform_uses_defaults_for_missing_keys():
+    """Missing config keys fall back to sensible defaults."""
+    mock_mcp = MagicMock()
+    config = {}  # All keys missing — should use defaults
+
+    with patch("fastmcp.server.transforms.search.BM25SearchTransform") as 
mock_bm25_cls:
+        mock_bm25_cls.return_value = MagicMock()
+
+        _apply_tool_search_transform(mock_mcp, config)
+
+        call_kwargs = mock_bm25_cls.call_args[1]
+        assert call_kwargs["max_results"] == 5
+        assert call_kwargs["always_visible"] == []
+        assert call_kwargs["search_tool_name"] == "search_tools"
+        assert call_kwargs["call_tool_name"] == "call_tool"
+
+
+def test_transform_not_applied_when_disabled():
+    """No transform applied when config has enabled=False."""
+    # This tests the gating logic in run_server, not 
_apply_tool_search_transform
+    config = {"enabled": False}
+    assert not config.get("enabled", False)
+
+
+def test_transform_applied_when_enabled():
+    """Transform is applied when config has enabled=True."""
+    config = {"enabled": True}
+    assert config.get("enabled", False)

Review Comment:
   You're right — those two tests are just testing Python dict.get() behavior, 
not actual application logic. Removed in 389eda2.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(mcp): add BM25 tool search transform to reduce initial context size [superset]

Reply via email to