Re: [PR] feat(mcp): add BM25 tool search transform to reduce initial context size [superset]

via GitHub Fri, 13 Mar 2026 02:50:46 -0700


Antonio-RiveroMartnez commented on code in PR #38562:
URL: https://github.com/apache/superset/pull/38562#discussion_r2930114776



##########
superset/mcp_service/server.py:
##########
@@ -151,6 +155,110 @@ def create_event_store(config: dict[str, Any] | None = 
None) -> Any | None:
         return None
 
 
+def _strip_titles(obj: Any, in_properties_map: bool = False) -> Any:
+    """Recursively strip schema metadata ``title`` keys.
+
+    Keeps real field names inside ``properties`` (e.g. a property literally
+    named ``title``), while removing auto-generated schema title metadata.
+    """
+    if isinstance(obj, dict):
+        result: dict[str, Any] = {}
+        for key, value in obj.items():
+            if key == "title" and not in_properties_map:
+                continue
+            result[key] = _strip_titles(value, in_properties_map=(key == 
"properties"))
+        return result
+    if isinstance(obj, list):
+        return [_strip_titles(item, in_properties_map=False) for item in obj]
+    return obj
+
+
+def _serialize_tools_without_output_schema(
+    tools: Sequence[Any],
+) -> list[dict[str, Any]]:
+    """Serialize tools to JSON, stripping outputSchema and titles to reduce 
tokens.
+
+    LLMs only need inputSchema to call tools. outputSchema accounts for
+    50-80% of the per-tool schema size, and auto-generated 'title' fields
+    add ~12% bloat. Stripping both cuts search result tokens significantly.
+    """
+    results = []
+    for tool in tools:
+        data = tool.to_mcp_tool().model_dump(mode="json", exclude_none=True)
+        data.pop("outputSchema", None)
+        if input_schema := data.get("inputSchema"):
+            data["inputSchema"] = _strip_titles(input_schema)
+        results.append(data)
+    return results
+
+
+def _fix_call_tool_schema(transform: Any) -> None:
+    """Patch the call_tool proxy to emit a clean ``type: object`` schema.
+
+    FastMCP's BaseSearchTransform defines ``arguments`` as
+    ``dict[str, Any] | None`` which emits an ``anyOf`` JSON Schema.
+    Some MCP bridges (mcp-remote, Claude Desktop) don't handle ``anyOf``
+    and strip it, leaving the field without a ``type`` — causing all
+    call_tool invocations to fail with "Input should be a valid dictionary".
+
+    This patches the transform's ``_make_call_tool`` to post-process the
+    schema, replacing the ``anyOf`` with a flat ``type: object``.
+    """
+    original_make = transform._make_call_tool
+
+    def patched_make_call_tool() -> Any:
+        tool = original_make()
+        if "arguments" in (props := (tool.parameters or {}).get("properties", 
{})):
+            props["arguments"] = {
+                "additionalProperties": True,
+                "default": None,
+                "description": "Arguments to pass to the tool",
+                "type": "object",
+            }
+        return tool
+
+    import types
+
+    transform._make_call_tool = types.MethodType(

Review Comment:
   `_make_call_tool` is private API (`_` prefix). Any FastMCP 3.x minor release 
could rename, remove, or change its signature. Also there is zero test coverage 
for this function.
   



##########
superset/mcp_service/server.py:
##########
@@ -151,6 +155,110 @@ def create_event_store(config: dict[str, Any] | None = 
None) -> Any | None:
         return None
 
 
+def _strip_titles(obj: Any, in_properties_map: bool = False) -> Any:
+    """Recursively strip schema metadata ``title`` keys.
+
+    Keeps real field names inside ``properties`` (e.g. a property literally
+    named ``title``), while removing auto-generated schema title metadata.
+    """
+    if isinstance(obj, dict):
+        result: dict[str, Any] = {}
+        for key, value in obj.items():
+            if key == "title" and not in_properties_map:
+                continue
+            result[key] = _strip_titles(value, in_properties_map=(key == 
"properties"))
+        return result
+    if isinstance(obj, list):
+        return [_strip_titles(item, in_properties_map=False) for item in obj]
+    return obj
+
+
+def _serialize_tools_without_output_schema(
+    tools: Sequence[Any],
+) -> list[dict[str, Any]]:
+    """Serialize tools to JSON, stripping outputSchema and titles to reduce 
tokens.
+
+    LLMs only need inputSchema to call tools. outputSchema accounts for
+    50-80% of the per-tool schema size, and auto-generated 'title' fields
+    add ~12% bloat. Stripping both cuts search result tokens significantly.
+    """
+    results = []
+    for tool in tools:
+        data = tool.to_mcp_tool().model_dump(mode="json", exclude_none=True)
+        data.pop("outputSchema", None)
+        if input_schema := data.get("inputSchema"):
+            data["inputSchema"] = _strip_titles(input_schema)
+        results.append(data)
+    return results
+
+
+def _fix_call_tool_schema(transform: Any) -> None:
+    """Patch the call_tool proxy to emit a clean ``type: object`` schema.
+
+    FastMCP's BaseSearchTransform defines ``arguments`` as
+    ``dict[str, Any] | None`` which emits an ``anyOf`` JSON Schema.
+    Some MCP bridges (mcp-remote, Claude Desktop) don't handle ``anyOf``
+    and strip it, leaving the field without a ``type`` — causing all
+    call_tool invocations to fail with "Input should be a valid dictionary".
+
+    This patches the transform's ``_make_call_tool`` to post-process the
+    schema, replacing the ``anyOf`` with a flat ``type: object``.
+    """
+    original_make = transform._make_call_tool
+
+    def patched_make_call_tool() -> Any:

Review Comment:
   We usually don't monkey patch, the usual behavior are subclassing (e.g., 
DetailedJWTVerifier extends JWTVerifier) and configuration injection (e.g., 
search_result_serializer callback).



##########
superset/mcp_service/flask_singleton.py:
##########
@@ -53,61 +52,33 @@
         # Use _get_current_object() to get the actual Flask app, not the 
LocalProxy
         app = current_app._get_current_object()
     else:
-        # Either appbuilder is not initialized (standalone MCP server),
-        # or appbuilder is initialized but we're not in an app context
-        # (edge case - should rarely happen). In both cases, create a minimal 
app.
+        # Standalone MCP server — Superset models are deeply coupled to
+        # appbuilder, security_manager, event_logger, encrypted_field_factory,
+        # etc. so we use create_app() for full initialization rather than
+        # trying to init a minimal subset (which leads to cascading failures).
         #
-        # We avoid calling create_app() which would run full FAB initialization
-        # and could corrupt the shared appbuilder singleton if main app starts.
-        from superset.app import SupersetApp
+        # create_app() is safe here because in standalone mode the main

Review Comment:
   This is only true when `appbuilder_initialized` is `False`. The else branch 
now handles both standalone AND the edge case, and calling `create_app()` when 
`appbuilder` is already initialized would call `appbuilder.init_app()` a second 
time with a different Flask app, overwriting shared internal state.



##########
tests/unit_tests/mcp_service/test_tool_search_transform.py:
##########
@@ -0,0 +1,167 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Tests for MCP tool search transform configuration and application."""
+
+from unittest.mock import MagicMock, patch
+
+from superset.mcp_service.mcp_config import MCP_TOOL_SEARCH_CONFIG
+from superset.mcp_service.server import (
+    _apply_tool_search_transform,
+    _serialize_tools_without_output_schema,
+)
+
+
+def test_tool_search_config_defaults():
+    """Default config has expected keys and values."""
+    assert MCP_TOOL_SEARCH_CONFIG["enabled"] is True
+    assert MCP_TOOL_SEARCH_CONFIG["strategy"] == "bm25"
+    assert MCP_TOOL_SEARCH_CONFIG["max_results"] == 5
+    assert "health_check" in MCP_TOOL_SEARCH_CONFIG["always_visible"]
+    assert "get_instance_info" in MCP_TOOL_SEARCH_CONFIG["always_visible"]
+    assert MCP_TOOL_SEARCH_CONFIG["search_tool_name"] == "search_tools"
+    assert MCP_TOOL_SEARCH_CONFIG["call_tool_name"] == "call_tool"
+
+
+def test_apply_bm25_transform():
+    """BM25SearchTransform is applied when strategy is 'bm25'."""
+    mock_mcp = MagicMock()
+    config = {
+        "strategy": "bm25",
+        "max_results": 5,
+        "always_visible": ["health_check"],
+        "search_tool_name": "search_tools",
+        "call_tool_name": "call_tool",
+    }
+
+    with patch("fastmcp.server.transforms.search.BM25SearchTransform") as 
mock_bm25_cls:
+        mock_transform = MagicMock()
+        mock_bm25_cls.return_value = mock_transform
+
+        _apply_tool_search_transform(mock_mcp, config)
+
+        call_kwargs = mock_bm25_cls.call_args[1]
+        assert call_kwargs["max_results"] == 5
+        assert call_kwargs["always_visible"] == ["health_check"]
+        assert call_kwargs["search_tool_name"] == "search_tools"
+        assert call_kwargs["call_tool_name"] == "call_tool"
+        assert (
+            call_kwargs["search_result_serializer"]
+            is _serialize_tools_without_output_schema
+        )
+        mock_mcp.add_transform.assert_called_once_with(mock_transform)
+
+
+def test_apply_regex_transform():
+    """RegexSearchTransform is applied when strategy is 'regex'."""
+    mock_mcp = MagicMock()
+    config = {
+        "strategy": "regex",
+        "max_results": 10,
+        "always_visible": ["health_check", "get_instance_info"],
+        "search_tool_name": "find_tools",
+        "call_tool_name": "invoke_tool",
+    }
+
+    with patch(
+        "fastmcp.server.transforms.search.RegexSearchTransform"
+    ) as mock_regex_cls:
+        mock_transform = MagicMock()
+        mock_regex_cls.return_value = mock_transform
+
+        _apply_tool_search_transform(mock_mcp, config)
+
+        call_kwargs = mock_regex_cls.call_args[1]
+        assert call_kwargs["max_results"] == 10
+        assert call_kwargs["always_visible"] == ["health_check", 
"get_instance_info"]
+        assert call_kwargs["search_tool_name"] == "find_tools"
+        assert call_kwargs["call_tool_name"] == "invoke_tool"
+        assert (
+            call_kwargs["search_result_serializer"]
+            is _serialize_tools_without_output_schema
+        )
+        mock_mcp.add_transform.assert_called_once_with(mock_transform)
+
+
+def test_apply_transform_uses_defaults_for_missing_keys():
+    """Missing config keys fall back to sensible defaults."""
+    mock_mcp = MagicMock()
+    config = {}  # All keys missing — should use defaults
+
+    with patch("fastmcp.server.transforms.search.BM25SearchTransform") as 
mock_bm25_cls:
+        mock_bm25_cls.return_value = MagicMock()
+
+        _apply_tool_search_transform(mock_mcp, config)
+
+        call_kwargs = mock_bm25_cls.call_args[1]
+        assert call_kwargs["max_results"] == 5
+        assert call_kwargs["always_visible"] == []
+        assert call_kwargs["search_tool_name"] == "search_tools"
+        assert call_kwargs["call_tool_name"] == "call_tool"
+
+
+def test_transform_not_applied_when_disabled():
+    """No transform applied when config has enabled=False."""
+    # This tests the gating logic in run_server, not 
_apply_tool_search_transform
+    config = {"enabled": False}
+    assert not config.get("enabled", False)
+
+
+def test_transform_applied_when_enabled():
+    """Transform is applied when config has enabled=True."""
+    config = {"enabled": True}
+    assert config.get("enabled", False)

Review Comment:
   The OG from the bot is correct, there's no "contract" config you're checking 
with this two "transform" tests, this might as well be called `foo` and have 
random keys and still pass. I would recommend removing these two.



##########
superset/mcp_service/chart/schemas.py:
##########


Review Comment:
   This is trimming description for charts but what about dashboard/schemas.py 
and system/schemas.py? why not trimming those as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(mcp): add BM25 tool search transform to reduce initial context size [superset]

Reply via email to