codeant-ai-for-open-source[bot] commented on code in PR #37200:
URL: https://github.com/apache/superset/pull/37200#discussion_r2699817933


##########
superset/mcp_service/utils/token_utils.py:
##########
@@ -0,0 +1,402 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Token counting and response size utilities for MCP service.
+
+This module provides utilities to estimate token counts and generate smart
+suggestions when responses exceed configured limits. This prevents large
+responses from overwhelming LLM clients like Claude Desktop.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import Any, Dict, List
+
+logger = logging.getLogger(__name__)
+
+# Approximate characters per token for estimation
+# Claude tokenizer averages ~4 chars per token for English text
+# JSON tends to be more verbose, so we use a slightly lower ratio
+CHARS_PER_TOKEN = 3.5
+
+
+def estimate_token_count(text: str | bytes) -> int:
+    """
+    Estimate the token count for a given text.
+
+    Uses a character-based heuristic since we don't have direct access to
+    the actual tokenizer. This is conservative to avoid underestimating.
+
+    Args:
+        text: The text to estimate tokens for (string or bytes)
+
+    Returns:
+        Estimated number of tokens
+    """
+    if isinstance(text, bytes):
+        text = text.decode("utf-8", errors="replace")
+
+    # Simple heuristic: ~3.5 characters per token for JSON/code
+    return int(len(text) / CHARS_PER_TOKEN)
+
+
+def estimate_response_tokens(response: Any) -> int:
+    """
+    Estimate token count for an MCP tool response.
+
+    Handles various response types including Pydantic models, dicts, and 
strings.
+
+    Args:
+        response: The response object to estimate
+
+    Returns:
+        Estimated number of tokens
+    """
+    try:
+        from superset.utils import json
+
+        # Convert response to JSON string for accurate estimation
+        if hasattr(response, "model_dump"):
+            # Pydantic model
+            response_str = json.dumps(response.model_dump())
+        elif isinstance(response, (dict, list)):
+            response_str = json.dumps(response)
+        elif isinstance(response, (str, bytes)):
+            response_str = response if isinstance(response, str) else 
response.decode()
+        else:
+            response_str = str(response)
+
+        return estimate_token_count(response_str)
+    except (TypeError, ValueError, AttributeError) as e:
+        logger.warning("Failed to estimate response tokens: %s", e)
+        # Return a high estimate to be safe
+        return 100000
+
+
+def get_response_size_bytes(response: Any) -> int:
+    """
+    Get the size of a response in bytes.
+
+    Args:
+        response: The response object
+
+    Returns:
+        Size in bytes
+    """
+    try:
+        from superset.utils import json
+
+        if hasattr(response, "model_dump"):
+            response_str = json.dumps(response.model_dump())
+        elif isinstance(response, (dict, list)):
+            response_str = json.dumps(response)
+        elif isinstance(response, bytes):
+            return len(response)
+        elif isinstance(response, str):
+            return len(response.encode("utf-8"))
+        else:
+            response_str = str(response)
+
+        return len(response_str.encode("utf-8"))
+    except (TypeError, ValueError, AttributeError) as e:
+        logger.warning("Failed to get response size: %s", e)
+        return 0

Review Comment:
   **Suggestion:** The except block returns 0 on failure which silently 
underestimates response size (allowing oversized responses to pass); also it 
doesn't catch ImportError/UnicodeDecodeError — fall back to computing the 
byte-length via str(response) and if that fails, return a conservative large 
value. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Oversized responses may bypass size checks.
   - ⚠️ LLM clients (e.g., Claude Desktop) risk receiving huge payloads.
   - ⚠️ Affected component: MCP response-size determination.
   ```
   </details>
   
   ```suggestion
       except (TypeError, ValueError, AttributeError, ImportError, 
UnicodeDecodeError) as e:
           logger.warning("Failed to get response size: %s; falling back to 
string length", e)
           try:
               return len(str(response).encode("utf-8"))
           except Exception:
               # Return a large conservative size to avoid silently 
underestimating
               return 100000
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Start the MCP service with the PR applied.
   
   2. Call a tool that returns a response object that json.dumps cannot 
serialize (e.g., an
   object with non-serializable attributes). get_response_size_bytes in
   superset/mcp_service/utils/token_utils.py (function get_response_size_bytes, 
lines
   ~92-119) calls json.dumps(response) at lines ~105-108.
   
   3. json.dumps raises TypeError; execution falls into the except block at 
lines 117-119
   which currently logs and returns 0.
   
   4. Returning 0 underestimates size; the ResponseSizeGuardMiddleware (which 
uses this
   utility to decide blocking) may allow an oversized response through because 
the size check
   sees 0 bytes.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/utils/token_utils.py
   **Line:** 117:119
   **Comment:**
        *Logic Error: The except block returns 0 on failure which silently 
underestimates response size (allowing oversized responses to pass); also it 
doesn't catch ImportError/UnicodeDecodeError — fall back to computing the 
byte-length via str(response) and if that fails, return a conservative large 
value.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to