codeant-ai-for-open-source[bot] commented on code in PR #37389:
URL: https://github.com/apache/superset/pull/37389#discussion_r2726129012
##########
superset/mcp_service/system/resources/instance_metadata.py:
##########
@@ -88,12 +91,57 @@ def get_instance_metadata_resource() -> str:
logger=logger,
)
- # Use the shared core's resource method
- return instance_info_core.get_resource()
+ # Get base instance info
+ base_result = json.loads(instance_info_core.get_resource())
+
+ # Remove empty popular_content if it has no useful data
+ popular = base_result.get("popular_content", {})
+ if popular and not any(popular.get(k) for k in popular):
+ del base_result["popular_content"]
+
+ # Add available datasets (top 20 by most recent modification)
+ dataset_dao = instance_info_core.dao_classes["datasets"]
+ try:
+ datasets = dataset_dao.find_all()
+ # Convert to string to avoid TypeError when comparing datetime
with None
+ sorted_datasets = sorted(
+ datasets,
+ key=lambda d: str(getattr(d, "changed_on", "") or ""),
+ reverse=True,
+ )[:20]
Review Comment:
**Suggestion:** Resource exhaustion: calling `dataset_dao.find_all()` loads
all dataset rows into memory. Replace with a paginated DAO call (`list`) that
fetches only a limited number of rows (top ~20) to avoid loading the entire
table into memory and then sorting in Python. [resource leak]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ instance://metadata may OOM on large dataset tables.
- ⚠️ Metadata calls become slow for many datasets.
- ⚠️ Affects LLM clients fetching dataset IDs.
```
</details>
```suggestion
# Use paginated `list` to avoid loading every dataset into
memory.
datasets, _ = dataset_dao.list(
page_size=20,
columns=["id", "table_name", "schema", "database_id",
"changed_on"],
)
# Keep previous string-based fallback for changed_on to avoid
datetime comparison errors.
sorted_datasets = sorted(
datasets,
key=lambda d: str(getattr(d, "changed_on", "") or ""),
reverse=True,
)
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Start MCP service: `python -m superset.mcp_service`.
2. Call the instance metadata resource ("instance://metadata"), which invokes
get_instance_metadata_resource in
superset/mcp_service/system/resources/instance_metadata.py (function defined
near line
36).
3. Code at lines 103-106 calls `dataset_dao.find_all()` (dataset DAO from
InstanceInfoCore.dao_classes). BaseDAO.find_all
(superset/daos/base.py:355-361) executes
`query.all()`, loading all dataset rows into memory.
4. On installations with many datasets (tens of thousands), this will
allocate large
memory, cause long pauses or OOM during the metadata request; you can
reproduce by
creating many datasets and invoking the resource and observing high memory
usage at
instance_metadata.py:103.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/mcp_service/system/resources/instance_metadata.py
**Line:** 105:111
**Comment:**
*Resource Leak: Resource exhaustion: calling `dataset_dao.find_all()`
loads all dataset rows into memory. Replace with a paginated DAO call (`list`)
that fetches only a limited number of rows (top ~20) to avoid loading the
entire table into memory and then sorting in Python.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]