elasticdotventures commented on issue #35003:
URL: https://github.com/apache/superset/issues/35003#issuecomment-3940271286

   Hi @betodealmeida and the Superset community 👋
   
   We wanted to share some context from a fork we've been developing in case 
it's useful to anyone looking for **DataFrame, MCP (Model Context Protocol), or 
AI-agent-driven chart/dashboard generation** capabilities today while SIP-182 
matures.
   
   **The fork**: https://github.com/PromptExecution/superset-datafusion-mcp
   
   ### Why the fork exists
   
   [@PromptExecution](https://github.com/PromptExecution) is a consulting org 
working with a client that operates a data platform currently in testing, 
expected to reach roughly **~1,000 daily users within the next few months**. 
Many of those users are Python-proficient analysts who needed a capability that 
goes beyond what standard BI tools offer today:
   
   - **In-session DataFrames as chart sources** — ingest an Arrow/Parquet table 
via an AI agent, immediately generate a Superset chart against it, no database 
required
   - **MCP tool surface** — expose chart creation, dashboard assembly, and 
DataFrame querying as first-class tools that LLM agents can call
   - **"Better than Grafana" diagram and dashboard generation** — including 
Mermaid diagram output and composite dashboard assembly from agent conversations
   
   The delivery timeline made a clean upstream contribution path impractical 
for this cycle. Rather than wait, we made a **hard fork** to ship the MCP 
service layer on top of Superset's existing chart infrastructure.
   
   ### What we built (relevant to SIP-182)
   
   The fork adds a `VirtualDatasetRegistry` backed by **[Apache 
Arrow](https://arrow.apache.org/)** (in-memory tables, TTL-scoped, 
session-isolated) and **[Apache DataFusion](https://datafusion.apache.org/)** / 
DuckDB for query execution. We think this is the natural internal engine choice 
for Apache Superset — Arrow and DataFusion are both Apache-family projects with 
strong columnar performance characteristics, and Arrow in particular is already 
the lingua franca for DataFrame interchange across the Python ecosystem.
   
   An AI agent can:
   1. Ingest a DataFrame → register as a virtual dataset (Arrow table in memory)
   2. Call `generate_chart(dataset_id="virtual:{uuid}", config={...})` → 
DataFusion/DuckDB executes the query → Superset renders the chart
   3. Query the virtual dataset with arbitrary SQL via the MCP tool surface
   
   The bridge between virtual datasets and chart rendering lives entirely 
outside Superset's `get_sqla_query()` path, which means **it is structurally 
aligned with the decoupling SIP-182 proposes** — the `Explorable` protocol 
would give our bridge a proper first-class home.
   
   ### How we're planning to harmonize
   
   This fork is also serving as a live test of 
**[`gh-aw`](https://github.com/PromptExecution/superset-datafusion-mcp/tree/master/.github/workflows)**
 (GitHub Copilot Agent Workflows) for CI/CD automation. We've wired up a 
breaking-change checker agent that watches specifically for SIP-182 milestones:
   
   - `Explorable` protocol introduction (Phase 0 / PR #36245)
   - `form_data` key renames (Phases 2/3) — our bridge centralises all 
form_data reads into accessor functions so they're a single-file update
   - `get_sqla_query()` removal (Phase 4) — low direct risk since we already 
bypass it, but we'll do a full audit when it lands
   
   When Phase 0 merges, our plan is to implement `Explorable` for the 
`VirtualDatasetRegistry` so virtual datasets work natively through Superset's 
chart pipeline. At that point we'd love to discuss upstreaming the registry, 
the MCP tool surface, and potentially the Prometheus query tool (which has no 
upstream equivalent proposed yet).
   
   ### Cherry-pick contributions
   
   In the meantime we're tracking upstream closely and tagging anything that 
looks like a clean upstream contribution candidate. If any of the patterns 
we've built — session-scoped in-memory datasets, TTL lifecycle management, 
Arrow-native query results, or the MCP agentic tool layer — would be useful 
reference material as Phases 1–3 land, we're happy to share specifics or open 
draft PRs for discussion.
   
   Thanks for the thoughtful design work here — SIP-182 is exactly the right 
abstraction boundary and we're genuinely excited to see it mature.
   
   — [@PromptExecution](https://github.com/PromptExecution)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to