[PR] feat(mcp): PR4 - add advanced chart tools and SQL Lab/Explore integration [superset]

via GitHub Fri, 24 Oct 2025 17:20:10 -0700


aminghadersohi opened a new pull request, #35837:
URL: https://github.com/apache/superset/pull/35837


   ### SUMMARY
   
   This PR adds 11 new MCP tools for chart creation/modification, data access, 
SQL execution, and explore link generation. This is the largest MCP PR, adding 
comprehensive write capabilities and production-ready features.
   
   **Built on top of:** PR #8 (feat/mcp_service_pr3_dataset_and_dashboard) - 
Dashboard and dataset tools
   
   **What's Included:**
   
   ## Chart Creation & Modification Tools (6 tools)
   
   ### generate_chart - Create new charts with AI assistance (454 lines)
   - **Chart types**: Table and XY charts (line, bar, area, scatter)
   - **5-layer validation pipeline**:
     1. Schema validation: Pydantic models catch type/format errors
     2. Business logic: Label uniqueness, required fields
     3. Dataset validation: Column/metric existence with fuzzy matching
     4. Superset compatibility: Valid aggregation functions, chart types
     5. Runtime validation: High-cardinality warnings, format suggestions
   - **Security**: XSS/SQL injection prevention with whitelist validation
   - **Smart features**:
     - Auto-generates chart names from dataset/columns
     - Fuzzy match column names ("totl_sales" → "total_sales")
     - Detects duplicate labels and suggests alternatives
     - Chart type recommendations based on data types
   - **Returns**: chart_id, preview URLs (PNG/ASCII/table/VegaLite), explore URL
   - **Options**: save_chart (True/False), generate_preview, preview_formats
   
   ### update_chart - Update existing charts (213 lines)
   - **Lookup**: By chart ID or UUID
   - **Updates**: Configuration, filters, axes, chart type
   - **Validation**: Same 5-layer pipeline as generate_chart
   - **Preserves**: Chart ownership, permissions, dashboard relationships
   - **Returns**: Updated chart info, new preview URLs, explore URL
   
   ### update_chart_preview - Modify cached preview (158 lines)
   - **Use case**: Iterative chart design without saving to database
   - **Input**: form_data_key from generate_chart (save_chart=False)
   - **Output**: New form_data_key, updated previews
   - **Invalidation**: Original form_data_key becomes invalid
   - **Workflow**: generate_chart (preview) → update_chart_preview (iterate) → 
generate_chart (save)
   
   ### get_chart_preview - Visual preview generation (2,082 lines)
   - **Formats**:
     - `url`: PNG screenshot via Selenium WebDriver (default)
     - `ascii`: Terminal-friendly ASCII art charts
     - `table`: Tabular data representation
     - `vega_lite`: Interactive VegaLite JSON spec
     - `base64`: Embedded base64-encoded PNG
   - **Screenshot system**:
     - WebDriver connection pool for concurrent generation
     - Automatic retry on transient failures
     - Configurable dimensions (default 800x600)
     - Headless Chrome/Firefox support
   - **Cache control**: use_cache, force_refresh, cache_timeout
   - **Performance**: Pooled WebDriver connections reduce latency
   
   ### get_chart_data - Export underlying data (649 lines)
   - **Formats**: json, csv, excel
   - **Pagination**: Configurable row limits (max 10,000)
   - **Cache control**: use_cache, force_refresh, cache_timeout
   - **Use case**: LLM analysis of raw data without rendering
   - **Returns**: Structured data with column metadata
   
   ### get_chart_available_filters - Filter metadata (50 lines)
   - **Returns**: Available filter fields and operators
   - **Fields**: slice_name, viz_type, datasource_name
   - **Operators**: eq, ne, sw, ew, in, nin, gt, gte, lt, lte, like, ilike
   - **Use case**: Helps LLMs discover filterable chart fields
   
   ## SQL Lab Tools (3 tools)
   
   ### execute_sql - Execute SQL queries (94 lines)
   - **Security**: DML permission enforcement, SQL injection prevention
   - **Limits**: Configurable row limits (max 10,000), query timeout (max 300s)
   - **Features**:
     - Parameterized queries for safe value substitution
     - Schema-specific execution
     - Structured error responses
     - Query result caching
   - **Returns**: Query results with column metadata and row data
   - **Use case**: LLM-driven data exploration and analysis
   
   ### open_sql_lab_with_context - Generate SQL Lab URL (118 lines)
   - **Pre-populate**:
     - SQL query in editor
     - Database connection selection
     - Default schema
     - Dataset context for exploration
     - Tab title
   - **Returns**: URL for direct navigation to SQL Lab
   - **Use case**: Seamless transition from MCP to interactive SQL editing
   
   ## Explore & Analysis Tools (1 tool)
   
   ### generate_explore_link - Create explore URLs (95 lines)
   - **Configurations**: Table and XY chart types
   - **Form data caching**: Temporary cache for seamless UX
   - **Features**:
     - Pre-configure dataset, metrics, filters
     - Set chart type and visualization options
     - Generate shareable explore URLs
   - **Returns**: explore_url for interactive chart building
   - **Use case**: LLM creates chart config, user refines in Superset UI
   
   ## Chart Infrastructure (3,085 lines)
   
   ### Core Utilities (1,045 lines)
   **chart_utils.py** (484 lines):
   - Chart creation and update orchestration
   - Form data lifecycle management (create, cache, expire)
   - Integration with ChartDAO for persistence
   - Handles both saved charts and cached previews
   
   **preview_utils.py** (561 lines):
   - Converts form_data to ASCII art using plotext
   - Generates tabular representations
   - Creates VegaLite specifications for interactivity
   - Extracts and formats chart data
   
   ### Validation Pipeline (2,040 lines)
   
   **5-Layer Architecture**:
   1. **schema_validator.py** (307 lines): Pydantic schema validation
      - Type checking (str, int, float, list, dict)
      - Required field enforcement
      - Format validation (dates, numbers, etc.)
   
   2. **dataset_validator.py** (329 lines): Column/metric existence
      - Validates x, y, group_by columns exist in dataset
      - Fuzzy matching for typos ("totl_sales" → "total_sales")
      - Suggests alternatives when columns not found
      - Checks aggregate function compatibility
   
   3. **pipeline.py** (293 lines): Validation orchestration
      - Runs validators in sequence
      - Collects errors from all layers
      - Returns structured ValidationResult
      - Supports warnings vs errors
   
   4. **runtime/chart_type_suggester.py** (437 lines): Smart recommendations
      - Analyzes data types (numeric, temporal, categorical)
      - Suggests appropriate chart types (line for time series, bar for 
categories)
      - Warns about high-cardinality group_by columns
      - Recommends aggregations for numeric columns
   
   5. **runtime/cardinality_validator.py** (195 lines): Performance warnings
      - Detects high-cardinality group_by columns (>1000 unique values)
      - Suggests filtering or aggregation
      - Prevents slow queries and unreadable charts
   
   **runtime/format_validator.py** (225 lines): Axis format validation
   - Validates D3 format strings (`$,.2f`, `.0%`, etc.)
   - Suggests corrections for invalid formats
   - Checks format compatibility with data types
   
   ## Screenshot Infrastructure (1,090 lines)
   
   **pooled_screenshot.py** (483 lines):
   - Manages pool of WebDriver connections
   - Concurrent screenshot generation (5 concurrent by default)
   - Automatic retry with exponential backoff
   - Configurable timeouts and dimensions
   - Health checks for WebDriver connections
   
   **webdriver_pool.py** (433 lines):
   - Connection pooling for Selenium WebDriver
   - Lazy initialization (create on demand)
   - Thread-safe pool management
   - Automatic recovery from connection failures
   - Supports Chrome and Firefox
   
   **webdriver_config.py** (139 lines):
   - Reads WEBDRIVER_* configuration from superset_config.py
   - Headless mode configuration
   - Chrome/Firefox options
   - Window size, timeouts, user agent
   
   **Configuration example**:
   ```python
   # superset_config.py
   WEBDRIVER_BASEURL = "http://localhost:9001/";
   WEBDRIVER_TYPE = "chrome"
   WEBDRIVER_OPTION_ARGS = ["--headless", "--no-sandbox"]
   WEBDRIVER_WINDOW = {"width": 1280, "height": 800}
   ```
   
   ## SQL Lab Infrastructure (589 lines)
   
   **execute_sql_core.py** (221 lines):
   - Core SQL execution logic with error handling
   - DML permission checks (SELECT only by default)
   - Result formatting and pagination
   - Query timeout enforcement
   - Integration with Superset's database connections
   
   **sql_lab_utils.py** (243 lines):
   - Constructs SQL Lab URLs with encoded form data
   - Handles schema, database, and query context
   - Generates shareable SQL Lab links
   
   **schemas.py** (109 lines):
   - `ExecuteSqlRequest`: database_id, sql, schema, limit, timeout, parameters
   - `ExecuteSqlResponse`: columns, data, query_id, status
   - `OpenSqlLabRequest`: database_id, sql, schema, title, dataset_in_context
   - `OpenSqlLabResponse`: url, title
   
   ## Enhanced Middleware (740 lines, +666 lines)
   
   **Expanded capabilities**:
   - Request/response logging with precise timing
   - Error response formatting with structured schemas
   - Flask context integration for all tools
   - Cache control header management
   - Form data cleanup for expired entries
   - Error tracking and debugging support
   
   ## Enhanced Authentication (auth.py, +38 lines)
   
   **New function**:
   - `has_dataset_access(dataset)`: Validates user permissions
     - Integrates with Superset's security_manager
     - Used by all chart creation/modification tools
     - Returns True/False for dataset access
   
   ## Common Utilities (738 lines)
   
   **error_schemas.py** (103 lines):
   - BaseError: Base class for all MCP errors
   - ValidationError: Schema/business logic errors with field details
   - ChartError, DashboardError, DatasetError: Resource-specific errors
   - Consistent timestamp and error_type fields
   
   **cache_utils.py** (143 lines):
   - CacheControl schema for unified caching behavior
   - Flags: use_cache, force_refresh, cache_timeout, refresh_metadata
   - Used across all listing/retrieval/preview tools
   
   **error_builder.py** (364 lines):
   - Builds structured error responses from exceptions
   - Validation error formatting with field-level details
   - Suggestion generation for typos (Levenshtein distance)
   - Stack trace inclusion for debugging
   
   **url_utils.py** (128 lines):
   - `get_superset_base_url()`: Returns SUPERSET_WEBSERVER_ADDRESS
   - Constructs chart URLs: `{base}/superset/slice/{id}/`
   - Constructs explore URLs: `{base}/explore/?form_data_key={key}`
   - Constructs SQL Lab URLs: `{base}/sqllab/?{params}`
   
   ## Commands (33 lines)
   
   **create_form_data.py**:
   - Wrapper for Superset's CreateFormDataCommand
   - Creates cached form_data entries
   - Returns form_data_key for URL generation
   - Used by preview tools
   
   ## Testing (2,554 lines, 105 new tests)
   
   **Chart Tool Tests** (1,877 lines, 72 tests):
   - `test_generate_chart.py` (268 lines): Chart creation validation
     - Schema validation tests
     - Dataset column existence tests
     - Label uniqueness enforcement
     - Chart type compatibility
     - Error response formatting
   
   - `test_update_chart.py` (385 lines): Chart update validation
     - ID and UUID lookup
     - Configuration modification
     - Permission checks
     - Not found error handling
   
   - `test_update_chart_preview.py` (474 lines): Preview update tests
     - Form data modification
     - Preview regeneration
     - form_data_key invalidation
     - Cache behavior
   
   - `test_get_chart_preview.py` (290 lines): Preview generation
     - URL format (screenshot) tests
     - ASCII format tests
     - Table format tests
     - VegaLite format tests
     - Base64 format tests
   
   - `test_chart_utils.py` (460 lines): Utility function tests
     - Chart creation helpers
     - Form data lifecycle
     - DAO integration mocking
   
   **SQL Lab Tests** (561 lines, 25 tests):
   - `test_execute_sql.py` (497 lines): SQL execution validation
     - Query execution success cases
     - Permission enforcement (DML blocked)
     - Limit enforcement (max 10,000 rows)
     - Timeout handling
     - SQL injection prevention
     - Parameterized query support
     - Error response formatting
   
   - `test_execute_sql_helper.py` (64 lines): Helper function tests
     - SQL parsing utilities
     - Query validation helpers
   
   **Explore Tests** (580 lines, 8 tests):
   - `test_generate_explore_link.py` (580 lines): Explore URL generation
     - Table chart configuration
     - XY chart configuration
     - Form data caching
     - URL construction
     - Error handling
   
   **Test Status**:
   - **Total**: 167 tests
   - **Passing**: 139 tests (83%)
   - **Needs work**: 28 SQL Lab tests require integration environment
   
   ## Integration
   
   **app.py updates** (tool imports):
   All 11 new tools are imported and auto-registered via @mcp.tool decorators:
   
   ```python
   # Chart tools (6 new)
   from superset.mcp_service.chart.tool import (
       generate_chart,              # NEW
       update_chart,                # NEW
       update_chart_preview,        # NEW
       get_chart_preview,           # NEW
       get_chart_data,              # NEW
       get_chart_available_filters, # NEW
       list_charts,
       get_chart_info,
   )
   
   # SQL Lab tools (3 new)
   from superset.mcp_service.sql_lab.tool import (
       execute_sql,                      # NEW
       open_sql_lab_with_context,        # NEW
   )
   
   # Explore tools (1 new)
   from superset.mcp_service.explore.tool import (
       generate_explore_link,  # NEW
   )
   ```
   
   ## Architecture & Patterns
   
   **Validation Pipeline**:
   ```
   User Request → Schema Validation → Business Logic → Dataset Validation
     → Superset Compatibility → Runtime Warnings → Success/Error Response
   ```
   
   **Chart Creation Flow**:
   ```
   generate_chart(config) → Validation Pipeline → Create Chart in DB
     → Cache form_data → Generate Previews → Return chart_id + URLs
   ```
   
   **Preview Workflow**:
   ```
   generate_chart(save_chart=False) → Cache form_data → form_data_key
     → update_chart_preview(form_data_key, new_config) → New form_data_key
     → generate_chart(save_chart=True) → Persists chart to DB
   ```
   
   **Screenshot Generation**:
   ```
   get_chart_preview(chart_id) → Retrieve form_data → Navigate to chart URL
     → Wait for chart-container element → Capture screenshot → Save to file
     → Return URL or base64 data
   ```
   
   ### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
   
   N/A - Backend MCP service infrastructure only, no UI changes.
   
   **Example tool usage via Claude Desktop:**
   
   ```
   User: "Create a line chart showing sales over time from the sales dataset"
   Claude calls: generate_chart({
     "dataset_id": 42,
     "config": {
       "chart_type": "xy",
       "kind": "line",
       "x": {"name": "order_date"},
       "y": [{"name": "total_sales", "aggregate": "SUM"}]
     }
   })
   Returns: chart_id: 123, preview_url: 
"http://localhost:9001/api/v1/chart/123/thumbnail/";
   
   User: "Change it to a bar chart"
   Claude calls: update_chart({
     "identifier": 123,
     "config": {"chart_type": "xy", "kind": "bar", ...}
   })
   
   User: "Show me the raw data"
   Claude calls: get_chart_data({"identifier": 123, "format": "json", "limit": 
100})
   
   User: "Execute SELECT * FROM sales WHERE region='West' LIMIT 10"
   Claude calls: execute_sql({
     "database_id": 1,
     "sql": "SELECT * FROM sales WHERE region='West' LIMIT 10"
   })
   
   User: "Open SQL Lab to explore the sales table"
   Claude calls: open_sql_lab_with_context({
     "database_connection_id": 1,
     "dataset_in_context": "sales",
     "sql": "SELECT * FROM sales LIMIT 100"
   })
   Returns: url: "http://localhost:9001/sqllab/?...";
   ```
   
   ### TESTING INSTRUCTIONS
   
   **Prerequisites:**
   - Superset running with PR #8 (dashboard and dataset tools) merged
   - Python 3.10 or 3.11
   - fastmcp installed (`pip install -e .[development]`)
   - Chrome or Firefox for screenshot generation
   - ChromeDriver or GeckoDriver installed
   
   **Setup:**
   
   ```bash
   # 1. Ensure database is initialized
   export FLASK_APP=superset
   superset db upgrade
   superset init
   
   # 2. Create admin user (if not already done)
   superset fab create-admin \
     --username admin \
     --firstname Admin \
     --lastname Admin \
     --email admin@localhost \
     --password admin
   
   # 3. Load example data (required for chart creation tests)
   superset load-examples
   
   # 4. Configure WebDriver in superset_config.py
   cat >> superset_config.py << 'EOF'
   # WebDriver Configuration (for chart screenshots)
   WEBDRIVER_BASEURL = "http://localhost:9001/";
   WEBDRIVER_BASEURL_USER_FRIENDLY = WEBDRIVER_BASEURL
   WEBDRIVER_TYPE = "chrome"  # or "firefox"
   WEBDRIVER_OPTION_ARGS = [
       "--headless",
       "--no-sandbox",
       "--disable-dev-shm-usage",
   ]
   WEBDRIVER_WINDOW = {"width": 1280, "height": 800}
   EOF
   
   # 5. Install ChromeDriver (macOS with Homebrew)
   brew install chromedriver
   
   # OR download manually:
   # https://chromedriver.chromium.org/downloads
   ```
   
   **Run Tests:**
   
   ```bash
   # Run all MCP service unit tests
   pytest tests/unit_tests/mcp_service/ -v
   
   # Should see:
   # - Chart schema tests: 11 passing
   # - Chart tool tests: Varies by test file
   # - Dashboard tests: 15 passing
   # - Dataset tests: 20 passing
   # - Core tests: 16 passing
   # - Explore tests: 8 passing
   # - SQL Lab tests: Some passing (25 need integration environment)
   # Total: 139-167 tests (139 passing in unit test mode)
   
   # Run specific test files
   pytest tests/unit_tests/mcp_service/chart/tool/test_generate_chart.py -v
   pytest tests/unit_tests/mcp_service/chart/tool/test_get_chart_preview.py -v
   pytest 
tests/unit_tests/mcp_service/explore/tool/test_generate_explore_link.py -v
   pytest tests/unit_tests/mcp_service/sql_lab/tool/test_execute_sql.py -v
   ```
   
   **Test with MCP Server:**
   
   ```bash
   # Terminal 1: Start Superset
   superset run -p 9001 --with-threads
   
   # Terminal 2: Start MCP service
   superset mcp run --port 5008 --debug
   
   # Terminal 3: Test endpoints
   curl http://localhost:9001/health  # Superset health
   curl http://localhost:5008/health  # MCP health
   ```
   
   **Test Tools via Claude Desktop:**
   
   Configure Claude Desktop with:
   ```json
   {
     "mcpServers": {
       "superset": {
         "command": "superset",
         "args": ["mcp", "run", "--port", "5008"]
       }
     }
   }
   ```
   
   Then test queries:
   - "Create a bar chart of sales by region from dataset 1"
   - "Show me the raw data from chart 5"
   - "Generate a preview of chart 10 in ASCII format"
   - "Update chart 3 to use a line chart instead"
   - "Execute SELECT * FROM cleaned_sales_data LIMIT 10"
   - "Open SQL Lab to explore the birth_names table"
   
   **Verify Chart Creation:**
   
   ```python
   # Test in Python shell with Flask app context
   from superset.mcp_service.chart.tool import generate_chart
   from superset.mcp_service.chart.schemas import (
       GenerateChartRequest,
       XYChartConfig,
       ColumnRef,
   )
   
   # Create a simple line chart
   request = GenerateChartRequest(
       dataset_id=1,  # Use an existing dataset ID
       config=XYChartConfig(
           chart_type="xy",
           kind="line",
           x=ColumnRef(name="ds"),  # Date column
           y=[ColumnRef(name="num", aggregate="SUM")],
       ),
       save_chart=True,
   )
   
   result = generate_chart.fn(request=request)
   print(f"Chart ID: {result.chart_id}")
   print(f"Preview URL: {result.preview_url}")
   print(f"Explore URL: {result.explore_url}")
   ```
   
   **Verify Screenshot Generation:**
   
   ```python
   from superset.mcp_service.chart.tool import get_chart_preview
   
   # Get PNG screenshot
   result = get_chart_preview.fn(
       identifier=1,  # Use an existing chart ID
       format="url",
       width=800,
       height=600,
   )
   print(f"Screenshot URL: {result.preview_url}")
   
   # Get ASCII art
   result = get_chart_preview.fn(identifier=1, format="ascii")
   print(result.ascii_chart)
   
   # Get tabular data
   result = get_chart_preview.fn(identifier=1, format="table")
   print(result.table_data)
   ```
   
   ### ADDITIONAL INFORMATION
   
   - [ ] Has associated issue:
   - [ ] Required feature flags: None
   - [ ] Changes UI: No
   - [ ] Includes DB Migration: No
   - [x] Introduces new feature or API: Yes - MCP chart creation, preview, SQL 
Lab, and explore tools
   - [ ] Removes existing feature or API: No
   
   **Stats:**
   - 53 files changed
   - 13,318 insertions, 37 deletions
   - 2,554 lines of tests (105 test cases: 72 chart + 25 SQL Lab + 8 explore)
   - 139 tests passing (28 SQL Lab tests need integration environment)
   - 100% mypy compliant
   - All pre-commit hooks passing
   
   **Configuration Requirements:**
   
   For screenshot generation to work, users must configure WebDriver in 
`superset_config.py`:
   
   ```python
   WEBDRIVER_BASEURL = "http://localhost:9001/";
   WEBDRIVER_BASEURL_USER_FRIENDLY = WEBDRIVER_BASEURL
   WEBDRIVER_TYPE = "chrome"  # or "firefox"
   WEBDRIVER_OPTION_ARGS = ["--headless", "--no-sandbox"]
   WEBDRIVER_WINDOW = {"width": 1280, "height": 800}
   ```
   
   **Future PRs will add:**
   - Dashboard creation tools (generate_dashboard, add_chart_to_dashboard)
   - Advanced SQL Lab features (query history, saved queries)
   - Chart template library
   - Bulk operations (create multiple charts, batch updates)
   - Export/import of chart configurations
   
   **Notes:**
   - Builds cleanly on PR #8 (dashboard and dataset tools)
   - Largest PR in MCP series (13K+ lines)
   - Adds full CRUD for charts
   - Production-ready validation and error handling
   - Screenshot system for visual previews
   - SQL Lab integration for data exploration
   - No database migrations
   - No UI changes
   - Optional dependencies (fastmcp, selenium in development requirements only)
   - All tools follow established patterns from PR #6 and #8
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(mcp): PR4 - add advanced chart tools and SQL Lab/Explore integration [superset]

Reply via email to