Re: [PR] feat(examples): Revamp example data loading with DuckDB and fix chart issues [superset]

via GitHub Thu, 11 Dec 2025 09:03:42 -0800


codeant-ai-for-open-source[bot] commented on code in PR #36538:
URL: https://github.com/apache/superset/pull/36538#discussion_r2611370179



##########
superset/commands/importers/v1/utils.py:
##########
@@ -320,6 +321,49 @@ def import_tag(
     return new_tag_ids
 
 
+def safe_insert_dashboard_chart_relationships(
+    dashboard_chart_ids: list[tuple[int, int]],
+) -> None:
+    """
+    Safely insert dashboard-chart relationships, handling duplicates.
+
+    This function checks for existing relationships and only inserts new ones
+    to avoid duplicate key constraint errors.
+    """
+    from sqlalchemy.sql import select
+
+    if not dashboard_chart_ids:
+        return
+
+    # Get all existing relationships
+    existing_relationships = db.session.execute(
+        select([dashboard_slices.c.dashboard_id, dashboard_slices.c.slice_id])
+    ).fetchall()
+    existing_relationships_set = {(row[0], row[1]) for row in 
existing_relationships}
+
+    # Filter out relationships that already exist
+    new_relationships = [
+        (dashboard_id, chart_id)
+        for dashboard_id, chart_id in dashboard_chart_ids
+        if (dashboard_id, chart_id) not in existing_relationships_set
+    ]
+
+    # Insert new relationships one by one to handle any edge cases
+    for dashboard_id, chart_id in new_relationships:
+        # Double-check that the relationship doesn't exist
+        exists = db.session.execute(
+            select([dashboard_slices.c.dashboard_id])
+            .where(dashboard_slices.c.dashboard_id == dashboard_id)
+            .where(dashboard_slices.c.slice_id == chart_id)
+        ).fetchone()
+
+        if not exists:
+            db.session.execute(
+                dashboard_slices.insert(),
+                {"dashboard_id": dashboard_id, "slice_id": chart_id},

Review Comment:
   **Suggestion:** The current implementation issues an additional SELECT query 
for each dashboard–chart pair in `new_relationships`, creating an N+1 query 
pattern that can significantly slow down imports when many relationships are 
inserted, even though duplicates have already been filtered using 
`existing_relationships_set`. [performance]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
       # Insert new relationships in bulk, deduplicating to avoid unique 
constraint issues
       unique_new_relationships = {
           (dashboard_id, chart_id) for dashboard_id, chart_id in 
new_relationships
       }
   
       if unique_new_relationships:
           db.session.execute(
               dashboard_slices.insert(),
               [
                   {"dashboard_id": dashboard_id, "slice_id": chart_id}
                   for dashboard_id, chart_id in unique_new_relationships
               ],
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The suggestion correctly identifies an unnecessary per-row SELECT after 
we've already filtered out existing relationships using 
existing_relationships_set.
   That extra check causes an N+1 pattern and is redundant in the common case, 
hurting performance for large batches.
   Replacing the loop with a single bulk insert of deduplicated new 
relationships is a valid optimization and reduces DB round trips.
   Edge-cases to consider (and why this is still acceptable): bulk insert must 
handle the possibility of a race condition where another process inserts the 
same row concurrently — this would surface as a unique constraint error, so the 
caller should be prepared to handle/rollback that case or the code can catch 
IntegrityError and ignore duplicates if desired.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/importers/v1/utils.py
   **Line:** 351:363
   **Comment:**
        *Performance: The current implementation issues an additional SELECT 
query for each dashboard–chart pair in `new_relationships`, creating an N+1 
query pattern that can significantly slow down imports when many relationships 
are inserted, even though duplicates have already been filtered using 
`existing_relationships_set`.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/cli/examples.py:
##########
@@ -40,51 +40,53 @@ def load_examples_run(
     # pylint: disable=import-outside-toplevel
     import superset.examples.data_loading as examples
 
+    # Always load CSS templates
     examples.load_css_templates()
 
-    if load_test_data:
-        logger.info("Loading energy related dataset")
-        examples.load_energy(only_metadata, force)
-
-    logger.info("Loading [World Bank's Health Nutrition and Population Stats]")
-    examples.load_world_bank_health_n_pop(only_metadata, force)
+    # Auto-discover and load all datasets
+    for loader_name in dir(examples):
+        if not loader_name.startswith("load_"):
+            continue
 
-    logger.info("Loading [Birth names]")
-    examples.load_birth_names(only_metadata, force)
+        # Skip special loaders that aren't datasets
+        if loader_name in ["load_css_templates", "load_examples_from_configs"]:
+            continue
 
-    if load_test_data:
-        logger.info("Loading [Tabbed dashboard]")
-        examples.load_tabbed_dashboard(only_metadata)
+        # Skip dashboards (loaded separately)
+        if loader_name in ["load_tabbed_dashboard", 
"load_supported_charts_dashboard"]:
+            continue
 
-        logger.info("Loading [Supported Charts Dashboard]")
-        examples.load_supported_charts_dashboard()
-    else:
-        logger.info("Loading [Random long/lat data]")
-        examples.load_long_lat_data(only_metadata, force)
+        # Skip test/big data if not requested
+        if loader_name == "load_big_data" and not load_big_data:

Review Comment:
   **Suggestion:** When `only_metadata` is set, the big data loader is still 
executed if `load_big_data` is true, which contradicts the "only load metadata, 
skip actual data" semantics and generates large synthetic tables even in 
metadata-only mode; `load_big_data` has no way to honor `only_metadata`, so it 
should be skipped entirely when that flag is active. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
           # Skip big data if not requested or when only metadata is requested
           if loader_name == "load_big_data" and (not load_big_data or 
only_metadata):
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The click help and the `only_metadata` flag semantics ("Only load metadata, 
skip actual data") mean metadata-only mode should override other flags that 
cause heavy data generation. As written, if a user passes both 
`--only-metadata` and `--load-big-data`, the `load_big_data` loader will run 
(unless that loader itself accepts and honors `only_metadata`), which is 
surprising and can cause unnecessary large data creation. Guarding the loader 
with `or only_metadata` is the correct, simple precedence fix: metadata-only 
should trump big-data generation. This is a real logic bug, not mere style.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/cli/examples.py
   **Line:** 59:60
   **Comment:**
        *Logic Error: When `only_metadata` is set, the big data loader is still 
executed if `load_big_data` is true, which contradicts the "only load metadata, 
skip actual data" semantics and generates large synthetic tables even in 
metadata-only mode; `load_big_data` has no way to honor `only_metadata`, so it 
should be skipped entirely when that flag is active.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/examples/generic_loader.py:
##########
@@ -0,0 +1,223 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Generic DuckDB example data loader."""
+
+import logging
+from typing import Any, Callable, Optional
+
+import numpy as np
+from sqlalchemy import inspect
+
+from superset import db
+from superset.connectors.sqla.models import SqlaTable
+from superset.examples.helpers import read_example_data
+from superset.models.core import Database
+from superset.sql.parse import Table
+from superset.utils import json
+from superset.utils.database import get_example_database
+
+logger = logging.getLogger(__name__)
+
+
+def serialize_numpy_arrays(obj: Any) -> Any:  # noqa: C901
+    """Convert numpy arrays to JSON-serializable format."""
+    if isinstance(obj, np.ndarray):
+        return obj.tolist()
+    elif isinstance(obj, np.generic):
+        # Handle numpy scalar types
+        return obj.item()
+    elif isinstance(obj, (list, tuple)):
+        return [serialize_numpy_arrays(item) for item in obj]
+    elif isinstance(obj, dict):
+        return {key: serialize_numpy_arrays(val) for key, val in obj.items()}
+    return obj
+
+
+def load_duckdb_table(  # noqa: C901
+    duckdb_file: str,
+    table_name: str,
+    database: Optional[Database] = None,
+    only_metadata: bool = False,
+    force: bool = False,
+    sample_rows: Optional[int] = None,
+) -> SqlaTable:
+    """Load a DuckDB table into the example database.
+
+    Args:
+        duckdb_file: Name of the DuckDB file (e.g., "birth_names")
+        table_name: Name for the table in the target database
+        database: Target database (defaults to example database)
+        only_metadata: If True, only create metadata without loading data
+        force: If True, replace existing table
+        sample_rows: If specified, only load this many rows
+
+    Returns:
+        The created SqlaTable object
+    """
+    if database is None:
+        database = get_example_database()
+
+    # Check if table exists
+    with database.get_sqla_engine() as engine:
+        schema = inspect(engine).default_schema_name
+
+    table_exists = database.has_table(Table(table_name, schema=schema))
+    if table_exists and not force:
+        logger.info("Table %s already exists, skipping data load", table_name)
+        tbl = (
+            db.session.query(SqlaTable)
+            .filter_by(table_name=table_name, database_id=database.id)
+            .first()
+        )
+        if tbl:
+            return tbl
+
+    # Load data if not metadata only
+    if not only_metadata:
+        logger.info("Loading data for %s from %s.duckdb", table_name, 
duckdb_file)
+
+        # Read from DuckDB
+        pdf = read_example_data(f"examples://{duckdb_file}")
+
+        # Sample if requested
+        if sample_rows:
+            pdf = pdf.head(sample_rows)
+
+        # Check for columns with complex types (numpy arrays, nested 
structures)
+        for col in pdf.columns:
+            # Check if any value in the column is a numpy array or nested 
structure
+            if pdf[col].dtype == object:
+                try:
+                    # Check if the first non-null value is complex
+                    sample_val = (
+                        pdf[col].dropna().iloc[0]
+                        if not pdf[col].dropna().empty
+                        else None
+                    )
+                    if sample_val is not None and isinstance(
+                        sample_val, (np.ndarray, list, dict)
+                    ):
+                        logger.info("Converting complex column %s to JSON 
string", col)
+
+                        # Convert to JSON string for database storage
+                        def safe_serialize(
+                            x: Any, column_name: str = col
+                        ) -> Optional[str]:
+                            if x is None:
+                                return None
+                            try:
+                                return json.dumps(serialize_numpy_arrays(x))
+                            except (TypeError, ValueError) as e:
+                                logger.warning(
+                                    "Failed to serialize value in column %s: 
%s",
+                                    column_name,
+                                    e,
+                                )
+                                # Convert to string representation as fallback
+                                return str(x)
+
+                        pdf[col] = pdf[col].apply(lambda x, c=col: 
safe_serialize(x, c))
+                except Exception as e:
+                    logger.warning("Could not process column %s: %s", col, e)
+
+        # Write to target database
+        with database.get_sqla_engine() as engine:
+            pdf.to_sql(
+                table_name,
+                engine,
+                schema=schema,
+                if_exists="replace",
+                chunksize=500,
+                method="multi",
+                index=False,
+            )
+
+        logger.info("Loaded %d rows into %s", len(pdf), table_name)
+
+    # Create or update SqlaTable metadata
+    tbl = (
+        db.session.query(SqlaTable)
+        .filter_by(table_name=table_name, database_id=database.id)
+        .first()
+    )
+
+    if not tbl:
+        tbl = SqlaTable(table_name=table_name, database_id=database.id)
+        # Set the database reference
+        tbl.database = database
+
+    if not only_metadata:
+        # Ensure database reference is set before fetching metadata
+        if not tbl.database:
+            tbl.database = database
+        tbl.fetch_metadata()
+
+    db.session.merge(tbl)
+    db.session.commit()
+
+    return tbl
+
+
+def create_generic_loader(
+    duckdb_file: str,
+    table_name: Optional[str] = None,
+    description: Optional[str] = None,
+    sample_rows: Optional[int] = None,
+) -> Callable[[Database, SqlaTable], None]:
+    """Create a loader function for a specific DuckDB file.
+
+    This factory function creates loaders that match the existing pattern
+    used by Superset examples.
+
+    Args:
+        duckdb_file: Name of the DuckDB file (without .duckdb extension)
+        table_name: Table name (defaults to duckdb_file)
+        description: Description for the dataset
+        sample_rows: Default number of rows to sample
+
+    Returns:
+        A loader function with the standard signature
+    """
+    if table_name is None:
+        table_name = duckdb_file
+
+    def loader(
+        only_metadata: bool = False,
+        force: bool = False,
+        sample: bool = False,
+    ) -> None:
+        """Load the dataset."""
+        rows = sample_rows if sample and sample_rows else None
+
+        tbl = load_duckdb_table(
+            duckdb_file=duckdb_file,
+            table_name=table_name,
+            only_metadata=only_metadata,
+            force=force,

Review Comment:
   **Suggestion:** In the loader factory, the computation of `rows` also relies 
on the truthiness of `sample_rows`, so a value of `0` will be treated as "no 
sampling" and result in full data load, which contradicts the documented "only 
load this many rows" behavior. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
               rows = sample_rows if sample and sample_rows is not None else 
None
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The factory's loader uses the same truthiness pattern. If sample_rows is 0, 
"sample and sample_rows" is falsy and rows becomes None, causing 
load_duckdb_table to ignore sampling. The suggested change to check 
"sample_rows is not None" correctly treats 0 as an explicit sampling request 
and aligns behavior with the documented intent.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/examples/generic_loader.py
   **Line:** 210:210
   **Comment:**
        *Logic Error: In the loader factory, the computation of `rows` also 
relies on the truthiness of `sample_rows`, so a value of `0` will be treated as 
"no sampling" and result in full data load, which contradicts the documented 
"only load this many rows" behavior.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/examples/helpers.py:
##########
@@ -119,52 +88,91 @@ def get_slice_json(defaults: dict[Any, Any], **kwargs: 
Any) -> str:
     return json.dumps(defaults_copy, indent=4, sort_keys=True)
 
 
-def get_example_url(filepath: str) -> str:
-    """Return an absolute URL to *filepath* under the examples‑data repo.
-
-    All calls are routed through jsDelivr unless overridden. Supports nested
-    paths like ``datasets/examples/slack/messages.csv``.
-    """
-    return f"{BASE_URL}{filepath}"
+def normalize_example_data_url(url: str) -> str:
+    """Normalize example data URLs for consistency.
 
+    This function ensures that example data URLs are properly formatted.
+    Since the schema validator expects valid URLs and our examples:// protocol
+    isn't standard, we convert to file:// URLs pointing to the actual location.
 
-def normalize_example_data_url(url: str) -> str:
-    """Convert example data URLs to use the configured CDN.
+    Args:
+        url: URL to normalize (e.g., "examples://birth_names" or 
"birth_names.duckdb")
 
-    Transforms examples:// URLs to the configured CDN URL.
-    Non-example URLs are returned unchanged.
+    Returns:
+        Normalized file:// URL pointing to the DuckDB file
     """
+    import os
+
+    # Handle existing examples:// protocol
     if url.startswith(EXAMPLES_PROTOCOL):
-        relative_path = url[len(EXAMPLES_PROTOCOL) :]
-        return get_example_url(relative_path)
+        # Remove the protocol for processing
+        path = url[len(EXAMPLES_PROTOCOL) :]
+    elif url.startswith("file://"):
+        # Already a file URL, just return it
+        return url
+    else:
+        path = url
+
+    # Ensure .duckdb extension
+    if not path.endswith(".duckdb"):
+        path = f"{path}.duckdb"
 
-    # Not an examples URL, return unchanged
-    return url
+    # Build the full file path
+    full_path = os.path.join(get_examples_folder(), "data", path)
+
+    # Convert to file:// URL for schema validation
+    # This will pass URL validation and still work with our loader
+    return f"file://{full_path}"
 
 
 def read_example_data(
     filepath: str,
-    max_attempts: int = 5,
-    wait_seconds: float = 60,
+    table_name: str | None = None,
     **kwargs: Any,
 ) -> pd.DataFrame:
-    """Load CSV or JSON from example data mirror with retry/backoff."""
-    url = normalize_example_data_url(filepath)
-    is_json = filepath.endswith(".json") or filepath.endswith(".json.gz")
-
-    for attempt in range(1, max_attempts + 1):
-        try:
-            if is_json:
-                return pd.read_json(url, **kwargs)
-            return pd.read_csv(url, **kwargs)
-        except HTTPError:
-            if attempt < max_attempts:
-                sleep_time = wait_seconds * (2 ** (attempt - 1))
-                print(
-                    f"HTTP 429 received from {url}. ",
-                    f"Retrying in {sleep_time:.1f}s ",
-                    f"(attempt {attempt}/{max_attempts})...",
-                )
-                time.sleep(sleep_time)
-            else:
-                raise
+    """Load data from local DuckDB files.
+
+    Args:
+        filepath: Path to the DuckDB file (e.g., 
"examples://birth_names.duckdb" or "file:///path/to/file.duckdb")
+        table_name: Table to read from DuckDB (defaults to filename without 
extension)
+        **kwargs: Ignored (kept for backward compatibility)
+
+    Returns:
+        DataFrame with the loaded data
+    """
+    import os
+
+    import duckdb
+
+    # Handle different URL formats
+    if filepath.startswith(EXAMPLES_PROTOCOL):
+        # examples:// protocol
+        relative_path = filepath[len(EXAMPLES_PROTOCOL) :]
+        if not relative_path.endswith(".duckdb"):
+            relative_path = f"{relative_path}.duckdb"
+        local_path = os.path.join(get_examples_folder(), "data", relative_path)
+    elif filepath.startswith("file://"):
+        # file:// protocol - extract the actual path
+        local_path = filepath[7:]  # Remove "file://"
+    else:
+        # Assume it's a relative path
+        relative_path = filepath
+        if not relative_path.endswith(".duckdb"):
+            relative_path = f"{relative_path}.duckdb"
+        local_path = os.path.join(get_examples_folder(), "data", relative_path)
+
+    if not os.path.exists(local_path):
+        raise FileNotFoundError(f"Example data file not found: {local_path}")
+
+    # Determine table name if not provided
+    if table_name is None:
+        base_name = os.path.basename(local_path)
+        table_name = os.path.splitext(base_name)[0]
+
+    # Connect and read from DuckDB
+    conn = duckdb.connect(local_path, read_only=True)
+    try:
+        df = conn.execute(f"SELECT * FROM {table_name}").df()  # noqa: S608
+        return df

Review Comment:
   **Suggestion:** The DuckDB query in the example data reader interpolates 
`table_name` directly into the SQL string without quoting or escaping, which 
can lead to SQL injection or syntax errors if the table name contains special 
characters and is ever derived from untrusted input. [security]
   
   **Severity Level:** Critical 🚨
   ```suggestion
           # Safely quote the table name to avoid SQL injection issues
           safe_table_name = table_name.replace('"', '""')
           df = conn.execute(f'SELECT * FROM "{safe_table_name}"').df()
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Directly injecting table_name into the SQL string can cause syntax errors or 
SQL injection if the name ever comes from untrusted input or contains special 
characters. Safely quoting the identifier (escaping double quotes and wrapping 
the identifier) is the correct mitigation because SQL parameters can't be used 
for identifiers; this is a real security/correctness improvement.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/examples/helpers.py
   **Line:** 176:176
   **Comment:**
        *Security: The DuckDB query in the example data reader interpolates 
`table_name` directly into the SQL string without quoting or escaping, which 
can lead to SQL injection or syntax errors if the table name contains special 
characters and is ever derived from untrusted input.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/examples/helpers.py:
##########
@@ -119,52 +88,91 @@ def get_slice_json(defaults: dict[Any, Any], **kwargs: 
Any) -> str:
     return json.dumps(defaults_copy, indent=4, sort_keys=True)
 
 
-def get_example_url(filepath: str) -> str:
-    """Return an absolute URL to *filepath* under the examples‑data repo.
-
-    All calls are routed through jsDelivr unless overridden. Supports nested
-    paths like ``datasets/examples/slack/messages.csv``.
-    """
-    return f"{BASE_URL}{filepath}"
+def normalize_example_data_url(url: str) -> str:
+    """Normalize example data URLs for consistency.
 
+    This function ensures that example data URLs are properly formatted.
+    Since the schema validator expects valid URLs and our examples:// protocol
+    isn't standard, we convert to file:// URLs pointing to the actual location.
 
-def normalize_example_data_url(url: str) -> str:
-    """Convert example data URLs to use the configured CDN.
+    Args:
+        url: URL to normalize (e.g., "examples://birth_names" or 
"birth_names.duckdb")
 
-    Transforms examples:// URLs to the configured CDN URL.
-    Non-example URLs are returned unchanged.
+    Returns:
+        Normalized file:// URL pointing to the DuckDB file
     """
+    import os
+
+    # Handle existing examples:// protocol
     if url.startswith(EXAMPLES_PROTOCOL):
-        relative_path = url[len(EXAMPLES_PROTOCOL) :]
-        return get_example_url(relative_path)
+        # Remove the protocol for processing
+        path = url[len(EXAMPLES_PROTOCOL) :]
+    elif url.startswith("file://"):
+        # Already a file URL, just return it
+        return url
+    else:
+        path = url
+
+    # Ensure .duckdb extension
+    if not path.endswith(".duckdb"):
+        path = f"{path}.duckdb"
 
-    # Not an examples URL, return unchanged
-    return url
+    # Build the full file path
+    full_path = os.path.join(get_examples_folder(), "data", path)

Review Comment:
   **Suggestion:** Building a `file://` URL by directly joining the 
user-controlled path segment with `get_examples_folder()/data` allows directory 
traversal (for example using `..` segments) so that a crafted `examples://` 
value could escape the intended examples directory and access arbitrary files 
when later fetched via `urlopen`. [security]
   
   **Severity Level:** Critical 🚨
   ```suggestion
       # Normalize the path and prevent directory traversal outside 
examples/data
       normalized_path = os.path.normpath(path).lstrip(os.sep)
       if normalized_path.startswith(".."):
           raise ValueError("Invalid example data path")
   
       # Build the full file path
       full_path = os.path.join(get_examples_folder(), "data", normalized_path)
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The code concatenates user-controlled path segments into the examples data 
path without normalizing or rejecting ".." segments. That can allow a crafted 
examples:// value to resolve to files outside the intended directory. 
Normalizing and validating the path (or rejecting paths with traversal) is a 
real security hardening and prevents unintended file access.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/examples/helpers.py
   **Line:** 120:121
   **Comment:**
        *Security: Building a `file://` URL by directly joining the 
user-controlled path segment with `get_examples_folder()/data` allows directory 
traversal (for example using `..` segments) so that a crafted `examples://` 
value could escape the intended examples directory and access arbitrary files 
when later fetched via `urlopen`.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/examples/helpers.py:
##########
@@ -119,52 +88,91 @@ def get_slice_json(defaults: dict[Any, Any], **kwargs: 
Any) -> str:
     return json.dumps(defaults_copy, indent=4, sort_keys=True)
 
 
-def get_example_url(filepath: str) -> str:
-    """Return an absolute URL to *filepath* under the examples‑data repo.
-
-    All calls are routed through jsDelivr unless overridden. Supports nested
-    paths like ``datasets/examples/slack/messages.csv``.
-    """
-    return f"{BASE_URL}{filepath}"
+def normalize_example_data_url(url: str) -> str:
+    """Normalize example data URLs for consistency.
 
+    This function ensures that example data URLs are properly formatted.
+    Since the schema validator expects valid URLs and our examples:// protocol
+    isn't standard, we convert to file:// URLs pointing to the actual location.
 
-def normalize_example_data_url(url: str) -> str:
-    """Convert example data URLs to use the configured CDN.
+    Args:
+        url: URL to normalize (e.g., "examples://birth_names" or 
"birth_names.duckdb")
 
-    Transforms examples:// URLs to the configured CDN URL.
-    Non-example URLs are returned unchanged.
+    Returns:
+        Normalized file:// URL pointing to the DuckDB file
     """
+    import os
+
+    # Handle existing examples:// protocol
     if url.startswith(EXAMPLES_PROTOCOL):
-        relative_path = url[len(EXAMPLES_PROTOCOL) :]
-        return get_example_url(relative_path)
+        # Remove the protocol for processing
+        path = url[len(EXAMPLES_PROTOCOL) :]
+    elif url.startswith("file://"):
+        # Already a file URL, just return it
+        return url
+    else:

Review Comment:
   **Suggestion:** The URL normalization helper currently rewrites any 
non-`examples://` and non-`file://` URL (including valid HTTP/S or other remote 
URIs) into a local `file://` DuckDB path, which will break generic dataset 
imports that rely on remote CSV/URL-based `data` values once 
`normalize_example_data_url` is used in the dataset importer. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
       elif "://" in url:
           # Non-example absolute URLs (http, https, s3, etc.) are left 
unchanged
           return url
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The current implementation treats any non-file, non-examples URL as a local 
examples path, so an input like "https://..."; or "s3://..." would be turned 
into a local file:// path and break remote dataset imports. The suggested guard 
for other schemes (detecting :// and returning unchanged) fixes a real logic 
bug and preserves expected behavior for remote URLs.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/examples/helpers.py
   **Line:** 113:113
   **Comment:**
        *Logic Error: The URL normalization helper currently rewrites any 
non-`examples://` and non-`file://` URL (including valid HTTP/S or other remote 
URIs) into a local `file://` DuckDB path, which will break generic dataset 
imports that rely on remote CSV/URL-based `data` values once 
`normalize_example_data_url` is used in the dataset importer.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/commands/importers/v1/utils.py:
##########
@@ -320,6 +321,49 @@ def import_tag(
     return new_tag_ids
 
 
+def safe_insert_dashboard_chart_relationships(
+    dashboard_chart_ids: list[tuple[int, int]],
+) -> None:
+    """
+    Safely insert dashboard-chart relationships, handling duplicates.
+
+    This function checks for existing relationships and only inserts new ones
+    to avoid duplicate key constraint errors.
+    """
+    from sqlalchemy.sql import select
+
+    if not dashboard_chart_ids:
+        return
+
+    # Get all existing relationships
+    existing_relationships = db.session.execute(
+        select([dashboard_slices.c.dashboard_id, dashboard_slices.c.slice_id])

Review Comment:
   **Suggestion:** `safe_insert_dashboard_chart_relationships` currently loads 
all rows from the `dashboard_slices` table into memory, which is unnecessary 
and can become a serious performance and memory issue in large installations 
when only a subset of dashboards is being updated. [performance]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
       # Get existing relationships only for dashboards being updated
       dashboard_ids = {dashboard_id for dashboard_id, _ in dashboard_chart_ids}
       existing_relationships = db.session.execute(
           select([dashboard_slices.c.dashboard_id, 
dashboard_slices.c.slice_id]).where(
               dashboard_slices.c.dashboard_id.in_(dashboard_ids)
           )
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The current code unconditionally loads every row from dashboard_slices into 
memory which is wasteful and unnecessary when we're only inserting 
relationships for a specific set of dashboards.
   Scoping the SELECT to the dashboard IDs being updated 
(dashboard_slices.c.dashboard_id.in_(dashboard_ids)) is a straightforward, 
correct improvement that reduces memory and query cost.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/importers/v1/utils.py
   **Line:** 338:340
   **Comment:**
        *Performance: `safe_insert_dashboard_chart_relationships` currently 
loads all rows from the `dashboard_slices` table into memory, which is 
unnecessary and can become a serious performance and memory issue in large 
installations when only a subset of dashboards is being updated.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/examples/generic_loader.py:
##########
@@ -0,0 +1,223 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Generic DuckDB example data loader."""
+
+import logging
+from typing import Any, Callable, Optional
+
+import numpy as np
+from sqlalchemy import inspect
+
+from superset import db
+from superset.connectors.sqla.models import SqlaTable
+from superset.examples.helpers import read_example_data
+from superset.models.core import Database
+from superset.sql.parse import Table
+from superset.utils import json
+from superset.utils.database import get_example_database
+
+logger = logging.getLogger(__name__)
+
+
+def serialize_numpy_arrays(obj: Any) -> Any:  # noqa: C901
+    """Convert numpy arrays to JSON-serializable format."""
+    if isinstance(obj, np.ndarray):
+        return obj.tolist()
+    elif isinstance(obj, np.generic):
+        # Handle numpy scalar types
+        return obj.item()
+    elif isinstance(obj, (list, tuple)):
+        return [serialize_numpy_arrays(item) for item in obj]
+    elif isinstance(obj, dict):
+        return {key: serialize_numpy_arrays(val) for key, val in obj.items()}
+    return obj
+
+
+def load_duckdb_table(  # noqa: C901
+    duckdb_file: str,
+    table_name: str,
+    database: Optional[Database] = None,
+    only_metadata: bool = False,
+    force: bool = False,
+    sample_rows: Optional[int] = None,
+) -> SqlaTable:
+    """Load a DuckDB table into the example database.
+
+    Args:
+        duckdb_file: Name of the DuckDB file (e.g., "birth_names")
+        table_name: Name for the table in the target database
+        database: Target database (defaults to example database)
+        only_metadata: If True, only create metadata without loading data
+        force: If True, replace existing table
+        sample_rows: If specified, only load this many rows
+
+    Returns:
+        The created SqlaTable object
+    """
+    if database is None:
+        database = get_example_database()
+
+    # Check if table exists
+    with database.get_sqla_engine() as engine:
+        schema = inspect(engine).default_schema_name
+
+    table_exists = database.has_table(Table(table_name, schema=schema))
+    if table_exists and not force:
+        logger.info("Table %s already exists, skipping data load", table_name)
+        tbl = (
+            db.session.query(SqlaTable)
+            .filter_by(table_name=table_name, database_id=database.id)
+            .first()
+        )
+        if tbl:
+            return tbl
+
+    # Load data if not metadata only
+    if not only_metadata:
+        logger.info("Loading data for %s from %s.duckdb", table_name, 
duckdb_file)
+
+        # Read from DuckDB
+        pdf = read_example_data(f"examples://{duckdb_file}")
+
+        # Sample if requested
+        if sample_rows:

Review Comment:
   **Suggestion:** The `sample_rows` condition uses a truthiness check, so 
passing `sample_rows=0` (which is a valid "load zero rows" value per the 
docstring) will be treated as if no sampling was requested and will load the 
full dataset instead of zero rows. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
           if sample_rows is not None:
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The current code uses a truthiness check on sample_rows. If a caller 
intentionally passes 0 (meaning "load zero rows"), the condition evaluates to 
False and the full dataframe is left unchanged. That's a real logic bug because 
the docstring states "If specified, only load this many rows" — 0 is a valid 
specified integer. Changing the check to "is not None" preserves the 
distinction between "not specified" (None) and "specified as 0".
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/examples/generic_loader.py
   **Line:** 97:97
   **Comment:**
        *Logic Error: The `sample_rows` condition uses a truthiness check, so 
passing `sample_rows=0` (which is a valid "load zero rows" value per the 
docstring) will be treated as if no sampling was requested and will load the 
full dataset instead of zero rows.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(examples): Revamp example data loading with DuckDB and fix chart issues [superset]

Reply via email to