codeant-ai-for-open-source[bot] commented on code in PR #36538:
URL: https://github.com/apache/superset/pull/36538#discussion_r2611370179
##########
superset/commands/importers/v1/utils.py:
##########
@@ -320,6 +321,49 @@ def import_tag(
return new_tag_ids
+def safe_insert_dashboard_chart_relationships(
+ dashboard_chart_ids: list[tuple[int, int]],
+) -> None:
+ """
+ Safely insert dashboard-chart relationships, handling duplicates.
+
+ This function checks for existing relationships and only inserts new ones
+ to avoid duplicate key constraint errors.
+ """
+ from sqlalchemy.sql import select
+
+ if not dashboard_chart_ids:
+ return
+
+ # Get all existing relationships
+ existing_relationships = db.session.execute(
+ select([dashboard_slices.c.dashboard_id, dashboard_slices.c.slice_id])
+ ).fetchall()
+ existing_relationships_set = {(row[0], row[1]) for row in
existing_relationships}
+
+ # Filter out relationships that already exist
+ new_relationships = [
+ (dashboard_id, chart_id)
+ for dashboard_id, chart_id in dashboard_chart_ids
+ if (dashboard_id, chart_id) not in existing_relationships_set
+ ]
+
+ # Insert new relationships one by one to handle any edge cases
+ for dashboard_id, chart_id in new_relationships:
+ # Double-check that the relationship doesn't exist
+ exists = db.session.execute(
+ select([dashboard_slices.c.dashboard_id])
+ .where(dashboard_slices.c.dashboard_id == dashboard_id)
+ .where(dashboard_slices.c.slice_id == chart_id)
+ ).fetchone()
+
+ if not exists:
+ db.session.execute(
+ dashboard_slices.insert(),
+ {"dashboard_id": dashboard_id, "slice_id": chart_id},
Review Comment:
**Suggestion:** The current implementation issues an additional SELECT query
for each dashboard–chart pair in `new_relationships`, creating an N+1 query
pattern that can significantly slow down imports when many relationships are
inserted, even though duplicates have already been filtered using
`existing_relationships_set`. [performance]
**Severity Level:** Minor ⚠️
```suggestion
# Insert new relationships in bulk, deduplicating to avoid unique
constraint issues
unique_new_relationships = {
(dashboard_id, chart_id) for dashboard_id, chart_id in
new_relationships
}
if unique_new_relationships:
db.session.execute(
dashboard_slices.insert(),
[
{"dashboard_id": dashboard_id, "slice_id": chart_id}
for dashboard_id, chart_id in unique_new_relationships
],
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
The suggestion correctly identifies an unnecessary per-row SELECT after
we've already filtered out existing relationships using
existing_relationships_set.
That extra check causes an N+1 pattern and is redundant in the common case,
hurting performance for large batches.
Replacing the loop with a single bulk insert of deduplicated new
relationships is a valid optimization and reduces DB round trips.
Edge-cases to consider (and why this is still acceptable): bulk insert must
handle the possibility of a race condition where another process inserts the
same row concurrently — this would surface as a unique constraint error, so the
caller should be prepared to handle/rollback that case or the code can catch
IntegrityError and ignore duplicates if desired.
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/commands/importers/v1/utils.py
**Line:** 351:363
**Comment:**
*Performance: The current implementation issues an additional SELECT
query for each dashboard–chart pair in `new_relationships`, creating an N+1
query pattern that can significantly slow down imports when many relationships
are inserted, even though duplicates have already been filtered using
`existing_relationships_set`.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/cli/examples.py:
##########
@@ -40,51 +40,53 @@ def load_examples_run(
# pylint: disable=import-outside-toplevel
import superset.examples.data_loading as examples
+ # Always load CSS templates
examples.load_css_templates()
- if load_test_data:
- logger.info("Loading energy related dataset")
- examples.load_energy(only_metadata, force)
-
- logger.info("Loading [World Bank's Health Nutrition and Population Stats]")
- examples.load_world_bank_health_n_pop(only_metadata, force)
+ # Auto-discover and load all datasets
+ for loader_name in dir(examples):
+ if not loader_name.startswith("load_"):
+ continue
- logger.info("Loading [Birth names]")
- examples.load_birth_names(only_metadata, force)
+ # Skip special loaders that aren't datasets
+ if loader_name in ["load_css_templates", "load_examples_from_configs"]:
+ continue
- if load_test_data:
- logger.info("Loading [Tabbed dashboard]")
- examples.load_tabbed_dashboard(only_metadata)
+ # Skip dashboards (loaded separately)
+ if loader_name in ["load_tabbed_dashboard",
"load_supported_charts_dashboard"]:
+ continue
- logger.info("Loading [Supported Charts Dashboard]")
- examples.load_supported_charts_dashboard()
- else:
- logger.info("Loading [Random long/lat data]")
- examples.load_long_lat_data(only_metadata, force)
+ # Skip test/big data if not requested
+ if loader_name == "load_big_data" and not load_big_data:
Review Comment:
**Suggestion:** When `only_metadata` is set, the big data loader is still
executed if `load_big_data` is true, which contradicts the "only load metadata,
skip actual data" semantics and generates large synthetic tables even in
metadata-only mode; `load_big_data` has no way to honor `only_metadata`, so it
should be skipped entirely when that flag is active. [logic error]
**Severity Level:** Minor ⚠️
```suggestion
# Skip big data if not requested or when only metadata is requested
if loader_name == "load_big_data" and (not load_big_data or
only_metadata):
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
The click help and the `only_metadata` flag semantics ("Only load metadata,
skip actual data") mean metadata-only mode should override other flags that
cause heavy data generation. As written, if a user passes both
`--only-metadata` and `--load-big-data`, the `load_big_data` loader will run
(unless that loader itself accepts and honors `only_metadata`), which is
surprising and can cause unnecessary large data creation. Guarding the loader
with `or only_metadata` is the correct, simple precedence fix: metadata-only
should trump big-data generation. This is a real logic bug, not mere style.
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/cli/examples.py
**Line:** 59:60
**Comment:**
*Logic Error: When `only_metadata` is set, the big data loader is still
executed if `load_big_data` is true, which contradicts the "only load metadata,
skip actual data" semantics and generates large synthetic tables even in
metadata-only mode; `load_big_data` has no way to honor `only_metadata`, so it
should be skipped entirely when that flag is active.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/examples/generic_loader.py:
##########
@@ -0,0 +1,223 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Generic DuckDB example data loader."""
+
+import logging
+from typing import Any, Callable, Optional
+
+import numpy as np
+from sqlalchemy import inspect
+
+from superset import db
+from superset.connectors.sqla.models import SqlaTable
+from superset.examples.helpers import read_example_data
+from superset.models.core import Database
+from superset.sql.parse import Table
+from superset.utils import json
+from superset.utils.database import get_example_database
+
+logger = logging.getLogger(__name__)
+
+
+def serialize_numpy_arrays(obj: Any) -> Any: # noqa: C901
+ """Convert numpy arrays to JSON-serializable format."""
+ if isinstance(obj, np.ndarray):
+ return obj.tolist()
+ elif isinstance(obj, np.generic):
+ # Handle numpy scalar types
+ return obj.item()
+ elif isinstance(obj, (list, tuple)):
+ return [serialize_numpy_arrays(item) for item in obj]
+ elif isinstance(obj, dict):
+ return {key: serialize_numpy_arrays(val) for key, val in obj.items()}
+ return obj
+
+
+def load_duckdb_table( # noqa: C901
+ duckdb_file: str,
+ table_name: str,
+ database: Optional[Database] = None,
+ only_metadata: bool = False,
+ force: bool = False,
+ sample_rows: Optional[int] = None,
+) -> SqlaTable:
+ """Load a DuckDB table into the example database.
+
+ Args:
+ duckdb_file: Name of the DuckDB file (e.g., "birth_names")
+ table_name: Name for the table in the target database
+ database: Target database (defaults to example database)
+ only_metadata: If True, only create metadata without loading data
+ force: If True, replace existing table
+ sample_rows: If specified, only load this many rows
+
+ Returns:
+ The created SqlaTable object
+ """
+ if database is None:
+ database = get_example_database()
+
+ # Check if table exists
+ with database.get_sqla_engine() as engine:
+ schema = inspect(engine).default_schema_name
+
+ table_exists = database.has_table(Table(table_name, schema=schema))
+ if table_exists and not force:
+ logger.info("Table %s already exists, skipping data load", table_name)
+ tbl = (
+ db.session.query(SqlaTable)
+ .filter_by(table_name=table_name, database_id=database.id)
+ .first()
+ )
+ if tbl:
+ return tbl
+
+ # Load data if not metadata only
+ if not only_metadata:
+ logger.info("Loading data for %s from %s.duckdb", table_name,
duckdb_file)
+
+ # Read from DuckDB
+ pdf = read_example_data(f"examples://{duckdb_file}")
+
+ # Sample if requested
+ if sample_rows:
+ pdf = pdf.head(sample_rows)
+
+ # Check for columns with complex types (numpy arrays, nested
structures)
+ for col in pdf.columns:
+ # Check if any value in the column is a numpy array or nested
structure
+ if pdf[col].dtype == object:
+ try:
+ # Check if the first non-null value is complex
+ sample_val = (
+ pdf[col].dropna().iloc[0]
+ if not pdf[col].dropna().empty
+ else None
+ )
+ if sample_val is not None and isinstance(
+ sample_val, (np.ndarray, list, dict)
+ ):
+ logger.info("Converting complex column %s to JSON
string", col)
+
+ # Convert to JSON string for database storage
+ def safe_serialize(
+ x: Any, column_name: str = col
+ ) -> Optional[str]:
+ if x is None:
+ return None
+ try:
+ return json.dumps(serialize_numpy_arrays(x))
+ except (TypeError, ValueError) as e:
+ logger.warning(
+ "Failed to serialize value in column %s:
%s",
+ column_name,
+ e,
+ )
+ # Convert to string representation as fallback
+ return str(x)
+
+ pdf[col] = pdf[col].apply(lambda x, c=col:
safe_serialize(x, c))
+ except Exception as e:
+ logger.warning("Could not process column %s: %s", col, e)
+
+ # Write to target database
+ with database.get_sqla_engine() as engine:
+ pdf.to_sql(
+ table_name,
+ engine,
+ schema=schema,
+ if_exists="replace",
+ chunksize=500,
+ method="multi",
+ index=False,
+ )
+
+ logger.info("Loaded %d rows into %s", len(pdf), table_name)
+
+ # Create or update SqlaTable metadata
+ tbl = (
+ db.session.query(SqlaTable)
+ .filter_by(table_name=table_name, database_id=database.id)
+ .first()
+ )
+
+ if not tbl:
+ tbl = SqlaTable(table_name=table_name, database_id=database.id)
+ # Set the database reference
+ tbl.database = database
+
+ if not only_metadata:
+ # Ensure database reference is set before fetching metadata
+ if not tbl.database:
+ tbl.database = database
+ tbl.fetch_metadata()
+
+ db.session.merge(tbl)
+ db.session.commit()
+
+ return tbl
+
+
+def create_generic_loader(
+ duckdb_file: str,
+ table_name: Optional[str] = None,
+ description: Optional[str] = None,
+ sample_rows: Optional[int] = None,
+) -> Callable[[Database, SqlaTable], None]:
+ """Create a loader function for a specific DuckDB file.
+
+ This factory function creates loaders that match the existing pattern
+ used by Superset examples.
+
+ Args:
+ duckdb_file: Name of the DuckDB file (without .duckdb extension)
+ table_name: Table name (defaults to duckdb_file)
+ description: Description for the dataset
+ sample_rows: Default number of rows to sample
+
+ Returns:
+ A loader function with the standard signature
+ """
+ if table_name is None:
+ table_name = duckdb_file
+
+ def loader(
+ only_metadata: bool = False,
+ force: bool = False,
+ sample: bool = False,
+ ) -> None:
+ """Load the dataset."""
+ rows = sample_rows if sample and sample_rows else None
+
+ tbl = load_duckdb_table(
+ duckdb_file=duckdb_file,
+ table_name=table_name,
+ only_metadata=only_metadata,
+ force=force,
Review Comment:
**Suggestion:** In the loader factory, the computation of `rows` also relies
on the truthiness of `sample_rows`, so a value of `0` will be treated as "no
sampling" and result in full data load, which contradicts the documented "only
load this many rows" behavior. [logic error]
**Severity Level:** Minor ⚠️
```suggestion
rows = sample_rows if sample and sample_rows is not None else
None
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
The factory's loader uses the same truthiness pattern. If sample_rows is 0,
"sample and sample_rows" is falsy and rows becomes None, causing
load_duckdb_table to ignore sampling. The suggested change to check
"sample_rows is not None" correctly treats 0 as an explicit sampling request
and aligns behavior with the documented intent.
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/examples/generic_loader.py
**Line:** 210:210
**Comment:**
*Logic Error: In the loader factory, the computation of `rows` also
relies on the truthiness of `sample_rows`, so a value of `0` will be treated as
"no sampling" and result in full data load, which contradicts the documented
"only load this many rows" behavior.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/examples/helpers.py:
##########
@@ -119,52 +88,91 @@ def get_slice_json(defaults: dict[Any, Any], **kwargs:
Any) -> str:
return json.dumps(defaults_copy, indent=4, sort_keys=True)
-def get_example_url(filepath: str) -> str:
- """Return an absolute URL to *filepath* under the examples‑data repo.
-
- All calls are routed through jsDelivr unless overridden. Supports nested
- paths like ``datasets/examples/slack/messages.csv``.
- """
- return f"{BASE_URL}{filepath}"
+def normalize_example_data_url(url: str) -> str:
+ """Normalize example data URLs for consistency.
+ This function ensures that example data URLs are properly formatted.
+ Since the schema validator expects valid URLs and our examples:// protocol
+ isn't standard, we convert to file:// URLs pointing to the actual location.
-def normalize_example_data_url(url: str) -> str:
- """Convert example data URLs to use the configured CDN.
+ Args:
+ url: URL to normalize (e.g., "examples://birth_names" or
"birth_names.duckdb")
- Transforms examples:// URLs to the configured CDN URL.
- Non-example URLs are returned unchanged.
+ Returns:
+ Normalized file:// URL pointing to the DuckDB file
"""
+ import os
+
+ # Handle existing examples:// protocol
if url.startswith(EXAMPLES_PROTOCOL):
- relative_path = url[len(EXAMPLES_PROTOCOL) :]
- return get_example_url(relative_path)
+ # Remove the protocol for processing
+ path = url[len(EXAMPLES_PROTOCOL) :]
+ elif url.startswith("file://"):
+ # Already a file URL, just return it
+ return url
+ else:
+ path = url
+
+ # Ensure .duckdb extension
+ if not path.endswith(".duckdb"):
+ path = f"{path}.duckdb"
- # Not an examples URL, return unchanged
- return url
+ # Build the full file path
+ full_path = os.path.join(get_examples_folder(), "data", path)
+
+ # Convert to file:// URL for schema validation
+ # This will pass URL validation and still work with our loader
+ return f"file://{full_path}"
def read_example_data(
filepath: str,
- max_attempts: int = 5,
- wait_seconds: float = 60,
+ table_name: str | None = None,
**kwargs: Any,
) -> pd.DataFrame:
- """Load CSV or JSON from example data mirror with retry/backoff."""
- url = normalize_example_data_url(filepath)
- is_json = filepath.endswith(".json") or filepath.endswith(".json.gz")
-
- for attempt in range(1, max_attempts + 1):
- try:
- if is_json:
- return pd.read_json(url, **kwargs)
- return pd.read_csv(url, **kwargs)
- except HTTPError:
- if attempt < max_attempts:
- sleep_time = wait_seconds * (2 ** (attempt - 1))
- print(
- f"HTTP 429 received from {url}. ",
- f"Retrying in {sleep_time:.1f}s ",
- f"(attempt {attempt}/{max_attempts})...",
- )
- time.sleep(sleep_time)
- else:
- raise
+ """Load data from local DuckDB files.
+
+ Args:
+ filepath: Path to the DuckDB file (e.g.,
"examples://birth_names.duckdb" or "file:///path/to/file.duckdb")
+ table_name: Table to read from DuckDB (defaults to filename without
extension)
+ **kwargs: Ignored (kept for backward compatibility)
+
+ Returns:
+ DataFrame with the loaded data
+ """
+ import os
+
+ import duckdb
+
+ # Handle different URL formats
+ if filepath.startswith(EXAMPLES_PROTOCOL):
+ # examples:// protocol
+ relative_path = filepath[len(EXAMPLES_PROTOCOL) :]
+ if not relative_path.endswith(".duckdb"):
+ relative_path = f"{relative_path}.duckdb"
+ local_path = os.path.join(get_examples_folder(), "data", relative_path)
+ elif filepath.startswith("file://"):
+ # file:// protocol - extract the actual path
+ local_path = filepath[7:] # Remove "file://"
+ else:
+ # Assume it's a relative path
+ relative_path = filepath
+ if not relative_path.endswith(".duckdb"):
+ relative_path = f"{relative_path}.duckdb"
+ local_path = os.path.join(get_examples_folder(), "data", relative_path)
+
+ if not os.path.exists(local_path):
+ raise FileNotFoundError(f"Example data file not found: {local_path}")
+
+ # Determine table name if not provided
+ if table_name is None:
+ base_name = os.path.basename(local_path)
+ table_name = os.path.splitext(base_name)[0]
+
+ # Connect and read from DuckDB
+ conn = duckdb.connect(local_path, read_only=True)
+ try:
+ df = conn.execute(f"SELECT * FROM {table_name}").df() # noqa: S608
+ return df
Review Comment:
**Suggestion:** The DuckDB query in the example data reader interpolates
`table_name` directly into the SQL string without quoting or escaping, which
can lead to SQL injection or syntax errors if the table name contains special
characters and is ever derived from untrusted input. [security]
**Severity Level:** Critical 🚨
```suggestion
# Safely quote the table name to avoid SQL injection issues
safe_table_name = table_name.replace('"', '""')
df = conn.execute(f'SELECT * FROM "{safe_table_name}"').df()
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
Directly injecting table_name into the SQL string can cause syntax errors or
SQL injection if the name ever comes from untrusted input or contains special
characters. Safely quoting the identifier (escaping double quotes and wrapping
the identifier) is the correct mitigation because SQL parameters can't be used
for identifiers; this is a real security/correctness improvement.
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/examples/helpers.py
**Line:** 176:176
**Comment:**
*Security: The DuckDB query in the example data reader interpolates
`table_name` directly into the SQL string without quoting or escaping, which
can lead to SQL injection or syntax errors if the table name contains special
characters and is ever derived from untrusted input.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/examples/helpers.py:
##########
@@ -119,52 +88,91 @@ def get_slice_json(defaults: dict[Any, Any], **kwargs:
Any) -> str:
return json.dumps(defaults_copy, indent=4, sort_keys=True)
-def get_example_url(filepath: str) -> str:
- """Return an absolute URL to *filepath* under the examples‑data repo.
-
- All calls are routed through jsDelivr unless overridden. Supports nested
- paths like ``datasets/examples/slack/messages.csv``.
- """
- return f"{BASE_URL}{filepath}"
+def normalize_example_data_url(url: str) -> str:
+ """Normalize example data URLs for consistency.
+ This function ensures that example data URLs are properly formatted.
+ Since the schema validator expects valid URLs and our examples:// protocol
+ isn't standard, we convert to file:// URLs pointing to the actual location.
-def normalize_example_data_url(url: str) -> str:
- """Convert example data URLs to use the configured CDN.
+ Args:
+ url: URL to normalize (e.g., "examples://birth_names" or
"birth_names.duckdb")
- Transforms examples:// URLs to the configured CDN URL.
- Non-example URLs are returned unchanged.
+ Returns:
+ Normalized file:// URL pointing to the DuckDB file
"""
+ import os
+
+ # Handle existing examples:// protocol
if url.startswith(EXAMPLES_PROTOCOL):
- relative_path = url[len(EXAMPLES_PROTOCOL) :]
- return get_example_url(relative_path)
+ # Remove the protocol for processing
+ path = url[len(EXAMPLES_PROTOCOL) :]
+ elif url.startswith("file://"):
+ # Already a file URL, just return it
+ return url
+ else:
+ path = url
+
+ # Ensure .duckdb extension
+ if not path.endswith(".duckdb"):
+ path = f"{path}.duckdb"
- # Not an examples URL, return unchanged
- return url
+ # Build the full file path
+ full_path = os.path.join(get_examples_folder(), "data", path)
Review Comment:
**Suggestion:** Building a `file://` URL by directly joining the
user-controlled path segment with `get_examples_folder()/data` allows directory
traversal (for example using `..` segments) so that a crafted `examples://`
value could escape the intended examples directory and access arbitrary files
when later fetched via `urlopen`. [security]
**Severity Level:** Critical 🚨
```suggestion
# Normalize the path and prevent directory traversal outside
examples/data
normalized_path = os.path.normpath(path).lstrip(os.sep)
if normalized_path.startswith(".."):
raise ValueError("Invalid example data path")
# Build the full file path
full_path = os.path.join(get_examples_folder(), "data", normalized_path)
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
The code concatenates user-controlled path segments into the examples data
path without normalizing or rejecting ".." segments. That can allow a crafted
examples:// value to resolve to files outside the intended directory.
Normalizing and validating the path (or rejecting paths with traversal) is a
real security hardening and prevents unintended file access.
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/examples/helpers.py
**Line:** 120:121
**Comment:**
*Security: Building a `file://` URL by directly joining the
user-controlled path segment with `get_examples_folder()/data` allows directory
traversal (for example using `..` segments) so that a crafted `examples://`
value could escape the intended examples directory and access arbitrary files
when later fetched via `urlopen`.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/examples/helpers.py:
##########
@@ -119,52 +88,91 @@ def get_slice_json(defaults: dict[Any, Any], **kwargs:
Any) -> str:
return json.dumps(defaults_copy, indent=4, sort_keys=True)
-def get_example_url(filepath: str) -> str:
- """Return an absolute URL to *filepath* under the examples‑data repo.
-
- All calls are routed through jsDelivr unless overridden. Supports nested
- paths like ``datasets/examples/slack/messages.csv``.
- """
- return f"{BASE_URL}{filepath}"
+def normalize_example_data_url(url: str) -> str:
+ """Normalize example data URLs for consistency.
+ This function ensures that example data URLs are properly formatted.
+ Since the schema validator expects valid URLs and our examples:// protocol
+ isn't standard, we convert to file:// URLs pointing to the actual location.
-def normalize_example_data_url(url: str) -> str:
- """Convert example data URLs to use the configured CDN.
+ Args:
+ url: URL to normalize (e.g., "examples://birth_names" or
"birth_names.duckdb")
- Transforms examples:// URLs to the configured CDN URL.
- Non-example URLs are returned unchanged.
+ Returns:
+ Normalized file:// URL pointing to the DuckDB file
"""
+ import os
+
+ # Handle existing examples:// protocol
if url.startswith(EXAMPLES_PROTOCOL):
- relative_path = url[len(EXAMPLES_PROTOCOL) :]
- return get_example_url(relative_path)
+ # Remove the protocol for processing
+ path = url[len(EXAMPLES_PROTOCOL) :]
+ elif url.startswith("file://"):
+ # Already a file URL, just return it
+ return url
+ else:
Review Comment:
**Suggestion:** The URL normalization helper currently rewrites any
non-`examples://` and non-`file://` URL (including valid HTTP/S or other remote
URIs) into a local `file://` DuckDB path, which will break generic dataset
imports that rely on remote CSV/URL-based `data` values once
`normalize_example_data_url` is used in the dataset importer. [logic error]
**Severity Level:** Minor ⚠️
```suggestion
elif "://" in url:
# Non-example absolute URLs (http, https, s3, etc.) are left
unchanged
return url
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
The current implementation treats any non-file, non-examples URL as a local
examples path, so an input like "https://..." or "s3://..." would be turned
into a local file:// path and break remote dataset imports. The suggested guard
for other schemes (detecting :// and returning unchanged) fixes a real logic
bug and preserves expected behavior for remote URLs.
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/examples/helpers.py
**Line:** 113:113
**Comment:**
*Logic Error: The URL normalization helper currently rewrites any
non-`examples://` and non-`file://` URL (including valid HTTP/S or other remote
URIs) into a local `file://` DuckDB path, which will break generic dataset
imports that rely on remote CSV/URL-based `data` values once
`normalize_example_data_url` is used in the dataset importer.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/commands/importers/v1/utils.py:
##########
@@ -320,6 +321,49 @@ def import_tag(
return new_tag_ids
+def safe_insert_dashboard_chart_relationships(
+ dashboard_chart_ids: list[tuple[int, int]],
+) -> None:
+ """
+ Safely insert dashboard-chart relationships, handling duplicates.
+
+ This function checks for existing relationships and only inserts new ones
+ to avoid duplicate key constraint errors.
+ """
+ from sqlalchemy.sql import select
+
+ if not dashboard_chart_ids:
+ return
+
+ # Get all existing relationships
+ existing_relationships = db.session.execute(
+ select([dashboard_slices.c.dashboard_id, dashboard_slices.c.slice_id])
Review Comment:
**Suggestion:** `safe_insert_dashboard_chart_relationships` currently loads
all rows from the `dashboard_slices` table into memory, which is unnecessary
and can become a serious performance and memory issue in large installations
when only a subset of dashboards is being updated. [performance]
**Severity Level:** Minor ⚠️
```suggestion
# Get existing relationships only for dashboards being updated
dashboard_ids = {dashboard_id for dashboard_id, _ in dashboard_chart_ids}
existing_relationships = db.session.execute(
select([dashboard_slices.c.dashboard_id,
dashboard_slices.c.slice_id]).where(
dashboard_slices.c.dashboard_id.in_(dashboard_ids)
)
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
The current code unconditionally loads every row from dashboard_slices into
memory which is wasteful and unnecessary when we're only inserting
relationships for a specific set of dashboards.
Scoping the SELECT to the dashboard IDs being updated
(dashboard_slices.c.dashboard_id.in_(dashboard_ids)) is a straightforward,
correct improvement that reduces memory and query cost.
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/commands/importers/v1/utils.py
**Line:** 338:340
**Comment:**
*Performance: `safe_insert_dashboard_chart_relationships` currently
loads all rows from the `dashboard_slices` table into memory, which is
unnecessary and can become a serious performance and memory issue in large
installations when only a subset of dashboards is being updated.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/examples/generic_loader.py:
##########
@@ -0,0 +1,223 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Generic DuckDB example data loader."""
+
+import logging
+from typing import Any, Callable, Optional
+
+import numpy as np
+from sqlalchemy import inspect
+
+from superset import db
+from superset.connectors.sqla.models import SqlaTable
+from superset.examples.helpers import read_example_data
+from superset.models.core import Database
+from superset.sql.parse import Table
+from superset.utils import json
+from superset.utils.database import get_example_database
+
+logger = logging.getLogger(__name__)
+
+
+def serialize_numpy_arrays(obj: Any) -> Any: # noqa: C901
+ """Convert numpy arrays to JSON-serializable format."""
+ if isinstance(obj, np.ndarray):
+ return obj.tolist()
+ elif isinstance(obj, np.generic):
+ # Handle numpy scalar types
+ return obj.item()
+ elif isinstance(obj, (list, tuple)):
+ return [serialize_numpy_arrays(item) for item in obj]
+ elif isinstance(obj, dict):
+ return {key: serialize_numpy_arrays(val) for key, val in obj.items()}
+ return obj
+
+
+def load_duckdb_table( # noqa: C901
+ duckdb_file: str,
+ table_name: str,
+ database: Optional[Database] = None,
+ only_metadata: bool = False,
+ force: bool = False,
+ sample_rows: Optional[int] = None,
+) -> SqlaTable:
+ """Load a DuckDB table into the example database.
+
+ Args:
+ duckdb_file: Name of the DuckDB file (e.g., "birth_names")
+ table_name: Name for the table in the target database
+ database: Target database (defaults to example database)
+ only_metadata: If True, only create metadata without loading data
+ force: If True, replace existing table
+ sample_rows: If specified, only load this many rows
+
+ Returns:
+ The created SqlaTable object
+ """
+ if database is None:
+ database = get_example_database()
+
+ # Check if table exists
+ with database.get_sqla_engine() as engine:
+ schema = inspect(engine).default_schema_name
+
+ table_exists = database.has_table(Table(table_name, schema=schema))
+ if table_exists and not force:
+ logger.info("Table %s already exists, skipping data load", table_name)
+ tbl = (
+ db.session.query(SqlaTable)
+ .filter_by(table_name=table_name, database_id=database.id)
+ .first()
+ )
+ if tbl:
+ return tbl
+
+ # Load data if not metadata only
+ if not only_metadata:
+ logger.info("Loading data for %s from %s.duckdb", table_name,
duckdb_file)
+
+ # Read from DuckDB
+ pdf = read_example_data(f"examples://{duckdb_file}")
+
+ # Sample if requested
+ if sample_rows:
Review Comment:
**Suggestion:** The `sample_rows` condition uses a truthiness check, so
passing `sample_rows=0` (which is a valid "load zero rows" value per the
docstring) will be treated as if no sampling was requested and will load the
full dataset instead of zero rows. [logic error]
**Severity Level:** Minor ⚠️
```suggestion
if sample_rows is not None:
```
<details>
<summary><b>Why it matters? ⭐ </b></summary>
The current code uses a truthiness check on sample_rows. If a caller
intentionally passes 0 (meaning "load zero rows"), the condition evaluates to
False and the full dataframe is left unchanged. That's a real logic bug because
the docstring states "If specified, only load this many rows" — 0 is a valid
specified integer. Changing the check to "is not None" preserves the
distinction between "not specified" (None) and "specified as 0".
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/examples/generic_loader.py
**Line:** 97:97
**Comment:**
*Logic Error: The `sample_rows` condition uses a truthiness check, so
passing `sample_rows=0` (which is a valid "load zero rows" value per the
docstring) will be treated as if no sampling was requested and will load the
full dataset instead of zero rows.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]