codeant-ai-for-open-source[bot] commented on code in PR #37557:
URL: https://github.com/apache/superset/pull/37557#discussion_r2744246258


##########
superset/examples/generic_loader.py:
##########
@@ -34,6 +34,45 @@
 logger = logging.getLogger(__name__)
 
 
+def _find_dataset(
+    table_name: str,
+    database_id: int,
+    uuid: Optional[str] = None,
+    schema: Optional[str] = None,
+) -> tuple[Optional[SqlaTable], bool]:
+    """Find a dataset by UUID first, then fall back to table_name + 
database_id.
+
+    Includes schema in the fallback lookup to prevent cross-schema collisions.
+
+    This avoids unique constraint violations when a duplicate row exists.
+
+    Args:
+        table_name: The table name to look up
+        database_id: The database ID
+        uuid: Optional UUID to look up first
+        schema: Optional schema to include in fallback lookup
+
+    Returns:
+        A tuple of (dataset or None, found_by_uuid bool)
+    """
+    tbl = None
+    found_by_uuid = False
+
+    if uuid:
+        tbl = db.session.query(SqlaTable).filter_by(uuid=uuid).first()
+        if tbl:
+            found_by_uuid = True
+
+    if not tbl:
+        tbl = (
+            db.session.query(SqlaTable)
+            .filter_by(table_name=table_name, database_id=database_id, 
schema=schema)

Review Comment:
   **Suggestion:** The `_find_dataset` fallback query filters by 
`schema=schema`, so when an existing dataset row has `schema=None` but the 
caller passes a non-null schema (the common backfill case), the row is not 
found; this prevents UUID/schema backfill from ever running and instead leads 
to creating a new dataset row for the same physical table, causing duplicates 
and leaving the original broken row in place. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Duplicate dataset rows created during examples import.
   - ❌ Examples load creating new datasets instead of backfilling UUIDs.
   - ⚠️ Backfill of legacy datasets (schema/UUID) fails silently.
   - ⚠️ Example re-run produces dataset lookup collisions.
   ```
   </details>
   
   ```suggestion
           # First try to match on the exact schema for cross-schema safety
           tbl = (
               db.session.query(SqlaTable)
               .filter_by(table_name=table_name, database_id=database_id, 
schema=schema)
               .first()
           )
           # If nothing matches the requested schema, fall back to rows with no 
schema
           # so we can backfill schema/UUID on legacy datasets that have 
schema=None.
           if not tbl and schema is not None:
               tbl = (
                   db.session.query(SqlaTable)
                   .filter_by(table_name=table_name, database_id=database_id, 
schema=None)
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Discover example loaders: discover_datasets() in
   superset/examples/data_loading.py:143-150 calls create_generic_loader(...,
   schema=config[\"schema\"], uuid=config.get(\"uuid\")) (file:
   superset/examples/data_loading.py, lines 143-150). This yields a loader that 
closes over a
   non-null schema value when dataset.yaml specifies a schema.
   
   2. Invoke the loader created by create_generic_loader: create_generic_loader 
-> loader()
   calls load_parquet_table(...) with the captured schema and uuid
   (superset/examples/generic_loader.py:280-289). See create_generic_loader 
definition and
   loader invocation (generic_loader.py, lines 272-289).
   
   3. load_parquet_table receives schema!=None and uuid and reaches the dataset 
lookup: it
   calls _find_dataset(table_name, database.id, uuid, schema) 
(generic_loader.py, line 216).
   _find_dataset first tries UUID then falls back to a single filter_by that 
includes
   schema=schema (generic_loader.py, lines 61-71 and 216-217).
   
   4. Real-world problematic state: an existing SqlaTable row was created 
earlier by the old
   data-loading path without setting schema (row.schema is NULL) and uuid is 
NULL. In that
   case _find_dataset's fallback filter_by(..., schema=schema) will not match 
the NULL-schema
   row because schema!=NULL, so tbl remains None and found_by_uuid False. This 
causes
   load_parquet_table to treat the dataset as missing and create a new 
SqlaTable row for the
   same physical table, leaving the legacy NULL-schema row intact (duplicate 
dataset row).
   
   5. Concrete reproduction with current test scaffolding: run the sequence in
   tests/unit_tests/examples/generic_loader_test.py using mocks that simulate 
(a)
   discover_datasets passing schema (data_loading.py:143-150), (b) mock DB 
having an existing
   row with schema=None and uuid=None, and (c) load_parquet_table being called 
with
   schema="public" and uuid set. With the existing code path, _find_dataset 
will not return
   the NULL-schema row (because schema="public" != None), and the loader will 
create a new
   dataset row instead of backfilling the legacy row. The mock scenarios in 
tests demonstrate
   related behaviors (see tests around lines 618-646 and 312-345 for backfill 
expectations).
   
   6. Why the improved code fixes it: after trying an exact schema match, the 
suggested
   change falls back to searching rows with schema=None when a schema was 
requested (i.e.,
   schema is not None). This allows the loader to find legacy rows with NULL 
schema so the
   code can backfill tbl.schema and/or tbl.uuid and avoid creating duplicate 
dataset rows for
   the same physical table.
   
   7. Note on intentionality: current code intentionally includes schema in the 
lookup to
   avoid cross-schema collisions (tests cover schema-distinguishing behavior at 
tests lines
   471-506). The suggested change preserves exact-schema matching first 
(maintaining
   cross-schema safety) and only falls back to schema=None when no exact-schema 
match exists,
   which is consistent with the test expectations and addresses legacy backfill 
scenarios.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/examples/generic_loader.py
   **Line:** 67:69
   **Comment:**
        *Logic Error: The `_find_dataset` fallback query filters by 
`schema=schema`, so when an existing dataset row has `schema=None` but the 
caller passes a non-null schema (the common backfill case), the row is not 
found; this prevents UUID/schema backfill from ever running and instead leads 
to creating a new dataset row for the same physical table, causing duplicates 
and leaving the original broken row in place.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/unit_tests/examples/generic_loader_test.py:
##########
@@ -0,0 +1,731 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Tests for the generic_loader module, specifically UUID handling."""
+
+from unittest.mock import MagicMock, patch
+
+import pandas as pd
+
+
+def _setup_database_mocks(
+    mock_get_db: MagicMock, mock_database: MagicMock, has_table: bool = False
+) -> MagicMock:
+    """Helper to set up common database mocks."""
+    mock_database.id = 1
+    mock_database.has_table.return_value = has_table
+    mock_get_db.return_value = mock_database
+
+    mock_engine = MagicMock()
+    mock_inspector = MagicMock()
+    mock_inspector.default_schema_name = "public"
+    mock_database.get_sqla_engine.return_value.__enter__ = MagicMock(
+        return_value=mock_engine
+    )
+    mock_database.get_sqla_engine.return_value.__exit__ = 
MagicMock(return_value=False)
+
+    return mock_inspector
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_uuid_on_new_table(
+    mock_read_data: MagicMock,
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table sets UUID when creating a new SqlaTable."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=False)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # No existing table by UUID or table_name
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            None
+        )
+
+        mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+        test_uuid = "14f48794-ebfa-4f60-a26a-582c49132f1b"
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=test_uuid,
+        )
+
+        assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_finds_existing_by_uuid_first(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table looks up by UUID first when provided."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table found by UUID
+        test_uuid = "existing-uuid-1234"
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = test_uuid
+        mock_existing_table.table_name = "test_table"
+
+        # First call (by uuid) returns the table, second call (by table_name) 
not needed
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=test_uuid,
+        )
+
+        # Should return the existing table found by UUID
+        assert result.uuid == test_uuid
+        assert result is mock_existing_table
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_uuid_on_existing_table(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that existing dataset with uuid=None gets UUID backfilled."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table with NO UUID (needs backfill)
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = None
+        mock_existing_table.table_name = "test_table"
+
+        # UUID lookup returns None, table_name lookup returns the table
+        def filter_by_side_effect(**kwargs):
+            mock_result = MagicMock()
+            if "uuid" in kwargs:
+                mock_result.first.return_value = None
+            else:
+                mock_result.first.return_value = mock_existing_table
+            return mock_result
+
+        mock_db.session.query.return_value.filter_by.side_effect = 
filter_by_side_effect
+
+        new_uuid = "new-uuid-5678"
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=new_uuid,
+        )
+
+        # UUID should be backfilled
+        assert result.uuid == new_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_avoids_uuid_collision(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that finding by UUID doesn't try to re-set UUID (avoids 
collision)."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Table already has the UUID we're looking for
+        test_uuid = "existing-uuid-1234"
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = test_uuid
+
+        # UUID lookup finds the table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=test_uuid,
+        )
+
+        # UUID should remain unchanged (not re-assigned)
+        assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_preserves_existing_different_uuid(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that if table has different UUID, we find it by UUID lookup 
first."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # A table exists with the target UUID
+        target_uuid = "target-uuid-1234"
+        mock_uuid_table = MagicMock()
+        mock_uuid_table.uuid = target_uuid
+
+        # UUID lookup finds the UUID-matching table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_uuid_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="different_table_name",
+            database=mock_database,
+            only_metadata=True,
+            uuid=target_uuid,
+        )
+
+        # Should return the table found by UUID, not create new one
+        assert result is mock_uuid_table
+        assert result.uuid == target_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_works_without_uuid(
+    mock_read_data: MagicMock,
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table still works when no UUID is provided."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=False)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # No existing table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            None
+        )
+
+        mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+        )
+
+        assert result is not None
+        assert result.table_name == "test_table"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_schema_on_new_table(
+    mock_read_data: MagicMock,
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table sets schema when creating a new 
SqlaTable."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=False)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # No existing table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            None
+        )
+
+        mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            schema="custom_schema",
+        )
+
+        assert result is not None
+        assert result.schema == "custom_schema"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_schema_on_existing_table(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that existing dataset with schema=None gets schema backfilled."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table with NO schema (needs backfill)
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = "some-uuid"
+        mock_existing_table.schema = None
+        mock_existing_table.table_name = "test_table"
+
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            schema="public",
+        )
+
+        # Schema should be backfilled
+        assert result.schema == "public"
+
+
+def test_create_generic_loader_passes_uuid() -> None:
+    """Test that create_generic_loader passes UUID to load_parquet_table."""
+    from superset.examples.generic_loader import create_generic_loader
+
+    test_uuid = "test-uuid-1234"
+    loader = create_generic_loader(
+        parquet_file="test_data",
+        table_name="test_table",
+        uuid=test_uuid,
+    )
+
+    assert loader is not None
+    assert callable(loader)
+    assert loader.__name__ == "load_test_data"
+

Review Comment:
   **Suggestion:** The `test_create_generic_loader_passes_uuid` test never 
calls the generated loader or inspects `load_parquet_table`'s arguments, so it 
will still pass even if the UUID is not actually threaded through, failing to 
detect regressions in the behavior it claims to validate. The fix is to patch 
`load_parquet_table`, invoke the loader, and assert that `uuid` was passed in 
the call. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Unit test suite fails to catch UUID propagation regressions.
   - ⚠️ examples loading UUID bug can slip into releases.
   - ⚠️ Affects tests under tests/unit_tests/examples.
   ```
   </details>
   
   ```suggestion
   @patch("superset.examples.generic_loader.load_parquet_table")
   def test_create_generic_loader_passes_uuid(
       mock_load_parquet: MagicMock,
   ) -> None:
       """Test that create_generic_loader passes UUID to load_parquet_table."""
       from superset.examples.generic_loader import create_generic_loader
   
       test_uuid = "test-uuid-1234"
       loader = create_generic_loader(
           parquet_file="test_data",
           table_name="test_table",
           uuid=test_uuid,
       )
   
       # Call the loader to trigger load_parquet_table
       loader(True, False, False)  # type: ignore[call-arg,arg-type]
   
       # Verify UUID was passed through to load_parquet_table
       mock_load_parquet.assert_called_once()
       _, kwargs = mock_load_parquet.call_args
       assert kwargs.get("uuid") == test_uuid
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Open tests file tests/unit_tests/examples/generic_loader_test.py and 
locate the test
   function `test_create_generic_loader_passes_uuid()` (file header shows this 
at lines
   347-361 in the PR). The current test only constructs the loader and asserts 
its name
   (lines 347-361).
   
   2. Because the test never calls the returned loader, the internal call into
   `load_parquet_table()` is never executed. This means regressions that drop 
the uuid
   argument from create_generic_loader -> loader -> load_parquet_table won't 
fail the test.
   
   3. Reproduce the regression locally by modifying create_generic_loader to 
not forward uuid
   (e.g., remove uuid kwarg when calling load_parquet_table) then run pytest
   
tests/unit_tests/examples/generic_loader_test.py::test_create_generic_loader_passes_uuid
 -
   the current test will still pass because it only asserts the loader's name 
(no call).
   
   4. Applying the improved test (patching load_parquet_table and calling 
loader) makes the
   regression fail: the patched mock will record call args and the assertion on
   kwargs["uuid"] will detect missing propagation.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/examples/generic_loader_test.py
   **Line:** 347:361
   **Comment:**
        *Logic Error: The `test_create_generic_loader_passes_uuid` test never 
calls the generated loader or inspects `load_parquet_table`'s arguments, so it 
will still pass even if the UUID is not actually threaded through, failing to 
detect regressions in the behavior it claims to validate. The fix is to patch 
`load_parquet_table`, invoke the loader, and assert that `uuid` was passed in 
the call.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/unit_tests/examples/generic_loader_test.py:
##########
@@ -0,0 +1,731 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Tests for the generic_loader module, specifically UUID handling."""
+
+from unittest.mock import MagicMock, patch
+
+import pandas as pd
+
+
+def _setup_database_mocks(
+    mock_get_db: MagicMock, mock_database: MagicMock, has_table: bool = False
+) -> MagicMock:
+    """Helper to set up common database mocks."""
+    mock_database.id = 1
+    mock_database.has_table.return_value = has_table
+    mock_get_db.return_value = mock_database
+
+    mock_engine = MagicMock()
+    mock_inspector = MagicMock()
+    mock_inspector.default_schema_name = "public"
+    mock_database.get_sqla_engine.return_value.__enter__ = MagicMock(
+        return_value=mock_engine
+    )
+    mock_database.get_sqla_engine.return_value.__exit__ = 
MagicMock(return_value=False)
+
+    return mock_inspector
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_uuid_on_new_table(
+    mock_read_data: MagicMock,
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table sets UUID when creating a new SqlaTable."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=False)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # No existing table by UUID or table_name
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            None
+        )
+
+        mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+        test_uuid = "14f48794-ebfa-4f60-a26a-582c49132f1b"
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=test_uuid,
+        )
+
+        assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_finds_existing_by_uuid_first(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table looks up by UUID first when provided."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table found by UUID
+        test_uuid = "existing-uuid-1234"
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = test_uuid
+        mock_existing_table.table_name = "test_table"
+
+        # First call (by uuid) returns the table, second call (by table_name) 
not needed
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=test_uuid,
+        )
+
+        # Should return the existing table found by UUID
+        assert result.uuid == test_uuid
+        assert result is mock_existing_table
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_uuid_on_existing_table(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that existing dataset with uuid=None gets UUID backfilled."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table with NO UUID (needs backfill)
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = None
+        mock_existing_table.table_name = "test_table"
+
+        # UUID lookup returns None, table_name lookup returns the table
+        def filter_by_side_effect(**kwargs):
+            mock_result = MagicMock()
+            if "uuid" in kwargs:
+                mock_result.first.return_value = None
+            else:
+                mock_result.first.return_value = mock_existing_table
+            return mock_result
+
+        mock_db.session.query.return_value.filter_by.side_effect = 
filter_by_side_effect
+
+        new_uuid = "new-uuid-5678"
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=new_uuid,
+        )
+
+        # UUID should be backfilled
+        assert result.uuid == new_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_avoids_uuid_collision(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that finding by UUID doesn't try to re-set UUID (avoids 
collision)."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Table already has the UUID we're looking for
+        test_uuid = "existing-uuid-1234"
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = test_uuid
+
+        # UUID lookup finds the table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid=test_uuid,
+        )
+
+        # UUID should remain unchanged (not re-assigned)
+        assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_preserves_existing_different_uuid(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that if table has different UUID, we find it by UUID lookup 
first."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # A table exists with the target UUID
+        target_uuid = "target-uuid-1234"
+        mock_uuid_table = MagicMock()
+        mock_uuid_table.uuid = target_uuid
+
+        # UUID lookup finds the UUID-matching table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_uuid_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="different_table_name",
+            database=mock_database,
+            only_metadata=True,
+            uuid=target_uuid,
+        )
+
+        # Should return the table found by UUID, not create new one
+        assert result is mock_uuid_table
+        assert result.uuid == target_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_works_without_uuid(
+    mock_read_data: MagicMock,
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table still works when no UUID is provided."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=False)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # No existing table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            None
+        )
+
+        mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+        )
+
+        assert result is not None
+        assert result.table_name == "test_table"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_schema_on_new_table(
+    mock_read_data: MagicMock,
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that load_parquet_table sets schema when creating a new 
SqlaTable."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=False)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # No existing table
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            None
+        )
+
+        mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            schema="custom_schema",
+        )
+
+        assert result is not None
+        assert result.schema == "custom_schema"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_schema_on_existing_table(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that existing dataset with schema=None gets schema backfilled."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table with NO schema (needs backfill)
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = "some-uuid"
+        mock_existing_table.schema = None
+        mock_existing_table.table_name = "test_table"
+
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            schema="public",
+        )
+
+        # Schema should be backfilled
+        assert result.schema == "public"
+
+
+def test_create_generic_loader_passes_uuid() -> None:
+    """Test that create_generic_loader passes UUID to load_parquet_table."""
+    from superset.examples.generic_loader import create_generic_loader
+
+    test_uuid = "test-uuid-1234"
+    loader = create_generic_loader(
+        parquet_file="test_data",
+        table_name="test_table",
+        uuid=test_uuid,
+    )
+
+    assert loader is not None
+    assert callable(loader)
+    assert loader.__name__ == "load_test_data"
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_returns_uuid_match_first(mock_db: MagicMock) -> None:
+    """Test that _find_dataset returns UUID match over table_name match."""
+    from superset.examples.generic_loader import _find_dataset
+
+    # Row with UUID (should be found first)
+    uuid_row = MagicMock()
+    uuid_row.uuid = "target-uuid"
+    uuid_row.table_name = "table_a"
+
+    # Row without UUID (same table_name as query)
+    tablename_row = MagicMock()
+    tablename_row.uuid = None
+    tablename_row.table_name = "test_table"
+
+    # UUID lookup returns uuid_row
+    
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+        uuid_row
+    )
+
+    result, found_by_uuid = _find_dataset("test_table", 1, "target-uuid", 
"public")
+
+    assert result is uuid_row
+    assert found_by_uuid is True
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_falls_back_to_table_name(mock_db: MagicMock) -> None:
+    """Test that _find_dataset falls back to table_name when UUID not found."""
+    from superset.examples.generic_loader import _find_dataset
+
+    tablename_row = MagicMock()
+    tablename_row.uuid = None
+    tablename_row.table_name = "test_table"
+
+    # UUID lookup returns None, table_name lookup returns the row
+    def filter_by_side_effect(**kwargs):
+        mock_result = MagicMock()
+        if "uuid" in kwargs:
+            mock_result.first.return_value = None
+        else:
+            mock_result.first.return_value = tablename_row
+        return mock_result
+
+    mock_db.session.query.return_value.filter_by.side_effect = 
filter_by_side_effect
+
+    result, found_by_uuid = _find_dataset("test_table", 1, "nonexistent-uuid", 
"public")
+
+    assert result is tablename_row
+    assert found_by_uuid is False
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_duplicate_rows_table_missing(
+    mock_read_data: MagicMock,
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test UUID-first lookup when duplicate rows exist and physical table is 
missing.
+
+    Scenario:
+    - Row A: table_name="test_table", uuid="target-uuid" (correct row)
+    - Row B: table_name="test_table", uuid=None (duplicate from broken run)
+    - Physical table was dropped
+
+    The UUID-first lookup should find Row A and avoid collision.
+    """
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(
+        mock_get_db,
+        mock_database,
+        has_table=False,  # Table was dropped
+    )
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Row with UUID (should be found by UUID lookup)
+        uuid_row = MagicMock()
+        uuid_row.uuid = "target-uuid"
+        uuid_row.table_name = "test_table"
+        uuid_row.database = mock_database
+
+        # UUID lookup finds the correct row
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            uuid_row
+        )
+
+        mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid="target-uuid",
+        )
+
+        # Should return the UUID row, not try to backfill (which would collide)
+        assert result is uuid_row
+        assert result.uuid == "target-uuid"
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_distinguishes_schemas(mock_db: MagicMock) -> None:
+    """Test that _find_dataset uses schema to distinguish same-name tables.
+
+    Scenario:
+    - Row A: table_name="users", schema="schema_a", uuid=None
+    - Row B: table_name="users", schema="schema_b", uuid=None
+
+    Looking up "users" in "schema_b" should find Row B, not Row A.
+    """
+    from superset.examples.generic_loader import _find_dataset
+
+    # Row in schema_b (should be found)
+    schema_b_row = MagicMock()
+    schema_b_row.uuid = None
+    schema_b_row.table_name = "users"
+    schema_b_row.schema = "schema_b"
+
+    # No UUID lookup (uuid not provided), table_name lookup returns schema_b 
row
+    def filter_by_side_effect(**kwargs):
+        mock_result = MagicMock()
+        if "uuid" in kwargs:
+            mock_result.first.return_value = None
+        elif kwargs.get("schema") == "schema_b":
+            mock_result.first.return_value = schema_b_row
+        else:
+            mock_result.first.return_value = None  # schema_a not requested
+        return mock_result
+
+    mock_db.session.query.return_value.filter_by.side_effect = 
filter_by_side_effect
+
+    result, found_by_uuid = _find_dataset("users", 1, None, "schema_b")
+
+    assert result is schema_b_row
+    assert found_by_uuid is False
+    assert result.schema == "schema_b"
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_no_uuid_no_schema(mock_db: MagicMock) -> None:
+    """Test _find_dataset with no UUID and no schema (basic lookup)."""
+    from superset.examples.generic_loader import _find_dataset
+
+    basic_row = MagicMock()
+    basic_row.uuid = None
+    basic_row.table_name = "test_table"
+    basic_row.schema = None
+
+    # No UUID provided, so skip UUID lookup; table_name+database_id lookup 
returns row
+    
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+        basic_row
+    )
+
+    result, found_by_uuid = _find_dataset("test_table", 1, None, None)
+
+    assert result is basic_row
+    assert found_by_uuid is False
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_no_backfill_when_uuid_already_set(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that existing UUID is preserved (not overwritten) during 
backfill."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table already has a UUID set
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = "existing-uuid-1234"
+        mock_existing_table.schema = "public"
+        mock_existing_table.table_name = "test_table"
+
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid="new-uuid-5678",  # Try to set a different UUID
+        )
+
+        # Existing UUID should be preserved, not overwritten
+        assert result.uuid == "existing-uuid-1234"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_no_backfill_when_schema_already_set(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that existing schema is preserved (not overwritten) during 
backfill."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table already has a schema set
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = "some-uuid"
+        mock_existing_table.schema = "existing_schema"
+        mock_existing_table.table_name = "test_table"
+
+        
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+            mock_existing_table
+        )
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            schema="new_schema",  # Try to set a different schema
+        )
+
+        # Existing schema should be preserved, not overwritten
+        assert result.schema == "existing_schema"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_both_uuid_and_schema_backfill(
+    mock_get_db: MagicMock,
+    mock_db: MagicMock,
+) -> None:
+    """Test that both UUID and schema are backfilled in a single call."""
+    from superset.examples.generic_loader import load_parquet_table
+
+    mock_database = MagicMock()
+    mock_inspector = _setup_database_mocks(mock_get_db, mock_database, 
has_table=True)
+
+    with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+        mock_inspect.return_value = mock_inspector
+
+        # Existing table with neither UUID nor schema
+        mock_existing_table = MagicMock()
+        mock_existing_table.uuid = None
+        mock_existing_table.schema = None
+        mock_existing_table.table_name = "test_table"
+
+        # UUID lookup returns None, table_name lookup returns the table
+        def filter_by_side_effect(**kwargs):
+            mock_result = MagicMock()
+            if "uuid" in kwargs:
+                mock_result.first.return_value = None
+            else:
+                mock_result.first.return_value = mock_existing_table
+            return mock_result
+
+        mock_db.session.query.return_value.filter_by.side_effect = 
filter_by_side_effect
+
+        result = load_parquet_table(
+            parquet_file="test_data",
+            table_name="test_table",
+            database=mock_database,
+            only_metadata=True,
+            uuid="new-uuid",
+            schema="new_schema",
+        )
+
+        # Both should be backfilled
+        assert result.uuid == "new-uuid"
+        assert result.schema == "new_schema"
+
+
+def test_create_generic_loader_passes_schema() -> None:
+    """Test that create_generic_loader passes schema to load_parquet_table."""
+    from superset.examples.generic_loader import create_generic_loader
+
+    test_schema = "custom_schema"
+    loader = create_generic_loader(
+        parquet_file="test_data",
+        table_name="test_table",
+        schema=test_schema,
+    )
+
+    assert loader is not None
+    assert callable(loader)
+    assert loader.__name__ == "load_test_data"
+

Review Comment:
   **Suggestion:** The `test_create_generic_loader_passes_schema` test 
similarly only inspects the returned loader object and never verifies that 
`schema` is actually forwarded into `load_parquet_table`, so it will not fail 
if schema propagation is broken; patching `load_parquet_table`, invoking the 
loader, and asserting on the `schema` argument fixes this. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Unit test suite misses schema propagation regressions.
   - ⚠️ Example dataset schema mismatches at load time.
   - ⚠️ Affects tests under tests/unit_tests/examples.
   ```
   </details>
   
   ```suggestion
   @patch("superset.examples.generic_loader.load_parquet_table")
   def test_create_generic_loader_passes_schema(
       mock_load_parquet: MagicMock,
   ) -> None:
       """Test that create_generic_loader passes schema to 
load_parquet_table."""
       from superset.examples.generic_loader import create_generic_loader
   
       test_schema = "custom_schema"
       loader = create_generic_loader(
           parquet_file="test_data",
           table_name="test_table",
           schema=test_schema,
       )
   
       # Call the loader to trigger load_parquet_table
       loader(True, False, False)  # type: ignore[call-arg,arg-type]
   
       # Verify schema was passed through to load_parquet_table
       mock_load_parquet.assert_called_once()
       _, kwargs = mock_load_parquet.call_args
       assert kwargs.get("schema") == test_schema
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Open tests/unit_tests/examples/generic_loader_test.py and find
   `test_create_generic_loader_passes_schema()` (lines 649-663). The test only 
creates the
   loader and asserts its name, never exercising the loader call path.
   
   2. Simulate a regression by changing create_generic_loader to drop the 
schema kwarg when
   constructing the inner loader function. Run pytest
   
tests/unit_tests/examples/generic_loader_test.py::test_create_generic_loader_passes_schema
   — the current test will still pass because it doesn't call the loader.
   
   3. With the improved test (patching load_parquet_table and invoking the 
loader), running
   the same pytest target will record the mocked call and assert that kwargs 
contain
   "schema", failing if schema was omitted.
   
   4. This verifies the runtime propagation from create_generic_loader -> 
returned loader ->
   load_parquet_table, preventing silent regressions.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/examples/generic_loader_test.py
   **Line:** 649:663
   **Comment:**
        *Logic Error: The `test_create_generic_loader_passes_schema` test 
similarly only inspects the returned loader object and never verifies that 
`schema` is actually forwarded into `load_parquet_table`, so it will not fail 
if schema propagation is broken; patching `load_parquet_table`, invoking the 
loader, and asserting on the `schema` argument fixes this.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to