Re: [PR] feat: add option for hash algorithms [superset]

via GitHub Mon, 08 Dec 2025 07:30:57 -0800


codeant-ai-for-open-source[bot] commented on code in PR #35621:
URL: https://github.com/apache/superset/pull/35621#discussion_r2599069240



##########
superset/commands/dashboard/permalink/create.py:
##########
@@ -71,9 +76,31 @@ def run(self) -> str:
             "state": self.state,
         }
         user_id = get_user_id()
-        entry = KeyValueDAO.upsert_entry(
+        payload = (user_id, value)
+
+        # Try to find existing entry with current algorithm
+        uuid_key = get_deterministic_uuid(self.salt, payload)
+        entry = KeyValueDAO.get_entry(self.resource, uuid_key)
+
+        # Fallback: check configured fallback algorithms for backward 
compatibility
+        if not entry:
+            for fallback_algo in get_fallback_algorithms():
+                uuid_fallback = get_deterministic_uuid_with_algorithm(
+                    self.salt, payload, fallback_algo
+                )
+                entry = KeyValueDAO.get_entry(self.resource, uuid_fallback)
+                if entry:
+                    break
+
+        if entry:
+            # Return existing entry
+            assert entry.id  # for type checks
+            return encode_permalink_key(key=entry.id, salt=self.salt)
+
+        # Create new entry with current algorithm
+        entry = KeyValueDAO.create_entry(
             resource=self.resource,
-            key=get_deterministic_uuid(self.salt, (user_id, value)),
+            key=uuid_key,
             value=value,
             codec=self.codec,
         )

Review Comment:
   **Suggestion:** The code does a separate "find then create" which is 
vulnerable to a race: two concurrent requests can both not find an entry and 
both try to insert, causing a conflict or duplicate entries. Wrap the create in 
a try/except for the database integrity error, roll back and fetch the existing 
entry if insertion failed due to a concurrent insert. [race condition]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
           # Create new entry with current algorithm (handle concurrent inserts)
           from sqlalchemy.exc import IntegrityError
           try:
               entry = KeyValueDAO.create_entry(
                   resource=self.resource,
                   key=uuid_key,
                   value=value,
                   codec=self.codec,
               )
               db.session.flush()
           except IntegrityError:
               # Another transaction inserted the same key concurrently; 
rollback and fetch it
               db.session.rollback()
               entry = KeyValueDAO.get_entry(self.resource, uuid_key)
               if not entry:
                   raise DashboardPermalinkCreateFailedError("Failed to create 
or retrieve permalink entry")
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The "find-then-create" pattern is vulnerable to concurrent inserts. Catching 
IntegrityError, rolling back and fetching the existing row is a valid and 
practical fix to avoid duplicate entries or hard failures under concurrency. 
The proposed change directly addresses a real race condition visible from this 
diff.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/dashboard/permalink/create.py
   **Line:** 100:106
   **Comment:**
        *Race Condition: The code does a separate "find then create" which is 
vulnerable to a race: two concurrent requests can both not find an entry and 
both try to insert, causing a conflict or duplicate entries. Wrap the create in 
a try/except for the database integrity error, roll back and fetch the existing 
entry if insertion failed due to a concurrent insert.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/commands/dashboard/permalink/create.py:
##########
@@ -71,9 +76,31 @@ def run(self) -> str:
             "state": self.state,
         }
         user_id = get_user_id()
-        entry = KeyValueDAO.upsert_entry(
+        payload = (user_id, value)
+
+        # Try to find existing entry with current algorithm
+        uuid_key = get_deterministic_uuid(self.salt, payload)
+        entry = KeyValueDAO.get_entry(self.resource, uuid_key)
+
+        # Fallback: check configured fallback algorithms for backward 
compatibility
+        if not entry:
+            for fallback_algo in get_fallback_algorithms():
+                uuid_fallback = get_deterministic_uuid_with_algorithm(
+                    self.salt, payload, fallback_algo
+                )
+                entry = KeyValueDAO.get_entry(self.resource, uuid_fallback)
+                if entry:
+                    break
+
+        if entry:
+            # Return existing entry
+            assert entry.id  # for type checks

Review Comment:
   **Suggestion:** An `assert entry.id` is used to check presence of 
`entry.id`, but `assert` statements may be disabled at runtime and are not a 
reliable runtime check. Replace the assert with an explicit runtime validation 
and raise a clear exception if `entry.id` is missing. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
               if not getattr(entry, "id", None):
                   raise DashboardPermalinkCreateFailedError("Existing 
permalink entry does not have an id")
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Using assert for a runtime guarantee is brittle because asserts can be 
disabled; replacing it with an explicit runtime check (and a clear exception) 
improves robustness and error signaling. The improved code addresses a 
verifiable correctness concern in production.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/dashboard/permalink/create.py
   **Line:** 97:97
   **Comment:**
        *Possible Bug: An `assert entry.id` is used to check presence of 
`entry.id`, but `assert` statements may be disabled at runtime and are not a 
reliable runtime check. Replace the assert with an explicit runtime validation 
and raise a clear exception if `entry.id` is missing.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/db_engine_specs/dremio.py:
##########
@@ -104,4 +104,4 @@ def _mutate_label(label: str) -> str:
         :param label: Expected expression label
         :return: Conditionally mutated label
         """
-        return f"{label}_{md5_sha_from_str(label)[:6]}"
+        return f"{label}_{hash_from_str(label)[:6]}"

Review Comment:
   **Suggestion:** The docstring says the suffix is derived from the MD5 of the 
label, but the new code calls `hash_from_str` with the configured/default 
algorithm, which can differ between deployments and breaks the 
documented/previous behavior; explicitly use MD5 if the intent is to preserve 
prior semantics (or update the docstring). This prevents surprising changes in 
label suffixes when different hash algorithms are configured. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
           return f"{label}_{hash_from_str(label, algorithm='md5')[:6]}"
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The file's docstring explicitly states the suffix comes from the MD5 of the 
label, but the code calls hash_from_str() with the default/configured 
algorithm. That creates a behavioral mismatch and can surprise users when the 
project's hash algorithm changes. For correctness you should either force md5 
here or update the docstring to reflect the configured algorithm; the suggested 
change (forcing md5) directly fixes the inconsistency visible in the PR diff.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/db_engine_specs/dremio.py
   **Line:** 107:107
   **Comment:**
        *Logic Error: The docstring says the suffix is derived from the MD5 of 
the label, but the new code calls `hash_from_str` with the configured/default 
algorithm, which can differ between deployments and breaks the 
documented/previous behavior; explicitly use MD5 if the intent is to preserve 
prior semantics (or update the docstring). This prevents surprising changes in 
label suffixes when different hash algorithms are configured.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/utils/hashing.py:
##########
@@ -14,23 +14,84 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from __future__ import annotations
+
 import hashlib
-from typing import Any, Callable, Optional
+import logging
+from typing import Any, Callable, Literal, Optional
+
+from flask import current_app
 
 from superset.utils import json
 
+logger = logging.getLogger(__name__)
+
+HashAlgorithm = Literal["md5", "sha256"]
+
+# Hash function lookup table for efficient dispatch
+_HASH_FUNCTIONS: dict[str, Callable[[bytes], str]] = {
+    "sha256": lambda data: hashlib.sha256(data).hexdigest(),
+    "md5": lambda data: hashlib.md5(data).hexdigest(),  # noqa: S324
+}
+
+
+def get_hash_algorithm() -> HashAlgorithm:
+    """
+    Get the configured hash algorithm for non-cryptographic purposes.
+
+    Returns:
+        Hash algorithm name ('md5' or 'sha256')
+    """
+    return current_app.config["HASH_ALGORITHM"]
+
 
-def md5_sha_from_str(val: str) -> str:
-    return hashlib.md5(val.encode("utf-8")).hexdigest()  # noqa: S324
+def hash_from_str(val: str, algorithm: Optional[HashAlgorithm] = None) -> str:
+    """
+    Generate a hash from a string using the configured or specified algorithm.
 
+    Args:
+        val: String to hash
+        algorithm: Hash algorithm to use (defaults to configured algorithm)
 
-def md5_sha_from_dict(
+    Returns:
+        Hexadecimal hash digest string
+
+    Examples:
+        >>> hash_from_str("test")  # Uses configured algorithm
+        '9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08'
+        >>> hash_from_str("test", algorithm="md5")  # Force MD5
+        '098f6bcd4621d373cade4e832627b4f6'
+    """
+    if algorithm is None:
+        algorithm = get_hash_algorithm()
+
+    hash_func = _HASH_FUNCTIONS.get(algorithm)
+    if hash_func is None:
+        raise ValueError(f"Unsupported hash algorithm: {algorithm}")
+
+    return hash_func(val.encode("utf-8"))

Review Comment:
   **Suggestion:** `hash_from_str` always calls `.encode()` on `val` which will 
raise an AttributeError if a bytes-like object is passed; accept 
bytes/bytearray inputs and only encode when `val` is a str. [type error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
       # Accept bytes/bytearray inputs directly; only encode when a str is 
provided
       data = val if isinstance(val, (bytes, bytearray)) else 
val.encode("utf-8")
   
       hash_func = _HASH_FUNCTIONS.get(algorithm)
       if hash_func is None:
           raise ValueError(f"Unsupported hash algorithm: {algorithm}")
   
       return hash_func(data)
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The current implementation assumes val is always a str (and the signature 
names it so).
   If a bytes-like object is ever passed, calling .encode will raise. Accepting 
bytes/bytearray
   (or guarding the encode call) is a small, sensible robustness improvement. 
The change is safe
   but should be accompanied by a signature/type-hint update and tests.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/hashing.py
   **Line:** 68:72
   **Comment:**
        *Type Error: `hash_from_str` always calls `.encode()` on `val` which 
will raise an AttributeError if a bytes-like object is passed; accept 
bytes/bytearray inputs and only encode when `val` is a str.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/unit_tests/key_value/utils_test.py:
##########
@@ -28,33 +29,159 @@
 ID_KEY = 123
 
 
-def test_get_filter_uuid() -> None:
[email protected](
+    "key,expected_filter",
+    [
+        (UUID_KEY, {"resource": RESOURCE, "uuid": UUID_KEY}),
+        (ID_KEY, {"resource": RESOURCE, "id": ID_KEY}),
+    ],
+    ids=["uuid_key", "id_key"],
+)
+def test_get_filter(key, expected_filter) -> None:
+    """Test get_filter with different key types."""
     from superset.key_value.utils import get_filter
 
-    assert get_filter(resource=RESOURCE, key=UUID_KEY) == {
-        "resource": RESOURCE,
-        "uuid": UUID_KEY,
-    }
-
-
-def test_get_filter_id() -> None:
-    from superset.key_value.utils import get_filter
-
-    assert get_filter(resource=RESOURCE, key=ID_KEY) == {
-        "resource": RESOURCE,
-        "id": ID_KEY,
-    }
+    assert get_filter(resource=RESOURCE, key=key) == expected_filter
 
 
 def test_encode_permalink_id_valid() -> None:
+    """Test encoding permalink ID with valid input."""
     from superset.key_value.utils import encode_permalink_key
 
     salt = "abc"
     assert encode_permalink_key(1, salt) == "AyBn4lm9qG8"

Review Comment:
   **Suggestion:** The test asserts an exact encoded string for 
`encode_permalink_key`, which is brittle (depends on `HASHIDS_MIN_LENGTH` and 
implementation details). Instead, assert roundtrip correctness by encoding and 
then decoding and verifying the original ID is recovered. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
       """Test encoding permalink ID with valid input by asserting roundtrip 
decode."""
       from superset.key_value.utils import encode_permalink_key, 
decode_permalink_id
   
       salt = "abc"
       encoded = encode_permalink_key(1, salt)
       assert decode_permalink_id(encoded, salt) == 1
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Asserting an exact encoded output ties the test to implementation details 
(hashids min length, alphabet) and is brittle. Testing encode/decode roundtrip 
is a robust, verifiable improvement that ensures behaviour without depending on 
exact string format.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/key_value/utils_test.py
   **Line:** 48:52
   **Comment:**
        *Possible Bug: The test asserts an exact encoded string for 
`encode_permalink_key`, which is brittle (depends on `HASHIDS_MIN_LENGTH` and 
implementation details). Instead, assert roundtrip correctness by encoding and 
then decoding and verifying the original ID is recovered.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/unit_tests/key_value/test_shared_entries_migration.py:
##########
@@ -0,0 +1,135 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Test hash algorithm migration for shared entries."""
+
+from unittest.mock import MagicMock, patch
+from uuid import uuid3
+
+
+def test_get_shared_value_fallback_to_md5() -> None:
+    """Test that get_shared_value falls back to MD5 when SHA-256 doesn't find 
entry."""
+    from superset.key_value.shared_entries import get_shared_value
+    from superset.key_value.types import SharedKey
+    from superset.key_value.utils import get_uuid_namespace_with_algorithm
+
+    key = SharedKey.DASHBOARD_PERMALINK_SALT
+    expected_value = "test_salt_value_12345"
+
+    # Calculate what the MD5 UUID would be
+    namespace_md5 = get_uuid_namespace_with_algorithm("", "md5")
+    uuid_md5 = uuid3(namespace_md5, key)

Review Comment:
   **Suggestion:** Passing a non-string `SharedKey` value into `uuid3` can 
raise a TypeError if `SharedKey` is an Enum (or any non-str/bytes type); 
explicitly convert `key` to `str` when calling `uuid3` to ensure a valid name 
type is provided. [type error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
       uuid_md5 = uuid3(namespace_md5, str(key))
   ```
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/key_value/test_shared_entries_migration.py
   **Line:** 34:34
   **Comment:**
        *Type Error: Passing a non-string `SharedKey` value into `uuid3` can 
raise a TypeError if `SharedKey` is an Enum (or any non-str/bytes type); 
explicitly convert `key` to `str` when calling `uuid3` to ensure a valid name 
type is provided.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/unit_tests/key_value/test_shared_entries_migration.py:
##########
@@ -0,0 +1,135 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Test hash algorithm migration for shared entries."""
+
+from unittest.mock import MagicMock, patch
+from uuid import uuid3
+
+
+def test_get_shared_value_fallback_to_md5() -> None:
+    """Test that get_shared_value falls back to MD5 when SHA-256 doesn't find 
entry."""
+    from superset.key_value.shared_entries import get_shared_value
+    from superset.key_value.types import SharedKey
+    from superset.key_value.utils import get_uuid_namespace_with_algorithm
+
+    key = SharedKey.DASHBOARD_PERMALINK_SALT
+    expected_value = "test_salt_value_12345"
+
+    # Calculate what the MD5 UUID would be
+    namespace_md5 = get_uuid_namespace_with_algorithm("", "md5")
+    uuid_md5 = uuid3(namespace_md5, key)
+
+    # Mock KeyValueDAO to simulate MD5 entry exists, SHA-256 doesn't
+    mock_dao = MagicMock()
+
+    def mock_get_value(resource, uuid_key, codec):
+        # Only return value if UUID matches MD5 version
+        if uuid_key == uuid_md5:
+            return expected_value
+        return None
+
+    mock_dao.get_value.side_effect = mock_get_value
+
+    # Mock current_app to use SHA-256 with MD5 fallback
+    mock_app = MagicMock()
+    mock_app.config = {
+        "HASH_ALGORITHM": "sha256",
+        "HASH_ALGORITHM_FALLBACKS": ["md5"],
+    }
+
+    with patch("superset.key_value.shared_entries.KeyValueDAO", mock_dao):
+        with patch("superset.key_value.utils.current_app", mock_app):

Review Comment:
   **Suggestion:** Using `patch(..., mock_dao)` (positional second arg) is 
valid but ambiguous; be explicit by passing `new=mock_dao` and `new=mock_app` 
so the patch target is clearly replaced with the provided mock, avoiding 
accidental misuse or confusion about which argument is being supplied. 
[possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
       with patch("superset.key_value.shared_entries.KeyValueDAO", 
new=mock_dao):
           with patch("superset.key_value.utils.current_app", new=mock_app):
   ```
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/key_value/test_shared_entries_migration.py
   **Line:** 54:55
   **Comment:**
        *Possible Bug: Using `patch(..., mock_dao)` (positional second arg) is 
valid but ambiguous; be explicit by passing `new=mock_dao` and `new=mock_app` 
so the patch target is clearly replaced with the provided mock, avoiding 
accidental misuse or confusion about which argument is being supplied.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/utils/cache.py:
##########
@@ -31,14 +31,14 @@
 from superset.constants import CACHE_DISABLED_TIMEOUT
 from superset.extensions import cache_manager
 from superset.models.cache import CacheKey
-from superset.utils.hashing import md5_sha_from_dict
+from superset.utils.hashing import hash_from_dict
 from superset.utils.json import json_int_dttm_ser
 
 logger = logging.getLogger(__name__)
 
 
 def generate_cache_key(values_dict: dict[str, Any], key_prefix: str = "") -> 
str:
-    hash_str = md5_sha_from_dict(values_dict, default=json_int_dttm_ser)
+    hash_str = hash_from_dict(values_dict, default=json_int_dttm_ser)

Review Comment:
   **Suggestion:** The new call to `hash_from_dict` doesn't honor a 
configurable hash algorithm; the decorator/utility change in the PR title 
suggests algorithm selection should be supported. Pass the configured algorithm 
(e.g. `app.config["CACHE_HASH_ALGORITHM"]`) into `hash_from_dict` so cache keys 
use the intended algorithm. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
       hash_str = hash_from_dict(
           values_dict,
           default=json_int_dttm_ser,
           algorithm=app.config.get("CACHE_HASH_ALGORITHM"),
       )
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   hash_from_dict accepts an algorithm parameter (see hashing.py), so passing a 
configured algorithm (e.g., app.config["CACHE_HASH_ALGORITHM"]) is a valid 
enhancement that makes cache keys consistent with any deploy-time 
configuration. This isn't strictly a bug fix but it is a sensible, actionable 
improvement tied to the new use of hash_from_dict.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/cache.py
   **Line:** 35:41
   **Comment:**
        *Logic Error: The new call to `hash_from_dict` doesn't honor a 
configurable hash algorithm; the decorator/utility change in the PR title 
suggests algorithm selection should be supported. Pass the configured algorithm 
(e.g. `app.config["CACHE_HASH_ALGORITHM"]`) into `hash_from_dict` so cache keys 
use the intended algorithm.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/key_value/utils.py:
##########
@@ -66,13 +68,94 @@ def decode_permalink_id(key: str, salt: str) -> int:
     raise KeyValueParseKeyError(_("Invalid permalink key"))
 
 
-def get_uuid_namespace(seed: str) -> UUID:
+def _uuid_namespace_from_md5(seed: str) -> UUID:
+    """Generate UUID namespace from MD5 hash (legacy compatibility)."""
     md5_obj = md5()  # noqa: S324
     md5_obj.update(seed.encode("utf-8"))
     return UUID(md5_obj.hexdigest())
 
 
+def _uuid_namespace_from_sha256(seed: str) -> UUID:
+    """Generate UUID namespace from SHA-256 hash (first 16 bytes)."""
+    sha256_obj = hashlib.sha256()
+    sha256_obj.update(seed.encode("utf-8"))
+    # Use first 16 bytes of SHA-256 digest for UUID
+    return UUID(bytes=sha256_obj.digest()[:16])
+
+
+# UUID namespace generator dispatch table
+_UUID_NAMESPACE_GENERATORS = {
+    "md5": _uuid_namespace_from_md5,
+    "sha256": _uuid_namespace_from_sha256,
+}
+
+
+def get_uuid_namespace_with_algorithm(seed: str, algorithm: str) -> UUID:
+    """
+    Generate a UUID namespace from a seed string using specified hash 
algorithm.
+
+    Args:
+        seed: Seed string for namespace generation
+        algorithm: Hash algorithm to use ('sha256' or 'md5')
+
+    Returns:
+        UUID namespace
+    """
+    generator = _UUID_NAMESPACE_GENERATORS.get(algorithm)
+    if generator is None:
+        raise ValueError(f"Unsupported hash algorithm: {algorithm}")
+    return generator(seed)
+
+
+def get_uuid_namespace(seed: str, app: Any = None) -> UUID:
+    """
+    Generate a UUID namespace from a seed string using configured hash 
algorithm.
+
+    Args:
+        seed: Seed string for namespace generation
+        app: Flask app instance (optional, uses current_app if not provided)
+
+    Returns:
+        UUID namespace
+    """
+    app = app or current_app
+    algorithm = app.config["HASH_ALGORITHM"]
+    return get_uuid_namespace_with_algorithm(seed, algorithm)
+
+
+def get_deterministic_uuid_with_algorithm(
+    namespace: str, payload: Any, algorithm: str
+) -> UUID:
+    """
+    Get a deterministic UUID (uuid3) using specified hash algorithm.
+
+    Args:
+        namespace: Namespace string for UUID generation
+        payload: JSON-serializable payload
+        algorithm: Hash algorithm to use ('sha256' or 'md5')
+
+    Returns:
+        Deterministic UUID
+    """
+    payload_str = json_dumps_w_dates(payload, sort_keys=True)
+    return uuid3(get_uuid_namespace_with_algorithm(namespace, algorithm), 
payload_str)
+
+
 def get_deterministic_uuid(namespace: str, payload: Any) -> UUID:
     """Get a deterministic UUID (uuid3) from a salt and a JSON-serializable 
payload."""
     payload_str = json_dumps_w_dates(payload, sort_keys=True)
     return uuid3(get_uuid_namespace(namespace), payload_str)
+
+
+def get_fallback_algorithms(app: Any = None) -> list[str]:
+    """
+    Get the list of fallback hash algorithms from config.
+
+    Args:
+        app: Flask app instance (optional, uses current_app if not provided)
+
+    Returns:
+        List of fallback algorithm names (empty list if none configured)
+    """
+    app = app or current_app
+    return app.config.get("HASH_ALGORITHM_FALLBACKS", [])

Review Comment:
   **Suggestion:** `get_fallback_algorithms` returns the raw config value which 
may be None, a string, or another type; this can cause callers to receive 
unexpected types. Normalize the return to always be a list of strings (empty 
list if not configured) by coercing `None` or single-string values into a list. 
[type error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
       value = app.config.get("HASH_ALGORITHM_FALLBACKS", [])
       if value is None:
           return []
       if isinstance(value, str):
           return [value]
       try:
           return list(value)
       except TypeError:
           return []
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Normalizing the configuration value to always return a list[str] is 
practical and prevents callers from handling unexpected types (None or single 
string). The improved_code is sensible: it handles None, strings, and attempts 
to coerce iterables to lists. It's a defensive, low-risk improvement that 
matches the documented return type.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/key_value/utils.py
   **Line:** 161:161
   **Comment:**
        *Type Error: `get_fallback_algorithms` returns the raw config value 
which may be None, a string, or another type; this can cause callers to receive 
unexpected types. Normalize the return to always be a list of strings (empty 
list if not configured) by coercing `None` or single-string values into a list.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/integration_tests/utils/hashing_tests.py:
##########
@@ -17,80 +17,184 @@
 import datetime
 import math
 from typing import Any

Review Comment:
   **Suggestion:** The tests later use json.dumps to produce serialized JSON 
strings, but the file doesn't import the `json` module; this will cause a 
NameError when replacing hard-coded serialized strings with `json.dumps` or 
when adding new json-based assertions. Add `import json` to the imports block. 
[possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
   import json
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The test file currently doesn't import json. If we adopt other suggestions 
to replace brittle hard-coded JSON strings with json.dumps, the missing import 
would cause a NameError. Adding the import is a harmless, necessary change when 
switching to json.dumps and is correct here.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/integration_tests/utils/hashing_tests.py
   **Line:** 19:19
   **Comment:**
        *Possible Bug: The tests later use json.dumps to produce serialized 
JSON strings, but the file doesn't import the `json` module; this will cause a 
NameError when replacing hard-coded serialized strings with `json.dumps` or 
when adding new json-based assertions. Add `import json` to the imports block.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/extensions/metastore_cache.py:
##########
@@ -54,7 +54,7 @@ def factory(
         cls, app: Flask, config: dict[str, Any], args: list[Any], kwargs: 
dict[str, Any]
     ) -> BaseCache:
         seed = config.get("CACHE_KEY_PREFIX", "")
-        kwargs["namespace"] = get_uuid_namespace(seed)
+        kwargs["namespace"] = get_uuid_namespace(seed, app)

Review Comment:
   **Suggestion:** Calling `get_uuid_namespace(seed, app)` can raise a KeyError 
if `app.config["HASH_ALGORITHM"]` is not set inside `get_uuid_namespace`; 
retrieve the hash algorithm with a safe default and call the underlying helper 
that accepts an explicit algorithm to avoid crashing when the config entry is 
missing. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
           # pylint: disable=import-outside-toplevel
           from superset.key_value.utils import 
get_uuid_namespace_with_algorithm
           algorithm = app.config.get("HASH_ALGORITHM", "md5")
           kwargs["namespace"] = get_uuid_namespace_with_algorithm(seed, 
algorithm)
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   get_uuid_namespace (in utils.py) indexes app.config["HASH_ALGORITHM"] 
directly, so a missing config key can raise KeyError during cache factory 
initialization. Replacing the call with an explicit helper call that reads the 
config with .get(..., default) or otherwise provides a safe default is a 
reasonable defensive fix. The change is localized and prevents an easily 
avoidable crash during app startup.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/extensions/metastore_cache.py
   **Line:** 57:57
   **Comment:**
        *Possible Bug: Calling `get_uuid_namespace(seed, app)` can raise a 
KeyError if `app.config["HASH_ALGORITHM"]` is not set inside 
`get_uuid_namespace`; retrieve the hash algorithm with a safe default and call 
the underlying helper that accepts an explicit algorithm to avoid crashing when 
the config entry is missing.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/key_value/utils.py:
##########
@@ -66,13 +68,94 @@ def decode_permalink_id(key: str, salt: str) -> int:
     raise KeyValueParseKeyError(_("Invalid permalink key"))
 
 
-def get_uuid_namespace(seed: str) -> UUID:
+def _uuid_namespace_from_md5(seed: str) -> UUID:
+    """Generate UUID namespace from MD5 hash (legacy compatibility)."""
     md5_obj = md5()  # noqa: S324
     md5_obj.update(seed.encode("utf-8"))
     return UUID(md5_obj.hexdigest())
 
 
+def _uuid_namespace_from_sha256(seed: str) -> UUID:
+    """Generate UUID namespace from SHA-256 hash (first 16 bytes)."""
+    sha256_obj = hashlib.sha256()
+    sha256_obj.update(seed.encode("utf-8"))
+    # Use first 16 bytes of SHA-256 digest for UUID
+    return UUID(bytes=sha256_obj.digest()[:16])
+
+
+# UUID namespace generator dispatch table
+_UUID_NAMESPACE_GENERATORS = {
+    "md5": _uuid_namespace_from_md5,
+    "sha256": _uuid_namespace_from_sha256,
+}
+
+
+def get_uuid_namespace_with_algorithm(seed: str, algorithm: str) -> UUID:
+    """
+    Generate a UUID namespace from a seed string using specified hash 
algorithm.
+
+    Args:
+        seed: Seed string for namespace generation
+        algorithm: Hash algorithm to use ('sha256' or 'md5')
+
+    Returns:
+        UUID namespace
+    """
+    generator = _UUID_NAMESPACE_GENERATORS.get(algorithm)
+    if generator is None:
+        raise ValueError(f"Unsupported hash algorithm: {algorithm}")
+    return generator(seed)
+
+
+def get_uuid_namespace(seed: str, app: Any = None) -> UUID:
+    """
+    Generate a UUID namespace from a seed string using configured hash 
algorithm.
+
+    Args:
+        seed: Seed string for namespace generation
+        app: Flask app instance (optional, uses current_app if not provided)
+
+    Returns:
+        UUID namespace
+    """
+    app = app or current_app
+    algorithm = app.config["HASH_ALGORITHM"]

Review Comment:
   **Suggestion:** `get_uuid_namespace` reads `app.config["HASH_ALGORITHM"]` 
without a safe lookup; if the configuration key is missing this raises a 
KeyError at runtime. Use `app.config.get("HASH_ALGORITHM", "<default>")` (or 
raise a clear ValueError) so the code fails with a clearer message or uses a 
sensible default. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
       algorithm = app.config.get("HASH_ALGORITHM", "md5")
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The suggestion is reasonable: directly indexing app.config can raise 
KeyError if the configuration key is missing, which surfaces as a less-helpful 
error. Making the lookup explicit (either defaulting or raising a clearer 
error) improves robustness and diagnostics. Defaulting to "md5" is a sensible 
choice given the code has an MD5 legacy path, but switching to a default is a 
behavioral change that should be discussed. Still, the core idea—avoid 
unguarded config indexing—is valid.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/key_value/utils.py
   **Line:** 122:122
   **Comment:**
        *Possible Bug: `get_uuid_namespace` reads 
`app.config["HASH_ALGORITHM"]` without a safe lookup; if the configuration key 
is missing this raises a KeyError at runtime. Use 
`app.config.get("HASH_ALGORITHM", "<default>")` (or raise a clear ValueError) 
so the code fails with a clearer message or uses a sensible default.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/unit_tests/key_value/utils_test.py:
##########
@@ -28,33 +29,159 @@
 ID_KEY = 123
 
 
-def test_get_filter_uuid() -> None:
[email protected](
+    "key,expected_filter",
+    [
+        (UUID_KEY, {"resource": RESOURCE, "uuid": UUID_KEY}),
+        (ID_KEY, {"resource": RESOURCE, "id": ID_KEY}),

Review Comment:
   **Suggestion:** The test asserts the filter's `resource` key equals the Enum 
member `RESOURCE`, but `get_filter` returns `resource.value` (the enum's 
value). This will make the equality assertion fail because the types/values 
differ; change expected dictionaries to use `RESOURCE.value`. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
           (UUID_KEY, {"resource": RESOURCE.value, "uuid": UUID_KEY}),
           (ID_KEY, {"resource": RESOURCE.value, "id": ID_KEY}),
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   This is a real failing assertion: utils.get_filter builds {"resource": 
resource.value}, so the test must compare against RESOURCE.value. The proposed 
change directly fixes the mismatch and will make the test pass.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/key_value/utils_test.py
   **Line:** 35:36
   **Comment:**
        *Logic Error: The test asserts the filter's `resource` key equals the 
Enum member `RESOURCE`, but `get_filter` returns `resource.value` (the enum's 
value). This will make the equality assertion fail because the types/values 
differ; change expected dictionaries to use `RESOURCE.value`.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/db_engine_specs/bigquery.py:
##########
@@ -290,7 +290,7 @@ def _truncate_label(cls, label: str) -> str:
         :param label: expected expression label
         :return: truncated label
         """
-        return "_" + md5_sha_from_str(label)
+        return "_" + hash_from_str(label)

Review Comment:
   **Suggestion:** `_truncate_label` now returns an underscore + full hash 
which may exceed column name length limits; truncate the hash to ensure the 
returned label always respects `max_column_name_length` (reserve one char for 
the leading underscore). [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
           # Ensure truncated label respects the engine's max column name 
length.
           max_len = getattr(cls, "max_column_name_length", None) or 128
           return "_" + hash_from_str(label)[: max_len - 1]
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Returning the entire hash could, in theory, produce a label longer than the 
engine's max_column_name_length if hashing algorithm or limits change. 
Truncating the hashed suffix to respect max_column_name_length (reserving one 
char for the leading underscore) is a harmless hardening that prevents 
generating invalid column names. This addresses a potential production issue 
(invalid/too-long identifiers) visible in the diff.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/db_engine_specs/bigquery.py
   **Line:** 293:293
   **Comment:**
        *Logic Error: `_truncate_label` now returns an underscore + full hash 
which may exceed column name length limits; truncate the hash to ensure the 
returned label always respects `max_column_name_length` (reserve one char for 
the leading underscore).
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/utils/core.py:
##########
@@ -1005,7 +1005,7 @@ def form_data_to_adhoc(form_data: dict[str, Any], clause: 
str) -> AdhocFilterCla
         "expressionType": "SQL",
         "sqlExpression": form_data.get(clause),
     }
-    result["filterOptionName"] = md5_sha_from_dict(cast(dict[Any, Any], 
result))
+    result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any], result))

Review Comment:
   **Suggestion:** Same issue as above for `form_data_to_adhoc`: hashing 
`result` directly can raise if `sqlExpression` or other values are 
mutable/unhashable; normalize the dict (convert lists to tuples) and hash the 
normalized copy to produce a stable `filterOptionName`. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
       result_for_hash = {k: tuple(v) if isinstance(v, list) else v for k, v in 
result.items()}
       result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any], 
result_for_hash))
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Same rationale as above but for form_data_to_adhoc: the dict being hashed 
can contain values that are mutable/unhashable. Normalizing the dict before 
hashing is a safe improvement that prevents errors and makes identifiers 
deterministic.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/core.py
   **Line:** 1008:1008
   **Comment:**
        *Possible Bug: Same issue as above for `form_data_to_adhoc`: hashing 
`result` directly can raise if `sqlExpression` or other values are 
mutable/unhashable; normalize the dict (convert lists to tuples) and hash the 
normalized copy to produce a stable `filterOptionName`.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/utils/hashing.py:
##########
@@ -14,23 +14,84 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from __future__ import annotations
+
 import hashlib
-from typing import Any, Callable, Optional
+import logging
+from typing import Any, Callable, Literal, Optional
+
+from flask import current_app
 
 from superset.utils import json
 
+logger = logging.getLogger(__name__)
+
+HashAlgorithm = Literal["md5", "sha256"]
+
+# Hash function lookup table for efficient dispatch
+_HASH_FUNCTIONS: dict[str, Callable[[bytes], str]] = {
+    "sha256": lambda data: hashlib.sha256(data).hexdigest(),
+    "md5": lambda data: hashlib.md5(data).hexdigest(),  # noqa: S324
+}
+
+
+def get_hash_algorithm() -> HashAlgorithm:
+    """
+    Get the configured hash algorithm for non-cryptographic purposes.
+
+    Returns:
+        Hash algorithm name ('md5' or 'sha256')
+    """
+    return current_app.config["HASH_ALGORITHM"]

Review Comment:
   **Suggestion:** Accessing `current_app.config["HASH_ALGORITHM"]` will raise 
a KeyError if the config key is missing and doesn't validate the configured 
value; use `.get()` with a secure default and validate membership in supported 
algorithms, logging and falling back if invalid. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
       alg = current_app.config.get("HASH_ALGORITHM", "sha256")
       if alg not in _HASH_FUNCTIONS:
           logger.warning(
               "Unsupported HASH_ALGORITHM '%s' in app config; falling back to 
'sha256'", alg
           )
           return "sha256"
       return alg
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   The current code will raise a KeyError if HASH_ALGORITHM is absent and 
doesn't validate values.
   The proposed change hardens the function by providing a safe default and 
validating membership
   against _HASH_FUNCTIONS, preventing runtime errors and unexpected algorithms.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/hashing.py
   **Line:** 45:45
   **Comment:**
        *Possible Bug: Accessing `current_app.config["HASH_ALGORITHM"]` will 
raise a KeyError if the config key is missing and doesn't validate the 
configured value; use `.get()` with a secure default and validate membership in 
supported algorithms, logging and falling back if invalid.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/utils/core.py:
##########
@@ -992,7 +992,7 @@ def simple_filter_to_adhoc(
     }
     if filter_clause.get("isExtra"):
         result["isExtra"] = True
-    result["filterOptionName"] = md5_sha_from_dict(cast(dict[Any, Any], 
result))
+    result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any], result))

Review Comment:
   **Suggestion:** Hashing `result` directly can fail or be non-deterministic 
if any values (e.g. `comparator`) are mutable/unhashable (lists) or if the dict 
is later mutated; build a normalized, immutable copy of `result` (convert lists 
to tuples) before calling `hash_from_dict` so the computed `filterOptionName` 
is stable and won't raise on unhashable values. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
       result_for_hash = {k: tuple(v) if isinstance(v, list) else v for k, v in 
result.items()}
       result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any], 
result_for_hash))
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   This is a reasonable defensive suggestion: result may contain lists (e.g. 
comparator) and different hashing implementations may behave inconsistently or 
fail on mutable/unhashable nested objects. Normalizing mutable sequences (lists 
-> tuples) into an immutable structure before calling hash_from_dict produces a 
more stable, predictable filterOptionName and avoids potential runtime errors.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/core.py
   **Line:** 995:995
   **Comment:**
        *Possible Bug: Hashing `result` directly can fail or be 
non-deterministic if any values (e.g. `comparator`) are mutable/unhashable 
(lists) or if the dict is later mutated; build a normalized, immutable copy of 
`result` (convert lists to tuples) before calling `hash_from_dict` so the 
computed `filterOptionName` is stable and won't raise on unhashable values.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/integration_tests/utils_tests.py:
##########
@@ -266,7 +266,7 @@ def test_ssl_certificate_parse(self):
 
     def test_ssl_certificate_file_creation(self):
         path = create_ssl_cert_file(ssl_certificate)
-        expected_filename = md5_sha_from_str(ssl_certificate)
+        expected_filename = hash_from_str(ssl_certificate)
         assert expected_filename in path

Review Comment:
   **Suggestion:** The test asserts `expected_filename in path` which can match 
directory components by accident; verify the filename portion of the returned 
path (basename) instead to avoid false positives. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
           # verify the filename portion only to avoid matching directory names 
that may contain the same substring
           assert expected_filename in os.path.basename(path)
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Valid suggestion. Checking the basename (os.path.basename(path)) is a safer 
assertion — it prevents accidental matches against directory names and makes 
the test intention explicit. The improved code directly fixes a potential 
false-positive test without requiring broader changes.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/integration_tests/utils_tests.py
   **Line:** 270:270
   **Comment:**
        *Logic Error: The test asserts `expected_filename in path` which can 
match directory components by accident; verify the filename portion of the 
returned path (basename) instead to avoid false positives.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add option for hash algorithms [superset]

Reply via email to