codeant-ai-for-open-source[bot] commented on code in PR #35621:
URL: https://github.com/apache/superset/pull/35621#discussion_r2599069240
##########
superset/commands/dashboard/permalink/create.py:
##########
@@ -71,9 +76,31 @@ def run(self) -> str:
"state": self.state,
}
user_id = get_user_id()
- entry = KeyValueDAO.upsert_entry(
+ payload = (user_id, value)
+
+ # Try to find existing entry with current algorithm
+ uuid_key = get_deterministic_uuid(self.salt, payload)
+ entry = KeyValueDAO.get_entry(self.resource, uuid_key)
+
+ # Fallback: check configured fallback algorithms for backward
compatibility
+ if not entry:
+ for fallback_algo in get_fallback_algorithms():
+ uuid_fallback = get_deterministic_uuid_with_algorithm(
+ self.salt, payload, fallback_algo
+ )
+ entry = KeyValueDAO.get_entry(self.resource, uuid_fallback)
+ if entry:
+ break
+
+ if entry:
+ # Return existing entry
+ assert entry.id # for type checks
+ return encode_permalink_key(key=entry.id, salt=self.salt)
+
+ # Create new entry with current algorithm
+ entry = KeyValueDAO.create_entry(
resource=self.resource,
- key=get_deterministic_uuid(self.salt, (user_id, value)),
+ key=uuid_key,
value=value,
codec=self.codec,
)
Review Comment:
**Suggestion:** The code does a separate "find then create" which is
vulnerable to a race: two concurrent requests can both not find an entry and
both try to insert, causing a conflict or duplicate entries. Wrap the create in
a try/except for the database integrity error, roll back and fetch the existing
entry if insertion failed due to a concurrent insert. [race condition]
**Severity Level:** Minor β οΈ
```suggestion
# Create new entry with current algorithm (handle concurrent inserts)
from sqlalchemy.exc import IntegrityError
try:
entry = KeyValueDAO.create_entry(
resource=self.resource,
key=uuid_key,
value=value,
codec=self.codec,
)
db.session.flush()
except IntegrityError:
# Another transaction inserted the same key concurrently;
rollback and fetch it
db.session.rollback()
entry = KeyValueDAO.get_entry(self.resource, uuid_key)
if not entry:
raise DashboardPermalinkCreateFailedError("Failed to create
or retrieve permalink entry")
```
<details>
<summary><b>Why it matters? β </b></summary>
The "find-then-create" pattern is vulnerable to concurrent inserts. Catching
IntegrityError, rolling back and fetching the existing row is a valid and
practical fix to avoid duplicate entries or hard failures under concurrency.
The proposed change directly addresses a real race condition visible from this
diff.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/commands/dashboard/permalink/create.py
**Line:** 100:106
**Comment:**
*Race Condition: The code does a separate "find then create" which is
vulnerable to a race: two concurrent requests can both not find an entry and
both try to insert, causing a conflict or duplicate entries. Wrap the create in
a try/except for the database integrity error, roll back and fetch the existing
entry if insertion failed due to a concurrent insert.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/commands/dashboard/permalink/create.py:
##########
@@ -71,9 +76,31 @@ def run(self) -> str:
"state": self.state,
}
user_id = get_user_id()
- entry = KeyValueDAO.upsert_entry(
+ payload = (user_id, value)
+
+ # Try to find existing entry with current algorithm
+ uuid_key = get_deterministic_uuid(self.salt, payload)
+ entry = KeyValueDAO.get_entry(self.resource, uuid_key)
+
+ # Fallback: check configured fallback algorithms for backward
compatibility
+ if not entry:
+ for fallback_algo in get_fallback_algorithms():
+ uuid_fallback = get_deterministic_uuid_with_algorithm(
+ self.salt, payload, fallback_algo
+ )
+ entry = KeyValueDAO.get_entry(self.resource, uuid_fallback)
+ if entry:
+ break
+
+ if entry:
+ # Return existing entry
+ assert entry.id # for type checks
Review Comment:
**Suggestion:** An `assert entry.id` is used to check presence of
`entry.id`, but `assert` statements may be disabled at runtime and are not a
reliable runtime check. Replace the assert with an explicit runtime validation
and raise a clear exception if `entry.id` is missing. [possible bug]
**Severity Level:** Critical π¨
```suggestion
if not getattr(entry, "id", None):
raise DashboardPermalinkCreateFailedError("Existing
permalink entry does not have an id")
```
<details>
<summary><b>Why it matters? β </b></summary>
Using assert for a runtime guarantee is brittle because asserts can be
disabled; replacing it with an explicit runtime check (and a clear exception)
improves robustness and error signaling. The improved code addresses a
verifiable correctness concern in production.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/commands/dashboard/permalink/create.py
**Line:** 97:97
**Comment:**
*Possible Bug: An `assert entry.id` is used to check presence of
`entry.id`, but `assert` statements may be disabled at runtime and are not a
reliable runtime check. Replace the assert with an explicit runtime validation
and raise a clear exception if `entry.id` is missing.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/db_engine_specs/dremio.py:
##########
@@ -104,4 +104,4 @@ def _mutate_label(label: str) -> str:
:param label: Expected expression label
:return: Conditionally mutated label
"""
- return f"{label}_{md5_sha_from_str(label)[:6]}"
+ return f"{label}_{hash_from_str(label)[:6]}"
Review Comment:
**Suggestion:** The docstring says the suffix is derived from the MD5 of the
label, but the new code calls `hash_from_str` with the configured/default
algorithm, which can differ between deployments and breaks the
documented/previous behavior; explicitly use MD5 if the intent is to preserve
prior semantics (or update the docstring). This prevents surprising changes in
label suffixes when different hash algorithms are configured. [logic error]
**Severity Level:** Minor β οΈ
```suggestion
return f"{label}_{hash_from_str(label, algorithm='md5')[:6]}"
```
<details>
<summary><b>Why it matters? β </b></summary>
The file's docstring explicitly states the suffix comes from the MD5 of the
label, but the code calls hash_from_str() with the default/configured
algorithm. That creates a behavioral mismatch and can surprise users when the
project's hash algorithm changes. For correctness you should either force md5
here or update the docstring to reflect the configured algorithm; the suggested
change (forcing md5) directly fixes the inconsistency visible in the PR diff.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/db_engine_specs/dremio.py
**Line:** 107:107
**Comment:**
*Logic Error: The docstring says the suffix is derived from the MD5 of
the label, but the new code calls `hash_from_str` with the configured/default
algorithm, which can differ between deployments and breaks the
documented/previous behavior; explicitly use MD5 if the intent is to preserve
prior semantics (or update the docstring). This prevents surprising changes in
label suffixes when different hash algorithms are configured.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/utils/hashing.py:
##########
@@ -14,23 +14,84 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
+from __future__ import annotations
+
import hashlib
-from typing import Any, Callable, Optional
+import logging
+from typing import Any, Callable, Literal, Optional
+
+from flask import current_app
from superset.utils import json
+logger = logging.getLogger(__name__)
+
+HashAlgorithm = Literal["md5", "sha256"]
+
+# Hash function lookup table for efficient dispatch
+_HASH_FUNCTIONS: dict[str, Callable[[bytes], str]] = {
+ "sha256": lambda data: hashlib.sha256(data).hexdigest(),
+ "md5": lambda data: hashlib.md5(data).hexdigest(), # noqa: S324
+}
+
+
+def get_hash_algorithm() -> HashAlgorithm:
+ """
+ Get the configured hash algorithm for non-cryptographic purposes.
+
+ Returns:
+ Hash algorithm name ('md5' or 'sha256')
+ """
+ return current_app.config["HASH_ALGORITHM"]
+
-def md5_sha_from_str(val: str) -> str:
- return hashlib.md5(val.encode("utf-8")).hexdigest() # noqa: S324
+def hash_from_str(val: str, algorithm: Optional[HashAlgorithm] = None) -> str:
+ """
+ Generate a hash from a string using the configured or specified algorithm.
+ Args:
+ val: String to hash
+ algorithm: Hash algorithm to use (defaults to configured algorithm)
-def md5_sha_from_dict(
+ Returns:
+ Hexadecimal hash digest string
+
+ Examples:
+ >>> hash_from_str("test") # Uses configured algorithm
+ '9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08'
+ >>> hash_from_str("test", algorithm="md5") # Force MD5
+ '098f6bcd4621d373cade4e832627b4f6'
+ """
+ if algorithm is None:
+ algorithm = get_hash_algorithm()
+
+ hash_func = _HASH_FUNCTIONS.get(algorithm)
+ if hash_func is None:
+ raise ValueError(f"Unsupported hash algorithm: {algorithm}")
+
+ return hash_func(val.encode("utf-8"))
Review Comment:
**Suggestion:** `hash_from_str` always calls `.encode()` on `val` which will
raise an AttributeError if a bytes-like object is passed; accept
bytes/bytearray inputs and only encode when `val` is a str. [type error]
**Severity Level:** Minor β οΈ
```suggestion
# Accept bytes/bytearray inputs directly; only encode when a str is
provided
data = val if isinstance(val, (bytes, bytearray)) else
val.encode("utf-8")
hash_func = _HASH_FUNCTIONS.get(algorithm)
if hash_func is None:
raise ValueError(f"Unsupported hash algorithm: {algorithm}")
return hash_func(data)
```
<details>
<summary><b>Why it matters? β </b></summary>
The current implementation assumes val is always a str (and the signature
names it so).
If a bytes-like object is ever passed, calling .encode will raise. Accepting
bytes/bytearray
(or guarding the encode call) is a small, sensible robustness improvement.
The change is safe
but should be accompanied by a signature/type-hint update and tests.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/hashing.py
**Line:** 68:72
**Comment:**
*Type Error: `hash_from_str` always calls `.encode()` on `val` which
will raise an AttributeError if a bytes-like object is passed; accept
bytes/bytearray inputs and only encode when `val` is a str.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/unit_tests/key_value/utils_test.py:
##########
@@ -28,33 +29,159 @@
ID_KEY = 123
-def test_get_filter_uuid() -> None:
[email protected](
+ "key,expected_filter",
+ [
+ (UUID_KEY, {"resource": RESOURCE, "uuid": UUID_KEY}),
+ (ID_KEY, {"resource": RESOURCE, "id": ID_KEY}),
+ ],
+ ids=["uuid_key", "id_key"],
+)
+def test_get_filter(key, expected_filter) -> None:
+ """Test get_filter with different key types."""
from superset.key_value.utils import get_filter
- assert get_filter(resource=RESOURCE, key=UUID_KEY) == {
- "resource": RESOURCE,
- "uuid": UUID_KEY,
- }
-
-
-def test_get_filter_id() -> None:
- from superset.key_value.utils import get_filter
-
- assert get_filter(resource=RESOURCE, key=ID_KEY) == {
- "resource": RESOURCE,
- "id": ID_KEY,
- }
+ assert get_filter(resource=RESOURCE, key=key) == expected_filter
def test_encode_permalink_id_valid() -> None:
+ """Test encoding permalink ID with valid input."""
from superset.key_value.utils import encode_permalink_key
salt = "abc"
assert encode_permalink_key(1, salt) == "AyBn4lm9qG8"
Review Comment:
**Suggestion:** The test asserts an exact encoded string for
`encode_permalink_key`, which is brittle (depends on `HASHIDS_MIN_LENGTH` and
implementation details). Instead, assert roundtrip correctness by encoding and
then decoding and verifying the original ID is recovered. [possible bug]
**Severity Level:** Critical π¨
```suggestion
"""Test encoding permalink ID with valid input by asserting roundtrip
decode."""
from superset.key_value.utils import encode_permalink_key,
decode_permalink_id
salt = "abc"
encoded = encode_permalink_key(1, salt)
assert decode_permalink_id(encoded, salt) == 1
```
<details>
<summary><b>Why it matters? β </b></summary>
Asserting an exact encoded output ties the test to implementation details
(hashids min length, alphabet) and is brittle. Testing encode/decode roundtrip
is a robust, verifiable improvement that ensures behaviour without depending on
exact string format.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/unit_tests/key_value/utils_test.py
**Line:** 48:52
**Comment:**
*Possible Bug: The test asserts an exact encoded string for
`encode_permalink_key`, which is brittle (depends on `HASHIDS_MIN_LENGTH` and
implementation details). Instead, assert roundtrip correctness by encoding and
then decoding and verifying the original ID is recovered.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/unit_tests/key_value/test_shared_entries_migration.py:
##########
@@ -0,0 +1,135 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Test hash algorithm migration for shared entries."""
+
+from unittest.mock import MagicMock, patch
+from uuid import uuid3
+
+
+def test_get_shared_value_fallback_to_md5() -> None:
+ """Test that get_shared_value falls back to MD5 when SHA-256 doesn't find
entry."""
+ from superset.key_value.shared_entries import get_shared_value
+ from superset.key_value.types import SharedKey
+ from superset.key_value.utils import get_uuid_namespace_with_algorithm
+
+ key = SharedKey.DASHBOARD_PERMALINK_SALT
+ expected_value = "test_salt_value_12345"
+
+ # Calculate what the MD5 UUID would be
+ namespace_md5 = get_uuid_namespace_with_algorithm("", "md5")
+ uuid_md5 = uuid3(namespace_md5, key)
Review Comment:
**Suggestion:** Passing a non-string `SharedKey` value into `uuid3` can
raise a TypeError if `SharedKey` is an Enum (or any non-str/bytes type);
explicitly convert `key` to `str` when calling `uuid3` to ensure a valid name
type is provided. [type error]
**Severity Level:** Minor β οΈ
```suggestion
uuid_md5 = uuid3(namespace_md5, str(key))
```
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/unit_tests/key_value/test_shared_entries_migration.py
**Line:** 34:34
**Comment:**
*Type Error: Passing a non-string `SharedKey` value into `uuid3` can
raise a TypeError if `SharedKey` is an Enum (or any non-str/bytes type);
explicitly convert `key` to `str` when calling `uuid3` to ensure a valid name
type is provided.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/unit_tests/key_value/test_shared_entries_migration.py:
##########
@@ -0,0 +1,135 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Test hash algorithm migration for shared entries."""
+
+from unittest.mock import MagicMock, patch
+from uuid import uuid3
+
+
+def test_get_shared_value_fallback_to_md5() -> None:
+ """Test that get_shared_value falls back to MD5 when SHA-256 doesn't find
entry."""
+ from superset.key_value.shared_entries import get_shared_value
+ from superset.key_value.types import SharedKey
+ from superset.key_value.utils import get_uuid_namespace_with_algorithm
+
+ key = SharedKey.DASHBOARD_PERMALINK_SALT
+ expected_value = "test_salt_value_12345"
+
+ # Calculate what the MD5 UUID would be
+ namespace_md5 = get_uuid_namespace_with_algorithm("", "md5")
+ uuid_md5 = uuid3(namespace_md5, key)
+
+ # Mock KeyValueDAO to simulate MD5 entry exists, SHA-256 doesn't
+ mock_dao = MagicMock()
+
+ def mock_get_value(resource, uuid_key, codec):
+ # Only return value if UUID matches MD5 version
+ if uuid_key == uuid_md5:
+ return expected_value
+ return None
+
+ mock_dao.get_value.side_effect = mock_get_value
+
+ # Mock current_app to use SHA-256 with MD5 fallback
+ mock_app = MagicMock()
+ mock_app.config = {
+ "HASH_ALGORITHM": "sha256",
+ "HASH_ALGORITHM_FALLBACKS": ["md5"],
+ }
+
+ with patch("superset.key_value.shared_entries.KeyValueDAO", mock_dao):
+ with patch("superset.key_value.utils.current_app", mock_app):
Review Comment:
**Suggestion:** Using `patch(..., mock_dao)` (positional second arg) is
valid but ambiguous; be explicit by passing `new=mock_dao` and `new=mock_app`
so the patch target is clearly replaced with the provided mock, avoiding
accidental misuse or confusion about which argument is being supplied.
[possible bug]
**Severity Level:** Critical π¨
```suggestion
with patch("superset.key_value.shared_entries.KeyValueDAO",
new=mock_dao):
with patch("superset.key_value.utils.current_app", new=mock_app):
```
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/unit_tests/key_value/test_shared_entries_migration.py
**Line:** 54:55
**Comment:**
*Possible Bug: Using `patch(..., mock_dao)` (positional second arg) is
valid but ambiguous; be explicit by passing `new=mock_dao` and `new=mock_app`
so the patch target is clearly replaced with the provided mock, avoiding
accidental misuse or confusion about which argument is being supplied.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/utils/cache.py:
##########
@@ -31,14 +31,14 @@
from superset.constants import CACHE_DISABLED_TIMEOUT
from superset.extensions import cache_manager
from superset.models.cache import CacheKey
-from superset.utils.hashing import md5_sha_from_dict
+from superset.utils.hashing import hash_from_dict
from superset.utils.json import json_int_dttm_ser
logger = logging.getLogger(__name__)
def generate_cache_key(values_dict: dict[str, Any], key_prefix: str = "") ->
str:
- hash_str = md5_sha_from_dict(values_dict, default=json_int_dttm_ser)
+ hash_str = hash_from_dict(values_dict, default=json_int_dttm_ser)
Review Comment:
**Suggestion:** The new call to `hash_from_dict` doesn't honor a
configurable hash algorithm; the decorator/utility change in the PR title
suggests algorithm selection should be supported. Pass the configured algorithm
(e.g. `app.config["CACHE_HASH_ALGORITHM"]`) into `hash_from_dict` so cache keys
use the intended algorithm. [logic error]
**Severity Level:** Minor β οΈ
```suggestion
hash_str = hash_from_dict(
values_dict,
default=json_int_dttm_ser,
algorithm=app.config.get("CACHE_HASH_ALGORITHM"),
)
```
<details>
<summary><b>Why it matters? β </b></summary>
hash_from_dict accepts an algorithm parameter (see hashing.py), so passing a
configured algorithm (e.g., app.config["CACHE_HASH_ALGORITHM"]) is a valid
enhancement that makes cache keys consistent with any deploy-time
configuration. This isn't strictly a bug fix but it is a sensible, actionable
improvement tied to the new use of hash_from_dict.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/cache.py
**Line:** 35:41
**Comment:**
*Logic Error: The new call to `hash_from_dict` doesn't honor a
configurable hash algorithm; the decorator/utility change in the PR title
suggests algorithm selection should be supported. Pass the configured algorithm
(e.g. `app.config["CACHE_HASH_ALGORITHM"]`) into `hash_from_dict` so cache keys
use the intended algorithm.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/key_value/utils.py:
##########
@@ -66,13 +68,94 @@ def decode_permalink_id(key: str, salt: str) -> int:
raise KeyValueParseKeyError(_("Invalid permalink key"))
-def get_uuid_namespace(seed: str) -> UUID:
+def _uuid_namespace_from_md5(seed: str) -> UUID:
+ """Generate UUID namespace from MD5 hash (legacy compatibility)."""
md5_obj = md5() # noqa: S324
md5_obj.update(seed.encode("utf-8"))
return UUID(md5_obj.hexdigest())
+def _uuid_namespace_from_sha256(seed: str) -> UUID:
+ """Generate UUID namespace from SHA-256 hash (first 16 bytes)."""
+ sha256_obj = hashlib.sha256()
+ sha256_obj.update(seed.encode("utf-8"))
+ # Use first 16 bytes of SHA-256 digest for UUID
+ return UUID(bytes=sha256_obj.digest()[:16])
+
+
+# UUID namespace generator dispatch table
+_UUID_NAMESPACE_GENERATORS = {
+ "md5": _uuid_namespace_from_md5,
+ "sha256": _uuid_namespace_from_sha256,
+}
+
+
+def get_uuid_namespace_with_algorithm(seed: str, algorithm: str) -> UUID:
+ """
+ Generate a UUID namespace from a seed string using specified hash
algorithm.
+
+ Args:
+ seed: Seed string for namespace generation
+ algorithm: Hash algorithm to use ('sha256' or 'md5')
+
+ Returns:
+ UUID namespace
+ """
+ generator = _UUID_NAMESPACE_GENERATORS.get(algorithm)
+ if generator is None:
+ raise ValueError(f"Unsupported hash algorithm: {algorithm}")
+ return generator(seed)
+
+
+def get_uuid_namespace(seed: str, app: Any = None) -> UUID:
+ """
+ Generate a UUID namespace from a seed string using configured hash
algorithm.
+
+ Args:
+ seed: Seed string for namespace generation
+ app: Flask app instance (optional, uses current_app if not provided)
+
+ Returns:
+ UUID namespace
+ """
+ app = app or current_app
+ algorithm = app.config["HASH_ALGORITHM"]
+ return get_uuid_namespace_with_algorithm(seed, algorithm)
+
+
+def get_deterministic_uuid_with_algorithm(
+ namespace: str, payload: Any, algorithm: str
+) -> UUID:
+ """
+ Get a deterministic UUID (uuid3) using specified hash algorithm.
+
+ Args:
+ namespace: Namespace string for UUID generation
+ payload: JSON-serializable payload
+ algorithm: Hash algorithm to use ('sha256' or 'md5')
+
+ Returns:
+ Deterministic UUID
+ """
+ payload_str = json_dumps_w_dates(payload, sort_keys=True)
+ return uuid3(get_uuid_namespace_with_algorithm(namespace, algorithm),
payload_str)
+
+
def get_deterministic_uuid(namespace: str, payload: Any) -> UUID:
"""Get a deterministic UUID (uuid3) from a salt and a JSON-serializable
payload."""
payload_str = json_dumps_w_dates(payload, sort_keys=True)
return uuid3(get_uuid_namespace(namespace), payload_str)
+
+
+def get_fallback_algorithms(app: Any = None) -> list[str]:
+ """
+ Get the list of fallback hash algorithms from config.
+
+ Args:
+ app: Flask app instance (optional, uses current_app if not provided)
+
+ Returns:
+ List of fallback algorithm names (empty list if none configured)
+ """
+ app = app or current_app
+ return app.config.get("HASH_ALGORITHM_FALLBACKS", [])
Review Comment:
**Suggestion:** `get_fallback_algorithms` returns the raw config value which
may be None, a string, or another type; this can cause callers to receive
unexpected types. Normalize the return to always be a list of strings (empty
list if not configured) by coercing `None` or single-string values into a list.
[type error]
**Severity Level:** Minor β οΈ
```suggestion
value = app.config.get("HASH_ALGORITHM_FALLBACKS", [])
if value is None:
return []
if isinstance(value, str):
return [value]
try:
return list(value)
except TypeError:
return []
```
<details>
<summary><b>Why it matters? β </b></summary>
Normalizing the configuration value to always return a list[str] is
practical and prevents callers from handling unexpected types (None or single
string). The improved_code is sensible: it handles None, strings, and attempts
to coerce iterables to lists. It's a defensive, low-risk improvement that
matches the documented return type.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/key_value/utils.py
**Line:** 161:161
**Comment:**
*Type Error: `get_fallback_algorithms` returns the raw config value
which may be None, a string, or another type; this can cause callers to receive
unexpected types. Normalize the return to always be a list of strings (empty
list if not configured) by coercing `None` or single-string values into a list.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/integration_tests/utils/hashing_tests.py:
##########
@@ -17,80 +17,184 @@
import datetime
import math
from typing import Any
Review Comment:
**Suggestion:** The tests later use json.dumps to produce serialized JSON
strings, but the file doesn't import the `json` module; this will cause a
NameError when replacing hard-coded serialized strings with `json.dumps` or
when adding new json-based assertions. Add `import json` to the imports block.
[possible bug]
**Severity Level:** Critical π¨
```suggestion
import json
```
<details>
<summary><b>Why it matters? β </b></summary>
The test file currently doesn't import json. If we adopt other suggestions
to replace brittle hard-coded JSON strings with json.dumps, the missing import
would cause a NameError. Adding the import is a harmless, necessary change when
switching to json.dumps and is correct here.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/integration_tests/utils/hashing_tests.py
**Line:** 19:19
**Comment:**
*Possible Bug: The tests later use json.dumps to produce serialized
JSON strings, but the file doesn't import the `json` module; this will cause a
NameError when replacing hard-coded serialized strings with `json.dumps` or
when adding new json-based assertions. Add `import json` to the imports block.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/extensions/metastore_cache.py:
##########
@@ -54,7 +54,7 @@ def factory(
cls, app: Flask, config: dict[str, Any], args: list[Any], kwargs:
dict[str, Any]
) -> BaseCache:
seed = config.get("CACHE_KEY_PREFIX", "")
- kwargs["namespace"] = get_uuid_namespace(seed)
+ kwargs["namespace"] = get_uuid_namespace(seed, app)
Review Comment:
**Suggestion:** Calling `get_uuid_namespace(seed, app)` can raise a KeyError
if `app.config["HASH_ALGORITHM"]` is not set inside `get_uuid_namespace`;
retrieve the hash algorithm with a safe default and call the underlying helper
that accepts an explicit algorithm to avoid crashing when the config entry is
missing. [possible bug]
**Severity Level:** Critical π¨
```suggestion
# pylint: disable=import-outside-toplevel
from superset.key_value.utils import
get_uuid_namespace_with_algorithm
algorithm = app.config.get("HASH_ALGORITHM", "md5")
kwargs["namespace"] = get_uuid_namespace_with_algorithm(seed,
algorithm)
```
<details>
<summary><b>Why it matters? β </b></summary>
get_uuid_namespace (in utils.py) indexes app.config["HASH_ALGORITHM"]
directly, so a missing config key can raise KeyError during cache factory
initialization. Replacing the call with an explicit helper call that reads the
config with .get(..., default) or otherwise provides a safe default is a
reasonable defensive fix. The change is localized and prevents an easily
avoidable crash during app startup.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/extensions/metastore_cache.py
**Line:** 57:57
**Comment:**
*Possible Bug: Calling `get_uuid_namespace(seed, app)` can raise a
KeyError if `app.config["HASH_ALGORITHM"]` is not set inside
`get_uuid_namespace`; retrieve the hash algorithm with a safe default and call
the underlying helper that accepts an explicit algorithm to avoid crashing when
the config entry is missing.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/key_value/utils.py:
##########
@@ -66,13 +68,94 @@ def decode_permalink_id(key: str, salt: str) -> int:
raise KeyValueParseKeyError(_("Invalid permalink key"))
-def get_uuid_namespace(seed: str) -> UUID:
+def _uuid_namespace_from_md5(seed: str) -> UUID:
+ """Generate UUID namespace from MD5 hash (legacy compatibility)."""
md5_obj = md5() # noqa: S324
md5_obj.update(seed.encode("utf-8"))
return UUID(md5_obj.hexdigest())
+def _uuid_namespace_from_sha256(seed: str) -> UUID:
+ """Generate UUID namespace from SHA-256 hash (first 16 bytes)."""
+ sha256_obj = hashlib.sha256()
+ sha256_obj.update(seed.encode("utf-8"))
+ # Use first 16 bytes of SHA-256 digest for UUID
+ return UUID(bytes=sha256_obj.digest()[:16])
+
+
+# UUID namespace generator dispatch table
+_UUID_NAMESPACE_GENERATORS = {
+ "md5": _uuid_namespace_from_md5,
+ "sha256": _uuid_namespace_from_sha256,
+}
+
+
+def get_uuid_namespace_with_algorithm(seed: str, algorithm: str) -> UUID:
+ """
+ Generate a UUID namespace from a seed string using specified hash
algorithm.
+
+ Args:
+ seed: Seed string for namespace generation
+ algorithm: Hash algorithm to use ('sha256' or 'md5')
+
+ Returns:
+ UUID namespace
+ """
+ generator = _UUID_NAMESPACE_GENERATORS.get(algorithm)
+ if generator is None:
+ raise ValueError(f"Unsupported hash algorithm: {algorithm}")
+ return generator(seed)
+
+
+def get_uuid_namespace(seed: str, app: Any = None) -> UUID:
+ """
+ Generate a UUID namespace from a seed string using configured hash
algorithm.
+
+ Args:
+ seed: Seed string for namespace generation
+ app: Flask app instance (optional, uses current_app if not provided)
+
+ Returns:
+ UUID namespace
+ """
+ app = app or current_app
+ algorithm = app.config["HASH_ALGORITHM"]
Review Comment:
**Suggestion:** `get_uuid_namespace` reads `app.config["HASH_ALGORITHM"]`
without a safe lookup; if the configuration key is missing this raises a
KeyError at runtime. Use `app.config.get("HASH_ALGORITHM", "<default>")` (or
raise a clear ValueError) so the code fails with a clearer message or uses a
sensible default. [possible bug]
**Severity Level:** Critical π¨
```suggestion
algorithm = app.config.get("HASH_ALGORITHM", "md5")
```
<details>
<summary><b>Why it matters? β </b></summary>
The suggestion is reasonable: directly indexing app.config can raise
KeyError if the configuration key is missing, which surfaces as a less-helpful
error. Making the lookup explicit (either defaulting or raising a clearer
error) improves robustness and diagnostics. Defaulting to "md5" is a sensible
choice given the code has an MD5 legacy path, but switching to a default is a
behavioral change that should be discussed. Still, the core ideaβavoid
unguarded config indexingβis valid.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/key_value/utils.py
**Line:** 122:122
**Comment:**
*Possible Bug: `get_uuid_namespace` reads
`app.config["HASH_ALGORITHM"]` without a safe lookup; if the configuration key
is missing this raises a KeyError at runtime. Use
`app.config.get("HASH_ALGORITHM", "<default>")` (or raise a clear ValueError)
so the code fails with a clearer message or uses a sensible default.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/unit_tests/key_value/utils_test.py:
##########
@@ -28,33 +29,159 @@
ID_KEY = 123
-def test_get_filter_uuid() -> None:
[email protected](
+ "key,expected_filter",
+ [
+ (UUID_KEY, {"resource": RESOURCE, "uuid": UUID_KEY}),
+ (ID_KEY, {"resource": RESOURCE, "id": ID_KEY}),
Review Comment:
**Suggestion:** The test asserts the filter's `resource` key equals the Enum
member `RESOURCE`, but `get_filter` returns `resource.value` (the enum's
value). This will make the equality assertion fail because the types/values
differ; change expected dictionaries to use `RESOURCE.value`. [logic error]
**Severity Level:** Minor β οΈ
```suggestion
(UUID_KEY, {"resource": RESOURCE.value, "uuid": UUID_KEY}),
(ID_KEY, {"resource": RESOURCE.value, "id": ID_KEY}),
```
<details>
<summary><b>Why it matters? β </b></summary>
This is a real failing assertion: utils.get_filter builds {"resource":
resource.value}, so the test must compare against RESOURCE.value. The proposed
change directly fixes the mismatch and will make the test pass.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/unit_tests/key_value/utils_test.py
**Line:** 35:36
**Comment:**
*Logic Error: The test asserts the filter's `resource` key equals the
Enum member `RESOURCE`, but `get_filter` returns `resource.value` (the enum's
value). This will make the equality assertion fail because the types/values
differ; change expected dictionaries to use `RESOURCE.value`.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/db_engine_specs/bigquery.py:
##########
@@ -290,7 +290,7 @@ def _truncate_label(cls, label: str) -> str:
:param label: expected expression label
:return: truncated label
"""
- return "_" + md5_sha_from_str(label)
+ return "_" + hash_from_str(label)
Review Comment:
**Suggestion:** `_truncate_label` now returns an underscore + full hash
which may exceed column name length limits; truncate the hash to ensure the
returned label always respects `max_column_name_length` (reserve one char for
the leading underscore). [logic error]
**Severity Level:** Minor β οΈ
```suggestion
# Ensure truncated label respects the engine's max column name
length.
max_len = getattr(cls, "max_column_name_length", None) or 128
return "_" + hash_from_str(label)[: max_len - 1]
```
<details>
<summary><b>Why it matters? β </b></summary>
Returning the entire hash could, in theory, produce a label longer than the
engine's max_column_name_length if hashing algorithm or limits change.
Truncating the hashed suffix to respect max_column_name_length (reserving one
char for the leading underscore) is a harmless hardening that prevents
generating invalid column names. This addresses a potential production issue
(invalid/too-long identifiers) visible in the diff.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/db_engine_specs/bigquery.py
**Line:** 293:293
**Comment:**
*Logic Error: `_truncate_label` now returns an underscore + full hash
which may exceed column name length limits; truncate the hash to ensure the
returned label always respects `max_column_name_length` (reserve one char for
the leading underscore).
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/utils/core.py:
##########
@@ -1005,7 +1005,7 @@ def form_data_to_adhoc(form_data: dict[str, Any], clause:
str) -> AdhocFilterCla
"expressionType": "SQL",
"sqlExpression": form_data.get(clause),
}
- result["filterOptionName"] = md5_sha_from_dict(cast(dict[Any, Any],
result))
+ result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any], result))
Review Comment:
**Suggestion:** Same issue as above for `form_data_to_adhoc`: hashing
`result` directly can raise if `sqlExpression` or other values are
mutable/unhashable; normalize the dict (convert lists to tuples) and hash the
normalized copy to produce a stable `filterOptionName`. [possible bug]
**Severity Level:** Critical π¨
```suggestion
result_for_hash = {k: tuple(v) if isinstance(v, list) else v for k, v in
result.items()}
result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any],
result_for_hash))
```
<details>
<summary><b>Why it matters? β </b></summary>
Same rationale as above but for form_data_to_adhoc: the dict being hashed
can contain values that are mutable/unhashable. Normalizing the dict before
hashing is a safe improvement that prevents errors and makes identifiers
deterministic.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/core.py
**Line:** 1008:1008
**Comment:**
*Possible Bug: Same issue as above for `form_data_to_adhoc`: hashing
`result` directly can raise if `sqlExpression` or other values are
mutable/unhashable; normalize the dict (convert lists to tuples) and hash the
normalized copy to produce a stable `filterOptionName`.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/utils/hashing.py:
##########
@@ -14,23 +14,84 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
+from __future__ import annotations
+
import hashlib
-from typing import Any, Callable, Optional
+import logging
+from typing import Any, Callable, Literal, Optional
+
+from flask import current_app
from superset.utils import json
+logger = logging.getLogger(__name__)
+
+HashAlgorithm = Literal["md5", "sha256"]
+
+# Hash function lookup table for efficient dispatch
+_HASH_FUNCTIONS: dict[str, Callable[[bytes], str]] = {
+ "sha256": lambda data: hashlib.sha256(data).hexdigest(),
+ "md5": lambda data: hashlib.md5(data).hexdigest(), # noqa: S324
+}
+
+
+def get_hash_algorithm() -> HashAlgorithm:
+ """
+ Get the configured hash algorithm for non-cryptographic purposes.
+
+ Returns:
+ Hash algorithm name ('md5' or 'sha256')
+ """
+ return current_app.config["HASH_ALGORITHM"]
Review Comment:
**Suggestion:** Accessing `current_app.config["HASH_ALGORITHM"]` will raise
a KeyError if the config key is missing and doesn't validate the configured
value; use `.get()` with a secure default and validate membership in supported
algorithms, logging and falling back if invalid. [possible bug]
**Severity Level:** Critical π¨
```suggestion
alg = current_app.config.get("HASH_ALGORITHM", "sha256")
if alg not in _HASH_FUNCTIONS:
logger.warning(
"Unsupported HASH_ALGORITHM '%s' in app config; falling back to
'sha256'", alg
)
return "sha256"
return alg
```
<details>
<summary><b>Why it matters? β </b></summary>
The current code will raise a KeyError if HASH_ALGORITHM is absent and
doesn't validate values.
The proposed change hardens the function by providing a safe default and
validating membership
against _HASH_FUNCTIONS, preventing runtime errors and unexpected algorithms.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/hashing.py
**Line:** 45:45
**Comment:**
*Possible Bug: Accessing `current_app.config["HASH_ALGORITHM"]` will
raise a KeyError if the config key is missing and doesn't validate the
configured value; use `.get()` with a secure default and validate membership in
supported algorithms, logging and falling back if invalid.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/utils/core.py:
##########
@@ -992,7 +992,7 @@ def simple_filter_to_adhoc(
}
if filter_clause.get("isExtra"):
result["isExtra"] = True
- result["filterOptionName"] = md5_sha_from_dict(cast(dict[Any, Any],
result))
+ result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any], result))
Review Comment:
**Suggestion:** Hashing `result` directly can fail or be non-deterministic
if any values (e.g. `comparator`) are mutable/unhashable (lists) or if the dict
is later mutated; build a normalized, immutable copy of `result` (convert lists
to tuples) before calling `hash_from_dict` so the computed `filterOptionName`
is stable and won't raise on unhashable values. [possible bug]
**Severity Level:** Critical π¨
```suggestion
result_for_hash = {k: tuple(v) if isinstance(v, list) else v for k, v in
result.items()}
result["filterOptionName"] = hash_from_dict(cast(dict[Any, Any],
result_for_hash))
```
<details>
<summary><b>Why it matters? β </b></summary>
This is a reasonable defensive suggestion: result may contain lists (e.g.
comparator) and different hashing implementations may behave inconsistently or
fail on mutable/unhashable nested objects. Normalizing mutable sequences (lists
-> tuples) into an immutable structure before calling hash_from_dict produces a
more stable, predictable filterOptionName and avoids potential runtime errors.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/core.py
**Line:** 995:995
**Comment:**
*Possible Bug: Hashing `result` directly can fail or be
non-deterministic if any values (e.g. `comparator`) are mutable/unhashable
(lists) or if the dict is later mutated; build a normalized, immutable copy of
`result` (convert lists to tuples) before calling `hash_from_dict` so the
computed `filterOptionName` is stable and won't raise on unhashable values.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/integration_tests/utils_tests.py:
##########
@@ -266,7 +266,7 @@ def test_ssl_certificate_parse(self):
def test_ssl_certificate_file_creation(self):
path = create_ssl_cert_file(ssl_certificate)
- expected_filename = md5_sha_from_str(ssl_certificate)
+ expected_filename = hash_from_str(ssl_certificate)
assert expected_filename in path
Review Comment:
**Suggestion:** The test asserts `expected_filename in path` which can match
directory components by accident; verify the filename portion of the returned
path (basename) instead to avoid false positives. [logic error]
**Severity Level:** Minor β οΈ
```suggestion
# verify the filename portion only to avoid matching directory names
that may contain the same substring
assert expected_filename in os.path.basename(path)
```
<details>
<summary><b>Why it matters? β </b></summary>
Valid suggestion. Checking the basename (os.path.basename(path)) is a safer
assertion β it prevents accidental matches against directory names and makes
the test intention explicit. The improved code directly fixes a potential
false-positive test without requiring broader changes.
</details>
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/integration_tests/utils_tests.py
**Line:** 270:270
**Comment:**
*Logic Error: The test asserts `expected_filename in path` which can
match directory components by accident; verify the filename portion of the
returned path (basename) instead to avoid false positives.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]