cloud-fan commented on code in PR #56026:
URL: https://github.com/apache/spark/pull/56026#discussion_r3281452915
##########
dev/merge_spark_pr.py:
##########
@@ -773,6 +773,169 @@ def standardize_jira_ref(text):
return clean_text
+# Recognized Spark component tags. A PR title must contain at least one of
+# these (within square brackets) to be merged. The list shown to the committer
+# in prompt_for_components is a subset of this set selected for prominence.
+# Status markers like [MINOR] or [FOLLOWUP] are intentionally absent and do
+# not satisfy the component requirement. [WIP] is also absent: a WIP PR
+# should be aborted at the earlier WIP warning, not merged.
+COMPONENT_MARKERS = frozenset(
Review Comment:
The allowlist has 23 entries, but SPARK's active JIRA components number ~32
(per `issues.apache.org/jira/rest/api/2/project/SPARK/components`). Several
active components have no allowlist entry: `Declarative Pipelines` (would be
`[PIPELINES]` per the "last-word-uppercased" convention, or `[SDP]` as 77
recent commits actually use), `Deploy`, `DStreams`, `Examples`, `Optimizer`,
`Scheduler`, `Shuffle`, `Spark Docker`, `Spark Shell`, `Spark Submit`,
`Windows`, `Block Manager`.
What's the selection rule for what goes in `COMPONENT_MARKERS`? A few
possibilities:
1. **Full JIRA canonical list** — derive from
`spark_jira_utils.list_components` (network-live) or snapshot it here, so any
future JIRA addition is recognized.
2. **Curated "common subsystems" only** — explicitly document the criterion
(commit volume? top-level visibility?) so the next maintainer knows when to add
or remove an entry.
3. **Status quo** — keep the current set as-is and force committers to add a
covering canonical tag for everything else (`[PYTHON][SDP]`, `[CORE][SHUFFLE]`,
etc.).
If the answer is (3), it's worth saying so in the header comment — and
especially worth confirming for `Declarative Pipelines`, since `[SDP]` is the
largest single non-canonical tag in recent history and the SDP team has
presumably been using it as a deliberate subsystem marker.
##########
dev/merge_spark_pr.py:
##########
@@ -773,6 +773,169 @@ def standardize_jira_ref(text):
return clean_text
+# Recognized Spark component tags. A PR title must contain at least one of
+# these (within square brackets) to be merged. The list shown to the committer
+# in prompt_for_components is a subset of this set selected for prominence.
+# Status markers like [MINOR] or [FOLLOWUP] are intentionally absent and do
+# not satisfy the component requirement. [WIP] is also absent: a WIP PR
+# should be aborted at the earlier WIP warning, not merged.
+COMPONENT_MARKERS = frozenset(
+ {
+ "AVRO",
+ "BUILD",
+ "CONNECT",
+ "CORE",
+ "DOC",
+ "DOCS",
+ "GRAPHX",
+ "INFRA",
+ "K8S",
+ "ML",
+ "MLLIB",
+ "PROTOBUF",
+ "PS",
+ "PYTHON",
+ "R",
+ "REPL",
+ "SECURITY",
+ "SQL",
+ "SS",
+ "STREAMING",
+ "TESTS",
+ "UI",
+ "YARN",
+ }
+)
+
+
+def title_missing_component(title):
+ """
+ Return True when the PR title has no bracket whose content is in
+ COMPONENT_MARKERS.
+
+ A compliant Spark PR title looks like:
[SPARK-XXXXX][COMPONENT1][COMPONENT2] Rest,
+ where at least one [COMPONENT] is in COMPONENT_MARKERS. Status markers like
+ [MINOR] or [FOLLOWUP] are not in COMPONENT_MARKERS and don't satisfy the
+ requirement. Unknown tags like [FOO] also don't satisfy it. Revert PRs,
+ which reuse the original commit's title verbatim, are not flagged.
+
+ >>> title_missing_component("[SPARK-56853] Improve PATH Tests")
+ True
+ >>> title_missing_component("[SPARK-56853][SQL] Improve PATH Tests")
+ False
+ >>> title_missing_component("[SPARK-56853][SQL][TESTS] Improve PATH Tests")
+ False
+ >>> title_missing_component("[SPARK-1234][SPARK-5678] Title")
+ True
+ >>> title_missing_component("[SPARK-1234][SPARK-5678][CORE] Title")
+ False
+ >>> title_missing_component('Revert "[SPARK-48591][PYTHON] Simplify"')
+ False
+ >>> title_missing_component("Plain title with no brackets")
+ True
+ >>> title_missing_component("[SPARK-1234][MINOR] Fix typo")
+ True
+ >>> title_missing_component("[SPARK-1234][FOLLOWUP] Follow up")
+ True
+ >>> title_missing_component("[SPARK-1234][TRIVIAL] Trivial fix")
+ True
+ >>> title_missing_component("[SPARK-1234][SQL][MINOR] SQL typo fix")
+ False
+ >>> title_missing_component("[SPARK-1234][FOO] Unknown component")
+ True
+ >>> title_missing_component("[MINOR] Fix typo")
+ True
+ """
+ if title.startswith('Revert "') and title.endswith('"'):
+ return False
+ brackets = re.findall(r"\[([^\]]+)\]", title)
+ return not any(b.strip().upper() in COMPONENT_MARKERS for b in brackets)
+
+
+def insert_components_into_title(title, components):
+ """
+ Insert [COMPONENT1][COMPONENT2]... after the leading [SPARK-XXX] ref(s) in
title.
+ If the title has no leading SPARK ref, the component tag is prepended.
When the
+ remainder of the title starts with another [bracket] tag (e.g. an existing
+ sub-tag like [SDP]), the inserted block is joined to it without a space so
all
+ leading brackets stay adjacent.
+
+ >>> insert_components_into_title("[SPARK-56853] Improve PATH Tests",
["SQL"])
+ '[SPARK-56853][SQL] Improve PATH Tests'
+ >>> insert_components_into_title(
+ ... "[SPARK-1234][SPARK-5678] Title", ["CORE", "SQL"])
+ '[SPARK-1234][SPARK-5678][CORE][SQL] Title'
+ >>> insert_components_into_title("Random title", ["CORE"])
+ '[CORE] Random title'
+ >>> insert_components_into_title("[SPARK-56853]Improve PATH Tests",
["SQL"])
+ '[SPARK-56853][SQL] Improve PATH Tests'
+ >>> insert_components_into_title("[SPARK-56953][SDP] Implement", ["SQL"])
+ '[SPARK-56953][SQL][SDP] Implement'
+ >>> insert_components_into_title(
+ ... "[SPARK-56953][SDP] Implement", ["SQL", "CORE"])
+ '[SPARK-56953][SQL][CORE][SDP] Implement'
+ >>> insert_components_into_title("[MINOR] Fix typo", ["CORE"])
+ '[CORE][MINOR] Fix typo'
+ >>> insert_components_into_title("[MINOR] Fix typo", ["CORE", "SQL"])
+ '[CORE][SQL][MINOR] Fix typo'
+ """
+ component_tag = "".join("[%s]" % c for c in components)
+ m = re.match(r"^((?:\[SPARK-[0-9]{3,6}\])+)\s*(.*)$", title)
+ if m and m.group(1):
+ refs, rest = m.group(1), m.group(2)
+ if not rest:
+ return "%s%s" % (refs, component_tag)
+ sep = "" if rest.startswith("[") else " "
+ return "%s%s%s%s" % (refs, component_tag, sep, rest)
+ sep = "" if title.startswith("[") else " "
+ return "%s%s%s" % (component_tag, sep, title)
+
+
+def prompt_for_components(title):
+ """
+ Prompt the committer for component(s) when the PR title lacks a recognized
+ [COMPONENT] tag. Validates that at least one entered component is in
+ COMPONENT_MARKERS. Re-prompts until valid input is provided. Returns an
+ uppercase list.
+ """
+ print_error("PR title is missing a recognized [COMPONENT] tag.")
Review Comment:
`print_error` is the project's red-text helper for fatal/error states. This
message is a recoverable interactive prompt, not a failure — consider using
`print` (or a less alarming color) so it doesn't read as "the script broke."
Minor.
##########
dev/merge_spark_pr.py:
##########
@@ -773,6 +773,169 @@ def standardize_jira_ref(text):
return clean_text
+# Recognized Spark component tags. A PR title must contain at least one of
+# these (within square brackets) to be merged. The list shown to the committer
+# in prompt_for_components is a subset of this set selected for prominence.
+# Status markers like [MINOR] or [FOLLOWUP] are intentionally absent and do
+# not satisfy the component requirement. [WIP] is also absent: a WIP PR
+# should be aborted at the earlier WIP warning, not merged.
+COMPONENT_MARKERS = frozenset(
+ {
+ "AVRO",
+ "BUILD",
+ "CONNECT",
+ "CORE",
+ "DOC",
+ "DOCS",
+ "GRAPHX",
+ "INFRA",
+ "K8S",
+ "ML",
+ "MLLIB",
+ "PROTOBUF",
+ "PS",
+ "PYTHON",
+ "R",
+ "REPL",
+ "SECURITY",
+ "SQL",
+ "SS",
+ "STREAMING",
+ "TESTS",
+ "UI",
+ "YARN",
+ }
+)
+
+
+def title_missing_component(title):
+ """
+ Return True when the PR title has no bracket whose content is in
+ COMPONENT_MARKERS.
+
+ A compliant Spark PR title looks like:
[SPARK-XXXXX][COMPONENT1][COMPONENT2] Rest,
+ where at least one [COMPONENT] is in COMPONENT_MARKERS. Status markers like
+ [MINOR] or [FOLLOWUP] are not in COMPONENT_MARKERS and don't satisfy the
+ requirement. Unknown tags like [FOO] also don't satisfy it. Revert PRs,
+ which reuse the original commit's title verbatim, are not flagged.
+
+ >>> title_missing_component("[SPARK-56853] Improve PATH Tests")
+ True
+ >>> title_missing_component("[SPARK-56853][SQL] Improve PATH Tests")
+ False
+ >>> title_missing_component("[SPARK-56853][SQL][TESTS] Improve PATH Tests")
+ False
+ >>> title_missing_component("[SPARK-1234][SPARK-5678] Title")
+ True
+ >>> title_missing_component("[SPARK-1234][SPARK-5678][CORE] Title")
+ False
+ >>> title_missing_component('Revert "[SPARK-48591][PYTHON] Simplify"')
+ False
+ >>> title_missing_component("Plain title with no brackets")
+ True
+ >>> title_missing_component("[SPARK-1234][MINOR] Fix typo")
+ True
+ >>> title_missing_component("[SPARK-1234][FOLLOWUP] Follow up")
+ True
+ >>> title_missing_component("[SPARK-1234][TRIVIAL] Trivial fix")
+ True
+ >>> title_missing_component("[SPARK-1234][SQL][MINOR] SQL typo fix")
+ False
+ >>> title_missing_component("[SPARK-1234][FOO] Unknown component")
+ True
+ >>> title_missing_component("[MINOR] Fix typo")
+ True
+ """
+ if title.startswith('Revert "') and title.endswith('"'):
+ return False
+ brackets = re.findall(r"\[([^\]]+)\]", title)
+ return not any(b.strip().upper() in COMPONENT_MARKERS for b in brackets)
+
+
+def insert_components_into_title(title, components):
Review Comment:
Worth adding a doctest for `insert_components_into_title("[SPARK-1234]",
["SQL"])` — that exercises the `if not rest:` branch which currently has no
doctest coverage.
##########
dev/merge_spark_pr.py:
##########
@@ -773,6 +773,169 @@ def standardize_jira_ref(text):
return clean_text
+# Recognized Spark component tags. A PR title must contain at least one of
+# these (within square brackets) to be merged. The list shown to the committer
+# in prompt_for_components is a subset of this set selected for prominence.
+# Status markers like [MINOR] or [FOLLOWUP] are intentionally absent and do
+# not satisfy the component requirement. [WIP] is also absent: a WIP PR
+# should be aborted at the earlier WIP warning, not merged.
+COMPONENT_MARKERS = frozenset(
+ {
+ "AVRO",
+ "BUILD",
+ "CONNECT",
+ "CORE",
+ "DOC",
+ "DOCS",
+ "GRAPHX",
+ "INFRA",
+ "K8S",
+ "ML",
+ "MLLIB",
+ "PROTOBUF",
+ "PS",
+ "PYTHON",
+ "R",
+ "REPL",
+ "SECURITY",
+ "SQL",
+ "SS",
+ "STREAMING",
+ "TESTS",
+ "UI",
+ "YARN",
+ }
+)
+
+
+def title_missing_component(title):
+ """
+ Return True when the PR title has no bracket whose content is in
+ COMPONENT_MARKERS.
+
+ A compliant Spark PR title looks like:
[SPARK-XXXXX][COMPONENT1][COMPONENT2] Rest,
+ where at least one [COMPONENT] is in COMPONENT_MARKERS. Status markers like
+ [MINOR] or [FOLLOWUP] are not in COMPONENT_MARKERS and don't satisfy the
+ requirement. Unknown tags like [FOO] also don't satisfy it. Revert PRs,
+ which reuse the original commit's title verbatim, are not flagged.
+
+ >>> title_missing_component("[SPARK-56853] Improve PATH Tests")
+ True
+ >>> title_missing_component("[SPARK-56853][SQL] Improve PATH Tests")
+ False
+ >>> title_missing_component("[SPARK-56853][SQL][TESTS] Improve PATH Tests")
+ False
+ >>> title_missing_component("[SPARK-1234][SPARK-5678] Title")
+ True
+ >>> title_missing_component("[SPARK-1234][SPARK-5678][CORE] Title")
+ False
+ >>> title_missing_component('Revert "[SPARK-48591][PYTHON] Simplify"')
+ False
+ >>> title_missing_component("Plain title with no brackets")
+ True
+ >>> title_missing_component("[SPARK-1234][MINOR] Fix typo")
+ True
+ >>> title_missing_component("[SPARK-1234][FOLLOWUP] Follow up")
+ True
+ >>> title_missing_component("[SPARK-1234][TRIVIAL] Trivial fix")
+ True
+ >>> title_missing_component("[SPARK-1234][SQL][MINOR] SQL typo fix")
+ False
+ >>> title_missing_component("[SPARK-1234][FOO] Unknown component")
+ True
+ >>> title_missing_component("[MINOR] Fix typo")
+ True
+ """
+ if title.startswith('Revert "') and title.endswith('"'):
+ return False
+ brackets = re.findall(r"\[([^\]]+)\]", title)
+ return not any(b.strip().upper() in COMPONENT_MARKERS for b in brackets)
+
+
+def insert_components_into_title(title, components):
+ """
+ Insert [COMPONENT1][COMPONENT2]... after the leading [SPARK-XXX] ref(s) in
title.
+ If the title has no leading SPARK ref, the component tag is prepended.
When the
+ remainder of the title starts with another [bracket] tag (e.g. an existing
+ sub-tag like [SDP]), the inserted block is joined to it without a space so
all
+ leading brackets stay adjacent.
+
+ >>> insert_components_into_title("[SPARK-56853] Improve PATH Tests",
["SQL"])
+ '[SPARK-56853][SQL] Improve PATH Tests'
+ >>> insert_components_into_title(
+ ... "[SPARK-1234][SPARK-5678] Title", ["CORE", "SQL"])
+ '[SPARK-1234][SPARK-5678][CORE][SQL] Title'
+ >>> insert_components_into_title("Random title", ["CORE"])
+ '[CORE] Random title'
+ >>> insert_components_into_title("[SPARK-56853]Improve PATH Tests",
["SQL"])
+ '[SPARK-56853][SQL] Improve PATH Tests'
+ >>> insert_components_into_title("[SPARK-56953][SDP] Implement", ["SQL"])
+ '[SPARK-56953][SQL][SDP] Implement'
+ >>> insert_components_into_title(
+ ... "[SPARK-56953][SDP] Implement", ["SQL", "CORE"])
+ '[SPARK-56953][SQL][CORE][SDP] Implement'
+ >>> insert_components_into_title("[MINOR] Fix typo", ["CORE"])
+ '[CORE][MINOR] Fix typo'
+ >>> insert_components_into_title("[MINOR] Fix typo", ["CORE", "SQL"])
+ '[CORE][SQL][MINOR] Fix typo'
+ """
+ component_tag = "".join("[%s]" % c for c in components)
+ m = re.match(r"^((?:\[SPARK-[0-9]{3,6}\])+)\s*(.*)$", title)
+ if m and m.group(1):
+ refs, rest = m.group(1), m.group(2)
+ if not rest:
+ return "%s%s" % (refs, component_tag)
+ sep = "" if rest.startswith("[") else " "
+ return "%s%s%s%s" % (refs, component_tag, sep, rest)
+ sep = "" if title.startswith("[") else " "
+ return "%s%s%s" % (component_tag, sep, title)
+
+
+def prompt_for_components(title):
+ """
+ Prompt the committer for component(s) when the PR title lacks a recognized
+ [COMPONENT] tag. Validates that at least one entered component is in
+ COMPONENT_MARKERS. Re-prompts until valid input is provided. Returns an
+ uppercase list.
+ """
+ print_error("PR title is missing a recognized [COMPONENT] tag.")
+ print("Common components:")
+ print(" [BUILD] - build system")
+ print(" [CONNECT] - Spark Connect")
+ print(" [CORE] - Spark Core")
+ print(" [DOCS] - documentation changes")
+ print(" [GRAPHX] - GraphX")
+ print(" [INFRA] - project infrastructure")
+ print(" [K8S] - Kubernetes")
+ print(" [ML] - Spark ML (DataFrame-based)")
+ print(" [MLLIB] - MLlib (RDD-based)")
+ print(" [PS] - pandas API on Spark")
+ print(" [PYTHON] - PySpark")
+ print(" [R] - SparkR")
+ print(" [SQL] - Spark SQL")
+ print(" [SS] - Structured Streaming")
+ print(" [STREAMING] - DStream API")
+ print(" [TESTS] - test-only changes")
+ print(" [UI] - Spark Web UI")
+ print(" [YARN] - Hadoop YARN")
+ print("Current title: %s" % title)
+ while True:
+ raw = bold_input(
+ "Enter comma-separated component(s) to insert into the title (e.g.
CORE,SQL): "
+ )
+ components = [c.strip().upper() for c in raw.split(",") if c.strip()]
+ if not components:
+ print_error("Component(s) cannot be empty. Please enter at least
one.")
+ continue
+ if not any(c in COMPONENT_MARKERS for c in components):
Review Comment:
When the user enters a mix of canonical and unknown tags (e.g., `SQL,FOO` or
a typo like `SQL,SQLLL`), the unknown tag is silently inserted into the final
title. Consider warning so typos don't slip through — for example:
```python
unknown = [c for c in components if c not in COMPONENT_MARKERS]
if not any(c in COMPONENT_MARKERS for c in components):
print_error(
"At least one component must be a recognized Spark component. "
"Got: %s" % ", ".join(components)
)
continue
if unknown:
print(
"Warning: unrecognized component(s) will be inserted as-is: %s" % ",
".join(unknown)
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]