This is an automated email from the ASF dual-hosted git repository.
HyukjinKwon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.5 by this push:
new f3f9b34e754e [SPARK-56998] Add SECURITY.md + AGENTS.md Security
section for scan-agent discoverability
f3f9b34e754e is described below
commit f3f9b34e754e0ad21e439c28a4fd446b59eff9f7
Author: Jarek Potiuk <[email protected]>
AuthorDate: Fri May 22 13:25:05 2026 -0700
[SPARK-56998] Add SECURITY.md + AGENTS.md Security section for scan-agent
discoverability
**This is a proposal for the PMC to review — please correct, reject, or
discuss as needed.** Nothing here is a requirement; the maintainer is the
decision-maker.
This adds a `SECURITY.md` to the repo root and a `Security` section to the
existing `AGENTS.md` so an automated scan agent can mechanically discover the
project's security model via the conventional `AGENTS.md → SECURITY.md → model
URL` chain. The chain terminates at the existing
<https://spark.apache.org/docs/latest/security.html> page — nothing about the
model content itself changes.
Context: the ASF Security team is preparing the project for an automated
agentic security scan we're piloting. Such scans refuse to run if the model
isn't discoverable by that path (refusing upfront beats wasting PMC reviewer
cycles on a noise-heavy run against an unknown model). Discoverability is the
one hard gate; everything else is suggestion. The Security team has reached out
separately on the PMC's private list with the program details; this PR is the
public-facing repo piece.
The Security team uses
[`threat-model-producer`](https://gist.github.com/potiuk/da14a826283038ddfe38cc9fe6310573)
as the rubric for what a complete model looks like — but this PR is just the
*link*; the existing `security.html` content is accepted as the model.
After this lands on `master`, the same two files would need to be on
`branch-3.5` for the second scan target — happy to open a cherry-pick PR for
that, or leave it to the PMC.
Questions / pushback welcome. Happy to adjust the wording or move the
section if the project has a house style.
Closes #55933 from potiuk/asf-security/discoverability-2026-05-18.
Lead-authored-by: Jarek Potiuk <[email protected]>
Co-authored-by: Xiao Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 411dedc1aa98a5a1aeacfd9b27487583f18eaab1)
Signed-off-by: Hyukjin Kwon <[email protected]>
---
AGENTS.md | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
SECURITY.md | 13 +++++
2 files changed, 176 insertions(+)
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 000000000000..c37d8a130421
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,163 @@
+# Apache Spark
+
+## Pre-flight Checks
+
+Before the first code edit or running test in a session, ensure a clean
working environment. DO NOT skip these checks:
+
+1. Run `git remote -v` to identify the personal fork and upstream
(`apache/spark`). If unclear, ask the user to configure their remotes following
the standard convention (`origin` for the fork, `upstream` for `apache/spark`).
+2. If the latest commit on `<upstream>/master` is more than a day old (check
with `git log -1 --format="%ci" <upstream>/master`), run `git fetch <upstream>
master`.
+3. If there are uncommitted changes (check with `git status`), ask the user to
stash them before proceeding.
+4. Switch to the appropriate branch:
+ - **Existing PR**: resolve the PR branch name via `gh api
repos/apache/spark/pulls/<number> --jq '.head.ref'`, then look for a local
branch matching that name. If found, switch to it and inform the user. If not
found, ask whether to fetch it or if there is a local branch under a different
name.
+ - **New edits**: ask the user to choose: create a new git worktree from
`<upstream>/master` and work from there (recommended), or create and switch to
a new branch from `<upstream>/master`.
+ - **Running tests**: use `<upstream>/master`.
+
+## Development Notes
+
+SQL golden file tests are managed by `SQLQueryTestSuite` and its variants.
Read the class documentation before running or updating these tests. DO NOT
edit the generated golden files (`.sql.out`) directly. Always regenerate them
when needed, and carefully review the diff to make sure it's expected.
+
+Spark Connect protocol is defined in proto files under
`sql/connect/common/src/main/protobuf/`. Read the README there before modifying
proto definitions.
+
+Avoid introducing non-ASCII characters in code or comments. String literals
may contain non-ASCII when the content requires it (error messages, test data,
etc.). Identifiers are ASCII by convention. The common failure mode is
typographic characters (em-dash, smart quotes, ellipsis, non-breaking space)
sneaking into comments; scalastyle flags some of these. Spot-check before
committing: `grep -rn -P "[^\x00-\x7F]" <files>`.
+
+## Scala Test Base Classes
+
+When writing a new Scala test suite, pick the lowest base class that provides
what the test actually needs. Spark uses the `AnyFunSuite` ScalaTest style
throughout, so the bases below are the chain to choose from. Each adds
capability on top of the previous:
+
+ SparkFunSuite
(core)
+ <- PlanTest
(sql/catalyst)
+ <- QueryTest
(sql/core)
+
+| Test scope | Base | Notes |
+|------------|------|-------|
+| Plain JVM/Scala — no Spark SQL | `SparkFunSuite` | `core` utilities, RDD,
network, util classes, etc. Adds per-test timeout, `testRetry`, `gridTest`,
thread audit, fixed timezone/locale, `withTempDir`, `withLogAppender`,
`checkError`. |
+| Catalyst plan tests — no `SparkSession` | `PlanTest` | Adds `comparePlans`,
`normalizePlan`, `normalizeExprIds`. For analyzer / optimizer / planner rule
tests. |
+| SQL/DataFrame tests — needs a `SparkSession` | `QueryTest` | Adds
`checkAnswer`, codegen-on/off helpers. `spark: SparkSession` is abstract and
must be supplied by a session-providing trait (see below). |
+
+### Providing a `SparkSession` for `QueryTest`
+
+`QueryTest` declares `spark: SparkSession` abstractly via
`SparkSessionProvider`, so it cannot be instantiated on its own. A concrete
suite mixes in one of the session-providing traits below:
+
+ QueryTest
(abstract `spark`)
+ + SharedSparkSession (sql/core) -> classic in-process
`TestSparkSession`
+ + TestHiveSingleton (sql/hive) -> Hive-backed `TestHive` session
+
+| Session provider | Module / location | Typical usage |
+|---|---|---|
+| `SharedSparkSession` | `sql/core` | Already extends `QueryTest` for
historical reasons, but still mix in `QueryTest` explicitly, e.g. `class X
extends QueryTest with SharedSparkSession`. Default for tests under `sql/core`.
|
+| `TestHiveSingleton` | `sql/hive` | Mixed in alongside `QueryTest`, e.g.
`class X extends QueryTest with TestHiveSingleton`. Used by tests under
`sql/hive`. |
+
+## Build and Test
+
+Build and tests can take a long time. If the user explicitly asked to run
tests, run them. Otherwise (you are running tests on your own to verify a
change), first ask the user if they have more changes to make.
+
+Prefer SBT over Maven for faster incremental compilation. Module names are
defined in `project/SparkBuild.scala`.
+
+Compile a single module:
+
+ build/sbt <module>/compile
+
+Compile test code for a single module:
+
+ build/sbt <module>/Test/compile
+
+Run test suites by wildcard or full class name:
+
+ build/sbt '<module>/testOnly *MySuite'
+ build/sbt '<module>/testOnly org.apache.spark.sql.MySuite'
+
+Run test cases matching a substring:
+
+ build/sbt '<module>/testOnly *MySuite -- -z "test name"'
+
+For faster iteration, keep SBT open in interactive mode:
+
+ build/sbt
+ > project <module>
+ > testOnly *MySuite
+
+### PySpark Tests
+
+PySpark tests require building Spark with Hive support first:
+
+ build/sbt -Phive package
+
+Activate the virtual environment specified by the user, or default to `.venv`:
+
+ source <venv>/bin/activate
+
+If the default venv does not exist, create it:
+
+ python3 -m venv .venv
+ source .venv/bin/activate
+ pip install -r dev/requirements.txt
+
+Run a single test suite:
+
+ python/run-tests --testnames pyspark.sql.tests.arrow.test_arrow
+
+Run a single test case:
+
+ python/run-tests --testnames "pyspark.sql.tests.test_catalog
CatalogTests.test_current_database"
+
+## Investigating PR CI Failures
+
+Do NOT download full job logs to grep for errors — they are very large and
slow. Instead, use the test report annotations on the fork.
+
+Step 1 — Get the fork owner and the latest commit SHA of the PR:
+
+ gh api repos/apache/spark/pulls/<PR_NUMBER> --jq '{owner:
.head.repo.owner.login, sha: .head.sha}'
+
+Step 2 — Find the "Report test results" check run on the fork's commit:
+
+ gh api repos/<OWNER>/spark/commits/<SHA>/check-runs \
+ --jq '.check_runs[] | select(.name == "Report test results") | {id: .id,
annotations: .output.annotations_count}'
+
+Step 3 — Fetch failure annotations:
+
+ gh api repos/<OWNER>/spark/check-runs/<CHECK_RUN_ID>/annotations
+
+Each annotation contains the test class, test name, and failure message.
+
+## Pull Request Workflow
+
+PR title format is `[SPARK-xxxx][COMPONENT] Title`. The component tag is
derived from the JIRA component name: take the last word and uppercase it (e.g.
`Project Infra` → `[INFRA]`, `Spark Core` → `[CORE]`, `Structured Streaming` →
`[STREAMING]`, `SQL` → `[SQL]`).
+
+Infer the PR title from the changes. If no ticket ID is given, create one
using `dev/create_spark_jira.py`, using the PR title (without the JIRA ID and
component tag) as the ticket title.
+
+ python3 dev/create_spark_jira.py "<title>" -c <component> { -t <type> | -p
<parent-jira-id> }
+
+- **Component** (`-c`): the exact JIRA component name (not the PR title
shorthand), e.g. "SQL", "Spark Core", "PySpark", "Connect". Run `python3
dev/create_spark_jira.py --list-components` for the full list.
+- **Issue type** (`-t`): "Bug", "Improvement", "New Feature", "Test",
"Documentation", or "Dependency upgrade".
+- **Parent** (`-p`): if the user mentions a parent JIRA ticket (e.g., "this is
a subtask of SPARK-12345"), pass it instead of `-t`. The issue type is
automatically "Sub-task".
+
+The script sets the latest unreleased version as the default affected version.
+
+After creating a JIRA ticket, print a prominent notice so the user does not
miss it:
+
+ ============================================================
+ JIRA ticket created: SPARK-XXXXX
+ https://issues.apache.org/jira/browse/SPARK-XXXXX
+
+ Title: <title>
+ Component(s): <component>
+ Issue type: <type>
+ Affected version(s): <version>
+ Priority: <priority>
+
+ Please review and adjust these fields if needed.
+ ============================================================
+
+Before writing the PR description, read `.github/PULL_REQUEST_TEMPLATE` and
fill in every section from that file.
+
+DO NOT push to the upstream repo. Always push to the personal fork. Open PRs
against `master` on the upstream repo.
+
+DO NOT force push or use `--amend` on pushed commits unless the user
explicitly asks. If the remote branch has new commits, fetch and rebase before
pushing.
+
+Always get user approval before external operations such as pushing commits,
creating PRs, or posting comments. Use `gh pr create` to open PRs. If `gh` is
not installed, generate the GitHub PR URL for the user and recommend installing
the GitHub CLI.
+
+## Security
+
+Security model: [SECURITY.md](./SECURITY.md)
+
+Agents that scan this repository should consult `SECURITY.md` for the
project's threat model, in-scope / out-of-scope declarations, and known
non-findings before reporting issues.
diff --git a/SECURITY.md b/SECURITY.md
new file mode 100644
index 000000000000..26b540d81ab5
--- /dev/null
+++ b/SECURITY.md
@@ -0,0 +1,13 @@
+# Security policy
+
+## Reporting a vulnerability
+
+To report a security vulnerability in Apache Spark, follow the ASF security
process at <https://www.apache.org/security/>. Reports can be sent privately to
[email protected] or to [email protected].
+
+## Threat model
+
+Apache Spark's security guidance for operators — covering authentication,
encryption, access control, Kerberos integration, trust boundaries,
environmental assumptions, security properties provided and not provided, and
downstream operator responsibilities — is documented at:
+
+<https://spark.apache.org/docs/latest/security.html>
+
+Automated security scanning agents should consult that document for the
project's in-scope / out-of-scope declarations before reporting issues.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]