Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

vaquar khan Tue, 17 Mar 2026 15:07:26 -0700

Hi Tian,

I have spent a significant amount of time on this proposal and have already
shared the link to the Google Doc. Please review it thoroughly rather than
making assumptions; it is essential to have a working prototype before
proposing any technical solution. I hope you understand.


SIP -
https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh

Regards,

Viquar Khan

On Tue, 17 Mar 2026 at 16:57, Tian Gao via dev <[email protected]> wrote:

> I guess Vaquar is talking about
> https://github.com/vaquarkhan/aiv-integrity-gate , which I assume is a
> new project he developed. Forgive me if this feels like a promotion to me.
>
> Tian
>
> On Tue, Mar 17, 2026 at 2:52 PM vaquar khan <[email protected]> wrote:
>
>> Sure let me update SIP doc with supporting links
>>
>> On Tue, Mar 17, 2026, 4:44 PM Dongjoon Hyun <[email protected]> wrote:
>>
>>> Hi Viquar,
>>>
>>> Thank you for sharing this.
>>>
>>> While reviewing the SPIP, I noticed that we might need more concrete
>>> data to support the claims regarding the recent surge in the Apache Spark
>>> community, specifically this section:
>>>
>>> > Why Now: The Open Source Automated Contribution Crisis: The
>>> open-source ecosystem is experiencing an unprecedented surge in automated,
>>> low-quality pull requests. This is not a theoretical concern—it is an
>>> active, documented crisis affecting Apache projects and the broader
>>> community:
>>> > Apache Spark's Own Data (Verified from Commit History): Spark added a
>>> generative tooling disclosure checkbox to its PR template on August 19,
>>> 2023. Analysis of commit history shows machine-assisted commits
>>> accelerating: 9 in 2024, 23 in 2025, and 35 in just the first 45 days of
>>> 2026. Only ~1-2% of commits currently disclose automated tooling usage, but
>>> disclosure is voluntary and unverifiable; the actual percentage is likely
>>> much higher.
>>>
>>> Just FYI, please note that the recent `Generated-By: ` commits came from
>>> active Apache Spark PMC members (like me, Kent, Yang) mostly. It's because
>>> of the recent promotion from the vendors (like Claude Code OSS program,
>>> Google Antigravity Ultra Plan Discount, and Copilot). It's truly the
>>> productivity enhancements instead of the attack of AI slops.
>>>
>>> Additionally, as a point of context, our community has already taken
>>> proactive measures to safeguard against low-quality AI-generated
>>> contributions. We currently maintain a human-in-the-loop system—such as
>>> requiring an ASF JIRA ticket to be created before submitting a PR—to help
>>> mitigate this issue.
>>>
>>> So, we may want to revisit those topic later with the concrete and
>>> massive examples of AI Slops in the Spark Pull Request list.
>>>
>>> Sincerely,
>>> Dongjoon Hyun
>>>
>>>
>>> On 2026/03/17 21:22:55 vaquar khan wrote:
>>> > Hi Team,
>>> >
>>> >  Nowadays a really hot topic in all Apache Projects is AI and I wanted
>>> to
>>> > kick off a discussion around a new SPIP.I've been putting together.
>>> With
>>> > the sheer volume of contributions we handle, relying entirely on PR
>>> > templates and manual review to filter out AI-generated slop is just
>>> burning
>>> > out maintainers. We've seen other projects like curl and Airflow get
>>> > completely hammered by this stuff lately, and I think we need a hard
>>> > technical defense.
>>> >
>>> > I'm proposing the Automated Integrity Validation (AIV) Gate. Basically,
>>> > it's a local CI job that parses the AST of a PR (using Python, jAST,
>>> and
>>> > tree-sitter-scala) to catch submissions that are mostly empty
>>> scaffolding
>>> > or violate our specific design rules (like missing.stop() calls or
>>> using
>>> > Await.result).
>>> >
>>> > To keep our pipeline completely secure from CI supply chain attacks,
>>> this
>>> > runs 100% locally in our dev/ directory;zero external API calls.  If
>>> the
>>> > tooling ever messes up or a committer needs to force a hotfix, you can
>>> just
>>> > bypass it instantly with a GPG-signed commit containing '/aiv skip'.
>>> >
>>> > I think the safest way to roll this out without disrupting anyone's
>>> > workflow is starting it in a non-blocking "Shadow Mode" just to gather
>>> data
>>> > and tune the thresholds.
>>> >
>>> > I've attached the full SPIP draft below which dives into all the
>>> technical
>>> > weeds, the rollout plan, and a FAQ. Would love to hear your thoughts!
>>> >
>>> >
>>> https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh
>>> >
>>> > --
>>> > Regards,
>>> > Viquar Khan
>>> > *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
>>> > *Book *-
>>> >
>>> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
>>> > *GitBook*-
>>> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
>>> > *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
>>> > *github*-https://github.com/vaquarkhan/aiv-integrity-gate
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

Reply via email to