Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

Jungtaek Lim Wed, 18 Mar 2026 15:08:30 -0700

IMHO requiring ASF JIRA tickets to be created is a bit orthogonal to this.
That's about attackers/spammers trying to pollute the community. If we
assume that the gate is broken, I don't think they will generate the code
to address some JIRA tickets; they won't make AI slop issues. They would
make spam/phishing issues.


Though I get where you are coming from - the code contribution of the
project is now scoped to pretty active contributors (including committers
and PMC members) and we have sorta belief among the active contributors
that we do not submit AI slops - we generate the code from LLM but I expect
we "definitely" review it before submitting a PR - PR author playing a role
of the initial reviewer.
(That said, if anyone does not do the initial review for generated code
by LLM, do it. This is an "integrity" issue.)

Ironically, the way contributors tend to work prevents cherry-pickers (who
would find it easier to vibe-code with LLM to earn credit) rather
unintentionally; we allow "anyone" to work on the ticket till the PR raises
up, so experienced contributors never file the JIRA ticket first - they
work on the item and file the JIRA ticket after they complete the work. If
there are TODO tickets filed and being left, many of them would be
non-trivial. (Or maybe the UX of JIRA is bad enough for cherry-pickers to
search through.)

Btw, IMHO I wouldn't rather consider the template of AI disclaimer to be
something where we get honest/fact information. Assume you are a
contributor who is very willing to show up your contribution and be
recognizable in any way. They will find it useful to NOT disclose the fact
they used LLM to generate the code. This isn't enforcement and there is no
way to verify it. We'd better consider this as an optional thing at this
point.

On Thu, Mar 19, 2026 at 5:51 AM Dongjoon Hyun <[email protected]>
wrote:

> Hi, Lisa.
>
> Thank you for sharing your opinion, but this SPIP itself seems based on an
> illusion. Since Apache Spark doesn't have this volume problem, this
> argument is a strawman.
>
> > I want to share how Apache Airflow is handling this, since they're
> dealing with the same volume problem.
>
> Let me share my previous comment once more. The recent `Generated-By: `
> commits came from active Apache Spark PMC members (like me, Kent, Yang)
> mostly. It's because of the recent promotion from the vendors (like Claude
> Code OSS program, Google Antigravity Ultra Plan Discount, and Copilot).
> It's truly the productivity enhancements, not the AI slops that the SPIP
> claims, that are the issue.
>
> Unlike the Apache Airflow community, the Apache Spark community has
> already taken proactive measures to safeguard against low-quality
> AI-generated contributions. We currently maintain a human-in-the-loop
> system—such as requiring an ASF JIRA ticket to be created before submitting
> a PR—.
>
> We may want to revisit this topic later when a real problem arises.
>
> Sincerely,
> Dongjoon.
>
>
> On Wed, Mar 18, 2026 at 1:23 PM Lisa N. Cao <[email protected]>
> wrote:
>
>> Hi all,
>>
>> I want to share how Apache Airflow is handling this, since they're
>> dealing with the same volume problem.
>>
>> Rather than building detection for AI-generated PRs specifically, they've
>> focused on raising the quality bar for all non-collaborator contributions
>> and automating the enforcement. The discussion and tooling could provide
>> inspiration:
>> https://lists.apache.org/thread/8tzwwwd7jmtmfo4j9pzg27704g10vpr4
>> https://github.com/apache/airflow/pull/62682
>>
>> PRs from non-collaborators must pass all checks, follow PR guidelines
>> (LLM-verified), and include proper descriptions and tests before any
>> maintainer looks at them. PRs that don't meet the bar get converted to
>> drafts automatically with actionable comments. A human triager reviews the
>> automated output, but the responsibility sits entirely with the author. I
>> don't see it as all that different from the goals of this SPIP.
>>
>> Their Gen-AI disclosure policy layers on top of this:
>>
>> https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions
>>
>> Could be a useful model as the community weighs what levels of
>> enforcement are available.
>>
>>
>> --
>> LNC
>>
>> On Wed, Mar 18, 2026, 1:48 AM Jungtaek Lim <[email protected]>
>> wrote:
>>
>>> Hi Vaquar,
>>>
>>> I do not see a value in coupling this with Apache Spark. If this is
>>> useful for Apache Spark, why is this particularly useful only for Apache
>>> Spark? It shouldn't be too hard for you to run the prototype with
>>> existing/new PRs over various OSS projects. Apache Spark project is too
>>> restricted to prove your project because nowadays code contributors are
>>> rather almost bound - we are not running the project which is quite new and
>>> shiny to gain traction from random contributors. I don't feel like we
>>> should take the approach of shadow mode while it is not really necessary.
>>> There is an existing way to prove the value; go with a faster loop on your
>>> own project first.
>>>
>>> There is no actual relation between this and Apache Spark "from the
>>> product point of view". You should become more successful when you prove
>>> the value with the project. Please incubate properly and in the right
>>> direction.
>>>
>>>
>>> On Wed, Mar 18, 2026 at 5:26 PM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> Thank you for these points. Your concern regarding *accuracy and
>>>> reviewer overhead* is perhaps the most impactful feedback I’ve
>>>> received so far. I completely agree: if an automated tool has a high
>>>> false-positive rate, it creates a "validation tax" that makes a reviewer's
>>>> job harder, not easier.
>>>>
>>>> Because your questions get to the heart of the proposal’s viability, I
>>>> have specifically documented the answers and data regarding accuracy and
>>>> your "validate before integrate" suggestion directly into the SIP: *[Link
>>>> to SIP: PR Quality & AI-Generated Content Policy]*.
>>>>
>>>> To summarize the strategy I've outlined there to address your concerns:
>>>>
>>>>    1.
>>>>
>>>>    *The "Linter" Strategy:* We are not using subjective "guesses" to
>>>>    identify AI. We are looking for objective metadata violations;such as
>>>>    missing JIRA IDs, ignored PR templates, and specific automated 
>>>> signatures.
>>>>    These are "binary" failures with a near-zero false-positive rate, much 
>>>> like
>>>>    a code linter.
>>>>    2.
>>>>
>>>>    *Shadow Mode (Validation without Integration):* To your point about
>>>>    figuring out the value first, I propose we run this logic in *Shadow
>>>>    Mode*. It would run as a non-blocking background process to collect
>>>>    accurate data on Spark PRs for a set period (e.g., 4 weeks). This 
>>>> allows us
>>>>    to prove the value and measure the false-positive rate without adding a
>>>>    single second of overhead to your current review process.
>>>>    3.
>>>>
>>>>    *Proactive vs. Reactive:* While testing on other projects is
>>>>    possible, Spark’s unique standards mean we need Spark-specific data. 
>>>> This
>>>>    proactive approach ensures we have the tools ready before the volume of 
>>>> "AI
>>>>    slop" becomes a crisis.
>>>>
>>>> I’ve made sure the SIP now reflects that the goal of this tool is to
>>>> act as a *shield* for committers, not a new hurdle. I’d value your
>>>> thoughts on the "Shadow Mode" data collection as a way to provide the proof
>>>> of accuracy you’re looking for.
>>>>
>>>> Please read the details in the SIP doc with your name.
>>>>
>>>> Best regards,
>>>>
>>>> Viquar Khan
>>>>
>>>> On Wed, 18 Mar 2026 at 03:17, vaquar khan <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Holden,
>>>>>
>>>>> I appreciate the perspective on keeping a human in the loop. However,
>>>>> relying on "massive examples" as a lagging indicator means we only act 
>>>>> once
>>>>> maintainers are already overwhelmed. Data across the ecosystem shows that
>>>>> the transition from a manageable queue to an unmanageable flood happens
>>>>> rapidly; if Spark is not heavily impacted today, the trajectory of sibling
>>>>> projects suggests we will be within 6 months.
>>>>>
>>>>> The "human in the loop" approach is already costing us time. We are
>>>>> seeing drive-by AI contributions that bypass our soft controls and require
>>>>> manual intervention to close. For example:
>>>>>
>>>>>    -
>>>>>
>>>>>    *Large-Scale Noise:* PR #52218
>>>>>    <https://github.com/apache/spark/pull/52218> introduced 1,151
>>>>>    lines of a RabbitMQ connector explicitly marked as "Generated-by:
>>>>>    ChatGPT-5," lacking tests and ignoring architectural standards.
>>>>>    -
>>>>>
>>>>>    *Duplicate Overhead:* PR #54810
>>>>>    <https://github.com/apache/spark/pull/54810> and PR #54717
>>>>>    <https://github.com/apache/spark/pull/54717> are concrete
>>>>>    instances of AI-driven duplicate PRs for the same JIRA ticket, showing 
>>>>> a
>>>>>    lack of context awareness.
>>>>>    -
>>>>>
>>>>>    *Template Evasion:* PR #54150
>>>>>    <https://github.com/apache/spark/pull/54150> and PR #50400
>>>>>    <https://github.com/apache/spark/pull/50400> completely ignored
>>>>>    JIRA IDs and PR templates without disclosing AI usage. This proves the
>>>>>    voluntary checkbox is an unreliable metric for the true volume of AI 
>>>>> code
>>>>>    entering the repo.
>>>>>
>>>>> It is important to distinguish this "AI slop" from high-quality,
>>>>> productive AI use. As I mentioned, PR #54300
>>>>> <https://github.com/apache/spark/pull/54300> from *Dongjoon Hyun*
>>>>> (using Gemini 3 Pro on Antigravity) is a perfect example of how AI should
>>>>> be used—with PMC-level oversight and intent.
>>>>>
>>>>> I have documented these emerging patterns in the SIP. If we look at
>>>>> the data, it is clear we are moving toward the same crisis seen in other
>>>>> projects. This proposal is a *proactive approach* to protect our
>>>>> committers’ bandwidth before the flood arrives, rather than a
>>>>> *reactive* one that forces us to scramble once the review queue is
>>>>> already broken.
>>>>>
>>>>> If a full "auto-close" feels too aggressive right now, could we at
>>>>> least implement *automated labeling* based on these SIP patterns to
>>>>> reduce "discovery time" for the PMC?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Viquar Khan
>>>>>
>>>>> On Wed, 18 Mar 2026 at 03:08, vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Thank you all for taking the time to review and respond to my email,
>>>>>> especially on what I know is a busy Monday.
>>>>>>
>>>>>> Before diving into the specifics, I want to share a bit of my
>>>>>> background. I am an AI developer building various AI products, which 
>>>>>> gives
>>>>>> me a clear perspective on both its pros and cons. I am a strong advocate
>>>>>> for using AI and rely on it heavily in my day-to-day life.
>>>>>>
>>>>>> On that note, I was happy to see our PMC member, Dongjoon Hyun—who
>>>>>> requested evidence—is also actively utilizing AI. Specifically, PR #54300
>>>>>> uses "Gemini 3 Pro (High) on Antigravity" (GitHub Link
>>>>>> <https://github.com/apache/spark/pull/54300>). I want to emphasize
>>>>>> that this is perfectly acceptable; it is a great example of productive AI
>>>>>> use rather than "AI slop."
>>>>>>
>>>>>> *Because there are many questions to cover, I won't overwhelm you by
>>>>>> answering them all in a single thread. Instead, I will send multiple
>>>>>> follow-up emails to ensure I address each point thoroughly. For a few of
>>>>>> the more complex questions, the answers were quite long, so I have
>>>>>> documented them directly in the SIP.*
>>>>>>
>>>>>> Thanks again for your time and feedback
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Viquar Khan
>>>>>>
>>>>>> On Tue, 17 Mar 2026 at 18:10, Jungtaek Lim <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Personally I would love to ask Vaquar to run the idea against OSS
>>>>>>> projects and figure out the value, rather than trying to integrate first
>>>>>>> and validate. I do not see a limitation to run the idea without actual
>>>>>>> integration - the only issue is the cost, but I hope he can get some 
>>>>>>> help
>>>>>>> from his employer if this is ever useful. While it will take multiple
>>>>>>> months to collect the useful info from Apache Spark, it shouldn't need
>>>>>>> multiple months if it's expanded to so many OSS projects and it will be
>>>>>>> much more useful than trying to frame that Apache Spark project would 
>>>>>>> need
>>>>>>> this.
>>>>>>>
>>>>>>> On Wed, Mar 18, 2026 at 7:32 AM Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think for now we should probably avoid adding automated closing
>>>>>>>> of possible AI PRs, I think we are not as badly impacted (knock on 
>>>>>>>> wood) as
>>>>>>>> some projects and having a human in the loop for closing is 
>>>>>>>> reasonable. If
>>>>>>>> we start getting a bunch of seemingly openclaw generated PRs then we 
>>>>>>>> can
>>>>>>>> revisit this.
>>>>>>>>
>>>>>>>> On Tue, Mar 17, 2026 at 3:07 PM Jungtaek Lim <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Maybe my biggest worry for this kind of attempt is the accuracy.
>>>>>>>>> If this gives false positives, this will just add overhead on the 
>>>>>>>>> review
>>>>>>>>> phase pushing the reviewer to check the validation manually, which is
>>>>>>>>> "additional" overhead. I wouldn't be happy with it if I get another 
>>>>>>>>> phase
>>>>>>>>> in addition to the current review process.
>>>>>>>>>
>>>>>>>>> We get AI slop exactly because of the accuracy. How is this battle
>>>>>>>>> tested? Do you have a proof of the accuracy? Linter failures are 
>>>>>>>>> almost
>>>>>>>>> obvious and there are really rare false positives (at least I haven't 
>>>>>>>>> seen
>>>>>>>>> it), so I don't bother with linter checking. I would bother with an
>>>>>>>>> additional process if that does not guarantee (or at least has a 
>>>>>>>>> sense of)
>>>>>>>>> the accuracy.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2026 at 6:23 AM vaquar khan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Team,
>>>>>>>>>>
>>>>>>>>>>  Nowadays a really hot topic in all Apache Projects is AI and I
>>>>>>>>>> wanted to kick off a discussion around a new SPIP.I've been putting
>>>>>>>>>> together. With the sheer volume of contributions we handle, relying
>>>>>>>>>> entirely on PR templates and manual review to filter out 
>>>>>>>>>> AI-generated slop
>>>>>>>>>> is just burning out maintainers. We've seen other projects like curl 
>>>>>>>>>> and
>>>>>>>>>> Airflow get completely hammered by this stuff lately, and I think we 
>>>>>>>>>> need a
>>>>>>>>>> hard technical defense.
>>>>>>>>>>
>>>>>>>>>> I'm proposing the Automated Integrity Validation (AIV) Gate.
>>>>>>>>>> Basically, it's a local CI job that parses the AST of a PR (using 
>>>>>>>>>> Python,
>>>>>>>>>> jAST, and tree-sitter-scala) to catch submissions that are mostly 
>>>>>>>>>> empty
>>>>>>>>>> scaffolding or violate our specific design rules (like 
>>>>>>>>>> missing.stop() calls
>>>>>>>>>> or using Await.result).
>>>>>>>>>>
>>>>>>>>>> To keep our pipeline completely secure from CI supply chain
>>>>>>>>>> attacks, this runs 100% locally in our dev/ directory;zero external 
>>>>>>>>>> API
>>>>>>>>>> calls.  If the tooling ever messes up or a committer needs to force a
>>>>>>>>>> hotfix, you can just bypass it instantly with a GPG-signed commit
>>>>>>>>>> containing '/aiv skip'.
>>>>>>>>>>
>>>>>>>>>> I think the safest way to roll this out without disrupting
>>>>>>>>>> anyone's workflow is starting it in a non-blocking "Shadow Mode" 
>>>>>>>>>> just to
>>>>>>>>>> gather data and tune the thresholds.
>>>>>>>>>>
>>>>>>>>>> I've attached the full SPIP draft below which dives into all the
>>>>>>>>>> technical weeds, the rollout plan, and a FAQ. Would love to hear your
>>>>>>>>>> thoughts!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>> Viquar Khan
>>>>>>>>>> *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>>>>> *Book *-
>>>>>>>>>> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
>>>>>>>>>> *GitBook*-
>>>>>>>>>> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
>>>>>>>>>> *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
>>>>>>>>>> *github*-https://github.com/vaquarkhan/aiv-integrity-gate
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>

Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

Reply via email to