Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

vaquar khan Wed, 18 Mar 2026 01:09:18 -0700

Hi everyone,

Thank you all for taking the time to review and respond to my email,
especially on what I know is a busy Monday.


Before diving into the specifics, I want to share a bit of my background. I
am an AI developer building various AI products, which gives me a clear
perspective on both its pros and cons. I am a strong advocate for using AI
and rely on it heavily in my day-to-day life.

On that note, I was happy to see our PMC member, Dongjoon Hyun—who
requested evidence—is also actively utilizing AI. Specifically, PR #54300
uses "Gemini 3 Pro (High) on Antigravity" (GitHub Link
<https://github.com/apache/spark/pull/54300>). I want to emphasize that
this is perfectly acceptable; it is a great example of productive AI use
rather than "AI slop."

*Because there are many questions to cover, I won't overwhelm you by
answering them all in a single thread. Instead, I will send multiple
follow-up emails to ensure I address each point thoroughly. For a few of
the more complex questions, the answers were quite long, so I have
documented them directly in the SIP.*

Thanks again for your time and feedback

Regards,

Viquar Khan

On Tue, 17 Mar 2026 at 18:10, Jungtaek Lim <[email protected]>
wrote:

> Personally I would love to ask Vaquar to run the idea against OSS projects
> and figure out the value, rather than trying to integrate first and
> validate. I do not see a limitation to run the idea without actual
> integration - the only issue is the cost, but I hope he can get some help
> from his employer if this is ever useful. While it will take multiple
> months to collect the useful info from Apache Spark, it shouldn't need
> multiple months if it's expanded to so many OSS projects and it will be
> much more useful than trying to frame that Apache Spark project would need
> this.
>
> On Wed, Mar 18, 2026 at 7:32 AM Holden Karau <[email protected]>
> wrote:
>
>> I think for now we should probably avoid adding automated closing of
>> possible AI PRs, I think we are not as badly impacted (knock on wood) as
>> some projects and having a human in the loop for closing is reasonable. If
>> we start getting a bunch of seemingly openclaw generated PRs then we can
>> revisit this.
>>
>> On Tue, Mar 17, 2026 at 3:07 PM Jungtaek Lim <
>> [email protected]> wrote:
>>
>>> Maybe my biggest worry for this kind of attempt is the accuracy. If this
>>> gives false positives, this will just add overhead on the review phase
>>> pushing the reviewer to check the validation manually, which is
>>> "additional" overhead. I wouldn't be happy with it if I get another phase
>>> in addition to the current review process.
>>>
>>> We get AI slop exactly because of the accuracy. How is this battle
>>> tested? Do you have a proof of the accuracy? Linter failures are almost
>>> obvious and there are really rare false positives (at least I haven't seen
>>> it), so I don't bother with linter checking. I would bother with an
>>> additional process if that does not guarantee (or at least has a sense of)
>>> the accuracy.
>>>
>>> On Wed, Mar 18, 2026 at 6:23 AM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Team,
>>>>
>>>>  Nowadays a really hot topic in all Apache Projects is AI and I wanted
>>>> to kick off a discussion around a new SPIP.I've been putting together. With
>>>> the sheer volume of contributions we handle, relying entirely on PR
>>>> templates and manual review to filter out AI-generated slop is just burning
>>>> out maintainers. We've seen other projects like curl and Airflow get
>>>> completely hammered by this stuff lately, and I think we need a hard
>>>> technical defense.
>>>>
>>>> I'm proposing the Automated Integrity Validation (AIV) Gate. Basically,
>>>> it's a local CI job that parses the AST of a PR (using Python, jAST, and
>>>> tree-sitter-scala) to catch submissions that are mostly empty scaffolding
>>>> or violate our specific design rules (like missing.stop() calls or using
>>>> Await.result).
>>>>
>>>> To keep our pipeline completely secure from CI supply chain attacks,
>>>> this runs 100% locally in our dev/ directory;zero external API calls.  If
>>>> the tooling ever messes up or a committer needs to force a hotfix, you can
>>>> just bypass it instantly with a GPG-signed commit containing '/aiv skip'.
>>>>
>>>> I think the safest way to roll this out without disrupting anyone's
>>>> workflow is starting it in a non-blocking "Shadow Mode" just to gather data
>>>> and tune the thresholds.
>>>>
>>>> I've attached the full SPIP draft below which dives into all the
>>>> technical weeds, the rollout plan, and a FAQ. Would love to hear your
>>>> thoughts!
>>>>
>>>>
>>>> https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh
>>>>
>>>> --
>>>> Regards,
>>>> Viquar Khan
>>>> *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
>>>> *Book *-
>>>> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
>>>> *GitBook*-
>>>> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
>>>> *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
>>>> *github*-https://github.com/vaquarkhan/aiv-integrity-gate
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>

Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

Reply via email to