Hi everyone, Thank you all for taking the time to review and respond to my email, especially on what I know is a busy Monday.
Before diving into the specifics, I want to share a bit of my background. I am an AI developer building various AI products, which gives me a clear perspective on both its pros and cons. I am a strong advocate for using AI and rely on it heavily in my day-to-day life. On that note, I was happy to see our PMC member, Dongjoon Hyun—who requested evidence—is also actively utilizing AI. Specifically, PR #54300 uses "Gemini 3 Pro (High) on Antigravity" (GitHub Link <https://github.com/apache/spark/pull/54300>). I want to emphasize that this is perfectly acceptable; it is a great example of productive AI use rather than "AI slop." *Because there are many questions to cover, I won't overwhelm you by answering them all in a single thread. Instead, I will send multiple follow-up emails to ensure I address each point thoroughly. For a few of the more complex questions, the answers were quite long, so I have documented them directly in the SIP.* Thanks again for your time and feedback Regards, Viquar Khan On Tue, 17 Mar 2026 at 18:10, Jungtaek Lim <[email protected]> wrote: > Personally I would love to ask Vaquar to run the idea against OSS projects > and figure out the value, rather than trying to integrate first and > validate. I do not see a limitation to run the idea without actual > integration - the only issue is the cost, but I hope he can get some help > from his employer if this is ever useful. While it will take multiple > months to collect the useful info from Apache Spark, it shouldn't need > multiple months if it's expanded to so many OSS projects and it will be > much more useful than trying to frame that Apache Spark project would need > this. > > On Wed, Mar 18, 2026 at 7:32 AM Holden Karau <[email protected]> > wrote: > >> I think for now we should probably avoid adding automated closing of >> possible AI PRs, I think we are not as badly impacted (knock on wood) as >> some projects and having a human in the loop for closing is reasonable. If >> we start getting a bunch of seemingly openclaw generated PRs then we can >> revisit this. >> >> On Tue, Mar 17, 2026 at 3:07 PM Jungtaek Lim < >> [email protected]> wrote: >> >>> Maybe my biggest worry for this kind of attempt is the accuracy. If this >>> gives false positives, this will just add overhead on the review phase >>> pushing the reviewer to check the validation manually, which is >>> "additional" overhead. I wouldn't be happy with it if I get another phase >>> in addition to the current review process. >>> >>> We get AI slop exactly because of the accuracy. How is this battle >>> tested? Do you have a proof of the accuracy? Linter failures are almost >>> obvious and there are really rare false positives (at least I haven't seen >>> it), so I don't bother with linter checking. I would bother with an >>> additional process if that does not guarantee (or at least has a sense of) >>> the accuracy. >>> >>> On Wed, Mar 18, 2026 at 6:23 AM vaquar khan <[email protected]> >>> wrote: >>> >>>> Hi Team, >>>> >>>> Nowadays a really hot topic in all Apache Projects is AI and I wanted >>>> to kick off a discussion around a new SPIP.I've been putting together. With >>>> the sheer volume of contributions we handle, relying entirely on PR >>>> templates and manual review to filter out AI-generated slop is just burning >>>> out maintainers. We've seen other projects like curl and Airflow get >>>> completely hammered by this stuff lately, and I think we need a hard >>>> technical defense. >>>> >>>> I'm proposing the Automated Integrity Validation (AIV) Gate. Basically, >>>> it's a local CI job that parses the AST of a PR (using Python, jAST, and >>>> tree-sitter-scala) to catch submissions that are mostly empty scaffolding >>>> or violate our specific design rules (like missing.stop() calls or using >>>> Await.result). >>>> >>>> To keep our pipeline completely secure from CI supply chain attacks, >>>> this runs 100% locally in our dev/ directory;zero external API calls. If >>>> the tooling ever messes up or a committer needs to force a hotfix, you can >>>> just bypass it instantly with a GPG-signed commit containing '/aiv skip'. >>>> >>>> I think the safest way to roll this out without disrupting anyone's >>>> workflow is starting it in a non-blocking "Shadow Mode" just to gather data >>>> and tune the thresholds. >>>> >>>> I've attached the full SPIP draft below which dives into all the >>>> technical weeds, the rollout plan, and a FAQ. Would love to hear your >>>> thoughts! >>>> >>>> >>>> https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh >>>> >>>> -- >>>> Regards, >>>> Viquar Khan >>>> *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/ >>>> *Book *- >>>> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true >>>> *GitBook*- >>>> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/ >>>> *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan >>>> *github*-https://github.com/vaquarkhan/aiv-integrity-gate >>>> >>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >
