Re: [DISCUSS] Should we introduce AGENTS.md or CLAUDE.md to Hadoop?

Steve Loughran Thu, 30 Apr 2026 03:54:51 -0700

+1

CLAUDE.md can just be a symlink to AGENTS.md
iceberg and spark both have good ones


I'd recommend adding that AI tools aren't to commit work without their
supervisor's permission.

Aas someone on the security@hadoop mail list, I propose a security section.
We are now seeing multiple AI generated reports a week of which many are
completely bogus; those which aren't are over-exaggerated. The "look, we
can do RCE by submitting a job to the cluster" being the ultimate.  My
workflow is now "send it to my claude and get their opinion" before
anything else.

## Security

All AI generated security vulnerabilities must be audited before submission
Take the generated report and analyze it as if you were receiving a
machine-generated CVE report of unknown quality.
- Does it hold up?
- Are there preconditions such as write access to the local disk or
possession of service Kerberos credentials?
-If so, how are those preconditions being met, or does the report gloss
over that detail? if so: the report is incomplete and will be rejected
-Does the report simply recycle classic web server vulnerabilities without
awareness of the system itself? if so, show how an exploit can be achieved
in a real system
-Before reporting that Writable presents a deserialization risk, please
identify an implementation class that can be used as *a Gadget* in an
attack sequence.
-Production on-prem systems are always deployed in private infrastructures
and fully authenticated (Kerberos).
-Cloud deployments are always within isolated subnets. A multi-user system
will again use Kerberos.
-Single-user cloud deployments are isolated at the firewall (Apache Knox or
similar, as offered by Amazon EMR and Microsoft HDInsight, amongst others).
 In these deployments access to cloud services and infrastructure is
restricted to that single user, so cluster services run as the user and
have equal access to persistent cloud data.
-Do remember that yarn and MR are job submission engines. Allowing an
authorized user to submit work into the cluster is not a Remote Code
Execution exploit, it is the correct behavior of the system. An exploit
exists if and only if there is a permission escalation.
-In a cloud deployment where all services run as the same user (or at least
share access to cloud infrastructure by per-machine/per-container
credentials), running code as a service is not privilege escalation.

---

These might reduce the noise, or at least give me something to point at
when responding "can your AI tool read this and act"

I am looking at hardening some of the writable support (and cutting where
possible). I will add comments in the javadocs targeting AI tools to see if
that makes a difference too.

Note: this is not me dismissing AI in security attacks, it can save a lot
of time. It has also been shown to be able to audit commits and identify
the security implications, which makes it harder for any OSS project to
sneak out fixes before a release.

One PR I'm looking at is an avro decomporession one
https://github.com/apache/avro/pull/3625
here it was found using an LLM to guide fuzzing attacks, and help generate
patches
https://arxiv.org/abs/2509.07225

This is good, it is why everyone in cybersecurity is feeling dumped on
right now, and why false reports are a distraction

steve






On Tue, 28 Apr 2026 at 13:58, Zhanghaobo <[email protected]> wrote:

> Dear All:
>     Just as the title described, shall we introduce them to Hadoop
> project? If yes, what’s the content? Hope to receive your response. Thanks
>
>
>
>
> Best Wishes~
>
>
>
>

Re: [DISCUSS] Should we introduce AGENTS.md or CLAUDE.md to Hadoop?

Reply via email to