This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/hudi-rs.git
The following commit(s) were added to refs/heads/main by this push:
new dcef1484 docs: deduplicate AI agent guidance with sub-directory
AGENTS.md (#594)
dcef1484 is described below
commit dcef1484afd08aee5b8585f53ab88a87b386d474
Author: Shiyan Xu <[email protected]>
AuthorDate: Mon May 4 13:11:06 2026 -0500
docs: deduplicate AI agent guidance with sub-directory AGENTS.md (#594)
---
.github/copilot-instructions.md | 46 ++++----------
.github/instructions/python.instructions.md | 95 +----------------------------
.github/instructions/rust.instructions.md | 64 +------------------
AGENTS.md | 79 +++++-------------------
CLAUDE.md | 7 +--
cpp/AGENTS.md | 5 ++
cpp/CLAUDE.md | 1 +
crates/AGENTS.md | 65 ++++++++++++++++++++
crates/CLAUDE.md | 1 +
python/AGENTS.md | 66 ++++++++++++++++++++
python/CLAUDE.md | 1 +
11 files changed, 168 insertions(+), 262 deletions(-)
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index 8c88cfb2..1c094c15 100644
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -1,42 +1,18 @@
# GitHub Copilot Instructions — Apache Hudi-rs
-GitHub Copilot loads **both** this file and [`AGENTS.md`](../AGENTS.md) at the
root of the repo;
-they are concatenated, not alternatives. Treat `AGENTS.md` as the source of
truth for project
-overview, build commands, coding conventions, testing, PR rules, and the
review rubric — this file
-adds Copilot-specific notes only.
+GitHub Copilot loads **both** this file and [`AGENTS.md`](../AGENTS.md). Treat
`AGENTS.md` files
+as the source of truth — this file adds Copilot-specific notes only.
-Path-scoped rules under [`./instructions/`](./instructions) are loaded
automatically when files
-match their `applyTo` glob and remain authoritative for those files.
+Language-specific conventions live in sub-directory AGENTS.md files:
-## Quick orientation
+- [`crates/AGENTS.md`](../crates/AGENTS.md) — Rust
+- [`python/AGENTS.md`](../python/AGENTS.md) — Python / PyO3
+- [`cpp/AGENTS.md`](../cpp/AGENTS.md) — C++ / cxx
-- Native Rust implementation of Apache Hudi with Python (PyO3) and C++ (`cxx`)
bindings.
-- Workspace: `crates/{core,datafusion,hudi,test}`, plus `python/`, `cpp/`,
`benchmark/tpch/`.
-- Toolchain: Rust edition `2024` / MSRV `1.88`; Python `>=3.10`; managed via
`uv` and `maturin`.
-- Pre-PR check: `make format check test`.
+Path-scoped rules under [`./instructions/`](./instructions) reference the
sub-directory AGENTS.md
+files and are loaded automatically when files match their `applyTo` glob.
-## Path-scoped rules (loaded by `applyTo` frontmatter)
+For code review behavior, see
+[`code-review.instructions.md`](./instructions/code-review.instructions.md).
-| Glob | File
| Topic
|
-| ----------------- |
------------------------------------------------------------------------------------------
| ------------------------------------------------------------------ |
-| `**/*.rs` |
[`instructions/rust.instructions.md`](./instructions/rust.instructions.md)
| Rust error handling, async, performance, API design, doc comments
|
-| `python/**` |
[`instructions/python.instructions.md`](./instructions/python.instructions.md)
| PyO3 patterns, GIL management, PyArrow conversion, Python tests
|
-| `**/*` (review) |
[`instructions/code-review.instructions.md`](./instructions/code-review.instructions.md)
| Review rubric, severity tags, multi-round behavior, cross-file impact |
-
-## PR title format
-
-PR titles must follow [Conventional
Commits](https://www.conventionalcommits.org)
-(`<type>(<scope>): <description>`). Allowed types per
-[`.commitlintrc.yaml`](../.commitlintrc.yaml):
-`build chore ci docs feat fix perf refactor revert style test`. Examples:
-
-- `feat(core): add support for MOR table reads`
-- `fix(python): handle null partition values correctly`
-- `docs: update API documentation for HudiTable`
-
-For code review behavior, severity tags, and patterns to flag, see the
path-scoped
-[`code-review.instructions.md`](./instructions/code-review.instructions.md)
(loaded automatically
-for all files during review).
-
-For everything else — build commands, coding conventions, testing, security
expectations — see
-[`AGENTS.md`](../AGENTS.md).
+For everything else — build commands, testing, PR rules — see
[`AGENTS.md`](../AGENTS.md).
diff --git a/.github/instructions/python.instructions.md
b/.github/instructions/python.instructions.md
index 3751a283..ad689750 100644
--- a/.github/instructions/python.instructions.md
+++ b/.github/instructions/python.instructions.md
@@ -2,97 +2,4 @@
applyTo: "python/**"
---
-# Python Bindings Instructions
-
-## PyO3 Patterns
-
-### Error Handling
-
-- Convert Rust errors to Python exceptions properly
-- Use appropriate Python exception types
-
-```rust
-// GOOD
-#[pyfunction]
-fn read_table(path: &str) -> PyResult<PyObject> {
- let result = hudi_core::read_table(path)
- .map_err(|e| PyRuntimeError::new_err(format!("Failed to read table:
{e}")))?;
- // ...
-}
-
-// BAD - Panics on error
-#[pyfunction]
-fn read_table(path: &str) -> PyObject {
- let result = hudi_core::read_table(path).unwrap(); // Will panic!
- // ...
-}
-```
-
-### GIL Management
-
-- Release GIL for long-running operations
-- Be careful with Python object access outside GIL
-
-```rust
-// GOOD - Release GIL for I/O
-fn read_files(&self, py: Python<'_>, paths: Vec<String>) ->
PyResult<Vec<PyObject>> {
- let batches = py.allow_threads(|| {
- // This runs without GIL
- self.inner.read_files_blocking(&paths)
- })?;
- // Convert to Python with GIL held
- // ...
-}
-```
-
-### PyArrow Integration
-
-- Use proper Arrow<->PyArrow conversion
-- Leverage zero-copy when possible
-
-```rust
-// Use arrow's Python integration
-use arrow::pyarrow::ToPyArrow;
-
-fn to_pyarrow_batch(py: Python<'_>, batch: &RecordBatch) -> PyResult<PyObject>
{
- batch.to_pyarrow(py)
-}
-```
-
-## Python API Design
-
-### Consistency with Python Conventions
-
-- Use `snake_case` for function/method names
-- Use docstrings for all public functions
-
-### Type Hints
-
-- Add type hints to Python stub files (`.pyi`)
-- Ensure compatibility with type checkers
-
-## Testing Python Bindings
-
-### Test Patterns
-
-```python
-import pytest
-from hudi import HudiTableBuilder
-
-def test_read_snapshot_with_filters():
- table = HudiTableBuilder.from_base_uri("/tmp/test").build()
- batches = table.read_snapshot(filters=[("city", "=", "test")])
- assert len(batches) > 0
-
-def test_invalid_path_raises():
- with pytest.raises(RuntimeError, match="Failed to"):
- HudiTableBuilder.from_base_uri("/nonexistent").build()
-```
-
-## Review Checklist for Python Changes
-
-- [ ] Rust panics don't propagate to Python (caught and converted)
-- [ ] GIL is released for blocking operations
-- [ ] Type stubs updated for API changes
-- [ ] Python tests added for new functionality
-- [ ] Memory management is correct (no leaks)
+See [`python/AGENTS.md`](../../python/AGENTS.md) for Python conventions.
diff --git a/.github/instructions/rust.instructions.md
b/.github/instructions/rust.instructions.md
index 076fce34..3883b16d 100644
--- a/.github/instructions/rust.instructions.md
+++ b/.github/instructions/rust.instructions.md
@@ -2,66 +2,4 @@
applyTo: "**/*.rs"
---
-# Rust Instructions for Apache Hudi-rs
-
-## Error Handling (Critical)
-
-- Flag `.unwrap()` / `.expect()` in non-test code as **Critical**
-- Flag `panic!()` in library code as **Critical**
-- Exception: `unreachable!()` is acceptable with a comment explaining why
-- Prefer `anyhow::Context` or custom error types with context over bare
`.map_err()`
-- Error messages should be actionable and include relevant values
-
-```rust
-// GOOD - with context
-let file = File::open(&path)
- .with_context(|| format!("Failed to open table metadata at {}",
path.display()))?;
-
-// BAD - no context
-let file = File::open(&path).map_err(|e| HudiError::Io(e))?;
-```
-
-## Memory and Performance
-
-- Flag unnecessary `.clone()`, especially on large types like
`Vec<RecordBatch>`
-- Prefer `&str` / `&[T]` over owned types in parameters when ownership isn't
needed
-- Use `Cow<'_, str>` when a function might or might not need to allocate
-- Prefer Arrow compute kernels over manual iteration on arrays
-- Use `arrow::compute::concat_batches` for combining batches
-
-## Async Code Patterns
-
-- Ensure async functions return `Send` futures (no `Rc`, `RefCell` across
await points)
-- Flag blocking I/O (`std::fs`, `std::thread::sleep`) in async functions
-- Use `tokio::task::spawn_blocking` for CPU-intensive work
-
-## Style
-
-- Use inline format args (Rust 1.88+): `format!("{x}")`, not `format!("{}",
x)` — expressions
- like `path.display()` still require positional args
-
-## API Design
-
-### Builder Pattern
-
-- Use builder pattern for types with many optional parameters
-- Builders should consume `self` and return `Self` for chaining
-
-```rust
-pub struct TableBuilder {
- base_uri: String,
- options: HashMap<String, String>,
-}
-
-impl TableBuilder {
- pub fn from_base_uri(uri: impl Into<String>) -> Self { ... }
- pub fn with_option(mut self, key: impl Into<String>, value: impl
Into<String>) -> Self { ... }
- pub async fn build(self) -> Result<Table> { ... }
-}
-```
-
-### Public API Documentation
-
-- All public items must have doc comments
-- Include examples in doc comments for complex APIs
-- Document panics, errors, and safety requirements
+See [`crates/AGENTS.md`](../../crates/AGENTS.md) for Rust conventions.
diff --git a/AGENTS.md b/AGENTS.md
index 810aae49..7f271791 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,20 +1,11 @@
# AGENTS.md
-Guidance for AI coding agents working on this repository — the
[agents.md](https://agents.md) format.
-Humans should start with [`README.md`](./README.md) and
[`CONTRIBUTING.md`](./CONTRIBUTING.md).
-
-- [`CLAUDE.md`](./CLAUDE.md) imports this file via `@AGENTS.md` (Claude Code
reads `CLAUDE.md`).
-- [`.github/copilot-instructions.md`](./.github/copilot-instructions.md) is
loaded by GitHub Copilot
- alongside this file and carries Copilot-specific notes.
-- [`.github/instructions/*.instructions.md`](./.github/instructions) are
path-scoped Copilot rules
- (via `applyTo` frontmatter); their content is summarized below so any agent
applies the same standards.
-
## Project
Native Rust implementation of [Apache Hudi](https://hudi.apache.org) with
Python (PyO3) and C++
([`cxx`](https://cxx.rs)) bindings. Apache 2.0. Rust workspace, edition
`2024`, MSRV `1.88`.
-Python `>=3.10`. Distributed as [`hudi`](https://crates.io/crates/hudi) on
crates.io and
-[`hudi`](https://pypi.org/project/hudi/) on PyPI.
+Python `>=3.10`. Key traits: async-first (tokio), Arrow-native, `object_store`
for all I/O,
+timeline-based MVCC.
```
crates/
@@ -25,7 +16,6 @@ crates/
python/ PyO3 bindings (module hudi._internal); tests in python/tests
cpp/ cxx bindings; bridge in cpp/src/lib.rs
benchmark/tpch/ TPC-H benchmark harness
-.github/instructions/ path-scoped Copilot rules (rust, python, code-review)
```
## Commands
@@ -47,42 +37,22 @@ make coverage-rust
# tarpaulin HTML
## Conventions
-### Rust ([details](./.github/instructions/rust.instructions.md))
-
-- **No `.unwrap()` / `.expect()` / `panic!()`** in non-test code (🔴 Critical).
`unreachable!()`
- only with a comment justifying the invariant.
-- **Errors carry context** — typed `thiserror` variants or `anyhow::Context`;
messages name the
- offending value. Avoid bare `.map_err(Into::into)`.
-- **No blocking I/O in async** (`std::fs::*`, `std::thread::sleep`); use Tokio
or
- `tokio::task::spawn_blocking`. Async functions must return `Send` futures.
-- **Avoid unnecessary `.clone()`** on `RecordBatch` / `Schema` / `Vec<_>`.
Prefer `&str`, `&[T]`,
- `Cow<'_, str>` in parameters. Prefer Arrow compute kernels over hand-rolled
loops.
-- **Builder pattern** for many-optional types: consume `self`, return `Self`,
finalize with `build()`.
-- **Public items must have doc comments** (examples for non-trivial APIs; note
panics, errors, safety).
-- **Inline format args** (Rust 1.88+): `format!("{x}")`, not `format!("{}",
x)`.
-- **Don't widen the `cxx` FFI surface** to expose internal Hudi types — keep
the bridge narrow.
-
-### Python ([details](./.github/instructions/python.instructions.md))
+### Dependencies
-- `ruff` (`E4 E7 E9 F I`, target `py310`) + `mypy --strict` over `hudi/*.py`.
`snake_case`,
- docstrings on public APIs, type hints in `python/hudi/_internal.pyi`.
-- **PyO3**: convert Rust errors to specific Python exceptions
- (`PyRuntimeError::new_err(format!("Failed to …: {e}"))`); never let a panic
surface to Python.
- Release the GIL with `py.allow_threads(...)` for blocking I/O. Use
`arrow::pyarrow::ToPyArrow`
- for zero-copy Arrow ↔ PyArrow.
+Prefer stdlib or existing workspace dependencies before adding new crates.
Keep `Cargo.lock`
+changes intentional — don't `cargo add` without justification.
-### C++
+### Language-specific
-`cxx` bridge in `cpp/src/lib.rs` — keep thin; push logic into `crates/core`.
Functions may throw
-`rust::Error`; document the error semantics on the C++ side.
+- [`crates/AGENTS.md`](./crates/AGENTS.md) — Rust
+- [`python/AGENTS.md`](./python/AGENTS.md) — Python / PyO3
+- [`cpp/AGENTS.md`](./cpp/AGENTS.md) — C++ / cxx
## Testing
-Rust unit tests are colocated under `#[cfg(test)] mod tests`; use
`#[tokio::test]` for async.
-Naming: `test_<function>_<scenario>_<expected>`. Shared fixtures in
`crates/test`. Python tests in
-`python/tests/`. Cover happy and error paths. New features and bug fixes
**must** add tests; for
-bug fixes, add a regression test that would have caught the bug. Avoid
redundant coverage — each
-test should have a unique purpose. Coverage tracked via
[Codecov](https://app.codecov.io/github/apache/hudi-rs).
+Cover happy and error paths. New features and bug fixes **must** add tests;
for bug fixes, add a
+regression test that would have caught the bug. Avoid redundant coverage —
each test should have a
+unique purpose.
## Pull requests
@@ -108,29 +78,10 @@ Storage backends route by URI scheme (`file://`, `s3://`,
`az://`, `gs://`) thro
are typed: `HudiTableConfig`, `HudiReadConfig` (also Python enums). Prefer
enum members over raw
string keys; bulk variants (`with_hudi_options` / `with_options`) currently
expect string keys.
-## Code review rubric
-
-Full rubric:
[`.github/instructions/code-review.instructions.md`](./.github/instructions/code-review.instructions.md).
-Severity tags:
-
-- 🔴 **Critical** — `.unwrap()/.expect()/panic!()` in lib code, blocking calls
in async, hardcoded
- secrets, breaking public-API changes without docs.
-- 🟠 **Important** — missing error context, unnecessary clones, missing doc
comments on `pub`,
- missing tests for new behavior.
-- 🟡 **Suggestion** — iterator chains over loops, `?` over nested `match` on
`Result`.
-- 💬 **Question** — clarification.
-
-On updated PRs, focus on the latest commits; do not re-raise issues already
fixed. For
-`crates/core` public-API changes, also inspect `crates/datafusion`, `python/`,
and `cpp/`.
-
-## Pointers
+## Code review
-- [`README.md`](./README.md) — usage examples
-- [`CONTRIBUTING.md`](./CONTRIBUTING.md) — full contributor workflow
-- [`Makefile`](./Makefile) — every supported dev/CI command
-- [`Cargo.toml`](./Cargo.toml),
[`python/pyproject.toml`](./python/pyproject.toml) — version pins
-- [`CHANGELOG.md`](./CHANGELOG.md) — release history (driven by `cliff.toml`)
-- [Apache Hudi docs](https://hudi.apache.org/docs/overview), [issue
tracker](https://github.com/apache/hudi-rs/issues),
[Slack](https://hudi.apache.org/slack)
+See
[`.github/instructions/code-review.instructions.md`](./.github/instructions/code-review.instructions.md)
+for the full rubric, severity tags, and checklists.
## Maintenance
diff --git a/CLAUDE.md b/CLAUDE.md
deleted file mode 100644
index 0f83ea42..00000000
--- a/CLAUDE.md
+++ /dev/null
@@ -1,6 +0,0 @@
-# CLAUDE.md
-
-Imports the canonical agent guidance. Add Claude-specific notes (plan-mode
tips, skill references,
-hook expectations) below the import as they emerge.
-
[email protected]
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 120000
index 00000000..47dc3e3d
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md
\ No newline at end of file
diff --git a/cpp/AGENTS.md b/cpp/AGENTS.md
new file mode 100644
index 00000000..4db1be32
--- /dev/null
+++ b/cpp/AGENTS.md
@@ -0,0 +1,5 @@
+# C++ Conventions
+
+`cxx` bridge in `cpp/src/lib.rs` — keep thin; push logic into `crates/core`.
Don't widen the FFI
+surface to expose internal Hudi types. Functions may throw `rust::Error`;
document the error
+semantics on the C++ side.
diff --git a/cpp/CLAUDE.md b/cpp/CLAUDE.md
new file mode 120000
index 00000000..47dc3e3d
--- /dev/null
+++ b/cpp/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md
\ No newline at end of file
diff --git a/crates/AGENTS.md b/crates/AGENTS.md
new file mode 100644
index 00000000..cd423ab5
--- /dev/null
+++ b/crates/AGENTS.md
@@ -0,0 +1,65 @@
+# Rust Conventions
+
+## Error Handling
+
+- **No `.unwrap()` / `.expect()` / `panic!()`** in non-test code (Critical).
`unreachable!()`
+ only with a comment justifying the invariant.
+- **Errors carry context** — typed `thiserror` variants or `anyhow::Context`;
messages name the
+ offending value. Avoid bare `.map_err(Into::into)`.
+
+```rust
+// GOOD — with context
+let file = File::open(&path)
+ .with_context(|| format!("Failed to open table metadata at {}",
path.display()))?;
+
+// BAD — no context
+let file = File::open(&path).map_err(|e| HudiError::Io(e))?;
+```
+
+## Async
+
+- **No blocking I/O in async** (`std::fs::*`, `std::thread::sleep`); use Tokio
or
+ `tokio::task::spawn_blocking`. Async functions must return `Send` futures
(no `Rc` or `RefCell`
+ across await points).
+
+## Memory & Performance
+
+- **Avoid unnecessary `.clone()`** on `RecordBatch` / `Schema` / `Vec<_>`.
Prefer `&str`, `&[T]`,
+ `Cow<'_, str>` in parameters.
+- **Prefer streaming over collecting** — don't collect streams of
`RecordBatch` into `Vec` when
+ you can process them incrementally.
+- Prefer Arrow compute kernels over hand-rolled loops. Use
`arrow::compute::concat_batches` for
+ combining batches.
+
+## API Design
+
+- **Builder pattern** for many-optional types: consume `self`, return `Self`,
finalize with
+ `build()`.
+
+```rust
+pub struct TableBuilder {
+ base_uri: String,
+ options: HashMap<String, String>,
+}
+
+impl TableBuilder {
+ pub fn from_base_uri(uri: impl Into<String>) -> Self { ... }
+ pub fn with_option(mut self, key: impl Into<String>, value: impl
Into<String>) -> Self { ... }
+ pub async fn build(self) -> Result<Table> { ... }
+}
+```
+
+- **Public items must have doc comments** (examples for non-trivial APIs; note
panics, errors,
+ safety).
+
+## Testing
+
+Unit tests are colocated under `#[cfg(test)] mod tests`; use `#[tokio::test]`
for async.
+Naming: `test_<function>_<scenario>_<expected>`. Shared fixtures in
`crates/test`.
+
+## Style
+
+Run `make format-rust check-rust` before submitting to match CI.
+
+- **Inline format args** (Rust 1.88+): `format!("{x}")`, not `format!("{}",
x)` — expressions
+ like `path.display()` still require positional args.
diff --git a/crates/CLAUDE.md b/crates/CLAUDE.md
new file mode 120000
index 00000000..47dc3e3d
--- /dev/null
+++ b/crates/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md
\ No newline at end of file
diff --git a/python/AGENTS.md b/python/AGENTS.md
new file mode 100644
index 00000000..a68cabff
--- /dev/null
+++ b/python/AGENTS.md
@@ -0,0 +1,66 @@
+# Python Conventions
+
+## Linting & Formatting
+
+Run `make format-python check-python` before submitting to match CI.
+
+`ruff` (`E4 E7 E9 F I`, target `py310`) + `mypy --strict` over `hudi/*.py`.
`snake_case`,
+docstrings on public APIs, type hints in `python/hudi/_internal.pyi`.
+
+## PyO3 Patterns
+
+### Error Handling
+
+Convert Rust errors to specific Python exceptions; never let a panic surface
to Python.
+
+```rust
+// GOOD
+#[pyfunction]
+fn read_table(path: &str) -> PyResult<PyObject> {
+ let result = hudi_core::read_table(path)
+ .map_err(|e| PyRuntimeError::new_err(format!("Failed to read table:
{e}")))?;
+ // ...
+}
+
+// BAD — panics on error
+#[pyfunction]
+fn read_table(path: &str) -> PyObject {
+ let result = hudi_core::read_table(path).unwrap();
+ // ...
+}
+```
+
+### GIL Management
+
+Release the GIL with `py.allow_threads(...)` for blocking I/O.
+
+```rust
+fn read_files(&self, py: Python<'_>, paths: Vec<String>) ->
PyResult<Vec<PyObject>> {
+ let batches = py.allow_threads(|| {
+ self.inner.read_files_blocking(&paths)
+ })?;
+ // ...
+}
+```
+
+### PyArrow Integration
+
+Use `arrow::pyarrow::ToPyArrow` for zero-copy Arrow <-> PyArrow.
+
+## Testing
+
+Tests in `python/tests/`.
+
+```python
+import pytest
+from hudi import HudiTableBuilder
+
+def test_read_snapshot_with_filters():
+ table = HudiTableBuilder.from_base_uri("/tmp/test").build()
+ batches = table.read_snapshot(filters=[("city", "=", "test")])
+ assert len(batches) > 0
+
+def test_invalid_path_raises():
+ with pytest.raises(RuntimeError, match="Failed to"):
+ HudiTableBuilder.from_base_uri("/nonexistent").build()
+```
diff --git a/python/CLAUDE.md b/python/CLAUDE.md
new file mode 120000
index 00000000..47dc3e3d
--- /dev/null
+++ b/python/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md
\ No newline at end of file