[PR] [python] Add streaming infrastructure: scanners, consumers, caching [paimon]

via GitHub Wed, 04 Mar 2026 06:22:33 -0800


tub opened a new pull request, #7342:
URL: https://github.com/apache/paimon/pull/7342


   ## Summary
   
   PR 1 of 3 for pure-Python streaming reads. This PR adds foundational 
infrastructure:
   
   - **Follow-up scanners** (delta, changelog, incremental diff) for continuous 
snapshot polling
   - **Consumer manager** for persisting read progress to the table path
   - **LRU caching** for snapshots, manifests, and manifest lists
   - **Batch existence checks** for efficient file IO
   - **Bucket-based sharding** params in FileScanner for parallel consumption
   - **Row kind support** in table reads
   - **Streaming-related core options**
   - **Backtick support** for identifier parsing
   
   25 files changed, +2701 / -31 lines
   
   ## PR Stack
   
   1. **👉 this PR** — Streaming infrastructure (scanners, consumers, caching, 
sharding)
   2. Core streaming (StreamReadBuilder, AsyncStreamingTableScan, table 
integration)
   3. CLI (`paimon tail` command)
   
   **Merge workflow:** Merge PR 1, rebase PR 2 onto updated master (PR 1 
commits drop out), merge PR 2, repeat for PR 3.
   
   ## Test plan
   
   - [x] `python -m pytest pypaimon/tests` — 537 passed (9 pre-existing lance 
failures)
   - [x] `python -c "from pypaimon import CatalogFactory"` — no import errors
   - [x] Unit tests for all new scanners, consumer manager, manifest caching, 
identifier parsing
   - [x] Integration tests for FileScanner shard filtering
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [python] Add streaming infrastructure: scanners, consumers, caching [paimon]

Reply via email to