RE: [DISCUSS] KIP-1255 Remote Read Replicas for Kafka Tiered Storage

Thomas Thornton via dev Thu, 15 Jan 2026 12:22:31 -0800

Hi Manan,

We noticed you cross-posting some discussion items from the KIP-1248/KIP-1254 
thread (1) and wanted to share our response to your points:

1. Performance & Latency

1.2 Increased internal bandwidth & Cost
You mentioned RRR improves cost control. We view this differently:

1.2.a Cost: Clients reading directly from S3 (KIP-1254) is inherently cheaper 
since you don't need to launch and manage separate RRR instances.

1.2.b Throughput: Direct S3 access offers better scalability for fan-out reads. 
We can launch hundreds of clients reading parallel offset ranges from S3; 
Kafka's scale-out read capability is limited by the provisioned RRR fleet size.

2. Client & Protocol

2.2 Redirect-based flow
You mentioned "Redirects are lightweight." Could you elaborate on the protocol?

2.2.a Switching: How does the client seamlessly switch between the RRR and the 
normal broker? Does the main broker return a specific error code?

2.2.b Flapping: If a client consumes at the hot/cold boundary, is there a risk 
of "flapping" (repeated disconnects) between Leader and RRR as segments roll 
over?

4. Metadata & Routing

4.1 Partition assignment
The KIP states RRRs are "stateless" but also "optionally cache" data. These 
seem to be at odds with each other.

4.1.a Assignment: How is partition assignment handled when the topic partition 
count changes? Does the Controller explicitly map partitions to RRRs?

4.1.b Discovery: How does the cluster (and the client) discover a newly created 
RRR node when scaling out?

4.1.c Cache Hit Rate: If routing is purely dynamic/stateless, requests for the 
same partition could land on different nodes. This negates the benefit of local 
caching.

Thanks,
Tom & Henry

(1) https://lists.apache.org/thread/j9l1orx8x67lwcbo6f7qgn7xs3p5bjq0

On 2026/01/09 12:07:41 Manan Gupta wrote:
> Below are responses to the key concerns raised around RRRs in KIP-1248 and
> KIP-1254, organized by area:
> 
> 1. Performance & Latency
> 1.1 Higher read latency
> Yes. Historical reads add a hop (remote storage → RRR → client). This is
> intentional: RRRs target cold and analytic workloads where throughput and
> cost efficiency matter more than tail latency.
> Mitigations include prefetching, local caching, larger sequential reads,
> and AZ-local RRRs. Hot-path consumers continue to read directly from
> leaders.
> 
> 1.2 Increased internal bandwidth
> RRRs increase internal traffic, but they:
> 
>    -
> 
>    Reduce load on leader brokers
>    -
> 
>    Centralize and optimize remote storage access
>    -
> 
>    Improve cost control versus per-client object storage reads
> 
> 2. Client & Protocol
> 2.1 Client complexity
> Client complexity is reduced, not eliminated. Brokers remain authoritative,
> clients stay storage-agnostic, and most complexity is encapsulated in
> shared libraries.
> 
> 2.2 Redirect-based flow
> Redirects are lightweight and Kafka-native (similar to leader/coordinator
> discovery). Clients follow broker instructions without understanding
> storage layouts or tiering.
> 
> 3. Semantics & Features
> 3.1 Transactional semantics
> Preserved. RRRs read canonical log segments, including transaction markers.
> read_committed semantics are supported.
> 
> 3.2 Newer features
> RRRs initially support standard log consumption only. Features requiring
> coordination or state mutation remain on main brokers by design.
> 
> 4. Metadata & Routing
> 4.1 Partition assignment
> RRRs are stateless: no partition ownership, ISR participation, or
> rebalancing. Routing is dynamic and broker/controller-driven.
> 
> 4.2 AZ affinity
> Handled via existing rack/AZ metadata and broker-directed redirects.
> 
> 4.3 Failure handling
> No state means no rebalancing. Clients retry against another RRR or fall
> back to brokers.
> 
> 5. Operations & Scaling
> 5.1 Operational overhead
> RRRs add a fleet but are stateless: no replication, elections, writes, or
> durability responsibilities. They are easy to automate and replace.
> 
> 5.2 Autoscaling
> A first-class goal. RRRs scale on load, start quickly, and scale down
> safely without state migration.
> 
> 6. Architectural Trade-off
> Yes, complexity is shifted—but deliberately off the hot path. This isolates
> cold and bursty reads, protects real-time workloads, and cleanly separates
> durability, serving, and analytics concerns.
> On 2025/12/14 10:58:32 Manan Gupta wrote:
> > Hi all,
> >
> > This email starts the discussion thread for *KIP-1255: Remote Read
> Replicas
> > for Kafka Tiered Storage*. The proposal introduces a lightweight broker
> > role, *Remote Read Replica*, dedicated to serving historical reads
> directly
> > from remote storage.
> >
> > We’d appreciate your initial thoughts and feedback on the proposal.
> >
>

RE: [DISCUSS] KIP-1255 Remote Read Replicas for Kafka Tiered Storage

Reply via email to