[ 
https://issues.apache.org/jira/browse/KAFKA-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Roesler resolved KAFKA-6144.
---------------------------------
    Resolution: Fixed

> Allow serving interactive queries from in-sync Standbys
> -------------------------------------------------------
>
>                 Key: KAFKA-6144
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6144
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Antony Stubbs
>            Assignee: Navinder Brar
>            Priority: Major
>              Labels: kip-535
>             Fix For: 2.5.0
>
>         Attachments: image-2019-10-09-20-33-37-423.png, 
> image-2019-10-09-20-47-38-096.png
>
>
> Currently when expanding the KS cluster, the new node's partitions will be 
> unavailable during the rebalance, which for large states can take a very long 
> time, or for small state stores even more than a few ms can be a deal-breaker 
> for micro service use cases.
> One workaround is to allow stale data to be read from the state stores when 
> use case allows. Adding the use case from KAFKA-8994 as it is more 
> descriptive.
> "Consider the following scenario in a three node Streams cluster with node A, 
> node S and node R, executing a stateful sub-topology/topic group with 1 
> partition and `_num.standby.replicas=1_`  
>  * *t0*: A is the active instance owning the partition, B is the standby that 
> keeps replicating the A's state into its local disk, R just routes streams 
> IQs to active instance using StreamsMetadata
>  * *t1*: IQs pick node R as router, R forwards query to A, A responds back to 
> R which reverse forwards back the results.
>  * *t2:* Active A instance is killed and rebalance begins. IQs start failing 
> to A
>  * *t3*: Rebalance assignment happens and standby B is now promoted as active 
> instance. IQs continue to fail
>  * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last 
> commit position, IQs continue to fail
>  * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs 
> start succeeding again
>  
> Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can 
> take few seconds (~10 seconds based on defaults values). Depending on how 
> laggy the standby B was prior to A being killed, t4 can take few 
> seconds-minutes. 
> While this behavior favors consistency over availability at all times, the 
> long unavailability window might be undesirable for certain classes of 
> applications (e.g simple caches or dashboards). 
> This issue aims to also expose information about standby B to R, during each 
> rebalance such that the queries can be routed by an application to a standby 
> to serve stale reads, choosing availability over consistency."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to