[
https://issues.apache.org/jira/browse/KAFKA-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias J. Sax resolved KAFKA-6953.
------------------------------------
Resolution: Abandoned
> [Streams] Schedulable KTable as Graph source (for minimizing aggregation
> pressure)
> ----------------------------------------------------------------------------------
>
> Key: KAFKA-6953
> URL: https://issues.apache.org/jira/browse/KAFKA-6953
> Project: Kafka
> Issue Type: New Feature
> Components: streams
> Reporter: Flavio Stutz
> Priority: Major
>
> === PROBLEM ===
> We have faced the following scenario/problem in a lot of situations with
> KStreams:
> - Huge incoming data being processed by numerous application instances
> - Need to aggregate, or count the overall data as a single value
> (something like "count the total number of messages that has been processed
> among all distributed instances")
> - The challenge here is to manage this kind of situation without any
> bottlenecks. We don't need the overall aggregation of all instances states at
> each processed message, so it is possible to store the partial aggregations
> on local stores and, at time to time, query those states and aggregate them,
> avoiding bottlenecks.
> Some ways we KNOW it wouldn't work because of bottlenecks:
> - Sink all instances local counter/aggregation result to a Topic with a
> single partition so that we could have another Graph with a single instance
> that could aggregate all results
> - In this case, if I had 500 instances processing 1000/s each (with
> no bottlenecks), I would have a single partition topic with 500k messages/s
> for my single aggregating instance to process that much messages (IMPOSSIBLE
> bottleneck)
> === TRIALS ===
> These are some ways we managed to do this:
> - Expose a REST endpoint so that Prometheus could extract local metrics of
> each application instance's state stores and them calculate the total count
> on Prometheus using queries
> - we don't like this much because we believe KStreams was meant to
> INPUT and OUTPUT data using Kafka Topics for simplicity and power
> - Create a scheduled Punctuate at the end of the Graph so that we can
> query (using getAllMetadata) all other instances's state store counters, sum
> them all and them publish to another Kafka Topic from time to time.
> - For this to work we created a way so that only one application
> instance's Punctuate algorithm would perform the calculations (something like
> a master election through instance ids and metadata)
> === PROPOSAL ===
> Create a new DSL Source with the following characteristics:
> - Source parameters: "scheduled time" (using cron's like config), "state
> store name", bool "from all application instances"
> - Behavior: At the desired time, query all K,V tuples from the state store
> and source those messages to the Graph
> - If "from all application instances" is true, query the tuples
> from all application instances state stores and source them all, concatenated
> - This is a way to create a "timed aggregation barrier" to avoid
> bottlenecks. With this we could enhance the ability of KStreams to better
> handle the CAP Theorem characteristics, so that one could choose to have
> Consistency over Availability.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)