You make some great cases for your architecture. To be clear - Ive been proselytizing for kafka since I joined this company last year. I think my largest issue is rethinking some preexisting notions about streaming to make them work in the kstream universe.
On Fri, Mar 24, 2017 at 6:07 AM, Michael Noll <mich...@confluent.io> wrote: > > If I understand this correctly: assuming I have a simple aggregator > > distributed across n-docker instances each instance will _also_ need to > > support some sort of communications process for allowing access to its > > statestore (last param from KStream.groupby.aggregate). > > Yes. > > See > http://docs.confluent.io/current/streams/developer- > guide.html#your-application-and-interactive-queries > . > > > - The tombstoning facilities of redis or C* would lend themselves well to > > implementing a 'true' rolling aggregation > > What is a 'true' rolling aggregation, and how could Redis or C* help with > that in a way that Kafka can't? (Honest question.) > > > > I get that RocksDB has a small footprint but given the choice of > > implementing my own RPC / gossip-like process for data sharing and using > a > > well tested one (ala C* or redis) I would almost always opt for the > latter. > > [...] > > Just my $0.02. I would love to hear why Im missing the 'big picture'. The > > kstreams architecture seems rife with potential. > > One question is, for example: can the remote/central DB of your choice > (Redis, C*) handle as many qps as Kafka/Kafka Streams/RocksDB can handle? > Over the network? At the same low latency? Also, what happens if the > remote DB is unavailable? Do you wait and retry? Discard? Accept the > fact that your app's processing latency will now go through the roof? I > wrote about some such scenarios at > https://www.confluent.io/blog/distributed-real-time-joins- > and-aggregations-on-user-activity-events-using-kafka-streams/ > . > > One big advantage (for many use cases, not for) with Kafka/Kafka Streams is > that you can leverage fault-tolerant *local* state that may also be > distributed across app instances. Local state is much more efficient and > faster when doing stateful processing such as joins or aggregations. You > don't need to worry about an external system, whether it's up and running, > whether its version is still compatible with your app, whether it can scale > as much as your app/Kafka Streams/Kafka/the volume of your input data. > > Also, note that some users have actually opted to run hybrid setups: Some > processing output is sent to a remote data store like Cassandra (e.g. via > Kafka Connect), some processing output is exposed directly through > interactive queries. It's not like your forced to pick only one approach. > > > > - Typical microservices would separate storing / retrieving data > > I'd rather argue that for microservices you'd oftentimes prefer to *not* > use a remote DB, and rather do everything inside your microservice whatever > the microservice needs to do (perhaps we could relax this to "do everything > in a way that your microservices is in full, exclusive control", i.e. it > doesn't necessarily need to be *inside*, but arguably it would be better if > it actually is). > See e.g. the article > https://www.confluent.io/blog/data-dichotomy-rethinking-the- > way-we-treat-data-and-services/ > that lists some of the reasoning behind this school of thinking. Again, > YMMV. > > Personally, I think there's no simple true/false here. The decisions > depend on what you need, what your context is, etc. Anyways, since you > already have some opinions for the one side, I wanted to share some food > for thought for the other side of the argument. :-) > > Best, > Michael > > > > > > On Fri, Mar 24, 2017 at 1:25 PM, Jon Yeargers <jon.yearg...@cedexis.com> > wrote: > > > If I understand this correctly: assuming I have a simple aggregator > > distributed across n-docker instances each instance will _also_ need to > > support some sort of communications process for allowing access to its > > statestore (last param from KStream.groupby.aggregate). > > > > How would one go about substituting a separated db (EG redis) for the > > statestore? > > > > Some advantages to decoupling: > > - It would seem like having a centralized process like this would > alleviate > > the need to execute multiple requests for a given kv pair (IE "who has > this > > data?" and subsequent requests to retrieve it). > > - it would take some pressure off of each node to maintain a large disk > > store > > - Typical microservices would separate storing / retrieving data > > - It would raise some eyebrows if a spec called for a mysql/nosql > instance > > to be installed with every docker container > > - The tombstoning facilities of redis or C* would lend themselves well to > > implementing a 'true' rolling aggregation > > > > I get that RocksDB has a small footprint but given the choice of > > implementing my own RPC / gossip-like process for data sharing and using > a > > well tested one (ala C* or redis) I would almost always opt for the > latter. > > (Footnote: Our implementations already heavily use redis/memcached for > > deduplication of kafka messages so it would seem a small step to use the > > same to store aggregation results.) > > > > Just my $0.02. I would love to hear why Im missing the 'big picture'. The > > kstreams architecture seems rife with potential. > > > > On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax <matth...@confluent.io> > > wrote: > > > > > The config does not "do" anything. It's metadata that get's broadcasted > > > to other Streams instances for IQ feature. > > > > > > See this blog post for more details: > > > https://www.confluent.io/blog/unifying-stream-processing- > > > and-interactive-queries-in-apache-kafka/ > > > > > > Happy to answer any follow up question. > > > > > > > > > -Matthias > > > > > > On 3/23/17 11:51 AM, Jon Yeargers wrote: > > > > What does this config param do? > > > > > > > > I see it referenced / used in some samples and here ( > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 67%3A+Queryable+state+for+Kafka+Streams > > > > ) > > > > > > > > > > > > >