We had some discussion if we can/should replace re-partitioning topic via a direct network connection between instances. It's a tricky problem though with many string attached... Thus, it comes with pros and cons and it's still unclear what the exact trade-off is.
Thus, it might happen, but it's unclear atm if or when. No concrete road map. But as an open-source project, we rely on user feedback. Thus, this idea just got one more +1 :) -Matthias On 11/29/17 8:26 AM, Adrienne Kole wrote: > Hi, > > You misunderstood the focus of the post perhaps or I could not explain > properly. I am not claiming the streams is limited to single node. > Although the whole topology instance can be limited to a single node (each > node run all topology), this is sth else. > Also, I think that "moving 100s of GB data per day" claim is orthogonal > and as this is not big/fast/ enough to reason. > > The thing is that, for some use-cases streams-kafka-streams connection can > be a bottleneck. Yes, if I have 40GB/s or infiniband network bandwidth > this might not be an issue. > > Consider a simple topology with operators A>B->C. (B forces to re-partition) > Streams nodes are s1(A), s2 (B,C) and kafka resides on cluster k, which > might be in different network switch. > So, rather than transferring data k->s1->s2, we make a round trip > k->s1->k->s2. If we know that s1 and s2 are in the same network and data > transfer is fast between two, we should not go through another intermediate > layer. > > > Thanks. > > > > On Wed, Nov 29, 2017 at 4:52 PM, Jan Filipiak <[email protected]> > wrote: > >> Hey, >> >> you making some wrong assumptions here. >> Kafka Streams is in no way single threaded or >> limited to one physical instance. >> Having connectivity issues to your brokers is IMO >> a problem with the deployment and not at all >> with how kafka streams is designed and works. >> >> Kafka Streams moves hundreds of GB per day for us. >> >> Hope this helps. >> >> Best Jan >> >> >> >> On 29.11.2017 15:10, Adrienne Kole wrote: >> >>> Hi, >>> >>> The purpose of this email is to get overall intuition for the future >>> plans >>> of streams library. >>> >>> The main question is that, will it be a single threaded application in the >>> long run and serve microservices use-cases, or are there any plans to >>> extend it to multi-node execution framework with less kafka dependency. >>> >>> Currently, each streams node 'talks' with kafka cluster and they can >>> indirectly talk with each other again through kafka. However, especially >>> if >>> kafka is not in the same network with streams nodes (actually this can >>> happen if they are in the same network as well) this will cause high >>> network overhead and inefficiency. >>> >>> One solution for this (bypassing network overhead) is to deploy streams >>> node on kafka cluster to ensure the data locality. However, this is not >>> recommended as the library and kafka can affect each other's performance >>> and streams does not necessarily have to know the internal data >>> partitioning of kafka. >>> >>> Another solution would be extending streams library to have a common >>> runtime. IMO, preserving the current selling points of streams (like >>> dynamic scale in/out) with this kind of extensions can be very good >>> improvement. >>> >>> So my question is that, will streams in the long/short run, will extend >>> its >>> use-cases to massive and efficient stream processing (and compete with >>> spark) or stay and strengthen its current position? >>> >>> Cheers, >>> Adrienne >>> >>> >> >
signature.asc
Description: OpenPGP digital signature
