Re: Plans to extend streams?

Guozhang Wang Wed, 29 Nov 2017 11:16:27 -0800

Hello Adrienne,

I think your suggested feature to to use not only Kafka as inter-process
communication but also configurable to use TCP directly, right?


There are a few people asking about this before, especially for not using
Kafka for repartitioning (think: shuffling in the batch world), but let
them go through TCP between processes. Though this is doable, I'd point out
that it may have many side-effects such as:

1) back pressure: Streams library do not worry about back pressure at all
since all communication channels are persistent (Kafka topics), using TCP
then you need to face the back pressure issue again.
2) exactly once semantics: the transactional messaging is leveraged by
Streams to achieve EOS, and extending TCP means that we need to add more
gears to handle TCP data loss / duplicates (e.g. other frameworks have been
using buffers with epoch boundaries to do that).
3) state snapshots: imagine if you are shutting down your app, we then need
to make sure all in-flight messages with TCP are drained because otherwise
we are not certain if the committed offsets are valid or not.



Guozhang


On Wed, Nov 29, 2017 at 8:26 AM, Adrienne Kole <adrienneko...@gmail.com>
wrote:

> Hi,
>
> You misunderstood the focus of the post perhaps or I could not explain
> properly. I am not claiming the streams is limited to single node.
> Although the whole topology instance can be limited to a single node (each
> node run all topology), this is sth else.
> Also, I think that "moving 100s of GB data per day" claim is orthogonal
> and as this is not big/fast/ enough to reason.
>
> The thing is that, for some use-cases streams-kafka-streams connection can
> be a bottleneck.  Yes, if I have 40GB/s or infiniband network bandwidth
> this might not be an issue.
>
> Consider a simple topology with operators A>B->C. (B forces to
> re-partition)
>  Streams nodes are s1(A), s2 (B,C) and kafka resides on cluster k, which
> might be in different network switch.
> So, rather than transferring data k->s1->s2, we make a round trip
> k->s1->k->s2. If we know that s1 and s2 are in the same network and data
> transfer is fast between two, we should not go through another intermediate
> layer.
>
>
> Thanks.
>
>
>
> On Wed, Nov 29, 2017 at 4:52 PM, Jan Filipiak <jan.filip...@trivago.com>
> wrote:
>
> > Hey,
> >
> > you making some wrong assumptions here.
> > Kafka Streams is in no way single threaded or
> > limited to one physical instance.
> > Having connectivity issues to your brokers is IMO
> > a problem with the deployment and not at all
> > with how kafka streams is designed and works.
> >
> > Kafka Streams moves hundreds of GB per day for us.
> >
> > Hope this helps.
> >
> > Best Jan
> >
> >
> >
> > On 29.11.2017 15:10, Adrienne Kole wrote:
> >
> >> Hi,
> >>
> >> The purpose of this email is to get overall intuition for the future
> >> plans
> >> of streams library.
> >>
> >> The main question is that, will it be a single threaded application in
> the
> >> long run and serve microservices use-cases, or are there any plans to
> >> extend it to multi-node execution framework with less kafka dependency.
> >>
> >> Currently, each streams node 'talks' with kafka cluster and they can
> >> indirectly talk with each other again through kafka. However, especially
> >> if
> >> kafka is not in the same network with streams nodes (actually this can
> >> happen if they are in the same network as well) this will cause high
> >> network overhead and inefficiency.
> >>
> >> One solution for this (bypassing network overhead) is to deploy streams
> >> node on kafka cluster to ensure the data locality. However, this is not
> >> recommended as the library and kafka can affect each other's performance
> >> and  streams does not necessarily have to know the internal data
> >> partitioning of kafka.
> >>
> >> Another solution would be extending streams library to have a common
> >> runtime. IMO, preserving the current selling points of streams (like
> >> dynamic scale in/out) with this kind of extensions can be very good
> >> improvement.
> >>
> >> So my question is that, will streams in the long/short run, will extend
> >> its
> >> use-cases to massive and efficient stream processing (and compete with
> >> spark) or stay and strengthen its current position?
> >>
> >> Cheers,
> >> Adrienne
> >>
> >>
> >
>



-- 
-- Guozhang

Re: Plans to extend streams?

Reply via email to