Hi Guozhang and Kohki,

Thanks for your replies.

I think I know how to deal with partitioning now, but I am still not sure how 
to deal with the traffic between the hidden state store sizes and Kafka Brokers 
(same as Kohki).

I feel like the easiest thing to do is to set a larger commit window, so that 
the state stores are synced to brokers slower than default.

I do have a Spark Cluster, but I am not convince how Spark Streaming can do 
this differently. Guozhang, could you comment anything regarding Kafka Streams 
vs Spark Streaming, especially in terms of aggregations/groupbys/joins 
implementation logic?

Also, is it possible to stop the syncing between state stores to brokers, if I 
am fine with failures?

Thanks
Tianji


On 2017-02-26 23:52 (-0500), Guozhang Wang <wangg...@gmail.com> wrote: 
> Hello Tianji,
> 
> As Kohki mentioned, in Streams joins and aggregations are always done
> pre-partitioned, and hence locally. So there won't be any inter-node
> communications needed to execute the join / aggregations. Also they can be
> hosted as persistent local state stores so you don't need to keep them in
> memory. So for example if you partition your data with K1 / K2, then data
> with the same values in combo (K1, K2) will always goes to the same
> partition, and hence good for aggregations / joins on either K1, K2, or
> combo(K1, K2), but not sufficient for combo(K1, K2, K3, K4), as data with
> the same values of K3 / K4 might still goes to different partitions
> processed by different Streams instances.
> 
> So what you want is really to partition based on the "maximum superset" of
> all the involved keys. Note that with the superset of all the keys one
> thing to watch out is the even distribution of the partitions. If it is not
> evenly distributed, then some instance might become hot points. This can be
> tackled by customizing the "PartitionGrouper" interface in Streams, which
> indicates which set of partitions will be assigned to each of the tasks (by
> default each one partition from the source topics will form a task, and
> task is the unit of parallelism in Streams).
> 
> Hope this helps.
> 
> Guozhang
> 
> 
> On Sun, Feb 26, 2017 at 10:57 AM, Kohki Nishio <tarop...@gmail.com> wrote:
> 
> > Tianji,
> > KStream is indeed Append mode as long as I do stateless processing, but
> > when you do aggregation that is a stateful operation and it turns to KTable
> > and that does Update mode.
> >
> > In regard to your aggregation, I believe Kafka's aggregation works for a
> > single partition not over multiple partitions, are you doing 100
> > different aggregation against record key ? Then you should have a single
> > data object for those 100 values, anyway it sounds like we have similar
> > problem ..
> >
> > -Kohki
> >
> >
> >
> >
> >
> > On Sat, Feb 25, 2017 at 1:11 PM, Tianji Li <skyah...@gmail.com> wrote:
> >
> > > Hi Kohki,
> > >
> > > Thanks very much for providing your investigation results. Regarding
> > > 'append' mode with Kafka Streams, isn't KStream the thing you want?
> > >
> > > Hi Guozhang,
> > >
> > > Thanks for the pointers to the two blogs. I read one of them before and
> > > just had a look at the other one.
> > >
> > > What I am hoping to do is below, can you help me decide if Kafka Stream
> > is
> > > a good fit?
> > >
> > > We have a few data sources, and we are hoping to correlate these sources,
> > > and then do aggregations, as *a stream in real-time*.
> > >
> > > The number of aggregations is around 100 which means, if using Kafka
> > > Streams, we need to maintain around 100 state stores with 100 change-log
> > > topics behind
> > > the scene when joining and aggregations.
> > >
> > > The number of unique entries in each of these state stores is expected to
> > > be at the level of < 100M. The size of each record is around 1K bytes and
> > > so,
> > > each state is expected to be ~100G bytes in size. The total number of
> > > bytes in all these state stores is thus around 10T bytes.
> > >
> > > If keeping all these stores in memory, this translates into around 50
> > > machines with 256Gbytes for this purpose alone.
> > >
> > > Plus, the incoming raw data rate could reach 10M records per second in
> > > peak hours. So, during aggregation, data movement between Kafka Streams
> > > instances
> > > will be heavy, i.e., 10M records per second in the cluster for joining
> > and
> > > aggregations.
> > >
> > > Is Kafka Streams good for this? My gut feeling is Kafka Streams is fine.
> > > But I'd like to run this by you.
> > >
> > > And, I am hoping to minimize data movement (to saving bandwidth) during
> > > joins/groupBys. If I partition the raw data with the minimum subset of
> > > aggregation keys (say K1 and K2),  then I wonder if the following
> > > joins/groupBys (say on keys K1, K2, K3, K4) happen on local data, if
> > using
> > > DSL?
> > >
> > > Thanks
> > > Tianji
> > >
> > >
> > > On 2017-02-25 13:49 (-0500), Guozhang Wang <w...@gmail.com> wrote:
> > > > Hello Kohki,>
> > > >
> > > > Thanks for the email. I'd like to learn what's your concern of the size
> > > of>
> > > > the state store? From your description it's a bit hard to figure out
> > but>
> > > > I'd guess you have lots of state stores while each of them are
> > > relatively>
> > > > small?>
> > > >
> > > > Hello Tianji,>
> > > >
> > > > Regarding your question about maturity and users of Streams, you can
> > > take a>
> > > > look at a bunch of the blog posts written about their Streams usage in>
> > > > production, for example:>
> > > >
> > > > http://engineering.skybettingandgaming.com/2017/01/23/
> > > streaming-architectures/>
> > > >
> > > > http://developers.linecorp.com/blog/?p=3960>
> > > >
> > > > Guozhang>
> > > >
> > > >
> > > > On Sat, Feb 25, 2017 at 7:52 AM, Kohki Nishio <ta...@gmail.com>
> > wrote:>
> > > >
> > > > > I did a bit of research on that matter recently, the comparison is
> > > between>
> > > > > Spark Structured Streaming(SSS) and Kafka Streams,>
> > > > >>
> > > > > Both are relatively new (~1y) and trying to solve similar problems,
> > > however>
> > > > > if you go with Spark, you have to go with a cluster, if your
> > > environment>
> > > > > already have a cluster, then it's good. However our team doesn't do
> > > any>
> > > > > Spark, so the initial cost would be very high. On the other hand,
> > > Kafka>
> > > > > Streams is a java library, since we have a service framework, doing
> > > stream>
> > > > > inside a service is super easy.>
> > > > >>
> > > > > However for some reason, people see SSS is more mature and Kafka
> > > Streams is>
> > > > > not so mature (like Beta). But old fashion stream is both mature
> > > enough (in>
> > > > > my opinion), I didn't see any difference in DStream(Spark) and>
> > > > > KStream(Kafka)>
> > > > >>
> > > > > DataFrame (Structured Streaming) and KTable, I found it quite
> > > different.>
> > > > > Kafka's model is more like a change log, that means you need to see
> > > the>
> > > > > latest entry to make a final decision. I would call this as 'Update'
> > > model,>
> > > > > whereas Spark does 'Append' model and it doesn't support 'Update'
> > > model>
> > > > > yet. (it's coming to 2.2)>
> > > > >>
> > > > > http://spark.apache.org/docs/latest/structured-streaming-pro>
> > > > > gramming-guide.html#output-modes>
> > > > >>
> > > > > I wanted to have 'Append' model with Kafka, but it seems it's not
> > easy>
> > > > > thing to do, also Kafka Streams uses an internal topic to keep state>
> > > > > changes for fail-over scenario, but I'm dealing with a lots of tiny>
> > > > > information and I have a big concern about the size of the state
> > store
> > > />
> > > > > topic, so my decision is that I'm going with my own handling of Kafka
> > > API>
> > > > > ..>
> > > > >>
> > > > > If you do stateless operation and don't have a spark cluster, yeah
> > > Kafka>
> > > > > Streams is perfect.>
> > > > > If you do stateful complicated operation and happen to have a spark>
> > > > > cluster, give Spark a try>
> > > > > else you have to write a code which is optimized for your use case>
> > > > >>
> > > > >>
> > > > > thanks>
> > > > > -Kohki>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > > On Fri, Feb 24, 2017 at 6:22 PM, Tianji Li <sk...@gmail.com> wrote:>
> > > > >>
> > > > > > Hi there,>
> > > > > >>
> > > > > > Can anyone give a good explanation in what cases Kafka Streams is>
> > > > > > preferred, and in what cases Sparking Streaming is better?>
> > > > > >>
> > > > > > Thanks>
> > > > > > Tianji>
> > > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > > -->
> > > > > Kohki Nishio>
> > > > >>
> > > >
> > > >
> > > >
> > > > -- >
> > > > -- Guozhang>
> > > >
> > >
> >
> >
> >
> > --
> > Kohki Nishio
> >
> 
> 
> 
> -- 
> -- Guozhang
> 

Reply via email to