Hi Guozhang and Kohki, Thanks for your replies.
I think I know how to deal with partitioning now, but I am still not sure how to deal with the traffic between the hidden state store sizes and Kafka Brokers (same as Kohki). I feel like the easiest thing to do is to set a larger commit window, so that the state stores are synced to brokers slower than default. I do have a Spark Cluster, but I am not convince how Spark Streaming can do this differently. Guozhang, could you comment anything regarding Kafka Streams vs Spark Streaming, especially in terms of aggregations/groupbys/joins implementation logic? Also, is it possible to stop the syncing between state stores to brokers, if I am fine with failures? Thanks Tianji On 2017-02-26 23:52 (-0500), Guozhang Wang <wangg...@gmail.com> wrote: > Hello Tianji, > > As Kohki mentioned, in Streams joins and aggregations are always done > pre-partitioned, and hence locally. So there won't be any inter-node > communications needed to execute the join / aggregations. Also they can be > hosted as persistent local state stores so you don't need to keep them in > memory. So for example if you partition your data with K1 / K2, then data > with the same values in combo (K1, K2) will always goes to the same > partition, and hence good for aggregations / joins on either K1, K2, or > combo(K1, K2), but not sufficient for combo(K1, K2, K3, K4), as data with > the same values of K3 / K4 might still goes to different partitions > processed by different Streams instances. > > So what you want is really to partition based on the "maximum superset" of > all the involved keys. Note that with the superset of all the keys one > thing to watch out is the even distribution of the partitions. If it is not > evenly distributed, then some instance might become hot points. This can be > tackled by customizing the "PartitionGrouper" interface in Streams, which > indicates which set of partitions will be assigned to each of the tasks (by > default each one partition from the source topics will form a task, and > task is the unit of parallelism in Streams). > > Hope this helps. > > Guozhang > > > On Sun, Feb 26, 2017 at 10:57 AM, Kohki Nishio <tarop...@gmail.com> wrote: > > > Tianji, > > KStream is indeed Append mode as long as I do stateless processing, but > > when you do aggregation that is a stateful operation and it turns to KTable > > and that does Update mode. > > > > In regard to your aggregation, I believe Kafka's aggregation works for a > > single partition not over multiple partitions, are you doing 100 > > different aggregation against record key ? Then you should have a single > > data object for those 100 values, anyway it sounds like we have similar > > problem .. > > > > -Kohki > > > > > > > > > > > > On Sat, Feb 25, 2017 at 1:11 PM, Tianji Li <skyah...@gmail.com> wrote: > > > > > Hi Kohki, > > > > > > Thanks very much for providing your investigation results. Regarding > > > 'append' mode with Kafka Streams, isn't KStream the thing you want? > > > > > > Hi Guozhang, > > > > > > Thanks for the pointers to the two blogs. I read one of them before and > > > just had a look at the other one. > > > > > > What I am hoping to do is below, can you help me decide if Kafka Stream > > is > > > a good fit? > > > > > > We have a few data sources, and we are hoping to correlate these sources, > > > and then do aggregations, as *a stream in real-time*. > > > > > > The number of aggregations is around 100 which means, if using Kafka > > > Streams, we need to maintain around 100 state stores with 100 change-log > > > topics behind > > > the scene when joining and aggregations. > > > > > > The number of unique entries in each of these state stores is expected to > > > be at the level of < 100M. The size of each record is around 1K bytes and > > > so, > > > each state is expected to be ~100G bytes in size. The total number of > > > bytes in all these state stores is thus around 10T bytes. > > > > > > If keeping all these stores in memory, this translates into around 50 > > > machines with 256Gbytes for this purpose alone. > > > > > > Plus, the incoming raw data rate could reach 10M records per second in > > > peak hours. So, during aggregation, data movement between Kafka Streams > > > instances > > > will be heavy, i.e., 10M records per second in the cluster for joining > > and > > > aggregations. > > > > > > Is Kafka Streams good for this? My gut feeling is Kafka Streams is fine. > > > But I'd like to run this by you. > > > > > > And, I am hoping to minimize data movement (to saving bandwidth) during > > > joins/groupBys. If I partition the raw data with the minimum subset of > > > aggregation keys (say K1 and K2), then I wonder if the following > > > joins/groupBys (say on keys K1, K2, K3, K4) happen on local data, if > > using > > > DSL? > > > > > > Thanks > > > Tianji > > > > > > > > > On 2017-02-25 13:49 (-0500), Guozhang Wang <w...@gmail.com> wrote: > > > > Hello Kohki,> > > > > > > > > Thanks for the email. I'd like to learn what's your concern of the size > > > of> > > > > the state store? From your description it's a bit hard to figure out > > but> > > > > I'd guess you have lots of state stores while each of them are > > > relatively> > > > > small?> > > > > > > > > Hello Tianji,> > > > > > > > > Regarding your question about maturity and users of Streams, you can > > > take a> > > > > look at a bunch of the blog posts written about their Streams usage in> > > > > production, for example:> > > > > > > > > http://engineering.skybettingandgaming.com/2017/01/23/ > > > streaming-architectures/> > > > > > > > > http://developers.linecorp.com/blog/?p=3960> > > > > > > > > Guozhang> > > > > > > > > > > > > On Sat, Feb 25, 2017 at 7:52 AM, Kohki Nishio <ta...@gmail.com> > > wrote:> > > > > > > > > > I did a bit of research on that matter recently, the comparison is > > > between> > > > > > Spark Structured Streaming(SSS) and Kafka Streams,> > > > > >> > > > > > Both are relatively new (~1y) and trying to solve similar problems, > > > however> > > > > > if you go with Spark, you have to go with a cluster, if your > > > environment> > > > > > already have a cluster, then it's good. However our team doesn't do > > > any> > > > > > Spark, so the initial cost would be very high. On the other hand, > > > Kafka> > > > > > Streams is a java library, since we have a service framework, doing > > > stream> > > > > > inside a service is super easy.> > > > > >> > > > > > However for some reason, people see SSS is more mature and Kafka > > > Streams is> > > > > > not so mature (like Beta). But old fashion stream is both mature > > > enough (in> > > > > > my opinion), I didn't see any difference in DStream(Spark) and> > > > > > KStream(Kafka)> > > > > >> > > > > > DataFrame (Structured Streaming) and KTable, I found it quite > > > different.> > > > > > Kafka's model is more like a change log, that means you need to see > > > the> > > > > > latest entry to make a final decision. I would call this as 'Update' > > > model,> > > > > > whereas Spark does 'Append' model and it doesn't support 'Update' > > > model> > > > > > yet. (it's coming to 2.2)> > > > > >> > > > > > http://spark.apache.org/docs/latest/structured-streaming-pro> > > > > > gramming-guide.html#output-modes> > > > > >> > > > > > I wanted to have 'Append' model with Kafka, but it seems it's not > > easy> > > > > > thing to do, also Kafka Streams uses an internal topic to keep state> > > > > > changes for fail-over scenario, but I'm dealing with a lots of tiny> > > > > > information and I have a big concern about the size of the state > > store > > > /> > > > > > topic, so my decision is that I'm going with my own handling of Kafka > > > API> > > > > > ..> > > > > >> > > > > > If you do stateless operation and don't have a spark cluster, yeah > > > Kafka> > > > > > Streams is perfect.> > > > > > If you do stateful complicated operation and happen to have a spark> > > > > > cluster, give Spark a try> > > > > > else you have to write a code which is optimized for your use case> > > > > >> > > > > >> > > > > > thanks> > > > > > -Kohki> > > > > >> > > > > >> > > > > >> > > > > >> > > > > > On Fri, Feb 24, 2017 at 6:22 PM, Tianji Li <sk...@gmail.com> wrote:> > > > > >> > > > > > > Hi there,> > > > > > >> > > > > > > Can anyone give a good explanation in what cases Kafka Streams is> > > > > > > preferred, and in what cases Sparking Streaming is better?> > > > > > >> > > > > > > Thanks> > > > > > > Tianji> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > > --> > > > > > Kohki Nishio> > > > > >> > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang> > > > > > > > > > > > > > > > -- > > Kohki Nishio > > > > > > -- > -- Guozhang >