KStreams was surely incubated based on early Samza learnings. Although they have a common base in terms of architecture, they started evolving independently a year ago.
In the last year there have been significant improvements in Samza in order to make stateful stream processing more reliable and production ready. The significant ones can be found in this article : https://engineering.linkedin.com/blog/2016/01/whats-new-samza <https://engineering.linkedin.com/blog/2016/01/whats-new-samza> To summarize some of these improvements: Samza stores the partition mapping durably. In addition, Samza integrates deeply with YARN to ensure that YARN doesn't move containers around if the jobs are stateful. This allows for state to be sticky and hence avoids reseeding state even when the client application is upgraded, stopped, restarted etc. Samza also makes sure that minimal amount of state gets moved when you increase/decrease the number of containers in your job. These improvements have been critical in keeping many production jobs (especially with large state) stable. One of the other key differences between Samza and KStreams is that although Samza has first class support for Kafka, it fundamentally supports input and output from/to non-kafka sources. for e.g. at LinkedIn we have a Samza job which reads from Kinesis and DynamoDB streams directly. This can in many situations significantly reduce additional Kafka hardware cost and operational cost of running a separate bridging service which moves the data from the external source into Kafka. We are also prototyping running Samza jobs in our hadoop grids and have them reading directly from HDFS and produce to HDFS. If this is successful we will allow for running batch jobs in Samza (this will support some scenarios where customers want to do experimentation in hadoop before moving to processing in near real time using kafka). Having said the above, the additional flexibility and support for sticky stateful apps in Samza comes at the cost of additional things to configure. More work will happen in Samza in the future to make the config simpler. On the question of futures, Samza will continue to evolve independently and we have a long list of stream processing features and ease of use improvements that we hope to contribute to Samza in the coming year. Hope that helps. Thanks Kartik On Mon, May 23, 2016 at 10:28 AM, Sriram Ramachandrasekaran < [email protected]> wrote: > Hello Yi, et all, > > I've been following Samza and Kafka (and, Kafka Streams). Given the state > where Kafka Streams is, it provides a nice high level API for consuming > stuff from Kafka + support for localized state. If thrown into an > environment like mesos(fronted by marathon), we should get distribution out > of the box. > > I wanted to hear from you, what this means to Samza's roadmap and what you > guys are thinking about it. As I understand, a lot of Samza's learning have > gone into Kafka Streams, which means, it should be more polished out of the > box. Please share your thoughts. > > > > -- > It's just about how deep your longing is! > -- We are hiring in Streams Infra (Kafka/Samza/Datastream) !!
