[ https://issues.apache.org/jira/browse/KAFKA-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743375#comment-16743375 ]
Vinoth Rajasekar commented on KAFKA-7820: ----------------------------------------- Hey Boyang, That's a good idea and we considered that option too. And that comes with a trade-off allocating more space in the cluster for distinct count itself with re-partitioning. we have large schema with around 15-20 fields qualify for distinct counts based on the use cases for making near-real time decisions. re-partitioning for those many fields we have to trade-off on the space. this will be challenging at enterprise level, since many teams share the same cluster. There are other ways where we can perform distinct count for a field using statetore or ktables. In this scenario we have too maintain too many tables at the application level. This is a nice to have feature for apps that uses Kafka streams for analytical use cases making real-time decisions. since many other streaming frameworks supports this feature, we thought it would be a very useful feature in the streams for many use cases, if this is something that can be handled at framework level . Would implementing this feature at the framework level going to be a heavy-lift? > distinct count kafka streams api > -------------------------------- > > Key: KAFKA-7820 > URL: https://issues.apache.org/jira/browse/KAFKA-7820 > Project: Kafka > Issue Type: New Feature > Components: streams > Reporter: Vinoth Rajasekar > Priority: Minor > Labels: needs-kip > > we are using Kafka streams for our real-time analytic use cases. most of our > use cases involved with doing distinct count on certain fields. > currently we do distinct count by storing the hash map value of the data in a > set and do a count as event flows in. There are lot of challenges doing this > using application memory, because storing the hashmap value and counting them > is limited by the allotted memory size. When we get high volume or spike in > traffic hash map of the distinct count fields grows beyond allotted memory > size leading to issues. > other issue is when we scale the app, we need to use global ktables so we > get all the values for doing distinct count and this adds back pressure in > the cluster or we have to re-partition the topic and do count on the key. > Can we have feature, where the distinct count is supported by through streams > api at the framework level, rather than dealing it with application level. -- This message was sent by Atlassian JIRA (v7.6.3#76005)