[ https://issues.apache.org/jira/browse/KAFKA-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744663#comment-16744663 ]
Guozhang Wang commented on KAFKA-7820: -------------------------------------- [~vinubarro] Thanks for sharing your use case. I think the proposal 2) from [~bchen225242] may well fit your needs. To be more specific: say you need 10-20 fields that require distinct counts, you can create a repartition key which is a combo of all of these fields via a single repartition topic. For example, if your interested fields are A,B,C, and you create a combo key is (A,B,C), the semantics of a co-partition key is that "all the records with the same values in A,B,C will go to the same partition", which inplies "all the records with the same values of A will go to the same partition" (same for B, C), so after you've done the repartitioning, say to distinctly count on field A, you can aggregate on B/C and count on A, and aggregate on A/C to count on B etc. > distinct count kafka streams api > -------------------------------- > > Key: KAFKA-7820 > URL: https://issues.apache.org/jira/browse/KAFKA-7820 > Project: Kafka > Issue Type: New Feature > Components: streams > Reporter: Vinoth Rajasekar > Priority: Minor > Labels: needs-kip > > we are using Kafka streams for our real-time analytic use cases. most of our > use cases involved with doing distinct count on certain fields. > currently we do distinct count by storing the hash map value of the data in a > set and do a count as event flows in. There are lot of challenges doing this > using application memory, because storing the hashmap value and counting them > is limited by the allotted memory size. When we get high volume or spike in > traffic hash map of the distinct count fields grows beyond allotted memory > size leading to issues. > other issue is when we scale the app, we need to use global ktables so we > get all the values for doing distinct count and this adds back pressure in > the cluster or we have to re-partition the topic and do count on the key. > Can we have feature, where the distinct count is supported by through streams > api at the framework level, rather than dealing it with application level. -- This message was sent by Atlassian JIRA (v7.6.3#76005)