[ 
https://issues.apache.org/jira/browse/KAFKA-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743375#comment-16743375
 ] 

Vinoth Rajasekar commented on KAFKA-7820:
-----------------------------------------

Hey Boyang,

That's a good idea and we considered that option too. And that comes with a 
trade-off allocating more space in the cluster for distinct count itself with 
re-partitioning. we have large schema with around 15-20 fields qualify for 
distinct counts based on the use cases for making near-real time decisions. 
re-partitioning for those many fields we have to trade-off on the space. this 
will be challenging at enterprise level, since many teams share the same 
cluster.

There are other ways where we can perform distinct count for a field using 
statetore or ktables. In this scenario we have too maintain too many tables at 
the application level. 

This is a nice to have feature for apps that uses Kafka streams for analytical 
use cases making real-time decisions. since many other streaming frameworks 
supports this feature, we thought it would be a very useful feature in the 
streams for many use cases, if this is something that can be handled at 
framework level .

Would implementing this feature at the framework level going to be a heavy-lift?

 

> distinct count kafka streams api
> --------------------------------
>
>                 Key: KAFKA-7820
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7820
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Vinoth Rajasekar
>            Priority: Minor
>              Labels: needs-kip
>
> we are using Kafka streams for our real-time analytic use cases. most of our 
> use cases involved with doing distinct count on certain fields.
> currently we do distinct count by storing the hash map value of the data in a 
> set and do a count as event flows in. There are lot of challenges doing this 
> using application memory, because storing the hashmap value and counting them 
> is limited by the allotted memory size. When we get high volume  or spike in 
> traffic hash map of the distinct count fields grows beyond allotted memory 
> size leading to issues.
> other issue is when  we scale the app, we need to use global ktables so we 
> get all the values for doing distinct count and this adds back pressure in 
> the cluster or we have to re-partition the topic and do count on the key.
> Can we have feature, where the distinct count is supported by through streams 
> api at the framework level, rather than dealing it with application level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to