[ 
https://issues.apache.org/jira/browse/KAFKA-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743479#comment-16743479
 ] 

Boyang Chen commented on KAFKA-7820:
------------------------------------

Thanks [~vinubarro] for more details. I think we need to further understand the 
use case before we decide whether we need to add a new API. 
 # About repartition, since KStream does aggregation on partition level, so it 
is required to have the same key hashing to the same partition. My question is 
that how many unique keys we are having here (the combo out of all 15-20 
fields)? If the total number of the keys are not that big, it should be ok 
 # We don't need to have multiple KTables to solve the problem. We could just 
get a common aggregation key and do the counts, if you are referring to one 
single streaming instance. At Pinterest, we are building a generic API on top 
of #aggregate() by extracting necessary fields to generate a superset join key 
through our thrift structure.

In conclusion, with proper repartition applied, count() should be suffice for 
our use case. If we want the field extraction on framework level, we need to 
have the API support multiple data types (Json, Avro, Thrift). Let me know if 
this answers your question. 

> distinct count kafka streams api
> --------------------------------
>
>                 Key: KAFKA-7820
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7820
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Vinoth Rajasekar
>            Priority: Minor
>              Labels: needs-kip
>
> we are using Kafka streams for our real-time analytic use cases. most of our 
> use cases involved with doing distinct count on certain fields.
> currently we do distinct count by storing the hash map value of the data in a 
> set and do a count as event flows in. There are lot of challenges doing this 
> using application memory, because storing the hashmap value and counting them 
> is limited by the allotted memory size. When we get high volume  or spike in 
> traffic hash map of the distinct count fields grows beyond allotted memory 
> size leading to issues.
> other issue is when  we scale the app, we need to use global ktables so we 
> get all the values for doing distinct count and this adds back pressure in 
> the cluster or we have to re-partition the topic and do count on the key.
> Can we have feature, where the distinct count is supported by through streams 
> api at the framework level, rather than dealing it with application level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to