Matt,

If I understand correctly, you have an 8 node Kafka cluster and need to support 
 about 1 million requests/sec into the cluster from source servers and expect 
to consume that for aggregation.

How big are your msgs?

I would suggest looking into batching multiple requests per single Kafka msg to 
achieve desired throughput.

So e.g. on the request receiving systems, I would suggest creating a logical 
avro file (byte buffer) of say N requests and then making that into one Kafka 
msg payload.

We have a similar situation 
(https://www.slideshare.net/JayeshThakrar/apacheconflumekafka2016) and found 
anything from 4x to 10x better throughput with batching as compared to one 
request per msg.
We have different kinds of msgs/topics and the individual "request" size varies 
from  about 100 bytes to 1+ KB. 

On 3/2/18, 8:24 AM, "Matt Daum" <m...@setfive.com> wrote:

    I am new to Kafka but I think I have a good use case for it.  I am trying
    to build daily counts of requests based on a number of different attributes
    in a high throughput system (~1 million requests/sec. across all  8
    servers).  The different attributes are unbounded in terms of values, and
    some will spread across 100's of millions values.  This is my current
    through process, let me know where I could be more efficient or if there is
    a better way to do it.
    
    I'll create an AVRO object "Impression" which has all the attributes of the
    inbound request.  My application servers then will on each request create
    and send this to a single kafka topic.
    
    I'll then have a consumer which creates a stream from the topic.  From
    there I'll use the windowed timeframes and groupBy to group by the
    attributes on each given day.  At the end of the day I'd need to read out
    the data store to an external system for storage.  Since I won't know all
    the values I'd need something similar to the KVStore.all() but for
    WindowedKV Stores.  This appears that it'd be possible in 1.1 with this
    commit:
    
https://github.com/apache/kafka/commit/1d1c8575961bf6bce7decb049be7f10ca76bd0c5
    .
    
    Is this the best approach to doing this?  Or would I be better using the
    stream to listen and then an external DB like Aerospike to store the counts
    and read out of it directly end of day.
    
    Thanks for the help!
    Daum
    

Reply via email to