Thanks! For the counts I'd need to use a global table to make sure it's across all the data right? Also having millions of different values per grouped attribute will scale ok?
On Mar 4, 2018 8:45 AM, "Thakrar, Jayesh" <jthak...@conversantmedia.com> wrote: > Yes, that's the general design pattern. Another thing to look into is to > compress the data. Now Kafka consumer/producer can already do it for you, > but we choose to compress in the applications due to a historic issue that > drgraded performance, although it has been resolved now. > > Also, just keep in mind that while you do your batching, kafka producer > also tries to batch msgs to Kafka, and you will need to ensure you have > enough buffer memory. However that's all configurable. > > Finally ensure you have the latest java updates and have kafka 0.10.2 or > higher. > > Jayesh > > ------------------------------ > *From:* Matt Daum <m...@setfive.com> > *Sent:* Sunday, March 4, 2018 7:06:19 AM > *To:* Thakrar, Jayesh > *Cc:* users@kafka.apache.org > *Subject:* Re: Kafka Setup for Daily counts on wide array of keys > > We actually don't have a kafka cluster setup yet at all. Right now just > have 8 of our application servers. We currently sample some impressions > and then dedupe/count outside at a different DC, but are looking to try to > analyze all impressions for some overall analytics. > > Our requests are around 100-200 bytes each. If we lost some of them due > to network jitter etc. it would be fine we're trying to just get overall a > rough count of each attribute. Creating batched messages definitely makes > sense and will also cut down on the network IO. > > We're trying to determine the required setup for Kafka to do what we're > looking to do as these are physical servers so we'll most likely need to > buy new hardware. For the first run I think we'll try it out on one of our > application clusters that get a smaller amount traffic (300-400k req/sec) > and run the kafka cluster on the same machines as the applications. > > So would the best route here be something like each application server > batches requests, send it to kafka, have a stream consumer that then > tallies up the totals per attribute that we want to track, output that to a > new topic, which then goes to a sink to either a DB or something like S3 > which then we read into our external DBs? > > Thanks! > > On Sun, Mar 4, 2018 at 12:31 AM, Thakrar, Jayesh < > jthak...@conversantmedia.com> wrote: > >> Matt, >> >> If I understand correctly, you have an 8 node Kafka cluster and need to >> support about 1 million requests/sec into the cluster from source servers >> and expect to consume that for aggregation. >> >> How big are your msgs? >> >> I would suggest looking into batching multiple requests per single Kafka >> msg to achieve desired throughput. >> >> So e.g. on the request receiving systems, I would suggest creating a >> logical avro file (byte buffer) of say N requests and then making that into >> one Kafka msg payload. >> >> We have a similar situation (https://www.slideshare.net/Ja >> yeshThakrar/apacheconflumekafka2016) and found anything from 4x to 10x >> better throughput with batching as compared to one request per msg. >> We have different kinds of msgs/topics and the individual "request" size >> varies from about 100 bytes to 1+ KB. >> >> On 3/2/18, 8:24 AM, "Matt Daum" <m...@setfive.com> wrote: >> >> I am new to Kafka but I think I have a good use case for it. I am >> trying >> to build daily counts of requests based on a number of different >> attributes >> in a high throughput system (~1 million requests/sec. across all 8 >> servers). The different attributes are unbounded in terms of values, >> and >> some will spread across 100's of millions values. This is my current >> through process, let me know where I could be more efficient or if >> there is >> a better way to do it. >> >> I'll create an AVRO object "Impression" which has all the attributes >> of the >> inbound request. My application servers then will on each request >> create >> and send this to a single kafka topic. >> >> I'll then have a consumer which creates a stream from the topic. From >> there I'll use the windowed timeframes and groupBy to group by the >> attributes on each given day. At the end of the day I'd need to read >> out >> the data store to an external system for storage. Since I won't know >> all >> the values I'd need something similar to the KVStore.all() but for >> WindowedKV Stores. This appears that it'd be possible in 1.1 with >> this >> commit: >> https://github.com/apache/kafka/commit/1d1c8575961bf6bce7dec >> b049be7f10ca76bd0c5 >> . >> >> Is this the best approach to doing this? Or would I be better using >> the >> stream to listen and then an external DB like Aerospike to store the >> counts >> and read out of it directly end of day. >> >> Thanks for the help! >> Daum >> >> >> >