Hi, In short, I don't see Kafka having problems with those numbers. Logstash will have a harder time, I believe. That said, it all depends on how you tune things an what kind of / how much hardware you use.
2B or 200B events, yes, big numbers, but how quickly do you need to process those? in 1 minute, 1 hour, 1 day, or a week? :) SPM for Kafka (http://sematext.com/spm) will show you all possible Kafka metrics you can imagine, so if you decide to give Kafka a try you'll be able to tune Kafka with the help of SPM for Kafka charts and the help of people on this mailing list. Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Thu, Feb 5, 2015 at 2:12 PM, Vineet Mishra <clearmido...@gmail.com> wrote: > Yury, > > Well thanks for sharing the insight of kafka partition distribution. > > Well I am more of a concerned about the throughtput that kafka-storm can > collaborative give so as to event process. > > Currently I am having around a 30 Gb file with around .2 Billion events, > this number is soon gonna rise 100 times the existing numbers. > > I was wondering will the above mentioned stream processing engine will be > good fit in my case? > If yes, then with what configuration and tuning so as to effectively use > resource and maximize throughput. > > Thanks! > On Feb 3, 2015 8:38 PM, "Yury Ruchin" <yuri.ruc...@gmail.com> wrote: > > > This is a quote from Kafka documentation: > > "The routing decision is influenced by the kafka.producer.Partitioner. > > > > interface Partitioner<T> { > > int partition(T key, int numPartitions); > > } > > The partition API uses the key and the number of available broker > > partitions to return a partition id. This id is used as an index into a > > sorted list of broker_ids and partitions to pick a broker partition for > the > > producer request. The default partitioning strategy is > > hash(key)%numPartitions. If the key is null, then a random broker > partition > > is picked. A custom partitioning strategy can also be plugged in using > the > > partitioner.class config parameter." > > > > An important point for the null key is that the randomly chosen broker > > partition sticks for the time specified by " > > topic.metadata.refresh.interval.ms" which is 10 minutes by default. So > if > > you are using null key for Logstash entries, you will be writing to the > > same partition for 10 minutes. Is this your case? > > > > 2015-02-03 14:03 GMT+03:00 Vineet Mishra <clearmido...@gmail.com>: > > > > > Hi, > > > > > > I am having a setup where I am sniffing some logs(ofcourse the big > ones) > > > through Logstash Forwarder and forwarding it to Logstash, which in turn > > > publish these events to Kafka. > > > > > > I have created the Kafka Topic ensuring the required number of > Partitions > > > and Replication Factor but not sure with Logstash Output > Configuration, I > > > am having following doubt with reference to the same. > > > > > > For the Logstash Publishing events to kafka > > > > > > 1) Do we need to explicitly define the Partition in Logstash while > > > Publishing to Kafka > > > 2) Will Kafka take care of the proper distribution of the data across > the > > > Partitions > > > > > > I am having a notion that despite of the fact of declaring the > partitions > > > in Kafka while creating Topic the data from Logstash is been pushed to > > > single Partition or perhaps not getting uniformly distributed. > > > > > > Looking for the Expert Advise. > > > > > > Thanks! > > > > > >