Re: Logstash to Kafka
Yury, Well thanks for sharing the insight of kafka partition distribution. Well I am more of a concerned about the throughtput that kafka-storm can collaborative give so as to event process. Currently I am having around a 30 Gb file with around .2 Billion events, this number is soon gonna rise 100 times the existing numbers. I was wondering will the above mentioned stream processing engine will be good fit in my case? If yes, then with what configuration and tuning so as to effectively use resource and maximize throughput. Thanks! On Feb 3, 2015 8:38 PM, Yury Ruchin yuri.ruc...@gmail.com wrote: This is a quote from Kafka documentation: The routing decision is influenced by the kafka.producer.Partitioner. interface PartitionerT { int partition(T key, int numPartitions); } The partition API uses the key and the number of available broker partitions to return a partition id. This id is used as an index into a sorted list of broker_ids and partitions to pick a broker partition for the producer request. The default partitioning strategy is hash(key)%numPartitions. If the key is null, then a random broker partition is picked. A custom partitioning strategy can also be plugged in using the partitioner.class config parameter. An important point for the null key is that the randomly chosen broker partition sticks for the time specified by topic.metadata.refresh.interval.ms which is 10 minutes by default. So if you are using null key for Logstash entries, you will be writing to the same partition for 10 minutes. Is this your case? 2015-02-03 14:03 GMT+03:00 Vineet Mishra clearmido...@gmail.com: Hi, I am having a setup where I am sniffing some logs(ofcourse the big ones) through Logstash Forwarder and forwarding it to Logstash, which in turn publish these events to Kafka. I have created the Kafka Topic ensuring the required number of Partitions and Replication Factor but not sure with Logstash Output Configuration, I am having following doubt with reference to the same. For the Logstash Publishing events to kafka 1) Do we need to explicitly define the Partition in Logstash while Publishing to Kafka 2) Will Kafka take care of the proper distribution of the data across the Partitions I am having a notion that despite of the fact of declaring the partitions in Kafka while creating Topic the data from Logstash is been pushed to single Partition or perhaps not getting uniformly distributed. Looking for the Expert Advise. Thanks!
Re: Logstash to Kafka
Hi, In short, I don't see Kafka having problems with those numbers. Logstash will have a harder time, I believe. That said, it all depends on how you tune things an what kind of / how much hardware you use. 2B or 200B events, yes, big numbers, but how quickly do you need to process those? in 1 minute, 1 hour, 1 day, or a week? :) SPM for Kafka (http://sematext.com/spm) will show you all possible Kafka metrics you can imagine, so if you decide to give Kafka a try you'll be able to tune Kafka with the help of SPM for Kafka charts and the help of people on this mailing list. Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Thu, Feb 5, 2015 at 2:12 PM, Vineet Mishra clearmido...@gmail.com wrote: Yury, Well thanks for sharing the insight of kafka partition distribution. Well I am more of a concerned about the throughtput that kafka-storm can collaborative give so as to event process. Currently I am having around a 30 Gb file with around .2 Billion events, this number is soon gonna rise 100 times the existing numbers. I was wondering will the above mentioned stream processing engine will be good fit in my case? If yes, then with what configuration and tuning so as to effectively use resource and maximize throughput. Thanks! On Feb 3, 2015 8:38 PM, Yury Ruchin yuri.ruc...@gmail.com wrote: This is a quote from Kafka documentation: The routing decision is influenced by the kafka.producer.Partitioner. interface PartitionerT { int partition(T key, int numPartitions); } The partition API uses the key and the number of available broker partitions to return a partition id. This id is used as an index into a sorted list of broker_ids and partitions to pick a broker partition for the producer request. The default partitioning strategy is hash(key)%numPartitions. If the key is null, then a random broker partition is picked. A custom partitioning strategy can also be plugged in using the partitioner.class config parameter. An important point for the null key is that the randomly chosen broker partition sticks for the time specified by topic.metadata.refresh.interval.ms which is 10 minutes by default. So if you are using null key for Logstash entries, you will be writing to the same partition for 10 minutes. Is this your case? 2015-02-03 14:03 GMT+03:00 Vineet Mishra clearmido...@gmail.com: Hi, I am having a setup where I am sniffing some logs(ofcourse the big ones) through Logstash Forwarder and forwarding it to Logstash, which in turn publish these events to Kafka. I have created the Kafka Topic ensuring the required number of Partitions and Replication Factor but not sure with Logstash Output Configuration, I am having following doubt with reference to the same. For the Logstash Publishing events to kafka 1) Do we need to explicitly define the Partition in Logstash while Publishing to Kafka 2) Will Kafka take care of the proper distribution of the data across the Partitions I am having a notion that despite of the fact of declaring the partitions in Kafka while creating Topic the data from Logstash is been pushed to single Partition or perhaps not getting uniformly distributed. Looking for the Expert Advise. Thanks!
Re: Logstash to Kafka
This is a quote from Kafka documentation: The routing decision is influenced by the kafka.producer.Partitioner. interface PartitionerT { int partition(T key, int numPartitions); } The partition API uses the key and the number of available broker partitions to return a partition id. This id is used as an index into a sorted list of broker_ids and partitions to pick a broker partition for the producer request. The default partitioning strategy is hash(key)%numPartitions. If the key is null, then a random broker partition is picked. A custom partitioning strategy can also be plugged in using the partitioner.class config parameter. An important point for the null key is that the randomly chosen broker partition sticks for the time specified by topic.metadata.refresh.interval.ms which is 10 minutes by default. So if you are using null key for Logstash entries, you will be writing to the same partition for 10 minutes. Is this your case? 2015-02-03 14:03 GMT+03:00 Vineet Mishra clearmido...@gmail.com: Hi, I am having a setup where I am sniffing some logs(ofcourse the big ones) through Logstash Forwarder and forwarding it to Logstash, which in turn publish these events to Kafka. I have created the Kafka Topic ensuring the required number of Partitions and Replication Factor but not sure with Logstash Output Configuration, I am having following doubt with reference to the same. For the Logstash Publishing events to kafka 1) Do we need to explicitly define the Partition in Logstash while Publishing to Kafka 2) Will Kafka take care of the proper distribution of the data across the Partitions I am having a notion that despite of the fact of declaring the partitions in Kafka while creating Topic the data from Logstash is been pushed to single Partition or perhaps not getting uniformly distributed. Looking for the Expert Advise. Thanks!