Re: Logstash to Kafka

2015-02-05 Thread Vineet Mishra
Yury,

Well thanks for sharing the insight of kafka partition distribution.

Well I am more of a concerned about the throughtput that kafka-storm can
collaborative give so as to event process.

Currently I am having around a 30 Gb file with around .2 Billion events,
this number is soon gonna rise 100 times the existing numbers.

I was wondering will the above mentioned stream processing engine will be
good fit in my case?
If yes, then with what configuration and tuning so as to effectively use
resource and maximize throughput.

Thanks!
On Feb 3, 2015 8:38 PM, Yury Ruchin yuri.ruc...@gmail.com wrote:

 This is a quote from Kafka documentation:
 The routing decision is influenced by the kafka.producer.Partitioner.

 interface PartitionerT {
int partition(T key, int numPartitions);
 }
 The partition API uses the key and the number of available broker
 partitions to return a partition id. This id is used as an index into a
 sorted list of broker_ids and partitions to pick a broker partition for the
 producer request. The default partitioning strategy is
 hash(key)%numPartitions. If the key is null, then a random broker partition
 is picked. A custom partitioning strategy can also be plugged in using the
 partitioner.class config parameter.

 An important point for the null key is that the randomly chosen broker
 partition sticks for the time specified by 
 topic.metadata.refresh.interval.ms which is 10 minutes by default. So if
 you are using null key for Logstash entries, you will be writing to the
 same partition for 10 minutes. Is this your case?

 2015-02-03 14:03 GMT+03:00 Vineet Mishra clearmido...@gmail.com:

  Hi,
 
  I am having a setup where I am sniffing some logs(ofcourse the big ones)
  through Logstash Forwarder and forwarding it to Logstash, which in turn
  publish these events to Kafka.
 
  I have created the Kafka Topic ensuring the required number of Partitions
  and Replication Factor but not sure with Logstash Output Configuration, I
  am having following doubt with reference to the same.
 
  For the Logstash Publishing events to kafka
 
  1) Do we need to explicitly define the Partition in Logstash while
  Publishing to Kafka
  2) Will Kafka take care of the proper distribution of the data across the
  Partitions
 
  I am having a notion that despite of the fact of declaring the partitions
  in Kafka while creating Topic the data from Logstash is been pushed to
  single Partition or perhaps not getting uniformly distributed.
 
  Looking for the Expert Advise.
 
  Thanks!
 



Re: Logstash to Kafka

2015-02-05 Thread Otis Gospodnetic
Hi,

In short, I don't see Kafka having problems with those numbers.  Logstash
will have a harder time, I believe.
That said, it all depends on how you tune things an what kind of / how much
hardware you use.

2B or 200B events, yes, big numbers, but how quickly do you need to process
those? in 1 minute, 1 hour, 1 day, or a week? :)

SPM for Kafka (http://sematext.com/spm) will show you all possible Kafka
metrics you can imagine, so if you decide to give Kafka a try you'll be
able to tune Kafka with the help of SPM for Kafka charts and the help of
people on this mailing list.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Feb 5, 2015 at 2:12 PM, Vineet Mishra clearmido...@gmail.com
wrote:

 Yury,

 Well thanks for sharing the insight of kafka partition distribution.

 Well I am more of a concerned about the throughtput that kafka-storm can
 collaborative give so as to event process.

 Currently I am having around a 30 Gb file with around .2 Billion events,
 this number is soon gonna rise 100 times the existing numbers.

 I was wondering will the above mentioned stream processing engine will be
 good fit in my case?
 If yes, then with what configuration and tuning so as to effectively use
 resource and maximize throughput.

 Thanks!
 On Feb 3, 2015 8:38 PM, Yury Ruchin yuri.ruc...@gmail.com wrote:

  This is a quote from Kafka documentation:
  The routing decision is influenced by the kafka.producer.Partitioner.
 
  interface PartitionerT {
 int partition(T key, int numPartitions);
  }
  The partition API uses the key and the number of available broker
  partitions to return a partition id. This id is used as an index into a
  sorted list of broker_ids and partitions to pick a broker partition for
 the
  producer request. The default partitioning strategy is
  hash(key)%numPartitions. If the key is null, then a random broker
 partition
  is picked. A custom partitioning strategy can also be plugged in using
 the
  partitioner.class config parameter.
 
  An important point for the null key is that the randomly chosen broker
  partition sticks for the time specified by 
  topic.metadata.refresh.interval.ms which is 10 minutes by default. So
 if
  you are using null key for Logstash entries, you will be writing to the
  same partition for 10 minutes. Is this your case?
 
  2015-02-03 14:03 GMT+03:00 Vineet Mishra clearmido...@gmail.com:
 
   Hi,
  
   I am having a setup where I am sniffing some logs(ofcourse the big
 ones)
   through Logstash Forwarder and forwarding it to Logstash, which in turn
   publish these events to Kafka.
  
   I have created the Kafka Topic ensuring the required number of
 Partitions
   and Replication Factor but not sure with Logstash Output
 Configuration, I
   am having following doubt with reference to the same.
  
   For the Logstash Publishing events to kafka
  
   1) Do we need to explicitly define the Partition in Logstash while
   Publishing to Kafka
   2) Will Kafka take care of the proper distribution of the data across
 the
   Partitions
  
   I am having a notion that despite of the fact of declaring the
 partitions
   in Kafka while creating Topic the data from Logstash is been pushed to
   single Partition or perhaps not getting uniformly distributed.
  
   Looking for the Expert Advise.
  
   Thanks!
  
 



Re: Logstash to Kafka

2015-02-03 Thread Yury Ruchin
This is a quote from Kafka documentation:
The routing decision is influenced by the kafka.producer.Partitioner.

interface PartitionerT {
   int partition(T key, int numPartitions);
}
The partition API uses the key and the number of available broker
partitions to return a partition id. This id is used as an index into a
sorted list of broker_ids and partitions to pick a broker partition for the
producer request. The default partitioning strategy is
hash(key)%numPartitions. If the key is null, then a random broker partition
is picked. A custom partitioning strategy can also be plugged in using the
partitioner.class config parameter.

An important point for the null key is that the randomly chosen broker
partition sticks for the time specified by 
topic.metadata.refresh.interval.ms which is 10 minutes by default. So if
you are using null key for Logstash entries, you will be writing to the
same partition for 10 minutes. Is this your case?

2015-02-03 14:03 GMT+03:00 Vineet Mishra clearmido...@gmail.com:

 Hi,

 I am having a setup where I am sniffing some logs(ofcourse the big ones)
 through Logstash Forwarder and forwarding it to Logstash, which in turn
 publish these events to Kafka.

 I have created the Kafka Topic ensuring the required number of Partitions
 and Replication Factor but not sure with Logstash Output Configuration, I
 am having following doubt with reference to the same.

 For the Logstash Publishing events to kafka

 1) Do we need to explicitly define the Partition in Logstash while
 Publishing to Kafka
 2) Will Kafka take care of the proper distribution of the data across the
 Partitions

 I am having a notion that despite of the fact of declaring the partitions
 in Kafka while creating Topic the data from Logstash is been pushed to
 single Partition or perhaps not getting uniformly distributed.

 Looking for the Expert Advise.

 Thanks!