Re: Consumer that consumes only local partition?

2015-08-04 Thread Hawin Jiang
Hi  Robert

Here is the kafka benchmark for your reference.
if you want to use Flink, Storm, Samza or Spark, the performance will be
going down.

821,557 records/sec(78.3 MB/sec)

https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines





Best regards
Hawin



On Tue, Aug 4, 2015 at 11:57 AM, Robert Metzger rmetz...@apache.org wrote:

 Sorry for the very late reply ...

 The performance issue was not caused by network latency. I had a job like
 this:
 FlinkKafkaConsumer -- someSimpleOperation -- FlinkKafkaProducer.

 I thought that our FlinkKafkaConsumer is slow, but actually our
 FlinkKafkaProducer was using the old producer API of Kafka. Switching to
 the new producer API of Kafka greatly improved our writing performance to
 Kafka. Flink was slowing down the KafkaConsumer because of the producer.

 Since we are already talking about performance, let me ask you the
 following question:
 I am using Kafka and Flink on a HDP 2.2 cluster (with 40 machines). What
 would you consider a good read/write performance for 8-byte messages on the
 following setup?
 - 40 brokers,
 - topic with 120 partitions
 - 120 reading threads (on 30 machines)
 - 120 writing threads (on 30 machines)

 I'm getting a write throughput of ~75k elements/core/second and a read
 throughput of ~50k el/c/s.
 When I'm stopping the writers, the read throughput goes up to 130k.
 I would expect a higher throughput than (8*75000) / 1024 = 585.9 kb/sec per
 partition .. or are the messages too small and the overhead is very high.

 Which system out there would you recommend for getting reference
 performance numbers? Samza, Spark, Storm?


 On Wed, Jul 15, 2015 at 7:20 PM, Gwen Shapira gshap...@cloudera.com
 wrote:

  This is not something you can use the consumer API to simply do easily
  (consumers don't have locality notion).
  I can imagine using Kafka's low-level API calls to get a list of
  partitions and the lead replica, figuring out which are local and
  using those - but that sounds painful.
 
  Are you 100% sure the performance issue is due to network latency? If
  not, you may want to start optimizing somewhere more productive :)
  Kafka brokers and clients both have Metrics that may help you track
  where the performance issues are coming from.
 
  Gwen
 
  On Wed, Jul 15, 2015 at 9:24 AM, Robert Metzger rmetz...@apache.org
  wrote:
   Hi Shef,
  
   did you resolve this issue?
   I'm facing some performance issues and I was wondering whether reading
   locally would resolve them.
  
   On Mon, Jun 22, 2015 at 11:43 PM, Shef she...@yahoo.com wrote:
  
   Noob question here. I want to have a single consumer for each
 partition
   that consumes only the messages that have been written locally. In
 other
   words, I want the consumer to access the local disk and not pull
  anything
   across the network. Possible?
  
   How can I discover which partitions are local?
  
  
  
 



Re: Consumer that consumes only local partition?

2015-08-04 Thread Robert Metzger
Sorry for the very late reply ...

The performance issue was not caused by network latency. I had a job like
this:
FlinkKafkaConsumer -- someSimpleOperation -- FlinkKafkaProducer.

I thought that our FlinkKafkaConsumer is slow, but actually our
FlinkKafkaProducer was using the old producer API of Kafka. Switching to
the new producer API of Kafka greatly improved our writing performance to
Kafka. Flink was slowing down the KafkaConsumer because of the producer.

Since we are already talking about performance, let me ask you the
following question:
I am using Kafka and Flink on a HDP 2.2 cluster (with 40 machines). What
would you consider a good read/write performance for 8-byte messages on the
following setup?
- 40 brokers,
- topic with 120 partitions
- 120 reading threads (on 30 machines)
- 120 writing threads (on 30 machines)

I'm getting a write throughput of ~75k elements/core/second and a read
throughput of ~50k el/c/s.
When I'm stopping the writers, the read throughput goes up to 130k.
I would expect a higher throughput than (8*75000) / 1024 = 585.9 kb/sec per
partition .. or are the messages too small and the overhead is very high.

Which system out there would you recommend for getting reference
performance numbers? Samza, Spark, Storm?


On Wed, Jul 15, 2015 at 7:20 PM, Gwen Shapira gshap...@cloudera.com wrote:

 This is not something you can use the consumer API to simply do easily
 (consumers don't have locality notion).
 I can imagine using Kafka's low-level API calls to get a list of
 partitions and the lead replica, figuring out which are local and
 using those - but that sounds painful.

 Are you 100% sure the performance issue is due to network latency? If
 not, you may want to start optimizing somewhere more productive :)
 Kafka brokers and clients both have Metrics that may help you track
 where the performance issues are coming from.

 Gwen

 On Wed, Jul 15, 2015 at 9:24 AM, Robert Metzger rmetz...@apache.org
 wrote:
  Hi Shef,
 
  did you resolve this issue?
  I'm facing some performance issues and I was wondering whether reading
  locally would resolve them.
 
  On Mon, Jun 22, 2015 at 11:43 PM, Shef she...@yahoo.com wrote:
 
  Noob question here. I want to have a single consumer for each partition
  that consumes only the messages that have been written locally. In other
  words, I want the consumer to access the local disk and not pull
 anything
  across the network. Possible?
 
  How can I discover which partitions are local?
 
 
 



Re: Consumer that consumes only local partition?

2015-07-15 Thread Robert Metzger
Hi Shef,

did you resolve this issue?
I'm facing some performance issues and I was wondering whether reading
locally would resolve them.

On Mon, Jun 22, 2015 at 11:43 PM, Shef she...@yahoo.com wrote:

 Noob question here. I want to have a single consumer for each partition
 that consumes only the messages that have been written locally. In other
 words, I want the consumer to access the local disk and not pull anything
 across the network. Possible?

 How can I discover which partitions are local?