Re: Topic Partitioning Strategy For Large Data

2014-05-25 Thread Drew Goya
A few things I've learned:

1) Don't break things up into separate topics unless the data in them is
truly independent.  Consumer behavior can be extremely variable, don't
assume you will always be consuming as fast as you are  producing.

2) Keep time related messages in the same partition.  Again, consumer
behavior can (and will be) extremely variable, don't assume the lag on all
your partitions will be similar.  Design a partitioning scheme, so that the
owner of one partition can stop consuming for a long period of time and
your application will be minimally impacted. (for example, partitioning by
transaction id)


On Fri, May 23, 2014 at 1:12 PM, Joel Koshy jjkosh...@gmail.com wrote:

 Take a look at:

 https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic
 ?

 On Fri, May 23, 2014 at 12:49:39PM -0700, Bhavesh Mistry wrote:
  Hi Kafka Users,
 
 
 
  We are trying to transport 4TB data per day on single topic.  It is
  operation application logs.How do we estimate number of partitions
 and
  partitioning strategy?   Our goal is to drain (from consumer side) from
  the Kafka Brokers as soon as messages arrive (keep the lag as minimum as
  possible) and also we would like to uniformly distribute the logs across
  all partitions.
 
 
 
  Here is our Brokers HW Spec:
 
  3 Broker Cluster (192 GB RAM, 32 Cores each with SSD to hold 7 days of
 data
  ) with 100G NIC
 
 
 
  Data Rate :~ 13 GB per minute
 
 
 
 
 
  Is there a formula to compute optimal number of partition need  ?  Also,
  how
  to ensure uniform distribution from the producer side  (currently we have
  counter % numPartitions  which is not viable solution in prod env)
 
 
 
  Thanks,
  Bhavesh




答复: kafka performance question

2014-05-25 Thread Zhujie (zhujie, Smartcare)
Only one broker,and eight partitions, async mode.

Increase the number of batch.num.messages is useless.

We split the whole file into 1K per block.

 
-邮件原件-
发件人: robairrob...@gmail.com [mailto:robairrob...@gmail.com] 代表 Robert Turner
发送时间: 2014年5月16日 13:45
收件人: users@kafka.apache.org
主题: Re: kafka performance question

A couple of thoughts spring to mind, are you sending the whole file as 1 
message? and is your producer code using sync or async mode?

Cheers
   Rob.


On 14 May 2014 15:49, Jun Rao jun...@gmail.com wrote:

 How many brokers and partitions do you have? You may try increasing 
 batch.num.messages.

 Thanks,

 Jun


 On Tue, May 13, 2014 at 5:56 PM, Zhujie (zhujie, Smartcare)  
 first.zhu...@huawei.com wrote:

  Dear all,
 
  We want to use kafka to collect and dispatch data file, but the 
  performance is maybe lower than we want.
 
  In our cluster,there is a provider and a broker. We use a one thread 
  read file from local disk of provider and send it to broker. The 
  average throughput is only 3 MB/S~4MB/S.
  But if we just use java NIO API to send file ,the throughput can 
  exceed 200MB/S.
  Why the kafka performance is so bad in our test, are we missing
 something??
 
 
 
  Our server:
  Cpu: Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz*4 Mem:300G Disk:600G 
  15K RPM SAS*8
 
  Configuration of provider:
  props.put(serializer.class, kafka.serializer.NullEncoder); 
  props.put(metadata.broker.list, 169.10.35.57:9092); 
  props.put(request.required.acks, 0); props.put(producer.type, 
  async);//异步
  props.put(queue.buffering.max.ms,500);
  props.put(queue.buffering.max.messages,10);
  props.put(batch.num.messages, 1200); 
  props.put(queue.enqueue.timeout.ms, -1); 
  props.put(send.buffer.bytes, 10240);
 
  Configuration of broker:
 
  # Licensed to the Apache Software Foundation (ASF) under one or more 
  # contributor license agreements.  See the NOTICE file distributed 
  with # this work for additional information regarding copyright ownership.
  # The ASF licenses this file to You under the Apache License, 
  Version 2.0 # (the License); you may not use this file except in 
  compliance with # the License.  You may obtain a copy of the License 
  at #
  #http://www.apache.org/licenses/LICENSE-2.0
  #
  # Unless required by applicable law or agreed to in writing, 
  software # distributed under the License is distributed on an AS 
  IS BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 
  express or
 implied.
  # See the License for the specific language governing permissions 
  and # limitations under the License.
  # see kafka.server.KafkaConfig for additional details and defaults
 
  # Server Basics 
  #
 
  # The id of the broker. This must be set to a unique integer for 
  each broker.
  broker.id=0
 
  # Socket Server Settings 
  #
 
  # The port the socket server listens on
  port=9092
 
  # Hostname the broker will bind to. If not set, the server will bind 
  to all interfaces #host.name=localhost
 
  # Hostname the broker will advertise to producers and consumers. If 
  not set, it uses the # value for host.name if configured.  
  Otherwise, it will use the value returned from # 
  java.net.InetAddress.getCanonicalHostName().
  #advertised.host.name=hostname routable by clients
 
  # The port to publish to ZooKeeper for clients to use. If this is 
  not
 set,
  # it will publish the same port that the broker binds to.
  #advertised.port=port accessible by clients
 
  # The number of threads handling network requests
  #num.network.threads=2
  # The number of threads doing disk I/O
  #num.io.threads=8
 
  # The send buffer (SO_SNDBUF) used by the socket server
  #socket.send.buffer.bytes=1048576
 
  # The receive buffer (SO_RCVBUF) used by the socket server
  #socket.receive.buffer.bytes=1048576
 
  # The maximum size of a request that the socket server will accept 
  (protection against OOM)
  #socket.request.max.bytes=104857600
 
 
  # Log Basics 
  #
 
  # A comma seperated list of directories under which to store log 
  files log.dirs=/data/kafka-logs
 
  # The default number of log partitions per topic. More partitions 
  allow greater # parallelism for consumption, but this will also 
  result in more files across # the brokers.
  #num.partitions=2
 
  # Log Flush Policy 
  #
 
  # Messages are immediately written to the filesystem but by default 
  we only fsync() to sync # the OS cache lazily. The following 
  configurations control the flush of data to disk.
  # There are a few important trade-offs here:
  #1. Durability: Unflushed data may be lost if you are not using
  replication.
  #2. Latency: Very large flush intervals may lead to latency spikes
  when the flush does occur as there will be a lot