article: hands-on kafka: dynamic DNS
Hi list! I just wanted to mention a small article I put together to describe an approach to leverage log compaction when you have compound types and messages are operations on that compound type with an example use-case: http://spootnik.org/entries/2015/04/23_hands-on-kafka-dynamic-dns.html Always eager to hear about your feedback and other approaches. Cheers, - pyr
Atomic write of message batch to single partition
Hello, I'm using Kafka 0.8.2.1 (in a Scala/Java project) and trying to find out how to atomically write n messages (message batch) to a single topic partition. Is there any client API that gives such a guarantee? I couldn't find a clear answer reading the documentation, API docs (of the old and new producer) and mailing list archives - sorry if I missed something obvious. With the new org.apache.kafka.clients.producer.KafkaProducer, the send method has only a single ProducerRecord parameter and record accumulation and batch-sending of records is an implementation detail (which can be controlled to some extend by the batch.size configuration setting etc but not by user-defined message batches). So it seems that the new producer cannot be used for that. Are there any plans to support that in future versions? Only the old kafka.producer.Producer allows me to pass a user-defined KeyedMessage batch to its send method. What are the semantics of this method when producer.type=sync and all KeyedMessages for a given send call are targeted at the same topic partition? Are these messages being written atomically to the partition? Are there other options to achieve atomic writes of user-defined message batches (except making the batch a single Kafka message)? Thanks for any hints! Regards, Martin
[ANN] Apache Cloudstack 4.5 kafka-event-bus plugin
Hi list, I thought I'd also mention that the next release of Apache Cloudstack adds the ability to publish all events happening on throughout the environment to kafka. Events are published as JSON. http://cloudstack-administration.readthedocs.org/en/latest/events.html#kafka-configuration Cheers, - pyr
Re: Why fetching meta-data for topic is done three times?
Hi All, Once gone through code found that, While Producer starts it does three things: 1. Sends Meta-data request 2. Send message to broker(fetching broker list) 3. If number of message to be produce is grater than 0 then again tries to refresh metadata for outstanding produce requests. Each of the request takes configured timeout and go to next and finally once all is done then it will throw Exception(if 3 also fails). Here the problem is, if we set timeout as 1 sec then to throw an exception It takes 3 sec, so user request will be hanged up till 3 sec, that is comparatively high for response time and if all threads will be blocked due to producer send then whole application will be blocked for 3 sec. So we want to reduce this time to 1 sec. in overall to throw Exception. What is the possible way to do this? Thanks Madhukar On Thu, Apr 16, 2015 at 8:10 PM, Madhukar Bharti bhartimadhu...@gmail.com wrote: Hi All, I came across a problem, If we use broker IP which is not reachable or out of network. Then it takes more than configured time(request.timeout.ms). After checking the log got to know that it is trying to fetch topic meta-data three times by changing correlation id. Due to this even though i keep (request.timeout.ms=1000) It takes 3 sec to throw Exception. I am using Kafka0.8.1.1 with patch https://issues.apache.org/jira/secure/attachment/12678547/kafka-1733-add-connectTimeoutMs.patch I have attached the log. Please check this and clarify why it is behaving like this. Whether it is by design or have to set some other property also. Regards Madhukar
Does log.retention.bytes apply only to partition leader or also replicas
Does the byte retention policy apply to replica partitions or leader partitions or both? In a multi-node cluster, with all brokers configured configured with different retention policies, it seems obvious that the partitions for which a given broker is a leader will be subject to the byte retention policy, but what about the partitions for which the given broker is a replica? Are they subject to the same policy? For example, a 2 node cluster with single topic and two partitions. Broker A is the leader for partition 1 and contains the replica of partition 2 Broker B is the leader for partition 2 and contains the replica of partition 1 Broker A has a byte retention policy of 1 byte Broker B has a byte retention policy of 2 bytes Will broker A retain 1 byte of of both partitions it hosts? Likewise, will broker B retain 2 bytes of both partitions it hosts?
Re: [KIP-DISCUSSION] KIP-22 Expose a Partitioner interface in the new producer
Hi, Here are the questions I think we should consider: 1. Do we need this at all given that we have the partition argument in ProducerRecord which gives full control? I think we do need it because this is a way to plug in a different partitioning strategy at run time and do it in a fairly transparent way. Yes, we need it if we want to support different partitioning strategies inside Kafka rather than requiring the user to code them externally. 3. Do we need to add the value? I suspect people will have uses for computing something off a few fields in the value to choose the partition. This would be useful in cases where the key was being used for log compaction purposes and did not contain the full information for computing the partition. I am not entirely sure about this. I guess that most partitioners should not use it. I think it makes it easier to reason about the system if the partitioner only works on the key. Hoever, if the value (and its serialization) are already available, there is not much harm in passing them along. 4. This interface doesn't include either an init() or close() method. It should implement Closable and Configurable, right? Right now the only application I can think of to have an init() and close() is to read some state information (e.g., load information) that is published on some external distributed storage (e.g., zookeeper) by the brokers. It might be useful also for reconfiguration and state migration. I think it's not a very common use case right now, but if the added complexity is not too much it might be worth to have support for these methods. 5. What happens if the user both sets the partition id in the ProducerRecord and sets a partitioner? Does the partition id just get passed in to the partitioner (as sort of implied in this interface?). This is a bit weird since if you pass in the partition id you kind of expect it to get used, right? Or is it the case that if you specify a partition the partitioner isn't used at all (in which case no point in including partition in the Partitioner api). The user should be able to override the partitioner on a per-record basis by explicitly setting the partition id. I don't think it makes sense for the partitioners to take hints on the partition. I would even go the extra step, and have a default logic that accepts both key and partition id (current interface) and calls partition() only if the partition id is not set. The partition() method does *not* take the partition ID as input (only key-value). Cheers, -- Gianmarco Cheers, -Jay On Thu, Apr 23, 2015 at 6:55 AM, Sriharsha Chintalapani ka...@harsha.io wrote: Hi, Here is the KIP for adding a partitioner interface for producer. https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22+-+Expose+a+Partitioner+interface+in+the+new+producer There is one open question about how interface should look like. Please take a look and let me know if you prefer one way or the other. Thanks, Harsha
Re: Kafka server - conflicted ephemeral node
Is there a recommended way to handle this issue? Thanks! Mayuresh Gharat gharatmayures...@gmail.com于2015年4月22日星期三写道: This happens due to a bug in zookeeper, sometimes the znode does not get deleted automatically.We have seen it many times at Linkedin and are trying to investigate further. Thanks, Mayuresh On Mon, Apr 20, 2015 at 8:52 PM, 小宇 mocking...@gmail.com javascript:; wrote: Thanks for your response gharatmayuresh1, but I don't know what you mean exactly. I have restart my server and I want to find out the cause in case it happen again. 2015-04-21 11:36 GMT+08:00 gharatmayures...@gmail.com javascript:;: Try bouncing 10.144.38.185 This should resolve the issue. Thanks, Mayuresh Sent from my iPhone On Apr 20, 2015, at 8:22 PM, 小宇 mocking...@gmail.com javascript:; wrote: 10.144.38.185 -- -Regards, Mayuresh R. Gharat (862) 250-7125
New Java Producer: Single Producer vs multiple Producers
We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar
Re: New Java Producer: Single Producer vs multiple Producers
Hi, I ran some tests on our cluster by sending message from multiple clients (machines). Each machine had about 40-100 threads per producer. I thought of trying out having multiple producers per clients with each producer receiving messages from say 10-15 threads. I actually did see an increase in throughput in this case. It was not one off cases but a repeatable phenomenon. I called threads to producer ratio sharingFactor in my code. I am not planning to use it this way in our clients sending messages to Kafka but it did go against the suggestion to have single producer across all threads. On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in wrote: Hi Jay, Yes, we are producing from single process/jvm. From docs The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. If I understand correctly, batching happens at topic/partition level, not at Node level. right? If yes, then both (single producer for all topics , separate producer for each topic) approaches may give similar performance. On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote: If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar -- Thanks Regards, Navneet Gupta
Re: New Java Producer: Single Producer vs multiple Producers
Jay, Its not evident how to switch between sync and async modes using this new 'org.apache.kafka.clients.tools.ProducerPerformance' AFAIKT it measures in async mode by default. -roshan On 4/24/15 3:23 PM, Jay Kreps jay.kr...@gmail.com wrote: That should work. I recommend using the performance tool cited in the blog linked from the performance page of the website. That tool is more accurate and uses the new producer. On Fri, Apr 24, 2015 at 2:29 PM, Roshan Naik ros...@hortonworks.com wrote: Can we use the new 0.8.2 producer perf tool against a 0.8.1 broker ? -roshan On 4/24/15 1:19 PM, Jay Kreps jay.kr...@gmail.com wrote: Do make sure if you are at all performance sensitive you are using the new producer api we released in 0.8.2. -Jay On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com wrote: Yes, I too notice the same behavior (with producer/consumer perf tool on 8.1.2) Š adding more threads indeed improved the perf a lot (both with and without --sync). in --sync mode batch size made almost no diff, larger events improved the perf. I was doing some 8.1.2 perf testing with a 1 node broker setup (machine: 32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,). My observations: ASYNC MODE: Partition Count: large improvement when going from 1 to 2, beyond 2 see a slight dip Number of producer threads: perf much better than sync mode with 1 thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted negatively SYNC MODE (does not seem to use batch size) Batch Size: little to no impact Event Size: perf scales linearly with event size Number of producer threads: poor perf with one thread, improves with more threads,peaks around 30 to 50 threads socket.send.buffer.bytes : increasing it Made a small but measurable difference (~4%) --SYNC mode was much slower. I modified the producer perf tool to use the scala batched producer api (not available in v8.2) --sync mode and perf of --sync mode was closer to async mode. -roshan On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR) navneet.gu...@flipkart.com wrote: Hi, I ran some tests on our cluster by sending message from multiple clients (machines). Each machine had about 40-100 threads per producer. I thought of trying out having multiple producers per clients with each producer receiving messages from say 10-15 threads. I actually did see an increase in throughput in this case. It was not one off cases but a repeatable phenomenon. I called threads to producer ratio sharingFactor in my code. I am not planning to use it this way in our clients sending messages to Kafka but it did go against the suggestion to have single producer across all threads. On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in wrote: Hi Jay, Yes, we are producing from single process/jvm. From docs The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. If I understand correctly, batching happens at topic/partition level, not at Node level. right? If yes, then both (single producer for all topics , separate producer for each topic) approaches may give similar performance. On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote: If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar -- Thanks Regards, Navneet Gupta
Re: New Java Producer: Single Producer vs multiple Producers
That should work. I recommend using the performance tool cited in the blog linked from the performance page of the website. That tool is more accurate and uses the new producer. On Fri, Apr 24, 2015 at 2:29 PM, Roshan Naik ros...@hortonworks.com wrote: Can we use the new 0.8.2 producer perf tool against a 0.8.1 broker ? -roshan On 4/24/15 1:19 PM, Jay Kreps jay.kr...@gmail.com wrote: Do make sure if you are at all performance sensitive you are using the new producer api we released in 0.8.2. -Jay On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com wrote: Yes, I too notice the same behavior (with producer/consumer perf tool on 8.1.2) Š adding more threads indeed improved the perf a lot (both with and without --sync). in --sync mode batch size made almost no diff, larger events improved the perf. I was doing some 8.1.2 perf testing with a 1 node broker setup (machine: 32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,). My observations: ASYNC MODE: Partition Count: large improvement when going from 1 to 2, beyond 2 see a slight dip Number of producer threads: perf much better than sync mode with 1 thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted negatively SYNC MODE (does not seem to use batch size) Batch Size: little to no impact Event Size: perf scales linearly with event size Number of producer threads: poor perf with one thread, improves with more threads,peaks around 30 to 50 threads socket.send.buffer.bytes : increasing it Made a small but measurable difference (~4%) --SYNC mode was much slower. I modified the producer perf tool to use the scala batched producer api (not available in v8.2) --sync mode and perf of --sync mode was closer to async mode. -roshan On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR) navneet.gu...@flipkart.com wrote: Hi, I ran some tests on our cluster by sending message from multiple clients (machines). Each machine had about 40-100 threads per producer. I thought of trying out having multiple producers per clients with each producer receiving messages from say 10-15 threads. I actually did see an increase in throughput in this case. It was not one off cases but a repeatable phenomenon. I called threads to producer ratio sharingFactor in my code. I am not planning to use it this way in our clients sending messages to Kafka but it did go against the suggestion to have single producer across all threads. On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in wrote: Hi Jay, Yes, we are producing from single process/jvm. From docs The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. If I understand correctly, batching happens at topic/partition level, not at Node level. right? If yes, then both (single producer for all topics , separate producer for each topic) approaches may give similar performance. On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote: If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar -- Thanks Regards, Navneet Gupta
New and old producers partition messages differently
Hi, I was playing with the new producer in 0.8.2.1 using partition keys (semantic partitioning I believe is the phrase?). I noticed that the default partitioner in 0.8.2.1 does not partition items the same way as the old 0.8.1.1 default partitioner was doing. For a test item, the old producer was sending it to partition 0, whereas the new producer was sending it to partition 4. Digging in the code, it appears that the partitioning logic is different between the old and new producers. Both of them hash the key, but they use different hashing algorithms. Old partitioner: ./core/src/main/scala/kafka/producer/DefaultPartitioner.scala: def partition(key: Any, numPartitions: Int): Int = { Utils.abs(key.hashCode) % numPartitions } New partitioner: ./clients/src/main/java/org/apache/kafka/clients/producer/internals/Partitioner.java: } else { // hash the key to choose a partition return Utils.abs(Utils.murmur2(record.key())) % numPartitions; } Where murmur2 is a custom hashing algorithm. (I'm assuming that murmur2 isn't the same logic as hashCode, especially since hashCode is overrideable). Was it intentional that the hashing algorithm would change between the old and new producer? If so, was this documented? I don't know if anyone was relying on the old default partitioner, as opposed to going round-robin or using their own custom partitioner. Do you expect it to change in the future? I'm guessing that one of the main reasons to have a custom hashing algorithm is so that you are full control of the partitioning and can keep it stable (as opposed to being reliant on hashCode()). Thanks, -James
New producer: metadata update problem on 2 Node cluster.
We are testing new producer on a 2 node cluster. Under some node failure scenarios, producer is not able to update metadata. Steps to reproduce 1. form a 2 node cluster (K1, K2) 2. create a topic with single partition, replication factor = 2 3. start producing data (producer metadata : K1,K2) 2. Kill leader node (say K1) 3. K2 becomes the leader (producer metadata : K2) 4. Bring back K1 and Kill K2 before metadata.max.age.ms 5. K1 becomes the Leader (producer metadata still contains : K2) After this point, producer is not able to update the metadata. producer continuously trying to connect with dead node (K2). This looks like a bug to me. Am I missing anything?
Re: New Java Producer: Single Producer vs multiple Producers
If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar
Consumer members do not own any partitions in consumer group
Hi everyone, We are running Kafka 0.8.1.1 with Storm. We wrote our own spout which uses the high level consumer API. Our setup is to create 4 spouts per worker. If your not familiar with Storm its basically 4 kafka consumers per java process. This particular consumer group is interested in 20 topics and ~150 partitions. When we increased the parallelism to 6 workers and 24 consumers we noticed certain consumers did not own any partitions in the group. These consumers were on certain hosts. We see their ephemeral nodes in zookeeper under their consumer group. We have also verified connectivity with kafka from those nodes. I also found if I can get those workers/consumers to run on the hosts with consumers that do own partitions they too will start owning partitions. I'm also finding nothing in our logs in Kafka or on the consumer side which indicates any kind of problem. Any suggestions on what to try? Is it possible to force a consumer rebalance and who handles the partition assignment for a consumer group? Bryan
Re: kafka user group in los angeles
Hey Alex, It looks like this group might be appropriate to have a Kafka talk at: http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/ It might be worth showing up at one of their events and asking around. -Jon On Thu, Apr 23, 2015 at 11:40 AM, Alex Toth a...@purificator.net wrote: Hi, Sorry this isn't directly a kafka question, but I was wondering if there are andy Kafka user groups in (or in near driving range of) Los Angeles. Looking through meetup.com and the usual web search engines hasn't brought me much outside of the LA Hadoop user group and I was hoping for something more specific. If I should have asked this somewhere else, again, sorry and let me know. alex
Re: New Java Producer: Single Producer vs multiple Producers
Hi Jay, Yes, we are producing from single process/jvm. From docs The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. If I understand correctly, batching happens at topic/partition level, not at Node level. right? If yes, then both (single producer for all topics , separate producer for each topic) approaches may give similar performance. On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote: If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar
Re: kafka user group in los angeles
Thanks. I'll see what I can find. alex From: Jon Bringhurst j...@bringhurst.org To: users@kafka.apache.org; Alex Toth a...@purificator.net Sent: Friday, April 24, 2015 9:51 AM Subject: Re: kafka user group in los angeles Hey Alex, It looks like this group might be appropriate to have a Kafka talk at: http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/ It might be worth showing up at one of their events and asking around. -Jon On Thu, Apr 23, 2015 at 11:40 AM, Alex Toth a...@purificator.net wrote: Hi, Sorry this isn't directly a kafka question, but I was wondering if there are andy Kafka user groups in (or in near driving range of) Los Angeles. Looking through meetup.com and the usual web search engines hasn't brought me much outside of the LA Hadoop user group and I was hoping for something more specific. If I should have asked this somewhere else, again, sorry and let me know. alex
RE: kafka user group in los angeles
If you don't mind venturing further south, http://www.meetup.com/OCBigData/ could be a good meetup to discuss Kafka at as well. -Original Message- From: Alex Toth [mailto:a...@purificator.net] Sent: Friday, April 24, 2015 9:55 AM To: Jon Bringhurst; users@kafka.apache.org Subject: Re: kafka user group in los angeles Thanks. I'll see what I can find. alex From: Jon Bringhurst j...@bringhurst.org To: users@kafka.apache.org; Alex Toth a...@purificator.net Sent: Friday, April 24, 2015 9:51 AM Subject: Re: kafka user group in los angeles Hey Alex, It looks like this group might be appropriate to have a Kafka talk at: http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/ It might be worth showing up at one of their events and asking around. -Jon On Thu, Apr 23, 2015 at 11:40 AM, Alex Toth a...@purificator.net wrote: Hi, Sorry this isn't directly a kafka question, but I was wondering if there are andy Kafka user groups in (or in near driving range of) Los Angeles. Looking through meetup.com and the usual web search engines hasn't brought me much outside of the LA Hadoop user group and I was hoping for something more specific. If I should have asked this somewhere else, again, sorry and let me know. alex
Getting java.lang.IllegalMonitorStateException in mirror maker when building fetch request
Hi team, I observed java.lang.IllegalMonitorStateException thrown from AbstractFetcherThread in mirror maker when it is trying to build the fetchrequst. Below is the error [2015-04-23 16:16:02,049] ERROR [ConsumerFetcherThread-group_id_localhost-1429830778627-4519368f-0-7], Error due to (kafka.consumer.ConsumerFetcherThread) java.lang.IllegalMonitorStateException at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:155) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260) at java.util.concurrent.locks.AbstractQueuedSynchronizer.fullyRelease(AbstractQueuedSynchronizer.java:1723) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2166) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:95) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) I believe this is due to partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS) being called while not lock is acquired. below code should fix the issue inLock(partitionMapLock) { partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS) } Should I file a jira ticket and submit the patch? I use the latest version of mirror maker in trunk. -- Regards, Tao
Re: New Java Producer: Single Producer vs multiple Producers
Yes, I too notice the same behavior (with producer/consumer perf tool on 8.1.2) Š adding more threads indeed improved the perf a lot (both with and without --sync). in --sync mode batch size made almost no diff, larger events improved the perf. I was doing some 8.1.2 perf testing with a 1 node broker setup (machine: 32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,). My observations: ASYNC MODE: Partition Count: large improvement when going from 1 to 2, beyond 2 see a slight dip Number of producer threads: perf much better than sync mode with 1 thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted negatively SYNC MODE (does not seem to use batch size) Batch Size: little to no impact Event Size: perf scales linearly with event size Number of producer threads: poor perf with one thread, improves with more threads,peaks around 30 to 50 threads socket.send.buffer.bytes : increasing it Made a small but measurable difference (~4%) --SYNC mode was much slower. I modified the producer perf tool to use the scala batched producer api (not available in v8.2) --sync mode and perf of --sync mode was closer to async mode. -roshan On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR) navneet.gu...@flipkart.com wrote: Hi, I ran some tests on our cluster by sending message from multiple clients (machines). Each machine had about 40-100 threads per producer. I thought of trying out having multiple producers per clients with each producer receiving messages from say 10-15 threads. I actually did see an increase in throughput in this case. It was not one off cases but a repeatable phenomenon. I called threads to producer ratio sharingFactor in my code. I am not planning to use it this way in our clients sending messages to Kafka but it did go against the suggestion to have single producer across all threads. On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in wrote: Hi Jay, Yes, we are producing from single process/jvm. From docs The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. If I understand correctly, batching happens at topic/partition level, not at Node level. right? If yes, then both (single producer for all topics , separate producer for each topic) approaches may give similar performance. On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote: If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar -- Thanks Regards, Navneet Gupta
Re: Consumer members do not own any partitions in consumer group
Managed to figure this one out myself. This is due to the range partition assignment in 0.8.1.1 and the fact each of our topics has 8 partitions so only the first 8 consumers get assigned anything. Looks like 0.8.2.0 has a round robin assignment which is what we want. On Fri, Apr 24, 2015 at 11:13 AM Bryan Baugher bjb...@gmail.com wrote: Hi everyone, We are running Kafka 0.8.1.1 with Storm. We wrote our own spout which uses the high level consumer API. Our setup is to create 4 spouts per worker. If your not familiar with Storm its basically 4 kafka consumers per java process. This particular consumer group is interested in 20 topics and ~150 partitions. When we increased the parallelism to 6 workers and 24 consumers we noticed certain consumers did not own any partitions in the group. These consumers were on certain hosts. We see their ephemeral nodes in zookeeper under their consumer group. We have also verified connectivity with kafka from those nodes. I also found if I can get those workers/consumers to run on the hosts with consumers that do own partitions they too will start owning partitions. I'm also finding nothing in our logs in Kafka or on the consumer side which indicates any kind of problem. Any suggestions on what to try? Is it possible to force a consumer rebalance and who handles the partition assignment for a consumer group? Bryan
Kafka dependencies on Pig and Avro
Hi, Im new to kafka and noticed that kafka has dependencies on older versions of Avro (1.4.0) and Pig (0.8.0); is there a reason for not moving to the latest (avro 1.7.7 and pig 0.14.0)? Also, kafka-hadoop-producer has dependencies on different versions of pig, pig-0.8.0 and piggybank-0.12.0; should they be in sync? Regards, Carita
Re: New Java Producer: Single Producer vs multiple Producers
Do make sure if you are at all performance sensitive you are using the new producer api we released in 0.8.2. -Jay On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com wrote: Yes, I too notice the same behavior (with producer/consumer perf tool on 8.1.2) Š adding more threads indeed improved the perf a lot (both with and without --sync). in --sync mode batch size made almost no diff, larger events improved the perf. I was doing some 8.1.2 perf testing with a 1 node broker setup (machine: 32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,). My observations: ASYNC MODE: Partition Count: large improvement when going from 1 to 2, beyond 2 see a slight dip Number of producer threads: perf much better than sync mode with 1 thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted negatively SYNC MODE (does not seem to use batch size) Batch Size: little to no impact Event Size: perf scales linearly with event size Number of producer threads: poor perf with one thread, improves with more threads,peaks around 30 to 50 threads socket.send.buffer.bytes : increasing it Made a small but measurable difference (~4%) --SYNC mode was much slower. I modified the producer perf tool to use the scala batched producer api (not available in v8.2) --sync mode and perf of --sync mode was closer to async mode. -roshan On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR) navneet.gu...@flipkart.com wrote: Hi, I ran some tests on our cluster by sending message from multiple clients (machines). Each machine had about 40-100 threads per producer. I thought of trying out having multiple producers per clients with each producer receiving messages from say 10-15 threads. I actually did see an increase in throughput in this case. It was not one off cases but a repeatable phenomenon. I called threads to producer ratio sharingFactor in my code. I am not planning to use it this way in our clients sending messages to Kafka but it did go against the suggestion to have single producer across all threads. On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in wrote: Hi Jay, Yes, we are producing from single process/jvm. From docs The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. If I understand correctly, batching happens at topic/partition level, not at Node level. right? If yes, then both (single producer for all topics , separate producer for each topic) approaches may give similar performance. On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote: If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar -- Thanks Regards, Navneet Gupta
Re: New Java Producer: Single Producer vs multiple Producers
Can we use the new 0.8.2 producer perf tool against a 0.8.1 broker ? -roshan On 4/24/15 1:19 PM, Jay Kreps jay.kr...@gmail.com wrote: Do make sure if you are at all performance sensitive you are using the new producer api we released in 0.8.2. -Jay On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com wrote: Yes, I too notice the same behavior (with producer/consumer perf tool on 8.1.2) Š adding more threads indeed improved the perf a lot (both with and without --sync). in --sync mode batch size made almost no diff, larger events improved the perf. I was doing some 8.1.2 perf testing with a 1 node broker setup (machine: 32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,). My observations: ASYNC MODE: Partition Count: large improvement when going from 1 to 2, beyond 2 see a slight dip Number of producer threads: perf much better than sync mode with 1 thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted negatively SYNC MODE (does not seem to use batch size) Batch Size: little to no impact Event Size: perf scales linearly with event size Number of producer threads: poor perf with one thread, improves with more threads,peaks around 30 to 50 threads socket.send.buffer.bytes : increasing it Made a small but measurable difference (~4%) --SYNC mode was much slower. I modified the producer perf tool to use the scala batched producer api (not available in v8.2) --sync mode and perf of --sync mode was closer to async mode. -roshan On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR) navneet.gu...@flipkart.com wrote: Hi, I ran some tests on our cluster by sending message from multiple clients (machines). Each machine had about 40-100 threads per producer. I thought of trying out having multiple producers per clients with each producer receiving messages from say 10-15 threads. I actually did see an increase in throughput in this case. It was not one off cases but a repeatable phenomenon. I called threads to producer ratio sharingFactor in my code. I am not planning to use it this way in our clients sending messages to Kafka but it did go against the suggestion to have single producer across all threads. On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in wrote: Hi Jay, Yes, we are producing from single process/jvm. From docs The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. If I understand correctly, batching happens at topic/partition level, not at Node level. right? If yes, then both (single producer for all topics , separate producer for each topic) approaches may give similar performance. On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote: If you are talking about within a single process, having one producer is generally the fastest because batching dramatically reduces the number of requests (esp using the new java producer). -Jay On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com wrote: We have a 2 node cluster with 100 topics. should we use a single producer for all topics or create multiple producers? What is the best choice w.r.t network load/failures, node failures, latency, locks? Regards, Manikumar -- Thanks Regards, Navneet Gupta
leader election rate
Looking at the output from the jmx stats from our Kafka cluster, I see a more or less constant leader election rate of around 2.5 from our controller. Is this expected, or does this mean that leaders are shifting around constantly? If they are shifting, how should I go about debugging, and what triggers a leader election? Thanks, Wes