article: hands-on kafka: dynamic DNS

2015-04-24 Thread Pierre-Yves Ritschard
Hi list!

I just wanted to mention a small article I put together to describe an
approach to leverage log compaction when you have compound types and
messages are operations on that compound type with an example use-case:

http://spootnik.org/entries/2015/04/23_hands-on-kafka-dynamic-dns.html

Always eager to hear about your feedback and other approaches.

Cheers,
  - pyr


Atomic write of message batch to single partition

2015-04-24 Thread Martin Krasser

Hello,

I'm using Kafka 0.8.2.1 (in a Scala/Java project) and trying to find out 
how to atomically write n messages (message batch) to a single topic 
partition. Is there any client API that gives such a guarantee? I 
couldn't find a clear answer reading the documentation, API docs (of the 
old and new producer) and mailing list archives - sorry if I missed 
something obvious.


With the new org.apache.kafka.clients.producer.KafkaProducer, the send 
method has only a single ProducerRecord parameter and record 
accumulation and batch-sending of records is an implementation detail 
(which can be controlled to some extend by the batch.size configuration 
setting etc but not by user-defined message batches). So it seems that 
the new producer cannot be used for that. Are there any plans to support 
that in future versions?


Only the old kafka.producer.Producer allows me to pass a user-defined 
KeyedMessage batch to its send method. What are the semantics of this 
method when producer.type=sync and all KeyedMessages for a given send 
call are targeted at the same topic partition? Are these messages being 
written atomically to the partition?


Are there other options to achieve atomic writes of user-defined message 
batches (except making the batch a single Kafka message)?


Thanks for any hints!

Regards,
Martin



[ANN] Apache Cloudstack 4.5 kafka-event-bus plugin

2015-04-24 Thread Pierre-Yves Ritschard
Hi list,

I thought I'd also mention that the next release of Apache Cloudstack
adds the ability to publish all events happening on throughout the
environment to kafka. Events are published as JSON.

http://cloudstack-administration.readthedocs.org/en/latest/events.html#kafka-configuration

Cheers,
  - pyr


Re: Why fetching meta-data for topic is done three times?

2015-04-24 Thread Madhukar Bharti
Hi All,

Once gone through code found that, While Producer starts it does three
things:

1. Sends Meta-data request
2. Send message to broker(fetching broker list)
3. If number of message to be produce is grater than 0 then again tries to
refresh metadata for outstanding produce requests.

Each of the request takes configured timeout and go to next and finally
once all is done then it will throw Exception(if 3 also fails).

Here the problem is, if we set timeout as 1 sec then to throw an exception
It takes 3 sec, so user request will be hanged up till 3 sec, that is
comparatively high for response time and if all threads will be blocked due
to producer send then whole application will be blocked for 3 sec. So we
want to reduce this time to 1 sec. in overall to throw Exception.

What is the possible way to do this?

Thanks
Madhukar

On Thu, Apr 16, 2015 at 8:10 PM, Madhukar Bharti bhartimadhu...@gmail.com
wrote:

 Hi All,

 I came across a problem, If we use broker IP which is not reachable or out
 of network. Then it takes more than configured time(request.timeout.ms).
 After checking the log got to know that it is trying to fetch topic
 meta-data three times by changing correlation id.
 Due to this even though i keep (request.timeout.ms=1000) It takes 3 sec
 to throw Exception. I am using Kafka0.8.1.1 with patch
 https://issues.apache.org/jira/secure/attachment/12678547/kafka-1733-add-connectTimeoutMs.patch


 I have attached the log. Please check this and clarify why it is behaving
 like this. Whether it is by design or have to set some other property also.



 Regards
 Madhukar





Does log.retention.bytes apply only to partition leader or also replicas

2015-04-24 Thread David Corley
Does the byte retention policy apply to replica partitions or leader
partitions or both?
In a multi-node cluster, with all brokers configured configured with
different retention policies, it seems obvious that the partitions for
which a given broker is a leader will be subject to the byte retention
policy, but what about the partitions for which the given broker is a
replica? Are they subject to the same policy?

For example, a 2 node cluster with single topic and two partitions.
Broker A is the leader for partition 1 and contains the replica of
partition 2
Broker B is the leader for partition 2 and contains the replica of
partition 1
Broker A has a byte retention policy of 1 byte
Broker B has a byte retention policy of 2 bytes

Will broker A retain 1 byte of of both partitions it hosts?
Likewise, will broker B retain 2 bytes of both partitions it hosts?


Re: [KIP-DISCUSSION] KIP-22 Expose a Partitioner interface in the new producer

2015-04-24 Thread Gianmarco De Francisci Morales
Hi,


Here are the questions I think we should consider:
 1. Do we need this at all given that we have the partition argument in
 ProducerRecord which gives full control? I think we do need it because this
 is a way to plug in a different partitioning strategy at run time and do it
 in a fairly transparent way.


Yes, we need it if we want to support different partitioning strategies
inside Kafka rather than requiring the user to code them externally.


 3. Do we need to add the value? I suspect people will have uses for
 computing something off a few fields in the value to choose the partition.
 This would be useful in cases where the key was being used for log
 compaction purposes and did not contain the full information for computing
 the partition.


I am not entirely sure about this. I guess that most partitioners should
not use it.
I think it makes it easier to reason about the system if the partitioner
only works on the key.
Hoever, if the value (and its serialization) are already available, there
is not much harm in passing them along.


 4. This interface doesn't include either an init() or close() method. It
 should implement Closable and Configurable, right?


Right now the only application I can think of to have an init() and close()
is to read some state information (e.g., load information) that is
published on some external distributed storage (e.g., zookeeper) by the
brokers.
It might be useful also for reconfiguration and state migration.

I think it's not a very common use case right now, but if the added
complexity is not too much it might be worth to have support for these
methods.



 5. What happens if the user both sets the partition id in the
 ProducerRecord and sets a partitioner? Does the partition id just get
 passed in to the partitioner (as sort of implied in this interface?). This
 is a bit weird since if you pass in the partition id you kind of expect it
 to get used, right? Or is it the case that if you specify a partition the
 partitioner isn't used at all (in which case no point in including
 partition in the Partitioner api).


The user should be able to override the partitioner on a per-record basis
by explicitly setting the partition id.
I don't think it makes sense for the partitioners to take hints on the
partition.

I would even go the extra step, and have a default logic that accepts both
key and partition id (current interface) and calls partition() only if the
partition id is not set. The partition() method does *not* take the
partition ID as input (only key-value).


Cheers,
--
Gianmarco



 Cheers,

 -Jay

 On Thu, Apr 23, 2015 at 6:55 AM, Sriharsha Chintalapani ka...@harsha.io
 wrote:

  Hi,
  Here is the KIP for adding a partitioner interface for producer.
 
 
 https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22+-+Expose+a+Partitioner+interface+in+the+new+producer
  There is one open question about how interface should look like. Please
  take a look and let me know if you prefer one way or the other.
 
  Thanks,
  Harsha
 
 



Re: Kafka server - conflicted ephemeral node

2015-04-24 Thread 小宇
Is there a recommended way to handle this issue?

Thanks!

Mayuresh Gharat gharatmayures...@gmail.com于2015年4月22日星期三写道:

 This happens due to a bug in zookeeper, sometimes the znode does not get
 deleted automatically.We have seen it many times at Linkedin and are trying
 to investigate further.

 Thanks,

 Mayuresh

 On Mon, Apr 20, 2015 at 8:52 PM, 小宇 mocking...@gmail.com javascript:;
 wrote:

  Thanks for your response gharatmayuresh1, but I don't know what you mean
  exactly. I have restart my server and I want to find out the cause in
 case
  it happen again.
 
  2015-04-21 11:36 GMT+08:00 gharatmayures...@gmail.com javascript:;:
 
   Try bouncing
   10.144.38.185
  
   This should resolve the issue.
  
   Thanks,
  
   Mayuresh
   Sent from my iPhone
  
On Apr 20, 2015, at 8:22 PM, 小宇 mocking...@gmail.com javascript:;
 wrote:
   
10.144.38.185
  
 



 --
 -Regards,
 Mayuresh R. Gharat
 (862) 250-7125



New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Manikumar Reddy
We have a 2 node cluster with 100 topics.
should we use a single producer for all topics or  create multiple
producers?
What is the best choice w.r.t network load/failures, node failures,
latency, locks?

Regards,
Manikumar


Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Navneet Gupta (Tech - BLR)
Hi,

I ran some tests on our cluster by sending message from multiple clients
(machines). Each machine had about 40-100 threads per producer.

I thought of trying out having multiple producers per clients with each
producer receiving messages from say 10-15 threads. I actually did see an
increase in throughput in this case. It was not one off cases but a
repeatable phenomenon. I called threads to producer ratio sharingFactor in
my code.

I am not planning to use it this way in our clients sending messages to
Kafka but it did go against the suggestion to have single producer across
all threads.



On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in
wrote:

 Hi Jay,

 Yes, we are producing from single process/jvm.

 From docs The producer will attempt to batch records together into fewer
 requests whenever multiple records are being sent to the same partition.

 If I understand correctly, batching happens at topic/partition level, not
 at Node level. right?

 If yes, then  both (single producer for all topics , separate producer for
 each topic) approaches
 may give similar performance.

 On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote:

  If you are talking about within a single process, having one producer is
  generally the fastest because batching dramatically reduces the number of
  requests (esp using the new java producer).
  -Jay
 
  On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy 
  manikumar.re...@gmail.com
  wrote:
 
   We have a 2 node cluster with 100 topics.
   should we use a single producer for all topics or  create multiple
   producers?
   What is the best choice w.r.t network load/failures, node failures,
   latency, locks?
  
   Regards,
   Manikumar
  
 




-- 
Thanks  Regards,
Navneet Gupta


Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Roshan Naik
Jay,
  Its not evident how to switch between  sync and async modes using this
new 'org.apache.kafka.clients.tools.ProducerPerformance'

AFAIKT it measures in async mode by default.

-roshan



On 4/24/15 3:23 PM, Jay Kreps jay.kr...@gmail.com wrote:

That should work. I recommend using the performance tool cited in the blog
linked from the performance page of the website. That tool is more
accurate and uses the new producer.

On Fri, Apr 24, 2015 at 2:29 PM, Roshan Naik ros...@hortonworks.com
wrote:

 Can we use the new 0.8.2 producer perf tool against a 0.8.1 broker ?
 -roshan


 On 4/24/15 1:19 PM, Jay Kreps jay.kr...@gmail.com wrote:

 Do make sure if you are at all performance sensitive you are using the
new
 producer api we released in 0.8.2.
 
 -Jay
 
 On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com
 wrote:
 
  Yes, I too notice the same behavior (with producer/consumer perf
tool on
  8.1.2) Š adding more threads indeed improved the perf a lot (both
with
 and
  without --sync). in --sync mode
batch size made almost no diff, larger events improved the perf.
 
  I was doing some 8.1.2 perf testing with a 1 node broker setup
 (machine:
  32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,).
 
  My observations:
 
 
 
  ASYNC MODE:
 
 
 
 
 
 
 
 
 
 
  Partition Count: large improvement when going from 1 to 2, beyond 2
see
 a
  slight dip
 
 
 
 
 
 
Number of producer threads: perf much better than sync mode with 1
  thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted
  negatively
 
 
 
  SYNC MODE (does not seem to use batch size)
  Batch Size: little to no impact
  Event Size: perf scales linearly with event size
  Number of producer threads: poor perf with one thread, improves with
 more
  threads,peaks around 30 to 50 threads
  socket.send.buffer.bytes : increasing it Made a small but measurable
  difference (~4%)
 
 
  --SYNC mode was much slower.
 
 
  I modified the producer perf tool to use the scala batched producer
api
  (not available in v8.2) --sync mode and perf of --sync mode was
closer
 to
  async mode.
 
 
  -roshan
 
 
 
  On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR)
  navneet.gu...@flipkart.com wrote:
 
  Hi,
  
  I ran some tests on our cluster by sending message from multiple
 clients
  (machines). Each machine had about 40-100 threads per producer.
  
  I thought of trying out having multiple producers per clients with
each
  producer receiving messages from say 10-15 threads. I actually did
see
 an
  increase in throughput in this case. It was not one off cases but a
  repeatable phenomenon. I called threads to producer ratio
 sharingFactor in
  my code.
  
  I am not planning to use it this way in our clients sending
messages to
  Kafka but it did go against the suggestion to have single producer
 across
  all threads.
  
  
  
  On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy
 ku...@nmsworks.co.in
  wrote:
  
   Hi Jay,
  
   Yes, we are producing from single process/jvm.
  
   From docs The producer will attempt to batch records together
into
  fewer
   requests whenever multiple records are being sent to the same
  partition.
  
   If I understand correctly, batching happens at topic/partition
level,
  not
   at Node level. right?
  
   If yes, then  both (single producer for all topics , separate
 producer
  for
   each topic) approaches
   may give similar performance.
  
   On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com
 wrote:
  
If you are talking about within a single process, having one
 producer
  is
generally the fastest because batching dramatically reduces the
  number of
requests (esp using the new java producer).
-Jay
   
On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy 
manikumar.re...@gmail.com
wrote:
   
 We have a 2 node cluster with 100 topics.
 should we use a single producer for all topics or  create
 multiple
 producers?
 What is the best choice w.r.t network load/failures, node
 failures,
 latency, locks?

 Regards,
 Manikumar

   
  
  
  
  
  --
  Thanks  Regards,
  Navneet Gupta
 
 





Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Jay Kreps
That should work. I recommend using the performance tool cited in the blog
linked from the performance page of the website. That tool is more
accurate and uses the new producer.

On Fri, Apr 24, 2015 at 2:29 PM, Roshan Naik ros...@hortonworks.com wrote:

 Can we use the new 0.8.2 producer perf tool against a 0.8.1 broker ?
 -roshan


 On 4/24/15 1:19 PM, Jay Kreps jay.kr...@gmail.com wrote:

 Do make sure if you are at all performance sensitive you are using the new
 producer api we released in 0.8.2.
 
 -Jay
 
 On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com
 wrote:
 
  Yes, I too notice the same behavior (with producer/consumer perf tool on
  8.1.2) Š adding more threads indeed improved the perf a lot (both with
 and
  without --sync). in --sync mode
batch size made almost no diff, larger events improved the perf.
 
  I was doing some 8.1.2 perf testing with a 1 node broker setup
 (machine:
  32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,).
 
  My observations:
 
 
 
  ASYNC MODE:
 
 
 
 
 
 
 
 
 
 
  Partition Count: large improvement when going from 1 to 2, beyond 2 see
 a
  slight dip
 
 
 
 
 
 
Number of producer threads: perf much better than sync mode with 1
  thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted
  negatively
 
 
 
  SYNC MODE (does not seem to use batch size)
  Batch Size: little to no impact
  Event Size: perf scales linearly with event size
  Number of producer threads: poor perf with one thread, improves with
 more
  threads,peaks around 30 to 50 threads
  socket.send.buffer.bytes : increasing it Made a small but measurable
  difference (~4%)
 
 
  --SYNC mode was much slower.
 
 
  I modified the producer perf tool to use the scala batched producer api
  (not available in v8.2) --sync mode and perf of --sync mode was closer
 to
  async mode.
 
 
  -roshan
 
 
 
  On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR)
  navneet.gu...@flipkart.com wrote:
 
  Hi,
  
  I ran some tests on our cluster by sending message from multiple
 clients
  (machines). Each machine had about 40-100 threads per producer.
  
  I thought of trying out having multiple producers per clients with each
  producer receiving messages from say 10-15 threads. I actually did see
 an
  increase in throughput in this case. It was not one off cases but a
  repeatable phenomenon. I called threads to producer ratio
 sharingFactor in
  my code.
  
  I am not planning to use it this way in our clients sending messages to
  Kafka but it did go against the suggestion to have single producer
 across
  all threads.
  
  
  
  On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy
 ku...@nmsworks.co.in
  wrote:
  
   Hi Jay,
  
   Yes, we are producing from single process/jvm.
  
   From docs The producer will attempt to batch records together into
  fewer
   requests whenever multiple records are being sent to the same
  partition.
  
   If I understand correctly, batching happens at topic/partition level,
  not
   at Node level. right?
  
   If yes, then  both (single producer for all topics , separate
 producer
  for
   each topic) approaches
   may give similar performance.
  
   On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com
 wrote:
  
If you are talking about within a single process, having one
 producer
  is
generally the fastest because batching dramatically reduces the
  number of
requests (esp using the new java producer).
-Jay
   
On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy 
manikumar.re...@gmail.com
wrote:
   
 We have a 2 node cluster with 100 topics.
 should we use a single producer for all topics or  create
 multiple
 producers?
 What is the best choice w.r.t network load/failures, node
 failures,
 latency, locks?

 Regards,
 Manikumar

   
  
  
  
  
  --
  Thanks  Regards,
  Navneet Gupta
 
 




New and old producers partition messages differently

2015-04-24 Thread James Cheng
Hi,

I was playing with the new producer in 0.8.2.1 using partition keys (semantic 
partitioning I believe is the phrase?). I noticed that the default partitioner 
in 0.8.2.1 does not partition items the same way as the old 0.8.1.1 default 
partitioner was doing. For a test item, the old producer was sending it to 
partition 0, whereas the new producer was sending it to partition 4.

Digging in the code, it appears that the partitioning logic is different 
between the old and new producers. Both of them hash the key, but they use 
different hashing algorithms.

Old partitioner:
./core/src/main/scala/kafka/producer/DefaultPartitioner.scala:

  def partition(key: Any, numPartitions: Int): Int = {
Utils.abs(key.hashCode) % numPartitions
  }

New partitioner:
./clients/src/main/java/org/apache/kafka/clients/producer/internals/Partitioner.java:

} else {
// hash the key to choose a partition
return Utils.abs(Utils.murmur2(record.key())) % numPartitions;
}

Where murmur2 is a custom hashing algorithm. (I'm assuming that murmur2 isn't 
the same logic as hashCode, especially since hashCode is overrideable).

Was it intentional that the hashing algorithm would change between the old and 
new producer? If so, was this documented? I don't know if anyone was relying on 
the old default partitioner, as opposed to going round-robin or using their own 
custom partitioner. Do you expect it to change in the future? I'm guessing that 
one of the main reasons to have a custom hashing algorithm is so that you are 
full control of the partitioning and can keep it stable (as opposed to being 
reliant on hashCode()).

Thanks,
-James



New producer: metadata update problem on 2 Node cluster.

2015-04-24 Thread Manikumar Reddy
We are testing new producer on a 2 node cluster.
Under some node failure scenarios, producer is not able
to update metadata.

Steps to reproduce
1. form a 2 node cluster (K1, K2)
2. create a topic with single partition, replication factor = 2
3. start producing data (producer metadata : K1,K2)
2. Kill leader node (say K1)
3. K2 becomes the leader (producer metadata : K2)
4. Bring back K1 and Kill K2 before metadata.max.age.ms
5. K1 becomes the Leader (producer metadata still contains : K2)

After this point, producer is not able to update the metadata.
producer continuously trying to connect with dead node (K2).

This looks like a bug to me. Am I missing anything?


Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Jay Kreps
If you are talking about within a single process, having one producer is
generally the fastest because batching dramatically reduces the number of
requests (esp using the new java producer).
-Jay

On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy manikumar.re...@gmail.com
wrote:

 We have a 2 node cluster with 100 topics.
 should we use a single producer for all topics or  create multiple
 producers?
 What is the best choice w.r.t network load/failures, node failures,
 latency, locks?

 Regards,
 Manikumar



Consumer members do not own any partitions in consumer group

2015-04-24 Thread Bryan Baugher
Hi everyone,

We are running Kafka 0.8.1.1 with Storm. We wrote our own spout which uses
the high level consumer API. Our setup is to create 4 spouts per worker. If
your not familiar with Storm its basically 4 kafka consumers per java
process. This particular consumer group is interested in 20 topics and ~150
partitions. When we increased the parallelism to 6 workers and 24 consumers
we noticed certain consumers did not own any partitions in the group. These
consumers were on certain hosts. We see their ephemeral nodes in zookeeper
under their consumer group. We have also verified connectivity with kafka
from those nodes. I also found if I can get those workers/consumers to run
on the hosts with consumers that do own partitions they too will start
owning partitions.

I'm also finding nothing in our logs in Kafka or on the consumer side which
indicates any kind of problem.

Any suggestions on what to try? Is it possible to force a consumer
rebalance and who handles the partition assignment for a consumer group?

Bryan


Re: kafka user group in los angeles

2015-04-24 Thread Jon Bringhurst
Hey Alex,

It looks like this group might be appropriate to have a Kafka talk at:

http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/

It might be worth showing up at one of their events and asking around.

-Jon

On Thu, Apr 23, 2015 at 11:40 AM, Alex Toth a...@purificator.net wrote:
 Hi,
 Sorry this isn't directly a kafka question, but I was wondering if there are 
 andy Kafka user groups in (or in near driving range of) Los Angeles.  Looking 
 through meetup.com and the usual web search engines hasn't brought me much 
 outside of the LA Hadoop user group and I was hoping for something more 
 specific.
 If I should have asked this somewhere else, again, sorry and let me know.


   alex


Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Manikumar Reddy
Hi Jay,

Yes, we are producing from single process/jvm.

From docs The producer will attempt to batch records together into fewer
requests whenever multiple records are being sent to the same partition.

If I understand correctly, batching happens at topic/partition level, not
at Node level. right?

If yes, then  both (single producer for all topics , separate producer for
each topic) approaches
may give similar performance.

On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote:

 If you are talking about within a single process, having one producer is
 generally the fastest because batching dramatically reduces the number of
 requests (esp using the new java producer).
 -Jay

 On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy 
 manikumar.re...@gmail.com
 wrote:

  We have a 2 node cluster with 100 topics.
  should we use a single producer for all topics or  create multiple
  producers?
  What is the best choice w.r.t network load/failures, node failures,
  latency, locks?
 
  Regards,
  Manikumar
 



Re: kafka user group in los angeles

2015-04-24 Thread Alex Toth
Thanks.  I'll see what I can find.

  alex

  From: Jon Bringhurst j...@bringhurst.org
 To: users@kafka.apache.org; Alex Toth a...@purificator.net 
 Sent: Friday, April 24, 2015 9:51 AM
 Subject: Re: kafka user group in los angeles
   
Hey Alex,

It looks like this group might be appropriate to have a Kafka talk at:

http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/

It might be worth showing up at one of their events and asking around.

-Jon



On Thu, Apr 23, 2015 at 11:40 AM, Alex Toth a...@purificator.net wrote:
 Hi,
 Sorry this isn't directly a kafka question, but I was wondering if there are 
 andy Kafka user groups in (or in near driving range of) Los Angeles.  Looking 
 through meetup.com and the usual web search engines hasn't brought me much 
 outside of the LA Hadoop user group and I was hoping for something more 
 specific.
 If I should have asked this somewhere else, again, sorry and let me know.


  alex

  

RE: kafka user group in los angeles

2015-04-24 Thread Jeff Field
If you don't mind venturing further south, http://www.meetup.com/OCBigData/ 
could be a good meetup to discuss Kafka at as well.

-Original Message-
From: Alex Toth [mailto:a...@purificator.net] 
Sent: Friday, April 24, 2015 9:55 AM
To: Jon Bringhurst; users@kafka.apache.org
Subject: Re: kafka user group in los angeles

Thanks.  I'll see what I can find.

  alex

  From: Jon Bringhurst j...@bringhurst.org
 To: users@kafka.apache.org; Alex Toth a...@purificator.net 
 Sent: Friday, April 24, 2015 9:51 AM
 Subject: Re: kafka user group in los angeles
   
Hey Alex,

It looks like this group might be appropriate to have a Kafka talk at:

http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/

It might be worth showing up at one of their events and asking around.

-Jon



On Thu, Apr 23, 2015 at 11:40 AM, Alex Toth a...@purificator.net wrote:
 Hi,
 Sorry this isn't directly a kafka question, but I was wondering if there are 
 andy Kafka user groups in (or in near driving range of) Los Angeles.  Looking 
 through meetup.com and the usual web search engines hasn't brought me much 
 outside of the LA Hadoop user group and I was hoping for something more 
 specific.
 If I should have asked this somewhere else, again, sorry and let me know.


  alex

  


Getting java.lang.IllegalMonitorStateException in mirror maker when building fetch request

2015-04-24 Thread tao xiao
Hi team,

I observed java.lang.IllegalMonitorStateException thrown
from AbstractFetcherThread in mirror maker when it is trying to build the
fetchrequst. Below is the error

[2015-04-23 16:16:02,049] ERROR
[ConsumerFetcherThread-group_id_localhost-1429830778627-4519368f-0-7],
Error due to  (kafka.consumer.ConsumerFetcherThread)

java.lang.IllegalMonitorStateException

at
java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:155)

at
java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260)

at
java.util.concurrent.locks.AbstractQueuedSynchronizer.fullyRelease(AbstractQueuedSynchronizer.java:1723)

at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2166)

at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:95)

at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)

I believe this is due to partitionMapCond.await(fetchBackOffMs,
TimeUnit.MILLISECONDS) being called while not lock is acquired.

below code should fix the issue

inLock(partitionMapLock) {
partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
}

Should I file a jira ticket and submit the patch?

I use the latest version of mirror maker in trunk.


-- 
Regards,
Tao


Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Roshan Naik
Yes, I too notice the same behavior (with producer/consumer perf tool on
8.1.2) Š adding more threads indeed improved the perf a lot (both with and
without --sync). in --sync mode
  batch size made almost no diff, larger events improved the perf.

I was doing some 8.1.2 perf testing with a 1 node broker setup  (machine:
32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,).

My observations:

 
 
ASYNC MODE: 
  
  
  
  
  
  
  
  
 
 
Partition Count: large improvement when going from 1 to 2, beyond 2 see a
slight dip
  
  
  
  
 
 
  Number of producer threads: perf much better than sync mode with 1
thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted
negatively



SYNC MODE (does not seem to use batch size)
Batch Size: little to no impact
Event Size: perf scales linearly with event size
Number of producer threads: poor perf with one thread, improves with more
threads,peaks around 30 to 50 threads
socket.send.buffer.bytes : increasing it Made a small but measurable
difference (~4%)


--SYNC mode was much slower.


I modified the producer perf tool to use the scala batched producer api
(not available in v8.2) --sync mode and perf of --sync mode was closer to
async mode.


-roshan
 


On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR)
navneet.gu...@flipkart.com wrote:

Hi,

I ran some tests on our cluster by sending message from multiple clients
(machines). Each machine had about 40-100 threads per producer.

I thought of trying out having multiple producers per clients with each
producer receiving messages from say 10-15 threads. I actually did see an
increase in throughput in this case. It was not one off cases but a
repeatable phenomenon. I called threads to producer ratio sharingFactor in
my code.

I am not planning to use it this way in our clients sending messages to
Kafka but it did go against the suggestion to have single producer across
all threads.



On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in
wrote:

 Hi Jay,

 Yes, we are producing from single process/jvm.

 From docs The producer will attempt to batch records together into
fewer
 requests whenever multiple records are being sent to the same
partition.

 If I understand correctly, batching happens at topic/partition level,
not
 at Node level. right?

 If yes, then  both (single producer for all topics , separate producer
for
 each topic) approaches
 may give similar performance.

 On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote:

  If you are talking about within a single process, having one producer
is
  generally the fastest because batching dramatically reduces the
number of
  requests (esp using the new java producer).
  -Jay
 
  On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy 
  manikumar.re...@gmail.com
  wrote:
 
   We have a 2 node cluster with 100 topics.
   should we use a single producer for all topics or  create multiple
   producers?
   What is the best choice w.r.t network load/failures, node failures,
   latency, locks?
  
   Regards,
   Manikumar
  
 




-- 
Thanks  Regards,
Navneet Gupta



Re: Consumer members do not own any partitions in consumer group

2015-04-24 Thread Bryan Baugher
Managed to figure this one out myself. This is due to the range partition
assignment in 0.8.1.1 and the fact each of our topics has 8 partitions so
only the first 8 consumers get assigned anything. Looks like 0.8.2.0 has a
round robin assignment which is what we want.

On Fri, Apr 24, 2015 at 11:13 AM Bryan Baugher bjb...@gmail.com wrote:

 Hi everyone,

 We are running Kafka 0.8.1.1 with Storm. We wrote our own spout which uses
 the high level consumer API. Our setup is to create 4 spouts per worker. If
 your not familiar with Storm its basically 4 kafka consumers per java
 process. This particular consumer group is interested in 20 topics and ~150
 partitions. When we increased the parallelism to 6 workers and 24 consumers
 we noticed certain consumers did not own any partitions in the group. These
 consumers were on certain hosts. We see their ephemeral nodes in zookeeper
 under their consumer group. We have also verified connectivity with kafka
 from those nodes. I also found if I can get those workers/consumers to run
 on the hosts with consumers that do own partitions they too will start
 owning partitions.

 I'm also finding nothing in our logs in Kafka or on the consumer side
 which indicates any kind of problem.

 Any suggestions on what to try? Is it possible to force a consumer
 rebalance and who handles the partition assignment for a consumer group?

 Bryan



Kafka dependencies on Pig and Avro

2015-04-24 Thread Carita Ou
Hi,

Im new to kafka and noticed that kafka has dependencies on older versions
of Avro (1.4.0) and Pig (0.8.0); is there a reason for not moving to the
latest (avro 1.7.7 and pig 0.14.0)?

Also, kafka-hadoop-producer has dependencies on different versions of pig,
pig-0.8.0 and piggybank-0.12.0; should they be in sync?

Regards,
Carita


Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Jay Kreps
Do make sure if you are at all performance sensitive you are using the new
producer api we released in 0.8.2.

-Jay

On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com
wrote:

 Yes, I too notice the same behavior (with producer/consumer perf tool on
 8.1.2) Š adding more threads indeed improved the perf a lot (both with and
 without --sync). in --sync mode
   batch size made almost no diff, larger events improved the perf.

 I was doing some 8.1.2 perf testing with a 1 node broker setup  (machine:
 32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,).

 My observations:



 ASYNC MODE:










 Partition Count: large improvement when going from 1 to 2, beyond 2 see a
 slight dip






   Number of producer threads: perf much better than sync mode with 1
 thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted
 negatively



 SYNC MODE (does not seem to use batch size)
 Batch Size: little to no impact
 Event Size: perf scales linearly with event size
 Number of producer threads: poor perf with one thread, improves with more
 threads,peaks around 30 to 50 threads
 socket.send.buffer.bytes : increasing it Made a small but measurable
 difference (~4%)


 --SYNC mode was much slower.


 I modified the producer perf tool to use the scala batched producer api
 (not available in v8.2) --sync mode and perf of --sync mode was closer to
 async mode.


 -roshan



 On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR)
 navneet.gu...@flipkart.com wrote:

 Hi,
 
 I ran some tests on our cluster by sending message from multiple clients
 (machines). Each machine had about 40-100 threads per producer.
 
 I thought of trying out having multiple producers per clients with each
 producer receiving messages from say 10-15 threads. I actually did see an
 increase in throughput in this case. It was not one off cases but a
 repeatable phenomenon. I called threads to producer ratio sharingFactor in
 my code.
 
 I am not planning to use it this way in our clients sending messages to
 Kafka but it did go against the suggestion to have single producer across
 all threads.
 
 
 
 On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy ku...@nmsworks.co.in
 wrote:
 
  Hi Jay,
 
  Yes, we are producing from single process/jvm.
 
  From docs The producer will attempt to batch records together into
 fewer
  requests whenever multiple records are being sent to the same
 partition.
 
  If I understand correctly, batching happens at topic/partition level,
 not
  at Node level. right?
 
  If yes, then  both (single producer for all topics , separate producer
 for
  each topic) approaches
  may give similar performance.
 
  On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com wrote:
 
   If you are talking about within a single process, having one producer
 is
   generally the fastest because batching dramatically reduces the
 number of
   requests (esp using the new java producer).
   -Jay
  
   On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy 
   manikumar.re...@gmail.com
   wrote:
  
We have a 2 node cluster with 100 topics.
should we use a single producer for all topics or  create multiple
producers?
What is the best choice w.r.t network load/failures, node failures,
latency, locks?
   
Regards,
Manikumar
   
  
 
 
 
 
 --
 Thanks  Regards,
 Navneet Gupta




Re: New Java Producer: Single Producer vs multiple Producers

2015-04-24 Thread Roshan Naik
Can we use the new 0.8.2 producer perf tool against a 0.8.1 broker ?
-roshan


On 4/24/15 1:19 PM, Jay Kreps jay.kr...@gmail.com wrote:

Do make sure if you are at all performance sensitive you are using the new
producer api we released in 0.8.2.

-Jay

On Fri, Apr 24, 2015 at 12:46 PM, Roshan Naik ros...@hortonworks.com
wrote:

 Yes, I too notice the same behavior (with producer/consumer perf tool on
 8.1.2) Š adding more threads indeed improved the perf a lot (both with
and
 without --sync). in --sync mode
   batch size made almost no diff, larger events improved the perf.

 I was doing some 8.1.2 perf testing with a 1 node broker setup
(machine:
 32 cpu cores, 256gb RAM, 10gig ethernet, 1 x 15000rpm disks,).

 My observations:



 ASYNC MODE:










 Partition Count: large improvement when going from 1 to 2, beyond 2 see
a
 slight dip






   Number of producer threads: perf much better than sync mode with 1
 thread, perf peaks out with ~10 threads, beyond 10 thds perf impacted
 negatively



 SYNC MODE (does not seem to use batch size)
 Batch Size: little to no impact
 Event Size: perf scales linearly with event size
 Number of producer threads: poor perf with one thread, improves with
more
 threads,peaks around 30 to 50 threads
 socket.send.buffer.bytes : increasing it Made a small but measurable
 difference (~4%)


 --SYNC mode was much slower.


 I modified the producer perf tool to use the scala batched producer api
 (not available in v8.2) --sync mode and perf of --sync mode was closer
to
 async mode.


 -roshan



 On 4/24/15 11:42 AM, Navneet Gupta (Tech - BLR)
 navneet.gu...@flipkart.com wrote:

 Hi,
 
 I ran some tests on our cluster by sending message from multiple
clients
 (machines). Each machine had about 40-100 threads per producer.
 
 I thought of trying out having multiple producers per clients with each
 producer receiving messages from say 10-15 threads. I actually did see
an
 increase in throughput in this case. It was not one off cases but a
 repeatable phenomenon. I called threads to producer ratio
sharingFactor in
 my code.
 
 I am not planning to use it this way in our clients sending messages to
 Kafka but it did go against the suggestion to have single producer
across
 all threads.
 
 
 
 On Fri, Apr 24, 2015 at 10:27 PM, Manikumar Reddy
ku...@nmsworks.co.in
 wrote:
 
  Hi Jay,
 
  Yes, we are producing from single process/jvm.
 
  From docs The producer will attempt to batch records together into
 fewer
  requests whenever multiple records are being sent to the same
 partition.
 
  If I understand correctly, batching happens at topic/partition level,
 not
  at Node level. right?
 
  If yes, then  both (single producer for all topics , separate
producer
 for
  each topic) approaches
  may give similar performance.
 
  On Fri, Apr 24, 2015 at 9:29 PM, Jay Kreps jay.kr...@gmail.com
wrote:
 
   If you are talking about within a single process, having one
producer
 is
   generally the fastest because batching dramatically reduces the
 number of
   requests (esp using the new java producer).
   -Jay
  
   On Fri, Apr 24, 2015 at 4:54 AM, Manikumar Reddy 
   manikumar.re...@gmail.com
   wrote:
  
We have a 2 node cluster with 100 topics.
should we use a single producer for all topics or  create
multiple
producers?
What is the best choice w.r.t network load/failures, node
failures,
latency, locks?
   
Regards,
Manikumar
   
  
 
 
 
 
 --
 Thanks  Regards,
 Navneet Gupta





leader election rate

2015-04-24 Thread Wesley Chow
Looking at the output from the jmx stats from our Kafka cluster, I see a
more or less constant leader election rate of around 2.5 from our
controller. Is this expected, or does this mean that leaders are shifting
around constantly?

If they are shifting, how should I go about debugging, and what triggers a
leader election?

Thanks,
Wes