How to achieve distributed processing and high availability simultaneously in Kafka?

2015-05-05 Thread sumit jain
I have a topic consisting of n partitions. To have distributed processing I
create two processes running on different machines. They subscribe to the
topic with same groupd id and allocate n/2 threads, each of which processes
single stream(n/2 partitions per process).

With this I will have achieved load distribution, but now if process 1
crashes, than process 2 cannot consume messages from partitions allocated
to process 1, as it listened only on n/2 streams at the start.

Or else, if I configure for HA and start n threads/streams on both
processes, then when one node fails, all partitions will be processed by
other node. But here, we have compromised distribution, as all partitions
will be processed by a single node at a time.

Is there a way to achieve both simultaneously and how?
Note: Already asked this on stackoverflow
http://stackoverflow.com/questions/30060261/how-to-achieve-distributed-processing-and-high-availability-simultaneously-in-ka
.
-- 
Thanks & Regards,
Sumit Jain


Re: circuit breaker for producer

2015-05-05 Thread Guozhang Wang
   1. KAFKA-1955 , I
   think Jay has a WIP patch for it.
   2.
   3.


On Tue, May 5, 2015 at 5:10 PM, Jason Rosenberg  wrote:

> Guozhang,
>
> Do you have the ticket number for possibly adding in local log file
> failover? Is it actively being worked on?
>
> Thanks,
>
> Jason
>
> On Tue, May 5, 2015 at 6:11 PM, Guozhang Wang  wrote:
>
> > Does this "log file" acts as a temporary disk buffer when broker slows
> > down, whose data will be re-send to broker later, or do you plan to use
> it
> > as a separate persistent storage as Kafka brokers?
> >
> > For the former use case, I think there is an open ticket for integrating
> > this kind of functionality into producer; for the latter use case, you
> may
> > want to do this traffic control out of Kafka producer, i.e. upon
> detecting
> > producer buffer full, do not call send() on it for a while but write to a
> > different file, etc.
> >
> > Guozhang
> >
> > On Tue, May 5, 2015 at 11:28 AM, mete  wrote:
> >
> > > Sure, i kind of count on that actually, i guess with this setting the
> > > sender blocks on allocate method and this bufferpool-wait-ratio
> > increases.
> > >
> > > I want to fully compartmentalize the kafka producer from the rest of
> the
> > > system. Ex: writing to a log file instead of trying to send to kafka
> when
> > > some metric in the producer indicates that there is a performance
> > > degradation or some other problem.
> > > I was wondering what would be the ideal way of deciding that?
> > >
> > >
> > >
> > > On Tue, May 5, 2015 at 6:32 PM, Jay Kreps  wrote:
> > >
> > > > Does block.on.buffer.full=false do what you want?
> > > >
> > > > -Jay
> > > >
> > > > On Tue, May 5, 2015 at 1:59 AM, mete  wrote:
> > > >
> > > > > Hello Folks,
> > > > >
> > > > > I was looking through the kafka.producer metrics on the JMX
> > interface,
> > > to
> > > > > find a good indicator when to "trip" the circuit. So far it seems
> > like
> > > > the
> > > > > "bufferpool-wait-ratio" metric is a useful decision mechanism when
> to
> > > cut
> > > > > off the production to kafka.
> > > > >
> > > > > As far as i experienced, when kafka server slow for some reason,
> > > requests
> > > > > start piling up on the producer queue and if you are not willing to
> > > drop
> > > > > any messages on the producer, send method starts blocking because
> of
> > > the
> > > > > slow responsiveness.
> > > > >
> > > > > So this buffer pool wait ratio starts going up from 0.x up to 1.0.
> > And
> > > i
> > > > am
> > > > > thinking about tripping the circuit breaker using this metric, ex:
> if
> > > > > wait-ratio > 0.90 etc...
> > > > >
> > > > > What do you think? Do you think there would be a better indicator
> to
> > > > check
> > > > > the health overall?
> > > > >
> > > > > Best
> > > > > Mete
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang


Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Ewen Cheslack-Postava
I'm not sure about the old producer behavior in this same failure scenario,
but creating a new producer instance would resolve the issue since it would
start with the list of bootstrap nodes and, assuming at least one of them
was up, it would be able to fetch up to date metadata.

On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg  wrote:

> Can you clarify, is this issue here specific to the "new" producer?  With
> the "old" producer, we routinely construct a new producer which makes a
> fresh metadata request (via a VIP connected to all nodes in the cluster).
> Would this approach work with the new producer?
>
> Jason
>
>
> On Tue, May 5, 2015 at 1:12 PM, Rahul Jain  wrote:
>
> > Mayuresh,
> > I was testing this in a development environment and manually brought
> down a
> > node to simulate this. So the dead node never came back up.
> >
> > My colleague and I were able to consistently see this behaviour several
> > times during the testing.
> > On 5 May 2015 20:32, "Mayuresh Gharat" 
> wrote:
> >
> > > I agree that to find the least Loaded node the producer should fall
> back
> > to
> > > the bootstrap nodes if its not able to connect to any nodes in the
> > current
> > > metadata. That should resolve this.
> > >
> > > Rahul, I suppose the problem went off because the dead node in your
> case
> > > might have came back up and allowed for a metadata update. Can you
> > confirm
> > > this?
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain  wrote:
> > >
> > > > We observed the exact same error. Not very clear about the root cause
> > > > although it appears to be related to leastLoadedNode implementation.
> > > > Interestingly, the problem went away by increasing the value of
> > > > reconnect.backoff.ms to 1000ms.
> > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" 
> > wrote:
> > > >
> > > > > Ok, all of that makes sense. The only way to possibly recover from
> > that
> > > > > state is either for K2 to come back up allowing the metadata
> refresh
> > to
> > > > > eventually succeed or to eventually try some other node in the
> > cluster.
> > > > > Reusing the bootstrap nodes is one possibility. Another would be
> for
> > > the
> > > > > client to get more metadata than is required for the topics it
> needs
> > in
> > > > > order to ensure it has more nodes to use as options when looking
> for
> > a
> > > > node
> > > > > to fetch metadata from. I added your description to KAFKA-1843,
> > > although
> > > > it
> > > > > might also make sense as a separate bug since fixing it could be
> > > > considered
> > > > > incremental progress towards resolving 1843.
> > > > >
> > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> > ku...@nmsworks.co.in
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Ewen,
> > > > > >
> > > > > >  Thanks for the response.  I agree with you, In some case we
> should
> > > use
> > > > > > bootstrap servers.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > If you have logs at debug level, are you seeing this message in
> > > > between
> > > > > > the
> > > > > > > connection attempts:
> > > > > > >
> > > > > > > Give up sending metadata request since no node is available
> > > > > > >
> > > > > >
> > > > > >  Yes, this log came for couple of times.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Also, if you let it continue running, does it recover after the
> > > > > > > metadata.max.age.ms timeout?
> > > > > > >
> > > > > >
> > > > > >  It does not reconnect.  It is continuously trying to connect
> with
> > > dead
> > > > > > node.
> > > > > >
> > > > > >
> > > > > > -Manikumar
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Ewen
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -Regards,
> > > Mayuresh R. Gharat
> > > (862) 250-7125
> > >
> >
>



-- 
Thanks,
Ewen


Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Jason Rosenberg
Can you clarify, is this issue here specific to the "new" producer?  With
the "old" producer, we routinely construct a new producer which makes a
fresh metadata request (via a VIP connected to all nodes in the cluster).
Would this approach work with the new producer?

Jason


On Tue, May 5, 2015 at 1:12 PM, Rahul Jain  wrote:

> Mayuresh,
> I was testing this in a development environment and manually brought down a
> node to simulate this. So the dead node never came back up.
>
> My colleague and I were able to consistently see this behaviour several
> times during the testing.
> On 5 May 2015 20:32, "Mayuresh Gharat"  wrote:
>
> > I agree that to find the least Loaded node the producer should fall back
> to
> > the bootstrap nodes if its not able to connect to any nodes in the
> current
> > metadata. That should resolve this.
> >
> > Rahul, I suppose the problem went off because the dead node in your case
> > might have came back up and allowed for a metadata update. Can you
> confirm
> > this?
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain  wrote:
> >
> > > We observed the exact same error. Not very clear about the root cause
> > > although it appears to be related to leastLoadedNode implementation.
> > > Interestingly, the problem went away by increasing the value of
> > > reconnect.backoff.ms to 1000ms.
> > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" 
> wrote:
> > >
> > > > Ok, all of that makes sense. The only way to possibly recover from
> that
> > > > state is either for K2 to come back up allowing the metadata refresh
> to
> > > > eventually succeed or to eventually try some other node in the
> cluster.
> > > > Reusing the bootstrap nodes is one possibility. Another would be for
> > the
> > > > client to get more metadata than is required for the topics it needs
> in
> > > > order to ensure it has more nodes to use as options when looking for
> a
> > > node
> > > > to fetch metadata from. I added your description to KAFKA-1843,
> > although
> > > it
> > > > might also make sense as a separate bug since fixing it could be
> > > considered
> > > > incremental progress towards resolving 1843.
> > > >
> > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> ku...@nmsworks.co.in
> > >
> > > > wrote:
> > > >
> > > > > Hi Ewen,
> > > > >
> > > > >  Thanks for the response.  I agree with you, In some case we should
> > use
> > > > > bootstrap servers.
> > > > >
> > > > >
> > > > > >
> > > > > > If you have logs at debug level, are you seeing this message in
> > > between
> > > > > the
> > > > > > connection attempts:
> > > > > >
> > > > > > Give up sending metadata request since no node is available
> > > > > >
> > > > >
> > > > >  Yes, this log came for couple of times.
> > > > >
> > > > >
> > > > > >
> > > > > > Also, if you let it continue running, does it recover after the
> > > > > > metadata.max.age.ms timeout?
> > > > > >
> > > > >
> > > > >  It does not reconnect.  It is continuously trying to connect with
> > dead
> > > > > node.
> > > > >
> > > > >
> > > > > -Manikumar
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Ewen
> > > >
> > >
> >
> >
> >
> > --
> > -Regards,
> > Mayuresh R. Gharat
> > (862) 250-7125
> >
>


Re: circuit breaker for producer

2015-05-05 Thread Jason Rosenberg
Guozhang,

Do you have the ticket number for possibly adding in local log file
failover? Is it actively being worked on?

Thanks,

Jason

On Tue, May 5, 2015 at 6:11 PM, Guozhang Wang  wrote:

> Does this "log file" acts as a temporary disk buffer when broker slows
> down, whose data will be re-send to broker later, or do you plan to use it
> as a separate persistent storage as Kafka brokers?
>
> For the former use case, I think there is an open ticket for integrating
> this kind of functionality into producer; for the latter use case, you may
> want to do this traffic control out of Kafka producer, i.e. upon detecting
> producer buffer full, do not call send() on it for a while but write to a
> different file, etc.
>
> Guozhang
>
> On Tue, May 5, 2015 at 11:28 AM, mete  wrote:
>
> > Sure, i kind of count on that actually, i guess with this setting the
> > sender blocks on allocate method and this bufferpool-wait-ratio
> increases.
> >
> > I want to fully compartmentalize the kafka producer from the rest of the
> > system. Ex: writing to a log file instead of trying to send to kafka when
> > some metric in the producer indicates that there is a performance
> > degradation or some other problem.
> > I was wondering what would be the ideal way of deciding that?
> >
> >
> >
> > On Tue, May 5, 2015 at 6:32 PM, Jay Kreps  wrote:
> >
> > > Does block.on.buffer.full=false do what you want?
> > >
> > > -Jay
> > >
> > > On Tue, May 5, 2015 at 1:59 AM, mete  wrote:
> > >
> > > > Hello Folks,
> > > >
> > > > I was looking through the kafka.producer metrics on the JMX
> interface,
> > to
> > > > find a good indicator when to "trip" the circuit. So far it seems
> like
> > > the
> > > > "bufferpool-wait-ratio" metric is a useful decision mechanism when to
> > cut
> > > > off the production to kafka.
> > > >
> > > > As far as i experienced, when kafka server slow for some reason,
> > requests
> > > > start piling up on the producer queue and if you are not willing to
> > drop
> > > > any messages on the producer, send method starts blocking because of
> > the
> > > > slow responsiveness.
> > > >
> > > > So this buffer pool wait ratio starts going up from 0.x up to 1.0.
> And
> > i
> > > am
> > > > thinking about tripping the circuit breaker using this metric, ex: if
> > > > wait-ratio > 0.90 etc...
> > > >
> > > > What do you think? Do you think there would be a better indicator to
> > > check
> > > > the health overall?
> > > >
> > > > Best
> > > > Mete
> > > >
> > >
> >
>
>
>
> --
> -- Guozhang
>


Re: Round Robin Partition Assignment

2015-05-05 Thread Jason Rosenberg
I asked about this same issue in a previous thread.  Thanks for reminding
me, I've added this Jira:  https://issues.apache.org/jira/browse/KAFKA-2172

I think this is a great new feature, but it's unfortunately the "all
consumers must be the same" is just a bit too restrictive.

Jason

On Tue, May 5, 2015 at 5:20 PM, Bryan Baugher  wrote:

> Hi everyone,
>
> We recently switched to round robin partition assignment after we noticed
> that range partition assignment (default) will only make use of the first X
> consumers were X is the number of partitions for a topic our consumers are
> interested in. We then noticed the caveat in round robin,
>
> "Round-robin assignment is permitted only if: (a) Every topic has the same
> number of streams within a consumer instance (b) The set of subscribed
> topics is identical for every consumer instance within the group."
>
> We tried this out and found if all consumers don't agree on topic
> subscription they basically stop consuming until things get figured out.
> This is bad for us since our consumers change their topic subscription
> based on config they load from a REST service periodically.
>
> Is there something we can do on our side to avoid this? The best thing I
> can think of is to try and use something like zookeeper to coordinate
> changing the topic filters.
>
> Would it be possible to see the round robin assignment updated to not
> require identical topic subscriptions?
>
> Bryan
>


Re: 'roundrobin' partition assignment strategy restrictions

2015-05-05 Thread Jason Rosenberg
I filed this jira, fwiw:  https://issues.apache.org/jira/browse/KAFKA-2172

Jason

On Mon, Mar 23, 2015 at 2:44 PM, Jiangjie Qin 
wrote:

> Hi Jason,
>
> Yes, I agree the restriction makes the usage of round-robin less flexible.
> I think the focus of round-robin strategy is workload balance. If
> different consumers are consuming from different topics, it is unbalanced
> by nature. In that case, is it possible that you use different consumer
> group for different sets of topics?
> The rolling update is a good point. If you do rolling bounce in a small
> window, the rebalance retry should handle it. But if you want to canary a
> new topic setting on one consumer for some time, it won’t work.
> Could you maybe share the use case with more detail? So we can see if
> there is any workaround.
>
> Jiangjie (Becket) Qin
>
> On 3/22/15, 10:04 AM, "Jason Rosenberg"  wrote:
>
> >Jiangjie,
> >
> >Yeah, I welcome the round-robin strategy, as the 'range' strategy ('til
> >now
> >the only one available), is not always good at balancing partitions, as
> >you
> >observed above.
> >
> >The main thing I'm bringing up in this thread though is the question of
> >why
> >there needs to be a restriction to having a homogenous set of consumers in
> >the group being balanced.  This is not a requirement for the range
> >algorithm, but is for the roundrobin algorithm.  So, I'm just wanting to
> >understand why there's that limitation.  (And sadly, in our case, we do
> >have heterogenous consumers using the same groupid, so we can't easily
> >turn
> >on roundrobin at the moment, without some effort :) ).
> >
> >I can see that it does simplify the implementation to have that
> >limitation,
> >but I'm just wondering if there's anything fundamental that would prevent
> >an implementation that works over heterogenous consumers.  E.g. "Lay out
> >all partitions, and layout all consumer threads, and proceed round robin
> >assigning each partition to the next consumer thread. *If the next
> >consumer
> >thread doesn't have a selection for the current partition, then move on to
> >the next consumer-thread"*
> >
> >The current implementation is also problematic if you are doing a rolling
> >restart of a consumer cluster.  Let's say you are updating the topic
> >selection as part of an update to the cluster.  Once the first node is
> >updated, the entire cluster will no longer be homogenous until the last
> >node is updated, which means you will have a temporary outage consuming
> >data until all nodes have been updated.  So, it makes it difficult to do
> >rolling restarts, or canary updates on a subset of nodes, etc.
> >
> >Jason
> >
> >Jason
> >
> >On Fri, Mar 20, 2015 at 10:15 PM, Jiangjie Qin  >
> >wrote:
> >
> >> Hi Jason,
> >>
> >> The motivation behind round robin is to better balance the consumers¹
> >> load. Imagine you have two topics each with two partitions. These topics
> >> are consumed by two consumers each with two consumer threads.
> >>
> >> The range assignment gives:
> >> T1-P1 -> C1-Thr1
> >> T1-P2 -> C1-Thr2
> >> T2-P1 -> C1-Thr1
> >> T2-P2 -> C1-Thr2
> >> Consumer 2 will not be consuming from any partitions.
> >>
> >> The round robin algorithm gives:
> >> T1-P1 -> C1-Thr1
> >> T1-P2 -> C1-Thr2
> >> T2-P1 -> C2-Thr1
> >> T2-p2 -> C2-Thr2
> >> It is much better than range assignment.
> >>
> >> That¹s the reason why we introduced round robin strategy even though it
> >> has restrictions.
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >> On 3/20/15, 12:20 PM, "Jason Rosenberg"  wrote:
> >>
> >> >Jiangle,
> >> >
> >> >The error messages I got (and the config doc) do clearly state that the
> >> >number of threads per consumer must match also
> >> >
> >> >I'm not convinced that an easy to understand algorithm would work fine
> >> >with
> >> >a heterogeneous set of selected topics between consumers.
> >> >
> >> >Jason
> >> >
> >> >On Thu, Mar 19, 2015 at 8:07 PM, Mayuresh Gharat
> >> > >> >> wrote:
> >> >
> >> >> Hi Becket,
> >> >>
> >> >> Can you list down an example for this. It would be easier to
> >>understand
> >> >>:)
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Mayuresh
> >> >>
> >> >> On Thu, Mar 19, 2015 at 4:46 PM, Jiangjie Qin
> >> >>
> >> >> wrote:
> >> >>
> >> >> > Hi Jason,
> >> >> >
> >> >> > The round-robin strategy first takes the partitions of all the
> >>topics
> >> >>a
> >> >> > consumer is consuming from, then distributed them across all the
> >> >> consumers.
> >> >> > If different consumers are consuming from different topics, the
> >> >>assigning
> >> >> > algorithm will generate different answers on different consumers.
> >> >> > It is OK for consumers to have different thread count, but the
> >> >>consumers
> >> >> > have to consume from the same set of topics.
> >> >> >
> >> >> >
> >> >> > For range strategy, the balance is for each individual topic
> >>instead
> >> >>of
> >> >> > cross topics. So the balance is only done for the consumers
> >>consuming
> >> >> from
> >> >> > the same topic.
> >> >> >

Re: circuit breaker for producer

2015-05-05 Thread Guozhang Wang
Does this "log file" acts as a temporary disk buffer when broker slows
down, whose data will be re-send to broker later, or do you plan to use it
as a separate persistent storage as Kafka brokers?

For the former use case, I think there is an open ticket for integrating
this kind of functionality into producer; for the latter use case, you may
want to do this traffic control out of Kafka producer, i.e. upon detecting
producer buffer full, do not call send() on it for a while but write to a
different file, etc.

Guozhang

On Tue, May 5, 2015 at 11:28 AM, mete  wrote:

> Sure, i kind of count on that actually, i guess with this setting the
> sender blocks on allocate method and this bufferpool-wait-ratio increases.
>
> I want to fully compartmentalize the kafka producer from the rest of the
> system. Ex: writing to a log file instead of trying to send to kafka when
> some metric in the producer indicates that there is a performance
> degradation or some other problem.
> I was wondering what would be the ideal way of deciding that?
>
>
>
> On Tue, May 5, 2015 at 6:32 PM, Jay Kreps  wrote:
>
> > Does block.on.buffer.full=false do what you want?
> >
> > -Jay
> >
> > On Tue, May 5, 2015 at 1:59 AM, mete  wrote:
> >
> > > Hello Folks,
> > >
> > > I was looking through the kafka.producer metrics on the JMX interface,
> to
> > > find a good indicator when to "trip" the circuit. So far it seems like
> > the
> > > "bufferpool-wait-ratio" metric is a useful decision mechanism when to
> cut
> > > off the production to kafka.
> > >
> > > As far as i experienced, when kafka server slow for some reason,
> requests
> > > start piling up on the producer queue and if you are not willing to
> drop
> > > any messages on the producer, send method starts blocking because of
> the
> > > slow responsiveness.
> > >
> > > So this buffer pool wait ratio starts going up from 0.x up to 1.0. And
> i
> > am
> > > thinking about tripping the circuit breaker using this metric, ex: if
> > > wait-ratio > 0.90 etc...
> > >
> > > What do you think? Do you think there would be a better indicator to
> > check
> > > the health overall?
> > >
> > > Best
> > > Mete
> > >
> >
>



-- 
-- Guozhang


Round Robin Partition Assignment

2015-05-05 Thread Bryan Baugher
Hi everyone,

We recently switched to round robin partition assignment after we noticed
that range partition assignment (default) will only make use of the first X
consumers were X is the number of partitions for a topic our consumers are
interested in. We then noticed the caveat in round robin,

"Round-robin assignment is permitted only if: (a) Every topic has the same
number of streams within a consumer instance (b) The set of subscribed
topics is identical for every consumer instance within the group."

We tried this out and found if all consumers don't agree on topic
subscription they basically stop consuming until things get figured out.
This is bad for us since our consumers change their topic subscription
based on config they load from a REST service periodically.

Is there something we can do on our side to avoid this? The best thing I
can think of is to try and use something like zookeeper to coordinate
changing the topic filters.

Would it be possible to see the round robin assignment updated to not
require identical topic subscriptions?

Bryan


Re: circuit breaker for producer

2015-05-05 Thread mete
Sure, i kind of count on that actually, i guess with this setting the
sender blocks on allocate method and this bufferpool-wait-ratio increases.

I want to fully compartmentalize the kafka producer from the rest of the
system. Ex: writing to a log file instead of trying to send to kafka when
some metric in the producer indicates that there is a performance
degradation or some other problem.
I was wondering what would be the ideal way of deciding that?



On Tue, May 5, 2015 at 6:32 PM, Jay Kreps  wrote:

> Does block.on.buffer.full=false do what you want?
>
> -Jay
>
> On Tue, May 5, 2015 at 1:59 AM, mete  wrote:
>
> > Hello Folks,
> >
> > I was looking through the kafka.producer metrics on the JMX interface, to
> > find a good indicator when to "trip" the circuit. So far it seems like
> the
> > "bufferpool-wait-ratio" metric is a useful decision mechanism when to cut
> > off the production to kafka.
> >
> > As far as i experienced, when kafka server slow for some reason, requests
> > start piling up on the producer queue and if you are not willing to drop
> > any messages on the producer, send method starts blocking because of the
> > slow responsiveness.
> >
> > So this buffer pool wait ratio starts going up from 0.x up to 1.0. And i
> am
> > thinking about tripping the circuit breaker using this metric, ex: if
> > wait-ratio > 0.90 etc...
> >
> > What do you think? Do you think there would be a better indicator to
> check
> > the health overall?
> >
> > Best
> > Mete
> >
>


Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Rahul Jain
Mayuresh,
I was testing this in a development environment and manually brought down a
node to simulate this. So the dead node never came back up.

My colleague and I were able to consistently see this behaviour several
times during the testing.
On 5 May 2015 20:32, "Mayuresh Gharat"  wrote:

> I agree that to find the least Loaded node the producer should fall back to
> the bootstrap nodes if its not able to connect to any nodes in the current
> metadata. That should resolve this.
>
> Rahul, I suppose the problem went off because the dead node in your case
> might have came back up and allowed for a metadata update. Can you confirm
> this?
>
> Thanks,
>
> Mayuresh
>
> On Tue, May 5, 2015 at 5:10 AM, Rahul Jain  wrote:
>
> > We observed the exact same error. Not very clear about the root cause
> > although it appears to be related to leastLoadedNode implementation.
> > Interestingly, the problem went away by increasing the value of
> > reconnect.backoff.ms to 1000ms.
> > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava"  wrote:
> >
> > > Ok, all of that makes sense. The only way to possibly recover from that
> > > state is either for K2 to come back up allowing the metadata refresh to
> > > eventually succeed or to eventually try some other node in the cluster.
> > > Reusing the bootstrap nodes is one possibility. Another would be for
> the
> > > client to get more metadata than is required for the topics it needs in
> > > order to ensure it has more nodes to use as options when looking for a
> > node
> > > to fetch metadata from. I added your description to KAFKA-1843,
> although
> > it
> > > might also make sense as a separate bug since fixing it could be
> > considered
> > > incremental progress towards resolving 1843.
> > >
> > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy  >
> > > wrote:
> > >
> > > > Hi Ewen,
> > > >
> > > >  Thanks for the response.  I agree with you, In some case we should
> use
> > > > bootstrap servers.
> > > >
> > > >
> > > > >
> > > > > If you have logs at debug level, are you seeing this message in
> > between
> > > > the
> > > > > connection attempts:
> > > > >
> > > > > Give up sending metadata request since no node is available
> > > > >
> > > >
> > > >  Yes, this log came for couple of times.
> > > >
> > > >
> > > > >
> > > > > Also, if you let it continue running, does it recover after the
> > > > > metadata.max.age.ms timeout?
> > > > >
> > > >
> > > >  It does not reconnect.  It is continuously trying to connect with
> dead
> > > > node.
> > > >
> > > >
> > > > -Manikumar
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Ewen
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>


Re: circuit breaker for producer

2015-05-05 Thread Jay Kreps
Does block.on.buffer.full=false do what you want?

-Jay

On Tue, May 5, 2015 at 1:59 AM, mete  wrote:

> Hello Folks,
>
> I was looking through the kafka.producer metrics on the JMX interface, to
> find a good indicator when to "trip" the circuit. So far it seems like the
> "bufferpool-wait-ratio" metric is a useful decision mechanism when to cut
> off the production to kafka.
>
> As far as i experienced, when kafka server slow for some reason, requests
> start piling up on the producer queue and if you are not willing to drop
> any messages on the producer, send method starts blocking because of the
> slow responsiveness.
>
> So this buffer pool wait ratio starts going up from 0.x up to 1.0. And i am
> thinking about tripping the circuit breaker using this metric, ex: if
> wait-ratio > 0.90 etc...
>
> What do you think? Do you think there would be a better indicator to check
> the health overall?
>
> Best
> Mete
>


Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Mayuresh Gharat
I agree that to find the least Loaded node the producer should fall back to
the bootstrap nodes if its not able to connect to any nodes in the current
metadata. That should resolve this.

Rahul, I suppose the problem went off because the dead node in your case
might have came back up and allowed for a metadata update. Can you confirm
this?

Thanks,

Mayuresh

On Tue, May 5, 2015 at 5:10 AM, Rahul Jain  wrote:

> We observed the exact same error. Not very clear about the root cause
> although it appears to be related to leastLoadedNode implementation.
> Interestingly, the problem went away by increasing the value of
> reconnect.backoff.ms to 1000ms.
> On 29 Apr 2015 00:32, "Ewen Cheslack-Postava"  wrote:
>
> > Ok, all of that makes sense. The only way to possibly recover from that
> > state is either for K2 to come back up allowing the metadata refresh to
> > eventually succeed or to eventually try some other node in the cluster.
> > Reusing the bootstrap nodes is one possibility. Another would be for the
> > client to get more metadata than is required for the topics it needs in
> > order to ensure it has more nodes to use as options when looking for a
> node
> > to fetch metadata from. I added your description to KAFKA-1843, although
> it
> > might also make sense as a separate bug since fixing it could be
> considered
> > incremental progress towards resolving 1843.
> >
> > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy 
> > wrote:
> >
> > > Hi Ewen,
> > >
> > >  Thanks for the response.  I agree with you, In some case we should use
> > > bootstrap servers.
> > >
> > >
> > > >
> > > > If you have logs at debug level, are you seeing this message in
> between
> > > the
> > > > connection attempts:
> > > >
> > > > Give up sending metadata request since no node is available
> > > >
> > >
> > >  Yes, this log came for couple of times.
> > >
> > >
> > > >
> > > > Also, if you let it continue running, does it recover after the
> > > > metadata.max.age.ms timeout?
> > > >
> > >
> > >  It does not reconnect.  It is continuously trying to connect with dead
> > > node.
> > >
> > >
> > > -Manikumar
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Rahul Jain
We observed the exact same error. Not very clear about the root cause
although it appears to be related to leastLoadedNode implementation.
Interestingly, the problem went away by increasing the value of
reconnect.backoff.ms to 1000ms.
On 29 Apr 2015 00:32, "Ewen Cheslack-Postava"  wrote:

> Ok, all of that makes sense. The only way to possibly recover from that
> state is either for K2 to come back up allowing the metadata refresh to
> eventually succeed or to eventually try some other node in the cluster.
> Reusing the bootstrap nodes is one possibility. Another would be for the
> client to get more metadata than is required for the topics it needs in
> order to ensure it has more nodes to use as options when looking for a node
> to fetch metadata from. I added your description to KAFKA-1843, although it
> might also make sense as a separate bug since fixing it could be considered
> incremental progress towards resolving 1843.
>
> On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy 
> wrote:
>
> > Hi Ewen,
> >
> >  Thanks for the response.  I agree with you, In some case we should use
> > bootstrap servers.
> >
> >
> > >
> > > If you have logs at debug level, are you seeing this message in between
> > the
> > > connection attempts:
> > >
> > > Give up sending metadata request since no node is available
> > >
> >
> >  Yes, this log came for couple of times.
> >
> >
> > >
> > > Also, if you let it continue running, does it recover after the
> > > metadata.max.age.ms timeout?
> > >
> >
> >  It does not reconnect.  It is continuously trying to connect with dead
> > node.
> >
> >
> > -Manikumar
> >
>
>
>
> --
> Thanks,
> Ewen
>


Re: Kafka Cluster Issue

2015-05-05 Thread Kamal C
This is resolved. As I missed host entry configuration in my infrastructure.

On Mon, May 4, 2015 at 10:35 AM, Kamal C  wrote:

> We are running ZooKeeper in ensemble (Cluster of 3 / 5).  With further
> investigation, I found that the Connect Exception throws for all "inflight"
> producers.
>
> Say we are pushing 50 msg/s to a topic. Stop the leader Kafka for that
> topic. Producers are unable to push messages to the Kafka Cluster and
> starts throwing the Connect Exception.
>
>
>
> *Exception* WARN 2015-05-04 10:27:41,052 [kafka-producer-network-thread |
> NOTIFICATION_CATEGORY_ALARM]: Selector:poll() :  : Error in I/O with
> tcstest2.nmsworks.co.in/192.168.11.140
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> ~[?:1.7.0_40]
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
> ~[?:1.7.0_40]
> at
> org.apache.kafka.common.network.Selector.poll(Selector.java:238)
> [kafka-clients-0.8.2.0.jar:?]
> at
> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192)
> [kafka-clients-0.8.2.0.jar:?]
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
> [kafka-clients-0.8.2.0.jar:?]
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:122)
> [kafka-clients-0.8.2.0.jar:?]
>
>
> *Description of a Kafka topic*
>
> [root@tcstest2 bin]# sh kafka-topics.sh --zookeeper localhost:2181
> --describe | grep NOTIFICATION_CATEGORY_ALARM
> Topic:NOTIFICATION_CATEGORY_ALARMPartitionCount:1
> ReplicationFactor:3Configs:
> Topic: NOTIFICATION_CATEGORY_ALARMPartition: 0Leader: 1
> Replicas: 2,0,1Isr: 1,0
>
>
> Leader is switching but producers are unable to find the new leader. How
> to resolve it?
>
>
> On Sun, May 3, 2015 at 11:13 PM, Jiangjie Qin 
> wrote:
>
>> What do you mean by cluster mode with 3 Zookeeper and 3 Kafka brokers? Do
>> you mean 1 Zookeeper and 3 brokers?
>>
>> On 5/2/15, 11:01 PM, "Kamal C"  wrote:
>>
>> >Any comments on this issue?
>> >
>> >On Sat, May 2, 2015 at 9:16 AM, Kamal C  wrote:
>> >
>> >> Hi,
>> >> We are using Kafka_2.10-0.8.2.0, new Kafka producer and Kafka Simple
>> >> Consumer. In Standalone mode, 1 ZooKeeper and 1 Kafka we haven't faced
>> >>any
>> >> problems.
>> >>
>> >> In cluster mode, 3 ZooKeeper and 3 Kafka Brokers. We did some sanity
>> >> testing by bringing a Kafka node down then a random Producer starts to
>> >> throw Connect Exception continuously and tries to connect with the dead
>> >> node (not all producers).
>> >>
>> >> Is there any configuration available to avoid this exception ?
>> >>
>> >> Regards,
>> >> Kamal C
>> >>
>> >>
>> >>
>> >>
>> >>
>>
>>
>


circuit breaker for producer

2015-05-05 Thread mete
Hello Folks,

I was looking through the kafka.producer metrics on the JMX interface, to
find a good indicator when to "trip" the circuit. So far it seems like the
"bufferpool-wait-ratio" metric is a useful decision mechanism when to cut
off the production to kafka.

As far as i experienced, when kafka server slow for some reason, requests
start piling up on the producer queue and if you are not willing to drop
any messages on the producer, send method starts blocking because of the
slow responsiveness.

So this buffer pool wait ratio starts going up from 0.x up to 1.0. And i am
thinking about tripping the circuit breaker using this metric, ex: if
wait-ratio > 0.90 etc...

What do you think? Do you think there would be a better indicator to check
the health overall?

Best
Mete