Re: Broker deregisters from ZK, but stays alive and does not rejoin the cluster

2019-03-22 Thread Ismael Juma
I'd suggest filing a single JIRA as a first step. Please test the 2.2 RC
before filing if possible. Please include enough details for someone else
to reproduce.

Thanks!

Ismael

On Fri, Mar 22, 2019, 3:14 PM Joe Ammann  wrote:

> Hi Ismael
>
> I've done a few more tests, and it seems that I'm able to "reproduce"
> various kinds of problems in Kafka 2.1.1 in out DEV. I can force these by
> faking an outage of Zookeeper. What I do for my tests is freeze (kill
> -STOP) 2 out of 3 ZK instances, let the Kafka brokers continue, then thaw
> the ZK instances (kill -CONT) and see what happens.
>
> The ZK nodes always very quickly reunite and build a Quorum after thawing.
>
> But the Kafka brokers (running on the same 3 Linux VMs) quite often show
> problems after this procedure (most of the time they successfully
> re-register and continue to work). I've seen 3 different kinds of problems
> (this is why I put "reproduce" in quotes, I can never predict what will
> happen)
>
> - the brokers get their ZK sessions expired (obviously) and sometimes only
> 2 of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register
> for some reason (that's the problem I originally described)
> - the brokers all re-register and re-elect a new controller. But that new
> controller does not fully work. For example it doesn't process partition
> reassignment requests and or does not transfer partition leadership after I
> kill a broker
> - the previous controller gets "dead-locked" (it has 3-4 of the important
> controller threads in a lock) and hence does not perform any of it's
> controller duties. But it regards itsself still as the valid controller and
> is accepted by the other brokers
>
> We have seen variants of these behaviours in TEST and PROD during the last
> days. Of course there not provoked by kill -STOP, but rather by the stalled
> underlying Linux VMs (we're heavily working on getting those replaced by
> bare metal, but it may take some time).
>
> Before I start filing JIRA's
>
> - I feel this behaviour is so totally wierd, that I hardly can believe
> it's Kafka bugs. They should have hit the community really hard and have
> been uncovered quickly. So I'm rather guessing I'm doing something terribly
> wrong. Any clue what that might be?
> - if I really start filing JIRA's should it rather be one single, or one
> per error scenario
>
> On 3/21/19 4:05 PM, Ismael Juma wrote:
> > Hi Joe,
> >
> > This is not expected behaviour, please file a JIRA.
> >
> > Ismael
> >
> > On Mon, Mar 18, 2019 at 7:29 AM Joe Ammann  j...@pyx.ch>> wrote:
> >
> > Hi all
> >
> > We're running several clusters (mostly with 3 brokers) with 2.1.1
> >
> > We quite regularly see the pattern that one of the 3 brokers
> "detaches" from ZK (the broker id is not registered anymore under
> /brokers/ids). We assume that the root cause for this is that the brokers
> are running on VMs (due to company policy, no alternative) and that the VM
> gets "stalled" for several minutes due to missing resources on the VMware
> ESX host.
> >
> > This is not new behaviour with 2.1.1, we already saw it with
> 0.10.2.1 before.
>
>
> --
> CU, Joe
>


Re: Broker deregisters from ZK, but stays alive and does not rejoin the cluster

2019-03-22 Thread Joe Ammann
Hi Ismael

I've done a few more tests, and it seems that I'm able to "reproduce" various 
kinds of problems in Kafka 2.1.1 in out DEV. I can force these by faking an 
outage of Zookeeper. What I do for my tests is freeze (kill -STOP) 2 out of 3 
ZK instances, let the Kafka brokers continue, then thaw the ZK instances (kill 
-CONT) and see what happens.

The ZK nodes always very quickly reunite and build a Quorum after thawing.

But the Kafka brokers (running on the same 3 Linux VMs) quite often show 
problems after this procedure (most of the time they successfully re-register 
and continue to work). I've seen 3 different kinds of problems (this is why I 
put "reproduce" in quotes, I can never predict what will happen)

- the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 
3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some 
reason (that's the problem I originally described)
- the brokers all re-register and re-elect a new controller. But that new 
controller does not fully work. For example it doesn't process partition 
reassignment requests and or does not transfer partition leadership after I 
kill a broker
- the previous controller gets "dead-locked" (it has 3-4 of the important 
controller threads in a lock) and hence does not perform any of it's controller 
duties. But it regards itsself still as the valid controller and is accepted by 
the other brokers

We have seen variants of these behaviours in TEST and PROD during the last 
days. Of course there not provoked by kill -STOP, but rather by the stalled 
underlying Linux VMs (we're heavily working on getting those replaced by bare 
metal, but it may take some time).

Before I start filing JIRA's

- I feel this behaviour is so totally wierd, that I hardly can believe it's 
Kafka bugs. They should have hit the community really hard and have been 
uncovered quickly. So I'm rather guessing I'm doing something terribly wrong. 
Any clue what that might be?
- if I really start filing JIRA's should it rather be one single, or one per 
error scenario

On 3/21/19 4:05 PM, Ismael Juma wrote:
> Hi Joe,
> 
> This is not expected behaviour, please file a JIRA.
> 
> Ismael
> 
> On Mon, Mar 18, 2019 at 7:29 AM Joe Ammann mailto:j...@pyx.ch>> 
> wrote:
> 
> Hi all
> 
> We're running several clusters (mostly with 3 brokers) with 2.1.1
> 
> We quite regularly see the pattern that one of the 3 brokers "detaches" 
> from ZK (the broker id is not registered anymore under /brokers/ids). We 
> assume that the root cause for this is that the brokers are running on VMs 
> (due to company policy, no alternative) and that the VM gets "stalled" for 
> several minutes due to missing resources on the VMware ESX host.
> 
> This is not new behaviour with 2.1.1, we already saw it with 0.10.2.1 
> before.


-- 
CU, Joe


KafkaStreams backoff for non-existing topic

2019-03-22 Thread Murilo Tavares
Hi
After some research, I've come to a few discussions, and they all tell me
that Kafka Streams require the topics to be created before starting the
application.
Nevertheless, I'd like my application to keep retrying if a topic does not
exist.
I've seen this thread:
https://groups.google.com/forum/#!topic/confluent-platform/nmfrnAKCM3c,
which is pretty old, and I'd like to know if it's still hard to catch that
Exception in my app.

Thanks
Murilo


Re: Tracking progress for messages generated by a batch process

2019-03-22 Thread Matthias J. Sax
Sounds reasonable to me.

-Matthias

On 3/22/19 9:50 AM, Tim Gent wrote:
> Hi all,
> 
> We have a data processing system where a daily batch process generates
> some data into a Kafka topic. This then goes through several other
> components that enrich the data, these are also integrated via Kafka.
> So overall we have something like:
> 
> Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3
> 
> We would like to know when all the data generated onto topic A finally
> gets processed by streaming app 3, as we may trigger some other
> processes from this (e.g. notifying customers their data is processed
> for that day). We've come up with a possible solution, and it would be
> great to get feedback to see what we missed.
> 
> Assumptions:
> - Consumers all track their offsets using Kafka, committing once
> they've done all required processing for a message
> - We have some "batch-monitor" component which will track progress,
> described below
> - It isn't important to us to know exactly when the batch finished
> processing, sometime soon after batch finished processing is good
> enough
> 
> Broad flow:
> - Batch job reads some input data and publishes output to topic A
> - Batch job sends data to our "batch-monitor" component about the
> offsets on each partition at the time it finishes it's processing
> - "batch-monitor" subscribes to the topic containing the committed
> offsets for topic A for streaming app 2 consumer
> - "batch-monitor" can therefore see when streaming app 2 has committed
> all the offsets that were in the batch
> - Once "batch-monitor" detects that streaming app 2 has finished it's
> processing for the batch it records max offsets for all partitions for
> messages in topic b -> these can be used to know when streaming app 3
> has finished processing the batch
> - "batch-monitor" subscribes to the topic containing the committed
> offsets for topic B for streaming app 3 consumer
> - "batch-monitor" can therefore see when streaming app 3 has committed
> all the offsets that were in the batch
> - Once that happens "batch-monitor" can send some notification somewhere else
> 
> Any thoughts gratefully received
> 
> Tim
> 



signature.asc
Description: OpenPGP digital signature


Tracking progress for messages generated by a batch process

2019-03-22 Thread Tim Gent
Hi all,

We have a data processing system where a daily batch process generates
some data into a Kafka topic. This then goes through several other
components that enrich the data, these are also integrated via Kafka.
So overall we have something like:

Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3

We would like to know when all the data generated onto topic A finally
gets processed by streaming app 3, as we may trigger some other
processes from this (e.g. notifying customers their data is processed
for that day). We've come up with a possible solution, and it would be
great to get feedback to see what we missed.

Assumptions:
- Consumers all track their offsets using Kafka, committing once
they've done all required processing for a message
- We have some "batch-monitor" component which will track progress,
described below
- It isn't important to us to know exactly when the batch finished
processing, sometime soon after batch finished processing is good
enough

Broad flow:
- Batch job reads some input data and publishes output to topic A
- Batch job sends data to our "batch-monitor" component about the
offsets on each partition at the time it finishes it's processing
- "batch-monitor" subscribes to the topic containing the committed
offsets for topic A for streaming app 2 consumer
- "batch-monitor" can therefore see when streaming app 2 has committed
all the offsets that were in the batch
- Once "batch-monitor" detects that streaming app 2 has finished it's
processing for the batch it records max offsets for all partitions for
messages in topic b -> these can be used to know when streaming app 3
has finished processing the batch
- "batch-monitor" subscribes to the topic containing the committed
offsets for topic B for streaming app 3 consumer
- "batch-monitor" can therefore see when streaming app 3 has committed
all the offsets that were in the batch
- Once that happens "batch-monitor" can send some notification somewhere else

Any thoughts gratefully received

Tim


Re: Question on performance data for Kafka vs NATS

2019-03-22 Thread Adam Bellemare
One more thing to note:

You are looking at regular, base NATS. On its own, it is not a direct 1-1
comparison to Kafka because it lacks things like data retention, clustering
and replication. Instead, you would want to compare it to NATS-Streaming, (
https://github.com/nats-io/nats-streaming-server ). You can find a number
of more recent articles and comparisons by a simple web search.

With that being said, this is likely not the best venue for an in-depth
discussion on tradeoffs between the two (especially since I see you're
spanning two very large mailing lists).

Adam




On Fri, Mar 22, 2019 at 1:34 AM Hans Jespersen  wrote:

> Thats a 4.5 year old benchmark and it was run with a single broker node
> and only 1 producer and 1 consumer all running on a single MacBookPro.
> Definitely not the target production environment for Kafka.
>
> -hans
>
> > On Mar 21, 2019, at 11:43 AM, M. Manna  wrote:
> >
> > HI All,
> >
> > https://nats.io/about/
> >
> > this shows a general comparison of sender/receiver throughputs for NATS
> and
> > other messaging system including our favourite Kafka.
> >
> > It appears that Kafka, despite taking the 2nd place, has a very low
> > throughput. My question is, where does Kafka win over NATS? is it the
> > unique partitioning and delivery semantics? Or, is it something else.
> >
> > From what I can see, NATS has traditional pub/sub and queuing. But it
> > doesn't look like there is any proper retention system built for this.
> >
> > Has anyone come across this already?
> >
> > Thanks,
>