date:20161111

Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Ali Akhtar

I have some unit tests in which I create an embedded single broker kafka
cluster, using :

EmbeddedSingleNodeKafkaCluster.java from
https://github.com/confluentinc/examples/blob/master/kafka-streams/src/test/java/io/confluent/examples/streams/kafka/EmbeddedSingleNodeKafkaCluster.java

That class also creates an embedded zookeeper cluster / instance.

The problem is, while the tests run pretty fast and pass, they then stay
stuck in the 'teardown / clean up' stage for a really long time, often upto
10-20
seconds per test.

As I have a lot of test classes, each class creating its own embedded kafka
cluster, this time can really add up during compiles.

Is it possible to get these test classes to not do any clean up / safety
stuff, because the instances are just throwaway. Just have them kill -9 the
kafka / zookeeper and exit?

It doesn't make any sense that tests pass within seconds, but can't move on
to the next test class because its cleaning up.

I also have an embedded cassandra instance in these tests, but I don't
think that one is the problem, as i see a lot of zookeeper logs such as
these after the test runs:

133764 [main-SendThread(127.0.0.1:38846)] WARN
 org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for server
null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)


Could it be that zookeeper doesn't exit and keeps retrying to connect?

Problem with kafka cluster with 2 brokers - Replicas out of sync

2016-11-11 Thread Alexandre Juma

Hi list,

I've added a kafka node on our Hortonworks cluster and while executing the
reassign partition procedure something went wrong and I'm kind of stuck.

ZK nodes: RHTPINEC001, RHTPINEC004, RHTPINEC005
Kafka nodes: RHTPINEC008 (broker id = 1001), RHTPINEC007 (broker id = 1002,
this is the new broker)
Relevant Topics: poc_3b_syslog, poc_3a_syslog, uma_syslog_topic
Number of partitions: 6

Topics to move file:

{"topics":
 [{"topic": "poc_3a_syslog"},{"topic": "poc_3b_syslog"},{"topic":
"uma_syslog_topic"}],
 "version":1
}

Generated reassign output:

{"version":1,"partitions":[{"topic":"uma_syslog_topic","partition":2,"replicas":[1001]},{"topic":"poc_3a_syslog","partition":0,"replicas":[1002]},{"topic":"poc_3b_syslog","partition":2,"replicas":[1002]},{"topic":"poc_3b_syslog","partition":1,"replicas":[1001]},{"topic":"poc_3b_syslog","partition":0,"replicas":[1002]},{"topic":"uma_syslog_topic","partition":1,"replicas":[1002]},{"topic":"uma_syslog_topic","partition":0,"replicas":[1001]},{"topic":"poc_3b_syslog","partition":4,"replicas":[1002]},{"topic":"poc_3b_syslog","partition":3,"replicas":[1001]},{"topic":"poc_3a_syslog","partition":2,"replicas":[1002]},{"topic":"uma_syslog_topic","partition":3,"replicas":[1002]},{"topic":"poc_3a_syslog","partition":5,"replicas":[1001]},{"topic":"poc_3a_syslog","partition":3,"replicas":[1001]},{"topic":"poc_3a_syslog","partition":1,"replicas":[1001]},{"topic":"poc_3a_syslog","partition":4,"replicas":[1002]},{"topic":"uma_syslog_topic","partition":6,"replicas":[1001]},{"topic":"poc_3b_syslog","partition":5,"replicas":[1001]},{"topic":"uma_syslog_topic","partition":5,"replicas":[1002]},{"topic":"uma_syslog_topic","partition":4,"replicas":[1001]}]

Reassignment request:

$ ./kafka-reassign-partitions.sh --zookeeper 10.135.96.207:2181
--reassignment-json-file /tmp/move_me --broker-list "1001,1002" --execute

Verify:

$ ./kafka-reassign-partitions.sh --zookeeper 10.135.96.207:2181
--reassignment-json-file /tmp/move_me --broker-list "1001,1002" --verify
Status of partition reassignment:
Reassignment of partition [uma_syslog_topic,0] completed successfully
Reassignment of partition [poc_3b_syslog,4] completed successfully
Reassignment of partition [uma_syslog_topic,6] completed successfully
Reassignment of partition [poc_3a_syslog,0] completed successfully
Reassignment of partition [poc_3a_syslog,2] completed successfully
Reassignment of partition [poc_3b_syslog,0] completed successfully
Reassignment of partition [uma_syslog_topic,2] completed successfully
Reassignment of partition [uma_syslog_topic,1] is still in progress
Reassignment of partition [poc_3b_syslog,3] is still in progress
Reassignment of partition [uma_syslog_topic,4] completed successfully
Reassignment of partition [poc_3a_syslog,4] completed successfully
Reassignment of partition [poc_3b_syslog,5] is still in progress
Reassignment of partition [poc_3a_syslog,1] is still in progress
Reassignment of partition [poc_3a_syslog,3] is still in progress
Reassignment of partition [poc_3b_syslog,1] is still in progress
Reassignment of partition [uma_syslog_topic,3] is still in progress
Reassignment of partition [poc_3a_syslog,5] is still in progress
Reassignment of partition [poc_3b_syslog,2] completed successfully
Reassignment of partition [uma_syslog_topic,5] is still in progress

And I got stuck getting this to move.. Can't see anything moving on the
logs on the broker 1002.

After this I've cleared the /admin/reassign_partitions znode and I'm trying
to abort the procedure but can't go anywhere.

Right now the current status of the topics/partitions is:

Topic:poc_3a_syslog PartitionCount:6ReplicationFactor:1
Configs:retention.ms=8640
Topic: poc_3a_syslogPartition: 0Leader: 1001Replicas:
1001  Isr: 1001
Topic: poc_3a_syslogPartition: 1Leader: 1001Replicas:
1002,1001 Isr: 1001
Topic: poc_3a_syslogPartition: 2Leader: 1001Replicas:
1001  Isr: 1001
Topic: poc_3a_syslogPartition: 3Leader: 1001Replicas:
1002,1001 Isr: 1001
Topic: poc_3a_syslogPartition: 4Leader: 1001Replicas:
1001  Isr: 1001
Topic: poc_3a_syslogPartition: 5Leader: 1001Replicas:
1002,1001 Isr: 1001
Topic:poc_3b_syslog PartitionCount:6ReplicationFactor:1
Configs:retention.ms=8640
Topic: poc_3b_syslogPartition: 0Leader: 1001Replicas:
1001  Isr: 1001
Topic: poc_3b_syslogPartition: 1Leader: 1001Replicas:
1002,1001 Isr: 1001
Topic: poc_3b_syslogPartition: 2Leader: 1001Replicas:
1001  Isr: 1001
Topic: poc_3b_syslogPartition: 3Leader: 1001Replicas:
1002,1001 Isr: 1001
Topic: poc_3b_syslogPartition: 4Leader: 1001Replicas:
1001  Isr: 1001
Topic: poc_3b_syslogPartition: 5Leader: 1001Replicas:
1002,1001 Isr: 1001
Topic:uma_syslog_topic  PartitionCount:7Replica

Re: Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Eno Thereska

Hi Ali,

You're right, shutting down the broker and ZK is expensive. We kept the number 
of integration tests relatively small (and pushed some more tests as system 
tests, while doing as many as possible as unit tests). It's not just the 
shutdown that's expensive, it's also the starting up unfortunately. It's on our 
todo list to do something about this, but we haven't gotten there yet. If 
someone from the community wants to have a look and help out, that'd be great 
(with a JIRA and PR).

About the second problem with ZK logs, this is being worked on as part of 
removing the ZK dependency from streams and should be merged shortly: 
https://github.com/apache/kafka/pull/1884 
. The msg you see does not affect 
correctness, it's just annoying and it will go away.

Thanks,
Eno
 

> On 11 Nov 2016, at 14:28, Ali Akhtar  wrote:
> 
> I have some unit tests in which I create an embedded single broker kafka
> cluster, using :
> 
> EmbeddedSingleNodeKafkaCluster.java from
> https://github.com/confluentinc/examples/blob/master/kafka-streams/src/test/java/io/confluent/examples/streams/kafka/EmbeddedSingleNodeKafkaCluster.java
> 
> That class also creates an embedded zookeeper cluster / instance.
> 
> The problem is, while the tests run pretty fast and pass, they then stay
> stuck in the 'teardown / clean up' stage for a really long time, often upto
> 10-20
> seconds per test.
> 
> As I have a lot of test classes, each class creating its own embedded kafka
> cluster, this time can really add up during compiles.
> 
> Is it possible to get these test classes to not do any clean up / safety
> stuff, because the instances are just throwaway. Just have them kill -9 the
> kafka / zookeeper and exit?
> 
> It doesn't make any sense that tests pass within seconds, but can't move on
> to the next test class because its cleaning up.
> 
> I also have an embedded cassandra instance in these tests, but I don't
> think that one is the problem, as i see a lot of zookeeper logs such as
> these after the test runs:
> 
> 133764 [main-SendThread(127.0.0.1:38846)] WARN
> org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for server
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 
> 
> Could it be that zookeeper doesn't exit and keeps retrying to connect?

Re: Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Ali Akhtar

Hey Eno,

Thanks for the quick reply.

In the meantime, is it possible to just send a sigterm / kill -9 which just
kills the zookeeper + kafka? I can figure out how to do it if you can point
out which class / method creates the processes / threads.

Thanks.

On Fri, Nov 11, 2016 at 9:24 PM, Eno Thereska 
wrote:

> Hi Ali,
>
> You're right, shutting down the broker and ZK is expensive. We kept the
> number of integration tests relatively small (and pushed some more tests as
> system tests, while doing as many as possible as unit tests). It's not just
> the shutdown that's expensive, it's also the starting up unfortunately.
> It's on our todo list to do something about this, but we haven't gotten
> there yet. If someone from the community wants to have a look and help out,
> that'd be great (with a JIRA and PR).
>
> About the second problem with ZK logs, this is being worked on as part of
> removing the ZK dependency from streams and should be merged shortly:
> https://github.com/apache/kafka/pull/1884  kafka/pull/1884>. The msg you see does not affect correctness, it's just
> annoying and it will go away.
>
> Thanks,
> Eno
>
>
> > On 11 Nov 2016, at 14:28, Ali Akhtar  wrote:
> >
> > I have some unit tests in which I create an embedded single broker kafka
> > cluster, using :
> >
> > EmbeddedSingleNodeKafkaCluster.java from
> > https://github.com/confluentinc/examples/blob/
> master/kafka-streams/src/test/java/io/confluent/examples/streams/kafka/
> EmbeddedSingleNodeKafkaCluster.java
> >
> > That class also creates an embedded zookeeper cluster / instance.
> >
> > The problem is, while the tests run pretty fast and pass, they then stay
> > stuck in the 'teardown / clean up' stage for a really long time, often
> upto
> > 10-20
> > seconds per test.
> >
> > As I have a lot of test classes, each class creating its own embedded
> kafka
> > cluster, this time can really add up during compiles.
> >
> > Is it possible to get these test classes to not do any clean up / safety
> > stuff, because the instances are just throwaway. Just have them kill -9
> the
> > kafka / zookeeper and exit?
> >
> > It doesn't make any sense that tests pass within seconds, but can't move
> on
> > to the next test class because its cleaning up.
> >
> > I also have an embedded cassandra instance in these tests, but I don't
> > think that one is the problem, as i see a lot of zookeeper logs such as
> > these after the test runs:
> >
> > 133764 [main-SendThread(127.0.0.1:38846)] WARN
> > org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for server
> > null, unexpected error, closing socket connection and attempting
> reconnect
> > java.net.ConnectException: Connection refused
> > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> > at sun.nio.ch.SocketChannelImpl.finishConnect(
> SocketChannelImpl.java:717)
> > at
> > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
> ClientCnxnSocketNIO.java:361)
> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> >
> >
> > Could it be that zookeeper doesn't exit and keeps retrying to connect?
>
>

Re: Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Ali Akhtar

Unless I'm missing anything, there's no reason why these throwaway
processes should be shutdown gracefully. Just kill them as soon as the test
finishes.

On Fri, Nov 11, 2016 at 9:26 PM, Ali Akhtar  wrote:

> Hey Eno,
>
> Thanks for the quick reply.
>
> In the meantime, is it possible to just send a sigterm / kill -9 which
> just kills the zookeeper + kafka? I can figure out how to do it if you can
> point out which class / method creates the processes / threads.
>
> Thanks.
>
> On Fri, Nov 11, 2016 at 9:24 PM, Eno Thereska 
> wrote:
>
>> Hi Ali,
>>
>> You're right, shutting down the broker and ZK is expensive. We kept the
>> number of integration tests relatively small (and pushed some more tests as
>> system tests, while doing as many as possible as unit tests). It's not just
>> the shutdown that's expensive, it's also the starting up unfortunately.
>> It's on our todo list to do something about this, but we haven't gotten
>> there yet. If someone from the community wants to have a look and help out,
>> that'd be great (with a JIRA and PR).
>>
>> About the second problem with ZK logs, this is being worked on as part of
>> removing the ZK dependency from streams and should be merged shortly:
>> https://github.com/apache/kafka/pull/1884 > ka/pull/1884>. The msg you see does not affect correctness, it's just
>> annoying and it will go away.
>>
>> Thanks,
>> Eno
>>
>>
>> > On 11 Nov 2016, at 14:28, Ali Akhtar  wrote:
>> >
>> > I have some unit tests in which I create an embedded single broker kafka
>> > cluster, using :
>> >
>> > EmbeddedSingleNodeKafkaCluster.java from
>> > https://github.com/confluentinc/examples/blob/master/kafka-
>> streams/src/test/java/io/confluent/examples/streams/
>> kafka/EmbeddedSingleNodeKafkaCluster.java
>> >
>> > That class also creates an embedded zookeeper cluster / instance.
>> >
>> > The problem is, while the tests run pretty fast and pass, they then stay
>> > stuck in the 'teardown / clean up' stage for a really long time, often
>> upto
>> > 10-20
>> > seconds per test.
>> >
>> > As I have a lot of test classes, each class creating its own embedded
>> kafka
>> > cluster, this time can really add up during compiles.
>> >
>> > Is it possible to get these test classes to not do any clean up / safety
>> > stuff, because the instances are just throwaway. Just have them kill -9
>> the
>> > kafka / zookeeper and exit?
>> >
>> > It doesn't make any sense that tests pass within seconds, but can't
>> move on
>> > to the next test class because its cleaning up.
>> >
>> > I also have an embedded cassandra instance in these tests, but I don't
>> > think that one is the problem, as i see a lot of zookeeper logs such as
>> > these after the test runs:
>> >
>> > 133764 [main-SendThread(127.0.0.1:38846)] WARN
>> > org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for server
>> > null, unexpected error, closing socket connection and attempting
>> reconnect
>> > java.net.ConnectException: Connection refused
>> > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl
>> .java:717)
>> > at
>> > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientC
>> nxnSocketNIO.java:361)
>> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>> >
>> >
>> > Could it be that zookeeper doesn't exit and keeps retrying to connect?
>>
>>
>

Re: Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Ali Akhtar

For me, the startup doesn't take anywhere near as long as shutdown does.

On Fri, Nov 11, 2016 at 9:37 PM, Ali Akhtar  wrote:

> Unless I'm missing anything, there's no reason why these throwaway
> processes should be shutdown gracefully. Just kill them as soon as the test
> finishes.
>
> On Fri, Nov 11, 2016 at 9:26 PM, Ali Akhtar  wrote:
>
>> Hey Eno,
>>
>> Thanks for the quick reply.
>>
>> In the meantime, is it possible to just send a sigterm / kill -9 which
>> just kills the zookeeper + kafka? I can figure out how to do it if you can
>> point out which class / method creates the processes / threads.
>>
>> Thanks.
>>
>> On Fri, Nov 11, 2016 at 9:24 PM, Eno Thereska 
>> wrote:
>>
>>> Hi Ali,
>>>
>>> You're right, shutting down the broker and ZK is expensive. We kept the
>>> number of integration tests relatively small (and pushed some more tests as
>>> system tests, while doing as many as possible as unit tests). It's not just
>>> the shutdown that's expensive, it's also the starting up unfortunately.
>>> It's on our todo list to do something about this, but we haven't gotten
>>> there yet. If someone from the community wants to have a look and help out,
>>> that'd be great (with a JIRA and PR).
>>>
>>> About the second problem with ZK logs, this is being worked on as part
>>> of removing the ZK dependency from streams and should be merged shortly:
>>> https://github.com/apache/kafka/pull/1884 >> ka/pull/1884>. The msg you see does not affect correctness, it's just
>>> annoying and it will go away.
>>>
>>> Thanks,
>>> Eno
>>>
>>>
>>> > On 11 Nov 2016, at 14:28, Ali Akhtar  wrote:
>>> >
>>> > I have some unit tests in which I create an embedded single broker
>>> kafka
>>> > cluster, using :
>>> >
>>> > EmbeddedSingleNodeKafkaCluster.java from
>>> > https://github.com/confluentinc/examples/blob/master/kafka-s
>>> treams/src/test/java/io/confluent/examples/streams/kafka/
>>> EmbeddedSingleNodeKafkaCluster.java
>>> >
>>> > That class also creates an embedded zookeeper cluster / instance.
>>> >
>>> > The problem is, while the tests run pretty fast and pass, they then
>>> stay
>>> > stuck in the 'teardown / clean up' stage for a really long time, often
>>> upto
>>> > 10-20
>>> > seconds per test.
>>> >
>>> > As I have a lot of test classes, each class creating its own embedded
>>> kafka
>>> > cluster, this time can really add up during compiles.
>>> >
>>> > Is it possible to get these test classes to not do any clean up /
>>> safety
>>> > stuff, because the instances are just throwaway. Just have them kill
>>> -9 the
>>> > kafka / zookeeper and exit?
>>> >
>>> > It doesn't make any sense that tests pass within seconds, but can't
>>> move on
>>> > to the next test class because its cleaning up.
>>> >
>>> > I also have an embedded cassandra instance in these tests, but I don't
>>> > think that one is the problem, as i see a lot of zookeeper logs such as
>>> > these after the test runs:
>>> >
>>> > 133764 [main-SendThread(127.0.0.1:38846)] WARN
>>> > org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for
>>> server
>>> > null, unexpected error, closing socket connection and attempting
>>> reconnect
>>> > java.net.ConnectException: Connection refused
>>> > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl
>>> .java:717)
>>> > at
>>> > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientC
>>> nxnSocketNIO.java:361)
>>> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.ja
>>> va:1081)
>>> >
>>> >
>>> > Could it be that zookeeper doesn't exit and keeps retrying to connect?
>>>
>>>
>>
>

Re: Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Eno Thereska

It's the org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster (start 
ZK and calls org.apache.kafka.streams.integration.utils.KafkaEmbedded to start 
Kafka).
So these are embedded in the sense that it's not another process, just threads 
within the main streams test process.

Thanks
Eno

> On 11 Nov 2016, at 16:26, Ali Akhtar  wrote:
> 
> Hey Eno,
> 
> Thanks for the quick reply.
> 
> In the meantime, is it possible to just send a sigterm / kill -9 which just
> kills the zookeeper + kafka? I can figure out how to do it if you can point
> out which class / method creates the processes / threads.
> 
> Thanks.
> 
> On Fri, Nov 11, 2016 at 9:24 PM, Eno Thereska 
> wrote:
> 
>> Hi Ali,
>> 
>> You're right, shutting down the broker and ZK is expensive. We kept the
>> number of integration tests relatively small (and pushed some more tests as
>> system tests, while doing as many as possible as unit tests). It's not just
>> the shutdown that's expensive, it's also the starting up unfortunately.
>> It's on our todo list to do something about this, but we haven't gotten
>> there yet. If someone from the community wants to have a look and help out,
>> that'd be great (with a JIRA and PR).
>> 
>> About the second problem with ZK logs, this is being worked on as part of
>> removing the ZK dependency from streams and should be merged shortly:
>> https://github.com/apache/kafka/pull/1884 > kafka/pull/1884>. The msg you see does not affect correctness, it's just
>> annoying and it will go away.
>> 
>> Thanks,
>> Eno
>> 
>> 
>>> On 11 Nov 2016, at 14:28, Ali Akhtar  wrote:
>>> 
>>> I have some unit tests in which I create an embedded single broker kafka
>>> cluster, using :
>>> 
>>> EmbeddedSingleNodeKafkaCluster.java from
>>> https://github.com/confluentinc/examples/blob/
>> master/kafka-streams/src/test/java/io/confluent/examples/streams/kafka/
>> EmbeddedSingleNodeKafkaCluster.java
>>> 
>>> That class also creates an embedded zookeeper cluster / instance.
>>> 
>>> The problem is, while the tests run pretty fast and pass, they then stay
>>> stuck in the 'teardown / clean up' stage for a really long time, often
>> upto
>>> 10-20
>>> seconds per test.
>>> 
>>> As I have a lot of test classes, each class creating its own embedded
>> kafka
>>> cluster, this time can really add up during compiles.
>>> 
>>> Is it possible to get these test classes to not do any clean up / safety
>>> stuff, because the instances are just throwaway. Just have them kill -9
>> the
>>> kafka / zookeeper and exit?
>>> 
>>> It doesn't make any sense that tests pass within seconds, but can't move
>> on
>>> to the next test class because its cleaning up.
>>> 
>>> I also have an embedded cassandra instance in these tests, but I don't
>>> think that one is the problem, as i see a lot of zookeeper logs such as
>>> these after the test runs:
>>> 
>>> 133764 [main-SendThread(127.0.0.1:38846)] WARN
>>> org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for server
>>> null, unexpected error, closing socket connection and attempting
>> reconnect
>>> java.net.ConnectException: Connection refused
>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> at sun.nio.ch.SocketChannelImpl.finishConnect(
>> SocketChannelImpl.java:717)
>>> at
>>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
>> ClientCnxnSocketNIO.java:361)
>>> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>>> 
>>> 
>>> Could it be that zookeeper doesn't exit and keeps retrying to connect?
>> 
>>

Re: Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Ali Akhtar

Oh, so it seems like there's no easy way to just Thread.stop() without
changing the internal kafka / zk code? :(

Perhaps its possible to start kafka / zk within another thread, and then
kill the wrapper thread. Will that stop the children threads, if the
wrapper thread is killed?

Hmm, or may be a an Executor which is shutdown, and which force shuts down
the children threads?



On Fri, Nov 11, 2016 at 9:46 PM, Eno Thereska 
wrote:

> It's the org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster
> (start ZK and calls org.apache.kafka.streams.integration.utils.KafkaEmbedded
> to start Kafka).
> So these are embedded in the sense that it's not another process, just
> threads within the main streams test process.
>
> Thanks
> Eno
>
> > On 11 Nov 2016, at 16:26, Ali Akhtar  wrote:
> >
> > Hey Eno,
> >
> > Thanks for the quick reply.
> >
> > In the meantime, is it possible to just send a sigterm / kill -9 which
> just
> > kills the zookeeper + kafka? I can figure out how to do it if you can
> point
> > out which class / method creates the processes / threads.
> >
> > Thanks.
> >
> > On Fri, Nov 11, 2016 at 9:24 PM, Eno Thereska 
> > wrote:
> >
> >> Hi Ali,
> >>
> >> You're right, shutting down the broker and ZK is expensive. We kept the
> >> number of integration tests relatively small (and pushed some more
> tests as
> >> system tests, while doing as many as possible as unit tests). It's not
> just
> >> the shutdown that's expensive, it's also the starting up unfortunately.
> >> It's on our todo list to do something about this, but we haven't gotten
> >> there yet. If someone from the community wants to have a look and help
> out,
> >> that'd be great (with a JIRA and PR).
> >>
> >> About the second problem with ZK logs, this is being worked on as part
> of
> >> removing the ZK dependency from streams and should be merged shortly:
> >> https://github.com/apache/kafka/pull/1884  >> kafka/pull/1884>. The msg you see does not affect correctness, it's just
> >> annoying and it will go away.
> >>
> >> Thanks,
> >> Eno
> >>
> >>
> >>> On 11 Nov 2016, at 14:28, Ali Akhtar  wrote:
> >>>
> >>> I have some unit tests in which I create an embedded single broker
> kafka
> >>> cluster, using :
> >>>
> >>> EmbeddedSingleNodeKafkaCluster.java from
> >>> https://github.com/confluentinc/examples/blob/
> >> master/kafka-streams/src/test/java/io/confluent/examples/streams/kafka/
> >> EmbeddedSingleNodeKafkaCluster.java
> >>>
> >>> That class also creates an embedded zookeeper cluster / instance.
> >>>
> >>> The problem is, while the tests run pretty fast and pass, they then
> stay
> >>> stuck in the 'teardown / clean up' stage for a really long time, often
> >> upto
> >>> 10-20
> >>> seconds per test.
> >>>
> >>> As I have a lot of test classes, each class creating its own embedded
> >> kafka
> >>> cluster, this time can really add up during compiles.
> >>>
> >>> Is it possible to get these test classes to not do any clean up /
> safety
> >>> stuff, because the instances are just throwaway. Just have them kill -9
> >> the
> >>> kafka / zookeeper and exit?
> >>>
> >>> It doesn't make any sense that tests pass within seconds, but can't
> move
> >> on
> >>> to the next test class because its cleaning up.
> >>>
> >>> I also have an embedded cassandra instance in these tests, but I don't
> >>> think that one is the problem, as i see a lot of zookeeper logs such as
> >>> these after the test runs:
> >>>
> >>> 133764 [main-SendThread(127.0.0.1:38846)] WARN
> >>> org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for
> server
> >>> null, unexpected error, closing socket connection and attempting
> >> reconnect
> >>> java.net.ConnectException: Connection refused
> >>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >>> at sun.nio.ch.SocketChannelImpl.finishConnect(
> >> SocketChannelImpl.java:717)
> >>> at
> >>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
> >> ClientCnxnSocketNIO.java:361)
> >>> at org.apache.zookeeper.ClientCnxn$SendThread.run(
> ClientCnxn.java:1081)
> >>>
> >>>
> >>> Could it be that zookeeper doesn't exit and keeps retrying to connect?
> >>
> >>
>
>

Re: Unit tests (in Java) take a very long time to 'clean up'?

2016-11-11 Thread Ali Akhtar

Eno,

I tried the following, but i haven't seen any improvement. In fact it seems
to run longer. Are you sure Kafka / ZK are run as threads and not as
process:


Set threadSet = Thread.getAllStackTraces().keySet();
threadSet.forEach(t ->
{
if (t == Thread.currentThread())
return;

t.stop();
});


On Fri, Nov 11, 2016 at 9:52 PM, Ali Akhtar  wrote:

> Oh, so it seems like there's no easy way to just Thread.stop() without
> changing the internal kafka / zk code? :(
>
> Perhaps its possible to start kafka / zk within another thread, and then
> kill the wrapper thread. Will that stop the children threads, if the
> wrapper thread is killed?
>
> Hmm, or may be a an Executor which is shutdown, and which force shuts down
> the children threads?
>
>
>
> On Fri, Nov 11, 2016 at 9:46 PM, Eno Thereska 
> wrote:
>
>> It's the org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster
>> (start ZK and calls org.apache.kafka.streams.integration.utils.KafkaEmbedded
>> to start Kafka).
>> So these are embedded in the sense that it's not another process, just
>> threads within the main streams test process.
>>
>> Thanks
>> Eno
>>
>> > On 11 Nov 2016, at 16:26, Ali Akhtar  wrote:
>> >
>> > Hey Eno,
>> >
>> > Thanks for the quick reply.
>> >
>> > In the meantime, is it possible to just send a sigterm / kill -9 which
>> just
>> > kills the zookeeper + kafka? I can figure out how to do it if you can
>> point
>> > out which class / method creates the processes / threads.
>> >
>> > Thanks.
>> >
>> > On Fri, Nov 11, 2016 at 9:24 PM, Eno Thereska 
>> > wrote:
>> >
>> >> Hi Ali,
>> >>
>> >> You're right, shutting down the broker and ZK is expensive. We kept the
>> >> number of integration tests relatively small (and pushed some more
>> tests as
>> >> system tests, while doing as many as possible as unit tests). It's not
>> just
>> >> the shutdown that's expensive, it's also the starting up unfortunately.
>> >> It's on our todo list to do something about this, but we haven't gotten
>> >> there yet. If someone from the community wants to have a look and help
>> out,
>> >> that'd be great (with a JIRA and PR).
>> >>
>> >> About the second problem with ZK logs, this is being worked on as part
>> of
>> >> removing the ZK dependency from streams and should be merged shortly:
>> >> https://github.com/apache/kafka/pull/1884 > >> kafka/pull/1884>. The msg you see does not affect correctness, it's
>> just
>> >> annoying and it will go away.
>> >>
>> >> Thanks,
>> >> Eno
>> >>
>> >>
>> >>> On 11 Nov 2016, at 14:28, Ali Akhtar  wrote:
>> >>>
>> >>> I have some unit tests in which I create an embedded single broker
>> kafka
>> >>> cluster, using :
>> >>>
>> >>> EmbeddedSingleNodeKafkaCluster.java from
>> >>> https://github.com/confluentinc/examples/blob/
>> >> master/kafka-streams/src/test/java/io/confluent/examples/str
>> eams/kafka/
>> >> EmbeddedSingleNodeKafkaCluster.java
>> >>>
>> >>> That class also creates an embedded zookeeper cluster / instance.
>> >>>
>> >>> The problem is, while the tests run pretty fast and pass, they then
>> stay
>> >>> stuck in the 'teardown / clean up' stage for a really long time, often
>> >> upto
>> >>> 10-20
>> >>> seconds per test.
>> >>>
>> >>> As I have a lot of test classes, each class creating its own embedded
>> >> kafka
>> >>> cluster, this time can really add up during compiles.
>> >>>
>> >>> Is it possible to get these test classes to not do any clean up /
>> safety
>> >>> stuff, because the instances are just throwaway. Just have them kill
>> -9
>> >> the
>> >>> kafka / zookeeper and exit?
>> >>>
>> >>> It doesn't make any sense that tests pass within seconds, but can't
>> move
>> >> on
>> >>> to the next test class because its cleaning up.
>> >>>
>> >>> I also have an embedded cassandra instance in these tests, but I don't
>> >>> think that one is the problem, as i see a lot of zookeeper logs such
>> as
>> >>> these after the test runs:
>> >>>
>> >>> 133764 [main-SendThread(127.0.0.1:38846)] WARN
>> >>> org.apache.zookeeper.ClientCnxn  - Session 0x15853c3497f0001 for
>> server
>> >>> null, unexpected error, closing socket connection and attempting
>> >> reconnect
>> >>> java.net.ConnectException: Connection refused
>> >>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >>> at sun.nio.ch.SocketChannelImpl.finishConnect(
>> >> SocketChannelImpl.java:717)
>> >>> at
>> >>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
>> >> ClientCnxnSocketNIO.java:361)
>> >>> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.
>> java:1081)
>> >>>
>> >>>
>> >>> Could it be that zookeeper doesn't exit and keeps retrying to connect?
>> >>
>> >>
>>
>>
>

Re: Kafka upgrade from 0.8.0 to 0.10.0.0

2016-11-11 Thread Amit Tank

Hi,

I am not an expert but from what read and understood from HDFS
documentation, If you want to upgrade zookeeper, you can not avoid
downtime.

Thank you,
Amit

On Thursday, November 10, 2016, ZHU Hua B 
wrote:

> Hi,
>
>
> For a rolling upgrade, Kafka suggest upgrade the brokers one at a time
> (shut down the broker, update the code, and restart it) to avoid downtime
> during the upgrade.
> Usually, there is one zookeeper point to some brokers in a Kafka cluster,
> if the zookeeper should be upgraded also? If so, how to avoid downtime
> during zookeeper upgrade? Thanks!
>
>
>
>
>
>
> Best Regards
>
> Johnny
>
>

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-11 Thread Marcos Juarez

Becket/Jason,

We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 patch,
but we're seeing the same exact issue.  It doesnt' seem like the patch
fixes the problem we're seeing.

At this point, we're considering downgrading our prod clusters back to
0.10.0.1.  Is there any concern/issues we should be aware of when
downgrading the cluster like that?

Thanks,

Marcos Juarez


On Mon, Nov 7, 2016 at 5:47 PM, Marcos Juarez  wrote:

> Thanks Becket.
>
> I was working on that today.  I have a working jar, created from the
> 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it.  I've
> left it running in one test broker today, will try tomorrow to trigger the
> issue, and try it with both the patched and un-patched versions.
>
> I'll let you know what we find.
>
> Thanks,
>
> Marcos
>
> On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin  wrote:
>
>> Hi Marcos,
>>
>> Is it possible for you to apply the patch of KAFKA-3994 and see if the
>> issue is still there. The current patch of KAFKA-3994 should work, the
>> only
>> reason we haven't checked that in was because when we ran stress test it
>> shows noticeable performance impact when producers are producing with
>> acks=all. So if you are blocking on this issue maybe you can pick up the
>> patch as a short term solution. Meanwhile we will prioritize the ticket.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez  wrote:
>>
>> > We ran into this issue several more times over the weekend.  Basically,
>> > FDs are exhausted so fast now, we can't even get to the server in time,
>> the
>> > JVM goes down in less than 5 minutes.
>> >
>> > I can send the whole thread dumps if needed, but for brevity's sake, I
>> > just copied over the relevant deadlock segment, and concatenated them
>> all
>> > together in the attached text file.
>> >
>> > Do you think this is something I should add to KAFKA-3994 ticket?  Or is
>> > the information in that ticket enough for now?
>> >
>> > Thanks,
>> >
>> > Marcos
>> >
>> > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez 
>> wrote:
>> >
>> >> That's great, thanks Jason.
>> >>
>> >> We'll try and apply the patch in the meantime, and wait for the
>> official
>> >> release for 0.10.1.1.
>> >>
>> >> Please let us know if you need more details about the deadlocks on our
>> >> side.
>> >>
>> >> Thanks again!
>> >>
>> >> Marcos
>> >>
>> >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson 
>> >> wrote:
>> >>
>> >>> Hi Marcos,
>> >>>
>> >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
>> Since
>> >>> we're now seeing users hit this in practice, I'll definitely bump up
>> the
>> >>> priority on a fix. I can't say for sure when the release will be, but
>> >>> we'll
>> >>> merge the fix into the 0.10.1 branch and you can build from there if
>> you
>> >>> need something in a hurry.
>> >>>
>> >>> Thanks,
>> >>> Jason
>> >>>
>> >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez 
>> wrote:
>> >>>
>> >>> > Jason,
>> >>> >
>> >>> > Thanks for that link.  It does appear to be a very similar issue, if
>> >>> not
>> >>> > identical.  In our case, the deadlock is reported as across 3
>> threads,
>> >>> one
>> >>> > of them being a group_metadata_manager thread. Otherwise, it looks
>> the
>> >>> > same.
>> >>> >
>> >>> > On your questions:
>> >>> >
>> >>> > - We did not see this in prior releases, but we are ramping up usage
>> >>> of our
>> >>> > kafka clusters lately, so maybe we didn't have the needed volume
>> >>> before to
>> >>> > trigger it.
>> >>> >
>> >>> > - Across our multiple staging and production clusters, we're seeing
>> the
>> >>> > problem roughly once or twice a day.
>> >>> >
>> >>> > - Our clusters are small at the moment.  The two that are
>> experiencing
>> >>> the
>> >>> > issue are 5 and 8 brokers, respectively.  The number of consumers is
>> >>> small,
>> >>> > I'd say less than 20 at the moment.  The amount of data being
>> produced
>> >>> is
>> >>> > small also, in the tens of megabytes per second range, but the
>> number
>> >>> of
>> >>> > connects/disconnects is really high, because they're usually
>> >>> short-lived
>> >>> > processes.  Our guess at the moment is that this is triggering the
>> >>> bug.  We
>> >>> > have a separate cluster where we don't have short-lived producers,
>> and
>> >>> that
>> >>> > one has been rock solid.
>> >>> >
>> >>> >
>> >>> > We'll look into applying the patch, which could help reduce the
>> >>> problem.
>> >>> > The ticket says it's being targeted for the 0.10.2 release.  Any
>> rough
>> >>> > estimate of a timeline for that to come out?
>> >>> >
>> >>> > Thanks!
>> >>> >
>> >>> > Marcos
>> >>> >
>> >>> >
>> >>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson > >
>> >>> > wrote:
>> >>> >
>> >>> > > Hey Marcos,
>> >>> > >
>> >>> > > Thanks for the report. Can you check out
>> >>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
>> >>> matches?
>> >>> > At
>> >>> > > a glance, it looks like the same

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-11 Thread Becket Qin

Hi Marcos,

Thanks for the update. It looks the deadlock you saw was another one. Do
you mind sending us a full stack trace after this happens?

Regarding the downgrade, the steps would be the following:
1. change the inter.broker.protocol to 0.10.0
2. rolling bounce the cluster
3. deploy the 0.10.0.1 code

There might be a bunch of .timeindex file left over but that should be fine.

Thanks,

Jiangjie (Becket) Qin


On Fri, Nov 11, 2016 at 9:51 AM, Marcos Juarez  wrote:

> Becket/Jason,
>
> We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 patch,
> but we're seeing the same exact issue.  It doesnt' seem like the patch
> fixes the problem we're seeing.
>
> At this point, we're considering downgrading our prod clusters back to
> 0.10.0.1.  Is there any concern/issues we should be aware of when
> downgrading the cluster like that?
>
> Thanks,
>
> Marcos Juarez
>
>
> On Mon, Nov 7, 2016 at 5:47 PM, Marcos Juarez  wrote:
>
> > Thanks Becket.
> >
> > I was working on that today.  I have a working jar, created from the
> > 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it.  I've
> > left it running in one test broker today, will try tomorrow to trigger
> the
> > issue, and try it with both the patched and un-patched versions.
> >
> > I'll let you know what we find.
> >
> > Thanks,
> >
> > Marcos
> >
> > On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin 
> wrote:
> >
> >> Hi Marcos,
> >>
> >> Is it possible for you to apply the patch of KAFKA-3994 and see if the
> >> issue is still there. The current patch of KAFKA-3994 should work, the
> >> only
> >> reason we haven't checked that in was because when we ran stress test it
> >> shows noticeable performance impact when producers are producing with
> >> acks=all. So if you are blocking on this issue maybe you can pick up the
> >> patch as a short term solution. Meanwhile we will prioritize the ticket.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez 
> wrote:
> >>
> >> > We ran into this issue several more times over the weekend.
> Basically,
> >> > FDs are exhausted so fast now, we can't even get to the server in
> time,
> >> the
> >> > JVM goes down in less than 5 minutes.
> >> >
> >> > I can send the whole thread dumps if needed, but for brevity's sake, I
> >> > just copied over the relevant deadlock segment, and concatenated them
> >> all
> >> > together in the attached text file.
> >> >
> >> > Do you think this is something I should add to KAFKA-3994 ticket?  Or
> is
> >> > the information in that ticket enough for now?
> >> >
> >> > Thanks,
> >> >
> >> > Marcos
> >> >
> >> > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez 
> >> wrote:
> >> >
> >> >> That's great, thanks Jason.
> >> >>
> >> >> We'll try and apply the patch in the meantime, and wait for the
> >> official
> >> >> release for 0.10.1.1.
> >> >>
> >> >> Please let us know if you need more details about the deadlocks on
> our
> >> >> side.
> >> >>
> >> >> Thanks again!
> >> >>
> >> >> Marcos
> >> >>
> >> >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson 
> >> >> wrote:
> >> >>
> >> >>> Hi Marcos,
> >> >>>
> >> >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
> >> Since
> >> >>> we're now seeing users hit this in practice, I'll definitely bump up
> >> the
> >> >>> priority on a fix. I can't say for sure when the release will be,
> but
> >> >>> we'll
> >> >>> merge the fix into the 0.10.1 branch and you can build from there if
> >> you
> >> >>> need something in a hurry.
> >> >>>
> >> >>> Thanks,
> >> >>> Jason
> >> >>>
> >> >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez 
> >> wrote:
> >> >>>
> >> >>> > Jason,
> >> >>> >
> >> >>> > Thanks for that link.  It does appear to be a very similar issue,
> if
> >> >>> not
> >> >>> > identical.  In our case, the deadlock is reported as across 3
> >> threads,
> >> >>> one
> >> >>> > of them being a group_metadata_manager thread. Otherwise, it looks
> >> the
> >> >>> > same.
> >> >>> >
> >> >>> > On your questions:
> >> >>> >
> >> >>> > - We did not see this in prior releases, but we are ramping up
> usage
> >> >>> of our
> >> >>> > kafka clusters lately, so maybe we didn't have the needed volume
> >> >>> before to
> >> >>> > trigger it.
> >> >>> >
> >> >>> > - Across our multiple staging and production clusters, we're
> seeing
> >> the
> >> >>> > problem roughly once or twice a day.
> >> >>> >
> >> >>> > - Our clusters are small at the moment.  The two that are
> >> experiencing
> >> >>> the
> >> >>> > issue are 5 and 8 brokers, respectively.  The number of consumers
> is
> >> >>> small,
> >> >>> > I'd say less than 20 at the moment.  The amount of data being
> >> produced
> >> >>> is
> >> >>> > small also, in the tens of megabytes per second range, but the
> >> number
> >> >>> of
> >> >>> > connects/disconnects is really high, because they're usually
> >> >>> short-lived
> >> >>> > processes.  Our guess at the moment is that this is triggering the
> >> >

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-11 Thread Marcos Juarez

Thanks Becket,

We should get a full thread dump the next time, so I'll send it as soon
that happens.

Marcos

On Fri, Nov 11, 2016 at 11:27 AM, Becket Qin  wrote:

> Hi Marcos,
>
> Thanks for the update. It looks the deadlock you saw was another one. Do
> you mind sending us a full stack trace after this happens?
>
> Regarding the downgrade, the steps would be the following:
> 1. change the inter.broker.protocol to 0.10.0
> 2. rolling bounce the cluster
> 3. deploy the 0.10.0.1 code
>
> There might be a bunch of .timeindex file left over but that should be
> fine.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Fri, Nov 11, 2016 at 9:51 AM, Marcos Juarez  wrote:
>
> > Becket/Jason,
> >
> > We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994
> patch,
> > but we're seeing the same exact issue.  It doesnt' seem like the patch
> > fixes the problem we're seeing.
> >
> > At this point, we're considering downgrading our prod clusters back to
> > 0.10.0.1.  Is there any concern/issues we should be aware of when
> > downgrading the cluster like that?
> >
> > Thanks,
> >
> > Marcos Juarez
> >
> >
> > On Mon, Nov 7, 2016 at 5:47 PM, Marcos Juarez  wrote:
> >
> > > Thanks Becket.
> > >
> > > I was working on that today.  I have a working jar, created from the
> > > 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it.
> I've
> > > left it running in one test broker today, will try tomorrow to trigger
> > the
> > > issue, and try it with both the patched and un-patched versions.
> > >
> > > I'll let you know what we find.
> > >
> > > Thanks,
> > >
> > > Marcos
> > >
> > > On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin 
> > wrote:
> > >
> > >> Hi Marcos,
> > >>
> > >> Is it possible for you to apply the patch of KAFKA-3994 and see if the
> > >> issue is still there. The current patch of KAFKA-3994 should work, the
> > >> only
> > >> reason we haven't checked that in was because when we ran stress test
> it
> > >> shows noticeable performance impact when producers are producing with
> > >> acks=all. So if you are blocking on this issue maybe you can pick up
> the
> > >> patch as a short term solution. Meanwhile we will prioritize the
> ticket.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez 
> > wrote:
> > >>
> > >> > We ran into this issue several more times over the weekend.
> > Basically,
> > >> > FDs are exhausted so fast now, we can't even get to the server in
> > time,
> > >> the
> > >> > JVM goes down in less than 5 minutes.
> > >> >
> > >> > I can send the whole thread dumps if needed, but for brevity's
> sake, I
> > >> > just copied over the relevant deadlock segment, and concatenated
> them
> > >> all
> > >> > together in the attached text file.
> > >> >
> > >> > Do you think this is something I should add to KAFKA-3994 ticket?
> Or
> > is
> > >> > the information in that ticket enough for now?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Marcos
> > >> >
> > >> > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez 
> > >> wrote:
> > >> >
> > >> >> That's great, thanks Jason.
> > >> >>
> > >> >> We'll try and apply the patch in the meantime, and wait for the
> > >> official
> > >> >> release for 0.10.1.1.
> > >> >>
> > >> >> Please let us know if you need more details about the deadlocks on
> > our
> > >> >> side.
> > >> >>
> > >> >> Thanks again!
> > >> >>
> > >> >> Marcos
> > >> >>
> > >> >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <
> ja...@confluent.io>
> > >> >> wrote:
> > >> >>
> > >> >>> Hi Marcos,
> > >> >>>
> > >> >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
> > >> Since
> > >> >>> we're now seeing users hit this in practice, I'll definitely bump
> up
> > >> the
> > >> >>> priority on a fix. I can't say for sure when the release will be,
> > but
> > >> >>> we'll
> > >> >>> merge the fix into the 0.10.1 branch and you can build from there
> if
> > >> you
> > >> >>> need something in a hurry.
> > >> >>>
> > >> >>> Thanks,
> > >> >>> Jason
> > >> >>>
> > >> >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez 
> > >> wrote:
> > >> >>>
> > >> >>> > Jason,
> > >> >>> >
> > >> >>> > Thanks for that link.  It does appear to be a very similar
> issue,
> > if
> > >> >>> not
> > >> >>> > identical.  In our case, the deadlock is reported as across 3
> > >> threads,
> > >> >>> one
> > >> >>> > of them being a group_metadata_manager thread. Otherwise, it
> looks
> > >> the
> > >> >>> > same.
> > >> >>> >
> > >> >>> > On your questions:
> > >> >>> >
> > >> >>> > - We did not see this in prior releases, but we are ramping up
> > usage
> > >> >>> of our
> > >> >>> > kafka clusters lately, so maybe we didn't have the needed volume
> > >> >>> before to
> > >> >>> > trigger it.
> > >> >>> >
> > >> >>> > - Across our multiple staging and production clusters, we're
> > seeing
> > >> the
> > >> >>> > problem roughly once or twice a day.
> > >> >>> >
> > >> >>> > - Our clusters are small at the moment.  The

RE: sliding ktable?

2016-11-11 Thread John Hayles

Thanks again for the replies, and the time to suggest alternate solution.  I 
know your time is valuable.



The reason for the List in the aggregate, which I missed trying to keep it 
simple, is that the order of records added to topic is not guaranteed to be 
chronological. Keeping the list allows to traverse and apply more complex 
business rules to determine the output values.



Topic payload {txnId,lane,txnDate}  Notice lane 'c' is duplicated 3 times.



. . .

{'03','c','11/07/2016  04:00:00'}// first entry for lane 'c'; no duplicate

. . .

{'11','c','11/08/2016  15:30:00'} // second entry for lane 'c'; no duplicate

. . .

{'09','c','11/07/2016  04:00:01'} // third entry for lane 'c' - added to 
topic out of chronological order - is duplicate of txnId 03 even though time is 
slightly different

. . .



Resulting materialized view.



'c':{('03','11/07/2016 04:00:00') ,

  ('11','11/08/2016 15:30:00') ,

  ('09','11/07/2016 04:00:01')}



When the kstream is joined to lookup ktable defined above, txnId '03' should be 
identified as duplicate to ‘09’.  So, output topic like...



Topic payload {txnId,lane,txnDate,duplicateTxnId}



. . .

{'03','c','11/07/2016  04:00:00',''}

 . .

{'11','c','11/08/2016  15:30:00',''}

. . .

{'09','c','11/07/2016  04:00:01','03'} // note the duplicate value ‘03’ is 
not from the most recent

. . .



Note both ksteam and ktable source are from same topic.   When executed (see 
code below), the joined ktable is 'behind' the kstream so lookup will sometimes 
fail.  I think the ktable is just slower to process because it is doing more 
processing than kstream.



I need to synchronize such that a record must be added to the ktable before the 
join operation occurs for the same record in kstream.



(1) Please give hint if any options to resolve?



Also, from below comments...



> About infinitely growing KTable: It seems you are extending each lane

> with a list of all txnId -- so your view needs infinite memory as you

> expend your values... A quick fix might be, to delete older txnID for

> this list, each time you update the list (as you mentioned you only

> need data for the last two weeks -- you might need to add a timestamp

> for each txnID in the list to do the pruning each time you append or

> lookup the list).



(2) Once I determine a key can be deleted, can I tombstone record 
programmatically in the same process?  I am not clear on how.  My input topic 
is created by a Kafka Connect configuration (no code).



If interested here is the main part of code so far…



KStream txnStream = 
builder.stream("LANE_TRANSACTIONS");



// Create new ktable from original.

KTable countTableStream =  txnStream

.map((dummy, value) -> {

return new KeyValue<>(value, value);

})

.countByKey(valueAvroSerde, "CountTable");



// aggregate ktable

KTable> lookup = countTableStream

   .groupBy((key, value) -> {

return new KeyValue(key.get("LANE_ID").toString(), key);

   },

   stringSerdeKey,

   valueAvroSerde)

   .aggregate(

 () -> { return new ArrayList<>(); },

 (key, record, list) ->

  {

   list.add(record);

   return list;

 },

 (key, record, list) ->

  {

// list.remove(record);  // this is always executed, 
not sure why

   return list;

 },

 new ArrayListSerde<>(keyAvroSerde,valueAvroSerde), 
"TxnMapByTagPlaza");



ArrayListSerde arrayListSerde = new 
ArrayListSerde<>(keyAvroSerde,valueAvroSerde);

lookup.through(stringSerdeKey,arrayListSerde,"DUPLICATE_LOOKUPS_STREAM");





// Create new stream from original.

KStream txnStreamFull =

   txnStream.map((key, record) -> {

return new KeyValue<>(record.get("LANE_ID").toString(),record);

 })

   .through(stringSerdeKey,valueAvroSerde,"RekeyedIntermediateTopic19")

   ;



// Join stream to lookup ktable.

// Right now just evaluating results.

KStream duplicatesStream =

txnStreamFull.leftJoin(lookup,(vTxnStream, vLookupTable) -> {



// this is where vLookupTable list will be 
traversed to determine any duplicates that match vTxnStream value.



// from inspecting JOINED objects I can see 
 vLookupTable values are late compared to txnSteamFull stream values.



 return vTxnStream;

});



Know that I have spent time reading the docs / google searches, but still just 
not all clear to me, and perhaps there are other docs I am not finding.



(3) Just wanted to get opinion also,

Unit tests (in Java) take a very long time to 'clean up'?

Problem with kafka cluster with 2 brokers - Replicas out of sync

Re: Unit tests (in Java) take a very long time to 'clean up'?

Re: Unit tests (in Java) take a very long time to 'clean up'?

Re: Unit tests (in Java) take a very long time to 'clean up'?

Re: Unit tests (in Java) take a very long time to 'clean up'?

Re: Unit tests (in Java) take a very long time to 'clean up'?

Re: Unit tests (in Java) take a very long time to 'clean up'?

Re: Unit tests (in Java) take a very long time to 'clean up'?

Re: Kafka upgrade from 0.8.0 to 0.10.0.0

Re: Deadlock using latest 0.10.1 Kafka release

Re: Deadlock using latest 0.10.1 Kafka release

Re: Deadlock using latest 0.10.1 Kafka release

RE: sliding ktable?

14 matches

Site Navigation

Mail list logo

Footer information