Kafka cluster instablility

2018-02-14 Thread Avinash Herle
Hi,

I'm using Kafka version 0.11.0.2. In my cluster, I've 4 nodes running Kafka
of which 3 nodes also running Zookeeper. I've a few producer processes that
publish to Kafka and multiple consumer processes, a streaming engine
(Spark) that ingests from Kafka and also publishes data to Kafka, and a
distributed data store (Druid) which reads all messages from Kafka and
stores in the DB. Druid also uses the same Zookeeper cluster being used by
Kafka for cluster state management.

*Kafka Configs:*
1) No replication being used
2) Number of network threads 30
3) Number of IO threads 8
4) Machines have 64GB RAM and 16 cores
5) 3 topics with 64 partitions per topic

*Questions:*

1) *Partitions going offline*
I frequently see partitions going offline because of which the scheduling
delay of the Spark application increases and input rate gets jittery. I
tried enabling replication too to see if it helped with the problem. It
didn't quite make a difference. What could be the cause of this issue? Lack
of resources or cluster misconfigurations? Can the cause be large number of
receiver processes?

*2) Colocation of Zookeeper and Kafka:*
As I mentioned above, I'm running 3 nodes with both Zookeeper and Kafka
colocated. Both the components are containerized, so they are running
inside docker containers. I found a few blogs that suggested not colocating
them for performance reasons. Is it necessary to run them on dedicated
machines?

*3) Using same Zookeeper cluster across different components*
In my cluster, I use the same Zookeeper cluster for state management of the
Kafka cluster and the Druid cluster. Could this cause instability of the
overall system?

Hope I've covered all the necessary information needed. Please let me know
if more information about my cluster is needed.

Thanks in advance,
Avinash
-- 

Excuse brevity and typos. Sent from mobile device.


Re: Memory Leak in Kafka

2018-01-24 Thread Avinash Herle
Hi Ted,

I've posted this question on a kafka-user google group as well. Here is the
link <https://groups.google.com/forum/#!topic/kafka-clients/AeglVfsRCak>.
It has the attachments as well.

Thanks,
Avinash

On Tue, 23 Jan 2018 at 17:23 Ted Yu  wrote:

> Did you attach two .png files ?
>
> Please use third party site since the attachment didn't come thru.
>
> On Tue, Jan 23, 2018 at 5:20 PM, Avinash Herle 
> wrote:
>
> >
> > Hi,
> >
> > I'm using Kafka as a messaging system in my data pipeline. I've a couple
> > of producer processes in my pipeline and Spark Streaming
> > <
> https://spark.apache.org/docs/2.2.1/streaming-kafka-0-10-integration.html>
> > and Druid's Kafka indexing service
> > <
> http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
> >
> > as consumers of Kafka. The indexing service spawns 40 new indexing tasks
> > (Kafka consumers) every 15 mins.
> >
> > The heap memory used on Kafka seems fairly constant for an hour after
> > which it seems to shoot up to the max allocated space. The garbage
> > collection logs of Kafka seems to indicate a memory leak in Kafka. Find
> > attached the plots generated from the GC logs.
> >
> > *Kafka Deployment:*
> > 3 nodes, with 3 topics and 64 partitions per topic
> >
> > *Kafka Runtime jvm parameters:*
> > 8GB Heap Memory
> > 1GC swap Memory
> > Using G1GC
> > MaxGCPauseMilllis=20
> > InitiatingHeapOccupancyPercent=35
> >
> > *Kafka Versions Used:*
> > I've used Kafka version 0.10.0, 0.11.0.2 and 1.0.0 and find similar
> > behavior
> >
> > *Questions:*
> > 1) Is this a memory leak on the Kafka side or a misconfiguration of my
> > Kafka cluster?
> > 2) Druid creates new indexing tasks periodically. Does Kafka stably
> handle
> > large number of consumers being added periodically?
> > 3) As a knock on effect, We also notice kafka partitions going offline
> > periodically after some time with the following error:
> > ERROR [ReplicaFetcherThread-18-2], Error for partition [topic1,2] to
> > broker
> 2:*org.apache.kafka.common.errors.UnknownTopicOrPartitionException*:
> > This server does not host this topic-partition. (kafka.server.
> > ReplicaFetcherThread)
> >
> > Can someone shed some light on the behavior being seen in my cluster?
> >
> > Please let me know if more details are needed to root cause the behavior
> > being seen.
> >
> > Thanks in advance.
> >
> > Avinash
> > [image: Screen Shot 2018-01-23 at 2.29.04 PM.png][image: Screen Shot
> > 2018-01-23 at 2.29.21 PM.png]
> >
> >
> >
> >
> > --
> >
> > Excuse brevity and typos. Sent from mobile device.
> >
> >
> > --
> >
> > Excuse brevity and typos. Sent from mobile device.
> >
>


-- 

Excuse brevity and typos. Sent from mobile device.


Memory Leak in Kafka

2018-01-23 Thread Avinash Herle
Hi,

I'm using Kafka as a messaging system in my data pipeline. I've a couple of
producer processes in my pipeline and Spark Streaming

and Druid's Kafka indexing service

as consumers of Kafka. The indexing service spawns 40 new indexing tasks
(Kafka consumers) every 15 mins.

The heap memory used on Kafka seems fairly constant for an hour after which
it seems to shoot up to the max allocated space. The garbage collection
logs of Kafka seems to indicate a memory leak in Kafka. Find attached the
plots generated from the GC logs.

*Kafka Deployment:*
3 nodes, with 3 topics and 64 partitions per topic

*Kafka Runtime jvm parameters:*
8GB Heap Memory
1GC swap Memory
Using G1GC
MaxGCPauseMilllis=20
InitiatingHeapOccupancyPercent=35

*Kafka Versions Used:*
I've used Kafka version 0.10.0, 0.11.0.2 and 1.0.0 and find similar behavior

*Questions:*
1) Is this a memory leak on the Kafka side or a misconfiguration of my
Kafka cluster? Does Kafka stably handle large number of consumers being
added periodically?
2) As a knock on effect, We also notice kafka partitions going offline
periodically after some time with the following error:
ERROR [ReplicaFetcherThread-18-2], Error for partition [topic1,2] to
broker 2:*org.apache.kafka.common.errors.UnknownTopicOrPartitionException*:
This server does not host this topic-partition.
(kafka.server.ReplicaFetcherThread)

Can someone shed some light on the behavior being seen in my cluster?

Please let me know if more details are needed to root cause the behavior
being seen.

Thanks in advance.

Avinash
[image: Screen Shot 2018-01-23 at 2.29.04 PM.png][image: Screen Shot
2018-01-23 at 2.29.21 PM.png]




-- 

Excuse brevity and typos. Sent from mobile device.


Memory Leak in Kafka

2018-01-23 Thread Avinash Herle
Hi,

I'm using Kafka as a messaging system in my data pipeline. I've a couple of
producer processes in my pipeline and Spark Streaming

and Druid's Kafka indexing service

as consumers of Kafka. The indexing service spawns 40 new indexing tasks
(Kafka consumers) every 15 mins.

The heap memory used on Kafka seems fairly constant for an hour after which
it seems to shoot up to the max allocated space. The garbage collection
logs of Kafka seems to indicate a memory leak in Kafka. Find attached the
plots generated from the GC logs.

*Kafka Deployment:*
3 nodes, with 3 topics and 64 partitions per topic

*Kafka Runtime jvm parameters:*
8GB Heap Memory
1GC swap Memory
Using G1GC
MaxGCPauseMilllis=20
InitiatingHeapOccupancyPercent=35

*Kafka Versions Used:*
I've used Kafka version 0.10.0, 0.11.0.2 and 1.0.0 and find similar behavior

*Questions:*
1) Is this a memory leak on the Kafka side or a misconfiguration of my
Kafka cluster?
2) Druid creates new indexing tasks periodically. Does Kafka stably handle
large number of consumers being added periodically?
3) As a knock on effect, We also notice kafka partitions going offline
periodically after some time with the following error:
ERROR [ReplicaFetcherThread-18-2], Error for partition [topic1,2] to
broker 2:*org.apache.kafka.common.errors.UnknownTopicOrPartitionException*:
This server does not host this topic-partition.
(kafka.server.ReplicaFetcherThread)

Can someone shed some light on the behavior being seen in my cluster?

Please let me know if more details are needed to root cause the behavior
being seen.

Thanks in advance.

Avinash
[image: Screen Shot 2018-01-23 at 2.29.04 PM.png][image: Screen Shot
2018-01-23 at 2.29.21 PM.png]




-- 

Excuse brevity and typos. Sent from mobile device.


-- 

Excuse brevity and typos. Sent from mobile device.