I'll look into those tools, thanks!

I was able to turn on the JMX polling and consumer metrics in kafka-manager.  I 
now know which topic & partition is causing the problem.  It's basically 80MB 
of a single partiton on a single topic being hit by 60'odd consumers.  Now I 
need to figure out what that means.

Thanks!

-Dylan


________________________________
From: Alex Woolford <a...@woolford.io>
Sent: Saturday, February 8, 2020 10:09 PM
To: users@kafka.apache.org <users@kafka.apache.org>
Cc: Dylan Martin <dmar...@istreamplanet.com>
Subject: Re: Confusingly unbalanced broker


[EXTERNAL E-MAIL]

That's a very intriguing question, Dylan.

Even if the partitions for each of the topics are distributed evenly across the 
brokers, it's not guaranteed that the *data* will be distributed evenly. By 
default, the producer will send all the messages in a topic with the same key 
to the same partition. It's possible you have keyed messages, the cardinality 
of the key is very low, and a disproportionate portion of the messages are 
going to a single "hot" partition.

One thing you could do, off the top of my head, is to take a peek at the file 
access events. For example, the following one-liner shows that on this 
particular node, there are a lot of writes to the `aprs` topic, partition 2:

# fatrace --seconds 10 | sort | uniq -c | sort -nr | head
   161 java(1928): W /var/lib/kafka/aprs-2/00000000000081049867.log
   155 java(1928): R 
/var/lib/kafka/_confluent-metrics-2/00000000000031360445.log
   148 java(1928): R /var/lib/kafka/conn-0/00000000000029833400.log
   136 ossec-agentd(1733): R /var/ossec/etc/shared/merged.mg<http://merged.mg>
   129 osqueryd(2201): O /etc/passwd
   104 java(1928): R 
/var/lib/kafka/_confluent-monitoring-0/00000000000046052008.log
    95 osqueryd(2201): RC /etc/passwd
    91 osqueryd(2201): RCO /etc/passwd
    79 java(1928): R 
/var/lib/kafka/_confluent-controlcenter-5-4-0-1-MetricsAggregateStore-repartition-2/00000000000414771172.log
    64 java(1928): R 
/var/lib/kafka/_confluent-controlcenter-5-4-0-1-monitoring-message-rekey-store-1/00000000000002063409.log

I'm running CentOS 7. Here's what I did to install fatrace:

wget 
https://dl.fedoraproject.org/pub/fedora/linux/releases/31/Everything/source/tree/Packages/f/fatrace-0.13-5.fc31.src.rpm
rpm -i fatrace-0.13-5.fc31.src.rpm
yum install bzip2
tar xvf /root/rpmbuild/SOURCES/fatrace-0.13.tar.bz2
cd fatrace-0.13
make
make install

You could also poke around in the filesystem, perhaps using `ncdu`, to see 
which topics/partitions are consuming the disk. For example, `ncdu 
/var/lib/kafka` shows that partition 0 of my syslog topic is consuming most of 
the space on this particular broker:

--- /var/lib/kafka -------------------
  61.1 GiB [##########] /syslog-0
   6.4 GiB [#         ] /aprs-0
   3.7 GiB [          ] /syslog-7
   3.7 GiB [          ] /syslog-9

Hopefully, someone with better Kafka-fu can suggest a more native way to 
understand, at the partition level, what's causing this behavior.

HTH,

Alex Woolford

On Fri, Feb 7, 2020 at 2:38 PM Dylan Martin 
<dmar...@istreamplanet.com<mailto:dmar...@istreamplanet.com>> wrote:
Hi all!

I have a cluster of about 20 brokers and one of them is transmitting about 4 
times as much data as the others (80mB/sec vs 20mB/sec).  It has the roughly 
the same number of topics & partitions and it's the leader for the same number 
as all the other brokers.  The kafka-manager web tool doesn't say it's doing a 
particuarly large amount of work.  Datadog  & iftop both agree that it's 
sending out 4 times as much traffic as any of the others.  It's very 
consistent, in that it's been this way for weeks.

Any advice on how to track down what's going on?

Thanks!
-Dylan

The information contained in this email message, and any attachment thereto, is 
confidential and may not be disclosed without the sender's express permission. 
If you are not the intended recipient or an employee or agent responsible for 
delivering this message to the intended recipient, you are hereby notified that 
you have received this message in error and that any review, dissemination, 
distribution or copying of this message, or any attachment thereto, in whole or 
in part, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender by telephone, fax or email and delete the 
message and all of its attachments. Thank you.

Reply via email to