Hi,

I'm running a 9 node Kafka 0.10.2.1 cluster in AWS and 3-4 of the nodes
experience high IO wait but others don't.
In terms read/write load, the high IO nodes look identical to others. All
nodes were at around 40% disk space usage.
No errors or warns in logs.

Here is part of the output running iostat -kx 2 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.99    0.00    8.87   18.38    0.26   54.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
xvda              0.00     5.00    0.00    2.50     0.00    30.00    24.00
   0.00    0.00   0.00   0.00
xvdf              0.00     8.00    1.00   50.00     4.00   736.75    29.05
 149.10 3128.82  19.61 100.00

The cluster is composed of r4.xlarge nodes with single 1TB st1 volume used
for Kafka data (xvdf).
According AWS docs a 1 TB st1 EBS volume should have 40 MiB/s base
throughput.
As you can see from above, its only writing at ~736 kB/s. The average write
size is also very small at ~15 kB.

Running iotop shows the following most of the time:
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
14321 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [kworker/u30:1]
With flashes of the Kafka java process also taking up 99.99% just like
kworker.

Additional info:

   - There are about 60 topics total, 60 partitions each, all with
   replication factor 3. Topic/partition distribution is very uniform across
   nodes.
   - Total message ingest rate for cluster is only 2.2 MB/s message at
   ~8000 msg/s. Load is very uniform across nodes.
   - Some possibly relevant configs in server.properties:
   log.retention.hours=168
   log.segment.bytes=1073741824
   log.retention.check.interval.ms=300000
   num.io.threads=1
   - I'm using XFS for file system with default configs on Amazon Linux.

I have only tried to restart the high IO nodes so far but it doesn't seem
to help.
Also tried increasing num.io.threads, did not help.
I'm a little skeptical of just increasing EBS volume size for more IOPS
given the IO stats above.

I'm fairly new to using Kafka so perhaps something is very wrong with my
setup.
If anyone has any suggestions it would be greatly appreciated.

Thanks,
Xiaochuan Yu

Reply via email to