[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779769#comment-17779769 ]
Sergio Troiano edited comment on KAFKA-4084 at 10/26/23 6:45 AM: ----------------------------------------------------------------- [~wushujames] [~junrao] , I wanted to add some extra information about the metrics and issues you have reported: I had similar issues and I found out the root cause of those failures is the OS (I use linux) reaching its capacity to hold memory dirty pages. The Kernel has a mechanism to check how "quick" the processes write data, in order to protect itself it will throttle teh process which is generating a bunch of data when it calculates if it continues at the same rate it will end up filling up the OS memory with dirty pages. So after adding the throttle you Kafka process moves writes to "sync" to "async" basically the OS makes the process to wait in each write to allow the OS flush to write the dirty pages and write the new ones. I still consider that the demote broker is the best option when we are replacing a volume but we have generated a tool to monitor when the OS is throttling the Kafka process, so at least we have visibility. We saw when the OS throttle the process the produce request goes from 30 ms to 3 seconds, of course this creates a lot of problems to the clients. Here is the script we use to monitor this: {code:java} #!/usr/bin/env python3 # # Detects if the kernel is throttling a proccess due to the high write operations # Get the thottle time in jiffies, more info here: https://www.yugabyte.com/blog/linux-performance-tuning-memory-disk-io/ # and here: https://msreekan.com/tag/dirty-pages/ # For Linux, uses BCC, eBPF. # # USAGE: monitor-writeback-throttle # # Copyright 2023 Adevinta. # # 28-Agust-2023 Sergio Troiano Created this. import argparse from time import sleepfrom bcc import BPF from datadog import initialize, statsd def argsparser(): parser = argparse.ArgumentParser( description="Detects if the kernel is throttling a proccess due to the high write operations.", formatter_class=argparse.RawDescriptionHelpFormatter, ) parser.add_argument( "-f", "--probes_source_file", default="monitor-writeback-throttle.c", type=str, help="C source code of the probe to be attached to the kernel trace function", ) parser.add_argument( "-i", "--interval", default=3, type=int, help="output interval, in seconds" ) parser.add_argument( "-d", "--report-to-datadog", default=False, help="Report metrics to Datadog.", action="store_true", ) return parser.parse_args() def load_probes(args): with open(args.probes_source_file, "r") as file: bpf_text = file.read() file.close() b = BPF(text=bpf_text) return b def print_event(ctx, data, size): event = b["events"].event(data) paused_events.append(event.pause) args = argsparser() if args.report_to_datadog: options = {"statsd_host": "127.0.0.1", "statsd_port": 8125} initialize(**options) PREFIX = "ebpf."b = load_probes(args) b["events"].open_ring_buffer(print_event)while True: paused_events = [] sleep(int(args.interval)) b.ring_buffer_poll(timeout=int(args.interval)) if args.report_to_datadog: statsd.gauge("{}writeback_throttle".format(PREFIX), int(any(paused_events))) else: print({"paused": any(paused_events)}){code} And of course you will need the probe in C for eBPF: {code:java} #include <linux/ptrace.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h>struct event { unsigned long pause; };BPF_RINGBUF_OUTPUT(events, 8);// This probe traces the kernel function balance_dirty_pages() from writeback.c TRACEPOINT_PROBE(writeback, balance_dirty_pages) { struct event event = {}; event.pause = args->pause; events.ringbuf_output(&event, sizeof(event), 0); return 0; }; {code} was (Author: JIRAUSER285632): [~wushujames] [~junrao] , I wanted to add some extra information about the metrics and issues you have reported: I had similar issues and I found out the root cause of those failures is the OS (I use linux) reaching its capacity to hold memory dirty pages. The Kernel has a mechanism to check how "quick" the processes write data, in order to protect itself it will throttle teh process which is generating a bunch of data when it calculates if it continues at the same rate it will end up filling up the OS memory with dirty pages. So after adding the throttle you Kafka process moves writes to "sync" to "async" basically the OS makes the process to wait in each write to allow the OS flush to write the dirty pages and write the new ones. I still consider that the demote broker is the best option when we are replacing a volume but we have generated a tool to monitor when the OS is throttling the Kafka process, so at least we have visibility. We saw when the OS throttle the process the produce request goes from 30 ms to 3 seconds, of course this creates a lot of problems to the clients. Here is the script we use to monitor this: {code:java} #!/usr/bin/env python3 # # Detects if the kernel is throttling a proccess due to the high write operations # Get the thottle time in jiffies, more info here: https://www.yugabyte.com/blog/linux-performance-tuning-memory-disk-io/ # and here: https://msreekan.com/tag/dirty-pages/ # For Linux, uses BCC, eBPF. # # USAGE: monitor-writeback-throttle # # Copyright 2023 Adevinta. # # 28-Agust-2023 Sergio Troiano Created this. import argparse from time import sleepfrom bcc import BPF from datadog import initialize, statsd def argsparser(): parser = argparse.ArgumentParser( description="Detects if the kernel is throttling a proccess due to the high write operations.", formatter_class=argparse.RawDescriptionHelpFormatter, ) parser.add_argument( "-f", "--probes_source_file", default="monitor-writeback-throttle.c", type=str, help="C source code of the probe to be attached to the kernel trace function", ) parser.add_argument( "-i", "--interval", default=3, type=int, help="output interval, in seconds" ) parser.add_argument( "-d", "--report-to-datadog", default=False, help="Report metrics to Datadog.", action="store_true", ) return parser.parse_args() def load_probes(args): with open(args.probes_source_file, "r") as file: bpf_text = file.read() file.close() b = BPF(text=bpf_text) return b def print_event(ctx, data, size): event = b["events"].event(data) paused_events.append(event.pause) args = argsparser() if args.report_to_datadog: options = {"statsd_host": "127.0.0.1", "statsd_port": 8125} initialize(**options) PREFIX = "ebpf."b = load_probes(args) b["events"].open_ring_buffer(print_event)while True: paused_events = [] sleep(int(args.interval)) b.ring_buffer_poll(timeout=int(args.interval)) if args.report_to_datadog: statsd.gauge("{}writeback_throttle".format(PREFIX), int(any(paused_events))) else: print({"paused": any(paused_events)}){code} > automated leader rebalance causes replication downtime for clusters with too > many partitions > -------------------------------------------------------------------------------------------- > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Tom Crayford > Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.20.10#820010)