Sequential writes make Kafka fast, or so they say
A question about something that was always in the back of my mind. According to Jay Kreps > The first [reason that Kafka is so fast despite writing to disk] is that > Kafka does only sequential file I/O. I wonder how true this statement is, because Kafka uses 3 segments per partition. so even with a single topic and partition per broker and disk, it would not be sequential. Now say we have 1000 partitions per broker/disk, i.e. 3000 files. How can concurrent/interleaved writes to thousands of files on a single disk be considered sequential file I/O? Isn't the reason Kafka is so fast despite writing to disk the fact that it does not fsync to disk, leaving that to the OS? The OS would, I assume, be smart enough to order the writes when it flushes its caches to disk in a way that minimizes random seeks. But then, wouldn't the manner in which Kafka writes to files be more or less irrelevant? Or put differently: If Kafka was synchronously flushing to disk, wouldn't it have to limit itself to writing all partitions for a broker/disk to a single file, if it wanted to do sequential file I/O? For reading (historical, non-realtime) data that is not in the OS cache, keeping it in append-only files, the statement makes of course sense.
Re: log.dirs and SSDs
Thanks! Will let this list know if and when I run a log.dirs vs. num.io.threads test. Hah :) I think this deserves an experiment. I’d try setting up some tests with one, two, four, and eight log directories per disk and running some performance tests. I’d be interested to see your results. > On Mar 11, 2020, at 5:45 PM, Eugen Dueck wrote: > > I'm asking the questions here! > So is that the way to tune the broker if it does not achieve disk throughput? > > > 差出人: Peter Bukowinski > 送信日時: 2020年3月12日 9:38 > > Couldn’t the same be accomplished by increasing the num.io.threads broker > setting? > >> On Mar 11, 2020, at 5:15 PM, Eugen Dueck wrote: >> >> So there is not e.g. a single thread responsible per directory in log.dirs >> that could become a bottleneck relative to SSD throughput of GB/s? >> >> This is in fact the case for Apache Pulsar, and the openmessaging benchmark >> uses 4 directories on the same SSD to increase throughput. >> >> >> 差出人: Peter Bukowinski >> 送信日時: 2020年3月12日 8:51 >> >>> On Mar 11, 2020, at 4:28 PM, Eugen Dueck wrote: >>> >>> So log.dirs should contain only one entry per HDD disk, to avoid random >>> seeks. >>> What about SSDs? Can throughput be increased by specifying multiple >>> directories on the same SSD? >> >> >> Given a constant number of partitions, I don’t see any advantage to >> splitting partitions among multiple log directories vs. keeping them all in >> one (per disk). You’d still have the same total number of topic-partition >> directories and the same number of topic-partition leaders. >> >> If you want to increase throughput, focus on using the appropriate number of >> partitions. >> >> — >> Peter Bukowinski >
Re: log.dirs and SSDs
Hah :) I think this deserves an experiment. I’d try setting up some tests with one, two, four, and eight log directories per disk and running some performance tests. I’d be interested to see your results. > On Mar 11, 2020, at 5:45 PM, Eugen Dueck wrote: > > I'm asking the questions here! > So is that the way to tune the broker if it does not achieve disk throughput? > > > 差出人: Peter Bukowinski > 送信日時: 2020年3月12日 9:38 > > Couldn’t the same be accomplished by increasing the num.io.threads broker > setting? > >> On Mar 11, 2020, at 5:15 PM, Eugen Dueck wrote: >> >> So there is not e.g. a single thread responsible per directory in log.dirs >> that could become a bottleneck relative to SSD throughput of GB/s? >> >> This is in fact the case for Apache Pulsar, and the openmessaging benchmark >> uses 4 directories on the same SSD to increase throughput. >> >> >> 差出人: Peter Bukowinski >> 送信日時: 2020年3月12日 8:51 >> >>> On Mar 11, 2020, at 4:28 PM, Eugen Dueck wrote: >>> >>> So log.dirs should contain only one entry per HDD disk, to avoid random >>> seeks. >>> What about SSDs? Can throughput be increased by specifying multiple >>> directories on the same SSD? >> >> >> Given a constant number of partitions, I don’t see any advantage to >> splitting partitions among multiple log directories vs. keeping them all in >> one (per disk). You’d still have the same total number of topic-partition >> directories and the same number of topic-partition leaders. >> >> If you want to increase throughput, focus on using the appropriate number of >> partitions. >> >> — >> Peter Bukowinski >
Re: log.dirs and SSDs
The reason I'm asking is because the documentation is quite brief: num.io.threads: The number of threads that the server uses for processing requests, which may include disk I/O And none of the google hits revealed any more details. I'm asking the questions here! So is that the way to tune the broker if it does not achieve disk throughput? 差出人: Peter Bukowinski 送信日時: 2020年3月12日 9:38 Couldn’t the same be accomplished by increasing the num.io.threads broker setting? > On Mar 11, 2020, at 5:15 PM, Eugen Dueck wrote: > > So there is not e.g. a single thread responsible per directory in log.dirs > that could become a bottleneck relative to SSD throughput of GB/s? > > This is in fact the case for Apache Pulsar, and the openmessaging benchmark > uses 4 directories on the same SSD to increase throughput. > > > 差出人: Peter Bukowinski > 送信日時: 2020年3月12日 8:51 > >> On Mar 11, 2020, at 4:28 PM, Eugen Dueck wrote: >> >> So log.dirs should contain only one entry per HDD disk, to avoid random >> seeks. >> What about SSDs? Can throughput be increased by specifying multiple >> directories on the same SSD? > > > Given a constant number of partitions, I don’t see any advantage to splitting > partitions among multiple log directories vs. keeping them all in one (per > disk). You’d still have the same total number of topic-partition directories > and the same number of topic-partition leaders. > > If you want to increase throughput, focus on using the appropriate number of > partitions. > > — > Peter Bukowinski
Re: what happened in case of single disk failure
Thanks, very helpful ! Peter Bukowinski 于2020年3月12日周四 上午5:48写道: > Yes, that’s correct. While a broker is down: > > all topic partitions assigned to that broker will be under-replicated > topic partitions with an unmet minimum ISR count will be offline > leadership of partitions meeting the minimum ISR count will move to the > next in-sync replica in the replica list > if no in-sync replica exists for a topic-partitions, it will be offline > Setting unclean.leader.election.enable=true will allow an out-of-sync > replica to become a leader. > If topic partition availability is more important to you than data > integrity, you should allow unclean leader election. > > > > On Mar 11, 2020, at 6:11 AM, 张祥 wrote: > > > > Hi, Peter, following what we talked about before, I want to understand > what > > will happen when one broker goes down, I would say it will be very > similar > > to what happens under disk failure, except that the rules apply to all > the > > partitions on that broker instead of only one malfunctioned disk. Am I > > right? Thanks. > > > > 张祥 于2020年3月5日周四 上午9:25写道: > > > >> Thanks Peter, really appreciate it. > >> > >> Peter Bukowinski 于2020年3月4日周三 下午11:50写道: > >> > >>> Yes, you should restart the broker. I don’t believe there’s any code to > >>> check if a Log directory previously marked as failed has returned to > >>> healthy. > >>> > >>> I always restart the broker after a hardware repair. I treat broker > >>> restarts as a normal, non-disruptive operation in my clusters. I use a > >>> minimum of 3x replication. > >>> > >>> -- Peter (from phone) > >>> > On Mar 4, 2020, at 12:46 AM, 张祥 wrote: > > Another question, according to my memory, the broker needs to be > >>> restarted > after replacing disk to recover this. Is that correct? If so, I take > >>> that > Kafka cannot know by itself that the disk has been replaced, manually > restart is necessary. > > 张祥 于2020年3月4日周三 下午2:48写道: > > > Thanks Peter, it makes a lot of sense. > > > > Peter Bukowinski 于2020年3月3日周二 上午11:56写道: > > > >> Whether your brokers have a single data directory or multiple data > >> directories on separate disks, when a disk fails, the topic > partitions > >> located on that disk become unavailable. What happens next depends > on > >>> how > >> your cluster and topics are configured. > >> > >> If the topics on the affected broker have replicas and the minimum > ISR > >> (in-sync replicas) count is met, then all topic partitions will > remain > >> online and leaders will move to another broker. Producers and > >>> consumers > >> will continue to operate as usual. > >> > >> If the topics don’t have replicas or the minimum ISR count is not > met, > >> then the topic partitions on the failed disk will be offline. > >>> Producers can > >> still send data to the affected topics — it will just go to the > online > >> partitions. Consumers can still consume data from the online > >>> partitions. > >> > >> -- Peter > >> > On Mar 2, 2020, at 7:00 PM, 张祥 wrote: > > Hi community, > > I ran into disk failure when using Kafka, and fortunately it did > not > >> crash > >>> the entire cluster. So I am wondering how Kafka handles multiple > >>> disks > >> and > >>> it manages to work in case of single disk failure. The more > detailed, > >> the > >>> better. Thanks ! > >> > > > >>> > >> > >
Re: log.dirs and SSDs
I'm asking the questions here! So is that the way to tune the broker if it does not achieve disk throughput? 差出人: Peter Bukowinski 送信日時: 2020年3月12日 9:38 Couldn’t the same be accomplished by increasing the num.io.threads broker setting? > On Mar 11, 2020, at 5:15 PM, Eugen Dueck wrote: > > So there is not e.g. a single thread responsible per directory in log.dirs > that could become a bottleneck relative to SSD throughput of GB/s? > > This is in fact the case for Apache Pulsar, and the openmessaging benchmark > uses 4 directories on the same SSD to increase throughput. > > > 差出人: Peter Bukowinski > 送信日時: 2020年3月12日 8:51 > >> On Mar 11, 2020, at 4:28 PM, Eugen Dueck wrote: >> >> So log.dirs should contain only one entry per HDD disk, to avoid random >> seeks. >> What about SSDs? Can throughput be increased by specifying multiple >> directories on the same SSD? > > > Given a constant number of partitions, I don’t see any advantage to splitting > partitions among multiple log directories vs. keeping them all in one (per > disk). You’d still have the same total number of topic-partition directories > and the same number of topic-partition leaders. > > If you want to increase throughput, focus on using the appropriate number of > partitions. > > — > Peter Bukowinski
Re: log.dirs and SSDs
Couldn’t the same be accomplished by increasing the num.io.threads broker setting? > On Mar 11, 2020, at 5:15 PM, Eugen Dueck wrote: > > So there is not e.g. a single thread responsible per directory in log.dirs > that could become a bottleneck relative to SSD throughput of GB/s? > > This is in fact the case for Apache Pulsar, and the openmessaging benchmark > uses 4 directories on the same SSD to increase throughput. > > > 差出人: Peter Bukowinski > 送信日時: 2020年3月12日 8:51 > 宛先: users@kafka.apache.org > 件名: Re: log.dirs and SSDs > >> On Mar 11, 2020, at 4:28 PM, Eugen Dueck wrote: >> >> So log.dirs should contain only one entry per HDD disk, to avoid random >> seeks. >> What about SSDs? Can throughput be increased by specifying multiple >> directories on the same SSD? > > > Given a constant number of partitions, I don’t see any advantage to splitting > partitions among multiple log directories vs. keeping them all in one (per > disk). You’d still have the same total number of topic-partition directories > and the same number of topic-partition leaders. > > If you want to increase throughput, focus on using the appropriate number of > partitions. > > — > Peter Bukowinski
Re: log.dirs and SSDs
So there is not e.g. a single thread responsible per directory in log.dirs that could become a bottleneck relative to SSD throughput of GB/s? This is in fact the case for Apache Pulsar, and the openmessaging benchmark uses 4 directories on the same SSD to increase throughput. 差出人: Peter Bukowinski 送信日時: 2020年3月12日 8:51 宛先: users@kafka.apache.org 件名: Re: log.dirs and SSDs > On Mar 11, 2020, at 4:28 PM, Eugen Dueck wrote: > > So log.dirs should contain only one entry per HDD disk, to avoid random seeks. > What about SSDs? Can throughput be increased by specifying multiple > directories on the same SSD? Given a constant number of partitions, I don’t see any advantage to splitting partitions among multiple log directories vs. keeping them all in one (per disk). You’d still have the same total number of topic-partition directories and the same number of topic-partition leaders. If you want to increase throughput, focus on using the appropriate number of partitions. — Peter Bukowinski
Re: log.dirs and SSDs
> On Mar 11, 2020, at 4:28 PM, Eugen Dueck wrote: > > So log.dirs should contain only one entry per HDD disk, to avoid random seeks. > What about SSDs? Can throughput be increased by specifying multiple > directories on the same SSD? Given a constant number of partitions, I don’t see any advantage to splitting partitions among multiple log directories vs. keeping them all in one (per disk). You’d still have the same total number of topic-partition directories and the same number of topic-partition leaders. If you want to increase throughput, focus on using the appropriate number of partitions. — Peter Bukowinski
log.dirs and SSDs
So log.dirs should contain only one entry per HDD disk, to avoid random seeks. What about SSDs? Can throughput be increased by specifying multiple directories on the same SSD?
Re: what happened in case of single disk failure
Yes, that’s correct. While a broker is down: all topic partitions assigned to that broker will be under-replicated topic partitions with an unmet minimum ISR count will be offline leadership of partitions meeting the minimum ISR count will move to the next in-sync replica in the replica list if no in-sync replica exists for a topic-partitions, it will be offline Setting unclean.leader.election.enable=true will allow an out-of-sync replica to become a leader. If topic partition availability is more important to you than data integrity, you should allow unclean leader election. > On Mar 11, 2020, at 6:11 AM, 张祥 wrote: > > Hi, Peter, following what we talked about before, I want to understand what > will happen when one broker goes down, I would say it will be very similar > to what happens under disk failure, except that the rules apply to all the > partitions on that broker instead of only one malfunctioned disk. Am I > right? Thanks. > > 张祥 于2020年3月5日周四 上午9:25写道: > >> Thanks Peter, really appreciate it. >> >> Peter Bukowinski 于2020年3月4日周三 下午11:50写道: >> >>> Yes, you should restart the broker. I don’t believe there’s any code to >>> check if a Log directory previously marked as failed has returned to >>> healthy. >>> >>> I always restart the broker after a hardware repair. I treat broker >>> restarts as a normal, non-disruptive operation in my clusters. I use a >>> minimum of 3x replication. >>> >>> -- Peter (from phone) >>> On Mar 4, 2020, at 12:46 AM, 张祥 wrote: Another question, according to my memory, the broker needs to be >>> restarted after replacing disk to recover this. Is that correct? If so, I take >>> that Kafka cannot know by itself that the disk has been replaced, manually restart is necessary. 张祥 于2020年3月4日周三 下午2:48写道: > Thanks Peter, it makes a lot of sense. > > Peter Bukowinski 于2020年3月3日周二 上午11:56写道: > >> Whether your brokers have a single data directory or multiple data >> directories on separate disks, when a disk fails, the topic partitions >> located on that disk become unavailable. What happens next depends on >>> how >> your cluster and topics are configured. >> >> If the topics on the affected broker have replicas and the minimum ISR >> (in-sync replicas) count is met, then all topic partitions will remain >> online and leaders will move to another broker. Producers and >>> consumers >> will continue to operate as usual. >> >> If the topics don’t have replicas or the minimum ISR count is not met, >> then the topic partitions on the failed disk will be offline. >>> Producers can >> still send data to the affected topics — it will just go to the online >> partitions. Consumers can still consume data from the online >>> partitions. >> >> -- Peter >> On Mar 2, 2020, at 7:00 PM, 张祥 wrote: Hi community, I ran into disk failure when using Kafka, and fortunately it did not >> crash >>> the entire cluster. So I am wondering how Kafka handles multiple >>> disks >> and >>> it manages to work in case of single disk failure. The more detailed, >> the >>> better. Thanks ! >> > >>> >>
Re: what happened in case of single disk failure
Hi, Peter, following what we talked about before, I want to understand what will happen when one broker goes down, I would say it will be very similar to what happens under disk failure, except that the rules apply to all the partitions on that broker instead of only one malfunctioned disk. Am I right? Thanks. 张祥 于2020年3月5日周四 上午9:25写道: > Thanks Peter, really appreciate it. > > Peter Bukowinski 于2020年3月4日周三 下午11:50写道: > >> Yes, you should restart the broker. I don’t believe there’s any code to >> check if a Log directory previously marked as failed has returned to >> healthy. >> >> I always restart the broker after a hardware repair. I treat broker >> restarts as a normal, non-disruptive operation in my clusters. I use a >> minimum of 3x replication. >> >> -- Peter (from phone) >> >> > On Mar 4, 2020, at 12:46 AM, 张祥 wrote: >> > >> > Another question, according to my memory, the broker needs to be >> restarted >> > after replacing disk to recover this. Is that correct? If so, I take >> that >> > Kafka cannot know by itself that the disk has been replaced, manually >> > restart is necessary. >> > >> > 张祥 于2020年3月4日周三 下午2:48写道: >> > >> >> Thanks Peter, it makes a lot of sense. >> >> >> >> Peter Bukowinski 于2020年3月3日周二 上午11:56写道: >> >> >> >>> Whether your brokers have a single data directory or multiple data >> >>> directories on separate disks, when a disk fails, the topic partitions >> >>> located on that disk become unavailable. What happens next depends on >> how >> >>> your cluster and topics are configured. >> >>> >> >>> If the topics on the affected broker have replicas and the minimum ISR >> >>> (in-sync replicas) count is met, then all topic partitions will remain >> >>> online and leaders will move to another broker. Producers and >> consumers >> >>> will continue to operate as usual. >> >>> >> >>> If the topics don’t have replicas or the minimum ISR count is not met, >> >>> then the topic partitions on the failed disk will be offline. >> Producers can >> >>> still send data to the affected topics — it will just go to the online >> >>> partitions. Consumers can still consume data from the online >> partitions. >> >>> >> >>> -- Peter >> >>> >> > On Mar 2, 2020, at 7:00 PM, 张祥 wrote: >> > >> > Hi community, >> > >> > I ran into disk failure when using Kafka, and fortunately it did not >> >>> crash >> the entire cluster. So I am wondering how Kafka handles multiple >> disks >> >>> and >> it manages to work in case of single disk failure. The more detailed, >> >>> the >> better. Thanks ! >> >>> >> >> >> >