Sequential writes make Kafka fast, or so they say

2020-03-11 Thread Eugen Dueck
A question about something that was always in the back of my mind.

According to Jay Kreps

> The first [reason that Kafka is so fast despite writing to disk] is that 
> Kafka does only sequential file I/O.

I wonder how true this statement is, because Kafka uses 3 segments per 
partition. so even with a single topic and partition per broker and disk, it 
would not be sequential. Now say we have 1000 partitions per broker/disk, i.e. 
3000 files. How can concurrent/interleaved writes to thousands of files on a 
single disk be considered sequential file I/O?

Isn't the reason Kafka is so fast despite writing to disk the fact that it does 
not fsync to disk, leaving that to the OS? The OS would, I assume, be smart 
enough to order the writes when it flushes its caches to disk in a way that 
minimizes random seeks. But then, wouldn't the manner in which Kafka writes to 
files be more or less irrelevant? Or put differently: If Kafka was 
synchronously flushing to disk, wouldn't it have to limit itself to writing all 
partitions for a broker/disk to a single file, if it wanted to do sequential 
file I/O?

For reading (historical, non-realtime) data that is not in the OS cache, 
keeping it in append-only files, the statement makes of course sense.


Re: log.dirs and SSDs

2020-03-11 Thread Eugen Dueck
Thanks! Will let this list know if and when I run a log.dirs vs. num.io.threads 
test.



Hah :)

I think this deserves an experiment. I’d try setting up some tests with one, 
two, four, and eight log directories per disk and running some performance 
tests. I’d be interested to see your results.

> On Mar 11, 2020, at 5:45 PM, Eugen Dueck  wrote:
>
> I'm asking the questions here! 
> So is that the way to tune the broker if it does not achieve disk throughput?
>
> 
> 差出人: Peter Bukowinski 
> 送信日時: 2020年3月12日 9:38
>
> Couldn’t the same be accomplished by increasing the num.io.threads broker 
> setting?
>
>> On Mar 11, 2020, at 5:15 PM, Eugen Dueck  wrote:
>>
>> So there is not e.g. a single thread responsible per directory in log.dirs 
>> that could become a bottleneck relative to SSD throughput of GB/s?
>>
>> This is in fact the case for Apache Pulsar, and the openmessaging benchmark 
>> uses 4 directories on the same SSD to increase throughput.
>>
>> 
>> 差出人: Peter Bukowinski 
>> 送信日時: 2020年3月12日 8:51
>>
>>> On Mar 11, 2020, at 4:28 PM, Eugen Dueck  wrote:
>>>
>>> So log.dirs should contain only one entry per HDD disk, to avoid random 
>>> seeks.
>>> What about SSDs? Can throughput be increased by specifying multiple 
>>> directories on the same SSD?
>>
>>
>> Given a constant number of partitions, I don’t see any advantage to 
>> splitting partitions among multiple log directories vs. keeping them all in 
>> one (per disk). You’d still have the same total number of topic-partition 
>> directories and the same number of topic-partition leaders.
>>
>> If you want to increase throughput, focus on using the appropriate number of 
>> partitions.
>>
>> —
>> Peter Bukowinski
>



Re: log.dirs and SSDs

2020-03-11 Thread Peter Bukowinski
Hah :)

I think this deserves an experiment. I’d try setting up some tests with one, 
two, four, and eight log directories per disk and running some performance 
tests. I’d be interested to see your results.

> On Mar 11, 2020, at 5:45 PM, Eugen Dueck  wrote:
> 
> I'm asking the questions here! 
> So is that the way to tune the broker if it does not achieve disk throughput?
> 
> 
> 差出人: Peter Bukowinski 
> 送信日時: 2020年3月12日 9:38
> 
> Couldn’t the same be accomplished by increasing the num.io.threads broker 
> setting?
> 
>> On Mar 11, 2020, at 5:15 PM, Eugen Dueck  wrote:
>> 
>> So there is not e.g. a single thread responsible per directory in log.dirs 
>> that could become a bottleneck relative to SSD throughput of GB/s?
>> 
>> This is in fact the case for Apache Pulsar, and the openmessaging benchmark 
>> uses 4 directories on the same SSD to increase throughput.
>> 
>> 
>> 差出人: Peter Bukowinski 
>> 送信日時: 2020年3月12日 8:51
>> 
>>> On Mar 11, 2020, at 4:28 PM, Eugen Dueck  wrote:
>>> 
>>> So log.dirs should contain only one entry per HDD disk, to avoid random 
>>> seeks.
>>> What about SSDs? Can throughput be increased by specifying multiple 
>>> directories on the same SSD?
>> 
>> 
>> Given a constant number of partitions, I don’t see any advantage to 
>> splitting partitions among multiple log directories vs. keeping them all in 
>> one (per disk). You’d still have the same total number of topic-partition 
>> directories and the same number of topic-partition leaders.
>> 
>> If you want to increase throughput, focus on using the appropriate number of 
>> partitions.
>> 
>> —
>> Peter Bukowinski
> 



Re: log.dirs and SSDs

2020-03-11 Thread Eugen Dueck
The reason I'm asking is because the documentation is quite brief:

num.io.threads: The number of threads that the server uses for processing 
requests, which may include disk I/O

And none of the google hits revealed any more details.



I'm asking the questions here! 
So is that the way to tune the broker if it does not achieve disk throughput?


差出人: Peter Bukowinski 
送信日時: 2020年3月12日 9:38

Couldn’t the same be accomplished by increasing the num.io.threads broker 
setting?

> On Mar 11, 2020, at 5:15 PM, Eugen Dueck  wrote:
>
> So there is not e.g. a single thread responsible per directory in log.dirs 
> that could become a bottleneck relative to SSD throughput of GB/s?
>
> This is in fact the case for Apache Pulsar, and the openmessaging benchmark 
> uses 4 directories on the same SSD to increase throughput.
>
> 
> 差出人: Peter Bukowinski 
> 送信日時: 2020年3月12日 8:51
>
>> On Mar 11, 2020, at 4:28 PM, Eugen Dueck  wrote:
>>
>> So log.dirs should contain only one entry per HDD disk, to avoid random 
>> seeks.
>> What about SSDs? Can throughput be increased by specifying multiple 
>> directories on the same SSD?
>
>
> Given a constant number of partitions, I don’t see any advantage to splitting 
> partitions among multiple log directories vs. keeping them all in one (per 
> disk). You’d still have the same total number of topic-partition directories 
> and the same number of topic-partition leaders.
>
> If you want to increase throughput, focus on using the appropriate number of 
> partitions.
>
> —
> Peter Bukowinski



Re: what happened in case of single disk failure

2020-03-11 Thread 张祥
Thanks, very helpful !

Peter Bukowinski  于2020年3月12日周四 上午5:48写道:

> Yes, that’s correct. While a broker is down:
>
> all topic partitions assigned to that broker will be under-replicated
> topic partitions with an unmet minimum ISR count will be offline
> leadership of partitions meeting the minimum ISR count will move to the
> next in-sync replica in the replica list
> if no in-sync replica exists for a topic-partitions, it will be offline
> Setting unclean.leader.election.enable=true will allow an out-of-sync
> replica to become a leader.
> If topic partition availability is more important to you than data
> integrity, you should allow unclean leader election.
>
>
> > On Mar 11, 2020, at 6:11 AM, 张祥  wrote:
> >
> > Hi, Peter, following what we talked about before, I want to understand
> what
> > will happen when one broker goes down, I would say it will be very
> similar
> > to what happens under disk failure, except that the rules apply to all
> the
> > partitions on that broker instead of only one malfunctioned disk. Am I
> > right? Thanks.
> >
> > 张祥  于2020年3月5日周四 上午9:25写道:
> >
> >> Thanks Peter, really appreciate it.
> >>
> >> Peter Bukowinski  于2020年3月4日周三 下午11:50写道:
> >>
> >>> Yes, you should restart the broker. I don’t believe there’s any code to
> >>> check if a Log directory previously marked as failed has returned to
> >>> healthy.
> >>>
> >>> I always restart the broker after a hardware repair. I treat broker
> >>> restarts as a normal, non-disruptive operation in my clusters. I use a
> >>> minimum of 3x replication.
> >>>
> >>> -- Peter (from phone)
> >>>
>  On Mar 4, 2020, at 12:46 AM, 张祥  wrote:
> 
>  Another question, according to my memory, the broker needs to be
> >>> restarted
>  after replacing disk to recover this. Is that correct? If so, I take
> >>> that
>  Kafka cannot know by itself that the disk has been replaced, manually
>  restart is necessary.
> 
>  张祥  于2020年3月4日周三 下午2:48写道:
> 
> > Thanks Peter, it makes a lot of sense.
> >
> > Peter Bukowinski  于2020年3月3日周二 上午11:56写道:
> >
> >> Whether your brokers have a single data directory or multiple data
> >> directories on separate disks, when a disk fails, the topic
> partitions
> >> located on that disk become unavailable. What happens next depends
> on
> >>> how
> >> your cluster and topics are configured.
> >>
> >> If the topics on the affected broker have replicas and the minimum
> ISR
> >> (in-sync replicas) count is met, then all topic partitions will
> remain
> >> online and leaders will move to another broker. Producers and
> >>> consumers
> >> will continue to operate as usual.
> >>
> >> If the topics don’t have replicas or the minimum ISR count is not
> met,
> >> then the topic partitions on the failed disk will be offline.
> >>> Producers can
> >> still send data to the affected topics — it will just go to the
> online
> >> partitions. Consumers can still consume data from the online
> >>> partitions.
> >>
> >> -- Peter
> >>
>  On Mar 2, 2020, at 7:00 PM, 张祥  wrote:
> 
>  Hi community,
> 
>  I ran into disk failure when using Kafka, and fortunately it did
> not
> >> crash
> >>> the entire cluster. So I am wondering how Kafka handles multiple
> >>> disks
> >> and
> >>> it manages to work in case of single disk failure. The more
> detailed,
> >> the
> >>> better. Thanks !
> >>
> >
> >>>
> >>
>
>


Re: log.dirs and SSDs

2020-03-11 Thread Eugen Dueck
I'm asking the questions here! 
So is that the way to tune the broker if it does not achieve disk throughput?


差出人: Peter Bukowinski 
送信日時: 2020年3月12日 9:38

Couldn’t the same be accomplished by increasing the num.io.threads broker 
setting?

> On Mar 11, 2020, at 5:15 PM, Eugen Dueck  wrote:
>
> So there is not e.g. a single thread responsible per directory in log.dirs 
> that could become a bottleneck relative to SSD throughput of GB/s?
>
> This is in fact the case for Apache Pulsar, and the openmessaging benchmark 
> uses 4 directories on the same SSD to increase throughput.
>
> 
> 差出人: Peter Bukowinski 
> 送信日時: 2020年3月12日 8:51
>
>> On Mar 11, 2020, at 4:28 PM, Eugen Dueck  wrote:
>>
>> So log.dirs should contain only one entry per HDD disk, to avoid random 
>> seeks.
>> What about SSDs? Can throughput be increased by specifying multiple 
>> directories on the same SSD?
>
>
> Given a constant number of partitions, I don’t see any advantage to splitting 
> partitions among multiple log directories vs. keeping them all in one (per 
> disk). You’d still have the same total number of topic-partition directories 
> and the same number of topic-partition leaders.
>
> If you want to increase throughput, focus on using the appropriate number of 
> partitions.
>
> —
> Peter Bukowinski



Re: log.dirs and SSDs

2020-03-11 Thread Peter Bukowinski
Couldn’t the same be accomplished by increasing the num.io.threads broker 
setting?

> On Mar 11, 2020, at 5:15 PM, Eugen Dueck  wrote:
> 
> So there is not e.g. a single thread responsible per directory in log.dirs 
> that could become a bottleneck relative to SSD throughput of GB/s?
> 
> This is in fact the case for Apache Pulsar, and the openmessaging benchmark 
> uses 4 directories on the same SSD to increase throughput.
> 
> 
> 差出人: Peter Bukowinski 
> 送信日時: 2020年3月12日 8:51
> 宛先: users@kafka.apache.org 
> 件名: Re: log.dirs and SSDs
> 
>> On Mar 11, 2020, at 4:28 PM, Eugen Dueck  wrote:
>> 
>> So log.dirs should contain only one entry per HDD disk, to avoid random 
>> seeks.
>> What about SSDs? Can throughput be increased by specifying multiple 
>> directories on the same SSD?
> 
> 
> Given a constant number of partitions, I don’t see any advantage to splitting 
> partitions among multiple log directories vs. keeping them all in one (per 
> disk). You’d still have the same total number of topic-partition directories 
> and the same number of topic-partition leaders.
> 
> If you want to increase throughput, focus on using the appropriate number of 
> partitions.
> 
> —
> Peter Bukowinski



Re: log.dirs and SSDs

2020-03-11 Thread Eugen Dueck
So there is not e.g. a single thread responsible per directory in log.dirs that 
could become a bottleneck relative to SSD throughput of GB/s?

This is in fact the case for Apache Pulsar, and the openmessaging benchmark 
uses 4 directories on the same SSD to increase throughput.


差出人: Peter Bukowinski 
送信日時: 2020年3月12日 8:51
宛先: users@kafka.apache.org 
件名: Re: log.dirs and SSDs

> On Mar 11, 2020, at 4:28 PM, Eugen Dueck  wrote:
>
> So log.dirs should contain only one entry per HDD disk, to avoid random seeks.
> What about SSDs? Can throughput be increased by specifying multiple 
> directories on the same SSD?


Given a constant number of partitions, I don’t see any advantage to splitting 
partitions among multiple log directories vs. keeping them all in one (per 
disk). You’d still have the same total number of topic-partition directories 
and the same number of topic-partition leaders.

If you want to increase throughput, focus on using the appropriate number of 
partitions.

—
Peter Bukowinski


Re: log.dirs and SSDs

2020-03-11 Thread Peter Bukowinski
> On Mar 11, 2020, at 4:28 PM, Eugen Dueck  wrote:
> 
> So log.dirs should contain only one entry per HDD disk, to avoid random seeks.
> What about SSDs? Can throughput be increased by specifying multiple 
> directories on the same SSD?


Given a constant number of partitions, I don’t see any advantage to splitting 
partitions among multiple log directories vs. keeping them all in one (per 
disk). You’d still have the same total number of topic-partition directories 
and the same number of topic-partition leaders.

If you want to increase throughput, focus on using the appropriate number of 
partitions.

—
Peter Bukowinski

log.dirs and SSDs

2020-03-11 Thread Eugen Dueck
So log.dirs should contain only one entry per HDD disk, to avoid random seeks.
What about SSDs? Can throughput be increased by specifying multiple directories 
on the same SSD?


Re: what happened in case of single disk failure

2020-03-11 Thread Peter Bukowinski
Yes, that’s correct. While a broker is down:

all topic partitions assigned to that broker will be under-replicated
topic partitions with an unmet minimum ISR count will be offline
leadership of partitions meeting the minimum ISR count will move to the next 
in-sync replica in the replica list
if no in-sync replica exists for a topic-partitions, it will be offline
Setting unclean.leader.election.enable=true will allow an out-of-sync replica 
to become a leader.
If topic partition availability is more important to you than data integrity, 
you should allow unclean leader election.


> On Mar 11, 2020, at 6:11 AM, 张祥  wrote:
> 
> Hi, Peter, following what we talked about before, I want to understand what
> will happen when one broker goes down, I would say it will be very similar
> to what happens under disk failure, except that the rules apply to all the
> partitions on that broker instead of only one malfunctioned disk. Am I
> right? Thanks.
> 
> 张祥  于2020年3月5日周四 上午9:25写道:
> 
>> Thanks Peter, really appreciate it.
>> 
>> Peter Bukowinski  于2020年3月4日周三 下午11:50写道:
>> 
>>> Yes, you should restart the broker. I don’t believe there’s any code to
>>> check if a Log directory previously marked as failed has returned to
>>> healthy.
>>> 
>>> I always restart the broker after a hardware repair. I treat broker
>>> restarts as a normal, non-disruptive operation in my clusters. I use a
>>> minimum of 3x replication.
>>> 
>>> -- Peter (from phone)
>>> 
 On Mar 4, 2020, at 12:46 AM, 张祥  wrote:
 
 Another question, according to my memory, the broker needs to be
>>> restarted
 after replacing disk to recover this. Is that correct? If so, I take
>>> that
 Kafka cannot know by itself that the disk has been replaced, manually
 restart is necessary.
 
 张祥  于2020年3月4日周三 下午2:48写道:
 
> Thanks Peter, it makes a lot of sense.
> 
> Peter Bukowinski  于2020年3月3日周二 上午11:56写道:
> 
>> Whether your brokers have a single data directory or multiple data
>> directories on separate disks, when a disk fails, the topic partitions
>> located on that disk become unavailable. What happens next depends on
>>> how
>> your cluster and topics are configured.
>> 
>> If the topics on the affected broker have replicas and the minimum ISR
>> (in-sync replicas) count is met, then all topic partitions will remain
>> online and leaders will move to another broker. Producers and
>>> consumers
>> will continue to operate as usual.
>> 
>> If the topics don’t have replicas or the minimum ISR count is not met,
>> then the topic partitions on the failed disk will be offline.
>>> Producers can
>> still send data to the affected topics — it will just go to the online
>> partitions. Consumers can still consume data from the online
>>> partitions.
>> 
>> -- Peter
>> 
 On Mar 2, 2020, at 7:00 PM, 张祥  wrote:
 
 Hi community,
 
 I ran into disk failure when using Kafka, and fortunately it did not
>> crash
>>> the entire cluster. So I am wondering how Kafka handles multiple
>>> disks
>> and
>>> it manages to work in case of single disk failure. The more detailed,
>> the
>>> better. Thanks !
>> 
> 
>>> 
>> 



Re: what happened in case of single disk failure

2020-03-11 Thread 张祥
Hi, Peter, following what we talked about before, I want to understand what
will happen when one broker goes down, I would say it will be very similar
to what happens under disk failure, except that the rules apply to all the
partitions on that broker instead of only one malfunctioned disk. Am I
right? Thanks.

张祥  于2020年3月5日周四 上午9:25写道:

> Thanks Peter, really appreciate it.
>
> Peter Bukowinski  于2020年3月4日周三 下午11:50写道:
>
>> Yes, you should restart the broker. I don’t believe there’s any code to
>> check if a Log directory previously marked as failed has returned to
>> healthy.
>>
>> I always restart the broker after a hardware repair. I treat broker
>> restarts as a normal, non-disruptive operation in my clusters. I use a
>> minimum of 3x replication.
>>
>> -- Peter (from phone)
>>
>> > On Mar 4, 2020, at 12:46 AM, 张祥  wrote:
>> >
>> > Another question, according to my memory, the broker needs to be
>> restarted
>> > after replacing disk to recover this. Is that correct? If so, I take
>> that
>> > Kafka cannot know by itself that the disk has been replaced, manually
>> > restart is necessary.
>> >
>> > 张祥  于2020年3月4日周三 下午2:48写道:
>> >
>> >> Thanks Peter, it makes a lot of sense.
>> >>
>> >> Peter Bukowinski  于2020年3月3日周二 上午11:56写道:
>> >>
>> >>> Whether your brokers have a single data directory or multiple data
>> >>> directories on separate disks, when a disk fails, the topic partitions
>> >>> located on that disk become unavailable. What happens next depends on
>> how
>> >>> your cluster and topics are configured.
>> >>>
>> >>> If the topics on the affected broker have replicas and the minimum ISR
>> >>> (in-sync replicas) count is met, then all topic partitions will remain
>> >>> online and leaders will move to another broker. Producers and
>> consumers
>> >>> will continue to operate as usual.
>> >>>
>> >>> If the topics don’t have replicas or the minimum ISR count is not met,
>> >>> then the topic partitions on the failed disk will be offline.
>> Producers can
>> >>> still send data to the affected topics — it will just go to the online
>> >>> partitions. Consumers can still consume data from the online
>> partitions.
>> >>>
>> >>> -- Peter
>> >>>
>> > On Mar 2, 2020, at 7:00 PM, 张祥  wrote:
>> >
>> > Hi community,
>> >
>> > I ran into disk failure when using Kafka, and fortunately it did not
>> >>> crash
>>  the entire cluster. So I am wondering how Kafka handles multiple
>> disks
>> >>> and
>>  it manages to work in case of single disk failure. The more detailed,
>> >>> the
>>  better. Thanks !
>> >>>
>> >>
>>
>