Hey Jay,

Thanks for your comments!

Yeah recovery is different from the sanity check. They are correlated in
the sense that there may still be corrupted index files even after clean
broker shutdown. And in that case if we delay the sanity check then we may
delay the log recovery. The main goal of this KIP is to optimize the sanity
check related work so that it does not delay the broker startup much.

The KIP mentioned that the sanity check is done using log recovery
background thread. The name "recovery" is mentioned mainly because the
background thread number is determined using the existing
config num.recovery.threads.per.data.dir. I have updated the KIP to make
this less confusing.

It makes a ton of sense to optimize the broker startup time in a data
driven fashion. The currently optimize is done kind of in this fashion. The
broker log shows that LogManager.loadLogs() takes a long time in large
clusters. Then I started broker with cold cache and repeatedly get thread
dump to see what are broker threads are doing during LogManager.loadLogs().
Most of the threads are working on sanityCheck() and this motivates the
change in this KIP. Previously broker shutdown time was investigated in a
similar data driven fashion and optimized with KAFKA-6172 and KAFKA-6175.
It seems that the current KIP can reduces the rolling bounce time of a
large cluster by 50% -- there may be room for further improvement but maybe
those do not require as big a change (with the caveat described in the KIP)
as suggested in this KIP.

It is not clear whether it is safe to just read the latest segment without
sanity checking all previous inactive segment of a given partition if
transaction is used. Otherwise we probably want to always skip the sanity
check of inactive segments without introducing a new config. Maybe the
developers familiar with the transaction can comment on that?

Thanks,
Dong


On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <j...@confluent.io> wrote:

> Optimizing startup seems really valuable but I'm a little confused by this.
>
> There are two different things:
> 1. Recovery
> 2. Sanity check
>
> The terminology we're using is a bit mixed here.
>
> Recovery means checksumming the log segments and rebuilding the index on a
> hard crash. This only happens on unflushed segments, which is generally
> just the last segment. Recovery is essential for the correctness guarantees
> of the log and you shouldn't disable it. It only happens on hard crash and
> is not a factor in graceful restart. We can likely optimize it but that
> would make most sense to do in a data driven fashion off some profiling.
>
> However there is also a ton of disk activity that happens during
> initialization (lots of checks on the file size, absolute path, etc). I
> think these have crept in over time with people not really realizing this
> code is perf sensitive and java hiding a lot of what is and isn't a file
> operation. One part of this is the sanityCheck() call for the two indexes.
> I don't think this call reads the full index, just the last entry in the
> index, right?. There should be no need to read the full index except during
> recovery (and then only for the segments being recovered). I think it would
> make a ton of sense to optimize this but I don't think that optimization
> needs to be configurable as this is just a helpful sanity check to detect
> common non-sensical things in the index files, but it isn't part of the
> core guarantees, in general you aren't supposed to lose committed data from
> disk, and if you do we may be able to fail faster but we fundamentally
> can't really help you. Again I think this would make the most sense to do
> in a data driven way, if you look at that code I think it is doing crazy
> amounts of file operations (e.g. getAbsolutePath, file sizes, etc). I think
> it'd make most sense to profile startup with a cold cash on a large log
> directory and do the same with an strace to see how many redundant system
> calls we do per segment and what is costing us and then cut some of this
> out. I suspect we could speed up our startup time quite a lot if we did
> that.
>
> For example we have a bunch of calls like this:
>
>     require(len % entrySize == 0,
>
>             "Index file " + file.getAbsolutePath + " is corrupt, found " +
> len +
>
>             " bytes which is not positive or not a multiple of 8.")
> I'm pretty such file.getAbsolutePath is a system call and I assume that
> happens whether or not you fail the in-memory check?
>
> -Jay
>
>
> On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hi all,
> >
> > I have created KIP-263: Allow broker to skip sanity check of inactive
> > segments on broker startup. See
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> > segments+on+broker+startup
> > .
> >
> > This KIP provides a way to significantly reduce time to rolling bounce a
> > Kafka cluster.
> >
> > Comments are welcome!
> >
> > Thanks,
> > Dong
> >
>

Reply via email to