On Wed, Nov 6, 2019 at 4:33 PM Ying Zheng <yi...@uber.com> wrote:

> 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> its relationship with RemoteLogSegmentInfo. It seems
> that RemoteLogIndexEntry are offset index entries pointing to record
> batches inside a segment. That seems to be the same as the .index file?
>
> We do not assume the how the data is stored in the remote storage.
> Depends on the implementation, the data of one segment may not necessary
> be stored in a single file.
> There could be a maximum object / chunk / file size restriction on the
> remote storage. So, one Kafka
> segment could be saved in multiple chunks in remote storage.
>
> The remote log index also have a larger index interval. The default
> interval of the local .index file
> (log.index.interval.bytes) is 4KB. In the current HDFS RSM implementation,
> the default remote
> index interval (hdfs.remote.index.interval.bytes) is 256KB. The
> coarse-grained remote index saves
> some local disk space. The smaller size also makes it more likely to be
> cached in physical memory.
>

The remote log index file is also very different from the existing .index
file. With the current design,
one .index file correspond to one segment file. But one remote log index
file can correspond to many
remote segments.

Because only inactive segments can be shipped to remote storage, to be able
to ship log data as soon
as possible, we will roll log segment very fast (e.g. every half hour).
This will lead to a large number of
small segments. If we maintain one remote index file for each remote
segment, we can easily hit some
OS limitations, like the maximum # of open files or the maximum # of
mmapped files.

So, instead of creating a new remote index file, we append
the RemoteLogIndexEntries of multiple
remote segments to one local file. We will roll the remote index file at a
configurable size or time interval.

Reply via email to