RE: Re: [DISCUSS] KIP-1333: Support log segments larger than 2 GB

Haifeng Chen Wed, 17 Jun 2026 18:23:00 -0700

Thanks Murali. I replied in another thread, not sure why it didn't
reply to this thread and merge.

Hi Murali,

Thanks for the careful read and the code pointers. Replying point by
point; I've also revised the wiki where noted.

mb-1 (read-path detection): No filename suffix or magic byte (both
rejected — see Rejected Alternatives 6/7). Every OffsetIndex is built
via a constructor that calls detectEntrySize(file) on the actual
bytes: divisible by 12-not-8 → large; by 8-not-12 → legacy; by both →
validate first entries under each format, with the MetadataVersion
hint as tiebreaker; by neither → rebuild. The 5-arg constructor flag
only picks the format for new files. So one log dir can hold both
formats and each reads correctly. Added a "Format detection for
existing files" section.

mb-2 (RemoteIndexCache): No silent misread in the normal case. The
useLargeFormat=false is only a tiebreaker hint — the constructor still
auto-detects from the fetched bytes, so a real 12-byte index (size
divisible by 12-not-8) is detected regardless. The only theoretical
gap is the doubly-divisible-and-both-valid case, where sanityCheck()'s
full monotonicity scan rejects a wrong pick and triggers a rebuild —
so the failure mode is "rebuild," not wrong offsets. If you'd prefer
zero ambiguity, RemoteIndexCache could pass the current
MetadataVersion as the hint instead of hardcoding false; happy to add
that.

mb-3 (index sizing): Agreed, your ~24 MB for 8 GB is right. Added an
"Index size guidance" note. Defaults stay unchanged (no behavior
change for existing users), but operators running >2 GB segments
should scale log.index.size.max.bytes proportionally. Failure mode is
benign: a full index just rolls the segment early — no data loss.

mb-4 (SegmentPosition int→long): Type consistency only, not a range
need. It's built from LogOffsetMetadata.relativePositionInSegment (now
long); keeping it int would reintroduce a lossy truncation. KRaft
segments are ~1 GB so the value never exceeds INT_MAX. Added a
rationale note.

mb-5 (sizeInBytes / "no impact"): My wording was misleading — "no
impact" applies only to the client/network layer (bounded by
max.message.bytes, so int is fine). The storage layer does need long:
we add FileRecords.sizeInBytesLong() and migrate all storage call
sites to it. BaseRecords.sizeInBytes() stays int and clamps at INT_MAX
only to avoid cascading the interface change. Your LogSegment:256
example is one of the migrated sites → long physicalPosition =
log.sizeInBytesLong(). Fixed the wording in the KIP.

mb-6 (read() lines 446/460): Correct on both, already in the
storage-widening table. line 446 → long startPosition (since
OffsetPosition.position becomes long); line 460 → (int)
Math.min(maxPosition - startPosition, (long) adjustedMaxSize) —
fetchSize stays int (bounded by fetch.max.bytes) but the subtraction
is done in long to avoid overflow; slice becomes sliceLong. We'll
catch the long tail via the compiler (once source types widen) plus
the LogSegment/FileRecords tests.

Thanks again — these sharpened the KIP. Let me know if you'd want the
mb-2 hardening in scope.

Haifeng

On 2026/05/09 16:36:42 Muralidhar Basani via dev wrote:
> Hi Chen, thanks for this well written kip. It definitely addresses a core
> issue.
>
> I have a few points to clarify.
>
> mb-1 : It is mentioned as 'The large format is only written after
> MetadataVersion IBP_4_4_IV1 is finalized' and 'New index files are written
> in 12-byte format. Existing 8-byte index files continue to be read
> correctly.'.
>
> But it is not clear how a 8 byte and 12 byte index files are detected by
> OffsetIndex, after the 4.4 is finalized.
> Because log dir may contain 8 byte and 12 byte index files simultaneously.
> Something like a filename suffix ?
> There is a mention of 5 arg constructor, but it controls the writing path,
> but no mechanism mentioned for the read path.
>
> mb-2 : RemoteIndexCache caches index files fetched from tiered storage. Is
> it possible that 12-byte index gets cached, but read back as legacy format ?
>
> mb-3 : Current default for log.index.size.max.bytes is 10mb and
> log.index.interval.bytes default is 4 kb. But to handle larger segments, I
> think this is not enough. Would it be better to mention some guidance or
> other default in the kip? For ex : guidance on value, for a 8 gb segment,
> needs this index cap to be around 24mb ?
>
> mb-4 : There is a change of type for SegmentPosition.relativePosition
> (raft) from int to long. But it is not mentioned why this needs a change?
>
> mb-5 : As per kip, BaseRecords.sizeInBytes() is not changed and keeps it as
> int. But when FileRecords size is > 2 gb, curious what should be returned
> by sizeInBytes ?
>
> And 'BaseRecords.sizeInBytes() remains int (449 callers.. ' is there really
> no impact ?
>
> For ex : LogSegment.java:256, we have int physicalPosition =
> log.sizeInBytes(). If this value is above 2gb, it's value can become wrong,
> pointing to wrong position ? May be there are other references like these.
>
> mb-6 : Also in LogSegment line 446, int startPosition =
> startOffsetAndSize.position; If this value is above 2gb, probably this gets
> changed to long ?
> and in line 460,
> int fetchSize = Math.min((int) (maxPositionOpt.get() - startPosition),
> adjustedMaxSize);
>
> these kinds of changes could be captured while implementing may be, Just
> thought, if they could be mentioned in kip, so we don't miss while
> implementing ?
>
>
> Thanks,
> Murali
>
> On Wed, May 6, 2026 at 12:35 PM Haifeng Chen <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I'd like to propose KIP-1333: Support log segments larger than 2 GB.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1333%3A+Support+log+segments+larger+than+2+GB
> >
> > Please help take a look and look forward to your feedback. Thanks,
> >
> > Haifeng
> >
>

RE: Re: [DISCUSS] KIP-1333: Support log segments larger than 2 GB

Reply via email to