Thanks Murali. I replied in another thread, not sure why it didn't reply to this thread and merge.
Hi Murali, Thanks for the careful read and the code pointers. Replying point by point; I've also revised the wiki where noted. mb-1 (read-path detection): No filename suffix or magic byte (both rejected — see Rejected Alternatives 6/7). Every OffsetIndex is built via a constructor that calls detectEntrySize(file) on the actual bytes: divisible by 12-not-8 → large; by 8-not-12 → legacy; by both → validate first entries under each format, with the MetadataVersion hint as tiebreaker; by neither → rebuild. The 5-arg constructor flag only picks the format for new files. So one log dir can hold both formats and each reads correctly. Added a "Format detection for existing files" section. mb-2 (RemoteIndexCache): No silent misread in the normal case. The useLargeFormat=false is only a tiebreaker hint — the constructor still auto-detects from the fetched bytes, so a real 12-byte index (size divisible by 12-not-8) is detected regardless. The only theoretical gap is the doubly-divisible-and-both-valid case, where sanityCheck()'s full monotonicity scan rejects a wrong pick and triggers a rebuild — so the failure mode is "rebuild," not wrong offsets. If you'd prefer zero ambiguity, RemoteIndexCache could pass the current MetadataVersion as the hint instead of hardcoding false; happy to add that. mb-3 (index sizing): Agreed, your ~24 MB for 8 GB is right. Added an "Index size guidance" note. Defaults stay unchanged (no behavior change for existing users), but operators running >2 GB segments should scale log.index.size.max.bytes proportionally. Failure mode is benign: a full index just rolls the segment early — no data loss. mb-4 (SegmentPosition int→long): Type consistency only, not a range need. It's built from LogOffsetMetadata.relativePositionInSegment (now long); keeping it int would reintroduce a lossy truncation. KRaft segments are ~1 GB so the value never exceeds INT_MAX. Added a rationale note. mb-5 (sizeInBytes / "no impact"): My wording was misleading — "no impact" applies only to the client/network layer (bounded by max.message.bytes, so int is fine). The storage layer does need long: we add FileRecords.sizeInBytesLong() and migrate all storage call sites to it. BaseRecords.sizeInBytes() stays int and clamps at INT_MAX only to avoid cascading the interface change. Your LogSegment:256 example is one of the migrated sites → long physicalPosition = log.sizeInBytesLong(). Fixed the wording in the KIP. mb-6 (read() lines 446/460): Correct on both, already in the storage-widening table. line 446 → long startPosition (since OffsetPosition.position becomes long); line 460 → (int) Math.min(maxPosition - startPosition, (long) adjustedMaxSize) — fetchSize stays int (bounded by fetch.max.bytes) but the subtraction is done in long to avoid overflow; slice becomes sliceLong. We'll catch the long tail via the compiler (once source types widen) plus the LogSegment/FileRecords tests. Thanks again — these sharpened the KIP. Let me know if you'd want the mb-2 hardening in scope. Haifeng On 2026/05/09 16:36:42 Muralidhar Basani via dev wrote: > Hi Chen, thanks for this well written kip. It definitely addresses a core > issue. > > I have a few points to clarify. > > mb-1 : It is mentioned as 'The large format is only written after > MetadataVersion IBP_4_4_IV1 is finalized' and 'New index files are written > in 12-byte format. Existing 8-byte index files continue to be read > correctly.'. > > But it is not clear how a 8 byte and 12 byte index files are detected by > OffsetIndex, after the 4.4 is finalized. > Because log dir may contain 8 byte and 12 byte index files simultaneously. > Something like a filename suffix ? > There is a mention of 5 arg constructor, but it controls the writing path, > but no mechanism mentioned for the read path. > > mb-2 : RemoteIndexCache caches index files fetched from tiered storage. Is > it possible that 12-byte index gets cached, but read back as legacy format ? > > mb-3 : Current default for log.index.size.max.bytes is 10mb and > log.index.interval.bytes default is 4 kb. But to handle larger segments, I > think this is not enough. Would it be better to mention some guidance or > other default in the kip? For ex : guidance on value, for a 8 gb segment, > needs this index cap to be around 24mb ? > > mb-4 : There is a change of type for SegmentPosition.relativePosition > (raft) from int to long. But it is not mentioned why this needs a change? > > mb-5 : As per kip, BaseRecords.sizeInBytes() is not changed and keeps it as > int. But when FileRecords size is > 2 gb, curious what should be returned > by sizeInBytes ? > > And 'BaseRecords.sizeInBytes() remains int (449 callers.. ' is there really > no impact ? > > For ex : LogSegment.java:256, we have int physicalPosition = > log.sizeInBytes(). If this value is above 2gb, it's value can become wrong, > pointing to wrong position ? May be there are other references like these. > > mb-6 : Also in LogSegment line 446, int startPosition = > startOffsetAndSize.position; If this value is above 2gb, probably this gets > changed to long ? > and in line 460, > int fetchSize = Math.min((int) (maxPositionOpt.get() - startPosition), > adjustedMaxSize); > > these kinds of changes could be captured while implementing may be, Just > thought, if they could be mentioned in kip, so we don't miss while > implementing ? > > > Thanks, > Murali > > On Wed, May 6, 2026 at 12:35 PM Haifeng Chen <[email protected]> > wrote: > > > Hi all, > > > > I'd like to propose KIP-1333: Support log segments larger than 2 GB. > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1333%3A+Support+log+segments+larger+than+2+GB > > > > Please help take a look and look forward to your feedback. Thanks, > > > > Haifeng > > >
