Hi there, I'm joining the party a little late on this one, but this is something I encountered at work and I think I can shed some light on the problem at hand. I filed a bug report https://issues.apache.org/jira/browse/KAFKA-10207 and also submitted a pull request https://github.com/apache/kafka/pull/8936 that should resolve the issue.
>From my investigation, it appears the issue was related to the jvm version we were using and only happened against a zfs mount. We tried ext4 and btrfs successfully under this configuration, but eventually upgraded our jvm and the issue with using zfs disappeared. I hope this helps! On 2020/04/29 12:09:28, Liam Clarke-Hutchinson <l...@adscale.co.nz> wrote: > Hmm, how are you doing your rolling deploys?> > > I'm wondering if the time indexes are being corrupted by unclean> > shutdowns. I've> > been reading code and the only path I could find that led to a largest> > timestamp of 0 was, as you've discovered, where there was no time index.> > > WRT to the corruption - the broker being SIGKILLed (systemctl by default> > sends SIGKILL 90 seconds after SIGTERM, and our broker needed 120s to shut> > down cleanly) has caused index corruption for us in the past - although in> > our case it was recovered from automatically by the broker. Just took 2> > hours.> > > Also are you moving between versions with these deploys?> > > On Wed, 29 Apr. 2020, 11:23 pm JP MB, <jo...@gmail.com> wrote:> > > > The server is in UTC, [2020-04-27 10:36:40,386] was actually my time. On> > > the server was 9:36.> > > It doesn't look like a timezone problem because it cleans properly other> > > records, exactly 48 hours.> > >> > > Em qua., 29 de abr. de 2020 às 11:26, Goran Sliskovic> > > <gs...@yahoo.com.invalid> escreveu:> > >> > > > Hi,> > > > When lastModifiedTime on that segment is converted to human readable> > > time:> > > > Monday, April 27, 2020 9:14:19 AM UTC> > > >> > > > In what time zone is the server (IOW: [2020-04-27 10:36:40,386] from the> > > > log is in what time zone)?> > > > It looks as largestTime is property of log record and 0 means the log> > > > record is empty.> > > >> > > > On Tuesday, April 28, 2020, 04:37:03 PM GMT+2, JP MB <> > > > jose.brandao1...@gmail.com> wrote:> > > >> > > > Hi,> > > > We have messages disappearing from topics on Apache Kafka with versions> > > > 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling> > > > deployment of our clusters and unfortunately it doesn't happen every> > > time,> > > > so it's very inconsistent.> > > >> > > > Sometimes we lose all messages inside a topic, other times we lose all> > > > messages inside a partition. When this happens the following log is a> > > > constant:> > > >> > > > [2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,> > > > dir=/var/kafkadata/data01/data] Deleting segments> > > > List(LogSegment(baseOffset=6, size=728,> > > > lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)> > > >> > > > There is also a previous log saying this segment hit the retention time> > > > breach of 48 hours. In this example, the message was produced ~12 minutes> > > > before the deployment.> > > >> > > > Notice, all messages that are wrongly deleted havelargestTime=0 and the> > > > ones that are properly deleted have a valid timestamp in there. From what> > > > we read from documentation and code it looks like the largestTime is used> > > > to calculate if a given segment reached the time breach or not.> > > >> > > > Since we can observe this in multiple versions of Kafka, we think this> > > > might be related to anything external to Kafka. E.g Zookeeper.> > > >> > > > Does anyone have any ideas of why this could be happening?> > > > For the record, we are using Zookeeper 3.6.0.> > > >> > >> >