Re: Kafka: Messages disappearing from topics, largestTime=0

2020-06-27 Thread John Malizia
Hi there, I'm joining the party a little late on this one, but this is
something I encountered at work and I think I can shed some light on the
problem at hand. I filed a bug report
https://issues.apache.org/jira/browse/KAFKA-10207 and also submitted a pull
request https://github.com/apache/kafka/pull/8936 that should resolve the
issue.

>From my investigation, it appears the issue was related to the jvm version
we were using and only happened against a zfs mount. We tried ext4 and
btrfs successfully under this configuration, but eventually upgraded our
jvm and the issue with using zfs disappeared. I hope this helps!

On 2020/04/29 12:09:28, Liam Clarke-Hutchinson  wrote:
> Hmm, how are you doing your rolling deploys?>
>
> I'm wondering if the time indexes are being corrupted by unclean>
> shutdowns. I've>
> been reading code and the only path I could find that led to a largest>
> timestamp of 0 was, as you've discovered, where there was no time index.>
>
> WRT to the corruption - the broker being SIGKILLed (systemctl by default>
> sends SIGKILL 90 seconds after SIGTERM, and our broker needed 120s to
shut>
> down cleanly) has caused index corruption for us in the past - although
in>
> our case it was recovered from automatically by the broker. Just took 2>
> hours.>
>
> Also are you moving between versions with these deploys?>
>
> On Wed, 29 Apr. 2020, 11:23 pm JP MB,  wrote:>
>
> > The server is in UTC, [2020-04-27 10:36:40,386] was actually my time.
On>
> > the server was 9:36.>
> > It doesn't look like a timezone problem because it cleans properly
other>
> > records, exactly 48 hours.>
> >>
> > Em qua., 29 de abr. de 2020 às 11:26, Goran Sliskovic>
> >  escreveu:>
> >>
> > > Hi,>
> > > When lastModifiedTime on that segment is converted to human readable>
> > time:>
> > > Monday, April 27, 2020 9:14:19 AM UTC>
> > >>
> > > In what time zone is the server (IOW: [2020-04-27 10:36:40,386] from
the>
> > > log is in what time zone)?>
> > > It looks as largestTime is property of log record and 0 means the
log>
> > > record is empty.>
> > >>
> > > On Tuesday, April 28, 2020, 04:37:03 PM GMT+2, JP MB <>
> > > jose.brandao1...@gmail.com> wrote:>
> > >>
> > > Hi,>
> > > We have messages disappearing from topics on Apache Kafka with
versions>
> > > 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling>
> > > deployment of our clusters and unfortunately it doesn't happen every>
> > time,>
> > > so it's very inconsistent.>
> > >>
> > > Sometimes we lose all messages inside a topic, other times we lose
all>
> > > messages inside a partition. When this happens the following log is
a>
> > > constant:>
> > >>
> > > [2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,>
> > > dir=/var/kafkadata/data01/data] Deleting segments>
> > > List(LogSegment(baseOffset=6, size=728,>
> > > lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)>
> > >>
> > > There is also a previous log saying this segment hit the retention
time>
> > > breach of 48 hours. In this example, the message was produced ~12
minutes>
> > > before the deployment.>
> > >>
> > > Notice, all messages that are wrongly deleted havelargestTime=0 and
the>
> > > ones that are properly deleted have a valid timestamp in there. From
what>
> > > we read from documentation and code it looks like the largestTime is
used>
> > > to calculate if a given segment reached the time breach or not.>
> > >>
> > > Since we can observe this in multiple versions of Kafka, we think
this>
> > > might be related to anything external to Kafka. E.g Zookeeper.>
> > >>
> > > Does anyone have any ideas of why this could be happening?>
> > > For the record, we are using Zookeeper 3.6.0.>
> > >>
> >>
>


Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-08 Thread JP MB
Hi guys,
I just wanted to inform you that we solved our issue, it was indeed related
to the volume switching process and some permission mess.
Thanks, everyone for the efforts on finding the root cause.

Now as a question for a possible improvement, should kafka ever admit
largestTime to be 0 in any scenario?

I mean, from what I understood the .timeindex files where introduces to
have more fine-grained control over the segment retention:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
Can we use the segment .log file last modified time as a fallback?

Best regards

Em ter., 5 de mai. de 2020 às 10:13, JP MB 
escreveu:

> Hi,
> What I can see from the configurations:
>
>> log.dir = /tmp/kafka-logs (default)
>> log.dirs = /var/kafkadata/data01/data
>
>
> From the documentation log.dir is only used if log.dirs is not set, so 
> */var/kafkadata/data01/data
> *is the folder used for logs.
>
> Regards
>
> Em ter., 5 de mai. de 2020 às 08:52, Nicolas Carlot
>  escreveu:
>
>> Hi guys, still following your discussion even if it's out of my reach.
>> Just been noticing that you use /tmp/ for your logs, dunno if it's a good
>> idea :o https://issues.apache.org/jira/browse/KAFKA-3925
>>
>> Le lun. 4 mai 2020 à 19:40, JP MB  a écrit :
>>
>> > Here are the startup logs from a deployment where we lost 15 messages in
>> > topic-p:
>> > https://gist.github.com/josebrandao13/81271140e59e28eda7aaa777d2d3b02c
>> >
>> > .timeindex files state before the deployment:
>> > *Partitions with messages: timestamp mismatch
>> > *Partitions without messages: permission denied
>> >
>> > .timeindex files state before the deployment:
>> > *All partitions without messages: permission denied, new files were
>> > created.
>> >
>> > Don't see anything particular in the logs but you can see that the
>> messages
>> > are deleted with largestTime=0 and are from today morning.
>> >
>> > Em seg., 4 de mai. de 2020 às 11:37, JP MB 
>> > escreveu:
>> >
>> > > Hi guys,
>> > >
>> > > I'm gonna get back to this today, I get mixed feelings regarding the
>> > > volumes being the cause. This volume switching is around for quite
>> some
>> > > time, in a lot of clusters, and we only started noticing this problem
>> > when
>> > > we updated some of them. Also, this only happens in *a few* of those
>> > > .timeindex files and not in all of them. The .log files or .index
>> files
>> > > which are also on the volumes don't also have the problem.
>> > >
>> > > Additionally, I'm a bit confused on what should be the initial state
>> of
>> > > those .timeindex files. Sometimes I see "found log offset: -1", others
>> > the
>> > > timestamp mismatch error "Index timestamp: 0, log timestamp:
>> > 1588583643582"
>> > > and sometimes something like this "Indexed offset: 0, found log
>> offset:
>> > 28".
>> > >
>> > > So we have seen previously that whenever the timestamp mismatch error
>> is
>> > > present we lose messages. eventually after a deployment. Since this
>> looks
>> > > like the trigger for the problem I would like to understand how it can
>> > > happen. So my question is, how can each of those different states of
>> > > initialization for the .timeindexes happen? We can reproduce all of
>> them
>> > > when running Kafka locally.
>> > >
>> > > Meanwhile, I'm trying to reproduce the situation in our dev
>> environments
>> > > and get some startup logs and I will play with the log flush settings.
>> > >
>> > > Regards
>> > >
>> > > Em sáb., 2 de mai. de 2020 às 14:45, Liam Clarke-Hutchinson <
>> > > liam.cla...@adscale.co.nz> escreveu:
>> > >
>> > >> Good luck JP, do try it with the volume switching commented out, and
>> see
>> > >> how it goes.
>> > >>
>> > >> On Fri, May 1, 2020 at 6:50 PM JP MB 
>> > wrote:
>> > >>
>> > >> > Thank you very much for the help anyway.
>> > >> >
>> > >> > Best regards
>> > >> >
>> > >> > On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <
>> > >> > liam.cla...@adscale.co.nz>
>> > >> > wrote:
>> > >> >
>> > >> > > So the logs show a healthy shutdown, so we can eliminate that as
>> an
>> > >> > issue.
>> > >> > > I would look next at the volume management during a rollout
>> based on
>> > >> the
>> > >> > > other error messages you had earlier about permission denied etc.
>> > It's
>> > >> > > possible there's some journalled but not flushed changes in those
>> > time
>> > >> > > indexes, but at this point we're getting into filesystem
>> internals
>> > >> which
>> > >> > > aren't my forte. But if you can temporarily disable the volume
>> > >> switching
>> > >> > > and do a test roll out, see if you get the same problems or not,
>> > would
>> > >> > help
>> > >> > > eliminate it or confirm it.
>> > >> > >
>> > >> > > Sorry I can't help further on that.
>> > >> > >
>> > >> > > On Fri, May 1, 2020 at 5:34 AM JP MB > >
>> > >> wrote:
>> > >> > >
>> > >> > > > I took a bit because I needed logs of the server shutting down
>> > when
>> > >> > this
>> > >> > > > occurs. Here they are, I can see 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-05 Thread JP MB
Hi,
What I can see from the configurations:

> log.dir = /tmp/kafka-logs (default)
> log.dirs = /var/kafkadata/data01/data


>From the documentation log.dir is only used if log.dirs is not set, so
*/var/kafkadata/data01/data
*is the folder used for logs.

Regards

Em ter., 5 de mai. de 2020 às 08:52, Nicolas Carlot
 escreveu:

> Hi guys, still following your discussion even if it's out of my reach.
> Just been noticing that you use /tmp/ for your logs, dunno if it's a good
> idea :o https://issues.apache.org/jira/browse/KAFKA-3925
>
> Le lun. 4 mai 2020 à 19:40, JP MB  a écrit :
>
> > Here are the startup logs from a deployment where we lost 15 messages in
> > topic-p:
> > https://gist.github.com/josebrandao13/81271140e59e28eda7aaa777d2d3b02c
> >
> > .timeindex files state before the deployment:
> > *Partitions with messages: timestamp mismatch
> > *Partitions without messages: permission denied
> >
> > .timeindex files state before the deployment:
> > *All partitions without messages: permission denied, new files were
> > created.
> >
> > Don't see anything particular in the logs but you can see that the
> messages
> > are deleted with largestTime=0 and are from today morning.
> >
> > Em seg., 4 de mai. de 2020 às 11:37, JP MB 
> > escreveu:
> >
> > > Hi guys,
> > >
> > > I'm gonna get back to this today, I get mixed feelings regarding the
> > > volumes being the cause. This volume switching is around for quite some
> > > time, in a lot of clusters, and we only started noticing this problem
> > when
> > > we updated some of them. Also, this only happens in *a few* of those
> > > .timeindex files and not in all of them. The .log files or .index files
> > > which are also on the volumes don't also have the problem.
> > >
> > > Additionally, I'm a bit confused on what should be the initial state of
> > > those .timeindex files. Sometimes I see "found log offset: -1", others
> > the
> > > timestamp mismatch error "Index timestamp: 0, log timestamp:
> > 1588583643582"
> > > and sometimes something like this "Indexed offset: 0, found log offset:
> > 28".
> > >
> > > So we have seen previously that whenever the timestamp mismatch error
> is
> > > present we lose messages. eventually after a deployment. Since this
> looks
> > > like the trigger for the problem I would like to understand how it can
> > > happen. So my question is, how can each of those different states of
> > > initialization for the .timeindexes happen? We can reproduce all of
> them
> > > when running Kafka locally.
> > >
> > > Meanwhile, I'm trying to reproduce the situation in our dev
> environments
> > > and get some startup logs and I will play with the log flush settings.
> > >
> > > Regards
> > >
> > > Em sáb., 2 de mai. de 2020 às 14:45, Liam Clarke-Hutchinson <
> > > liam.cla...@adscale.co.nz> escreveu:
> > >
> > >> Good luck JP, do try it with the volume switching commented out, and
> see
> > >> how it goes.
> > >>
> > >> On Fri, May 1, 2020 at 6:50 PM JP MB 
> > wrote:
> > >>
> > >> > Thank you very much for the help anyway.
> > >> >
> > >> > Best regards
> > >> >
> > >> > On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <
> > >> > liam.cla...@adscale.co.nz>
> > >> > wrote:
> > >> >
> > >> > > So the logs show a healthy shutdown, so we can eliminate that as
> an
> > >> > issue.
> > >> > > I would look next at the volume management during a rollout based
> on
> > >> the
> > >> > > other error messages you had earlier about permission denied etc.
> > It's
> > >> > > possible there's some journalled but not flushed changes in those
> > time
> > >> > > indexes, but at this point we're getting into filesystem internals
> > >> which
> > >> > > aren't my forte. But if you can temporarily disable the volume
> > >> switching
> > >> > > and do a test roll out, see if you get the same problems or not,
> > would
> > >> > help
> > >> > > eliminate it or confirm it.
> > >> > >
> > >> > > Sorry I can't help further on that.
> > >> > >
> > >> > > On Fri, May 1, 2020 at 5:34 AM JP MB 
> > >> wrote:
> > >> > >
> > >> > > > I took a bit because I needed logs of the server shutting down
> > when
> > >> > this
> > >> > > > occurs. Here they are, I can see some errors:
> > >> > > >
> > >>
> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
> > >> > > >
> > >> > > > Regarding systemd, the closest I could find to TimeoutStopSec
> was
> > >> > > > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I
> > could
> > >> not
> > >> > > > find any KillSignal or RestartKillSignal. You can see the output
> > of
> > >> > > > systemctl show --all here:
> > >> > > >
> > >>
> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
> > >> > > >
> > >> > > > Once again, thanks for the help.
> > >> > > >
> > >> > > > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> > >> > > > liam.cla...@adscale.co.nz> escreveu:
> > >> > > >
> > >> > > > > I'd also suggest eyeballing your systemd conf to verify that
> > >> 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-05 Thread Nicolas Carlot
Hi guys, still following your discussion even if it's out of my reach.
Just been noticing that you use /tmp/ for your logs, dunno if it's a good
idea :o https://issues.apache.org/jira/browse/KAFKA-3925

Le lun. 4 mai 2020 à 19:40, JP MB  a écrit :

> Here are the startup logs from a deployment where we lost 15 messages in
> topic-p:
> https://gist.github.com/josebrandao13/81271140e59e28eda7aaa777d2d3b02c
>
> .timeindex files state before the deployment:
> *Partitions with messages: timestamp mismatch
> *Partitions without messages: permission denied
>
> .timeindex files state before the deployment:
> *All partitions without messages: permission denied, new files were
> created.
>
> Don't see anything particular in the logs but you can see that the messages
> are deleted with largestTime=0 and are from today morning.
>
> Em seg., 4 de mai. de 2020 às 11:37, JP MB 
> escreveu:
>
> > Hi guys,
> >
> > I'm gonna get back to this today, I get mixed feelings regarding the
> > volumes being the cause. This volume switching is around for quite some
> > time, in a lot of clusters, and we only started noticing this problem
> when
> > we updated some of them. Also, this only happens in *a few* of those
> > .timeindex files and not in all of them. The .log files or .index files
> > which are also on the volumes don't also have the problem.
> >
> > Additionally, I'm a bit confused on what should be the initial state of
> > those .timeindex files. Sometimes I see "found log offset: -1", others
> the
> > timestamp mismatch error "Index timestamp: 0, log timestamp:
> 1588583643582"
> > and sometimes something like this "Indexed offset: 0, found log offset:
> 28".
> >
> > So we have seen previously that whenever the timestamp mismatch error is
> > present we lose messages. eventually after a deployment. Since this looks
> > like the trigger for the problem I would like to understand how it can
> > happen. So my question is, how can each of those different states of
> > initialization for the .timeindexes happen? We can reproduce all of them
> > when running Kafka locally.
> >
> > Meanwhile, I'm trying to reproduce the situation in our dev environments
> > and get some startup logs and I will play with the log flush settings.
> >
> > Regards
> >
> > Em sáb., 2 de mai. de 2020 às 14:45, Liam Clarke-Hutchinson <
> > liam.cla...@adscale.co.nz> escreveu:
> >
> >> Good luck JP, do try it with the volume switching commented out, and see
> >> how it goes.
> >>
> >> On Fri, May 1, 2020 at 6:50 PM JP MB 
> wrote:
> >>
> >> > Thank you very much for the help anyway.
> >> >
> >> > Best regards
> >> >
> >> > On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <
> >> > liam.cla...@adscale.co.nz>
> >> > wrote:
> >> >
> >> > > So the logs show a healthy shutdown, so we can eliminate that as an
> >> > issue.
> >> > > I would look next at the volume management during a rollout based on
> >> the
> >> > > other error messages you had earlier about permission denied etc.
> It's
> >> > > possible there's some journalled but not flushed changes in those
> time
> >> > > indexes, but at this point we're getting into filesystem internals
> >> which
> >> > > aren't my forte. But if you can temporarily disable the volume
> >> switching
> >> > > and do a test roll out, see if you get the same problems or not,
> would
> >> > help
> >> > > eliminate it or confirm it.
> >> > >
> >> > > Sorry I can't help further on that.
> >> > >
> >> > > On Fri, May 1, 2020 at 5:34 AM JP MB 
> >> wrote:
> >> > >
> >> > > > I took a bit because I needed logs of the server shutting down
> when
> >> > this
> >> > > > occurs. Here they are, I can see some errors:
> >> > > >
> >> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
> >> > > >
> >> > > > Regarding systemd, the closest I could find to TimeoutStopSec was
> >> > > > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I
> could
> >> not
> >> > > > find any KillSignal or RestartKillSignal. You can see the output
> of
> >> > > > systemctl show --all here:
> >> > > >
> >> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
> >> > > >
> >> > > > Once again, thanks for the help.
> >> > > >
> >> > > > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> >> > > > liam.cla...@adscale.co.nz> escreveu:
> >> > > >
> >> > > > > I'd also suggest eyeballing your systemd conf to verify that
> >> someone
> >> > > > hasn't
> >> > > > > set a very low TimeoutStopSec, or that
> >> KillSignal/RestartKillSignal
> >> > > > haven't
> >> > > > > been configured to SIGKILL (confusingly named, imo, as the
> default
> >> > for
> >> > > > > KillSignal is SIGTERM).
> >> > > > >
> >> > > > > Also, the Kafka broker logs at shutdown look very different if
> it
> >> > shut
> >> > > > down
> >> > > > > currently vs if it didn't. Could you perhaps put them in a Gist
> >> and
> >> > > email
> >> > > > > the link?
> >> > > > >
> >> > > > > Just trying to make sure basic assumptions are holding 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-04 Thread JP MB
Here are the startup logs from a deployment where we lost 15 messages in
topic-p:
https://gist.github.com/josebrandao13/81271140e59e28eda7aaa777d2d3b02c

.timeindex files state before the deployment:
*Partitions with messages: timestamp mismatch
*Partitions without messages: permission denied

.timeindex files state before the deployment:
*All partitions without messages: permission denied, new files were created.

Don't see anything particular in the logs but you can see that the messages
are deleted with largestTime=0 and are from today morning.

Em seg., 4 de mai. de 2020 às 11:37, JP MB 
escreveu:

> Hi guys,
>
> I'm gonna get back to this today, I get mixed feelings regarding the
> volumes being the cause. This volume switching is around for quite some
> time, in a lot of clusters, and we only started noticing this problem when
> we updated some of them. Also, this only happens in *a few* of those
> .timeindex files and not in all of them. The .log files or .index files
> which are also on the volumes don't also have the problem.
>
> Additionally, I'm a bit confused on what should be the initial state of
> those .timeindex files. Sometimes I see "found log offset: -1", others the
> timestamp mismatch error "Index timestamp: 0, log timestamp: 1588583643582"
> and sometimes something like this "Indexed offset: 0, found log offset: 28".
>
> So we have seen previously that whenever the timestamp mismatch error is
> present we lose messages. eventually after a deployment. Since this looks
> like the trigger for the problem I would like to understand how it can
> happen. So my question is, how can each of those different states of
> initialization for the .timeindexes happen? We can reproduce all of them
> when running Kafka locally.
>
> Meanwhile, I'm trying to reproduce the situation in our dev environments
> and get some startup logs and I will play with the log flush settings.
>
> Regards
>
> Em sáb., 2 de mai. de 2020 às 14:45, Liam Clarke-Hutchinson <
> liam.cla...@adscale.co.nz> escreveu:
>
>> Good luck JP, do try it with the volume switching commented out, and see
>> how it goes.
>>
>> On Fri, May 1, 2020 at 6:50 PM JP MB  wrote:
>>
>> > Thank you very much for the help anyway.
>> >
>> > Best regards
>> >
>> > On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <
>> > liam.cla...@adscale.co.nz>
>> > wrote:
>> >
>> > > So the logs show a healthy shutdown, so we can eliminate that as an
>> > issue.
>> > > I would look next at the volume management during a rollout based on
>> the
>> > > other error messages you had earlier about permission denied etc. It's
>> > > possible there's some journalled but not flushed changes in those time
>> > > indexes, but at this point we're getting into filesystem internals
>> which
>> > > aren't my forte. But if you can temporarily disable the volume
>> switching
>> > > and do a test roll out, see if you get the same problems or not, would
>> > help
>> > > eliminate it or confirm it.
>> > >
>> > > Sorry I can't help further on that.
>> > >
>> > > On Fri, May 1, 2020 at 5:34 AM JP MB 
>> wrote:
>> > >
>> > > > I took a bit because I needed logs of the server shutting down when
>> > this
>> > > > occurs. Here they are, I can see some errors:
>> > > >
>> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
>> > > >
>> > > > Regarding systemd, the closest I could find to TimeoutStopSec was
>> > > > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could
>> not
>> > > > find any KillSignal or RestartKillSignal. You can see the output of
>> > > > systemctl show --all here:
>> > > >
>> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
>> > > >
>> > > > Once again, thanks for the help.
>> > > >
>> > > > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
>> > > > liam.cla...@adscale.co.nz> escreveu:
>> > > >
>> > > > > I'd also suggest eyeballing your systemd conf to verify that
>> someone
>> > > > hasn't
>> > > > > set a very low TimeoutStopSec, or that
>> KillSignal/RestartKillSignal
>> > > > haven't
>> > > > > been configured to SIGKILL (confusingly named, imo, as the default
>> > for
>> > > > > KillSignal is SIGTERM).
>> > > > >
>> > > > > Also, the Kafka broker logs at shutdown look very different if it
>> > shut
>> > > > down
>> > > > > currently vs if it didn't. Could you perhaps put them in a Gist
>> and
>> > > email
>> > > > > the link?
>> > > > >
>> > > > > Just trying to make sure basic assumptions are holding :)
>> > > > >
>> > > > > On Fri, 1 May 2020, 1:21 am JP MB, 
>> > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > > It's quite a complex script generated with ansible where we use
>> a/b
>> > > > > > deployment and honestly, I don't have full knowledge on it I can
>> > > share
>> > > > > the
>> > > > > > general guidelines of what is done:
>> > > > > >
>> > > > > > > - Any old volumes (from previous releases are removed) (named
>> > with
>> > > > > suffix
>> > > > > > > '-old')
>> > > > > > > - Detach 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-04 Thread JP MB
Hi guys,

I'm gonna get back to this today, I get mixed feelings regarding the
volumes being the cause. This volume switching is around for quite some
time, in a lot of clusters, and we only started noticing this problem when
we updated some of them. Also, this only happens in *a few* of those
.timeindex files and not in all of them. The .log files or .index files
which are also on the volumes don't also have the problem.

Additionally, I'm a bit confused on what should be the initial state of
those .timeindex files. Sometimes I see "found log offset: -1", others the
timestamp mismatch error "Index timestamp: 0, log timestamp: 1588583643582"
and sometimes something like this "Indexed offset: 0, found log offset: 28".

So we have seen previously that whenever the timestamp mismatch error is
present we lose messages. eventually after a deployment. Since this looks
like the trigger for the problem I would like to understand how it can
happen. So my question is, how can each of those different states of
initialization for the .timeindexes happen? We can reproduce all of them
when running Kafka locally.

Meanwhile, I'm trying to reproduce the situation in our dev environments
and get some startup logs and I will play with the log flush settings.

Regards

Em sáb., 2 de mai. de 2020 às 14:45, Liam Clarke-Hutchinson <
liam.cla...@adscale.co.nz> escreveu:

> Good luck JP, do try it with the volume switching commented out, and see
> how it goes.
>
> On Fri, May 1, 2020 at 6:50 PM JP MB  wrote:
>
> > Thank you very much for the help anyway.
> >
> > Best regards
> >
> > On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <
> > liam.cla...@adscale.co.nz>
> > wrote:
> >
> > > So the logs show a healthy shutdown, so we can eliminate that as an
> > issue.
> > > I would look next at the volume management during a rollout based on
> the
> > > other error messages you had earlier about permission denied etc. It's
> > > possible there's some journalled but not flushed changes in those time
> > > indexes, but at this point we're getting into filesystem internals
> which
> > > aren't my forte. But if you can temporarily disable the volume
> switching
> > > and do a test roll out, see if you get the same problems or not, would
> > help
> > > eliminate it or confirm it.
> > >
> > > Sorry I can't help further on that.
> > >
> > > On Fri, May 1, 2020 at 5:34 AM JP MB 
> wrote:
> > >
> > > > I took a bit because I needed logs of the server shutting down when
> > this
> > > > occurs. Here they are, I can see some errors:
> > > >
> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
> > > >
> > > > Regarding systemd, the closest I could find to TimeoutStopSec was
> > > > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could
> not
> > > > find any KillSignal or RestartKillSignal. You can see the output of
> > > > systemctl show --all here:
> > > >
> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
> > > >
> > > > Once again, thanks for the help.
> > > >
> > > > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> > > > liam.cla...@adscale.co.nz> escreveu:
> > > >
> > > > > I'd also suggest eyeballing your systemd conf to verify that
> someone
> > > > hasn't
> > > > > set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal
> > > > haven't
> > > > > been configured to SIGKILL (confusingly named, imo, as the default
> > for
> > > > > KillSignal is SIGTERM).
> > > > >
> > > > > Also, the Kafka broker logs at shutdown look very different if it
> > shut
> > > > down
> > > > > currently vs if it didn't. Could you perhaps put them in a Gist and
> > > email
> > > > > the link?
> > > > >
> > > > > Just trying to make sure basic assumptions are holding :)
> > > > >
> > > > > On Fri, 1 May 2020, 1:21 am JP MB, 
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > > It's quite a complex script generated with ansible where we use
> a/b
> > > > > > deployment and honestly, I don't have full knowledge on it I can
> > > share
> > > > > the
> > > > > > general guidelines of what is done:
> > > > > >
> > > > > > > - Any old volumes (from previous releases are removed) (named
> > with
> > > > > suffix
> > > > > > > '-old')
> > > > > > > - Detach the volumes attached to the old host
> > > > > > > - Stop the service in the old host - uses systemctl stop kafka
> > > > > > > - Attempt to create a CNAME volume: this is a volume with the
> > same
> > > > name
> > > > > > > that will be attached to the new box. Except for very first
> run,
> > > this
> > > > > > task
> > > > > > > is used to get the information about the existing volume. (no
> > > sufix)
> > > > > > > - A new volume is created as copy of the CNAME volume (named
> with
> > > > > suffix
> > > > > > > '-new')
> > > > > > > - The new volume is attached to the host/vm (named with suffix
> > > > '-new')
> > > > > > > - The new volume is formated (except for very first run, its
> > > already
> > > > > > > formated)(named with 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-02 Thread Liam Clarke-Hutchinson
Good luck JP, do try it with the volume switching commented out, and see
how it goes.

On Fri, May 1, 2020 at 6:50 PM JP MB  wrote:

> Thank you very much for the help anyway.
>
> Best regards
>
> On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <
> liam.cla...@adscale.co.nz>
> wrote:
>
> > So the logs show a healthy shutdown, so we can eliminate that as an
> issue.
> > I would look next at the volume management during a rollout based on the
> > other error messages you had earlier about permission denied etc. It's
> > possible there's some journalled but not flushed changes in those time
> > indexes, but at this point we're getting into filesystem internals which
> > aren't my forte. But if you can temporarily disable the volume switching
> > and do a test roll out, see if you get the same problems or not, would
> help
> > eliminate it or confirm it.
> >
> > Sorry I can't help further on that.
> >
> > On Fri, May 1, 2020 at 5:34 AM JP MB  wrote:
> >
> > > I took a bit because I needed logs of the server shutting down when
> this
> > > occurs. Here they are, I can see some errors:
> > > https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
> > >
> > > Regarding systemd, the closest I could find to TimeoutStopSec was
> > > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could not
> > > find any KillSignal or RestartKillSignal. You can see the output of
> > > systemctl show --all here:
> > > https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
> > >
> > > Once again, thanks for the help.
> > >
> > > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> > > liam.cla...@adscale.co.nz> escreveu:
> > >
> > > > I'd also suggest eyeballing your systemd conf to verify that someone
> > > hasn't
> > > > set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal
> > > haven't
> > > > been configured to SIGKILL (confusingly named, imo, as the default
> for
> > > > KillSignal is SIGTERM).
> > > >
> > > > Also, the Kafka broker logs at shutdown look very different if it
> shut
> > > down
> > > > currently vs if it didn't. Could you perhaps put them in a Gist and
> > email
> > > > the link?
> > > >
> > > > Just trying to make sure basic assumptions are holding :)
> > > >
> > > > On Fri, 1 May 2020, 1:21 am JP MB, 
> wrote:
> > > >
> > > > > Hi,
> > > > > It's quite a complex script generated with ansible where we use a/b
> > > > > deployment and honestly, I don't have full knowledge on it I can
> > share
> > > > the
> > > > > general guidelines of what is done:
> > > > >
> > > > > > - Any old volumes (from previous releases are removed) (named
> with
> > > > suffix
> > > > > > '-old')
> > > > > > - Detach the volumes attached to the old host
> > > > > > - Stop the service in the old host - uses systemctl stop kafka
> > > > > > - Attempt to create a CNAME volume: this is a volume with the
> same
> > > name
> > > > > > that will be attached to the new box. Except for very first run,
> > this
> > > > > task
> > > > > > is used to get the information about the existing volume. (no
> > sufix)
> > > > > > - A new volume is created as copy of the CNAME volume (named with
> > > > suffix
> > > > > > '-new')
> > > > > > - The new volume is attached to the host/vm (named with suffix
> > > '-new')
> > > > > > - The new volume is formated (except for very first run, its
> > already
> > > > > > formated)(named with suffix '-new')
> > > > > > - The new volume is mounted (named with suffix '-new')
> > > > > > - Start the service in the new host - uses systemctl start kafka
> > > > > > - If everthing went well stopping/starting services:
> > > > > >- The volume no the old host is renamed with prefix '-old'.
> > > > > >- The new volume is renamed stripping the suffix '-new'.
> > > > >
> > > > >
> > > > > I made a new experiment today with some interesting findings. Had
> 518
> > > > > messages in a given topic, after a deployment lost 9 due to this
> > > problem
> > > > in
> > > > > partitions 13,15,16 and 17. All the errors I could find in the time
> > > > > index files before the deployment (left is partition number):
> > > > >
> > > > > 11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805,
> > no
> > > > > > message loss here
> > > > > > 12 -> -1 error no indexes on the log - base segment was the last
> > > offset
> > > > > so
> > > > > > ok
> > > > > > 13 -> timestamp mismatch error on 823168 - offsets from 323168 to
> > > > 823172,
> > > > > > four messages lost
> > > > > > 14 -> timestamp mismatch on 619257 - offsets from 619253 to
> 619258,
> > > no
> > > > > > message loss here
> > > > > > 15 -> timestamp mismatch on 658783 - offsets from 658783 to
> 658784,
> > > one
> > > > > > message missing
> > > > > > 16 -> timestamp mismatch on 623508 - offsets from 623508 to
> 623509,
> > > one
> > > > > > message missing
> > > > > > 17 -> timestamp mismatch on 515479 - offsets from 515479 to
> 515481,
> > > two
> > > > > > messages missing
> > > > 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-01 Thread Goran Sliskovic
 
Yes, that's a clean shutdown log, with few exceptions that are expected 
(connected clients get disconnected during shutdown). Add fsync after kafka 
shutdown should force OS to flush buffers to disk. I somehow suspect there is 
some problems during unmounting/mounting disks.However I don't know about that 
process much. We need startup logs, cause exceptions there are unexpected.
General notes:Such things are difficult to trace, thus setting staging 
environment (test copy of production system) is a must. Then you can experiment 
freely.There is an option in kafka to force flush on every message (but it can 
have serious performance impact). I'd test that on staging, this is not the 
option you want to use in production, however might help diagnose the issue. 
link: 
https://stackoverflow.com/questions/33970374/need-to-understand-kafka-broker-property-log-flush-interval-messagesYou
 may also want to check for filesystem erros (fsck)
You should not copy live filesystems (with cp command for example) while 
applications operate, you need at least crash consistent copy.



On Friday, May 1, 2020, 01:54:49 AM GMT+2, Liam Clarke-Hutchinson 
 wrote:  
 
 So the logs show a healthy shutdown, so we can eliminate that as an issue.
I would look next at the volume management during a rollout based on the
other error messages you had earlier about permission denied etc. It's
possible there's some journalled but not flushed changes in those time
indexes, but at this point we're getting into filesystem internals which
aren't my forte. But if you can temporarily disable the volume switching
and do a test roll out, see if you get the same problems or not, would help
eliminate it or confirm it.

Sorry I can't help further on that.

On Fri, May 1, 2020 at 5:34 AM JP MB  wrote:

> I took a bit because I needed logs of the server shutting down when this
> occurs. Here they are, I can see some errors:
> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
>
> Regarding systemd, the closest I could find to TimeoutStopSec was
> DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could not
> find any KillSignal or RestartKillSignal. You can see the output of
> systemctl show --all here:
> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
>
> Once again, thanks for the help.
>
> Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> liam.cla...@adscale.co.nz> escreveu:
>
> > I'd also suggest eyeballing your systemd conf to verify that someone
> hasn't
> > set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal
> haven't
> > been configured to SIGKILL (confusingly named, imo, as the default for
> > KillSignal is SIGTERM).
> >
> > Also, the Kafka broker logs at shutdown look very different if it shut
> down
> > currently vs if it didn't. Could you perhaps put them in a Gist and email
> > the link?
> >
> > Just trying to make sure basic assumptions are holding :)
> >
> > On Fri, 1 May 2020, 1:21 am JP MB,  wrote:
> >
> > > Hi,
> > > It's quite a complex script generated with ansible where we use a/b
> > > deployment and honestly, I don't have full knowledge on it I can share
> > the
> > > general guidelines of what is done:
> > >
> > > > - Any old volumes (from previous releases are removed) (named with
> > suffix
> > > > '-old')
> > > > - Detach the volumes attached to the old host
> > > > - Stop the service in the old host - uses systemctl stop kafka
> > > > - Attempt to create a CNAME volume: this is a volume with the same
> name
> > > > that will be attached to the new box. Except for very first run, this
> > > task
> > > > is used to get the information about the existing volume. (no sufix)
> > > > - A new volume is created as copy of the CNAME volume (named with
> > suffix
> > > > '-new')
> > > > - The new volume is attached to the host/vm (named with suffix
> '-new')
> > > > - The new volume is formated (except for very first run, its already
> > > > formated)(named with suffix '-new')
> > > > - The new volume is mounted (named with suffix '-new')
> > > > - Start the service in the new host - uses systemctl start kafka
> > > > - If everthing went well stopping/starting services:
> > > >    - The volume no the old host is renamed with prefix '-old'.
> > > >    - The new volume is renamed stripping the suffix '-new'.
> > >
> > >
> > > I made a new experiment today with some interesting findings. Had 518
> > > messages in a given topic, after a deployment lost 9 due to this
> problem
> > in
> > > partitions 13,15,16 and 17. All the errors I could find in the time
> > > index files before the deployment (left is partition number):
> > >
> > > 11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805, no
> > > > message loss here
> > > > 12 -> -1 error no indexes on the log - base segment was the last
> offset
> > > so
> > > > ok
> > > > 13 -> timestamp mismatch error on 823168 - offsets from 323168 to
> > 823172,
> > > > four messages lost
> > 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-05-01 Thread JP MB
Thank you very much for the help anyway.

Best regards

On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson 
wrote:

> So the logs show a healthy shutdown, so we can eliminate that as an issue.
> I would look next at the volume management during a rollout based on the
> other error messages you had earlier about permission denied etc. It's
> possible there's some journalled but not flushed changes in those time
> indexes, but at this point we're getting into filesystem internals which
> aren't my forte. But if you can temporarily disable the volume switching
> and do a test roll out, see if you get the same problems or not, would help
> eliminate it or confirm it.
>
> Sorry I can't help further on that.
>
> On Fri, May 1, 2020 at 5:34 AM JP MB  wrote:
>
> > I took a bit because I needed logs of the server shutting down when this
> > occurs. Here they are, I can see some errors:
> > https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
> >
> > Regarding systemd, the closest I could find to TimeoutStopSec was
> > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could not
> > find any KillSignal or RestartKillSignal. You can see the output of
> > systemctl show --all here:
> > https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
> >
> > Once again, thanks for the help.
> >
> > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> > liam.cla...@adscale.co.nz> escreveu:
> >
> > > I'd also suggest eyeballing your systemd conf to verify that someone
> > hasn't
> > > set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal
> > haven't
> > > been configured to SIGKILL (confusingly named, imo, as the default for
> > > KillSignal is SIGTERM).
> > >
> > > Also, the Kafka broker logs at shutdown look very different if it shut
> > down
> > > currently vs if it didn't. Could you perhaps put them in a Gist and
> email
> > > the link?
> > >
> > > Just trying to make sure basic assumptions are holding :)
> > >
> > > On Fri, 1 May 2020, 1:21 am JP MB,  wrote:
> > >
> > > > Hi,
> > > > It's quite a complex script generated with ansible where we use a/b
> > > > deployment and honestly, I don't have full knowledge on it I can
> share
> > > the
> > > > general guidelines of what is done:
> > > >
> > > > > - Any old volumes (from previous releases are removed) (named with
> > > suffix
> > > > > '-old')
> > > > > - Detach the volumes attached to the old host
> > > > > - Stop the service in the old host - uses systemctl stop kafka
> > > > > - Attempt to create a CNAME volume: this is a volume with the same
> > name
> > > > > that will be attached to the new box. Except for very first run,
> this
> > > > task
> > > > > is used to get the information about the existing volume. (no
> sufix)
> > > > > - A new volume is created as copy of the CNAME volume (named with
> > > suffix
> > > > > '-new')
> > > > > - The new volume is attached to the host/vm (named with suffix
> > '-new')
> > > > > - The new volume is formated (except for very first run, its
> already
> > > > > formated)(named with suffix '-new')
> > > > > - The new volume is mounted (named with suffix '-new')
> > > > > - Start the service in the new host - uses systemctl start kafka
> > > > > - If everthing went well stopping/starting services:
> > > > >- The volume no the old host is renamed with prefix '-old'.
> > > > >- The new volume is renamed stripping the suffix '-new'.
> > > >
> > > >
> > > > I made a new experiment today with some interesting findings. Had 518
> > > > messages in a given topic, after a deployment lost 9 due to this
> > problem
> > > in
> > > > partitions 13,15,16 and 17. All the errors I could find in the time
> > > > index files before the deployment (left is partition number):
> > > >
> > > > 11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805,
> no
> > > > > message loss here
> > > > > 12 -> -1 error no indexes on the log - base segment was the last
> > offset
> > > > so
> > > > > ok
> > > > > 13 -> timestamp mismatch error on 823168 - offsets from 323168 to
> > > 823172,
> > > > > four messages lost
> > > > > 14 -> timestamp mismatch on 619257 - offsets from 619253 to 619258,
> > no
> > > > > message loss here
> > > > > 15 -> timestamp mismatch on 658783 - offsets from 658783 to 658784,
> > one
> > > > > message missing
> > > > > 16 -> timestamp mismatch on 623508 - offsets from 623508 to 623509,
> > one
> > > > > message missing
> > > > > 17 -> timestamp mismatch on 515479 - offsets from 515479 to 515481,
> > two
> > > > > messages missing
> > > >
> > > >
> > > > After the deployment, I took a look and the state was this:
> > > >
> > > > > 11 -> timestamp mismatch error on 685803 -   same state
> > > > > 12 -> -1 error no indexes on the log - same state
> > > > > 13 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > > > 14 -> timestamp mismatch error on 619257 - same state
> > > > > 15 -> Exception in thread "main" 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-30 Thread Liam Clarke-Hutchinson
So the logs show a healthy shutdown, so we can eliminate that as an issue.
I would look next at the volume management during a rollout based on the
other error messages you had earlier about permission denied etc. It's
possible there's some journalled but not flushed changes in those time
indexes, but at this point we're getting into filesystem internals which
aren't my forte. But if you can temporarily disable the volume switching
and do a test roll out, see if you get the same problems or not, would help
eliminate it or confirm it.

Sorry I can't help further on that.

On Fri, May 1, 2020 at 5:34 AM JP MB  wrote:

> I took a bit because I needed logs of the server shutting down when this
> occurs. Here they are, I can see some errors:
> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
>
> Regarding systemd, the closest I could find to TimeoutStopSec was
> DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could not
> find any KillSignal or RestartKillSignal. You can see the output of
> systemctl show --all here:
> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
>
> Once again, thanks for the help.
>
> Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> liam.cla...@adscale.co.nz> escreveu:
>
> > I'd also suggest eyeballing your systemd conf to verify that someone
> hasn't
> > set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal
> haven't
> > been configured to SIGKILL (confusingly named, imo, as the default for
> > KillSignal is SIGTERM).
> >
> > Also, the Kafka broker logs at shutdown look very different if it shut
> down
> > currently vs if it didn't. Could you perhaps put them in a Gist and email
> > the link?
> >
> > Just trying to make sure basic assumptions are holding :)
> >
> > On Fri, 1 May 2020, 1:21 am JP MB,  wrote:
> >
> > > Hi,
> > > It's quite a complex script generated with ansible where we use a/b
> > > deployment and honestly, I don't have full knowledge on it I can share
> > the
> > > general guidelines of what is done:
> > >
> > > > - Any old volumes (from previous releases are removed) (named with
> > suffix
> > > > '-old')
> > > > - Detach the volumes attached to the old host
> > > > - Stop the service in the old host - uses systemctl stop kafka
> > > > - Attempt to create a CNAME volume: this is a volume with the same
> name
> > > > that will be attached to the new box. Except for very first run, this
> > > task
> > > > is used to get the information about the existing volume. (no sufix)
> > > > - A new volume is created as copy of the CNAME volume (named with
> > suffix
> > > > '-new')
> > > > - The new volume is attached to the host/vm (named with suffix
> '-new')
> > > > - The new volume is formated (except for very first run, its already
> > > > formated)(named with suffix '-new')
> > > > - The new volume is mounted (named with suffix '-new')
> > > > - Start the service in the new host - uses systemctl start kafka
> > > > - If everthing went well stopping/starting services:
> > > >- The volume no the old host is renamed with prefix '-old'.
> > > >- The new volume is renamed stripping the suffix '-new'.
> > >
> > >
> > > I made a new experiment today with some interesting findings. Had 518
> > > messages in a given topic, after a deployment lost 9 due to this
> problem
> > in
> > > partitions 13,15,16 and 17. All the errors I could find in the time
> > > index files before the deployment (left is partition number):
> > >
> > > 11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805, no
> > > > message loss here
> > > > 12 -> -1 error no indexes on the log - base segment was the last
> offset
> > > so
> > > > ok
> > > > 13 -> timestamp mismatch error on 823168 - offsets from 323168 to
> > 823172,
> > > > four messages lost
> > > > 14 -> timestamp mismatch on 619257 - offsets from 619253 to 619258,
> no
> > > > message loss here
> > > > 15 -> timestamp mismatch on 658783 - offsets from 658783 to 658784,
> one
> > > > message missing
> > > > 16 -> timestamp mismatch on 623508 - offsets from 623508 to 623509,
> one
> > > > message missing
> > > > 17 -> timestamp mismatch on 515479 - offsets from 515479 to 515481,
> two
> > > > messages missing
> > >
> > >
> > > After the deployment, I took a look and the state was this:
> > >
> > > > 11 -> timestamp mismatch error on 685803 -   same state
> > > > 12 -> -1 error no indexes on the log - same state
> > > > 13 -> Exception in thread "main" java.io.IOException: Permission
> denied
> > > > 14 -> timestamp mismatch error on 619257 - same state
> > > > 15 -> Exception in thread "main" java.io.IOException: Permission
> denied
> > > > 16 -> Exception in thread "main" java.io.IOException: Permission
> denied
> > > > 17 -> Exception in thread "main" java.io.IOException: Permission
> denied
> > >
> > >
> > > Some conclusions at this point:
> > >
> > >- We only lost messages where the initial offset had a corrupted
> > >.timeindex file, 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-30 Thread JP MB
I took a bit because I needed logs of the server shutting down when this
occurs. Here they are, I can see some errors:
https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726

Regarding systemd, the closest I could find to TimeoutStopSec was
DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could not
find any KillSignal or RestartKillSignal. You can see the output of
systemctl show --all here:
https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4

Once again, thanks for the help.

Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
liam.cla...@adscale.co.nz> escreveu:

> I'd also suggest eyeballing your systemd conf to verify that someone hasn't
> set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal haven't
> been configured to SIGKILL (confusingly named, imo, as the default for
> KillSignal is SIGTERM).
>
> Also, the Kafka broker logs at shutdown look very different if it shut down
> currently vs if it didn't. Could you perhaps put them in a Gist and email
> the link?
>
> Just trying to make sure basic assumptions are holding :)
>
> On Fri, 1 May 2020, 1:21 am JP MB,  wrote:
>
> > Hi,
> > It's quite a complex script generated with ansible where we use a/b
> > deployment and honestly, I don't have full knowledge on it I can share
> the
> > general guidelines of what is done:
> >
> > > - Any old volumes (from previous releases are removed) (named with
> suffix
> > > '-old')
> > > - Detach the volumes attached to the old host
> > > - Stop the service in the old host - uses systemctl stop kafka
> > > - Attempt to create a CNAME volume: this is a volume with the same name
> > > that will be attached to the new box. Except for very first run, this
> > task
> > > is used to get the information about the existing volume. (no sufix)
> > > - A new volume is created as copy of the CNAME volume (named with
> suffix
> > > '-new')
> > > - The new volume is attached to the host/vm (named with suffix '-new')
> > > - The new volume is formated (except for very first run, its already
> > > formated)(named with suffix '-new')
> > > - The new volume is mounted (named with suffix '-new')
> > > - Start the service in the new host - uses systemctl start kafka
> > > - If everthing went well stopping/starting services:
> > >- The volume no the old host is renamed with prefix '-old'.
> > >- The new volume is renamed stripping the suffix '-new'.
> >
> >
> > I made a new experiment today with some interesting findings. Had 518
> > messages in a given topic, after a deployment lost 9 due to this problem
> in
> > partitions 13,15,16 and 17. All the errors I could find in the time
> > index files before the deployment (left is partition number):
> >
> > 11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805, no
> > > message loss here
> > > 12 -> -1 error no indexes on the log - base segment was the last offset
> > so
> > > ok
> > > 13 -> timestamp mismatch error on 823168 - offsets from 323168 to
> 823172,
> > > four messages lost
> > > 14 -> timestamp mismatch on 619257 - offsets from 619253 to 619258, no
> > > message loss here
> > > 15 -> timestamp mismatch on 658783 - offsets from 658783 to 658784, one
> > > message missing
> > > 16 -> timestamp mismatch on 623508 - offsets from 623508 to 623509, one
> > > message missing
> > > 17 -> timestamp mismatch on 515479 - offsets from 515479 to 515481, two
> > > messages missing
> >
> >
> > After the deployment, I took a look and the state was this:
> >
> > > 11 -> timestamp mismatch error on 685803 -   same state
> > > 12 -> -1 error no indexes on the log - same state
> > > 13 -> Exception in thread "main" java.io.IOException: Permission denied
> > > 14 -> timestamp mismatch error on 619257 - same state
> > > 15 -> Exception in thread "main" java.io.IOException: Permission denied
> > > 16 -> Exception in thread "main" java.io.IOException: Permission denied
> > > 17 -> Exception in thread "main" java.io.IOException: Permission denied
> >
> >
> > Some conclusions at this point:
> >
> >- We only lost messages where the initial offset had a corrupted
> >.timeindex file, this is, the base offset for the segment.
> >- Immediately after the deployment, we were unable to open all the
> >partitions where we lost messages: Permission denied.
> >
> > This was yesterday at the end of the day, today I checked the number of
> > messages and it was reduced from 509 to 493. Also, the state of the
> > .timeindex files was changed:
> >
> > 11 -> Exception in thread "main" java.io.IOException: Permission denied -
> > > changed state
> > > 12 -> -1 error no indexes on the log
> > > 13 -> Exception in thread "main" java.io.IOException: Permission denied
> > > 14 -> Exception in thread "main" java.io.IOException: Permission
> denied -
> > > changed state
> > > 15 -> Exception in thread "main" java.io.IOException: Permission denied
> > > 16 -> Exception in thread "main" java.io.IOException: Permission 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-30 Thread Liam Clarke-Hutchinson
I'd also suggest eyeballing your systemd conf to verify that someone hasn't
set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal haven't
been configured to SIGKILL (confusingly named, imo, as the default for
KillSignal is SIGTERM).

Also, the Kafka broker logs at shutdown look very different if it shut down
currently vs if it didn't. Could you perhaps put them in a Gist and email
the link?

Just trying to make sure basic assumptions are holding :)

On Fri, 1 May 2020, 1:21 am JP MB,  wrote:

> Hi,
> It's quite a complex script generated with ansible where we use a/b
> deployment and honestly, I don't have full knowledge on it I can share the
> general guidelines of what is done:
>
> > - Any old volumes (from previous releases are removed) (named with suffix
> > '-old')
> > - Detach the volumes attached to the old host
> > - Stop the service in the old host - uses systemctl stop kafka
> > - Attempt to create a CNAME volume: this is a volume with the same name
> > that will be attached to the new box. Except for very first run, this
> task
> > is used to get the information about the existing volume. (no sufix)
> > - A new volume is created as copy of the CNAME volume (named with suffix
> > '-new')
> > - The new volume is attached to the host/vm (named with suffix '-new')
> > - The new volume is formated (except for very first run, its already
> > formated)(named with suffix '-new')
> > - The new volume is mounted (named with suffix '-new')
> > - Start the service in the new host - uses systemctl start kafka
> > - If everthing went well stopping/starting services:
> >- The volume no the old host is renamed with prefix '-old'.
> >- The new volume is renamed stripping the suffix '-new'.
>
>
> I made a new experiment today with some interesting findings. Had 518
> messages in a given topic, after a deployment lost 9 due to this problem in
> partitions 13,15,16 and 17. All the errors I could find in the time
> index files before the deployment (left is partition number):
>
> 11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805, no
> > message loss here
> > 12 -> -1 error no indexes on the log - base segment was the last offset
> so
> > ok
> > 13 -> timestamp mismatch error on 823168 - offsets from 323168 to 823172,
> > four messages lost
> > 14 -> timestamp mismatch on 619257 - offsets from 619253 to 619258, no
> > message loss here
> > 15 -> timestamp mismatch on 658783 - offsets from 658783 to 658784, one
> > message missing
> > 16 -> timestamp mismatch on 623508 - offsets from 623508 to 623509, one
> > message missing
> > 17 -> timestamp mismatch on 515479 - offsets from 515479 to 515481, two
> > messages missing
>
>
> After the deployment, I took a look and the state was this:
>
> > 11 -> timestamp mismatch error on 685803 -   same state
> > 12 -> -1 error no indexes on the log - same state
> > 13 -> Exception in thread "main" java.io.IOException: Permission denied
> > 14 -> timestamp mismatch error on 619257 - same state
> > 15 -> Exception in thread "main" java.io.IOException: Permission denied
> > 16 -> Exception in thread "main" java.io.IOException: Permission denied
> > 17 -> Exception in thread "main" java.io.IOException: Permission denied
>
>
> Some conclusions at this point:
>
>- We only lost messages where the initial offset had a corrupted
>.timeindex file, this is, the base offset for the segment.
>- Immediately after the deployment, we were unable to open all the
>partitions where we lost messages: Permission denied.
>
> This was yesterday at the end of the day, today I checked the number of
> messages and it was reduced from 509 to 493. Also, the state of the
> .timeindex files was changed:
>
> 11 -> Exception in thread "main" java.io.IOException: Permission denied -
> > changed state
> > 12 -> -1 error no indexes on the log
> > 13 -> Exception in thread "main" java.io.IOException: Permission denied
> > 14 -> Exception in thread "main" java.io.IOException: Permission denied -
> > changed state
> > 15 -> Exception in thread "main" java.io.IOException: Permission denied
> > 16 -> Exception in thread "main" java.io.IOException: Permission denied
> > 17 -> Exception in thread "main" java.io.IOException: Permission denied
>
>
> So partition 11 and 14 timeindex files were the ones with the timestamp
> mismatch error that didn't lose messages immediately after the deployment.
> After the deployment being done and after the cluster being already running
> both changed to permission denied and* all the messages inside those
> partitions(11 & 14) were gone. *So this didn't happened only immediately
> after the rolling deployment but actually also while the cluster was
> running.
>
> I have manually restarted a broker with systemctl stop (took 2/3 seconds) &
> systemctl start all those "permission denied" errors were transformed into
> "-1 error no indexes on the log" looking like the files were reset. The
> other brokers still have permission denied.
>

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-30 Thread JP MB
Hi,
It's quite a complex script generated with ansible where we use a/b
deployment and honestly, I don't have full knowledge on it I can share the
general guidelines of what is done:

> - Any old volumes (from previous releases are removed) (named with suffix
> '-old')
> - Detach the volumes attached to the old host
> - Stop the service in the old host - uses systemctl stop kafka
> - Attempt to create a CNAME volume: this is a volume with the same name
> that will be attached to the new box. Except for very first run, this task
> is used to get the information about the existing volume. (no sufix)
> - A new volume is created as copy of the CNAME volume (named with suffix
> '-new')
> - The new volume is attached to the host/vm (named with suffix '-new')
> - The new volume is formated (except for very first run, its already
> formated)(named with suffix '-new')
> - The new volume is mounted (named with suffix '-new')
> - Start the service in the new host - uses systemctl start kafka
> - If everthing went well stopping/starting services:
>- The volume no the old host is renamed with prefix '-old'.
>- The new volume is renamed stripping the suffix '-new'.


I made a new experiment today with some interesting findings. Had 518
messages in a given topic, after a deployment lost 9 due to this problem in
partitions 13,15,16 and 17. All the errors I could find in the time
index files before the deployment (left is partition number):

11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805, no
> message loss here
> 12 -> -1 error no indexes on the log - base segment was the last offset so
> ok
> 13 -> timestamp mismatch error on 823168 - offsets from 323168 to 823172,
> four messages lost
> 14 -> timestamp mismatch on 619257 - offsets from 619253 to 619258, no
> message loss here
> 15 -> timestamp mismatch on 658783 - offsets from 658783 to 658784, one
> message missing
> 16 -> timestamp mismatch on 623508 - offsets from 623508 to 623509, one
> message missing
> 17 -> timestamp mismatch on 515479 - offsets from 515479 to 515481, two
> messages missing


After the deployment, I took a look and the state was this:

> 11 -> timestamp mismatch error on 685803 -   same state
> 12 -> -1 error no indexes on the log - same state
> 13 -> Exception in thread "main" java.io.IOException: Permission denied
> 14 -> timestamp mismatch error on 619257 - same state
> 15 -> Exception in thread "main" java.io.IOException: Permission denied
> 16 -> Exception in thread "main" java.io.IOException: Permission denied
> 17 -> Exception in thread "main" java.io.IOException: Permission denied


Some conclusions at this point:

   - We only lost messages where the initial offset had a corrupted
   .timeindex file, this is, the base offset for the segment.
   - Immediately after the deployment, we were unable to open all the
   partitions where we lost messages: Permission denied.

This was yesterday at the end of the day, today I checked the number of
messages and it was reduced from 509 to 493. Also, the state of the
.timeindex files was changed:

11 -> Exception in thread "main" java.io.IOException: Permission denied -
> changed state
> 12 -> -1 error no indexes on the log
> 13 -> Exception in thread "main" java.io.IOException: Permission denied
> 14 -> Exception in thread "main" java.io.IOException: Permission denied -
> changed state
> 15 -> Exception in thread "main" java.io.IOException: Permission denied
> 16 -> Exception in thread "main" java.io.IOException: Permission denied
> 17 -> Exception in thread "main" java.io.IOException: Permission denied


So partition 11 and 14 timeindex files were the ones with the timestamp
mismatch error that didn't lose messages immediately after the deployment.
After the deployment being done and after the cluster being already running
both changed to permission denied and* all the messages inside those
partitions(11 & 14) were gone. *So this didn't happened only immediately
after the rolling deployment but actually also while the cluster was
running.

I have manually restarted a broker with systemctl stop (took 2/3 seconds) &
systemctl start all those "permission denied" errors were transformed into
"-1 error no indexes on the log" looking like the files were reset. The
other brokers still have permission denied.

Does this sound anything to you? I don't really have an idea of what could
be corrupting those index files.

Next things I will check:

   - When exactly those messages were deleted in partitions 11 and 14.
   - What happens if I have timeindex files with a "timestamp mismatch
   error" and manually start and stop a broker.

Once again, thanks for the efforts on awnsering.

Em qui., 30 de abr. de 2020 às 09:39, Goran Sliskovic
 escreveu:

>  Hi,
> It does look as index corruption... Can you post script that stops kafka?
>
>
> On Wednesday, April 29, 2020, 06:38:18 PM GMT+2, JP MB <
> jose.brandao1...@gmail.com> wrote:
>
>  >
> > Can you try using the console consumer 

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-30 Thread Goran Sliskovic
 Hi,
It does look as index corruption... Can you post script that stops kafka?


On Wednesday, April 29, 2020, 06:38:18 PM GMT+2, JP MB 
 wrote:  
 
 >
> Can you try using the console consumer to display messages/keys and
> timestamps ?
> --property print.key=true --property print.timestamp=true


There are a lot off messages so I'm picking an example without and with
timeindex entry. All of them have a null key:

Offset 57 CreateTime:1588074808027 Key:null  - no time index
Offset 144 CreateTime:1588157145655 Key:null - has time index


Hmm, how are you doing your rolling deploys?

It's rollout deployment, we take one node down and spin up another a new
one. One at a time.

I'm wondering if the time indexes are being corrupted by unclean
> shutdowns. I've been reading code and the only path I could find that led
> to a largest
> timestamp of 0 was, as you've discovered, where there was no time index.
>  WRT to the corruption - the broker being SIGKILLed (systemctl by default
> sends SIGKILL 90 seconds after SIGTERM, and our broker needed 120s to shut
> down cleanly) has caused index corruption for us in the past - although in
> our case it was recovered from automatically by the broker. Just took 2
> hours.


This would be a perfect justification for it but we use systemctl stop and
it takes around 4 seconds to shut down so I believe it ends gracefully
before SIGKILL?

Also, are you moving between versions with these deploys?

No, we have several clusters where this is happening. The information I
showed you is from a cluster with version 2.3 but with 10.2 for inter
broker protocol communication and log format. We have also experienced this
in fully updated 2.4  and 2.4.1 clusters. But to sum, the experiences are
done always deploying (again) the version already there.

Thanks all for the efforts so far.


...  

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-29 Thread JP MB
>
> Can you try using the console consumer to display messages/keys and
> timestamps ?
> --property print.key=true --property print.timestamp=true


There are a lot off messages so I'm picking an example without and with
timeindex entry. All of them have a null key:

Offset 57 CreateTime:1588074808027 Key:null  - no time index
Offset 144 CreateTime:1588157145655 Key:null - has time index


Hmm, how are you doing your rolling deploys?

It's rollout deployment, we take one node down and spin up another a new
one. One at a time.

I'm wondering if the time indexes are being corrupted by unclean
> shutdowns. I've been reading code and the only path I could find that led
> to a largest
> timestamp of 0 was, as you've discovered, where there was no time index.
>  WRT to the corruption - the broker being SIGKILLed (systemctl by default
> sends SIGKILL 90 seconds after SIGTERM, and our broker needed 120s to shut
> down cleanly) has caused index corruption for us in the past - although in
> our case it was recovered from automatically by the broker. Just took 2
> hours.


This would be a perfect justification for it but we use systemctl stop and
it takes around 4 seconds to shut down so I believe it ends gracefully
before SIGKILL?

Also, are you moving between versions with these deploys?

No, we have several clusters where this is happening. The information I
showed you is from a cluster with version 2.3 but with 10.2 for inter
broker protocol communication and log format. We have also experienced this
in fully updated 2.4  and 2.4.1 clusters. But to sum, the experiences are
done always deploying (again) the version already there.

Thanks all for the efforts so far.


Em qua., 29 de abr. de 2020 às 13:01, Nicolas Carlot <
nicolas.car...@chronopost.fr> escreveu:

> Can you try using the console consumer to display messages/keys and
> timestamps ?
> --property print.key=true --property print.timestamp=true
>
>
> Le mer. 29 avr. 2020 à 13:23, JP MB  a écrit :
>
> > The server is in UTC, [2020-04-27 10:36:40,386] was actually my time. On
> > the server was 9:36.
> > It doesn't look like a timezone problem because it cleans properly other
> > records, exactly 48 hours.
> >
> > Em qua., 29 de abr. de 2020 às 11:26, Goran Sliskovic
> >  escreveu:
> >
> > >  Hi,
> > > When lastModifiedTime on that segment is converted to human readable
> > time:
> > > Monday, April 27, 2020 9:14:19 AM UTC
> > >
> > > In what time zone is the server (IOW: [2020-04-27 10:36:40,386] from
> the
> > > log is in what time zone)?
> > > It looks as largestTime is property of log record and 0 means the log
> > > record is empty.
> > >
> > > On Tuesday, April 28, 2020, 04:37:03 PM GMT+2, JP MB <
> > > jose.brandao1...@gmail.com> wrote:
> > >
> > >  Hi,
> > > We have messages disappearing from topics on Apache Kafka with versions
> > > 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling
> > > deployment of our clusters and unfortunately it doesn't happen every
> > time,
> > > so it's very inconsistent.
> > >
> > > Sometimes we lose all messages inside a topic, other times we lose all
> > > messages inside a partition. When this happens the following log is a
> > > constant:
> > >
> > > [2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,
> > > dir=/var/kafkadata/data01/data] Deleting segments
> > > List(LogSegment(baseOffset=6, size=728,
> > > lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
> > >
> > > There is also a previous log saying this segment hit the retention time
> > > breach of 48 hours. In this example, the message was produced ~12
> minutes
> > > before the deployment.
> > >
> > > Notice, all messages that are wrongly deleted havelargestTime=0 and the
> > > ones that are properly deleted have a valid timestamp in there. From
> what
> > > we read from documentation and code it looks like the largestTime is
> used
> > > to calculate if a given segment reached the time breach or not.
> > >
> > > Since we can observe this in multiple versions of Kafka, we think this
> > > might be related to anything external to Kafka. E.g Zookeeper.
> > >
> > > Does anyone have any ideas of why this could be happening?
> > > For the record, we are using Zookeeper 3.6.0.
> > >
> >
>
>
> --
> *Nicolas Carlot*
>
> Lead dev
> |  | nicolas.car...@chronopost.fr
>
>
> *Veuillez noter qu'à partir du 20 mai, le siège Chronopost déménage. La
> nouvelle adresse est : 3 boulevard Romain Rolland 75014 Paris*
>
> [image: Logo Chronopost]
> | chronopost.fr 
> Suivez nous sur Facebook  et
> Twitter
> .
>
> [image: DPD Group]
>


Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-29 Thread Liam Clarke-Hutchinson
Hmm, how are you doing your rolling deploys?

I'm wondering if the time indexes are being corrupted by unclean
shutdowns. I've
been reading code and the only path I could find that led to a largest
timestamp of 0 was, as you've discovered, where there was no time index.

 WRT to the corruption - the broker being SIGKILLed (systemctl by default
sends SIGKILL 90 seconds after SIGTERM, and our broker needed 120s to shut
down cleanly) has caused index corruption for us in the past - although in
our case it was recovered from automatically by the broker. Just took 2
hours.

Also are you moving between versions with these deploys?

On Wed, 29 Apr. 2020, 11:23 pm JP MB,  wrote:

> The server is in UTC, [2020-04-27 10:36:40,386] was actually my time. On
> the server was 9:36.
> It doesn't look like a timezone problem because it cleans properly other
> records, exactly 48 hours.
>
> Em qua., 29 de abr. de 2020 às 11:26, Goran Sliskovic
>  escreveu:
>
> >  Hi,
> > When lastModifiedTime on that segment is converted to human readable
> time:
> > Monday, April 27, 2020 9:14:19 AM UTC
> >
> > In what time zone is the server (IOW: [2020-04-27 10:36:40,386] from the
> > log is in what time zone)?
> > It looks as largestTime is property of log record and 0 means the log
> > record is empty.
> >
> > On Tuesday, April 28, 2020, 04:37:03 PM GMT+2, JP MB <
> > jose.brandao1...@gmail.com> wrote:
> >
> >  Hi,
> > We have messages disappearing from topics on Apache Kafka with versions
> > 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling
> > deployment of our clusters and unfortunately it doesn't happen every
> time,
> > so it's very inconsistent.
> >
> > Sometimes we lose all messages inside a topic, other times we lose all
> > messages inside a partition. When this happens the following log is a
> > constant:
> >
> > [2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,
> > dir=/var/kafkadata/data01/data] Deleting segments
> > List(LogSegment(baseOffset=6, size=728,
> > lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
> >
> > There is also a previous log saying this segment hit the retention time
> > breach of 48 hours. In this example, the message was produced ~12 minutes
> > before the deployment.
> >
> > Notice, all messages that are wrongly deleted havelargestTime=0 and the
> > ones that are properly deleted have a valid timestamp in there. From what
> > we read from documentation and code it looks like the largestTime is used
> > to calculate if a given segment reached the time breach or not.
> >
> > Since we can observe this in multiple versions of Kafka, we think this
> > might be related to anything external to Kafka. E.g Zookeeper.
> >
> > Does anyone have any ideas of why this could be happening?
> > For the record, we are using Zookeeper 3.6.0.
> >
>


Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-29 Thread Nicolas Carlot
Can you try using the console consumer to display messages/keys and
timestamps ?
--property print.key=true --property print.timestamp=true


Le mer. 29 avr. 2020 à 13:23, JP MB  a écrit :

> The server is in UTC, [2020-04-27 10:36:40,386] was actually my time. On
> the server was 9:36.
> It doesn't look like a timezone problem because it cleans properly other
> records, exactly 48 hours.
>
> Em qua., 29 de abr. de 2020 às 11:26, Goran Sliskovic
>  escreveu:
>
> >  Hi,
> > When lastModifiedTime on that segment is converted to human readable
> time:
> > Monday, April 27, 2020 9:14:19 AM UTC
> >
> > In what time zone is the server (IOW: [2020-04-27 10:36:40,386] from the
> > log is in what time zone)?
> > It looks as largestTime is property of log record and 0 means the log
> > record is empty.
> >
> > On Tuesday, April 28, 2020, 04:37:03 PM GMT+2, JP MB <
> > jose.brandao1...@gmail.com> wrote:
> >
> >  Hi,
> > We have messages disappearing from topics on Apache Kafka with versions
> > 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling
> > deployment of our clusters and unfortunately it doesn't happen every
> time,
> > so it's very inconsistent.
> >
> > Sometimes we lose all messages inside a topic, other times we lose all
> > messages inside a partition. When this happens the following log is a
> > constant:
> >
> > [2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,
> > dir=/var/kafkadata/data01/data] Deleting segments
> > List(LogSegment(baseOffset=6, size=728,
> > lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
> >
> > There is also a previous log saying this segment hit the retention time
> > breach of 48 hours. In this example, the message was produced ~12 minutes
> > before the deployment.
> >
> > Notice, all messages that are wrongly deleted havelargestTime=0 and the
> > ones that are properly deleted have a valid timestamp in there. From what
> > we read from documentation and code it looks like the largestTime is used
> > to calculate if a given segment reached the time breach or not.
> >
> > Since we can observe this in multiple versions of Kafka, we think this
> > might be related to anything external to Kafka. E.g Zookeeper.
> >
> > Does anyone have any ideas of why this could be happening?
> > For the record, we are using Zookeeper 3.6.0.
> >
>


-- 
*Nicolas Carlot*

Lead dev
|  | nicolas.car...@chronopost.fr


*Veuillez noter qu'à partir du 20 mai, le siège Chronopost déménage. La
nouvelle adresse est : 3 boulevard Romain Rolland 75014 Paris*

[image: Logo Chronopost]
| chronopost.fr 
Suivez nous sur Facebook  et Twitter
.

[image: DPD Group]


Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-29 Thread JP MB
The server is in UTC, [2020-04-27 10:36:40,386] was actually my time. On
the server was 9:36.
It doesn't look like a timezone problem because it cleans properly other
records, exactly 48 hours.

Em qua., 29 de abr. de 2020 às 11:26, Goran Sliskovic
 escreveu:

>  Hi,
> When lastModifiedTime on that segment is converted to human readable time:
> Monday, April 27, 2020 9:14:19 AM UTC
>
> In what time zone is the server (IOW: [2020-04-27 10:36:40,386] from the
> log is in what time zone)?
> It looks as largestTime is property of log record and 0 means the log
> record is empty.
>
> On Tuesday, April 28, 2020, 04:37:03 PM GMT+2, JP MB <
> jose.brandao1...@gmail.com> wrote:
>
>  Hi,
> We have messages disappearing from topics on Apache Kafka with versions
> 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling
> deployment of our clusters and unfortunately it doesn't happen every time,
> so it's very inconsistent.
>
> Sometimes we lose all messages inside a topic, other times we lose all
> messages inside a partition. When this happens the following log is a
> constant:
>
> [2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,
> dir=/var/kafkadata/data01/data] Deleting segments
> List(LogSegment(baseOffset=6, size=728,
> lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
>
> There is also a previous log saying this segment hit the retention time
> breach of 48 hours. In this example, the message was produced ~12 minutes
> before the deployment.
>
> Notice, all messages that are wrongly deleted havelargestTime=0 and the
> ones that are properly deleted have a valid timestamp in there. From what
> we read from documentation and code it looks like the largestTime is used
> to calculate if a given segment reached the time breach or not.
>
> Since we can observe this in multiple versions of Kafka, we think this
> might be related to anything external to Kafka. E.g Zookeeper.
>
> Does anyone have any ideas of why this could be happening?
> For the record, we are using Zookeeper 3.6.0.
>


Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-29 Thread JP MB
We are using the console produce, directly on the machines where we are
experiencing the problem. I just inserted 150 messages in a topic and chose
the partition with more messages to make this analysis, in this case,
partition 15 in broker 1.

The log file:

> kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration
> --print-data-log --files
> /var/kafkadata/data01/data/topic-15/0059.log
> Dumping /var/kafkadata/data01/data/topic-15/0059.log
> Starting offset: 59
> offset: 59 position: 0 CreateTime: 1588151779702 size: 36 magic: 1
> compresscodec: NONE crc: 565272749 isvalid: true
> | offset: 59 CreateTime: 1588151779702 keysize: -1 valuesize: 2 crc:
> 565272749 isvalid: true payload: 12
> offset: 60 position: 36 CreateTime: 1588151799916 size: 36 magic: 1
> compresscodec: NONE crc: 370075951 isvalid: true
> | offset: 60 CreateTime: 1588151799916 keysize: -1 valuesize: 2 crc:
> 370075951 isvalid: true payload: 30
> offset: 61 position: 72 CreateTime: 1588152179129 size: 36 magic: 1
> compresscodec: NONE crc: 2353683039 isvalid: true
> | offset: 61 CreateTime: 1588152179129 keysize: -1 valuesize: 2 crc:
> 2353683039 isvalid: true payload: 36
> offset: 62 position: 108 CreateTime: 1588152202048 size: 36 magic: 1
> compresscodec: NONE crc: 83181941 isvalid: true
> | offset: 62 CreateTime: 1588152202048 keysize: -1 valuesize: 2 crc:
> 83181941 isvalid: true payload: 54
> offset: 63 position: 144 CreateTime: 1588152232426 size: 36 magic: 1
> compresscodec: NONE crc: 1251610227 isvalid: true
> | offset: 63 CreateTime: 1588152232426 keysize: -1 valuesize: 2 crc:
> 1251610227 isvalid: true payload: 72
> offset: 64 position: 180 CreateTime: 1588152250662 size: 36 magic: 1
> compresscodec: NONE crc: 1452283589 isvalid: true
> | offset: 64 CreateTime: 1588152250662 keysize: -1 valuesize: 2 crc:
> 1452283589 isvalid: true payload: 90
> offset: 65 position: 216 CreateTime: 1588152271999 size: 37 magic: 1
> compresscodec: NONE crc: 3155811409 isvalid: true
> | offset: 65 CreateTime: 1588152271999 keysize: -1 valuesize: 3 crc:
> 3155811409 isvalid: true payload: 108
> offset: 66 position: 253 CreateTime: 1588152304661 size: 37 magic: 1
> compresscodec: NONE crc: 2526532572 isvalid: true
> | offset: 66 CreateTime: 1588152304661 keysize: -1 valuesize: 3 crc:
> 2526532572 isvalid: true payload: 126
> offset: 67 position: 290 CreateTime: 1588152330022 size: 37 magic: 1
> compresscodec: NONE crc: 4266477330 isvalid: true
> | offset: 67 CreateTime: 1588152330022 keysize: -1 valuesize: 3 crc:
> 4266477330 isvalid: true payload: 144


The .index file is empty:

> kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration
> --print-data-log --files
> /var/kafkadata/data01/data/topic-15/0059.index
> Dumping /var/kafkadata/data01/data/topic-15/0059.index
> offset: 59 position: 0


The .timeindex file return this but from what I found on the internet you
can have this error when using DumpLogsegments in the active segment:

> kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration
> --print-data-log --files
> /var/kafkadata/data01/data/topic-15/0059.timeindex

Dumping
> /var/kafkadata/data01/data/topic-15/0059.timeindextimestamp:
> 0 offset: 59

Found timestamp mismatch in
> :/var/kafkadata/data01/data/topic-15/0059.timeindex

Index timestamp: 0, log timestamp: 1588151779702


*The consumer gets messages from 53 to 67, which is strange because on this
broker the log starts from 59 and all the brokers should have the
information replicated.*

So I have injected around 1000 messages more and got this on the .timeindex
file starting at offset 144?

> kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration
> --print-data-log --files
> /var/kafkadata/data01/data/topic-15/0059.timeindex
> Dumping /var/kafkadata/data01/data/topic-15/0059.timeindex
> timestamp: 1588157171331 offset: 144
> timestamp: 1588157306199 offset: 147
> timestamp: 1588157358211 offset: 150
> timestamp: 1588157465320 offset: 155
> timestamp: 1588157467376 offset: 157
> timestamp: 1588157469434 offset: 160
> timestamp: 1588157471324 offset: 163
> timestamp: 1588157474553 offset: 168
> timestamp: 1588157476271 offset: 171
> timestamp: 1588157478642 offset: 174
> timestamp: 1588157481068 offset: 178
> timestamp: 1588157484115 offset: 181
> timestamp: 1588157486643 offset: 184
> timestamp: 1588157489433 offset: 188



So it looks like we don't have any time index from 59 to 144... If I made a
rolling deploy before 144 there was no timestamp for the remaining offsets
and I would lose all messages due to cleanup? Any thoughts on this?

Thanks in advance.


Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-29 Thread Goran Sliskovic
 Hi,
When lastModifiedTime on that segment is converted to human readable time:
Monday, April 27, 2020 9:14:19 AM UTC

In what time zone is the server (IOW: [2020-04-27 10:36:40,386] from the log is 
in what time zone)? 
It looks as largestTime is property of log record and 0 means the log record is 
empty.

On Tuesday, April 28, 2020, 04:37:03 PM GMT+2, JP MB 
 wrote:  
 
 Hi,
We have messages disappearing from topics on Apache Kafka with versions
2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling
deployment of our clusters and unfortunately it doesn't happen every time,
so it's very inconsistent.

Sometimes we lose all messages inside a topic, other times we lose all
messages inside a partition. When this happens the following log is a
constant:

[2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,
dir=/var/kafkadata/data01/data] Deleting segments
List(LogSegment(baseOffset=6, size=728,
lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)

There is also a previous log saying this segment hit the retention time
breach of 48 hours. In this example, the message was produced ~12 minutes
before the deployment.

Notice, all messages that are wrongly deleted havelargestTime=0 and the
ones that are properly deleted have a valid timestamp in there. From what
we read from documentation and code it looks like the largestTime is used
to calculate if a given segment reached the time breach or not.

Since we can observe this in multiple versions of Kafka, we think this
might be related to anything external to Kafka. E.g Zookeeper.

Does anyone have any ideas of why this could be happening?
For the record, we are using Zookeeper 3.6.0.
  

Re: Kafka: Messages disappearing from topics, largestTime=0

2020-04-29 Thread Nicolas Carlot
Timestamps are stored in the message headers. You can use the tools in
kafka/bin to dump those messages timestamps to check their value.
My guess is that the producer is setting a specific timestamp that is wrong
?

Le mer. 29 avr. 2020 à 10:09, JP MB  a écrit :

> Hi,
> Sorry guys for insisting on this but does anyone have an idea of how that
> largestTime can be 0 ?
>
> Regards
>
> -- Forwarded message -
> De: JP MB 
> Date: ter., 28 de abr. de 2020 às 15:36
> Subject: Kafka: Messages disappearing from topics, largestTime=0
> To: 
>
>
> Hi,
> We have messages disappearing from topics on Apache Kafka with versions
> 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling
> deployment of our clusters and unfortunately it doesn't happen every time,
> so it's very inconsistent.
>
> Sometimes we lose all messages inside a topic, other times we lose all
> messages inside a partition. When this happens the following log is a
> constant:
>
> [2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5,
> dir=/var/kafkadata/data01/data] Deleting segments
> List(LogSegment(baseOffset=6, size=728,
> lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
>
> There is also a previous log saying this segment hit the retention time
> breach of 48 hours. In this example, the message was produced ~12 minutes
> before the deployment.
>
> Notice, all messages that are wrongly deleted havelargestTime=0 and the
> ones that are properly deleted have a valid timestamp in there. From what
> we read from documentation and code it looks like the largestTime is used
> to calculate if a given segment reached the time breach or not.
>
> Since we can observe this in multiple versions of Kafka, we think this
> might be related to anything external to Kafka. E.g Zookeeper.
>
> Does anyone have any ideas of why this could be happening?
> For the record, we are using Zookeeper 3.6.0.
>


-- 
*Nicolas Carlot*

Lead dev
|  | nicolas.car...@chronopost.fr


*Veuillez noter qu'à partir du 20 mai, le siège Chronopost déménage. La
nouvelle adresse est : 3 boulevard Romain Rolland 75014 Paris*

[image: Logo Chronopost]
| chronopost.fr 
Suivez nous sur Facebook  et Twitter
.

[image: DPD Group]