Thank you very much for the help anyway.

Best regards

On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <liam.cla...@adscale.co.nz>
wrote:

> So the logs show a healthy shutdown, so we can eliminate that as an issue.
> I would look next at the volume management during a rollout based on the
> other error messages you had earlier about permission denied etc. It's
> possible there's some journalled but not flushed changes in those time
> indexes, but at this point we're getting into filesystem internals which
> aren't my forte. But if you can temporarily disable the volume switching
> and do a test roll out, see if you get the same problems or not, would help
> eliminate it or confirm it.
>
> Sorry I can't help further on that.
>
> On Fri, May 1, 2020 at 5:34 AM JP MB <jose.brandao1...@gmail.com> wrote:
>
> > I took a bit because I needed logs of the server shutting down when this
> > occurs. Here they are, I can see some errors:
> > https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
> >
> > Regarding systemd, the closest I could find to TimeoutStopSec was
> > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could not
> > find any KillSignal or RestartKillSignal. You can see the output of
> > systemctl show --all here:
> > https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
> >
> > Once again, thanks for the help.
> >
> > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
> > liam.cla...@adscale.co.nz> escreveu:
> >
> > > I'd also suggest eyeballing your systemd conf to verify that someone
> > hasn't
> > > set a very low TimeoutStopSec, or that KillSignal/RestartKillSignal
> > haven't
> > > been configured to SIGKILL (confusingly named, imo, as the default for
> > > KillSignal is SIGTERM).
> > >
> > > Also, the Kafka broker logs at shutdown look very different if it shut
> > down
> > > currently vs if it didn't. Could you perhaps put them in a Gist and
> email
> > > the link?
> > >
> > > Just trying to make sure basic assumptions are holding :)
> > >
> > > On Fri, 1 May 2020, 1:21 am JP MB, <jose.brandao1...@gmail.com> wrote:
> > >
> > > > Hi,
> > > > It's quite a complex script generated with ansible where we use a/b
> > > > deployment and honestly, I don't have full knowledge on it I can
> share
> > > the
> > > > general guidelines of what is done:
> > > >
> > > > > - Any old volumes (from previous releases are removed) (named with
> > > suffix
> > > > > '-old')
> > > > > - Detach the volumes attached to the old host
> > > > > - Stop the service in the old host - uses systemctl stop kafka
> > > > > - Attempt to create a CNAME volume: this is a volume with the same
> > name
> > > > > that will be attached to the new box. Except for very first run,
> this
> > > > task
> > > > > is used to get the information about the existing volume. (no
> sufix)
> > > > > - A new volume is created as copy of the CNAME volume (named with
> > > suffix
> > > > > '-new')
> > > > > - The new volume is attached to the host/vm (named with suffix
> > '-new')
> > > > > - The new volume is formated (except for very first run, its
> already
> > > > > formated)(named with suffix '-new')
> > > > > - The new volume is mounted (named with suffix '-new')
> > > > > - Start the service in the new host - uses systemctl start kafka
> > > > > - If everthing went well stopping/starting services:
> > > > >    - The volume no the old host is renamed with prefix '-old'.
> > > > >    - The new volume is renamed stripping the suffix '-new'.
> > > >
> > > >
> > > > I made a new experiment today with some interesting findings. Had 518
> > > > messages in a given topic, after a deployment lost 9 due to this
> > problem
> > > in
> > > > partitions 13,15,16 and 17. All the errors I could find in the time
> > > > index files before the deployment (left is partition number):
> > > >
> > > > 11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805,
> no
> > > > > message loss here
> > > > > 12 -> -1 error no indexes on the log - base segment was the last
> > offset
> > > > so
> > > > > ok
> > > > > 13 -> timestamp mismatch error on 823168 - offsets from 323168 to
> > > 823172,
> > > > > four messages lost
> > > > > 14 -> timestamp mismatch on 619257 - offsets from 619253 to 619258,
> > no
> > > > > message loss here
> > > > > 15 -> timestamp mismatch on 658783 - offsets from 658783 to 658784,
> > one
> > > > > message missing
> > > > > 16 -> timestamp mismatch on 623508 - offsets from 623508 to 623509,
> > one
> > > > > message missing
> > > > > 17 -> timestamp mismatch on 515479 - offsets from 515479 to 515481,
> > two
> > > > > messages missing
> > > >
> > > >
> > > > After the deployment, I took a look and the state was this:
> > > >
> > > > > 11 -> timestamp mismatch error on 685803 -   same state
> > > > > 12 -> -1 error no indexes on the log - same state
> > > > > 13 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > > > 14 -> timestamp mismatch error on 619257 - same state
> > > > > 15 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > > > 16 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > > > 17 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > >
> > > >
> > > > Some conclusions at this point:
> > > >
> > > >    - We only lost messages where the initial offset had a corrupted
> > > >    .timeindex file, this is, the base offset for the segment.
> > > >    - Immediately after the deployment, we were unable to open all the
> > > >    partitions where we lost messages: Permission denied.
> > > >
> > > > This was yesterday at the end of the day, today I checked the number
> of
> > > > messages and it was reduced from 509 to 493. Also, the state of the
> > > > .timeindex files was changed:
> > > >
> > > > 11 -> Exception in thread "main" java.io.IOException: Permission
> > denied -
> > > > > changed state
> > > > > 12 -> -1 error no indexes on the log
> > > > > 13 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > > > 14 -> Exception in thread "main" java.io.IOException: Permission
> > > denied -
> > > > > changed state
> > > > > 15 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > > > 16 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > > > 17 -> Exception in thread "main" java.io.IOException: Permission
> > denied
> > > >
> > > >
> > > > So partition 11 and 14 timeindex files were the ones with the
> timestamp
> > > > mismatch error that didn't lose messages immediately after the
> > > deployment.
> > > > After the deployment being done and after the cluster being already
> > > running
> > > > both changed to permission denied and* all the messages inside those
> > > > partitions(11 & 14) were gone. *So this didn't happened only
> > immediately
> > > > after the rolling deployment but actually also while the cluster was
> > > > running.
> > > >
> > > > I have manually restarted a broker with systemctl stop (took 2/3
> > > seconds) &
> > > > systemctl start all those "permission denied" errors were transformed
> > > into
> > > > "-1 error no indexes on the log" looking like the files were reset.
> The
> > > > other brokers still have permission denied.
> > > >
> > > > Does this sound anything to you? I don't really have an idea of what
> > > could
> > > > be corrupting those index files.
> > > >
> > > > Next things I will check:
> > > >
> > > >    - When exactly those messages were deleted in partitions 11 and
> 14.
> > > >    - What happens if I have timeindex files with a "timestamp
> mismatch
> > > >    error" and manually start and stop a broker.
> > > >
> > > > Once again, thanks for the efforts on awnsering.
> > > >
> > > > Em qui., 30 de abr. de 2020 às 09:39, Goran Sliskovic
> > > > <gslis...@yahoo.com.invalid> escreveu:
> > > >
> > > > >  Hi,
> > > > > It does look as index corruption... Can you post script that stops
> > > kafka?
> > > > >
> > > > >
> > > > >     On Wednesday, April 29, 2020, 06:38:18 PM GMT+2, JP MB <
> > > > > jose.brandao1...@gmail.com> wrote:
> > > > >
> > > > >  >
> > > > > > Can you try using the console consumer to display messages/keys
> and
> > > > > > timestamps ?
> > > > > > --property print.key=true --property print.timestamp=true
> > > > >
> > > > >
> > > > > There are a lot off messages so I'm picking an example without and
> > with
> > > > > timeindex entry. All of them have a null key:
> > > > >
> > > > > Offset 57 CreateTime:1588074808027 Key:null  - no time index
> > > > > Offset 144 CreateTime:1588157145655 Key:null - has time index
> > > > >
> > > > >
> > > > > Hmm, how are you doing your rolling deploys?
> > > > >
> > > > > It's rollout deployment, we take one node down and spin up another
> a
> > > new
> > > > > one. One at a time.
> > > > >
> > > > > I'm wondering if the time indexes are being corrupted by unclean
> > > > > > shutdowns. I've been reading code and the only path I could find
> > that
> > > > led
> > > > > > to a largest
> > > > > > timestamp of 0 was, as you've discovered, where there was no time
> > > > index.
> > > > > >  WRT to the corruption - the broker being SIGKILLed (systemctl by
> > > > default
> > > > > > sends SIGKILL 90 seconds after SIGTERM, and our broker needed
> 120s
> > to
> > > > > shut
> > > > > > down cleanly) has caused index corruption for us in the past -
> > > although
> > > > > in
> > > > > > our case it was recovered from automatically by the broker. Just
> > > took 2
> > > > > > hours.
> > > > >
> > > > >
> > > > > This would be a perfect justification for it but we use systemctl
> > stop
> > > > and
> > > > > it takes around 4 seconds to shut down so I believe it ends
> > gracefully
> > > > > before SIGKILL?
> > > > >
> > > > > Also, are you moving between versions with these deploys?
> > > > >
> > > > > No, we have several clusters where this is happening. The
> > information I
> > > > > showed you is from a cluster with version 2.3 but with 10.2 for
> inter
> > > > > broker protocol communication and log format. We have also
> > experienced
> > > > this
> > > > > in fully updated 2.4  and 2.4.1 clusters. But to sum, the
> experiences
> > > are
> > > > > done always deploying (again) the version already there.
> > > > >
> > > > > Thanks all for the efforts so far.
> > > > >
> > > > >
> > > > > ...
> > > >
> > >
> >
>

Reply via email to