Bug#961481: ceph: Protocol incompatibility between armhf and amd64

Val Lorentz Sun, 24 May 2020 18:15:54 -0700

Package: ceph
Version: 14.2.9-1~bpo10+1

Dear maintainers,


I run a cluster made of armhf and amd64 OSDs, and amd64 monitors and
manager.

I recently updated my cluster from Luminous (12, in buster) to Nautilus
(14, in buster-backports), following the instructions here:
https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous

At some point (and after hot-fixing for #956293 on armhf machines), I
noticed something was off, as my OSDs kept flipping between up and down,
with all machines of one arch up and the others down.

Eventually, the armhf went down definitively down (in the monitors'
view). (This might be when I enabled msgr2, but I do not remember the
exact timing.)

Starting one of the armhf OSDs causes this kind of line to appear in
monitors' logs:

2020-05-25 02:07:55.681 7f142df5b700 -1 --2-
[v2:[fdfc:0:0:2::e]:3300/0,v1:[fdfc:0:0:2::e]:6789/0] >>
conn(0x55f003781a80 0x55f004589b80 unknown :-1 s=HELLO_ACCEPTING pgs=0
cs=0 l=0 rx=0 tx=0).run_continuation failed decoding of frame header:
buffer::bad_alloc


Moving the disk and config from an armhf to an arm64 machine fixes the
issue.

Bug#961481: ceph: Protocol incompatibility between armhf and amd64

Reply via email to