Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Hello Val, hello Ard, I am not sure, but the issue might be fixed. There is an interesting comment in the upstream changelog [1] of Ceph Pacific v16.2.5: A long-standing bug that prevented 32-bit and 64-bit client/server interoperability under msgr v2 has been fixed. In particular, mixing armv7l (armhf) and x86_64 or aarch64 servers in the same cluster now works. Ceph version 16.2.7 is now in sid/unstable which also passed autopkgtest for armhf. Many Thanks to Thomas, Bernd et al for their work, much appreciated. [1]:https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/ Best Regards Berni
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Hello Val, hello Ard, I am not sure, but the issue might be fixed. There is an interesting comment in the upstream changelog [1] of Ceph Pacific v16.2.5: A long-standing bug that prevented 32-bit and 64-bit client/server interoperability under msgr v2 has been fixed. In particular, mixing armv7l (armhf) and x86_64 or aarch64 servers in the same cluster now works. Ceph version 16.2.7 is now in sid/unstable which also passed autopkgtest for armhf. Many Thanks to Thomas, Bernd et al for their work, much appreciated. [1]: https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/ Best Regards Berni
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Hi Ard, On 6/18/20 1:28 PM, Ard van Breemen wrote: >> The biggest issue in maintaining ceph is to make it build on 32 bit >> architectures. This seems not to be supported at all by upstream anymore. > > First of all, I don't know what your goal is to support 32 bit. Debian supports it, so it should be supported if possible. > I do have a goal: I have loads of armhf machines and only so many amd64 > machines that do not even have enough memory to properly support ceph > and being able to do something (as the MON uses 1GB of memory alone). > I have multiple sites with this situation, and for the foreseeable > future, we will still be building infrastructure on armhf. Getting a > decent AMD64 setup in any location is additional and probably > unnecessary costs. You'll either need to migrate to amd64 (or arm/whatever64) or pay somebody to fix ceph at upstream. > I think the stance of the ceph community in this is: as long as nobody > sends in patches they are not going to care. And they can't support it > themselves because they have a totally different target (clouds). Its the same: they support what they get paid for or what is needed. People rarely use 32bit these days. Even on cheap arm devices 64bit is the way to go. > I am willing to host the armhf releases and maybe the i386 releases on > my server, that way there will be 32 bit releases but not official ones. Doesn't matter, hosting is not the issue here. > But I do want your involvement. You can want that, but you won't get it. Send patches or people who will do the work. I'll happily accept patches, or even better, but reports with links to patches at upstream. > I've been trying to compile it for a time, using sources from ceph and > from proxmox, until I realised ceph nautilus is in backports. And it > worked. > So at least I want your guidance on how you build these... For now I've > used an armhf machine, and I needed to limit the number of threads to 1 > due to c++ compiler needing more than 1GB of RAM to compile a single > source. Upstream has a detailed readme, or you can use the basic way to build a debian package using dpkg-buildpackage, or similar tools. > Not only do I want to make support complete so I can use hardware, I > also think it's just bad programming not to use explicit sizes. And I am > also on the verge of investing in amd64 clusters, I don't want it to > depend on code that's depending on a lot of features. > Anyway: I don't know how you build and test on non amd64 systems, do you > also use armhf, or do you use a cross compile environment? You can just build it, if you are using the Debian source. Otherwise you'll need a lot of patches to make it build, and even more to fix those various 32bit related bugs. Bernd -- Bernd ZeimetzDebian GNU/Linux Developer http://bzed.dehttp://www.debian.org GPG Fingerprint: ECA1 E3F2 8E11 2432 D485 DD95 EB36 171A 6FF9 435F
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Hi Bernd On 2020-05-27 21:22, Bernd Zeimetz wrote: sorry for not replying inline, but I thought I'd just share my general opinion on this. The biggest issue in maintaining ceph is to make it build on 32 bit architectures. This seems not to be supported at all by upstream anymore. First of all, I don't know what your goal is to support 32 bit. I do have a goal: I have loads of armhf machines and only so many amd64 machines that do not even have enough memory to properly support ceph and being able to do something (as the MON uses 1GB of memory alone). I have multiple sites with this situation, and for the foreseeable future, we will still be building infrastructure on armhf. Getting a decent AMD64 setup in any location is additional and probably unnecessary costs. Between 14.2.7 and 14.2.9 I had a longer look into the issue and started to fix some issues, for example the parsing of config options does pretty broken things if the default for the option does not fit into a 32bit integer. Fixing this properly brought me to various other places where size_t is being used in the code, but actually an (at least) uint64_t is being required. Fedora already removed ceph for all 32bit architectures with a "not supported by upstream anymore", but I was not able to find an official statement from ceph upstream. I think the stance of the ceph community in this is: as long as nobody sends in patches they are not going to care. And they can't support it themselves because they have a totally different target (clouds). Also unfortunately I did not yet find the time to collect my findings and send them to the ceph devel mailinglist, but I'd assume that they just don't want to support 32bit anymore, otherwise they'd test it properly. As the work to fix this is properly seems to be a rather long task, I definitely won't do this. But I also don't want to upload maybe-working binaries to Debian anymore. So unless somebody fixes and tests ceph for 32bit (or does this for Debian, also fine for me - running the regression test suite is possible with enough resources and some hardware), I will remove all 32bit architectures with the next upload. My debian karma is bad, really bad. That's why I asked you what your goal is of supporting 32 bit. I have a goal. I might also be able to let 64 bit lxc containers talk to 32 bit lxc containers and real armhf machines so I can test. I am willing to host the armhf releases and maybe the i386 releases on my server, that way there will be 32 bit releases but not official ones. But I do want your involvement. I've been trying to compile it for a time, using sources from ceph and from proxmox, until I realised ceph nautilus is in backports. And it worked. So at least I want your guidance on how you build these... For now I've used an armhf machine, and I needed to limit the number of threads to 1 due to c++ compiler needing more than 1GB of RAM to compile a single source. Not only do I want to make support complete so I can use hardware, I also think it's just bad programming not to use explicit sizes. And I am also on the verge of investing in amd64 clusters, I don't want it to depend on code that's depending on a lot of features. Anyway: I don't know how you build and test on non amd64 systems, do you also use armhf, or do you use a cross compile environment? Regards, Ard van Breemen
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Hi, sorry for not replying inline, but I thought I'd just share my general opinion on this. The biggest issue in maintaining ceph is to make it build on 32 bit architectures. This seems not to be supported at all by upstream anymore. Between 14.2.7 and 14.2.9 I had a longer look into the issue and started to fix some issues, for example the parsing of config options does pretty broken things if the default for the option does not fit into a 32bit integer. Fixing this properly brought me to various other places where size_t is being used in the code, but actually an (at least) uint64_t is being required. Fedora already removed ceph for all 32bit architectures with a "not supported by upstream anymore", but I was not able to find an official statement from ceph upstream. Also unfortunately I did not yet find the time to collect my findings and send them to the ceph devel mailinglist, but I'd assume that they just don't want to support 32bit anymore, otherwise they'd test it properly. As the work to fix this is properly seems to be a rather long task, I definitely won't do this. But I also don't want to upload maybe-working binaries to Debian anymore. So unless somebody fixes and tests ceph for 32bit (or does this for Debian, also fine for me - running the regression test suite is possible with enough resources and some hardware), I will remove all 32bit architectures with the next upload. I guess those are not the news you wanted to hear, but so fard thats the situation. Bernd On 5/27/20 10:54 AM, Ard van Breemen wrote: > Hi, > > On Tue, May 26, 2020 at 06:35:20PM +0200, Val Lorentz wrote: >> Thanks for the tip. >> >> I just tried downgrading an OSD (armhf) and a monitor (amd64) to >> 14.2.7-1~bpo10+1 using http://snapshot.debian.org/ ; but they are still >> unable to communicate ("failed decoding of frame header: >> buffer::bad_alloc"). >> >> So this might be a different issue, although related. > > Well, 14.2.7-~bpo something did work on my armhf osd cluster, > with 2 mons running on armhf, and one on proxmox pve 6 running > ceph 14.2.8 . > What Already did not work was OSD's on AMD64 working together > with a 2xarmhf and 1xamd64 mon setup. > I had a lot of problems getting it to work at all, but I thought > it was just my lack of knowledge at that time. 99% of the > problems is with setting up the correct secrets, or in other > words, the handling of the "keyrings". Even between amd64 and > amd64 this has been buggy if I look at the release notes. > Specifically 14.2.6 to 14.2.7 I think. > I assume bugs are in authentication, because as long as I did not > reboot the amd64 it works. > The daemons authenticate using the secrets, and the secret gives > an authentication ticket. > > Anyway: the most simple test is to install a system, rsync > /etc/ceph and type in ceph status. It either works (on 32 bits, > fix the timeout in the python script, because if you don't it > won't work at all) or it doesn't return at all. > > I will test if it's also the case with armhf ceph cli client to a > amd64 cluster. I only have one working amd64 cluster though, and > it has 2 fake OSD's, because amd64 clusters are too expensive to > experiment with. > I have to do some networking hacks though to connect the systems. > > Anyway: the kernel has no problem talking to either OSD types, so > the kernel's protocol handling is implemented correctly, and > cephx works between an rbd amd64 or armhf kernel client and armhf > userspace. > The rbd amd64 userspace utility however does not work at all. As > far as I can see it can't get past authentication, but without > any logs I am a bit riddled. > > By the way: the mgr dashboard modules is about 99% correct. The > disk space is obviously calculated incorrectly. > > Regards, > Ard > -- Bernd ZeimetzDebian GNU/Linux Developer http://bzed.dehttp://www.debian.org GPG Fingerprint: ECA1 E3F2 8E11 2432 D485 DD95 EB36 171A 6FF9 435F
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Hi, On Tue, May 26, 2020 at 06:35:20PM +0200, Val Lorentz wrote: > Thanks for the tip. > > I just tried downgrading an OSD (armhf) and a monitor (amd64) to > 14.2.7-1~bpo10+1 using http://snapshot.debian.org/ ; but they are still > unable to communicate ("failed decoding of frame header: > buffer::bad_alloc"). > > So this might be a different issue, although related. Well, 14.2.7-~bpo something did work on my armhf osd cluster, with 2 mons running on armhf, and one on proxmox pve 6 running ceph 14.2.8 . What Already did not work was OSD's on AMD64 working together with a 2xarmhf and 1xamd64 mon setup. I had a lot of problems getting it to work at all, but I thought it was just my lack of knowledge at that time. 99% of the problems is with setting up the correct secrets, or in other words, the handling of the "keyrings". Even between amd64 and amd64 this has been buggy if I look at the release notes. Specifically 14.2.6 to 14.2.7 I think. I assume bugs are in authentication, because as long as I did not reboot the amd64 it works. The daemons authenticate using the secrets, and the secret gives an authentication ticket. Anyway: the most simple test is to install a system, rsync /etc/ceph and type in ceph status. It either works (on 32 bits, fix the timeout in the python script, because if you don't it won't work at all) or it doesn't return at all. I will test if it's also the case with armhf ceph cli client to a amd64 cluster. I only have one working amd64 cluster though, and it has 2 fake OSD's, because amd64 clusters are too expensive to experiment with. I have to do some networking hacks though to connect the systems. Anyway: the kernel has no problem talking to either OSD types, so the kernel's protocol handling is implemented correctly, and cephx works between an rbd amd64 or armhf kernel client and armhf userspace. The rbd amd64 userspace utility however does not work at all. As far as I can see it can't get past authentication, but without any logs I am a bit riddled. By the way: the mgr dashboard modules is about 99% correct. The disk space is obviously calculated incorrectly. Regards, Ard -- .signature not found
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Thanks for the tip. I just tried downgrading an OSD (armhf) and a monitor (amd64) to 14.2.7-1~bpo10+1 using http://snapshot.debian.org/ ; but they are still unable to communicate ("failed decoding of frame header: buffer::bad_alloc"). So this might be a different issue, although related.
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Hi Guys, I've had working OSD's on armhf using 14.2.7 fixed using the workaround from #956293. The OSD and mon worked on armhf 14.2.7 and amd64 14.2.8 (proxmox install). When I upgraded the 14.2.7 cluster to 14.2.9, everything still worked, until I rebooted the proxmox server. Everything since then just went sauer. So: I have a complete working ceph cluster on 14.2.9 running on arm. ceph status works. Mapping rbd using echo to the /sys/bus/rbd/add_single_major works (using the username, key and monitors from ceph.conf) on kernel 5.6.11 amd64 and any other kernel (armhf or whatever). So, the ceph cluster works and the protocol is still correct. However as soon as I just want to do a ceph status on an amd64, I get an indefinite hanging ceph command line. No way to trace that (please tell me how). This problem is limited to amd64 though. When I install ceph on an i386 image, connecting to the ceph cluster works and the cluster is healthy. So protocol wise amd64 kernel works with 32 bits clusters. But amd64 user space does not work with 32 bits clusters. This might be somewhere in the authentication chain, as 14.2.9 was working (as far as I know) until I rebooted the 64 bit system. And I think that last CVE fix might be the problem. Anyway, I hope this reaches someone... Regards, Ard van Breemen
Bug#961481: ceph: Protocol incompatibility between armhf and amd64
Package: ceph Version: 14.2.9-1~bpo10+1 Dear maintainers, I run a cluster made of armhf and amd64 OSDs, and amd64 monitors and manager. I recently updated my cluster from Luminous (12, in buster) to Nautilus (14, in buster-backports), following the instructions here: https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous At some point (and after hot-fixing for #956293 on armhf machines), I noticed something was off, as my OSDs kept flipping between up and down, with all machines of one arch up and the others down. Eventually, the armhf went down definitively down (in the monitors' view). (This might be when I enabled msgr2, but I do not remember the exact timing.) Starting one of the armhf OSDs causes this kind of line to appear in monitors' logs: 2020-05-25 02:07:55.681 7f142df5b700 -1 --2- [v2:[fdfc:0:0:2::e]:3300/0,v1:[fdfc:0:0:2::e]:6789/0] >> conn(0x55f003781a80 0x55f004589b80 unknown :-1 s=HELLO_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).run_continuation failed decoding of frame header: buffer::bad_alloc Moving the disk and config from an armhf to an arm64 machine fixes the issue.