Re: [Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
On Fri, May 22, 2020 at 11:25 AM Jay Ring <1874...@bugs.launchpad.net>
wrote:

> "However it should be possible to complete the do-release-upgrade to the
> point of requesting a reboot - don't - drop to the CLI and get all
> machines to this point and then:
>
>   restart the mons across all three machines
>   restart the mgrs across all three machines
>   restart the osds across all three machines"
>
> Yes, I believe this would work.
>
> However, that's not normally how I would do an upgrade.  Normally, I
> upgrade one machine, make sure it works, and then upgrade the next.  I
> have done it this way since I built the cluster back in Firefly.  When I
> did this time, and it destroyed every OSD on the node that I upgraded.
>

Although not best practice (upgrading machine at a time, rather than mons,
mgrs and osd ingroups) when I tried this earlier today it did actually work
- hence why I think I'm missing something about impacted deployments.

My testing did a fresh deploy of eoan with nautilus and then upgraded to
focal; maybe deployments which have been about for a while have different
state on disk/characteristic which cause this issue.

I'm endeavouring to get to a point where we understand *why* this happens
in certain situations.

tl;dr I need more details about impacted deployments to be able to debug
this further.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
Hi Christian

On Fri, May 22, 2020 at 8:10 AM Christian Huebner <
1874...@bugs.launchpad.net> wrote:

> i filed this bug specifically for hyperconverged environments. Upgrading
> monitor nodes first and then upgrading separate OSD nodes is probably
> doable, but in a hyperconverged environment you can not separate.
>

I appreciate that which is why I have endeavoured to reproduce your issue
on a hyperconverged deployment as well.


> I tried do-release-upgrade (a couple of times) without rebooting at the
> end, but found the monitors and OSDs were upgraded and deadlocked at the
> end.
> I tried shutting down all Ceph services first and then do-release upgrade.
> Which started my Ceph services and destroyed my cluster.
> I tried manually upgrading Ceph, which is thwarted by the dependencies,
> it's all or nothing.
>
> I finally accomplished the upgrade by marking all Ceph packages held,
> then digging myself through the dependency jungle to upgrade the
> packages in the right sequence. This was an absolute nightmare and took
> me more than an hour per node. Obviously is not a production ready way
> to do so, but at least Ceph Octopus is running in 20.04 now now.
>
> There are two asks here:
>
> Separate the dependencies so that ceph-mon, ceph-mgr and ceph-osd can be
> installed separately (with the appropriate dependencies, but in a way
> that upgrading ceph-mon does not try to upgrade ceph-osd also. There is
> no good reason why upgrade of ceph-mon should go down and back up the
> dependency tree and try to upgrade ceph-osd too. In fact, I would not
> want monitor packages on my OSD nodes and vice versa in a traditional
> cluster.
>

The versioning between the various binary packages that the ceph source
code produces are strongly versioned so that you can't end up with an
inappropriate/broken mix of binaries on disk at the same time.

Upgrading the ceph-mon package results in an upgrade of the ceph-osd
package because they both depend on ceph-base with a strong version
dependency of a matching binary version.

This is how we enforce a known good set of bits on disks - and is why the
package maintainer scripts don't do restarts of the daemons on upgrade so
that the restart process can be managed with appropriate upgrade ordering.


> And fix do-release-upgrades, so a Ceph cluster does not get restarted
> when the upgrade procedure ends. I can vouch for the services being
> restarted, i tried it several times, once even with the services shut
> down before do-release-upgrade was started.
>

If you shutdown services the postinst script starts
'ceph{-mon,osd,mgr}.target' so they would get started back up, but targets
and services won't get restarted - I tested and validated and checked the
installed maintainer scripts.

I think you'd have to disable and mask the targets *and* services to ensure
that the target start did not force daemons to start as well but I did not
observe any restart behaviour during my upgrade testing (other than due to
the reboot of the system).


>
> An upgrade procedure that breaks customer data should be fixed.
>

Agreed but the first step is reproduction of the issue so that we can
actually identify what the problem is.

I've followed what I think is the same process that you undertook but I've
not seen the same issue when running mixed version MON, MGR and OSD.

So there is something specific in your deployment that we've not captured
in this bug report yet.

Full details of a) /etc/ceph/ceph.conf and b) pool types and configurations
in use would be helpful.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-27 Thread Christian Huebner
This would work If all nodes have a single function only (mon, mgr, old). I
tried everything to update the monitors first, but due to the dependencies
between the Ceph packages the monitors and mgr daemons can not simply be
updated separately from the OSDs What I don't get, though, is that once all
three monitors and mgrs are updated the OSDs do not fall back in line after
a reboot.
I will try to force the install of ceph-base, ceph-common and mon/mgr and
then force upgrade the OSDs  to test whether that will work. If not at
least a workflow should be considered that allows upgrade of hyper
converged clusters, which are becoming more and more important for edge
sites.

On Fri, Apr 24, 2020 at 5:50 PM Dan Hill <1874...@bugs.launchpad.net>
wrote:

> Eoan packages Nautilus, while Focal packages Octopus:
>  ceph | 14.2.2-0ubuntu3  | eoan
>  ceph | 14.2.4-0ubuntu0.19.10.2  | eoan-security
>  ceph | 14.2.8-0ubuntu0.19.10.1  | eoan-updates
>  ceph | 15.2.1-0ubuntu1  | focal
>  ceph | 15.2.1-0ubuntu2  | focal-proposed
>
> When upgrading your cluster, make sure to follow the Octopus upgrade
> guidelines [0]. Specifically, the Mon and Mgr nodes must be upgraded and
> their services restarted before upgrading OSD nodes.
>
> [0] https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-
> mimic-or-nautilus
> 
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1874939
>
> Title:
>   ceph-osd can't connect after upgrade to focal
>
> Status in ceph package in Ubuntu:
>   New
>
> Bug description:
>   Upon upgrading a Ceph node with do-release-upgrade from eoan to focal,
>   the OSD doesn't connect after the upgrade. I rolled back the change
>   (VBox snapshot) and tried again, same result. I also tried to hold
>   back the Ceph packages and upgrade after the fact, but again same
>   result.
>
>   Epected behavior: OSD connects to cluster after upgrade.
>
>   Actual behavior: OSD log shows endlessly repeated
>   'tick_without_osd_lock' messages. OSD will stay down from perspective
>   of the cluster.
>
>   Extract from debug log of OSD:
>
>   2020-04-24T16:25:35.811-0700 7fd70e83d700  5 osd.0 16499 heartbeat
> osd_stat(store_statfs(0x4499/0x4000/0x24000, data
> 0x14bb97877/0x1bb66, compress 0x0/0x0/0x0, omap 0x2bbf, meta
> 0x3fffd441), peers [] op hist [])
>   2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499
> check_full_status cur ratio 0.769796, physical ratio 0.769796, new state
> none
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>   2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499
> tick_without_osd_lock
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>   2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499
> tick_without_osd_lock
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>   2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0
> queries
>   2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499
> tick_without_osd_lock
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>
>   This repeats over and over again.
>
>   strace of the process yields lots of unfinished futex access attempts:
>
>   [pid  2130] futex(0x55b17b8e216c,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054,
> tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY 
>   [pid  2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79
>   [pid  2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL
> 
>   [pid  2190] <...