Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

Reed Dier Thu, 11 Jan 2018 10:23:19 -0800

I am in the process of migrating my OSDs to bluestore finally and thought I 
would give you some input on how I am approaching it.
Some of saga you can find in another ML thread here: 
https://www.spinics.net/lists/ceph-users/msg41802.html 
<https://www.spinics.net/lists/ceph-users/msg41802.html>


My first OSD I was cautious, and I outed the OSD without downing it, allowing 
it to move data off.
Some background on my cluster, for this OSD, it is an 8TB spinner, with an NVMe 
partition previously used for journaling in filestore, intending to be used for 
block.db in bluestore.

Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
create the new LVM target. Then unset the norecover and norebalance flags and 
it backfilled like normal.

I initially ran into issues with specifying --osd.id causing my osd’s to fail 
to start, but removing that I was able to get it to fill in the gap of the OSD 
I just removed.

I’m now doing quicker, more destructive migrations in an attempt to reduce data 
movement.
This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
read back from temp OSD, write back to ‘new’ OSD.
I’m just reading from replica and writing to ‘new’ OSD.

So I’m setting the norecover and norebalance flags, down the OSD (but not out, 
it stays in, also have the noout flag set), destroy/zap, recreate using 
ceph-volume, unset the flags, and it starts backfilling.
For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time to 
offload it and then backfill back from them. I trust my disks enough to 
backfill from the other disks, and its going well. Also seeing very good write 
performance backfilling compared to previous drive replacements in filestore, 
so thats very promising.

Reed

> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen <jmozd...@nde.ag> wrote:
> 
> Hi Alfredo,
> 
> thank you for your comments:
> 
> Zitat von Alfredo Deza <ad...@redhat.com <mailto:ad...@redhat.com>>:
>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen <jmozd...@nde.ag> wrote:
>>> Dear *,
>>> 
>>> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
>>> keeping the OSD number? There have been a number of messages on the list,
>>> reporting problems, and my experience is the same. (Removing the existing
>>> OSD and creating a new one does work for me.)
>>> 
>>> I'm working on an Ceph 12.2.2 cluster and tried following
>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>>> - this basically says
>>> 
>>> 1. destroy old OSD
>>> 2. zap the disk
>>> 3. prepare the new OSD
>>> 4. activate the new OSD
>>> 
>>> I never got step 4 to complete. The closest I got was by doing the following
>>> steps (assuming OSD ID "999" on /dev/sdzz):
>>> 
>>> 1. Stop the old OSD via systemd (osd-node # systemctl stop
>>> ceph-osd@999.service)
>>> 
>>> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>>> 
>>> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
>>> volume group
>>> 
>>> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>>> 
>>> 4. destroy the old OSD (osd-node # ceph osd destroy 999
>>> --yes-i-really-mean-it)
>>> 
>>> 5. create a new OSD entry (osd-node # ceph osd new $(cat
>>> /var/lib/ceph/osd/ceph-999/fsid) 999)
>> 
>> Step 5 and 6 are problematic if you are going to be trying ceph-volume
>> later on, which takes care of doing this for you.
>> 
>>> 
>>> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
>>> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
>>> /var/lib/ceph/osd/ceph-999/keyring)
> 
> I at first tried to follow the documented steps (without my steps 5 and 6), 
> which did not work for me. The documented approach failed with "init 
> authentication >> failed: (1) Operation not permitted", because actually 
> ceph-volume did not add the auth entry for me.
> 
> But even after manually adding the authentication, the "ceph-volume" approach 
> failed, as the OSD was still marked "destroyed" in the osdmap epoch as used 
> by ceph-osd (see the commented messages from ceph-osd.999.log below).
> 
>>> 
>>> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
>>> --osd-id 999 --data /dev/sdzz)
>> 
>> You are going to hit a bug in ceph-volume that is preventing you from
>> specifying the osd id directly if the ID has been destroyed.
>> 
>> See http://tracker.ceph.com/issues/22642 
>> <http://tracker.ceph.com/issues/22642>
> 
> If I read that bug description correctly, you're confirming why I needed step 
> #6 above (manually adding the OSD auth entry. But even if ceph-volume had 
> added it, the ceph-osd.log entries suggest that starting the OSD would still 
> have failed, because of accessing the wrong osdmap epoch.
> 
> To me it seems like I'm hitting a bug outside of ceph-volume - unless it's 
> ceph-volume that somehow determines which osdmap epoch is used by ceph-osd.
> 
>> In order for this to work, you would need to make sure that the ID has
>> really been destroyed and avoid passing --osd-id in ceph-volume. The
>> caveat
>> being that you will get whatever ID is available next in the cluster.
> 
> Yes, that's the work-around I then used - purge the old OSD and create a new 
> one.
> 
> Thanks & regards,
> Jens
> 
>>> [...]
>>> --- cut here ---
>>> # first of multiple attempts, before "ceph auth add ..."
>>> # no actual epoch referenced, as login failed due to missing auth
>>> 2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has features
>>> 288232575208783872, adjusting msgr requires for clients
>>> 2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has features
>>> 288232575208783872 was 8705, adjusting msgr requires for mons
>>> 2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has features
>>> 288232575208783872, adjusting msgr requires for osds
>>> 2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
>>> 2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
>>> 2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using weightedpriority
>>> op queue with priority op cut off at 64.
>>> 2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors
>>> {default=true}
>>> 2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init authentication
>>> failed: (1) Operation not permitted
>>> 
>>> # after "ceph auth ..."
>>> # note the different epochs below? BTW, 110587 is the current epoch at that
>>> time and osd.999 is marked destroyed there
>>> # 109892: much too old to offer any details
>>> # 110587: modified 2018-01-09 23:43:13.202381
>>> 
>>> 2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has features
>>> 288232575208783872, adjusting msgr requires for clients
>>> 2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has features
>>> 288232575208783872 was 8705, adjusting msgr requires for mons
>>> 2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has features
>>> 288232575208783872, adjusting msgr requires for osds
>>> 2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
>>> 2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
>>> 2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using weightedpriority
>>> op queue with priority op cut off at 64.
>>> 2018-01-10 00:08:00.946544 7fc55905bd00 -1 osd.999 0 log_to_monitors
>>> {default=true}
>>> 2018-01-10 00:08:00.951720 7fc55905bd00  0 osd.999 0 done with init,
>>> starting boot process
>>> 2018-01-10 00:08:00.952225 7fc54160a700 -1 osd.999 0 waiting for initial
>>> osdmap
>>> 2018-01-10 00:08:00.970644 7fc546614700  0 osd.999 109892 crush map has
>>> features 288232610642264064, adjusting msgr requires for clients
>>> 2018-01-10 00:08:00.970653 7fc546614700  0 osd.999 109892 crush map has
>>> features 288232610642264064 was 288232575208792577, adjusting msgr requires
>>> for mons
>>> 2018-01-10 00:08:00.970660 7fc546614700  0 osd.999 109892 crush map has
>>> features 1008808551021559808, adjusting msgr requires for osds
>>> 2018-01-10 00:08:01.349602 7fc546614700 -1 osd.999 110587 osdmap says I am
>>> destroyed, exiting
>>> 
>>> # another try
>>> # it is now using epoch 110587 for everything. But that one is off by one at
>>> that time already:
>>> # 110587: modified 2018-01-09 23:43:13.202381
>>> # 110588: modified 2018-01-10 00:12:55.271913
>>> 
>>> # but both 110587 and 110588 have osd.999 as "destroyed", so never mind.
>>> 2018-01-10 00:13:04.332026 7f408d5a4d00  0 osd.999 110587 crush map has
>>> features 288232610642264064, adjusting msgr requires for clients
>>> 2018-01-10 00:13:04.332037 7f408d5a4d00  0 osd.999 110587 crush map has
>>> features 288232610642264064 was 8705, adjusting msgr requires for mons
>>> 2018-01-10 00:13:04.332043 7f408d5a4d00  0 osd.999 110587 crush map has
>>> features 1008808551021559808, adjusting msgr requires for osds
>>> 2018-01-10 00:13:04.332092 7f408d5a4d00  0 osd.999 110587 load_pgs
>>> 2018-01-10 00:13:04.332096 7f408d5a4d00  0 osd.999 110587 load_pgs opened 0
>>> pgs
>>> 2018-01-10 00:13:04.332100 7f408d5a4d00  0 osd.999 110587 using
>>> weightedpriority op queue with priority op cut off at 64.
>>> 2018-01-10 00:13:04.332990 7f408d5a4d00 -1 osd.999 110587 log_to_monitors
>>> {default=true}
>>> 2018-01-10 00:13:06.026628 7f408d5a4d00  0 osd.999 110587 done with init,
>>> starting boot process
>>> 2018-01-10 00:13:06.027627 7f4075352700 -1 osd.999 110587 osdmap says I am
>>> destroyed, exiting
>>> 
>>> # the attempt after using "ceph osd new", which created epoch 110591 as the
>>> first with osd.999 as autoout,exists,new
>>> # But ceph-osd still uses 110587.
>>> # 110587: modified 2018-01-09 23:43:13.202381
>>> # 110591: modified 2018-01-10 00:30:44.850078
>>> 
>>> 2018-01-10 00:31:15.453871 7f1c57c58d00  0 osd.999 110587 crush map has
>>> features 288232610642264064, adjusting msgr requires for clients
>>> 2018-01-10 00:31:15.453882 7f1c57c58d00  0 osd.999 110587 crush map has
>>> features 288232610642264064 was 8705, adjusting msgr requires for mons
>>> 2018-01-10 00:31:15.453887 7f1c57c58d00  0 osd.999 110587 crush map has
>>> features 1008808551021559808, adjusting msgr requires for osds
>>> 2018-01-10 00:31:15.453940 7f1c57c58d00  0 osd.999 110587 load_pgs
>>> 2018-01-10 00:31:15.453945 7f1c57c58d00  0 osd.999 110587 load_pgs opened 0
>>> pgs
>>> 2018-01-10 00:31:15.453952 7f1c57c58d00  0 osd.999 110587 using
>>> weightedpriority op queue with priority op cut off at 64.
>>> 2018-01-10 00:31:15.454862 7f1c57c58d00 -1 osd.999 110587 log_to_monitors
>>> {default=true}
>>> 2018-01-10 00:31:15.520533 7f1c57c58d00  0 osd.999 110587 done with init,
>>> starting boot process
>>> 2018-01-10 00:31:15.521278 7f1c40207700 -1 osd.999 110587 osdmap says I am
>>> destroyed, exiting
>>> --- cut here ---
>>> [...]
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

Reply via email to