Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

Dan Jakubiec Wed, 17 Jan 2018 11:15:28 -0800

Also worth pointing out something a bit obvious but: this kind of 
faster/destructive migration should only be attempted if all your pools are at 
least 3x replicated.


For example, if you had a 1x replicated pool you would lose data using this 
approach.

-- Dan

> On Jan 11, 2018, at 14:24, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> Thank you for documenting your progress and peril on the ML.
> 
> Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to 
> bluestore.
> 
> 8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m 
> able to do about 3 at a time (1 node) for rip/replace.
> 
> Definitely taking it slow and steady, and the SSDs will move quickly for 
> backfills as well.
> Seeing about 1TB/6hr on backfills, without much performance hit on rest of 
> everything, about 5TB average util on each 8TB disk, so just about 30 
> hours-ish per host *8 hosts will be about 10 days, so a couple weeks is a 
> safe amount of headway.
> This write performance certainly seems better on bluestore than filestore, so 
> that likely helps as well.
> 
> Expect I can probably refill an SSD osd in about an hour or two, and will 
> likely stagger those out.
> But with such a small number of osd’s currently, I’m taking the by-hand 
> approach rather than scripting it so as to avoid similar pitfalls.
> 
> Reed 
> 
>> On Jan 11, 2018, at 12:38 PM, Brady Deetz <bde...@gmail.com 
>> <mailto:bde...@gmail.com>> wrote:
>> 
>> I hear you on time. I have 350 x 6TB drives to convert. I recently posted 
>> about a disaster I created automating my migration. Good luck
>> 
>> On Jan 11, 2018 12:22 PM, "Reed Dier" <reed.d...@focusvq.com 
>> <mailto:reed.d...@focusvq.com>> wrote:
>> I am in the process of migrating my OSDs to bluestore finally and thought I 
>> would give you some input on how I am approaching it.
>> Some of saga you can find in another ML thread here: 
>> https://www.spinics.net/lists/ceph-users/msg41802.html 
>> <https://www.spinics.net/lists/ceph-users/msg41802.html>
>> 
>> My first OSD I was cautious, and I outed the OSD without downing it, 
>> allowing it to move data off.
>> Some background on my cluster, for this OSD, it is an 8TB spinner, with an 
>> NVMe partition previously used for journaling in filestore, intending to be 
>> used for block.db in bluestore.
>> 
>> Then I downed it, flushed the journal, destroyed it, zapped with 
>> ceph-volume, set norecover and norebalance flags, did ceph osd crush remove 
>> osd.$ID, ceph auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume 
>> locally to create the new LVM target. Then unset the norecover and 
>> norebalance flags and it backfilled like normal.
>> 
>> I initially ran into issues with specifying --osd.id <http://osd.id/> 
>> causing my osd’s to fail to start, but removing that I was able to get it to 
>> fill in the gap of the OSD I just removed.
>> 
>> I’m now doing quicker, more destructive migrations in an attempt to reduce 
>> data movement.
>> This way I don’t read from OSD I’m replacing, write to other OSD 
>> temporarily, read back from temp OSD, write back to ‘new’ OSD.
>> I’m just reading from replica and writing to ‘new’ OSD.
>> 
>> So I’m setting the norecover and norebalance flags, down the OSD (but not 
>> out, it stays in, also have the noout flag set), destroy/zap, recreate using 
>> ceph-volume, unset the flags, and it starts backfilling.
>> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time 
>> to offload it and then backfill back from them. I trust my disks enough to 
>> backfill from the other disks, and its going well. Also seeing very good 
>> write performance backfilling compared to previous drive replacements in 
>> filestore, so thats very promising.
>> 
>> Reed
>> 
>>> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen <jmozd...@nde.ag 
>>> <mailto:jmozd...@nde.ag>> wrote:
>>> 
>>> Hi Alfredo,
>>> 
>>> thank you for your comments:
>>> 
>>> Zitat von Alfredo Deza <ad...@redhat.com <mailto:ad...@redhat.com>>:
>>>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen <jmozd...@nde.ag 
>>>> <mailto:jmozd...@nde.ag>> wrote:
>>>>> Dear *,
>>>>> 
>>>>> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
>>>>> keeping the OSD number? There have been a number of messages on the list,
>>>>> reporting problems, and my experience is the same. (Removing the existing
>>>>> OSD and creating a new one does work for me.)
>>>>> 
>>>>> I'm working on an Ceph 12.2.2 cluster and tried following
>>>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>>>>>  
>>>>> <http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd>
>>>>> - this basically says
>>>>> 
>>>>> 1. destroy old OSD
>>>>> 2. zap the disk
>>>>> 3. prepare the new OSD
>>>>> 4. activate the new OSD
>>>>> 
>>>>> I never got step 4 to complete. The closest I got was by doing the 
>>>>> following
>>>>> steps (assuming OSD ID "999" on /dev/sdzz):
>>>>> 
>>>>> 1. Stop the old OSD via systemd (osd-node # systemctl stop
>>>>> ceph-osd@999.service <mailto:ceph-osd@999.service>)
>>>>> 
>>>>> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>>>>> 
>>>>> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
>>>>> volume group
>>>>> 
>>>>> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>>>>> 
>>>>> 4. destroy the old OSD (osd-node # ceph osd destroy 999
>>>>> --yes-i-really-mean-it)
>>>>> 
>>>>> 5. create a new OSD entry (osd-node # ceph osd new $(cat
>>>>> /var/lib/ceph/osd/ceph-999/fsid) 999)
>>>> 
>>>> Step 5 and 6 are problematic if you are going to be trying ceph-volume
>>>> later on, which takes care of doing this for you.
>>>> 
>>>>> 
>>>>> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
>>>>> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
>>>>> /var/lib/ceph/osd/ceph-999/keyring)
>>> 
>>> I at first tried to follow the documented steps (without my steps 5 and 6), 
>>> which did not work for me. The documented approach failed with "init 
>>> authentication >> failed: (1) Operation not permitted", because actually 
>>> ceph-volume did not add the auth entry for me.
>>> 
>>> But even after manually adding the authentication, the "ceph-volume" 
>>> approach failed, as the OSD was still marked "destroyed" in the osdmap 
>>> epoch as used by ceph-osd (see the commented messages from ceph-osd.999.log 
>>> below).
>>> 
>>>>> 
>>>>> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
>>>>> --osd-id 999 --data /dev/sdzz)
>>>> 
>>>> You are going to hit a bug in ceph-volume that is preventing you from
>>>> specifying the osd id directly if the ID has been destroyed.
>>>> 
>>>> See http://tracker.ceph.com/issues/22642 
>>>> <http://tracker.ceph.com/issues/22642>
>>> 
>>> If I read that bug description correctly, you're confirming why I needed 
>>> step #6 above (manually adding the OSD auth entry. But even if ceph-volume 
>>> had added it, the ceph-osd.log entries suggest that starting the OSD would 
>>> still have failed, because of accessing the wrong osdmap epoch.
>>> 
>>> To me it seems like I'm hitting a bug outside of ceph-volume - unless it's 
>>> ceph-volume that somehow determines which osdmap epoch is used by ceph-osd.
>>> 
>>>> In order for this to work, you would need to make sure that the ID has
>>>> really been destroyed and avoid passing --osd-id in ceph-volume. The
>>>> caveat
>>>> being that you will get whatever ID is available next in the cluster.
>>> 
>>> Yes, that's the work-around I then used - purge the old OSD and create a 
>>> new one.
>>> 
>>> Thanks & regards,
>>> Jens
>>> 
>>>>> [...]
>>>>> --- cut here ---
>>>>> # first of multiple attempts, before "ceph auth add ..."
>>>>> # no actual epoch referenced, as login failed due to missing auth
>>>>> 2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has 
>>>>> features
>>>>> 288232575208783872, adjusting msgr requires for clients
>>>>> 2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has 
>>>>> features
>>>>> 288232575208783872 was 8705, adjusting msgr requires for mons
>>>>> 2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has 
>>>>> features
>>>>> 288232575208783872, adjusting msgr requires for osds
>>>>> 2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
>>>>> 2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
>>>>> 2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using 
>>>>> weightedpriority
>>>>> op queue with priority op cut off at 64.
>>>>> 2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors
>>>>> {default=true}
>>>>> 2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init authentication
>>>>> failed: (1) Operation not permitted
>>>>> 
>>>>> # after "ceph auth ..."
>>>>> # note the different epochs below? BTW, 110587 is the current epoch at 
>>>>> that
>>>>> time and osd.999 is marked destroyed there
>>>>> # 109892: much too old to offer any details
>>>>> # 110587: modified 2018-01-09 23:43:13.202381
>>>>> 
>>>>> 2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has 
>>>>> features
>>>>> 288232575208783872, adjusting msgr requires for clients
>>>>> 2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has 
>>>>> features
>>>>> 288232575208783872 was 8705, adjusting msgr requires for mons
>>>>> 2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has 
>>>>> features
>>>>> 288232575208783872, adjusting msgr requires for osds
>>>>> 2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
>>>>> 2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
>>>>> 2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using 
>>>>> weightedpriority
>>>>> op queue with priority op cut off at 64.
>>>>> 2018-01-10 00:08:00.946544 7fc55905bd00 -1 osd.999 0 log_to_monitors
>>>>> {default=true}
>>>>> 2018-01-10 00:08:00.951720 7fc55905bd00  0 osd.999 0 done with init,
>>>>> starting boot process
>>>>> 2018-01-10 00:08:00.952225 7fc54160a700 -1 osd.999 0 waiting for initial
>>>>> osdmap
>>>>> 2018-01-10 00:08:00.970644 7fc546614700  0 osd.999 109892 crush map has
>>>>> features 288232610642264064, adjusting msgr requires for clients
>>>>> 2018-01-10 00:08:00.970653 7fc546614700  0 osd.999 109892 crush map has
>>>>> features 288232610642264064 was 288232575208792577, adjusting msgr 
>>>>> requires
>>>>> for mons
>>>>> 2018-01-10 00:08:00.970660 7fc546614700  0 osd.999 109892 crush map has
>>>>> features 1008808551021559808, adjusting msgr requires for osds
>>>>> 2018-01-10 00:08:01.349602 7fc546614700 -1 osd.999 110587 osdmap says I am
>>>>> destroyed, exiting
>>>>> 
>>>>> # another try
>>>>> # it is now using epoch 110587 for everything. But that one is off by one 
>>>>> at
>>>>> that time already:
>>>>> # 110587: modified 2018-01-09 23:43:13.202381
>>>>> # 110588: modified 2018-01-10 00:12:55.271913
>>>>> 
>>>>> # but both 110587 and 110588 have osd.999 as "destroyed", so never mind.
>>>>> 2018-01-10 00:13:04.332026 7f408d5a4d00  0 osd.999 110587 crush map has
>>>>> features 288232610642264064, adjusting msgr requires for clients
>>>>> 2018-01-10 00:13:04.332037 7f408d5a4d00  0 osd.999 110587 crush map has
>>>>> features 288232610642264064 was 8705, adjusting msgr requires for mons
>>>>> 2018-01-10 00:13:04.332043 7f408d5a4d00  0 osd.999 110587 crush map has
>>>>> features 1008808551021559808, adjusting msgr requires for osds
>>>>> 2018-01-10 00:13:04.332092 7f408d5a4d00  0 osd.999 110587 load_pgs
>>>>> 2018-01-10 00:13:04.332096 7f408d5a4d00  0 osd.999 110587 load_pgs opened >>>>> 0
>>>>> pgs
>>>>> 2018-01-10 00:13:04.332100 7f408d5a4d00  0 osd.999 110587 using
>>>>> weightedpriority op queue with priority op cut off at 64.
>>>>> 2018-01-10 00:13:04.332990 7f408d5a4d00 -1 osd.999 110587 log_to_monitors
>>>>> {default=true}
>>>>> 2018-01-10 00:13:06.026628 7f408d5a4d00  0 osd.999 110587 done with init,
>>>>> starting boot process
>>>>> 2018-01-10 00:13:06.027627 7f4075352700 -1 osd.999 110587 osdmap says I am
>>>>> destroyed, exiting
>>>>> 
>>>>> # the attempt after using "ceph osd new", which created epoch 110591 as 
>>>>> the
>>>>> first with osd.999 as autoout,exists,new
>>>>> # But ceph-osd still uses 110587.
>>>>> # 110587: modified 2018-01-09 23:43:13.202381
>>>>> # 110591: modified 2018-01-10 00:30:44.850078
>>>>> 
>>>>> 2018-01-10 00:31:15.453871 7f1c57c58d00  0 osd.999 110587 crush map has
>>>>> features 288232610642264064, adjusting msgr requires for clients
>>>>> 2018-01-10 00:31:15.453882 7f1c57c58d00  0 osd.999 110587 crush map has
>>>>> features 288232610642264064 was 8705, adjusting msgr requires for mons
>>>>> 2018-01-10 00:31:15.453887 7f1c57c58d00  0 osd.999 110587 crush map has
>>>>> features 1008808551021559808, adjusting msgr requires for osds
>>>>> 2018-01-10 00:31:15.453940 7f1c57c58d00  0 osd.999 110587 load_pgs
>>>>> 2018-01-10 00:31:15.453945 7f1c57c58d00  0 osd.999 110587 load_pgs opened >>>>> 0
>>>>> pgs
>>>>> 2018-01-10 00:31:15.453952 7f1c57c58d00  0 osd.999 110587 using
>>>>> weightedpriority op queue with priority op cut off at 64.
>>>>> 2018-01-10 00:31:15.454862 7f1c57c58d00 -1 osd.999 110587 log_to_monitors
>>>>> {default=true}
>>>>> 2018-01-10 00:31:15.520533 7f1c57c58d00  0 osd.999 110587 done with init,
>>>>> starting boot process
>>>>> 2018-01-10 00:31:15.521278 7f1c40207700 -1 osd.999 110587 osdmap says I am
>>>>> destroyed, exiting
>>>>> --- cut here ---
>>>>> [...]
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

Reply via email to