Re: [ceph-users] Proper procedure to replace DB/WAL SSD

David Turner Tue, 27 Feb 2018 05:12:30 -0800

If you're only using a 1GB DB partition, there is a very real possibility
it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
so for a 4TB osd a 40GB DB should work for most use cases (except loads and
loads of small files). There are a few threads that mention how to check
how much of your DB partition is in use. Once it's full, it spills over to
the HDD.


On Tue, Feb 27, 2018, 6:19 AM Caspar Smit <caspars...@supernas.eu> wrote:

> 2018-02-26 23:01 GMT+01:00 Gregory Farnum <gfar...@redhat.com>:
>
>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit <caspars...@supernas.eu>
>> wrote:
>>
>>> 2018-02-24 7:10 GMT+01:00 David Turner <drakonst...@gmail.com>:
>>>
>>>> Caspar, it looks like your idea should work. Worst case scenario seems
>>>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>>>> the idea to weight them to 0, backfilling, then recreate the osds.
>>>> Definitely with a try in my opinion, and I'd love to hear your experience
>>>> after.
>>>>
>>>>
>>> Hi David,
>>>
>>> First of all, thank you for ALL your answers on this ML, you're really
>>> putting a lot of effort into answering many questions asked here and very
>>> often they contain invaluable information.
>>>
>>>
>>> To follow up on this post i went out and built a very small (proxmox)
>>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
>>> And it worked!
>>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>>>
>>> Here's what i did on 1 node:
>>>
>>> 1) ceph osd set noout
>>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>>> 3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log
>>> 4) removed the old SSD physically from the node
>>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>>> 6) ceph osd unset noout
>>>
>>> I assume that once the ddrescue step is finished a 'partprobe' or
>>> something similar is triggered and udev finds the DB partitions on the new
>>> SSD and starts the OSD's again (kind of what happens during hotplug)
>>> So it is probably better to clone the SSD in another (non-ceph) system
>>> to not trigger any udev events.
>>>
>>> I also tested a reboot after this and everything still worked.
>>>
>>>
>>> The old SSD was 120GB and the new is 256GB (cloning took around 4
>>> minutes)
>>> Delta of data was very low because it was a test cluster.
>>>
>>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>>> didn't actually need to set noout :)
>>>
>>
>> I kicked off a brief discussion about this with some of the BlueStore
>> guys and they're aware of the problem with migrating across SSDs, but so
>> far it's just a Trello card:
>> https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
>> They do confirm you should be okay with dd'ing things across, assuming
>> symlinks get set up correctly as David noted.
>>
>>
> Great that it is on the radar to address. This method feels hacky.
>
>
>> I've got some other bad news, though: BlueStore has internal metadata
>> about the size of the block device it's using, so if you copy it onto a
>> larger block device, it will not actually make use of the additional space.
>> :(
>> -Greg
>>
>
> Yes, i was well aware of that, no problem. The reason was the smaller SSD
> sizes are simply not being made anymore or discontinued by the manufacturer.
> Would be nice though if the DB size could be resized in the future, the
> default 1GB DB size seems very small to me.
>
> Caspar
>
>
>>
>>
>>>
>>> Kind regards,
>>> Caspar
>>>
>>>
>>>
>>>> Nico, it is not possible to change the WAL or DB size, location, etc
>>>> after osd creation. If you want to change the configuration of the osd
>>>> after creation, you have to remove it from the cluster and recreate it.
>>>> There is no similar functionality to how you could move, recreate, etc
>>>> filesystem osd journals. I think this might be on the radar as a feature,
>>>> but I don't know for certain. I definitely consider it to be a regression
>>>> of bluestore.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>>>> nico.schottel...@ungleich.ch> wrote:
>>>>
>>>>>
>>>>> A very interesting question and I would add the follow up question:
>>>>>
>>>>> Is there an easy way to add an external DB/WAL devices to an existing
>>>>> OSD?
>>>>>
>>>>> I suspect that it might be something on the lines of:
>>>>>
>>>>> - stop osd
>>>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>>>>> - (maybe run some kind of osd mkfs ?)
>>>>> - start osd
>>>>>
>>>>> Has anyone done this so far or recommendations on how to do it?
>>>>>
>>>>> Which also makes me wonder: what is actually the format of WAL and
>>>>> BlockDB in bluestore? Is there any documentation available about it?
>>>>>
>>>>> Best,
>>>>>
>>>>> Nico
>>>>>
>>>>>
>>>>> Caspar Smit <caspars...@supernas.eu> writes:
>>>>>
>>>>> > Hi All,
>>>>> >
>>>>> > What would be the proper way to preventively replace a DB/WAL SSD
>>>>> (when it
>>>>> > is nearing it's DWPD/TBW limit and not failed yet).
>>>>> >
>>>>> > It hosts DB partitions for 5 OSD's
>>>>> >
>>>>> > Maybe something like:
>>>>> >
>>>>> > 1) ceph osd reweight 0 the 5 OSD's
>>>>> > 2) let backfilling complete
>>>>> > 3) destroy/remove the 5 OSD's
>>>>> > 4) replace SSD
>>>>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>>>>> >
>>>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
>>>>> so i
>>>>> > thought maybe the following would work:
>>>>> >
>>>>> > 1) ceph osd set noout
>>>>> > 2) stop the 5 OSD's (systemctl stop)
>>>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>>>>> > 4) remove the old SSD
>>>>> > 5) start the 5 OSD's (systemctl start)
>>>>> > 6) let backfilling/recovery complete (only delta data between OSD
>>>>> stop and
>>>>> > now)
>>>>> > 6) ceph osd unset noout
>>>>> >
>>>>> > Would this be a viable method to replace a DB SSD? Any udev/serial
>>>>> nr/uuid
>>>>> > stuff preventing this to work?
>>>>> >
>>>>> > Or is there another 'less hacky' way to replace a DB SSD without
>>>>> moving too
>>>>> > much data?
>>>>> >
>>>>> > Kind regards,
>>>>> > Caspar
>>>>> > _______________________________________________
>>>>> > ceph-users mailing list
>>>>> > ceph-users@lists.ceph.com
>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>> --
>>>>> Modern, affordable, Swiss Virtual Machines. Visit
>>>>> www.datacenterlight.ch
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

Reply via email to