Re: [ceph-users] Nautilus HEALTH_WARN for msgr2 protocol

2019-06-14 Thread Yury Shevchuk
This procedure worked for us:

http://docs.ceph.com/docs/master/releases/nautilus/#v14-2-0-nautilus

  13. To enable the new v2 network protocol, issue the following command:

ceph mon enable-msgr2

  This will instruct all monitors that bind to the old default port 6789 for 
the legacy v1 protocol to also bind to the new 3300 v2 protocol port. To see if 
all monitors have been updated,:

ceph mon dump

  and verify that each monitor has both a v2: and v1: address listed.

And then make sure all monitors are listed with both protocols in
/etc/ceph/ceph.conf, like this:

  mon_host = [v2:10.1.0.10:3300/0,v1:10.1.0.10:6789/0] 
[v2:10.1.0.11:3300/0,v1:10.1.0.11:6789/0] 
[v2:10.1.0.12:3300/0,v1:10.81.0.12:6789/0]


-- Yury

On Fri, Jun 14, 2019 at 05:10:56PM +0100, Bob Farrell wrote:
> Hi. Firstly thanks to all involved in this great mailing list, I learn lots
> from it every day.
> 
> We are running Ceph with a huge amount of success to store website
> themes/templates across a large collection of websites. We are very pleased
> with the solution in every way.
> 
> The only issue we have, which we have had since day 1, is we always see
> HEALTH_WARN:
> 
> health: HEALTH_WARN
> 1 monitors have not enabled msgr2
> 
> And this is reflected in the monmap:
> 
> monmaptool: monmap file /tmp/monmap
> epoch 7
> fsid 7273720d-04d7-480f-a77c-f0207ae35852
> last_changed 2019-04-02 17:21:56.935381
> created 2019-04-02 17:21:09.925941
> min_mon_release 14 (nautilus)
> 0: v1:172.30.0.144:6789/0 mon.node01.homeflow.co.uk
> 1: [v2:172.30.0.146:3300/0,v1:172.30.0.146:6789/0] mon.node03.homeflow.co.uk
> 2: [v2:172.30.0.147:3300/0,v1:172.30.0.147:6789/0] mon.node04.homeflow.co.uk
> 3: [v2:172.30.0.148:3300/0,v1:172.30.0.148:6789/0] mon.node05.homeflow.co.uk
> 4: [v2:172.30.0.145:3300/0,v1:172.30.0.145:6789/0] mon.node02.homeflow.co.uk
> 5: [v2:172.30.0.149:3300/0,v1:172.30.0.149:6789/0] mon.node06.homeflow.co.uk
> 6: [v2:172.30.0.150:3300/0,v1:172.30.0.150:6789/0] mon.node07.homeflow.co.uk
> 
> I never figured out the correct syntax to set up the first monitor to use
> both 6789 and 3300. The other monitors that join the cluster set this
> config automatically but I couldn't work out how to apply it to the first
> monitor node.
> 
> The cluster has been operating in production for at least a month now with
> no issues at all, so it would be nice to remove this warning as, at the
> moment, it's not really very useful as a monitoring metric.
> 
> Could somebody advise me on the safest/most sensible way to update the
> monmap so that node01 listens on v2 and v1 ?
> 
> Thanks for any help !

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous OSD: replace block.db partition

2019-05-27 Thread Yury Shevchuk
Hi Swami,

In Luminous you will have to delete and re-create the OSD with the
desired size.  Please follow this link for details:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034805.html

-- Yury

PS [cross-posting to ceph-devel removed]

On Mon, May 27, 2019 at 05:37:02PM +0530, M Ranga Swami Reddy wrote:
> Hello - I have created an OSD with 20G block.db, now I wanted to change the
> block.db to 100G size.
> Please let us know if there is a process for the same.
> 
> PS: Ceph version 12.2.4 with bluestore backend.
> 
> Thanks
> Swami

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Grow bluestore PV/LV

2019-05-15 Thread Yury Shevchuk
Hello Michael,

growing (expanding) bluestore OSD is possible since Nautilus (14.2.0)
using bluefs-bdev-expand tool as discussed in this thread:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034116.html

-- Yury

On Wed, May 15, 2019 at 10:03:29PM -0700, Michael Andersen wrote:
> Hi
> 
> After growing the size of an OSD's PV/LV, how can I get bluestore to see
> the new space as available? It does notice the LV has changed size, but it
> sees the new space as occupied.
> 
> This is the same question as:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023893.html
> and
> that original poster spent a lot of effort in explaining exactly what he
> meant, but I could not find a reply to his email.
> 
> Thanks
> Michael

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inodes on /cephfs

2019-04-30 Thread Yury Shevchuk
cephfs is not alone at this, there are other inode-less filesystems
around.  They all go with zeroes:

# df -i /nfs-dir
Filesystem  Inodes IUsed IFree IUse% Mounted on
xxx.xxx.xx.x:/xxx/xxx/x  0 0 0 - /xxx

# df -i /reiserfs-dir
FilesystemInodes   IUsed   IFree IUse% Mounted on
/xxx//x0   0   0-  /xxx/xxx//x

# df -i /btrfs-dir
Filesystem   Inodes IUsed IFree IUse% Mounted on
/xxx/xx/  0 0 0 - /

Would YUM refuse to install on them all, including mainstream btrfs?
I doubt that.  Prehaps YUM is confused by Inodes count that
cephfs (alone!) reports as non-zero.  Look at YUM sources?


-- Yury

On Wed, May 01, 2019 at 01:23:57AM +0200, Oliver Freyermuth wrote:
> Am 01.05.19 um 00:51 schrieb Patrick Donnelly:
> > On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
> >  wrote:
> >>
> >> Dear Cephalopodians,
> >>
> >> we have a classic libvirtd / KVM based virtualization cluster using 
> >> Ceph-RBD (librbd) as backend and sharing the libvirtd configuration 
> >> between the nodes via CephFS
> >> (all on Mimic).
> >>
> >> To share the libvirtd configuration between the nodes, we have symlinked 
> >> some folders from /etc/libvirt to their counterparts on /cephfs,
> >> so all nodes see the same configuration.
> >> In general, this works very well (of course, there's a "gotcha": Libvirtd 
> >> needs reloading / restart for some changes to the XMLs, we have automated 
> >> that),
> >> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
> >> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
> >>
> >>Transaction check error:
> >>  installing package 
> >> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on 
> >> the /cephfs filesystem
> >>  installing package 
> >> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on 
> >> the /cephfs filesystem
> >>
> >> So it seems yum follows the symlinks and checks the available inodes on 
> >> /cephfs. Sadly, that reveals:
> >>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
> >>Filesystem Inodes IUsed IFree IUse% Mounted on
> >>ceph-fuse  6868 0  100% /cephfs
> >>
> >> I think that's just because there is no real "limit" on the maximum inodes 
> >> on CephFS. However, returning 0 breaks some existing tools (notably, Yum).
> >>
> >> What do you think? Should CephFS return something different than 0 here to 
> >> not break existing tools?
> >> Or should the tools behave differently? But one might also argue that if 
> >> the total number of Inodes matches the used number of Inodes, the FS is 
> >> indeed "full".
> >> It's just unclear to me who to file a bug against ;-).
> >>
> >> Right now, I am just using:
> >> yum -y --setopt=diskspacecheck=0 update
> >> as a manual workaround, but this is naturally rather cumbersome.
> > 
> > This is fallout from [1]. See discussion on setting f_free to 0 here
> > [2]. In summary, userland tools are trying to be too clever by looking
> > at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
> > there are other instances of this.]
> > 
> > [1] https://github.com/ceph/ceph/pull/23323
> > [2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911
> 
> Thanks for the references! That certainly enlightens me on why this decision 
> was taken, and of course I congratulate upon trying to prevent false 
> monitoring. 
> Still, even though I don't have other instances at hand (yet), I am not yet 
> convinced "0" is a better choice than "ULONG_MAX". 
> It certainly alerts users / monitoring software about doing something wrong, 
> but it prevents a check which any file system (or rather, any file system I 
> encountered so far) allows. 
> 
> Yum (or other package managers doing things in a safe manner) need to ensure 
> they can fully install a package in an "atomic" way before doing so,
> since rolling back may be complex or even impossible (for most file systems). 
> So they need a way to check if a file system can store the additional files 
> in terms of space and inodes, before placing the data there,
> or risk installing something only partially, and potentially being unable to 
> roll back. 
> 
> In most cases, the free number of inodes allows for that check. Of course, 
> that has no (direct) meaning for CephFS, so one might argue the tools should 
> add an exception for CephFS - 
> but as the discussion correctly stated, there's no defined way to find out 
> where the file system has a notion of "free inodes", and - if we go for an 
> exceptional treatment for a list of file systems - 
> not even a "clean" way to find out if the file system is CephFS (the tools 
> will only see it is FUSE for ceph-fuse) [1]. 
> 
> So my question is: 
> How are tools which need to ensure that a file system can accept a given 
> number of bytes and inodes before ac

Re: [ceph-users] bluestore block/db/wal sizing (Was: bluefs-bdev-expand experience)

2019-04-13 Thread Yury Shevchuk
On Fri, Apr 12, 2019 at 08:06:49AM -0400, Alfredo Deza wrote:
> On Thu, Apr 11, 2019 at 4:23 PM Yury Shevchuk  wrote:
[quoting trimmed]
> >
> > I used this (admittedly not very recent) message as a guide for
> > volume sizing:
> >
> >   https://www.spinics.net/lists/ceph-devel/msg37804.html
> >
> > It reads: "1GB for block.wal.  For block.db, as much as you have."
> 
> Hey Yury, I've tried to further clarify this part of configuring
> Bluestore OSDs because "as large as possible" isn't accurate enough
> and ends up raising more questions.
> 
> Have you read the sizing section here?
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> 
> block.db should be at least 4% the size of block. So if your block
> device is 1TB your block.db shouldn't be less than 40GB.
> 
> I didn't see any mentions of NVMe or SSDs in your message, if that is
> the case a separate block.db is not required at all, and you can just
> create block and be done with it.
> 
> If I did miss out an SSD mention in this thread, then block.db on the
> fast device is what is recommended, and lastly, no block.wal is
> required unless you have something faster
> than the block.db.
> 
> If the doc link isn't clear enough, I would like to hear about it so I
> can improve it further!

Hi Alfredo,

This phrase in the doc confused me: "Generally, block.db should have
as large as possible logical volumes".  Made me think - the bigger
NVMe/SSD, the happier Ceph will be?  The doc gives no hint at which
point the money invested in SSD stop paying off, no trade-off
explaination.

The following sentence was enlightening:

"If they are on separate devices, then you need to make it as big as
you need to ensure that it won't spill over (or if it does that you're
ok with the degraded performance while the db partition is full)."

The quote is from
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021030.html
(Thanks to Igor for pointing me in the right direction)

As to separate block/db/wal in my setup that lacks SSD: this is an
attempt to secure a growth path.  If we ever start struggling with OSD
performance, we can add an SSD to the LVM volume group and pvmove(8)
block.db to the SSD without recreating OSD - even without stopping it.
If we start seeing spillover messages
(https://tracker.ceph.com/issues/38745), we can lvextend(8) block.db,
adding another SSD if necessary, then bluefs-bdev-expand.  We can
blktrace(8) each volume separately to identify where bottlenecks are.
This is the point of creating separate block/db/wal even in single
solid drive setup.  However, this all is speculation, we have very
little Ceph experience as of now.

BTW to use pvmove(8) one needs to create all volumes in one LVM volume
group, not a separate vg for each lv as the doc suggests:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#block-and-block-db
LVs can still be assigned to proper PVs with PhysicalVolumePath arg to
lvcreate(8).

Regards,


-- Yury

[quoting trimmed]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluefs-bdev-expand experience

2019-04-11 Thread Yury Shevchuk
Hi Igor!

I have upgraded from Luminous to Nautilus and now slow device
expansion works indeed.  The steps are shown below to round up the
topic.

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   %USE  
VAR  PGS STATUS
 0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 38.92 
1.04 128 up
 1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 38.92 
1.04 128 up
 3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 0 
   0   0   down
 2   hdd 0.22739  1.0 481 GiB 172 GiB  90 GiB 201 MiB 823 MiB 309 GiB 35.70 
0.96 128 up
TOTAL 947 GiB 353 GiB 269 GiB 610 MiB 2.4 GiB 594 GiB 37.28
MIN/MAX VAR: 0.96/1.04  STDDEV: 1.62

node2# lvextend -L+50G /dev/vg0/osd2
  Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 
450.00 GiB (115200 extents).
  Logical volume vg0/osd2 successfully resized.

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
2019-04-11 22:28:00.240 7f2e24e190c0 -1 bluestore(/var/lib/ceph/osd/ceph-2) 
_lock_fsid failed to lock /var/lib/ceph/osd/ceph-2/fsid (is another ceph-osd 
still running?)(11) Resource temporarily unavailable
...
*** Caught signal (Aborted) **
[two pages of stack dump stripped]

My mistake in the first place: I tried to expand non-stopped osd again.

node2# systemctl stop ceph-osd.target

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
0 : device size 0x4000 : own 0x[1000~3000] = 0x3000 : using 0x8ff000
1 : device size 0x144000 : own 0x[2000~143fffe000] = 0x143fffe000 : using 
0x24dfe000
2 : device size 0x708000 : own 0x[30~4] = 0x4 : 
using 0x0
Expanding...
2 : expanding  from 0x64 to 0x708000
2 : size label updated to 483183820800

node2# ceph-bluestore-tool show-label --dev /dev/vg0/osd2 | grep size
"size": 483183820800,

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   %USE  
VAR  PGS STATUS
 0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 38.92 
1.10 128 up
 1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 38.92 
1.10 128 up
 3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 0 
   0   0   down
 2   hdd 0.22739  1.0 531 GiB 172 GiB  90 GiB 185 MiB 839 MiB 359 GiB 32.33 
0.91 128 up
TOTAL 997 GiB 353 GiB 269 GiB 593 MiB 2.4 GiB 644 GiB 35.41
MIN/MAX VAR: 0.91/1.10  STDDEV: 3.37

It worked: AVAIL = 594+50 = 644.  Great!
Thanks a lot for your help.

And one more question regarding your last remark is inline below.

On Wed, Apr 10, 2019 at 09:54:35PM +0300, Igor Fedotov wrote:
>
> On 4/9/2019 1:59 PM, Yury Shevchuk wrote:
> > Igor, thank you, Round 2 is explained now.
> >
> > Main aka block aka slow device cannot be expanded in Luminus, this
> > functionality will be available after upgrade to Nautilus.
> > Wal and db devices can be expanded in Luminous.
> >
> > Now I have recreated osd2 once again to get rid of the paradoxical
> > cepf osd df output and tried to test db expansion, 40G -> 60G:
> >
> > node2:/# ceph-volume lvm zap --destroy --osd-id 2
> > node2:/# ceph osd lost 2 --yes-i-really-mean-it
> > node2:/# ceph osd destroy 2 --yes-i-really-mean-it
> > node2:/# lvcreate -L1G -n osd2wal vg0
> > node2:/# lvcreate -L40G -n osd2db vg0
> > node2:/# lvcreate -L400G -n osd2 vg0
> > node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
> > --block.db vg0/osd2db --block.wal vg0/osd2wal
> >
> > node2:/# ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
> >   0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
> >   1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
> >   3   hdd 0.227390 0B  0B 0B00   0
> >   2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
> >  TOTAL 866GiB 28.5GiB 837GiB 3.29
> > MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83
> >
> > node2:/# lvextend -L+20G /dev/vg0/osd2db
> >Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) 
> > to 60.00 GiB (15360 extents).
> >Logical volume vg0/osd2db successfully resized.
> >
> > node2:/# ceph-bluestore-tool bluefs-bdev-expand --path 
> > /var/lib/ceph/osd/ceph-2/
> > inferring bluefs devices from bluestore path
> >   slot 0 /var/lib/ceph/osd/ceph-2//block.wal
> >   slot 1 /var/lib/ceph/osd/ceph-2//block.db
> >   slot 2 /var/lib/ceph/osd/ceph-2//block
> > 0 : size 0x4000 : own 0x[1000~3000]
> 

Re: [ceph-users] bluefs-bdev-expand experience

2019-04-09 Thread Yury Shevchuk
Igor, thank you, Round 2 is explained now.

Main aka block aka slow device cannot be expanded in Luminus, this
functionality will be available after upgrade to Nautilus.
Wal and db devices can be expanded in Luminous.

Now I have recreated osd2 once again to get rid of the paradoxical
cepf osd df output and tried to test db expansion, 40G -> 60G:

node2:/# ceph-volume lvm zap --destroy --osd-id 2
node2:/# ceph osd lost 2 --yes-i-really-mean-it
node2:/# ceph osd destroy 2 --yes-i-really-mean-it
node2:/# lvcreate -L1G -n osd2wal vg0
node2:/# lvcreate -L40G -n osd2db vg0
node2:/# lvcreate -L400G -n osd2 vg0
node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
--block.db vg0/osd2db --block.wal vg0/osd2wal

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
 0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
 1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
 3   hdd 0.227390 0B  0B 0B00   0
 2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

node2:/# lvextend -L+20G /dev/vg0/osd2db
  Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) to 
60.00 GiB (15360 extents).
  Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
 slot 0 /var/lib/ceph/osd/ceph-2//block.wal
 slot 1 /var/lib/ceph/osd/ceph-2//block.db
 slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0xf : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0xf
1 : size label updated to 64424509440

node2:/# ceph-bluestore-tool show-label --dev /dev/vg0/osd2db | grep size
"size": 64424509440,

The label updated correctly, but ceph osd df have not changed.
I expected to see 391 + 20 = 411GiB in AVAIL column, but it stays at 391:

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
 0   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
 1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
 3   hdd 0.227390 0B  0B 0B00   0
 2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

I have restarted monitors on all three nodes, 391GiB stays intact.

OK, but I used bluefs-bdev-expand on running OSD... probably not good,
it seems to fork by opening bluefs directly... trying once again:

node2:/# systemctl stop ceph-osd@2

node2:/# lvextend -L+20G /dev/vg0/osd2db
  Size of logical volume vg0/osd2db changed from 60.00 GiB (15360 extents) to 
80.00 GiB (20480 extents).
  Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
 slot 0 /var/lib/ceph/osd/ceph-2//block.wal
 slot 1 /var/lib/ceph/osd/ceph-2//block.db
 slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0x14 : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0x14
1 : size label updated to 85899345920

node2:/# systemctl start ceph-osd@2
node2:/# systemctl restart ceph-mon@pier42

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
 0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
 1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
 3   hdd 0.227390 0B  0B 0B00   0
 2   hdd 0.22739  1.0 400GiB 9.50GiB 391GiB 2.37 0.72   0
TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

Something is wrong.  Maybe I was wrong expecting db change to appear
in AVAIL column?  From Bluestore description I got db and slow should
sum up, no?

Thanks for your help,


-- Yury

On Mon, Apr 08, 2019 at 10:17:24PM +0300, Igor Fedotov wrote:
> Hi Yuri,
>
> both issues from Round 2 relate to unsupported expansion for main device.
>
> In fact it doesn't work and silently bypasses the operation in you case.
>
> Please try with a different device...
>
>
> Also I've just submitted a PR for mimic to indicate the bypass, will
> backport to Luminous once mimic patch is approved.
>
> See https://github.com/ceph/ceph/pull/27447
>
>
> Thanks,
>
> Igor
>
> On 4/5/2019 4:07 PM, Yury Shevchuk wrote:
> > On Fri, Apr 05, 2019 at 02:42:53PM +0300, Igor Fedotov wrote:
> > > wrt Round 1 - an ability to expand block(main) device has been added to
> > > Nautilus,
> > >
> > > see: https://github.com/ceph/ceph/pull/25308
&g

Re: [ceph-users] bluefs-bdev-expand experience

2019-04-05 Thread Yury Shevchuk
On Fri, Apr 05, 2019 at 02:42:53PM +0300, Igor Fedotov wrote:
> wrt Round 1 - an ability to expand block(main) device has been added to
> Nautilus,
>
> see: https://github.com/ceph/ceph/pull/25308

Oh, that's good.  But still separate wal&db may be good for studying
load on each volume (blktrace) or moving db/wal to another physical
disk by means of LVM transparently to ceph.

> wrt Round 2:
>
> - not setting 'size' label looks like a bug although I recall I fixed it...
> Will double check.
>
> - wrong stats output is probably related to the lack of monitor restart -
> could you please try that and report back if it helps? Or even restart the
> whole cluster.. (well I understand that's a bad approach for production but
> just to verify my hypothesis)

Mon restart didn't help:

node0:~# systemctl restart ceph-mon@0
node1:~# systemctl restart ceph-mon@1
node2:~# systemctl restart ceph-mon@2
node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
 0   hdd 0.22739  1.0  233GiB 9.44GiB 223GiB  4.06 0.12 128
 1   hdd 0.22739  1.0  233GiB 9.44GiB 223GiB  4.06 0.12 128
 3   hdd 0.227390  0B  0B 0B 00   0
 2   hdd 0.22739  1.0  800GiB  409GiB 391GiB 51.18 1.51 128
TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Restarting mgrs and then all ceph daemons on all three nodes didn't
help either:

node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
 0   hdd 0.22739  1.0  233GiB 9.43GiB 223GiB  4.05 0.12 128
 1   hdd 0.22739  1.0  233GiB 9.43GiB 223GiB  4.05 0.12 128
 3   hdd 0.227390  0B  0B 0B 00   0
 2   hdd 0.22739  1.0  800GiB  409GiB 391GiB 51.18 1.51 128
TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Maybe we should upgrade to v14.2.0 Nautilus instead of studying old
bugs... after all, this is a toy cluster for now.

Thank you for responding,


-- Yury

> On 4/5/2019 2:06 PM, Yury Shevchuk wrote:
> > Hello all!
> >
> > We have a toy 3-node Ceph cluster running Luminous 12.2.11 with one
> > bluestore osd per node.  We started with pretty small OSDs and would
> > like to be able to expand OSDs whenever needed.  We had two issues
> > with the expansion: one turned out user-serviceable while the other
> > probably needs developers' look.  I will describe both shortly.
> >
> > Round 1
> > ~~~
> > Trying to expand osd.2 by 1TB:
> >
> ># lvextend -L+1T /dev/vg0/osd2
> >  Size of logical volume vg0/osd2 changed from 232.88 GiB (59618 
> > extents) to 1.23 TiB (321762 extents).
> >  Logical volume vg0/osd2 successfully resized.
> >
> ># ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2
> >inferring bluefs devices from bluestore path
> > slot 1 /var/lib/ceph/osd/ceph-2//block
> >1 : size 0x13a3880 : own 0x[1bf220~25430]
> >Expanding...
> >1 : can't be expanded. Bypassing...
> >#
> >
> > It didn't work.  The explaination can be found in
> > ceph/src/os/bluestore/BlueFS.cc at line 310:
> >
> >// returns true if specified device is under full bluefs control
> >// and hence can be expanded
> >bool BlueFS::is_device_expandable(unsigned id)
> >{
> >  if (id >= MAX_BDEV || bdev[id] == nullptr) {
> >return false;
> >  }
> >  switch(id) {
> >  case BDEV_WAL:
> >return true;
> >
> >  case BDEV_DB:
> >// true if DB volume is non-shared
> >return bdev[BDEV_SLOW] != nullptr;
> >  }
> >  return false;
> >}
> >
> > So we have to use separate block.db and block.wal for OSD to be
> > expandable.  Indeed, our OSDs were created without separate block.db
> > and block.wal, like this:
> >
> >ceph-volume lvm create --bluestore --data /dev/vg0/osd2
> >
> > Recreating osd.2 with separate block.db and block.wal:
> >
> ># ceph-volume lvm zap --destroy --osd-id 2
> ># lvcreate -L1G -n osd2wal vg0
> >  Logical volume "osd2wal" created.
> ># lvcreate -L40G -n osd2db vg0
> >  Logical volume "osd2db" created.
> ># lvcreate -L400G -n osd2 vg0
> >  Logical volume "osd2" created.
> ># ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
> > --block.db vg0/osd2db --block.wal vg0/osd2wal
> >
> > Resync takes some time, and then we have expandable osd.2.
> >
> >
> > Round 2
> > ~~~
> > 

[ceph-users] bluefs-bdev-expand experience

2019-04-05 Thread Yury Shevchuk
Hello all!

We have a toy 3-node Ceph cluster running Luminous 12.2.11 with one
bluestore osd per node.  We started with pretty small OSDs and would
like to be able to expand OSDs whenever needed.  We had two issues
with the expansion: one turned out user-serviceable while the other
probably needs developers' look.  I will describe both shortly.

Round 1
~~~
Trying to expand osd.2 by 1TB:

  # lvextend -L+1T /dev/vg0/osd2
Size of logical volume vg0/osd2 changed from 232.88 GiB (59618 extents) to 
1.23 TiB (321762 extents).
Logical volume vg0/osd2 successfully resized.

  # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2
  inferring bluefs devices from bluestore path
   slot 1 /var/lib/ceph/osd/ceph-2//block
  1 : size 0x13a3880 : own 0x[1bf220~25430]
  Expanding...
  1 : can't be expanded. Bypassing...
  #

It didn't work.  The explaination can be found in
ceph/src/os/bluestore/BlueFS.cc at line 310:

  // returns true if specified device is under full bluefs control
  // and hence can be expanded
  bool BlueFS::is_device_expandable(unsigned id)
  {
if (id >= MAX_BDEV || bdev[id] == nullptr) {
  return false;
}
switch(id) {
case BDEV_WAL:
  return true;

case BDEV_DB:
  // true if DB volume is non-shared
  return bdev[BDEV_SLOW] != nullptr;
}
return false;
  }

So we have to use separate block.db and block.wal for OSD to be
expandable.  Indeed, our OSDs were created without separate block.db
and block.wal, like this:

  ceph-volume lvm create --bluestore --data /dev/vg0/osd2

Recreating osd.2 with separate block.db and block.wal:

  # ceph-volume lvm zap --destroy --osd-id 2
  # lvcreate -L1G -n osd2wal vg0
Logical volume "osd2wal" created.
  # lvcreate -L40G -n osd2db vg0
Logical volume "osd2db" created.
  # lvcreate -L400G -n osd2 vg0
Logical volume "osd2" created.
  # ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 --block.db 
vg0/osd2db --block.wal vg0/osd2wal

Resync takes some time, and then we have expandable osd.2.


Round 2
~~~
Trying to expand osd.2 from 400G to 700G:

  # lvextend -L+300G /dev/vg0/osd2
Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 
700.00 GiB (179200 extents).
Logical volume vg0/osd2 successfully resized.

  # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
  inferring bluefs devices from bluestore path
   slot 0 /var/lib/ceph/osd/ceph-2//block.wal
   slot 1 /var/lib/ceph/osd/ceph-2//block.db
   slot 2 /var/lib/ceph/osd/ceph-2//block
  0 : size 0x4000 : own 0x[1000~3000]
  1 : size 0xa : own 0x[2000~9e000]
  2 : size 0xaf : own 0x[30~4]
  Expanding...
  #


This time expansion appears to work: 0xaf = 700GiB.

However, the size in the block device label have not changed:

  # ceph-bluestore-tool show-label --dev /dev/vg0/osd2
  {
  "/dev/vg0/osd2": {
  "osd_uuid": "a18ff7f7-0de1-4791-ba4b-f3b6d2221f44",
  "size": 429496729600,

429496729600 = 400GiB

Worse, ceph osd df shows the added space as used, not available:

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
 0   hdd 0.22739  1.0  233GiB 8.06GiB 225GiB  3.46 0.13 128
 1   hdd 0.22739  1.0  233GiB 8.06GiB 225GiB  3.46 0.13 128
 2   hdd 0.22739  1.0  700GiB  301GiB 399GiB 43.02 1.58  64
TOTAL 1.14TiB  317GiB 849GiB 27.21
MIN/MAX VAR: 0.13/1.58  STDDEV: 21.43

If I expand osd.2 by another 100G, the space also goes to
"USE" column:

pier42:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
 0   hdd 0.22739  1.0  233GiB 8.05GiB 225GiB  3.46 0.10 128
 1   hdd 0.22739  1.0  233GiB 8.05GiB 225GiB  3.46 0.10 128
 3   hdd 0.227390  0B  0B 0B 00   0
 2   hdd 0.22739  1.0  800GiB  408GiB 392GiB 51.01 1.52 128
TOTAL 1.24TiB  424GiB 842GiB 33.51
MIN/MAX VAR: 0.10/1.52  STDDEV: 26.54

So OSD expansion almost works, but not quite.  If you had better luck
with bluefs-bdev-expand, could you please share your story?


-- Yury
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com