[ceph-users] Multiple radosgw on the same server

2021-03-02 Thread Szabo, Istvan (Agoda)
Hi,

I've heard many time that to install multiple rados-gateway on the same server 
is possible, just need to create on a different port.
However I've never managed to make it work.
Today I gave another try like this:


  1.  Created a new keyring: ceph auth get-or-create client.rgw.servername.rgw1 
mon 'allow rw' osd 'allow rwx'
  2.  Created the keyring file: 
/var/lib/ceph/radosgw/ceph-rgw.servername.rgw1/keyring
  3.  Added another entry in the ceph octopus configuration with different port:
[client.rgw.servername.rgw1]

host = servername

keyring = /var/lib/ceph/radosgw/ceph-rgw.servername.rgw1/keyring

log file = /var/log/ceph/ceph-rgw-servername.rgw1.log

rgw frontends = beast endpoint=10.104.198.101:8081

rgw thread pool size = 512

rgw_zone=zone

  1.  Copied another RGW system file in centos 8: cp -pr 
/etc/systemd/system/ceph-radosgw.target.wants/ceph-radosgw\@rgw.servername.rgw0.service
 
/etc/systemd/system/ceph-radosgw.target.wants/ceph-radosgw\@rgw.servername.rgw1.service
  2.  Restarted ceph.target.
  3.  Result is the same number of rados-gateways.

So how is this actually should have been done?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] bug in latest cephadm bootstrap: got an unexpected keyword argument 'verbose_on_failure'

2021-03-02 Thread Philip Brown
Seems like someone is not testing cephadm on centos 7.9

Just tried installing cephadm from the repo, and ran
cephadm bootstrap --mon-ip=xxx

it blew up, with

ceph TypeError: __init__() got an unexpected keyword argument 
'verbose_on_failure'

just after the firewall section.

I happen to have a test cluser from a few months ago, and compared the code.

Some added, in line 2348,

"out, err, ret = call([self.cmd, '--permanent', '--query-port', 
tcp_port], verbose_on_failure=False)"

this made the init fail, on my centos 7.9 system, freshly installed and updated 
today.

# cephadm version
ceph version 15.2.9 (357616cbf726abb779ca75a551e8d02568e15b17) octopus (stable)


Simply commenting out that line makes it complete the cluster init like I 
remember.


--
Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
5 Peters Canyon Rd Suite 250 
Irvine CA 92606 
Office 714.918.1310| Fax 714.918.1325 
pbr...@medata.com| www.medata.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitor leveldb growing without bound v14.2.16

2021-03-02 Thread Peter Woodman
is the ceph insights plugin enabled? this caused huge huge bloat of the mon
stores for me. before i figured that out, i turned on leveldb compression
options on the mon store and got pretty significant savings, also.

On Tue, Mar 2, 2021 at 6:56 PM Lincoln Bryant  wrote:

> Hi list,
>
> We recently had a cluster outage over the weekend where several OSDs were
> inaccessible over night for several hours. When I found the cluster in the
> morning, the monitors' root disks (which contained both the monitor's
> leveldb and the Ceph logs) had completely filled.
>
> After restarting OSDs, cleaning out the monitors' logs, moving
> /var/lib/ceph to dedicated disks on the mons, and starting recovery (in
> which there was 1 unfound object that I marked lost, if that has any
> relevancy), the leveldb continued/continues to grow without bound. The
> cluster has all PGs in active+clean at this point, yet I'm accumulating
> what seems like approximately ~1GB/hr of new leveldb data.
>
> Two of the monitors (a, c) are in quorum, while the third (b) has been
> synchronizing for the last several hours, but doesn't seem to be able to
> catch up. Mon 'b' has been running for 4 hours now in the 'synchronizing'
> state. The mon's log has many messages about compacting and deleting files,
> yet we never exit the synchronization state.
>
> The ceph.log is also rapidly accumulating complaints that the mons are
> slow (not surprising, I suppose, since the levelDBs are ~100GB at this
> point).
>
> I've found that using monstore tool to do compaction on mons 'a' and 'c'
> thelps but is only a temporary fix. Soon the database inflates again and
> I'm back to where I started.
>
> Thoughts on how to proceed here? Some ideas I had:
>- Would it help to add some new monitors that use RocksDB?
>- Stop a monitor and dump the keys via monstoretool, just to get an
> idea of what's going on?
>- Increase mon_sync_max_payload_size to try to move data in larger
> chunks?
>- Drop down to a single monitor, and see if normal compaction triggers
> and stops growing unbounded?
>- Stop both 'a' and 'c', compact them, start them, and immediately
> start 'b' ?
>
> Appreciate any advice.
>
> Regards,
> Lincoln
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD id 241 != my id 248: conversion from "ceph-disk" to "ceph-volume simple" destroys OSDs

2021-03-02 Thread Chris Dunlop

Hi Frank,

On Tue, Mar 02, 2021 at 02:58:05PM +, Frank Schilder wrote:

Hi all,

this is a follow-up on "reboot breaks OSDs converted from ceph-disk to ceph-volume 
simple".

I converted a number of ceph-disk OSDs to ceph-volume using "simple scan" and 
"simple activate". Somewhere along the way, the OSDs meta-data gets rigged and the 
prominent symptom is that the symlink block is changes from a part-uuid target to an unstable 
device name target like:

before conversion:

block -> /dev/disk/by-partuuid/9123be91-7620-495a-a9b7-cc85b1de24b7

after conversion:

block -> /dev/sdj2

This is a huge problem as the "after conversion" device names are unstable. I 
have now a cluster that I cannot reboot servers on due to this problem. OSDs randomly 
re-assigned devices will refuse to start with:

2021-03-02 15:56:21.709 7fb7c2549b80 -1 OSD id 241 != my id 248

Please help me with getting out of this mess.



These paths might be coming from /etc/ceph/osd/*.json files.

Have your tried editing the files to replace /dev/sdXX path with the 
by-partuuid path?

Cheers,

Chris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-02 Thread Norman.Kern
James,

Can you tell me what's the hardware config of your bcache? I use the 400G SATA 
SSD as cache device and

10T HDD as the storage device.  Hardware relationed?

On 2021/3/2 下午4:49, James Page wrote:
> Hi Norman
>
> On Mon, Mar 1, 2021 at 4:38 AM Norman.Kern  wrote:
>
>> Hi, guys
>>
>> I am testing ceph on bcache devices,  I found the performance is not good
>> as expected. Does anyone have any best practices for it?  Thanks.
>>
> I've used bcache quite a bit with Ceph with the following configuration
> options tweaked
>
> a) use writeback mode rather than writethrough (which is the default)
>
> This ensures that the cache device is actually used for write caching
>
> b) turn off the sequential cutoff
>
> sequential_cutoff = 0
>
> This means that sequential writes will also always go to the cache device
> rather than the backing device
>
> c) disable the congestion read and write thresholds
>
> congested_read_threshold_us = congested_write_threshold_us = 0
>
> The following repository:
>
> https://git.launchpad.net/charm-bcache-tuning/tree/src/files
>
> has a python script and systemd configuration todo b) and c) automatically
> on all bcache devices on boot; a) we let the provisioning system take care
> of.
>
> HTH
>
>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Stefan Kooman

On 3/2/21 7:17 PM, Stefan Kooman wrote:



What is output of "ceph daemon osd.0 config get ms_bind_ipv4" on the
osd0 node?


ceph daemon osd.0 config get ms_bind_ipv4
{
     "ms_bind_ipv4": "true"
}

And

ceph daemon mds.mds1 config get ms_bind_ipv4
{
     "ms_bind_ipv4": "true"
}

for that matter.


Now I'm typing this I'm like, hmm, there was an issue with this on the 
mailinglist related to this, as in where you would need to disable ipv4 
explicitly. I believe it was a peering issue between PGs.


OK, so this is probably it. And up to kernel 5.11 (I'll test 5.10 as 
well) this would not be a problem.


I should be able te reproduce on a couple of test clusters that are 
configured similarly, and be able to test if setting ms_bind_ipv4=false 
on OSDs / MDSs fixes the issue.


I would like to have the heuristic to prefer IPv6 over IPv4 when 
filtering addresses (as that is default / common behavior for most if 
not all dual stack systems) ;-).


Hmm, we looked up the documentation, and it seems lacking this whole 
options in master / latest: 
https://docs.ceph.com/en/latest/rados/configuration/ms-ref/


The ms_bind_ipv6 option states the following:

Description

Enable to bind daemons to IPv6 addresses instead of IPv4. Not 
required if you specify a daemon or cluster


So, it says: _instead_. We have configured ms_bind_ipv6, so it should 
not, according to documentation, bind to IPv4, but to IPv6 only. So 
either the documenation is not correct, or the current behaviour of the 
daemons is not correct.


Anyways, I'll come back if disabling binding to IPv4 works.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Monitor leveldb growing without bound v14.2.16

2021-03-02 Thread Lincoln Bryant
Hi list,

We recently had a cluster outage over the weekend where several OSDs were 
inaccessible over night for several hours. When I found the cluster in the 
morning, the monitors' root disks (which contained both the monitor's leveldb 
and the Ceph logs) had completely filled.

After restarting OSDs, cleaning out the monitors' logs, moving /var/lib/ceph to 
dedicated disks on the mons, and starting recovery (in which there was 1 
unfound object that I marked lost, if that has any relevancy), the leveldb 
continued/continues to grow without bound. The cluster has all PGs in 
active+clean at this point, yet I'm accumulating what seems like approximately 
~1GB/hr of new leveldb data.

Two of the monitors (a, c) are in quorum, while the third (b) has been 
synchronizing for the last several hours, but doesn't seem to be able to catch 
up. Mon 'b' has been running for 4 hours now in the 'synchronizing' state. The 
mon's log has many messages about compacting and deleting files, yet we never 
exit the synchronization state.

The ceph.log is also rapidly accumulating complaints that the mons are slow 
(not surprising, I suppose, since the levelDBs are ~100GB at this point).

I've found that using monstore tool to do compaction on mons 'a' and 'c' thelps 
but is only a temporary fix. Soon the database inflates again and I'm back to 
where I started.

Thoughts on how to proceed here? Some ideas I had:
   - Would it help to add some new monitors that use RocksDB?
   - Stop a monitor and dump the keys via monstoretool, just to get an idea of 
what's going on?
   - Increase mon_sync_max_payload_size to try to move data in larger chunks?
   - Drop down to a single monitor, and see if normal compaction triggers and 
stops growing unbounded?
   - Stop both 'a' and 'c', compact them, start them, and immediately start 'b' 
?

Appreciate any advice.

Regards,
Lincoln

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Need Clarification on Maintenance Shutdown Procedure

2021-03-02 Thread Joachim Kraftmayer

Hello Dave,

I recommend you read this docu:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/administration_guide/understanding-process-managemnet-for-ceph#powering-down-and-rebooting-a-red-hat-ceph-storage-cluster-management

Regards, Joachim

___

Clyso GmbH - Ceph Foundation Member

Am 02.03.2021 um 13:55 schrieb Dave Hall:

Dave,

Just to be certain of the terminology,

-
Step before Step 4:  Quiesce client systems using Ceph

Step 4:  Turn off everything that's not a MGR, MON, or OSD.

Step 5:  Turn off OSDs

Step 6:  Turn off MONs

Step 7: Turn off MGRs

If any of the above are running on the the same nodes (i.e. mixed nodes),
use OS capabilities (systemd) to stop and disable so nothing auto-starts
when the hardware is powered back on.


Regarding my cluster:  Currently 3 nodes with 10GB front and back networks,
8 x 12 TB HDDs per node with Samsung 1.6TB PCIe NVMe cards.  The NVMe was
provisioned to allow adding 4 more HDDs per node, but the RocksDBs are
proving to be a bit too small.

We will shortly increase to 6 OSD nodes plus 3 separate nodes for MGRs,
MONs, MDSs, RGWs, etc.  We will also add Enterprise M.2 drives to the
original nodes to allow us to increase the size of the caches.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu

On Tue, Mar 2, 2021 at 4:06 AM David Caro  wrote:


On 03/01 21:41, Dave Hall wrote:

Hello,

I've had a look at the instructions for clean shutdown given at
https://ceph.io/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/,

but

I'm not clear about some things on the steps about shutting down the
various Ceph components.

For my current 3-node cluster I have MONs, MDSs, MGRs, and OSDs all

running

on the same nodes.  Also, this is a non-container installation.

Since I don't have separate dedicated nodes, as described in the

referenced

web page, I think  the instructions mean that I need to issue SystemD
commands to stop the corresponding services/targets on each node for the
Ceph components mentioned in each step.

Yep, the systemd units are usually named 'ceph-@', for example
'ceph-osd@45' would be the systemd unit for osd.45.


Since we want to bring services up in the right order, I should also use
SystemD commands to disable these services/targets so they don't
automatically restart when I power the nodes back on.  After power-on, I
would then re-enable and manually start services/targets in the order
described.

Also yes, and if you use some configuration management or similar that
might
bring them up automatically you might want to disable it temporarily too.


One other specific question:  For step 4 it says to shut down my service
nodes.  Does this mean my MDSs?  (I'm not running any Object Gateways or
NFS, but I think these would go in this step as well?)

Yes, that is correct. Monitor would be the MONs, and admin the MGRs.


Please let me know if I've got this right.  The cluster contains 200TB

of a

researcher's data that has taken a year to collect, so caution is needed.

Can you share a bit more about your setup? Are you using replicas? How
many?
Erasure coding? (a ceph osd pool ls detail , ceph osd status or similar can
help too).


I would recommend trying to get the hand of the process in a test
environment
first.

Cheers!


Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
David Caro
SRE - Cloud Services
Wikimedia Foundation 
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Stefan Kooman




On 3/2/21 6:54 PM, Ilya Dryomov wrote:



--- snip ---
osd.0 up   in  weight 1 up_from 98071 up_thru 98719 down_at 98068
last_clean_interval [96047,98067)
[v2:[2001:7b8:80:1:0:1:2:1]:6848/505534,v1:[2001:7b8:80:1:0:1:2:1]:6854/505534,v2:0.0.0.0:6860/505534,v1:0.0.0.0:6866/505534]


Where did "v2:0.0.0.0:6860/505534,v1:0.0.0.0:6866/505534" come from?
This is what confuses the kernel client: it sees two addresses of
the same type and doesn't know which one to pick.  Instead of blindly
picking the first one (or some other dubious heuristic) it just denies
the osdmap.


[mds.mds1{0:229930080} state up:active seq 144042 addr
[v2:[2001:7b8:80:1:0:1:3:1]:6800/2234186180,v1:[2001:7b8:80:1:0:1:3:1]:6801/2234186180,v2:0.0.0.0:6802/2234186180,v1:0.0.0.0:6803/2234186180]]


Same for the mdsmap.

Were you using ipv6 with the kernel client before upgrading to 5.11?


Yes, exclusively. So dual stack was not something that was possible 
before nautilus. We always did set "ms_bind_ipv6=true".


But good find, I missed that "0.0.0.0" part. Yeah, that must be it.


What is output of "ceph daemon osd.0 config get ms_bind_ipv4" on the
osd0 node?


ceph daemon osd.0 config get ms_bind_ipv4
{
"ms_bind_ipv4": "true"
}

And

ceph daemon mds.mds1 config get ms_bind_ipv4
{
"ms_bind_ipv4": "true"
}

for that matter.


Now I'm typing this I'm like, hmm, there was an issue with this on the 
mailinglist related to this, as in where you would need to disable ipv4 
explicitly. I believe it was a peering issue between PGs.


OK, so this is probably it. And up to kernel 5.11 (I'll test 5.10 as 
well) this would not be a problem.


I should be able te reproduce on a couple of test clusters that are 
configured similarly, and be able to test if setting ms_bind_ipv4=false 
on OSDs / MDSs fixes the issue.


I would like to have the heuristic to prefer IPv6 over IPv4 when 
filtering addresses (as that is default / common behavior for most if 
not all dual stack systems) ;-).


I'll do testing and let you know.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Stefan Kooman

On 3/2/21 6:00 PM, Jeff Layton wrote:


v2 support in the kernel is keyed on the ms_mode= mount option, so that
has to be passed in if you're connecting to a v2 port. Until the mount
helpers get support for that option you'll need to specify the address
and port manually if you want to use v2.


I've tried feeding it ms_mode=v2 but I get a "mount error 22 = Invalid
argument", the ms_mode=legacy does work, but fails with the same errors.



That needs different values. See:

 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=00498b994113a871a556f7ff24a4cf8a00611700

You can try passing in a specific mon address and port, like:

 192.168.21.22:3300:/cephfs/dir/

...and then pass in ms_mode=crc or something similar.


That works, as in I don't get a mount error (ms_mode=prefer-crc) and 
added the port 3300 explictly, but same error.




That said, what you're doing should be working, so this sounds like a
regression. I presume you're able to mount with earlier kernels? What's
the latest kernel version that you have that works?


Previous one was 4.18 ... elrepo only has 5.4 / 5.11 available now 
AFAIK. I'll try to test some ubuntu kernel ppa's as I can choose what 
version to use.


I'll keep you posted.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Metadata for LibRADOS

2021-03-02 Thread Cary FitzHugh
Phooey. :)

Do you know of any notification subsystems in libRADOS that might be
useful?

Will have to think on this...

Thanks

On Tue, Mar 2, 2021 at 4:05 PM Matt Benjamin  wrote:

> Right.  The elastic search integration--or something custom you could
> base on s3 bucket notifications--would both be working with events
> generated in RGW.
>
> Matt
>
> On Tue, Mar 2, 2021 at 3:55 PM Cary FitzHugh 
> wrote:
> >
> > Understood.
> >
> > With the RGW architecture comes more load balancing concerns, more
> moving parts, more tedious (to me) ACLs, less features (append and some
> other things not supported in S3).  Was hoping for a solution which didn't
> require us to be hamstrung and only read / write to a pool via the gateway.
> >
> > If the RGW Metadata search was able to "source" it's data from the OSDs
> and sync that way, then I'd be up for setting up a skeleton
> implementation,  but it sounds like RGW Metadata is only going to record
> things which are flowing through the gateway.  (Is that correct?)
> >
> >
> >
> >
> > On Tue, Mar 2, 2021 at 3:46 PM Matt Benjamin 
> wrote:
> >>
> >> Hi Cary,
> >>
> >> As you've said, these are well-developed features of RGW, I think that
> >> would be the way to go, in the Ceph ecosystem.
> >>
> >> Matt
> >>
> >> On Tue, Mar 2, 2021 at 3:41 PM Cary FitzHugh 
> wrote:
> >> >
> >> > Hello -
> >> >
> >> > We're trying to use native libRADOS and the only challenge we're
> running
> >> > into is searching metadata.
> >> >
> >> > Using the rgw metadata sync seems to require all data to be pushed
> through
> >> > the rgw, which is not something we're interested in setting up at the
> >> > moment.
> >> >
> >> > Are there hooks or features of libRADOS which could be leveraged to
> enable
> >> > syncing of metadata to an external system (elastic-search / postgres
> / etc)?
> >> >
> >> > Is there a way to listen to a stream of updates to a pool in
> real-time,
> >> > with some guarantees I wouldn't miss things?
> >> >
> >> > Are there any features like this in libRADOS?
> >> >
> >> > Thank you
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >
> >>
> >>
> >> --
> >>
> >> Matt Benjamin
> >> Red Hat, Inc.
> >> 315 West Huron Street, Suite 140A
> >> Ann Arbor, Michigan 48103
> >>
> >> http://www.redhat.com/en/technologies/storage
> >>
> >> tel.  734-821-5101
> >> fax.  734-769-8938
> >> cel.  734-216-5309
> >>
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Metadata for LibRADOS

2021-03-02 Thread Matt Benjamin
Right.  The elastic search integration--or something custom you could
base on s3 bucket notifications--would both be working with events
generated in RGW.

Matt

On Tue, Mar 2, 2021 at 3:55 PM Cary FitzHugh  wrote:
>
> Understood.
>
> With the RGW architecture comes more load balancing concerns, more moving 
> parts, more tedious (to me) ACLs, less features (append and some other things 
> not supported in S3).  Was hoping for a solution which didn't require us to 
> be hamstrung and only read / write to a pool via the gateway.
>
> If the RGW Metadata search was able to "source" it's data from the OSDs and 
> sync that way, then I'd be up for setting up a skeleton implementation,  but 
> it sounds like RGW Metadata is only going to record things which are flowing 
> through the gateway.  (Is that correct?)
>
>
>
>
> On Tue, Mar 2, 2021 at 3:46 PM Matt Benjamin  wrote:
>>
>> Hi Cary,
>>
>> As you've said, these are well-developed features of RGW, I think that
>> would be the way to go, in the Ceph ecosystem.
>>
>> Matt
>>
>> On Tue, Mar 2, 2021 at 3:41 PM Cary FitzHugh  wrote:
>> >
>> > Hello -
>> >
>> > We're trying to use native libRADOS and the only challenge we're running
>> > into is searching metadata.
>> >
>> > Using the rgw metadata sync seems to require all data to be pushed through
>> > the rgw, which is not something we're interested in setting up at the
>> > moment.
>> >
>> > Are there hooks or features of libRADOS which could be leveraged to enable
>> > syncing of metadata to an external system (elastic-search / postgres / 
>> > etc)?
>> >
>> > Is there a way to listen to a stream of updates to a pool in real-time,
>> > with some guarantees I wouldn't miss things?
>> >
>> > Are there any features like this in libRADOS?
>> >
>> > Thank you
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>>
>>
>> --
>>
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel.  734-821-5101
>> fax.  734-769-8938
>> cel.  734-216-5309
>>


-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Stefan Kooman

On 3/2/21 5:42 PM, Ilya Dryomov wrote:

On Tue, Mar 2, 2021 at 9:26 AM Stefan Kooman  wrote:


Hi,

On a CentOS 7 VM with mainline kernel (5.11.2-1.el7.elrepo.x86_64 #1 SMP
Fri Feb 26 11:54:18 EST 2021 x86_64 x86_64 x86_64 GNU/Linux) and with
Ceph Octopus 15.2.9 packages installed. The MDS server is running
Nautilus 14.2.16. Messenger v2 has been enabled. Poort 3300 of the
monitors is reachable from the client. At mount time we get the following:


Mar  2 09:01:14  kernel: Key type ceph registered
Mar  2 09:01:14  kernel: libceph: loaded (mon/osd proto 15/24)
Mar  2 09:01:14  kernel: FS-Cache: Netfs 'ceph' registered for caching
Mar  2 09:01:14  kernel: ceph: loaded (mds proto 32)
Mar  2 09:01:14  kernel: libceph: mon4 (1)[mond addr]:6789 session established
Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
Mar  2 09:01:14  kernel: ceph: corrupt mdsmap
Mar  2 09:01:14  kernel: ceph: error decoding mdsmap -22
Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
Mar  2 09:01:14  kernel: libceph: corrupt full osdmap (-22) epoch 98764 off 
6357 (27a57a75 of d3075952-e307797f)
Mar  2 09:02:15  kernel: ceph: No mds server is up or the cluster is laggy


The /etc/ceph/ceph.conf has been adjusted to reflect the messenger v2
changes. ms_bind_ipv6=trie, ms_bind_ipv4=false. The kernel client still
seems to be use the v1 port though (although since 5.11 v2 should be
supported).

Has anyone seen this before? Just guessing here, but could it that the
client tries to speak v2 protocol on v1 port?


Hi Stefan,

Those "another match of type 1" errors suggest that you have two
different v1 addresses for some of or all OSDs and MDSes in osdmap
and mdsmap respectively.

What is the output of "ceph osd dump" and "ceph fs dump"?


That's a lot of output, so I trimmed it:

--- snip ---
osd.0 up   in  weight 1 up_from 98071 up_thru 98719 down_at 98068 
last_clean_interval [96047,98067) 
[v2:[2001:7b8:80:1:0:1:2:1]:6848/505534,v1:[2001:7b8:80:1:0:1:2:1]:6854/505534,v2:0.0.0.0:6860/505534,v1:0.0.0.0:6866/505534] 
[v2:[2001:7b8:80:1:0:1:2:1]:6872/505534,v1:[2001:7b8:80:1:0:1:2:1]:6878/505534,v2:0.0.0.0:6886/505534,v1:0.0.0.0:6892/505534] 
exists,up 93e7d17f-2c7a-4acd-93c0-586dbb7cc6d7

-- snap ---

-- snip ---
[mds.mds1{0:229930080} state up:active seq 144042 addr 
[v2:[2001:7b8:80:1:0:1:3:1]:6800/2234186180,v1:[2001:7b8:80:1:0:1:3:1]:6801/2234186180,v2:0.0.0.0:6802/2234186180,v1:0.0.0.0:6803/2234186180]]



Standby daemons:

[mds.mds2{-1:229977514} state up:standby seq 2 addr 
[v2:[2001:7b8:80:3:0:2c:3:2]:6800/2983725953,v1:[2001:7b8:80:3:0:2c:3:2]:6801/2983725953,v2:0.0.0.0:6802/2983725953,v1:0.0.0.0:6803/2983725953]]

--- snap ---

We only have a public address network in use, and IPv6 only, So not sure 
how we could have multiple v1 addresses for OSDs and MDSes.


Thanks,

Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Metadata for LibRADOS

2021-03-02 Thread Cary FitzHugh
Understood.

With the RGW architecture comes more load balancing concerns, more moving
parts, more tedious (to me) ACLs, less features (append and some other
things not supported in S3).  Was hoping for a solution which didn't
require us to be hamstrung and only read / write to a pool via the
gateway.

If the RGW Metadata search was able to "source" it's data from the OSDs and
sync that way, then I'd be up for setting up a skeleton implementation,
but it sounds like RGW Metadata is only going to record things which are
flowing through the gateway.  (Is that correct?)




On Tue, Mar 2, 2021 at 3:46 PM Matt Benjamin  wrote:

> Hi Cary,
>
> As you've said, these are well-developed features of RGW, I think that
> would be the way to go, in the Ceph ecosystem.
>
> Matt
>
> On Tue, Mar 2, 2021 at 3:41 PM Cary FitzHugh 
> wrote:
> >
> > Hello -
> >
> > We're trying to use native libRADOS and the only challenge we're running
> > into is searching metadata.
> >
> > Using the rgw metadata sync seems to require all data to be pushed
> through
> > the rgw, which is not something we're interested in setting up at the
> > moment.
> >
> > Are there hooks or features of libRADOS which could be leveraged to
> enable
> > syncing of metadata to an external system (elastic-search / postgres /
> etc)?
> >
> > Is there a way to listen to a stream of updates to a pool in real-time,
> > with some guarantees I wouldn't miss things?
> >
> > Are there any features like this in libRADOS?
> >
> > Thank you
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Metadata for LibRADOS

2021-03-02 Thread Matt Benjamin
Hi Cary,

As you've said, these are well-developed features of RGW, I think that
would be the way to go, in the Ceph ecosystem.

Matt

On Tue, Mar 2, 2021 at 3:41 PM Cary FitzHugh  wrote:
>
> Hello -
>
> We're trying to use native libRADOS and the only challenge we're running
> into is searching metadata.
>
> Using the rgw metadata sync seems to require all data to be pushed through
> the rgw, which is not something we're interested in setting up at the
> moment.
>
> Are there hooks or features of libRADOS which could be leveraged to enable
> syncing of metadata to an external system (elastic-search / postgres / etc)?
>
> Is there a way to listen to a stream of updates to a pool in real-time,
> with some guarantees I wouldn't miss things?
>
> Are there any features like this in libRADOS?
>
> Thank you
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Stefan Kooman

On 3/2/21 5:16 PM, Jeff Layton wrote:

On Tue, 2021-03-02 at 09:25 +0100, Stefan Kooman wrote:

Hi,

On a CentOS 7 VM with mainline kernel (5.11.2-1.el7.elrepo.x86_64 #1 SMP
Fri Feb 26 11:54:18 EST 2021 x86_64 x86_64 x86_64 GNU/Linux) and with


I'm guessing this is a stable series kernel


It's a kernel from 'elrepo', so I'm not sure. I would guess so too.




Ceph Octopus 15.2.9 packages installed. The MDS server is running
Nautilus 14.2.16. Messenger v2 has been enabled. Poort 3300 of the
monitors is reachable from the client. At mount time we get the following:


Mar  2 09:01:14  kernel: Key type ceph registered
Mar  2 09:01:14  kernel: libceph: loaded (mon/osd proto 15/24)
Mar  2 09:01:14  kernel: FS-Cache: Netfs 'ceph' registered for caching
Mar  2 09:01:14  kernel: ceph: loaded (mds proto 32)
Mar  2 09:01:14  kernel: libceph: mon4 (1)[mond addr]:6789 session established
Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
Mar  2 09:01:14  kernel: ceph: corrupt mdsmap
Mar  2 09:01:14  kernel: ceph: error decoding mdsmap -22


-22 == -EINVAL

Looks like a an osdmap parsing error?


Indeed, that is kinda weird isn't it.




Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
Mar  2 09:01:14  kernel: libceph: corrupt full osdmap (-22) epoch 98764 off 
6357 (27a57a75 of d3075952-e307797f)
Mar  2 09:02:15  kernel: ceph: No mds server is up or the cluster is laggy


The /etc/ceph/ceph.conf has been adjusted to reflect the messenger v2
changes. ms_bind_ipv6=trie, ms_bind_ipv4=false. The kernel client still
seems to be use the v1 port though (although since 5.11 v2 should be
supported).



The mount helper only recently got v2 support, and that hasn't trickled
out into the distros yet. See:

 https://github.com/ceph/ceph/pull/38788


Ah, good to know.




Has anyone seen this before? Just guessing here, but could it that the
client tries to speak v2 protocol on v1 port?



What mount options are you passing in? Are you using mon autodiscovery?


We provide the mon address explictly in the /etc/fstab, So something 
like this:


mon1,mon2,mon3,mon4,mon5:/cephfs/dir /client_mountpoint ceph 
name=client-id,secretfile=/etc/ceph/ceph.client.client-id.cephfs.key,noatime,_netdev 
0 2


We are not using any dns based discovery of monitors if that is what you 
mean.


Note: We tried with nautilus packages before (14.2.16) and got the same 
result.




v2 support in the kernel is keyed on the ms_mode= mount option, so that
has to be passed in if you're connecting to a v2 port. Until the mount
helpers get support for that option you'll need to specify the address
and port manually if you want to use v2.


I've tried feeding it ms_mode=v2 but I get a "mount error 22 = Invalid 
argument", the ms_mode=legacy does work, but fails with the same errors.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Metadata for LibRADOS

2021-03-02 Thread Cary FitzHugh
Hello -

We're trying to use native libRADOS and the only challenge we're running
into is searching metadata.

Using the rgw metadata sync seems to require all data to be pushed through
the rgw, which is not something we're interested in setting up at the
moment.

Are there hooks or features of libRADOS which could be leveraged to enable
syncing of metadata to an external system (elastic-search / postgres / etc)?

Is there a way to listen to a stream of updates to a pool in real-time,
with some guarantees I wouldn't miss things?

Are there any features like this in libRADOS?

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD id 241 != my id 248: conversion from "ceph-disk" to "ceph-volume simple" destroys OSDs

2021-03-02 Thread Frank Schilder
Hi all,

this is a follow-up on "reboot breaks OSDs converted from ceph-disk to 
ceph-volume simple".

I converted a number of ceph-disk OSDs to ceph-volume using "simple scan" and 
"simple activate". Somewhere along the way, the OSDs meta-data gets rigged and 
the prominent symptom is that the symlink block is changes from a part-uuid 
target to an unstable device name target like:

before conversion:

block -> /dev/disk/by-partuuid/9123be91-7620-495a-a9b7-cc85b1de24b7

after conversion:

block -> /dev/sdj2

This is a huge problem as the "after conversion" device names are unstable. I 
have now a cluster that I cannot reboot servers on due to this problem. OSDs 
randomly re-assigned devices will refuse to start with:

2021-03-02 15:56:21.709 7fb7c2549b80 -1 OSD id 241 != my id 248

Please help me with getting out of this mess.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Remapped PGs

2021-03-02 Thread David Orman
I wanted to revisit this - we're not on 15.2.9 and still have this one
cluster with 5 PGs "stuck" in pg_temp. Any idea how to clean this up,
or how it might have occurred? I'm fairly certain it showed up after
an autoscale up and autoscale down happened that overlapped each
other.

On Mon, Aug 10, 2020 at 10:28 AM David Orman  wrote:
>
> We've gotten a bit further, after evaluating how this remapped count was 
> determine (pg_temp), we've found the PGs counted as being remapped:
>
> root@ceph01:~# ceph osd dump |grep pg_temp
> pg_temp 3.7af [93,1,29]
> pg_temp 3.7bc [137,97,5]
> pg_temp 3.7d9 [72,120,18]
> pg_temp 3.7e8 [80,21,71]
> pg_temp 3.7fd [74,51,8]
>
> Looking at 3.7af:
> root@ceph01:~# ceph pg map 3.7af
> osdmap e15406 pg 3.7af (3.f) -> up [87,156,29] acting [87,156,29]
>
> I'm unclear why this is staying in pg_temp. Is there a way to clean this up? 
> I would have expected it to be cleaned up as per docs but I might be missing 
> something here.
>
> On Thu, Aug 6, 2020 at 2:40 PM David Orman  wrote:
>>
>> Still haven't figured this out. We went ahead and upgraded the entire 
>> cluster to Podman 2.0.4 and in the process did OS/Kernel upgrades and 
>> rebooted every node, one at a time. We've still got 5 PGs stuck in 
>> 'remapped' state, according to 'ceph -s' but 0 in the pg dump output in that 
>> state. Does anybody have any suggestions on what to do about this?
>>
>> On Wed, Aug 5, 2020 at 10:54 AM David Orman  wrote:
>>>
>>> Hi,
>>>
>>> We see that we have 5 'remapped' PGs, but are unclear why/what to do about 
>>> it. We shifted some target ratios for the autobalancer and it resulted in 
>>> this state. When adjusting ratio, we noticed two OSDs go down, but we just 
>>> restarted the container for those OSDs with podman, and they came back up. 
>>> Here's status output:
>>>
>>> ###
>>> root@ceph01:~# ceph status
>>> INFO:cephadm:Inferring fsid x
>>> INFO:cephadm:Inferring config x
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>>   cluster:
>>> id: 41bb9256-c3bf-11ea-85b9-9e07b0435492
>>> health: HEALTH_OK
>>>
>>>   services:
>>> mon: 5 daemons, quorum ceph01,ceph04,ceph02,ceph03,ceph05 (age 2w)
>>> mgr: ceph03.ytkuyr(active, since 2w), standbys: ceph01.aqkgbl, 
>>> ceph02.gcglcg, ceph04.smbdew, ceph05.yropto
>>> osd: 168 osds: 168 up (since 2d), 168 in (since 2d); 5 remapped pgs
>>>
>>>   data:
>>> pools:   3 pools, 1057 pgs
>>> objects: 18.00M objects, 69 TiB
>>> usage:   119 TiB used, 2.0 PiB / 2.1 PiB avail
>>> pgs: 1056 active+clean
>>>  1active+clean+scrubbing+deep
>>>
>>>   io:
>>> client:   859 KiB/s rd, 212 MiB/s wr, 644 op/s rd, 391 op/s wr
>>>
>>> root@ceph01:~#
>>>
>>> ###
>>>
>>> When I look at ceph pg dump, I don't see any marked as remapped:
>>>
>>> ###
>>> root@ceph01:~# ceph pg dump |grep remapped
>>> INFO:cephadm:Inferring fsid x
>>> INFO:cephadm:Inferring config x
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>> dumped all
>>> root@ceph01:~#
>>> ###
>>>
>>> Any idea what might be going on/how to recover? All OSDs are up. Health is 
>>> 'OK'. This is Ceph 15.2.4 deployed using Cephadm in containers, on Podman 
>>> 2.0.3.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS is reporting damaged metadata damage- followup

2021-03-02 Thread ricardo.re.azevedo
Hi all,

 

Following up on a previous issue. 

 

My cephfs MDS is reporting damaged metadata following the addition (and
remapping) of 12 new OSDs.  
`ceph tell mds.database-0 damage ls` reports ~85 files damaged. All of type
"backtrace". 
` ceph tell mds.database-0 scrub start / recursive repair` seems to have no
effect on the damage. 
` ceph tell mds.database-0 scrub start / recursive repair force` also has no
effect.

I understand this seems to be an issue with mapping the file to a filesystem
path. Is there anything I can do to recover these files? Any manual methods?

 

 


> ceph status reports:
  cluster:

id: 692905c0-f271-4cd8-9e43-1c32ef8abd13

health: HEALTH_ERR

1 MDSs report damaged metadata

300 pgs not deep-scrubbed in time

300 pgs not scrubbed in time

 

  services:

mon: 3 daemons, quorum database-0,file-server,webhost (age 37m)

mgr: webhost(active, since 3d), standbys: file-server, database-0

mds: cephfs:1 {0=database-0=up:active} 2 up:standby

osd: 48 osds: 48 up (since 56m), 48 in (since 13d); 10 remapped pgs

 

  task status:

scrub status:

mds.database-0: idle

 

  data:

pools:   7 pools, 633 pgs

objects: 60.82M objects, 231 TiB

usage:   336 TiB used, 246 TiB / 582 TiB avail

pgs: 623 active+clean

 6   active+remapped+backfilling

 4   active+remapped+backfill_wait

 

Thanks for the help.

 

Best,

Ricardo

 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Ilya Dryomov
On Tue, Mar 2, 2021 at 6:02 PM Stefan Kooman  wrote:
>
> On 3/2/21 5:42 PM, Ilya Dryomov wrote:
> > On Tue, Mar 2, 2021 at 9:26 AM Stefan Kooman  wrote:
> >>
> >> Hi,
> >>
> >> On a CentOS 7 VM with mainline kernel (5.11.2-1.el7.elrepo.x86_64 #1 SMP
> >> Fri Feb 26 11:54:18 EST 2021 x86_64 x86_64 x86_64 GNU/Linux) and with
> >> Ceph Octopus 15.2.9 packages installed. The MDS server is running
> >> Nautilus 14.2.16. Messenger v2 has been enabled. Poort 3300 of the
> >> monitors is reachable from the client. At mount time we get the following:
> >>
> >>> Mar  2 09:01:14  kernel: Key type ceph registered
> >>> Mar  2 09:01:14  kernel: libceph: loaded (mon/osd proto 15/24)
> >>> Mar  2 09:01:14  kernel: FS-Cache: Netfs 'ceph' registered for caching
> >>> Mar  2 09:01:14  kernel: ceph: loaded (mds proto 32)
> >>> Mar  2 09:01:14  kernel: libceph: mon4 (1)[mond addr]:6789 session 
> >>> established
> >>> Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> >>> Mar  2 09:01:14  kernel: ceph: corrupt mdsmap
> >>> Mar  2 09:01:14  kernel: ceph: error decoding mdsmap -22
> >>> Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> >>> Mar  2 09:01:14  kernel: libceph: corrupt full osdmap (-22) epoch 98764 
> >>> off 6357 (27a57a75 of d3075952-e307797f)
> >>> Mar  2 09:02:15  kernel: ceph: No mds server is up or the cluster is laggy
> >>
> >> The /etc/ceph/ceph.conf has been adjusted to reflect the messenger v2
> >> changes. ms_bind_ipv6=trie, ms_bind_ipv4=false. The kernel client still
> >> seems to be use the v1 port though (although since 5.11 v2 should be
> >> supported).
> >>
> >> Has anyone seen this before? Just guessing here, but could it that the
> >> client tries to speak v2 protocol on v1 port?
> >
> > Hi Stefan,
> >
> > Those "another match of type 1" errors suggest that you have two
> > different v1 addresses for some of or all OSDs and MDSes in osdmap
> > and mdsmap respectively.
> >
> > What is the output of "ceph osd dump" and "ceph fs dump"?
>
> That's a lot of output, so I trimmed it:
>
> --- snip ---
> osd.0 up   in  weight 1 up_from 98071 up_thru 98719 down_at 98068
> last_clean_interval [96047,98067)
> [v2:[2001:7b8:80:1:0:1:2:1]:6848/505534,v1:[2001:7b8:80:1:0:1:2:1]:6854/505534,v2:0.0.0.0:6860/505534,v1:0.0.0.0:6866/505534]

Where did "v2:0.0.0.0:6860/505534,v1:0.0.0.0:6866/505534" come from?
This is what confuses the kernel client: it sees two addresses of
the same type and doesn't know which one to pick.  Instead of blindly
picking the first one (or some other dubious heuristic) it just denies
the osdmap.

> [mds.mds1{0:229930080} state up:active seq 144042 addr
> [v2:[2001:7b8:80:1:0:1:3:1]:6800/2234186180,v1:[2001:7b8:80:1:0:1:3:1]:6801/2234186180,v2:0.0.0.0:6802/2234186180,v1:0.0.0.0:6803/2234186180]]

Same for the mdsmap.

Were you using ipv6 with the kernel client before upgrading to 5.11?

What is output of "ceph daemon osd.0 config get ms_bind_ipv4" on the
osd0 node?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reboot breaks OSDs converted from ceph-disk to ceph-volume simple

2021-03-02 Thread Frank Schilder
For comparison, the output of device discovery from ceph-disk and ceph-volume. 
ceph-disk does it correctly, ceph-volume is screwed up:

[root@ceph-adm:ceph-18 ceph-241]# ceph-disk list /dev/sdb
/usr/lib/python2.7/site-packages/ceph_disk/main.py:5689: UserWarning:
***
This tool is now deprecated in favor of ceph-volume.
It is recommended to use ceph-volume for OSD deployments. For details see:

http://docs.ceph.com/docs/master/ceph-volume/#migrating

***

  warnings.warn(DEPRECATION_WARNING)
/dev/sdb :
 /dev/sdb1 ceph data, active, cluster ceph, osd.241, block /dev/sdb2
 /dev/sdb2 ceph block, for /dev/sdb1
/usr/lib/python2.7/site-packages/ceph_disk/main.py:5721: UserWarning:
***
This tool is now deprecated in favor of ceph-volume.
It is recommended to use ceph-volume for OSD deployments. For details see:

http://docs.ceph.com/docs/master/ceph-volume/#migrating

***

  warnings.warn(DEPRECATION_WARNING)


[root@ceph-adm:ceph-18 ceph-241]# ceph-volume simple scan --stdout /dev/sdb1
Running command: /usr/sbin/cryptsetup status /dev/sdb1
{
"active": "ok",
"block": {
"path": "/dev/sda2",
"uuid": "b5ac1462-510a-4483-8f42-604e6adc5c9d"
},
"block_uuid": "1d9d89a2-18c7-4610-9dcd-167d44ce1879",
"bluefs": 1,
"ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"cluster_name": "ceph",
"data": {
"path": "/dev/sdb1",
"uuid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb"
},
"fsid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb",
"keyring": "AQAZJ6ddedALDxAAJI7NLJ2CRFoQWK5STRpHuw==",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"none": "",
"ready": "ready",
"require_osd_release": "",
"type": "bluestore",
"whoami": 241
}

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 02 March 2021 14:35:25
To: ceph-users@ceph.io
Subject: [ceph-users] reboot breaks OSDs converted from ceph-disk to 
ceph-volume simple

Dear all,

ceph version: mimic 13.2.10

I'm facing a serious bug with devices converted from "ceph-disk" to 
"ceph-volume simple". I "converted" all ceph-disk devices using "ceph-volume 
simple scan ..." And everything worked fine at the beginning. Today I needed to 
reboot an OSD host and since then most ceph-disk OSDs are screwed up. 
Apparently, "ceph-volume simple scan ..." creates symlinks to the block 
partition /dev/sd?2 using the "/dev/sd?2" name for the link target. These names 
are not stable and are expected to change after every reboot. Now I have a 
bunch of OSDs with new /dev/sd?2" names that won't boot any more, because this 
link points to the wrong block partition. Doing another "ceph-volume simple 
scan ..." doesn't help, it just "rediscovers" the wrong location. Here is what 
a broken OSD looks like (fresh  "ceph-volume simple scan --stdout ..." output):

{
"active": "ok",
"block": {
"path": "/dev/sda2",
"uuid": "b5ac1462-510a-4483-8f42-604e6adc5c9d"
},
"block_uuid": "1d9d89a2-18c7-4610-9dcd-167d44ce1879",
"bluefs": 1,
"ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"cluster_name": "ceph",
"data": {
"path": "/dev/sdb1",
"uuid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb"
},
"fsid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb",
"keyring": "AQAZJ6ddedALDxAAJI7NLJ2CRFoQWK5STRpHuw==",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"none": "",
"ready": "ready",
"require_osd_release": "",
"type": "bluestore",
"whoami": 241
}

OSD 241's data partition looks like this (after mount /dev/sdb1 
/var/lib/ceph/osd/ceph-241):

[root@ceph-adm:ceph-18 ceph-241]# ls -l /var/lib/ceph/osd/ceph-241
total 56
-rw-r--r--. 1 root root 411 Oct 16  2019 activate.monmap
-rw-r--r--. 1 ceph ceph   3 Oct 16  2019 active
lrwxrwxrwx. 1 root root   9 Mar  2 14:19 block -> /dev/sda2
-rw-r--r--. 1 ceph ceph  37 Oct 16  2019 block_uuid
-rw-r--r--. 1 ceph disk   2 Oct 16  2019 bluefs
-rw-r--r--. 1 ceph ceph  37 Oct 16  2019 ceph_fsid
-rw-r--r--. 1 ceph ceph  37 Oct 16  2019 fsid
-rw---. 1 ceph ceph  58 Oct 16  2019 keyring
-rw-r--r--. 1 ceph disk   8 Oct 16  2019 kv_backend
-rw-r--r--. 1 ceph ceph  21 Oct 16  2019 magic
-rw-r--r--. 1 ceph disk   4 Oct 16  2019 mkfs_done
-rw-r--r--. 1 ceph ceph   0 Nov 23 14:58 none
-rw-r--r--. 1 ceph disk   6 Oct 16  2019 ready
-rw-r--r--. 1 ceph disk   2 Jan 31  2020 require_osd_release
-rw-r--r--. 1 ceph ceph  10 Oct 16  2019 type
-rw-r--r--. 1 ceph ceph   4 Oct 16  2019 whoami

The symlink "block -> /dev/sda2" goes to the wrong disk. How can I fix 

[ceph-users] reboot breaks OSDs converted from ceph-disk to ceph-volume simple

2021-03-02 Thread Frank Schilder
Dear all,

ceph version: mimic 13.2.10

I'm facing a serious bug with devices converted from "ceph-disk" to 
"ceph-volume simple". I "converted" all ceph-disk devices using "ceph-volume 
simple scan ..." And everything worked fine at the beginning. Today I needed to 
reboot an OSD host and since then most ceph-disk OSDs are screwed up. 
Apparently, "ceph-volume simple scan ..." creates symlinks to the block 
partition /dev/sd?2 using the "/dev/sd?2" name for the link target. These names 
are not stable and are expected to change after every reboot. Now I have a 
bunch of OSDs with new /dev/sd?2" names that won't boot any more, because this 
link points to the wrong block partition. Doing another "ceph-volume simple 
scan ..." doesn't help, it just "rediscovers" the wrong location. Here is what 
a broken OSD looks like (fresh  "ceph-volume simple scan --stdout ..." output):

{
"active": "ok", 
"block": {
"path": "/dev/sda2", 
"uuid": "b5ac1462-510a-4483-8f42-604e6adc5c9d"
}, 
"block_uuid": "1d9d89a2-18c7-4610-9dcd-167d44ce1879", 
"bluefs": 1, 
"ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9", 
"cluster_name": "ceph", 
"data": {
"path": "/dev/sdb1", 
"uuid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb"
}, 
"fsid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb", 
"keyring": "AQAZJ6ddedALDxAAJI7NLJ2CRFoQWK5STRpHuw==", 
"kv_backend": "rocksdb", 
"magic": "ceph osd volume v026", 
"mkfs_done": "yes", 
"none": "", 
"ready": "ready", 
"require_osd_release": "", 
"type": "bluestore", 
"whoami": 241
}

OSD 241's data partition looks like this (after mount /dev/sdb1 
/var/lib/ceph/osd/ceph-241):

[root@ceph-adm:ceph-18 ceph-241]# ls -l /var/lib/ceph/osd/ceph-241
total 56
-rw-r--r--. 1 root root 411 Oct 16  2019 activate.monmap
-rw-r--r--. 1 ceph ceph   3 Oct 16  2019 active
lrwxrwxrwx. 1 root root   9 Mar  2 14:19 block -> /dev/sda2
-rw-r--r--. 1 ceph ceph  37 Oct 16  2019 block_uuid
-rw-r--r--. 1 ceph disk   2 Oct 16  2019 bluefs
-rw-r--r--. 1 ceph ceph  37 Oct 16  2019 ceph_fsid
-rw-r--r--. 1 ceph ceph  37 Oct 16  2019 fsid
-rw---. 1 ceph ceph  58 Oct 16  2019 keyring
-rw-r--r--. 1 ceph disk   8 Oct 16  2019 kv_backend
-rw-r--r--. 1 ceph ceph  21 Oct 16  2019 magic
-rw-r--r--. 1 ceph disk   4 Oct 16  2019 mkfs_done
-rw-r--r--. 1 ceph ceph   0 Nov 23 14:58 none
-rw-r--r--. 1 ceph disk   6 Oct 16  2019 ready
-rw-r--r--. 1 ceph disk   2 Jan 31  2020 require_osd_release
-rw-r--r--. 1 ceph ceph  10 Oct 16  2019 type
-rw-r--r--. 1 ceph ceph   4 Oct 16  2019 whoami

The symlink "block -> /dev/sda2" goes to the wrong disk. How can I fix that in 
a stable way? Also, why are not stable "/dev/disk/by-uuid/..." link targets 
created instead? Can I change that myself?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Jeff Layton
On Tue, 2021-03-02 at 17:44 +0100, Stefan Kooman wrote:
> On 3/2/21 5:16 PM, Jeff Layton wrote:
> > On Tue, 2021-03-02 at 09:25 +0100, Stefan Kooman wrote:
> > > Hi,
> > > 
> > > On a CentOS 7 VM with mainline kernel (5.11.2-1.el7.elrepo.x86_64 #1 SMP
> > > Fri Feb 26 11:54:18 EST 2021 x86_64 x86_64 x86_64 GNU/Linux) and with
> > 
> > I'm guessing this is a stable series kernel
> 
> It's a kernel from 'elrepo', so I'm not sure. I would guess so too.
> 
> > 
> > > Ceph Octopus 15.2.9 packages installed. The MDS server is running
> > > Nautilus 14.2.16. Messenger v2 has been enabled. Poort 3300 of the
> > > monitors is reachable from the client. At mount time we get the following:
> > > 
> > > > Mar  2 09:01:14  kernel: Key type ceph registered
> > > > Mar  2 09:01:14  kernel: libceph: loaded (mon/osd proto 15/24)
> > > > Mar  2 09:01:14  kernel: FS-Cache: Netfs 'ceph' registered for caching
> > > > Mar  2 09:01:14  kernel: ceph: loaded (mds proto 32)
> > > > Mar  2 09:01:14  kernel: libceph: mon4 (1)[mond addr]:6789 session 
> > > > established
> > > > Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> > > > Mar  2 09:01:14  kernel: ceph: corrupt mdsmap
> > > > Mar  2 09:01:14  kernel: ceph: error decoding mdsmap -22
> > 
> > -22 == -EINVAL
> > 
> > Looks like a an osdmap parsing error?
> 
> Indeed, that is kinda weird isn't it.
> 
> > 
> > > > Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> > > > Mar  2 09:01:14  kernel: libceph: corrupt full osdmap (-22) epoch 98764 
> > > > off 6357 (27a57a75 of d3075952-e307797f)
> > > > Mar  2 09:02:15  kernel: ceph: No mds server is up or the cluster is 
> > > > laggy
> > > 
> > > The /etc/ceph/ceph.conf has been adjusted to reflect the messenger v2
> > > changes. ms_bind_ipv6=trie, ms_bind_ipv4=false. The kernel client still
> > > seems to be use the v1 port though (although since 5.11 v2 should be
> > > supported).
> > > 
> > 
> > The mount helper only recently got v2 support, and that hasn't trickled
> > out into the distros yet. See:
> > 
> >  https://github.com/ceph/ceph/pull/38788
> 
> Ah, good to know.
> 
> > 
> > > Has anyone seen this before? Just guessing here, but could it that the
> > > client tries to speak v2 protocol on v1 port?
> > > 
> > 
> > What mount options are you passing in? Are you using mon autodiscovery?
> 
> We provide the mon address explictly in the /etc/fstab, So something 
> like this:
> 
> mon1,mon2,mon3,mon4,mon5:/cephfs/dir /client_mountpoint ceph 
> name=client-id,secretfile=/etc/ceph/ceph.client.client-id.cephfs.key,noatime,_netdev
>  
> 0 2
> 
> We are not using any dns based discovery of monitors if that is what you 
> mean.
> 
> Note: We tried with nautilus packages before (14.2.16) and got the same 
> result.
> 
> > 
> > v2 support in the kernel is keyed on the ms_mode= mount option, so that
> > has to be passed in if you're connecting to a v2 port. Until the mount
> > helpers get support for that option you'll need to specify the address
> > and port manually if you want to use v2.
> 
> I've tried feeding it ms_mode=v2 but I get a "mount error 22 = Invalid 
> argument", the ms_mode=legacy does work, but fails with the same errors.
> 

That needs different values. See:


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=00498b994113a871a556f7ff24a4cf8a00611700

You can try passing in a specific mon address and port, like:

192.168.21.22:3300:/cephfs/dir/

...and then pass in ms_mode=crc or something similar.

That said, what you're doing should be working, so this sounds like a
regression. I presume you're able to mount with earlier kernels? What's
the latest kernel version that you have that works?
-- 
Jeff Layton 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Need Clarification on Maintenance Shutdown Procedure

2021-03-02 Thread Dave Hall
Dave,

Just to be certain of the terminology,

-
Step before Step 4:  Quiesce client systems using Ceph

Step 4:  Turn off everything that's not a MGR, MON, or OSD.

Step 5:  Turn off OSDs

Step 6:  Turn off MONs

Step 7: Turn off MGRs

If any of the above are running on the the same nodes (i.e. mixed nodes),
use OS capabilities (systemd) to stop and disable so nothing auto-starts
when the hardware is powered back on.


Regarding my cluster:  Currently 3 nodes with 10GB front and back networks,
8 x 12 TB HDDs per node with Samsung 1.6TB PCIe NVMe cards.  The NVMe was
provisioned to allow adding 4 more HDDs per node, but the RocksDBs are
proving to be a bit too small.

We will shortly increase to 6 OSD nodes plus 3 separate nodes for MGRs,
MONs, MDSs, RGWs, etc.  We will also add Enterprise M.2 drives to the
original nodes to allow us to increase the size of the caches.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu

On Tue, Mar 2, 2021 at 4:06 AM David Caro  wrote:

> On 03/01 21:41, Dave Hall wrote:
> > Hello,
> >
> > I've had a look at the instructions for clean shutdown given at
> > https://ceph.io/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/,
> but
> > I'm not clear about some things on the steps about shutting down the
> > various Ceph components.
> >
> > For my current 3-node cluster I have MONs, MDSs, MGRs, and OSDs all
> running
> > on the same nodes.  Also, this is a non-container installation.
> >
> > Since I don't have separate dedicated nodes, as described in the
> referenced
> > web page, I think  the instructions mean that I need to issue SystemD
> > commands to stop the corresponding services/targets on each node for the
> > Ceph components mentioned in each step.
>
> Yep, the systemd units are usually named 'ceph-@', for example
> 'ceph-osd@45' would be the systemd unit for osd.45.
>
> >
> > Since we want to bring services up in the right order, I should also use
> > SystemD commands to disable these services/targets so they don't
> > automatically restart when I power the nodes back on.  After power-on, I
> > would then re-enable and manually start services/targets in the order
> > described.
>
> Also yes, and if you use some configuration management or similar that
> might
> bring them up automatically you might want to disable it temporarily too.
>
> >
> > One other specific question:  For step 4 it says to shut down my service
> > nodes.  Does this mean my MDSs?  (I'm not running any Object Gateways or
> > NFS, but I think these would go in this step as well?)
>
> Yes, that is correct. Monitor would be the MONs, and admin the MGRs.
>
> >
> > Please let me know if I've got this right.  The cluster contains 200TB
> of a
> > researcher's data that has taken a year to collect, so caution is needed.
>
> Can you share a bit more about your setup? Are you using replicas? How
> many?
> Erasure coding? (a ceph osd pool ls detail , ceph osd status or similar can
> help too).
>
>
> I would recommend trying to get the hand of the process in a test
> environment
> first.
>
> Cheers!
>
> >
> > Thanks.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdh...@binghamton.edu
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> David Caro
> SRE - Cloud Services
> Wikimedia Foundation 
> PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share in the
> sum of all knowledge. That's our commitment."
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Ilya Dryomov
On Tue, Mar 2, 2021 at 9:26 AM Stefan Kooman  wrote:
>
> Hi,
>
> On a CentOS 7 VM with mainline kernel (5.11.2-1.el7.elrepo.x86_64 #1 SMP
> Fri Feb 26 11:54:18 EST 2021 x86_64 x86_64 x86_64 GNU/Linux) and with
> Ceph Octopus 15.2.9 packages installed. The MDS server is running
> Nautilus 14.2.16. Messenger v2 has been enabled. Poort 3300 of the
> monitors is reachable from the client. At mount time we get the following:
>
> > Mar  2 09:01:14  kernel: Key type ceph registered
> > Mar  2 09:01:14  kernel: libceph: loaded (mon/osd proto 15/24)
> > Mar  2 09:01:14  kernel: FS-Cache: Netfs 'ceph' registered for caching
> > Mar  2 09:01:14  kernel: ceph: loaded (mds proto 32)
> > Mar  2 09:01:14  kernel: libceph: mon4 (1)[mond addr]:6789 session 
> > established
> > Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> > Mar  2 09:01:14  kernel: ceph: corrupt mdsmap
> > Mar  2 09:01:14  kernel: ceph: error decoding mdsmap -22
> > Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> > Mar  2 09:01:14  kernel: libceph: corrupt full osdmap (-22) epoch 98764 off 
> > 6357 (27a57a75 of d3075952-e307797f)
> > Mar  2 09:02:15  kernel: ceph: No mds server is up or the cluster is laggy
>
> The /etc/ceph/ceph.conf has been adjusted to reflect the messenger v2
> changes. ms_bind_ipv6=trie, ms_bind_ipv4=false. The kernel client still
> seems to be use the v1 port though (although since 5.11 v2 should be
> supported).
>
> Has anyone seen this before? Just guessing here, but could it that the
> client tries to speak v2 protocol on v1 port?

Hi Stefan,

Those "another match of type 1" errors suggest that you have two
different v1 addresses for some of or all OSDs and MDSes in osdmap
and mdsmap respectively.

What is the output of "ceph osd dump" and "ceph fs dump"?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Octopus auto-scale causing HEALTH_WARN re object numbers

2021-03-02 Thread Matthew Vernon

Hi,

I've upgraded our test cluster to Octopus, and enabled the auto-scaler. 
It's nearly finished:


PG autoscaler decreasing pool 11 PGs from 1024 to 32 (4d)
  [==..] (remaining: 3h)

But I notice it looks to be making pool 11 smaller when HEALTH_WARN 
thinks it should be larger:


root@sto-t1-1:~# ceph health detail
HEALTH_WARN 1 pools have many more objects per pg than average; 9 pgs 
not deep-scrubbed in time
[WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than 
average
pool default.rgw.buckets.data objects per pg (313153) is more than 
23.4063 times cluster average (13379)


...which seems like the wrong thing for the auto-scaler to be doing. Is 
this a known problem?


Regards,

Matthew

More details:

ceph df:
root@sto-t1-1:~# ceph df
--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
hdd993 TiB  782 TiB  210 TiB   211 TiB  21.22
TOTAL  993 TiB  782 TiB  210 TiB   211 TiB  21.22

--- POOLS ---
POOLID  STORED   OBJECTS  USED %USED  MAX AVAIL
.rgw.root2   69 KiB4  1.4 MiB  0220 TiB
default.rgw.control  3  1.1 MiB8  3.3 MiB  0220 TiB
default.rgw.data.root4  115 KiB   14  3.6 MiB  0220 TiB
default.rgw.gc   5  5.3 MiB   32   23 MiB  0220 TiB
default.rgw.log  6   31 MiB  184   96 MiB  0220 TiB
default.rgw.users.uid7  249 KiB8  1.8 MiB  0220 TiB
default.rgw.buckets.data11   23 GiB   10.02M  2.0 TiB   0.30220 TiB
rgwtls  13   54 KiB3  843 KiB  0220 TiB
pilot-metrics   14  285 MiB2.60M  476 GiB   0.07220 TiB
pilot-images15   40 GiB4.97k  122 GiB   0.02220 TiB
pilot-volumes   16  192 GiB   48.90k  577 GiB   0.09220 TiB
pilot-vms   17  125 GiB   33.79k  376 GiB   0.06220 TiB
default.rgw.users.keys  18  111 KiB5  1.5 MiB  0220 TiB
default.rgw.buckets.index   19  4.0 GiB  246   12 GiB  0220 TiB
rbd 20   39 TiB   10.09M  116 TiB  14.88220 TiB
default.rgw.buckets.non-ec  21  344 KiB1  1.0 MiB  0220 TiB
rgw-ec  22  7.0 TiB1.93M   11 TiB   1.57441 TiB
rbd-ec  23   45 TiB   11.73M   67 TiB   9.22441 TiB
default.rgw.users.email 24   23 MiB1   69 MiB  0220 TiB
pilot-backups   25   73 MiB3  219 MiB  0220 TiB
device_health_metrics   26   51 MiB  186  153 MiB  0220 TiB

root@sto-t1-1:~# ceph osd pool autoscale-status
POOL  SIZE  TARGET SIZE  RATE  RAW CAPACITY 
RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
.rgw.root   70843 3.0992.7T 
0.  1.0  32  on
default.rgw.control  1116k3.0992.7T 
0.  1.0  32  on
default.rgw.data.root   115.1k3.0992.7T 
0.  1.0  32  on
default.rgw.gc   5379k3.0992.7T 
0.  1.0  32  on
default.rgw.log 32036k3.0992.7T 
0.  1.0  32  on
default.rgw.users.uid   248.7k3.0992.7T 
0.  1.0  32  on
default.rgw.buckets.data23894M3.0992.7T 
0.0001  1.0  32  on
rgwtls  55760 3.0992.7T 
0.  1.0  32  on
pilot-metrics   285.3M3.0992.7T 
0.  1.0  32  on
pilot-images41471M3.0992.7T 
0.0001  1.0  32  on
pilot-volumes   192.3G3.0992.7T 
0.0006  1.0  32  on
pilot-vms   124.6G3.0992.7T 
0.0004  1.0  32  on
default.rgw.users.keys  111.1k3.0992.7T 
0.  1.0  32  on
default.rgw.buckets.index4090M3.0992.7T 
0.  1.0  32  on
rbd 39430G3.0992.7T 
0.1164  1.01024  on
default.rgw.buckets.non-ec  344.3k3.0992.7T 
0. 

[ceph-users] Re: cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Jeff Layton
On Tue, 2021-03-02 at 09:25 +0100, Stefan Kooman wrote:
> Hi,
> 
> On a CentOS 7 VM with mainline kernel (5.11.2-1.el7.elrepo.x86_64 #1 SMP 
> Fri Feb 26 11:54:18 EST 2021 x86_64 x86_64 x86_64 GNU/Linux) and with 

I'm guessing this is a stable series kernel

> Ceph Octopus 15.2.9 packages installed. The MDS server is running 
> Nautilus 14.2.16. Messenger v2 has been enabled. Poort 3300 of the 
> monitors is reachable from the client. At mount time we get the following:
> 
> > Mar  2 09:01:14  kernel: Key type ceph registered
> > Mar  2 09:01:14  kernel: libceph: loaded (mon/osd proto 15/24)
> > Mar  2 09:01:14  kernel: FS-Cache: Netfs 'ceph' registered for caching
> > Mar  2 09:01:14  kernel: ceph: loaded (mds proto 32)
> > Mar  2 09:01:14  kernel: libceph: mon4 (1)[mond addr]:6789 session 
> > established
> > Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> > Mar  2 09:01:14  kernel: ceph: corrupt mdsmap
> > Mar  2 09:01:14  kernel: ceph: error decoding mdsmap -22

-22 == -EINVAL

Looks like a an osdmap parsing error?

> > Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
> > Mar  2 09:01:14  kernel: libceph: corrupt full osdmap (-22) epoch 98764 off 
> > 6357 (27a57a75 of d3075952-e307797f)
> > Mar  2 09:02:15  kernel: ceph: No mds server is up or the cluster is laggy
> 
> The /etc/ceph/ceph.conf has been adjusted to reflect the messenger v2 
> changes. ms_bind_ipv6=trie, ms_bind_ipv4=false. The kernel client still 
> seems to be use the v1 port though (although since 5.11 v2 should be 
> supported).
> 

The mount helper only recently got v2 support, and that hasn't trickled
out into the distros yet. See:

https://github.com/ceph/ceph/pull/38788

> Has anyone seen this before? Just guessing here, but could it that the 
> client tries to speak v2 protocol on v1 port?
> 

What mount options are you passing in? Are you using mon autodiscovery?

v2 support in the kernel is keyed on the ms_mode= mount option, so that
has to be passed in if you're connecting to a v2 port. Until the mount
helpers get support for that option you'll need to specify the address
and port manually if you want to use v2.

-- 
Jeff Layton 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Object Gateway setup/tutorial

2021-03-02 Thread Rok Jaklič
Hi,

installation of cluster/osds went "by the book" https://docs.ceph.com/, but
now I want to setup Ceph Object Gateway, but documentation on
https://docs.ceph.com/en/latest/radosgw/ seems to lack information about
what and where to restart for example when setting [client.rgw.gateway-node1]
in /etc/ceph/ceph.conf. Also where should we set this? In cephadm shell or
on the host ...?

Is there some tutorial howto setup gateway from the beginning?

Kind regards,
Rok
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs: unable to mount share with 5.11 mainline, ceph 15.2.9, MDS 14.1.16

2021-03-02 Thread Stefan Kooman

Hi,

On a CentOS 7 VM with mainline kernel (5.11.2-1.el7.elrepo.x86_64 #1 SMP 
Fri Feb 26 11:54:18 EST 2021 x86_64 x86_64 x86_64 GNU/Linux) and with 
Ceph Octopus 15.2.9 packages installed. The MDS server is running 
Nautilus 14.2.16. Messenger v2 has been enabled. Poort 3300 of the 
monitors is reachable from the client. At mount time we get the following:



Mar  2 09:01:14  kernel: Key type ceph registered
Mar  2 09:01:14  kernel: libceph: loaded (mon/osd proto 15/24)
Mar  2 09:01:14  kernel: FS-Cache: Netfs 'ceph' registered for caching
Mar  2 09:01:14  kernel: ceph: loaded (mds proto 32)
Mar  2 09:01:14  kernel: libceph: mon4 (1)[mond addr]:6789 session established
Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
Mar  2 09:01:14  kernel: ceph: corrupt mdsmap
Mar  2 09:01:14  kernel: ceph: error decoding mdsmap -22
Mar  2 09:01:14  kernel: libceph: another match of type 1 in addrvec
Mar  2 09:01:14  kernel: libceph: corrupt full osdmap (-22) epoch 98764 off 
6357 (27a57a75 of d3075952-e307797f)
Mar  2 09:02:15  kernel: ceph: No mds server is up or the cluster is laggy


The /etc/ceph/ceph.conf has been adjusted to reflect the messenger v2 
changes. ms_bind_ipv6=trie, ms_bind_ipv4=false. The kernel client still 
seems to be use the v1 port though (although since 5.11 v2 should be 
supported).


Has anyone seen this before? Just guessing here, but could it that the 
client tries to speak v2 protocol on v1 port?


Thanks,

Stefan


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-02 Thread Matthias Ferdinand
On Tue, Mar 02, 2021 at 05:47:29PM +0800, Norman.Kern wrote:
> Matthias, 
> 
> I agreed with you for tuning. I  ask this question just for that my OSDs have 
> problems when the
> 
> cache_available_percent less than 30, the SSDs almost useless and all I/Os 
> bypass to HDDs with large latency.


Hi,

I used to tune writeback_percent as far down as 1. I guess rapid
writeback helped keep complexity (CPU, additional I/O) of handling dirty
blocks low. Hoarding dirty data for a better chance to eventually turn
it into sequential I/O is an important gain on MD-RAID5/6, but not so
much on a single disk.

Perhaps at cache_available_percent < 30 bcache needs to do some garbage
collection. This would at least give some CPU spike, and probably some
additional I/O spike for the on-disk data structure.

This is where you really want to have decent DC-grade caching devices
that can keep up with this sort of write amplification spikes.
Consumer-grade devices won't be able to, and even add their own very
much noticeable internal garbage collection on top.

Bypassing the caching SSD device on (non-sequential) I/O is usually a
symptom of bcache detecting a saturated caching device, i.e. the SSDs
are probably not DC-grade. At this point you get all the latency of the
backing HDD, plus some more from metadata handling. You might even tune
bcache to never bypass, but at this point, it would only further
add to the latency.


Regards
Matthias
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-02 Thread Andreas John
Hello,

we clearly understood that. But in ceph we have the concept of "OSD
Journal on very fast different disk".

I just asked what in theory should be the advantage of caching on
bcache/NVME vs. Journal/NVME. I would not expect any performance
advantage for bcache (if the Journal is reasonably sized).

I might be totally wrong, though. If you just do it, because you don't
want to re-create (or modify)  the OSDs, it's not worth the effort IMHO.


rgds,

derjohn


On 02.03.21 10:48, Norman.Kern wrote:
> On 2021/3/2 上午5:09, Andreas John wrote:
>> Hallo,
>>
>> do you expect that to be better (faster), than having the OSD's Journal
>> on a different disk (ssd, nvme) ?
> No, I created the OSD storage devices using bcache devices.
>>
>> rgds,
>>
>> derjohn
>>
>>
>> On 01.03.21 05:37, Norman.Kern wrote:
>>> Hi, guys
>>>
>>> I am testing ceph on bcache devices,  I found the performance is not good 
>>> as expected. Does anyone have any best practices for it?  Thanks.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Andreas John
net-lab GmbH  |  Frankfurter Str. 99  |  63067 Offenbach
Geschaeftsfuehrer: Andreas John | AG Offenbach, HRB40832
Tel: +49 69 8570033-1 | Fax: -2 | http://www.net-lab.net

Facebook: https://www.facebook.com/netlabdotnet
Twitter: https://twitter.com/netlabdotnet

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-02 Thread Norman.Kern

On 2021/3/2 下午4:49, James Page wrote:
> Hi Norman
>
> On Mon, Mar 1, 2021 at 4:38 AM Norman.Kern  wrote:
>
>> Hi, guys
>>
>> I am testing ceph on bcache devices,  I found the performance is not good
>> as expected. Does anyone have any best practices for it?  Thanks.
>>
> I've used bcache quite a bit with Ceph with the following configuration
> options tweaked
>
> a) use writeback mode rather than writethrough (which is the default)
>
> This ensures that the cache device is actually used for write caching
>
> b) turn off the sequential cutoff
>
> sequential_cutoff = 0
>
> This means that sequential writes will also always go to the cache device
> rather than the backing device
>
> c) disable the congestion read and write thresholds
>
> congested_read_threshold_us = congested_write_threshold_us = 0
>
> The following repository:
>
> https://git.launchpad.net/charm-bcache-tuning/tree/src/files
>
> has a python script and systemd configuration todo b) and c) automatically
> on all bcache devices on boot; a) we let the provisioning system take care
> of.
>
> HTH

I have set the variables described above.  Didn't you met the latency problems 
when cache used increaced 30%?

My cache status like this:

root@WXS0089:~# cat /sys/block/sda/bcache/priority_stats
Unused: 4%
Clean:  28%
Dirty:  70%
Metadata:   0%
Average:    551
Sectors per Q:  29197312
Quantiles:  [27 135 167 199 230 262 294 326 358 390 422 454 486 517 549 581 
613 645 677 709 741 773 804 836 844 847 851 855 860 868 881]


>
>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-02 Thread Norman.Kern

On 2021/3/2 上午5:09, Andreas John wrote:
> Hallo,
>
> do you expect that to be better (faster), than having the OSD's Journal
> on a different disk (ssd, nvme) ?
No, I created the OSD storage devices using bcache devices.
>
>
> rgds,
>
> derjohn
>
>
> On 01.03.21 05:37, Norman.Kern wrote:
>> Hi, guys
>>
>> I am testing ceph on bcache devices,  I found the performance is not good as 
>> expected. Does anyone have any best practices for it?  Thanks.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-02 Thread Norman.Kern

On 2021/3/1 下午6:32, Matthias Ferdinand wrote:
> On Mon, Mar 01, 2021 at 12:37:38PM +0800, Norman.Kern wrote:
>> Hi, guys
>>
>> I am testing ceph on bcache devices,  I found the performance is not
>> good as expected. Does anyone have any best practices for it?  Thanks.
> Hi,
>
> sorry to say, but since use cases and workloads differ so much, there is
> no easy list of best practises.
>
> Number one reason for low bcache performance is consumer-grade caching
> devices, since bcache does a lot of write amplification and not even
> "PRO"
> consumer devices will give you decent and consistent performance. You
> might even end up with worse performance than on direct HDD under load.
>
> With decent caching device, there still are quite a few tuning knobs in
> bcache, but it all depends on your workload.
>
> You also have to consider the added complexity of a bcache setup for
> maintenance operations. Moving an OSD between hosts becomes a complex
> operation (wait for bcache draining, detach bcache, move HDD, create new
> bcache caching device, attach bcache).

Matthias, 

I agreed with you for tuning. I  ask this question just for that my OSDs have 
problems when the

cache_available_percent less than 30, the SSDs almost useless and all I/Os 
bypass to HDDs with large latency.

So I think maybe I have wrong configs for bcache.

>
> Regards
> Matthias
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Need Clarification on Maintenance Shutdown Procedure

2021-03-02 Thread David Caro
On 03/01 21:41, Dave Hall wrote:
> Hello,
> 
> I've had a look at the instructions for clean shutdown given at
> https://ceph.io/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/, but
> I'm not clear about some things on the steps about shutting down the
> various Ceph components.
> 
> For my current 3-node cluster I have MONs, MDSs, MGRs, and OSDs all running
> on the same nodes.  Also, this is a non-container installation.
> 
> Since I don't have separate dedicated nodes, as described in the referenced
> web page, I think  the instructions mean that I need to issue SystemD
> commands to stop the corresponding services/targets on each node for the
> Ceph components mentioned in each step.

Yep, the systemd units are usually named 'ceph-@', for example
'ceph-osd@45' would be the systemd unit for osd.45.

> 
> Since we want to bring services up in the right order, I should also use
> SystemD commands to disable these services/targets so they don't
> automatically restart when I power the nodes back on.  After power-on, I
> would then re-enable and manually start services/targets in the order
> described.

Also yes, and if you use some configuration management or similar that might
bring them up automatically you might want to disable it temporarily too.

> 
> One other specific question:  For step 4 it says to shut down my service
> nodes.  Does this mean my MDSs?  (I'm not running any Object Gateways or
> NFS, but I think these would go in this step as well?)

Yes, that is correct. Monitor would be the MONs, and admin the MGRs.

> 
> Please let me know if I've got this right.  The cluster contains 200TB of a
> researcher's data that has taken a year to collect, so caution is needed.

Can you share a bit more about your setup? Are you using replicas? How many?
Erasure coding? (a ceph osd pool ls detail , ceph osd status or similar can
help too).


I would recommend trying to get the hand of the process in a test environment
first.

Cheers!

> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation 
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-02 Thread James Page
Hi Norman

On Mon, Mar 1, 2021 at 4:38 AM Norman.Kern  wrote:

> Hi, guys
>
> I am testing ceph on bcache devices,  I found the performance is not good
> as expected. Does anyone have any best practices for it?  Thanks.
>

I've used bcache quite a bit with Ceph with the following configuration
options tweaked

a) use writeback mode rather than writethrough (which is the default)

This ensures that the cache device is actually used for write caching

b) turn off the sequential cutoff

sequential_cutoff = 0

This means that sequential writes will also always go to the cache device
rather than the backing device

c) disable the congestion read and write thresholds

congested_read_threshold_us = congested_write_threshold_us = 0

The following repository:

https://git.launchpad.net/charm-bcache-tuning/tree/src/files

has a python script and systemd configuration todo b) and c) automatically
on all bcache devices on boot; a) we let the provisioning system take care
of.

HTH


> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io