[ceph-users] First 6 nodes cluster with Octopus

2021-03-30 Thread mabi
Hello,

I am planning to setup a small Ceph cluster for testing purpose with 6 Ubuntu 
nodes and have a few questions mostly regarding planning of the infra.

1) Based on the documentation the OS requirements mentions Ubuntu 18.04 LTS, is 
it ok to use Ubuntu 20.04 instead or should I stick with 18.04?

2) The documentation recommends using Cephadm for new deployments, so I will 
use that but I read that with Cephadm everything is running in containers, so 
is this the new way to go? Or is Ceph in containers kind of still experimental?

3) As I will be needing cephfs I will also need MDS servers so with a total of 
6 nodes I am planning the following layout:

Node 1: MGR+MON+MDS
Node 2: MGR+MON+MDS
Node 3: MGR+MON+MDS
Node 4: OSD
Node 5: OSD
Node 6: OSD

Does this make sense? I am mostly interested in stability and HA with this 
setup.

4) Is there any special kind of demand in terms of disks on the MGR+MON+MDS 
nodes? Or can I use have my OS disks on these nodes? As far as I understand the 
MDS will create a metadata pool on the OSDs.

Thanks for the hints.

Best,
Mabi



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: First 6 nodes cluster with Octopus

2021-03-30 Thread DHilsbos
Mabi;

We're running Nautilus, and I am not wholly convinced of the "everything in 
containers" view of the world, so take this with a small grain  of salt...

1) We don't run Ubuntu, sorry.  I suspect the documentation highlights 18.04 
because it's the current LTS release.  Personally, if I had a preference of 
20.04 over 18.04, I would attempt to build a cluster on 20.04, and see how it 
goes.  You might also look at this: 
https://www.server-world.info/en/note?os=Ubuntu_20.04&p=ceph15&f=1

2) Containers are the preferred way of doing things in Octopus, so yes it's 
considered stable.

3) Our first evaluation cluster was 3 Intel Atom C3000 nodes, with each node 
running all the daemons (MON, MGR, MDS, 2 x OSD).  Worked fine, and allowed me 
to demonstrate the concepts in a size I could carry around.

4) Yes, and No...  When the Cluster is happy, everything is generally happy.  
In certain Warning and Error situations, MONs can chew through the HD space 
fairly quickly.  I'm not familiar with the HD usage of the daemons.

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com

-Original Message-
From: mabi [mailto:m...@protonmail.ch] 
Sent: Tuesday, March 30, 2021 12:03 PM
To: ceph-users@ceph.io
Subject: [ceph-users] First 6 nodes cluster with Octopus

Hello,

I am planning to setup a small Ceph cluster for testing purpose with 6 Ubuntu 
nodes and have a few questions mostly regarding planning of the infra.

1) Based on the documentation the OS requirements mentions Ubuntu 18.04 LTS, is 
it ok to use Ubuntu 20.04 instead or should I stick with 18.04?

2) The documentation recommends using Cephadm for new deployments, so I will 
use that but I read that with Cephadm everything is running in containers, so 
is this the new way to go? Or is Ceph in containers kind of still experimental?

3) As I will be needing cephfs I will also need MDS servers so with a total of 
6 nodes I am planning the following layout:

Node 1: MGR+MON+MDS
Node 2: MGR+MON+MDS
Node 3: MGR+MON+MDS
Node 4: OSD
Node 5: OSD
Node 6: OSD

Does this make sense? I am mostly interested in stability and HA with this 
setup.

4) Is there any special kind of demand in terms of disks on the MGR+MON+MDS 
nodes? Or can I use have my OS disks on these nodes? As far as I understand the 
MDS will create a metadata pool on the OSDs.

Thanks for the hints.

Best,
Mabi



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Preferred order of operations when changing crush map and pool rules

2021-03-30 Thread Reed Dier
I've not undertaken such a large data movement,

The pgupmap script may be of use here, but assuming that its not.

But if I were, I would first take many backups of the current crush map.
I would set the noreblance and norecover flags.
Then I would verify all of the backfill settings are as aggressive as you 
expect them to be.
Then I would do the crush changes, that will then trigger backfill waits.
After verifying everything is as you expect, unset the flags, and let ceph do 
its thing.

And of course tweaking the backfill/recovery settings as needed to speed 
up/lighten the load.

Hope thats helpful.

Reed

> On Mar 30, 2021, at 8:00 AM, Thomas Hukkelberg  
> wrote:
> 
> Hi all!
> 
> We run a 1.5PB cluster with 12 hosts, 192 OSDs (mix of NVMe and HDD) and need 
> to improve our failure domain by altering the crush rules and moving rack to 
> pods, which would imply a lot of data movement.
> 
> I wonder what would the preferred order of operations be when doing such 
> changes to the crush map and pools? Will there be minimal data movement by 
> moving all racks to pods at once and change pool repl rules or is the best 
> approach to first move racks one by one to pods and then change pool 
> replication rules from rack to pods? Anyhow I guess it's good practice to set 
> 'norebalance' before moving hosts and unset to start the actual moving?
> 
> Right now we have the following setup:
> 
> root -> rack2 -> ups1 + node51 + node57 + switch21
> root -> rack3 -> ups2 + node52 + node58 + switch22
> root -> rack4 -> ups3 + node53 + node59 + switch23
> root -> rack5 -> ups4 + node54 + node60 -- switch 21 ^^
> root -> rack6 -> ups5 + node55 + node61 -- switch 22 ^^
> root -> rack7 -> ups6 + node56 + node62 -- switch 23 ^^
> 
> Note that racks 5-7 are connected to same ToR switches as racks 2-4. Cluster 
> and frontend network are in different VXLANs connected with dual 40GbE. 
> Failure domain for 3x replicated pools are currently by rack, and after 
> adding hosts 57-62 we realized that if one of the switches reboots or fails, 
> replicated PGs located only on those 4 hosts will be unavailable and force 
> pools offline. I guess the best way would instead like to organize the racks 
> in pods like this:
> 
> root -> pod1 -> rack2 -> ups1 + node51 + node57
> root -> pod1 -> rack5 -> ups4 + node54 + node60 -> switch21
> root -> pod2 -> rack3 -> ups2 + node52 + node58
> root -> pod2 -> rack6 -> ups5 + node55 + node61 -> switch 22
> root -> pod3 -> rack4 -> ups3 + node53 + node59
> root -> pod3 -> rack7 -> ups6 + node56 + node62 -> switch 23
> 
> The reason for this arrangement is that we in the future plan to organize the 
> pods in different buildings. We're running nautilus 14.2.16 and are about to 
> upgrade to Octopus. Should we upgrade to Octopus before crush changes? 
> 
> Any thoughts or insight on how to achieve this with minimal data movement and 
> risk of cluster downtime would be welcome!
> 
> 
> --thomas
> 
> --
> Thomas Hukkelberg
> tho...@hovedkvarteret.no
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v14.2.19 Nautilus released

2021-03-30 Thread David Caro
Thanks for the quick release! \o/

On Tue, 30 Mar 2021, 22:30 David Galloway,  wrote:

> This is the 19th update to the Ceph Nautilus release series. This is a
> hotfix release to prevent daemons from binding to loopback network
> interfaces. All nautilus users are advised to upgrade to this release.
>
> Notable Changes
> ---
>
> * This release fixes a regression introduced in v14.2.18 whereby in
> certain environments, OSDs will bind to 127.0.0.1.  See
> https://tracker.ceph.com/issues/49938.
>
> Getting Ceph
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-14.2.19.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
> * Release git sha1: bb796b9b5bab9463106022eef406373182465d11
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rados gateway static website

2021-03-30 Thread Marcel Kuiper

Casey,

Many thanks. That did the trick.

Regards

Marcel

Casey Bodley schreef op 2021-03-30 16:48:

this error 2039 is ERR_NO_SUCH_WEBSITE_CONFIGURATION. if you want to
access a bucket via rgw_dns_s3website_name, you have to set a website
configuration on the bucket - see
https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketWebsite.html

On Tue, Mar 30, 2021 at 10:05 AM Marcel Kuiper  wrote:



despite the examples that can be found on the internet I have troubles
setting up a static website that serves from a S3 bucket If anyone 
could

point me in the right direction that would be much appreciated

Marcel

I created an index.html in the bucket sky

gm-rc3-jumphost01@ceph/s3cmd (master)$ ./s3cmd info 
s3://sky/index.html

s3://sky/index.html (object):
File size: 42046
Last mod:  Tue, 30 Mar 2021 13:28:02 GMT
MIME type: text/html
Storage:   STANDARD
MD5 sum:   93acaccebb23a18da33ec4294d99ea1a
SSE:   none
Policy:none
CORS:  none
ACL:   *anon*: READ
ACL:   Generic Sky Account: FULL_CONTROL

And curl returns

gm-rc3-jumphost01@tmp/skills$ curl
https://sky.static.gm.core.local/index.html

  404 Not Found
  
   404 Not Found
   
Code: NoSuchWebsiteConfiguration
BucketName: sky
RequestId: 
tx000ba-00606327b8-cca124-rc3-gm

HostId: cca124-rc3-gm-rc3

COnfig of de rados instance

[client.radosgw.rc3-gm]
debug_rgw = 20
ms_debug = 1
rgw_zonegroup = rc3
rgw_zone = rc3-gm
rgw_enable_static_website = true
rgw_enable_apis = s3website
rgw expose bucket = true
rgw_dns_name = gm-rc3-radosgw.gm.core.local
rgw_dns_s3website_name = static.gm.core.local
rgw_resolve_cname = true
host = gm-rc3-s3web01
keyring = /etc/ceph/ceph.client.radosgw.rc3-gm.keyring
log_file = /var/log/ceph/radosgw.log
user = ceph
rgw_frontends = civetweb port=443s
ssl_certificate=/etc/ceph/ssl/key_cert_ca.pem

DNS (from pdnsutil list-zone)
*.static.gm.core.local  3600IN  CNAME   
gm-rc3-s3web01.gm.core.local


The logs shows

2021-03-30 15:32:53.725 7ff760fcd700  2
RGWDataChangesLog::ChangesRenewThread: start
2021-03-30 15:32:58.409 7ff746798700 20 HTTP_ACCEPT=*/*
2021-03-30 15:32:58.409 7ff746798700 20
HTTP_HOST=sky.static.gm.core.local
2021-03-30 15:32:58.409 7ff746798700 20 HTTP_USER_AGENT=curl/7.58.0
2021-03-30 15:32:58.409 7ff746798700 20 HTTP_VERSION=1.1
2021-03-30 15:32:58.409 7ff746798700 20 REMOTE_ADDR=10.128.160.47
2021-03-30 15:32:58.409 7ff746798700 20 REQUEST_METHOD=GET
2021-03-30 15:32:58.409 7ff746798700 20 REQUEST_URI=/index.html
2021-03-30 15:32:58.409 7ff746798700 20 SCRIPT_URI=/index.html
2021-03-30 15:32:58.409 7ff746798700 20 SERVER_PORT=443
2021-03-30 15:32:58.409 7ff746798700 20 SERVER_PORT_SECURE=443
2021-03-30 15:32:58.409 7ff746798700  1 == starting new request
req=0x7ff746791740 =
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s initializing 
for

trans_id = tx000c4-006063288a-cca124-rc3-gm
2021-03-30 15:32:58.409 7ff746798700 10 rgw api priority: s3=-1
s3website=1
2021-03-30 15:32:58.409 7ff746798700 10 host=sky.static.gm.core.local
2021-03-30 15:32:58.409 7ff746798700 20 subdomain=sky
domain=static.gm.core.local in_hosted_domain=1
in_hosted_domain_s3website=1
2021-03-30 15:32:58.409 7ff746798700 20 final domain/bucket
subdomain=sky domain=static.gm.core.local in_hosted_domain=1
in_hosted_domain_s3website=1 s->info.domain=static.gm.core.local
s->info.request_uri=/sky/index.html
2021-03-30 15:32:58.409 7ff746798700 20 get_handler
handler=29RGWHandler_REST_Obj_S3Website
2021-03-30 15:32:58.409 7ff746798700 10
handler=29RGWHandler_REST_Obj_S3Website
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s getting op 0
2021-03-30 15:32:58.409 7ff746798700 10
op=28RGWGetObj_ObjStore_S3Website
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj
verifying requester
2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
rgw::auth::StrategyRegistry::s3_main_strategy_t: trying
rgw::auth::s3::AWSAuthStrategy
2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
rgw::auth::s3::AWSAuthStrategy: trying 
rgw::auth::s3::S3AnonymousEngine

2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
rgw::auth::s3::S3AnonymousEngine granted access
2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
rgw::auth::s3::AWSAuthStrategy granted access
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj
normalizing buckets and tenants
2021-03-30 15:32:58.409 7ff746798700 10 s->object=index.html
s->bucket=sky
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj init
permissions
2021-03-30 15:32:58.409 7ff746798700 15 decode_policy Read
AccessControlPolicyhttp://s3.amazonaws.com/doc/2006-03-01/";>skyGeneric
Sky Accounthttp://www.w3.org/2001/XMLSchema-instance";
xsi:type="CanonicalUser">skyGeneric Sky
AccountFULL_CONTROL
2021-03-30 15:32:58.409 7ff746798700 20 get_system_obj_state:
rctx=0x7ff74678f310 obj=rc3-gm.rgw.meta:users.uid:anonymou

[ceph-users] v14.2.19 Nautilus released

2021-03-30 Thread David Galloway
This is the 19th update to the Ceph Nautilus release series. This is a
hotfix release to prevent daemons from binding to loopback network
interfaces. All nautilus users are advised to upgrade to this release.

Notable Changes
---

* This release fixes a regression introduced in v14.2.18 whereby in
certain environments, OSDs will bind to 127.0.0.1.  See
https://tracker.ceph.com/issues/49938.

Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.19.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: bb796b9b5bab9463106022eef406373182465d11
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-fuse false passed X_OK check

2021-03-30 Thread Patrick Donnelly
It's a bug: https://tracker.ceph.com/issues/50060

On Wed, Dec 23, 2020 at 5:53 PM Alex Taylor  wrote:
>
> Hi Patrick,
>
> Any updates? Looking forward to your reply :D
>
>
> On Thu, Dec 17, 2020 at 11:39 AM Patrick Donnelly  wrote:
> >
> > On Wed, Dec 16, 2020 at 5:46 PM Alex Taylor  wrote:
> > >
> > > Hi Cephers,
> > >
> > > I'm using VSCode remote development with a docker server. It worked OK
> > > but fails to start the debugger after /root mounted by ceph-fuse. The
> > > log shows that the binary passes access X_OK check but cannot be
> > > actually executed. see:
> > >
> > > ```
> > > strace_log: 
> > > access("/root/.vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7",
> > > X_OK) = 0
> > >
> > > root@develop:~# ls -alh
> > > .vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7
> > > -rw-r--r-- 1 root root 978 Dec 10 13:06
> > > .vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7
> > > ```
> > >
> > > I also test the access syscall on ext4, xfs and even cephfs kernel
> > > client, all of them return -EACCES, which is expected (the extension
> > > will then explicitly call chmod +x).
> > >
> > > After some digging in the code, I found it is probably caused by
> > > https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L5549-L5550.
> > > So here come two questions:
> > > 1. Is this a bug or is there any concern I missed?
> >
> > I tried reproducing it with the master branch and could not. It might
> > be due to an older fuse/ceph. I suggest you upgrade!
> >
>
> I tried the master(332a188d9b3c4eb5c5ad2720b7299913c5a772ee) as well
> and the issue still exists. My test program is:
> ```
> #include 
> #include 
>
> int main() {
> int r;
> const char path[] = "test";
>
> r = access(path, F_OK);
> printf("file exists: %d\n", r);
>
> r = access(path, X_OK);
> printf("file executable: %d\n", r);
>
> return 0;
> }
> ```
> And the test result:
> ```
> # local filesystem: ext4
> root@f626800a6e85:~# ls -l test
> -rw-r--r-- 1 root root 6 Dec 19 06:13 test
> root@f626800a6e85:~# ./a.out
> file exists: 0
> file executable: -1
>
> root@f626800a6e85:~# findmnt -t fuse.ceph-fuse
> TARGETSOURCEFSTYPE OPTIONS
> /root/mnt ceph-fuse fuse.ceph-fuse
> rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other
> root@f626800a6e85:~# cd mnt
>
> # ceph-fuse
> root@f626800a6e85:~/mnt# ls -l test
> -rw-r--r-- 1 root root 6 Dec 19 06:10 test
> root@f626800a6e85:~/mnt# ./a.out
> file exists: 0
> file executable: 0
> root@f626800a6e85:~/mnt# ./test
> bash: ./test: Permission denied
> ```
> Again, ceph-fuse says file `test` is executable but in fact it can't
> be executed.
> The kernel version I'm testing on is:
> ```
> root@f626800a6e85:~/mnt# uname -ar
> Linux f626800a6e85 4.9.0-7-amd64 #1 SMP Debian 4.9.110-1 (2018-07-05)
> x86_64 GNU/Linux
> ```
>
> Please try the program above and make sure you're running it as root
> user, thank you. And if the reproduction still fails, please let me
> know the kernel version.
>
> > > 2. It works again with fuse_default_permissions=true, any drawbacks if
> > > this option is set?
> >
> > Correctness (ironically, for you) and performance.
> >
> > --
> > Patrick Donnelly, Ph.D.
> > He / Him / His
> > Principal Software Engineer
> > Red Hat Sunnyvale, CA
> > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> >
>


-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: forceful remap PGs

2021-03-30 Thread Stefan Kooman

On 3/30/21 3:02 PM, Boris Behrens wrote:

I reweighted the OSD to .0 and then forced the backfilling.

How long does it take for ceph to free up space? I looks like it was 
doing this, but it could also be the "backup cleanup job" that removed 
images from the buckets.


I don't have any numbers on that. But from experience it seems it starts 
right away.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Rados gateway static website

2021-03-30 Thread Marcel Kuiper



despite the examples that can be found on the internet I have troubles 
setting up a static website that serves from a S3 bucket If anyone could 
point me in the right direction that would be much appreciated


Marcel

I created an index.html in the bucket sky

gm-rc3-jumphost01@ceph/s3cmd (master)$ ./s3cmd info s3://sky/index.html
s3://sky/index.html (object):
   File size: 42046
   Last mod:  Tue, 30 Mar 2021 13:28:02 GMT
   MIME type: text/html
   Storage:   STANDARD
   MD5 sum:   93acaccebb23a18da33ec4294d99ea1a
   SSE:   none
   Policy:none
   CORS:  none
   ACL:   *anon*: READ
   ACL:   Generic Sky Account: FULL_CONTROL

And curl returns

gm-rc3-jumphost01@tmp/skills$ curl  
https://sky.static.gm.core.local/index.html


 404 Not Found
 
  404 Not Found
  
   Code: NoSuchWebsiteConfiguration
   BucketName: sky
   RequestId: tx000ba-00606327b8-cca124-rc3-gm
   HostId: cca124-rc3-gm-rc3

COnfig of de rados instance

[client.radosgw.rc3-gm]
debug_rgw = 20
ms_debug = 1
rgw_zonegroup = rc3
rgw_zone = rc3-gm
rgw_enable_static_website = true
rgw_enable_apis = s3website
rgw expose bucket = true
rgw_dns_name = gm-rc3-radosgw.gm.core.local
rgw_dns_s3website_name = static.gm.core.local
rgw_resolve_cname = true
host = gm-rc3-s3web01
keyring = /etc/ceph/ceph.client.radosgw.rc3-gm.keyring
log_file = /var/log/ceph/radosgw.log
user = ceph
rgw_frontends = civetweb port=443s 
ssl_certificate=/etc/ceph/ssl/key_cert_ca.pem


DNS (from pdnsutil list-zone)
*.static.gm.core.local  3600IN  CNAME   gm-rc3-s3web01.gm.core.local

The logs shows

2021-03-30 15:32:53.725 7ff760fcd700  2 
RGWDataChangesLog::ChangesRenewThread: start

2021-03-30 15:32:58.409 7ff746798700 20 HTTP_ACCEPT=*/*
2021-03-30 15:32:58.409 7ff746798700 20 
HTTP_HOST=sky.static.gm.core.local

2021-03-30 15:32:58.409 7ff746798700 20 HTTP_USER_AGENT=curl/7.58.0
2021-03-30 15:32:58.409 7ff746798700 20 HTTP_VERSION=1.1
2021-03-30 15:32:58.409 7ff746798700 20 REMOTE_ADDR=10.128.160.47
2021-03-30 15:32:58.409 7ff746798700 20 REQUEST_METHOD=GET
2021-03-30 15:32:58.409 7ff746798700 20 REQUEST_URI=/index.html
2021-03-30 15:32:58.409 7ff746798700 20 SCRIPT_URI=/index.html
2021-03-30 15:32:58.409 7ff746798700 20 SERVER_PORT=443
2021-03-30 15:32:58.409 7ff746798700 20 SERVER_PORT_SECURE=443
2021-03-30 15:32:58.409 7ff746798700  1 == starting new request 
req=0x7ff746791740 =
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s initializing for 
trans_id = tx000c4-006063288a-cca124-rc3-gm
2021-03-30 15:32:58.409 7ff746798700 10 rgw api priority: s3=-1 
s3website=1

2021-03-30 15:32:58.409 7ff746798700 10 host=sky.static.gm.core.local
2021-03-30 15:32:58.409 7ff746798700 20 subdomain=sky 
domain=static.gm.core.local in_hosted_domain=1 
in_hosted_domain_s3website=1
2021-03-30 15:32:58.409 7ff746798700 20 final domain/bucket 
subdomain=sky domain=static.gm.core.local in_hosted_domain=1 
in_hosted_domain_s3website=1 s->info.domain=static.gm.core.local 
s->info.request_uri=/sky/index.html
2021-03-30 15:32:58.409 7ff746798700 20 get_handler 
handler=29RGWHandler_REST_Obj_S3Website
2021-03-30 15:32:58.409 7ff746798700 10 
handler=29RGWHandler_REST_Obj_S3Website

2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s getting op 0
2021-03-30 15:32:58.409 7ff746798700 10 
op=28RGWGetObj_ObjStore_S3Website
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj 
verifying requester
2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj 
rgw::auth::StrategyRegistry::s3_main_strategy_t: trying 
rgw::auth::s3::AWSAuthStrategy
2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj 
rgw::auth::s3::AWSAuthStrategy: trying rgw::auth::s3::S3AnonymousEngine
2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj 
rgw::auth::s3::S3AnonymousEngine granted access
2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj 
rgw::auth::s3::AWSAuthStrategy granted access
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj 
normalizing buckets and tenants
2021-03-30 15:32:58.409 7ff746798700 10 s->object=index.html 
s->bucket=sky
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj init 
permissions
2021-03-30 15:32:58.409 7ff746798700 15 decode_policy Read 
AccessControlPolicyxmlns="http://s3.amazonaws.com/doc/2006-03-01/";>skyGeneric 
Sky Accountxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:type="CanonicalUser">skyGeneric Sky 
AccountFULL_CONTROL
2021-03-30 15:32:58.409 7ff746798700 20 get_system_obj_state: 
rctx=0x7ff74678f310 obj=rc3-gm.rgw.meta:users.uid:anonymous 
state=0x55835be11220 s->prefetch_data=0
2021-03-30 15:32:58.409 7ff746798700 10 cache get: 
name=rc3-gm.rgw.meta+users.uid+anonymous : hit (negative entry)
2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj 
recalculating target

2021-03-30 15:32:58.409 7ff746798700 10 retarget Starting retarget
2021-03-30 15:32:58.409 7ff746798700 10 
RGWHandler_REST_S3Websi

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Frank Schilder
Ahh, right. I saw it fixed here https://tracker.ceph.com/issues/18749 a long 
time ago, but it seems the back-port never happened.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Josh Baergen 
Sent: 30 March 2021 15:23:10
To: Frank Schilder
Cc: Rainer Krienke; Eugen Block; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph Nautilus lost two disk over night everything 
hangs

I thought that recovery below min_size for EC pools wasn't expected to work 
until Octopus. From the Octopus release notes: "Ceph will allow recovery below 
min_size for Erasure coded pools, wherever possible."

Josh

On Tue, Mar 30, 2021 at 6:53 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Dear Rainer,

hmm, maybe the option is ignored or not implemented properly. This option set 
to true should have the same effect as reducing min_size *except* that new 
writes will not go to non-redundant storage. When reducing min-size, a 
critically degraded PG will accept new writes, which is the danger of data-loss 
mentioned before and avoided if only recovery ops are allowed on such PGs.

Can you open a tracker about your observation that reducing min-size was 
necessary and helped despite osd_allow_recovery_below_min_size=true?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Rainer Krienke mailto:krie...@uni-koblenz.de>>
Sent: 30 March 2021 13:30:00
To: Frank Schilder; Eugen Block; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph Nautilus lost two disk over night everything 
hangs

Hello Frank,

the option is actually set. On one of my monitors:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep
osd_allow_recovery_below_min_size
 "osd_allow_recovery_below_min_size": "true",

Thank you very much
Rainer

Am 30.03.21 um 13:20 schrieb Frank Schilder:
> Hi, this is odd. The problem with recovery when sufficiently many but less 
> than min_size shards are present should have been resolved with 
> osd_allow_recovery_below_min_size=true. It is really dangerous to reduce 
> min_size below k+1 and, in fact, should never be necessary for recovery. Can 
> you check if this option is present and set to true? If it is not working as 
> intended, a tracker ticker might be in order.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>

  --
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
1001312
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Josh Baergen
I thought that recovery below min_size for EC pools wasn't expected to work
until Octopus. From the Octopus release notes: "Ceph will allow recovery
below min_size for Erasure coded pools, wherever possible."

Josh

On Tue, Mar 30, 2021 at 6:53 AM Frank Schilder  wrote:

> Dear Rainer,
>
> hmm, maybe the option is ignored or not implemented properly. This option
> set to true should have the same effect as reducing min_size *except* that
> new writes will not go to non-redundant storage. When reducing min-size, a
> critically degraded PG will accept new writes, which is the danger of
> data-loss mentioned before and avoided if only recovery ops are allowed on
> such PGs.
>
> Can you open a tracker about your observation that reducing min-size was
> necessary and helped despite osd_allow_recovery_below_min_size=true?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Rainer Krienke 
> Sent: 30 March 2021 13:30:00
> To: Frank Schilder; Eugen Block; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: ceph Nautilus lost two disk over night
> everything hangs
>
> Hello Frank,
>
> the option is actually set. On one of my monitors:
>
> # ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep
> osd_allow_recovery_below_min_size
>  "osd_allow_recovery_below_min_size": "true",
>
> Thank you very much
> Rainer
>
> Am 30.03.21 um 13:20 schrieb Frank Schilder:
> > Hi, this is odd. The problem with recovery when sufficiently many but
> less than min_size shards are present should have been resolved with
> osd_allow_recovery_below_min_size=true. It is really dangerous to reduce
> min_size below k+1 and, in fact, should never be necessary for recovery.
> Can you check if this option is present and set to true? If it is not
> working as intended, a tracker ticker might be in order.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
>
>   --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
> 56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287
> 1312
> PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
> 1001312
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Preferred order of operations when changing crush map and pool rules

2021-03-30 Thread Stefan Kooman

On 3/30/21 3:00 PM, Thomas Hukkelberg wrote:


Any thoughts or insight on how to achieve this with minimal data movement and 
risk of cluster downtime would be welcome!


I would do so with Dan's "upmap-remap" script [1]. See [2] for his 
presentation. We have used that quite a few times now (also to do 
cluster expansions) and it works great. Especially that you can "pause" 
any movement and obtain "HEALTH_OK" are really nice benefits.


Gr. Stefan

[1]: 
https://github.com/HeinleinSupport/cern-ceph-scripts/blob/master/tools/upmap/upmap-remapped.py


[2]: 
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Frank Schilder
Sorry about the flow of messages.

I forgot to mention this. Looking at other replies, in particular, that the PG 
in question remained at 4 out of 6 OSDs until you reduced min_size might 
indicate that peering was blocked for some reason, but completed after the 
reduction. If this was the order of events, it seems like an important detail.

It is true that recovery will need to wait until the PG has the missing OSDs 
assigned. If this assignment is somehow blocked by min-size>k, the flag 
osd_allow_recovery_below_min_size itself will have no effect.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 30 March 2021 14:53:18
To: Rainer Krienke; Eugen Block; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph Nautilus lost two disk over night everything 
hangs

Dear Rainer,

hmm, maybe the option is ignored or not implemented properly. This option set 
to true should have the same effect as reducing min_size *except* that new 
writes will not go to non-redundant storage. When reducing min-size, a 
critically degraded PG will accept new writes, which is the danger of data-loss 
mentioned before and avoided if only recovery ops are allowed on such PGs.

Can you open a tracker about your observation that reducing min-size was 
necessary and helped despite osd_allow_recovery_below_min_size=true?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Rainer Krienke 
Sent: 30 March 2021 13:30:00
To: Frank Schilder; Eugen Block; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph Nautilus lost two disk over night everything 
hangs

Hello Frank,

the option is actually set. On one of my monitors:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep
osd_allow_recovery_below_min_size
 "osd_allow_recovery_below_min_size": "true",

Thank you very much
Rainer

Am 30.03.21 um 13:20 schrieb Frank Schilder:
> Hi, this is odd. The problem with recovery when sufficiently many but less 
> than min_size shards are present should have been resolved with 
> osd_allow_recovery_below_min_size=true. It is really dangerous to reduce 
> min_size below k+1 and, in fact, should never be necessary for recovery. Can 
> you check if this option is present and set to true? If it is not working as 
> intended, a tracker ticker might be in order.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>

  --
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
1001312
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Frank Schilder
Dear Rainer,

hmm, maybe the option is ignored or not implemented properly. This option set 
to true should have the same effect as reducing min_size *except* that new 
writes will not go to non-redundant storage. When reducing min-size, a 
critically degraded PG will accept new writes, which is the danger of data-loss 
mentioned before and avoided if only recovery ops are allowed on such PGs.

Can you open a tracker about your observation that reducing min-size was 
necessary and helped despite osd_allow_recovery_below_min_size=true?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Rainer Krienke 
Sent: 30 March 2021 13:30:00
To: Frank Schilder; Eugen Block; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph Nautilus lost two disk over night everything 
hangs

Hello Frank,

the option is actually set. On one of my monitors:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep
osd_allow_recovery_below_min_size
 "osd_allow_recovery_below_min_size": "true",

Thank you very much
Rainer

Am 30.03.21 um 13:20 schrieb Frank Schilder:
> Hi, this is odd. The problem with recovery when sufficiently many but less 
> than min_size shards are present should have been resolved with 
> osd_allow_recovery_below_min_size=true. It is really dangerous to reduce 
> min_size below k+1 and, in fact, should never be necessary for recovery. Can 
> you check if this option is present and set to true? If it is not working as 
> intended, a tracker ticker might be in order.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>

  --
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
1001312
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: forceful remap PGs

2021-03-30 Thread Stefan Kooman

On 3/30/21 12:55 PM, Boris Behrens wrote:

I just move one PG away from the OSD, but the diskspace will not get freed.


How did you move? I would suggest you use upmap:

ceph osd pg-upmap-items
Invalid command: missing required parameter pgid()
osd pg-upmap-items   [(id|osd.id)>...] :  set pg_upmap_items mapping :{ to , 
[...]} (developers only)



So you specify which PG has to move to which OSD.


Do I need to do something to clean obsolete objects from the osd?


No. The OSD will trim PG data that is not needed anymore.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Frank Schilder
Hi, this is odd. The problem with recovery when sufficiently many but less than 
min_size shards are present should have been resolved with 
osd_allow_recovery_below_min_size=true. It is really dangerous to reduce 
min_size below k+1 and, in fact, should never be necessary for recovery. Can 
you check if this option is present and set to true? If it is not working as 
intended, a tracker ticker might be in order.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Rainer Krienke 
Sent: 30 March 2021 13:05:56
To: Eugen Block; ceph-users@ceph.io
Subject: [ceph-users] Re: ceph Nautilus lost two disk over night everything 
hangs

Hello,

yes your assumptions are correct pxa-rbd ist the metadata pool for
pxa-ec which uses a erasure coding 4+2 profile.

In the last hours ceph repaired most of the damage. One inactive PG
remained and in ceph health detail then told me:

-
HEALTH_WARN Reduced data availability: 1 pg inactive, 1 pg incomplete;
15 daemons have recently crashed; 150 slow ops, oldest one blocked for
26716 sec, daemons [osd.60,osd.67] have slow ops.
PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
 pg 36.15b is remapped+incomplete, acting
[60,2147483647,23,96,2147483647,36] (reducing pool pxa-ec min_size from
5 may help; search ceph.com/docs for 'incomplete')
RECENT_CRASH 15 daemons have recently crashed
 osd.90 crashed on host ceph6 at 2021-03-29 21:14:10.442314Z
 osd.67 crashed on host ceph5 at 2021-03-30 02:21:23.944205Z
 osd.67 crashed on host ceph5 at 2021-03-30 01:39:14.452610Z
 osd.90 crashed on host ceph6 at 2021-03-29 21:14:24.23Z
 osd.67 crashed on host ceph5 at 2021-03-30 02:35:43.373845Z
 osd.67 crashed on host ceph5 at 2021-03-30 01:19:58.762393Z
 osd.67 crashed on host ceph5 at 2021-03-30 02:09:42.297941Z
 osd.67 crashed on host ceph5 at 2021-03-30 02:28:29.981528Z
 osd.67 crashed on host ceph5 at 2021-03-30 01:50:05.374278Z
 osd.90 crashed on host ceph6 at 2021-03-29 21:13:51.896849Z
 osd.67 crashed on host ceph5 at 2021-03-30 02:00:22.593745Z
 osd.67 crashed on host ceph5 at 2021-03-30 01:29:39.170134Z
 osd.90 crashed on host ceph6 at 2021-03-29 21:14:38.114768Z
 osd.67 crashed on host ceph5 at 2021-03-30 00:54:06.629808Z
 osd.67 crashed on host ceph5 at 2021-03-30 01:10:21.824447Z
-

All osds except for 67 and 90 are up and I followed the hint in health
detail  and lowered min_size from 5 to 4 for pxa-ec. Since then ceph is
again repairing and in between some VMs in the attached proxmox cluster
are working again.

So I hope that after repairing all PGs are up, so that I can restart all
VMs again.

Thanks
Rainer

Am 30.03.21 um 11:41 schrieb Eugen Block:
> Hi,
>
> from what you've sent my conclusion about the stalled I/O would be
> indeed the min_size of the EC pool.
> There's only one PG reported as incomplete, I assume that is the EC
> pool, not the replicated pxa-rbd, right? Both pools are for rbd so I'm
> guessing the rbd headers are in pxa-rbd while the data is stored in
> pxa-ec, could you confirm that?
>
> You could add 'ceph health detail' output to your question to see which
> PG is incomplete.
> I assume that both down OSDs are in the acting set of the inactive PG,
> and since the pool's min_size is 5 the I/O pauses. If you can't wait for
> recovery to finish and can't bring up at least one of those OSDs you
> could set the min_size of pxa-ec to 4, but if you do, be aware that one
> more disk failure could mean data loss! So think carefully about it
> (maybe you could instead speed up recovery?) and don't forget to
> increase min_size back to 5 when the recovery has finished, that's very
> important!
>
> Regards,
> Eugen
>
>
> Zitat von Rainer Krienke :
>
>> Hello,
>>
>> i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we
>> lost two disks, so two OSDs (67,90) are down. The two disks are on two
>> different hosts. A third ODS on a third host repotrts slow ops. ceph
>> is repairing at the moment.
>>
>> Pools affected are eg these ones:
>>  pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor
>> 0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0
>> pg_num_min 128 target_size_ratio 0.0001 application rbd
>>
>> pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash
>> rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor
>> 0/172580/172578 flags hashpspool,ec_overwrites,selfmanaged_snaps
>> stripe_width 16384 pg_num_min 512 target_size_ratio 0.15 application rbd
>>
>> At the mmoment the proxmox-cluster using storage from the seperate
>> ceph cluster hangs. The ppols with date are erasure coded with the
>> following profile:
>>
>> crush-device-class=
>> crush-failure-domain=host
>> crush-root=default
>> jerasure-per-chunk-alignment=false
>> k=4
>> m=2
>> plugin=jerasure
>>

[ceph-users] Re: Upgrade from Luminous to Nautilus now one MDS with could not get service secret

2021-03-30 Thread Dan van der Ster
Hi Robert,

We get a handful of verify_authorizer warnings on some of our clusters
too but they don't seem to pose any problems.
I've tried without success to debug this in the past -- IIRC I started
to suspect it was coming from old cephfs kernel clients but got
distracted and never reached the bottom of it.

Below in the PS is what it looks like on an osd with debug_auth=20 and
debug_ms=1 in case this sparks any ideas.

-- dan

2021-03-30 17:11:55.015 7f2a178a6700  1 --2-
[v2:128.142.xx:6816/465972,v1:128.142.xx:6817/465972] >>
conn(0x5608ef7c7000 0x5608d6be1c00 unknown :-1 s=BANNER_ACCEPTING
pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload
supported=0 required=0
2021-03-30 17:11:55.016 7f2a178a6700 20 AuthRegistry(0x7fff71e9bf58)
get_handler peer_type 8 method 2 cluster_methods [2] service_methods
[2] client_methods [2]
2021-03-30 17:11:55.016 7f2a178a6700 10 cephx: verify_authorizer
decrypted service osd secret_id=58900
2021-03-30 17:11:55.016 7f2a178a6700  0 auth: could not find secret_id=58900
2021-03-30 17:11:55.016 7f2a178a6700 10 auth: dump_rotating:
2021-03-30 17:11:55.016 7f2a178a6700 10 auth:  id 61926 AQxxx==
expires 2021-03-30 16:11:57.193945
2021-03-30 17:11:55.016 7f2a178a6700 10 auth:  id 61927 AQyyy==
expires 2021-03-30 17:11:58.331600
2021-03-30 17:11:55.016 7f2a178a6700 10 auth:  id 61928 AQzzz==
expires 2021-03-30 18:11:59.341208
2021-03-30 17:11:55.016 7f2a178a6700  0 cephx: verify_authorizer could
not get service secret for service osd secret_id=58900
2021-03-30 17:11:55.016 7f2a178a6700  1 --2-
[v2:128.142.xx:6816/465972,v1:128.142.xx:6817/465972] >>
conn(0x5608ef7c7000 0x5608d6be1c00 crc :-1 s=AUTH_ACCEPTING pgs=0 cs=0
l=1 rev1=0 rx=0 tx=0)._auth_bad_method auth_method 2 r (13) Permission
denied, allowed_methods [2], allowed_modes [1,2]

On Sun, Mar 28, 2021 at 8:17 PM Robert LeBlanc  wrote:
>
> We just upgraded our cluster from Lumious to Nautilus and after a few
> days one of our MDS servers is getting:
>
> 2021-03-28 18:06:32.304 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
> Sending beacon up:standby seq 16
> 2021-03-28 18:06:32.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
> sender thread waiting interval 4s
> 2021-03-28 18:06:32.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
> received beacon reply up:standby seq 16 rtt 0.0041
> 2021-03-28 18:06:36.308 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
> Sending beacon up:standby seq 17
> 2021-03-28 18:06:36.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
> sender thread waiting interval 4s
> 2021-03-28 18:06:36.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
> received beacon reply up:standby seq 17 rtt 0
> 2021-03-28 18:06:37.788 7f57c900a700  0 auth: could not find secret_id=34586
> 2021-03-28 18:06:37.788 7f57c900a700  0 cephx: verify_authorizer could
> not get service secret for service mds secret_id=34586
> 2021-03-28 18:06:37.788 7f57c6004700  5 mds.sun-gcs01-mds02
> ms_handle_reset on v2:10.65.101.13:46566/0
> 2021-03-28 18:06:40.308 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
> Sending beacon up:standby seq 18
> 2021-03-28 18:06:40.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
> sender thread waiting interval 4s
> 2021-03-28 18:06:40.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
> received beacon reply up:standby seq 18 rtt 0
> 2021-03-28 18:06:44.304 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
> Sending beacon up:standby seq 19
> 2021-03-28 18:06:44.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
> sender thread waiting interval 4s
>
> I've tried removing the /var/lib/ceph/mds/ directory and getting the
> key again. I've removed the key and generated a new one, I've checked
> the clocks between all the nodes. From what I can tell, everything is
> good.
>
> We did have an issue where the monitor cluster fell over and would not
> boot. We reduced the monitors to a single monitor, disabled cephx,
> pulled it off the network and restarted the service a few times which
> allowed it to come up. We then expanded back to three mons and
> reenabled cephx and everything has been good until this. No other
> services seem to be suffering from this and it even appears that the
> MDS works okay even with these messages. We would like to figure out
> how to resolve this.
>
> Thank you,
> Robert LeBlanc
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rados gateway static website

2021-03-30 Thread Casey Bodley
this error 2039 is ERR_NO_SUCH_WEBSITE_CONFIGURATION. if you want to
access a bucket via rgw_dns_s3website_name, you have to set a website
configuration on the bucket - see
https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketWebsite.html

On Tue, Mar 30, 2021 at 10:05 AM Marcel Kuiper  wrote:
>
>
> despite the examples that can be found on the internet I have troubles
> setting up a static website that serves from a S3 bucket If anyone could
> point me in the right direction that would be much appreciated
>
> Marcel
>
> I created an index.html in the bucket sky
>
> gm-rc3-jumphost01@ceph/s3cmd (master)$ ./s3cmd info s3://sky/index.html
> s3://sky/index.html (object):
> File size: 42046
> Last mod:  Tue, 30 Mar 2021 13:28:02 GMT
> MIME type: text/html
> Storage:   STANDARD
> MD5 sum:   93acaccebb23a18da33ec4294d99ea1a
> SSE:   none
> Policy:none
> CORS:  none
> ACL:   *anon*: READ
> ACL:   Generic Sky Account: FULL_CONTROL
>
> And curl returns
>
> gm-rc3-jumphost01@tmp/skills$ curl
> https://sky.static.gm.core.local/index.html
> 
>   404 Not Found
>   
>404 Not Found
>
> Code: NoSuchWebsiteConfiguration
> BucketName: sky
> RequestId: tx000ba-00606327b8-cca124-rc3-gm
> HostId: cca124-rc3-gm-rc3
>
> COnfig of de rados instance
>
> [client.radosgw.rc3-gm]
> debug_rgw = 20
> ms_debug = 1
> rgw_zonegroup = rc3
> rgw_zone = rc3-gm
> rgw_enable_static_website = true
> rgw_enable_apis = s3website
> rgw expose bucket = true
> rgw_dns_name = gm-rc3-radosgw.gm.core.local
> rgw_dns_s3website_name = static.gm.core.local
> rgw_resolve_cname = true
> host = gm-rc3-s3web01
> keyring = /etc/ceph/ceph.client.radosgw.rc3-gm.keyring
> log_file = /var/log/ceph/radosgw.log
> user = ceph
> rgw_frontends = civetweb port=443s
> ssl_certificate=/etc/ceph/ssl/key_cert_ca.pem
>
> DNS (from pdnsutil list-zone)
> *.static.gm.core.local  3600IN  CNAME   gm-rc3-s3web01.gm.core.local
>
> The logs shows
>
> 2021-03-30 15:32:53.725 7ff760fcd700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2021-03-30 15:32:58.409 7ff746798700 20 HTTP_ACCEPT=*/*
> 2021-03-30 15:32:58.409 7ff746798700 20
> HTTP_HOST=sky.static.gm.core.local
> 2021-03-30 15:32:58.409 7ff746798700 20 HTTP_USER_AGENT=curl/7.58.0
> 2021-03-30 15:32:58.409 7ff746798700 20 HTTP_VERSION=1.1
> 2021-03-30 15:32:58.409 7ff746798700 20 REMOTE_ADDR=10.128.160.47
> 2021-03-30 15:32:58.409 7ff746798700 20 REQUEST_METHOD=GET
> 2021-03-30 15:32:58.409 7ff746798700 20 REQUEST_URI=/index.html
> 2021-03-30 15:32:58.409 7ff746798700 20 SCRIPT_URI=/index.html
> 2021-03-30 15:32:58.409 7ff746798700 20 SERVER_PORT=443
> 2021-03-30 15:32:58.409 7ff746798700 20 SERVER_PORT_SECURE=443
> 2021-03-30 15:32:58.409 7ff746798700  1 == starting new request
> req=0x7ff746791740 =
> 2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s initializing for
> trans_id = tx000c4-006063288a-cca124-rc3-gm
> 2021-03-30 15:32:58.409 7ff746798700 10 rgw api priority: s3=-1
> s3website=1
> 2021-03-30 15:32:58.409 7ff746798700 10 host=sky.static.gm.core.local
> 2021-03-30 15:32:58.409 7ff746798700 20 subdomain=sky
> domain=static.gm.core.local in_hosted_domain=1
> in_hosted_domain_s3website=1
> 2021-03-30 15:32:58.409 7ff746798700 20 final domain/bucket
> subdomain=sky domain=static.gm.core.local in_hosted_domain=1
> in_hosted_domain_s3website=1 s->info.domain=static.gm.core.local
> s->info.request_uri=/sky/index.html
> 2021-03-30 15:32:58.409 7ff746798700 20 get_handler
> handler=29RGWHandler_REST_Obj_S3Website
> 2021-03-30 15:32:58.409 7ff746798700 10
> handler=29RGWHandler_REST_Obj_S3Website
> 2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s getting op 0
> 2021-03-30 15:32:58.409 7ff746798700 10
> op=28RGWGetObj_ObjStore_S3Website
> 2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj
> verifying requester
> 2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
> rgw::auth::StrategyRegistry::s3_main_strategy_t: trying
> rgw::auth::s3::AWSAuthStrategy
> 2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
> rgw::auth::s3::AWSAuthStrategy: trying rgw::auth::s3::S3AnonymousEngine
> 2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
> rgw::auth::s3::S3AnonymousEngine granted access
> 2021-03-30 15:32:58.409 7ff746798700 20 req 196 0.000s s3:get_obj
> rgw::auth::s3::AWSAuthStrategy granted access
> 2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj
> normalizing buckets and tenants
> 2021-03-30 15:32:58.409 7ff746798700 10 s->object=index.html
> s->bucket=sky
> 2021-03-30 15:32:58.409 7ff746798700  2 req 196 0.000s s3:get_obj init
> permissions
> 2021-03-30 15:32:58.409 7ff746798700 15 decode_policy Read
> AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/";>skyGeneric
> Sky Account xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> xsi:type="CanonicalUser">skyGeneric Sky
> AccountFULL_

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
I raised the backfillfull_ratio to .91 to see what happens, now I am
waiting. Some OSDs were around 89-91%, some are around 50-60%
The pgp_num is on 1946 since one week. I think this will solve itself, when
the cluster becomes a bit more tidy.

Am Di., 30. März 2021 um 15:23 Uhr schrieb Dan van der Ster <
d...@vanderster.com>:

> You started with 1024 PGs, and are splitting to 2048.
> Currently there are 1946 PGs used .. so it is nearly there at the goal.
>
> You need to watch that value 1946 and see if it increases slowly. If
> it does not increase, then those backfill_toofull PGs are probably
> splitting PGs, and they are blocked by not having enough free space.
>
> To solve that free space problem, you could either increase the
> backfillfull_ratio like we discussed earlier, or add capacity.
> I prefer the former, if the OSDs are just over the 90% default limit.
>
> -- dan
>
> On Tue, Mar 30, 2021 at 3:18 PM Boris Behrens  wrote:
> >
> > The output from ceph osd pool ls detail tell me nothing, except that the
> pgp_num is not where it should be. Can you help me to read the output? How
> do I estimate how long the split will take?
> >
> > [root@s3db1 ~]# ceph status
> >   cluster:
> > id: dca79fff-ffd0-58f4-1cff-82a2feea05f4
> > health: HEALTH_WARN
> > noscrub,nodeep-scrub flag(s) set
> > 10 backfillfull osd(s)
> > 19 nearfull osd(s)
> > 37 pool(s) backfillfull
> > BlueFS spillover detected on 1 OSD(s)
> > 13 large omap objects
> > Low space hindering backfill (add storage if this doesn't
> resolve itself): 234 pgs backfill_toofull
> > ...
> >   data:
> > pools:   37 pools, 4032 pgs
> > objects: 121.40M objects, 199 TiB
> > usage:   627 TiB used, 169 TiB / 795 TiB avail
> > pgs: 45263471/364213596 objects misplaced (12.428%)
> >  3719 active+clean
> >  209  active+remapped+backfill_wait+backfill_toofull
> >  59   active+remapped+backfill_wait
> >  24   active+remapped+backfill_toofull
> >  20   active+remapped+backfilling
> >  1active+remapped+forced_backfill+backfill_toofull
> >
> >   io:
> > client:   8.4 MiB/s rd, 127 MiB/s wr, 208 op/s rd, 163 op/s wr
> > recovery: 276 MiB/s, 164 objects/s
> >
> > [root@s3db1 ~]# ceph osd pool ls detail
> > ...
> > pool 10 'eu-central-1.rgw.buckets.index' replicated size 3 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> last_change 320966 lfor 0/193276/306366 flags hashpspool,backfillfull
> stripe_width 0 application rgw
> > pool 11 'eu-central-1.rgw.buckets.data' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 2048 pgp_num 1946 pgp_num_target
> 2048 autoscale_mode warn last_change 320966 lfor 0/263549/317774 flags
> hashpspool,backfillfull stripe_width 0 application rgw
> > ...
> >
> > Am Di., 30. März 2021 um 15:07 Uhr schrieb Dan van der Ster <
> d...@vanderster.com>:
> >>
> >> It would be safe to turn off the balancer, yes go ahead.
> >>
> >> To know if adding more hardware will help, we need to see how much
> >> longer this current splitting should take. This will help:
> >>
> >> ceph status
> >> ceph osd pool ls detail
> >>
> >> -- dan
> >>
> >> On Tue, Mar 30, 2021 at 3:00 PM Boris Behrens  wrote:
> >> >
> >> > I would think due to splitting, because the balancer doesn't refuses
> it's work, because to many misplaced objects.
> >> > I also think to turn it off for now, so it doesn't begin it's work at
> 5% missplaced objects.
> >> >
> >> > Would adding more hardware help? We wanted to insert another OSD node
> with 7x8TB disks anyway, but postponed it due to the rebalancing.
> >> >
> >> > Am Di., 30. März 2021 um 14:23 Uhr schrieb Dan van der Ster <
> d...@vanderster.com>:
> >> >>
> >> >> Are those PGs backfilling due to splitting or due to balancing?
> >> >> If it's the former, I don't think there's a way to pause them with
> >> >> upmap or any other trick.
> >> >>
> >> >> -- dan
> >> >>
> >> >> On Tue, Mar 30, 2021 at 2:07 PM Boris Behrens  wrote:
> >> >> >
> >> >> > One week later the ceph is still balancing.
> >> >> > What worries me like hell is the %USE on a lot of those OSDs. Does
> ceph
> >> >> > resolv this on it's own? We are currently down to 5TB space in the
> cluster.
> >> >> > Rebalancing single OSDs doesn't work well and it increases the
> "missplaced
> >> >> > objects".
> >> >> >
> >> >> > I thought about letting upmap do some rebalancing. Anyone know if
> this is a
> >> >> > good idea? Or if I should bite my nails an wait as I am the
> headache of my
> >> >> > life.
> >> >> > [root@s3db1 ~]# ceph osd getmap -o om; osdmaptool om --upmap
> out.txt
> >> >> > --upmap-pool eu-central-1.rgw.buckets.data --upmap-max 10; cat
> out.txt
> >> >> > got osdmap epoch 321975
> >> >> > osdmaptool: osdmap file 'om'
> >> >> > writing upmap command output to: out.txt
> >> >> > checking for upm

[ceph-users] Re: Device class not deleted/set correctly

2021-03-30 Thread Stefan Kooman

On 3/25/21 1:05 PM, Nico Schottelius wrote:


it seems there is no reference to it in the ceph documentation. Do you
have any pointers to it?


Not anymore with new Ceph documentation.


Out of curiosity, do you have any clue why it's not in there anymore?


It might still be, but I cannot find it anymore ...

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Dan van der Ster
You started with 1024 PGs, and are splitting to 2048.
Currently there are 1946 PGs used .. so it is nearly there at the goal.

You need to watch that value 1946 and see if it increases slowly. If
it does not increase, then those backfill_toofull PGs are probably
splitting PGs, and they are blocked by not having enough free space.

To solve that free space problem, you could either increase the
backfillfull_ratio like we discussed earlier, or add capacity.
I prefer the former, if the OSDs are just over the 90% default limit.

-- dan

On Tue, Mar 30, 2021 at 3:18 PM Boris Behrens  wrote:
>
> The output from ceph osd pool ls detail tell me nothing, except that the 
> pgp_num is not where it should be. Can you help me to read the output? How do 
> I estimate how long the split will take?
>
> [root@s3db1 ~]# ceph status
>   cluster:
> id: dca79fff-ffd0-58f4-1cff-82a2feea05f4
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 10 backfillfull osd(s)
> 19 nearfull osd(s)
> 37 pool(s) backfillfull
> BlueFS spillover detected on 1 OSD(s)
> 13 large omap objects
> Low space hindering backfill (add storage if this doesn't resolve 
> itself): 234 pgs backfill_toofull
> ...
>   data:
> pools:   37 pools, 4032 pgs
> objects: 121.40M objects, 199 TiB
> usage:   627 TiB used, 169 TiB / 795 TiB avail
> pgs: 45263471/364213596 objects misplaced (12.428%)
>  3719 active+clean
>  209  active+remapped+backfill_wait+backfill_toofull
>  59   active+remapped+backfill_wait
>  24   active+remapped+backfill_toofull
>  20   active+remapped+backfilling
>  1active+remapped+forced_backfill+backfill_toofull
>
>   io:
> client:   8.4 MiB/s rd, 127 MiB/s wr, 208 op/s rd, 163 op/s wr
> recovery: 276 MiB/s, 164 objects/s
>
> [root@s3db1 ~]# ceph osd pool ls detail
> ...
> pool 10 'eu-central-1.rgw.buckets.index' replicated size 3 min_size 1 
> crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn 
> last_change 320966 lfor 0/193276/306366 flags hashpspool,backfillfull 
> stripe_width 0 application rgw
> pool 11 'eu-central-1.rgw.buckets.data' replicated size 3 min_size 2 
> crush_rule 0 object_hash rjenkins pg_num 2048 pgp_num 1946 pgp_num_target 
> 2048 autoscale_mode warn last_change 320966 lfor 0/263549/317774 flags 
> hashpspool,backfillfull stripe_width 0 application rgw
> ...
>
> Am Di., 30. März 2021 um 15:07 Uhr schrieb Dan van der Ster 
> :
>>
>> It would be safe to turn off the balancer, yes go ahead.
>>
>> To know if adding more hardware will help, we need to see how much
>> longer this current splitting should take. This will help:
>>
>> ceph status
>> ceph osd pool ls detail
>>
>> -- dan
>>
>> On Tue, Mar 30, 2021 at 3:00 PM Boris Behrens  wrote:
>> >
>> > I would think due to splitting, because the balancer doesn't refuses it's 
>> > work, because to many misplaced objects.
>> > I also think to turn it off for now, so it doesn't begin it's work at 5% 
>> > missplaced objects.
>> >
>> > Would adding more hardware help? We wanted to insert another OSD node with 
>> > 7x8TB disks anyway, but postponed it due to the rebalancing.
>> >
>> > Am Di., 30. März 2021 um 14:23 Uhr schrieb Dan van der Ster 
>> > :
>> >>
>> >> Are those PGs backfilling due to splitting or due to balancing?
>> >> If it's the former, I don't think there's a way to pause them with
>> >> upmap or any other trick.
>> >>
>> >> -- dan
>> >>
>> >> On Tue, Mar 30, 2021 at 2:07 PM Boris Behrens  wrote:
>> >> >
>> >> > One week later the ceph is still balancing.
>> >> > What worries me like hell is the %USE on a lot of those OSDs. Does ceph
>> >> > resolv this on it's own? We are currently down to 5TB space in the 
>> >> > cluster.
>> >> > Rebalancing single OSDs doesn't work well and it increases the 
>> >> > "missplaced
>> >> > objects".
>> >> >
>> >> > I thought about letting upmap do some rebalancing. Anyone know if this 
>> >> > is a
>> >> > good idea? Or if I should bite my nails an wait as I am the headache of 
>> >> > my
>> >> > life.
>> >> > [root@s3db1 ~]# ceph osd getmap -o om; osdmaptool om --upmap out.txt
>> >> > --upmap-pool eu-central-1.rgw.buckets.data --upmap-max 10; cat out.txt
>> >> > got osdmap epoch 321975
>> >> > osdmaptool: osdmap file 'om'
>> >> > writing upmap command output to: out.txt
>> >> > checking for upmap cleanups
>> >> > upmap, max-count 10, max deviation 5
>> >> >  limiting to pools eu-central-1.rgw.buckets.data ([11])
>> >> > pools eu-central-1.rgw.buckets.data
>> >> > prepared 10/10 changes
>> >> > ceph osd rm-pg-upmap-items 11.209
>> >> > ceph osd rm-pg-upmap-items 11.253
>> >> > ceph osd pg-upmap-items 11.7f 79 88
>> >> > ceph osd pg-upmap-items 11.fc 53 31 105 78
>> >> > ceph osd pg-upmap-items 11.1d8 84 50
>> >> > ceph osd pg-upmap-items 11.47f 94 86
>> >> > ceph osd pg-upmap-items 11.49c 44 71
>>

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
The output from ceph osd pool ls detail tell me nothing, except that the
pgp_num is not where it should be. Can you help me to read the output? How
do I estimate how long the split will take?

[root@s3db1 ~]# ceph status
  cluster:
id: dca79fff-ffd0-58f4-1cff-82a2feea05f4
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
10 backfillfull osd(s)
19 nearfull osd(s)
37 pool(s) backfillfull
BlueFS spillover detected on 1 OSD(s)
13 large omap objects
Low space hindering backfill (add storage if this doesn't
resolve itself): 234 pgs backfill_toofull
...
  data:
pools:   37 pools, 4032 pgs
objects: 121.40M objects, 199 TiB
usage:   627 TiB used, 169 TiB / 795 TiB avail
pgs: 45263471/364213596 objects misplaced (12.428%)
 3719 active+clean
 209  active+remapped+backfill_wait+backfill_toofull
 59   active+remapped+backfill_wait
 24   active+remapped+backfill_toofull
 20   active+remapped+backfilling
 1active+remapped+forced_backfill+backfill_toofull

  io:
client:   8.4 MiB/s rd, 127 MiB/s wr, 208 op/s rd, 163 op/s wr
recovery: 276 MiB/s, 164 objects/s

[root@s3db1 ~]# ceph osd pool ls detail
...
pool 10 'eu-central-1.rgw.buckets.index' replicated size 3 min_size 1
crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
last_change 320966 lfor 0/193276/306366 flags hashpspool,backfillfull
stripe_width 0 application rgw
pool 11 'eu-central-1.rgw.buckets.data' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 2048 pgp_num 1946 pgp_num_target
2048 autoscale_mode warn last_change 320966 lfor 0/263549/317774 flags
hashpspool,backfillfull stripe_width 0 application rgw
...

Am Di., 30. März 2021 um 15:07 Uhr schrieb Dan van der Ster <
d...@vanderster.com>:

> It would be safe to turn off the balancer, yes go ahead.
>
> To know if adding more hardware will help, we need to see how much
> longer this current splitting should take. This will help:
>
> ceph status
> ceph osd pool ls detail
>
> -- dan
>
> On Tue, Mar 30, 2021 at 3:00 PM Boris Behrens  wrote:
> >
> > I would think due to splitting, because the balancer doesn't refuses
> it's work, because to many misplaced objects.
> > I also think to turn it off for now, so it doesn't begin it's work at 5%
> missplaced objects.
> >
> > Would adding more hardware help? We wanted to insert another OSD node
> with 7x8TB disks anyway, but postponed it due to the rebalancing.
> >
> > Am Di., 30. März 2021 um 14:23 Uhr schrieb Dan van der Ster <
> d...@vanderster.com>:
> >>
> >> Are those PGs backfilling due to splitting or due to balancing?
> >> If it's the former, I don't think there's a way to pause them with
> >> upmap or any other trick.
> >>
> >> -- dan
> >>
> >> On Tue, Mar 30, 2021 at 2:07 PM Boris Behrens  wrote:
> >> >
> >> > One week later the ceph is still balancing.
> >> > What worries me like hell is the %USE on a lot of those OSDs. Does
> ceph
> >> > resolv this on it's own? We are currently down to 5TB space in the
> cluster.
> >> > Rebalancing single OSDs doesn't work well and it increases the
> "missplaced
> >> > objects".
> >> >
> >> > I thought about letting upmap do some rebalancing. Anyone know if
> this is a
> >> > good idea? Or if I should bite my nails an wait as I am the headache
> of my
> >> > life.
> >> > [root@s3db1 ~]# ceph osd getmap -o om; osdmaptool om --upmap out.txt
> >> > --upmap-pool eu-central-1.rgw.buckets.data --upmap-max 10; cat out.txt
> >> > got osdmap epoch 321975
> >> > osdmaptool: osdmap file 'om'
> >> > writing upmap command output to: out.txt
> >> > checking for upmap cleanups
> >> > upmap, max-count 10, max deviation 5
> >> >  limiting to pools eu-central-1.rgw.buckets.data ([11])
> >> > pools eu-central-1.rgw.buckets.data
> >> > prepared 10/10 changes
> >> > ceph osd rm-pg-upmap-items 11.209
> >> > ceph osd rm-pg-upmap-items 11.253
> >> > ceph osd pg-upmap-items 11.7f 79 88
> >> > ceph osd pg-upmap-items 11.fc 53 31 105 78
> >> > ceph osd pg-upmap-items 11.1d8 84 50
> >> > ceph osd pg-upmap-items 11.47f 94 86
> >> > ceph osd pg-upmap-items 11.49c 44 71
> >> > ceph osd pg-upmap-items 11.553 74 50
> >> > ceph osd pg-upmap-items 11.6c3 66 63
> >> > ceph osd pg-upmap-items 11.7ad 43 50
> >> >
> >> > ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAP META
> >> >  AVAIL%USE  VAR  PGS STATUS TYPE NAME
> >> >  -1   795.42548- 795 TiB 626 TiB  587 TiB   82 GiB 1.4
> TiB  170
> >> > TiB 78.64 1.00   -root default
> >> >  56   hdd   7.32619  1.0 7.3 TiB 6.4 TiB  6.4 TiB  684 MiB  16
> GiB  910
> >> > GiB 87.87 1.12 129 up osd.56
> >> >  67   hdd   7.27739  1.0 7.3 TiB 6.4 TiB  6.4 TiB  582 MiB  16
> GiB  865
> >> > GiB 88.40 1.12 115 up osd.67
> >> >  79   hdd   3.63689  1.0 3.6 TiB 3.2 TiB  432 GiB  1.9 GiB 0
> B  432
> >> 

[ceph-users] Re: Resolving LARGE_OMAP_OBJECTS

2021-03-30 Thread David Orman
Hi Ben,

That was beyond helpful. Thank you so much for the thoughtful and
detailed explanation. That should definitely be added to the
documentation, until/unless the dynamic resharder/sharder handle this
case (if there is even desire to do so) with versioned objects.

Respectfully,
David

On Tue, Mar 30, 2021 at 12:21 AM Benoît Knecht  wrote:
>
> Hi David,
>
> On Tuesday, March 30th, 2021 at 00:50, David Orman  
> wrote:
> > Sure enough, it is more than 200,000, just as the alert indicates.
> > However, why did it not reshard further? Here's the kicker - we only
> > see this with versioned buckets/objects. I don't see anything in the
> > documentation that indicates this is a known issue with sharding, but
> > perhaps there is something going on with versioned buckets/objects. Is
> > there any clarity here/suggestions on how to deal with this? It sounds
> > like you expect this behavior with versioned buckets, so we must be
> > missing something.
>
> The issue with versioned buckets is that each object is associated with at 
> least 4 index entries, with 2 additional index entries for each version of 
> the object. Dynamic resharding is based on the number of objects, not the 
> number of index entries, and it counts each version of an object as an 
> object, so the biggest discrepancy between number of objects and index 
> entries happens when there's only one version of each object (factor of 4), 
> and it tends to a factor of two as the number of versions per object 
> increases to infinity. But there's one more special case. When you delete an 
> versioned object, it also creates two more index entries, but those are not 
> taken into account by dynamic resharding. Therefore, the absolute worst case 
> is when there was a single version of each object, and all the objects have 
> been deleted. In that case, there's 6 index entries for each object counted 
> by dynamic resharding, i.e. a factor of 6.
>
> So one way to "solve" this issue is to set 
> `osd_deep_scrub_large_omap_object_key_threshold=60`, which (with the 
> default `rgw_max_objs_per_shard=10`) will guarantee that dynamic 
> resharding will kick in before you get a large omap object warning even in 
> the worst case scenario for versioned buckets. If you're not comfortable 
> having that many keys per omap object, you could instead decrease 
> `rgw_max_objs_per_shard`.
>
> Cheers,
>
> --
> Ben
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Dan van der Ster
It would be safe to turn off the balancer, yes go ahead.

To know if adding more hardware will help, we need to see how much
longer this current splitting should take. This will help:

ceph status
ceph osd pool ls detail

-- dan

On Tue, Mar 30, 2021 at 3:00 PM Boris Behrens  wrote:
>
> I would think due to splitting, because the balancer doesn't refuses it's 
> work, because to many misplaced objects.
> I also think to turn it off for now, so it doesn't begin it's work at 5% 
> missplaced objects.
>
> Would adding more hardware help? We wanted to insert another OSD node with 
> 7x8TB disks anyway, but postponed it due to the rebalancing.
>
> Am Di., 30. März 2021 um 14:23 Uhr schrieb Dan van der Ster 
> :
>>
>> Are those PGs backfilling due to splitting or due to balancing?
>> If it's the former, I don't think there's a way to pause them with
>> upmap or any other trick.
>>
>> -- dan
>>
>> On Tue, Mar 30, 2021 at 2:07 PM Boris Behrens  wrote:
>> >
>> > One week later the ceph is still balancing.
>> > What worries me like hell is the %USE on a lot of those OSDs. Does ceph
>> > resolv this on it's own? We are currently down to 5TB space in the cluster.
>> > Rebalancing single OSDs doesn't work well and it increases the "missplaced
>> > objects".
>> >
>> > I thought about letting upmap do some rebalancing. Anyone know if this is a
>> > good idea? Or if I should bite my nails an wait as I am the headache of my
>> > life.
>> > [root@s3db1 ~]# ceph osd getmap -o om; osdmaptool om --upmap out.txt
>> > --upmap-pool eu-central-1.rgw.buckets.data --upmap-max 10; cat out.txt
>> > got osdmap epoch 321975
>> > osdmaptool: osdmap file 'om'
>> > writing upmap command output to: out.txt
>> > checking for upmap cleanups
>> > upmap, max-count 10, max deviation 5
>> >  limiting to pools eu-central-1.rgw.buckets.data ([11])
>> > pools eu-central-1.rgw.buckets.data
>> > prepared 10/10 changes
>> > ceph osd rm-pg-upmap-items 11.209
>> > ceph osd rm-pg-upmap-items 11.253
>> > ceph osd pg-upmap-items 11.7f 79 88
>> > ceph osd pg-upmap-items 11.fc 53 31 105 78
>> > ceph osd pg-upmap-items 11.1d8 84 50
>> > ceph osd pg-upmap-items 11.47f 94 86
>> > ceph osd pg-upmap-items 11.49c 44 71
>> > ceph osd pg-upmap-items 11.553 74 50
>> > ceph osd pg-upmap-items 11.6c3 66 63
>> > ceph osd pg-upmap-items 11.7ad 43 50
>> >
>> > ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAP META
>> >  AVAIL%USE  VAR  PGS STATUS TYPE NAME
>> >  -1   795.42548- 795 TiB 626 TiB  587 TiB   82 GiB 1.4 TiB  170
>> > TiB 78.64 1.00   -root default
>> >  56   hdd   7.32619  1.0 7.3 TiB 6.4 TiB  6.4 TiB  684 MiB  16 GiB  910
>> > GiB 87.87 1.12 129 up osd.56
>> >  67   hdd   7.27739  1.0 7.3 TiB 6.4 TiB  6.4 TiB  582 MiB  16 GiB  865
>> > GiB 88.40 1.12 115 up osd.67
>> >  79   hdd   3.63689  1.0 3.6 TiB 3.2 TiB  432 GiB  1.9 GiB 0 B  432
>> > GiB 88.40 1.12  63 up osd.79
>> >  53   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.4 TiB  971 MiB  22 GiB  864
>> > GiB 88.48 1.13 114 up osd.53
>> >  51   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.4 TiB  734 MiB  15 GiB  837
>> > GiB 88.77 1.13 120 up osd.51
>> >  73   hdd  14.55269  1.0  15 TiB  13 TiB   13 TiB  1.8 GiB  39 GiB  1.6
>> > TiB 88.97 1.13 246 up osd.73
>> >  55   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.5 TiB  259 MiB  15 GiB  825
>> > GiB 89.01 1.13 118 up osd.55
>> >  70   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.5 TiB  291 MiB  16 GiB  787
>> > GiB 89.44 1.14 119 up osd.70
>> >  42   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  685 MiB 8.2 GiB  374
>> > GiB 90.23 1.15  60 up osd.42
>> >  94   hdd   3.63869  1.0 3.6 TiB 3.3 TiB  3.3 TiB  132 MiB 7.7 GiB  345
>> > GiB 90.75 1.15  64 up osd.94
>> >  25   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  3.2 MiB 8.1 GiB  352
>> > GiB 90.79 1.15  53 up osd.25
>> >  31   hdd   7.32619  1.0 7.3 TiB 6.7 TiB  6.6 TiB  223 MiB  15 GiB  690
>> > GiB 90.80 1.15 117 up osd.31
>> >  84   hdd   7.52150  1.0 7.5 TiB 6.8 TiB  6.6 TiB  159 MiB  16 GiB  699
>> > GiB 90.93 1.16 121 up osd.84
>> >  82   hdd   3.63689  1.0 3.6 TiB 3.3 TiB  332 GiB  1.0 GiB 0 B  332
>> > GiB 91.08 1.16  59 up osd.82
>> >  89   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  400 MiB  15 GiB  670
>> > GiB 91.29 1.16 126 up osd.89
>> >  33   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  382 MiB 8.6 GiB  327
>> > GiB 91.46 1.16  66 up osd.33
>> >  90   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  338 MiB  15 GiB  658
>> > GiB 91.46 1.16 112 up osd.90
>> > 105   hdd   3.63869  0.8 3.6 TiB 3.3 TiB  3.3 TiB  206 MiB 8.1 GiB  301
>> > GiB 91.91 1.17  56 up osd.105
>> >  66   hdd   7.27739  0.95000 7.3 TiB 6.7 TiB  6.7 TiB  322 MiB  16 GiB  548
>> > GiB 92.64 1.18 121 up osd.66
>>

[ceph-users] Re: forceful remap PGs

2021-03-30 Thread Boris Behrens
I reweighted the OSD to .0 and then forced the backfilling.

How long does it take for ceph to free up space? I looks like it was doing
this, but it could also be the "backup cleanup job" that removed images
from the buckets.

Am Di., 30. März 2021 um 14:41 Uhr schrieb Stefan Kooman :

> On 3/30/21 12:55 PM, Boris Behrens wrote:
> > I just move one PG away from the OSD, but the diskspace will not get
> freed.
>
> How did you move? I would suggest you use upmap:
>
> ceph osd pg-upmap-items
> Invalid command: missing required parameter pgid()
> osd pg-upmap-items   [ (id|osd.id)>...] :  set pg_upmap_items mapping :{ to ,
> [...]} (developers only)
>
>
> So you specify which PG has to move to which OSD.
>
> > Do I need to do something to clean obsolete objects from the osd?
>
> No. The OSD will trim PG data that is not needed anymore.
>
> Gr. Stefan
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)

2021-03-30 Thread Sasha Litvak
Any time frame on 14.2.19?

On Fri, Mar 26, 2021, 1:43 AM Konstantin Shalygin  wrote:

> Finally master is merged now
>
>
> k
>
> Sent from my iPhone
>
> > On 25 Mar 2021, at 23:09, Simon Oosthoek 
> wrote:
> >
> > I'll wait a bit before upgrading the remaining nodes. I hope 14.2.19
> will be available quickly.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
I would think due to splitting, because the balancer doesn't refuses it's
work, because to many misplaced objects.
I also think to turn it off for now, so it doesn't begin it's work at 5%
missplaced objects.

Would adding more hardware help? We wanted to insert another OSD node with
7x8TB disks anyway, but postponed it due to the rebalancing.

Am Di., 30. März 2021 um 14:23 Uhr schrieb Dan van der Ster <
d...@vanderster.com>:

> Are those PGs backfilling due to splitting or due to balancing?
> If it's the former, I don't think there's a way to pause them with
> upmap or any other trick.
>
> -- dan
>
> On Tue, Mar 30, 2021 at 2:07 PM Boris Behrens  wrote:
> >
> > One week later the ceph is still balancing.
> > What worries me like hell is the %USE on a lot of those OSDs. Does ceph
> > resolv this on it's own? We are currently down to 5TB space in the
> cluster.
> > Rebalancing single OSDs doesn't work well and it increases the
> "missplaced
> > objects".
> >
> > I thought about letting upmap do some rebalancing. Anyone know if this
> is a
> > good idea? Or if I should bite my nails an wait as I am the headache of
> my
> > life.
> > [root@s3db1 ~]# ceph osd getmap -o om; osdmaptool om --upmap out.txt
> > --upmap-pool eu-central-1.rgw.buckets.data --upmap-max 10; cat out.txt
> > got osdmap epoch 321975
> > osdmaptool: osdmap file 'om'
> > writing upmap command output to: out.txt
> > checking for upmap cleanups
> > upmap, max-count 10, max deviation 5
> >  limiting to pools eu-central-1.rgw.buckets.data ([11])
> > pools eu-central-1.rgw.buckets.data
> > prepared 10/10 changes
> > ceph osd rm-pg-upmap-items 11.209
> > ceph osd rm-pg-upmap-items 11.253
> > ceph osd pg-upmap-items 11.7f 79 88
> > ceph osd pg-upmap-items 11.fc 53 31 105 78
> > ceph osd pg-upmap-items 11.1d8 84 50
> > ceph osd pg-upmap-items 11.47f 94 86
> > ceph osd pg-upmap-items 11.49c 44 71
> > ceph osd pg-upmap-items 11.553 74 50
> > ceph osd pg-upmap-items 11.6c3 66 63
> > ceph osd pg-upmap-items 11.7ad 43 50
> >
> > ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAP META
> >  AVAIL%USE  VAR  PGS STATUS TYPE NAME
> >  -1   795.42548- 795 TiB 626 TiB  587 TiB   82 GiB 1.4 TiB
> 170
> > TiB 78.64 1.00   -root default
> >  56   hdd   7.32619  1.0 7.3 TiB 6.4 TiB  6.4 TiB  684 MiB  16 GiB
> 910
> > GiB 87.87 1.12 129 up osd.56
> >  67   hdd   7.27739  1.0 7.3 TiB 6.4 TiB  6.4 TiB  582 MiB  16 GiB
> 865
> > GiB 88.40 1.12 115 up osd.67
> >  79   hdd   3.63689  1.0 3.6 TiB 3.2 TiB  432 GiB  1.9 GiB 0 B
> 432
> > GiB 88.40 1.12  63 up osd.79
> >  53   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.4 TiB  971 MiB  22 GiB
> 864
> > GiB 88.48 1.13 114 up osd.53
> >  51   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.4 TiB  734 MiB  15 GiB
> 837
> > GiB 88.77 1.13 120 up osd.51
> >  73   hdd  14.55269  1.0  15 TiB  13 TiB   13 TiB  1.8 GiB  39 GiB
> 1.6
> > TiB 88.97 1.13 246 up osd.73
> >  55   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.5 TiB  259 MiB  15 GiB
> 825
> > GiB 89.01 1.13 118 up osd.55
> >  70   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.5 TiB  291 MiB  16 GiB
> 787
> > GiB 89.44 1.14 119 up osd.70
> >  42   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  685 MiB 8.2 GiB
> 374
> > GiB 90.23 1.15  60 up osd.42
> >  94   hdd   3.63869  1.0 3.6 TiB 3.3 TiB  3.3 TiB  132 MiB 7.7 GiB
> 345
> > GiB 90.75 1.15  64 up osd.94
> >  25   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  3.2 MiB 8.1 GiB
> 352
> > GiB 90.79 1.15  53 up osd.25
> >  31   hdd   7.32619  1.0 7.3 TiB 6.7 TiB  6.6 TiB  223 MiB  15 GiB
> 690
> > GiB 90.80 1.15 117 up osd.31
> >  84   hdd   7.52150  1.0 7.5 TiB 6.8 TiB  6.6 TiB  159 MiB  16 GiB
> 699
> > GiB 90.93 1.16 121 up osd.84
> >  82   hdd   3.63689  1.0 3.6 TiB 3.3 TiB  332 GiB  1.0 GiB 0 B
> 332
> > GiB 91.08 1.16  59 up osd.82
> >  89   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  400 MiB  15 GiB
> 670
> > GiB 91.29 1.16 126 up osd.89
> >  33   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  382 MiB 8.6 GiB
> 327
> > GiB 91.46 1.16  66 up osd.33
> >  90   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  338 MiB  15 GiB
> 658
> > GiB 91.46 1.16 112 up osd.90
> > 105   hdd   3.63869  0.8 3.6 TiB 3.3 TiB  3.3 TiB  206 MiB 8.1 GiB
> 301
> > GiB 91.91 1.17  56 up osd.105
> >  66   hdd   7.27739  0.95000 7.3 TiB 6.7 TiB  6.7 TiB  322 MiB  16 GiB
> 548
> > GiB 92.64 1.18 121 up osd.66
> >  46   hdd   7.27739  1.0 7.3 TiB 6.8 TiB  6.7 TiB  316 MiB  16 GiB
> 536
> > GiB 92.81 1.18 119 up osd.46
> >
> > Am Di., 23. März 2021 um 19:59 Uhr schrieb Boris Behrens :
> >
> > > Good point. Thanks for the hint. I changed it for all OSDs from 5 to 1
> > > *crossing finger*
> > >
> > > Am Di., 23. März 2021 um 19:45 Uhr schri

[ceph-users] Preferred order of operations when changing crush map and pool rules

2021-03-30 Thread Thomas Hukkelberg
Hi all!

We run a 1.5PB cluster with 12 hosts, 192 OSDs (mix of NVMe and HDD) and need 
to improve our failure domain by altering the crush rules and moving rack to 
pods, which would imply a lot of data movement.

I wonder what would the preferred order of operations be when doing such 
changes to the crush map and pools? Will there be minimal data movement by 
moving all racks to pods at once and change pool repl rules or is the best 
approach to first move racks one by one to pods and then change pool 
replication rules from rack to pods? Anyhow I guess it's good practice to set 
'norebalance' before moving hosts and unset to start the actual moving?

Right now we have the following setup:

root -> rack2 -> ups1 + node51 + node57 + switch21
root -> rack3 -> ups2 + node52 + node58 + switch22
root -> rack4 -> ups3 + node53 + node59 + switch23
root -> rack5 -> ups4 + node54 + node60 -- switch 21 ^^
root -> rack6 -> ups5 + node55 + node61 -- switch 22 ^^
root -> rack7 -> ups6 + node56 + node62 -- switch 23 ^^

Note that racks 5-7 are connected to same ToR switches as racks 2-4. Cluster 
and frontend network are in different VXLANs connected with dual 40GbE. Failure 
domain for 3x replicated pools are currently by rack, and after adding hosts 
57-62 we realized that if one of the switches reboots or fails, replicated PGs 
located only on those 4 hosts will be unavailable and force pools offline. I 
guess the best way would instead like to organize the racks in pods like this:

root -> pod1 -> rack2 -> ups1 + node51 + node57
root -> pod1 -> rack5 -> ups4 + node54 + node60 -> switch21
root -> pod2 -> rack3 -> ups2 + node52 + node58
root -> pod2 -> rack6 -> ups5 + node55 + node61 -> switch 22
root -> pod3 -> rack4 -> ups3 + node53 + node59
root -> pod3 -> rack7 -> ups6 + node56 + node62 -> switch 23

The reason for this arrangement is that we in the future plan to organize the 
pods in different buildings. We're running nautilus 14.2.16 and are about to 
upgrade to Octopus. Should we upgrade to Octopus before crush changes? 

Any thoughts or insight on how to achieve this with minimal data movement and 
risk of cluster downtime would be welcome!


--thomas

--
Thomas Hukkelberg
tho...@hovedkvarteret.no
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke

Hello,

in between ceph is runing again normally, except for the two osds that 
are down because of the failed disks.


What really helped in my situation was to lower min_size from  5 (k+1) 
to 4 in my 4+2 erasure code setup. So I am also greatful for the 
programmer who put the helping hint in ceph health detail for this 
situation.


Thanks very much to everyone who answered my request to help out.

What is left now is to replace the disks and then bring the two osds up 
again.


Have a nice day
Rainer

Am 30.03.21 um 13:32 schrieb Burkhard Linke:

Hi,

On 30.03.21 13:05, Rainer Krienke wrote:

Hello,

yes your assumptions are correct pxa-rbd ist the metadata pool for 
pxa-ec which uses a erasure coding 4+2 profile.


In the last hours ceph repaired most of the damage. One inactive PG 
remained and in ceph health detail then told me:


-
HEALTH_WARN Reduced data availability: 1 pg inactive, 1 pg incomplete; 
15 daemons have recently crashed; 150 slow ops, oldest one blocked for 
26716 sec, daemons [osd.60,osd.67] have slow ops.

PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
    pg 36.15b is remapped+incomplete, acting 
[60,2147483647,23,96,2147483647,36] (reducing pool pxa-ec min_size 
from 5 may help; search ceph.com/docs for 'incomplete')



*snipsnap*

2147483647 is (uint32)(-1), which mean no associated OSD. So this PG 
does not have six independent OSDs, and no backfilling is happening 
since there are no targets to backfill.



You mentioned 9 hosts, so if you use a simple host based crush rule ceph 
should be able to find new OSDs for that PG. If you do not use standard 
crush rules please check that ceph is able to derive enough OSDs to 
satisfy the PG requirements (six different OSDs).



The 'incomplete' part might be a problem. If just a chunk would be 
missing, the state should be undersized, not incomplete...



Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287 
1001312

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Dan van der Ster
Are those PGs backfilling due to splitting or due to balancing?
If it's the former, I don't think there's a way to pause them with
upmap or any other trick.

-- dan

On Tue, Mar 30, 2021 at 2:07 PM Boris Behrens  wrote:
>
> One week later the ceph is still balancing.
> What worries me like hell is the %USE on a lot of those OSDs. Does ceph
> resolv this on it's own? We are currently down to 5TB space in the cluster.
> Rebalancing single OSDs doesn't work well and it increases the "missplaced
> objects".
>
> I thought about letting upmap do some rebalancing. Anyone know if this is a
> good idea? Or if I should bite my nails an wait as I am the headache of my
> life.
> [root@s3db1 ~]# ceph osd getmap -o om; osdmaptool om --upmap out.txt
> --upmap-pool eu-central-1.rgw.buckets.data --upmap-max 10; cat out.txt
> got osdmap epoch 321975
> osdmaptool: osdmap file 'om'
> writing upmap command output to: out.txt
> checking for upmap cleanups
> upmap, max-count 10, max deviation 5
>  limiting to pools eu-central-1.rgw.buckets.data ([11])
> pools eu-central-1.rgw.buckets.data
> prepared 10/10 changes
> ceph osd rm-pg-upmap-items 11.209
> ceph osd rm-pg-upmap-items 11.253
> ceph osd pg-upmap-items 11.7f 79 88
> ceph osd pg-upmap-items 11.fc 53 31 105 78
> ceph osd pg-upmap-items 11.1d8 84 50
> ceph osd pg-upmap-items 11.47f 94 86
> ceph osd pg-upmap-items 11.49c 44 71
> ceph osd pg-upmap-items 11.553 74 50
> ceph osd pg-upmap-items 11.6c3 66 63
> ceph osd pg-upmap-items 11.7ad 43 50
>
> ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAP META
>  AVAIL%USE  VAR  PGS STATUS TYPE NAME
>  -1   795.42548- 795 TiB 626 TiB  587 TiB   82 GiB 1.4 TiB  170
> TiB 78.64 1.00   -root default
>  56   hdd   7.32619  1.0 7.3 TiB 6.4 TiB  6.4 TiB  684 MiB  16 GiB  910
> GiB 87.87 1.12 129 up osd.56
>  67   hdd   7.27739  1.0 7.3 TiB 6.4 TiB  6.4 TiB  582 MiB  16 GiB  865
> GiB 88.40 1.12 115 up osd.67
>  79   hdd   3.63689  1.0 3.6 TiB 3.2 TiB  432 GiB  1.9 GiB 0 B  432
> GiB 88.40 1.12  63 up osd.79
>  53   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.4 TiB  971 MiB  22 GiB  864
> GiB 88.48 1.13 114 up osd.53
>  51   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.4 TiB  734 MiB  15 GiB  837
> GiB 88.77 1.13 120 up osd.51
>  73   hdd  14.55269  1.0  15 TiB  13 TiB   13 TiB  1.8 GiB  39 GiB  1.6
> TiB 88.97 1.13 246 up osd.73
>  55   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.5 TiB  259 MiB  15 GiB  825
> GiB 89.01 1.13 118 up osd.55
>  70   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.5 TiB  291 MiB  16 GiB  787
> GiB 89.44 1.14 119 up osd.70
>  42   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  685 MiB 8.2 GiB  374
> GiB 90.23 1.15  60 up osd.42
>  94   hdd   3.63869  1.0 3.6 TiB 3.3 TiB  3.3 TiB  132 MiB 7.7 GiB  345
> GiB 90.75 1.15  64 up osd.94
>  25   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  3.2 MiB 8.1 GiB  352
> GiB 90.79 1.15  53 up osd.25
>  31   hdd   7.32619  1.0 7.3 TiB 6.7 TiB  6.6 TiB  223 MiB  15 GiB  690
> GiB 90.80 1.15 117 up osd.31
>  84   hdd   7.52150  1.0 7.5 TiB 6.8 TiB  6.6 TiB  159 MiB  16 GiB  699
> GiB 90.93 1.16 121 up osd.84
>  82   hdd   3.63689  1.0 3.6 TiB 3.3 TiB  332 GiB  1.0 GiB 0 B  332
> GiB 91.08 1.16  59 up osd.82
>  89   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  400 MiB  15 GiB  670
> GiB 91.29 1.16 126 up osd.89
>  33   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  382 MiB 8.6 GiB  327
> GiB 91.46 1.16  66 up osd.33
>  90   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  338 MiB  15 GiB  658
> GiB 91.46 1.16 112 up osd.90
> 105   hdd   3.63869  0.8 3.6 TiB 3.3 TiB  3.3 TiB  206 MiB 8.1 GiB  301
> GiB 91.91 1.17  56 up osd.105
>  66   hdd   7.27739  0.95000 7.3 TiB 6.7 TiB  6.7 TiB  322 MiB  16 GiB  548
> GiB 92.64 1.18 121 up osd.66
>  46   hdd   7.27739  1.0 7.3 TiB 6.8 TiB  6.7 TiB  316 MiB  16 GiB  536
> GiB 92.81 1.18 119 up osd.46
>
> Am Di., 23. März 2021 um 19:59 Uhr schrieb Boris Behrens :
>
> > Good point. Thanks for the hint. I changed it for all OSDs from 5 to 1
> > *crossing finger*
> >
> > Am Di., 23. März 2021 um 19:45 Uhr schrieb Dan van der Ster <
> > d...@vanderster.com>:
> >
> >> I see. When splitting PGs, the OSDs will increase is used space
> >> temporarily to make room for the new PGs.
> >> When going from 1024->2048 PGs, that means that half of the objects from
> >> each PG will be copied to a new PG, and then the previous PGs will have
> >> those objects deleted.
> >>
> >> Make sure osd_max_backfills is set to 1, so that not too many PGs are
> >> moving concurrently.
> >>
> >>
> >>
> >> On Tue, Mar 23, 2021, 7:39 PM Boris Behrens  wrote:
> >>
> >>> Thank you.
> >>> Currently I do not have any full OSDs (all <90%) but I keep this in mind.
> >>>

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
One week later the ceph is still balancing.
What worries me like hell is the %USE on a lot of those OSDs. Does ceph
resolv this on it's own? We are currently down to 5TB space in the cluster.
Rebalancing single OSDs doesn't work well and it increases the "missplaced
objects".

I thought about letting upmap do some rebalancing. Anyone know if this is a
good idea? Or if I should bite my nails an wait as I am the headache of my
life.
[root@s3db1 ~]# ceph osd getmap -o om; osdmaptool om --upmap out.txt
--upmap-pool eu-central-1.rgw.buckets.data --upmap-max 10; cat out.txt
got osdmap epoch 321975
osdmaptool: osdmap file 'om'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 5
 limiting to pools eu-central-1.rgw.buckets.data ([11])
pools eu-central-1.rgw.buckets.data
prepared 10/10 changes
ceph osd rm-pg-upmap-items 11.209
ceph osd rm-pg-upmap-items 11.253
ceph osd pg-upmap-items 11.7f 79 88
ceph osd pg-upmap-items 11.fc 53 31 105 78
ceph osd pg-upmap-items 11.1d8 84 50
ceph osd pg-upmap-items 11.47f 94 86
ceph osd pg-upmap-items 11.49c 44 71
ceph osd pg-upmap-items 11.553 74 50
ceph osd pg-upmap-items 11.6c3 66 63
ceph osd pg-upmap-items 11.7ad 43 50

ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAP META
 AVAIL%USE  VAR  PGS STATUS TYPE NAME
 -1   795.42548- 795 TiB 626 TiB  587 TiB   82 GiB 1.4 TiB  170
TiB 78.64 1.00   -root default
 56   hdd   7.32619  1.0 7.3 TiB 6.4 TiB  6.4 TiB  684 MiB  16 GiB  910
GiB 87.87 1.12 129 up osd.56
 67   hdd   7.27739  1.0 7.3 TiB 6.4 TiB  6.4 TiB  582 MiB  16 GiB  865
GiB 88.40 1.12 115 up osd.67
 79   hdd   3.63689  1.0 3.6 TiB 3.2 TiB  432 GiB  1.9 GiB 0 B  432
GiB 88.40 1.12  63 up osd.79
 53   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.4 TiB  971 MiB  22 GiB  864
GiB 88.48 1.13 114 up osd.53
 51   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.4 TiB  734 MiB  15 GiB  837
GiB 88.77 1.13 120 up osd.51
 73   hdd  14.55269  1.0  15 TiB  13 TiB   13 TiB  1.8 GiB  39 GiB  1.6
TiB 88.97 1.13 246 up osd.73
 55   hdd   7.32619  1.0 7.3 TiB 6.5 TiB  6.5 TiB  259 MiB  15 GiB  825
GiB 89.01 1.13 118 up osd.55
 70   hdd   7.27739  1.0 7.3 TiB 6.5 TiB  6.5 TiB  291 MiB  16 GiB  787
GiB 89.44 1.14 119 up osd.70
 42   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  685 MiB 8.2 GiB  374
GiB 90.23 1.15  60 up osd.42
 94   hdd   3.63869  1.0 3.6 TiB 3.3 TiB  3.3 TiB  132 MiB 7.7 GiB  345
GiB 90.75 1.15  64 up osd.94
 25   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  3.2 MiB 8.1 GiB  352
GiB 90.79 1.15  53 up osd.25
 31   hdd   7.32619  1.0 7.3 TiB 6.7 TiB  6.6 TiB  223 MiB  15 GiB  690
GiB 90.80 1.15 117 up osd.31
 84   hdd   7.52150  1.0 7.5 TiB 6.8 TiB  6.6 TiB  159 MiB  16 GiB  699
GiB 90.93 1.16 121 up osd.84
 82   hdd   3.63689  1.0 3.6 TiB 3.3 TiB  332 GiB  1.0 GiB 0 B  332
GiB 91.08 1.16  59 up osd.82
 89   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  400 MiB  15 GiB  670
GiB 91.29 1.16 126 up osd.89
 33   hdd   3.73630  1.0 3.7 TiB 3.4 TiB  3.3 TiB  382 MiB 8.6 GiB  327
GiB 91.46 1.16  66 up osd.33
 90   hdd   7.52150  1.0 7.5 TiB 6.9 TiB  6.6 TiB  338 MiB  15 GiB  658
GiB 91.46 1.16 112 up osd.90
105   hdd   3.63869  0.8 3.6 TiB 3.3 TiB  3.3 TiB  206 MiB 8.1 GiB  301
GiB 91.91 1.17  56 up osd.105
 66   hdd   7.27739  0.95000 7.3 TiB 6.7 TiB  6.7 TiB  322 MiB  16 GiB  548
GiB 92.64 1.18 121 up osd.66
 46   hdd   7.27739  1.0 7.3 TiB 6.8 TiB  6.7 TiB  316 MiB  16 GiB  536
GiB 92.81 1.18 119 up osd.46

Am Di., 23. März 2021 um 19:59 Uhr schrieb Boris Behrens :

> Good point. Thanks for the hint. I changed it for all OSDs from 5 to 1
> *crossing finger*
>
> Am Di., 23. März 2021 um 19:45 Uhr schrieb Dan van der Ster <
> d...@vanderster.com>:
>
>> I see. When splitting PGs, the OSDs will increase is used space
>> temporarily to make room for the new PGs.
>> When going from 1024->2048 PGs, that means that half of the objects from
>> each PG will be copied to a new PG, and then the previous PGs will have
>> those objects deleted.
>>
>> Make sure osd_max_backfills is set to 1, so that not too many PGs are
>> moving concurrently.
>>
>>
>>
>> On Tue, Mar 23, 2021, 7:39 PM Boris Behrens  wrote:
>>
>>> Thank you.
>>> Currently I do not have any full OSDs (all <90%) but I keep this in mind.
>>> What worries me is the ever increasing %USE metric (it went up from
>>> around 72% to 75% in three hours). It looks like there is comming a lot of
>>> data (there comes barely new data at the moment), but I think this might
>>> have to do with my "let's try to increase the PGs to 2048". I hope that
>>> ceph begins to split the old PGs into new ones and removes the old PGs.
>>>
>>> ID  CLASS WEIGHTREWEIGHT SI

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Burkhard Linke

Hi,

On 30.03.21 13:05, Rainer Krienke wrote:

Hello,

yes your assumptions are correct pxa-rbd ist the metadata pool for 
pxa-ec which uses a erasure coding 4+2 profile.


In the last hours ceph repaired most of the damage. One inactive PG 
remained and in ceph health detail then told me:


-
HEALTH_WARN Reduced data availability: 1 pg inactive, 1 pg incomplete; 
15 daemons have recently crashed; 150 slow ops, oldest one blocked for 
26716 sec, daemons [osd.60,osd.67] have slow ops.

PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
    pg 36.15b is remapped+incomplete, acting 
[60,2147483647,23,96,2147483647,36] (reducing pool pxa-ec min_size 
from 5 may help; search ceph.com/docs for 'incomplete')



*snipsnap*

2147483647 is (uint32)(-1), which mean no associated OSD. So this PG 
does not have six independent OSDs, and no backfilling is happening 
since there are no targets to backfill.



You mentioned 9 hosts, so if you use a simple host based crush rule ceph 
should be able to find new OSDs for that PG. If you do not use standard 
crush rules please check that ceph is able to derive enough OSDs to 
satisfy the PG requirements (six different OSDs).



The 'incomplete' part might be a problem. If just a chunk would be 
missing, the state should be undersized, not incomplete...



Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke

Hello Frank,

the option is actually set. On one of my monitors:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep 
osd_allow_recovery_below_min_size

"osd_allow_recovery_below_min_size": "true",

Thank you very much
Rainer

Am 30.03.21 um 13:20 schrieb Frank Schilder:

Hi, this is odd. The problem with recovery when sufficiently many but less than 
min_size shards are present should have been resolved with 
osd_allow_recovery_below_min_size=true. It is really dangerous to reduce 
min_size below k+1 and, in fact, should never be necessary for recovery. Can 
you check if this option is present and set to true? If it is not working as 
intended, a tracker ticker might be in order.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14



 --
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287 
1001312

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke

Hello,

yes your assumptions are correct pxa-rbd ist the metadata pool for 
pxa-ec which uses a erasure coding 4+2 profile.


In the last hours ceph repaired most of the damage. One inactive PG 
remained and in ceph health detail then told me:


-
HEALTH_WARN Reduced data availability: 1 pg inactive, 1 pg incomplete; 
15 daemons have recently crashed; 150 slow ops, oldest one blocked for 
26716 sec, daemons [osd.60,osd.67] have slow ops.

PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
pg 36.15b is remapped+incomplete, acting 
[60,2147483647,23,96,2147483647,36] (reducing pool pxa-ec min_size from 
5 may help; search ceph.com/docs for 'incomplete')

RECENT_CRASH 15 daemons have recently crashed
osd.90 crashed on host ceph6 at 2021-03-29 21:14:10.442314Z
osd.67 crashed on host ceph5 at 2021-03-30 02:21:23.944205Z
osd.67 crashed on host ceph5 at 2021-03-30 01:39:14.452610Z
osd.90 crashed on host ceph6 at 2021-03-29 21:14:24.23Z
osd.67 crashed on host ceph5 at 2021-03-30 02:35:43.373845Z
osd.67 crashed on host ceph5 at 2021-03-30 01:19:58.762393Z
osd.67 crashed on host ceph5 at 2021-03-30 02:09:42.297941Z
osd.67 crashed on host ceph5 at 2021-03-30 02:28:29.981528Z
osd.67 crashed on host ceph5 at 2021-03-30 01:50:05.374278Z
osd.90 crashed on host ceph6 at 2021-03-29 21:13:51.896849Z
osd.67 crashed on host ceph5 at 2021-03-30 02:00:22.593745Z
osd.67 crashed on host ceph5 at 2021-03-30 01:29:39.170134Z
osd.90 crashed on host ceph6 at 2021-03-29 21:14:38.114768Z
osd.67 crashed on host ceph5 at 2021-03-30 00:54:06.629808Z
osd.67 crashed on host ceph5 at 2021-03-30 01:10:21.824447Z
-

All osds except for 67 and 90 are up and I followed the hint in health 
detail  and lowered min_size from 5 to 4 for pxa-ec. Since then ceph is 
again repairing and in between some VMs in the attached proxmox cluster 
are working again.


So I hope that after repairing all PGs are up, so that I can restart all 
VMs again.


Thanks
Rainer

Am 30.03.21 um 11:41 schrieb Eugen Block:

Hi,

from what you've sent my conclusion about the stalled I/O would be 
indeed the min_size of the EC pool.
There's only one PG reported as incomplete, I assume that is the EC 
pool, not the replicated pxa-rbd, right? Both pools are for rbd so I'm 
guessing the rbd headers are in pxa-rbd while the data is stored in 
pxa-ec, could you confirm that?


You could add 'ceph health detail' output to your question to see which 
PG is incomplete.
I assume that both down OSDs are in the acting set of the inactive PG, 
and since the pool's min_size is 5 the I/O pauses. If you can't wait for 
recovery to finish and can't bring up at least one of those OSDs you 
could set the min_size of pxa-ec to 4, but if you do, be aware that one 
more disk failure could mean data loss! So think carefully about it 
(maybe you could instead speed up recovery?) and don't forget to 
increase min_size back to 5 when the recovery has finished, that's very 
important!


Regards,
Eugen


Zitat von Rainer Krienke :


Hello,

i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we 
lost two disks, so two OSDs (67,90) are down. The two disks are on two 
different hosts. A third ODS on a third host repotrts slow ops. ceph 
is repairing at the moment.


Pools affected are eg these ones:
 pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor 
0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0 
pg_num_min 128 target_size_ratio 0.0001 application rbd


pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash 
rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor 
0/172580/172578 flags hashpspool,ec_overwrites,selfmanaged_snaps 
stripe_width 16384 pg_num_min 512 target_size_ratio 0.15 application rbd


At the mmoment the proxmox-cluster using storage from the seperate 
ceph cluster hangs. The ppols with date are erasure coded with the 
following profile:


crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

What I do not understand is why access on the virtualization seem to 
block. Could that be related to min_size of the pools cause this 
behaviour? How can I find out if this is true or what else is causing 
the blocking behaviour seen?


This is the current status:
    health: HEALTH_WARN
    Reduced data availability: 1 pg inactive, 1 pg incomplete
    Degraded data redundancy: 42384/130014984 objects degraded 
(0.033%), 4 pgs degraded, 5 pgs undersized

    15 daemons have recently crashed
    150 slow ops, oldest one blocked for 15901 sec, daemons 
[osd.60,osd.67] have slow ops.


  services:
    mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
    mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
    mds: cephfsrz:1 {0=ceph

[ceph-users] Re: forceful remap PGs

2021-03-30 Thread Boris Behrens
I just move one PG away from the OSD, but the diskspace will not get freed.
Do I need to do something to clean obsolete objects from the osd?

Am Di., 30. März 2021 um 11:47 Uhr schrieb Boris Behrens :

> Hi,
> I have a couple OSDs that currently get a lot of data, and are running
> towards 95% fillrate.
>
> I would like to forcefully remap some PGs (they are around 100GB) to more
> empty OSDs and drop them from the full OSDs. I know this would lead to
> degraded objects, but I am not sure how long the cluster will stay in a
> state where it can allocate objects.
>
> OSD.105 grew from around 85% to 92% in the last 4 hours.
>
> This is the current state
>   cluster:
> id: dca79fff-ffd0-58f4-1cff-82a2feea05f4
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 9 backfillfull osd(s)
> 19 nearfull osd(s)
> 37 pool(s) backfillfull
> BlueFS spillover detected on 1 OSD(s)
> 13 large omap objects
> Low space hindering backfill (add storage if this doesn't
> resolve itself): 248 pgs backfill_toofull
> Degraded data redundancy: 18115/362288820 objects degraded
> (0.005%), 1 pg degraded, 1 pg undersized
>
>   services:
> mon: 3 daemons, quorum ceph-s3-mon1,ceph-s3-mon2,ceph-s3-mon3 (age 6d)
> mgr: ceph-mgr2(active, since 6d), standbys: ceph-mgr3, ceph-mgr1
> mds:  3 up:standby
> osd: 110 osds: 110 up (since 4d), 110 in (since 6d); 324 remapped pgs
>  flags noscrub,nodeep-scrub
> rgw: 4 daemons active (admin, eu-central-1, eu-msg-1, eu-secure-1)
>
>   task status:
>
>   data:
> pools:   37 pools, 4032 pgs
> objects: 120.76M objects, 197 TiB
> usage:   620 TiB used, 176 TiB / 795 TiB avail
> pgs: 18115/362288820 objects degraded (0.005%)
>  47144186/362288820 objects misplaced (13.013%)
>  3708 active+clean
>  241  active+remapped+backfill_wait+backfill_toofull
>  63   active+remapped+backfill_wait
>  11   active+remapped+backfilling
>  6active+remapped+backfill_toofull
>  1active+remapped+backfilling+forced_backfill
>  1active+remapped+forced_backfill+backfill_toofull
>  1active+undersized+degraded+remapped+backfilling
>
>   io:
> client:   23 MiB/s rd, 252 MiB/s wr, 347 op/s rd, 381 op/s wr
> recovery: 194 MiB/s, 112 objects/s
> ---
> ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATAOMAP META
>  AVAIL%USE  VAR  PGS STATUS TYPE NAME
>  -1   795.42548- 795 TiB 620 TiB 582 TiB   82 GiB 1.4 TiB  176
> TiB 77.90 1.00   -root default
>  84   hdd   7.52150  1.0 7.5 TiB 6.8 TiB 6.5 TiB  158 MiB  15 GiB  764
> GiB 90.07 1.16 121 up osd.84
>  79   hdd   3.63689  1.0 3.6 TiB 3.3 TiB 367 GiB  1.9 GiB 0 B  367
> GiB 90.15 1.16  64 up osd.79
>  70   hdd   7.27739  1.0 7.3 TiB 6.6 TiB 6.5 TiB  268 MiB  15 GiB  730
> GiB 90.20 1.16 121 up osd.70
>  82   hdd   3.63689  1.0 3.6 TiB 3.3 TiB 364 GiB  1.1 GiB 0 B  364
> GiB 90.23 1.16  59 up osd.82
>  89   hdd   7.52150  1.0 7.5 TiB 6.8 TiB 6.6 TiB  395 MiB  16 GiB  735
> GiB 90.45 1.16 126 up osd.89
>  90   hdd   7.52150  1.0 7.5 TiB 6.8 TiB 6.6 TiB  338 MiB  15 GiB  723
> GiB 90.62 1.16 112 up osd.90
>  33   hdd   3.73630  1.0 3.7 TiB 3.4 TiB 3.3 TiB  382 MiB 8.6 GiB  358
> GiB 90.64 1.16  66 up osd.33
>  66   hdd   7.27739  0.95000 7.3 TiB 6.7 TiB 6.7 TiB  313 MiB  16 GiB  605
> GiB 91.88 1.18 122 up osd.66
>  46   hdd   7.27739  1.0 7.3 TiB 6.7 TiB 6.7 TiB  312 MiB  16 GiB  601
> GiB 91.93 1.18 119 up osd.46
> 105   hdd   3.63869  0.8 3.6 TiB 3.4 TiB 3.4 TiB  206 MiB 8.1 GiB  281
> GiB 92.45 1.19  58 up osd.105
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph User Survey Working Group - Next Steps

2021-03-30 Thread Mike Perez
Hi everyone,

I didn't get enough responses on the previous Doodle to schedule a
meeting. I'm wondering if people are OK with the previous PDF I
released or if there's interest in the community to develop better
survey results?

https://ceph.io/community/ceph-user-survey-2019/

On Mon, Mar 22, 2021 at 7:39 AM Mike Perez  wrote:
>
> Hi everyone,
>
> We are approaching the April 2nd deadline in two weeks, so we should
> start proposing the next meeting to plan the survey results.
>
> Anybody in the community is welcome to join the Ceph Working Groups.
> Please add your name to:
> https://ceph.io/user-survey/
>
> I have started a doodle:
> https://doodle.com/poll/y3t2ttdt8a3egz4v?utm_source=poll&utm_medium=link
>
> Please help promote the User Survey:
> https://twitter.com/Ceph/status/1369589099716349956
>
> --
> Mike Perez (thingee)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] forceful remap PGs

2021-03-30 Thread Boris Behrens
Hi,
I have a couple OSDs that currently get a lot of data, and are running
towards 95% fillrate.

I would like to forcefully remap some PGs (they are around 100GB) to more
empty OSDs and drop them from the full OSDs. I know this would lead to
degraded objects, but I am not sure how long the cluster will stay in a
state where it can allocate objects.

OSD.105 grew from around 85% to 92% in the last 4 hours.

This is the current state
  cluster:
id: dca79fff-ffd0-58f4-1cff-82a2feea05f4
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
9 backfillfull osd(s)
19 nearfull osd(s)
37 pool(s) backfillfull
BlueFS spillover detected on 1 OSD(s)
13 large omap objects
Low space hindering backfill (add storage if this doesn't
resolve itself): 248 pgs backfill_toofull
Degraded data redundancy: 18115/362288820 objects degraded
(0.005%), 1 pg degraded, 1 pg undersized

  services:
mon: 3 daemons, quorum ceph-s3-mon1,ceph-s3-mon2,ceph-s3-mon3 (age 6d)
mgr: ceph-mgr2(active, since 6d), standbys: ceph-mgr3, ceph-mgr1
mds:  3 up:standby
osd: 110 osds: 110 up (since 4d), 110 in (since 6d); 324 remapped pgs
 flags noscrub,nodeep-scrub
rgw: 4 daemons active (admin, eu-central-1, eu-msg-1, eu-secure-1)

  task status:

  data:
pools:   37 pools, 4032 pgs
objects: 120.76M objects, 197 TiB
usage:   620 TiB used, 176 TiB / 795 TiB avail
pgs: 18115/362288820 objects degraded (0.005%)
 47144186/362288820 objects misplaced (13.013%)
 3708 active+clean
 241  active+remapped+backfill_wait+backfill_toofull
 63   active+remapped+backfill_wait
 11   active+remapped+backfilling
 6active+remapped+backfill_toofull
 1active+remapped+backfilling+forced_backfill
 1active+remapped+forced_backfill+backfill_toofull
 1active+undersized+degraded+remapped+backfilling

  io:
client:   23 MiB/s rd, 252 MiB/s wr, 347 op/s rd, 381 op/s wr
recovery: 194 MiB/s, 112 objects/s
---
ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATAOMAP METAAVAIL
   %USE  VAR  PGS STATUS TYPE NAME
 -1   795.42548- 795 TiB 620 TiB 582 TiB   82 GiB 1.4 TiB  176
TiB 77.90 1.00   -root default
 84   hdd   7.52150  1.0 7.5 TiB 6.8 TiB 6.5 TiB  158 MiB  15 GiB  764
GiB 90.07 1.16 121 up osd.84
 79   hdd   3.63689  1.0 3.6 TiB 3.3 TiB 367 GiB  1.9 GiB 0 B  367
GiB 90.15 1.16  64 up osd.79
 70   hdd   7.27739  1.0 7.3 TiB 6.6 TiB 6.5 TiB  268 MiB  15 GiB  730
GiB 90.20 1.16 121 up osd.70
 82   hdd   3.63689  1.0 3.6 TiB 3.3 TiB 364 GiB  1.1 GiB 0 B  364
GiB 90.23 1.16  59 up osd.82
 89   hdd   7.52150  1.0 7.5 TiB 6.8 TiB 6.6 TiB  395 MiB  16 GiB  735
GiB 90.45 1.16 126 up osd.89
 90   hdd   7.52150  1.0 7.5 TiB 6.8 TiB 6.6 TiB  338 MiB  15 GiB  723
GiB 90.62 1.16 112 up osd.90
 33   hdd   3.73630  1.0 3.7 TiB 3.4 TiB 3.3 TiB  382 MiB 8.6 GiB  358
GiB 90.64 1.16  66 up osd.33
 66   hdd   7.27739  0.95000 7.3 TiB 6.7 TiB 6.7 TiB  313 MiB  16 GiB  605
GiB 91.88 1.18 122 up osd.66
 46   hdd   7.27739  1.0 7.3 TiB 6.7 TiB 6.7 TiB  312 MiB  16 GiB  601
GiB 91.93 1.18 119 up osd.46
105   hdd   3.63869  0.8 3.6 TiB 3.4 TiB 3.4 TiB  206 MiB 8.1 GiB  281
GiB 92.45 1.19  58 up osd.105

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Eugen Block

Hi,

from what you've sent my conclusion about the stalled I/O would be  
indeed the min_size of the EC pool.
There's only one PG reported as incomplete, I assume that is the EC  
pool, not the replicated pxa-rbd, right? Both pools are for rbd so I'm  
guessing the rbd headers are in pxa-rbd while the data is stored in  
pxa-ec, could you confirm that?


You could add 'ceph health detail' output to your question to see  
which PG is incomplete.
I assume that both down OSDs are in the acting set of the inactive PG,  
and since the pool's min_size is 5 the I/O pauses. If you can't wait  
for recovery to finish and can't bring up at least one of those OSDs  
you could set the min_size of pxa-ec to 4, but if you do, be aware  
that one more disk failure could mean data loss! So think carefully  
about it (maybe you could instead speed up recovery?) and don't forget  
to increase min_size back to 5 when the recovery has finished, that's  
very important!


Regards,
Eugen


Zitat von Rainer Krienke :


Hello,

i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night  
we lost two disks, so two OSDs (67,90) are down. The two disks are  
on two different hosts. A third ODS on a third host repotrts slow  
ops. ceph is repairing at the moment.


Pools affected are eg these ones:
 pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor  
0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0  
pg_num_min 128 target_size_ratio 0.0001 application rbd


pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash  
rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor  
0/172580/172578 flags hashpspool,ec_overwrites,selfmanaged_snaps  
stripe_width 16384 pg_num_min 512 target_size_ratio 0.15 application  
rbd


At the mmoment the proxmox-cluster using storage from the seperate  
ceph cluster hangs. The ppols with date are erasure coded with the  
following profile:


crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

What I do not understand is why access on the virtualization seem to  
block. Could that be related to min_size of the pools cause this  
behaviour? How can I find out if this is true or what else is  
causing the blocking behaviour seen?


This is the current status:
health: HEALTH_WARN
Reduced data availability: 1 pg inactive, 1 pg incomplete
Degraded data redundancy: 42384/130014984 objects  
degraded (0.033%), 4 pgs degraded, 5 pgs undersized

15 daemons have recently crashed
150 slow ops, oldest one blocked for 15901 sec, daemons  
[osd.60,osd.67] have slow ops.


  services:
mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
mds: cephfsrz:1 {0=ceph6=up:active} 2 up:standby
osd: 144 osds: 142 up (since 4h), 142 in (since 5h); 6 remapped pgs

  task status:
scrub status:
mds.ceph6: idle

  data:
pools:   15 pools, 2632 pgs
objects: 21.70M objects, 80 TiB
usage:   139 TiB used, 378 TiB / 517 TiB avail
pgs: 0.038% pgs not active
 42384/130014984 objects degraded (0.033%)
 2623 active+clean
 3active+undersized+degraded+remapped+backfilling
 3active+clean+scrubbing+deep
 1active+undersized+degraded+remapped+backfill_wait
 1active+undersized+remapped+backfill_wait
 1remapped+incomplete

  io:
client:   2.2 MiB/s rd, 3.6 MiB/s wr, 8 op/s rd, 179 op/s wr
recovery: 51 MiB/s, 12 objects/s

Thanks a lot
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax:  
+49261287 1001312

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Resolving LARGE_OMAP_OBJECTS

2021-03-30 Thread Benoît Knecht
Hi David,

On Tuesday, March 30th, 2021 at 00:50, David Orman  wrote:
> Sure enough, it is more than 200,000, just as the alert indicates.
> However, why did it not reshard further? Here's the kicker - we only
> see this with versioned buckets/objects. I don't see anything in the
> documentation that indicates this is a known issue with sharding, but
> perhaps there is something going on with versioned buckets/objects. Is
> there any clarity here/suggestions on how to deal with this? It sounds
> like you expect this behavior with versioned buckets, so we must be
> missing something.

The issue with versioned buckets is that each object is associated with at 
least 4 index entries, with 2 additional index entries for each version of the 
object. Dynamic resharding is based on the number of objects, not the number of 
index entries, and it counts each version of an object as an object, so the 
biggest discrepancy between number of objects and index entries happens when 
there's only one version of each object (factor of 4), and it tends to a factor 
of two as the number of versions per object increases to infinity. But there's 
one more special case. When you delete an versioned object, it also creates two 
more index entries, but those are not taken into account by dynamic resharding. 
Therefore, the absolute worst case is when there was a single version of each 
object, and all the objects have been deleted. In that case, there's 6 index 
entries for each object counted by dynamic resharding, i.e. a factor of 6.

So one way to "solve" this issue is to set 
`osd_deep_scrub_large_omap_object_key_threshold=60`, which (with the 
default `rgw_max_objs_per_shard=10`) will guarantee that dynamic resharding 
will kick in before you get a large omap object warning even in the worst case 
scenario for versioned buckets. If you're not comfortable having that many keys 
per omap object, you could instead decrease `rgw_max_objs_per_shard`.

Cheers,

--
Ben
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke

Hello,

i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we 
lost two disks, so two OSDs (67,90) are down. The two disks are on two 
different hosts. A third ODS on a third host repotrts slow ops. ceph is 
repairing at the moment.


Pools affected are eg these ones:
 pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor 
0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0 
pg_num_min 128 target_size_ratio 0.0001 application rbd


pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash 
rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor 0/172580/172578 
flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
pg_num_min 512 target_size_ratio 0.15 application rbd


At the mmoment the proxmox-cluster using storage from the seperate ceph 
cluster hangs. The ppols with date are erasure coded with the following 
profile:


crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

What I do not understand is why access on the virtualization seem to 
block. Could that be related to min_size of the pools cause this 
behaviour? How can I find out if this is true or what else is causing 
the blocking behaviour seen?


This is the current status:
health: HEALTH_WARN
Reduced data availability: 1 pg inactive, 1 pg incomplete
Degraded data redundancy: 42384/130014984 objects degraded 
(0.033%), 4 pgs degraded, 5 pgs undersized

15 daemons have recently crashed
150 slow ops, oldest one blocked for 15901 sec, daemons 
[osd.60,osd.67] have slow ops.


  services:
mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
mds: cephfsrz:1 {0=ceph6=up:active} 2 up:standby
osd: 144 osds: 142 up (since 4h), 142 in (since 5h); 6 remapped pgs

  task status:
scrub status:
mds.ceph6: idle

  data:
pools:   15 pools, 2632 pgs
objects: 21.70M objects, 80 TiB
usage:   139 TiB used, 378 TiB / 517 TiB avail
pgs: 0.038% pgs not active
 42384/130014984 objects degraded (0.033%)
 2623 active+clean
 3active+undersized+degraded+remapped+backfilling
 3active+clean+scrubbing+deep
 1active+undersized+degraded+remapped+backfill_wait
 1active+undersized+remapped+backfill_wait
 1remapped+incomplete

  io:
client:   2.2 MiB/s rd, 3.6 MiB/s wr, 8 op/s rd, 179 op/s wr
recovery: 51 MiB/s, 12 objects/s

Thanks a lot
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287 
1001312

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io