date:20210809

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-09 Thread Adam King

Wanted to respond to the original thread I saw archived on this topic but I
wasn't subscribed to the mailing list yet so don't have the thread in my
inbox to reply to. Hopefully, those involved in that thread still see this.

This issue looks the same as https://tracker.ceph.com/issues/51027 which is
being worked on. Essentially, it seems that hosts that were being rebooted
were temporarily marked as offline and cephamd had an issue where it would
try to remove all daemons (outside of osds I believe) from offline hosts.
The pre-remove step for monitors was to remove it from the monmap, so this
would happen, but then the daemon itself would not be removed since the
host was temporarily inaccessible due to the reboot. When the host came
back up, the mon was restarted but it had already been removed from the
monmap so it gets stuck in a "stopped" state. A fix for this that stops
cephadm from trying to remove daemons from offline hosts is in the works.

A temporary workaround right now, as mentioned by Harry on that tracker, is
to get cephadm to actually remove the mon daemon by changing the placement
spec to not include the host with the broken mon. Then wait to see the mon
daemon was removed, and finally put the placement spec back to how it was
so the mon gets redeployed (and now hopefully runs normally).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] very low RBD and Cephfs performance

2021-08-09 Thread Prokopis Kitros

Hello 

I have a 4 nodes Ceph cluster on Azure. Each node is a E32s_v4 VM ,which has 
32vcpus and 256GB memory.The network between nodes is 15GBit/sec measured with 
iperf.
The OS is CentOS 8.2 .Ceph version is Pacific and was deployed with 
ceph-ansible.

Three nodes have the OSDs and the fourth node is acting as rbd client.
In total there are 12 OSDs ,four per node , with each disk having 5000 IOPS for 
4K  writes.

i have one pool with 512 PG and one rbd image.I am running the following fio 
command and i get only 1433 IOPS


fio --filename=/dev/rbd0 --direct=1 --fsync=1 --rw=write --bs=4k --numjobs=16 
--iodepth=8 --runtime=360 --time_based --group_reporting --name=4k-sync-write

4k-sync-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=psync, iodepth=8
...
fio-3.19
Starting 16 processes
Jobs: 16 (f=16): [W(16)][100.0%][w=5734KiB/s][w=1433 IOPS][eta 00m:00s]
4k-sync-write: (groupid=0, jobs=16): err= 0: pid=12427: Mon Aug  9 16:18:38 2021
  write: IOPS=1327, BW=5309KiB/s (5436kB/s)(1866MiB/360011msec); 0 zone resets
clat (msec): min=2, max=365, avg=12.04, stdev= 7.79
 lat (msec): min=2, max=365, avg=12.04, stdev= 7.79
clat percentiles (usec):
 |  1.00th=[ 3556],  5.00th=[ 4686], 10.00th=[ 5669], 20.00th=[ 6849],
 | 30.00th=[ 7767], 40.00th=[ 8717], 50.00th=[ 9896], 60.00th=[11338],
 | 70.00th=[13173], 80.00th=[15795], 90.00th=[20841], 95.00th=[26608],
 | 99.00th=[41157], 99.50th=[47449], 99.90th=[66323], 99.95th=[76022],
 | 99.99th=[96994]
   bw (  KiB/s): min= 1855, max=10240, per=100.00%, avg=5313.12, stdev=97.24, 
samples=11488
   iops: min=  463, max= 2560, avg=1324.30, stdev=24.33, samples=11488
  lat (msec)   : 4=2.42%, 10=48.56%, 20=37.90%, 50=10.73%, 100=0.38%
  lat (msec)   : 250=0.01%, 500=0.01%
  fsync/fdatasync/sync_file_range:
sync (nsec): min=1100, max=114600, avg=5610.37, stdev=3387.10
sync percentiles (nsec):
 |  1.00th=[ 2192],  5.00th=[ 3312], 10.00th=[ 3408], 20.00th=[ 3408],
 | 30.00th=[ 3504], 40.00th=[ 3600], 50.00th=[ 3888], 60.00th=[ 6816],
 | 70.00th=[ 7712], 80.00th=[ 7776], 90.00th=[ 7904], 95.00th=[ 9408],
 | 99.00th=[18304], 99.50th=[23936], 99.90th=[41216], 99.95th=[45824],
 | 99.99th=[61696]
  cpu  : usr=0.30%, sys=0.53%, ctx=477856, majf=0, minf=203
  IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=0,477811,0,477795 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=5309KiB/s (5436kB/s), 5309KiB/s-5309KiB/s (5436kB/s-5436kB/s), 
io=1866MiB (1957MB), run=360011-360011msec

Disk stats (read/write):
rbd0: ios=0/469238, merge=0/4868, ticks=0/5598109, in_queue=5363153, util=38.89%





  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-09 Thread David Orman

Hi,

We are seeing very similar behavior on 16.2.5, and also have noticed
that an undeploy/deploy cycle fixes things. Before we go rummaging
through the source code trying to determine the root cause, has
anybody else figured this out? It seems odd that a repeatable issue
(I've seen other mailing list posts about this same issue) impacting
16.2.4/16.2.5, at least, on reboots hasn't been addressed yet, so
wanted to check.

Here's one of the other thread titles that appears related:
"[ceph-users] mons assigned via orch label 'committing suicide' upon
reboot."

Respectfully,
David


On Sun, May 23, 2021 at 3:40 AM Adrian Nicolae
 wrote:
>
> Hi guys,
>
> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put
> it in production on a 1PB+ storage cluster with rgw-only access.
>
> I noticed a weird issue with my mons :
>
> - if I reboot a mon host, the ceph-mon container is not starting after
> reboot
>
> - I can see with 'ceph orch ps' the following output :
>
> mon.node01   node01   running (20h)   4m ago
> 20h   16.2.4 8d91d370c2b8  0a2e86af94b2
> mon.node02   node02   running (115m)  12s ago
> 115m  16.2.4 8d91d370c2b8  51f4885a1b06
> mon.node03   node03   stopped 4m ago
> 19h  
>
> (where node03 is the host which was rebooted).
>
> - I tried to start the mon container manually on node03 with '/bin/bash
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run'
> and I've got the following output :
>
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 3314933069573799936,
> adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> adjusting msgr requires
> cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164)
> 36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB
> data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
> debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1
> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
> debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map
> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out
> after 0.0s
> debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0
> mon.node03@-1(probing) e5  my rank is now 1 (was -1)
> debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing)
> e6  removed from monmap, suicide.
>
> root@node03:/home/adrian# systemctl status
> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph
> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
>   Loaded: loaded
> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service;
> enabled; vendor preset: enabled)
>   Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
>  Process: 1176 ExecStart=/bin/bash
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run
> (code=exited, status=0/SUCCESS)
>  Process: 1855 ExecStop=/usr/bin/docker stop
> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited,
> status=1/FAILURE)
>  Process: 1861 ExecStopPost=/bin/bash
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop
> (code=exited, status=0/SUCCESS)
> Main PID: 1176 (code=exited, status=0/SUCCESS)
>
> The only fix I could find was to redeploy the mon with :
>
> ceph orch daemon rm  mon.node03 --force
> ceph orch daemon add mon node03
>
> However, even if it's working after redeploy, it's not giving me a lot
> of trust to use it in a production environment having an issue like
> that.  I could reproduce it with 2 different mons so it's not just an
> exception.
>
> My setup is based on Ubuntu 20.04 and docker instead of podman :
>
> root@node01:~# docker -v
> Docker version 20.10.6, build 370c289
>
> Do you know a workaround for this issue or is this a known bug ? I
> noticed that there are some other complaints with the same behaviour in
> Octopus as well and the solution at that time was to delete the
> /var/lib/ceph/mon folder .
>
>
> Thanks.
>
>
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rbd object mapping

2021-08-09 Thread Tony Liu

Thank you Konstantin!
Tony

From: Konstantin Shalygin 
Sent: August 9, 2021 01:20 AM
To: Tony Liu
Cc: ceph-users; d...@ceph.io
Subject: Re: [ceph-users] rbd object mapping

On 8 Aug 2021, at 20:10, Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:

That's what I thought. I am confused by this.

# ceph osd map vm fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk
osdmap e18381 pool 'vm' (4) object 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk' 
-> pg 4.c7a78d40 (4.0) -> up ([4,17,6], p4) acting ([4,17,6], p4)

It calls RBD image "object" and it shows the whole image maps to a single PG,
while the image is actually split into many objects each of which maps to a PG.
How am I supposed to understand the output of this command?

You can execute `ceph osd map vm nonexist` and you will see mapping for 
'nonexist' object. Future mapping...
To achieve mappings for each object of your image, you need to find all objects 
by rbd_header and iterate over this list.

k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Multiple cephfs MDS crashes with same assert_condition: state == LOCK_XLOCK || state == LOCK_XLOCKDONE

2021-08-09 Thread Thomas Hukkelberg

Hi

Today we suddenly experience multiple MDS crashes during the day with an error 
we have not seen earlier. We run octopus 15.2.13 with 4 ranks and 4 
standby-reply MDSes and 1 passive standby. Any input on how to troubleshot or 
resolve this would be most welcome.

---

root@hk-cephnode-54:~# ceph crash ls
2021-08-09T08:06:41.573899Z_306a9a10-b9d7-4a68-83a9-f5bd3d700fd7  
mds.hk-cephnode-58   
2021-08-09T08:09:03.132838Z_9a62b1fc-6069-4576-974d-2e0464169bb5  
mds.hk-cephnode-62   
2021-08-09T11:20:23.776776Z_5a665d00-9862-4d8f-99b5-323cdf441966  
mds.hk-cephnode-54   
2021-08-09T11:25:14.213601Z_f47fa398-5582-4da6-8e18-9252bbb52805  
mds.hk-cephnode-62   
2021-08-09T12:44:34.190128Z_1e163bf2-6ddf-45ef-a80f-0bf42158da31  
mds.hk-cephnode-60   

---

*All the crashlogs have the same assert_condition/file/msg*

root@hk-cephnode-54:~# ceph crash info 
2021-08-09T12:44:34.190128Z_1e163bf2-6ddf-45ef-a80f-0bf42158da31
{
"archived": "2021-08-09 12:53:01.429088",
"assert_condition": "state == LOCK_XLOCK || state == LOCK_XLOCKDONE",
"assert_file": "/build/ceph/ceph-15.2.13/src/mds/ScatterLock.h",
"assert_func": "void ScatterLock::set_xlock_snap_sync(MDSContext*)",
"assert_line": 59,
"assert_msg": "/build/ceph/ceph-15.2.13/src/mds/ScatterLock.h: In function 
'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f0f76853700 time 
2021-08-09T14:44:34.185861+0200\n/build/ceph/ceph-15.2.13/src/mds/ScatterLock.h:
 59: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE)\n",
"assert_thread_name": "MR_Finisher",
"backtrace": [
"(()+0x12730) [0x7f0f8153d730]",
"(gsignal()+0x10b) [0x7f0f80e027bb]",
"(abort()+0x121) [0x7f0f80ded535]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a5) [0x7f0f81f1d0f5]",
"(()+0x28127c) [0x7f0f81f1d27c]",
"(MDCache::truncate_inode(CInode*, LogSegment*)+0x305) 
[0x55ed3b243aa5]",
"(C_MDS_inode_update_finish::finish(int)+0x14c) [0x55ed3b219dec]",
"(MDSContext::complete(int)+0x52) [0x55ed3b4156d2]",
"(MDSIOContextBase::complete(int)+0x9f) [0x55ed3b4158af]",
"(MDSLogContextBase::complete(int)+0x40) [0x55ed3b415c30]",
"(Finisher::finisher_thread_entry()+0x19d) [0x7f0f81fab73d]",
"(()+0x7fa3) [0x7f0f81532fa3]",
"(clone()+0x3f) [0x7f0f80ec44cf]"
],
"ceph_version": "15.2.13",
"crash_id": 
"2021-08-09T12:44:34.190128Z_1e163bf2-6ddf-45ef-a80f-0bf42158da31",
"entity_name": "mds.hk-cephnode-60",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-mds",
"stack_sig": 
"5f310d14ffe4b2600195c874fba3761c268218711ee4a449413862bb5553fb4c",
"timestamp": "2021-08-09T12:44:34.190128Z",
"utsname_hostname": "hk-cephnode-60",
"utsname_machine": "x86_64",
"utsname_release": "5.4.114-1-pve",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200)»
}


--- 

root@hk-cephnode-54:~# ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
mds.hk-cephnode-54 crashed on host hk-cephnode-54 at 
2021-08-09T11:20:23.776776Z

root@hk-cephnode-54:~# ceph status
  cluster:
id: 
health: HEALTH_WARN
1 daemons have recently crashed

  services:
mon: 3 daemons, quorum hk-cephnode-60,hk-cephnode-61,hk-cephnode-62 (age 4w)
mgr: hk-cephnode-53(active, since 4h), standbys: hk-cephnode-51, 
hk-cephnode-52
mds: cephfs:4 
{0=hk-cephnode-60=up:active,1=hk-cephnode-61=up:active,2=hk-cephnode-55=up:active,3=hk-cephnode-57=up:active}
 4 up:standby-replay 1 up:standby
osd: 180 osds: 180 up (since 5d), 180 in (since 2w)
 
  data:
pools:   9 pools, 2433 pgs
objects: 118.22M objects, 331 TiB
usage:   935 TiB used, 990 TiB / 1.9 PiB avail
pgs: 2433 active+clean
 
  io:
client:   231 MiB/s rd, 146 MiB/s wr, 900 op/s rd, 4.07k op/s wr


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

2021-08-09 Thread Ilya Dryomov

On Mon, Aug 9, 2021 at 5:14 PM Robert W. Eckert  wrote:
>
> I have had the same issue with the windows client.
> I had to issue
> ceph config set mon auth_expose_insecure_global_id_reclaim false
> Which allows the other clients to connect.
> I think you need to restart the monitors as well, because the first few times 
> I tried this, I still couldn't connect.

For archive's sake, I'd like to mention that disabling
auth_expose_insecure_global_id_reclaim isn't right and it wasn't
intended for this.  Enabling auth_allow_insecure_global_id_reclaim
should be enough to allow all (however old) clients to connect.
The fact that it wasn't enough for the available Windows build
suggests that there is some subtle breakage in it because all "expose"
does is it forces the client to connect twice instead of just once.
It doesn't actually refuse old unpatched clients.

(The breakage isn't surprising given that the available build is
more or less a random development snapshot with some pending at the
time Windows-specific patches applied.  I'll try to escalate issue
and get the linked MSI bundle updated.)

Thanks,

Ilya

>
> -Original Message-
> From: Richard Bade 
> Sent: Sunday, August 8, 2021 8:27 PM
> To: Daniel Persson 
> Cc: Ceph Users 
> Subject: [ceph-users] Re: BUG #51821 - client is using insecure global_id 
> reclaim
>
> Hi Daniel,
> I had a similar issue last week after upgrading my test cluster from
> 14.2.13 to 14.2.22 which included this fix for Global ID reclaim in .20. My 
> issue was a rados gw that I was re-deploying on the latest version. The 
> problem seemed to be related with cephx authentication.
> It kept displaying the error message you have and the service wouldn't start.
> I ended up stopping and removing the old rgw service, deleting all the keys 
> in /etc/ceph/ and all data in /var/lib/ceph/radosgw/ and re-deploying the 
> radosgw. This used the new rgw bootstrap keys and new key for this radosgw.
> So, I would suggest you double and triple check which keys your clients are 
> using and that cephx is enabled correctly on your cluster.
> Check your admin key in /etc/ceph as well, as that's what's being used for 
> ceph status.
>
> Regards,
> Rich
>
> On Sun, 8 Aug 2021 at 05:01, Daniel Persson  wrote:
> >
> > Hi everyone.
> >
> > I suggested asking for help here instead of in the bug tracker so that
> > I will try it.
> >
> > https://tracker.ceph.com/issues/51821?next_issue_id=51820&prev_issue_i
> > d=51824
> >
> > I have a problem that I can't seem to figure out how to resolve the issue.
> >
> > AUTH_INSECURE_GLOBAL_ID_RECLAIM: client is using insecure global_id
> > reclaim
> > AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
> > global_id reclaim
> >
> >
> > Both of these have to do with reclaiming ID and securing that no
> > client could steal or reuse another client's ID. I understand the
> > reason for this and want to resolve the issue.
> >
> > Currently, I have three different clients.
> >
> > * One Windows client using the latest Ceph-Dokan build. (ceph version
> > 15.0.0-22274-g5656003758 (5656003758614f8fd2a8c49c2e7d4f5cd637b0ea)
> > pacific
> > (rc))
> > * One Linux Debian build using the built packages for that kernel. (
> > 4.19.0-17-amd64)
> > * And one client that I've built from source for a raspberry PI as
> > there is no arm build for the Pacific release. (5.11.0-1015-raspi)
> >
> > If I switch over to not allow global id reclaim, none of these clients
> > could connect, and using the command "ceph status" on one of my nodes
> > will also fail.
> >
> > All of them giving the same error message:
> >
> > monclient(hunting): handle_auth_bad_method server allowed_methods [2]
> > but i only support [2]
> >
> >
> > Has anyone encountered this problem and have any suggestions?
> >
> > PS. The reason I have 3 different hosts is that this is a test
> > environment where I try to resolve and look at issues before we
> > upgrade our production environment to pacific. DS.
> >
> > Best regards
> > Daniel
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
> ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

2021-08-09 Thread Robert W. Eckert

I have had the same issue with the windows client.  
I had to issue 
ceph config set mon auth_expose_insecure_global_id_reclaim false
Which allows the other clients to connect.  
I think you need to restart the monitors as well, because the first few times I 
tried this, I still couldn't connect.

-Original Message-
From: Richard Bade  
Sent: Sunday, August 8, 2021 8:27 PM
To: Daniel Persson 
Cc: Ceph Users 
Subject: [ceph-users] Re: BUG #51821 - client is using insecure global_id 
reclaim

Hi Daniel,
I had a similar issue last week after upgrading my test cluster from
14.2.13 to 14.2.22 which included this fix for Global ID reclaim in .20. My 
issue was a rados gw that I was re-deploying on the latest version. The problem 
seemed to be related with cephx authentication.
It kept displaying the error message you have and the service wouldn't start.
I ended up stopping and removing the old rgw service, deleting all the keys in 
/etc/ceph/ and all data in /var/lib/ceph/radosgw/ and re-deploying the radosgw. 
This used the new rgw bootstrap keys and new key for this radosgw.
So, I would suggest you double and triple check which keys your clients are 
using and that cephx is enabled correctly on your cluster.
Check your admin key in /etc/ceph as well, as that's what's being used for ceph 
status.

Regards,
Rich

On Sun, 8 Aug 2021 at 05:01, Daniel Persson  wrote:
>
> Hi everyone.
>
> I suggested asking for help here instead of in the bug tracker so that 
> I will try it.
>
> https://tracker.ceph.com/issues/51821?next_issue_id=51820&prev_issue_i
> d=51824
>
> I have a problem that I can't seem to figure out how to resolve the issue.
>
> AUTH_INSECURE_GLOBAL_ID_RECLAIM: client is using insecure global_id 
> reclaim
> AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure 
> global_id reclaim
>
>
> Both of these have to do with reclaiming ID and securing that no 
> client could steal or reuse another client's ID. I understand the 
> reason for this and want to resolve the issue.
>
> Currently, I have three different clients.
>
> * One Windows client using the latest Ceph-Dokan build. (ceph version
> 15.0.0-22274-g5656003758 (5656003758614f8fd2a8c49c2e7d4f5cd637b0ea) 
> pacific
> (rc))
> * One Linux Debian build using the built packages for that kernel. (
> 4.19.0-17-amd64)
> * And one client that I've built from source for a raspberry PI as 
> there is no arm build for the Pacific release. (5.11.0-1015-raspi)
>
> If I switch over to not allow global id reclaim, none of these clients 
> could connect, and using the command "ceph status" on one of my nodes 
> will also fail.
>
> All of them giving the same error message:
>
> monclient(hunting): handle_auth_bad_method server allowed_methods [2] 
> but i only support [2]
>
>
> Has anyone encountered this problem and have any suggestions?
>
> PS. The reason I have 3 different hosts is that this is a test 
> environment where I try to resolve and look at issues before we 
> upgrade our production environment to pacific. DS.
>
> Best regards
> Daniel
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Balanced use of HDD and SSD

2021-08-09 Thread E Taka

Hello all,

a year ago we started with a 3-node-Cluster for Ceph with 21 HDD and 3
SSD, which we installed with Cephadm, configuring the disks with
`ceph orch apply osd --all-available-devices`

Over the time the usage grew quite significantly: now we have another
5 nodes with 8-12 HDD and 1-2 SSD each, the integration worked without
any problems with `ceph orch add host`. Now we wonder if the HDD and
SSD are used as recommended, so that access is fast, but without

My questions: how can I check what the data_devices and db_devices
are? Can we still apply a setup as for example the second one in this
documentation? https://docs.ceph.com/en/latest/cephadm/osd/#the-simple-case

Some technical details: Xeans with plenty RAM and Cores, Ceph 16.2.5
with mostly default configuration, Ubuntu 20.04, separated cluster and
public network (both 10 Gb), Usage as RBD (Qemu), Cephfs, and Ceph
object gateway. (The latter is surprisingly slow, but I want to sort
out the problem with the underlying configuration first.)

Thanks for any helpful responses,
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Size of cluster

2021-08-09 Thread Jorge JP

Hello, this is my osd tree:

ID   CLASS  WEIGHT TYPE NAME
 -1 312.14557  root default
 -3  68.97755  host pveceph01
  3hdd   10.91409  osd.3
 14hdd   16.37109  osd.14
 15hdd   16.37109  osd.15
 20hdd   10.91409  osd.20
 23hdd   10.91409  osd.23
  0ssd3.49309  osd.0
 -5  68.97755  host pveceph02
  4hdd   10.91409  osd.4
 13hdd   16.37109  osd.13
 16hdd   16.37109  osd.16
 21hdd   10.91409  osd.21
 24hdd   10.91409  osd.24
  1ssd3.49309  osd.1
 -7  68.97755  host pveceph03
  6hdd   10.91409  osd.6
 12hdd   16.37109  osd.12
 17hdd   16.37109  osd.17
 22hdd   10.91409  osd.22
 25hdd   10.91409  osd.25
  2ssd3.49309  osd.2
-13  52.60646  host pveceph04
  9hdd   10.91409  osd.9
 11hdd   16.37109  osd.11
 18hdd   10.91409  osd.18
 26hdd   10.91409  osd.26
  5ssd3.49309  osd.5
-16  52.60646  host pveceph05
  8hdd   10.91409  osd.8
 10hdd   16.37109  osd.10
 19hdd   10.91409  osd.19
 27hdd   10.91409  osd.27
  7ssd3.49309  osd.7

Sorry, but how I check the failure domain? I seem to remember that my failure 
domain is host.

Regards.


De: Robert Sander 
Enviado: lunes, 9 de agosto de 2021 13:40
Para: ceph-users@ceph.io 
Asunto: [ceph-users] Re: Size of cluster

Hi,

Am 09.08.21 um 12:56 schrieb Jorge JP:

> 15 x 12TB = 180TB
> 8 x 18TB = 144TB

How are these distributed across your nodes and what is the failure
domain? I.e. how will Ceph distribute data among them?

> The raw size of this cluster (HDD) should be 295TB after format but the size 
> of my "primary" pool (2/1) in this moment is:

A pool with a size of 2 and a min_size of 1 will lead to data loss.

> 53.50% (65.49 TiB of 122.41 TiB)
>
> 122,41TiB multiplied by replication of 2 is 244TiB, not 295TiB.
>
> How can use all size of the class?

If you have 3 nodes with each 5x 12TB (60TB) and 2 nodes with each 4x
18TB (72TB) the maximum usable capacity will not be the sum of all
disks. Remember that Ceph tries to evenly distribute the data.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Size of cluster

2021-08-09 Thread Robert Sander

Hi,

Am 09.08.21 um 12:56 schrieb Jorge JP:

> 15 x 12TB = 180TB
> 8 x 18TB = 144TB

How are these distributed across your nodes and what is the failure
domain? I.e. how will Ceph distribute data among them?

> The raw size of this cluster (HDD) should be 295TB after format but the size 
> of my "primary" pool (2/1) in this moment is:

A pool with a size of 2 and a min_size of 1 will lead to data loss.

> 53.50% (65.49 TiB of 122.41 TiB)
> 
> 122,41TiB multiplied by replication of 2 is 244TiB, not 295TiB.
> 
> How can use all size of the class?

If you have 3 nodes with each 5x 12TB (60TB) and 2 nodes with each 4x
18TB (72TB) the maximum usable capacity will not be the sum of all
disks. Remember that Ceph tries to evenly distribute the data.

Regards
-- 
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: "ceph orch ls", "ceph orch daemon rm" fail with exception "'KeyError: 'not'" on 15.2.10

2021-08-09 Thread Erkki Seppala

Hi,

Might anyone have any insight for this issue?  I have been unable to resolve
it so far and it prevents many "ceph orch" commands and breaks many aspects
of the Web user interface.

-- 
  _
 / __// /__   __  Erkki Seppälä\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@inside.orghttp://www.inside.org/~flux/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Size of cluster

2021-08-09 Thread Jorge JP

Hello,

I have a ceph cluster with 5 nodes. I have 23 osds distributed in these one 
with hdd class. The disk size are:

15 x 12TB = 180TB
8 x 18TB = 144TB

Result of execute "ceph df" command:

--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
hdd295 TiB  163 TiB  131 TiB   131 TiB  44.55
ssd 17 TiB   17 TiB  316 GiB   324 GiB   1.81
TOTAL  312 TiB  181 TiB  131 TiB   132 TiB  42.16

--- POOLS ---
POOLID  PGS  STORED   OBJECTS  USED %USED  MAX 
AVAIL
device_health_metrics11   13 MiB5   39 MiB  0 
40 TiB
.rgw.root44  1.5 KiB4  768 KiB  0 
38 TiB
default.rgw.meta 64  4.7 KiB   12  1.9 MiB  0 
38 TiB
rbd  8  512  1.4 KiB4  384 KiB  0 
38 TiB
default.rgw.buckets.data12   32   10 GiB2.61k   31 GiB   0.03 
38 TiB
default.rgw.log 13  128   35 KiB  2076 MiB  0 
38 TiB
default.rgw.control 144  0 B8  0 B  0 
38 TiB
default.rgw.buckets.non-ec  15  128 27 B1  192 KiB  0 
38 TiB
default.rgw.buckets.index   184  1.1 MiB2  3.3 MiB  0
5.4 TiB
default.rgw.buckets.ssd.index   218  0 B0  0 B  0
5.4 TiB
default.rgw.buckets.ssd.data228  0 B0  0 B  0
5.4 TiB
default.rgw.buckets.ssd.non-ec  238  0 B0  0 B  0
5.4 TiB
POOL-HDD32  512   65 TiB   17.28M  131 TiB  53.51 
57 TiB
POOL_SSD_2_134   32  157 GiB  296.94k  316 GiB   1.86
8.1 TiB

The raw size of this cluster (HDD) should be 295TB after format but the size of 
my "primary" pool (2/1) in this moment is:

53.50% (65.49 TiB of 122.41 TiB)

122,41TiB multiplied by replication of 2 is 244TiB, not 295TiB.

How can use all size of the class?

Thanks a lot.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

2021-08-09 Thread Daniel Persson

Hi Tobias and Richard.

Thank you for answering my questions. I got the link suggested by Tobias on
the issue report, which led me to further investigation. It was hard to see
what version the kernel version on the system was using, but looking at the
result of "ceph health detail" and ldd librados2.so could give me some
information.

It seemed that one of my Linux environments used the old buster kernel
model, which was 12.2.* and not compatible with the new global ID reclaim.

Another issue I got was that the windows client available for download uses
a strange version 15.0.0 Pacific, which is just not correct.

After reading and searching on GitHub, I realized that the windows
executables could be built in a Linux environment using the ceph source
code. So I've now built new binaries to windows that work just fine except
for a libwnbd.dll which were never built. But adding it from the old
installation, I got it to work.

Now ceph-dokan reports a version of 16.2.5, which was the version I built.

Building this was not straightforward, and something I think could be
interesting for the community. So I'm planning to create an instruction
video on the subject that I will publish next week.

Again thank you for your help.

Best regards
Daniel

On Mon, Aug 9, 2021 at 11:46 AM Tobias Urdin 
wrote:

> Hello,
>
> Did you follow the fix/recommendation when applying patches as per
> the documentation in the CVE security post [1] ?
>
> Best regards
>
> [1] https://docs.ceph.com/en/latest/security/CVE-2021-20288/
>
> > On 9 Aug 2021, at 02:26, Richard Bade  wrote:
> >
> > Hi Daniel,
> > I had a similar issue last week after upgrading my test cluster from
> > 14.2.13 to 14.2.22 which included this fix for Global ID reclaim in
> > .20. My issue was a rados gw that I was re-deploying on the latest
> > version. The problem seemed to be related with cephx authentication.
> > It kept displaying the error message you have and the service wouldn't
> > start.
> > I ended up stopping and removing the old rgw service, deleting all the
> > keys in /etc/ceph/ and all data in /var/lib/ceph/radosgw/ and
> > re-deploying the radosgw. This used the new rgw bootstrap keys and new
> > key for this radosgw.
> > So, I would suggest you double and triple check which keys your
> > clients are using and that cephx is enabled correctly on your cluster.
> > Check your admin key in /etc/ceph as well, as that's what's being used
> > for ceph status.
> >
> > Regards,
> > Rich
> >
> > On Sun, 8 Aug 2021 at 05:01, Daniel Persson 
> wrote:
> >>
> >> Hi everyone.
> >>
> >> I suggested asking for help here instead of in the bug tracker so that I
> >> will try it.
> >>
> >>
> https://tracker.ceph.com/issues/51821?next_issue_id=51820&prev_issue_id=51824
> >>
> >> I have a problem that I can't seem to figure out how to resolve the
> issue.
> >>
> >> AUTH_INSECURE_GLOBAL_ID_RECLAIM: client is using insecure global_id
> reclaim
> >> AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
> >> global_id reclaim
> >>
> >>
> >> Both of these have to do with reclaiming ID and securing that no client
> >> could steal or reuse another client's ID. I understand the reason for
> this
> >> and want to resolve the issue.
> >>
> >> Currently, I have three different clients.
> >>
> >> * One Windows client using the latest Ceph-Dokan build. (ceph version
> >> 15.0.0-22274-g5656003758 (5656003758614f8fd2a8c49c2e7d4f5cd637b0ea)
> pacific
> >> (rc))
> >> * One Linux Debian build using the built packages for that kernel. (
> >> 4.19.0-17-amd64)
> >> * And one client that I've built from source for a raspberry PI as
> there is
> >> no arm build for the Pacific release. (5.11.0-1015-raspi)
> >>
> >> If I switch over to not allow global id reclaim, none of these clients
> >> could connect, and using the command "ceph status" on one of my nodes
> will
> >> also fail.
> >>
> >> All of them giving the same error message:
> >>
> >> monclient(hunting): handle_auth_bad_method server allowed_methods [2]
> >> but i only support [2]
> >>
> >>
> >> Has anyone encountered this problem and have any suggestions?
> >>
> >> PS. The reason I have 3 different hosts is that this is a test
> environment
> >> where I try to resolve and look at issues before we upgrade our
> production
> >> environment to pacific. DS.
> >>
> >> Best regards
> >> Daniel
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

2021-08-09 Thread Tobias Urdin

Hello,

Did you follow the fix/recommendation when applying patches as per
the documentation in the CVE security post [1] ?

Best regards

[1] https://docs.ceph.com/en/latest/security/CVE-2021-20288/

> On 9 Aug 2021, at 02:26, Richard Bade  wrote:
> 
> Hi Daniel,
> I had a similar issue last week after upgrading my test cluster from
> 14.2.13 to 14.2.22 which included this fix for Global ID reclaim in
> .20. My issue was a rados gw that I was re-deploying on the latest
> version. The problem seemed to be related with cephx authentication.
> It kept displaying the error message you have and the service wouldn't
> start.
> I ended up stopping and removing the old rgw service, deleting all the
> keys in /etc/ceph/ and all data in /var/lib/ceph/radosgw/ and
> re-deploying the radosgw. This used the new rgw bootstrap keys and new
> key for this radosgw.
> So, I would suggest you double and triple check which keys your
> clients are using and that cephx is enabled correctly on your cluster.
> Check your admin key in /etc/ceph as well, as that's what's being used
> for ceph status.
> 
> Regards,
> Rich
> 
> On Sun, 8 Aug 2021 at 05:01, Daniel Persson  wrote:
>> 
>> Hi everyone.
>> 
>> I suggested asking for help here instead of in the bug tracker so that I
>> will try it.
>> 
>> https://tracker.ceph.com/issues/51821?next_issue_id=51820&prev_issue_id=51824
>> 
>> I have a problem that I can't seem to figure out how to resolve the issue.
>> 
>> AUTH_INSECURE_GLOBAL_ID_RECLAIM: client is using insecure global_id reclaim
>> AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
>> global_id reclaim
>> 
>> 
>> Both of these have to do with reclaiming ID and securing that no client
>> could steal or reuse another client's ID. I understand the reason for this
>> and want to resolve the issue.
>> 
>> Currently, I have three different clients.
>> 
>> * One Windows client using the latest Ceph-Dokan build. (ceph version
>> 15.0.0-22274-g5656003758 (5656003758614f8fd2a8c49c2e7d4f5cd637b0ea) pacific
>> (rc))
>> * One Linux Debian build using the built packages for that kernel. (
>> 4.19.0-17-amd64)
>> * And one client that I've built from source for a raspberry PI as there is
>> no arm build for the Pacific release. (5.11.0-1015-raspi)
>> 
>> If I switch over to not allow global id reclaim, none of these clients
>> could connect, and using the command "ceph status" on one of my nodes will
>> also fail.
>> 
>> All of them giving the same error message:
>> 
>> monclient(hunting): handle_auth_bad_method server allowed_methods [2]
>> but i only support [2]
>> 
>> 
>> Has anyone encountered this problem and have any suggestions?
>> 
>> PS. The reason I have 3 different hosts is that this is a test environment
>> where I try to resolve and look at issues before we upgrade our production
>> environment to pacific. DS.
>> 
>> Best regards
>> Daniel
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rbd object mapping

2021-08-09 Thread Konstantin Shalygin



> On 8 Aug 2021, at 20:10, Tony Liu  wrote:
> 
> That's what I thought. I am confused by this.
> 
> # ceph osd map vm fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk
> osdmap e18381 pool 'vm' (4) object 
> 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk' -> pg 4.c7a78d40 (4.0) -> up 
> ([4,17,6], p4) acting ([4,17,6], p4)
> 
> It calls RBD image "object" and it shows the whole image maps to a single PG,
> while the image is actually split into many objects each of which maps to a 
> PG.
> How am I supposed to understand the output of this command?

You can execute `ceph osd map vm nonexist` and you will see mapping for 
'nonexist' object. Future mapping...
To achieve mappings for each object of your image, you need to find all objects 
by rbd_header and iterate over this list.



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

[ceph-users] very low RBD and Cephfs performance

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

[ceph-users] Re: rbd object mapping

[ceph-users] Multiple cephfs MDS crashes with same assert_condition: state == LOCK_XLOCK || state == LOCK_XLOCKDONE

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

[ceph-users] Balanced use of HDD and SSD

[ceph-users] Re: Size of cluster

[ceph-users] Re: Size of cluster

[ceph-users] Re: "ceph orch ls", "ceph orch daemon rm" fail with exception "'KeyError: 'not'" on 15.2.10

[ceph-users] Size of cluster

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

[ceph-users] Re: rbd object mapping

15 matches

Site Navigation

Mail list logo

Footer information