[ceph-users] cephfs-journal-tool event recover_dentries summary killed due to memory usage

2018-11-02 Thread Rhian Resnick
I was posting with my office account but I think it is being blocked.

Our cephfs's metadata pool went from 1GB to 1TB in a matter of hours and
after using all storage on the OSD's reports two damaged ranks.

The cephfs-journal-tool crashes when performing any operations due to
memory utilization.

We tried a backup which crashed (we then did a rados cppool to backup our
metadata).
I then tried to run a dentry recovery which failed due to memory usage.

Any recommendations for the next step?

Data from our config and status




Combined logs (after marking things as repaired to see if that would rescue us):


Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499
7f68db7a3700 -1 mds.4.purge_queue operator(): Error -108 loading
Journaler
Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499
7f68db7a3700 -1 mds.4.purge_queue operator(): Error -108 loading
Journaler
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143
7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update:
1 mds daemon damaged (MDS_DAMAGE)
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143
7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update:
1 mds daemon damaged (MDS_DAMAGE)
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934
7f6dacd69700 -1 mds.1.journaler.mdlog(ro) try_read_entry: decode error
from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: mds.1
10.141.255.202:6898/1492854021 1 : Error loading MDS rank 1: (22)
Invalid argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914949
7f6dacd69700  0 mds.1.log _replay journaler got error -22, aborting
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934
7f6dacd69700 -1 mds.1.journaler.mdlog(ro) try_read_entry: decode error
from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745
7f6dacd69700 -1 log_channel(cluster) log [ERR] : Error loading MDS
rank 1: (22) Invalid argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745
7f6dacd69700 -1 log_channel(cluster) log [ERR] : Error loading MDS
rank 1: (22) Invalid argument
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432
7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update:
2 mds daemons damaged (MDS_DAMAGE)
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432
7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update:
2 mds daemons damaged (MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231
7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update:
1 mds daemon damaged (MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231
7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update:
1 mds daemon damaged (MDS_DAMAGE)

Ceph OSD Status: (The missing and oud osd's are in a different pool
from all data, these were the bad ssds that caused the issue)


  cluster:
id: 6a2e8f21-bca2-492b-8869-eecc995216cc
health: HEALTH_ERR
1 filesystem is degraded
2 mds daemons damaged

  services:
mon: 3 daemons, quorum ceph-p-mon2,ceph-p-mon1,ceph-p-mon3
mgr: ceph-p-mon1(active), standbys: ceph-p-mon2
mds: cephfs-3/5/5 up
{0=ceph-storage3=up:resolve,2=ceph-p-mon3=up:resolve,4=ceph-p-mds1=up:resolve},
3 up:standby, 2 damaged
osd: 170 osds: 167 up, 158 in

  data:
pools:   7 pools, 7520 pgs
objects: 188.46M objects, 161TiB
usage:   275TiB used, 283TiB / 558TiB avail
pgs: 7511 active+clean
 9active+clean+scrubbing+deep

  io:
client:   0B/s rd, 17.2KiB/s wr, 0op/s rd, 1op/s wr



Ceph OSD Tree:

ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
-10   0 root deefault
 -9 5.53958 root ssds
-11 1.89296 host ceph-cache1
 35   hdd   1.09109 osd.35 up0 1.0
181   hdd   0.26729 osd.181up0 1.0
182   hdd   0.26729 osd.182  down0 1.0
183   hdd   0.26729 osd.183  down0 1.0
-12 1.75366 host ceph-cache2
 46   hdd   1.09109 osd.46 up0 1.0
185   hdd   0.26729 osd.185  down0 1.0
186   hdd   0.12799 osd.186up0 1.0
187   hdd   0.26729 osd.187up0 1.0
-13 1.89296 host ceph-cache3
 60   hdd   1.09109 osd.60 up0 1.0
189   hdd   0.26729 osd.189up0 1.0
190   hdd   0.26729 osd.190up0 1.0
191   hdd   0.26729 osd.191up0 1.0
 -5 4.33493 root ssds-ro
 -6 1.44498 host ceph-storage1-ssd
 85   ssd   0.72249 osd.85 up  1.0 1.0
 89   ssd   0.72249 osd.89 up  

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-02 Thread David Turner
That makes so much more sense. It seems like RHCS had had this ability
since Jewel while it was only put into the community version as of Mimic.
So my version of the version isn't actually capable of changing the backend
db. Whole digging into the coffee I did find a bug with the creation of the
rocksdb backend created with ceph-kvstore-tool. It doesn't use the ceph
defaults or any settings in your config file for the db settings. I'm
working on testing a modified version that should take those settings into
account. If the fix does work, the fix will be able to apply to a few other
tools as well that can be used to set up the omap backend db.

On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi 
wrote:

> It was Redhat versioned Jewel. But may be more relevantly, we are on
> Ubuntu unlike your case.
>
>
>
> *From: *David Turner 
> *Date: *Friday, November 2, 2018 at 10:24 AM
>
>
> *To: *Pavan Rallabhandi 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
>
>
> Pavan, which version of Ceph were you using when you changed your backend
> to rocksdb?
>
>
>
> On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi <
> prallabha...@walmartlabs.com> wrote:
>
> Yeah, I think this is something to do with the CentOS binaries, sorry that
> I couldn’t be of much help here.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Monday, October 1, 2018 at 1:37 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I tried modifying filestore_rocksdb_options
> by removing compression=kNoCompression as well as setting it
> to compression=kSnappyCompression.  Leaving it with kNoCompression or
> removing it results in the same segfault in the previous log.  Setting it
> to kSnappyCompression resulted in [1] this being logged and the OSD just
> failing to start instead of segfaulting.  Is there anything else you would
> suggest trying before I purge this OSD from the cluster?  I'm afraid it
> might be something with the CentOS binaries.
>
> [1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
> compression = kSnappyCompression
> 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
> Compression type Snappy is not linked with the binary.
> 2018-10-01 17:10:37.135004 7f1415dfcd80 -1
> filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
> :
> 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
> mount object store
> 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
> failed: (1) Operation not permittedESC[0m
>
> On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be supported. This hints more at how rocksdb was built on CentOS
> for Ceph.
>
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms
> supported:
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
> 2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0
>
> On 9/27/18, 2:56 PM, "Pavan Rallabhandi"  prallabha...@walmartlabs.com> wrote:
>
> I see Filestore symbols on the stack, so the bluestore config doesn’t
> affect. And the top frame of the stack hints at a RocksDB issue, and there
> are a whole lot of these too:
>
> “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
> Cannot find Properties block from file.”
>
> It really seems to be something with RocksDB on centOS. I still think
> you can try removing “compression=kNoCompression” from the
> filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be
> enabled.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Thursday, September 27, 2018 at 1:18 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I got pulled away from this for a while.  The error in the log is
> "abort: Corruption: Snappy not supported or corrupted Snappy 

Re: [ceph-users] Filestore to Bluestore migration question

2018-11-02 Thread Hayashida, Mami
I followed all the steps Hector suggested, and almost everything seems to
have worked fine.  I say "almost" because one out of the 10 osds I was
migrating could not be activated even though everything up to that point
worked just as well for that osd as the other ones. Here is the output for
that particular failure:

*
ceph-volume lvm activate --all
...
--> Activating OSD ID 67 FSID 17cd6755-76f9-4160-906c-XX
Running command: mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-67
--> Absolute path not found for executable: restorecon
--> Ensure $PATH environment variable contains common executable locations
Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/hdd67/data67 --path /var/lib/ceph/osd/ceph-67
 stderr: failed to read label for /dev/hdd67/data67: (2) No such file or
directory
-->  RuntimeError: command returned non-zero exit status:

***
I then checked to see if the rest of the migrated OSDs were back in by
calling the ceph osd tree command from the admin node.  Since they were
not, I tried to restart the first of the 10 newly migrated Bluestore osds
by calling

***
systemctl start ceph-osd@60

At that point, not only this particular service could not be started, but
ALL the OSDs (daemons) on the entire node shut down!

**
root@osd1:~# systemctl status ceph-osd@60
● ceph-osd@60.service - Ceph object storage daemon osd.60
   Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime;
vendor preset: enabled)
   Active: inactive (dead) since Fri 2018-11-02 15:47:20 EDT; 1h 9min ago
  Process: 3473621 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id
%i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
  Process: 3473147 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 3473621 (code=exited, status=0/SUCCESS)

Oct 29 15:57:53 osd1.x.uky.edu ceph-osd[3473621]: 2018-10-29
15:57:53.868856 7f68adaece00 -1 osd.60 48106 log_to_monitors {default=true}
Oct 29 15:57:53 osd1.x.uky.edu ceph-osd[3473621]: 2018-10-29
15:57:53.874373 7f68adaece00 -1 osd.60 48106 mon_cmd_maybe_osd_create fail:
'you must complete the upgrade and 'ceph osd require-osd-release luminous'
before using crush device classes': (1) Operation not permitted
Oct 30 06:25:01 osd1.x.uky.edu ceph-osd[3473621]: 2018-10-30
06:25:01.961720 7f687feb3700 -1 received  signal: Hangup from  PID: 3485955
task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse
radosgw  UID: 0
Oct 31 06:25:02 osd1.x.uky.edu ceph-osd[3473621]: 2018-10-31
06:25:02.110898 7f687feb3700 -1 received  signal: Hangup from  PID: 3500945
task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse
radosgw  UID: 0
Nov 01 06:25:02 osd1.x.uky.edu ceph-osd[3473621]: 2018-11-01
06:25:02.101548 7f687feb3700 -1 received  signal: Hangup from  PID: 3514774
task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse
radosgw  UID: 0
Nov 02 06:25:02 osd1.x.uky.edu ceph-osd[3473621]: 2018-11-02
06:25:01.997557 7f687feb3700 -1 received  signal: Hangup from  PID: 3528128
task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse
radosgw  UID: 0
Nov 02 15:47:16 osd1.ox.uky.edu ceph-osd[3473621]: 2018-11-02
15:47:16.39 7f687feb3700 -1 received  signal: Terminated from  PID: 1
task name: /lib/systemd/systemd --system --deserialize 20  UID: 0
Nov 02 15:47:16 osd1.x.uky.edu ceph-osd[3473621]: 2018-11-02
15:47:16.322253 7f687feb3700 -1 osd.60 48504 *** Got signal Terminated ***
Nov 02 15:47:16 osd1.x.uky.edu ceph-osd[3473621]: 2018-11-02
15:47:16.676625 7f687feb3700 -1 osd.60 48504 shutdown
Nov 02 16:34:05 osd1.ox.uky.edu systemd[1]: Stopped Ceph object storage
daemon osd.60.

**
And ere is the output for one of the OSDs (osd.70 still using Filestore)
that shut down right when I tried to start osd.60



root@osd1:~# systemctl status ceph-osd@70
● ceph-osd@70.service - Ceph object storage daemon osd.70
   Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime;
vendor preset: enabled)
   Active: inactive (dead) since Fri 2018-11-02 16:34:08 EDT; 2min 6s ago
  Process: 3473629 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id
%i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
  Process: 3473153 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 3473629 (code=exited, status=0/SUCCESS)

Oct 29 15:57:51 osd1..uky.edu ceph-osd[3473629]: 2018-10-29
15:57:51.300563 7f530eec2e00 -1 osd.70 pg_epoch: 48095 pg[68.ces1( empty
local-lis/les=47489/47489 n=0 ec=6030/6030 lis/c 47488/47488 les/c/f
47489/47489/0 47485/47488/47488) [138,70,203]p138(0) r=1 lpr=0 crt=0'0
unknown NO
Oct 30 06:25:01 osd1..uky.edu ceph-osd[3473629]: 2018-10-30
06:25:01.961743 7f52d8e44700 -1 received  signal: Hangup from  PID: 3485955
task name: killall -q -1 ceph-mon ceph-mgr ceph-mds 

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-02 Thread Pavan Rallabhandi
It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu 
unlike your case.

From: David Turner 
Date: Friday, November 2, 2018 at 10:24 AM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Pavan, which version of Ceph were you using when you changed your backend to 
rocksdb?

On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>> wrote:
Yeah, I think this is something to do with the CentOS binaries, sorry that I 
couldn’t be of much help here.

Thanks,
-Pavan.

From: David Turner mailto:drakonst...@gmail.com>>
Date: Monday, October 1, 2018 at 1:37 PM
To: Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I tried modifying filestore_rocksdb_options by removing 
compression=kNoCompression as well as setting it to 
compression=kSnappyCompression.  Leaving it with kNoCompression or removing it 
results in the same segfault in the previous log.  Setting it to 
kSnappyCompression resulted in [1] this being logged and the OSD just failing 
to start instead of segfaulting.  Is there anything else you would suggest 
trying before I purge this OSD from the cluster?  I'm afraid it might be 
something with the CentOS binaries.

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option compression 
= kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: 
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount(1723): Error initializing rocksdb :
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi 
> 
wrote:
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pavan Rallabhandi" 
> 
wrote:

I see Filestore symbols on the stack, so the bluestore config doesn’t 
affect. And the top frame of the stack hints at a RocksDB issue, and there are 
a whole lot of these too:

“2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

It really seems to be something with RocksDB on centOS. I still think you 
can try removing “compression=kNoCompression” from the 
filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be 
enabled.

Thanks,
-Pavan.

From: David Turner 
>
Date: Thursday, September 27, 2018 at 1:18 PM
To: Pavan Rallabhandi 
>
Cc: ceph-users 
>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
:prallabha...@walmartlabs.com>
 wrote:
Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

Have a feel it’s expecting the compression to be enabled, can you try 
removing 

[ceph-users] Ceph Community Newsletter (October 2018)

2018-11-02 Thread Mike Perez
Hey Cephers,

The Ceph Community Newsletter of October 2018 has been published:

https://ceph.com/community/ceph-community-newsletter-october-2018-edition/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Damaged MDS Ranks will not start / recover

2018-11-02 Thread Rhian Resnick
I am posting this with an improved subject since we really want your heads up. 
So please pardon a double submission.


This has been a rough couple days. We thought we had resolved all our 
performance issues by moving the ceph metadata to some high intensity write 
disks from Intel but what we didn't notice was that Ceph labeled them as HDD's 
(thanks dell raid controller).


We believe this caused read lock errors and resulted in the journal increasing 
from 700MB to 1 TB in 2 hours. (Basically over lunch) We tried to migrate and 
then stop everything before the OSD's reached full status but failed.


Over the last 24 hours the data has been migrated from the SDD's back to 
spinning disks but the MDS servers are now reporting that two ranks are damaged.


We are running a backup of the metadata pool but wanted to know what the list 
thinks the next steps should be. I have attached the error's we see in the logs 
as well as our OSD Tree, ceph.conf (comments removed), and ceph fs dump.


Our 1TB metadata backup is 75% done (rados cppool as the metadata export fails 
by using up all server memory). Before we start working on fixing our metadata 
we wanted our projected procedure to be reviewed by an expert.


Does the following sequence look correct for our environment?



  1.  rados cppool cephfs_metadata cephfs_metadata.bk
  2.  Create a snap of the metadata pool, and the cephfs-cold pool (ec), and 
the cache tier (cephfs-hot)
  3.  cephfs-journal-tool event recover_dentries summary --rank=0
  4.  cephfs-journal-tool event recover_dentries summary --rank=1
  5.  cephfs-journal-tool event recover_dentries summary --rank=2
  6.  cephfs-journal-tool event recover_dentries summary --rank=3
  7.  cephfs-journal-tool event recover_dentries summary --rank=4
  8.  cephfs-journal-tool journal reset --rank=0
  9.  cephfs-journal-tool journal reset --rank=1
  10. cephfs-journal-tool journal reset --rank=2
  11. cephfs-journal-tool journal reset --rank=3
  12. cephfs-journal-tool journal reset --rank=4
  13. cephfs-table-tool all reset session
  14. Start metadata servers
  15. Scrub mds:
 *   ceph daemon mds.{hostname} scrub_path / recursive
 *   ceph daemon mds.{hostname} scrub_path / force





Combined logs (after marking things as repaired to see if that would rescue us):


Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 
-1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 
-1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 
-1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: mds.1 10.141.255.202:6898/1492854021 1 
: Error loading MDS rank 1: (22) Invalid argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914949 7f6dacd69700 
 0 mds.1.log _replay journaler got error -22, aborting
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 
-1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 
-1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid 
argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 
-1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid 
argument
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged 
(MDS_DAMAGE)
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged 
(MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)

Ceph OSD Status: (The missing and oud osd's are in a different pool  from all 
data, these were the bad ssds that caused the issue)


  cluster:
id: 6a2e8f21-bca2-492b-8869-eecc995216cc
health: HEALTH_ERR
1 filesystem is degraded
2 mds daemons damaged

  services:
mon: 3 daemons, quorum ceph-p-mon2,ceph-p-mon1,ceph-p-mon3
mgr: ceph-p-mon1(active), standbys: ceph-p-mon2
mds: cephfs-3/5/5 up  

Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread vitalif

If you simply multiply number of objects and rbd object size
you will get 7611672*4M ~= 29T and that is what you should see in USED
field, and 29/2*3=43.5T of raw space.
Unfortunately no idea why they consume less; probably because not all
objects are fully written.


It seems some objects correspond to snapshots and Bluestore is smart and 
uses copy-on-write (virtual clone) on them, so they aren't provisioned 
at all...


...UNTIL REBALANCE


What ceph version?


Mimic 13.2.2


Can you show osd config, "ceph daemon osd.0 config show"?


See the attachment. But it mostly contains defaults, only the following 
variables are overridden in /etc/ceph/ceph.conf:


[osd]
rbd_op_threads = 4
osd_op_queue = mclock_opclass
osd_max_backfills = 2
bluestore_prefer_deferred_size_ssd = 1
bdev_enable_discard = true


Can you show some "rbd info ecpool_hdd/rbd_name"?


[root@sill-01 ~]# rbd info rpool_hdd/rms-201807-golden
rbd image 'rms-201807-golden':
size 14 TiB in 3670016 objects
order 22 (4 MiB objects)
id: 3d3e1d6b8b4567
data_pool: ecpool_hdd
block_name_prefix: rbd_data.15.3d3e1d6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool

op_features:
flags:
create_timestamp: Tue Aug  7 13:00:10 2018
{
"name": "osd.0",
"cluster": "ceph",
"admin_socket": "/var/run/ceph/ceph-osd.0.asok",
"admin_socket_mode": "",
"auth_client_required": "cephx",
"auth_cluster_required": "cephx",
"auth_debug": "false",
"auth_mon_ticket_ttl": "43200.00",
"auth_service_required": "cephx",
"auth_service_ticket_ttl": "3600.00",
"auth_supported": "",
"bdev_aio": "true",
"bdev_aio_max_queue_depth": "1024",
"bdev_aio_poll_ms": "250",
"bdev_aio_reap_max": "16",
"bdev_async_discard": "false",
"bdev_block_size": "4096",
"bdev_debug_aio": "false",
"bdev_debug_aio_suicide_timeout": "60.00",
"bdev_debug_inflight_ios": "false",
"bdev_enable_discard": "true",
"bdev_inject_crash": "0",
"bdev_inject_crash_flush_delay": "2",
"bdev_nvme_retry_count": "-1",
"bdev_nvme_unbind_from_kernel": "false",
"bluefs_alloc_size": "1048576",
"bluefs_allocator": "stupid",
"bluefs_buffered_io": "true",
"bluefs_compact_log_sync": "false",
"bluefs_log_compact_min_ratio": "5.00",
"bluefs_log_compact_min_size": "16777216",
"bluefs_max_log_runway": "4194304",
"bluefs_max_prefetch": "1048576",
"bluefs_min_flush_size": "524288",
"bluefs_min_log_runway": "1048576",
"bluefs_preextend_wal_files": "false",
"bluefs_sync_write": "false",
"bluestore_2q_cache_kin_ratio": "0.50",
"bluestore_2q_cache_kout_ratio": "0.50",
"bluestore_allocator": "stupid",
"bluestore_bitmapallocator_blocks_per_zone": "1024",
"bluestore_bitmapallocator_span_size": "1024",
"bluestore_blobid_prealloc": "10240",
"bluestore_block_create": "true",
"bluestore_block_db_create": "false",
"bluestore_block_db_path": "",
"bluestore_block_db_size": "0",
"bluestore_block_path": "",
"bluestore_block_preallocate_file": "false",
"bluestore_block_size": "10737418240",
"bluestore_block_wal_create": "false",
"bluestore_block_wal_path": "",
"bluestore_block_wal_size": "100663296",
"bluestore_bluefs": "true",
"bluestore_bluefs_balance_failure_dump_interval": "0.00",
"bluestore_bluefs_balance_interval": "1.00",
"bluestore_bluefs_env_mirror": "false",
"bluestore_bluefs_gift_ratio": "0.02",
"bluestore_bluefs_max_ratio": "0.90",
"bluestore_bluefs_min": "1073741824",
"bluestore_bluefs_min_free": "1073741824",
"bluestore_bluefs_min_ratio": "0.02",
"bluestore_bluefs_reclaim_ratio": "0.20",
"bluestore_cache_kv_min": "536870912",
"bluestore_cache_kv_ratio": "0.50",
"bluestore_cache_meta_ratio": "0.50",
"bluestore_cache_size": "0",
"bluestore_cache_size_hdd": "1073741824",
"bluestore_cache_size_ssd": "3221225472",
"bluestore_cache_trim_interval": "0.05",
"bluestore_cache_trim_max_skip_pinned": "64",
"bluestore_cache_type": "2q",
"bluestore_clone_cow": "true",
"bluestore_compression_algorithm": "snappy",
"bluestore_compression_max_blob_size": "0",
"bluestore_compression_max_blob_size_hdd": "524288",
"bluestore_compression_max_blob_size_ssd": "65536",
"bluestore_compression_min_blob_size": "0",
"bluestore_compression_min_blob_size_hdd": "131072",
"bluestore_compression_min_blob_size_ssd": "8192",
"bluestore_compression_mode": "none",
"bluestore_compression_required_ratio": "0.875000",
"bluestore_csum_type": "crc32c",
"bluestore_debug_freelist": "false",
"bluestore_debug_fsck_abort": "false",
"bluestore_debug_inject_bug21040": "false",
"bluestore_debug_inject_read_err": 

Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread Aleksei Gutikov

If you simply multiply number of objects and rbd object size
you will get 7611672*4M ~= 29T and that is what you should see in USED 
field, and 29/2*3=43.5T of raw space.
Unfortunately no idea why they consume less; probably because not all 
objects are fully written.

What ceph version?
Can you show osd config, "ceph daemon osd.0 config show"?
Can you show some "rbd info ecpool_hdd/rbd_name"?
For example if bluestore_min_alloc_size_hdd is greater than
/ can cause additional space consumption.


On 10/29/2018 11:50 PM, Виталий Филиппов wrote:

Is there a way to force OSDs to remove old data?


Hi

After I recreated one OSD + increased pg count of my erasure-coded 
(2+1) pool (which was way too low, only 100 for 9 osds) the cluster 
started to eat additional disk space.


First I thought that was caused by the moved PGs using additional 
space during unfinished backfills. I pinned most of new PGs to old 
OSDs via `pg-upmap` and indeed it freed some space in the cluster.


Then I reduced osd_max_backfills to 1 and started to remove upmap pins 
in small portions which allowed Ceph to finish backfills for these PGs.


HOWEVER, used capacity still grows! It drops after moving each PG, but 
still grows overall.


It has grown +1.3TB yesterday. In the same period of time clients have 
written only ~200 new objects (~800 MB, there are RBD images only).


Why, what's using such big amount of additional space?

Graphs from our prometheus are attached. Only ~200 objects were 
created by RBD clients yesterday, but used raw space increased +1.3 TB.


Additional question is why ceph df / rados df tells there is only 16 
TB actual data written, but it uses 29.8 TB (now 31 TB) of raw disk 
space. Shouldn't it be 16 / 2*3 = 24 TB ?


ceph df output:

[root@sill-01 ~]# ceph df
GLOBAL:
SIZE   AVAIL   RAW USED %RAW USED
38 TiB 6.9 TiB   32 TiB 82.03
POOLS:
NAME   ID USED%USED MAX AVAIL OBJECTS
ecpool_hdd 13  16 TiB 93.94   1.0 TiB 7611672
rpool_hdd  15 9.2 MiB 0   515 GiB  92
fs_meta44  20 KiB 0   515 GiB  23
fs_data45 0 B 0   1.0 TiB   0

How to heal it?





--

Best regards,
Aleksei Gutikov
Software Engineer | synesis.ru | Minsk. BY
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread vitalif

Hi again.

It seems I've found the problem, although I don't understand the root 
cause.


I looked into OSD datastore using ceph-objectstore-tool and I see that 
for almost every object there are two copies, like:


2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.00361a96:28#
2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.00361a96:head#

And more interesting is the fact that these two copies don't differ (!).

So the space is taken up by the unneeded snapshot copies.

rbd_data.15.3d3e1d6b8b4567 is the prefix of the biggest (14 TB) base 
image we have. This image has 1 snapshot:


[root@sill-01 ~]# rbd info rpool_hdd/rms-201807-golden
rbd image 'rms-201807-golden':
size 14 TiB in 3670016 objects
order 22 (4 MiB objects)
id: 3d3e1d6b8b4567
data_pool: ecpool_hdd
block_name_prefix: rbd_data.15.3d3e1d6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool

op_features:
flags:
create_timestamp: Tue Aug  7 13:00:10 2018
[root@sill-01 ~]# rbd snap ls rpool_hdd/rms-201807-golden
SNAPID NAME  SIZE TIMESTAMP
37 initial 14 TiB Tue Aug 14 12:42:48 2018

The problem is this image has NEVER been written to after importing it 
to Ceph with RBD. All writes go only to its clones.


So I have 2.. no, 5 questions:

1) Why base image snapshot is "provisioned" while the image isn't 
written to? May it be related to `rbd snap revert`? (i.e. does rbd snap 
revert just copy all snapshot data into the image itself?)


2) If all parent snapshots seem to be forcefully provisioned on write: 
Is there a way to disable this behaviour? Maybe if I make the base image 
readonly its snapshots will stop to be "provisioned"?


3) Even if there is no way to disable it: why does Ceph create extra 
copy of equal snapshot data during rebalance?


4) What's ":28" in rados objects? Snapshot id is 37. Even in hex 0x28 = 
40, not 37. Or does RADOS snapshot id not need to be equal to RBD 
snapshot ID?


5) Am I safe to "unprovision" the snapshot? (for example, by doing `rbd 
snap revert`?)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-02 Thread David Turner
Pavan, which version of Ceph were you using when you changed your backend
to rocksdb?

On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> Yeah, I think this is something to do with the CentOS binaries, sorry that
> I couldn’t be of much help here.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Monday, October 1, 2018 at 1:37 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I tried modifying filestore_rocksdb_options
> by removing compression=kNoCompression as well as setting it
> to compression=kSnappyCompression.  Leaving it with kNoCompression or
> removing it results in the same segfault in the previous log.  Setting it
> to kSnappyCompression resulted in [1] this being logged and the OSD just
> failing to start instead of segfaulting.  Is there anything else you would
> suggest trying before I purge this OSD from the cluster?  I'm afraid it
> might be something with the CentOS binaries.
>
> [1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
> compression = kSnappyCompression
> 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
> Compression type Snappy is not linked with the binary.
> 2018-10-01 17:10:37.135004 7f1415dfcd80 -1
> filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
> :
> 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
> mount object store
> 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
> failed: (1) Operation not permittedESC[0m
>
> On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be supported. This hints more at how rocksdb was built on CentOS
> for Ceph.
>
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms
> supported:
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
> 2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0
>
> On 9/27/18, 2:56 PM, "Pavan Rallabhandi"  prallabha...@walmartlabs.com> wrote:
>
> I see Filestore symbols on the stack, so the bluestore config doesn’t
> affect. And the top frame of the stack hints at a RocksDB issue, and there
> are a whole lot of these too:
>
> “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
> Cannot find Properties block from file.”
>
> It really seems to be something with RocksDB on centOS. I still think
> you can try removing “compression=kNoCompression” from the
> filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be
> enabled.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Thursday, September 27, 2018 at 1:18 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I got pulled away from this for a while.  The error in the log is
> "abort: Corruption: Snappy not supported or corrupted Snappy compressed
> block contents" and the OSD has 2 settings set to snappy by default,
> async_compressor_type and bluestore_compression_algorithm.  Do either of
> these settings affect the omap store?
>
> On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> Looks like you are running on CentOS, fwiw. We’ve successfully ran the
> conversion commands on Jewel, Ubuntu 16.04.
>
> Have a feel it’s expecting the compression to be enabled, can you try
> removing “compression=kNoCompression” from the filestore_rocksdb_options?
> And/or you might want to check if rocksdb is expecting snappy to be enabled.
>
> From: David Turner 
> Date: Tuesday, September 18, 2018 at 6:01 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> Here's the [1] full log from the 

Re: [ceph-users] Removing MDS

2018-11-02 Thread Rhian Resnick
Morning our backup of the metadata is 75% done (rados cppool as the metadata 
export fails by using up all server memory). Before we start working on fixing 
our metadata we wanted our projected procedure to be reviewed.


Does the following sequence look correct for our environment?


  1.  rados cppool cephfs_metadata cephfs_metadata.bk
  2.  cephfs-journal-tool event recover_dentries summary --rank=0
  3.  cephfs-journal-tool event recover_dentries summary --rank=1
  4.  cephfs-journal-tool event recover_dentries summary --rank=2
  5.  cephfs-journal-tool event recover_dentries summary --rank=3
  6.  cephfs-journal-tool event recover_dentries summary --rank=4
  7.  cephfs-journal-tool journal reset --rank=0
  8.  cephfs-journal-tool journal reset --rank=1
  9.  cephfs-journal-tool journal reset --rank=2
  10. cephfs-journal-tool journal reset --rank=3
  11. cephfs-journal-tool journal reset --rank=4
  12. cephfs-table-tool all reset session
  13. Start metadata servers
  14. Scrub mds:
 *   ceph daemon mds.{hostname} scrub_path / recursive
 *   ceph daemon mds.{hostname} scrub_path / force
  15.





Rhian Resnick

Associate Director Research Computing

Enterprise Systems

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 



From: Rhian Resnick
Sent: Thursday, November 1, 2018 10:32 AM
To: Patrick Donnelly
Cc: Ceph Users
Subject: Re: [ceph-users] Removing MDS


Morning all,


This has been a rough couple days. We thought we had resolved all our 
performance issues by moving the ceph metadata to some high intensity write 
disks from Intel but what we didn't notice was that Ceph labeled them as HDD's 
(thanks dell raid controller).


We believe this caused read lock errors and resulted in the journal increasing 
from 700MB to 1 TB in 2 hours. (Basically over lunch) We tried to migrate and 
then stop everything before the OSD's reached full status but failed.


Over the last 12 hours the data has been migrated from the SDD's back to 
spinning disks but the MDS servers are now reporting that two ranks are damaged.


We are running a backup of the metadata pool but wanted to know what the list 
thinks the next steps should be. I have attached the error's we see in the logs 
as well as our OSD Tree, ceph.conf (comments removed), and ceph fs dump.


Combined logs (after marking things as repaired to see if that would rescue us):


Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 
-1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 
-1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 
-1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: mds.1 10.141.255.202:6898/1492854021 1 
: Error loading MDS rank 1: (22) Invalid argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914949 7f6dacd69700 
 0 mds.1.log _replay journaler got error -22, aborting
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 
-1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 
-1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid 
argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 
-1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid 
argument
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged 
(MDS_DAMAGE)
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged 
(MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)

Ceph OSD Status: (The missing and oud osd's are in a different pool  from all 
data, these were the bad ssds that caused the issue)


  cluster:
id: 

Re: [ceph-users] Large omap objects - how to fix ?

2018-11-02 Thread Ben Morrice

Thanks for everyones comments, including the thread hijackers :)

I solved this in our infrastructure slightly differently:

1) find largest omap(s)
# for i in `rados -p .bbp-gva-master.rgw.buckets.index ls`; do echo -n 
"$i:"; rados -p .bbp-gva-master.rgw.buckets.index listomapkeys $i |wc 
-l; done > omapkeys

# sort -t: -k2 -r -n omapkeys  |head -1
.dir.bbp-gva-master.125103342.18:7558822

2) confirm that the above index is not used by any buckets
# cat bucketstats
#!/bin/bash
for bucket in $(radosgw-admin bucket list | jq -r .[]); do
    bucket_id=$(radosgw-admin metadata get bucket:${bucket} | jq -r 
.data.bucket.bucket_id)
    marker=$(radosgw-admin metadata get bucket:${bucket} | jq -r 
.data.bucket.marker)

    echo "$bucket:$bucket_id:$marker"
done
# ./bucketstats > bucketstats.out
# grep 125103342.18 bucketstats.out

3) delete the rados object
rados -p .bbp-gva-master.rgw.buckets.index rm 
.dir.bbp-gva-master.125103342.18


4) perform a deep scrub on the PGs that were affected
# for i in `ceph pg ls-by-pool .bbp-gva-master.rgw.buckets.index | tail 
-n +2 | awk '{print $1}'`; do echo -n "$i: "; ceph pg $i query |grep 
num_large_omap_objects | head -1 | awk '{print $2}'; done | grep ": 1"

137.1b: 1
137.36: 1
# ceph pg deep-scrub 137.1b
# ceph pg deep-scrub 137.36



Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 10/31/2018 11:02 AM, Alexandru Cucu wrote:

Hi,

Didn't know that auto resharding does not remove old instances. Wrote
my own script for cleanup as I've discovered this before reading your
message.
Not very wlll tested, but here it is:

for bucket in $(radosgw-admin bucket list | jq -r .[]); do
 bucket_id=$(radosgw-admin metadata get bucket:${bucket} | jq -r
.data.bucket.bucket_id)
 marker=$(radosgw-admin metadata get bucket:${bucket} | jq -r
.data.bucket.marker)
 for instance in $(radosgw-admin metadata list bucket.instance | jq
-r .[] | grep "^${bucket}:" | grep -v ${bucket_id} | grep -v ${marker}
| cut -f2 -d':'); do
  radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
  radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
 done
done


On Tue, Oct 30, 2018 at 3:30 PM Tomasz Płaza  wrote:

Hi hijackers,

Please read: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030317.html

TL;DR: Ceph should reshard big indexes, but after that it leaves them to be 
removed manually. Starting from some version, deep-scrub reports indexes above 
some threshold as HALTH_WARN. You should find it in osd logs. If You do not 
have logs, just listomapkeys on every object in default.rgw.buckets.index and 
find the biggest ones... it should be safe to remove those (radosgw-admin bi 
purge) but I can not guarantee it.


On 26.10.2018 at 17:18, Florian Engelmann wrote:

Hi,

hijacking the hijacker! Sorry!

radosgw-admin bucket reshard --bucket somebucket --num-shards 8
*** NOTICE: operation will not remove old bucket index objects ***
*** these will need to be removed manually ***
tenant:
bucket name: somebucket
old bucket instance id: cb1594b3-a782-49d0-a19f-68cd48870a63.1923153.1
new bucket instance id: cb1594b3-a782-49d0-a19f-68cd48870a63.3119759.1
total entries: 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 11000 12000 
13000 14000 15000 16000 17000 18000 19000 2 21000 22000 23000 24000 25000 
26000 27000 28000 29000 3 31000 32000 33000 34000 35000 36000 37000 38000 
39000 4 41000 42000 43000 44000 45000 46000 47000 48000 49000 5 51000 
52000 53000 54000 55000 56000 57000 58000 59000 6 61000 62000 63000 64000 
65000 66000 67000 68000 69000 7 71000 72000 73000 74000 75000 76000 77000 
78000 79000 8 81000 82000 83000 84000 85000 86000 87000 88000 89000 9 
91000 92000 93000 94000 95000 96000 97000 98000 99000 10 101000 102000 
103000 104000 105000 106000 107000 108000 109000 11 111000 112000 113000 
114000 115000 116000 117000 118000 119000 12 121000 122000 123000 124000 
125000 126000 127000 128000 129000 13 131000 132000 133000 134000 135000 
136000 137000 138000 139000 14 141000 142000 143000 144000 145000 146000 
147000 148000 149000 15 151000 152000 153000 154000 155000 156000 157000 
158000 159000 16 161000 162000 163000 164000 165000 166000 167000 168000 
169000 17 171000 172000 173000 174000 175000 176000 177000 178000 179000 
18 181000 182000 183000 184000 185000 186000 187000 188000 189000 19 
191000 192000 193000 194000 195000 196000 197000 198000 199000 20 201000 
202000 203000 204000 205000 206000 207000 207660

What to do now?

ceph -s is still:

 health: HEALTH_WARN
 1 large omap objects

But I have no idea how to:
*** NOTICE: operation will not remove old bucket index objects ***
*** these will need to be removed