Re: [ceph-users] Newbie question: stretch ceph cluster

2018-02-16 Thread Alex Gorbachev
On Wed, Feb 14, 2018 at 3:20 AM Maged Mokhtar  wrote:

> Hi,
>
> You need to set the min_size to 2 in crush rule.
>
> The exact location and replication flow when a client writes data depends
> on the object name and num of pgs. the crush rule determines which osds
> will serve a pg, the first is the primary osd for that pg. The client
> computes the pg from the object name and writes the object to the primary
> osd for that pg, then primary osd is then responsible to replicate with the
> other osds serving this pg. So for the same client, some objects will be
> sent to datacenter 1 and some to 2 and the osds will do the rest.
>
> The other point is regarding how to setup monitors across 2 datacenters
> and be able to function if one goes down, this is tricky since monitors do
> require an odd number and form a quorum. This link my is quite interesting,
> i am not sure if there are better ways to do it:
>
> https://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/
>
>
Fyi, I had this reply from Vincent Godin, you can search the ML for the
full thread:

Hello

We have a similar design. Two Datacenters at short distance (sharing
the same level 2 network) and one Datacenter at long range (more than
100km) for our Ceph cluster. Let's call these sites A1, A2 and B.

We set 2 Mons on A1, 2 Mons on A2 and 1 Mon on B. A1 and A2 shared a
same level 2 network. We need routing to connect to B.

We set a HSRP Gateway on A1 & A2 to reach the B site. Let's call them
GwA1 and GwA2 with default to GwA1

We set a HSRP Gateway on site B. Let's call them GwB1 and GwB2 with
default to GwB1. GwB1 connected to A1 and A2 via GwA1, GwB2 connected
to A1 and A2 via GwA2. We set an simple LACP between GwB1 and GwA1
ports and an other between GwB2 and GwA2 ports. (If GwA1 port is going
down then GwB1 port will go down too)

So if everything is OK, the Mon on site B can see every OSDs and Mons
on both sites A1 & A2 via GwB1, then GwA1. Quorum is reached and Ceph
is healthy

if B1 site is down, the Mon on site B can see every OSDs and Mons on
site A1 via GwB1, then GwA1. Quorum is reached and Ceph is available

If A1 site is down, both HSRPs will change. The Mon on site B will see
Mons and OSDs of the A2 site via GwB2 then GwA2. Quorum is reached and
Ceph is still available

if the L2 links between A1 & B2 are cut, the B2 site will be isolated.
The Mon on site B can see every OSDs and Mons on A1 via GwB1, then
GwA1 but cannot see Mons and OSDs of the A2 site because of the link
failure. The quorum will be reached only on A1 site with 3 Mons and
Ceph will still be available






> Maged
>
> On 2018-02-14 04:12, ST Wong (ITSC) wrote:
>
> Hi,
>
> Thanks for your advice,
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Luis Periquito
> Sent: Friday, February 09, 2018 11:34 PM
> To: Kai Wagner
> Cc: Ceph Users
> Subject: Re: [ceph-users] Newbie question: stretch ceph cluster
>
> On Fri, Feb 9, 2018 at 2:59 PM, Kai Wagner  wrote:
>
> Hi and welcome,
>
>
> On 09.02.2018 15:46, ST Wong (ITSC) wrote:
>
> Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
> We've 2 10Gb connected data centers in the same campus.I wonder if it's
> possible to setup a CEPH cluster with following components in each
> data
> center:
>
>
> 3 x mon + mds + mgr
>
> In this scenario you wouldn't be any better, as loosing a room means
> loosing half of your cluster. Can you run the MON somewhere else that would
> be able to continue if you loose one of the rooms?
>
>
> Will it be okay to have 3 x MON per DC so that we still have 3 x MON in
> case of losing 1 DC ?  Or need more in case of double fault - losing 1 DC
> and failure of any MON in remaining DC will make the cluster stop working?
>
>
> As for MGR and MDS they're (recommended) active/passive; so one per room
> would be enough.
>
>
> 3 x OSD (replicated factor=2, between data center)
>
>
> replicated with size=2 is a bad idea. You can have size=4 and
> min_size=2 and have a crush map with rules something like:
>
>
>
> rule crosssite {
> id 0
> type replicated
> min_size 4
> max_size 4
> step take default
> step choose firstn 2 type room
> step chooseleaf firstn 2 type host
> step emit
> }
>
> this will store 4 copies, 2 in different hosts and 2 different rooms.
>
>
> Does it mean for new data write to hostA:roomA, replication will take
> place as following?
> 1. from hostA:roomA to hostB:roomA
> 2. from hostA:roomA to hostA, roomB
> 3. from hostB:roomA to hostB, roomB
>
> If it works in this way, can copy in 3 be skipped so that for each piece
> of data, there are 3 replicas - original one, replica in same room, and
> replica in other room, in order to save some space?
>
> Besides, would also like to ask if it's correct that the cluster will
> continue to work (degraded) if one room is lost?
>
> Will there 

Re: [ceph-users] rgw bucket inaccessible - appears to be using incorrect index pool?

2018-02-16 Thread Robin H. Johnson
On Fri, Feb 16, 2018 at 07:06:21PM -0600, Graham Allan wrote:
[snip great debugging]

This seems similar to two open issues, could be either of them depending
on how old that bucket is.
http://tracker.ceph.com/issues/22756
http://tracker.ceph.com/issues/22928

- I have a mitigation posted to 22756.
- There's a PR posted for 22928, but it'll probably only be in v12.2.4.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw bucket inaccessible - appears to be using incorrect index pool?

2018-02-16 Thread Graham Allan
Sorry to be posting a second mystery at the same time - though this 
feels unconnected to my other one.


We had a user complain that they can't list the contents of one of their 
buckets (they can access certain objects within the bucket).


I started by running a simple command to get data on the bucket:


root@cephmon1:~# radosgw-admin bucket stats --bucket=mccuelab
error getting bucket stats ret=-2


not encouraging, but struck a memory... we had a bucket some time ago 
which had 32 index shards but the metadata showed num_shards=0... gave 
the same error. So looking at bucket metadata...



root@cephmon1:~# radosgw-admin metadata get 
bucket.instance:mccuelab:default.2049236.2
{
"key": "bucket.instance:mccuelab:default.2049236.2",
"ver": {
"tag": "_pOR6OLmXKQxYuFBa0E-eEmK",
"ver": 17
},
"mtime": "2018-02-15 17:50:28.225135Z",
"data": {
"bucket_info": {
"bucket": {
"name": "mccuelab",
"marker": "default.2049236.2",
"bucket_id": "default.2049236.2",
"tenant": "",
"explicit_placement": {
"data_pool": ".rgw.buckets",
"data_extra_pool": "",
"index_pool": ".rgw.buckets"
}
},
"creation_time": "0.00",
"owner": "uid=12093",
"flags": 0,
"zonegroup": "default",
"placement_rule": "",
"has_instance_obj": "true",
"quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1024,
"max_size_kb": 0,
"max_objects": -1
},
"num_shards": 32,
"bi_shard_hash_type": 0,
"requester_pays": "false",
"has_website": "false",
"swift_versioning": "false",
"swift_ver_location": "",
"index_type": 0,
"mdsearch_config": [],
"reshard_status": 0,
"new_bucket_instance_id": ""
},
"attrs": [
{
"key": "user.rgw.acl",
"val": 
"AgKxAwImCQAAAHVpZD0xMjA5MxUAAABSb2JlcnQgSmFtZXMgU2NoYWVmZXIEA38BAQkAAAB1aWQ9MTIwOTMPAQkAAAB1aWQ9MTIwOTMFA0oCAgQACQAAAHVpZD0xMjA5MwAAAgIEDwAAABUAAABSb2JlcnQgSmFtZXMgU2NoYWVmZXIA"
},
{
"key": "user.rgw.idtag",
"val": ""
}
]
}
}


But, the index pool doesn't contain all of these - only 15 shards:


root@cephmon1:~# rados -p .rgw.buckets.index ls - | grep "default.2049236.2"
.dir.default.2049236.2.22
.dir.default.2049236.2.3
.dir.default.2049236.2.10
.dir.default.2049236.2.31
.dir.default.2049236.2.12
.dir.default.2049236.2.0
.dir.default.2049236.2.18
.dir.default.2049236.2.13
.dir.default.2049236.2.16
.dir.default.2049236.2.11
.dir.default.2049236.2.23
.dir.default.2049236.2.17
.dir.default.2049236.2.9
.dir.default.2049236.2.29
.dir.default.2049236.2.24



But, wait a minute - I wasn't reading carefully. this is a really old 
bucket... the data pool and index pool are both set to .rgw.buckets


OK, so I check for index objects in the .rgw.buckets pool, where shards 
0..31 are all present - that's good.


so why do any index objects even exist in the .rgw.buckets.index pool...?

Set debug rgw=1 debug ms=1 and ran "radosgw-admin bi list"... amongst a 
lot of other output I see...

- a query to osd 204 pg 100.1c;
- it finds and lists the first entry from ".dir.default.2049236.2.0"
- then a query to osd 164 pg 100.3d
- which returns "file not found" for ".dir.default.2049236.2.1"...
- consistent with shard #0 existing in .rgw.buckets.index, but not #1.


2018-02-16 18:13:10.405545 7f0f539beb80  1 -- 10.32.16.93:0/3172453804 --> 
10.31.0.65:6812/58901 -- osd_op(unknown.0.0:97 100.1c 
100:3a18c885:::.dir.default.2049236.2.0:head [call rgw.bi_list] snapc 0=[] 
ondisk+read+known_if_redirected e507701) v8 -- 0x7f0f55921f90 con 0
2018-02-16 18:13:10.410902 7f0f40a2f700  1 -- 10.32.16.93:0/3172453804 <== 
osd.204 10.31.0.65:6812/58901 1  osd_op_reply(97 .dir.default.2049236.2.0 
[call] v0'0 uv1416558 ondisk = 0) v8  168+0+317 (1847036665 0 2928341486) 
0x7f0f3403d510 con 0x7f0f55925720[
{
"type": "plain",
"idx": "durwa004/Copenhagen_bam_files_3.tar.xz",
"entry": {
"name": "durwa004/Copenhagen_bam_files_3.tar.xz",
"instance": "",
"ver": {
"pool": 23,
"epoch": 179629
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 291210535540,
"mtime": "2018-02-09 04:59:43.869899Z",
"etag": "e75dc95f44944fe9df6a102c809566be-272",
"owner": "uid=12093",

Re: [ceph-users] High Load and High Apply Latency

2018-02-16 Thread John Petrini
I thought I'd follow up on this just in case anyone else experiences
similar issues. We ended up increasing the tcmalloc thread cache size and
saw a huge improvement in latency. This got us out of the woods because we
were finally in a state where performance was good enough that it was no
longer impacting services.

The tcmalloc issues are pretty well documented on this mailing list and I
don't believe they impact newer versions of Ceph but I thought I'd at least
give a data point. After making this change our average apply latency
dropped to 3.46ms during peak business hours. To give you an idea of how
significant that is here's a graph of the apply latency prior to the
change: https://imgur.com/KYUETvD

This however did not resolve all of our issues. We were still seeing high
iowait (repeated spikes up to 400ms) on three of our OSD nodes on all
disks. We tried replacing the RAID controller (PERC H730) on these nodes
and while this resolved the issue on one server the two others remained
problematic. These two nodes were configured differently than the rest.
They'd been configured in non-raid mode while the others were configured as
individual raid-0. This turned out to be the problem. We ended up removing
the two nodes one at a time and rebuilding them with their disks configured
in independent raid-0 instead of non-raid. After this change iowait rarely
spikes above 15ms and averages <1ms.

I was really surprised at the performance impact when using non-raid mode.
While I realize non-raid bypasses the controller cache I still would have
never expected such high latency. Dell has a whitepaper that recommends
using individual raid-0 but their own tests show only a small performance
advantage over non-raid. Note that we are running SAS disks, they actually
recommend non-raid mode for SATA but I have not tested this. You can view
the whtiepaper here:
http://en.community.dell.com/techcenter/cloud/m/dell_cloud_resources/20442913/download

I hope this helps someone.

John Petrini
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Signature check failures.

2018-02-16 Thread Gregory Farnum
On Thu, Feb 15, 2018 at 10:28 AM Cary  wrote:

> Hello,
>
> I have enabled debugging on my MONs and OSDs to help troubleshoot
> these signature check failures. I was watching ods.4's log and saw
> these errors when the signature check failure happened.
>
> 2018-02-15 18:06:29.235791 7f8bca7de700  1 --
> 192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
> conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).read_bulk peer
> close file descriptor 81
> 2018-02-15 18:06:29.235832 7f8bca7de700  1 --
> 192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
> conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).read_until read
> failed
> 2018-02-15 18:06:29.235841 7f8bca7de700  1 --
> 192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
> conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).process read
> tag failed
> 2018-02-15 18:06:29.235848 7f8bca7de700  1 --
> 192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
> conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).fault on lossy
> channel, failing
> 2018-02-15 18:06:29.235966 7f8bc0853700  2 osd.8 27498 ms_handle_reset
> con 0x55f802746000 session 0x55f8063b3180
>
>
>  Could someone please look at this? We have 3 different Ceph clusters
> setup and they all have this issue. This cluster is running Gentoo and
> Ceph version 12.2.2-r1. The other two clusters are 12.2.2. Exporting
> images causes signature check failures and with larger files it seg
> faults as well.
>
> When exporting the image from osd.4 This message shows up as well.
> Exporting image: 1% complete...2018-02-15 18:14:05.283708 7f6834277700
>  0 -- 192.168.173.44:0/122241099 >> 192.168.173.44:6801/72152
> conn(0x7f681400ff10 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
> pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
>
> The error below show up on all OSD/MGR/MON nodes when exporting an image.
> Exporting image: 8% complete...2018-02-15 18:15:51.419437 7f2b64ac0700
>  0 SIGN: MSG 28 Message signature does not match contents.
> 2018-02-15 18:15:51.419459 7f2b64ac0700  0 SIGN: MSG 28Signature on
> message:
> 2018-02-15 18:15:51.419460 7f2b64ac0700  0 SIGN: MSG 28sig:
> 8338581684421737157
> 2018-02-15 18:15:51.419469 7f2b64ac0700  0 SIGN: MSG 28Locally
> calculated signature:
> 2018-02-15 18:15:51.419470 7f2b64ac0700  0 SIGN: MSG 28
> sig_check:5913182128308244
> 2018-02-15 18:15:51.419471 7f2b64ac0700  0 Signature failed.
> 2018-02-15 18:15:51.419472 7f2b64ac0700  0 --
> 192.168.173.44:0/3919097436 >> 192.168.173.44:6801/72152
> conn(0x7f2b4800ff10 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH
> pgs=39 cs=1 l=1).process Signature check failed
>
> Our VMs crash when writing to disk. Libvirt's logs just say the VM
> crashed.   This is a blocker. Has anyone else seen this? This seems to
> be an issue with Ceph Luminous, as we were not having these problem
> with Jewel.
>

When I search through my email, the only two reports of failed signatures
are people who in fact had misconfiguration issues resulting in one end
using signatures and the other side not.

Given that, and since you're on Gentoo and presumably compiled the packages
yourself, the most likely explanation I can think of is something that went
wrong between your packages and the compilation. :/

I guess you could try switching from libnss to libcryptopp (or vice versa)
by recompiling with the relevant makeflags if you want to do something that
only involves the Ceph code. Otherwise, do a rebuild?

Sadly I don't think there's much else we can suggest given that nobody has
seen this with binary packages blessed by the upstream or a distribution.
-Greg


>
> Cary
> -Dynamic
>
> On Thu, Feb 1, 2018 at 7:04 PM, Cary  wrote:
> > Hello,
> >
> > I did not do anything special that I know of. I was just exporting an
> > image from Openstack. We have recently upgraded from Jewel 10.2.3 to
> > Luminous 12.2.2.
> >
> > Caps for admin:
> > client.admin
> > key: CENSORED
> > auid: 0
> > caps: [mgr] allow *
> > caps: [mon] allow *
> > caps: [osd] allow *
> >
> > Caps for Cinder:
> > client.cinder
> > key: CENSORED
> > caps: [mgr] allow r
> > caps: [mon] profile rbd, allow command "osd blacklist"
> > caps: [osd] profile rbd pool=vms, profile rbd pool=volumes,
> > profile rbd pool=images
> >
> > Caps for MGR:
> > mgr.0
> > key: CENSORED
> > caps: [mon] allow *
> >
> > I believe this is causing the virtual machines we have running to
> > crash. Any advice would be appreciated. Please let me know if I need
> > to provide any other details. Thank you,
> >
> > Cary
> > -Dynamic
> >
> > On Mon, Jan 29, 2018 at 7:53 PM, Gregory Farnum 
> wrote:
> >> On Fri, Jan 26, 2018 at 12:14 PM Cary  wrote:
> >>>
> >>> Hello,
> >>>
> >>>  We are running Luminous 12.2.2. 6 OSD hosts with 12 1TB OSDs, and 64GB
> >>> RAM. Each host has a SSD for 

Re: [ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
I checked the disk that monitor is on it with smartctl and it didn't return
any error and it doesn't have any Current_Pending_Sector.
Do you recommend any disk checks to make sure that this disk has problem
and then I can send the report to the provider for replacing the disk

On Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum  wrote:

> The disk that the monitor is on...there isn't anything for you to
> configure about a monitor WAL though so I'm not sure how that enters into
> it?
>
> On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani <
> behnam.loghm...@gmail.com> wrote:
>
>> Thanks for your reply
>>
>> Do you mean, that's the problem with the disk I use for WAL and DB?
>>
>> On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum 
>> wrote:
>>
>>>
>>> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <
>>> behnam.loghm...@gmail.com> wrote:
>>>
 Hi there,

 I have a Ceph cluster version 12.2.2 on CentOS 7.

 It is a testing cluster and I have set it up 2 weeks ago.
 after some days, I see that one of the three mons has stopped(out of
 quorum) and I can't start it anymore.
 I checked the mon service log and the output shows this error:

 """
 mon.XX@-1(probing) e4 preinit clean up potentially inconsistent
 store state
 rocksdb: submit_transaction_sync error: Corruption: block checksum
 mismatch

>>>
>>> This bit is the important one. Your disk is bad and it’s feeding back
>>> corrupted data.
>>>
>>>
>>>
>>>
 code = 2 Rocksdb transaction:
  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
 LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
 MonitorDBStore::clear(std::set&)' thread
 7f45a1e52e40 time 2018-02-16 17:37:07.040846
 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/
 ceph-12.2.2/src/mon/MonitorDBStore.h: 581: FAILE
 D assert(r >= 0)
 """

 the only solution I found is to remove this mon from quorum and remove
 all mon data and re-add this mon to quorum again.
 and ceph goes to the healthy status again.

 but now after some days this mon has stopped and I face the same
 problem again.

 My cluster setup is:
 4 osd hosts
 total 8 osds
 3 mons
 1 rgw

 this cluster has setup with ceph-volume lvm and wal/db separation on
 logical volumes.

 Best regards,
 Behnam Loghmani


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Gregory Farnum
On Fri, Feb 16, 2018 at 12:17 PM Graham Allan  wrote:

> On 02/16/2018 12:31 PM, Graham Allan wrote:
> >
> > If I set debug rgw=1 and demug ms=1 before running the "object stat"
> > command, it seems to stall in a loop of trying communicate with osds for
> > pool 96, which is .rgw.control
> >
> >> 10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
> >> osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
> >> cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
> >> e507695) v8 -- 0x7f10ac033610 con 0
> >> 10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
> >> osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
> >> uv3933745 ondisk = 0) v8  152+0+0 (2536111836 <(253)%20611-1836> 0
> 0) 0x7f1158003e20
> >> con 0x7f117afd8390
> >
> > Prior to that, probably more relevant, this was the only communication
> > logged with the primary osd of the pg:
> >
> >> 10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
> >> osd_op(unknown.0.0:96 70.438s0
> >> 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
> >> [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
> >> v8 -- 0x7fab79889fa0 con 0
> >> 10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 
> >> osd_backoff(70.438s0 block id 1
> >>
> [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
> >> e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
> >> 0x7fab79885980
> >> 210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
> >> osd_backoff(70.438s0 ack-block id 1
> >>
> [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
> >> e507695) v1 -- 0x7fab48065420 con 0
> >
> > so I guess the backoff message above is saying the object is
> > unavailable. OK, that certainly makes sense. Not sure that it helps me
> > understand how to fix the inconsistencies
>
> If I restart the primary osd for the pg, that makes it forget its state
> and return to active+clean+inconsistent. I can then download the
> previously-unfound objects again, as well as run "radosgw-admin object
> stat".
>
> So the interesting bit is probably figuring out why it decides these
> objects are unfound, when clearly they aren't.
>
> What would be the best place to enable additional logging to understand
> this - perhaps the primary osd?
>

David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg


>
> Thanks for all your help,
>
> Graham
> --
> Graham Allan
> Minnesota Supercomputing Institute - g...@umn.edu
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-16 Thread Gregory Farnum
The disk that the monitor is on...there isn't anything for you to configure
about a monitor WAL though so I'm not sure how that enters into it?

On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani 
wrote:

> Thanks for your reply
>
> Do you mean, that's the problem with the disk I use for WAL and DB?
>
> On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum 
> wrote:
>
>>
>> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <
>> behnam.loghm...@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> I have a Ceph cluster version 12.2.2 on CentOS 7.
>>>
>>> It is a testing cluster and I have set it up 2 weeks ago.
>>> after some days, I see that one of the three mons has stopped(out of
>>> quorum) and I can't start it anymore.
>>> I checked the mon service log and the output shows this error:
>>>
>>> """
>>> mon.XX@-1(probing) e4 preinit clean up potentially inconsistent
>>> store state
>>> rocksdb: submit_transaction_sync error: Corruption: block checksum
>>> mismatch
>>>
>>
>> This bit is the important one. Your disk is bad and it’s feeding back
>> corrupted data.
>>
>>
>>
>>
>>> code = 2 Rocksdb transaction:
>>>  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
>>> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
>>> MonitorDBStore::clear(std::set&)' thread
>>> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
>>> 581: FAILE
>>> D assert(r >= 0)
>>> """
>>>
>>> the only solution I found is to remove this mon from quorum and remove
>>> all mon data and re-add this mon to quorum again.
>>> and ceph goes to the healthy status again.
>>>
>>> but now after some days this mon has stopped and I face the same problem
>>> again.
>>>
>>> My cluster setup is:
>>> 4 osd hosts
>>> total 8 osds
>>> 3 mons
>>> 1 rgw
>>>
>>> this cluster has setup with ceph-volume lvm and wal/db separation on
>>> logical volumes.
>>>
>>> Best regards,
>>> Behnam Loghmani
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Orphaned entries in Crush map

2018-02-16 Thread David Turner
First you stop the service, then make sure they're down, out, crush remove,
auth del, and finally osd rm.  You had it almost in the right order, but
you were down and outing them before you stopped them.  That would allow
them to mark themselves back up and in.  The down and out commands don't
need the 'osd.', just the ${n}.

In any case, by this point the cluster definitely believes them to be down,
out, and removed from the cluster.  I swear I remember having phantom
devices in my crush map like this before, but I thought it was because the
osd hadn't been rm'd from the cluster... which doesn't seem to be the case.

Does anyone else have any thoughts?

On Fri, Feb 16, 2018 at 4:22 PM Karsten Becker 
wrote:

> Here is what I did - bash history:
>
> >  1897  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd down
> osd.$n; done>  1920  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do
> ceph osd out
> osd.$n; done
> >  1921  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd down
> osd.$n; done
> >  1923  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do systemctl stop
> ceph-osd@$n.service; done
> >  1925  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd crush
> remove osd.${n}; done
> >  1926  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph auth del
> osd.${n}; done
> >  1927  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd rm
> ${n}; done
>
> I assume that I did the right steps...
>
>
>
>
>
> On 16.02.2018 21:56, David Turner wrote:
> > What is the output of `ceph osd stat`?  My guess is that they are still
> > considered to be part of the cluster and going through the process of
> > removing OSDs from your cluster is what you need to do.  In particular
> > `ceph osd rm 19`.
> >
> > On Fri, Feb 16, 2018 at 2:31 PM Karsten Becker
> > > wrote:
> >
> > Hi.
> >
> > during the reorgainzation of my cluster I removed some OSDs.
> Obviously
> > something went wrong for 2 of them, osd.19 and osd.20.
> >
> > If I get my current Crush map, decompile and edit them, I see 2
> > orphaned/stale entries for the former OSDs:
> >
> > > device 16 osd.16 class hdd
> > > device 17 osd.17 class hdd
> > > device 18 osd.18 class hdd
> > > device 19 device19
> > > device 20 device20
> > > device 21 osd.21 class hdd
> > > device 22 osd.22 class hdd
> > > device 23 osd.23 class hdd
> >
> > If I delete them from the Crush map (file), recompile it and set it
> > productive - they appear again... if I get the current map again and
> > decompile them, they are in again.
> >
> > So how to get rid of these entries?
> >
> > Best from Berlin/Germany
> > Karsten
> >
> > Ecologic Institut gemeinnuetzige GmbH
> > Pfalzburger Str. 43/44, D-10717 Berlin
> > Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
> > Sitz der Gesellschaft / Registered Office: Berlin (Germany)
> > Registergericht / Court of Registration: Amtsgericht Berlin
> > (Charlottenburg), HRB 57947
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> Ecologic Institut gemeinnuetzige GmbH
> Pfalzburger Str. 43/44, D-10717 Berlin
> Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
> Sitz der Gesellschaft / Registered Office: Berlin (Germany)
> Registergericht / Court of Registration: Amtsgericht Berlin
> (Charlottenburg), HRB 57947
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Orphaned entries in Crush map

2018-02-16 Thread Karsten Becker
Here is what I did - bash history:

>  1897  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd down 
> osd.$n; done>  1920  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph 
> osd out
osd.$n; done
>  1921  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd down 
> osd.$n; done
>  1923  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do systemctl stop 
> ceph-osd@$n.service; done
>  1925  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd crush 
> remove osd.${n}; done
>  1926  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph auth del 
> osd.${n}; done
>  1927  for n in 6 7 14 15 16 17 18 19 3 9 10 11 12 20; do ceph osd rm ${n}; 
> done

I assume that I did the right steps...





On 16.02.2018 21:56, David Turner wrote:
> What is the output of `ceph osd stat`?  My guess is that they are still
> considered to be part of the cluster and going through the process of
> removing OSDs from your cluster is what you need to do.  In particular
> `ceph osd rm 19`.
> 
> On Fri, Feb 16, 2018 at 2:31 PM Karsten Becker
> > wrote:
> 
> Hi.
> 
> during the reorgainzation of my cluster I removed some OSDs. Obviously
> something went wrong for 2 of them, osd.19 and osd.20.
> 
> If I get my current Crush map, decompile and edit them, I see 2
> orphaned/stale entries for the former OSDs:
> 
> > device 16 osd.16 class hdd
> > device 17 osd.17 class hdd
> > device 18 osd.18 class hdd
> > device 19 device19
> > device 20 device20
> > device 21 osd.21 class hdd
> > device 22 osd.22 class hdd
> > device 23 osd.23 class hdd
> 
> If I delete them from the Crush map (file), recompile it and set it
> productive - they appear again... if I get the current map again and
> decompile them, they are in again.
> 
> So how to get rid of these entries?
> 
> Best from Berlin/Germany
> Karsten
> 
> Ecologic Institut gemeinnuetzige GmbH
> Pfalzburger Str. 43/44, D-10717 Berlin
> Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
> Sitz der Gesellschaft / Registered Office: Berlin (Germany)
> Registergericht / Court of Registration: Amtsgericht Berlin
> (Charlottenburg), HRB 57947
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


Ecologic Institut gemeinnuetzige GmbH
Pfalzburger Str. 43/44, D-10717 Berlin
Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
Sitz der Gesellschaft / Registered Office: Berlin (Germany)
Registergericht / Court of Registration: Amtsgericht Berlin (Charlottenburg), 
HRB 57947
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread David Turner
That sounds like a good next step.  Start with OSDs involved in the longest
blocked requests.  Wait a couple minutes after the osd marks itself back up
and continue through them.  Hopefully things will start clearing up so that
you don't need to mark all of them down.  There is usually a only a couple
OSDs holding everything up.

On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister 
wrote:

> Thanks David,
>
>
>
> Taking the list of all OSDs that are stuck reports that a little over 50%
> of all OSDs are in this condition.  There isn’t any discernable pattern
> that I can find and they are spread across the three servers.  All of the
> OSDs are online as far as the service is concern.
>
>
>
>
> I have also taken all PGs that were reported the health detail output and
> looked for any that report “peering_blocked_by” but none do, so I can’t
> tell if any OSD is actually blocking the peering operation.
>
>
>
> As suggested, I got a report of all peering PGs:
>
> [root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering
> | sort -k13
>
> pg 14.fe0 is stuck peering since forever, current state peering, last
> acting [104,94,108]
>
> pg 14.fe0 is stuck unclean since forever, current state peering, last
> acting [104,94,108]
>
> pg 14.fbc is stuck peering since forever, current state peering, last
> acting [110,91,0]
>
> pg 14.fd1 is stuck peering since forever, current state peering, last
> acting [130,62,111]
>
> pg 14.fd1 is stuck unclean since forever, current state peering, last
> acting [130,62,111]
>
> pg 14.fed is stuck peering since forever, current state peering, last
> acting [32,33,82]
>
> pg 14.fed is stuck unclean since forever, current state peering, last
> acting [32,33,82]
>
> pg 14.fee is stuck peering since forever, current state peering, last
> acting [37,96,68]
>
> pg 14.fee is stuck unclean since forever, current state peering, last
> acting [37,96,68]
>
> pg 14.fe8 is stuck peering since forever, current state peering, last
> acting [45,31,107]
>
> pg 14.fe8 is stuck unclean since forever, current state peering, last
> acting [45,31,107]
>
> pg 14.fc1 is stuck peering since forever, current state peering, last
> acting [59,124,39]
>
> pg 14.ff2 is stuck peering since forever, current state peering, last
> acting [62,117,7]
>
> pg 14.ff2 is stuck unclean since forever, current state peering, last
> acting [62,117,7]
>
> pg 14.fe4 is stuck peering since forever, current state peering, last
> acting [84,55,92]
>
> pg 14.fe4 is stuck unclean since forever, current state peering, last
> acting [84,55,92]
>
> pg 14.fb0 is stuck peering since forever, current state peering, last
> acting [94,30,38]
>
> pg 14.ffc is stuck peering since forever, current state peering, last
> acting [96,53,70]
>
> pg 14.ffc is stuck unclean since forever, current state peering, last
> acting [96,53,70]
>
>
>
> Some have common OSDs but some OSDs only listed once.
>
>
>
> Should I try just marking OSDs with stuck requests down to see if that
> will re-assert them?
>
>
>
> Thanks!!
>
> -Bryan
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* Friday, February 16, 2018 2:51 PM
>
>
> *To:* Bryan Banister 
> *Cc:* Bryan Stillwell ; Janne Johansson <
> icepic...@gmail.com>; Ceph Users 
> *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
>
>
>
> *Note: External Email*
> --
>
> The questions I definitely know the answer to first, and then we'll
> continue from there.  If an OSD is blocking peering but is online, when you
> mark it as down in the cluster it receives a message in it's log saying it
> was wrongly marked down and tells the mons it is online.  That gets it to
> stop what it was doing and start talking again.  I referred to that as
> re-asserting.  If the OSD that you marked down doesn't mark itself back up
> within a couple minutes, restarting the OSD might be a good idea.  Then
> again actually restarting the daemon could be bad because the daemon is
> doing something.  With as much potential for places to work with to get
> things going, actually restarting the daemons is probably something I would
> wait to do for now.
>
>
>
> The reason the cluster doesn't know anything about the PG is because it's
> still creating and hasn't actually been created.  Starting with some of the
> OSDs that you see with blocked requests would be a good idea.  Eventually
> you'll down an OSD that when it comes back up things start looking much
> better as things start peering and getting better.  Below are the list of
> OSDs you had from a previous email that if they're still there with stuck
> requests then they'll be good to start doing this to.  On closer review,
> it's almost all of them... but you have to start somewhere.  Another
> possible place to start with these is 

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread Bryan Banister
Thanks David,

Taking the list of all OSDs that are stuck reports that a little over 50% of 
all OSDs are in this condition.  There isn’t any discernable pattern that I can 
find and they are spread across the three servers.  All of the OSDs are online 
as far as the service is concern.

I have also taken all PGs that were reported the health detail output and 
looked for any that report “peering_blocked_by” but none do, so I can’t tell if 
any OSD is actually blocking the peering operation.

As suggested, I got a report of all peering PGs:
[root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering | sort 
-k13
pg 14.fe0 is stuck peering since forever, current state peering, last 
acting [104,94,108]
pg 14.fe0 is stuck unclean since forever, current state peering, last 
acting [104,94,108]
pg 14.fbc is stuck peering since forever, current state peering, last 
acting [110,91,0]
pg 14.fd1 is stuck peering since forever, current state peering, last 
acting [130,62,111]
pg 14.fd1 is stuck unclean since forever, current state peering, last 
acting [130,62,111]
pg 14.fed is stuck peering since forever, current state peering, last 
acting [32,33,82]
pg 14.fed is stuck unclean since forever, current state peering, last 
acting [32,33,82]
pg 14.fee is stuck peering since forever, current state peering, last 
acting [37,96,68]
pg 14.fee is stuck unclean since forever, current state peering, last 
acting [37,96,68]
pg 14.fe8 is stuck peering since forever, current state peering, last 
acting [45,31,107]
pg 14.fe8 is stuck unclean since forever, current state peering, last 
acting [45,31,107]
pg 14.fc1 is stuck peering since forever, current state peering, last 
acting [59,124,39]
pg 14.ff2 is stuck peering since forever, current state peering, last 
acting [62,117,7]
pg 14.ff2 is stuck unclean since forever, current state peering, last 
acting [62,117,7]
pg 14.fe4 is stuck peering since forever, current state peering, last 
acting [84,55,92]
pg 14.fe4 is stuck unclean since forever, current state peering, last 
acting [84,55,92]
pg 14.fb0 is stuck peering since forever, current state peering, last 
acting [94,30,38]
pg 14.ffc is stuck peering since forever, current state peering, last 
acting [96,53,70]
pg 14.ffc is stuck unclean since forever, current state peering, last 
acting [96,53,70]

Some have common OSDs but some OSDs only listed once.

Should I try just marking OSDs with stuck requests down to see if that will 
re-assert them?

Thanks!!
-Bryan

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Friday, February 16, 2018 2:51 PM
To: Bryan Banister 
Cc: Bryan Stillwell ; Janne Johansson 
; Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

The questions I definitely know the answer to first, and then we'll continue 
from there.  If an OSD is blocking peering but is online, when you mark it as 
down in the cluster it receives a message in it's log saying it was wrongly 
marked down and tells the mons it is online.  That gets it to stop what it was 
doing and start talking again.  I referred to that as re-asserting.  If the OSD 
that you marked down doesn't mark itself back up within a couple minutes, 
restarting the OSD might be a good idea.  Then again actually restarting the 
daemon could be bad because the daemon is doing something.  With as much 
potential for places to work with to get things going, actually restarting the 
daemons is probably something I would wait to do for now.

The reason the cluster doesn't know anything about the PG is because it's still 
creating and hasn't actually been created.  Starting with some of the OSDs that 
you see with blocked requests would be a good idea.  Eventually you'll down an 
OSD that when it comes back up things start looking much better as things start 
peering and getting better.  Below are the list of OSDs you had from a previous 
email that if they're still there with stuck requests then they'll be good to 
start doing this to.  On closer review, it's almost all of them... but you have 
to start somewhere.  Another possible place to start with these is to look at a 
list of all of the peering PGs and see if there are any common OSDs when you 
look at all of them at once.  Some patterns may emerge and would be good 
options to try.

osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds 
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have 
stuck requests > 134218 sec
osds 
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
 have stuck requests > 268435 sec


On Fri, Feb 16, 2018 at 2:53 PM Bryan Banister 

Re: [ceph-users] Orphaned entries in Crush map

2018-02-16 Thread Karsten Becker
Hi David.

So far everything else is fine.

> 46 osds: 46 up, 46 in; 1344 remapped pgs

And the rm gives:

> root@kong[/0]:~ # ceph osd rm 19
> osd.19 does not exist. 
> root@kong[/0]:~ # ceph osd rm 20
> osd.20 does not exist.

The "devices" do NOT show up in "ceph osd tree" or "ceph osd df"... just
in the map.

If I do NOT delete them out of the crush map, compile and set it active,
I get
> 2 osds exist in the crush map but not in the osdmap
In that case they also do NOT show up in "ceph osd tree" or "ceph osd df".

:-(





On 16.02.2018 21:56, David Turner wrote:
> What is the output of `ceph osd stat`?  My guess is that they are still
> considered to be part of the cluster and going through the process of
> removing OSDs from your cluster is what you need to do.  In particular
> `ceph osd rm 19`.
> 
> On Fri, Feb 16, 2018 at 2:31 PM Karsten Becker
> > wrote:
> 
> Hi.
> 
> during the reorgainzation of my cluster I removed some OSDs. Obviously
> something went wrong for 2 of them, osd.19 and osd.20.
> 
> If I get my current Crush map, decompile and edit them, I see 2
> orphaned/stale entries for the former OSDs:
> 
> > device 16 osd.16 class hdd
> > device 17 osd.17 class hdd
> > device 18 osd.18 class hdd
> > device 19 device19
> > device 20 device20
> > device 21 osd.21 class hdd
> > device 22 osd.22 class hdd
> > device 23 osd.23 class hdd
> 
> If I delete them from the Crush map (file), recompile it and set it
> productive - they appear again... if I get the current map again and
> decompile them, they are in again.
> 
> So how to get rid of these entries?
> 
> Best from Berlin/Germany
> Karsten
> 
> Ecologic Institut gemeinnuetzige GmbH
> Pfalzburger Str. 43/44, D-10717 Berlin
> Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
> Sitz der Gesellschaft / Registered Office: Berlin (Germany)
> Registergericht / Court of Registration: Amtsgericht Berlin
> (Charlottenburg), HRB 57947
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 

Karsten Becker
Head of Information Technology
Ecologic Institute

Tel: +49 30 86880-137
Website: http://ecologic.eu

Ecologic Institut gemeinnuetzige GmbH
Pfalzburger Str. 43/44, D-10717 Berlin
Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
Sitz der Gesellschaft / Registered Office: Berlin (Germany)
Registergericht / Court of Registration: Amtsgericht Berlin (Charlottenburg), 
HRB 57947
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Orphaned entries in Crush map

2018-02-16 Thread David Turner
What is the output of `ceph osd stat`?  My guess is that they are still
considered to be part of the cluster and going through the process of
removing OSDs from your cluster is what you need to do.  In particular
`ceph osd rm 19`.

On Fri, Feb 16, 2018 at 2:31 PM Karsten Becker 
wrote:

> Hi.
>
> during the reorgainzation of my cluster I removed some OSDs. Obviously
> something went wrong for 2 of them, osd.19 and osd.20.
>
> If I get my current Crush map, decompile and edit them, I see 2
> orphaned/stale entries for the former OSDs:
>
> > device 16 osd.16 class hdd
> > device 17 osd.17 class hdd
> > device 18 osd.18 class hdd
> > device 19 device19
> > device 20 device20
> > device 21 osd.21 class hdd
> > device 22 osd.22 class hdd
> > device 23 osd.23 class hdd
>
> If I delete them from the Crush map (file), recompile it and set it
> productive - they appear again... if I get the current map again and
> decompile them, they are in again.
>
> So how to get rid of these entries?
>
> Best from Berlin/Germany
> Karsten
>
> Ecologic Institut gemeinnuetzige GmbH
> Pfalzburger Str. 43/44, D-10717 Berlin
> Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
> Sitz der Gesellschaft / Registered Office: Berlin (Germany)
> Registergericht / Court of Registration: Amtsgericht Berlin
> (Charlottenburg), HRB 57947
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread David Turner
The questions I definitely know the answer to first, and then we'll
continue from there.  If an OSD is blocking peering but is online, when you
mark it as down in the cluster it receives a message in it's log saying it
was wrongly marked down and tells the mons it is online.  That gets it to
stop what it was doing and start talking again.  I referred to that as
re-asserting.  If the OSD that you marked down doesn't mark itself back up
within a couple minutes, restarting the OSD might be a good idea.  Then
again actually restarting the daemon could be bad because the daemon is
doing something.  With as much potential for places to work with to get
things going, actually restarting the daemons is probably something I would
wait to do for now.

The reason the cluster doesn't know anything about the PG is because it's
still creating and hasn't actually been created.  Starting with some of the
OSDs that you see with blocked requests would be a good idea.  Eventually
you'll down an OSD that when it comes back up things start looking much
better as things start peering and getting better.  Below are the list of
OSDs you had from a previous email that if they're still there with stuck
requests then they'll be good to start doing this to.  On closer review,
it's almost all of them... but you have to start somewhere.  Another
possible place to start with these is to look at a list of all of the
peering PGs and see if there are any common OSDs when you look at all of
them at once.  Some patterns may emerge and would be good options to try.

osds 7,39,60,103,133 have stuck requests > 67108.9 sec

osds
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
have stuck requests > 134218 sec

osds
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
have stuck requests > 268435 sec


On Fri, Feb 16, 2018 at 2:53 PM Bryan Banister 
wrote:

> Thanks David,
>
>
>
> I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at
> this point and the backfills have stopped.  I’ll also stop the backups from
> pushing into ceph for now.
>
>
>
> I don’t want to make things worse, so ask for some more guidance now.
>
>
>
> 1)  In looking at a PG that is still peering or one that is
> “unknown”, Ceph complains that it doesn’t have that pgid:
>
> pg 14.fb0 is stuck peering since forever, current state peering, last
> acting [94,30,38]
>
> [root@carf-ceph-osd03 ~]# ceph pg 14.fb0 query
>
> Error ENOENT: i don't have pgid 14.fb0
>
> [root@carf-ceph-osd03 ~]#
>
>
>
> 2)  One that is activating shows this for the recovery_state:
>
> [root@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less
>
> [snip]
>
> "recovery_state": [
>
> {
>
> "name": "Started/Primary/Active",
>
> "enter_time": "2018-02-13 14:33:21.406919",
>
> "might_have_unfound": [
>
> {
>
> "osd": "84(0)",
>
> "status": "not queried"
>
> }
>
> ],
>
> "recovery_progress": {
>
> "backfill_targets": [
>
> "56(0)",
>
> "87(1)",
>
> "88(2)"
>
> ],
>
> "waiting_on_backfill": [],
>
> "last_backfill_started": "MIN",
>
> "backfill_info": {
>
> "begin": "MIN",
>
> "end": "MIN",
>
> "objects": []
>
> },
>
> "peer_backfill_info": [],
>
> "backfills_in_flight": [],
>
> "recovering": [],
>
> "pg_backend": {
>
> "recovery_ops": [],
>
> "read_ops": []
>
> }
>
> },
>
> "scrub": {
>
> "scrubber.epoch_start": "0",
>
> "scrubber.active": false,
>
> "scrubber.state": "INACTIVE",
>
> "scrubber.start": "MIN",
>
> "scrubber.end": "MIN",
>
> "scrubber.subset_last_update": "0'0",
>
> "scrubber.deep": false,
>
> "scrubber.seed": 0,
>
> "scrubber.waiting_on": 0,
>
> "scrubber.waiting_on_whom": []
>
> }
>
> },
>
> {
>
> "name": "Started",
>
> "enter_time": "2018-02-13 14:33:17.491148"
>
> }
>
> ],
>
>
>
> Sorry for all the hand holding, but how do I determine if I need to set an
> OSD as ‘down’ to fix the issues, and how does it go about re-asserting
> itself?
>
>
>
> I again tried looking at the ceph docs on troubleshooting OSDs but didn’t
> find any details.  Man page also has no details.
>
>
>
> Thanks again,
>
> -Bryan
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* Friday, February 16, 

Re: [ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
Thanks for your reply

Do you mean, that's the problem with the disk I use for WAL and DB?

On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum  wrote:

>
> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani 
> wrote:
>
>> Hi there,
>>
>> I have a Ceph cluster version 12.2.2 on CentOS 7.
>>
>> It is a testing cluster and I have set it up 2 weeks ago.
>> after some days, I see that one of the three mons has stopped(out of
>> quorum) and I can't start it anymore.
>> I checked the mon service log and the output shows this error:
>>
>> """
>> mon.XX@-1(probing) e4 preinit clean up potentially inconsistent
>> store state
>> rocksdb: submit_transaction_sync error: Corruption: block checksum
>> mismatch
>>
>
> This bit is the important one. Your disk is bad and it’s feeding back
> corrupted data.
>
>
>
>
>> code = 2 Rocksdb transaction:
>>  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
>> 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
>> centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
>> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
>> MonitorDBStore::clear(std::set&)' thread
>> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
>> 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
>> centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/
>> ceph-12.2.2/src/mon/MonitorDBStore.h: 581: FAILE
>> D assert(r >= 0)
>> """
>>
>> the only solution I found is to remove this mon from quorum and remove
>> all mon data and re-add this mon to quorum again.
>> and ceph goes to the healthy status again.
>>
>> but now after some days this mon has stopped and I face the same problem
>> again.
>>
>> My cluster setup is:
>> 4 osd hosts
>> total 8 osds
>> 3 mons
>> 1 rgw
>>
>> this cluster has setup with ceph-volume lvm and wal/db separation on
>> logical volumes.
>>
>> Best regards,
>> Behnam Loghmani
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restoring keyring capabilities

2018-02-16 Thread Nico Schottelius

It seems your monitor capabilities are different to mine:

root@server3:/opt/ungleich-tools# ceph -k 
/var/lib/ceph/mon/ceph-server3/keyring -n mon. auth list
2018-02-16 20:34:59.257529 7fe0d5c6b700  0 librados: mon. authentication error 
(13) Permission denied
[errno 13] error connecting to the cluster
root@server3:/opt/ungleich-tools# cat /var/lib/ceph/mon/ceph-server3/keyring
[mon.]
key = AQCp9IVa2GmYARAAVvCGfNpXfxOoUf119KAq1g==

Where you have

> root@ceph-mon1:/# cat /var/lib/ceph/mon/ceph-ceph-mon1/keyring
> [mon.]
> key = AQD1y3RapVDCNxAAmInc8D3OPZKuTVeUcNsPug==
> caps mon = "allow *"

Which probably explains why it works for you, but not for me.

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Graham Allan

On 02/16/2018 12:31 PM, Graham Allan wrote:


If I set debug rgw=1 and demug ms=1 before running the "object stat" 
command, it seems to stall in a loop of trying communicate with osds for 
pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- 
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping 
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected 
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59  
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 0 0) 0x7f1158003e20 
con 0x7f117afd8390


Prior to that, probably more relevant, this was the only communication 
logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- 
osd_op(unknown.0.0:96 70.438s0 
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head 
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) 
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1  
osd_backoff(70.438s0 block id 1 
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) 
e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con 
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- 
osd_backoff(70.438s0 ack-block id 1 
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) 
e507695) v1 -- 0x7fab48065420 con 0


so I guess the backoff message above is saying the object is 
unavailable. OK, that certainly makes sense. Not sure that it helps me 
understand how to fix the inconsistencies


If I restart the primary osd for the pg, that makes it forget its state 
and return to active+clean+inconsistent. I can then download the 
previously-unfound objects again, as well as run "radosgw-admin object 
stat".


So the interesting bit is probably figuring out why it decides these 
objects are unfound, when clearly they aren't.


What would be the best place to enable additional logging to understand 
this - perhaps the primary osd?


Thanks for all your help,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restoring keyring capabilities

2018-02-16 Thread Michel Raabe
On 02/16/18 @ 18:59, Nico Schottelius wrote:
> Saw that, too, however it does not work:
> 
> root@server3:/var/lib/ceph/mon/ceph-server3# ceph -n mon. --keyring keyring  
> auth caps client.admin mds 'allow *' osd 'allow *' mon 'allow *'
> 2018-02-16 17:23:38.154282 7f7e257e3700  0 librados: mon. authentication 
> error (13) Permission denied
> [errno 13] error connecting to the cluster
> 
> ... which kind of makes sense, as the mon. key does not have
> capabilities for it. Then again, I wonder how monitors actually talk to
> each other...

Wired. Works for me.

root@ceph-mon1:/# ceph -k /var/lib/ceph/mon/ceph-ceph-mon1/keyring -n mon. auth 
list | grep -A4 client.admin
installed auth entries:

client.admin
key: AQD1y3RaTyOzNhAA7NwuH5CDmpTiJAX9tAoCzQ==
auid: 0
caps: [mgr] allow *
client.bootstrap-mds

root@ceph-mon1:/# ceph -k /var/lib/ceph/mon/ceph-ceph-mon1/keyring -n mon. auth 
caps client.admin mon 'allow *' osd 'allow *' mgr 'allow *' mds 'allow *'
updated caps for client.admin

root@ceph-mon1:/# ceph -k /var/lib/ceph/mon/ceph-ceph-mon1/keyring -n mon. auth 
list | grep -A7 client.admin 
installed auth entries:

client.admin
key: AQD1y3RaTyOzNhAA7NwuH5CDmpTiJAX9tAoCzQ==
auid: 0
caps: [mds] allow *
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *
client.bootstrap-mds

root@ceph-mon1:/# cat /var/lib/ceph/mon/ceph-ceph-mon1/keyring
[mon.]
key = AQD1y3RapVDCNxAAmInc8D3OPZKuTVeUcNsPug==
caps mon = "allow *"

> Michel Raabe  writes:
> > On 02/16/18 @ 18:21, Nico Schottelius wrote:
> >> on a test cluster I issued a few seconds ago:
> >>
> >>   ceph auth caps client.admin mgr 'allow *'
> >>
> >> instead of what I really wanted to do
> >>
> >>   ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \
> >>   mds allow
> >>
> >> Now any access to the cluster using client.admin correctly results in
> >> client.admin authentication error (13) Permission denied.
> >>
> >> Is there any way to modify the keyring capabilities "from behind",
> >> i.e. by modifying the rocksdb of the monitors or similar?
> >
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015474.html


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df: Raw used vs. used vs. actual bytes in cephfs

2018-02-16 Thread Gregory Farnum
What does the cluster deployment look like? Usually this happens when
you’re sharing disks with the OS, or have co-located file journals or
something.
On Fri, Feb 16, 2018 at 4:02 AM Flemming Frandsen <
flemming.frand...@stibosystems.com> wrote:

> I'm trying out cephfs and I'm in the process of copying over some
> real-world data to see what happens.
>
> I have created a number of cephfs file systems, the only one I've
> started working on is the one called jenkins specifically the one named
> jenkins which lives in fs_jenkins_data and fs_jenkins_metadata.
>
> According to ceph df I have about 1387 GB of data in all of the pools,
> while the raw used space is 5918 GB, which gives a ratio of about 4.3, I
> would have expected a ratio around 2 as the pool size has been set to 2.
>
>
> Can anyone explain where half my space has been squandered?
>
>  > ceph df
> GLOBAL:
>  SIZE  AVAIL RAW USED %RAW USED
>  8382G 2463G5918G 70.61
> POOLS:
>  NAME ID USED   %USED MAX
> AVAIL OBJECTS
>  .rgw.root11113 0 258G4
>  default.rgw.control  2   0 0 258G8
>  default.rgw.meta 3   0 0 258G0
>  default.rgw.log  4   0 0 258G  207
>  fs_docker-nexus_data 5  66120M 11.09 258G22655
>  fs_docker-nexus_metadata 6  39463k 0 258G 2376
>  fs_meta_data 7 330 0 258G4
>  fs_meta_metadata 8567k 0 258G   22
>  fs_jenkins_data  9   1321G 71.84 258G 28576278
>  fs_jenkins_metadata  10 52178k 0 258G  2285493
>  fs_nexus_data11  0 0 258G0
>  fs_nexus_metadata12   4181 0 258G   21
>
> --
>   Regards Flemming Frandsen - Stibo Systems - DK - STEP Release Manager
>   Please use rele...@stibo.com for all Release Management requests
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-16 Thread Gregory Farnum
On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani 
wrote:

> Hi there,
>
> I have a Ceph cluster version 12.2.2 on CentOS 7.
>
> It is a testing cluster and I have set it up 2 weeks ago.
> after some days, I see that one of the three mons has stopped(out of
> quorum) and I can't start it anymore.
> I checked the mon service log and the output shows this error:
>
> """
> mon.XX@-1(probing) e4 preinit clean up potentially inconsistent store
> state
> rocksdb: submit_transaction_sync error: Corruption: block checksum
> mismatch
>

This bit is the important one. Your disk is bad and it’s feeding back
corrupted data.




> code = 2 Rocksdb transaction:
>  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
> MonitorDBStore::clear(std::set&)' thread
> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
> 581: FAILE
> D assert(r >= 0)
> """
>
> the only solution I found is to remove this mon from quorum and remove all
> mon data and re-add this mon to quorum again.
> and ceph goes to the healthy status again.
>
> but now after some days this mon has stopped and I face the same problem
> again.
>
> My cluster setup is:
> 4 osd hosts
> total 8 osds
> 3 mons
> 1 rgw
>
> this cluster has setup with ceph-volume lvm and wal/db separation on
> logical volumes.
>
> Best regards,
> Behnam Loghmani
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread Bryan Banister
Thanks David,

I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at this 
point and the backfills have stopped.  I’ll also stop the backups from pushing 
into ceph for now.

I don’t want to make things worse, so ask for some more guidance now.


1)  In looking at a PG that is still peering or one that is “unknown”, Ceph 
complains that it doesn’t have that pgid:
pg 14.fb0 is stuck peering since forever, current state peering, last 
acting [94,30,38]
[root@carf-ceph-osd03 ~]# ceph pg 14.fb0 query
Error ENOENT: i don't have pgid 14.fb0
[root@carf-ceph-osd03 ~]#


2)  One that is activating shows this for the recovery_state:
[root@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less
[snip]
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-02-13 14:33:21.406919",
"might_have_unfound": [
{
"osd": "84(0)",
"status": "not queried"
}
],
"recovery_progress": {
"backfill_targets": [
"56(0)",
"87(1)",
"88(2)"
],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-02-13 14:33:17.491148"
}
],


Sorry for all the hand holding, but how do I determine if I need to set an OSD 
as ‘down’ to fix the issues, and how does it go about re-asserting itself?

I again tried looking at the ceph docs on troubleshooting OSDs but didn’t find 
any details.  Man page also has no details.

Thanks again,
-Bryan

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Friday, February 16, 2018 1:21 PM
To: Bryan Banister 
Cc: Bryan Stillwell ; Janne Johansson 
; Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

Your problem might have been creating too many PGs at once.  I generally 
increase pg_num and pgp_num by no more than 256 at a time.  Making sure that 
all PGs are creating, peered, and healthy (other than backfilling).

To help you get back to a healthy state, let's start off by getting all of your 
PGs peered.  Go ahead and put a stop to backfilling, recovery, scrubbing, etc.  
Those are all hindering the peering effort right now.  The more clients you can 
disable is also better.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing

After that look at your peering PGs and find out what is blocking their 
peering.  This is where you might need to be using `ceph osd down 23` (assuming 
you needed to kick osd.23) to mark them down in the cluster and let them 
re-assert themselves.  Once you have all PGs done with peering, go ahead and 
unset nobackfill and norecovery and let the cluster start moving data around.  
Leaving noscrubbing and nodeep-scrubbing off is optional and up to you.  I'll 
never say it's better to leave them off, but scrubbing does use a fair bit of 
spindles while you're trying to backfill.

On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister 
> wrote:
Well I decided to try the increase in PGs to 4096 and that seems to have caused 
some issues:

2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR 
61802168/241154376 objects misplaced (25.628%); Reduced data availability: 2081 
pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 objects 
degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck requests are 
blocked > 4096 sec

The cluster is actively backfilling misplaced objects, but not all PGs are 
active at this point and may are stuck peering, stuck unclean, or have a state 
of unknown:
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive 

[ceph-users] Orphaned entries in Crush map

2018-02-16 Thread Karsten Becker
Hi.

during the reorgainzation of my cluster I removed some OSDs. Obviously
something went wrong for 2 of them, osd.19 and osd.20.

If I get my current Crush map, decompile and edit them, I see 2
orphaned/stale entries for the former OSDs:

> device 16 osd.16 class hdd
> device 17 osd.17 class hdd
> device 18 osd.18 class hdd
> device 19 device19
> device 20 device20
> device 21 osd.21 class hdd
> device 22 osd.22 class hdd
> device 23 osd.23 class hdd

If I delete them from the Crush map (file), recompile it and set it
productive - they appear again... if I get the current map again and
decompile them, they are in again.

So how to get rid of these entries?

Best from Berlin/Germany
Karsten

Ecologic Institut gemeinnuetzige GmbH
Pfalzburger Str. 43/44, D-10717 Berlin
Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
Sitz der Gesellschaft / Registered Office: Berlin (Germany)
Registergericht / Court of Registration: Amtsgericht Berlin (Charlottenburg), 
HRB 57947
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Crush for 2 room setup

2018-02-16 Thread Karsten Becker
Hi.

I want to run my Ceph cluster in a 2 datacenter/room setup with pool
size/replica 3.

But I don't get it done to define the ruleset correctly - or at least I
am unsure if it is correct.

I have the following setup of my Ceph cluster:

> ID  CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF 
>  -1   91.00598 root company_spinning 
> -12   45.50299 room dc3_spinning  
> -11   45.50299 rack lan3_spinning 
>  -2   33.66600 host kong_spinning 
>   8   hdd  3.73799 osd.8  up  1.0 1.0 
> [...]
>  45   hdd  3.73799 osd.45 up  1.0 1.0 
> -43   11.83699 host predator_spinning 
>  21   hdd  1.69099 osd.21 up  1.0 1.0 
> [...]
>  27   hdd  1.69099 osd.27 up  1.0 1.0 
>  
>  [...]
>  
> -10   45.50299 room dc6_spinning  
> -49   11.83699 rack dev6_spinning 
> -58   11.83699 host alien_spinning
>  29   hdd  1.69099 osd.29 up  1.0 1.0 
> [...]
>  35   hdd  1.69099 osd.35 up  1.0 1.0 
>  -8   33.66600 rack lan6_spinning 
>  -3   33.66600 host king_spinning 
>   3   hdd  1.87299 osd.3  up  1.0 1.0 
> []...
>  47   hdd  3.73799 osd.47 up  1.0 1.0
>  
>  [...]


What I want to archieve is that at least one replica lives in another
datacenter than the remaining two. On which racks/hosts/osds within a
specific datacenter does not matter.


My ruleset looks like:

> rule replicated_ruleset_spinning {
> id 0
> type replicated
> min_size 1
> max_size 10
> step take company_spinning
> step choose firstn 2 type room
> step chooseleaf firstn -1 type host
> step emit
>


Is this correct... I'm in doubt...

Best from Berlin/Germany
Karsten

Ecologic Institut gemeinnuetzige GmbH
Pfalzburger Str. 43/44, D-10717 Berlin
Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
Sitz der Gesellschaft / Registered Office: Berlin (Germany)
Registergericht / Court of Registration: Amtsgericht Berlin (Charlottenburg), 
HRB 57947
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread David Turner
Your problem might have been creating too many PGs at once.  I generally
increase pg_num and pgp_num by no more than 256 at a time.  Making sure
that all PGs are creating, peered, and healthy (other than backfilling).

To help you get back to a healthy state, let's start off by getting all of
your PGs peered.  Go ahead and put a stop to backfilling, recovery,
scrubbing, etc.  Those are all hindering the peering effort right now.  The
more clients you can disable is also better.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing

After that look at your peering PGs and find out what is blocking their
peering.  This is where you might need to be using `ceph osd down 23`
(assuming you needed to kick osd.23) to mark them down in the cluster and
let them re-assert themselves.  Once you have all PGs done with peering, go
ahead and unset nobackfill and norecovery and let the cluster start moving
data around.  Leaving noscrubbing and nodeep-scrubbing off is optional and
up to you.  I'll never say it's better to leave them off, but scrubbing
does use a fair bit of spindles while you're trying to backfill.

On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister 
wrote:

> Well I decided to try the increase in PGs to 4096 and that seems to have
> caused some issues:
>
>
>
> 2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR
> 61802168/241154376 objects misplaced (25.628%); Reduced data availability:
> 2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376
> objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck
> requests are blocked > 4096 sec
>
>
>
> The cluster is actively backfilling misplaced objects, but not all PGs are
> active at this point and may are stuck peering, stuck unclean, or have a
> state of unknown:
>
> PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs
> peering
>
> pg 14.fae is stuck inactive for 253360.025730, current state
> activating+remapped, last acting [85,12,41]
>
> pg 14.faf is stuck inactive for 253368.511573, current state unknown,
> last acting []
>
> pg 14.fb0 is stuck peering since forever, current state peering, last
> acting [94,30,38]
>
> pg 14.fb1 is stuck inactive for 253362.605886, current state
> activating+remapped, last acting [6,74,34]
>
> [snip]
>
>
>
> The health also shows a large number of degraded data redundancy PGs:
>
> PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded
> (0.000%), 3099 pgs unclean, 38 pgs degraded
>
> pg 14.fc7 is stuck unclean for 253368.511573, current state unknown,
> last acting []
>
> pg 14.fc8 is stuck unclean for 531622.531271, current state
> active+remapped+backfill_wait, last acting [73,132,71]
>
> pg 14.fca is stuck unclean for 420540.396199, current state
> active+remapped+backfill_wait, last acting [0,80,61]
>
> pg 14.fcb is stuck unclean for 531622.421855, current state
> activating+remapped, last acting [70,26,75]
>
> [snip]
>
>
>
> We also now have a number of stuck requests:
>
> REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
>
> 69 ops are blocked > 268435 sec
>
> 66 ops are blocked > 134218 sec
>
>28 ops are blocked > 67108.9 sec
>
> osds 7,39,60,103,133 have stuck requests > 67108.9 sec
>
> osds
> 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
> have stuck requests > 134218 sec
>
> osds
> 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
> have stuck requests > 268435 sec
>
>
>
> I tried looking through the mailing list archive on how to solve the stuck
> requests, and it seems that restarting the OSDs is the right way?
>
>
>
> At this point we have just been watching the backfills running and see a
> steady but slow decrease of misplaced objects.  When the cluster is idle,
> the overall OSD disk utilization is not too bad at roughly 40% on the
> physical disks running these backfills.
>
>
>
> However we still have our backups trying to push new images to the
> cluster.  This worked ok for the first few days, but yesterday we were
> getting failure alerts.  I checked the status of the RGW service and
> noticed that 2 of the 3 RGW civetweb servers where not responsive.  I
> restarted the RGWs on the ones that appeared hung and that got them working
> for a while, but then the same condition happened.  The RGWs seem to have
> recovered on their own now, but again the cluster is idle and only
> backfills are currently doing anything (that I can tell).  I did see these
> log entries:
>
> 2018-02-15 16:46:07.541542 7fffe6c56700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
>
> 2018-02-15 16:46:12.541613 7fffe6c56700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out 

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread Bryan Banister
Well I decided to try the increase in PGs to 4096 and that seems to have caused 
some issues:

2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR 
61802168/241154376 objects misplaced (25.628%); Reduced data availability: 2081 
pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 objects 
degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck requests are 
blocked > 4096 sec

The cluster is actively backfilling misplaced objects, but not all PGs are 
active at this point and may are stuck peering, stuck unclean, or have a state 
of unknown:
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive for 253360.025730, current state 
activating+remapped, last acting [85,12,41]
pg 14.faf is stuck inactive for 253368.511573, current state unknown, last 
acting []
pg 14.fb0 is stuck peering since forever, current state peering, last 
acting [94,30,38]
pg 14.fb1 is stuck inactive for 253362.605886, current state 
activating+remapped, last acting [6,74,34]
[snip]

The health also shows a large number of degraded data redundancy PGs:
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded (0.000%), 
3099 pgs unclean, 38 pgs degraded
pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last 
acting []
pg 14.fc8 is stuck unclean for 531622.531271, current state 
active+remapped+backfill_wait, last acting [73,132,71]
pg 14.fca is stuck unclean for 420540.396199, current state 
active+remapped+backfill_wait, last acting [0,80,61]
pg 14.fcb is stuck unclean for 531622.421855, current state 
activating+remapped, last acting [70,26,75]
[snip]

We also now have a number of stuck requests:
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
69 ops are blocked > 268435 sec
66 ops are blocked > 134218 sec
   28 ops are blocked > 67108.9 sec
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds 
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have 
stuck requests > 134218 sec
osds 
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
 have stuck requests > 268435 sec

I tried looking through the mailing list archive on how to solve the stuck 
requests, and it seems that restarting the OSDs is the right way?

At this point we have just been watching the backfills running and see a steady 
but slow decrease of misplaced objects.  When the cluster is idle, the overall 
OSD disk utilization is not too bad at roughly 40% on the physical disks 
running these backfills.

However we still have our backups trying to push new images to the cluster.  
This worked ok for the first few days, but yesterday we were getting failure 
alerts.  I checked the status of the RGW service and noticed that 2 of the 3 
RGW civetweb servers where not responsive.  I restarted the RGWs on the ones 
that appeared hung and that got them working for a while, but then the same 
condition happened.  The RGWs seem to have recovered on their own now, but 
again the cluster is idle and only backfills are currently doing anything (that 
I can tell).  I did see these log entries:
2018-02-15 16:46:07.541542 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600

At this point we do not know to proceed with recovery efforts.  I tried looking 
at the ceph docs and mail list archives but wasn’t able to determine the right 
path forward here.

Any help is appreciated,
-Bryan


From: Bryan Stillwell [mailto:bstillw...@godaddy.com]
Sent: Tuesday, February 13, 2018 2:27 PM
To: Bryan Banister ; Janne Johansson 

Cc: Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

It may work fine, but I would suggest limiting the number of operations going 
on at the same time.

Bryan

From: Bryan Banister 
>
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell >, 
Janne Johansson >
Cc: Ceph Users >
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!


Re: [ceph-users] Bluestore Hardwaresetup

2018-02-16 Thread Jan Peters

Hi,
 
thank you.
 
Networksetup is like that:
 
2 x 10 GBit LACP for public
2 x 10 GBit LACP for clusternetwork
1 x 1 GBit for management 
Yes Joe, the sizing for block.db and blockwal would be interesting!
 
Is there another advice for SSDs like blog from Sebastian Han?:
 
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
Best regards
 
Peter
 
 

Gesendet: Freitag, 16. Februar 2018 um 19:09 Uhr
Von: "Joe Comeau" 
An: "Michel Raabe" , "Jan Peters" 
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Bluestore Hardwaresetup

I have a question about  block.db and block.wal
 
How big should they be?
Relative to drive size or ssd size ?
 
Thanks Joe

>>> Michel Raabe  2/16/2018 9:12 AM >>>
Hi Peter,

On 02/15/18 @ 19:44, Jan Peters wrote:
> I want to evaluate ceph with bluestore, so I need some hardware/configure 
> advices from you.
>
> My Setup should be:
>
> 3 Nodes Cluster, on each with:
>
> - Intel Gold Processor SP 5118, 12 core / 2.30Ghz
> - 64GB RAM
> - 6 x 7,2k, 4 TB SAS
> - 2 x SSDs, 480GB

Network?

> On the POSIX FS you have to set your journal on SSDs. What is the best way 
> for bluestore?
>
> Should I configure separate SSDs for block.db and block.wal?

Yes.

Regards,
Michel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore Hardwaresetup

2018-02-16 Thread Jan Peters

Hi,

 

thank you.

 

Networksetup is like that:

 

2 x 10 GBit LACP for public

2 x 10 GBit LACP for clusternetwork

1 x 1 GBit for management
 

Yes Joe, the sizing for block.db and blockwal would be interesting!

 

Is there another advice for SSDs like blog from Sebastian Han?:

 

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

 

Best regards

 

Peter

 

 

Gesendet: Freitag, 16. Februar 2018 um 19:09 Uhr
Von: "Joe Comeau" 
An: "Michel Raabe" , "Jan Peters" 
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Bluestore Hardwaresetup



I have a question about  block.db and block.wal

 

How big should they be?

Relative to drive size or ssd size ?

 

Thanks Joe



>>> Michel Raabe  2/16/2018 9:12 AM >>>
Hi Peter,

On 02/15/18 @ 19:44, Jan Peters wrote:
> I want to evaluate ceph with bluestore, so I need some hardware/configure advices from you.
>
> My Setup should be:
>
> 3 Nodes Cluster, on each with:
>
> - Intel Gold Processor SP 5118, 12 core / 2.30Ghz
> - 64GB RAM
> - 6 x 7,2k, 4 TB SAS
> - 2 x SSDs, 480GB

Network?

> On the POSIX FS you have to set your journal on SSDs. What is the best way for bluestore?
>
> Should I configure separate SSDs for block.db and block.wal?

Yes.

Regards,
Michel





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous and calamari

2018-02-16 Thread Ronny Aasen

On 16.02.2018 06:20, Laszlo Budai wrote:

Hi,

I've just started up the dasboard component of the ceph mgr. It looks 
OK, but from what can be seen, and what I was able to find in the 
docs, the dashboard is just for monitoring. Is there any plugin that 
allows management of the ceph resources (pool create/delete). 



openattic allows for web administation. but i think it is only possible 
to run it comfortably on opensuse leap atm. I could not find updated 
debian packages last time i checked.


proxmox also allow for ceph administration.  but proxmox is probably a 
bit overkill for only ceph admin. since it is a web admin tool for kvm 
vm's and lxd containers as well as ceph.




kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Graham Allan



On 02/15/2018 05:33 PM, Gregory Farnum wrote:
On Thu, Feb 15, 2018 at 3:10 PM Graham Allan > wrote:


A lot more in xattrs which I won't paste, though the keys are:

 > root@cephmon1:~# ssh ceph03 find
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*'
-exec xattr {} +
 > user.ceph._user.rgw.idtag
 > user.cephos.spill_out
 > user.ceph._
 > user.ceph.snapset
 > user.ceph._user.rgw.manifest
 > user.ceph._@1
 > user.ceph.hinfo_key
 > user.ceph._user.rgw.manifest@1
 > user.ceph._user.rgw.manifest@2
 > user.ceph._user.rgw.acl
 > user.ceph._user.rgw.x-amz-acl
 > user.ceph._user.rgw.etag
 > user.ceph._user.rgw.x-amz-date
 > user.ceph._user.rgw.content_type

Not sure which among these would contain pointers to part files.


I believe it’s the manifest xattrs there.


Thanks, I see better now how this goes together. Rather than dumping out 
the manifest xattrs, I guess I should also be able to understand this 
from the output of "radosgw-admin object stat".



One example, yesterday pg 70.438 showed "has 169 objects unfound and
apparently lost". At that time the state was
active+recovery_wait+inconsistent. Today it's showing no unfound objects
but is active+clean+inconsistent, and objects which were inaccessible
via radosgw yesterday can now download. I'm not sure what changed. I
have asked ceph to perform another deep scrub and repair on the pg, but
it has yet to start. I'm really curious to see if it becomes consistent,
or discovers unfound objects again.

Actually now I notice that a pg reported as
active+recovery_wait+inconsistent by "ceph health detail" is shown as
active+recovering+inconsistent by "ceph pg list". That makes more sense
to me - "recovery_wait" implied to me that it was waiting for recovery
to start, while "recovering" explains why the problem might clear
itself.

Right, “recovery_wait” means that the pg needs to do log-based recovery 
but (at least) one of the participating OSDs doesn’t have a slot 
available; that will resolve itself eventually.


It sounds like the scrubbing has detected some inconsistencies but the 
reason you weren’t getting data is just that it hit an object which 
needed recovery but was blocked waiting on it.


Yes, though it seems to be stuck in a cycle. This morning, that same pg 
70.438 is back in active+recovery_wait+inconsistent state with the same 
169 unfound objects - and from s3, the objects which would download 
successfully yesterday now stall. Which probably makes sense while in 
"unfound" state, but clearly the data is there, while ceph is not 
successful in making the pg consistent. Each time it repairs again, it 
claims to fix more errors but finds the same number of unfound objects 
again.



/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.438357 osd.175 osd.175 
10.31.0.71:6838/66301 2928 : cluster [ERR] 70.438s0 repair 0 missing, 169 
inconsistent objects
/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.442875 osd.175 osd.175 
10.31.0.71:6838/66301 2929 : cluster [ERR] 70.438 repair 169 errors, 845 fixed
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.040196 osd.175 osd.175 
10.31.0.71:6838/66301 2995 : cluster [ERR] 70.438s0 repair 0 missing, 169 
inconsistent objects
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.046028 osd.175 osd.175 
10.31.0.71:6838/66301 2996 : cluster [ERR] 70.438 repair 169 errors, 685 fixed


I also now see that for these unfound objects "radosgw-admin object 
stat" also hangs. Clearly makes sense since radosgw must perform this to 
retrieve the object. Does it imply that ceph can't access the "head" 
object in order to read the xattr data?


If I set debug rgw=1 and demug ms=1 before running the "object stat" 
command, it seems to stall in a loop of trying communicate with osds for 
pool 96, which is .rgw.control



10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- osd_op(unknown.0.0:541 
96.e 96:7759931f:::notify.3:head [watch ping cookie 139709246356176] snapc 0=[] 
ondisk+write+known_if_redirected e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59  
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 uv3933745 
ondisk = 0) v8  152+0+0 (2536111836 0 0) 0x7f1158003e20 con 0x7f117afd8390


Prior to that, probably more relevant, this was the only communication 
logged with the primary osd of the pg:



10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_op(unknown.0.0:96 
70.438s0 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head 
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) v8 -- 
0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1  
osd_backoff(70.438s0 block id 1 

Re: [ceph-users] Bluestore Hardwaresetup

2018-02-16 Thread Joe Comeau
I have a question about  block.db and block.wal
 
How big should they be?
Relative to drive size or ssd size ?
 
Thanks Joe


>>> Michel Raabe  2/16/2018 9:12 AM >>>
Hi Peter,

On 02/15/18 @ 19:44, Jan Peters wrote:
> I want to evaluate ceph with bluestore, so I need some hardware/configure 
> advices from you. 
> 
> My Setup should be:
> 
> 3 Nodes Cluster, on each with:
> 
> - Intel Gold Processor SP 5118, 12 core / 2.30Ghz
> - 64GB RAM
> - 6 x 7,2k, 4 TB SAS
> - 2 x SSDs, 480GB

Network?

> On the POSIX FS you have to set your journal on SSDs. What is the best way 
> for bluestore? 
> 
> Should I configure separate SSDs for block.db and block.wal?

Yes. 

Regards,
Michel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restoring keyring capabilities

2018-02-16 Thread Nico Schottelius

Saw that, too, however it does not work:

root@server3:/var/lib/ceph/mon/ceph-server3# ceph -n mon. --keyring keyring  
auth caps client.admin mds 'allow *' osd 'allow *' mon 'allow *'
2018-02-16 17:23:38.154282 7f7e257e3700  0 librados: mon. authentication error 
(13) Permission denied
[errno 13] error connecting to the cluster

... which kind of makes sense, as the mon. key does not have
capabilities for it. Then again, I wonder how monitors actually talk to
each other...

Michel Raabe  writes:

> On 02/16/18 @ 18:21, Nico Schottelius wrote:
>> on a test cluster I issued a few seconds ago:
>>
>>   ceph auth caps client.admin mgr 'allow *'
>>
>> instead of what I really wanted to do
>>
>>   ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \
>>   mds allow
>>
>> Now any access to the cluster using client.admin correctly results in
>> client.admin authentication error (13) Permission denied.
>>
>> Is there any way to modify the keyring capabilities "from behind",
>> i.e. by modifying the rocksdb of the monitors or similar?
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015474.html
>
> Not verified.
>
> Regards,
> Michel


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restoring keyring capabilities

2018-02-16 Thread Michel Raabe
On 02/16/18 @ 18:21, Nico Schottelius wrote:
> on a test cluster I issued a few seconds ago:
> 
>   ceph auth caps client.admin mgr 'allow *'
> 
> instead of what I really wanted to do
> 
>   ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \
>   mds allow
> 
> Now any access to the cluster using client.admin correctly results in
> client.admin authentication error (13) Permission denied.
> 
> Is there any way to modify the keyring capabilities "from behind",
> i.e. by modifying the rocksdb of the monitors or similar?

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015474.html

Not verified.

Regards,
Michel


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Restoring keyring capabilities

2018-02-16 Thread Nico Schottelius

Hello,

on a test cluster I issued a few seconds ago:

  ceph auth caps client.admin mgr 'allow *'

instead of what I really wanted to do

  ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \
  mds allow

Now any access to the cluster using client.admin correctly results in
client.admin authentication error (13) Permission denied.

Is there any way to modify the keyring capabilities "from behind",
i.e. by modifying the rocksdb of the monitors or similar?

If the answer is no, it's not a big problem, as we can easily destroy
the cluster, but if the answer is yes, it would be interesting to know
how to get out of this.

Best,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore Hardwaresetup

2018-02-16 Thread Michel Raabe
Hi Peter,

On 02/15/18 @ 19:44, Jan Peters wrote:
> I want to evaluate ceph with bluestore, so I need some hardware/configure 
> advices from you. 
> 
> My Setup should be:
> 
> 3 Nodes Cluster, on each with:
> 
> - Intel Gold Processor SP 5118, 12 core / 2.30Ghz
> - 64GB RAM
> - 6 x 7,2k, 4 TB SAS
> - 2 x SSDs, 480GB

Network?

> On the POSIX FS you have to set your journal on SSDs. What is the best way 
> for bluestore? 
> 
> Should I configure separate SSDs for block.db and block.wal?

Yes. 

Regards,
Michel


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-16 Thread Jason Dillaman
On Fri, Feb 16, 2018 at 11:20 AM, Eugen Block  wrote:
> Hi Jason,
>
>> ... also forgot to mention "rbd export --export-format 2" / "rbd
>> import --export-format 2" that will also deeply export/import all
>> snapshots associated with an image and that feature is available in
>> the Luminous release.
>
>
> thanks for that information, this could be very valuable for us. I'll have
> to test that intesively, but not before next week.
>
> But a first quick test brought up a couple of issues which I'll have to
> re-check before bringing them up here.
>
> One issue is worth mentioning, though: After I exported (rbd export
> --export-format ...) a glance image and imported it back to a different pool
> (rbd import --export-format ...) its snapshot was copied, but not protected.
> This prevented nova from cloning the base image and leaving that instance in
> error state. Protecting the snapshot manually and launch another instance
> enabled nova to clone the image successfully.
>
> Could this be worth a bug report or is it rather something I did wrong or
> missed?

Definitely deserves a bug tracker ticket opened. Thanks.

> I wish you all a nice weekend!
>
> Regards
> Eugen
>
>
> Zitat von Jason Dillaman :
>
>> On Fri, Feb 16, 2018 at 8:08 AM, Jason Dillaman 
>> wrote:
>>>
>>> On Fri, Feb 16, 2018 at 5:36 AM, Jens-U. Mozdzen  wrote:

 Dear list, hello Jason,

 you may have seen my message on the Ceph mailing list about RDB pool
 migration - it's a common subject that pools were created in a
 sub-optimum
 fashion and i. e. pgnum is (not yet) reducible, so we're looking into
 means
 to "clone" an RBD pool into a new pool within the same cluster
 (including
 snapshots).

 We had looked into creating a tool for this job, but soon noticed that
 we're
 duplicating basic functionality of rbd-mirror. So we tested the
 following,
 which worked out nicely:

 - create a test cluster (Ceph cluster plus an Openstack cluster using an
 RBD
 pool) and some Openstack instances

 - create a second Ceph test cluster

 - stop Openstack

 - use rbd-mirror to clone the RBD pool from the first to the second Ceph
 cluster (IOW aborting rbd-mirror once the initial coping was done)

 - recreate the RDB pool on the first cluster

 - use rbd-mirror to clone the mirrored pool back to the (newly created)
 pool
 on the first cluster

 - start Openstack and work with the (recreated) pool on the first
 cluster

 So using rbd-mirror, we could clone an RBD pool's content to a
 differently
 structured pool on the same cluster - by using an intermediate cluster.

 @Jason: Looking at the commit history for rbd-mirror, it seems you might
 be
 able to shed some light on this: Do you see an easy way to modify
 rbd-mirror
 in such a fashion that instead of mirroring to a pool on a different
 cluster
 (having the same pool name as the original), mirroring would be to a
 pool on
 the *same* cluster, (obviously having a pool different name)?

 From the "rbd cppool" perspective, a one-shot mode of operation would be
 fully sufficient - but looking at the code, I have not even been able to
 identify the spots where we might "cut away" the networking part, so
 that
 rbd-mirror might do an intra-cluster job.

 Are you able to judge how much work would need to be done, in order to
 create a one-shot, intra-cluster version of rbd-mirror? Might it even be
 something that could be a simple enhancement?
>>>
>>>
>>> You might be interested in the deep-copy feature that will be included
>>> in the Mimic release. By running "rbd deep-copy 
>>> ", it will fully copy the image, including snapshots and
>>> parentage, to a new image. There is also work-in-progress for online
>>> image migration [1] that will allow you to keep using the image while
>>> it's being migrated to a new destination image. Both of these are
>>> probably more suited to your needs than the heavy-weight RBD mirroring
>>> process -- especially if you are only interested in the first step
>>> since RBD mirroring now directly utilizes the deep-copy feature for
>>> the initial image sync.
>>
>>
>> ... also forgot to mention "rbd export --export-format 2" / "rbd
>> import --export-format 2" that will also deeply export/import all
>> snapshots associated with an image and that feature is available in
>> the Luminous release.
>>
 Thank you for any information and / or opinion you care to share!

 With regards,
 Jens

>>>
>>> [1] https://github.com/ceph/ceph/pull/15831
>>>
>>> --
>>> Jason
>>
>>
>>
>>
>> --
>> Jason
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> 

Re: [ceph-users] Migrating to new pools

2018-02-16 Thread Eugen Block

Hi Jason,


... also forgot to mention "rbd export --export-format 2" / "rbd
import --export-format 2" that will also deeply export/import all
snapshots associated with an image and that feature is available in
the Luminous release.


thanks for that information, this could be very valuable for us. I'll  
have to test that intesively, but not before next week.


But a first quick test brought up a couple of issues which I'll have  
to re-check before bringing them up here.


One issue is worth mentioning, though: After I exported (rbd export  
--export-format ...) a glance image and imported it back to a  
different pool (rbd import --export-format ...) its snapshot was  
copied, but not protected. This prevented nova from cloning the base  
image and leaving that instance in error state. Protecting the  
snapshot manually and launch another instance enabled nova to clone  
the image successfully.


Could this be worth a bug report or is it rather something I did wrong  
or missed?


I wish you all a nice weekend!

Regards
Eugen


Zitat von Jason Dillaman :


On Fri, Feb 16, 2018 at 8:08 AM, Jason Dillaman  wrote:

On Fri, Feb 16, 2018 at 5:36 AM, Jens-U. Mozdzen  wrote:

Dear list, hello Jason,

you may have seen my message on the Ceph mailing list about RDB pool
migration - it's a common subject that pools were created in a sub-optimum
fashion and i. e. pgnum is (not yet) reducible, so we're looking into means
to "clone" an RBD pool into a new pool within the same cluster (including
snapshots).

We had looked into creating a tool for this job, but soon noticed  
that we're

duplicating basic functionality of rbd-mirror. So we tested the following,
which worked out nicely:

- create a test cluster (Ceph cluster plus an Openstack cluster  
using an RBD

pool) and some Openstack instances

- create a second Ceph test cluster

- stop Openstack

- use rbd-mirror to clone the RBD pool from the first to the second Ceph
cluster (IOW aborting rbd-mirror once the initial coping was done)

- recreate the RDB pool on the first cluster

- use rbd-mirror to clone the mirrored pool back to the (newly  
created) pool

on the first cluster

- start Openstack and work with the (recreated) pool on the first cluster

So using rbd-mirror, we could clone an RBD pool's content to a differently
structured pool on the same cluster - by using an intermediate cluster.

@Jason: Looking at the commit history for rbd-mirror, it seems you might be
able to shed some light on this: Do you see an easy way to modify  
rbd-mirror
in such a fashion that instead of mirroring to a pool on a  
different cluster
(having the same pool name as the original), mirroring would be to  
a pool on

the *same* cluster, (obviously having a pool different name)?

From the "rbd cppool" perspective, a one-shot mode of operation would be
fully sufficient - but looking at the code, I have not even been able to
identify the spots where we might "cut away" the networking part, so that
rbd-mirror might do an intra-cluster job.

Are you able to judge how much work would need to be done, in order to
create a one-shot, intra-cluster version of rbd-mirror? Might it even be
something that could be a simple enhancement?


You might be interested in the deep-copy feature that will be included
in the Mimic release. By running "rbd deep-copy 
", it will fully copy the image, including snapshots and
parentage, to a new image. There is also work-in-progress for online
image migration [1] that will allow you to keep using the image while
it's being migrated to a new destination image. Both of these are
probably more suited to your needs than the heavy-weight RBD mirroring
process -- especially if you are only interested in the first step
since RBD mirroring now directly utilizes the deep-copy feature for
the initial image sync.


... also forgot to mention "rbd export --export-format 2" / "rbd
import --export-format 2" that will also deeply export/import all
snapshots associated with an image and that feature is available in
the Luminous release.


Thank you for any information and / or opinion you care to share!

With regards,
Jens



[1] https://github.com/ceph/ceph/pull/15831

--
Jason




--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com

[ceph-users] ceph luminous - ceph tell osd bench performance

2018-02-16 Thread Steven Vacaroaia
Hi,

For every CONSECUTIVE   ran of the "ceph tell osd.x bench"  command
I get different and MUCH worse  results

Is this expected ?

OSD was created with the following command ( /dev/sda is an Entreprise
class SDD)

ceph-deploy osd create --zap-disk --bluestore  osd01:sdc --block-db
/dev/sda --block-wal /dev/sda

If not, what could cause it ?

[root@osd01 ~]# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 440630335
}

[root@osd01 ~]# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 313287177
}

[root@osd01 ~]# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 251350160
}

[root@osd01 ~]# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 244450342
}

[root@osd01 ~]# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 253622108
}

[root@osd01 ~]# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 253355474
}

[root@osd01 ~]# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 252890400


[root@osd01 ~]# megacli -LDGetProp  -DskCache -L3 -a0

Adapter 0-VD 3(target id: 3): Disk Write Cache : Enabled

Exit Code: 0x00
[root@osd01 ~]# megacli -LDGetProp  -Cache -L3 -a0

Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive, Cached,
Write Cache OK if bad BBU

mount
/dev/sdd1 on /var/lib/ceph/osd/ceph-0 type xfs
(rw,noatime,nodiratime,swalloc,attr2,largeio,inode64,allocsize=4096k,logbufs=8,logbsize=256k,noquota)


[root@osd01 ~]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdf  8:80   0  59.8G  0 disk
├─sdf5   8:85   037G  0 part /var
├─sdf3   8:83   0 6G  0 part [SWAP]
├─sdf1   8:81   0 1G  0 part /boot
├─sdf4   8:84   0 1K  0 part
└─sdf2   8:82   0  15.5G  0 part /
sdd  8:48   0 558.4G  0 disk
├─sdd2   8:50   0 558.3G  0 part
└─sdd1   8:49   0   100M  0 part /var/lib/ceph/osd/ceph-0
sdb  8:16   0 558.4G  0 disk
sr0 11:01  1024M  0 rom
sde  8:64   0 558.4G  0 disk
sdc  8:32   0 558.4G  0 disk
├─sdc2   8:34   0 558.3G  0 part
└─sdc1   8:33   0   100M  0 part /var/lib/ceph/osd/ceph-3
sda  8:00   372G  0 disk
├─sda4   8:40 1G  0 part
├─sda2   8:20 1G  0 part
├─sda3   8:3030G  0 part
└─sda1   8:1030G  0 part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
Hi there,

I have a Ceph cluster version 12.2.2 on CentOS 7.

It is a testing cluster and I have set it up 2 weeks ago.
after some days, I see that one of the three mons has stopped(out of
quorum) and I can't start it anymore.
I checked the mon service log and the output shows this error:

"""
mon.XX@-1(probing) e4 preinit clean up potentially inconsistent store
state
rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch
code = 2 Rocksdb transaction:
 0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
MonitorDBStore::clear(std::set&)' thread
7f45a1e52e40 time 2018-02-16 17:37:07.040846
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
581: FAILE
D assert(r >= 0)
"""

the only solution I found is to remove this mon from quorum and remove all
mon data and re-add this mon to quorum again.
and ceph goes to the healthy status again.

but now after some days this mon has stopped and I face the same problem
again.

My cluster setup is:
4 osd hosts
total 8 osds
3 mons
1 rgw

this cluster has setup with ceph-volume lvm and wal/db separation on
logical volumes.

Best regards,
Behnam Loghmani
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is the minimum length of a part in a RGW multipart upload configurable?

2018-02-16 Thread Casey Bodley


On 02/16/2018 12:39 AM, F21 wrote:

I am uploading parts to RGW using the S3 multipart upload functionality.

I tried uploading a part sized at 500 KB and received a EntityTooSmall 
error from the server. I am assuming that it expects each part to have 
a minimum size of 5MB like S3.


I found `rgw multipart min part size` being mentioned on the issue 
tracker, but this option does not seem to be in the docs. This PR also 
shows that it was removed: https://github.com/ceph/ceph/pull/9285


Is this still a configurable option?

Thanks,

Francis

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


That is the right config option, it just hasn't been documented. I've 
opened a doc bug for that at http://tracker.ceph.com/issues/23027 - 
anyone interested in helping out can follow up there.


Thanks,
Casey
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor won't upgrade

2018-02-16 Thread Mark Schouten
On vrijdag 16 februari 2018 13:33:32 CET David Turner wrote:
> Can you send us a `ceph status` and `ceph health detail`? Something is
> still weird. Also can you query the running daemon for it's version instead
> of asking the cluster? You should also be able to find it in the logs when
> it starts.

There is no output in ceph status or ceph health detail the reveals anything 
about this 
situation. The logging was sent in my previous message, it shows that it is 
starting the new 
version.

The daemon itself seems to report the correct version!

root@proxmox2:/var/run/ceph# ceph --admin-daemon ceph-mon.0.asok version
{"version":"0.94.10"}

-- 
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten  | Tuxis Internet Engineering
KvK: 61527076  | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl


signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor won't upgrade

2018-02-16 Thread David Turner
Can you send us a `ceph status` and `ceph health detail`? Something is
still weird. Also can you query the running daemon for it's version instead
of asking the cluster? You should also be able to find it in the logs when
it starts.

On Fri, Feb 16, 2018, 4:24 AM Mark Schouten  wrote:

> On vrijdag 16 februari 2018 00:21:34 CET Gregory Farnum wrote:
>
> > If mon.0 is not connected to the cluster, the monitor version report
> won’t
>
> > update — how could it?
>
> >
>
> > So you need to figure out why that’s not working. A monitor that’s
> running
>
> > but isn’t part of the active set is not good.
>
>
>
> Obviously, I check the versions after the monitor is restarted. So the
> steps I followed are:
>
>
>
> - `ceph tell mon.* version`
>
> - Stop mon.0
>
> - Verify mon.0 is actually stopped using `ps uaxwww` and `ceph -s` on
> other monitors
>
> - Start mon.0
>
> - `ceph tell mon.* version`
>
>
>
> It's not too weird that various versions of OSD's are running. The box has
> a high uptime and has been upgraded in between of adding new OSD's. The
> md5sum of /usr/bin/ceph-mon, which is obviously the daemon that is running,
> is identical on all machines.
>
>
>
> See the attachment for logging on the moment of restarting.
>
>
>
> --
>
> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
>
> Mark Schouten | Tuxis Internet Engineering
>
> KvK: 61527076 | http://www.tuxis.nl/
>
> T: 0318 200208 | i...@tuxis.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] puppet for the deployment of ceph

2018-02-16 Thread Александр Пивушков
Colleagues, tell me please, who uses puppet for the deployment of ceph in 
production?
  And also, where can I get the puppet modules for ceph?


Александр Пивушков
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-16 Thread Jason Dillaman
On Fri, Feb 16, 2018 at 8:08 AM, Jason Dillaman  wrote:
> On Fri, Feb 16, 2018 at 5:36 AM, Jens-U. Mozdzen  wrote:
>> Dear list, hello Jason,
>>
>> you may have seen my message on the Ceph mailing list about RDB pool
>> migration - it's a common subject that pools were created in a sub-optimum
>> fashion and i. e. pgnum is (not yet) reducible, so we're looking into means
>> to "clone" an RBD pool into a new pool within the same cluster (including
>> snapshots).
>>
>> We had looked into creating a tool for this job, but soon noticed that we're
>> duplicating basic functionality of rbd-mirror. So we tested the following,
>> which worked out nicely:
>>
>> - create a test cluster (Ceph cluster plus an Openstack cluster using an RBD
>> pool) and some Openstack instances
>>
>> - create a second Ceph test cluster
>>
>> - stop Openstack
>>
>> - use rbd-mirror to clone the RBD pool from the first to the second Ceph
>> cluster (IOW aborting rbd-mirror once the initial coping was done)
>>
>> - recreate the RDB pool on the first cluster
>>
>> - use rbd-mirror to clone the mirrored pool back to the (newly created) pool
>> on the first cluster
>>
>> - start Openstack and work with the (recreated) pool on the first cluster
>>
>> So using rbd-mirror, we could clone an RBD pool's content to a differently
>> structured pool on the same cluster - by using an intermediate cluster.
>>
>> @Jason: Looking at the commit history for rbd-mirror, it seems you might be
>> able to shed some light on this: Do you see an easy way to modify rbd-mirror
>> in such a fashion that instead of mirroring to a pool on a different cluster
>> (having the same pool name as the original), mirroring would be to a pool on
>> the *same* cluster, (obviously having a pool different name)?
>>
>> From the "rbd cppool" perspective, a one-shot mode of operation would be
>> fully sufficient - but looking at the code, I have not even been able to
>> identify the spots where we might "cut away" the networking part, so that
>> rbd-mirror might do an intra-cluster job.
>>
>> Are you able to judge how much work would need to be done, in order to
>> create a one-shot, intra-cluster version of rbd-mirror? Might it even be
>> something that could be a simple enhancement?
>
> You might be interested in the deep-copy feature that will be included
> in the Mimic release. By running "rbd deep-copy 
> ", it will fully copy the image, including snapshots and
> parentage, to a new image. There is also work-in-progress for online
> image migration [1] that will allow you to keep using the image while
> it's being migrated to a new destination image. Both of these are
> probably more suited to your needs than the heavy-weight RBD mirroring
> process -- especially if you are only interested in the first step
> since RBD mirroring now directly utilizes the deep-copy feature for
> the initial image sync.

... also forgot to mention "rbd export --export-format 2" / "rbd
import --export-format 2" that will also deeply export/import all
snapshots associated with an image and that feature is available in
the Luminous release.

>> Thank you for any information and / or opinion you care to share!
>>
>> With regards,
>> Jens
>>
>
> [1] https://github.com/ceph/ceph/pull/15831
>
> --
> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-16 Thread Jason Dillaman
On Fri, Feb 16, 2018 at 5:36 AM, Jens-U. Mozdzen  wrote:
> Dear list, hello Jason,
>
> you may have seen my message on the Ceph mailing list about RDB pool
> migration - it's a common subject that pools were created in a sub-optimum
> fashion and i. e. pgnum is (not yet) reducible, so we're looking into means
> to "clone" an RBD pool into a new pool within the same cluster (including
> snapshots).
>
> We had looked into creating a tool for this job, but soon noticed that we're
> duplicating basic functionality of rbd-mirror. So we tested the following,
> which worked out nicely:
>
> - create a test cluster (Ceph cluster plus an Openstack cluster using an RBD
> pool) and some Openstack instances
>
> - create a second Ceph test cluster
>
> - stop Openstack
>
> - use rbd-mirror to clone the RBD pool from the first to the second Ceph
> cluster (IOW aborting rbd-mirror once the initial coping was done)
>
> - recreate the RDB pool on the first cluster
>
> - use rbd-mirror to clone the mirrored pool back to the (newly created) pool
> on the first cluster
>
> - start Openstack and work with the (recreated) pool on the first cluster
>
> So using rbd-mirror, we could clone an RBD pool's content to a differently
> structured pool on the same cluster - by using an intermediate cluster.
>
> @Jason: Looking at the commit history for rbd-mirror, it seems you might be
> able to shed some light on this: Do you see an easy way to modify rbd-mirror
> in such a fashion that instead of mirroring to a pool on a different cluster
> (having the same pool name as the original), mirroring would be to a pool on
> the *same* cluster, (obviously having a pool different name)?
>
> From the "rbd cppool" perspective, a one-shot mode of operation would be
> fully sufficient - but looking at the code, I have not even been able to
> identify the spots where we might "cut away" the networking part, so that
> rbd-mirror might do an intra-cluster job.
>
> Are you able to judge how much work would need to be done, in order to
> create a one-shot, intra-cluster version of rbd-mirror? Might it even be
> something that could be a simple enhancement?

You might be interested in the deep-copy feature that will be included
in the Mimic release. By running "rbd deep-copy 
", it will fully copy the image, including snapshots and
parentage, to a new image. There is also work-in-progress for online
image migration [1] that will allow you to keep using the image while
it's being migrated to a new destination image. Both of these are
probably more suited to your needs than the heavy-weight RBD mirroring
process -- especially if you are only interested in the first step
since RBD mirroring now directly utilizes the deep-copy feature for
the initial image sync.

> Thank you for any information and / or opinion you care to share!
>
> With regards,
> Jens
>

[1] https://github.com/ceph/ceph/pull/15831

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph df: Raw used vs. used vs. actual bytes in cephfs

2018-02-16 Thread Flemming Frandsen
I'm trying out cephfs and I'm in the process of copying over some 
real-world data to see what happens.


I have created a number of cephfs file systems, the only one I've 
started working on is the one called jenkins specifically the one named 
jenkins which lives in fs_jenkins_data and fs_jenkins_metadata.


According to ceph df I have about 1387 GB of data in all of the pools, 
while the raw used space is 5918 GB, which gives a ratio of about 4.3, I 
would have expected a ratio around 2 as the pool size has been set to 2.



Can anyone explain where half my space has been squandered?

> ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
8382G 2463G5918G 70.61
POOLS:
NAME ID USED   %USED MAX 
AVAIL OBJECTS

.rgw.root11113 0 258G4
default.rgw.control  2   0 0 258G8
default.rgw.meta 3   0 0 258G0
default.rgw.log  4   0 0 258G  207
fs_docker-nexus_data 5  66120M 11.09 258G22655
fs_docker-nexus_metadata 6  39463k 0 258G 2376
fs_meta_data 7 330 0 258G4
fs_meta_metadata 8567k 0 258G   22
fs_jenkins_data  9   1321G 71.84 258G 28576278
fs_jenkins_metadata  10 52178k 0 258G  2285493
fs_nexus_data11  0 0 258G0
fs_nexus_metadata12   4181 0 258G   21

--
 Regards Flemming Frandsen - Stibo Systems - DK - STEP Release Manager
 Please use rele...@stibo.com for all Release Management requests

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-16 Thread Ansgar Jazdzewski
hi,

wihle i added the "class" to all my OSD's the ceph-mgr crashed :-( but
the prometheus plugin works now

for i in {1..9}; do ceph osd crush set-device-class hdd osd.$i; done

Thanks,
Ansgar

2018-02-16 10:12 GMT+01:00 Jan Fajerski :
> On Fri, Feb 16, 2018 at 09:27:08AM +0100, Ansgar Jazdzewski wrote:
>>
>> Hi Folks,
>>
>> i just try to get the prometheus plugin up and runing but as soon as i
>> browse /metrics i got:
>>
>> 500 Internal Server Error
>> The server encountered an unexpected condition which prevented it from
>> fulfilling the request.
>>
>> Traceback (most recent call last):
>>  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
>> 670, in respond
>>response.body = self.handler()
>>  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>> line 217, in __call__
>>self.body = self.oldhandler(*args, **kwargs)
>>  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>> line 61, in __call__
>>return self.callable(*self.args, **self.kwargs)
>>  File "/usr/lib/ceph/mgr/prometheus/module.py", line 386, in metrics
>>metrics = global_instance().collect()
>>  File "/usr/lib/ceph/mgr/prometheus/module.py", line 323, in collect
>>self.get_metadata_and_osd_status()
>>  File "/usr/lib/ceph/mgr/prometheus/module.py", line 283, in
>> get_metadata_and_osd_status
>>dev_class['class'],
>> KeyError: 'class'
>
> This error is part of the osd metadata metric. Which version of Ceph are you
> running this with? Specifically the Crush Map of this cluster seems to not
> have the device class for each OSD yet.
>>
>>
>> I assume that i have to change the mkgr cephx kex? but iam not 100% sure
>>
>> mgr.mgr01
>>   key: AQAqLIRasocnChAAbOIEMKVEWWHCbgVeEctwng==
>>   caps: [mds] allow *
>>   caps: [mon] allow profile mgr
>>   caps: [osd] allow *
>>
>> thanks for your help,
>> Ansgar
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> --
> Jan Fajerski
> Engineer Enterprise Storage
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-16 Thread Ansgar Jazdzewski
we upgraded the cluster from jewel to luminous

but the restart of the ceph osd service did not add the 'CLASS
(hdd/ssd)' on its own it it not exist so i had to add it on my own to
make it work.

should somhow mentiond in the upgrade process?

like:
for all osd make shure that the "class" it set in in the crushmap by
useing "ceph osd crush set-device-class hdd osd."

the versions on the stage cluster
# ceph versions
{
   "mon": {
   "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)": 3
   },
   "mgr": {
   "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)": 1
   },
   "osd": {
   "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)": 10
   },
   "mds": {
   "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)": 2
   },
   "rgw": {
   "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)": 3
   },
   "overall": {
   "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)": 19
   }
}


2018-02-16 12:09 GMT+01:00 John Spray :
> On Fri, Feb 16, 2018 at 8:27 AM, Ansgar Jazdzewski
>  wrote:
>> Hi Folks,
>>
>> i just try to get the prometheus plugin up and runing but as soon as i
>> browse /metrics i got:
>>
>> 500 Internal Server Error
>> The server encountered an unexpected condition which prevented it from
>> fulfilling the request.
>>
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
>> 670, in respond
>> response.body = self.handler()
>>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>> line 217, in __call__
>> self.body = self.oldhandler(*args, **kwargs)
>>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>> line 61, in __call__
>> return self.callable(*self.args, **self.kwargs)
>>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 386, in metrics
>> metrics = global_instance().collect()
>>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 323, in collect
>> self.get_metadata_and_osd_status()
>>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 283, in
>> get_metadata_and_osd_status
>> dev_class['class'],
>> KeyError: 'class'
>
> Looks a lot like you might be trying to run a >= luminous ceph-mgr
> with an older Ceph cluster?
>
> What versions are you running?
>
> John
>
>>
>> I assume that i have to change the mkgr cephx kex? but iam not 100% sure
>>
>> mgr.mgr01
>>key: AQAqLIRasocnChAAbOIEMKVEWWHCbgVeEctwng==
>>caps: [mds] allow *
>>caps: [mon] allow profile mgr
>>caps: [osd] allow *
>>
>> thanks for your help,
>> Ansgar
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-16 Thread John Spray
On Fri, Feb 16, 2018 at 8:27 AM, Ansgar Jazdzewski
 wrote:
> Hi Folks,
>
> i just try to get the prometheus plugin up and runing but as soon as i
> browse /metrics i got:
>
> 500 Internal Server Error
> The server encountered an unexpected condition which prevented it from
> fulfilling the request.
>
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
> 670, in respond
> response.body = self.handler()
>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
> line 217, in __call__
> self.body = self.oldhandler(*args, **kwargs)
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
> line 61, in __call__
> return self.callable(*self.args, **self.kwargs)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 386, in metrics
> metrics = global_instance().collect()
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 323, in collect
> self.get_metadata_and_osd_status()
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 283, in
> get_metadata_and_osd_status
> dev_class['class'],
> KeyError: 'class'

Looks a lot like you might be trying to run a >= luminous ceph-mgr
with an older Ceph cluster?

What versions are you running?

John

>
> I assume that i have to change the mkgr cephx kex? but iam not 100% sure
>
> mgr.mgr01
>key: AQAqLIRasocnChAAbOIEMKVEWWHCbgVeEctwng==
>caps: [mds] allow *
>caps: [mon] allow profile mgr
>caps: [osd] allow *
>
> thanks for your help,
> Ansgar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Efficient deletion of large radosgw buckets

2018-02-16 Thread Sean Purdy
Thanks David.


> purging the objects and bypassing the GC is definitely the way to go

Cool.

> What rebalancing do you expect to see during this operation that you're 
> trying to avoid

I think I just have a poor understanding or wasn't thinking very hard :)  I 
suppose the question really was "are there any performance implications in 
deleting large buckets that I should be aware of?".  So, no really.  Just will 
take a while.

The actual cluster is small and balanced with free space.  Buckets are not 
customer-facing.


Thanks for the advice,

Sean


On Thu, 15 Feb 2018, David Turner said:
> Which is more important to you?  Deleting the bucket fast or having the
> used space become available?  If deleting the bucket fast is the priority,
> then you can swamp the GC by multithreading object deletion from the bucket
> with python or something.  If having everything deleted and cleaned up from
> the cluster is the priority (which is most likely the case), then what you
> have there is the best option.  If you want to do it in the background away
> from what the client can see, then you can change the ownership of the
> bucket so they no longer see it and then take care of the bucket removal in
> the background, but purging the objects and bypassing the GC is definitely
> the way to go. ... It's just really slow.
> 
> I just noticed that your question is about ceph rebalancing.  What
> rebalancing do you expect to see during this operation that you're trying
> to avoid?  I'm unaware of any such rebalancing (unless it might be the new
> automatic OSD rebalancing mechanism in Luminous to keep OSDs even... but
> deleting data shouldn't really trigger that if the cluster is indeed
> balanced).
> 
> On Thu, Feb 15, 2018 at 9:13 AM Sean Purdy  wrote:
> 
> >
> > Hi,
> >
> > I have a few radosgw buckets with millions or tens of millions of
> > objects.  I would like to delete these entire buckets.
> >
> > Is there a way to do this without ceph rebalancing as it goes along?
> >
> > Is there anything better than just doing:
> >
> > radosgw-admin bucket rm --bucket=test --purge-objects --bypass-gc
> >
> >
> > Thanks,
> >
> > Sean Purdy
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-16 Thread Jens-U. Mozdzen

Dear list, hello Jason,

you may have seen my message on the Ceph mailing list about RDB pool  
migration - it's a common subject that pools were created in a  
sub-optimum fashion and i. e. pgnum is (not yet) reducible, so we're  
looking into means to "clone" an RBD pool into a new pool within the  
same cluster (including snapshots).


We had looked into creating a tool for this job, but soon noticed that  
we're duplicating basic functionality of rbd-mirror. So we tested the  
following, which worked out nicely:


- create a test cluster (Ceph cluster plus an Openstack cluster using  
an RBD pool) and some Openstack instances


- create a second Ceph test cluster

- stop Openstack

- use rbd-mirror to clone the RBD pool from the first to the second  
Ceph cluster (IOW aborting rbd-mirror once the initial coping was done)


- recreate the RDB pool on the first cluster

- use rbd-mirror to clone the mirrored pool back to the (newly  
created) pool on the first cluster


- start Openstack and work with the (recreated) pool on the first cluster

So using rbd-mirror, we could clone an RBD pool's content to a  
differently structured pool on the same cluster - by using an  
intermediate cluster.


@Jason: Looking at the commit history for rbd-mirror, it seems you  
might be able to shed some light on this: Do you see an easy way to  
modify rbd-mirror in such a fashion that instead of mirroring to a  
pool on a different cluster (having the same pool name as the  
original), mirroring would be to a pool on the *same* cluster,  
(obviously having a pool different name)?


From the "rbd cppool" perspective, a one-shot mode of operation would  
be fully sufficient - but looking at the code, I have not even been  
able to identify the spots where we might "cut away" the networking  
part, so that rbd-mirror might do an intra-cluster job.


Are you able to judge how much work would need to be done, in order to  
create a one-shot, intra-cluster version of rbd-mirror? Might it even  
be something that could be a simple enhancement?


Thank you for any information and / or opinion you care to share!

With regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] libvirt on ceph - external snapshots?

2018-02-16 Thread João Pagaime

Hello all,

I have a  VM system with libvirt/KVM on top of  a ceph storage system

I can't take an external snapshot (disk + RAM) of a running vm. The 
option view->snapshots is disabled on the virt-manager application . On 
other VM, on the same hypervisor, that runs on local storage, I can take 
snapshots  on virt-manager


Is it possible to do external snapshots with libvirt/KVM on top of  a 
ceph storage system? How to configure?


Here is an example of a running VM (output from "libvirt dumpxml x"):

   .

    
  
   .
  
   .

thanks for any insight!

best regards

João

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-16 Thread Ansgar Jazdzewski
tanks i will have a look into it

--
Ansgar

2018-02-16 10:10 GMT+01:00 Konstantin Shalygin :
>> i just try to get the prometheus plugin up and runing
>
>
>
> Use module from master.
>
> From this commit should work with 12.2.2, just wget it and replace stock
> module.
>
> https://github.com/ceph/ceph/blob/d431de74def1b8889ad568ab99436362833d063e/src/pybind/mgr/prometheus/module.py
>
>
>
>
> k
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] balancer mgr module

2018-02-16 Thread Caspar Smit
2018-02-16 10:16 GMT+01:00 Dan van der Ster :

> Hi Caspar,
>
> I've been trying the mgr balancer for a couple weeks now and can share
> some experience.
>
> Currently there are two modes implemented: upmap and crush-compat.
>
> Upmap requires all clients to be running luminous -- it uses this new
> pg-upmap mechanism to precisely move PGs one by one to a more balanced
> layout.
> The upmap mode is working only with num PGs, AFAICT, and on at least
> one of our clusters it happens to be moving PGs in a pool with no data
> -- useless. Checking the implementation, it should be upmapping PGs
> from a random pool each iteration -- I have a tracker open for this:
> http://tracker.ceph.com/issues/22431
>
> Upmap is the future, but for now I'm trying to exercise the
> crush-compat mode on some larger clusters. It's still early days, but
> in general it seems to be working in the right direction.
> crush-compat does two things: it creates a new "compat" crush
> weight-set to give underutilized OSDs more crush weight; and second,
> it phases out the osd reweights back to 1.0. So, if you have a cluster
> that was previously balanced with ceph osd reweight-by-*, then
> crush-compat will gently bring you to the new balancing strategy.
>
> There have been a few issues spotted in 12.2.2... some of the balancer
> config-key settings aren't cast properly to int/float so they can
> break the balancer; and more importantly the mgr doesn't refresh
> config-keys if they change. So if you do change the configuration, you
> need to ceph mgr fail  to force the next mgr to reload
> the config.
>
> My current config is:
>
> ceph config-key dump
> {
> "mgr/balancer/active": "1",
> "mgr/balancer/begin_time": "0830",
> "mgr/balancer/end_time": "1600",
> "mgr/balancer/max_misplaced": "0.01",
> "mgr/balancer/mode": "crush-compat"
> }
>
> Note that the begin_time/end_time seem to be in UTC, not the local time
> zone.
> max_displaced defaults to 0.05, and this is used to limit the
> percentage of PGs/objects to be rebalanced each iteration.
>
> I have it enabled (ceph balancer on) which means it tries to balance
> every 60s. It will skip an iteration if num misplaced is greater than
> > max_misplaced, or if any objects are degraded.
>
> When you're first trying the balancer you should do two things to test
> a one-off balancing (rather than the always on mode that I use):
>   - set debug_mgr=4/5 # then you can tail -f ceph-mgr.*.log | grep
> balancer  to see what it's doing
>   - ceph balancer mode crush-compat
>   - ceph balancer eval # to check the current score
>   - ceph balancer optimize myplan # create but do not execute a new plan
>   - ceph balancer eval myplan # check what would be the new score
> after myplan. Is it getting closer to the optimal value 0?
>   - ceph balancer show myplan # study what it's trying to do
>   - ceph balancer execute myplan # execute the plan. data movement starts
> here!
>   - ceph balancer reset # we do this because balancer rm is broken,
> and myplan isn't removed automatically after execution
>
> v12.2.3 has quite a few balancer fixes, and also adds a pool-specific
> balancing (which should hopefully fix my upmap issue).
>
> Hope that helps!
>
>
It sure does Dan! Thank you very much for your detailed answer.

I will start testing the balancer module with our demo cluster.

Caspar



> Dan
>
>
>
> On Fri, Feb 16, 2018 at 9:22 AM, Caspar Smit 
> wrote:
> > Hi,
> >
> > After watching Sage's talk at LinuxConfAU about making distributed
> storage
> > easy he mentioned the Balancer Manager module. After enabling this
> module,
> > pg's should get balanced automagically around the cluster.
> >
> > The module was added in Ceph Luminous v12.2.2
> >
> > Since i couldn't find much documentation about this module i was
> wondering
> > if it is considered stable? (production ready) or still experimental/WIP.
> >
> > Here's the original mailinglist post describing the module:
> >
> > https://www.spinics.net/lists/ceph-devel/msg37730.html
> >
> > A few questions:
> >
> > What are the differences between the different optimization modes?
> > Is the balancer run at certain intervals, if yes, what is the interval?
> > Will this trigger continuous backfillling/recovering of pg's when a
> cluster
> > is mostly under write load?
> >
> > Kind regards,
> > Caspar
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor won't upgrade

2018-02-16 Thread Mark Schouten
On vrijdag 16 februari 2018 00:21:34 CET Gregory Farnum wrote:
> If mon.0 is not connected to the cluster, the monitor version report won’t
> update — how could it?
> 
> So you need to figure out why that’s not working. A monitor that’s running
> but isn’t part of the active set is not good.

Obviously, I check the versions after the monitor is restarted. So the steps I 
followed are:

- `ceph tell mon.* version`
- Stop mon.0
- Verify mon.0 is actually stopped using `ps uaxwww` and `ceph -s` on other 
monitors
- Start mon.0
- `ceph tell mon.* version`

It's not too weird that various versions of OSD's are running. The box has a 
high uptime and 
has been upgraded in between of adding new OSD's. The md5sum of 
/usr/bin/ceph-mon, 
which is obviously the daemon that is running, is identical on all machines.

See the attachment for logging on the moment of restarting.

-- 
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten  | Tuxis Internet Engineering
KvK: 61527076  | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl
2018-02-14 03:45:02.142186 7ff8d3348700  0 quorum service shutdown
2018-02-14 03:45:38.928910 7fada75a8880  0 ceph version 0.94.10 
(b1e0532418e4631af01acbc0cedd426f1905f4af), process ceph-mon, pid 55243
2018-02-14 03:45:38.979708 7fada75a8880  0 starting mon.0 rank 1 at 
192.168.100.2:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 
36dab522-b24d-44db-8df8-e7aa6b901f66
2018-02-14 03:45:38.979958 7fada75a8880  1 mon.0@-1(probing) e3 preinit fsid 
36dab522-b24d-44db-8df8-e7aa6b901f66
2018-02-14 03:45:38.980313 7fada75a8880  1 mon.0@-1(probing).paxosservice(pgmap 
59142586..59143174) refresh upgraded, format 0 -> 1
2018-02-14 03:45:38.980321 7fada75a8880  1 mon.0@-1(probing).pg v0 on_upgrade 
discarding in-core PGMap
2018-02-14 03:45:38.985070 7fada75a8880  0 mon.0@-1(probing).mds e1 print_map
epoch   1
flags   0
created 0.00
modified2016-01-20 11:09:14.572361
tableserver 0
root0
session_timeout 0
session_autoclose   0
max_file_size   0
last_failure0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={}
max_mds 0
in  
up  {}
failed  
stopped 
data_pools  
metadata_pool   0
inline_data disabled

2018-02-14 03:45:38.985331 7fada75a8880  0 mon.0@-1(probing).osd e4020 crush 
map has features 1107558400, adjusting msgr requires
2018-02-14 03:45:38.985339 7fada75a8880  0 mon.0@-1(probing).osd e4020 crush 
map has features 1107558400, adjusting msgr requires
2018-02-14 03:45:38.985342 7fada75a8880  0 mon.0@-1(probing).osd e4020 crush 
map has features 1107558400, adjusting msgr requires
2018-02-14 03:45:38.985344 7fada75a8880  0 mon.0@-1(probing).osd e4020 crush 
map has features 1107558400, adjusting msgr requires
2018-02-14 03:45:38.985731 7fada75a8880  1 mon.0@-1(probing).paxosservice(auth 
19251..19482) refresh upgraded, format 0 -> 1
2018-02-14 03:45:38.986735 7fada75a8880  0 mon.0@-1(probing) e3  my rank is now 
1 (was -1)
2018-02-14 03:45:38.988799 7fad9ce13700  0 -- 192.168.100.2:6789/0 >> 
192.168.100.3:6789/0 pipe(0x3d26000 sd=22 :8 s=2 pgs=776395 cs=1 l=0 
c=0x3957a20).reader missed message?  skipped from seq 0 to 1886081844
2018-02-14 03:45:38.989156 7fad9cd12700  0 -- 192.168.100.2:6789/0 >> 
192.168.100.1:6789/0 pipe(0x4132000 sd=17 :56104 s=2 pgs=18652390 cs=1 l=0 
c=0x3957760).reader missed message?  skipped from seq 0 to 1847147828
2018-02-14 03:45:38.992068 7fada28a1700  0 log_channel(audit) log [DBG] : 
from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2018-02-14 03:45:38.992152 7fada28a1700  0 log_channel(audit) log [DBG] : 
from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished



signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] balancer mgr module

2018-02-16 Thread Dan van der Ster
Hi Caspar,

I've been trying the mgr balancer for a couple weeks now and can share
some experience.

Currently there are two modes implemented: upmap and crush-compat.

Upmap requires all clients to be running luminous -- it uses this new
pg-upmap mechanism to precisely move PGs one by one to a more balanced
layout.
The upmap mode is working only with num PGs, AFAICT, and on at least
one of our clusters it happens to be moving PGs in a pool with no data
-- useless. Checking the implementation, it should be upmapping PGs
from a random pool each iteration -- I have a tracker open for this:
http://tracker.ceph.com/issues/22431

Upmap is the future, but for now I'm trying to exercise the
crush-compat mode on some larger clusters. It's still early days, but
in general it seems to be working in the right direction.
crush-compat does two things: it creates a new "compat" crush
weight-set to give underutilized OSDs more crush weight; and second,
it phases out the osd reweights back to 1.0. So, if you have a cluster
that was previously balanced with ceph osd reweight-by-*, then
crush-compat will gently bring you to the new balancing strategy.

There have been a few issues spotted in 12.2.2... some of the balancer
config-key settings aren't cast properly to int/float so they can
break the balancer; and more importantly the mgr doesn't refresh
config-keys if they change. So if you do change the configuration, you
need to ceph mgr fail  to force the next mgr to reload
the config.

My current config is:

ceph config-key dump
{
"mgr/balancer/active": "1",
"mgr/balancer/begin_time": "0830",
"mgr/balancer/end_time": "1600",
"mgr/balancer/max_misplaced": "0.01",
"mgr/balancer/mode": "crush-compat"
}

Note that the begin_time/end_time seem to be in UTC, not the local time zone.
max_displaced defaults to 0.05, and this is used to limit the
percentage of PGs/objects to be rebalanced each iteration.

I have it enabled (ceph balancer on) which means it tries to balance
every 60s. It will skip an iteration if num misplaced is greater than
> max_misplaced, or if any objects are degraded.

When you're first trying the balancer you should do two things to test
a one-off balancing (rather than the always on mode that I use):
  - set debug_mgr=4/5 # then you can tail -f ceph-mgr.*.log | grep
balancer  to see what it's doing
  - ceph balancer mode crush-compat
  - ceph balancer eval # to check the current score
  - ceph balancer optimize myplan # create but do not execute a new plan
  - ceph balancer eval myplan # check what would be the new score
after myplan. Is it getting closer to the optimal value 0?
  - ceph balancer show myplan # study what it's trying to do
  - ceph balancer execute myplan # execute the plan. data movement starts here!
  - ceph balancer reset # we do this because balancer rm is broken,
and myplan isn't removed automatically after execution

v12.2.3 has quite a few balancer fixes, and also adds a pool-specific
balancing (which should hopefully fix my upmap issue).

Hope that helps!

Dan



On Fri, Feb 16, 2018 at 9:22 AM, Caspar Smit  wrote:
> Hi,
>
> After watching Sage's talk at LinuxConfAU about making distributed storage
> easy he mentioned the Balancer Manager module. After enabling this module,
> pg's should get balanced automagically around the cluster.
>
> The module was added in Ceph Luminous v12.2.2
>
> Since i couldn't find much documentation about this module i was wondering
> if it is considered stable? (production ready) or still experimental/WIP.
>
> Here's the original mailinglist post describing the module:
>
> https://www.spinics.net/lists/ceph-devel/msg37730.html
>
> A few questions:
>
> What are the differences between the different optimization modes?
> Is the balancer run at certain intervals, if yes, what is the interval?
> Will this trigger continuous backfillling/recovering of pg's when a cluster
> is mostly under write load?
>
> Kind regards,
> Caspar
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-16 Thread Jan Fajerski

On Fri, Feb 16, 2018 at 09:27:08AM +0100, Ansgar Jazdzewski wrote:

Hi Folks,

i just try to get the prometheus plugin up and runing but as soon as i
browse /metrics i got:

500 Internal Server Error
The server encountered an unexpected condition which prevented it from
fulfilling the request.

Traceback (most recent call last):
 File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
670, in respond
   response.body = self.handler()
 File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
line 217, in __call__
   self.body = self.oldhandler(*args, **kwargs)
 File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
line 61, in __call__
   return self.callable(*self.args, **self.kwargs)
 File "/usr/lib/ceph/mgr/prometheus/module.py", line 386, in metrics
   metrics = global_instance().collect()
 File "/usr/lib/ceph/mgr/prometheus/module.py", line 323, in collect
   self.get_metadata_and_osd_status()
 File "/usr/lib/ceph/mgr/prometheus/module.py", line 283, in
get_metadata_and_osd_status
   dev_class['class'],
KeyError: 'class'
This error is part of the osd metadata metric. Which version of Ceph are you 
running this with? Specifically the Crush Map of this cluster seems to not have 
the device class for each OSD yet.


I assume that i have to change the mkgr cephx kex? but iam not 100% sure

mgr.mgr01
  key: AQAqLIRasocnChAAbOIEMKVEWWHCbgVeEctwng==
  caps: [mds] allow *
  caps: [mon] allow profile mgr
  caps: [osd] allow *

thanks for your help,
Ansgar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-16 Thread Konstantin Shalygin

i just try to get the prometheus plugin up and runing



Use module from master.

From this commit should work with 12.2.2, just wget it and replace 
stock module.


https://github.com/ceph/ceph/blob/d431de74def1b8889ad568ab99436362833d063e/src/pybind/mgr/prometheus/module.py




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw: Huge Performance impact during dynamic bucket index resharding

2018-02-16 Thread Micha Krause

Hi,

Radosgw decided to reshard a bucket with 25 million objects from 256 to 512 
shards.

Resharding took about 1 hour, during this time all buckets on the cluster had a 
huge performance drop.

"GET" requests for small objects (on other buckets) took multiple seconds.

Are there any configuration options to reduce this impact? or limit resharding 
to a maximum of 256 shards?


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous and calamari

2018-02-16 Thread Lenz Grimmer
On 02/16/2018 07:16 AM, Kai Wagner wrote:

> yes there are plans to add management functionality to the dashboard as
> well. As soon as we're covered all the existing functionality to create
> the initial PR we'll start with the management stuff. The big benefit
> here is, that we can profit what we've already done within openATTIC.
> 
> If you've missed the ongoing Dashboard V2 discussions and work, here's a
> blog post to follow up:
> 
> https://www.openattic.org/posts/ceph-manager-dashboard-v2/
> 
> Let us know about your thoughts on this.

And just to add to this - of course the standalone version of openATTIC
is still available, too ;)

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-16 Thread Ansgar Jazdzewski
Hi Folks,

i just try to get the prometheus plugin up and runing but as soon as i
browse /metrics i got:

500 Internal Server Error
The server encountered an unexpected condition which prevented it from
fulfilling the request.

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
670, in respond
response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
line 217, in __call__
self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
line 61, in __call__
return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 386, in metrics
metrics = global_instance().collect()
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 323, in collect
self.get_metadata_and_osd_status()
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 283, in
get_metadata_and_osd_status
dev_class['class'],
KeyError: 'class'

I assume that i have to change the mkgr cephx kex? but iam not 100% sure

mgr.mgr01
   key: AQAqLIRasocnChAAbOIEMKVEWWHCbgVeEctwng==
   caps: [mds] allow *
   caps: [mon] allow profile mgr
   caps: [osd] allow *

thanks for your help,
Ansgar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] balancer mgr module

2018-02-16 Thread Caspar Smit
 Hi,

After watching Sage's talk at LinuxConfAU about making distributed storage
easy he mentioned the Balancer Manager module. After enabling this module,
pg's should get balanced automagically around the cluster.

The module was added in Ceph Luminous v12.2.2

Since i couldn't find much documentation about this module i was wondering
if it is considered stable? (production ready) or still experimental/WIP.

Here's the original mailinglist post describing the module:

https://www.spinics.net/lists/ceph-devel/msg37730.html

A few questions:

What are the differences between the different optimization modes?
Is the balancer run at certain intervals, if yes, what is the interval?
Will this trigger continuous backfillling/recovering of pg's when a cluster
is mostly under write load?

Kind regards,
Caspar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com