from:"Ben Morrice"

[ceph-users] Ceph re-ip of OSD node

2017-08-30 Thread Ben Morrice


Hello

We have a small cluster that we need to move to a different network in 
the same datacentre.


My workflow was the following (for a single OSD host), but I failed 
(further details below)


1) ceph osd set noout
2) stop ceph-osd processes
3) change IP, gateway, domain (short hostname is the same), VLAN
4) change references of OLD IP (cluster and public network) in 
/etc/ceph/ceph.conf with NEW IP (see [1])

5) start a single OSD process

This seems to work as the NEW IP can communicate with mon hosts and osd 
hosts on the OLD network, the OSD is booted and is visible via 'ceph -w' 
however after a few seconds the OSD drops with messages such as the 
below in it's log file


heartbeat_check: no reply from 10.1.1.100:6818 osd.14 ever on either 
front or back, first ping sent 2017-08-30 16:42:14.692210 (cutoff 
2017-08-30 16:42:24.962245)


There are logs like the above for every OSD server/process

and then eventually a

2017-08-30 16:42:14.486275 7f6d2c966700  0 log_channel(cluster) log 
[WRN] : map e85351 wrongly marked me down



Am I missing something obvious to reconfigure the network on a OSD host?



[1]

OLD
[osd.0]
   host = sn01
   devs = /dev/sdi
   cluster addr = 10.1.1.101
   public addr = 10.1.1.101
NEW
[osd.0]
   host = sn01
   devs = /dev/sdi
   cluster addr = 10.1.2.101
   public addr = 10.1.2.101

--
Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RGW multisite - second cluster woes

2016-08-18 Thread Ben Morrice

Hello,

I am trying to configure a second cluster into an existing Jewel RGW
installation.

I do not get the expected output when I perform a 'radosgw-admin realm
pull'. My realm on the first cluster is called 'gold', however when
doing a realm pull it doesn't reflect the 'gold' name or id and I get an
error related to latest_epoch (?).

The documentation seems straight forward, so i'm not quite sure what i'm
missing here?

Please see below for the full output.

# radosgw-admin realm pull --url=http://cluster1:80 --access-key=access
--secret=secret

2016-08-18 17:20:09.585261 7fb939d879c0  0 error read_lastest_epoch
.rgw.root:periods.8c64a4dd-ccd8-4975-b63b-324fbb24aab6.latest_epoch
{
"id": "98a7b356-83fd-4d42-b895-b58d45fa4233",
"name": "",
"current_period": "8c64a4dd-ccd8-4975-b63b-324fbb24aab6",
"epoch": 1
}

# radosgw-admin period pull --url=http://cluster1:80 --access-key=access
secret=secret
2016-08-18 17:21:33.277719 7f5dbc7849c0  0 error read_lastest_epoch
.rgw.root:periods..latest_epoch
{
"id": "",
"epoch": 0,
"predecessor_uuid": "",
"sync_status": [],
"period_map": {
"id": "",
"zonegroups": [],
"short_zone_ids": []
},
"master_zonegroup": "",
"master_zone": "",
"period_config": {
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
},
"realm_id": "",
"realm_name": "",
"realm_epoch": 0
}

# radosgw-admin realm default --rgw-realm=gold
failed to init realm: (2) No such file or directory2016-08-18
17:21:46.220181 7f720defa9c0  0 error in read_id for id  : (2) No such
file or directory

# radosgw-admin zonegroup default --rgw-zonegroup=us
failed to init zonegroup: (2) No such file or directory
2016-08-18 17:22:10.348984 7f9b2da699c0  0 error in read_id for id  :
(2) No such file or directory


-- 
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW multisite - second cluster woes

2016-08-21 Thread Ben Morrice

Hello,

Looks fine on the first cluster:

cluster1# radosgw-admin period get
{
"id": "6ea09956-60a7-48df-980c-2b5bbf71b565",
"epoch": 2,
"predecessor_uuid": "80026abd-49f4-436e-844f-f8743685dac5",
"sync_status": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"period_map": {
"id": "6ea09956-60a7-48df-980c-2b5bbf71b565",
"zonegroups": [
{
"id": "rgw1-gva",
"name": "rgw1-gva",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "rgw1-gva-master",
"zones": [
{
"id": "rgw1-gva-master",
"name": "rgw1-gva-master",
"endpoints": [
"http:\/\/rgw1:80\/"
],
"log_meta": "true",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "b23771d0-6005-41da-8ee0-aec03db510d7"
}
],
"short_zone_ids": [
        {
"key": "rgw1-gva-master",
"val": 1414621010
}
]
},
"master_zonegroup": "rgw1-gva",
"master_zone": "rgw1-gva-master",
"period_config": {
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
},
"realm_id": "b23771d0-6005-41da-8ee0-aec03db510d7",
"realm_name": "gold",
"realm_epoch": 2
}

And, from the second cluster I get this:

cluster2 # radosgw-admin realm pull --url=http://rgw1:80
--access-key=access --secret=secret
2016-08-22 08:48:42.682785 7fc5d3fe29c0  0 error read_lastest_epoch
.rgw.root:periods.381464e1-4326-4b6b-9191-35940c4f645f.latest_epoch
{
"id": "98a7b356-83fd-4d42-b895-b58d45fa4233",
"name": "",
"current_period": "381464e1-4326-4b6b-9191-35940c4f645f",
"epoch": 1
}


Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 19/08/16 08:46, Shilpa Manjarabad Jagannath wrote:
>
> - Original Message -
>> From

[ceph-users] RGW multisite replication failures

2016-09-23 Thread Ben Morrice

tate:
rctx=0x7f9a7b7f46d0
obj=.bbp-gva-secondary.domain.rgw:.bucket.meta.bentest1:bbp-gva-master.85732351.16
state=0x7f9a000b19b8 s->prefetch_data=0
2016-09-23 09:03:28.731703 7f9a72ffd700 20
cr:s=0x7f9a3c5a4f90:op=0x7f9a3ca75ef0:20RGWContinuousLeaseCR: couldn't
lock
.bbp-gva-secondary.log:bucket.sync-status.bbp-gva-master:bentest1:bbp-gva-master.85732351.16:sync_lock:
retcode=-16
2016-09-23 09:03:28.731721 7f9a72ffd700  0 ERROR: incremental sync on
bentest1 bucket_id=bbp-gva-master.85732351.16 shard_id=-1 failed,
retcode=-16
2016-09-23 09:03:28.758421 7f9a72ffd700 20 store_marker(): updating
marker
marker_oid=bucket.sync-status.bbp-gva-master:bentest1:bbp-gva-master.85732351.16
marker=035.4585.2
2016-09-23 09:03:28.829207 7f9a72ffd700  0 ERROR: failed to sync object:
bentest1:bbp-gva-master.85732351.16:-1/1m
2016-09-23 09:03:28.834281 7f9a72ffd700 20 store_marker(): updating
marker
marker_oid=bucket.sync-status.bbp-gva-master:bentest1:bbp-gva-master.85732351.16
marker=00000000036.4586.3



-- 
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW multisite replication failures

2016-09-27 Thread Ben Morrice

Hello Orit,

Yes, this bug looks to correlate. Was this included in 10.2.3?

I guess not as I have since updated to 10.2.3 but getting the same errors

This bug talks about not retrying after a failure, however do you know
why the sync fails in the first place? It seems that basically any
object over 500k in size fails :(

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 23/09/16 16:52, Orit Wasserman wrote:
> Hi Ben,
> It seems to be http://tracker.ceph.com/issues/16742.
> It is being backported to jewel http://tracker.ceph.com/issues/16794,
> you can try apply it and see if it helps you.
>
> Regards,
> Orit
>
> On Fri, Sep 23, 2016 at 9:21 AM, Ben Morrice  wrote:
>> Hello all,
>>
>> I have two separate ceph (10.2.2) clusters and have configured multisite
>> replication between the two. I can see some buckets get synced, however
>> others do not.
>>
>> Both clusters are RHEL7, and I have upgraded libcurl from 7.29 to 7.50
>> (to avoid http://tracker.ceph.com/issues/15915).
>>
>> Below is some debug output on the 'secondary' zone (bbp-gva-secondary)
>> after uploading a file to the bucket 'bentest1' from onto the master
>> zone (bbp-gva-master).
>>
>> This appears to to be happening very frequently. The size of my bucket
>> pool in the master is ~120GB, however on the secondary site it's only
>> 5GB so things are not very happy at the moment.
>>
>> What steps can I take to work out why RGW cannot create a lock in the
>> log pool?
>>
>> Is there a way to force a full sync, starting fresh (the secondary site
>> is not advertised to users, thus it's okay to even clean pools to start
>> again)?
>>
>>
>> 2016-09-23 09:03:28.498292 7f992e664700 20 execute(): read data:
>> [{"key":6,"val":["bentest1:bbp-gva-master.85732351.16:-1"]}]
>> 2016-09-23 09:03:28.498453 7f992e664700 20 execute(): modified
>> key=bentest1:bbp-gva-master.85732351.16:-1
>> 2016-09-23 09:03:28.498456 7f992e664700 20 wakeup_data_sync_shards:
>> source_zone=bbp-gva-master,
>> shard_ids={6=bentest1:bbp-gva-master.85732351.16:-1}
>> 2016-09-23 09:03:28.498547 7f9a72ffd700 20 incremental_sync(): async
>> update notification: bentest1:bbp-gva-master.85732351.16:-1
>> 2016-09-23 09:03:28.499137 7f9a7dffb700 20 get_system_obj_state:
>> rctx=0x7f9a3c5f8e08
>> obj=.bbp-gva-secondary.log:bucket.sync-status.bbp-gva-master:bentest1:bbp-gva-master.85732351.16
>> state=0x7f9a0c069848 s->prefetch_data=0
>> 2016-09-23 09:03:28.501379 7f9a72ffd700 20 operate(): sync status for
>> bucket bentest1:bbp-gva-master.85732351.16:-1: 2
>> 2016-09-23 09:03:28.501433 7f9a877fe700 20 reading from
>> .bbp-gva-secondary.domain.rgw:.bucket.meta.bentest1:bbp-gva-master.85732351.16
>> 2016-09-23 09:03:28.501447 7f9a877fe700 20 get_system_obj_state:
>> rctx=0x7f9a877fc6d0
>> obj=.bbp-gva-secondary.domain.rgw:.bucket.meta.bentest1:bbp-gva-master.85732351.16
>> state=0x7f9a340cfbe8 s->prefetch_data=0
>> 2016-09-23 09:03:28.503269 7f9a877fe700 20 get_system_obj_state:
>> rctx=0x7f9a877fc6d0
>> obj=.bbp-gva-secondary.domain.rgw:.bucket.meta.bentest1:bbp-gva-master.85732351.16
>> state=0x7f9a340cfbe8 s->prefetch_data=0
>> 2016-09-23 09:03:28.510428 7f9a72ffd700 20 sending request to
>> https://bbpobjectstorage.epfl.ch:443/admin/log?bucket-instance=bentest1%3Abbp-gva-master.85732351.16&format=json&marker=034.4578.3&type=bucket-index&rgwx-zonegroup=bbp-gva
>> 2016-09-23 09:03:28.625755 7f9a72ffd700 20 [inc sync] skipping object:
>> bentest1:bbp-gva-master.85732351.16:-1/1m: non-complete operation
>> 2016-09-23 09:03:28.625759 7f9a72ffd700 20 [inc sync] syncing object:
>> bentest1:bbp-gva-master.85732351.16:-1/1m
>> 2016-09-23 09:03:28.625831 7f9a72ffd700 20 bucket sync single entry
>> (source_zone=bbp-gva-master)
>> b=bentest1(@{i=.bbp-gva-secondary.rgw.buckets.index,e=.bbp-gva-master.rgw.buckets.extra}.bbp-gva-secondary.rgw.buckets[bbp-gva-master.85732351.16]):-1/1m[0]
>> log_entry=036.4586.3 op=0 op_state=1
>> 2016-09-23 09:03:28.625857 7f9a72ffd700  5 bucket sync: sync obj:
>> bbp-gva-master/bentest1(@{i=.bbp-gva-secondary.rgw.buckets.index,e=.bbp-gva-master.rgw.buckets.extra}.bbp-gva-secondary.rgw.buckets[bbp-gva-master.85732351.16])/1m[0]
>> 2016-09-23 09:03:28.626092 7f9a85ffb700 20 get_obj_state:
>> rctx=0x7f9a85ff96a0 obj=bentest1:1m state=0x7f9a30051cf8 s->prefetch_data=0
>&g

Re: [ceph-users] RGW multisite replication failures

2016-09-28 Thread Ben Morrice

ync] syncing object:
20160928:bbp-gva-master.106061599.1/20160928-1mb-testfile[null]
2016-09-28 16:19:01.968988 7f845b7fe700 20 bucket sync single entry
(source_zone=bbp-gva-master)
b=20160928:bbp-gva-master.106061599.1/20160928-1mb-testfile[null][0]
log_entry=20160928-1mb-testfile[null] op=0 op_state=1
2016-09-28 16:19:01.969003 7f845b7fe700  5
Sync:bbp-gva-:data:Object:20160928:bbp-gva-master.106061599.1/20160928-1mb-testfile[null][0]:start
2016-09-28 16:19:01.969014 7f845b7fe700  5 bucket sync: sync obj:
bbp-gva-master/20160928(@{i=.bbp-gva-secondary.rgw.buckets.index,e=.bbp-gva-secondary.rgw.buckets.extra}.bbp-gva-secondary.rgw.buckets[bbp-gva-master.106061599.1])/20160928-1mb-testfile[null][0]
2016-09-28 16:19:01.969017 7f845b7fe700  5
Sync:bbp-gva-:data:Object:20160928:bbp-gva-master.106061599.1/20160928-1mb-testfile[null][0]:fetch
2016-09-28 16:19:01.969363 7f84913f6700 20 get_obj_state:
rctx=0x7f84913f46a0 obj=20160928:20160928-1mb-testfile
state=0x7f844c17f348 s->prefetch_data=0
2016-09-28 16:19:01.970699 7f84913f6700 10 get_canon_resource():
dest=/20160928/20160928-1mb-testfile?versionId=null
/20160928/20160928-1mb-testfile?versionId=null
2016-09-28 16:19:01.970882 7f84913f6700 20 sending request to
https://bbpobjectstorage.epfl.ch:443/20160928/20160928-1mb-testfile?rgwx-zonegroup=bbp-gva&rgwx-prepend-metadata=bbp-gva&versionId=null
2016-09-28 16:19:02.087169 7f84913f6700 10 received
header:x-amz-meta-orig-filename: 20160928-1mb-testfile
2016-09-28 16:19:02.156463 7f845b7fe700  5
Sync:bbp-gva-:data:Object:20160928:bbp-gva-master.106061599.1/20160928-1mb-testfile[null][0]:done,
retcode=-5
2016-09-28 16:19:02.156467 7f845b7fe700  0 ERROR: failed to sync object:
20160928:bbp-gva-master.106061599.1/20160928-1mb-testfile
2016-09-28 16:19:02.160115 7f845b7fe700  5
Sync:bbp-gva-:data:Object:20160928:bbp-gva-master.106061599.1/20160928-1mb-testfile[null][0]:finish
2016-09-28 16:19:02.163101 7f845b7fe700  5
Sync:bbp-gva-:data:BucketFull:20160928:bbp-gva-master.106061599.1:finish
2016-09-28 16:19:02.163108 7f845b7fe700  5 full sync on
20160928:bbp-gva-master.106061599.1 failed, retcode=-5
2016-09-28 16:19:02.163111 7f845b7fe700  5
Sync:bbp-gva-:data:Bucket:20160928:bbp-gva-master.106061599.1:finish


Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/09/16 09:36, Ben Morrice wrote:
> Hello Orit,
>
> Yes, this bug looks to correlate. Was this included in 10.2.3?
>
> I guess not as I have since updated to 10.2.3 but getting the same errors
>
> This bug talks about not retrying after a failure, however do you know
> why the sync fails in the first place? It seems that basically any
> object over 500k in size fails :(
>
> Kind regards,
>
> Ben Morrice
>
> __
> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
> EPFL ENT CBS BBP
> Biotech Campus
> Chemin des Mines 9
> 1202 Geneva
> Switzerland
>
> On 23/09/16 16:52, Orit Wasserman wrote:
>> Hi Ben,
>> It seems to be http://tracker.ceph.com/issues/16742.
>> It is being backported to jewel http://tracker.ceph.com/issues/16794,
>> you can try apply it and see if it helps you.
>>
>> Regards,
>> Orit
>>
>> On Fri, Sep 23, 2016 at 9:21 AM, Ben Morrice  wrote:
>>> Hello all,
>>>
>>> I have two separate ceph (10.2.2) clusters and have configured multisite
>>> replication between the two. I can see some buckets get synced, however
>>> others do not.
>>>
>>> Both clusters are RHEL7, and I have upgraded libcurl from 7.29 to 7.50
>>> (to avoid http://tracker.ceph.com/issues/15915).
>>>
>>> Below is some debug output on the 'secondary' zone (bbp-gva-secondary)
>>> after uploading a file to the bucket 'bentest1' from onto the master
>>> zone (bbp-gva-master).
>>>
>>> This appears to to be happening very frequently. The size of my bucket
>>> pool in the master is ~120GB, however on the secondary site it's only
>>> 5GB so things are not very happy at the moment.
>>>
>>> What steps can I take to work out why RGW cannot create a lock in the
>>> log pool?
>>>
>>> Is there a way to force a full sync, starting fresh (the secondary site
>>> is not advertised to users, thus it's okay to even clean pools to start
>>> again)?
>>>
>>>
>>> 2016-09-23 09:03:28.498292 7f992e664700 20 execute(): read data:
>>> [{"key":6,"val":["bentest1:bbp-gva-master.85732351.16:-1"]}]
>>> 2016-09-23 09:03:28.4

Re: [ceph-users] Memory leak in radosgw

2016-10-21 Thread Ben Morrice

What version of libcurl are you using?

I was hitting this bug with RHEL7/libcurl 7.29 which could also be your
catalyst.

http://tracker.ceph.com/issues/15915

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 20/10/16 21:41, Trey Palmer wrote:
> I've been trying to test radosgw multisite and have a pretty bad memory
> leak.It appears to be associated only with multisite sync.
>
> Multisite works well for a small numbers of objects.However, it all
> fell over when I wrote in 8M 64K objects to two buckets overnight for
> testing (via cosbench).
>
> The leak appears to happen on the multisite transfer source -- that is, the
> node where the objects were written originally.   The radosgw process
> eventually dies, I'm sure via the OOM killer, and systemd restarts it.
> Then repeat, though multisite sync pretty much stops at that point.
>
> I have tried 10.2.2, 10.2.3 and a combination of the two.   I'm running on
> CentOS 7.2, using civetweb with SSL.   I saw that the memory profiler only
> works on mon, osd and mds processes.
>
> Anyone else seen anything like this?
>
>-- Trey
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph pg inconsistencies - omap data lost

2017-04-04 Thread Ben Morrice


Hi all,

We have a weird issue with a few inconsistent PGs. We are running ceph 
11.2 on RHEL7.


As an example inconsistent PG we have:

# rados -p volumes list-inconsistent-obj 4.19
{"epoch":83986,"inconsistents":[{"object":{"name":"rbd_header.08f7fa43a49c7f","nspace":"","locator":"","snap":"head","version":28785242},"errors":[],"union_shard_errors":["omap_digest_mismatch_oi"],"selected_object_info":"4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 
0])","shards":[{"osd":10,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":20,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":29,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"}]}]}


If I try to repair this PG, I get the following in the OSD logs:

2017-04-04 14:31:37.825833 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 shard 10: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head 
omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 
4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 0])
2017-04-04 14:31:37.825863 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 shard 20: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head 
omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 
4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 0])
2017-04-04 14:31:37.825870 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 shard 29: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head 
omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 
4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 0])
2017-04-04 14:31:37.825877 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head: failed to 
pick suitable auth object
2017-04-04 14:32:37.926980 7f2d7cffd700 -1 log_channel(cluster) log 
[ERR] : 4.19 deep-scrub 3 errors


If I list the omapvalues, they are null

# rados -p volumes listomapvals rbd_header.08f7fa43a49c7f |wc -l
0


If I list the extended attributes on the filesystem of each OSD that 
hosts this file, they are indeed empty (all 3 OSDs are the same, but 
just listing one for brevity)


getfattr 
/var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\\uheader.08f7fa43a49c7f__head_6C8FC219__4

getfattr: Removing leading '/' from absolute path names
# file: 
var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\134uheader.08f7fa43a49c7f__head_6C8FC219__4

user.ceph._
user.ceph._@1
user.ceph._lock.rbd_lock
user.ceph.snapset
user.cephos.spill_out


Is there anything I can do to recover from this situation?


--
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-20 Thread Ben Morrice


Hi all,

I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7 
(RHEL7) and authentication is in a very bad state. This installation is 
part of a multigw configuration, and I have just updated one host in the 
secondary zone (all other hosts/zones are running 10.2.5).


On the 10.2.7 server I cannot authenticate as a user (normally backed by 
OpenStack Keystone), but even worse I can also not authenticate with an 
admin user.


Please see [1] for the results of performing a list bucket operation 
with python boto (script works against rgw 10.2.5)


Also, if I try to authenticate from the 'master' rgw zone with a 
"radosgw-admin sync status --rgw-zone=bbp-gva-master" I get:


"ERROR: failed to fetch datalog info"

"failed to retrieve sync info: (13) Permission denied"

The above errors correlates to the errors in the log on the server 
running 10.2.7 (debug level 20) at [2]


I'm not sure what I have done wrong or can try next?

By the way, downgrading the packages from 10.2.7 to 10.2.5 returns 
authentication functionality


[1]
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
encoding="UTF-8"?>SignatureDoesNotMatchtx4-0058f8c86a-3fa2959-bbp-gva-secondary3fa2959-bbp-gva-secondary-bbp-gva


[2]
/bbpsrvc15.cscs.ch/admin/log
2017-04-20 16:43:04.916253 7ff87c6c0700 15 calculated 
digest=Ofg/f/NI0L4eEG1MsGk4PsVscTM=
2017-04-20 16:43:04.916255 7ff87c6c0700 15 
auth_sign=qZ3qsy7AuNCOoPMhr8yNoy5qMKU=

2017-04-20 16:43:04.916255 7ff87c6c0700 15 compare=34
2017-04-20 16:43:04.916266 7ff87c6c0700 10 failed to authorize request
2017-04-20 16:43:04.916268 7ff87c6c0700 20 handler->ERRORHANDLER: 
err_no=-2027 new_err_no=-2027
2017-04-20 16:43:04.916329 7ff87c6c0700  2 req 354:0.052585:s3:GET 
/admin/log:get_obj:op status=0
2017-04-20 16:43:04.916339 7ff87c6c0700  2 req 354:0.052595:s3:GET 
/admin/log:get_obj:http status=403
2017-04-20 16:43:04.916343 7ff87c6c0700  1 == req done 
req=0x7ff87c6ba710 op status=0 http_status=403 ==

2017-04-20 16:43:04.916350 7ff87c6c0700 20 process_request() returned -2027
2017-04-20 16:43:04.916390 7ff87c6c0700  1 civetweb: 0x7ff990015610: 
10.80.6.26 - - [20/Apr/2017:16:43:04 +0200] "GET /admin/log HTTP/1.1" 
403 0 - -
2017-04-20 16:43:04.917212 7ff9777e6700 20 
cr:s=0x7ff97000d420:op=0x7ff9703a5440:18RGWMetaSyncShardCR: operate()
2017-04-20 16:43:04.917223 7ff9777e6700 20 rgw meta sync: 
incremental_sync:1544: shard_id=20 
mdlog_marker=1_1492686039.901886_5551978.1 
sync_marker.marker=1_1492686039.901886_5551978.1 period_marker=
2017-04-20 16:43:04.917227 7ff9777e6700 20 rgw meta sync: 
incremental_sync:1551: shard_id=20 syncing mdlog for shard_id=20
2017-04-20 16:43:04.917236 7ff9777e6700 20 
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.917238 7ff9777e6700 20 rgw meta sync: operate: 
shard_id=20: init request
2017-04-20 16:43:04.917240 7ff9777e6700 20 
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.917241 7ff9777e6700 20 rgw meta sync: operate: 
shard_id=20: reading shard status
2017-04-20 16:43:04.917303 7ff9777e6700 20 run: stack=0x7ff97000d420 is 
io blocked
2017-04-20 16:43:04.918285 7ff9777e6700 20 
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.918295 7ff9777e6700 20 rgw meta sync: operate: 
shard_id=20: reading shard status complete
2017-04-20 16:43:04.918307 7ff9777e6700 20 rgw meta sync: shard_id=20 
marker=1_1492686039.901886_5551978.1 last_update=2017-04-20 
13:00:39.0.901886s
2017-04-20 16:43:04.918316 7ff9777e6700 20 
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.918317 7ff9777e6700 20 rgw meta sync: operate: 
shard_id=20: sending rest request
2017-04-20 16:43:04.918381 7ff9777e6700 20 RGWEnv::set(): HTTP_DATE: Thu 
Apr 20 14:43:04 2017
2017-04-20 16:43:04.918390 7ff9777e6700 20 > HTTP_DATE -> Thu Apr 20 
14:43:04 2017
2017-04-20 16:43:04.918404 7ff9777e6700 10 get_canon_resource(): 
dest=/admin/log

2017-04-20 16:43:04.918406 7ff9777e6700 10 generated canonical header: GET

--
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-21 Thread Ben Morrice


Hello Orit,

Please find attached the output from the radosgw commands and the 
relevant section from ceph.conf (radosgw)


bbp-gva-master is running 10.2.5

bbp-gva-secondary is running 10.2.7

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 21/04/17 07:55, Orit Wasserman wrote:

Hi Ben,

On Thu, Apr 20, 2017 at 6:08 PM, Ben Morrice  wrote:

Hi all,

I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7 (RHEL7)
and authentication is in a very bad state. This installation is part of a
multigw configuration, and I have just updated one host in the secondary
zone (all other hosts/zones are running 10.2.5).

On the 10.2.7 server I cannot authenticate as a user (normally backed by
OpenStack Keystone), but even worse I can also not authenticate with an
admin user.

Please see [1] for the results of performing a list bucket operation with
python boto (script works against rgw 10.2.5)

Also, if I try to authenticate from the 'master' rgw zone with a
"radosgw-admin sync status --rgw-zone=bbp-gva-master" I get:

"ERROR: failed to fetch datalog info"

"failed to retrieve sync info: (13) Permission denied"

The above errors correlates to the errors in the log on the server running
10.2.7 (debug level 20) at [2]

I'm not sure what I have done wrong or can try next?

By the way, downgrading the packages from 10.2.7 to 10.2.5 returns
authentication functionality

Can you provide the following info:
radosgw-admin period get
radsogw-admin zonegroup get
radsogw-admin zone get

Can you provide your ceph.conf?

Thanks,
Orit


[1]
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
SignatureDoesNotMatchtx4-0058f8c86a-3fa2959-bbp-gva-secondary3fa2959-bbp-gva-secondary-bbp-gva

[2]
/bbpsrvc15.cscs.ch/admin/log
2017-04-20 16:43:04.916253 7ff87c6c0700 15 calculated
digest=Ofg/f/NI0L4eEG1MsGk4PsVscTM=
2017-04-20 16:43:04.916255 7ff87c6c0700 15
auth_sign=qZ3qsy7AuNCOoPMhr8yNoy5qMKU=
2017-04-20 16:43:04.916255 7ff87c6c0700 15 compare=34
2017-04-20 16:43:04.916266 7ff87c6c0700 10 failed to authorize request
2017-04-20 16:43:04.916268 7ff87c6c0700 20 handler->ERRORHANDLER:
err_no=-2027 new_err_no=-2027
2017-04-20 16:43:04.916329 7ff87c6c0700  2 req 354:0.052585:s3:GET
/admin/log:get_obj:op status=0
2017-04-20 16:43:04.916339 7ff87c6c0700  2 req 354:0.052595:s3:GET
/admin/log:get_obj:http status=403
2017-04-20 16:43:04.916343 7ff87c6c0700  1 == req done
req=0x7ff87c6ba710 op status=0 http_status=403 ==
2017-04-20 16:43:04.916350 7ff87c6c0700 20 process_request() returned -2027
2017-04-20 16:43:04.916390 7ff87c6c0700  1 civetweb: 0x7ff990015610:
10.80.6.26 - - [20/Apr/2017:16:43:04 +0200] "GET /admin/log HTTP/1.1" 403 0
- -
2017-04-20 16:43:04.917212 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff9703a5440:18RGWMetaSyncShardCR: operate()
2017-04-20 16:43:04.917223 7ff9777e6700 20 rgw meta sync:
incremental_sync:1544: shard_id=20
mdlog_marker=1_1492686039.901886_5551978.1
sync_marker.marker=1_1492686039.901886_5551978.1 period_marker=
2017-04-20 16:43:04.917227 7ff9777e6700 20 rgw meta sync:
incremental_sync:1551: shard_id=20 syncing mdlog for shard_id=20
2017-04-20 16:43:04.917236 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.917238 7ff9777e6700 20 rgw meta sync: operate:
shard_id=20: init request
2017-04-20 16:43:04.917240 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.917241 7ff9777e6700 20 rgw meta sync: operate:
shard_id=20: reading shard status
2017-04-20 16:43:04.917303 7ff9777e6700 20 run: stack=0x7ff97000d420 is io
blocked
2017-04-20 16:43:04.918285 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.918295 7ff9777e6700 20 rgw meta sync: operate:
shard_id=20: reading shard status complete
2017-04-20 16:43:04.918307 7ff9777e6700 20 rgw meta sync: shard_id=20
marker=1_1492686039.901886_5551978.1 last_update=2017-04-20
13:00:39.0.901886s
2017-04-20 16:43:04.918316 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
2017-04-20 16:43:04.918317 7ff9777e6700 20 rgw meta sync: operate:
shard_id=20: sending rest request
2017-04-20 16:43:04.918381 7ff9777e6700 20 RGWEnv::set(): HTTP_DATE: Thu Apr
20 14:43:04 2017
2017-04-20 16:43:04.918390 7ff9777e6700 20 > HTTP_DATE -> Thu Apr 20
14:43:04 2017
2017-04-20 16:43:04.918404 7ff9777e6700 10 get_canon_resource():
dest=/admin/log
2017-04-20 16:43:04.918406 7ff9777e6700 10 generated canonical header: GET

--
Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-24 Thread Ben Morrice


Hello Orit,

Could it be that something has changed in 10.2.5+ which is related to 
reading the endpoints from the zone/period config?


In my master zone I have specified the endpoint with a trailing 
backslash (which is also escaped), however I do not define the secondary 
endpoint this way. Am I hitting a bug here?


Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 21/04/17 09:36, Ben Morrice wrote:

Hello Orit,

Please find attached the output from the radosgw commands and the 
relevant section from ceph.conf (radosgw)


bbp-gva-master is running 10.2.5

bbp-gva-secondary is running 10.2.7

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 21/04/17 07:55, Orit Wasserman wrote:

Hi Ben,

On Thu, Apr 20, 2017 at 6:08 PM, Ben Morrice  
wrote:

Hi all,

I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7 
(RHEL7)
and authentication is in a very bad state. This installation is part 
of a
multigw configuration, and I have just updated one host in the 
secondary

zone (all other hosts/zones are running 10.2.5).

On the 10.2.7 server I cannot authenticate as a user (normally 
backed by

OpenStack Keystone), but even worse I can also not authenticate with an
admin user.

Please see [1] for the results of performing a list bucket operation 
with

python boto (script works against rgw 10.2.5)

Also, if I try to authenticate from the 'master' rgw zone with a
"radosgw-admin sync status --rgw-zone=bbp-gva-master" I get:

"ERROR: failed to fetch datalog info"

"failed to retrieve sync info: (13) Permission denied"

The above errors correlates to the errors in the log on the server 
running

10.2.7 (debug level 20) at [2]

I'm not sure what I have done wrong or can try next?

By the way, downgrading the packages from 10.2.7 to 10.2.5 returns
authentication functionality

Can you provide the following info:
radosgw-admin period get
radsogw-admin zonegroup get
radsogw-admin zone get

Can you provide your ceph.conf?

Thanks,
Orit


[1]
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
encoding="UTF-8"?>SignatureDoesNotMatchtx4-0058f8c86a-3fa2959-bbp-gva-secondary3fa2959-bbp-gva-secondary-bbp-gva 



[2]
/bbpsrvc15.cscs.ch/admin/log
2017-04-20 16:43:04.916253 7ff87c6c0700 15 calculated
digest=Ofg/f/NI0L4eEG1MsGk4PsVscTM=
2017-04-20 16:43:04.916255 7ff87c6c0700 15
auth_sign=qZ3qsy7AuNCOoPMhr8yNoy5qMKU=
2017-04-20 16:43:04.916255 7ff87c6c0700 15 compare=34
2017-04-20 16:43:04.916266 7ff87c6c0700 10 failed to authorize request
2017-04-20 16:43:04.916268 7ff87c6c0700 20 handler->ERRORHANDLER:
err_no=-2027 new_err_no=-2027
2017-04-20 16:43:04.916329 7ff87c6c0700  2 req 354:0.052585:s3:GET
/admin/log:get_obj:op status=0
2017-04-20 16:43:04.916339 7ff87c6c0700  2 req 354:0.052595:s3:GET
/admin/log:get_obj:http status=403
2017-04-20 16:43:04.916343 7ff87c6c0700  1 == req done
req=0x7ff87c6ba710 op status=0 http_status=403 ==
2017-04-20 16:43:04.916350 7ff87c6c0700 20 process_request() 
returned -2027

2017-04-20 16:43:04.916390 7ff87c6c0700  1 civetweb: 0x7ff990015610:
10.80.6.26 - - [20/Apr/2017:16:43:04 +0200] "GET /admin/log 
HTTP/1.1" 403 0

- -
2017-04-20 16:43:04.917212 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff9703a5440:18RGWMetaSyncShardCR: operate()
2017-04-20 16:43:04.917223 7ff9777e6700 20 rgw meta sync:
incremental_sync:1544: shard_id=20
mdlog_marker=1_1492686039.901886_5551978.1
sync_marker.marker=1_1492686039.901886_5551978.1 period_marker=
2017-04-20 16:43:04.917227 7ff9777e6700 20 rgw meta sync:
incremental_sync:1551: shard_id=20 syncing mdlog for shard_id=20
2017-04-20 16:43:04.917236 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: 
operate()

2017-04-20 16:43:04.917238 7ff9777e6700 20 rgw meta sync: operate:
shard_id=20: init request
2017-04-20 16:43:04.917240 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: 
operate()

2017-04-20 16:43:04.917241 7ff9777e6700 20 rgw meta sync: operate:
shard_id=20: reading shard status
2017-04-20 16:43:04.917303 7ff9777e6700 20 run: stack=0x7ff97000d420 
is io

blocked
2017-04-20 16:43:04.918285 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: 
operate()

2017-04-20 16:43:04.918295 7ff9777e6700 20 rgw meta sync: operate:
shard_id=20: reading shard status complete
2017-04-20 16:43:04.918307 7ff9777e6700 20 rgw meta sync: shard_id=20
marker=1_1492686039.901886_5551978.1 last_update=2017-04-20
13:00:39.0.901886s
2017-04-20 16:43:04.918316 7ff9777e6700 20
cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWClo

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-27 Thread Ben Morrice


Hello Radek,

Thank-you for your analysis so far! Please find attached logs for both 
the admin user and a keystone backed user from 10.2.5 (same host as 
before, I have simply downgraded the packages). Both users can 
authenticate and list buckets on 10.2.5.


Also - I tried version 10.2.6 and see the same behavior as 10.2.7, so 
the bug i'm hitting looks like it was introduced in 10.2.6


Kind regards,

Ben Morrice

______
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 04:45, Radoslaw Zarzynski wrote:

Thanks for the logs, Ben.

It looks that two completely different authenticators have failed:
the local, RADOS-backed auth (admin.txt) and Keystone-based
one as well. In the second case I'm pretty sure that Keystone has
rejected [1][2] to authenticate provided signature/StringToSign.
RGW tried to fallback to the local auth which obviously didn't have
any chance as the credentials were stored remotely. This explains
the presence of "error reading user info" in the user-keystone.txt.

What is common for both scenarios are the low-level things related
to StringToSign crafting/signature generation at RadosGW's side.
Following one has been composed for the request from admin.txt:

   GET


   Wed, 26 Apr 2017 09:18:42 GMT
   /bbpsrvc15.cscs.ch/

If you could provide a similar log from v10.2.5, I would be really grateful.

Regards,
Radek

[1] https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_rest_s3.cc#L3269-L3272
[2] https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_common.h#L170

On Wed, Apr 26, 2017 at 11:29 AM, Morrice Ben  wrote:

Hello Radek,

Please find attached the failed request for both the admin user and a standard 
user (backed by keystone).

Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland


From: Radoslaw Zarzynski 
Sent: Tuesday, April 25, 2017 7:38 PM
To: Morrice Ben
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Hello Ben,

Could you provide full RadosGW's log for the failed request?
I mean the lines starting from header listing, through the start
marker ("== starting new request...") till the end marker?

At the moment we can't see any details related to the signature
calculation.

Regards,
Radek

On Thu, Apr 20, 2017 at 5:08 PM, Ben Morrice  wrote:

Hi all,

I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7 (RHEL7)
and authentication is in a very bad state. This installation is part of a
multigw configuration, and I have just updated one host in the secondary
zone (all other hosts/zones are running 10.2.5).

On the 10.2.7 server I cannot authenticate as a user (normally backed by
OpenStack Keystone), but even worse I can also not authenticate with an
admin user.

Please see [1] for the results of performing a list bucket operation with
python boto (script works against rgw 10.2.5)

Also, if I try to authenticate from the 'master' rgw zone with a
"radosgw-admin sync status --rgw-zone=bbp-gva-master" I get:

"ERROR: failed to fetch datalog info"

"failed to retrieve sync info: (13) Permission denied"

The above errors correlates to the errors in the log on the server running
10.2.7 (debug level 20) at [2]

I'm not sure what I have done wrong or can try next?

By the way, downgrading the packages from 10.2.7 to 10.2.5 returns
authentication functionality

[1]
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
SignatureDoesNotMatchtx4-0058f8c86a-3fa2959-bbp-gva-secondary3fa2959-bbp-gva-secondary-bbp-gva

[2]
/bbpsrvc15.cscs.ch/admin/log
2017-04-20 16:43:04.916253 7ff87c6c0700 15 calculated
digest=Ofg/f/NI0L4eEG1MsGk4PsVscTM=
2017-04-20 16:43:04.916255 7ff87c6c0700 15
auth_sign=qZ3qsy7AuNCOoPMhr8yNoy5qMKU=
2017-04-20 16:43:04.916255 7ff87c6c0700 15 compare=34
2017-04-20 16:43:04.916266 7ff87c6c0700 10 failed to authorize request
2017-04-20 16:43:04.916268 7ff87c6c0700 20 handler->ERRORHANDLER:
err_no=-2027 new_err_no=-2027
2017-04-20 16:43:04.916329 7ff87c6c0700  2 req 354:0.052585:s3:GET
/admin/log:get_obj:op status=0
2017-04-20 16:43:04.916339 7ff87c6c0700  2 req 354:0.052595:s3:GET
/admin/log:get_obj:http status=403
2017-04-20 16:43:04.916343 7ff87c6c0700  1 == req done
req=0x7ff87c6ba710 op status=0 http_status=403 ==
2017-04-20 16:43:04.916350 7ff87c6c0700 20 process_request() returned -2027
2017-04-20 16:43:04.916390 7ff87c6c0700  1 civetweb: 0x7ff990015610:
10.80.6.26 - - [20/Apr/2017:16:43:04 +0200] "GET /admin/log HTTP/1.1" 403 0
- -
2017-04-20 16:43:04.917212 7ff9777e6700 20
cr:s=0x7ff97000

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-28 Thread Ben Morrice


Hello Radek,

Thanks again for your anaylsis.

I can confirm on 10.2.7, if I remove the conf "rgw dns name" I can auth 
to directly to the radosgw host.


In our environment we terminate SSL and route connections via haproxy, 
but it's still sometimes useful to be able to communicate directly to 
the backend radosgw server.


It seems that it's not possible to set multiple "rgw dns name" entries 
in ceph.conf


Is the only solution to modify the zonegroup and populate the 
'hostnames' array with all backend server hostnames as well as the 
hostname terminated by haproxy?


Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 13:53, Radoslaw Zarzynski wrote:

Bingo! From the 10.2.5-admin:

   GET

   Thu, 27 Apr 2017 07:49:59 GMT
   /

And also:

   2017-04-27 09:49:59.117447 7f4a90ff9700 20 subdomain= domain=
in_hosted_domain=0 in_hosted_domain_s3website=0
   2017-04-27 09:49:59.117449 7f4a90ff9700 20 final domain/bucket
subdomain= domain= in_hosted_domain=0 in_hosted_domain_s3website=0
s->info.domain= s->info.request_uri=/

The most interesting part is the "final ... in_hosted_domain=0".
It looks we need to dig around RGWREST::preprocess(),
rgw_find_host_in_domains() & company.

There is a commit introduced in v10.2.6 that touches this area [1].
I'm definitely not saying it's the root cause. It might be that a change
in the code just unhidden a configuration issue [2].

I will talk about the problem on the today's sync-up.

Thanks for the logs!
Regards,
Radek

[1] https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca16d7f4c6d0
[2] http://tracker.ceph.com/issues/17440

On Thu, Apr 27, 2017 at 10:11 AM, Ben Morrice  wrote:

Hello Radek,

Thank-you for your analysis so far! Please find attached logs for both the
admin user and a keystone backed user from 10.2.5 (same host as before, I
have simply downgraded the packages). Both users can authenticate and list
buckets on 10.2.5.

Also - I tried version 10.2.6 and see the same behavior as 10.2.7, so the
bug i'm hitting looks like it was introduced in 10.2.6

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 04:45, Radoslaw Zarzynski wrote:

Thanks for the logs, Ben.

It looks that two completely different authenticators have failed:
the local, RADOS-backed auth (admin.txt) and Keystone-based
one as well. In the second case I'm pretty sure that Keystone has
rejected [1][2] to authenticate provided signature/StringToSign.
RGW tried to fallback to the local auth which obviously didn't have
any chance as the credentials were stored remotely. This explains
the presence of "error reading user info" in the user-keystone.txt.

What is common for both scenarios are the low-level things related
to StringToSign crafting/signature generation at RadosGW's side.
Following one has been composed for the request from admin.txt:

GET


Wed, 26 Apr 2017 09:18:42 GMT
/bbpsrvc15.cscs.ch/

If you could provide a similar log from v10.2.5, I would be really
grateful.

Regards,
Radek

[1]
https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_rest_s3.cc#L3269-L3272
[2] https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_common.h#L170

On Wed, Apr 26, 2017 at 11:29 AM, Morrice Ben  wrote:

Hello Radek,

Please find attached the failed request for both the admin user and a
standard user (backed by keystone).

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland


From: Radoslaw Zarzynski 
Sent: Tuesday, April 25, 2017 7:38 PM
To: Morrice Ben
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Hello Ben,

Could you provide full RadosGW's log for the failed request?
I mean the lines starting from header listing, through the start
marker ("== starting new request...") till the end marker?

At the moment we can't see any details related to the signature
calculation.

Regards,
Radek

On Thu, Apr 20, 2017 at 5:08 PM, Ben Morrice  wrote:

Hi all,

I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7
(RHEL7)
and authentication is in a very bad state. This installation is part of
a
multigw configuration, and I have just updated one host in the secondary
zone (all other hosts/zones are running 10.2.5).

On the 10.2.7 server I cannot authenticate as a user (normally backed by
OpenStack Keystone), but even

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-28 Thread Ben Morrice


Hello again,

I can work around this issue. If the host header is an IP address, the 
request is treated as a virtual:


So if I auth to to my backends via IP, things work as expected.

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 28/04/17 09:26, Ben Morrice wrote:

Hello Radek,

Thanks again for your anaylsis.

I can confirm on 10.2.7, if I remove the conf "rgw dns name" I can 
auth to directly to the radosgw host.


In our environment we terminate SSL and route connections via haproxy, 
but it's still sometimes useful to be able to communicate directly to 
the backend radosgw server.


It seems that it's not possible to set multiple "rgw dns name" entries 
in ceph.conf


Is the only solution to modify the zonegroup and populate the 
'hostnames' array with all backend server hostnames as well as the 
hostname terminated by haproxy?


Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 13:53, Radoslaw Zarzynski wrote:

Bingo! From the 10.2.5-admin:

   GET

   Thu, 27 Apr 2017 07:49:59 GMT
   /

And also:

   2017-04-27 09:49:59.117447 7f4a90ff9700 20 subdomain= domain=
in_hosted_domain=0 in_hosted_domain_s3website=0
   2017-04-27 09:49:59.117449 7f4a90ff9700 20 final domain/bucket
subdomain= domain= in_hosted_domain=0 in_hosted_domain_s3website=0
s->info.domain= s->info.request_uri=/

The most interesting part is the "final ... in_hosted_domain=0".
It looks we need to dig around RGWREST::preprocess(),
rgw_find_host_in_domains() & company.

There is a commit introduced in v10.2.6 that touches this area [1].
I'm definitely not saying it's the root cause. It might be that a change
in the code just unhidden a configuration issue [2].

I will talk about the problem on the today's sync-up.

Thanks for the logs!
Regards,
Radek

[1] 
https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca16d7f4c6d0

[2] http://tracker.ceph.com/issues/17440

On Thu, Apr 27, 2017 at 10:11 AM, Ben Morrice  
wrote:

Hello Radek,

Thank-you for your analysis so far! Please find attached logs for 
both the
admin user and a keystone backed user from 10.2.5 (same host as 
before, I
have simply downgraded the packages). Both users can authenticate 
and list

buckets on 10.2.5.

Also - I tried version 10.2.6 and see the same behavior as 10.2.7, 
so the

bug i'm hitting looks like it was introduced in 10.2.6

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 04:45, Radoslaw Zarzynski wrote:

Thanks for the logs, Ben.

It looks that two completely different authenticators have failed:
the local, RADOS-backed auth (admin.txt) and Keystone-based
one as well. In the second case I'm pretty sure that Keystone has
rejected [1][2] to authenticate provided signature/StringToSign.
RGW tried to fallback to the local auth which obviously didn't have
any chance as the credentials were stored remotely. This explains
the presence of "error reading user info" in the user-keystone.txt.

What is common for both scenarios are the low-level things related
to StringToSign crafting/signature generation at RadosGW's side.
Following one has been composed for the request from admin.txt:

GET


Wed, 26 Apr 2017 09:18:42 GMT
/bbpsrvc15.cscs.ch/

If you could provide a similar log from v10.2.5, I would be really
grateful.

Regards,
Radek

[1]
https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_rest_s3.cc#L3269-L3272 

[2] 
https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_common.h#L170


On Wed, Apr 26, 2017 at 11:29 AM, Morrice Ben  
wrote:

Hello Radek,

Please find attached the failed request for both the admin user and a
standard user (backed by keystone).

Kind regards,

Ben Morrice

__ 


Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland


From: Radoslaw Zarzynski 
Sent: Tuesday, April 25, 2017 7:38 PM
To: Morrice Ben
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Hello Ben,

Could you provide full RadosGW's log for the failed request?
I mean the lines starting from header listing, through the start
marker ("== starting new request...") till the end marker?

At the moment we can't see any details related to the signature

Re: [ceph-users] Prometheus RADOSGW usage exporter

2017-05-30 Thread Ben Morrice


Hello Berant,

This is very nice! I've had a play with this against our installation of 
Ceph which is Kraken. We had to change the bucket_owner variable to be 
inside the for loop [1] and we are currently not getting any bytes 
sent/received statistics - though this is not an issue with your code, 
as these values are not updated via radosgw-admin either. I think i'm 
hitting this bug http://tracker.ceph.com/issues/19194


[1] for bucket in entry['buckets']:
print bucket
bucket_owner = bucket['owner']

Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 25/05/17 16:25, Berant Lemmenes wrote:

Hello all,

I've created prometheus exporter that scrapes the RADOSGW Admin Ops API and
exports the usage information for all users and buckets. This is my first
prometheus exporter so if anyone has feedback I'd greatly appreciate it.
I've tested it against Hammer, and will shortly test against Jewel; though
looking at the docs it should work fine for Jewel as well.

https://github.com/blemmenes/radosgw_usage_exporter


Sample output:
radosgw_usage_successful_ops_total{bucket="shard0",category="create_bucket",owner="testuser"}
1.0
radosgw_usage_successful_ops_total{bucket="shard0",category="delete_obj",owner="testuser"}
1094978.0
radosgw_usage_successful_ops_total{bucket="shard0",category="list_bucket",owner="testuser"}
2276.0
radosgw_usage_successful_ops_total{bucket="shard0",category="put_obj",owner="testuser"}
1094978.0
radosgw_usage_successful_ops_total{bucket="shard0",category="stat_bucket",owner="testuser"}
20.0
radosgw_usage_received_bytes_total{bucket="shard0",category="create_bucket",owner="testuser"}
0.0
radosgw_usage_received_bytes_total{bucket="shard0",category="delete_obj",owner="testuser"}
0.0
radosgw_usage_received_bytes_total{bucket="shard0",category="list_bucket",owner="testuser"}
0.0
radosgw_usage_received_bytes_total{bucket="shard0",category="put_obj",owner="testuser"}
6352678.0
radosgw_usage_received_bytes_total{bucket="shard0",category="stat_bucket",owner="testuser"}
0.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="create_bucket",owner="testuser"}
19.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="delete_obj",owner="testuser"}
0.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="list_bucket",owner="testuser"}
638339458.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="put_obj",owner="testuser"}
79.0
radosgw_usage_sent_bytes_total{bucket="shard0",category="stat_bucket",owner="testuser"}
380.0
radosgw_usage_ops_total{bucket="shard0",category="create_bucket",owner="testuser"}
1.0
radosgw_usage_ops_total{bucket="shard0",category="delete_obj",owner="testuser"}
1094978.0
radosgw_usage_ops_total{bucket="shard0",category="list_bucket",owner="testuser"}
2276.0
radosgw_usage_ops_total{bucket="shard0",category="put_obj",owner="testuser"}
1094979.0
radosgw_usage_ops_total{bucket="shard0",category="stat_bucket",owner="testuser"}
20.0


Thanks,
Berant



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW: Auth error with hostname instead of IP

2017-06-12 Thread Ben Morrice


Hello Eric,

You are probably hitting the git commits listed on this thread: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017731.html


If this is the same behaviour, your options are:

a) set all fqn inside the array of hostnames of your zonegroup(s)

or

b) remove 'rgw dns name' from your ceph.conf

Kind regards,

Ben Morrice

______
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 09/06/17 23:50, Eric Choi wrote:
When I send the a RGW request with hostname (with port that is not 
80), I am seeing "SignatureDoesNotMatch" error.


GET / HTTP/1.1
Host: cephrgw0002s2mdw1.sendgrid.net:50680 
<http://cephrgw0002s2mdw1.sendgrid.net:50680>

User-Agent: Minio (linux; amd64) minio-go/2.0.4 mc/2017-04-03T18:35:01Z
Authorization: AWS **REDACTED**:**REDACTED**


encoding="UTF-8"?>SignatureDoesNotMatchtx00093e0c1-00593b145c-996aae1-default996aae1-default-defaultmc: 



However this works fine when I send it with an IP address instead.  Is 
the hostname part of the signature?  If so, how can I make it so that 
it will work with hostname as well?



Thank you,


Eric



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RGW Jewel upgrade: realms and default .rgw.root pool?

2016-05-04 Thread Ben Morrice

t:realms.1d3f123fa1f9f2f2f49d119c50590d63 
state=0x7f93732b0e18 s->prefetch_data=0

2016-05-04 14:00:13.924347 7f9371d7da40 20 rados->read ofs=0 len=524288
2016-05-04 14:00:13.924834 7f9371d7da40 20 rados->read r=0 bl.length=118
2016-05-04 14:00:13.924852 7f9371d7da40 20 get_system_obj_state: 
rctx=0x7ffd86e56150 
obj=.rgw.root:periods.21305dac-ee64-42ea-87cf-ee5bb3b42d40.latest_epoch 
state=0x7f93732b0e18 s->prefetch_data=0
2016-05-04 14:00:13.925401 7f9371d7da40 20 get_system_obj_state: 
s->obj_tag was set empty
2016-05-04 14:00:13.925407 7f9371d7da40 20 get_system_obj_state: 
rctx=0x7ffd86e56150 
obj=.rgw.root:periods.21305dac-ee64-42ea-87cf-ee5bb3b42d40.latest_epoch 
state=0x7f93732b0e18 s->prefetch_data=0

2016-05-04 14:00:13.925409 7f9371d7da40 20 rados->read ofs=0 len=524288
2016-05-04 14:00:13.925950 7f9371d7da40 20 rados->read r=0 bl.length=10
2016-05-04 14:00:13.925971 7f9371d7da40 20 get_system_obj_state: 
rctx=0x7ffd86e56170 
obj=.rgw.root:periods.21305dac-ee64-42ea-87cf-ee5bb3b42d40.1 
state=0x7f93732b0e18 s->prefetch_data=0
2016-05-04 14:00:13.926584 7f9371d7da40 20 get_system_obj_state: 
s->obj_tag was set empty
2016-05-04 14:00:13.926590 7f9371d7da40 20 get_system_obj_state: 
rctx=0x7ffd86e56170 
obj=.rgw.root:periods.21305dac-ee64-42ea-87cf-ee5bb3b42d40.1 
state=0x7f93732b0e18 s->prefetch_data=0

2016-05-04 14:00:13.926592 7f9371d7da40 20 rados->read ofs=0 len=524288
2016-05-04 14:00:13.927347 7f9371d7da40 20 rados->read r=0 bl.length=242
2016-05-04 14:00:13.927387 7f9371d7da40 20 get_system_obj_state: 
rctx=0x7ffd86e561d0 obj=.bbp-dev.rgw.root:region_info.bbp-dev 
state=0x7f93732b0e18 s->prefetch_data=0
2016-05-04 14:00:13.928068 7f9371d7da40 20 get_system_obj_state: 
s->obj_tag was set empty
2016-05-04 14:00:13.928075 7f9371d7da40 20 get_system_obj_state: 
rctx=0x7ffd86e561d0 obj=.bbp-dev.rgw.root:region_info.bbp-dev 
state=0x7f93732b0e18 s->prefetch_data=0

2016-05-04 14:00:13.928077 7f9371d7da40 20 rados->read ofs=0 len=524288
2016-05-04 14:00:13.928759 7f9371d7da40 20 rados->read r=0 bl.length=212

--
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] slow requests and degraded cluster, but not really ?

2018-10-23 Thread Ben Morrice


Hello all,

We have an issue with our ceph cluster where 'ceph -s' shows that 
several requests are blocked, however querying further with 'ceph health 
detail' indicates that the PGs affected are either active+clean or do 
not currently exist.
OSD 32 appears to be working fine, and the cluster is performing as 
expected with no clients seemingly affected.


Note - we had just upgraded to Luminous - and despite having "mon max pg 
per osd = 400" set in ceph.conf, we still have the message "too many PGs 
per OSD (278 > max 200)"


In order to improve the situation above, I removed several pools that 
were not used anymore. I assume the PGs that ceph cannot find now are 
related to this pool deletion.


Does anyone have any ideas on how to get out of this state?

Details below - and full 'ceph health detail' attached to this email.

Kind regards,

Ben Morrice

[root@ceph03 ~]# ceph -s
  cluster:
    id: 6c21c4ba-9c4d-46ef-93a3-441b8055cdc6
    health: HEALTH_WARN
    Degraded data redundancy: 443765/14311983 objects degraded 
(3.101%), 162 pgs degraded, 241 pgs undersized

    75 slow requests are blocked > 32 sec. Implicated osds 32
    too many PGs per OSD (278 > max 200)

  services:
    mon: 5 daemons, quorum bbpocn01,bbpocn02,bbpocn03,bbpocn04,bbpocn07
    mgr: bbpocn03(active, starting)
    osd: 36 osds: 36 up, 36 in
    rgw: 1 daemon active

  data:
    pools:   24 pools, 3440 pgs
    objects: 4.77M objects, 7.69TiB
    usage:   23.1TiB used, 104TiB / 127TiB avail
    pgs: 443765/14311983 objects degraded (3.101%)
 3107 active+clean
 170  active+undersized
 109  active+undersized+degraded
 43   active+recovery_wait+degraded
 10   active+recovering+degraded
 1    active+recovery_wait

[root@ceph03 ~]# for i in `ceph health detail |grep stuck | awk '{print 
$2}'`; do echo -n "$i: " ; ceph pg $i query -f plain | cut -d: -f2 | cut 
-d\" -f2; done

150.270: active+clean
150.2a0: active+clean
150.2b6: active+clean
150.2c2: active+clean
150.2cc: active+clean
150.2d5: active+clean
150.2d6: active+clean
150.2e1: active+clean
150.2ef: active+clean
150.2f5: active+clean
150.2f7: active+clean
150.2fc: active+clean
150.315: active+clean
150.318: active+clean
150.31a: active+clean
150.320: active+clean
150.326: active+clean
150.36e: active+clean
150.380: active+clean
150.389: active+clean
150.3a4: active+clean
150.3ad: active+clean
150.3b4: active+clean
150.3bb: active+clean
150.3ce: active+clean
150.3d0: active+clean
150.3d8: active+clean
150.3e0: active+clean
150.3f6: active+clean
165.24c: Error ENOENT: problem getting command descriptions from pg.165.24c
165.28f: Error ENOENT: problem getting command descriptions from pg.165.28f
165.2b3: Error ENOENT: problem getting command descriptions from pg.165.2b3
165.2b4: Error ENOENT: problem getting command descriptions from pg.165.2b4
165.2d6: Error ENOENT: problem getting command descriptions from pg.165.2d6
165.2f4: Error ENOENT: problem getting command descriptions from pg.165.2f4
165.2fd: Error ENOENT: problem getting command descriptions from pg.165.2fd
165.30f: Error ENOENT: problem getting command descriptions from pg.165.30f
165.322: Error ENOENT: problem getting command descriptions from pg.165.322
165.325: Error ENOENT: problem getting command descriptions from pg.165.325
165.334: Error ENOENT: problem getting command descriptions from pg.165.334
165.36e: Error ENOENT: problem getting command descriptions from pg.165.36e
165.37c: Error ENOENT: problem getting command descriptions from pg.165.37c
165.382: Error ENOENT: problem getting command descriptions from pg.165.382
165.387: Error ENOENT: problem getting command descriptions from pg.165.387
165.3af: Error ENOENT: problem getting command descriptions from pg.165.3af
165.3da: Error ENOENT: problem getting command descriptions from pg.165.3da
165.3e0: Error ENOENT: problem getting command descriptions from pg.165.3e0
165.3e2: Error ENOENT: problem getting command descriptions from pg.165.3e2
165.3e9: Error ENOENT: problem getting command descriptions from pg.165.3e9
165.3fb: Error ENOENT: problem getting command descriptions from pg.165.3fb

[root@ceph03 ~]# ceph pg 165.24c query
Error ENOENT: problem getting command descriptions from pg.165.24c
[root@ceph03 ~]# ceph pg 165.24c delete
Error ENOENT: problem getting command descriptions from pg.165.24c

--
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

HEALTH_WARN Degraded data redundancy: 443765/14311983 objects degraded 
(3.101%), 162 pgs degraded, 241 pgs undersized; 75 slow requests are blocked > 
32 sec. Implicated osds 32; too many PGs per OSD (278 > max 200)
pg 150.270 is stuck under

[ceph-users] Large omap objects - how to fix ?

2018-10-26 Thread Ben Morrice


Hello all,

After a recent Luminous upgrade (now running 12.2.8 with all OSDs 
migrated to bluestore, upgraded from 11.2.0 and running filestore) I am 
currently experiencing the warning 'large omap objects'.
I know this is related to large buckets in radosgw, and luminous 
supports 'dynamic sharding' - however I feel that something is missing 
from our configuration and i'm a bit confused on what the right approach 
is to fix it.


First a bit of background info:

We previously had a multi site radosgw installation, however recently we 
decommissioned the second site. With the radosgw multi-site 
configuration we had 'bucket_index_max_shards = 0'. Since 
decommissioning the second site, I have removed the secondary zonegroup 
and changed 'bucket_index_max_shards' to be 16 for the single primary zone.
All our buckets do not have a 'num_shards' field when running 
'radosgw-admin bucket stats --bucket '

Is this normal ?

Also - I'm finding it difficult to find out exactly what to do with the 
buckets that are affected with 'large omap' (see commands below).

My interpretation of 'search the cluster log' is also listed below.

What do I need to do to with the below buckets get back to an overall 
ceph HEALTH OK state ? :)



# ceph health detail
HEALTH_WARN 2 large omap objects
2 large objects found in pool '.bbp-gva-master.rgw.buckets.index'
Search the cluster log for 'Large omap object found' for more details.

# ceph osd pool get .bbp-gva-master.rgw.buckets.index pg_num
pg_num: 64

# for i in `ceph pg ls-by-pool .bbp-gva-master.rgw.buckets.index | tail 
-n +2 | awk '{print $1}'`; do echo -n "$i: "; ceph pg $i query |grep 
num_large_omap_objects | head -1 | awk '{print $2}'; done | grep ": 1"

137.1b: 1
137.36: 1

# cat buckets
#!/bin/bash
buckets=`radosgw-admin metadata list bucket |grep \" | cut -d\" -f2`
for i in $buckets
do
  id=`radosgw-admin bucket stats --bucket $i |grep \"id\" | cut -d\" -f4`
  pg=`ceph osd map .bbp-gva-master.rgw.buckets.index ${id} | awk 
'{print $11}' | cut -d\( -f2 | cut -d\) -f1`

  echo "$i:$id:$pg"
done
# ./buckets > pglist
# egrep '137.1b|137.36' pglist |wc -l
192

The following doesn't appear to do change anything

# for bucket in `cut -d: -f1 pglist`; do radosgw-admin reshard add 
--bucket $bucket --num-shards 8; done


# radosgw-admin reshard process



--
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Large omap objects - how to fix ?

2018-11-02 Thread Ben Morrice


Thanks for everyones comments, including the thread hijackers :)

I solved this in our infrastructure slightly differently:

1) find largest omap(s)
# for i in `rados -p .bbp-gva-master.rgw.buckets.index ls`; do echo -n 
"$i:"; rados -p .bbp-gva-master.rgw.buckets.index listomapkeys $i |wc 
-l; done > omapkeys

# sort -t: -k2 -r -n omapkeys  |head -1
.dir.bbp-gva-master.125103342.18:7558822

2) confirm that the above index is not used by any buckets
# cat bucketstats
#!/bin/bash
for bucket in $(radosgw-admin bucket list | jq -r .[]); do
    bucket_id=$(radosgw-admin metadata get bucket:${bucket} | jq -r 
.data.bucket.bucket_id)
    marker=$(radosgw-admin metadata get bucket:${bucket} | jq -r 
.data.bucket.marker)

    echo "$bucket:$bucket_id:$marker"
done
# ./bucketstats > bucketstats.out
# grep 125103342.18 bucketstats.out

3) delete the rados object
rados -p .bbp-gva-master.rgw.buckets.index rm 
.dir.bbp-gva-master.125103342.18


4) perform a deep scrub on the PGs that were affected
# for i in `ceph pg ls-by-pool .bbp-gva-master.rgw.buckets.index | tail 
-n +2 | awk '{print $1}'`; do echo -n "$i: "; ceph pg $i query |grep 
num_large_omap_objects | head -1 | awk '{print $2}'; done | grep ": 1"

137.1b: 1
137.36: 1
# ceph pg deep-scrub 137.1b
# ceph pg deep-scrub 137.36



Kind regards,

Ben Morrice

__________
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 10/31/2018 11:02 AM, Alexandru Cucu wrote:

Hi,

Didn't know that auto resharding does not remove old instances. Wrote
my own script for cleanup as I've discovered this before reading your
message.
Not very wlll tested, but here it is:

for bucket in $(radosgw-admin bucket list | jq -r .[]); do
 bucket_id=$(radosgw-admin metadata get bucket:${bucket} | jq -r
.data.bucket.bucket_id)
 marker=$(radosgw-admin metadata get bucket:${bucket} | jq -r
.data.bucket.marker)
 for instance in $(radosgw-admin metadata list bucket.instance | jq
-r .[] | grep "^${bucket}:" | grep -v ${bucket_id} | grep -v ${marker}
| cut -f2 -d':'); do
  radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
  radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
 done
done


On Tue, Oct 30, 2018 at 3:30 PM Tomasz Płaza  wrote:

Hi hijackers,

Please read: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030317.html

TL;DR: Ceph should reshard big indexes, but after that it leaves them to be 
removed manually. Starting from some version, deep-scrub reports indexes above 
some threshold as HALTH_WARN. You should find it in osd logs. If You do not 
have logs, just listomapkeys on every object in default.rgw.buckets.index and 
find the biggest ones... it should be safe to remove those (radosgw-admin bi 
purge) but I can not guarantee it.


On 26.10.2018 at 17:18, Florian Engelmann wrote:

Hi,

hijacking the hijacker! Sorry!

radosgw-admin bucket reshard --bucket somebucket --num-shards 8
*** NOTICE: operation will not remove old bucket index objects ***
*** these will need to be removed manually ***
tenant:
bucket name: somebucket
old bucket instance id: cb1594b3-a782-49d0-a19f-68cd48870a63.1923153.1
new bucket instance id: cb1594b3-a782-49d0-a19f-68cd48870a63.3119759.1
total entries: 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 11000 12000 
13000 14000 15000 16000 17000 18000 19000 2 21000 22000 23000 24000 25000 
26000 27000 28000 29000 3 31000 32000 33000 34000 35000 36000 37000 38000 
39000 4 41000 42000 43000 44000 45000 46000 47000 48000 49000 5 51000 
52000 53000 54000 55000 56000 57000 58000 59000 6 61000 62000 63000 64000 
65000 66000 67000 68000 69000 7 71000 72000 73000 74000 75000 76000 77000 
78000 79000 8 81000 82000 83000 84000 85000 86000 87000 88000 89000 9 
91000 92000 93000 94000 95000 96000 97000 98000 99000 10 101000 102000 
103000 104000 105000 106000 107000 108000 109000 11 111000 112000 113000 
114000 115000 116000 117000 118000 119000 12 121000 122000 123000 124000 
125000 126000 127000 128000 129000 13 131000 132000 133000 134000 135000 
136000 137000 138000 139000 14 141000 142000 143000 144000 145000 146000 
147000 148000 149000 15 151000 152000 153000 154000 155000 156000 157000 
158000 159000 16 161000 162000 163000 164000 165000 166000 167000 168000 
169000 17 171000 172000 173000 174000 175000 176000 177000 178000 179000 
18 181000 182000 183000 184000 185000 186000 187000 188000 189000 19 
191000 192000 193000 194000 195000 196000 197000 198000 199000 20 201000 
202000 203000 204000 205000 206000 207000 207660

What to do now?

ceph -s is still:

 health: HEALTH_WARN
 1 large omap objects

But I have no idea how to:
*** NOTICE:

[ceph-users] Ceph re-ip of OSD node

[ceph-users] RGW multisite - second cluster woes

Re: [ceph-users] RGW multisite - second cluster woes

[ceph-users] RGW multisite replication failures

Re: [ceph-users] RGW multisite replication failures

Re: [ceph-users] RGW multisite replication failures

Re: [ceph-users] Memory leak in radosgw

[ceph-users] ceph pg inconsistencies - omap data lost

[ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Re: [ceph-users] Prometheus RADOSGW usage exporter

Re: [ceph-users] RGW: Auth error with hostname instead of IP

[ceph-users] RGW Jewel upgrade: realms and default .rgw.root pool?

[ceph-users] slow requests and degraded cluster, but not really ?

[ceph-users] Large omap objects - how to fix ?

Re: [ceph-users] Large omap objects - how to fix ?

20 matches

Site Navigation

Mail list logo

Footer information