Re: [ceph-users] ceph pg inconsistencies - omap data lost

2017-04-28 Thread Gregory Farnum
On Tue, Apr 4, 2017 at 7:09 AM, Ben Morrice  wrote:
> Hi all,
>
> We have a weird issue with a few inconsistent PGs. We are running ceph 11.2
> on RHEL7.
>
> As an example inconsistent PG we have:
>
> # rados -p volumes list-inconsistent-obj 4.19
> {"epoch":83986,"inconsistents":[{"object":{"name":"rbd_header.08f7fa43a49c7f","nspace":"","locator":"","snap":"head","version":28785242},"errors":[],"union_shard_errors":["omap_digest_mismatch_oi"],"selected_object_info":"4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242
> client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd
>  od  alloc_hint [0 0
> 0])","shards":[{"osd":10,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":20,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":29,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"}]}]}
>
> If I try to repair this PG, I get the following in the OSD logs:
>
> 2017-04-04 14:31:37.825833 7f2d7f802700 -1 log_channel(cluster) log [ERR] :
> 4.19 shard 10: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head omap_digest
> 0x62b5dcb6 != omap_digest 0x from auth oi
> 4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242
> client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd
>  od  alloc_hint [0 0 0])
> 2017-04-04 14:31:37.825863 7f2d7f802700 -1 log_channel(cluster) log [ERR] :
> 4.19 shard 20: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head omap_digest
> 0x62b5dcb6 != omap_digest 0x from auth oi
> 4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242
> client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd
>  od  alloc_hint [0 0 0])
> 2017-04-04 14:31:37.825870 7f2d7f802700 -1 log_channel(cluster) log [ERR] :
> 4.19 shard 29: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head omap_digest
> 0x62b5dcb6 != omap_digest 0x from auth oi
> 4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242
> client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd
>  od  alloc_hint [0 0 0])
> 2017-04-04 14:31:37.825877 7f2d7f802700 -1 log_channel(cluster) log [ERR] :
> 4.19 soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head: failed to pick
> suitable auth object
> 2017-04-04 14:32:37.926980 7f2d7cffd700 -1 log_channel(cluster) log [ERR] :
> 4.19 deep-scrub 3 errors
>
> If I list the omapvalues, they are null
>
> # rados -p volumes listomapvals rbd_header.08f7fa43a49c7f |wc -l
> 0
>
>
> If I list the extended attributes on the filesystem of each OSD that hosts
> this file, they are indeed empty (all 3 OSDs are the same, but just listing
> one for brevity)
>
> getfattr
> /var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\\uheader.08f7fa43a49c7f__head_6C8FC219__4
> getfattr: Removing leading '/' from absolute path names
> # file:
> var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\134uheader.08f7fa43a49c7f__head_6C8FC219__4
> user.ceph._
> user.ceph._@1
> user.ceph._lock.rbd_lock
> user.ceph.snapset
> user.cephos.spill_out
>
>
> Is there anything I can do to recover from this situation?

This is probably late, but for future reference, you can use the
ceph-objectstore tool running against local OSDs to examine their
specific state (as opposd to the rados listomapvals command, which
just looks at the primary). If you have a valid replica, you generally
just use that tool to delete the primary's copy of the object and copy
it over from the replicas, or run a repair which does it for you.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is cls_log_add logging so much?

2017-04-28 Thread Gregory Farnum
On Tue, Apr 4, 2017 at 2:49 AM, Jens Rosenboom  wrote:
> On a busy cluster, I'm seeing a couple of OSDs logging millions of
> lines like this:
>
> 2017-04-04 06:35:18.240136 7f40ff873700  0 
> cls/log/cls_log.cc:129: storing entry at
> 1_1491287718.237118_57657708.1
> 2017-04-04 06:35:18.244453 7f4102078700  0 
> cls/log/cls_log.cc:129: storing entry at
> 1_1491287718.241622_57657709.1
> 2017-04-04 06:35:18.296092 7f40ff873700  0 
> cls/log/cls_log.cc:129: storing entry at
> 1_1491287718.296308_57657710.1
>
> 1. Can someone explain what these messages mean? It seems strange to
> me that only a few OSD generate these.
>
> 2. Why are they being generated at debug level 0, meaning that they
> cannot be filtered? This should happen for a non-error message that
> can be generated at least 50 times per second.

It looks like these are generated by one of object classes which RGW
uses (for its geo-replication features?). They are indeed generated at
level 0 and I can't imagine why either, unless it was just a developer
debug message that didn't get cleaned up.
I'm sure a patch changing it would be welcome.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Question] RBD Striping

2017-04-28 Thread Jason Dillaman
Here is a background on Ceph striping [1]. By default, RBD will stripe
data with a stripe unit of 4MB and a stripe count of 1. Decreasing the
default RBD image object size will balloon the number of objects in
your backing Ceph cluster but will also result in less data to copy
during snapshot and clone CoW operations. Using "fancy" stripe
settings can improve performance under small, sequential IO operations
since the ops can be executing in parallel by multiple OSDs.


[1] http://docs.ceph.com/docs/master/architecture/#data-striping

On Thu, Apr 27, 2017 at 10:13 AM, Timofey Titovets  wrote:
> Hi, i found that RBD Striping documentation are not enough detail.
> Can some one explain how RBD stripe own object over more objects and
> why it's better use striping instead of small rbd object size?
>
> Also if RBD use object size = 4MB by default does it's mean that every
> time object has modified OSD read 4MB of data and replicate it,
> instead of only changes?
> If yes, can striping help with that?
>
> Thanks for any answer
>
> --
> Have a nice day,
> Timofey.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd and/or filestore tuning for ssds?

2017-04-28 Thread Sage Weil
Hi everyone,

Are there any osd or filestore options that operators are tuning for 
all-SSD clusters?  If so (and they make sense) we'd like to introduce them 
as defaults for ssd-backed OSDs.

BlueStore already has different hdd and ssd default values for many 
options that it chooses based on the type of device; we'd like to do that 
for other options in either filestore or the OSD if it makes sense.

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] deploy on centos 7

2017-04-28 Thread Roger Brown
I don't recall. Perhaps later I can try a test and see.


On Fri, Apr 28, 2017 at 10:22 AM Ali Moeinvaziri  wrote:

> Thanks. So, you didn't get any error on command "ceph-deploy mon
> create-initial"?
> -AM
>
>
> On Fri, Apr 28, 2017 at 9:50 AM, Roger Brown 
> wrote:
>
>> I used ceph on centos 7. I check monitor status with commands like these:
>> systemctl status ceph-mon@nuc1
>> systemctl stop ceph-mon@nuc1
>> systemctl start ceph-mon@nuc1
>> systemctl restart ceph-mon@nuc1
>>
>> for me, the hostnames are nuc1, nuc2, nuc3 so you have to modify to suit
>> your case.
>>
>>
>> On Fri, Apr 28, 2017 at 9:43 AM Ali Moeinvaziri 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> I'm just trying to install and test cephs with centos 7, which is the
>>> recommended version
>>> over centos 6 (if I read it correctly). However, the scripts seem to be
>>> still tuned for centos6. So
>>> here is the error I get on deploying monitor node:
>>>
>>> [ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service
>>> ceph -c /etc/ceph/ceph.conf start mon.
>>> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
>>>
>>>
>>> Here is the tail of log file:
>>>
>>> [][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-orion2/done
>>> [][DEBUG ] create a done file to avoid re-doing the mon deployment
>>> [][DEBUG ] create the init path if it does not exist
>>> [][DEBUG ] locating the `service` executable...
>>> [orion2][INFO  ] Running command: sudo /usr/sbin/service ceph -c
>>> /etc/ceph/ceph.conf start mon.
>>> [][WARNING] The service command supports only basic LSB actions
>>> (start, stop, restart, try-restart, reload, force-reload, status). For
>>> other actions, please try to use systemctl.
>>> [][ERROR ] RuntimeError: command returned non-zero exit status: 2
>>> [ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service
>>> ceph -c /etc/ceph/ceph.conf start mon.
>>> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
>>>
>>>
>>> Are there scripts for centos7, or should this be done manually? Any
>>> suggestions?
>>>
>>> Thanks,
>>> -AM
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] deploy on centos 7

2017-04-28 Thread Ali Moeinvaziri
Thanks. So, you didn't get any error on command "ceph-deploy mon
create-initial"?
-AM


On Fri, Apr 28, 2017 at 9:50 AM, Roger Brown  wrote:

> I used ceph on centos 7. I check monitor status with commands like these:
> systemctl status ceph-mon@nuc1
> systemctl stop ceph-mon@nuc1
> systemctl start ceph-mon@nuc1
> systemctl restart ceph-mon@nuc1
>
> for me, the hostnames are nuc1, nuc2, nuc3 so you have to modify to suit
> your case.
>
>
> On Fri, Apr 28, 2017 at 9:43 AM Ali Moeinvaziri 
> wrote:
>
>>
>> Hi,
>>
>> I'm just trying to install and test cephs with centos 7, which is the
>> recommended version
>> over centos 6 (if I read it correctly). However, the scripts seem to be
>> still tuned for centos6. So
>> here is the error I get on deploying monitor node:
>>
>> [ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service
>> ceph -c /etc/ceph/ceph.conf start mon.
>> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
>>
>>
>> Here is the tail of log file:
>>
>> [][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-orion2/done
>> [][DEBUG ] create a done file to avoid re-doing the mon deployment
>> [][DEBUG ] create the init path if it does not exist
>> [][DEBUG ] locating the `service` executable...
>> [orion2][INFO  ] Running command: sudo /usr/sbin/service ceph -c
>> /etc/ceph/ceph.conf start mon.
>> [][WARNING] The service command supports only basic LSB actions
>> (start, stop, restart, try-restart, reload, force-reload, status). For
>> other actions, please try to use systemctl.
>> [][ERROR ] RuntimeError: command returned non-zero exit status: 2
>> [ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service
>> ceph -c /etc/ceph/ceph.conf start mon.
>> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
>>
>>
>> Are there scripts for centos7, or should this be done manually? Any
>> suggestions?
>>
>> Thanks,
>> -AM
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] deploy on centos 7

2017-04-28 Thread Roger Brown
I used ceph on centos 7. I check monitor status with commands like these:
systemctl status ceph-mon@nuc1
systemctl stop ceph-mon@nuc1
systemctl start ceph-mon@nuc1
systemctl restart ceph-mon@nuc1

for me, the hostnames are nuc1, nuc2, nuc3 so you have to modify to suit
your case.


On Fri, Apr 28, 2017 at 9:43 AM Ali Moeinvaziri  wrote:

>
> Hi,
>
> I'm just trying to install and test cephs with centos 7, which is the
> recommended version
> over centos 6 (if I read it correctly). However, the scripts seem to be
> still tuned for centos6. So
> here is the error I get on deploying monitor node:
>
> [ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service
> ceph -c /etc/ceph/ceph.conf start mon.
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
>
>
> Here is the tail of log file:
>
> [][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-orion2/done
> [][DEBUG ] create a done file to avoid re-doing the mon deployment
> [][DEBUG ] create the init path if it does not exist
> [][DEBUG ] locating the `service` executable...
> [orion2][INFO  ] Running command: sudo /usr/sbin/service ceph -c
> /etc/ceph/ceph.conf start mon.
> [][WARNING] The service command supports only basic LSB actions
> (start, stop, restart, try-restart, reload, force-reload, status). For
> other actions, please try to use systemctl.
> [][ERROR ] RuntimeError: command returned non-zero exit status: 2
> [ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service
> ceph -c /etc/ceph/ceph.conf start mon.
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
>
>
> Are there scripts for centos7, or should this be done manually? Any
> suggestions?
>
> Thanks,
> -AM
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] deploy on centos 7

2017-04-28 Thread Ali Moeinvaziri
Hi,

I'm just trying to install and test cephs with centos 7, which is the
recommended version
over centos 6 (if I read it correctly). However, the scripts seem to be
still tuned for centos6. So
here is the error I get on deploying monitor node:

[ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service ceph
-c /etc/ceph/ceph.conf start mon.
[ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors


Here is the tail of log file:

[][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-orion2/done
[][DEBUG ] create a done file to avoid re-doing the mon deployment
[][DEBUG ] create the init path if it does not exist
[][DEBUG ] locating the `service` executable...
[orion2][INFO  ] Running command: sudo /usr/sbin/service ceph -c
/etc/ceph/ceph.conf start mon.
[][WARNING] The service command supports only basic LSB actions (start,
stop, restart, try-restart, reload, force-reload, status). For other
actions, please try to use systemctl.
[][ERROR ] RuntimeError: command returned non-zero exit status: 2
[ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service ceph
-c /etc/ceph/ceph.conf start mon.
[ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors


Are there scripts for centos7, or should this be done manually? Any
suggestions?

Thanks,
-AM
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-04-28 Thread Mark Nelson

On 04/28/2017 08:23 AM, Frédéric Nass wrote:


Le 28/04/2017 à 15:19, Frédéric Nass a écrit :


Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on
Luminous dev (1st release) and came to the same (early) conclusion
regarding the performance drop with many small objects on bluestore,
whatever the number of PGs is on a pool. Here is the graph I generated
from the results:



The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128
--no-cleanup


Correction: test was made on a size 1 pool hosted on a single 12x OSDs
node. The rados bench was run from this single host (to this single host).

Frédéric.


If you happen to have time, I would be very interested to see what the 
compaction statistics look like in rocksdb (available via the osd logs). 
 We actually wrote a tool that's in the cbt tools directory that can 
parse the data and look at what rocksdb is doing.  Here's some of the 
data we collected last fall:


https://drive.google.com/open?id=0B2gTBZrkrnpZRFdiYjFRNmxLblU

The idea there was to try to determine how WAL buffer size / count and 
min_alloc size affected the amount of compaction work that rocksdb was 
doing.  There are also some more general compaction statistics that are 
more human readable in the logs that are worth looking at (ie things 
like write amp and such).


The gist of it is that as you do lots of small writes the amount of 
metadata that has to be kept track of in rocksdb increases, and rocksdb 
ends up doing a *lot* of compaction work, with the associated read and 
write amplification.  The only ways to really deal with this are to 
either reduce the amount of metadata (onodes, extents, etc) or see if we 
can find any ways to reduce the amount of work rocksdb has to do.


On the first point, increasing the min_alloc size in bluestore tends to 
help, but with tradeoffs.  Any io smaller than the min_alloc size will 
be doubly-written like with filestore journals, so you trade reducing 
metadata for an extra WAL write.  We did a bunch of testing last fall 
and at least on NVMe it was better to use a 16k min_alloc size and eat 
the WAL write than use a 4K min_alloc size, skip the WAL write, but 
shove more metadata at rocksdb.  For HDDs, I wouldn't expect too bad of 
behavior with the default 64k min alloc size, but it sounds like it 
could be a problem based on your results.  That's why it would be 
interesting to see if that's what's happening during your tests.


Another issue is that short lived WAL writes potentially can leak into 
level0 and cause additional compaction work.  Sage has a pretty clever 
idea to fix this but we need someone knowledgeable about rocksdb to go 
in and try to implement it (or something like it).


Anyway, we still see a significant amount of work being done by rocksdb 
due to compaction, most of it being random reads.  We actually spoke 
about this quite a bit yesterday at the performance meeting.  If you 
look at a wallclock profile of 4K random writes, you'll see a ton of 
work being doing on compact (about 70% in total of thread 2):


https://paste.fedoraproject.org/paste/uS3LHRHw2Yma0iUYSkgKOl5M1UNdIGYhyRLivL9gydE=

One thing we are still confused about is why rocksdb is doing 
random_reads for compaction rather than sequential reads.  It would be 
really great if someone that knows rocksdb well could help us understand 
why it's doing this.


Ultimately for something like RBD I suspect the performance will stop 
dropping once you've completely filled the disk with 4k random writes. 
For RGW type work, the more tiny objects you add the more data rocksdb 
has to keep track of and the more rocksdb is going to slow down.  It's 
not the same problem filestore suffers from, but it's similar in that 
the more keys/bytes/levels rocksdb has to deal with, the more data gets 
moved around between levels, the more background work that happens, the 
more likely we are waiting on rocksdb before we can write more data.


Mark





I hope this will improve as the performance drop seems more related to
how many objects are in the pool (> 40M) rather than how many objects
are written each second.
Like Wido, I was thinking that we may have to increase the number of
disks in the future to keep up with the needed performance for our
Zimbra messaging use case.
Or move datas from current EC pool to a replicated pool, as erasure
coding doesn't help either for this type of use cases.

Regards,

Frédéric.

Le 26/04/2017 à 22:25, Wido den Hollander a écrit :

Op 24 april 2017 om 19:52 schreef Florian Haas :


Hi everyone,

so this will be a long email — it's a summary of several off-list
conversations I've had over the last couple of weeks, but the TL;DR
version is this question:

How can a Ceph cluster maintain 

Re: [ceph-users] Replication (k=1) in LRC

2017-04-28 Thread David Turner
You can't have different EC profiles in the same pool either.  You have to
create the pool as either a specific EC profile or as Replica.  If you
choose EC you can't even change the EC profile later, however you can
change the amount of copies a Replica pool has.  An EC pool of 1:1 doesn't
do anything.  It's worse than a Replica 2 pool because you could never
change it to have more copies of the data.  I'm guessing this is why it
isn't possible to do.

If you meant to say that you can't have EC and Replica pools in the same
cluster on the same disks, then that isn't correct.  That is very common
and done regularly.

On Fri, Apr 28, 2017 at 9:27 AM Loic Dachary  wrote:

>
>
> On 04/28/2017 02:48 PM, David Turner wrote:
> > Wouldn't k=1, m=1 just be replica 2?
>
> Well yes. But Ceph does not support mixing replication and erasure code in
> the same pool.
>
> > EC will split the object into k pieces (1)... Ok, that's the whole
> object.
>
> I was just wondering if jerasure tolerates this degraded use case. Even
> though it's not useful in general it would solve Oleg issue.
>
> > And then you want to be able to lose m copies of the object (1)... Ok,
> that's an entire copy of that whole object.  That isn't erasure coding,
> that is full 2 copy replication. For erasure coding to work you need to
> split the object into at least 2 pieces (k) and then have at least one
> parity copy (m). With m=0  you have no redundancy and just made a super
> slow raid 0.
>
> :-D
>
> >
> >
> > On Thu, Apr 27, 2017, 6:49 PM Loic Dachary > wrote:
> >
> >
> >
> > On 04/27/2017 11:43 PM, Oleg Kolosov wrote:
> > > Hi Loic,
> > > Of course.
> > > I'm implementing a version of Pyramid Code. In Pyramid you remove
> one of the global parities of Reed-Solomon and add one local parity for
> each local group. In my version, I'd like to add local parity to the global
> parity (meaning that for the case the global parity = 1, it would be
> replicated). This way in case of a failure in the global parity, you can
> reconstruct it using the replicated node instead of reconstructing it will
> all K nodes.
> > >
> > > This is my profile:
> > > ceph osd erasure-code-profile set myprofile \
> > > plugin=lrc \
> > > mapping=DD_DD___ \
> > > layers='[
> > > [ "DD_DD_c_", "" ],
> > > [ "DDc_", "" ],
> > > [ "___DDc__", "" ],
> > > [ "__Dc", "" ],
> > > ]' \
> > > ruleset-steps='[
> > > [ "chooseleaf", "osd",  8  ],
> > > ]'
> >
> > You could test and see if commenting out the sanity check at
> >
> >
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L89
> >
> > does the trick. I don't remember enough about this border case to be
> sure it won't work. You can also give it a try with
> >
> >
> https://github.com/ceph/ceph/blob/master/src/test/erasure-code/ceph_erasure_code_benchmark.cc
> >
> > Cheers
> >
> > > Regards,
> > > Oleg
> > >
> > > On Fri, Apr 28, 2017 at 12:33 AM, Loic Dachary   >> wrote:
> > >
> > > Hi Oleg,
> > >
> > > On 04/27/2017 11:23 PM, Oleg Kolosov wrote:
> > > > Hi,
> > > > I'm working on various implementation of LRC codes for study
> purposes. The layers implementation in the LRC module is very convenient
> for this, but I've came upon a problem in one of the cases.
> > > > I'm interested in having k=1, m=1 in one of the layers.
> However this gives out an error:
> > > > Error EINVAL: k=1 must be >= 2
> > > >
> > > > I should point out that my erasure code has additional
> layers which are fine, only this one has k=1, m=1.
> > > >
> > > > What was the reason for this issue?
> > > > Can replication be implemented in one of LRC's layers?
> > >
> > > Could you provide the code for me to reproduce this problem ?
> Or a description of the layers ? I implemented this restriction because it
> made the code simpler. And also because I could not think of a valid use
> case.
> > >
> > > Cheers
> > >
> > > --
> > > Loïc Dachary, Artisan Logiciel Libre
> > >
> > >
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication (k=1) in LRC

2017-04-28 Thread Loic Dachary


On 04/28/2017 02:48 PM, David Turner wrote:
> Wouldn't k=1, m=1 just be replica 2? 

Well yes. But Ceph does not support mixing replication and erasure code in the 
same pool.

> EC will split the object into k pieces (1)... Ok, that's the whole object. 

I was just wondering if jerasure tolerates this degraded use case. Even though 
it's not useful in general it would solve Oleg issue.

> And then you want to be able to lose m copies of the object (1)... Ok, that's 
> an entire copy of that whole object.  That isn't erasure coding, that is full 
> 2 copy replication. For erasure coding to work you need to split the object 
> into at least 2 pieces (k) and then have at least one parity copy (m). With 
> m=0  you have no redundancy and just made a super slow raid 0.

:-D

> 
> 
> On Thu, Apr 27, 2017, 6:49 PM Loic Dachary  > wrote:
> 
> 
> 
> On 04/27/2017 11:43 PM, Oleg Kolosov wrote:
> > Hi Loic,
> > Of course.
> > I'm implementing a version of Pyramid Code. In Pyramid you remove one 
> of the global parities of Reed-Solomon and add one local parity for each 
> local group. In my version, I'd like to add local parity to the global parity 
> (meaning that for the case the global parity = 1, it would be replicated). 
> This way in case of a failure in the global parity, you can reconstruct it 
> using the replicated node instead of reconstructing it will all K nodes.
> >
> > This is my profile:
> > ceph osd erasure-code-profile set myprofile \
> > plugin=lrc \
> > mapping=DD_DD___ \
> > layers='[
> > [ "DD_DD_c_", "" ],
> > [ "DDc_", "" ],
> > [ "___DDc__", "" ],
> > [ "__Dc", "" ],
> > ]' \
> > ruleset-steps='[
> > [ "chooseleaf", "osd",  8  ],
> > ]'
> 
> You could test and see if commenting out the sanity check at
> 
> 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L89
> 
> does the trick. I don't remember enough about this border case to be sure 
> it won't work. You can also give it a try with
> 
> 
> https://github.com/ceph/ceph/blob/master/src/test/erasure-code/ceph_erasure_code_benchmark.cc
> 
> Cheers
> 
> > Regards,
> > Oleg
> >
> > On Fri, Apr 28, 2017 at 12:33 AM, Loic Dachary    >> wrote:
> >
> > Hi Oleg,
> >
> > On 04/27/2017 11:23 PM, Oleg Kolosov wrote:
> > > Hi,
> > > I'm working on various implementation of LRC codes for study 
> purposes. The layers implementation in the LRC module is very convenient for 
> this, but I've came upon a problem in one of the cases.
> > > I'm interested in having k=1, m=1 in one of the layers. However 
> this gives out an error:
> > > Error EINVAL: k=1 must be >= 2
> > >
> > > I should point out that my erasure code has additional layers 
> which are fine, only this one has k=1, m=1.
> > >
> > > What was the reason for this issue?
> > > Can replication be implemented in one of LRC's layers?
> >
> > Could you provide the code for me to reproduce this problem ? Or a 
> description of the layers ? I implemented this restriction because it made 
> the code simpler. And also because I could not think of a valid use case.
> >
> > Cheers
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> >
> >
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-04-28 Thread Frédéric Nass


Le 28/04/2017 à 15:19, Frédéric Nass a écrit :


Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on 
Luminous dev (1st release) and came to the same (early) conclusion 
regarding the performance drop with many small objects on bluestore, 
whatever the number of PGs is on a pool. Here is the graph I generated 
from the results:




The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS 
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1 
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128 
--no-cleanup


Correction: test was made on a size 1 pool hosted on a single 12x OSDs 
node. The rados bench was run from this single host (to this single host).


Frédéric.



I hope this will improve as the performance drop seems more related to 
how many objects are in the pool (> 40M) rather than how many objects 
are written each second.
Like Wido, I was thinking that we may have to increase the number of 
disks in the future to keep up with the needed performance for our 
Zimbra messaging use case.
Or move datas from current EC pool to a replicated pool, as erasure 
coding doesn't help either for this type of use cases.


Regards,

Frédéric.

Le 26/04/2017 à 22:25, Wido den Hollander a écrit :

Op 24 april 2017 om 19:52 schreef Florian Haas:


Hi everyone,

so this will be a long email — it's a summary of several off-list
conversations I've had over the last couple of weeks, but the TL;DR
version is this question:

How can a Ceph cluster maintain near-constant performance
characteristics while supporting a steady intake of a large number of
small objects?

This is probably a very common problem, but we have a bit of a dearth of
truly adequate best practices for it. To clarify, what I'm talking about
is an intake on the order of millions per hour. That might sound like a
lot, but if you consider an intake of 700 objects/s at 20 KiB/object,
that's just 14 MB/s. That's not exactly hammering your cluster — but it
amounts to 2.5 million objects created per hour.


I have seen that the amount of objects at some point becomes a problem.

Eventually you will have scrubs running and especially a deep-scrub will cause 
issues.

I have never had the use-case to have a sustained intake of so many 
objects/hour, but it is interesting though.


Under those circumstances, two things tend to happen:

(1) There's a predictable decline in insert bandwidth. In other words, a
cluster that may allow inserts at a rate of 2.5M/hr rapidly goes down to
1.8M/hr and then 1.7M/hr ... and by "rapidly" I mean hours, not days. As
I understand it, this is mainly due to the FileStore's propensity to
index whole directories with a readdir() call which is an linear-time
operation.

(2) FileStore's mitigation strategy for this is to proactively split
directories so they never get so large as for readdir() to become a
significant bottleneck. That's fine, but in a cluster with a steadily
growing number of objects, that tends to lead to lots and lots of
directory splits happening simultanously — causing inserts to slow to a
crawl.

For (2) there is a workaround: we can initialize a pool with an expected
number of objects, set a pool max_objects quota, and disable on-demand
splitting altogether by setting a negative filestore merge threshold.
That way, all splitting occurs at pool creation time, and before another
split were to happen, you hit the pool quota. So you never hit that
brick wall causes by the thundering herd of directory splits. Of course,
it also means that when you want to insert yet more objects, you need
another pool — but you can handle that at the application level.

It's actually a bit of a dilemma: we want directory splits to happen
proactively, so that readdir() doesn't slow things down, but then we
also *don't* want them to happen, because while they do, inserts flatline.

(2) will likely be killed off completely by BlueStore, because there are
no more directories, hence nothing to split.

For (1) there really isn't a workaround that I'm aware of for FileStore.
And at least preliminary testing shows that BlueStore clusters suffer
from similar, if not the same, performance degradation (although, to be
fair, I haven't yet seen tests under the above parameters with rocksdb
and WAL on NVMe hardware).


Can you point me to this testing of BlueStore?


For (1) however I understand that there would be a potential solution in
FileStore itself, by throwing away Ceph's own directory indexing and
just rely on flat directory lookups — which should be logarithmic-time
operations in both btrfs and XFS, as both use B-trees for directory
indexing. But I understand that that would be a fairly massive operation
that looks even less attractive to undertake with BlueStore around the
corner.

One suggestion that has been made (credit to Greg) was to do object
packing, i.e. bunch up a lot of discrete 

Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-04-28 Thread Frédéric Nass

Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on 
Luminous dev (1st release) and came to the same (early) conclusion 
regarding the performance drop with many small objects on bluestore, 
whatever the number of PGs is on a pool. Here is the graph I generated 
from the results:




The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS 
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1 
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128 
--no-cleanup


I hope this will improve as the performance drop seems more related to 
how many objects are in the pool (> 40M) rather than how many objects 
are written each second.
Like Wido, I was thinking that we may have to increase the number of 
disks in the future to keep up with the needed performance for our 
Zimbra messaging use case.
Or move datas from current EC pool to a replicated pool, as erasure 
coding doesn't help either for this type of use cases.


Regards,

Frédéric.

Le 26/04/2017 à 22:25, Wido den Hollander a écrit :

Op 24 april 2017 om 19:52 schreef Florian Haas :


Hi everyone,

so this will be a long email — it's a summary of several off-list
conversations I've had over the last couple of weeks, but the TL;DR
version is this question:

How can a Ceph cluster maintain near-constant performance
characteristics while supporting a steady intake of a large number of
small objects?

This is probably a very common problem, but we have a bit of a dearth of
truly adequate best practices for it. To clarify, what I'm talking about
is an intake on the order of millions per hour. That might sound like a
lot, but if you consider an intake of 700 objects/s at 20 KiB/object,
that's just 14 MB/s. That's not exactly hammering your cluster — but it
amounts to 2.5 million objects created per hour.


I have seen that the amount of objects at some point becomes a problem.

Eventually you will have scrubs running and especially a deep-scrub will cause 
issues.

I have never had the use-case to have a sustained intake of so many 
objects/hour, but it is interesting though.


Under those circumstances, two things tend to happen:

(1) There's a predictable decline in insert bandwidth. In other words, a
cluster that may allow inserts at a rate of 2.5M/hr rapidly goes down to
1.8M/hr and then 1.7M/hr ... and by "rapidly" I mean hours, not days. As
I understand it, this is mainly due to the FileStore's propensity to
index whole directories with a readdir() call which is an linear-time
operation.

(2) FileStore's mitigation strategy for this is to proactively split
directories so they never get so large as for readdir() to become a
significant bottleneck. That's fine, but in a cluster with a steadily
growing number of objects, that tends to lead to lots and lots of
directory splits happening simultanously — causing inserts to slow to a
crawl.

For (2) there is a workaround: we can initialize a pool with an expected
number of objects, set a pool max_objects quota, and disable on-demand
splitting altogether by setting a negative filestore merge threshold.
That way, all splitting occurs at pool creation time, and before another
split were to happen, you hit the pool quota. So you never hit that
brick wall causes by the thundering herd of directory splits. Of course,
it also means that when you want to insert yet more objects, you need
another pool — but you can handle that at the application level.

It's actually a bit of a dilemma: we want directory splits to happen
proactively, so that readdir() doesn't slow things down, but then we
also *don't* want them to happen, because while they do, inserts flatline.

(2) will likely be killed off completely by BlueStore, because there are
no more directories, hence nothing to split.

For (1) there really isn't a workaround that I'm aware of for FileStore.
And at least preliminary testing shows that BlueStore clusters suffer
from similar, if not the same, performance degradation (although, to be
fair, I haven't yet seen tests under the above parameters with rocksdb
and WAL on NVMe hardware).


Can you point me to this testing of BlueStore?


For (1) however I understand that there would be a potential solution in
FileStore itself, by throwing away Ceph's own directory indexing and
just rely on flat directory lookups — which should be logarithmic-time
operations in both btrfs and XFS, as both use B-trees for directory
indexing. But I understand that that would be a fairly massive operation
that looks even less attractive to undertake with BlueStore around the
corner.

One suggestion that has been made (credit to Greg) was to do object
packing, i.e. bunch up a lot of discrete data chunks into a single RADOS
object. But in terms of distribution and lookup logic that would have to
be built on top, that seems weird to me (CRUSH on top of CRUSH to find
out which RADOS object a chunk 

Re: [ceph-users] Replication (k=1) in LRC

2017-04-28 Thread David Turner
Wouldn't k=1, m=1 just be replica 2? EC will split the object into k pieces
(1)... Ok, that's the whole object. And then you want to be able to lose m
copies of the object (1)... Ok, that's an entire copy of that whole
object.  That isn't erasure coding, that is full 2 copy replication. For
erasure coding to work you need to split the object into at least 2 pieces
(k) and then have at least one parity copy (m). With m=0  you have no
redundancy and just made a super slow raid 0.

On Thu, Apr 27, 2017, 6:49 PM Loic Dachary  wrote:

>
>
> On 04/27/2017 11:43 PM, Oleg Kolosov wrote:
> > Hi Loic,
> > Of course.
> > I'm implementing a version of Pyramid Code. In Pyramid you remove one of
> the global parities of Reed-Solomon and add one local parity for each local
> group. In my version, I'd like to add local parity to the global parity
> (meaning that for the case the global parity = 1, it would be replicated).
> This way in case of a failure in the global parity, you can reconstruct it
> using the replicated node instead of reconstructing it will all K nodes.
> >
> > This is my profile:
> > ceph osd erasure-code-profile set myprofile \
> > plugin=lrc \
> > mapping=DD_DD___ \
> > layers='[
> > [ "DD_DD_c_", "" ],
> > [ "DDc_", "" ],
> > [ "___DDc__", "" ],
> > [ "__Dc", "" ],
> > ]' \
> > ruleset-steps='[
> > [ "chooseleaf", "osd",  8  ],
> > ]'
>
> You could test and see if commenting out the sanity check at
>
>
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L89
>
> does the trick. I don't remember enough about this border case to be sure
> it won't work. You can also give it a try with
>
>
> https://github.com/ceph/ceph/blob/master/src/test/erasure-code/ceph_erasure_code_benchmark.cc
>
> Cheers
>
> > Regards,
> > Oleg
> >
> > On Fri, Apr 28, 2017 at 12:33 AM, Loic Dachary  > wrote:
> >
> > Hi Oleg,
> >
> > On 04/27/2017 11:23 PM, Oleg Kolosov wrote:
> > > Hi,
> > > I'm working on various implementation of LRC codes for study
> purposes. The layers implementation in the LRC module is very convenient
> for this, but I've came upon a problem in one of the cases.
> > > I'm interested in having k=1, m=1 in one of the layers. However
> this gives out an error:
> > > Error EINVAL: k=1 must be >= 2
> > >
> > > I should point out that my erasure code has additional layers
> which are fine, only this one has k=1, m=1.
> > >
> > > What was the reason for this issue?
> > > Can replication be implemented in one of LRC's layers?
> >
> > Could you provide the code for me to reproduce this problem ? Or a
> description of the layers ? I implemented this restriction because it made
> the code simpler. And also because I could not think of a valid use case.
> >
> > Cheers
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> >
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-28 Thread Ben Morrice

Hello again,

I can work around this issue. If the host header is an IP address, the 
request is treated as a virtual:


So if I auth to to my backends via IP, things work as expected.

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 28/04/17 09:26, Ben Morrice wrote:

Hello Radek,

Thanks again for your anaylsis.

I can confirm on 10.2.7, if I remove the conf "rgw dns name" I can 
auth to directly to the radosgw host.


In our environment we terminate SSL and route connections via haproxy, 
but it's still sometimes useful to be able to communicate directly to 
the backend radosgw server.


It seems that it's not possible to set multiple "rgw dns name" entries 
in ceph.conf


Is the only solution to modify the zonegroup and populate the 
'hostnames' array with all backend server hostnames as well as the 
hostname terminated by haproxy?


Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 13:53, Radoslaw Zarzynski wrote:

Bingo! From the 10.2.5-admin:

   GET

   Thu, 27 Apr 2017 07:49:59 GMT
   /

And also:

   2017-04-27 09:49:59.117447 7f4a90ff9700 20 subdomain= domain=
in_hosted_domain=0 in_hosted_domain_s3website=0
   2017-04-27 09:49:59.117449 7f4a90ff9700 20 final domain/bucket
subdomain= domain= in_hosted_domain=0 in_hosted_domain_s3website=0
s->info.domain= s->info.request_uri=/

The most interesting part is the "final ... in_hosted_domain=0".
It looks we need to dig around RGWREST::preprocess(),
rgw_find_host_in_domains() & company.

There is a commit introduced in v10.2.6 that touches this area [1].
I'm definitely not saying it's the root cause. It might be that a change
in the code just unhidden a configuration issue [2].

I will talk about the problem on the today's sync-up.

Thanks for the logs!
Regards,
Radek

[1] 
https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca16d7f4c6d0

[2] http://tracker.ceph.com/issues/17440

On Thu, Apr 27, 2017 at 10:11 AM, Ben Morrice  
wrote:

Hello Radek,

Thank-you for your analysis so far! Please find attached logs for 
both the
admin user and a keystone backed user from 10.2.5 (same host as 
before, I
have simply downgraded the packages). Both users can authenticate 
and list

buckets on 10.2.5.

Also - I tried version 10.2.6 and see the same behavior as 10.2.7, 
so the

bug i'm hitting looks like it was introduced in 10.2.6

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 04:45, Radoslaw Zarzynski wrote:

Thanks for the logs, Ben.

It looks that two completely different authenticators have failed:
the local, RADOS-backed auth (admin.txt) and Keystone-based
one as well. In the second case I'm pretty sure that Keystone has
rejected [1][2] to authenticate provided signature/StringToSign.
RGW tried to fallback to the local auth which obviously didn't have
any chance as the credentials were stored remotely. This explains
the presence of "error reading user info" in the user-keystone.txt.

What is common for both scenarios are the low-level things related
to StringToSign crafting/signature generation at RadosGW's side.
Following one has been composed for the request from admin.txt:

GET


Wed, 26 Apr 2017 09:18:42 GMT
/bbpsrvc15.cscs.ch/

If you could provide a similar log from v10.2.5, I would be really
grateful.

Regards,
Radek

[1]
https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_rest_s3.cc#L3269-L3272 

[2] 
https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_common.h#L170


On Wed, Apr 26, 2017 at 11:29 AM, Morrice Ben  
wrote:

Hello Radek,

Please find attached the failed request for both the admin user and a
standard user (backed by keystone).

Kind regards,

Ben Morrice

__ 


Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland


From: Radoslaw Zarzynski 
Sent: Tuesday, April 25, 2017 7:38 PM
To: Morrice Ben
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Hello Ben,

Could you provide full RadosGW's log for the failed request?
I mean the lines starting from header listing, through the start
marker ("== starting new request...") till the end marker?

At the moment we can't see any details related to the signature
calculation.

Regards,
Radek

On Thu, Apr 20, 2017 at 5:08 PM, 

Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-28 Thread Ben Morrice

Hello Radek,

Thanks again for your anaylsis.

I can confirm on 10.2.7, if I remove the conf "rgw dns name" I can auth 
to directly to the radosgw host.


In our environment we terminate SSL and route connections via haproxy, 
but it's still sometimes useful to be able to communicate directly to 
the backend radosgw server.


It seems that it's not possible to set multiple "rgw dns name" entries 
in ceph.conf


Is the only solution to modify the zonegroup and populate the 
'hostnames' array with all backend server hostnames as well as the 
hostname terminated by haproxy?


Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 13:53, Radoslaw Zarzynski wrote:

Bingo! From the 10.2.5-admin:

   GET

   Thu, 27 Apr 2017 07:49:59 GMT
   /

And also:

   2017-04-27 09:49:59.117447 7f4a90ff9700 20 subdomain= domain=
in_hosted_domain=0 in_hosted_domain_s3website=0
   2017-04-27 09:49:59.117449 7f4a90ff9700 20 final domain/bucket
subdomain= domain= in_hosted_domain=0 in_hosted_domain_s3website=0
s->info.domain= s->info.request_uri=/

The most interesting part is the "final ... in_hosted_domain=0".
It looks we need to dig around RGWREST::preprocess(),
rgw_find_host_in_domains() & company.

There is a commit introduced in v10.2.6 that touches this area [1].
I'm definitely not saying it's the root cause. It might be that a change
in the code just unhidden a configuration issue [2].

I will talk about the problem on the today's sync-up.

Thanks for the logs!
Regards,
Radek

[1] https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca16d7f4c6d0
[2] http://tracker.ceph.com/issues/17440

On Thu, Apr 27, 2017 at 10:11 AM, Ben Morrice  wrote:

Hello Radek,

Thank-you for your analysis so far! Please find attached logs for both the
admin user and a keystone backed user from 10.2.5 (same host as before, I
have simply downgraded the packages). Both users can authenticate and list
buckets on 10.2.5.

Also - I tried version 10.2.6 and see the same behavior as 10.2.7, so the
bug i'm hitting looks like it was introduced in 10.2.6

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 27/04/17 04:45, Radoslaw Zarzynski wrote:

Thanks for the logs, Ben.

It looks that two completely different authenticators have failed:
the local, RADOS-backed auth (admin.txt) and Keystone-based
one as well. In the second case I'm pretty sure that Keystone has
rejected [1][2] to authenticate provided signature/StringToSign.
RGW tried to fallback to the local auth which obviously didn't have
any chance as the credentials were stored remotely. This explains
the presence of "error reading user info" in the user-keystone.txt.

What is common for both scenarios are the low-level things related
to StringToSign crafting/signature generation at RadosGW's side.
Following one has been composed for the request from admin.txt:

GET


Wed, 26 Apr 2017 09:18:42 GMT
/bbpsrvc15.cscs.ch/

If you could provide a similar log from v10.2.5, I would be really
grateful.

Regards,
Radek

[1]
https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_rest_s3.cc#L3269-L3272
[2] https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_common.h#L170

On Wed, Apr 26, 2017 at 11:29 AM, Morrice Ben  wrote:

Hello Radek,

Please find attached the failed request for both the admin user and a
standard user (backed by keystone).

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland


From: Radoslaw Zarzynski 
Sent: Tuesday, April 25, 2017 7:38 PM
To: Morrice Ben
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

Hello Ben,

Could you provide full RadosGW's log for the failed request?
I mean the lines starting from header listing, through the start
marker ("== starting new request...") till the end marker?

At the moment we can't see any details related to the signature
calculation.

Regards,
Radek

On Thu, Apr 20, 2017 at 5:08 PM, Ben Morrice  wrote:

Hi all,

I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7
(RHEL7)
and authentication is in a very bad state. This installation is part of
a
multigw configuration, and I have just updated one host in the secondary
zone (all other hosts/zones are running 10.2.5).

On the 10.2.7 server I cannot authenticate as a user (normally backed by
OpenStack Keystone), but even worse I can also not authenticate with an