Re: [ceph-users] Missing Ceph perf-counters in Ceph-Dashboard or Prometheus/InfluxDB...?

2019-12-03 Thread Benjeman Meekhof
I'd like to see a few of the cache tier counters exposed.  You get
some info on cache activity in 'ceph -s' so it makes sense from my
perspective to have similar availability in exposed counters.

There's a tracker for this request (opened by me a while ago):
https://tracker.ceph.com/issues/37156

thanks,
Ben



On Tue, Dec 3, 2019 at 8:36 AM Ernesto Puerta  wrote:
>
> Hi Cephers,
>
> As a result of this tracker (https://tracker.ceph.com/issues/42961)
> Neha and I were wondering if there would be other perf-counters deemed
> by users/operators as worthy to be exposed via ceph-mgr modules for
> monitoring purposes.
>
> The default behaviour is that only perf-counters with priority
> PRIO_USEFUL (5) or higher are exposed (via `get_all_perf_counters` API
> call) to ceph-mgr modules (including Dashboard, DiskPrediction or
> Prometheus/InfluxDB/Telegraf exporters).
>
> While changing that is rather trivial, it could make sense to get
> users' feedback and come up with a list of missing perf-counters to be
> exposed.
>
> Kind regards,
> Ernesto
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Admin REST metadata caps

2019-07-23 Thread Benjeman Meekhof
Please disregard,  the listed caps are sufficient and there does not
seem to be any issue here.  Between adding the metadata caps and
re-testing I made a mistake in passing credentials to the module and
naturally received an AccessDenied for bad credentials.

thanks,
Ben


On Tue, Jul 23, 2019 at 12:53 PM Benjeman Meekhof  wrote:
>
> Ceph Nautilus, 14.2.2, RGW civetweb.
> Trying to read from the RGW admin api /metadata/user with request URL like:
> GET /admin/metadata/user?key=someuser&format=json
>
> But am getting a 403 denied error from RGW.  Shouldn't the caps below
> be sufficient, or am I missing something?
>
>  "caps": [
> {
> "type": "metadata",
> "perm": "read"
> },
> {
> "type": "user",
> "perm": "read"
> },
> {
> "type": "users",
> "perm": "read"
> }
> ],
>
> The application making the call is a python module:
> https://github.com/UMIACS/rgwadmin
>
> I have another application using the API and it is able to make
> requests to fetch a user but does so by calling 'GET
> /admin/user?format=xml&uid=someuser' and that user has just the
> 'users=read' cap.
>
> thanks,
> Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Admin REST metadata caps

2019-07-23 Thread Benjeman Meekhof
Ceph Nautilus, 14.2.2, RGW civetweb.
Trying to read from the RGW admin api /metadata/user with request URL like:
GET /admin/metadata/user?key=someuser&format=json

But am getting a 403 denied error from RGW.  Shouldn't the caps below
be sufficient, or am I missing something?

 "caps": [
{
"type": "metadata",
"perm": "read"
},
{
"type": "user",
"perm": "read"
},
{
"type": "users",
"perm": "read"
}
],

The application making the call is a python module:
https://github.com/UMIACS/rgwadmin

I have another application using the API and it is able to make
requests to fetch a user but does so by calling 'GET
/admin/user?format=xml&uid=someuser' and that user has just the
'users=read' cap.

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-28 Thread Benjeman Meekhof
I suggest having a look at this thread, which suggests that sizes 'in
between' the requirements of different RocksDB levels have no net
effect, and size accordingly.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030740.html

My impression is that 28GB is good (L0+L1+L3), or 280 GB is good (+L4
too), or whatever size is required for +L5 is good, but anything in
between will probably not get used.  I've seen this somewhat borne out
with our oldest storage nodes which have only enough NVMe space to
provide 24GB per OSD.  Though only ~3GiB of DB space are in use of the
24GiB available, 1GiB of 'slow db' is used:

"db_total_bytes": 26671570944,
"db_used_bytes": 2801795072,
"slow_used_bytes": 1102053376
(Mimic 13.2.5)

thanks,
Ben


On Tue, May 28, 2019 at 12:55 PM Igor Fedotov  wrote:
>
> Hi Jake,
>
> just my 2 cents - I'd suggest to use LVM for DB/WAL to be able
> seamlessly extend their sizes if needed.
>
> Once you've configured this way and if you're able to add more NVMe
> later you're almost free to select any size at the initial stage.
>
>
> Thanks,
>
> Igor
>
>
> On 5/28/2019 4:13 PM, Jake Grimmett wrote:
> > Dear All,
> >
> > Quick question regarding SSD sizing for a DB/WAL...
> >
> > I understand 4% is generally recommended for a DB/WAL.
> >
> > Does this 4% continue for "large" 12TB drives, or can we  economise and
> > use a smaller DB/WAL?
> >
> > Ideally I'd fit a smaller drive providing a 266GB DB/WAL per 12TB OSD,
> > rather than 480GB. i.e. 2.2% rather than 4%.
> >
> > Will "bad things" happen as the OSD fills with a smaller DB/WAL?
> >
> > By the way the cluster will mainly be providing CephFS, fairly large
> > files, and will use erasure encoding.
> >
> > many thanks for any advice,
> >
> > Jake
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restricting access to RadosGW/S3 buckets

2019-05-02 Thread Benjeman Meekhof
Hi Vlad,

If a user creates a bucket then only that user can see the bucket
unless an S3 ACL is applied giving additional permissionsbut I'd
guess you are asking a more complex question than that.

If you are looking to apply some kind of policy over-riding whatever
ACL a user might apply to a bucket then it looks like the integration
with Open Policy Agent can do what you want.  I have not myself tried
this out but it looks very interesting if you have the Nautilus
release.
http://docs.ceph.com/docs/nautilus/radosgw/opa/

A third option is you could run the RGW behind something like HAproxy
and configure ACL there which allow/disallow requests based on
different criteria.  For example you can parse the bucket name out of
the URL and match against an ACL.  You may be able to use the
Authorization header to pull out the access key id and match that
against a map file and allow/disallow the request, or use some other
criteria as might be available in HAproxy.  HAproxy does have a unix
socket interface allowing for modifying mapfile entries without
restarting/editing the proxy config files.
http://cbonte.github.io/haproxy-dconv/1.8/configuration.html#7

thanks,
Ben

On Thu, May 2, 2019 at 12:53 PM Vladimir Brik
 wrote:
>
> Hello
>
> I am trying to figure out a way to restrict access to S3 buckets. Is it
> possible to create a RadosGW user that can only access specific bucket(s)?
>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Limits of mds bal fragment size max

2019-04-12 Thread Benjeman Meekhof
We have a user syncing data with some kind of rsync + hardlink based
system creating/removing large numbers of hard links.  We've
encountered many of the issues with stray inode re-integration as
described in the thread and tracker below.

As noted one fix is to increase mds_bal_fragment_size_max so the stray
directories can accommodate the high stray count.  We blew right
through 200,000, then 300,000, and at this point I'm wondering if
there is an upper safe limit on this parameter?   If I go to something
like 1mil to work with this use case will I have other problems?

Background:
https://www.spinics.net/lists/ceph-users/msg51985.html
http://tracker.ceph.com/issues/38849

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Additional meta data attributes for rgw user?

2019-01-21 Thread Benjeman Meekhof
Hi all,

I'm looking to keep some extra meta-data associated with radosgw users
created by radosgw-admin.  I saw in the output of 'radosgw-admin
metadata get user:someuser" there is an 'attrs' structure that looked
promising.  However it seems to be strict about what it accepts so I
wonder if that's really ok.  For example, I can collect the meta-data,
alter it, and with 'radosgw-admin metadata put' insert this back in
successfully (as well as get it back when I re-read metadata)

"attrs": [
{
"key": "user.rgw.idtag",
"val": ""
},
{
"key": "user.rgw.mynewvalue",
"val": "DLJDFLKD"
},
]

But...if the "val" is just one character any longer, there is an
error.   If the "val" is a number, also an error:
{
"key": "user.rgw.mynewvalue",
"val": "1234"  (or any string longer than 8 char)
 },
cat someuser.meta | radosgw-admin metadata put user:someuser
ERROR: can't put key: (22) Invalid argument

Sois there some other way or place to put my additional attributes
in the user object?   Version I am experimenting with is Mimic.

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon high single-core usage, reencode_incremental_map

2018-12-19 Thread Benjeman Meekhof
Version:  Mimic 13.2.2

Lately during any kind of cluster change, particularly adding OSD in
this most recent instance, I'm seeing our mons (all of them) showing
100% usage on a single core but not at all using any of the other
available cores on the system.  Cluster commands are slow to respond
and clients start seeing session timeouts as well.

When I turn debug_mon up to 20 the logs are dominated by the message
below at a very high rate.  Is this indicating an issue?

2018-12-19 10:50:21.713 7f356c2cd700 20 mon.wsu-mon01@1(peon).osd
ef8d3b build_incrementalinc f82a1 d4 bytes
2018-12-19 10:50:21.713 7f356c2cd700 20 mon.wsu-mon01@1(peon).osd
ef8d3b reencode_incremental_map f82a0 with features 700088000202a00
2018-12-19 10:50:21.713 7f356c2cd700 20 mon.wsu-mon01@1(peon).osd
ef8d3b build_incrementalinc f82a0 f1 bytes
2018-12-19 10:50:21.713 7f356c2cd700 20 mon.wsu-mon01@1(peon).osd
ef8d3b reencode_incremental_map f829f with features 700088000202a00

Some searching turned up bug that should be resolved but did seem related:
https://tracker.ceph.com/issues/23713

My reading of the tracker leads me to believe re-encoding the map for
older clients maybe involved?  As such I'll include output of 'ceph
features' in case relevant.  Depending on what's in most recent CentOS
kernel updating/rebooting might be an option for workaround if
relevant.  Moving to ceph-fuse across all our clients might be an
option as well.  Any other cluster components not show below are
"luminous"  "features": "0x3ffddff8ffa4fffb"

"client": [
{
"features": "0x40107b84a842ada",
"release": "jewel",
"num": 16
},
{
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 75
},
{
"features": "0x27018fb86aa42ada",
"release": "jewel",
"num": 63
},
{
"features": "0x3ffddff8ffa4fffb",
"release": "luminous",
"num": 60
}
],


thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help related to authentication

2018-12-05 Thread Benjeman Meekhof
Hi Rishabh,

You might want to check out these examples for python boto3 which include SSE-C:
https://github.com/boto/boto3/blob/develop/boto3/examples/s3.rst

As already noted use 'radosgw-admin' to retrieve access key and secret
key to plug into your client.  If you are not an administrator on your
Ceph cluster you may have to ask someone who is to create/retrieve the
necessary user info.  Example:

 radosgw-admin user info --uid testuser
.
 "keys": [
{
"user": "testuser",
"access_key":"ABCDE0",
"secret_key": "1FGHIJK"
}

There is also an Admin API to retrieve this information but you
wouldn't use it unless your application is something more general
purpose requiring access to all user credentials (or other
information).  There are libraries for this API as well noted at the
bottom of the docs page.  If you just need an access/secret to plug
into your client this is not what you are looking for - to even use it
you still need to create a user with the radosgw-admin command.  If
you need to programmatically manage / retrieve user info with some
kind of privileged application it might be of use.
http://docs.ceph.com/docs/mimic/radosgw/adminops/

thanks,
Ben


On Tue, Dec 4, 2018 at 11:41 PM Rishabh S  wrote:
>
> Hi Paul,
>
> Thank You.
>
> I was looking for suggestions on how my ceph client should get access and 
> secret keys.
>
> Another thing where I need help is regarding encryption
> http://docs.ceph.com/docs/mimic/radosgw/encryption/#
>
> I am little confused what does these statement means.
>
> The Ceph Object Gateway supports server-side encryption of uploaded objects, 
> with 3 options for the management of encryption keys. Server-side encryption 
> means that the data is sent over HTTP in its unencrypted form, and the Ceph 
> Object Gateway stores that data in the Ceph Storage Cluster in encrypted form.
>
> Note Requests for server-side encryption must be sent over a secure HTTPS 
> connection to avoid sending secrets in plaintext.
>
> CUSTOMER-PROVIDED KEYS
>
> In this mode, the client passes an encryption key along with each request to 
> read or write encrypted data. It is the client’s responsibility to manage 
> those keys and remember which key was used to encrypt each object.
>
>
> My understanding is when ceph client is trying to upload a file/object to 
> Ceph cluster then client request should be https and will include  
> “customer-provided-key”.
> Then Ceph will use customer-provided-key to encrypt file/object before 
> storing data into Ceph cluster.
>
> Please correct and suggest best approach to store files/object in Ceph 
> cluster.
>
> Any code example of initial handshake to upload a file/object with 
> encryption-key will be of great help.
>
> Regards,
> Rishabh
>
> On 05-Dec-2018, at 2:48 AM, Paul Emmerich  wrote:
>
> You are probably looking for radosgw-admin which can manage users on
> the shell, e.g.:
>
> radosgw-admin user create --uid username --display-name "full name"
> radosgw-admin user list
> radosgw-admin user info --uid username
>
> The create and info commands return the secret/access key which can be
> used with any S3 client.
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> Am Di., 4. Dez. 2018 um 18:55 Uhr schrieb Rishabh S 
> :
>
>
> Dear Members,
>
> I am new to ceph and implementing object store using ceph.
>
> I have following scenario.
>
> 1. I have an application which needs to store thousands of files in to ceph 
> cluster
> 2. My application will be deployed in kubernetes cluster
> 3. My application will communicate using Rest API
>
> My application will be ceph client which will be communicating ceph cluster 
> using http/https.
> Can some one please help me with how my application should get 
> access-key/secret-key to communicate with ceph cluster.
>
> I am mainly looking for rest/http api example for initial 
> authentication/authorization handshake.
>
> Thanks in advance.
>
> Regards,
> Rishabh
>
>
>
> On 04-Dec-2018, at 11:11 PM, ceph-users-requ...@lists.ceph.com wrote:
>
> ceph-users@lists.ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS and hard links

2018-08-07 Thread Benjeman Meekhof
I switched configs to use ms_type: simple and restarted all of our MDS
(there are 3 but only 1 active).  It looks like the memory usage crept
back up to the same levels as before.  I've included new mempool dump
and heap stat.  If I can provide other debug info let me know.

 ceph daemon mds.xxx config show | grep simple
"ms_type": "simple",

---

 ceph daemon mds.xxx dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 4691601,
"bytes": 4691601
},
"bluestore_alloc": {
"items": 0,
"bytes": 0
},
"bluestore_cache_data": {
"items": 0,
"bytes": 0
},
"bluestore_cache_onode": {
"items": 0,
"bytes": 0
},
"bluestore_cache_other": {
"items": 0,
"bytes": 0
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 0,
"bytes": 0
},
"bluestore_writing_deferred": {
"items": 0,
"bytes": 0
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 0,
"bytes": 0
},
"buffer_anon": {
"items": 1304976,
"bytes": 6740803514
},
"buffer_meta": {
"items": 1506,
"bytes": 96384
},
"osd": {
"items": 0,
"bytes": 0
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 0,
"bytes": 0
},
"osdmap": {
"items": 8205,
"bytes": 185760
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 163871317,
"bytes": 4080249950
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 169877605,
"bytes": 10826027209
}
}
}

---
 ceph tell mds.xxx heap stats

mds.um-mds01 tcmalloc heap
stats:
MALLOC:12521041136 (11941.0 MiB) Bytes in use by application
MALLOC: +49152 (0.0 MiB) Bytes in page heap freelist
MALLOC: +246633144 (  235.2 MiB) Bytes in central cache freelist
MALLOC: +  6692352 (6.4 MiB) Bytes in transfer cache freelist
MALLOC: + 27288152 (   26.0 MiB) Bytes in thread cache freelists
MALLOC: + 67895296 (   64.8 MiB) Bytes in malloc metadata
MALLOC:   ----
MALLOC: =  12869599232 (12273.4 MiB) Actual memory used (physical + swap)
MALLOC: +436740096 (  416.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  13306339328 (12689.9 MiB) Virtual address space used
MALLOC:
MALLOC: 901356  Spans in use
MALLOC:   1099  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.


On Fri, Aug 3, 2018 at 9:14 AM, Yan, Zheng  wrote:
> On Fri, Aug 3, 2018 at 8:53 PM Benjeman Meekhof  wrote:
>>
>> Thanks, that's useful to know.  I've pasted the output you asked for
>> below, thanks for taking a look.
>>
>> Here's the output of dump_mempools:
>>
>> {
>> "mempool": {
>> "by_pool": {
>>

Re: [ceph-users] Ceph MDS and hard links

2018-08-03 Thread Benjeman Meekhof
Thanks, that's useful to know.  I've pasted the output you asked for
below, thanks for taking a look.

Here's the output of dump_mempools:

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 4806709,
"bytes": 4806709
},
"bluestore_alloc": {
"items": 0,
"bytes": 0
},
"bluestore_cache_data": {
"items": 0,
"bytes": 0
},
"bluestore_cache_onode": {
"items": 0,
"bytes": 0
},
"bluestore_cache_other": {
"items": 0,
"bytes": 0
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 0,
"bytes": 0
},
"bluestore_writing_deferred": {
"items": 0,
"bytes": 0
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 0,
"bytes": 0
},
"buffer_anon": {
"items": 1303621,
"bytes": 6643324694
},
"buffer_meta": {
"items": 2397,
"bytes": 153408
},
"osd": {
"items": 0,
"bytes": 0
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 0,
"bytes": 0
},
"osdmap": {
"items": 8222,
"bytes": 185840
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 160660321,
"bytes": 4080240182
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 166781270,
"bytes": 10728710833
}
}
}

and heap_stats:

MALLOC:12418630040 (11843.3 MiB) Bytes in use by application
MALLOC: +  1310720 (1.2 MiB) Bytes in page heap freelist
MALLOC: +378986760 (  361.4 MiB) Bytes in central cache freelist
MALLOC: +  4713472 (    4.5 MiB) Bytes in transfer cache freelist
MALLOC: + 20722016 (   19.8 MiB) Bytes in thread cache freelists
MALLOC: + 62652416 (   59.8 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  12887015424 (12290.0 MiB) Actual memory used (physical + swap)
MALLOC: +309624832 (  295.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  13196640256 (12585.3 MiB) Virtual address space used
MALLOC:
MALLOC: 921411  Spans in use
MALLOC: 20  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

On Wed, Aug 1, 2018 at 10:31 PM, Yan, Zheng  wrote:
> On Thu, Aug 2, 2018 at 3:36 AM Benjeman Meekhof  wrote:
>>
>> I've been encountering lately a much higher than expected memory usage
>> on our MDS which doesn't align with the cache_memory limit even
>> accounting for potential over-runs.  Our memory limit is 4GB but the
>> MDS process is steadily at around 11GB used.
>>
>> Coincidentally we also have a new user heavily relying on hard links.
>> This led me to the following (old) document which says "Hard links are
>> also supported, although in their current implementation each link
>> requires a small bit of MDS memory and so there is an implied limit
>> based on your available memory. "
>> (https://ceph.com/geen-categorie/cephfs-mds-status-discussion/)
>>
>> Is that statement still correct, could it potentially explain why our
>> memory usage appears so high?  As far as I know this is a recent
>> development and it does very closely correspond to a new user doing a
>> lot of hardlinking.  Ceph Mimic 13.2.1, though we first saw the issue
>> while still running 13.2.0.
>>
>
> That statement is no longer correct.   what are output of  "ceph
> daemon mds.x dump_mempools" and "ceph tell mds.x heap stats"?
>
>
>> thanks,
>> Ben
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph MDS and hard links

2018-08-01 Thread Benjeman Meekhof
I've been encountering lately a much higher than expected memory usage
on our MDS which doesn't align with the cache_memory limit even
accounting for potential over-runs.  Our memory limit is 4GB but the
MDS process is steadily at around 11GB used.

Coincidentally we also have a new user heavily relying on hard links.
This led me to the following (old) document which says "Hard links are
also supported, although in their current implementation each link
requires a small bit of MDS memory and so there is an implied limit
based on your available memory. "
(https://ceph.com/geen-categorie/cephfs-mds-status-discussion/)

Is that statement still correct, could it potentially explain why our
memory usage appears so high?  As far as I know this is a recent
development and it does very closely correspond to a new user doing a
lot of hardlinking.  Ceph Mimic 13.2.1, though we first saw the issue
while still running 13.2.0.

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] active directory integration with cephfs

2018-07-26 Thread Benjeman Meekhof
I can comment on that docker image:  We built that to bake in a
certain amount of config regarding nfs-ganesha serving CephFS and
using LDAP to do idmap lookups (example ldap entries are in readme).
At least as we use it the server-side uid/gid information is pulled
from sssd using a config file on the machine hosting the docker image
but one could probably also map passwd files to the image.  Most of it
is configurable with environment vars if some other usage were
desired, or it could be a good start to build something specific to
your needs.  The README is a bit out of date as the image now uses
mimic and ganesha 2.6.  We use it successfully to serve CephFS to a
few krb5 authenticated clients / users and have them be mapped to
uid/gid/gidlist out of our LDAP directory instead of their local info.
  I'm happy to answer questions on the list or via direct email if
that's more appropriate.

thanks,
Ben

On Thu, Jul 26, 2018 at 4:32 AM, John Hearns  wrote:
> NFS Ganesha certainly works with Cephfs. I would investigate that also.
> http://docs.ceph.com/docs/master/cephfs/nfs/
>
> Regarding Active Directory, I have done a lot of work recently with sssd.
> Not entirely relevant to this list, please send me a mail offline.
>
> Not sure if this is any direct use
> https://github.com/MI-OSiRIS/docker-nfs-ganesha-ceph
>
>
>
>
>
>
>
>
>
> On Thu, 26 Jul 2018 at 08:34, Serkan Çoban  wrote:
>>
>> You can do it by exporting cephfs by samba. I don't think any other
>> way exists for cephfs.
>>
>> On Thu, Jul 26, 2018 at 9:12 AM, Manuel Sopena Ballesteros
>>  wrote:
>> > Dear Ceph community,
>> >
>> >
>> >
>> > I am quite new to Ceph but trying to learn as much quick as I can. We
>> > are
>> > deploying our first Ceph production cluster in the next few weeks, we
>> > choose
>> > luminous and our goal is to have cephfs. One of the question I have been
>> > asked by other members of our team is if there is a possibility to
>> > integrate
>> > ceph authentication/authorization with Active Directory. I have seen in
>> > the
>> > documentations that objct gateway can do this but I am not about cephfs.
>> >
>> >
>> >
>> > Anyone has any idea if I can integrate cephfs with AD?
>> >
>> >
>> >
>> > Thank you very much
>> >
>> >
>> >
>> > Manuel Sopena Ballesteros | Big data Engineer
>> > Garvan Institute of Medical Research
>> > The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
>> > T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E:
>> > manuel...@garvan.org.au
>> >
>> >
>> >
>> > NOTICE
>> > Please consider the environment before printing this email. This message
>> > and
>> > any attachments are intended for the addressee named and may contain
>> > legally
>> > privileged/confidential/copyright information. If you are not the
>> > intended
>> > recipient, you should not read, use, disclose, copy or distribute this
>> > communication. If you have received this message in error please notify
>> > us
>> > at once by return email and then delete both messages. We accept no
>> > liability for the distribution of viruses or similar in electronic
>> > communications. This notice should not be removed.
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS: journaler.pq decode error

2018-06-21 Thread Benjeman Meekhof
I do have one follow-up related question:  While doing this I took
offline all the standby MDS, and max_mds on our cluster is at 1.  Were
I to enable multiple MDS would they all actively split up processing
the purge queue?  We have not yet at this point ever allowed multi
active MDS but plan to enable now that it's been stable a while (as
well as now being the default in Mimic).  Though it is not enough to
cause problems my MDS right now is at an increased level of CPU usage
processing the queue backlog.  I'm not really inclined to throw
another variable in the mix right now but the answer to the question
might be interesting for future reference.

thanks,
Ben

On Thu, Jun 21, 2018 at 11:32 AM, Benjeman Meekhof  wrote:
> Thanks very much John!  Skipping over the corrupt entry by setting a
> new expire_pos seems to have worked.  The journal expire_pos is now
> advancing and pools are being purged.  It has a little while to go to
> catch up to current write_pos but the journal inspect command gives an
> 'OK' for overall integrity.
>
> As recommended I did take an export of the journal first and I'll take
> a stab at using a hex editor on it near future.  Worst case we go
> through the tag/scan if necessary.
>
> thanks,
> Ben
>
>
> On Thu, Jun 21, 2018 at 9:04 AM, John Spray  wrote:
>> On Wed, Jun 20, 2018 at 2:17 PM Benjeman Meekhof  wrote:
>>>
>>> Thanks for the response.  I was also hoping to be able to debug better
>>> once we got onto Mimic.  We just finished that upgrade yesterday and
>>> cephfs-journal-tool does find a corruption in the purge queue though
>>> our MDS continues to startup and the filesystem appears to be
>>> functional as usual.
>>>
>>> How can I modify the purge queue to remove damaged sections?
>>
>> Before any modifications, use the "cephfs-journal-tool
>> --journal=purge_queue export " command to take a backup.
>>
>> The 'splice' mode of cephfs-journal-tool exists for this purpose with
>> the ordinary log, but unfortunately we don't have an equivalent for
>> the purge queue at the moment (http://tracker.ceph.com/issues/24604)
>>
>> Since the header commands are implemented, you could do a "header set
>> expire_pos 6822485" to skip over the corrupt entry.  You would run
>> that while the MDS was not running, and then start the MDS again
>> afterwards.  Hopefully all the subsequent entries are valid, and you
>> will only be orphaning the objects for one file.
>>
>>> Is there
>>> some way to scan known FS objects and remove any that might now be
>>> orphaned once the damage is removed/repaired?
>>
>> Yes, we do that using the online "ceph daemon mds. tag path"
>> operation to tag all the non-orphan files' data objects, followed by a
>> cephfs-data-scan operation to scan for anything untagged.  However, in
>> this instance I'd be more inclined to try going over your journal
>> export with a hex editor to see in what way it was corrupted, and
>> maybe the inode of the entry we skipped will still be visible there,
>> saving you a big O(N) scan over data objects.
>>
>> John
>>
>>>
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>>
>>> Overall journal integrity: DAMAGED
>>> Corrupt regions:
>>>   0x6819f8-681a55
>>>
>>> # cephfs-journal-tool --journal=purge_queue header get
>>>
>>> {
>>> "magic": "ceph fs volume v011",
>>> "write_pos": 203357732,
>>> "expire_pos": 6822392,
>>> "trimmed_pos": 4194304,
>>> "stream_format": 1,
>>> "layout": {
>>> "stripe_unit": 4194304,
>>> "stripe_count": 1,
>>> "object_size": 4194304,
>>> "pool_id": 64,
>>> "pool_ns": ""
>>> }
>>> }
>>>
>>> thanks,
>>> Ben
>>>
>>> On Fri, Jun 15, 2018 at 11:54 AM, John Spray  wrote:
>>> > On Fri, Jun 15, 2018 at 2:55 PM, Benjeman Meekhof  
>>> > wrote:
>>> >> Have seen some posts and issue trackers related to this topic in the
>>> >> past but haven't been able to put it together to resolve the issue I'm
>>> >> having.  All on Luminous 12.2.5 (upgraded over time from past
>>> >> releases).  We are going to upgrade to Mimic near future if that would
>>> >> somehow resolve the issue.
&

Re: [ceph-users] MDS: journaler.pq decode error

2018-06-21 Thread Benjeman Meekhof
Thanks very much John!  Skipping over the corrupt entry by setting a
new expire_pos seems to have worked.  The journal expire_pos is now
advancing and pools are being purged.  It has a little while to go to
catch up to current write_pos but the journal inspect command gives an
'OK' for overall integrity.

As recommended I did take an export of the journal first and I'll take
a stab at using a hex editor on it near future.  Worst case we go
through the tag/scan if necessary.

thanks,
Ben


On Thu, Jun 21, 2018 at 9:04 AM, John Spray  wrote:
> On Wed, Jun 20, 2018 at 2:17 PM Benjeman Meekhof  wrote:
>>
>> Thanks for the response.  I was also hoping to be able to debug better
>> once we got onto Mimic.  We just finished that upgrade yesterday and
>> cephfs-journal-tool does find a corruption in the purge queue though
>> our MDS continues to startup and the filesystem appears to be
>> functional as usual.
>>
>> How can I modify the purge queue to remove damaged sections?
>
> Before any modifications, use the "cephfs-journal-tool
> --journal=purge_queue export " command to take a backup.
>
> The 'splice' mode of cephfs-journal-tool exists for this purpose with
> the ordinary log, but unfortunately we don't have an equivalent for
> the purge queue at the moment (http://tracker.ceph.com/issues/24604)
>
> Since the header commands are implemented, you could do a "header set
> expire_pos 6822485" to skip over the corrupt entry.  You would run
> that while the MDS was not running, and then start the MDS again
> afterwards.  Hopefully all the subsequent entries are valid, and you
> will only be orphaning the objects for one file.
>
>> Is there
>> some way to scan known FS objects and remove any that might now be
>> orphaned once the damage is removed/repaired?
>
> Yes, we do that using the online "ceph daemon mds. tag path"
> operation to tag all the non-orphan files' data objects, followed by a
> cephfs-data-scan operation to scan for anything untagged.  However, in
> this instance I'd be more inclined to try going over your journal
> export with a hex editor to see in what way it was corrupted, and
> maybe the inode of the entry we skipped will still be visible there,
> saving you a big O(N) scan over data objects.
>
> John
>
>>
>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>
>> Overall journal integrity: DAMAGED
>> Corrupt regions:
>>   0x6819f8-681a55
>>
>> # cephfs-journal-tool --journal=purge_queue header get
>>
>> {
>> "magic": "ceph fs volume v011",
>> "write_pos": 203357732,
>> "expire_pos": 6822392,
>> "trimmed_pos": 4194304,
>> "stream_format": 1,
>> "layout": {
>> "stripe_unit": 4194304,
>> "stripe_count": 1,
>> "object_size": 4194304,
>> "pool_id": 64,
>> "pool_ns": ""
>> }
>> }
>>
>> thanks,
>> Ben
>>
>> On Fri, Jun 15, 2018 at 11:54 AM, John Spray  wrote:
>> > On Fri, Jun 15, 2018 at 2:55 PM, Benjeman Meekhof  
>> > wrote:
>> >> Have seen some posts and issue trackers related to this topic in the
>> >> past but haven't been able to put it together to resolve the issue I'm
>> >> having.  All on Luminous 12.2.5 (upgraded over time from past
>> >> releases).  We are going to upgrade to Mimic near future if that would
>> >> somehow resolve the issue.
>> >>
>> >> Summary:
>> >>
>> >> 1.  We have a CephFS data pool which has steadily and slowly grown in
>> >> size without corresponding writes to the directory placed on it - a
>> >> plot of usage over a few hours shows a very regular upward rate of
>> >> increase.   The pool is now 300TB vs 16TB of actual space used in
>> >> directory.
>> >>
>> >> 2.  Reading through some email posts and issue trackers led me to
>> >> disabling 'standby replay' though we are not and have not ever used
>> >> snapshots.   Disabling that feature on our 3 MDS stopped the steady
>> >> climb.  However the pool remains with 300TB of unaccounted for space
>> >> usage.  http://tracker.ceph.com/issues/19593 and
>> >> http://tracker.ceph.com/issues/21551
>> >
>> > This is pretty strange -- if you were already on 12.2.5 then the
>> > http://tracker.ceph.com/issues/19593 should have been fixed and
>

Re: [ceph-users] MDS: journaler.pq decode error

2018-06-20 Thread Benjeman Meekhof
Thanks for the response.  I was also hoping to be able to debug better
once we got onto Mimic.  We just finished that upgrade yesterday and
cephfs-journal-tool does find a corruption in the purge queue though
our MDS continues to startup and the filesystem appears to be
functional as usual.

How can I modify the purge queue to remove damaged sections?  Is there
some way to scan known FS objects and remove any that might now be
orphaned once the damage is removed/repaired?

# cephfs-journal-tool --journal=purge_queue journal inspect

Overall journal integrity: DAMAGED
Corrupt regions:
  0x6819f8-681a55

# cephfs-journal-tool --journal=purge_queue header get

{
"magic": "ceph fs volume v011",
"write_pos": 203357732,
"expire_pos": 6822392,
"trimmed_pos": 4194304,
"stream_format": 1,
"layout": {
"stripe_unit": 4194304,
"stripe_count": 1,
"object_size": 4194304,
"pool_id": 64,
"pool_ns": ""
}
}

thanks,
Ben

On Fri, Jun 15, 2018 at 11:54 AM, John Spray  wrote:
> On Fri, Jun 15, 2018 at 2:55 PM, Benjeman Meekhof  wrote:
>> Have seen some posts and issue trackers related to this topic in the
>> past but haven't been able to put it together to resolve the issue I'm
>> having.  All on Luminous 12.2.5 (upgraded over time from past
>> releases).  We are going to upgrade to Mimic near future if that would
>> somehow resolve the issue.
>>
>> Summary:
>>
>> 1.  We have a CephFS data pool which has steadily and slowly grown in
>> size without corresponding writes to the directory placed on it - a
>> plot of usage over a few hours shows a very regular upward rate of
>> increase.   The pool is now 300TB vs 16TB of actual space used in
>> directory.
>>
>> 2.  Reading through some email posts and issue trackers led me to
>> disabling 'standby replay' though we are not and have not ever used
>> snapshots.   Disabling that feature on our 3 MDS stopped the steady
>> climb.  However the pool remains with 300TB of unaccounted for space
>> usage.  http://tracker.ceph.com/issues/19593 and
>> http://tracker.ceph.com/issues/21551
>
> This is pretty strange -- if you were already on 12.2.5 then the
> http://tracker.ceph.com/issues/19593 should have been fixed and
> switching standby replays on/off shouldn't make a difference (unless
> there's some similar bug that crept back into luminous).
>
>> 3.   I've never had any issue starting the MDS or with filesystem
>> functionality but looking through the mds logs I see a single
>> 'journaler.pg(rw) _decode error from assimilate_prefetch' at every
>> startup.  A log snippet with context is below with debug_mds and
>> debug_journaler at 20.
>
> This message suggests that the purge queue has been corrupted, but the
> MDS is ignoring this -- something is wrong with the error handling.
> The MDS should be marked damaged when something like this happens, but
> in this case PurgeQueue is apparently dropping the error on the floor
> after it gets logged by Journaler.  I've opened a ticket+PR for the
> error handling here: http://tracker.ceph.com/issues/24533 (however,
> the loading path in PurgeQueue::_recover *does* have error handling so
> I'm not clear why that isn't happening in your case).
>
> I believe cephfs-journal-tool in mimic was enhanced to be able to
> optionally operate on the purge queue as well as the metadata journal
> (they use the same underlying format), so upgrading to mimic would
> give you better tooling for debugging this.
>
> John
>
>
>> As noted, there is at least one past email thread on the topic but I'm
>> not quite having the same issue as this person and I couldn't glean
>> any information as to what I should do to repair this error and get
>> stale objects purged from this pool (if that is in fact the issue):
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021379.html
>>
>> Any thoughts on troubleshooting steps I could try next?
>>
>> Here is the log snippet:
>>
>> 2018-06-15 09:14:50.746831 7fb47251b700 20 mds.0.journaler.pq(rw)
>> write_buf_throttle get, delta 101
>> 2018-06-15 09:14:50.746835 7fb47251b700 10 mds.0.journaler.pq(rw)
>> append_entry len 81 to 88121773~101
>> 2018-06-15 09:14:50.746838 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
>> 2018-06-15 09:14:50.746863 7fb47251b700 20 mds.0.journaler.pq(rw)
>> write_buf_throttle get, delta 101
>> 2018-06-15 09:14:50.746864 7fb47251b700 10 mds.0.journaler.pq(rw)
>> append_entry l

[ceph-users] MDS: journaler.pq decode error

2018-06-15 Thread Benjeman Meekhof
Have seen some posts and issue trackers related to this topic in the
past but haven't been able to put it together to resolve the issue I'm
having.  All on Luminous 12.2.5 (upgraded over time from past
releases).  We are going to upgrade to Mimic near future if that would
somehow resolve the issue.

Summary:

1.  We have a CephFS data pool which has steadily and slowly grown in
size without corresponding writes to the directory placed on it - a
plot of usage over a few hours shows a very regular upward rate of
increase.   The pool is now 300TB vs 16TB of actual space used in
directory.

2.  Reading through some email posts and issue trackers led me to
disabling 'standby replay' though we are not and have not ever used
snapshots.   Disabling that feature on our 3 MDS stopped the steady
climb.  However the pool remains with 300TB of unaccounted for space
usage.  http://tracker.ceph.com/issues/19593 and
http://tracker.ceph.com/issues/21551

3.   I've never had any issue starting the MDS or with filesystem
functionality but looking through the mds logs I see a single
'journaler.pg(rw) _decode error from assimilate_prefetch' at every
startup.  A log snippet with context is below with debug_mds and
debug_journaler at 20.

As noted, there is at least one past email thread on the topic but I'm
not quite having the same issue as this person and I couldn't glean
any information as to what I should do to repair this error and get
stale objects purged from this pool (if that is in fact the issue):
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021379.html

Any thoughts on troubleshooting steps I could try next?

Here is the log snippet:

2018-06-15 09:14:50.746831 7fb47251b700 20 mds.0.journaler.pq(rw)
write_buf_throttle get, delta 101
2018-06-15 09:14:50.746835 7fb47251b700 10 mds.0.journaler.pq(rw)
append_entry len 81 to 88121773~101
2018-06-15 09:14:50.746838 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
2018-06-15 09:14:50.746863 7fb47251b700 20 mds.0.journaler.pq(rw)
write_buf_throttle get, delta 101
2018-06-15 09:14:50.746864 7fb47251b700 10 mds.0.journaler.pq(rw)
append_entry len 81 to 88121874~101
2018-06-15 09:14:50.746867 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
2018-06-15 09:14:50.746901 7fb46fd16700 10 mds.0.journaler.pq(rw)
_finish_read got 6822392~1566216
2018-06-15 09:14:50.746909 7fb46fd16700 10 mds.0.journaler.pq(rw)
_assimilate_prefetch 6822392~1566216
2018-06-15 09:14:50.746911 7fb46fd16700 10 mds.0.journaler.pq(rw)
_assimilate_prefetch gap of 4194304 from received_pos 8388608 to first
prefetched buffer 12582912
2018-06-15 09:14:50.746913 7fb46fd16700 10 mds.0.journaler.pq(rw)
_assimilate_prefetch read_buf now 6822392~1566216, read pointers
6822392/8388608/50331648

=== error here ===> 2018-06-15 09:14:50.746965 7fb46fd16700 -1
mds.0.journaler.pq(rw) _decode error from assimilate_prefetch

2018-06-15 09:14:50.746994 7fb47251b700 20 mds.0.journaler.pq(rw)
write_buf_throttle get, delta 101
2018-06-15 09:14:50.746998 7fb47251b700 10 mds.0.journaler.pq(rw)
append_entry len 81 to 88121975~101
2018-06-15 09:14:50.747007 7fb47251b700 10 mds.0.journaler.pq(rw)
wait_for_readable at 6822392 onreadable 0x557ee0f58300
2018-06-15 09:14:50.747042 7fb47251b700 20 mds.0.journaler.pq(rw)
write_buf_throttle get, delta 101
2018-06-15 09:14:50.747043 7fb47251b700 10 mds.0.journaler.pq(rw)
append_entry len 81 to 88122076~101
2018-06-15 09:14:50.747063 7fb47251b700 20 mds.0.journaler.pq(rw)
write_buf_throttle get, delta 101
2018-06-15 09:14:50.747064 7fb47251b700 10 mds.0.journaler.pq(rw)
append_entry len 81 to 88122177~101
2018-06-15 09:14:50.747113 7fb47251b700 20 mds.0.journaler.pq(rw)
write_buf_throttle get, delta 101
2018-06-15 09:14:50.747114 7fb47251b700 10 mds.0.journaler.pq(rw)
append_entry len 81 to 88122278~101
2018-06-15 09:14:50.747136 7fb47251b700 20 mds.0.journaler.pq(rw)
write_buf_throttle get, delta 101
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] nfs-ganesha 2.6 deb packages

2018-05-14 Thread Benjeman Meekhof
I see that luminous RPM packages are up at download.ceph.com for
ganesha-ceph 2.6 but there is nothing in the Deb area.  Any estimates
on when we might see those packages?

http://download.ceph.com/nfs-ganesha/deb-V2.6-stable/luminous/

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw ldap info

2018-03-26 Thread Benjeman Meekhof
Hi Marc,

I can't speak to your other questions but as far as the user auth caps
those are still kept in the radosgw metadata outside of ldap.  As far
as I know all that LDAP gives you is a way to authenticate users with
a user/password combination.

So, for example, if you create a user 'ldapuser' in your ldap
directory, generate a token for that user,  and then use the LDAP
token to authenticate to RGW as that user you would then find this
info in the radosgw metadata where it can be altered to set quotas,
caps, etc.  You could perhaps even add an access key so the
conventional auth also works for that user identity (I have never
tried that, we only do one or the other for any given user).

$ radosgw-admin user info --uid ldapuser

{
"user_id": "ldapuser",
"caps": [],
... etc ...
"type": "ldap"
}

thanks,
Ben

On Sat, Mar 24, 2018 at 10:30 AM, Marc Roos  wrote:
>
>
> To clarify if I understand correctly:
>
> It is NOT POSSIBLE to use an s3 client like eg. cyberduck/mountainduck
> and supply a user with an 'Access key' and a 'Password' regardless if
> the user is defined in ldap or local?
>
> I honestly cannot see how this ldap integration should even work,
> without a proper ldap scheme for auth caps being available. Nor do I
> understand where you set currently these auth caps, nor do I understand
> what use the current ldap functionality has.
>
> Would be nice to update this on these pages
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/ceph_object_gateway_with_ldapad_guide/index
> http://docs.ceph.com/docs/master/radosgw/ldap-auth/
>
>
> Maybe it is good to give some 'beginners' access to the docs pages.
> Because as they are learning ceph (and maybe missing info in the docs)
> they can add this then. Because I have the impression that many things
> asked here could be added to the docs.
>
>
>
>
>
> -Original Message-
> From: Konstantin Shalygin [mailto:k0...@k0ste.ru]
> Sent: zondag 18 maart 2018 5:04
> To: ceph-users@lists.ceph.com
> Cc: Marc Roos; Yehuda Sadeh-Weinraub
> Subject: Re: [ceph-users] Radosgw ldap user authentication issues
>
> Hi Marc
>
>
>> looks like no search is being done there.
>
>> rgw::auth::s3::AWSAuthStrategy denied with reason=-13
>
>
> The same for me, http://tracker.ceph.com/issues/23091
>
>
> But Yehuda closed this.
>
>
>
>
> k
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw ldap user authentication issues

2018-03-19 Thread Benjeman Meekhof
Hi Marc,

You mentioned following the instructions 'except for doing this ldap
token'.  Do I read that correctly that you did not generate / use an
LDAP token with your client?  I think that is a necessary part of
triggering the LDAP authentication (Section 3.2 and 3.3 of the doc you
linked).  I can verify it works if you do that.  Pass the base64 token
(ewogICAgIlJHetc) to the 'access key' param of your client leaving
the secret blank (it is ignored).

You can use the referenced command line tool or any method you like to
generate a base64 string which encodes a json struct that looks like
this (this is the decoded ldap token string from the docs):

{
"RGW_TOKEN": {
"version": 1,
"type": "ldap",
"id": "ceph",
"key": "800#Gorilla"
}
}

thanks,
Ben


On Sun, Mar 18, 2018 at 12:04 AM, Konstantin Shalygin  wrote:
> Hi Marc
>
>
>> looks like no search is being done there.
>
>
>> rgw::auth::s3::AWSAuthStrategy denied with reason=-13
>
>
>
> The same for me, http://tracker.ceph.com/issues/23091
>
>
> But Yehuda closed this.
>
>
>
>
> k
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ganesha-rgw export with LDAP auth

2018-03-09 Thread Benjeman Meekhof
Hi Matt,

Sorry about incomplete last message sent by mistake (unknown hotkey
slip, secrets have been invalidated).

So to continue:
In ganesha.conf Access_Key_Id is set to ldap token, that token encodes
a user 'myuser' secret 'whatever'.   User_id and Secret_access_key
settings blank - they cannot be left out or config parser complains
but I would expect they are unused in this context.

In ganesha log it seems to pick up what you'd expect out of the ldap token:
2018-03-09 11:21:27.513315 7fafbfd861c0 12 auth search filter: (uid=myuser)

I have seen that there would be a 'auth simple_bind failed' message
from the rgw instance if this bind failed...

And in ldap logs it appears to bind:
[09/Mar/2018:11:21:27.637588220 -0500] conn=8965 op=0 BIND
dn="uid=myuser,ou=RGWUsers,dc=example,dc=org" method=128 version=3

But still have this in ganesha log:
09/03/2018 11:21:27 : epoch 5aa2b485 : host.example :
ganesha.nfsd-363383[main] create_export :FSAL :CRIT :Authorization
Failed for user

That's not truncated, it's using the User_id setting which is an empty
string.  It doesn't work even if I put 'myuser' in User_id though.

The net result is the share doesn't initialize.
09/03/2018 11:21:27 : epoch 5aa2b485 : host.example :
ganesha.nfsd-363383[main] mdcache_fsal_create_export :FSAL :MAJ
:Failed to call create_export on underlying FSAL RGW
09/03/2018 11:21:27 : epoch 5aa2b485 : host.example :
ganesha.nfsd-363383[main] fsal_put :FSAL :INFO :FSAL RGW now unused
09/03/2018 11:21:27 : epoch 5aa2b485 : host.example :
ganesha.nfsd-363383[main] fsal_cfg_commit :CONFIG :CRIT :Could not
create export for (/) to (/)

This same configuration has no issues if I use radosgw-admin to create
a user that does not use LDAP for authentication and configure with
those credentials.  Likewise the same ldap token I am using for
Access_Key_Id is working fine with via a rgw http instance.

Let me know if there's any other info that would be useful, and thanks
very much for the help.

regards,
Ben


On Fri, Mar 9, 2018 at 12:16 PM, Matt Benjamin  wrote:
> Hi Benjeman,
>
> It is -intended- to work, identically to the standalone radosgw
> server.  I can try to verify whether there could be a bug affecting
> this path.
>
> Matt
>
> On Fri, Mar 9, 2018 at 12:01 PM, Benjeman Meekhof  wrote:
>> I'm having issues exporting a radosgw bucket if the configured user is
>> authenticated using the rgw ldap connectors.  I've verified that this
>> same ldap token works ok for other clients, and as I'll note below it
>> seems like the rgw instance is contacting the LDAP server and
>> successfully authenticating the user.  Details:
>>
>> Ganesha export:
>>  FSAL {
>> Name = RGW;
>> User_Id = "";
>>
>> Access_Key_Id =
>> "eyJSR1dfVE9LRU4iOnsidmVyc2lvbiI6MSwidHlwZSI6ImxkYXAiLCJpZCI6ImJtZWVraG9mX29zaXJpc2FkbWluIiwia2V$
>>
>> # Secret_Access_Key =
>> "eyJSR1dfVE9LRU4iOnsidmVyc2lvbiI6MSwidHlwZSI6ImxkYXAiLCJpZCI6ImJtZWVraG9mX29zaXJpc2FkbWluI$
>> # Secret_Access_Key = "weW\/XGiHfcVhtH3chUTyoF+uz9Ldz3Hz";
>>
>> }
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ganesha-rgw export with LDAP auth

2018-03-09 Thread Benjeman Meekhof
I'm having issues exporting a radosgw bucket if the configured user is
authenticated using the rgw ldap connectors.  I've verified that this
same ldap token works ok for other clients, and as I'll note below it
seems like the rgw instance is contacting the LDAP server and
successfully authenticating the user.  Details:

Ganesha export:
 FSAL {
Name = RGW;
User_Id = "";

Access_Key_Id =
"eyJSR1dfVE9LRU4iOnsidmVyc2lvbiI6MSwidHlwZSI6ImxkYXAiLCJpZCI6ImJtZWVraG9mX29zaXJpc2FkbWluIiwia2V$

# Secret_Access_Key =
"eyJSR1dfVE9LRU4iOnsidmVyc2lvbiI6MSwidHlwZSI6ImxkYXAiLCJpZCI6ImJtZWVraG9mX29zaXJpc2FkbWluI$
# Secret_Access_Key = "weW\/XGiHfcVhtH3chUTyoF+uz9Ldz3Hz";

}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] puppet for the deployment of ceph

2018-02-19 Thread Benjeman Meekhof
We use this one, now heavily modified in our own fork.   I'd sooner
point you at the original unless it is missing something you need.
Ours has diverged a bit and makes no attempt to support anything
outside our specific environment (RHEL7).

https://github.com/openstack/puppet-ceph
https://github.com/MI-OSiRIS/puppet-ceph

In the near future the master branch of our fork will use ceph-volume
instead of ceph-disk to setup new OSD, those changes are on
'bluestore' branch now.

There are a few on the forge which I have no experience with:
https://forge.puppet.com/tags/ceph

thanks,
Ben

On Fri, Feb 16, 2018 at 8:11 AM, Александр Пивушков  wrote:
> Colleagues, tell me please, who uses puppet for the deployment of ceph in
> production?
>  And also, where can I get the puppet modules for ceph?
>
>
> Александр Пивушков
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "Cannot get stat of OSD" in ceph.mgr.log upon enabling influx plugin

2018-02-19 Thread Benjeman Meekhof
The 'cannot stat' messages are normal at startup, we see them also in
our working setup with mgr influx module.  Maybe they could be fixed
by delaying the module startup,  or having it check for some other
'all good' status but I haven't looked into it.  You should only be
seeing them when the mgr initially loads.

As far as not getting data, if the self-test works and outputs metrics
then the module is reading metrics ok from the mgr.  A few things you
could try:

- Check that the user you setup has rights to the destination
database, or admin rights to create database if you did not create and
setup beforehand
- Increase mgr debug and see if anything is showing up:  ceph tell
mgr.* injectargs '--debug_mgr 20'(this will be a lot of logging,
be sure to reset to 1/5 default)
- Check that your influx server is getting the traffic:   ' tcpdump -i
eth1 port 8086 and src host.example '

thanks,
Ben

On Mon, Feb 19, 2018 at 9:36 AM,   wrote:
> Forgot to mentioned that influx self-test produces a reasonable output too
> (long json list with some metrics and timestamps) as well as there are the
> following lines in mgr log:
>
> 2018-02-19 17:35:04.208858 7f33a50ec700  1 mgr.server reply handle_command
> (0) Success
> 2018-02-19 17:35:04.245285 7f33a50ec700  0 log_channel(audit) log [DBG] :
> from='client.344950 :0/3773014505' entity='client.admin'
> cmd=[{"prefix": "influx self-test"}]: dispatch
> 2018-02-19 17:35:04.245314 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer status'
> 2018-02-19 17:35:04.245319 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer mode'
> 2018-02-19 17:35:04.245323 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer on'
> 2018-02-19 17:35:04.245327 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer off'
> 2018-02-19 17:35:04.245331 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer eval'
> 2018-02-19 17:35:04.245335 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer eval-verbose'
> 2018-02-19 17:35:04.245339 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer optimize'
> 2018-02-19 17:35:04.245343 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer show'
> 2018-02-19 17:35:04.245347 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer rm'
> 2018-02-19 17:35:04.245351 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer reset'
> 2018-02-19 17:35:04.245354 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer dump'
> 2018-02-19 17:35:04.245358 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'balancer execute'
> 2018-02-19 17:35:04.245363 7f33a50ec700  1 mgr.server handle_command
> pyc_prefix: 'influx self-test'
> 2018-02-19 17:35:04.402782 7f33a58ed700  1 mgr.server reply handle_command
> (0) Success Self-test OK
>
> kna...@gmail.com wrote on 19/02/18 17:27:
>
>> Dear Ceph users,
>>
>> I am trying to enable influx plugin for ceph following
>> http://docs.ceph.com/docs/master/mgr/influx/ but no data comes to influxdb
>> DB. As soon as 'ceph mgr module enable influx' command is executed on one of
>> ceph mgr node (running on CentOS 7.4.1708) there are the following messages
>> in /var/log/ceph/ceph-mgr..log:
>>
>> 2018-02-19 17:11:05.947122 7f33c9b43600  0 ceph version 12.2.2
>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process
>> (unknown), pid 96425
>> 2018-02-19 17:11:05.947737 7f33c9b43600  0 pidfile_write: ignore empty
>> --pid-file
>> 2018-02-19 17:11:05.986676 7f33c9b43600  1 mgr send_beacon standby
>> 2018-02-19 17:11:06.003029 7f33c0e2a700  1 mgr init Loading python module
>> 'balancer'
>> 2018-02-19 17:11:06.031293 7f33c0e2a700  1 mgr init Loading python module
>> 'dashboard'
>> 2018-02-19 17:11:06.119328 7f33c0e2a700  1 mgr init Loading python module
>> 'influx'
>> 2018-02-19 17:11:06.220394 7f33c0e2a700  1 mgr init Loading python module
>> 'restful'
>> 2018-02-19 17:11:06.398380 7f33c0e2a700  1 mgr init Loading python module
>> 'status'
>> 2018-02-19 17:11:06.919109 7f33c0e2a700  1 mgr handle_mgr_map Activating!
>> 2018-02-19 17:11:06.919454 7f33c0e2a700  1 mgr handle_mgr_map I am now
>> activating
>> 2018-02-19 17:11:06.952174 7f33a58ed700  1 mgr load Constructed class from
>> module: balancer
>> 2018-02-19 17:11:06.953259 7f33a58ed700  1 mgr load Constructed class from
>> module: dashboard
>> 2018-02-19 17:11:06.953959 7f33a58ed700  1 mgr load Constructed class from
>> module: influx
>> 2018-02-19 17:11:06.954193 7f33a58ed700  1 mgr load Constructed class from
>> module: restful
>> 2018-02-19 17:11:06.955549 7f33a58ed700  1 mgr load Constructed class from
>> module: status
>> 2018-02-19 17:11:06.955613 7f33a58ed700  1 mgr send_beacon active
>> 2018-02-19 17:11:06.960224 7f33a58ed700  1 mgr[restful] Unknown request ''
>> 2018-02-19 17:11:06.961912 7f33a28e7700  1 mgr[restful] server not
>> running: no certificate configured
>> 2018-02-19 17:11:06.969027 7f33a30e8700  0 Cannot get stat of OSD 

Re: [ceph-users] mgr[influx] Cannot transmit statistics: influxdb python module not found.

2018-02-12 Thread Benjeman Meekhof
In our case I think we grabbed the SRPM from Fedora and rebuilt it on
Scientific Linux (another RHEL derivative).  Presumably the binary
didn't work or I would have installed it directly.  I'm not quite sure
why it hasn't migrated to EPEL yet.

I haven't tried the SRPM for latest releases, we're actually quite far
behind the current python-influx version since I built it a while back
but if I were you I'd grab whatever SRPM gets you the latest
python-influxdb release and give it a try.

http://rpmfind.net/linux/rpm2html/search.php?query=python-influxdb

thanks,
Ben

On Mon, Feb 12, 2018 at 11:03 AM,   wrote:
> Dear all,
>
> I'd like to store ceph luminous metrics into influxdb. It seems like influx
> plugin has been already backported for lumious:
> rpm -ql ceph-mgr-12.2.2-0.el7.x86_64|grep -i influx
> /usr/lib64/ceph/mgr/influx
> /usr/lib64/ceph/mgr/influx/__init__.py
> /usr/lib64/ceph/mgr/influx/__init__.pyc
> /usr/lib64/ceph/mgr/influx/__init__.pyo
> /usr/lib64/ceph/mgr/influx/module.py
> /usr/lib64/ceph/mgr/influx/module.pyc
> /usr/lib64/ceph/mgr/influx/module.pyo
>
> So following http://docs.ceph.com/docs/master/mgr/influx/ doc I enabled
> influx plugin by executing the following command on mgr node:
> ceph mgr module enable influx
>
> but in ceph log I see the following error:
> 2018-02-12 15:51:31.241854 7f95e7942600  0 ceph version 12.2.2
> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process
> (unknown), pid 96425
> []
> 2018-02-12 15:51:31.422414 7f95dec29700  1 mgr init Loading python module
> 'influx'
> []
> 2018-02-12 15:51:32.227206 7f95c36ec700  1 mgr load Constructed class from
> module: influx
> []
> 2018-02-12 15:51:32.228163 7f95c0ee7700  0 mgr[influx] Cannot transmit
> statistics: influxdb python module not found.  Did you install it?
>
> Indeed there is no python-influxdb module install on my mgr node (CentOS 7
> x64) but yum search can't find it with the following repos enabled:
> repo id
> repo name   status
> Ceph/x86_64
> Ceph packages for x86_64
> Ceph-noarch
> Ceph noarch packages
> base/7/x86_64 CentOS-7 - Base
> ceph-source
> Ceph source packages
> epel/x86_64
> Extra Packages for Enterprise Linux 7 - x86_64
> extras/7/x86_64 CentOS-7 - Extras
> updates/7/x86_64 CentOS-7 - Updates
>
> Python version is 2.7.5.
>
> Is 'pip install' the only way to go or there is still some option to have
> required python module via rpm? I wonder how other people deals with that
> issue?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MGR Influx plugin 12.2.2

2018-01-11 Thread Benjeman Meekhof
Hi Reed,

Someone in our group originally wrote the plugin and put in PR.  Since
our commit the plugin was 'forward-ported' to master and made
incompatible with Luminous so we've been using our own version of the
plugin while waiting for the necessary pieces to be back-ported to
Luminous to use the modified upstream version.  Now we are in the
process of trying out the back-ported version that is in 12.2.2 as
well as adding some additional code from our version that collects pg
summary information (count of active, etc) and supports sending to
multiple influx destinations.  We'll attempt to PR any changes we
make.

So to answer your question:  Yes, we use it but not exactly the
version from upstream in production yet.  However in our testing the
module included with 12.2.2 appears to work as expected and we're
planning to move over to it and do any future work based from the
version in the upstream Ceph tree.

There is one issue/bug that may still exist exist:  because of how the
data point timestamps are written inside a loop through OSD stats the
spread is sometimes wide enough that Grafana doesn't group properly
and you get the appearance of extreme spikes in derivative calculation
of rates.  We ended up modifying our code to calculate timestamps just
outside the loops that create data points and apply it to every point
created in loops through stats.  Of course we'll feed that back
upstream when we get to it and assuming it is still an issue in the
current code.

thanks,
Ben

On Thu, Jan 11, 2018 at 2:04 AM, Reed Dier  wrote:
> Hi all,
>
> Does anyone have any idea if the influx plugin for ceph-mgr is stable in
> 12.2.2?
>
> Would love to ditch collectd and report directly from ceph if that is the
> case.
>
> Documentation says that it is added in Mimic/13.x, however it looks like
> from an earlier ML post that it would be coming to Luminous.
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021302.html
>
> I also see it as a disabled module currently:
>
> $ ceph mgr module ls
> {
> "enabled_modules": [
> "dashboard",
> "restful",
> "status"
> ],
> "disabled_modules": [
> "balancer",
> "influx",
> "localpool",
> "prometheus",
> "selftest",
> "zabbix"
> ]
> }
>
>
> Curious if anyone has been using it in place of CollectD/Telegraf for
> feeding InfluxDB with statistics.
>
> Thanks,
>
> Reed
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr summarize recovery counters

2017-10-10 Thread Benjeman Meekhof
Hi John,

Thanks for the guidance!  Is pg_status something we should expect to
find in Luminous (12.2.1)?  It doesn't seem to exist.  We do have a
'pg_summary' object which contains a list of every PG and current
state (active, etc) but nothing about I/O.

Calls to self.get('pg_status') in our module log:  mgr get_python
Python module requested unknown data 'pg_status'

thanks,
Ben

On Thu, Oct 5, 2017 at 8:42 AM, John Spray  wrote:
> On Wed, Oct 4, 2017 at 7:14 PM, Gregory Farnum  wrote:
>> On Wed, Oct 4, 2017 at 9:14 AM, Benjeman Meekhof  wrote:
>>> Wondering if anyone can tell me how to summarize recovery
>>> bytes/ops/objects from counters available in the ceph-mgr python
>>> interface?  To put another way, how does the ceph -s command put
>>> together that infomation and can I access that information from a
>>> counter queryable by the ceph-mgr python module api?
>>>
>>> I want info like the 'recovery' part of the status output.  I have a
>>> ceph-mgr module that feeds influxdb but I'm not sure what counters
>>> from ceph-mgr to summarize to create this information.  OSD have
>>> available a recovery_ops counter which is not quite the same.  Maybe
>>> the various 'subop_..' counters encompass recovery ops?  It's not
>>> clear to me but I'm hoping it is obvious to someone more familiar with
>>> the internals.
>>>
>>> io:
>>> client:   2034 B/s wr, 0 op/s rd, 0 op/s wr
>>> recovery: 1173 MB/s, 8 keys/s, 682 objects/s
>>
>>
>> You'll need to run queries against the PGMap. I'm not sure how that
>> works in the python interfaces but I'm led to believe it's possible.
>> Documentation is probably all in the PGMap.h header; you can look at
>> functions like the "recovery_rate_summary" to see what they're doing.
>
> Try get("pg_status") from a python module, that should contain the
> recovery/client IO amongst other things.
>
> You may find that the fields only appear when they're nonzero, I would
> be happy to see a change that fixed the underlying functions to always
> output the fields (e.g. in PGMapDigest::recovery_rate_summary) when
> writing to a Formatter.  Skipping the irrelevant stuff is only useful
> when doing plain text output.
>
> John
>
>> -Greg
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-mgr summarize recovery counters

2017-10-04 Thread Benjeman Meekhof
Wondering if anyone can tell me how to summarize recovery
bytes/ops/objects from counters available in the ceph-mgr python
interface?  To put another way, how does the ceph -s command put
together that infomation and can I access that information from a
counter queryable by the ceph-mgr python module api?

I want info like the 'recovery' part of the status output.  I have a
ceph-mgr module that feeds influxdb but I'm not sure what counters
from ceph-mgr to summarize to create this information.  OSD have
available a recovery_ops counter which is not quite the same.  Maybe
the various 'subop_..' counters encompass recovery ops?  It's not
clear to me but I'm hoping it is obvious to someone more familiar with
the internals.

io:
client:   2034 B/s wr, 0 op/s rd, 0 op/s wr
recovery: 1173 MB/s, 8 keys/s, 682 objects/s

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Benjeman Meekhof
Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

 it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
 wrote:
> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>
>>
>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
 On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi,
>
> I'm still looking for the answer of these questions. Maybe someone can
> share their thought on these. Any comment will be helpful too.
>
> Best regards,
>
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> mailto:mrxlazuar...@gmail.com>> wrote:
>
> Hi,
>
> 1. Is it possible configure use osd_data not as small partition on
> OSD but a folder (ex. on root disk)? If yes, how to do that with
> ceph-disk and any pros/cons of doing that?
> 2. Is WAL & DB size calculated based on OSD size or expected
> throughput like on journal device of filestore? If no, what is the
> default value and pro/cons of adjusting that?
> 3. Is partition alignment matter on Bluestore, including WAL & DB
> if using separate device for them?
>
> Best regards,
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 I am also looking for recommendations on wal/db partition sizes. Some
 hints:

 ceph-disk defaults used in case it does not find
 bluestore_block_wal_size or bluestore_block_db_size in config file:

 wal =  512MB

 db = if bluestore_block_size (data size) is in config file it uses 1/100
 of it else it uses 1G.

 There is also a presentation by Sage back in March, see page 16:

 https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


 wal: 512 MB

 db: "a few" GB

 the wal size is probably not debatable, it will be like a journal for
 small block sizes which are constrained by iops hence 512 MB is more
 than enough. Probably we will see more on the db size in the future.
>>>
>>> This is what I understood so far.
>>> I wonder if it makes sense to set the db size as big as possible and
>>> divide entire db device is  by the number of OSDs it will serve.
>>>
>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>
>>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>
>>> Is this smart/stupid?
>>
>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>> amp but mean larger memtables and potentially higher overhead scanning
>> through memtables).  4x256MB buffers works pretty well, but it means
>> memory overhead too.  Beyond that, I'd devote the entire rest of the
>> device to DB partitions.
>>
>
> thanks for your suggestion Mark!
>
> So, just to make sure I understood this right:
>
> You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
> entire rest for DB partitions.
>
> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
> partitions with each 512MB-2GB and 10 equal sized DB partitions
> consuming the rest of the NVME.
>
>
> Thanks
>   Dietmar
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multipath configuration for Ceph storage nodes

2017-07-12 Thread Benjeman Meekhof
We use a puppet module to deploy them.  We give it devices to
configure from hiera data specific to our different types of storage
nodes.  The module is a fork from
https://github.com/openstack/puppet-ceph.

Ultimately the module ends up running 'ceph-disk prepare [arguments]
/dev/mapper/mpathXX /dev/nvmeXX'   (data dev, journal dev).

thanks,
Ben



On Wed, Jul 12, 2017 at 12:13 PM,   wrote:
> Hi Ben,
>
> Thanks for this, much appreciated.
>
> Can I just check: Do you use ceph-deploy to create your OSDs? E.g.:
>
> ceph-deploy disk zap ceph-sn1.example.com:/dev/mapper/disk1
> ceph-deploy osd prepare ceph-sn1.example.com:/dev/mapper/disk1
>
> Best wishes,
> Bruno
>
>
> -Original Message-
> From: Benjeman Meekhof [mailto:bmeek...@umich.edu]
> Sent: 11 July 2017 18:46
> To: Canning, Bruno (STFC,RAL,SC)
> Cc: ceph-users
> Subject: Re: [ceph-users] Multipath configuration for Ceph storage nodes
>
> Hi Bruno,
>
> We have similar types of nodes and minimal configuration is required 
> (RHEL7-derived OS).  Install device-mapper-multipath or equivalent package, 
> configure /etc/multipath.conf and enable 'multipathd'.  If working correctly 
> the command 'multipath -ll' should output multipath devices and component 
> devices on all paths.
>
> For reference, our /etc/multipath.conf is just these few lines:
>
> defaults {
> user_friendly_names yes
> find_multipaths yes
> }
>
> thanks,
> Ben
>
> On Tue, Jul 11, 2017 at 10:48 AM,   wrote:
>> Hi All,
>>
>>
>>
>> I’d like to know if anyone has any experience of configuring multipath
>> on ceph storage nodes, please. I’d like to know how best to go about it.
>>
>>
>>
>> We have a number of Dell PowerEdge R630 servers, each of which are
>> fitted with two SAS 12G HBA cards and each of which have two
>> associated Dell MD1400 storage units connected to them via HD-Mini -
>> HD-Mini cables, see the attached graphic (ignore colours: two direct
>> connections from the server to each storage unit, two connections running 
>> between each storage unit).
>>
>>
>>
>> Best wishes,
>>
>> Bruno
>>
>>
>>
>>
>>
>> Bruno Canning
>>
>> LHC Data Store System Administrator
>>
>> Scientific Computing Department
>>
>> STFC Rutherford Appleton Laboratory
>>
>> Harwell Oxford
>>
>> Didcot
>>
>> OX11 0QX
>>
>> Tel. +44 ((0)1235) 446621
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multipath configuration for Ceph storage nodes

2017-07-11 Thread Benjeman Meekhof
Hi Bruno,

We have similar types of nodes and minimal configuration is required
(RHEL7-derived OS).  Install device-mapper-multipath or equivalent
package, configure /etc/multipath.conf and enable 'multipathd'.  If
working correctly the command 'multipath -ll' should output multipath
devices and component devices on all paths.

For reference, our /etc/multipath.conf is just these few lines:

defaults {
user_friendly_names yes
find_multipaths yes
}

thanks,
Ben

On Tue, Jul 11, 2017 at 10:48 AM,   wrote:
> Hi All,
>
>
>
> I’d like to know if anyone has any experience of configuring multipath on
> ceph storage nodes, please. I’d like to know how best to go about it.
>
>
>
> We have a number of Dell PowerEdge R630 servers, each of which are fitted
> with two SAS 12G HBA cards and each of which have two associated Dell MD1400
> storage units connected to them via HD-Mini - HD-Mini cables, see the
> attached graphic (ignore colours: two direct connections from the server to
> each storage unit, two connections running between each storage unit).
>
>
>
> Best wishes,
>
> Bruno
>
>
>
>
>
> Bruno Canning
>
> LHC Data Store System Administrator
>
> Scientific Computing Department
>
> STFC Rutherford Appleton Laboratory
>
> Harwell Oxford
>
> Didcot
>
> OX11 0QX
>
> Tel. +44 ((0)1235) 446621
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-08 Thread Benjeman Meekhof
Hi Sage,

We did at one time run multiple clusters on our OSD nodes and RGW
nodes (with Jewel).  We accomplished this by putting code in our
puppet-ceph module that would create additional systemd units with
appropriate CLUSTER=name environment settings for clusters not named
ceph.  IE, if the module were asked to configure OSD for a cluster
named 'test' it would copy/edit the ceph-osd service to create a
'test-osd@.service' unit that would start instances with CLUSTER=test
so they would point to the right config file, etc   Eventually on the
RGW side I started doing instance-specific overrides like
'/etc/systemd/system/ceph-rado...@client.name.d/override.conf' so as
to avoid replicating the stock systemd unit.

We gave up on multiple clusters on the OSD nodes because it wasn't
really that useful to maintain a separate 'test' cluster on the same
hardware.  We continue to need ability to reference multiple clusters
for RGW nodes and other clients. For the other example, users of our
project might have their own Ceph clusters in addition to wanting to
use ours.

If the daemon solution in the no-clustername future is to 'modify
systemd unit files to do something' we're already doing that so it's
not a big issue.  However the current modification of over-riding
CLUSTER in the environment section of systemd files does seem cleaner
than over-riding an exec command to specify a different config file
and keyring path.   Maybe systemd units could ship with those
arguments as variables for easily over-riding.

thanks,
Ben

On Thu, Jun 8, 2017 at 3:37 PM, Sage Weil  wrote:
> At CDM yesterday we talked about removing the ability to name your ceph
> clusters.  There are a number of hurtles that make it difficult to fully
> get rid of this functionality, not the least of which is that some
> (many?) deployed clusters make use of it.  We decided that the most we can
> do at this point is remove support for it in ceph-deploy and ceph-ansible
> so that no new clusters or deployed nodes use it.
>
> The first PR in this effort:
>
> https://github.com/ceph/ceph-deploy/pull/441
>
> Background:
>
> The cluster name concept was added to allow multiple clusters to have
> daemons coexist on the same host.  At the type it was a hypothetical
> requirement for a user that never actually made use of it, and the
> support is kludgey:
>
>  - default cluster name is 'ceph'
>  - default config is /etc/ceph/$cluster.conf, so that the normal
> 'ceph.conf' still works
>  - daemon data paths include the cluster name,
>  /var/lib/ceph/osd/$cluster-$id
>which is weird (but mostly people are used to it?)
>  - any cli command you want to touch a non-ceph cluster name
> needs -C $name or --cluster $name passed to it.
>
> Also, as of jewel,
>
>  - systemd only supports a single cluster per host, as defined by $CLUSTER
> in /etc/{sysconfig,default}/ceph
>
> which you'll notice removes support for the original "requirement".
>
> Also note that you can get the same effect by specifying the config path
> explicitly (-c /etc/ceph/foo.conf) along with the various options that
> substitute $cluster in (e.g., osd_data=/var/lib/ceph/osd/$cluster-$id).
>
>
> Crap preventing us from removing this entirely:
>
>  - existing daemon directories for existing clusters
>  - various scripts parse the cluster name out of paths
>
>
> Converting an existing cluster "foo" back to "ceph":
>
>  - rename /etc/ceph/foo.conf -> ceph.conf
>  - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-*
>  - remove the CLUSTER=foo line in /etc/{default,sysconfig}/ceph
>  - reboot
>
>
> Questions:
>
>  - Does anybody on the list use a non-default cluster name?
>  - If so, do you have a reason not to switch back to 'ceph'?
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disable osd hearbeat logs

2017-03-14 Thread Benjeman Meekhof
Hi all,

Even with debug_osd 0/0 as well as every other debug_ setting at 0/0 I
still get logs like those pasted below in
/var/log/ceph/ceph-osd..log when the relevant situation arises
(release 11.2.0).

Any idea what toggle switches these off?   I went through and set
every single debug_ setting to 0/0 and am still getting them.  I use
'ceph daemon' to set them though so I don't know if there's something
not taking effect until startup.   At startup ceph.conf includes:
debug_osd = 0
debug_heartbeatmap = 0
debug_filestore = 0

2017-03-14 12:14:41.220880 7ff3478cd700 -1 osd.5 222380
heartbeat_check: no reply from x.x.x.x:6909 osd.571 ever on either
front or back, first ping sent 2017-03-14 12:12:53.791402 (cutoff
2017-03-14 12:14:21.220749)

Same for this one, can't get rid of it:
2017-03-14 12:14:41.817346 7ff34b009700  0 -- x.x.x.x:6853/166369 >> -
conn(0x7ff3609a7800 :6853 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state
just closed

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph SElinux denials on OSD startup

2017-02-27 Thread Benjeman Meekhof
Hi,

I'm seeing some SElinux denials for ops to nvme devices.  They only
occur at OSD start, they are not ongoing.  I'm not sure it's causing
an issue though I did try a few tests with SElinux in permissive mode
to see if it made any difference with startup/recovery CPU loading we
have seen since update to Kraken (another thread).  There doesn't seem
to be a noticeable difference in behaviour when we turn enforcing off
- our default state is with enforcing on and has been since the start
of our cluster.

Familiar to anyone?  I can open a tracker issue if it isn't obviously
an issue on my end.

thanks,
Ben

---
type=AVC msg=audit(1487971555.994:39654): avc:  denied  { read } for
pid=470733 comm="ceph-osd" name="nvme0n1p13" dev="devtmpfs" ino=28742
scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file
type=AVC msg=audit(1487971555.994:39654): avc:  denied  { open } for
pid=470733 comm="ceph-osd" path="/dev/nvme0n1p13" dev="devtmpfs"
ino=28742 scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file
type=AVC msg=audit(1487971555.995:39655): avc:  denied  { getattr }
for  pid=470733 comm="ceph-osd" path="/dev/nvme0n1p13" dev="devtmpfs"
ino=28742 scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file
type=AVC msg=audit(1487971555.995:39656): avc:  denied  { ioctl } for
pid=470733 comm="ceph-osd" path="/dev/nvme0n1p13" dev="devtmpfs"
ino=28742 scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file

type=AVC msg=audit(1487978131.752:40937): avc:  denied  { getattr }
for  pid=528235 comm="fn_odsk_fstore" path="/dev/nvme0n1"
dev="devtmpfs" ino=16546 scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file
type=AVC msg=audit(1487978131.752:40938): avc:  denied  { read } for
pid=528235 comm="fn_odsk_fstore" name="nvme0n1p1" dev="devtmpfs"
ino=16549 scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file
type=AVC msg=audit(1487978131.752:40938): avc:  denied  { open } for
pid=528235 comm="fn_odsk_fstore" path="/dev/nvme0n1p1" dev="devtmpfs"
ino=16549 scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file
type=AVC msg=audit(1487978131.752:40939): avc:  denied  { ioctl } for
pid=528235 comm="fn_odsk_fstore" path="/devnvme0n1p1" dev="devtmpfs"
ino=16549 scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:nvme_device_t:s0 tclass=blk_file
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel to Kraken OSD upgrade issues

2017-02-23 Thread Benjeman Meekhof
Hi Greg,

Appreciate you looking into it.  I'm concerned about CPU power per
daemon as well...though we never had this issue when restarting our
dense nodes under Jewel.  Is the rapid rate of OSDmap generation a
one-time condition particular to post-update processing or to Kraken
in general?

We did eventually get all the OSD back up either by doing so in small
batches or setting nodown and waiting for the host to churn
through...a day or so later all the OSD pop up.  Now that we're in a
stable non-degraded state I have to do more tests to see what happens
under Kraken when we kill a node or several nodes.

I have to give ceph a lot of credit here.  Following my email the 16th
while we were in a marginal state with kraken OSD churning to come up
we lost a data center for a minute.  Subsequently we had our remaining
2 mons refuse to stay in quorom long enough to serve cluster sessions
(constant back and forth elections).  I believe the issue was timeouts
caused by explosive leveldb growth in combination with other activity
but eventually we got them to come back by increasing db lease time in
ceph settings.  We had some unfound objects at this point but after
waiting out all the OSD coming online with nodown/noout set everything
was fine.  I should have been more careful in applying the update but
as one of our team put it we definitely found out that Ceph is
resilient to admins as well as other disasters.

thanks,
Ben

On Thu, Feb 23, 2017 at 5:10 PM, Gregory Farnum  wrote:
> On Thu, Feb 16, 2017 at 9:19 AM, Benjeman Meekhof  wrote:
>> I tried starting up just a couple OSD with debug_osd = 20 and
>> debug_filestore = 20.
>>
>> I pasted a sample of the ongoing log here.  To my eyes it doesn't look
>> unusual but maybe someone else sees something in here that is a
>> problem:  http://pastebin.com/uy8S7hps
>>
>> As this log is rolling on, our OSD has still not been marked up and is
>> occupying 100% of a CPU core.  I've done this a couple times and in a
>> matter of some hours it will be marked up and CPU will drop.  If more
>> kraken OSD on another host are brought up the existing kraken OSD go
>> back into max CPU usage again while pg recover.  The trend scales
>> upward as OSD are started until the system is completely saturated.
>>
>> I was reading the docs on async messenger settings at
>> http://docs.ceph.com/docs/master/rados/configuration/ms-ref/ and saw
>> that under 'ms async max op threads' there is a note about one or more
>> CPUs constantly on 100% load.  As an experiment I set max op threads
>> to 20 and that is the setting during the period of the pasted log.  It
>> seems to make no difference.
>>
>> Appreciate any thoughts on troubleshooting this.  For the time being
>> I've aborted our kraken update and will probably re-initialize any
>> already updated OSD to revert to Jewel except perhaps one host to
>> continue testing.
>
> Ah, that log looks like you're just generating OSDMaps so quickly that
> rebooting 60 at a time leaves you with a ludicrous number to churn
> through, and that takes a while. It would have been exacerbated by
> having 60 daemons fight for the CPU to process them, leading to
> flapping.
>
> You might try restarting daemons sequentially on the node instead of
> all at once. Depending on your needs it would be even cheaper if you
> set the nodown flag, though obviously that will impede IO while it
> happens.
>
> I'd be concerned that this demonstrates you don't have enough CPU
> power per daemon, though.
> -Greg
>
>>
>> thanks,
>> Ben
>>
>> On Tue, Feb 14, 2017 at 3:55 PM, Gregory Farnum  wrote:
>>> On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof  
>>> wrote:
>>>> Hi all,
>>>>
>>>> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
>>>> (11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
>>>> mons to Kraken.
>>>>
>>>> After updating ceph packages I restarted the 60 OSD on the box with
>>>> 'systemctl restart ceph-osd.target'.  Very soon after the system cpu
>>>> load flat-lines at 100% with top showing all of that being system load
>>>> from ceph-osd processes.  Not long after we get OSD flapping due to
>>>> the load on the system (noout was set to start this, but perhaps
>>>> too-quickly unset post restart).
>>>>
>>>> This is causing problems in the cluster, and we reboot the box.  The
>>>> OSD don't start up/mount automatically - not a new problem on this
>>>> setup.  We run 'ceph-disk activate $disk' on a list of all 

Re: [ceph-users] Jewel to Kraken OSD upgrade issues

2017-02-16 Thread Benjeman Meekhof
Sure, looks as follows:

ceph -s
cluster 24b3bf92-299d-426c-ae56-48a995014f04
 health HEALTH_ERR
775 pgs are stuck inactive for more than 300 seconds
1954 pgs backfill_wait
7764 pgs degraded
517 pgs down
3 pgs inconsistent
504 pgs peering
31 pgs recovering
5661 pgs recovery_wait
6976 pgs stuck degraded
775 pgs stuck inactive
8362 pgs stuck unclean
1851 pgs stuck undersized
1880 pgs undersized
110 requests are blocked > 32 sec
recovery 2788277/17810399 objects degraded (15.655%)
recovery 1846569/17810399 objects misplaced (10.368%)
recovery 11442/5635366 unfound (0.203%)
76 scrub errors
 monmap e4: 3 mons at
{msu-mon01=207.73.217.13:6789/0,um-mon01=141.211.169.13:6789/0,wsu-mon01=204.39.195.13:6789/0}
election epoch 23402, quorum 0,1,2 um-mon01,wsu-mon01,msu-mon01
  fsmap e1074: 1/1/1 up {0=wsu-mds01=up:active}, 1
up:standby-replay, 1 up:standby
mgr active: um-mon01
 osdmap e152705: 627 osds: 475 up, 475 in; 2434 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v13524534: 20864 pgs, 25 pools, 21246 GB data, 5503 kobjects
59604 GB used, 3397 TB / 3455 TB avail
2788277/17810399 objects degraded (15.655%)
1846569/17810399 objects misplaced (10.368%)
11442/5635366 unfound (0.203%)
   12501 active+clean
5638 active+recovery_wait+degraded
1617 active+undersized+degraded+remapped+backfill_wait
 436 down+remapped+peering
 258 undersized+degraded+remapped+backfill_wait+peered
 189 active+degraded
  79 active+remapped+backfill_wait
  68 down+peering
  31 active+recovering+degraded
  20 active+recovery_wait+degraded+remapped
  10 down
   4 active+degraded+remapped
   3 down+remapped
   3 active+undersized+degraded+remapped
   2 active+recovery_wait+undersized+degraded+remapped
   2 active+remapped
   1 active+clean+inconsistent
   1 active+recovery_wait+degraded+inconsistent
   1 active+degraded+inconsistent

On Thu, Feb 16, 2017 at 5:08 PM, Shinobu Kinjo  wrote:
> Would you simply do?
>
>  * ceph -s
>
> On Fri, Feb 17, 2017 at 6:26 AM, Benjeman Meekhof  wrote:
>> As I'm looking at logs on the OSD mentioned in previous email at this
>> point, I mostly see this message repeating...is this normal or
>> indicating a problem?  This osd is marked up in the cluster.
>>
>> 2017-02-16 16:23:35.550102 7fc66fce3700 20 osd.564 152609
>> share_map_peer 0x7fc6887a3000 already has epoch 152609
>> 2017-02-16 16:23:35.556208 7fc66f4e2700 20 osd.564 152609
>> share_map_peer 0x7fc689e35000 already has epoch 152609
>> 2017-02-16 16:23:35.556233 7fc66f4e2700 20 osd.564 152609
>> share_map_peer 0x7fc689e35000 already has epoch 152609
>> 2017-02-16 16:23:35.577324 7fc66fce3700 20 osd.564 152609
>> share_map_peer 0x7fc68f4c1000 already has epoch 152609
>> 2017-02-16 16:23:35.577356 7fc6704e4700 20 osd.564 152609
>> share_map_peer 0x7fc68f4c1000 already has epoch 152609
>>
>> thanks,
>> Ben
>>
>> On Thu, Feb 16, 2017 at 12:19 PM, Benjeman Meekhof  
>> wrote:
>>> I tried starting up just a couple OSD with debug_osd = 20 and
>>> debug_filestore = 20.
>>>
>>> I pasted a sample of the ongoing log here.  To my eyes it doesn't look
>>> unusual but maybe someone else sees something in here that is a
>>> problem:  http://pastebin.com/uy8S7hps
>>>
>>> As this log is rolling on, our OSD has still not been marked up and is
>>> occupying 100% of a CPU core.  I've done this a couple times and in a
>>> matter of some hours it will be marked up and CPU will drop.  If more
>>> kraken OSD on another host are brought up the existing kraken OSD go
>>> back into max CPU usage again while pg recover.  The trend scales
>>> upward as OSD are started until the system is completely saturated.
>>>
>>> I was reading the docs on async messenger settings at
>>> http://docs.ceph.com/docs/master/rados/configuration/ms-ref/ and saw
>>> that under 'ms async max op threads' there is a note about one or more
>>> CPUs constantly on 100% load.  As an experiment I set max op threads
>>> to 20 and that is the setting during the period of the pasted log.  It
>>> seems to make no difference.
>>>
>>> Apprec

Re: [ceph-users] Jewel to Kraken OSD upgrade issues

2017-02-16 Thread Benjeman Meekhof
As I'm looking at logs on the OSD mentioned in previous email at this
point, I mostly see this message repeating...is this normal or
indicating a problem?  This osd is marked up in the cluster.

2017-02-16 16:23:35.550102 7fc66fce3700 20 osd.564 152609
share_map_peer 0x7fc6887a3000 already has epoch 152609
2017-02-16 16:23:35.556208 7fc66f4e2700 20 osd.564 152609
share_map_peer 0x7fc689e35000 already has epoch 152609
2017-02-16 16:23:35.556233 7fc66f4e2700 20 osd.564 152609
share_map_peer 0x7fc689e35000 already has epoch 152609
2017-02-16 16:23:35.577324 7fc66fce3700 20 osd.564 152609
share_map_peer 0x7fc68f4c1000 already has epoch 152609
2017-02-16 16:23:35.577356 7fc6704e4700 20 osd.564 152609
share_map_peer 0x7fc68f4c1000 already has epoch 152609

thanks,
Ben

On Thu, Feb 16, 2017 at 12:19 PM, Benjeman Meekhof  wrote:
> I tried starting up just a couple OSD with debug_osd = 20 and
> debug_filestore = 20.
>
> I pasted a sample of the ongoing log here.  To my eyes it doesn't look
> unusual but maybe someone else sees something in here that is a
> problem:  http://pastebin.com/uy8S7hps
>
> As this log is rolling on, our OSD has still not been marked up and is
> occupying 100% of a CPU core.  I've done this a couple times and in a
> matter of some hours it will be marked up and CPU will drop.  If more
> kraken OSD on another host are brought up the existing kraken OSD go
> back into max CPU usage again while pg recover.  The trend scales
> upward as OSD are started until the system is completely saturated.
>
> I was reading the docs on async messenger settings at
> http://docs.ceph.com/docs/master/rados/configuration/ms-ref/ and saw
> that under 'ms async max op threads' there is a note about one or more
> CPUs constantly on 100% load.  As an experiment I set max op threads
> to 20 and that is the setting during the period of the pasted log.  It
> seems to make no difference.
>
> Appreciate any thoughts on troubleshooting this.  For the time being
> I've aborted our kraken update and will probably re-initialize any
> already updated OSD to revert to Jewel except perhaps one host to
> continue testing.
>
> thanks,
> Ben
>
> On Tue, Feb 14, 2017 at 3:55 PM, Gregory Farnum  wrote:
>> On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof  
>> wrote:
>>> Hi all,
>>>
>>> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
>>> (11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
>>> mons to Kraken.
>>>
>>> After updating ceph packages I restarted the 60 OSD on the box with
>>> 'systemctl restart ceph-osd.target'.  Very soon after the system cpu
>>> load flat-lines at 100% with top showing all of that being system load
>>> from ceph-osd processes.  Not long after we get OSD flapping due to
>>> the load on the system (noout was set to start this, but perhaps
>>> too-quickly unset post restart).
>>>
>>> This is causing problems in the cluster, and we reboot the box.  The
>>> OSD don't start up/mount automatically - not a new problem on this
>>> setup.  We run 'ceph-disk activate $disk' on a list of all the
>>> /dev/dm-X devices as output by ceph-disk list.  Everything activates
>>> and the CPU gradually climbs to once again be a solid 100%.  No OSD
>>> have joined cluster so it isn't causing issues.
>>>
>>> I leave the box overnight...by the time I leave I see that 1-2 OSD on
>>> this box are marked up/in.   By morning all are in, CPU is fine,
>>> cluster is still fine.
>>>
>>> This is not a show-stopping issue now that I know what happens though
>>> it means upgrades are a several hour or overnight affair.  Next box I
>>> will just mark all the OSD out before updating and restarting them or
>>> try leaving them up but being sure to set noout to avoid flapping
>>> while they churn.
>>>
>>> Here's a log snippet from one currently spinning in the startup
>>> process since 11am.  This is the second box we did, the first
>>> experience being as detailed above.  Could this have anything to do
>>> with the 'PGs are upgrading' message?
>>
>> It doesn't seem likely — there's a fixed per-PG overhead that doesn't
>> scale with the object count. I could be missing something but I don't
>> see anything in the upgrade notes that should be doing this either.
>> Try running an upgrade with "debug osd = 20" and "debug filestore =
>> 20" set and see what the log spits out.
>> -Greg
>>
>>>
>>> 2017-02-14 11:04:07.028311 7f

Re: [ceph-users] Jewel to Kraken OSD upgrade issues

2017-02-16 Thread Benjeman Meekhof
I tried starting up just a couple OSD with debug_osd = 20 and
debug_filestore = 20.

I pasted a sample of the ongoing log here.  To my eyes it doesn't look
unusual but maybe someone else sees something in here that is a
problem:  http://pastebin.com/uy8S7hps

As this log is rolling on, our OSD has still not been marked up and is
occupying 100% of a CPU core.  I've done this a couple times and in a
matter of some hours it will be marked up and CPU will drop.  If more
kraken OSD on another host are brought up the existing kraken OSD go
back into max CPU usage again while pg recover.  The trend scales
upward as OSD are started until the system is completely saturated.

I was reading the docs on async messenger settings at
http://docs.ceph.com/docs/master/rados/configuration/ms-ref/ and saw
that under 'ms async max op threads' there is a note about one or more
CPUs constantly on 100% load.  As an experiment I set max op threads
to 20 and that is the setting during the period of the pasted log.  It
seems to make no difference.

Appreciate any thoughts on troubleshooting this.  For the time being
I've aborted our kraken update and will probably re-initialize any
already updated OSD to revert to Jewel except perhaps one host to
continue testing.

thanks,
Ben

On Tue, Feb 14, 2017 at 3:55 PM, Gregory Farnum  wrote:
> On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof  wrote:
>> Hi all,
>>
>> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
>> (11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
>> mons to Kraken.
>>
>> After updating ceph packages I restarted the 60 OSD on the box with
>> 'systemctl restart ceph-osd.target'.  Very soon after the system cpu
>> load flat-lines at 100% with top showing all of that being system load
>> from ceph-osd processes.  Not long after we get OSD flapping due to
>> the load on the system (noout was set to start this, but perhaps
>> too-quickly unset post restart).
>>
>> This is causing problems in the cluster, and we reboot the box.  The
>> OSD don't start up/mount automatically - not a new problem on this
>> setup.  We run 'ceph-disk activate $disk' on a list of all the
>> /dev/dm-X devices as output by ceph-disk list.  Everything activates
>> and the CPU gradually climbs to once again be a solid 100%.  No OSD
>> have joined cluster so it isn't causing issues.
>>
>> I leave the box overnight...by the time I leave I see that 1-2 OSD on
>> this box are marked up/in.   By morning all are in, CPU is fine,
>> cluster is still fine.
>>
>> This is not a show-stopping issue now that I know what happens though
>> it means upgrades are a several hour or overnight affair.  Next box I
>> will just mark all the OSD out before updating and restarting them or
>> try leaving them up but being sure to set noout to avoid flapping
>> while they churn.
>>
>> Here's a log snippet from one currently spinning in the startup
>> process since 11am.  This is the second box we did, the first
>> experience being as detailed above.  Could this have anything to do
>> with the 'PGs are upgrading' message?
>
> It doesn't seem likely — there's a fixed per-PG overhead that doesn't
> scale with the object count. I could be missing something but I don't
> see anything in the upgrade notes that should be doing this either.
> Try running an upgrade with "debug osd = 20" and "debug filestore =
> 20" set and see what the log spits out.
> -Greg
>
>>
>> 2017-02-14 11:04:07.028311 7fd7a0372940  0 _get_class not permitted to load 
>> lua
>> 2017-02-14 11:04:07.077304 7fd7a0372940  0 osd.585 135493 crush map
>> has features 288514119978713088, adjusting msgr requires for clients
>> 2017-02-14 11:04:07.077318 7fd7a0372940  0 osd.585 135493 crush map
>> has features 288514394856620032 was 8705, adjusting msgr requires for
>> mons
>> 2017-02-14 11:04:07.077324 7fd7a0372940  0 osd.585 135493 crush map
>> has features 288514394856620032, adjusting msgr requires for osds
>> 2017-02-14 11:04:09.446832 7fd7a0372940  0 osd.585 135493 load_pgs
>> 2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading
>> 2017-02-14 11:04:10.246166 7fd7a0372940  0 osd.585 135493 load_pgs
>> opened 148 pgs
>> 2017-02-14 11:04:10.246249 7fd7a0372940  0 osd.585 135493 using 1 op
>> queue with priority op cut off at 64.
>> 2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493
>> log_to_monitors {default=true}
>> 2017-02-14 11:04:12.473450 7fd7a0372940  0 osd.585 135493 done with
>> init, starting boot process
>> (logs stop here, cpu spinning)
>>
>>
>> regards,
>> Ben
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-16 Thread Benjeman Meekhof
Hi all,

I'd also not like to see cache tiering in the current form go away.
We've explored using it in situations where we have a data pool with
replicas spread across WAN sites which we then overlay with a fast
cache tier local to the site where most clients will be using the
pool.  This significantly speeds up operations for that set of clients
as long as they don't involve waiting for the cache to flush data.  In
theory we could move the cache tier around (flush, delete, recreate)
as needed.  The drawback of course is that clients not near the cache
tier pool would still be forced into using it by Ceph.  What would be
really useful is multiple cache pools with rulesets that allow us to
direct clients by IP, GeoIP proximity lookups or something like that.

As I see it, the alternative configuration for this case is to simply
map replicas at the appropriate site.  Which makes more sense for us
might depend on different factors for different project users.

thanks,
Ben
---
OSiRIS htttp://www.osris.org

On Thu, Feb 16, 2017 at 3:40 AM, Dongsheng Yang
 wrote:
> BTW, is there any body using EnhanceIO?
>
> On 02/15/2017 05:51 PM, Dongsheng Yang wrote:
>>
>> thanx Nick, Gregory and Wido,
>> So at least, we can say the cache tiering in Jewel is stable enough I
>> think.
>> I like cache tiering more than the others, but yes, there is a problem
>> about cache tiering in
>> flushing data between different nodes, which are not a problem in local
>> caching solution.
>>
>> guys:
>> Is there any plan to enhance cache tiering to solve such problem? Or
>> as Nick asked, is
>> that cache tiering fading away?
>>
>> Yang
>>
>>
>> On 15/02/2017, 06:42, Nick Fisk wrote:

 -Original Message-
 From: Gregory Farnum [mailto:gfar...@redhat.com]
 Sent: 14 February 2017 21:05
 To: Wido den Hollander 
 Cc: Dongsheng Yang ; Nick Fisk
 ; Ceph Users 
 Subject: Re: [ceph-users] bcache vs flashcache vs cache tiering

 On Tue, Feb 14, 2017 at 8:25 AM, Wido den Hollander 
 wrote:
>>
>> Op 14 februari 2017 om 11:14 schreef Nick Fisk :
>>
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>> Behalf Of Dongsheng Yang
>>> Sent: 14 February 2017 09:01
>>> To: Sage Weil 
>>> Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
>>> Subject: [ceph-users] bcache vs flashcache vs cache tiering
>>>
>>> Hi Sage and all,
>>>   We are going to use SSDs for cache in ceph. But I am not sure
>>> which one is the best solution, bcache? flashcache? or cache
>>
>> tier?
>>
>> I would vote for cache tier. Being able to manage it from within
>> Ceph, instead of having to manage X number of bcache/flashcache
>> instances, appeals to me more. Also last time I looked Flashcache
>> seems unmaintained and bcache might be going that way with talk of
>> this new bcachefs. Another point to consider is that Ceph has had a
>> lot of

 work done on it to ensure data consistency; I don't ever want to be in a
 position where I'm trying to diagnose problems that might be being
 caused
 by another layer sitting in-between Ceph and the Disk.
>>
>> However, I know several people on here are using bcache and
>> potentially getting better performance than with cache tiering, so

 hopefully someone will give their views.
>
> I am using Bcache on various systems and it performs really well. The

 caching layer in Ceph is slow. Promoting Objects is slow and it also
 involves
 additional RADOS lookups.

 Yeah. Cache tiers have gotten a lot more usable in Ceph, but the use
 cases
 where they're effective are still pretty limited and I think in-node
 caching has
 a brighter future. We just don't like to maintain the global state that
 makes
 separate caching locations viable and unless you're doing something
 analogous to the supercomputing "burst buffers" (which some people
 are!),
 it's going to be hard to beat something that doesn't have to pay the
 cost of
 extra network hops/bandwidth.
 Cache tiers are also not a feature that all the vendors support in their
 downstream products, so it will probably see less ongoing investment
 than
 you'd expect from such a system.
>>>
>>> Should that be taken as an unofficial sign that the tiering support is
>>> likely to fade away?
>>>
>>> I think both approaches have different strengths and probably the
>>> difference between a tiering system and a caching one is what causes some of
>>> the problems.
>>>
>>> If something like bcache is going to be the preferred approach, then I
>>> think more work needs to be done around certifying it for use with Ceph and
>>> allowing its behavior to be more controlled by Ceph as well. I assume there
>>> are issues around backfilling and scrubbing polluting the cache

[ceph-users] Jewel to Kraken OSD upgrade issues

2017-02-14 Thread Benjeman Meekhof
Hi all,

We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
(11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
mons to Kraken.

After updating ceph packages I restarted the 60 OSD on the box with
'systemctl restart ceph-osd.target'.  Very soon after the system cpu
load flat-lines at 100% with top showing all of that being system load
from ceph-osd processes.  Not long after we get OSD flapping due to
the load on the system (noout was set to start this, but perhaps
too-quickly unset post restart).

This is causing problems in the cluster, and we reboot the box.  The
OSD don't start up/mount automatically - not a new problem on this
setup.  We run 'ceph-disk activate $disk' on a list of all the
/dev/dm-X devices as output by ceph-disk list.  Everything activates
and the CPU gradually climbs to once again be a solid 100%.  No OSD
have joined cluster so it isn't causing issues.

I leave the box overnight...by the time I leave I see that 1-2 OSD on
this box are marked up/in.   By morning all are in, CPU is fine,
cluster is still fine.

This is not a show-stopping issue now that I know what happens though
it means upgrades are a several hour or overnight affair.  Next box I
will just mark all the OSD out before updating and restarting them or
try leaving them up but being sure to set noout to avoid flapping
while they churn.

Here's a log snippet from one currently spinning in the startup
process since 11am.  This is the second box we did, the first
experience being as detailed above.  Could this have anything to do
with the 'PGs are upgrading' message?

2017-02-14 11:04:07.028311 7fd7a0372940  0 _get_class not permitted to load lua
2017-02-14 11:04:07.077304 7fd7a0372940  0 osd.585 135493 crush map
has features 288514119978713088, adjusting msgr requires for clients
2017-02-14 11:04:07.077318 7fd7a0372940  0 osd.585 135493 crush map
has features 288514394856620032 was 8705, adjusting msgr requires for
mons
2017-02-14 11:04:07.077324 7fd7a0372940  0 osd.585 135493 crush map
has features 288514394856620032, adjusting msgr requires for osds
2017-02-14 11:04:09.446832 7fd7a0372940  0 osd.585 135493 load_pgs
2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading
2017-02-14 11:04:10.246166 7fd7a0372940  0 osd.585 135493 load_pgs
opened 148 pgs
2017-02-14 11:04:10.246249 7fd7a0372940  0 osd.585 135493 using 1 op
queue with priority op cut off at 64.
2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493
log_to_monitors {default=true}
2017-02-14 11:04:12.473450 7fd7a0372940  0 osd.585 135493 done with
init, starting boot process
(logs stop here, cpu spinning)


regards,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw scaling recommendation?

2017-02-14 Thread Benjeman Meekhof
Thanks everyone for the suggestions, playing with all three of the
tuning knobs mentioned has greatly increased the number of client
connections an instance can deal with.  We're still experimenting to
find the max values to saturate our hardware.

With values as below we'd see something around 50 reqs/s and at higher
rates start to see some 403 responses or TCP peer resets.  However
we're still only hitting 3-5% utilization of the hardware CPU and
plenty of headroom with other resources so there's room to go higher I
think.  There wasn't a lot of thought put into those numbers or the
relation between them, just 'bigger'.

My analysis is that connection resets are likely due to too few
civetweb threads to handle requests, and 403 responses from too few
threads/handles to handle connections that do get through.

rgw_thread_pool_size= 800,
civetweb num_threads = 400
rgw_num_rados_handles = 8

regards,
Ben

On Thu, Feb 9, 2017 at 4:48 PM, Ben Hines  wrote:
> I'm curious how does the num_threads option to civetweb relate to the 'rgw
> thread pool size'?  Should i make them equal?
>
> ie:
>
> rgw frontends = civetweb enable_keep_alive=yes port=80 num_threads=125
> error_log_file=/var/log/ceph/civetweb.error.log
> access_log_file=/var/log/ceph/civetweb.access.log
>
>
> -Ben
>
> On Thu, Feb 9, 2017 at 12:30 PM, Wido den Hollander  wrote:
>>
>>
>> > Op 9 februari 2017 om 19:34 schreef Mark Nelson :
>> >
>> >
>> > I'm not really an RGW expert, but I'd suggest increasing the
>> > "rgw_thread_pool_size" option to something much higher than the default
>> > 100 threads if you haven't already.  RGW requires at least 1 thread per
>> > client connection, so with many concurrent connections some of them
>> > might end up timing out.  You can scale the number of threads and even
>> > the number of RGW instances on a single server, but at some point you'll
>> > run out of threads at the OS level.  Probably before that actually
>> > happens though, you'll want to think about multiple RGW gateway nodes
>> > behind a load balancer.  Afaik that's how the big sites do it.
>> >
>>
>> In addition, have you tried to use more RADOS handles?
>>
>> rgw_num_rados_handles = 8
>>
>> That with more RGW threads as Mark mentioned.
>>
>> Wido
>>
>> > I believe some folks are considering trying to migrate rgw to a
>> > threadpool/event processing model but it sounds like it would be quite a
>> > bit of work.
>> >
>> > Mark
>> >
>> > On 02/09/2017 12:25 PM, Benjeman Meekhof wrote:
>> > > Hi all,
>> > >
>> > > We're doing some stress testing with clients hitting our rados gw
>> > > nodes with simultaneous connections.  When the number of client
>> > > connections exceeds about 5400 we start seeing 403 forbidden errors
>> > > and log messages like the following:
>> > >
>> > > 2017-02-09 08:53:16.915536 7f8c667bc700 0 NOTICE: request time skew
>> > > too big now=2017-02-09 08:53:16.00 req_time=2017-02-09
>> > > 08:37:18.00
>> > >
>> > > This is version 10.2.5 using embedded civetweb.  There's just one
>> > > instance per node, and they all start generating 403 errors and the
>> > > above log messages when enough clients start hitting them.  The
>> > > hardware is not being taxed at all, negligible load and network
>> > > throughput.   OSD don't show any appreciable increase in CPU load or
>> > > io wait on journal/data devices.  Unless I'm missing something it
>> > > looks like the RGW is just not scaling to fill out the hardware it is
>> > > on.
>> > >
>> > > Does anyone have advice on scaling RGW to fully utilize a host?
>> > >
>> > > thanks,
>> > > Ben
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw scaling recommendation?

2017-02-09 Thread Benjeman Meekhof
Hi all,

We're doing some stress testing with clients hitting our rados gw
nodes with simultaneous connections.  When the number of client
connections exceeds about 5400 we start seeing 403 forbidden errors
and log messages like the following:

2017-02-09 08:53:16.915536 7f8c667bc700 0 NOTICE: request time skew
too big now=2017-02-09 08:53:16.00 req_time=2017-02-09
08:37:18.00

This is version 10.2.5 using embedded civetweb.  There's just one
instance per node, and they all start generating 403 errors and the
above log messages when enough clients start hitting them.  The
hardware is not being taxed at all, negligible load and network
throughput.   OSD don't show any appreciable increase in CPU load or
io wait on journal/data devices.  Unless I'm missing something it
looks like the RGW is just not scaling to fill out the hardware it is
on.

Does anyone have advice on scaling RGW to fully utilize a host?

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Latency between datacenters

2017-02-08 Thread Benjeman Meekhof
Hi Daniel,

50 ms of latency is going to introduce a big performance hit though
things will still function.  We did a few tests which are documented
at http://www.osris.org/performance/latency

thanks,
Ben

On Tue, Feb 7, 2017 at 12:17 PM, Daniel Picolli Biazus
 wrote:
> Hi Guys,
>
> I have been planning to deploy a Ceph Cluster with the following hardware:
>
> OSDs:
>
> 4 Servers Xeon D 1520 / 32 GB RAM / 5 x 6TB SAS 2 (6 OSD daemon per server)
>
> Monitor/Rados Gateways
>
> 5 Servers Xeon D 1520 32 GB RAM / 2 x 1TB SAS 2 (5 MON daemon/ 4 rados
> daemon)
>
> Usage: Object Storage only
>
> However I need to deploy 2 OSD and 3 MON Servers in Miami datacenter and
> another 2 OSD and 2 MON Servers in Montreal Datacenter. The latency between
> these datacenters is 50 milliseconds.
>Considering this scenario, should I use Federated Gateways or should I
> use a single Cluster ?
>
> Thanks in advance
> Daniel
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] general ceph cluster design

2016-11-28 Thread Benjeman Meekhof
Hi Nick,

We have a Ceph cluster spread across 3 datacenters at 3 institutions
in Michigan (UM, MSU, WSU).  It certainly is possible.  As noted you
will have increased latency for write operations and overall reduced
throughput as latency increases.  Latency between our sites is 3-5ms.

We did some simulated latency testing with netem where we induced
varying levels of latency on one of our storage hosts (60 OSD).  Some
information about the results is on our website:
http://www.osris.org/performance/latency

We also had success running a 4th cluster site at Supercomputing in
SLC.  We'll be putting up information on experiences there in the near
future.

thanks,
Ben


On Mon, Nov 28, 2016 at 12:06 AM, nick  wrote:
> Hi Maxime,
> thank you for the information given. We will have a look and check.
>
> Cheers
> Nick
>
> On Friday, November 25, 2016 09:48:35 PM Maxime Guyot wrote:
>> Hi Nick,
>>
>> See inline comments.
>>
>> Cheers,
>> Maxime
>>
>> On 25/11/16 16:01, "ceph-users on behalf of nick"
>>  wrote:
>
>>
>> >Hi,
>> >we are currently planning a new ceph cluster which will be used for
>> >virtualization (providing RBD storage for KVM machines) and we have
>> >some
>> >general questions.
>> >
>> >* Is it advisable to have one ceph cluster spread over multiple
>> >datacenters
>  (latency is low, as they are not so far from each
>> >other)? Is anybody doing this in a production setup? We know that any
>> >network issue would affect virtual machines in all locations instead
>> >just one, but we can see a lot of advantages as well.
>>
>>
>> I think the general consensus is to limit the size of the failure domain.
>> That said, it depends the use case and what you mean by “multiple
>> datacenters” and “latency is low”: writes will have to be journal-ACK:ed
>> by the OSDs in the other datacenter. If there is 10ms latency between
>> Location1 and Location2, then it would add 10ms to each write operation if
>> crushmap requires replicas in each location. Speaking of which a 3rd
>> location would help with sorting our quorum (1 mon at each location) in
>> “triangle” configuration.
>
>> If this is for DR: RBD-mirroring is supposed to address that, you might not
>> want to have 1 big cluster ( = failure domain).
>  If this is for VM live
>> migration: Usually requires spread L2 adjacency (failure domain) or
>> overlays (VXLAN and the likes), “network trombone” effect can be a problem
>> depending on the setup
>> I know of Nantes University who used/is using a 3 datacenter Ceph cluster:
>> http://dachary.org/?p=2087
>
>>
>> >
>> >* We are planning to combine the hosts for ceph and KVM (so far we are
>> >using
>  seperate hosts for virtual machines and ceph storage). We see
>> >the big advantage (next to the price drop) of an automatic ceph
>> >expansion when adding more compute nodes as we got into situations in
>> >the past where we had too many compute nodes and the ceph cluster was
>> >not expanded properly (performance dropped over time). On the other
>> >side there would be changes to the crush map every time we add a
>> >compute node and that might end in a lot of data movement in ceph. Is
>> >anybody using combined servers for compute and ceph storage and has
>> >some experience?
>>
>>
>> The challenge is to avoid ceph-osd to become a noisy neighbor for the VMs
>> hosted on the hypervisor, especially under recovery. I’ve heard people
>> using CPU pinning, containers, and QoS to keep it under control.
>  Sebastian
>> has an article on his blog this topic:
>> https://www.sebastien-han.fr/blog/2016/07/11/Quick-dive-into-hyperconverged
>> -architecture-with-OpenStack-and-Ceph/
>> For the performance dropped over time, you can look to improve your
>> capacity:performance ratio.
>
>>
>> >* is there a maximum amount of OSDs in a ceph cluster? We are planning
>> >to use
>  a minimum of 8 OSDs per server and going to have a cluster
>> >with about 100 servers which would end in about 800 OSDs.
>>
>>
>> There are a couple of thread from the ML about this:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028371.html
>> and
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-November/014246.ht
>> ml
>
>>
>> >
>> >Thanks for any help...
>> >
>> >Cheers
>> >Nick
>>
>>
>
>
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Feedback wanted: health warning when standby MDS dies?

2016-10-18 Thread Benjeman Meekhof
+1 to this, it would be useful

On Tue, Oct 18, 2016 at 8:31 AM, Wido den Hollander  wrote:
>
>> Op 18 oktober 2016 om 14:06 schreef Dan van der Ster :
>>
>>
>> +1 I would find this warning useful.
>>
>
> +1 Probably make it configurable, say, you want at least X standby MDS to be 
> available before WARN. But in general, yes, please!
>
> Wido
>
>>
>>
>> On Tue, Oct 18, 2016 at 1:46 PM, John Spray  wrote:
>> > Hi all,
>> >
>> > Someone asked me today how to get a list of down MDS daemons, and I
>> > explained that currently the MDS simply forgets about any standby that
>> > stops sending beacons.  That got me thinking about the case where a
>> > standby dies while the active MDS remains up -- the cluster has gone
>> > into a non-highly-available state, but we are not giving the admin any
>> > indication.
>> >
>> > I've suggested a solution here:
>> > http://tracker.ceph.com/issues/17604
>> >
>> > This is probably going to be a bit of a subjective thing in terms of
>> > whether people find it useful or find it to be annoying noise, so I'd
>> > be interested in feedback from people currently running cephfs.
>> >
>> > Cheers,
>> > John
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD journal utilization

2016-06-20 Thread Benjeman Meekhof
For automatically collecting stats like this you might also look into
collectd.  It has many plugins for different system statistics
including one for collecting stats from Ceph daemon admin sockets.
There are several ways to collect and view the data from collectd.  We
are pointing clients at Influxdb and then viewing with Grafana.  There
are many small tutorials on this combination of tools if you google
and the docs for the tools cover how to configure each end.

In particular this plugin will get disk utilization counters:
https://collectd.org/wiki/index.php/Plugin:Disk

This combination is how we are monitoring OSD journal utilization
among other things.

thanks,
Ben





On Mon, Jun 20, 2016 at 12:02 PM, David Turner
 wrote:
> If you want to watch what a disk is doing while you watch it, use iostat on
> the journal device.  If you want to see it's patterns at all times of the
> day, use sar.  Neither of these are ceph specific commands, just Linux tools
> that can watch your disk utilization, speeds, etc (among other things.  Both
> tools are well documented and easy to use.
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of EP Komarla
> [ep.koma...@flextronics.com]
> Sent: Friday, June 17, 2016 5:13 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Ceph OSD journal utilization
>
> Hi,
>
>
>
> I am looking for a way to monitor the utilization of OSD journals – by
> observing the utilization pattern over time, I can determine if I have over
> provisioned them or not. Is there a way to do this?
>
>
>
> When I googled on this topic, I saw one similar request about 4 years back.
> I am wondering if there is some traction on this topic since then.
>
>
>
> Thanks a lot.
>
>
>
> - epk
>
>
> Legal Disclaimer:
> The information contained in this message may be privileged and
> confidential. It is intended to be read only by the individual or entity to
> whom it is addressed or by their designee. If the reader of this message is
> not the intended recipient, you are on notice that any distribution of this
> message, in any form, is strictly prohibited. If you have received this
> message in error, please immediately notify the sender and delete or destroy
> any copy of this message!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dense storage nodes

2016-05-19 Thread Benjeman Meekhof
Hi Christian,

Thanks for your insights.  To answer your question the NVMe devices
appear to be some variety of Samsung:

Model: Dell Express Flash NVMe 400GB
Manufacturer: SAMSUNG
Product ID: a820

regards,
Ben

On Wed, May 18, 2016 at 10:01 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Wed, 18 May 2016 12:32:25 -0400 Benjeman Meekhof wrote:
>
>> Hi Lionel,
>>
>> These are all very good points we should consider, thanks for the
>> analysis.  Just a couple clarifications:
>>
>> - NVMe in this system are actually slotted in hot-plug front bays so a
>> failure can be swapped online.  However I do see your point about this
>> otherwise being a non-optimal config.
>>
> What NVMes are these exactly? DC P3700?
> With Intel you can pretty much rely on them not to die before their time
> is up, so monitor wearout levels religiously and automatically (nagios
> etc).
> At a low node count like yours it is understandable to not want to loose
> 15 OSDs because a NVMe failed, but your performance and cost are both not
> ideal as Lionel said.
>
> I guess you're happy with what you have, but as I mentioned in this
> thread also about RAIDed OSDs, there is a chassis that does basically what
> you're having while saving 1U:
> https://www.supermicro.com.tw/products/system/4U/6048/SSG-6048R-E1CR60N.cfm
>
> This can also have optionally 6 NVMes, hot-swappable.
>
>> - Our 20 physical cores come out to be 40 HT cores to the system which
>> we are hoping is adequate to do 60 OSD without raid devices.  My
>> experiences in other contexts lead me to believe a hyper-threaded core
>> is pretty well the same as a phys core (perhaps with some exceptions
>> depending on specific cases).
>>
> It all depends, if you had no SSD journals at all I'd say you could scrape
> by, barely.
> With NVMes for journals, especially if you should decide to use them
> individually with 15 OSDs per NVMe, I'd expect CPU to become the
> bottleneck when dealing with a high number of small IOPS.
>
> Regards,
>
> Christian
>> regards,
>> Ben
>>
>> On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton
>>  wrote:
>> > Hi,
>> >
>> > I'm not yet familiar with Jewel, so take this with a grain of salt.
>> >
>> > Le 18/05/2016 16:36, Benjeman Meekhof a écrit :
>> >> We're in process of tuning a cluster that currently consists of 3
>> >> dense nodes with more to be added.  The storage nodes have spec:
>> >> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
>> >> - 384 GB RAM
>> >> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
>> >> LSI 9207-8e SAS 6Gbps
>> >
>> > I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I
>> > think your performance would be limited by the CPUs but Jewel is faster
>> > AFAIK.
>> > That said you could setup the 60 disks as RAID arrays to limit the
>> > number of OSDs. This can be tricky but some people have reported doing
>> > so successfully (IIRC using RAID5 in order to limit both the number of
>> > OSDs and the rebalancing events when a disk fails).
>> >
>> >> - XFS filesystem on OSD data devs
>> >> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
>> >> raid-1 device)
>> >
>> > Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB
>> > conservative estimate, for 30 disks you'd need a write bandwidth of
>> > 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they
>> > will take twice the amount of writes in RAID1. The alternative - using
>> > NVMe directly for journals - will get better performance and have less
>> > failures. The only drawback is that an NVMe failing entirely (I'm not
>> > familiar with NVMe but with SSD you often get write errors affecting a
>> > single OSD before a whole device failure) will bring down 15 OSDs at
>> > once. Note that replacing NVMe usually means stopping the whole node
>> > when not using hotplug PCIe, so not losing the journals when one fails
>> > may not gain you as much as anticipated if the cluster must rebalance
>> > anyway during the maintenance operation where your replace the faulty
>> > NVMe (and might perform other upgrades/swaps that were waiting).
>> >
>> >> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
>> >
>> > Seems adequate although more bandwidth could be of some benefit.
>> >
>> > This is a total of ~12GB/s full dup

Re: [ceph-users] dense storage nodes

2016-05-18 Thread Benjeman Meekhof
Hi Lionel,

These are all very good points we should consider, thanks for the
analysis.  Just a couple clarifications:

- NVMe in this system are actually slotted in hot-plug front bays so a
failure can be swapped online.  However I do see your point about this
otherwise being a non-optimal config.

- Our 20 physical cores come out to be 40 HT cores to the system which
we are hoping is adequate to do 60 OSD without raid devices.  My
experiences in other contexts lead me to believe a hyper-threaded core
is pretty well the same as a phys core (perhaps with some exceptions
depending on specific cases).

regards,
Ben

On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton  wrote:
> Hi,
>
> I'm not yet familiar with Jewel, so take this with a grain of salt.
>
> Le 18/05/2016 16:36, Benjeman Meekhof a écrit :
>> We're in process of tuning a cluster that currently consists of 3
>> dense nodes with more to be added.  The storage nodes have spec:
>> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
>> - 384 GB RAM
>> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
>> LSI 9207-8e SAS 6Gbps
>
> I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I
> think your performance would be limited by the CPUs but Jewel is faster
> AFAIK.
> That said you could setup the 60 disks as RAID arrays to limit the
> number of OSDs. This can be tricky but some people have reported doing
> so successfully (IIRC using RAID5 in order to limit both the number of
> OSDs and the rebalancing events when a disk fails).
>
>> - XFS filesystem on OSD data devs
>> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
>> raid-1 device)
>
> Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB
> conservative estimate, for 30 disks you'd need a write bandwidth of
> 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they
> will take twice the amount of writes in RAID1. The alternative - using
> NVMe directly for journals - will get better performance and have less
> failures. The only drawback is that an NVMe failing entirely (I'm not
> familiar with NVMe but with SSD you often get write errors affecting a
> single OSD before a whole device failure) will bring down 15 OSDs at once.
> Note that replacing NVMe usually means stopping the whole node when not
> using hotplug PCIe, so not losing the journals when one fails may not
> gain you as much as anticipated if the cluster must rebalance anyway
> during the maintenance operation where your replace the faulty NVMe (and
> might perform other upgrades/swaps that were waiting).
>
>> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
>
> Seems adequate although more bandwidth could be of some benefit.
>
> This is a total of ~12GB/s full duplex. If Ceph is able to use the whole
> disk bandwidth you will saturate this : if you get a hotspot on one node
> with a client capable of writing at 12GB/s on it and have a replication
> size of 3, you will get only half of this (as twice this amount will be
> sent on replicas). So ideally you would have room for twice the client
> bandwidth on the cluster network. In my experience this isn't a problem
> (hot spots like this almost never happen as client write traffic is
> mostly distributed evenly on nodes) but having the headroom avoids the
> risk of atypical access patterns becoming a problem so it seems like a
> good thing if it doesn't cost too much.
> Note that if your total NVMe write bandwidth is more than the total disk
> bandwidth they act as buffers capable of handling short write bursts
> (only if there's no read on recent writes which should almost never
> happen for RBD but might for other uses) so you could limit your ability
> to handle these.
>
> Best regards,
>
> Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dense storage nodes

2016-05-18 Thread Benjeman Meekhof
We're in process of tuning a cluster that currently consists of 3
dense nodes with more to be added.  The storage nodes have spec:
- Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
- 384 GB RAM
- 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
LSI 9207-8e SAS 6Gbps
- XFS filesystem on OSD data devs
- 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
raid-1 device)
- 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
- Jewel release

I don't have much to add for tuning advice.   I'm reading this thread
for tuning advice and upcoming problems.  We've done a little network
tuning based on Mellanox recommendations but nothing specific to ceph
(in fact we just use the scripts that come with the Mellanox driver
packages).  We haven't hit any major issues so far in trying out RBD
and RGW but we haven't taxed anything yet.

thanks,
Ben

On Wed, May 18, 2016 at 9:14 AM, Brian Felton  wrote:
> At my current gig, we are running five (soon to be six) pure object storage
> clusters in production with the following specs:
>
>  - 9 nodes
>  - 32 cores, 256 GB RAM per node
>  - 72 6 TB SAS spinners per node (648 total per cluster)
>  - 7,2 erasure coded pool for RGW buckets
>  - ZFS as the filesystem on the OSDs with collocated journals
>  - Hammer release with rgw patches
>
> We are currently storing a few hundred TB of data across several hundred
> MObjects.
>
> We have hit the following issues:
>
>  - Filestore merge splits occur at ~40 MObjects with default settings.  This
> is a really, really bad couple of days while things settle.
>  - Realizing that, with erasure coding, scrubs have the same impact as deep
> scrubs
>  - Scrubs causing a slew of blocked/slow requests and stale pgs
>  - A handful of RGW issues
>
> As utilization has grown, the performance impact of scrubbing has become
> much more noticeable, to the point that we've had to hand-roll software to
> manage the scrubs and keep them at a very reduced rate.  SSD journals are
> your friends, folks.  Don't skimp.  We are in the process of retrofitting
> these clusters with SSD journals to help speed things up.  We are also
> evaluating BlueStore in Jewel to see how it compares to LevelDB as well as
> different node configurations (less dense and with SSD journals).
>
> Brian
>
> On Wed, May 18, 2016 at 12:54 AM, Blair Bethwaite
>  wrote:
>>
>> Hi all,
>>
>> What are the densest node configs out there, and what are your
>> experiences with them and tuning required to make them work? If we can
>> gather enough info here then I'll volunteer to propose some upstream
>> docs covering this.
>>
>> At Monash we currently have some 32-OSD nodes (running RHEL7), though
>> 8 of those OSDs are not storing or doing much yet (in a quiet EC'd RGW
>> pool), the other 24 OSDs are serving RBD and at perhaps 65% full on
>> average - these are 4TB drives.
>>
>> Aside from the already documented pid_max increases that are typically
>> necessary just to start all OSDs, we've also had to up
>> nf_conntrack_max. We've hit issues (twice now) that seem (have not
>> figured out exactly how to confirm this yet) to be related to kernel
>> dentry slab cache exhaustion - symptoms were a major slow down in
>> performance and slow requests all over the place on writes, watching
>> OSD iostat would show a single drive hitting 90+% util for ~15s with a
>> bunch of small reads and no writes. These issues were worked around by
>> tuning up filestore split and merge thresholds, though if we'd known
>> about this earlier we'd probably have just bumped up the default
>> object size so that we simply had fewer objects (and/or rounded up the
>> PG count to the next power of 2). We also set vfs_cache_pressure to 1,
>> though this didn't really seem to do much at the time. I've also seen
>> recommendations about setting min_free_kbytes to something higher
>> (currently 90112 on our hardware) but have not verified this.
>>
>> --
>> Cheers,
>> ~Blairo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do I start ceph jewel in CentOS?

2016-05-04 Thread Benjeman Meekhof
Hi Michael,

Systemctl pattern for OSD with Infernalis or higher is 'systemctl
start ceph-osd@'  (or status, restart)

It will start OSD in default cluster 'ceph' or other cluster if you
have set 'CLUSTER=' in /etc/sysconfig/ceph

If by chance you have 2 clusters on the same hardware you'll have to
manually create separate systemd unit files in /usr/lib/systemd/system
like '-osd@.service' edited to have the 2nd cluster name, create
a separate '-osd.target' in same dir, and symlink in
/etc/systemd/system/-osd.target.wants.  I don't know if there
might be another way built-in but I did not see it.

To see all the ceph units configured:

systemctl -a | grep ceph

regards,
Ben



On Wed, May 4, 2016 at 12:58 PM, Michael Kuriger  wrote:
> How are others starting ceph services?  Am I the only person trying to 
> install jewel on CentOS 7?
> Unfortunately, systemctl status does not list any “ceph” services at all.
>
>
>
>
>
>
>
>
>
>
> On 5/4/16, 9:37 AM, "Vasu Kulkarni"  wrote:
>
>>sadly there are still some issues with jewel/master branch for centos
>>systemctl service,
>>As a workaround if you run "systemctl status" and look at the top most
>>service name in the ceph-osd service tree and use that to stop/start
>>it should work.
>>
>>
>>On Wed, May 4, 2016 at 9:00 AM, Michael Kuriger  wrote:
>>> I’m running CentOS 7.2.  I upgraded one server from hammer to jewel.   I
>>> cannot get ceph to start using these new systems scripts.  Can anyone help?
>>>
>>> I tried to enable ceph-osd@.service by creating symlinks manually.
>>>
>>> # systemctl list-unit-files|grep ceph
>>>
>>> ceph-create-keys@.service  static
>>>
>>> ceph-disk@.service static
>>>
>>> ceph-mds@.service  disabled
>>>
>>> ceph-mon@.service  disabled
>>>
>>> ceph-osd@.service  enabled
>>>
>>> ceph-mds.targetdisabled
>>>
>>> ceph-mon.targetdisabled
>>>
>>> ceph-osd.targetenabled
>>>
>>> ceph.targetenabled
>>>
>>>
>>>
>>> # systemctl start ceph.target
>>>
>>>
>>> # systemctl status ceph.target
>>>
>>> ● ceph.target - ceph target allowing to start/stop all ceph*@.service
>>> instances at once
>>>
>>>Loaded: loaded (/usr/lib/systemd/system/ceph.target; enabled; vendor
>>> preset: disabled)
>>>
>>>Active: active since Wed 2016-05-04 08:53:30 PDT; 4min 6s ago
>>>
>>>
>>> May 04 08:53:30  systemd[1]: Reached target ceph target allowing to
>>> start/stop all ceph*@.service instances at once.
>>>
>>> May 04 08:53:30  systemd[1]: Starting ceph target allowing to start/stop all
>>> ceph*@.service instances at once.
>>>
>>> May 04 08:57:32  systemd[1]: Reached target ceph target allowing to
>>> start/stop all ceph*@.service instances at once.
>>>
>>>
>>> # systemctl status ceph-osd.target
>>>
>>> ● ceph-osd.target - ceph target allowing to start/stop all ceph-osd@.service
>>> instances at once
>>>
>>>Loaded: loaded (/usr/lib/systemd/system/ceph-osd.target; enabled; vendor
>>> preset: disabled)
>>>
>>>Active: active since Wed 2016-05-04 08:53:30 PDT; 4min 20s ago
>>>
>>>
>>> May 04 08:53:30  systemd[1]: Reached target ceph target allowing to
>>> start/stop all ceph-osd@.service instances at once.
>>>
>>> May 04 08:53:30  systemd[1]: Starting ceph target allowing to start/stop all
>>> ceph-osd@.service instances at once.
>>>
>>>
>>> # systemctl status ceph-osd@.service
>>>
>>> Failed to get properties: Unit name ceph-osd@.service is not valid.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=CwIFaQ&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=ha3XvQGcc5Yztz98b7hb8pYQo14dcIiYxfOoMzyUM00&s=VdVOtGV4JQUKyQDDC_QYn1-7wBcSh-eYwx_cCSQWlQk&e=
>>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd prepare 10.1.2

2016-04-14 Thread Benjeman Meekhof
Hi Michael,

The partprobe issue was resolved for me by updating parted to the
package from Fedora 22:  parted-3.2-16.fc22.x86_64.  It shouldn't
require any other dependencies updated to install on EL7 varieties.

http://tracker.ceph.com/issues/15176

regards,
Ben

On Thu, Apr 14, 2016 at 12:35 PM, Michael Hanscho  wrote:
> Hi!
>
> A fresh install of 10.1.2 on CentOS 7.2.1511 fails adding osds:
>
> [ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v
> prepare --cluster ceph --fs-type xfs -- /dev/sdm /dev/sdi
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
>
> The reason seems to be a failing partprobe command:
> [cestor1][WARNIN] update_partition: Calling partprobe on created device
> /dev/sdi
> [cestor1][WARNIN] command_check_call: Running command: /usr/bin/udevadm
> settle --timeout=600
> [cestor1][WARNIN] command: Running command: /sbin/partprobe /dev/sdi
> [cestor1][WARNIN] update_partition: partprobe /dev/sdi failed : Error:
> Error informing the kernel about modifications to partition /dev/sdi1 --
> Device or resource busy.  This means Linux won't know about any changes
> you made to /dev/sdi1 until you reboot -- so you shouldn't mount it or
> use it in any way before rebooting.
> [cestor1][WARNIN] Error: Failed to add partition 1 (Device or resource busy)
> [cestor1][WARNIN]  (ignored, waiting 60s)
>
> Attached ceph-deploy-osd-prepare-error.log with the details.
>
> Modifying ceph-disk to ignore the partprobe failing allows to proceed.
> Any hints?
>
> Gruesse
> Michael
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com