Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Wei Jin
>
> So, questions: does that really matter? What are possible impacts? What
> could have caused this 2 hosts to hold so many capabilities?
> 1 of the hosts are for tests purposes, traffic is close to zero. The other
> host wasn't using cephfs at all. All services stopped.
>

The reason might be updatedb program, you could forbid it to scan your
mount point.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Patrick Donnelly
On Thu, Dec 14, 2017 at 4:44 PM, Webert de Souza Lima
 wrote:
> Hi Patrick,
>
> On Thu, Dec 14, 2017 at 7:52 PM, Patrick Donnelly 
> wrote:
>>
>>
>> It's likely you're a victim of a kernel backport that removed a dentry
>> invalidation mechanism for FUSE mounts. The result is that ceph-fuse
>> can't trim dentries.
>
>
> even though I'm not using FUSE? I'm using kernel mounts.
>
>
>>
>> I suggest setting that config manually to false on all of your clients
>
>
> Ok how do I do that?

I missed that you were using the kernel client. I agree with Zheng's analysis.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1 osd Segmentation fault in test cluster

2017-12-14 Thread Konstantin Shalygin

>/Is this useful for someone? /
Yes!



Seehttp://tracker.ceph.com/issues/21259

The latest luminous branch (which you can get from
https://shaman.ceph.com/builds/ceph/luminous/) has some additional
debugging on OSD shutdown that should help me figure out what is causing
this.  If this is something you can reproduce on your cluster, please
install the latest luminous and set 'osd debug shutdown = true' in the
[osd] section of your config, and then ceph-post-file the log after a


Don't know fix backported or not to 12.2.2, but  today one of osd:

Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: *** Caught signal 
(Segmentation fault) **
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: in thread 7f8f44b72700 
thread_name:bstore_mempool
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 1: (()+0xa339e1) [0x5629c08799e1]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 2: (()+0xf5e0) [0x7f8f4f63a5e0]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 3: 
(BlueStore::TwoQCache::_trim(unsigned long, unsigned long)+0x2df) 
[0x5629c074665f]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 4: 
(BlueStore::Cache::trim(unsigned long, float, float, float)+0x1d1) 
[0x5629c0718d71]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 5: 
(BlueStore::MempoolThread::entry()+0x14d) [0x5629c071f4ad]

Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 6: (()+0x7e25) [0x7f8f4f632e25]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 7: (clone()+0x6d) 
[0x7f8f4e72634d]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 2017-12-15 06:23:57.714362 
7f8f44b72700 -1 *** Caught signal (Segmentation fault) **
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: in thread 7f8f44b72700 
thread_name:bstore_mempool
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 1: (()+0xa339e1) [0x5629c08799e1]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 2: (()+0xf5e0) [0x7f8f4f63a5e0]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 3: 
(BlueStore::TwoQCache::_trim(unsigned long, unsigned long)+0x2df) 
[0x5629c074665f]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 4: 
(BlueStore::Cache::trim(unsigned long, float, float, float)+0x1d1) 
[0x5629c0718d71]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 5: 
(BlueStore::MempoolThread::entry()+0x14d) [0x5629c071f4ad]

Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 6: (()+0x7e25) [0x7f8f4f632e25]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 7: (clone()+0x6d) 
[0x7f8f4e72634d]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: NOTE: a copy of the 
executable, or `objdump -rdS ` is needed to interpret this.
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 0> 2017-12-15 06:23:57.714362 
7f8f44b72700 -1 *** Caught signal (Segmentation fault) **
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: in thread 7f8f44b72700 
thread_name:bstore_mempool
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 1: (()+0xa339e1) [0x5629c08799e1]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 2: (()+0xf5e0) [0x7f8f4f63a5e0]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 3: 
(BlueStore::TwoQCache::_trim(unsigned long, unsigned long)+0x2df) 
[0x5629c074665f]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 4: 
(BlueStore::Cache::trim(unsigned long, float, float, float)+0x1d1) 
[0x5629c0718d71]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 5: 
(BlueStore::MempoolThread::entry()+0x14d) [0x5629c071f4ad]

Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 6: (()+0x7e25) [0x7f8f4f632e25]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: 7: (clone()+0x6d) 
[0x7f8f4e72634d]
Dec 15 06:23:57 ceph-osd0 ceph-osd[89499]: NOTE: a copy of the 
executable, or `objdump -rdS ` is needed to interpret this.




k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Odd object blocking IO on PG

2017-12-14 Thread Brad Hubbard
On Wed, Dec 13, 2017 at 11:39 PM, Nick Fisk  wrote:
> Boom!! Fixed it. Not sure if the behavior I stumbled from is correct, but
> this has a potential to break a few things for people moving from Jewel to
> Luminous if they potentially had a few too many PG’s.
>
>
>
> Firstly, how I stumbled across it. I whacked the logging up to max on OSD 68
> and saw this mentioned in the logs
>
>
>
> osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >=
> 400
>
>
>
> This made me search through the code for this warning string
>
>
>
> https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4221
>
>
>
> Which jogged my memory about the changes in Luminous regarding max PG’s
> warning, and in particular these two config options
>
> mon_max_pg_per_osd
>
> osd_max_pg_per_osd_hard_ratio
>
>
>
> In my cluster I have just over 200 PG’s per OSD, but the node with OSD.68
> in, has 8TB disks instead of 3TB for the rest of the cluster. This means
> these OSD’s were taking a lot more PG’s than the average would suggest. So
> in Luminous 200x2 gives a hard limit of 400, which is what that error
> message in the log suggests is the limit. I set the
> osd_max_pg_per_osd_hard_ratio  option to 3 and restarted the OSD and hey
> presto everything fell into line.
>
>
>
> Now a question. I get the idea around these settings to stop making too many
> or pools with too many PG’s. But is it correct they can break an existing
> pool which is maybe making the new PG on an OSD due to CRUSH layout being
> modified?

It would be good to capture this in a tracker Nick so it can be
explored in  more depth.

>
>
>
> Nick
>
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: 13 December 2017 11:14
> To: 'Gregory Farnum' 
> Cc: 'ceph-users' 
> Subject: Re: [ceph-users] Odd object blocking IO on PG
>
>
>
>
>
> On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk  wrote:
>
>
>> That doesn't look like an RBD object -- any idea who is
>> "client.34720596.1:212637720"?
>
> So I think these might be proxy ops from the cache tier, as there are also
> block ops on one of the cache tier OSD's, but this time it actually lists
> the object name. Block op on cache tier.
>
>"description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
> 17:f3831e75:::rbd_data.15a5e20238e1f29.000388ad:head [set-alloc-hint
> object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
> RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
> "initiated_at": "2017-12-12 16:25:32.435718",
> "age": 13996.681147,
> "duration": 13996.681203,
> "type_data": {
> "flag_point": "reached pg",
> "client_info": {
> "client": "client.34720596",
> "client_addr": "10.3.31.41:0/2600619462",
> "tid": 212637720
>
> I'm a bit baffled at the moment what's going. The pg query (attached) is not
> showing in the main status that it has been blocked from peering or that
> there are any missing objects. I've tried restarting all OSD's I can see
> relating to the PG in case they needed a bit of a nudge.
>
>
>
> Did that fix anything? I don't see anything immediately obvious but I'm not
> practiced in quickly reading that pg state output.
>
>
>
> What's the output of "ceph -s"?
>
>
>
> Hi Greg,
>
>
>
> No restarting OSD’s didn’t seem to help. But I did make some progress late
> last night. By stopping OSD.68 the cluster unlocks itself and IO can
> progress. However as soon as it starts back up, 0.1cf and a couple of other
> PG’s again get stuck in an activating state. If I out the OSD, either with
> it up or down, then some other PG’s seem to get hit by the same problem as
> CRUSH moves PG mappings around to other OSD’s.
>
>
>
> So there definitely seems to be some sort of weird peering issue somewhere.
> I have seen a very similar issue before on this cluster where after running
> the crush reweight script to balance OSD utilization, the weight got set too
> low and PG’s were unable to peer. I’m not convinced this is what’s happening
> here as all the weights haven’t changed, but I’m intending to explore this
> further just in case.
>
>
>
> With 68 down
>
> pgs: 1071783/48650631 objects degraded (2.203%)
>
>  5923 active+clean
>
>  399  active+undersized+degraded
>
>  7active+clean+scrubbing+deep
>
>  7active+clean+remapped
>
>
>
> With it up
>
> pgs: 0.047% pgs not active
>
>  67271/48651279 objects degraded (0.138%)
>
>  15602/48651279 objects misplaced (0.032%)
>
>  6051 active+clean
>
>  273  active+recovery_wait+degraded
>
>  4active+clean+scrubbing+deep
>
>  4active+remapped+backfill_wait
>
> 3activating+remapped
>
> 

[ceph-users] cephfs miss data for 15s when master mds rebooting

2017-12-14 Thread 13605702...@163.com
hi

i used 3 nodes to deploy mds (each node also has mon on it)

my config:
[mds.ceph-node-10-101-4-17]
mds_standby_replay = true
mds_standby_for_rank = 0

[mds.ceph-node-10-101-4-21]
mds_standby_replay = true
mds_standby_for_rank = 0

[mds.ceph-node-10-101-4-22]
mds_standby_replay = true
mds_standby_for_rank = 0

the mds stat:
e29: 1/1/1 up {0=ceph-node-10-101-4-22=up:active}, 1 up:standby-replay, 1 
up:standby

i mount the cephfs on the ceph client, and run the test script to write data 
into file under the cephfs dir,
when i reboot the master mds, and i found the data is not written into the file.
after 15 seconds, data can be written into the file again

so my question is: 
is this normal when reboot the master mds?
when will the up:standby-replay mds take over the the cephfs?

thanks



13605702...@163.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Yan, Zheng
On Fri, Dec 15, 2017 at 1:18 AM, Webert de Souza Lima
 wrote:
> Hi,
>
> I've been look at ceph mds perf counters and I saw the one of my clusters
> was hugely different from other in number of caps:
>
> rlat inos  caps  | hsr  hcs   hcr | writ read actv  | recd recy stry  purg |
> segs evts subm
>   0  3.0M 5.1M |  0 0 595 | 30440 |  0   0   13k   0
> | 42 35k   893
>   0  3.0M 5.1M |  0 0 165 | 1.8k   437   |  0   0   13k   0
> | 43 36k   302
> 16  3.0M 5.1M |  0 0 429 | 24794 |  0   0   13k   58
> | 38 32k   1.7k
>   0  3.0M 5.1M |  0 1 213 | 1.2k   0857 |  0   0   13k   0
> | 40 33k   766
> 23  3.0M 5.1M |  0 0 945 | 44510 |  0   0   13k   0
> | 41 34k   1.1k
>   0  3.0M 5.1M |  0 2 696 | 376   11   0 |  0   0   13k   0
> | 43 35k   1.0k
>   3  2.9M 5.1M |  0 0 601 | 2.0k   60 |  0   0   13k
> 56| 38 29k   1.2k
>   0  2.9M 5.1M |  0 0 394 | 272   11   0 |  0   0   13k   0
> | 38 30k   758
>
> on another cluster running the same version:
>
> -mds-- --mds_server-- ---objecter--- -mds_cache-
> ---mds_log
> rlat inos caps  | hsr  hcs  hcr  | writ read actv | recd recy stry purg |
> segs evts subm
>   2  3.9M 380k |  01 266 | 1.8k   0   370  |  0   0   24k  44
> |  37  129k  1.5k
>
>
> I did a perf dump on the active mds:
>
> ~# ceph daemon mds.a perf dump mds
> {
> "mds": {
> "request": 2245276724,
> "reply": 2245276366,
> "reply_latency": {
> "avgcount": 2245276366,
> "sum": 18750003.074118977
> },
> "forward": 0,
> "dir_fetch": 20217943,
> "dir_commit": 555295668,
> "dir_split": 0,
> "inode_max": 300,
> "inodes": 3000276,
> "inodes_top": 152555,
> "inodes_bottom": 279938,
> "inodes_pin_tail": 2567783,
> "inodes_pinned": 2782064,
> "inodes_expired": 308697104,
> "inodes_with_caps": 2779658,
> "caps": 5147887,
> "subtrees": 2,
> "traverse": 2582452087,
> "traverse_hit": 2338123987,
> "traverse_forward": 0,
> "traverse_discover": 0,
> "traverse_dir_fetch": 16627249,
> "traverse_remote_ino": 29276,
> "traverse_lock": 2507504,
> "load_cent": 18446743868740589422,
> "q": 27,
> "exported": 0,
> "exported_inodes": 0,
> "imported": 0,
> "imported_inodes": 0
> }
> }
>
> and then a session ls to see what clients could be holding that much:
>
>{
>   "client_metadata" : {
>  "entity_id" : "admin",
>  "kernel_version" : "4.4.0-97-generic",
>  "hostname" : "suppressed"
>   },
>   "completed_requests" : 0,
>   "id" : 1165169,
>   "num_leases" : 343,
>   "inst" : "client.1165169 10.0.0.112:0/982172363",
>   "state" : "open",
>   "num_caps" : 111740,
>   "reconnecting" : false,
>   "replay_requests" : 0
>},
>{
>   "state" : "open",
>   "replay_requests" : 0,
>   "reconnecting" : false,
>   "num_caps" : 108125,
>   "id" : 1236036,
>   "completed_requests" : 0,
>   "client_metadata" : {
>  "hostname" : "suppressed",
>  "kernel_version" : "4.4.0-97-generic",
>  "entity_id" : "admin"
>   },
>   "num_leases" : 323,
>   "inst" : "client.1236036 10.0.0.113:0/1891451616"
>},
>{
>   "num_caps" : 63186,
>   "reconnecting" : false,
>   "replay_requests" : 0,
>   "state" : "open",
>   "num_leases" : 147,
>   "completed_requests" : 0,
>   "client_metadata" : {
>  "kernel_version" : "4.4.0-75-generic",
>  "entity_id" : "admin",
>  "hostname" : "suppressed"
>   },
>   "id" : 1235930,
>   "inst" : "client.1235930 10.0.0.110:0/2634585537"
>},
>{
>   "num_caps" : 2476444,
>   "replay_requests" : 0,
>   "reconnecting" : false,
>   "state" : "open",
>   "num_leases" : 0,
>   "completed_requests" : 0,
>   "client_metadata" : {
>  "entity_id" : "admin",
>  "kernel_version" : "4.4.0-75-generic",
>  "hostname" : "suppressed"
>   },
>   "id" : 1659696,
>   "inst" : "client.1659696 10.0.0.101:0/4005556527"
>},
>{
>   "state" : "open",
>   "replay_requests" : 0,
>   "reconnecting" : false,
>   "num_caps" : 2386376,
>   "id" : 1069714,
>   "client_metadata" : {
>  "hostname" : "suppressed",
>  "kernel_version" : "4.4.0-75-generic",
>  "entity_id" : "admin"
>   },
>   "completed_requests" : 0,
>   "num_leases" : 0,
>   "inst" : "client.1069714 10.0.0.111:0/1876172355"
>},
>{
>   "replay_requests" : 0,
>   "reconnecting" : false,
>   "num_caps" : 1726,
>   

Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Webert de Souza Lima
Hi Patrick,

On Thu, Dec 14, 2017 at 7:52 PM, Patrick Donnelly 
 wrote:

>
> It's likely you're a victim of a kernel backport that removed a dentry
> invalidation mechanism for FUSE mounts. The result is that ceph-fuse
> can't trim dentries.
>

even though I'm not using FUSE? I'm using kernel mounts.



> I suggest setting that config manually to false on all of your clients
>

Ok how do I do that?

Many thanks.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] S3 objects deleted but storage doesn't free space

2017-12-14 Thread Jan-Willem Michels


Hi there all,
Perhaps someone can help.

We tried to free some storage so we deleted a lot S3 objects. The bucket 
has also valuable data so we can't delete the whole bucket.
Everything went fine, but used storage space doesn't get  less.  We are 
expecting several TB of data to be freed.


We then learned of garbage collection. So we thought let's wait. But 
even day's later no real change.
We started " radosgw-admin gc process ", that never finished , or 
displayed any  error or anything.
Could find anything like -verbose or debug for this command or find a 
place with log to debug what is going on when radosgw-admin is working


We tried to change the default settings, we got from old posting.
We have put them in global and tried  also in [client.rgw..]
rgw_gc_max_objs =7877 ( but also rgw_gc_max_objs =200 or 
rgw_gc_max_objs =1000)

rgw_lc_max_objs = 7877
rgw_gc_obj_min_wait = 300
rgw_gc_processor_period = 600
rgw_gc_processor_max_time = 600

We restarted the  ceph-radosgw several times, the computers, all over 
period of days etc . Tried radosgw-admin gc process a few times etc.
Did not find any references in radosgw logs like gc:: delete etc. But we 
don't know what to look for
System is well, nor errors or warnings. But system is in use ( we are 
loading up data) -> Will GC only run when idle?


When we count them with "radosgw-admin gc list | grep oid | wc -l" we get
11:00 18.086.665 objects
13:00 18.086.665 objects
15:00 18.086.665 objects
so no change in objects after hours

When we list "radosgw-admin gc list" we get files like
 radosgw-admin gc list | more
[
{
"tag": "b5687590-473f-4386-903f-d91a77b8d5cd.7354141.21122\u",
"time": "2017-12-06 11:04:56.0.459704s",
"objs": [
{
"pool": "default.rgw.buckets.data",
"oid": 
"b5687590-473f-4386-903f-d91a77b8d5cd.44121.4__shadow_.5OtA02n_GU8TkP08We_SLrT5GL1ihuS_1",

"key": "",
"instance": ""
},
{
"pool": "default.rgw.buckets.data",
"oid": 
"b5687590-473f-4386-903f-d91a77b8d5cd.44121.4__shadow_.5OtA02n_GU8TkP08We_SLrT5GL1ihuS_2",

"key": "",
"instance": ""
},
{
"pool": "default.rgw.buckets.data",
"oid": 
"b5687590-473f-4386-903f-d91a77b8d5cd.44121.4__shadow_.5OtA02n_GU8TkP08We_SLrT5GL1ihuS_3",

"key": "",
"instance": ""
},

 A few questions ->

Who purges the gc list. Is it on the the radosgw machines. Or is it done 
distributed on the OSD's?
Where do i have to change default "rgw_gc_max_objs =1000". We tried 
everywhere. We have used "tell" to change them in OSD and MON systems 
and also on the RGW endpoint's which we restarted.


We have two radosgw endpoints. Is there a lock that only one will act, 
or will they both try to delete. Can we free / display such a lock


How can I debug the application radosgw-admin. In which log files to 
look, what would be example of message.


If I know an oid like above. Can I manually delete such an oid.

Suppose we would delete the complete bucket with "radosgw-admin bucket 
rm --bucket=mybucket --purge-objects --inconsistent-index" would that 
also get rid of the GC files that allready there?


Thanks  ahead for your time,

JW Michels







q
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding reshard issues

2017-12-14 Thread Graham Allan



On 12/14/2017 04:00 AM, Martin Emrich wrote:

Hi!

Am 13.12.17 um 20:50 schrieb Graham Allan:
After our Jewel to Luminous 12.2.2 upgrade, I ran into some of the 
same issues reported earlier on the list under "rgw resharding 
operation seemingly won't end". 


Yes, that were/are my threads, I also have this issue.


I was able to correct the buckets using "radosgw-admin bucket check 
--fix" command, and later disabled the auto resharding.


Were you able to manually reshard a bucket after the "--fix"? Here, 
after a bucket was damaged once, the manual reshard process will freeze.


Interesting... the test bucket I tried to reshard below was one that had 
previously needed "bucket check --fix".


I just tried the same thing on another old (and small, ~100 object) 
bucket which had not previously seen problems - I got the same hang.


Although, I was doing a "reshard add" and "reshard execute" on the 
bucket which I guess is more of a manually triggered automatic reshard, 
as opposed to a true manual "bucket reshard" command. Having said that, 
the manual "bucket reshard" command also now freezes on that bucket.


As an experiment, I selected an unsharded bucket to attempt a manual 
reshard. I added it the reshard list ,then ran "radosgw-admin reshard 
execute". The bucket in question contains 184000 objects and was being 
converted from 1 to 3 shards.


I'm trying to understand what I found...

1) the "radosgw-admin reshard execute" never returned. Somehow I 
expected it to kick off a background operation, but possibly this was 
mistaken.


Yes, same behaviour here. Someone on the list mentioned that resharding 
should actually happen quite fast (at most a few minutes).


So there's clearly something wrong here, and I am glad I am not the only 
one experiencing it.


To compare: What is your infrastructure? mine is:

* three beefy hosts (64GB RAM) with 4 OSDs each for data (HDD), and 2 
OSDs each on SSDs for the index.

* all bluestore (DB/WAL for the HDD OSDs also on SSD partitions)
* radosgw runs on each of these OSD hosts (as they are mostly idling, I 
see no cause for my poor performance in running the rados gateways on 
the OSD hosts)

* 3 separate monitor/mgr hosts
* OS is CentOS 7, running Ceph 12.2.2
* We use several buckets, all with Versioning enabled, for many (100k to 
12M) rather small objects.


This cluster has been around for some time (since firefly), and is 
running ubuntu 14.04. I will be converting it to Centos 7 over the next 
few weeks or months. It's only used for object store, no rbd or cephfs.


3 dedicated mons
9 large osd nodes with ~60x 6TB osds each, plus a handful of SSDs
4 radosgw nodes (2 ubuntu, 2 centos 7)

The radosgw main storage pools are ec42 filestore spinning drives, the 
indexes are on 3-way replicated filestore ssds.


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu

--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

2017-12-14 Thread Cary
James,

 Usually once the misplaced data has balanced out the cluster should
reach a healthy state. If you run a "ceph health detail" Ceph will
show you some more detail about what is happening.  Is Ceph still
recovering, or has it stalled? has the "objects misplaced (62.511%"
changed to a lower %?

Cary
-Dynamic

On Thu, Dec 14, 2017 at 10:52 PM, James Okken  wrote:
> Thanks Cary!
>
> Your directions worked on my first sever. (once I found the missing carriage 
> return in your list of commands, the email musta messed it up.
>
> For anyone else:
> chown -R ceph:ceph /var/lib/ceph/osd/ceph-4 ceph auth add osd.4 osd 'allow *' 
> mon 'allow profile osd' -i /etc/ceph/ceph.osd.4.keyring
> really is 2 commands:
> chown -R ceph:ceph /var/lib/ceph/osd/ceph-4
>  and
> ceph auth add osd.4 osd 'allow *' mon 'allow profile osd' -i 
> /etc/ceph/ceph.osd.4.keyring
>
> Cary, what am I looking for in ceph -w and ceph -s to show the status of the 
> data moving?
> Seems like the data is moving and that I have some issue...
>
> root@node-53:~# ceph -w
> cluster 2b9f7957-d0db-481e-923e-89972f6c594f
>  health HEALTH_WARN
> 176 pgs backfill_wait
> 1 pgs backfilling
> 27 pgs degraded
> 1 pgs recovering
> 26 pgs recovery_wait
> 27 pgs stuck degraded
> 204 pgs stuck unclean
> recovery 10322/84644 objects degraded (12.195%)
> recovery 52912/84644 objects misplaced (62.511%)
>  monmap e3: 3 mons at 
> {node-43=192.168.1.7:6789/0,node-44=192.168.1.5:6789/0,node-45=192.168.1.3:6789/0}
> election epoch 138, quorum 0,1,2 node-45,node-44,node-43
>  osdmap e206: 4 osds: 4 up, 4 in; 177 remapped pgs
> flags sortbitwise,require_jewel_osds
>   pgmap v3936175: 512 pgs, 5 pools, 333 GB data, 58184 objects
> 370 GB used, 5862 GB / 6233 GB avail
> 10322/84644 objects degraded (12.195%)
> 52912/84644 objects misplaced (62.511%)
>  308 active+clean
>  176 active+remapped+wait_backfill
>   26 active+recovery_wait+degraded
>1 active+remapped+backfilling
>1 active+recovering+degraded
> recovery io 100605 kB/s, 14 objects/s
>   client io 0 B/s rd, 92788 B/s wr, 50 op/s rd, 11 op/s wr
>
> 2017-12-14 22:45:57.459846 mon.0 [INF] pgmap v3936174: 512 pgs: 1 activating, 
> 1 active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
> active+remapped+backfilling, 307 active+clean, 176 
> active+remapped+wait_backfill; 333 GB data, 369 GB used, 5863 GB / 6233 GB 
> avail; 0 B/s rd, 101107 B/s wr, 19 op/s; 10354/84644 objects degraded 
> (12.232%); 52912/84644 objects misplaced (62.511%); 12224 kB/s, 2 objects/s 
> recovering
> 2017-12-14 22:45:58.466736 mon.0 [INF] pgmap v3936175: 512 pgs: 1 
> active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
> active+remapped+backfilling, 308 active+clean, 176 
> active+remapped+wait_backfill; 333 GB data, 370 GB used, 5862 GB / 6233 GB 
> avail; 0 B/s rd, 92788 B/s wr, 61 op/s; 10322/84644 objects degraded 
> (12.195%); 52912/84644 objects misplaced (62.511%); 100605 kB/s, 14 objects/s 
> recovering
> 2017-12-14 22:46:00.474335 mon.0 [INF] pgmap v3936176: 512 pgs: 1 
> active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
> active+remapped+backfilling, 308 active+clean, 176 
> active+remapped+wait_backfill; 333 GB data, 370 GB used, 5862 GB / 6233 GB 
> avail; 0 B/s rd, 434 kB/s wr, 45 op/s; 10322/84644 objects degraded 
> (12.195%); 52912/84644 objects misplaced (62.511%); 84234 kB/s, 10 objects/s 
> recovering
> 2017-12-14 22:46:02.482228 mon.0 [INF] pgmap v3936177: 512 pgs: 1 
> active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
> active+remapped+backfilling, 308 active+clean, 176 
> active+remapped+wait_backfill; 333 GB data, 370 GB used, 5862 GB / 6233 GB 
> avail; 0 B/s rd, 334 kB/s wr
>
>
> -Original Message-
> From: Cary [mailto:dynamic.c...@gmail.com]
> Sent: Thursday, December 14, 2017 4:21 PM
> To: James Okken
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)
>
> Jim,
>
> I am not an expert, but I believe I can assist.
>
>  Normally you will only have 1 OSD per drive. I have heard discussions about 
> using multiple OSDs per disk, when using SSDs though.
>
>  Once your drives have been installed you will have to format them, unless 
> you are using Bluestore. My steps for formatting are below.
> Replace the sXX with your drive name.
>
> parted -a optimal /dev/sXX
> print
> mklabel gpt
> unit mib
> mkpart OSD4sdd1 1 -1
> quit
> mkfs.xfs -f /dev/sXX1
>
> # Run blkid, and copy the UUID for the newly formatted drive.
> blkid
> # Add the mount point/UUID to fstab. The mount point will be created later.
> vi /etc/fstab
> # For example
> UUID=6386bac4-7fef-3cd2-7d64-13db51d83b12 /var/lib/ceph/osd/ceph-4 

Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

2017-12-14 Thread James Okken
Thanks Cary!

Your directions worked on my first sever. (once I found the missing carriage 
return in your list of commands, the email musta messed it up.

For anyone else: 
chown -R ceph:ceph /var/lib/ceph/osd/ceph-4 ceph auth add osd.4 osd 'allow *' 
mon 'allow profile osd' -i /etc/ceph/ceph.osd.4.keyring
really is 2 commands: 
chown -R ceph:ceph /var/lib/ceph/osd/ceph-4
 and 
ceph auth add osd.4 osd 'allow *' mon 'allow profile osd' -i 
/etc/ceph/ceph.osd.4.keyring

Cary, what am I looking for in ceph -w and ceph -s to show the status of the 
data moving?
Seems like the data is moving and that I have some issue...

root@node-53:~# ceph -w
cluster 2b9f7957-d0db-481e-923e-89972f6c594f
 health HEALTH_WARN
176 pgs backfill_wait
1 pgs backfilling
27 pgs degraded
1 pgs recovering
26 pgs recovery_wait
27 pgs stuck degraded
204 pgs stuck unclean
recovery 10322/84644 objects degraded (12.195%)
recovery 52912/84644 objects misplaced (62.511%)
 monmap e3: 3 mons at 
{node-43=192.168.1.7:6789/0,node-44=192.168.1.5:6789/0,node-45=192.168.1.3:6789/0}
election epoch 138, quorum 0,1,2 node-45,node-44,node-43
 osdmap e206: 4 osds: 4 up, 4 in; 177 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v3936175: 512 pgs, 5 pools, 333 GB data, 58184 objects
370 GB used, 5862 GB / 6233 GB avail
10322/84644 objects degraded (12.195%)
52912/84644 objects misplaced (62.511%)
 308 active+clean
 176 active+remapped+wait_backfill
  26 active+recovery_wait+degraded
   1 active+remapped+backfilling
   1 active+recovering+degraded
recovery io 100605 kB/s, 14 objects/s
  client io 0 B/s rd, 92788 B/s wr, 50 op/s rd, 11 op/s wr

2017-12-14 22:45:57.459846 mon.0 [INF] pgmap v3936174: 512 pgs: 1 activating, 1 
active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
active+remapped+backfilling, 307 active+clean, 176 
active+remapped+wait_backfill; 333 GB data, 369 GB used, 5863 GB / 6233 GB 
avail; 0 B/s rd, 101107 B/s wr, 19 op/s; 10354/84644 objects degraded 
(12.232%); 52912/84644 objects misplaced (62.511%); 12224 kB/s, 2 objects/s 
recovering
2017-12-14 22:45:58.466736 mon.0 [INF] pgmap v3936175: 512 pgs: 1 
active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
active+remapped+backfilling, 308 active+clean, 176 
active+remapped+wait_backfill; 333 GB data, 370 GB used, 5862 GB / 6233 GB 
avail; 0 B/s rd, 92788 B/s wr, 61 op/s; 10322/84644 objects degraded (12.195%); 
52912/84644 objects misplaced (62.511%); 100605 kB/s, 14 objects/s recovering
2017-12-14 22:46:00.474335 mon.0 [INF] pgmap v3936176: 512 pgs: 1 
active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
active+remapped+backfilling, 308 active+clean, 176 
active+remapped+wait_backfill; 333 GB data, 370 GB used, 5862 GB / 6233 GB 
avail; 0 B/s rd, 434 kB/s wr, 45 op/s; 10322/84644 objects degraded (12.195%); 
52912/84644 objects misplaced (62.511%); 84234 kB/s, 10 objects/s recovering
2017-12-14 22:46:02.482228 mon.0 [INF] pgmap v3936177: 512 pgs: 1 
active+recovering+degraded, 26 active+recovery_wait+degraded, 1 
active+remapped+backfilling, 308 active+clean, 176 
active+remapped+wait_backfill; 333 GB data, 370 GB used, 5862 GB / 6233 GB 
avail; 0 B/s rd, 334 kB/s wr


-Original Message-
From: Cary [mailto:dynamic.c...@gmail.com] 
Sent: Thursday, December 14, 2017 4:21 PM
To: James Okken
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

Jim,

I am not an expert, but I believe I can assist.

 Normally you will only have 1 OSD per drive. I have heard discussions about 
using multiple OSDs per disk, when using SSDs though.

 Once your drives have been installed you will have to format them, unless you 
are using Bluestore. My steps for formatting are below.
Replace the sXX with your drive name.

parted -a optimal /dev/sXX
print
mklabel gpt
unit mib
mkpart OSD4sdd1 1 -1
quit
mkfs.xfs -f /dev/sXX1

# Run blkid, and copy the UUID for the newly formatted drive.
blkid
# Add the mount point/UUID to fstab. The mount point will be created later.
vi /etc/fstab
# For example
UUID=6386bac4-7fef-3cd2-7d64-13db51d83b12 /var/lib/ceph/osd/ceph-4 xfs
rw,noatime,inode64,logbufs=8 0 0


# You can then add the OSD to the cluster.

uuidgen
# Replace the UUID below with the UUID that was created with uuidgen.
ceph osd create 23e734d7-96d8-4327-a2b9-0fbdc72ed8f1

# Notice what number of osd it creates usually the lowest # OSD available.

# Add osd.4 to ceph.conf on all Ceph nodes.
vi /etc/ceph/ceph.conf
...
[osd.4]
public addr = 172.1.3.1
cluster addr = 10.1.3.1
...

# Now add the mount point.
mkdir -p /var/lib/ceph/osd/ceph-4
chown -R ceph:ceph /var/lib/ceph/osd/ceph-4

# The command below mounts everything in fstab.
mount -a
# The 

Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

2017-12-14 Thread Ronny Aasen

On 14.12.2017 18:34, James Okken wrote:

Hi all,

Please let me know if I am missing steps or using the wrong steps

I'm hoping to expand my small CEPH cluster by adding 4TB hard drives to each of 
the 3 servers in the cluster.

I also need to change my replication factor from 1 to 3.
This is part of an Openstack environment deployed by Fuel and I had foolishly 
set my replication factor to 1 in the Fuel settings before deploy. I know this 
would have been done better at the beginning. I do want to keep the current 
cluster and not start over. I know this is going thrash my cluster for a while 
replicating, but there isn't too much data on it yet.


To start I need to safely turn off each CEPH server and add in the 4TB drive:
To do that I am going to run:
ceph osd set noout
systemctl stop ceph-osd@1 (or 2 or 3 on the other servers)
ceph osd tree (to verify it is down)
poweroff, install the 4TB drive, bootup again
ceph osd unset noout



Next step wouyld be to get CEPH to use the 4TB drives. Each CEPH server already 
has a 836GB OSD.

ceph> osd df
ID WEIGHT  REWEIGHT SIZE  USE  AVAIL %USE  VAR  PGS
  0 0.81689  1.0  836G 101G  734G 12.16 0.90 167
  1 0.81689  1.0  836G 115G  721G 13.76 1.02 166
  2 0.81689  1.0  836G 121G  715G 14.49 1.08 179
   TOTAL 2509G 338G 2171G 13.47
MIN/MAX VAR: 0.90/1.08  STDDEV: 0.97

ceph> df
GLOBAL:
 SIZE  AVAIL RAW USED %RAW USED
 2509G 2171G 338G 13.47
POOLS:
 NAMEID USED %USED MAX AVAIL OBJECTS
 rbd 0 0 0 2145G   0
 images  1  216G  9.15 2145G   27745
 backups 2 0 0 2145G   0
 volumes 3  114G  5.07 2145G   29717
 compute 4 0 0 2145G   0


Once I get the 4TB drive into each CEPH server should I look to increasing the 
current OSD (ie: to 4836GB)?
Or create a second 4000GB OSD on each CEPH server?
If I am going to create a second OSD on each CEPH server I hope to use this doc:
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/



As far as changing the replication factor from 1 to 3:
Here are my pools now:

ceph osd pool ls detail
pool 0 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'images' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 116 flags hashpspool stripe_width 0
 removed_snaps [1~3,b~6,12~8,20~2,24~6,2b~8,34~2,37~20]
pool 2 'backups' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 7 flags hashpspool stripe_width 0
pool 3 'volumes' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 73 flags hashpspool stripe_width 0
 removed_snaps [1~3]
pool 4 'compute' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 34 flags hashpspool stripe_width 0

I plan on using these steps I saw online:
ceph osd pool set rbd size 3
ceph -s  (Verify that replication completes successfully)
ceph osd pool set images size 3
ceph -s
ceph osd pool set backups size 3
ceph -s
ceph osd pool set volumes size 3
ceph -s


please let me know any advice or better methods...


you normaly want each drive to be it's own osd. it is the number of 
osd's that give ceph it's scaleabillity. so more osd's = more aggeregate 
performance.  only exception is if you are limited by something like cpu 
or ram and must limit osd count becouse of that.


also remember to up your min_size from 1 to the default 2.  with 1 your 
cluster will accept writes with only a single operational osd. and if 
that one fail you will have dataloss corruption and inconsistencies.


you might also consider upping your size and min_size before taking down 
a osd, since you obviously will have the pg's on that osd unavailable. 
and you may want to have the extra redundancy before shaking the tree.  
with max usage 15% on the most used OSD you should have the space for it.



good luck
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Patrick Donnelly
On Thu, Dec 14, 2017 at 9:18 AM, Webert de Souza Lima
 wrote:
> So, questions: does that really matter? What are possible impacts? What
> could have caused this 2 hosts to hold so many capabilities?
> 1 of the hosts are for tests purposes, traffic is close to zero. The other
> host wasn't using cephfs at all. All services stopped.

It's likely you're a victim of a kernel backport that removed a dentry
invalidation mechanism for FUSE mounts. The result is that ceph-fuse
can't trim dentries. We have a patch to turn off that particular
mechanism by default:

https://github.com/ceph/ceph/pull/17925

I suggest setting that config manually to false on all of your clients
and ensure each client can remount itself to trim dentries (i.e. it's
being run as root or with sufficient capabiltities) which is a
fallback mechanism.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

2017-12-14 Thread Cary
Jim,

I am not an expert, but I believe I can assist.

 Normally you will only have 1 OSD per drive. I have heard discussions
about using multiple OSDs per disk, when using SSDs though.

 Once your drives have been installed you will have to format them,
unless you are using Bluestore. My steps for formatting are below.
Replace the sXX with your drive name.

parted -a optimal /dev/sXX
print
mklabel gpt
unit mib
mkpart OSD4sdd1 1 -1
quit
mkfs.xfs -f /dev/sXX1

# Run blkid, and copy the UUID for the newly formatted drive.
blkid
# Add the mount point/UUID to fstab. The mount point will be created later.
vi /etc/fstab
# For example
UUID=6386bac4-7fef-3cd2-7d64-13db51d83b12 /var/lib/ceph/osd/ceph-4 xfs
rw,noatime,inode64,logbufs=8 0 0


# You can then add the OSD to the cluster.

uuidgen
# Replace the UUID below with the UUID that was created with uuidgen.
ceph osd create 23e734d7-96d8-4327-a2b9-0fbdc72ed8f1

# Notice what number of osd it creates usually the lowest # OSD available.

# Add osd.4 to ceph.conf on all Ceph nodes.
vi /etc/ceph/ceph.conf
...
[osd.4]
public addr = 172.1.3.1
cluster addr = 10.1.3.1
...

# Now add the mount point.
mkdir -p /var/lib/ceph/osd/ceph-4
chown -R ceph:ceph /var/lib/ceph/osd/ceph-4

# The command below mounts everything in fstab.
mount -a
# The number after -i below needs changed to the correct OSD ID, and
the osd-uuid needs to be changed the UUID created with uuidgen above.
Your keyring location may be different and need changed as well.
ceph-osd -i 4 --mkfs --mkkey --osd-uuid 23e734d7-96d8-4327-a2b9-0fbdc72ed8f1
chown -R ceph:ceph /var/lib/ceph/osd/ceph-4
ceph auth add osd.4 osd 'allow *' mon 'allow profile osd' -i
/etc/ceph/ceph.osd.4.keyring

# Add the new OSD to its host in the crush map.
ceph osd crush add osd.4 .0 host=YOURhostNAME

# Since the weight used in the previous step was .0, you will need to
increase it. I use 1 for a 1TB drive and 5 for a 5TB drive. The
command below will reweight osd.4 to 1. You may need to slowly ramp up
this number. ie .10 then .20 etc.
ceph osd crush reweight osd.4 1

You should now be able to start the drive. You can watch the data move
to the drive with a ceph -w. Once data has migrated to the drive,
start the next.

Cary
-Dynamic

On Thu, Dec 14, 2017 at 5:34 PM, James Okken  wrote:
> Hi all,
>
> Please let me know if I am missing steps or using the wrong steps
>
> I'm hoping to expand my small CEPH cluster by adding 4TB hard drives to each 
> of the 3 servers in the cluster.
>
> I also need to change my replication factor from 1 to 3.
> This is part of an Openstack environment deployed by Fuel and I had foolishly 
> set my replication factor to 1 in the Fuel settings before deploy. I know 
> this would have been done better at the beginning. I do want to keep the 
> current cluster and not start over. I know this is going thrash my cluster 
> for a while replicating, but there isn't too much data on it yet.
>
>
> To start I need to safely turn off each CEPH server and add in the 4TB drive:
> To do that I am going to run:
> ceph osd set noout
> systemctl stop ceph-osd@1 (or 2 or 3 on the other servers)
> ceph osd tree (to verify it is down)
> poweroff, install the 4TB drive, bootup again
> ceph osd unset noout
>
>
>
> Next step wouyld be to get CEPH to use the 4TB drives. Each CEPH server 
> already has a 836GB OSD.
>
> ceph> osd df
> ID WEIGHT  REWEIGHT SIZE  USE  AVAIL %USE  VAR  PGS
>  0 0.81689  1.0  836G 101G  734G 12.16 0.90 167
>  1 0.81689  1.0  836G 115G  721G 13.76 1.02 166
>  2 0.81689  1.0  836G 121G  715G 14.49 1.08 179
>   TOTAL 2509G 338G 2171G 13.47
> MIN/MAX VAR: 0.90/1.08  STDDEV: 0.97
>
> ceph> df
> GLOBAL:
> SIZE  AVAIL RAW USED %RAW USED
> 2509G 2171G 338G 13.47
> POOLS:
> NAMEID USED %USED MAX AVAIL OBJECTS
> rbd 0 0 0 2145G   0
> images  1  216G  9.15 2145G   27745
> backups 2 0 0 2145G   0
> volumes 3  114G  5.07 2145G   29717
> compute 4 0 0 2145G   0
>
>
> Once I get the 4TB drive into each CEPH server should I look to increasing 
> the current OSD (ie: to 4836GB)?
> Or create a second 4000GB OSD on each CEPH server?
> If I am going to create a second OSD on each CEPH server I hope to use this 
> doc:
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/
>
>
>
> As far as changing the replication factor from 1 to 3:
> Here are my pools now:
>
> ceph osd pool ls detail
> pool 0 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
> rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
> pool 1 'images' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
> rjenkins pg_num 64 pgp_num 64 last_change 116 flags hashpspool stripe_width 0
> removed_snaps 

[ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

2017-12-14 Thread James Okken
Hi all,

Please let me know if I am missing steps or using the wrong steps

I'm hoping to expand my small CEPH cluster by adding 4TB hard drives to each of 
the 3 servers in the cluster.

I also need to change my replication factor from 1 to 3.
This is part of an Openstack environment deployed by Fuel and I had foolishly 
set my replication factor to 1 in the Fuel settings before deploy. I know this 
would have been done better at the beginning. I do want to keep the current 
cluster and not start over. I know this is going thrash my cluster for a while 
replicating, but there isn't too much data on it yet.


To start I need to safely turn off each CEPH server and add in the 4TB drive:
To do that I am going to run:
ceph osd set noout
systemctl stop ceph-osd@1 (or 2 or 3 on the other servers)
ceph osd tree (to verify it is down)
poweroff, install the 4TB drive, bootup again
ceph osd unset noout



Next step wouyld be to get CEPH to use the 4TB drives. Each CEPH server already 
has a 836GB OSD.

ceph> osd df
ID WEIGHT  REWEIGHT SIZE  USE  AVAIL %USE  VAR  PGS
 0 0.81689  1.0  836G 101G  734G 12.16 0.90 167
 1 0.81689  1.0  836G 115G  721G 13.76 1.02 166
 2 0.81689  1.0  836G 121G  715G 14.49 1.08 179
  TOTAL 2509G 338G 2171G 13.47
MIN/MAX VAR: 0.90/1.08  STDDEV: 0.97

ceph> df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
2509G 2171G 338G 13.47
POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS
rbd 0 0 0 2145G   0
images  1  216G  9.15 2145G   27745
backups 2 0 0 2145G   0
volumes 3  114G  5.07 2145G   29717
compute 4 0 0 2145G   0


Once I get the 4TB drive into each CEPH server should I look to increasing the 
current OSD (ie: to 4836GB)?
Or create a second 4000GB OSD on each CEPH server?
If I am going to create a second OSD on each CEPH server I hope to use this doc:
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/



As far as changing the replication factor from 1 to 3:
Here are my pools now:

ceph osd pool ls detail
pool 0 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'images' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 116 flags hashpspool stripe_width 0
removed_snaps [1~3,b~6,12~8,20~2,24~6,2b~8,34~2,37~20]
pool 2 'backups' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 7 flags hashpspool stripe_width 0
pool 3 'volumes' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 73 flags hashpspool stripe_width 0
removed_snaps [1~3]
pool 4 'compute' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 34 flags hashpspool stripe_width 0

I plan on using these steps I saw online:
ceph osd pool set rbd size 3
ceph -s  (Verify that replication completes successfully)
ceph osd pool set images size 3
ceph -s  
ceph osd pool set backups size 3
ceph -s  
ceph osd pool set volumes size 3
ceph -s  


please let me know any advice or better methods...

thanks

--Jim

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs mds millions of caps

2017-12-14 Thread Webert de Souza Lima
Hi,

I've been look at ceph mds perf counters and I saw the one of my clusters
was hugely different from other in number of caps:

rlat inos  caps  | hsr  hcs   hcr | writ read actv  | recd recy stry  purg
| segs evts subm
  0  3.0M 5.1M |  0 0 595 | 30440 |  0   0   13k
0  | 42 35k   893
  0  3.0M 5.1M |  0 0 165 | 1.8k   437   |  0   0   13k   0
 | 43 36k   302
16  3.0M 5.1M |  0 0 429 | 24794 |  0   0   13k
58| 38 32k   1.7k
  0  3.0M 5.1M |  0 1 213 | 1.2k   0857 |  0   0   13k   0
 | 40 33k   766
23  3.0M 5.1M |  0 0 945 | 44510 |  0   0   13k   0
 | 41 34k   1.1k
  0  3.0M 5.1M |  0 2 696 | 376   11   0 |  0   0   13k   0
 | 43 35k   1.0k
  3  2.9M 5.1M |  0 0 601 | 2.0k   60 |  0   0   13k
56| 38 29k   1.2k
  0  2.9M 5.1M |  0 0 394 | 272   11   0 |  0   0   13k   0
 | 38 30k   758

on another cluster running the same version:

-mds-- --mds_server-- ---objecter--- -mds_cache-
---mds_log
rlat inos caps  | hsr  hcs  hcr  | writ read actv | recd recy stry purg |
segs evts subm
  2  3.9M 380k |  01 266 | 1.8k   0   370  |  0   0   24k  44
 |  37  129k  1.5k


I did a perf dump on the active mds:

~# ceph daemon mds.a perf dump mds
{
"mds": {
"request": 2245276724,
"reply": 2245276366,
"reply_latency": {
"avgcount": 2245276366,
"sum": 18750003.074118977
},
"forward": 0,
"dir_fetch": 20217943,
"dir_commit": 555295668,
"dir_split": 0,
"inode_max": 300,
"inodes": 3000276,
"inodes_top": 152555,
"inodes_bottom": 279938,
"inodes_pin_tail": 2567783,
"inodes_pinned": 2782064,
"inodes_expired": 308697104,
"inodes_with_caps": 2779658,
"caps": 5147887,
"subtrees": 2,
"traverse": 2582452087,
"traverse_hit": 2338123987,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 16627249,
"traverse_remote_ino": 29276,
"traverse_lock": 2507504,
"load_cent": 18446743868740589422,
"q": 27,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}

and then a session ls to see what clients could be holding that much:

   {
  "client_metadata" : {
 "entity_id" : "admin",
 "kernel_version" : "4.4.0-97-generic",
 "hostname" : "suppressed"
  },
  "completed_requests" : 0,
  "id" : 1165169,
  "num_leases" : 343,
  "inst" : "client.1165169 10.0.0.112:0/982172363",
  "state" : "open",
  "num_caps" : 111740,
  "reconnecting" : false,
  "replay_requests" : 0
   },
   {
  "state" : "open",
  "replay_requests" : 0,
  "reconnecting" : false,
  "num_caps" : 108125,
  "id" : 1236036,
  "completed_requests" : 0,
  "client_metadata" : {
 "hostname" : "suppressed",
 "kernel_version" : "4.4.0-97-generic",
 "entity_id" : "admin"
  },
  "num_leases" : 323,
  "inst" : "client.1236036 10.0.0.113:0/1891451616"
   },
   {
  "num_caps" : 63186,
  "reconnecting" : false,
  "replay_requests" : 0,
  "state" : "open",
  "num_leases" : 147,
  "completed_requests" : 0,
  "client_metadata" : {
 "kernel_version" : "4.4.0-75-generic",
 "entity_id" : "admin",
 "hostname" : "suppressed"
  },
  "id" : 1235930,
  "inst" : "client.1235930 10.0.0.110:0/2634585537"
   },
   {
  "num_caps" : 2476444,
  "replay_requests" : 0,
  "reconnecting" : false,
  "state" : "open",
  "num_leases" : 0,
  "completed_requests" : 0,
  "client_metadata" : {
 "entity_id" : "admin",
 "kernel_version" : "4.4.0-75-generic",
 "hostname" : "suppressed"
  },
  "id" : 1659696,
  "inst" : "client.1659696 10.0.0.101:0/4005556527"
   },
   {
  "state" : "open",
  "replay_requests" : 0,
  "reconnecting" : false,
  "num_caps" : 2386376,
  "id" : 1069714,
  "client_metadata" : {
 "hostname" : "suppressed",
 "kernel_version" : "4.4.0-75-generic",
 "entity_id" : "admin"
  },
  "completed_requests" : 0,
  "num_leases" : 0,
  "inst" : "client.1069714 10.0.0.111:0/1876172355"
   },
   {
  "replay_requests" : 0,
  "reconnecting" : false,
  "num_caps" : 1726,
  "state" : "open",
  "inst" : "client.8394 10.0.0.103:0/3970353996",
  "num_leases" : 0,
  "id" : 8394,
  "client_metadata" : {
 "entity_id" : "admin",
 "kernel_version" : "4.4.0-75-generic",
 "hostname" : "suppressed"
  },
  "completed_requests" : 0
   }


Surprisingly, the 2 hosts that were holding 2M+ caps 

Re: [ceph-users] High Load and High Apply Latency

2017-12-14 Thread David Turner
We show high disk latencies on a node when the controller's cache battery
dies.  This is assuming that you're using a controller with cache enabled
for your disks.  In any case, I would look at the hardware on the server.

On Thu, Dec 14, 2017 at 10:15 AM John Petrini  wrote:

> Anyone have any ideas on this?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snap trim queue length issues

2017-12-14 Thread David Turner
I've tracked this in a much more manual way.  I would grab a random subset
of PGs in the pool and query the PGs counting how much were in there
queues.  After that, you average it out by how many PGs you queried and how
many objects there were and multiply it back out by how many PGs are in the
pool.  That gave us a relatively accurate size of the snaptrimq.  Well
enough to be monitored at least.  We could run this in a matter of minutes
with a subset of 200 PGs and it was generally accurate in a pool with 32k
pgs.

I also created a daemon that ran against the cluster watching for cluster
load and modifying the snap_trim_sleep accordingly.  The combination of
those 2 things and we were able to keep up with deleting hundreds of GB of
snapshots/day while not killing VM performance.  We hit a bug where we had
to disable snap trimming completely for about a week and on a dozen osds
for about a month.  We ended up with a snaptrimq over 100M objects, but
with these tools we were able to catch up within a couple weeks taking care
of the daily snapshots being added to the queue.

This was all on a Hammer cluster.  The changes to the snap trimming queues
going into the main osd thread made it so that our use case was not viable
on Jewel until changes to Jewel that happened after I left.  It's exciting
that this will actually be a reportable value from the cluster.

Sorry that this story doesn't really answer your question, except to say
that people aware of this problem likely have a work around for it.
However I'm certain that a lot more clusters are impacted by this than are
aware of it and being able to quickly see that would be beneficial to
troubleshooting problems.  Backporting would be nice.  I run a few Jewel
clusters that have some VM's and it would be nice to see how well the
cluster handle snap trimming.  But they are much less critical on how much
snapshots they do.

On Thu, Dec 14, 2017 at 9:36 AM Piotr Dałek 
wrote:

> Hi,
>
> We recently ran into low disk space issues on our clusters, and it wasn't
> because of actual data. On those affected clusters we're hosting VMs and
> volumes, so naturally there are snapshots involved. For some time, we
> observed increased disk space usage that we couldn't explain, as there was
> discrepancy between  what Ceph reported and actual space used on disks. We
> finally found out that snap trim queues were both long and not getting any
> shorter, and decreasing snap trim sleep and increasing max concurrent snap
> trims helped reversing the trend - we're safe now.
> The problem is, we haven't been aware of this issue for some time, and
> there's no easy (and fast[1]) way to check this. I made a pull request[2]
> that makes snap trim queue lengths available to monitoring tools
> and also generates health warning when things go out of control, so an
> admin
> can act before hell breaks loose.
>
> My question is, how many Jewel users would be interested in a such feature?
> There's a lot of changes between Luminous and Jewel, and it's not going to
> be a straight backport, but it's not a big patch either, so I won't mind
> doing it myself. But having some support from users would be helpful in
> pushing this into next Jewel release.
>
> Thanks!
>
>
> [1] one of our guys hacked a bash oneliner that printed out snap trim queue
> lengths for all pgs, but full run takes over an hour to complete on a
> cluster with over 20k pgs...
> [2] https://github.com/ceph/ceph/pull/19520
>
> --
> Piotr Dałek
> piotr.da...@corp.ovh.com
> https://www.ovh.com/us/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph luminous nfs-ganesha-ceph

2017-12-14 Thread Daniel Gryniewicz

On 12/14/2017 09:46 AM, nigel davies wrote:

Is this nfs-ganesha exporting Cephfs? Yes

Are you using NFS for a Vmware Datastore? Yes

What are you using for the NFS failover? (this is where i could be going 
wrong)


When creating the NFS Datastore i added the two NFS servers ip address in



NFS failover is more complicated than that.  It doesn't natively support 
failover from one machine to another.  Instead, you need to have an 
active/active or active/passive setup on the servers that manages 
virtual IPs, and fails the IP over from one machine to another.  This is 
because NFS, as a protocol, will recover from a failed server if *that 
server* comes back, but not to a second server.  The virtual IP failover 
looks, to the client, like a restart of a single server.


This is done in Ganesha currently with Pacemaker/Corosync to manage the 
failover.  It's in a supported product for Ganesha over Gluster, but 
not, as far as I know, for Ganesha over Ceph.  It has been done by 
community members, however.


The Gluster docs for this are here:
http://docs.gluster.org/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/

In the future (Ganesha 2.6 / Ceph Mimic) this should be better supported 
over Ceph.


Daniel




On Thu, Dec 14, 2017 at 2:29 PM, David C > wrote:


Is this nfs-ganesha exporting Cephfs?
Are you using NFS for a Vmware Datastore?
What are you using for the NFS failover?

We need more info but this does sound like a vmware/nfs question
rather than specifically ceph/nfs-ganesha

On Thu, Dec 14, 2017 at 1:47 PM, nigel davies > wrote:

Hay all

i am in the process or trying to set up and VMware storage
environment
I been reading and found that Iscsi (on jewel release) can cause
issues and the datastore can drop out.

I been looking at using  nfs-ganesha with my ceph platform, it
all looked good until i looked at failover to our 2nd nfs serve.
I believe i set up the Server right. gave the two IP address of
the NFS servers.

when i shutdown the live NFS server, the datastore becomes
"inactive" even after i bring the NFS server backup, it stil
shows as inactive.

any advise on this i would be grateful incase i am missing
something.
i am using NFS 4.1 as i been advised it will support the fail over

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High Load and High Apply Latency

2017-12-14 Thread John Petrini
Anyone have any ideas on this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph luminous nfs-ganesha-ceph

2017-12-14 Thread nigel davies
Is this nfs-ganesha exporting Cephfs? Yes

Are you using NFS for a Vmware Datastore? Yes

What are you using for the NFS failover? (this is where i could be going
wrong)

When creating the NFS  Datastore i added the two NFS servers ip address in




On Thu, Dec 14, 2017 at 2:29 PM, David C  wrote:

> Is this nfs-ganesha exporting Cephfs?
> Are you using NFS for a Vmware Datastore?
> What are you using for the NFS failover?
>
> We need more info but this does sound like a vmware/nfs question rather
> than specifically ceph/nfs-ganesha
>
> On Thu, Dec 14, 2017 at 1:47 PM, nigel davies  wrote:
>
>> Hay all
>>
>> i am in the process or trying to set up and VMware storage environment
>> I been reading and found that Iscsi (on jewel release) can cause issues
>> and the datastore can drop out.
>>
>> I been looking at using  nfs-ganesha with my ceph platform, it all looked
>> good until i looked at failover to our 2nd nfs serve. I believe i set up
>> the Server right. gave the two IP address of the NFS servers.
>>
>> when i shutdown the live NFS server, the datastore becomes "inactive"
>> even after i bring the NFS server backup, it stil shows as inactive.
>>
>> any advise on this i would be grateful incase i am missing something.
>> i am using NFS 4.1 as i been advised it will support the fail over
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2017-12-14 Thread Yan, Zheng
On Thu, Dec 14, 2017 at 8:52 PM, Florent B  wrote:
> On 14/12/2017 03:38, Yan, Zheng wrote:
>> On Thu, Dec 14, 2017 at 12:49 AM, Florent B  wrote:
>>>
>>> Systems are on Debian Jessie : kernel 3.16.0-4-amd64 & libfuse 2.9.3-15.
>>>
>>> I don't know pattern of corruption, but according to error message in
>>> Dovecot, it seems to expect data to read but reach EOF.
>>>
>>> All seems fine using fuse_disable_pagecache (no more corruption, and
>>> performance increased : no more MDS slow requests on filelock requests).
>>
>> I checked ceph-fuse changes since kraken, didn't find any clue. I
>> would be helpful if you can try recent version kernel.
>>
>> Regards
>> Yan, Zheng
>
> Problem occurred this morning even with fuse_disable_pagecache=true.
>
> It seems to be a lock issue between imap & lmtp processes.
>
> Dovecot uses fcntl as locking method. Is there any change about it in
> Luminous ? I switched to flock to see if problem is still there...
>

I don't remenber there is any change.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snap trim queue length issues

2017-12-14 Thread Piotr Dałek

Hi,

We recently ran into low disk space issues on our clusters, and it wasn't 
because of actual data. On those affected clusters we're hosting VMs and 
volumes, so naturally there are snapshots involved. For some time, we 
observed increased disk space usage that we couldn't explain, as there was 
discrepancy between  what Ceph reported and actual space used on disks. We 
finally found out that snap trim queues were both long and not getting any 
shorter, and decreasing snap trim sleep and increasing max concurrent snap 
trims helped reversing the trend - we're safe now.
The problem is, we haven't been aware of this issue for some time, and 
there's no easy (and fast[1]) way to check this. I made a pull request[2] 
that makes snap trim queue lengths available to monitoring tools
and also generates health warning when things go out of control, so an admin 
can act before hell breaks loose.


My question is, how many Jewel users would be interested in a such feature? 
There's a lot of changes between Luminous and Jewel, and it's not going to 
be a straight backport, but it's not a big patch either, so I won't mind 
doing it myself. But having some support from users would be helpful in 
pushing this into next Jewel release.


Thanks!


[1] one of our guys hacked a bash oneliner that printed out snap trim queue 
lengths for all pgs, but full run takes over an hour to complete on a 
cluster with over 20k pgs...

[2] https://github.com/ceph/ceph/pull/19520

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph luminous nfs-ganesha-ceph

2017-12-14 Thread David C
Is this nfs-ganesha exporting Cephfs?
Are you using NFS for a Vmware Datastore?
What are you using for the NFS failover?

We need more info but this does sound like a vmware/nfs question rather
than specifically ceph/nfs-ganesha

On Thu, Dec 14, 2017 at 1:47 PM, nigel davies  wrote:

> Hay all
>
> i am in the process or trying to set up and VMware storage environment
> I been reading and found that Iscsi (on jewel release) can cause issues
> and the datastore can drop out.
>
> I been looking at using  nfs-ganesha with my ceph platform, it all looked
> good until i looked at failover to our 2nd nfs serve. I believe i set up
> the Server right. gave the two IP address of the NFS servers.
>
> when i shutdown the live NFS server, the datastore becomes "inactive" even
> after i bring the NFS server backup, it stil shows as inactive.
>
> any advise on this i would be grateful incase i am missing something.
> i am using NFS 4.1 as i been advised it will support the fail over
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Max number of objects per bucket

2017-12-14 Thread Prasad Bhalerao
Hello ,

I have following doubts, Could you please help me out?

I am using S3 Apis, What is the Max number of objects a bucket can have
when using indexless bucket?

What the max number of bucket can a user create?

Can we have both indexless and indexed buckets at the same time. Do we have
any configuration for this?


Thanks,
Prasad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] measure performance / latency in blustore

2017-12-14 Thread Sage Weil
On Thu, 14 Dec 2017, Stefan Priebe - Profihost AG wrote:
> 
> Am 14.12.2017 um 13:22 schrieb Sage Weil:
> > On Thu, 14 Dec 2017, Stefan Priebe - Profihost AG wrote:
> >> Hello,
> >>
> >> Am 21.11.2017 um 11:06 schrieb Stefan Priebe - Profihost AG:
> >>> Hello,
> >>>
> >>> to measure performance / latency for filestore we used:
> >>> filestore:apply_latency
> >>> filestore:commitcycle_latency
> >>> filestore:journal_latency
> >>> filestore:queue_transaction_latency_avg
> >>>
> >>> What are the correct ones for bluestore?
> >>
> >> really nobody? Does nobody track the latency under bluestore?
> > 
> > I forget the long names off the top of my head, but the interesting 
> > latency measures are marked with a low priority and come up (with a wide 
> > terminal) when you do 'ceph daemonperf osd.N'. You can see the metrics, 
> > pririties, and descriptions with 'ceph daemon osd.N perf schema'.
> 
> uhuh very long list. Any idea which ones are relevant?
> 
> ceph daemon osd.8 perf dump

If you do 'perf schema' you'll see a 'priority' property that calls out 
the important ones.  That's how the daemonperf command decides which ones 
to show (based on terminal width and priorities).

sage


 > 
> shows me a lot of stuff and even a lot of wait or latancy values.
> 
> ceph daemon osd.8 perf dump | egrep "wait|lat"
> "kv_flush_lat": {
> "kv_commit_lat": {
> "kv_lat": {
> "state_prepare_lat": {
> "state_aio_wait_lat": {
> "state_io_done_lat": {
> "state_kv_queued_lat": {
> "state_kv_commiting_lat": {
> "state_kv_done_lat": {
> "state_deferred_queued_lat": {
> "state_deferred_aio_wait_lat": {
> "state_deferred_cleanup_lat": {
> "state_finishing_lat": {
> "state_done_lat": {
> "throttle_lat": {
> "submit_lat": {
> "commit_lat": {
> "read_lat": {
> "read_onode_meta_lat": {
> "read_wait_aio_lat": {
> "compress_lat": {
> "decompress_lat": {
> "csum_lat": {
> "complete_latency": {
> "complete_latency": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "op_latency": {
> "op_process_latency": {
> "op_prepare_latency": {
> "op_r_latency": {
> "op_r_process_latency": {
> "op_r_prepare_latency": {
> "op_w_latency": {
> "op_w_process_latency": {
> "op_w_prepare_latency": {
> "op_rw_latency": {
> "op_rw_process_latency": {
> "op_rw_prepare_latency": {
> "op_before_queue_op_lat": {
> "op_before_dequeue_op_lat": {
> "subop_latency": {
> "subop_w_latency": {
> "subop_pull_latency": {
> "subop_push_latency": {
> "osd_tier_flush_lat": {
> "osd_tier_promote_lat": {
> "osd_tier_r_lat": {
> "initial_latency": {
> "started_latency": {
> "reset_latency": {
> "start_latency": {
> "primary_latency": {
> "peering_latency": {
> "backfilling_latency": {
> "waitremotebackfillreserved_latency": {
> "waitlocalbackfillreserved_latency": {
> "notbackfilling_latency": {
> "repnotrecovering_latency": {
> "repwaitrecoveryreserved_latency": {
> "repwaitbackfillreserved_latency": {
> "reprecovering_latency": {
> "activating_latency": {
> "waitlocalrecoveryreserved_latency": {
> "waitremoterecoveryreserved_latency": {
> "recovering_latency": {
> "recovered_latency": {
> "clean_latency": {
> "active_latency": {
> "replicaactive_latency": {
> "stray_latency": {
> "getinfo_latency": {
> "getlog_latency": {
> "waitactingchange_latency": {
> "incomplete_latency": {
> "down_latency": {
> "getmissing_latency": {
> "waitupthru_latency": {
> "notrecovering_latency": {
> "get_latency": {
> "submit_latency": {
> "submit_sync_latency": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> "wait": {
> 
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] measure performance / latency in blustore

2017-12-14 Thread Stefan Priebe - Profihost AG

Am 14.12.2017 um 13:22 schrieb Sage Weil:
> On Thu, 14 Dec 2017, Stefan Priebe - Profihost AG wrote:
>> Hello,
>>
>> Am 21.11.2017 um 11:06 schrieb Stefan Priebe - Profihost AG:
>>> Hello,
>>>
>>> to measure performance / latency for filestore we used:
>>> filestore:apply_latency
>>> filestore:commitcycle_latency
>>> filestore:journal_latency
>>> filestore:queue_transaction_latency_avg
>>>
>>> What are the correct ones for bluestore?
>>
>> really nobody? Does nobody track the latency under bluestore?
> 
> I forget the long names off the top of my head, but the interesting 
> latency measures are marked with a low priority and come up (with a wide 
> terminal) when you do 'ceph daemonperf osd.N'. You can see the metrics, 
> pririties, and descriptions with 'ceph daemon osd.N perf schema'.

uhuh very long list. Any idea which ones are relevant?

ceph daemon osd.8 perf dump

shows me a lot of stuff and even a lot of wait or latancy values.

ceph daemon osd.8 perf dump | egrep "wait|lat"
"kv_flush_lat": {
"kv_commit_lat": {
"kv_lat": {
"state_prepare_lat": {
"state_aio_wait_lat": {
"state_io_done_lat": {
"state_kv_queued_lat": {
"state_kv_commiting_lat": {
"state_kv_done_lat": {
"state_deferred_queued_lat": {
"state_deferred_aio_wait_lat": {
"state_deferred_cleanup_lat": {
"state_finishing_lat": {
"state_done_lat": {
"throttle_lat": {
"submit_lat": {
"commit_lat": {
"read_lat": {
"read_onode_meta_lat": {
"read_wait_aio_lat": {
"compress_lat": {
"decompress_lat": {
"csum_lat": {
"complete_latency": {
"complete_latency": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"op_latency": {
"op_process_latency": {
"op_prepare_latency": {
"op_r_latency": {
"op_r_process_latency": {
"op_r_prepare_latency": {
"op_w_latency": {
"op_w_process_latency": {
"op_w_prepare_latency": {
"op_rw_latency": {
"op_rw_process_latency": {
"op_rw_prepare_latency": {
"op_before_queue_op_lat": {
"op_before_dequeue_op_lat": {
"subop_latency": {
"subop_w_latency": {
"subop_pull_latency": {
"subop_push_latency": {
"osd_tier_flush_lat": {
"osd_tier_promote_lat": {
"osd_tier_r_lat": {
"initial_latency": {
"started_latency": {
"reset_latency": {
"start_latency": {
"primary_latency": {
"peering_latency": {
"backfilling_latency": {
"waitremotebackfillreserved_latency": {
"waitlocalbackfillreserved_latency": {
"notbackfilling_latency": {
"repnotrecovering_latency": {
"repwaitrecoveryreserved_latency": {
"repwaitbackfillreserved_latency": {
"reprecovering_latency": {
"activating_latency": {
"waitlocalrecoveryreserved_latency": {
"waitremoterecoveryreserved_latency": {
"recovering_latency": {
"recovered_latency": {
"clean_latency": {
"active_latency": {
"replicaactive_latency": {
"stray_latency": {
"getinfo_latency": {
"getlog_latency": {
"waitactingchange_latency": {
"incomplete_latency": {
"down_latency": {
"getmissing_latency": {
"waitupthru_latency": {
"notrecovering_latency": {
"get_latency": {
"submit_latency": {
"submit_sync_latency": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {
"wait": {

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph luminous nfs-ganesha-ceph

2017-12-14 Thread nigel davies
Hay all

i am in the process or trying to set up and VMware storage environment
I been reading and found that Iscsi (on jewel release) can cause issues and
the datastore can drop out.

I been looking at using  nfs-ganesha with my ceph platform, it all looked
good until i looked at failover to our 2nd nfs serve. I believe i set up
the Server right. gave the two IP address of the NFS servers.

when i shutdown the live NFS server, the datastore becomes "inactive" even
after i bring the NFS server backup, it stil shows as inactive.

any advise on this i would be grateful incase i am missing something.
i am using NFS 4.1 as i been advised it will support the fail over
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-12-14 Thread Matthew Vernon
On 29/11/17 17:24, Matthew Vernon wrote:

> We have a 3,060 OSD ceph cluster (running Jewel
> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
> host), and having ops blocking on it for some time. It will then behave
> for a bit, and then go back to doing this.
> 
> It's always the same OSD, and we've tried replacing the underlying disk.
> 
> The logs have lots of entries of the form
> 
> 2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15

Thanks for the various helpful suggestions in response to this. In case
you're interested (and for the archives), the answer was Gnocchi - all
the slow requests were for a particular pool, which is where we were
sending metrics from an OpenStack instance. Gnocchi less than version
4.0 is, I learn, known to kill ceph because its use of librados is
rather badly behaved. Newer OpenStacks (from Pike, I think) use a newer
Gnocchi. We stopped ceilometer and gnocchi, and the problem went away.
Thanks are due to RedHat support for finding this for us :)

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] measure performance / latency in blustore

2017-12-14 Thread Sage Weil
On Thu, 14 Dec 2017, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> Am 21.11.2017 um 11:06 schrieb Stefan Priebe - Profihost AG:
> > Hello,
> > 
> > to measure performance / latency for filestore we used:
> > filestore:apply_latency
> > filestore:commitcycle_latency
> > filestore:journal_latency
> > filestore:queue_transaction_latency_avg
> > 
> > What are the correct ones for bluestore?
> 
> really nobody? Does nobody track the latency under bluestore?

I forget the long names off the top of my head, but the interesting 
latency measures are marked with a low priority and come up (with a wide 
terminal) when you do 'ceph daemonperf osd.N'. You can see the metrics, 
pririties, and descriptions with 'ceph daemon osd.N perf schema'.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-14 Thread Yan, Zheng
On Thu, Dec 14, 2017 at 12:52 AM, Jens-U. Mozdzen  wrote:
> Hi Yan,
>
> Zitat von "Yan, Zheng" :
>>
>> [...]
>>
>> It's likely some clients had caps on unlinked inodes, which prevent
>> MDS from purging objects. When a file gets deleted, mds notifies all
>> clients, clients are supposed to drop corresponding caps if possible.
>> You may hit a bug in this area, some clients failed to drop cap for
>> unlinked inodes.
>> [...]
>> There is a reconnect stage during MDS recovers. To reduce reconnect
>> message size, clients trim unused inodes from their cache
>> aggressively. In your case,  most unlinked inodes also got trimmed .
>> So mds could purge corresponding objects after it recovered
>
>
> thank you for that detailed explanation. While I've already included the
> recent code fix for this issue on a test node, all other mount points
> (including the NFS server machine) still run thenon-fixed kernel Ceph
> client. So your description makes me believe we've hit exactly what you
> describe.
>
> Seems we'll have to fix the clients :)
>
> Is there a command I can use to see what caps a client holds, to verify the
> proposed patch actually works?
>

No easy way.

'ceph daemon mds.x session ls' can show how many caps each client
holds.  'ceph daemon mds.x dump cache' dump whole mds cache. that
information can be extracted from the cache dump.

Regards
Yan, Zheng
> Regards,
> Jens
>
> PS: Is there a command I can use to see what caps a client holds, to verify
> the proposed patch actually works?
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to troubleshoot "heartbeat_check: no reply" in OSD log

2017-12-14 Thread Tristan Le Toullec

Hi Jared,
    did you have find a solution to your problem ? It appear that I 
have the same osd problem, and tcpdump captures won't show any solution.


All OSD nodes produced logs like

2017-12-14 11:25:11.756552 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.155:6817 osd.46 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)
2017-12-14 11:25:11.756558 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.155:6815 osd.48 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)
2017-12-14 11:25:11.756564 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.156:6805 osd.50 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)


Sometime OSD Process was shutdown and respawn, sometime just shutdown.

We used Ubuntu 14.04 (one node is on 16.04) and ceph version 10.2.10.

Thanks
Tristan





On Fri, Jul 28, 2017 at 6:06 AM, Jared Watts > wrote:
/I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are 

up/in). />/ceph status and ceph osd tree output can be found at: />//>/https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12 />//>//>//>/In osd.4 log, 
I see many of these: />//>/2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 10.32.0.3:6807 osd.15 ever on either front or back, first 
ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>/2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 
10.32.0.3:6811 osd.16 ever on either front or back, first ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>//>//>/ From osd.4, those 
endpoints look reachable: />//>// # nc -vz 10.32.0.3 6807 />//>/10.32.0.3 (10.32.0.3:6807) open />//>// # nc -vz 10.32.0.3 6811 />//>/10.32.0.3 
(10.32.0.3:6811) open />//>//>//>/What else can I look at to determine why most of the OSDs cannot />/communicate? http://tracker.ceph.com/issues/16092 indicates this 
behavior />/is a networking or hardware issue, what else can I check there? I can turn />/on extra logging as needed. Thanks! /
Do a packet capture on both machines at the same time and verify the
packets are arriving as expected.

//>//>/___ />/ceph-users mailing list />/ceph-users at lists.ceph.com 

 
/>/http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com />//

<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding reshard issues

2017-12-14 Thread Martin Emrich

Hi!

Am 13.12.17 um 20:50 schrieb Graham Allan:
After our Jewel to Luminous 12.2.2 upgrade, I ran into some of the same 
issues reported earlier on the list under "rgw resharding operation 
seemingly won't end". 


Yes, that were/are my threads, I also have this issue.


I was able to correct the buckets using "radosgw-admin bucket check 
--fix" command, and later disabled the auto resharding.


Were you able to manually reshard a bucket after the "--fix"? Here, 
after a bucket was damaged once, the manual reshard process will freeze.


As an experiment, I selected an unsharded bucket to attempt a manual 
reshard. I added it the reshard list ,then ran "radosgw-admin reshard 
execute". The bucket in question contains 184000 objects and was being 
converted from 1 to 3 shards.


I'm trying to understand what I found...

1) the "radosgw-admin reshard execute" never returned. Somehow I 
expected it to kick off a background operation, but possibly this was 
mistaken.


Yes, same behaviour here. Someone on the list mentioned that resharding 
should actually happen quite fast (at most a few minutes).


So there's clearly something wrong here, and I am glad I am not the only 
one experiencing it.


To compare: What is your infrastructure? mine is:

* three beefy hosts (64GB RAM) with 4 OSDs each for data (HDD), and 2 
OSDs each on SSDs for the index.

* all bluestore (DB/WAL for the HDD OSDs also on SSD partitions)
* radosgw runs on each of these OSD hosts (as they are mostly idling, I 
see no cause for my poor performance in running the rados gateways on 
the OSD hosts)

* 3 separate monitor/mgr hosts
* OS is CentOS 7, running Ceph 12.2.2
* We use several buckets, all with Versioning enabled, for many (100k to 
12M) rather small objects.


pool settings:
# ceph osd pool ls detail
pool 1 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 174 lfor 0/172 flags 
hashpspool stripe_width 0
pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 842 owner 18446744073709551615 
flags hashpspool stripe_width 0 application rgw
pool 3 'default.rgw.control' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 8 pgp_num 8 last_change 843 owner 
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 128 pgp_num 128 last_change 950 lfor 0/948 
owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.log' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 8 pgp_num 8 last_change 845 owner 
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.buckets.index' replicated size 3 min_size 2 
crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 last_change 846 
owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.buckets.data' replicated size 3 min_size 2 
crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 847 
lfor 0/246 owner 18446744073709551615 flags hashpspool stripe_width 0 
application rgw
pool 8 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 849 
flags hashpspool stripe_width 0 application rgw


Regards,

Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier unexpected behavior: promote on lock

2017-12-14 Thread Захаров Алексей
Hi, Gregory,Thank you for your answer! Is there a way to not promote on "locking", when not using EC pools?Is it possible to make this configurable? We don't use EC pool. So, for us this meachanism is overhead. It only adds more load on both pools and network. 14.12.2017, 01:16, "Gregory Farnum" :Voluntary “locking” in RADOS is an “object class” operation. These are not part of the core API and cannot run on EC pools, so any operation using them will cause an immediate promotion.On Wed, Dec 13, 2017 at 4:02 AM Захаров Алексей  wrote:Hello,I've found that when client gets lock on object then ceph ignores any promotion settings and promotes this object immedeatly.Is it a bug or a feature?Is it configurable?Hope for any help!Ceph version: 10.2.10 and 12.2.2We use libradosstriper-based clients.Cache pool settings:size: 3min_size: 2crash_replay_interval: 0pg_num: 2048pgp_num: 2048crush_ruleset: 0hashpspool: truenodelete: falsenopgchange: falsenosizechange: falsewrite_fadvise_dontneed: falsenoscrub: truenodeep-scrub: falsehit_set_type: bloomhit_set_period: 60hit_set_count: 30hit_set_fpp: 0.05use_gmt_hitset: 1auid: 0target_max_objects: 0target_max_bytes: 18819770744832cache_target_dirty_ratio: 0.4cache_target_dirty_high_ratio: 0.6cache_target_full_ratio: 0.8cache_min_flush_age: 60cache_min_evict_age: 180min_read_recency_for_promote: 15min_write_recency_for_promote: 15fast_read: 0hit_set_grade_decay_rate: 50hit_set_search_last_n: 30To get lock via cli (to test behavior) we use:# rados -p poolname lock get --lock-tag weird_ceph_locks --lock-cookie `uuid` objectname striper.lockRight after that object could be found in caching pool.--Regards,Aleksei Zakharov___ceph-users mailing listceph-users@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  -- Regards,Aleksei Zakharov ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph scrub logs: _scan_snaps no head for $object?

2017-12-14 Thread Stefan Kooman
Hi,

We see the following in the logs after we start a scrub for some osds:

ceph-osd.2.log:2017-12-14 06:50:47.180344 7f0f47db2700  0 log_channel(cluster) 
log [DBG] : 1.2d8 scrub starts
ceph-osd.2.log:2017-12-14 06:50:47.180915 7f0f47db2700 -1 osd.2 pg_epoch: 11897 
pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] local-lis/les=11733/11734 
n=67 ec=132/132 lis/c 11733/11733 les/c/f 11734/11734/0 11733/11733/11733) 
[2,45,31] r=0 lpr=11733 crt=11890'165209 lcod 11890'165208 mlcod 11890'165208 
active+clean+scrubbing] _scan_snaps no head for 
1:1b518155:::rbd_data.620652ae8944a.0126:29 (have MIN)
ceph-osd.2.log:2017-12-14 06:50:47.180929 7f0f47db2700 -1 osd.2 pg_epoch: 11897 
pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] local-lis/les=11733/11734 
n=67 ec=132/132 lis/c 11733/11733 les/c/f 11734/11734/0 11733/11733/11733) 
[2,45,31] r=0 lpr=11733 crt=11890'165209 lcod 11890'165208 mlcod 11890'165208 
active+clean+scrubbing] _scan_snaps no head for 
1:1b518155:::rbd_data.620652ae8944a.0126:14 (have MIN)
ceph-osd.2.log:2017-12-14 06:50:47.180941 7f0f47db2700 -1 osd.2 pg_epoch: 11897 
pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] local-lis/les=11733/11734 
n=67 ec=132/132 lis/c 11733/11733 les/c/f 11734/11734/0 11733/11733/11733) 
[2,45,31] r=0 lpr=11733 crt=11890'165209 lcod 11890'165208 mlcod 11890'165208 
active+clean+scrubbing] _scan_snaps no head for 
1:1b518155:::rbd_data.620652ae8944a.0126:a (have MIN)
ceph-osd.2.log:2017-12-14 06:50:47.214198 7f0f43daa700  0 log_channel(cluster) 
log [DBG] : 1.2d8 scrub ok

So finally it logs "scrub ok", but what does " _scan_snaps no head for ..." 
mean?
Does this indicate a problem?

Ceph 12.2.2 with bluestore on lvm

Gr. Stefan




-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests

2017-12-14 Thread Fulvio Galeazzi

Hallo Matthew, thanks for your feedback!
  Please clarify one point: you mean that you recreated the pool as an 
erasure-coded one, or that you recreated it as a regular replicated one? 
I mean, you now have an erasure-coded pool in production as a gnocchi 
backend?


  In any case, from the instability you mention, experimenting with 
BlueStore looks like a better alternative.


  Thanks again

Fulvio

 Original Message 
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud 
To: Fulvio Galeazzi , Brian Andrus 


CC: "ceph-users@lists.ceph.com" 
Date: 12/13/2017 5:05 PM


We fixed it by destroying the pool and recreating it though this isn’t really a 
fix. Come to find out ceph has a weakness for small high change rate objects 
(the behavior that gnocchi displays). The cluster will keep going fine until an 
event (aka a reboot, osd failure, etc) happens. I haven’t been able to find 
another solution.

I have heard that BlueStore handles this better, but that wasn’t stable on the 
release we are on.

Thanks,
Matthew Stroud

On 12/13/17, 3:56 AM, "Fulvio Galeazzi"  wrote:

 Hallo Matthew,
  I am now facing the same issue and found this message of yours.
Were you eventually able to figure what the problem is, with
 erasure-coded pools?

 At first sight, the bugzilla page linked by Brian does not seem to
 specifically mention erasure-coded pools...

Thanks for your help

 Fulvio

  Original Message 
 Subject: Re: [ceph-users] Blocked requests
 From: Matthew Stroud 
 To: Brian Andrus 
 CC: "ceph-users@lists.ceph.com" 
 Date: 09/07/2017 11:01 PM

 > After some troubleshooting, the issues appear to be caused by gnocchi
 > using rados. I’m trying to figure out why.
 >
 > Thanks,
 >
 > Matthew Stroud
 >
 > *From: *Brian Andrus 
 > *Date: *Thursday, September 7, 2017 at 1:53 PM
 > *To: *Matthew Stroud 
 > *Cc: *David Turner , "ceph-users@lists.ceph.com"
 > 
 > *Subject: *Re: [ceph-users] Blocked requests
 >
 > "ceph osd blocked-by" can do the same thing as that provided script.
 >
 > Can you post relevant osd.10 logs and a pg dump of an affected placement
 > group? Specifically interested in recovery_state section.
 >
 > Hopefully you were careful in how you were rebooting OSDs, and not
 > rebooting multiple in the same failure domain before recovery was able
 > to occur.
 >
 > On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud
 > > wrote:
 >
 > Here is the output of your snippet:
 >
 > [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
 >
 >6 osd.10
 >
 > 52  ops are blocked > 4194.3   sec on osd.17
 >
 > 9   ops are blocked > 2097.15  sec on osd.10
 >
 > 4   ops are blocked > 1048.58  sec on osd.10
 >
 > 39  ops are blocked > 262.144  sec on osd.10
 >
 > 19  ops are blocked > 131.072  sec on osd.10
 >
 > 6   ops are blocked > 65.536   sec on osd.10
 >
 > 2   ops are blocked > 32.768   sec on osd.10
 >
 > Here is some backfilling info:
 >
 > [root@mon01 ceph-conf]# ceph status
 >
 >  cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
 >
 >   health HEALTH_WARN
 >
 >  5 pgs backfilling
 >
 >  5 pgs degraded
 >
 >  5 pgs stuck degraded
 >
 >  5 pgs stuck unclean
 >
 >  5 pgs stuck undersized
 >
 >  5 pgs undersized
 >
 >  122 requests are blocked > 32 sec
 >
 >  recovery 2361/1097929 objects degraded (0.215%)
 >
 >  recovery 5578/1097929 objects misplaced (0.508%)
 >
 >   monmap e1: 3 mons at
 > 
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0
 > 
}
 >
 >  election epoch 58, quorum 0,1,2 mon01,mon02,mon03
 >
 >   osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
 >
 >  flags sortbitwise,require_jewel_osds
 >
 >pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
 >
 >  1005 GB used, 20283 GB / 21288 GB avail
 >
 >  2361/1097929 objects degraded (0.215%)
 >