Re: [ceph-users] osd_op_threads appears to be removed from the settings

2018-06-14 Thread Matthew Stroud
Thanks for the info From: Piotr Dalek Date: Friday, June 15, 2018 at 12:33 AM To: Matthew Stroud , ceph-users Subject: RE: osd_op_threads appears to be removed from the settings No, it’s no longer valid. -- Piotr Dałek piotr.da...@corp.ovh.com https://ovhcloud.com/ From: ceph-users On Behal

Re: [ceph-users] osd_op_threads appears to be removed from the settings

2018-06-14 Thread Piotr Dalek
No, it’s no longer valid. -- Piotr Dałek piotr.da...@corp.ovh.com https://ovhcloud.com/ From: ceph-users On Behalf Of Matthew Stroud Sent: Friday, June 15, 2018 8:11 AM To: ceph-users Subject: [ceph-users] osd_op_threads appears to be removed from the settings So I’m trying to update the osd_o

[ceph-users] osd_op_threads appears to be removed from the settings

2018-06-14 Thread Matthew Stroud
So I’m trying to update the osd_op_threads setting that was in jewel, that now doesn’t appear to be in luminous. What’s more confusing is that the docs state that is a valid option. Is osd_op_threads still valid? I’m currently running ceph 12.2.2 Thanks, Matthew Stroud

Re: [ceph-users] ceph pg dump

2018-06-14 Thread John Spray
On Thu, Jun 14, 2018 at 6:31 PM, Ranjan Ghosh wrote: > Hi all, > > we have two small clusters (3 nodes each) called alpha and beta. One node > (alpha0/beta0) is on a remote site and only has monitor & manager. The two > other nodes (alpha/beta-1/2) have all 4 services and contain the OSDs and > ar

[ceph-users] Is Ceph Full Tiering Possible?

2018-06-14 Thread Pardhiv Karri
Hi, I Ceph there is cache tiering but is there a way to have full tiering? In Cache Tiering: Cache Pool (SSDs) = 10TB Slow Pool (HDDs) = 50TB But the total usable space that ceph will see is only 50TB not 60TB (50+10). Is there a way to see 60TB so that if we later want to expand fast storage po

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-14 Thread Steve Anthony
For reference, building luminous with the changes in the pull request also fixed this issue for me. Some of my unexpected snapshots were on Bluestore devices; here's how I used the objectstore tool to remove them. In the example, the problematic placement group is 2.1c3f, and the unexpected clone i

Re: [ceph-users] Frequent slow requests

2018-06-14 Thread Brad Hubbard
Turn up debug logging, at least debug_osd 20, and search for the operation in the osd logs. On Thu, Jun 14, 2018 at 5:38 PM, Frank (lists) wrote: > Hi, > > On a small cluster (3 nodes) I frequently have slow requests. When dumping > the inflight ops from the hanging OSD, it seems it doesn't get a

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
Do you think it's safe to start the MDS daemons back up at this point? Current status is: https://gist.github.com/oschulz/36d92af84851ec42e09ce1f3cacbc110 Or would it be better to wait until backfill is complete? On 14.06.2018 22:53, Gregory Farnum wrote: I would not run the cephfs disaster

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-14 Thread Nick Fisk
For completeness in case anyone has this issue in the future and stumbles across this thread If your OSD is crashing and you are still running on a Luminous build that does not have the fix in the pull request below, you will need to compile the ceph-osd binary and replace it on the affected OSD

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
Dear Paul, sure, here's the current status (including crushmap): https://gist.github.com/oschulz/36d92af84851ec42e09ce1f3cacbc110 Any advice will be very much appreciated. Cheers, Oliver On 14.06.2018 22:46, Paul Emmerich wrote: Can you post your whole crushmap? ceph osd getcrushmap

Re: [ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Sander van Schie / True
Awesome, thanks for the quick replies and insights. Seems like this is the issue you're talking about: http://tracker.ceph.com/issues/22769 which is set to be released in v12.2.6. We'll focus on investigating the issue regarding the resharding of buckets, hopefully this will solve the issue f

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Gregory Farnum
I would not run the cephfs disaster recovery tools. Your cluster was offline here because it couldn't do some writes, but it should still be self-consistent. On Thu, Jun 14, 2018 at 4:52 PM Oliver Schulz wrote: > They are recovered now, looks like it just took a bit > for them to "jump the queue

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Nick Fisk
I’ve seen similar things like this happen if you tend to end up with extreme weighting towards a small set of OSD’s. Crush tries a slightly different combination of OSD’s at each attempt, but in an extremely lop sided weighting, it can run out of attempts before it finds a set of OSD’s which mat

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
They are recovered now, looks like it just took a bit for them to "jump the queue". :-) Whew ... I rember something about there being some kind of fschk for CephFS now. Is something like this that I can/should run before I start my MDS daemons again? May then I can finally reduce my MDS max-ran

Re: [ceph-users] ceph pg dump

2018-06-14 Thread Brad Hubbard
Try this and pay careful attention to the IPs and ports in use. Then you can make sure there are no connectivity issues. # ceph -s --debug_ms 20 On Fri, Jun 15, 2018 at 3:31 AM, Ranjan Ghosh wrote: > Hi all, > > we have two small clusters (3 nodes each) called alpha and beta. One node > (alpha0/

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Gregory Farnum
I don't think there's a way to help them. They "should" get priority in recovery, but there were a number of bugs with it in various versions and forcing that kind of priority without global decision making is prone to issues. But yep, looks like things will eventually become all good now. :) On

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Paul Emmerich
Can you post your whole crushmap? ceph osd getcrushmap -o crushmap crushtool -d crushmap -o crushmap.txt Paul 2018-06-14 22:39 GMT+02:00 Oliver Schulz : > Thanks, Greg!! > > I reset all the OSD weights to 1.00, and I think I'm in a much > better state now. The only trouble left in "ceph healt

Re: [ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Gregory Farnum
Yes. Deep scrub of a bucket index pool requires reading all the omap keys, and the rgw bucket indices can get quite large. The OSD will limit the number of keys it reads at a time to try and avoid overwhelming things. We backported to luminous (but after the 12.2.5 release, it looks like) a commit

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
Thanks, Greg!! I reset all the OSD weights to 1.00, and I think I'm in a much better state now. The only trouble left in "ceph health detail" is PG_DEGRADED Degraded data redundancy: 4/404985012 objects degraded (0.000%), 3 pgs degraded pg 2.47 is active+recovery_wait+degraded+remapped, ac

Re: [ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Sander van Schie / True
Thank you for your reply. I'm not sure if this is the case, since we have a rather small cluster and the PGs have at most just over 10k objects (total objects in the cluster is about 9 million). During the 10 minute scrubs we're seeing a steady 10k iops on the underlying block device of the OS

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Gregory Farnum
On Thu, Jun 14, 2018 at 4:07 PM Oliver Schulz wrote: > Hi Greg, > > I increased the hard limit and rebooted everything. The > PG without acting OSDs still has none, but I also have > quite a few PGs with that look like this, now: > > pg 1.79c is stuck undersized for 470.640254, current state

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
Hi Greg, I increased the hard limit and rebooted everything. The PG without acting OSDs still has none, but I also have quite a few PGs with that look like this, now: pg 1.79c is stuck undersized for 470.640254, current state active+undersized+degraded, last acting [179,154] I had that pr

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Gregory Farnum
On Thu, Jun 14, 2018 at 3:26 PM Oliver Schulz wrote: > But the contents of the remapped PGs should still be > Ok, right? What confuses me is that they don't > backfill - why don't the "move" where they belong? > > As for the PG hard limit, yes, I ran into this. Our > cluster had been very (very)

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
Ah, I see some OSDs actually are over the 200 PG limit - I'll increase the hard limit and restart everything. On 14.06.2018 21:26, Oliver Schulz wrote: But the contents of the remapped PGs should still be Ok, right? What confuses me is that they don't backfill - why don't the "move" where they b

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
But the contents of the remapped PGs should still be Ok, right? What confuses me is that they don't backfill - why don't the "move" where they belong? As for the PG hard limit, yes, I ran into this. Our cluster had been very (very) full, but I wanted the new OSD nodes to use bluestore, so I updat

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Gregory Farnum
Okay, I can’t tell you what happened to that one pg, but you’ve got another 445 remapped pgs and that’s not a good state to be in. It was probably your use of the rewritten-by-utilization. :/ I am pretty sure the missing PG and remapped ones have the same root cause, and it’s possible but by no mea

[ceph-users] Reweighting causes whole cluster to peer/activate

2018-06-14 Thread Kevin Hrpcek
Hello, I'm seeing something that seems to be odd behavior when reweighting OSDs. I've just upgraded to 12.2.5 and am adding in a new osd server to the cluster. I gradually weight the 10TB OSDs into the cluster by doing a +1, letting things backfill for a while, then +1 until I reach my desire

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
I'm not running the balancer, but I did reweight-by-utilization a few times recently. "ceph osd tree" and "ceph -s" say: https://gist.github.com/oschulz/36d92af84851ec42e09ce1f3cacbc110 On 14.06.2018 20:23, Gregory Farnum wrote: Well, if this pg maps to no osds, something has certainly go

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Gregory Farnum
Well, if this pg maps to no osds, something has certainly gone wrong with your crush map. What’s the crush rule it’s using, and what’s the output of “ceph osd tree”? Are you running the manager’s balancer module or something that might be putting explicit mappings into the osd map and broken it? I

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
Dear Greg, no, it's a very old cluster (continuous operation since 2013, with multiple extensions). It's a production cluster and there's about 300TB of valuable data on it. We recently updated to luminous and added more OSDs (a month ago or so), but everything seemed Ok since then. We didn't ha

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Gregory Farnum
Is this a new cluster? Or did the crush map change somehow recently? One way this might happen is if CRUSH just failed entirely to map a pg, although I think if the pg exists anywhere it should still be getting reported as inactive. On Thu, Jun 14, 2018 at 8:40 AM Oliver Schulz wrote: > Dear all,

Re: [ceph-users] large omap object

2018-06-14 Thread Gregory Farnum
There may be a mismatch between be auto-restarting and the omap warning code. Looks like you already have 349 shards, with 13 of them warning on size! You can increase a config value to shut that error up, but you may want to get somebody from RGW to look at how you’ve managed to exceed those defau

Re: [ceph-users] Aligning RBD stripe size with EC chunk size?

2018-06-14 Thread Gregory Farnum
On Thu, Jun 14, 2018 at 11:04 AM Lars Marowsky-Bree wrote: > Hi all, > > so, I'm wondering right now (with some urgency, ahem) how to make RBD on > EC pools faster without resorting to cache tiering. > > In a replicated pool, we had some success with RBD striping. > > I wonder if it would be poss

Re: [ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Gregory Farnum
Deep scrub needs to read every object in the pg. if some pgs are only taking 5 seconds they must be nearly empty (or maybe they only contain objects with small amounts of omap or something). Ten minutes is perfectly reasonable, but it is an added load on the cluster as it does all those object read

[ceph-users] ceph pg dump

2018-06-14 Thread Ranjan Ghosh
Hi all, we have two small clusters (3 nodes each) called alpha and beta. One node (alpha0/beta0) is on a remote site and only has monitor & manager. The two other nodes (alpha/beta-1/2) have all 4 services and contain the OSDs and are connected via an internal network. In short: alpha0 -

[ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Sander van Schie / True
Hello, We recently upgraded Ceph from version 12.2.2 to version 12.2.5. Since the upgrade we've been having performance issues which seem to relate to when deep-scrub actions are performed. Most of the time deep-scrub actions only takes a couple of seconds at most, however ocassionaly it takes

[ceph-users] Aligning RBD stripe size with EC chunk size?

2018-06-14 Thread Lars Marowsky-Bree
Hi all, so, I'm wondering right now (with some urgency, ahem) how to make RBD on EC pools faster without resorting to cache tiering. In a replicated pool, we had some success with RBD striping. I wonder if it would be possible to align RBD stripe-unit with the EC chunk size ...? Is that worth p

Re: [ceph-users] Migrating cephfs data pools and/or mounting multiple filesystems belonging to the same cluster

2018-06-14 Thread Alessandro De Salvo
Hi, Il 14/06/18 06:13, Yan, Zheng ha scritto: On Wed, Jun 13, 2018 at 9:35 PM Alessandro De Salvo wrote: Hi, Il 13/06/18 14:40, Yan, Zheng ha scritto: On Wed, Jun 13, 2018 at 7:06 PM Alessandro De Salvo wrote: Hi, I'm trying to migrate a cephfs data pool to a different one in order to r

[ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Oliver Schulz
Dear all, I have a serious problem with our Ceph cluster: One of our PGs somehow ended up in this state (reported by "ceph health detail": pg 1.XXX is stuck inactive for ..., current state unknown, last acting [] Also, "ceph pg map 1.xxx" reports: osdmap e525812 pg 1.721 (1.721) -> up

Re: [ceph-users] Installing iSCSI support

2018-06-14 Thread Max Cuttins
Ok, i take your points. I'll do my best. Il 13/06/2018 10:14, Lenz Grimmer ha scritto: On 06/12/2018 07:14 PM, Max Cuttins wrote: it's a honor to me contribute to the main repo of ceph. We appreciate you support! Please take a look at http://docs.ceph.com/docs/master/start/documenting-ceph/

Re: [ceph-users] Problems with CephFS

2018-06-14 Thread Steininger, Herbert
Thanks Guys, I was out of Office, I will try your suggestions and get back to you. And extending the Cluster is something that I will do in the near Future, I just thought it would be better to get the Cluster Health back to “Normal” first. Thanks, Herbert Von: ceph-users [mailto:ceph-users-

Re: [ceph-users] Add a new iSCSI gateway would not update client multipath

2018-06-14 Thread Max Cuttins
I did it of course! :) However I found the real issue. While I was playing with multipath I disabled some feature from RBD image, one of this was EXCLUSIVE LOCK, because I was thinking it was related to the issue with multipath. Instead this brake the RBD iscsi target on gateway side (but not

[ceph-users] Frequent slow requests

2018-06-14 Thread Frank (lists)
Hi, On a small cluster (3 nodes) I frequently have slow requests. When dumping the inflight ops from the hanging OSD, it seems it doesn't get a 'response' for one of the subops. The events always look like:     "events": [     {     "time": "201