Re: [ceph-users] Power outages!!! help!

2017-09-15 Thread hjcho616
Looking better... working on scrubbing..HEALTH_ERR 1 pgs are stuck inactive for 
more than 300 seconds; 1 pgs incomplete; 12 pgs inconsistent; 2 pgs repair; 1 
pgs stuck inactive; 1 pgs stuck unclean; 109 scrub errors; too few PGs per OSD 
(29 < min 30); mds rank 0 has failed; mds cluster is degraded; noout flag(s) 
set; no legacy OSD present but 'sortbitwise' flag is not set

Now PG1.28.. looking at all old osds dead or alive.  Only one with DIR_* 
directory is in osd.4.   This appears to be metadata pool!  21M of metadata can 
be quite a bit of stuff.. so I would like to rescue this!  But I am not able to 
start this OSD.  exporting through ceph-objectstore-tool appears to crash.  
Even with --skip-journal-replay and --skip-mount-omap (different failure).  As 
I mentioned in earlier email, that exception thrown message is bogus...# 
ceph-objectstore-tool --op export --pgid 1.28  --data-path 
/var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --file 
~/1.28.exportterminate called after throwing an instance of 'std::domain_error' 
 what():  coll_t::decode(): don't know how to decode version 1*** Caught signal 
(Aborted) ** in thread 7f812e7fb940 thread_name:ceph-objectstor ceph version 
10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x996a57) 
[0x55dee175fa57] 2: (()+0x110c0) [0x7f812d0050c0] 3: (gsignal()+0xcf) 
[0x7f812b438fcf] 4: (abort()+0x16a) [0x7f812b43a3fa] 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f812bd1fb3d] 6: 
(()+0x5ebb6) [0x7f812bd1dbb6] 7: (()+0x5ec01) [0x7f812bd1dc01] 8: (()+0x5ee19) 
[0x7f812bd1de19] 9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) 
[0x55dee143001e] 10: 
(DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
[0x55dee156d5f5] 11: (DBObjectMap::check(std::ostream&, bool)+0x279) 
[0x55dee1562bb9] 12: (DBObjectMap::init(bool)+0x288) [0x55dee1561eb8] 13: 
(FileStore::mount()+0x2525) [0x55dee1498eb5] 14: (main()+0x28c0) 
[0x55dee10c9400] 15: (__libc_start_main()+0xf1) [0x7f812b4262b1] 16: 
(()+0x34f747) [0x55dee1118747]Aborted# ceph-objectstore-tool --op export --pgid 
1.28  --data-path /var/lib/ceph/osd/ceph-4 --journal-path 
/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export 
--skip-journal-replayterminate called after throwing an instance of 
'std::domain_error'  what():  coll_t::decode(): don't know how to decode 
version 1*** Caught signal (Aborted) ** in thread 7fa6d087b940 
thread_name:ceph-objectstor ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x996a57) [0x55abd356aa57] 2: 
(()+0x110c0) [0x7fa6cf0850c0] 3: (gsignal()+0xcf) [0x7fa6cd4b8fcf] 4: 
(abort()+0x16a) [0x7fa6cd4ba3fa] 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fa6cdd9fb3d] 6: 
(()+0x5ebb6) [0x7fa6cdd9dbb6] 7: (()+0x5ec01) [0x7fa6cdd9dc01] 8: (()+0x5ee19) 
[0x7fa6cdd9de19] 9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) 
[0x55abd323b01e] 10: 
(DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
[0x55abd33785f5] 11: (DBObjectMap::check(std::ostream&, bool)+0x279) 
[0x55abd336dbb9] 12: (DBObjectMap::init(bool)+0x288) [0x55abd336ceb8] 13: 
(FileStore::mount()+0x2525) [0x55abd32a3eb5] 14: (main()+0x28c0) 
[0x55abd2ed4400] 15: (__libc_start_main()+0xf1) [0x7fa6cd4a62b1] 16: 
(()+0x34f747) [0x55abd2f23747]Aborted# ceph-objectstore-tool --op export --pgid 
1.28  --data-path /var/lib/ceph/osd/ceph-4 --journal-path 
/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export 
--skip-mount-omapceph-objectstore-tool: 
/usr/include/boost/smart_ptr/scoped_ptr.hpp:99: T* 
boost::scoped_ptr::operator->() const [with T = ObjectMap]: Assertion `px != 
0' failed.*** Caught signal (Aborted) ** in thread 7f14345c5940 
thread_name:ceph-objectstor ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x996a57) [0x5575b50a9a57] 2: 
(()+0x110c0) [0x7f1432dcf0c0] 3: (gsignal()+0xcf) [0x7f1431202fcf] 4: 
(abort()+0x16a) [0x7f14312043fa] 5: (()+0x2be37) [0x7f14311fbe37] 6: 
(()+0x2bee2) [0x7f14311fbee2] 7: (()+0x2fa19c) [0x5575b4a0d19c] 8: 
(FileStore::omap_get_values(coll_t const&, ghobject_t const&, 
std::set, std::allocator > 
const&, std::map, 
std::allocator > >*)+0x6c2) 
[0x5575b4dc9322] 9: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
ceph::buffer::list*)+0x235) [0x5575b4ab3135] 10: (main()+0x5bd6) 
[0x5575b4a16716] 11: (__libc_start_main()+0xf1) [0x7f14311f02b1] 12: 
(()+0x34f747) [0x5575b4a62747]
When trying to bring up osd.4 we get this message.  Feels very similar to that 
crash in first two above. ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x960e57) [0x5565e564ae57] 2: 
(()+0x110c0) [0x7f34aa17e0c0] 3: (gsignal()+0xcf) [0x7f34a81c4fcf] 4: 
(abort()+0x16a) [0x7f34a81c63fa] 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f34a8aabb3d] 6: 
(()+0x5ebb6) [0x7f34a8aa9bb6] 7: (()+0x5ec01) [0x7f34a8aa9c01] 8: (()+0x5ee19) 
[0x7f34a8aa9e19] 9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) 
[0x5565e531933e] 10: 
(DBObjectMap::_Header::de

Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-15 Thread Josh Durgin
(Sorry for top posting, this email client isn't great at editing)


The mitigation strategy I mentioned before of forcing backfill could be 
backported to jewel, but I don't think it's a very good option for RBD users 
without SSDs.


In luminous there is a command (something like 'ceph pg force-recovery') that 
you can use to prioritize recovery of particular PGs (and thus rbd images with 
some scripting). This would at least let you limit the scope of affected 
images. A couple folks from OVH added it for just this purpose.


Neither of these is an ideal workaround, but I haven't thought of a better one 
for existing versions.


Josh


Sent from Nine

From: Florian Haas 
Sent: Sep 15, 2017 3:43 PM
To: Josh Durgin
Cc: ceph-users@lists.ceph.com; Christian Theune
Subject: Re: [ceph-users] Clarification on sequence of recovery and client ops 
after OSDs rejoin cluster (also, slow requests)

> On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin  wrote: 
> >> So this affects just writes. Then I'm really not following the 
> >> reasoning behind the current behavior. Why would you want to wait for 
> >> the recovery of an object that you're about to clobber anyway? Naïvely 
> >> thinking an object like that would look like a candidate for 
> >> *eviction* from the recovery queue, not promotion to a higher 
> >> priority. Is this because the write could be a partial write, whereas 
> >> recovery would need to cover the full object? 
> > 
> > 
> > Generally most writes are partial writes - for RBD that's almost always 
> > the case - often writes are 512b or 4kb. It's also true for e.g. RGW 
> > bucket index updates (adding an omap key/value pair). 
>
> Sure, makes sense. 
>
> >> This is all under the disclaimer that I have no detailed 
> >> knowledge of the internals so this is all handwaving, but would a more 
> >> logical sequence of events not look roughly like this: 
> >> 
> >> 1. Are all replicas of the object available? If so, goto 4. 
> >> 2. Is the write a full object write? If so, goto 4. 
> >> 3. Read the local copy of the object, splice in the partial write, 
> >> making it a full object write. 
> >> 4. Evict the object from the recovery queue. 
> >> 5. Replicate the write. 
> >> 
> >> Forgive the silly use of goto; I'm wary of email clients mangling 
> >> indentation if I were to write this as a nested if block. :) 
> > 
> > 
> > This might be a useful optimization in some cases, but it would be 
> > rather complex to add to the recovery code. It may be worth considering 
> > at some point - same with deletes or other cases where the previous data 
> > is not needed. 
>
> Uh, yeah, waiting for an object to recover just so you can then delete 
> it, and blocking the delete I/O in the process, does also seem rather 
> very strange. 
>
> I think we do agree that any instance of I/O being blocked upward of 
> 30s in a VM is really really bad, but the way you describe it, I see 
> little chance for a Ceph-deploying cloud operator to ever make a 
> compelling case to their customers that such a thing is unlikely to 
> happen. And I'm not even sure if a knee-jerk reaction to buy faster 
> hardware would be a very prudent investment: it's basically all just a 
> factor of (a) how much I/O happens on a cluster during an outage, (b) 
> how many nodes/OSDs will be affected by that outage. Neither is very 
> predictable, and only (b) is something you have any influence over in 
> a cloud environment. Beyond a certain threshold of either (a) or (b), 
> the probability of *recovery* slowing a significant number of VMs to a 
> crawl approximates 1. 
>
> For an rgw bucket index pool, that's usually a sufficiently small 
> amount of data that allows you to sprinkle a few fast drives 
> throughout your cluster, create a ruleset with a separate root 
> (pre-Luminous) or making use of classes (Luminous and later), and then 
> assign that ruleset to the pool. But for RBD storage, that's usually 
> not an option — not at non-prohibitive cost, anyway. 
>
> Can you share your suggested workaround / mitigation strategy for 
> users that are currently being bitten by this behavior? If async 
> recovery lands in mimic with no chance of a backport, then it'll be a 
> while before LTS users get any benefit out of it. 
>
> Cheers, 
> Florian 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread Vasu Kulkarni
On Fri, Sep 15, 2017 at 3:49 PM, Gregory Farnum  wrote:
> On Fri, Sep 15, 2017 at 3:34 PM David Turner  wrote:
>>
>> I don't understand a single use case where I want updating my packages
>> using yum, apt, etc to restart a ceph daemon.  ESPECIALLY when there are so
>> many clusters out there with multiple types of daemons running on the same
>> server.
>>
>> My home setup is 3 nodes each running 3 OSDs, a MON, and an MDS server.
>> If upgrading the packages restarts all of those daemons at once, then I'm
>> mixing MON versions, OSD versions and MDS versions every time I upgrade my
>> cluster.  It removes my ability to methodically upgrade my MONs, OSDs, and
>> then clients.
I think the choice one makes with small cluster is the upgrade is
going to be disruptive, but for the large redundant cluster
it is better that the upgrade do the *full* job for better user
experience because once the decision to upgrade is made
there is no use running old version of daemon, also not sure how the
format changes in major version would actually
effect the new upgraded files on the system running old daemon, maybe
we haven't hit corner cases?

>>
>> Now let's take the Luminous upgrade which REQUIRES you to upgrade all of
>> your MONs before anything else... I'm screwed.  I literally can't perform
>> the upgrade if it's going to restart all of my daemons because it is
>> impossible for me to achieve a paxos quorum of MONs running the Luminous
>> binaries BEFORE I upgrade any other daemon in the cluster.  The only way to
>> achieve that is to stop the entire cluster and every daemon,

Again for small physical node cluster with colocated daemon's that's
the compromise, one could use VM"s insde
the cluster to separate out the upgrade process with some tradeoff's.

upgrade all of
>> the packages, then start the mons, then start the rest of the cluster
>> again... There is no way that is a desired behavior.
>>
>> All of this is ignoring large clusters using something like Puppet to
>> manage their package versions.  I want to just be able to update the ceph
>> version and push that out to the cluster.  It will install the new packages
>> to the entire cluster and then my automated scripts can perform a rolling
>> restart of the cluster upgrading all of the daemons while ensuring that the
>> cluster is healthy every step of the way.  I don't want to add in the time
>> of installing the packages on every node DURING the upgrade.  I want that
>> done before I initiate my script to be in a mixed version state as little as
>> possible.
>>
>> Claiming that having anything other than an issued command to specifically
>> restart a Ceph daemon is anything but a bug and undesirable sounds crazy to
>> me.  I don't ever want anything restarting my Ceph daemons that is not
>> explicitly called to do so.  That just sounds like it's begging to put my
>> entire cluster into a world of hurt by accidentally restarting too many
>> daemons at the same time making the data in my cluster inaccessible.
>>
>> I'm used to the Ubuntu side of things.  I've never seen upgrading the Ceph
>> packages to ever affect a daemon before.  If that's actually a thing that is
>> done on purpose in RHEL and CentOS... good riddance! That's ridiculous!
>
>
> I don't know what the settings are right now, or what the latest argument
> was to get them there.
>
> But we *have* had distributions require us to make changes to come into
> compliance with their packaging policies.
> Some users *do* want their daemons to automatically reboot on upgrade,
> because if you have segregated nodes that you're managing by hand, it's a
> lot easier to issue one command than two.
> And on and on and on.
>
> Personally, I tend closer to your position. But this is a thing that some
> people get very vocal about; we don't have a lot of upstream people
> interested in maintaining packaging or fighting with other interest groups
> who say we're doing it wrong; and it's just not a lot of fun to deal with.
>
> Looking through the git logs, I think CEPH_AUTO_RESTART_ON_UPGRADE was
> probably added so distros could easily make that distinction. And it would
> not surprise me if the use of selinux required restarts — upgrading packages
> tends to change what the daemon's selinux policy allows it to do, and if
> they have different behavior I presume selinux is going to complain
> wildly...
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-15 Thread Gregory Farnum
On Mon, Sep 11, 2017 at 1:10 PM Florian Haas  wrote:

> On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick
>  wrote:
> >
> > On 2017-09-08 06:06 PM, Gregory Farnum wrote:
> > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick <
> patrick.mcl...@sony.com> wrote:
> > >
> > >> On a related note, we are very curious why the snapshot id is
> > >> incremented when a snapshot is deleted, this creates lots
> > >> phantom entries in the deleted snapshots set. Interleaved
> > >> deletions and creations will cause massive fragmentation in
> > >> the interval set. The only reason we can come up for this
> > >> is to track if anything changed, but I suspect a different
> > >> value that doesn't inject entries in to the interval set might
> > >> be better for this purpose.
> > > Yes, it's because having a sequence number tied in with the snapshots
> > > is convenient for doing comparisons. Those aren't leaked snapids that
> > > will make holes; when we increment the snapid to delete something we
> > > also stick it in the removed_snaps set. (I suppose if you alternate
> > > deleting a snapshot with adding one that does increase the size until
> > > you delete those snapshots; hrmmm. Another thing to avoid doing I
> > > guess.)
> > >
> >
> >
> > Fair enough, though it seems like these limitations of the
> > snapshot system should be documented.
>
> This is why I was so insistent on numbers, formulae or even
> rules-of-thumb to predict what works and what does not. Greg's "one
> snapshot per RBD per day is probably OK" from a few months ago seemed
> promising, but looking at your situation it's probably not that useful
> a rule.
>
>
> > We most likely would
> > have used a completely different strategy if it was documented
> > that certain snapshot creation and removal patterns could
> > cause the cluster to fall over over time.
>
> I think right now there are probably very few people, if any, who
> could *describe* the pattern that causes this. That complicates
> matters of documentation. :)
>
>
> > >>> It might really just be the osdmap update processing -- that would
> > >>> make me happy as it's a much easier problem to resolve. But I'm also
> > >>> surprised it's *that* expensive, even at the scales you've described.
>
> ^^ This is what I mean. It's kind of tough to document things if we're
> still in "surprised that this is causing harm" territory.
>
>
> > >> That would be nice, but unfortunately all the data is pointing
> > >> to PGPool::Update(),
> > > Yes, that's the OSDMap update processing I referred to. This is good
> > > in terms of our ability to remove it without changing client
> > > interfaces and things.
> >
> > That is good to hear, hopefully this stuff can be improved soon
> > then.
>
> Greg, can you comment on just how much potential improvement you see
> here? Is it more like "oh we know we're doing this one thing horribly
> inefficiently, but we never thought this would be an issue so we shied
> away from premature optimization, but we can easily reduce 70% CPU
> utilization to 1%" or rather like "we might be able to improve this by
> perhaps 5%, but 100,000 RBDs is too many if you want to be using
> snapshotting at all, for the foreseeable future"?
>

I got the chance to discuss this a bit with Patrick at the Open Source
Summit Wednesday (good to see you!).

So the idea in the previously-referenced CDM talk essentially involves
changing the way we distribute snap deletion instructions from a
"deleted_snaps" member in the OSDMap to a "deleting_snaps" member that gets
trimmed once the OSDs report to the manager that they've finished removing
that snapid. This should entirely resolve the CPU burn they're seeing
during OSDMap processing on the nodes, as it shrinks the intersection
operation down from "all the snaps" to merely "the snaps not-done-deleting".

The other reason we maintain the full set of deleted snaps is to prevent
client operations from re-creating deleted snapshots — we filter all client
IO which includes snaps against the deleted_snaps set in the PG. Apparently
this is also big enough in RAM to be a real (but much smaller) problem.

Unfortunately eliminating that is a lot harder and a permanent fix will
involve changing the client protocol in ways nobody has quite figured out
how to do. But Patrick did suggest storing the full set of deleted snaps
on-disk and only keeping in-memory the set which covers snapids in the
range we've actually *seen* from clients. I haven't gone through the code
but that seems broadly feasible — the hard part will be working out the
rules when you have to go to disk to read a larger part of the
deleted_snaps set. (Perfectly feasible.)

PRs are of course welcome! ;)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread David Turner
I'm sorry for getting a little hot there.  You're definitely right that you
can't please everyone with a forced choice.  It's unfortunate that it can
so drastically impact an upgrade like it did here.  Is there a way to
configure yum or apt to make sure that it won't restart these (or guarantee
that it will for those folks)?
On Fri, Sep 15, 2017 at 6:49 PM Gregory Farnum  wrote:

> On Fri, Sep 15, 2017 at 3:34 PM David Turner 
> wrote:
>
>> I don't understand a single use case where I want updating my packages
>> using yum, apt, etc to restart a ceph daemon.  ESPECIALLY when there are so
>> many clusters out there with multiple types of daemons running on the same
>> server.
>>
>> My home setup is 3 nodes each running 3 OSDs, a MON, and an MDS server.
>> If upgrading the packages restarts all of those daemons at once, then I'm
>> mixing MON versions, OSD versions and MDS versions every time I upgrade my
>> cluster.  It removes my ability to methodically upgrade my MONs, OSDs, and
>> then clients.
>>
>> Now let's take the Luminous upgrade which REQUIRES you to upgrade all of
>> your MONs before anything else... I'm screwed.  I literally can't perform
>> the upgrade if it's going to restart all of my daemons because it is
>> impossible for me to achieve a paxos quorum of MONs running the Luminous
>> binaries BEFORE I upgrade any other daemon in the cluster.  The only way to
>> achieve that is to stop the entire cluster and every daemon, upgrade all of
>> the packages, then start the mons, then start the rest of the cluster
>> again... There is no way that is a desired behavior.
>>
>> All of this is ignoring large clusters using something like Puppet to
>> manage their package versions.  I want to just be able to update the ceph
>> version and push that out to the cluster.  It will install the new packages
>> to the entire cluster and then my automated scripts can perform a rolling
>> restart of the cluster upgrading all of the daemons while ensuring that the
>> cluster is healthy every step of the way.  I don't want to add in the time
>> of installing the packages on every node DURING the upgrade.  I want that
>> done before I initiate my script to be in a mixed version state as little
>> as possible.
>>
>> Claiming that having anything other than an issued command to
>> specifically restart a Ceph daemon is anything but a bug and undesirable
>> sounds crazy to me.  I don't ever want anything restarting my Ceph daemons
>> that is not explicitly called to do so.  That just sounds like it's begging
>> to put my entire cluster into a world of hurt by accidentally restarting
>> too many daemons at the same time making the data in my cluster
>> inaccessible.
>>
>> I'm used to the Ubuntu side of things.  I've never seen upgrading the
>> Ceph packages to ever affect a daemon before.  If that's actually a thing
>> that is done on purpose in RHEL and CentOS... good riddance! That's
>> ridiculous!
>>
>
> I don't know what the settings are right now, or what the latest argument
> was to get them there.
>
> But we *have* had distributions require us to make changes to come into
> compliance with their packaging policies.
> Some users *do* want their daemons to automatically reboot on upgrade,
> because if you have segregated nodes that you're managing by hand, it's a
> lot easier to issue one command than two.
> And on and on and on.
>
> Personally, I tend closer to your position. But this is a thing that some
> people get very vocal about; we don't have a lot of upstream people
> interested in maintaining packaging or fighting with other interest groups
> who say we're doing it wrong; and it's just not a lot of fun to deal with.
>
> Looking through the git logs, I think CEPH_AUTO_RESTART_ON_UPGRADE was
> probably added so distros could easily make that distinction. And it would
> not surprise me if the use of selinux required restarts — upgrading
> packages tends to change what the daemon's selinux policy allows it to do,
> and if they have different behavior I presume selinux is going to complain
> wildly...
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread Gregory Farnum
On Fri, Sep 15, 2017 at 3:34 PM David Turner  wrote:

> I don't understand a single use case where I want updating my packages
> using yum, apt, etc to restart a ceph daemon.  ESPECIALLY when there are so
> many clusters out there with multiple types of daemons running on the same
> server.
>
> My home setup is 3 nodes each running 3 OSDs, a MON, and an MDS server.
> If upgrading the packages restarts all of those daemons at once, then I'm
> mixing MON versions, OSD versions and MDS versions every time I upgrade my
> cluster.  It removes my ability to methodically upgrade my MONs, OSDs, and
> then clients.
>
> Now let's take the Luminous upgrade which REQUIRES you to upgrade all of
> your MONs before anything else... I'm screwed.  I literally can't perform
> the upgrade if it's going to restart all of my daemons because it is
> impossible for me to achieve a paxos quorum of MONs running the Luminous
> binaries BEFORE I upgrade any other daemon in the cluster.  The only way to
> achieve that is to stop the entire cluster and every daemon, upgrade all of
> the packages, then start the mons, then start the rest of the cluster
> again... There is no way that is a desired behavior.
>
> All of this is ignoring large clusters using something like Puppet to
> manage their package versions.  I want to just be able to update the ceph
> version and push that out to the cluster.  It will install the new packages
> to the entire cluster and then my automated scripts can perform a rolling
> restart of the cluster upgrading all of the daemons while ensuring that the
> cluster is healthy every step of the way.  I don't want to add in the time
> of installing the packages on every node DURING the upgrade.  I want that
> done before I initiate my script to be in a mixed version state as little
> as possible.
>
> Claiming that having anything other than an issued command to specifically
> restart a Ceph daemon is anything but a bug and undesirable sounds crazy to
> me.  I don't ever want anything restarting my Ceph daemons that is not
> explicitly called to do so.  That just sounds like it's begging to put my
> entire cluster into a world of hurt by accidentally restarting too many
> daemons at the same time making the data in my cluster inaccessible.
>
> I'm used to the Ubuntu side of things.  I've never seen upgrading the Ceph
> packages to ever affect a daemon before.  If that's actually a thing that
> is done on purpose in RHEL and CentOS... good riddance! That's ridiculous!
>

I don't know what the settings are right now, or what the latest argument
was to get them there.

But we *have* had distributions require us to make changes to come into
compliance with their packaging policies.
Some users *do* want their daemons to automatically reboot on upgrade,
because if you have segregated nodes that you're managing by hand, it's a
lot easier to issue one command than two.
And on and on and on.

Personally, I tend closer to your position. But this is a thing that some
people get very vocal about; we don't have a lot of upstream people
interested in maintaining packaging or fighting with other interest groups
who say we're doing it wrong; and it's just not a lot of fun to deal with.

Looking through the git logs, I think CEPH_AUTO_RESTART_ON_UPGRADE was
probably added so distros could easily make that distinction. And it would
not surprise me if the use of selinux required restarts — upgrading
packages tends to change what the daemon's selinux policy allows it to do,
and if they have different behavior I presume selinux is going to complain
wildly...
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-15 Thread Florian Haas
On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin  wrote:
>> So this affects just writes. Then I'm really not following the
>> reasoning behind the current behavior. Why would you want to wait for
>> the recovery of an object that you're about to clobber anyway? Naïvely
>> thinking an object like that would look like a candidate for
>> *eviction* from the recovery queue, not promotion to a higher
>> priority. Is this because the write could be a partial write, whereas
>> recovery would need to cover the full object?
>
>
> Generally most writes are partial writes - for RBD that's almost always
> the case - often writes are 512b or 4kb. It's also true for e.g. RGW
> bucket index updates (adding an omap key/value pair).

Sure, makes sense.

>> This is all under the disclaimer that I have no detailed
>> knowledge of the internals so this is all handwaving, but would a more
>> logical sequence of events not look roughly like this:
>>
>> 1. Are all replicas of the object available? If so, goto 4.
>> 2. Is the write a full object write? If so, goto 4.
>> 3. Read the local copy of the object, splice in the partial write,
>> making it a full object write.
>> 4. Evict the object from the recovery queue.
>> 5. Replicate the write.
>>
>> Forgive the silly use of goto; I'm wary of email clients mangling
>> indentation if I were to write this as a nested if block. :)
>
>
> This might be a useful optimization in some cases, but it would be
> rather complex to add to the recovery code. It may be worth considering
> at some point - same with deletes or other cases where the previous data
> is not needed.

Uh, yeah, waiting for an object to recover just so you can then delete
it, and blocking the delete I/O in the process, does also seem rather
very strange.

I think we do agree that any instance of I/O being blocked upward of
30s in a VM is really really bad, but the way you describe it, I see
little chance for a Ceph-deploying cloud operator to ever make a
compelling case to their customers that such a thing is unlikely to
happen. And I'm not even sure if a knee-jerk reaction to buy faster
hardware would be a very prudent investment: it's basically all just a
factor of (a) how much I/O happens on a cluster during an outage, (b)
how many nodes/OSDs will be affected by that outage. Neither is very
predictable, and only (b) is something you have any influence over in
a cloud environment. Beyond a certain threshold of either (a) or (b),
the probability of *recovery* slowing a significant number of VMs to a
crawl approximates 1.

For an rgw bucket index pool, that's usually a sufficiently small
amount of data that allows you to sprinkle a few fast drives
throughout your cluster, create a ruleset with a separate root
(pre-Luminous) or making use of classes (Luminous and later), and then
assign that ruleset to the pool. But for RBD storage, that's usually
not an option — not at non-prohibitive cost, anyway.

Can you share your suggested workaround / mitigation strategy for
users that are currently being bitten by this behavior? If async
recovery lands in mimic with no chance of a backport, then it'll be a
while before LTS users get any benefit out of it.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread David Turner
I don't understand a single use case where I want updating my packages
using yum, apt, etc to restart a ceph daemon.  ESPECIALLY when there are so
many clusters out there with multiple types of daemons running on the same
server.

My home setup is 3 nodes each running 3 OSDs, a MON, and an MDS server.  If
upgrading the packages restarts all of those daemons at once, then I'm
mixing MON versions, OSD versions and MDS versions every time I upgrade my
cluster.  It removes my ability to methodically upgrade my MONs, OSDs, and
then clients.

Now let's take the Luminous upgrade which REQUIRES you to upgrade all of
your MONs before anything else... I'm screwed.  I literally can't perform
the upgrade if it's going to restart all of my daemons because it is
impossible for me to achieve a paxos quorum of MONs running the Luminous
binaries BEFORE I upgrade any other daemon in the cluster.  The only way to
achieve that is to stop the entire cluster and every daemon, upgrade all of
the packages, then start the mons, then start the rest of the cluster
again... There is no way that is a desired behavior.

All of this is ignoring large clusters using something like Puppet to
manage their package versions.  I want to just be able to update the ceph
version and push that out to the cluster.  It will install the new packages
to the entire cluster and then my automated scripts can perform a rolling
restart of the cluster upgrading all of the daemons while ensuring that the
cluster is healthy every step of the way.  I don't want to add in the time
of installing the packages on every node DURING the upgrade.  I want that
done before I initiate my script to be in a mixed version state as little
as possible.

Claiming that having anything other than an issued command to specifically
restart a Ceph daemon is anything but a bug and undesirable sounds crazy to
me.  I don't ever want anything restarting my Ceph daemons that is not
explicitly called to do so.  That just sounds like it's begging to put my
entire cluster into a world of hurt by accidentally restarting too many
daemons at the same time making the data in my cluster inaccessible.

I'm used to the Ubuntu side of things.  I've never seen upgrading the Ceph
packages to ever affect a daemon before.  If that's actually a thing that
is done on purpose in RHEL and CentOS... good riddance! That's ridiculous!

On Fri, Sep 15, 2017 at 6:06 PM Vasu Kulkarni  wrote:

> On Fri, Sep 15, 2017 at 2:10 PM, David Turner 
> wrote:
> > I'm glad that worked for you to finish the upgrade.
> >
> > He has multiple MONs, but all of them are on nodes with OSDs as well.
> When
> > he updated the packages on the first node, it restarted the MON and all
> of
> > the OSDs.  This is strictly not supported in the Luminous upgrade as the
> > OSDs can't be running Luminous code until all of the MONs are running
> > Luminous.  I have never seen updating Ceph packages cause a restart of
> the
> > daemons because you need to schedule the restarts and wait until the
> cluster
> > is back to healthy before restarting the next node to upgrade the
> daemons.
> > If upgrading the packages is causing a restart of the Ceph daemons, it is
> > most definitely a bug and needs to be fixed.
>
> The current spec file tell that unless CEPH_AUTO_RESTART_ON_UPGRADE is
> set to "yes", it shoudn't restart, but I remember
> it does restart in my own testing as well. Although I see no harm
> since the underlying binaries have changed and for the cluster
> in redundant mode restarting of service shoudn't cause any issue. But
> maybe its still useful for some use cases.
>
>
> >
> > On Fri, Sep 15, 2017 at 4:48 PM David  wrote:
> >>
> >> Happy to report I got everything up to Luminous, used your tip to keep
> the
> >> OSDs running, David, thanks again for that.
> >>
> >> I'd say this is a potential gotcha for people collocating MONs. It
> appears
> >> that if you're running selinux, even in permissive mode, upgrading the
> >> ceph-selinux packages forces a restart on all the OSDs. You're left
> with a
> >> load of OSDs down that you can't start as you don't have a Luminous mon
> >> quorum yet.
> >>
> >>
> >> On 15 Sep 2017 4:54 p.m., "David"  wrote:
> >>
> >> Hi David
> >>
> >> I like your thinking! Thanks for the suggestion. I've got a maintenance
> >> window later to finish the update so will give it a try.
> >>
> >>
> >> On Thu, Sep 14, 2017 at 6:24 PM, David Turner 
> >> wrote:
> >>>
> >>> This isn't a great solution, but something you could try.  If you stop
> >>> all of the daemons via systemd and start them all in a screen as a
> manually
> >>> running daemon in the foreground of each screen... I don't think that
> yum
> >>> updating the packages can stop or start the daemons.  You could copy
> and
> >>> paste the running command (viewable in ps) to know exactly what to run
> in
> >>> the screens to start the daemons like this.
> >>>
> >>> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:
> 
>  Hi All
> 
>  I di

Re: [ceph-users] OSD memory usage

2017-09-15 Thread Christian Wuerdig
Assuming you're using Bluestore you could experiments with the cache
settings 
(http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/)

In your case setting bluestore_cache_size_hdd lower than the default
1GB might help with the RAM usage

various people have reported solving OOM issues by setting this to
512MB, not sure what the performance impact might be

On Tue, Sep 12, 2017 at 6:15 AM,   wrote:
> Please excuse my brain-fart.  We're using 24 disks on the servers in
> question.  Only after discussing this further with a colleague did we
> realize this.
>
> This brings us right to the minimum-spec which generally isn't a good idea.
>
> Sincerely
>
> -Dave
>
>
> On 11/09/17 11:38 AM, bulk.sch...@ucalgary.ca wrote:
>>
>> [This sender failed our fraud detection checks and may not be who they
>> appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
>>
>>
>> Hi Everyone,
>>
>> I wonder if someone out there has a similar problem to this?
>>
>> I keep having issues with memory usage.  I have 2 OSD servers wiith 48G
>> memory and 12 2TB OSDs.  I seem to have significantly more memory than
>> the minimum spec, but these two machines with 2TB drives seem to OOM
>> kill and crash periodically -- basically any time the cluster goes into
>> recovery for even 1 OSD this happens.
>>
>> 12 Drives * 2TB = 24 TB.  By using the 1GB RAM per 1TB Disk rule: I
>> should need only 24TB or so.
>>
>> I am testing and benchmarking at this time so most changes are fine.  I
>> am abusing this filesystem considerably by running 14 clients with
>> something that is more or less dd each to a different file but that's
>> the point :)
>>
>> When it's working, the performance is really good.  3GB/s with 3x
>> replicated data pool up to around 10GB/s with 1X replication (just for
>> kicks and giggles) My bottleneck is likely the SAS channels to those
>> disks.
>>
>> I'm using the 12.2.0 release running on Centos 7
>>
>> Testing cephfs with one MDS and 3 montors.  The MON/MDS are not on the
>> servers in question.
>>
>> Total of around 350 OSDs (all spinning disk) most of which are 1TB
>> drives on 15 servers that are a bit older with Xeon E5620's.
>>
>> Dual QDR Infiniband (20GBit) fabrics (1 cluster and 1 client).
>>
>> Any thoughts?  Am I missing some tuning parameter in /proc or something?
>>
>> Thanks
>> -Dave
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread Vasu Kulkarni
On Fri, Sep 15, 2017 at 2:10 PM, David Turner  wrote:
> I'm glad that worked for you to finish the upgrade.
>
> He has multiple MONs, but all of them are on nodes with OSDs as well.  When
> he updated the packages on the first node, it restarted the MON and all of
> the OSDs.  This is strictly not supported in the Luminous upgrade as the
> OSDs can't be running Luminous code until all of the MONs are running
> Luminous.  I have never seen updating Ceph packages cause a restart of the
> daemons because you need to schedule the restarts and wait until the cluster
> is back to healthy before restarting the next node to upgrade the daemons.
> If upgrading the packages is causing a restart of the Ceph daemons, it is
> most definitely a bug and needs to be fixed.

The current spec file tell that unless CEPH_AUTO_RESTART_ON_UPGRADE is
set to "yes", it shoudn't restart, but I remember
it does restart in my own testing as well. Although I see no harm
since the underlying binaries have changed and for the cluster
in redundant mode restarting of service shoudn't cause any issue. But
maybe its still useful for some use cases.


>
> On Fri, Sep 15, 2017 at 4:48 PM David  wrote:
>>
>> Happy to report I got everything up to Luminous, used your tip to keep the
>> OSDs running, David, thanks again for that.
>>
>> I'd say this is a potential gotcha for people collocating MONs. It appears
>> that if you're running selinux, even in permissive mode, upgrading the
>> ceph-selinux packages forces a restart on all the OSDs. You're left with a
>> load of OSDs down that you can't start as you don't have a Luminous mon
>> quorum yet.
>>
>>
>> On 15 Sep 2017 4:54 p.m., "David"  wrote:
>>
>> Hi David
>>
>> I like your thinking! Thanks for the suggestion. I've got a maintenance
>> window later to finish the update so will give it a try.
>>
>>
>> On Thu, Sep 14, 2017 at 6:24 PM, David Turner 
>> wrote:
>>>
>>> This isn't a great solution, but something you could try.  If you stop
>>> all of the daemons via systemd and start them all in a screen as a manually
>>> running daemon in the foreground of each screen... I don't think that yum
>>> updating the packages can stop or start the daemons.  You could copy and
>>> paste the running command (viewable in ps) to know exactly what to run in
>>> the screens to start the daemons like this.
>>>
>>> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:

 Hi All

 I did a Jewel -> Luminous upgrade on my dev cluster and it went very
 smoothly.

 I've attempted to upgrade on a small production cluster but I've hit a
 snag.

 After installing the ceph 12.2.0 packages with "yum install ceph" on the
 first node and accepting all the dependencies, I found that all the OSD
 daemons, the MON and the MDS running on that node were terminated. Systemd
 appears to have attempted to restart them all but the daemons didn't start
 successfully (not surprising as first stage of upgrading all mons in 
 cluster
 not completed). I was able to start the MON and it's running. The OSDs are
 all down and I'm reluctant to attempt to start them without upgrading the
 other MONs in the cluster. I'm also reluctant to attempt upgrading the
 remaining 2 MONs without understanding what happened.

 The cluster is on Jewel 10.2.5 (as was the dev cluster)
 Both clusters running on CentOS 7.3

 The only obvious difference I can see between the dev and production is
 the production has selinux running in permissive mode, the dev had it
 disabled.

 Any advice on how to proceed at this point would be much appreciated.
 The cluster is currently functional, but I have 1 node out 4 with all OSDs
 down. I had noout set before the upgrade and I've left it set for now.

 Here's the journalctl right after the packages were installed (hostname
 changed):

 https://pastebin.com/fa6NMyjG

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread David Turner
I'm glad that worked for you to finish the upgrade.

He has multiple MONs, but all of them are on nodes with OSDs as well.  When
he updated the packages on the first node, it restarted the MON and all of
the OSDs.  This is strictly not supported in the Luminous upgrade as the
OSDs can't be running Luminous code until all of the MONs are running
Luminous.  I have never seen updating Ceph packages cause a restart of the
daemons because you need to schedule the restarts and wait until the
cluster is back to healthy before restarting the next node to upgrade the
daemons.  If upgrading the packages is causing a restart of the Ceph
daemons, it is most definitely a bug and needs to be fixed.

On Fri, Sep 15, 2017 at 4:48 PM David  wrote:

> Happy to report I got everything up to Luminous, used your tip to keep the
> OSDs running, David, thanks again for that.
>
> I'd say this is a potential gotcha for people collocating MONs. It appears
> that if you're running selinux, even in permissive mode, upgrading the
> ceph-selinux packages forces a restart on all the OSDs. You're left with a
> load of OSDs down that you can't start as you don't have a Luminous mon
> quorum yet.
>
>
> On 15 Sep 2017 4:54 p.m., "David"  wrote:
>
> Hi David
>
> I like your thinking! Thanks for the suggestion. I've got a maintenance
> window later to finish the update so will give it a try.
>
>
> On Thu, Sep 14, 2017 at 6:24 PM, David Turner 
> wrote:
>
>> This isn't a great solution, but something you could try.  If you stop
>> all of the daemons via systemd and start them all in a screen as a manually
>> running daemon in the foreground of each screen... I don't think that yum
>> updating the packages can stop or start the daemons.  You could copy and
>> paste the running command (viewable in ps) to know exactly what to run in
>> the screens to start the daemons like this.
>>
>> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:
>>
>>> Hi All
>>>
>>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
>>> smoothly.
>>>
>>> I've attempted to upgrade on a small production cluster but I've hit a
>>> snag.
>>>
>>> After installing the ceph 12.2.0 packages with "yum install ceph" on the
>>> first node and accepting all the dependencies, I found that all the OSD
>>> daemons, the MON and the MDS running on that node were terminated. Systemd
>>> appears to have attempted to restart them all but the daemons didn't start
>>> successfully (not surprising as first stage of upgrading all mons in
>>> cluster not completed). I was able to start the MON and it's running. The
>>> OSDs are all down and I'm reluctant to attempt to start them without
>>> upgrading the other MONs in the cluster. I'm also reluctant to attempt
>>> upgrading the remaining 2 MONs without understanding what happened.
>>>
>>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
>>> Both clusters running on CentOS 7.3
>>>
>>> The only obvious difference I can see between the dev and production is
>>> the production has selinux running in permissive mode, the dev had it
>>> disabled.
>>>
>>> Any advice on how to proceed at this point would be much appreciated.
>>> The cluster is currently functional, but I have 1 node out 4 with all OSDs
>>> down. I had noout set before the upgrade and I've left it set for now.
>>>
>>> Here's the journalctl right after the packages were installed (hostname
>>> changed):
>>>
>>> https://pastebin.com/fa6NMyjG
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread Vasu Kulkarni
On Fri, Sep 15, 2017 at 1:48 PM, David  wrote:
> Happy to report I got everything up to Luminous, used your tip to keep the
> OSDs running, David, thanks again for that.
>
> I'd say this is a potential gotcha for people collocating MONs. It appears
> that if you're running selinux, even in permissive mode, upgrading the
> ceph-selinux packages forces a restart on all the OSDs.
It is the ceph-osd and/or ceph-mon that got upgraded and has restarted
the service.
This is the *correct* behavior, If it is not restarting after upgrade
then its a bug.

You're left with a
> load of OSDs down that you can't start as you don't have a Luminous mon
> quorum yet.
Do you have only one monitor? It is recommended to have more 3 or odd
number of mons(>3) for HA, The release
notes also mentions upgrading mon's one by one, As long as you have
redundancy mon/osd collocation shouldn't matter much.

>
>
> On 15 Sep 2017 4:54 p.m., "David"  wrote:
>
> Hi David
>
> I like your thinking! Thanks for the suggestion. I've got a maintenance
> window later to finish the update so will give it a try.
>
>
> On Thu, Sep 14, 2017 at 6:24 PM, David Turner  wrote:
>>
>> This isn't a great solution, but something you could try.  If you stop all
>> of the daemons via systemd and start them all in a screen as a manually
>> running daemon in the foreground of each screen... I don't think that yum
>> updating the packages can stop or start the daemons.  You could copy and
>> paste the running command (viewable in ps) to know exactly what to run in
>> the screens to start the daemons like this.
>>
>> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:
>>>
>>> Hi All
>>>
>>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
>>> smoothly.
>>>
>>> I've attempted to upgrade on a small production cluster but I've hit a
>>> snag.
>>>
>>> After installing the ceph 12.2.0 packages with "yum install ceph" on the
>>> first node and accepting all the dependencies, I found that all the OSD
>>> daemons, the MON and the MDS running on that node were terminated. Systemd
>>> appears to have attempted to restart them all but the daemons didn't start
>>> successfully (not surprising as first stage of upgrading all mons in cluster
>>> not completed). I was able to start the MON and it's running. The OSDs are
>>> all down and I'm reluctant to attempt to start them without upgrading the
>>> other MONs in the cluster. I'm also reluctant to attempt upgrading the
>>> remaining 2 MONs without understanding what happened.
>>>
>>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
>>> Both clusters running on CentOS 7.3
>>>
>>> The only obvious difference I can see between the dev and production is
>>> the production has selinux running in permissive mode, the dev had it
>>> disabled.
>>>
>>> Any advice on how to proceed at this point would be much appreciated. The
>>> cluster is currently functional, but I have 1 node out 4 with all OSDs down.
>>> I had noout set before the upgrade and I've left it set for now.
>>>
>>> Here's the journalctl right after the packages were installed (hostname
>>> changed):
>>>
>>> https://pastebin.com/fa6NMyjG
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread David
Happy to report I got everything up to Luminous, used your tip to keep the
OSDs running, David, thanks again for that.

I'd say this is a potential gotcha for people collocating MONs. It appears
that if you're running selinux, even in permissive mode, upgrading the
ceph-selinux packages forces a restart on all the OSDs. You're left with a
load of OSDs down that you can't start as you don't have a Luminous mon
quorum yet.


On 15 Sep 2017 4:54 p.m., "David"  wrote:

Hi David

I like your thinking! Thanks for the suggestion. I've got a maintenance
window later to finish the update so will give it a try.


On Thu, Sep 14, 2017 at 6:24 PM, David Turner  wrote:

> This isn't a great solution, but something you could try.  If you stop all
> of the daemons via systemd and start them all in a screen as a manually
> running daemon in the foreground of each screen... I don't think that yum
> updating the packages can stop or start the daemons.  You could copy and
> paste the running command (viewable in ps) to know exactly what to run in
> the screens to start the daemons like this.
>
> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:
>
>> Hi All
>>
>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
>> smoothly.
>>
>> I've attempted to upgrade on a small production cluster but I've hit a
>> snag.
>>
>> After installing the ceph 12.2.0 packages with "yum install ceph" on the
>> first node and accepting all the dependencies, I found that all the OSD
>> daemons, the MON and the MDS running on that node were terminated. Systemd
>> appears to have attempted to restart them all but the daemons didn't start
>> successfully (not surprising as first stage of upgrading all mons in
>> cluster not completed). I was able to start the MON and it's running. The
>> OSDs are all down and I'm reluctant to attempt to start them without
>> upgrading the other MONs in the cluster. I'm also reluctant to attempt
>> upgrading the remaining 2 MONs without understanding what happened.
>>
>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
>> Both clusters running on CentOS 7.3
>>
>> The only obvious difference I can see between the dev and production is
>> the production has selinux running in permissive mode, the dev had it
>> disabled.
>>
>> Any advice on how to proceed at this point would be much appreciated. The
>> cluster is currently functional, but I have 1 node out 4 with all OSDs
>> down. I had noout set before the upgrade and I've left it set for now.
>>
>> Here's the journalctl right after the packages were installed (hostname
>> changed):
>>
>> https://pastebin.com/fa6NMyjG
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-15 Thread Josh Durgin

On 09/15/2017 01:57 AM, Florian Haas wrote:

On Fri, Sep 15, 2017 at 8:58 AM, Josh Durgin  wrote:

This is more of an issue with write-intensive RGW buckets, since the
bucket index object is a single bottleneck if it needs recovery, and
all further writes to a shard of a bucket index will be blocked on that
bucket index object.


Well, yes, the problem impact may be even worse on rgw, but you do
agree that the problem does exist for RBD too, correct? (The hard
evidence points to that.)


Yes, of course it still exists for RBD or other uses.




There's a description of the idea here:

https://github.com/jdurgin/ceph/commit/15c4c7134d32f2619821f891ec8b8e598e786b92


Thanks!. That raises another question:

"Until now, this recovery process was synchronous - it blocked writes
to an object until it was recovered."

So this affects just writes. Then I'm really not following the
reasoning behind the current behavior. Why would you want to wait for
the recovery of an object that you're about to clobber anyway? Naïvely
thinking an object like that would look like a candidate for
*eviction* from the recovery queue, not promotion to a higher
priority. Is this because the write could be a partial write, whereas
recovery would need to cover the full object?


Generally most writes are partial writes - for RBD that's almost always
the case - often writes are 512b or 4kb. It's also true for e.g. RGW
bucket index updates (adding an omap key/value pair).


This is all under the disclaimer that I have no detailed
knowledge of the internals so this is all handwaving, but would a more
logical sequence of events not look roughly like this:

1. Are all replicas of the object available? If so, goto 4.
2. Is the write a full object write? If so, goto 4.
3. Read the local copy of the object, splice in the partial write,
making it a full object write.
4. Evict the object from the recovery queue.
5. Replicate the write.

Forgive the silly use of goto; I'm wary of email clients mangling
indentation if I were to write this as a nested if block. :)


This might be a useful optimization in some cases, but it would be
rather complex to add to the recovery code. It may be worth considering
at some point - same with deletes or other cases where the previous data
is not needed.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSDs are down after Server reboot

2017-09-15 Thread Joe Comeau
 
We're running journals on NVMe as well - SLES 
 
before rebooting try deleting the links here:
 /etc/systemd/system/ceph-osd.target.wants/
 
if we delete first it boots ok
if we don't delete the disks sometimes don't come up and we have to
ceph-disk activate all

HTH
 
Thanks Joe

>>> David Turner  9/15/2017 9:54 AM >>>
I have this issue with my NVMe OSDs, but not my HDD OSDs. I have 15
HDD's and 2 NVMe's in each host. We put most of the journals on one of
the NVMe's and a few on the second, but added a small OSD partition to
the second NVMe for RGW metadata pools.

When restarting a server manually for testing, the NVMe OSD comes back
up normally. We're tracking a problem with the OSD nodes freezing and
having to force reboot them. After this, the NVMe OSD doesn't come back
on its own until I run `ceph-disk activate-all`. This seems to track
with your theory that a non-clean FS is a part of the equation.

Is there any ideas as to how to resolve this yet? So far being able to
run `ceph-disk activate-all` is good enough, but a bit of a nuisance.

On Fri, Sep 15, 2017 at 11:48 AM Matthew Vernon 
wrote:


Hi,

On 14/09/17 16:26, Götz Reinicke wrote:

> maybe someone has a hint: I do have a cephalopod cluster (6 nodes,
144
> OSDs), Cents 7.3 ceph 10.2.7.
>
> I did a kernel update to the recent centos 7.3 one on a node and did
a
> reboot.
>
> After that, 10 OSDs did not came up as the others. The disk did not
get
> mounted and the OSD processes did nothing … even after a couple of
> minutes no more disks/OSDs showed up.
>
> So I did a ceph-disk activate-all.
>
> And all missing OSDs got back online.
>
> Questions: Any hints on debugging why the disk did not get online
after
> the reboot?

We've been seeing this on our Ubuntu / Jewel cluster, after we
upgraded
from ceph 10.2.3 / kernel 4.4.0-62 to ceph 10.2.7 / kernel 4.4.0-93.

I'm still digging, but AFAICT it's a race condition in startup - in
our
case, we're only seeing it if some of the filesystems aren't clean.
This
may be related to the thread "Very slow start of osds after reboot"
from
August, but I don't think any conclusion was reached there.

Regards,

Matthew


--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-15 Thread Lazuardi Nasution
Hi,

1. Is it possible configure use osd_data not as small partition on OSD but
a folder (ex. on root disk)? If yes, how to do that with ceph-disk and any
pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected throughput
like on journal device of filestore? If no, what is the default value and
pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB if using
separate device for them?

Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon health status gone from display

2017-09-15 Thread Alex Gorbachev
In Jewel and prior there was a health status for MONs in ceph -s JSON
output, this seems to be gone now.  Is there a place where a status of
a given monitor is shown in Luminous?

Thank you
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSDs are down after Server reboot

2017-09-15 Thread David Turner
I have this issue with my NVMe OSDs, but not my HDD OSDs.  I have 15 HDD's
and 2 NVMe's in each host.  We put most of the journals on one of the
NVMe's and a few on the second, but added a small OSD partition to the
second NVMe for RGW metadata pools.

When restarting a server manually for testing, the NVMe OSD comes back up
normally.  We're tracking a problem with the OSD nodes freezing and having
to force reboot them.  After this, the NVMe OSD doesn't come back on its
own until I run `ceph-disk activate-all`.  This seems to track with your
theory that a non-clean FS is a part of the equation.

Is there any ideas as to how to resolve this yet?  So far being able to run
`ceph-disk activate-all` is good enough, but a bit of a nuisance.

On Fri, Sep 15, 2017 at 11:48 AM Matthew Vernon  wrote:

> Hi,
>
> On 14/09/17 16:26, Götz Reinicke wrote:
>
> > maybe someone has a hint: I do have a cephalopod cluster (6 nodes, 144
> > OSDs), Cents 7.3 ceph 10.2.7.
> >
> > I did a kernel update to the recent centos 7.3 one on a node and did a
> > reboot.
> >
> > After that, 10 OSDs did not came up as the others. The disk did not get
> > mounted and the OSD processes did nothing … even after a couple of
> > minutes no more disks/OSDs showed up.
> >
> > So I did a ceph-disk activate-all.
> >
> > And all missing OSDs got back online.
> >
> > Questions: Any hints on debugging why the disk did not get online after
> > the reboot?
>
> We've been seeing this on our Ubuntu / Jewel cluster, after we upgraded
> from ceph 10.2.3 / kernel 4.4.0-62 to ceph 10.2.7 / kernel 4.4.0-93.
>
> I'm still digging, but AFAICT it's a race condition in startup - in our
> case, we're only seeing it if some of the filesystems aren't clean. This
> may be related to the thread "Very slow start of osds after reboot" from
> August, but I don't think any conclusion was reached there.
>
> Regards,
>
> Matthew
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixed versions of cluster and clients

2017-09-15 Thread Mike A

> 15 сент. 2017 г., в 18:42, Sage Weil  написал(а):
> 
> On Fri, 15 Sep 2017, Mike A wrote:
>> Hello!
>> 
>> We have a ceph cluster based on Jewel release and one virtualization 
>> infrastructure that is using the cluster. Now we are going to add another 
>> ceph cluster but based on luminous with bluestore. 
>> The virtualization infrastructure must use these ceph clusters.  Do I need 
>> to update software version of client (librbd/librados) inside the 
>> virtualization infra? 
>> 
>> I think there are 3 different ways to add the new cluster:
>> 1. Update client side to the Luminous release and leave old cluster to the 
>> Jewel release
>> 2. Update old cluster and client to the Luminous release
>> 3. Leave old cluster and client to the Jewel release
>> 
>> Please suggest pros and cons.
> 
> You can use either jewel or luminous clients.  Just be aware that the 
> luminous cluster can't use luminous-only features until clients are 
> upgrade.  By default, new luminous clusters will set their "min compat 
> client" to jewel, so no special configuration is needed.
> 
> 3 is the smallest change and least risk, which is appealing. You'll want 
> to do 2 eventually, and 1 is a step along that path.
> 
> sage

Thank you for the detailed answer.

— 
Mike, runs!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread David
Hi David

I like your thinking! Thanks for the suggestion. I've got a maintenance
window later to finish the update so will give it a try.


On Thu, Sep 14, 2017 at 6:24 PM, David Turner  wrote:

> This isn't a great solution, but something you could try.  If you stop all
> of the daemons via systemd and start them all in a screen as a manually
> running daemon in the foreground of each screen... I don't think that yum
> updating the packages can stop or start the daemons.  You could copy and
> paste the running command (viewable in ps) to know exactly what to run in
> the screens to start the daemons like this.
>
> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:
>
>> Hi All
>>
>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
>> smoothly.
>>
>> I've attempted to upgrade on a small production cluster but I've hit a
>> snag.
>>
>> After installing the ceph 12.2.0 packages with "yum install ceph" on the
>> first node and accepting all the dependencies, I found that all the OSD
>> daemons, the MON and the MDS running on that node were terminated. Systemd
>> appears to have attempted to restart them all but the daemons didn't start
>> successfully (not surprising as first stage of upgrading all mons in
>> cluster not completed). I was able to start the MON and it's running. The
>> OSDs are all down and I'm reluctant to attempt to start them without
>> upgrading the other MONs in the cluster. I'm also reluctant to attempt
>> upgrading the remaining 2 MONs without understanding what happened.
>>
>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
>> Both clusters running on CentOS 7.3
>>
>> The only obvious difference I can see between the dev and production is
>> the production has selinux running in permissive mode, the dev had it
>> disabled.
>>
>> Any advice on how to proceed at this point would be much appreciated. The
>> cluster is currently functional, but I have 1 node out 4 with all OSDs
>> down. I had noout set before the upgrade and I've left it set for now.
>>
>> Here's the journalctl right after the packages were installed (hostname
>> changed):
>>
>> https://pastebin.com/fa6NMyjG
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSDs are down after Server reboot

2017-09-15 Thread Matthew Vernon
Hi,

On 14/09/17 16:26, Götz Reinicke wrote:

> maybe someone has a hint: I do have a cephalopod cluster (6 nodes, 144
> OSDs), Cents 7.3 ceph 10.2.7.
> 
> I did a kernel update to the recent centos 7.3 one on a node and did a
> reboot.
> 
> After that, 10 OSDs did not came up as the others. The disk did not get
> mounted and the OSD processes did nothing … even after a couple of
> minutes no more disks/OSDs showed up.
> 
> So I did a ceph-disk activate-all.
> 
> And all missing OSDs got back online.
> 
> Questions: Any hints on debugging why the disk did not get online after
> the reboot?

We've been seeing this on our Ubuntu / Jewel cluster, after we upgraded
from ceph 10.2.3 / kernel 4.4.0-62 to ceph 10.2.7 / kernel 4.4.0-93.

I'm still digging, but AFAICT it's a race condition in startup - in our
case, we're only seeing it if some of the filesystems aren't clean. This
may be related to the thread "Very slow start of osds after reboot" from
August, but I don't think any conclusion was reached there.

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixed versions of cluster and clients

2017-09-15 Thread Sage Weil
On Fri, 15 Sep 2017, Mike A wrote:
> Hello!
> 
> We have a ceph cluster based on Jewel release and one virtualization 
> infrastructure that is using the cluster. Now we are going to add another 
> ceph cluster but based on luminous with bluestore. 
> The virtualization infrastructure must use these ceph clusters.  Do I need to 
> update software version of client (librbd/librados) inside the virtualization 
> infra? 
> 
> I think there are 3 different ways to add the new cluster:
> 1. Update client side to the Luminous release and leave old cluster to the 
> Jewel release
> 2. Update old cluster and client to the Luminous release
> 3. Leave old cluster and client to the Jewel release
> 
> Please suggest pros and cons.

You can use either jewel or luminous clients.  Just be aware that the 
luminous cluster can't use luminous-only features until clients are 
upgrade.  By default, new luminous clusters will set their "min compat 
client" to jewel, so no special configuration is needed.

3 is the smallest change and least risk, which is appealing. You'll want 
to do 2 eventually, and 1 is a step along that path.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mixed versions of cluster and clients

2017-09-15 Thread Mike A
Hello!

We have a ceph cluster based on Jewel release and one virtualization 
infrastructure that is using the cluster. Now we are going to add another ceph 
cluster but based on luminous with bluestore. 
The virtualization infrastructure must use these ceph clusters.  Do I need to 
update software version of client (librbd/librados) inside the virtualization 
infra? 

I think there are 3 different ways to add the new cluster:
1. Update client side to the Luminous release and leave old cluster to the 
Jewel release
2. Update old cluster and client to the Luminous release
3. Leave old cluster and client to the Jewel release

Please suggest pros and cons.

— 
Mike, runs!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-15 Thread hjcho616
After running ceph osd lost osd.0, it started backfilling... I figured that was 
supposed to happen earlier when I added those missing PGs.  Running in to "too 
few PGs per OSD" I removed osds after cluster stopped working after adding 
osds.  But I guess I still needed them.  Currently I see several incomplete PGs 
and trying to import those PGs back. =P
As far as 1.28 goes, it didn't look like it was limited by osd.0, logs didn't 
show any signs of osd.0 and data is only available on osd.4, which wouldn't 
export... So I still need to deal with that one.  It is still showing up as 
incomplete.. =P  Any recommendations how to get that back?pg 1.28 is stuck 
inactive since forever, current state down+incomplete, last acting [11,6]pg 
1.28 is stuck unclean since forever, current state down+incomplete, last acting 
[11,6]pg 1.28 is down+incomplete, acting [11,6] (reducing pool metadata 
min_size from 2 may help; search ceph.com/docs for 'incomplete')
Regards,Hong


On Friday, September 15, 2017 4:51 AM, Ronny Aasen 
 wrote:
 

 
you write you had all pg's exported except one. so i assume you have 
injected those pg's into the cluster again using the method linked a few 
times in this thread. How did that go, were you successfull in 
recovering those pg's ?

kind regards.
Ronny Aasen



On 15. sep. 2017 07:52, hjcho616 wrote:
> I just did this and backfilling started.  Let's see where this takes me.
> ceph osd lost 0 --yes-i-really-mean-it
> 
> Regards,
> Hong
> 
> 
> On Friday, September 15, 2017 12:44 AM, hjcho616  wrote:
> 
> 
> Ronny,
> 
> Working with all of the pgs shown in the "ceph health detail", I ran 
> below for each PG to export.
> ceph-objectstore-tool --op export --pgid 0.1c  --data-path 
> /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal 
> --skip-journal-replay --file 0.1c.export
> 
> I have all PGs exported, except 1... PG 1.28.  It is on ceph-4.  This 
> error doesn't make much sense to me.  Looking at the source code from 
> https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc, that 
> message is telling me struct_v is 1... but not sure how it ended up in 
> the default in the case statement when 1 case is defined...  I tried 
> with --skip-journal-replay, fails with same error message.
> ceph-objectstore-tool --op export --pgid 1.28  --data-path 
> /var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal 
> --file 1.28.export
> terminate called after throwing an instance of 'std::domain_error'
>    what():  coll_t::decode(): don't know how to decode version 1
> *** Caught signal (Aborted) **
>  in thread 7fabc5ecc940 thread_name:ceph-objectstor
>  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>  1: (()+0x996a57) [0x55b2d3323a57]
>  2: (()+0x110c0) [0x7fabc46d50c0]
>  3: (gsignal()+0xcf) [0x7fabc2b08fcf]
>  4: (abort()+0x16a) [0x7fabc2b0a3fa]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fabc33efb3d]
>  6: (()+0x5ebb6) [0x7fabc33edbb6]
>  7: (()+0x5ec01) [0x7fabc33edc01]
>  8: (()+0x5ee19) [0x7fabc33ede19]
>  9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) [0x55b2d2ff401e]
>  10: 
> (DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
> [0x55b2d31315f5]
>  11: (DBObjectMap::check(std::ostream&, bool)+0x279) [0x55b2d3126bb9]
>  12: (DBObjectMap::init(bool)+0x288) [0x55b2d3125eb8]
>  13: (FileStore::mount()+0x2525) [0x55b2d305ceb5]
>  14: (main()+0x28c0) [0x55b2d2c8d400]
>  15: (__libc_start_main()+0xf1) [0x7fabc2af62b1]
>  16: (()+0x34f747) [0x55b2d2cdc747]
> Aborted
> 
> Then wrote a simple script to run import process... just created an OSD 
> per PG.  Basically ran below for each PG.
> mkdir /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> ceph-disk prepare /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> ceph-disk activate /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> ceph osd crush reweight osd.$(cat 
> /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) 0
> systemctl stop ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)
> ceph-objectstore-tool --op import --pgid 0.1c  --data-path 
> /var/lib/ceph/osd/ceph-$(cat 
> /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) --journal-path 
> /var/lib/ceph/osd/ceph-$(cat 
> /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)/journal --file 
> ./export/0.1c.export
> chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> systemctl start ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)
> 
> Sometimes import didn't work.. but stopping OSD and rerunning 
> ceph-objectstore-tool again seems to help or when some PG didn't really 
> want to import .
> 
> Unfound messages are gone!  But I still have down+peering, or 
> down+remapped+peering.
> # ceph health detail
> HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs 
> down; 1 pgs inconsistent; 22 pgs peering; 22 pgs stuck inactive; 22 pgs 
> stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow 
> requests; 2 scrub errors; mds

Re: [ceph-users] Power outages!!! help!

2017-09-15 Thread Ronny Aasen


you write you had all pg's exported except one. so i assume you have 
injected those pg's into the cluster again using the method linked a few 
times in this thread. How did that go, were you successfull in 
recovering those pg's ?


kind regards.
Ronny Aasen



On 15. sep. 2017 07:52, hjcho616 wrote:

I just did this and backfilling started.  Let's see where this takes me.
ceph osd lost 0 --yes-i-really-mean-it

Regards,
Hong


On Friday, September 15, 2017 12:44 AM, hjcho616  wrote:


Ronny,

Working with all of the pgs shown in the "ceph health detail", I ran 
below for each PG to export.
ceph-objectstore-tool --op export --pgid 0.1c   --data-path 
/var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal 
--skip-journal-replay --file 0.1c.export


I have all PGs exported, except 1... PG 1.28.  It is on ceph-4.  This 
error doesn't make much sense to me.  Looking at the source code from 
https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc, that 
message is telling me struct_v is 1... but not sure how it ended up in 
the default in the case statement when 1 case is defined...  I tried 
with --skip-journal-replay, fails with same error message.
ceph-objectstore-tool --op export --pgid 1.28  --data-path 
/var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal 
--file 1.28.export

terminate called after throwing an instance of 'std::domain_error'
   what():  coll_t::decode(): don't know how to decode version 1
*** Caught signal (Aborted) **
  in thread 7fabc5ecc940 thread_name:ceph-objectstor
  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
  1: (()+0x996a57) [0x55b2d3323a57]
  2: (()+0x110c0) [0x7fabc46d50c0]
  3: (gsignal()+0xcf) [0x7fabc2b08fcf]
  4: (abort()+0x16a) [0x7fabc2b0a3fa]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fabc33efb3d]
  6: (()+0x5ebb6) [0x7fabc33edbb6]
  7: (()+0x5ec01) [0x7fabc33edc01]
  8: (()+0x5ee19) [0x7fabc33ede19]
  9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) [0x55b2d2ff401e]
  10: 
(DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
[0x55b2d31315f5]

  11: (DBObjectMap::check(std::ostream&, bool)+0x279) [0x55b2d3126bb9]
  12: (DBObjectMap::init(bool)+0x288) [0x55b2d3125eb8]
  13: (FileStore::mount()+0x2525) [0x55b2d305ceb5]
  14: (main()+0x28c0) [0x55b2d2c8d400]
  15: (__libc_start_main()+0xf1) [0x7fabc2af62b1]
  16: (()+0x34f747) [0x55b2d2cdc747]
Aborted

Then wrote a simple script to run import process... just created an OSD 
per PG.  Basically ran below for each PG.

mkdir /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph-disk prepare /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph-disk activate /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph osd crush reweight osd.$(cat 
/var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) 0

systemctl stop ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)
ceph-objectstore-tool --op import --pgid 0.1c   --data-path 
/var/lib/ceph/osd/ceph-$(cat 
/var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) --journal-path 
/var/lib/ceph/osd/ceph-$(cat 
/var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)/journal --file 
./export/0.1c.export

chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
systemctl start ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)

Sometimes import didn't work.. but stopping OSD and rerunning 
ceph-objectstore-tool again seems to help or when some PG didn't really 
want to import .


Unfound messages are gone!   But I still have down+peering, or 
down+remapped+peering.

# ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs 
down; 1 pgs inconsistent; 22 pgs peering; 22 pgs stuck inactive; 22 pgs 
stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow 
requests; 2 scrub errors; mds cluster is degraded; noout flag(s) set; no 
legacy OSD present but 'sortbitwise' flag is not set
pg 1.d is stuck inactive since forever, current state down+peering, last 
acting [11,2]
pg 0.a is stuck inactive since forever, current state 
down+remapped+peering, last acting [11,7]
pg 2.8 is stuck inactive since forever, current state 
down+remapped+peering, last acting [11,7]
pg 2.b is stuck inactive since forever, current state 
down+remapped+peering, last acting [7,11]
pg 1.9 is stuck inactive since forever, current state 
down+remapped+peering, last acting [11,7]
pg 0.e is stuck inactive since forever, current state down+peering, last 
acting [11,2]
pg 1.3d is stuck inactive since forever, current state 
down+remapped+peering, last acting [10,6]
pg 0.2c is stuck inactive since forever, current state down+peering, 
last acting [1,11]
pg 0.0 is stuck inactive since forever, current state 
down+remapped+peering, last acting [10,7]
pg 1.2b is stuck inactive since forever, current state down+peering, 
last acting [1,11]
pg 0.29 is stuck inactive since forever, current state down+peering, 
last acting [11,6]
pg 1.28 is stuck inactive since forever, current sta

Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-15 Thread Florian Haas
On Fri, Sep 15, 2017 at 8:58 AM, Josh Durgin  wrote:
>> OK, maybe the "also" can be removed to reduce potential confusion?
>
>
> Sure

That'd be great. :)

>> - We have a bunch of objects that need to be recovered onto the
>> just-returned OSD(s).
>> - Clients access some of these objects while they are pending recovery.
>> - When that happens, recovery of those objects gets reprioritized.
>> Simplistically speaking, they get to jump the queue.
>>
>> Did I get that right?
>
>
> Yes
>
>> If so, let's zoom out a bit now and look at RBD's most frequent use
>> case, virtualization. While the OSDs were down, the RADOS objects that
>> were created or modified would have come from whatever virtual
>> machines were running at that time. When the OSDs return, there's a
>> very good chance that those same VMs are still running. While they're
>> running, they of course continue to access the same RBDs, and are
>> quite likely to access the same *data* as before on those RBDs — data
>> that now needs to be recovered.
>>
>> So that means that there is likely a solid majority of to-be-recovered
>> RADOS objects that needs to be moved to the front of the queue at some
>> point during the recovery. Which, in the extreme, renders the
>> prioritization useless: if I have, say, 1,000 objects that need to be
>> recovered but 998 have been moved to the "front" of the queue, the
>> queue is rather meaningless.
>
>
> This is more of an issue with write-intensive RGW buckets, since the
> bucket index object is a single bottleneck if it needs recovery, and
> all further writes to a shard of a bucket index will be blocked on that
> bucket index object.

Well, yes, the problem impact may be even worse on rgw, but you do
agree that the problem does exist for RBD too, correct? (The hard
evidence points to that.)

>> Again, on the assumption that this correctly describes what Ceph
>> currently does, do you have suggestions for how to mitigate this? It
>> seems to me that the only actual remedy for this issue in
>> Jewel/Luminous would be to not access objects pending recovery, but as
>> just pointed out, that's a rather unrealistic goal.
>
>
> In luminous you can force the osds to backfill (which does not block
> I/O) instead of using log-based recovery. This requires scanning
> the disk to see which objects are missing, instead of looking at the pg
> log, so it will take longer to recover. This is feasible for all-SSD
> setups, but with pure HDD it may be too much slower, depending on your
> desire to trade-off durability for availability.
>
> You can do this by setting:
>
> osd pg log min entries = 1
> osd pg log max entries = 2
>
>>> I'm working on the fix (aka async recovery) for mimic. This won't be
>>> backportable unfortunately.
>>
>>
>> OK — is there any more information on this that is available and
>> current? A quick search turned up a Trello card
>> (https://trello.com/c/jlJL5fPR/199-osd-async-recovery), a mailing list
>> post (https://www.spinics.net/lists/ceph-users/msg37127.html), a slide
>> deck
>> (https://www.slideshare.net/jupiturliu/ceph-recovery-improvement-v02),
>> a stale PR (https://github.com/ceph/ceph/pull/11918), and an inactive
>> branch (https://github.com/jdurgin/ceph/commits/wip-async-recovery),
>> but I was hoping for something a little more detailed. Thanks in
>> advance for any additional insight you can share here!
>
>
> There's a description of the idea here:
>
> https://github.com/jdurgin/ceph/commit/15c4c7134d32f2619821f891ec8b8e598e786b92

Thanks!. That raises another question:

"Until now, this recovery process was synchronous - it blocked writes
to an object until it was recovered."

So this affects just writes. Then I'm really not following the
reasoning behind the current behavior. Why would you want to wait for
the recovery of an object that you're about to clobber anyway? Naïvely
thinking an object like that would look like a candidate for
*eviction* from the recovery queue, not promotion to a higher
priority. Is this because the write could be a partial write, whereas
recovery would need to cover the full object?

This is all under the disclaimer that I have no detailed
knowledge of the internals so this is all handwaving, but would a more
logical sequence of events not look roughly like this:

1. Are all replicas of the object available? If so, goto 4.
2. Is the write a full object write? If so, goto 4.
3. Read the local copy of the object, splice in the partial write,
making it a full object write.
4. Evict the object from the recovery queue.
5. Replicate the write.

Forgive the silly use of goto; I'm wary of email clients mangling
indentation if I were to write this as a nested if block. :)

Again, thanks for the continued insight!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] s3cmd not working with luminous radosgw

2017-09-15 Thread Yoann Moulin
Hello,

I have a fresh luminous cluster in test and I made a copy of a bucket (4TB 1.5M 
files) with rclone, I'm able to list/copy files with rclone but
s3cmd does not work at all, it is just able to give the bucket list but I can't 
list files neither update ACL.

does anyone already test this ?

root@iccluster012:~# rclone --version
rclone v1.37

root@iccluster012:~# s3cmd --version
s3cmd version 2.0.0


### rclone ls files ###

root@iccluster012:~# rclone ls testadmin:image-net/LICENSE
 1589 LICENSE
root@iccluster012:~#

nginx (as revers proxy) log :

> 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
> HTTP/1.1" 200 0 "-" "rclone/v1.37"
> 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "GET 
> /image-net?delimiter=%2F&max-keys=1024&prefix= HTTP/1.1" 200 779 "-" 
> "rclone/v1.37"

rgw logs :

> 2017-09-15 10:30:02.620266 7ff1f58f7700  1 == starting new request 
> req=0x7ff1f58f11f0 =
> 2017-09-15 10:30:02.622245 7ff1f58f7700  1 == req done req=0x7ff1f58f11f0 
> op status=0 http_status=200 ==
> 2017-09-15 10:30:02.622324 7ff1f58f7700  1 civetweb: 0x56061584b000: 
> 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE HTTP/1.0" 
> 1 0 - rclone/v1.37
> 2017-09-15 10:30:02.623361 7ff1f50f6700  1 == starting new request 
> req=0x7ff1f50f01f0 =
> 2017-09-15 10:30:02.689632 7ff1f50f6700  1 == req done req=0x7ff1f50f01f0 
> op status=0 http_status=200 ==
> 2017-09-15 10:30:02.689719 7ff1f50f6700  1 civetweb: 0x56061585: 
> 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "GET 
> /image-net?delimiter=%2F&max-keys=1024&prefix= HTTP/1.0" 1 0 - rclone/v1.37



### s3cmds ls files ###

root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls 
s3://image-net/LICENSE
root@iccluster012:~#

nginx (as revers proxy) log :

> 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
> http://test.iccluster.epfl.ch/image-net/?location HTTP/1.1" 200 127 "-" "-"
> 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
> http://image-net.test.iccluster.epfl.ch/?delimiter=%2F&prefix=LICENSE 
> HTTP/1.1" 200 318 "-" "-"

rgw logs :

> 2017-09-15 10:30:04.295355 7ff1f48f5700  1 == starting new request 
> req=0x7ff1f48ef1f0 =
> 2017-09-15 10:30:04.295913 7ff1f48f5700  1 == req done req=0x7ff1f48ef1f0 
> op status=0 http_status=200 ==
> 2017-09-15 10:30:04.295977 7ff1f48f5700  1 civetweb: 0x560615855000: 
> 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET /image-net/?location 
> HTTP/1.0" 1 0 - -
> 2017-09-15 10:30:04.299303 7ff1f40f4700  1 == starting new request 
> req=0x7ff1f40ee1f0 =
> 2017-09-15 10:30:04.300993 7ff1f40f4700  1 == req done req=0x7ff1f40ee1f0 
> op status=0 http_status=200 ==
> 2017-09-15 10:30:04.301070 7ff1f40f4700  1 civetweb: 0x56061585a000: 
> 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET 
> /?delimiter=%2F&prefix=LICENSE HTTP/1.0" 1 0 - 



### s3cmd : list bucket ###

root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls s3://
2017-08-28 12:27  s3://image-net
root@iccluster012:~#

nginx (as revers proxy) log :

> ==> nginx/access.log <==
> 10.90.37.13 - - [15/Sep/2017:10:36:10 +0200] "GET 
> http://test.iccluster.epfl.ch/ HTTP/1.1" 200 318 "-" "-"

rgw logs :

> 2017-09-15 10:36:10.645354 7ff1f38f3700  1 == starting new request 
> req=0x7ff1f38ed1f0 =
> 2017-09-15 10:36:10.647419 7ff1f38f3700  1 == req done req=0x7ff1f38ed1f0 
> op status=0 http_status=200 ==
> 2017-09-15 10:36:10.647488 7ff1f38f3700  1 civetweb: 0x56061585f000: 
> 127.0.0.1 - - [15/Sep/2017:10:36:10 +0200] "GET / HTTP/1.0" 1 0 - -



### rclone : list bucket ###


root@iccluster012:~# rclone lsd testadmin:
  -1 2017-08-28 12:27:33-1 image-net
root@iccluster012:~#

nginx (as revers proxy) log :

> ==> nginx/access.log <==
> 10.90.37.13 - - [15/Sep/2017:10:37:53 +0200] "GET / HTTP/1.1" 200 318 "-" 
> "rclone/v1.37"

rgw logs :

> ==> ceph/luminous-rgw-iccluster015.log <==
> 2017-09-15 10:37:53.005424 7ff1f28f1700  1 == starting new request 
> req=0x7ff1f28eb1f0 =
> 2017-09-15 10:37:53.007192 7ff1f28f1700  1 == req done req=0x7ff1f28eb1f0 
> op status=0 http_status=200 ==
> 2017-09-15 10:37:53.007282 7ff1f28f1700  1 civetweb: 0x56061586e000: 
> 127.0.0.1 - - [15/Sep/2017:10:37:53 +0200] "GET / HTTP/1.0" 1 0 - rclone/v1.37


Thanks for you help

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 'flags' of PG.

2017-09-15 Thread dE .
Hi,
I was going through health check

documentation, where I found references to 'PG flags' like degraded,
undersized, backfill_toofull or recovery_toofull etc... I find traces of
these flags throughout the documentation, but but no one appears to have a
list of flags, where will they appear (in ceph health detail?) and what
exactly is a flag is a PG flag 1st place.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com