I think Greg (who appears to be a ceph committer) basically said he was interested in looking at it, if only you had the pool that failed this way.
Why not try to reproduce it, and make a log of your procedure so he can reproduce it too? What caused the slow requests... copy on write from snapshots? A bad disk? exclusive-lock with 2 clients writing at the same time maybe? I'd be interested in a solution too... like why can't idle disks (non-full disk queue) mean that the osd op or whatever queue can still fill with requests not related to the blocked pg/objects? I would love for ceph to handle this better. I suspect some issues I have are related to this (slow requests on one VM can freeze others [likely blame the osd], even requiring kill -9 [likely blame client librbd]). On 03/22/17 16:18, Alejandro Comisario wrote: > any thoughts ? > > On Tue, Mar 14, 2017 at 10:22 PM, Alejandro Comisario > <alejan...@nubeliu.com <mailto:alejan...@nubeliu.com>> wrote: > > Greg, thanks for the reply. > True that i cant provide enough information to know what happened > since the pool is gone. > > But based on your experience, can i please take some of your time, > and give me the TOP 5 fo what could happen / would be the reason > to happen what hapened to that pool (or any pool) that makes Ceph > (maybe hapened specifically in Hammer ) to behave like that ? > > Information that i think will be of value, is that the cluster was > 5 nodes large, running "0.94.6-1trusty" i added two nodes running > the latest "0.94.9-1trusty" and replication into those new disks > never ended, since i saw WEIRD errors on the new OSDs, so i > thought that packages needed to be the same, so i "apt-get > upgraded" the 5 old nodes without restrting nothing, so > rebalancing started to happen without errors (WEIRD). > > after these two nodes reached 100% of the disks weight, the > cluster worked perfectly for about two weeks, till this happened. > After the resolution from my first email, everything has been > working perfect. > > thanks for the responses. > > > On Fri, Mar 10, 2017 at 4:23 PM, Gregory Farnum > <gfar...@redhat.com <mailto:gfar...@redhat.com>> wrote: > > > > On Tue, Mar 7, 2017 at 10:18 AM Alejandro Comisario > <alejan...@nubeliu.com <mailto:alejan...@nubeliu.com>> wrote: > > Gregory, thanks for the response, what you've said is by > far, the most enlightneen thing i know about ceph in a > long time. > > What brings even greater doubt, which is, this > "non-functional" pool, was only 1.5GB large, vs 50-150GB > on the other effected pools, the tiny pool was still being > used, and just because that pool was blovking requests, > the whole cluster was unresponsive. > > So , what do you mean by "non-functional" pool ? how a > pool can become non-functional ? and what asures me that > tomorrow (just becaue i deleted the 1.5GB pool to fix the > whole problem) another pool doesnt becomes non-functional ? > > > Well, you said there were a bunch of slow requests. That can > happen any number of ways, if you're overloading the OSDs or > something. > When there are slow requests, those ops take up OSD memory and > throttle, and so they don't let in new messages until the old > ones are serviced. This can cascade across a cluster -- > because everything is interconnected, clients and OSDs end up > with all their requests targeted at the slow OSDs which aren't > letting in new IO quickly enough. It's one of the weaknesses > of the standard deployment patterns, but it usually doesn't > come up unless something else has gone pretty wrong first. > As for what actually went wrong here, you haven't provided > near enough information and probably can't now that the pool > has been deleted. *shrug* > -Greg >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com