Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

Peter Maloney Thu, 23 Mar 2017 01:48:03 -0700

I think Greg (who appears to be a ceph committer) basically said he was
interested in looking at it, if only you had the pool that failed this way.


Why not try to reproduce it, and make a log of your procedure so he can
reproduce it too? What caused the slow requests... copy on write from
snapshots? A bad disk? exclusive-lock with 2 clients writing at the same
time maybe?

I'd be interested in a solution too... like why can't idle disks
(non-full disk queue) mean that the osd op or whatever queue can still
fill with requests not related to the blocked pg/objects? I would love
for ceph to handle this better. I suspect some issues I have are related
to this (slow requests on one VM can freeze others [likely blame the
osd], even requiring kill -9 [likely blame client librbd]).

On 03/22/17 16:18, Alejandro Comisario wrote:
> any thoughts ?
>
> On Tue, Mar 14, 2017 at 10:22 PM, Alejandro Comisario
> <alejan...@nubeliu.com <mailto:alejan...@nubeliu.com>> wrote:
>
>     Greg, thanks for the reply.
>     True that i cant provide enough information to know what happened
>     since the pool is gone.
>
>     But based on your experience, can i please take some of your time,
>     and give me the TOP 5 fo what could happen / would be the reason
>     to happen what hapened to that pool (or any pool) that makes Ceph
>     (maybe hapened specifically in Hammer ) to behave like that ?
>
>     Information that i think will be of value, is that the cluster was
>     5 nodes large, running "0.94.6-1trusty" i added two nodes running
>     the latest "0.94.9-1trusty" and replication into those new disks
>     never ended, since i saw WEIRD errors on the new OSDs, so i
>     thought that packages needed to be the same, so i "apt-get
>     upgraded" the 5 old nodes without restrting nothing, so
>     rebalancing started to happen without errors (WEIRD).
>
>     after these two nodes reached 100% of the disks weight, the
>     cluster worked perfectly for about two weeks, till this happened.
>     After the resolution from my first email, everything has been
>     working perfect.
>
>     thanks for the responses.
>      
>
>     On Fri, Mar 10, 2017 at 4:23 PM, Gregory Farnum
>     <gfar...@redhat.com <mailto:gfar...@redhat.com>> wrote:
>
>
>
>         On Tue, Mar 7, 2017 at 10:18 AM Alejandro Comisario
>         <alejan...@nubeliu.com <mailto:alejan...@nubeliu.com>> wrote:
>
>             Gregory, thanks for the response, what you've said is by
>             far, the most enlightneen thing i know about ceph in a
>             long time.
>
>             What brings even greater doubt, which is, this
>             "non-functional" pool, was only 1.5GB large, vs 50-150GB
>             on the other effected pools, the tiny pool was still being
>             used, and just because that pool was blovking requests,
>             the whole cluster was unresponsive.
>
>             So , what do you mean by "non-functional" pool ? how a
>             pool can become non-functional ? and what asures me that
>             tomorrow (just becaue i deleted the 1.5GB pool to fix the
>             whole problem) another pool doesnt becomes non-functional ?
>
>
>         Well, you said there were a bunch of slow requests. That can
>         happen any number of ways, if you're overloading the OSDs or
>         something.
>         When there are slow requests, those ops take up OSD memory and
>         throttle, and so they don't let in new messages until the old
>         ones are serviced. This can cascade across a cluster --
>         because everything is interconnected, clients and OSDs end up
>         with all their requests targeted at the slow OSDs which aren't
>         letting in new IO quickly enough. It's one of the weaknesses
>         of the standard deployment patterns, but it usually doesn't
>         come up unless something else has gone pretty wrong first.
>         As for what actually went wrong here, you haven't provided
>         near enough information and probably can't now that the pool
>         has been deleted. *shrug*
>         -Greg
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

Reply via email to