Re: [ceph-users] Blocked requests during and after CephFS delete

2013-12-09 Thread Gregory Farnum
[ Re-added the list since I don't have log files. ;) ]

On Mon, Dec 9, 2013 at 5:52 AM, Oliver Schulz  wrote:
> Hi Greg,
>
> I'll send this privately, maybe better not to post log-files, etc.
> to the list. :-)
>
>
>> Nobody's reported it before, but I think the CephFS MDS is sending out
>> too many delete requests. [...]
>>
>> That's all speculation on my part though; can you go sample the slow
>> requests and see what their makeup looked like? Do you have logs from
>> the MDS or OSDs during that time period?
>
>
> Uh - how do I sample the requests?

I believe the slow requests should have been logged in the monitor's
central log. That's a file sitting in the mon directory, and is
probably accessible via other means I can't think of off-hand. Go see
if it describes what the slow OSD requests are (eg, are they a bunch
of MDS deletes with some other stuff sprinkled in, or all other stuff,
or whatever).

> Concerning logs - you mean the regular ceph daemon log files?
> Sure - I'm attaching a tarball of all daemon logs from the
> relevant time interval (please don't publish them ;-) ).
> It's 13.2 MB, I hope it goes through by email.
>
>
> I also dumped "ceph health" every minute during the test.
>
> * 15:34:34 to 15:48:37 is the effect from my first mass delete. I aborted
>   that one before it could finished, to see if emperor would to better

By abort, you mean you stopped deleting all the things you intended to?

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests during and after CephFS delete

2013-12-08 Thread Gregory Farnum
On Sun, Dec 8, 2013 at 7:16 AM, Oliver Schulz  wrote:
> Hello Ceph-Gurus,
>
> a short while ago I reported some trouble we had with our cluster
> suddenly going into a state of "blocked requests".
>
> We did a few tests, and we can reproduce the problem:
> During / after deleting of a substantial chunk of data on
> CephFS (a few TB), ceph health shows blocked requests like
>
> HEALTH_WARN 222 requests are blocked > 32 sec
>
> This goes on for a couple of minutes, during which the cluster is
> pretty much unusable. The number of blocked requests jumps around
> (but seems to go down on average), until finally (after about 15
> minutes in my last test) health is back to OK.
>
> I upgraded the cluster to Ceph emperor (0.72.1) and repeated the
> test, but the problem persists.
>
> Is this normal - and if not, what might be the reason? Obviously,
> having the cluster go on strike for a while after data deletion
> is a bit of a problem, especially with a mixed application load.
> The VM's running on RBDs aren't too happy about it, for example. ;-)

Nobody's reported it before, but I think the CephFS MDS is sending out
too many delete requests. When you delete something in CephFS, it's
just marked as deleted and the MDS is supposed to do so asynchronously
in the background, but I'm not sure if there are any throttles on how
quickly it does so. If you remove several terabytes worth of data, and
the MDS is sending out RADOS object deletes for each 4MB as fast as it
can, that's a lot of unfiltered traffic on the OSDs.
That's all speculation on my part though; can you go sample the slow
requests and see what their makeup looked like? Do you have logs from
the MDS or OSDs during that time period?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com