On Fri, Nov 27, 2015 at 9:53 PM, Gregory Farnum <gfar...@redhat.com> wrote:

> > Nothing about the cluster has changed recently -- no OS patches, no Ceph
> > patches, no software updates of any kind.  For the months we've had the
> > cluster operational, we've had no performance-related issues.  In the
> days
> > leading up to the major performance issue we're now experiencing, the
> logs
> > did record 100 or so 'slow request' events of >30 seconds on subsequent
> > days.  After that, the slow requests became a constant, and now our logs
> are
> > spammed with entries like the following:
> >
> > 2015-11-28 02:30:07.328347 osd.116 192.168.10.10:6832/1689576 1115 :
> cluster
> > [WRN] 2 slow requests, 1 included below; oldest blocked for > 60.024165
> secs
> > 2015-11-28 02:30:07.328358 osd.116 192.168.10.10:6832/1689576 1116 :
> cluster
> > [WRN] slow request 60.024165 seconds old, received at 2015-11-28
> > 02:29:07.304113: osd_op(client.214858.0:6990585
> > default.184914.126_2d29cad4962d3ac08bb7c3153188d23f [create 0~0
> > [excl],setxattr user.rgw.idtag (22),writefull 0~523488,setxattr
> > user.rgw.manifest (444),setxattr user.rgw.acl (371),setxattr
> > user.rgw.content_type (1),setxattr user.rgw.etag (33)] 48.158d9795
> > ondisk+write+known_if_redirected e15933) currently commit_sent
>
> In this op, it's already sent a commit back to the client — the only
> thing that can happen after that point is applying the update to the
> local filesystem. The only thing that can block that is if we've got a
> local fs sync in progress, or have too much dirty data waiting to be
> flushed.
>
> It's vaguely conceivable that this warning just happened to be
> generated on an op in a short time between being committed to journal
> and applied, and that you just happened to pick it out...but I see
> you're using ZFS, which is not a common fs under Ceph. I bet you've
> run into something there, but I can't help much since I've never run
> ZFS in any capacity and am not sure how it's used by the OSD. (Is it
> using the snapshot journal mode? If so I'd start there. Otherwise,
> whatever the generic zfs issue discovery and diagnosis stuff is.)
> -Greg
>

I understand that our ZFS setup is atypical, and we're got more digging to
do there, for sure.  For what it's worth, those commit_sent messages
comprise about 25% of the warning messages we see (about 600k total
warnings in the past 24 hours), so it wasn't just a lucky grab of log
data.  Almost 50% of our warnings are of the following variety:

2015-11-28 04:16:56.271835 osd.142 192.168.10.10:6888/10971 2867 : cluster
[WRN] 3 slow requests, 2 included below; oldest blocked for > 31.059483 secs
2015-11-28 04:16:56.271856 osd.142 192.168.10.10:6888/10971 2868 : cluster
[WRN] slow request 30.299307 seconds old, received at 2015-11-28
04:16:25.972474: MOSDECSubOpWrite(48.e7as2 15933 ECSubWrite(tid=42479,
reqid=client.204927.0:10413916, at_version=15933'47335,
trim_to=15062'44316, trim_rollback_to=15933'47333)) currently started
2015-11-28 04:16:56.271862 osd.142 192.168.10.10:6888/10971 2869 : cluster
[WRN] slow request 30.295754 seconds old, received at 2015-11-28
04:16:25.976027: MOSDECSubOpWrite(48.e7as2 15933 ECSubWrite(tid=42480,
reqid=client.204927.0:10413917, at_version=15933'47336,
trim_to=15062'44316, trim_rollback_to=15933'47333)) currently started

I really appreciate the quick response here.

Brian
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to