On Fri, Nov 27, 2015 at 9:53 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> > Nothing about the cluster has changed recently -- no OS patches, no Ceph > > patches, no software updates of any kind. For the months we've had the > > cluster operational, we've had no performance-related issues. In the > days > > leading up to the major performance issue we're now experiencing, the > logs > > did record 100 or so 'slow request' events of >30 seconds on subsequent > > days. After that, the slow requests became a constant, and now our logs > are > > spammed with entries like the following: > > > > 2015-11-28 02:30:07.328347 osd.116 192.168.10.10:6832/1689576 1115 : > cluster > > [WRN] 2 slow requests, 1 included below; oldest blocked for > 60.024165 > secs > > 2015-11-28 02:30:07.328358 osd.116 192.168.10.10:6832/1689576 1116 : > cluster > > [WRN] slow request 60.024165 seconds old, received at 2015-11-28 > > 02:29:07.304113: osd_op(client.214858.0:6990585 > > default.184914.126_2d29cad4962d3ac08bb7c3153188d23f [create 0~0 > > [excl],setxattr user.rgw.idtag (22),writefull 0~523488,setxattr > > user.rgw.manifest (444),setxattr user.rgw.acl (371),setxattr > > user.rgw.content_type (1),setxattr user.rgw.etag (33)] 48.158d9795 > > ondisk+write+known_if_redirected e15933) currently commit_sent > > In this op, it's already sent a commit back to the client — the only > thing that can happen after that point is applying the update to the > local filesystem. The only thing that can block that is if we've got a > local fs sync in progress, or have too much dirty data waiting to be > flushed. > > It's vaguely conceivable that this warning just happened to be > generated on an op in a short time between being committed to journal > and applied, and that you just happened to pick it out...but I see > you're using ZFS, which is not a common fs under Ceph. I bet you've > run into something there, but I can't help much since I've never run > ZFS in any capacity and am not sure how it's used by the OSD. (Is it > using the snapshot journal mode? If so I'd start there. Otherwise, > whatever the generic zfs issue discovery and diagnosis stuff is.) > -Greg > I understand that our ZFS setup is atypical, and we're got more digging to do there, for sure. For what it's worth, those commit_sent messages comprise about 25% of the warning messages we see (about 600k total warnings in the past 24 hours), so it wasn't just a lucky grab of log data. Almost 50% of our warnings are of the following variety: 2015-11-28 04:16:56.271835 osd.142 192.168.10.10:6888/10971 2867 : cluster [WRN] 3 slow requests, 2 included below; oldest blocked for > 31.059483 secs 2015-11-28 04:16:56.271856 osd.142 192.168.10.10:6888/10971 2868 : cluster [WRN] slow request 30.299307 seconds old, received at 2015-11-28 04:16:25.972474: MOSDECSubOpWrite(48.e7as2 15933 ECSubWrite(tid=42479, reqid=client.204927.0:10413916, at_version=15933'47335, trim_to=15062'44316, trim_rollback_to=15933'47333)) currently started 2015-11-28 04:16:56.271862 osd.142 192.168.10.10:6888/10971 2869 : cluster [WRN] slow request 30.295754 seconds old, received at 2015-11-28 04:16:25.976027: MOSDECSubOpWrite(48.e7as2 15933 ECSubWrite(tid=42480, reqid=client.204927.0:10413917, at_version=15933'47336, trim_to=15062'44316, trim_rollback_to=15933'47333)) currently started I really appreciate the quick response here. Brian
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com