On 7/25/14, 4:58 AM, Seger, Mark (Cloud Services) wrote:
I’m trying to track object server GET errors using statsd and I’m not
seeing them.  The test I’m doing is to simply do a GET on an
non-existent object.  As expected, a 404 is returned and the object
server log records it.  However, statsd implies it succeeded because
there were no errors reported.  A read of the admin guide does clearly
say the GET timing includes failed GETs, but my question then becomes
how is one to tell there was a failure?  Should there be another type of
message that DOES report errors?  Or how about including these in the
‘object-server.GET.errors.timing’ message?

What "error" means with respect to Swift's backend-server timing metrics is pretty fuzzy at the moment, and could probably use some work.

The idea is that object-server.GET.timing has timing data for everything that Swift handled successfully, and object-server.GET.timing.errors has timing data for things where Swift failed.

Some things are pretty easy to divide up. For example, 200-series status code always counts as success, and 500-series status code always counts as error.

It gets tricky in the 400-series status codes. For example, a 404 means that a client asked for an object that doesn't exist. That's not Swift's fault, so that goes into the success bucket (object-server.GET.timing). Similarly, a 412 means that a client set an unsatisfiable precondition in the If-Match, If-None-Match, If-Modified-Since, or If-Unmodified-Since headers, and Swift correctly determined that the requested object can't fulfill the precondition, so that one goes in the success bucket too.

However, there are other status codes that are more ambiguous. Consider 409; the object server responds with 409 if the request's X-Timestamp is less than the object's X-Timestamp (on PUT/POST/DELETE). You can get this with two near-simultaneous POSTs:

  1. request A hits proxy; proxy assigns X-Timestamp: 1406316223.851131
  2. request B hits proxy; proxy assigns X-Timestamp: 1406316223.851132
  3. request B hits object server and gets 202
  4. request A hits object server and gets 409

Does that error count as Swift's fault? If the client requests were nearly simultaneous, then I think not; there's always going to be *some* delay between accept() and gettimeofday(). On the other hand, if one proxy server's time is significantly behind another's, then it is Swift's fault.

It's even worse with 400; sometimes it's for bad paths (like asking an object server for /<partition>/<account>/<container>; this can happen if the administrator misconfigures their rings), and sometimes it's for bad X-Delete-At / X-Delete-After values (which are set by the client).

I'm not sure what the best way to fix this is, but if you just want to see some error metrics, unmount a disk to get some 507s.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to