At the risk of opening this email with a pun: we've invested a bunch of time on
both desktop[0] and Android[1] addressing clock skew problems.
(And in server-side tests, too: [2].)
Auth, token, and storage requests are all Hawk-authenticated. The Hawk
authentication process bakes in a timestamp. That timestamp necessarily comes
from the client clock. If the client clock is too far off the server clock (and
remember, there are three different servers in our architecture), the request
will be rejected because the header is wrong.[3]
The solution we've used for this is skew adjustment. We maintain a skew value
for each server, baking in this offset to future requests.
This is part and parcel of Hawk:
--- [4]
Hawk uses an interesting mechanism to ensure the clock skews are within the
reasonable limits. When the server must fail a request on account of stale
timestamp (MAC computed matches with the one in the request but timestamp is
outside of the allowable skew), the server sends the timestamp (ts) as per the
server clock along with a MAC (tsm) computed using the client credentials, in
the WWW-Authenticate response header like so.
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Hawk ts="1353832234",
tsm="6G8r5JiE+NLoym+WwjeHzjDNCUtLNIxmo1vpMofpLAE="
---
--- [5]
Using a timestamp requires the client's clock to be in sync with the server's
clock. Hawk requires both the client clock and the server clock to use NTP to
ensure synchronization. However, given the limitations of some client types
(e.g. browsers) to deploy NTP, the server provides the client with its current
time (in seconds precision) in response to a bad timestamp.
There is no expectation that the client will adjust its system clock to match
the server (in fact, this would be a potential attack vector). Instead, the
client only uses the server's time to calculate an offset used only for
communications with that particular server. The protocol rewards clients with
synchronized clocks by reducing the number of round trips required to
authenticate the first request.
---
Correct and efficient usage of Hawk is predicated on clients with correct
clocks, which seems like an insane assumption to make: at least 3.5% of Android
devices have clocks that are incorrect by more than *1 hour*, let alone 1
minute.[4]
Network-set Android clocks are also routinely wrong by 15s, which is 25% of the
protocol's margin of error.
Failures due to clocks seem incredibly widespread amongst the small set of
Mozillians who've given FxA Sync a try. That is disheartening, but not
surprising.
We're *requiring* clients to fail frequently in the course of normal operation:
on the first request (no known skew yet); on subsequent requests if the clock
is adjusted since the skew was computed; on subsequent syncs if your network
changes and your latency shifts (because our skew computation doesn't try to
model the network); when the server clock is automatically corrected; etc.
This whole process is fragile, provides a bad user experience (your first sync
is almost guaranteed to fail), and on an implementation level it is apparently
hard to get right (as the existence of [3], after we've landed our skew
handling, demonstrates).
We know we still have low-level work to do: maybe persisting skew values across
restarts, doing better at modeling the environment to correct skews, retrying
in more places to allow for skew-driven failures.
But this seems like a bad choice of investment. Correcting for skew seems to
defeat some of the purpose of this timestamp validation: if you can intercept a
request from a client whose clock is wrong in the right direction, you can save
that token and use it later when the timestamp becomes valid, no? And
categorizing a large chunk of requests as routinely erroneous, forcing them
into error handling states, seems like a bad idea.
What can we do to mitigate this problem? Ideas, many of which will no doubt
violate the promises that Hawk makes:
* Widen the validity window from 1 minute to 1 hour. Or six hours. Or three
days.
* Do something non-conformant, like having clients pass their clock to the
server, eliminating the requirement for clients to manage skew.
* Eliminate Hawk entirely, at least for the storage servers, switching the
output of the token server to be some kind of short-lived bearer token.
* ???
More input, please!
-R
[0] https://bugzilla.mozilla.org/show_bug.cgi?id=957863
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=962668,
https://bugzilla.mozilla.org/show_bug.cgi?id=929066
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=971059#c16
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=971059
[4]
http://lbadri.wordpress.com/2013/09/01/Hawk-authentication-for-asp-net-web-api-using-thinktecture-identitymodel-45-replay-protection/
[5] https://www.npmjs.org/package/Hawk
[6] http://opensignal.com/reports/timestamps/
_______________________________________________
Sync-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/sync-dev