Hi Zack,

Thanks for bringing this up again, this is a very useful discussion to
have.

On Thu, Jun 05, 2014 at 12:45:11PM -0400, Zack Weinberg wrote:
> * what page is the target reading?
> * what _sequence of pages_ is the target reading?  (This is actually
> easier, assuming the attacker knows the internal link graph.)

The former should be pretty easy too, due to the ancillary requests that
you already briefly mentioned.

Because of our domain sharding strategy that places media under a
separate domain (upload.wikimedia.org), an adversary would know for a
given page (1) the size of the encrypted text response, (2) the count
and size of responses to media that were requested immediately after the
main text response. This combination would create a pretty unique
fingerprint for a lot of pages, especially well-curated pages that would
have a fair amount of media embeded into them.

Combine this with the fact that we provide XML dumps of our content and
images, plus a live feed of changes in realtime and it should be easy
enough for a couple of researchers (let alone state agencies with
unlimited resources) to devise an algorithm that calculates these with
great accuracy and exposes (at least) reads.

> What I would like to do, in the short term, is perform a large-scale
> crawl of one or more of the encyclopedias and measure what the above
> eavesdropper would observe.  I would do this over regular HTTPS, from
> a documented IP address, both as a logged-in user and an anonymous
> user.

I doubt you can create enough traffic to make a difference, so yes, with
my operations hat on, sure, you can go ahead. Note that all of our
software, production stack/config management and dumps of our content
are publically available and free (as in speech) to use, so you or
anyone else could even create a replica environment and do this kind of
analysis without us ever noticing.

> With that data in hand, the next phase would be to develop some sort
> of algorithm for automatically padding HTTP responses to maximize
> eavesdropper confusion while minimizing overhead.  I don't yet know
> exactly how this would work.  I imagine that it would be based on
> clustering the database into sets of pages with similar length but
> radically different contents.

I don't think it'd make sense to involve the database in this at all.
It'd make much more sense to postprocess the content (still within
MediaWiki, most likely) and pad it to fit in buckets of predefined
sizes. You'd also have to take care of padding images as well, as the
combination of count/size alone leaks too many bits of information.

However, as others mentioned already, this kind of attack is partially
addressed with the introduction of SPDY / HTTP/2.0, which is on our
roadmap. A full production deployment, including undoing optimizations
such as domain sharding (and SSL+SPDY by default, for everyone) is many
months ahead, however it does make me wonder if it makes much sense to
spend time focusing on plain HTTPS attacks right now.

Regards,
Faidon

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to