Re: [Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-06 Thread Faidon Liambotis
Hi Zack,

Thanks for bringing this up again, this is a very useful discussion to
have.

On Thu, Jun 05, 2014 at 12:45:11PM -0400, Zack Weinberg wrote:
> * what page is the target reading?
> * what _sequence of pages_ is the target reading?  (This is actually
> easier, assuming the attacker knows the internal link graph.)

The former should be pretty easy too, due to the ancillary requests that
you already briefly mentioned.

Because of our domain sharding strategy that places media under a
separate domain (upload.wikimedia.org), an adversary would know for a
given page (1) the size of the encrypted text response, (2) the count
and size of responses to media that were requested immediately after the
main text response. This combination would create a pretty unique
fingerprint for a lot of pages, especially well-curated pages that would
have a fair amount of media embeded into them.

Combine this with the fact that we provide XML dumps of our content and
images, plus a live feed of changes in realtime and it should be easy
enough for a couple of researchers (let alone state agencies with
unlimited resources) to devise an algorithm that calculates these with
great accuracy and exposes (at least) reads.

> What I would like to do, in the short term, is perform a large-scale
> crawl of one or more of the encyclopedias and measure what the above
> eavesdropper would observe.  I would do this over regular HTTPS, from
> a documented IP address, both as a logged-in user and an anonymous
> user.

I doubt you can create enough traffic to make a difference, so yes, with
my operations hat on, sure, you can go ahead. Note that all of our
software, production stack/config management and dumps of our content
are publically available and free (as in speech) to use, so you or
anyone else could even create a replica environment and do this kind of
analysis without us ever noticing.

> With that data in hand, the next phase would be to develop some sort
> of algorithm for automatically padding HTTP responses to maximize
> eavesdropper confusion while minimizing overhead.  I don't yet know
> exactly how this would work.  I imagine that it would be based on
> clustering the database into sets of pages with similar length but
> radically different contents.

I don't think it'd make sense to involve the database in this at all.
It'd make much more sense to postprocess the content (still within
MediaWiki, most likely) and pad it to fit in buckets of predefined
sizes. You'd also have to take care of padding images as well, as the
combination of count/size alone leaks too many bits of information.

However, as others mentioned already, this kind of attack is partially
addressed with the introduction of SPDY / HTTP/2.0, which is on our
roadmap. A full production deployment, including undoing optimizations
such as domain sharding (and SSL+SPDY by default, for everyone) is many
months ahead, however it does make me wonder if it makes much sense to
spend time focusing on plain HTTPS attacks right now.

Regards,
Faidon

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread Tyler Romeo
On Thu, Jun 5, 2014 at 4:50 PM, David Gerard  wrote:

> Or, indeed, MediaWiki tarball version itself.


MediaWiki is a web application. As amazing as it would be for Wikipedia to
be secure against traffic analysis, we are not going to introduce
presentation-layer logic into an application-layer product.

*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread David Gerard
On 5 June 2014 17:45, Zack Weinberg  wrote:

> I'd like to restart the conversation about hardening Wikipedia (or
> possibly Wikimedia in general) against traffic analysis.


Or, indeed, MediaWiki tarball version itself.



- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread Gabriel Wicke
On 06/05/2014 11:53 AM, Nick White wrote:
> As was mentioned, external resources like variously sized images 
> would probably be the trickiest thing to figure out good ways 
> around. IIRC SPDY has some inlining multiple resources in the same 
> packet sort of stuff, which we might be able to take advantage of to 
> help here (it's been ages since I read about it, though).


When using SPDY browsers multiplex and interleave all requests over a single
TCP connection, while with HTTP they typically open around six parallel
connections for HTTPS. HTTP pipelining also tends to be disabled in desktop
browsers, which makes it relatively easy to figure out the size of
individual requests & thus potentially the page viewed or edited.

With Apple finally adding SPDY support in the latest Safari release (after
IE ;) support should soon grow beyond the 67% of global equests claimed
currently [1], which is good for performance, security and architectural
simplicity.

Ops has a draft goal of experimental SPDY support in Q2, so it seems that
it's going to happen soon. Also see bug 33890 [2].

Gabriel

[1]: http://caniuse.com/spdy
[2]: https://bugzilla.wikimedia.org/show_bug.cgi?id=33890

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread C. Scott Ananian
Introducting my own working theory here, ignore if you wish.

I'd think that the *first* thing that would have to happen is that the page
and the images it contains would have to be delivered in one stream.  There
are both HTML5 (resource bundling) and protocol (SPDY) mechanisms for doing
this.  Some URL rearrangement to consolidate hosts might be required as
well. But in the interest of being forward-looking I'd suggest taking some
implementation of resource bundling as a given.  Similarly, I wouldn't
waste time profiling both compressed and uncompressed data; assume the
bundle is compressed.

That leaves a simpler question: if the contents of a page (including
images) are delivered as a compressed bundle over a single host connection,
how easy is it to tell the what page I'm looking at from the size of the
response?  How much padding would need to be added to bundles to foil the
analysis?

Some extensions:
1) Assume that it's not just a single page at a time, but a sequence of N
linked pages.  How does the required padding depend on N?
2) Can correlations be made? For example, once I've visited page A which
contained image X, visiting page B which also has image X will not require
X to be re-downloaded.  So can I distinguish certain pairs of "long
request/short request"?
3) What if the attacker is not interested in specific pages, but instead in
enumerating users looking at a *specific* page (or set of pages).  That is,
instead of looking at averages across all pages, look at the "unusual"
outliers.
4) Active targetted attack: assume that the attacker can modify pages or
images referenced by pages.  This makes it easier to tell when specific
pages are accessed.   How can this be prevented/discouraged?
5) As you mentioned, disclosing user identity from read-only traffic
patterns is also a concern.

I'm afraid the answers will be that traffic analysis is still very easy.
 But it still be useful to know *how* easy.
 --scott
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread Nick White
Hi Zack,

On Thu, Jun 05, 2014 at 12:45:11PM -0400, Zack Weinberg wrote:
> I'd like to restart the conversation about hardening Wikipedia (or
> possibly Wikimedia in general) against traffic analysis.  I brought
> this up ... last November, I think, give or take a month?  but it got
> lost in a larger discussion about HTTPS.

This sounds like a great idea to me, thanks for thinking about it 
and sharing it. Privacy of peoples' reading habits is critical, and 
the more we can do to ensure it the better.

> With that data in hand, the next phase would be to develop some sort
> of algorithm for automatically padding HTTP responses to maximize
> eavesdropper confusion while minimizing overhead.  I don't yet know
> exactly how this would work.  I imagine that it would be based on
> clustering the database into sets of pages with similar length but
> radically different contents.  The output of this would be some
> combination of changes to MediaWiki core (for instance, to ensure that
> the overall length of the HTTP response headers does not change when
> one logs in) and an extension module that actually performs the bulk
> of the padding.  I am not at all a PHP developer, so I would need help
> from someone who is with this part.

I'm not a big PHP developer, but given the right project I can be 
enticed into doing some, and I'd be very happy to help out with 
this. Ensuring any changes didn't add complexity would be very 
important, but that should be do-able.

As was mentioned, external resources like variously sized images 
would probably be the trickiest thing to figure out good ways 
around. IIRC SPDY has some inlining multiple resources in the same 
packet sort of stuff, which we might be able to take advantage of to 
help here (it's been ages since I read about it, though).

Nick

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread Chris Steipp
On Thu, Jun 5, 2014 at 9:45 AM, Zack Weinberg  wrote:

> I'd like to restart the conversation about hardening Wikipedia (or
> possibly Wikimedia in general) against traffic analysis.  I brought
> this up ... last November, I think, give or take a month?  but it got
> lost in a larger discussion about HTTPS.
>

Thanks Zack, I think this is research that needs to happen, but the WMF
doesn't have the resources to do itself right now. I'm very interested in
seeing the results you come up with.


>
> For background, the type of attack that it would be nice to be able to
> prevent is described in this paper:
>
> http://sysseclab.informatics.indiana.edu/projects/sidebuster/sidebuster-final.pdf
>  Someone is eavesdropping on an encrypted connection to
> LANG.wikipedia.org.  (It's not possible to prevent the attacker from
> learning the DNS name and therefore the language the target reads,
> short of Tor or similar.  It's also not possible to prevent them from
> noticing accesses to ancillary servers, e.g. Commons for media.)  The
> attacker's goal is to figure out things like
>
> * what page is the target reading?
> * what _sequence of pages_ is the target reading?  (This is actually
> easier, assuming the attacker knows the internal link graph.)
> * is the target a logged-in user, and if so, which user?
> * did the target just edit a page, and if so, which page?
> * (... y'all are probably better at thinking up these hypotheticals than
> me ...)
>

Anything in the logs-- Account creation is probably an easy target.


>
> Wikipedia is different from a tax-preparation website (the case study
> in the above paper) in that all of the content is public, and edit
> actions are also public.  The attacker can therefore correlate their
> eavesdropping data with observations of Special:RecentChanges and the
> like.  This may mean it is impossible to prevent the attacker from
> detecting edits.  I think it's worth the experiment, though.
>
> What I would like to do, in the short term, is perform a large-scale
> crawl of one or more of the encyclopedias and measure what the above
> eavesdropper would observe.  I would do this over regular HTTPS, from
> a documented IP address, both as a logged-in user and an anonymous
> user.  This would capture only the reading experience; I would also
> like to work with prolific editors to take measurements of the traffic
> patterns generated by that activity.  (Bot edits go via the API, as I
> understand it, and so are not reflective of "naturalistic" editing by
> human users.)
>

Make sure to respect typical bot rate limits. Anonymous crawling should be
fine, although logged in crawling could cause issues. But if you're doing
this from a single machine, I don't think there's too much harm you can do.
Thanks for warning us in advance!

Also, mobile looks very different from desktop. May be worth analyzing it
as well.


>
> With that data in hand, the next phase would be to develop some sort
> of algorithm for automatically padding HTTP responses to maximize
> eavesdropper confusion while minimizing overhead.  I don't yet know
> exactly how this would work.  I imagine that it would be based on
> clustering the database into sets of pages with similar length but
> radically different contents.  The output of this would be some
> combination of changes to MediaWiki core (for instance, to ensure that
> the overall length of the HTTP response headers does not change when
> one logs in) and an extension module that actually performs the bulk
> of the padding.  I am not at all a PHP developer, so I would need help
> from someone who is with this part.
>

Padding the page in output page would be a pretty simple extension,
although ensuring the page size after the web server is gzips it is a
specific size would be more difficult to do efficiently. However, iirc the
most obvious fingerprinting technique was just looking at the number and
sizes of images loaded from commons. Making sure those are consistent sizes
is likely going to be hard.


>
> What do you think?  I know some of this is vague and handwavey but I
> hope it is at least a place to start a discussion.
>

One more thing to take into account is that the WMF is likely going to
switch to spdy, which will completely change the characteristics of the
traffic. So developing a solid process that you can repeat next year would
be time well spent.


>
> zw
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread Zack Weinberg
I'd like to restart the conversation about hardening Wikipedia (or
possibly Wikimedia in general) against traffic analysis.  I brought
this up ... last November, I think, give or take a month?  but it got
lost in a larger discussion about HTTPS.

For background, the type of attack that it would be nice to be able to
prevent is described in this paper:
http://sysseclab.informatics.indiana.edu/projects/sidebuster/sidebuster-final.pdf
 Someone is eavesdropping on an encrypted connection to
LANG.wikipedia.org.  (It's not possible to prevent the attacker from
learning the DNS name and therefore the language the target reads,
short of Tor or similar.  It's also not possible to prevent them from
noticing accesses to ancillary servers, e.g. Commons for media.)  The
attacker's goal is to figure out things like

* what page is the target reading?
* what _sequence of pages_ is the target reading?  (This is actually
easier, assuming the attacker knows the internal link graph.)
* is the target a logged-in user, and if so, which user?
* did the target just edit a page, and if so, which page?
* (... y'all are probably better at thinking up these hypotheticals than me ...)

Wikipedia is different from a tax-preparation website (the case study
in the above paper) in that all of the content is public, and edit
actions are also public.  The attacker can therefore correlate their
eavesdropping data with observations of Special:RecentChanges and the
like.  This may mean it is impossible to prevent the attacker from
detecting edits.  I think it's worth the experiment, though.

What I would like to do, in the short term, is perform a large-scale
crawl of one or more of the encyclopedias and measure what the above
eavesdropper would observe.  I would do this over regular HTTPS, from
a documented IP address, both as a logged-in user and an anonymous
user.  This would capture only the reading experience; I would also
like to work with prolific editors to take measurements of the traffic
patterns generated by that activity.  (Bot edits go via the API, as I
understand it, and so are not reflective of "naturalistic" editing by
human users.)

With that data in hand, the next phase would be to develop some sort
of algorithm for automatically padding HTTP responses to maximize
eavesdropper confusion while minimizing overhead.  I don't yet know
exactly how this would work.  I imagine that it would be based on
clustering the database into sets of pages with similar length but
radically different contents.  The output of this would be some
combination of changes to MediaWiki core (for instance, to ensure that
the overall length of the HTTP response headers does not change when
one logs in) and an extension module that actually performs the bulk
of the padding.  I am not at all a PHP developer, so I would need help
from someone who is with this part.

What do you think?  I know some of this is vague and handwavey but I
hope it is at least a place to start a discussion.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l