Re: [Wikitech-l] Wikimedia's anti-surveillance plans: traffic analysis resistance

2013-08-17 Thread Jeremy Baron
On Sat, Aug 17, 2013 at 12:04 AM, Zack Weinberg  wrote:
>  * State transitions between reading and editing.

Reads on our projects (whether logged in or out) have very little data
coming from the client and a lot being sent back to the client. When a
page is saved or previewed (or even in during an edit with calls to
the parsoid web service?) or file uploaded then suddenly there's more
substantial traffic going from the client up to the projects.

This could be mitigated in part by chunking larger transmissions and
sending them over time. Which we already do for some users with the
upload wizard. (i.e. file uploads. we could expand that to cover more
users) Those chunked transmissions could be then spread out over time
and mixed in with transmissions of garbage when there's no pending
chunk to send.

That may be OK from a UX perspective for file uploads (run in the
background and let the user do other stuff while the upload runs) but
I don't think people will want to wait for their edits (or previews!)
to go through.

Also, we'd probably need an opt-out option for sending the garbage
when idle and we'd have to research the potential impact on bandwidth
bills/quotas. And maybe also impact on battery life??

(it couldn't be opt-in because the fact that you had opted in would
itself be an indicator that you might be an editor.)

-Jeremy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikimedia's anti-surveillance plans: traffic analysis resistance

2013-08-17 Thread Tim Starling
On 17/08/13 10:04, Zack Weinberg wrote:
> What's
> actually needed is to *bin* page (+resource) sizes such that any given
> load could be a substantial number of different pages.

That would certainly help for some sites, but I don't think it would
make much difference for Wikipedia. Even if you used power-of-two
bins, you would still be leaking tens of bits per page view due to the
sizes and request order of the images on each page.

It would mean that instead of just crawling the HTML, the attacker
would also have to send a HEAD request for each image. Maybe it would
add 50% to the attacker's software development time, but once it was
done for one site, it would work against any similar site.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikimedia's anti-surveillance plans: traffic analysis resistance

2013-08-16 Thread C. Scott Ananian
On Fri, Aug 16, 2013 at 8:04 PM, Zack Weinberg  wrote:

> Wikipedia user handle.  I realize how disruptive this would be, but I
> think we need to consider changing the canonical Wikipedia URL format to
> https://wikipedia.org/**LANGUAGE/PAGENAME
> .
>

Note that LANGUAGE.wikipedia.org/VARIANT/PAGENAME is already in use for
wikis which use the language variant conversion code, such as zhwiki.
 Usually LANGUAGE is a prefix of VARIANT, for example zh-hans, zh-hant,
en-us, en-gb, sr, sr-ec.

If you wanted to approach this goal, one could start by creating a proxy
service at https://secure.wikipedia.org/LANGUAGE-VARIANT/PAGENAME that did
an internal proxy of pages from
https://LANGUAGE.wikipedia.org/LANGUAGE-VARIANT/PAGENAME.  That would allow
some low-risk being-bold exploration of the different implications.


> This last article raises a critical point. To render Wikipedia genuinely
> secure against traffic analysis


Whenever someone seems to veer into discussion of absolute security I get
nervous.  It would be best to begin with asking "how can we make attacks
more expensive"?

Given the contents of the most recent NSA document leaks, it seems like it
is also worthwhile to attempt to confound the "are we at least 51% certain
that this user is not an American" question.  It does seem like combining
wikis is a worthwhile step here. I wonder if any arbitrary user of zhwiki
(for example) would automatically be assumed >51% chance of being
non-American.

Random padding, in fact, is no good at all. The adversary can simply
> average over many pageloads and extract the true length.


Again, "no good at all" slides into this "absolute security" fallacy.  *How
much more difficult* does padding make things?  *How many* more pageloads?
 The adversaries with infinite resources can also legally compel the sysop
to compromise the server.  But can we improve the situation for
medium-sized state actors, or raise the bar so that only targeted users can
be compromised (instead of passively collecting information on all users)?

As a start on constructing a better threat model, let me offer two
scenarios:

a) NSA passive collection of all traffic to/from wikipedia (XKEYSCORE).  It
would be nice to frustrate this so that (as a start) only traffic from
targeted users could be effectively collected -- for example, by requiring
an active MIM attack instead of a passive tap.

b) Great Firewall monitoring of specific pages (Tienanmen square, Falun
Gong).  Can we better protect the identities of readers of these pages?
 Can we protect the identities of editors?  Can we frustrate attempts to
block specific pages?

Real world issues should also be taken into account.  Methods that prevent
the Great Firewall from blocking specific pages might provoke a site-wide
block.  Efforts to force utilization of the latest browsers (which support
some new protocol) might disenfranchise mobile users or users for whom
poverty and resource limitations are a bigger threat than coercive
government.  Etc...
 --scott

-- 
(http://cscott.net)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Wikimedia's anti-surveillance plans: traffic analysis resistance

2013-08-16 Thread Zack Weinberg
(Please see the thread titled "Wikimedia's anti-surveillance plans: site 
hardening" for who I am and some general context.)


Once Wikipedia is up to snuff with all the site-hardening I recommended 
in the other thread, there remain two significant information leaks (and 
probably others, but these two are gonna be a big project all by 
themselves, so let's worry about them first).  One is hostnames, and the 
other is page(+resource) length.


Server hostnames are transmitted over the net in cleartext even when TLS 
is in use (because DNS operates in cleartext, and because the cleartext 
portion of the TLS handshake includes the hostname, so the server knows 
which certificate to send down).  The current URL structure of 
*.wiki[pm]edia.org exposes sensitive information in the server hostname: 
for Wikipedia it's the language tag, for Wikimedia the subproject. 
Language seems like a serious exposure to me, potentially enough all by 
itself to finger a specific IP address as associated with a specific 
Wikipedia user handle.  I realize how disruptive this would be, but I 
think we need to consider changing the canonical Wikipedia URL format to 
https://wikipedia.org/LANGUAGE/PAGENAME.


For *.wikimedia.org it is less obvious what should be done. That domain 
makes use of subdomain partitioning to control the same-origin policy 
(for instance, upload.wikimedia.org needs to be a distinct hostname from 
everything else, lest someone upload e.g. a malicious SVG that 
exfiltrates your session cookies) so it cannot be altogether 
consolidated. However, knowing (for instance) whether a particular user 
is even *aware* of Commons or Meta may be enough to finger them, so we 
need to think about *some* degree of consolidation.


---

Just how much information is exposed by page length (and how to best 
mitigate it) is a live area of basic research. It happens to be *my* 
area of basic research, and I would be interested in collaborating with 
y'all on locking it down (it would make a spiffy case study for my 
thesis :-) but I must emphasize that *we don't know if it is possible to 
prevent this attack*.


I recommend that everyone interested in this topic read these articles: 
http://hal.archives-ouvertes.fr/docs/00/74/78/41/PDF/johnny2hotpet-finalcam.pdf 
discusses why Web browsing history is sensitive information in general. 
http://kpdyer.com/publications/oakland2012.pdf and 
http://www.freehaven.net/anonbib/cache/ccs2012-fingerprinting.pdf 
demonstrate how page length can reveal page identity, debunk a number of 
"easy" fixes, and their reference lists are good portals to the 
literature. Finally, 
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf demonstrates a 
related but perhaps even more insidious attack, whereby the eavesdropper 
learns the *user identity* of someone on a social network by virtue of 
the size of their profile photo.


This last article raises a critical point. To render Wikipedia genuinely 
secure against traffic analysis, it is not sufficient for the 
eavesdropper to be unable to identify *which pages* are being read or 
edited. The eavesdropper may also be able to learn and make use of the 
answers to questions such as:


 * Given an IP address known to be communicating with WP/WM, whether
   or not there is a logged-in user responsible for the traffic.
 * Assuming it is known that a logged-in user is responsible for some
   traffic, *which user it is* (User: handle) or whether the user has
   any special privileges.
 * State transitions between uncredentialed and logged-in (in either
   direction).
 * State transitions between reading and editing.

This is unlikely to be an exhaustive list. If we are serious about 
defending about traffic analysis, one of the first things we should do 
is have a bunch of experienced editors and developers sit down and work 
out an exhaustive list of things we don't want to reveal. (I have only 
ever dabbled in editing Wikipedia.)


Now, once this is pinned down, theoretically, yes, the cure is padding. 
However, the padding inherent in TLS block cipher modes is *not* 
adequate; it's normally strictly "round up to the nearest multiple of 16 
bytes", which has been shown to be completely inadequate.  One of the 
above papers talks about patching GnuTLS to pad randomly by up to 256 
bytes, but this too is probably insufficient.


Random padding, in fact, is no good at all. The adversary can simply 
average over many pageloads and extract the true length. What's actually 
needed is to *bin* page (+resource) sizes such that any given load could 
be a substantial number of different pages. 
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf also discusses how 
this can be done in principle. The project - and I emphasize that it 
would be a *project* - would be to arrange for MediaWiki (the software) 
to do this binning automatically, such that the adversary cannot learn 
anything useful either from individual traffic bursts or from a sequence 
of such bu