Re: [Wikitech-l] Sane versioning for core (was: Re: Fwd: No more Architecture Committee?)

2015-01-25 Thread Zack Weinberg
On Sun, Jan 25, 2015 at 1:27 PM, Legoktm legoktm.wikipe...@gmail.com wrote:
 On 01/15/2015 08:26 PM, Chad wrote:
 I've been saying for over a year now we should just drop the 1. from
 the 1.x.y release versions. So the next release would be 25.0, 26.0,
 etc etc.

-1 from me, for what little that's worth...

 It would allow us to follow semver and still retain
 our current version number history instead of waiting for a magical 2.0.

This logic is the opposite of semver.  Semver says you only bump the
major version number when you make a breaking change.  Since breaking
changes are Bad Things, therefore bumping the major version number is
also a Bad Thing.  It is something that you should strive to *avoid*
having to do.

Under semver, a version number of the form 1.large integer is a
*badge of honor*.  It means that you have successfully executed many
releases *without* needing to make a breaking change.  One should
display that initial 1. proudly; one should not consider it to be
superfluous.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Feature request.

2014-11-16 Thread Zack Weinberg
On Sun, Nov 16, 2014 at 7:27 PM, svetlana svetl...@fastmail.com.au wrote:
 On the second edit conflict, I read the message at the page top. It says:

 Someone else has changed this page since you started editing it. The upper 
 text area contains the page text as it currently exists. **Your changes are 
 shown in the lower text area.** You will have to merge your changes into the 
 existing text. Only the text in the upper text area will be saved when you 
 press Save page.

 Emphasis added by me.  We all know that people fail to read though.  If we 
 can come up with a more colorful error message or a more intuitive edit 
 conflict page layout, I'm all ears.

Perhaps we could look at desktop 3-way diff utilities for inspiration?
 Something like (pray forgive the ASCII art...)

EDIT CONFLICT

 YOUR VERSION OTHER VERSION
+-+  +-+
| (read only  |  | (read only  |
|  text area) |  |  text area) |
| (background |  | (background |
|  greenish)  |  |  yellowish) |
+-+  +-+

Please merge the two versions
into the text area below.
+--+
| (editable text area) |
| (background white)   |
| (prefilled with the best |
|  3-way merge we can manage)  |
+--+

I have to say I don't understand why the system does such a terrible
job -- 3-way merge is a Solved Problem over in the land of version
control systems for programmers.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Changing edit token length

2014-10-20 Thread Zack Weinberg
On Mon, Oct 20, 2014 at 1:38 PM, Chris Steipp cste...@wikimedia.org wrote:
 * Tokens can be time limited. By default they won't be, but this puts
 the plumbing in place if it makes sense to do that on any token checks
 in the future.
 * The tokens returned in a request will change on each request. Any
 version of the token will be good for as long as the time limit is
 valid (which again, will default to infinite), but this protects
 against ssl-compression attacks (like BREACH) where plaintext in a
 request can be brute-forced by making many requests and watching the
 size of the response.

 To do this, the size of the token (which has been a fixed 32 bytes +
 token suffix for a very long time) will change to add up to 16 bytes
 of timestamp (although in practice, it will stay 8 bytes for the next
 several years) to the end of the token.

I have no objection to the change itself, but I would like to make a
few related comments:

1) Since this is changing anyway, it would be a good time to make the
token size and structure independent of whether the user is logged on
or not.  (This is probably not the only place where MediaWiki leaks
is this user logged on? via request or response size, but it is an
obvious place.)  I think that would involve generating 'wsEditToken'
whether or not isAnon() is true, which should be fine?  And then
matchEditToken() would be simpler.  And anonymous editing tokens could
also be time-limited.

2) Since this is changing anyway, it would be a good time to stop
using MD5.  SHA256 should be good for a while.

3) You're using the per-session 'wsEditToken' value as the HMAC secret
key.  Is there anywhere that the raw 'wsEditToken' might be exposed to
the client?  Such a leak would enable a malicious client to forge
editing tokens and bypass the time-limiting.

4) Architecturally speaking, does it make sense to time-limit the
*token* rather than the *session*?

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Changing edit token length

2014-10-20 Thread Zack Weinberg
On Mon, Oct 20, 2014 at 3:34 PM, Chris Steipp cste...@wikimedia.org wrote:
 On Mon, Oct 20, 2014 at 11:00 AM, Zack Weinberg za...@cmu.edu wrote:
 1) Since this is changing anyway, it would be a good time to make the
 token size and structure independent of whether the user is logged on
 or not. [...]

 This is the direction I'm pushing towards. The way we handle caching
 at the WMF keeps this from being as simple as you have here, but yes,
 it's a long over due change.

Good to know.  I'm not much of a PHP developer but I am interested in
helping to the extent that I can.

 2) Since this is changing anyway, it would be a good time to stop
 using MD5.  SHA256 should be good for a while.

 Preimage attacks on md5 are still just slightly faster than brute
 force, so while I don't think we're in danger, I'm not opposed to
 strengthening this.

It's not an urgent change, and you're right that the HMAC construction
insulates you from all the known problems with MD5, but it ought not
be put off indefinitely.  It seems to me that one flag day is better
than two.

 4) Architecturally speaking, does it make sense to time-limit the
 *token* rather than the *session*?

 That would be nice, but it makes it harder to do rolling validity, and
 this way we can also limit different types of tokens (so a checkuser
 token can be limited to a few minutes, while an edit token can have
 several hours) without having to track more secrets in a user's
 session.

Ah.  Makes sense.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Making a plain MW core git clone not be installable

2014-06-11 Thread Zack Weinberg
On Wed, Jun 11, 2014 at 10:58 AM, Tyler Romeo tylerro...@gmail.com wrote:
 On Wed, Jun 11, 2014 at 10:56 AM, Brad Jorsch (Anomie) 
 bjor...@wikimedia.org wrote:
 ... That's just awful.

 How so?

Well, it makes *me* wince because you're directing people to pull code
over the network and feed it straight to the PHP interpreter, probably
as root, without inspecting it first.  And the site is happy to send
it to you via plain HTTP, which means a one-character typo gives an
active attacker a chance to pwn your entire installation.

No, nobody bothers to read all the code they just checked out of Git,
but it's integrity-protected by design, independent of the transport
channel -- you know that the code you just received is the exact same
code everyone else is getting, so you can trust that *someone* did the
security audit.

(And yeah, no one does *that* either, which is how we got the OpenSSL
fiasco, but computers can't solve that problem.)

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Making a plain MW core git clone not be installable

2014-06-11 Thread Zack Weinberg
On Wed, Jun 11, 2014 at 11:21 AM, Tyler Romeo tylerro...@gmail.com wrote:

 It's over HTTPS. As long as you trust that getcomposer.org is the domain
 you are looking for, this is really no different than installing via a
 package manager.

Nothing stops you from installing it over insecure HTTP.  (I filed
https://github.com/composer/composer/issues/3047 for that.)

But this is bad practice even with HTTPS; you're relying on
*transport* integrity/authenticity to secure *document* authenticity.
Yeah, we do that all the time on today's Web, but software
installation is (I don't think this is hyperbole) more
security-critical than anything else and should be held to higher
standards.  In this case, there should be an independently verifiable
(i.e. not tied to the TLS PKI) PGP signature on the installer and
people should be instructed to check that before executing it.

Note that Git submodules do this for you automatically, because the
revision hash is unforgeable.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Hardening WP/WM against traffic analysis (take two)

2014-06-05 Thread Zack Weinberg
I'd like to restart the conversation about hardening Wikipedia (or
possibly Wikimedia in general) against traffic analysis.  I brought
this up ... last November, I think, give or take a month?  but it got
lost in a larger discussion about HTTPS.

For background, the type of attack that it would be nice to be able to
prevent is described in this paper:
http://sysseclab.informatics.indiana.edu/projects/sidebuster/sidebuster-final.pdf
 Someone is eavesdropping on an encrypted connection to
LANG.wikipedia.org.  (It's not possible to prevent the attacker from
learning the DNS name and therefore the language the target reads,
short of Tor or similar.  It's also not possible to prevent them from
noticing accesses to ancillary servers, e.g. Commons for media.)  The
attacker's goal is to figure out things like

* what page is the target reading?
* what _sequence of pages_ is the target reading?  (This is actually
easier, assuming the attacker knows the internal link graph.)
* is the target a logged-in user, and if so, which user?
* did the target just edit a page, and if so, which page?
* (... y'all are probably better at thinking up these hypotheticals than me ...)

Wikipedia is different from a tax-preparation website (the case study
in the above paper) in that all of the content is public, and edit
actions are also public.  The attacker can therefore correlate their
eavesdropping data with observations of Special:RecentChanges and the
like.  This may mean it is impossible to prevent the attacker from
detecting edits.  I think it's worth the experiment, though.

What I would like to do, in the short term, is perform a large-scale
crawl of one or more of the encyclopedias and measure what the above
eavesdropper would observe.  I would do this over regular HTTPS, from
a documented IP address, both as a logged-in user and an anonymous
user.  This would capture only the reading experience; I would also
like to work with prolific editors to take measurements of the traffic
patterns generated by that activity.  (Bot edits go via the API, as I
understand it, and so are not reflective of naturalistic editing by
human users.)

With that data in hand, the next phase would be to develop some sort
of algorithm for automatically padding HTTP responses to maximize
eavesdropper confusion while minimizing overhead.  I don't yet know
exactly how this would work.  I imagine that it would be based on
clustering the database into sets of pages with similar length but
radically different contents.  The output of this would be some
combination of changes to MediaWiki core (for instance, to ensure that
the overall length of the HTTP response headers does not change when
one logs in) and an extension module that actually performs the bulk
of the padding.  I am not at all a PHP developer, so I would need help
from someone who is with this part.

What do you think?  I know some of this is vague and handwavey but I
hope it is at least a place to start a discussion.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Forget mailing lists and on-wiki discussions; Twitter's the place!

2014-04-06 Thread Zack Weinberg
On Sun, Apr 6, 2014 at 8:39 PM, Steven Walling steven.wall...@gmail.com wrote:

 I too was surprised at how many users are A) on XP with ClearType off,
 which is the default there or B) turn font smoothing off intentionally.

I have no comment on any of the rest of this, but with my Firefox dev
hat on, we've been through several cycles of being told in no
uncertain terms that a vocal minority of our userbase _hates_ what
ClearType does to the visual appearance of text, and will stop at
nothing to suppress it.  (That the vast majority of fonts do not have
the extra hinting required to look good with ClearType off matters not
to them; these people generally also wish to force all websites to use
their preferred fonts.  Possibly I should say preferred *font*.)

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR

2014-01-13 Thread Zack Weinberg
On Sun, Jan 12, 2014 at 11:46 PM, Gryllida gryll...@fastmail.fm wrote:
 On Mon, 13 Jan 2014, at 15:29, Gregory Maxwell wrote:
 What freenode does is not functionally useful for Tor users. In my
 first hand experience it manages to enable abusive activity while
 simultaneously eliminating Tor's usefulness at protecting its users.

 The register at real IP, then only use TOR through an account flow
 implies trust in some entity (such as freenode irc network opers or
 Wikipedia CheckUsers). I currently believe that requiring such trust
 doesn't eliminate TOR's usefullness at protecting its users.

I rather think it does.  Assume a person under continual surveillance.
 If they have to reveal their true IP address to Wikipedia in order to
register their editor account, the adversary will learn it as well,
and can then attribute all subsequent edits by that handle to that
person *whether or not* those edits are routed over an anonymity
network.

To satisfy Applebaum's request, there needs to be a mechanism whereby
someone can edit even if *all of their communications with Wikipedia,
including the initial contact* are coming over Tor or equivalent.
Blinded, costly-to-create handles (minted by Wikipedia itself) are one
possible way to achieve that; if there are concrete reasons why that
will not work for Wikipedia, the people designing these schemes would
like to know about them.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR

2014-01-13 Thread Zack Weinberg
On Mon, Jan 13, 2014 at 11:43 AM, Marc A. Pelletier m...@uberbox.org wrote:
 On 01/13/2014 11:32 AM, Zack Weinberg wrote:
 Assume a person under continual surveillance.
  If they have to reveal their true IP address to Wikipedia in order to
 register their editor account, the adversary will learn it as well,
 and can then attribute all subsequent edits by that handle to that
 person *whether or not* those edits are routed over an anonymity
 network.

 If you start with that assumption, then it is unreasonable to assume
 that the endpoints aren't /also/ compromised or under surveillance.

Not true.  Tor's threat model already includes protecting clients
against malicious exit nodes.  The client endpoint can be secured by
using trusted hardware (Snowden notwithstanding, I feel relatively
comfortable assuming that attacks on the integrity of computers bought
off the shelf and never let out of one's sight since are rare and
expensive, even for nation-state adversaries) and a canned Tor-centric
client operating system executing from read-only media (e.g. Tails).

 What TOR may be good at is to protect your privacy from casual or
 economic spying; in which case going to some random Internet access
 point to create an account protects you adequately.

That is exactly the wrong advice to give the sort of people who want
to be able to edit Wikipedia over Tor (you should be thinking of
democracy activists in totalitarian states). Random
publicly-accessible internet access points are *more* likely to be
under aggressive surveillance, including thoroughly-bugged client OSes
which one may not supplant.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR

2014-01-13 Thread Zack Weinberg
On Mon, Jan 13, 2014 at 2:51 PM, Gryllida gryll...@fastmail.fm wrote:
 On Tue, 14 Jan 2014, at 3:32, Zack Weinberg wrote:

 I rather think it does.  Assume a person under continual surveillance.
  If they have to reveal their true IP address to Wikipedia in order to
 register their editor account, the adversary will learn it as well,
 and can then attribute all subsequent edits by that handle to that
 person *whether or not* those edits are routed over an anonymity
 network.

 Doesn't it get solved if, despite the surveillance, the trust entity
 (freenode opers or wikipedia checkusers) reveals the user's IP
 only under a court order?

No.  The adversary doesn't need to talk to the trust entity to get the
user's IP.  The adversary learns the IP by eavesdropping on the initial,
uncloaked network traffic between the user-to-be and the trusted entity.

Equally, in some contexts it is unacceptable for the trust entity to be
able to reveal the user's IP even under legal compulsion or threat of force.

 To satisfy Applebaum's request, there needs to be a mechanism whereby
 someone can edit even if *all of their communications with Wikipedia,
 including the initial contact* are coming over Tor or equivalent.

 Rubbish. This makes a vandal inherently untrackable and unblockable.

This isn't necessarily so.  In my previous message I mentioned one
technique that *should* be adequate to preventing vandals even when the
administrators do not and have never known their IP address of origin.
There are others in the literature, and if there are concrete reasons why
none of those techniques will work for Wikipedia, people want to know about
it.

zw
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] $wgRedactedFunctionArguments

2013-10-29 Thread Zack Weinberg
On Tue, Oct 29, 2013 at 9:55 AM, Dan Andreescu dandree...@wikimedia.org wrote:
 I think Ori's original point stands though.  Configuration could be used to
 redact fully / not redact at all for local debugging purposes.  But a black
 list for what to redact is bad for all the reasons black lists are bad
 security in general.

Theoretically speaking, the right way to do this would be to identify
the (small, one hopes) number of *sources* of sensitive data and
change them to return objects of a special class, which would then
automatically print out as [REDACTED] (if so configured) in a stack
trace. This would have other benefits; for instance, the special class
could arrange to handle the data extra-carefully (scrubbing it from
memory when no longer required, doing constant-time comparisons, that
sort of thing) and code that needed to treat the datum as anything
other than an opaque blob would have to explicitly unwrap it, which
would then be a red flag for code review.

I don't have any idea how hard this would be; I'd guess somewhere
between conceptually easy but a huge number of tedious
almost-but-not-quite-mechanical changes to implement and infeasible
due to API breakage.

zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Wikimedia's anti-surveillance plans

2013-08-18 Thread Zack Weinberg
Hi, I'm a grad student at CMU studying network security in general and 
censorship / surveillance resistance in particular. I also used to work 
for Mozilla, some of you may remember me in that capacity. My friend 
Sumana Harihareswara asked me to comment on Wikimedia's plans for 
hardening the encyclopedia against state surveillance. I've read all of 
the discussion to date on this subject, but it was kinda all over the 
map, so I thought it would be better to start a new thread.


I understand that there is specific interest in making it hard for an 
eavesdropper to identify *which pages* are being read or edited. I'd 
first like to suggest that there are probably dozens of other things a 
traffic-analytic attacker could learn and make use of, such as:


 * Given an IP address known to be communicating with WP/WM, whether
   or not there is a logged-in user responsible for the traffic.
 * Assuming it is known that a logged-in user is responsible for some
   traffic, *which user it is* (User: handle) or whether the user has
   any special privileges.
 * State transitions between uncredentialed and logged-in (in either
   direction).
 * State transitions between reading and editing.

This is unlikely to be an exhaustive list. If we are serious about 
defending about traffic analysis, one of the first things we should do 
is have a bunch of experienced editors and developers sit down and work 
out an exhaustive list of things we don't want to reveal. (I have only 
ever dabbled in editing Wikipedia.)


---

Now, to technical measures. The roadmap at [URL] looks to me to have the 
right shape, but there are some missing things and points of confusion.


The very first step really must be to enable HTTPS unconditionally for 
everyone (whether or not logged in). I saw a couple of people mention 
that this would lock some user groups out of the encyclopedia -- can 
anyone expand on that a little? We're going to have to find a workaround 
for that. If the server ever emits cleartext, the game is over. You 
should probably think about doing SPDY, or whatever they're calling it 
these days, at the same time; it's valuable not only for traffic 
analysis' sake, but because it offers server-side efficiency gains that 
(in theory) should mitigate the overhead of doing TLS for everyone.


After that's done, there's a grab bag of additional security refinements 
that are deployable now or with minimal-to-moderate engineering effort. 
The roadmap mentions Strict Transport Security; that should definitely 
happen. You should also do Content-Security-Policy, as strict as 
possible. I know this can be a huge amount of development effort, but 
the benefits are equally huge - we don't know exactly how it was done, 
but there's an excellent chance CSP on the hidden service would have 
prevented the exploit that got us all talking about this. Certificate 
pinning (possible either via HSTS extensions, or via talking to browser 
vendors and getting them to bake your certificate in) should at least 
cut down on the risk of a compromised CA. Deploying DNSSEC and DANE will 
also help with that. (Nobody consumes DANE information yet, but if you 
make the first move, things might happen very fast on the client side; 
also, if you discover that you can't reasonably deploy DANE, the IETF 
needs to know about it [I would rate it as moderately likely that DANE 
is broken-as-specified].)


Perfect forward secrecy should also be considered at this stage. Folks 
seem to be confused about what PFS is good for. It is *complementary* to 
traffic analysis resistance, but it's not useless in the absence of. 
What it does is provide defense in depth against a server compromise by 
a well-heeled entity who has been logging traffic *contents*. If you 
don't have PFS and the server is compromised, *all* traffic going back 
potentially for years is decryptable, including cleartext passwords and 
other equally valuable info. If you do have PFS, the exposure is limited 
to the session rollover interval.


You should also consider aggressively paring back the set of 
ciphersuites offered by your servers. [...]


And finally, I realize how disruptive this is, but you need to change 
all the URLs so that the hostname does not expose the language tag. 
Server hostnames are cleartext even with HTTPS and SPDY (because they're 
the subject of DNS lookups, and because they are sent both ways in the 
clear as part of the TLS handshake); so even with ubiquitous encryption, 
an eavesdropper can tell which language-specific encyclopedia is being 
read, and that might be enough to finger someone.
My suggested bikeshed color would be 
https://wikipedia.org/LANGUAGE/PAGENAME (i.e. replace /wiki/ with the 
language tag).  It is probably not necessary to do this for Commons, but 
it *is* necessary for metawikis (knowing whether a given IP address ever 
even looks at a metawiki may reveal something important).


---

Once *all of* those things have been done, we could 

Re: [Wikitech-l] Wikimedia's anti-surveillance plans

2013-08-18 Thread Zack Weinberg

On 2013-08-18 1:04 PM, Bjoern Hoehrmann wrote:

an elision mark that does not explain itself. Makes you come across as
hit send too early.


My email client appears to have decided to post an early draft of the 
messages I sent on Friday.  Sorry about that.  Please ignore.


For the record, I have read everything that was sent in response to 
those messages but probably won't get around to responding till Monday.


zw


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Wikimedia's anti-surveillance plans: site hardening

2013-08-16 Thread Zack Weinberg
Hi, I'm a grad student at CMU studying network security in general and 
censorship / surveillance resistance in particular. I also used to work 
for Mozilla, some of you may remember me in that capacity. My friend 
Sumana Harihareswara asked me to comment on Wikimedia's plans for 
hardening the encyclopedia against state surveillance. I've read the 
discussion to date on this subject, but it was kinda all over the map, 
so I thought it would be better to start a new thread. Actually I'm 
going to start two threads, one for general site hardening and one 
specifically about traffic analysis. This is the one about site 
hardening, which should happen first. Please note that I am subscribed 
to wikitech-l but not wikimedia-l (but I have read the discussion over 
there).


The roadmap at 
https://blog.wikimedia.org/2013/08/01/future-https-wikimedia-projects/ 
looks to me to have the right shape, but there are some missing things 
and points of confusion.


The first step really must be to enable HTTPS unconditionally for 
everyone (whether or not logged in). I see on the roadmap that there is 
concern that this will lock out large groups of users, e.g. from China; 
a workaround simply *must* be found for this. Everything else that is 
worth doing is rendered ineffective if *any* application layer data is 
*ever* transmitted over an insecure channel. There is no point worrying 
about traffic analysis when an active man-in-the-middle can inject 
malicious JavaScript into unsecured pages, or a passive one can steal 
session cookies as they fly by in cleartext.


As part of the engineering effort to turn on TLS for everyone, you 
should also provide SPDY, or whatever they're calling it these days. 
It's valuable not only for traffic analysis' sake, but because it offers 
server-side efficiency gains that (in theory anyway) should mitigate the 
TLS overhead somewhat.


After that's done, there's a grab bag of additional security refinements 
that are deployable immediately or with minimal-to-moderate engineering 
effort. The roadmap mentions HTTP Strict Transport Security; that should 
definitely happen. All cookies should be tagged both Secure and HttpOnly 
(which renders them inaccessible to accidental HTTP loads and to page 
JavaScript); now would also be a good time to prune your cookie 
requirements, ideally to just one which does not reveal via inspection 
whether or not someone is logged in. You should also do 
Content-Security-Policy, as strict as possible. I know this can be a 
huge amount of development effort, but the benefits are equally huge - 
we don't know exactly how it was done, but there's an excellent chance 
CSP on the hidden service would have prevented the exploit discussed 
here: 
https://blog.torproject.org/blog/hidden-services-current-events-and-freedom-hosting


Several people raised concerns about Wikimedia's certificate authority 
becoming compromised (whether by traditional hacking, social 
engineering, or government coercion). The best available cure for this 
is called certificate pinning, which is unfortunately only doable by 
talking to browser vendors right now; however, I imagine they would be 
happy to apply pins for Wikipedia. There's been some discussion of an 
HSTS extension that would apply a pin 
(http://tools.ietf.org/html/draft-evans-palmer-key-pinning-00) and it's 
also theoretically doable via DANE (http://tools.ietf.org/html/rfc6698); 
however, AFAIK no one implements either of these things yet, and I rate 
it moderately likely that DANE is broken-as-specified. DANE requires 
DNSSEC, which is worth implementing for its own sake (it appears that 
the wikipedia.org. and wikimedia.org. zones are not currently signed).


Perfect forward secrecy should also be considered at this stage. Folks 
seem to be confused about what PFS is good for. It is *complementary* to 
traffic analysis resistance, but it's not useless in the absence of. 
What it does is provide defense in depth against a server compromise by 
a well-heeled entity who has been logging traffic *contents*. If you 
don't have PFS and the server is compromised, *all* traffic going back 
potentially for years is decryptable, including cleartext passwords and 
other equally valuable info. If you do have PFS, the exposure is limited 
to the session rollover interval.  Browsers are fairly aggressively 
moving away from non-PFS ciphersuites (see 
https://briansmith.org/browser-ciphersuites-01.html; all of the 
non-deprecated suites are PFS).


Finally, consider paring back the set of ciphersuites accepted by your 
servers. Hopefully we will soon be able to ditch TLS 1.0 entirely (all 
of its ciphersuites have at least one serious flaw).  Again, see 
https://briansmith.org/browser-ciphersuites-01.html for the current 
thinking from the browser side.


zw

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Wikimedia's anti-surveillance plans: traffic analysis resistance

2013-08-16 Thread Zack Weinberg
(Please see the thread titled Wikimedia's anti-surveillance plans: site 
hardening for who I am and some general context.)


Once Wikipedia is up to snuff with all the site-hardening I recommended 
in the other thread, there remain two significant information leaks (and 
probably others, but these two are gonna be a big project all by 
themselves, so let's worry about them first).  One is hostnames, and the 
other is page(+resource) length.


Server hostnames are transmitted over the net in cleartext even when TLS 
is in use (because DNS operates in cleartext, and because the cleartext 
portion of the TLS handshake includes the hostname, so the server knows 
which certificate to send down).  The current URL structure of 
*.wiki[pm]edia.org exposes sensitive information in the server hostname: 
for Wikipedia it's the language tag, for Wikimedia the subproject. 
Language seems like a serious exposure to me, potentially enough all by 
itself to finger a specific IP address as associated with a specific 
Wikipedia user handle.  I realize how disruptive this would be, but I 
think we need to consider changing the canonical Wikipedia URL format to 
https://wikipedia.org/LANGUAGE/PAGENAME.


For *.wikimedia.org it is less obvious what should be done. That domain 
makes use of subdomain partitioning to control the same-origin policy 
(for instance, upload.wikimedia.org needs to be a distinct hostname from 
everything else, lest someone upload e.g. a malicious SVG that 
exfiltrates your session cookies) so it cannot be altogether 
consolidated. However, knowing (for instance) whether a particular user 
is even *aware* of Commons or Meta may be enough to finger them, so we 
need to think about *some* degree of consolidation.


---

Just how much information is exposed by page length (and how to best 
mitigate it) is a live area of basic research. It happens to be *my* 
area of basic research, and I would be interested in collaborating with 
y'all on locking it down (it would make a spiffy case study for my 
thesis :-) but I must emphasize that *we don't know if it is possible to 
prevent this attack*.


I recommend that everyone interested in this topic read these articles: 
http://hal.archives-ouvertes.fr/docs/00/74/78/41/PDF/johnny2hotpet-finalcam.pdf 
discusses why Web browsing history is sensitive information in general. 
http://kpdyer.com/publications/oakland2012.pdf and 
http://www.freehaven.net/anonbib/cache/ccs2012-fingerprinting.pdf 
demonstrate how page length can reveal page identity, debunk a number of 
easy fixes, and their reference lists are good portals to the 
literature. Finally, 
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf demonstrates a 
related but perhaps even more insidious attack, whereby the eavesdropper 
learns the *user identity* of someone on a social network by virtue of 
the size of their profile photo.


This last article raises a critical point. To render Wikipedia genuinely 
secure against traffic analysis, it is not sufficient for the 
eavesdropper to be unable to identify *which pages* are being read or 
edited. The eavesdropper may also be able to learn and make use of the 
answers to questions such as:


 * Given an IP address known to be communicating with WP/WM, whether
   or not there is a logged-in user responsible for the traffic.
 * Assuming it is known that a logged-in user is responsible for some
   traffic, *which user it is* (User: handle) or whether the user has
   any special privileges.
 * State transitions between uncredentialed and logged-in (in either
   direction).
 * State transitions between reading and editing.

This is unlikely to be an exhaustive list. If we are serious about 
defending about traffic analysis, one of the first things we should do 
is have a bunch of experienced editors and developers sit down and work 
out an exhaustive list of things we don't want to reveal. (I have only 
ever dabbled in editing Wikipedia.)


Now, once this is pinned down, theoretically, yes, the cure is padding. 
However, the padding inherent in TLS block cipher modes is *not* 
adequate; it's normally strictly round up to the nearest multiple of 16 
bytes, which has been shown to be completely inadequate.  One of the 
above papers talks about patching GnuTLS to pad randomly by up to 256 
bytes, but this too is probably insufficient.


Random padding, in fact, is no good at all. The adversary can simply 
average over many pageloads and extract the true length. What's actually 
needed is to *bin* page (+resource) sizes such that any given load could 
be a substantial number of different pages. 
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf also discusses how 
this can be done in principle. The project - and I emphasize that it 
would be a *project* - would be to arrange for MediaWiki (the software) 
to do this binning automatically, such that the adversary cannot learn 
anything useful either from individual traffic bursts or from a sequence 
of such bursts,