Re: [tor-dev] Building better pluggable transports - GSoC 2013 project

2013-06-11 Thread Steven Murdoch
Hi Chang,

On 29 May 2013, at 06:22, Chang Lan changl...@gmail.com wrote:
 Given that ScrambleSuite is being deployed, improving protocol obfuscation 
 will be my main focus. HTTP impersonation is really useful, since there are 
 numerous HTTP proxy outside the censored region, while the number of bridges 
 is quite limited. What I'm gonna be doing during the summer is implementing a 
 good enough HTTP impersonation based on pluggable transports specification. 
 There are still many open questions indeed. Discussions are more than welcome!

There certainly are quite a few open questions, so it would be good to start 
planning early. Implementing HTTP is a deceptively difficult project. 

I'd suggest starting by reading the HTTP specification in detail, particularly 
the parts that deal with caching:
  http://tools.ietf.org/html/rfc2616
For comparison HTTP/1.0 is also worth looking at:
  http://tools.ietf.org/html/rfc1945

Some issues that you will need to deal with are:
- Individual HTTP requests may be re-ordered if they are over different TCP 
connections
- Responses may be truncated without an error being reported to higher layers 
(which is why HTTP includes length fields as an option).
- HTTP doesn't give the same congestion avoidance as TCP
- Proxies can both cache and modify data they transmit.
- Proxies deviate from what is permitted by the specification 
- (and others)

When dealing with these, you will need to ensure you don't introduce any new 
ways for a censor to efficiently and reliably distinguish your protocol from 
HTTP.

I think it would also be a good idea to implement scanning resistance. Since it 
will be over TCP, you can't hide that something is listening, but you can 
ensure that if the initial request does not demonstrate knowledge of a valid 
secret, the response does not disclose that it is a Tor bridge.

As you start implementing, you should have some way of testing. Initially this 
can be a direct connection from your pluggable transport client to pluggable 
transport server. You can set up an OP and bridge on the same machine (set your 
bridge not to advertise itself), and get your OP to talk to your bridge via 
your pluggable transport.

However, you shouldn't keep to this setup for very long, as it won't test how 
your pluggable transport works with a proxy. So you should put a caching proxy 
(e.g. Squid) between your pluggable transport server and client, and make sure 
they keep working. You can try configuring Squid in ways to stress your 
pluggable transport, and also replace Squid with a proxy server you create 
(e.g. based on one of the many Python HTTP proxies 
http://proxies.xhaus.com/python/). This proxy server could behave 
pathologically, and test the corner cases of your pluggable transport.

When working on your experiments, automate the set up, running of the test, and 
processing of the results. This is not just to make your life easier but it 
means that your experiments can be repeatable. The scripts and configuration 
files should be checked into version control. Your goal should be that someone 
can check out your code, install a few standard packages via apt-get or yum, 
run a single command, and get the same results. There are tools to help do this 
(e.g. http://software-carpentry.org/4_0/data/mgmt.html and 
http://software-carpentry.org/4_0/data/bein.html) but just using make and shell 
scripts might be fine.

There's a lot to think about here, so we don't need answers to everything now, 
but if you have any questions or comments do let me know.

Best wishes,
Steven
 ___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] grabbing Tor circuit (node) data- Tor stem, torrc and Tor control port

2013-06-11 Thread SARAH CORTES
Damian, thanks. Your summary pretty well sums up where i am trying to start. I 
will try this out and go from there. 

On Jun 10, 2013, at 11:08 PM, Damian Johnson ata...@torproject.org wrote:

 Hi Sarah. I'm not really sure what you're trying to ask in most of
 these questions. Assuming that your goal is simply 'I want to connect
 to port 9051 and dump information about all the relays I'm presently
 using in my circuits' then the following should do the trick.
 
 Take the following with a grain of salt - I just wrote it, I haven't
 tried running it. Should be close though. ;)
 
 from stem.control import Controller
 
 with Controller.from_port(port = 9051) as controller:
  my_relay_fingerprints = []  # fingerprints of all relays in our circuits
 
  for circ in controller.get_circuits():
my_relay_fingerprints += [fp for (fp, nickname) in circ.path]
 
  for fingerprint in my_relay_fingerprints:
desc = controller.get_network_status(fingerprint)
country = controller.get_info(ip-to-country/%s % desc.address, unknown)
 
print relay: %s: % fingerprint
print   address: %s:%s % (desc.address, desc.or_port)
print   locale: %s % country
 
 
 Did you have any other questions?
 
 On Mon, Jun 10, 2013 at 9:43 AM, SARAH CORTES sa...@lewis.is wrote:
 Damian, thanks, this is very helpful.
 
 is there a way to do this in torrc? Else, i suppose i will need to:
 
 1) create a socket or connection to my port 9051 ; do i need/can i use
 TORRC_CONTROL_SOCKET ?
 2) call get_circuits() ; grab the relay fingerprints
 Do I need circuit = controller.get_circuit(circ_id)
 3) return class:`stem.events.CircuitEvent` ; for the given circuit
 Not sure whether or where to use the path attribute
 4) call controller.get_network_status() to get IP address, nickname, ORPort,
 Should I use:
 desc_by_fingerprint = controller.get_network_status(test_relay.fingerprint)
 
  +  desc_by_nickname =
 controller.get_network_status(test_relay.nickname)
 
 5) use Maxmind- i already have the GeoIPLite DB to grab AS and country, and
 onion code also from Arturo
 
 
 Any guidance is appreciated
 
 
 https://lists.torproject.org/pipermail/tor-commits/2012-December/051174.html
 
 get_circuit(self, circuit_id, default = UNDEFINED):
 +
 +Provides a circuit presently available from tor.
 +
 +:param int circuit_id: circuit to be fetched
 +:param object default: response if the query fails
 +
 +:returns: :class:`stem.events.CircuitEvent` for the given circuit
 +
 +:raises:
 +  * :class:`stem.ControllerError` if the call fails
 +  * ValueError if the circuit doesn't exist
 +
 +  An exception is only raised if we weren't provided a default
 response.
 +
 +
 +try:
 +  for circ in self.get_circuits():
 +if circ.id == circuit_id:
 +  return circ
 +
 +  raise ValueError(Tor presently does not have a circuit with the id
 of '%s' % circuit_id)
 +except Exception, exc:
 +  if default: return default
 +  else: raise exc
 +
   def get_circuits(self):
 
 Provides the list of circuits Tor is currently handling.
 
 
 
 On Jun 10, 2013, at 10:34 AM, Damian Johnson ata...@torproject.org wrote:
 
 Hi, Damian, thanks. I am happy to discuss it on tor-dev@. But I want to keep
 off spam, which some of my questions at first may be, essentially Qs. But,
 if you think they would be of interest to tor-dev, or others could help,
 just let me know, and i will sign up for it.
 
 
 They certainly are! If you're interested in tor and development then I
 would definitely suggest being on that list. Including it for this
 thread.
 
 I am trying to figure out how to pull in the nodes that are actually used in
 my Tor circuits. They are the nodes reflected in the Network map function.
 
 
 You want the get_circuits() method. As you mentioned the 'path'
 attribute has the relays in your present circuits...
 
 https://stem.torproject.org/api/control.html#stem.control.Controller.get_circuits
 
 I have created a MySql DB of some of my Tor circuits and nodes which i am
 analyzing. I grabbed 48 circuits with their 144 nodes and info (IP address,
 nickname, country) manually from my laptop's Tor network map.
 
 
 That certainly sounds painful. The circuit paths will provide the
 relay fingerprints which you can use for get_network_status() to get
 the address, nickname, ORPort, etc...
 
 https://stem.torproject.org/api/control.html#stem.control.Controller.get_network_status
 
 As for locales that would be done via get_info('ip-to-country/address')...
 
 https://gitweb.torproject.org/torspec.git/blob/HEAD:/control-spec.txt#l672
 
 ... and ultimately to AS and country.
 
 
 AS will require the Maxmind AS database or something else. I know that
 Onionoo includes the AS information so the options that come to mind
 are either to (a) see how it does it or (b) query Onionoo for this
 information.
 
 https://onionoo.torproject.org/
 
 And i have read much of the control-spec, don't know how 

Re: [tor-dev] Metrics Plans

2013-06-11 Thread Damian Johnson
 I can try experimenting with this later on (when we have the full / needed
 importer working, e.g.), but it might be difficult to scale indeed (not
 sure, of course). Do you have any specific use cases in mind? (actually
 curious, could be interesting to hear.)

The advantages of being able to reconstruct Descriptor instances is
simpler usage (and hence more maintainable code). Ie, usage could be
as simple as...



from tor.metrics import descriptor_db

# Fetches all of the server descriptors for a given date. These are provided as
# instances of...
#
#   stem.descriptor.server_descriptor.RelayDescriptor

for desc in descriptor_db.get_server_descriptors(2013, 1, 1):
  # print the addresses of only the exits

  if desc.exit_policy.is_exiting_allowed():
print desc.address



Obviously we'd still want to do raw SQL queries for high traffic
applications. However, for applications where maintainability trumps
speed this could be a nice feature to have.

 * After making the schema update the importer could then run over this
 raw data table, constructing Descriptor instances from it and
 performing updates for any missing attributes.

 I can't say I can easily see the specifics of how all this would work, but
 if we had an always-up-to-date data model (mediated by Stem Relay Descriptor
 class, but not necessarily), this might work.. (The ORM - Stem Descriptor
 object mapping itself is trivial, so all is well in that regard.)

I'm not sure if I entirely follow. As I understand it the importer...

* Reads raw rsynced descriptor data.
* Uses it to construct stem Descriptor instances.
* Persists those to the database.

My suggestion is that for the first step it could read the rsynced
descriptors *or* the raw descriptor content from the database itself.
This means that the importer could be used to not only populate new
descriptors, but also back-fill after a schema update.

That is to say, adding a new column would simply be...

* Perform the schema update.
* Run the importer, which...
  * Reads raw descriptor data from the database.
  * Uses it to construct stem Descriptor instances.
  * Performs an UPDATE for anything that's out of sync or missing from
the database.

Cheers! -Damian
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Client simulation

2013-06-11 Thread Norman Danner



On 6/10/13 4:40 AM, Karsten Loesing wrote:

On 6/6/13 7:32 PM, Norman Danner wrote:

I have two questions regarding a possible research project.

First, the research question:  can one use machine-learning techniques
to construct a model of Tor client behavior?  Or in a more general form:
   can one use fill-in-the-blank to construct a model of Tor client
behavior?  A student of mine did some work on this over the last year,
and the results are encouraging, though not strong enough to do anything
with yet.


The intent is that each cluster (represented by a single hidden Markov
model) represents a type of client, even though we don't know for sure
what that client type does.  We can make some guesses about some:  the
type of steady high-volume cell counts is probably a bulk downloader;
the type of steady zero cell counts is probably an unused circuit;
etc.  But in some sense, I'm thinking that what counts is the behavior
of the client, not the reason for that behavior.  We don't have to
instrument clients for this.  Of course, then one has to ask whether
this kind of modeling is in fact useful.  It is somewhat different than
what you are envisioning, I think.

There are about a billion variations (at last count) on this theme.  We
chose one particular one as a test case to play with the methodology.  I
think the methodology is mostly OK, though I'm not completely satisfied
with the results of the particular variation Julian worked on.  So now
I'm trying to figure out whether to push this forward and in particular
what directions and end goals would be useful.


Interesting stuff!  You're indeed taking a different approach than I
were envisioning by gathering data on a single guard rather than on a
set of volunteering clients.  Both approaches have their pros and cons,
but I think your approach leads to some interesting results and can be
done in a privacy-preserving fashion.

Two thoughts:

- I could imagine that your results are quite valuable for modeling
better Shadow/ExperimenTor clients or for deriving better client models
for Tor path simulators.  Maybe Julian's thesis already has some good
data for that, or maybe we'll have to repeat the experiment in a
slightly different setting.  I'm cc'ing Rob (the Shadow author) and
Aaron (working on a path simulator) to make sure they saw this thread.
I can help by reviewing code changes to Tor to make sure data is
gathered in a privacy-preserving way, and I'd appreciate if those code
changes would be made public together with analysis results.


I'm in the process of rewriting the data collection code, and will 
e-mail later with some of the details.  But maybe off-list initially, as 
I think the first few passes will be very special-purpose and hence not 
of general interest (though I'm happy to discuss it more publicly if 
that's more appropriate).


Right now I'm considering focusing on trying to get a reasonable 
(partial) answer to the following question:  how well do various 
timing-analysis attacks actually work?That is, how well do they work 
when the client model is accurate?  I'm not even sure how exactly to 
define accurate, though I can think of at least a few different ways. 
 But I'm hoping that by focusing on a relatively narrow question, we 
can see manageable chunks of questions related to what kinds of data can 
be reasonably collected, and how can we use that data for other purposes.



- It might be interesting to observe how Tor usage changes over time.
Maybe the research experiment leads to a set of classifiers telling us
when a circuit is most likely used for bulk downloads, used for web
browsing, used for IRC, unused, or whatever.  We could then extend
circuit statistics to have all relays report aggregate data of how
circuits can be classified.  Requires a proposal and code, but I could
help with those.


Yes, I can see a number of longer-range applications like this.  I'm not 
sure I want to think about proposals and code just yet.


- Norman

--
Norman Danner - ndan...@wesleyan.edu - http://ndanner.web.wesleyan.edu
Department of Mathematics and Computer Science - Wesleyan University
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Building better pluggable transports - GSoC 2013 project

2013-06-11 Thread David Fifield
On Tue, Jun 11, 2013 at 05:46:49PM +0100, Steven Murdoch wrote:
 On 11 Jun 2013, at 12:49, Steven Murdoch [1]steven.murd...@cl.cam.ac.uk
 wrote:
 
 There certainly are quite a few open questions, so it would be good to
 start planning early. Implementing HTTP is a deceptively difficult
 project. 
 
 I've started a design document https://github.com/sjmurdoch/http-transport/
 blob/master/design.md which is very much a work-in-progress but I'm interested
 in comments.

Here are some ideas on a few things that I've been thinking about
recently, mostly taken from https://www.bamsoftware.com/papers/oss.pdf.
That's an HTTP-based transport, though one with different goals: It's
meant to evade IP-based blocking and not DPI. (The paper does have a
section at the end about mitigations against DPI.)

 Bi-directional data
 Tor requires that communication exchanges be initiated either by the
 bridge client or bridge server. In contrast HTTP clients initiate all
 communications. There are a few ways to avoid this problem:
 * The client periodically polls the server to check if any data is
   available
 * The client keeps a long-running Comet TCP connection, on which the
   server can send responses
 * The client and server both act as HTTP clients and HTTP servers, so
   can each send data when they wish

Making the client an HTTP server has the same NAT problems that flash
proxy has. The OSS model has the worst of both worlds: the client has to
be an HTTP server and also has to poll. But we implemented polling and
it was usable.

 Proxy busting
 Proxies will, under certain conditions, not send a request they
 receive to the destination server, but instead serve whatever the
 proxy thinks is the correct response. The HTTP specification dictates
 a proxy's behaviour but some proxy servers may deviate from the
 requirements. The pluggable transport will therefore need to either
 prevent the proxy from caching responses or detect cached data and
 trigger a re-transmission. It may be unusual behaviour for a HTTP
 client to always send unique requests, so it perhaps should
 occasionally send dummy requests which are the same as before and so
 would be cached.

To inhibit caching we added a random number to every request. However
that's a good point about not having all requests be unique.

 Client to server (requests)
 Cookies
 Short and usually do not change, so possibly not a good choice
 HTTP POST file uploads
 Quite unusual, but permit large uploads

Another avenue is URLs--they are sometime kilobytes long (and clients
and servers support much longer than that), and often contain opaque
binary data.

David
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Building better pluggable transports - GSoC 2013 project

2013-06-11 Thread Steven Murdoch
On 11 Jun 2013, at 12:49, Steven Murdoch steven.murd...@cl.cam.ac.uk wrote:
 There certainly are quite a few open questions, so it would be good to start 
 planning early. Implementing HTTP is a deceptively difficult project. 

I've started a design document 
https://github.com/sjmurdoch/http-transport/blob/master/design.md which is very 
much a work-in-progress but I'm interested in comments.

Best wishes,
Steven
 

___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Building better pluggable transports - GSoC 2013 project

2013-06-11 Thread Zack Weinberg
I've been thinking about writing a lessons-learned document about
StegoTorus; I'll bump that up a little on the todo queue.

For right now I want to mention that any greenfields design should
take a hard look at MinimaLT
http://cr.yp.to/tcpip/minimalt-20130522.pdf as its cryptographic
layer.  It looks like it addresses most if not all of the problems I
was trying to tackle with ST's crypto layer, only (unlike ST's crypto
layer) it's actually *finished*.

On Tue, Jun 11, 2013 at 12:46 PM, Steven Murdoch
steven.murd...@cl.cam.ac.uk wrote:
 On 11 Jun 2013, at 12:49, Steven Murdoch steven.murd...@cl.cam.ac.uk
 wrote:

 There certainly are quite a few open questions, so it would be good to start
 planning early. Implementing HTTP is a deceptively difficult project.


 I've started a design document
 https://github.com/sjmurdoch/http-transport/blob/master/design.md which is
 very much a work-in-progress but I'm interested in comments.

 Best wishes,
 Steven



 ___
 tor-dev mailing list
 tor-dev@lists.torproject.org
 https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev