Re: [tor-dev] Remote descriptor fetching

2013-07-23 Thread Damian Johnson
 - see the
 example at the start of its docs...

 https://stem.torproject.org/api/descriptor/remote.html
 https://stem.torproject.org/_modules/stem/descriptor/remote.html

 These two links don't work for me for some reason.

Very strange. It didn't work when I just tried clicking them from
another system but when I did a full refresh (ctrl+shift+r) it did.
Probably browser side caching.

 PS. Where does an authority's v3ident come from? Presently I reference
 users to the values in config.c but that's mostly because I'm confused
 about what it is and how it differs from their fingerprint.

 The v3 identity is what v3 directory authorities use to sign their votes
 and consensuses.  Here's a better explanation of v3 identity keys:

 https://gitweb.torproject.org/torspec.git/blob/HEAD:/attic/v3-authority-howto.txt

I spotted that the v3ident is the same thing as the 'fingerprint' line
from the authority key certificates. In my humble opinion this
overloaded meaning of a relay fingerprint is confusing, and I'm not
clear why we'd reference authorities by the key fingerprint rather
than the relay fingerprint. But oh well. If there's anything we can
improve in the module pydocs then let me know.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Remote descriptor fetching

2013-07-21 Thread Damian Johnson
Hi Karsten, I've finally finished implementing stem's module for
remote descriptor fetching. Its usage is pleasantly simple - see the
example at the start of its docs...

https://stem.torproject.org/api/descriptor/remote.html
https://stem.torproject.org/_modules/stem/descriptor/remote.html
https://gitweb.torproject.org/stem.git/commitdiff/7f050eb?hp=b6c23b0

The only part of our wiki plans that I regretted needing to drop is a
local filesystem cache (the part that made me think twice was figuring
out when to invalidate cached resources). Otherwise this turned into a
pleasantly slick module.

Cheers! -Damian

PS. Where does an authority's v3ident come from? Presently I reference
users to the values in config.c but that's mostly because I'm confused
about what it is and how it differs from their fingerprint.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Remote descriptor fetching

2013-06-10 Thread Kostas Jakeliunas
Hi folks!

 Indeed, this would be pretty bad.  I'm not convinced that moria1
 provides truncated responses though.  It could also be that it
 compresses results for every new request and that compressed responses
 randomly differ in size, but are still valid compressions of the same
 input.  Kostas, do you want to look more into this and open a ticket if
 this really turns out to be a bug?

I did check each downloaded file, each was different in size etc., but not
all of them were valid, from a shallow look at things (just chucking the
file to zlib and seeing what comes out).

Ok, I'll try looking into this. :) do note that exams etc. are still
ongoing, so this will get pushed back, if anybody figures things out
earlier, then great!

 Tor clients use the ORPort to fetch descriptors. As I understand it
 the DirPort has been pretty well unused for years, in which case a
 regression there doesn't seem that surprising. Guess we'll see.

Noted - OK, will see!

Re: python url request parallelization: @Damian: in the past when I wanted
to do concurrent urllib requests, I simply used threading.Thread. There
might be caveats here, I'm not familiar with the specifics. I can (again,
(maybe quite a bit) later) try cooking something up to see if such a simple
parallelization approach would work? (I should probably just try and do it
when I have time, maybe will turn out some specific solution is needed and
you guys will have solved it by then anyway.)

Cheers
Kostas.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Remote descriptor fetching

2013-06-08 Thread Damian Johnson
 Indeed, this would be pretty bad.  I'm not convinced that moria1
 provides truncated responses though.  It could also be that it
 compresses results for every new request and that compressed responses
 randomly differ in size, but are still valid compressions of the same
 input.  Kostas, do you want to look more into this and open a ticket if
 this really turns out to be a bug?

Tor clients use the ORPort to fetch descriptors. As I understand it
the DirPort has been pretty well unused for years, in which case a
regression there doesn't seem that surprising. Guess we'll see.

If Kostas wants to lead this investigation then that would be fantastic. :)

 So, this isn't the super smart downloader that I had in mind, but maybe
 there should still be some logic left in the application using this API.
  I can imagine how both DocTor and metrics-db-R could use this API with
 some modifications.  A few comments/suggestions:

What kind of additional smartness were you hoping for the downloader to have?

 - There could be two methods get/set_compression(compression) that
 define whether to use compression.  Assuming we get it working.

Good idea. Added.

 - If possible, the downloader should support parallel downloads, with at
 most one parallel download per directory.  But it's possible to ask
 multiple directories at the same time.  There could be two methods
 get/set_max_parallel_downloads(max) with a default of 1.

Usually I'd be all for paralleling our requests to both improve
performance and distribute load. However, tor's present interface
doesn't really encourage it. There's no way of saying get half of the
server descriptors from location X and the other half from location
Y. You can only request specific descriptors or all of them.

Are you thinking that the get_server_descriptors() and friends should
only try to parallelize when given a set of fingerprints? If so then
that sounds like a fine idea.

 - I'd want to set a global timeout for all things requested from the
 directories, so a get/set_global_timeout(seconds) would be nice.  The
 downloader could throw an exception when the global download timeout
 elapses.  I need such a timeout for hourly running cronjobs to prevent
 them from overlapping when things are really, really slow.

How does the global timeout differ from our present set_timeout()?

 - Just to be sure, get/set_retries(tries) is meant for each endpoint, right?

Yup, clarified.

 - I don't like get_directory_mirrors() as much, because it does two
 things: make a network request and parse it.  I'd prefer a method
 use_v2dirs_as_endpoints(consensus) that takes a consensus document and
 uses the contained v2dirs as endpoints for future downloads.  The
 documentation could suggest to use this approach to move some load off
 the directory authorities and to directory mirrors.

Very good point. Changed to a use_directory_mirrors() method, callers
can then call get_endpoints() if they're really curious what the
present directory mirrors are (which I doubt they often will).

 - Related note: I always look if the Dir port is non-zero to decide
 whether a relay is a directory.  Not sure if there's a difference to
 looking at the V2Dir flag.

Sounds good. We'll go for that instead.

 - All methods starting at get_consensus() should be renamed to fetch_*
 or query_* to make it clear that these are no getters but perform actual
 network requests.

Going with fetch_*.

 - All methods starting at get_consensus() could have an additional
 parameter for the number of copies (from different directories) to
 download.  The default would be 1.  But in some cases people might be
 interested in having 2 or 3 copies of a descriptor to compare if there
 are any differences, or to compare download times (more on this below).
  Also, a special value of -1 could mean to download every requested
 descriptor from every available directory.  That's what I'd do in DocTor
 to download the consensus from all directory authorities.

 - As for download times, is there a way to include download meta data in
 the result of get_consensus() and friends?  I'd be interested in the
 directory that a descriptor was downloaded from and in the download time
 in millis.  This is similar to how I'm interested in file meta data in
 the descriptor reader, like file name or last modified time of the file
 containing a descriptor.

This sounds really specialized. If callers cared about the download
times then that seems best done via something like...

endpoints = ['location1', 'location2'... etc]

for endpoint in endpoints:
  try:
start_time = time.time()
downloader.set_endpoints([endpoint])
downloader.get_consensus()

print endpoint %s took: %0.2f % (endpoint, time.time() - start_time)
  except IOError, exc:
print failed to use %s: %s % (endpoint, exc)

 - Can you add a fetch|query_votes(fingerprints) method to request vote
 documents?

Added a fetch_vote(authority) to provide an authority's
NetworkStatusDocument 

Re: [tor-dev] Remote descriptor fetching

2013-05-28 Thread Karsten Loesing
On 5/28/13 1:50 AM, Damian Johnson wrote:
 Hi Karsten. I'm starting to look into remote descriptor fetching, a
 capability of metrics-lib that stem presently lacks [1][2]. The spec
 says that mirrors provide zlib compressed data [3], and the
 DirectoryDownloader handles this via a InflaterInputStream [4].
 
 So far, so good. By my read of the man pages this means that gzip or
 python's zlib module should be able to handle the decompression.
 However, I must be missing something...
 
 % wget http://128.31.0.34:9131/tor/server/all.z
 
 % file all.z
 all.z: data
 
 % gzip -d all.z
 gzip: all.z: not in gzip format
 
 % zcat all.z
 gzip: all.z: not in gzip format
 
 % python
 import zlib
 with open('all.z') as desc_file:
 ...   print zlib.decompress(desc_file.read())
 ...
 Traceback (most recent call last):
   File stdin, line 2, in module
 zlib.error: Error -5 while decompressing data: incomplete or truncated stream
 
 Maybe a fresh set of eyes will spot something I'm obviously missing.
 Spotting anything?

Hmmm, that's a fine question.  I remember this was tricky in Java and
took me a while to figure out.  I did a quick Google search, but I
didn't find a way to decompress tor's .z files using shell commands or
Python. :/

How about we focus on the API first and ignore the fact that compressed
responses exist?

 Speaking of remote descriptor fetching, any thought on the API? I'm
 thinking of a 'stem/descriptor/remote.py' module with...
 
 * get_directory_authorities()
 
 List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally
 we'd have an integ test to notify us when our listing falls out of
 date. However, it looks like the controller interface doesn't surface
 this. Is there a nice method of determining the present authorities
 besides polling the authorities array of 'src/or/config.c' [5]?
 
 * fetch_directory_mirrors()
 
 Polls an authority for the present consensus and filters it down to
 relays with the V2Dir flag. It then uses this to populate a global
 directory mirror cached that's used when querying directory data. This
 can optionally be provided with a Controller instance or cached
 consensus file to use that instead of polling a authority.

(Minor note: if possible, let's separate methods like this into one
method that makes a network request and another method that works only
locally.)

 * get_directory_cache()
 
 Provides a list of our present directory mirrors. This is a list if
 (IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been
 called this is the directory authorities.
 
 * query(descriptor_type, fingerprint = None, retires = 5)
 
 Picks a random relay from our directory mirror cache, and attempts to
 retrieve the given type of descriptor data. Arguments behave as
 follows...
 
 descriptor_type (str): Type of descriptor to be fetched. This is the
 same as our @type annotations [6]. This raises a ValueError if the
 descriptor type isn't available from directory mirrors.
 
 fingerprint (str, list): Optional argument for the relay or list of
 relays to fetch the descriptors for. This retrieves all relays if
 omitted.
 
 retries (int): Maximum number of times we'll attempt to retrieve the
 descriptors. We fail to another randomly selected directory mirror
 when unsuccessful. Our last attempt is always via a directory
 authority. If all attempts are unsuccessful we raise an IOError.
 
 
 
 I'd imagine this would make use of the module something like the following...
 
 # Simple script to print all of the exits.
 
 from stem.descriptor import remote
 
 # Populates our directory mirror cache. This does more harm
 # here than good since we're only making a single request.
 # However, if this was a longer living script doing this would
 # relieve load from the authorities.
 
 remote.fetch_directory_mirrors()
 
 try:
   for desc in remote.query('server-descriptor 1.0'):
 if desc.exit_policy.is_exiting_allowed():
   print %s (%s) % (desc.nickname, desc.fingerprint)
 except IOError, exc:
   print Unable to query the server descriptors: %s % exc
 
 
 
 Thoughts? Does this cover all of the use cases we'll this module for?

This API looks like a fine way to manually download descriptors, but I
wonder if we can make the downloader smarter than that.

The two main use cases I have in mind are:

1. Download and archive relay descriptors: metrics-db uses different
sources to archive relay descriptors including gabelmoo's cached-*
files.  But there's always the chance to miss a descriptor that is
referenced from another descriptor.  metrics-db (or the Python
equivalent) would initialize the downloader by telling it which
descriptors it's missing, and the downloader would go fetch them.

2. Monitor consensus process for any issues: DocTor downloads the
current consensus from all directory authorities and all votes from any
directory authority.  It doesn't care about server or extra-info

Re: [tor-dev] Remote descriptor fetching

2013-05-28 Thread Kostas Jakeliunas
On Tue, May 28, 2013 at 2:50 AM, Damian Johnson ata...@torproject.orgwrote:

 So far, so good. By my read of the man pages this means that gzip or
 python's zlib module should be able to handle the decompression.
 However, I must be missing something...

 % wget http://128.31.0.34:9131/tor/server/all.z

 [...]

 % python
  import zlib
  with open('all.z') as desc_file:
 ...   print zlib.decompress(desc_file.read())
 ...
 Traceback (most recent call last):
   File stdin, line 2, in module
 zlib.error: Error -5 while decompressing data: incomplete or truncated
 stream


This seemed peculiar, so I tried it out. Each time I wget all.z from that
address, it's always a different one; I guess that's how it should be, but
it seems that sometimes not all of it gets downloaded (hence the actually
legit zlib error.)

I was able to make it work after my second download attempt (with your
exact code); zlib handles it well. So far it's worked every time since.

This is probably not good if the source may sometimes deliver an incomplete
stream.

TL;DR try wget'ing multiple times and getting even more puzzled (?)
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev