Thanks for that. It's interesting to have that data visualized.
On 12/16/06, Mike Perry <[EMAIL PROTECTED]> wrote:
While testing the latest relese of my Tor scanner, I decided to do a study on circuit reliability and how long it takes to construct a circuit then fetch the html of http://tor.eff.org, and also to fetch http://tor.eff.org via that same constructed circuit. Using tor-0.1.2.4 (actually SVN r9067), I sorted the routers by their bandwidth capacity, divided them up into 15% segments of the network from 0 to 90, and for each segment timed 250 circuits as well as used the new failure tracking abilities of my scanner to track the failure rates of nodes as well as the failure reasons for circuits and streams. Times are seconds: RANGE 0-15 250 build+fetches: avg=20.89, dev=31.23 RANGE 0-15 250 fetches: avg=3.66, dev=2.69 RANGE 15-30 250 build+fetches: avg=33.44, dev=47.01 RANGE 15-30 250 fetches: avg=7.28, dev=12.86 RANGE 30-45 250 build+fetches: avg=81.47, dev=79.55 RANGE 30-45 250 fetches: avg=12.66, dev=38.63 RANGE 45-60 250 build+fetches: avg=63.56, dev=67.56 RANGE 45-60 250 fetches: avg=7.51, dev=12.80 RANGE 60-75 250 build+fetches: avg=40.85, dev=42.76 RANGE 60-75 250 fetches: avg=10.13, dev=11.28 RANGE 75-90 250 build+fetches: avg=48.87, dev=56.11 RANGE 75-90 250 fetches: avg=6.82, dev=7.48 As you can see, the high bandwidth nodes in 0-15% are much quicker than the rest both at using existing circuits and at building new ones. My guess is that the circuit build speed increase is likely due to the fact that running a fast node requires a fast machine to be able to do all the crypto, and thus crypto-intensive circuit builds execute faster on these nodes. The rest of the results for circuit construction and speed seem only loosely tied to bandwidth, however. Probably other factors like network connection and stability come into play there. A few bad nodes can slow those averages down a lot, as is hinted at by the large std deviation in some of the classes. So what of the failure rates and reasons then? Lets have a look at the FAILTOTALS line from each class: 0-15.naive_fail_rates: FAILTOTALS 131/473 54+6/603 OK 15-30.naive_fail_rates: FAILTOTALS 224/750 135+40/726 OK 30-45.naive_fail_rates: FAILTOTALS 559/1221 130+29/737 OK 45-60.naive_fail_rates: FAILTOTALS 273/845 138+22/752 OK 60-75.naive_fail_rates: FAILTOTALS 140/592 85+33/678 OK 75-90.naive_fail_rates: FAILTOTALS 187/637 76+18/656 OK By looking at the README for the scanner, we see the format of these lines is: 250 FAILTOTALS CIRCUIT_FAILURES/TOTAL_CIRCUITS DETACHED+FAILED/TOTAL_STREAMS So it looks that nodes in the 30-45% range seemed to have a good deal higher rate of circuit failure than the rest (if you're wondering, the overall circuit failure rate is 33%). Looking at the top of the 30-45.naive_fail_rates file shows us a handful of nodes with slightly higher failure rates than normal, but several of the other classes have a few bad nodes also. So why was this class so much slower? It turns out if you look at the naive_fail_reasons file, the largest portion of failures comes from CIRCUITFAILED:TIMEOUT reason: 250 REASONTOTAL 522/1277 or 522 timeout failures out of all the total node failures. Note that reason-based failure counting and reason totals are node-based, where as the FAILTOTALS lines just count circuits and streams, hence the large number there. In general, the most common failure reasons were circuit timeouts, stream timeouts, and OR connection closed (TCP connections between nodes mysteriously dying or failing to open). Here's the top failure reasons by class. When there are 3 reason terms paired together, the reason was reported from an upstream node and not deduced locally. 0-15: 1. CIRCUITFAILED:OR_CONNECTION_CLOSED (174/322 node failures) 2. CIRCUITFAILED:TIMEOUT (72/322 node failures) 3. STREAMDETACHED:TIMEOUT (41/322 node failures) 15-30: 1. CIRCUITFAILED:OR_CONN_CLOSED (182/623) 2. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (116/623) 3. CIRCUITFAILED:TIMEOUT (124/623) 30-45: 1. CIRCUITFAILED:TIMEOUT (522/1277) 2. CIRCUITFAILED:OR_CONN_CLOSED (396/1277) 3. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (192/1277) 45-60: 1. CIRCUITFAILED:TIMEOUT (306/706) 2. CIRCUITFAILED:OR_CONN_CLOSED (164/706) 3. STREAMDETACHED:TIMEOUT (138/706) 4. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (72/706) 60-75: 1. CIRCUITFAILED:TIMEOUT (112/398) 2. CIRCUITFAILED:OR_CONN_CLOSED (110/398) 3. STREAMDETACHED:TIMEOUT (85/398) 4. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (56/398) 75-90: 1. CIRCUITFAILED:TIMEOUT (216/468) 2. CIRCUITFAILED:OR_CONN_CLOSED (96/468) 3. STREAMDETACHED:TIMEOUT (76/468) 4. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (56/468) So if you total the two OR_CONN_CLOSED (local and remote), you see that for some reason node to node TCP connections are fairly unreliable and prone to being closed (or are difficult to open/establish?). This is strange... I should also note that stream failure reasons are only counted for the exit node, where as circuit failure reasons are counted for 2 nodes - the last successful hop and the first unsuccesful one. So in effect, the STREAMDETACHED reason really is 2x more common than in those lists. On the other hand, it is mostly alleviated by making compute_socks_timeout() always return 15 (this was not done for this study, however). Well that's about all the detail I have time to go into right now. The complete results are up at http://fscked.org/proj/minihax/SnakesOnATor/speedrace.zip As soon as I finish polishing up my README and change log, I will put up the new release of SoaT itself up. Should be by sometime today. -- Mike Perry Mad Computer Scientist fscked.org evil labs