#29772: Plot nearly worst-case bandwidth when downloading from [public|onion] server -----------------------------+------------------------------ Reporter: karsten | Owner: metrics-team Type: enhancement | Status: needs_review Priority: Medium | Milestone: Component: Metrics/Website | Version: Severity: Normal | Resolution: Keywords: scalability | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: -----------------------------+------------------------------ Changes (by karsten):
* status: needs_revision => needs_review Comment: I should start this comment by saying that I'm not a statistician. In case of doubt what I'm saying below, please go re-read this first sentence! :) I agree with you that the bandwidth plot works better than the latency plot. We're excluding very few bandwidth numbers as outliers as compared to the number of latency numbers that we're throwing out. However, I don't think that a 4-day moving average would fix this. As you can see in the boxplots I posted here last week, medians and quartiles are relatively stable over the days, and those values are what we're using to figure out if another value is excluded as outlier. After all, we have around 144 latency values per day and public/onion service. So, even if we considered 4 days (or even more) at a time, our threshold for excluding values as outliers would not change much. Of course, implementing such a moving average wouldn't be trivial to do, with all the missing data that we have to handle. I think the issue is that the way we're excluding outliers is based on the assumption that our data is normally distributed. This works okay for bandwidth, which is obviously not 100% correct, because there's no negative bandwidth, but which is apparently close enough. It doesn't work very well for latencies, because there's some heavy-tailed distribution at work that we don't know, and not all the values we're excluding are really outliers. Another reason could be that we're looking at the smallest bandwidth values, which are at the ''head'' of the distribution, and at the largest latency values, which are the heavy ''tail''. However, my suggestion is to ignore all this and make the plots as you suggested earlier and as I plotted them last week. Two reasons: 1. Boxplots are understood by many people, and if we say that we're plotting the five values from boxplots, people will understand what we're doing. 2. We need a baseline, even if it's not 100% correct in a mathematical/statistical sense. If our way to exclude outliers is flawed, it will be flawed for past measurements as well as for future measurements, in the exact same way. Regarding your rocket analogy: it's certainly not just distance between relays that we're seeing here. We're also seeing overfull queues keeping received cells waiting for crypto and forwarding to the next relay. But this is fine, we want to know how long it takes to send something over the circuit and get back a response. So, my suggestion would be to move forward with what we have. What do you think? -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29772#comment:7> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs