Re: [tor-dev] 24 hours worth of BridgeDB usage metrics

2019-07-30 Thread Philipp Winter
On Tue, Jul 30, 2019 at 05:42:11PM +0200, Karsten Loesing wrote:
> You say that you're planning to add aggregate statistics like numbers by
> distributor without drilling down to transports or countries. Keep in
> mind that this is going to reduce the noise that you added when rounding
> up to multiples of 10. For example, knowing that the total by country is
> closer to $entries_in_that_country * 1 or $entries_in_that_country * 10
> will tell you something about the average noise added per entry. It
> would be more privacy-preserving (and also less accurate) to keep all
> the noise in the statistics and do the aggregation in a separate step.

That's a great point.  I was originally concerned about the decrease in
accuracy but, after running the numbers, it seems tolerable.  Let's have
a look at the lower and upper bound of the total number of HTTPS
requests.  Summing up all bins (and ignoring bot requests) gives us the
upper bound:

  grep https bridgedb-metrics.log | grep -v zz | cut -d ' ' -f 3 | paste -sd+ | 
bc
  3850

To determine the lower bound, we first calculate the number of bins:

  grep https bridgedb-metrics.log | grep -c -v zz
  235

Then, we multiply the number of bins by 9 and subtract it from the upper
bound, which gives us a lower bound of 1,735.

Applying this method to all three distribution mechanisms results in the
following table:

Lower bound  Upper bound
---  ---
  Moat4,5764,630
  HTTPS   1,7353,850
  Email 303  420

Despite the inaccuracy caused by the binning, we can be certain that
moat is more popular than HTTPS (moat's lower bound > HTTPS's upper
bound) and email is an order of magnitude less popular than both HTTPS
and moat.  HTTPS is the most inaccurate because of the large number of
bins.

> What is obs4 in bridgedb-metric-count email.obs4.gmail.fail.none 10 (as
> opposed to obfs4)?

That's a typo that a user made when requesting the transport.  I had not
yet changed the code to only consider transports that are supported by
BridgeDB.  All unsupported transport types should result in a log
message and not affect the metrics.

Interestingly, there's another metrics line that shows that there were
1-10 successful requests for the invalid obs4 transport.  When
requesting an invalid transport, BridgeDB tells you that there are
currently no bridges available.  Instead, it should tell you that the
requested transport does not exist.

> Would it make sense to add a line like bridge-stats-version to include a
> version number of some sort, just in case you want to change the format
> at a later time?

Yes, that's a good idea.  I will do that.

Thanks,
Philipp
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] 24 hours worth of BridgeDB usage metrics

2019-07-30 Thread Philipp Winter
On Mon, Jul 29, 2019 at 09:22:52PM -0700, Rick Huebner wrote:
> Could some metrics be added to summarize how the bridges and queries
> are distributed across the hashrings?

Thanks for this suggestion.  I agree that it would be helpful and I'll
look into incorporating it into the metrics.

Cheers,
Philipp
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] 24 hours worth of BridgeDB usage metrics

2019-07-30 Thread Karsten Loesing
On 2019-07-30 00:01, Philipp Winter wrote:
> [...]
> 
> After a cursory look at the numbers, I would like to aggregate the data,
> to make it easier to compare distributors, transports, and countries.
> For example: how do moat, email, and HTTPS rank in popularity?  I'll
> improve the patch to keep track of these numbers in separate metrics.
> 
> Any thoughts or suggestions?

Looks like a great start!

I have two questions and one suggestion, based on a quick read:

You say that you're planning to add aggregate statistics like numbers by
distributor without drilling down to transports or countries. Keep in
mind that this is going to reduce the noise that you added when rounding
up to multiples of 10. For example, knowing that the total by country is
closer to $entries_in_that_country * 1 or $entries_in_that_country * 10
will tell you something about the average noise added per entry. It
would be more privacy-preserving (and also less accurate) to keep all
the noise in the statistics and do the aggregation in a separate step.

What is obs4 in bridgedb-metric-count email.obs4.gmail.fail.none 10 (as
opposed to obfs4)?

Would it make sense to add a line like bridge-stats-version to include a
version number of some sort, just in case you want to change the format
at a later time?

> Cheers,
> Philipp

All the best,
Karsten



signature.asc
Description: OpenPGP digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] 24 hours worth of BridgeDB usage metrics

2019-07-29 Thread Rick Huebner
That's awesome, and will shine a lot of light on user demand patterns 
and how well things are actually working through various channels. Could 
some metrics be added to summarize how the bridges and queries are 
distributed across the hashrings? As in, at the end of the day, roughly 
how many bridges are in each hashring, and how many requests were served 
from each hashring? I've seen behavior in the past that made me wonder 
if the internal HMAC/modulo partitioning method is actually uniformly 
distributed or not, like perhaps there are some hashrings with most of 
the bridges and others with too few (within a given distribution 
method), or maybe there are too many requests being pulled from certain 
hashrings, leaving others under-utilized. This might not need to be a 
permanent stat dump, but seeing it for at least a few days would help a 
lot to confirm that the db's guts are working as intended.


___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev