Re: Firefox Hello new data collection

Ian Bicking Mon, 04 Apr 2016 13:36:07 -0700

On Mon, Apr 4, 2016 at 10:44 AM, Gijs Kruitbosch <gijskruitbo...@gmail.com>
wrote:

> On 04/04/2016 11:01, Romain Testard wrote:
>
>> The privacy review bug is
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1261467.
>> More details added below.
>>
>
> See response at the bottom.
>
> On Mon, Apr 4, 2016 at 11:23 AM, Gijs Kruitbosch <gijskruitbo...@gmail.com
>> >
>> wrote:
>>
>>> On 04/04/2016 10:01, Romain Testard wrote:
>>>
>>>      We would use a whitelist client-side to only collect domains that
>>>> are
>>>>      part of the top 2000 domains (Alexa list of top domains). This
>>>> prevents
>>>>      personal identification based on obscure domain usage.
>>>>
>>>>
>>> Mathematically, the combination of a set of (popular) domains shared
>>> could
>>> still be uniquely identifying, especially as, AIUI, you will get the
>>> counts
>>> of each domain and in what sequence they were visited / which ones were
>>> visited in which session. It all depends on the number of unique users
>>> and
>>> the number of domains they visit / share (not clear: see above). Because
>>> the total number of Hello users compared with the number of Firefox users
>>> is quite low, this still seems somewhat concerning to me. Have you tried
>>> to
>>> remedy this in any way?
>>>
>>>
>> We are aggregating domain names, and are not storing session histories.
>> These are submitted at the end of the session, so exact timestamps of any
>> visit are not included.
>>
>
> But both Firefox and Hello sessions are commonly relatively short (<1d)
> and numerous. That means lots of data points, which will likely be enough
> to uniquely identify people even without exact timestamps of their visits.
> (FWIW, from a technical perspective, there is no reason why the submission
> time implies ("so") that exact timestamps of visits are not included.)

Yes, if an attacker has access to cross-domain tracking for several sites
that a user visits, and that attacker can access the reporting in transit,
it may be possible to correlate, thus finding the rest of the whitelisted
history, and some associated Firefox Hello data.  But that's only in the
case of an attack.  The actually data sent to the logging pipeline is
immediately pulled out of a session list and submitted as individual items,
and all other data (e.g., IP address) is left out of this logging.

>
> We looked into this approach originally although we found that we'd lose a
>>>
>> level of granularity that can have an importance. We may find that Hello
>> gets used a lot with a specific Website for a specific reason and using
>> client side categories would prevent us from learning this.
>>
>
> This was explicitly not in your original motivation, so you're moving the
> goalposts here. If the goal is about separate categories or separate sites
> then those are pretty distinct goals that require different approaches. If
> the real point is "we have no idea, so we figured we'd just get the data
> and then go from there", why not be upfront about it?

We are looking for clues about how people are using Hello, and using
domains as one way to understand this.  So yes, it is exploratory, and we
are looking for insight we have not yet received, rather than a more binary
signal such as do people use Hello for shopping or not.

For example, two domains that are on the whitelist: steampowered.com and
steamcommunity.com – these would both typically be categorized as "gaming",
but they represent very different use cases (store vs. discussion).  Or
aa.com, tripadvisor.com, and expedia.com are all travel sites, but
represent different (but overlapping) use cases.

But in that case, yeah, why not consider a survey or something less
> intrusive, like asking people explicitly what type of site they were using,
> or asking if Mozilla can use the domain in question ?

Asking people what site they were using seems challenging.  Do we suggest
types?  Will people acknowledge the full path of sites they used?  How much
do we have to annoy people with questions in order to get a large enough
sample?  Will it ever be a representative sample?  Even if we do work to
address these, how can we tell if we have done so if we don't have real
usage data to compare to?

As fully implemented, including the backend collection which further
aggregates the information, I believe we are not collecting private or
personally revealing information.  If we ask users to opt-in to collection
I don't think we can accurately explain to users the limits of what we are
collecting (especially at that moment when we are interrupting what they
are doing), and I think it will make it appear that we are trying to
collect personal information that we are not.

>
> Also Alexa
>> website categories are far from perfect which would add another level of
>> complexity to understand the collected data.
>>
>
> At no point did I say I expected you to use their categorization, whatever
> that is. Categorize as you see fit, rather than as Alexa does it.
>
> Conversely, if their categorization is questionable, then your scrubbing
> of the Adult category sounds like it might need auditing? Also, why not
> other categories like "Banking" or "Medical" (NB: no idea what
> categorization Alexa employs, but these seem like categories that ought to
> be scrubbed, too)?
>

For filtering out adult sites we used a well-maintained blacklist.  Alexa
categorization seems to be based on dmoz, which is very out of date –
browsing the categories feels like being sent back in time to a younger
internet.  It seemed reasonable to add items to the list based on that
categorization, but it's otherwise a very poor categorization.

> 6 months also seems incredibly long. You should be able to aggregate the
>>> data and keep that ("60% of users share on sites of type X") and throw
>>> away
>>> the raw data much sooner than that.
>>>
>>> Yes agreed, we'll look into what's the most optimal amount of time
>> required
>> to process the data and extract the useful information. I agree we should
>> try to make this shorter - we'll learn from being on Beta and will adjust
>> this accordingly.
>>
>
> Well, why not make it 1 week to start with, and make it longer if you
> don't get enough information from beta (with a rationale as to why that is
> the case) ?
>

The way tab sharing now works in Hello is a new experience, and we both
don't expect that it has found its niche yet, nor that people have decided
how they want to use Hello.  Capturing it for 1 week now is unlikely to
show us how people successfully use Hello, and in order to see when the
data seems to be settling around particular use cases requires us to track
it over time.

>
> Finally, I am surprised that you're sharing this 2 weeks before we're
>>> releasing Firefox 46. Hasn't this been tested and verified on Nightly
>>> and/or other channels? Why was no privacy update made at/before that
>>> time?
>>>
>>>
>> We are shipping Hello through Go Faster. The Go Faster process allows us
>> to
>> uplift directly to Beta 46 directly since we're a system add-on
>> (development was done about 2 weeks ago).
>> Firefox Hello has its own privacy notice (details here
>> <https://www.mozilla.org/en-US/privacy/firefox-hello/>).
>>
>
> But shipping through go faster does not absolve you from adequately
> testing changes and getting feedback on them. Is the add-on not getting
> tested on nightly at all? Or at the same time as it goes to beta? When will
> it be used on release - when 46 ships as release, or earlier, or later?
>
> It also seems like you filed the privacy review after the functionality
> was implemented and is now shipping, which per
> https://wiki.mozilla.org/Privacy/Reviews seems like it is too late to
> incorporate meaningful feedback. I'm not on the privacy team, but that
> order looks wrong to me.
>

This is my fault – we began discussion of this collection many months ago
with people from data stewardship and legal through less formal channels,
and I didn't follow up with a formal privacy review bug.  I agree it is not
the correct order.

Note that while implemented, the functionality is currently pref'd off.

-- 
Ian Bicking | Engineering Manager | Hello | Mozilla
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Firefox Hello new data collection

Reply via email to