Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Hanno Schlichting
On 11.09.2013, at 02:06 , Gervase Markham  wrote:
> On 10/09/13 19:05, Chris Peterson wrote:
>> Our location service (and stumbler) also collects cell data, so we can
>> geolocate with Wi-Fi AP and/or cell data.
> 
> Sure. But in the rural areas I am thinking about, cells cover many
> square km. The wifi access point has a much smaller range, and therefore
> geolocates a person much more precisely.
> 
> So it would be awesome if I could say "I'm in this network cell, near
> this single access point - tell me where I am, please", and the service
> complied.

That's a good idea, I added a ticket about it at 
https://github.com/mozilla/ichnaea/issues/23

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Hanno Schlichting
On 10.09.2013, at 17:41 , Daniel Veditz  wrote:
> That can't be right, so your database must be more complex. If you're
> storing more than originally implied that may have some impact on a
> security assessment.

We apparently haven't been clear about the scope of the proposal. It only deals 
with a way to export and publicly share a subset of our data. Internally the 
service has a lot more data, but there's no way we can share that, thanks to 
the privacy aspects of it.

But at this point it seems clear to me, that there's likely no way to share any 
meaningful subset or aggregated version of this data publicly at all.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Hanno Schlichting
On 10.09.2013, at 20:23 , ianG  wrote:
> On 11/09/13 03:27 AM, Daniel Veditz wrote:
>> "private" means we can't even /look/ at it, rather than merely can't
>> store it?
> 
> The data regime might be simply put as this:  you can't store a number 
> suitable for tracking (or any derivative of it if that simply creates a new 
> tracking number) unless you have a compelling business reason, and you have 
> agreement.
> 
> The EU data protection regime makes a very strong distinction about any 
> private tracking information.  It also goes to another level if you share 
> that information with anyone.
> 
> The initial simple answer is, don't go there.  (I have no idea how google 
> finessed this issue, or even if they didn't.)

Most of this is very much a gray area. The data privacy officers / protection 
agencies have generally recognized that location services based on wifi 
networks are a very useful service, and in order to practically run them, you 
have to be able to collect wifi bssid's without getting the individual assent 
of every wifi AP operator.

But at the same time they consider the combination of a bssid, timestamp and 
geolocation as personally identifiable information suitable for tracking. Much 
like IP addresses, or phone numbers.

So currently there's an unspoken agreement where industry players like Google, 
Microsoft and Apple have voluntarily put some restrictions into place. One of 
those is the introduction of the _nomap network name suffix, which was deemed 
an effective way for wifi operators to opt-out of the data gathering (see for 
example 
http://www.dutchdpa.nl/Pages/en_pb_20120405_google-complies-with-Dutch-DPA-requirements.aspx).

Other cases where the introduction of the "you need to know two nearby wifis" 
to geolocate yourself protection. This was a measure suggested and implemented 
first by Google based on media outcries and has now become a industry 
best-practice. But it's not actually mandated by any official regulation to my 
knowledge.

For now the whole space hasn't seen official tight regulation and the industry 
players are allowed to continue to operate. But it's a fine balance and any new 
media outcries or questionable behavior can threaten this balance.

So for us this means trying to adhere to existing industry best practices and 
generally following data privacy best practices like: only gather and store 
what you need, delete data as soon as you don't need it anymore, etc.

All of this applies to the hosted service use-case, where we keep the data 
internal and don't share or sell it for other purposes. Since it's all 
unofficial agreements, it's very hard to impossible to know exactly what we 
should do for the "we want to publicly share this data" use-case.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Hanno Schlichting
On 10.09.2013, at 03:39 , Gervase Markham  wrote:
> BTW, how does the service figure out the lat/long of an AP? Do we do
> anything at all with signal strengths? Could we?

This is a bit off-topic for the security discussion.

I suggest starting a new thread on dev-geolocation, if you want to know more 
about the technical details. The short answer is: Yes, but it's a lot more 
complicated than that :)

Cheers :)
Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Hanno Schlichting
On 10.09.2013, at 03:46 , Gervase Markham  wrote:
> On 10/09/13 10:48, ianG wrote:
>> If that is the case, why not flip it around.  Instead of trying to
>> interpolate the existing data that is broadcast out there, why not write
>> a protocol to broadcast the direct location from the wireless access point?
> 
> Because only a tiny, tiny fraction of devices would run it, and for most
> of those, the user wouldn't have correctly set the device's location
> anyway, and for some of them, they'd have set it and then moved.
> 
> This is a "boil the sea" approach to the problem.

In addition the CDMA cell networks actually have support for reporting the base 
stations lat/lon as part of the protocol. But in practice these are almost 
never set, as cell operators value ease of deployment and uniform configuration 
more than providing this extra service.

In another anecdote, mobile operators cannot actually give you lists of all 
their cell towers and locations - we asked our partners. Thanks to a multitude 
of subsidiaries, subcontractors and partnerships, they often don't actually 
know how many cell towers they have and where they are. The same problem 
applies to the many wifi AP's officially being operated by some large telco.

So even where this is possible, it's not actually a practically relevant 
approach.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Hanno Schlichting
On 10.09.2013, at 03:46 , Gervase Markham  wrote:
> On 10/09/13 00:25, R. Jason Cronk wrote:
>> What happens if I move? 
> 
> The raw database notes that you are now being detected in a new
> location. What happens then is up for debate. I'd argue that if your
> position was fixed for N months before, and it seems fixed again now, we
> should assume you have moved house and keep the point in the DB. APs
> which seem to move a lot, or move regularly, should be excluded.

As of this moment, we filter out any AP that has been detected in two different 
places (where different means more than ~1km away from each other). This is 
very conservative approach and we'll relax that later.

The real strategy is going to involve thresholds of a certain number of reports 
over a certain time period. For example we could change the "canonical 
location" after an AP has been seen 10 times over the course of at least one 
week in a new location. Potentially we could flag it as "dirty" after seeing 
just two reports in a different location and remove it from the search results, 
while we wait for more reports and "confirmation". The exact numbers will 
depend on the reporting volume and spam problems.

Anecdotal reports suggest it takes Apple and Google about a day to realize your 
AP has actually moved.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread Hanno Schlichting
On 09.09.2013, at 18:41 , Eric Rescorla  wrote:
> 1. How do I bootstrap? I turn on my device and want to get the coordinates of 
> the aps I see. That requires a lat long for neighbors. What now?

We build the database by having people use a stumbler application to sent us 
observations. The stumbler app uses the mobile phones GPS sensor to know its 
location. It reports all cell towers and wifi APs it sees to us in a certain 
location. We crunch some data, then we make a search API available over this 
data. Later someone else asks us what their location is, based on seeing cell 
towers or APs.

> 2. As asked previously will the db be published or query able?

It will definitely be queryable, but with a lot of restrictions to enhance 
privacy. We would like to publish it or as much of it as possible, but it's 
unclear how to do that, when a lot of the individual records are considered 
personally identifiable information.

> 3. What is the lat/long resolution? How is it measured?

The resolution differs, but is generally "as precise as it gets". So GPS 
sensors often have 5 meter precision, Google aims to do 1 meter resolution for 
indoor locations based on Wifi access points. Internally we currently store 
things with centimeter precision and timestamps in milliseconds - so definitely 
all on the far side of "extremely detailed / private".

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread Hanno Schlichting
On 09.09.2013, at 18:13 , Brian Smith  wrote:
> On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson  wrote:
>> Google's Location Service prevents people from tracking individual access
>> points by requiring requests to include at least 2-3 access points that
>> Google knows are near each other. This "proves" the requester is near the
>> access points.
> 
> I assume by "prevents people from tracking individual access points"
> means the following: Some people have a personal access point on them
> (e.g. in their phone). If somebody knows the SSID and MAC of this
> personal access point, then they could track this person's location by
> polling the database for that (SSID, MAC) pair. Google tries to limit
> this type of abuse as much as practical while providing still
> providing a location service based on such crowdsourced data.

Yes :) Though there's one crucial difference between Google and us: We would 
like to make as much of this data public as possible, while Google will always 
just provide a service without access to the underlying data.

>> Unlike Google's Location Service, our server does not store MAC addresses or
>> SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID).
>> To query the location of an access point in the database, you must know both
>> its MAC address and current SSID.
> 
> MAC addresses are 48 bits. SSIDs are often guessable or predictable.
> Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
> not buying you much in terms of privacy, IMO. Basically, if you are
> really trying to use this as a privacy mechanism then you should store
> the MAC+SSID according to best practices for storing passwords. For
> example, use PBKDF2 with a large number of iterations. Regardless of
> whether you use SHA1, SHA2, PBKDF2, or something else, I will still
> call whatever function you use H(x). But, I am not sure that switching
> to PBKDF2 even buys you much improved privacy protection.

We were looking for two things with using the sha1:

- Make it possible for the end-user to change their unique value (they cannot 
change the mac address, but they can change the ssid). This allows them to 
"invalidate" historical records in the database.
- Make it harder for spammers to "guess" actual unique keys and flood our 
service. Mac addresses have a vendor prefix, which makes it rather easy to 
generate lots of valid mac addresses. Taking the ssid into account makes it 
harder to generate valid keys. Unfortunately the ssid itself is considered 
private data in European countries, so you aren't allowed to store it without 
the users consent. That's why Google and everyone else has stopped storing them 
and only use mac addresses now.

The sha1 scheme might be ineffective in doing this.

>>H1 = Hash(AP1.MAC + AP1.SSID)
>>H2 = Hash(AP2.MAC + AP2.SSID)
>> 
>> Our private database's schema looks something like:
>> 
>>Hash(AP1.MAC + AP1.SSID) ==> AP1.latitude, AP1.longitude, ...
>>Hash(AP2.MAC + AP2.SSID) ==> AP2.latitude, AP2.longitude, ...
>> 
>> Our published database would include two tables. The first table would map a
>> random row id to metadata about an anonymous access point:
>> 
>>Random1 ==> AP1.latitude, AP1.longitude, ...
>>Random2 ==> AP2.latitude, AP2.longitude, ...
>> 
>> The second table's primary key would be a hash of hashes. It would map a
>> hash of two neighboring access points' hash IDs to a row id of the first
>> table. Something like:
>> 
>>Hash(H1 + H2) ==> Random1
>>Hash(H2 + H1) ==> Random2
>> 
>> Someone querying the published database would need to know the MAC addresses
>> and current SSIDs of two neighboring access points to look up either's
>> location.
> 
> If  you know the MAC+SSID of person X's personal access point and the
> MAC+SSID of person Y's personal access point, then you can use this
> database to ask the question "are person X and person Y in the same
> location?" This seems bad. I see that you attempt to address this
> below.

On the service level, we can prevent this with adding extra thresholds. Like 
filtering out "moving" APs and only reporting APs which have been seen in the 
same location a number of times over a minimum time period.

But this doesn't help us when publishing the underlying data.

>> btw, should we use SHA-2 instead of SHA-1?
> 
> There is no reason to use SHA-1 when you have SHA-2 available.
> However, as I indicated above, it isn't clear it is a good idea to be
> using any plain hash function as H(x).
> 
>> Other layers of privacy protection include filtering out ad-hoc Wi-Fi
>> networks; MAC addresses with vendor prefixes from mobile device manufacters
>> (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g.
>> "XXX's iPhone" and Google's "_nomap" opt-out); and APs reported in multiple
>> locations.
> 
> I think that these things are much more important than the protection
> offered by H(x). My concern is that if you store the data on the
> server as