Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

2014-03-20 Thread Tom Lee
Arggh. Wrong link. Apologies to all and thanks to James McKinney. That's
what I get for having that many tabs open.

https://sunlightfoundation.com/blog/2014/03/20/a-little-math-could-make-identifiers-a-whole-lot-better/


On Thu, Mar 20, 2014 at 5:44 PM, James McKinney ja...@opennorth.ca wrote:

 Do you mean this post?


 https://sunlightfoundation.com/blog/2014/03/20/a-little-math-could-make-identifiers-a-whole-lot-better/


 On Mar 20, 2014, at 3:44 PM, Tom Lee t...@sunlightfoundation.com wrote:

 Thanks again to everyone who helped me think through how government's
 approach to disclosing identifiers could be improved through checksums,
 tokenization and related techniques -- it was extremely helpful.  The
 resulting post is here:


 https://sunlightfoundation.com/blog/2013/07/25/the-sunlight-foundations-comments-on-the-faas-proposed-open-data-policy/

 I'd be grateful for any feedback -- or, especially, corrections -- that
 might occur to you.


 On Thu, Feb 6, 2014 at 3:49 PM, Tom Lee t...@sunlightfoundation.comwrote:

 We've been kicking around an idea at Sunlight that aims to use
 cryptographic ideas to resolve some of the concerns around the publication
 of publicly identifiable information in government disclosures. I could use
 some smart people to tell me what's dumb about it.

 We often face challenges related to disambiguating entities: is the John
 Smith who gave political donation A the same John Smith that gave political
 donation B? One obvious solution to this problem is to push to expand the
 information that's collected and disclosed -- if we had John's driver's
 license number (DLN), for instance, it'd be easy to disambiguate these
 records. But that could introduce privacy concerns for John. One approach
 to this problem (which I don't think government has tried) is employing a
 one-way hash.

 Obviously the input key space for DLNs and most other personal ID numbers
 is so small that reversing this with a dictionary attack would be trivial.
 You can add a salt, but only on a per-entity basis (not a per-record basis)
 if you want to preserve the capacity to disambiguate. That in turns calls
 for a lookup table in which the input keys are stored, which kind of
 defeats the point of using a hash (you might as well just assign random
 output IDs for each input ID). I would worry about government's ability to
 keep this lookup table secure, and I worry about the brittleness of such a
 system.

 Alternately, you can use a single system-wide secret (or set of secrets)
 to transform inputs into reliable outputs. I think this is less brittle and
 maybe easier to preserve as a secret, but this system might be too easily
 reversible given the ability to observe its outputs and know the universe
 of possible inputs. I'm unsure of the cryptographic options that might be
 appropriate here.

 For all I know, the lack of implementations using this kind of one-way
 transformation isn't about government sluggishness but rather about its
 feasibility. I'd be very curious to hear folks ideas on this score, though.
  My general hunch is that something must be possible -- even a few bits'
 worth of disambiguating information would be hugely useful to us, and
 presumably you're not leaking important amounts of information by, say,
 sharing the last digit of a DLN. So there must be a spectrum of options.
 But as is probably apparent, I don't think I've got a handle on how to
 think about this problem rigorously.

 Tom



 --
 You received this message because you are subscribed to the Google Groups
 sunlightlabs group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sunlightlabs+unsubscr...@googlegroups.com.
 To post to this group, send email to sunlightl...@googlegroups.com.
 Visit this group at http://groups.google.com/group/sunlightlabs.
 For more options, visit https://groups.google.com/d/optout.


  --
 You received this message because you are subscribed to the Google Groups
 sunlightlabs group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sunlightlabs+unsubscr...@googlegroups.com.
 To post to this group, send email to sunlightl...@googlegroups.com.
 Visit this group at http://groups.google.com/group/sunlightlabs.
 For more options, visit https://groups.google.com/d/optout.

-- 
Liberationtech is public  archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
compa...@stanford.edu.

Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

2014-02-07 Thread Michael Rogers
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On 06/02/14 20:56, Margie Roswell wrote:
 For all I know, the lack of implementations using this kind of 
 one-way transformation isn't about government sluggishness but 
 rather about its feasibility. I'd be very curious to hear folks 
 ideas on this score, though.  My general hunch is that something 
 must be possible -- even a few bits' worth of disambiguating 
 information would be hugely useful to us, and presumably you're
 not leaking important amounts of information by, say, sharing the
 last digit of a DLN. So there must be a spectrum of options. But as
 is probably apparent, I don't think I've got a handle on how to
 think about this problem rigorously.

Even if you had a perfect method of anonymising the individual
records, they might be reidentifiable by examining the whole dataset:

http://33bits.org/2010/06/21/myths-and-fallacies-of-personally-identifiable-information/
http://randomwalker.info/social-networks/
http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

At the level of individual records, you could use modular
exponentiation to anonymise the data. You pick a prime modulus p, and
each organisation that's going to publish anonymised data picks a
random secret value. Organisation X with secret value x anonymises a
piece of data d by publishing d_x = d^x mod p, and organisation Y with
secret value y anonymises the same data by publishing d_y = d^y mod p.

If X and Y want to know which records they have in common, X takes the
data published by Y and calculates d_x' = d_y^x mod p = d^(yx) mod p,
and Y takes the data published by X and calculates d_y' = d_x^y mod p
= d^(xy) mod p. For each record in common, d_x' = d_y', but neither
can de-anonymise records published by the other that they don't have
in common.

This can be extended to more than two organisations: pass the records
round in a circle, and when they get back to you they've been
exponentiated by all the secret values (order doesn't matter). Now you
can see which records you have in common with all the other organisations.

(Maybe. IANAC.)

Cheers,
Michael

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)

iQEcBAEBCAAGBQJS9NWaAAoJEBEET9GfxSfMyWcH/1Au9/066O/3AaPkkid8nBhq
2uuNjjLgDWzE+5aTIQGMzk9yy85TRKlXKdC4c9/n0UXxJjAUYxkLSoNkAD33ej36
s/oi3pI0C9OQ1MffJVCSImA+NwQ0QqDG6DOUBNPRoBUTr/nd5efbBRwWVtLSn50D
0QlLJYXUGGB+fSMZKyy368rrx5Ue8ICQOzIUyNJ3sWZsQEJo0nE8WJd1+89GlR45
XPRSUUma/5DCECl9gWBFq5pVuEtf29KoXV6QLCzagWCaAa2dNlCspoGp4bVlkBz9
UWMJRFHYDj9AxzUKt5Vi++uh6nYrTu++a7bXqOGJHb9y8VL54JHweEXNW2xWyog=
=BrUY
-END PGP SIGNATURE-
-- 
Liberationtech is public  archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
compa...@stanford.edu.


Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

2014-02-06 Thread Margie Roswell
 PII = personally identifiable information

(Anyone who can address the question probably already knows that... but I
was curious, and figured I'd spare others the look-up.)



--
http://FarmBillPrimer.org
http://www.BaltimoreUrbanAg.org (Please send events; This site is hungry.)
http://www.ExcellentNutrition.org
http://www.packtpub.com/drupal-5-views-recipes/book


On Thu, Feb 6, 2014 at 3:49 PM, Tom Lee t...@sunlightfoundation.com wrote:

 We've been kicking around an idea at Sunlight that aims to use
 cryptographic ideas to resolve some of the concerns around the publication
 of publicly identifiable information in government disclosures. I could use
 some smart people to tell me what's dumb about it.

 We often face challenges related to disambiguating entities: is the John
 Smith who gave political donation A the same John Smith that gave political
 donation B? One obvious solution to this problem is to push to expand the
 information that's collected and disclosed -- if we had John's driver's
 license number (DLN), for instance, it'd be easy to disambiguate these
 records. But that could introduce privacy concerns for John. One approach
 to this problem (which I don't think government has tried) is employing a
 one-way hash.

 Obviously the input key space for DLNs and most other personal ID numbers
 is so small that reversing this with a dictionary attack would be trivial.
 You can add a salt, but only on a per-entity basis (not a per-record basis)
 if you want to preserve the capacity to disambiguate. That in turns calls
 for a lookup table in which the input keys are stored, which kind of
 defeats the point of using a hash (you might as well just assign random
 output IDs for each input ID). I would worry about government's ability to
 keep this lookup table secure, and I worry about the brittleness of such a
 system.

 Alternately, you can use a single system-wide secret (or set of secrets)
 to transform inputs into reliable outputs. I think this is less brittle and
 maybe easier to preserve as a secret, but this system might be too easily
 reversible given the ability to observe its outputs and know the universe
 of possible inputs. I'm unsure of the cryptographic options that might be
 appropriate here.

 For all I know, the lack of implementations using this kind of one-way
 transformation isn't about government sluggishness but rather about its
 feasibility. I'd be very curious to hear folks ideas on this score, though.
  My general hunch is that something must be possible -- even a few bits'
 worth of disambiguating information would be hugely useful to us, and
 presumably you're not leaking important amounts of information by, say,
 sharing the last digit of a DLN. So there must be a spectrum of options.
 But as is probably apparent, I don't think I've got a handle on how to
 think about this problem rigorously.

 Tom

 --
 You received this message because you are subscribed to the Google Groups
 sunlightlabs group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sunlightlabs+unsubscr...@googlegroups.com.
 To post to this group, send email to sunlightl...@googlegroups.com.
 Visit this group at http://groups.google.com/group/sunlightlabs.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
Liberationtech is public  archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
compa...@stanford.edu.

Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

2014-02-06 Thread Chris Dary
Just one thought to throw out: Something that sprang to mind is the idea of
a check digit or simplified hash that would be redundant enough to collide
very often if you were trying to reverse, but would still provide enough
disambiguation that you'd be able to appropriately determine who you're
dealing with.

You could probably use something similar to the Luhn algorithm for that,
although I'm not sure how uniform that is:
http://en.wikipedia.org/wiki/Luhn_algorithm - also, that only ends up with
a single check digit, which is probably too small for good disambiguation.
The approach in general might still be helpful though.

-Chris


On Thu, Feb 6, 2014 at 3:49 PM, Tom Lee t...@sunlightfoundation.com wrote:

 We've been kicking around an idea at Sunlight that aims to use
 cryptographic ideas to resolve some of the concerns around the publication
 of publicly identifiable information in government disclosures. I could use
 some smart people to tell me what's dumb about it.

 We often face challenges related to disambiguating entities: is the John
 Smith who gave political donation A the same John Smith that gave political
 donation B? One obvious solution to this problem is to push to expand the
 information that's collected and disclosed -- if we had John's driver's
 license number (DLN), for instance, it'd be easy to disambiguate these
 records. But that could introduce privacy concerns for John. One approach
 to this problem (which I don't think government has tried) is employing a
 one-way hash.

 Obviously the input key space for DLNs and most other personal ID numbers
 is so small that reversing this with a dictionary attack would be trivial.
 You can add a salt, but only on a per-entity basis (not a per-record basis)
 if you want to preserve the capacity to disambiguate. That in turns calls
 for a lookup table in which the input keys are stored, which kind of
 defeats the point of using a hash (you might as well just assign random
 output IDs for each input ID). I would worry about government's ability to
 keep this lookup table secure, and I worry about the brittleness of such a
 system.

 Alternately, you can use a single system-wide secret (or set of secrets)
 to transform inputs into reliable outputs. I think this is less brittle and
 maybe easier to preserve as a secret, but this system might be too easily
 reversible given the ability to observe its outputs and know the universe
 of possible inputs. I'm unsure of the cryptographic options that might be
 appropriate here.

 For all I know, the lack of implementations using this kind of one-way
 transformation isn't about government sluggishness but rather about its
 feasibility. I'd be very curious to hear folks ideas on this score, though.
  My general hunch is that something must be possible -- even a few bits'
 worth of disambiguating information would be hugely useful to us, and
 presumably you're not leaking important amounts of information by, say,
 sharing the last digit of a DLN. So there must be a spectrum of options.
 But as is probably apparent, I don't think I've got a handle on how to
 think about this problem rigorously.

 Tom

 --
 You received this message because you are subscribed to the Google Groups
 sunlightlabs group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sunlightlabs+unsubscr...@googlegroups.com.
 To post to this group, send email to sunlightl...@googlegroups.com.
 Visit this group at http://groups.google.com/group/sunlightlabs.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
Liberationtech is public  archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
compa...@stanford.edu.

Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

2014-02-06 Thread Chris Dary
It's been a while since I dug into it, but something like an 8-bit
CRChttp://en.wikipedia.org/wiki/Cyclic_redundancy_checkwould
probably provide enough disambiguation but would collide often enough
to not be much of a concern for reversing - 256 different values.


On Thu, Feb 6, 2014 at 4:10 PM, Chris Dary umb...@gmail.com wrote:

 Just one thought to throw out: Something that sprang to mind is the idea
 of a check digit or simplified hash that would be redundant enough to
 collide very often if you were trying to reverse, but would still provide
 enough disambiguation that you'd be able to appropriately determine who
 you're dealing with.

 You could probably use something similar to the Luhn algorithm for that,
 although I'm not sure how uniform that is:
 http://en.wikipedia.org/wiki/Luhn_algorithm - also, that only ends up
 with a single check digit, which is probably too small for good
 disambiguation. The approach in general might still be helpful though.

 -Chris


 On Thu, Feb 6, 2014 at 3:49 PM, Tom Lee t...@sunlightfoundation.comwrote:

 We've been kicking around an idea at Sunlight that aims to use
 cryptographic ideas to resolve some of the concerns around the publication
 of publicly identifiable information in government disclosures. I could use
 some smart people to tell me what's dumb about it.

 We often face challenges related to disambiguating entities: is the John
 Smith who gave political donation A the same John Smith that gave political
 donation B? One obvious solution to this problem is to push to expand the
 information that's collected and disclosed -- if we had John's driver's
 license number (DLN), for instance, it'd be easy to disambiguate these
 records. But that could introduce privacy concerns for John. One approach
 to this problem (which I don't think government has tried) is employing a
 one-way hash.

 Obviously the input key space for DLNs and most other personal ID numbers
 is so small that reversing this with a dictionary attack would be trivial.
 You can add a salt, but only on a per-entity basis (not a per-record basis)
 if you want to preserve the capacity to disambiguate. That in turns calls
 for a lookup table in which the input keys are stored, which kind of
 defeats the point of using a hash (you might as well just assign random
 output IDs for each input ID). I would worry about government's ability to
 keep this lookup table secure, and I worry about the brittleness of such a
 system.

 Alternately, you can use a single system-wide secret (or set of secrets)
 to transform inputs into reliable outputs. I think this is less brittle and
 maybe easier to preserve as a secret, but this system might be too easily
 reversible given the ability to observe its outputs and know the universe
 of possible inputs. I'm unsure of the cryptographic options that might be
 appropriate here.

 For all I know, the lack of implementations using this kind of one-way
 transformation isn't about government sluggishness but rather about its
 feasibility. I'd be very curious to hear folks ideas on this score, though.
  My general hunch is that something must be possible -- even a few bits'
 worth of disambiguating information would be hugely useful to us, and
 presumably you're not leaking important amounts of information by, say,
 sharing the last digit of a DLN. So there must be a spectrum of options.
 But as is probably apparent, I don't think I've got a handle on how to
 think about this problem rigorously.

 Tom

 --
 You received this message because you are subscribed to the Google Groups
 sunlightlabs group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sunlightlabs+unsubscr...@googlegroups.com.
 To post to this group, send email to sunlightl...@googlegroups.com.
 Visit this group at http://groups.google.com/group/sunlightlabs.
 For more options, visit https://groups.google.com/groups/opt_out.



-- 
Liberationtech is public  archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
compa...@stanford.edu.

Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

2014-02-06 Thread Josh Tauberer

On 02/06/2014 03:49 PM, Tom Lee wrote:
Obviously the input key space for DLNs and most other personal ID 
numbers is so small that reversing this with a dictionary attack would 
be trivial. You can add a salt, but only on a per-entity basis (not a 
per-record basis) if you want to preserve the capacity to 
disambiguate. That in turns calls for a lookup table in which the 
input keys are stored, which kind of defeats the point of using a hash 
(you might as well just assign random output IDs for each input ID). I 
would worry about government's ability to keep this lookup table 
secure, and I worry about the brittleness of such a system.



And yet a lookup table mapping inputs to random outputs might be the 
best worst option.


Even if the right cryptographic method (hash, encryption, etc.) can be 
found and is mathematically sound, I'd have /very/ low confidence that 
it would be implemented correctly. Maybe one office does it right, the 
next office says hey that's a great idea but forgets that hashing a four 
digit pin doesn't provide any obscurity, etc. (That's not a jab at 
government. Crypto is so hard.)


I'd ask, for a particular case, what data does the data source already 
have? If they /already/ have DLNs in their database, there's no added 
privacy concern in creating a random mapping to unique identifiers for 
public consumption. (Besides the mosaic effect, but that aside.) 
Assuming the data source can make the distinction at all internally, 
they must have /something/ already in their database.


HTH,

- Josh Tauberer (@JoshData)

http://razor.occams.info


-- 
Liberationtech is public  archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
compa...@stanford.edu.

Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

2014-02-06 Thread James McKinney
I don't know how these government databases are maintained in the US, but in 
Canada it's not infrequent for such databases to be more-or-less write only - 
the government fills up a database with names, donation amounts, postcodes, 
etc. and then publishes it somewhere for others to consume. In a subsequent 
year, it fills up a fresh database - maybe it maintains the same database 
schema, but in every other respect it's as if the old database didn't exist.

If we go with the solution of generating a new ID for each donor, there will 
have to be better coordination within and between agencies to store this 
information centrally in order for them to share IDs across time and location. 
That's a security risk.

Can we guarantee that each agency will have the same private information to 
create identifiers from? If so, as Chris mentions, a CRC can be used to 
disambiguate, i.e. match donors on name, etc. and resolve collisions by looking 
at the CRC.

On 2014-02-06, at 4:19 PM, Chris Dary wrote:

 It's been a while since I dug into it, but something like an 8-bit CRC would 
 probably provide enough disambiguation but would collide often enough to not 
 be much of a concern for reversing - 256 different values.
 
 
 On Thu, Feb 6, 2014 at 4:10 PM, Chris Dary umb...@gmail.com wrote:
 Just one thought to throw out: Something that sprang to mind is the idea of a 
 check digit or simplified hash that would be redundant enough to collide very 
 often if you were trying to reverse, but would still provide enough 
 disambiguation that you'd be able to appropriately determine who you're 
 dealing with.
 
 You could probably use something similar to the Luhn algorithm for that, 
 although I'm not sure how uniform that is: 
 http://en.wikipedia.org/wiki/Luhn_algorithm - also, that only ends up with a 
 single check digit, which is probably too small for good disambiguation. The 
 approach in general might still be helpful though.
 
 -Chris
 
 
 On Thu, Feb 6, 2014 at 3:49 PM, Tom Lee t...@sunlightfoundation.com wrote:
 We've been kicking around an idea at Sunlight that aims to use cryptographic 
 ideas to resolve some of the concerns around the publication of publicly 
 identifiable information in government disclosures. I could use some smart 
 people to tell me what's dumb about it.
 
 We often face challenges related to disambiguating entities: is the John 
 Smith who gave political donation A the same John Smith that gave political 
 donation B? One obvious solution to this problem is to push to expand the 
 information that's collected and disclosed -- if we had John's driver's 
 license number (DLN), for instance, it'd be easy to disambiguate these 
 records. But that could introduce privacy concerns for John. One approach to 
 this problem (which I don't think government has tried) is employing a 
 one-way hash. 
 
 Obviously the input key space for DLNs and most other personal ID numbers is 
 so small that reversing this with a dictionary attack would be trivial. You 
 can add a salt, but only on a per-entity basis (not a per-record basis) if 
 you want to preserve the capacity to disambiguate. That in turns calls for a 
 lookup table in which the input keys are stored, which kind of defeats the 
 point of using a hash (you might as well just assign random output IDs for 
 each input ID). I would worry about government's ability to keep this lookup 
 table secure, and I worry about the brittleness of such a system.
 
 Alternately, you can use a single system-wide secret (or set of secrets) to 
 transform inputs into reliable outputs. I think this is less brittle and 
 maybe easier to preserve as a secret, but this system might be too easily 
 reversible given the ability to observe its outputs and know the universe of 
 possible inputs. I'm unsure of the cryptographic options that might be 
 appropriate here.
 
 For all I know, the lack of implementations using this kind of one-way 
 transformation isn't about government sluggishness but rather about its 
 feasibility. I'd be very curious to hear folks ideas on this score, though.  
 My general hunch is that something must be possible -- even a few bits' worth 
 of disambiguating information would be hugely useful to us, and presumably 
 you're not leaking important amounts of information by, say, sharing the last 
 digit of a DLN. So there must be a spectrum of options. But as is probably 
 apparent, I don't think I've got a handle on how to think about this problem 
 rigorously.
 
 Tom
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 sunlightlabs group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to sunlightlabs+unsubscr...@googlegroups.com.
 To post to this group, send email to sunlightl...@googlegroups.com.
 Visit this group at http://groups.google.com/group/sunlightlabs.
 For more options, visit https://groups.google.com/groups/opt_out.
 
 
 
 -- 
 You received this message because