Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

James McKinney Thu, 06 Feb 2014 19:21:29 -0800

I don't know how these government databases are maintained in the US, but in 
Canada it's not infrequent for such databases to be more-or-less "write only" - 
the government fills up a database with names, donation amounts, postcodes, 
etc. and then publishes it somewhere for others to consume. In a subsequent 
year, it fills up a fresh database - maybe it maintains the same database 
schema, but in every other respect it's as if the old database didn't exist.


If we go with the solution of generating a new ID for each donor, there will 
have to be better coordination within and between agencies to store this 
information centrally in order for them to share IDs across time and location. 
That's a security risk.

Can we guarantee that each agency will have the same private information to 
create identifiers from? If so, as Chris mentions, a CRC can be used to 
disambiguate, i.e. match donors on name, etc. and resolve collisions by looking 
at the CRC.

On 2014-02-06, at 4:19 PM, Chris Dary wrote:

> It's been a while since I dug into it, but something like an 8-bit CRC would 
> probably provide enough disambiguation but would collide often enough to not 
> be much of a concern for reversing - 256 different values.
> 
> 
> On Thu, Feb 6, 2014 at 4:10 PM, Chris Dary <umb...@gmail.com> wrote:
> Just one thought to throw out: Something that sprang to mind is the idea of a 
> check digit or simplified hash that would be redundant enough to collide very 
> often if you were trying to reverse, but would still provide enough 
> disambiguation that you'd be able to appropriately determine who you're 
> dealing with.
> 
> You could probably use something similar to the Luhn algorithm for that, 
> although I'm not sure how uniform that is: 
> http://en.wikipedia.org/wiki/Luhn_algorithm - also, that only ends up with a 
> single check digit, which is probably too small for good disambiguation. The 
> approach in general might still be helpful though.
> 
> -Chris
> 
> 
> On Thu, Feb 6, 2014 at 3:49 PM, Tom Lee <t...@sunlightfoundation.com> wrote:
> We've been kicking around an idea at Sunlight that aims to use cryptographic 
> ideas to resolve some of the concerns around the publication of publicly 
> identifiable information in government disclosures. I could use some smart 
> people to tell me what's dumb about it.
> 
> We often face challenges related to disambiguating entities: is the John 
> Smith who gave political donation A the same John Smith that gave political 
> donation B? One obvious solution to this problem is to push to expand the 
> information that's collected and disclosed -- if we had John's driver's 
> license number (DLN), for instance, it'd be easy to disambiguate these 
> records. But that could introduce privacy concerns for John. One approach to 
> this problem (which I don't think government has tried) is employing a 
> one-way hash. 
> 
> Obviously the input key space for DLNs and most other personal ID numbers is 
> so small that reversing this with a dictionary attack would be trivial. You 
> can add a salt, but only on a per-entity basis (not a per-record basis) if 
> you want to preserve the capacity to disambiguate. That in turns calls for a 
> lookup table in which the input keys are stored, which kind of defeats the 
> point of using a hash (you might as well just assign random output IDs for 
> each input ID). I would worry about government's ability to keep this lookup 
> table secure, and I worry about the brittleness of such a system.
> 
> Alternately, you can use a single system-wide secret (or set of secrets) to 
> transform inputs into reliable outputs. I think this is less brittle and 
> maybe easier to preserve as a secret, but this system might be too easily 
> reversible given the ability to observe its outputs and know the universe of 
> possible inputs. I'm unsure of the cryptographic options that might be 
> appropriate here.
> 
> For all I know, the lack of implementations using this kind of one-way 
> transformation isn't about government sluggishness but rather about its 
> feasibility. I'd be very curious to hear folks ideas on this score, though.  
> My general hunch is that something must be possible -- even a few bits' worth 
> of disambiguating information would be hugely useful to us, and presumably 
> you're not leaking important amounts of information by, say, sharing the last 
> digit of a DLN. So there must be a spectrum of options. But as is probably 
> apparent, I don't think I've got a handle on how to think about this problem 
> rigorously.
> 
> Tom
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "sunlightlabs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sunlightlabs+unsubscr...@googlegroups.com.
> To post to this group, send email to sunlightl...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sunlightlabs.
> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "sunlightlabs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sunlightlabs+unsubscr...@googlegroups.com.
> To post to this group, send email to sunlightl...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sunlightlabs.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
Liberationtech is public & archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
compa...@stanford.edu.

Re: [liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

Reply via email to