Re: de-identification
Steven M. Bellovin writes: | | Ladies and Gentlemen, | | I'd like to come up to speed on the state of the | art in de-identification (~=anonymization) of data | especially monitoring data (firewall/hids logs, say). | A little googling suggests that this is an academic | subspeciality as well as a word with many interpretations. | If someone here can point me at the mother lode of | insight, I would be most grateful. | | | What's your threat model? It's proved to be a very hard problem to | solve, since there are all sorts of other channels -- application data, | timing data (the remote fingerprinting paper mentioned this one), etc. Steve, et al., My threat model is how can I have a convincing technical solution that, in turn, gets your average corporate general counsel to permit sharing various kinds of logs with similar firms. The Patriot Act (2001,Bush), PDD 63 (1998,Clinton), and various other intervening bits of legislation say that threat and vulnerability information shared between like private sector firms is (1) exempt from Anti-Trust (even where security is a competitive feature) and (2) exempt from FOIA (even where such sharing is under government aegis). Nevertheless no corporate general counsel will permit such sharing. From where a GC sits, the risk is clear, near-term and direct to the firm while any benefit is diffuse and distant, and no GC believes any laws' words until the courts, as unacknowledged legislators, get a whack at it and that being so no GC wants to be the test case. Ipso facto, I (we) need a way to ensure that log data can be shared between firms in ways that do not identify the source firm so that, in turn, I can stand up and say that the risk as seen from the GC's point of view has been technically put to bed. I don't imagine for a minute that even that argument will be trivial, but a technical solution is necessary even if insufficient. My real aim is, of course, the characterization of macro-scale risk to critical infrastructure. In the hypothesis-generation stage of such an effort I need to take field observations that could easily go any of three ways: (1) All the players see the same scans, the same automated attacks, the same over-pressure; (2) All the players see entirely different scans, entirely different automated attacks, entirely different over-pressures; or (3) One of the players stands apart from the others and whereas the corpus of that industrial sector sees the same scans, the same automated attacks, the same over-pressure there is one player whose experience is different. This is information that no firm can get on its own, so uniqueness of value is a given and amongst rational players unarguable. What I need is to break the logjam over being the first to share. The only alternative is to take the biased samples that are available inside managed security providers and confidential consulting firms and pool that data, thus anonymizing it, within a single corporate shell. This is second best and tends to have little motive power of its own, though I/we proved it can be done[1] as has Qualys[2], inter alia. Clear enough? --dan [1] http://www.atstake.com/research/reports/acrobat/atstake_app_reloaded.pdf [2] http://www.qualys.com/company/newsroom/newsreleases/usa/pr.php/2004-07-28 - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: de-identification
In message [EMAIL PROTECTED], [EMAIL PROTECTED] writes : Ladies and Gentlemen, I'd like to come up to speed on the state of the art in de-identification (~=anonymization) of data especially monitoring data (firewall/hids logs, say). A little googling suggests that this is an academic subspeciality as well as a word with many interpretations. If someone here can point me at the mother lode of insight, I would be most grateful. What's your threat model? It's proved to be a very hard problem to solve, since there are all sorts of other channels -- application data, timing data (the remote fingerprinting paper mentioned this one), etc. --Steven M. Bellovin, http://www.cs.columbia.edu/~smb - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: de-identification
I'd like to come up to speed on the state of the art in de-identification (~=anonymization) of data especially monitoring data (firewall/hids logs, say). We call it pseudonymization (Pseudonymisierung). It's a commonly used technique in Germany to detaint personally identifiable information, so you can share it freely for statistics purposes. The methods used in the field are rather crude (time-seeded LCGs are very common, unfortunately). A reference to the book Translucent Databases was posted to this list a couple of months ago, but IIRC it's being revisied, so I didn't rush to buy and read it. - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: de-identification
Florian Weimer wrote: We call it pseudonymization (Pseudonymisierung). It's a commonly used technique in Germany to detaint personally identifiable information, so you can share it freely for statistics purposes. The methods used in the field are rather crude (time-seeded LCGs are very common, unfortunately). from privacy glossary and taxonomy http://www.garlic.com/~lynn/privacy.htm that i put together when working on x9.99 PIA standard for financial industry ... from HIPAA anonymized Previously identifiable data that have been deidentified and for which a code or other link no longer exists. An investigator would not be able to link anonymized information back to a specific individual. [HIPAA] (see also anonymous, coded, directly identifiable, indirectly identifiable) - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: de-identification
On Jun 8, 2005, at 15:19, [EMAIL PROTECTED] wrote: I'd like to come up to speed on the state of the art in de-identification (~=anonymization) of data especially monitoring data (firewall/hids logs, say). I don't know the state of the art, but I can tell you the state of the artless. I had a request to share ourr border router traffic logs (Cisco netflow) with a university, so they could try out some anomaly detection schemes they were working on. (Bkgnd: We don't consider our network topology sensitive. Our traffic logs are subject to a general respect for privacy.) Since they could send us packets of their choosing, I deemed it useless to obfuscate our own IP addresses. I chose to anonymize all the external addresses. My design note is below. But then, as fate would have it, the university said they needed the true external addresses. That left me a bit stumped. Perhaps a less chaotic mapping, like one that is bijective between classful network numbers, would do. obfuscation filter program Parameters Blocks of IP addresses deemed internal. Internal includes multicast addresses and RFC 1918 private use address. Working data preserved across runs For each date, a database of (true address, substituted address) pairings. Algorithms Substituted addresses are pseudo-random, formed by MD5-hashing a string (S | D | A | N) and taking the first 32 bits. S = fixed secret hash seed, long term D = date of data, in MMDD format N = integer, starting at 0 and incremented if resulting address is an internal one or a collision. to obfuscate an IP address: { if it's internal, return it unchanged. otherwise is a substitute is already assigned? If so, return it. otherwise for ( done = N = 0; !done; N++ ) { generate substitute address by hashing as above if ( !collision ) done = 1 } save forward reverse mappings } for each netflow record { i = 0 if ( src is external ) { obfuscate src; i++ } if ( dst is external ) { obfuscate dst; i++ } if ( i != 1 ) log an unusual condition write output } Scripts: generator loops over input files, applying obfuscator, writing temp-named output file, then renaming completed output file to permanent name. mover looks for completed output files, copies them to destination, then looks for more, sleeping and retrying if there are none. Other notes: The obfuscated mappings can be regenerated at will if exactly the same data is processed in the same sequence, and the secret hash seed is known. - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
de-identification
Ladies and Gentlemen, I'd like to come up to speed on the state of the art in de-identification (~=anonymization) of data especially monitoring data (firewall/hids logs, say). A little googling suggests that this is an academic subspeciality as well as a word with many interpretations. If someone here can point me at the mother lode of insight, I would be most grateful. --dan - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]