Re: de-identification

2005-06-17 Thread dan

Steven M. Bellovin writes:
 | 
 | Ladies and Gentlemen,
 | 
 | I'd like to come up to speed on the state of the
 | art in de-identification (~=anonymization) of data
 | especially monitoring data (firewall/hids logs, say).
 | A little googling suggests that this is an academic
 | subspeciality as well as a word with many interpretations.
 | If someone here can point me at the mother lode of 
 | insight, I would be most grateful.
 | 
 | 
 | What's your threat model?  It's proved to be a very hard problem to 
 | solve, since there are all sorts of other channels -- application data, 
 | timing data (the remote fingerprinting paper mentioned this one), etc.

Steve, et al.,

My threat model is how can I have a convincing
technical solution that, in turn, gets your average
corporate general counsel to permit sharing various
kinds of logs with similar firms.  The Patriot Act
(2001,Bush), PDD 63 (1998,Clinton), and various other
intervening bits of legislation say that threat and
vulnerability information shared between like private
sector firms is (1) exempt from Anti-Trust (even
where security is a competitive feature) and (2)
exempt from FOIA (even where such sharing is under
government aegis).  Nevertheless no corporate general
counsel will permit such sharing.  From where a GC
sits, the risk is clear, near-term and direct to the
firm while any benefit is diffuse and distant, and
no GC believes any laws' words until the courts, as
unacknowledged legislators, get a whack at it and
that being so no GC wants to be the test case.

Ipso facto, I (we) need a way to ensure that log
data can be shared between firms in ways that do
not identify the source firm so that, in turn, 
I can stand up and say that the risk as seen from
the GC's point of view has been technically put
to bed.  I don't imagine for a minute that even
that argument will be trivial, but a technical
solution is necessary even if insufficient.

My real aim is, of course, the characterization 
of macro-scale risk to critical infrastructure.
In the hypothesis-generation stage of such an
effort I need to take field observations that
could easily go any of three ways:

 (1) All the players see the same scans, the same
 automated attacks, the same over-pressure;

 (2) All the players see entirely different scans,
 entirely different automated attacks, entirely
 different over-pressures; or

 (3) One of the players stands apart from the others
 and whereas the corpus of that industrial 
 sector sees the same scans, the same automated
 attacks, the same over-pressure there is one
 player whose experience is different.

This is information that no firm can get on its
own, so uniqueness of value is a given and amongst
rational players unarguable.  What I need is to
break the logjam over being the first to share.

The only alternative is to take the biased samples
that are available inside managed security providers
and confidential consulting firms and pool that data,
thus anonymizing it, within a single corporate shell.
This is second best and tends to have little motive
power of its own, though I/we proved it can be done[1]
as has Qualys[2], inter alia.

Clear enough?

--dan



[1]
http://www.atstake.com/research/reports/acrobat/atstake_app_reloaded.pdf

[2]
http://www.qualys.com/company/newsroom/newsreleases/usa/pr.php/2004-07-28


-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: de-identification

2005-06-16 Thread Steven M. Bellovin
In message [EMAIL PROTECTED], [EMAIL PROTECTED] writes
:

Ladies and Gentlemen,

I'd like to come up to speed on the state of the
art in de-identification (~=anonymization) of data
especially monitoring data (firewall/hids logs, say).
A little googling suggests that this is an academic
subspeciality as well as a word with many interpretations.
If someone here can point me at the mother lode of 
insight, I would be most grateful.


What's your threat model?  It's proved to be a very hard problem to 
solve, since there are all sorts of other channels -- application data, 
timing data (the remote fingerprinting paper mentioned this one), etc.

--Steven M. Bellovin, http://www.cs.columbia.edu/~smb



-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: de-identification

2005-06-13 Thread Florian Weimer
 I'd like to come up to speed on the state of the
 art in de-identification (~=anonymization) of data
 especially monitoring data (firewall/hids logs, say).

We call it pseudonymization (Pseudonymisierung).  It's a commonly
used technique in Germany to detaint personally identifiable
information, so you can share it freely for statistics purposes.  The
methods used in the field are rather crude (time-seeded LCGs are very
common, unfortunately).

A reference to the book Translucent Databases was posted to this
list a couple of months ago, but IIRC it's being revisied, so I didn't
rush to buy and read it.

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: de-identification

2005-06-13 Thread Anne Lynn Wheeler
Florian Weimer wrote:
 We call it pseudonymization (Pseudonymisierung).  It's a commonly
 used technique in Germany to detaint personally identifiable
 information, so you can share it freely for statistics purposes.  The
 methods used in the field are rather crude (time-seeded LCGs are very
 common, unfortunately).


from privacy glossary and taxonomy
http://www.garlic.com/~lynn/privacy.htm

that i put together when working on x9.99 PIA standard for financial
industry ... from HIPAA

anonymized
Previously identifiable data that have been deidentified and for
which a code or other link no longer exists. An investigator would not
be able to link anonymized information back to a specific individual.
[HIPAA] (see also anonymous, coded, directly identifiable, indirectly
identifiable)

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: de-identification

2005-06-09 Thread Matt Crawford

On Jun 8, 2005, at 15:19, [EMAIL PROTECTED] wrote:

I'd like to come up to speed on the state of the
art in de-identification (~=anonymization) of data
especially monitoring data (firewall/hids logs, say).


I don't know the state of the art, but I can tell you the state of the 
artless.  I had a request to share ourr border router traffic logs 
(Cisco netflow) with a university, so they could try out some anomaly 
detection schemes they were working on.


(Bkgnd: We don't consider our network topology sensitive. Our traffic 
logs are subject to a general respect for privacy.)


Since they could send us packets of their choosing, I deemed it useless 
to obfuscate our own IP addresses.  I chose to anonymize all the 
external addresses.  My design note is below.


But then, as fate would have it, the university said they needed the 
true external addresses.  That left me a bit stumped.  Perhaps a less 
chaotic mapping, like one that is bijective between classful network 
numbers, would do.



obfuscation filter program

  Parameters
Blocks of IP addresses deemed internal.  Internal includes multicast
addresses and RFC 1918 private use address.

  Working data preserved across runs
For each date, a database of (true address, substituted address) 
pairings.


  Algorithms
Substituted addresses are pseudo-random, formed by MD5-hashing a
string (S | D | A | N) and taking the first 32 bits.
  S = fixed secret hash seed, long term
  D = date of data, in MMDD format
  N = integer, starting at 0 and incremented if resulting address
  is an internal one or a collision.

to obfuscate an IP address: {
  if it's internal, return it unchanged.  otherwise
   is a substitute is already assigned?  If so, return it. otherwise
for ( done = N = 0; !done; N++ ) {
  generate substitute address by hashing as above
  if ( !collision ) done = 1
}
save forward  reverse mappings
}

for each netflow record {
  i = 0
  if ( src is external ) {
obfuscate src; i++
  }
  if ( dst is external ) {
obfuscate dst; i++
  }
  if ( i != 1 ) log an unusual condition
  write output
}

Scripts:

  generator loops over input files, applying obfuscator, writing 
temp-named

  output file, then renaming completed output file to permanent name.

  mover looks for completed output files, copies them to destination, 
then

  looks for more, sleeping and retrying if there are none.

Other notes:

  The obfuscated mappings can be regenerated at will if exactly the 
same data

  is processed in the same sequence, and the secret hash seed is known.


-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


de-identification

2005-06-08 Thread dan

Ladies and Gentlemen,

I'd like to come up to speed on the state of the
art in de-identification (~=anonymization) of data
especially monitoring data (firewall/hids logs, say).
A little googling suggests that this is an academic
subspeciality as well as a word with many interpretations.
If someone here can point me at the mother lode of 
insight, I would be most grateful.

--dan


-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]