[ 
https://jira.duraspace.org/browse/DS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Donohue updated DS-955:
---------------------------

    Status: Open  (was: Received)

This idea was discussed in the Developers Meeting on Feb 22, 2012:

[20:07] <kompewter> [ https://jira.duraspace.org/browse/DS-955 ] - [#DS-955] 
Anonymize IP-Logging to comply to privacy laws - DuraSpace JIRA
[20:10] <mhwood> Does hashing a 32-bit keyspace really provide much protection?
[20:10] * mdiggory (~mdigg...@rrcs-74-87-47-114.west.biz.rr.com) has joined 
#duraspace
[20:10] <richardrodgers> See the concern, not sure DSpace is the place to 
address it (e.g. still can have IP addrs in Apache logs, etc)
[20:10] <tdonohue> to me, this seems like a reasonable request. A part of me 
wonders though if there's two stages here: (1) hash it initially, (2) after a 
month (or so) aggregate the stats and just remove specific IPs (once 
aggregated, we don't really need to keep IP specifics)
[20:11] <mdiggory> We are logging IP in both the dspace logs and that the 
statistics engine. This is more specific to annonymizing those.
[20:12] <mdiggory> its important that a number of actions happen prior to 
anonymization
[20:12] <tdonohue> richardrodgers -- good point. I agree about the logs. Not 
sure we should need to anonymize at the log level. I'd be more interested in 
not storing IPs forever in the stats engine
[20:12] <mdiggory> Most important of which is GeoIP evaluation
[20:13] <tdonohue> right, but after GeoIP eval is done, why keep around the IP? 
Can't it just be anonymized to a general location at that point? (or aggregated 
after a month or so)
[20:14] <mdiggory> We've been discussing a solution for distributed statistics 
aggregation for Dryad, part of which we've proposeda n approach of producing 
statistics specifc logs that only contain the event data that needs to be 
placed into solr. This would also act as a backup datasource for DSpace solr 
based statistics and provide a source for other stats tools as well.
[20:14] <mdiggory> these loggs would be generated by the UsageEvent system
[20:14] <mdiggory> and update of solr would be a separate process.
[20:15] <mdiggory> this is not dissimilar to the current dspace.log stats 
processing loogic
[20:15] <mhwood> Note the reference to privacy laws. Recall that, just because 
something is sensible and useful doesn't mean it isn't unlawful.
[20:15] <mdiggory> and yes, at that point what was in the log would be 
anonymized
[20:17] <tdonohue> mdiggory: makes sense -- yea, I'd assume we'd want to 
anonymize after GeoIP (or any other initial necessary IP analysis tasks). Is 
the work you describe for Dryad actually going to be open sourced?
[20:17] <mdiggory> one generally wants to reduce risk of any sort of risk of 
legal action if one can avoid it
[20:18] <mdiggory> sorry I was attempting to rewrite that thought before 
sending it...
[20:18] <mdiggory> yes, most everything in Dryad is OS
[20:19] <hpottinger> going back to mhwood's question re: hashing a 32-bit 
keyspace, I think it's sufficient to clear the hurdle of privacy concerns, 
probably insufficient if you were talking security concerns, though since the 
specter of legality has been invoked, probably a lawyer would need to weigh in 
on that
[20:19] <mdiggory> the intent is to bring back projects from Dryad whenever it 
is possible.
[20:21] <tdonohue> hpottinger & mhwood, I guess then the question would be 
whether we should think of providing an option to hash it in the logs -- as it 
sounds like we have some general agreement that once it gets to Stats Engine, 
to should be anonymized after GeoIP (or similar)
...
[20:22] <mhwood> I'm thinking that we might want to immediately extract 
aggregated measures such as GeoIP results and just destroy the address 
altogether -- not log anything. If a site has a specific problem then it could 
(with advice of counsel) insert additional logging to address the specific 
problem. I think that might satisfy the law very well and not lose much if 
anything of value.
...
[20:24] <mdiggory> mhwood: thats an interesting idea too however, we do retain 
the ip to post process bots if they are ever added to the bot lists.
[20:25] <mdiggory> so we need to be able to produce a hashing strategy to 
support detecting and cleaning out bots
[20:25] <hpottinger> deriving lat/long from IP and then storing that for stats 
sounds fine... would impact efforts to measure "impact" based on IP address, 
not sure if that's our concern, making it configurable would be best, not every 
institution has these concerns
[20:25] <tdonohue> mhwood -- agreed, interesting idea. I question that we could 
get all the IP analysis (GeoIP / bot processing) done quickly enough to make it 
"seemless" though (i.e. we may need to keep that IP temporarily as mdiggory is 
saying)
[20:26] <mdiggory> I"m not sure you got that... the identification of a bot may 
come long after the logging event itself
[20:26] <mdiggory> and theres cli that are responsible for clearing or flagging 
bot records based on new IP added to the spiders directory
[20:27] <tdonohue> mdiggory: I did get that, I'd just say that "long after" is 
relative. It could be that it's "good enough" for many sites to do bot 
identification as of today, and then assume it's OK.
[20:27] <mdiggory> so, in that case being able to match on records that are 
bots needs to retain some variant of the IP
[20:28] <mdiggory> You may not say that if your stats get bloated with bot 
events that you can no longer tell were bots.
[20:29] * tdonohue may need to table this discussion shortly. very interesting 
discussion, but we likely shouldn't take all meeting on it -- still, these are 
great brainstorms
[20:29] <ablemann> ....like the hash of it?
[20:29] <mdiggory> remember this is stats, they may be the worse king of lies, 
but we do try to minimize the error and achieve some level of accuracy
[20:29] <mhwood> Make it configurable? If your local law is persnickety then 
you turn it off and lose the ability to bot-flag retroactively; otherwise you 
can turn it on.
[20:30] <tdonohue> mdiggory: stats are always 'relative' and accuracy is never 
a given. The only way to truly be sure about whether an IP could be a bot would 
be to keep it 'forever' and keep checking (and even then, you still may not be 
right).
[20:30] <hpottinger> I do think there are use cases for keeping some record of 
an IP address, if it's legal for you to do so, i.e. some researchers might want 
to know # of downloads from various known institutional ranges
[20:30] <mdiggory> hpottinger: interesting point...
[20:31] <mhwood> Yes, but that goes back to the "specific problem" argument: if 
you want to extract more, add an extractor.
[20:31] <mdiggory> I suspect the appropriate case even then is to abstract it 
to some "state" like" oncampus" / "offcampus"
[20:31] <mhwood> interface StackableLogMassager....
[20:33] <tdonohue> could also think about just making it 
'semi-anonymous'....cut off the last part of the IP, so that "127.0.0.1" 
becomes "127.0.0" at some point...then, you still know *something* about where 
that came from, but it's not exact. Just a brainstorm
[20:33] <tdonohue> in any case, we probably should close up this discussion 
here shortly
[20:33] <hpottinger> mdiggory: requires that you know what you're looking for 
ahead of time, but I agree, that's probably the best approach
[20:34] * mhwood notes that many many addresses are effectively anonymized at 
the other end due to dynamic allocation, NAT, etc.
[20:34] <mhwood> The point being, not that we don't need to address this, but 
that the quality of the data is already iffy.
[20:35] <tdonohue> ok, shall we move on to other topics now? We should post 
these discussion notes to DS-955
                
> Anonymize IP-Logging to comply to privacy laws
> ----------------------------------------------
>
>                 Key: DS-955
>                 URL: https://jira.duraspace.org/browse/DS-955
>             Project: DSpace
>          Issue Type: Improvement
>          Components: DSpace API
>    Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 
> 1.7.2
>            Reporter: Claudia Jürgen
>            Priority: Major
>
> At the moment DSpace logs the complete IP-Addresses in it's log files. 
> Furthermore they are stored/used for Solr statitistics. In some countries 
> this is forbidden based on privacy laws. It should be made possible to 
> anonymize the IP's logged in the log files and disable solr stats and solr 
> logging.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to