[ https://jira.duraspace.org/browse/DS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Donohue updated DS-955: --------------------------- Status: Open (was: Received) This idea was discussed in the Developers Meeting on Feb 22, 2012: [20:07] <kompewter> [ https://jira.duraspace.org/browse/DS-955 ] - [#DS-955] Anonymize IP-Logging to comply to privacy laws - DuraSpace JIRA [20:10] <mhwood> Does hashing a 32-bit keyspace really provide much protection? [20:10] * mdiggory (~mdigg...@rrcs-74-87-47-114.west.biz.rr.com) has joined #duraspace [20:10] <richardrodgers> See the concern, not sure DSpace is the place to address it (e.g. still can have IP addrs in Apache logs, etc) [20:10] <tdonohue> to me, this seems like a reasonable request. A part of me wonders though if there's two stages here: (1) hash it initially, (2) after a month (or so) aggregate the stats and just remove specific IPs (once aggregated, we don't really need to keep IP specifics) [20:11] <mdiggory> We are logging IP in both the dspace logs and that the statistics engine. This is more specific to annonymizing those. [20:12] <mdiggory> its important that a number of actions happen prior to anonymization [20:12] <tdonohue> richardrodgers -- good point. I agree about the logs. Not sure we should need to anonymize at the log level. I'd be more interested in not storing IPs forever in the stats engine [20:12] <mdiggory> Most important of which is GeoIP evaluation [20:13] <tdonohue> right, but after GeoIP eval is done, why keep around the IP? Can't it just be anonymized to a general location at that point? (or aggregated after a month or so) [20:14] <mdiggory> We've been discussing a solution for distributed statistics aggregation for Dryad, part of which we've proposeda n approach of producing statistics specifc logs that only contain the event data that needs to be placed into solr. This would also act as a backup datasource for DSpace solr based statistics and provide a source for other stats tools as well. [20:14] <mdiggory> these loggs would be generated by the UsageEvent system [20:14] <mdiggory> and update of solr would be a separate process. [20:15] <mdiggory> this is not dissimilar to the current dspace.log stats processing loogic [20:15] <mhwood> Note the reference to privacy laws. Recall that, just because something is sensible and useful doesn't mean it isn't unlawful. [20:15] <mdiggory> and yes, at that point what was in the log would be anonymized [20:17] <tdonohue> mdiggory: makes sense -- yea, I'd assume we'd want to anonymize after GeoIP (or any other initial necessary IP analysis tasks). Is the work you describe for Dryad actually going to be open sourced? [20:17] <mdiggory> one generally wants to reduce risk of any sort of risk of legal action if one can avoid it [20:18] <mdiggory> sorry I was attempting to rewrite that thought before sending it... [20:18] <mdiggory> yes, most everything in Dryad is OS [20:19] <hpottinger> going back to mhwood's question re: hashing a 32-bit keyspace, I think it's sufficient to clear the hurdle of privacy concerns, probably insufficient if you were talking security concerns, though since the specter of legality has been invoked, probably a lawyer would need to weigh in on that [20:19] <mdiggory> the intent is to bring back projects from Dryad whenever it is possible. [20:21] <tdonohue> hpottinger & mhwood, I guess then the question would be whether we should think of providing an option to hash it in the logs -- as it sounds like we have some general agreement that once it gets to Stats Engine, to should be anonymized after GeoIP (or similar) ... [20:22] <mhwood> I'm thinking that we might want to immediately extract aggregated measures such as GeoIP results and just destroy the address altogether -- not log anything. If a site has a specific problem then it could (with advice of counsel) insert additional logging to address the specific problem. I think that might satisfy the law very well and not lose much if anything of value. ... [20:24] <mdiggory> mhwood: thats an interesting idea too however, we do retain the ip to post process bots if they are ever added to the bot lists. [20:25] <mdiggory> so we need to be able to produce a hashing strategy to support detecting and cleaning out bots [20:25] <hpottinger> deriving lat/long from IP and then storing that for stats sounds fine... would impact efforts to measure "impact" based on IP address, not sure if that's our concern, making it configurable would be best, not every institution has these concerns [20:25] <tdonohue> mhwood -- agreed, interesting idea. I question that we could get all the IP analysis (GeoIP / bot processing) done quickly enough to make it "seemless" though (i.e. we may need to keep that IP temporarily as mdiggory is saying) [20:26] <mdiggory> I"m not sure you got that... the identification of a bot may come long after the logging event itself [20:26] <mdiggory> and theres cli that are responsible for clearing or flagging bot records based on new IP added to the spiders directory [20:27] <tdonohue> mdiggory: I did get that, I'd just say that "long after" is relative. It could be that it's "good enough" for many sites to do bot identification as of today, and then assume it's OK. [20:27] <mdiggory> so, in that case being able to match on records that are bots needs to retain some variant of the IP [20:28] <mdiggory> You may not say that if your stats get bloated with bot events that you can no longer tell were bots. [20:29] * tdonohue may need to table this discussion shortly. very interesting discussion, but we likely shouldn't take all meeting on it -- still, these are great brainstorms [20:29] <ablemann> ....like the hash of it? [20:29] <mdiggory> remember this is stats, they may be the worse king of lies, but we do try to minimize the error and achieve some level of accuracy [20:29] <mhwood> Make it configurable? If your local law is persnickety then you turn it off and lose the ability to bot-flag retroactively; otherwise you can turn it on. [20:30] <tdonohue> mdiggory: stats are always 'relative' and accuracy is never a given. The only way to truly be sure about whether an IP could be a bot would be to keep it 'forever' and keep checking (and even then, you still may not be right). [20:30] <hpottinger> I do think there are use cases for keeping some record of an IP address, if it's legal for you to do so, i.e. some researchers might want to know # of downloads from various known institutional ranges [20:30] <mdiggory> hpottinger: interesting point... [20:31] <mhwood> Yes, but that goes back to the "specific problem" argument: if you want to extract more, add an extractor. [20:31] <mdiggory> I suspect the appropriate case even then is to abstract it to some "state" like" oncampus" / "offcampus" [20:31] <mhwood> interface StackableLogMassager.... [20:33] <tdonohue> could also think about just making it 'semi-anonymous'....cut off the last part of the IP, so that "127.0.0.1" becomes "127.0.0" at some point...then, you still know *something* about where that came from, but it's not exact. Just a brainstorm [20:33] <tdonohue> in any case, we probably should close up this discussion here shortly [20:33] <hpottinger> mdiggory: requires that you know what you're looking for ahead of time, but I agree, that's probably the best approach [20:34] * mhwood notes that many many addresses are effectively anonymized at the other end due to dynamic allocation, NAT, etc. [20:34] <mhwood> The point being, not that we don't need to address this, but that the quality of the data is already iffy. [20:35] <tdonohue> ok, shall we move on to other topics now? We should post these discussion notes to DS-955 > Anonymize IP-Logging to comply to privacy laws > ---------------------------------------------- > > Key: DS-955 > URL: https://jira.duraspace.org/browse/DS-955 > Project: DSpace > Issue Type: Improvement > Components: DSpace API > Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, > 1.7.2 > Reporter: Claudia Jürgen > Priority: Major > > At the moment DSpace logs the complete IP-Addresses in it's log files. > Furthermore they are stored/used for Solr statitistics. In some countries > this is forbidden based on privacy laws. It should be made possible to > anonymize the IP's logged in the log files and disable solr stats and solr > logging. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://jira.duraspace.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel