On 05/22/2018 07:33 PM, Stephen J. Turnbull wrote:
I would imagine that it is the subthread rooted at the first post containing complainant's PII -- "Personally Identifying Information".

I feel like that's a self referencing definition.

A "thread" is "a subthread rooted at the first post containing PII".

I agree that's where the focus should start. But I don't think it defines a thread in the way that I'm asking.

What is their working definition of "thread"?

Let's say:

1)  Bla
2)   +--- Re: Bla
3)   +--- Re: Bla
4)   |     +--- BlaBlaBla
5)   +--- Re: Bla
6)         +--- I hijacked this thread because I need help!!!

Let's say the PII was in message 3 and the person replying to it in message 4 removed the PII. Do messages 3 and 4 need to be removed (or otherwise modified)?

Let's say that message 1 had the PII, messages 2, 3, and 5 quoted it, but 4 did not and 6 is a hijacker that hit reply on the most convenient message (under his cursor) and removed all content. Do messages 4 and 6 need to be removed?

What is the "(sub)thread" that needs to be removed?

That is going to depend on the presence of PII in the messages. If *whole messages* are to be deleted, that would presumably involve content that somehow identifies the person. I would expect that we don't have to delete whole bug reports on this list just because somebody requests their PII be redacted.

I agree that it's possible to remove / redact PII without deleting the items containing the PII.

Think about it this way, spooks don't shred the entire sheet of paper, instead they take a black marker and redact just the pieces that need to be removed.

I'm afraid that the infinite wisdom of politicians will say that the entire paper needs to be shredded.

I think it also significantly depends on what needs to be redacted. Removing "supercalifragilisticexpialidocious" is a LOT different than removing "Grant Taylor" from the Mailman-Users archive. "supercalifragilisticexpialidocious" would be like reference to an event. "Grant Taylor" would be any mention of my (or an impostor's) name.

The former is likely MUCH simpler to do than the latter. The latter will also impact MANY more messages.

What worries me more is the implications for blockchain, or more precisely, DAG-based VCSes that use hashes for integrity check like git: the identity of commits will change if authors and emails are redacted, including if a commit log refers to PII of a bug reporter as they often do. I guess you'd need to maintain an index of pointers from old commit ids, or at least for branches and tags (we do have the reflog in git).

I don't want to try to work that out.

And heaven help you if you're a security conscious group like the Linux kernel and use signed commits. I guess the person who does the redaction would sign the new commits, but that's pretty yucky -- that person could do anything and nobody would know when it happened because you have to delete the old commits and blobs that get redacted.

Yep.

As I understand the "right to be forgotten", it's *not* a right to arbitrarily edit content stored by someone else, it's the right to redact *all* PII in that content.

Agreed.

In this case, I don't think that supercalifragilisticexpialidocious qualifies under GDPR's right to be forgotten. }:-)

It's not just messages from a person, it's headers containing their name and email address, attribution lines for quoted material, quoted .sigs, etc etc.

Agreed.

What about headers containing message ID from an uncommon / single user domain like mine? I'd say that anything that can be used to identify less than a group of 1000 people would probably need to be redacted. (I just chose 1000 arbitrarily, but it's a starting point.)

You're missing

0)  Randos accessing public archives.

What other modes have we collectively missed?

For (0), the only logging would be IP addresses in the webserver.

True.

No. The accessing IPs will be in the webserver logs, but I don't think there is any logging in either Mailman 2 or Mailman 3 of authentication data. All there would be is the implication that authentication was successful if that data were accessed.

Okay.

I wonder if there's any correlation between the IP that authenticated and the IP that accessed data.

In Mailman 2 there's no PII data whatsoever except for email address and (maybe) display name in the subscriber data.

I expect that either of those, the email address -or- the display name are enough to count as PII.

I believe it's fair to say that people expect gtaylor (at) tnetconsulting (dot) net to reference a single person. I also believe it's fair to say that most people expect most email addresses to identify be associated with one person. The only exceptions to the rule being things like positional addresses; sales@ or info@ or webmaster@.

I suppose you could put phone #s and junk like that in the display name, but GDPR is more concerned with the database fields that might store PII than the actual content.

1) I'd consider the phone numbers in the display name to be a form of display name. 2) *sigh* It sounds like GDPR is talking about specific fields that could contain PII, even if they don't, while ignoring other fields that erroneously do contain PII.

However, in Mailman 2 the various list passwords are shared, and would not identify individuals in cases with multiple moderators or list owners.

IMHO that's an operational mis-step. I get that it does happen. But I think that it shouldn't. People tend to share root password on unix too, despite multiple other options where it's not needed.

Indeed. The problem is identifying them if they do, since they can just use normal filesystem operations from the shell, which are not normally logged at all.

Where I've worked, it was assumed that if you had an ID on the box and file system level permission to access things then you effectively had accessed it. — If you can't prove that they didn't access the data, then you assume that they did access the data.

In Mailman 3, we can configure databases like PostgreSQL, which I suppose can log access to the subscriber databases, and which make it hard (but not impossible) to access data via ordinary filesystem operations.

Having an RDBMS (et al) manage the files doesn't prevent file level access. I can very likely still copy the DB file(s) and do my own thing with them to extract the data.

This is where (and why) DB encryption comes into play. Though, if a rogue admin has access to the decryption key through any method. (This includes extracting it out of memory.) }:-)

However, I think that the issue here is basically moot. You keep host access logs to check for suspicious IP addresses (attempting to) log in, and otherwise (for #2 and #3) you just give the list of all the people who can access that data in the normal course of their duties.

Yep.

I don't think the issue with logging is pinning down a particular access to specific data, but rather determining who *could* access that data.

Yep. Yep.

The relevant access might have been by a long-since fired engineer who did a Snowden on your database. How could you possibly know?

Yep. Yep. Yep.

I don't understand the "exclude third party site hosters". The GDPR requirement is not to *limit* access, it's to *log* access.

I was trying to imply that companies would need to host their own list servers. Meaning that they couldn't outsource it to 3rd party companies, whom have their own host system administrators.

I'm pretty sure they're referring to CRM-type databases where you track customer interactions over time, linked by PII, and build up a profile. One-off "for sale" posts wouldn't matter. However, if this were a common activity on the list, the *archives* might qualify as such a database.

~chuckle~

How many grains of sand does it take to make a pile?

IMHO none.  You just have to declare the pile's location.

Sure, the point is to make it difficult for 3rd parties to discover that history ex post.

Okay. I want to make sure I'm understanding you correctly. (Part of) GDPR is not about (just) knowing who has (had at the time) legitimate access to data, but additionally making it more difficult for other 3rd parties to gain access to the data in the future. By the fact that the data is removed from the corpus that the 3rd party is subsequently given access to.

I don't think the legislators envisioned people invoking these rights frivolously or maliciously (though I do :-/).

Agreed.

Backups would need to be redacted as well, I suppose.

Um... that also presents a severe technical problem. One that could impose large operational expenses. Suppose a company contracts to store their backup tapes off sight. This means that they would need to recall the tapes that need to be redacted, do so, send the tapes back to the offsite storage. This may involve an additional company that is simply the courier. Let's not forget about the off site companies handling fees and the courier's fees. Both ways for each tape. Let's also throw company policies in place that dictate that only X number of drives can be in transit or recalled at one time. That's a logistical nightmare, could take more than a trivial amount of time to complete, and untold cost. Ouch!

I have no idea what you mean by "ongoing discovery".

Ah.

Let's say that Wile E. Coyote decides to sue Acme because of their bad products. As soon as the lawsuit is initiated, chances are very good that Acme's lawyers will 1) tell them to destroy all records or 2) tell Acme's IT staff that they can no longer rotate out any backups that may contain data pertinent to the lawsuit. This is to facilitate the legal process of discovering evidence to be used in the case. (Either way, for or against, Mr. Coyote, doesn't matter.)

I frequently hear about this referred to as one of two things "Litigation Hold" or "(Electronic) Discovery". Discovery being the more common term and applies to more than just electronics.

Not Mailman host's problem, assuming all subscribers have properly been opted in and are allowed to opt out at will, as is normally the case.

What about that pesky time where the moderator hasn't approved the unsubscribe request. (I think I remember seeing that option in Mailman.)

Distributing content downstream is the purpose of the software, and subscribers are aware of that. The only edge cases I can imagine offhand is the one discussed elsewhere in the thread, where a subscriber posts a third party's information without permission, and possibly an open-post list where the poster doesn't realize that it's open subscription/public archives/whatever.

I think you misinterpreted what I was referring to. Or I'm misinterpreting your reply.

I'm talking about 3rd party spam filtering services that are in the path between, downstream in between Mailman and the recipient's server. They collect logs / data all the time. Usually those logs and that data are what help them be better at their job of spam filtering.

Not Mailman host's problem.

Okay.

Sure, but you probably won't like what the courts consider reasonable.

"reasonable" is always subject to deliberation.

Lawyers get payed to tell a judge that "It will cost $Company $50,000 dollars to recover the messages that $Plaintiff is requesting from $Defendant as part of their sunshine law request. Here's why:

1) We don't have a server that we can use so we must buy a low end machine. (Legit, when there is only one mail server and the business can't be without mail for days / weeks.)
2)  We need another tape drive to do the restores.
3)  It will take $X number of (wo)man hours at $Y dollars per hour.
4) We, $Defendant's lawyers must go through the emails at $YYYYY dollars per hour to make sure there's nothing given out that's outside of the sunshine law request. 5) You just expanded the scope of your discovery? Well, now we need to increase #1 and #2 to go through the last 5 years of things in the next three weeks. Also #3 and #4. }:-)

So … the total bill for your sunshine request comes to just over $50,000. Are you willing to pay that bill to get an answer to your question via a sunshine law request?

Aside: A sunshine law request is a request from a citizen to a governmental body for data that was arguably payed for by tax funding and on behalf of citizens, thus the citizen effectively owns the data in a round about way. — I don't know how wide spread that is.

You lock up the backups offline unless and until the court asks for them or you actually need to restore. That reasonably addresses the privacy issue itself, and you're covered by the "essential to business purpose" clause for the duration of the court order.

6) We have to buy additional tapes to replace the tapes that are on Lit' Hold. 7) We have to pay for more storage to accommodate #6. (Or we have to pay someone to house the tapes in a secure manner.)

I digress.



--
Grant. . . .
unix || die
------------------------------------------------------
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org

Reply via email to