Re: GDPR compliance best practices?

2018-06-13 Thread Peter Backes
On Wed, Jun 13, 2018 at 10:12:18AM -0400, Theodore Y. Ts'o wrote:
> Sure, but given that you are the one trying to claim that people need
> to do all sorts of extra development work (I don't see any patches

No. I am not. I said it is desirable to have a convenient solution for 
the problem. I did not demand development work or patches from anyone, 
just kindly asked for a comment on a possible solution.

> from you) and suffer performance degredation, the burden of proof is
> on _you_ to show that this is a problem that github, et. al., are
> likely run into.

*You* claimed there was performance degradation, not me.

That github et. al. will sooner or later receive such erasure requests 
is a practical certainty. Google receives them every day in large 
quantities. Just think about someone who committed smelly code on 
github and now wants to get a new job and wants to get rid of all 
associations with those smells.

> In particular, keep in mind that distribution of open source code can
> only be done under the terms of an open source license --- and a
> license is a contract.

Not that it would be relevant here, but, depending on jurisdication, it 
is highly controversial whether open source licenses really constitute 
contracts (or, for example, promissory estoppel).

For the right to erasure, it does not matter whether a contract exists 
or not.

The GDPR explicitly prohibits any use of contracts in a way that 
undermines the GDPR. Making it an irrevocable contractual obligation to 
publish the data is not going to be an excuse thus. And Free Software 
licenses have nothing whatsoever to do with repository metadata. Such 
software has existed long before version control became so popular.

> So in particular, your claim that the data is
> no longer necessary (point a) is at the very least going to be subject

No, it is github's claim that it must no longer be necessary for being 
erased, not mine!

I clearly stated that if ANY point (not: ALL points) is given, the data 
must be deleted.

Thus, point b, c, d or any other are just as good as point a.

> to dispute and is a legal question.  I can think of any number of ways
> that this could considered necessary in order to assure open source
> license compliance, the public interest in terms of allowing forking,
> etc.

To claim that the data is necessary (which is, as I said, irrelevant) 
and then say it's not because you can as well use a dummy user string, 
is self-contradicting.

> The bottom line is I'm sure the lawyers at github and Microsoft have
> very carefully done their due diligence, and if they are concerned,
> I'm sure we'll see patches from them, since after all, they would not

Why should they be concerned? They can rewrite history if necessary. 
They have a solution, though an inconvenient one. As far as the lawyers 
are concerned, that solution is pefectly fine.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-12 Thread Peter Backes
On Tue, Jun 12, 2018 at 11:56:13AM -0700, David Lang wrote:
> [quoting github]
> 
> It's important to remember that the Right to Erasure only applies to
> personal data, not all data. It only applies to data a controller (GitHub,
> for example) is processing _solely_ on the basis of consent.

This is very obviously wrong. See Art. 17 GDPR. Consent is only one of 
the explicitly mentioned grounds for deletion (it is (1) lit b, but 
there's also a and c to f).

> And it only
> applies when there's not another legal reason to keep the data -- for
> instance, if the data is no longer necessary for the purpose for which it
> was collected.

This incorrect claim is completely inverting the logic of Art. 17.

The logic is clarly that if ANY of lit (a) to (f) is satisfied, the 
data must be deleted.

It is not necessary for ALL of them to be satisfied.

In particular, if the data is no longer necessary for the purpose for 
which it was collected, then THAT ALONE is grounds for erasure ((1) 
lit. a). It does not matter at all whether processing was consent-based 
or whether such consent was withdrawn.

> We do not process Git commit history on the basis of consent. We have a
> legitimate business purpose for collecting Git commit history: to maintain
> the integrity of the Git commit record. It remains necessary for its purpose
> for as long as a commit needs to be attributable to its committer.

Right, but this merely justifies storing the data, not publishing it, 
or keeping it published, as I already explained at length.

> At GitHub, as part of our Privacy By Design work, we offer ways for users to
> set their own Git commit email data, so if an individual wants to remain
> anonymous or pseudonymous, he or she can do so.

Not only is this contradicting fundamentally what they just said in the 
previous sentence, it is not a justification for ignoring the right to 
erasure either. It is exactly the purpose of the right to erasure to 
get the data erased *after* the fact.

> We also explain, in our
> [Privacy
> Statement](https://help.github.com/articles/github-privacy-statement), that
> we are not able to delete personal data from the Git commit history once it
> has been recorded.

Privacy Statements are not a justification under GDPR for processing 
data or ignoring the right to erasure.

And oh yes they are able. Rewriting history is a possibility, though an 
inconvenient one.

I have pointed towards more convenient solutions.

> I'll point out that not only did the Github lawyers need to sign off on this
> stance, but the Microsoft lawyers would have looked at it as well as part of
> their purchase of Github.

So? If a thousand lawyers claim 1+1=3, it becomes a mathematical truth?

Best wishes
Peter


-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-08 Thread Peter Backes
On Fri, Jun 08, 2018 at 10:45:51AM -0400, Theodore Y. Ts'o wrote:
> *Anyone* can run a repository.  It's not just github and gitlab.  The
> hobbiest in New Zealand, who might never visit Europe (so she can't
> be arrested when she visits the fair shores of Europe) and who has no
> business interests in Europe, can host such a web site.

Just because letters of request are hardly enforced doesn't make it 
legal to break the GDPR. For sure, a hobbyist would not have much to 
fear, even if he is violating the GDPR and coming to Europe. The GDPR 
is mostly about taming the megacorporations, not about arresting 
tourists.

> So the person trying to engage in censorship

Censorship? The GDPR is not about censorship.

If you want to write an opionion about someone by name, the GDPR gives 
you all legitimization to do so, against that person's will.

This is about removing the data under ordinary circumstances.

> would need to contact *everyone*.

This is the subject's problem, not the repository provider's.

> And someone who has a git note in their private repo who
> then pushes to github/gitlab would end up pushing that note back up to
> the web server.

If that note has been deleted based on the right to be forgotten, you 
as the repository provider have to make sure you don't publish it 
again. Since you are allowed to keep a private copy, ensuring that 
shouldn't be a problem for you. 

> Great, so you can get github and gitlab to get rid of the information.
> But it's *pointless*.

It's up to the subject to consider it pointless or not to exercise his 
rights...

> Your problem is in the word: "a"

...and against whom, whether one repository provider, the major ones, 
all of them he can find.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-08 Thread Peter Backes
On Fri, Jun 08, 2018 at 10:13:20AM +0200, Ævar Arnfjörð Bjarmason wrote:
> Can you walk us through how anyone would be expected to fork (as create
> a new project, not the github-ism) existing projects under such a
> regiment?

I don't see your point. Copy the repository to fork. Nothing changes 
about that. Nothing prevents anyone from forking a repository which had 
some of its author names removed from the commits.

> As David Lang notes upthread, "the license is granted to the world, so
> the world has an interest in it". I wouldn't be so sure that this line
> of argument wouldn't work.

As I already stressed, having an interest is not enough. You need to 
have overriding legitimate grounds.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-08 Thread Peter Backes
On Fri, Jun 08, 2018 at 12:42:54AM -0700, David Lang wrote:
> Wrong, if you have to delete info, you are not allowed to keep a private
> copy.

Yes you are allowed. See Art. 17 (3) lit e GDPR.

> There is _nothing_ in the GDPR about publishing information,
> everything in it is about what you are allowed to store privately, how you
> are required to protect it (or more precisely, what you are required to do
> if private data gets hacked), and how you are required to keep it available.

Nope, the GDPR is not at all restricted to private copies.

The GDPR has special jargon for publishing; the GDPR calls it 
"disclosure (Art. 4 (2) GDPR) to an unspecified number of unspecified 
recipients (Art. 4 (9) GDPR), including ones in third countries 
(Chapter 5) in a repetitive (Art 49 (1) GDPR) fashion".

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-08 Thread Peter Backes
On Thu, Jun 07, 2018 at 10:53:13PM -0400, Theodore Y. Ts'o wrote:
> The problem is you've left undefined who is "you"?  With an open
> source project, anyone who has contributed to open source project has
> a copyright interest.  That hobbyist in German who submitted a patch?
> They have a copyright interest.  That US Company based in Redmond,
> Washington?  They own a copyright interest.  Huawei in China?  They
> have a copyright interest.
> 
> So there is no "privately".  And "you" numbers in the thousands and
> thousands of copyright holders of portions of the open source code.

Of course there is "privately". Every single one of those who have the 
author information can keep it, privately, for themselves. But those 
that have received a request to be forgotten must not keep publishing 
it on the Internet for download or distribute it to others.

> And of course, that's the other thing you seem to fundamentally not
> understand about how git works.  Every developer in the world working
> on that open source project has their own copy.  There is
> fundamentally no way that you can expunge that information from every
> single git repository in the world.

The misunderstanding is on your side.

If you run a website where the world can access a repository, you are 
responsible for obeying the GDPR with respect to that repository. If 
you receive a request to be forgotten, you have to make sure you stop 
publishing that author's identity as part of the repository.

You do NOT need to

- delete it from a private copy you have
- care about others who publish that data
- or even make sure the data is deleted from private copies others may 
have, even if the number of copies is in the thousands.

In practical terms, if someone wishes to exercise his right to be 
forgotten, he will usually send the request to the maintainer and stop 
him from distributing the information, and perhaps to a third party he 
might use as a platform for publication, such as github.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-08 Thread Peter Backes
On Thu, Jun 07, 2018 at 04:53:16PM -0700, David Lang wrote:
> the license is granted to the world, so the world has an interest in it.

Certainly, but you need to have overriding legitimate grounds. An 
interest is not enough for justification. You have to weight your 
interests against those of the subject.

> Unless you are going to argue that the GDPR outlawed open source 
> development.

No it certainly did not and I don't see how it could.

All the GDPR arguably demands is that the author's identity is deleted 
from a public repository if he wishes so.

Just assume it was a CVS repo. Then removal would not be any issue at 
all. It is a technical speciality of git that makes the removal so 
intricate to implement, which is not at all an intrinsic property of 
open source development.

> you are the one arguing that the GDPR prohibits Git from storing and
> revealing this license granting data, not me.

It prohibits publishing, and only after a request to be forgotten. It 
does not prohibit storing your private copy.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-07 Thread Peter Backes
On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote:
> > Again: The GDPR certainly allows you to keep a proof of copyright
> > privately if you have it. However, it does not allow you to keep
> > publishing it if someone exercises his right to be forgotten.
> someone is granting the world the right to use the code and you are claiming
> that the evidence that they have granted this right is illegal to have?

Hell no! Please read what I wrote:

- "allows you to keep a proof ... privately"
- "However, it does not allow you to keep publishing it"

> And you are incorrect to say that the GDPR lets you keep records privately
> and only applies to publishing them. The GDPR is specifically targeted at
> companies like Facebook and Google that want to keep lots of data privately.
> It does no good to ask Facebook to not publish your info, they don't want to
> publish it in the first place, they want to keep it internally and use it.

How can you misunderstand so badly what I wrote.

Sure the GDPR does not let you keep records privately at will. You 
ultimately need to have overriding legitimate grounds for doing so. 

However, overriding legitimate grounds for keeping private records are 
rarely overriding legitimate grounds for publishing them.

In case of git history metadata, for publishing, you may have consent 
or even legitimate interests, but not overriding legitimate grounds. 
For keeping a private copy of the metadata, your probably have 
overriding legitimate grounds, however.

The GDPR is not an "all or nothing" thing.

Facebook and Google certainly do not have overriding legitimate grounds 
for most of the data they keep privately.

Is it that so hard to understand?

Best wishes
Peter
-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-07 Thread Peter Backes
On Thu, Jun 07, 2018 at 10:28:47PM +0100, Philip Oakley wrote:
> Some of Peter's fine distinctions may be technically valid, but that does
> not stop there being legal grounds. The proof of copyright is a legal
> grounds.

Again: The GDPR certainly allows you to keep a proof of copyright 
privately if you have it. However, it does not allow you to keep 
publishing it if someone exercises his right to be forgotten.

There is simply no justification for publishing against the explicit 
will of the subject, except for the rare circumstances where there are 
overriding legitimate grounds for doing so. I hardly see those for the 
average author entry in your everyday git repo. Such a justification is 
extremely fragile.

> Unfortunately once one gets into legal nitpicking the wording becomes
> tortuous and helps no-one.

That's not nitpicking. If what you say were true, the GDPR would be 
without any practical validity at all.

> If one starts from an absolute "right to be forgotten" perspective one can
> demand all evidence of wrong doing , or authority to do something, be
> forgotten. The GDPR has the right to retain such evidence.

Yes, but not to keep it published.

> I'll try and comment where I see the distinctions to be.

You're essentially repeating what you already said there.

> Publishing (the meta data) is *distinct* from having it.

Absolutely right. That is my point.

> You either start off public and stay public, or you start off private and
> stay there.

Nope. The GDPR says you have to go from public to private if the 
subject wishes so and there are no overriding legitimate grounds.

That is the entire purpose of the GDPR's right to be forgotten.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-07 Thread Peter Backes
Hi David,

thanks for your input on the issue.

> LEGAL GDPR NOTICE:
> According to the European data protection laws (GDPR), we would like to make 
> you
> aware that contributing to rsyslog via git will permanently store the
> name and email address you provide as well as the actual commit and the
> time and date you made it inside git's version history. This is inevitable,
> because it is a main feature git.

As we can, see, rsyslog tries to solve the issue by the already 
discussed legal "technology" of disclaimers (which is certainly not 
accepted as state of the art technology by the GDPR). In essence, they 
are giving excuses for why they are not honoring the right to be 
forgotten.

Disclaimers do not work. They have no legal effect, they are placebos.

The GDPR does not accept such excuses. If it would, companies could 
arbitrarily design their data storage such as to make it "the main 
feature" to not honor the right to be forgotten and/or other GDPR 
rights. It is obvious that this cannot work, as it would completely 
undermine those rights.

The GDPR honors technology as a means to protect the individual's 
rights, not as a means to subvert them.

> If you are concerned about your
> privacy, we strongly recommend to use
> 
> --author "anonymous "
> 
> together with your commit.

This can only be a solution if the project rejects any commits which 
are not anonymous.

> However, we have valid reasons why we cannot remove that information
> later on. The reasons are:
> 
> * this would break git history and make future merges unworkable

This is not a valid excuse (see above). The technology has to be 
designed or applied in such a way that the individuals rights are 
honored, not the other way around.

In absence of other means, the project has to rewrite history if it 
gets a valid request by someone exercising his right to be forgotten, 
even if that causes a lot of hazzle for everyone.

> * the rsyslog projects has legitimate interest to keep a permanent record of 
> the
>   contributor identity, once given, for
>   - copyright verification
>   - being able to provide proof should a malicious commit be made

True, but that doesn't justify publishing that information and keeping 
it published even when someone exercises his right to be forgotten.

In that case, "legitimate interest" is not enough. There need to be 
"overriding legitimate grounds". I don't see them here.

> Please also note that your commit is public and as such will potentially be
> processed by many third-parties. Git's distributed nature makes it impossible
> to track where exactly your commit, and thus your personal data, will be 
> stored
> and be processed. If you would not like to accept this risk, please do either
> commit anonymously or refrain from contributing to the rsyslog project.

This is one of those statements that ultimately say "we do not honor 
the GDPR; either accept that or don't submit". That's the old, arguably 
ignorant mentality, and won't stand.

The project has to have a legal basis for publishing the personal 
metadata contained in the repository. In doubt, it needs to be consent 
based, as that is practically the only basis that allows putting the 
data on the internet for everyone to download. And consent can be 
withdrawn at any time.

The GDPR's transitional period started over two years ago. There was 
enough time to get everything GDPR compliant.

It might be possible to implement my solution without changing git, 
btw. Simply use the anonymous hash as author name, and store the random 
number and the author as a git-notes. git-notes can be rewritten or 
deleted at any time without changing the commit ID. I am currently 
looking into this solution. One just needs to add something that can 
verify and resolve those anonymous hashes.

Best wishes
Peter


-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-04 Thread Peter Backes
On Mon, Jun 04, 2018 at 09:47:18AM -0400, Theodore Y. Ts'o wrote:
> For people who are doing real work on git repos, other commands that
> we very much care about include "git log --author=", "git
> tag --contains", "git blame", etc.

I do not see how those, or anything but git clone (and even that only 
if author verification is requested) could possibly be affected in any 
significant way.

Best wishes
Peter


-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 11:28:43PM +0100, Philip Oakley wrote:
> It is here that Article 6 kicks in as to whether the 'organisation' can
> retain the data and continue to use it.

Article 6 is not about continuing to use data. Article 6 is about 
having and even obtaining it in the first place.

Article 17 and article 21 are about continuing to use data.

> For an open source project with an open source licence then an implict DCO
> applies for the meta data. It is the legal  basis for the the release.

Neither article 6 nor 17 or 21 have anything remotely like an "implicit 
DCO" as a legitimization for publishing employee data.

The GDPR is very explicit about implicit stuff never being a basis for 
consent, if you want to imply that is your basis. And consent can be 
withdrawn at any time anyway.

An open source license has nothing whatsoever to do with the question 
of version control metadata. A public version control system is not 
necessary to publish open source software.

> > - copyright is about distributing the program, not about distributing
> > version control metadata.
> It is specificaly about giving that right to copy by Jane Doe (but git gives
> no other information other than that supposedly globally unique 'author
> email'.

I don't get what you are saying. As I said, a public version control 
system is not necessary to publish open source software. The two things 
may be intimately related in practice, but not in theory.

> > - Being named is a right, not an obligation of the author. Hence, if
> > the author doesn't want his name published, the company doesn't have
> > legitimate grounds based in copyright for doing it anyway, against his
> > or her will.
> Git for Open Source is about open licencing by name. I'd agree that a closed
> corporate licence stays closed, but not forgotten.

Again I don't get what you are saying. The author has a right to be 
named as the author, not an obligation. This has nothing whatsoever to 
do with the question of Open Source vs. closed corporate licenses.

> > Let's be honest: We do not know what legitimization exactly in each
> > specific case the git metadata is being distributed under.
> 
> We should know, already. A specific licence [or limit] should be in place.
> We don't really want to have to let a court decide ;-)

It is insufficient to have a license for distributing the program. The 
license is not a GDPR legitimization for git metadata. Distributing the 
program can be done without distributing the author's identity as part 
of the metadata of his commits.

> The law is never decided by technical means, unfortunately.

It is. The GDPR refers to the state of the art of technology without 
defining it. Thus, technical means are very important in the GDPR. This 
may be something new for lawyers. If technology changes tomorrow, even 
without anything else changing, you may be breaking the GDPR by this 
simple fact tomorrow, while not breaking it today.

Again: Technology is very important in the GDPR.

> Regular git users should have no issues - they just need to point 
> their finger at the responsible authority.

If git users are putting commits online for global download, they are 
the responsible authority.

> The DCO/GPL2 are the legitimate data record that recipients should have for
> their copy. There is no right to be forgotten at that point.

What do you mean by "should have for their copy"? Why shouldn't there 
be a right to be forgotten? Open Source Software has been distributed a 
lot without detailed version control history information. Having this 
information as a record is certainly in the interest of the recipient, 
but it is very very questionable that it is an overriding legitimate 
grounds as per Art. 17 for keeping that data.

> I see the solution to be elsewhere, and that it is in some ways a strawman
> discussion: "if someone has the right to be forgotten, how do we delete the
> meta data", when that right (to delete the meta data in a properly licence
> repo) does not exist.

See, this kind of shady legal argument is what lawyers are selling you. 
Why not put the energy into designing a technical solution.

They tell you: "Ignore the GDPR. I will give you backup by giving you 
lots of disclaimers and excuses for doing so. Just give me a lot of 
money."

Having the ability to validate yet erase data form repositorys is 
desirable from a technical point of view. It has a lot of uses, not 
necessarily only legal ones. The objection of efficiency raised by Ted 
is a valid one. The strawman argument is not.

Best wishes
Peter
-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 05:03:44PM -0400, Theodore Y. Ts'o wrote:
> If you don't think a potential 2x -- 10x performance hit isn't a
> blocking factor --- sure, go ahead and try implementing it.  And good
> luck to you.  And this is not a guarantee that it won't get rejected.
> I certainly don't have the power to make that guarantee.

I do not want or expect a guarantee, or even a probability, of course. 
Just trying to avoid "STRONG REJECT. We could have said you before you 
even started implementing. Why didn't you discuss this beforehand?"

One would simply change something like

author A U Thor  1465982009 +

into something like

author 21bbba8e9ce9734022d2c23df247a2704c0320ad7d43c02e8bdecdfae27e23b4 A U 
Thor 
author-hash 469bb107e38f8e59dddb3bbd6f8646e052bf73d48427865563c7358a64467f2c
authordate c444f739ca317e09dbd3dae1207065585ae2c2e18cd0fc434b5bde08df1e0569 
1465982009 +
authordate-hash 199875e5aedb6cb164a2b40c16209dc5bb37f34c059a56c6d96766440fb0fe68

and then compute the commit id without the "author" and the 
"authordate" lines.

The *-hash values were obtained as follows:

echo -n '21bbba8e9ce9734022d2c23df247a2704c0320ad7d43c02e8bdecdfae27e23b4 A U 
Thor ' | sha3sum -a 256
echo -n 'c444f739ca317e09dbd3dae1207065585ae2c2e18cd0fc434b5bde08df1e0569 
1465982009 +' | sha3sum -a 256

The hex values here are simply the $huge_random_numbers

Verifying the commit ID by itself wouldn't be any less efficient than 
before. Admitteldly, it wouldn't verify the author and authordate 
integrity anymore without additional work. That would be some overhead, 
sure, and could be done on demand, and would mostly affect clones. I 
don't think it would be that much of a problem. It can be parallelized 
easily. The hashes for each field are independent of each other. They 
can all be verified in parallel in different threads running on 
different cores.

On djb's typical 2015 skylake machine the supercop benchmark tells us 
that sha3-256 (~=keccakc512) has a speed of about 20 cycles/byte for 
blocks of 64 bytes of data, see 
https://bench.cr.yp.to/results-sha3.html#amd64-skylake

Let's say we have 128 bytes of data on average for the author field, so 
conservatively speaking it takes about 3000 cycles (> 128*20) to hash 
and compare the hash.

At 3000 MHz, we can thus do roughly about 1000 verifications per second 
per core.

Let's assume we have 10 anonymizable fields of this kind per commit.

Then the overhead would be one second per 100 x ncores commits.

How many commits are we talking about in a huge repository? And how 
long does a clone of such a huge repository take at the moment? Do you 
have any numbers?

> If you don't have time to implement, why do you think it's fair to
> inflict on everyone else the request for time to do a design review
> for something for which the need hasn't even been established?

I do not request from anyone to even reply to my messages. I just see a 
lot of time being wasted by discussing things about my proposal that 
are technically irrelevant. If that time were put into reviewing the 
design, it would be spent better.

Please don't devalue a proposal. It is not true that the only value is 
in actual code and proposals are "bullshit".

I was not the first to raise the issue, as I clearly showed in my 
initial email.

The demand is in fact high; very high. At present, that demand is 
satisfied by lawyers. Who are writing snake oil disclaimers and such 
for enormous sums of money. In a lot of companies. To "solve" a 
technical issue by pseudo-legal means by finding excuses for why the 
"right to be forgotten" doesn't have to be implemented in specific 
cases such as git. What if all that lawyer money were put into actually 
solving the technical issues as technical issues? Engineers are 
apparently bad at marketing, the lawyers seem more successful in that 
respect.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 04:07:39PM -0400, Theodore Y. Ts'o wrote:
> Why don't you try to implement your proposal then, and then benchmark
> it.  After you find out how much of a performance disaster it's going
> to be, especially for large git repos, we can discuss who is being
> tyrannical.

See, Ted, but I have this other hobby project with git stash preserving 
timestamps, which is 90% done but not yet finished. I am a very busy 
person. I might implement it but it's not the topmost priority. Thus, 
first I want to discuss to not waste too much time implementing 
something that's then rejected by valid criticism while that criticms 
could have been raised beforehand. Perhaps I can convince my employer 
to work on it on their account. But there's so much to do at the moment.

I have a PhD, about very complex things like static program analysis by 
abstract interpretation. I love hacking very much but I can mostly only 
do it as a hobby because humanity is better served doing the complex 
things that not every hacker can do.

I know I am being whiny but that's how it is.

But I will take your message as saying you at least don't see any 
obvious criticism leading to complete rejection of the approach.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 09:48:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Sure, but what I'm pointing out is a) you can't focus on git as the
> technology because it tells you nothing about what's being done with it
> (e.g. the log file case I mentioned b) nobody who came up with the GDPR
> was concerned with some free software projects or the SCM used by
> companies, so this is very unlikely to be enforced.

As I already said, the GDPR refers to the state of the art in 
technology, without defining it.

The GDPR provides a generic framework. It covers everyone. From a 
single person running a small blog to a S enterprise. It also 
covers non-profits and state authorities. Everyone is covered. 
Including SCM used.

The GDPR will be enforced against SCMs. The question is just who will 
be the first to be affected. I suspect it will be a mega-corporation 
who fired one of their developers who wants to fight back and exercise 
his right to be forgotten against the company's public git repos.

> So nobody can be GDPR compliant in the face of archive.org and the like?

The GDPR has special exceptions for archives and the like.

> It does if you've got the ref. Maybe I just don't get your proposal,
> quote:
> 
> Do not hash anything directly to obtain the commit ID. Instead, hash a
> list of hashes of [$random_number, $information] pairs. $information
> could be an author id, a commit date, a comment, or anything else. Then
> store the commit id, the list of hashes, and the list of pairs to form
> the commit.
> 
> You're just proposing (if I've read this correctly) that the commit
> object should have some list of headers pointing to other SHA1s, and
> that fsck and the like be OK with these going away. Right?

Certainly not SHA1. SHA1 is completely broken. I know Linus has a bit 
of a different opinion. But there's really no defense for SHA1. It's an 
utterly broken algorithm and should not be used at all anymore.

> How is this intrinsically different from referring to something in the
> ref namespace that may be deleted in the future?

I guess I am partly repeating myself, but:

1. Having fsck be OK with erasure is not enough. It tells you nothing 
about anonymization. If the hash is the same in 5000 instances that's 
pseudonymization, not anonymization. You need to ensure a different 
hash in each instance, and you need to ensure there's no easy way to 
reconstruct the data from its hash. Hence $random_number (or let's call 
it $huge_random_number, it should have x bits if the hash has x bits). 
If you have the SHA1 64ca93f83bb29b51d8cbd6f3e6a8daff2e08d3ec it's too 
easy to figure out the plaintext (it's "Peter" BTW).

2. If you use a random UUID you cannot reconstruct the data from its 
hash, but you have the same issue about UUID reuse. Plus, you lose the 
ability to verify the author's name as part of the commit.

> Okey, so you're not reading the GDPR in some literal sense, but you're
> coming to a conclusion that's supported by ... what? To echo Theodore
> Y. Ts'o E-Mail have you consulted with someone who's an actual lawyer on
> this subject?

I'm replying in private conversation about this one. It's not relevant 
for this discussion.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
Addendum:

I one discussed with a philosopher the question: What is your argument 
against libertarianism?

He said: It would be a tyranny of lawyers.

Let's not have a tyranny of lawyers. Let us, the engineers and hackers, 
exercise the necessary control over those pesky lawyers by defining and 
redefining the state of the art in technology, and prevent them from 
defining it by themselves. For a hammer, everything looks like a nail. 
What is the better options: To suggest people to pay for legal advice 
by lawyers, who only offer lengthy disclaimers and such for bypassing 
the right to be forgotten, or simply discuss technical changes for git 
which enable its easy implementation, without legal excuses for not 
doing supporting it?

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 02:18:07PM -0400, Theodore Y. Ts'o wrote:
> I would gently suggest that if you really want to engage in something
> practical than speculating how the GPDR compliance will work out in
> actual practice, that you contact a lawyer and get official legal
> advice?

I completely disagree.

Erasure is a technical issue to be solved by engineers, not by lawyers.

And that's completely in line with the GDPR. The GDPR is ultimately not 
a legal thing to be solved by lawyers writing lengthy legal 
argumentations and disclaimers and such. They are not even the ones to 
take lead in GDPR implementation. All that would be simply snake oil. 
Some legal documentation may be necessary, and having a competent 
lawyer in a GDPR compliance task force is certainly a must. But that 
gets you done only 20% of the job, 80% is engineering. Every lawyer who 
claims to give you shady legal tricks to get the job 100% done in no 
time is a liar.

The GDPRs ultimate goal is to incline the world to improve how data 
protection is implemented on a technical level. The GDPR contains 
several blanket clauses that refer to the "state of the art" of 
technology, which the GDPR itself of course does not define and which 
is of course nothing a lawyer has any competence in.

My proposal is a technical, not a legal one: Provide a generic 
possibility of having eraseability and verifiability at the same time 
in git. Improve the state of the art in version control such that it is 
more in line with the GDPRs idea that people have a right to be 
forgotten, but to also be useful for a multitude of other applications. 
The lawyers can then build on this.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 04:28:31PM +0100, Philip Oakley wrote:
> In most Git cases that legal/legitimate purpose is the copyright licence,
> and/or corporate employment. That is, Jane wrote it, hence X has a legal
> rights of use, and we need to have a record of that (Jane wrote it) as
> evidence of that (I'm X, I can use it) right. That would mean that Jane
> cannot just ask to have that record removed and expect it to be removed.

Re corporate employment:

For sure nobody would dare to quesion that a company has a right to 
keep an internal record that Jane wrote it.

The issue is publishing that information. This is an entirely different 
story.

I already stressed that from the very beginning.

Re copyright license:

No, a copyright license does not provide a legitimization.

- copyright is about distributing the program, not about distributing 
version control metadata.

- Being named is a right, not an obligation of the author. Hence, if 
the author doesn't want his name published, the company doesn't have 
legitimate grounds based in copyright for doing it anyway, against his 
or her will.

> From a personal view, many folk want it to be that corporates (and open
> source organisations) should hold no personal information with having
> explicit permission that can then be withdrawn, with deletion to follow.
> However that 'legal' clause does [generally] win.

Let's be honest: We do not know what legitimization exactly in each 
specific case the git metadata is being distributed under.

It may be copyright, it may be employment, but it may also be revocable 
consent. This is, we cannot safely assume that no git user will ever 
have to deal with a legitimate request based on the right to be 
forgotten.

> In the git.git case (and linux.git) there is the DCO (to back up the GLP2)
> as an explicit requirement/certification that puts the information into the
> legal evidence category. IIUC almost all copyright ends up with a similar
> evidentail trail for the meta data.

This makes things more complicated, not less. You have yet more meta 
data to cope with, yet more opportunities to be bitten by the right to 
be forgotten. Since I proposed a list of metadata where each entry can 
be anonymized independently of each other, it would be able to deal 
with this perfectly.

> The more likely problem is if the content of the repo, rather than the meta
> data, is subject to GDPR, and that could easily ruin any storage method.
> Being able to mark an object as  would help here(*).

My proposal supports any part of the commit, including the contents of 
individual files, as eraseable, yet verifiable data.

> Also remember that most EU legislation is 'intent' based, rather than
> 'letter of', for the style of legal arguments (which is where some of the UK
> Brexit misunderstandings come from), so it is more than possible to get into
> the situation where an action is both mandated and illegal at the same time,
> so plent of snake oil salesman continue to sell magic fixes according to the
> customers local biases.

This may be true. I am not trying to sell snake oil, however. To have 
erasure and verifiability at the same time is a highly generic feature 
that may be desirable to have for a multitude of reasons, including but 
not limited to legal ones like GDPR and copyright violations.

> I do not believe Git has anything to worry about that wasn't already an
> issue.

Yes, but it definitely had and still does have something to worry about.

git should provide technical means to deal with this. I provided a 
proposal based on anonymization that does not in any way have any 
drawback compared to the status quo, except a slight increase in 
metadata size and various degrees of backwards incompatibility, 
depending on how it is implemented.

What do you think about my proposal as a solution for the problem?

You provide a lot of arguments about why it is not a necessity to have 
this, but let's assume it is; is there any actual problem you see with 
the proposal, except that someone would have to implement it?

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 02:59:26PM +0200, Ævar Arnfjörð Bjarmason wrote:
> I'm not trying to be selfish, I'm just trying to counter your literal
> reading of the law with a comment of "it'll depend".
> 
> Just like there's a law against public urination in many places, but
> this is applied very differently to someone taking a piss in front of
> parliament v.s. someone taking a piss in the forest on a hike, even
> though the law itself usually makes no distinction about the two.

We have huge companies using git now. This is not the tool used by a 
few kernel hackers anymore.

> In this example once you'd delete the UUID ref you don't have the UUID
> -> author mapping anymore (and b.t.w. that could be a many to one
> mapping).

It is not relevant whether you have that mapping or not, it is enough 
that with additional information you could obtain it. For example, say, 
you have 5000 commits with the same UUID. Now your delete the mapping. 
But your friend still has it on his local copy. Now your friendly 
merely needs to tell you who is behind that UUID and instantly you can 
associate all 5000 commits with that person again.

The GDPR is very explict about this, see recital 26. It says that 
pseudonymization is not enough, you need anonymization if you want to 
be free from regulation.

In addition, and in contrast to my proposal, your solution doesn't 
allow verification of the author field.

> I think again that this is taking too much of a literalist view. The
> intent of that policy is to ensure that companies like Google can't just
> close down their EU offices weasel out of compliance be saying "we're
> just doing business from the US, it doesn't apply to us".
> 
> It will not be used against anyone who's taking every reasonable
> precaution from doing business with EU customers.

I think you are underestimating the political intention behind the 
GDPR. It has kind of an imperialist goal, to set international 
standards, to enforce them against foreign companies and to pressure 
other nations to establish the same standards.

If I would read the GPDR in a literal sense, I would in fact come to 
the same conclusion as you: It's about companies doing substantial 
business in the EU. But the GDPR is carefully constructed in such a way 
that it is hard not to be affected by the GDPR in one way or another, 
and the obvious way to cope with that risk is to more or less obey the 
GDPR rules even if one does not have substantial business interests in 
the EU. 

> What do you imagine that this is going to be like? That some EU citizen
> is going to walk into a small business in South America one day, which
> somehow is violating the GPDR, and when that business owner goes on
> holiday to the EU they're going to get detained? Not even the US policy
> against Cuba is anywhere remotely close to that.

Well not if he's locally interacting with that business, a situation 
which I am sure is not regulated by the GDPR.

However, if a large US website accepts users from the EU and uses the 
data gathered in conflict with the GDPR, perhaps selling it for use in 
political campaigns, and it gets several fines for this by EU 
authorities but ignores them and doesn't pay them, and the CEO one day 
takes a flight to Frankfurt to continue by train to Switzerland to get 
some cash from his bank account, then he will most likely not reach 
Swiss territory.

Best wishes
Peter
-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
On Sun, Jun 03, 2018 at 12:45:25PM +0200, Ævar Arnfjörð Bjarmason wrote:
> protection". I.e. regulators / prosecutors are much likely to go after
> some advertising company than some project using a Git repo.

Well, it is indeed rather unlikely that one particular git repo project 
will be targeted, but I guess it is basically certain that at least 
some of them will be.

It is the same as a lottery, it's very unlikely you win the jackpot, 
yet someone wins it every few months. We should care about the entire 
community, not be too selfish.

> Since the Author is free-form this sort of thing doesn't need to be part
> of the git data format. You can just generate a UUID like
> "5c679eda-b4e5-4f35-b691-8e13862d4f79" and then set user.name to
> "refval:5c679eda-b4e5-4f35-b691-8e13862d4f79" and user.email to
> "refval:5c679eda-b4e5-4f35-b691-8e13862d4f79".

Well, this is merely pseudonymization, not anonymization. Note that the 
UUID, innocent as it may look, is not in any way less "personal data" 
than the author string itself. Your proposal would thus not actually 
solve the problem, only slightly transform it. Only when you truly 
anonymize (see my proposal about one way to to it), you can completely 
evade the GDPR.

> Sites that are paranoid about the GDPR could have a pre-receive hook
> rejecting any pushes from EU customers unless their commits were in this
> format.

This won't work either. The GDPR makes each data processor directly 
responsible in relation to the data subject. So it does not matter at 
all who is pushing, it matters who is in the author field of the 
commits that were pushed. And since you don't have any information 
about whether those authors are residing within the EU or not, you have 
to assume they are and you have to obey the GDPR. Even if you are 
outside the EU and do not have any subsidiaries within the EU, the GDPR 
sill applies as long as you are processing personal data of EU citizen. 
Perhaps the authorities in your country will refuse to obey letters of 
request if the EU authorities try to enforce the GDPR on an 
international scope, but if you have a record of GDPR violation and you 
ever set foot on EU territory, you are fair game.

> Instead I'll have a daily UUID issued from a government API

Heaven forbid. ;) There is an old German proverb, warning that even 
humorous trolling might be dangerous: "Man soll den Teufel nicht an die 
Wand malen!" ;)

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-06-03 Thread Peter Backes
Hi,

Unfortunatly this important topic of GDPR compliance has not seen much 
interest.

After asking github about how they would cope with the issue of erasing 
the author field, they changed their privacy policy, which now 
clarifies that this won't be done.

My guess is that this would ultimately rely on "overriding legitimate 
grounds for the processing" (Art. 17 (1) point (a) GDPR) which is one 
of the most fragile legitimizations avaiblable in the GDPR.

The GDPR emphasizes the importance of using state of the art 
technology, including anonymization, in as much as possible to ensure 
privacy.

At 
https://public-inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a92p0C2O6TA53==bzf...@mail.gmail.com/T/
 
there is already some discussion about transitioning to a different 
hashing algorithm to get more in line with state of the art in hashing. 
(My clear favourite would be SHA-3.)

In course of this, anonymization could also be added. My idea would be 
as follows:

Do not hash anything directly to obtain the commit ID. Instead, hash a 
list of hashes of [$random_number, $information] pairs. $information 
could be an author id, a commit date, a comment, or anything else. Then 
store the commit id, the list of hashes, and the list of pairs to form 
the commit.

If someone requests erasure, simply empty the corresponding pair in the 
list. All that would be left would be the hash of the pair, which is 
completely anonymous (not more useful than a random number) and thus 
not covered by the GDPR. The history could still be completely 
verified, and when displaying the log, the erased entry could be 
displayed as "<>".

What do you think about this?

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: GDPR compliance best practices?

2018-04-17 Thread Peter Backes
On Tue, Apr 17, 2018 at 11:38:26PM +0200, Ævar Arnfjörð Bjarmason wrote:
> I've been loosely following a similar discussion around blockchains and
> my understanding of the situation is that for a project such as say
> Linux the GDPR gives you this potential out for that[1]:
> 
> "the personal data are no longer necessary in relation to the
> purposes for which they were collected or otherwise processed"
> 
> I.e. you understand that when you submit a patch to linux.git how it's
> going to get used, and that it's in a storage system that isn't going to
> be pruned just because you ask for it.
> [...]
> You can make a compelling case that for say submitting your data to the
> Bitcoin blockhcain the above quote from article 17 overrides it

Well, you're quoting from lit. a but there's also lit. b to f! It says 
"one of the following grounds applies", not "all of ...".

> This is very different from you say joining a company, committing to its
> internal git repo, and your name being there in perpetuity, or choosing
> to submit a patch to linux.git or git.git.
>
> I'd think that would be handled the same way as a structural engineering
> firm being able to record in perpetuity who it was that drew up the
> design for some bridge.

Internal repo is entirely unproblematic, since you don't need consent 
for doing that. It is covered by Art. 6 (1) lit. f.

The problem is public repos. Publishing employee information is 
generally considered not to be covered by Art. 6 (1) lit. f. After all, 
you can easily publish the software but not the repo.

> I don't think it's plausible that the GDPR,
> which is probably mainly going to be about consumer protection, is going
> to concern itself with that in practice.

Oh, no, GDPR is about privacy in general. It's not only about consumer 
protection. It applies in the same way to employees in relation to 
their employer and to citizens in relation to the authorities, and to 
open source contributors in relation to the projects, or to any other 
data processing outside family and friends (Art. 2 (2) lit. c).

I am inclined to assume that Art. 6 (1) lit. b might be the solution, 
since the licenses typically demand a history of changes to be 
distributed with the program (for example, GPLv3 section 5 a). After 
all, the author generally wants to be given credit for his changes and 
it can be assumed that this one of the conditions for licensing the 
work in the first place.

On the other hand, of course, the author could waive the condition at 
any time, which means Art. 6 (1) lit. b wouldn't apply anymore and 
you'd have the same issue as with consent-based processing of the 
information (lit. a).

Best wishes
Peter


-- 
Peter Backes, r...@helen.plasma.xg8.de


GDPR compliance best practices?

2018-04-17 Thread Peter Backes
Hi,

I'd like to ask whether anyone has best practices for achieving GDPR 
compliance for git repos? The GDPR will come into effect in the EU next 
month.

In particular, how do you cope with the "Right to erasure" concerning 
entries in the history of your git repos?

Erasing author names from the history changes the commit hashes.  It is 
well known that this leads to a lot of problems.  So I don't consider 
this a workable solution.

And how do you justify publishing your employee's name/email as part of 
a git commit under GDPR rules in the first place?

github has the following page mentioning the "Right to erasure" but 
AFAICS nothing about how it will be implemented
https://about.gitlab.com/gdpr/

Here are discussions I found but they do not really provide a solution:
https://law.stackexchange.com/questions/24623/gdpr-git-history
https://news.ycombinator.com/item?id=16509755

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-26 Thread 'Peter Backes'
On Mon, Feb 26, 2018 at 11:56:42AM +0100, Andreas Krey wrote:
> > The bigger issue is usually to copy with those pesky leap seconds. It 
> > makes a difference whether one uses solar seconds ("posix" style; those 
> > are more commonly seen) or atomic seconds ("right" style) for the UNIX 
> > timestamp.
> 
> Is there any system, unix or otherwise, that uses 'right'-style seconds,
> i.e. TAI, as its base?

Most certainly there is. This depends on the individual configuration 
of the system. On my Fedora system, the commonly used tzdata package 
off the shelf contains support for 'right' style versions of all 
timezones in /usr/share/zoneinfo/right If the user links one of those 
timezones to /etc/localtime or manually specifies them (like 
TZ=right/Europe/Berlin ls -l) they will be used.

You don't find a lot of those systems today, but those who used to use 
the 'right' timestamps might for legacy reasons explicitly configure 
their system to use those timezone variants. I personally did this for 
a number of years, but then converted the filesystems timestamps to 
'posix' and I am now exclusively using 'posix' ones.

Best wishes
Peter
-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-21 Thread 'Peter Backes'
On Wed, Feb 21, 2018 at 06:58:34PM -0500, Randall S. Becker wrote:
> May I suggest storing the date/time in UTC+0 in all cases. I can see
> potential issues a couple of times a year where holes exist. I cannot even
> fathom what would happen on a merge or edit of history.

I consider storing the timestamp simply in the traditional 
seconds-since-epoch UNIX timestamp format. But I'm not entirely sure 
yet (see below).

If a timestamp includes the offset, there shouldn't be any issue with 
holes. UTC+0 is nice, too, of course, though some might want to 
preserve the timezone in which the timestamp was actually created.

The bigger issue is usually to copy with those pesky leap seconds. It 
makes a difference whether one uses solar seconds ("posix" style; those 
are more commonly seen) or atomic seconds ("right" style) for the UNIX 
timestamp. Those differences accumulate over time, so you can have 
almost half a minute delta if you are not careful with timestamp 
conversion. If I remember correctly, rcs uses some rather awkward 
interative convergence algorithm to portably convert from 
human-readable date and time to UNIX timestamps.

Thus I'm still not sure whether it will be a UNIX-format timestamp or 
whether a human-readable date/time might be preferrable.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-21 Thread Peter Backes
On Wed, Feb 21, 2018 at 11:44:13PM +0100, Ævar Arnfjörð Bjarmason wrote:
> If it were added as a first-level feature to git it would present a lot
> of UX confusion. E.g. you run "git add" and it'll be showing the mtime
> somehow, or you get a formatted patch over E-Mail and it doesn't only
> include the commit time but also times for individual files.

But that's pretty standard. patch format has timestamp fields for 
precisely this purpose:

% echo a > x  
% echo b > y
% diff -u x y
--- x   2018-02-21 23:56:29.574029523 +0100
+++ y   2018-02-21 23:56:31.430003389 +0100
@@ -1 +1 @@
-a
+b

At present, git simply leaves those fields blank...

> The VC systems that had this feature in the past were centralized, so
> they could (in theory anyway) ensure that timestamps were monotonically
> increasing. This won't be the case with git, we have plenty of timestamp
> drift in e.g. linux.git and other git repos.

I don't see where monotonicity would be an issue any more than it is 
for centralized version control systems.

Even in the centralized setting, monotonicity is not guaranteed, since 
you might have local timestamps deviating from the repository; you 
might have added a line, compiled, and removed it again later on, 
without running make again. Now if you checkout changes from the 
repository, and it sets the timestamp, that timestamp might be older 
than before the compile, and the file would not be rebuilt if you run 
make. So you cannot avoid those issues in centralized setttings either.

> So if these mtimes were used by default they'd interact badly with stuff
> like "make" in those cases, because you might check out a modified
> version with a timestamp in the past.

That's very clearly the case, and I have stressed in my initial email 
that I fully agree with the reasoning of the FAQ in this regard. It is, 
however, merely an argument against *restoring* the timestamps *by 
default*, to comply with the principle of least astonishment. It is, by 
itself, not an argument against *storing* the timestamps, let alone 
against restoring them *on request*.

For the initial checkout, it should not even be harmful to restore the 
timestamps by default.

> any case, I just wanted to point out a workaround (but then digressed
> into critiquing the idea above...).

Well, Johannes's proposed solution seems pretty reasonable and 
realistic to me.  Thanks to Phillip's hint about unquote_path() in 
Git.pm it seems I now have all the needed ingredients to implement this 
feature.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-21 Thread Peter Backes
On Wed, Feb 21, 2018 at 10:33:05PM +0100, Ævar Arnfjörð Bjarmason wrote:
> This sounds like a sensible job for a git import tool, i.e. import a
> target directory into git, and instead of 'git add'-ing the whole thing
> it would look at the mtimes, sort files by mtime, then add them in order
> and only commit those files that had the same mtime in the same commit
> (or within some boundary).

I think that this would be The Wrong Thing to do.

The commit time is just that: The time the commit was done. The commit 
is an atomic group of changes to a number of files that hopefully bring 
the tree from one usable state into the next.

The mtime, in contrast, tells us when a file was most recently modified.

It may well be that main.c was most recently modified yesterday, and 
feature.c was modified this morning, and that only both changes taken 
together make sense as a commit, despite the long time in between.

Even worse, it may be that feature A took a long time to implement, so 
we have huge gaps in between the mtimes, but feature B was quickly done 
after A was finished. Such an algorithm would probably split feature A 
incorrectly into several commits, and group the more recently changed 
files of feature A with those of feature B.

And if Feature A and Feature B were developed in parallel, things get 
completely messy.

> The advantage of doing this via such a tool is that you could tweak it
> to commit by any criteria you wanted, e.g. not mtime but ctime or even
> atime.

Maybe, but it would be rather useless to commit by ctime or atime. You 
do one grep -r and the atime is different. You do one chmod or chown 
and the ctime is different. Those timestamps are really only useful for 
very limited purposes.

That ctime exists seems reasonable, since it's only ever updated when 
the inode is written anyway.

atime, in contrast, was clearly one of the rather nonsensical 
innovations of UNIX: Do one write to the disk for each read from the 
disk. C'mon, really? It would have been a lot more reasonable to simply 
provide a generic way for tracing read() system calls instead; then 
userspace could decide what to do with that information and which of it 
is useful and should be kept and perhaps stored on disk. Now we have 
this ugly hack called relatime to deal with the problem.

> You'd get the same thing as you'd get if git's tree format would change
> to include mtimes (which isn't going to happen), but with a lot more
> flexibility.

Well, from basic logic, I don't see how a decision not to implement a 
feature could possibly increase flexility. The opposite seems to be the 
case.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-20 Thread Peter Backes
On Tue, Feb 20, 2018 at 11:32:23PM +0100, Johannes Schindelin wrote:
> Hi Peter,
> 
> On Tue, 20 Feb 2018, Peter Backes wrote:
> 
> > On Tue, Feb 20, 2018 at 11:46:38AM +0100, Johannes Schindelin wrote:
> > 
> > > I would probably invent a file format (``)
> > 
> > I'm stuck there because of  being munged.
> 
> From which command do you want to get it? If you are looking at `git
> diff`, you may want to use the `-z --name-only` options to avoid munging
> the paths.

I plan to use "git diff-tree --name-only $w_tree HEAD" and subtract
all lines from "git diff-index --name-only HEAD" to get the files for 
which the timestamp should be stored..

If I use "-z" I get the non-munged path, but I cannot safely store such 
paths in the proposed file format; they might contain newlines (sigh). 
So at one point I have to munge. Then the same question arises when I 
have to get the actual path from the munged path when restoring the 
timestamps.

If there's no ready-made functionality to munge and unmunge paths, I 
have to write some awk for this. At first I thought this might add one 
more dependency to git, but it seems that awk is already used in 
git-mergetool.sh, so I suppose it's okay to use in git-stash.sh etc, 
too.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-20 Thread Peter Backes
Hi Jeff,

On Tue, Feb 20, 2018 at 04:16:34PM -0500, Jeff King wrote:
> I think there are some references buried somewhere in that wiki, but did
> you look at any of the third-party tools that store file metadata
> alongside the files in the repository? E.g.:
> 
>   https://etckeeper.branchable.com/
> 
> or
> 
>   https://github.com/przemoc/metastore
> 
> I didn't see either of those mentioned in this thread (though I also do
> not have personal experience with them, either).
> 
> Modification times are a subset of the total metadata you might care
> about, so they are solving a much more general problem. Which may also
> partially answer your question about why this isn't built into git. The
> general problem gets much bigger when you start wanting to carry things
> like modes (which git doesn't actually track; we really only care about
> the executable bit) or extended attributes (acls, etc).

I know about those, but that's not what I am looking for. Those tools 
serve entirely different purposes, ie., tracking file system changes. 
I, however, am specifically interested in version control.

In version control, the user checks out his own copy of the tree for 
working. For this purpose, it is thus pointless to track ownership, 
permissions (except for the x bit), xattrs, or any other metadata. In 
fact, it can be considered the wrong thing to do.

The modification time, however, is special. It clearly has its place in 
version control. It tells us when the last modification was actually 
done to the file. I am often working on some feature, and one part is 
finished and is lying around, but I am still working on other parts in 
other files. Then, maybe after some weeks, the other parts are 
finished. Now, when committing, the information about modification time 
is lost. Maybe some weeks later I want to figure out when I last 
modified those files that were committed. But that information is now 
gone, at least in the git repository. Sure, I could do lots of WIP 
commits, but this would clutter up the history unneccessarly and I 
would have lots of versions that might not even compile, let alone run.

As far as I remember, bitkeeper had this distinction between checkins 
and commits. You could check in a file at any time, and any number of 
times, and then group all those checkins together with a commit. Git 
seems to have avoided this principle, or have kept it only 
rudimentarily via git add (but git add cannot add more than one version 
of the same file). Perhaps for simplificiation of use, perhaps for 
simplification of implementation, I don't know.

I assume, if it were not for the build tool issues, git would have 
tracked mtime from the very start.

Best wishes
Peter
-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-20 Thread Peter Backes
Hi Johannes,

On Tue, Feb 20, 2018 at 11:46:38AM +0100, Johannes Schindelin wrote:
> If I were you [...]

It seems all pretty straight forward, except for

> I would probably invent a file format (``)

I'm stuck there because of  being munged.

To obtain or set the mtime of the file, I need the unmunged path.

How to get it?



What follows is irrelevant for progress.

> I don't think that code was ever there. Maybe you heard about some file
> mode being preserved overzealously (we stored the octal file mode
> verbatim, but then decided to store only 644 or 755).

I'm not sure. I'm not able to find that source anymore, though.

> As you can see from the code decoding a tree entry:
> 
> https://github.com/git-for-windows/git/blob/e1848984d/tree-walk.c#L25-L52
> 
> there is no mtime at all in the on-disk format of tree objects. There is
> the hash, the mode, and the file name.

I didn't comletely get the code in tree-walk.c since the parsing 
architecture seems to pass around pointers via global variables. 
It seems that in addition to hash, mode and file name, the on-disk 
format has at least the object type, see git cat-file -p master^{tree} 
Perhaps I got it wrong.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-20 Thread Peter Backes
Hello Johannes,

On Tue, Feb 20, 2018 at 11:46:38AM +0100, Johannes Schindelin wrote:
> Oh, sorry. I understood your mail as if you had told the core Git
> developers that they should implement the feature you desire. I did not
> understand that you hinted at a discussion first, and that you would then
> go and implement the feature you asked for.

Well, sorry for being misunderstandable. It was my impression from the 
FAQ that the reason for why this feature doesn't exist was a strong 
opinion that it would cause technical problems. The FAQ doesn't mention 
anything like a lack of manpower. As I stated it was my 
impression that this feature would not be too hard to implement.

Because of this my email presupposed it was not manpower that prevented 
this feature.

My statement "Please provide options" was thus targeted at reviewing 
and discussing the perceived technical reasons for not implementing 
this feature at least as an option. It wasn't supposed to demand free 
lunch from anyone.

Of course I can offer to do some work to the best of my abilitites if 
that's the issue. That should go without saying for Free Software 
projects. Perhaps even my employer would be happy to pay me for 
implementing the feature during workign hours. This shouldn't be the 
issue. The issue is the seemingly dogmatic reply in the FAQ which makes 
me reluctant to put work into this in fear that a patch submission 
would be met with strong rejection.

> You will not be able to convince the core Git developers to make this the
> default, I don't think.

I have stressed very clearly in my mail that I am not asking the 
defaults about mtime restoring to be changed. I agree that those 
defaults are reasonable and in line with the principle of least 
astonishment.

What bugs me is my impression from the FAQ that even as an option, the 
feature might be unwelcome.

Best wishes
Peter
-- 
Peter Backes, r...@helen.plasma.xg8.de


Re: Git should preserve modification times at least on request

2018-02-19 Thread Peter Backes
Hi Johannes,

On Mon, Feb 19, 2018 at 10:58:12PM +0100, Johannes Schindelin wrote:
> Since you already assessed that it shouldn't be hard to do, you probably
> want to put your money where your mouth is and come up with a patch, and
> then offer it up for discussion on this here mailing list.

Well, it would be good to discuss this a bit beforehand, since my time 
is wasted if there's no chance to get it accepted. Perhaps there is 
some counterargument I don't know about.

Is there some existing code that could be used? I think I read 
somewhere that git once did preserve mtimes, but that this code was 
removed because of the build tool issues. Perhaps that code could 
simply be put back in, and surrounded by conditions.

Best wishes
Peter

PS: Given the opportunity, I want to thank you very much for 
maintaining the git repository for my cvsclone tool.

-- 
Peter Backes, r...@helen.plasma.xg8.de


Git should preserve modification times at least on request

2018-02-19 Thread Peter Backes
Hello,

please ensure to CC me if you reply as I am not subscribed to the list.

https://git.wiki.kernel.org/index.php/Git_FAQ#Why_isn.27t_Git_preserving_modification_time_on_files.3F
 
argues that git isn't preserving modification times because it needs to 
ensure that build tools work properly.

I agree that modification times should not be restored by default, 
because of the principle of least astonishment. But should it be 
impossible? The principle of least astonishment does not mandate this; 
it is not a paternalistic principle.

Thus, I do not get at all
- why git doesn't *store* modification times, perhaps by default, but 
at least on request
- why git doesn't restore modification times *on request*

It is pretty annoying that git cannot, even if I know what I am doing, 
and explicitly want it to, preserve the modification time.

One use case: I have lots of file lying around in my build directory 
and for some of them, the modification time in important information to 
me. Those files are not at all used with the build tool. In contrast to 
git pull, git pull --rebase needs those to be stashed. But after the 
pull and unstash, the mtime is gone. Boo.

Please provide options to store and restore modification times. It 
shouldn't be hard to do, given that other metadata such as the mode is 
already stored. It would make live so much easier. And the fact that 
this has made into the FAQ clearly suggests that there are many others 
who think so.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de