Re: [PATCH v2] teach fast-export an --anonymize option

Philip Oakley Fri, 22 Aug 2014 11:40:18 -0700

From: "Jeff King" <p...@peff.net>: Friday, August 22, 2014 12:21 AM

On Thu, Aug 21, 2014 at 06:49:10PM -0400, Jeff King wrote:
The few things I don't anonymize are:

  1. ref prefixes. We see the same distribution of refs/heads vs
     refs/tags, etc.
2. refs/heads/master is left untouched, for convenience (andbecauseit's not really a secret). The implementation is lazy, though,andwould leave "refs/heads/master-supersecret", as well. I cantighten
     that if we really want to be careful.
3. gitlinks are left untouched, since sha1s cannot be reversed.This
     could leak some information (if your private repo points to a
     public, I can find out you have it as submodule). I doubt it
     matters, but we can also scramble the sha1s.
Here's a re-roll that addresses the latter two. I don't think any areabig deal, but it's much easier to say "it's handled" than try tofigure
out whether and when it's important.

This also includes the documentation update I sent earlier. The
interdiff is a bit noisy, as I also converted the anonymize_memfunctionto take void pointers (since it doesn't know or care what it'sstoring,
and this makes storing unsigned chars for sha1s easier).


Just a bit of bikeshedding for future improvements..

The .gitignore is another potential user problem area that may benefitform not being anonymised when problems strike. For example, there's acurrent problem on the git-users listhttps://groups.google.com/forum/#!topic/git-users/JJFIEsI5HRQ about "gitclean vs git status re .gitignore", which would then also beg questionsabout retaining file extensions/suffixes (.txt, .o, .c, etc).

I've had a similar problem with an over zealous file compare routinewhere the same too much vs too little was an issue.

One thought is that the user should be able to, as an option, select thenumber of initial characters retained from filenames, and similarly, theoption to retain the file extension, and possibly directory names, suchthat the full .gitignore still works in most cases, and the sort orderworks (as far as it goes on number of characters).


All things for future improvers to consider.

Philip

-- >8 --
Subject: teach fast-export an --anonymize option

Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
"anonymized" repository could then be shared with developers
(assuming it still replicates the original problem).

This patch implements an "--anonymize" option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:

 git fast-export --anonymize --all |
 perl -pe 's/\d+/X/g' |
 sort -u |
 less

which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
"User 0", and we replace it consistently in the output).

In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:

 $ time git fast-export --anonymize --all \
        --tag-of-filtered-object=drop >output
 real    0m2.883s
 user    0m2.828s
 sys     0m0.052s

 $ gzip output
 $ ls -lh output.gz | awk '{print $5}'
 2.9M

Signed-off-by: Jeff King <p...@peff.net>
---

[...]

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] teach fast-export an --anonymize option

Reply via email to