[PATCH 1/2] filter-branch: Add --state-branch to hold pickled copy of ref map

2017-08-08 Thread Ian Campbell
Allowing for incremental updates of large trees.

I have been using this as part of the device tree extraction from the Linux
kernel source since 2013, about time I sent the patch upstream!

Signed-off-by: Ian Campbell 
---
 git-filter-branch.sh | 39 ++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 3a74602ef..d07db3fee 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -86,7 +86,7 @@ USAGE="[--setup ] [--env-filter ]
[--parent-filter ] [--msg-filter ]
[--commit-filter ] [--tag-name-filter ]
[--subdirectory-filter ] [--original ]
-   [-d ] [-f | --force]
+   [-d ] [-f | --force] [--state-branch ]
[--] [...]"
 
 OPTIONS_SPEC=
@@ -106,6 +106,7 @@ filter_msg=cat
 filter_commit=
 filter_tag_name=
 filter_subdir=
+state_branch=
 orig_namespace=refs/original/
 force=
 prune_empty=
@@ -181,6 +182,9 @@ do
--original)
orig_namespace=$(expr "$OPTARG/" : '\(.*[^/]\)/*$')/
;;
+   --state-branch)
+   state_branch="$OPTARG"
+   ;;
*)
usage
;;
@@ -252,6 +256,20 @@ export GIT_INDEX_FILE
 # map old->new commit ids for rewriting parents
 mkdir ../map || die "Could not create map/ directory"
 
+if [ -n "$state_branch" ] ; then
+   state_commit=`git show-ref -s "$state_branch"`
+   if [ -n "$state_commit" ] ; then
+   echo "Populating map from $state_branch ($state_commit)" 1>&2
+   git show "$state_commit":filter.map |
+   perl -n -e 'm/(.*):(.*)/ or die;
+   open F, ">../map/$1" or die;
+   print F "$2" or die;
+   close(F) or die'
+   else
+   echo "Branch $state_branch does not exist. Will create" 1>&2
+   fi
+fi
+
 # we need "--" only if there are no path arguments in $@
 nonrevs=$(git rev-parse --no-revs "$@") || exit
 if test -z "$nonrevs"
@@ -544,6 +562,25 @@ if [ "$filter_tag_name" ]; then
done
 fi
 
+if [ -n "$state_branch" ] ; then
+   echo "Saving rewrite state to $state_branch" 1>&2
+   STATE_BLOB=$(ls ../map |
+   perl -n -e 'chomp();
+   open F, "<../map/$_" or die;
+   chomp($f = ); print "$_:$f\n";' |
+   git hash-object -w --stdin )
+   STATE_TREE=$(/bin/echo -e "100644 blob $STATE_BLOB\tfilter.map" | git 
mktree)
+   STATE_PARENT=$(git show-ref -s "$state_branch")
+   unset GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL GIT_AUTHOR_DATE
+   unset GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL GIT_COMMITTER_DATE
+   if [ -n "$STATE_PARENT" ] ; then
+   STATE_COMMIT=$(/bin/echo "Sync" | git commit-tree "$STATE_TREE" -p 
"$STATE_PARENT")
+   else
+   STATE_COMMIT=$(/bin/echo "Sync" | git commit-tree "$STATE_TREE" )
+   fi
+   git update-ref "$state_branch" "$STATE_COMMIT"
+fi
+
 cd "$orig_dir"
 rm -rf "$tempdir"
 
-- 
2.11.0



Re: [PATCH 1/2] filter-branch: Add --state-branch to hold pickled copy of ref map

2017-08-08 Thread Junio C Hamano
Ian Campbell  writes:

> Allowing for incremental updates of large trees.

"by doing what" is missing.  And ...

>
> I have been using this as part of the device tree extraction from the Linux
> kernel source since 2013, about time I sent the patch upstream!

... this does not help understanding what is going on.  It belongs
to the space after three dashes.

Perhaps

Subject: filter-branch: stash away ref map in a branch

With "--state-branch=" option, the mapping from
old object names and filtered ones in ./map/ directory is
stashed away in the object database, and the one from the
previous run is read to populate the ./map/ directory,
allowing for incremental updates of large trees.

or something?

>
> Signed-off-by: Ian Campbell 
> ---
>  git-filter-branch.sh | 39 ++-
>  1 file changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/git-filter-branch.sh b/git-filter-branch.sh
> index 3a74602ef..d07db3fee 100755
> --- a/git-filter-branch.sh
> +++ b/git-filter-branch.sh
> @@ -86,7 +86,7 @@ USAGE="[--setup ] [--env-filter ]
>   [--parent-filter ] [--msg-filter ]
>   [--commit-filter ] [--tag-name-filter ]
>   [--subdirectory-filter ] [--original ]
> - [-d ] [-f | --force]
> + [-d ] [-f | --force] [--state-branch ]
>   [--] [...]"
>  
>  OPTIONS_SPEC=
> @@ -106,6 +106,7 @@ filter_msg=cat
>  filter_commit=
>  filter_tag_name=
>  filter_subdir=
> +state_branch=
>  orig_namespace=refs/original/
>  force=
>  prune_empty=
> @@ -181,6 +182,9 @@ do
>   --original)
>   orig_namespace=$(expr "$OPTARG/" : '\(.*[^/]\)/*$')/
>   ;;
> + --state-branch)
> + state_branch="$OPTARG"
> + ;;
>   *)
>   usage
>   ;;
> @@ -252,6 +256,20 @@ export GIT_INDEX_FILE
>  # map old->new commit ids for rewriting parents
>  mkdir ../map || die "Could not create map/ directory"
>  
> +if [ -n "$state_branch" ] ; then
> + state_commit=`git show-ref -s "$state_branch"`

I hate to nitpick styles, especially on this script that already has
existing violations, but for completeness:

Style: we prefer to write $(command substitution) instead.
Style: we prefer to write "if test", not "if [".
Style: we prefer to avoid ';' and write "if test condtion" and
   "then" on different lines.

It is a bit curious use of "show-ref".  It is not wrong per-se, but
"git rev-parse" may be more common.  I do not care too deeply either
way, though.

Don't we want to make sure the value given to --state-branch is a
full refname, not just a branch name?  What happens when you say 

filter-branch --state-branch master

by mistake?  "show-ref -s" is likely to show your refs/heads/master,
and other master branches that appear as remote-tracking branches for
the remotes you interact with.

> + if [ -n "$state_commit" ] ; then
> + echo "Populating map from $state_branch ($state_commit)" 1>&2
> + git show "$state_commit":filter.map |
> + perl -n -e 'm/(.*):(.*)/ or die;
> + open F, ">../map/$1" or die;
> + print F "$2" or die;
> + close(F) or die'

The process calling this perl script, which carefully diagnoses
malformed input and dies, does not seem to do anything when it sees
errors.  Intended?

> + else
> + echo "Branch $state_branch does not exist. Will create" 1>&2
> + fi
> +fi
> +
>  # we need "--" only if there are no path arguments in $@
>  nonrevs=$(git rev-parse --no-revs "$@") || exit
>  if test -z "$nonrevs"
> @@ -544,6 +562,25 @@ if [ "$filter_tag_name" ]; then
>   done
>  fi
>  
> +if [ -n "$state_branch" ] ; then
> + echo "Saving rewrite state to $state_branch" 1>&2
> + STATE_BLOB=$(ls ../map |
> + perl -n -e 'chomp();
> + open F, "<../map/$_" or die;
> + chomp($f = ); print "$_:$f\n";' |

I see it somewhat gross to pipe the output of "/bin/ls" to a Perl
script, instead of iterating over "while (<../map/*>)" inside the
script itself.

> + git hash-object -w --stdin )
> + STATE_TREE=$(/bin/echo -e "100644 blob $STATE_BLOB\tfilter.map" | git 
> mktree)
> + STATE_PARENT=$(git show-ref -s "$state_branch")

Don't you already have this in $state_commit?

One advantage of reading $state_branch again at this point is to
detect mistakes of running more than one filter-branch (which may
cause you to read $STATE_PARENT that is different from $state_commit
you read earlier), but I do not think that is being done here, so...

> + unset GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL GIT_AUTHOR_DATE
> + unset GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL GIT_COMMITTER_DATE

Hmph.  I can see that you are trying not to be affected by the
committers and authors of the commits on the branch being filtered
(which are set by finish_ident shell function), but I won

Re: [PATCH 1/2] filter-branch: Add --state-branch to hold pickled copy of ref map

2017-08-09 Thread Ian Campbell
On Tue, 2017-08-08 at 13:56 -0700, Junio C Hamano wrote:
> Ian Campbell  writes:
> 
> > Allowing for incremental updates of large trees.
> 
> "by doing what" is missing.  And ...
> 
> >
> > I have been using this as part of the device tree extraction from
> the Linux
> > kernel source since 2013, about time I sent the patch upstream!
> 
> ... this does not help understanding what is going on.  It belongs
> to the space after three dashes.
> 
> Perhaps
> 
>   Subject: filter-branch: stash away ref map in a branch
> 
>   With "--state-branch=" option, the mapping from
>   old object names and filtered ones in ./map/ directory is
>   stashed away in the object database, and the one from the
>   previous run is read to populate the ./map/ directory,
>   allowing for incremental updates of large trees.
> 
> or something?

Yes, thanks that is a lot better.

I'll address the feedback (style nits and all) in the coming weeks,
heads up that I might be a bit slow, got a busy week this week followed
by 3 weeks of travel (which might mean no time for hacking or lots,
hard to say ;-))

> Don't we want to make sure the value given to --state-branch is a
> full refname, not just a branch name?  What happens when you say 
> 
>   filter-branch --state-branch master
> 
> by mistake?  "show-ref -s" is likely to show your refs/heads/master,
> and other master branches that appear as remote-tracking branches for
> the remotes you interact with.

I've been using this as `--state-branch refs/heads/filter-state` which
creates a local/visible filter-state branch which I also push to a
remote, so I also have a `refs/remotes/state/filter-state` too.

What is the correct way to check for a full ref name? Is it as simple
as checking for a refs/heads/ prefix or is there a better way?

> > +   if [ -n "$state_commit" ] ; then
> > +   echo "Populating map from $state_branch
> ($state_commit)" 1>&2
> > +   git show "$state_commit":filter.map |
> > +   perl -n -e 'm/(.*):(.*)/ or die;
> > +   open F, ">../map/$1" or die;
> > +   print F "$2" or die;
> > +   close(F) or die'
> 
> The process calling this perl script, which carefully diagnoses
> malformed input and dies, does not seem to do anything when it sees
> errors.  Intended?

I hadn't realised the script wasn't using `set -e`. I'll sort this with
some local error handling.

> 
> > +   else
> > +   echo "Branch $state_branch does not exist. Will
> create" 1>&2
> > +   fi
> > +fi
> > +
> >  # we need "--" only if there are no path arguments in $@
> >  nonrevs=$(git rev-parse --no-revs "$@") || exit
> >  if test -z "$nonrevs"
> > @@ -544,6 +562,25 @@ if [ "$filter_tag_name" ]; then
> >     done
> >  fi
> >  
> > +if [ -n "$state_branch" ] ; then
> > +   echo "Saving rewrite state to $state_branch" 1>&2
> > +   STATE_BLOB=$(ls ../map |
> > +   perl -n -e 'chomp();
> > +   open F, "<../map/$_" or die;
> > +   chomp($f = ); print "$_:$f\n";' |
> 
> I see it somewhat gross to pipe the output of "/bin/ls" to a Perl
> script, instead of iterating over "while (<../map/*>)" inside the
> script itself.

I considered cleaning this up too as I was forward porting, but weirdly
it appeared to microbenchmark slower that way, I don't remember the
magnitude of the difference (and the test script is on another machine
right now). I'll revisit that and if it isn't too much slower I'll
switch to the saner looking all in Perl method.

> > +   unset GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL GIT_AUTHOR_DATE
> > +   unset GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL
> GIT_COMMITTER_DATE
> 
> Hmph.  I can see that you are trying not to be affected by the
> committers and authors of the commits on the branch being filtered
> (which are set by finish_ident shell function), but I wonder if we
> could (and more importantly "want to") do better to preserve the
> real committer the user who runs the script may have in the
> environment before running it.  I guess it does not matter that
> much, as long as the user has properly user.{name,email} configured
> elsewhere without relying on the environment variable.

I'm glad you spotted this because I couldn't remember ;-)

I'll stash these in a bunch of ORIG_FOO near the top and then reset
them at an appropriate point (I'll use ORIG_GIT_DIR as the pattern).

> Despite all the above comments, I like what you are trying to
> achieve here.  Thanks for sharing.

Thanks for the review and feedback.


Ian.