Re: git-svn performance

2014-10-20 Thread Jakob Stoklund Olesen

> On Oct 19, 2014, at 18:16, Eric Wong  wrote:
> 
> Jakob Stoklund Olesen  wrote:
>> If cached_mergeinfo is using too much memory, you can probably drop
>> that cache entirely. IIRC, it didn't give that much of a speed up.
>> 
>> I am surprised that it is using a lot of memory, though. There is only
>> one entry per SVN branch.
> 
> Something like the below?  (on top of your original two patches)
> Pushed to my master @ git://bogomips.org/git-svn.git

Yes, but I think you can remove cached_mergeinfo_rev too. 

Thanks
/Jakob


>Eric Wong (2):
>  git-svn: reduce check_cherry_pick cache overhead
>  git-svn: cache only mergeinfo revisions
> 
>Jakob Stoklund Olesen (2):
>  git-svn: only look at the new parts of svn:mergeinfo
>  git-svn: only look at the root path for svn:mergeinfo
> 
> git-svn still seems to have some excessive memory usage problems,
> even independenty of mergeinfo stuff.
> --8<
> From: Eric Wong 
> Date: Mon, 20 Oct 2014 01:02:53 +
> Subject: [PATCH] git-svn: cache only mergeinfo revisions
> 
> This should reduce excessive memory usage from the new mergeinfo
> caches without hurting performance too much, assuming reasonable
> latency to the SVN server.
> 
> Cc: Hin-Tak Leung 
> Suggested-by: Jakob Stoklund Olesen 
> Signed-off-by: Eric Wong 
> ---
> perl/Git/SVN.pm | 22 --
> 1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/perl/Git/SVN.pm b/perl/Git/SVN.pm
> index 171af37..f8a75b1 100644
> --- a/perl/Git/SVN.pm
> +++ b/perl/Git/SVN.pm
> @@ -1713,13 +1713,10 @@ sub mergeinfo_changes {
># Initialize cache on the first call.
>unless (defined $self->{cached_mergeinfo_rev}) {
>$self->{cached_mergeinfo_rev} = {};
> -$self->{cached_mergeinfo} = {};
>}
> 
>my $cached_rev = $self->{cached_mergeinfo_rev}{$old_path};
> -if (defined $cached_rev && $cached_rev == $old_rev) {
> -$old_minfo = $self->{cached_mergeinfo}{$old_path};
> -} else {
> +unless (defined $cached_rev && $cached_rev == $old_rev) {
>my $ra = $self->ra;
># Give up if $old_path isn't in the repo.
># This is probably a merge on a subtree.
> @@ -1728,19 +1725,16 @@ sub mergeinfo_changes {
>"directory didn't exist in r$old_rev\n";
>return {};
>}
> -my (undef, undef, $props) =
> -$self->ra->get_dir($old_path, $old_rev);
> -if (defined $props->{"svn:mergeinfo"}) {
> -my %omi = map {split ":", $_ } split "\n",
> -$props->{"svn:mergeinfo"};
> -$old_minfo = \%omi;
> -}
> -$self->{cached_mergeinfo}{$old_path} = $old_minfo;
> -$self->{cached_mergeinfo_rev}{$old_path} = $old_rev;
>}
> +my (undef, undef, $props) = $self->ra->get_dir($old_path, $old_rev);
> +if (defined $props->{"svn:mergeinfo"}) {
> +my %omi = map {split ":", $_ } split "\n",
> +$props->{"svn:mergeinfo"};
> +$old_minfo = \%omi;
> +}
> +$self->{cached_mergeinfo_rev}{$old_path} = $old_rev;
> 
># Cache the new mergeinfo.
> -$self->{cached_mergeinfo}{$path} = \%minfo;
>$self->{cached_mergeinfo_rev}{$path} = $rev;
> 
>my %changes = ();
> -- 
> EW
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git-svn performance

2014-10-19 Thread Jakob Stoklund Olesen

> On Oct 18, 2014, at 19:33, Eric Wong  wrote:
> 
> Eric Wong  wrote:
>> This reduces hash lookups for looking up cache data and will
>> simplify tying data to disk in the next commit.
> 
> I considered the following, but GDBM might not be readily available on
> non-POSIX platforms.  I think the other problem is the existing caches
> are still in memory (whether YAML or Storable) even if disk-backed,
> causing a large amount of memory usage anyways.

If cached_mergeinfo is using too much memory, you can probably drop that cache 
entirely. IIRC, it didn't give that much of a speed up.

I am surprised that it is using a lot of memory, though. There is only one 
entry per SVN branch.

> (Both patches on top of Jakob's)
> -
> Subject: [RFC] git-svn: tie cached_mergeinfo to a GDBM_File store
> 
> This should reduce per-instance memory usage by allowing
> serialization to disk.  Using the existing Memoize::Storable
> or YAML backends does not allow fast lookups.
> 
> GDBM_File should be available in most Perl installations
> and should not pose unnecessary burden
> ---
> perl/Git/SVN.pm | 19 ---
> 1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/perl/Git/SVN.pm b/perl/Git/SVN.pm
> index 25dbcd5..3e477c7 100644
> --- a/perl/Git/SVN.pm
> +++ b/perl/Git/SVN.pm
> @@ -14,6 +14,7 @@ use IPC::Open3;
> use Memoize;  # core since 5.8.0, Jul 2002
> use Memoize::Storable;
> use POSIX qw(:signal_h);
> +use Storable qw(freeze thaw);
> 
> use Git qw(
> command
> @@ -1713,10 +1714,21 @@ sub mergeinfo_changes {
> 
># Initialize cache on the first call.
>unless (defined $cached_mergeinfo) {
> -$cached_mergeinfo = $self->{cached_mergeinfo} = {};
> +my %hash;
> +eval '
> +require File::Temp;
> +use GDBM_File;
> +my $fh = File::Temp->new(TEMPLATE => "mergeinfo.");
> +$self->{cached_mergeinfo_fh} = $fh;
> +$fh->unlink_on_destroy(1);
> +tie %hash => "GDBM_File", $fh->filename, GDBM_WRCREAT, 0600;
> +';
> +$cached_mergeinfo = $self->{cached_mergeinfo} = \%hash;
>}
> 
>my $cached = $cached_mergeinfo->{$old_path};
> +$cached = thaw($cached) if defined $cached;
> +
>if (defined $cached && $cached->[0] == $old_rev) {
>$old_minfo = $cached->[1];
>} else {
> @@ -1735,11 +1747,12 @@ sub mergeinfo_changes {
>$props->{"svn:mergeinfo"};
>$old_minfo = \%omi;
>}
> -$cached_mergeinfo->{$old_path} = [ $old_rev, $old_minfo ];
> +$cached_mergeinfo->{$old_path} =
> +freeze([ $old_rev, $old_minfo ]);
>}
> 
># Cache the new mergeinfo.
> -$cached_mergeinfo->{$path} = [ $rev, \%minfo ];
> +$cached_mergeinfo->{$path} = freeze([ $rev, \%minfo ]);
> 
>my %changes = ();
>foreach my $p (keys %minfo) {
> -- 
> EW
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git svn's performance issue and strange pauses, and other thing

2014-10-19 Thread Jakob Stoklund Olesen
On Oct 18, 2014, at 21:12, Eric Wong  wrote:

>> I am somwhat worry about the dramatic difference between the two 
>> .svn/.caches -
>> check_cherry_pick.yaml is 225MB in one and 73MB in the other, and also
>> _rev_list.yaml is opposite - 24MB vs 73MB. How do I reconcile that?
> 
> Calling patterns changed, and it looks like Jakob's changes avoided some
> calls.

It is possible that those functions don't need to be memoized any more. My 
patch is trying to avoid calling them with the same arguments over and over, 
and memoizing doesn't help when arguments are changing.

Thanks,
/jakob--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git svn's performance issue and strange pauses, and other thing

2014-09-19 Thread Jakob Stoklund Olesen


> On Sep 19, 2014, at 1:25, Eric Wong  wrote:
> 
> Hin-Tak Leung  wrote:
> 
>> -  I know I can probably just "read the source", but I'd like to know
>> why .git/svn/.caches is even larger than .git/objects (which supposedly
>> contains everything that's of interest)? I hope this can be documented
>> towards the end of the man-page, for example, of important parts
>> of .git/svn (and what not to do with them...), without needing to
>> 'read the source'. Here is part of "du" from a couple of days ago:
>> 
>> 254816.git/objects
>> 307056.git/svn/.caches
>> 332452.git/svn
>> 588064.git
>> 
>> The actual .git/config is here - this should be sufficient info for
>> somebody looking into experiencing the issues I mentioned above.
> 
> IIRC, the caching is unique to mergeinfo, so perhaps Jakob's patches
> help, there, too.

IIRC the caches are used for memoization, and with my two patches applied it 
doesn't improve performance much.

You could try removing the memoization after applying my patches.

Thanks,
/Jakob--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] git-svn: only look at the root path for svn:mergeinfo

2014-04-27 Thread Jakob Stoklund Olesen

On Apr 22, 2014, at 11:54 AM, Eric Wong  wrote:

> Jakob Stoklund Olesen  wrote:
>> Subversion can put mergeinfo on any sub-directory to track cherry-picks.
>> Since cherry-picks are not represented explicitly in git, git-svn should
>> just ignore it.
> 
> Hi, was git-svn trying to track cherry-picks as merge before?

It would try and fail. I didn't explain that properly in the commit message.

Suppose I have a standard svn layout with $url/trunk and $url/branches/topic1. 
My topic1 branch has a change in subdir1 that I want to cherry-pick into trunk:

% svn switch $url/trunk
% cd subdir1
% svn merge $url/branches/topic1/subdir1
% cd ..
% svn commit

This operation will set svn:mergeinfo on $url/trunk/subdir1 where a normal full 
merge would set it on $url/trunk:

% svn pg svn:mergeinfo subdir1 
/branches/topic1/subdir1:3-4

When git-svn fetches these changes, it currently does examine the svn:mergeinfo 
change on the subdirectory as if it were a full merge. It then fails to find a 
revmap for /branches/topic1/subdir1:

Couldn't find revmap for file:///tmp/sdb/branches/topic1/subdir1
r5 = 5ce1f687c30495deca40730fb7be3baa0e145479 (refs/remotes/trunk)

It is looking for refs/remotes/topic1/subdir1, but we only have the 
refs/remotes/topic1 branch in git.

This patch makes git-svn stop trying to reconstruct those subdirectory merges 
that we know will fail anyway.

> This changes behavior a bit, so two independent users of git-svn
> may not have identical histories as a result, correct?

For normal subdirectory cherry-picks as described above, the behavior doesn't 
change. This is just a performance optimization.

For weirder cases where a whole branch has been merged onto a subdirectory of 
trunk, behavior does change. Currently, git-svn will mark that as a full merge 
in git. With this change it won't.

> Can you add a test to ensure this behavior is preserved?
> Thanks.

I'll add a test for the subdirectory merge described above.

> Sorry, I've never looked at mergeinfo myself, mainly relying on
> Sam + tests for this.
> 
> [1] - Historically, git-svn (using defaults) has always tried to
>  preserve identical histories for independent users across
>  different git-svn versions.  However, mergeinfo may be
>  enough of a corner-case where we can make an exception.


I agree. It doesn't seem worthwhile to try to preserve git-svn's historical 
behavior in weird corner cases.

BTW, this performance optimization matters not because of sporadic manual 
cherry-picks, but because certain older svn releases would replicate 
svn:mergeinfo on every subdirectory in a standard merge. With hundreds of 
subdirectories and thousands of merged branches, git-svn gets completely stuck 
processing all those mergeinfo lines.

Thanks,
/jakob

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] git-svn: only look at the root path for svn:mergeinfo

2014-04-16 Thread Jakob Stoklund Olesen
Subversion can put mergeinfo on any sub-directory to track cherry-picks.
Since cherry-picks are not represented explicitly in git, git-svn should
just ignore it.

Signed-off-by: Jakob Stoklund Olesen 
---
 perl/Git/SVN.pm | 29 +
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/perl/Git/SVN.pm b/perl/Git/SVN.pm
index d3785ab..0aa4dd3 100644
--- a/perl/Git/SVN.pm
+++ b/perl/Git/SVN.pm
@@ -1210,7 +1210,7 @@ sub do_fetch {
unless ($self->ra->gs_do_update($last_rev, $rev, $self, $ed)) {
die "SVN connection failed somewhere...\n";
}
-   $self->make_log_entry($rev, \@parents, $ed, $last_rev);
+   $self->make_log_entry($rev, \@parents, $ed, $last_rev, $self->path);
 }
 
 sub mkemptydirs {
@@ -1859,21 +1859,18 @@ sub make_log_entry {
my $untracked = $self->get_untracked($ed);
 
my @parents = @$parents;
-   my $ps = $ed->{path_strip} || "";
-   for my $path ( grep { m/$ps/ } %{$ed->{dir_prop}} ) {
-   my $props = $ed->{dir_prop}{$path};
-   if ( $props->{"svk:merge"} ) {
-   $self->find_extra_svk_parents
-   ($ed, $props->{"svk:merge"}, \@parents);
-   }
-   if ( $props->{"svn:mergeinfo"} ) {
-   my $mi_changes = $self->mergeinfo_changes
-   ($parent_path || $path, $parent_rev,
-$path, $rev,
-$props->{"svn:mergeinfo"});
-   $self->find_extra_svn_parents
-   ($ed, $mi_changes, \@parents);
-   }
+   my $props = $ed->{dir_prop}{$self->path};
+   if ( $props->{"svk:merge"} ) {
+   $self->find_extra_svk_parents
+   ($ed, $props->{"svk:merge"}, \@parents);
+   }
+   if ( $props->{"svn:mergeinfo"} ) {
+   my $mi_changes = $self->mergeinfo_changes
+   ($parent_path, $parent_rev,
+$self->path, $rev,
+$props->{"svn:mergeinfo"});
+   $self->find_extra_svn_parents
+   ($ed, $mi_changes, \@parents);
}
 
open my $un, '>>', "$self->{dir}/unhandled.log" or croak $!;
-- 
1.8.5.2 (Apple Git-48)

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] git-svn: only look at the new parts of svn:mergeinfo

2014-04-16 Thread Jakob Stoklund Olesen
In a Subversion repository where many feature branches are merged into a
trunk, the svn:mergeinfo property can grow very large. This severely
slows down git-svn's make_log_entry() because it is checking all
mergeinfo entries every time the property changes.

In most cases, the additions to svn:mergeinfo since the last commit are
pretty small, and there is nothing to gain by checking merges that were
already checked for the last commit in the branch.

Add a mergeinfo_changes() function which computes the set of interesting
changes to svn:mergeinfo since the last commit. Filter out merged
branches whose ranges haven't changed, and remove a common prefix of
ranges from other merged branches.

This speeds up "git svn fetch" by several orders of magnitude on a large
repository where thousands of feature branches have been merged.

Signed-off-by: Jakob Stoklund Olesen 
---
 perl/Git/SVN.pm | 84 -
 1 file changed, 72 insertions(+), 12 deletions(-)

diff --git a/perl/Git/SVN.pm b/perl/Git/SVN.pm
index a59564f..d3785ab 100644
--- a/perl/Git/SVN.pm
+++ b/perl/Git/SVN.pm
@@ -1178,7 +1178,7 @@ sub find_parent_branch {
  or die "SVN connection failed somewhere...\n";
}
print STDERR "Successfully followed parent\n" unless $::_q > 1;
-   return $self->make_log_entry($rev, [$parent], $ed);
+   return $self->make_log_entry($rev, [$parent], $ed, $r0, 
$branch_from);
}
return undef;
 }
@@ -1210,7 +1210,7 @@ sub do_fetch {
unless ($self->ra->gs_do_update($last_rev, $rev, $self, $ed)) {
die "SVN connection failed somewhere...\n";
}
-   $self->make_log_entry($rev, \@parents, $ed);
+   $self->make_log_entry($rev, \@parents, $ed, $last_rev);
 }
 
 sub mkemptydirs {
@@ -1478,9 +1478,9 @@ sub find_extra_svk_parents {
 sub lookup_svn_merge {
my $uuid = shift;
my $url = shift;
-   my $merge = shift;
+   my $source = shift;
+   my $revs = shift;
 
-   my ($source, $revs) = split ":", $merge;
my $path = $source;
$path =~ s{^/}{};
my $gs = Git::SVN->find_by_url($url.$source, $url, $path);
@@ -1702,6 +1702,62 @@ sub parents_exclude {
return @excluded;
 }
 
+# Compute what's new in svn:mergeinfo.
+sub mergeinfo_changes {
+   my ($self, $old_path, $old_rev, $path, $rev, $mergeinfo_prop) = @_;
+   my %minfo = map {split ":", $_ } split "\n", $mergeinfo_prop;
+   my $old_minfo = {};
+
+   # Initialize cache on the first call.
+   unless (defined $self->{cached_mergeinfo_rev}) {
+   $self->{cached_mergeinfo_rev} = {};
+   $self->{cached_mergeinfo} = {};
+   }
+
+   my $cached_rev = $self->{cached_mergeinfo_rev}{$old_path};
+   if (defined $cached_rev && $cached_rev == $old_rev) {
+   $old_minfo = $self->{cached_mergeinfo}{$old_path};
+   } else {
+   my $ra = $self->ra;
+   # Give up if $old_path isn't in the repo.
+   # This is probably a merge on a subtree.
+   if ($ra->check_path($old_path, $old_rev) != $SVN::Node::dir) {
+   warn "W: ignoring svn:mergeinfo on $old_path, ",
+   "directory didn't exist in r$old_rev\n";
+   return {};
+   }
+   my (undef, undef, $props) =
+   $self->ra->get_dir($old_path, $old_rev);
+   if (defined $props->{"svn:mergeinfo"}) {
+   my %omi = map {split ":", $_ } split "\n",
+   $props->{"svn:mergeinfo"};
+   $old_minfo = \%omi;
+   }
+   $self->{cached_mergeinfo}{$old_path} = $old_minfo;
+   $self->{cached_mergeinfo_rev}{$old_path} = $old_rev;
+   }
+
+   # Cache the new mergeinfo.
+   $self->{cached_mergeinfo}{$path} = \%minfo;
+   $self->{cached_mergeinfo_rev}{$path} = $rev;
+
+   my %changes = ();
+   foreach my $p (keys %minfo) {
+   my $a = $old_minfo->{$p} || "";
+   my $b = $minfo{$p};
+   # Omit merged branches whose ranges lists are unchanged.
+   next if $a eq $b;
+   # Remove any common range list prefix.
+   ($a ^ $b) =~ /^[\0]*/;
+   my $common_prefix = rindex $b, ",", $+[0] - 1;
+   $changes{$p} = substr $b, $common_prefix + 1;
+   }
+   print STDERR "Checking svn:mergeinfo changes since r$old_rev: ",
+   scalar(keys %minfo), " sources, ",
+   scalar(keys %changes), " change