Re: Is there a way to speed up remote-hg?

2013-04-21 Thread John Szakmeister
On Sat, Apr 20, 2013 at 7:07 PM, Felipe Contreras
felipe.contre...@gmail.com wrote:
 On Sat, Apr 20, 2013 at 6:07 AM, John Szakmeister j...@szakmeister.net 
 wrote:
 I really like the idea of remote-hg, but it appears to be awfully slow
 on the clone step:

 The short answer is no. I do have a couple of patches that improve
 performance, but not by a huge factor.

 I have profiled the code, and there are two significant places where
 performance is wasted:

 1) Fetching the file contents

 Extracting, decompressing, transferring, and then compressing and
 storing the file contents is mostly unavoidable, unless we already
 have the contents of such file, which in Git, it would be easy to
 check by analyzing the checksum (SHA-1). Unfortunately Mercurial
 doesn't have that information. The SHA-1 that is stored is not of the
 contents, but the contents and the parent checksum, which means that
 if you revert a modification you made to a file, or move a file, any
 operation that ends up in the same contents, but from a different
 path, the SHA-1 is different. This means the only way to know if the
 contents are the same, is by extracting, and calculating the SHA-1
 yourself, which defeats the purpose of what you want the calculation
 for.

 I've tried, calculating the SHA-1 and use a previous reference to
 avoid the transfer, or do the transfer, and let Git check for existing
 objects doesn't make a difference.

 This is by Mercurial's stupid design, and there's nothing we, or
 anybody could do about it until they change it.

That's a bummer. :-(

 2) Checking for file changes

 For each commit (or revision), we need to figure out which files were
 modified, and for that, Mercurial has a neat shortcut that stores such
 modifications in the commit context itself, so it's easy to retrieve.
 Unfortunately, it's sometimes wrong.

 Since the Mercurial tools never use this information for any real
 work, simply to show the changes to the users, Mercurial folks never
 noticed the contents they were storing were wrong. Which means if you
 have a repository that started with old versions of mercurial, chances
 are this information would be wrong, and there's no real guarantee
 that future versions won't have this problem, since to this day this
 information continues to be used only display stuff to the user.

 So, since we cannot rely on this, we need to manually check for
 differences the way Mercurial does, which blows performance away,
 because you need to get the contents of the two parent revisions, and
 compare them away. My content I mean the the manifest, or list of
 files, which takes considerable amount of time.

Eek!

 For 1) there's nothing we can do, and for 2) we could trust the files
 Mercurial thinks were modified, and that gives us a very significant
 boost, but the repository will sometimes end up wrong. Most of the
 time is spent on 2).

 So unfortunately there's nothing we can do, that's just Mercurial
 design, and it really has nothing to do with Git. Any other tool would
 have the same problems, even a tool that converts a Mercurial
 repository to Mercurial (without using tricks).
[snip]

That's unfortunate, but thank you for taking the time to explain!

-John
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is there a way to speed up remote-hg?

2013-04-20 Thread John Szakmeister
I really like the idea of remote-hg, but it appears to be awfully slow
on the clone step:

...
progress revision 81499 'master' (81500/81664)
progress revision 81599 'master' (81600/81664)
Checking out files: 100% (3744/3744), done.
git clone hg::https://bitbucket.org/python_mirrors/cpython
4484.61s user 41510.05s system 102% cpu 12:29:45.73 total

That seems like an awfully high price to pay.  It there a way to speed
this up at all?  I realize the Python hg repo has more history than
others, but even a smaller project like Sphinx takes a while:

git clone hg::https://bitbucket.org/birkenfeld/sphinx  56.41s user
90.86s system 98% cpu 2:28.87 total

I was just curious if something more could be done here.  I don't go
around cloning Python all the time, so it's not a big issue, but it'd
be nice if it was more performant.

Thanks!

-John
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is there a way to speed up remote-hg?

2013-04-20 Thread Felipe Contreras
On Sat, Apr 20, 2013 at 6:07 AM, John Szakmeister j...@szakmeister.net wrote:
 I really like the idea of remote-hg, but it appears to be awfully slow
 on the clone step:

The short answer is no. I do have a couple of patches that improve
performance, but not by a huge factor.

I have profiled the code, and there are two significant places where
performance is wasted:

1) Fetching the file contents

Extracting, decompressing, transferring, and then compressing and
storing the file contents is mostly unavoidable, unless we already
have the contents of such file, which in Git, it would be easy to
check by analyzing the checksum (SHA-1). Unfortunately Mercurial
doesn't have that information. The SHA-1 that is stored is not of the
contents, but the contents and the parent checksum, which means that
if you revert a modification you made to a file, or move a file, any
operation that ends up in the same contents, but from a different
path, the SHA-1 is different. This means the only way to know if the
contents are the same, is by extracting, and calculating the SHA-1
yourself, which defeats the purpose of what you want the calculation
for.

I've tried, calculating the SHA-1 and use a previous reference to
avoid the transfer, or do the transfer, and let Git check for existing
objects doesn't make a difference.

This is by Mercurial's stupid design, and there's nothing we, or
anybody could do about it until they change it.

2) Checking for file changes

For each commit (or revision), we need to figure out which files were
modified, and for that, Mercurial has a neat shortcut that stores such
modifications in the commit context itself, so it's easy to retrieve.
Unfortunately, it's sometimes wrong.

Since the Mercurial tools never use this information for any real
work, simply to show the changes to the users, Mercurial folks never
noticed the contents they were storing were wrong. Which means if you
have a repository that started with old versions of mercurial, chances
are this information would be wrong, and there's no real guarantee
that future versions won't have this problem, since to this day this
information continues to be used only display stuff to the user.

So, since we cannot rely on this, we need to manually check for
differences the way Mercurial does, which blows performance away,
because you need to get the contents of the two parent revisions, and
compare them away. My content I mean the the manifest, or list of
files, which takes considerable amount of time.

For 1) there's nothing we can do, and for 2) we could trust the files
Mercurial thinks were modified, and that gives us a very significant
boost, but the repository will sometimes end up wrong. Most of the
time is spent on 2).

So unfortunately there's nothing we can do, that's just Mercurial
design, and it really has nothing to do with Git. Any other tool would
have the same problems, even a tool that converts a Mercurial
repository to Mercurial (without using tricks).

It seems Bazaar is more sensible in this regard; 1) the checksums are
try of the file contents, and 2) each revision does store the file
modifications correctly. So a clone in Bazaar is much faster. In my
opinion Mercurial just screwed up their design.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html