Re: Long clone time after done.
Hi guys, Any further interest on this scalability problem or should I move on? Thanks, Uri On Thu, Nov 8, 2012 at 5:35 PM, Uri Moszkowicz u...@4refs.com wrote: I tried on the local disk as well and it didn't help. I managed to find a SUSE11 machine and tried it there but no luck so I think we can eliminate NFS and OS as significant factors now. I ran with perf and here's the report: ESC[31m69.07%ESC[m git /lib64/libc-2.11.1.so [.] memcpy ESC[31m12.33%ESC[m git prefix/git-1.8.0.rc2.suse11/bin/git [.] blk_SHA1_Block ESC[31m 5.11%ESC[m git prefix/zlib/local/lib/libz.so.1.2.5 [.] inflate_fast ESC[32m 2.61%ESC[m git prefix/zlib/local/lib/libz.so.1.2.5 [.] adler32 ESC[32m 1.98%ESC[m git /lib64/libc-2.11.1.so [.] _int_malloc ESC[32m 0.86%ESC[m git [kernel] [k] clear_page_c Does this help? Machine has 396GB of RAM if it matters. On Thu, Nov 8, 2012 at 4:33 PM, Jeff King p...@peff.net wrote: On Thu, Nov 08, 2012 at 04:16:59PM -0600, Uri Moszkowicz wrote: I ran git cat-file commit some-tag for every tag. They seem to be roughly uniformly distributed between 0s and 2s and about 2/3 of the time seems to be system. My disk is mounted over NFS so I tried on the local disk and it didn't make a difference. I have only one 1.97GB pack. I ran git gc --aggressive before. Ah. NFS. That is almost certainly the source of the problem. Git will aggressively mmap. I would not be surprised to find that RHEL4's NFS implementation is not particularly fast at mmap-ing 2G files, and is spending a bunch of time in the kernel servicing the requests. Aside from upgrading your OS or getting off of NFS, I don't have a lot of advice. The performance characteristics you are seeing are so grossly off of what is normal that using git is probably going to be painful. Your 2s cat-files should be more like .002s. I don't think there's anything for git to fix here. You could try building with NO_MMAP, which will emulate it with pread. That might fare better under your NFS implementation. Or it might be just as bad. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Long clone time after done.
I tried the patch but it doesn't appear to have helped :( Clone time with it was ~32m. Do you all by any chance have a tool to obfuscate a repository? Probably I still wouldn't be permitted to distribute it but might make the option slightly more palatable. Anything else that I can do to help debug this problem? On Thu, Nov 8, 2012 at 9:56 AM, Jeff King p...@peff.net wrote: On Wed, Nov 07, 2012 at 11:32:37AM -0600, Uri Moszkowicz wrote: #4 parse_object (sha1=0xb0ee98 \017C\205Wj\001`\254\356\307Z\332\367\353\233.\375P}D) at object.c:212 #5 0x004ae9ec in handle_one_ref (path=0xb0eec0 refs/tags/removed, sha1=0xb0ee98 \017C\205Wj\001`\254\356\307Z\332\367\353\233.\375P}D, flags=2, cb_data=optimized out) at pack-refs. [...] It looks like handle_one_ref() is called for each ref and most result in a call to read_sha1_file(). Right. When generating the packed-refs file, we include the peeled reference for a tag (i.e., the commit that a tag object points to). So we have to actually read any tag objects to get the value. The upload-pack program generates a similar list, and I recently added some optimizations. This code path could benefit from some of them by using peel_ref instead of hand-rolling the tag dereferencing. The main optimization, though, is reusing peeled values that are already in packed-refs; we would probably need some additional magic to reuse the values from the source repository. However: It only takes a second or so for each call but when you have thousands of them (one for each ref) it adds up. I am more concerned that it takes a second to read each tag. Even in my pathological tests for optimizing upload-pack, peeling 50,000 refs took only half a second. Adding --single-branch --branch branch doesn't appear to help as it is implemented afterwards. I would like to debug this problem further but am not familiar enough with the implementation to know what the next step is. Can anyone offer some suggestions? I don't see why a clone should be dependent on an O(#refs) operations. Does this patch help? In a sample repo with 5000 annotated tags, it drops my local clone time from 0.20s to 0.11s. Which is a big percentage speedup, but this code isn't taking a long time in the first place for me. --- diff --git a/pack-refs.c b/pack-refs.c index f09a054..3344749 100644 --- a/pack-refs.c +++ b/pack-refs.c @@ -40,13 +40,9 @@ static int handle_one_ref(const char *path, const unsigned char *sha1, fprintf(cb-refs_file, %s %s\n, sha1_to_hex(sha1), path); if (is_tag_ref) { - struct object *o = parse_object(sha1); - if (o-type == OBJ_TAG) { - o = deref_tag(o, path, 0); - if (o) - fprintf(cb-refs_file, ^%s\n, - sha1_to_hex(o-sha1)); - } + unsigned char peeled[20]; + if (!peel_ref(path, peeled)) + fprintf(cb-refs_file, ^%s\n, sha1_to_hex(peeled)); } if ((cb-flags PACK_REFS_PRUNE) !do_not_prune(flags)) { -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Long clone time after done.
I'm using RHEL4. Looks like perf is only available with RHEL6. heads: 308 tags: 9614 Looking up the tags that way took a very long time by the way. git tag | wc -l was much quicker. I've already pruned a lot of tags to get to this number as well. The original repository had ~37k tags since we used to tag every commit with CVS. All my tags are packed so cat-file doesn't work: fatal: git cat-file refs/tags/some-tag: bad file On Thu, Nov 8, 2012 at 2:33 PM, Jeff King p...@peff.net wrote: On Thu, Nov 08, 2012 at 11:20:29AM -0600, Uri Moszkowicz wrote: I tried the patch but it doesn't appear to have helped :( Clone time with it was ~32m. That sounds ridiculously long. Do you all by any chance have a tool to obfuscate a repository? Probably I still wouldn't be permitted to distribute it but might make the option slightly more palatable. Anything else that I can do to help debug this problem? I don't have anything already written. What platform are you on? If it's Linux, can you try using perf to record where the time is going? How many refs do you have? What does: echo heads: $(git for-each-ref refs/heads | wc -l) echo tags: $(git for-each-ref refs/tags | wc -l) report? How long does it take to look up a tag, like: time git cat-file tag refs/tags/some-tag ? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Long clone time after done.
I ran git cat-file commit some-tag for every tag. They seem to be roughly uniformly distributed between 0s and 2s and about 2/3 of the time seems to be system. My disk is mounted over NFS so I tried on the local disk and it didn't make a difference. I have only one 1.97GB pack. I ran git gc --aggressive before. On Thu, Nov 8, 2012 at 4:11 PM, Jeff King p...@peff.net wrote: On Thu, Nov 08, 2012 at 03:49:32PM -0600, Uri Moszkowicz wrote: I'm using RHEL4. Looks like perf is only available with RHEL6. Yeah, RHEL4 is pretty ancient; I think it predates the invention of perf. heads: 308 tags: 9614 Looking up the tags that way took a very long time by the way. git tag | wc -l was much quicker. I've already pruned a lot of tags to get to this number as well. The original repository had ~37k tags since we used to tag every commit with CVS. Hmm. I think for-each-ref will actually open the tag objects, but git tag will not. That would imply that reading the refs is fast, but opening objects is slow. I wonder why. How many packs do you have in .git/objects/pack of the repository? All my tags are packed so cat-file doesn't work: fatal: git cat-file refs/tags/some-tag: bad file The packing shouldn't matter. The point of the command is to look up the refs/tags/some-tag ref (in packed-refs or in the filesystem), and then open and write the pointed-to object to stdout. If that is not working, then there is something seriously wrong going on. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Long clone time after done.
I tried on the local disk as well and it didn't help. I managed to find a SUSE11 machine and tried it there but no luck so I think we can eliminate NFS and OS as significant factors now. I ran with perf and here's the report: ESC[31m69.07%ESC[m git /lib64/libc-2.11.1.so [.] memcpy ESC[31m12.33%ESC[m git prefix/git-1.8.0.rc2.suse11/bin/git [.] blk_SHA1_Block ESC[31m 5.11%ESC[m git prefix/zlib/local/lib/libz.so.1.2.5 [.] inflate_fast ESC[32m 2.61%ESC[m git prefix/zlib/local/lib/libz.so.1.2.5 [.] adler32 ESC[32m 1.98%ESC[m git /lib64/libc-2.11.1.so [.] _int_malloc ESC[32m 0.86%ESC[m git [kernel] [k] clear_page_c Does this help? Machine has 396GB of RAM if it matters. On Thu, Nov 8, 2012 at 4:33 PM, Jeff King p...@peff.net wrote: On Thu, Nov 08, 2012 at 04:16:59PM -0600, Uri Moszkowicz wrote: I ran git cat-file commit some-tag for every tag. They seem to be roughly uniformly distributed between 0s and 2s and about 2/3 of the time seems to be system. My disk is mounted over NFS so I tried on the local disk and it didn't make a difference. I have only one 1.97GB pack. I ran git gc --aggressive before. Ah. NFS. That is almost certainly the source of the problem. Git will aggressively mmap. I would not be surprised to find that RHEL4's NFS implementation is not particularly fast at mmap-ing 2G files, and is spending a bunch of time in the kernel servicing the requests. Aside from upgrading your OS or getting off of NFS, I don't have a lot of advice. The performance characteristics you are seeing are so grossly off of what is normal that using git is probably going to be painful. Your 2s cat-files should be more like .002s. I don't think there's anything for git to fix here. You could try building with NO_MMAP, which will emulate it with pread. That might fare better under your NFS implementation. Or it might be just as bad. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Long clone time after done.
It all goes to pack_refs() in write_remote_refs called from update_remote_refs(). On Tue, Oct 23, 2012 at 11:29 PM, Nguyen Thai Ngoc Duy pclo...@gmail.com wrote: On Wed, Oct 24, 2012 at 1:30 AM, Uri Moszkowicz u...@4refs.com wrote: I have a large repository which I ran git gc --aggressive on that I'm trying to clone on a local file system. I would expect it to complete very quickly with hard links but it's taking about 6min to complete with no checkout (git clone -n). I see the message Clining into 'repos'... done. appear after a few seconds but then Git just hangs there for another 6min. Any idea what it's doing at this point and how I can speed it up? done. is printed by clone_local(), which is called in cmd_clone(). After that there are just a few more calls. Maybe you could add a few printf in between these calls, see which one takes most time? -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tag storage format
That did the trick - thanks! On Mon, Oct 22, 2012 at 5:46 PM, Andreas Schwab sch...@linux-m68k.org wrote: Uri Moszkowicz u...@4refs.com writes: Perhaps Git should switch to a single-file block text or binary format once a large number of tags becomes present in a repository. This is what git pack-refs (called by git gc) does (by putting the refs in .git/packed-refs). Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 And now for something completely different. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Long clone time after done.
I have a large repository which I ran git gc --aggressive on that I'm trying to clone on a local file system. I would expect it to complete very quickly with hard links but it's taking about 6min to complete with no checkout (git clone -n). I see the message Clining into 'repos'... done. appear after a few seconds but then Git just hangs there for another 6min. Any idea what it's doing at this point and how I can speed it up? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Large number of object files
Continuing to work on improving clone times, using git gc --aggressive has resulted in a large number of tags combining into a single file but now I have a large number of files in the objects directory - 131k for a ~2.7GB repository. Any way to reduce the number of these files to speed up clones? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
tag storage format
I'm doing some testing on a large Git repository and am finding local clones to take a very long time. After some investigation I've determined that the problem is due to a very large number of tags (~38k). Even with hard links, it just takes a really long time to visit that many inodes. As it happens, I don't care for most of these tags and will prune many of them anyway but I expect that over time it will creep back up again. Have others reported this problem before and is there a workaround? Perhaps Git should switch to a single-file block text or binary format once a large number of tags becomes present in a repository. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected directories from read-tree
I am using 1.8.0-rc2 but also tried 1.7.8.4. Thanks for the suggestion to use ls-files -t - that's exactly what I was looking for. With that I was easily able to tell what the problem is: missing / from the sparse-checkout file. On Thu, Oct 18, 2012 at 10:34 PM, Nguyen Thai Ngoc Duy pclo...@gmail.com wrote: On Fri, Oct 19, 2012 at 6:10 AM, Uri Moszkowicz u...@4refs.com wrote: I'm testing out the sparse checkout feature of Git on my large (14GB) repository and am running into a problem. When I add dir1/ to sparse-checkout and then run git read-tree -mu HEAD I see dir1 as expected. But when I add dir2/ to sparse-checkout and read-tree again I see dir2 and dir3 appear and they're not nested. If I replace dir2/ with dir3/ in the sparse-checkout file, then I see dir1 and dir3 but not dir2 as expected again. How can I debug this problem? Posting here is step 1. What version are you using? You can look at unpack-trees.c The function that does the check is excluded_from_list. You should check ls-files -t, see if CE_SKIP_WORKTREE is set correctly for all dir1/*, dir2/* and dir3/*. Can you recreate a minimal test case for the problem? -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Unexpected directories from read-tree
I'm testing out the sparse checkout feature of Git on my large (14GB) repository and am running into a problem. When I add dir1/ to sparse-checkout and then run git read-tree -mu HEAD I see dir1 as expected. But when I add dir2/ to sparse-checkout and read-tree again I see dir2 and dir3 appear and they're not nested. If I replace dir2/ with dir3/ in the sparse-checkout file, then I see dir1 and dir3 but not dir2 as expected again. How can I debug this problem? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: error: git-fast-import died of signal 11
Hi Michael, Looks like the changes to limit solved the problem. I didn't verify if it was the stacksize or descriptors but one of those. Final repository size was 14GB from a 328GB dump file. Thanks, Uri On Tue, Oct 16, 2012 at 2:18 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 10/15/2012 05:53 PM, Uri Moszkowicz wrote: I'm trying to convert a CVS repository to Git using cvs2git. I was able to generate the dump file without problem but am unable to get Git to fast-import it. The dump file is 328GB and I ran git fast-import on a machine with 512GB of RAM. fatal: Out of memory? mmap failed: Cannot allocate memory fast-import: dumping crash report to fast_import_crash_18192 error: git-fast-import died of signal 11 How can I import the repository? What versions of git and of cvs2git are you using? If not the current versions, please try with the current versions. What is the nature of your repository (i.e., why is it so big)? Does it consist of extremely large files? A very deep history? Extremely many branches/tags? Extremely many files? Did you check whether the RAM usage of git-fast-import process was growing gradually to fill RAM while it was running vs. whether the usage seemed reasonable until it suddenly crashed? There are a few obvious possibilities: 0. There is some reason that too little of your computer's RAM is available to git-fast-import (e.g., ulimit, other processes running at the same time, much RAM being used as a ramdisk, etc). 1. Your import is simply too big for git-fast-import to hold in memory the accumulated things that it has to remember. I'm not familiar with the internals of git-fast-import, but I believe that the main thing that it has to keep in RAM is the list of marks (references to git objects that can be referred to later in the import). From your crash file, it looks like there were about 350k marks loaded at the time of the crash. Supposing each mark is about 100 bytes, this would only amount to 35 Mb, which should not be a problem (*if* my assumptions are correct). 2. Your import contains a gigantic object which individually is so big that it overflows some component of the import. (I don't know whether large objects are handled streamily; they might be read into memory at some point.) But since your computer had so much RAM this is hardly imaginable. 3. git-fast-import has a memory leak and the accumulated memory leakage is exhausting your RAM. 4. git-fast-import has some other kind of a bug. 5. The contents of the dumpfile are corrupt in a way that is triggering the problem. This could either be invalid input (e.g., an object that is reported to be quaggabytes large), or some invalid input that triggers a bug in git-fast-import. If (1), then you either need a bigger machine or git-fast-import needs architectural changes. If (2), then you either need a bigger machine or git-fast-import and/or git needs architectural changes. If (3), then it would be good to get more information about the problem so that the leak can be fixed. If this is the case, it might be possible to work around the problem by splitting the dumpfile into several parts and loading them one after the other (outputting the marks from one run and loading them into the next). If (4) or (5), then it would be helpful to narrow down the problem. It might be possible to do so by following the instructions in the cvs2svn FAQ [1] for systematically shrinking a test case to smaller size using destroy_repository.py and shrink_test_case.py. If you can create a small repository that triggers the same problem, then there is a good chance that it is easy to fix. Michael (the cvs2git maintainer) [1] http://cvs2svn.tigris.org/faq.html#testcase -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: error: git-fast-import died of signal 11
I'm using Git 1.8.0-rc2 and cvs2git version 2.5.0-dev (trunk). The repository is almost 20 years old and should consist of mostly smallish plain text files. We've been tagging every commit, in addition to for releases and development branches, so there's a lot of tags and branches. I didn't see the memory usage of the process before exiting but after ~3.5 hours in a subsequent run it seems to be using about 8.5GB of virtual memory with a resident size of only .5GB, which should have easily fit on the 512GB machine that I was using. I'm trying on a 1TB machine now but it doesn't look like it'll make a difference. There is no ram disk and I have exclusive access to the machine so only from the OS, which is trivial. The only significant limit from my environment would be on the stack: [umoszkow@mawhp5 ~] limit cputime unlimited filesize unlimited datasize unlimited stacksize8000 kbytes coredumpsize 0 kbytes memoryuseunlimited vmemoryuse unlimited descriptors 1024 memorylocked 32 kbytes maxproc 8388608 Would that result in an mmap error though? I'll try with unlimited stacksize and descriptors anyway. I don't think modifying the original repository or a clone of it is possible at this point but breaking up the import into a few steps may be - will try that next if this fails. On Tue, Oct 16, 2012 at 2:18 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 10/15/2012 05:53 PM, Uri Moszkowicz wrote: I'm trying to convert a CVS repository to Git using cvs2git. I was able to generate the dump file without problem but am unable to get Git to fast-import it. The dump file is 328GB and I ran git fast-import on a machine with 512GB of RAM. fatal: Out of memory? mmap failed: Cannot allocate memory fast-import: dumping crash report to fast_import_crash_18192 error: git-fast-import died of signal 11 How can I import the repository? What versions of git and of cvs2git are you using? If not the current versions, please try with the current versions. What is the nature of your repository (i.e., why is it so big)? Does it consist of extremely large files? A very deep history? Extremely many branches/tags? Extremely many files? Did you check whether the RAM usage of git-fast-import process was growing gradually to fill RAM while it was running vs. whether the usage seemed reasonable until it suddenly crashed? There are a few obvious possibilities: 0. There is some reason that too little of your computer's RAM is available to git-fast-import (e.g., ulimit, other processes running at the same time, much RAM being used as a ramdisk, etc). 1. Your import is simply too big for git-fast-import to hold in memory the accumulated things that it has to remember. I'm not familiar with the internals of git-fast-import, but I believe that the main thing that it has to keep in RAM is the list of marks (references to git objects that can be referred to later in the import). From your crash file, it looks like there were about 350k marks loaded at the time of the crash. Supposing each mark is about 100 bytes, this would only amount to 35 Mb, which should not be a problem (*if* my assumptions are correct). 2. Your import contains a gigantic object which individually is so big that it overflows some component of the import. (I don't know whether large objects are handled streamily; they might be read into memory at some point.) But since your computer had so much RAM this is hardly imaginable. 3. git-fast-import has a memory leak and the accumulated memory leakage is exhausting your RAM. 4. git-fast-import has some other kind of a bug. 5. The contents of the dumpfile are corrupt in a way that is triggering the problem. This could either be invalid input (e.g., an object that is reported to be quaggabytes large), or some invalid input that triggers a bug in git-fast-import. If (1), then you either need a bigger machine or git-fast-import needs architectural changes. If (2), then you either need a bigger machine or git-fast-import and/or git needs architectural changes. If (3), then it would be good to get more information about the problem so that the leak can be fixed. If this is the case, it might be possible to work around the problem by splitting the dumpfile into several parts and loading them one after the other (outputting the marks from one run and loading them into the next). If (4) or (5), then it would be helpful to narrow down the problem. It might be possible to do so by following the instructions in the cvs2svn FAQ [1] for systematically shrinking a test case to smaller size using destroy_repository.py and shrink_test_case.py. If you can create a small repository that triggers the same problem, then there is a good chance that it is easy to fix. Michael (the cvs2git maintainer) [1] http://cvs2svn.tigris.org/faq.html#testcase -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com
error: git-fast-import died of signal 11
Hi, I'm trying to convert a CVS repository to Git using cvs2git. I was able to generate the dump file without problem but am unable to get Git to fast-import it. The dump file is 328GB and I ran git fast-import on a machine with 512GB of RAM. fatal: Out of memory? mmap failed: Cannot allocate memory fast-import: dumping crash report to fast_import_crash_18192 error: git-fast-import died of signal 11 How can I import the repository? Thanks, Uri -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html