On 16 Sep 2015, at 10:36, Luke Diamand <l...@diamand.org> wrote: > On 14/09/15 14:26, larsxschnei...@gmail.com wrote: >> From: Lars Schneider <larsxschnei...@gmail.com> >> >> Perforce repositories can contain large (binary) files. Migrating these >> repositories to Git generates very large local clones. External storage >> systems such as Git LFS [1], Git Fat [2], Git Media [3], git-annex [4] >> try to address this problem. >> >> Add a generic mechanism to detect large files based on extension, >> uncompressed size, and/or compressed size. > > Looks excellent! I've got a small few comments (below) and I need to come > back and have another look through if that's OK. Sure! Thank you :-)
> > Thanks! > Luke > >> >> [1] https://git-lfs.github.com/ >> [2] https://github.com/jedbrown/git-fat >> [3] https://github.com/alebedev/git-media >> [4] https://git-annex.branchable.com/ >> >> Signed-off-by: Lars Schneider <larsxschnei...@gmail.com> >> --- >> Documentation/git-p4.txt | 32 +++++++++++ >> git-p4.py | 139 >> +++++++++++++++++++++++++++++++++++++++++---- >> t/t9823-git-p4-mock-lfs.sh | 96 +++++++++++++++++++++++++++++++ >> 3 files changed, 257 insertions(+), 10 deletions(-) >> create mode 100755 t/t9823-git-p4-mock-lfs.sh >> >> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt >> index 82aa5d6..3f21e95 100644 >> --- a/Documentation/git-p4.txt >> +++ b/Documentation/git-p4.txt >> @@ -510,6 +510,38 @@ git-p4.useClientSpec:: >> option '--use-client-spec'. See the "CLIENT SPEC" section above. >> This variable is a boolean, not the name of a p4 client. >> >> +git-p4.largeFileSystem:: >> + Specify the system that is used for large (binary) files. Please note >> + that large file systems do not support the 'git p4 submit' command. > > Why is that? Is it just that you haven't implemented support, or is it > fundamentally impossible? If we detect LFS files only by file extension then we could make it work. But then we must not use any git-p4 settings. We would need to rely only on the “.gitattributes” file that is stored in the P4 repository. My implementation also looks at the file size and decides on a individual file basis if a file is stored in LFS. That means all clients need the same file size threshold. Junio explained the problem in the v4 thread: > How is collaboration between those who talk to the same p4 depot > backed by GitHub LFS expected to work? You use config to set size > limits and list of file extensions in your repository, and grab new > changes from p4 and turn them into Git commits (with pointers to LFS > and the .gitattributes file that records your choice of the config). > I as a new member to the same project come, clone the resulting Git > repository from you and then what do I do before talking to the same > p4 depot to further update the Git history? Are the values recorded > in .gitattributes (which essentially were derived from your > configuration) somehow reflected back automatically to my config so > that our world view would become consistent? Perhaps you added > 'iso' to largeFileExtensions before I cloned from you, and that > would be present in the copy of .gitattributes I obtained from you. > I may be trying to add a new ".iso" file, and I presume that an > existing .gitattributes entry you created based on the file > extension automatically covers it from the LFS side, but does my > "git p4 submit" also know what to do, or would it be broken until I > add a set of configrations that matches yours? > >> + Only Git LFS [1] is implemented right now. Download >> + and install the Git LFS command line extension to use this option >> + and configure it like this: >> ++ >> +------------- >> +git config git-p4.largeFileSystem GitLFS >> +------------- >> ++ >> + [1] https://git-lfs.github.com/ >> + >> +git-p4.largeFileExtensions:: >> + All files matching a file extension in the list will be processed >> + by the large file system. Do not prefix the extensions with '.'. >> + >> +git-p4.largeFileThreshold:: >> + All files with an uncompressed size exceeding the threshold will be >> + processed by the large file system. By default the threshold is >> + defined in bytes. Add the suffix k, m, or g to change the unit. >> + >> +git-p4.largeFileCompressedThreshold:: >> + All files with a compressed size exceeding the threshold will be >> + processed by the large file system. This option might slow down >> + your clone/sync process. By default the threshold is defined in >> + bytes. Add the suffix k, m, or g to change the unit. >> + >> +git-p4.pushLargeFiles:: >> + Boolean variable which defines if large files are automatically >> + pushed to a server. >> + >> Submit variables >> ~~~~~~~~~~~~~~~~ >> git-p4.detectRenames:: >> diff --git a/git-p4.py b/git-p4.py >> index b465356..bfe71b5 100755 >> --- a/git-p4.py >> +++ b/git-p4.py >> @@ -22,6 +22,8 @@ import platform >> import re >> import shutil >> import stat >> +import zipfile >> +import zlib >> >> try: >> from subprocess import CalledProcessError >> @@ -932,6 +934,111 @@ def wildcard_present(path): >> m = re.search("[*#@%]", path) >> return m is not None >> >> +class LargeFileSystem(object): >> + """Base class for large file system support.""" >> + >> + def __init__(self, writeToGitStream): >> + self.largeFiles = set() >> + self.writeToGitStream = writeToGitStream >> + >> + def generatePointer(self, cloneDestination, contentFile): >> + """Return the content of a pointer file that is stored in Git >> instead of >> + the actual content.""" >> + assert False, "Method 'generatePointer' required in " + >> self.__class__.__name__ >> + >> + def pushFile(self, localLargeFile): >> + """Push the actual content which is not stored in the Git >> repository to >> + a server.""" >> + assert False, "Method 'pushFile' required in " + >> self.__class__.__name__ >> + >> + def hasLargeFileExtension(self, relPath): >> + return reduce( >> + lambda a, b: a or b, >> + [relPath.endswith('.' + e) for e in >> gitConfigList('git-p4.largeFileExtensions')], >> + False >> + ) >> + >> + def generateTempFile(self, contents): >> + contentFile = >> tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False) >> + for d in contents: >> + contentFile.write(d) >> + contentFile.flush() >> + contentFile.close() > > close() should flush() automatically I would have thought. Correct. SO thinks so, too https://stackoverflow.com/questions/2447143/does-close-imply-flush-in-python > >> + return contentFile.name >> + >> + def exceedsLargeFileThreshold(self, relPath, contents): >> + if gitConfigInt('git-p4.largeFileThreshold'): >> + contentsSize = sum(len(d) for d in contents) >> + if contentsSize > gitConfigInt('git-p4.largeFileThreshold'): >> + return True >> + if gitConfigInt('git-p4.largeFileCompressedThreshold'): >> + contentsSize = sum(len(d) for d in contents) >> + if contentsSize <= >> gitConfigInt('git-p4.largeFileCompressedThreshold'): >> + return False >> + contentTempFile = self.generateTempFile(contents) >> + compressedContentFile = >> tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False) >> + zf = zipfile.ZipFile(compressedContentFile.name, mode='w') >> + zf.write(contentTempFile, compress_type=zipfile.ZIP_DEFLATED) >> + zf.close() >> + compressedContentsSize = zf.infolist()[0].compress_size >> + os.remove(contentTempFile) >> + os.remove(compressedContentFile.name) >> + if compressedContentsSize > >> gitConfigInt('git-p4.largeFileCompressedThreshold'): >> + return True >> + return False >> + >> + def addLargeFile(self, relPath): >> + self.largeFiles.add(relPath) >> + >> + def removeLargeFile(self, relPath): >> + self.largeFiles.remove(relPath) >> + >> + def isLargeFile(self, relPath): >> + return relPath in self.largeFiles >> + >> + def processContent(self, cloneDestination, git_mode, relPath, contents): >> + """Processes the content of git fast import. This method decides if >> a >> + file is stored in the large file system and handles all necessary >> + steps.""" >> + if self.exceedsLargeFileThreshold(relPath, contents) or >> self.hasLargeFileExtension(relPath): >> + contentTempFile = self.generateTempFile(contents) >> + (git_mode, contents, localLargeFile) = >> self.generatePointer(cloneDestination, contentTempFile) >> + >> + # Move temp file to final location in large file system >> + largeFileDir = os.path.dirname(localLargeFile) >> + if not os.path.isdir(largeFileDir): >> + os.makedirs(largeFileDir) >> + shutil.move(contentTempFile, localLargeFile) >> + self.addLargeFile(relPath) >> + if gitConfigBool('git-p4.pushLargeFiles'): >> + self.pushFile(localLargeFile) >> + if verbose: >> + sys.stderr.write("%s moved to large file system (%s)\n" % >> (relPath, localLargeFile)) >> + return (git_mode, contents) >> + >> +class MockLFS(LargeFileSystem): >> + """Mock large file system for testing.""" >> + >> + def generatePointer(self, cloneDestination, contentFile): >> + """The pointer content is the original content prefixed with >> "pointer-". >> + The local filename of the large file storage is derived from the >> file content. >> + """ >> + with open(contentFile, 'r') as f: >> + content = next(f) >> + gitMode = '100644' >> + pointerContents = 'pointer-' + content >> + localLargeFile = os.path.join(cloneDestination, '.git', >> 'mock-storage', 'local', content[:-1]) >> + return (gitMode, pointerContents, localLargeFile) >> + >> + def pushFile(self, localLargeFile): >> + """The remote filename of the large file storage is the same as the >> local >> + one but in a different directory. >> + """ >> + remotePath = os.path.join(os.path.dirname(localLargeFile), '..', >> 'remote') >> + if not os.path.exists(remotePath): >> + os.makedirs(remotePath) >> + shutil.copyfile(localLargeFile, os.path.join(remotePath, >> os.path.basename(localLargeFile))) >> + >> class Command: >> def __init__(self): >> self.usage = "usage: %prog [options]" >> @@ -1105,6 +1212,9 @@ class P4Submit(Command, P4UserMap): >> self.p4HasMoveCommand = p4_has_move_command() >> self.branch = None >> >> + if gitConfig('git-p4.largeFileSystem'): >> + die("Large file system not supported for git-p4 submit command. >> Please remove it from config.") >> + >> def check(self): >> if len(p4CmdList("opened ...")) > 0: >> die("You have files opened with perforce! Close them before >> starting the sync.") >> @@ -2057,6 +2167,12 @@ class P4Sync(Command, P4UserMap): >> lambda git_mode, relPath, contents: >> self.writeToGitStream(git_mode, relPath, contents) >> ) >> >> + if gitConfig('git-p4.largeFileSystem'): >> + largeFileSystemConstructor = >> globals()[gitConfig('git-p4.largeFileSystem')] >> + self.largeFileSystem = largeFileSystemConstructor( >> + lambda git_mode, relPath, contents: >> self.writeToGitStream(git_mode, relPath, contents) >> + ) >> + >> if gitConfig("git-p4.syncFromOrigin") == "false": >> self.syncWithOrigin = False >> >> @@ -2176,6 +2292,13 @@ class P4Sync(Command, P4UserMap): >> >> return branches >> >> + def writeToGitStream(self, gitMode, relPath, contents): >> + self.gitStream.write('M %s inline %s\n' % (gitMode, relPath)) >> + self.gitStream.write('data %d\n' % sum(len(d) for d in contents)) >> + for d in contents: >> + self.gitStream.write(d) >> + self.gitStream.write('\n') >> + >> # output one file from the P4 stream >> # - helper for streamP4Files >> >> @@ -2246,17 +2369,10 @@ class P4Sync(Command, P4UserMap): >> text = regexp.sub(r'$\1$', text) >> contents = [ text ] >> >> - self.gitStream.write("M %s inline %s\n" % (git_mode, relPath)) >> + if self.largeFileSystem: >> + (git_mode, contents) = >> self.largeFileSystem.processContent(self.cloneDestination, git_mode, >> relPath, contents) >> >> - # total length... >> - length = 0 >> - for d in contents: >> - length = length + len(d) >> - >> - self.gitStream.write("data %d\n" % length) >> - for d in contents: >> - self.gitStream.write(d) >> - self.gitStream.write("\n") >> + self.writeToGitStream(git_mode, relPath, contents) >> >> def streamOneP4Deletion(self, file): >> relPath = self.stripRepoPath(file['path'], self.branchPrefixes) >> @@ -2265,6 +2381,9 @@ class P4Sync(Command, P4UserMap): >> sys.stdout.flush() >> self.gitStream.write("D %s\n" % relPath) >> >> + if self.largeFileSystem and >> self.largeFileSystem.isLargeFile(relPath): >> + self.largeFileSystem.removeLargeFile(relPath) >> + >> # handle another chunk of streaming data >> def streamP4FilesCb(self, marshalled): >> >> diff --git a/t/t9823-git-p4-mock-lfs.sh b/t/t9823-git-p4-mock-lfs.sh >> new file mode 100755 >> index 0000000..8582857 >> --- /dev/null >> +++ b/t/t9823-git-p4-mock-lfs.sh >> @@ -0,0 +1,96 @@ >> +#!/bin/sh >> + >> +test_description='Clone repositories and store files in Mock LFS' >> + >> +. ./lib-git-p4.sh >> + >> +test_file_in_mock () { >> + FILE="$1" > Missing && > Plus the next few lines Are they strictly necessary? I believe you can define variables all in “one line”. I found it like that in existing tests: https://github.com/git/git/blob/f4d9753a89bf04011c00e943d85211906e86a0f6/t/t9402-git-cvsserver-refs.sh#L18-L20 Nevertheless I will add them in the next role! > >> + CONTENT="$2" >> + LOCAL_STORAGE=".git/mock-storage/local/$CONTENT" >> + SERVER_STORAGE=".git/mock-storage/remote/$CONTENT" >> + echo "pointer-$CONTENT" >expect_pointer >> + echo "$CONTENT" >expect_content >> + test_path_is_file "$FILE" && >> + test_path_is_file "$LOCAL_STORAGE" && >> + test_path_is_file "$SERVER_STORAGE" && >> + test_cmp expect_pointer "$FILE" && >> + test_cmp expect_content "$LOCAL_STORAGE" && >> + test_cmp expect_content "$SERVER_STORAGE" >> +} >> + >> +test_file_count_in_dir () { >> + DIR="$1" >> + EXPECTED_COUNT="$2" >> + find "$DIR" -type f >actual >> + test_line_count = $EXPECTED_COUNT actual >> +} >> + >> +test_expect_success 'start p4d' ' >> + start_p4d >> +' >> + >> +test_expect_success 'Create repo with binary files' ' >> + client_view "//depot/... //client/..." && >> + ( >> + cd "$cli" && >> + >> + echo "content 1 txt 23 bytes" >file1.txt && >> + p4 add file1.txt && >> + echo "content 2-3 bin 25 bytes" >file2.dat && >> + p4 add file2.dat && >> + p4 submit -d "Add text and binary file" && >> + >> + mkdir "path with spaces" && >> + echo "content 2-3 bin 25 bytes" >"path with spaces/file3.bin" && > > Nice putting spaces in the path! I stumbled across this case in a production migration :-) > >> + p4 add "path with spaces/file3.bin" && >> + p4 submit -d "Add another binary file with same content and >> spaces in path" && >> + >> + echo "content 4 bin 26 bytes XX" >file4.bin && >> + p4 add file4.bin && >> + p4 submit -d "Add another binary file with different content" >> + ) >> +' >> + >> +test_expect_success 'Store files in Mock based on size (>24 bytes)' ' >> + client_view "//depot/... //client/..." && >> + test_when_finished cleanup_git && >> + ( >> + cd "$git" && >> + git init . && >> + git config git-p4.useClientSpec true && >> + git config git-p4.largeFileSystem MockLFS && >> + git config git-p4.largeFileThreshold 24 && >> + git config git-p4.pushLargeFiles True && >> + git p4 clone --destination="$git" //depot@all && >> + >> + test_file_in_mock file2.dat "content 2-3 bin 25 bytes" && >> + test_file_in_mock "path with spaces/file3.bin" "content 2-3 bin >> 25 bytes" && >> + test_file_in_mock file4.bin "content 4 bin 26 bytes XX" && >> + >> + test_file_count_in_dir ".git/mock-storage/local" 2 && >> + test_file_count_in_dir ".git/mock-storage/remote" 2 >> + ) >> +' >> + >> +test_expect_success 'Run git p4 submit in repo configured with large file >> system' ' >> + client_view "//depot/... //client/..." && >> + test_when_finished cleanup_git && >> + ( >> + cd "$git" && >> + git init . && >> + git config git-p4.useClientSpec true && >> + git config git-p4.largeFileSystem MockLFS && >> + git config git-p4.largeFileThreshold 24 && >> + git config git-p4.pushLargeFiles True && >> + git p4 clone --destination="$git" //depot@all && >> + >> + test_must_fail git p4 submit > > No test for deletion, does that matter? OK, I will add one in the mock, too. I have a test for that in the “t9828-git-p4-lfs.sh” named “Remove file from repo and store files in LFS based on size (>24 bytes)”. Thanks, Lars-- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html