On 16 Sep 2015, at 10:36, Luke Diamand <l...@diamand.org> wrote:

> On 14/09/15 14:26, larsxschnei...@gmail.com wrote:
>> From: Lars Schneider <larsxschnei...@gmail.com>
>> 
>> Perforce repositories can contain large (binary) files. Migrating these
>> repositories to Git generates very large local clones. External storage
>> systems such as Git LFS [1], Git Fat [2], Git Media [3], git-annex [4]
>> try to address this problem.
>> 
>> Add a generic mechanism to detect large files based on extension,
>> uncompressed size, and/or compressed size.
> 
> Looks excellent! I've got a small few comments (below) and I need to come 
> back and have another look through if that's OK.
Sure! Thank you :-)

> 
> Thanks!
> Luke
> 
>> 
>> [1] https://git-lfs.github.com/
>> [2] https://github.com/jedbrown/git-fat
>> [3] https://github.com/alebedev/git-media
>> [4] https://git-annex.branchable.com/
>> 
>> Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
>> ---
>>  Documentation/git-p4.txt   |  32 +++++++++++
>>  git-p4.py                  | 139 
>> +++++++++++++++++++++++++++++++++++++++++----
>>  t/t9823-git-p4-mock-lfs.sh |  96 +++++++++++++++++++++++++++++++
>>  3 files changed, 257 insertions(+), 10 deletions(-)
>>  create mode 100755 t/t9823-git-p4-mock-lfs.sh
>> 
>> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
>> index 82aa5d6..3f21e95 100644
>> --- a/Documentation/git-p4.txt
>> +++ b/Documentation/git-p4.txt
>> @@ -510,6 +510,38 @@ git-p4.useClientSpec::
>>      option '--use-client-spec'.  See the "CLIENT SPEC" section above.
>>      This variable is a boolean, not the name of a p4 client.
>> 
>> +git-p4.largeFileSystem::
>> +    Specify the system that is used for large (binary) files. Please note
>> +    that large file systems do not support the 'git p4 submit' command.
> 
> Why is that? Is it just that you haven't implemented support, or is it 
> fundamentally impossible?

If we detect LFS files only by file extension then we could make it work. But 
then we must not use any git-p4 settings. We would need to rely only on the 
“.gitattributes” file that is stored in the P4 repository. My implementation 
also looks at the file size and decides on a individual file basis if a file is 
stored in LFS. That means all clients need the same file size threshold. 

Junio explained the problem in the v4 thread:
> How is collaboration between those who talk to the same p4 depot
> backed by GitHub LFS expected to work?  You use config to set size
> limits and list of file extensions in your repository, and grab new
> changes from p4 and turn them into Git commits (with pointers to LFS
> and the .gitattributes file that records your choice of the config).
> I as a new member to the same project come, clone the resulting Git
> repository from you and then what do I do before talking to the same
> p4 depot to further update the Git history?  Are the values recorded
> in .gitattributes (which essentially were derived from your
> configuration) somehow reflected back automatically to my config so
> that our world view would become consistent?  Perhaps you added
> 'iso' to largeFileExtensions before I cloned from you, and that
> would be present in the copy of .gitattributes I obtained from you.
> I may be trying to add a new ".iso" file, and I presume that an
> existing .gitattributes entry you created based on the file
> extension automatically covers it from the LFS side, but does my
> "git p4 submit" also know what to do, or would it be broken until I
> add a set of configrations that matches yours?
 


> 
>> +    Only Git LFS [1] is implemented right now. Download
>> +    and install the Git LFS command line extension to use this option
>> +    and configure it like this:
>> ++
>> +-------------
>> +git config       git-p4.largeFileSystem GitLFS
>> +-------------
>> ++
>> +    [1] https://git-lfs.github.com/
>> +
>> +git-p4.largeFileExtensions::
>> +    All files matching a file extension in the list will be processed
>> +    by the large file system. Do not prefix the extensions with '.'.
>> +
>> +git-p4.largeFileThreshold::
>> +    All files with an uncompressed size exceeding the threshold will be
>> +    processed by the large file system. By default the threshold is
>> +    defined in bytes. Add the suffix k, m, or g to change the unit.
>> +
>> +git-p4.largeFileCompressedThreshold::
>> +    All files with a compressed size exceeding the threshold will be
>> +    processed by the large file system. This option might slow down
>> +    your clone/sync process. By default the threshold is defined in
>> +    bytes. Add the suffix k, m, or g to change the unit.
>> +
>> +git-p4.pushLargeFiles::
>> +    Boolean variable which defines if large files are automatically
>> +    pushed to a server.
>> +
>>  Submit variables
>>  ~~~~~~~~~~~~~~~~
>>  git-p4.detectRenames::
>> diff --git a/git-p4.py b/git-p4.py
>> index b465356..bfe71b5 100755
>> --- a/git-p4.py
>> +++ b/git-p4.py
>> @@ -22,6 +22,8 @@ import platform
>>  import re
>>  import shutil
>>  import stat
>> +import zipfile
>> +import zlib
>> 
>>  try:
>>      from subprocess import CalledProcessError
>> @@ -932,6 +934,111 @@ def wildcard_present(path):
>>      m = re.search("[*#@%]", path)
>>      return m is not None
>> 
>> +class LargeFileSystem(object):
>> +    """Base class for large file system support."""
>> +
>> +    def __init__(self, writeToGitStream):
>> +        self.largeFiles = set()
>> +        self.writeToGitStream = writeToGitStream
>> +
>> +    def generatePointer(self, cloneDestination, contentFile):
>> +        """Return the content of a pointer file that is stored in Git 
>> instead of
>> +           the actual content."""
>> +        assert False, "Method 'generatePointer' required in " + 
>> self.__class__.__name__
>> +
>> +    def pushFile(self, localLargeFile):
>> +        """Push the actual content which is not stored in the Git 
>> repository to
>> +           a server."""
>> +        assert False, "Method 'pushFile' required in " + 
>> self.__class__.__name__
>> +
>> +    def hasLargeFileExtension(self, relPath):
>> +        return reduce(
>> +            lambda a, b: a or b,
>> +            [relPath.endswith('.' + e) for e in 
>> gitConfigList('git-p4.largeFileExtensions')],
>> +            False
>> +        )
>> +
>> +    def generateTempFile(self, contents):
>> +        contentFile = 
>> tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False)
>> +        for d in contents:
>> +            contentFile.write(d)
>> +        contentFile.flush()
>> +        contentFile.close()
> 
> close() should flush() automatically I would have thought.
Correct. SO thinks so, too 
https://stackoverflow.com/questions/2447143/does-close-imply-flush-in-python

> 
>> +        return contentFile.name
>> +
>> +    def exceedsLargeFileThreshold(self, relPath, contents):
>> +        if gitConfigInt('git-p4.largeFileThreshold'):
>> +            contentsSize = sum(len(d) for d in contents)
>> +            if contentsSize > gitConfigInt('git-p4.largeFileThreshold'):
>> +                return True
>> +        if gitConfigInt('git-p4.largeFileCompressedThreshold'):
>> +            contentsSize = sum(len(d) for d in contents)
>> +            if contentsSize <= 
>> gitConfigInt('git-p4.largeFileCompressedThreshold'):
>> +                return False
>> +            contentTempFile = self.generateTempFile(contents)
>> +            compressedContentFile = 
>> tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False)
>> +            zf = zipfile.ZipFile(compressedContentFile.name, mode='w')
>> +            zf.write(contentTempFile, compress_type=zipfile.ZIP_DEFLATED)
>> +            zf.close()
>> +            compressedContentsSize = zf.infolist()[0].compress_size
>> +            os.remove(contentTempFile)
>> +            os.remove(compressedContentFile.name)
>> +            if compressedContentsSize > 
>> gitConfigInt('git-p4.largeFileCompressedThreshold'):
>> +                return True
>> +        return False
>> +
>> +    def addLargeFile(self, relPath):
>> +        self.largeFiles.add(relPath)
>> +
>> +    def removeLargeFile(self, relPath):
>> +        self.largeFiles.remove(relPath)
>> +
>> +    def isLargeFile(self, relPath):
>> +        return relPath in self.largeFiles
>> +
>> +    def processContent(self, cloneDestination, git_mode, relPath, contents):
>> +        """Processes the content of git fast import. This method decides if 
>> a
>> +           file is stored in the large file system and handles all necessary
>> +           steps."""
>> +        if self.exceedsLargeFileThreshold(relPath, contents) or 
>> self.hasLargeFileExtension(relPath):
>> +            contentTempFile = self.generateTempFile(contents)
>> +            (git_mode, contents, localLargeFile) = 
>> self.generatePointer(cloneDestination, contentTempFile)
>> +
>> +            # Move temp file to final location in large file system
>> +            largeFileDir = os.path.dirname(localLargeFile)
>> +            if not os.path.isdir(largeFileDir):
>> +                os.makedirs(largeFileDir)
>> +            shutil.move(contentTempFile, localLargeFile)
>> +            self.addLargeFile(relPath)
>> +            if gitConfigBool('git-p4.pushLargeFiles'):
>> +                self.pushFile(localLargeFile)
>> +            if verbose:
>> +                sys.stderr.write("%s moved to large file system (%s)\n" % 
>> (relPath, localLargeFile))
>> +        return (git_mode, contents)
>> +
>> +class MockLFS(LargeFileSystem):
>> +    """Mock large file system for testing."""
>> +
>> +    def generatePointer(self, cloneDestination, contentFile):
>> +        """The pointer content is the original content prefixed with 
>> "pointer-".
>> +           The local filename of the large file storage is derived from the 
>> file content.
>> +           """
>> +        with open(contentFile, 'r') as f:
>> +            content = next(f)
>> +            gitMode = '100644'
>> +            pointerContents = 'pointer-' + content
>> +            localLargeFile = os.path.join(cloneDestination, '.git', 
>> 'mock-storage', 'local', content[:-1])
>> +            return (gitMode, pointerContents, localLargeFile)
>> +
>> +    def pushFile(self, localLargeFile):
>> +        """The remote filename of the large file storage is the same as the 
>> local
>> +           one but in a different directory.
>> +           """
>> +        remotePath = os.path.join(os.path.dirname(localLargeFile), '..', 
>> 'remote')
>> +        if not os.path.exists(remotePath):
>> +            os.makedirs(remotePath)
>> +        shutil.copyfile(localLargeFile, os.path.join(remotePath, 
>> os.path.basename(localLargeFile)))
>> +
>>  class Command:
>>      def __init__(self):
>>          self.usage = "usage: %prog [options]"
>> @@ -1105,6 +1212,9 @@ class P4Submit(Command, P4UserMap):
>>          self.p4HasMoveCommand = p4_has_move_command()
>>          self.branch = None
>> 
>> +        if gitConfig('git-p4.largeFileSystem'):
>> +            die("Large file system not supported for git-p4 submit command. 
>> Please remove it from config.")
>> +
>>      def check(self):
>>          if len(p4CmdList("opened ...")) > 0:
>>              die("You have files opened with perforce! Close them before 
>> starting the sync.")
>> @@ -2057,6 +2167,12 @@ class P4Sync(Command, P4UserMap):
>>                  lambda git_mode, relPath, contents: 
>> self.writeToGitStream(git_mode, relPath, contents)
>>              )
>> 
>> +        if gitConfig('git-p4.largeFileSystem'):
>> +            largeFileSystemConstructor = 
>> globals()[gitConfig('git-p4.largeFileSystem')]
>> +            self.largeFileSystem = largeFileSystemConstructor(
>> +                lambda git_mode, relPath, contents: 
>> self.writeToGitStream(git_mode, relPath, contents)
>> +            )
>> +
>>          if gitConfig("git-p4.syncFromOrigin") == "false":
>>              self.syncWithOrigin = False
>> 
>> @@ -2176,6 +2292,13 @@ class P4Sync(Command, P4UserMap):
>> 
>>          return branches
>> 
>> +    def writeToGitStream(self, gitMode, relPath, contents):
>> +        self.gitStream.write('M %s inline %s\n' % (gitMode, relPath))
>> +        self.gitStream.write('data %d\n' % sum(len(d) for d in contents))
>> +        for d in contents:
>> +            self.gitStream.write(d)
>> +        self.gitStream.write('\n')
>> +
>>      # output one file from the P4 stream
>>      # - helper for streamP4Files
>> 
>> @@ -2246,17 +2369,10 @@ class P4Sync(Command, P4UserMap):
>>              text = regexp.sub(r'$\1$', text)
>>              contents = [ text ]
>> 
>> -        self.gitStream.write("M %s inline %s\n" % (git_mode, relPath))
>> +        if self.largeFileSystem:
>> +            (git_mode, contents) = 
>> self.largeFileSystem.processContent(self.cloneDestination, git_mode, 
>> relPath, contents)
>> 
>> -        # total length...
>> -        length = 0
>> -        for d in contents:
>> -            length = length + len(d)
>> -
>> -        self.gitStream.write("data %d\n" % length)
>> -        for d in contents:
>> -            self.gitStream.write(d)
>> -        self.gitStream.write("\n")
>> +        self.writeToGitStream(git_mode, relPath, contents)
>> 
>>      def streamOneP4Deletion(self, file):
>>          relPath = self.stripRepoPath(file['path'], self.branchPrefixes)
>> @@ -2265,6 +2381,9 @@ class P4Sync(Command, P4UserMap):
>>              sys.stdout.flush()
>>          self.gitStream.write("D %s\n" % relPath)
>> 
>> +        if self.largeFileSystem and 
>> self.largeFileSystem.isLargeFile(relPath):
>> +            self.largeFileSystem.removeLargeFile(relPath)
>> +
>>      # handle another chunk of streaming data
>>      def streamP4FilesCb(self, marshalled):
>> 
>> diff --git a/t/t9823-git-p4-mock-lfs.sh b/t/t9823-git-p4-mock-lfs.sh
>> new file mode 100755
>> index 0000000..8582857
>> --- /dev/null
>> +++ b/t/t9823-git-p4-mock-lfs.sh
>> @@ -0,0 +1,96 @@
>> +#!/bin/sh
>> +
>> +test_description='Clone repositories and store files in Mock LFS'
>> +
>> +. ./lib-git-p4.sh
>> +
>> +test_file_in_mock () {
>> +    FILE="$1"
> Missing &&
> Plus the next few lines
Are they strictly necessary? I believe you can define variables all in “one 
line”. I found it like that in existing tests:
https://github.com/git/git/blob/f4d9753a89bf04011c00e943d85211906e86a0f6/t/t9402-git-cvsserver-refs.sh#L18-L20

Nevertheless I will add them in the next role!

> 
>> +    CONTENT="$2"
>> +    LOCAL_STORAGE=".git/mock-storage/local/$CONTENT"
>> +    SERVER_STORAGE=".git/mock-storage/remote/$CONTENT"
>> +    echo "pointer-$CONTENT" >expect_pointer
>> +    echo "$CONTENT" >expect_content
>> +    test_path_is_file "$FILE" &&
>> +    test_path_is_file "$LOCAL_STORAGE" &&
>> +    test_path_is_file "$SERVER_STORAGE" &&
>> +    test_cmp expect_pointer "$FILE" &&
>> +    test_cmp expect_content "$LOCAL_STORAGE" &&
>> +    test_cmp expect_content "$SERVER_STORAGE"
>> +}
>> +
>> +test_file_count_in_dir () {
>> +    DIR="$1"
>> +    EXPECTED_COUNT="$2"
>> +    find "$DIR" -type f >actual
>> +    test_line_count = $EXPECTED_COUNT actual
>> +}
>> +
>> +test_expect_success 'start p4d' '
>> +    start_p4d
>> +'
>> +
>> +test_expect_success 'Create repo with binary files' '
>> +    client_view "//depot/... //client/..." &&
>> +    (
>> +            cd "$cli" &&
>> +
>> +            echo "content 1 txt 23 bytes" >file1.txt &&
>> +            p4 add file1.txt &&
>> +            echo "content 2-3 bin 25 bytes" >file2.dat &&
>> +            p4 add file2.dat &&
>> +            p4 submit -d "Add text and binary file" &&
>> +
>> +            mkdir "path with spaces" &&
>> +            echo "content 2-3 bin 25 bytes" >"path with spaces/file3.bin" &&
> 
> Nice putting spaces in the path!
I stumbled across this case in a production migration :-)

> 
>> +            p4 add "path with spaces/file3.bin" &&
>> +            p4 submit -d "Add another binary file with same content and 
>> spaces in path" &&
>> +
>> +            echo "content 4 bin 26 bytes XX" >file4.bin &&
>> +            p4 add file4.bin &&
>> +            p4 submit -d "Add another binary file with different content"
>> +    )
>> +'
>> +
>> +test_expect_success 'Store files in Mock based on size (>24 bytes)' '
>> +    client_view "//depot/... //client/..." &&
>> +    test_when_finished cleanup_git &&
>> +    (
>> +            cd "$git" &&
>> +            git init . &&
>> +            git config git-p4.useClientSpec true &&
>> +            git config git-p4.largeFileSystem MockLFS &&
>> +            git config git-p4.largeFileThreshold 24 &&
>> +            git config git-p4.pushLargeFiles True &&
>> +            git p4 clone --destination="$git" //depot@all &&
>> +
>> +            test_file_in_mock file2.dat "content 2-3 bin 25 bytes" &&
>> +            test_file_in_mock "path with spaces/file3.bin" "content 2-3 bin 
>> 25 bytes" &&
>> +            test_file_in_mock file4.bin "content 4 bin 26 bytes XX" &&
>> +
>> +            test_file_count_in_dir ".git/mock-storage/local" 2 &&
>> +            test_file_count_in_dir ".git/mock-storage/remote" 2
>> +    )
>> +'
>> +
>> +test_expect_success 'Run git p4 submit in repo configured with large file 
>> system' '
>> +    client_view "//depot/... //client/..." &&
>> +    test_when_finished cleanup_git &&
>> +    (
>> +            cd "$git" &&
>> +            git init . &&
>> +            git config git-p4.useClientSpec true &&
>> +            git config git-p4.largeFileSystem MockLFS &&
>> +            git config git-p4.largeFileThreshold 24 &&
>> +            git config git-p4.pushLargeFiles True &&
>> +            git p4 clone --destination="$git" //depot@all &&
>> +
>> +            test_must_fail git p4 submit
> 
> No test for deletion, does that matter?
OK, I will add one in the mock, too. I have a test for that in the 
“t9828-git-p4-lfs.sh” named “Remove file from repo and store files in LFS based 
on size (>24 bytes)”.

Thanks,
Lars--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to