GitHub user jaikiran opened a pull request:

    https://github.com/apache/ant/pull/76

    bz-43144 - Improve the performance of the tar task when it uses a zipfileset

    https://bz.apache.org/bugzilla/show_bug.cgi?id=43144 is an issue where 
users have reported that the tar task is extremely slow when used with a 
`zipfileset`. 
    
    The comment in that bug, from @bodewig, rightly notes that the reason for 
this slowness has to do with the code where we iterate over the zip entries as 
a `Resource` and then open an `InputStream` for each entry during the `tar` 
task. The implementation of the `ZipResource#getInputStream` opens a new 
`ZipFile` every time and as Stefan notes in that bug, the reason it's done this 
way is because we don't know what would be the right time to close a `ZipFIle` 
if the implementation of the `ZipResource` would want to hold on to a open 
`ZipFile` instance and reuse it.
    
    However, there are occasions, like the one here, where the callers of the 
`ArchiveFileSet` are aware and know when and how long they expect the 
underlying archive to be open. The commit in this PR, introduces a new method 
(`openArchive()`) on the `ArchiveFileSet` to allow such callers to have more 
control over the opening and closing of the underlying resource(s) like the 
`ZipFile`. The whole goal of this new method is to allow iterating over the 
entries in the archive (just like the existing `iterator()` method) and yet the 
same time exposing a way for users to explicitly `close` the returned 
`ArchiveEntries`. When to use the `iterator()` method and when to use the 
`openArchive` method will be left to the users of ArchiveSet.
    
    The commit in this PR intentionally just exposes only one method 
`openArchive` as a `public` method and keeps the rest of the new methods either 
private or package private. Right now, only `ZipFileSet` overrides the new 
package private method to provide a scanner which holds on to open resource(s), 
if it's asked to do so.
    
    The changes in this commit maintain backward compatibility of the existing 
methods and does _not_ introduce any change in behaviour of their usages or 
semantics.
    
    The javadocs of the new methods explain more about what each one does and 
how the usage typically looks like.
    
    I have run a few of the existing Tar related tests locally and haven't 
found any regressions. I have also run a manual test to see how the new 
performance with this change looks like. I used a random zip file which is 
around 5MB in size and has 927 entries (as reported by the unzip command):
    
    ```
    unzip -l source.zip
    <contents of the file> ....
    ---------                     -------
     22222385                     927 files
    ```
    I then used the following build file:
    
    ```
    <?xml version="1.0" encoding="UTF-8"?>
    <project name="test" default="main">
        <target name="main">
          <mkdir dir="build"/>  
          <tar destFile="build/test.tar" longfile="gnu">
                <zipfileset src="source.zip"/>
           </tar>
        </target>
    </project>
    ```
    Ran the following command:
    ```
    time ant
    ```
    With Ant 1.10.5 (the latest released), it takes a while to complete and 
reports:
    
    ```
    Total time: 23 seconds
    
    real        0m23.485s
    user        0m16.767s
    sys 0m9.368s
    ```
    So around 23 seconds for the tar task.
    
    With this patch and the freshly built Ant distribution, this same build 
file (I did the necessary cleanup of the  build generated tar file, before 
running it again) reports:
    
    ```
    Total time: 0 seconds
    
    real        0m1.068s
    user        0m1.994s
    sys 0m0.258s
    ```
    
    Around 1 second for the exact same task. So as expected there's a big 
improvement. I have done basic comparison of the generated tar files, with Ant 
1.10.5 and this patched version to check it contains all the expected content 
and it looked fine. I will see if I can add some kind of automated testing 
around this if possible. Until then I would like inputs on whether this looks 
fine and any suggestions/feedback on this change. Right now, this is targetted 
for master branch. I'll take a look later if it's easy (I guess, should be) to 
have this in 1.9.x too.
    
    P.S: The linked bug also has one comment which states that the `copy` task 
when it uses a `zipfileset` is impacted by this performance issue too. I 
haven't checked or verified that. At a later date, I'll take a look if it needs 
this same/similar fix.
    
    
    
     

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jaikiran/ant bz-43144-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/ant/pull/76.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #76
    
----
commit 81b8c80c550ba560770a1f82de60c4d0b11ace91
Author: Jaikiran Pai <jaikiran@...>
Date:   2018-10-31T13:59:29Z

    bz-43144 - (performance fix) Explicitly control when an archive is opened 
and closed when a zipfileset is used as a resource collection in the tar task

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ant.apache.org
For additional commands, e-mail: dev-h...@ant.apache.org

Reply via email to