GitHub user jaikiran opened a pull request: https://github.com/apache/ant/pull/76
bz-43144 - Improve the performance of the tar task when it uses a zipfileset https://bz.apache.org/bugzilla/show_bug.cgi?id=43144 is an issue where users have reported that the tar task is extremely slow when used with a `zipfileset`. The comment in that bug, from @bodewig, rightly notes that the reason for this slowness has to do with the code where we iterate over the zip entries as a `Resource` and then open an `InputStream` for each entry during the `tar` task. The implementation of the `ZipResource#getInputStream` opens a new `ZipFile` every time and as Stefan notes in that bug, the reason it's done this way is because we don't know what would be the right time to close a `ZipFIle` if the implementation of the `ZipResource` would want to hold on to a open `ZipFile` instance and reuse it. However, there are occasions, like the one here, where the callers of the `ArchiveFileSet` are aware and know when and how long they expect the underlying archive to be open. The commit in this PR, introduces a new method (`openArchive()`) on the `ArchiveFileSet` to allow such callers to have more control over the opening and closing of the underlying resource(s) like the `ZipFile`. The whole goal of this new method is to allow iterating over the entries in the archive (just like the existing `iterator()` method) and yet the same time exposing a way for users to explicitly `close` the returned `ArchiveEntries`. When to use the `iterator()` method and when to use the `openArchive` method will be left to the users of ArchiveSet. The commit in this PR intentionally just exposes only one method `openArchive` as a `public` method and keeps the rest of the new methods either private or package private. Right now, only `ZipFileSet` overrides the new package private method to provide a scanner which holds on to open resource(s), if it's asked to do so. The changes in this commit maintain backward compatibility of the existing methods and does _not_ introduce any change in behaviour of their usages or semantics. The javadocs of the new methods explain more about what each one does and how the usage typically looks like. I have run a few of the existing Tar related tests locally and haven't found any regressions. I have also run a manual test to see how the new performance with this change looks like. I used a random zip file which is around 5MB in size and has 927 entries (as reported by the unzip command): ``` unzip -l source.zip <contents of the file> .... --------- ------- 22222385 927 files ``` I then used the following build file: ``` <?xml version="1.0" encoding="UTF-8"?> <project name="test" default="main"> <target name="main"> <mkdir dir="build"/> <tar destFile="build/test.tar" longfile="gnu"> <zipfileset src="source.zip"/> </tar> </target> </project> ``` Ran the following command: ``` time ant ``` With Ant 1.10.5 (the latest released), it takes a while to complete and reports: ``` Total time: 23 seconds real 0m23.485s user 0m16.767s sys 0m9.368s ``` So around 23 seconds for the tar task. With this patch and the freshly built Ant distribution, this same build file (I did the necessary cleanup of the build generated tar file, before running it again) reports: ``` Total time: 0 seconds real 0m1.068s user 0m1.994s sys 0m0.258s ``` Around 1 second for the exact same task. So as expected there's a big improvement. I have done basic comparison of the generated tar files, with Ant 1.10.5 and this patched version to check it contains all the expected content and it looked fine. I will see if I can add some kind of automated testing around this if possible. Until then I would like inputs on whether this looks fine and any suggestions/feedback on this change. Right now, this is targetted for master branch. I'll take a look later if it's easy (I guess, should be) to have this in 1.9.x too. P.S: The linked bug also has one comment which states that the `copy` task when it uses a `zipfileset` is impacted by this performance issue too. I haven't checked or verified that. At a later date, I'll take a look if it needs this same/similar fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaikiran/ant bz-43144-master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ant/pull/76.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #76 ---- commit 81b8c80c550ba560770a1f82de60c4d0b11ace91 Author: Jaikiran Pai <jaikiran@...> Date: 2018-10-31T13:59:29Z bz-43144 - (performance fix) Explicitly control when an archive is opened and closed when a zipfileset is used as a resource collection in the tar task ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@ant.apache.org For additional commands, e-mail: dev-h...@ant.apache.org