Hi Eirik,

Please see below

On Mar 6, 2023, at 1:43 PM, Eirik Bjørsnøs 
<eir...@gmail.com<mailto:eir...@gmail.com>> wrote:


Hi,

I noticed that ZipOutputStream violates APPNOTE.txt when writing directory 
entries:

You are referring to section:

------------

 4.3.8 File data


      Immediately following the local header for a file
      SHOULD be placed the compressed or stored data for the file.
      If the file is encrypted, the encryption header for the file
      SHOULD be placed after the local header and before the file
      data. The series of [local file header][encryption header]
      [file data][data descriptor] repeats for each file in the
      .ZIP archive.

      Zero-byte files, directories, and other file types that
      contain no content MUST NOT include file data.

——————

I don’t think changing the default will cause any harm but I have been 
surprised by even the most trivial change before.

The jar tool  already addresses this in Main.java:addFile() on a quick glance.

So I guess I do not see a huge downside, however we might want to look at a 
property to adjust in the unlikely event the unexpected occurs?

Alan thoughts?




Zero-byte files, directories, and other file types that
contain no content MUST NOT include file data.

Nobody seems to have complained about this violation (I searched JBS), so in 
itself it might not be worth pursuing a fix. However, in addition to breaking 
the specification, this also causes suboptimal performance both on the write 
and read paths.

Before I go ahead and suggest any particular solution, I wanted to share some 
analysis:

ZipOutputStream writes directories as any other entry, so by default it ends up 
writing a LOC header with the DEFLATED method and the data descriptor bit set. 
This is followed by a final, empty DEFLATE block (0x0300) and then a signed 
data descriptor with the CRC, csize (2) and size (0), each 4 bytes:

------  Local File Header  ------
000000  signature          0x04034b50
000004  version            20
000006  flags              0x0808
000008  method             8              Deflated
000010  time               0x9812         19:00:36
000012  date               0x5666         2023-03-06
000014  crc                0x00000000
000018  csize              0
000022  size               0
000026  nlen               9
000028  elen               0
000030  name               9 bytes        'META-INF/'

------  File Data  ------
000039  data               2 bytes

------  Data Descriptor  ------
000041  signature          0x08074b50
000045  crc                0x00000000
000049  csize              2
000053  size               0

In total, this means that an extra 18 bytes is output, which is a waste of 
storage space and IO cycles while writing and reading/parsing the unnecessary 
headers and data. If the directory entry has the STORED method and the sizes 
and crc is set to 0, then the file data and data descriptors are skipped and we 
reclaim 18 bytes.

The use of DEFLATE has an adverse negative impact on performance, since native 
ZLIB code needs to be called to produce and read (in ZipInputStream) the empty 
DEFLATE block. Worse, Deflater.reset and Inflater.reset are called, which seems 
to incur a significant overhead.

In fact, when configuring the STORED method and explicitly setting csize, size 
and crc to 0, we see considerable benchmark improvements writing and reading 
directory entries:

DEFLATED:

Benchmark                        (size)  Mode  Cnt  Score   Error  Units
ZipEmptyStreams.readDirStream       512  avgt   15  0.278 ± 0.005  ms/op
ZipEmptyStreams.readDirStream      1024  avgt   15  0.550 ± 0.006  ms/op
ZipEmptyStreams.writeDirStreams     512  avgt   15  2.379 ± 0.043  ms/op
ZipEmptyStreams.writeDirStreams    1024  avgt   15  4.845 ± 0.087  ms/op

STORED:

Benchmark                       (size)  Mode  Cnt  Score   Error  Units
ZipEmptyStreams.readDirStream      512  avgt   15  0.082 ± 0.001  ms/op
ZipEmptyStreams.readDirStream     1024  avgt   15  0.176 ± 0.015  ms/op
ZipEmptyStreams.writeDirStream     512  avgt   15  0.233 ± 0.021  ms/op
ZipEmptyStreams.writeDirStream    1024  avgt   15  0.460 ± 0.028  ms/op


Possible actions:

- Do nothing: Nobody complained so far, so if is ain't broke, don't fix it.
- Update Javadocs of ZipInputStream.putNextEntry to recommend that users should 
use STORED  and configure sizes & crc for directory entries for correctness and 
performance
- Change the default compression method for directories to STORED and set sizes 
& crc to 0. This would add the risk of changing behavior of existing code.

For context, 7% of entries in the dependencies of Spring Petclinic are 
directories.

Any thoughts or input?

Thanks,
Eirik.

[cid:E1C4E2F0-ECD0-4C9D-ADB4-B16CA7BCB7FC@home]






Lance Andersen | Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering
1 Network Drive
Burlington, MA 01803
lance.ander...@oracle.com<mailto:lance.ander...@oracle.com>



Reply via email to