Hi,

I noticed that ZipOutputStream violates APPNOTE.txt when writing directory
entries:

Zero-byte files, directories, and other file types that
> contain no content MUST NOT include file data.


Nobody seems to have complained about this violation (I searched JBS), so
in itself it might not be worth pursuing a fix. However, in addition to
breaking the specification, this also causes suboptimal performance both on
the write and read paths.

Before I go ahead and suggest any particular solution, I wanted to share
some analysis:

ZipOutputStream writes directories as any other entry, so by default it
ends up writing a LOC header with the DEFLATED method and the data
descriptor bit set. This is followed by a final, empty DEFLATE block
(0x0300) and then a signed data descriptor with the CRC, csize (2) and size
(0), each 4 bytes:

------  Local File Header  ------
000000  signature          0x04034b50
000004  version            20
000006  flags              0x0808
000008  method             8              Deflated
000010  time               0x9812         19:00:36
000012  date               0x5666         2023-03-06
000014  crc                0x00000000
000018  csize              0
000022  size               0
000026  nlen               9
000028  elen               0
000030  name               9 bytes        'META-INF/'

------  File Data  ------
000039  data               2 bytes

------  Data Descriptor  ------
000041  signature          0x08074b50
000045  crc                0x00000000
000049  csize              2
000053  size               0

In total, this means that an extra 18 bytes is output, which is a waste of
storage space and IO cycles while writing and reading/parsing the
unnecessary headers and data. If the directory entry has the STORED method
and the sizes and crc is set to 0, then the file data and data descriptors
are skipped and we reclaim 18 bytes.

The use of DEFLATE has an adverse negative impact on performance, since
native ZLIB code needs to be called to produce and read (in ZipInputStream)
the empty DEFLATE block. Worse, Deflater.reset and Inflater.reset are
called, which seems to incur a significant overhead.

In fact, when configuring the STORED method and explicitly setting csize,
size and crc to 0, we see considerable benchmark improvements writing and
reading directory entries:

DEFLATED:

Benchmark                        (size)  Mode  Cnt  Score   Error  Units
ZipEmptyStreams.readDirStream       512  avgt   15  0.278 ± 0.005  ms/op
ZipEmptyStreams.readDirStream      1024  avgt   15  0.550 ± 0.006  ms/op
ZipEmptyStreams.writeDirStreams     512  avgt   15  2.379 ± 0.043  ms/op
ZipEmptyStreams.writeDirStreams    1024  avgt   15  4.845 ± 0.087  ms/op

STORED:

Benchmark                       (size)  Mode  Cnt  Score   Error  Units
ZipEmptyStreams.readDirStream      512  avgt   15  0.082 ± 0.001  ms/op
ZipEmptyStreams.readDirStream     1024  avgt   15  0.176 ± 0.015  ms/op
ZipEmptyStreams.writeDirStream     512  avgt   15  0.233 ± 0.021  ms/op
ZipEmptyStreams.writeDirStream    1024  avgt   15  0.460 ± 0.028  ms/op


Possible actions:

- Do nothing: Nobody complained so far, so if is ain't broke, don't fix it.
- Update Javadocs of ZipInputStream.putNextEntry to recommend that users
should use STORED  and configure sizes & crc for directory entries for
correctness and performance
- Change the default compression method for directories to STORED and set
sizes & crc to 0. This would add the risk of changing behavior of existing
code.

For context, 7% of entries in the dependencies of Spring Petclinic are
directories.

Any thoughts or input?

Thanks,
Eirik.

Reply via email to