Hi,

*Summary*

Relieve ZipFile.Source.initCEN from non-parsing related responsibilities,
such that it can become a pure function for parsing and validating CEN
directories. Make it return Source, such that instance fields of this class
may become final, immutable and accessible without synchronization.

*Motivation*

The method ZipFile.Source.initCEN has been the cause of a good amount of
head-scratching over the years. It was brought over from the native
implementation when this was transitioned to Java in Java 9.

Risk assessments at that time probably constrained the code to closely
align with the native code, which together with interpreter-friendliness in
early startup have affected the current shape of this code.

It is hard to summarize the responsibilities of this method, but I'll try:

o Call findEND() to read metadata from the (Zip64) END headers in the ZIP
file, validate fields and reject invalid values
o Determine the position of the first LOC header
o Read the CEN byte array from the ZIP file
o Construct hash table data structures
o Assign these to instance fields in Source
o Parse and validate the CEN structure and populate hash table structures
o Detect manifest file(s), signature related files and multi-release JAR
versions
o Detect when a ZIP has more CEN entries than declared by the END header
and restart parsing by invoking itself recursively after counting the
actual number of CEN entries
o Detect when it is being invoked recursively to avoid reading the END and
CEN headers twice

It's been obvious for a while that this method performs an excessive number
of unrelated tasks, knows too much about the state of the object being
constructed and handles a disproportionate amount of funky control flow. It
calls out to a number of helper methods to perform parsing and validation,
some of them static, some relying on instance state.

The amount of synchronization in ZipFile today is probably excessive for
the current implementation. It made more sense back when ZipFile was backed
by native code and resources. In particular, lookup misses pay a high
relative cost to synchronization which is probably not needed.

Making Source fields final moves us in a direction which may allow us to
eventually remove synchronization for operations that only touch a parsed,
immutable CEN.

*Ideas and plan for improvements*

I think we can improve the current situation by:

o Moving END header parsing and validation out of initCEN
o Moving CEN byte array reading out of initCEN
o Making all helper methods static (it's a mix today)

At this point, initCEN will almost, but not quite be a pure function for
parsing the CEN byte array. So the next natural step is to actually make it
a pure function by:

o Making it static
o Having it return a Source instance instead of assigning to the instance
being constructed
o Making Source (mostly) immutable by marking parsed instance fields as
'final'

Rather than a big bang change, I see this effort as a step-by-step
refactoring, probably over several PRs, each focusing on a smaller aspect
of refactoring, hopefully making them easier to review. This is not an
all-or-nothing effort, most of these changes will be incremental
improvements on their own merit. If we don't get all the way, that's also
okay.

*Feedback?*

Input to this idea/plan is welcome. If someone sees reasons this may not be
a good idea, it's better to raise such concern here and now.

Eirik.

Reply via email to