Hello Eirik,
I strongly agree with your proposal. I see such a change has low risk given 
ZipCoder is an internal class.

Regards,
Chen

________________________________
From: core-libs-dev <[email protected]> on behalf of Eirik 
Bjørsnøs <[email protected]>
Sent: Wednesday, January 28, 2026 3:26 AM
To: core-libs-dev <[email protected]>
Subject: RFD: Reorganize ZipCoder such that UTF8 is handled by the base class

Hi,

Bringing this up on core-libs-dev such that the motivation can be 
explained/discussed here and any future PR can focus on actual code changes.

Summary:

Reorganize the ZipCoder class hierarchy to let the base class handle UTF8 and 
the subclass handle arbitrary Charsets. This makes the design better match the 
ZIP specification and how ZIP files are used in the real world and additionally 
have some benefits in code quality and performance.

Motivation:

The ZipCoder class has been central to many ZipFile performance improvements in 
recent years. Many optimizations are encoding-specific and encapsulating these 
concerns makes a lot of sense.

Currently, the base ZipCoder instance supports any given Charset. Then, a 
subclass UTF8ZipCoder provides higher performance optimizations specific to 
UTF-8.

However, real-world use of the ZipFile API defaults to UTF-8. The ZIP 
specification long-ago introduced a flag to explicitly indicate that entries 
are encoded using UTF-8. The JAR specification has mandated UTF-8 since the 
beginning. Any use of non-UTF-8 ZIP files is increasingly niche and belongs in 
the legacy zone.

The current UTF8ZipCoder is stateless and documented as thread safe, while the 
base class ZipCoder is not. As a subclass of ZipCode, UTF8ZipCoder does however 
inherit CharsetEncoder and CharsetDecoder state fields from its super class and 
it needs to pass a UTF8 Charset to its parent, without really using it. This 
makes state and thread safety harder to reason about.

Since UTF8ZipCoder is always needed, the JVM must always load it along with the 
base class ZipCoder. Apart from loading an extra class, this prevents the JVM 
from seeing calls to ZipCoder methods as monomorphic.

A draft implementation of this change indicates a ~3% performance win on 
ZipFile lookups in ZipFileGetEntry, probably explained by the compiler seeing 
only one instance of ZipCoder being loaded.

Solution:

Switch the class hierarchy of ZipCoder around such that the base class handles 
UTF-8. Introduce a new subclass CharsetZipCoder to handle legacy non-UTF ZIP 
files. Move the Charset, CharsetEncoder, CharsetDecoder fields to this 
subclass. Update code comments to reflect the changes.

Risks:

This should be a pure refactoring, mostly moving code around. Most changes can 
be performed in-place, such that side by side review will mostly reflect 
indentation changes. We have good test coverage for UTF8 and non-UTF-8 ZIP 
files to help us catch regressions.

If I see support for this proposal, I'll be happy to submit a PR with the 
actual changes.

Cheers,
Eirik :-)









Confidential- Oracle Internal

Reply via email to