Hi, *Summary*
Relieve ZipFile.Source.initCEN from non-parsing related responsibilities, such that it can become a pure function for parsing and validating CEN directories. Make it return Source, such that instance fields of this class may become final, immutable and accessible without synchronization. *Motivation* The method ZipFile.Source.initCEN has been the cause of a good amount of head-scratching over the years. It was brought over from the native implementation when this was transitioned to Java in Java 9. Risk assessments at that time probably constrained the code to closely align with the native code, which together with interpreter-friendliness in early startup have affected the current shape of this code. It is hard to summarize the responsibilities of this method, but I'll try: o Call findEND() to read metadata from the (Zip64) END headers in the ZIP file, validate fields and reject invalid values o Determine the position of the first LOC header o Read the CEN byte array from the ZIP file o Construct hash table data structures o Assign these to instance fields in Source o Parse and validate the CEN structure and populate hash table structures o Detect manifest file(s), signature related files and multi-release JAR versions o Detect when a ZIP has more CEN entries than declared by the END header and restart parsing by invoking itself recursively after counting the actual number of CEN entries o Detect when it is being invoked recursively to avoid reading the END and CEN headers twice It's been obvious for a while that this method performs an excessive number of unrelated tasks, knows too much about the state of the object being constructed and handles a disproportionate amount of funky control flow. It calls out to a number of helper methods to perform parsing and validation, some of them static, some relying on instance state. The amount of synchronization in ZipFile today is probably excessive for the current implementation. It made more sense back when ZipFile was backed by native code and resources. In particular, lookup misses pay a high relative cost to synchronization which is probably not needed. Making Source fields final moves us in a direction which may allow us to eventually remove synchronization for operations that only touch a parsed, immutable CEN. *Ideas and plan for improvements* I think we can improve the current situation by: o Moving END header parsing and validation out of initCEN o Moving CEN byte array reading out of initCEN o Making all helper methods static (it's a mix today) At this point, initCEN will almost, but not quite be a pure function for parsing the CEN byte array. So the next natural step is to actually make it a pure function by: o Making it static o Having it return a Source instance instead of assigning to the instance being constructed o Making Source (mostly) immutable by marking parsed instance fields as 'final' Rather than a big bang change, I see this effort as a step-by-step refactoring, probably over several PRs, each focusing on a smaller aspect of refactoring, hopefully making them easier to review. This is not an all-or-nothing effort, most of these changes will be incremental improvements on their own merit. If we don't get all the way, that's also okay. *Feedback?* Input to this idea/plan is welcome. If someone sees reasons this may not be a good idea, it's better to raise such concern here and now. Eirik.
