Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-10 Thread Eirik Bjørsnøs
> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the 
> number of compressed or uncompressed bytes read from the inflater is larger 
> than the Zip64 magic value.
> 
> While the ZIP format  mandates that the data descriptor `SHOULD be stored in 
> ZIP64 format (as 8 byte values) when a file's size exceeds 0x`, it 
> also states that `ZIP64 format MAY be used regardless of the size of a file`. 
> For such small entries, the above assumption does not hold.
> 
> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the 
> ZipEntry includes a Zip64 extra information field. This brings ZipInputStream 
> into alignment with the APPNOTE format spec:
> 
> 
> When extracting, if the zip64 extended information extra 
> field is present for the file the compressed and 
> uncompressed sizes will be 8 byte values.
> 
> 
> While small Zip64 files with 8-byte data descriptors are not commonly found 
> in the wild, it is possible to create one using the Info-ZIP command line 
> `-fd` flag:
> 
> `echo hello | zip -fd > hello.zip`
> 
> The PR also adds a test verifying that such a small Zip64 file can be parsed 
> by ZipInputStream.

Eirik Bjørsnøs has updated the pull request incrementally with two additional 
commits since the last revision:

 - Remove trailing whitespace
 - Remove trailing whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12524/files
  - new: https://git.openjdk.org/jdk/pull/12524/files/8a44feb5..91fbcce5

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12524&range=11
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12524&range=10-11

  Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/12524.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/12524/head:pull/12524

PR: https://git.openjdk.org/jdk/pull/12524


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-15 Thread Alan Bateman
On Wed, 10 Jan 2024 13:39:52 GMT, Eirik Bjørsnøs  wrote:

>> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the 
>> number of compressed or uncompressed bytes read from the inflater is larger 
>> than the Zip64 magic value.
>> 
>> While the ZIP format  mandates that the data descriptor `SHOULD be stored in 
>> ZIP64 format (as 8 byte values) when a file's size exceeds 0x`, it 
>> also states that `ZIP64 format MAY be used regardless of the size of a 
>> file`. For such small entries, the above assumption does not hold.
>> 
>> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the 
>> ZipEntry includes a Zip64 extra information field AND the 'compressed size' 
>> and 'uncompressed size' have the expected Zip64 "magic" value 0x. 
>> This brings ZipInputStream into alignment with the APPNOTE format spec:
>> 
>> 
>> When extracting, if the zip64 extended information extra 
>> field is present for the file the compressed and 
>> uncompressed sizes will be 8 byte values.
>> 
>> 
>> While small Zip64 files with 8-byte data descriptors are not commonly found 
>> in the wild, it is possible to create one using the Info-ZIP command line 
>> `-fd` flag:
>> 
>> `echo hello | zip -fd > hello.zip`
>> 
>> The PR also adds a test verifying that such a small Zip64 file can be parsed 
>> by ZipInputStream.
>
> Eirik Bjørsnøs has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Remove trailing whitespace
>  - Remove trailing whitespace

I was initially concerned about this change but I think it's okay. As you say, 
it expands set the set of ZIP file that can be read by ZipInputStream. I've 
added myself as reviewer to the CSR.

-

PR Comment: https://git.openjdk.org/jdk/pull/12524#issuecomment-1892510671


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-15 Thread Andrey Turbanov
On Wed, 10 Jan 2024 13:39:52 GMT, Eirik Bjørsnøs  wrote:

>> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the 
>> number of compressed or uncompressed bytes read from the inflater is larger 
>> than the Zip64 magic value.
>> 
>> While the ZIP format  mandates that the data descriptor `SHOULD be stored in 
>> ZIP64 format (as 8 byte values) when a file's size exceeds 0x`, it 
>> also states that `ZIP64 format MAY be used regardless of the size of a 
>> file`. For such small entries, the above assumption does not hold.
>> 
>> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the 
>> ZipEntry includes a Zip64 extra information field AND the 'compressed size' 
>> and 'uncompressed size' have the expected Zip64 "magic" value 0x. 
>> This brings ZipInputStream into alignment with the APPNOTE format spec:
>> 
>> 
>> When extracting, if the zip64 extended information extra 
>> field is present for the file the compressed and 
>> uncompressed sizes will be 8 byte values.
>> 
>> 
>> While small Zip64 files with 8-byte data descriptors are not commonly found 
>> in the wild, it is possible to create one using the Info-ZIP command line 
>> `-fd` flag:
>> 
>> `echo hello | zip -fd > hello.zip`
>> 
>> The PR also adds a test verifying that such a small Zip64 file can be parsed 
>> by ZipInputStream.
>
> Eirik Bjørsnøs has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Remove trailing whitespace
>  - Remove trailing whitespace

src/java.base/share/classes/java/util/zip/ZipInputStream.java line 657:

> 655:  * @param size  the value of the 'uncompressed size' field in the LOC
> 656:  */
> 657: private boolean expect64BitDataDescriptor(byte[] extra, int flag, 
> long csize, long size)  {

Suggestion:

private boolean expect64BitDataDescriptor(byte[] extra, int flag, long 
csize, long size) {

test/jdk/java/util/zip/ZipInputStream/Zip64DataDescriptor.java line 113:

> 111: // This ZIP has the regular 4-bit data descriptor
> 112: 
> 113: byte[] extra = new byte[Long.BYTES  + Long.BYTES + Short.BYTES * 
> 2]; // Size of a regular Zip64 extra field

Suggestion:

byte[] extra = new byte[Long.BYTES + Long.BYTES + Short.BYTES * 2]; // 
Size of a regular Zip64 extra field

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1452996737
PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1452996515


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Jaikiran Pai
On Tue, 16 Jan 2024 13:40:18 GMT, Jaikiran Pai  wrote:

>> Eirik Bjørsnøs has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Remove trailing whitespace
>>  - Remove trailing whitespace
>
> src/java.base/share/classes/java/util/zip/ZipInputStream.java line 534:
> 
>> 532: 
>> 533: long csize = get32(tmpbuf, LOCSIZ);
>> 534: long size = get32(tmpbuf, LOCLEN);
> 
> Hello Eirik, I suspect this part of the change has an issue. Before reading 
> the `tmpbuf` for compressed and uncompressed sizes, there will be 32 bits of 
> CRC, which should be read first. This now skips those 32 CRC bits and reads 
> them (in the else block) after reading these sizes and that can cause 
> incorrect LOC data.

The Github actions job which runs tier1 is all successful with this proposed 
change. So I'm a bit surprised that the tests didn't catch any issues, which 
makes me wonder if we have enough test coverage that covers this change.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453445673


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Jaikiran Pai
On Wed, 10 Jan 2024 13:39:52 GMT, Eirik Bjørsnøs  wrote:

>> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the 
>> number of compressed or uncompressed bytes read from the inflater is larger 
>> than the Zip64 magic value.
>> 
>> While the ZIP format  mandates that the data descriptor `SHOULD be stored in 
>> ZIP64 format (as 8 byte values) when a file's size exceeds 0x`, it 
>> also states that `ZIP64 format MAY be used regardless of the size of a 
>> file`. For such small entries, the above assumption does not hold.
>> 
>> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the 
>> ZipEntry includes a Zip64 extra information field AND the 'compressed size' 
>> and 'uncompressed size' have the expected Zip64 "magic" value 0x. 
>> This brings ZipInputStream into alignment with the APPNOTE format spec:
>> 
>> 
>> When extracting, if the zip64 extended information extra 
>> field is present for the file the compressed and 
>> uncompressed sizes will be 8 byte values.
>> 
>> 
>> While small Zip64 files with 8-byte data descriptors are not commonly found 
>> in the wild, it is possible to create one using the Info-ZIP command line 
>> `-fd` flag:
>> 
>> `echo hello | zip -fd > hello.zip`
>> 
>> The PR also adds a test verifying that such a small Zip64 file can be parsed 
>> by ZipInputStream.
>
> Eirik Bjørsnøs has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Remove trailing whitespace
>  - Remove trailing whitespace

src/java.base/share/classes/java/util/zip/ZipInputStream.java line 534:

> 532: 
> 533: long csize = get32(tmpbuf, LOCSIZ);
> 534: long size = get32(tmpbuf, LOCLEN);

Hello Eirik, I suspect this part of the change has an issue. Before reading the 
`tmpbuf` for compressed and uncompressed sizes, there will be 32 bits of CRC, 
which should be read first. This now skips those 32 CRC bits and reads them (in 
the else block) after reading these sizes and that can cause incorrect LOC data.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453442814


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Jaikiran Pai
On Wed, 10 Jan 2024 13:39:52 GMT, Eirik Bjørsnøs  wrote:

>> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the 
>> number of compressed or uncompressed bytes read from the inflater is larger 
>> than the Zip64 magic value.
>> 
>> While the ZIP format  mandates that the data descriptor `SHOULD be stored in 
>> ZIP64 format (as 8 byte values) when a file's size exceeds 0x`, it 
>> also states that `ZIP64 format MAY be used regardless of the size of a 
>> file`. For such small entries, the above assumption does not hold.
>> 
>> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the 
>> ZipEntry includes a Zip64 extra information field AND the 'compressed size' 
>> and 'uncompressed size' have the expected Zip64 "magic" value 0x. 
>> This brings ZipInputStream into alignment with the APPNOTE format spec:
>> 
>> 
>> When extracting, if the zip64 extended information extra 
>> field is present for the file the compressed and 
>> uncompressed sizes will be 8 byte values.
>> 
>> 
>> While small Zip64 files with 8-byte data descriptors are not commonly found 
>> in the wild, it is possible to create one using the Info-ZIP command line 
>> `-fd` flag:
>> 
>> `echo hello | zip -fd > hello.zip`
>> 
>> The PR also adds a test verifying that such a small Zip64 file can be parsed 
>> by ZipInputStream.
>
> Eirik Bjørsnøs has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Remove trailing whitespace
>  - Remove trailing whitespace

src/java.base/share/classes/java/util/zip/ZipInputStream.java line 664:

> 662: 
> 663: // The LOC's 'compressed size' and 'uncompressed size' must both 
> be marked for Zip64
> 664: if (csize != ZIP64_MAGICVAL || size != ZIP64_MAGICVAL) {

The spec for this says different. It says:

>
> 4.4.4 general purpose bit flag:
> ...
>Bit 3: If this bit is set, the fields crc-32, compressed size and 
> uncompressed size are set to zero in the local header.  The correct values 
> are put in the data descriptor immediately following the compressed data.  

So it expects the value zero for the compressed/uncompressed sizes in the LOC 
when the data descriptor bit is set.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453460177


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Jaikiran Pai
On Wed, 10 Jan 2024 13:39:52 GMT, Eirik Bjørsnøs  wrote:

>> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the 
>> number of compressed or uncompressed bytes read from the inflater is larger 
>> than the Zip64 magic value.
>> 
>> While the ZIP format  mandates that the data descriptor `SHOULD be stored in 
>> ZIP64 format (as 8 byte values) when a file's size exceeds 0x`, it 
>> also states that `ZIP64 format MAY be used regardless of the size of a 
>> file`. For such small entries, the above assumption does not hold.
>> 
>> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the 
>> ZipEntry includes a Zip64 extra information field AND the 'compressed size' 
>> and 'uncompressed size' have the expected Zip64 "magic" value 0x. 
>> This brings ZipInputStream into alignment with the APPNOTE format spec:
>> 
>> 
>> When extracting, if the zip64 extended information extra 
>> field is present for the file the compressed and 
>> uncompressed sizes will be 8 byte values.
>> 
>> 
>> While small Zip64 files with 8-byte data descriptors are not commonly found 
>> in the wild, it is possible to create one using the Info-ZIP command line 
>> `-fd` flag:
>> 
>> `echo hello | zip -fd > hello.zip`
>> 
>> The PR also adds a test verifying that such a small Zip64 file can be parsed 
>> by ZipInputStream.
>
> Eirik Bjørsnøs has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Remove trailing whitespace
>  - Remove trailing whitespace

src/java.base/share/classes/java/util/zip/ZipInputStream.java line 706:

> 704:  * @return true if the extra field is a Zip64 extra field compatible 
> with data descriptors
> 705:  */
> 706: private static boolean isZip64DataDescriptorField(int headerId, 
> byte[] extra, int blockStart, int blockSize) {

I understand the goals of this method - what it's trying to do is, assure the 
caller that the extra field/block actually is a zip64 extra block. That 
assurance is then used to access the data descriptor content as 8 byte fields. 
However, I think in this proposed implementation of this method we are perhaps 
doing a bit too much. Specifically, I don't think we should check what values 
have been stamped for "Original size" and "Compressed size" fields of this 
zip64 block. I think, those values (presence or absence) shouldn't play a role 
in deciding whether we have to read a data descriptor size fields as 8 bytes. 
Doing these checks for these zip64 original/compressed size fields, I think 
will open up more permutations about which zip entries get processed as 8 byte 
data descriptors.

Given the context in which this method is used, I think the only checks that we 
should do in this method is to verify that the header id is `ZIP64_EXTID`. 
Perhaps then this `isZip64DataDescriptorField(...)` won't be needed and we can 
just inline that `headerid == ZIP64_EXTID` check inline in the implementation 
of `expect64BitDataDescriptor(...)` method

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453508949


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Eirik Bjørsnøs
On Tue, 16 Jan 2024 13:42:30 GMT, Jaikiran Pai  wrote:

>> src/java.base/share/classes/java/util/zip/ZipInputStream.java line 534:
>> 
>>> 532: 
>>> 533: long csize = get32(tmpbuf, LOCSIZ);
>>> 534: long size = get32(tmpbuf, LOCLEN);
>> 
>> Hello Eirik, I suspect this part of the change has an issue. Before reading 
>> the `tmpbuf` for compressed and uncompressed sizes, there will be 32 bits of 
>> CRC, which should be read first. This now skips those 32 CRC bits and reads 
>> them (in the else block) after reading these sizes and that can cause 
>> incorrect LOC data.
>
> The Github actions job which runs tier1 is all successful with this proposed 
> change. So I'm a bit surprised that the tests didn't catch any issues, which 
> makes me wonder if we have enough test coverage that covers this change.

> Hello Eirik, I suspect this part of the change has an issue. Before reading 
> the `tmpbuf` for compressed and uncompressed sizes, there will be 32 bits of 
> CRC, which should be read first. This now skips those 32 CRC bits and reads 
> them (in the else block) after reading these sizes and that can cause 
> incorrect LOC data.

How does the order of the reads from the byte array matter? Outside cache 
efficiency I would presume they could be read in any order?

(These reads are from a temp buffer, not the stream)

Could you elaborate? I'm not sure I'm following :-)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453516319


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Jaikiran Pai
On Tue, 16 Jan 2024 14:34:29 GMT, Eirik Bjørsnøs  wrote:

>(These reads are from a temp buffer, not the stream)

Ah! you are right indeed. I didn't correctly read that part of that code. It's 
reading from a temp buffer which has been fully initialized with a LOC and 
these reads happens with specific offsets within that buffer. So yes, you are 
correct that the order won't matter here. Thank you for that detail.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453524828


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Eirik Bjørsnøs
On Tue, 16 Jan 2024 13:54:06 GMT, Jaikiran Pai  wrote:

>> Eirik Bjørsnøs has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Remove trailing whitespace
>>  - Remove trailing whitespace
>
> src/java.base/share/classes/java/util/zip/ZipInputStream.java line 664:
> 
>> 662: 
>> 663: // The LOC's 'compressed size' and 'uncompressed size' must 
>> both be marked for Zip64
>> 664: if (csize != ZIP64_MAGICVAL || size != ZIP64_MAGICVAL) {
> 
> The spec for this says different. It says:
> 
>>
>> 4.4.4 general purpose bit flag:
>> ...
>>Bit 3: If this bit is set, the fields crc-32, compressed size and 
>> uncompressed size are set to zero in the local header.  The correct values 
>> are put in the data descriptor immediately following the compressed data.  
> 
> So it expects the value zero for the compressed/uncompressed sizes in the LOC 
> when the data descriptor bit is set.

The spec isn't terribly helpful in spelling out what should happen in the case 
where an entry combines the uses of  data descriptors (mandating that CRC, size 
and compressed size must be zero) with Zip64 (mandating that size and 
compressed size must be 0x)

My interpretation (based in the InfoZIP implementation) is that in such cases, 
CRC should be zero, while size and compressed size should be 0x, with 
their counterparts in the Zip64 extra field should set to zero.

Does this make sense?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453526069


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-16 Thread Jaikiran Pai
On Tue, 16 Jan 2024 14:41:06 GMT, Eirik Bjørsnøs  wrote:

> The spec isn't terribly helpful in spelling out what should happen in the 
> case where an entry combines the uses of data descriptors (mandating that 
> CRC, size and compressed size must be zero) with Zip64 (mandating that size 
> and compressed size must be 0x)

I see what you mean. However, given that the data descriptor general bit field 
is set to indicate that the data descriptor should be honoured, plus the spec 
stating that the original/compressed sizes in the zip64 extra block aren't 
mandatory, I think not relying on the zip64 extra block for parsing decisions 
about a data descriptor content might be better. I think the only role that a 
zip64 block should play when we are deciding how to parse a data descriptor is 
whether or not the entry has zip64 extra field set (the header id value of the 
extra field). Does that sound reasonable?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453546995


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-22 Thread Eirik Bjørsnøs
On Tue, 16 Jan 2024 14:55:39 GMT, Jaikiran Pai  wrote:

> I think the only role that a zip64 block should play when we are deciding how 
> to parse a data descriptor is whether or not the entry has zip64 extra field 
> set (the header id value of the extra field). Does that sound reasonable?

Are you suggesting that we ONLY look for the presence of a Zip64 (tag 0x1) 
extended field? Meaning we should ignore the following:

- The values of the 'compressed size' and 'uncompressed size' fields in the LOC 
header
- The contents/fields in the data block in the Zip64 header

The purpose of this check is to determine wheter the entry "is in the ZIP64 
format", to use parlance from `APPNOTE.TXT`. The reason stronger validation was 
added was to avoid false positives, meaning an entry would be interpreted as 
"in the Zip64 format", even when that was not the intention of the producer. 
Some of this validation was on my initiative, but I think maybe some was in 
response to concerns raised in review.

If we ignore the 'uncompressed size' and 'compresssed size' fields, then if a 
LOC has both of these set to zero and has a Zip64 extended field, then we'll 
intepret this as "in the Zip64 format", even though no field is marked using 
the Zip64 magic value `0x`.

I think I would be fine with dropping inspection of the Zip64 contents (as 
there are probably efforts underway to add validation of LOC Zip64 fields 
similar to the recently added CEN validation).

But I'm a bit worried that ignoring the LOC values will make us parse some data 
descriptors using 8 bytes, when the producer did in fact not intend to use the 
Zip64 format. (The Zip64 field just came through some kind of "pollution" or 
unintended copying.

A natural way to implement Zip64 parsing is to only look at the extended field 
if at least one of the relevant LOC headers are `0x`, otherwise one can 
short-circuit and assume there is no Zip64 field.

@LanceAndersen Do you have a cent or two to spare?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1461951695


Re: RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

2024-01-22 Thread Lance Andersen
On Mon, 22 Jan 2024 14:34:25 GMT, Eirik Bjørsnøs  wrote:

> @LanceAndersen Do you have a cent or two to spare?

Let me try and dig out from a couple of things and circle back to this again

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1462519461