JDK-6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Philipp Kunz Thu, 19 Apr 2018 15:59:49 -0700

Hi,

I tried to fix bug 6202130 about manifest utf support and come up nowwith a test as suggested in the bug's comments that shows that utfcharset actually works before removing the comments from the code.

When I wanted to remove the XXX comments about utf it occurred to methat version attributes ("Signature-Version" and "Manifest-Version")would never be broken across lines and should anyway not support thewhole utf character set which sounds more like related to bugs 6910466or 4935610 but it's not a real fit. Therefore, I could not remove onesuch comment of Attributes#writeMain but I changed it. The first commentin bug 6202130 mentions only two comments but there are three inAttributes. In the attached patch I removed only two of three andchanged the remaining third to not mention utf anymore.

At the moment, at least until 6443578 is fixed, multi-byte utfcharacters can be broken across lines. It might be worth a considerationto test that explicitly as well but then I guess there is not much of apoint in testing the current behavior that will change with 6443578,hopefully soon. There are in my opinion enough characters broken acrosslines in the attached test that demonstrate that this still works likeit did before.

I would have preferred also to remove the calls to deprecatedString(byte[], int, int, int) but then figured it relates more to bug6443578 than 6202130 and now prefer to do that in another separate patch.

Bug 6202130 also states that lines are broken by String.length not bybyte length. While it looks so at first glance, I could not confirm. Thecombination of getBytes("UTF8"), String(byte[], int, int, int), and thenDataOutputStream.writeBytes(String) in that combination does not drophigh-bytes because every byte (whether a whole character or only a partof a multi-byte character) becomes a character in String(...) containingthat byte in its low-byte which will be read again from writeBytes(...).Or put in a different way, every utf encoded byte becomes a characterand multi-byte utf characters are converted into multiple stringcharacters containing one byte each in their lower bytes. The currentsolution is not nice, but at least works. With that respect I'd like tosuggest to deprecate DataOutputStream.writeBytes(String) because it doessomething not exactly expected when guessing from its name and thatwould suit a byte[] parameter better very much like it has been donewith String(byte[], int, int, int). Any advice about the procedure todeprecate something?

I was surprised that it was not trivial to list all valid utfcharacters. If someone has a better idea than isValidUtfCharacter in theattached test, let me know.

Altogether, I would not consider 6202130 resolved completely, unlessmaybe all remaining points are copied to 6443578 and maybe another bugabout valid values for "Signature-Version" and "Manifest-Version" if atall desired. But still I consider the attached patch an improvement andmost of the remainder can then be solved in 6443578 and so far I amlooking forward to any kind of feedback.


Regards,
Philipp

diff -r 3852547060c8 src/java.base/share/classes/java/util/jar/Attributes.java
--- a/src/java.base/share/classes/java/util/jar/Attributes.java	Thu Apr 12 19:12:54 2018 +0000
+++ b/src/java.base/share/classes/java/util/jar/Attributes.java	Fri Apr 20 00:42:44 2018 +0200
@@ -296,7 +296,6 @@
 
     /*
      * Writes the current attributes to the specified data output stream.
-     * XXX Need to handle UTF8 values and break up lines longer than 72 bytes
      */
      @SuppressWarnings("deprecation")
      void write(DataOutputStream os) throws IOException {
@@ -324,7 +323,8 @@
      * make sure to write out the MANIFEST_VERSION or SIGNATURE_VERSION
      * attributes first.
      *
-     * XXX Need to handle UTF8 values and break up lines longer than 72 bytes
+     * XXX Need to handle version values and break up lines longer than 72 bytes
+     * also in "Manifest-Version" and "Signature-Version" attributes
      */
     @SuppressWarnings("deprecation")
     void writeMain(DataOutputStream out) throws IOException
@@ -367,7 +367,6 @@
 
     /*
      * Reads attributes from the specified input stream.
-     * XXX Need to handle UTF8 values.
      */
     @SuppressWarnings("deprecation")
     void read(Manifest.FastInputStream is, byte[] lbuf) throws IOException {
diff -r 3852547060c8 test/jdk/java/util/jar/Manifest/ValueUtfEncoding.java
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test/jdk/java/util/jar/Manifest/ValueUtfEncoding.java	Fri Apr 20 00:42:44 2018 +0200
@@ -0,0 +1,226 @@
+/*
+ * Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved.
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This code is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 only, as
+ * published by the Free Software Foundation.
+ *
+ * This code is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * version 2 for more details (a copy is included in the LICENSE file that
+ * accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License version
+ * 2 along with this work; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
+ * or visit www.oracle.com if you need additional information or have any
+ * questions.
+ */
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.jar.Attributes;
+import java.util.jar.Attributes.Name;
+import java.util.jar.Manifest;
+import java.util.List;
+import java.util.ArrayList;
+
+import org.testng.annotations.Test;
+import static org.testng.Assert.*;
+
+/**
+ * @test
+ * @bug 6202130
+ * @run testng ValueUtfEncoding
+ * @summary manifest values utf encoding
+ *
+ * This test writes and reads a manifest that contains every valid utf
+ * character (three times), grouped into manifest header values with about
+ * 65535 bytes each or slightly more, resulting in a single huge manifest with
+ * 3 * 67 + 1 values and 13703968 bytes in the manifest's encoded form in
+ * total. This way, all possible 1111995 utf characters are covered in one
+ * manifest.
+ *
+ * Every character occurs three times, once in a main attribute value, once in
+ * a section name, and once in a named section attribute value, because
+ * implementation of writing the main section headers differs from the one
+ * writing named section headers in
+ * {@link Attributes#writeMain(DataOutputStream)} and
+ * {@link Attributes#write(DataOutputStream)} due to special order of
+ * {@link Name#MANIFEST_VERSION} and {@link Name#SIGNATURE_VERSION}.
+ * and also {@link Manifest#read(InputStream)} treating reading the main
+ * section differently from reading named sections names headers.
+ *
+ * Only header values are tested. Characters for header names are much more
+ * limited and very simple ones are used just to get valid and different ones.
+ */
+public class ValueUtfEncoding {
+
+    /**
+     * From the specifications:
+     * "Implementations should support 65535-byte (not character) header
+     * values, and 65535 headers per file. They might run out of memory,
+     * but there should not be hard-coded limits below these values."
+     *
+     * @see <a href="{@docRoot}/../specs/jar/jar.html#Notes_on_Manifest_and_Signature_Files">Notes on Manifest and Signature Files</a>
+     */
+    static final int MIN_VALUE_LENGTH_SUPPORTED = 2 << 16 - 1;
+
+    static final int MAX_UTF_CHARACTER_ENCODED_LENGTH = 4;
+
+    static boolean isValidUtfCharacter(int codePoint) {
+        if (0xD800 <= codePoint && codePoint <= 0xDFFF) {
+            return false; /* surrogates */
+        }
+        if (0xFDD0 <= codePoint && codePoint <= 0xFDEF) {
+            return false; /* non-characters */
+        }
+        if ((codePoint & 0xFFFE) == 0xFFFE) {
+            return false; /* more non-characters */
+        }
+        return true;
+    }
+
+    /**
+     * returns {@code true} if {@code codePoint} is explicitly forbidden in
+     * manifest values based on a statement from the specs:
+     * <pre>otherchar: any UTF-8 character except NUL, CR and LF<pre>
+     *
+     * @see <a href="{@docRoot}/../specs/jar/jar.html#Section-Specification">Jar File Specification</a>
+     */
+    static boolean isInvalidManifestValueCharacter(int codePoint) {
+        return codePoint == 0 /* NUL */
+            || codePoint == '\r' /* CR */
+            || codePoint == '\n' /* LF */;
+    };
+
+    /**
+     * produces a list of strings with all known utf characters except those
+     * invalid in manifest header values with at least
+     * {@link #MIN_VALUE_LENGTH_SUPPORTED} utf-8 encoded bytes each
+     * except the last string which contains just the remaining characters
+     */
+    static List<String> produceValidManifestUtfCharacterValues() {
+        int maxLengthBytes = MIN_VALUE_LENGTH_SUPPORTED +
+                // exceed the limit by at least one character
+                MAX_UTF_CHARACTER_ENCODED_LENGTH + 1;
+
+        int numberOfUsedCodepoints = 0;
+        ArrayList<String> values = new ArrayList<>();
+        byte[] valueBuf = new byte[maxLengthBytes];
+        int pos = 0;
+        for (int codePoint = Character.MIN_CODE_POINT;
+                codePoint <= Character.MAX_CODE_POINT; codePoint++) {
+            if (!isValidUtfCharacter(codePoint)) {
+                continue;
+            }
+            if (isInvalidManifestValueCharacter(codePoint)) {
+                continue;
+            }
+            numberOfUsedCodepoints++;
+
+            byte[] charBuf =
+                    new String(Character.toChars(codePoint)).getBytes(UTF_8);
+            if (pos + charBuf.length > valueBuf.length) {
+                values.add(new String(valueBuf, 0, pos, UTF_8));
+                pos = 0;
+            }
+            System.arraycopy(charBuf, 0, valueBuf, pos, charBuf.length);
+            pos += charBuf.length;
+        }
+        if (pos > 0) {
+            values.add(new String(valueBuf, 0, pos, UTF_8));
+        }
+
+        if (numberOfUsedCodepoints !=
+                (17 << 16) /* utf space */
+                - 2048 /* surrogates */
+                - 66 /* non-characters */
+                - 3 /* nul, cr, lf */) {
+            fail("self-test: utf character set not covered exactly");
+        }
+
+        return values;
+    }
+
+    /**
+     * returns simple, valid, short, and distinct manifest header names.
+     * The returned name cannot be "{@code Manifest-Version}" because it does
+     * not return "{@code -}".
+     *
+     * @param seed seed to produce distinct names
+     */
+    static String azName(int seed) {
+        StringBuffer name = new StringBuffer();
+        do {
+            name.insert(0, (char) (seed % 26 + (seed < 26 ? 'A' : 'a')));
+            seed = seed / 26 - 1;
+        } while (seed >= 0);
+        return name.toString();
+    }
+
+    /**
+     * covers writing and reading of manifests with all known utf characters.
+     *
+     * Because the OpenJDK implementation uses different portions
+     * of code depending on where the value occurs to read or write, each
+     * character is tested in each of the three positions:<ul>
+     * <li>main attribute header,</li>
+     * <li>named section name, which is in fact a header value after a blank
+     * line, and</li>
+     * <li>named sections header values</li>
+     * <ul>
+     */
+    @Test
+    public void testValueUtfEncoding() throws IOException {
+        Manifest mf = new Manifest();
+        mf.getMainAttributes().put(Name.MANIFEST_VERSION, "1.0");
+
+        List<String> values = produceValidManifestUtfCharacterValues();
+        for (int i = 0; i < values.size(); i++) {
+            String name = azName(i);
+            String value = values.get(i);
+
+            mf.getMainAttributes().put(new Name(name), value);
+            Attributes attributes = new Attributes();
+            mf.getEntries().put(value, attributes);
+            attributes.put(new Name(name), value);
+        }
+
+        mf = writeAndRead(mf);
+
+        for (int i = 0; i < values.size(); i++) {
+            String value = values.get(i);
+            String name = azName(i);
+
+            assertEquals(mf.getMainAttributes().getValue(name), value,
+                    "main attributes header value");
+            Attributes attributes = mf.getAttributes(value);
+            assertNotNull(attributes, "named section not found");
+            assertEquals(attributes.getValue(name), value,
+                    "named section attributes value");
+        }
+    }
+
+    static Manifest writeAndRead(Manifest mf) throws IOException {
+        ByteArrayOutputStream out = new ByteArrayOutputStream();
+        mf.write(out);
+        byte[] mfBytes = out.toByteArray();
+        System.out.println("-------------------------------------------"
+                + "-----------------------------");
+        System.out.print(new String(mfBytes, UTF_8));
+        System.out.println("-------------------------------------------"
+                + "-----------------------------");
+
+        ByteArrayInputStream in = new ByteArrayInputStream(mfBytes);
+        return new Manifest(in);
+    }
+
+}

JDK-6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Reply via email to