Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Lance Andersen Thu, 06 Feb 2020 12:26:24 -0800

Hi Philipp,

Thank you for the reminder.  Your email came in during the holidays so it kind 
of was lost in my backlog.


First, thank you for the proposed patch.  This will take some time to review 
but it is on the list


Best
Lance

> On Feb 6, 2020, at 1:37 PM, Philipp Kunz <philipp.k...@paratix.ch> wrote:
> 
> Hi,
> 
> I recently submitted two patches related to jar manifests and UTF-8 and
> haven't got any reaction so far. I understand and appreciate that
> everyone has not time for every wish and my enquiry is certainly not
> urgent, but still, may I gently ask if I may continue to hope for any
> progress, have missed anything, or if this of no interest at all?
> 
> Unfortunately the line breaks in the previous mail went bad which is
> why I paste the text again below and hope it looks nicer this time.
> 
> Regards,
> Philipp
> 
> 
> 
> [1] 
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-January/064190.html
>  
> <https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-January/064190.html>
> [2] 
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-December/064149.html
>  
> <https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-December/064149.html>
> 
> 
> On Thu, 2019-12-26 at 17:50 +0100, Philipp Kunz wrote:
> Hi,
> The specification says, a line break in a manifest can occur before or
> after a Unicode character encoded in UTF-8.
> ...  value:         SPACE *otherchar newline *continuation 
> continuation:  SPACE *otherchar newline...  otherchar:     any UTF-8
> character except NUL, CR and LF
> The current implementation breaks manifest lines at 72 bytes regardless
> of how the bytes around the break are part of a sequence of bytes
> encoding a character. Code points may use up to four bytes when encoded
> in UTF-8. Manifests with line breaks inside of sequences of bytes
> encoding Unicode characters in UTF-8 with more than one bytes not only
> are invalid UTF-8 but also look ugly in text editors. For example, a
> manifest could look like this:
> Manifest-Version: 1.0Some-Key: Some languages have decorated
> characters, for example: espa? ?ol
> Below code produces a result as seen above with some unexpected
> question marks where the encoding is invalid:
> import java.util.jar.Manifest;import java.util.jar.Attributes;import
> static java.util.jar.Attributes.Name;
> public class CharacterBrokenDemo1 {    public static void main(String[]
> args) throws Exception{        Manifest mf = new
> Manifest();        Attributes attrs =
> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
> "1.0");        attrs.put(new Name("Some-Key"),                  "Some
> languages have decorated characters, " +                   "for
> example: español"); // or
> "espa\u00D1ol"        mf.write(System.out);    }}
> This is of course an example written with actual question marks to get
> a valid text for this message. The trick here is that "Some-Key" to
> "example :espa" amounts to exactly one byte less encoded in UTF-8 than
> would fit on one line with the 72 byte limit so that the subsequent
> character encoded with two bytes gets broken inside of the sequence of
> two bytes for this character across a continuation line break.
> When decoding the resulting bytes from UTF-8 as one whole string, the
> two question marks will not fit together again even if the line break
> with the continuation space is removed. However, Manifest::read removes
> the continuation line breaks ("\r\n ") before decoding the manifest
> header value from UTF-8 and hence can reproduce the original value.
> Characters encoded in UTF-8 can not only span up to four bytes for one
> code point each, there are also combining characters or classes thereof
> or combining diacritical marks or whatever the appropriate term could
> be, that combine more than one code point into what is usually
> experienced and referred to as a character.
> The term character really gets ambiguous at this point. I wouldn't know
> what the specification actually refers to with that term "character".
> So rather than diving in too much specification or any sorts of theory,
> let's look at another example:
> import java.util.jar.Manifest;import java.util.jar.Attributes;import
> static java.util.jar.Attributes.Name;
> public class DemoCharacterBroken2 {    public static void main(String[]
> args) throws Exception{        Manifest mf = new
> Manifest();        Attributes attrs =
> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
> "1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
> "Angstro\u0308m");        mf.write(System.out);    }}
> which produces console output as follows:
> Manifest-Version: 1.0Some-Key:                     Angstro ̈m
> (In case this does not display well, the diaeresis is on the m on the
> last line)
> When the whole Manifest is decoded from UTF-8 as one big single string
> and continuation line breaks are not removed until after UTF-8 decoding
> the whole manifest, the diaeresis (umlaut, two points above, u0308)
> apparently kind of jumps onto the following letter m because somehow it
> cannot be combined with the preceding space. The UTF-8 decoder (all of
> my editors I tried, not only Eclipse and its console view, also less,
> gedit, cat and terminal) somehow tries to fix that but the diaeresis
> may not necessarily jump back on the "o" where it originally belonged
> by removing the continuation line break ("\r\n ") after UTF-8 decoding
> has taken place, at least that did not work for me.
> Hence, ideally combining diacritical marks should better not be
> separated from whatever they combine with when breaking manifest lines
> onto a continuation line. Such combinations, however, seem to be
> unlimited in terms of number of code points combining into the same
> "experienced" character. I was able to find combinations that not only
> exceed the limit of 72 bytes per line but also exceed the line buffer
> size of 512 bytes in Manifest::read. These may be rather uncommon but
> still possible to my own surprise.
> Next consideration would then be to remove that limit of 512 bytes per
> manifest line but exceeding it would make such manifests incompatible
> with previous Manifest::read implementations and is not really an
> immediately available option at the moment.
> As a compromise, those characters including combining diacritical marks
> which combine only so many code points as that their binarily encoded
> form in UTF-8 remains within a limit of 71 bytes can be written without
> an interrupting continuation line break, which applies to most cases,
> but not all. I guess this should suit practically and realistically to
> be expected values well.
> Another possibility would be to allow for characters that are
> combinations of multiple Unicode code points to be kept together in
> their encoded form in UTF-8 up to 512 bytes line length limit when
> reading minus a space and a line break amounting to 509 bytes, but that
> would still not make manifests be represented as valid Unicode in all
> corner cases and I guess would not probably make a real improvement in
> practice over a limit of 71 bytes.
> Attached is a patch that tries to implement what was described above
> using a BreakIterator. While it works from a functional point of view,
> this might be less desirable performance-wise. Alternatively could be
> considered to do without the BreakIterator and only keep Unicode code
> points together by not placing line breaks before a continuation byte,
> which however would not address combining diacritical marks as in the
> second example above.
> The jar file specification does not explicitly state that manifest
> should be valid UTF-8, and they were not always, but it also does not
> state otherwise, leaving an impression that manifests could be
> (mis)taken for UTF-8 encoded strings, which they also are in many or
> most cases and which has been confused many times. At the moment, the
> only case where a valid manifest is not also a valid UTF-8 encoded
> string is when a sequence of bytes encoding the same character happens
> to be interrupted with a continuation line break. To the best of my
> knowledge, all other valid manifests are also valid UTF-8 encoded
> strings.
> It would be nice, if manifests could be viewed and manipulated with all
> Unicode capable editors and not only be parsed correctly with
> Manifest::read.
> Any opinions? Would someone sponsor this patch?
> Regards,Philipp
> 
> https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character

 <http://oracle.com/us/design/oracle-email-sig-198324.gif>
 <http://oracle.com/us/design/oracle-email-sig-198324.gif> 
<http://oracle.com/us/design/oracle-email-sig-198324.gif>
 <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| 
Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering 
1 Network Drive 
Burlington, MA 01803
lance.ander...@oracle.com <mailto:lance.ander...@oracle.com>

Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Reply via email to