Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

naoto . sato Tue, 10 Mar 2020 10:47:12 -0700

Hi Philipp,

The reason that the current implementation of Grapheme boundary checkdoes it is that, some rules require information before the boundary,i.e. the rule GB12 in Unicode Segmentation [1] reads:

"do not break between regional indicator (RI) symbols if there is an oddnumber of RI characters *before* the break point."

Of course this itself can be improved, but I am not so keen on modifyingGrapheme/Pattern code at this time, so using BreakIterator (withLocale.US) is more preferred solution to me.


Naoto

[1] https://unicode.org/reports/tr29/


On 3/10/20 12:11 AM, Philipp Kunz wrote:

Hi Naoto and everyone,

When I tried the regular expression approach you proposed which reallylooks delightful, ValueUtf8Coding test ran into a timeout, which is whyI changed Pattern and Grapheme classes to start the loop in nextBoundaryto start at the offset off rather than at the beginning of the charactersequence src rendering isBoundary redundant. The changes to Pattern andGrapheme relate only by means of performance and could otherwise beapplied seperately as well. I admit I'm not too confident and familiarwith the regex parts of the openjdk and would welcome not only thoroughreview as usual but also any other feedback or thought shared about thematter.

Then I started to wonder about an issue of copy paste between Graphemeand GraphemeTest already out of date and also about some comments inRegExTest and if no more matches shouldn't also be asserted, which alsocould possibly be changed in a different change but still relate to theactual changes to certain extent. I could not find a way to accessUCD_FILES which has no or is in the default package.

What about passing the name and the value parameters ofManifest.printLine72 to it as separate parameters? Because names cannotexceed 70 bytes and the allowed characters are limited to ones that arerepresented with one byte each in UTF-8 the name can/could always beprinted without a line break. I don't know if Manifest.println72 can beremoved or has to be retained and deprecated for compatibility.


See attached patch.

Looking forward to receiving any kind of feedback and regards,
Philipp





On Tue, 2020-02-11 at 14:20 -0800, [email protected] wrote:

Hi,

Can the code in Manifest.java be simplified using regex? E.g.,

var gc = Pattern.compile("\\X");
gc.matcher(line).results()
      .map(MatchResult::group)
      .forEach(
        // print till 72 bytes in UTF-8
      );

Just a thought. If BreakIterator is preferred, it should take Locale.US
as an argument to the factory method, so that it would produce the same
result no matter what the default locale is.

Naoto

On 2/10/20 1:22 PM, Lance Andersen wrote:

Hi all,

Here is a webrev for the patch that Philipp is proposing which will make it 
easier to review:
http://cr.openjdk.java.net/~lancea/6202130/webrev.00
  <
http://cr.openjdk.java.net/~lancea/6202130/webrev.00
>

On Dec 26, 2019, at 11:50 AM, Philipp Kunz <
[email protected]
 <mailto:[email protected]>
> wrote:

Hi,
The specification says, a line break in a manifest can occur beforeor
after a Unicode character encoded in UTF-8.

    ...>      value:         SPACE *otherchar newline

*continuation>      continuation:  SPACE *otherchar
newline>    ...>      otherchar:     any UTF-8 character except NUL, CR
and LF
The current implementation breaks manifest lines at 72 bytes regardless
ofhow the bytes around the break are part of a sequence of bytes
encoding acharacter. Code points may use up to four bytes when encoded
in UTF-8.Manifests with line breaks inside of sequences of bytes
encoding Unicodecharacters in UTF-8 with more than one bytes not only
are invalid UTF-8but also look ugly in text editors.For example, a
manifest could look like this:
import java.util.jar.Manifest;import java.util.jar.Attributes;import
static java.util.jar.Attributes.Name;
public class CharacterBrokenDemo1 {    public static void main(String[]
args) throws Exception{        Manifest mf = new
Manifest();        Attributes attrs =
mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
"1.0");        attrs.put(new Name("Some-Key"),                  "Some
languages have decorated characters, " +                   "for
example: español"); // or
"espa\u00D1ol"        mf.write(System.out);    }}
Above code produces a result as follows with some unexpected question
markswhere the encoding is invalid:

    Manifest-Version: 1.0>    Some-Key: Some languages have decorated

characters, for example: espa?>     ?ol
This is of course an example written with actual question marks to get
a validtext for this message. The trick here is that "Some-Key" to
"example :espa"amounts to exactly one byte less encoded in UTF-8 than
would fit on one linewith the 72 byte limit so that the subsequent
character encoded with two bytesgets broken inside of the sequence of
two bytes for this character across acontinuation line break.
When decoding the resulting bytes from UTF-8 as one whole string, the
twoquestion marks will not fit together again even if the line break
with thecontinuation space is removed. However, Manifest::read removes
the continuationline breaks ("\r\n ") before decoding the manifest
header value from UTF-8 andhence can reproduce the original value.
Characters encoded in UTF-8 can not only span up to four bytes for one
codepoint each, there are also combining characters or classes thereof
or combiningdiacritical marks or whatever the appropriate term could
be, that combine morethan one code point into what is usually
experienced and referred to as acharacter.
The term character really gets ambiguous at this point. I wouldn't know
whatthe specification actually refers to with that term "character". So
rather thandiving in too much specification or any sorts of theory,
let's look at anotherexample:
import java.util.jar.Manifest;import java.util.jar.Attributes;import
static java.util.jar.Attributes.Name;
public class DemoCharacterBroken2 {    public static void main(String[]
args) throws Exception{        Manifest mf = new
Manifest();        Attributes attrs =
mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
"1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
"Angstro\u0308m");        mf.write(System.out);    }}
which produces console output as follows:

    Manifest-Version: 1.0>    Some-

Key:                                                      Angstro>
̈m
(In case this does not display well, the diaeresis is on the m on the
last line)
When the whole Manifest is decoded from UTF-8 as one big single string
andcontinuation line breaks are not removed until after UTF-8 decoding
the wholemanifest, the diaeresis (umlaut, two points above, u0308)
apparently kind ofjumps onto the following letter m because somehow it
cannot be combined withthe preceding space. The UTF-8 decoder (all of
my editors I tried, not onlyEclipse and its console view, also less,
gedit, cat and terminal) somehowtries to fix that but the diaeresis may
not necessarily jump back on the "o"where it originally belonged by
removing the continuation line break ("\r\n ")after UTF-8 decoding has
taken place, at least that did not work for me.
Hence, ideally combining diacritical marks should better not be
separated fromwhatever they combine with when breaking manifest lines
onto a continuationline. Such combinations, however, seem to be
unlimited in terms of number ofcode points combining into the same
"experienced" character. I was able tofind combinations that not only
exceed the limit of 72 bytes per line but alsoexceed the line buffer
size of 512 bytes in Manifest::read. These may be ratheruncommon but
still possible to my own surprise.
Next consideration would then be to remove that limit of 512 bytes per
manifestline but exceeding it would make such manifests incompatible
with previousManifest::read implementations and is not really an
immediately availableoption at the moment.
As a compromise, those characters including combining diacritical marks
whichcombine only so many code points as that their binarily encoded
form in UTF-8remains within a limit of 71 bytes can be written without
an interruptingcontinuation line break, which applies to most cases,
but not all. I guess thisshould suit practically and realistically to
be expected values well.
Another possibility would be to allow for characters that are
combinations ofmultiple Unicode code points to be kept together in
their encoded form in UTF-8up to 512 bytes line length limit when
reading minus a space and a line breakamounting to 509 bytes, but that
would still not make manifests be representedas valid Unicode in all
corner cases and I guess would not probably make a realimprovement in
practice over a limit of 71 bytes.
Attached is a patch that tries to implement what was described above
using aBreakIterator. While it works from a functional point of view,
this might beless desirable performance-wise. Alternatively could be
considered to do withoutthe BreakIterator and only keep Unicode code
points together by not placingline breaks before a continuation byte,
which however would not addresscombining diacritical marks as in the
second example above.
The jar file specification does not explicitly state that manifest
should bevalid UTF-8, and they were not always, but it also does not
state otherwise,leaving an impression that manifests could be
(mis)taken for UTF-8 encodedstrings, which they also are in many or
most cases and which has been confusedmany times. At the moment, the
only case where a valid manifest is not also avalid UTF-8 encoded
string is when a sequence of bytes encoding the samecharacter happens
to be interrupted with a continuation line break. To the bestof my
knowledge, all other valid manifests are also valid UTF-8 encoded
strings.
It would be nice, if manifests could be viewed and manipulated with all
Unicodecapable editors and not only be parsed correctly with
Manifest::read.
Any opinions? Would someone sponsor this patch?
Regards,Philipp

https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character


<6202130-manifestutf8linebreak.patch>


   <
http://oracle.com/us/design/oracle-email-sig-198324.gif
>
   <
http://oracle.com/us/design/oracle-email-sig-198324.gif
> <
http://oracle.com/us/design/oracle-email-sig-198324.gif
>
   <
http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance
  Andersen| Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering
1 Network Drive
Burlington, MA 01803
[email protected]
 <mailto:[email protected]>
  <mailto:
[email protected]
 <mailto:[email protected]>
>

Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Reply via email to