Summary.
A. Line support.
- Supporting a mix of line terminators `\n|\r\n|\r` is already a well
established pattern in language parsers, in the JDK (ex. see
java.nio.file.FileChannelLinesSpliterator) and RegEx (ex. see `\R`). The
performance difference between checking one vs the three is negligible.
- Yes, Stream<String> stream =
Pattern.compile("\n|\r\n|\r").splitAsStream(string); is very useful
(Spliterators rule), but is cumbersome in this expected to be common use case.
Only so-so streamy. :-)
- BufferedRead.lines() vs. String.lines() is a tricky discussion. It comes down
to whether the new line is a terminator or a separator. In the i/o case, it
seems terminator is the right answer. A well formed text file will have a new
line at the end of every line. However, I think you’ll find when people work
with multi-line strings they think of new line as a separator. Hence, the
common use of split(“\n”) and “”.split(“\n”).length == 1. Indentation, the
position of closing delimiter and margin trimming makes that last line very
fluid.
What clinches the deal is that
string.lines().collect(joining(“\n”)).equals(string). I’ll ensure both versions
of lines() have the difference well javadocumented.
- The current Spliterator implementation makes
String.lines().toArray(String[]::new) an order of magnitude faster than
split(`\n|\r\n|\r`). That’s why I implemented it for margin management. Faster
still if no collection/array is constructed.
BTW: split(`\R`) is 2x-3x faster than split(`\n|\r\n|\r`). Nice.
B. Additions to basic trim methods.
- Revamped to become strip, stripLeading, stripTrailing using
Character.isWhiteSpace(codepoint) as the test (optimized using ch == ‘ ' || ch
== ‘\t’ || Character.isWhiteSpace(ch)).
- No strong feeling about it, but String.trim() could be recommended for
deprecation.
C. Margin management.
- String.trimMarkers() as a default to String.trimMarkers(“|”, “|”) is
reasonable. Will put it in the CSR for broader discussion.
- Re use of patterns. I think the Stream<String> lines() method will make it
very easy enough to create custom trim margin lambdas.
D. Escape management.
- Good
Cheers,
— Jim
> On Mar 13, 2018, at 10:47 AM, Jim Laskey <[email protected]> wrote:
>
> With the announcement of JEP 326 Raw String Literals, we would like to open
> up a discussion with regards to RSL library support. Below are several
> implemented String methods that are believed to be appropriate. Please
> comment on those mentioned below including recommending alternate names or
> signatures. Additional methods can be considered if warranted, but as always,
> the bar for inclusion in String is high.
>
> You should keep a couple things in mind when reviewing these methods.
>
> Methods should be applicable to all strings, not just Raw String Literals.
>
> The number of additional methods should be minimized, not adding every
> possible method.
>
> Don't put any emphasis on performance. That is a separate discussion.
>
> Cheers,
>
> -- Jim
>
> A. Line support.
>
> public Stream<String> lines()
> Returns a stream of substrings extracted from this string partitioned by line
> terminators. Internally, the stream is implemented using a Spliteratorthat
> extracts one line at a time. The line terminators recognized are \n, \r\n and
> \r. This method provides versatility for the developer working with
> multi-line strings.
> Example:
>
> String string = "abc\ndef\nghi";
> Stream<String> stream = string.lines();
> List<String> list = stream.collect(Collectors.toList());
>
> Result:
>
> [abc, def, ghi]
>
>
> Example:
>
> String string = "abc\ndef\nghi";
> String[] array = string.lines().toArray(String[]::new);
>
> Result:
>
> [Ljava.lang.String;@33e5ccce // [abc, def, ghi]
>
>
> Example:
>
> String string = "abc\ndef\r\nghi\rjkl";
> String platformString =
> string.lines().collect(joining(System.lineSeparator()));
>
> Result:
>
> abc
> def
> ghi
> jkl
>
>
> Example:
>
> String string = " abc \n def \n ghi ";
> String trimmedString =
> string.lines().map(s -> s.trim()).collect(joining("\n"));
>
> Result:
>
> abc
> def
> ghi
>
>
> Example:
>
> String table = `First Name Surname Phone
> Al Albert 555-1111
> Bob Roberts 555-2222
> Cal Calvin 555-3333
> `;
>
> // Extract headers
> String firstLine = table.lines().findFirst().orElse("");
> List<String> headings = List.of(firstLine.trim().split(`\s{2,}`));
>
> // Build stream of maps
> Stream<Map<String, String>> stream =
> table.lines().skip(1)
> .map(line -> line.trim())
> .filter(line -> !line.isEmpty())
> .map(line -> line.split(`\s{2,}`))
> .map(columns -> {
> List<String> values = List.of(columns);
> return IntStream.range(0, headings.size()).boxed()
> .collect(toMap(headings::get,
> values::get));
> });
>
> // print all "First Name"
> stream.map(row -> row.get("First Name"))
> .forEach(name -> System.out.println(name));
>
> Result:
>
> Al
> Bob
> Cal
> B. Additions to basic trim methods. In addition to margin methods trimIndent
> and trimMarkers described below in Section C, it would be worth introducing
> trimLeft and trimRight to augment the longstanding trim method. A key
> question is how trimLeft and trimRight should detect whitespace, because
> different definitions of whitespace exist in the library.
>
> trim itself uses the simple test less than or equal to the space character, a
> fast test but not Unicode friendly.
>
> Character.isWhitespace(codepoint) returns true if codepoint one of the
> following;
>
> SPACE_SEPARATOR.
> LINE_SEPARATOR.
> PARAGRAPH_SEPARATOR.
> '\t', U+0009 HORIZONTAL TABULATION.
> '\n', U+000A LINE FEED.
> '\u000B', U+000B VERTICAL TABULATION.
> '\f', U+000C FORM FEED.
> '\r', U+000D CARRIAGE RETURN.
> '\u001C', U+001C FILE SEPARATOR.
> '\u001D', U+001D GROUP SEPARATOR.
> '\u001E', U+001E RECORD SEPARATOR.
> '\u001F', U+001F UNIT SEPARATOR.
> ' ', U+0020 SPACE.
> (Note: that non-breaking space (\u00A0) is excluded)
>
> Character.isSpaceChar(codepoint) returns true if codepoint one of the
> following;
>
> SPACE_SEPARATOR.
> LINE_SEPARATOR.
> PARAGRAPH_SEPARATOR.
> ' ', U+0020 SPACE.
> '\u00A0', U+00A0 NON-BREAKING SPACE.
> That sets up several kinds of whitespace; trim's whitespace (TWS), Character
> whitespace (CWS) and the union of the two (UWS). TWS is a fast test. CWS is a
> slow test. UWS is fast for Latin1 and slow-ish for UTF-16.
>
> We are recommending that trimLeft and trimRight use UWS, leave trim alone to
> avoid breaking the world and then possibly introduce trimWhitespace that uses
> UWS.
>
> public String trim()
> Removes characters less than equal to space from the beginning and end of the
> string. No, change except spec clarification and links to the new trim
> methods.
> Examples:
> "".trim(); // ""
> " ".trim(); // ""
> " abc ".trim(); // "abc"
> " \u2028abc ".trim(); // "\u2028abc"
> public String trimWhitespace()
> Removes whitespace from the beginning and end of the string.
> Examples:
>
> "".trimWhitespace(); // ""
> " ".trimWhitespace(); // ""
> " abc ".trimWhitespace(); // "abc"
> " \u2028abc ".trimWhitespace(); // "abc"
> public String trimLeft()
> Removes whitespace from the beginning of the string.
> Examples:
>
> "".trimLeft(); // ""
> " ".trimLeft(); // ""
> " abc ".trimLeft(); // "abc "
> public String trimRight()
> Removes whitespace from the end of the string.
> Examples:
>
> "".trimRight(); // ""
> " ".trimRight(); // ""
> " abc ".trimRight(); // " abc"
> C. Margin management. With introduction of multi-line Raw String Literals,
> developers will have to deal with the extraneous spacing introduced by
> indenting and formatting string bodies.
>
> Note that for all the methods in this group, if the first line is empty then
> it is removed and if the last is empty then it is removed. This removal
> provides a means for developers that use delimiters on separate lines to
> bracket string bodies. Also note, that all line separators are replaced with
> \n.
>
> public String trimIndent()
> This method determines a representative line in the string body that has a
> non-whitespace character closest to the left margin. Once that line has been
> determined, the number of leading whitespaces is tallied to produce a minimal
> indent amount. Consequently, the result of the method is a string with the
> minimal indent amount removed from each line. The first line is unaffected
> since it is preceded by the open delimiter. The type of whitespace used
> (spaces or tabs) does not affect the result as long as the developer is
> consistent with the whitespace used.
> Example:
>
> String x = `
> This is a line
> This is a line
> This is a line
> This is a line
> This is a line
> `.trimIndent();
>
> Result:
>
> This is a line
> This is a line
> This is a line
> This is a line
> This is a line
> public String trimMarkers(String leftMarker, String rightMarker)
> Each line of the multi-line string is first trimmed. If the trimmed line
> contains the leftMarker at the beginning of the string then it is removed.
> Finally, if the line contains the rightMarker at the end of line, it is
> removed.
> Example:
>
> String x = `|This is a line|
> |This is a line|
> |This is a line|`.trimMarkers("|", "|");
> Result:
>
> This is a line
> This is a line
> This is a line
>
> Example:
>
> String x = `>> This is a line
>>> This is a line
>>> This is a line`.trimMarkers(">> ", "");
> Result:
>
> This is a line
> This is a line
> This is a line
> D. Escape management. Since Raw String Literals do not interpret Unicode
> escapes (\unnnn) or escape sequences (\n, \b, etc), we need to provide a
> scheme for developers who just want multi-line strings but still have escape
> sequences interpreted.
>
> public String unescape() throws MalformedEscapeException
> Translates each Unicode escape or escape sequence in the string into the
> character represented by the escape. @jls 3.3, 3.10.6
> Example:
>
> `abc\u2022def\nghi`.unescape();
>
> Result:
>
> abc•def
> ghi
> public String unescape(EscapeType... escape) throws MalformedEscapeException
> Selectively translates Unicode escape or escape sequence based on the escape
> type flags provided.
> public enum EscapeType {
> /** Backslash escape sequences based on section 3.10.6 of the
> * <cite>The Java™ Language Specification</cite>.
> * This includes sequences for backspace, horizontal tab,
> * line feed, form feed, carriage return, double quote,
> * single quote, backslash and octal escape sequences.
> */
> BACKSLASH, //
>
> /** Unicode sequences based on section 3.3 of the
> * <cite>The Java™ Language Specification</cite>.
> * This includes sequences in the form {@code \u005Cunnnn}.
> */
> UNICODE
> }
>
>
> Example:
>
> `abc\u2022def\nghi`.unescape(EscapeType.BACKSLASH);
>
> Result:
>
> abc\u2022def
> ghi
>
>
> Example:
>
> `abc\u2022def\nghi`.unescape(EscapeType.UNICODE);
>
> Result:
>
> abc•def\nghi
> Conversely, there are circumstances where the inverse is required
>
> public String escape()
> Translates each quote, backslash, non-graphic character or non-ASCII
> character into an Unicode escape or escape sequence. The method is equivalent
> to escape(BACKSLASH, UNICODE) .
> Example:
>
> `abc•def
> ghi`.escape();
>
> Result:
>
> abc\u2022def\nghi
> public String escape(EscapeType... escape)
> Selectively translates each quote, backslash, non-graphic character or
> non-ASCII character into an Unicode escape or escape sequence based on the
> escape type flags provided.
> Example:
>
> `abc•def
> ghi`.escape(EscapeType.BACKSLASH);
>
> Result:
>
> abc•def\nghi
>
>
> Example:
>
> `abc•def
> ghi`.escape(EscapeType.UNICODE);
>
> Result:
>
> abc\u2022def
> ghi
>