Raw string literals -- where we are, how we got here

Brian Goetz Tue, 27 Mar 2018 12:18:07 -0700

Now that things have largely stabilized with raw string literals, let mesummarize where we are, and how we got here.


## The proposal

Where we are now is that a raw string literal consists of an openingdelimiter which is a sequence of N consecutive backticks, for some N >0, a body which may contain any characters (including newlines) exceptfor a sequence of N consecutive backticks, and a closing delimiter of Nconsecutive backticks. Any line-end sequences (CR, LF, CRLF) arenormalized to a single newline (LF), and the remainder of the body istreated without any further transformation (including without unicodeescape processing), and placed in a String. No other processing is doneon the contents.

A raw string literal has type String, just like a traditional stringliteral, and can be used anywhere an expression of type String can beused (assignment, concatenation, etc.)


Examples:

    String s = `Doesn't have a \n newline character in it`;
    String ss = `a multi-
        line-string`;
    String sss = ``a string with a single tick (`) character in it``;
    String ssss = `a string with two ticks (``) in it`;

String sssss = `````a string literal with gratuitously many ticksin its delimiter`````;

Note that the delimiter need not be _more_ ticks than the longest ticksequence in the body; if the body contains sequences of two ticks andthree ticks, it can be delimited by one tick, four ticks, five ticks,etc. This makes it possible to choose a minimal delimiter that doesn'tinterfere with the body.


## Design Center

The design center for this feature is _raw string literals_. Notmulti-line strings (though this is well handled), not interpolatedstrings (though this can be considered in the future.) It turns off allinline escaping, even unicode escaping (which is usually handled by thelexer before the production even sees the characters.) We stay as trueas we can to this principle: raw means raw, not 99% raw with a littlebit of escaping. (The single exception is normalizing of carriagecontrol, the absence of which would just be too surprising.)

The primary use case addressed by raw string literals are snippets ofcode from other languages embedded in Java source files. Here weinterpret "languages" broadly; they could be traditional programminglanguages, specialized languages like regular expressions or SQL, orhuman languages. We want that the Java lexing not interfere at all;given a suitable O(1) incantation (picking a non-conflicting delimiter),you can freely cut and paste the foreign string to and from Java. Beingable to do this is not only convenient, but it reduces errors due tohand-mangling the string, and enhances readability because the embeddedsnippet is free of interference from Java.

Choosing raw-ness as a design center leads to a simpler design, which isgood, but it also is _more stable_, because it leads us away from thetemptation to tweak the rules here and there in ways that might besubjectively attractive, but that further increase the complexity of thefeature. This design choice belies a priority choice: the high-orderbit is _no embedding anomalies_. Users don't have to reason aboutwhether they need to hand-mangle a snippet to avoid it being mangled bythe compiler or runtime; given a suitable choice of delimiter, there'snothing else to think about. (IDEs can help with the "writing code"part of this.)

The various additional features we might be tempted to put in (specialprocessing for leading or trailing blank lines, leading white space,trimming to markers, etc) can instead be handled via libraryfunctionality. Since raw string literals are Strings, we can furtherprocess them with library code -- both JDK code and user code (thoughmethods on String have the advantage that they can be chained, ratherthan wrapped, which most users will prefer). Adding new stringmanipulation features via libraries rather than through the language iseasier, can be done by users, and is not constrained by the demands ofconsistency (you can have seven different trimming methods, each withtheir own definition of whitespace, if you like), whereas a languagefeature has to be one-size-fits-all. Moving this complexity to thelibrary where possible leads to a simpler feature and more choices forusers.


#### A road not taken

We choose to divide the world of string literals first into raw andnon-raw literals; from this, multi-line strings falls out for free as wecan treat line breaks in the source file as just more raw characters.

We could have chosen, instead, to first divide the world into single andmulti-line strings, and then into raw and non-raw; this would have leftus with four choices (raw single line, raw multi-line, cookedsingle-line, cooked multi-line.) This also would have been a defensibleposition, but seemed to add lexical complexity for little gain.


#### The exception that proves the rule

The one exception to raw-ness is that we normalize the line terminatorsto the most common (*nix) choice of a single newline, rather than usingthe platform-specific line terminator on the system that happens to havecompiled the classfile. The alternative would have just been toosurprising.



## Syntax

Given that this feature has such a high syntax-to-substance ratio, weshould expect more than the usual number of syntax opinions. Let's startwith some consequences of our chosen design center.


#### No fixed delimiter

From the design choice above, it is a forced move to accept variabledelimiters. Otherwise, one cannot represent a string with the delimiterin a raw string, without inventing an escaping mechanism, and subvertingour "raw means raw" goal.

The "self-embedding test" is not a mere theoretical goal. Since thesnippets we expect to paste into Java source are not randomly chosenstrings of characters, but meaningful snippets of some language, thelikelihood of wanting to represent a string that contains the chosendelimiter goes up. Even if you are willing to dismiss "embed Java inJava" as a serious use case (we're not), people also want a familiardelimiter, which means something that looks like the delimiter in otherlanguages, further increasing the chance of collision. (For example, ifwe'd picked a fixed triple quote delimiter, then you couldn't embedGroovy or Python code, among others -- surely a real use case). Fixeddelimiters (of any length) and "raw means raw" are not compatible goals,and we choose "raw means raw".

The credible options for variable delimiters are using a repeatingdelimiter sequence (say, any number of ticks), or some sort ofuser-provided nonce ("here" docs), or both. Nonces impose a highercongnitive load on readers, and their benefit accrues mostly to cornercases, so the more constrained option of repeating delimiters seemspreferable.


#### Why not 'just' use triple quotes

People's syntax preferences are guided by familiarity, so we shouldexpect suggestions to be biased towards what "similar" languages alreadydo. So the suggestion of using """triple quotes""" should be expected.

We've already discussed how a fixed delimiter is not acceptable. So at aminimum, this would have to be adjusted to "three or more." While somepeople find triple quotes natural (or at least familiar), others find itoffensively heavyweight. Neither crowd is going to convince the other.


#### But ticks are too light

The opposite of the "triple quotes are too heavy" argument is "ticks aretoo light"; that a single tick is a lightweight character, and could gounnoticed, especially if your monitor hasn't been cleaned for a while. Unfortunately the quote-like delimiters in the middle of the weightrange are taken by other activities. Again, we can't satisfy the "toolight" and "too heavy" crowd at the same time; whichever we do will makesome people unhappy.


#### Why do you have to always do something new?

The quoting scheme chosen -- any number of ticks -- is actually takenfrom something we all use: Markdown(https://daringfireball.net/projects/markdown/syntax), which permits anynumber of ticks to be used for infix sequences, and any different numberof ticks to be embedded. (Where we depart from Markdown is thatMarkdown strips any leading and trailing newlines from multi-line tickblocks, an appropriate trick for a page presentation language, but notconsistent with the design goal of "raw".)


#### But I want indentation stripping

When embedding a snippet of one language in another, both of whichsupport indentation, we are left with two choices: indent the enclosedblock exactly, which has the effect of the code "jutting out to theleft", or indent the enclosed block relative to the enclosing block,which has the effect of having more indentation than you might want forthe enclosed block. Sometimes this doesn't matter, but sometimes itdoes. Whatever we do, one of these crowds will be unhappy. When indoubt, we stick to the principle of "raw means raw", and provideindentation stripping via new instance methods on `String` to allow arange of trimming options, such as `trimIndent()`.


#### But I want leading / trailing empty lines

Some people would like for the language to strip off leading andtrailing blank lines. Like indentation stripping, this is going to bewhat people want sometimes, and sometimes not. And given that again, wecan't do both, we again, are guided by "raw means raw", and providelibrary means to strip the extraneous newlines.


#### But I want a marker character to make it obvious

Some people would like a margin marker character, so they can managemargins like this:


    foo(`This is a long string
        >the characters up to, and
        >including, the bracket are stripped
        >by the compiler
        >    and this line is indented`)

(Others would argue the marker character should be "|".) Again, webelieve these sorts of transforms are the purview of libraries, notlanguage, and will be provided.


#### But people will make ASCII art

    ``````````````````
    `Yes, they might.`
    ``````````````````

#### But I want to use unicode escaping

There will be library support for explicitly processing Unicode escapesequences, or backslash escape sequences, or both.


#### But calling library methods like `longString`.trim() is ugly

You say ugly; I say simple and transparent.

#### But doing these things in libraries has to be slower and yield morebloated bytecode


No, it doesn't.

## Anomalies and puzzlers

While the proposed scheme is lexically very simple, it does have some atleast one surprising consequence, as well as at least one restriction: - The empty string cannot be represented by a raw string literal(because two consecutive ticks will be interpreted as a double-tickdelimiter, not a starting and ending delimiter); - String containing line delimiters other than \n cannot berepresented directly by a raw string literal.

The latter anomaly is true for any scheme that is free of embeddinganomalies (escaping) and that normalizes newlines. If we chose to notnormalize newlines, we'd arguably have a worse anomaly, which is thatthe carriage control of a raw string depends on the platform youcompiled it on.

The empty-string anomaly is scary at first, but, in my opinion, is muchless of a concern than the initial surprise makes it appear. Once youlearn it, you won't forget it -- and IDEs and compilers will providefeedback that help you learn it. It is also easily avoided: usetraditional string literals unless you have a specific need forraw-ness. There already is a perfectly valid way to denote the emptystring.


#### Can't these be fixed?

These anomalies can be moved around by tweaking the rules, but theresult is going to be more complicated rules and the same number (ormore) of anomalies, just in different places -- and sometimes in worseplaces. While there is room to subjectively differ on which anomaliesare worse than others, we believe that the simplicity of this scheme,and its freedom from embedding anomalies, makes it the winner.

Because we start with such a simple rule (any number of consecutiveticks), pretty much any tweak is going to be complexity-increasing. Itseems a poor tradeoff to make the feature more complex and lessconvenient for everyone, just to cater to empty strings.

Raw string literals -- where we are, how we got here

Reply via email to