As Reinier pointed out on amber-dev, regex strings may routinely contain
escaped meta-characters — +, *, brackets, etc. So the embedded \- and \+
story has an obvious conflict. While these are not the only possible
characters for such “shift” operators, his point that this might be overkill is
a good one. So let’s look at options for denoting raw-ness.
- Just make triple-quote strings always raw as well as multi-line-capable;
regexes and friends would use TQ strings even though they are single line
(Scala, Kotlin)
- Letter prefix, such as R”…” (C++, Rust)
- Symbol prefix, such as @“…” (C#), or \”…” (suggestive of “distributing” the
escaping across the string.)
- Embedded escape sequence that switches to raw mode, but can’t be switched
back: “\+raw string”, “\{raw}raw string”.
Data from Google suggests that, in their code base, on the order of 5% of
candidates for multi-line strings use some escape sequences (Kevin/Liam, can
you verify?) This suggests to me that the “just use TQ” approach is vaguely
workable, but likely to be error-prone (5% is infrequently enough that people
will say \t when they mean tab and discover this at runtime, and then have to
go back and add a .escape() call.)
(Of these, my current favorite is using the backslash: “cooked”, “””cooked and
ML-capable”, \”raw”, \”””raw and ML capable”. The use of \ suggests “the
backslashes have been pre-added for you”, building on existing associations
with backslash.)
Are there other credible candidates that I’ve missed?
> On Jan 2, 2019, at 2:00 PM, Jim Laskey <[email protected]> wrote:
>
>
>>
>> http://cr.openjdk.java.net/~jlaskey/Strings/RTL2/index.html
>> <http://cr.openjdk.java.net/~jlaskey/Strings/RTL2/index.html>
>> http://cr.openjdk.java.net/~jlaskey/Strings/RTL2.pdf
>> <http://cr.openjdk.java.net/~jlaskey/Strings/RTL2.pdf>
>> First of all, I would like to apologize for leading us down the garden path
>> re Java Raw String Literals. I jumped into this feature fully enamoured with
>> the JavaScript equivalent and, "why can't we have this in Java?" As the
>> proposal evolved, it became clear that what we came up with was not a good
>> Java solution. I underestimated the concern that the original proposal was
>> too left field and did not fit into Java very well. It's somewhat ironic
>> that the backtick looks like a thorn.
>>
>> So, let's start the new year with a structured approach to the enhance
>> string literal design. Brian gave a summary of why the old design fails.
>> Starting with this summary, Brian and I talked out a series of critical
>> decision points that should be given thought, if not answers, before we
>> propose a new design. As an exercise, I supplemented these points and
>> created a series of small decision trees (a full on decision tree would be
>> complex and not very helpful.) I found these trees good intuition pumps for
>> getting the design at least 80% there. Hopefully, this exercise will help
>> you in the same way.
>>
>>
>>
>>
>> Even the label Raw String Literal put the emphasis on the wrong part of the
>> feature. What developers really want is multi-line strings. They want to be
>> able to paste alien source into their Java programs with as little fuss as
>> possible.
>>
>> String raw-ness (not translating escapes) is a tangential aspect, that may
>> or may not be needed to implement multi-line strings. Yes, the regex and
>> Window's file path arguments in JEP 326 are still valid, but this aspect
>> needs to be separated from the main part of the design. Further in the
>> discussion, we'll see that raw-ness is really a many-headed hydra, best
>> slain one head at a time.
>>
>>
>>
>>
>> We have to be honest. We know Java's primary market. Sure we want to embed
>> Java in Java for writing tests. Sure there is JavaScript and CSS in web
>> pages. Nevertheless, most uses of multi-line will be for non-complex
>> grammars. Specifically, grammars that don't require special handling of
>> multi-character delimiter sequences. If you can accept this, then the
>> solution set is much smaller.
>>
>>
>>
>>
>> This is an easy one. Familiarity is key to feature education. Radical
>> wandering off with new syntax is not helpful to anyone but bloggers and
>> authors.
>>
>>
>>
>>
>> If you buy into the familiarity argument, then double quote is really only
>> choice for a delimiter. Double quote already indicates a string literal.
>> Single quote indicates a character. We don’t want to gratuitously burn
>> unused symbols like backtick. Backslash works for regex but maybe not for
>> others. Combinations and nonces just introduce new noise when our original
>> goal was to reduce noise and complexity.
>>
>>
>>
>>
>> Other languages avoid delimiter escape sequences by doubling up. Example,
>> "abc""def" -> abc"def. This concept is unfamiliar to Java developers, why
>> change now. Escape sequences are what we know.
>>
>>
>>
>>
>> Language designers got very nervous when I suggested infinite delimiter
>> sequences in the original proposal; lexically sacrilegious. I felt strongly
>> that it was easy to explain and only 1 in 1M developers would ever use more
>> than 4-5 character delimiter sequences. In round two, I have come to agree.
>> This was taking on more complexity than is really warranted, for a use case
>> that doesn’t come along very often. I suggest we only need single and triple
>> double quotes. A single double quote works today, so no argument there.
>> Double double quotes means empty string, no problem. Triple double quotes
>> are only necessary to avoid having to escape quotes in alien source.
>>
>> String json = """
>> {
>> "name": "Jean Smith",
>> "age": 32,
>> "location": "San Jose"
>> }
>> """;
>>
>> versus
>>
>> String json = "
>> {
>> \"name\": \"Jean Smith\",
>> \"age\": 32,
>> \"location\": \"San Jose\"
>> }
>> ";
>>
>> This second case is where we wandered off the tracks with raw-ness. We
>> assumed raw-ness is necessary to avoid all the backslashes. Most cases can
>> be handled with triple double quotes.
>>
>> Okay, so why not more combinations? Simply because, most of the time they
>> are not needed. On the rare occasion we do have nested triple double quotes,
>> we can then use escape sequences.
>>
>> String nestedJSON = """
>> \"\"\"
>> {
>> "name": "Jean Smith",
>> "age": 32,
>> "location": "San Jose"
>> }
>> \"\"\";
>> """;
>>
>> or better yet, you only have to escape every third double quote
>>
>> String nestedJSON = """
>> \"""
>> {
>> "name": "Jean Smith",
>> "age": 32,
>> "location": "San Jose"
>> }
>> \""";
>> """;
>>
>> Not so evil and it's familiar.
>>
>>
>>
>>
>> Meaning, you can only use single quotes for simple strings and triple quotes
>> for multi-line strings. I don't have a strong opinion other than it seems
>> like an unneeded restriction. The only argument I've heard has been for
>> better error recovery when missing a close delimiter during parsing. My
>> counter for that argument is that if you are processing multi-line strings
>> then you can easily track the first newline after the opening delimiter and
>> recover from there. I implemented that recovery in javac and worked out well.
>>
>>
>>
>>
>>
>> Cooked (translated escape sequences) should be the default. Why should a
>> multi-line string be different than a simple string? We have a solution for
>> embedding double quote. Single quotes don't require escaping. Tabs and
>> newlines can exist as is. Unicode characters can be either an escape
>> sequence or the unicode character. So the only problem case is backslash. I
>> would argue that the rare backslash can be escaped. If not, then the
>> developer can use the raw-ness solution.
>>
>>
>>
>>
>> If we don't translate newlines, then source is not transferable across
>> platforms. That is, a source from one platform may not execute the same way
>> on another platform. Translating consistently guarantees execution
>> consistency. As a note, programming languages that didn't translate newlines
>> in multi-line string literals typically regretted it later (Python.)
>>
>>
>>
>>
>> With the original Raw String Literal proposal, there was concern about
>> leading and trailing nested delimiters. If we default to cooked strings,
>> then we use can use \".
>>
>>
>>
>>
>> These questions have been answered numerous times and fall into the realm of
>> library support. Same arguments as before, same outcome.
>>
>>
>> To summarize the bold paths at this point;
>> - multi-line strings are an extension of traditional simple strings
>> - newlines in a string are no longer an error and the string can extend
>> across several lines
>> - error recovery can pick up at the first newline after the opening
>> delimiter
>> - multi-line strings process escape sequences (including unicode) in
>> the same way as simple strings
>> - multiple double quotes are handled with escape sequences
>> - triple double quote delimiter is introduced to avoid escaping simple
>> double quote sequences
>>
>> Generally, I think this is very much in the traditional Java spirit.
>>
>>
>> Now, let's move on to the lesser but more interesting issue. As I stated
>> above, raw-ness is a multi-headed beast. Raw-ness involves the turning off
>> the translation of
>> - escape sequences
>> - unicode escapes
>> - delimiter sequences
>> - escape sequence prefix (backslash)
>> - tabs and newlines (control characters in general)
>>
>> Sometimes we need all of the translations, sometimes few and sometimes none.
>> In the multi-line discussion above, we see we don't need raw as much as we
>> might have expected. Maybe for occasional backslashes, as in regex and
>> Windows paths strings.
>>
>>
>>
>>
>>
>> The original Raw String Literal proposal suggested that raw-ness was a
>> property of the whole string literal and thus we proposed an alternate
>> delimiter syntax just to emphasize that fact. If we accept the bold path of
>> multi-line discussion above, then alternate delimiter is out. This leaves
>> prefixing as the best option to bless a string literal with raw-ness.
>>
>> At this point, I would like to suggest an alternate, maybe progressive way
>> to think of raw-ness. Since the original proposal, I have been thinking of
>> raw-ness as a state of processing the literal. State is certainly obvious in
>> the scanner implementation, why not raise that to the language level? If it
>> is a state then we should be able to enter and leave that state in some way.
>> Escape sequences are an obvious way of transitioning translation in the
>> string. \- and \+ are available and not currently recognized as valid escape
>> sequences, why not \- and \+ to toggle escape processing?
>>
>> String a = "cooked \-raw\+ cooked"; // cooked raw cooked - a little odd
>> but not so much so
>> String b = "abc\-\\\\\+def"; // abc\\\\def - struggling
>> String c = "\-abc\\\\def"; // abc\\\\def - more readable as an
>> inner prefix
>> String d = "abc\-\-def\+\+ghi"; // abc\-def\+ghi - raw on "\-" is
>> "\" and "-", raw off "\+" is "\" and "+"
>> String e = """\-"abc"\+"""; // "abc" - \- and \+ act a no-ops of
>> sorts
>>
>> Comparing property vs state:
>>
>> Runtime.getRuntime().exec(R""" "C:\Program Files\foo" bar""".strip());
>> Runtime.getRuntime().exec("""\-"C:\Program Files\foo" bar""");
>>
>> System.out.println("this".matches(R"\w\w\w\w"));
>> System.out.println("this".matches("\-\w\w\w\w"));
>>
>> String html = R"""
>> <html>
>> <body>
>> <p>Hello World.</p>
>> </body>
>> </html>
>> """.align();
>> String html = """\-
>> <html>
>> <body>
>> <p>Hello World.</p>
>> </body>
>> </html>
>> """.align();
>>
>>
>> String nested = """
>> String EXAMPLE_TEST = "This is my small example "
>> + "string which I'm going to "
>> + "use for pattern matching.";
>> """ +
>> R"""
>> System.out.println(EXAMPLE_TEST.replaceAll("\\s+",
>> "\t"));
>> """;
>> String nested = """
>> String EXAMPLE_TEST = "This is my small example "
>> + "string which I'm going to "
>> + "use for pattern matching.";
>> \-
>> System.out.println(EXAMPLE_TEST.replaceAll("\\s+",
>> "\t"));
>> \+
>> """;
>>
>> Hopefully, this is a good starting point for discussion. As before, I'm
>> pragmatic about which direction we go, so feel free to comment.
>>
>> Cheers,
>>
>> -- Jim
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>