Re: String reboot (plain text)

Remi Forax Thu, 21 Mar 2019 06:48:44 -0700

I really like in the syntax proposed by Jim the fact that the single quote " is 
retconned to allow several lines,
it seems the easiest thing to do if we just want to introduce a multi-lines 
literal string.


>From that, i agree that the more lines you have, the more you need to have a 
>way to defines raw strings because it's far easier to read,
i'm fine with """ meaning raw string, like ", """ will allow single line 
strings and multi-line strings.

I disagree with Brian that we should try to have an intelligent algorithm to 
remove the blank spaces because i see several possible intelligent algorithms, 
so i think it's better to keep things simple (another interesting question is 
should this intelligent algorithm applied only on the escapable strings or on 
the raw strings too ?).
Obviously, it means users will be frustrated to not have an intelligent 
algorithm to remove the blank spaces but i think the code will be more readable 
using a lazy static final,
by example

  private lazy static final String aText = """
    + This is the first sentence of the comment
    + this is the second sentence
    """.alignUsingCharacter("+");

(note that we don't need the compiler to fold the method call here, it can be 
done by the condy BSM (implementing the lazy static final) and when we will 
revisit the BSM protocol, we may add a way to read a constant without storing 
it in the runtime constant pool representation so the intermediary string 
doens't have to be stored at runtime).

Rémi

----- Mail original -----
> De: "John Rose" <john.r.r...@oracle.com>
> À: "Brian Goetz" <brian.go...@oracle.com>, "Jim Laskey" 
> <james.las...@oracle.com>
> Cc: "amber-spec-experts" <amber-spec-experts@openjdk.java.net>
> Envoyé: Samedi 16 Mars 2019 01:54:30
> Objet: Re: String reboot (plain text)

> OK, I responded to one corner by pointing out a principle that tends to
> align rawness more strongly with multi-line-ness.  I guess I should lay
> all my cards on the table FTR, and will do so by responding to Brian's
> restacking Email and Jim's reboot Email.  (I guess today's String-day.)
> 
> TL;DR: I agree substantially with Jim's analysis and Brian's staging,
> especially the earlier and simpler parts.
> 
> Our order #1 should keep classic escapes, instead of eliminating them (raw)
> or strengthening them (strong escapes, like strong delimiters).  Later orders
> should have a place for such things (raw and/or strong escapes/quotes).
> 
> (Side note:  The term "escape" always make me think of a two character
> sequence, the first of which is probably reverse solidus, like "\x".
> I'd like to use a neutral term like "interruptor" coupled with "quote" to 
> refer
> to the more general feature of "a visible notation which interrupts a string
> rather than terminates it like a quote does".  And now I realize that Jim's
> term "delimiter" does the same thing for "quote".  So I'll try to tilt toward
> "delimiter" and "interruptor" instead of "quote" and "escape".)
> 
> Classic escapes and single quotes are both too tiny to see well inside
> multi-line
> strings, but they are also familiar and people will get used to "squinting" 
> for
> them,
> at least the escapes.  Our take is that we'd all rather "squint" (in the first
> order)
> instead of add complexity to the first feature.
> 
> I'm fine with a two- or three-order stacking, as long as there is a credible
> story for the final course of the meal, if we are still hungry, which includes
> strong delimiters and (some sort of) strong escapes that are (a) not easy to
> collide with and (b) not hard to "squint" for.   IMO strong delimiters will
> often
> be associated somehow with strong interruptors.  In fact (see digression
> below in context) I think rawness is maybe not exactly the right concept;
> the concept of "escape strength" may be more fruitful for us.
> 
>> On Mar 13, 2019, at 10:52 AM, Brian Goetz <brian.go...@oracle.com> wrote:
>> 
>> Lots of good discussion so far.  Let me gather the threads.
>> 
>>  - The primary use case is embedding multi-line chunks of foreign code or 
>> data in
>>  Java, with minimal need to cruft it up with escaping.  This says to me that
>>  _multi-line strings_ are actually the high-order bit here, and raw strings 
>> are
>>  the next bit.  Let’s address these in order.
> 
> +1
> 
>>  - Multi-line-ness and raw-ness are orthogonal concepts.  Some languages 
>> merge
>>  them, and we might consider doing that too, but we shouldn’t start there.
> 
> +0.6
> 
> (As I implied previously, a number less than one is more representative of
> orthogonality, sine-of-the-angle-between, of the two features.
> 
> But also, I'm fine with not starting with raw-ness, as long as it's on the
> menu somewhere.
> 
>>  - For multi-line strings, a stronger delimiter (e.g., """) seems to be 
>> preferred
>>  on readability grounds, because people don't want to have to squint to see
>>  where the embedded code ends and the Java code resumes.
> 
> Yes.  The same point applies to escapes ("string interruptors", not "string
> delimiters"),
> but since escapes are clearly less common than string boundaries, I'm content 
> to
> just note the point, and accept a design which requires users to squint for
> escapes,
> on the grounds that they will be both rare, usually safe to disregard on first
> reading.
> 
>> To which I'll add the following observations:
>> 
>>  - Most multi-line string candidates (JSON, XML, SQL, etc) do not require
>>  characters that have to be escaped, as long as we don't have conflicts with 
>> the
>>  quote character.  Which suggests further than ML-ness and raw-ness are 
>> solving
>>  separate problems.
> 
> Jim notes this in passing in the "75%" section, but I'll call it out here too:
> 
> "Characters that have to be escaped" also include Java's escape.  A JSON
> string will have a puzzling problem if it contains a JSON escape sequence that
> is processed by Java, rather than by the JSON parser.  I don't see how to 
> avoid
> this easily in the first course on the menu, but I want to note the design
> heuristic that design vectors for delimiters are correlated with interruptors.
> 
> (The problem with JSON escapes is like the problem with regexp escapes.
> In both cases we have both Java and the foreign notation competing for
> ownership of the reverse solidus.  I think a proper notion of strong
> interruptors
> will allow Java to gracefully give the foreign notation precedence, within
> certain of Java's envelopes, just as strong delimiters do so with quotes.)
> 
> If you have to escape foreign delimiters, chances are you'll have to escape
> foreign
> interruptors.  Another use of the heuristic:  If you found yourself tripling 
> the
> quotes
> to avoid collisions, there's probably a related use case for strengthening
> (tripling???) the escapes, to avoid the same (but rarer) sort of collisions.
> 
> (I'm thinking Python also and JavaScript also, for script fragments, but we
> choose
> to place scripting lower on the menu, along with quoted-Java-in-Java nesting.)
> 
>>  - Once we separate multi-line from raw, the idea of automatically reflowing
>>  indentation starts to become a sensible option on non-raw, multi-line 
>> strings.
> 
> +100 Yes, this is the nugget of gold that we mine out of the decision to defer
> rawness.
> 
>>  - Repeating delimiters are slightly more powerful than fixed delimiters, but
>>  also have additional cognitive load, and can still lead to anomalies that 
>> are
>>  easily encountered.
> 
> That said, they pay for themselves as visual cues for multi-line thingies, and
> we
> immediately put them back into the shopping cart, with length set at three.
> This helps us properly size the "cognitive load" argument.  Once you learn 
> about
> jumbo delimiters, you learn to spot them, and you are paid for the effort
> because
> you only learned once, but you can spot them quicker every time you look.
> 
> The same point readily applies to replacing "a count of three" with "a count 
> of
> three or more", although with sharply diminished returns, since three is 
> almost
> always enough.  (What about quote counting?  Well, programmers shouldn't be
> writing puzzlers in their code.  So use extra, enough to make it obvious, and
> don't
> trick your reader with one-off counts unless you are writing a puzzler book.
> Or find another solution instead of quote counting to make the quotes look
> (a) like the quotes they are, and (b) different enough from competing would-be
> quotes.)
> 
> None of these ideas apply to the first course, IMO.  I'm realizing how apt it 
> is
> for Jim to call it an appetizer; it is very thin but tasty, as an appetizer
> should be.
> And Brian will say, "wait until you see how filling it is!"  We certainly want
> to avoid
> unhealthy gorging…
> 
>> With that said, let's reorder the dishes a bit.
>> 
>> For our first course, we could have multi-line strings, delimited by the 
>> fixed
>> delimiter """.  These would be escaped strings, just like existing string
>> literals, but because the single-quote is no longer the delimiter, the most
>> common source of escaping (embedded quotes) is removed.  Most multi-line
>> strings will require no escaping at all.
> 
> +1 (for most definitions of "most")
> 
>> Note that if we stopped here _and never ordered anything else_, we would 
>> still
>> be in a much better place than we are now (most snippets could just be cut 
>> and
>> pasted without mangling), and what we've introduced is dead-simple!  So the
>> cost-benefit ratio here is high; it’s a simple addition that addresses a
>> significant fraction of the pain points.  I think we should at least order
>> this.
> 
> +100
> 
>> Now, maybe we're still a little hungry, and the above doesn't help with those
>> strings that are most polluted by escapes, such as regular expressions.  So, 
>> we
>> might additionally order the ability to layer a way to say "no escape 
>> mangling"
>> atop both our " strings and our """ strings.  Jim proposes we use a delimiter
>> of \".."\ for such strings (\""" ... """\ for the multi-line version).  This
>> has a nice connotation; it is as if the backslash is “distributed over” the
>> whole string.
> 
> +1; it wins the beauty contest.
> 
> It needs lack of simplicity as well as beauty.  By simplicity I mean
> it resists unintentional creation of puzzlers, and we think intentional
> puzzlers have a limited effect.  The jury is out IMO; puzzle on.
> 
> Also, the second course (tweaking escapes) needs IMO to be plausibly
> followable (if not followed in fact) by a third course, which allows fullest
> control of syntax (nonces, repeats, whatever).  I think Jim's syntax passes
> that test, since there are ways to increase the number of escapes, or
> lengthen the token in other ways to achieve strong delimiters.  It seems
> to me there may be a good course #3 design which pins the quotes
> at three and allows larger and larger numbers of escapes.
> 
> (Hmm, idea of the moment: We could allow any *whole* delimiter
> sequence to be *tripled* in order to strengthen it.  Not just little old
> double-quote " gets the tripling treatment.  But now I'm puzzling way
> outside the box.)
> 
>> This does, unfortunately, bring us back into Delimiter Hell; what if we want 
>> our
>> string to contain the quote + backslash combination?  One way is to dive back
>> into repeating delimiters (e.g., using multiple backslashes in the 
>> delimiter).
>> Having a non-homogeneous repeating delimiter leaves us in a slightly better
>> place than the original proposal, as we’ve eliminated the “empty string”
>> anomaly as well as the “starting with backtick” anomaly.  So this seems a
>> workable direction, though the cost-benefit here is less than with the first
>> course — in both directions (higher cost, lower benefit.)
>> 
>> 
>> So, in the spirit of “keep ordering until sated, but stop there”, here are 
>> some
>> reasonable choices.
>> 
>> 1.  Do multi-line (escaped) strings with a “”” fixed delimiter.  Large 
>> benefit,
>> small cost.  Most embedded snippets don’t need any escaping.  Low cost, big
>> payoff.
>> 
>> 1a.  Do 1, but automatically reflow multi-line strings using the equivalent 
>> of
>> String::align.  There have been reasonable proposals on how to do this; where
>> they fell apart is the interaction with raw-ness, but if we separate ML and
>> raw, these become reasonable again.  Higher cost, but higher payoff; having
>> separated the interaction with raw strings, this is more defensible.
> 
> I like this; it will make ML-string code more readable, and coders can use
> indentation to guide the eye.  This almost (not quite) removes the need for
> tripling the quote.  (Not quite because it would mandate indentation, and
> because of JSON quotes.  Heuristic comment:  Remember JSON escapes also.)
> 
> 1a'. As part of 1a., add a one or two new escape sequences to control
> string body layout, in straightforward ways, as part of the reflow story.
> Discussion on request; one way is to allow a "white space gobbler" escape
> which eats the backslash and all whitespace plus a final newline if any.
> I'm mentioning that now here because it has several uses.
> 
>> 2.  Do (1) or (1a), and add: single-line raw string literals delimited by 
>> \”…”\.
> 
> This course (#2) raises the issue of controlling delimiters and interruptors
> separately
> instead of together.  I think it's fine to control them separately, in 
> different
> courses.
> If quote and escapes (delimiters and interruptors) were equally common in
> today's
> workloads I think we'd choose to control them together, but they are not, so
> it's
> more important to tweak the delimiters than tweak the interruptors.
> 
> This proposal can be understood in either of two ways:  The contents of the
> string
> are absolutely raw except for the occurrences of end-delimiters, or they are
> "more
> strongly raw", in that some stronger interruptor is sufficient to bring in
> today's
> rules for escapes, just as some stronger delimiter is sufficient to delimit 
> the
> end of the string.
> 
> I think Jim anticipated the idea of stronger interruptors when he said:
> 
>> Even with escaping off, we still might have to escape delimiters.
>> Repeated backslashes (or repeated delimiters) is the typical out.
> 
> The idea of stronger escapes conflicts with absolute "escaping
> off", which Jim also calls for, so I think order #2 needs a little
> more simmering.  Which is fine; let's eat order #1 first.
> 
> My overall take is, if a strong-enough (repeated?) escape can escape a
> strong delimiter, let's also allow such a strong-enough escape to do
> other chores as well; that leads me to a proper concept of "strong
> interruptor".  This means that if you have a raw string that has a very
> rare need for an escape sequence, then you just strengthen the escape,
> rather than cook the whole string or concatenate it.  Use the right rawness
> for the job, certainly, and maybe there's a way to do this on the whole-string
> level.  In any case I think we can improve here on the previous proposals for
> "regional rawness".  More details later; that's enough for now.
> 
> <digression>
> Rawness is proportional to escape strength.
> 
> No single string syntax is truly 100.000% raw, because the raw string
> cannot include a copy of its delimiter.  Adjust that viewpoint to embrace
> interruptors as well and you get:  A very raw string is one which is 
> difficult,
> but not impossible, to end with a delimiter token, or to interrupt with an
> interruptor token.  What does "difficult" mean?  Simple, it means using
> more characters, until the subject string gives up and says, "don't have
> one of those, go fish".
> 
> So the quest for ever stronger delimiters has a flip side:  It is also a quest
> for
> ever rawer string notations.  There is no such thing as an absolutely raw
> string,
> just one that is "raw enough".  In those terms, I'd like to reserve, for an
> optional final course, a scheme for making strings as raw as you please,
> so that a quoted-and-escaped-five-times-raw string can be quoted inside
> of quoted-and-escaped-six-times-raw string.  A corner case for purists?
> Yes.  A real need for real users?  We'll see; let's keep something brewing in
> the kitchen, just in case.
> 
> </digression>
> 
>> 2a.  Do (1) or (1a), and also support multi-line raw string literals (where 
>> we
>> _don’t_ automatically apply String::align; this can be done manually).  Note
>> that this creates anomalies for multi-line raw string literals starting with
>> quotes (this can be handled with concatenation, and having separated ML and
>> raw, this is less of a problem than before).
> 
> +1
> 
> If we allow stronger interruptors in rawer strings, we can easily disrupt
> would-be
> delimiters by escaping them, so we wouldn't need concatenation.  The stronger
> escapes could be part of 2 (controversially complex) or 3 (slightly 
> inconsistent
> with absolute rawness of simple 2 syntax).
> 
>> 3.  Do (2) and (2a), and also support a repeating compound delimiter with
>> multiple backslashes and a quote.
>> 
>> Note that we can start with 1 or 1a now, and move on to 2/2a later, and same 
>> for
>> 3.
> 
> Order #3 is where we would have a full and decisive conversation about not
> only strong delimiters but also strong interruptors.  I bring it up with order
> #2
> above because #2 is where interruptor control first appears as a possibility.
> 
>> As we evaluate these options, note that:
>> 
>>  - Having separated ML-ness from raw-ness, doing automatic reflow becomes 
>> more
>>  defensible for the common (ML, non-raw) case.
> 
> This is a very important point.  It wasn't apparent when we started, and 
> that's
> why we go slowly on these things.
> 
>>  - The intersection of ML and raw seems pretty small, so doing 1a + 2, while
>>  asymmetric, is defensible.
> 
> Our experience will bear out how truly small this intersection is; you and I
> perhaps
> differ on that call.  But after doing 1a (1a' please!) we will certainly know
> more.
> 
>>  - What we don’t order now, we can add later.
> 
> Yes, if we are careful not to get ourselves thrown out of the restaurant
> by making poor choices during the early courses.  That's why I'm being
> all picky and theoretical here.
> 
> Now for some brief responses to Jim's points, if they are not already
> noted above:
> 
> On Feb 10, 2019, at 7:43 AM, Jim Laskey <james.las...@oracle.com> wrote:
>> 
>>> ...50% solution
>>> 
>>> Where we keep running into trouble is that a choice for one part of the 
>>> lexicon
>>> spreads into the the other parts. That is, use of certain characters in the
>>> delimiter affect which characters require escaping and which characters can 
>>> be
>>> used for escaping.
> 
> (Good insight; leads to independent control for delimiter.)
> 
>>> ...
>>> 
>>> 75% solution, almost
>>> 
>>> …
>>>     • Even with escaping off, we still might have to escape delimiters. 
>>> Repeated
>>>     backslashes (or repeated delimiters) is the typical out.
> 
> (Yes, this got me going, maybe more than you intended, see above.)
> 
>>> 
>>>           String html = \"<html>
>>>                             <body style="width: 100vw">
>>>                                   <p>Hello World.</p>
>>>                             </body>
>>>                             <script>console.log("\nloaded")</script>
>>>                           </html>"\;
> 
> (I'm starting to call these Jim-quotes.  They are growing on me.)
> 
>>> … Captain we need more sequences.
> 
>> 
>>> And, this is the crux of all the debate around strings. Fixed delimiters 
>>> imply a
>>> requirement for escape sequences, otherwise there is content you cannot 
>>> express
>>> as a string.
> 
> (My work is almost done here!  Now if we apply that reasoning to
> interruptors also, we get the idea of adjustable rawness, without
> losing the benefits of escape sequences.)
> 
>>> ...
> 
>>> Fixed delimiter
>>> 
>>> If we go with a fixed delimiter then we limit the content that can be 
>>> expressed
>>> without escape sequences. This is not totally left field. There are floating
>>> point values we can not express in Java and types we can express but not
>>> denote, such as anonymous class types, intersection types or capture types.
> 
> (Sure, but strings are much more "free" mathematically than those other things
> One character shouldn't have to care (char?) what its neighbors are doing.)
> 
>>> ...
>>> Once you take away conflicts with the delimiter, most strings do not require
>>> escaping.
> 
> …Always excepting strings which have the audacity to mention
> the New, Improved Delimiter.  If Java picks one that nobody else
> would ever dream of, we'll still have one remaining case of
> embedding Java inside of Java.  For me failure to nest is a smell
> indicating possible rats, for others it's a trade-off.
> 
>>> …
>>> Summary: All strings can be expressed with fixed plus escaping, but can not
>>> express strings containing the fixed delimiter (""") with escaping off.
> 
> True.  Three points related to that:
> 
> A. If we have to escape the fixed delimiter, then we place an escape
> before it, and all is well.  If we are happy that users can easily spot
> our delimiter without "squinting", then they can probably spot the
> escaped copy of the same delimiter.
> 
> B. But, once we allow delimiters to run through the string, there is
> another cost:  Little sequences like \\ and \n and \0 can be anywhere
> in the bulk of the ML string, and users *must squint* for those.
> This is a cost, and we wish we could make those more visible also,
> or just make the rest of the string raw.
> 
> C. The observations of A and B can be balanced if we use strong
> interruptors instead of the "little squinty sequences", and maybe
> also for the escaped delimiter.  There are various ways to do this,
> all of which suppress short escape sequences in favor of longer ones.
> 
>>> Jumping ahead: I think that stating that traditional " strings must be
>>> single-line will be a popular restriction, even if it not needed. Then they
>>> will think of """ as meaning multi-line.
> 
> +1
> 
>>> 
>>> Structured delimiter
> 
> (AKA periodic or partially periodic string.)
> 
>>> …
>>> Summary: Can express all strings with and without escaping. If the delimiter
>>> length is limited the there there is still a (smaller) set of strings that 
>>> can
>>> not be expressed.
> 
> Yep.  And put "structured interruptor" in the kitchen also.
> 
>>> Nonce delimiter
>>> 
>>> ...
>>> Summary: Can express all strings with and without escaping, but nonce can 
>>> affect
>>> readability.
> 
> I agree.  There's too much "noise" in a nonce, and it's easy to misuse.
> 
> Alternative (stated elsewhere):  Indexed delimiter.  Here, the role of the 
> nonce
> is
> played by a small number which is not the length of the delimiter but rather 
> an
> actual numeral placed in the delimiter.  Such things can be made 
> deterministic,
> so that, if you are going to quote a string S which has apparent delimiters in
> it,
> there is a unique smallest non-conflicting index which may be used for the
> indexed delimiter of the quoted string.  (And the indexed interruptor, if you
> want one.)
> 
>>> 
>>> Multi-line formatting
>>> 
>>> I left this out of the main discussion, but I think we can all agree that
>>> formatting rules should separate the delimiters from the content.
> 
> +1 (This is an instance of user control over the form of the source program
> containing the string.  I don't know what is the right mix of mechanism and
> policy to get it all right, but I agree format control is an important issue.)
> 
>>> Other details can be refined after choice of delimiter(s).
>>> ...
>>> Entrees and desserts
>>> 
>>> If we make good choices now (stay away from the oysters) we can still move 
>>> on to
>>> other courses later.
>>> 
>>> For instance; if we got up from the table with the ", """, ", """ set of
>>> delimiters, we could still introduce structured delimiters in the future;
> 
> This is often true, but not always, so we have to keep our eyes open.
> Purely periodic strings don't extend, as structured delimiters, as well
> as non-periodic or (some) partially-periodic ones.  Consider:
> 
>  var s = \"""""…
> 
> Does that begin today's three-quote-delimited string, which has two more
> quotes in it, or tomorrow's five-quote-delimited string?  (This takes me back
> to the crazy idea of going with 1, 3, 9, 27 quotes.  "I'll have a triple.")
> 
> If I allow up to N quotes in my delimiter today, then coders will write 
> strings
> which
> begin with more quotes in the string body.  Either I have to somehow outlaw
> that,
> or else I am forbidden from using longer strings of N+1 quotes for future
> delimiters.
> 
> Adding more escapes on the front is another matter, and I think that would 
> work
> fine, especially if the "extra" escapes on the front somehow strengthened the
> string's interruptor and delimiter in a consistent manner.
> 
> So we could enumerate ", \", """, \""", \\""", \\\""", \\\\""" etc.
> 
> Or ", \", """, \""", \1""", \2""", \3""" etc.
> 
> No need for more than three quotes (or more than one, for that matter,
> but there are other reasons to like three).
> 
>>> either with repeated  (see Swift) or repeated ". We could also follow a
>>> suggestion John made to use a pseudo nonce like " for \\" or """"".
> 
> Yep, see above.
> 
>>> Point being, we can work with a 85% solution now that we can supplement 
>>> later
>>> when we're not so hangry.
> 
> +100
> 
> HTH
> 
> — John

Re: String reboot (plain text)

Reply via email to