One feature common to many programming languages that Rust lacks is "raw"
string literals. Specifically, these are string literals that don't interpret
backslash-escapes. There are three obvious applications at the moment: regular
expressions, windows file paths, and format!() strings that want to embed { and
} chars. I'm sure there are more as well, such as large string literals that
contain things like HTML text.
I took a look at 3 programming languages to see what solutions they had: D,
C++11, and Python. I've reproduced their syntax below, plus one more custom
syntax, along with pros & cons. I'm hoping we can come up with a syntax that
makes sense for Rust.
## Python syntax:
Python supports an "r" or "R" prefix on any string literal (both "short"
strings, delimited with a single quote, or "long" strings, delimited with 3
quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of
disabling backslash-escapes within the string. For the most part. It actually
gets a bit weird: if a sequence of backslashes of an odd length occurs prior to
a quote (of the appropriate quote type for the string), then the quote is
considered to be escaped, but the backslashes are left in the string. This
means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is
`foo\\\"`, but r"foo\\" is merely the string `foo\\`.
Pros:
* Simple syntax
* Allows for embedding the closing quote character in the raw string
Cons:
* Handling of backslashes is very bizarre, and the closing quote character can
only be embedded if you want to have a backslash before it.
## C++11 syntax:
C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq".
In this construct, `seq` is any sequence of (zero or more) characters except
for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw
text)", which allows for anything in the raw text except for the sequence `)"`.
The addition of the delimiter sequence allows for constructing a raw string
containing any sequence at all (as the delimiter sequence can be adjusted based
on the represented text).
Pros:
* Allows for embedding any character at all (representable in the source file
encoding), including the closing quote.
* Reasonably straightforward
Cons:
* Syntax is slightly complicated
## D syntax:
D supports three different forms of raw strings. The first two are similar,
being r"raw text" and `raw text`. Besides the choice of delimiters, they behave
identically, in that the raw text may contain anything except for the
appropriate quote character. The third syntax is a slightly more complicated
form of C++11's syntax, and is called a delimited string. It takes two forms.
The first looks like q"(raw text)" where the ( may be any non-identifier
non-whitespace character. If the character is one of [(<{ then it is a "nesting
delimiter", and the close delimiter must be the matching ])>} character,
otherwise the close delimiter is the same as the open. Furthermore, nesting
delimiters do exactly what their name says: they nest. If the nesting delimiter
is (), then any ( in the raw text must be balanced with a ) in the raw text. In
other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and
q"(foobar))" are both illegal.
The second uses any identifier as the delimiter. In this case, the identifier
must immediately be followed by a newline, and in order to close the string,
the close delimiter must be preceded by a newline. This looks like
q"delim
this is some raw text
delim"
It's essentially a heredoc. Note that the first newline is not part of the
string, but the final newline is, so this evaluates to "this is some raw
text\n".
Pros:
* Flexible
* Allows for constructing a raw string that contains any desired sequence of
characters (representable in the source file's encoding)
Cons:
* Overly complicated
## Custom syntax
There's another approach that none of these three languages take, which is to
merely allow for doubling up the quote character in order to embed a quote.
This would look like R"raw string literal ""with embedded quotes"".", which
becomes `raw string literal "with embedded quotes"`.
Pros:
* Very simple
* Allows for embedding the close quote character, and therefore, any character
(representable in the source file encoding)
Cons:
* Slightly odd to read
## Conclusion
Of the three existing syntaxes examined here, I think C++11's is the best. It
ties with D's syntax for being the most powerful, but is simpler than D's. The
custom syntax is just as powerful though. The benefit of the C++11 syntax over
the custom syntax is it's slightly easier to read the C++11 syntax, as the raw
text has a 1-to-one mapping with the resulting string. The custom syntax is a
bit more confusing to read, especially if you want to add multiple quotes. As a
pathological case, let's try representing a Python triple-quoted docstring
using both syntaxes:
C++11: R"("""this is a python docstring""")"
Custom: R"""""""this is a python docstring"""""""
Based on this examination, I'm leaning towards saying Rust should support
C++11's raw string literal syntax.
I welcome any comments, criticisms, or suggestions.
-Kevin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev