Having thought a bit more concretely about this, using the suggestions here,
and talking to Graydon on IRC I've come up with the following design for a
string formatting library. I think that this addresses the comments in the
responses to my original email, but if not please let me know!

== Format Language ==

On of the major goals of the "formatting language" is to support
internationalization as necessary. This means that must be nested format
patterns, some form of a few functions that can be executed at runtime, and be
able to test the equivalence of format strings at runtime. To this end, I drew
from these links:

  http://docs.python.org/3/library/string.html#formatstrings
  http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/MessageFormat.html
  
http://docs.oracle.com/javase/7/docs/api/java/text/ChoiceFormat.html?is-external=true
  http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/PluralFormat.html
  http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/SelectFormat.html
  https://github.com/SlexAxton/messageformat.js

And settled on this grammar:

  format_string := <text> [ format <text> ] *
  format := '{' [ argument [ ':' format_spec ]  ',' ] function_spec ] '}'
  argument := '' | integer
  format_spec := [[fill]align][sign]['#'][width][.precision][type]
  fill := character
  align := '<' | '>'
  sign := '+' | '-'
  width := count
  precision := count | '*'
  type := identifier | ''
  count := parameter | integer
  parameter := integer '$'

  function_spec := plural | select

  plural := 'plural' ',' [ 'offset:' integer ] ( selector arm ) *
  selector := '=' integer | keyword
  keyword := 'zero' | 'one' | 'two' | 'few' | 'many'

  select := 'select' ',' ( identifier arm ) *

  arm := '{' format_string '}'

Some examples would be:

  {}
  {1}
  {1:d}
  {:date}
  {: <}
  {:+#}
  {plural, other {...}}
  {1, plural, offset:1 =1{...} one{...} many{...} other {...}}
  {select, s1{...} s2{...} other{...}}
  {select, selector{{2, plural, other{...}}} other{...}}

An overview of this:
  * Any argument can be selected (0-indexed from the start)
  * Any argument can be formatted any way (or at least the formatter requests a
    particular format
  * There are two internationalization functions 'select' and 'plural'. I've
    also seen a 'choice' function and I haven't quite been able to grasp it, but
    there's enough foundation here that it should be easy to add.
  * Nested format strings are allowed
  * Multi-char format names are allowed.
  * I'm not conviced the format-specifiers are the best they could be, they're
    currently modified from python's version, and do differ slightly from what
    currently exists today.

Implementation-wise, there will be a parser in a `fmt` module which parses these
strings and yields ast-like items representing the structure of the format
string.

== Compile-time suport ==

All of the formats available for use will be defined at compile-time. Each
format will be defined as implementors of a particular trait, and these traits
will have one method each defining a format function. For example:

  #[fmt="b"] pub trait Bool { fn fmt(&Self, &mut Formatter); }
  #[fmt="c"] pub trait Char { fn fmt(&Self, &mut Formatter); }
  #[fmt="d"] pub trait Signed { fn fmt(&Self, &mut Formatter); }
  #[fmt="u"] #[fmt="i"] pub trait Unsigned { fn fmt(&Self, &mut Formatter); }
  #[fmt="s"] pub trait String { fn fmt(&Self, &mut Formatter); }
  #[fmt="?"] pub trait Poly { fn fmt(&Self, &mut Formatter); }

Here each format specifier is specified via a #[fmt] attribute. There is one
static function called `fmt` which takes the type as a first parameter and then
a `Formatter` object as a second. The `Formatter` object contains the output
stream and any relevant flags like width/precision/fill/alignment. It will be up
to each implementation of each trait to implement these flags, but there will be
a number of helper functions in a `fmt` package for dealing with these options.

>From the compiler's point of view, there will be a new macro, let's call it
ifmt!, which will have the following transformation:

  ifmt!("{:s}, {}!", "Hello", "World")

  {
      let l1 = "Hello";
      let l2 = "World";
      ::std::fmt::sprintf("{:s}, {}!", [c(String::fmt, &l1),
                                        c(Poly::fmt, &l2)])
  }

A few notable points:
  * If you're wondering what this `c` function is, look below
  * An attempt is made to make this as little code as possible. Each format
    location should purely pass all the arguments along to someone else.
  * The argument list is a list of tuples where the first element is a
    function which takes the second element (and a formatter) to format the
    result into a stream. The exact function selected depends on the format
    parameter specified in the string, such as:
       "s" == String::fmt, default == Poly::fmt
  * A bit of magic goes on under the hood with unsafe casts to make these all
    typecheck to the same thing (more details below)

== Runtime support ==

The crux of the implementation will be around this function signature:

  type FormatFn<T> = extern "Rust" fn(&T, &mut Formatter);
  type Argument = (FormatFn<Void>, &Void);

  unsafe fn fprintf(w: &mut io::Writer, fmt: &str, args: &[Argument]) {
      ...
  }

Here, the stream to output to is taken, the format string, and the list of
arguments. Each argument is an "opaque" pointer/function pair where the function
knows how to format the value at the pointer. The validity of each
FormatFn/pointer type is validated at compile time, so only valid calls to this
function will be emitted. The function is then also tagged as `unsafe` so if
it's manually called at runtime there's a knowledge that if you mix up the
arguments then serious problems will happen.

>From above, the compiler would emit calls to the `c` function as so:

  fn c<T>(f: FormatFn<T>, t: &T) -> Argument { ... }

The actual implementation is just a wrapper around `transmute`. This gets us a
lot of nice error messages and compile-time checks that guarantee the type of
each argument is sane (regarding its format specifier). For example an invalid
program would yield the following:

  ifmt!("{:s}", MyStruct{ foo: "bar" })
  //~^ ERROR: No implementation of `String` trait found for `MyStruct`

This comes about because the 's' format specifier is registered to the `String`
trait (or rather `std::fmt::String`), and due to the signature of the `c`
function it will attempt to look up an implementation of that trait for the
`MyStruct` type (passed as the second argument of `c`).

Algorithm-wise, this will create a parser for the fmt string, and iterate over
each of the "tokens" performing the necessary action (streaming output to the
specified stream).

A few notes:
  * I believe that parsing must occur at runtime, because otherwise i18n
    wouldn't work because it could generate any arbitrary format string at
    runtime based on the current locale.
  * Currently traits don't work well enough such that `&mut io::Writer` is a
    thing that works, so the current interface would only export an `sprintf`
    function which emits to a `&mut ~str` object (essentially a stream).

== Internationalization ==

I also wanted to touch on how this covers internationalization. The main point
of this is located within the query language, but the runtime must also support
some constructs. The format string and arguments are validated at compile-time,
but any format string could be run at runtime. For this reason an equivalence
function will be needed that takes the original format string and a translated
format string and ensures at runtime that the two are equivalent in terms of
types, position, and number of arguments.

On a related note, any argument as a parameter to the `plural` function will be
required to be of the `&uint` type, and any argument to the `select` function
will be required to thbe of the `& &str` type. Additionally, the function
pointer of the argument pair these are in will be some dummy function that fails
if called (because they should never be called). I haven't given too much
thought to these constructs, but that was kinda the first thing I came up with.

== Summing up ==

Currently I have implemented the format language parsing, and the runtime
support necessary for this (without dealing with formatting flags). I haven't
started the compiler work yet, and there's no reason that any of this couldn't
completely change in the meantime.

I would love comments/suggestions on this system. I think that this takes into
account almost all of the feedback which I've received about how formatting
strings should work, but extra sets of eyes are always useful!
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to