I've been mulling over this for a while. It seems for quoting we have
the following possibilities:

- just have a couple of basic \ rules, and avoid quoted unicode
altogether, on the basis that we can use real unicode files (which we
already can in the next generation of the tools)
- use the ISO rules, that the spec currently indicates, i.e. &aaaa or
&#xHHHH
- use the \uNNNN approach Andrew suggests (is this hex or decimal?)

We have already built some unicode archetypes in Farsi, and no quoting
is needed. The current generation of tools are not quite up to
displaying them yet, but the next generation will do it. So - is there a
strong argument for quoted unicode at all? Is it that we need to cater
for tools or situations where only ascii is allowable in the saved form
of the file? I'm quite happy to go with the \uNNNN approach, but we need
to be clear what it's for; and it seems to me that we need to state
clearly that we support 2 kinds of serialisation: 1 to ASCII, in which
anything not in the basic ISO latin-1 charset is shown as quoted
unicode, and 2, to true unicode UTF-8 (which is what we have stated
elsewhere in openEHR we will use as the encoding).

As for the other quoted characters, I don't see what the need for things
like \f (formfeed) is; what we need is to decide a minimum set which
might be:
- \r - carriage return
- \n - linefeed
- \t - tab
- \\ - backslash
- \" - literal "

Is anything else needed?

- thomas


Andrew Patterson wrote:
> I am having trouble with the exact definition of the
> string literal..
>
> >From the spec
> ----------------------------------------------
> 3.5.1.2 String Data
> All strings are enclosed in double quotes, as follows:
> "this is a string"
> Quoting and line extension is done using the backslash character, as follows:
> "this is a much longer string, what one might call a \"phrase\" or even \
> a \"sentence\" with a very annoying backslash (\\) in it."
> String data can be used to contain almost any other kind of data,
> which is intended to be parsed as some other formalism.
> Special characters (including the inverted comma and backslash characters)
> are expressed using the ISO 10646 or XML special character codes
> within single quotes. ISO codes are mnemonic, and follow the pattern
> &aaaa;, while
> XML codes are hexadecimal and follow the pattern
> &#xHHHH;, where H stands for a hexadecimal digit. An example is:
> "a ∈ A" -- prints as: a ? ?
> All strings are case-sensitive, i.e. 'word' is distinct from 'Word'.
> ----------------------------------------------
>
> So the question then is, what should the behaviour be when a
> \ is used without a valid 'quotation' character following i.e. \\
> and \" should be interpreted as \ and " respectively, but should
> \. be interpreted a one character . or two characters \.? I'd suggest
> that if the quotation character is not recongised, then it should
> be an error.
>
> Furthermore, what are the rules around the &#xHHHH escape
> sequence? Surely the & will have to be quoted as well? Otherwise
> how will the parser know when it is being used for a mnenomic?
>  Wouldn't a unicode escape technique such as used in Java and C#
> (\uAABB) be a better approach (i.e. one that fits in more
> comfortably with the standard \" quoting rules).
>
> Finally, what about the other standard string escapes used
> in Java and C# (\t, \b etc). Is there any room for them (maybe
> in ADL 2.0?)
>
> Comments?
>
> Andrew
> _______________________________________________
> openEHR-technical mailing list
> openEHR-technical at openehr.org
> http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical
>
>
>   


-- 
___________________________________________________________________________________
CTO Ocean Informatics (http://www.OceanInformatics.biz)
Research Fellow, University College London (http://www.chime.ucl.ac.uk)
Chair Architectural Review Board, openEHR (http://www.openEHR.org)


_______________________________________________
openEHR-technical mailing list
openEHR-technical at openehr.org
http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical



Reply via email to