I've been mulling over this for a while. It seems for quoting we have the following possibilities:
- just have a couple of basic \ rules, and avoid quoted unicode altogether, on the basis that we can use real unicode files (which we already can in the next generation of the tools) - use the ISO rules, that the spec currently indicates, i.e. &aaaa or &#xHHHH - use the \uNNNN approach Andrew suggests (is this hex or decimal?) We have already built some unicode archetypes in Farsi, and no quoting is needed. The current generation of tools are not quite up to displaying them yet, but the next generation will do it. So - is there a strong argument for quoted unicode at all? Is it that we need to cater for tools or situations where only ascii is allowable in the saved form of the file? I'm quite happy to go with the \uNNNN approach, but we need to be clear what it's for; and it seems to me that we need to state clearly that we support 2 kinds of serialisation: 1 to ASCII, in which anything not in the basic ISO latin-1 charset is shown as quoted unicode, and 2, to true unicode UTF-8 (which is what we have stated elsewhere in openEHR we will use as the encoding). As for the other quoted characters, I don't see what the need for things like \f (formfeed) is; what we need is to decide a minimum set which might be: - \r - carriage return - \n - linefeed - \t - tab - \\ - backslash - \" - literal " Is anything else needed? - thomas Andrew Patterson wrote: > I am having trouble with the exact definition of the > string literal.. > > >From the spec > ---------------------------------------------- > 3.5.1.2 String Data > All strings are enclosed in double quotes, as follows: > "this is a string" > Quoting and line extension is done using the backslash character, as follows: > "this is a much longer string, what one might call a \"phrase\" or even \ > a \"sentence\" with a very annoying backslash (\\) in it." > String data can be used to contain almost any other kind of data, > which is intended to be parsed as some other formalism. > Special characters (including the inverted comma and backslash characters) > are expressed using the ISO 10646 or XML special character codes > within single quotes. ISO codes are mnemonic, and follow the pattern > &aaaa;, while > XML codes are hexadecimal and follow the pattern > &#xHHHH;, where H stands for a hexadecimal digit. An example is: > "a ∈ A" -- prints as: a ? ? > All strings are case-sensitive, i.e. 'word' is distinct from 'Word'. > ---------------------------------------------- > > So the question then is, what should the behaviour be when a > \ is used without a valid 'quotation' character following i.e. \\ > and \" should be interpreted as \ and " respectively, but should > \. be interpreted a one character . or two characters \.? I'd suggest > that if the quotation character is not recongised, then it should > be an error. > > Furthermore, what are the rules around the &#xHHHH escape > sequence? Surely the & will have to be quoted as well? Otherwise > how will the parser know when it is being used for a mnenomic? > Wouldn't a unicode escape technique such as used in Java and C# > (\uAABB) be a better approach (i.e. one that fits in more > comfortably with the standard \" quoting rules). > > Finally, what about the other standard string escapes used > in Java and C# (\t, \b etc). Is there any room for them (maybe > in ADL 2.0?) > > Comments? > > Andrew > _______________________________________________ > openEHR-technical mailing list > openEHR-technical at openehr.org > http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical > > > -- ___________________________________________________________________________________ CTO Ocean Informatics (http://www.OceanInformatics.biz) Research Fellow, University College London (http://www.chime.ucl.ac.uk) Chair Architectural Review Board, openEHR (http://www.openEHR.org) _______________________________________________ openEHR-technical mailing list openEHR-technical at openehr.org http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical