Send links-users mailing list submissions to
[email protected]
To subscribe or unsubscribe via the World Wide Web, visit
http://lists.inf.ed.ac.uk/mailman/listinfo/links-users
or, via email, send a message with subject or body 'help' to
[EMAIL PROTECTED]
You can reach the person managing the list at
[EMAIL PROTECTED]
When replying, please edit your Subject line so it is more specific
than "Re: Contents of links-users digest..."
Today's Topics:
1. Re: <?xml declaration?> (=?UTF-8?B?SsO2cmcgUm9tYW4gUnVkbmljaw==?=)
2. Re: <?xml declaration?> (=?UTF-8?B?SsO2cmcgUm9tYW4gUnVkbmljaw==?=)
3. Re: <?xml declaration?> (Ezra Cooper)
----------------------------------------------------------------------
Date: Mon, 04 Feb 2008 14:58:35 +0100
From: =?UTF-8?B?SsO2cmcgUm9tYW4gUnVkbmljaw==?=
<[EMAIL PROTECTED]>
To: Ezra Cooper <[EMAIL PROTECTED]>
Cc: [email protected]
Subject: Re: [links-users] <?xml declaration?>
Message-ID: <[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=UTF-8
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Precedence: list
Message: 1
Dear Ezra,
First off, a common way to declare the encoding of a page is with an
HTML meta tag:
<meta http-equiv="Content-type" content="text/html; charset=utf8">
(1) In principle, setting this tag is a fine solution. :-)
(2) The first thing I did was adapting lexer.mll for having it support
'-' as XML NameChar.
(3) A second problem occured: the meta tag is not represented 'as is' in
HEAD, but hidden inside the JavaScript code - this seems to prevent it
from being able to force an encoding. (I.e., it works with the tag added
literally in HEAD.)
Second, if you do want to write source files in UTF-8, I can't promise
that Links will correctly parse them. Poking at it a little bit, it
seems not to choke right away, but it hasn't been designed with that
in mind and might be breakable. Jeremy, who is our parser expert,
might have some comments on this.
This looks better than believed: The UTF-8 bytes seem to be transferred
correctly so far... :-)
Now, assuming that you did want to hack Links to make it handle XML
declarations, where to start? You might want to support directly in
the XML parser in Links. In that case, you'll need to make changes to
the grammar (in parser.mly), specifically the "xml" nonterminal, and
probably the lexer too (in lexer.mll). You'll run into a problem:
presently the XML representation is uniform in the sense that the same
constructions are valid at any point in an XML literal (it's 'turtles
all the way down')--but what you want to add is a construction that's
only valid at the top level of an XML document. I'm not sure what the
implications of that would be--it could get thorny.
If you go this route, what you might prefer doing is to add a
primitive operation that takes our Xml type and
produces something different, say XmlDoc, while adding an XML
declaration to the top; for example:
sig addXmlDecl : Xml #document root element
-> [(String, String)] # attributes for <?xml?>
declaration
-> XmlDoc
You could then use this as a last step before sending a document to
the client. The run-time web system (in webif.ml) would have to be
modified to expect XmlDoc instead of, or in addition to, Xml.
Primitive operations can be added in the file library.ml--just follow
the pattern established there.
... sounds convincing... :-)
This appears much easier to me than introducing an alternative direct
rendering for tags (like an encoding meta in HEAD).
Thank you for the support,
Nick
------------------------------
Date: Mon, 04 Feb 2008 15:12:36 +0100
From: =?UTF-8?B?SsO2cmcgUm9tYW4gUnVkbmljaw==?=
<[EMAIL PROTECTED]>
To: Jeremy Yallop <[EMAIL PROTECTED]>
Cc: [email protected]
Subject: Re: [links-users] <?xml declaration?>
Message-ID: <[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=UTF-8
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Message: 2
Dear Jeremy,
I'd suggest working from the outside in: changing the representation
of XML values, then fixing everything that breaks. Currently a text
node is represented internally as an OCaml string (see the `xmlitem`
type in result.ml). If you start by changing this to a wide character
type (perhaps something like Camomile's UChar.t) then the OCaml
compiler should point out everything else that needs to be changed.
I'd expect this part to involve quite a lot of changes, but I think
most of them should be pretty straightforward.
To my impression, the UTF-8 *bytes* seem to be transferred correctly so
far... :-)
You may want to change strings to Unicode as well. Links strings,
like Haskell strings, are lists of characters, so you'll need to
change the representation of characters in the `primitive_value' type
in result.ml.
You lead me to a question I am really curious about:
These char lists, e.g., ['P', 'o', 'o', 'h'] - are they used for the
sake of performance or another important reason?
There would be presumably no need to change character handling to wide
characters, if strings were passed completely. The only problem seems
actually that strings are cut into pieces *bytewise*: E.g., "Bussibär"
is represented as:
['B', 'u', 's', 's', 'i', 'b', 'Ã', '¤', 'r'],
'ä' representing the correct bytes for 'ä'. Once I take a hex editor
and change the whole to
['B', 'u', 's', 's', 'i', 'b', 'ä', 'r'],
"Bussibär" is represented correctly with an appropriate header. :-)
So, if there are no important considerations in favour of byte lists, I
would guess that this approach might be more favourable in regard of
performance, as wide char recognition might pose some overhead...
If you want to use UTF-8 in input files then you'll need to change the
lexer, either to use Alain Frisch's ulex package (which has a rather
different interface), or to do UTF-8 decoding/encoding by hand (!).
This is probably the trickiest bit of the whole endeavour. It might
be worth considering punting on this and using numerical char refs
instead.
We'd certainly welcome a patch that adds support for UTF-8/Unicode:
it's something we're planning to do eventually, but it hasn't been a
high priority so far. The internals of Links have changed rather a
lot since the last public release; if you'd like access to our source
code repository to work on the latest version, let us know.
If there were an alternative to do so without wide char recognition -
would you like this, too? Are there any arguments against rendering the
JavaScript code string-wise?
Thank you very much for your support,
Nick
------------------------------
Date: Mon, 04 Feb 2008 18:25:50 +0000
From: Ezra Cooper <[EMAIL PROTECTED]>
To: =?UTF-8?B?SsO2cmcgUm9tYW4gUnVkbmljaw==?= <[EMAIL PROTECTED]>
Cc: [email protected]
Subject: Re: [links-users] <?xml declaration?>
Message-ID: <[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
Content-Type: text/plain; charset=UTF-8; format=flowed
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Message: 3
Jörg Roman Rudnick wrote:
Dear Jeremy,
...
You may want to change strings to Unicode as well. Links strings,
like Haskell strings, are lists of characters, so you'll need to
change the representation of characters in the `primitive_value' type
in result.ml.
You lead me to a question I am really curious about:
These char lists, e.g., ['P', 'o', 'o', 'h'] - are they used for the
sake of performance or another important reason?
The reason we represent strings that way is for polymorphism: In Links,
the type String is simply [Char], a list of characters, so you can use
any list functions on Strings. We think this is a nice feature at the
level of the Links language. At the JavaScript level, it has
disadvantages--in particular, it makes it harder to talk to native
JavaScript functions, and it may have a performance cost in some situations.
We have considered doing some engineering to represent strings as JS
strings, and there are at least a few possible ways to do this, but it
hasn't been urgent thus far, and would complicate the JS runtime library.
There would be presumably no need to change character handling to wide
characters, if strings were passed completely. The only problem seems
actually that strings are cut into pieces *bytewise*: E.g., "Bussibär"
is represented as:
['B', 'u', 's', 's', 'i', 'b', 'Ã', '¤', 'r'],
'ä' representing the correct bytes for 'ä'. Once I take a hex editor
and change the whole to
['B', 'u', 's', 's', 'i', 'b', 'ä', 'r'],
"Bussibär" is represented correctly with an appropriate header. :-)
So, if there are no important considerations in favour of byte lists, I
would guess that this approach might be more favourable in regard of
performance, as wide char recognition might pose some overhead...
You're right: this is an outgrowth of the fact that the Links
lexer/parser has no idea that there are UTF-8 characters in its
input--it is literally looking at them as separate bytes--and it should
be independent of the array-based JavaScript representation of lists. If
the lexer/parser were modified to read its input as UTF-8 characters,
then I assume there would be no problem in writing out the JS string
literals as, for example,
['B','u','s','s','i','b','ä','r']
We'd certainly welcome a patch that adds support for UTF-8/Unicode:
it's something we're planning to do eventually, but it hasn't been a
high priority so far. The internals of Links have changed rather a
lot since the last public release; if you'd like access to our source
code repository to work on the latest version, let us know.
If there were an alternative to do so without wide char recognition -
would you like this, too? Are there any arguments against rendering the
JavaScript code string-wise?
As I noted, it would be inconvenient to translate Links strings to
JavaScript strings while maintaining polymorphism in the source language.
Having UTF-8 support in source files would be great. Short of that, I
think there are other ways you could get non-ASCII characters into your
HTML; for example, you could use numerical HTML characters refs as
Jeremy mentioned (e.g. ὕ). Another option would be to extend the
Char datatype to encompass all Unicode character, but provide some
special way of creating the ones that are not ASCII. Presently,we allow
ASCII hex and octal character specs in string and character literals.
For example:
links> '\x40';
'@' : Char
You could extend this syntax to some broader encoding than ASCII. Perl
allows "\x{abcd}" to designate a character. (Using, I think raw Unicode
numbers rather than UTF-8).
This last approach would be useful even if the lexer/parser is
eventually extended to accept fancy encodings. It would be useful, for
example, for programmers who couldn't work with UTF-8 in the source.
Good luck!
Ezra
------------------------------
_______________________________________________
links-users mailing list
[email protected]
http://lists.inf.ed.ac.uk/mailman/listinfo/links-users
End of links-users Digest, Vol 17, Issue 2
******************************************