Dear Jeremy,
>
> I'd suggest working from the outside in: changing the representation
> of XML values, then fixing everything that breaks.  Currently a text
> node is represented internally as an OCaml string (see the `xmlitem`
> type in result.ml).  If you start by changing this to a wide character
> type (perhaps something like Camomile's UChar.t) then the OCaml
> compiler should point out everything else that needs to be changed.
> I'd expect this part to involve quite a lot of changes, but I think
> most of them should be pretty straightforward.
To my impression, the UTF-8 *bytes* seem to be transferred correctly so
far... :-)
>
> You may want to change strings to Unicode as well.  Links strings,
> like Haskell strings, are lists of characters, so you'll need to
> change the representation of characters in the `primitive_value' type
> in result.ml.
You lead me to a question I am really curious about:

These char lists, e.g., ['P', 'o', 'o', 'h'] - are they used for the
sake of performance or another important reason?

There would be presumably no need to change character handling to wide
characters, if strings were passed completely. The only problem seems
actually that strings are cut into pieces *bytewise*: E.g., "Bussibär"
is represented as:

['B', 'u', 's', 's', 'i', 'b', 'Ã', '¤', 'r'],

'ä' representing the correct bytes for 'ä'. Once I take a hex editor
and change the whole to

['B', 'u', 's', 's', 'i', 'b', 'ä', 'r'],

"Bussibär" is represented correctly with an appropriate header. :-)

So, if there are no important considerations in favour of byte lists, I
would guess that this approach might be more favourable in regard of
performance, as wide char recognition might pose some overhead...
>
> If you want to use UTF-8 in input files then you'll need to change the
> lexer, either to use Alain Frisch's ulex package (which has a rather
> different interface), or to do UTF-8 decoding/encoding by hand (!).
> This is probably the trickiest bit of the whole endeavour.  It might
> be worth considering punting on this and using numerical char refs
> instead.
>
> We'd certainly welcome a patch that adds support for UTF-8/Unicode:
> it's something we're planning to do eventually, but it hasn't been a
> high priority so far.  The internals of Links have changed rather a
> lot since the last public release; if you'd like access to our source
> code repository to work on the latest version, let us know.
>
If there were an alternative to do so without wide char recognition -
would you like this, too? Are there any arguments against rendering the
JavaScript code string-wise?


Thank you very much for your support,

          Nick


_______________________________________________
links-users mailing list
[email protected]
http://lists.inf.ed.ac.uk/mailman/listinfo/links-users

Reply via email to