On Tue, 19 Aug 2008, Achim D. Brucker wrote:

> On Tue, Aug 19, 2008 at 02:18:36PM +1000, Michael Norrish wrote:
> > Are there any plans to implement some sort of sensible WideChar  
> > signature?  (The existing WideChar signature in the Basis is not really  
> > a good base on which to build good support here.)
> 
> there is a Unicode library (based on the Word32 datatype) available as
> part of fxp [1] (an XML parser) that works with various SML Systems (I
> use it with current releases of Poly/ML, sml/NJ, and Mlton).

Note that Word32 in Poly/ML is rather inefficient, based on tuples of 2 
smaller word types (even on 64bit platforms).

In Isabelle we have managed to ignore character encodings beyond plain 
ASCII, using the \<forall> symbol notation that you certainly know of.  
Back in 1997 we had taken the window of opportunity *not* to convert to 
the then newly introduced char type, i.e. our view on exploded strings is 
still that of lists of (small) strings, either single chars "a" or named 
symbols "\<forall>".  Luckily a singleton string in Poly/ML is just a 
unboxed integer, i.e. more efficient than Word32 or Int32.  In other words 
the original SML90 standard essentially did already have chars of 
arbitrary width.

The user interface can convert to whatever encoding it needs to render 
text.  For example, the JVM uses UTF-16 with odd "surrogate characters" to 
represent unicodes outside of the "basic multilingual plane", such as 
blackbord-bold B.

Recently we have introduced one minor change to this encoding-agnostic 
approach in Isabelle: 1 line of ML to count character positions, which 
ignores anything from the range of 128..192.  This fits well with UTF-8 
and also ISO-latin-1 (ignoring special punctuation 160..192).


        Makarius
_______________________________________________
polyml mailing list
polyml@inf.ed.ac.uk
http://lists.inf.ed.ac.uk/mailman/listinfo/polyml

Reply via email to