Here's a J model for converting between utf8 and utf32 (two
representations of unicode).

The model assumes characters are represented as numbers.  We do not
currently have 32 bit character literals.  The conversion uses numeric
properties of the characters.

To convert 8 bit literals to numbers, you can use:

charnum=: a.&i.

Here's the model:

CPLEN=:1 0 2 3 4 5 6 0 #~ 2>.2^i.-8
CPBASE=: 2^(#~ 2>.2&^)i.-8
CPOFF=: _128+ CPLEN i. ~.CPLEN

utf8len=: {&CPLEN
utf8dat=: {&CPBASE | ]
utf8to32=: *@utf8len (64 #. utf8dat);.1 ]
utf32to8=: [: ; <@((+ 128 + # {. CPOFF {~ #)@(#.inv~&64)^:(>&127))"0

Here's an illustration that these mechanisms are consistent with
existing utf-8 support for a.

   (utf32to8 i.256)-: charnum 8 u: 2 u: a.
1

   (i.256)-: utf8to32 charnum 8 u: 2 u: a.
1

Here's a test for valid utf-8:

isutf8=: # -: CPLEN +/@:{~ ]

This is not a complete test, because it only ensures that the right
characters are present -- it does not ensure that they are ordered
properly.  Here's a version of the conversion from utf8 which fails if
it's given invalid utf8:

utf8valid=: [ assert@isutf8
utf8to32valid=: *@utf8len (64 #. utf8dat)@utf8valid;.1 utf8valid

This provides a complete test since individual characters cannot be
valid if the characters are not arranged properly.

FYI,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to