My favorite unicode story is from macos filenames.

They decompose filenames and only track the decomposed form (letter
separate from the overstrike).

The following accented chars look the same, but have different values.

   c=: 195 164{a. NB. composed
   c
ä
   d=: 97 204 136{a. NB. decomposed
   d
ä
   c-:d
0
   'abc'fwrite c
   fread c NB. fails on macos as the system has the decomposed form as the
name it looks for

Torvald has a wonderful rant about this that is a fun read.

On Tue, Mar 22, 2022 at 7:02 PM Raul Miller <[email protected]> wrote:

> I ran into a situation, today (dealing with files), where most of the
> files were utf-8 encoded but some represented the latin-1 "code plane"
> with 8 bit characters.
>
> To cope with this issue, I coded up a mechanism to test whether the
> file contained only valid utf-8 sequences, and used {{ ": 10 u: y }}
> for the files which failed this test.
>
> In other words:
>
> cclass=: (i.9) (48+i.9)} 256#9
> cstates=: 0 10#:10* ".;._2{{)n
>   0    7.3  2    3    4    5    6    7.3  7.3  7.1 NB. 0: start char
> sequence
>   0    7.3  2    3    4    5    6    7.3  7.3  7.1 NB. 1: finish char
> sequence, start next
>   7.3  1    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 2: need one
> more character
>   7.3  2    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 3: need two
> more characters
>   7.3  3    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 4: need three
> more characters
>   7.3  4    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 5: need four
> more characters
>   7.3  5    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 6: need five
> more characters
>   7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.2 NB. 7: end
> }}
>
> utf8lenb=: <:2#.>1 #each~1+i.8
> utf8ok=: {{
>   try.
>     (1;cstates;cclass) ;: '.',~'012345678_'{~ utf8lenb I. 3 u: y
>     1
>   catch.
>     0
>   end.
> }}
>
> NB. most content is utf-8 -- assume non-utf-8 sequences are ascii+latin-1
> latin2utf8=: {{
>   if.utf8ok y do. y else. ":10 u: y end.
> }}
>
> I don't know if this approach would be useful to anyone else here,
> but... just in case...
>
> FYI,
>
> --
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to