I wonder what happens when you create two files with distinct names, and
then unicode changes such that they are the same when
normalised/casefolded/..
Probably nothing good.
On Tue, 22 Mar 2022, Eric Iverson wrote:
My favorite unicode story is from macos filenames.
They decompose filenames and only track the decomposed form (letter
separate from the overstrike).
The following accented chars look the same, but have different values.
c=: 195 164{a. NB. composed
c
ä
d=: 97 204 136{a. NB. decomposed
d
ä
c-:d
0
'abc'fwrite c
fread c NB. fails on macos as the system has the decomposed form as the
name it looks for
Torvald has a wonderful rant about this that is a fun read.
On Tue, Mar 22, 2022 at 7:02 PM Raul Miller <[email protected]> wrote:
I ran into a situation, today (dealing with files), where most of the
files were utf-8 encoded but some represented the latin-1 "code plane"
with 8 bit characters.
To cope with this issue, I coded up a mechanism to test whether the
file contained only valid utf-8 sequences, and used {{ ": 10 u: y }}
for the files which failed this test.
In other words:
cclass=: (i.9) (48+i.9)} 256#9
cstates=: 0 10#:10* ".;._2{{)n
0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 0: start char
sequence
0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 1: finish char
sequence, start next
7.3 1 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 2: need one
more character
7.3 2 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 3: need two
more characters
7.3 3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 4: need three
more characters
7.3 4 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 5: need four
more characters
7.3 5 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 6: need five
more characters
7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.2 NB. 7: end
}}
utf8lenb=: <:2#.>1 #each~1+i.8
utf8ok=: {{
try.
(1;cstates;cclass) ;: '.',~'012345678_'{~ utf8lenb I. 3 u: y
1
catch.
0
end.
}}
NB. most content is utf-8 -- assume non-utf-8 sequences are ascii+latin-1
latin2utf8=: {{
if.utf8ok y do. y else. ":10 u: y end.
}}
I don't know if this approach would be useful to anyone else here,
but... just in case...
FYI,
--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm