Not sure. And not sure I want to know. But to continue the example: fread c NB. fails on macos as the system has the decomposed form as the name it looks for fread d abc
On Tue, Mar 22, 2022 at 9:39 PM Elijah Stone <[email protected]> wrote: > I wonder what happens when you create two files with distinct names, and > then unicode changes such that they are the same when > normalised/casefolded/.. > > Probably nothing good. > > On Tue, 22 Mar 2022, Eric Iverson wrote: > > > My favorite unicode story is from macos filenames. > > > > They decompose filenames and only track the decomposed form (letter > > separate from the overstrike). > > > > The following accented chars look the same, but have different values. > > > > c=: 195 164{a. NB. composed > > c > > ä > > d=: 97 204 136{a. NB. decomposed > > d > > ä > > c-:d > > 0 > > 'abc'fwrite c > > fread c NB. fails on macos as the system has the decomposed form as the > > name it looks for > > > > Torvald has a wonderful rant about this that is a fun read. > > > > On Tue, Mar 22, 2022 at 7:02 PM Raul Miller <[email protected]> > wrote: > > > >> I ran into a situation, today (dealing with files), where most of the > >> files were utf-8 encoded but some represented the latin-1 "code plane" > >> with 8 bit characters. > >> > >> To cope with this issue, I coded up a mechanism to test whether the > >> file contained only valid utf-8 sequences, and used {{ ": 10 u: y }} > >> for the files which failed this test. > >> > >> In other words: > >> > >> cclass=: (i.9) (48+i.9)} 256#9 > >> cstates=: 0 10#:10* ".;._2{{)n > >> 0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 0: start char > >> sequence > >> 0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 1: finish char > >> sequence, start next > >> 7.3 1 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 2: need one > >> more character > >> 7.3 2 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 3: need two > >> more characters > >> 7.3 3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 4: need three > >> more characters > >> 7.3 4 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 5: need four > >> more characters > >> 7.3 5 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 6: need five > >> more characters > >> 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.2 NB. 7: end > >> }} > >> > >> utf8lenb=: <:2#.>1 #each~1+i.8 > >> utf8ok=: {{ > >> try. > >> (1;cstates;cclass) ;: '.',~'012345678_'{~ utf8lenb I. 3 u: y > >> 1 > >> catch. > >> 0 > >> end. > >> }} > >> > >> NB. most content is utf-8 -- assume non-utf-8 sequences are > ascii+latin-1 > >> latin2utf8=: {{ > >> if.utf8ok y do. y else. ":10 u: y end. > >> }} > >> > >> I don't know if this approach would be useful to anyone else here, > >> but... just in case... > >> > >> FYI, > >> > >> -- > >> Raul > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
