This is a bug in J. J should follow MacOS file system name normalization rule. I'll take a look.
On Wed, 23 Mar 2022 at 11:01 AM Eric Iverson <[email protected]> wrote: > Not sure. And not sure I want to know. > > But to continue the example: > fread c NB. fails on macos as the system has the decomposed form as the > name it looks for > fread d > abc > > > On Tue, Mar 22, 2022 at 9:39 PM Elijah Stone <[email protected]> wrote: > > > I wonder what happens when you create two files with distinct names, and > > then unicode changes such that they are the same when > > normalised/casefolded/.. > > > > Probably nothing good. > > > > On Tue, 22 Mar 2022, Eric Iverson wrote: > > > > > My favorite unicode story is from macos filenames. > > > > > > They decompose filenames and only track the decomposed form (letter > > > separate from the overstrike). > > > > > > The following accented chars look the same, but have different values. > > > > > > c=: 195 164{a. NB. composed > > > c > > > ä > > > d=: 97 204 136{a. NB. decomposed > > > d > > > ä > > > c-:d > > > 0 > > > 'abc'fwrite c > > > fread c NB. fails on macos as the system has the decomposed form as > the > > > name it looks for > > > > > > Torvald has a wonderful rant about this that is a fun read. > > > > > > On Tue, Mar 22, 2022 at 7:02 PM Raul Miller <[email protected]> > > wrote: > > > > > >> I ran into a situation, today (dealing with files), where most of the > > >> files were utf-8 encoded but some represented the latin-1 "code plane" > > >> with 8 bit characters. > > >> > > >> To cope with this issue, I coded up a mechanism to test whether the > > >> file contained only valid utf-8 sequences, and used {{ ": 10 u: y }} > > >> for the files which failed this test. > > >> > > >> In other words: > > >> > > >> cclass=: (i.9) (48+i.9)} 256#9 > > >> cstates=: 0 10#:10* ".;._2{{)n > > >> 0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 0: start char > > >> sequence > > >> 0 7.3 2 3 4 5 6 7.3 7.3 7.1 NB. 1: finish char > > >> sequence, start next > > >> 7.3 1 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 2: need one > > >> more character > > >> 7.3 2 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 3: need two > > >> more characters > > >> 7.3 3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 4: need three > > >> more characters > > >> 7.3 4 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 5: need four > > >> more characters > > >> 7.3 5 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 NB. 6: need five > > >> more characters > > >> 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.2 NB. 7: end > > >> }} > > >> > > >> utf8lenb=: <:2#.>1 #each~1+i.8 > > >> utf8ok=: {{ > > >> try. > > >> (1;cstates;cclass) ;: '.',~'012345678_'{~ utf8lenb I. 3 u: y > > >> 1 > > >> catch. > > >> 0 > > >> end. > > >> }} > > >> > > >> NB. most content is utf-8 -- assume non-utf-8 sequences are > > ascii+latin-1 > > >> latin2utf8=: {{ > > >> if.utf8ok y do. y else. ":10 u: y end. > > >> }} > > >> > > >> I don't know if this approach would be useful to anyone else here, > > >> but... just in case... > > >> > > >> FYI, > > >> > > >> -- > > >> Raul > > >> ---------------------------------------------------------------------- > > >> For information about J forums see > http://www.jsoftware.com/forums.htm > > >> > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
