This is a bug in J.  J should follow MacOS file system name normalization
rule. I'll take a look.


On Wed, 23 Mar 2022 at 11:01 AM Eric Iverson <[email protected]>
wrote:

> Not sure. And not sure I want to know.
>
> But to continue the example:
>    fread c NB. fails on macos as the system has the decomposed form as the
> name it looks for
>    fread d
> abc
>
>
> On Tue, Mar 22, 2022 at 9:39 PM Elijah Stone <[email protected]> wrote:
>
> > I wonder what happens when you create two files with distinct names, and
> > then unicode changes such that they are the same when
> > normalised/casefolded/..
> >
> > Probably nothing good.
> >
> > On Tue, 22 Mar 2022, Eric Iverson wrote:
> >
> > > My favorite unicode story is from macos filenames.
> > >
> > > They decompose filenames and only track the decomposed form (letter
> > > separate from the overstrike).
> > >
> > > The following accented chars look the same, but have different values.
> > >
> > >   c=: 195 164{a. NB. composed
> > >   c
> > > ä
> > >   d=: 97 204 136{a. NB. decomposed
> > >   d
> > > ä
> > >   c-:d
> > > 0
> > >   'abc'fwrite c
> > >   fread c NB. fails on macos as the system has the decomposed form as
> the
> > > name it looks for
> > >
> > > Torvald has a wonderful rant about this that is a fun read.
> > >
> > > On Tue, Mar 22, 2022 at 7:02 PM Raul Miller <[email protected]>
> > wrote:
> > >
> > >> I ran into a situation, today (dealing with files), where most of the
> > >> files were utf-8 encoded but some represented the latin-1 "code plane"
> > >> with 8 bit characters.
> > >>
> > >> To cope with this issue, I coded up a mechanism to test whether the
> > >> file contained only valid utf-8 sequences, and used {{ ": 10 u: y }}
> > >> for the files which failed this test.
> > >>
> > >> In other words:
> > >>
> > >> cclass=: (i.9) (48+i.9)} 256#9
> > >> cstates=: 0 10#:10* ".;._2{{)n
> > >>   0    7.3  2    3    4    5    6    7.3  7.3  7.1 NB. 0: start char
> > >> sequence
> > >>   0    7.3  2    3    4    5    6    7.3  7.3  7.1 NB. 1: finish char
> > >> sequence, start next
> > >>   7.3  1    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 2: need one
> > >> more character
> > >>   7.3  2    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 3: need two
> > >> more characters
> > >>   7.3  3    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 4: need three
> > >> more characters
> > >>   7.3  4    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 5: need four
> > >> more characters
> > >>   7.3  5    7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3 NB. 6: need five
> > >> more characters
> > >>   7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.3  7.2 NB. 7: end
> > >> }}
> > >>
> > >> utf8lenb=: <:2#.>1 #each~1+i.8
> > >> utf8ok=: {{
> > >>   try.
> > >>     (1;cstates;cclass) ;: '.',~'012345678_'{~ utf8lenb I. 3 u: y
> > >>     1
> > >>   catch.
> > >>     0
> > >>   end.
> > >> }}
> > >>
> > >> NB. most content is utf-8 -- assume non-utf-8 sequences are
> > ascii+latin-1
> > >> latin2utf8=: {{
> > >>   if.utf8ok y do. y else. ":10 u: y end.
> > >> }}
> > >>
> > >> I don't know if this approach would be useful to anyone else here,
> > >> but... just in case...
> > >>
> > >> FYI,
> > >>
> > >> --
> > >> Raul
> > >> ----------------------------------------------------------------------
> > >> For information about J forums see
> http://www.jsoftware.com/forums.htm
> > >>
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to