Re: [Jprogramming] RFC: unicode

bill lam Sun, 20 Mar 2022 00:23:09 -0700

It is not problematic. If you know your data and apply suitable conversion
, everything is ok.
string is not a data structure in J. A string is a rank 1 literal array. J
can never be sure whether a string is utf8 or other codepage or binary
data. For example  If the result of x,z is not that users expected, it is
the responsibility of the users to do conversion to make intention clear.
If #x = 4 and #z = 4, I would expect #x,z = 8 and this is the current
behavior.


On Sun, 20 Mar 2022 at 2:37 PM Elijah Stone <[email protected]> wrote:

> The section I quoted comes from the u: page, not the ": page.
> https://www.jsoftware.com/docs/help807/dictionary/duco.htm
>
> It may be more pure, or not, to implement ": and u: as foreigns, but we
> still have to decide how they should behave.  And I find the current
> behaviour (as well as that of e.g. 1!:1) problematic.
>
> On Sun, 20 Mar 2022, bill lam wrote:
>
> > if ": and u: were implemented using foreign conjunction, J would be more
> > pure.
> > The original J dictionary said nothing about unicode at all. How to
> handle
> > unicode in ": is implementation dependent.
> >
> > https://www.jsoftware.com/docs/help807/dictionary/d602.htm
> >
> > On Sun, Mar 20, 2022 at 1:56 PM Raul Miller <[email protected]>
> wrote:
> >
> >> I think a point has been lost here (partially because of hasty
> >> statements I made, where I was not considering all of the details of
> >> how ": works) on why getting rid of u: would not change anything about
> >> the initial example in this thread:
> >>
> >>    #x=: 8 u: 97 243 98
> >> 4
> >>    datatype x
> >> literal
> >>    #z=: 10 u:  97 195 179 98
> >> 4
> >>    datatype z
> >> unicode4
> >>    datatype x,z
> >> unicode4
> >>    #x,z
> >> 8
> >>
> >> When displayed, x is displayed as utf-8. This is largely due to
> >> properties of the host environment and the operating system. Here, x
> >> is treated as an array of unicode octets.
> >>
> >> When we combine x and z into an array, x is not treated as an array of
> >> octets. It is, instead treated as a utf-32 sequence. Discarding u:
> >> would not change this, because u: was not involved in that operation.
> >>
> >> Most likely, the operation you were looking for was something like
> >>
> >>    #x,&":z
> >> 10
> >>
> >> or
> >>
> >>    #x,&(8 u: ]) z
> >> 10
> >>
> >> Here, we are not treating x as a utf-32 array -- we are instead first
> >> representing z as utf-8.
> >>
> >> And, again, discarding u: would not change this aspect of J (except to
> >> cause an error for the x,&(8 u: ]) z example).
> >>
> >> Thanks,
> >>
> >> --
> >> Raul
> >>
> >> On Sun, Mar 20, 2022 at 1:10 AM Raul Miller <[email protected]>
> wrote:
> >> >
> >> > On Sat, Mar 19, 2022 at 8:34 PM Elijah Stone <[email protected]>
> >> wrote:
> >> > > I think a deprecation period would probably be a good idea.
> >> >
> >> > I think we would  need to complete the preceding steps before we
> >> > attempted such a thing.
> >> >
> >> > Deprecation based on something which has not been implemented is bad
> >> news.
> >> >
> >> > > Per the dictionary:
> >> > >
> >> > > > ": converts literal2 and literal4 to U8 encoded 1-byte char
> >> >
> >> > Yes, I realized that after I hit send on that message.
> >> >
> >> > > Not specified is whether literal2 is interpreted as ucs-2 or utf-16.
> >> > > Experimentally, it is utf-16.
> >> >
> >> > It's my understanding that ucs-2 is a subset of utf-16.
> >> >
> >> > > >   ; verb each sequence
> >> > >
> >> > > I don't understand the significance of this.
> >> >
> >> > Generally speaking, when you are working with text, you are working
> >> > with arbitrary length sequences. So, boxing intermediate results and
> >> > razing the boxes is a frequently used idiom.
> >> >
> >> >    ;(# ":)each 1 2 3
> >> > 122333
> >> >
> >> > > > Generally speaking, if you want an unambiguous representation of
> your
> >> > > > data, you should use something like {{ 5!:5<'y' }} rather than ":
> >> > >
> >> > > I don't need unambiguous.  I'll take non-obfuscatory.  And, as
> >> mentioned,
> >> > > the behaviour of ": here is inconsistent with other primitives.
> >> >
> >> > Every primitive is in some sense "inconsistent" with other primitives,
> >> > because every primitive accomplishes something different.
> >> >
> >> > The ": primitive is about formatting text for display. That is going
> >> > to have to be different from an operation like addition.
> >> >
> >> > > > it is not being displayed correctly.
> >> > >
> >> > > The display seems correct to me.
> >> >
> >> > Ah, that was my browser / email client messing up.
> >> >
> >> > Thanks,
> >> >
> >> > --
> >> > Raul
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to