Re: [Jbeta] U8 and unicode

Raul Miller Wed, 29 May 2013 11:42:51 -0700

On Wed, May 29, 2013 at 1:30 PM, Don Guinn <[email protected]> wrote:
> I don't think so. But maybe. The thing is that everything in J assumes that
> literal (char) is U8 except monadic u: default for char and concatenating
> literal with unicode (wchar).


I do not think this is accurate.  For example:

   'def',2 u: 'abc'
defabc
   3!:0 'abc'
2
   3!:0]2 u: 'abc'
131072
   3!:0 'def',2 u: 'abc'
131072

The thing to realize is that J has a concept of a "character literal".
Conceptually, this is a number which is treated as a character (we do
not support arithmetic on literals, for better or worse).
Interpretation of that number belongs to the context where that number
is delivered.

U8 is an example of something where J delivers a sequence of numbers
to some external context.

> Just make everything assume that literal is U8.

Why should we make this assumption when it would break existing code?

> I can't think of a case where one would want it otherwise if the char
> were really text.

Does this mean you cannot think of uses for u16 (wchar) or u32 (not
yet implemented)? If so, should this lack of ability to think of uses
be valid justification for breaking backwards compatibility?

> If there is a case where one wants to copy the lower byte
> and zero the upper byte to wchar, apply 2&u: . Does anybody now use
> _128{.a. characters for anything other than U8? If not, such a change
> should not affect anyone. char data which does not contain any U8 would not
> be affected.

It seems to me that box drawing characters are in _128{.a. and that
they are not u8 characters.

It also seems to me that we could support u16 (and maybe u32) unicode
box drawing characters here.

> Right now if one has both wchar and U8 in an application, care must be
> taken to make sure that any char data that might contain U8 is run through
> 7&u: before concatenating it to wchar.

Yes.

One issue here is that most u8 characters are not J characters, but
are instead a sequence of J characters.

> Optimization may convert wchar to char unbeknownst to the programmer.
> Not now probably, but who knows in the future.

This, then, should be a documentation issue.

> If I combine an integer with a real I expect the integer to be converted to
> real before combining.

Yes. Note also that here you are converting 1 number to 1 number.

> It is not necessary for me to convert the integer to
> real. Why not have concatenation of char and wchar work the same way?

That's how it works, IF (and only if) we are talking about individual
characters. Translating multi-character sequences to individual
characters necessarily violates some invariants, and so should not be
done when the programmer does not explicitly call for it.

> Like I showed with z,":z, where the result of ":z is U8 gave unexpected 
> results.

I do not understand this example. But I believe that if z is tagged as
literal then (-: ":)z should be true. (In other words z and ": z
should be identical when viewed from inside J.)

> Before Unicode _128{.a. was needed for non-ASCII characters. Not any more.
> Do away with the idea that literal and U8 are different.

u16 is also literal. So you seem to be saying that u16 should not be
different from u8.

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jbeta] U8 and unicode

Reply via email to