Re: [Jbeta] U8 and unicode

Raul Miller Wed, 29 May 2013 14:01:50 -0700

On Wed, May 29, 2013 at 4:22 PM, Don Guinn <[email protected]> wrote:
>    'def',2 u: 'abc'
> defabc
>    'def',7 u: 'abc'
> defabc
>    3!:0 'de∑',2 u: 'abc'
> 131072
>    'de∑',2 u: 'abc'
> deâˆ‘abc
>    3 u: 'de∑',2 u: 'abc'
> 100 101 226 136 145 97 98 99
>    'de∑',7 u: 'abc'
> de∑abc
>    2 u: 'de∑'
> deâˆ‘
>    7 u: 'de∑'
> de∑
>
> For plane ASCII 2&u: and 7&u: are the same. But when wchar and char with U8
> are together 2&u: gives strange results. 7&u: works as one would expect.
> Notice the sequence 226 136 145. That was a U8 code that has been
> destroyed. I'm not asking that 2&u: be changed, just that u: monadic be
> 7&u: for character and , apply 7&u: to char before catenating with wchar.


From my point of view, u8 is not a character type. It's a character encoding.

It's not clear why we should expect (,) to know about encodings.

If you have part of a gif and part of an mp3 and you use (,) to
combine them, is that meaningful? (The answer is yes, if you are
careful, otherwise no. Some file formats, like zip for example, do
combine content with different encoding.)

> True, J doesn't do + and - on characters, but it does do , . That is a
> calculation that is being done incorrectly.

So don't do that.

> As to breaking code. All that code that depended on _128{a. being
> characters is obsolete.

I was getting careless here.

I think you meant _128{.a. and not _128{a.

Meanwhile, box drawing characters are not in _128{.a. instead, box
drawing characters are 16+i.11

Also, you are assuming that character literals are text. Claiming that
_128{.a. is obsolete claims we can no longer use literals to represent
raw memory. But this would be bad.

You might argue that J needs a new "text type" for u8 - but that type
would not be literal. In u8 character are sequences of octets and
while the octets can be elements of an array, the sequences are not,
in and of themselves, elements of an array.

> All I'm asking is that J be aware of the character type.

I think you are asking for far more than you realize. So far you have
treated only one use case for one primitive (,) you have not
considered other primitives (# {. $ i. }. { and } all come to mind).
For example, if u8 "characters" were to be J array elements then to
select character 123456789 from a gigabyte file you must first scan
through every character preceding it to find its position in the file.

> Char containing U8 codes should not be blindly concatenated with
> wchar without conversion any more than an integer should be blindly added
> to a floating point number without conversion.

The punchline, here, should be that you should not do that when it's
not something you want done.

It's up to you, the programmer, to decide what it is that you want to
accomplish.

In this case: we are using the language's "literal" data type, and
"unicode" is a name for the u: verb. If we were to say that all
literals are unicode literals then that would break any application
which needs to deal with non-textual data, as well as historical
applications which need to deal with other text encodings. It would
also break existing code that supports unicode, and make it difficult
to adapt to future changes in unicode.

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jbeta] U8 and unicode

Reply via email to