Base character plus tag sequences (from RE: Is the binaryness/textness of a data format a property?)

2020-03-23 Thread wjgo_10...@btinternet.com via Unicode


Doug Ewell wrote:

When 137,468 private-use characters aren't enough?
In my opinion, a base character plus tag sequence has the potential to 
be used for many large scale applications for the future.
A base character plus tag sequence encoding has the advantage over a 
Private Use Area encoding (except for a prompt experimental use or for 
some applications) that the encoding can be unique and thus 
interoperability is possible amongst people generally.


QID emoji is just the very start of applications, some not even dreamed 
of yet, for which a base character sequence encoding could be used.


Once restrictions of the result of a specific encoding of being only 
allowed to be a fixed image are removed, then new information technology 
applications will be possible within text streams.


There is the QID Emoji Public Review and issues like this can be 
explored there so that they will be before the Unicode Technical 
Committee when it assesses the responses to the public review.


In my response of Monday 2 March 2020 I put forward an idea that could 
allow the idea of QID emoji to proceed yet without the disadvantages.


No comment after that has been published as of the time of sending this 
post.


https://www.unicode.org/review/pri408/

Whatever your view on whether such ideas should be allowed to flourish 
and become mainstream in the future I opine that it would be good for 
there to be more responses to the public review so that as wide a range 
of views as possible are before the Unicode Technical Committee when it 
assesses the responses to the public review, not on just QID emoji as 
such but on whether the underlying method of encoding of a base 
character and tag character sequence for  large sets of items should be 
encouraged.


William Overington

Monday 23 March 2020






Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Martin J . Dürst via Unicode
On 23/03/2020 03:56, Markus Scherer via Unicode wrote:
> On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode 
> wrote:
> 
>> I thought the whole premise of GB18030 was that it was Unicode mapped into
>> a GB2312 framework. What characters exist in GB18030 that don't exist in
>> Unicode, and have they been proposed for Unicode yet, and why was none of
>> the PUA space considered appropriate for that in the meantime?
>>
> 
> My memory of GB18030 is that its code space has 1.6M code points, of which
> 1.1M are a permutation of Unicode. For the rest you would have to go beyond
> the Unicode code space for 1:1 round-trip mappings.

This matches my recollection. What's more, there are no characters 
allocated in the parts of the GB 18030 codespace that doesn't map to 
Unicode, and there is as far as I understand no plan to use that space. 
It's just there because that was the most straightforward way to extend 
GB 2312/GBK.

Regards,   Martin.



Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Markus Scherer via Unicode
On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode 
wrote:

> I thought the whole premise of GB18030 was that it was Unicode mapped into
> a GB2312 framework. What characters exist in GB18030 that don't exist in
> Unicode, and have they been proposed for Unicode yet, and why was none of
> the PUA space considered appropriate for that in the meantime?
>

My memory of GB18030 is that its code space has 1.6M code points, of which
1.1M are a permutation of Unicode. For the rest you would have to go beyond
the Unicode code space for 1:1 round-trip mappings.

Just please don't call it UTF-8.

markus


Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Richard Wordingham via Unicode
On Sat, 21 Mar 2020 13:33:18 -0600
Doug Ewell via Unicode  wrote:

> Eli Zaretskii wrote:

> > Emacs uses some of that for supporting charsets that cannot be
> > mapped into Unicode.  GB18030 is one example of such charsets.  The
> > internal representation of characters in Emacs is UTF-8, so it uses
> > 5-byte UTF-8 like sequences to represent such characters.  

> When 137,468 private-use characters aren't enough?

But they aren't private use!  I haven't made any agreement with anyone
about using them.

Additionally, just as some people seem to think that stray UTF-16 code
units should be supported (and occasionally declaring UTF-8
implementations of Unicode standard algorithms to be automatically
non-compliant), there is a case for supporting stray UTF-8 code units.
Emacs supports the full range of 8-bit byte values - 128 unified with
ASCII and the other 128 with high bit set.

> What characters exist in GB18030 that don't
> exist in Unicode, and have they been proposed for Unicode yet, and
> why was none of the PUA space considered appropriate for that in the
> meantime?

Doesn't GB18030 appropriate some of the PUA for Tibetan (and quite
possibly other complex scripts)?  I haven't looked up how Emacs
handles this. 

Richard.


RE: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode
Eli Zaretskii wrote:

>> When 137,468 private-use characters aren't enough?
>
> Why is that relevant to the issue at hand?

You're right. I did ask what the uses of non-standard UTF-8 were, and you gave 
me an example.

> I don't remember off hand, but last time I looked at GB18030, there
> were a lot of them not in Unicode.

I'd forgotten that there were still about two dozen GB18030 characters mapped, 
more or less officially, into the Unicode PUA. But again, I changed the 
subject. Sorry about that.

--
Doug Ewell | Thornton, CO, US | ewellic.org






Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Julian Bradfield via Unicode
On 2020-03-21, Eli Zaretskii via Unicode  wrote:
>> Date: Sat, 21 Mar 2020 11:13:40 -0600
>> From: Doug Ewell via Unicode 
>> 
>> Adam Borowski wrote:
>> 
>> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
>> > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
>> > its uses but is not well-formed Unicode.
>> 
>> I'd be interested in your elaboration on what these uses are.
>
> Emacs uses some of that for supporting charsets that cannot be mapped
> into Unicode.  GB18030 is one example of such charsets.  The internal
> representation of characters in Emacs is UTF-8, so it uses 5-byte
> UTF-8 like sequences to represent such characters.

My own (now >10 year old) Unicode adaptation of XEmacs does the same,
even for charsets that can be mapped into Unicode. To ensure complete
backward compatibility, it distinguishes "legacy" charsets from Unicode,
and only does conversion when requested.



Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Eli Zaretskii via Unicode
> From: "Doug Ewell" 
> Cc: 
> Date: Sat, 21 Mar 2020 13:33:18 -0600
> 
> > Emacs uses some of that for supporting charsets that cannot be mapped
> > into Unicode.  GB18030 is one example of such charsets.  The internal
> > representation of characters in Emacs is UTF-8, so it uses 5-byte
> > UTF-8 like sequences to represent such characters.
> 
> When 137,468 private-use characters aren't enough?

Why is that relevant to the issue at hand?

> I thought the whole premise of GB18030 was that it was Unicode mapped into a 
> GB2312 framework. What characters exist in GB18030 that don't exist in 
> Unicode, and have they been proposed for Unicode yet

I don't remember off hand, but last time I looked at GB18030, there
were a lot of them not in Unicode.

> and why was none of the PUA space considered appropriate for that in the 
> meantime?

Because many fonts already use them?  I don't really know why it was
decided to use codepoints above 0x1F, it's just that this is how
Emacs works for quite some time.  You asked for examples of usage, and
I provided one.


RE: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode
Eli Zaretskii wrote:

>>> Also, UTF-8 can carry more than Unicode -- for example,
>>> U+D800..U+DFFF or U+11000..U+7FFF (or possibly even up to 2³⁶ or
>>> 2⁴²), which has its uses but is not well-formed Unicode.
>>
>> I'd be interested in your elaboration on what these uses are.
>
> Emacs uses some of that for supporting charsets that cannot be mapped
> into Unicode.  GB18030 is one example of such charsets.  The internal
> representation of characters in Emacs is UTF-8, so it uses 5-byte
> UTF-8 like sequences to represent such characters.

When 137,468 private-use characters aren't enough?

I thought the whole premise of GB18030 was that it was Unicode mapped into a 
GB2312 framework. What characters exist in GB18030 that don't exist in Unicode, 
and have they been proposed for Unicode yet, and why was none of the PUA space 
considered appropriate for that in the meantime?

--
Doug Ewell | Thornton, CO, US | ewellic.org





Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Eli Zaretskii via Unicode
> Date: Sat, 21 Mar 2020 11:13:40 -0600
> From: Doug Ewell via Unicode 
> 
> Adam Borowski wrote:
> 
> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
> > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
> > its uses but is not well-formed Unicode.
> 
> I'd be interested in your elaboration on what these uses are.

Emacs uses some of that for supporting charsets that cannot be mapped
into Unicode.  GB18030 is one example of such charsets.  The internal
representation of characters in Emacs is UTF-8, so it uses 5-byte
UTF-8 like sequences to represent such characters.


Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode
Adam Borowski wrote:

> Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
> or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
> its uses but is not well-formed Unicode.

I'd be interested in your elaboration on what these uses are.

--
Doug Ewell | Thornton, CO, US | ewellic.org





Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Martin J . Dürst via Unicode
On 20/03/2020 23:41, Adam Borowski via Unicode wrote:

> Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or
> U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses
> but is not well-formed Unicode.

This would definitely no longer be UTF-8!   Martin.



Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Richard Wordingham via Unicode
On Fri, 20 Mar 2020 13:46:25 +0100
Adam Borowski via Unicode  wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via
> Unicode wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> > something.
> > 
> > JPEG is a binary data format.
> > CSV is a text data format.
> > 
> > Question #1: Is the binaryness/textness of a data format a
> > property? 
> > 
> > Question #2: If the answer to Question #1 is yes, then what is the
> > name of this binaryness/textness property?  

I'd suggest 'texthood' as the correct English term.

> I'm afraid this question is too fuzzy to have a proper answer.
> 
> For example, most Unix-heads will tell you that UTF16LE is a binary
> rather than text format.  Microsoft employees and some members of
> this list will disagree.

Some files change type on changing operating system.  Digital's old RMS
formats included as basic text files in which each record (roughly a
line) started with a binary 2-byte length field.  Text records on
magnetic tape typically started with an ASCII length count!

> Then you have Postscript -- nothing but basic ASCII, yet utterly
> unreadable for a (sane) human.

No worse than a hex dump - in fact, a lot more readable.  Indeed, are
you not aware of the concept of a write-only programming language? 

> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
>   nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's
> not UTF-8 Unicode, you're doing it wrong)
> * no invalid characters

Unassigned characters are perfectly reasonable in a text file.  Surely
you aren't saying that a text file using the characters new to Unicode
13.0 should, at present, usually be regarded as a binary file?

> But besides this narrow technical meaning -- is a Word document
> "text"? And if it is, why not Powerpoint?  This all falls apart.

Well, a .docx file isn't text - it's a variety of ZIP file, which is
binary.  Indeed, as word files naturally include pictures, it very much
isn't a text file.  A .doc file is more like an image dump of a file
system.  A .rtf file on the other hand, probably is a text file -
though I've a feeling there are variants that aren't *A*SCII.

Richard.


Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Adam Borowski via Unicode
On Fri, Mar 20, 2020 at 07:22:45AM -0700, J Decker via Unicode wrote:
> On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode <
> > For example, most Unix-heads will tell you that UTF16LE is a binary rather
> > than text format.  Microsoft employees and some members of this list will
> > disagree.
[...]
> > If you want _my_ definition of a file being _technically_ text, it's:
> > * no bytes 0..31 other than newlines and tabs (even form feeds are out
> >   nowadays)
> > * correctly encoded for the expected charset (and nowadays, if that's not
> >   UTF-8 Unicode, you're doing it wrong)
> > * no invalid characters
> 
> Just a minor note...
> In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every
> valid utf8 codeunit has at least 1 bit off.

Yeah, but I allowed for ancient encodings, some of which do use these bytes.
(I do discriminate against UTF16 and shift-state ones, they're too broken.)

Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or
U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses
but is not well-formed Unicode.

> I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI
> codes) are all quite usable...

\t is tab, \n a newline (blah blah blah \r).

As for \e (\x1b), that's higher-level markup.  I do use it -- hey, you can
"apt/dnf install colorized-logs" for my tools -- but that's beyond plain
text.


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀   --  on #linux-sunxi
⠈⠳⣄


Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread J Decker via Unicode
On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode <
unicode@unicode.org> wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via Unicode
> wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> something.
> >
> > JPEG is a binary data format.
> > CSV is a text data format.
> >
> > Question #1: Is the binaryness/textness of a data format a property?
> >
> > Question #2: If the answer to Question #1 is yes, then what is the name
> of
> > this binaryness/textness property?
>
> I'm afraid this question is too fuzzy to have a proper answer.
>
> For example, most Unix-heads will tell you that UTF16LE is a binary rather
> than text format.  Microsoft employees and some members of this list will
> disagree.
>
> Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable
> for a (sane) human.
>
> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
>   nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's not
>   UTF-8 Unicode, you're doing it wrong)
> * no invalid characters
>

Just a minor note...
In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every
valid utf8 codeunit has at least 1 bit off.
I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI
codes) are all quite usable...



>
> But besides this narrow technical meaning -- is a Word document "text"?
> And if it is, why not Powerpoint?  This all falls apart.
>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
> ⢿⡄⠘⠷⠚⠋⠀   --  on #linux-sunxi
> ⠈⠳⣄
>


Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Adam Borowski via Unicode
On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via Unicode wrote:
> [Definition] Property: an attribute, quality, or characteristic of something.
> 
> JPEG is a binary data format.
> CSV is a text data format.
> 
> Question #1: Is the binaryness/textness of a data format a property? 
> 
> Question #2: If the answer to Question #1 is yes, then what is the name of
> this binaryness/textness property?

I'm afraid this question is too fuzzy to have a proper answer.

For example, most Unix-heads will tell you that UTF16LE is a binary rather
than text format.  Microsoft employees and some members of this list will
disagree.

Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable
for a (sane) human.

If you want _my_ definition of a file being _technically_ text, it's:
* no bytes 0..31 other than newlines and tabs (even form feeds are out
  nowadays)
* correctly encoded for the expected charset (and nowadays, if that's not
  UTF-8 Unicode, you're doing it wrong)
* no invalid characters

But besides this narrow technical meaning -- is a Word document "text"?
And if it is, why not Powerpoint?  This all falls apart.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀   --  on #linux-sunxi
⠈⠳⣄


AW: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Dreiheller, Albrecht via Unicode
#1: Yes.
#2: [ my suggestion ]  File type category

A.D.

-Ursprüngliche Nachricht-
Von: Unicode  Im Auftrag von Costello, Roger L. 
via Unicode
Gesendet: Freitag, 20. März 2020 13:21
An: unicode@unicode.org
Betreff: Is the binaryness/textness of a data format a property?

Hello Data Format Experts!

[Definition] Property: an attribute, quality, or characteristic of something.

JPEG is a binary data format.
CSV is a text data format.

Question #1: Is the binaryness/textness of a data format a property? 

Question #2: If the answer to Question #1 is yes, then what is the name of this 
binaryness/textness property?

Question #3: Here is another way of asking Question #2: Please fill in the 
following blanks with the property name (both blanks should be filled with the 
same thing):

For the JPEG data format:  _ = binary.
For the CSV data format:  _ = text. 

/Roger




Is the binaryness/textness of a data format a property?

2020-03-20 Thread Costello, Roger L. via Unicode
Hello Data Format Experts!

[Definition] Property: an attribute, quality, or characteristic of something.

JPEG is a binary data format.
CSV is a text data format.

Question #1: Is the binaryness/textness of a data format a property? 

Question #2: If the answer to Question #1 is yes, then what is the name of this 
binaryness/textness property?

Question #3: Here is another way of asking Question #2: Please fill in the 
following blanks with the property name (both blanks should be filled with the 
same thing):

For the JPEG data format:  _ = binary.
For the CSV data format:  _ = text. 

/Roger