Base character plus tag sequences (from RE: Is the binaryness/textness of a data format a property?)
Doug Ewell wrote: When 137,468 private-use characters aren't enough? In my opinion, a base character plus tag sequence has the potential to be used for many large scale applications for the future. A base character plus tag sequence encoding has the advantage over a Private Use Area encoding (except for a prompt experimental use or for some applications) that the encoding can be unique and thus interoperability is possible amongst people generally. QID emoji is just the very start of applications, some not even dreamed of yet, for which a base character sequence encoding could be used. Once restrictions of the result of a specific encoding of being only allowed to be a fixed image are removed, then new information technology applications will be possible within text streams. There is the QID Emoji Public Review and issues like this can be explored there so that they will be before the Unicode Technical Committee when it assesses the responses to the public review. In my response of Monday 2 March 2020 I put forward an idea that could allow the idea of QID emoji to proceed yet without the disadvantages. No comment after that has been published as of the time of sending this post. https://www.unicode.org/review/pri408/ Whatever your view on whether such ideas should be allowed to flourish and become mainstream in the future I opine that it would be good for there to be more responses to the public review so that as wide a range of views as possible are before the Unicode Technical Committee when it assesses the responses to the public review, not on just QID emoji as such but on whether the underlying method of encoding of a base character and tag character sequence for large sets of items should be encouraged. William Overington Monday 23 March 2020
Re: Is the binaryness/textness of a data format a property?
On 23/03/2020 03:56, Markus Scherer via Unicode wrote: > On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode > wrote: > >> I thought the whole premise of GB18030 was that it was Unicode mapped into >> a GB2312 framework. What characters exist in GB18030 that don't exist in >> Unicode, and have they been proposed for Unicode yet, and why was none of >> the PUA space considered appropriate for that in the meantime? >> > > My memory of GB18030 is that its code space has 1.6M code points, of which > 1.1M are a permutation of Unicode. For the rest you would have to go beyond > the Unicode code space for 1:1 round-trip mappings. This matches my recollection. What's more, there are no characters allocated in the parts of the GB 18030 codespace that doesn't map to Unicode, and there is as far as I understand no plan to use that space. It's just there because that was the most straightforward way to extend GB 2312/GBK. Regards, Martin.
Re: Is the binaryness/textness of a data format a property?
On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode wrote: > I thought the whole premise of GB18030 was that it was Unicode mapped into > a GB2312 framework. What characters exist in GB18030 that don't exist in > Unicode, and have they been proposed for Unicode yet, and why was none of > the PUA space considered appropriate for that in the meantime? > My memory of GB18030 is that its code space has 1.6M code points, of which 1.1M are a permutation of Unicode. For the rest you would have to go beyond the Unicode code space for 1:1 round-trip mappings. Just please don't call it UTF-8. markus
Re: Is the binaryness/textness of a data format a property?
On Sat, 21 Mar 2020 13:33:18 -0600 Doug Ewell via Unicode wrote: > Eli Zaretskii wrote: > > Emacs uses some of that for supporting charsets that cannot be > > mapped into Unicode. GB18030 is one example of such charsets. The > > internal representation of characters in Emacs is UTF-8, so it uses > > 5-byte UTF-8 like sequences to represent such characters. > When 137,468 private-use characters aren't enough? But they aren't private use! I haven't made any agreement with anyone about using them. Additionally, just as some people seem to think that stray UTF-16 code units should be supported (and occasionally declaring UTF-8 implementations of Unicode standard algorithms to be automatically non-compliant), there is a case for supporting stray UTF-8 code units. Emacs supports the full range of 8-bit byte values - 128 unified with ASCII and the other 128 with high bit set. > What characters exist in GB18030 that don't > exist in Unicode, and have they been proposed for Unicode yet, and > why was none of the PUA space considered appropriate for that in the > meantime? Doesn't GB18030 appropriate some of the PUA for Tibetan (and quite possibly other complex scripts)? I haven't looked up how Emacs handles this. Richard.
RE: Is the binaryness/textness of a data format a property?
Eli Zaretskii wrote: >> When 137,468 private-use characters aren't enough? > > Why is that relevant to the issue at hand? You're right. I did ask what the uses of non-standard UTF-8 were, and you gave me an example. > I don't remember off hand, but last time I looked at GB18030, there > were a lot of them not in Unicode. I'd forgotten that there were still about two dozen GB18030 characters mapped, more or less officially, into the Unicode PUA. But again, I changed the subject. Sorry about that. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Is the binaryness/textness of a data format a property?
On 2020-03-21, Eli Zaretskii via Unicode wrote: >> Date: Sat, 21 Mar 2020 11:13:40 -0600 >> From: Doug Ewell via Unicode >> >> Adam Borowski wrote: >> >> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF >> > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has >> > its uses but is not well-formed Unicode. >> >> I'd be interested in your elaboration on what these uses are. > > Emacs uses some of that for supporting charsets that cannot be mapped > into Unicode. GB18030 is one example of such charsets. The internal > representation of characters in Emacs is UTF-8, so it uses 5-byte > UTF-8 like sequences to represent such characters. My own (now >10 year old) Unicode adaptation of XEmacs does the same, even for charsets that can be mapped into Unicode. To ensure complete backward compatibility, it distinguishes "legacy" charsets from Unicode, and only does conversion when requested.
Re: Is the binaryness/textness of a data format a property?
> From: "Doug Ewell" > Cc: > Date: Sat, 21 Mar 2020 13:33:18 -0600 > > > Emacs uses some of that for supporting charsets that cannot be mapped > > into Unicode. GB18030 is one example of such charsets. The internal > > representation of characters in Emacs is UTF-8, so it uses 5-byte > > UTF-8 like sequences to represent such characters. > > When 137,468 private-use characters aren't enough? Why is that relevant to the issue at hand? > I thought the whole premise of GB18030 was that it was Unicode mapped into a > GB2312 framework. What characters exist in GB18030 that don't exist in > Unicode, and have they been proposed for Unicode yet I don't remember off hand, but last time I looked at GB18030, there were a lot of them not in Unicode. > and why was none of the PUA space considered appropriate for that in the > meantime? Because many fonts already use them? I don't really know why it was decided to use codepoints above 0x1F, it's just that this is how Emacs works for quite some time. You asked for examples of usage, and I provided one.
RE: Is the binaryness/textness of a data format a property?
Eli Zaretskii wrote: >>> Also, UTF-8 can carry more than Unicode -- for example, >>> U+D800..U+DFFF or U+11000..U+7FFF (or possibly even up to 2³⁶ or >>> 2⁴²), which has its uses but is not well-formed Unicode. >> >> I'd be interested in your elaboration on what these uses are. > > Emacs uses some of that for supporting charsets that cannot be mapped > into Unicode. GB18030 is one example of such charsets. The internal > representation of characters in Emacs is UTF-8, so it uses 5-byte > UTF-8 like sequences to represent such characters. When 137,468 private-use characters aren't enough? I thought the whole premise of GB18030 was that it was Unicode mapped into a GB2312 framework. What characters exist in GB18030 that don't exist in Unicode, and have they been proposed for Unicode yet, and why was none of the PUA space considered appropriate for that in the meantime? -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Is the binaryness/textness of a data format a property?
> Date: Sat, 21 Mar 2020 11:13:40 -0600 > From: Doug Ewell via Unicode > > Adam Borowski wrote: > > > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF > > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has > > its uses but is not well-formed Unicode. > > I'd be interested in your elaboration on what these uses are. Emacs uses some of that for supporting charsets that cannot be mapped into Unicode. GB18030 is one example of such charsets. The internal representation of characters in Emacs is UTF-8, so it uses 5-byte UTF-8 like sequences to represent such characters.
Re: Is the binaryness/textness of a data format a property?
Adam Borowski wrote: > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has > its uses but is not well-formed Unicode. I'd be interested in your elaboration on what these uses are. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Is the binaryness/textness of a data format a property?
On 20/03/2020 23:41, Adam Borowski via Unicode wrote: > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or > U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses > but is not well-formed Unicode. This would definitely no longer be UTF-8! Martin.
Re: Is the binaryness/textness of a data format a property?
On Fri, 20 Mar 2020 13:46:25 +0100 Adam Borowski via Unicode wrote: > On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via > Unicode wrote: > > [Definition] Property: an attribute, quality, or characteristic of > > something. > > > > JPEG is a binary data format. > > CSV is a text data format. > > > > Question #1: Is the binaryness/textness of a data format a > > property? > > > > Question #2: If the answer to Question #1 is yes, then what is the > > name of this binaryness/textness property? I'd suggest 'texthood' as the correct English term. > I'm afraid this question is too fuzzy to have a proper answer. > > For example, most Unix-heads will tell you that UTF16LE is a binary > rather than text format. Microsoft employees and some members of > this list will disagree. Some files change type on changing operating system. Digital's old RMS formats included as basic text files in which each record (roughly a line) started with a binary 2-byte length field. Text records on magnetic tape typically started with an ASCII length count! > Then you have Postscript -- nothing but basic ASCII, yet utterly > unreadable for a (sane) human. No worse than a hex dump - in fact, a lot more readable. Indeed, are you not aware of the concept of a write-only programming language? > If you want _my_ definition of a file being _technically_ text, it's: > * no bytes 0..31 other than newlines and tabs (even form feeds are out > nowadays) > * correctly encoded for the expected charset (and nowadays, if that's > not UTF-8 Unicode, you're doing it wrong) > * no invalid characters Unassigned characters are perfectly reasonable in a text file. Surely you aren't saying that a text file using the characters new to Unicode 13.0 should, at present, usually be regarded as a binary file? > But besides this narrow technical meaning -- is a Word document > "text"? And if it is, why not Powerpoint? This all falls apart. Well, a .docx file isn't text - it's a variety of ZIP file, which is binary. Indeed, as word files naturally include pictures, it very much isn't a text file. A .doc file is more like an image dump of a file system. A .rtf file on the other hand, probably is a text file - though I've a feeling there are variants that aren't *A*SCII. Richard.
Re: Is the binaryness/textness of a data format a property?
On Fri, Mar 20, 2020 at 07:22:45AM -0700, J Decker via Unicode wrote: > On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode < > > For example, most Unix-heads will tell you that UTF16LE is a binary rather > > than text format. Microsoft employees and some members of this list will > > disagree. [...] > > If you want _my_ definition of a file being _technically_ text, it's: > > * no bytes 0..31 other than newlines and tabs (even form feeds are out > > nowadays) > > * correctly encoded for the expected charset (and nowadays, if that's not > > UTF-8 Unicode, you're doing it wrong) > > * no invalid characters > > Just a minor note... > In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every > valid utf8 codeunit has at least 1 bit off. Yeah, but I allowed for ancient encodings, some of which do use these bytes. (I do discriminate against UTF16 and shift-state ones, they're too broken.) Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses but is not well-formed Unicode. > I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI > codes) are all quite usable... \t is tab, \n a newline (blah blah blah \r). As for \e (\x1b), that's higher-level markup. I do use it -- hey, you can "apt/dnf install colorized-logs" for my tools -- but that's beyond plain text. 喵! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good. ⢿⡄⠘⠷⠚⠋⠀ -- on #linux-sunxi ⠈⠳⣄
Re: Is the binaryness/textness of a data format a property?
On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode < unicode@unicode.org> wrote: > On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via Unicode > wrote: > > [Definition] Property: an attribute, quality, or characteristic of > something. > > > > JPEG is a binary data format. > > CSV is a text data format. > > > > Question #1: Is the binaryness/textness of a data format a property? > > > > Question #2: If the answer to Question #1 is yes, then what is the name > of > > this binaryness/textness property? > > I'm afraid this question is too fuzzy to have a proper answer. > > For example, most Unix-heads will tell you that UTF16LE is a binary rather > than text format. Microsoft employees and some members of this list will > disagree. > > Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable > for a (sane) human. > > If you want _my_ definition of a file being _technically_ text, it's: > * no bytes 0..31 other than newlines and tabs (even form feeds are out > nowadays) > * correctly encoded for the expected charset (and nowadays, if that's not > UTF-8 Unicode, you're doing it wrong) > * no invalid characters > Just a minor note... In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every valid utf8 codeunit has at least 1 bit off. I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI codes) are all quite usable... > > But besides this narrow technical meaning -- is a Word document "text"? > And if it is, why not Powerpoint? This all falls apart. > > > Meow! > -- > ⢀⣴⠾⠻⢶⣦⠀ > ⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good. > ⢿⡄⠘⠷⠚⠋⠀ -- on #linux-sunxi > ⠈⠳⣄ >
Re: Is the binaryness/textness of a data format a property?
On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via Unicode wrote: > [Definition] Property: an attribute, quality, or characteristic of something. > > JPEG is a binary data format. > CSV is a text data format. > > Question #1: Is the binaryness/textness of a data format a property? > > Question #2: If the answer to Question #1 is yes, then what is the name of > this binaryness/textness property? I'm afraid this question is too fuzzy to have a proper answer. For example, most Unix-heads will tell you that UTF16LE is a binary rather than text format. Microsoft employees and some members of this list will disagree. Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable for a (sane) human. If you want _my_ definition of a file being _technically_ text, it's: * no bytes 0..31 other than newlines and tabs (even form feeds are out nowadays) * correctly encoded for the expected charset (and nowadays, if that's not UTF-8 Unicode, you're doing it wrong) * no invalid characters But besides this narrow technical meaning -- is a Word document "text"? And if it is, why not Powerpoint? This all falls apart. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good. ⢿⡄⠘⠷⠚⠋⠀ -- on #linux-sunxi ⠈⠳⣄
AW: Is the binaryness/textness of a data format a property?
#1: Yes. #2: [ my suggestion ] File type category A.D. -Ursprüngliche Nachricht- Von: Unicode Im Auftrag von Costello, Roger L. via Unicode Gesendet: Freitag, 20. März 2020 13:21 An: unicode@unicode.org Betreff: Is the binaryness/textness of a data format a property? Hello Data Format Experts! [Definition] Property: an attribute, quality, or characteristic of something. JPEG is a binary data format. CSV is a text data format. Question #1: Is the binaryness/textness of a data format a property? Question #2: If the answer to Question #1 is yes, then what is the name of this binaryness/textness property? Question #3: Here is another way of asking Question #2: Please fill in the following blanks with the property name (both blanks should be filled with the same thing): For the JPEG data format: _ = binary. For the CSV data format: _ = text. /Roger
Is the binaryness/textness of a data format a property?
Hello Data Format Experts! [Definition] Property: an attribute, quality, or characteristic of something. JPEG is a binary data format. CSV is a text data format. Question #1: Is the binaryness/textness of a data format a property? Question #2: If the answer to Question #1 is yes, then what is the name of this binaryness/textness property? Question #3: Here is another way of asking Question #2: Please fill in the following blanks with the property name (both blanks should be filled with the same thing): For the JPEG data format: _ = binary. For the CSV data format: _ = text. /Roger