from:"Doug Ewell"

RE: Emoji map of Colorado

2020-04-02 Thread Doug Ewell via Unicode

Karl Williamson shared:
 
> https://www.reddit.com/r/Denver/comments/fsmn87/quarantine_boredom_my_emoji_map_of_colorado/?mc_cid=365e908e08_eid=0700c8706b
 
It's too bad this was only made available as an image, not as text, which of 
course it is.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode

Eli Zaretskii wrote:

>> When 137,468 private-use characters aren't enough?
>
> Why is that relevant to the issue at hand?

You're right. I did ask what the uses of non-standard UTF-8 were, and you gave 
me an example.

> I don't remember off hand, but last time I looked at GB18030, there
> were a lot of them not in Unicode.

I'd forgotten that there were still about two dozen GB18030 characters mapped, 
more or less officially, into the Unicode PUA. But again, I changed the 
subject. Sorry about that.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode

Eli Zaretskii wrote:

>>> Also, UTF-8 can carry more than Unicode -- for example,
>>> U+D800..U+DFFF or U+11000..U+7FFF (or possibly even up to 2³⁶ or
>>> 2⁴²), which has its uses but is not well-formed Unicode.
>>
>> I'd be interested in your elaboration on what these uses are.
>
> Emacs uses some of that for supporting charsets that cannot be mapped
> into Unicode.  GB18030 is one example of such charsets.  The internal
> representation of characters in Emacs is UTF-8, so it uses 5-byte
> UTF-8 like sequences to represent such characters.

When 137,468 private-use characters aren't enough?

I thought the whole premise of GB18030 was that it was Unicode mapped into a 
GB2312 framework. What characters exist in GB18030 that don't exist in Unicode, 
and have they been proposed for Unicode yet, and why was none of the PUA space 
considered appropriate for that in the meantime?

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode

Adam Borowski wrote:

> Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
> or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
> its uses but is not well-formed Unicode.

I'd be interested in your elaboration on what these uses are.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Doug Ewell via Unicode

Costello, Roger L. wrote: > Text files may indeed contain binary (i.e., bytes that are not> interpretable as characters). Namely, text files may contain newlines,> tabs, and some other invisible things.>> Question: "characters" are defined as only the visible things, right? In addition to this being incorrect, as Ken and Richard (so far) have pointed out, this isn't the distinction you are looking for. All file formats contain data which is relevant to that file format. Zip files, executables, JPEGs, MP4s, all contain specific data structured in a specific way. If any of them has that structure interrupted by random bytes, the format has been broken and the file is corrupt. It is no different for text data, which is expected to contain certain bytes and is normally not expected to be interrupted by a series of ranëH‰UÀHƒÈÿH Does that help? --Doug Ewell | Thornton, CO, US | ewellic.org

Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-11 Thread Doug Ewell via Unicode

Ken Whistler wrote:

> So, in general, no, you can *never* assume that once the UTC has just
> approved a new character that it will be in the next version of
> Unicode.

I got quite a few messages like this when UTC approved the legacy
computing characters in L2/19-025 last January. Great, that means I'll
be able to start using and exchanging them in March, when Unicode 12.1
is released, right? Uh, no:

1. What Ken said above.

2. Unicode 12.1 was always just about the Reiwa sign.

3. Even when 13 comes out, fonts won't be immediately and magically
updated to include them.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: On the lack of a SQUARE TB glyph

2019-09-30 Thread Doug Ewell via Unicode

I wrote:

> As others have stated, it was easily demonstrated that applications
> existed in Japan which required a single code point for the era name.
> That is what necessitated the acceptance, let alone fast-tracking, of
> U+32FF SQUARE ERA NAME REIWA.

Well, this is what I've heard, anyway.

Just out of curiosity, does anyone have actual examples of such
applications? This might help demonstrate why the Reiwa sign doesn't set
a precedent for TB et al.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: On the lack of a SQUARE TB glyph

2019-09-29 Thread Doug Ewell via Unicode

Fred Brennan wrote:

> The purpose of Unicode is plaintext encoding, is it not? The square TB
> form is fundamentally no different than the square form of Reiwa,
> U+32FF ㋿, which was added in a hurry. The difference is that SQUARE
> TB's necessity and use is a slow thing which happened over years, not
> all of a sudden via one announcement of the Japanese government.

I think the case you are going to have to make is that applications exist which 
*must* use the single-code-point character for this purpose, instead of simply 
being able to use U+0054 plus U+0042.

As others have stated, it was easily demonstrated that applications existed in 
Japan which required a single code point for the era name. That is what 
necessitated the acceptance, let alone fast-tracking, of U+32FF SQUARE ERA NAME 
REIWA.

The characters in the CJK Compatibility block were added for exactly that 
reason — compatibility with character encoding standards that existed prior to 
Unicode. There has never been any expectation that sets or sequences in that 
and other "compatibility" blocks would be updated continually. Compatibility 
with character sets created since the wide adoption of Unicode, such as in 
2008, is also not guaranteed.

Earlier I wrote "[t]his seems like a reasonable candidate for a proposal," not 
necessarily because UTC will agree with the stated use case, but because 
talking about such a character on the mailing list won't get it added.

> In plaintext SQUARE TB is fundamentally different than ASCII T followed by
> ASCII B. Plaintext tables (and programs generating them) and files already
> using SQUARE MB, SQUARE GB, etc benefit from SQUARE TB.

That is something you would have to demonstrate in your proposal: that there 
are important processes (as in, "the government and industry and commerce 
depend on this") that use ㎅ ㎆ ㎇ which it would not be feasible to extend or 
modify to use Basic Latin TB, PB, EB, etc. That was the case made for the Reiwa 
sign: that there were important processes using ㍾ ㍽ ㍼ ㍻ that could not simply 
use the two existing characters 令和 for Reiwa.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: On the lack of a SQUARE TB glyph

2019-09-26 Thread Doug Ewell via Unicode

Fred Brennan wrote:

> I can't help but notice that there is no "SQUARE TB" glyph. 

Marius Spix replied:

> Unfortunately, the CJK Compatibility block is full, but U+321F in the
> Enclosed CJK Letters and Months seems to be free. I definitely see a usage
> for the proposed character. 

IIRC the CJK Compatibility squared characters came from a legacy
character set or standard, developed at whatever point in history it was
developed. So it's kind of inevitable that the set of symbols in that
block might not always be up to date.

This seems like a reasonable candidate for a proposal. UTC won't add a
character based on mailing-list chat, of course; they'll need a proper
proposal. They'll also be the ones to decide what code point is
assigned, although the proposal can politely suggest one.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: PUA (BMP) planned characters HTML tables

2019-08-21 Thread Doug Ewell via Unicode

On August 11, I replied to Robert Wheelock:
 
>> I remember that a website that has tables for certain PUA precomposed
>> accented characters that aren’t yet in Unicode (thing like:
>> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-
>> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...).
>
> If you are thinking of these as potential future additions to the
> standard, keep in mind that accented letters that can already be
> represented by a combination of letter + accent will not ever be
> encoded. This is one of the longest-standing principles Unicode has.
 
I missed the possible significance of the Latvian comma below vs.
Marshallese cedilla, which captured most of the ensuing discussion and
morphed into a discussion about different user communities and group
identity.
 
I'd like to restate, since I think the point may have been lost, that
for the OTHER characters Robert mentioned:
 
> H/h-acute, capital T-dieresis, capital H-underbar, acute accented
> Cyrillic vowels, Cyrillic ER/er-caron, ...
 
there does not appear to be any conflicting usage between different user
communities, and no particular difficulty in rendering or otherwise
processing these as combining sequences, using up-to-date fonts and
rendering engines. I suppose Philippe's example of Võro might factor
into whether different groups prefer different appearances for h́, but
otherwise these user-perceived characters seem to be non-controversial.
 
So to reiterate, these characters appear vanishingly unlikely to be
atomically encoded, "yet" or ever, for good reason.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Numeric group separators and Bidi

2019-07-15 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> [... U+202F ...] and not even accessible in most input tools...
> including the Windows "Charmap" where it is not even listed with other
> spaces or punctuations, except if we display the FULL list of
> characters supported by a selected font that maps it (many fonts don't
> map it) and the "Unicode" encoding. Windows charmap is so outdated
> (and has many inconsistancies in its proposed grouping, look for
> example at the groups proposed for Greek, they are complete non-sense,
> with duplicate subranges, but groups made completely arbitrarily,
> making this basic tool really difficult to use).

BabelMap (http://www.babelstone.co.uk/Software/BabelMap.html) is free of
charge, is easy to use, runs on all versions of Windows since 2000,
provides much better support for almost all Character Map functions than
Character Map, and has tons of additional useful features which can be
easily ignored if not needed.

The only possible reason for a knowledgeable, let alone
Unicode-knowledgeable, Windows user to use the built-in Character Map
utility instead of BabelMap would be to look up and pick from legacy
character sets (which I think is what Philippe is referring to as
"proposed groupings").

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Unicode "no-op" Character?

2019-07-04 Thread Doug Ewell via Unicode

Shawn Steele wrote:

> Even more complicated is that, as pointed out by others, it's pretty
> much impossible to say "these n codepoints should be ignored and have
> no meaning" because some process would try to use codepoints 1-3 for
> some private meaning.  Another would use codepoint 1 for their own
> thing, and there'd be a conflict.

That's pretty much what happened with NUL. It was originally intended (long, 
long before Unicode) to be ignorable and have no meaning, but then other 
processes were designed that gave it specific meaning, and that was pretty much 
that.

While the Unix/C "end of string" convention was not the only case in which NUL 
was hijacked, it is certainly the best-known, and the greatest impediment to 
any current attempt to use it with its original meaning.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Unicode "no-op" Character?

2019-06-22 Thread Doug Ewell via Unicode

Sławomir Osipiuk wrote:

> Does Unicode include a character that does nothing at all? I'm talking
> about something that can be used for padding data without affecting
> interpretation of other characters, including combining chars and
> ligatures.

I join Shawn Steele in wondering what your "data padding" use case is for such 
a character. Most modern protocols don't require string fields to be exactly N 
characters long, or have their own mechanism for storing the real string length 
and ignoring any padding characters.

If you just need to fill up space at the end of a line, and need a character 
that has as little disruptive meaning as possible, I agree that U+FEFF is 
probably the closest you'll get.

NULL, of course, was intended to serve exactly this purpose, but everyone has 
decided for themselves what the C0 code points should be used for, and "display 
a .notdef glyph" is one of the popular choices.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Proposal to extend the U+1F4A9 Symbol

2019-06-01 Thread Doug Ewell via Unicode

Andrew West wrote:

> oh, there is no Wikidata QID for phone dropped in the toilet.

It's Wikidata, right? Pretty much anyone can create an item for pretty much 
anything, right? Problem solved.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Proposal to extend the U+1F4A9 Symbol

2019-06-01 Thread Doug Ewell via Unicode

Tex wrote:

> What I would find useful is an emoji for when my phone falls into the
> toilet.

I would have thought ⤵ would be sufficient.

But I didn't include any variation selectors and combining sequences for the 
gender, skin color, hair style, profession, and current state of mind of the 
phone's owner, and there are none for the brand and model of phone and toilet. 
So the sequence above is clearly inadequate for people to express themselves.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal to extend the U+1F4A9 Symbol

2019-06-01 Thread Doug Ewell via Unicode

bristol_poo wrote:

> This would produce 7 variants of the U+1F4A9 emoji, including existing
> (Which I believe is about Type 4 on the scale).
>
> Why? I think this would really benefit the medical profession, with a
> large uptick in e-doctor/text only conversations towards the medical
> profession.

If physicians and other medical professionals are relying on emoji, in any way 
and at any time, to determine diagnosis and treatment, the state of health care 
is much worse than I thought.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Format A

2019-05-30 Thread Doug Ewell via Unicode

Apologies if this is a repeat of a (much) earlier inquiry.
 
The mapping tables that are available as part of the Unicode Standard
(http://www.unicode.org/Public/MAPPINGS/) are generally provided in a
text format called "Format A." Each line in the file defines a mapping
between a character in a legacy encoding and the Unicode equivalent,
with fields separated by tabs or sequences of spaces, like this:
 
0xA00x00A0  #NO-BREAK SPACE
0xA10x00A1  #INVERTED EXCLAMATION MARK
0xA20x00A2  #CENT SIGN
 
The format supports DBCS as well:
 
0x8140  0x4E02  #CJK UNIFIED IDEOGRAPH
0x8141  0x4E04  #CJK UNIFIED IDEOGRAPH
0x8142  0x4E05  #CJK UNIFIED IDEOGRAPH
 
My questions are:
 
1. Is there a specification for this format anywhere, and if so, where?
 
2. Is there a "Format B" or similar? (I don't mean UCM, CharMapML, RFC
1345 format, etc., but something truly similar to and/or derivative of
Format A.)
 
Please reply on-list only if you think the list at large would benefit
from your reply. I'm hoping some of the Unicode elders might have some
insight here.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Symbols of colors used in Portugal for transport

2019-04-29 Thread Doug Ewell via Unicode

Hans Åberg wrote:

> The guy who made the artwork for Heroes is completely color-blind,
> seeing only in a grayscale, so they agreed he coded the colors in
> black and white, and then that was replaced with colors.

Did he use this particular scheme? That is something I would expect to
see on the scheme's web site, and would probably be good evidence for a
proposal.

I do see several awards related to the concept, but few examples where
this scheme is actually in use, especially in plain text.

I'm not opposed to this type of symbol, but I like to think the classic
rule about "established, not ephemeral" would still apply.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Symbols of colors used in Portugal for transport

2019-04-29 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> A very useful think to add to Unicode (for colorblind people) !
>
> http://bestinportugal.com/color-add-project-brings-color-identification-to-the-color-blind
>
> Is it proposed to add as new symbols ?

Well, it isn't proposed until someone proposes it.

At first I thought Emojination would be best to write this proposal, to
improve its chances of approval. But these aren't really emoji; they're
actual text-like symbols, of the type that has always been considered
appropriate for Unicode. (They're not "for transport" per se; they are a
secondary indication of colors, meant for the color-blind.)

One important question that a proposal would need to answer is whether
these symbols are actually used in the real world. They seem like a good
and innovative new idea, and there is always a desire to help people
with physical challenges; but neither of those is what Unicode is about.
For non-emoji characters, there is usually still a requirement to show a
certain level of actual usage.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Is ARMENIAN ABBREVIATION MARK (՟, U+055F) misclassified?

2019-04-26 Thread Doug Ewell via Unicode

Fredrick Brennan wrote:

> Although my research on this has by no means been exhaustive, it
> seems at a cursory glance that the «pativ», the Armenian abbreviation
> mark, is misclassified; it seems it should either be itself a
> combining mark or have a combining mark version.
>
> I have not been able to find a single Unicode font which treats it as
> such, however.

Using BabelPad on Windows 10, with the sequence <0531, 0532, 0533, 055F> (ayb, 
ben, gim, abbreviation mark), the following fonts show the abbreviation mark 
correctly over the gim:

Calibri
Cambria
Cambria Math
Nishiki-teki
Trebuchet MS

All of these except Nishiki-teki are standard Windows fonts.

This is a small percentage of the number of fonts that have all four of these 
Armenian glyphs, but show the abbreviation mark as a spacing glyph. It looks 
like Unicode is right, Wikipedia is right, and the fonts are wrong.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Unicode CLDR 35 alpha available for testing

2019-02-28 Thread Doug Ewell via Unicode

announcements at unicode.org wrote:
 
> The alpha version of Unicode CLDR 35 
> <http://cldr.unicode.org/index/downloads/cldr-35> is available for
> testing.
 
No downloadable data files in the sense of released builds, correct?
  
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-02-10 Thread Doug Ewell via Unicode

going through that era too.

> Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
> for curly underline. What to do with them?

Should we be extension-compatible with other implementations, or following the 
straight and narrow path? Another decision that is not unique to ISO 6429.

> Where to draw the line what to add to Unicode and what not to? Will
> Unicode possibly be a bottleneck of further improvements in terminal
> emulators, because from now on every new mode we figure out we'd like
> to have in terminals should go through some Unicode committee?

I think you know the answer to this by now.

>> This mechanism [...] is already supported
>> as widely as any new Unicode-only convention will ever be.
>
> I truly doubt this, these escape sequences are specific to terminal
> emulation, an extremely narrow subset of where Unicode is used and
> rich text might be desired.

That's true. Probably next to nobody is using ISO 6429 sequences for plain text 
intended for print, just as next to nobody is using the proposed VS14 mechanism 
or Andrew West's Plane 14 mechanism. My suggestion was to document the ISO 6429 
approach, run it up the flagpole, and see if anyone salutes.

> Or, if it wants to adopt some already existing technology, I find
> HTML/CSS a much better starting point.

Q: How can we represent italics in plain text?
A: Use rich text.

Kent Karlsson wrote:

>> •Underline on: ESC [4m 
>(implies turning double underline off) 
>   Underline, double: ESC [21m 
>(implies turning single underline off) 

I deliberately left out single and double underlining, and many other features 
of ISO 6429 SGR (such as Fraktur). The email was not intended as a final 
proposal. I do think it would be strange for single and double underlining not 
to cancel each other out.

> Note that these do NOT nest (no stack...), just state changes for the
> relevant PART of the "graphic" (i.e. style) state. So the approach in
> that regard is quite different from the approach done in HTML/CSS.

I don't regard that as either a bug or a feature. I certainly don't expect that 
every such mechanism has to nest, simply because SGML and its descendants are 
designed that way.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-02-08 Thread Doug Ewell via Unicode

I'd like to propose encoding italics and similar display attributes in
plain text using the following stateful mechanism:
 
•   Italics on: ESC [3m
•   Italics off: ESC [23m
•   Bold on: ESC [1m
•   Bold off: ESC [22m
•   Underline on: ESC [4m
•   Underline off: ESC [24m
•   Strikethrough on: ESC [9m
•   Strikethrough off: ESC [29m
•   Reverse on: ESC [7m
•   Reverse off: ESC [27m
•   Reset all attributes: ESC [m
 
where ESC is U+001B.
 
This mechanism has existed for around 40 years and is already supported
as widely as any new Unicode-only convention will ever be.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Doug Ewell via Unicode

http://www.unicode.org/faq/utf_bom.html#utf8-2 
  
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Doug Ewell via Unicode

Richard Wordingham wrote:

> Unicode may not deprecate the tag characters, but the characters of
> Plane 14 are widely deplored, despised or abhorred. That is why I
> think of it as the deprecated plane.

Think of it as the deplored plane, then, or the despised plane or the abhorred 
plane or the Plane That Shall Not Be Mentioned.

"Deprecated" is a term of art in Unicode.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Use of tag characters in emoji sequences (was: Re: Proposal for BiDi in terminal emulators)

2019-02-02 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> Actually not all U+E0020 through U+E007E are "un-deprecated" for this
> use.

Characters in Unicode are not "deprecated" for some purposes and not for 
others. "Deprecated" is a clearly defined property in Unicode. The only 
reference that matters here is in PropList.txt:

E ; Other_Default_Ignorable_Code_Point # Cn   
E0001 ; Deprecated # Cf   LANGUAGE TAG
E0002..E001F  ; Other_Default_Ignorable_Code_Point # Cn  [30] 
..
E0020..E007F  ; Other_Grapheme_Extend # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Other_Default_Ignorable_Code_Point # Cn [128] 
..

Note carefully that the code point marked "Deprecated" is deprecated, and the 
others listed here are not. (My earlier post saying that U+E007F was still 
deprecated was incorrect, as Andrew noted.)

> For now emoji flags only use:
> - U+E0041 through U+E005A (mapping to ASCII letters A through Z used
> in 2-letter ISO3166-1 codes). These are usable in pairs, without
> requiring any modifier (and only for ISO3166-1 registered codes).

Section C.1 of UTS #51 says otherwise:

tag_baseU+1F3F4 BLACK FLAG
tag_spec(U+E0030 TAG DIGIT ZERO .. U+E0039 TAG DIGIT NINE,
U+E0061 TAG LATIN SMALL LETTER A .. U+E007A TAG LATIN SMALL LETTER 
Z)+

Emoji flags use lowercase tag letters, not uppercase, and may also use digits. 
The digits are for CLDR subdivision IDs containing ISO 3166-2 code elements 
that happen to be numeric, and there are plenty of these. For example, "fr75" 
is the subdivision ID for Paris. Almost all ISO 3166-2 code elements in France 
are numeric.

> - I think that U+0030 through U+E0039 (mapping to ASCII digits 0
> through 9) are reserved for ISO3166 extensions, started with only the
> 3 "countries" added in the United Kingdom ("ENENG", "ENSCO" and
> "ENWLS"), with possible pending additions for other ISO3166-2, but not
> mapping any dash separator).

There is no top-level country "EN", and if there were, I doubt Scotland and 
Wales would be enthusiastic to be considered part of it.

In any case, "gbeng" and "gbsco" and "gbwls" are merely the only subdivision 
IDs that are designated "RGI," or "recommended for general interchange," in 
CLDR. Any other subdivision ID can be used in a flag tag sequence, although the 
lack of RGI designation may cause vendors to think the sequence is "recommended 
against" and not support it in fonts.

As shown above, tag digits are not reserved for "ISO 3166 extensions" (possibly 
implying ISO 3166-1), but are already usable for ISO 3166-2 code elements.

> These tags are used as modifiers in sequences starting by a leading
> U+1F3F4
> <http://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0065_e006e_e0067_e007f>
> (WAVING BLACK FLAG) emoji.

This is true. (Note the lowercase tag letters.)

> - U+E007F (CANCEL TAG) is already used too for the regional extensions
> as a mandatory terminator, as seen in the three British countries.

This is true.

> It is not used for country flags made of 2-letter emoji codes without
> any leading flag emoji.

This is true, but not particularly relevant, as these use Regional Indicator 
Symbols and have nothing to do with the "emoji codes" discussed elsewhere.

> And the proposal discussed here to use U+E003C, mapped to the ASCII
> "<" LOWER THAN

LESS-THAN SIGN

> as a leading tag sequence for reencoding HTML tags in sequences
> terminated by U+E003E ">" (and containing HTML element names using
> lowercase letter tags,

Only "b", "i", "u", and "s" by definition.

> possibly digit tags in these names,

No.

> and "/" for HTML tags terminator, possibly also U+E0020 SPACE TAG for
> separating HTML attributes, U+003D "=" for attribute values, U+E0022
> (') or U+E0027 (") around attribute values, but a problem if the
> mapped element names or attributes contain non-ASCII characters...)

None of these are part of Andrew's mechanism. It's just b, i, u, and s.

> is not standard

Neither Andrew nor anyone else claimed it was.

> (it's just an experiment in one font),

It applies to any TrueType font, because the rendering engine can apply these 
four styles (in any combination) to any TrueType font.

> and would in fact not be compatible with the existing specification
> for tags.

Good thing nobody claimed they were.

> So only E+E0020 through U+E0040, and U+E005B through U+E007E remain
> deprecated.

Da capo.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Doug Ewell via Unicode

Richard Wordingham wrote:

> Language tagging is already available in Unicode, via the tag
> characters in the deprecated plane.

Plane 14 isn't deprecated -- that isn't a property of planes -- and the
tag characters U+E0020 through U+E007E have been un-deprecated for use
with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are
deprecated.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Encoding italic

2019-01-31 Thread Doug Ewell via Unicode

Kent Karlsson wrote:

> ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
> sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
> is implemented in Cygwin (sorry for mentioning a product name).)

Fair enough. This thread is mostly about italics and bold and such, not
colors, but the point is well taken that one of these leads invariably
to the others, especially if the standard or flavor in question
implements them.

> ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
> traditionally does not use bold or italic.

But that's OK. For low-level mechanisms like these, it should be
incumbent on the user to say, "Yes, I can use this styling with that
script, but I shouldn't; it would look terrible and would fly in the
face of convention." ISO 6429 also allows green text on a cyan
background, which is about as good an idea as CJK italics.

> Compare those specified for CSS
> (https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
> https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
> These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
> be of interest for the generalised subject of this thread.

I'm hoping we can continue to restrict this thread to plain text.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Doug Ewell via Unicode

Egmont Koblinger wrote:

> "Basic Arabic shaping, at the level of a typewriter, is
> straightforward enough to be implemented in the application, using
> presentation form characters, as I suggest". Could you please point
> out the problems with this statement?

As multiple people have pointed out, Arabic presentation forms don't
cover the whole Arabic script and are not generally recommended for new
applications, though they are not formally deprecated.

If you take a look at the parallel discussion about italics in plain
text, you will see a corollary in the use of Mathematical Alphanumeric
Symbols: they look tempting and are (usually) easy to render, but among
other things, they only cover [A-Za-zıȷΑ-Ωα-ω] and thus miss much
of the text that may need to be italicized.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-30 Thread Doug Ewell via Unicode

Kent Karlsson wrote:

> Yes, great. But as I've said, we've ALREADY got a
> default-ignorable-in-display (if implemented right)
> way of doing such things.
>
> And not only do we already have one, but it is also
> standardised in multiple standards from different
> standards institutions. See for instance "ISO/IEC 8613-6,
> Information technology --- Open Document Architecture (ODA)
> and Interchange Format: Character content architecture".

I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
has the advantage of not costing me USD 179, and it looks very similar
to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
are talking about: setting text display properties such as bold and
italics by means of escape sequences.

Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
doing, and if it does not, why we should not simply refer to the more
familiar 6429?

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-30 Thread Doug Ewell via Unicode

Martin J. Dürst wrote:

> Here's a little dirty secret about these tag characters: They were
> placed in one of the astral planes explicitly to make sure they'd use
> 4 bytes per tag character, and thus quite a few bytes for any actual
> complete tags.

Aha. That explains why SCSU had to be banished to the hut, right around
the same time the Plane 14 language tags were deprecated. In SCSU,
astral characters can be 1 byte just like BMP characters.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-29 Thread Doug Ewell via Unicode

Martin J. Dürst wrote:

> Here's a little dirty secret about these tag characters: They were
> placed in one of the astral planes explicitly to make sure they'd use
> 4 bytes per tag character, and thus quite a few bytes for any actual
> complete tags. See https://tools.ietf.org/html/rfc2482 for details.
> Note that RFC 2482 has been obsoleted by
> https://tools.ietf.org/html/rfc6082, in parallel with a similar motion
> on the Unicode side.

I don't recall anyone mentioning Plane 14 language tags per se in this
thread. The tag characters themselves were un-deprecated to support
emoji flag sequences. But more on language tags in a moment.

> These tag characters were born only to shoot down an even worse
> proposal, https://tools.ietf.org/html/draft-ietf-acap-mlsf-01. For
> some additional background, please see
> https://tools.ietf.org/html/draft-ietf-acap-langtag-00.
>
> The overall tag proposal had the desired effect: The original proposal
> to hijack some unused bytes in UTF-8 was defeated, and the tags itself
> were not actually used and therefore could be depreciated.

I agree that the ACAP proposal was awful, for many reasons and on many
levels. But in general, introducing a new standardized mechanism SO THAT
it can be deprecated is a crummy idea. It engenders bad feelings and
distrust among loyal users of the standard. Major software vendors, one
in particular starting with M, have been castigated for decades for
employing tactics similar to this.

> Bad ideas turn up once every 10 or 20 years. It usually takes some
> time for some of the people to realize that they are bad ideas. But
> that doesn't make them any better when they turn up again.

The suggestions over the past three weeks to encode basic styling in
plain text (I'm not saying I'm for or against that) have some
similarities with Plane 14 language tags: many people consider both
types of information to be meta-information, unsuitable for plain text,
and many of the suggested mechanisms are stateful, which is an anti-goal
of Unicode. But these are NOT the same idea, and the fact that they both
use Plane 14 tag characters doesn't make them so.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-29 Thread Doug Ewell via Unicode

Kent Karlsson wrote:

> We already have a well-established standard for doing this kind of
> things...

I thought we were having this discussion because none of the existing
methods, no matter how well documented, has been accepted on a
widespread basis as "the" standard.

Some people dislike markdown because it looks like lightweight markup
(which it is), not like actual italics and boldface. Some dislike ISO
6429 because escape characters are invisible and might interfere with
other protocols (though they really shouldn't). Some dislike math
alphanumerics abuse because it's abuse, doesn't cover other writing
systems, etc.

I'd be happy to work with Kent to campaign for ISO 6429 as "the"
well-established standard for applying simple styling to plain text, but
we would have to acknowledge the significant challenges.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-29 Thread Doug Ewell via Unicode

Philippe Verdy replied to James Kass:
 
> You're not very explicit about the Tag encoding you use for these
> styles.
 
Of course, it was Andrew West who implemented the styling mechanism in a
beta release of BabelPad. James was just reporting on it.
 
> And what is then the interest compared to standard HTML
 
This entire discussion, for more than three weeks now, has been about
how to implement styling (e.g. italics) in plain text. Everyone knows it
can be done, and how to do it, in rich text.
 
> So you used "bold  U+E003E> I.e, you converted from ASCII to tag characters the full HTML
> sequences "" and "", including the HTML element name. I see
> little interest for that approach.
 
I thought we had established that someone had mentioned it on this list,
at some time during the past three weeks. Can someone look up what post
that was? I don't have time to go through scores of messages, and there
is no search facility.
 
I can't speak for Andrew, but I strongly suspect he implemented this as
a proof of concept, not to declare himself the Maker of Standards.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Unihan variants information

2019-01-28 Thread Doug Ewell via Unicode

Michel MARIANI wrote:

> I've developped an open-source, multi-platform desktop application
> called Unicode Plus

Before you get too heavily invested in this product name, you may want
to:

1. check out the page "Unicode® Copyright and Terms of Use" located at
http://www.unicode.org/copyright.html, and

2. send a quick note to the Consortium officers asking whether they are
OK with this use of the Unicode name.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic (was: A last missing link)

2019-01-21 Thread Doug Ewell via Unicode

Kent Karlsson wrote:

> There is already a standardised, "character level" (well, it is from
> a character standard, though a more modern view would be that it is
> a higher level protocol) way of specifying italics (and bold, and
> underline, and more):
>
> \u001b[3mbla bla bla\u001b[0m
>
> Terminal emulators implement some such escape sequences.

And indeed, the forthcoming Unicode Technical Note we are going to be
writing to supplement the introduction of the characters in L2/19-025,
whether next year or later, will recommend ISO 6429 sequences like this
to implement features like background and foreground colors, inverse
video, and more, which are not available as plain-text characters. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-21 Thread Doug Ewell via Unicode

James Kass wrote:
 
> Even the enthusiasts among us seldom take the trouble to include
> ‘proper’ quotes and apostrophes in e-mails — even for posting to
> specialized lists such as this one where other members might notice
> and appreciate the extra effort involved.
 
Well, definitely not to this list, since the digest will clobber such
characters (quod vide). 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Where is my character @?

2019-01-09 Thread Doug Ewell via Unicode

James Kass wrote:

> It's probably old-fashioned to say that technology should be forced to
> accomodate people rather than the other way around.  But it's good to
> note that efforts are still being made on behalf of the users to make
> progress towards U.C.S. inclusion.

I'm as opposed to this proposal as I was in 2004, if not more so, and
I'm working on a brief response document for next week's UTC.

Among other things, it's not at all clear that the orthography using @,
cited in three works from a single publisher in 1998, has been adopted
or become particularly widespread within the Koalib community. (And no,
this does not constitute "disdain for the small community.")

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: The encoding of the Welsh flag

2018-11-22 Thread Doug Ewell via Unicode


Christoph Päper wrote:


We have gotten requests for this, but the stumbling block is the lack
of an official N. Ireland document describing what the official flag
is and should look like.


Such documents are lacking for several of the RIS flag emojis as well,
though, e.g. for  from ISO 3166-1 code `UM` (United States 
Outlying

Islands), resulting in unknown or duplicate flags, hence confusion.
The solution there would have been to exclude codes for dependent
territories becoming RGI emojis. ISO 3166 provides that property.


That's neither the problem nor the solution, IMHO. Even for RIS 
sequences, you have no guarantee of exactly how the flag will be 
depicted. For flags that have been recently changed, you might get the 
old or the new. For UM, you might get the US flag or one of the 
unofficially adopted flags. For Northern Ireland (if it were 
RGI-blessed), you might get either the Ulster Banner or St. Patrick's 
Saltire.


This situation is described, and explicitly so for the UM flags, in 
Annex B of UTS #51 under "Caveats."


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: The encoding of the Welsh flag

2018-11-22 Thread Doug Ewell via Unicode


Ken Whistler replied to Michael Everson:


What really annoys me about this is that there is no flag for
Northern Ireland. The folks at CLDR did not think to ask either the
UK or the Irish representatives to SC2 about this.


[...]


If you or Andrew West or anyone else is interested in pursuing an
emoji tag sequence for an emoji flag for Northern Ireland, then that
should be done by submitting a proposal, with justification, to the
Emoji Subcommittee, which *does* have jurisdiction.


There is, of course, an encoding for the flag of Northern Ireland:

1F3F4 E0067 E0062 E006E E0069 E0072 E007F

where the tag characters are "gbnir" followed by TAG CANCEL.

What I suspect Michael means is that this sequence is not RGI, or 
"recommended for general interchange," a status which applies for flag 
emoji only to England, Scotland, and Wales, and not to any of the 
thousands of other subdivisions worldwide.


The terminology currently in UTS #51 is definitely an improvement over 
early drafts, which explicitly labeled such sequences "not recommended," 
but it still leads practically everyone. evidently including Michael, to 
believe the sequences are invalid or non-existent.


I would certainly like to use the flag of Colorado, whose visual 
appearance is very much standardized, but the vicious circle of vendor 
support and UTS #51 categorization means no system will offer glyph 
support, and some systems may even reject it as invalid.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-05 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> Note that I actually propose not just one rendering for the  abbrevaition mark> but two possible variants (that would be equally
> valid withou preference).

Actually you're not proposing them. You're talking about them (at
length) on the public mailing list. If you want to propose something,
you should consider writing a proposal.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Doug Ewell via Unicode

Michael Everson wrote:

> I write my 7’s and Z’s with a horizontal line through them. Ƶ is
> encoded not for this purpose, but because Z and Ƶ are distinct in
> orthographies for varieties of Tatar, Chechen, Karelian, and
> Mongolian. This is a contemporary writing convention but it does not
> argue for a new SEVEN WITH STROKE character or that I should use Ƶ
> rather than Z when I write *Ƶanƶibar. 

http://www.unicode.org/L2/L2018/18323-open-four.pdf
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Doug Ewell via Unicode

Do we have any other evidence of this usage, besides a single
handwritten postcard? 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

[getting OT] Re: A sign/abbreviation for "magister"

2018-10-30 Thread Doug Ewell via Unicode

Marcel Schneider replied to Khaled Hosny:

>>> E.g. in Arabic script, superscript is considered worth encoding and
>>> using without any caveat, [...]
>>
>> Curious, what Arabic superscripts are encoded in Unicode?
>
> [...] There is the range U+FC5E..U+FC63 (presentation forms).

Arabic presentation forms are never an example of anything, and their
use is full of caveats.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: A sign/abbreviation for "magister"

2018-10-30 Thread Doug Ewell via Unicode

Julian Bradfield wrote:
 
>> in the 17ᵗʰ or 18ᵗʰ century to keep it only for ordinals. Should
>> Unicode
>
> What do you mean, for ordinals? If you mean 1st, 2nd etc., then there
> is not now (when superscripting looks very old-fashioned) and never
> has been any requirement to superscript them, as far as I know -
> though since the OED doesn't have an entry for "1st", I can't easily
> check.
 
The English Wikipedia article "Ordinal number (linguistics)" does not
show numbers such as 1st, 2nd, etc. with superscripts, though as a
rich-text Web page, it could easily.
 
The article "English numerals" does include a bullet point: "The
suffixes -th, -st, -nd and -rd are occasionally written superscript
above the number itself." Note the word "occasionally."
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: A sign/abbreviation for "magister"

2018-10-29 Thread Doug Ewell via Unicode

Richard Wordingham wrote:

>> I like palaeographic renderings of text very much indeed, and in fact
>> remain in conflict with members of the UTC (who still, alas, do NOT
>> communicate directly about such matters, but only in duelling ballot
>> comments) about some actually salient representations required for
>> medievalist use. The squiggle in your sample, Janusz, does not
>> indicate anything; it is only a decoration, and the abbreviation is
>> the same without it.
>
> I think this is one of the few cases where Multicode may have
> advantages over Unicode. In a mathematical contest, aⁿ would be
> interpreted as _a_ applied _n_ times. As to "fⁿ", ambiguity may be
> avoided by the superscript being inappropriate for an exponent. What
> is redundant in one context may be significant in another.

Are you referring to the encoding described in the 1997 paper by
Mudawwar, which "address[es] Unicode's principal drawbacks" by switching
between language-specific character sets? Kind of like ISO 2022, but
less extensible?

ObMagister: I agree that trying to reflect every decorative nuance of
handwriting is not what plain text is all about. (I also disagree with
those who insist that superscripted abbreviations are required for
correct spelling in certain languages, and I expect to draw swift
flamage for that stance.) The abbreviation in the postcard, rendered in
plain text, is "Mr". Bringing U+02B3 or U+036C into the discussion just
fuels the recurring demands for every Latin letter (and eventually those
in other scripts) to be duplicated in subscript and superscript, à la
L2/18-206.

Back into my hole now.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Doug Ewell via Unicode


Steffen Nurpmeso wrote:


Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
(MIME) Part One: Format of Internet Message Bodies).


Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data 
Encodings." RFC 2045 defines a particular implementation of base64, 
specific to transporting Internet mail in a 7-bit environment.


RFC 4648 discusses many of the "higher-level protocol" topics that some 
people are focusing on, such as separating the base64-encoded output 
into lines of length 72 (or other), alternative target code unit sets or 
"alphabets," and padding characters. It would be helpful for everyone to 
read this particular RFC before concluding that these topics have not 
been considered, or that they compromise round-tripping or other 
characteristics of base64.


I had assumed that when Roger asked about "base64 encoding," he was 
asking about the basic definition of base64.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Doug Ewell via Unicode

J Decker wrote:

>> How about the opposite direction: If m is base64 encoded to yield t
>> and then t is base64 decoded to yield n, will it always be the case
>> that m equals n?
>
> False.
> Canonical translation may occur which the different base64 may be the
> same sort of string...

Base64 is a binary-to-text encoding. Neither encoding nor decoding
should presume any special knowledge of the meaning of the binary data,
or do anything extra based on that presumption.

Converting Unicode text to and from base64 should not perform any sort
of Unicode normalization, convert between UTFs, insert or remove BOMs,
etc. This is like saying that converting a JPEG image to and from base64
should not resize or rescale the image, change its color depth, convert
it to another graphic format, etc.

So I'd say "true" to Roger's question.

I touched on this a little bit in UTN #14, from the standpoint of trying
to improve compression by normalizing the Unicode text first.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD

2018-09-08 Thread Doug Ewell via Unicode


To finish (I hope) this thread:

1. Glad to know that Notepad is getting some modern updates, even if 
belatedly.


2. Sorry that there are still tools out there, on different platforms, 
that can't handle each other's EOL conventions. (Of course, this is the 
problem Unicode was trying to solve by introducing LS and PS, but we 
know how that went.)


3. Unicode data files can be read and processed on any platform, but 
some careful choice of reading and processing tools might be advisable.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: UCD in XML or in CSV? (is: UCD in YAML)

2018-09-06 Thread Doug Ewell via Unicode

Marcel Schneider wrote:

> BTW what I conjectured about the role of line breaks is true for CSV
> too, and any file downloaded from UCD on a semicolon separator basis
> becomes unusable when displayed straight in the built-in text editor
> of Windows, given Unicode uses Unix EOL.

It's been well known for decades that Windows Notepad doesn't display
LF-terminated text files correctly. The solution is to use almost any
other editor. Notepad++ is free and a great alternative, but there are
plenty of others (no editor wars, please).

The RFC Editor site explains why it provides PDF versions of every RFC,
nearly all of which are plain text:

"The primary version of every RFC is encoded as an ASCII text file,
which was once the lingua franca of the computer world. However, users
of Microsoft Windows often have difficulty displaying vanilla ASCII text
files with the correct pagination."

which similarly assumes that "users of Microsoft Windows" have only
Notepad at their disposal.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Unicode Digest, Vol 56, Issue 20

2018-08-30 Thread Doug Ewell via Unicode

UnicodeData.txt was devised long before any of the other UCD data files. Though 
it might seem like a simple enhancement to us, adding a header block, or even a 
single line, would break a lot of existing processes that were built long ago 
to parse this file.
So Unicode can't add a header to this file, and that is the reason the format 
can never be changed (e.g. with more columns). That is why new files keep 
getting created instead.
The XML format could indeed be expanded with more attributes and more 
subsections. Any process that can parse XML can handle unknown stuff like this 
without misinterpreting the stuff it does know.
That's why the only two reasonable options for getting UCD data are to read all 
the tab- and semicolon-delimited files, and be ready for new files, or just 
read the XML. Asking for changes to existing UCD file formats is kind of a 
non-starter, given these two alternatives.


--Doug Ewell | Thornton, CO, US | ewellic.org
 Original message Message: 3Date: Thu, 30 Aug 2018 02:27:33 
+0200 (CEST)
From: Marcel Schneider via Unicode 

Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. [...]

Re: Private Use areas

2018-08-28 Thread Doug Ewell via Unicode

On August 23, 2011, Asmus Freytag wrote:

> On 8/23/2011 7:22 AM, Doug Ewell wrote:
>> Of all applications, a word processor or DTP application would want
>> to know more about the properties of characters than just whether
>> they are RTL. Line breaking, word breaking, and case mapping come to
>> mind.
>>
>> I would think the format used by standard UCD files, or the XML
>> equivalent, would be preferable to making one up:
>
> The right answer would follow the XML format of the UCD.
>
> That's the only format that allows all necessary information contained
> in one file, and it would leverage of any effort that users of the
> main UCD have made in parsing the XML format.
>
> An XML format shold also be flexible in that you can add/remove not
> just characters, but properties as needed.
>
> The worst thing do do, other than designing something from scratch,
> would be to replicate the UnicodeData.txt layout with its random, but
> fixed collection of properties and insanely many semi-colons. None of
> the existing UCD txt files carries all the needed data in a single
> file.

I don't know if or how I responded 7 years ago, but at least today, I
think this is an excellent suggestion.

If the goal is to encourage vendors to support PUA assignments, using an
exceedingly well-defined format (UAX #42) sitting atop one of the most
widely used base formats ever (XML), with all property information in a
single repository (per PUA scheme), would be great encouragement. I've
devised lots of novel file formats and I think this is one use case
where that would be a real hindrance.

Storing this information in a font, by hook or crook, would lock users
of those PUA characters into that font. At that rate, you might as well
use ASCII-hacked fonts, as we did 25 years ago.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Private Use areas

2018-08-21 Thread Doug Ewell via Unicode

Ken Whistler wrote:

> The way forward for folks who want to do this kind thing is: 
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points. 

I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread Doug Ewell via Unicode

Mark Davis wrote:

> The only caution I would give is that people shouldn't expect general
> purpose software to do anything with PUA text that depends on
> character properties.

Very true, and a good point. People with creative PUA ideas do sometimes
expect this to magically work.

I have anecdotes, if anyone is interested off-list.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Unicode, emoji and Sundar Pichai

2018-07-13 Thread Doug Ewell via Unicode

Yuhong Bao wrote:

> I wonder how much Sundar Pichai (CEO of Google) participate in Unicode
> (especially the emoji part)?
> Would he be interested in Unicode UTC meetings for example?

Google currently has a representative on the Unicode Board of Directors
(Bob Jung), the Unicode Consortium President, CLDR Technical Committee
chair, and Emoji Subcommittee chair (Mark Davis), and the ICU Technical
Committee chair (Markus Scherer).

With apologies to James Kass, I would speculate that Mr. Pichai is a
busy man and is quite satisfied with Google's representation within the
Unicode organization. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Italic mu in squared Latin abbreviations?

2018-06-20 Thread Doug Ewell via Unicode

Ivan Panchenko wrote:

> Is there a reason why the mu does not appear upright

It was probably italicized in the glyphs printed in the relevant
Japanese standard, back in the 1990s.

The glyphs in the Unicode charts are not normative, except for a very
small handful of encoded characters like Dingbats where they are "kind
of normative."

Because of this, it's not necessary to worry about whether the µ in the
CJK squared Latin abbreviations is italic or roman in any given font.
Fonts will be fonts. Glyph variation happens.

Rendering ㎖ with a capital M does seem to be a violation of character
identity, but Arial Unicode MS has not been updated since 2000 and this
problem is likely to remain unsolved. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Hyphenation Markup

2018-06-02 Thread Doug Ewell via Unicode


Richard Wordingham wrote:


What about U+200B ZWSP?


Thanks for the suggestion, but it's not likely to work:


Are you asking what schemes exist, or are you trying to call attention 
to some rendering engine and/or font that doesn't render a combination 
as it should?



1) In the sequence

realisation of the break should definitely result in  on one line and in  on the next
line, whereas in visual order, character-2 should precede character-1.


This is too general for me to parse. Can you replace these hypotheticals 
with actual characters, using code points, or at least with actual 
General Categories? For example, an 'Mc' followed by ZWSP followed by an 
'Lo' displays like such-and-so. The code points would be best.



Incidentally, does CLDR define the rendering of soft hyphen, or is one
entirely at the mercy of the application?


Why would this be a CLDR thing?

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Doug Ewell via Unicode

Richard Wordingham wrote:

>>> The effects of virama that spring to mind are:
>>>
>>> (a) Causing one or both letters on either side to change or combine
>>> to indicate combination;
>>>
>>> (b) Appearing as a mark only if it does not affect one of the
>>> letters on either side;
>>>
>>> (c) Causing a left matra to appear on the left of the sequence of
>>> consonants joined by a sequence of non-visible viramas.
>>
>> Most of these don't apply to Tamil, of course.
>
> They all apply to க்ஷே  TAMIL
> SYLLABLE KSSEE. There are four other named syllables where they all
> apply.

And several others where they do not. TUS explains that visible
puḷḷi is the general rule in Tamil, and conjunct ligatures are the
exception.

I should have written "These mostly don't apply to Tamil, of course."

In any case, Ken has answered the real underlying question: a process
that checks whether each character in a sequence is "alphabetic" is
inappropriate for determining whether the sequence constitutes a word.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Doug Ewell via Unicode


SundaraRaman R wrote:


but the very common pulli (VIRAMA)
is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.


Is this definition part of Unicode? I thought the use of General 
Category to answer questions like "this sequence is a word" or "this 
string is alphabetic" was much more complex than that. (I'm not even 
sure what the latter means, for any script with any sort of combining 
mark.)


Richard Wordingham wrote:


The effects of virama that spring to mind are:

(a) Causing one or both letters on either side to change or combine to
indicate combination;

(b) Appearing as a mark only if it does not affect one of the letters
on either side;

(c) Causing a left matra to appear on the left of the sequence of
consonants joined by a sequence of non-visible viramas.


Most of these don't apply to Tamil, of course.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: L2/18-181

2018-05-17 Thread Doug Ewell via Unicode

Otto Stolz wrote:

> I wonder how English and French ever could
> be made to use a single script, let alone
> German (???), Icelandic (???), Swedish (???),
> Latvian (???), Chech (???) or ? you name it.

They do use the same script, Latin. They do not use the same alphabet.
Each language has its own language-specific alphabet.

It is the same for Bengali and Assamese, although the language-specific
subsets are called abugidas instead of alphabets.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: L2/18-181

2018-05-17 Thread Doug Ewell via Unicode

I wrote:
 
> ক্ is a conjunct consisting of three code points
 
s/ক্/ক্ষ/

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: L2/18-181

2018-05-17 Thread Doug Ewell via Unicode

Everyone,

I was not serious about this proposal being "fascinating" or in any way
a model for what should happen with the Bengali script.

Please imagine a tongue-in-cheek expression as you re-read my post.
Maybe there is an emoji that depicts this. Maybe I've just been away
from the list too long and forgot that plain text often does not
communicate dry humor effectively.

James Kass wrote:

> We should strive to keep any criticism constructive rather than
> derisive.

Fair enough. My constructive suggestion would be to press vendors to
support Assamese language tools, so that spell-checking, sorting,
transcription, and other language-dependent operations will work
properly, whether or not that was the goal of the proposal. A language
with 15 million native speakers deserves no less.

Regarding keyboards, ক্ is a conjunct consisting of three code
points (U+0995, U+09CD, U+09B7) and fits comfortably on a single key
within a standard Windows layout. Indeed, the Assamese keyboards shipped
with Windows since at least 7 already have this key (E06, level 2).
Systems that limit a keystroke to one code point have problems that go
well beyond Assamese.

> If I'm not mistaken, the character naming for this script was
> inherited from the ISCII standard, so it was the Indian government's
> convention.

BIS made a mistake here in failing to distinguish languages, or
language-specific alphabets, from scripts, but it only cost them a
single attribute byte assignment in ISCII. Disunifying Assamese from
Bengali in Unicode would have a much greater impact.

--
Doug Ewell | Thornton, CO, US | ewellic.org

L2/18-181

2018-05-16 Thread Doug Ewell via Unicode

http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf

This is a fascinating proposal to disunify the Assamese script from
Bengali on the following bases:

1. The identity of Assamese as a script distinct from Bengali is in
jeopardy.

2. Collation is different between the Assamese and Bengali languages,
and code point order should reflect collation order.

3. Keyboard design is more difficult because consonants like ক্ষ
are encoded as conjunct forms instead of atomic characters.

4. The use of a single encoded script to write two languages forces
users to use language identifiers to identify the language.

5. Transliteration of Assamese into a different script is problematic
because letters have different phonological value in Assamese and
Bengali.

It will be interesting to see where this proposal goes. Given that all
or most of these issues can be claimed for English, French, German,
Spanish, and hundreds of other languages written in the Latin script, if
the Assamese proposal is approved we can expect similar disunification
of the Latin script into language-specific alphabets in the future.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread Doug Ewell via Unicode

Martin J. Dürst wrote:
 
> Please enjoy. Sorry for being late with forwarding, at least in some
> parts of the world.
 
Unfortunately, we know some folks will look past the humor and use this
as a springboard for the recurring theme "Yes, what *will* we do when
Unicode runs out of code points?"

I did appreciate the Acknowledgements section which lists the members of
ABBA as a source of inspiration.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: base1024 encoding using Unicode emojis

2018-03-11 Thread Doug Ewell via Unicode

Oh, let him have a little fun. At least he's using emoji for something 
related to characters, instead of playing Mr. Potato Head.


Incidentally, more prior art on large-base encoding:
https://sites.google.com/site/markusicu/unicode/base16k

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-01 Thread Doug Ewell via Unicode

Tim Partridge wrote:

> Perhaps the CLDR work the Consortium does is being referenced. That is
> by language on this list
> http://www.unicode.org/cldr/charts/32/supplemental/locale_coverage.html#ee
> By the time it gets to the 100th entry the Modern percentage has "room
> for improvement".

I think that is a measurement of locale coverage -- whether the
collation tables and translations of "a.m." and "p.m." and "a week ago
Thursday" are correct and verified -- not character coverage.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Missing Kazakh Latin letters (was: Re: 0027, 02BC, 2019, or a new character?)

2018-02-27 Thread Doug Ewell via Unicode

Michael Everson wrote:

> Why on earth would they use Ch and Sh when 1) C isn’t used by itself
> and 2) if you’re using Ǵǵ you may as well use Çç Şş.

Philippe Verdy wrote:

> The three versions of the Cyrilic letter i is mapped to 1.5
> (distinguished only on lowercase with the Turkic lowercase dotless i,
> but not distinguished on uppercase where there's no dot at all...).
> It should have used two distinct letters at least (I with or without
> acute).

There's another problem. No Latin equivalents are listed for the
Cyrillic letters Ц ц Ъ ъ Ь ь Э э Ю ю Я я, in either the old
charts with apostrophes or the new chart with acutes. These are code
points 0426, 042A, 042C, 042D, 042E, and 042F and corresponding
lowercase.

All of these letters, in lowercase or both, are used in the Kazakh
translation of the UDHR currently available from the "UDHR in Unicode"
project. So either the UDHR translation is wildly incorrect, which seems
unlikely, or the transliteration tables are incomplete.

Wikipedia shows digraphs Iý ıý for Ю ю, and Ia ıa for Я я, and
nothing for the others, though it is not clear where the digraphs came
from, and of course the usual Wikipedia caveats apply. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Unicode of Death 2.0

2018-02-17 Thread Doug Ewell via Unicode


Manish Goregaokar wrote:


FWIW I dissected the crashing strings, it's basically all <consonant,
virama, consonant, zwnj, vowel> sequences in Telugu, Bengali,
Devanagari where the consonant is suffix-joining (ra in Devanagari,
jo and ro in Bengali, and all Telugu consonants), the vowel is not
Bengali au or o / Telugu ai, and if the second consonant is ra/ro the
first one is not also ra/ro (or ro-with-line-through-it).

https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/


Thanks for this very detailed and informative blog post. It's certainly 
better than "probably not a bug of Unicode," implying an outside chance 
that it might be.


I've linked Manish's post on FB as a reply to one of those mainstream 
articles that repeatedly calls the conjunct a "single character," 
written by a staffer who couldn't be bothered to find out how a writing 
system used by 78 million people works.


--
Doug Ewell | Thornton, CO, US | ewellic.org

+1 (was: Re: Why so much emoji nonsense?)

2018-02-15 Thread Doug Ewell via Unicode


Philippe Verdy wrote:


If people don't know how to read and cannot reuse the content and
transmit it, they become just consumers and in fact less and less
productors or creators of contents. Just look at opinions under
videos, most of them are just "thumbs up", "like", "+1", barely
counted only, unqualifiable (there's not even a thumb down).


+1 is actually a convenient shorthand when all that needs to be said is 
"I agree" or "me too" (especially now that the latter has taken on a 
highly charged meaning in the U.S.). It is especially popular in the 
IETF. It is not intended for situations that require explanation or 
details.


--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Keyboard layouts and CLDR

2018-01-30 Thread Doug Ewell via Unicode

Marcel Schneider wrote:

>> http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html
>
> Sadly the downloads are still unavailable (as formerly discussed). But
> I saved in time, too (June 2015).

Sorry, try this:

http://vrici.lojban.org/~cowan/MobyLatinKeyboard.zip
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Keyboard layouts and CLDR

2018-01-30 Thread Doug Ewell via Unicode

Marcel Schneider wrote:

> That tends to prove that Mac users accept changes, while Windows users
> refuse changes.

I was going to say that was a gross over-generalization, but that didn't
adequately express how gross it was. It's just plain wrong. Pardon my
bluntness.

How about: Windows is often used in the workplace, where users may not
have the freedom or motivation to make their own changes and be
different from other users, while Macs are often used by individuals who
do. That's an over-generalization too, but not quite at the level of
"Windows users refuse changes."

Alastair Houghton replied:

> I think, rather, that Apple is (or has been) prepared to make radical
> changes, even at the expense of backwards compatibility and even where
> it knows there will be short term pain from users complaining about
> them, where Microsoft is more conservative.

That too. Good point.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?)

2018-01-29 Thread Doug Ewell via Unicode


(b) it doesn't ship with Windows


Of course that is not a "luxury." Knowing that third-party options are 
available, let alone free and easily installed ones, is the luxury.


--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?)

2018-01-29 Thread Doug Ewell via Unicode

Marcel Schneider wrote:

> Prior to this thread, I believed that the ratio of Windows users
> liking the US-International vs Mac users liking the US-Extended was
> like other “Windows implementation” vs “Apple implementation” ratios.

For many users, it may not be a question of what they like, but rather
(a) what they are aware of, (b) what comes standard with their Windows
installation, and (c) in the workplace, what their IT overlords have
granted them permission to use.

I use a modified version of John Cowan's "Moby Latin" layout on all my
machines:

http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html

which allows me to type about 900 characters *in addition* to Basic
Latin, with 100% backward compatibility with U.S. English (i.e. none of
the apostrophe and quotation-mark shenanigans we are talking about). But
(a) I happen to know about Moby Latin, (b) it doesn't ship with Windows,
and (c) I am able to install it (and even modify it). Many users do not
have all or even any of these luxuries.

There is perhaps another factor: many Americans, who are probably the
majority users of US-International though not the only ones, simply do
not know or care about accents and other "foreign stuff." Even those who
know a language other than English often write it in ASCII, and see it
that way in marketing and other professionally created material. For
example, menus in Mexican restaurants often list "albondigas" and
"jalapenos."

The non-phonetic spelling of English may further encourage English-only
speakers to ignore the squiggles and dots that are necessary to indicate
correct pronunciation of other languages.

Given that, interest among potential users of US-International to find a
better solution is probably very low.

> If so many people like it, why was Windows Intl not updated, then?

1. I'd be surprised if there were "so many people," or much demand to
update it. Microsoft might have a few other items on their backlogs.

2. I don't speak for Microsoft, but there is often fear of making
changes to existing standards, even changes that fill in holes in the
standard. Users who type a formerly invalid sequence and get a valid
character, instead of the beep or question mark they once got, and
complain about the change, might seem to be a low-priority constituency,
but you'd be surprised.

> To like a particular layout does not mean to want to stick with it
> when anything better comes up. Userʼs choice is always respected.

See above regarding what users might like if only they had a choice.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?)

2018-01-28 Thread Doug Ewell via Unicode


Mark Davis wrote:


One addition: with the expansion of keyboards in
http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html
we are looking to expand the repository to not merely represent those,
but to also serve as a resource that vendors can draw on.


Would you say, then, that Marcel's statements:

"Now that CLDR is sorting out how to improve keyboard layouts, hopefully 
something falls off to replace the *legacy* US-Intl."


and:

"We can only hope that now, CLDR is thoroughly re-engineering the way 
international or otherwise extended keyboards are mapped."


reflect the situation accurately?

Nothing in the PRI #367 blog post or background document communicated to 
me that CLDR was going to try to influence vendors to retire these 
keyboard layouts and replace them with those. I thought it was just 
about providing a richer CLDR format and syntax to better "support 
keyboard layouts from all major providers." Please point me to the part 
I missed.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?)

2018-01-28 Thread Doug Ewell via Unicode


Marcel Schneider wrote:


We can only hope that now, CLDR is thoroughly re-engineering the way
international or otherwise extended keyboards are mapped.


I suspect you already know this and just misspoke, but CLDR doesn't 
prescribe any vendor's keyboard layouts. CLDR mappings reflect what 
vendors have released.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: 0027, 02BC, 2019, or a new character?

2018-01-25 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> I agree, and still you won't necessarily have to press a dead key to
> have these characters, if you map one key where the Cyrillic letter
> was > producing directly the character with its accent. [...]
>
> However, if you can type one key to produce one latin letter with its
> accent, I don't see why it could not use the caron instead of the
> acute above s and c, so that it is also immediately readable in other
> Eastern European languages. [...]

I think it is very likely the Kazakhs, like most people who are not
experts on computers or Unicode, did not consider the distinction
between the physical keyboard (hardware) and the driver that maps
keystrokes to characters (software). And they might consider replacing
software drivers nationwide to be as unfeasible as replacing physical
keyboards. Remember the government of Kazakhstan is probably not
composed of computer experts.

> As a bonus, banning the apostrophe from the alphabet will have be
> security improvement (thing about the many cases where ASCII
> apostrophes are used as string delimiters in various programming and
> markup languages

Another fact that they really did not seem to take into account. The
advisers and linguists might have considered this, but not the
decision-maker(s).

> the time of 7-bit ASCII is ended now since long, except in very old
> systems,

And on U.S. English keyboards. (It's true, as Sharma says, that they
didn't specify exactly what they meant by a "standard keyboard," but
they did banish all diacritical marks, so...)

> Even with UTF-8, these Latin letters with accents (from any ISO 8859-*
> subset) will be 2-byte wide, so exactly the same encoding size as
> basic letter+ASCII quote and the encoding size is definitely not an
> issue anywhere (all existing Kazakh Cyrillic letters are already using
> 2-byte encoding in UTF-8, as all their assigned code points values
> were higher than 0x7F but lower than 0x800) [...]
>
> Choosing the ASCII quote for this "apostrophe" will not save
> anything ; but the regular Unicode apostrophe U+2019 would need... 3
> bytes after the  1-byte basic Latin letter from ASCII (so it is
> worse !).

I did not see any evidence that this was something they ever considered
or cared about.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: 0027, 02BC, 2019, or a new character?

2018-01-25 Thread Doug Ewell via Unicode


Philippe Verdy wrote:


So there will be a new administrative jargon in Kazakhstan that people
won't like, and outside the government, they'll continue using their
exiosting keyboards [...]

Newspapers and books will continue for a wihile being published in
Cyrillic [...]


Yes, it will be a mess. I think we can agree on that.


Soon they will realize that this is not sustainable


And that, only that, is what will cause them to change it.

Shriramana Sharma wrote:


Sir why this assumption that everyone here is "western"? I'm situated
at an even more eastern longitude than Kazakhstan.


Most of the participants in this "apostrophe" thread appeared to be from 
North America and Western Europe; I think you're the only one who 
expanded that. I wasn't referring to the geographical or cultural makeup 
of the list as a whole.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: 0027, 02BC, 2019, or a new character?

2018-01-24 Thread Doug Ewell via Unicode

James Kass wrote:

> Heh. We are offering sound advice. If people fail to heed it, that's
> too bad.

We're offering excellent advice, very well informed. But the leadership
has made the decision that it has made. All the news stories say that
linguistic experts in Kazakhstan offered similar good advice, and were
disheartened to learn it was ignored completely.

Richard Wordingham wrote:

> Is it only in English then that typing an apostrophe key after a
> letter can't be relied UPON to yield U+0027 rather than U+2019?

Um, I always get U+0027 when I expect it.

Oh wait, you must be talking about AutoCorrect on Microsoft Word. Just
visit AutoCorrect Options and turn off that particular "replace as you
type" option, and be done with it.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: 0027, 02BC, 2019, or a new character?

2018-01-23 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> The best they should have done is instead keeping their existing
> keyboard layout, continaing both the Cyrillic letters and Latin QWERTY
> printed on them, but operating in two modes (depending on OS
> preferences) to invert the two layouts but without changing the
> keystrokes. It would just have needed one Latin letter or modified
> Latin letter so that it was simply a 1 to 1 transliteration.

The objective apparently was to be able use a U.S. English keyboard
layout, AS IS, to type Kazakh-in-Latin. Adding new characters to the
layout would defeat this purpose.

Again, this may not be how you or I would solve the problem, and it may
not be how the Kazakhs would solve the problem if there were no
installed base (i.e. existing Latin-script keyboards with which
compatibility was desired).

As they say, the reason God was able to create the heavens and the earth
in only 6 days was that there was no installed base to worry about.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: 0027, 02BC, 2019, or a new character?

2018-01-23 Thread Doug Ewell via Unicode

I think it's so cute that some of us think we can advise Nazarbayev on
whether to use straight or curly apostrophes or accents or x's or
whatever. Like he would listen to a bunch of Western technocrats.

An explicitly stated goal of the new orthography was to enable typing
Kazakh on a "standard keyboard," meaning an English-language one.
Nazarbayev may ultimately be persuaded to embrace ASCII digraphs, which
also meet this goal, but this talk about U+2019 and U+02BC will make
exactly zero difference in Kazakh policy.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

SignWriting in U+40000 block

2018-01-22 Thread Doug Ewell via Unicode

The IETF is noting the progress of an updated draft:

Formal SignWriting
draft-slevinski-formal-signwriting-04
https://tools.ietf.org/html/draft-slevinski-formal-signwriting-04.html

which continues to describe an implementation of SignWriting in the
as-yet unassigned Plane 4, including a detailed breakdown of blocks for
different types of characters.

I know the struggle between Slevinski and Unicode is long and
contentious, with Slevinski arguing for years that the Unicode encoding
of SignWriting is useless because it doesn't encode position, and vowing
that no implementation (under his aegis) will ever use it).

Nevertheless, I wonder if it would be appropriate for Unicode or WG2, in
some capacity, to protest in some formal way against this recommendation
to arrogate an unassigned plane instead of using the PUA, which is the
correct place for unassigned characters.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10))

2018-01-15 Thread Doug Ewell via Unicode


On January 5, Mark Davis wrote:


Doug, I modified my working draft, at
https://docs.google.com/document/d/1EuNjbs0XrBwqlvCJxra44o3EVrwdBJUWsPf8Ec1fWKY

If that looks ok, I'll submit.


Sorry for the delay. The text substitutions look fine.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10))

2018-01-02 Thread Doug Ewell via Unicode


Mark Davis wrote:


BTW, relevant to this discussion is a proposal filed
http://www.unicode.org/L2/L2017/17434-emoji-rejex-uts51-def.pdf (The
date is wrong, should be 2017-12-22)


The phrase "emoji regex" had caused me to ignore this document, but I 
took a look based on this thread. It says "we still depend on the RGI 
test to filter the set of emoji sequences" and proposes that the EBNF in 
UTS #51 be simplified on the basis that only RGI sequences will pass the 
"possible emoji" test anyway.


Thus it is true, as some people have said (i.e. in L2/17‐382), that 
non-RGI sequences do not actually count as emoji, and therefore there is 
no way — not merely no "recommended" way — to represent the flags of 
entities such as Catalonia and Brittany.


In 2016 I had asked for the emoji tag sequence mechanism for flags to be 
available for all CLDR subdivisions, not just three, with the 
understanding that the vast majority would not be supported by vendor 
glyphs. II t is unfortunate that, while the conciliatory name 
"recommended" was adopted for the three, the intent of "exclusively 
permitted" was retained.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Linearized tilde?

2017-12-30 Thread Doug Ewell via Unicode


Philippe Verdy wrote:


Isn't it a rounded variant of Latin letter n ? Then it could exist
also in uppercase form (like "n" and "N")


A defining characteristic of the 1982 African Reference Alphabet was 
that it was lowercase-only. An uppercase form would be an invention with 
no basis in history or usage.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Linearized tilde?

2017-12-30 Thread Doug Ewell via Unicode


David Starner wrote:


"The letter is not included in any current spelling and is not
included in Unicode." Should it be?


Did anyone ever use the 1982 alphabet, other than Mann and Dalby?

If not, I wonder if this letter is a bit like the "proposed new 
punctuation marks" that show up in proposals from time to time, but have 
never been used except by their inventors and to talk about them.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Doug Ewell via Unicode

J Decker wrote:

> I generally accepted any utf-8 encoding up to 31 bits though ( since
> I was going from the original spec, and not what was effective limit
> based on unicode codepoint space)

Hey, everybody: Don't do that.

UTF-8 has been constrained to the Unicode code space (maximum U+10,
four bytes) for almost fourteen years now. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Doug Ewell via Unicode

Costello, Roger L. wrote:

> Suppose an application splits a UTF-8 multi-octet sequence. The
> application then sends the split sequence to a client. The client must
> restore the original sequence. 
>
> Question: is it possible to split a UTF-8 multi-octet sequence in such
> a way that the client cannot unambiguously restore the original
> sequence? 

1. (Bug) The folding process inserts CRLF plus white space characters,
and the unfolding process doesn't properly delete all of them.

2. (Non-conformant behavior) Some process, after folding and before
unfolding, attempts to interpret the partial UTF-8 sequences and
converts them into replacement characters or worse.

In a minimally decent implementation, splitting and reassembling a UTF-8
sequence should always yield the correct result; there should be no
ambiguity.

A good implementation, of course, would know the character encoding of
the data, and would not split multi-byte sequences in that encoding to
begin with.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: First bonafide use (≠ mention) of emoji by an academic publisher?

2017-07-23 Thread Doug Ewell via Unicode


Leonardo Boiko wrote:


To my boundless, heartbreaking disappointment, these emojis are not
U+1F4D8 BLUE BOOKs  from a custom @css font, but rather private-use
U+F02Ds, which index a book glyph in some icon pack called Font
Awesome <https://en.wikipedia.org/wiki/Font_Awesome>. At least they're
inserted via CSS :before-selectors, which means they'll be
automatically treated as decorations and seamlessly excluded from
copy-paste operations.


We use Font Awesome for my project at work, for symbols embedded in text 
which have no reason and no need to be interchanged, converted to other 
character sets, or indexed in search engines.


Font Awesome also includes some symbols that, we think, won't ever be 
Unicode emoji, such as the Android, Apple, Bluetooth, and Windows logos.


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Unicode education in UK Schools

2017-07-07 Thread Doug Ewell via Unicode

Asmus Freytag wrote:

> I've not (yet) located any assignments that try to address any of the
> "tricky" issues in the use of Unicode. 

That might be a good thing. Many introductory lessons or chapters or
talks about Unicode dive almost immediately into the complexities and
weirdnesses, much more so than with other technical topics. This scares
newbies and they walk away thinking every aspect of Unicode is complex
and weird.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Unicode education in the professional world

2017-07-07 Thread Doug Ewell via Unicode

Sort of along the lines of "education"...

I've been helping a colleague who is using the Oracle database and
trying to work through a customer's character conversion and mojibake
issues. I started suspecting the NLS_LANG variable and looked up some
references, and found the following alternative facts on the Oracle FAQ
and community pages:

> SQL> SELECT DUMP(col,1016)FROM table;
>
> Typ=1 Len=39 CharacterSet=UTF8: 227,131,143,227,131,170
>
> returns the value of a column consisting of 3 Japanese characters in
> UTF8 encoding . For example the 1st char is 227(*255)+131.

and:

> While UTF8 uses only 2 bytes to store data AL32UTF8 uses 2 or 4 bytes.

Unicode and UTF-8 have been around a long time by now. The fact that
there is still fake news like this out there, steering our less
Unicode-aware colleagues waaay down the wrong path, is disconcerting.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-07-03 Thread Doug Ewell via Unicode

a.lukyanov wrote:

> Is it possible to design fonts that will render ẞ as SS?
>
> So we could choose between ẞ and SS by just selecting the proper font,
> without changing the text itself.
>
> Or perhaps there will be a "font feature" to select this rendering
> within the same font.

I thought that was one of the main reasons we had Unicode: so we would
no longer have to rely on particular fonts, or magic font behavior, to
get character identities we expected and could interchange reliably.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: 10.0 Code Charts

2017-06-22 Thread Doug Ewell via Unicode

Michael Bear wrote:

> When are the code charts (http://www.unicode.org/charts/) going to be
> updated for Unicode 10.0? 

They look fine to me.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Looking for 8-bit computer designers

2017-06-14 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> These old platforms still have their fans which are easily found on
> socail networks. [...]

We know this. That's why a group of us is working on a proposal to add
missing characters from these platforms.

Some of the platforms have really obscure and hard-to-decipher
characters, and we were looking for insight from the original folks who
worked on them. We have no shortage of present-day expertise.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-05 Thread Doug Ewell via Unicode

Martin J. Dürst wrote:

> Assuming (conservatively) that it will take about a century to fill up
> all 17 (well, actually 15, because two are private) planes, this would
> give us another century.

Current estimates seem to indicate that 800 years is closer to the mark.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding of character for new Japanese era name after Heisei

2017-06-02 Thread Doug Ewell via Unicode

> Anyway, since emperor Akihito (明仁), the era starting in 1989 is no
> longer named after the emperor, but is Heisei (平成) "Peace everywhere".
> This already occured in the past on the Ningo system. There's no
> absolute requirement to change the era name even if there's a new
> Emperor named.

The Wikipedia article is instructive here (sorry, the French version
doesn't seem to have the same information):

https://en.wikipedia.org/wiki/Japanese_era_name#Neng.C5.8D_in_modern_Japan

Since 1868 Japan has adhered to a system of "one reign, one era name"
(一世一元). The era name is determined upon accession of the emperor
and is unrelated to his birth name.

The emperor continues to be known by his birth name until his death, at
which point he becomes known by the name of his era instead (so Emperor
Hirohito became Emperor Shōwa upon his death in 1989).

There are no indications that the abdication of an emperor, as opposed
to his death, would cause this system to be suspended.

Unicode does not have an extensive history of encoding "placeholder"
characters without knowing what they will actually be. This is probably
a Good Thing.

The four existing characters at U+337x are square compatibility
characters, with decompositions to unified ideographs. So, whatever era
name is chosen for the new emperor (probably Crown Prince Naruhito),
there is a near-guarantee that it will be immediately representable in
Unicode using normal ideographs.

A new square compatibility character, if necessary, can be encoded after
the era name is chosen. It might be fast-tracked at that time, as the
Euro sign was, but there is no emergency about this and no reason to
invent any new encoding procedures or waive any existing ones.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Doug Ewell via Unicode

Richard Wordingham wrote:

> even supporting 6-byte patterns just in case 20.1 bits eventually turn
> out not to be enough,

Oh, gosh, here we go with this.

What will we do if 31 bits turn out not to be enough?
 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Doug Ewell via Unicode

Henri Sivonen wrote:

> If anything, I hope this thread results in the establishment of a
> requirement for proposals to come with proper research about what
> multiple prominent implementations to about the subject matter of a
> proposal concerning changes to text about implementation behavior.

Considering that several folks have objected that the U+FFFD
recommendation is perceived as having the weight of a requirement, I
think adding Henri's good advice above as a "requirement" seems
heavy-handed. Who will judge how much research qualifies as "proper"?
Who will determine that the judge doesn't have a conflict?

An alternative would be to require that proposals, once received with
whatever amount of research, are augmented with any necessary additional
research *before* being approved. The identity or reputation of the
requester should be irrelevant to approval.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode

That's not at all the same as saying it was a valid sequence. That's saying 
decoders were allowed to be lenient with invalid sequences.
We're supposed to be comfortable with standards language here. Do we really not 
understand this distinction?


--Doug Ewell | Thornton, CO, US | ewellic.org
 Original message From: Karl Williamson 
<pub...@khwilliamson.com> Date: 5/30/17  16:32  (GMT-07:00) To: Doug Ewell 
<d...@ewellic.org>, Unicode Mailing List <unicode@unicode.org> Subject: Re: 
Feedback on the proposal to change U+FFFD generation when
  decoding ill-formed UTF-8 
On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote:
> L2/17-168 says:
> 
> "For UTF-8, recommend evaluating maximal subsequences based on the
> original structural definition of UTF-8, without ever restricting trail
> bytes to less than 80..BF. For example:  is a single maximal
> subsequence because C0 was originally a lead byte for two-byte
> sequences."
> 
> When was it ever true that C0 was a valid lead byte? And what does that
> have to do with (not) restricting trail bytes?

Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence
   as U+002F.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode

L2/17-168 says:

"For UTF-8, recommend evaluating maximal subsequences based on the
original structural definition of UTF-8, without ever restricting trail
bytes to less than 80..BF. For example:  is a single maximal
subsequence because C0 was originally a lead byte for two-byte
sequences."

When was it ever true that C0 was a valid lead byte? And what does that
have to do with (not) restricting trail bytes?
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Looking for 8-bit computer designers

2017-05-30 Thread Doug Ewell via Unicode

Not as OT as it might seem:

If there are any engineers or designers on this list who worked on 8-bit
and early 16-bit legacy computers (Apple II, Atari, Commodore, Tandy,
etc.), and especially on character set design for these machines, please
contact me privately at . Any desired degree of
anonymity and confidentiality will be honored.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1755 matches

Mail list logo