from:"Kent Karlsson"

Re: Encoding italic

2019-02-12 Thread Kent Karlsson via Unicode



Oh, the crystal ball is pure solid state, no moving or hot parts.
A magic 8-ball on the other hand can easily get jammed...

(Now, enough of that...)

/K


Den 2019-02-12 02:57, skrev "James Kass via Unicode" :

> 
> On 2019-02-11 6:42 PM, Kent Karlsson wrote:
> 
>> Using a VS to get italics, or anything like that approach, will
>> NEVER be a part of Unicode!
> 
> Maybe the crystal ball is jammed.  This can happen, especially on the
> older models which use vacuum tubes.
> 
> Wanting a second opinion, I asked the magic 8 ball:
> ³Will VS14 italic be part of Unicode?²
> The answer was:
> ³It is decidedly so.²
>

Re: Encoding colour (from Re: Encoding italic)

2019-02-12 Thread Kent Karlsson via Unicode

Den 2019-02-12 03:20, skrev "Mark E. Shoulson via Unicode"
:

> On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote:
>> Continuing too look deep into the crystal ball, doing some more
>> hand swirls...
>> 
>> ...
>> 
>> ...
>> 
>> The scheme quoted (far) below (from wjgo_10009), or anything like it,
>> will NEVER be part of Unicode!
> 
> Not in Unicode, but I have to say I'm intrigued by the idea of writing
> HTML with tag characters (not even necessarily "restricted" HTML: the
> whole deal).  This does NOT make it possible to write "italics in plain
> text," since you aren't writing plain text.  But what you can do is
> write rich text (HTML) that Just So Happens to look like plain text when
> rendered with a plain-text-renderer (and maybe there could be
> plain-text-renderers that straddle the line, maybe supporting some
> limited subset of HTML and doing boldface and italics or something. 

And so would ESC/command sequences as such, if properly skipped for display.
If some are interpreted, those would affect the display of other characters.
Just like "HTML in tag characters" would. A show invisibles mode would
display both ESC/command sequences as well as "HTML in tag characters"
characters.

> BUT, this would NOT be a Unicode feature/catastrophe at all.  This would
> be purely the decision of the committee in charge of HTML/XML and
> related standards, to decide to accept Unicode tag characters as if they
> were ASCII for the purposes of writing XML tags/attributes &c.  It's

I have no say on HTML/CSS, but I would venture to predict that those
who do have a say, would not be keen on that idea. And XML tags in
general need not be in ASCII. And... identifiers in CSS need not
be in pure ASCII either... And attribute values, like filenames
including those that refer to CSS files (CSS is preferably stored
separately from the HTML/XML), certainly need not be pure ASCII.)

So, no, I'd say that that idea is completely dead.

/Kent K

> totally nothing to do with Unicode, unless the XML folks want Unicode to
> change some properties on the tag chars or something.  I think it's a...
> fascinating idea, and probably has *disastrous* consequences lurking
> that I haven't tried to think of yet, but it's not a Unicode idea.
> 
> ~mark
>

Re: Encoding colour (from Re: Encoding italic)

2019-02-11 Thread Kent Karlsson via Unicode



Continuing too look deep into the crystal ball, doing some more
hand swirls...

...

...

The scheme quoted (far) below (from wjgo_10009), or anything like it,
will NEVER be part of Unicode!


---

But I do like colour (and bold and italic) also for otherwise "plain"
text. And having those stylings represented in a lightweight manner,
in many cases. Not needing heavy-lifting with (say) HTML+CSS. More on
that further below.

As we have noted already on this thread, we already have a standard
for specifying background and foreground (the glyphs for the text)
colour. As ESC (command) sequences. It even has (non-standard) "room" for
for an alpha channel (after the 6th ':', a parameter position otherwise
unused for RGB; it is used for K of CMYK in the ITU T.416 standard).

Colour, RGB, with alpha channel T (0: opaque, 255: fully transparent;
this way around since 0 is the default default value in these things),
can be given with the detailed syntax below (it matches the overall
syntax, so there is no overall syntax error for the detailed syntax).
The brackets, except the single first one, indicate optional; strictly
speaking everything after the "2" here is incrementally optional, but
that is a nit; the i and the "a:s" are intended for different kinds
of colour adjustments (at least the "i" one being implementation defined).
But those are a bit too detailed to pick up here.

The lowercase variables, not the final m, here to be replaced by digits
representing values 0 to 255. A syntax error would result in the
command sequence being ignored. (If too long, longer than 35(?) chars,
the printable characters would be displayed, no interpretation as a
command sequence.) The 2 means RGB (and, here, T) colour specification.

Foreground colour: ESC [38:2:i:r:g:b[:t[:a:s]]m
Background colour: ESC [48:2:i:r:g:b[:t[:a:s]]m

E.g. ESC [38:2:0:70:100:200:100m  for a slightly transparent bluish
foreground colour. Separator is (must be) colon, so as not to interfere
with the permitted (but I would not recommend it) multiple style
settings in a single SGM command sequence, using semicolon separator.

---

Now, colour for plain text? Well, lots of people are editing coloured
plain text daily! Any decent modern IDE does automatic syntax colouring
(and bold and italic). And that for program source text, which certainly
does not have any HTML/CSS or any other higher-level (formatting)
protocol applied to them. Ok, the colouring/bold/italic is entirely
internal. It is not saved in the files in any way, it is derived. But
it would be nice to sometimes keep the syntax colouring, when quoting
a piece of program source code (from an IDE) Into a chat conversation,
for instance. Or pasting a piece of source code into a presentation
slide or a document (in these cases any light-weight colouring/style
would need to be converted to whatever representation is used for
such things in those document formats, something more "heavy-weight").

And keep the formatting/colour in a light-weight manner, when
copying/cutting (ctrl-c/ctrl-x) text from an IDE. One that is
also easy to strip away (if pasting a perhaps modified version of it
into a source file (via an IDE)). The "heavy-weight" ones are harder to
strip away, and might not even be supported on the target platform.

ESC/command sequences are easy to strip away, due to the starting
control character and well-defined overall syntax, even though it
is only the start character that is (otherwise) non-printable in
the sequence. They were designed for being easy to parse out! And they
are already standardised! Platform independently. And light-weight.
Granted, they are, for now, only popular to implement in terminal
emulators. But the styling command sequences are NOT specifically
made for terminal (emulators).

If you worry about actual ESC characters in source code (strings),
those should be written as \e, or other more general escape sequence
(a completely different, though somewhat related, sense of the term
"escape sequence"), like \u001B. It is a REALLY bad idea to have
a real escape character (U+001B) in a source code string literal.

(Nit: The "predefined" colours in ECMA-48 are not useful for this.
They are too stark. The IDEs (by default) use milder colours.)

If you think that using styling on program source text is a new-fangled
idea that came with the IDEs: No, it started already in the sixties.
Algol-60 source text, when printed in books, had the keywords written
in bold. For the *actual* programs, IIRC (at least for some compiler),
one had to mark the keywords with underscore: _BEGIN_, _IF_, ...
(No lowercase in computers then...) The keywords were initially
not reserved, so one had to mark them. And... often stored as punched
cards or punched paper tape...

While possible, I do NOT propose to use command sequences to mark
keywords (etc.) as bold (or colour) when input to a compiler.
NOR do I propose to encode characters for punched hole patterns...
(Have to draw the

Re: Encoding italic

2019-02-11 Thread Kent Karlsson via Unicode

Den 2019-02-11 10:55, skrev "wjgo_10...@btinternet.com via Unicode"
:

> Doug Ewell wrote:
> 
>> , just as next to nobody is using the proposed VS14 mechanism 
> 
> Well, of course not because use of VS14 in a plain text document to
> record a request for an italic glyph version is not at the present time
> an official part of Unicode.

Looking deeply into the crystal ball, swirling my hands over it...

...

...

Using a VS to get italics, or anything like that approach, will
NEVER be a part of Unicode!

/Kent K

Re: Encoding italic

2019-02-10 Thread Kent Karlsson via Unicode





Den 2019-02-10 16:31, skrev "James Kass via Unicode" :

> 
> Philippe Verdy wrote,
> 
>>> ...[one font file having both italic and roman]...

For OpenType fonts, there is a "design axis" called "ital". Value 0 on that
axis would be roman (upright, normally), and value 1 on that axis would be
italic. I don't know to what extent that is available in OpenType fonts in
common use... (Instead of using two separate font files.)

[math chars]
> They were encoded for interoperability and round-tripping because they
> existed in character sets such as STIX. 

They were basically requested "by" STIX, yes. Not sure about the
round-tripping bit.

> They remain Latin letter form
> variants.  If they had been encoded as the variant forms which
> constitute their essential identity it would have broken the character
> vs. glyph encoding model of that era.  Arguing that they must not be
> used other than for scientific purposes

I don't think that particular argument was made, IIUC.

> is just so much semantic
> quibbling in order to justify their encoding.
> 
> Suppose we started using the double struck ASCII variants on this list
> in order to note Unicode character numbers such as 𝕌+𝔽𝔼𝔽𝔽 or
> 𝕌+𝟚𝟘𝟞𝟘? 

That particular example would be ok (event though outside of a
conventional math formula). But we were talking about natural
languages in their conventional orthography, using italics/bold.

/Kent K

Re: Encoding italic

2019-02-09 Thread Kent Karlsson via Unicode


Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode" :

> I'd like to propose encoding italics and similar display attributes in
> plain text using the following stateful mechanism:

Note that these do NOT nest (no stack...), just state changes for the
relevant PART of the "graphic" (i.e. style) state. So the approach in
that regard is quite different from the approach done in HTML/CSS.

>  Italics on: ESC [3m
>  Italics off: ESC [23m
>  Bold on: ESC [1m
>  Bold off: ESC [22m
>  Underline on: ESC [4m
(implies turning double underline off)

   Underline, double: ESC [21m
(implies turning single underline off)

>  Underline off: ESC [24m
>  Strikethrough on: ESC [9m
>  Strikethrough off: ESC [29m
>  Reverse on: ESC [7m
>  Reverse off: ESC [27m

"Reverse" = "switch background and foreground colours".

This is an (odd) colour thing. If you want to go with (full!) colour
(foreground and background), fine, but the "reverse" is oddball (and
based on what really old terminals were limited to when it comes to colour).

I'd rather include 'ESC [50m' (not variable spacing, i.e. "monospace" font)
and 'ESC [26m' (variable spacing, i.e. "proportional" font). Recall that
this is NOT for terminal emulators but for styling applied to text
outside of terminal emulators. (Terminal emulators already implement
much of this and more; albeit sometimes wrongly). This would be handy
for including (say) programming code or computer commands (or for that
matter, "ASCII art", or more generally "Unicode art") in otherwise
"ordinary"
text... (The "ordinary" text preferably set in a proportional font.)

>  Reset all attributes: ESC [m

(Actually 'ESC [0m', with the 0 default-able.) Handy, agreed, but not 100%
necessary.
These ESC-sequences should not normally be inserted "manually" but by a text
editor program, using the conventional means of "making bold" etc. (ctrl-b,
cmd-b,
"bold" in a menu); only "hackers" (in the positive sense) would actually
bother
about the command sequences as such.

/Kent K


> where ESC is U+001B.
>  
> This mechanism has existed for around 40 years and is already supported
> as widely as any new Unicode-only convention will ever be.
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>  
>

Re: Encoding italic

2019-02-09 Thread Kent Karlsson via Unicode



Den 2019-02-08 22:29, skrev "Egmont Koblinger via Unicode"
:

> (Mind you, I don't find it a good idea to add italic and whatnot
> formatting support to Unicode at all... but let's put aside that now.)

I don't think Doug mean to "add it to the Unicode standard", just to
have a summary of "handy esc-sequences (actually command-sequences)
for simple styling of text" picked from long-standing (text level...)
standards.

> There are a lot of problems with these escape sequences, and if you go
> for a potentially new standard, you might not want to carry these
> problems.
> 
> There is not a well-defined framework for escape sequences. In this
> particular case you might say it starts with ESC [ and ends with the
> letter 'm', but how do you know where to end the sequence if that
> letter 'm' just doesn't arrive? Terminal emulators have extremely

There is an overriding "basic (overall) syntax" for esc-seq/
command-sequences that do not include a string argument (like OSC,
APC, ...). IIUC it is (originally as byte sequences, but here as
character sequences):

\u001B[\u0020-\002F]*[\u0030-\007E]| 
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E] 

(no newline or carriage return in there). True, that has no direct
limit, but it would not be unreasonable to set a limit of (say)
max 30 characters. Potential (i.e. starting with ESC) esc-"sequences"
that do not match the overall syntax or are too long can simply be
rendered as is (except for the ESC itself). The esc/command sequences
(that match) but are not interpreted should be ignored in "normal"
(not "show invisibles" mode) display.

They are unlikely to be "default ignored" by such things as sorting
(and should preferably be filtered out beforehand, if possible). But
if we compare to other rich text editors, the command sequences should
be ignored by (interactive) searching, just like HTML tags are ignored
in interactive searching (the internal representation "skipping" the
HTML tags in one way or another). HTML tags should also (when text
known to be HTLM) filtered out before doing such things as sorting.

> complex tables for parsing (and still many of them get plenty of
> things wrong). It's unreasonable for any random small utility
> processing Unicode text to go into this business of recognizing all
> the well-known escape sequences, not even to the extent to know where
> they end. Whatever is designed should be much more easily parseable.
> Should you say "everything from ESC[ to m", you'll cause a whole bunch
> of problems when a different kind of escape sequence gets interpreted
> as Unicode.

The escape/command sequences would not be part of Unicode (standard).

> A parser, by the way, would also have to interpret combined sequences
> like ESC[3;0;1m or alike, for which I don't see a good reason as
> opposed to having separate sequences for each. Also, it should be

Formally covered by the (non-Unicode) standards, but optional (IIUC).

> carefully evaluated what to do with C1 (U+009B) instead of the C0 ESC[
> opening for an escape sequence  here terminal emulators vary. These
> just make everything even more cumbersome.
> 
> ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity".

I think one should interpret these in a "modern" way, not looking
too much at what old terminals were limited to. (Colour ("increased
intensity") should be handled completely separately from bold.)

> Should this scheme be extended for colors, too? What to do with the
> legacy 8/16 as well as the 256-color extensions wrt. the color
> palette? Should Unicode go into the business of defining a fixed set
> of colors, or allow to alter the palette colors using the OSC 4 and
> friends escape sequences which supported by about half of the terminal
> emulators out there?

IF extending to colour, only refer to "true colour" (RGB) command-sequence.
The colour palette versions are for the limitations of (semi-)old terminals.

> For 256-colors and truecolors, there are two or three syntaxes out
> there regarding whether the separator is a colon or a semicolon.

It can only be colon. Using semicolon would interfere with the syntax
for multiple style specifications in one command sequence. (I by mistake
wrote a semicolon there in an earlier post; sorry.)

> Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
> for curly underline. What to do with them? Where to draw the line what

(Note colon, not semicolon, as separator.) Possible, partially matching
the capabilities for underlining via CSS (solid, dotted, dashed, wavy,
double). Depends on how much styling options one wants to pick up.

> to add to Unicode and what not to? Will Unicode possibly be a

I don't think anyone wants to make this part of the Unicode standard.
(A the most a Unicode technical note...; from Unicode's point of view.)

[...] 
> What to do with things that Unicode might also want to have, but
> doesn't exist in terminal emulators due to their nature, such as
> switching t

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Kent Karlsson via Unicode



Den 2019-02-02 16:12, skrev "Richard Wordingham via Unicode"
:

> On Sat, 02 Feb 2019 14:01:46 +0100
> Kent Karlsson via Unicode  wrote:
> 
>> Well, I guess you may need to put some (practical) limit to the number
>> of non-spacing marks (like max two above + max one below; overstrikes
>> are an edge case). Otherwise one may need to either increase the line
>> height (bad idea for a terminal emulator I think) or the marks start
>> to visually interfere with text on other lines (even with the hinted
>> limits there may be some interference), also a bad idea for a terminal
>> emulator. So I'm not so sure that non-spacing marks is a piece of
>> cake... (I.e., need to limit them.)
> 
> Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the
> lamedh?  The depth then is the maximum depth, not the sum of the
> depths. 

Do you want to view/edit such texts on a terminal emulator? (Rather
than a GUI window.)
 
> Tai Lue has 'mai sat 3 lem' - that's three marks above for a
> combination common enough to have a name.  Throw in the repetition mark
> and that's four marks above if you treat the subscript consonant as a
> mark (or code it to comply with the USE's erroneous grammar).

I don't question that as such. But again, do you want to view/edit such
texts on a **terminal emulator**?

It is just that such things are likely to graphically overflow the
"cell" boundaries, unless the cells are disproportionately high (i.e.
double or so line spacing). Doesn't really sound like a terminal
emulator... I do not think terminal emulators should be used for
ALL kinds of text.

/Kent K

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Kent Karlsson via Unicode



Den 2019-02-02 12:17, skrev "Egmont Koblinger" :

> the font. It's taken from EastAsianWidth (or other means, which we're
> working on: https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9

Yes, that too:
FE0F ? VARIATION SELECTOR-16 = emoji variation selector

But the issue you refer to only deals with U+FE0F. There is also U+FE0E:
FE0E ? VARIATION SELECTOR-15 = text variation selector
which can make a character that is "default emoji" (which are wide)
into "text variant", often single-width, for instance:
1F315 FE0E ; text style;  # (6.0) FULL MOON SYMBOL

---

>> Likewise non-spacing combining characters should
>> be possible to deal reasonably with.
> 
> Most terminal emulators handle non-spacing combining marks, it's a
> piece of cake. (Spacing marks are more problematic.)

Well, I guess you may need to put some (practical) limit to the number
of non-spacing marks (like max two above + max one below; overstrikes
are an edge case). Otherwise one may need to either increase the line
height (bad idea for a terminal emulator I think) or the marks start
to visually interfere with text on other lines (even with the hinted
limits there may be some interference), also a bad idea for a terminal
emulator. So I'm not so sure that non-spacing marks is a piece of cake...
(I.e., need to limit them.)

/Kent K

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Kent Karlsson via Unicode

Den 2019-02-01 19:57, skrev "Richard Wordingham via Unicode"
:

> On Fri, 1 Feb 2019 13:02:45 +0200
> Khaled Hosny via Unicode  wrote:
> 
>> On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via
>> Unicode wrote:
>>> On Thu, 31 Jan 2019 12:46:48 +0100
>>> Egmont Koblinger  wrote:
>>> 
>>> No.  How many cells do CJK ideographs occupy?  We've had a strong
>>> hint that a medial BEH should occupy one cell, while an isolated
>>> BEH should occupy two.
>> 
>> Monospaced Arabic fonts (there are not that many of them) are designed
>> so that all forms occupy just one cell (most even including the
>> mandatory lam-alef ligatures), unlike CJK fonts.
>> 
>> I can imagine the terminal restricting itself to monspaced fonts,
>> disable ³liga² feature just in case, and expect the font to well
>> behave. Any other magic is likely to fail.
> 
> Of course, strictly speaking, a monospaced font cannot support harakat
> as Egmont has proposed.
> 
> Richard.

(harakat: non-spacing vowel mark in Arabic)

"Monospaced font" is really a concept with modification. Even for
"plain old ASCII" there are two advance widths, not just one: 0 for
control characters (and escape/control sequences, neither of which
should directly consult the font; even such things as OSC sequences,
but the latter are a bad idea to have in any line one might wish to
edit (vi/emacs/...) via a terminal emulator window). But terminals
(read terminal emulators) can deal with mixed single width and double
width characters (which is, IIUC, the motivation for the datafile
EastAsianWidth.txt). Likewise non-spacing combining characters should
be possible to deal reasonably with.

It is a lot more difficult to deal with BiDi in a terminal emulator,
also shaping may be hard to do, as well as reordering (or even
splitting) combining characters. All sorts of problems arise; feeding
the emulator a character (or "short" strings) at a time not allowed
to buffer for display (causing reshaping or movement of already
displayed characters, edit position movement even within a single
line, etc.). Even if solvable for a "GUI" text editor (not via a
terminal), they do not seem to be workable in a terminal (emulator)
setting. Esp. not if one also wants to support multiline editing
(vi/emacs/...) or even single-line editing.

As long as editing is limited to a single line (such as the system
line editor, or an "enhanced functionality" line editor (such as
that used for bash; moving in the history sets the edit position
at EOL) even variable width ("proportional) fonts should not pose
a major problem. But for multiline editors (à la vi/emacs) it would
not be possible to synch nicely (unless one accepts strange jums)
the visual edit position and the actual edit position in the edit
buffer: The program would not have access to the advance width data
from the font that the terminal emulator uses, unless one
revolutionise what terminal emulators do... (And I don't see a
case for doing that.) But both a terminal emulator and multiline
editing programs (for terminal emulators) still can have access
to EastAsianWidth data as well as which characters are non-spacing;
those are not font dependent. (There might be some glitches if
the Unicode versions used do not match (the terminal emulator
and the program being run are most often on different systems),
but only for characters where these properties have changed,
e.g. newly allocated non-spacing marks.)

/Kent K

PS
No, I have not done extensive testing of various terminal emulators
on how well the handle the stuff above.

Re: Encoding italic

2019-01-30 Thread Kent Karlsson via Unicode

I did say "multiple" and "for instance". But since you ask:

ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
is implemented in Cygwin (sorry for mentioning a product name).)
(The "named" ones, though very popular in terminal emulators, are
all much too stark, I think, and the exact colour for them are
implementation defined.)

ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
traditionally does not use bold or italic. Compare those specified for CSS
(https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
be of interest for the generalised subject of this thread.

There are some other differences as well, but those are the major ones
with regard to text styling. (I don't know those standards to a tee.
I've just looked at the "m" control sequences for text styling. And yes,
I looked at the free copies...)

/Kent Karlsson

PS
If people insist on that EACH character in "plain text" italic/bold/etc
"controls" be default ignorable: one could just take the control sequences
as specified, but map the printable characters part to the corresponding
tag characters... Not that I think that that is really necessary.

Den 2019-01-30 22:24, skrev "Doug Ewell via Unicode" :

> Kent Karlsson wrote:
>  
>> Yes, great. But as I've said, we've ALREADY got a
>> default-ignorable-in-display (if implemented right)
>> way of doing such things.
>> 
>> And not only do we already have one, but it is also
>> standardised in multiple standards from different
>> standards institutions. See for instance "ISO/IEC 8613-6,
>> Information technology --- Open Document Architecture (ODA)
>> and Interchange Format: Character content architecture".
>  
> I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
> has the advantage of not costing me USD 179, and it looks very similar
> to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
> are talking about: setting text display properties such as bold and
> italics by means of escape sequences.
>  
> Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
> doing, and if it does not, why we should not simply refer to the more
> familiar 6429?
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>

Re: Encoding italic

2019-01-29 Thread Kent Karlsson via Unicode

Yes, great. But as I've said, we've ALREADY got a
default-ignorable-in-display (if implemented right)
way of doing such things.

And not only do we already have one, but it is also
standardised in multiple standards from different
standards institutions. See for instance "ISO/IEC 8613-6,
Information technology --- Open Document Architecture (ODA)
and Interchange Format: Character content architecture".
(In a little experiment I found that it seems that
Cygwin is one of the better implementations of this;
B.t.w. I have no relation to Cygwin other than using it.)

To boot, it's been around for decades and is still
alive and well. I see absolutely no need for a "bold"
new concept here; the one below is not better in any
significant way.

/Kent Karlsson

Den 2019-01-29 23:35, skrev "Andrew West via Unicode" :

> On Mon, 28 Jan 2019 at 01:55, James Kass via Unicode
>  wrote:
>> 
>> This 󠀼󠁢󠀾bold󠀼󠀯󠁢󠀾 new concept was not mine.  When I tested it
>> here, I was using the tag encoding recommended by the developer.
> 
> Congratulations James, you've successfully interchanged tag-styled
> plain text over the internet with no adverse side effects. I copied
> your email into BabelPad and your "bold" is shown bold (see attached
> screenshot).
> 
> Andrew

Re: Encoding italic

2019-01-28 Thread Kent Karlsson via Unicode



Den 2019-01-28 02:53, skrev "James Kass via Unicode" :

> plain-text and are uncomfortable using the math alphanumerics for this,
> although the math alphanumerics seem well qualified for the purpose. 

It "works" basically only for English (note that any diacritics would be
placed suitable for math, not for words, and then there are Latin letters
that do not have a decomposition (like ø), and then there is of course
Cyrillic, and a whole slew of non-Latin scripts. So, no, they do NOT AT
ALL "seem well qualified". And... We already have a well-established
standard for doing this kind of things...

/Kent K

Re: Encoding italic

2019-01-27 Thread Kent Karlsson via Unicode

Apart from that control sequences for (some) styling is standardised
(since decades by now), and the "tag characters" approach is not:

For the control sequences for styling, there is no pretence of nesting,
just setting/unsetting an aspect of styling. For  etc. (in tag
characters) there is at least the pretence/appearance of nesting, even
if the interpreter doesn't actually care about nesting (and just interprets
them as set/unset). (In addition,  etc. in "real" HTML are
1) disrecommended, and
2) the actual styling comes from a style sheet (and the **default**
one makes  stuff bold).)

/Kent K


Den 2019-01-27 21:03, skrev "James Kass via Unicode" :

> 
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text
> using the tag characters method described earlier in this thread.  This
> enhancement is described in the release notes linked on this download page:
> 
> http://www.babelstone.co.uk/Software/index.html
>

Re: Encoding italic (was: A last missing link)

2019-01-24 Thread Kent Karlsson via Unicode

Den 2019-01-24 03:21, skrev "Mark E. Shoulson via Unicode"
:

> On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote:
>> Ok. One thing to note is that escape sequences (including control sequences,
>> for those who care to distinguish those) probably should be "default
>> ignorable" for display. Requiring, or even recommending, them to be default
>> ignorable for other processing (like sorting, searching, and other things)
>> may be a tall order. So, for display, (maximal) substrings that match:
>> 
>> \u001B[\u0020-\002F]*[\u0030-\007E]|
>> (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]
>> 
>> should be default ignorable (i.e. invisible, but a "show invisibles" mode
>> would show them; not interpreted ones should be kept, even if interpreted
>> ones need not, just (re)generated on save). That is as far as Unicode
>> should go.
> 
> So it isn't just "these characters should be default ignorable", but
> "this regular expression is default ignorable."  This gets back to
> "things that span more than a character" again, only this time the
> "span" isn't the text being styled, it's the annotation to style it. 

True. That is how ECMA/ISO/ANSI escape/control-sequences are designed.
Had they not already been designed, and implemented, but we were to do
a design today, it would surely be done differently; e.g. having
"controls" that consisted only of (individually) "default-ignorable"
characters.

But, and this is the important thing here:

a) The current esc/control-sequences is an accepted standard,
since long.

b) This standard is still in very much active use, albeit mostly
by terminal emulators. But the styling stuff need not at all
be limited to terminal emulators.

Since it is an actively and widely used standard, I don't see the
point of trying to design another way of specifying "default
ignorable"-controls for text styling. (HTML, for instance, does not
have "default ignorable" controls, since ALL characters in the
"controls" are printable characters, so one needs a "second level"
for parsing the controls.) True, ignoring or interpreting an
esc/control-sequence requires some processing of substrings, since
some (all but the first) are printable characters. But not that hard.
It has been implemented over and over...

Had this standard been defunct, then there would be an opportunity
to design something different.

> The "bash" shell has special escape-sequences (\[ and \]) to use in
> defining its prompt that tell the system that the text enclosed by them
> is not rendered and should not be counted when it comes to doing

Never heard of. Cannot find any reference mentioning them. Reference?

> cursor-control and line-editing stuff (so you put them around, yep, the
> escape sequences for coloring or boldfacing or whatever that you want in
> your prompt). 

Line editing stuff in bash is done on an internal buffer (there is a library
for doing this, and that library can be used by various other command line
programs; bash does not use the system input line editing). Then that
library tries to show what is in the buffer on the terminal. So, I'm
not sure what you are talking about; bash does NOT (somehow) scrape
the screen (terminal emulator window).

Furthermore, colouring and bold/underline is quite common not only in
prompts, but also in output directed at a terminal from various programs.
(And it works just fine.) Unfortunately cut-and-paste tends to loose
much (or all) of that. (Would be nicer if it got converted to HTML,
RTF, .doc, or whatever is the target format; or just nicely kept if
"plain text" is the target.)

> That would seem to be at least simpler than a big ol'
> regexp, but really not that much of an improvement.  It also goes to
> show how things like this require all kinds of special handling,
> even/especially in a "simple" shell prompt (which could make a strong
> case for being "plain text", though, yes, terminal escape codes are a
> thing.)

They are NOT "terminal escape codes". It is just that, for now, it is
just about only terminal emulator that implement esc/control-sequences.
>From https://www.ecma-international.org/publications/standards/Ecma-048.htm:
"The control functions are intended to be used embedded in character-coded
data for interchange, in particular with character-imaging devices."
A (plain) text editor is an example of a 'character-imaging device'.
(Yes, the terminology is a bit dated.)

/Kent K

> 
> ~mark

Re: Encoding italic (was: A last missing link)

2019-01-22 Thread Kent Karlsson via Unicode

Ok. One thing to note is that escape sequences (including control sequences,
for those who care to distinguish those) probably should be "default
ignorable" for display. Requiring, or even recommending, them to be default
ignorable for other processing (like sorting, searching, and other things)
may be a tall order. So, for display, (maximal) substrings that match:

\u001B[\u0020-\002F]*[\u0030-\007E]|
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]

should be default ignorable (i.e. invisible, but a "show invisibles" mode
would show them; not interpreted ones should be kept, even if interpreted
ones need not, just (re)generated on save). That is as far as Unicode
should go.

Some may be interpreted, this thread focuses on italic, but also bold
and underlined. There is a whole bunch of "style" control sequences
(those that have "m" at the end of the sequence) specified, and terminal
emulators implement several of them, but not all.

As for editing, if "style" control sequences à la ISO 6429 were to be
supported in text editors, I would NOT expect users to type in those
escape/control sequences in any way, but use "ctrl/command-i" (etc.) or
menu commands as editors do now, and the representation as esc-sequences
be kept under wraps (and maybe only present in files, not in the internal
representation during editing), and not seen unless one starts to analyse
the byte sequences in files. So, even if you don't like this esc-sequence
business:
1) It would not be seen by most users, mostly by programmers (the same
goes for other ways of representing this, be it HTML, .doc, or whatever.
2) It is already standardised, and one can make (a slightly inaccurate)
argument that it is "plain text".

What one would need to do is:
1) Prioritise which "style" control sequences should be interpreted
(rather than be ignored).
2) Lobby to "plain" text editor makers to support those styles,
representing them (in files) as standard control sequences.

A selection of already standardised style codes (i.e., for control
sequences that end in ²m²):

0   default rendition (implementation-defined)

1   bold
(2  lean)
22  normal intensity (neither bold nor lean)

3   italicized
23  not italicized (i.e. upright)

4   singly underlined
(21 doubly underlined)
24  not underlined (neither singly nor doubly)

(9  crossed-out (strikethrough))
(29 not crossed out)

If you really want to go for colour as well (RGB values in 0255)
(colour is popular in terminal emulators...):

(30-37  foreground: black, red, green, yellow, blue, magenta, cyan, white)
38  foreground colour as RGB. Next arguments 2;r;g;b
39  default foreground colour (implementation-defined)

(40-47  background: black, red, green, yellow, blue, magenta, cyan, white)
48  background colour as RGB. Next arguments 2;r;g;b
49  default background colour (implementation-defined)

There are some more (including some that assume a small font palette, for
changing font). But far enough for now. Maybe too far already. But do not
allow interpreting multiple style attribute codes in one control sequence;
quite unnecessary.

/Kent K

Den 2019-01-21 21:46, skrev "Doug Ewell via Unicode" :

> Kent Karlsson wrote:
> 
>> There is already a standardised, "character level" (well, it is from
>> a character standard, though a more modern view would be that it is
>> a higher level protocol) way of specifying italics (and bold, and
>> underline, and more):
>> 
>> \u001b[3mbla bla bla\u001b[0m
>> 
>> Terminal emulators implement some such escape sequences.
> 
> And indeed, the forthcoming Unicode Technical Note we are going to be
> writing to supplement the introduction of the characters in L2/19-025,
> whether next year or later, will recommend ISO 6429 sequences like this
> to implement features like background and foreground colors, inverse
> video, and more, which are not available as plain-text characters.
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>

Re: Encoding italic (was: A last missing link)

2019-01-19 Thread Kent Karlsson via Unicode

(I have skipped some messages in this thread, so maybe the following
has been pointed out already. Apologies for this message if so.)

You will not like this... But...

There is already a standardised, "character level" (well, it is from
a character standard, though a more modern view would be that it is
a higher level protocol) way of specifying italics (and bold, and
underline, and more):

\u001b[3mbla bla bla\u001b[0m

Terminal emulators implement some such escape sequences. The terminaI
emulators I use support bold (1 after the [) but not italic (3). Every time
you
use the "man"-command in a Linux/Unix/similar terminal you "use" the
escape sequences for bold and underline... Other terminal based programs
often use bold as well as colour esc-sequences for emphasis as well as for
warning/error messages, and other "hints" of various kinds. For xterm,
see: https://www.xfree86.org/4.8.0/ctlseqs.html.

So I don't see these esc-sequences becoming obsolete any time soon.
But I don't foresee them being supported outside of terminal emulators
either... (Though for style esc-sequences it would certainly be possible.
And a "smart" cut-and-paste operation could auto-insert an esc-sequence
that sets the the style after the paste to the one before the paste...)

Had HTML (somehow, magically) been invented before terminals, maybe
terminals (terminal emulators) would have used some kind of "mini-HTML"
instead. But things are like they are on that point.

/Kent Karlsson

PS
The cut-and-paste I used here convert (imperfectly: bold is lost and
spurious ! inserted) to HTML
(surely going through some internal attribute-based representation, the HTML
being generated
when I press send):

man(1) 
man(1)

NAME
   man - format and display the on-line manual pages

SYNOPSIS
   man  [-acdfFhkKtwW]  [--path]  [-m system] [-p string] [-C
config_file]
   [-M pathlist] [-P pager] [-B browser] [-H htmlpager] [-S
section_list]
   [section] name ...

Den 2019-01-18 20:18, skrev "Asmus Freytag via Unicode"
:

>
> 
> I would full agree and I think Mark puts it really well in the message below
> why some of the proposals brandished here are no longer plain text but
> "not-so-plain" text.
>  
> 
> I think we are better served with a solution that provides some form of
> "light" rich text, for basic emphasis in short messages. The proper way for
> this would be some form of MarkDown standard shared across vendors, and
> perhaps implemented in a way that users don't necessarily need to type
> anything special, but that, if exported to "true" plain text, it turns into
> the source format for the "light" rich text.
>  
> 
> This is an effort that's out of scope for Unicode to implement, or, I should
> say, if the Consortium were to take it on, it would be a separate technical
> standard from The Unicode Standard.
>  
>  
> 
> A./
>  
> 
> PS: I really hate the creeping expansion of pseudo-encoding via VS characters.
> The only worse thing is adding novel control functions.
>  
>  
> 
>  
>  
> On 1/18/2019 7:51 AM, Mark E. Shoulson via Unicode wrote:
>  
>  
>> On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote:
>>  
>>>  
>>>  Encoding 'begin italic' and 'end italic' would introduce difficulties when
>>> partial strings are moved, etc. But that's no different than with current
>>> punctuation. If you select the second half of a string that includes an end
>>> quote character you end up with a mismatched pair, with the same problems of
>>> interpretation as selecting the second half of a string including an 'end
>>> italic' character. Apps have to deal with it, and do, as in code editors.
>>>  
>>>  
>>  It kinda IS different.  If you paste in half a string, you get a mismatched
>> or unmatched paren or quote or something.  A typo, but a transient one.  It
>> looks bad where it is, but everything else is unaffected.  It's no worse than
>> hitting an extra key by mistake. If you paste in a "begin italic" and miss
>> the "end italic", though, then *all* your text from that point on is
>> affected!  (Or maybe "all until a newline" or some other stopgap ending, but
>> that's just damage-control, not damage-prevention.)  Suddenly, letters and
>> symbols five words/lines/paragraphs/pages look different, the pagination is
>> all altered (by far more than merely a single extra punctuation mark, since
>> italic fonts generally are narrower than roman).  It's a disaster.
>>  
>>  No.  This kind of statefulness really is beyond what Unicode is de

Re: Unicode 11 Georgian uppercase vs. fonts

2018-07-28 Thread Kent Karlsson via Unicode

I know it is too late now, but... Could have added the characters,
without adding the case mappings. Just as it was done for the LATIN
CAPITAL LETTER SHARP S (ẞ), where the proper case mapping was relegated
to "special purpose software" (or just a special setting in common
software). The (proper) case-mapping for ẞ is nowhere to be found the
Unicode database (which I think is a pity, but that is a different matter).

I think "specialcasing.txt" is not really maintained anymore, but I'll
disregard that here.

One could add a special-casing for each modern Georgian lowercase letter
to (continue to) uppercase-map to itself (for the Georgian language at
least).

/Kent K

Den 2018-07-28 15:26, skrev "Michael Everson via Unicode"
:

> Mtavruli could not be represented in the UCS before we added these characters.
> Now it can. 
> 
> Michael Everson
> 
>> On 28 Jul 2018, at 14:10, Richard Wordingham via Unicode
>>  wrote:
>> 
>> On Sat, 28 Jul 2018 01:45:53 +
>> Peter Constable via Unicode  wrote:
>> 
>>> (iii) gave
>>> indication of intent to develop a plan of action for preparing their
>>> institutions for this change as well as communicating that within
>>> Georgian industry and society. It was only after that did UTC feel it
>>> was viable to proceed with encoding Mtavruli characters.
>> 
>> It is dangerous to rely on declarations of intent when making
>> irreversible decisions.  The UTC should have learnt that from the
>> Mongolian mess.
>> 
>> Richard.
> 
>

Re: Proposal to add standardized variation sequences for chess notation

2017-04-12 Thread Kent Karlsson via Unicode


Den 2017-04-12 06:12, skrev "Garth Wallace" :

> Shogi diagrams are uncheckered (as Shogi boards are), with grid-lines to
> separate the spaces; traditionally, chess diagrams use the contrast of dark
> and light squares to distinguish spaces with no grid lines. They may, but do
> not have to, have dots at some intersections (these mark starting and
> promotion zones). Graphical diagrams may show images of pieces (pentagonal,
> with names written in kanji), but typeset diagrams use abbreviations of the
> piece names as CJK ideographs or kana: e.g. the gold general is 金, and the
> promoted pawn is と. Instead of "black" and "white", the pieces belonging to
> the sente player are displayed upright and those belonging to the gote player
> are rotated 180°. Any proposal for Shogi would have to deal with that.

OT

Unicode has (only) these for Shogi pieces:

2616;WHITE SHOGI PIECE;So;0;ON;N;
2617;BLACK SHOGI PIECE;So;0;ON;N;
26C9;TURNED WHITE SHOGI PIECE;So;0;ON;N;
26CA;TURNED BLACK SHOGI PIECE;So;0;ON;N;

Which seems insufficient...

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-12 Thread Kent Karlsson via Unicode


Den 2017-04-12 05:14, skrev "Garth Wallace" :

> One salient feature the Block Elements have that the Box Drawing characters do
> not: distinct LEFT and RIGHT verticals, and LOWER and UPPER horizontals. The
> double frame typically consists of a thin line and a thicker line, with one on
> the inside and one on the outside, so left and right verticals are not
> interchangeable. Even when a single frame is used, it is important for
> spacing, since the frame should be flush against the board.

Note that I used TWO DIFFERENT variation selectors for the horizontal and
vertical box drawing characters in my suggestion (marked in bold here):

2500 FE00; Chessboard box drawing (top); # BOX DRAWINGS LIGHT HORIZONTAL
(U+2500)
2500 FE01; Chessboard box drawing (bottom); # BOX DRAWINGS LIGHT HORIZONTAL
(U+2500)
2502 FE00; Chessboard box drawing (left); # BOX DRAWINGS LIGHT VERTICAL
(U+2502)
2502 FE01; Chessboard box drawing (right); # BOX DRAWINGS LIGHT VERTICAL
(U+2502)
250C FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT DOWN AND RIGHT
(U+250C)
2510 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT DOWN AND LEFT
(U+2510)
2514 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT UP AND RIGHT
(U+2514)
2518 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT UP AND LEFT (U+2518)

2550 FE00; Chessboard box drawing (top); # BOX DRAWINGS DOUBLE HORIZONTAL
(U+2550)
2550 FE01; Chessboard box drawing (bottom); # BOX DRAWINGS DOUBLE HORIZONTAL
(U+2550)
2551 FE00; Chessboard box drawing (left); # BOX DRAWINGS DOUBLE VERTICAL
(U+2551)
2551 FE01; Chessboard box drawing (right); # BOX DRAWINGS DOUBLE VERTICAL
(U+2551)
2554 FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE DOWN AND RIGHT
(U+2554)
2557 FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE DOWN AND LEFT
(U+2557)
255A FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE UP AND RIGHT
(U+255A)
255D FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE UP AND LEFT
(U+255D)

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-11 Thread Kent Karlsson via Unicode


Den 2017-04-10 12:19, skrev "Michael Everson" :

> I believe the box drawing characters are for drawing boxes

Which is exactly what you are doing.

> and grids on 
> computer terminals, which is not the same thing as scoring a line around a set
> of 64 graphic images.

No, that is why I put in variation selectors. The glyphic variation
selected would in my judgement fall well within the "box drawing semantics"
(if you like) of these characters.

In addition, thinking ahead, it is not at all unlikely that someone
might want to divide a chess board with a horizontal mid-line, or for
that matter a vertical mid-line (e.g. for "double chess"), or even
quadrants. And then, ta-da, there are already box-drawing characters for
doing just that (even when there is a small gap between the board and the
border. (I'm not suggesting to add variation selector sequences for /those/
box drawing characters, because I don't /know/ there is a use-case for
mid-lines in chess board layout, but I'm saying there might be.)

> I don¹t want to get mixed up in using the box-drawing
> characters. The characters which I have chosen work fine and to my mind suit
> the application better.

They "work" (of course), no font renderer or font editor is "smart" enough
to "see" that you are going quite a bit (in my judgement) outside of the
acceptable glyph variability for the characters you (so far) opted for
for chess box drawing. (Other relevant, and non-glyph, properties being
the same between the box drawing and block chars.)

That the "block characters" are pure crap (which they are), does not
mean that you can co-opt them for (slightly) "variant" box drawing.

> I also don¹t want to complicate chess fonts by having to have multiple choices
> within a font for bordering. For one thing, single-rule and double-rule
> bordering is by no means the gamut of possibility.

You are not wanting "emoji" style borders, I'm sure. But some slight
"ornate" style would be fine for the "box drawing" chars (even without
variation selectors). The "single" should still be single, though,
and the "double" be double. So triple (etc.) is out.

I think single/double line border should be a decision by the "author"/
"editor", and not the font maker. Imagine accompanying text saying
"the double bordered one is ".

> Chess fonts do not have to be swiss-army knives.

I don't see that I have asked for that.

B.t.w., I see you don't have 1-8, a-h labels on the boards... It might be
worth mentioning that FULLWIDTH a-h should work fine as labels (them being
em-wide).

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-09 Thread Kent Karlsson


Den 2017-04-06 01:25, skrev "Michael Everson" :

> Oh, here. This is what I would add.
> 
> 2581 FE00; Chessboard box drawing; # LOWER ONE EIGHTH BLOCK
> 258F FE00; Chessboard box drawing; # LEFT ONE EIGHTH BLOCK
> 2594 FE00; Chessboard box drawing; # UPPER ONE EIGHTH BLOCK
> 2595 FE00; Chessboard box drawing; # RIGHT ONE EIGHTH BLOCK
> 2596 FE00; Chessboard box drawing; # QUADRANT LOWER LEFT
> 2597 FE00; Chessboard box drawing; # QUADRANT LOWER RIGHT
> 2598 FE00; Chessboard box drawing; # QUADRANT UPPER LEFT
> 259D FE00; Chessboard box drawing; # QUADRANT UPPER RIGHT

Instead of that, I'd suggest:
2500 FE00; Chessboard box drawing (top); # BOX DRAWINGS LIGHT HORIZONTAL
(U+2500)
2500 FE01; Chessboard box drawing (bottom); # BOX DRAWINGS LIGHT HORIZONTAL
(U+2500)
2502 FE00; Chessboard box drawing (left); # BOX DRAWINGS LIGHT VERTICAL
(U+2502)
2502 FE01; Chessboard box drawing (right); # BOX DRAWINGS LIGHT VERTICAL
(U+2502)
250C FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT DOWN AND RIGHT
(U+250C)
2510 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT DOWN AND LEFT
(U+2510)
2514 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT UP AND RIGHT
(U+2514)
2518 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT UP AND LEFT (U+2518)

These are more likely to be supported (by (fixed-width) fonts) in fallback
than the ones you suggest.
They are also intended for box drawing (unlike the ones you suggest).

Perhaps also, since you exemplify also with double borders in your document:
2550 FE00; Chessboard box drawing (top); # BOX DRAWINGS DOUBLE HORIZONTAL
(U+2550)
2550 FE01; Chessboard box drawing (bottom); # BOX DRAWINGS DOUBLE HORIZONTAL
(U+2550)
2551 FE00; Chessboard box drawing (left); # BOX DRAWINGS DOUBLE VERTICAL
(U+2551)
2551 FE01; Chessboard box drawing (right); # BOX DRAWINGS DOUBLE VERTICAL
(U+2551)
2554 FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE DOWN AND RIGHT
(U+2554)
2557 FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE DOWN AND LEFT
(U+2557)
255A FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE UP AND RIGHT
(U+255A)
255D FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE UP AND LEFT
(U+255D)

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-06 Thread Kent Karlsson

Den 2017-04-06 03:05, skrev "Michael Everson" :

> On 6 Apr 2017, at 01:54, Kent Karlsson  wrote:
> 
>>>> - some bidi fix [preferably making the box/border drawing characters bidi
>>>> "L", if possible; otherwise a caveat that if there is an expectation to
>>>> paste in such a board into an RTL document, bidi controls need be used to
>>>> LTR the board]).
>>> 
>>> I donąt know if there is a problem here and am not able to offer a solution
>>> if there is. I donąt object to a solution, if there is a problem.
>> 
>> I would think
> 
> Come on. This is a serious proposal.

I agree! ;-)

> I'm glad you support it, but if you are
> going to raise an issue like this, "I would think and guess about a problem"
> isn't the same as "I have tried and here's an actual problem".

I apologise for my slightly cautious way of expressing myself...

All the characters in the "chess board lines" (apart from spaces, if any),
are of bidi category ON or NSM. So there is no character that "sets" a bidi
direction of the lines ("paragraphs"). So if the bidi setting for display is
set to default to RTL, each of the chess board lines will be reversed in
display. Now, since the border characters are not mirrored, the left and
right side of the board side lines will be somewhat botched. Which is very
visible in that it is ugly. (And I guess(!) the reader will notice that...)

I'm not a bidi expert, but I know that much about bidi (and so should
you...).

> Roozbeh, there's an issue that might benefit from your expertise. Can you look
> into it? Discussion needn't occur here, but offline with Kent and me, if you
> prefer. 
> 
>> that anyone pasting a chess board (ŕ la your proposal) to an RTL context will
>> see that something went amiss,
> 
> Will they? Why?

Since the border characters are not mirrored (they do not have the
mirroring property), the left and right side of the chess board side
lines will be somewhat botched. Which is visible/ugly. Indeed, the entire
chess board will be mirrored (though none of the individual glyphs), but,
though visible, that whole-mirroring (line reversal) is easier to miss.

>> and also know enough about bidi to set the bidi context to LTR for the chess
>> board(s),
> 
> RTL users understand the problems of cutting and pasting LTR text and symbols,
> certainly. LTR users don't.
> 
>> either by some setting, or by inserting bidi control characters.
> 
> Well, if there's a problem it should be well-defined so it can be tackled.
> 
>> So a small caveat is all that is necessary. Like: "The chess boards are
>> assumed to be set in a left-to-right bidi context."
> 
> THAT I can put into the document, but since chess is as important in both the
> RTL and LTR worlds, it would be good to know what's what.

See above.

/Kent K

> Thank you again for your thoughtfulness,
> 
> Michael

Re: Proposal to add standardized variation sequences for chess notation

2017-04-06 Thread Kent Karlsson

Den 2017-04-06 03:08, skrev "Michael Everson" :

> On 6 Apr 2017, at 02:05, Kent Karlsson  wrote:
> 
>>> Do generic font makers intend to support both graphic terminal emulation and
>>> chess?
>> 
>> I don't know. But it should not be impossible to do so.
> 
> And you think the proposal as it does leads to that?

Yes. One in one single font (according to your current proposal), one can
only have EITHER terminal emulator version, OR chess border version. Not
both. Using variant selectors for the chess border variants allow for both
glyph variants. Maybe it does not make much difference in a proportional
font. But for a "mono-width" font the terminal emulator versions for these
border characters would be "narrow", but the chess border versions should
be "fullwidh"/"square" (compare CJK in terminals; double the width of, e.g.,
Latin characters).

>>> Should chess font makers be burdened with graphic terminal emulation glyphs
>>> they know nothing about?
>> 
>> If it is really a chess font, they can just use the glyphs for the chess
>> variety also as the "plain" (terminal emulator variety), and it would not
>> matter (as long as no-one insist on using it for terminal emulation).
> 
> Ha, so you¹re saying it¹s mostly for things like Everson Mono that it matters
> ;-)

Yes (but there are other fonts than Everson Mono that are suitable for
terminal emulators...).

There are still people who read (plain text) emails in terminal emulators
(or other email clients that cannot handle font switching inside an email,
and may have selected a "terminal emulator" font for viewing emails). Though
"mono-width", the chess board glyphs should be "fullwidth"...

/Kent K

>> All that is needed for that is a manoeuvre to copy a few glyphs within the
>> font (when creating the font). I guess that is not very hard
> 
> It is not.
> 
> Michael Everson

Re: Proposal to add standardized variation sequences for chess notation

2017-04-05 Thread Kent Karlsson


Den 2017-04-06 02:47, skrev "Michael Everson" :

> Well, see my follow-up to James Kass and evaluate the merits of the two
> choices.

> Do generic font makers intend to support both graphic terminal
> emulation and chess?

I don't know. But it should not be impossible to do so.

> Should chess font makers be burdened with graphic
> terminal emulation glyphs they know nothing about?

If it is really a chess font, they can just use the glyphs for the chess
variety also as the "plain" (terminal emulator variety), and it would not
matter (as long as no-one insist on using it for terminal emulation). All
that is needed for that is a manoeuvre to copy a few glyphs within the font
(when creating the font). I guess that is not very hard...

/Kent K


>> On 6 Apr 2017, at 01:31, Kent Karlsson  wrote:
>> 
>> 
>> Exactly.
>> 
>> /K
>> 
>> Den 2017-04-06 01:25, skrev "Michael Everson" :
>> 
>>> 2581 FE00; Chessboard box drawing; # LOWER ONE EIGHTH BLOCK
>>> 258F FE00; Chessboard box drawing; # LEFT ONE EIGHTH BLOCK
>>> 2594 FE00; Chessboard box drawing; # UPPER ONE EIGHTH BLOCK
>>> 2595 FE00; Chessboard box drawing; # RIGHT ONE EIGHTH BLOCK
>>> 2596 FE00; Chessboard box drawing; # QUADRANT LOWER LEFT
>>> 2597 FE00; Chessboard box drawing; # QUADRANT LOWER RIGHT
>>> 2598 FE00; Chessboard box drawing; # QUADRANT UPPER LEFT
>>> 259D FE00; Chessboard box drawing; # QUADRANT UPPER RIGHT
>>> 
>>> I guess I see your point. It does no harm, especially if the font might
>>> possibly be used for graphics terminal emulation. ;-)
>> 
>> 
> 
>

Re: Proposal to add standardized variation sequences for chess notation

2017-04-05 Thread Kent Karlsson


Den 2017-04-06 01:25, skrev "Michael Everson" :

>>  - some bidi fix [preferably making the box/border drawing characters bidi
>> "L", if possible; otherwise a caveat that
>>if there is an expectation to paste in such a board into an RTL document,
>> bidi controls need be used to LTR the board]).
> 
> I don¹t know if there is a problem here and am not able to offer a solution if
> there is. I don¹t object to a solution, if there is a problem.

I would think that anyone pasting a chess board (à la your proposal) to an
RTL context will see that something went amiss, and also know enough about
bidi to set the bidi context to LTR for the chess board(s), either by some
setting, or by inserting bidi control characters.

So a small caveat is all that is necessary. Like:
"The chess boards are assumed to be set in a left-to-right bidi context."

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-05 Thread Kent Karlsson


Den 2017-04-06 01:25, skrev "Michael Everson" :

> Oh, you misunderstood me. I knew it was raw HTML. I didn¹t expect it to
> render. But it was meaningless code.

It was a response to Marcus, in that HTML might be used (with existing
characters and no VSs) to format chess boards. And he is right, as proven
by the HTML code I (basically) copied from stackoverflow. And it does
typeset better plain text chess boards à la your proposal...

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-05 Thread Kent Karlsson


Exactly.

/K

Den 2017-04-06 01:25, skrev "Michael Everson" :

> 2581 FE00; Chessboard box drawing; # LOWER ONE EIGHTH BLOCK
> 258F FE00; Chessboard box drawing; # LEFT ONE EIGHTH BLOCK
> 2594 FE00; Chessboard box drawing; # UPPER ONE EIGHTH BLOCK
> 2595 FE00; Chessboard box drawing; # RIGHT ONE EIGHTH BLOCK
> 2596 FE00; Chessboard box drawing; # QUADRANT LOWER LEFT
> 2597 FE00; Chessboard box drawing; # QUADRANT LOWER RIGHT
> 2598 FE00; Chessboard box drawing; # QUADRANT UPPER LEFT
> 259D FE00; Chessboard box drawing; # QUADRANT UPPER RIGHT
> 
> I guess I see your point. It does no harm, especially if the font might
> possibly be used for graphics terminal emulation. ;-)

Re: Proposal to add standardized variation sequences for chess notation

2017-04-05 Thread Kent Karlsson

Den 2017-04-05 16:48, skrev "Michael Everson" :

Kent, I can¹t read this in a plain-text e-mail.

Well, it was SUPPOSED to be explicit HTML code in the email. It was NOT the
intent that the given example was to be
rendered directly in the email (even if you have HTML emails enabled).
Further, I would write the code a bit differently,
in order to easily be able to map your proposed encoding for (parts of)
chessboards to HTML. But at this point I did not
want to change the referenced example (written by someone posting to
stackoverflow.com) in any significant way.

So yes, if you want to see the result of the HTML code, paste the HTML code
to a plan text editor, name the file
you save it to "chess.html", and view that file in a browser. That display
in turn may be cut and pasted to another
document, depending on the capabilities of the app used to edit that other
document. The paste may, admittedly
result in an awful and uneditable result.

I agree that the HTML code is a bit of a mouthful (and I would also do it a
bit differently), and also has the problem
mentioned in the previous paragraph). Which is why I support your proposal,
but with these modifications:

 - with the extra requirement to have VSs also for the boarder line drawing
characters (to make them fit for
   drawing chess board boarders, in a general purpose font), and

 - some bidi fix [preferably making the box/border drawing characters bidi
"L", if possible; otherwise a caveat that
   if there is an expectation to paste in such a board into an RTL document,
bidi controls need be used to LTR the board]).

Nit: You sometimes seem to have made the line spacing slightly larger (like
2 points) larger than the character width.
Should they not be exactly the same, to get the best (square) display of the
chess boards? (Not that it is very visible,
but a bit.)

/Kent K

PS
I think the "ligatures" approach is a dead end.
 - As you mention, the fallback will have very different line lengths for
the lines of a board display,
   and thus basically unreadable.
 - If ZWJ is not needed, one will need two *new* characters that (in some
fonts) ligate with chess pieces.
   No existing character should ever ligate with chess pieces.
 - If ZWJ is needed, then one can use some existing characters as board
squares.
 - In either case, it is not clear (or obvious) which should come first, a
chess piece or a board square.
   There will surely be mistakes, giving them in the wrong order (not a
problem in your proposal).
 - My personal guesstimate is that there will be much fewer fonts that would
implement the ligation
   (if that approach was to be chosen), than would implement the VS approach
you are suggesting.

Thus I support your proposal, since that gives:
  - Good fallback (readable, though ugly).
  - Fairly good display when the VS sequences are interpreted (and the font
is otherwise reasonable),
and "good" context (line height setting, not too short lines so that
auto line breaking is avoided, ...).
  - Easier to machine parse than the ligatures approach; and MUCH easier to
parse than an HTML version.
  - Easy to convert to (say) HTML for even better display in (say) HTML
pages (CAN look much better,
and NO dependence on line height setting or line width setting (or bidi
direction derivations), but
just that the table (for the board) is reasonably done.




Den 2017-04-05 16:48, skrev "Michael Everson" :

> Kent, I can¹t read this in a plain-text e-mail. I can¹t paste it into an
> ordinary word-processor like Word as in my previous response to Markus, or in
> Pages (left) or LibreOffice (right) as shown here. (I simply pasted in the
> text from Word to each of those. It¹s odd to see that there is some variation
> in display the text without selecting it and applying the correctly-configured
> font to it, but when that¹s done, the correct display is given (modulo some
> leading issues which I didn¹t focus on in either).
> 
> The workaround you give is just that. It works. It¹s not usefully portable or
> user-friendly, and as higher-letter protocols go, it hasn¹t swept away all
> competition for presenting chessboards. People use ASCII or MS Symbol-based
> fonts not even with any Unicode characters in them.

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson


Den 2017-04-04 00:35, skrev "Michael Everson" :

>> What I am saying is that the glyphs for the two new variants you are
>> proposing need to harmonise with the block elements such as U+2581
>> LOWER ONE EIGHTH BLOCK.
> 
> No in a chess font the font designer has to draw those block-element
> characters differently, to harmonize with the
> 
>> That requires uniform width *for those variants*.  That is a key part of the
>> glyph family's essence.
> 
> In their original usage in graphic terminals, sure. And some people still
> emulate those, and when they use those characters they draw them for that
> purpose. In current ASCII-based chess-fonts, a set of characters is used to
> draw a line (of one kind or another) around the board, and when I looked for
> Unicode characters to map to these, the block elements were the ones that had
> the right structure, since they were high and low and left and right in the em
> square. 
> 
>> There is no such requirement on the glyphs for normal text use as at present.
> 
> There is **in a chess font** if you want to be able to draw a box around the
> chessboard. 

I'm not too happy about this. Maybe have VSs applied also to the chess box
drawing chars?

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson


Den 2017-04-04 03:12, skrev "Michael Everson" :

> It *is* important that there be an even number of characters in every row of 8
> squares for fallback display to be better rather than worse, I think.

I agree. (Though *at present*, I happen to get a visible display of the
VSs in the email app, which does not look too good.)

> I found while setting the tables that it was convenient to have to remember
> that every one of the 64 characters had to have VS1 or VS2 along with it.
> Constructing a table from scratch and modifying and existing one both felt
> easier with uniform encoding.

Yes. BUT, I would hope that chess enthusiasts would not have to think much
about the encoding. Either using a special keyboard layout (momentarily) or
using a palette for picking board item by board item seems to be better
options. I'm sure someone will make a browser based chess editor, complete
with suitable palette, and having an empty board pre-edited to start out
(replacing the empty squares as pieces are laid out or moved).

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson


Den 2017-04-04 03:21, skrev "Asmus Freytag" :

> would look like this, if you base your proposal on ligatures rather than
> variation selectors (minimal case A above):
>  
> ▕□︀▨︁□︀▨︁♙︁□︀♛︀▨︁□︀▨︁▏

That line has a lot of VSs in it... (I see them, since they happen to be
visible in the email app I use.)
 
>  The disadvantage is that the fallback rendering does not line up; but I would
> regard that as a minor issue.

I think Michael regards that non-lineup as a show-stopper.

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson


Den 2017-04-04 02:10, skrev "Michael Everson" :

> On 4 Apr 2017, at 00:45, Kent Karlsson  wrote:
>> 
>> Book formatting? Old style book formatting still cannot use as sophisticated
>> layouts as HTML can... (AFAIK).
> 
> Yeah, but come on, the chief use of chess characters is to cite them inline in
> text like any other symbol @ § % & and the other equally chief use of chess
> characters is to set 8 × 8 chessboards which float in space in the layout as
> figures. The layout requirement isn’t all that demanding that HTML offers a
> major advantage. 

In case you missed it, the statement I made above was in *SUPPORT* of your
proposal (in general, but not necessarily all details)...

/Kent K

> Michael Everson

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson

I can well imagine people deeply interested in chess, to want to exchange
chess board layouts
in plain text emails (or at least not use quite hard-to-handle HTML code),
and even parse them
(programmatically) for analysis by a program, not wanting to bother with
quite complex HTML/CSS stuff.
Including making input easy (keyboard, palette), just "typing" the chess
board layout (with pieces).

But for HTML pages on chess, HTML/CSS markup is certainly preferable; but it
shouldn't be impossible
to just paste in a "plain text" chess board to an HTML page (with minimal
formatting effort).
One can (fairly easily) make a program to convert the "plain text" chess
board to an HTML one.

Book formatting? Old style book formatting still cannot use as sophisticated
layouts as HTML
can... (AFAIK).

/Kent K

Den 2017-04-03 23:44, skrev "markus@gmail.com" :

> On Mon, Apr 3, 2017 at 2:33 PM, Michael Everson  wrote:
>> On 3 Apr 2017, at 18:51, Markus Scherer  wrote:
>>> 
>>> It seems to me that higher-level layout (e.g, HTML+CSS) is appropriate for
>>> the board layout (e.g., via a table), board frame style, and cell/field
>>> shading. In each field, the existing characters should suffice.
>> 
>> That isn¹t plain text.
> 
> A lot of stuff needed for printing books and laying out PDFs and web pages
> goes beyond plain text.
> 
> Whose requirement is it to represent an entire chess or checkers board in
> plain text?
> 
> Other than a sort of puzzle of "what would it take to do so?"
> 
> markus
>

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson


Den 2017-04-03 20:46, skrev "Kent Karlsson" :

> 
> Den 2017-04-03 19:51, skrev "markus@gmail.com" :
> 
>> > It seems to me that higher-level layout (e.g, HTML+CSS) is appropriate for
>> the 
>> > board layout (e.g., via a table), board frame style, and cell/field
>> shading.
>> > In each field, the existing characters should suffice.
>> > 
>> > markus
> 
> True, and one can easily find an example online.
> 
> Slightly modified from
> http://stackoverflow.com/questions/18505921/chess-using-tables
> 
> [...]
> 
A bit more modification: more colourful, even with /// striped backgrounds.
One disadvantage
is that the "white" pieces interior get the background colour rather than
being actually white.
To get them actually white (not just the interiors, but the entire pieces),
use the "black"(!) pieces,
and (via CSS) colour them white (need to be set on a non-white background to
be visible...).
I know, the latter trick will make parsing even more tricky (needing to
interpret not only the
HTML tag markup and chess characters, but also (say) HTML class attribute to
distinguish "white"
from "black" pieces).

And, parsing (for other things than display in a browser), will be quite
sensitive to the exact
way of expressing this in HTML. There are many quite different ways of
expressing this
in HTML (+CSS).

But... with a bit of JavaScript savvyness, you can program moving the pieces
around... ;-) And
substitute the chess characters to more emoji style images of chess
pieces... Still in ;-) mode.




a {
 color: #f00;
 display: block;
 font-size: 24px;
 height: 32px;
 width:  32px;
 position: relative;
 text-decoration: none;
 text-shadow: 0 1px #fff;
}

a.white { color: #0ff; }

#chess_board { border: 2px solid #333; }

#chess_board td {
 background: #ffa;
 background: -moz-linear-gradient(top, #ffa, #eea);
 background: -webkit-gradient(linear,0 0, 0 100%, from(#ffa), to(#eea));
 box-shadow: inset 0 0 0 1px #ffa;
 -moz-box-shadow:inset 0 0 0 1px #ffa;
 -webkit-box-shadow: inset 0 0 0 1px #ffa;
 height: 32px;
 width:  32px;
 text-align: center;
 vertical-align: middle;
}

#chess_board tr:nth-child(odd)  td:nth-child(even),
#chess_board tr:nth-child(even) td:nth-child(odd) {
 /*background: #acc;
 background: -moz-linear-gradient(top, #acc, #aee);
 background: -webkit-gradient(linear,0 0, 0 100%, from(#acc), to(#aee));*/
 background: repeating-linear-gradient(
  -45deg,
  #a0adbc,
  #a0adbc 2px,
  #465298 2px,
  #465298 4px
 );
 box-shadow: inset 0 0 10px rgba(0,0,0,.4);
 -moz-box-shadow:inset 0 0 10px rgba(0,0,0,.4);
 -webkit-box-shadow: inset 0 0 10px rgba(0,0,0,.4);
}





♜
♞
♝
♛
♚
♝
♞
♜


♟
♟
♟
♟
♟
♟
♟
♟










































♙
♙
♙
♙
♙
♙
♙
♙


♖
♘
♗
♕
♔
♗
♘
♖

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson


Den 2017-04-03 19:51, skrev "markus@gmail.com" :

> It seems to me that higher-level layout (e.g, HTML+CSS) is appropriate for the
> board layout (e.g., via a table), board frame style, and cell/field shading.
> In each field, the existing characters should suffice.
> 
> markus

True, and one can easily find an example online.

Slightly modified from
http://stackoverflow.com/questions/18505921/chess-using-tables




a {
color:#000;
display:block;
font-size:12px;
height:16px;
position:relative;
text-decoration:none;
text-shadow:0 1px #fff;
width:16px;
}
#chess_board { border:2px solid #333; }
#chess_board td {
background:#fff;
background:-moz-linear-gradient(top, #fff, #eee);
background:-webkit-gradient(linear,0 0, 0 100%, from(#fff), to(#eee));
box-shadow:inset 0 0 0 1px #fff;
-moz-box-shadow:inset 0 0 0 1px #fff;
-webkit-box-shadow:inset 0 0 0 1px #fff;
height:16px;
text-align:center;
vertical-align:middle;
width:16px;
}
#chess_board tr:nth-child(odd) td:nth-child(even),
#chess_board tr:nth-child(even) td:nth-child(odd) {
background:#ccc;
background:-moz-linear-gradient(top, #ccc, #eee);
background:-webkit-gradient(linear,0 0, 0 100%, from(#ccc), to(#eee));
box-shadow:inset 0 0 10px rgba(0,0,0,.4);
-moz-box-shadow:inset 0 0 10px rgba(0,0,0,.4);
-webkit-box-shadow:inset 0 0 10px rgba(0,0,0,.4);
}



True, and one can easily find an example online.
Slightly modified from
http://stackoverflow.com/questions/18505921/chess-using-tables



♜
♞
♝
♛
♚
♝
♞
♜


♟
♟
♟
♟
♟
♟
♟
♟










































♙
♙
♙
♙
♙
♙
♙
♙


♖
♘
♗
♕
♔
♗
♘
♖

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Kent Karlsson

Den 2017-04-03 14:50, skrev "Michael Everson" :

> On 2 Apr 2017, at 18:52, Richard Wordingham 
> wrote:
>> 
>> You forgot the most important setting though - that the higher-order
>> protocols allow symbols to be displayed left-to-right. If the direction
>> should happen to be right-to-left, not only is the game mirrored, but the
>> board edges don't work properly, as the glyphs are not mirrored.  One needs
>> each bidi-paragraph to be forced to the correct order, e.g. by use of LRM
>> before and after, or, if the board is recorded right-to-left, RLM or ALM
>> before and after.
> 
> None of the characters listed in §3 has a mirroring property.

Right, but most of them have bidi property ON (other neutral), so in
a right-to-left context, the chess board characters will be reversed
(on each line, but the VSs (which are NSM) still go with their base).
This would
1) mirror the chess *board* display (but not the chess *piece* glyphs)
2) mess up the corner glyphs, which are not mirrored; and also
   the RIGHT/LEFT ONE EIGHTH BLOCK glyphs, which aren't mirrored
   either.

Issue 2 will result in ugly display.
Issue 1 will confuse the reader, mirroring the entire chess board (if
one disregards the ugly display of the corner and left/right borders).

Hence the chess board lines should be displayed in a strong left-to-right
context (either via bidi markup characters, or via some higher order
bidi markup mechanism, such as the "bidi" attribute in HTML). Though in
most cases (not Arabic/Hebrew/... document), the bidi context will default
to left-to right...

For cut-and-paste to work well also when pasting to a right-to-left
context document, bidi markup characters are probably better than using
a higher-level attribute. I think that is why Richard argues for using
bidi characters to make the lines strong left-to-right (without having
to surround each chess board line with visible strong l-t-r characters).

You might argue for making the board corner and board left/right border
characters strong l-t-r. Not sure if that would sit well with the UTC...

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-01 Thread Kent Karlsson




Den 2017-04-02 01:33, skrev "Michael Everson" :

>> but isn't the convention that one "always" start with FE00 for each character
>> that may have variation selectors applied?
> 
> I don¹t know what you mean by this.

> 25A8 FE01; Black chessboard square; # SQUARE WITH UPPER RIGHT TO LOWER LEFT
FILL

In this case, the "set of variation selectors" for 25A8 excludes FE00.

/Kent K

Re: Proposal to add standardized variation sequences for chess notation

2017-04-01 Thread Kent Karlsson

In addition, not directly related to your proposal, why aren't
chess pieces listed in http://unicode.org/emoji/charts/emoji-variants.html.

It seems to me that chess pieces would be very well suited to have each an
emoji variant (not to be used for the chess boards, maybe).

/Kent K

PS
Remember that Emoji style (or not) uses two OTHER variation selectors, FE0F
(and FE0E).

Re: Proposal to add standardized variation sequences for chess notation

2017-04-01 Thread Kent Karlsson


2654 FE00; Chesspiece on white; # WHITE CHESS KING

Why do the ones with white background need a variation selector?

25A1 FE00; White chessboard square; # WHITE SQUARE
25A8 FE01; Black chessboard square; # SQUARE WITH UPPER RIGHT TO LOWER LEFT
FILL

I see that you want a fallback in case the variation selectors aren't
supported; but isn't the convention that one "always" start with FE00
for each character that may have variation selectors applied?

So in this case, one would only need variation selector FE00; if applied
to 25A1 or 25A8 giving the chess board variety, if applied to a chess piece
character, gives "checkered" ("black") background (without, one gets the
white background).

Why not use 25A0 BLACK SQUARE with the variation selector? (I know that
it would not entirely black with the variation selector (if not fallback).)
I mean, there is no absolute LOGICAL NEED to draw the "black" background
as WITH UPPER RIGHT TO LOWER LEFT FILL, it could go the other direction
or be just "gray" (or for that matter medium blue...); font maker choice.

Kind regards
/Kent K

Re: IJ with accent

2016-09-28 Thread Kent Karlsson




Den 2016-09-28 22:48, skrev "Richard Wordingham"
:

> On Wed, 28 Sep 2016 12:30:04 -0700
> "Doug Ewell"  wrote:
> 
>>> Technically I see one, as bíj́na shound never break between í and
>>> j́,
>> 
>> These wor-
>> ds should not bre-
>> ak at the places wh-
>> ere I have broken t-
>> hem
>> 
>> but they don't need embedded control characters to enforce that.
> 
> Indeed, there aren't any control characters to control hyphenation.

Well, there is SOFT HYPHEN, as you yourself noted later.

There is also
0083;;Cc;0;BN;N;NO BREAK HERE

"NBH is used to indicate a point where a line break shall not occur when
text is formatted."

But that is in the C1 area, most of which nearly no-one implements...

/K

> Indeed, CGJ between default grapheme clusters is often a very good
> place to hyphenate.
> 
> Richard.
>

Re: IJ with accent

2016-09-28 Thread Kent Karlsson




Den 2016-09-29 00:12, skrev "Alex Plantema" :

> Op woensdag 28 september 2016 09:59 schreef a.lukyanov:
> 
>> Dutch language writing uses the ligature ĳ (U+0132, U+0133). When accented,
>> it should take an accent on each component, like this:
>> 
>> If one uses two separate characters (i+j), one can put an accent on each
>> character (íj́).
>> However, if monolithic ligature ĳ is used, how one can accent it correctly?
>> Unicode standard does not answer this.
>> Probably one should use the sequence U+0133 U+301, with the accent doubling
>> automatically, but this is not implemented (ĳ́).
> 
> I've never seen an ij with an accent. You can safely assume it's never needed.

See 
https://nl.wikipedia.org/wiki/Accenttekens_in_de_Nederlandse_spelling#Klemto
onteken

/K

> Alex.
>

Re: Swapcase for Titlecase characters

2016-03-29 Thread Kent Karlsson


Den 2016-03-19 17:40, skrev "Doug Ewell" :

> As one anecdote (which is even less like "data" than two anecdotes), I
> could not find any of the characters Ĳ ĳ Ǆ ǅ ǆ Ǉ ǈ ǉ Ǌ ǋ ǌ or their hex
(You missed the DZ "ligature" (which aren't really ligatures).)

As mentioned, for the Ĳ ĳ here (which sometimes ARE shown as ligatures,
mostly in signage), there is no "titlecase" variant for these (and thus
no problem for "swapcase"). For casing they behave just like Œ œ and Æ æ.
While we are off-topic for this thread... (but still on-topic for
this list):

I still think ĳ should have the "soft-dotted" property (and that
that property is finally implemented properly in various systems...).

> equivalents in any of the CLDR keyboard definitions.

I've heard that old typewriters used to have a key for Ĳ ĳ. Maybe it
should be reintroduced for Dutch computer keyboards, as well as used
(for Dutch) in autocorrects (IJ -> Ĳ, ij -> ĳ) or spell correctors
(looking at the whole word rather than just two letters, and then
not restricted to Dutch per se, but certain Dutch names regardless
of the language for the surrounding text). That, in turn, would
probably be a better approach than trying to have some special
handling of the sequence "ij" in case mapping (for Dutch alone).

/Kent K

> I'd imagine that 
> users just type the two characters separately, and that consequently
> most data in the real world is like that.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸

Re: Case for letters j and J with acute

2016-02-09 Thread Kent Karlsson


Den 2016-02-09 16:58, skrev "Michael Everson" :

> Well, the specification should be í (or i + combining acute) + j +
> combining acute. Neither dotless i nor dotless j would be correct.

While true, using the latter (the dotless ones) tend to render better
than the dotted ones. (I.e., the Soft_dotted property is still not
well supported.)

> Or IJ (or ij) + combining double acute.

While I agree that that maybe SHOULD be fine, the ij character has not
been given the Soft_dotted property. Although, as a different matter,
using the ij character tends to make automatic case mapping work better
for the ij in Dutch...

/Kent K

Re: N'Ko - which character? 02BC vs. 2019

2015-02-02 Thread Kent Karlsson


Den 2015-02-02 19:36, skrev "Michael Everson" :

> Hawaiian Hobbit, U+02BB has been drawn 133% taller, but of the same width, as
> U+2018. I believe this really must be considered good practice. In these

I think you mean 33 % taller, i.e. height 133 % relative to its "normal"
height. 133 % taller would be more than double its normal height, making
it about as tall as an uppercase letter... That would be excessive...

/Kent K


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Contrastive use of kratka and breve

2014-07-02 Thread Kent Karlsson

Sounds to me that what you really want is to have two different breve
characters
(assuming that the distinction is real and intentional, and not a
happenstance).
That would require encoding a new combining character, AFAICT...

/Kent K



Den 2014-07-02 20:48, skrev "Leo Broukhis" :

> Jukka,
> 
> If the font happens to have lunar breve at U+0306, whereas the letter й has
> the rounded bowl breve, using CGJ should guarantee to achieve distinctive
> rendering, because <и, CGJ, U+0306> is not canonically equivalent to  <и,
> U+0306> (cf. "The sequences  and  are not
> canonically  equivalent.") and therefore the renderer must not be allowed to
> pick the glyph for й instead as its canonical composition. This is a hack, but
> a legal hack.
> 
> Leo
> 
> 
> On Wed, Jul 2, 2014 at 11:13 AM, Jukka K. Korpela  wrote:
>> 2014-07-02 20:34, Philippe Verdy wrote:
>> 
>>> CGJ would be better used to prevent canonical compositions but it won't
>>> normally give a distinctive semantic.
>> 
>> In the question, visual difference was desired. The Unicode FAQ says:
>> “The semantics of CGJ are such that it should impact only searching and
>> sorting, for systems which have been tailored to distinguish it, while being
>> otherwise ignored in interpretation. The CGJ character was encoded with this
>> purpose in mind.”
>> http://www.unicode.org/faq/char_combmark.html
>> 
>> 
>> So CGJ is to be used when you specifically want the same rendering but wish
>> to make a distinction in processing.
>> 
>> Yucca
>> 
>> 
>> 
>> ___
>> Unicode mailing list
>> Unicode@unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>> 
> 
> 
> 
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Unicode organization is still anti-Serbian and anti-Macedonian

2014-02-17 Thread Kent Karlsson


Den 2014-02-17 10:33, skrev "Gerrit Ansmann" :

>> I don't like the idea, but one possibility would be to define Serbian glyph
>> styles by adding variation selectors.  Variation selectors are already
>> 'defined' for the decimal digits U+0030 to U+0039.  It would, however,
>> mess up string comparison operations that weren't smart enough to ignore
>> variation selectors.

>

Also, for the variation selectors to work for the end user, it requires
> the same technologies whose lack of support is why we are discussing this
> in the first place, doesn¹t it? So, defining the corresponding variation
> selectors would not make the end user see the correct glyphs earlier.

Still, variation selectors would be, in the text, a very localized
indication, independent of (displaying) user's preference settings
or language declaration (from the author, in e.g. XML/HTML formats)
for the text, and variation selectors are indeed more likely to
survive operations like cut-and-paste. There would be a problem of
inserting variation selectors at all places where appropriate, though.
Spell checking functionality could, in principle at least, help with
the latter.

/Kent K



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Why blackletter letters?

2013-09-10 Thread Kent Karlsson

Den 2013-09-10 19:01, skrev "Asmus Freytag" :

> Good question, Jean-François.
> 
> I seem to recall that typographers may make a distinction between
> "black-letter" and "fraktur" forms, but even if they, the differences
> are typographical, not essential. For the purpose of *character*
> encoding, one would need to make a very strong rationale for disunifying
> these.
> 
> This rationale is absent in document WG2 N3907 that requests these
> characters.
> 
> Therefore, it seems these two additions should not have been made.

I would agree, and in addition,
AB3E;LATIN SMALL LETTER BLACKLETTER O WITH STROKE;Ll;0;L;N;
should have a compatibility decomposition to
00F8;LATIN SMALL LETTER O WITH STROKE;Ll

Cmp.
1D52C;MATHEMATICAL FRAKTUR SMALL O;Ll;0;L; 006FN;

/Kent K

Re: Why blackletter letters?

2013-09-10 Thread Kent Karlsson


Den 2013-09-10 20:34, skrev "Whistler, Ken" :

> Items listed there in green are still under ballot in ISO, while items
> listed in yellow are not yet in ballot in ISO. For those, input is still
> useful.
> 
> If the entry is listed in white, forget it. Those items are already too late
> to impact the character name or code point for.

But for the characters that are new in Unicode 7, they can still get
canonical and compatibility decompositions.

/Kent K

Re: Shaping Hangul text with U+115F and/or U+1160

2013-03-18 Thread Kent Karlsson

First of all, are the fonts you are referring to able to at all handle
conjoining Jamo? If not, the question is moot.

Fonts that do not support conjoining Jamo, often show them as if they
were non-conjoining Jamo. But that is a bad idea, since the non-conjoining
Jamo have separate code points (for legacy reasons, but still useful).

Assuming that the fonts (and rendering mechanism) can otherwise handle
conjoining Jamo, the Jamos in a Hangul syllable are rendered "together"
in a syllable block. The "fillers" are then just used to fulfill the syntax
for the Hangul syllables, and do not produce any "ink" in the Hangul
syllable block. Effectively, all trailing conjoining Jamos and all
vowel Jamos are then "non-spacing".

That the fillers are "default ignorable" means that if you cannot handle
them properly, then do not display them. That goes for most control codes
(most of which are legacy, are are no longer interpreted) and format
controls (they should not render, whether you can interpret them or not).
It goes also for the Hangul filler characters, because they are there
for syntactic reasons, and should not render whether you can support them
properly or not.

/Kent Karlsson



Den 2013-03-18 16:34, skrev "Konstantin Ritt" :

> Hi Philippe,
> 
> thanks for your reply.
> I was confused by http://www.unicode.org/faq/unsup_char.html , which states
>> All default-ignorable characters should be rendered as completely invisible
>> (and non advancing, i.e. "zero width"), if not explicitly supported in
>> rendering.
> Do I understand correctly that, if the choseong filler is used when
> there's no leading consonnent before a medial vowel, it should be
> rendered as visible; otherwise become non-advancing
> (similarly, if the jungseong filler is used to replace a missing
> medial or final vowel, it should be rendered as visible; otherwise
> become non-advancing) ?
> i.e. , or , or  -- should the
> fillers be rendered as non-advancing?
> 
> regards,
> Konstantin
> 
> 2013/3/18 Philippe Verdy :
>> The "Default ignorable" property has nothing to do with rendering or
>> being zero-width, it's just a matter of collation (comparing strings
>> for similarity, for plain-text searches, or sorting them), it does not
>> necesarily mean that the character is zero-width (that's a rendering
>> property).
>> 
>> Characters that are "default ignorable" may still have an effect on
>> cluster boundaries used when editing texts, if you count manually the
>> number of zero-width characters (by pressng the left or right arrow
>> fnction keys.) As long as the rendering is correct, editors may allow
>> you to place insertion points between them.
>> 
>> U+115F is the choseong filler (used when there's no leading consonnent
>> to place before a medial vowel),  U+1160 is the jungseong filler (used
>> to replace a missing medial or final vowel).
>> 
>> You're right when saying that there should be two clusters in
>> , 
>> - The first one is a isolated vowel A, it should become spacing but
>> U+115F is just used as an invisible holder for the vowel,
>> - The second one is an isolated consonnant KAPYEOUNPIEUP, and the
>> U+1160 filler will remain invisible except that it iis used here so
>> that it explicitly terminates the cluster if it was followed by a
>> leading consonnant or dirctly by a "defective" vovel.
>> 
>> But the U+115F and U+1160 Hangul fillfers remain default ignorable in
>> collation. And there's no bug about this.
>> 
>> 2013/3/18 Konstantin Ritt :
>>> 2013/3/18 Konstantin Ritt :
>>>> The user reports Korean text rendering issue with any modern Hangul
>>>> font when U+115F and U+1160 are handled like default_ignorable code
>>>> points.
>>>> [quote]With input string "U+115F U+1161 U+112B U+1160",  we get three
>>>> zero-width glyphs instead of two; this is wrong.[/quote]
>>>> I did check some Hangul font and found that either U+115F or U+1160
>>>> zero-advances, not both. When handling them like default ignorable,
>>>> the rendered text seems to lack some advancing.
>>>> Since I know nothing about Korean typography, I'd like to ask here:
>>>> what is the reason for U+115F and U+1160 to be default ignorables and
>>>> shouldn't that be revised?
>

Re: Missing geometric shapes

2012-11-11 Thread Kent Karlsson


Den 2012-11-11 23:08, skrev "Doug Ewell" :

> Personal opinions follow.
> 
> It looks like the only actual use case we have, exemplified by the xkcd
> strip, is for a star with the left half black and the right half white.
> There *might* also be a case for the left-white, right-black star.
> 
> Everything else, including one-quarter and three-quarter stars,


> rendering tomatoes or doughnuts or film reels as "glyph variants" of
> stars, 

They should certainly **NOT** be treated as glyph variants of stars! Ever!

> facilitating a right-to-left rating system for Arabic- or
> Hebrew-speaking environments,

Naa. Recall that these symbols (whether of g.c. So or Sm)
have bidi category ON (other neutral). So a string of stars,
presumably starting with black stars (0 or more) and ending
in white stars (0 or more) with possibly a "half" star inbetween,
would automatically be "reversed" (displayed right to left) via
the bidi algorithm when they occur in a right-to-left context.
Getting the "half" star then get the "wrong half" of it be black
would be annoying at least...

> or turning Unicode into a standard for rating systems in general,

Here I agree. (Not sure why that branch of this tread is still ongoing...)

/Kent K


> is a complete flight of fancy by comparison
> to Jörg's original post.
> 
> I think in this case, as in many others, one introductory, exploratory
> proposal would be worth ten thousand speculative mailing-list posts.
> 
> --
> Doug Ewell | Thornton, Colorado, USA
> http://www.ewellic.org | @DougEwell 
> 
>

Re: Missing geometric shapes

2012-11-08 Thread Kent Karlsson


Den 2012-11-09 01:22, skrev "Asmus Freytag" :

> On 11/8/2012 3:40 PM, Philippe Verdy wrote:
>> Usually, we see the high ratings displayed as multiple stars, that are
>> either present or absent, but rarely half filled.
> 
> Half filled stars are relatively common, whenever there are fractional
> star ratings possible.
> 
> Stars are among the most common sets of symbols used for this kind of
> rating by "repetetive" symbols.

And indeed repeating a logo is quite common (even using half logos), or
suns (and half suns), or... That does not mean that Unicode should start
encoding logos, or regard logos as glyph variants of stars (or "rating
stars" or whatever).

/Kent K

> That there are many other ways to do rankings is besides the point.
> Stars (and half stars) are known to be in use (and can presumably be
> documented by whoever makes a proposal). That's enough to go on in
> deciding on encoding *those symbols*.
> 
> That some people use VULGAR FRACTION ONE HALF together with stars, or a
> plus sign or some other method is also besides the point. Unicode
> provides the symbols that allow people to write what they need to write,
> not what they "ought" to use.
> 
> A./
> 
>

Re: Missing geometric shapes

2012-11-08 Thread Kent Karlsson

Den 2012-11-09 00:09, skrev "Michael Everson" :

> On 8 Nov 2012, at 22:54, Kent Karlsson  wrote:
> 
>>>>> 2605;BLACK STAR;So;0;ON;N;
>>>>> 2606;WHITE STAR;So;0;ON;N;
>> 
>> The *chart* glyphs for these aren't same-sized (outer outline)
> 
> So?

It is quite common to fill up to the max rating (whichever that may
be in any one instance) with unfilled ("white") stars, rather than
just leave them out. In that case it looks better if the stars are
all the same size (outer outline). But that would be moot for those
two characters if one instead does the suggestion I had further down
in my message.

(The chart glyph sizes are often reflected in the sizes in other fonts
than the chart font(s). So you may want to do a minor chart glyph fix,
hoping other fonts get maintenance on that point.)

/Kent K

> Michael Everson * http://www.evertype.com/
> 
> 
>

Re: Missing geometric shapes

2012-11-08 Thread Kent Karlsson


Den 2012-11-08 14:34, skrev "Asmus Freytag" :

> On 11/8/2012 2:27 AM, "Martin J. Dürst" wrote:
>> On 2012/11/08 19:15, Michael Everson wrote:
>>> On 8 Nov 2012, at 09:59, Simon Montagu  wrote:
>>> 
 Please take into account that the half-stars should be
 symmetric-swapped in RTL text. I attach an example from an
 advertisment for a movie published in Haaretz 2 November 2012
>>> 
>>> I don't think Geometric Shapes have the mirror property.
>>> 
>>> 2605;BLACK STAR;So;0;ON;N;
>>> 2606;WHITE STAR;So;0;ON;N;

The *chart* glyphs for these aren't same-sized (outer outline)...

>> Well, those are usually symmetric, so adding a mirror property
>> wouldn't change much.
>> 
>>> In a Hebrew context you'd just choose the star you wanted
>>> (black-white vs white-black) and use it.
>> 
>> That works well if the text is written by hand. If it is produced as
>> part of a script that better work the same for many languages,
>> symmetric swapping would really be very helpful.
> 
> That may be, but that train has left the station a long time ago.
> 
> The problem is that there are related symbols for which mirroring is not
> defined, and defining it for the half stars would make for an
> inconsistent solution, unless any other vertically divided symbols were
> mirrored as well.

Well, define 3 (4?) brand new characters of g.c. Sm, and the "half" one(s)
(and quarter ones, if those are included too) have the bidi mirrored
property... There are plenty of g.c. Sm chars that are bidi mirrored.
(E.g. 27E2-27E3,  ⟢ ⟣ , which are four-pointed-starry.)

(note: 22C6;STAR OPERATOR;Sm;0;ON;N; has a too small glyph for this)

Maybe that is cheating a bit, but don't be surprised if someone actually
starts using them as math symbols (if and when included)...

/Kent K

> I don't believe any of them are.
> 
> Adding mirroring after a character has been encoded effectively renders
> all existing documents incompatible, so it's a property you can change
> mid-stream.
> 
> A./
>

Re: Small i with/out dot and with arrow

2012-08-03 Thread Kent Karlsson

TUS 6.1 says:

"P9 [Guideline] When a nonspacing mark is applied to the letters i and j or
any other character with the Soft_Dotted property, the inherent dot on the
base character is suppressed in display."

Well, the term non-spacing mark is too wide here. Non-spacing marks include
marks below and more. This only applies to marks directly above, cc 230.

Later in TUS 6.1:
"Dotless Characters. In the Unicode Standard, the characters ³i² and ³j²,
including their variations in the mathematical alphabets, have the
Soft_Dotted property. Any conformant renderer will remove the dot when the
character is followed by a nonspacing combining mark above. T"...
That is (almost) correct: "combining mark [*directly*!] above" is cc 230.

UAX 44 has:
"Characters with a "soft dot", like i or j. An accent placed on these
characters causes the dot to disappear." That is sort of correct, but
apparently open to misinterpretation. This goes for all cc 230 ("Distinct
marks directly above"), and is not open to anyones own interpretation of
what an "accent" (or "diacritic") is. (Thus, e.g., "Other_math" plays
*no* role here.) And: 20D7;COMBINING RIGHT ARROW ABOVE;Mn;230;..., U+20D7
does have cc 230.

And yes, my original suggestion to the UTC did include "cc 230". Not sure
why that formulation sometimes seems to have deteriorated.

/Kent K

Den 2012-08-02 03:09, skrev "Leo Broukhis" :

> Kent,
> 
> No, 20D7 is not a Diacritic, it is Other_Math, therefore the dot should
> remain.
> In general, mathematical combining characters are not diacritics.
> 
> Renderers that treat "combining" as a synonym for "diacritic" and
> remove the dot are in error.
> UAX 44 says, "Characters that linguistically modify the meaning of
> another character to which they apply. Some diacritics are not
> combining characters, and some combining characters are not
> diacritics."
> 
> Leo
> 
> On Wed, Aug 1, 2012 at 11:53 AM, Kent Karlsson
>  wrote:
>> 
>> Den 2012-08-01 19:41, skrev "Andreas Prilop" :
>> 
>> 
>>> Is it correct that
>>> 
>>>   U+0069 U+20D7
>>>   U+006A U+20D7
>>> 
>>> should have a dot
>> 
>> No, they are soft-dotted:
>> 0069..006A; Soft_Dotted # L&   [2] LATIN SMALL LETTER I..LATIN SMALL
>> LETTER J
>> which means that the inherent dot should be removed if a diacritic above the
>> letter
>> is added, which it is in your examples. However, I have yet to see a system
>> that
>> handles this correctly...
>

Re: Small i with/out dot and with arrow

2012-08-01 Thread Kent Karlsson


Den 2012-08-01 19:41, skrev "Andreas Prilop" :

> Is it correct that
> 
>   U+0069 U+20D7
>   U+006A U+20D7
> 
> should have a dot

No, they are soft-dotted:
0069..006A; Soft_Dotted # L&   [2] LATIN SMALL LETTER I..LATIN SMALL
LETTER J
which means that the inherent dot should be removed if a diacritic above the
letter
is added, which it is in your examples. However, I have yet to see a system
that
handles this correctly...

But note all the canonical decompositions that have i as base and a
diacritic above, like
U+00EC, LATIN SMALL LETTER I WITH GRAVE, canonical decomposition: 0069 0300,
and there is no dot above in the rendering of a LATIN SMALL LETTER I WITH
GRAVE.

If you really want to keep the dot above for i or j (or other soft-dotted
character),
you should add the dot explicitly, like in the named sequence
LATIN SMALL LETTER I WITH DOT ABOVE AND ACUTE;0069 0307 0301

> and that
> 
>   U+0131 U+20D7
>   U+0237 U+20D7
>   U+006B U+20D7
> 
> should have no dot?

Not sure why you include "k" here (which has no dot any which way)...

But the two first ones in the second group should nominally look the same
as the two in the first group. However, it is not to be recommended to use
   nor
  
(one might consider some kind of "error rendering").

/Kent K

Re: Unicode 6.2 to Support the Turkish Lira Sign

2012-05-27 Thread Kent Karlsson

Somebody wrote:
> For the ₤ we can define EXACTLY what it is: a scriptive capital Latin L with
> a double crossbar, in this very combination standing in for the term “Lira”
> (derived from Latin “libra”), meaning a monetary unit of that same name.

₤ (and £, which should have been the same character) is actually a
degenerate script lb (℔), so is #. (And "libra" = "pound", just
different languages.)

B.t.w., most fonts get the glyph for ℔ wrong, probably due to a mistake
in the example glyph in older Unicode charts (and a slightly misleading
name for that character). Please fix your fonts...

/Kent K

Re: Kaktovik Inupiaq numerals

2012-04-29 Thread Kent Karlsson

Den 2012-04-28 12:50, skrev "Richard Wordingham"
:

> On Fri, 27 Apr 2012 13:50:15 -0700
> Ken Whistler  wrote:
> 
>> On 4/27/2012 10:45 AM, Richard Wordingham wrote:
>>> If they are to be adopted by the CLDR, the digits need to be coded
>>> consecutively.
>> 
>> I doubt this matters in any case, because this proposed use is for
>> a vigesimal system, which has digits 0..19, not digits 0..9. Trying to
>> treat the first 10 digits as decimal digits in CLDR could accomplish
>> nothing, IMO.
> 
> I don't believe the exclusion of non-decimal bases is set in stone.
> So, while they wouldn't fit in to CLDR as it stands now, it would not
> take a huge change to add them.

CLDR used to require sequentially encoded decimal digits, but my
understanding is that that is no longer the case. And indeed, the
numeral systems need not be decimal, or even positional. Roman numerals
are supported, as are (e.g.) Armenian numerals, and traditional
Chinese numerals (non-positional, using multiplier words).

While vigesimal systems aren't supported (in CLDR) to the degree
that any got *named*, in the way some other systems have been, there
is still *some* support. See e.g.
 http://unicode.org/cldr/trac/browser/trunk/common/rbnf/nci.xml
(a full-fledged vigesimal system in those rules) for spelling out
numbers as words in Classical Nahuatl. There is also
 http://unicode.org/cldr/trac/browser/trunk/common/rbnf/kl.xml,
for spelling out numbers in Kalaallisut (Greenlandic), but it is not
full-fledged vigesimal.

These RBNF rules are based on what I could find out from sources on
the web a few years ago. If anyone has corrections/extensions/variation
to these, or additions for other languages using vigesimal systems (yes,
I did see that there was some data on the Wikipedia pages referenced),
please send them to me, preferably with contact information to someone
"in the know", and I'll see what I can do. I cannot use vigesimal digits,
though, since none are as yet encoded. But if some set of vigesimal
digits were to be encoded, supporting them via RBNF would likely be the
first point of support in CLDR.

>> Furthermore, what Inuit has is a vigesimal *counting* system, as the
>> article indicates. But this innovated set of numerals, is attempting
>> to turn this into a full-blown radix-20 numerical system, which I
>> doubt has any cultural validity.
> 
> I presume you are talking about how the hundreds are (or were)
> traditionally expressed.
> 
>> The Inuit number system is another case of the rather widespread use
>> of mixed 5/20 counting systems, which count 4 "hands" of 5 into
>> groups of 20.
> 
> Indeed, it immediately made me think of Welsh, where native-speakers'
> use of their vigesimal system has been hammered by the use of Arabic
> numerals.  (In England, resistance to this 'heathen notation' collapsed
> long ago.)  Before anyone points it out,  I do know that Welsh _pymtheg_
> '15' and possibly even _ugain_ '20' ultimately derive from a
> (superseded) decimal system.  However, Welsh goes decimal at 100, so
> this vigesimal notation would not match the language at all for higher
> numbers.
> 
>> I don't think combining diacritics makes sense in this case. Rather,
>> this kind of construction is better handled by taking the graphic
>> elements for 5, 10, and 15, and ligating them in a font for the
>> combined units. So the only elements requiring encoding would
>> be 0, 1, 2, 3, 4, 5, 10, 15, in order to fully represent this system.
> 
> No.  One must be able to distinguish  (= '25') and  ONE> (= '101') from the notation for '6'.  Or are you suggesting that
> rendering of ZWJ should be *essential* for the semantics, not just for
> acceptability?

While I would have liked to have seen the use of combining characters
(or ligation) in certain other cases where it is not present in Unicode,
I think that that approach would be very inappropriate here; this is for
digits for use in a positional system). Just encode (when that time comes)
each of the new digits corresponding to 0, ..., 19 *atomically*.

The Kaktovik digits are niftily designed though, with a logic in the
(abstract) graphical design, and each of them can be drawn in a single
pen stroke.

They have found their way into some fonts
 (http://www.linguistsoftware.com/linup.htm#Kaktovik), and has some
support form the Inuit Circumpolar Council
 (http://inuitcircumpolar.com/section.php?Nav=Section&ID=10&Lang=En).

/Kent K

> The (undemonstrated) use of the notation denoting hands for which I
> suggested a combining diacritic could be handled by ligatures
> specified by ZWJ, but there could be a lot of them.  Look at the ugly
> mess in New Tai Lue caused by not anticipating the need for medial 'v'
> because the UTC knew too little about Tai Lue (or even, more
> surprisingly, Northern Thai).
> 
> Richard.
>

Re: Sorting and German (was: Sorting and Volapük)

2012-01-02 Thread Kent Karlsson


Except that MacOS X *applications* (as apart from more POSIXy programs,
and Terminal.app) should not use the POSIX locales, but should use the
CLDR locales (via an Apple API or via ICU)... (Yes, I know, CLDR have
POSIX locales format files covering **some** of the CLDR data...)

And ISO 8859-15? Really? I don't even find it in the list of encodings
Terminal.app supports (but maybe that is just me not finding it).
Terminal.app by default uses UTF-8.

/K


Den 2012-01-02 20:10, skrev "Steffen Daode Nurpmeso"
:

> Hi,
> 
>>> How? I am not a programmer.
> 
> Applications -> Utilities -> Terminal.app
> $ man 1 mklocale
> $ man 1 colldef
> 
>> pay somebody to do it for you
> 
> $ cd $TMPDIR
> $ mkdir c:\\vodka && cd c\:\\vodka # yes it's still Mac OS X
> $ curl 
> 'http://www.freebsd.org/cgi/cvsweb.cgi/~checkout~/src/share/colldef/de_DE.ISO8
> 859-15.src?rev=1.6.44.1.2.1;content-type=text/plain' > de_DE.ISO8859-15.src
> $ curl 
> 'http://www.freebsd.org/cgi/cvsweb.cgi/~checkout~/src/share/colldef/map.ISO885
> 9-15?rev=1.1.6.1;content-type=text/plain' > map.ISO8859-15
> $ echo adjustments are beyond my scope
> $ colldef -o VALOPUEK < de_DE.ISO8859-15.src
> $ sudo mv VALOPUEK /usr/share/locale/VALOPUEK
> $ export LC_COLLATE=VOLAPUEK
> $ echo sorting should work now
> 
>> The University of Edinburgh
> 
> Cheerio, Miss Sophie!
> (That's http://www.youtube.com/watch?v=NDqD0Dz_J-M according to
> Google, and only because there are so many germans around here;
> but happy new year to all of you, even to those which use
> a different calendar and don't drink alcohol.
> Long live small, easy and otherwise beautiful standards.)
> 
> --steffen
>

Re: Sorting and German (was: Sorting and Volapük)

2012-01-01 Thread Kent Karlsson

Not sure why this discussion is on the Unicode list instead of the
cldr-users list... Anyhow...

While I do find (in CLDR's collation de.xml) a 
I don't find any variant doing the Austrian phonebook variant you mention.

Maybe you could file a ticket for adding that to CLDR. (Using that
same tailoring for Volapük is a separate matter.)

/Kent K


Den 2012-01-01 18:46, skrev "Otto Stolz" :

> In Austria, a third scheme is used in telefone directories
> (but not in the yellow pages): Here, Ä, Ö, and Ü, are
> indeed treated as distinct letters, to go between A and B,
> O and P, and U and V, respectively; and ß is treated as a
> distinct pair of letters, ro go between SS and ST.

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-20 Thread Kent Karlsson


Den 2011-11-20 20:50, skrev "Peter Constable" :

> Note that UTR 20 discusses semantic and presentation effects that are suitable
> for representation as characters versus markup and makes the point that, in
> XML, effects that involve spans of text should be represented using markup
> rather than characters that set and unset state. Those are, of course,
> recommendations about a markup language, not plain text. But the argument used
> works in both directions: things that involve spans of text are best handled
> as markup, while things that are very local (e.g. spanning no more than a
> grapheme cluster) may be more suitable for representation as characters.

And yet we have, apart from bidi controls, characters whose effect in
various ways spans several other characters/grapheme clusters:

0600;ARABIC NUMBER SIGN;Cf;0;AN;N;
0601;ARABIC SIGN SANAH;Cf;0;AN;N;
0602;ARABIC FOOTNOTE MARKER;Cf;0;AN;N;
0603;ARABIC SIGN SAFHA;Cf;0;AN;N;
0604;ARABIC SIGN SAMVAT;Cf;0;AN;N;

06DD;ARABIC END OF AYAH;Cf;0;AN;N;

070F;SYRIAC ABBREVIATION MARK;Cf;0;AL;N;

FFF9;INTERLINEAR ANNOTATION ANCHOR;Cf;0;ON;N;
FFFA;INTERLINEAR ANNOTATION SEPARATOR;Cf;0;ON;N;
FFFB;INTERLINEAR ANNOTATION TERMINATOR;Cf;0;ON;N;

110BD;KAITHI NUMBER SIGN;Cf;0;L;N;

1D173;MUSICAL SYMBOL BEGIN BEAM;Cf;0;BN;N;
1D174;MUSICAL SYMBOL END BEAM;Cf;0;BN;N;
1D175;MUSICAL SYMBOL BEGIN TIE;Cf;0;BN;N;
1D176;MUSICAL SYMBOL END TIE;Cf;0;BN;N;
1D177;MUSICAL SYMBOL BEGIN SLUR;Cf;0;BN;N;
1D178;MUSICAL SYMBOL END SLUR;Cf;0;BN;N;
1D179;MUSICAL SYMBOL BEGIN PHRASE;Cf;0;BN;N;
1D17A;MUSICAL SYMBOL END PHRASE;Cf;0;BN;N;

Ok, Ruby does already have XHTML/(HTML5) markup that seems better.

/Kent K
 
> Peter

Re: N4106

2011-11-07 Thread Kent Karlsson


Den 2011-11-07 10:34, skrev "vanis...@boil.afraid.org"
:

>> So despite being given (as proposed) vanilla above/below mark properties,
>> they do not "stack" the
>> way such characters normally do, but is supposed to invoke an entirely new
>> behaviour.
> 
> I agree, except that if we give them any but a ccc=220/230, then canonical
> reordering will separate them from the modifier letters that they are attached

Nit: modifier letters (as that term is used in Unicode) are not combining
marks; here you mean combining marks.

> to. I think this is one of those cases where a definition needs to expand in
> order to accommodate architecture. We do already have some non-stacking
> behaviour defined for these characters in order to accommodate polytonic
> Greek, 
> so we do have some experience with disparate appearances of consecutive marks.

Yes, but that they have special behavior needs to be made explicit.

>> That supposedly stacking combining marks *sometimes* (more a font dependence
>> than a character
>> dependence) don't stack but instead are laid out linearly is not new. But to
>> *require* non-stacking
>> behaviour for certain characters is new.
> 
> Then think of it as the "non-spacing" version of stacking behaviour.

Would not be sufficient. See below.

>> So we have a combination of:
>> 
>> 1. Splitting. (Normally only used for some Indic scripts).
>> 
>> 2. Indeed splitting with no other characters to use for the decomposition,
>> thus requiring the use of
>>PUA characters, to stay compliant, for representing the result of the
>> split at the character level.
>>(This is entirely new, as far as I can tell.)
> 
> I cannot imagine in any way how this requires PUA characters.

Splitting is usually done at the character level... I know, some say that
this should always be done at the glyph level (somehow), but IIUC that is
not so in practice. And I think it is preferable to do it at the character
level, so that is not just handwaved away (oh, the font should do this...)
leaving it up to each and every font designer to do this odd-ball extra
(and thus won't be done most of the time, even if the font framework may
support it). Laying out linearly instead of stacking is quite enough
odd-ball extra.

>> 3. The split is entirely *within* the sequence of combining characters
>> (except for COMBINING
>>PARENTHESES OVERLAY, which behaves as split vowels normally do, but still
>> with issue 2), not
>>around the combining sequence including the base. (This is entirely new.)
>> 
>> 4. Requiring (if at all supported) to use linear layout of combining
>> characters instead of stacking.
>>(This is entirely new.)
> 
> If I were designing a font, I would simply make the in/out mark attachment
> point near the top/middle of the parentheses, so that it drops down around the
> "base" mark, and then attaches any subsequent marks as if the parentheses
> weren't there. I think you're making this too complicated.

But glyphs for combining marks may be of different widths, for example a
(glyph for a) dot below is much narrower than a (proposed) wiggly line
below. Or, consider LENIS MARK and DOUBLE LENIS MARK (both for Teuthonista,
and both apparently used together with parentheses). The usual, and general,
way of handling that is to actually split the
character-that-goes-on-both-sides of something that may have different
widths in different instances. Of course you also need width info for
combining marks. I would still consider splitting to be a needless
complication here, and instead encode begin/end pairs of combining
parentheses instead of what is in N4106.

> 
>> This makes these proposed characters entirely unique in their display
>> behaviour, IMO.
> 
> I do, however, agree totally with this assessment, I just believe it is more
> manageable than you paint it.
> 
> [snip]
>> /Kent K 
> 
> I do, myself, have a couple of concerns in regards to several proposed
> characters in N4106 as well. Namely, I believe that U+1DF2, U+1DF3, and U+1DF4
> should require significant justification as to why they should not be encoded
> as U+0363 + U+0308, U+0366 + U+0308, and U+0367 + U+0308.

There is the issue of whether the diaeresis applies to the base letter (plus
something) or if it applies to the combining mark just under the diaeresis.

/Kent K


> I have similar 
> concerns about U+A799, U+AB30, U+AB33, U+AB38, U+AB3E, U+AB3F, etc.
> 
> Van A
> 
>

Re: N4106

2011-11-06 Thread Kent Karlsson


Den 2011-11-05 04:23, skrev "António Martins-Tuválkin" :

> I'm going through N4106 ( http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4106.pdf ),
...

I see the following characters being put forward for proposing to be
encoded:

1ABB COMBINING PARENTHESES ABOVE
1ABC COMBINING DOUBLE PARENTHESES ABOVE
1ABD COMBINING PARENTHESES BELOW
1ABE COMBINING PARENTHESES OVERLAY

Well, COMBINING DOUBLE PARENTHESES ABOVE seems to be the same as . And COMBINING PARENTHESES OVERLAY seems
to be just
a tiny parenthesis before and a tiny parenthesis after; no need for a
combining mark, especially one with
a splitting behaviour.

Otherwise, I think COMBINING ((DOUBLE)) PARENTHESES ABOVE/BELOW are an
entirely new brand of
characters in Unicode (if accepted as proposed). They are supposed to split
(ok, we have split
vowels in some Indic scripts, more on that below), but these split around
*another combining mark*.
So despite being given (as proposed) vanilla above/below mark properties,
they do not "stack" the
way such characters normally do, but is supposed to invoke an entirely new
behaviour.

Split vowels are not new, but they split around base characters (or more
generally, around combining
sequences), not around (a) combining character(s) only. Indeed, one can
split these vowels into two
characters (sometimes by canonical decomposition, when done right; sometime
by cheating a bit and
split into another character and the supposedly split vowel character but
not interpreted as the
second part of the decomposition; in principle one may need to cheat even
more and use PUA characters
in order to do this at the character level, but then that is really bad).

That supposedly stacking combining marks *sometimes* (more a font dependence
than a character
dependence) don't stack but instead are laid out linearly is not new. But to
*require* non-stacking
behaviour for certain characters is new.

So we have a combination of:

1. Splitting. (Normally only used for some Indic scripts).

2. Indeed splitting with no other characters to use for the decomposition,
thus requiring the use of
   PUA characters, to stay compliant, for representing the result of the
split at the character level.
   (This is entirely new, as far as I can tell.)

3. The split is entirely *within* the sequence of combining characters
(except for COMBINING
   PARENTHESES OVERLAY, which behaves as split vowels normally do, but still
with issue 2), not
   around the combining sequence including the base. (This is entirely new.)

4. Requiring (if at all supported) to use linear layout of combining
characters instead of stacking.
   (This is entirely new.)

This makes these proposed characters entirely unique in their display
behaviour, IMO.

This could be alleviated by encoding COMBINING BEGIN/END PARENTHESIS
ABOVE/BELOW.
That way the issues with split, as listed above, can be avoided. There is
still the issue of requiring
(when at all supported) linear layout instead of stacking. But at least that
is a lesser concern.

In summary, I'd propose replacing the four problematic proposed characters
above with:

COMBINING BEGIN PARENTHESES ABOVE(or LEFT)
COMBINING END PARENTHESES ABOVE(or RIGHT)

COMBINING BEGIN PARENTHESES BELOW(or LEFT)
COMBINING END PARENTHESES BELOW(or RIGHT)

BASELINE SMALL BEGIN PARENTHESES(or LEFT)
BASELINE SMALL END PARENTHESES(or RIGHT)
(or MODIFIER LETTER instead of BASELINE; the latter two are not combining)

/Kent K

Re: Arabic date format and Microsoft programs

2011-10-18 Thread Kent Karlsson

Just in case somebody has missed this:

There is a public review issue very much related to this thread:
http://www.unicode.org/review/pri205/, Proposed addition of AL MARK
and LEVEL DIRECTION MARK,
.
The latter proposed addition is partially motivated by date format
direction issues.

/Kent K


Den 2011-10-17 13:00, skrev "Eli Zaretskii" :

>> Date: Mon, 17 Oct 2011 10:09:06 +0100
>> From: "Peter Krefting" 
>> 
>> Eli Zaretskii :
>> 
>>> However, it could be that the confusion is mine, and it stems from the
>>> fact that the logical order of these characters was not stated by the
>>> OP.  Is it
>>> 
>>>  1999/12/31
>>> 
>>> or
>>> 
>>>  31/12/1999
>>> 
>>> ?
>> 
>> The logical order in the document that was cited is 1999/12/31
>> (١٩٩٩/١٢/٣١). I just did a pen-and-paper run of the bi-di algorithm, and
>> it does look to me as if the 1999/12/31 rendering is the correct one, even
>> with the paragraph set to right-to-left in HTML.
> 
> If the logical order is 1999/12/31, then you are right.  I'm sorry for
> confusion I caused; for some reason I thought the logical order was
> the other way around.
> 
> Sorry.
>

Re: about P1 part of BIDI alogrithm

2011-10-11 Thread Kent Karlsson


Den 2011-10-11 09:43, skrev "Eli Zaretskii" :

> Let me give you just one example: if the character should be mirrored,
> you cannot decide whether it fits the display line until _after_ you
> know what its mirrored glyph looks like.  But mirroring is only
> resolved at a very late stage of reordering, so if you want to reorder
> _after_ breaking into display lines, you will have to back up and
> reconsider that decision after reordering, which will slow you down.

Well, I think there is a silent (but reasonable, I would say) assumption
that mirroring does not change the width of a glyph... I would think that if
a font does not fulfill that, then you have a font problem (or mix of fonts
problem), not a bidi problem. Glyphs for characters that may mirror do not
normally form ligatures with other glyphs; and even if they do, the width of
the ligature should not change relative to the total with of the preligature
glyphs involving glyphs for mirrorable characters (and if it does change
anyway, you again have a font problem that may result in a somewhat ugly
display that should be fixed by fixing the font, not a bidi problem). I'm
not thinking about Emacs here, but in general.

IMHO
/Kent K

Re: Need for Level Direction Mark

2011-09-22 Thread Kent Karlsson




Den 2011-09-22 10:54, skrev "Philippe Verdy" :

> 2011/9/21 Richard Wordingham :
>> LRE...PDF acts like a character with BiDi class L, and likewise for
>> RLE...PDF.  I suppose the principle is that in a right-to-left context a
>> word composed of letters of BiDi class L should be treated like an
>> embedding.
> 
> That's where I think this behavior is wrong. this shoud just set the
> direction to be used internally, hiding this detail to the outside.
> Both sequences should behave like if this was a single character with
> weak Bidi class. Otherwise they are not really "embedding". Even the
> name "pop directional format" is misleading in this case because it
> actually does not restore the state that was before the state pushed
> by LRE/RLE.
> 

I think LRE...PDF (and similarly for the other start bidi bracketings)
should behave as if they had an inherent LDM (LEVEL DIRECTION MARK);
as I have hinted before.

If one changes LRE etc. to have an inherent LDM *functionality*, an
actual character for LDM is not needed, *nor* is a new bidi category
needed. The function of an LDM character can then be achieved by
 (or , , or ); note: empty
string between the start and end bidi control codes.

I still think there are plenty of other reasons to go for a UBA v.2;
also the change suggested here is probably best done in a UBA v.2
rather than in the current UBA.

/Kent K

Re: Need for Level Direction Mark

2011-09-19 Thread Kent Karlsson

Den 2011-09-19 04:53, skrev "Peter Edberg" :

> Philippe,
> 
> On Sep 17, 2011, at 12:54 PM, Philippe Verdy wrote:
> 
...
>> 
>> Note also that there's no way to specify a weak direction for the
>> internal content of embedded fields, as we don't have the WDE..PDF
>> mechanism described in the UBA for now (but may be we could emulate it
>> using RLE,B..PDF or LRE,B..PDF (but with which B character ?).
> 
> Sorry, which section of the UBA are you referring to that describes WDE (Weak
> Direction Embedding)?

I would guess none, as this was, IIUC, a suggestion for the same thing as:

> Deborah Goldsmith has suggested a "native direction embedding" which is like
> LRE/RLE but uses the inferred primary direction of the embedded text. I will
> try to put together a proposal about this for the next UTC.

And is indeed what I suggested for "(", "[" and other beginning punctuation
(general category Ps and default for Pi) in my response to the PRI,
*without* necessarily actually having a new *control character* for it.
Ending punctuation (Pe and default for Pf) would in my suggestion act like
bidi PDF. Note that the beginning and ending punctuation also must take on
the current (surrounding) embedding directionality, which unfortunately
LRE/RLE/PDF characters by themselves don't do in current UBA; and one must
of course not do rule X9 for characters that aren't pure bidi controls.

> 
>> And I
>> also spoke about interlinear annotations (whose equivalent in HTML is
>> the ruby notation, and in CSS the abolutely positionned blocks with
>> "display:inline-block;position:absolute") which suffer the same
>> problem in the UBA (note that ruby notations are frequently used to
>> insert interlinear translitterations into a different script that may
>> have a different direction than the script used in the annotated
>> text).

I would agree that INTERLINEAR ANNOTATION SEPARATOR should act (for bidi)
as  (i.e. implicitly have LDM before, and
 after), and INTERLINEAR ANNOTATION TERMINATOR should act (for
bidi) like PDF (i.e. implicitly have  before and LDM just after
the interlinear annotation terminator).

/Kent K

>> -- Philippe.
> 
> - Peter E
> 
> 
>

Re: Need for Level Direction Mark

2011-09-14 Thread Kent Karlsson


Den 2011-09-14 19:56, skrev "Philippe Verdy" :

> 2011/9/14 Kent Karlsson :
>>> And how will you define what is an "implicit" LDM ? For example "1.2"
>> 
>> Did you actually READ my submission re. the PRI? Seems like not. There is a
>> suggestion there (which requires a bit of character contextual processing).
>> It is also possible to use a different analysis for special cases, e.g.
>> domain names or URLs (if detectable somehow, e.g. via markup).
> 
> Yes I have read it, and I'm convinced this will not work. It breaks
> the UBA in a non-conforming and incompatible way. I'm now sure that
> LDM is not even needed if the UBA is implemented correctly.

Note that my suggestion was aimed at a possible UBA v.2 (which is option 3
in the PRI). UBA (v.1) would be unchanged. It is not the case that all bidi
control characters can be avoided in all cases using my suggestion. But
a great many cases, many that surprise users, would with the implicit bidi
control approach work with much less surprise, and no need to insert
explicit bidi controls (something which is not so easy).

Back to the original issue of this thread: All the workarounds w.r.t. LDM
depend on the directionality of neighbouring characters, not directly on
the embedding level direction. Therefore I think none of them will work
properly in all cases (even though they may give the seemingly correct
result in many cases). And they all require an inordinate amount of
insertion of bidi control characters. (Much better to have *fewer* bidi
control characters and still get a desirable display.)

/Kent K

Re: Need for Level Direction Mark

2011-09-14 Thread Kent Karlsson


Den 2011-09-14 19:05, skrev "Philippe Verdy" :

> 2011/9/14 Kent Karlsson :
>> Because that stability guarantee says "The Bidi_Class property values will
>> not be further subdivided." I'm not too keen on the word "subdivided" here,
... 
> That's absolutely not the way I understand it, notably if you consider

Your interpretation seems to be quite contrary to the interpretation done by
the UTC...

> And how will you define what is an "implicit" LDM ? For example "1.2"

Did you actually READ my submission re. the PRI? Seems like not. There is a
suggestion there (which requires a bit of character contextual processing).
It is also possible to use a different analysis for special cases, e.g.
domain names or URLs (if detectable somehow, e.g. via markup).

/Kent K

Re: Need for Level Direction Mark

2011-09-14 Thread Kent Karlsson


Den 2011-09-14 03:31, skrev "Philippe Verdy" :

> 2011/9/13 Kent Karlsson :
...
>> for the new one, and to the paragraph bidi level for the three old ones). (I
>> know, this would be a form of "option 1" in the PRI.)
> 
> You can turn it as you want it is still a splitting of the bidi class
> if you change the behavior of class S like this.

I did write that it was a version of the PRIs "option 1"!

> Onve again, if you
> want to encode new characters, why would you restrict yourself to
> reusing an existing bidi class just to break it?

Because that stability guarantee says "The Bidi_Class property values will
not be further subdivided." I'm not too keen on the word "subdivided" here,
but it (here) means there will be *no additions* to the set of values for
the Bidi_class property. Not even for new characters.

As far as I can tell, there is no restriction saying that the bidi algorithm
cannot look at code points as well as bidi category values.

But as I pointed out in my submitted response to the PRI, the bidi algorithm
has "glaring deficiencies" that I think would be best handled by going for
the third option: "bidi v. 2", where these "glaring deficiencies" can be
addressed; to a large extent by the use of *implicit* LDMs (and *implicit*
LRE/RLEs and PDFs).

/Kent K

Re: Need for Level Direction Mark

2011-09-13 Thread Kent Karlsson

I'm not at all sure the suggested workaround works in general, and not just
in a few examples.

Another possibility, as long as we are just "brain-storming" a bit here, is
to use the bidi category S (Segment Separator) for the LEVEL DIRECTION MARK
(which would be a normally invisible (bidi) format control character). I.e.
it would work just like TAB (as specified in the UBA), except that it
wouldn't do tabbing. But then it would work only for the paragraph bidi
direction. However, the idea that TAB (and the other bidi S characters)
magically cuts through *all* nested bidi levels seems a bit strange to me...
Going just to the closest explicit embedding/(override) level seems less
drastic. Without formally subdividing "S", one could treat different "bidi
S" (old and new) to reset to different levels (to the embedding bidi level
for the new one, and to the paragraph bidi level for the three old ones). (I
know, this would be a form of "option 1" in the PRI.)

/Kent K



Den 2011-09-13 09:43, skrev "Richard Wordingham"
:

> This is a summary of what I have already submitted for Public Review
> Issue 205 (http://www.unicode.org/review/pri205/).  I am mentioning it
> here in case there is something wrong with my idea.
> 
> My basic idea is that one does not a 'level direction mark'.  The
> desired effect can be achieved by embedding neutrals in a sequence
> LRM...RLM or RLM...LRM.  They will then take on the directionality of
> the embedding by Bidi Rule N2.
> 
> For an example, see my submission.  It may be helpful to view the source
> of the full page view, for that has examples in HTML written solely in
> ASCII.
> 
> Richard.
>

Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-11 Thread Kent Karlsson


Den 2011-09-11 18:53, skrev "Peter Constable" :

> There's no requirement that the width of glyphs in a monospaced font be 1 em.
> I would agree, though, that if a monospaced font forms a ligature of a pair
> like <0066, 0069>, then it should be twice the width (not necessarily 2em) of
> single-character glyphs.

That's fine (assuming the ligature is well designed, in the case of a
monospace font connecting the bar of the f to the top serif of the i and
only that).

> In a monospace font, nothing prevents the glyph for FB01 being a ligature, and
> some monospaced fonts do have a ligature glyph for that character.

Fine too. But see below.

> Of course, in a monospaced font, the glyph for that character should be the
> same width as all other glyphs. So if it's not a ligature, then the "f" and
> "i" elements still need to be narrower than the glyphs for 0066 and 0069.
> 
> Hence, in a monospaced font, FB01 certainly should look different from <0066,
> 0069>, regardless of whether ligature glyphs are used in either case.

If "monospace" is interpreted that rigidly, then it is much better *not* to
have any glyph at all for FB01 (and other characters like it) in a
"monospace" font.

/Kent K

> 
> Peter
> 
> -Original Message-
> From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf
> Of Philippe Verdy
> Sent: Saturday, September 10, 2011 10:33 PM
> To: Michael Everson
> Cc: unicode Unicode Discussion
> Subject: Re: ligature usage - WAS: How do we find out what assigned code
> points aren't normally used in text?
> 
> 2011/9/11 Michael Everson :
>> On 11 Sep 2011, at 00:23, Richard Wordingham wrote:
>> 
>>> A font need not support such ligation, but a glyph for U+FB01 must
>>> ligate the letters - otherwise it's not U+FB01!
>> 
>> Not in monowidth, it doesn't.
> 
> I also agree, a monospaced font can perfectly show the dot and ligate the
> letters, using a "double-width" (2em) ligature without any problem, or simply
> not map it at all, or choose to just map a composite glyph made of the
> 1em-width glyphs assigned to the two letters f and (dotted) i without showing
> any visible ligation between those glyphs (this being consistant with
> monospaced fonts that remove all ligations, variable advances and kernings
> between letters).
> 
> You could as well have a font design in which all pairs or Latin letters are
> joined, including in a monospaced font, in which case you should not see any
> difference between FB01 and the pair or Basic Latin letters. Joining letters
> is fully independant of the fact that the upper part of letter f may or may
> not interact graphically with the presence of a dot. If the style of letter
> glyphs does not cause any interaction, there's no reason to remove the dot
> over i or j in the "ligature" or joining letters.
> 
> You should not be limited by the common style used in modern Times-like fonts
> (notably in italic styles, where the letter f is overhanging over the nearby
> letters). Other font styles also exist that do not require adjustment to
> remove the dot, or merge it with a graphic feature of the preceding letter f
> which is specific to some fonts.
> 
> As the pair of letters f and (dotted) i is perfectly valid in Turkish, there's
> absolutely no reason why the fi ligature would be invalid in Turkish. But
> given that this character is just provided for compatibility with legacy
> encodings, I would still not recommand it for Turkish or for any other
> language, including English. This FB01 character is not necessary to any
> orthography and if possible, should be replaced by the pair of Basic Latin
> letters (and in fact I don't see any reason why a font would not choose to do
> this everywhere)
> 
> -- Philippe.
> 
> 
> 
>

Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-10 Thread Kent Karlsson


Den 2011-09-11 01:23, skrev "Richard Wordingham"
:

> On Sat, 10 Sep 2011 23:53:34 +0200
> Kent Karlsson  wrote:
> 
>> IMO, a glyph (if any) for that compatibility character should look
>> *exactly* like an "fi" (after automatic ligature formation, if that
>> is done for "fi") in the font used. So if no ligature for "fi" is
>> formed, the glyph for U+FB01 (if any) should have a dot just like
>> "fi" would have a dot. (I know, this is not commonly the case at the
>> moment.)
> 
> A font need not support such ligation,

True.

> but a glyph for U+FB01 must
> ligate the letters -

And this "ligature" can look just like "fi" in that font.
I see no reason whatsoever that it could not.

> otherwise it's not U+FB01!

Of course it would be.

> In such a case, I do
> not see the need for the dot.

That does not follow.

/Kent K

> Richard.
> 
>

Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-10 Thread Kent Karlsson

Den 2011-09-10 23:06, skrev "Richard Wordingham"
:

> On Sat, 10 Sep 2011 22:19:27 +0200
> Kent Karlsson  wrote:
> 
>> 
>> Den 2011-09-10 20:58, skrev "Jukka K. Korpela" :
>> 
>>> According to Oxford Style
>>> Manual, one should not use the fi ligature in Turkish, as that
>>> would obscure the distinction between normal i and dotless i ().
>  
>> It does not make perfect sense to me. Rather that:
> 
> I believe the point is that the glyph of ﬁ U+FB01 LATIN SMALL LIGATURE
> FI

Which is a character that should not be use for any language. Typographic
ligatures (if any) should be formed automatically by the font (and font
handling system).

> is unsuitable for Turkish because it is normally undotted, or at
> least, the dot is barely visible. (Confusingly, my e-mail client chooses
> a dotted glyph!)

IMO, a glyph (if any) for that compatibility character should look *exactly*
like an "fi" (after automatic ligature formation, if that is done for "fi")
in the font used. So if no ligature for "fi" is formed, the glyph for U+FB01
(if any) should have a dot just like "fi" would have a dot. (I know, this is
not commonly the case at the moment.)

/Kent K

> Richard.
> 
>

Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-10 Thread Kent Karlsson


Den 2011-09-10 20:58, skrev "Jukka K. Korpela" :

> There is a deeper language-dependency. According to Oxford Style Manual,
> one should not use the fi ligature in Turkish, as that would obscure the
> distinction between normal i and dotless i (). This makes perfect sense
> to me.

It does not make perfect sense to me. Rather that:

*If f followed by i is such that their font glyphs overlap (using
normal letter spacing), making a ligature appropriate, makes that
*font* unsuitable for Turkish, as such a ligature would obscure...*.

If that is what you (and other who have said the same thing) meant,
then fine. But taken at face value, your statement does not make
(typographic) sense.

/Kent K

Re: Continue:Glaring mistake in the code list for South Asian Script

2011-09-09 Thread Kent Karlsson


Den 2011-09-10 00:53, skrev "delex r" :

> I figure out that Unicode has not addressed the sovereignty issues of a
> language

Which, I daresay, is irrelevant from a *character* encoding perspective.

> while trying to devise an ASCII like encoding system for almost all
> the characters and symbols used on earth. I am continuing with my observation
> of the glaring mistake done by Unicode by naming a South Asian Script as
> ³Bengali². Here I would like to give certain information that I think will be
> of some help for Unicode in its endeavour to faithfully represent a Universal
> Character encoding standard truer to even micro-facts.
> 
> India is believed to have at least 1652 mother tongues out of which only 22

One list of languages in India is given in
http://www.ethnologue.com/show_country.asp?name=IN
(I did not count the number of entries)

> are recognized by the Indian Constitution as official languages for
> administrative communication among local governments and to the citizens. And
> the constitution has not explicitly recognized any official script. As Unicode
> has listed the languages and scripts, the Indian Constitution has also listed

Unicode does not list any languages at all. Ok, the CLDR subproject copies a
list of language codes from the IANA language subtag registry, which (in a
complex manner) takes its language codes from (among others) the ISO 639-3
registry, which largely is in sync with Ethnologue (as in the list above);
but I guess that is not what you referred to.

> the official languages ( In its 8th schedule). The first entry in that list is
> the Assamese language.  Assamese is a sovereign language with its own grammar

Which I don't think is in dispute at all.

> and ³script² that contains some unique characters that you will not find in
> any of the scripts so far discovered by Unicode. At least 30 million people

Unicode (at this stage) does not do any "discovery". Unicode and ISO/IEC
10646 is driven by applications (proposals) to encode characters (and define
properties of characters).

> call it the ³Assamese Script² and if provided with computers and internet

If you want to disunify the Bengali script (and characters) from Assamese,
you need to show, in a proposal document, that they really are different
scripts, and should not be unified as just different uses of the same
script.

> connection can bomb the Unicode e-mail address with confirmations. These

Hmm, an email bombing threat... I'm sure Sarasvati can find a way to block
those (or we may all simply file them away as spam).

> characters are, I repeat, the one that is given a Hexcode 09F0  and the other
> with 09F1 by this universal character encoding system but unfortunat!
>  ely has described both as ³Bengali² Ra etc. etc. I don¹t know who has advised
> Unicode to use the tag ³Bengali² to name the block that includes these two
> characters. 
> 
> If you are not an Indian then just google an image of an Indian Currency note.
> There on one side of the note you will find a box inside which the value of
> the currency note is written in words in at least 15 scripts of official
> Indian languages.( I don¹t know why it is not 22). At the top , the script is
> Assamese as Assamese is the first officially recognized language (script?) .
> Next below it you will find almost similar shapes. That is in Bengali. India
> officially recognises the distinction between these two scripts which although
> shaped similar but sounds very different at many points. And the standard

Minor font differences is not a reason for disunification. Different
pronunciations of the same letters is not a reason for disunification
either. Just think of how many different ways Latin letters (and letter
combinations) are pronounced in different languages (x, j, h, v, w, f, ...;
even "a" gets different pronunciation in British English vs. US English,
and that is within the same language...; and most orthographies aren't
very accurately phonetic anyway, with quite a bit of varying (contextual
and dialectal) pronunciation for the letters).

> assamese alphabet set has extra characters which are never bengali just like
> London is never in Germany.

There are 8 London in the USA, two in Canada, one in Kiribati, ... ;-)
(http://en.wikipedia.org/wiki/London_(disambiguation))

> Coming again to the Hexcodes 09F0 (Raw) and 09F1 (wabo). Both have nothing
> Bengali in them and interestingly 09F1 ( sounds WO or WA when used within
> words) has even nothing Ra¹ sound in it. Thus you know, with actual Bengali
> alphabet set one can¹t write anything to produce the sound ³Watt² as in James
> Watt and instead need to combine three alphabets but even then only to sound
> like ³ OOYAT ³ in Bengali itself.

Yes, English has a rather peculiar pronunciation for the letter W... ;-)
Several languages will pronounce Watt (without changing the spelling) as
Vatt, and regard that as a normal pronunciation of Watt.

> Therefore Unicode must consider terming the blo

Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Kent Karlsson




Den 2011-09-10 02:32, skrev "Stephan Stiller" :

>Actually, I *was* talking about purely typographic/aesthetic ligatures as
> well. I'm aware that which di-/trigraphs need to be considered from a font
> design perspective is language-dependent. But the point is that I observe
> that:
>  (a) aesthetic ligatures are not frequently seen in modern German print and

I would assume that is because many commonly used fonts are designed in such
a way that letter glyphs don't overlap anyway. And then you should not use
any ligature. (Sorry if my original "should" implied otherwise.)


/Kent K

Re: How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Kent Karlsson

Oh, my apologies.


In that case, CaseFolding.txt (from the Unicode character database) says:

FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST

which seems rather straightforward...

/Kent K



Den 2011-09-10 01:25, skrev "Karl Williamson" :

> On 09/09/2011 02:36 PM, Kent Karlsson wrote:

>>> getting these ligatures to work is quite hard, and it would be nice to
>> 
>> How do you mean "getting these ligatures to work"? These particular

> Sorry that I took out too much of the original context.  This is about
> implementing the Case_Folding property.

Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Kent Karlsson


I was talking about purely typographic ligatures, in particular
ligatures used because the glyphs (normally spaced) would otherwise
overlap in an unpleasing manner. If the glyphs don't overlap (or
there is extra spacing, which is quite ugly in itself if used in
"normal" text), no need to  use a (purely typographic) ligature.
So it is a font design issue. (And then there are also ornamental
typographic ligatures, like the st ligature, but those are outside
of what I was talking about here.) But of course, which pairs of
letters (or indeed also punctuation) are likely to occur adjacently
is language dependent.

/Kent K


Den 2011-09-09 23:45, skrev "Stephan Stiller" :

> Pardon my asking, as this is not my specialty:
> 
>> There are several other ligatures
>> that *should* be formed (automatically) by "run of the mill" fonts:
>> for instance the "fj" ligature, just to mention one that I find
>> particularly important (and that does not have a compatibility code
>> point).
> 
> About the "should" - isn't this language-dependent? For example I recall
> that ordinary German print literature barely uses any ligatures at all
> these days (ie: I'm not talking about historical texts). And, has anyone
> ever attempted to catalogue such ligature practices? (Is this suitable
> for CLDR?)
> 
> (I also recall being taken aback by the odd look of ligatures in many
> LaTeX-typeset English scientific documents, but I suspect that's rather
> because some of the commonly used fonts there are lacking in aesthetic
> design.)
> 
> Stephan
> 
>

Re: How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Kent Karlsson

Den 2011-09-09 21:24, skrev "Karl Williamson" :

> On 07/06/2011 04:23 PM, Ken Whistler wrote:
>> I'm not sure whether the FB05/FB06 instance is important enough to add
>> or not. Neither of those compabitility ligatures should ordinarily be used
>> in text, anyway  ...
>> 
>> --Ken
> 
> I'm wondering what other characters might not ordinarily be used in
> text, or how to discover which ones aren't.  We're discovering that
> getting these ligatures to work is quite hard, and it would be nice to

How do you mean "getting these ligatures to work"? These particular
ligatures would be formed (automatically) by specialty fonts only.
Not for "run of the mill" fonts. There are several other ligatures
that *should* be formed (automatically) by "run of the mill" fonts:
for instance the "fj" ligature, just to mention one that I find
particularly important (and that does not have a compatibility code
point).

Neither for "run of the mill" ligatures (like "fi", "fl", fj", "ft",...)
nor for specialty ligatures (like ligature of long s and t, or any
other ligature with long s) should the compatibility code points be
used. One reason in particular is that (compatibility) code points
exist only for a small subset of such ligatures. And that will remain
true, since Unicode/10646 will not allocate code points for any more
(typographic) ligatures. The ones already encoded are for compatibility
only.

Note that I am here referring to purely *typographic* ligatures.
*Not* to ligatures that have "graduated" to letter (like æ or ø) or
orthographic (like ¦) status. (I realise that LAM-ALEPH ligatures
would be borderline from this description, but the consensus is
not to use any of the LAM-ALEPH ligature *characters*, but regard
it as a required typographic ligatures of character pairs.)

/Kent K

> know where it's really appropriate to expend the implementation effort.
>

Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Kent Karlsson


Den 2010-11-15 23:53, skrev "Doug Ewell" :

>> When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and
>> then click the button labeled "Normalize to NFC", the character
>> becomes 漢 (U+6F22). Does BabelPad not conform to the Unicode Standard
>> in this case? Is this not truly Unicode normalization?
> 
> Crap.  Yes, Ken and BabelPad are right.  Some ideographs do have
> singleton mappings and can thus be different between NFD and NFC.

No, both NFD and NFC will map U+FA47 to U+6F22; singleton canonical
mappings are not "reversed" in the composition phase of transforming to NFC.

/Kent K

Re: looks like some problem in Scripts.txt file of UCD

2010-08-13 Thread Kent Karlsson


Den 2010-08-13 02.28, skrev "Pravin Satpute" :

> Yes, problem is happening only when these characters come at initial
> position.
> i.e U+0951 and U+0952 in isolation should render with U+25cc

U+25CC should never be inserted automatically. That some systems do so is a
bug (no matter how consciously it was made). (I know, there are some Indic
script characters that should have had a canonical decomposition but don't
have one; using what should have been the canonical decomposition should
then be marked somehow in rendering, but using a dotted circle in not the
way to do that I think).

>> "Inherited" means that the character inherits its Script property from
>> the preceding character(s), so if either of the stress signs is preceded
>> by a Devanagari character, it should make no difference whether the
>> stress sign itself is categorized as Devanagari or Inherited.
> 
> looks good, but hmm its really hard to guess characters script when it
> will be alone.
> I think one need to add extra check, when character will be at initial
> position with property inherited

When a combining character sequence is ill-formed ("at the initial
position"), it should be rendered *as if* applied to an NBSP (regardless
of script).


http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf, section 5.13:
"Defective combining character sequences should be rendered as if they had
a no-break space as a base character. (See Section 7.9, Combining Marks.)"

http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf, section 7.9:
"Marks as Spacing Characters. By convention, combining marks may be
exhibited in (apparent) isolation by applying them to U+00A0 no-break
space."

/Kent K

Re: CSUR Tonal

2010-08-06 Thread Kent Karlsson



Den 2010-08-06 11.02, skrev "Andrew West" :

> On 6 August 2010 05:14, Doug Ewell  wrote:
>> 
>> What makes this troublesome for me is that, on the one hand, there are the
>> perfectly ordinary-looking 0 through 8, and on the other hand there are the
>> invented digits for 9 and 11 through 15, and then in the middle there's this
>> bizarre use of an ordinary 9-glyph to mean decimal 10. That's what messes it
>> up for me and makes me think the '9' isn't really a 9, and what the heck,
>> maybe none of the "ordinary" digits are what they appear to be, so let's
>> CSUR-encode all of them.
> 
> Looking at the examples shown on
> , it seems to me that
> 0-8 are ordinary digits, and the symbols for 9 through 15 are inverted
> or inverted+modified forms of the digits '7' through '1', so that
> there is some sort of imperfect bilateral symmetry on the clock and
> compass faces, with '0' and '8' as the axis of symmetry. Thus the '9'
> is an inverted '6' (as 16-6=10) not an ordinary '9'. So except for the
> odd glyph forms for 9, 11, 12 and 15 (would be be expected to be
> simple inversions of '7', '5', '4' and '1') it makes sense as a system
> to me.
> 
> Anyhow, I do not think the ordinary digits '0' through '8' should be
> encoded in the CSUR.

Nyström himself writes
(http://books.google.com/books?id=aNYGYAAJ&pg=PA105&source=gbs_selected_
pages&cad=0_1#v=onepage&q&f=false):

"In the Tonal System it is proposed to add six new figures to the 10
arabic"... (page 15)
and 
"Although the old figures in the Tonal System bears the old value (except 9)
one by one"... (page 17)

/kent k

Re: CSUR Tonal

2010-08-04 Thread Kent Karlsson


Den 2010-08-05 00.20, skrev "Doug Ewell" :

> Kent Karlsson  wrote:
> 
>> I see absolutely no point in reencoding the digits 0-9 even though
>> 9 is (strangely) used to denote the value that is usually denoted 10.
>> That is just a (very strange) usage, not different characters from
>> the ordinary 0-9.
> 
> I suggested encoding all of them because U+0030 through U+0039 have the
> Nd nature.

That does not prevent anyone from using them for (ordinary) hexadecimal,
octal, etc.

> Remember, this is just for the ConScript registry, not for
> "real" Unicode, so it's more of an exercise than anything.

I know.

/kent k

Re: CSUR Tonal

2010-08-04 Thread Kent Karlsson

I see absolutely no point in reencoding the digits 0-9 even though
9 is (strangely) used to denote the value that is usually denoted 10.
That is just a (very strange) usage, not different characters from
the ordinary 0-9.

/kent k



Den 2010-08-02 19.54, skrev "Doug Ewell" :

> "Luke-Jr"  wrote:
> 
>> I've copied an updated draft proposal to:
>> http://luke.dashjr.org/tmp/chores/tonal.html
>> I believe I have addressed all of the suggestions raised to my earlier draft.
>> Please let me know what you all think.
> 
> If it were up to me, which it is not, I would consider this proposal
> suitable for posting on the CSUR site.
> 
> --
> Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
> RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s 
> 
> 
> 
>

Re: CSUR Tonal

2010-07-31 Thread Kent Karlsson

Since no-one else seems to have responded to Luke...

Den 2010-07-30 22.09, skrev "Luke-Jr" :

> This isn't about them not looking *exactly* the same, it's about these
> existing modifiers being inconsistent with each other in visibly noticable
> ways. 

That is most certainly a font issue, likely a font substitution issue
(several system try to substitute another font if a glyph is missing for
some codepoint in the font you selected). Indeed, superscript characters
may not have been given all that much attention w.r.t. consistency review
(within each font), especially not between Unicode blocks... (But that is
just my suspicion!)

/kent k

Re: High dot/dot above punctuation?

2010-07-29 Thread Kent Karlsson


Den 2010-07-29 08.47, skrev "Khaled Hosny" :

> I have few fonts where I implemented a 'locl' OpenType feature that maps
> European to Arabic digits, and contextual substitution feature that
> replaces the dot with Arabic decimal separator when it comes between two
> Arabic numbers, so I think it is doable.

Doable is not the same thing as a good idea. Your example here is one of the
not-at-all-good ideas.

/Kent K

Re: High dot/dot above punctuation?

2010-07-28 Thread Kent Karlsson


Den 2010-07-28 17.09, skrev "Jukka K. Korpela" :

> Kent Karlsson wrote:
> 
>> And the Nameslist says:
>> 002EFULL STOP
>>= period, dot, decimal point
>>* may be rendered as a raised decimal point in old style numbers
> 
> Right, I remembered there is such a comment somewhere but did not remember
> where.
> 
>> However, I think that is a bad idea: firstly the digits here aren't
>> necessarily "old style" (indeed, André wrote "lining", i.e. NOT
>> old style). And even if they are old style, it seems to me to be a
>> bad idea to make this a contextual rendering change for FULL STOP
>> (and it also says "may" not "shall" so there is no way of knowing
>> which rendering you should get even with old style digits).
> 
> I don't think the comment suggests the kind of contextual rendering you seem
> to be thinking. It just says "may", without specifying what controls the
> rendering and restricting raised rendering to old style numbers.
>
> But admittedly, if you wish to use raised dot rendering, you will need
> either some programmed logic that applies such style to FULL STOP in certain
> contexts but not others (and this is nontrivial) or manual work that sets
> the style of each FULL STOP used as decimal point.

That's "contextual rendering", in the general sense. And as Asmus says, I
don't think anyone has implemented any automatic contextual rendering as
hinted in that annotation (and in the chapter text you referenced), nor do
I think anyone should implement that.
 
>> Better stay with the MIDDLE DOT for the raised decimal dot.
>> 
>> Further, I don't see any major problem with using U+02D9 DOT ABOVE
>> for "high dot" in this case.
> 
> I see several problems with both approaches. The rendering will depend on
> font and may not be at all suitable for a raised dot. These characters have
> properties different from those of FULL STOP, and you never know what this
> may imply. Software that handles characters by their Unicode properties may
> do unexpected and unsuitable things if some character just "looks adequate"
> but has properties different from those of the character(s) that would be
> semantically (more) correct.

Such as exactly which problems here? If any, those are surely dwarfed
by several orders of magnitude compared to that some locales use "." as
thousands separator and others use it as decimal separator, and some
locales use "," as thousands separator and other use it as decimal
separator.

/kent k
 
> Jukka 
> 
>

Re: High dot/dot above punctuation?

2010-07-28 Thread Kent Karlsson

Den 2010-07-28 09.50, skrev "Jukka K. Korpela" :

> André Szabolcs Szelp wrote:
> 
>> Generally, for the decimal point . (U+002E FULLSTOP) and , (U+002C
>> COMMA) is used in the SI world. However, earlier conventions could use
>> different notation, such as the common British raised dot which
>> centers with the lining digits (i.e. that would be U+00B7 MIDDLE DOT).
> 
> The different dot-like characters are quite a mess, but the case of British
> raised dot is simple: it is regarded as typographic variant of FULL STOP.
> 
> Ref.: http://unicode.org/uni2book/ch06.pdf (second page, paragraph with
> run-in heading "Typographic variation").

And the Nameslist says:
002EFULL STOP
= period, dot, decimal point
* may be rendered as a raised decimal point in old style numbers

However, I think that is a bad idea: firstly the digits here aren't
necessarily "old style" (indeed, André wrote "lining", i.e. NOT
old style). And even if they are old style, it seems to me to be a
bad idea to make this a contextual rendering change for FULL STOP
(and it also says "may" not "shall" so there is no way of knowing
which rendering you should get even with old style digits).
Better stay with the MIDDLE DOT for the raised decimal dot.

Further, I don't see any major problem with using U+02D9 DOT ABOVE
for "high dot" in this case.

/Kent K

> Yucca
> 
>

Re: CSUR Tonal

2010-07-26 Thread Kent Karlsson


Den 2010-07-26 02.48, skrev "Doug Ewell" :

> superscript letters S, T, b, m, r, s, and t.  [...]
... 
> Imagine my surprise, then, when I found that these superscripts are not
> formally encoded (only i and n are).  [...]

There are more superscripted letters than i and n that are encoded; among
them are:

1D40;MODIFIER LETTER CAPITAL T;Lm;0;L; 0054
1D47;MODIFIER LETTER SMALL B;Lm;0;L; 0062
1D50;MODIFIER LETTER SMALL M;Lm;0;L; 006D
02B3;MODIFIER LETTER SMALL R;Lm;0;L; 0072
02E2;MODIFIER LETTER SMALL S;Lm;0;L; 0073
1D57;MODIFIER LETTER SMALL T;Lm;0;L; 0074

(only missing MODIFIER LETTER CAPITAL S from that list at top of the quote
above)

/kent k

Re: ? Reasonable to propose stability policy on numeric type = decimal

2010-07-25 Thread Kent Karlsson


Den 2010-07-25 03.09, skrev "Michael Everson" :

> On 25 Jul 2010, at 02:02, Bill Poser wrote:
> 
>> As I said, it isn't a huge issue, but scattering the digits makes the
>> programming a bit more complex and error-prone and the programs a little less
>> efficient.
> 
> But it would still *work*. So my hyperbole was not outrageous. And nobody has
> actually scattered them. THough there are various types of "runs" in existing
> encoded digits and numbers.

While not formally of general category Nd (they are "No"), the superscript
digits are a bit scattered:

00B2;SUPERSCRIPT TWO
00B3;SUPERSCRIPT THREE
00B9;SUPERSCRIPT ONE
2070;SUPERSCRIPT ZERO
2074;SUPERSCRIPT FOUR
2075;SUPERSCRIPT FIVE
2076;SUPERSCRIPT SIX
2077;SUPERSCRIPT SEVEN
2078;SUPERSCRIPT EIGHT
2079;SUPERSCRIPT NINE

And there are situations where one wants to interpret them as in a
decimal-position system.

/kent k

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread Kent Karlsson


Den 2010-07-24 10.07, skrev "Philippe Verdy" :

> Double diacritics have a combining property equal to zero, so they

No, they don't. The above ones have combining class 234 and the below
ones have combining class 233 (other characters with the word DOUBLE
in them are 'double' in some other way):

035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;N;
035F;COMBINING DOUBLE MACRON BELOW;Mn;233;NSM;N;
0362;COMBINING DOUBLE RIGHTWARDS ARROW BELOW;Mn;233;NSM;N;
1DFC;COMBINING DOUBLE INVERTED BREVE BELOW;Mn;233;NSM;N;

035D;COMBINING DOUBLE BREVE;Mn;234;NSM;N;
035E;COMBINING DOUBLE MACRON;Mn;234;NSM;N;
0360;COMBINING DOUBLE TILDE;Mn;234;NSM;N;
0361;COMBINING DOUBLE INVERTED BREVE;Mn;234;NSM;N;
1DCD;COMBINING DOUBLE CIRCUMFLEX ABOVE;Mn;234;NSM;N;

So everything you write based on your false premise is unjustified
(and most is false).

> block the reordering for canonical equivalences and the relative order
> and independance for the encoding of base grapheme clusters will be
> preserved during normalizations.
...
...
...

/kent k

Re: [indic] Indian Rupee symbol

2010-07-15 Thread Kent Karlsson

Den 2010-07-15 11.54, skrev "N. Ganesan" :

> On Thu, Jul 15, 2010 at 4:08 AM, Dr Pavanaja 
> wrote:
>> Now that Indian Rupee symbol has been finalised and accepted  by the Indian
>> Parliament can it go into Unicode ver 6.0?
>> 
> 
> For a look at the new sign for Indian Rupees:
> 
> http://minal.nairi.net/images/work/rupee_01_L.jpg

That glyph is different from the others cited for this approved(?) symbol.
Separate enough not to be unifiable with the others... (Neither of these
look anything like any of the "finalists".)

> http://news.google.com/news/story?pz=1&cf=i&ned=us&hl=en&q=india+rupee+symbol&;
> ncl=d6Fqo-wq_tHz-YMha5mr0AeQdPizM
> 
> http://www.business24-7.ae/banking-finance/banking/indian-rupee-symbol-is-appr
> oved-2010-07-15-1.266820

Subheading says: "Currency becomes fifth in the world to have a distinct
identity". Hmm, silly me, I thought each currency (nearly 300 of them
worldwide, going by ISO currency codes) had a "distinct identity".

Even when just looking at currency symbols, then many of them are used for
more than one currency and even just counting the number of currency
symbols, there are a lot more than four already...

/Kent K

> Hope it gets into the Unicode soon
> like $ etc.,
> 
> N. Ganesan
> 
>

Re: Indian Rupee Sign to be chosen today

2010-06-28 Thread Kent Karlsson

Den 2010-06-28 10.16, skrev "Michael Everson" :

> On 28 Jun 2010, at 06:41, Mahesh T. Pai wrote:
> 
>> Mahesh T. Pai said on Mon, Jun 28, 2010 at 10:57:53AM +0530,:
>> 
>>> On a serious note -
>>> 
>>> 1. Would a change of glypn / glyph shape be considered?
> 
> It would depend on what is chosen by the Government, I should think.

Also note that the character named RUPEE SIGN has a compatibility
decomposition to "Rs" (short for "Rupees" I guess, not for "rupee sign"):
20A8;RUPEE SIGN;Sc;0;ET; 0052 0073N;

This compatibility decomposition cannot be changed, nor can the sample
glyph be changed significantly (i.e. it must continue to look much like
"Rs").

Unfortunately, some current fonts erroneously display a glyph for
that character that looks more like "Rp". That is a font error,
and nothing else.

If a "NEW RUPEE SIGN" (or whatever name is preferred) is chosen,
and this goes the same way as the EURO SIGN, i.e. is highly likely to
be actually used, as was the case when the EURO SIGN was added to Unicode),
and assuming the glyph does not look like "Rs", a new character will need
to be added.

/kent k

Re: U+0000 in C strings

2004-11-16 Thread Kent Karlsson

Of course you can have NULL in a C string.

It is just that many (not all) standard C functions that accept
a string argument, interpret/use NULL for ending a string.
That is the convention used for many user defined functions
that accept string arguments or produce a string too.

But some standard C functions, e.g. strncpy, as well as many
user defined functions, instead take length or position
arguments (in one way or another). In such case NULL (represented
as a 0 value) in the string argument/result is just another
character. If you want to be able to process NULL as a
character in a text string, you must be careful to use
functions that don't interpret NULL as string termination.

A NULL (0) in a string can be used for all sorts of things,
e.g. new line (yes, new line), data missing, or even a
printable character, like COMMERCIAL AT (in the GSM 7-bit
encoding). And it may occur in input text files (or streams)
and it does NOT terminate a (text) file/stream.

But NULL (represented as 0) termination is handy when the
string is "known" to (or at least supposed to) contain
only "printable" characters in an ASCII or EBCDIC based
encoding. Like format strings, error messages, and so on.
But not for buffers containing (text) data obtained (e.g.)
from a file; unless you filter away, or internally "escape"
most control characters.

Many (text) *protocols*, quite apart from C, interpret some
characters specially, e.g. NULL (or all C0 control characters),
and those characters must be "escaped" somehow, if the protocol
allows for that. But that has nothing to do with C.

I've missed some of this thread, so I don't know why this
has become an issue. It's been like this for decades...

/kent k

RE: bit notation in ISO-8859-x is wrong

2004-10-11 Thread Kent Karlsson


> Not counting from zero leads to weird situations at times, such
> as the missing year 0 in the BC/AD system ;-)

Well, since you couldn't resist, nor can I...

There are (really) two systems here, not one, and the relationship is:

year X AD = year (1-X) BC

(and thus: year X BC = year (1-X) AD)

Both of the AD and BC systems (really) have zero and negative year
numbers, but in each instance one prefers to use the system where
the number, for the year in question, has a strictly positive value.
So there is no inconsistency with the whole numbers... ;-)

/kent k

> A./

RE: Saudi-Arabian Copyright sign

2004-09-21 Thread Kent Karlsson

Kenneth Whistler wrote:

> Second, there is the question of cursive joining for Arabic.
> I don't know anything in the Unicode Standard that states that
> a combining enclosing mark breaks cursive ligation. It stands
> to reason that it *should*, but I don't know anything that
> requires it.

Well, according to the Unicode standard, it used to be break
the joining on one side (the right side, unless one follows the
bidi algorithm literally, and do the join analysis after bidi,
in which case it would be the left side). I complained about
this (and other things about joining properties), suggesting
that "Me" characters (like an enclosing circle) should break
the joining on both sides. But the UTC decision was the opposite,
but equally good; Me characters should (shall?) not break the
joining on any side. This decision was communicated to the
bidi list recently:

Mark Davis wrote, on
> Sent: Thursday, August 12, 2004 12:08 AM
> To: [EMAIL PROTECTED]
> Subject: [bidi] Fw: ArabicShaping suggestion
>
>
> The UTC reviewed document L2/04-290 from Kent. I'll put the
> results (very briefly!) here:
...
> 6. We will add General Category = Me to Transparent.

/Kent K

Re: Arabic Implementation

2004-08-18 Thread Kent Karlsson

> After a character changes the display form into one mentioned
> in Arabic
> Presentation Form B does it still belong to a joining type.

Nope. All the Arabic presentation forms implicitly have the joining
type U (non-joining) [and the joining metagroup ].

> For example: Lets say Unicode Character : 0x0622 which is a
> right joining
> type , when this changes the display form into ISOLATED FORM
> its Unicode
> becomes : 0xfe81.

You can base a *partial* implementation of DISPLAY of the Arabic
script that way. But note that many of the more "exotic" Arabic
letters do not have any corresponding presentation form characters.
What one is supposed to do is to look up the presentation form
*glyphs* in the ("smart") font. That does not rely on any of the
presentation form characters. Nor should text be stored using the
presentation form characters.

> I personally feel that a particular character belonging to a
> particular
> joining type will have all its different display forms also
> belonging to the
> particular joining type .

No. It is only the "nominal" Arabic letters that are "shaped".
The preshaped ones already have their shape, and do not affect
the shape of neighbouring characters. Note that a "shaper"
based on using the presentation form characters, should also
interpret ZWJ and ZWNJ, but may remove them after interpretation.
(You should not store the resulting text beyond what is needed
for display/print.)

/kent k

Re: Error in Hangul composition code

2004-07-06 Thread Kent Karlsson

>  says:
>
> int SIndex = last - SBase;
...

The arithmetic decomposition of the Hangul Syllable
characters can be described as follows:

Each Hangul precomposed syllable character of
Hangul_Syllable_Type LV has a canonical decomposition
into an L and a V Hangul jamo:

LV: s
L in 11001112: LBase + ((s  SBase) div NCount)
V in 11611175: VBase + (((s  SBase) mod NCount) div TCountP1)

Each Hangul precomposed syllable character of
Hangul_Syllable_Type LVT has a canonical decomposition
into an LV Hangul syllable character and a T Hangul jamo:

LVT:s
LV: SBase + (((s  SBase) div NCount) * NCount)
T in 11A811C2: TBaseM1 + ((s  SBase) mod TCountP1)

(TBaseM1 is TBase-1, and TCountP1 is TCount+1)

This makes them decompose just like other canonical
decompositions into (one or) two other characters;
not more than two. The arithmetic description is then
just a shorthand for a long list of 11000+ canonical
decompositions (which can't be into more than two
other characters). They could in principle be handled
in normalisation code just like any other canonical
decomposition/composition, given that expanded table.
Code based on the arithmetic expressions are just more
efficient in achieving the same thing.

The composition can likewise be described arithmetically.

Note the use of the (relatively) new Hangul_Syllable_Type
property.

Some pseudo-code (for those who like code) based on this
for composing Hangul Syllable characters (I will spare
you the pseudocode for decomposing, this reply is getting
too long already):

public static String composeHangul(String source)
{
int len = source.length();
if (len == 0)
return "";


StringBuffer result = new StringBuffer();

// Hangul is in the BMP, so we need not worry about higher planes.

char prev = source.charAt(0);// get first char

for (int i = 1; i < len; i++)
{
char curr = source.charAt(i);

if ('\u1100' <= prev && prev <= '\u1112' && // "modern" L
'\u1161' <= curr && curr <= '\u1175')   // "modern" V
{
// make a syllable of the form LV
prev = (char)(SBase + ((prevLBase) * NCount) +
  ((currVBase) * TCountP1));
}
else if (hangulSyllableType(prev) == HangulSyllableType.LV &&
 '\u11A8' <= curr && curr <= '\u11C2') // "modern" T
{
// make a syllable of the form LVT
prev += curr  TBaseM1;
}
else
{
// no arithmetic composition possible, move on
result.append(prev);
prev = curr;
}
}
result.append(prev);  // don't loose last char in string
return result.toString();
}

Note that, while NOT part of Unicode decompositions, many
of the Hangul Jamo characters decompose into two or three
other Hangul Jamo letters. But that is much beyond UAX 15,
unfortunately.

  /kent k

RE: Script vs Writing System

2004-05-13 Thread Kent Karlsson

Peter Constable wrote:

> > of "featural" is probably refering to this
> > feature of Hangul of grouping letters into square syllables.
> 
> No, that is definitely *not* what was meant. In the taxonomy 
> devised by
> Gelb and promoted by Daniels, Hangul is described as a 
> "featural" script
> because the description of the script prepared under King Seychong's
> administration described the shape of certain jamos (e.g. k, t, m) as
> being iconically related to the corresponding points of articulation.

The *letters*, especially the consonants are "featural" in that sense.
However, the syllables aren't "featural" in *that* sense. A syllable
blocks just consists of the letters of the syllable. That would
be a quite different sense of "featural". Latin is a "featural" script,
since the letters are grouped into words (rather simplistically and
linearly, but that's a different matter)...

/kent k

1 2 3 >

1 - 100 of 273 matches

Mail list logo