Re: Missing geometric shapes

2012-11-09 Thread Asmus Freytag

On 11/9/2012 1:26 AM, Jean-François Colson wrote:

For a five level rating, ○ ◔ ◑ ◕ ● could do the job.


Yes it's possible to use other sets of symbols to indicate rating, but 
when it comes to such use of symbols Unicode would not encode the 
semantic of rating but that of star. The deeper semantic is one of 
convention. Not unlike the question whether y is a vowel or a 
consonant (yo-yo), which is a matter of convention between writer and 
reader.


A./



Re: Missing geometric shapes

2012-11-09 Thread Asmus Freytag

On 11/9/2012 5:53 PM, Philippe Verdy wrote:

Why then stars ? Any symbol, even any Unicode letter could be repeated
and half-filled.


There's nothing magical about limiting the half-filled geometrical 
shapes to the current (haphazard) set. If half-filled stars can be 
documented, they are legitimate targets for encoding. If someone later 
documents half-filled pentagons, again, the case would be decided on the 
merits.


I really hate the speculation on this list about notational conventions 
-- the rule should be: if notational conventions exist, and can be 
documented, the characters needed for them should be eligible to be 
considered.



Even logos (I've seen Apple logos used this way)


Logos are ineligible for other reasons, and that puts them out of 
discussion here.



or pictograms (I've seen 


Most of these graphics are simply used in repetition. Only shapes that 
lend themselves to being half-filled will show up in use.


Now, Unicode has recently introduced the innovation of formally encoding 
variation sequences for emoji-style symbols - expressing a desire to 
explicitly representing a unification of certain basic shapes with 
precisely equivalent fancy renditions of the same.


In the current instance that could mean (by extension) that at some 
point various fancy renditions of stars are officially unified with 
the plain stars (by adding a similar variation sequence). Fancy star 
symbols that I have seen include those that are colored instead of black 
or have colored background (on  a per symbol basis, not text background 
like highlighting).




Even today, using the existing Unicode for the WHITE STAR character
allows performing styling on it to render an empty, full, or partially
filled star.


There's clear precedent that Unicode views white/black/partially filled 
as a distinction on the character level (this is definitely the case for 
several types of geometrical symbols - witness circles and squares). 
Using styles to achieve that effect is possible (lots of things are 
possible), but it would be a violation of the character / glyph model to 
achieve such distinction by style, when it is present on the character 
level.


The precedent here, clearly speaks in favor of recognizing half-filled 
stars likewise as a distinction on the character level.



If you start encoding a document using uncommon characters, automated
Braille or aural readers won't know what to do with them...


I think this argument is a red herring.


For me all the graphical substitutions of numeric figures are NOT
plain text, they are presentational features for visual rendering, ...


The fact that you can think of a series of symbols as representing their 
count, doesn't make a series of symbols merely a numeric representation 
of that count.


But even if one were to take this view: Unicode painstakingly encodes 
characters for the different representations of digits, instead of 
relying merely on styles and glyphs to handle the representation of 
numbers. So, you see, even here, the precedent goes the other way.


A./





Re: The rules of encoding (from Re: Missing geometric shapes)

2012-11-09 Thread Asmus Freytag

On 11/9/2012 7:14 PM, Philippe Verdy wrote:

2012/11/9 Asmus Freytag asm...@ix.netcom.com:

Actually, there are certain instances where characters are encoded based on 
expected usage. Currency symbols are a well known case for that, but there have 
been instances of phonetic characters encoded in order to facilitate creation 
and publication of certain databases for specialists, without burdening them 
with instant obsolescence (if they had used PUA characters).

But work is still being performed to implement the characters ans
start using it massively, even if it's not encoded.


I think this entire line of discussion is rather drifting into 
irrelevant details. Yes, I agree that it should matter whether serious 
resources have been committed in support of a new symbol or new piece of 
notation. That forms part of the evidence that marks some of  these 
exceptional cases as viable standardized characters - despite lack of 
prior, widespread use. That, somehow, was my point.


However, I find it pointless to speculate about the details. Exceptions 
are exceptions, and the most important issue is to reserve the 
flexibility to deal with them, when they arise.


After they have arisen, they are best dealt with on a case-by-case basis 
(or in the case of currency symbols, we now have an entire category for 
which there is consensus hat it merits exceptional treatment).


A./



Re: Missing geometric shapes

2012-11-11 Thread Asmus Freytag

On 11/11/2012 2:08 PM, Doug Ewell wrote:

Personal opinions follow.

It looks like the only actual use case we have, exemplified by the 
xkcd strip, is for a star with the left half black and the right half 
white. There *might* also be a case for the left-white, right-black star.


Precedent is for encoding these in pairs and if there were any doubts 
about the wisdom of this Simon Montagu's mail illustrates the bidi 
ramifications (thanks to Frederic Grosshans for the reminder).


So, lets not prevaricate any longer and admit we have a a priam facie 
use case for the pair.


Everything else, including one-quarter and three-quarter stars, 
rendering tomatoes or doughnuts or film reels as glyph variants of 
stars, facilitating a right-to-left rating system for Arabic- or 
Hebrew-speaking environments, or turning Unicode into a standard for 
rating systems in general, is a complete flight of fancy


Flights of fancy, indeed. I couldn't have said it better.


I think in this case, as in many others, one introductory, exploratory 
proposal would be worth ten thousand speculative mailing-list posts.


You said it.

A./



Re: Missing geometric shapes

2012-11-11 Thread Asmus Freytag

On 11/11/2012 3:01 PM, Asmus Freytag wrote:

On 11/11/2012 2:08 PM, Doug Ewell wrote:

Personal opinions follow.

It looks like the only actual use case we have, exemplified by the 
xkcd strip, is for a star with the left half black and the right half 
white. There *might* also be a case for the left-white, right-black 
star.


Precedent is for encoding these in pairs and if there were any doubts 
about the wisdom of this Simon Montagu's mail illustrates the bidi 
ramifications (thanks to Frederic Grosshans for the reminder).


So, lets not prevaricate any longer and admit we have a a priam facie 
use case for the pair.

priam-prima


Everything else, including one-quarter and three-quarter stars, 
rendering tomatoes or doughnuts or film reels as glyph variants of 
stars, facilitating a right-to-left rating system for Arabic- or 
Hebrew-speaking environments, or turning Unicode into a standard for 
rating systems in general, is a complete flight of fancy


Flights of fancy, indeed. I couldn't have said it better.


I think in this case, as in many others, one introductory, 
exploratory proposal would be worth ten thousand speculative 
mailing-list posts.


You said it.

A./







Re: Missing geometric shapes

2012-11-11 Thread Asmus Freytag

On 11/11/2012 4:50 PM, Philippe Verdy wrote:
2012/11/12 Kent Karlsson kent.karlsso...@telia.com 
mailto:kent.karlsso...@telia.com


 rendering tomatoes or doughnuts or film reels as glyph variants of
 stars,

They should certainly **NOT** be treated as glyph variants of
stars! Ever!


Who said that ? NOT me.

If you think so, this is a misinterpretation in what I said


You wrote so many things that it's impossible to be sure what you said. :)

A./


Re: Missing geometric shapes

2012-11-11 Thread Asmus Freytag

On 11/11/2012 8:47 PM, Philippe Verdy wrote:
No, I was clear throughout, using the same arguments, that encoding 
things for the purpose of representing empty, full, half filled 
like if it was a nuemric gauge was a bad idea.


Trying to encode a gauge is indeed a losing proposition.



When I spoke about the various represetnations of gauges (including 
with photos) it was just to demonstrate that this is a domain where 
designers and authors are extremely creative, and there's absolutely 
no standard way of doing things right as each representation is a pure 
local decision.


However, there's no argument that stars are used as symbols, including 
half filled ones. Stars are part of our family of geometrical shapes, 
and those shapes also have many members that are partially filled.


There's no reason to pass judgment on why people might be using stars.


Just consider the case of the classification of hotels and campings: 
they are just given an integer number of stars, and whever these stars 
are white (hollow/transparent filling), black (completely filled), 
multicolor, or even half filled does not change the classification.


And this is where the discussion leaves the plane of encoding and veers 
into the realm of orthography. Orthography, loosely understood in its 
wider sense, is the realm of conventional use of written symbols. 
Orthographies associate conventional meaning to symbols and sequences of 
symbols (not just letters and words, but also punctuation marks etc.)


Unicode's role has to be strictly limited to providing building blocks.


Now if you think about half-filled stars, there are also the case of 
half-cut stars too (left side or right side shown) to represent as 
well half units. Or stars with only 1 to 4 branches filled, with 
variation about the position where branches are cut : in the middle of 
a branch, creating a thiner triangle. Or between branches (that are 
kept as complete diamonds extending up to the center. Variations as 
well in the number of branches for the star itself.


Correct - there are many designs for stars and variations of those 
designs. And also correct, there is at some point a limit where you 
don't need a standardized encoding for all of these, because at some 
point, things will get so specialized, that few users will be able to 
benefit from this standardization.


However, the half-filled, five pointed stars are garden-variety type 
symbols, and, as I keep pointing out, they absolutely fall within the 
scope of geometrical symbols for which there is ample precedent 
supporting both plain text usage as well as a standardized encoding.


The suggested characters (they haven't actually been formally proposed 
yet) would in no way push the envelope.


(skipping over lots of text that I think is not very relevant)


We should only encode characters that users would reliably draw 
manually using a plum or rollpen, independantly of color, or of the 
width of the tool used to draw strokes, or possibly to fill them : 
basic orientation of glyphs however will be a candidate if its 
variation in the same text orientation is significant (this includes 
mirrored, or upside down characters, or significant changes of size 
and position relative to the baseline. Some exceptions are given to 
maths symbols (including letter-like) which are encoded specifically 
with their maths semantics for use in maths, but not for general 
purpose text.



This is an entirely novel theory of encoding, and one, that I would like 
to point out, is very much your personal view. It does not have a 
foundation (or echo, or equivalent) in anything that really defines how 
encoding is done for the Unicode standard.


A./



Re: Missing geometric shapes

2012-11-11 Thread Asmus Freytag

On 11/11/2012 9:26 PM, Philippe Verdy wrote:




2012/11/12 Asmus Freytag asm...@ix.netcom.com 
mailto:asm...@ix.netcom.com



However, the half-filled, five pointed stars are garden-variety
type symbols, and, as I keep pointing out, they absolutely fall
within the scope of geometrical symbols for which there is ample
precedent supporting both plain text usage as well as a
standardized encoding.


I oppose your argument of garden-variety type symbols because 
consistancy of this usge with a defined pattern is not demonstated, 
included in the precise domain where they are found.



None of the geometric symbols have a precise domain where they are 
used. Typical for these symbols is that they have a wide variety of use 
and that therefore any encoding that tries to tie these characters to 
only some specific usage is doomed to fail.


That does not mean that it's not important to show that there is at 
least one usage for that that is consistent with plain-text.




The suggested characters (they haven't actually been formally
proposed yet) would in no way push the envelope.

[1] We should only encode characters that users would reliably
draw manually using a plum or rollpen, independantly of color,
or of the width of the tool used to draw strokes, or possibly
to fill them : basic orientation of glyphs however will be a
candidate if its variation in the same text orientation is
significant (this includes mirrored, or upside down
characters, or significant changes of size and position
relative to the baseline.


[2] Some exceptions are given to maths symbols (including
letter-like) which are encoded specifically with their maths
semantics for use in maths, but not for general purpose text.

This is an entirely novel theory of encoding, and one, that I
would like to point out, is very much your personal view. It does
not have a foundation (or echo, or equivalent) in anything that
really defines how encoding is done for the Unicode standard.


[1] The first part is a good real-life expression of what is meant by 
abstract character and the fact that we don't encode glyphs. So this 
is not so much a novelty (this is stated in the standard that we don't 
encode glyphs but only abstract characters independantly of orthogonal 
styles and tools used to render them).


This is not how abstract characters are defined.


[2] The second part is the expression of the exceptions that have been 
made ONLY because there REALLY was a well-defined pattern of usage 
where the additional meaning of a precise style is consistant (and 
really HAD TO be)... Allowing then these exceptions (the other 
exceptions have been for interoperability with older character sets 
for terminals that had almost no graphic capabilities). So this is 
also not a noverlty.


For now we lack the evidence of a consistant meaning in any given 
domain (not too much specialized to a single source at a single place 
for this consistancy).


This whole thing comes down to a misunderstanding of semantic in the 
context of the statement that abstract character represent a semantic 
over a presentational aspect.


The semantic of a letter a' is it's a-ness - in contrast to all the 
other letters of the same script. The semantic of an integral sign is 
integral sign in contrast to the other mathematical operators. (If 
there were two alternate notations for integral, then Unicode would not 
encode the concept of integral, but the several concrete symbols used 
to denote that concept - see for example, element of, where there is 
such variation).
The semantic of a FULL STOP is to be a dot on the line. It can represent 
many different concepts (from sentence period, to abbreviation, to 
domain name separator to decimal mark, but all of these are a matter of 
convention external to the encoding standard.)
Finally, the semantic of  geometrical shape is in essence the shape - 
how it is used in the context of text will give it additional meaning, 
but those meanings are not what is standardized.


A./



Re: Missing geometric shapes

2012-11-12 Thread Asmus Freytag

On 11/12/2012 10:13 AM, Philippe Verdy wrote:
2012/11/12 Asmus Freytag asm...@ix.netcom.com 
mailto:asm...@ix.netcom.com


On 11/11/2012 9:26 PM, Philippe Verdy wrote:

2012/11/12 Asmus Freytag asm...@ix.netcom.com
mailto:asm...@ix.netcom.com


However, the half-filled, five pointed stars are
garden-variety type symbols, and, as I keep pointing out,
they absolutely fall within the scope of geometrical symbols
for which there is ample precedent supporting both plain text
usage as well as a standardized encoding.


I oppose your argument of garden-variety type symbols because
consistancy of this usge with a defined pattern is not
demonstated, included in the precise domain where they are found.

That does not mean that it's not important to show that there is
at least one usage for that that is consistent with plain-text.


That's exactly what I meant. There must be at least one precise domain 
where this usage is consistent.


No, there's no need for usage to be consistent. The only requirement 
is that it occurs.


Unicode is not designed to be in the business of what people write, only 
in the business of enumerating the basis elements (written signs) needed 
for that communication.


In some cases, a wide variety of shapes will be understood to represent 
a single written sign, with the alternation being stylistic. That's 
the case you have with letters and fonts.


In other cases, it is not possible to a-priori, or reliably, or ever to 
decide what variance in shape can legitimately happen under the umbrella 
of a single written sign (as conventionally understood).


At some point, all you have to go on is the shape itself.

Whether an arrow is barbed or not, single or double stroked, filled or 
outlined makes no difference in its basic identification as arrow and 
no difference at all when it is used to merely point. However, different 
contexts (mathematics for one) have ascribed conventional meanings to 
some of the various appearances.


In order to make the case for encoding them, the primary task is to show 
that they can and will be used in contrast. If that can be shown, the 
details of what each style represents is of lesser importance. Those 
details come into play when the use is not so much one that is in 
contrast with the generic usage, but one where a convention 
arbitrarily requires a specific shape - and followers of that convention 
will not recognize a generic substitute as being the particular written 
sign in question.


A./

I certainly NOT meant ONE AND ONLY ONE. So all the rest about the (for 
example) use of the full stop for various purposes is not relevant: at 
least some of these uses are consistent in their domain.


But for now we've not seen any one for the half stars, and I don't 
know why you think they will be more important to encode than the 
various other representations of ratings or similar concepts like 
gauges which largely overwhelm in all these many variants seen the 
particular cases where an half star MAY very infrequently be used 
without any consistency, as if it was a sort of standard (the 
purpose of Encoding in Unicode is to endorse such existant standard or 
norm, either national or international, or adopted by a measurable 
community over some large enough period, and not in isolated 
documents, whatever their medium, electronic or physical).






Re: Caret

2012-11-12 Thread Asmus Freytag

On 11/12/2012 1:27 PM, Khaled Hosny wrote:

I’m not sure from where you are getting your statistics, but I’ve to
deal with all those “rare” and “extremely rare” situations all the day.
Khaled, don't mind Philippe - his experience is a bit on the 
theoretical end.


A./


Re: Caret

2012-11-12 Thread Asmus Freytag

On 11/12/2012 7:13 AM, David Starner wrote:

On Mon, Nov 12, 2012 at 4:39 AM, Julian Bradfield
jcb+unic...@inf.ed.ac.uk wrote:

Again, it depends. A user-oriented editor will treat é as a single
unit anyway, for text manipulations. In my programmer-oriented editor,
when the cursor is on e or  ́, the two codepoints are displayed
separately instead of combined, so again there is no ambiguity.

What do non-English speaking programmers do? It seems that if I spoke
good Hindi or Arabic and little to no English, it would be deeply
frustrating to try and use comments and strings in such an editor.

As a programmer, you do want to be able to edit *and view* strings as 
sequences of code units. Doing so only in the contest of binary memory 
dumps gets tedious. (In English, this means, for example, being able to 
view whitespace easily - a task that too many editors make hard).


For typing comments and strings, the display would not be an issue, 
because any partial characters would be handled the same way as in 
regular word processing. Editing the middle of a word might be 
different, but smarter editors could turn that feature off for comments. 
For strings it's something you'd want more often - depends a bit on what 
programs you are writing.


When inspecting strings I certainly would want to be able to distinguish 
between precomposed and decomposed e-accent, and whether I know English 
gots nothing to do with it.


A./


Re: Missing geometric shapes

2012-11-12 Thread Asmus Freytag
In the business of character encoding, it's not helpful to try to 
construct algorithmic rules that lead from one set of conditions to the 
state of encoded. It just doesn't work that way.


What does work is to think of factors, or criteria, that you can use in 
weighing a question. Certain factors weigh in favor of encodings, others 
don't (or have large negative weights - logo's currently have infinite 
negative weights :) ).


Many of these criteria managed to get written down in the Policies and 
Procedures document and have been helping Unicode and WG2 decide 
encoding questions. Others are still mainly present in the collective 
consciousness of the encoding committee. Such is life.


What's not helpful is for outside observers to propound theories of 
encoding that are seemingly based on more algorithmic foundations, or 
that embody more rigid or formulaic requirements for this that an the 
other thing.


It's not that meeting certain requirements isn't helpful in advancing 
the case for encoding a character or symbol, but rather that it works 
only by increasing the weight in favor, not by flipping a switch up or 
down. It's really important to not mischaracterize the nature of the 
character encoding business in this way.


That's all I want to contribute to the current thread.

A./



Re: xkcd: LTR

2012-11-27 Thread Asmus Freytag

On 11/27/2012 5:39 AM, Masatoshi Kimura wrote:

(2012/11/27 20:27), Philippe Verdy wrote:
Could you please stop spreading an unfounded rumor such as Firefox is 
wrong because it ignores the lacking of HTML5 prolog? 


Getting Philippe to stop spreading unfounded anything is a near 
impossible task. :)


A./





Re: UTF-8 ill-formed question

2012-12-11 Thread Asmus Freytag

On 12/11/2012 11:50 AM, vanis...@boil.afraid.org wrote:

From: James Lin James_Lin_at_symantec.com

Hi
Does anyone know why ill-form occurred on the UTF-8? besides it doesn't follow 
 the pattern of UTF-8 byte-sequences, i just wondering how or why?
If i have a code point: U+4E8C or 二
In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C. Where is this BA
comes from?

thanks
-James

Each of the UTF encodings represents the binary data in different ways. So we
need to break the scalar value, U+4E8C, into its binary representation before
we proceed.

4E8C - 0100 1110 1000 1100

Then, we need to look up the rules for UTF-8. It states that code points
between U+800 and U+ are encoded with three bytes, in the form 1110
10xx 10xx. So plugging in our data, we get

 4  E8 C
   0100   1110 10-00 1100
      //   \\
+ 1110 10xx 10xx

= 11100100 10111010 10001100
or  E  4 B  A 8  C

-Van Anderson


Nice!

A./

PS: I fixed a missing \



Re: wrongly identified geometric shape

2012-12-17 Thread Asmus Freytag
In relating the size of different series of geometric shapes with each 
other, the relevant aspect is not the height of the ink but the area, in 
my opinion.


I'm currently not able to take the time to sift through various 
documents and propose any resolution, but I would like to make sure that 
this point is not lost.


A diamond of the same height as a square, becomes effectively an 
inscribed diamond. When you compare the areas, the difference is 50% (!)


Seen in running text (and not next to each other) the diamond will look 
smaller, even though it has the same height.


For the other shapes, the same effect exists, but is not always as 
severe. Whether one matches the size of a hexagon with that of a 
pentagon by height or area, for example, may not result in obvious and 
observable difference in the impression of their relative size.


Ideally, as an author or font designer, I would aim for a set of symbols 
that have the same optical weight, or impression of weight. Shapes 
that are more compact, might be allowed to have a little more area, 
because they might otherwise look short. But in this balancing act, I 
would expect the most functional (and pleasing) choices for mathematical 
use to be those where the shapes end up rather closely (but perhaps 
deliberately not perfectly) matched in area.


As to whether the exact size progression within each series is best 
realized as a geometric or linear, or some other progression, I can't 
suggest a definite answer right now.


In terms of text sizes for CSS, the concept of fixed ratios seems to be 
prevalent. Ideally we would have some more input from mathematical 
typesetters and font designers. Whatever the progression ends up being 
would require that all steps can be distinguished in on-screen viewing 
at some point size (and traditional, not bleeding edge, set of DPI values).


A./


On 12/17/2012 3:37 PM, Michel Suignard wrote:


Philip,

It would have helped if you had updated your critique of N4115 to the 
current proposed code points. The updated version is N4384 
(L2/12-368). The number of characters proposed and their allocation 
have changed although the status for geometric shapes has no changed much.


I spent some times analyzing your documents and I can see you are 
trying to harmonize the size of the diamond and the square shapes by 
applying the concept that the length of a side should dictate the 
‘size’, not the ink height. By doing so you force the rule found on 
small sizes to the larger sizes which makes you deviate from the 
current TR25 recommendation, basically you are sizing down all the 
squares to match the diamonds. For example, now the regular size 
square side would be slightly above half the EM box side, which is 
what a medium square is today. And at the end you still have to add a 
new XL size which is not part of TR25. I also looked at the current 
font implementations of squares and they are all over the place in 
relative sizes but all have bigger sizes than what you propose. By far 
the more consistent set is the Wingdings set, but there are some many 
size inter correlations in geometric shapes that I can’t just put them 
in the charts. What I have found consistently among implemented fonts 
is a large gap between ‘small’ and ‘very small’ which reinforce my 
introduction of ‘slightly small’. As long as we don’t try to force the 
diamond scaling onto the square scaling I don’t see an issue with the 
current schema. The name ‘slightly small’ is not exactly pretty but we 
are running out of adjectives here.


If you made the same arguments a year or more ago, it would have been 
easier to influence the content of amendment 1, now it is quite late. 
Geometric shapes representation is always subjective and various 
schemas can be used. The one use in 4115 does not try to merge the 
size scale between square and diamonds (I don’t think there was a 
mandate to do so). Another goal was to take into consideration 
existing practice among math fonts. None of that is cast in stone, and 
I am sure we will see more fine tuning when math fonts implement the 
full set of these geometric shapes. The mapping of Wingdings/Webdings 
into Unicode is not frozen and TR25 is still a work in progress.


Always open to civilized discussion. Using terms such as ‘idiot’ and 
‘the arithmetic involved shouldn't challenge the average 12-year old’ 
will guarantee no answer from my part in the future.


Best regards,

Michel

*From:*philip chastney [mailto:philip_chast...@yahoo.com]
*Sent:* Sunday, December 16, 2012 10:45 AM
*To:* Michel Suignard
*Cc:* unicode List
*Subject:* Re: wrongly identified geometric shape

On 2012/Dec/08 02:34, Michel Suignard wrote:

* From:*philip chastney

 anybody converting a document currently using Wingding fonts to one 
using Unicode values and Unicode fonts instead, using the 
transliteration proposed in N 4384, will find their squares somewhat 
diminished in size (in this case, by one third)


this is 

Re: wrongly identified geometric shape

2012-12-18 Thread Asmus Freytag

On 12/17/2012 10:55 PM, Michel Suignard wrote:


Asmus

TR25 today takes an intermediate approach (ref page 19), the diamond 
exceeds the height but its sides are smaller than the ‘equivalent’ square.




Which is what I suggested below.

In fact, in smaller sizes, there is equivalences between sides of 
diamond and squares, but in larger sizes the square sides become 
increasing larger than the diamond sides. At some point, you can’t fit 
the diamond into the EM box without bleeding over the edges which is 
not acceptable.




In a true mathematical font you have very tall glyphs that are not 
appropriate for a mere symbol or dingbat font. Just consider the 
integral signs.


If you take the ink area rule to the letter, you would have to 
decrease significantly the ink for the larger squares to match the 
diamond ink for the same ‘size’. Again, if  you look at the current 
version of TR25 (page 20), the ink for the diamonds is quite smaller 
than the ink for the same ‘sized’ squares in the context of large sizes.


In an ideal world we would define all geometric shapes consistently, 
but they were created in an ad hoc manner and it becomes increasingly 
difficult to define consistent and uniform rules without creating 
regression issues.


It is inherently difficult to use the same scale progression between 
shapes that fit nicely the EM box (squares and circles) and spiky 
shapes (diamond, lozenge).


Added to that, the STIX fonts which are a common implementation of 
math symbols tend to size squares and circle even larger than the 
Unicode charts. So any effort to size down these shapes (implied by an 
alignment with the diamond sizes) would go opposite from current practice.




I think trying to solve this on the character encoding level, without 
double checking that with the wider mathematical/typographic community 
is a mistake.


We did some outreach when we came up with the specifications in the 
original TR#25, but it may well be that this could use some updating 
based on new input, new experience and the new characters.


STIX is an important element for this, but perhaps not the only one - if 
you know you are using the STIX fonts you can adjust your styles or 
glyph (character) selection to tweak the outcome, something that isn't 
an option for a generic prescription.


A./


Michel

*From:*Asmus Freytag [mailto:asm...@ix.netcom.com]
*Sent:* Monday, December 17, 2012 6:22 PM
*To:* Michel Suignard
*Cc:* philip chastney; unicode List
*Subject:* Re: wrongly identified geometric shape

In relating the size of different series of geometric shapes with each 
other, the relevant aspect is not the height of the ink but the area, 
in my opinion.


I'm currently not able to take the time to sift through various 
documents and propose any resolution, but I would like to make sure 
that this point is not lost.


A diamond of the same height as a square, becomes effectively an 
inscribed diamond. When you compare the areas, the difference is 50% (!)


Seen in running text (and not next to each other) the diamond will 
look smaller, even though it has the same height.


For the other shapes, the same effect exists, but is not always as 
severe. Whether one matches the size of a hexagon with that of a 
pentagon by height or area, for example, may not result in obvious and 
observable difference in the impression of their relative size.


Ideally, as an author or font designer, I would aim for a set of 
symbols that have the same optical weight, or impression of 
weight. Shapes that are more compact, might be allowed to have a 
little more area, because they might otherwise look short. But in 
this balancing act, I would expect the most functional (and pleasing) 
choices for mathematical use to be those where the shapes end up 
rather closely (but perhaps deliberately not perfectly) matched in area.


As to whether the exact size progression within each series is best 
realized as a geometric or linear, or some other progression, I can't 
suggest a definite answer right now.


In terms of text sizes for CSS, the concept of fixed ratios seems to 
be prevalent. Ideally we would have some more input from mathematical 
typesetters and font designers. Whatever the progression ends up being 
would require that all steps can be distinguished in on-screen viewing 
at some point size (and traditional, not bleeding edge, set of DPI 
values).


A./


On 12/17/2012 3:37 PM, Michel Suignard wrote:

Philip,

It would have helped if you had updated your critique of N4115 to
the current proposed code points. The updated version is N4384
(L2/12-368). The number of characters proposed and their
allocation have changed although the status for geometric shapes
has no changed much.

I spent some times analyzing your documents and I can see you are
trying to harmonize the size of the diamond and the square shapes
by applying the concept that the length of a side should dictate
the ‘size

Re: Character name translations

2012-12-20 Thread Asmus Freytag

On 12/20/2012 2:52 AM, Martinho Fernandes wrote:

Hello,

I was wondering if there is a list of character names translated into
other languages somewhere. Is there?




A French list was created, and for a while maintained with funding from 
the Canadian government. It covered the complete list of Unicode names 
for the version of Unicode at the time. It was hosted at the time on the 
Unicode site - there were issues because it's no longer fully 
up-to-date. Don't know the status.


There was a subset list of names based on a much earlier version of the 
Standard, in Swedish. Have no idea where that is accessible, if anywhere.


There have been efforts at a Japanese translation of the text of the 
standard, I have no idea whether that contains translated names for 
characters.


For many scripts, the character names consiste of a a prefix identifying 
the script, a desingator that distinguishes basic classification such as 
capital/small letters, vowels, consonants, and a part that is a often 
some transliteration of the character.


After translating the script name and the few words for these 
designators, what remains is the selection of an appropriate 
transliteration scheme for that script in the target language.


For most of these elements, existing translations should exist, and be 
easiliy accessible from the usual dictionaries and  online resources, 
except perhaps for the script names.


Punctuation marks and symbols tend to have more detailed names and 
present more issues to a translator.


In all the translated lists that I have seen it has been customary to 
use all uppercase letters, but allow the use of accented characters - 
essentially replacing the notion of A-Z with something like the basic 
alphabet for the given language. Some languages, may require certain 
punctuation marks, in  addition to hyphen, because these marks form part 
of the words used for traditional names of characters.


In many instances, translators have chosen to provide a new name for a 
character in the target language, usually based on a common name, or in 
analogy to other names in that language, rather than to translate 
word-for-word the English name.


It is unclear whether all languages benefit from an effort to translate 
all character names in the Standard, but having a cross reference of 
character codes to local names for widely used characters (or those of 
regional importance) seems a worthy goal.


Character names serve two purposes, which are sometimes at odds. One is 
to simply act as formal identifiers that are more or less mnemonic 
(which the hex codes are not). The other is an aid in identifying a 
character, as an aid in look-up or selection.


For the latter case, the formal names can be insufficient, because at 
times they are very arbitrary and don't represent the most common name, 
or because there isn't a single, common name for the character.


The French translation therefore wasn't limited to the character names, 
it translated the full character names list (what is used to print the 
code charts) with all the alternate descriptions (aliases) and 
annotations for the characters. Once you do that, it's clear that the 
work is indeed useful to ordinary users, because you enable them to 
search for a character by some word in their own language, and it is no 
longer a question whether you are translating some pure identifiers.


A./


Re: Character name translations

2012-12-20 Thread Asmus Freytag

On 12/20/2012 7:26 AM, Leif Halvard Silli wrote:

Andreas Prilop, Thu, 20 Dec 2012 15:41:28 +0100 (CET):

On Thu, 20 Dec 2012, Jukka K. Korpela wrote:


http://www.ling.helsinki.fi/filt/info/mes2/

Unicode names have certain restrictions (capital ASCII letters, etc).
This Finnish list even uses non-ASCII characters but sticks to
capital letters. Why no small letters if non-ASCII letters are allowed?

Which characters could be used for a Russian translation?
Cyrillic letters?
Only capital letters? If so — why?

My impression is that Unicode character names are limited to - in order
of priority:

  1. language (en-US)
  2. character set (US-ASCII)
  3. uppercase


Language
Letters + digits + some punctuation
UPPERCASE


What is the basis for the choice of uppercase? The probable answer
might be that it sticks out. It makes the name appear as code rather than 
ordinary words (which could thus lead to mistakes: Is it a word or a code?

The same way of thinking *plus* a desire to look like Unicode, could justify 
why translations in to e.g. Finnish and Russian would apply the same rules.
If you take the Unicode character names in the context of OTHER 
information about the character, as presented in the Unicode character 
nameslist (code charts) for example, then being able to distinguish the 
formal names (UPPER CASE) from informal aliases (lower/mixed case) is 
very handy.


In my other message, I made clear that I think translations of just the 
names is a lot less useful than translation of the full information 
presented in the code charts, which includes block (and therefore 
script) names, annotations and listing of alternate names by which these 
characters are known to ordinary users.


If your language uses a bicameral script, then the easiest way is to 
follow the same typographical conventions (or analogous ones) as the 
original text.


A./

PS: some languages use punctuation in forming words, If avoiding such 
use would make the names appear artificially restricted, such use might 
be allowed in addition to HYPHEN-MINUS for such a language.


PPS: ideally, the translated character names obey the same uniqueness 
under analogous 'loose matching rules as the original character names, 
and where formally published by a standards organization as 'national' 
version of 10646, one would expect similar guarantees for name stability.




Re: Character name translations

2012-12-20 Thread Asmus Freytag

On 12/20/2012 2:36 PM, Jukka K. Korpela wrote:

2012-12-20 14:13, David Starner wrote:


It may be useful to try to agree on official or semi-official names for
characters in a language. Such a list hardly needs to cover all of 
the over

100,000 Unicode characters.


Why not? Why should an English speaker sticking a arbitrary character
into a character map program get a name for it but a non-English
speaker not?


For most characters, a “translated” name would be arbitrary. I would 
compare this to names of biological species. Most species lack names 
in most languages, and when names exist, they are often vaguely and 
inconsistently used. 


But when real people, not biologists, want to look up information they 
have precisely two choices: they can look at a visual index (for species 
that can be arranged visually) or they can look up the scientific name 
for the species based on the only thing they know: the local popular name.



That’s why people use scientific (Linnaean) names. We use common names 
for common animals, but it just would not make sense to assign a name 
to the millions of insect species in each human language. The 
scientific name is a crucial key to information. With Unicode 
characters, both the number and the name act as such keys, though the 
name is usually descriptive of meaning, too.


Unlike species, all characters for living scripts have popular local 
names in at least one language other than English.


It may not be desirable to blindly translate ALL such names into ALL 
languages, but major languages (not only English) may be used by people 
that are familiar with or study many other languages and scripts. For 
those languages, their community of scholars represents another set of 
users who benefit from translated names.


Finally, for arcane scripts, there's usually an easily translatable part 
of the character name (think of LATIN LETTER SMALL) and an arbitrary 
part of the name (e.g. A) which comes from a transliteration scheme, a 
catalog number or the like.


If a language doesn't have a unique transliteration scheme for a 
particular script, the choices are to either use the same as present in 
the Unicode Standard, or to use one from another, culturally more 
relevant language (e.g. a French-based instead of and English-based 
transliteration).






So Unicode names should not be translated at all, any more than you
translate General Category values for example.


Why wouldn't you?


Because those values are identifiers.


No, names have multiple uses; especially if you take the formal name as 
one in a series of aliases for each character - that's why it's often 
more useful to think of translations of the full code charts and 
character index, instead of just the formal names. (The latter, by 
themselves are not so useful).





There's an argument that they're generally useful
for programmers only and programming often requires English knowledge,
but if I were explaining the character categories in Esperanto, I
would certainly say that Sm is matematikaj simboloj or Simbolo
Matematika, not act like Symbol, Math should have any importance to
my audience.


We can and often should *explain* meanings of identifiers in different 
languages, but that’s different from naming things. The value “Sm” has 
a technical meaning, and it is not identical with the common-language 
expression “mathematical symbol” or its variants, though rather close.




The linguistic content of the short labels is indeed limited, however, I 
can see good reasons to provide alternate abbreviations for characters, 
e.g. for ZWSP or WJ, because these terms are used in places where they 
do not act as identifiers.


A./



Re: Character name translations

2012-12-20 Thread Asmus Freytag

On 12/20/2012 5:12 PM, Philippe Verdy wrote:
Given the form of these names in the UCD, most of them could be 
translated automatically using a common dictionnary and resolving some 
terminologies that are approximative in Unicode.


If you mean by that: can usefully be done with a translation memory, I'd 
agree, but nothing fully automatic, I'm afraid.


But translated names should not be capitalized and not restricted to 
plain ASCII (including in US English, that is not really the human 
language used in standard names of the UCD but names in a computer 
language like in the default C locale).


Why, Why not?

The reason existing translations used uppercase is because they were 
trying to translate existing documents (e.g. code charts) where that 
convention was used.


If you merely create some DB for other purposes, by all means, use 
what's appropriate.


There's just no one size fits all one way or the other.
A./



Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-23 Thread Asmus Freytag

On 12/23/2012 3:55 PM, Joó Ádám wrote:

Roger, thank you for sharing this excerpt, I truly enjoyed it. You
drew my attention to a book I should definitely have a look at.


It's definitely a nice way to introduce people or remind them of this 
book. I'm sure, some misguided publishers would like if one got 
permission to even quote the endorsements from the cover text, but I 
find that attitude silly and, frankly, counter-productive.


If anything, the combination of this particular excerpt and source 
should help to generate more interest in people to obtain the book if 
they don't have a copy yet. It was not as if Roger gave away the plot, 
or pulled out the only memorable part of the book.



I must agree with Karl: I was suprised by Jukka’s reaction, since this
kind of quotation is both legally and ethically unquestionable here,
in the very center of Europe...


Glad to hear that.

I also agree with the points that Karl had raised.


I am not willing to be silent when what I perceive to be bullying is
expressed on this list.  This should be a safe place for any newbie to
post.  I found an unwarranted aggressiveness in Jukka's response to
Roger's apparently well-intentioned post.

unicode.org is based in the USA.  As another poster said, this
quotation would be considered fair use under USA law.  Quoting like
this is extremely common in USA writings.

The post uses only US-ASCII.  I'm sure that Jukka knows that US-ASCII
does not have an EM dash.  The standard I was taught in school (in the
USA) was to represent an EM dash in such situations precisely as the
original post does, as a sequence of two hyphen-minuses.

I do not believe that either the EM dash nor the miscapitalization of
a word constitute distorting the text, and I find it difficult to
believe that Jukka really does either.  Therefore I believe that Jukka
was not being honest in his response to the post; it appears to me
that he concealed the real reason he objects to it.

I could be wrong, and perhaps there are cultural differences between
the USA and Finland that are being unconsciously expressed here.  But
I can tell you that as a native USA English speaker, I found nothing
wrong with the original post.  And I found Jukka's response
objectionable.


There are cultural differences in the way users on online forums (and 
lists) correct each others spelling and punctuation. I'm thinking of a 
few examples, but even the more relaxed ones will try to encourage some 
minimal standards, like avoiding ALL CAPS, reduce the use of totally 
random spelling, or introducing minimal number of paragraph breaks. Some 
of these things just grate on peoples ears and there's always someone 
who says enough and posts some suggestions for the newbie. Usually, 
the tone of such messages fits the style of the list. There are cultural 
differences there as well (culture of the group, that is).


The fact that Roger's post was a quote wasn't clear to me until I 
reached the attribution at the very end. Had I stopped reading half-way 
through, I would have attributed the clever words to him. I'm sure. that 
was not his intention. Because of that, though, there's a reason that 
the reader may find the use of the quote subconsciously more 
questionable, then if Roger had given the source up front (and perhaps 
included a sentence or two of his own whether he can recommend the book 
and why).


Sometimes a post can rub somebody else the wrong way. Something that may 
have less to do with what's in the post, but with the state the reader 
is in when he comes across it.


A./







Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-23 Thread Asmus Freytag

On 12/23/2012 10:58 PM, Asmus Freytag wrote:


It's definitely a nice way to introduce people or remind them of this 
book.

There's a word missing in that sentence.


Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Asmus Freytag

On 12/30/2012 1:22 PM, Costello, Roger L. wrote:

Hi Folks,

I have heard it stated that, in the context of character encoding and decoding:

 Interoperability is getting better.

Do you have data to back up the assertion that interoperability is getting 
better?


The number of times that I receive e-mail or open web sites in other 
languages or scripts WITHOUT seeing garbled characters or boxes has 
definitely increased for me. That would be my personal observation.


More people are sending me material in other scripts and languages. 
whether on this list or via social media. Interoperability as measured 
in those terms has clearly improved as well; again, as experienced 
personally.


I still see the occasional garbled characters, most often because of a 
Latin-1/Latin-15 mismatch with UTF-8. Interoperability is not perfect. 
There's also no real reason to continue to create material in those 
8-bit sets, especially, if the data is mislabeled as UTF-8 (or sometimes 
vice versa).


In my experience, the rate of incidence for these appears to be going 
down as well, but I'm personally not running an actual count. I can 
imagine that there are places (and software configurations) that expose 
some users to higher rates of incidence than I am experiencing.


Rather than dissecting general statements such as whether 
Interoperability is getting better or not, it seems more productive to 
address specific shortcomings of particular content providers or tools.


In the final analysis, what counts is whether users can send and receive 
text with the lowest possible rate of problems - and if that requires 
transition away from certain legacy practices, it would be important to 
focus the energies on making sure that such transition takes place.


A./




Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Asmus Freytag

On 12/30/2012 3:19 PM, Leif Halvard Silli wrote:

My feeling is that interoperability is getting better everywhere. But one 
field which lags behind is e-mail. Especially Web archives of
e-mail (for instance, take the WHATwg.org’s web archive). And also some e-mail 
programs fail to default to UTF-8.


Archiving seems to occasionally destroy whatever settings made the 
original work. I have seen that not only with e-mail, but also with 
forums that have a separate, archive format.


Time to get those tools to move to UTF-8.

A./



Re: Interoperability is getting better ... What does that mean?

2012-12-31 Thread Asmus Freytag

On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:

Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
The Web archive for this very list, needs a fix as well … 



The way to formally request any action by the Unicode Consortium is via 
the contact form (found on the home page).


A./



Basic Latin

2013-01-01 Thread Asmus Freytag

On 1/1/2013 3:53 PM, Naena Guru wrote:
(By the way, Unicode is quietly suppressing Basic Latin block by 
removing it from the Latin group at top of the code block page 
(http://www.unicode.org/charts/) and hiding it under different names 
in the lower part of the page.)


I don't know what you mean here, you get it by clicking on the header 
Latin  at the very top of the Latin group. The word basic was deemed 
redundant in the index (a choice that you can argue about forever - if 
space wasn't at a premium on that page, it might have  been an easy 
decision to add an alias).


A./


Re: Terminology: does the term codepoint apply to non-Unicode character sets?

2013-01-01 Thread Asmus Freytag

On 1/1/2013 12:43 PM, Costello, Roger L. wrote:

Hi Folks,

Does the term codepoint apply to non-Unicode character sets?

For example, are there codepoints in iso-8859-1? In Windows-1252?

/Roger




The short answer is yes.

The term code point was in use for locations in IBM code pages long 
before Unicode was created; in the context of other standards, slightly 
different terms were in use, such as code location. (Windows-1252, 
while created by Microsoft, was registered in the IBM code page 
collection at the time, which assigned to it the number 1252, so the use 
of code point  for that character set is definitely an extension of 
the earlier usage).


It's worthwhile to make sure that if you operate in the context of some 
other standard, that you make sure you follow the terminology as defined 
there, but for general use, the word code point is not tied to or 
reserved for Unicode (but make sure you are clear which character set 
you are talking about).


Both spellings, with and without the intervening space, can be found, 
but Unicode uses the term only without the space.


A./


Re: Terminology: does the term codepoint apply to non-Unicode character sets?

2013-01-02 Thread Asmus Freytag

On 1/2/2013 9:00 AM, Doug Ewell wrote:

Asmus wrote:

 Both spellings, with and without the intervening space, can be 
found, but Unicode uses the term only without the space.


This didn't sound right to me, so I checked the Glossary, and it lists 
the term as two words with a space.


http://www.unicode.org/glossary/#code_point


OK. There are a few terms where Unicode doesn't use the space. I could 
have sworn this was one of them, looks like I got it backwards.


A./

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

From: Asmus Freytag mailto:asm...@ix.netcom.com
Sent: ‎1/‎1/‎2013 23:43
To: Costello, Roger L. mailto:coste...@mitre.org
Cc: unicode@unicode.org mailto:unicode@unicode.org
Subject: Re: Terminology: does the term codepoint apply to 
non-Unicode character sets?


On 1/1/2013 12:43 PM, Costello, Roger L. wrote:

Hi Folks,

Does the term codepoint apply to non-Unicode character sets?

For example, are there codepoints in iso-8859-1? In Windows-1252?

/Roger




The short answer is yes.

The term code point was in use for locations in IBM code pages long 
before Unicode was created; in the context of other standards, 
slightly different terms were in use, such as code location. 
(Windows-1252, while created by Microsoft, was registered in the IBM 
code page collection at the time, which assigned to it the number 
1252, so the use of code point  for that character set is definitely 
an extension of the earlier usage).


It's worthwhile to make sure that if you operate in the context of 
some other standard, that you make sure you follow the terminology as 
defined there, but for general use, the word code point is not tied to 
or reserved for Unicode (but make sure you are clear which character 
set you are talking about).


Both spellings, with and without the intervening space, can be found, 
but Unicode uses the term only without the space.


A./




Re: Basic Latin

2013-01-02 Thread Asmus Freytag

On 1/2/2013 3:26 PM, Jukka K. Korpela wrote:

2013-01-03 0:22, Markus Scherer wrote:




The page has been modified to add an alias for Basic Latin (ASCII) under
the Latin heading.


I can see that, but I don’t think it’s an improvement. It puts the 
Latin script in a special status. 


The special status results from the fact that nearly all other scripts 
don't use the word Basic but have a block for which the name is equal 
to the name of the script. The other exception is that this block 
happens to be the most looked-up block, so a small change accommodates 
many users.


The purpose of the index page is to allow people to find what they are 
looking for, and when the are looking for Basic Latin because of the 
block name, they should not be required to do mental gymnastics to 
puzzle out where that block might be hidden.


And it makes both “Latin” and “Basic Latin (ASCII)” links to the same 
page, violating fundamental accessibility principles: duplicate links 
should be avoided, and when they can’t be avoided, they should have 
exactly the same link texts.


Nice principle, but, utterly misapplied.

Look at any book index and you will find the same page (even passage) 
indexed under multiple terms - as appropriate.


And, if you look at the page source for the chart index you will find 
that there are already several links to the same page in other 
instances, So this change is not some kind of dramatic departure.


The original design was created the way it was based on considerations 
like the ones you raise here. Over time, evidence piled up that this was 
creating a usability problem. That has been fixed, so now we can all 
move along, nothing to see here.


A./




Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Asmus Freytag

On 1/4/2013 2:36 AM, Stephan Stiller wrote:

All,

There are plenty of unassigned code points within blocks that are in 
use; these often come at the end of a block but there are plenty of 
holes as well.


I have a cluster of interrelated questions:
1. What sorts of reasons are there (or have there been) for leaving 
holes? Code page conversion and changes to casing by simple 
arithmetic? What else?


There are a number of reasons why a code chart may not be contiguous 
besides the reason you give. Sometimes, a character gets removed from 
the draft at last minute, In those cases, a hole may be left. In 
general, the possible reasons for leaving a hole can not be enumerated 
in a fixed list. It's more of a case-by-case thing.
1.1 The rationale for particular holes is not documented in the code 
charts I looked at; is there documentation? (Yes, in some instances 
the answer can be guessed.)


In general, no. Sometimes, there's explanation in the text.
1.2 How is the number of holes determined? It seems like multiples of 
16 are used for block sizes merely for practical reasons.

Blocks end on a value ending in F in hexadecimal notation.
2. I notice that ranges are often used to describe where scripts are 
found. Do holes have properties? Are the other block-related policies 
that gives holes a certain semantics?


There are default values for some properties that can be applied to 
unassigned characters in order to make an algorithm do the best with 
as-yet-unassigned characters (so that if a new character is created, the 
algorithm doesn't have to be reimplemented necessarily but still gives 
good results).


There's no distinction between holes and other unassigned characters.
2.1 If not, how likely is it that Unicode assigns script-external 
characters to holes?


It's generally not desirable, but there's no firm policy that blocks 
must have a single script value (and in fact, no such restriction exists 
in existing blocks).
2.2 If yes, how does the number of assigned code points differ, if 
holes that are assumed to be filled only by certain types of 
characters are counted?


???
2.2.1 Would this make much of a difference wrt the question (this 
comes up from time to time it seems) of how much of Unicode will 
eventually fill up?


If strong technical reasons exist for placing a character into the BMP, 
there will be temptation to fill a hole if the BMP is otherwise full. 
Likewise, many, many years (decades) from now, similar pressure might 
exist should the rest of the code space become filled.


However, the most likely scenario is that Unicode will continue for an 
indefinite period with sufficient open space (and the occasional hole).

3. Have there been mistakes wrt to hole assignment?


Unicode doesn't make mistakes. :)

A,.


Stephan








Re: Is that character *+A7AC LATIN CAPITAL LETTER SCRIPT G ?

2013-01-10 Thread Asmus Freytag

On 1/10/2013 2:08 AM, Otto Stolz wrote:

Hello,

le 09/01/2013 18:07, Frédéric Grosshans a écrit :
Yes, but I actually don't know. I'd really like to have some idea on 
those old

printing techniques, but I fear we're drifting to off topic subjects...


Am 2013-01-09 um 18:16 schrieb Frédéric Grosshans:

Actually, the preceding tool combined with
http://en.wikipedia.org/wiki/Mimeograph would be my best (uninformed) 
guess.


I’d rather guess, he used this technique:
  http://en.wikipedia.org/wiki/Dry_transfer.
I have used it myself, in the 70s, to insert all those
Greek symbols into the formulae in my Dipl.-Phys. thesis.
It renders much clearer glyphs than the mimeograph
technique.

Best wishes,
  Otto Stolz



LetraSet (the market leader at the time) was indeed widely used by the 
70s, but was this available as early as the date of the manuscript?


The hallmark are aboslutely indentical shape, but with strong likelihood 
of small positioning errors (in both axes and rotation). The latter 
should show up on careful examination. Sometimes a letter could tear 
or the thin foild that could fold or crease upon transfer. Usually, 
in a careful production one would redo the letter, but sometimes such 
small imperfections survive - they look very different from defects in 
other forms of typography.


A./




Re: Is that character *+A7AC LATIN CAPITAL LETTER SCRIPT G ?

2013-01-10 Thread Asmus Freytag

On 1/10/2013 5:21 AM, Frédéric Grosshans wrote:

Le 10/01/2013 11:08, Otto Stolz a écrit :

Hello,

le 09/01/2013 18:07, Frédéric Grosshans a écrit :
Yes, but I actually don't know. I'd really like to have some idea on 
those old

printing techniques, but I fear we're drifting to off topic subjects...


Am 2013-01-09 um 18:16 schrieb Frédéric Grosshans:

Actually, the preceding tool combined with
http://en.wikipedia.org/wiki/Mimeograph would be my best 
(uninformed) guess.


I’d rather guess, he used this technique:
http://en.wikipedia.org/wiki/Dry_transfer.
I have used it myself, in the 70s, to insert all those
Greek symbols into the formulae in my Dipl.-Phys. thesis.
It renders much clearer glyphs than the mimeograph
technique.
I don't think so, because it is a 'real book' ( 
http://books.google.fr/books/about/La_th%C3%A9orie_des_particules_de_spin_1_2.html?id=3qzvMAAJredir_esc=y 
), which was printed in enough exemplars to be available 6 decades 
later in several libraries and on sale on internet for a reasonable price.

The Dry_transfer technique do not seem adapted to such publication.


One would apply the dry transfer to the original typescript. The book 
itself would then be printed by some photo-mechanical means (e.g. PMT).


I was involved in some print publication in the early eighties where the 
original was created using a variation of a photo-typesetting machine 
which, however, just created a single column of text. The output from 
that was pasted up (together with graphics) and then then tranferred 
photo-mechanically onto a drum for offset printing.


Something analogous could easily have been done to a high quality 
typescript with LetraSet for the special characters. The fact that the 
book uses a typewriter-like font for the running text seems to hint at 
that. (Some later typewriter ribbons used a technique similar to the dry 
transfer, and unlike the inked ribbon for which early typewriters are 
known)


I don't remember ever learning the proper terms for all of these things, 
but it should be easy to find those buried in Wikipedia somewhere.


A./


   Frédéric






Re: help with an unknown character

2013-01-10 Thread Asmus Freytag

http://ts2.mm.bing.net/th?id=H.4791646751032057pid=1.7w=176h=155c=7rs=1


??

Relation? Visual or otherwise. Pun?

(Note the similarity :widder: :wider:)

Just thinking out loud.

A./



Re: help with an unknown character

2013-01-17 Thread Asmus Freytag

On 1/16/2013 5:35 PM, Philippe Verdy wrote:

Fair enough. It's not a problem to ask the question, Is this a candidate  for 
encoding? It becomes a problem when the poster assumes, because thet  blob appeared 
in such-and-so location, that it MUST be a candidate for  encoding, and no level of 
argument about the character/glyph model, or the  need to interchange the blob, or 
anything else, will change that person's  mind.

Was there any sign of such assumption in the original question sent by Elbrecht ? He just 
asks for help, nothing else. He does not request a new encoding. He just speaks about 
something he found for which there's no easy mapping to Unicode.


Where Phillipe is right, he is right.

Yes,there are a few very obstinate individuals, but they are well known. 
However, it seems, that frequent interaction with them has given the 
list an allergic sensitization. That is unfortunate. It should be 
possible to come to the list, even if one is convinced the sign, symbol 
or letter is new to Unicode. I would even claim that most people who 
post here are discouraged by the negative reaction anyway, and never 
file a submission - even if their case has merits. Heck, even the 
obstinate ones don't always get around to filing a submission :)


The proper place for this list is to offer discussion, background and 
advise - it's not the ruling body and final determination what is or is 
not a valid character belongs to the proper committee like the UTC. 
Something that occasionally gets forgotten.




Re: Spiral symbol

2013-01-21 Thread Asmus Freytag

On 1/21/2013 4:11 PM, Andrés Sanhueza wrote:

Hello.
I have wondered if it may be a good idea to make a proposal to an 
spiral character, basically because I believe is the only mayor 
symbol recurrently used for represent swearing in comics that's 
missing from Unicode.


If it should come to a proposal, I can help out with one or two 
citations of the use of this symbol for that purpose in contexts that 
are not that different from other lettering in the same sources. Not 
more than emoji are from regular words.


A./

Most of the time it is replaced with the more common at (@), but still 
an actual one may be good. Not sure yet if there's enough 
documentation. Some Emoji representations displays the CYCLONE 
character (U+1F300) as one, yet I don't think that fits as a better 
replacement.


Andrés Sanhueza





Re: End of story character

2013-01-25 Thread Asmus Freytag



On Thu, 24 Jan 2013 20:05:41 -0300
Andrés Sanhueza peroyomasli...@gmail.com wrote:


Do you think that a end of story symbol may be feasible/useful?


My position is that the attempt to encode such semantics that are 
defined on the whole text level is a mistake. In fact, it is a common 
mistake that keeps surfacing in proposals or tentative proposals.


When Unicode encodes semantics, it's on the level of individual symbols. 
If there were a recognized notation that defined an end of text 
symbol, then you could encode that in Unicode, and expect that to be 
rendered with ordinary stylistic variations (governed by font selection 
- with the font not selected just for that symbol, but once, for all 
aspects of that notation).


Such a use would then be analogous to something like the integral sign, 
which has a (small) range of customary and conventional shapes, e.g. 
upright or slanted, bulky or slender, which fall into what anyone would 
consider stylistic variations. The precise variation is usually selected 
by choice of font not just for the integral, but a whole set of other 
mathematical symbols as well (the full notation in fact).


Placing a symbol of some sort at the end of a text is a fairly 
widespread convention, but there is no agreement on any set or range of 
customary shapes for that purpose. In a way, that makes this convention 
less a notation, but something different. In some ways it's more similar 
to the way that languages may agree on representing the concept house 
as a known, albeit with completely different sets of shapes (house, 
Haus, hus, maison etc.).


For languages, those representations would be called spellings, and I 
think that's the appropriate concept as well for the end of story 
convention. Rather than conceiving of it as  a single character with a 
range of glyphs, it's a convention on the whole text level that is 
customarily expressed by different spellings (choice of abstract or 
pictorial symbol).


Just as Unicode does not unify spellings, the different  choices of 
symbol for end of story should remain disunified. Each user of the 
convention decides on an appropriate character or symbol for the 
purpose. (Another analogy would be list item markers which are equally 
not unified into a generic control code with glyph variants, but are 
separate characters).


Because the semantic of the convention is not directly represented / 
representable on the encoding level, there's also no need to encode 
multiple characters of different shapes such as end of story-1, end 
of story-2 etc. Instead, like the use of . or , for decimal point, 
the semantic of end of story comes from context. Whenever a symbol is 
placed consistently at the end of every story in a collection, that 
symbol acquires the end of story semantics.


There are cases where Unicode has duplicated characters (using the same 
shape) based on which convention they happen to be used with. All these 
duplications are problematic in many contexts, however well intentioned 
they may have been. These cases make poor precedents and must be 
properly understood as the exceptions they are. The general encoding 
principle in Unicode remains that Unicode does not encode spelling - 
which means that symbols and other characters can be put into new 
contexts and there acquire new semantics to the human reader - without 
requiring the addition of dedicated code points.


With this, we can turn back to the original question. Should an end of 
story character be encoded? The answer must be negative. However, if 
particular shapes have been in widespread enough use for that purpose, 
but are not yet encoded in Unicode as their own symbol, then encoding 
such symbols for general use would be appropriate.


Some of the more fancy symbols used for end of story on the other hand 
might be better implemented as private use characters. For example the 
use of corporate logos at the end of magazine articles.


A./



Re: End of story character

2013-01-25 Thread Asmus Freytag

On 1/25/2013 6:52 AM, Joó Ádám wrote:

Asmus, I would be happy to hear your opinion on my question, in
context of which I may not have been clear on that my intent is not to
propose a general character for all uses as end-of-story sign but a
well-defined symbol based on both shape and usage pattern (a perfect
filled square, appropriately sized based on x-height or whatnot, used
as end-of-story sign). The name may well be something more visually
descriptive, not necessarily END OF STORY.

Á

Such a character would be a geometrical symbol. X-HEIGHT SQAURE ON 
BASELINE might be a descriptive name to distinguish it from other small 
square symbols that might happen to be in the standard already.


Alternatively it might be considered a punctuation character, but the 
symbol is so generic that giving it a the punctuation semantics seems 
debatable. But I wouldn't exclude that option.


Naming it end of story would imply that it is the only such character, 
so perhaps end of story square would be more suitable.


I am a strong proponent of not unifying geometric shapes (or certain 
punctuation marks) merely on the ink part of their shape, while 
disregarding vertical or horizontal placement. Instead, such placement 
can be significant, and if there is evidence that it relates to 
differences in usage, I tend to support that as evidence for disunification.


A./


Re: End of story character

2013-01-25 Thread Asmus Freytag

On 1/25/2013 7:44 AM, Mark E. Shoulson wrote:

On 01/25/2013 08:12 AM, Joó Ádám wrote:

I don’t know of its use outside of Hungary, but here, as the quote of
Halmos suggests, the tombstone is traditionally used in print
magazines as end of story. We have adopted it to the web on the
Weblabor magazine, where it stands at the end of all blog posts, so
the reader knows if it worths to open the story on its own, or the
excerpt on the front page was the whole story.

We had a problem with U+220E END OF PROOF though, as in most fonts it
is a rectangle, while in traditional use it is almost always a perfect
square. So we decided to use U+25A0 BLACK SQUARE instead, which has
its own problem since it really is oversized for this usage, so we had
to mark it up and scale it down.

Most of the times I've seen it, it's actually some form of a logo of 
the magazine in question, or at least a square with the magazine's 
initial(s) in it. Those all seem to be specialized forms of END OF 
PROOF to me. It fits the semantics too; a black block at the end of 
the article. If some magazines use squarer blocks and some more 
rectangular, that's glyph variation.


A good start at a counterexample might be a math journal that uses 
different-shaped blocks at the ends of its proofs and articles. Still 
might just be different fonts, but it does start to address it at least.


As I point out in another post, the comparison to other conventions 
points to things like list bullets. Clearly, almost any character (or 
image) can be used as list bullet. There simply is not a universal list 
bullet character, although BULLET is a very common character for that 
purpose. It would be a mistake, in my view, to conceptualize the use, 
say of a square bullet, as merely a glyph variant of such a universal 
bullet character.


The correct view, in my opinion, is to see these are different 
spellings of the same general concept, a concept that is therefore not 
directly expressed on the level of character semantics (just as many 
other conventions that use characters are not represented directly on 
the character level - they merely use characters by some sort of 
convention).


End of story markers can also be decorative. A boating magazine might 
use an anchor or a sailboat silhouette, for example, both representable 
by existing characters. As a result, the task should reduce to 
identifying whether there are generically usable symbols that are 
deployed for end-of-story markers. If these aren't encoded, they could 
be. (Make that should be), while idiosyncratic symbols should probably 
not be encoded - and either represented as PUA codes or directly as 
inline images in rich text.


As for representing the end of story semantic in a parseable way, that 
would be the domain of XML or similar structural markup, it would seem 
to me. Just because we speak of character semantics doesn't mean that 
all semantic aspects of a document need to be expressed on that level.


A./



Re: Long-term archiving of electronic text documents

2013-01-28 Thread Asmus Freytag

On 1/28/2013 5:12 AM, Martinho Fernandes wrote:

Similarly, there could be a type of pdf document where the text within the pdf 
document were stored in UTF-64 format.

FWIW, there is already a PDF variant designed for long-term archiving
known as PDF/A. You may want to look into that.




Good point. Also, note that each new format that is introduced, 
especially on the character code level, means adding a circle to the big 
Venn diagram - dividing ALL tools into a huge number of those that 
cannot handle the format and an initially minuscule number that can.


The small spot in the diagram reserved for tools that can handle ALL 
formats (or at least those desired for archiving) will correspondingly 
shrink.


As a result, instead of improving the activation of electronic text 
documents you've found a way to make this less reliable - any given 
combination of recovery tool and format may not work, and the more 
combinations there are, the lower the probability of success.


A./



Re: Long-term archiving of electronic text documents

2013-01-28 Thread Asmus Freytag

On 1/28/2013 5:12 AM, Martinho Fernandes wrote:

Similarly, there could be a type of pdf document where the text within the pdf 
document were stored in UTF-64 format.



FWIW, there is already a PDF variant designed for long-term archiving
known as PDF/A. You may want to look into that.




Good point.

Also, and that is a reply to William's original suggestion, please note 
that each new format that is introduced, especially on the character 
code level, means adding a circle to the big Venn diagram - dividing ALL 
tools into a huge number of those that cannot handle the format and an 
initially minuscule number that can.


The small spot in the diagram reserved for tools that can handle ALL 
formats (or at least those desired for archiving) will correspondingly 
shrink.


As a result, instead of improving the activation of electronic text 
documents you've found a way to make this less reliable - any given 
combination of recovery tool and format may not work, and the more 
combinations there are, the lower the probability of success.


A./



Re: Long-term archiving of electronic text documents

2013-01-28 Thread Asmus Freytag

On 1/28/2013 4:30 AM, William_J_G Overington wrote:

The idea is that there would be an additional UTF format, perhaps UTF-64, so 
that each character would be expressed in UTF-64 notation using 64 bits, thus 
providing error checking and correction facilities at a character level.


I think this proposal is a few weeks early, and that it should be 
resubmitted on the proper date, but as UTF-256 - for greater redundancy.


UTF-256 allows each hex digit of UTF-32 to be expressed as an ASCII hex 
digit (characters 0-9 and A-F encoded as bytes 0x30-0x39 and 0x41-0x46).


This leaves two bits per hex digit unused which could be utilized for 
bit-level error correction, or you could go to UTF-512 and encode each 
code twice.


The possibilities are endless.

A./


Re: External Link (Was: Spiral symbol)

2013-01-31 Thread Asmus Freytag

Mark,

in my view, the key aspect of the notice cited by Debbie, is the 
rejection of an external link semantic, which would act as a kind of 
generic code and could be rendered in many different ways.


Instead, the notice leaves open a request to standardize a particular 
shape, which then could be used as external link symbol by anyone 
wishing to use that particular shape for that purpose.


I happen to believe that the UTC got that one right, but I do see room 
for encoding a particular shape, if there's a user community behind it. 
whether based on passive evidence or preferably, in my view, active 
support.


Passive evidence is usually the preferred method for support, but in 
this case you may well run into a chicken and egg problem, unless you 
can find, say, a significant set of PDF documents where actual glyphs  
were used.


Active community support might be tricky because, unlike currency 
symbols or mathematical notation, it's not clear what constitutes a 
representative user community. However, if a community could be found to 
whom the preservation of this symbol matters when documents are 
converted to plain text, then that should help the case.


The fact that this keeps bubbling up, is, to me, sign that the notion 
that this ought to be a character is widespread - that certainly 
satisfies one of the necessary conditions, but as the UTC notice shows 
it's not a sufficient condition.


A./

On 1/31/2013 3:53 PM, Deborah W. Anderson wrote:

Mark,
The External Link symbol has been proposed*, you are correct, but it was
rejected by the UTC. See the Notice of Non-Approval, dated 06 June 2012:
http://www.unicode.org/alloc/nonapprovals.html

Debbie Anderson

*L2/06-268, L2/12-143, L2/12-169


-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
Behalf Of Mark E. Shoulson
Sent: Wednesday, January 30, 2013 5:27 PM
To: unicode@unicode.org
Subject: External Link (Was: Spiral symbol)

I found myself the other day looking once again for the character
representation of the external link sign so prevalent on Wikipedia and
Mathworld and other sites.  There has got to be enough evidence for
recording something like this.  And I've seen a proposal for it too!
http://www.unicode.org/review/pr-101.html and the proposal itself at
http://www.unicode.org/review/pr-101-06268-ext-link.pdf and proposed by
our own Karl Pentzlin back in 2006.  What has happened with it since?
Still in review?  I don't see it on the Pipeline page.

Can we revive this proposal, if indeed it needs reviving?  I think this
character needs encoding.

~mark








Re: External Link (Was: Spiral symbol)

2013-01-31 Thread Asmus Freytag

On 1/31/2013 5:55 PM, Mark E. Shoulson wrote:
So if a generic external link symbol isn't acceptable, I definitely 
see reason for at least the adoption of box-with-arrow, possibly 
*called* EXTERNAL LINK or something. 
Make that: possibly aliased or annotated as one of the symbols used 
to indicate an external link.


A./


Re: Word reversal from Abobe to Word

2013-02-07 Thread Asmus Freytag
How come I'm not surprised to see the problem traced to an RTF format 
incompatibility. Trying to figure out which parts of the RTF spec to 
support when is nearly impossible...


A./


On 2/7/2013 8:08 AM, Murray Sargent wrote:

If you include a {\fonttbl...} entry that defines \f0 as an Arabic font, Word 
displays it correctly. For example, include {\fonttbl{\f0\fswiss\fcharset177 
Arial;}}

as in

{\rtf1{\fonttbl{\f0\fswiss\fcharset177 Arial;}}
\pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 \'EE}
}

This displays as קודמ

Murray

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Dreiheller, Albrecht
Sent: Thursday, February 7, 2013 7:33 AM
To: Raymond Mercier; unicode@unicode.org
Subject: RE: Word reversal from Abobe to Word


Raymond,


If I have a Hebrew text displayed in Adobe Acrobat I can select part
of it and can paste it into Word. The trouble is that while individual
characters are correctly displayed the order is reversed.
Thus if I have
in Acrobat
קודמ (meaning 'prior')
when pasted into Word I get
םדוק

The Windows clipboard is a multi-channel medium, i.e. several different data 
formats may be supplied at the same time by the sending application.
The receiving application may choose one of these formats.

Using a clipboard debugging tool, I see that Word fills up to 18 formats, like 
000D  Unicode Text  (10 Bytes)
C090  Rich Text Format  (5815 Bytes)
C10E  HTML Format   (3641 Bytes),
whereas Adobe fills only 6 formats, e.g.
000D  Unicode Text   (11 Bytes)
C090  Rich Text Format (178 Bytes)

In both cases, the Unicode Text format contains the sequence
U+05E7, U+05D5, U+05D3, U+05DE in logical order.

When paste is used in Word, a high level format is preferred by default, so I 
suppose the RTF format is the problem here.

Word creates an RTF sequence like
{\ltrch\fcs1 \af220\afs40\alang1033 \rtlch\fcs0   \f220\fs40\lang1037
\langnp1033\langfenp2052\insrsid13502069\charrsid6162033\'f7\'e5\'e3\'ee}}

N.B. \'f7\'e5\'e3\'ee  is the CP1255 byte sequence for the Hebrew word above.

Adobe produces this RTF sequence:
\pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 \'EE} 
which is the right character sequence, but seems to be misunderstood by Word.

A solution is to use the Word command Paste contents ... (might be necessary to add it with 
Customize), and then choose unformatted Unicode text from the format list.

Albrecht.











Re: s-j combination in Unicode?

2013-02-13 Thread Asmus Freytag

On 2/13/2013 1:59 PM, Andries Brouwer wrote:

[Concerning the g-slash, r-slash, eth-slash symbols,
they can be coded using U+0337 as g̷ r̷ ð̷.
Unicode generally does not decompose slashed symbols - so for example, 
o-slash does not have a decomposition using U+0337.  The UTC may not 
feel bound by this as a precedent, but it would mean that such encoding 
could definitely be proposed, and probably should be, to get any 
decision to decompose these explicitly on the record.


A./


Re: s-j combination in Unicode?

2013-02-13 Thread Asmus Freytag

On 2/13/2013 1:24 PM, Stephan Stiller wrote:



It looks like something that has not been encoded.


What is the reason for not having a true combining grapheme joiner, 
one that overlays graphemes? Or a code point that instructs that the 
preceding (or following, I guess) code point should be printed at this 
position but otherwise be treated as having zero width?




The reason is that Unicode is not a text layout language.

A./



Re: s-j combination in Unicode?

2013-02-13 Thread Asmus Freytag

On 2/13/2013 2:58 PM, Buck Golemon wrote:

On Wed, Feb 13, 2013 at 2:30 PM, Asmus Freytag asm...@ix.netcom.com wrote:


On 2/13/2013 1:24 PM, Stephan Stiller wrote:


  It looks like something that has not been encoded.
What is the reason for not having a true combining grapheme joiner, one
that overlays graphemes? Or a code point that instructs that the preceding
(or following, I guess) code point should be printed at this position but
otherwise be treated as having zero width?



The reason is that Unicode is not a text layout language.

A./



That addresses his second quesiton, but not the first.


Actually --- not. It is intended to address the entire quoted section.


A grapheme combining character would only be usable if a normalized
combined character was also defined, and the mapping between the combined 
characters and the un-combined characters with combiner.


Where do you get that?

In other words adding such a thing wouldn't solve the problem you've posed 
(adding a combined sj character) since combining characters are (as I 
understand it) intended to be ephemeral and only fully combined characters are 
inteneded for communications.


That understanding of combining characters does not seem to be backed up 
by anything in the standard.


A./







Re: s-j combination in Unicode?

2013-02-13 Thread Asmus Freytag

On 2/13/2013 2:56 PM, Leo Broukhis wrote:

On Wed, Feb 13, 2013 at 11:31 AM, Andries Brouwer a...@win.tue.nl wrote:

I wondered how to code an s-j overstrike combination in Unicode.

I'd write s ZWJ j and use a font that has the appropriate ligature.



These features in Unicode aren't intended as just hacks to get the 
right appearance. The idea is that you can encode the intention of the 
author more directly. Unless the overstruck sj form happens to be 
nothing more than fancy presentation of an otherwise normal s, j sequence.


A ZWJ doesn't let you indicate whether you want an overstuck form or 
some other fused form, that choice would reside in the font - making the 
solution font dependent - which doesn't quite seem the correct approach.


Otherwise, why not use the BS control code. In the old days of teletypes 
that would nicely produce this overstruck effect. No need to define 
another format character if all you want to do is restore the semantics 
of that old control character.


A./



Re: s-j combination in Unicode?

2013-02-13 Thread Asmus Freytag

On 2/13/2013 6:00 PM, Leo Broukhis wrote:

Everything dialectology-related is a fancy presentation of the
phoneme attribute markup.


Well, that's one view.

A./


Leo

On Wed, Feb 13, 2013 at 5:51 PM, Asmus Freytag asm...@ix.netcom.com wrote:

On 2/13/2013 2:56 PM, Leo Broukhis wrote:

On Wed, Feb 13, 2013 at 11:31 AM, Andries Brouwer a...@win.tue.nl wrote:

I wondered how to code an s-j overstrike combination in Unicode.

I'd write s ZWJ j and use a font that has the appropriate ligature.




These features in Unicode aren't intended as just hacks to get the right
appearance. The idea is that you can encode the intention of the author more
directly. Unless the overstruck sj form happens to be nothing more than
fancy presentation of an otherwise normal s, j sequence.

A ZWJ doesn't let you indicate whether you want an overstuck form or some
other fused form, that choice would reside in the font - making the solution
font dependent - which doesn't quite seem the correct approach.

Otherwise, why not use the BS control code. In the old days of teletypes
that would nicely produce this overstruck effect. No need to define
another format character if all you want to do is restore the semantics of
that old control character.

A./





Re: s-j combination in Unicode?

2013-02-14 Thread Asmus Freytag

On 2/14/2013 5:38 AM, Andries Brouwer wrote:

I asked:

: wondered how to code an s-j overstrike combination

and learn from Karl Pentzlin about n3555.pdf where Michael Everson
proposes U+1E0A2 LATIN SMALL LETTER ESJ (and many other characters).
This document is from 2008. What is the status?


From the document record, it seems that 
http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4081.pdf now replaces 3555, but 
the newer document contains only a subset of the characters.


Doc. 3555 was considered during meeting 53 of ISO/IEC JTC1/SC2/WG2 but 
only reached the state were there was request for feedback.


Without digging deeper it appears as if the repertoire that contains the 
proposed overstrike was not followed up, while the work concentrated on 
Teuthonista.
(See mention of N3555  in 
http://std.dkuug.dk/JTC1/SC2/WG2/docs/n3703-AI.pdf)


Therefore to get these letters encoded would require to resubmit the 
sections from 3555 that contain them and restart the discussion in UTC 
and WG2.


But I'm sure you'll eventually hear from a direct participant.


On Wed, Feb 13, 2013 at 02:24:12PM -0800, Asmus Freytag wrote:

On 2/13/2013 1:59 PM, Andries Brouwer wrote:

[Concerning the g-slash, r-slash, eth-slash symbols,
they can be coded using U+0337 as g̷ r̷ ð̷.

Unicode generally does not decompose slashed symbols - so for
example, o-slash does not have a decomposition using U+0337.  The
UTC may not feel bound by this as a precedent, but it would mean
that such encoding could definitely be proposed, and probably should
be, to get any decision to decompose these explicitly on the record.

Yes, o-slash is not decomposed, so is different from o followed by U+0337.
But otherwise: are the characters with names starting with COMBINING
not intended to be used as combining diacriticals? Wouldn't use such
as the above be precisely as intended?


Some of the slashes are used, for example, 0338 is used with 
mathematical symbols for denoting negation.


It is just that o-slash, the most widely used representative of the 
*letters* was never decomposed, so to start now would make the treatment 
of letters uneven.


[However, n3555.pdf also contains
U+1E067 LATIN SMALL LETTER ETH WITH STROKE
U+1E06E LATIN SMALL LETTER G WITH DIAGONAL STROKE
U+1E096 LATIN SMALL LETTER R WITH DIAGONAL STROKE
and, e.g.,
U+1E0AE LATIN SMALL LETTER NASAL Y
for y with ogonek. At first sight I do not see the a-ring-ogonek here.
Does it occur elsewhere?]


You could try to search for it by constructing the likely character 
name on analogy with existing characters.


A./


Andries






Re: s-j combination in Unicode?

2013-02-16 Thread Asmus Freytag

On 2/15/2013 11:59 PM, Andries Brouwer wrote:

On Fri, Feb 15, 2013 at 10:56:17PM -0600, Ben Scarborough wrote:

On Feb 16, 2013 02:13, Andries Brouwer wrote:

The fragment of text I showed
was not from dialectology, but just from a novel written in Elfdalian.
The symbols are meant to be those of ordinary orthography.

Does that mean there's also a capital S-J?

Probably, in entirely capitalized text. At sentence start I see
capitalized I-ogonek, O-ogonek, U-ogonek, Å-ogonek in ordinary text.
I have only seen the s-j following d or t, not word-initially.

Andries



That would make it analogous in a way to German ß.

The minute things show up in real orthographies the pressure to handle 
ALL CAPS exists.


The wider use an orhography has, the stronger that pressure is, of course.

A./


Re: s-j combination in Unicode?

2013-02-16 Thread Asmus Freytag

On 2/16/2013 1:38 AM, Stephan Stiller wrote:



That would make it analogous in a way to German ß.

The minute things show up in real orthographies the pressure to 
handle ALL CAPS exists.


The question then is whether you'll find SJ or overlaid S/J. Or 
how a Swede would instinctively handle this, in the absence of an 
example of a consistently applied rule.


There's a question firts, of whether there's a difference between s+j 
and simple sj. Is it just to mark a different pronunciation of what 
would be sj in standard Swedish, or are these contrasting in 
Elfdlalien as well.


I suspect that the fallback would be SJ, if nothing else is available, 
but currently, anybody using s+j would use private fonts and thus 
there's not necessarily a need to use a fallback.


This is different from German use where telegraphs and typewriters were 
instrumental in creating and cementing the need for a fallback.


The German-style fallback is painful enough as it is to make sure it's 
not Unicode creating the bottleneck.




(By the way, for those finding the German rule to write SS 
unsatisfactory: It's hard to come by an actual minimal pair. 


MASSE - mass or measurements? See, not hard at all.

With the new orthography, ss vs. ß affects the pronunciation of the 
preceding vowel. It's irritating to see SS because you have to 
override that rule when you know that the word in lowercase was 
pronounced differently.


And, as Andreas had painstakenly done, you can collect a nearly infinite 
array of examples where users, in rule-bound Germany(!), simply continue 
to ignore that rule.


A./

PS:
And it's not like capitalization is otherwise invertible – the 
capitalization bits contain information as well, after all.)


Besides the point a bit. Even thought it's true that mixed case carries 
information that's lost in all upper or all lowercase, the issue is a 
bit different, as not focused on one letter..





Re: s-j combination in Unicode?

2013-02-16 Thread Asmus Freytag

On 2/16/2013 7:04 AM, Andries Brouwer wrote:

[BTW Is the fact that o-slash is not decomposed not entirely
analogous to the fact that i is not decomposed? I would say
that neither gives an indication of how symbols involving
a combining dot or combining slash are handled in general.]


Why don't you just take the precedent as what it is and make your 
proposal accordingly. Some decisions that went into Unicode could have 
come out different perhaps, but history says the didn't, and we are 
stuck with them. Changing horses in mid-stream helps nobody.


A./


Re: s-j combination in Unicode?

2013-02-16 Thread Asmus Freytag

On 2/16/2013 7:04 AM, Andries Brouwer wrote:

I found Diauni.ttf at
http://www.thesauruslex.com/typo/dialekt.htm  (swedish)
http://www.thesauruslex.com/typo/engdial.htm  (english)

It has landmålsalfabetet at E100-E197 (lower case only)
and s-j at E19F, S-J at E1A5, with Y-ogonek, Å-ogonek,
G-slash, R-slash, Ð-slash nearby.
So you have evidence that the uppercase form is implemented, if not yet 
a citation of actual use.


Since the latter is expected to be rare, I personally would be 
comfortable with making a code point for it, so that fonts like this, 
which are actually used, can be mapped to Unicode w/o forcing people 
into weird fallbacks over a rare character.


A./


Re: s-j combination in Unicode?

2013-02-16 Thread Asmus Freytag

On 2/16/2013 10:48 AM, Stephan Stiller wrote:



the issue is a bit different, as not focused on one letter
While we're splitting hairs: Word- or larger-level all-caps /does/ 
normally make a one-letter difference. When we undo all-caps, one can 
/normally/ lowercase everything of the word except the first letter. 
The capitalization bit of that one letter is sometimes unclear.


And usually not totally sense-destroying to a human reader with context 
available. But these fallbacks allow clear misspelled words to appear, 
not just miscapitalized ones. That's huge.


A./





Re: s-j combination in Unicode?

2013-02-16 Thread Asmus Freytag

On 2/16/2013 10:48 AM, Stephan Stiller wrote:



the issue is a bit different, as not focused on one letter
While we're splitting hairs: Word- or larger-level all-caps /does/ 
normally make a one-letter difference. When we undo all-caps, one can 
/normally/ lowercase everything of the word except the first letter. 
The capitalization bit of that one letter is sometimes unclear.


Sorry, not what I meant. It can hit any letter of the alphabet. The ß 
issue hits only one specific letter.


A./





Re: German »ß«

2013-02-16 Thread Asmus Freytag

On 2/16/2013 12:06 PM, Philippe Verdy wrote:

2013/2/16 Stephan Stiller stephan.stil...@gmail.com:

Of course in my worldview, all-caps writing is deprecated :-)

This is a presentation style which makes words more readable in some
conditions, notably on plates displayed on roads (cities are extremely
rarely written in lowercase, as this is more difficult to read from
far away when driving). Capitals anyway do not exclude preserving
distinctions (so there's a capital Ess-Tsett which preserves the
distinction with SS, anc accents are still present, even if they are
difficult to distinguish from far away on roads)


This may be a French thing.

A./

For US, see discussion here: 
http://www.studio360.org/2011/jan/21/design-real-world/


For Germany, look at 
http://www.ace-online.de/fileadmin/user_uploads/Der_Club/Presse-Archiv/Bilder/Verkehr/Autobahn/Autobahn_01.jpg

or google Autobahnschilder  for more

PPS: Sweden has quite a bit of UPPERCASE, but seems to use mixed case 
for some purposes (such as legends on warning signs and minor 
destinations on road signs).


Deprecation only concerns long texts, presented in multiline
paragraphs, for which capitals make the text less easy to read.







Re: s-j combination in Unicode?

2013-02-16 Thread Asmus Freytag

On 2/16/2013 9:55 PM, Stephan Stiller wrote:

from earlier:

Otto Scholz

Oops, sorry. Otto Stolz.

And usually not totally sense-destroying to a human reader with 
context available. But these fallbacks allow clear misspelled words 
to appear, not just miscapitalized ones. That's huge.


I'm all for a capital version of ß and other such letters, but you may 
be talking in extremes too much.


Never!

;)

Actually, the question that started this particular discussion is most 
likely moot, because the fact that Andries has located at the minimum an 
existing font implementation of capital S+J. That seems to indicate 
that, again at the bare minimum, there are other people who think that 
SJ is not the way to render this.


A./




Re: s-j combination in Unicode?

2013-02-17 Thread Asmus Freytag

On 2/17/2013 12:30 AM, Stephan Stiller wrote:



But I have to ask one more thing:
Since the latter is expected to be rare, I personally would be 
comfortable with making a code point for it, so that fonts like this, 
which are actually used, can be mapped to Unicode w/o forcing people 
into weird fallbacks over a rare character.
Why would that be so? I thought your normal way of doing things is 
require attestation of a particular usage. If a character is more 
frequent, it's more likely we're convinced of its being used in a 
particular way.


Law of diminishing returns.

I think it's a waste of everybody's time to even contemplate forcing 
fallback transformations (which are a pain to program) when perfectly 
straightforward capital form can be deduced, and has been deduced (at 
least by font creators - we don't know what user requests they based 
their work on).


Casing irregularities are expensive compared to adding a code point for 
a rare character.


A./





Re: German »ß«

2013-02-17 Thread Asmus Freytag

On 2/16/2013 11:19 PM, Julian Bradfield wrote:

On 2013-02-17, Philippe Verdy verd...@wanadoo.fr wrote:

True lowercase letters are causing problems on road sign indicators on
roads with high speed : they are hard to read and if the driver has to
look at them for one more second, he does not look at the road.

AS I SAID, empirical evaluation by those who had good cause to care
about the issue indicates the opposite, that people take longer to
read all caps (as is also the case in normal text).
This evaluation was done specifically for high speed roads. It
included live testing on one motorway.

Would not be the first time that Mr. Verdy's statements are in an 
interesting relation to empirically determined results.


:)

A./


Re: New Canonical Decompositions to Non-Starters

2013-02-17 Thread Asmus Freytag

On 2/17/2013 8:20 AM, Richard Wordingham wrote:

Is there any guarantee that U+E4567 will not have a
canonical decomposition mapping to U+0F73 TIBETAN VOWEL SIGN II,
U+E4568? If so, where is it published?  I thought we had guarantees
that new canonical decompositions to non-starters would not be created
(to U+0F71, U+0F72, U+E4568 in this case), but I cannot find it.  This
conceivable decomposition mapping appears to wriggle through a
loophole because U+0F73 is a starter, i.e. has canonical combining
class 0.

Richard.



Let me see whether I follow that.

If you encode a new character, it can have decomposition only if that 
decomposition also contains at least one new character. Otherwise, you 
might have existing data that contains that decomposition but wasn't 
previously normalizable with NFC (and now would be).


Now, does it make a difference whether that required new character in 
the decomposition is the first or the second? (Remember, all 
decompositions are defined to be pairs, except when they are singletons. 
If a one-t0-many mapping is desired, enough intermediate, partially 
composed characters must exist to allow this longer mapping to be 
represented as a chain of simpler mappings.) And if it does, can one 
point to a stability guarantee where that is expressed?


Is that what you are asking?

A./


Re: German »ß«

2013-02-18 Thread Asmus Freytag

Nice collection of links, here, Neil.

A./

On 2/17/2013 10:52 AM, Neil Harris wrote:

On 17/02/13 10:48, Philippe Verdy wrote:
I was not citing empirical results but things that are regulated by 
legislation.

And your existing empirical results are just nfomal tests ignoring
important parts of the population of drivers...


Here are some excellent articles about the evidence-based approach 
that led to the development of current road signage in the United States.


http://www.nytimes.com/2007/08/12/magazine/12fonts-t.html?pagewanted=1_r=0 



and this, on research on the legibility of mixed/lower case vs. ALL CAPS:

http://www.microsoft.com/typography/ctfonts/WordRecognition.aspx

Regarding Clearview and older drivers, this:

http://deldot.gov/information/pubs_forms/manuals/de_mutcd/pdf/20080731061923147.pdf 



is particularly interesting: the take-home quote is this:



The greatest improvement in legibility distance afforded by Clearview 
was realized by older drivers when viewed under headlamp illumination 
during nighttime conditions (an increase in legibility distance of 
between 6.0 percent and 6.8 percent)


-- Neil











Re: Private Use Area

2013-02-18 Thread Asmus Freytag

On 2/18/2013 5:43 AM, Erkki I Kolehmainen wrote:

This looks quite clear to me. If I create something and somebody else uses my 
creation in the intended context, he agrees to my definition. his agreement is 
private, outside the standard, since the same code points may represent a 
multitude of different meanings. It may also be the result of a negotiating 
process within a special purpose user group.


William,

when you write a standard, you can't avoid the use of technical terms. 
One of those is the meaning of private used here, as Erkki has so ably 
explained.


You had written:


... about ... private agreement. That is, I feel a somewhat unfortunate way 
of explaining the situation. You do not need the agreement of anybody to define your 
assignments in the Private Use Area. Certainly, if someone then wants to use the font and 
access an alternate glyph then he or she needs to go along with what you have assigned in 
order to use the font. To me, that sounds like following the documentation of the font 
rather than being an agreement.


In order to interpret the characters - Unicode's term for any 
operation other than blind transactions, like copying a string, 
requires you to follow some definition of which code point goes with 
which encoded character.


You correctly note that a font, in a way, provides a private 
specification (private as seen from the point of view of the original 
standard, which remains ignorant of it).


No user can correctly use your font without that specification, 
whether you make it available as a document or whether the user reverse 
engineers it by looking at the font in an editor and recognizing the 
shapes.


By agreeing to follow your specification (and not someone else's) your 
user now has a private agreement with you. Simple as that.


Matters may seem more complex because most software supports, to a 
degree, a generic treatment of private use characters for the purpose 
of rendering only. That is, if the rendering requirements consist of 
left-to-right layout of boxes, with their width defined in the font. 
then any font using your private assignment of shapes can be rendered 
without the software needing to be modified.


That default treatment is, of course, very useful, but it is not 
required by the Unicode Standard. In fact, it's an (usually implicit) 
private specification by the maker of the software, and by designing a 
font that takes advantage of it, you are now in a private agreement 
with the software maker.


Note that the default treatment for sorting, captilization, and a host 
of other functions is not going to work for you or most users, for that 
matter, because, unlike the case of fonts, there's no widely supported 
data format for submitting details of you to interpret a character 
outside rendering.


A./


Re: German »ß«

2013-02-18 Thread Asmus Freytag

Toll, eine dreisprachige Nachricht! Wer macht weiter?

A./

On 2/18/2013 10:25 PM, Charlie Ruland wrote:
Ne vous moquez pas de monsieur Verdy: il s’ agit là du dernier des 
Mohicans polymathes ! ☺


Charlie


Op zondag 17 februari 2013 schreef Asmus Freytag:
Would not be the first time that Mr. Verdy's statements are in an 
interesting relation to empirically determined results.


:)

A./








Re: Capitalization in German

2013-02-20 Thread Asmus Freytag

On 2/19/2013 9:35 AM, Leif Halvard Silli wrote:

Werner LEMBERG, Tue, 19 Feb 2013 10:48:52 +0100 (CET):

Otto Stoltz wrote:

Here is a minimal pair to illustrate that point:
 Er hat in Moskau liebe Genossen.
 Er hat in Moskau Liebe genossen.
which translates to:
 At Moskow, he’s got dear comrades.
 At Moskow, he has enjoyed love.

A classical joke are those two newspaper header lines:

   Der Gefangene floh
   Der gefangene Floh

which translates to

   The Prisoner Escaped
   The Caught Flea

And in this case, the prosody in German is *exactly* the same.

So in this case, the imaginary newspapers made use of written forms
that they perhaps would not have used orally, if instead of newspapers
they had been Radio channels.

The general subject here is the fact that “outer“ things, such as the
(effect of the) “look“ of the language, affects on the “inner“ things,
namely how we use the language.
In the earlier posts on the readability of road signs there was a link 
to a paper that reported a research result that is interesting here.


People read more slowly when a written form has a non-standard 
pronunciation (even for well-known words) and faster, when it has 
standard pronunciation (even for unknown words).


The example given was hint/ rint vs. pint.

Interesting that.

Also, ransom note capitalization is the hardest to read for all forms of 
capitalization. Take that, CamelCase :)


A./


Re: Private Use Area

2013-02-20 Thread Asmus Freytag

On 2/19/2013 2:26 PM, Andries Brouwer wrote:

On Tue, Feb 19, 2013 at 09:55:09AM +0100, Elbrecht wrote:


The academical TITUS project occupied U+E000 thru U+EFFF
of the Private Use Area ...

The primary Private Use Area is U+E000 .. U+F8FF today.
However, Unicode 1.0 defined a Private Use Area U+E800 .. U+FDFF.

The Linux keyboard driver uses the Private Use Area as it was
at the time of Unicode 1.0 for internal purposes, and assumes
that unicode characters have different values.
Since Unicode changed its mind, this is no longer true.



The very early Unicode versions aren't compatible with later versions 
(for more than one reason, but ultimately the cause was changes forced 
upon the design by various parties.


If something claims conformance to Unicode 1.0 at this stage, it should 
be investigated as to whether it isn't overdue for an update...


A./



Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Asmus Freytag

Richard,

the situation with the raised decimal point is a mess in Unicode.

I know that Mark thinks we have too many dots, but the reason this case 
is a mess is because the unification with U+002E is both non-workable in 
practice and runs counter to precedent.


The precedent in Unicode is to separately encode characters when they 
have different appearance, except, if, fundamentally, it's the same 
character and the difference in appearance can be determined 
unambiguously by context.


There are two primary kinds of context that Unicode admits here. One is 
based on surrounding text (such as positional forms of Arabic letters). 
The other is overall stylistic context, such as a font choice (such as 
upright vs. slanted integral symbols).


When the appearance of a character is different based on the author's 
intent, and two (or more) different appearances can occur in the same 
document with different significance, then the usual response by Unicode 
has been to encode explicit characters. (The lot of phonetic characters 
are full of examples for this, like the lower case a without hook or the 
g with hook, both of which need to be distinguishable from other forms 
of these letters in phonetics).


So, if a British document can use both inline dots and raised dots, then 
you can't assign a single font to cover both. Well, the thought was, 
software might recognize the numeric context. However, as you've pointed 
out, section numbers are numeric and do not have the raised dot. In 
fact, as far as such documents are concerned, the raised dot itself can 
be used by the reader to distinguish decimal numbers from other use of 
numbers separated by dots (something not possible in other languages 
that lack this convention).


So, on the face of it, the choice to unify the raised decimal dot with 
002E violates the encoding model, by pushing semantic distinctions into 
some kind of markup. On top of that, it's not really practical to expect 
to have to either mark up all decimal numbers or all section numbers 
with separate styles or font bindings. That's something not required 
anywhere else.


So far, that's bad enough.

Next, you have the issue that Unicode refused (quite properly) to encode 
a generic decimal separator character, the appearance of which was 
supposed to vary on external context (like locale or a document global 
style). This suggestion had been intended to allow numerical expressions 
to be cut and pasted between documents in different languages with all 
numbers formatted correctly w/o further editing. That is, the same 
character would appear as either comma or period (or raised period) 
depending on context.


I wrote that I agreed with the choice to not code such special character 
for that purpose. However, by not encoding a character for the raised 
decimal point, Unicode did an about-face and made 002E a limited 
purpose version of a decimal separator. Suddenly, there is a 
character that is supposed to have different appearance based on context 
- on the line for US documents, off the line for British documents.


This directly violates the precedent established by the refusal to 
encode the generic decimal separator.


What can be done?

I believe the Unicode Standard should be fixed by explicitly removing 
all suggestions in the text that the raised decimal point is unified 
with 002E.


Second, the standard should be amended by identifying which character is 
to be used instead for this purpose.


It might be something like 00B7. In that case, 00B7 would have to have 
properties that effectively produce the correct result in numeric 
context, while leaving non-numeric context unchanged. I believe that is 
entirely possible, and non-disruptive, insofar as numeric use of 00B7 
does not exist for any purpose other than showing a raised decimal point 
(I suspect there are documents in the wild that already use this 
character for that purpose).


If that alternative is deemed not acceptable, the only remaining choice 
would be to add a new character. (I would recommend that only as the 
last resort).


A./




Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Asmus Freytag

On 3/9/2013 1:51 PM, Jukka K. Korpela wrote:

2013-03-09 21:30, Asmus Freytag wrote:


I believe the Unicode Standard should be fixed by explicitly removing
all suggestions in the text that the raised decimal point is unified
with 002E.


That would be a good move if agreement can be found on the recommended 
coding of the middle dot.



Second, the standard should be amended by identifying which character is
to be used instead for this purpose.

It might be something like 00B7.


There are several reasons why that would be a bad move. First, 00B7 is 
a seriously overloaded character already.

As is 002E. Overloading characters is not ipso facto a bad thing.

The standard precedent in Unicode recognizes the need to primarily 
support rendering differences that cannot be determined absent markup. 
In very limited situations are characters of identical rendering 
behavior repeated on the basis of properties alone. The most common case 
of this exception is the dual coding of non-breaking characters (space, 
dash, etc.). A special exception for bidi properties exists for Arabic 
digits.


However, many characters, like dashes and dots, have multiple uses to 
the human writer and reader, and despite some differences in processing 
(line breaking etc.) the general approach is to overload the character 
and let humans (and software) disambiguate it on context - which at 
least humans can do as long as it renders properly. (The latter is the 
reason, in my view, why Unicode tends to disunify primarily for rendering).
Second, it’s a middle dot, which may differ from a raised dot. 
Mixed-language documents may well contain both British number 
notations and occurrences of middle dot in various contexts, and it 
should be possible to make them appear as different.


I would agree with that concern if you could demonstrate, with the usual 
evidence, that there is a distinction. Note that 8859-1 contains 00B7 at 
B7 and this will have been used by anyone needing a raised dot and not 
having a font that magically suppies one on context. (As James and 
Richard have pointed out, that kind of font technology does not exist, 
and there seems to be no interest by vendors to supply it - hence 
underscoring the need for a different character).


Due to another unfortunate unification (or semi-unification), 0387 
(Greek ano teleia) has been defined as canonical equivalent to 00B7, 
with the note “00B7 is the preferred character”. This means that glyph 
design for 00B7 needs to take this into account, and since Greek ano 
teleia isn’t really a middle dot (rather, an upper dot, appearing 
roughly around the x-height of a font, rather than at half of 
x-height, which is a natural position for middle dot).


This appears to be another possible mistake. However, the Greek script 
does provide a context which could be used to select the ano teleia 
appearance and properties (unless you tell me that the character appears 
in Greek surrounded by non-Greek alphabet characters).


The code chart comment on 002E (full stop) says: “may be rendered as a 
raised decimal point in old style numbers”. But checking a few fonts 
that use the OpenType feature for old style numbers (onum), I was 
unable to find any that has such a glyph selectable that way.


Yes, this comment makes no sense. It was a pious wish by the character 
encoders during the early day of Unicode. It's not been picked up by 
anyone in 20 years, so far as we know, which means it should be 
recognized as to what it is: an evolutionary dead branch which needs to 
be trimmed.


I wonder what character and techniques British publishers use to 
produce notations with a raised dot. Is it 002E, with typographic 
tools used to raise it, or is it 00B7?


I agree, data would help settle this. Richard?



I believe that is
entirely possible, and non-disruptive, insofar as numeric use of 00B7
does not exist for any purpose other than showing a raised decimal point


I’m afraid there is mathematical use of 00B7. It is tempting to use it 
as a multiplication dot (as in 2 · 2, meaning the same as 2 × 2), 
especially if you are limited to using ISO Latin 1 repertoire or you 
find 00B7 essentially simpler to type than 22C5 (dot operator). 
Standards have been vague or ignorant of the issue (now ISO 8-2 
explicitly defines the multiplication dot as 22C5, but I wonder how 
many people know about this).


For mathematical notation, the mathematical publishers are well 
organized and have agreements on how to handle issues like that (hence 
the ISO standard). The fact that some individual authors might have used 
00B7 as a fallback (or out of ignorance) is not really relevant here. 
For rendering it's not an issue, and for automatic parsing it's like any 
other typo.


Especially if the middle dot is used as multiplication symbol without 
spaces around it, confusion would be guaranteed.


Human readers don't read the code points.



If that alternative is deemed not acceptable, the only

Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Asmus Freytag

On 3/9/2013 3:41 PM, Philippe Verdy wrote:

2013/3/9 Asmus Freytag asm...@ix.netcom.com:

This appears to be another possible mistake. However, the Greek script does
provide a context which could be used to select the ano teleia appearance
and properties (unless you tell me that the character appears in Greek
surrounded by non-Greek alphabet characters).

And even this basic rule will be defeated in maths formulas where the MIDDLE 
DOT 00B7 has been used as a common multiplication operator, along with numbers 
and variables named after Greek letter. Of course Unicode has now distinctive 
symbols for maths, but that's another story.

RIght, because 22C5 exists for that purpose.


There's no reliably way to contextually infer an ano teleia rendering
(on the middle of the x-height instead of the middle of the M-height,
the intended redering for 00B7 which also works when the middle dot is
used as an appended diacritic after the letter L/l in Catalan) where
it would break the common appearance between digits with the intended
same meaning as a multiplication sign, for text that are not encoded
using Maths operators but legacy Greek letters and 00B7.






Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Asmus Freytag

On 3/9/2013 5:30 PM, Richard Wordingham wrote:

On Sat, 09 Mar 2013 14:41:11 -0800
Asmus Freytag asm...@ix.netcom.com wrote:


On 3/9/2013 1:51 PM, Jukka K. Korpela wrote:

2013-03-09 21:30, Asmus Freytag wrote:
I wonder what character and techniques British publishers use to
produce notations with a raised dot. Is it 002E, with typographic
tools used to raise it, or is it 00B7?

I agree, data would help settle this. Richard?

I'm not in the publishing business, but here's what I know.

The general feeling seems to be that computers don't do proper decimal points, 
and so the raised decimal point is dropping out of use.  In so far as character 
coding is involved, the raised decimal point seems to be produced using U+00B7, 
and I was taken aback by the statement that that was not the correct character.


This would not be the first incidence of new writing/printing/processing 
technology feeding back onto how people write, or layout text or even 
sort. Whenever new technology becomes pervasive, but doesn't support 
certain features, it can create pressure to remove them.


'The Lancet' reportedly insists on the use of the raised decimal point 
(http://www.download.thelancet.com/flatcontentassets/authors/artwork-guidelines.pdf)and
 gives the instructions 'Type decimal points midline (ie, 23·4, not 23.4). To 
create a midline decimal on a PC: hold down ALT key and type 0183 on the number 
pad, or on a Mac: ALT shift 9.'  On Windows, that gives U+00B7 MIDDLE DOT.
That's sensible advice, in a way, because B7 is in 8859-1 and therefore 
supported in a huge variety of fonts, for practical purposes, the 
coverage among non-decorative text fonts is pretty near universal.


I've googled for advice on how to produce the raised decimal point.
Apart from suggestions to use the a character picker (generally implying 
U+00B9),

recte: 00B7

  the only other method I've seen is a TeX package called
'decimal'.  It appears to render '.' as the (raised) decimal point and '\.' as 
the full stop.  That's the closest I've found to raising a full stop.


Well, in TeX, you can attach style or markup to any input character 
and there's no explicit reference to any character encoding, because 
ultimately, tech output gets resolved to a combination of glyphs plus 
position (that is, you can directly raise pr lower any glyph using a 
TeX macro, without the need to have font support).


Because of that, TeX fonts don't technically need separate glyphs for 
dots at different relative vertical position from the baseline.


Regular fonts might reuse the actual sequence of instructions for 
drawing the dot, but would still expose separate glyph records 
containing the different positions.


Back in May 1999, John Cowan said on this list 'That is the British
decimal-point convention. It can be represented in Unicode plain text with 
U+00B7 MIDDLE DOT', and no one contradicted him in the thread.


Looks like the community voted to not accept the Unicode recommendation 
for using formatting magic on 002E, so this reinforces the call to 
remove such recommendations as misleading and contrary to accepted practice.


A./









Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Asmus Freytag

Richard has given some cogent arguments below.

Another counter example is the use of : to form abbreviations in 
Swedish. (It's inserted in the word to replace the elided part). In that 
use, this punctuation character is suddenly part of a word.


To handle the full set of general case, word recognition has to be 
plenty smart (and context or environment sensitive). The basic, 
untailored default word breaking algorithm will only ever do the plain 
vanilla cases right.


Basing decisions about encoding of characters on the failings of such 
simple minded algorithms is really a non-starter. (The few existing 
exceptions just prove the rule).


A./

On 3/9/2013 6:52 PM, Richard Wordingham wrote:

On Sat, 09 Mar 2013 16:21:17 -0700
Karl Williamson pub...@khwilliamson.com wrote:


Rendering is not the only consideration.  Processing textual content
for 0387 is broken because it is considered to be an ID_Continue
character, whereas its Greek usage is equivalent to the English
semicolon, something that would never occur in the middle of a word
nor an identifier.

ID_Continue is for processing things like variable names.  How does
allowing U+0387 in variable names cause problems in the processing of
text?

How would ID_continue allow you to process English «foc’s’le» or
«co-operate»?  The default word boundary determination has been
tailored to give you the right results,and should work for Greek unless
you are working with scripta continua, in which case you have massive
problems regardless.

Note also that word boundary determination is intended to be
tailorable, which would allow one to exclude U+00B7 and U+0387 from
words or deal with miscoded accents and breathings physically at the
start of a word beginning with a capitalised vowel. One should also be
able to tailor it to deal with word final apostrophes - though doing
that in the CLDR style could be computationally excessive if the text
may contain quoting apostrophes.  One might even tailor it to allow
Greek «ὅ,τι», depending on whether one wishes to count it as a word.

Richard.








Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Asmus Freytag

On 3/9/2013 6:01 PM, Stephan Stiller wrote:



'The Lancet' reportedly insists on the use of the raised decimal point
(http://www.download.thelancet.com/flatcontentassets/authors/artwork-guidelines.pdf) 


and gives the instructions 'Type decimal points midline (ie, 23·4, not
23.4). To create a midline decimal on a PC: hold down ALT key and type
0183 on the number pad, or on a Mac: ALT shift 9.'  On Windows, that
gives U+00B7 MIDDLE DOT.
And in this linked-to document it's raised to only what appears to be 
half the x-height; I'd raise a multiplicative dot to half the 
M-height. Philippe's post just now might relate to that in some way.


Math operators are usually aligned on the math center line. Wherever 
that happens to be. However, for fully correct math layout, to require 
math mode (i.e. global markup selecting math layout) is an appropriate 
restriction and some minor infidelities in pure plain text rendering 
of math are therefore tolerable.


Mathematical layout has all sorts of little idiosyncratic rules about 
spacing etc. that are subtly different from regular text, even though 
many characters can occur in both environments. That's why high-fidelity 
math layout needs to first identify those areas of a document where math 
layout rules apply. In TeX that's handled by using $ as an operator, in 
other environments other conventions (including out of band styling) are 
used.


A./




Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Asmus Freytag

On 3/9/2013 5:47 PM, Philippe Verdy wrote:

2013/3/10 Asmus Freytag asm...@ix.netcom.com:

On 3/9/2013 3:41 PM, Philippe Verdy wrote:

2013/3/9 Asmus Freytag asm...@ix.netcom.com:

This appears to be another possible mistake. However, the Greek script
does
provide a context which could be used to select the ano teleia
appearance
and properties (unless you tell me that the character appears in Greek
surrounded by non-Greek alphabet characters).

And even this basic rule will be defeated in maths formulas where the
MIDDLE DOT 00B7 has been used as a common multiplication operator, along
with numbers and variables named after Greek letter. Of course Unicode has
now distinctive symbols for maths, but that's another story.

RIght, because 22C5 exists for that purpose.

But still, all the other related symbols are multipurpose and cannot
be fixed. They are still usable including in maths contexts, even if
their rendering is not always adequate for maths (where they may
become confusable).

But still these other characters should not need to take maths into
consideration, so the MIDDLE DOT 00B7 should still behave correctly in
Catalan as a diacritic and as a punctuation, and should remain:

- between the middle of the M-height and the middle of the x-hieight
(for correct display after l/L, or as a punctuation);

- but not on the middle of the math line like 22C5 (along with the
mathematic MINUS sign and the PLUS sign, the same center used as well
for the x-shaped multiplication sign, the division sign... all these
maths symbols having more strict presentation constraints).


Nothing prevents a mathematical layout program from fine tuning the 
display of 00B7 if used as a raised decimal point. (see other post).


A./


Re: Are there any pre-Unicode 5.2 applications still in existence?

2013-03-14 Thread Asmus Freytag

On 3/13/2013 10:25 PM, Peter Constable wrote:

I would  be inclined to assume that there are Unicode 1.1 apps loitering about.


What marks an implementation as version X.y ?

If the implementation doesn't support any processing of characters for 
which there is a mandatory conformance requirement (such as 
normalization or bidi), then this is difficult indeed. Even then, 
implementations are free to handle only a partial repertoire and still 
claim conformance to a given version. (This subsetting may not be 
permitted for some required operations).


That said, there are some specific incompatibilities in character 
assignment for Unicode 1.1 and earlier, which would allow one to detect 
a Unicode 1.1 implementation (e.g. of Korean) if it indeed implemented 
the older character assignments for those cases.


A Unicode implementation that passively accepts a character stream and 
does nothing other than ringing a bell upon accepting a U+0007 
character, would be trivially conformant to *any* version of the Unicode 
Standard. How would we assign this one a version number?


Is it a Unicode 1.0? or a Unicode 6.3? or some random version number 
corresponding to the latest version of the Unicode Standard that 
happened to be published at the time the application was designed?, 
compiled?, released?


A./

Peter

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Richard Wordingham
Sent: March 8, 2013 1:42 PM
To: unicode@unicode.org
Subject: Re: Are there any pre-Unicode 5.2 applications still in existence?

On Fri, 8 Mar 2013 15:54:57 +
Costello, Roger L. coste...@mitre.org wrote:


Are there any pre-Unicode 5.2 applications still in existence?

Strange question!  Unicode 5.2 was released in 2009.  Consequently, on the 
Ubuntu release I'm running all characters new in Unicode 5.2 are compared equal 
(and that nearly bit me - fortunately, the C locale was good enough for my 
purpose.). The MS Office I have at home on my Windows 7 machine is Office XP 
(i.e. 2002), and at work we use MS Office 2007 on Windows XP. I supposed it's 
possible that these versions have been upgraded to a more recent version of 
Unicode, but I suspect it's unlikely.

Richard.












Re: Processing Digit Variants

2013-03-19 Thread Asmus Freytag
On the basis of security considerations, it might be necessary to not 
allow variation selectors to salt strings for parsing. If the string 
cannot be rejected, then the proper thing might be to parse it as if the 
variation selectors were not present (on the basis that they do not 
affect semantics - by design - setting aside Han for the moment, where 
that story isn't totally clear).


Similar considerations would apply to other invisible characters, like 
redundant directional marks, as well as joiners and non-joiners. Again, 
if their presence can't be used to reject a string, parsing needs to 
handle them properly, so that what the user sees is what actually gets 
parsed.


A./


On 3/19/2013 1:45 PM, Richard Wordingham wrote:

On Mon, 18 Mar 2013 17:28:30 -0700
Steven R. Loomis s...@icu-project.org wrote:

On Monday, March 18, 2013, Richard Wordingham wrote:

The issue is rather with emphatically plain text U+0031, U+FE0E,
U+0032, U+FE0E.

It's the same situation to something like an implementation of LDML
number parsing. U+FE0E is not part of a number.

I agree that the same arguments are applicable to both parsing and
collating, though not necessarily with equal force.

Formally, U+0031, U+FE0E, U+0032, U+FE0E seems to be just as much a
number as U+FF11 FULLWIDTH DIGIT ONE, U+FF12 FULLWIDTH DIGIT TWO,
which the current LDML semantics do treat on an even footing with
12.  If the emoji digits had been encoded as new characters, ICU
would support them without batting an eyelid.  Because the difference
does not merit full characterhood, they are encoded by a sequence
rather than a single character.  Remember, all that U+FE0E does is to
request a particular glyph.  In a sense, we have 20 new decimal digits,
U+0030, U+FE0E to U+0039, U+FE0F and U+0030, U+FE0F to U+0039,
U+FE0F.

So, why do you consider U+0031, U+FE0E, U+0032, U+FE0E not to be
a valid decimal number?


10ZWJ0ZWJ0 would be perfectly reasonable for text
likely to be rendered by a cursive Latin font.

Identifying such an edge case does not prove that numeric tailoring is
broken.

An 'edge case' is often just a case that shows that an algorithm that
often works has not been thought through thoroughly.  Now, as CLDR
seems to value speed above perfect correctness, perhaps handling
variation sequences will be rejected on that basis.  All I was trying
to find out on this list was whether U+0031, U+FE0E, U+0032, U+FE0E
should be regarded as a proper number.

Special characters intended for just one aspect of text processing
should not affect other aspects. Unfortunately, a parametric tailoring
to ignore irrelevant characters while complying with the UCA is not
quite as simple as just ignoring them.  The issues arise with the
blocking of discontiguous contractions and the possibility that, for
example, one might wish to collate character variants differently.  On
the other hand, ignoring variation selectors by default might be
excusable, for they should not occur where they might block canonical
reordering (antepenultimate paragraph of TUS 6.2.0 Section 16.4).

Richard.







Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Asmus Freytag

On 3/21/2013 4:22 PM, Philippe Verdy wrote:

2013/3/21 Richard Wordingham richard.wording...@ntlworld.com:

Further, the code chart glyphs for the ANO TELEIA and the MIDDLE DOT
differ, see attachment.  If they are canonically equivalent, and one
is a mandatory decomposition of the other, why do they have differing
glyphs?

Because the codepoints are usually associated with different fonts?
For a more striking example, compare the code chart glyphs for U+2F831,
U+2F832 and U+2F833, which are all canonically equivalent to U+537F.

This is another good example where a semantic variation selector



Philippe, let's not go there.

Semantic selectors are pure pseudo-coding, because if the semantic 
differentiation is needed it is needed in plain text - and then it 
should be expressible in plain character codes.


If you need to annotate text with the results of semantic analysis as 
performed by a human reader, then you either need XML, or some other 
format that can express that particular intent.


Internal to your application you can design a light weight markup format 
using noncharacters, if you wish, but for portability of this kind of 
information you would be best off going with something widely supported.


The number of conventions that can be applicable to certain punctuation 
characters is truly staggering, and it seems unlikely that Unicode is 
the right place to

a) discover all of them or
b) standardize an expression for them.

The problem is, even if you could encode some selectors for certain 
common cases, the scheme would not be extensible to capture other 
information that pre-processing (or user input) might have provided and 
which might be useful to carry around in certain implementations - I'm 
thinking here that the full spectrum of natural language analysis for 
word-types might be as interesting as certain individual characters.


A./

A./



Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Asmus Freytag

On 3/22/2013 4:02 AM, Philippe Verdy wrote:

2013/3/22 Asmus Freytag asm...@ix.netcom.com:

Semantic selectors are pure pseudo-coding, because if the semantic
differentiation is needed it is needed in plain text - and then it should be
expressible in plain character codes.


We don't disagree, that's exactly what I meant here : plain character
codes for expresssing the semantics, even if many renderers or
collators will treat them as ignorable.


No, we do disagree.


The Unicode model is to have all information about character identity 
(what you call semantics) into the character code, not into a sequence 
of character and ignorable attributes.


Separating the identity of the character into attributes would be 
something very novel and best not attempted.


A./



Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Asmus Freytag

On 3/22/2013 4:08 AM, Philippe Verdy wrote:

2013/3/22 Asmus Freytag asm...@ix.netcom.com:

If you need to annotate text with the results of semantic analysis as
performed by a human reader, then you either need XML, or some other format
that can express that particular intent.

Absolutely NO. If this encodes semantics, this is part of plain text,


I think we are on a different page here. In some ways the Unicode term 
semantics is very misleading in this context. What Unicode means by 
this fancy term is the character's identity - not it's use.


If you use a colon to mark abbreviation (as in Swedish) you are using a 
colon - the use may be very different from how a colon is used 
elsewhere, but it does not create a new character.


Unicode does not encode the semantics of a sentence or word, but 
provides a string of characters of known identity that lets a human 
reader determine the semantics of that sentence or word as unambiguously 
as if that sentence had been reproduced by analog means - that's, in a 
nutshell, what Unicode attempts to do.



and not part of an upper layer protocol. Notably these characters
should be used to alter de default (ambiguous) character properties of
the characters they modify, and notably to give them the semantics
needed for existing Unicode algorithms (general categories:
punctuation, diacritic; word-breaking properties...)


Character properties define the *default* behavior of  a given 
character. There are many examples, especially in the context of 
punctuation where a character can have different uses. Each use may need 
a different treatment by readers (or algorithms).


To handle some behaviors, you may need complex processing (natural 
language processing) that mimics what a human reader can do.


There are a few exceptions where characters are disunified based on 
properties - the most principled of these involve properties that can't 
be modified, such as the bidi property. There are about a dozen 
characters that look entirely alike (by design and derivation) yet have 
been disunified based on bidi properties - because bidi properties 
cannot be overridden.


There are a few other cases, usually where a character can be both 
letter and punctuation where such disunifications were made based on 
overridable properties. Here the reason was that this distinction has 
such a wide reach (and hat to be applied by many basic algorithms) that 
breaking the principle of single character identity can be justified.


If a problem is sufficiently severe, then you'd possibly have 
justification to disunify. If not, then the answer would be outside the 
scope of character encoding.




adding new variants of existing characters like what was done
specifically for maths is not a stabl long term solution; solutions
similar to variant selectors however are much more meaningful, and
will allow for example to make the distinction between a MIDDLE DOT
punctuation and an ANO TELEIA, and will also allow them to be rendered
differently (even if there's no requirement to do so).

This is absolutely not pseudo-coding.

Pseudo coding refers to making distinctions between characters not on 
their basic encoding, but by means of attributes such as the selectors 
you are suggesting.




Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Asmus Freytag

On 3/22/2013 4:16 AM, Philippe Verdy wrote:

2013/3/22 Asmus Freytag asm...@ix.netcom.com:

The number of conventions that can be applicable to certain punctuation
characters is truly staggering, and it seems unlikely that Unicode is the
right place to
a) discover all of them or
b) standardize an expression for them.

My intent is certainly not to discover and encode all of them. But
existing characters are well known for having very common distinct
semantics which merit separate encodings.


This claim would have to be scrutinized, and, to be accepted, would 
require very detailed evidence. Also, on what principles would you base 
the requirement to make a distinction in encoding?



And this includes notably their use as numeric grouping separators or decimal 
separators.


Well, the standard currently rules that such use does not warrant 
separate encoding - and the standard has been consistent about that for 
the entire 20+ years of its existence.


Further, all other character encoding standards have encoded these 
characters as unified with ordinary punctuation. This is very different 
from the ANO TELEIA discussion, where an argument could be made that 
*before* Unicode, the character occurred only in *specific* character 
sets - and that was a distinction that was lost when these character 
sets were mapped to Unicode.


No such argument exists for either middle dot or raised decimal point 
(except insofar as you could possibly claim that raised decimal point 
had never been encoded properly before, but then you'd have to show some 
evidence for that position).


Such common semantic modifiers would be eaiser to support than
encoding many new special variants of characters (that won't even be
rendered by most applications, and thus won't be used).


That might be the case - except that they would introduce a number of 
problems. Any modifier that has no appearance of its own can get 
separated from the base character during editing.


The huge base of installed software is not prepared to handle an 
entirely different *kind* of character code, whereas support for simple 
character additions is something that will eventually percolate through 
most systems - that fact makes disunifications a much more 
straightforward process.


Some examples : the invisible multiplication sign, the invisible
function sign,


Nah, these are not modifiers. They stand on their own. Their 
invisibility is not ideal, but not any worse than word joiner or 
zwsp. All of these characters are separators - with the difference 
that the nature of the separator was determined to be crucial enough to 
encode explicitly. (And of course, reasonable people can disagree on 
each case).


Note that Unicode cloned several characters based on their word-break 
(or non-break) behavior, which is not a novel idea (earlier character 
encodings did the same with no break space). Already at that stage the 
train of having a word break attribute character (what you call a 
modifier) had left the station.


The only way to handle these issues, for better or for worse, is by 
disunification (wher that can be justified in exceptional circumstances).



  and even the Latin/Greek mathematical letter-symbols
which were only encoded for encoding style differences which have
occasional but rare semantic differences. For me, adding those
variants was really pseudo-coding, breaking the fundamental encoding
model, and complicatin the task for font creators, renderer designers,
and increasing a lot the size and complexity of collation tables.

Many of these character variants could have been expressed as a base
character and some modifier (whose distinct rendering was only
optional), allowing a much easier integration and better use. Because
of that the UCD is full of many added variants that re alsmost never
used and we have to leave with encoded texts that persist in using
ambguous characters for the most common possible distinctions.
No, for the math alphabetics you would have had to have a modifier that 
was *not* optional, breaking the variation selector model.


There was certainly discussion of a combining bold or combining 
italic at the time.


One of the major reasons this was rejected included the desire to 
prevent the creation of such operators that could be applied to 
*every* character in the standard.


And, of course, the desire to allow ordinary software to do the right 
thing in displaying these - the whole infrastructure to handle such 
modifiers would have been lacking.


Further, when you use and italic a in math, you do not need most (or 
all) software to be aware that this relates to an ordinary a in any 
way. It doesn't, really, except in text-to-speech conversion or similar, 
highly specialized tasks. So, unlike variation selectors, there would 
have been no benefit in using a modifier.


A./



Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Asmus Freytag

On 3/22/2013 12:08 PM, Karl Williamson wrote:

On 03/21/2013 04:48 PM, Richard Wordingham wrote:

For linguistic analysis, you need the normalisation appropriate to the
task.


Linguistic analysis (in general) being a hugely complex undertaking, 
mere normalization pales in comparison, so wrapping normalization into 
the processing isn't going to make it that much more complicated..



This is a case where Unicode normalisation generally throws away
information (namely, how the author views the characters),


Canonical normalization is supposed to take care of distinctions that 
fit within the same view of the character by the author and concern 
principally distinctions that could be said to be artifacts of the 
encoding.


The same is emphatically NOT true for COMPATIBILITY normalization.


whereas in
analysing Burmese you may want to ignore the order of non-interacting
medial signs even though they have canonical combining class 0. I have
found it useful to use a fake UnicodeData.txt to perform a non-Unicode
normalisation using what were intended to be routines for performing
Unicode normalisation.  Fake decompositions are routinely added to the
standard ones when generating the default collation weights for the
Unicode Collation Algorithm - but there the results still comply with
the principle of canonical equivalence.


This description seems to capture an implementation technique that 
could be a shortcut - assuming that normalization wasn't a separate, 
up-front pass. Some algorithms may have needs to normalize data in ways 
that might make adding the standard Unicode Normalization aspects into 
them attractive from a performance point of view (even if not from a 
maintenance point of view).




However, distinguishing U+00B7 and U+0387 would fail spectacularly
of the text had been converted to form NFC before you received it.


That's a claim for which the evidence isn't yet solid and if it could be 
made solid would make that claim very interesting.


This is the first time I've heard someone suggest that one can 
tailor normalizations.  Handling Greek shouldn't require having to 
fake UnicodeData.txt.  And writing normalization code is complex and 
tricky, so people use pre-written code libraries to do this.  What 
you're suggesting says that one can't use such a library as-is, but 
you would have to write your own.  I suppose another option is to 
translate all the characters you care about into non-characters before 
calling the normalization library, and then translate back afterwards, 
and hope that the library doesn't use the same non-character(s) 
internally.


Handling Greek in the context of run-of-the-mill algorithms should 
probably not be done by folding Normalization into them (for the 
excellent reasons given). But for some performance sensitive and rather 
complex types of detailed linguistic analysis I might accept the 
suggestion as a possible shortcut (over a two-pass process). Given the 
existence of such a shortcut, modifying the normalization part of the 
combined algorithm is an interesting suggestion as an implementation 
technique.


Tunneling through an existing normalization library would be a hack, 
which should never be necessary except where normalization is broken 
(see compatibility Han characters).


However, even if standard canonical decompositions can be mistaken, 
tunneling isn't really a fool-proof answer, because it assumes that data 
didn't get normalized en route. There's nothing that reliably prevents 
that from happening in a distributed system (unless all the parts are 
under your tight control, which would seem make it a distributed system 
in name only).


And the question I have is under what circumstances would better 
results be obtained by doing this normalization?  I suspect that the 
answer is only for backward compatibility with code written before 
Unicode came into existence.  If I'm right, then it would be better 
for most normalization routines to ignore/violate the Standard, and 
not do this normalization.




Let's get back to the interesting question:

Is it possible to correctly process text that uses 00B7 for ANO TELEIA, 
or is this fundamentally impossible? If so, under what scenario?


A./



Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Asmus Freytag

On 3/22/2013 6:17 PM, Richard Wordingham wrote:

On Fri, 22 Mar 2013 18:01:14 -0700
Asmus Freytag asm...@ix.netcom.com wrote:


On 03/21/2013 04:48 PM, Richard Wordingham wrote:

However, distinguishing U+00B7 and U+0387 would fail spectacularly
of the text had been converted to form NFC before you received it.

That's a claim for which the evidence isn't yet solid and if it could
be made solid would make that claim very interesting.


Distinguishing the character codes will fail trivially. The question is 
whether analysis or processing of the text will fail spectacularly. 
The latter is the true test of whether the unification is broken.


I, like many others on the various lists would like to see a conisely 
argued and well documented case made for or against this, using 
real-world examples.


A./

PS: If you quote selectively my text doesn't make any sense. At this 
point in the message you dropped Karl's reply which is what I am 
referring to next:

However, even if standard canonical decompositions can be mistaken,
tunneling isn't really a fool-proof answer, because it assumes that
data didn't get normalized en route.

Isn't that the key part of what I said above?


No it isn't, and even if it was, I was not replying to your words here.

A./


Richard.







Re: Rendering Raised FULL STOP between Digits

2013-03-23 Thread Asmus Freytag

On 3/23/2013 4:55 AM, Michael Everson wrote:

On 23 Mar 2013, at 01:01, Asmus Freytag asm...@ix.netcom.com wrote:


Let's get back to the interesting question:

Is it possible to correctly process text that uses 00B7 for ANO TELEIA, or is 
this fundamentally impossible? If so, under what scenario?

It is possible to process text without Unicode at all, using sets and sets of 
8-bit font-hack fonts. We all did it for years.




A bit of a non-sequitur in that whatever may have been done with 8-bit 
standards doesn't necessarily advance the discussion about how to do 
things in Unicode. Also, arguably not fully applicable, because the 
types of processing that could be done with those legacy sets exclude 
some important real-world scenarios that only Unicode enables...


In Unicode, 00B7 and 0387 are canonically mapped, so making distinctions 
based on code point is not guaranteed to be portable. That's why I 
singled out 00B7 (not 0xB7, but U+00B7).


The question was, given that, Is possible to correctly process text 
that uses one and the same character code for ano teleia, middle dot, 
raised decimal point (and fourteen other uses), or is this fundamentally 
impossible? If so, under what scenario?


I think handling raised decimal dot is not any more difficult than 
recognizing when period is a decimal point (there are some edge cases 
there that are challenging, but implementations have settled on using 
period, so that's a done deal).


I don't know about the fourteen other uses, but there's been a lot of 
griping about ano teleia. (That's why I singled out that one, even 
though II know most of the griping took place in a parallel discussion 
on another list).


I think it would be useful to actually write down an overview of the 
recommended implementation approach for handling ALL the different uses 
for middle dot and to make sure that what is recommended is not only 
theoretically possible, but acceptable and accepted(!) as best practice 
by implementers, users, and font designers alike.


If such a document were to successfully cover all (widely-)known cases, 
it would make fine material for adding to the character description. If 
there are holes (things that can't be done - see the question) then it 
would form a basis on which UTC could make some decisions on how to 
improve the standard.


A./



Re: Rendering Raised FULL STOP between Digits

2013-03-27 Thread Asmus Freytag
The question is who would be able to take on the drafting of a document 
that explains the recommended usage of 00B7 for the various purposes 
(including recommended ways of getting the correct rendering and 
processing).


ONLY by having such a document, is it possible to be certain that the 
encoding (now or in future) will not become an obstacle to any of the 
several usage scenarios.


At the moment, the statement that the existing encoding is actually 
implementable is something that must be considered unproven (enough 
issues have been pointed out for various elements of the unification 
already to allow such a conclusion).


What we are not getting closer to is a rational understanding of how to 
improve this situation. Random addition of middle dot characters for 
some purpose is just as bad as pretending everything is fine with the 
status quo.


I applaud any effort you can make to hold off such additions, but 
without addressing the larger question we are not getting to a place 
where we can be confident of what we have (or need).


A./




On 3/27/2013 10:56 AM, Michel Suignard wrote:

I think it would be useful to actually write down an overview of the 
recommended implementation approach for handling ALL the different uses for 
middle dot and to make sure that what is recommended is not only theoretically 
possible, but acceptable and accepted(!) as best practice by implementers, 
users, and font designers alike.

If such a document were to successfully cover all (widely-)known cases, it would
  make fine material for adding to the character description. If there are holes
(things that can't be done - see the question) then it would form a basis on 
which
  UTC could make some decisions on how to improve the standard.

Needless to say, the project editor for 10646 who has been pushing the can down 
the road for EVER concerning the proposed 'A78F LATIN LETTER MIDDLE DOT' would 
appreciate such production. As RichardW has shown before it is just an 
additional use case along many other middle dots scenario. It does not seem 
wise to finalize decision concerning the encoding of another middle dot unless 
clarification is brought concerning the de facto unification of middle dot, ano 
telia, and the British decimal point. If some dis-unification is considered all 
these aspects should be taken into account.

Michel






Re: Rendering Raised FULL STOP between Digits

2013-03-27 Thread Asmus Freytag

On 3/27/2013 12:07 PM, Philippe Verdy wrote:

2013/3/27 Asmus Freytag asm...@ix.netcom.com:

At the moment, the statement that the existing encoding is actually
implementable is something that must be considered unproven (enough issues
have been pointed out for various elements of the unification already to
allow such a conclusion).

What we are not getting closer to is a rational understanding of how to
improve this situation. Random addition of middle dot characters for some
purpose is just as bad as pretending everything is fine with the status quo.

We are in fact not discussing random additions but want to handle
correctly use cases that are in fact very frequently needed.


Ah, what additions are you discussing?


For example The Catalan syllable breaker is not a random case, it is
in fact highly used and needed as part of its standard orthography
(and Catalan is not a minor language, we cannot just ignore it).


Are you suggesting the addition of a character for it?


There are very frequent uses of the dots, and hyphens which are too
much overloaded in their original ASCII-only encoding. same thing
about apostrophes/quotes. This causes enough nightmares when trying to
parse text, and it's unbelievable that there's no solution to augment
the text with either distinct characters, or some varant selectors, or
some other format controls to disambiguate these uses, which is really
semantic on essential character properties (which are in Unicode since
long, like the general category).


That's restating well-known issues. Thanks for agreeing. However, let's 
limit the discussion to dots, otherwise we'll never get any conclusion.


For the dashes there are many explicit character that were encoded 
already, same for the quotes. In those cases, there is often a more 
readily discernible difference in appearance that made the decision to 
disunify somewhat easier. The situation for middle dot is both less well 
understood and less well addressed.


The solution based on an upper-layer protocol will not work (for
example in filenames, in databases of toponyms, or in archived legal
documents whose interpretation should not cause any trouble, including
when these documents are converted or exported to many other formats).
We are here exactly within the definition of linguistic rules for each
language, some of them being highly standardized and which would
require a stricter, less ambiguous ebcoding. The time os ASCII only is
over, The UCS offers many new unused possibilities, as well as many
existing technical solutions, which should not be based just on an
heuristic (which will ever break on many cases). Ysers want to get
sure that their text will not be misinterpreted, or rendered in an
ambiguous or wrong way.


Again, a nice general statement. However, it lacks the kind of detail 
and documented evidence of particular usage that would bring us further 
at this point.


Even if the solutions proposed seem novel this should not block us.
And even a novel solution can work in compatibility with the huge
existing corpus of texts which will remain ambiguous as they are. The
novel encoding solution can perfectly provide a fallback mechanism
where it will adopt the old compatibility scheme (similar to ASCII).

Of course, nothing will prevent anyone to use characters as they want
in random cases, even if this breaks all commonly admitted
properties and behaviors.
My use of the word random was directed at piecemeal addition of 
characters. You are using it in a different sense.

  But this should be distinguished from
frequently used cases which have rules formulated since long in
wellknown languages (excepr that now the texts have to live in a
environement which is more and more multilingual, for which it's not
possible to just infer which lalnguage to select to apply its
wellknown rules). We have no other solutions than providing explicit
hints in the encoded texts (and to forget the time of ASCII-only,
except in some technical domains like programming languages and
transport/storage protocols which have their own internal syntaxes and
which do not qualify relaly as plain text).


You've advocated hints or semantic selectors. While a feasible model 
in principle, I see the main issue in that it would create yet another 
type of encoding; this is especially troublesome in light of the 
precedent for quotes and dashes, where there was a careful addition of 
single-purpose (not overloaded) characters.


Unless you can present detailed analysis of the requirements which could 
be used to prove that ONLY such novel coding construct can handle the 
needed rendering and processing tasks I would fear it would be difficult 
to get traction for such a proposal.


But that brings me back to my original issue: nobody has done the 
necessary analysis of the requirements for all (or at least the major) 
use cases for a mid-level to raised-level dot and pinned down what is or 
isn't possible in software support (rendering

Re: If Unicode wants to show the Red Card to someone ...

2013-04-01 Thread Asmus Freytag

On 4/1/2013 12:19 PM, Buck Golemon wrote:
The only remaining question is whether the colors should be 
represented in the HSL or HSV color space.



Go HSV http://www.hsv.de/news/!




Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-22 Thread Asmus Freytag

On 4/22/2013 4:27 AM, Charlie Ruland ☘ wrote:

* William_J_G Overington [2013/4/22]:

[...]

If the scope of Unicode becomes widened in this way, this will provide a basis 
upon which those people who so choose may research and develop localizable 
sentence technology with the knowledge that such research and development 
could, if successful, lead to encoding in plane 13 of the Unicode system.
I don’t think your problem is “the scope of Unicode” but the size of 
the community that uses “localizable sentences.” The Unicode 
Consortium is prepared to encode all characters that can be shown to 
be in actual use.


Please submit a formal proposal that can serve as a basis for further 
discussion of the topic.


I'm afraid that any proposal submitted this way would just become the 
basis for a rejection with prejudice. Independent of the lack of 
technical merit of the proposal, the utter lack of support (or use) by 
any established community would make such a proposal a non-starter.


In other words can be shown to be in actual use is an important hurdle 
that this scheme, however dear to its inventor, cannot seem to pass.


The actual bar would actually be a bit higher than you state it. The use 
has to be of a kind that benefits from standardization. Usually, that 
means that the use is wide-spread, or failing that, that the 
character(s) in question are essential elements of a script or notation 
that, while themselves perhaps rare, complete a repertoire that has 
sufficient established use.


Characters invented for possible use (as in could become successful) 
simply don't pass that hurdle, even if for example, the inventor were to 
publish documents using these characters. There are honest attempts, for 
example, to add new symbols to mathematical notation, which have to wait 
until there's evidence that they have become accepted by the community 
before they can be considered for encoding.


Mr. Overington is quite aware of what would be the inevitable outcome of 
submitting an actual proposal, that's why he keeps raising this issue 
with some regularity here on the open  list.


A./



Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-22 Thread Asmus Freytag

On 4/22/2013 12:35 PM, Stephan Stiller wrote:

[Charlie Ruland:]
The Unicode Consortium is prepared to encode all characters that can 
be shown to be in actual use.
Are you sure there is a precedent for what is essentially markup for a 
system of (alpha)numerical IDs?



You don't even have to look that far. These inventions utterly fail the 
actual use test, in the sense that I explained in my other message.


I'm always suspicious if someone wants to discuss scope of the standard 
before demonstrating a compelling case on the merits of wide-spread 
actual use.


A./




Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-23 Thread Asmus Freytag

On 4/23/2013 3:00 AM, Philippe Verdy wrote:
Do you realize the operating cost of any international standard 
comittee or for the maintenance ans securization of an international 
registry ? Who will pay ?


Currently we all are paying by having interminable discussions of 
half-baked ideas foisted onto us. There's a word for this.


Time for this discussion to be dropped.

A./





Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-23 Thread Asmus Freytag

On 4/23/2013 2:01 AM, William_J_G Overington wrote:

On Monday 22 April 2013, Asmus Freytag asm...@ix.netcom.com wrote:
  

I'm always suspicious if someone wants to discuss scope of the standard before 
demonstrating a compelling case on the merits of wide-spread actual use.
  
The reason that I want to discuss the scope is because there is uncertainty.


I'm not going to engage on a scope discussion with you, even on this 
lovely list, without some shred of evidence that there is compelling need.


Cheers,

A./




Re: Suggestion for new dingbats/symbols

2013-05-28 Thread Asmus Freytag

On 5/26/2013 3:15 PM, David Starner wrote:

On Sun, May 26, 2013 at 12:40 PM, Andreas Stötzner a...@signographie.de wrote:

One of the bodies in the world still ignorant of this fact to the very day
is Unicode. Which I feel is a mess.

Problems from Unicode generally come from of two places; compatibility
with non-Unicode data sets, and people with different goals working on
it.


Excellent insight.

However, both come with the territory of designing a universal 
character encoding.


With a mandate like that, it's difficult to leave any significant user 
population behind, which forces you to include both the superset of went 
before and to encompass people with overlapping, but partially divergent 
goals.


Unicode has some characteristics that emerged and took on added 
importance over time. These include a desire for longevity and 
stability, which, among other things require that characters, once 
admitted, must be carried along forever - and that implies that one must 
be leery of anything that hasn't stood the test of time.


Characters fall out of use in the real world all the time, but the ideal 
for Unicode is to include primarily those that have an ongoing use in 
archiving and historical study, which in the digital universe might 
include anything used on a wide enough scale.


I sympathize with Andreas' take that the nature and development of 
modern pictographic writing are rather less well understood than they 
deserve, and that decisions about encoding are therefore done in partial 
ignorance of all the facts.


Solid scholarly study of the use of signs, symbols and pictographs might 
help - except that there seem to be no scholars that tackle these from 
an angle that would ultimately be useful for encoding. I don't believe 
that is merely a funding problem, but something more fundamental.


A./

PS: German uses the same term wissenschaftlich for both scientific and 
scholarly approaches to knowledge. There are prefixes you can use to 
narrow things down, but in context, they are often dropped. This, in 
turn, can lead to confusion because the wrong choice can be made in 
translation. I don't think there's a natural science of character 
encoding, and I don't believe that Andreas was really claiming that. 
Still, there are ways of rigorously studying the phenomenon, an activity 
that would be considered scholarship.





Re: Suggestion for new dingbats/symbols

2013-05-29 Thread Asmus Freytag

On 5/29/2013 1:39 AM, Andreas Stötzner wrote:


Am 29.05.2013 um 01:06 schrieb David Starner:


And what you'll run into is the fact that people don't agree that that
belongs in Unicode.



What Andreas was suggesting is rigorous study. I think that is a 
commendable suggestion.


The more interesting question is what aspects should such a study 
encompass, what are to be its starting points and what kind of 
conclusions should be possible after it is completed?


With better facts in hand it will be much easier to double-check whether 
currently-held assumptions about their relevance for encoding hold up or 
need revisiting. Without facts, this kind of discussion just deals in 
pre-conceived notions, and therefore adds little value.


A./


Re: Preconditions for changing a representative glyph?

2013-05-29 Thread Asmus Freytag

On 5/29/2013 8:39 AM, Leo Broukhis wrote:


I'd like to ask: what is supposed to be the trigger condition for the 
UTC to consider changing the representative glyph of

your favorate symbol here  to a novel design?



The answer: the purpose of the representative glyph is not to track 
fashions in representation but to give an easily recognized orthodox 
shape.


In the case of symbols, shape matters differently than for letters 
(where you have a word context that allows even decorative font shapes 
to be readable).


For symbols, once you leave the canonical shape behind, there's always 
the argument that what you have is in fact a new symbol.


There are some exceptions to this, where notational aspect of symbol use 
is so strong that variations really function identically and can be 
unified without issues. This might be the case in your example. However, 
in general, I would dispute that this is true for non-notational symbols.


In the case you give, the new design is clearly not the canonical 
shape, because it deliberately innovates. If it ever replaces the other 
sign in a majority of uses (not just in NYC) then perhaps updating the 
glyph might be appropriate.


At this time, we are far from that point.

A./



Re: Preconditions for changing a representative glyph?

2013-05-29 Thread Asmus Freytag

On 5/29/2013 9:38 AM, Leo Broukhis wrote:
On Wed, May 29, 2013 at 9:35 AM, Asmus Freytag asm...@ix.netcom.com 
mailto:asm...@ix.netcom.com wrote:




In the case you give, the new design is clearly not the
canonical shape, because it deliberately innovates. If it ever
replaces the other sign in a majority of uses (not just in NYC)
then perhaps updating the glyph might be appropriate.

At this time, we are far from that point.

That we are far from that point is clear to me; I was asking if there 
is a (semi-)formal definition of that point. What is a majority of uses?



I think Michael's answer covered that.

A./



Re: Preconditions for changing a representative glyph?

2013-05-29 Thread Asmus Freytag

On 5/29/2013 9:53 AM, Manuel Strehl wrote:
Out of curiosity, has it happened before, that a glyph was updated 
(i.e., substantially changed) in the standard?





Yes, Philippe gives some examples of typical situations.

Representative glyphs are not immutable - what is immutable is the 
identity of the character that is encoded. A change in representative 
glyph that affects the perception of that identity in an adverse way, 
must be avoided, but, in reverse, a glyph that leads to 
misidentification of a character can, and in typical situations, also 
should be corrected.


For symbol, the identity of the character does not necessarily exist 
independently of its shape. Two similar shapes may exist where each is 
used only in some context, or where the usage contexts only partially 
overlap. If that is the case, it should be questioned whether this is 
really a matter of two representations of the same character, or whether 
it is the case of two characters that happen to be related.


For letters, you have the word context that allows you to resolve the 
identity question. For symbols, there is no such single, overriding context.


A./



Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-06-19 Thread Asmus Freytag

On 6/19/2013 6:36 AM, Michael Everson wrote:

Only in text which has been decomposed. Not all text gets decomposed.

All text may get decomposed without warning.

As data is shipped around and processed in various parts of a 
distributed system, nobody can make any safe assumptions on the 
normalization state for their data. They may get composed, decomposed, 
or they may miraculously remain in whatever mixed normalization state 
they were created in.


The point is, any technical argument or design decision that implies 
that one has control of the normalization state is ipso facto suspect.


A./


Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-06-19 Thread Asmus Freytag

On 6/19/2013 6:36 AM, Michael Everson wrote:

The issue of cedilla can easily be solved at a higher level, font technologies 
like OpenType can easily display glyphs in Latvian or Livonia and different 
glyphs for Marshallese.

Only in environments which permit language tagging. I'd like Marshallese 
children to be able to write their language in filenames.


Language tagging doesn't seem to be reliable enough to require its use 
in anything other than high-end typography.


A./


Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-03 Thread Asmus Freytag

On 7/3/2013 2:04 AM, Michael Everson wrote:

On 3 Jul 2013, at 09:52, Martin J. Dürst due...@it.aoyama.ac.jp wrote:


Quite a few people might expect their Japanese filenames to appear with a 
Japanese font/with Japanese glyph variants, and their Chinese filenames to 
appear with a Chinese font/Chinese glyph variants. But that's never how this 
was planned, and that's not how it works today.

Yeah, but CJK is a world of difference away from alphabets of 30-40 characters.


That sounds dangerously close to special pleading.



And it's a pretty easy guess that there are quite a few more users with 
Japanese and Chinese filenames in the same file system than users with Latvian 
and Marshallese filenames in the same file system, both because both Chinese 
and Japanese are used by many more people than Latvian or Marshallese and 
because China and Japan are much closer than Latvia and the Marshall Islands.

I oppose language-tagging as a mechanism to fix the cock-up of slavishly 
following 8859 decomposition for cedilla and comma-below. Character encoding is 
the better way to deal with this.


That's the more fundamental point. If comma below and cedilla are really 
fundamentally different marks, then treating them as such is a 
principled solution.


However, the compromise sounds dangerously like it introduces another 
one of those irregularities that people will trip over in the future.


A./



Michael Everson * http://www.evertype.com/









Re: Scalability of ScriptExtensions (was: RE: Borrowed Thai Punctuation in Tai Tham Text)

2013-07-08 Thread Asmus Freytag

On 7/8/2013 1:35 PM, Whistler, Ken wrote:
A much more productive approach, it seems to me, would be instead to 
try to establish information about various, identifiable typographical 
traditions for use of punctuation around the world, and then associate 
exemplar sets of punctuation used with those traditions.


I would recommend that an approach like that be used behind the scenes 
to manage the update of the data file.


We are stuck with a format that seemingly assumes that all characters 
are treated individually. However, I agree with you, that this is not 
the case, but instead, there are these sets of punctuation marks for 
certain typographical traditions.


In addition, there are issues like the Dandas, where specific marks have 
been unified across a range of related scripts.


A flexible way to pull this information together would be a UTN that 
tries to collect this information in human, not machine readable form, 
with commentary and background.


If the information in the UTN is considered solid, then it could be 
reflected, in a separate pass, in the existing property file. Because 
you would work on the basis of either typographical sets (or explicit 
encoding decisions) there would be less temptation to jiggle individual 
characters' property values.


A./



Re: Scalability of ScriptExtensions

2013-07-09 Thread Asmus Freytag

On 7/8/2013 8:15 PM, Richard Wordingham wrote:

On Mon, 08 Jul 2013 14:42:15 -0700
Asmus Freytag asm...@ix.netcom.com wrote:


We are stuck with a format that seemingly assumes that all characters
are treated individually. However, I agree with you, that this is not
the case, but instead, there are these sets of punctuation marks for
certain typographical traditions.

UCD files are intended for computer use.  Are you proposing that text
rendering systems try to identify the typographical 'tradition' in
use?  If not, the format seems appropriate for computer use.


I'm suggesting that we change the model of how this particular file is 
maintained, not how the information in it is represented.


That was implicit in the part of my reply that you deleted in your answer.



In addition, there are issues like the Dandas, where specific marks
have been unified across a range of related scripts.

And effectively unrelated, like the Latin script.

Richard.







Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Asmus Freytag

What is wrong with using DIAMOND OPERATOR?

A./

On 7/18/2013 8:27 PM, Stephan Stiller wrote:

Hi all,

Modal logic uses a box and a diamond (this is how they're 
informally called) as operators (accepting one formula and returning 
another) to denote necessity and possibility, resp. Older texts might 
use the letters L and M (resp). Which Unicode codepoints do modal box 
and diamond correspond to?


According to the charts, it seems like the box is
◻ (U+25FB)
(is this definitive?), but what about the diamond? Unlike what one 
might glean from the charts, ⟠ (U+27E0) is afaiu /not/ normally used 
to denote possibility in the default† sense. Wiki's List of logic 
symbols article has something to say about this too, but I'm always 
cautious about information from there.


Stephan

† eg in the sense of λ푥 . ¬◻¬푥 with ◻ as used in say the axiom
schema conventionally named *T* in modal logic






<    2   3   4   5   6   7   8   9   10   11   >